You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by AlexeyK <le...@gmail.com> on 2013/05/23 09:02:38 UTC

Solr 4.3: node is seen as active in Zk while in recovery mode + endless recovery

Consider the following:
Solr 4.3, 2 node test cluster, each is a leader.
During (or immediately after, before hard commit) indexing I shutdown one of
them and restart later.
The tlog is about 200Mb size.
I see recurring 'Reordered DBQs detected' in the log, seems like an endless
loop because THE VERY SAME update query appears thousands of times, runs for
a long time now.
In the meanwhile, the node is inaccessible (obviously) but in the Zk state
it appears as active, NOT in recovery mode or down.
It seems that this is caused by a recent changed in ZkController which adds
recovery logic into 'register' routine.

Regards,
Alexey




--
View this message in context: http://lucene.472066.n3.nabble.com/Solr-4-3-node-is-seen-as-active-in-Zk-while-in-recovery-mode-endless-recovery-tp4065549.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Solr 4.3: node is seen as active in Zk while in recovery mode + endless recovery

Posted by Shalin Shekhar Mangar <sh...@gmail.com>.

I have opened https://issues.apache.org/jira/browse/SOLR-4870


On Tue, May 28, 2013 at 5:53 PM, Shalin Shekhar Mangar <
shalinmangar@gmail.com> wrote:

> This sounds like a bug. I'll open an issue. Thanks!
>
>
> On Tue, May 28, 2013 at 2:29 PM, AlexeyK <le...@gmail.com> wrote:
>
>> The cluster state problem reported above is not an issue - it was caused
>> by
>> our own code.
>> Speaking about the update log - i have noticed a strange behavior
>> concerning
>> the replay. The replay is *supposed* to be done for a predefined number of
>> log entries, but actually it is always done for the whole last 2 tlogs.
>> RecentUpdates.update() reads log within  while (numUpdates <
>> numRecordsToKeep), while numUpdates is never incremented, so it exits when
>> the reader reaches EOF.
>>
>>
>>
>> --
>> View this message in context:
>> http://lucene.472066.n3.nabble.com/Solr-4-3-node-is-seen-as-active-in-Zk-while-in-recovery-mode-endless-recovery-tp4065549p4066452.html
>> Sent from the Solr - User mailing list archive at Nabble.com.
>>
>
>
>
> --
> Regards,
> Shalin Shekhar Mangar.
>



-- 
Regards,
Shalin Shekhar Mangar.

Re: Solr 4.3: node is seen as active in Zk while in recovery mode + endless recovery

Posted by Shalin Shekhar Mangar <sh...@gmail.com>.

This sounds like a bug. I'll open an issue. Thanks!


On Tue, May 28, 2013 at 2:29 PM, AlexeyK <le...@gmail.com> wrote:

> The cluster state problem reported above is not an issue - it was caused by
> our own code.
> Speaking about the update log - i have noticed a strange behavior
> concerning
> the replay. The replay is *supposed* to be done for a predefined number of
> log entries, but actually it is always done for the whole last 2 tlogs.
> RecentUpdates.update() reads log within  while (numUpdates <
> numRecordsToKeep), while numUpdates is never incremented, so it exits when
> the reader reaches EOF.
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Solr-4-3-node-is-seen-as-active-in-Zk-while-in-recovery-mode-endless-recovery-tp4065549p4066452.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>



-- 
Regards,
Shalin Shekhar Mangar.

Re: Solr 4.3: node is seen as active in Zk while in recovery mode + endless recovery

Posted by AlexeyK <le...@gmail.com>.

The cluster state problem reported above is not an issue - it was caused by
our own code.
Speaking about the update log - i have noticed a strange behavior concerning
the replay. The replay is *supposed* to be done for a predefined number of
log entries, but actually it is always done for the whole last 2 tlogs.
RecentUpdates.update() reads log within  while (numUpdates <
numRecordsToKeep), while numUpdates is never incremented, so it exits when
the reader reaches EOF. 



--
View this message in context: http://lucene.472066.n3.nabble.com/Solr-4-3-node-is-seen-as-active-in-Zk-while-in-recovery-mode-endless-recovery-tp4065549p4066452.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Solr 4.3: node is seen as active in Zk while in recovery mode + endless recovery

Posted by Erick Erickson <er...@gmail.com>.

Unfortunately I don't quite know the internals of this code well. I
vaguely remember
a problem with insuring that deletes were handled correctly, so this may be a
manifestation of that fix. As I remember optimistic locking is mixed
up in this too.

But all that means is that I really can't answer your question, I'll
have to leave that
to people more familiar with the code.

Best
Erick

On Thu, May 23, 2013 at 9:30 AM, AlexeyK <le...@gmail.com> wrote:
> the hard commit is set to about 20 minutes, while ram buffer is 256Mb.
> We will add more frequent hard commits without refreshing the searcher, that
> for the tip.
>
> from what I understood from the code, for each 'add' command there is a test
> for a 'delete by query'. if there is an older dbq, it's run after the 'add'
> operation if its version > 'add' version.
> in my case, there are a lot of documents to be inserted, and a single large
> DBQ. My question is: shouldn't this be done in bulks? Why is it necessary to
> run the DBQ after each insertion? Supposedly there are 1000 insertions it's
> run 1000 times.
>
>
>
>
>
> --
> View this message in context: http://lucene.472066.n3.nabble.com/Solr-4-3-node-is-seen-as-active-in-Zk-while-in-recovery-mode-endless-recovery-tp4065549p4065628.html
> Sent from the Solr - User mailing list archive at Nabble.com.

Re: Solr 4.3: node is seen as active in Zk while in recovery mode + endless recovery

Posted by Shalin Shekhar Mangar <sh...@gmail.com>.

On Thu, May 23, 2013 at 7:00 PM, AlexeyK <le...@gmail.com> wrote:

<snip />

> from what I understood from the code, for each 'add' command there is a
> test
> for a 'delete by query'. if there is an older dbq, it's run after the 'add'
> operation if its version > 'add' version.
> in my case, there are a lot of documents to be inserted, and a single large
> DBQ. My question is: shouldn't this be done in bulks? Why is it necessary
> to
> run the DBQ after each insertion? Supposedly there are 1000 insertions it's
> run 1000 times.
>
>
>
As I understand it, this is done to handle out-of-order updates. Suppose a
client makes a few add requests and then invokes a DBQ but the DBQ reaches
the replicas before the last add request. In such a case, the DBQ is
executed after the add request to preserve consistency. We don't do that in
bulk because we don't know how long to wait for all add requests to arrive.
Also, the individual add requests may arrive via different threads (think
connection reset from leader to replica).

That being said, the scenario you describe of a 1000 insertions causing
DBQs to be run a large number of times (on recovery after restarting) could
be optimized. Note that the bug you discovered (SOLR-4870) does not affect
log replay because log replay on startup will replay all of the last two
transaction logs (unless they end with a commit). Only PeerSync is affected
by SOLR-4870.

You say that both nodes are leaders but the comment inside
DirectUpdateHandler2.addDoc() says that deletesAfter (i.e. reordered DBQs)
should always be null on leaders. So there's definitely something fishy
here. A quick review of the code leads me to believe that reordered DBQs
can happen on a leader as well. I'll investigate further.

> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Solr-4-3-node-is-seen-as-active-in-Zk-while-in-recovery-mode-endless-recovery-tp4065549p4065628.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>

-- 
Regards,
Shalin Shekhar Mangar.

Re: Solr 4.3: node is seen as active in Zk while in recovery mode + endless recovery

Posted by AlexeyK <le...@gmail.com>.

the hard commit is set to about 20 minutes, while ram buffer is 256Mb. 
We will add more frequent hard commits without refreshing the searcher, that
for the tip.

from what I understood from the code, for each 'add' command there is a test
for a 'delete by query'. if there is an older dbq, it's run after the 'add'
operation if its version > 'add' version.
in my case, there are a lot of documents to be inserted, and a single large
DBQ. My question is: shouldn't this be done in bulks? Why is it necessary to
run the DBQ after each insertion? Supposedly there are 1000 insertions it's
run 1000 times.





--
View this message in context: http://lucene.472066.n3.nabble.com/Solr-4-3-node-is-seen-as-active-in-Zk-while-in-recovery-mode-endless-recovery-tp4065549p4065628.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Solr 4.3: node is seen as active in Zk while in recovery mode + endless recovery

Posted by Jan Høydahl <ja...@cominvent.com>.

Huge tlogs seems to be a common problem. Should we make it flush automatically on huge file size? Could be configurable on the <updateLog> tag?

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com

23. mai 2013 kl. 14:03 skrev Erick Erickson <er...@gmail.com>:

> Tangential to the issue you raise is that this is a huge tlog. It indicates that
> you aren't doing a hard commit (openSearcher=false) very often. That
> operation will truncate your tlog which should speed recovery/startup.
> You're also chewing up some memory with a tlog that size since pointers
> to the tlog are kept for each document.
> 
> This comment doesn't address your comment about the change to
> ZkController, I'll leave that to someone who knows the code.
> 
> Best
> Erick
> 
> On Thu, May 23, 2013 at 3:14 AM, AlexeyK <le...@gmail.com> wrote:
>> a small change: it's not an endless loop, but a painfully slow processing
>> which includes running a delete query and then insertion. Each document from
>> the tlog takes tens of seconds to process (more than 100 times slower than
>> during normal insertion process)
>> 
>> 
>> 
>> --
>> View this message in context: http://lucene.472066.n3.nabble.com/Solr-4-3-node-is-seen-as-active-in-Zk-while-in-recovery-mode-endless-recovery-tp4065549p4065551.html
>> Sent from the Solr - User mailing list archive at Nabble.com.

Re: Solr 4.3: node is seen as active in Zk while in recovery mode + endless recovery

Posted by Erick Erickson <er...@gmail.com>.

Tangential to the issue you raise is that this is a huge tlog. It indicates that
you aren't doing a hard commit (openSearcher=false) very often. That
operation will truncate your tlog which should speed recovery/startup.
You're also chewing up some memory with a tlog that size since pointers
to the tlog are kept for each document.

This comment doesn't address your comment about the change to
ZkController, I'll leave that to someone who knows the code.

Best
Erick

On Thu, May 23, 2013 at 3:14 AM, AlexeyK <le...@gmail.com> wrote:
> a small change: it's not an endless loop, but a painfully slow processing
> which includes running a delete query and then insertion. Each document from
> the tlog takes tens of seconds to process (more than 100 times slower than
> during normal insertion process)
>
>
>
> --
> View this message in context: http://lucene.472066.n3.nabble.com/Solr-4-3-node-is-seen-as-active-in-Zk-while-in-recovery-mode-endless-recovery-tp4065549p4065551.html
> Sent from the Solr - User mailing list archive at Nabble.com.

Re: Solr 4.3: node is seen as active in Zk while in recovery mode + endless recovery

Posted by AlexeyK <le...@gmail.com>.

a small change: it's not an endless loop, but a painfully slow processing
which includes running a delete query and then insertion. Each document from
the tlog takes tens of seconds to process (more than 100 times slower than
during normal insertion process)



--
View this message in context: http://lucene.472066.n3.nabble.com/Solr-4-3-node-is-seen-as-active-in-Zk-while-in-recovery-mode-endless-recovery-tp4065549p4065551.html
Sent from the Solr - User mailing list archive at Nabble.com.