You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Timothy Potter <th...@gmail.com> on 2013/04/20 00:37:26 UTC

Rogue query killed several replicas with OOM, after recovering - match all docs query problem

We had a rogue query take out several replicas in a large 4.2.0 cluster
today, due to OOM's (we use the JVM args to kill the process on OOM).

After recovering, when I execute the match all docs query (*:*), I get a
different count each time.

In other words, if I execute q=*:* several times in a row, then I get a
different count back for numDocs.

This was not the case prior to the failure as that is one thing we monitor
for.

I think I should be worried ... any ideas on how to troubleshoot this? One
thing to mention is that several of my replicas had to do full recoveries
from the leader when they came back online. Indexing was happening when the
replicas failed.

Thanks.
Tim

Re: Rogue query killed several replicas with OOM, after recovering - match all docs query problem

Posted by Timothy Potter <th...@gmail.com>.

Hi Sudhakar,

Unfortunately, we don't know the underlying cause and I lost the logs
that could have helped diagnose further. FWIW, I think this is an
extreme case as I've lost nodes before and haven't had any
discrepancies after recovering. In my case, it was a perfect storm of
high throughput indexing 8-10K docs/sec and an very nasty query that
OOM'd on about half the nodes in my cluster. Of course we want to get
to the bottom of this but it's going to be hard to reproduce. The good
news is I'm recovered and the cluster is consistent again.

Cheers,
Tim

On Mon, Apr 22, 2013 at 5:18 PM, Sudhakar Maddineni
<ma...@gmail.com> wrote:
> We had encountered similar issue few days back with 4.0- Beta version.
> We have 6 node - 3 shard cluster setup.And, one of our replica
> servers[tomcat] was not responding to any requests because it reached the
> max no of the threads[200 -default]. To temporarily fix the issue, we had
> to restart the server.After restarting, we realized that there were 2
> tomcat processes running[old one + new one].So, we manually killed the two
> tomcat processes and had a clean start.And, we observed the numDocs of
> replica server not matching to the count on leader.
> So, this discrepancy is because we manually killed the process which
> interrupted the sync process?
>
> Thx,Sudhakar.
>
>
>
>
> On Mon, Apr 22, 2013 at 3:28 PM, Mark Miller <ma...@gmail.com> wrote:
>
>> No worries, thanks for the info. Let me know if you gain any more insight!
>> I'd love to figure out what happened here and address it. And I'm
>> especially interested in knowing if you lost any updates if you are able to
>> determine that.
>>
>> - Mark
>>
>> On Apr 22, 2013, at 5:02 PM, Timothy Potter <th...@gmail.com> wrote:
>>
>> > I ended up just nuking the index on the replica with less docs and
>> > restarting it - which triggered the snap pull from the leader. So now
>> > I'm in sync and have better processes in place to capture the
>> > information if it happens again, which given some of the queries my UI
>> > team develops, is highly likely ;-)
>> >
>> > Also, all our input data to Solr lives in Hive so I'm doing some id
>> > -to- id comparisons of what is in Solr vs. what is in Hive to find any
>> > discrepancies.
>> >
>> > Again, sorry about the loss of the logs. This is a tough scenario to
>> > try to re-create as it was a perfect storm of high indexing throughput
>> > and a rogue query.
>> >
>> > Tim
>> >
>> > On Mon, Apr 22, 2013 at 2:41 PM, Mark Miller <ma...@gmail.com>
>> wrote:
>> >> What do you know about the # of docs you *should*? Do you have that
>> mean when taking the bad replica out of the equation?
>> >>
>> >> - Mark
>> >>
>> >> On Apr 22, 2013, at 4:33 PM, Mark Miller <ma...@gmail.com> wrote:
>> >>
>> >>> Bummer on the log loss :(
>> >>>
>> >>> Good info though. Somehow that replica became active without actually
>> syncing? This is heavily tested (though not with OOM's I suppose), so I'm a
>> little surprised, but it's hard to speculate how it happened without the
>> logs. Specially, the logs from the node that is off would be great - we
>> would see what it did when it recovered and why it might think it was in
>> sync :(
>> >>>
>> >>> - Mark
>> >>>
>> >>> On Apr 22, 2013, at 2:19 PM, Timothy Potter <th...@gmail.com>
>> wrote:
>> >>>
>> >>>> nm - can't read my own output - the leader had more docs than the
>> replica ;-)
>> >>>>
>> >>>> On Mon, Apr 22, 2013 at 11:42 AM, Timothy Potter <
>> thelabdude@gmail.com> wrote:
>> >>>>> Have a little more info about this ... the numDocs for *:* fluctuates
>> >>>>> between two values (difference of 324 docs) depending on which nodes
>> I
>> >>>>> hit (distrib=true)
>> >>>>>
>> >>>>> 589,674,416
>> >>>>> 589,674,092
>> >>>>>
>> >>>>> Using distrib=false, I found 1 shard with a mis-match:
>> >>>>>
>> >>>>> shard15: {
>> >>>>> leader = 32,765,254
>> >>>>> replica = 32,764,930 diff:324
>> >>>>> }
>> >>>>>
>> >>>>> Interesting that the replica has more docs than the leader.
>> >>>>>
>> >>>>> Unfortunately, due to some bad log management scripting on my part,
>> >>>>> the logs were lost when these instances got re-started, which really
>> >>>>> bums me out :-(
>> >>>>>
>> >>>>> For now, I'm going to assume the replica with more docs is the one I
>> >>>>> want to keep and will replicate the full index over to the other one.
>> >>>>> Sorry about losing the logs :-(
>> >>>>>
>> >>>>> Tim
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>> On Sat, Apr 20, 2013 at 10:23 AM, Timothy Potter <
>> thelabdude@gmail.com> wrote:
>> >>>>>> Thanks for responding Mark. I'll collect the information you asked
>> >>>>>> about and open a JIRA once I have a little more understanding of
>> what
>> >>>>>> happened. Hopefully I can piece together some story after going over
>> >>>>>> the logs.
>> >>>>>>
>> >>>>>> As for replica / leader, I suspect some leaders went down but
>> >>>>>> fail-over to new leaders seemed to work fine. We lost about 9 nodes
>> at
>> >>>>>> once and continued to serve queries, which is awesome.
>> >>>>>>
>> >>>>>> On Sat, Apr 20, 2013 at 10:11 AM, Mark Miller <
>> markrmiller@gmail.com> wrote:
>> >>>>>>> Yeah, thats no good.
>> >>>>>>>
>> >>>>>>> You might hit each node with distrib=false to get the doc counts.
>> >>>>>>>
>> >>>>>>> Which ones have what you think are the right counts and which the
>> wrong - eg is it all replicas that are off, or leaders as well?
>> >>>>>>>
>> >>>>>>> You say several replicas - do you mean no leaders went down?
>> >>>>>>>
>> >>>>>>> You might look closer at the logs for a node that has it's count
>> off.
>> >>>>>>>
>> >>>>>>> Finally, I guess I'd try and track it in a JIRA issue.
>> >>>>>>>
>> >>>>>>> - Mark
>> >>>>>>>
>> >>>>>>> On Apr 19, 2013, at 6:37 PM, Timothy Potter <th...@gmail.com>
>> wrote:
>> >>>>>>>
>> >>>>>>>> We had a rogue query take out several replicas in a large 4.2.0
>> cluster
>> >>>>>>>> today, due to OOM's (we use the JVM args to kill the process on
>> OOM).
>> >>>>>>>>
>> >>>>>>>> After recovering, when I execute the match all docs query (*:*),
>> I get a
>> >>>>>>>> different count each time.
>> >>>>>>>>
>> >>>>>>>> In other words, if I execute q=*:* several times in a row, then I
>> get a
>> >>>>>>>> different count back for numDocs.
>> >>>>>>>>
>> >>>>>>>> This was not the case prior to the failure as that is one thing
>> we monitor
>> >>>>>>>> for.
>> >>>>>>>>
>> >>>>>>>> I think I should be worried ... any ideas on how to troubleshoot
>> this? One
>> >>>>>>>> thing to mention is that several of my replicas had to do full
>> recoveries
>> >>>>>>>> from the leader when they came back online. Indexing was
>> happening when the
>> >>>>>>>> replicas failed.
>> >>>>>>>>
>> >>>>>>>> Thanks.
>> >>>>>>>> Tim
>> >>>>>>>
>> >>>
>> >>
>>
>>

Re: Rogue query killed several replicas with OOM, after recovering - match all docs query problem

Posted by Sudhakar Maddineni <ma...@gmail.com>.

We had encountered similar issue few days back with 4.0- Beta version.
We have 6 node - 3 shard cluster setup.And, one of our replica
servers[tomcat] was not responding to any requests because it reached the
max no of the threads[200 -default]. To temporarily fix the issue, we had
to restart the server.After restarting, we realized that there were 2
tomcat processes running[old one + new one].So, we manually killed the two
tomcat processes and had a clean start.And, we observed the numDocs of
replica server not matching to the count on leader.
So, this discrepancy is because we manually killed the process which
interrupted the sync process?

Thx,Sudhakar.




On Mon, Apr 22, 2013 at 3:28 PM, Mark Miller <ma...@gmail.com> wrote:

> No worries, thanks for the info. Let me know if you gain any more insight!
> I'd love to figure out what happened here and address it. And I'm
> especially interested in knowing if you lost any updates if you are able to
> determine that.
>
> - Mark
>
> On Apr 22, 2013, at 5:02 PM, Timothy Potter <th...@gmail.com> wrote:
>
> > I ended up just nuking the index on the replica with less docs and
> > restarting it - which triggered the snap pull from the leader. So now
> > I'm in sync and have better processes in place to capture the
> > information if it happens again, which given some of the queries my UI
> > team develops, is highly likely ;-)
> >
> > Also, all our input data to Solr lives in Hive so I'm doing some id
> > -to- id comparisons of what is in Solr vs. what is in Hive to find any
> > discrepancies.
> >
> > Again, sorry about the loss of the logs. This is a tough scenario to
> > try to re-create as it was a perfect storm of high indexing throughput
> > and a rogue query.
> >
> > Tim
> >
> > On Mon, Apr 22, 2013 at 2:41 PM, Mark Miller <ma...@gmail.com>
> wrote:
> >> What do you know about the # of docs you *should*? Do you have that
> mean when taking the bad replica out of the equation?
> >>
> >> - Mark
> >>
> >> On Apr 22, 2013, at 4:33 PM, Mark Miller <ma...@gmail.com> wrote:
> >>
> >>> Bummer on the log loss :(
> >>>
> >>> Good info though. Somehow that replica became active without actually
> syncing? This is heavily tested (though not with OOM's I suppose), so I'm a
> little surprised, but it's hard to speculate how it happened without the
> logs. Specially, the logs from the node that is off would be great - we
> would see what it did when it recovered and why it might think it was in
> sync :(
> >>>
> >>> - Mark
> >>>
> >>> On Apr 22, 2013, at 2:19 PM, Timothy Potter <th...@gmail.com>
> wrote:
> >>>
> >>>> nm - can't read my own output - the leader had more docs than the
> replica ;-)
> >>>>
> >>>> On Mon, Apr 22, 2013 at 11:42 AM, Timothy Potter <
> thelabdude@gmail.com> wrote:
> >>>>> Have a little more info about this ... the numDocs for *:* fluctuates
> >>>>> between two values (difference of 324 docs) depending on which nodes
> I
> >>>>> hit (distrib=true)
> >>>>>
> >>>>> 589,674,416
> >>>>> 589,674,092
> >>>>>
> >>>>> Using distrib=false, I found 1 shard with a mis-match:
> >>>>>
> >>>>> shard15: {
> >>>>> leader = 32,765,254
> >>>>> replica = 32,764,930 diff:324
> >>>>> }
> >>>>>
> >>>>> Interesting that the replica has more docs than the leader.
> >>>>>
> >>>>> Unfortunately, due to some bad log management scripting on my part,
> >>>>> the logs were lost when these instances got re-started, which really
> >>>>> bums me out :-(
> >>>>>
> >>>>> For now, I'm going to assume the replica with more docs is the one I
> >>>>> want to keep and will replicate the full index over to the other one.
> >>>>> Sorry about losing the logs :-(
> >>>>>
> >>>>> Tim
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>> On Sat, Apr 20, 2013 at 10:23 AM, Timothy Potter <
> thelabdude@gmail.com> wrote:
> >>>>>> Thanks for responding Mark. I'll collect the information you asked
> >>>>>> about and open a JIRA once I have a little more understanding of
> what
> >>>>>> happened. Hopefully I can piece together some story after going over
> >>>>>> the logs.
> >>>>>>
> >>>>>> As for replica / leader, I suspect some leaders went down but
> >>>>>> fail-over to new leaders seemed to work fine. We lost about 9 nodes
> at
> >>>>>> once and continued to serve queries, which is awesome.
> >>>>>>
> >>>>>> On Sat, Apr 20, 2013 at 10:11 AM, Mark Miller <
> markrmiller@gmail.com> wrote:
> >>>>>>> Yeah, thats no good.
> >>>>>>>
> >>>>>>> You might hit each node with distrib=false to get the doc counts.
> >>>>>>>
> >>>>>>> Which ones have what you think are the right counts and which the
> wrong - eg is it all replicas that are off, or leaders as well?
> >>>>>>>
> >>>>>>> You say several replicas - do you mean no leaders went down?
> >>>>>>>
> >>>>>>> You might look closer at the logs for a node that has it's count
> off.
> >>>>>>>
> >>>>>>> Finally, I guess I'd try and track it in a JIRA issue.
> >>>>>>>
> >>>>>>> - Mark
> >>>>>>>
> >>>>>>> On Apr 19, 2013, at 6:37 PM, Timothy Potter <th...@gmail.com>
> wrote:
> >>>>>>>
> >>>>>>>> We had a rogue query take out several replicas in a large 4.2.0
> cluster
> >>>>>>>> today, due to OOM's (we use the JVM args to kill the process on
> OOM).
> >>>>>>>>
> >>>>>>>> After recovering, when I execute the match all docs query (*:*),
> I get a
> >>>>>>>> different count each time.
> >>>>>>>>
> >>>>>>>> In other words, if I execute q=*:* several times in a row, then I
> get a
> >>>>>>>> different count back for numDocs.
> >>>>>>>>
> >>>>>>>> This was not the case prior to the failure as that is one thing
> we monitor
> >>>>>>>> for.
> >>>>>>>>
> >>>>>>>> I think I should be worried ... any ideas on how to troubleshoot
> this? One
> >>>>>>>> thing to mention is that several of my replicas had to do full
> recoveries
> >>>>>>>> from the leader when they came back online. Indexing was
> happening when the
> >>>>>>>> replicas failed.
> >>>>>>>>
> >>>>>>>> Thanks.
> >>>>>>>> Tim
> >>>>>>>
> >>>
> >>
>
>

Re: Rogue query killed several replicas with OOM, after recovering - match all docs query problem

Posted by Mark Miller <ma...@gmail.com>.

No worries, thanks for the info. Let me know if you gain any more insight! I'd love to figure out what happened here and address it. And I'm especially interested in knowing if you lost any updates if you are able to determine that.

- Mark

On Apr 22, 2013, at 5:02 PM, Timothy Potter <th...@gmail.com> wrote:

> I ended up just nuking the index on the replica with less docs and
> restarting it - which triggered the snap pull from the leader. So now
> I'm in sync and have better processes in place to capture the
> information if it happens again, which given some of the queries my UI
> team develops, is highly likely ;-)
> 
> Also, all our input data to Solr lives in Hive so I'm doing some id
> -to- id comparisons of what is in Solr vs. what is in Hive to find any
> discrepancies.
> 
> Again, sorry about the loss of the logs. This is a tough scenario to
> try to re-create as it was a perfect storm of high indexing throughput
> and a rogue query.
> 
> Tim
> 
> On Mon, Apr 22, 2013 at 2:41 PM, Mark Miller <ma...@gmail.com> wrote:
>> What do you know about the # of docs you *should*? Do you have that mean when taking the bad replica out of the equation?
>> 
>> - Mark
>> 
>> On Apr 22, 2013, at 4:33 PM, Mark Miller <ma...@gmail.com> wrote:
>> 
>>> Bummer on the log loss :(
>>> 
>>> Good info though. Somehow that replica became active without actually syncing? This is heavily tested (though not with OOM's I suppose), so I'm a little surprised, but it's hard to speculate how it happened without the logs. Specially, the logs from the node that is off would be great - we would see what it did when it recovered and why it might think it was in sync :(
>>> 
>>> - Mark
>>> 
>>> On Apr 22, 2013, at 2:19 PM, Timothy Potter <th...@gmail.com> wrote:
>>> 
>>>> nm - can't read my own output - the leader had more docs than the replica ;-)
>>>> 
>>>> On Mon, Apr 22, 2013 at 11:42 AM, Timothy Potter <th...@gmail.com> wrote:
>>>>> Have a little more info about this ... the numDocs for *:* fluctuates
>>>>> between two values (difference of 324 docs) depending on which nodes I
>>>>> hit (distrib=true)
>>>>> 
>>>>> 589,674,416
>>>>> 589,674,092
>>>>> 
>>>>> Using distrib=false, I found 1 shard with a mis-match:
>>>>> 
>>>>> shard15: {
>>>>> leader = 32,765,254
>>>>> replica = 32,764,930 diff:324
>>>>> }
>>>>> 
>>>>> Interesting that the replica has more docs than the leader.
>>>>> 
>>>>> Unfortunately, due to some bad log management scripting on my part,
>>>>> the logs were lost when these instances got re-started, which really
>>>>> bums me out :-(
>>>>> 
>>>>> For now, I'm going to assume the replica with more docs is the one I
>>>>> want to keep and will replicate the full index over to the other one.
>>>>> Sorry about losing the logs :-(
>>>>> 
>>>>> Tim
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> On Sat, Apr 20, 2013 at 10:23 AM, Timothy Potter <th...@gmail.com> wrote:
>>>>>> Thanks for responding Mark. I'll collect the information you asked
>>>>>> about and open a JIRA once I have a little more understanding of what
>>>>>> happened. Hopefully I can piece together some story after going over
>>>>>> the logs.
>>>>>> 
>>>>>> As for replica / leader, I suspect some leaders went down but
>>>>>> fail-over to new leaders seemed to work fine. We lost about 9 nodes at
>>>>>> once and continued to serve queries, which is awesome.
>>>>>> 
>>>>>> On Sat, Apr 20, 2013 at 10:11 AM, Mark Miller <ma...@gmail.com> wrote:
>>>>>>> Yeah, thats no good.
>>>>>>> 
>>>>>>> You might hit each node with distrib=false to get the doc counts.
>>>>>>> 
>>>>>>> Which ones have what you think are the right counts and which the wrong - eg is it all replicas that are off, or leaders as well?
>>>>>>> 
>>>>>>> You say several replicas - do you mean no leaders went down?
>>>>>>> 
>>>>>>> You might look closer at the logs for a node that has it's count off.
>>>>>>> 
>>>>>>> Finally, I guess I'd try and track it in a JIRA issue.
>>>>>>> 
>>>>>>> - Mark
>>>>>>> 
>>>>>>> On Apr 19, 2013, at 6:37 PM, Timothy Potter <th...@gmail.com> wrote:
>>>>>>> 
>>>>>>>> We had a rogue query take out several replicas in a large 4.2.0 cluster
>>>>>>>> today, due to OOM's (we use the JVM args to kill the process on OOM).
>>>>>>>> 
>>>>>>>> After recovering, when I execute the match all docs query (*:*), I get a
>>>>>>>> different count each time.
>>>>>>>> 
>>>>>>>> In other words, if I execute q=*:* several times in a row, then I get a
>>>>>>>> different count back for numDocs.
>>>>>>>> 
>>>>>>>> This was not the case prior to the failure as that is one thing we monitor
>>>>>>>> for.
>>>>>>>> 
>>>>>>>> I think I should be worried ... any ideas on how to troubleshoot this? One
>>>>>>>> thing to mention is that several of my replicas had to do full recoveries
>>>>>>>> from the leader when they came back online. Indexing was happening when the
>>>>>>>> replicas failed.
>>>>>>>> 
>>>>>>>> Thanks.
>>>>>>>> Tim
>>>>>>> 
>>> 
>>

Re: Rogue query killed several replicas with OOM, after recovering - match all docs query problem

Posted by Timothy Potter <th...@gmail.com>.

I ended up just nuking the index on the replica with less docs and
restarting it - which triggered the snap pull from the leader. So now
I'm in sync and have better processes in place to capture the
information if it happens again, which given some of the queries my UI
team develops, is highly likely ;-)

Also, all our input data to Solr lives in Hive so I'm doing some id
-to- id comparisons of what is in Solr vs. what is in Hive to find any
discrepancies.

Again, sorry about the loss of the logs. This is a tough scenario to
try to re-create as it was a perfect storm of high indexing throughput
and a rogue query.

Tim

On Mon, Apr 22, 2013 at 2:41 PM, Mark Miller <ma...@gmail.com> wrote:
> What do you know about the # of docs you *should*? Do you have that mean when taking the bad replica out of the equation?
>
> - Mark
>
> On Apr 22, 2013, at 4:33 PM, Mark Miller <ma...@gmail.com> wrote:
>
>> Bummer on the log loss :(
>>
>> Good info though. Somehow that replica became active without actually syncing? This is heavily tested (though not with OOM's I suppose), so I'm a little surprised, but it's hard to speculate how it happened without the logs. Specially, the logs from the node that is off would be great - we would see what it did when it recovered and why it might think it was in sync :(
>>
>> - Mark
>>
>> On Apr 22, 2013, at 2:19 PM, Timothy Potter <th...@gmail.com> wrote:
>>
>>> nm - can't read my own output - the leader had more docs than the replica ;-)
>>>
>>> On Mon, Apr 22, 2013 at 11:42 AM, Timothy Potter <th...@gmail.com> wrote:
>>>> Have a little more info about this ... the numDocs for *:* fluctuates
>>>> between two values (difference of 324 docs) depending on which nodes I
>>>> hit (distrib=true)
>>>>
>>>> 589,674,416
>>>> 589,674,092
>>>>
>>>> Using distrib=false, I found 1 shard with a mis-match:
>>>>
>>>> shard15: {
>>>> leader = 32,765,254
>>>> replica = 32,764,930 diff:324
>>>> }
>>>>
>>>> Interesting that the replica has more docs than the leader.
>>>>
>>>> Unfortunately, due to some bad log management scripting on my part,
>>>> the logs were lost when these instances got re-started, which really
>>>> bums me out :-(
>>>>
>>>> For now, I'm going to assume the replica with more docs is the one I
>>>> want to keep and will replicate the full index over to the other one.
>>>> Sorry about losing the logs :-(
>>>>
>>>> Tim
>>>>
>>>>
>>>>
>>>>
>>>> On Sat, Apr 20, 2013 at 10:23 AM, Timothy Potter <th...@gmail.com> wrote:
>>>>> Thanks for responding Mark. I'll collect the information you asked
>>>>> about and open a JIRA once I have a little more understanding of what
>>>>> happened. Hopefully I can piece together some story after going over
>>>>> the logs.
>>>>>
>>>>> As for replica / leader, I suspect some leaders went down but
>>>>> fail-over to new leaders seemed to work fine. We lost about 9 nodes at
>>>>> once and continued to serve queries, which is awesome.
>>>>>
>>>>> On Sat, Apr 20, 2013 at 10:11 AM, Mark Miller <ma...@gmail.com> wrote:
>>>>>> Yeah, thats no good.
>>>>>>
>>>>>> You might hit each node with distrib=false to get the doc counts.
>>>>>>
>>>>>> Which ones have what you think are the right counts and which the wrong - eg is it all replicas that are off, or leaders as well?
>>>>>>
>>>>>> You say several replicas - do you mean no leaders went down?
>>>>>>
>>>>>> You might look closer at the logs for a node that has it's count off.
>>>>>>
>>>>>> Finally, I guess I'd try and track it in a JIRA issue.
>>>>>>
>>>>>> - Mark
>>>>>>
>>>>>> On Apr 19, 2013, at 6:37 PM, Timothy Potter <th...@gmail.com> wrote:
>>>>>>
>>>>>>> We had a rogue query take out several replicas in a large 4.2.0 cluster
>>>>>>> today, due to OOM's (we use the JVM args to kill the process on OOM).
>>>>>>>
>>>>>>> After recovering, when I execute the match all docs query (*:*), I get a
>>>>>>> different count each time.
>>>>>>>
>>>>>>> In other words, if I execute q=*:* several times in a row, then I get a
>>>>>>> different count back for numDocs.
>>>>>>>
>>>>>>> This was not the case prior to the failure as that is one thing we monitor
>>>>>>> for.
>>>>>>>
>>>>>>> I think I should be worried ... any ideas on how to troubleshoot this? One
>>>>>>> thing to mention is that several of my replicas had to do full recoveries
>>>>>>> from the leader when they came back online. Indexing was happening when the
>>>>>>> replicas failed.
>>>>>>>
>>>>>>> Thanks.
>>>>>>> Tim
>>>>>>
>>
>

Re: Rogue query killed several replicas with OOM, after recovering - match all docs query problem

Posted by Mark Miller <ma...@gmail.com>.

What do you know about the # of docs you *should*? Do you have that mean when taking the bad replica out of the equation?

- Mark

On Apr 22, 2013, at 4:33 PM, Mark Miller <ma...@gmail.com> wrote:

> Bummer on the log loss :(
> 
> Good info though. Somehow that replica became active without actually syncing? This is heavily tested (though not with OOM's I suppose), so I'm a little surprised, but it's hard to speculate how it happened without the logs. Specially, the logs from the node that is off would be great - we would see what it did when it recovered and why it might think it was in sync :(
> 
> - Mark
> 
> On Apr 22, 2013, at 2:19 PM, Timothy Potter <th...@gmail.com> wrote:
> 
>> nm - can't read my own output - the leader had more docs than the replica ;-)
>> 
>> On Mon, Apr 22, 2013 at 11:42 AM, Timothy Potter <th...@gmail.com> wrote:
>>> Have a little more info about this ... the numDocs for *:* fluctuates
>>> between two values (difference of 324 docs) depending on which nodes I
>>> hit (distrib=true)
>>> 
>>> 589,674,416
>>> 589,674,092
>>> 
>>> Using distrib=false, I found 1 shard with a mis-match:
>>> 
>>> shard15: {
>>> leader = 32,765,254
>>> replica = 32,764,930 diff:324
>>> }
>>> 
>>> Interesting that the replica has more docs than the leader.
>>> 
>>> Unfortunately, due to some bad log management scripting on my part,
>>> the logs were lost when these instances got re-started, which really
>>> bums me out :-(
>>> 
>>> For now, I'm going to assume the replica with more docs is the one I
>>> want to keep and will replicate the full index over to the other one.
>>> Sorry about losing the logs :-(
>>> 
>>> Tim
>>> 
>>> 
>>> 
>>> 
>>> On Sat, Apr 20, 2013 at 10:23 AM, Timothy Potter <th...@gmail.com> wrote:
>>>> Thanks for responding Mark. I'll collect the information you asked
>>>> about and open a JIRA once I have a little more understanding of what
>>>> happened. Hopefully I can piece together some story after going over
>>>> the logs.
>>>> 
>>>> As for replica / leader, I suspect some leaders went down but
>>>> fail-over to new leaders seemed to work fine. We lost about 9 nodes at
>>>> once and continued to serve queries, which is awesome.
>>>> 
>>>> On Sat, Apr 20, 2013 at 10:11 AM, Mark Miller <ma...@gmail.com> wrote:
>>>>> Yeah, thats no good.
>>>>> 
>>>>> You might hit each node with distrib=false to get the doc counts.
>>>>> 
>>>>> Which ones have what you think are the right counts and which the wrong - eg is it all replicas that are off, or leaders as well?
>>>>> 
>>>>> You say several replicas - do you mean no leaders went down?
>>>>> 
>>>>> You might look closer at the logs for a node that has it's count off.
>>>>> 
>>>>> Finally, I guess I'd try and track it in a JIRA issue.
>>>>> 
>>>>> - Mark
>>>>> 
>>>>> On Apr 19, 2013, at 6:37 PM, Timothy Potter <th...@gmail.com> wrote:
>>>>> 
>>>>>> We had a rogue query take out several replicas in a large 4.2.0 cluster
>>>>>> today, due to OOM's (we use the JVM args to kill the process on OOM).
>>>>>> 
>>>>>> After recovering, when I execute the match all docs query (*:*), I get a
>>>>>> different count each time.
>>>>>> 
>>>>>> In other words, if I execute q=*:* several times in a row, then I get a
>>>>>> different count back for numDocs.
>>>>>> 
>>>>>> This was not the case prior to the failure as that is one thing we monitor
>>>>>> for.
>>>>>> 
>>>>>> I think I should be worried ... any ideas on how to troubleshoot this? One
>>>>>> thing to mention is that several of my replicas had to do full recoveries
>>>>>> from the leader when they came back online. Indexing was happening when the
>>>>>> replicas failed.
>>>>>> 
>>>>>> Thanks.
>>>>>> Tim
>>>>> 
>

Re: Rogue query killed several replicas with OOM, after recovering - match all docs query problem

Posted by Mark Miller <ma...@gmail.com>.

Bummer on the log loss :(

Good info though. Somehow that replica became active without actually syncing? This is heavily tested (though not with OOM's I suppose), so I'm a little surprised, but it's hard to speculate how it happened without the logs. Specially, the logs from the node that is off would be great - we would see what it did when it recovered and why it might think it was in sync :(

- Mark

On Apr 22, 2013, at 2:19 PM, Timothy Potter <th...@gmail.com> wrote:

> nm - can't read my own output - the leader had more docs than the replica ;-)
> 
> On Mon, Apr 22, 2013 at 11:42 AM, Timothy Potter <th...@gmail.com> wrote:
>> Have a little more info about this ... the numDocs for *:* fluctuates
>> between two values (difference of 324 docs) depending on which nodes I
>> hit (distrib=true)
>> 
>> 589,674,416
>> 589,674,092
>> 
>> Using distrib=false, I found 1 shard with a mis-match:
>> 
>> shard15: {
>>  leader = 32,765,254
>>  replica = 32,764,930 diff:324
>> }
>> 
>> Interesting that the replica has more docs than the leader.
>> 
>> Unfortunately, due to some bad log management scripting on my part,
>> the logs were lost when these instances got re-started, which really
>> bums me out :-(
>> 
>> For now, I'm going to assume the replica with more docs is the one I
>> want to keep and will replicate the full index over to the other one.
>> Sorry about losing the logs :-(
>> 
>> Tim
>> 
>> 
>> 
>> 
>> On Sat, Apr 20, 2013 at 10:23 AM, Timothy Potter <th...@gmail.com> wrote:
>>> Thanks for responding Mark. I'll collect the information you asked
>>> about and open a JIRA once I have a little more understanding of what
>>> happened. Hopefully I can piece together some story after going over
>>> the logs.
>>> 
>>> As for replica / leader, I suspect some leaders went down but
>>> fail-over to new leaders seemed to work fine. We lost about 9 nodes at
>>> once and continued to serve queries, which is awesome.
>>> 
>>> On Sat, Apr 20, 2013 at 10:11 AM, Mark Miller <ma...@gmail.com> wrote:
>>>> Yeah, thats no good.
>>>> 
>>>> You might hit each node with distrib=false to get the doc counts.
>>>> 
>>>> Which ones have what you think are the right counts and which the wrong - eg is it all replicas that are off, or leaders as well?
>>>> 
>>>> You say several replicas - do you mean no leaders went down?
>>>> 
>>>> You might look closer at the logs for a node that has it's count off.
>>>> 
>>>> Finally, I guess I'd try and track it in a JIRA issue.
>>>> 
>>>> - Mark
>>>> 
>>>> On Apr 19, 2013, at 6:37 PM, Timothy Potter <th...@gmail.com> wrote:
>>>> 
>>>>> We had a rogue query take out several replicas in a large 4.2.0 cluster
>>>>> today, due to OOM's (we use the JVM args to kill the process on OOM).
>>>>> 
>>>>> After recovering, when I execute the match all docs query (*:*), I get a
>>>>> different count each time.
>>>>> 
>>>>> In other words, if I execute q=*:* several times in a row, then I get a
>>>>> different count back for numDocs.
>>>>> 
>>>>> This was not the case prior to the failure as that is one thing we monitor
>>>>> for.
>>>>> 
>>>>> I think I should be worried ... any ideas on how to troubleshoot this? One
>>>>> thing to mention is that several of my replicas had to do full recoveries
>>>>> from the leader when they came back online. Indexing was happening when the
>>>>> replicas failed.
>>>>> 
>>>>> Thanks.
>>>>> Tim
>>>>

Re: Rogue query killed several replicas with OOM, after recovering - match all docs query problem

Posted by Timothy Potter <th...@gmail.com>.

nm - can't read my own output - the leader had more docs than the replica ;-)

On Mon, Apr 22, 2013 at 11:42 AM, Timothy Potter <th...@gmail.com> wrote:
> Have a little more info about this ... the numDocs for *:* fluctuates
> between two values (difference of 324 docs) depending on which nodes I
> hit (distrib=true)
>
> 589,674,416
> 589,674,092
>
> Using distrib=false, I found 1 shard with a mis-match:
>
> shard15: {
>   leader = 32,765,254
>   replica = 32,764,930 diff:324
> }
>
> Interesting that the replica has more docs than the leader.
>
> Unfortunately, due to some bad log management scripting on my part,
> the logs were lost when these instances got re-started, which really
> bums me out :-(
>
> For now, I'm going to assume the replica with more docs is the one I
> want to keep and will replicate the full index over to the other one.
> Sorry about losing the logs :-(
>
> Tim
>
>
>
>
> On Sat, Apr 20, 2013 at 10:23 AM, Timothy Potter <th...@gmail.com> wrote:
>> Thanks for responding Mark. I'll collect the information you asked
>> about and open a JIRA once I have a little more understanding of what
>> happened. Hopefully I can piece together some story after going over
>> the logs.
>>
>> As for replica / leader, I suspect some leaders went down but
>> fail-over to new leaders seemed to work fine. We lost about 9 nodes at
>> once and continued to serve queries, which is awesome.
>>
>> On Sat, Apr 20, 2013 at 10:11 AM, Mark Miller <ma...@gmail.com> wrote:
>>> Yeah, thats no good.
>>>
>>> You might hit each node with distrib=false to get the doc counts.
>>>
>>> Which ones have what you think are the right counts and which the wrong - eg is it all replicas that are off, or leaders as well?
>>>
>>> You say several replicas - do you mean no leaders went down?
>>>
>>> You might look closer at the logs for a node that has it's count off.
>>>
>>> Finally, I guess I'd try and track it in a JIRA issue.
>>>
>>> - Mark
>>>
>>> On Apr 19, 2013, at 6:37 PM, Timothy Potter <th...@gmail.com> wrote:
>>>
>>>> We had a rogue query take out several replicas in a large 4.2.0 cluster
>>>> today, due to OOM's (we use the JVM args to kill the process on OOM).
>>>>
>>>> After recovering, when I execute the match all docs query (*:*), I get a
>>>> different count each time.
>>>>
>>>> In other words, if I execute q=*:* several times in a row, then I get a
>>>> different count back for numDocs.
>>>>
>>>> This was not the case prior to the failure as that is one thing we monitor
>>>> for.
>>>>
>>>> I think I should be worried ... any ideas on how to troubleshoot this? One
>>>> thing to mention is that several of my replicas had to do full recoveries
>>>> from the leader when they came back online. Indexing was happening when the
>>>> replicas failed.
>>>>
>>>> Thanks.
>>>> Tim
>>>

Re: Rogue query killed several replicas with OOM, after recovering - match all docs query problem

Posted by Timothy Potter <th...@gmail.com>.

Have a little more info about this ... the numDocs for *:* fluctuates
between two values (difference of 324 docs) depending on which nodes I
hit (distrib=true)

589,674,416
589,674,092

Using distrib=false, I found 1 shard with a mis-match:

shard15: {
  leader = 32,765,254
  replica = 32,764,930 diff:324
}

Interesting that the replica has more docs than the leader.

Unfortunately, due to some bad log management scripting on my part,
the logs were lost when these instances got re-started, which really
bums me out :-(

For now, I'm going to assume the replica with more docs is the one I
want to keep and will replicate the full index over to the other one.
Sorry about losing the logs :-(

Tim




On Sat, Apr 20, 2013 at 10:23 AM, Timothy Potter <th...@gmail.com> wrote:
> Thanks for responding Mark. I'll collect the information you asked
> about and open a JIRA once I have a little more understanding of what
> happened. Hopefully I can piece together some story after going over
> the logs.
>
> As for replica / leader, I suspect some leaders went down but
> fail-over to new leaders seemed to work fine. We lost about 9 nodes at
> once and continued to serve queries, which is awesome.
>
> On Sat, Apr 20, 2013 at 10:11 AM, Mark Miller <ma...@gmail.com> wrote:
>> Yeah, thats no good.
>>
>> You might hit each node with distrib=false to get the doc counts.
>>
>> Which ones have what you think are the right counts and which the wrong - eg is it all replicas that are off, or leaders as well?
>>
>> You say several replicas - do you mean no leaders went down?
>>
>> You might look closer at the logs for a node that has it's count off.
>>
>> Finally, I guess I'd try and track it in a JIRA issue.
>>
>> - Mark
>>
>> On Apr 19, 2013, at 6:37 PM, Timothy Potter <th...@gmail.com> wrote:
>>
>>> We had a rogue query take out several replicas in a large 4.2.0 cluster
>>> today, due to OOM's (we use the JVM args to kill the process on OOM).
>>>
>>> After recovering, when I execute the match all docs query (*:*), I get a
>>> different count each time.
>>>
>>> In other words, if I execute q=*:* several times in a row, then I get a
>>> different count back for numDocs.
>>>
>>> This was not the case prior to the failure as that is one thing we monitor
>>> for.
>>>
>>> I think I should be worried ... any ideas on how to troubleshoot this? One
>>> thing to mention is that several of my replicas had to do full recoveries
>>> from the leader when they came back online. Indexing was happening when the
>>> replicas failed.
>>>
>>> Thanks.
>>> Tim
>>

Re: Rogue query killed several replicas with OOM, after recovering - match all docs query problem

Posted by Timothy Potter <th...@gmail.com>.

Thanks for responding Mark. I'll collect the information you asked
about and open a JIRA once I have a little more understanding of what
happened. Hopefully I can piece together some story after going over
the logs.

As for replica / leader, I suspect some leaders went down but
fail-over to new leaders seemed to work fine. We lost about 9 nodes at
once and continued to serve queries, which is awesome.

On Sat, Apr 20, 2013 at 10:11 AM, Mark Miller <ma...@gmail.com> wrote:
> Yeah, thats no good.
>
> You might hit each node with distrib=false to get the doc counts.
>
> Which ones have what you think are the right counts and which the wrong - eg is it all replicas that are off, or leaders as well?
>
> You say several replicas - do you mean no leaders went down?
>
> You might look closer at the logs for a node that has it's count off.
>
> Finally, I guess I'd try and track it in a JIRA issue.
>
> - Mark
>
> On Apr 19, 2013, at 6:37 PM, Timothy Potter <th...@gmail.com> wrote:
>
>> We had a rogue query take out several replicas in a large 4.2.0 cluster
>> today, due to OOM's (we use the JVM args to kill the process on OOM).
>>
>> After recovering, when I execute the match all docs query (*:*), I get a
>> different count each time.
>>
>> In other words, if I execute q=*:* several times in a row, then I get a
>> different count back for numDocs.
>>
>> This was not the case prior to the failure as that is one thing we monitor
>> for.
>>
>> I think I should be worried ... any ideas on how to troubleshoot this? One
>> thing to mention is that several of my replicas had to do full recoveries
>> from the leader when they came back online. Indexing was happening when the
>> replicas failed.
>>
>> Thanks.
>> Tim
>

Re: Rogue query killed several replicas with OOM, after recovering - match all docs query problem

Posted by Mark Miller <ma...@gmail.com>.

Yeah, thats no good.

You might hit each node with distrib=false to get the doc counts.

Which ones have what you think are the right counts and which the wrong - eg is it all replicas that are off, or leaders as well?

You say several replicas - do you mean no leaders went down?

You might look closer at the logs for a node that has it's count off.

Finally, I guess I'd try and track it in a JIRA issue.

- Mark

On Apr 19, 2013, at 6:37 PM, Timothy Potter <th...@gmail.com> wrote:

> We had a rogue query take out several replicas in a large 4.2.0 cluster
> today, due to OOM's (we use the JVM args to kill the process on OOM).
> 
> After recovering, when I execute the match all docs query (*:*), I get a
> different count each time.
> 
> In other words, if I execute q=*:* several times in a row, then I get a
> different count back for numDocs.
> 
> This was not the case prior to the failure as that is one thing we monitor
> for.
> 
> I think I should be worried ... any ideas on how to troubleshoot this? One
> thing to mention is that several of my replicas had to do full recoveries
> from the leader when they came back online. Indexing was happening when the
> replicas failed.
> 
> Thanks.
> Tim