You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Avishai Ish-Shalom <av...@fewbytes.com> on 2014/03/12 22:06:35 UTC

single node causing cluster-wide outage

Hi all!

After upgrading to Solr 4.6.1 we encountered a situation where a cluster
outage was traced to a single node misbehaving, after restarting the node
the cluster immediately returned to normal operation.
The bad node had ~420 threads locked on FastLRUCache and most
httpshardexecutor threads were waiting on apache commons http futures.

Has anyone encountered such a situation? what can we do to prevent
misbehaving nodes from bringing down the entire cluster?

Cheers,
Avishai

Re: single node causing cluster-wide outage

Posted by Erick Erickson <er...@gmail.com>.
right, after an OOM error, the state of the
machine may be wonky. So the obvious thing
is to try to get rid of the OOM error....

how many unique values do you have in your
field you're faceting on?

Not too helpful, but the best I can do.
Erick

On Thu, Mar 13, 2014 at 3:15 PM, Avishai Ish-Shalom
<av...@fewbytes.com> wrote:
> a little more information: it seems the issue is happening after we get
> OutOfMemory error on facet query.
>
>
> On Wed, Mar 12, 2014 at 11:06 PM, Avishai Ish-Shalom
> <av...@fewbytes.com>wrote:
>
>> Hi all!
>>
>> After upgrading to Solr 4.6.1 we encountered a situation where a cluster
>> outage was traced to a single node misbehaving, after restarting the node
>> the cluster immediately returned to normal operation.
>> The bad node had ~420 threads locked on FastLRUCache and most
>> httpshardexecutor threads were waiting on apache commons http futures.
>>
>> Has anyone encountered such a situation? what can we do to prevent
>> misbehaving nodes from bringing down the entire cluster?
>>
>> Cheers,
>> Avishai
>>

Re: single node causing cluster-wide outage

Posted by Avishai Ish-Shalom <av...@fewbytes.com>.
a little more information: it seems the issue is happening after we get
OutOfMemory error on facet query.


On Wed, Mar 12, 2014 at 11:06 PM, Avishai Ish-Shalom
<av...@fewbytes.com>wrote:

> Hi all!
>
> After upgrading to Solr 4.6.1 we encountered a situation where a cluster
> outage was traced to a single node misbehaving, after restarting the node
> the cluster immediately returned to normal operation.
> The bad node had ~420 threads locked on FastLRUCache and most
> httpshardexecutor threads were waiting on apache commons http futures.
>
> Has anyone encountered such a situation? what can we do to prevent
> misbehaving nodes from bringing down the entire cluster?
>
> Cheers,
> Avishai
>