You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Timothy Potter <th...@gmail.com> on 2014/11/21 19:54:05 UTC

Dealing with bad apples in a SolrCloud cluster

Just soliciting some advice from the community ...

Let's say I have a 10-node SolrCloud cluster and have a single collection
with 2 shards with replication factor 10, so basically each shard has one
replica on each of my nodes.

Now imagine one of those nodes starts getting into a bad state and starts
to be slow about serving queries (not bad enough to crash outright though)
... I'm sure we could ponder any number of ways a box might slow down
without crashing.

>From my calculations, about 2/10ths of the queries will now be affected
since

1/10 queries from client apps will hit the bad apple
  +
1/10 queries from other replicas will hit the bad apple (distrib=false)


If QPS is high enough and the bad apple is slow enough, things can start to
get out of control pretty fast, esp. since we've set max threads so high to
avoid distributed dead-lock.

What have others done to mitigate this risk? Anything we can do in Solr to
help deal with this? It seems reasonable that nodes can identify a bad
apple by keeping track of query times and looking for nodes that are
significantly outside (>=2 stddev) what the other nodes are doing. Then
maybe mark the node as being down in ZooKeeper so clients and other nodes
stop trying to send requests to it; or maybe a simple policy of just don't
send requests to that node for a few minutes.

Re: Dealing with bad apples in a SolrCloud cluster

Posted by Mark Miller <ma...@gmail.com>.
bq.  esp. since we've set max threads so high to avoid distributed
dead-lock.

We should fix this for 5.0 - add a second thread pool that is used for
internal requests. We can make it optional if necessary (simpler default
container support), but it's a fairly easy improvement I think.

- Mark

On Fri Nov 21 2014 at 1:56:51 PM Timothy Potter <th...@gmail.com>
wrote:

> Just soliciting some advice from the community ...
>
> Let's say I have a 10-node SolrCloud cluster and have a single collection
> with 2 shards with replication factor 10, so basically each shard has one
> replica on each of my nodes.
>
> Now imagine one of those nodes starts getting into a bad state and starts
> to be slow about serving queries (not bad enough to crash outright though)
> ... I'm sure we could ponder any number of ways a box might slow down
> without crashing.
>
> From my calculations, about 2/10ths of the queries will now be affected
> since
>
> 1/10 queries from client apps will hit the bad apple
>   +
> 1/10 queries from other replicas will hit the bad apple (distrib=false)
>
>
> If QPS is high enough and the bad apple is slow enough, things can start to
> get out of control pretty fast, esp. since we've set max threads so high to
> avoid distributed dead-lock.
>
> What have others done to mitigate this risk? Anything we can do in Solr to
> help deal with this? It seems reasonable that nodes can identify a bad
> apple by keeping track of query times and looking for nodes that are
> significantly outside (>=2 stddev) what the other nodes are doing. Then
> maybe mark the node as being down in ZooKeeper so clients and other nodes
> stop trying to send requests to it; or maybe a simple policy of just don't
> send requests to that node for a few minutes.
>

Re: Dealing with bad apples in a SolrCloud cluster

Posted by ralph tice <ra...@gmail.com>.
bq. We ran into one of failure modes that only AWS can dream up
recently, where for an extended amount of time, two nodes in the same
placement group couldn't talk to one another, but they could both see
Zookeeper, so nothing was marked as down.

I had something similar happen with one of my SolrCloud clusters
recently in AWS, it was a failure in split DNS resolution so that
machines were resolving to their public IPs instead of their private
IPs and caused traffic to get blocked by security group rules.

Back on topic:

I don't know if anyone had a chance to look at Thoth yet (
https://github.com/trulia/thoth ) but one of the use cases was to
analyze query patterns in order to feed a ML model for intelligent
routing.  I could envision a design where this routing layer was a
feature within SolrCloud.  A coprocess (ThothML, for example) would
train a model that you would update periodically into some shared
storage for a routing layer component to load & reload.  With a high
enough QPS load you should be able to do reasonable anomaly detection
in near real time in order to route around bad nodes, and the model
for classifying requests might become good enough after a period of
training to not require continual updates.

What do you all think?

On Fri, Nov 21, 2014 at 1:07 PM, Michael Della Bitta
<mi...@appinions.com> wrote:
>
> Good discussion topic.
>
> I'm wondering if Solr doesn't need some sort of "shoot the other node in the head" functionality.
>
>
>
> I've written a basic monitoring script that periodically tries to access every node in the cluster from every other node, but I haven't gotten to the point that I've automated anything based on that. It does trigger now and again for brief moments of time.
>
> It'd be nice if there was some way the cluster could achieve some consensus that a particular node is a bad apple, and evict it from collections that have other active replicas. Not sure what the logic would be that would allow it to rejoin those collections after the situation passed, however.
>
> Michael
>
>
> On 11/21/14 13:54, Timothy Potter wrote:
>>
>> Just soliciting some advice from the community ...
>>
>> Let's say I have a 10-node SolrCloud cluster and have a single collection
>> with 2 shards with replication factor 10, so basically each shard has one
>> replica on each of my nodes.
>>
>> Now imagine one of those nodes starts getting into a bad state and starts
>> to be slow about serving queries (not bad enough to crash outright though)
>> ... I'm sure we could ponder any number of ways a box might slow down
>> without crashing.
>>
>>  From my calculations, about 2/10ths of the queries will now be affected
>> since
>>
>> 1/10 queries from client apps will hit the bad apple
>>    +
>> 1/10 queries from other replicas will hit the bad apple (distrib=false)
>>
>>
>> If QPS is high enough and the bad apple is slow enough, things can start to
>> get out of control pretty fast, esp. since we've set max threads so high to
>> avoid distributed dead-lock.
>>
>> What have others done to mitigate this risk? Anything we can do in Solr to
>> help deal with this? It seems reasonable that nodes can identify a bad
>> apple by keeping track of query times and looking for nodes that are
>> significantly outside (>=2 stddev) what the other nodes are doing. Then
>> maybe mark the node as being down in ZooKeeper so clients and other nodes
>> stop trying to send requests to it; or maybe a simple policy of just don't
>> send requests to that node for a few minutes.
>>
>

RE: Dealing with bad apples in a SolrCloud cluster

Posted by steve <sc...@hotmail.com>.
"Last Gasp" is the last message that Sun Storage controllers would send to each other when things whet sideways...
For what it's worth.

> Date: Fri, 21 Nov 2014 14:07:12 -0500
> From: michael.della.bitta@appinions.com
> To: solr-user@lucene.apache.org
> Subject: Re: Dealing with bad apples in a SolrCloud cluster
> 
> Good discussion topic.
> 
> I'm wondering if Solr doesn't need some sort of "shoot the other node in 
> the head" functionality.
> 
> We ran into one of failure modes that only AWS can dream up recently, 
> where for an extended amount of time, two nodes in the same placement 
> group couldn't talk to one another, but they could both see Zookeeper, 
> so nothing was marked as down.
> 
> I've written a basic monitoring script that periodically tries to access 
> every node in the cluster from every other node, but I haven't gotten to 
> the point that I've automated anything based on that. It does trigger 
> now and again for brief moments of time.
> 
> It'd be nice if there was some way the cluster could achieve some 
> consensus that a particular node is a bad apple, and evict it from 
> collections that have other active replicas. Not sure what the logic 
> would be that would allow it to rejoin those collections after the 
> situation passed, however.
> 
> Michael
> 
> On 11/21/14 13:54, Timothy Potter wrote:
> > Just soliciting some advice from the community ...
> >
> > Let's say I have a 10-node SolrCloud cluster and have a single collection
> > with 2 shards with replication factor 10, so basically each shard has one
> > replica on each of my nodes.
> >
> > Now imagine one of those nodes starts getting into a bad state and starts
> > to be slow about serving queries (not bad enough to crash outright though)
> > ... I'm sure we could ponder any number of ways a box might slow down
> > without crashing.
> >
> >  From my calculations, about 2/10ths of the queries will now be affected
> > since
> >
> > 1/10 queries from client apps will hit the bad apple
> >    +
> > 1/10 queries from other replicas will hit the bad apple (distrib=false)
> >
> >
> > If QPS is high enough and the bad apple is slow enough, things can start to
> > get out of control pretty fast, esp. since we've set max threads so high to
> > avoid distributed dead-lock.
> >
> > What have others done to mitigate this risk? Anything we can do in Solr to
> > help deal with this? It seems reasonable that nodes can identify a bad
> > apple by keeping track of query times and looking for nodes that are
> > significantly outside (>=2 stddev) what the other nodes are doing. Then
> > maybe mark the node as being down in ZooKeeper so clients and other nodes
> > stop trying to send requests to it; or maybe a simple policy of just don't
> > send requests to that node for a few minutes.
> >
> 
 		 	   		  

Re: Dealing with bad apples in a SolrCloud cluster

Posted by Michael Della Bitta <mi...@appinions.com>.
Good discussion topic.

I'm wondering if Solr doesn't need some sort of "shoot the other node in 
the head" functionality.

We ran into one of failure modes that only AWS can dream up recently, 
where for an extended amount of time, two nodes in the same placement 
group couldn't talk to one another, but they could both see Zookeeper, 
so nothing was marked as down.

I've written a basic monitoring script that periodically tries to access 
every node in the cluster from every other node, but I haven't gotten to 
the point that I've automated anything based on that. It does trigger 
now and again for brief moments of time.

It'd be nice if there was some way the cluster could achieve some 
consensus that a particular node is a bad apple, and evict it from 
collections that have other active replicas. Not sure what the logic 
would be that would allow it to rejoin those collections after the 
situation passed, however.

Michael

On 11/21/14 13:54, Timothy Potter wrote:
> Just soliciting some advice from the community ...
>
> Let's say I have a 10-node SolrCloud cluster and have a single collection
> with 2 shards with replication factor 10, so basically each shard has one
> replica on each of my nodes.
>
> Now imagine one of those nodes starts getting into a bad state and starts
> to be slow about serving queries (not bad enough to crash outright though)
> ... I'm sure we could ponder any number of ways a box might slow down
> without crashing.
>
>  From my calculations, about 2/10ths of the queries will now be affected
> since
>
> 1/10 queries from client apps will hit the bad apple
>    +
> 1/10 queries from other replicas will hit the bad apple (distrib=false)
>
>
> If QPS is high enough and the bad apple is slow enough, things can start to
> get out of control pretty fast, esp. since we've set max threads so high to
> avoid distributed dead-lock.
>
> What have others done to mitigate this risk? Anything we can do in Solr to
> help deal with this? It seems reasonable that nodes can identify a bad
> apple by keeping track of query times and looking for nodes that are
> significantly outside (>=2 stddev) what the other nodes are doing. Then
> maybe mark the node as being down in ZooKeeper so clients and other nodes
> stop trying to send requests to it; or maybe a simple policy of just don't
> send requests to that node for a few minutes.
>


Re: Dealing with bad apples in a SolrCloud cluster

Posted by "Ramkumar R. Aiyengar" <an...@gmail.com>.
As Eric mentions, his change to have a state where indexing happens but
querying doesn't surely helps in this case.

But these are still boolean decisions of send vs don't send. In general, it
would be nice to abstract the routing policy so that it is pluggable. You
could then do stuff like have a "least pending" policy for choosing
replicas -- instead of choosing a replica at random, you maintain a pending
response count, and you always send to the one with least pending (or
randomly amongst a set of replicas if there is a tie).

Also the chances your distrib=false case will be hit is actually 1/5 (or
something like that, I have forgotten my probability theory). Because you
have two shards and you get two chances at hitting the bad apple. This was
one of the reasons we got in SOLR-6730 to use replica and host affinity.
Under good enough load, the load distribution will more or less be the same
with this change, but chances of hitting bad apples will be lesser..
On 21 Nov 2014 18:56, "Timothy Potter" <th...@gmail.com> wrote:

Just soliciting some advice from the community ...

Let's say I have a 10-node SolrCloud cluster and have a single collection
with 2 shards with replication factor 10, so basically each shard has one
replica on each of my nodes.

Now imagine one of those nodes starts getting into a bad state and starts
to be slow about serving queries (not bad enough to crash outright though)
... I'm sure we could ponder any number of ways a box might slow down
without crashing.

>From my calculations, about 2/10ths of the queries will now be affected
since

1/10 queries from client apps will hit the bad apple
  +
1/10 queries from other replicas will hit the bad apple (distrib=false)


If QPS is high enough and the bad apple is slow enough, things can start to
get out of control pretty fast, esp. since we've set max threads so high to
avoid distributed dead-lock.

What have others done to mitigate this risk? Anything we can do in Solr to
help deal with this? It seems reasonable that nodes can identify a bad
apple by keeping track of query times and looking for nodes that are
significantly outside (>=2 stddev) what the other nodes are doing. Then
maybe mark the node as being down in ZooKeeper so clients and other nodes
stop trying to send requests to it; or maybe a simple policy of just don't
send requests to that node for a few minutes.