You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by Ben DeMott <be...@gmail.com> on 2017/03/13 23:11:32 UTC

How Zookeeper (and Puppet) brought down our Solr Cluster

So wanted to throw this out there, and get any feedback.

We had a persistent issue with our Solr clusters doing crazy things, from
running out of file-descriptors, to having replication issues, to filling
up the /overseer/queue .... Just some of the log Exceptions:

*o.e.j.s.ServerConnector java.io <http://java.io/>.IOException: Too many
open files*

*o.a.s.s.HttpSolrCall null:org.apache.solr.common.SolrException: Error
trying to proxy request for
url: http://10.50.64.4:8983/solr/efc-jobsearch-col/select
<http://10.50.64.4:8983/solr/efc-jobsearch-col/select>*

*o.a.s.h.RequestHandlerBase org.apache.solr.common.SolrException:
ClusterState says we are the leader
(http://10.50.64.4:8983/solr/efc-jobsearch-col_shard1_replica2
<http://10.50.64.4:8983/solr/efc-jobsearch-col_shard1_replica2>), but
locally we don't think so. Request came from null*

*o.a.s.c.Overseer Bad version writing to ZK using compare-and-set, will
force refresh cluster state: KeeperErrorCode = BadVersion for
/collections/efc-jobsearch-col/state.json*

*IndexFetcher File _5oz.nvd did not match. expected checksum is 3661731988
and actual is checksum 840593658. expected length is 271091 and actual
length is 271091*


*...*

I'll get to the point quickly.  This was all caused by a Zookeeper
configuration on a particular node getting reset, for a period of seconds,
and the service being restarted automatically.  When this happened, Solr's
connection to Zookeeper would be reset, Solr would reconnect, to the
Zookeeper node, which had a blank configuration and was in "STANDALONE"
mode.  The changes to ZK that were registered by the Solr connection
wouldn't be registered with the rest of the cluster.

As a result the *cversion* of */live_nodes* would be ahead of the other
servers by a version or two, but the zxid's would all by in-sync.  The
nodes would never re-synchronize; as far as Zookeeper is concerned
everything is synced up properly.  Also */live_nodes* would be a
mis-matched mess, empty, or inconsistent, depending on where Solr's ZK
connections were pointed, resulting in Client connections  returning some,
wrong, or no "live nodes".

Now, it specifically tells you to never connect to an inconsistent group of
servers as it will play havoc with Zookeeper, and it did exactly this.

As of Zookeeper 3.5 there is an option to NEVER ALLOW IT TO RUN IN
STANDALONE which we will be using when a stable version is released.

It caused absolute havoc within our cluster.

So to summarize, if a Zookeeper ensemble host ever goes into "Standalone"
even temporarily, Solr will be disconnected, and then (may) reconnect
(depending on which ZK node it picks) and its updates will never by
synchronized. Also it won't be able to coordinate any of its Cloud
operations.

So in the interest of being a good internet citizen I'm writing this up, is
there any desire for a patch that would provide a configuration or jvm
option to refuse to connect to nodes in standalone operation?   Obviously
the built-in ZK server that comes with Solr runs in standalone mode, so
this would only be an option for solr.in.sh.... But it would prevent Solr
from bringing the entire cluster down, in the event a single ZK server was
temporarily misconfigured, or lost it's configuration for some reason.

Maybe this isn't worth addressing.  Thoughts?

Re: How Zookeeper (and Puppet) brought down our Solr Cluster

Posted by Ben DeMott <be...@gmail.com>.
Hi Jan,

I created a Jira issue, and proposed a possible solution there.

Please feel free to comment if you have your own ideas.

Thanks for the response.

https://issues.apache.org/jira/browse/SOLR-10284

On Mon, Mar 13, 2017 at 5:52 PM, Jan Høydahl <ja...@cominvent.com> wrote:

> Hi
>
> Thanks for reporting.
> As it may take some time before we get ZK 3.5.x out there it would be nice
> with a fix.
> Do you plan to make our zkClient somehow explicitly validate that all
> given zk nodes are “good”?
>
> Or is there some way we could fix this with documentation?
> I imagine, if we always propose to use a chroot, e.g.
> ZK_HOST=zoo1,zoo2,zoo3*/solr* then it would be a requirement to do a
> *mkroot* before being able to use ZK. And I assume that in that case if
> one of the ZK nodes got restarted without or with wrong configuration, it
> would startup with some other *data *folder(?) and refuse to serve any
> data whatsoever since the /solr root would not exist?
>
> I’d say, even if this is not a Solr bug per se, it is still worthy of a
> JIRA issue.
>
> --
> Jan Høydahl, search solution architect
> Cominvent AS - www.cominvent.com
>
> 14. mar. 2017 kl. 00.11 skrev Ben DeMott <be...@gmail.com>:
>
> So wanted to throw this out there, and get any feedback.
>
> We had a persistent issue with our Solr clusters doing crazy things, from
> running out of file-descriptors, to having replication issues, to filling
> up the /overseer/queue .... Just some of the log Exceptions:
>
> *o.e.j.s.ServerConnector java.io <http://java.io/>.IOException: Too many
> open files*
>
> *o.a.s.s.HttpSolrCall null:org.apache.solr.common.SolrException: Error
> trying to proxy request for
> url: http://10.50.64.4:8983/solr/efc-jobsearch-col/select
> <http://10.50.64.4:8983/solr/efc-jobsearch-col/select>*
>
> *o.a.s.h.RequestHandlerBase org.apache.solr.common.SolrException:
> ClusterState says we are the leader
> (http://10.50.64.4:8983/solr/efc-jobsearch-col_shard1_replica2
> <http://10.50.64.4:8983/solr/efc-jobsearch-col_shard1_replica2>), but
> locally we don't think so. Request came from null*
>
> *o.a.s.c.Overseer Bad version writing to ZK using compare-and-set, will
> force refresh cluster state: KeeperErrorCode = BadVersion for
> /collections/efc-jobsearch-col/state.json*
>
> *IndexFetcher File _5oz.nvd did not match. expected checksum is 3661731988
> and actual is checksum 840593658. expected length is 271091 and actual
> length is 271091*
>
>
> *...*
>
> I'll get to the point quickly.  This was all caused by a Zookeeper
> configuration on a particular node getting reset, for a period of seconds,
> and the service being restarted automatically.  When this happened, Solr's
> connection to Zookeeper would be reset, Solr would reconnect, to the
> Zookeeper node, which had a blank configuration and was in "STANDALONE"
> mode.  The changes to ZK that were registered by the Solr connection
> wouldn't be registered with the rest of the cluster.
>
> As a result the *cversion* of */live_nodes* would be ahead of the other
> servers by a version or two, but the zxid's would all by in-sync.  The
> nodes would never re-synchronize; as far as Zookeeper is concerned
> everything is synced up properly.  Also */live_nodes* would be a
> mis-matched mess, empty, or inconsistent, depending on where Solr's ZK
> connections were pointed, resulting in Client connections  returning some,
> wrong, or no "live nodes".
>
> Now, it specifically tells you to never connect to an inconsistent group
> of servers as it will play havoc with Zookeeper, and it did exactly this.
>
> As of Zookeeper 3.5 there is an option to NEVER ALLOW IT TO RUN IN
> STANDALONE which we will be using when a stable version is released.
>
> It caused absolute havoc within our cluster.
>
> So to summarize, if a Zookeeper ensemble host ever goes into "Standalone"
> even temporarily, Solr will be disconnected, and then (may) reconnect
> (depending on which ZK node it picks) and its updates will never by
> synchronized. Also it won't be able to coordinate any of its Cloud
> operations.
>
> So in the interest of being a good internet citizen I'm writing this up,
> is there any desire for a patch that would provide a configuration or jvm
> option to refuse to connect to nodes in standalone operation?   Obviously
> the built-in ZK server that comes with Solr runs in standalone mode, so
> this would only be an option for solr.in.sh.... But it would prevent Solr
> from bringing the entire cluster down, in the event a single ZK server was
> temporarily misconfigured, or lost it's configuration for some reason.
>
> Maybe this isn't worth addressing.  Thoughts?
>
>
>

Re: How Zookeeper (and Puppet) brought down our Solr Cluster

Posted by Jan Høydahl <ja...@cominvent.com>.
Hi

Thanks for reporting.
As it may take some time before we get ZK 3.5.x out there it would be nice with a fix.
Do you plan to make our zkClient somehow explicitly validate that all given zk nodes are “good”?

Or is there some way we could fix this with documentation?
I imagine, if we always propose to use a chroot, e.g. ZK_HOST=zoo1,zoo2,zoo3/solr then it would be a requirement to do a mkroot before being able to use ZK. And I assume that in that case if one of the ZK nodes got restarted without or with wrong configuration, it would startup with some other data folder(?) and refuse to serve any data whatsoever since the /solr root would not exist?

I’d say, even if this is not a Solr bug per se, it is still worthy of a JIRA issue.

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com

> 14. mar. 2017 kl. 00.11 skrev Ben DeMott <be...@gmail.com>:
> 
> So wanted to throw this out there, and get any feedback.
> 
> We had a persistent issue with our Solr clusters doing crazy things, from running out of file-descriptors, to having replication issues, to filling up the /overseer/queue .... Just some of the log Exceptions:
> 
> o.e.j.s.ServerConnector java.io <http://java.io/>.IOException: Too many open files
> 
> o.a.s.s.HttpSolrCall null:org.apache.solr.common.SolrException: Error trying to proxy request for url: http://10.50.64.4:8983/solr/efc-jobsearch-col/select <http://10.50.64.4:8983/solr/efc-jobsearch-col/select>
> 
> o.a.s.h.RequestHandlerBase org.apache.solr.common.SolrException: ClusterState says we are the leader (http://10.50.64.4:8983/solr/efc-jobsearch-col_shard1_replica2 <http://10.50.64.4:8983/solr/efc-jobsearch-col_shard1_replica2>), but locally we don't think so. Request came from null
> 
> o.a.s.c.Overseer Bad version writing to ZK using compare-and-set, will force refresh cluster state: KeeperErrorCode = BadVersion for /collections/efc-jobsearch-col/state.json
> 
> IndexFetcher File _5oz.nvd did not match. expected checksum is 3661731988 and actual is checksum 840593658. expected length is 271091 and actual length is 271091
> 
> 
> ...
> 
> I'll get to the point quickly.  This was all caused by a Zookeeper configuration on a particular node getting reset, for a period of seconds, and the service being restarted automatically.  When this happened, Solr's connection to Zookeeper would be reset, Solr would reconnect, to the Zookeeper node, which had a blank configuration and was in "STANDALONE" mode.  The changes to ZK that were registered by the Solr connection wouldn't be registered with the rest of the cluster.
> 
> As a result the cversion of /live_nodes would be ahead of the other servers by a version or two, but the zxid's would all by in-sync.  The nodes would never re-synchronize; as far as Zookeeper is concerned everything is synced up properly.  Also /live_nodes would be a mis-matched mess, empty, or inconsistent, depending on where Solr's ZK connections were pointed, resulting in Client connections  returning some, wrong, or no "live nodes".
> 
> Now, it specifically tells you to never connect to an inconsistent group of servers as it will play havoc with Zookeeper, and it did exactly this.
> 
> As of Zookeeper 3.5 there is an option to NEVER ALLOW IT TO RUN IN STANDALONE which we will be using when a stable version is released.
> 
> It caused absolute havoc within our cluster.
> 
> So to summarize, if a Zookeeper ensemble host ever goes into "Standalone" even temporarily, Solr will be disconnected, and then (may) reconnect (depending on which ZK node it picks) and its updates will never by synchronized. Also it won't be able to coordinate any of its Cloud operations.
> 
> So in the interest of being a good internet citizen I'm writing this up, is there any desire for a patch that would provide a configuration or jvm option to refuse to connect to nodes in standalone operation?   Obviously the built-in ZK server that comes with Solr runs in standalone mode, so this would only be an option for solr.in.sh.... But it would prevent Solr from bringing the entire cluster down, in the event a single ZK server was temporarily misconfigured, or lost it's configuration for some reason.
> 
> Maybe this isn't worth addressing.  Thoughts?
>