You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Garth Grimm <Ga...@averyranchconsulting.com> on 2013/11/19 19:58:40 UTC

Zookeeper down question

Given a 4 solr node instance (i.e. 2 shards, 2 replicas per shard), and a standalone zookeeper.

Correct me if any of my understanding is incorrect on the following:
If ZK goes down, most normal operations will still function, since my understanding is that ZK isn't involved on a transaction by transaction basis for each of these.....
Document adds, updates, and deletes on existing collection will still work as expected.
Queries will still get processed as expected.
Is the above correct?

But adding new collections, changing configs, etc., will all fail while ZK is down (or at least, place things in an inconsistent state?)
Is that correct?

If, while ZK is down, one of the 4 solr nodes also goes down, will all normal operations fail?  Will they all continue to succeed?  I.e. will each of the nodes realize which node is down and route indexing and query requests around them, or is that impossible while ZK is down?  Will some queries succeed (because they were lucky enough to get routed to the one replica on the one shard that is still functional) while other queries fail (they aren't so lucky and get routed to the one replica that is down on the one shard)?

Thanks,
Garth Grimm



Re: Zookeeper down question

Posted by Otis Gospodnetic <ot...@gmail.com>.
Garth,

Here is something else related to help push the upgrade further:

http://search-lucene.com/m/gUajqxuETB1/&subj=Re+SolrCloud+and+split+brain

Monitor your beast keeper: http://search-lucene.com/m/R9vEg2JmiR91

Otis


On Tue, Nov 19, 2013 at 5:56 PM, Garth Grimm <
GarthGrimm@averyranchconsulting.com> wrote:

> Thanks Mark and Tim.  My understanding has been upgraded.
>
> -----Original Message-----
> From: Mark Miller [mailto:markrmiller@gmail.com]
> Sent: Tuesday, November 19, 2013 1:59 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Zookeeper down question
>
>
> On Nov 19, 2013, at 2:24 PM, Timothy Potter <th...@gmail.com> wrote:
>
> > Good questions ... From my understanding, queries will work if Zk goes
> > down but writes do not work w/o Zookeeper. This works because the
> > clusterstate is cached on each node so Zookeeper doesn't participate
> > directly in queries and indexing requests. Solr has to decide not to
> > allow writes if it loses its connection to Zookeeper, which is a safe
> > guard mechanism. In other words, Solr assumes it's pretty safe to
> > allow reads if the cluster doesn't have a healthy coordinator, but
> chooses to not allow writes to be safe.
>
> Right - we currently stop accepting writes when Solr cannot talk to
> ZooKeeper - this is because we can no longer count on knowing about any
> changes to the cluster and no new leaders can be elected, etc. It gets
> tricky fast if you consider allowing updates without ZooKeeper connectivity
> for very long.
>
> >
> > If a Solr nodes goes down while ZK is not available, since Solr no
> > longer accepts writes, leader / replica doesn't really matter. I'd
> > venture to guess there is some failover logic built in when executing
> > distributing queries but I'm not as familiar with that part of the
> > code (I'll brush up on it though as I'm now curious as well).
>
> Right - query requests will fail over to other replicas - this is
> important in general because the cluster state a Solr instance has can be a
> bit stale - so a request might hit something that has gone down and another
> replica in the shard can be tried. We use the load balancing solrj client
> for these internal requests. CloudSolrServer handles failover for the user
> (or non internal) requests. Or you can use your own external load balancer.
>
> - Mark
>
> >
> > Cheers,
> > Tim
> >
> >
> > On Tue, Nov 19, 2013 at 11:58 AM, Garth Grimm <
> > GarthGrimm@averyranchconsulting.com> wrote:
> >
> >> Given a 4 solr node instance (i.e. 2 shards, 2 replicas per shard),
> >> and a standalone zookeeper.
> >>
> >> Correct me if any of my understanding is incorrect on the following:
> >> If ZK goes down, most normal operations will still function, since my
> >> understanding is that ZK isn't involved on a transaction by
> >> transaction basis for each of these.....
> >> Document adds, updates, and deletes on existing collection will still
> >> work as expected.
> >> Queries will still get processed as expected.
> >> Is the above correct?
> >>
> >> But adding new collections, changing configs, etc., will all fail
> >> while ZK is down (or at least, place things in an inconsistent
> >> state?) Is that correct?
> >>
> >> If, while ZK is down, one of the 4 solr nodes also goes down, will
> >> all normal operations fail?  Will they all continue to succeed?  I.e.
> >> will each of the nodes realize which node is down and route indexing
> >> and query requests around them, or is that impossible while ZK is
> >> down?  Will some queries succeed (because they were lucky enough to
> >> get routed to the one replica on the one shard that is still
> >> functional) while other queries fail (they aren't so lucky and get
> >> routed to the one replica that is down on the one shard)?
> >>
> >> Thanks,
> >> Garth Grimm
> >>
> >>
> >>
>
>

RE: Zookeeper down question

Posted by Garth Grimm <Ga...@averyranchconsulting.com>.
Thanks Mark and Tim.  My understanding has been upgraded.

-----Original Message-----
From: Mark Miller [mailto:markrmiller@gmail.com] 
Sent: Tuesday, November 19, 2013 1:59 PM
To: solr-user@lucene.apache.org
Subject: Re: Zookeeper down question


On Nov 19, 2013, at 2:24 PM, Timothy Potter <th...@gmail.com> wrote:

> Good questions ... From my understanding, queries will work if Zk goes 
> down but writes do not work w/o Zookeeper. This works because the 
> clusterstate is cached on each node so Zookeeper doesn't participate 
> directly in queries and indexing requests. Solr has to decide not to 
> allow writes if it loses its connection to Zookeeper, which is a safe 
> guard mechanism. In other words, Solr assumes it's pretty safe to 
> allow reads if the cluster doesn't have a healthy coordinator, but chooses to not allow writes to be safe.

Right - we currently stop accepting writes when Solr cannot talk to ZooKeeper - this is because we can no longer count on knowing about any changes to the cluster and no new leaders can be elected, etc. It gets tricky fast if you consider allowing updates without ZooKeeper connectivity for very long.

> 
> If a Solr nodes goes down while ZK is not available, since Solr no 
> longer accepts writes, leader / replica doesn't really matter. I'd 
> venture to guess there is some failover logic built in when executing 
> distributing queries but I'm not as familiar with that part of the 
> code (I'll brush up on it though as I'm now curious as well).

Right - query requests will fail over to other replicas - this is important in general because the cluster state a Solr instance has can be a bit stale - so a request might hit something that has gone down and another replica in the shard can be tried. We use the load balancing solrj client for these internal requests. CloudSolrServer handles failover for the user (or non internal) requests. Or you can use your own external load balancer.

- Mark

> 
> Cheers,
> Tim
> 
> 
> On Tue, Nov 19, 2013 at 11:58 AM, Garth Grimm < 
> GarthGrimm@averyranchconsulting.com> wrote:
> 
>> Given a 4 solr node instance (i.e. 2 shards, 2 replicas per shard), 
>> and a standalone zookeeper.
>> 
>> Correct me if any of my understanding is incorrect on the following:
>> If ZK goes down, most normal operations will still function, since my 
>> understanding is that ZK isn't involved on a transaction by 
>> transaction basis for each of these.....
>> Document adds, updates, and deletes on existing collection will still 
>> work as expected.
>> Queries will still get processed as expected.
>> Is the above correct?
>> 
>> But adding new collections, changing configs, etc., will all fail 
>> while ZK is down (or at least, place things in an inconsistent 
>> state?) Is that correct?
>> 
>> If, while ZK is down, one of the 4 solr nodes also goes down, will 
>> all normal operations fail?  Will they all continue to succeed?  I.e. 
>> will each of the nodes realize which node is down and route indexing 
>> and query requests around them, or is that impossible while ZK is 
>> down?  Will some queries succeed (because they were lucky enough to 
>> get routed to the one replica on the one shard that is still 
>> functional) while other queries fail (they aren't so lucky and get 
>> routed to the one replica that is down on the one shard)?
>> 
>> Thanks,
>> Garth Grimm
>> 
>> 
>> 


Re: Zookeeper down question

Posted by Mark Miller <ma...@gmail.com>.
On Nov 19, 2013, at 2:24 PM, Timothy Potter <th...@gmail.com> wrote:

> Good questions ... From my understanding, queries will work if Zk goes down
> but writes do not work w/o Zookeeper. This works because the clusterstate
> is cached on each node so Zookeeper doesn't participate directly in queries
> and indexing requests. Solr has to decide not to allow writes if it loses
> its connection to Zookeeper, which is a safe guard mechanism. In other
> words, Solr assumes it's pretty safe to allow reads if the cluster doesn't
> have a healthy coordinator, but chooses to not allow writes to be safe.

Right - we currently stop accepting writes when Solr cannot talk to ZooKeeper - this is because we can no longer count on knowing about any changes to the cluster and no new leaders can be elected, etc. It gets tricky fast if you consider allowing updates without ZooKeeper connectivity for very long.

> 
> If a Solr nodes goes down while ZK is not available, since Solr no longer
> accepts writes, leader / replica doesn't really matter. I'd venture to
> guess there is some failover logic built in when executing distributing
> queries but I'm not as familiar with that part of the code (I'll brush up
> on it though as I'm now curious as well).

Right - query requests will fail over to other replicas - this is important in general because the cluster state a Solr instance has can be a bit stale - so a request might hit something that has gone down and another replica in the shard can be tried. We use the load balancing solrj client for these internal requests. CloudSolrServer handles failover for the user (or non internal) requests. Or you can use your own external load balancer.

- Mark

> 
> Cheers,
> Tim
> 
> 
> On Tue, Nov 19, 2013 at 11:58 AM, Garth Grimm <
> GarthGrimm@averyranchconsulting.com> wrote:
> 
>> Given a 4 solr node instance (i.e. 2 shards, 2 replicas per shard), and a
>> standalone zookeeper.
>> 
>> Correct me if any of my understanding is incorrect on the following:
>> If ZK goes down, most normal operations will still function, since my
>> understanding is that ZK isn't involved on a transaction by transaction
>> basis for each of these.....
>> Document adds, updates, and deletes on existing collection will still work
>> as expected.
>> Queries will still get processed as expected.
>> Is the above correct?
>> 
>> But adding new collections, changing configs, etc., will all fail while ZK
>> is down (or at least, place things in an inconsistent state?)
>> Is that correct?
>> 
>> If, while ZK is down, one of the 4 solr nodes also goes down, will all
>> normal operations fail?  Will they all continue to succeed?  I.e. will each
>> of the nodes realize which node is down and route indexing and query
>> requests around them, or is that impossible while ZK is down?  Will some
>> queries succeed (because they were lucky enough to get routed to the one
>> replica on the one shard that is still functional) while other queries fail
>> (they aren't so lucky and get routed to the one replica that is down on the
>> one shard)?
>> 
>> Thanks,
>> Garth Grimm
>> 
>> 
>> 


Re: Zookeeper down question

Posted by Timothy Potter <th...@gmail.com>.
Good questions ... From my understanding, queries will work if Zk goes down
but writes do not work w/o Zookeeper. This works because the clusterstate
is cached on each node so Zookeeper doesn't participate directly in queries
and indexing requests. Solr has to decide not to allow writes if it loses
its connection to Zookeeper, which is a safe guard mechanism. In other
words, Solr assumes it's pretty safe to allow reads if the cluster doesn't
have a healthy coordinator, but chooses to not allow writes to be safe.

If a Solr nodes goes down while ZK is not available, since Solr no longer
accepts writes, leader / replica doesn't really matter. I'd venture to
guess there is some failover logic built in when executing distributing
queries but I'm not as familiar with that part of the code (I'll brush up
on it though as I'm now curious as well).

Cheers,
Tim


On Tue, Nov 19, 2013 at 11:58 AM, Garth Grimm <
GarthGrimm@averyranchconsulting.com> wrote:

> Given a 4 solr node instance (i.e. 2 shards, 2 replicas per shard), and a
> standalone zookeeper.
>
> Correct me if any of my understanding is incorrect on the following:
> If ZK goes down, most normal operations will still function, since my
> understanding is that ZK isn't involved on a transaction by transaction
> basis for each of these.....
> Document adds, updates, and deletes on existing collection will still work
> as expected.
> Queries will still get processed as expected.
> Is the above correct?
>
> But adding new collections, changing configs, etc., will all fail while ZK
> is down (or at least, place things in an inconsistent state?)
> Is that correct?
>
> If, while ZK is down, one of the 4 solr nodes also goes down, will all
> normal operations fail?  Will they all continue to succeed?  I.e. will each
> of the nodes realize which node is down and route indexing and query
> requests around them, or is that impossible while ZK is down?  Will some
> queries succeed (because they were lucky enough to get routed to the one
> replica on the one shard that is still functional) while other queries fail
> (they aren't so lucky and get routed to the one replica that is down on the
> one shard)?
>
> Thanks,
> Garth Grimm
>
>
>