You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Dennis Haller <dh...@talenttech.com> on 2013/05/03 21:21:39 UTC

disaster recovery scenarios for solr cloud and zookeeper

Hi,

Solr 4.x is architected with a dependency on Zookeeper, and Zookeeper is
expected to have a very high (perfect?) availability. With 3 or 5 zookeeper
nodes, it is possible to manage zookeeper maintenance and online
availability to be close to %100. But what is the worst case for Solr if
for some unanticipated reason all Zookeeper nodes go offline?

Could someone comment on a couple of possible scenarios for which all ZK
nodes are offline. What would happen to Solr and what would be needed to
recover in each case?
1) brief interruption, say <2 minutes,
2) longer downtime, say 60 min

Thanks
Dennis

Re: disaster recovery scenarios for solr cloud and zookeeper

Posted by Jack Krupansky <ja...@basetechnology.com>.
>From the wiki: "SolrCloud can continue to serve results without interruption 
as long as at least one server hosts every shard. You can demonstrate this 
by judiciously shutting down various instances and looking for results. If 
you have killed all of the servers for a particular shard, requests to other 
servers will result in a 503 error. To return just the documents that are 
available in the shards that are still alive (and avoid the error), add the 
following query parameter: shards.tolerant=true"

That doesn't completely answer your question, but is an important part of 
the puzzle.

-- Jack Krupansky

-----Original Message----- 
From: Dennis Haller
Sent: Friday, May 03, 2013 3:21 PM
To: solr-user@lucene.apache.org
Subject: disaster recovery scenarios for solr cloud and zookeeper

Hi,

Solr 4.x is architected with a dependency on Zookeeper, and Zookeeper is
expected to have a very high (perfect?) availability. With 3 or 5 zookeeper
nodes, it is possible to manage zookeeper maintenance and online
availability to be close to %100. But what is the worst case for Solr if
for some unanticipated reason all Zookeeper nodes go offline?

Could someone comment on a couple of possible scenarios for which all ZK
nodes are offline. What would happen to Solr and what would be needed to
recover in each case?
1) brief interruption, say <2 minutes,
2) longer downtime, say 60 min

Thanks
Dennis 


Re: disaster recovery scenarios for solr cloud and zookeeper

Posted by Mark Miller <ma...@gmail.com>.
ClusterState is kept in memory and Solr is notified of ClusterState updates by ZooKeeper when a change happens - Solr then grabs the latest ClusterState. If ZooKeeper goes down, Solr keeps using the in memory ClusterState it has and simply stops getting any new ClusterState updates until ZooKeeper comes back.

- Mark

On May 6, 2013, at 2:59 AM, Furkan KAMACI <fu...@gmail.com> wrote:

> Hi Mark;
> 
> You said: "So it's pretty simple - when you lost the ability to talk to ZK,
> everything keeps working based on the most recent clusterstate - except
> that updates are blocked and you cannot add new nodes to the cluster."
> Where nodes keeps cluster stat? When a query comes to a node that is at
> another shard's replica, how query will return accurately?
> 
> 2013/5/5 Jack Krupansky <ja...@basetechnology.com>
> 
>> Is soul retrieval possible when ZooKeeper is down?
>> 
>> -- Jack Krupansky
>> 
>> -----Original Message----- From: Mark Miller
>> Sent: Sunday, May 05, 2013 2:19 PM
>> To: solr-user@lucene.apache.org
>> Subject: Re: disaster recovery scenarios for solr cloud and zookeeper
>> 
>> 
>> When Solr loses it's connection to ZooKeeper, updates will start being
>> rejected. Read requests will continue as normal. This is regardless of how
>> long ZooKeeper is down.
>> 
>> So it's pretty simple - when you lost the ability to talk to ZK,
>> everything keeps working based on the most recent clusterstate - except
>> that updates are blocked and you cannot add new nodes to the cluster. You
>> are essentially in steady state.
>> 
>> The ZK clients will continue trying to reconnect so that when ZK comes
>> back updates while start being accepted again and new nodes may join the
>> cluster.
>> 
>> - Mark
>> 
>> On May 3, 2013, at 3:21 PM, Dennis Haller <dh...@talenttech.com> wrote:
>> 
>> Hi,
>>> 
>>> Solr 4.x is architected with a dependency on Zookeeper, and Zookeeper is
>>> expected to have a very high (perfect?) availability. With 3 or 5
>>> zookeeper
>>> nodes, it is possible to manage zookeeper maintenance and online
>>> availability to be close to %100. But what is the worst case for Solr if
>>> for some unanticipated reason all Zookeeper nodes go offline?
>>> 
>>> Could someone comment on a couple of possible scenarios for which all ZK
>>> nodes are offline. What would happen to Solr and what would be needed to
>>> recover in each case?
>>> 1) brief interruption, say <2 minutes,
>>> 2) longer downtime, say 60 min
>>> 
>>> Thanks
>>> Dennis
>>> 
>> 
>> 


Re: disaster recovery scenarios for solr cloud and zookeeper

Posted by Erick Erickson <er...@gmail.com>.
If I understand correctly, each of the nodes has a copy of the state
as of the last time there was a ZK quorum and operates off that
so the cluster can keep chugging along with updates disabled.

Of course if the state of your cluster changes (i.e. nodes come or
go), ZK is no longer available to tell everyone about the change etc....

Best
Erick

On Mon, May 6, 2013 at 2:59 AM, Furkan KAMACI <fu...@gmail.com> wrote:
> Hi Mark;
>
> You said: "So it's pretty simple - when you lost the ability to talk to ZK,
> everything keeps working based on the most recent clusterstate - except
> that updates are blocked and you cannot add new nodes to the cluster."
> Where nodes keeps cluster stat? When a query comes to a node that is at
> another shard's replica, how query will return accurately?
>
> 2013/5/5 Jack Krupansky <ja...@basetechnology.com>
>
>> Is soul retrieval possible when ZooKeeper is down?
>>
>> -- Jack Krupansky
>>
>> -----Original Message----- From: Mark Miller
>> Sent: Sunday, May 05, 2013 2:19 PM
>> To: solr-user@lucene.apache.org
>> Subject: Re: disaster recovery scenarios for solr cloud and zookeeper
>>
>>
>> When Solr loses it's connection to ZooKeeper, updates will start being
>> rejected. Read requests will continue as normal. This is regardless of how
>> long ZooKeeper is down.
>>
>> So it's pretty simple - when you lost the ability to talk to ZK,
>> everything keeps working based on the most recent clusterstate - except
>> that updates are blocked and you cannot add new nodes to the cluster. You
>> are essentially in steady state.
>>
>> The ZK clients will continue trying to reconnect so that when ZK comes
>> back updates while start being accepted again and new nodes may join the
>> cluster.
>>
>> - Mark
>>
>> On May 3, 2013, at 3:21 PM, Dennis Haller <dh...@talenttech.com> wrote:
>>
>>  Hi,
>>>
>>> Solr 4.x is architected with a dependency on Zookeeper, and Zookeeper is
>>> expected to have a very high (perfect?) availability. With 3 or 5
>>> zookeeper
>>> nodes, it is possible to manage zookeeper maintenance and online
>>> availability to be close to %100. But what is the worst case for Solr if
>>> for some unanticipated reason all Zookeeper nodes go offline?
>>>
>>> Could someone comment on a couple of possible scenarios for which all ZK
>>> nodes are offline. What would happen to Solr and what would be needed to
>>> recover in each case?
>>> 1) brief interruption, say <2 minutes,
>>> 2) longer downtime, say 60 min
>>>
>>> Thanks
>>> Dennis
>>>
>>
>>

Re: disaster recovery scenarios for solr cloud and zookeeper

Posted by Furkan KAMACI <fu...@gmail.com>.
Hi Mark;

You said: "So it's pretty simple - when you lost the ability to talk to ZK,
everything keeps working based on the most recent clusterstate - except
that updates are blocked and you cannot add new nodes to the cluster."
Where nodes keeps cluster stat? When a query comes to a node that is at
another shard's replica, how query will return accurately?

2013/5/5 Jack Krupansky <ja...@basetechnology.com>

> Is soul retrieval possible when ZooKeeper is down?
>
> -- Jack Krupansky
>
> -----Original Message----- From: Mark Miller
> Sent: Sunday, May 05, 2013 2:19 PM
> To: solr-user@lucene.apache.org
> Subject: Re: disaster recovery scenarios for solr cloud and zookeeper
>
>
> When Solr loses it's connection to ZooKeeper, updates will start being
> rejected. Read requests will continue as normal. This is regardless of how
> long ZooKeeper is down.
>
> So it's pretty simple - when you lost the ability to talk to ZK,
> everything keeps working based on the most recent clusterstate - except
> that updates are blocked and you cannot add new nodes to the cluster. You
> are essentially in steady state.
>
> The ZK clients will continue trying to reconnect so that when ZK comes
> back updates while start being accepted again and new nodes may join the
> cluster.
>
> - Mark
>
> On May 3, 2013, at 3:21 PM, Dennis Haller <dh...@talenttech.com> wrote:
>
>  Hi,
>>
>> Solr 4.x is architected with a dependency on Zookeeper, and Zookeeper is
>> expected to have a very high (perfect?) availability. With 3 or 5
>> zookeeper
>> nodes, it is possible to manage zookeeper maintenance and online
>> availability to be close to %100. But what is the worst case for Solr if
>> for some unanticipated reason all Zookeeper nodes go offline?
>>
>> Could someone comment on a couple of possible scenarios for which all ZK
>> nodes are offline. What would happen to Solr and what would be needed to
>> recover in each case?
>> 1) brief interruption, say <2 minutes,
>> 2) longer downtime, say 60 min
>>
>> Thanks
>> Dennis
>>
>
>

Re: disaster recovery scenarios for solr cloud and zookeeper

Posted by Jack Krupansky <ja...@basetechnology.com>.
Is soul retrieval possible when ZooKeeper is down?

-- Jack Krupansky

-----Original Message----- 
From: Mark Miller
Sent: Sunday, May 05, 2013 2:19 PM
To: solr-user@lucene.apache.org
Subject: Re: disaster recovery scenarios for solr cloud and zookeeper

When Solr loses it's connection to ZooKeeper, updates will start being 
rejected. Read requests will continue as normal. This is regardless of how 
long ZooKeeper is down.

So it's pretty simple - when you lost the ability to talk to ZK, everything 
keeps working based on the most recent clusterstate - except that updates 
are blocked and you cannot add new nodes to the cluster. You are essentially 
in steady state.

The ZK clients will continue trying to reconnect so that when ZK comes back 
updates while start being accepted again and new nodes may join the cluster.

- Mark

On May 3, 2013, at 3:21 PM, Dennis Haller <dh...@talenttech.com> wrote:

> Hi,
>
> Solr 4.x is architected with a dependency on Zookeeper, and Zookeeper is
> expected to have a very high (perfect?) availability. With 3 or 5 
> zookeeper
> nodes, it is possible to manage zookeeper maintenance and online
> availability to be close to %100. But what is the worst case for Solr if
> for some unanticipated reason all Zookeeper nodes go offline?
>
> Could someone comment on a couple of possible scenarios for which all ZK
> nodes are offline. What would happen to Solr and what would be needed to
> recover in each case?
> 1) brief interruption, say <2 minutes,
> 2) longer downtime, say 60 min
>
> Thanks
> Dennis 


Re: disaster recovery scenarios for solr cloud and zookeeper

Posted by Mark Miller <ma...@gmail.com>.
When Solr loses it's connection to ZooKeeper, updates will start being rejected. Read requests will continue as normal. This is regardless of how long ZooKeeper is down.

So it's pretty simple - when you lost the ability to talk to ZK, everything keeps working based on the most recent clusterstate - except that updates are blocked and you cannot add new nodes to the cluster. You are essentially in steady state.

The ZK clients will continue trying to reconnect so that when ZK comes back updates while start being accepted again and new nodes may join the cluster.

- Mark

On May 3, 2013, at 3:21 PM, Dennis Haller <dh...@talenttech.com> wrote:

> Hi,
> 
> Solr 4.x is architected with a dependency on Zookeeper, and Zookeeper is
> expected to have a very high (perfect?) availability. With 3 or 5 zookeeper
> nodes, it is possible to manage zookeeper maintenance and online
> availability to be close to %100. But what is the worst case for Solr if
> for some unanticipated reason all Zookeeper nodes go offline?
> 
> Could someone comment on a couple of possible scenarios for which all ZK
> nodes are offline. What would happen to Solr and what would be needed to
> recover in each case?
> 1) brief interruption, say <2 minutes,
> 2) longer downtime, say 60 min
> 
> Thanks
> Dennis


Re: disaster recovery scenarios for solr cloud and zookeeper

Posted by Gopal Patwa <go...@gmail.com>.
agree with Anshum and Netflix has very nice supervisor system for ZooKeeper
if they goes down it will restart them automatically

http://techblog.netflix.com/2012/04/introducing-exhibitor-supervisor-system.html
https://github.com/Netflix/exhibitor




On Fri, May 3, 2013 at 6:53 PM, Anshum Gupta <an...@anshumgupta.net> wrote:

> In case all your Zk nodes go down, the querying would continue to work
> fine (as far as no nodes fail) but you'd not be able to add docs.
>
> Sent from my iPhone
>
> On 03-May-2013, at 17:52, Shawn Heisey <so...@elyograg.org> wrote:
>
> > On 5/3/2013 6:07 PM, Walter Underwood wrote:
> >> Ideally, the Solr nodes should be able to continue as long as no node
> fails. Failure of a leader would be bad, failure of non-leader replicas
> might cause some timeouts, but could be survivable.
> >>
> >> Of course, nodes could not be added.
> >
> > I have read a few things that say things go read only when the zookeeper
> > ensemble loses quorum.  I'm not sure whether that means that Solr goes
> > read only or zookeeper goes read only.  I would be interested in knowing
> > exactly what happens when zookeeper loses quorum as well as what happens
> > if all three (or more) zookeeper nodes in the ensemble go away entirely.
> >
> > I have a SolrCloud I can experiment with, but I need to find a
> > maintenance window for testing, so I can't check right now.
> >
> > Thanks,
> > Shawn
> >
>

Re: disaster recovery scenarios for solr cloud and zookeeper

Posted by Anshum Gupta <an...@anshumgupta.net>.
In case all your Zk nodes go down, the querying would continue to work fine (as far as no nodes fail) but you'd not be able to add docs.

Sent from my iPhone

On 03-May-2013, at 17:52, Shawn Heisey <so...@elyograg.org> wrote:

> On 5/3/2013 6:07 PM, Walter Underwood wrote:
>> Ideally, the Solr nodes should be able to continue as long as no node fails. Failure of a leader would be bad, failure of non-leader replicas might cause some timeouts, but could be survivable.
>> 
>> Of course, nodes could not be added.
> 
> I have read a few things that say things go read only when the zookeeper
> ensemble loses quorum.  I'm not sure whether that means that Solr goes
> read only or zookeeper goes read only.  I would be interested in knowing
> exactly what happens when zookeeper loses quorum as well as what happens
> if all three (or more) zookeeper nodes in the ensemble go away entirely.
> 
> I have a SolrCloud I can experiment with, but I need to find a
> maintenance window for testing, so I can't check right now.
> 
> Thanks,
> Shawn
> 

Re: disaster recovery scenarios for solr cloud and zookeeper

Posted by Shawn Heisey <so...@elyograg.org>.
On 5/3/2013 6:07 PM, Walter Underwood wrote:
> Ideally, the Solr nodes should be able to continue as long as no node fails. Failure of a leader would be bad, failure of non-leader replicas might cause some timeouts, but could be survivable.
> 
> Of course, nodes could not be added.

I have read a few things that say things go read only when the zookeeper
ensemble loses quorum.  I'm not sure whether that means that Solr goes
read only or zookeeper goes read only.  I would be interested in knowing
exactly what happens when zookeeper loses quorum as well as what happens
if all three (or more) zookeeper nodes in the ensemble go away entirely.

I have a SolrCloud I can experiment with, but I need to find a
maintenance window for testing, so I can't check right now.

Thanks,
Shawn


Re: disaster recovery scenarios for solr cloud and zookeeper

Posted by Walter Underwood <wu...@wunderwood.org>.
Ideally, the Solr nodes should be able to continue as long as no node fails. Failure of a leader would be bad, failure of non-leader replicas might cause some timeouts, but could be survivable.

Of course, nodes could not be added.

wunder

On May 3, 2013, at 5:05 PM, Otis Gospodnetic wrote:

> I *think* at this point SolrCloud without ZooKeeper is like a .....
> body without a head?
> 
> Otis
> --
> Solr & ElasticSearch Support
> http://sematext.com/
> 
> 
> 
> 
> 
> On Fri, May 3, 2013 at 3:21 PM, Dennis Haller <dh...@talenttech.com> wrote:
>> Hi,
>> 
>> Solr 4.x is architected with a dependency on Zookeeper, and Zookeeper is
>> expected to have a very high (perfect?) availability. With 3 or 5 zookeeper
>> nodes, it is possible to manage zookeeper maintenance and online
>> availability to be close to %100. But what is the worst case for Solr if
>> for some unanticipated reason all Zookeeper nodes go offline?
>> 
>> Could someone comment on a couple of possible scenarios for which all ZK
>> nodes are offline. What would happen to Solr and what would be needed to
>> recover in each case?
>> 1) brief interruption, say <2 minutes,
>> 2) longer downtime, say 60 min
>> 
>> Thanks
>> Dennis

--
Walter Underwood
wunder@wunderwood.org




Re: disaster recovery scenarios for solr cloud and zookeeper

Posted by Otis Gospodnetic <ot...@gmail.com>.
I *think* at this point SolrCloud without ZooKeeper is like a .....
body without a head?

Otis
--
Solr & ElasticSearch Support
http://sematext.com/





On Fri, May 3, 2013 at 3:21 PM, Dennis Haller <dh...@talenttech.com> wrote:
> Hi,
>
> Solr 4.x is architected with a dependency on Zookeeper, and Zookeeper is
> expected to have a very high (perfect?) availability. With 3 or 5 zookeeper
> nodes, it is possible to manage zookeeper maintenance and online
> availability to be close to %100. But what is the worst case for Solr if
> for some unanticipated reason all Zookeeper nodes go offline?
>
> Could someone comment on a couple of possible scenarios for which all ZK
> nodes are offline. What would happen to Solr and what would be needed to
> recover in each case?
> 1) brief interruption, say <2 minutes,
> 2) longer downtime, say 60 min
>
> Thanks
> Dennis

Re: disaster recovery scenarios for solr cloud and zookeeper

Posted by Jason Hellman <jh...@innoventsolutions.com>.
I have to imagine I'm quibbling with the original assertion that "Solr 4.x is architected with a dependency on Zookeeper" when I say the following:

Solr 4.x is not architected with a dependency on Zookeeper.  SolrCloud, however, is.  As such, if a line of reasoning drives greater concern about Zookeeper than (necessarily) Solr's resiliency it can clearly be opted to use Solr 4.x without Zookeeper.

I have to further imagine that isn't really the point of the original message.  Unfortunately for me somehow I'm obsessing on saying it :)

On May 3, 2013, at 12:21 PM, Dennis Haller <dh...@talenttech.com> wrote:

> Hi,
> 
> Solr 4.x is architected with a dependency on Zookeeper, and Zookeeper is
> expected to have a very high (perfect?) availability. With 3 or 5 zookeeper
> nodes, it is possible to manage zookeeper maintenance and online
> availability to be close to %100. But what is the worst case for Solr if
> for some unanticipated reason all Zookeeper nodes go offline?
> 
> Could someone comment on a couple of possible scenarios for which all ZK
> nodes are offline. What would happen to Solr and what would be needed to
> recover in each case?
> 1) brief interruption, say <2 minutes,
> 2) longer downtime, say 60 min
> 
> Thanks
> Dennis