You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by pramodEbay <pr...@ebay.com> on 2016/06/28 06:24:22 UTC

Help with recovering shard range after zookeeper disaster

We recently experienced a case where zookeeper snapshot became corrupt and
would not restart. 
zkCli.sh (of zookeeper) would fail with an error unable to connect to /

We have a solr cloud with two shards (Keys are autosharded) (Solr version
4.10.1)

Unfortunately, we did not have a good snapshot to recover from. We are
planning on creating a brand new zookeeper ensemble and have the solr nodes
reconnect. We do not have a good clusterstate.json to upload to zookeeper.

Our current state is - all solr nodes are operating on read-only mode. No
updates are possible. 

This is what we are planning on doing now:
1. Delete snapshot and logs from zookeepers
2. Create brand new data folder
3. Upload solr configurations into zookeepers
4. With solr nodes running, have them reconnect to zookeeper.

What I am not clear is, will each solr node as it attempts to reconnect -
identify itself as which shard it originally belonged to. Will the
clusterstate.json get created? I don't know the hash ranges since there is
no clusterstate.json. Or do I need to manually create a clusterstate.json
and upload it to the zookeeper.

What is our best recourse now. Any help with disaster recovery is much
appreciated.

Thanks,
Pramod

 






--
View this message in context: http://lucene.472066.n3.nabble.com/Help-with-recovering-shard-range-after-zookeeper-disaster-tp4284645.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Help with recovering shard range after zookeeper disaster

Posted by pramodEbay <pr...@ebay.com>.

Thanks for the reply. Since, we don't have a working snapshot - we are
creating brand new zookeeper nodes, re-upload solr configurations and
manually create a clusterstate.json. Fortunately, doing a combination of
grep and awk on corrupt snapshot - we figured out what the shard ranges were
each of the shards.Hopefully we will be back up again soon.

Lesson learnt - always keep a back up of a working snapshot of zookeeper.



--
View this message in context: http://lucene.472066.n3.nabble.com/Help-with-recovering-shard-range-after-zookeeper-disaster-tp4284645p4284862.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Help with recovering shard range after zookeeper disaster

Posted by Jeff Wartes <jw...@whitepages.com>.

This might come a little late to be helpful, but I had a similar situation with Solr 5.4 once.

We ended up finding a ZK snapshot we could restore, but we did also get the cluster back up for most of the interim by taking the now-empty ZK cluster, re-uploading the configs that the collections used, and then restarting the nodes one at a time. I did find that the cluster state re-wrote itself, including my shard ranges. (Although I didn’t have anything special in there, no shard-splitting history, etc).

There were a few other side effects, aliases needed to be recreated, there were issues around leader election, and there was an odd increase in latency for the duration until we got the backup restored. 

Good luck.

On 6/27/16, 11:24 PM, "pramodEbay" <pr...@ebay.com> wrote:

>We recently experienced a case where zookeeper snapshot became corrupt and
>would not restart. 
>zkCli.sh (of zookeeper) would fail with an error unable to connect to /
>
>We have a solr cloud with two shards (Keys are autosharded) (Solr version
>4.10.1)
>
>Unfortunately, we did not have a good snapshot to recover from. We are
>planning on creating a brand new zookeeper ensemble and have the solr nodes
>reconnect. We do not have a good clusterstate.json to upload to zookeeper.
>
>Our current state is - all solr nodes are operating on read-only mode. No
>updates are possible. 
>
>This is what we are planning on doing now:
>1. Delete snapshot and logs from zookeepers
>2. Create brand new data folder
>3. Upload solr configurations into zookeepers
>4. With solr nodes running, have them reconnect to zookeeper.
>
>What I am not clear is, will each solr node as it attempts to reconnect -
>identify itself as which shard it originally belonged to. Will the
>clusterstate.json get created? I don't know the hash ranges since there is
>no clusterstate.json. Or do I need to manually create a clusterstate.json
>and upload it to the zookeeper.
>
>What is our best recourse now. Any help with disaster recovery is much
>appreciated.
>
>Thanks,
>Pramod
>
> 
>
>
>
>
>
>
>--
>View this message in context: http://lucene.472066.n3.nabble.com/Help-with-recovering-shard-range-after-zookeeper-disaster-tp4284645.html
>Sent from the Solr - User mailing list archive at Nabble.com.