You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Bill Au <bi...@gmail.com> on 2012/11/08 21:10:49 UTC

best practice for restarting the entire SolrCloud cluster

I have a simple SolrCloud cluster with 4 Solr instances and 1 shard.  I can
start and stop individual Solr instances without any problem.  But not when
I have to shutdown all the Solr instances at the same time.

After shutting down all the Solr instances, the first instance that starts
up wait for all the replicas:

INFO: Waiting until we see more replicas up: total=4 found=3
timeoutin=169243

In the meantime, any additional Solr instances that start up while the
first one is waiting can't get the leader from zookeeper:

SEVERE: Error getting leader from zk
org.apache.solr.common.SolrException: Could not get leader props

When the first Solr instance see all the replicas, it becomes the leader:

INFO: Enough replicas found to continue.
INFO: I may be the new leader - try and sync

But it fails to sync with the instances that had failed to get the leader
before:

WARNING: PeerSync: core=collection1 url=http://host2:8983/solr  exception
talking to http://host2:8983/solr/collection1/, failed
org.apache.solr.client.solrj.SolrServerException: Timeout occured while
waiting response from server at: http://host2:8983/solr/collection1

So I ended up with one for more replicas down after the restart.  I had to
figure out which replica is down and restart them.

What I also discovered is that if I start the first Solr instance and wait
until it returns after the leaderVoteWait of 3 minutes, the rest of the
Solr instance can be started without any problem since by then they can get
the leader from zookeeper.

Is there a better way to restart an entire SolrCloud cluster?

Bill

Re: best practice for restarting the entire SolrCloud cluster

Posted by Bill Au <bi...@gmail.com>.
My replicas are actually on different machines so they do come up.  The
problem I found is that since they can't get the leader they just come up
but is not part of the cluster.  I can still do local search with
distrib=false.  They do not retry to get the leader so I have to restarted
them after the leader has started in order to get them back into the
cluster.

Bill


On Thu, Nov 8, 2012 at 4:02 PM, Markus Jelsma <ma...@openindex.io>wrote:

> Hi - i think you're seeing:
> https://issues.apache.org/jira/browse/SOLR-3993
>
>
> -----Original message-----
> > From:Bill Au <bi...@gmail.com>
> > Sent: Thu 08-Nov-2012 21:16
> > To: solr-user@lucene.apache.org
> > Subject: best practice for restarting the entire SolrCloud cluster
> >
> > I have a simple SolrCloud cluster with 4 Solr instances and 1 shard.  I
> can
> > start and stop individual Solr instances without any problem.  But not
> when
> > I have to shutdown all the Solr instances at the same time.
> >
> > After shutting down all the Solr instances, the first instance that
> starts
> > up wait for all the replicas:
> >
> > INFO: Waiting until we see more replicas up: total=4 found=3
> > timeoutin=169243
> >
> > In the meantime, any additional Solr instances that start up while the
> > first one is waiting can't get the leader from zookeeper:
> >
> > SEVERE: Error getting leader from zk
> > org.apache.solr.common.SolrException: Could not get leader props
> >
> > When the first Solr instance see all the replicas, it becomes the leader:
> >
> > INFO: Enough replicas found to continue.
> > INFO: I may be the new leader - try and sync
> >
> > But it fails to sync with the instances that had failed to get the leader
> > before:
> >
> > WARNING: PeerSync: core=collection1 url=http://host2:8983/solr exception
> > talking to http://host2:8983/solr/collection1/, failed
> > org.apache.solr.client.solrj.SolrServerException: Timeout occured while
> > waiting response from server at: http://host2:8983/solr/collection1
> >
> > So I ended up with one for more replicas down after the restart.  I had
> to
> > figure out which replica is down and restart them.
> >
> > What I also discovered is that if I start the first Solr instance and
> wait
> > until it returns after the leaderVoteWait of 3 minutes, the rest of the
> > Solr instance can be started without any problem since by then they can
> get
> > the leader from zookeeper.
> >
> > Is there a better way to restart an entire SolrCloud cluster?
> >
> > Bill
> >
>

RE: best practice for restarting the entire SolrCloud cluster

Posted by Markus Jelsma <ma...@openindex.io>.
Hi - i think you're seeing:
https://issues.apache.org/jira/browse/SOLR-3993
 
 
-----Original message-----
> From:Bill Au <bi...@gmail.com>
> Sent: Thu 08-Nov-2012 21:16
> To: solr-user@lucene.apache.org
> Subject: best practice for restarting the entire SolrCloud cluster
> 
> I have a simple SolrCloud cluster with 4 Solr instances and 1 shard.  I can
> start and stop individual Solr instances without any problem.  But not when
> I have to shutdown all the Solr instances at the same time.
> 
> After shutting down all the Solr instances, the first instance that starts
> up wait for all the replicas:
> 
> INFO: Waiting until we see more replicas up: total=4 found=3
> timeoutin=169243
> 
> In the meantime, any additional Solr instances that start up while the
> first one is waiting can't get the leader from zookeeper:
> 
> SEVERE: Error getting leader from zk
> org.apache.solr.common.SolrException: Could not get leader props
> 
> When the first Solr instance see all the replicas, it becomes the leader:
> 
> INFO: Enough replicas found to continue.
> INFO: I may be the new leader - try and sync
> 
> But it fails to sync with the instances that had failed to get the leader
> before:
> 
> WARNING: PeerSync: core=collection1 url=http://host2:8983/solr  exception
> talking to http://host2:8983/solr/collection1/, failed
> org.apache.solr.client.solrj.SolrServerException: Timeout occured while
> waiting response from server at: http://host2:8983/solr/collection1
> 
> So I ended up with one for more replicas down after the restart.  I had to
> figure out which replica is down and restart them.
> 
> What I also discovered is that if I start the first Solr instance and wait
> until it returns after the leaderVoteWait of 3 minutes, the rest of the
> Solr instance can be started without any problem since by then they can get
> the leader from zookeeper.
> 
> Is there a better way to restart an entire SolrCloud cluster?
> 
> Bill
>