You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Joseph Lorenzini <ja...@gmail.com> on 2020/02/04 13:38:49 UTC

Handling All Replicas Down in Solr 8.3 Cloud Collection

Hi all,

I have a 3 node solr cloud instance with a single collection. The solr
nodes are pointed to a 3-node zookeeper ensemble. I was doing some basic
disaster recovery testing and have encountered a problem that hasn't been
obvious to me on how to fix.

After i started back up the three solr java processes, i can see that they
are registered back in the solr UI. However, each replica is in a down
state permanently. there are no logs in either solr or zookeeper that may
indicate what the the problem would be -- neither exceptions nor warnings.

So is there any way to collect more diagnostics to figure out what's going
on? Short of deleting and recreating the replicas is there any way to fix
this?

Thanks,
Joe

Re: Handling All Replicas Down in Solr 8.3 Cloud Collection

Posted by Joseph Lorenzini <ja...@gmail.com>.

Here's roughly what was going on:

   1. set up three node cluster with a collection. The collection has one
   shard and three replicas for that shard.
   2. Shut down two of the nodes and verify the remaining node is the
   leader. Verified the other two nodes are registered as dead in solr ui.
   3. bulk import several million documents into solr from a CSV file.
   4. shut down the remaining node
   5. start up all three nodes

Even after three minutes no leader was active. I executed the FORCELEADER
API call which completed successfully and waited three minutes -- still no
replica was elected leader. I then compared my solr 8 cluster to a
different solr cluster. I noticed that the znode
*/collections/example/leaders/shard1
*existed on both clusters but in the solr 8 cluster the znode was empty. I
manually uploaded a json document with the proper settings to that znode
and then called the FORCELEADER API call again and waited 3 minutes.

A leader still wasn't elected.

Then, I removed the replica for the node that I imported all the documents
into it and added the replica back in. At that point, a leader was elected.
I am not sure i have exact steps to reproduce but I did get it working.

Thanks,
Joe

On Tue, Feb 4, 2020 at 7:54 AM Erick Erickson <er...@gmail.com>
wrote:

> First, be sure to wait at least 3 minutes before concluding the replicas
> are permanently down, that’s the default wait period for certain leader
> election fallbacks. It’s easy to conclude it’s never going to recover, 180
> seconds is an eternity ;).
>
> You can try the collections API FORCELEADER command. Assuming a leader is
> elected and becomes active, you _may_ have to restart the other two Solr
> nodes.
>
> How did you stop the servers? You mention disaster recovery, so I’m
> thinking you did a “kill -9” or similar? Were you actively indexing at the
> time? Solr _should_ manage the recovery even in that case, I’m mostly
> wondering what the sequence of events that lead up to this was…
>
> Best,
> Erick
>
> > On Feb 4, 2020, at 8:38 AM, Joseph Lorenzini <ja...@gmail.com> wrote:
> >
> > Hi all,
> >
> > I have a 3 node solr cloud instance with a single collection. The solr
> > nodes are pointed to a 3-node zookeeper ensemble. I was doing some basic
> > disaster recovery testing and have encountered a problem that hasn't been
> > obvious to me on how to fix.
> >
> > After i started back up the three solr java processes, i can see that
> they
> > are registered back in the solr UI. However, each replica is in a down
> > state permanently. there are no logs in either solr or zookeeper that may
> > indicate what the the problem would be -- neither exceptions nor
> warnings.
> >
> > So is there any way to collect more diagnostics to figure out what's
> going
> > on? Short of deleting and recreating the replicas is there any way to fix
> > this?
> >
> > Thanks,
> > Joe
>
>

Re: Handling All Replicas Down in Solr 8.3 Cloud Collection

Posted by Erick Erickson <er...@gmail.com>.

First, be sure to wait at least 3 minutes before concluding the replicas are permanently down, that’s the default wait period for certain leader election fallbacks. It’s easy to conclude it’s never going to recover, 180 seconds is an eternity ;).

You can try the collections API FORCELEADER command. Assuming a leader is elected and becomes active, you _may_ have to restart the other two Solr nodes.

How did you stop the servers? You mention disaster recovery, so I’m thinking you did a “kill -9” or similar? Were you actively indexing at the time? Solr _should_ manage the recovery even in that case, I’m mostly wondering what the sequence of events that lead up to this was…

Best,
Erick

> On Feb 4, 2020, at 8:38 AM, Joseph Lorenzini <ja...@gmail.com> wrote:
> 
> Hi all,
> 
> I have a 3 node solr cloud instance with a single collection. The solr
> nodes are pointed to a 3-node zookeeper ensemble. I was doing some basic
> disaster recovery testing and have encountered a problem that hasn't been
> obvious to me on how to fix.
> 
> After i started back up the three solr java processes, i can see that they
> are registered back in the solr UI. However, each replica is in a down
> state permanently. there are no logs in either solr or zookeeper that may
> indicate what the the problem would be -- neither exceptions nor warnings.
> 
> So is there any way to collect more diagnostics to figure out what's going
> on? Short of deleting and recreating the replicas is there any way to fix
> this?
> 
> Thanks,
> Joe