You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by "Johnston, Charlie" <ch...@blackrock.com> on 2019/01/08 19:12:40 UTC

SolrCloud 6.5.1 Stability/Recovery Issues

Hi,

We have been using Solr 6.5.1 leveraging SolrCloud backed by ZooKeeper for a multi-client, multi-node cluster for several months now and have been having a few stability/recovery issues we’d like to confirm if they are fixed in Solr 7 or not. We run 3 large Solr nodes in the cluster (each with 64 GBs of heap, 100’s of collections, and 9000+ cores). We have found that if anything happens (e.g. long garbage collection pauses) that puts replicas of the same shard on more than one node into the down/recovering state at the same, then they usually fail to complete the recovery process and reach the active state again, even though all of the nodes in the cluster are up and running. The same thing happens if we shut down multiple nodes and try to start them up at the same time. It looks like each replica thinks the other is supposed to be the leader and results in neither ever accepting the responsibility. Is this a common error in SolrCloud and to the best of the knowledge of the mailing list has it been fixed in any Solr 7 releases?

A second issue we have faced with the same cluster is that when the above situation arises, our recovery steps is to stop all the impacted nodes (sometimes it’s all 3 nodes in the cluster) and start them one at a time, waiting for all cores on each node to recover before starting the next one. During this process we have found two different issues. One issue is that recovering the nodes seems to take a long time, sometimes more than one hour for all the cores to move into the active state. Another and more pressing issue that we have run into is sometimes it seems the order in which we start the servers back up matters. We sometimes run into cases where we start a node and all but a few (< 10 cores) won’t recover even after several hours and several attempts of restarting the same node. These cores never leave the down state. To fix this we need to then stop the node and attempt starting another node until we find one that fully recovers so that we can return to the originally problematic node to try again- which has always worked in the end but only after a lot of pain. Our hope is to get to a day where we can start all the nodes in the cluster at the same time and have it “just work” i.e. converge on a fully active state 100% of the time vs. managing this one node at a time process. Our expectation is that if all the nodes in the cluster are up and healthy then all of the cores in the cluster would eventually reach the active state, but that seems to just not be the case. Are there any fixes in Solr 7 or potentially some configuration we can add to Solr 6 that resolve either of these issues?

Best,
Charlie Johnston

This message may contain information that is confidential or privileged. If you are not the intended recipient, please advise the sender immediately and delete this message. See http://www.blackrock.com/corporate/compliance/email-disclaimers for further information. Please refer to http://www.blackrock.com/corporate/compliance/privacy-policy for more information about BlackRock’s Privacy Policy.
For a list of BlackRock's office addresses worldwide, see http://www.blackrock.com/corporate/about-us/contacts-locations.

Re: SolrCloud 6.5.1 Stability/Recovery Issues

Posted by Shawn Heisey <ap...@elyograg.org>.

On 1/8/2019 12:12 PM, Johnston, Charlie wrote:
> We have been using Solr 6.5.1 leveraging SolrCloud backed by ZooKeeper for a multi-client, multi-node cluster for several months now and have been having a few stability/recovery issues we’d like to confirm if they are fixed in Solr 7 or not. We run 3 large Solr nodes in the cluster (each with 64 GBs of heap, 100’s of collections, and 9000+ cores).

Managing that many indexes is currently SolrCloud's achilles heel.

Splitting the cluster across more Solr nodes will help to some degree, 
but dealing with thousands of replicas in a single cluster is simply not 
going to scale.  In addition to splitting the indexes across more nodes, 
you may also need to create multiple clusters so that each cluster is 
managing a smaller number of shard replicas.

This is a known problem, and there is constantly work underway to try 
and improve the situation.  I need to repeat the experiments that I did 
on SOLR-7191 on a much newer version so that I can have a better idea of 
whether the situation has improved in 7.x versions.

The discussion on SOLR-7191 is long and very dense, but it might be 
worth reading.  You have about twice as many cores as I was creating in 
my experiments, which means that Solr will be processing more messages 
for recovery operations:

https://issues.apache.org/jira/browse/SOLR-7191

I think that SOLR-10265 pinpoints the central problem that causes these 
issues.  Some of its sub-issues have been implemented, which MIGHT mean 
that 7.x is a lot better off:

https://issues.apache.org/jira/browse/SOLR-10265

Thanks,
Shawn