You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by St...@ext.cdiscount.com on 2016/02/01 18:14:17 UTC
SolrCloud issues

Hello,

We are currently performing some benchmarks on Solr 5.4.0 and we hit some issues related to SolrCloud and leading to recoveries and inconstancies.
Based on our tests, it seems that this version is less stable under pressure than our previously installed 4.10.4 version.
We were able to mitigate the effects by increasing numRecordsToKeep in the update log and limiting replication bandwidth.
But all problems were not resolved and more worrying it is more difficult to get back a running cluster.
For example we ended up with a situation where on a shard the leader is down and all replicas are active.

We found a particular pattern that leads to a bad cluster state, described here: https://issues.apache.org/jira/browse/SOLR-8129?focusedCommentId=15119905&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15119905

There are also a lot of open issues (or resolved in version 5.5) related to SolrCloud / Zookeeper / Replications.

Here is a (non exhaustive) list I could gather from JIRA:





SOLR-8129<https://issues.apache.org/jira/browse/SOLR-8129>


HdfsChaosMonkeyNothingIsSafeTest failures<https://issues.apache.org/jira/browse/SOLR-8129>

SOLR-8461<https://issues.apache.org/jira/browse/SOLR-8461>


CloudSolrStream and ParallelStream can choose replicas that are not active<https://issues.apache.org/jira/browse/SOLR-8461>

SOLR-8619<https://issues.apache.org/jira/browse/SOLR-8619>


A new replica should not become leader when all current replicas are down as it leads to data loss<https://issues.apache.org/jira/browse/SOLR-8619>

SOLR-3274<https://issues.apache.org/jira/browse/SOLR-3274>


ZooKeeper related SolrCloud problems<https://issues.apache.org/jira/browse/SOLR-3274>

SOLR-6406<https://issues.apache.org/jira/browse/SOLR-6406>


ConcurrentUpdateSolrServer hang in blockUntilFinished.<https://issues.apache.org/jira/browse/SOLR-6406>

SOLR-8173<https://issues.apache.org/jira/browse/SOLR-8173> CLONE - Leader recovery process can select the wrong leader if all replicas for a shard are down and trying to recover as well as lose updates that should have been recovered.<https://issues.apache.org/jira/browse/SOLR-8173>
SOLR-8371<https://issues.apache.org/jira/browse/SOLR-8371>


Try and prevent too many recovery requests from stacking up and clean up some faulty logic.<https://issues.apache.org/jira/browse/SOLR-8371>

SOLR-7121<https://issues.apache.org/jira/browse/SOLR-7121>


Solr nodes should go down based on configurable thresholds and not rely on resource exhaustion<https://issues.apache.org/jira/browse/SOLR-7121>

SOLR-8586<https://issues.apache.org/jira/browse/SOLR-8586>


Implement hash over all documents to check for shard synchronization<https://issues.apache.org/jira/browse/SOLR-8586>



I wonder if all these issues could be treated in a general refactoring of this code instead of individual patches for every issue.
I know that these issues are not easy to reproduce and debug and I'm not aware of all the implications of this kind of work.
We are willing to contribute on this issues although our knowledge of Solr internal might still be weak for such an important part of SolrCloud architecture.
We can provide logs and benchmarks that lead to inconsistencies and/or bad cluster states.
It appears with have a better behaviour when we have a 5 nodes zk cluster than a 3 nodes.
However there are no sign of any problems on ZK when we have these errors in Solr.

Regards,
Stephan