You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by "David Hunt (Jira)" <ji...@apache.org> on 2019/12/19 17:51:00 UTC

[jira] [Commented] (SOLR-14123) autoAddReplicas is not reliable when multiple nodes go down.

    [ https://issues.apache.org/jira/browse/SOLR-14123?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17000250#comment-17000250 ] 

David Hunt commented on SOLR-14123:
-----------------------------------

I noticed in [https://github.com/apache/lucene-solr/blob/fa27e476f74bc4ba83e3fcdc39b421bc53a45d16/solr/core/src/java/org/apache/solr/cloud/ZkController.java] registerLiveNodesListener that it only looks at the top 3 nodes lost.  I don't entirely understand what the code is trying to do but it looks suspicious, especially when dealing with larger clusters.

> autoAddReplicas is not reliable when multiple nodes go down.
> ------------------------------------------------------------
>
>                 Key: SOLR-14123
>                 URL: https://issues.apache.org/jira/browse/SOLR-14123
>             Project: Solr
>          Issue Type: Bug
>      Security Level: Public(Default Security Level. Issues are Public) 
>          Components: AutoScaling
>    Affects Versions: 8.3
>            Reporter: David Hunt
>            Priority: Major
>              Labels: autoscale
>
> I started noticing problems in our production environment with indexing being blocked due to a minimum replication factor not being met.  We have autoAddReplicas triggers in place to add replicas when nodes our lost but it doesn't seem to correctly add all replicas that have been lost when nodes are lost. I’ve been able to reproduce this behavior consistently in a development environment.
> Repro:
>  # Setup a 10 node SolrCloud cluster.
>  # Create autoAddReplicas to trigger on nodeLost with waitFor set to 10 minutes.
>  # Create 15 collections with 2 shards and 4 replicas.
>  # Kill 3 Solr nodes.
>  # 15 minutes later kill 1 more Solr node.
> Results:
> Monitor your shards/replicas.  You’ll see some replicas added to make up for the lost replicas but not all.  An hour later many shards are still missing replicas.
> Expected:
> All lost replicas should be added on the 6 remaining healthy nodes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org