You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Shalin Shekhar Mangar (JIRA)" <ji...@apache.org> on 2014/02/06 08:46:09 UTC

[jira] [Resolved] (SOLR-5373) Can't become leader due infinite recovery loop

     [ https://issues.apache.org/jira/browse/SOLR-5373?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Shalin Shekhar Mangar resolved SOLR-5373.
-----------------------------------------

    Resolution: Not A Problem

As noted by Mark, this is by design.

> Can't become leader due infinite recovery loop
> ----------------------------------------------
>
>                 Key: SOLR-5373
>                 URL: https://issues.apache.org/jira/browse/SOLR-5373
>             Project: Solr
>          Issue Type: Bug
>    Affects Versions: 4.2
>         Environment: SolrCloud, 2 nodes, Fedora
>            Reporter: Javier Mendez
>            Assignee: Mark Miller
>            Priority: Minor
>              Labels: Recovery, SolrCloud
>             Fix For: 4.7
>
>         Attachments: SOLR-5373.patch, stack1, stack2, stack3, stack4, stack5, stack6, stack7
>
>
> We found an issue while performing stability tests on SolrCloud. Under certain circumstances, a node will get in an endless loop trying to recover. I've seen this happen in a two node setup, by following these steps:
> 1) Node A started
> 2) Node B started
> 3) Node B stopped
> 4) Node B started, and immediately Node A stopped (normal graceful shutdown). 
> At this point node B will throw connection refused messages while trying to sync to node A. For some reason (not always) this leads to a corrupt state where node B enters an infinite loop trying to recover from node A (it still thinks the cluster has two nodes). I think the leader election process started just fine, but since recovery is running async, at some point node B published it state as recovery failed, hence causing leader election to fail.
> Zookeeper /live_nodes has only one file.
> This shows on the logs:
>     0:57:18,960 INFO INFO  [ShardLeaderElectionContext] (main-EventThread) Running the leader process.
>     10:57:19,068 INFO INFO  [ShardLeaderElectionContext] (main-EventThread) Checking if I should try and be the leader.
>     10:57:19,068 INFO INFO  [ShardLeaderElectionContext] (main-EventThread) My last published State was recovery_failed, I won't be the leader.
>     10:57:19,068 INFO INFO  [ShardLeaderElectionContext] (main-EventThread) There may be a better leader candidate than us - going back into recovery
>     10:57:19,118 INFO INFO  [DefaultSolrCoreState] (main-EventThread) Running recovery - first canceling any ongoing recovery
>     10:57:19,118 WARN WARN  [RecoveryStrategy] (main-EventThread) Stopping recovery for zkNodeName=10.50.100.30:8998_solr_myCollectioncore=myCollection
>     10:57:19,869 ERROR ERROR [RecoveryStrategy] (RecoveryThread) Error while trying to recover. core=myCollection:org.apache.solr.common.SolrException: No registered leader was found, collection:myCollection slice:shard1
>             at org.apache.solr.common.cloud.ZkStateReader.getLeaderRetry(ZkStateReader.java:484)
>             at org.apache.solr.common.cloud.ZkStateReader.getLeaderRetry(ZkStateReader.java:467)
>             at org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:321)
>             at org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:223)
>     
>     10:57:19,869 ERROR ERROR [RecoveryStrategy] (RecoveryThread) Recovery failed - trying again... (0) core=myCollection
>     10:57:19,869 ERROR ERROR [RecoveryStrategy] (RecoveryThread) Recovery failed - interrupted. core=myCollection
>     10:57:19,869 ERROR ERROR [RecoveryStrategy] (RecoveryThread) Recovery failed - I give up. core=myCollection
>     10:57:19,869 INFO INFO  [ZkController] (RecoveryThread) publishing core=myCollection state=recovery_failed
>     10:57:19,869 INFO INFO  [ZkController] (RecoveryThread) numShards not found on descriptor - reading it from system property
>     10:57:19,902 WARN WARN  [RecoveryStrategy] (RecoveryThread) Stopping recovery for zkNodeName=10.50.100.30:8998_solr_myCollectioncore=myCollection
>     10:57:19,902 INFO INFO  [RecoveryStrategy] (RecoveryThread) Finished recovery process. core=myCollection
>     10:57:19,902 INFO INFO  [RecoveryStrategy] (RecoveryThread) Starting recovery process.  core=myCollection recoveringAfterStartup=false
> Solr Version: 4.2.1.2013.03.26.08.26.55
> Other references to the same issue:
>  - https://support.lucidworks.com/entries/23553611-Solr-cluster-not-able-to-recover 
>  - http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201306.mbox/%3C1371473296754-4070983.post@n3.nabble.com%3E



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org