You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by "Cao Manh Dat (JIRA)" <ji...@apache.org> on 2018/04/10 12:27:00 UTC

[jira] [Commented] (SOLR-12187) Replica should watch clusterstate and unload itself if its entry is removed

    [ https://issues.apache.org/jira/browse/SOLR-12187?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16432180#comment-16432180 ] 

Cao Manh Dat commented on SOLR-12187:
-------------------------------------

Patch for this ticket
 * Each replica will register a CollectionStateWatcher to unload itself when it is removed from clusterstate
 * Reverse changes made by SOLR-12176, changes of that issue is no longer needed since zombie leader cannot exist with this patch
 * Test
 * Refactoring ZkController.register() for better handling race-condition cases.

> Replica should watch clusterstate and unload itself if its entry is removed
> ---------------------------------------------------------------------------
>
>                 Key: SOLR-12187
>                 URL: https://issues.apache.org/jira/browse/SOLR-12187
>             Project: Solr
>          Issue Type: Bug
>      Security Level: Public(Default Security Level. Issues are Public) 
>            Reporter: Cao Manh Dat
>            Assignee: Cao Manh Dat
>            Priority: Major
>         Attachments: SOLR-12187.patch, SOLR-12187.patch, SOLR-12187.patch
>
>
> With the introduction of autoscaling framework, we have seen an increase in the number of issues related to the race condition between delete a replica and other stuff.
> Case 1: DeleteReplicaCmd failed to send UNLOAD request to a replica, therefore, forcefully remove its entry from clusterstate, but the replica still function normally and be able to become a leader -> SOLR-12176
> Case 2:
>  * DeleteReplicaCmd enqueue a DELETECOREOP (without sending a request to replica because the node is not live)
>  * The node start and the replica get loaded
>  * DELETECOREOP has not processed hence the replica still present in clusterstate --> pass checkStateInZk
>  * DELETECOREOP is executed, DeleteReplicaCmd finished
>  ** result 1: the replica start recovering, finish it and publish itself as ACTIVE --> state of the replica is ACTIVE
>  ** result 2: the replica throw an exception (probably: NPE) 
> --> state of the replica is DOWN, not join leader election



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org