You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Ishan Chattopadhyaya (JIRA)" <ji...@apache.org> on 2015/08/18 16:03:45 UTC

[jira] [Updated] (SOLR-7569) Create an API to force a leader election between nodes

     [ https://issues.apache.org/jira/browse/SOLR-7569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ishan Chattopadhyaya updated SOLR-7569:
---------------------------------------
    Attachment: SOLR-7569.patch

Trying to tackle this situation where all replicas (including the leader) are somehow marked "down" (maybe due to bugs?) and there is no leader in the shard, and hence the entire shard is down. Adding a new collection API "RECOVERSHARD".

In this patch, I am evaluating the following approach:

* Remove all leader initiated recovery flags for this shard.
* Pick the next leader: If the leader election queue is not empty and the first replica in the queue is on a live node, choose the replica as the next leader. Otherwise, pick a random replica to become the next leader (TODO: we can have the user specify which replica he/she wants as the next leader).
* If the chosen leader is not the at the head of the leader election queue, have it join the election at the head (similar to what REBALANCELEADERS tries to do). [TODO]
* Mark the next leader as "active". Mark rest of the replicas (which are on live nodes) as "recovering".
* Issue core admin REQUESTRECOVERY command to all the replicas except the next leader.
* Wait till recovery completes. [TODO]

Does the above approach sound reasonable? Does the patch seem reasonable?

> Create an API to force a leader election between nodes
> ------------------------------------------------------
>
>                 Key: SOLR-7569
>                 URL: https://issues.apache.org/jira/browse/SOLR-7569
>             Project: Solr
>          Issue Type: New Feature
>          Components: SolrCloud
>            Reporter: Shalin Shekhar Mangar
>              Labels: difficulty-medium, impact-high
>         Attachments: SOLR-7569.patch
>
>
> There are many reasons why Solr will not elect a leader for a shard e.g. all replicas' last published state was recovery or due to bugs which cause a leader to be marked as 'down'. While the best solution is that they never get into this state, we need a manual way to fix this when it does get into this  state. Right now we can do a series of dance involving bouncing the node (since recovery paths between bouncing and REQUESTRECOVERY are different), but that is difficult when running a large cluster. Although it is possible that such a manual API may lead to some data loss but in some cases, it is the only possible option to restore availability.
> This issue proposes to build a new collection API which can be used to force replicas into recovering a leader while avoiding data loss on a best effort basis.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org