You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Ishan Chattopadhyaya (JIRA)" <ji...@apache.org> on 2015/09/02 18:03:46 UTC

[jira] [Commented] (SOLR-7569) Create an API to force a leader election between nodes

    [ https://issues.apache.org/jira/browse/SOLR-7569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14727549#comment-14727549 ] 

Ishan Chattopadhyaya commented on SOLR-7569:
--------------------------------------------


bq.  1.  nit - RecoverShardTest has an unused notLeader1 variable
Thanks. Made some refactoring to the test and this has gone away now.

bq.   2.    Shouldn't the "Wait for a long time for a steady state" piece of code be before the proxies for the two replicas are reopened? The LIR state will surely be set at indexing time and only if the proxy is closed. Also if you move that wait before the proxy is reopened then you are sure to have the LIR state as 'down'.
This makes sense, I've made the change.

bq.   3.    The check for 'numActiveReplicas' and 'numReplicasOnLiveNodes' should be done after force refreshing the cluster state of the cloudClient otherwise spurious failures can happen

I didn't know about this force update of the cluster state; I've now added it.

bq.  4.    nit - Why is sendDoc overridden in RecoverShardTest? The minRf is same, just the max retries has been increased and wait between retries has been decreased
The tests were (and still are) taking too long, and reducing the wait from 30sec to 1sec was helpful.

bq. 5.    The OCMH.recoverShard() isn't unsetting the leader properly. It should be as simple as:
Thanks, I've cleaned this up.

bq.  6.    Can you please write a test to ensure that this API works with 'async' parameter?
TODO.

bq.    Leader is live but 'down' -> mark it 'active'
This works now. Added testLeaderDown() method.

bq.    Leader itself is in LIR -> delete the LIR node
This should work, since the API method first clears the LIR state. Couldn't add a test for this, since I couldn't simulate this state in a test.

bq.    Leader is not live:       Replicas are live but 'down' or 'recovering' -> mark them 'active'
This works now. Added testAllReplicasDownNoLeader() method.

bq.    Leader is not live:       Replicas are live but in LIR -> delete the LIR nodes
This works as last patch. The corresponding test is now at testReplicasInLIRNoLeader().

bq. Did you find out why/how that happened? If this is reproducible, can you please create an issue and post the test there?
Added SOLR-7989 for this, will look deeper soon.

> Create an API to force a leader election between nodes
> ------------------------------------------------------
>
>                 Key: SOLR-7569
>                 URL: https://issues.apache.org/jira/browse/SOLR-7569
>             Project: Solr
>          Issue Type: New Feature
>          Components: SolrCloud
>            Reporter: Shalin Shekhar Mangar
>            Assignee: Shalin Shekhar Mangar
>              Labels: difficulty-medium, impact-high
>         Attachments: SOLR-7569.patch, SOLR-7569.patch, SOLR-7569.patch, SOLR-7569.patch, SOLR-7569.patch, SOLR-7569.patch, SOLR-7569.patch, SOLR-7569_lir_down_state_test.patch
>
>
> There are many reasons why Solr will not elect a leader for a shard e.g. all replicas' last published state was recovery or due to bugs which cause a leader to be marked as 'down'. While the best solution is that they never get into this state, we need a manual way to fix this when it does get into this  state. Right now we can do a series of dance involving bouncing the node (since recovery paths between bouncing and REQUESTRECOVERY are different), but that is difficult when running a large cluster. Although it is possible that such a manual API may lead to some data loss but in some cases, it is the only possible option to restore availability.
> This issue proposes to build a new collection API which can be used to force replicas into recovering a leader while avoiding data loss on a best effort basis.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org