You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Ishan Chattopadhyaya (JIRA)" <ji...@apache.org> on 2015/10/22 14:26:27 UTC

[jira] [Updated] (SOLR-7569) Create an API to force a leader election between nodes

     [ https://issues.apache.org/jira/browse/SOLR-7569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ishan Chattopadhyaya updated SOLR-7569:
---------------------------------------
    Attachment: SOLR-7569.patch

Thanks Shalin for looking into the patch and your review.

bq. ForceLeaderTest.testReplicasInLIRNoLeader has a 5 second sleep, why? Isn't waitForRecoveriesToFinish() enough?
Fixed. This was a left over from some previous patch. I think I wanted to put the waitForRecoveriesToFinish(), but forgot to remove the 5 second sleep.

bq. Similarly, ForceLeaderTest.testLeaderDown has a 15 second sleep for steady state to be reached? What is this steady state, is there a better way than waiting for an arbitrary amount of time? In general, Thread.sleep should be avoided as much as possible as a way to reach steady state.
In this case, waiting those 15 seconds results in one of the down replicas to become a leader (but stay down). This is the situation I'm using FORCELEADER to recover from. Instead of waiting 15 seconds, I've added some polling with wait to wake up earlier if needed, while increasing the timeout from 15s to 25s.


bq. Can you please add some javadocs on the various test methods describing the scenario that they are test?
Sure, added.

bq. minor nit - can you use assertEquals when testing equality of state etc instead of assertTrue. The advantage with assertEquals is that it logs the mismatched values in the exception messages.
Used assertEquals() now.

bq. In OverseerCollectionMessageHandler, lirPath can never be null. The lir path should probably be logged in debug rather than INFO.
Thanks for the pointer, I've removed the null check. I feel this should be INFO instead of DEBUG, so that if a user says I issued FORCELEADER but still nothing worked for him, his logs would help us understand if we ever had any LIR state which was cleared out. But, please feel free to remove it if this doesn't make sense.

bq. minor nit - you can compare enums directly using == instead of .equals
Fixed.

bq. Referring to the following, what is the thinking behind it? when can this happen? is there a test which specifically exercises this scenario? seems like this can interfere with the leader election if the leader election was taking some time? 

I modified the comment text to make it more clear. This is for the situation when all replicas are (somehow, due to bug maybe?) down/recovering (but not in LIR), and there is no leader, even though many replicas are on live; I don't know if this ever happens (the LIR case happens, I know). The testAllReplicasDownNoLeader test exercises this scenario. This is more or less the scenario that you described (with one difference that there is no leader as well): {{Leader is not live: Replicas are live but 'down' or 'recovering' -> mark them 'active'}}.

As you point out, I think it can indeed interfere with any on-going leader election; my thought was that this FORCELEADER call is issued only because the leader election isn't achieving a stable leader, so force marking the queue head replica as leader is okay. But I defer to your judgement if this is fine or not, and I can remove (or you feel free to remove) that code path from the patch if you feel it is not right.

> Create an API to force a leader election between nodes
> ------------------------------------------------------
>
>                 Key: SOLR-7569
>                 URL: https://issues.apache.org/jira/browse/SOLR-7569
>             Project: Solr
>          Issue Type: New Feature
>          Components: SolrCloud
>            Reporter: Shalin Shekhar Mangar
>            Assignee: Shalin Shekhar Mangar
>              Labels: difficulty-medium, impact-high
>         Attachments: SOLR-7569.patch, SOLR-7569.patch, SOLR-7569.patch, SOLR-7569.patch, SOLR-7569.patch, SOLR-7569.patch, SOLR-7569.patch, SOLR-7569.patch, SOLR-7569.patch, SOLR-7569.patch, SOLR-7569.patch, SOLR-7569_lir_down_state_test.patch
>
>
> There are many reasons why Solr will not elect a leader for a shard e.g. all replicas' last published state was recovery or due to bugs which cause a leader to be marked as 'down'. While the best solution is that they never get into this state, we need a manual way to fix this when it does get into this  state. Right now we can do a series of dance involving bouncing the node (since recovery paths between bouncing and REQUESTRECOVERY are different), but that is difficult when running a large cluster. Although it is possible that such a manual API may lead to some data loss but in some cases, it is the only possible option to restore availability.
> This issue proposes to build a new collection API which can be used to force replicas into recovering a leader while avoiding data loss on a best effort basis.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org