You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@solr.apache.org by "David Smiley (Jira)" <ji...@apache.org> on 2021/07/03 04:06:00 UTC
[jira] [Commented] (SOLR-15029) More gracefully allow Shard Leader to give up leadership

    [ https://issues.apache.org/jira/browse/SOLR-15029?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17373892#comment-17373892 ] 

David Smiley commented on SOLR-15029:
-------------------------------------

LeaderTragicEventTest has been failing on 8x only since ~December 7th 2020 -- http://fucit.org/solr-jenkins-reports/history-trend-of-recent-failures.html#series/org.apache.solr.cloud.LeaderTragicEventTest.testLeaderFailsOver 
It appears strongly correlated with this issue due to both time and touching that test.

BTW any possible future work/fixes must be under new JIRA issues; this one is closed for good (released).

> More gracefully allow Shard Leader to give up leadership
> --------------------------------------------------------
>
>                 Key: SOLR-15029
>                 URL: https://issues.apache.org/jira/browse/SOLR-15029
>             Project: Solr
>          Issue Type: Improvement
>            Reporter: Mike Drob
>            Assignee: Mike Drob
>            Priority: Major
>             Fix For: 8.8, main (9.0)
>
>          Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> Currently we have (via SOLR-12412) that when a leader sees an index writing error during an update it will give up leadership by deleting the replica and adding a new replica. One stated benefit of this was that because we are using the overseer and a known code path, that this is done asynchronous and very efficiently.
> I would argue that this approach is too heavy handed.
> In the case of a corrupt index exception, it makes some sense to completely delete the index dir and attempt to sync from a good peer. Even in this case, however, it might be better to allow fingerprinting and other index delta mechanisms take over and allow for a more efficient data transfer.
> In an alternate case where the index error arises due to a disconnected file system (possible with shared file systems, i.e. S3, HDFS, some k8s systems) and the required solution is some kind of reconnect, then this approach has several shortcomings - the core delete and creations are going to fail leaving dangling replicas. Further, the data is still present so there is no need to do so many extra copies.
> I propose that we bring in a mechanism to give up leadership via the existing shard terms language. I believe we would be able to set all replicas currently equal to leader term T to T+1, and then trigger a new leader election. The current leader would know it is ineligible, while the other replicas that were current before the failed update would be eligible. This improvement would entail adding an additional possible operation to terms state machine.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@solr.apache.org
For additional commands, e-mail: issues-help@solr.apache.org