You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@solr.apache.org by "Houston Putman (Jira)" <ji...@apache.org> on 2022/09/21 15:09:00 UTC

[jira] [Commented] (SOLR-16416) Leader Election not respecting joinAtHead during ZK Connection issues

    [ https://issues.apache.org/jira/browse/SOLR-16416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17607805#comment-17607805 ] 

Houston Putman commented on SOLR-16416:
---------------------------------------

Ok after more digging, this does not seem to be the case. What actually happens is that at the end of OverseerNodePrioritizer.prioritizeOverseerNodes(), the prioritizer will send a command to the prioritized leader to take the second spot in the leader election, then send a second command to the current second spot to rejoin at the end.

The logging in the failed tests show that the second command is received, but the first is never logged. After going through the HttpShardHandler, it seems like the error message is just swallowed and never even logged. As a first step, I'll add logging if an error comes back from either command. Then we can actually start debugging these failures.

> Leader Election not respecting joinAtHead during ZK Connection issues
> ---------------------------------------------------------------------
>
>                 Key: SOLR-16416
>                 URL: https://issues.apache.org/jira/browse/SOLR-16416
>             Project: Solr
>          Issue Type: Bug
>      Security Level: Public(Default Security Level. Issues are Public) 
>            Reporter: Houston Putman
>            Priority: Major
>
> OverseerRolesTest.testDesignatedOverseerRestarts has been failing consistently (around 2.5% of the time). I think this is because LeaderElection.joinElection does not respect the joinAtHead flag, if connectionIssues happen while setting the leader election nodes.
> LeaderElection does not use the automatic retryOnConnLoss flags when doing zk operations. Instead, it waits for an error to come back, and it handles the retry itself. This is fine for the normal case, because it checks if node is represented in the leaderElection child nodes, and if so it ignores the connection loss. However when doing joinAtHead, if the childNode exists, but isn't at the place it should be, then the manual retry should be exercised.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@solr.apache.org
For additional commands, e-mail: issues-help@solr.apache.org