You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@solr.apache.org by "Houston Putman (Jira)" <ji...@apache.org> on 2022/10/26 16:57:00 UTC

[jira] [Reopened] (SOLR-16416) Fix silently failing Overseer Election joinAtHead during testDesignatedOverseerRestarts

     [ https://issues.apache.org/jira/browse/SOLR-16416?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Houston Putman reopened SOLR-16416:
-----------------------------------

The test is still failing due to overseer elections happening in the beginning of the test. These elections are started when the test removes the overseer role from each node, in order of the overseer election queue. Since we are removing these roles in order, after each role is removed the next node (that has the role) is elected (because its role hasn't been removed yet).

The simple fix is to remove roles in reverse election order, this way the last node to have its role removed is the current overseer. Therefore no overseer election will take place until the test wants overseer elections to take place.

> Fix silently failing Overseer Election joinAtHead during testDesignatedOverseerRestarts
> ---------------------------------------------------------------------------------------
>
>                 Key: SOLR-16416
>                 URL: https://issues.apache.org/jira/browse/SOLR-16416
>             Project: Solr
>          Issue Type: Bug
>      Security Level: Public(Default Security Level. Issues are Public) 
>            Reporter: Houston Putman
>            Assignee: Houston Putman
>            Priority: Major
>             Fix For: 9.1, main (10.0)
>
>          Time Spent: 1h
>  Remaining Estimate: 0h
>
> OverseerRolesTest.testDesignatedOverseerRestarts has been failing consistently (around 2.5% of the time). I think this is because LeaderElection.joinElection does not respect the joinAtHead flag, if connectionIssues happen while setting the leader election nodes.
> LeaderElection does not use the automatic retryOnConnLoss flags when doing zk operations. Instead, it waits for an error to come back, and it handles the retry itself. This is fine for the normal case, because it checks if node is represented in the leaderElection child nodes, and if so it ignores the connection loss. However when doing joinAtHead, if the childNode exists, but isn't at the place it should be, then the manual retry should be exercised.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@solr.apache.org
For additional commands, e-mail: issues-help@solr.apache.org