You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@solr.apache.org by "Joel Bernstein (Jira)" <ji...@apache.org> on 2022/03/02 14:31:00 UTC
[jira] [Commented] (SOLR-16013) Overseer gives up election node before closing - inflight commands can be processed twice

    [ https://issues.apache.org/jira/browse/SOLR-16013?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17500197#comment-17500197 ] 

Joel Bernstein commented on SOLR-16013:
---------------------------------------

I can provide a little context on this issue.

We have worked around the issue in our collections operator by assigning the coreName specifically in the ADDREPLICA command. Thus when two overseers execute the same ADDREPLICA command which ever gets there first will succeed and the second will fail due to the duplicate coreName. This was not an easy fix because the collections operator needed to follow the specific rules for coreName creation that Solr does in order for SolrCloud to work properly. The docs don't even have coreName as a parameter but it does work in the 8x branch. 
 
The reason this error came up frequently for us is that we have a test framework for the collections operator that does many things in parallel on a solr cluster. It spins up collections, tears them down, scales them etc... all in parallel. Once you do that this bug jumps out very quickly. If you're not doing parallel operations you won't hit this bug unless you have the misfortune of having the overseer leader die while performing an ADDREPLICA.

Lastly the code in Solr that seems to be at issue is the following ZkController logic:

{code:java}
 customThreadPool.submit(() -> IOUtils.closeQuietly(overseerElector.getContext()));
 customThreadPool.submit(() -> IOUtils.closeQuietly(overseer));
{code}

This code was not always done this way. Originally the code looked like this:


{code:java}
  IOUtils.closeQuietly(overseerElector.getContext());
  IOUtils.closeQuietly(overseer);
{code}

The threadPool was added as part of larger ticket to make the tests run faster. I believe there is a decent chance if we revert back to the serial closing of the overseer the problem will be resolved, but I haven't confirmed this.


> Overseer gives up election node before closing - inflight commands can be processed twice
> -----------------------------------------------------------------------------------------
>
>                 Key: SOLR-16013
>                 URL: https://issues.apache.org/jira/browse/SOLR-16013
>             Project: Solr
>          Issue Type: Bug
>      Security Level: Public(Default Security Level. Issues are Public) 
>            Reporter: Chris M. Hostetter
>            Priority: Major
>
> {{ZkController}} shutdown currently has these two lines (in this order)...
> {code:java}
>     customThreadPool.submit(() -> IOUtils.closeQuietly(overseerElector.getContext()));
>     customThreadPool.submit(() -> IOUtils.closeQuietly(overseer));
> {code}
> AFAICT this means that means that the overseer nodeX will give up it's election node (via overseerElector) allowing some other nodeY to be elected a new overseer, **BEFORE** Overseer nodeX shuts down it's {{Overseer}} object, which waits for the {{OverseerThread}} to finish processing any tasks in process.
> In practice, this seems to make it possible for a single command in the overseer queue to get processed twice.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@solr.apache.org
For additional commands, e-mail: issues-help@solr.apache.org