You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by "Cao Manh Dat (JIRA)" <ji...@apache.org> on 2019/07/10 07:28:00 UTC

[jira] [Commented] (SOLR-13616) Possible racecondition/deadlock between collection DELETE and PrepRecovery ? (TestPolicyCloud failures)

    [ https://issues.apache.org/jira/browse/SOLR-13616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16881790#comment-16881790 ] 

Cao Manh Dat commented on SOLR-13616:
-------------------------------------

Hi [~hossman], thanks a lot for taking a look at this failure. Your analysis make tracking down things much faster.

I think the problem here belongs to {{ZkStateReader}} (or the way we use it in {{PreRecoveryOp}}).
In {{PreRecoveryOp}} (this call was added/refactored in SOLR-12801)
{code}
coreContainer.getZkController().getZkStateReader().waitForState(collectionName, conflictWaitMs, TimeUnit.MILLISECONDS, (n, c) -> {

        try (SolrCore core = coreContainer.getCore(cname)) {
          if (core == null) throw new SolrException(SolrException.ErrorCode.BAD_REQUEST, "core not found:" + cname);
...
{code}
So the idea here is we throw an SolrException whenever the core (leader) is not found, so the method will terminate and return response to the replica which is doing recovery.
But ZkStateReader will never throw any exception happened in Watcher (CollectionStatePredicate) to the upper level, it just log out the exception and continue.

https://github.com/apache/lucene-solr/blob/fb30ded6436a577af86e5db201eed01170102a97/solr/solrj/src/java/org/apache/solr/common/cloud/ZkStateReader.java#L2045

Furthermore that exception watcher never get removed from the list of watchers, so whenever clusterstate get changed, this method will get invoked -> that is the reason why that {{"Error on calling watcher"}} appeared multiple times.

> Possible racecondition/deadlock between collection DELETE and PrepRecovery ? (TestPolicyCloud failures)
> -------------------------------------------------------------------------------------------------------
>
>                 Key: SOLR-13616
>                 URL: https://issues.apache.org/jira/browse/SOLR-13616
>             Project: Solr
>          Issue Type: Bug
>      Security Level: Public(Default Security Level. Issues are Public) 
>            Reporter: Hoss Man
>            Priority: Major
>         Attachments: SOLR-13616.test-incomplete.patch, thetaphi_Lucene-Solr-master-Linux_24358.log.txt
>
>
> Based on some recent jenkins failures in TestPolicyCloud, I suspect there is a possible deadlock condition when attempting to delete a collection while recovery is in progress.
> I haven't been able to identify exactly where/why/how the problem occurs, but it does not appear to be a test specific problem, and seems like it could potentially affect anyone unlucky enough to issue poorly timed DELETE.
> Details to follow in comments...



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org