You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by "Erick Erickson (Jira)" <ji...@apache.org> on 2019/09/29 17:04:00 UTC

[jira] [Commented] (SOLR-13709) Race condition on core reload while core is still loading?

    [ https://issues.apache.org/jira/browse/SOLR-13709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16940489#comment-16940489 ] 

Erick Erickson commented on SOLR-13709:
---------------------------------------

Finally back looking at this, the progress will be slow. I can add a few findings.
 * I'm 99% certain this has nothing to do with loading the core container.
 * The failures I'm seeing (NOTE: there may be more than one, but one thing at a time) are the Metrics-related NPE happening as the test finishes and the collection is being deleted.
 * What I see is a bunch of reloads (expected), then a _CoreContainer.unload()_, _then_ the NPE and the "unable to reload core" message.
 * I verified that the _unload_ is only happening when the collection is deleted at the end of the test.
 * So it seems like there is, indeed, a race condition, but it has to do with higher-level operations. Somehow the reloads are still going on when the collection is being deleted. I did note that the reload calls are async (IIRC).

This is looking more and more like a test artifact, were programmatically deleting the collection too quickly and it's not worth warping the Solr code to guard against. For test purposes, something that told us all of the pending collection operations were completed before we go ahead and delete the collection would do the trick, my theory so far is that the reload isn't complete yet.

I'm running some more tests that'll dump the stack of all the threads if I can get it to fail again, although I'm not sure that'll be conclusive.

I suppose the other approach would be to try to hand around in the delete collection code while anything was pending.

Thoughts? Expecially [~hossman]  and [~caomanhdat2]. Assuming we're in the Overseer when we delete the collection, waiting in the delete process would likely be a bad idea given how many messages we have to deal with. Is there a way to skip performing an operation right now and put it back on the queue? I'm thinking something like

 {code}
if (conflicting operations detected) {
    skip this command and put it back in the queue for the next go-round
}
 {code}

I frankly don't know if that's desirable even if possible. If it's just to fix some tests, it seems too intrusive. If it'd avoid real problems in non-test situations, then I'm all for it.

I'll add more when I have more clues.

> Race condition on core reload while core is still loading?
> ----------------------------------------------------------
>
>                 Key: SOLR-13709
>                 URL: https://issues.apache.org/jira/browse/SOLR-13709
>             Project: Solr
>          Issue Type: Bug
>      Security Level: Public(Default Security Level. Issues are Public) 
>            Reporter: Chris M. Hostetter
>            Assignee: Erick Erickson
>            Priority: Major
>         Attachments: apache_Lucene-Solr-Tests-8.x_449.log.txt
>
>
> A recent jenkins failure from {{TestSolrCLIRunExample}} seems to suggest that there may be a race condition when attempting to re-load a SolrCore while the core is currently in the process of (re)loading that can leave the SolrCore in an unusable state.
> Details to follow...



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org