You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Erick Erickson (Jira)" <ji...@apache.org> on 2019/08/23 21:05:00 UTC

[jira] [Comment Edited] (SOLR-13709) Race condition on core reload while core is still loading?

    [ https://issues.apache.org/jira/browse/SOLR-13709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16914633#comment-16914633 ] 

Erick Erickson edited comment on SOLR-13709 at 8/23/19 9:04 PM:
----------------------------------------------------------------

Deleted comment because it's bogus. I'll try again....


was (Author: erickerickson):
[~hossman] I'm hopping in the wayback machine. That said:

AFAICT, that comment in SolrCores.getCoreDescriptor is totally bogus and has been since at least 2015.

There are various lists that are maintained so multiple threads can open, close and reload cores etc. modifyLock is mostly used to coordinate multiple threads making changes to the lists, _not_ to deal with the underlying operations. So you're right, there is no blocking being done.

That said, getCoreDescriptor shouldn't be sensitive to whether the core is loaded or not, it should be solely about bookkeeping _descriptors_. But it's not. Over in CoreContainer, after all the cores have been discovered, there's this code:

{code}
        if (cd.isTransient() || !cd.isLoadOnStartup()) {
          solrCores.addCoreDescriptor(cd);
        } else if (asyncSolrCoreLoad) {
          solrCores.markCoreAsLoading(cd);
        }
        if (cd.isLoadOnStartup()) {
          futures.add(coreLoadExecutor.submit(() -> {
{code}

Eventually, if isLoadOnStartup is true the descriptor does get added to the core descriptor list as part of the core creation process. So some descriptors are available before and some after core discovery and that may be where the race condition is coming from.

I'll play around a bit with what happens if we just add all the descriptors to the internal lists before any cores are loaded, that seems like the right thing to do. After all, the lazily-loaded cores peacefully exist with a descriptor but no loaded core and "the right thing" happens when the core is referenced so I believe it should be OK.

Of course I have some fears that something else will pop out, but blocking on core load in getCoreDescriptor seems dangerous too, long-to-infinite delays if someone happens to ask for a core that is simply not there and never will be. And any timeout we choose will be wrong.

I'll assign this to myself for the nonce. If this doesn't break anything (and I'll beast several tests a lot over the weekend) then maybe we can circle back next week to see if any proposed changed make sense. 

How often do you see this failure? I'll put an e-mail filter in place to see how often we see "Unable to reload core" and collect some history about how often this happens so we can have some confidence it actually gets fixed if I can come up with some code.

Thanks for sleuthing this!



> Race condition on core reload while core is still loading?
> ----------------------------------------------------------
>
>                 Key: SOLR-13709
>                 URL: https://issues.apache.org/jira/browse/SOLR-13709
>             Project: Solr
>          Issue Type: Bug
>      Security Level: Public(Default Security Level. Issues are Public) 
>            Reporter: Hoss Man
>            Assignee: Erick Erickson
>            Priority: Major
>         Attachments: apache_Lucene-Solr-Tests-8.x_449.log.txt
>
>
> A recent jenkins failure from {{TestSolrCLIRunExample}} seems to suggest that there may be a race condition when attempting to re-load a SolrCore while the core is currently in the process of (re)loading that can leave the SolrCore in an unusable state.
> Details to follow...



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org