You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Andreas Hubold <an...@coremedia.com> on 2020/10/27 08:59:05 UTC

Solr LockObtainFailedException and NPEs for CoreAdmin STATUS

Hi,

we're running tests on a stand-alone Solr instance, which create Solr 
cores from multiple applications using CoreAdmin (via SolrJ).

Lately, we upgraded from 8.4.1 to 8.6.3, and sometimes we now see a 
LockObtainFailedException for a lock held by the same JVM, after which 
Solr is broken and runs into NullPointerExceptions for simple CoreAdmin 
STATUS requests. We have to restart Solr then. I've never seen this with 
8.4.1 or previous releases.

This bug is quite severe for us because it breaks our system tests with 
Solr, and we fear that it may also happen in production. Is this a known 
bug?

Our applications use a CoreAdmin STATUS request to check whether a core 
exists, followed by a CREATE request, if the core does not exist. With 
multiple applications, and bad timing, two concurrent CREATE requests 
for the same core are of course still possible. Solr 8.4.1 rejected 
duplicate requests and logged ERRORs but kept working correctly [1]. I 
can still see the same log messages in 8.6.3 ("Core with name ... 
already exists" or "Error CREATEing SolrCore ... Could not create a new 
core in ... as another core is already defined there") - but sometimes 
also the following error, after which Solr is broken:

2020-10-27 00:29:25.350 ERROR (qtp2029754983-24) [   ] o.a.s.h.RequestHandlerBase org.apache.solr.common.SolrException: Error CREATEing SolrCore 'blueprint_acgqqafsogyc_comments': Unable to create core [blueprint_acgqqafsogyc_comments] Caused by: Lock held by this virtual machine: /var/solr/data/blueprint_acgqqafsogyc_comments/data/index/write.lock

         at org.apache.solr.core.CoreContainer.create(CoreContainer.java:1312)
         at org.apache.solr.handler.admin.CoreAdminOperation.lambda$static$0(CoreAdminOperation.java:95)
         at org.apache.solr.handler.admin.CoreAdminOperation.execute(CoreAdminOperation.java:367)
...
Caused by: org.apache.solr.common.SolrException: Unable to create core [blueprint_acgqqafsogyc_comments]
         at org.apache.solr.core.CoreContainer.createFromDescriptor(CoreContainer.java:1408)
         at org.apache.solr.core.CoreContainer.create(CoreContainer.java:1273)
         ... 47 more
Caused by: org.apache.solr.common.SolrException: Error opening new searcher
         at org.apache.solr.core.SolrCore.<init>(SolrCore.java:1071)
         at org.apache.solr.core.SolrCore.<init>(SolrCore.java:906)
         at org.apache.solr.core.CoreContainer.createFromDescriptor(CoreContainer.java:1387)
         ... 48 more
Caused by: org.apache.solr.common.SolrException: Error opening new searcher
         at org.apache.solr.core.SolrCore.openNewSearcher(SolrCore.java:2184)
         at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:2308)
         at org.apache.solr.core.SolrCore.initSearcher(SolrCore.java:1130)
         at org.apache.solr.core.SolrCore.<init>(SolrCore.java:1012)
         ... 50 more
Caused by: org.apache.lucene.store.LockObtainFailedException: Lock held by this virtual machine: /var/solr/data/blueprint_acgqqafsogyc_comments/data/index/write.lock
         at org.apache.lucene.store.NativeFSLockFactory.obtainFSLock(NativeFSLockFactory.java:139)
         at org.apache.lucene.store.FSLockFactory.obtainLock(FSLockFactory.java:41)
         at org.apache.lucene.store.BaseDirectory.obtainLock(BaseDirectory.java:45)
         at org.apache.lucene.store.FilterDirectory.obtainLock(FilterDirectory.java:105)
         at org.apache.lucene.index.IndexWriter.<init>(IndexWriter.java:785)
         at org.apache.solr.update.SolrIndexWriter.<init>(SolrIndexWriter.java:126)
         at org.apache.solr.update.SolrIndexWriter.create(SolrIndexWriter.java:100)
         at org.apache.solr.update.DefaultSolrCoreState.createMainIndexWriter(DefaultSolrCoreState.java:261)
         at org.apache.solr.update.DefaultSolrCoreState.getIndexWriter(DefaultSolrCoreState.java:135)
         at org.apache.solr.core.SolrCore.openNewSearcher(SolrCore.java:2145)

2020-10-27 00:29:25.353 INFO  (qtp2029754983-19) [   ] o.a.s.s.HttpSolrCall [admin] webapp=null path=/admin/cores params={core=blueprint_acgqqafsogyc_comments&action=STATUS&indexInfo=false&wt=javabin&version=2} status=500 QTime=0
2020-10-27 00:29:25.353 ERROR (qtp2029754983-19) [   ] o.a.s.s.HttpSolrCall null:org.apache.solr.common.SolrException: Error handling 'STATUS' action
         at org.apache.solr.handler.admin.CoreAdminOperation.execute(CoreAdminOperation.java:372)
         at org.apache.solr.handler.admin.CoreAdminHandler$CallInfo.call(CoreAdminHandler.java:397)
         at org.apache.solr.handler.admin.CoreAdminHandler.handleRequestBody(CoreAdminHandler.java:181)
...
Caused by: java.lang.NullPointerException
         at org.apache.solr.core.SolrCore.getInstancePath(SolrCore.java:333)
         at org.apache.solr.handler.admin.CoreAdminOperation.getCoreStatus(CoreAdminOperation.java:329)
         at org.apache.solr.handler.admin.StatusOp.execute(StatusOp.java:54)
         at org.apache.solr.handler.admin.CoreAdminOperation.execute(CoreAdminOperation.java:367)


Any ideas? Were there any recent changes since 8.4.1 that could have 
caused this?

Thank you,
Andreas


[1] BTW, it would be nice if Solr would support an atomic request 
"create core if not exists" so that I wouldn't need separate 
STATUS/CREATE requests and could avoid ERRORs in the log. But this is 
not critical, and not the problem here. (I asked for this years ago: 
https://mail-archives.apache.org/mod_mbox/lucene-solr-user/201606.mbox/%3C5762B310.60600%40coremedia.com%3E 
)



Re: Solr LockObtainFailedException and NPEs for CoreAdmin STATUS

Posted by ahubold <an...@coremedia.com>.
I've created a JIRA ticket now:
https://issues.apache.org/jira/browse/SOLR-14969

I'd be really glad, if a Solr developer could help or comment on the issue. 

Thank you,
Andreas



--
Sent from: https://lucene.472066.n3.nabble.com/Solr-User-f472068.html

Re: Solr LockObtainFailedException and NPEs for CoreAdmin STATUS

Posted by ahubold <an...@coremedia.com>.
Hi,

after reading some Solr source code, I might have found the cause:

There was indeed a change in Solr 8.6 that leads to the NullPointerException
for the CoreAdmin STATUS request in CoreAdminOperation#getCoreStatus. The
instancePath is not retrieved from the ResourceLoader anymore, but from the
registered CoreDescriptor. See commit [1]. 

SolrCore.getInstancePath(SolrCore.java:333) throws an NPE because the
CoreContainer does not have a CoreDescriptor for the name, even though a
SolrCore is available in the CoreContainer under that name (retrieved some
lines above). This inconsistency is persistent: All STATUS requests keep
failing until Solr is restarted.

IIUC, the underlying problem is that CoreContainer#create does not correctly
handle concurrent requests to create the same core. There's a race condition
(see TODO comment [2]), and CoreContainer#createFromDescriptor may be called
subsequently for the same core. The second call then fails to create an
IndexWriter (LockObtainFailedException), and this causes a call to
SolrCores#removeCoreDescriptor [3]. This mean, the second call removes the
CoreDescriptor for the SolrCore created with the first call. This is the
inconsistency that causes the NPE in CoreAdminOperation#getCoreStatus.

Does this sound reasonable?

I'll create a JIRA ticket tomorrow, if that's okay.

Thank you,
Andreas

[1]
https://github.com/apache/lucene-solr/commit/17ae79b0905b2bf8635c1b260b30807cae2f5463#diff-9652fe8353b7eff59cd6f128bb2699d88361e670b840ee5ca1018b1bc45584d1R324
[2]
https://github.com/apache/lucene-solr/blob/15241573d3c8da0db3dfd380d99e4efcfe500c2e/solr/core/src/java/org/apache/solr/core/CoreContainer.java#L1242
[3]
https://github.com/apache/lucene-solr/blob/15241573d3c8da0db3dfd380d99e4efcfe500c2e/solr/core/src/java/org/apache/solr/core/CoreContainer.java#L1407




--
Sent from: https://lucene.472066.n3.nabble.com/Solr-User-f472068.html