You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Andreas Hubold <an...@coremedia.com> on 2020/10/27 08:59:05 UTC
Solr LockObtainFailedException and NPEs for CoreAdmin STATUS
Hi,
we're running tests on a stand-alone Solr instance, which create Solr
cores from multiple applications using CoreAdmin (via SolrJ).
Lately, we upgraded from 8.4.1 to 8.6.3, and sometimes we now see a
LockObtainFailedException for a lock held by the same JVM, after which
Solr is broken and runs into NullPointerExceptions for simple CoreAdmin
STATUS requests. We have to restart Solr then. I've never seen this with
8.4.1 or previous releases.
This bug is quite severe for us because it breaks our system tests with
Solr, and we fear that it may also happen in production. Is this a known
bug?
Our applications use a CoreAdmin STATUS request to check whether a core
exists, followed by a CREATE request, if the core does not exist. With
multiple applications, and bad timing, two concurrent CREATE requests
for the same core are of course still possible. Solr 8.4.1 rejected
duplicate requests and logged ERRORs but kept working correctly [1]. I
can still see the same log messages in 8.6.3 ("Core with name ...
already exists" or "Error CREATEing SolrCore ... Could not create a new
core in ... as another core is already defined there") - but sometimes
also the following error, after which Solr is broken:
2020-10-27 00:29:25.350 ERROR (qtp2029754983-24) [ ] o.a.s.h.RequestHandlerBase org.apache.solr.common.SolrException: Error CREATEing SolrCore 'blueprint_acgqqafsogyc_comments': Unable to create core [blueprint_acgqqafsogyc_comments] Caused by: Lock held by this virtual machine: /var/solr/data/blueprint_acgqqafsogyc_comments/data/index/write.lock
at org.apache.solr.core.CoreContainer.create(CoreContainer.java:1312)
at org.apache.solr.handler.admin.CoreAdminOperation.lambda$static$0(CoreAdminOperation.java:95)
at org.apache.solr.handler.admin.CoreAdminOperation.execute(CoreAdminOperation.java:367)
...
Caused by: org.apache.solr.common.SolrException: Unable to create core [blueprint_acgqqafsogyc_comments]
at org.apache.solr.core.CoreContainer.createFromDescriptor(CoreContainer.java:1408)
at org.apache.solr.core.CoreContainer.create(CoreContainer.java:1273)
... 47 more
Caused by: org.apache.solr.common.SolrException: Error opening new searcher
at org.apache.solr.core.SolrCore.<init>(SolrCore.java:1071)
at org.apache.solr.core.SolrCore.<init>(SolrCore.java:906)
at org.apache.solr.core.CoreContainer.createFromDescriptor(CoreContainer.java:1387)
... 48 more
Caused by: org.apache.solr.common.SolrException: Error opening new searcher
at org.apache.solr.core.SolrCore.openNewSearcher(SolrCore.java:2184)
at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:2308)
at org.apache.solr.core.SolrCore.initSearcher(SolrCore.java:1130)
at org.apache.solr.core.SolrCore.<init>(SolrCore.java:1012)
... 50 more
Caused by: org.apache.lucene.store.LockObtainFailedException: Lock held by this virtual machine: /var/solr/data/blueprint_acgqqafsogyc_comments/data/index/write.lock
at org.apache.lucene.store.NativeFSLockFactory.obtainFSLock(NativeFSLockFactory.java:139)
at org.apache.lucene.store.FSLockFactory.obtainLock(FSLockFactory.java:41)
at org.apache.lucene.store.BaseDirectory.obtainLock(BaseDirectory.java:45)
at org.apache.lucene.store.FilterDirectory.obtainLock(FilterDirectory.java:105)
at org.apache.lucene.index.IndexWriter.<init>(IndexWriter.java:785)
at org.apache.solr.update.SolrIndexWriter.<init>(SolrIndexWriter.java:126)
at org.apache.solr.update.SolrIndexWriter.create(SolrIndexWriter.java:100)
at org.apache.solr.update.DefaultSolrCoreState.createMainIndexWriter(DefaultSolrCoreState.java:261)
at org.apache.solr.update.DefaultSolrCoreState.getIndexWriter(DefaultSolrCoreState.java:135)
at org.apache.solr.core.SolrCore.openNewSearcher(SolrCore.java:2145)
2020-10-27 00:29:25.353 INFO (qtp2029754983-19) [ ] o.a.s.s.HttpSolrCall [admin] webapp=null path=/admin/cores params={core=blueprint_acgqqafsogyc_comments&action=STATUS&indexInfo=false&wt=javabin&version=2} status=500 QTime=0
2020-10-27 00:29:25.353 ERROR (qtp2029754983-19) [ ] o.a.s.s.HttpSolrCall null:org.apache.solr.common.SolrException: Error handling 'STATUS' action
at org.apache.solr.handler.admin.CoreAdminOperation.execute(CoreAdminOperation.java:372)
at org.apache.solr.handler.admin.CoreAdminHandler$CallInfo.call(CoreAdminHandler.java:397)
at org.apache.solr.handler.admin.CoreAdminHandler.handleRequestBody(CoreAdminHandler.java:181)
...
Caused by: java.lang.NullPointerException
at org.apache.solr.core.SolrCore.getInstancePath(SolrCore.java:333)
at org.apache.solr.handler.admin.CoreAdminOperation.getCoreStatus(CoreAdminOperation.java:329)
at org.apache.solr.handler.admin.StatusOp.execute(StatusOp.java:54)
at org.apache.solr.handler.admin.CoreAdminOperation.execute(CoreAdminOperation.java:367)
Any ideas? Were there any recent changes since 8.4.1 that could have
caused this?
Thank you,
Andreas
[1] BTW, it would be nice if Solr would support an atomic request
"create core if not exists" so that I wouldn't need separate
STATUS/CREATE requests and could avoid ERRORs in the log. But this is
not critical, and not the problem here. (I asked for this years ago:
https://mail-archives.apache.org/mod_mbox/lucene-solr-user/201606.mbox/%3C5762B310.60600%40coremedia.com%3E
)
Re: Solr LockObtainFailedException and NPEs for CoreAdmin STATUS
Posted by ahubold <an...@coremedia.com>.
I've created a JIRA ticket now:
https://issues.apache.org/jira/browse/SOLR-14969
I'd be really glad, if a Solr developer could help or comment on the issue.
Thank you,
Andreas
--
Sent from: https://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Re: Solr LockObtainFailedException and NPEs for CoreAdmin STATUS
Posted by ahubold <an...@coremedia.com>.
Hi,
after reading some Solr source code, I might have found the cause:
There was indeed a change in Solr 8.6 that leads to the NullPointerException
for the CoreAdmin STATUS request in CoreAdminOperation#getCoreStatus. The
instancePath is not retrieved from the ResourceLoader anymore, but from the
registered CoreDescriptor. See commit [1].
SolrCore.getInstancePath(SolrCore.java:333) throws an NPE because the
CoreContainer does not have a CoreDescriptor for the name, even though a
SolrCore is available in the CoreContainer under that name (retrieved some
lines above). This inconsistency is persistent: All STATUS requests keep
failing until Solr is restarted.
IIUC, the underlying problem is that CoreContainer#create does not correctly
handle concurrent requests to create the same core. There's a race condition
(see TODO comment [2]), and CoreContainer#createFromDescriptor may be called
subsequently for the same core. The second call then fails to create an
IndexWriter (LockObtainFailedException), and this causes a call to
SolrCores#removeCoreDescriptor [3]. This mean, the second call removes the
CoreDescriptor for the SolrCore created with the first call. This is the
inconsistency that causes the NPE in CoreAdminOperation#getCoreStatus.
Does this sound reasonable?
I'll create a JIRA ticket tomorrow, if that's okay.
Thank you,
Andreas
[1]
https://github.com/apache/lucene-solr/commit/17ae79b0905b2bf8635c1b260b30807cae2f5463#diff-9652fe8353b7eff59cd6f128bb2699d88361e670b840ee5ca1018b1bc45584d1R324
[2]
https://github.com/apache/lucene-solr/blob/15241573d3c8da0db3dfd380d99e4efcfe500c2e/solr/core/src/java/org/apache/solr/core/CoreContainer.java#L1242
[3]
https://github.com/apache/lucene-solr/blob/15241573d3c8da0db3dfd380d99e4efcfe500c2e/solr/core/src/java/org/apache/solr/core/CoreContainer.java#L1407
--
Sent from: https://lucene.472066.n3.nabble.com/Solr-User-f472068.html