You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Steve Rowe (JIRA)" <ji...@apache.org> on 2017/03/03 21:17:45 UTC

[jira] [Commented] (SOLR-9836) Add more graceful recovery steps when failing to create SolrCore

    [ https://issues.apache.org/jira/browse/SOLR-9836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15895044#comment-15895044 ] 

Steve Rowe commented on SOLR-9836:
----------------------------------

{{MissingSegmentRecoveryTest.testLeaderRecovery()}} has been failing pretty regularly on Jenkins.  Something happened on or about February 10th, when the probability of failure went up considerably (and has since remained at this elevated level).

I got 3 failures beasting 100 iterations of the test suite using Miller's beasting script on my box.  However, for the past three weeks I've see this several times a day on my Jenkins, and roughly once a day on either ASF or Policeman Jenkins.

Here's a recent failure [https://builds.apache.org/job/Lucene-Solr-Tests-master/1699/]:

{noformat}
  [junit4]   2> 599977 ERROR (coreLoadExecutor-3254-thread-1-processing-n:127.0.0.1:41308_solr) [n:127.0.0.1:41308_solr c:MissingSegmentRecoveryTest s:shard1 r:core_node1 x:MissingSegmentRecoveryTest_shard1_replica2] o.a.s.u.SolrIndexWriter Error closing IndexWriter
  [junit4]   2> java.nio.file.NoSuchFileException: /x1/jenkins/jenkins-slave/workspace/Lucene-Solr-Tests-master/solr/build/solr-core/test/J2/temp/solr.cloud.MissingSegmentRecoveryTest_B800C15EC6F11C02-001/tempDir-001/node2/MissingSegmentRecoveryTest_shard1_replica2/data/index.20170228030909468/write.lock
  [junit4]   2> 	at sun.nio.fs.UnixException.translateToIOException(UnixException.java:86)
  [junit4]   2> 	at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
  [junit4]   2> 	at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107)
  [junit4]   2> 	at sun.nio.fs.UnixFileAttributeViews$Basic.readAttributes(UnixFileAttributeViews.java:55)
  [junit4]   2> 	at sun.nio.fs.UnixFileSystemProvider.readAttributes(UnixFileSystemProvider.java:144)
  [junit4]   2> 	at sun.nio.fs.LinuxFileSystemProvider.readAttributes(LinuxFileSystemProvider.java:99)
  [junit4]   2> 	at java.nio.file.Files.readAttributes(Files.java:1737)
  [junit4]   2> 	at org.apache.lucene.store.NativeFSLockFactory$NativeFSLock.ensureValid(NativeFSLockFactory.java:177)
  [junit4]   2> 	at org.apache.lucene.store.LockValidatingDirectoryWrapper.sync(LockValidatingDirectoryWrapper.java:67)
  [junit4]   2> 	at org.apache.lucene.index.IndexWriter.startCommit(IndexWriter.java:4698)
  [junit4]   2> 	at org.apache.lucene.index.IndexWriter.prepareCommitInternal(IndexWriter.java:3093)
  [junit4]   2> 	at org.apache.lucene.index.IndexWriter.commitInternal(IndexWriter.java:3227)
  [junit4]   2> 	at org.apache.lucene.index.IndexWriter.shutdown(IndexWriter.java:1136)
  [junit4]   2> 	at org.apache.lucene.index.IndexWriter.close(IndexWriter.java:1179)
  [junit4]   2> 	at org.apache.solr.update.SolrIndexWriter.close(SolrIndexWriter.java:291)
  [junit4]   2> 	at org.apache.solr.core.SolrCore.initIndex(SolrCore.java:728)
  [junit4]   2> 	at org.apache.solr.core.SolrCore.<init>(SolrCore.java:911)
  [junit4]   2> 	at org.apache.solr.core.SolrCore.<init>(SolrCore.java:828)
  [junit4]   2> 	at org.apache.solr.core.CoreContainer.processCoreCreateException(CoreContainer.java:1011)
  [junit4]   2> 	at org.apache.solr.core.CoreContainer.create(CoreContainer.java:939)
  [junit4]   2> 	at org.apache.solr.core.CoreContainer.lambda$load$3(CoreContainer.java:572)
  [junit4]   2> 	at com.codahale.metrics.InstrumentedExecutorService$InstrumentedCallable.call(InstrumentedExecutorService.java:197)
  [junit4]   2> 	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
  [junit4]   2> 	at org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:229)
  [junit4]   2> 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
  [junit4]   2> 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
[...]
  [junit4]   2> 600005 ERROR (coreContainerWorkExecutor-3250-thread-1-processing-n:127.0.0.1:41308_solr) [n:127.0.0.1:41308_solr    ] o.a.s.c.CoreContainer Error waiting for SolrCore to be created
  [junit4]   2> java.util.concurrent.ExecutionException: org.apache.solr.common.SolrException: Unable to create core [MissingSegmentRecoveryTest_shard1_replica2]
  [junit4]   2> 	at java.util.concurrent.FutureTask.report(FutureTask.java:122)
  [junit4]   2> 	at java.util.concurrent.FutureTask.get(FutureTask.java:192)
  [junit4]   2> 	at org.apache.solr.core.CoreContainer.lambda$load$4(CoreContainer.java:600)
  [junit4]   2> 	at com.codahale.metrics.InstrumentedExecutorService$InstrumentedRunnable.run(InstrumentedExecutorService.java:176)
  [junit4]   2> 	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
  [junit4]   2> 	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
  [junit4]   2> 	at org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:229)
  [junit4]   2> 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
  [junit4]   2> 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
  [junit4]   2> 	at java.lang.Thread.run(Thread.java:745)
  [junit4]   2> Caused by: org.apache.solr.common.SolrException: Unable to create core [MissingSegmentRecoveryTest_shard1_replica2]
  [junit4]   2> 	at org.apache.solr.core.CoreContainer.create(CoreContainer.java:952)
  [junit4]   2> 	at org.apache.solr.core.CoreContainer.lambda$load$3(CoreContainer.java:572)
  [junit4]   2> 	at com.codahale.metrics.InstrumentedExecutorService$InstrumentedCallable.call(InstrumentedExecutorService.java:197)
  [junit4]   2> 	... 5 more
  [junit4]   2> Caused by: org.apache.solr.common.SolrException: Error opening new searcher
  [junit4]   2> 	at org.apache.solr.core.SolrCore.<init>(SolrCore.java:964)
  [junit4]   2> 	at org.apache.solr.core.SolrCore.<init>(SolrCore.java:828)
  [junit4]   2> 	at org.apache.solr.core.CoreContainer.processCoreCreateException(CoreContainer.java:1011)
  [junit4]   2> 	at org.apache.solr.core.CoreContainer.create(CoreContainer.java:939)
  [junit4]   2> 	... 7 more
  [junit4]   2> 	Suppressed: org.apache.solr.common.SolrException: Error opening new searcher
  [junit4]   2> 		at org.apache.solr.core.SolrCore.<init>(SolrCore.java:964)
  [junit4]   2> 		at org.apache.solr.core.SolrCore.<init>(SolrCore.java:828)
  [junit4]   2> 		at org.apache.solr.core.CoreContainer.create(CoreContainer.java:937)
  [junit4]   2> 		... 7 more
  [junit4]   2> 	Caused by: org.apache.solr.common.SolrException: Error opening new searcher
  [junit4]   2> 		at org.apache.solr.core.SolrCore.openNewSearcher(SolrCore.java:2005)
  [junit4]   2> 		at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:2125)
  [junit4]   2> 		at org.apache.solr.core.SolrCore.initSearcher(SolrCore.java:1053)
  [junit4]   2> 		at org.apache.solr.core.SolrCore.<init>(SolrCore.java:937)
  [junit4]   2> 		... 9 more
  [junit4]   2> 	Caused by: org.apache.lucene.index.CorruptIndexException: Unexpected file read error while reading index. (resource=BufferedChecksumIndexInput(MMapIndexInput(path="/x1/jenkins/jenkins-slave/workspace/Lucene-Solr-Tests-master/solr/build/solr-core/test/J2/temp/solr.cloud.MissingSegmentRecoveryTest_B800C15EC6F11C02-001/tempDir-001/node2/MissingSegmentRecoveryTest_shard1_replica2/data/index/segments_2")))
  [junit4]   2> 		at org.apache.lucene.index.SegmentInfos.readCommit(SegmentInfos.java:286)
  [junit4]   2> 		at org.apache.lucene.index.IndexWriter.<init>(IndexWriter.java:938)
  [junit4]   2> 		at org.apache.solr.update.SolrIndexWriter.<init>(SolrIndexWriter.java:125)
  [junit4]   2> 		at org.apache.solr.update.SolrIndexWriter.create(SolrIndexWriter.java:100)
  [junit4]   2> 		at org.apache.solr.update.DefaultSolrCoreState.createMainIndexWriter(DefaultSolrCoreState.java:240)
  [junit4]   2> 		at org.apache.solr.update.DefaultSolrCoreState.getIndexWriter(DefaultSolrCoreState.java:114)
  [junit4]   2> 		at org.apache.solr.core.SolrCore.openNewSearcher(SolrCore.java:1966)
  [junit4]   2> 		... 12 more
  [junit4]   2> 	Caused by: java.io.EOFException: read past EOF: MMapIndexInput(path="/x1/jenkins/jenkins-slave/workspace/Lucene-Solr-Tests-master/solr/build/solr-core/test/J2/temp/solr.cloud.MissingSegmentRecoveryTest_B800C15EC6F11C02-001/tempDir-001/node2/MissingSegmentRecoveryTest_shard1_replica2/data/index/segments_2")
  [junit4]   2> 		at org.apache.lucene.store.ByteBufferIndexInput.readByte(ByteBufferIndexInput.java:75)
  [junit4]   2> 		at org.apache.lucene.store.BufferedChecksumIndexInput.readByte(BufferedChecksumIndexInput.java:41)
  [junit4]   2> 		at org.apache.lucene.store.DataInput.readInt(DataInput.java:101)
  [junit4]   2> 		at org.apache.lucene.index.SegmentInfos.readCommit(SegmentInfos.java:296)
  [junit4]   2> 		at org.apache.lucene.index.SegmentInfos.readCommit(SegmentInfos.java:284)
  [junit4]   2> 		... 18 more
  [junit4]   2> Caused by: org.apache.solr.common.SolrException: Error opening new searcher
  [junit4]   2> 	at org.apache.solr.core.SolrCore.openNewSearcher(SolrCore.java:2005)
  [junit4]   2> 	at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:2125)
  [junit4]   2> 	at org.apache.solr.core.SolrCore.initSearcher(SolrCore.java:1053)
  [junit4]   2> 	at org.apache.solr.core.SolrCore.<init>(SolrCore.java:937)
  [junit4]   2> 	... 10 more
  [junit4]   2> Caused by: org.apache.lucene.index.IndexNotFoundException: no segments* file found in LockValidatingDirectoryWrapper(MMapDirectory@/x1/jenkins/jenkins-slave/workspace/Lucene-Solr-Tests-master/solr/build/solr-core/test/J2/temp/solr.cloud.MissingSegmentRecoveryTest_B800C15EC6F11C02-001/tempDir-001/node2/MissingSegmentRecoveryTest_shard1_replica2/data/index.20170228030909468 lockFactory=org.apache.lucene.store.NativeFSLockFactory@74782755): files: [write.lock]
  [junit4]   2> 	at org.apache.lucene.index.IndexWriter.<init>(IndexWriter.java:933)
  [junit4]   2> 	at org.apache.solr.update.SolrIndexWriter.<init>(SolrIndexWriter.java:125)
  [junit4]   2> 	at org.apache.solr.update.SolrIndexWriter.create(SolrIndexWriter.java:100)
  [junit4]   2> 	at org.apache.solr.update.DefaultSolrCoreState.createMainIndexWriter(DefaultSolrCoreState.java:240)
  [junit4]   2> 	at org.apache.solr.update.DefaultSolrCoreState.getIndexWriter(DefaultSolrCoreState.java:114)
  [junit4]   2> 	at org.apache.solr.core.SolrCore.openNewSearcher(SolrCore.java:1966)
[...]
  [junit4]   2> NOTE: reproduce with: ant test  -Dtestcase=MissingSegmentRecoveryTest -Dtests.method=testLeaderRecovery -Dtests.seed=B800C15EC6F11C02 -Dtests.multiplier=2 -Dtests.slow=true -Dtests.locale=fi-FI -Dtests.timezone=Asia/Famagusta -Dtests.asserts=true -Dtests.file.encoding=ISO-8859-1
  [junit4] FAILURE 94.6s J2 | MissingSegmentRecoveryTest.testLeaderRecovery <<<
  [junit4]    > Throwable #1: java.lang.AssertionError: Expected a collection with one shard and two replicas
  [junit4]    > null
  [junit4]    > Last available state: DocCollection(MissingSegmentRecoveryTest//collections/MissingSegmentRecoveryTest/state.json/6)={
  [junit4]    >   "replicationFactor":"2",
  [junit4]    >   "shards":{"shard1":{
  [junit4]    >       "range":"80000000-7fffffff",
  [junit4]    >       "state":"active",
  [junit4]    >       "replicas":{
  [junit4]    >         "core_node1":{
  [junit4]    >           "core":"MissingSegmentRecoveryTest_shard1_replica2",
  [junit4]    >           "base_url":"https://127.0.0.1:41308/solr",
  [junit4]    >           "node_name":"127.0.0.1:41308_solr",
  [junit4]    >           "state":"down"},
  [junit4]    >         "core_node2":{
  [junit4]    >           "core":"MissingSegmentRecoveryTest_shard1_replica1",
  [junit4]    >           "base_url":"https://127.0.0.1:60247/solr",
  [junit4]    >           "node_name":"127.0.0.1:60247_solr",
  [junit4]    >           "state":"active",
  [junit4]    >           "leader":"true"}}}},
  [junit4]    >   "router":{"name":"compositeId"},
  [junit4]    >   "maxShardsPerNode":"1",
  [junit4]    >   "autoAddReplicas":"false"}
  [junit4]    > 	at __randomizedtesting.SeedInfo.seed([B800C15EC6F11C02:E855595D9FD0AA1F]:0)
  [junit4]    > 	at org.apache.solr.cloud.SolrCloudTestCase.waitForState(SolrCloudTestCase.java:265)
  [junit4]    > 	at org.apache.solr.cloud.MissingSegmentRecoveryTest.testLeaderRecovery(MissingSegmentRecoveryTest.java:105)
[...]
  [junit4]   2> NOTE: test params are: codec=Asserting(Lucene70): {_version_=TestBloomFilteredLucenePostings(BloomFilteringPostingsFormat(Lucene50(blocksize=128))), id=FST50}, docValues:{}, maxPointsInLeafNode=1106, maxMBSortInHeap=6.191537660994534, sim=RandomSimilarity(queryNorm=true): {}, locale=fi-FI, timezone=Asia/Famagusta
  [junit4]   2> NOTE: Linux 3.13.0-85-generic amd64/Oracle Corporation 1.8.0_121 (64-bit)/cpus=4,threads=1,free=138683768,total=527433728
{noformat}


> Add more graceful recovery steps when failing to create SolrCore
> ----------------------------------------------------------------
>
>                 Key: SOLR-9836
>                 URL: https://issues.apache.org/jira/browse/SOLR-9836
>             Project: Solr
>          Issue Type: Bug
>      Security Level: Public(Default Security Level. Issues are Public) 
>          Components: SolrCloud
>            Reporter: Mike Drob
>            Assignee: Mark Miller
>             Fix For: 6.5, master (7.0)
>
>         Attachments: SOLR-9836.patch, SOLR-9836.patch, SOLR-9836.patch, SOLR-9836.patch, SOLR-9836.patch, SOLR-9836.patch, SOLR-9836.patch
>
>
> I have seen several cases where there is a zero-length segments_n file. We haven't identified the root cause of these issues (possibly a poorly timed crash during replication?) but if there is another node available then Solr should be able to recover from this situation. Currently, we log and give up on loading that core, leaving the user to manually intervene.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org