You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Steve Rowe (JIRA)" <ji...@apache.org> on 2018/03/30 16:12:01 UTC
[jira] [Comment Edited] (SOLR-12066) Cleanup deleted core when node
start
[ https://issues.apache.org/jira/browse/SOLR-12066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16420650#comment-16420650 ]
Steve Rowe edited comment on SOLR-12066 at 3/30/18 4:11 PM:
------------------------------------------------------------
Reopening because {{DeleteInactiveReplicaTest.deleteInactiveReplicaTest()}} is now failing 100% of the time without a seed, and {{git bisect}} blames commit {{35bfe89}} on this issue.
E.g. from [https://jenkins.thetaphi.de/job/Lucene-Solr-master-Linux/21725/]:
{noformat}
Checking out Revision 35bfe897901f1b51bce654b49aecd9560bfa797f (refs/remotes/origin/master)
[...]
[junit4] 2> 189142 ERROR (coreContainerWorkExecutor-659-thread-1-processing-n:127.0.0.1:46875_solr) [n:127.0.0.1:46875_solr ] o.a.s.c.CoreContainer Error waiting for SolrCore to be loaded on startup
[junit4] 2> org.apache.solr.cloud.ZkController$NotInClusterStateException: coreNodeName core_node8 does not exist in shard shard2, ignore the exception if the replica was deleted
[junit4] 2> at org.apache.solr.cloud.ZkController.checkStateInZk(ZkController.java:1739) ~[java/:?]
[junit4] 2> at org.apache.solr.cloud.ZkController.preRegister(ZkController.java:1637) ~[java/:?]
[junit4] 2> at org.apache.solr.core.CoreContainer.createFromDescriptor(CoreContainer.java:1044) ~[java/:?]
[junit4] 2> at org.apache.solr.core.CoreContainer.lambda$load$13(CoreContainer.java:647) ~[java/:?]
[junit4] 2> at com.codahale.metrics.InstrumentedExecutorService$InstrumentedCallable.call(InstrumentedExecutorService.java:197) ~[metrics-core-3.2.2.jar:3.2.2]
[junit4] 2> at java.util.concurrent.FutureTask.run(FutureTask.java:264) [?:?]
[junit4] 2> at org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:192) [java/:?]
[junit4] 2> at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1135) [?:?]
[junit4] 2> at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]
[junit4] 2> at java.lang.Thread.run(Thread.java:844) [?:?]
[junit4] 2> 189143 INFO (TEST-DeleteInactiveReplicaTest.deleteInactiveReplicaTest-seed#[27851F902A54F9D2]) [ ] o.a.s.SolrTestCaseJ4 ###Ending deleteInactiveReplicaTest
[junit4] 2> NOTE: reproduce with: ant test -Dtestcase=DeleteInactiveReplicaTest -Dtests.method=deleteInactiveReplicaTest -Dtests.seed=27851F902A54F9D2 -Dtests.multiplier=3 -Dtests.slow=true -Dtests.locale=fr-CH -Dtests.timezone=America/Panama -Dtests.asserts=true -Dtests.file.encoding=ISO-8859-1
[junit4] FAILURE 13.5s J1 | DeleteInactiveReplicaTest.deleteInactiveReplicaTest <<<
[junit4] > Throwable #1: java.lang.AssertionError: Deleted core was still loaded!
[junit4] > at __randomizedtesting.SeedInfo.seed([27851F902A54F9D2:EABB846365F58F30]:0)
[junit4] > at org.apache.solr.cloud.DeleteInactiveReplicaTest.deleteInactiveReplicaTest(DeleteInactiveReplicaTest.java:86)
[junit4] > at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
[junit4] > at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
[junit4] > at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
[junit4] > at java.base/java.lang.reflect.Method.invoke(Method.java:564)
[junit4] > at java.base/java.lang.Thread.run(Thread.java:844)
{noformat}
was (Author: steve_rowe):
Reopening because {{DeleteInactiveReplicaTest.deleteInactiveReplicaTest()}} is now failing 100% of the time without a seed, and {{git bisect}} blames commit {{35bfe89}} on this issue.
> Cleanup deleted core when node start
> ------------------------------------
>
> Key: SOLR-12066
> URL: https://issues.apache.org/jira/browse/SOLR-12066
> Project: Solr
> Issue Type: Bug
> Security Level: Public(Default Security Level. Issues are Public)
> Components: AutoScaling, SolrCloud
> Reporter: Varun Thacker
> Assignee: Cao Manh Dat
> Priority: Major
> Fix For: 7.4, master (8.0)
>
> Attachments: SOLR-12066.patch, SOLR-12066.patch
>
>
> Initially when SOLR-12047 was created it looked like waiting for a state in ZK for only 3 seconds was the culprit for cores not loading up
>
> But it turns out to be something else. Here are the steps to reproduce this problem
>
> - create a 3 node cluster
> - create a 1 shard X 2 replica collection to use node1 and node2 ( [http://localhost:8983/solr/admin/collections?action=create&name=test_node_lost&numShards=1&nrtReplicas=2&autoAddReplicas=true] )
> - stop node 2 : ./bin/solr stop -p 7574
> - Solr will create a new replica on node3 after 30 seconds because of the ".auto_add_replicas" trigger
> - At this point state.json has info about replicas being on node1 and node3
> - Start node2. Bam!
> {code:java}
> java.util.concurrent.ExecutionException: org.apache.solr.common.SolrException: Unable to create core [test_node_lost_shard1_replica_n2]
> ...
> Caused by: org.apache.solr.common.SolrException: Unable to create core [test_node_lost_shard1_replica_n2]
> at org.apache.solr.core.CoreContainer.createFromDescriptor(CoreContainer.java:1053)
> ...
> Caused by: org.apache.solr.common.SolrException:
> at org.apache.solr.cloud.ZkController.preRegister(ZkController.java:1619)
> at org.apache.solr.core.CoreContainer.createFromDescriptor(CoreContainer.java:1030)
> ...
> Caused by: org.apache.solr.common.SolrException: coreNodeName core_node4 does not exist in shard shard1: DocCollection(test_node_lost//collections/test_node_lost/state.json/12)={
> ...{code}
>
> The practical effects of this is not big since the move replica has already put the replica on another JVM . But to the user it's super confusing on what's happening. He can never get rid of this error unless he manually cleans up the data directory on node2 and restart
>
> Please note: I chose autoAddReplicas=true to reproduce this. but a user could be using a node lost trigger and and run into the same issue
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org