You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Hoss Man (JIRA)" <ji...@apache.org> on 2019/07/13 14:58:00 UTC
[jira] [Updated] (SOLR-13627) newly created collection can see all
replicas go into recovery immediately on first document addition
[ https://issues.apache.org/jira/browse/SOLR-13627?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Hoss Man updated SOLR-13627:
----------------------------
Attachment: apache_Lucene-Solr-NightlyTests-8.x_143.log.txt
Status: Open (was: Open)
I'm attaching the logs from {{apache_Lucene-Solr-NightlyTests-8.x_143.log.txt}} with the full test logs, but the gist of the situation is that these lines of code...
{code}
// NOTE: legacyCloud == false
CollectionAdminRequest.setClusterProperty(ZkStateReader.LEGACY_CLOUD, legacyCloud).process(cluster.getSolrClient());
final String collectionName = "deleteFromClusterState_"+legacyCloud;
CollectionAdminRequest.createCollection(collectionName, "conf", 1, 3)
.process(cluster.getSolrClient());
cluster.waitForActiveCollection(collectionName, 1, 3);
cluster.getSolrClient().add(collectionName, new SolrInputDocument("id", "1"));
cluster.getSolrClient().add(collectionName, new SolrInputDocument("id", "2"));
cluster.getSolrClient().commit(collectionName);
{code}
Results in the following key bits of logging...
{noformat}
# set the cluster prop...
[junit4] 2> 4266069 INFO (qtp1645118155-93196) [n:127.0.0.1:38694_solr ] o.a.s.h.a.CollectionsHandler Invoked Collection Action :clusterprop with params val=false&name=legacyCloud&action=CLUSTERPROP&wt=javabin&version=2 and sendToOCPQueue=true
#
# Collection creation, w/leader election...
#
[junit4] 2> 4266070 INFO (qtp1645118155-93196) [n:127.0.0.1:38694_solr ] o.a.s.s.HttpSolrCall [admin] webapp=null path=/admin/collections params={val=false&name=legacyCloud&action=CLUSTERPROP&wt=javabin&version=2} status=0 QTime=1
[junit4] 2> 4266071 INFO (qtp1645118155-93197) [n:127.0.0.1:38694_solr ] o.a.s.h.a.CollectionsHandler Invoked Collection Action :create with params collection.configName=conf&name=deleteFromClusterState_false&nrtReplicas=3&action=CREATE&numShards=1&wt=javabin&version=2 and sendToOCPQueue=true
...
[junit4] 2> 4266757 INFO (qtp1544834100-93213) [n:127.0.0.1:37510_solr x:deleteFromClusterState_false_shard1_replica_n2 ] o.a.s.h.a.CoreAdminOperation core create command qt=/admin/cores&coreNodeName=core_node5&collection.configName=conf&newCollection=true&name=deleteFromClusterState_false_shard1_replica_n2&action=CREATE&numShards=1&collection=deleteFromClusterState_false&shard=shard1&wt=javabin&version=2&replicaType=NRT
[junit4] 2> 4266814 INFO (qtp1240928045-93187) [n:127.0.0.1:44654_solr x:deleteFromClusterState_false_shard1_replica_n3 ] o.a.s.h.a.CoreAdminOperation core create command qt=/admin/cores&coreNodeName=core_node6&collection.configName=conf&newCollection=true&name=deleteFromClusterState_false_shard1_replica_n3&action=CREATE&numShards=1&collection=deleteFromClusterState_false&shard=shard1&wt=javabin&version=2&replicaType=NRT
[junit4] 2> 4266832 INFO (qtp1645118155-93199) [n:127.0.0.1:38694_solr x:deleteFromClusterState_false_shard1_replica_n1 ] o.a.s.h.a.CoreAdminOperation core create command qt=/admin/cores&coreNodeName=core_node4&collection.configName=conf&newCollection=true&name=deleteFromClusterState_false_shard1_replica_n1&action=CREATE&numShards=1&collection=deleteFromClusterState_false&shard=shard1&wt=javabin&version=2&replicaType=NRT
...
[junit4] 2> 4268946 INFO (qtp1544834100-93213) [n:127.0.0.1:37510_solr c:deleteFromClusterState_false s:shard1 r:core_node5 x:deleteFromClusterState_false_shard1_replica_n2 ] o.a.s.c.ZkShardTerms Successful update of terms at /collections/deleteFromClusterState_false/terms/shard1 to Terms{values={core_node5=0}, version=0}
...
[junit4] 2> 4269040 INFO (qtp1240928045-93187) [n:127.0.0.1:44654_solr c:deleteFromClusterState_false s:shard1 r:core_node6 x:deleteFromClusterState_false_shard1_replica_n3 ] o.a.s.c.ZkShardTerms Failed to save terms, version is not a match, retrying
[junit4] 2> 4269040 INFO (qtp1645118155-93199) [n:127.0.0.1:38694_solr c:deleteFromClusterState_false s:shard1 r:core_node4 x:deleteFromClusterState_false_shard1_replica_n1 ] o.a.s.c.ZkShardTerms Successful update of terms at /collections/deleteFromClusterState_false/terms/shard1 to Terms{values={core_node4=0, core_node5=0}, version=1}
[junit4] 2> 4269040 INFO (qtp1645118155-93199) [n:127.0.0.1:38694_solr c:deleteFromClusterState_false s:shard1 r:core_node4 x:deleteFromClusterState_false_shard1_replica_n1 ] o.a.s.c.ShardLeaderElectionContextBase make sure parent is created /collections/deleteFromClusterState_false/leaders/shard1
...
[junit4] 2> 4269106 INFO (qtp1240928045-93187) [n:127.0.0.1:44654_solr c:deleteFromClusterState_false s:shard1 r:core_node6 x:deleteFromClusterState_false_shard1_replica_n3 ] o.a.s.c.ZkShardTerms Successful update of terms at /collections/deleteFromClusterState_false/terms/shard1 to Terms{values={core_node6=0, core_node4=0, core_node5=0}, version=2}
...
[junit4] 2> 4269487 INFO (qtp1544834100-93213) [n:127.0.0.1:37510_solr c:deleteFromClusterState_false s:shard1 r:core_node5 x:deleteFromClusterState_false_shard1_replica_n2 ] o.a.s.c.ShardLeaderElectionContext Enough replicas found to continue.
[junit4] 2> 4269487 INFO (qtp1544834100-93213) [n:127.0.0.1:37510_solr c:deleteFromClusterState_false s:shard1 r:core_node5 x:deleteFromClusterState_false_shard1_replica_n2 ] o.a.s.c.ShardLeaderElectionContext I may be the new leader - try and sync
...
[junit4] 2> 4275795 INFO (qtp1544834100-93213) [n:127.0.0.1:37510_solr c:deleteFromClusterState_false s:shard1 r:core_node5 x:deleteFromClusterState_false_shard1_replica_n2 ] o.a.s.c.ShardLeaderElectionContext We failed sync, but we have no versions - we can't sync in that case - we were active before, so become leader anyway
...
[junit4] 2> 4275869 INFO (qtp1544834100-93213) [n:127.0.0.1:37510_solr c:deleteFromClusterState_false s:shard1 r:core_node5 x:deleteFromClusterState_false_shard1_replica_n2 ] o.a.s.c.ZkController I am the leader, no recovery necessary
[junit4] 2> 4275871 INFO (qtp1544834100-93213) [n:127.0.0.1:37510_solr c:deleteFromClusterState_false s:shard1 r:core_node5 x:deleteFromClusterState_false_shard1_replica_n2 ] o.a.s.s.HttpSolrCall [admin] webapp=null path=/admin/cores params={qt=/admin/cores&coreNodeName=core_node5&collection.configName=conf&newCollection=true&name=deleteFromClusterState_false_shard1_replica_n2&action=CREATE&numShards=1&collection=deleteFromClusterState_false&shard=shard1&wt=javabin&version=2&replicaType=NRT} status=0 QTime=9113
...
[junit4] 2> 4276828 INFO (qtp1645118155-93199) [n:127.0.0.1:38694_solr c:deleteFromClusterState_false s:shard1 r:core_node4 x:deleteFromClusterState_false_shard1_replica_n1 ] o.a.s.s.HttpSolrCall [admin] webapp=null path=/admin/cores params={qt=/admin/cores&coreNodeName=core_node4&collection.configName=conf&newCollection=true&name=deleteFromClusterState_false_shard1_replica_n1&action=CREATE&numShards=1&collection=deleteFromClusterState_false&shard=shard1&wt=javabin&version=2&replicaType=NRT} status=0 QTime=9996
[junit4] 2> 4276828 INFO (qtp1240928045-93187) [n:127.0.0.1:44654_solr c:deleteFromClusterState_false s:shard1 r:core_node6 x:deleteFromClusterState_false_shard1_replica_n3 ] o.a.s.s.HttpSolrCall [admin] webapp=null path=/admin/cores params={qt=/admin/cores&coreNodeName=core_node6&collection.configName=conf&newCollection=true&name=deleteFromClusterState_false_shard1_replica_n3&action=CREATE&numShards=1&collection=deleteFromClusterState_false&shard=shard1&wt=javabin&version=2&replicaType=NRT} status=0 QTime=10001
...
[junit4] 2> 4277031 INFO (qtp1645118155-93197) [n:127.0.0.1:38694_solr ] o.a.s.s.HttpSolrCall [admin] webapp=null path=/admin/collections params={collection.configName=conf&name=deleteFromClusterState_false&nrtReplicas=3&action=CREATE&numShards=1&wt=javabin&version=2} status=0 QTime=10959
#
# Colection creation finished, now add some docs...
#
[junit4] 2> 4277068 INFO (qtp1544834100-93215) [n:127.0.0.1:37510_solr c:deleteFromClusterState_false s:shard1 r:core_node5 x:deleteFromClusterState_false_shard1_replica_n2 ] o.a.s.c.ZkShardTerms Successful update of terms at /collections/deleteFromClusterState_false/terms/shard1 to Terms{values={core_node6=0, core_node4=0, core_node5=1}, version=3}
[junit4] 2> 4277068 INFO (qtp1544834100-93215) [n:127.0.0.1:37510_solr c:deleteFromClusterState_false s:shard1 r:core_node5 x:deleteFromClusterState_false_shard1_replica_n2 ] o.a.s.u.p.LogUpdateProcessorFactory [deleteFromClusterState_false_shard1_replica_n2] webapp=/solr path=/update params={wt=javabin&version=2}{add=[1 (1638426490843430912)]} 0 15
[junit4] 2> 4277190 INFO (zkCallback-14036-thread-2) [ ] o.a.s.c.RecoveringCoreTermWatcher Start recovery on core_node4 because core's term is less than leader's term
[junit4] 2> 4277190 INFO (qtp1544834100-93216) [n:127.0.0.1:37510_solr c:deleteFromClusterState_false s:shard1 r:core_node5 x:deleteFromClusterState_false_shard1_replica_n2 ] o.a.s.u.p.LogUpdateProcessorFactory [deleteFromClusterState_false_shard1_replica_n2] webapp=/solr path=/update params={wt=javabin&version=2}{add=[2 (1638426490986037248)]} 0 2
[junit4] 2> 4277190 INFO (zkCallback-14034-thread-2) [ ] o.a.s.c.RecoveringCoreTermWatcher Start recovery on core_node6 because core's term is less than leader's term
[junit4] 2> 4277191 INFO (qtp1544834100-93212) [n:127.0.0.1:37510_solr c:deleteFromClusterState_false s:shard1 r:core_node5 x:deleteFromClusterState_false_shard1_replica_n2 ] o.a.s.u.DirectUpdateHandler2 start commit{_version_=1638426490987085824,optimize=false,openSearcher=true,waitSearcher=true,expungeDeletes=false,softCommit=false,prepareCommit=false}
{noformat}
The leader (core_node5) request handling thread that processes the very first "add doc(1)" command also updates the ZkShardTerms for itself to "1", and that seems to immediately cause zkCallback threads (on the other 2 nodes?) to decide that core_node4 and core_node6 need to start recovery.
why??????
----
FWIW: this test ultimately fails because:
* the (random) replica it chooses to delete (via direct {{OverseerAction.DELETECORE}}) happens to be the leader
* this triggers the problems described in SOLR-13616: watchers set by PrepRecovery block forever because they can't find the leader SolrCore
* the test has a brittle assumption that once it confirms the core is unloaded, it should be able to {{waitFor(...)}} the size of the "watchers" list for the collection be exactly 1 less then it was at teh start of the test
** but this can never happen because of the (stalled) PrepRecovery watchers
...but these are all small things that can be fixed (or worked around) in the test ... a much more serious/sever concern i have is:
*Why do these replicas instantly go into recovery the first time a document is added to this brand new collection ????*
> newly created collection can see all replicas go into recovery immediately on first document addition
> -----------------------------------------------------------------------------------------------------
>
> Key: SOLR-13627
> URL: https://issues.apache.org/jira/browse/SOLR-13627
> Project: Solr
> Issue Type: Bug
> Security Level: Public(Default Security Level. Issues are Public)
> Reporter: Hoss Man
> Priority: Major
> Attachments: apache_Lucene-Solr-NightlyTests-8.x_143.log.txt
>
>
> There's something very weird going on that popped up in a recent jenkins run of {{DeleteReplicaTest.deleteReplicaFromClusterState}}. While the test has some issues of it's own, and ultimately failed due to a combination of SOLR-13616 + some sloppy assertions (which I will attempt address independently of this Jira) a more alarming situation is what the logs show at the _begining_ of the tests, before any problems occured.
> In a nutshell: *Just the act of creating a 1x3 collection and adding some docs to it caused both of the non-leader replicas to immediatley decide they needed to go into recovery.*
> Details to follow in comments...
--
This message was sent by Atlassian JIRA
(v7.6.14#76016)
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org