You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Hoss Man (JIRA)" <ji...@apache.org> on 2019/07/13 14:58:00 UTC
[jira] [Updated] (SOLR-13627) newly created collection can see all replicas go into recovery immediately on first document addition

     [ https://issues.apache.org/jira/browse/SOLR-13627?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Hoss Man updated SOLR-13627:
----------------------------
    Attachment: apache_Lucene-Solr-NightlyTests-8.x_143.log.txt
        Status: Open  (was: Open)


I'm attaching the logs from {{apache_Lucene-Solr-NightlyTests-8.x_143.log.txt}} with the full test logs, but the gist of the situation is that these lines of code...

{code}
// NOTE: legacyCloud == false

CollectionAdminRequest.setClusterProperty(ZkStateReader.LEGACY_CLOUD, legacyCloud).process(cluster.getSolrClient());
final String collectionName = "deleteFromClusterState_"+legacyCloud;
CollectionAdminRequest.createCollection(collectionName, "conf", 1, 3)
    .process(cluster.getSolrClient());

cluster.waitForActiveCollection(collectionName, 1, 3);

cluster.getSolrClient().add(collectionName, new SolrInputDocument("id", "1"));
cluster.getSolrClient().add(collectionName, new SolrInputDocument("id", "2"));
cluster.getSolrClient().commit(collectionName);
{code}

Results in the following key bits of logging...

{noformat}

# set the cluster prop...

   [junit4]   2> 4266069 INFO  (qtp1645118155-93196) [n:127.0.0.1:38694_solr     ] o.a.s.h.a.CollectionsHandler Invoked Collection Action :clusterprop with params val=false&name=legacyCloud&action=CLUSTERPROP&wt=javabin&version=2 and sendToOCPQueue=true

#
# Collection creation, w/leader election...
#

   [junit4]   2> 4266070 INFO  (qtp1645118155-93196) [n:127.0.0.1:38694_solr     ] o.a.s.s.HttpSolrCall [admin] webapp=null path=/admin/collections params={val=false&name=legacyCloud&action=CLUSTERPROP&wt=javabin&version=2} status=0 QTime=1
   [junit4]   2> 4266071 INFO  (qtp1645118155-93197) [n:127.0.0.1:38694_solr     ] o.a.s.h.a.CollectionsHandler Invoked Collection Action :create with params collection.configName=conf&name=deleteFromClusterState_false&nrtReplicas=3&action=CREATE&numShards=1&wt=javabin&version=2 and sendToOCPQueue=true

...

   [junit4]   2> 4266757 INFO  (qtp1544834100-93213) [n:127.0.0.1:37510_solr    x:deleteFromClusterState_false_shard1_replica_n2 ] o.a.s.h.a.CoreAdminOperation core create command qt=/admin/cores&coreNodeName=core_node5&collection.configName=conf&newCollection=true&name=deleteFromClusterState_false_shard1_replica_n2&action=CREATE&numShards=1&collection=deleteFromClusterState_false&shard=shard1&wt=javabin&version=2&replicaType=NRT
   [junit4]   2> 4266814 INFO  (qtp1240928045-93187) [n:127.0.0.1:44654_solr    x:deleteFromClusterState_false_shard1_replica_n3 ] o.a.s.h.a.CoreAdminOperation core create command qt=/admin/cores&coreNodeName=core_node6&collection.configName=conf&newCollection=true&name=deleteFromClusterState_false_shard1_replica_n3&action=CREATE&numShards=1&collection=deleteFromClusterState_false&shard=shard1&wt=javabin&version=2&replicaType=NRT
   [junit4]   2> 4266832 INFO  (qtp1645118155-93199) [n:127.0.0.1:38694_solr    x:deleteFromClusterState_false_shard1_replica_n1 ] o.a.s.h.a.CoreAdminOperation core create command qt=/admin/cores&coreNodeName=core_node4&collection.configName=conf&newCollection=true&name=deleteFromClusterState_false_shard1_replica_n1&action=CREATE&numShards=1&collection=deleteFromClusterState_false&shard=shard1&wt=javabin&version=2&replicaType=NRT

...
   [junit4]   2> 4268946 INFO  (qtp1544834100-93213) [n:127.0.0.1:37510_solr c:deleteFromClusterState_false s:shard1 r:core_node5 x:deleteFromClusterState_false_shard1_replica_n2 ] o.a.s.c.ZkShardTerms Successful update of terms at /collections/deleteFromClusterState_false/terms/shard1 to Terms{values={core_node5=0}, version=0}
...
   [junit4]   2> 4269040 INFO  (qtp1240928045-93187) [n:127.0.0.1:44654_solr c:deleteFromClusterState_false s:shard1 r:core_node6 x:deleteFromClusterState_false_shard1_replica_n3 ] o.a.s.c.ZkShardTerms Failed to save terms, version is not a match, retrying
   [junit4]   2> 4269040 INFO  (qtp1645118155-93199) [n:127.0.0.1:38694_solr c:deleteFromClusterState_false s:shard1 r:core_node4 x:deleteFromClusterState_false_shard1_replica_n1 ] o.a.s.c.ZkShardTerms Successful update of terms at /collections/deleteFromClusterState_false/terms/shard1 to Terms{values={core_node4=0, core_node5=0}, version=1}
   [junit4]   2> 4269040 INFO  (qtp1645118155-93199) [n:127.0.0.1:38694_solr c:deleteFromClusterState_false s:shard1 r:core_node4 x:deleteFromClusterState_false_shard1_replica_n1 ] o.a.s.c.ShardLeaderElectionContextBase make sure parent is created /collections/deleteFromClusterState_false/leaders/shard1
...
   [junit4]   2> 4269106 INFO  (qtp1240928045-93187) [n:127.0.0.1:44654_solr c:deleteFromClusterState_false s:shard1 r:core_node6 x:deleteFromClusterState_false_shard1_replica_n3 ] o.a.s.c.ZkShardTerms Successful update of terms at /collections/deleteFromClusterState_false/terms/shard1 to Terms{values={core_node6=0, core_node4=0, core_node5=0}, version=2}
...

   [junit4]   2> 4269487 INFO  (qtp1544834100-93213) [n:127.0.0.1:37510_solr c:deleteFromClusterState_false s:shard1 r:core_node5 x:deleteFromClusterState_false_shard1_replica_n2 ] o.a.s.c.ShardLeaderElectionContext Enough replicas found to continue.
   [junit4]   2> 4269487 INFO  (qtp1544834100-93213) [n:127.0.0.1:37510_solr c:deleteFromClusterState_false s:shard1 r:core_node5 x:deleteFromClusterState_false_shard1_replica_n2 ] o.a.s.c.ShardLeaderElectionContext I may be the new leader - try and sync

...

   [junit4]   2> 4275795 INFO  (qtp1544834100-93213) [n:127.0.0.1:37510_solr c:deleteFromClusterState_false s:shard1 r:core_node5 x:deleteFromClusterState_false_shard1_replica_n2 ] o.a.s.c.ShardLeaderElectionContext We failed sync, but we have no versions - we can't sync in that case - we were active before, so become leader anyway

...

   [junit4]   2> 4275869 INFO  (qtp1544834100-93213) [n:127.0.0.1:37510_solr c:deleteFromClusterState_false s:shard1 r:core_node5 x:deleteFromClusterState_false_shard1_replica_n2 ] o.a.s.c.ZkController I am the leader, no recovery necessary
   [junit4]   2> 4275871 INFO  (qtp1544834100-93213) [n:127.0.0.1:37510_solr c:deleteFromClusterState_false s:shard1 r:core_node5 x:deleteFromClusterState_false_shard1_replica_n2 ] o.a.s.s.HttpSolrCall [admin] webapp=null path=/admin/cores params={qt=/admin/cores&coreNodeName=core_node5&collection.configName=conf&newCollection=true&name=deleteFromClusterState_false_shard1_replica_n2&action=CREATE&numShards=1&collection=deleteFromClusterState_false&shard=shard1&wt=javabin&version=2&replicaType=NRT} status=0 QTime=9113
...
   [junit4]   2> 4276828 INFO  (qtp1645118155-93199) [n:127.0.0.1:38694_solr c:deleteFromClusterState_false s:shard1 r:core_node4 x:deleteFromClusterState_false_shard1_replica_n1 ] o.a.s.s.HttpSolrCall [admin] webapp=null path=/admin/cores params={qt=/admin/cores&coreNodeName=core_node4&collection.configName=conf&newCollection=true&name=deleteFromClusterState_false_shard1_replica_n1&action=CREATE&numShards=1&collection=deleteFromClusterState_false&shard=shard1&wt=javabin&version=2&replicaType=NRT} status=0 QTime=9996
   [junit4]   2> 4276828 INFO  (qtp1240928045-93187) [n:127.0.0.1:44654_solr c:deleteFromClusterState_false s:shard1 r:core_node6 x:deleteFromClusterState_false_shard1_replica_n3 ] o.a.s.s.HttpSolrCall [admin] webapp=null path=/admin/cores params={qt=/admin/cores&coreNodeName=core_node6&collection.configName=conf&newCollection=true&name=deleteFromClusterState_false_shard1_replica_n3&action=CREATE&numShards=1&collection=deleteFromClusterState_false&shard=shard1&wt=javabin&version=2&replicaType=NRT} status=0 QTime=10001

...

   [junit4]   2> 4277031 INFO  (qtp1645118155-93197) [n:127.0.0.1:38694_solr     ] o.a.s.s.HttpSolrCall [admin] webapp=null path=/admin/collections params={collection.configName=conf&name=deleteFromClusterState_false&nrtReplicas=3&action=CREATE&numShards=1&wt=javabin&version=2} status=0 QTime=10959

#
# Colection creation finished, now add some docs...
#

   [junit4]   2> 4277068 INFO  (qtp1544834100-93215) [n:127.0.0.1:37510_solr c:deleteFromClusterState_false s:shard1 r:core_node5 x:deleteFromClusterState_false_shard1_replica_n2 ] o.a.s.c.ZkShardTerms Successful update of terms at /collections/deleteFromClusterState_false/terms/shard1 to Terms{values={core_node6=0, core_node4=0, core_node5=1}, version=3}
   [junit4]   2> 4277068 INFO  (qtp1544834100-93215) [n:127.0.0.1:37510_solr c:deleteFromClusterState_false s:shard1 r:core_node5 x:deleteFromClusterState_false_shard1_replica_n2 ] o.a.s.u.p.LogUpdateProcessorFactory [deleteFromClusterState_false_shard1_replica_n2]  webapp=/solr path=/update params={wt=javabin&version=2}{add=[1 (1638426490843430912)]} 0 15
   [junit4]   2> 4277190 INFO  (zkCallback-14036-thread-2) [     ] o.a.s.c.RecoveringCoreTermWatcher Start recovery on core_node4 because core's term is less than leader's term
   [junit4]   2> 4277190 INFO  (qtp1544834100-93216) [n:127.0.0.1:37510_solr c:deleteFromClusterState_false s:shard1 r:core_node5 x:deleteFromClusterState_false_shard1_replica_n2 ] o.a.s.u.p.LogUpdateProcessorFactory [deleteFromClusterState_false_shard1_replica_n2]  webapp=/solr path=/update params={wt=javabin&version=2}{add=[2 (1638426490986037248)]} 0 2
   [junit4]   2> 4277190 INFO  (zkCallback-14034-thread-2) [     ] o.a.s.c.RecoveringCoreTermWatcher Start recovery on core_node6 because core's term is less than leader's term
   [junit4]   2> 4277191 INFO  (qtp1544834100-93212) [n:127.0.0.1:37510_solr c:deleteFromClusterState_false s:shard1 r:core_node5 x:deleteFromClusterState_false_shard1_replica_n2 ] o.a.s.u.DirectUpdateHandler2 start commit{_version_=1638426490987085824,optimize=false,openSearcher=true,waitSearcher=true,expungeDeletes=false,softCommit=false,prepareCommit=false}
{noformat}

The leader (core_node5) request handling thread that processes the very first "add doc(1)" command also updates the ZkShardTerms for itself to "1", and that seems to immediately cause zkCallback threads (on the other 2 nodes?) to decide that core_node4 and core_node6 need to start recovery.

why??????

----

FWIW: this test ultimately fails because:
* the (random) replica it chooses to delete (via direct {{OverseerAction.DELETECORE}}) happens to be the leader
* this triggers the problems described in SOLR-13616: watchers set by PrepRecovery block forever because they can't find the leader SolrCore
* the test has a brittle assumption that once it confirms the core is unloaded, it should be able to {{waitFor(...)}} the size of the "watchers" list for the collection be exactly 1 less then it was at teh start of the test
** but this can never happen because of the (stalled) PrepRecovery watchers


...but these are all small things that can be fixed (or worked around) in the test ... a much more serious/sever concern i have is:

*Why do these replicas instantly go into recovery the first time a document is added to this brand new collection ????*


> newly created collection can see all replicas go into recovery immediately on first document addition
> -----------------------------------------------------------------------------------------------------
>
>                 Key: SOLR-13627
>                 URL: https://issues.apache.org/jira/browse/SOLR-13627
>             Project: Solr
>          Issue Type: Bug
>      Security Level: Public(Default Security Level. Issues are Public) 
>            Reporter: Hoss Man
>            Priority: Major
>         Attachments: apache_Lucene-Solr-NightlyTests-8.x_143.log.txt
>
>
> There's something very weird going on that popped up in a recent jenkins run of {{DeleteReplicaTest.deleteReplicaFromClusterState}}.  While the test has some issues of it's own, and ultimately failed due to a combination of SOLR-13616 + some sloppy assertions (which I will attempt address independently of this Jira) a more alarming situation is what the logs show at the _begining_ of the tests, before any problems occured.
> In a nutshell: *Just the act of creating a 1x3 collection and adding some docs to it caused both of the non-leader replicas to immediatley decide they needed to go into recovery.*
> Details to follow in comments...



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org