You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by "Chris M. Hostetter (Jira)" <ji...@apache.org> on 2020/09/26 00:34:00 UTC

[jira] [Commented] (SOLR-14897) HttpSolrCall will forward a virtually unlimited number of times until ClusterState ZkWatcher is updated after collection delete

    [ https://issues.apache.org/jira/browse/SOLR-14897?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17202471#comment-17202471 ] 

Chris M. Hostetter commented on SOLR-14897:
-------------------------------------------

As an example, here is the output of running {{grep sigtest-c7cfa75ce_recs_aggr | grep -e INFO -e 'HttpChannel REQUEST' -e 'HttpChannel action COMPLETE'}} over a solr log where jetty level DEBUGing was turned on to help diagnose SOLR-14896 (with my commentary interspersed, but no lines removed)

It's useful to note while lookng at this log, that the 'sigtest-c7cfa75ce_recs_aggr' only ever had 2 shards, with 2 replicas each...
{noformat}
2020-09-25 06:02:59.175 DEBUG (qtp1800649922-102) [   ] o.e.j.s.HttpChannel REQUEST for //dzmitry-solr-1.dzmitry-solr-headless:8983/solr/admin/collections?&json.nl=map&indent=true&name=sigtest-c7cfa75ce_recs_aggr&action=DELETE&wt=json on HttpChannelOverHttp@dd53d71{s=HttpChannelState@56b5eeb8{s=IDLE rs=BLOCKING os=OPEN is=IDLE awp=false se=false i=true al=0},r=3,c=false/false,a=IDLE,uri=//dzmitry-solr-1.dzmitry-solr-headless:8983/solr/admin/collections?&json.nl=map&indent=true&name=sigtest-c7cfa75ce_recs_aggr&action=DELETE&wt=json,age=0}
2020-09-25 06:02:59.176 INFO  (qtp1800649922-102) [   ] o.a.s.h.a.CollectionsHandler Invoked Collection Action :delete with params json.nl=map&indent=true&name=sigtest-c7cfa75ce_recs_aggr&action=DELETE&wt=json and sendToOCPQueue=true
{noformat}
This node happens to be the one that receives the DELETE COLLECTION request for {{sigtest-c7cfa75ce_recs_aggr}}
{noformat}
2020-09-25 06:02:59.191 INFO  (qtp1800649922-15) [   x:sigtest-c7cfa75ce_recs_aggr_shard1_replica_n2] o.a.s.m.SolrMetricManager Closing metric reporters for registry=solr.core.sigtest-c7cfa75ce_recs_aggr.shard1.replica_n2 tag=null
2020-09-25 06:02:59.191 INFO  (qtp1800649922-15) [   x:sigtest-c7cfa75ce_recs_aggr_shard1_replica_n2] o.a.s.m.r.SolrJmxReporter Closing reporter [org.apache.solr.metrics.reporters.SolrJmxReporter@17e0cfa2: rootName = null, domain = solr.core.sigtest-c7cfa75ce_recs_aggr.shard1.replica_n2, service url = null, agent id = null] for registry solr.core.sigtest-c7cfa75ce_recs_aggr.shard1.replica_n2/com.codahale.metrics.MetricRegistry@553013e1
2020-09-25 06:02:59.194 INFO  (qtp1800649922-15) [   ] o.a.s.c.SolrCore [sigtest-c7cfa75ce_recs_aggr_shard1_replica_n2]  CLOSING SolrCore org.apache.solr.core.SolrCore@3f6d08b8
2020-09-25 06:02:59.194 INFO  (qtp1800649922-15) [   ] o.a.s.m.SolrMetricManager Closing metric reporters for registry=solr.core.sigtest-c7cfa75ce_recs_aggr.shard1.replica_n2 tag=SolrCore@3f6d08b8
2020-09-25 06:02:59.194 INFO  (qtp1800649922-15) [   ] o.a.s.m.SolrMetricManager Closing metric reporters for registry=solr.collection.sigtest-c7cfa75ce_recs_aggr.shard1.leader tag=SolrCore@3f6d08b8
2020-09-25 06:02:59.199 INFO  (qtp1800649922-15) [   ] o.a.s.c.ZkShardTerms Successful update of terms at /collections/sigtest-c7cfa75ce_recs_aggr/terms/shard1 to Terms{values={core_node3=0}, version=2}
2020-09-25 06:02:59.204 INFO  (qtp1800649922-15) [   ] o.a.s.s.HttpSolrCall [admin] webapp=null path=/admin/cores params={deleteInstanceDir=true&deleteMetricsHistory=true&core=sigtest-c7cfa75ce_recs_aggr_shard1_replica_n2&qt=/admin/cores&deleteDataDir=true&action=UNLOAD&wt=javabin&version=2} status=0 QTime=14
{noformat}
...it also hosts a replica of the collection, and within 30ms of the DELETE COLLECTION request has fully unloaded it's local core.

...BUT...

...we still haven't seen any ZkWatchers (locally) fire for this collection yet, and we're about to be forwarded a /select request by some other node (that seem to think our replica is alive) ...
{noformat}
2020-09-25 06:02:59.243 DEBUG (qtp1800649922-15) [   ] o.e.j.s.HttpChannel REQUEST for //dzmitry-solr-1.dzmitry-solr-headless:8983/solr/sigtest-c7cfa75ce_recs_aggr/select?_forwardedCount=1 on HttpChannelOverHttp@349d7280{s=HttpChannelState@48bf7d7e{s=IDLE rs=BLOCKING os=OPEN is=IDLE awp=false se=false i=true al=0},r=155,c=false/false,a=IDLE,uri=//dzmitry-solr-1.dzmitry-solr-headless:8983/solr/sigtest-c7cfa75ce_recs_aggr/select?_forwardedCount=1,age=0}
{noformat}
Since we don't have a core for that collection, we (evidently) forward it to someone else using {{_forwardCount=2}} – who forwards it right back to us.

this repeats for a while, involving at least 2 other nodes (because we don't get "every other" numbered forward) ...
{noformat}
2020-09-25 06:02:59.250 DEBUG (qtp1800649922-19) [   ] o.e.j.s.HttpChannel REQUEST for //dzmitry-solr-1.dzmitry-solr-headless:8983/solr/sigtest-c7cfa75ce_recs_aggr/select?_forwardedCount=3 on HttpChannelOverHttp@1e1c0a8{s=HttpChannelState@5050c26b{s=IDLE rs=BLOCKING os=OPEN is=IDLE awp=false se=false i=true al=0},r=95,c=false/false,a=IDLE,uri=//dzmitry-solr-1.dzmitry-solr-headless:8983/solr/sigtest-c7cfa75ce_recs_aggr/select?_forwardedCount=3,age=0}
2020-09-25 06:02:59.274 DEBUG (qtp1800649922-18) [   ] o.e.j.s.HttpChannel REQUEST for //dzmitry-solr-1.dzmitry-solr-headless:8983/solr/sigtest-c7cfa75ce_recs_aggr/select?_forwardedCount=10 on HttpChannelOverHttp@73304bea{s=HttpChannelState@19accdca{s=IDLE rs=BLOCKING os=OPEN is=IDLE awp=false se=false i=true al=0},r=105,c=false/false,a=IDLE,uri=//dzmitry-solr-1.dzmitry-solr-headless:8983/solr/sigtest-c7cfa75ce_recs_aggr/select?_forwardedCount=10,age=0}
2020-09-25 06:02:59.280 DEBUG (qtp1800649922-205) [   ] o.e.j.s.HttpChannel REQUEST for //dzmitry-solr-1.dzmitry-solr-headless:8983/solr/sigtest-c7cfa75ce_recs_aggr/select?_forwardedCount=12 on HttpChannelOverHttp@60d2652d{s=HttpChannelState@6702b72b{s=IDLE rs=BLOCKING os=OPEN is=IDLE awp=false se=false i=true al=0},r=4,c=false/false,a=IDLE,uri=//dzmitry-solr-1.dzmitry-solr-headless:8983/solr/sigtest-c7cfa75ce_recs_aggr/select?_forwardedCount=12,age=0}
2020-09-25 06:02:59.296 DEBUG (qtp1800649922-100) [   ] o.e.j.s.HttpChannel REQUEST for //dzmitry-solr-1.dzmitry-solr-headless:8983/solr/sigtest-c7cfa75ce_recs_aggr/select?_forwardedCount=17 on HttpChannelOverHttp@3cf825c0{s=HttpChannelState@148e5cb0{s=IDLE rs=BLOCKING os=OPEN is=IDLE awp=false se=false i=true al=0},r=22,c=false/false,a=IDLE,uri=//dzmitry-solr-1.dzmitry-solr-headless:8983/solr/sigtest-c7cfa75ce_recs_aggr/select?_forwardedCount=17,age=0}
2020-09-25 06:02:59.304 DEBUG (qtp1800649922-24) [   ] o.e.j.s.HttpChannel REQUEST for //dzmitry-solr-1.dzmitry-solr-headless:8983/solr/sigtest-c7cfa75ce_recs_aggr/select?_forwardedCount=19 on HttpChannelOverHttp@117eb506{s=HttpChannelState@4b93cdac{s=IDLE rs=BLOCKING os=OPEN is=IDLE awp=false se=false i=true al=0},r=1,c=false/false,a=IDLE,uri=//dzmitry-solr-1.dzmitry-solr-headless:8983/solr/sigtest-c7cfa75ce_recs_aggr/select?_forwardedCount=19,age=0}
2020-09-25 06:02:59.317 DEBUG (qtp1800649922-16) [   ] o.e.j.s.HttpChannel REQUEST for //dzmitry-solr-1.dzmitry-solr-headless:8983/solr/sigtest-c7cfa75ce_recs_aggr/select?_forwardedCount=23 on HttpChannelOverHttp@2f4e148{s=HttpChannelState@5f1bd187{s=IDLE rs=BLOCKING os=OPEN is=IDLE awp=false se=false i=true al=0},r=1,c=false/false,a=IDLE,uri=//dzmitry-solr-1.dzmitry-solr-headless:8983/solr/sigtest-c7cfa75ce_recs_aggr/select?_forwardedCount=23,age=0}
{noformat}
We're up to 23 forwards in the space of on 74ms before the local ZkWatcher for the cluster state fires, and we finally know that the collection has been deleted....
{noformat}
2020-09-25 06:02:59.325 INFO  (zkCallback-12-thread-15) [   ] o.a.s.c.c.ZkStateReader A cluster state change: [WatchedEvent state:SyncConnected type:NodeDeleted path:/collections/sigtest-c7cfa75ce_recs_aggr/state.json] for collection [sigtest-c7cfa75ce_recs_aggr] has occurred - updating... (live nodes size: [3])
2020-09-25 06:02:59.325 INFO  (zkCallback-12-thread-13) [   ] o.a.s.c.c.ZkStateReader A cluster state change: [WatchedEvent state:SyncConnected type:NodeDeleted path:/collections/sigtest-c7cfa75ce_recs_aggr/state.json] for collection [sigtest-c7cfa75ce_recs_aggr] has occurred - updating... (live nodes size: [3])
{noformat}
One "final" forward comes to us (#25 for those keeping count at home), and we now know there is no point in forwarding it to anyone else, so we start responding back with 404 responses, and the request stack unwinds...
{noformat}
2020-09-25 06:02:59.325 DEBUG (qtp1800649922-99) [   ] o.e.j.s.HttpChannel REQUEST for //dzmitry-solr-1.dzmitry-solr-headless:8983/solr/sigtest-c7cfa75ce_recs_aggr/select?_forwardedCount=25 on HttpChannelOverHttp@258bd7b7{s=HttpChannelState@dd4b25c{s=IDLE rs=BLOCKING os=OPEN is=IDLE awp=false se=false i=true al=0},r=1,c=false/false,a=IDLE,uri=//dzmitry-solr-1.dzmitry-solr-headless:8983/solr/sigtest-c7cfa75ce_recs_aggr/select?_forwardedCount=25,age=0}
2020-09-25 06:02:59.377 DEBUG (qtp1800649922-99) [   ] o.e.j.s.HttpChannel action COMPLETE HttpChannelOverHttp@258bd7b7{s=HttpChannelState@dd4b25c{s=HANDLING rs=COMPLETING os=COMPLETED is=READY awp=false se=false i=false al=1},r=1,c=false/true,a=HANDLING,uri=//dzmitry-solr-1.dzmitry-solr-headless:8983/solr/sigtest-c7cfa75ce_recs_aggr/select?_forwardedCount=25,age=52}
2020-09-25 06:02:59.378 DEBUG (qtp1800649922-16) [   ] o.e.j.s.HttpChannel action COMPLETE HttpChannelOverHttp@2f4e148{s=HttpChannelState@5f1bd187{s=HANDLING rs=COMPLETING os=COMPLETED is=READY awp=false se=false i=false al=1},r=1,c=false/true,a=HANDLING,uri=//dzmitry-solr-1.dzmitry-solr-headless:8983/solr/sigtest-c7cfa75ce_recs_aggr/select?_forwardedCount=23,age=61}
2020-09-25 06:02:59.381 DEBUG (qtp1800649922-24) [   ] o.e.j.s.HttpChannel action COMPLETE HttpChannelOverHttp@117eb506{s=HttpChannelState@4b93cdac{s=HANDLING rs=COMPLETING os=COMPLETED is=READY awp=false se=false i=false al=1},r=1,c=false/true,a=HANDLING,uri=//dzmitry-solr-1.dzmitry-solr-headless:8983/solr/sigtest-c7cfa75ce_recs_aggr/select?_forwardedCount=19,age=77}
2020-09-25 06:02:59.381 INFO  (qtp1800649922-102) [   ] o.a.s.s.HttpSolrCall [admin] webapp=null path=/admin/collections params={json.nl=map&indent=true&name=sigtest-c7cfa75ce_recs_aggr&action=DELETE&wt=json} status=0 QTime=205
2020-09-25 06:02:59.381 DEBUG (qtp1800649922-102) [   ] o.e.j.s.HttpChannel action COMPLETE HttpChannelOverHttp@dd53d71{s=HttpChannelState@56b5eeb8{s=HANDLING rs=COMPLETING os=OPEN is=READY awp=false se=false i=false al=1},r=3,c=false/false,a=HANDLING,uri=//dzmitry-solr-1.dzmitry-solr-headless:8983/solr/admin/collections?&json.nl=map&indent=true&name=sigtest-c7cfa75ce_recs_aggr&action=DELETE&wt=json,age=206}
2020-09-25 06:02:59.382 DEBUG (qtp1800649922-100) [   ] o.e.j.s.HttpChannel action COMPLETE HttpChannelOverHttp@3cf825c0{s=HttpChannelState@148e5cb0{s=HANDLING rs=COMPLETING os=COMPLETED is=READY awp=false se=false i=false al=1},r=22,c=false/true,a=HANDLING,uri=//dzmitry-solr-1.dzmitry-solr-headless:8983/solr/sigtest-c7cfa75ce_recs_aggr/select?_forwardedCount=17,age=86}
2020-09-25 06:02:59.385 DEBUG (qtp1800649922-205) [   ] o.e.j.s.HttpChannel action COMPLETE HttpChannelOverHttp@60d2652d{s=HttpChannelState@6702b72b{s=HANDLING rs=COMPLETING os=COMPLETED is=READY awp=false se=false i=false al=1},r=4,c=false/true,a=HANDLING,uri=//dzmitry-solr-1.dzmitry-solr-headless:8983/solr/sigtest-c7cfa75ce_recs_aggr/select?_forwardedCount=12,age=105}
2020-09-25 06:02:59.386 DEBUG (qtp1800649922-18) [   ] o.e.j.s.HttpChannel action COMPLETE HttpChannelOverHttp@73304bea{s=HttpChannelState@19accdca{s=HANDLING rs=COMPLETING os=COMPLETED is=READY awp=false se=false i=false al=1},r=105,c=false/true,a=HANDLING,uri=//dzmitry-solr-1.dzmitry-solr-headless:8983/solr/sigtest-c7cfa75ce_recs_aggr/select?_forwardedCount=10,age=112}
2020-09-25 06:02:59.397 DEBUG (qtp1800649922-15) [   ] o.e.j.s.HttpChannel action COMPLETE HttpChannelOverHttp@349d7280{s=HttpChannelState@48bf7d7e{s=HANDLING rs=COMPLETING os=OPEN is=READY awp=false se=false i=false al=1},r=155,c=false/false,a=HANDLING,uri=//dzmitry-solr-1.dzmitry-solr-headless:8983/solr/sigtest-c7cfa75ce_recs_aggr/select?_forwardedCount=1,age=154}
2020-09-25 06:02:59.398 DEBUG (qtp1800649922-19) [   ] o.e.j.s.HttpChannel action COMPLETE HttpChannelOverHttp@1e1c0a8{s=HttpChannelState@5050c26b{s=HANDLING rs=COMPLETING os=ABORTED is=READY awp=false se=false i=false al=1},r=95,c=false/false,a=HANDLING,uri=//dzmitry-solr-1.dzmitry-solr-headless:8983/solr/sigtest-c7cfa75ce_recs_aggr/select?_forwardedCount=3,age=148}
{noformat}
The kicker of all of this: prior to being deleted, the 'sigtest-c7cfa75ce_recs_aggr' only ever had 2 shards, with 2 replicas each – so even if it had taken 30minutes for the ZkWatcher to fire on _ANY_ node, as soon as a node saw {{_forwardedCount=5}} it should have known something was wrong and stop forwarding.

> HttpSolrCall will forward a virtually unlimited number of times until ClusterState ZkWatcher is updated after collection delete
> -------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: SOLR-14897
>                 URL: https://issues.apache.org/jira/browse/SOLR-14897
>             Project: Solr
>          Issue Type: Bug
>      Security Level: Public(Default Security Level. Issues are Public) 
>            Reporter: Chris M. Hostetter
>            Priority: Major
>
> While investigating the root cause of some SOLR-14896 related failures, I have seen evidence that if a collection is deleted, but a client makes a subequent request for that collection _before_ the local ClusterState has been updated to remove that DocCollection, HttpSolrCall will forward/proxy that request a (virtually) unbounded number of times in a very short time period - stopping only once the the "cached" local DocCollection is updated to indicate there are no active replicas.**
> While HttpSolrCall does track & increment a {{_forwardedCount}} param on every request it forwards, it doesn't consult that request unless/until it finds a situation where the (local) DocCollection says there are no active replicas.
> So if you have a collection XX with 4 total replicas on 4 diff nodes (A,B,C,D), and and you delete XX (triggering sequential core deletions on A,B,C,D that fire successive ZkWatchers on various nodes to update the collection state) a request for XX can bounce back and forth between nodes C & D 20+ times until the ClusterState watcher fires on both of those nodes so they finally realize that the {{_forwardedCount=20}} is more the the 0 active replicas...
> In the below code snippet from HttpSolrCall, the first call to {{getCoreUrl(...)}} is expected to return null if there are no active replicas - but it uses the local cached DocCollection, which may _think_ there is an active replica on another node, so it forwards the request to that node - where the replica may have been deleted, so that node runs hte same code and may forward the request right back to the original node....
> {code:java}
>     String coreUrl = getCoreUrl(collectionName, origCorename, clusterState,
>         activeSlices, byCoreName, true);
>     // Avoid getting into a recursive loop of requests being forwarded by
>     // stopping forwarding and erroring out after (totalReplicas) forwards
>     if (coreUrl == null) {
>       if (queryParams.getInt(INTERNAL_REQUEST_COUNT, 0) > totalReplicas){
>         throw new SolrException(SolrException.ErrorCode.INVALID_STATE,
>             "No active replicas found for collection: " + collectionName);
>       }
>       coreUrl = getCoreUrl(collectionName, origCorename, clusterState,
>           activeSlices, byCoreName, false);
>     }
> {code}
> ..the check that is suppose to prevent a "recursive loop" is only consulted once a situation arises where local ClusterState indicates there are no active replicas - which seems to defeat the point of the forward check?  (at which point if the total number of replicas hasn't been exceeded, the code is happy to forward the request to a coreUrl which the local ClusterState indicates is _not_ active (which also sems to defeat the point?)
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org