You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by "Munendra S N (Jira)" <ji...@apache.org> on 2020/09/28 13:49:00 UTC

[jira] [Updated] (SOLR-14897) HttpSolrCall will forward a virtually unlimited number of times until ClusterState ZkWatcher is updated after collection delete

     [ https://issues.apache.org/jira/browse/SOLR-14897?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Munendra S N updated SOLR-14897:
--------------------------------
    Attachment: SOLR-14897.patch

> HttpSolrCall will forward a virtually unlimited number of times until ClusterState ZkWatcher is updated after collection delete
> -------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: SOLR-14897
>                 URL: https://issues.apache.org/jira/browse/SOLR-14897
>             Project: Solr
>          Issue Type: Bug
>      Security Level: Public(Default Security Level. Issues are Public) 
>            Reporter: Chris M. Hostetter
>            Priority: Major
>         Attachments: SOLR-14897.patch
>
>
> While investigating the root cause of some SOLR-14896 related failures, I have seen evidence that if a collection is deleted, but a client makes a subequent request for that collection _before_ the local ClusterState has been updated to remove that DocCollection, HttpSolrCall will forward/proxy that request a (virtually) unbounded number of times in a very short time period - stopping only once the the "cached" local DocCollection is updated to indicate there are no active replicas.**
> While HttpSolrCall does track & increment a {{_forwardedCount}} param on every request it forwards, it doesn't consult that request unless/until it finds a situation where the (local) DocCollection says there are no active replicas.
> So if you have a collection XX with 4 total replicas on 4 diff nodes (A,B,C,D), and and you delete XX (triggering sequential core deletions on A,B,C,D that fire successive ZkWatchers on various nodes to update the collection state) a request for XX can bounce back and forth between nodes C & D 20+ times until the ClusterState watcher fires on both of those nodes so they finally realize that the {{_forwardedCount=20}} is more the the 0 active replicas...
> In the below code snippet from HttpSolrCall, the first call to {{getCoreUrl(...)}} is expected to return null if there are no active replicas - but it uses the local cached DocCollection, which may _think_ there is an active replica on another node, so it forwards the request to that node - where the replica may have been deleted, so that node runs hte same code and may forward the request right back to the original node....
> {code:java}
>     String coreUrl = getCoreUrl(collectionName, origCorename, clusterState,
>         activeSlices, byCoreName, true);
>     // Avoid getting into a recursive loop of requests being forwarded by
>     // stopping forwarding and erroring out after (totalReplicas) forwards
>     if (coreUrl == null) {
>       if (queryParams.getInt(INTERNAL_REQUEST_COUNT, 0) > totalReplicas){
>         throw new SolrException(SolrException.ErrorCode.INVALID_STATE,
>             "No active replicas found for collection: " + collectionName);
>       }
>       coreUrl = getCoreUrl(collectionName, origCorename, clusterState,
>           activeSlices, byCoreName, false);
>     }
> {code}
> ..the check that is suppose to prevent a "recursive loop" is only consulted once a situation arises where local ClusterState indicates there are no active replicas - which seems to defeat the point of the forward check?  (at which point if the total number of replicas hasn't been exceeded, the code is happy to forward the request to a coreUrl which the local ClusterState indicates is _not_ active (which also sems to defeat the point?)
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org