You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Shalin Shekhar Mangar (JIRA)" <ji...@apache.org> on 2016/10/14 19:03:20 UTC
[jira] [Commented] (SOLR-9512) CloudSolrClient's cluster state cache can break direct updates to leaders

    [ https://issues.apache.org/jira/browse/SOLR-9512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15576189#comment-15576189 ] 

Shalin Shekhar Mangar commented on SOLR-9512:
---------------------------------------------

Noble and I discussed this offline. Here is a summary of the problem and the solution:

There are five cases that we need to tackle. Assuming replica x is leader:
# Case 1: x is disconnected from zk, y becomes leader
** currently — x throws error on indexing, client fails and keeps trying to send requests to x and fail. This will continue until X re-connects and the client gets a stale state flag in the response.
# Case 2: x is dead, y becomes leader
** currently - client gets connect exception or NoResponseException (for in-flight requests) and client keeps retrying request to x. This will continue until x comes back online.
# Case 3: x is disconnected from zk, no one is leader
** currently -- client keeps sending requests to x which fail because x is disconnected from leader. This will continue until X re-connects and the client gets a stale state flag in the response.
# Case 4: x is dead, no one is leader yet
** currently - client gets connect exception or NoResponseException (for in-flight requests) and client keeps retrying request to x. This will continue until x comes back online.
# Case 5: x is alive but now y is leader
** currently -- client gets a stale state flag from x and refreshes cluster state to see y as the new leader. All further indexing requests are sent to y.
# Case 6: client is disconnected from zk
** currently -- client keeps indexing to x. If it receives a stale state error, it will try to refresh cluster state, fail and continue to send further requests to x, keep failing and keep trying to read from zk and be stuck in a cycle.

Cases 1-5 are solved by a single solution -- On ConnectException, NoHttpResponseException, Leader disconnected from zk error, client should fetch state from zk again. If client fetches from zk and does not get a new version then this should be marked in a flag and subsequent retries should only happen after N seconds are elapsed or if we know for a fact that version has changed since the last zk fetch was made. N could be something small as 2 seconds or so.

Case 6 is more difficult. Either we can keep failing the indexing requests or we can ask a random Solr instance to return the latest cluster state. This is kinda dangerous because it can open us up to very difficult to debug bugs so I am inclined to punt on this for now.

> CloudSolrClient's cluster state cache can break direct updates to leaders
> -------------------------------------------------------------------------
>
>                 Key: SOLR-9512
>                 URL: https://issues.apache.org/jira/browse/SOLR-9512
>             Project: Solr
>          Issue Type: Bug
>      Security Level: Public(Default Security Level. Issues are Public) 
>            Reporter: Alan Woodward
>         Attachments: SOLR-9512.patch
>
>
> This is the root cause of SOLR-9305 and (at least some of) SOLR-9390.  The process goes something like this:
> Documents are added to the cluster via a CloudSolrClient, with directUpdatesToLeadersOnly set to true.  CSC caches its view of the DocCollection.  The leader then goes down, and is reassigned.  Next time documents are added, CSC checks its cache again, and gets the old view of the DocCollection.  It then tries to send the update directly to the old, now down, leader, and we get ConnectionRefused.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org