You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@kudu.apache.org by "Alexey Serbin (JIRA)" <ji...@apache.org> on 2017/07/07 17:58:00 UTC
[jira] [Commented] (KUDU-694) Re-visit C++ client scan retry logic

    [ https://issues.apache.org/jira/browse/KUDU-694?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16078446#comment-16078446 ] 

Alexey Serbin commented on KUDU-694:
------------------------------------

An update to summarize current state of affairs (as far as I could see):
* The first item still holds Marking the server failed is specific for the tablet, so if querying some other tablet on the same server will not be affected by the mark done for prior one.  But it still affects the scans with the LEADER_ONLY selector.
* Not failing-over to another leader during the call is addressed: if there was an error from the server hosting the leader tablet (or any other tablet), the {{LookupRpc::SendRpc()}} will not use the 'fast path' and do server resolution again calling {{MasterServerProxy::GetTableLocationsAsync()}}
* The non-retried {{GetTabletServer()}} is retried from the upper level (i.e. in KuduScanner::Data::OpenTablet()), but a failure of DNS resolution in the path of {{KuduClient::Data::GetTabletServer()}} will result in a non-retriable error returned to the top-level from {{KuduScanner::Data::OpenTablet()}}.  Also, I suspect there other places like that -- an additional revision is needed.  Besides, we need to understand whether it makes sense to retry in such cases.

> Re-visit C++ client scan retry logic
> ------------------------------------
>
>                 Key: KUDU-694
>                 URL: https://issues.apache.org/jira/browse/KUDU-694
>             Project: Kudu
>          Issue Type: Bug
>          Components: client
>    Affects Versions: Private Beta
>            Reporter: Andrew Wang
>
> There are a number of remaining issues with scanner robustness, even after KUDU-597:
> * Once a node is marked as failed, it will not be used again in the call. This is more of an issue with longer timeouts (since the node is more likely to come back), or if the scan is LEADER_ONLY (since only one node being down leads to unavailability).
> * In the LEADER_ONLY case, since we don't refresh quorum information within the call, we won't recover when a failover happens.
> * The scanner code calls a number of other RPCs that are not retried on error, i.e. LookupTabletByKey or RefreshProxy's DNS resolution in GetTabletServer.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)