You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@kudu.apache.org by "Todd Lipcon (JIRA)" <ji...@apache.org> on 2016/05/24 06:14:12 UTC
[jira] [Commented] (KUDU-1466) C++ client errors misreported as
GetTableLocations timeouts
[ https://issues.apache.org/jira/browse/KUDU-1466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15297746#comment-15297746 ]
Todd Lipcon commented on KUDU-1466:
-----------------------------------
Despite this not being a correctness issue, I marked it "critical" because this makes issues very hard to debug by obscuring the real underlying tablet-server error.
> C++ client errors misreported as GetTableLocations timeouts
> -----------------------------------------------------------
>
> Key: KUDU-1466
> URL: https://issues.apache.org/jira/browse/KUDU-1466
> Project: Kudu
> Issue Type: Bug
> Components: client
> Affects Versions: 0.8.0
> Reporter: Todd Lipcon
> Assignee: Todd Lipcon
> Priority: Critical
>
> client-test is currently very flaky due to this issue:
> - we are injecting some kind of failure on the tablet server (eg DNS resolution failure)
> - when we fail to connect to the TS, we correctly re-trigger a lookup against the master
> - depending how the backoffs and retries line up, we sometimes end up triggering the lookup retry when the remaining operation budget is very short (eg <10ms)
> -- this GetTabletLocations RPC times out since the master is unable to respond within the ridiculously short timeout
> During the course of retrying some operation, we should probably not replace the 'last_error' with a master error, so long as we have had at least one successful master lookup (thus indicating that the master is not the problem)
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)