You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@kudu.apache.org by "Todd Lipcon (Code Review)" <ge...@cloudera.org> on 2016/06/07 15:12:26 UTC

[kudu-CR] WIP: KUDU-1466: improve error message when writes fail at TS

Hello Adar Dembo,

I'd like you to do a code review.  Please visit

    http://gerrit.cloudera.org:8080/3326

to review the following change.

Change subject: WIP: KUDU-1466: improve error message when writes fail at TS
......................................................................

WIP: KUDU-1466: improve error message when writes fail at TS

Currently, when we hit certain types of tablet server errors,
we fall back to re-requesting locations from the master. If
the timing of the errors lines up right, the last request to the
master may have a very short time out, in which case we will
misreport the write as failing due to a timeout on the
GetTableLocations() RPC, rather than due to the actual error on the
tablet server.

Injecting a bit of latency into GetTableLocations() reproduces the
issue reliably in ClientTest.TestFailedDnsResolution which is
already quite flaky in TSAN builds due to this issue.

This is a WIP patch as one potential way to solve it -- have the
location picker keep track of the "best" error seen so far. But,
perhaps it's actually better for this to be done in the retriable RPC.
Posting in order to get some comments on the best approach.

Change-Id: I5f1de8159e515cbb5f52fdc440d71370437c1af2
---
M src/kudu/client/client-test.cc
M src/kudu/client/meta_cache.cc
M src/kudu/client/meta_cache.h
M src/kudu/rpc/retriable_rpc.h
4 files changed, 31 insertions(+), 5 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/26/3326/1
-- 
To view, visit http://gerrit.cloudera.org:8080/3326
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newchange
Gerrit-Change-Id: I5f1de8159e515cbb5f52fdc440d71370437c1af2
Gerrit-PatchSet: 1
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>

[kudu-CR] WIP: KUDU-1466: improve error message when writes fail at TS

Posted by "Kudu Jenkins (Code Review)" <ge...@cloudera.org>.
Kudu Jenkins has posted comments on this change.

Change subject: WIP: KUDU-1466: improve error message when writes fail at TS
......................................................................


Patch Set 1:

Build Started http://104.196.14.100/job/kudu-gerrit/1758/

-- 
To view, visit http://gerrit.cloudera.org:8080/3326
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I5f1de8159e515cbb5f52fdc440d71370437c1af2
Gerrit-PatchSet: 1
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-HasComments: No

[kudu-CR] WIP: KUDU-1466: improve error message when writes fail at TS

Posted by "Adar Dembo (Code Review)" <ge...@cloudera.org>.
Adar Dembo has posted comments on this change.

Change subject: WIP: KUDU-1466: improve error message when writes fail at TS
......................................................................


Patch Set 1:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/3326/1//COMMIT_MSG
Commit Message:

PS1, Line 22: But,
            : perhaps it's actually better for this to be done in the retriable RPC.
I'm supportive of the idea, but I'm not sure how it's best implemented.

Do we use a policy that prefers one kind of error over another (i.e. Status::TimedOut is always less interesting than other failures)? If we do something context-free like that, it should be sufficient to pass the "last error" around over the course of the operation, updating it whenever we see an error of higher priority.

Or, do we prefer one error over another based on the operational phase that it occurred (i.e. in tserver operations, lookup failures are always less interesting than actual tserver failures)? This approach suggests we find the appropriate "top-level" object for each operation (i.e. for scans, the KuduScanner itself), track the best error there, and make sure it's available to all phases of the operation to update if necessary.

For more context, we've already got "last" error tracking in KuduScanner::Data and RpcRetrier. If we're going to add it to a third location, let's choose deliberately and understand how all three work together.

The location picker isn't the worst place; it'll make the error available for scans and writes, the main culprits. But it'd be nice to make it available in administrative operations too, which are a little more ad-hoc at the moment.


-- 
To view, visit http://gerrit.cloudera.org:8080/3326
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I5f1de8159e515cbb5f52fdc440d71370437c1af2
Gerrit-PatchSet: 1
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-HasComments: Yes