You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@kudu.apache.org by "Adar Dembo (Code Review)" <ge...@cloudera.org> on 2016/11/01 16:50:03 UTC

[kudu-CR] [java client] Implement RPC tracing, part 1

Adar Dembo has posted comments on this change.

Change subject: [java client] Implement RPC tracing, part 1
......................................................................

Patch Set 3:

(6 comments)

http://gerrit.cloudera.org:8080/#/c/4781/3//COMMIT_MSG
Commit Message:

Line 30: RpcTraceObject{rpcMethod='Write', timestampMs=1477079973579, action=SEND_TO_SERVER, server=0353a6d97d6c49f9a727bc1ee6c3393e},
> Yeah it should follow it, except the trace I picked up didn't have it becau
Hmm, how is that possible? Did you ctrl-c the program or something? Or is it really possible for a trace to terminate here?

http://gerrit.cloudera.org:8080/#/c/4781/3/java/kudu-client/src/main/java/org/apache/kudu/client/AsyncKuduClient.java
File java/kudu-client/src/main/java/org/apache/kudu/client/AsyncKuduClient.java:

Line 679: .action(RpcTraceObject.Action.PICKED_REPLICA)
> It shows the flow, it went this way instead of going to the master. Else yo
But we emit PICKED_REPLICA right before a master RPC too (e.g. GetTableLocations, from the commit description), so it doesn't actually disambiguate anything.

I think this could be useful if it emitted some information about which replica was chosen, or why, but even that is somewhat unnecessary given that SEND_TO_SERVER provides the server's UUID.

Line 1001: rpc.setTimeoutMillis(parentRpc.deadlineTracker.getMillisBeforeDeadline());
> Yeah... I fixed up this up while writing this patch. There's not that many
Alright, then please update the commit message to reflect these changes.

http://gerrit.cloudera.org:8080/#/c/4781/3/java/kudu-client/src/main/java/org/apache/kudu/client/AsyncKuduScanner.java
File java/kudu-client/src/main/java/org/apache/kudu/client/AsyncKuduScanner.java:

Line 789: public String toString() {
> All the RPCs are kind of following the same format already, changing it her
Sure, no biggy.

http://gerrit.cloudera.org:8080/#/c/4781/3/java/kudu-client/src/main/java/org/apache/kudu/client/KuduRpc.java
File java/kudu-client/src/main/java/org/apache/kudu/client/KuduRpc.java:

Line 78: private KuduRpc<?> parentRpc;
> We can represent the flush as being the first trace for a Write batch RPC.
I disagree pretty strongly with your position.

I've made my position (that the top-level object tracking an operation should NOT be an RPC) known for a while now, in several different places:
- Alexey's "client timeout diagnosis" doc: https://docs.google.com/document/d/1rNhSY8kz89os0nl84-J_mUi6KRPEgb_mkpGRIDFRWjc/edit?disco=AAAAAyn-wlc
- An earlier attempt by Todd to improve a write failure error message: https://gerrit.cloudera.org/#/c/3326
- A JIRA somewhere that, sadly, I can't find now.

Let me see if I can convince you, point by point.

1. It's true that, by and large, RPCs are the only "context" objects we're tracking. However, for Flush(), we have Buffer and Batcher (C++ and Java respectively), which could serve as top-level anchors for user-facing operations. Ignoring those for a second, I accept that what I'm proposing implies a larger effort investment, but I remain convinced that it'll yield clearer information in the long run.

2. I think we should always use an operational (vs. an RPC-based) form of tracking, even for single-RPC operations, because that gives us the most flexibility. For example, it makes it much easier to convert that operation into a multi-RPC operation in the future.

3. The problem is that "following a single RPC and everything it spawns" isn't particularly useful unless you also know how Flush() fans out. That is, for it to be useful you need to combine it with internal knowledge of how Flush() is implemented, which you have. I do agree that we'd need to use a tree to properly manage a fan-out flow of RPCs anchored by a top-level Flush().

4. Even the per-RPC trace approach you have now needs to be associated in some way to the Flush(), doesn't it? Or to an individual operation? If there's no association at all, how will the user know which part of their application failed? And if there is an association, it must be an unnatural one, because RPCs aren't something the user interacts with directly.

Here are a couple more (somewhat abstract) thoughts.

Both clients have been implemented under a design pattern that I'll call "RPC as a first class citizen". In both cases, it's led to spaghetti-like code as RPC logic becomes intertwined with business logic. Per-RPC tracing reinforces that pattern; an operation-based approach would begin the process of chipping away at it.

Somewhat related to the above, we've already stretched the definition of "RPC" somewhat to fit how our thick clients use them. In my mind, an RPC is a remote procedure call initiated by one node on another. But, when an RPC fails, we talk about "retrying it" (instead of issuing a second RPC). Sometimes an RPC fails and we "retry it to another node" (instead of issuing a second RPC to another node). It gets especially weird when, in order to send a Write RPC, we first need to send a GetTableLocations RPC to a master. Now suppose we've got Kerberos-based security; do we model KDC authentication requests as RPCs as well? Our mental model would be simpler if the underlying basis wasn't an RPC, but something else. Why share this brain damage with our users?

Line 222: * Add the provided trace to this RPC's collection of traces. If this RPC has a parent RPC, it
> I think it's a most two? Do you have an optimization in mind?
No, I just wanted clarification on the possible shapes of the graph.

--
To view, visit http://gerrit.cloudera.org:8080/4781
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I69ef56acc071b9f80b34e38c1821df4096f54907
Gerrit-PatchSet: 3
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Jean-Daniel Cryans <jd...@apache.org>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Jean-Daniel Cryans <jd...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: Yes