You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@htrace.apache.org by Roberto Attias <ro...@yahoo.com.INVALID> on 2016/09/10 18:51:30 UTC
HTrace API comments

Hello,I have some comment/concerns regarding the HTrace API, and was wondering whether extensions/changes would be considered. I'm listing the most important here, if there is interest we can discuss more in detail.

1) From the HTrace Developer Guide: 



TraceScope objects manage the lifespan of Span objects. When a TraceScope is created, it often comes with an associated Span object. When this scope is closed, the Span will be closed as well. “Closing” the scope means that the span is sent to a SpanReceiver for processing.


One of the implications of this model is the fact that nested spans (for example instrumenting nested function calls) will be delivered to the receiver in reverse order (as the innermost function completes before the outermost. This may introduce more complexity on the logic in the span receiver. 

Also, the fact that information about a span is not delivered until the span is closed, reiies on the program not terminating abruptly. In Java this is not so much of a problem, but in C what happens if a series of nested function calls is instrumented with spans, and the innermost function crashes? As far as I can tell none of the span is delivered. This makes the use of the tracing API unreliable for bug discovery.

Would you consider a change where each API call produces at least one event sent to the SpanReceiver? 

2) HTrace has a concept of spans having one or more parents.  This allows, for example, to capture the fact that a process makes an RPC call to another.  However, there is no information about when within the span the caller calls the callee. A caller span may have two child spans, representing the fact that it made two RPC calls, but the order in which those were made is lost in the model (using the timestamps associated to the begin of the callee spans is not feasible, as there may be different RPC latencies, or simply the clocks may not be aligned. Also, the only relation captured by the API is between blocks. 

I propose a more general API with a concept of spans and  points (timestamped sets of annotations), and cause-effect relationship among points. an RPC call can be represented as a point in the caller span marked as cause, and a  (begin) point in the callee span marked as effect. This is very flexible and allow to capture all sorts of relationship, not just parent child. for example, a DMA operation may be initiated in a block  and captured as a point, the completion captured as a point in a distinct block in the same entity (an abstraction for a unit of concurrency) 
3) there doesn't seem to be any provision in the HTrace API for considering clock domains. In a distributed system, there may be processes running on the same host, processes running in the same cluster, process running in different clusters. Different domain may have different degrees of clock mis-alignment. Providing indications of this information in the API allows the backend or UI trace building to make more accurate inferences on how concurrent entities line up.
4) does the API provide a mechanism for creating "delegated traces"? what I mean by this is that in some circumstances  some thread may need to create traces on behalf of some other element which may not have such capabilty. For example, a mobile device may have some custom tracing mechanism, and attach the information to a request for the server. The server would then need to create the HTrace trace from the existing data passed in the request (including timestamps)
Let me know if there is interest in discussing changes at this level.
Thanks,
                    Roberto