You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@storm.apache.org by "James Xu (JIRA)" <ji...@apache.org> on 2013/12/15 07:50:06 UTC

[jira] [Created] (STORM-154) Provide more information to spout "fail" method

James Xu created STORM-154:
------------------------------

             Summary: Provide more information to spout "fail" method
                 Key: STORM-154
                 URL: https://issues.apache.org/jira/browse/STORM-154
             Project: Apache Storm (Incubating)
          Issue Type: New Feature
            Reporter: James Xu


https://github.com/nathanmarz/storm/issues/39


It might be helpful to distinguish between unexpected errors (when they can be caught) and timeouts.

----------
conflagrator: +1 on this. I wrote a class extending OutputCollector with the following wrapper functions:

public class VerboseOutputCollector extends OutputCollector {
    public void fail(Tuple tuple) {}
    public void fail(Tuple tuple, String message) {}
    public void fail(Tuple tuple, Exception e) {}
    public void fail(Tuple tuple, Exception e, String message) {}
}
Each function generates an output containing the class and the line number of the "fail" call and the message or Exception, if provided. It's very handy for log analytic.

----------
dmoore247: +1

With 0.8.1 on a local cluster I've spent many hours tracking down failures, going through executor.clj code, turning on full logging, adding TaskHooks, playing with time out parameters, adding exception handling etc. 
As an aside, the SpoutFail....latencyMs value was always a null in my tests on the LocalCluster.

Still, all I know is that the message failed, but not why (Timeout?). 
Based on playing with the timeout parameters, I deduce that the failures were caused by timeouts.

Where in Storm does it determine, hey, we've exceeded a timeout, let's fail this Tuple? At least we/I could add debug message to Storm.

Many thanks.

----------
ruleb: +1

Had the same situation, searched a whole day to conclude that a trident topology regularly dropped complete batches of tuples because of timeout reached when they are queued up at a busy bolt. 
Having a small "tuple timeout reached" in the logs @ info level will save many developer days.

Many thanks.

----------
thecoop: This would be very helpful to determine why tuples are failing, rather than just an arbitrary number in the UI - just putting something in the logs as an info or warn saying a tuple failed and some information on why it failed.

----------
brianantonelli: +1

Would be great to get more information about what caused the spout to fail. I'm also seeing that the latency is always null too.

----------
revans2: It is fairly simple to extend spout to indicate if a tuple failed because of a timeout or if it failed because of something else, but it is much harder to determine what that something else was. The fail API on all output collectors does not have anything that could be used to map it to a reason. We would have to extend the API and decide what the failure reason should look like. Perhaps a free form string, but that is really horrible if you want to aggregate the failures in metrics. Also we would want to limit the size of the string so an to not overwhelm the acker bolts.



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)