You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@storm.apache.org by Yovav Waichman <yo...@jivesoftware.com> on 2017/10/15 09:10:31 UTC

Topology high number of failures

Hi,

We are using Apache Storm for a couple of years, and everything was fine till now.
For our spout we are using “storm-kafka-0.9.4.jar”.

Lately, our “Failed” number of events has increased dramatically, and currently almost 20% of our total events are marked as Failed.

We tried investigating our Topology logs, but we came up empty handed.
Moreover, our spout complete latency is 25.996 ms.
 We suspected that our db is under a heavy load, so we increased our message timeout t0 60 and even 300 seconds, but that had no affect on the number of failures.

Lowering our max pending value has produced a negative result.
At some point, since we are not using anchoring, we thought about adding anchoring, but we saw that the KafkaSpout handles failures by replaying them, so we were not sure whether to add it or not.

It would be helpful if you can direct us as to where we can find in Storm logs the reason for these failures, if there’s an exception which is not caught, maybe a time out, since we are a bit blind at the moment.

We would appreciate any help with that.

Thanks in advance,
Yovav

Re: Topology high number of failures

Posted by Stig Rohde Døssing <sr...@apache.org>.

Barring bugs in Storm there are only a few reasons a tuple will fail.
Either the tuple tree is manually failed from a bolt using
OutputCollector.fail(), or the tuple tree has timed out. It should be
fairly easy for you to rule out the first case, since tuples failed by
bolts cause Storm UI to show tuple failures for the relevant bolt. If
tuples fail due to timeout, it will look like they fail only on the spout.

Whether failed tuples are replayed is up to the spout implementation. Here
is the KafkaSpout method for handling failures
https://github.com/apache/storm/blob/v0.9.4/external/storm-kafka/src/jvm/storm/kafka/PartitionManager.java#L207.
Tuples are replayed until they succeed, or until the failing tuple fails
too far behind the latest emitted offset.

The log line I linked to earlier (
https://github.com/apache/storm/blob/v0.9.4/storm-core/src/clj/backtype/storm/daemon/executor.clj#L371)
contains the message id of the failed tuple, but for the KafkaSpout the
message id unfortunately doesn't have a toString. You can instead look for
this log line to figure out if some events are looping
https://github.com/apache/storm/blob/v0.9.4/external/storm-kafka/src/jvm/storm/kafka/PartitionManager.java#L214.
The log line doesn't include the partition before 1.0.2, so in case you
need this you'll either need to upgrade or check out and build the spout
yourself.

The logs can't easily tell you why the tuple times out, because the
components in Storm don't know. Without getting into the details of how
Storm tracks tuples, which is explained here
https://storm.apache.org/releases/0.9.6/Guaranteeing-message-processing.html,
all the spout or acker knows when a tuple times out is that the tree was
not fully acked in time. This can be because a message was lost somewhere,
or because a bolt failed to ack due to an implementation error, or simply
because processing took too long.

Depending on how high your load is, you might look at enabling topology
debug logging
https://github.com/apache/storm/blob/v0.9.4/examples/storm-starter/src/jvm/storm/starter/WordCountTopology.java#L88.
It will log whenever a tuple is transferred/emitted/acked/failed, and
should let you track what happens to failing tuples. Be warned that it's
very verbose. I believe you should be able to enable it only for those
components you're interested in debugging, which should probably be the
bolts you're still tracking the tuples in. You should be able to enable it
for a specific bolt by doing something like this:

builder.setBolt("split", new SplitSentence(), 8)
        .addConfiguration(Config.TOPOLOGY_DEBUG, true)

2017-10-17 16:30 GMT+02:00 Yovav Waichman <yo...@jivesoftware.com>:

> We rolled out any code changes.
>
> We have the same code base running in a different environment with
> different scales and it works fine.
>
>
>
> The current environment has a higher load.
>
> As you wrote, the DEBUG log didn’t shed any light as for the reason for
> the failures.
>
>
>
> I read somewhere that a failed tuple is replayed until it is acked.
>
> Is it possible that we have some failing events that loop infinitely
> because they keep failing?
>
> If so, is there an easy way to check that out? Instead of adding cache and
> metrics and implementing our own mechanism to count unique failures.
>
>
>
> I’m surprised there is no straight forward way of knowing what is causing
> these failures
>
>
>
>
>
> *From: *Stig Rohde Døssing <sr...@apache.org>
> *Reply-To: *"user@storm.apache.org" <us...@storm.apache.org>
> *Date: *Monday, October 16, 2017 at 9:09 PM
>
> *To: *"user@storm.apache.org" <us...@storm.apache.org>
> *Subject: *Re: Topology high number of failures
>
>
>
> Thanks for elaborating. Since you say that this has started recently you
> might want to check if any recent code changes have added paths to your
> bolts where tuples are not acked. You should check this for the bolts that
> receive tuples that are still anchored to the tuple tree, i.e. all bolts up
> until the point where you emit without anchors.
>
>
>
> I'm curious why you have acking enabled if you don't anchor your tuples?
>
> Regarding the log, I was mistaken. It doesn't look like there's much
> logging happening when a tuple fails, beyond the line I linked earlier. If
> your components throw exceptions they should go in the worker (topology)
> log, at least in recent versions of Storm, I'm not certain about 0.9.4.
>
> The most likely reason for these failures is hitting the tuple timeout.
> Your complete latency is low, but I want to say that that number doesn't
> account for tuples that fail due to timeout (again, not certain about
> this). If there's some part of your code that's receiving anchored tuples
> and not acking them, it would probably look like this.
>
> Another potential explanation could be this issue
> https://issues.apache.org/jira/browse/STORM-1750. If one of your
> executors die while the Zookeeper connection is unstable, the executor
> thread may die and not recover. I haven't verified if this issue occurs on
> 0.9.4, but it looks like it might from looking at the executor code.
>
>
>
> 2017-10-16 11:23 GMT+02:00 Yovav Waichman <yovav.waichman@jivesoftware.com
> >:
>
> Thanks for replying.
>
> Sorry for not being clear, acking is enabled, we don’t use anchoring when
> emitting events from our bolts.
>
>
>
> I’ve changed our log level to DEBUG.
>
> Are there any error messages I should look for in particular?
>
>
>
> Thanks again for your help
>
>
>
> *From: *Stig Rohde Døssing <sr...@apache.org>
> *Reply-To: *"user@storm.apache.org" <us...@storm.apache.org>
> *Date: *Sunday, October 15, 2017 at 5:25 PM
> *To: *"user@storm.apache.org" <us...@storm.apache.org>
> *Subject: *Re: Topology high number of failures
>
>
>
> Hi,
>
> Could you elaborate on your configuration? You say you aren't using
> anchoring, but if you're getting tuple failures (and a complete latency)
> then acking must be enabled. Is acking enabled for the spout, and then you
> don't use an anchor when emitting tuples from the bolts in the topology, or
> what do you mean?
>
>
>
> 0.9.4 should be able to log when a tuple fails in the spout, here
> https://github.com/apache/storm/blob/v0.9.4/storm-core/
> src/clj/backtype/storm/daemon/executor.clj#L371. I believe you need to
> set the "backtype.storm.daemon.executor" logger level to DEBUG in the
> logback config.
>
>
>
> 2017-10-15 11:10 GMT+02:00 Yovav Waichman <yovav.waichman@jivesoftware.com
> >:
>
> Hi,
>
>
>
> We are using Apache Storm for a couple of years, and everything was fine
> till now.
>
> For our spout we are using “storm-kafka-0.9.4.jar”.
>
>
>
> Lately, our “Failed” number of events has increased dramatically, and
> currently almost 20% of our total events are marked as Failed.
>
>
>
> We tried investigating our Topology logs, but we came up empty handed.
>
> Moreover, our spout complete latency is 25.996 ms.
>
>  We suspected that our db is under a heavy load, so we increased our
> message timeout t0 60 and even 300 seconds, but that had no affect on the
> number of failures.
>
>
>
> Lowering our max pending value has produced a negative result.
>
> At some point, since we are not using anchoring, we thought about adding
> anchoring, but we saw that the KafkaSpout handles failures by replaying
> them, so we were not sure whether to add it or not.
>
>
>
> It would be helpful if you can direct us as to where we can find in Storm
> logs the reason for these failures, if there’s an exception which is not
> caught, maybe a time out, since we are a bit blind at the moment.
>
>
>
> We would appreciate any help with that.
>
>
>
> Thanks in advance,
>
> Yovav
>
>
>
>
>

Re: Topology high number of failures

Posted by Yovav Waichman <yo...@jivesoftware.com>.

We rolled out any code changes.
We have the same code base running in a different environment with different scales and it works fine.

The current environment has a higher load.
As you wrote, the DEBUG log didn’t shed any light as for the reason for the failures.

I read somewhere that a failed tuple is replayed until it is acked.
Is it possible that we have some failing events that loop infinitely because they keep failing?
If so, is there an easy way to check that out? Instead of adding cache and metrics and implementing our own mechanism to count unique failures.

I’m surprised there is no straight forward way of knowing what is causing these failures


From: Stig Rohde Døssing <sr...@apache.org>
Reply-To: "user@storm.apache.org" <us...@storm.apache.org>
Date: Monday, October 16, 2017 at 9:09 PM
To: "user@storm.apache.org" <us...@storm.apache.org>
Subject: Re: Topology high number of failures

Thanks for elaborating. Since you say that this has started recently you might want to check if any recent code changes have added paths to your bolts where tuples are not acked. You should check this for the bolts that receive tuples that are still anchored to the tuple tree, i.e. all bolts up until the point where you emit without anchors.

I'm curious why you have acking enabled if you don't anchor your tuples?
Regarding the log, I was mistaken. It doesn't look like there's much logging happening when a tuple fails, beyond the line I linked earlier. If your components throw exceptions they should go in the worker (topology) log, at least in recent versions of Storm, I'm not certain about 0.9.4.
The most likely reason for these failures is hitting the tuple timeout. Your complete latency is low, but I want to say that that number doesn't account for tuples that fail due to timeout (again, not certain about this). If there's some part of your code that's receiving anchored tuples and not acking them, it would probably look like this.
Another potential explanation could be this issue https://issues.apache.org/jira/browse/STORM-1750. If one of your executors die while the Zookeeper connection is unstable, the executor thread may die and not recover. I haven't verified if this issue occurs on 0.9.4, but it looks like it might from looking at the executor code.

2017-10-16 11:23 GMT+02:00 Yovav Waichman <yo...@jivesoftware.com>>:
Thanks for replying.
Sorry for not being clear, acking is enabled, we don’t use anchoring when emitting events from our bolts.

I’ve changed our log level to DEBUG.
Are there any error messages I should look for in particular?

Thanks again for your help

From: Stig Rohde Døssing <sr...@apache.org>>
Reply-To: "user@storm.apache.org<ma...@storm.apache.org>" <us...@storm.apache.org>>
Date: Sunday, October 15, 2017 at 5:25 PM
To: "user@storm.apache.org<ma...@storm.apache.org>" <us...@storm.apache.org>>
Subject: Re: Topology high number of failures

Hi,
Could you elaborate on your configuration? You say you aren't using anchoring, but if you're getting tuple failures (and a complete latency) then acking must be enabled. Is acking enabled for the spout, and then you don't use an anchor when emitting tuples from the bolts in the topology, or what do you mean?

0.9.4 should be able to log when a tuple fails in the spout, here https://github.com/apache/storm/blob/v0.9.4/storm-core/src/clj/backtype/storm/daemon/executor.clj#L371. I believe you need to set the "backtype.storm.daemon.executor" logger level to DEBUG in the logback config.

2017-10-15 11:10 GMT+02:00 Yovav Waichman <yo...@jivesoftware.com>>:
Hi,

We are using Apache Storm for a couple of years, and everything was fine till now.
For our spout we are using “storm-kafka-0.9.4.jar”.

Lately, our “Failed” number of events has increased dramatically, and currently almost 20% of our total events are marked as Failed.

We tried investigating our Topology logs, but we came up empty handed.
Moreover, our spout complete latency is 25.996 ms.
 We suspected that our db is under a heavy load, so we increased our message timeout t0 60 and even 300 seconds, but that had no affect on the number of failures.

Lowering our max pending value has produced a negative result.
At some point, since we are not using anchoring, we thought about adding anchoring, but we saw that the KafkaSpout handles failures by replaying them, so we were not sure whether to add it or not.

It would be helpful if you can direct us as to where we can find in Storm logs the reason for these failures, if there’s an exception which is not caught, maybe a time out, since we are a bit blind at the moment.

We would appreciate any help with that.

Thanks in advance,
Yovav

Re: Topology high number of failures

Posted by Stig Rohde Døssing <sr...@apache.org>.

Thanks for elaborating. Since you say that this has started recently you
might want to check if any recent code changes have added paths to your
bolts where tuples are not acked. You should check this for the bolts that
receive tuples that are still anchored to the tuple tree, i.e. all bolts up
until the point where you emit without anchors.

I'm curious why you have acking enabled if you don't anchor your tuples?

Regarding the log, I was mistaken. It doesn't look like there's much
logging happening when a tuple fails, beyond the line I linked earlier. If
your components throw exceptions they should go in the worker (topology)
log, at least in recent versions of Storm, I'm not certain about 0.9.4.

The most likely reason for these failures is hitting the tuple timeout.
Your complete latency is low, but I want to say that that number doesn't
account for tuples that fail due to timeout (again, not certain about
this). If there's some part of your code that's receiving anchored tuples
and not acking them, it would probably look like this.

Another potential explanation could be this issue
https://issues.apache.org/jira/browse/STORM-1750. If one of your executors
die while the Zookeeper connection is unstable, the executor thread may die
and not recover. I haven't verified if this issue occurs on 0.9.4, but it
looks like it might from looking at the executor code.

2017-10-16 11:23 GMT+02:00 Yovav Waichman <yo...@jivesoftware.com>:

> Thanks for replying.
>
> Sorry for not being clear, acking is enabled, we don’t use anchoring when
> emitting events from our bolts.
>
>
>
> I’ve changed our log level to DEBUG.
>
> Are there any error messages I should look for in particular?
>
>
>
> Thanks again for your help
>
>
>
> *From: *Stig Rohde Døssing <sr...@apache.org>
> *Reply-To: *"user@storm.apache.org" <us...@storm.apache.org>
> *Date: *Sunday, October 15, 2017 at 5:25 PM
> *To: *"user@storm.apache.org" <us...@storm.apache.org>
> *Subject: *Re: Topology high number of failures
>
>
>
> Hi,
>
> Could you elaborate on your configuration? You say you aren't using
> anchoring, but if you're getting tuple failures (and a complete latency)
> then acking must be enabled. Is acking enabled for the spout, and then you
> don't use an anchor when emitting tuples from the bolts in the topology, or
> what do you mean?
>
>
>
> 0.9.4 should be able to log when a tuple fails in the spout, here
> https://github.com/apache/storm/blob/v0.9.4/storm-core/
> src/clj/backtype/storm/daemon/executor.clj#L371. I believe you need to
> set the "backtype.storm.daemon.executor" logger level to DEBUG in the
> logback config.
>
>
>
> 2017-10-15 11:10 GMT+02:00 Yovav Waichman <yovav.waichman@jivesoftware.com
> >:
>
> Hi,
>
>
>
> We are using Apache Storm for a couple of years, and everything was fine
> till now.
>
> For our spout we are using “storm-kafka-0.9.4.jar”.
>
>
>
> Lately, our “Failed” number of events has increased dramatically, and
> currently almost 20% of our total events are marked as Failed.
>
>
>
> We tried investigating our Topology logs, but we came up empty handed.
>
> Moreover, our spout complete latency is 25.996 ms.
>
>  We suspected that our db is under a heavy load, so we increased our
> message timeout t0 60 and even 300 seconds, but that had no affect on the
> number of failures.
>
>
>
> Lowering our max pending value has produced a negative result.
>
> At some point, since we are not using anchoring, we thought about adding
> anchoring, but we saw that the KafkaSpout handles failures by replaying
> them, so we were not sure whether to add it or not.
>
>
>
> It would be helpful if you can direct us as to where we can find in Storm
> logs the reason for these failures, if there’s an exception which is not
> caught, maybe a time out, since we are a bit blind at the moment.
>
>
>
> We would appreciate any help with that.
>
>
>
> Thanks in advance,
>
> Yovav
>
>
>

Re: Topology high number of failures

Posted by Yovav Waichman <yo...@jivesoftware.com>.

Thanks for replying.
Sorry for not being clear, acking is enabled, we don’t use anchoring when emitting events from our bolts.

I’ve changed our log level to DEBUG.
Are there any error messages I should look for in particular?

Thanks again for your help

From: Stig Rohde Døssing <sr...@apache.org>
Reply-To: "user@storm.apache.org" <us...@storm.apache.org>
Date: Sunday, October 15, 2017 at 5:25 PM
To: "user@storm.apache.org" <us...@storm.apache.org>
Subject: Re: Topology high number of failures

Hi,
Could you elaborate on your configuration? You say you aren't using anchoring, but if you're getting tuple failures (and a complete latency) then acking must be enabled. Is acking enabled for the spout, and then you don't use an anchor when emitting tuples from the bolts in the topology, or what do you mean?

0.9.4 should be able to log when a tuple fails in the spout, here https://github.com/apache/storm/blob/v0.9.4/storm-core/src/clj/backtype/storm/daemon/executor.clj#L371. I believe you need to set the "backtype.storm.daemon.executor" logger level to DEBUG in the logback config.

2017-10-15 11:10 GMT+02:00 Yovav Waichman <yo...@jivesoftware.com>>:
Hi,

We are using Apache Storm for a couple of years, and everything was fine till now.
For our spout we are using “storm-kafka-0.9.4.jar”.

Lately, our “Failed” number of events has increased dramatically, and currently almost 20% of our total events are marked as Failed.

We tried investigating our Topology logs, but we came up empty handed.
Moreover, our spout complete latency is 25.996 ms.
 We suspected that our db is under a heavy load, so we increased our message timeout t0 60 and even 300 seconds, but that had no affect on the number of failures.

Lowering our max pending value has produced a negative result.
At some point, since we are not using anchoring, we thought about adding anchoring, but we saw that the KafkaSpout handles failures by replaying them, so we were not sure whether to add it or not.

It would be helpful if you can direct us as to where we can find in Storm logs the reason for these failures, if there’s an exception which is not caught, maybe a time out, since we are a bit blind at the moment.

We would appreciate any help with that.

Thanks in advance,
Yovav

Re: Topology high number of failures

Posted by Stig Rohde Døssing <sr...@apache.org>.

Hi,

Could you elaborate on your configuration? You say you aren't using
anchoring, but if you're getting tuple failures (and a complete latency)
then acking must be enabled. Is acking enabled for the spout, and then you
don't use an anchor when emitting tuples from the bolts in the topology, or
what do you mean?

0.9.4 should be able to log when a tuple fails in the spout, here
https://github.com/apache/storm/blob/v0.9.4/storm-core/src/clj/backtype/storm/daemon/executor.clj#L371.
I believe you need to set the "backtype.storm.daemon.executor" logger level
to DEBUG in the logback config.

2017-10-15 11:10 GMT+02:00 Yovav Waichman <yo...@jivesoftware.com>:

> Hi,
>
>
>
> We are using Apache Storm for a couple of years, and everything was fine
> till now.
>
> For our spout we are using “storm-kafka-0.9.4.jar”.
>
>
>
> Lately, our “Failed” number of events has increased dramatically, and
> currently almost 20% of our total events are marked as Failed.
>
>
>
> We tried investigating our Topology logs, but we came up empty handed.
>
> Moreover, our spout complete latency is 25.996 ms.
>
>  We suspected that our db is under a heavy load, so we increased our
> message timeout t0 60 and even 300 seconds, but that had no affect on the
> number of failures.
>
>
>
> Lowering our max pending value has produced a negative result.
>
> At some point, since we are not using anchoring, we thought about adding
> anchoring, but we saw that the KafkaSpout handles failures by replaying
> them, so we were not sure whether to add it or not.
>
>
>
> It would be helpful if you can direct us as to where we can find in Storm
> logs the reason for these failures, if there’s an exception which is not
> caught, maybe a time out, since we are a bit blind at the moment.
>
>
>
> We would appreciate any help with that.
>
>
>
> Thanks in advance,
>
> Yovav
>