You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@storm.apache.org by Kevin Peek <kp...@salesforce.com> on 2016/07/27 19:22:14 UTC

Massive Number of Spout Failures

We have a topology that is experiencing massive amounts of spout failures
without corresponding bolt failures. We have been interpreting these as
tuple timeouts, but we seem to be getting more of these failures than we
understand to be possible with timeouts.

Our topology uses a Kafka spout and the topology is configured with:
topology.message.timeout.secs = 300
topology.max.spout.pending = 2500

Based on these settings, I would expect the topology to experience a
maximum of 2500 tuple timeouts per 300 seconds. But from the Storm UI, we
see that after running for about 10 minutes, the topology will show about
50K spout failures and zero bolt failures.

Am I misunderstanding something that would allow more tuples to time out,
or is there another source of spout failures?

Thanks in advance,
Kevin Peek

Re: Massive Number of Spout Failures

Posted by Kevin Peek <kp...@salesforce.com>.
Erik, we actually have 35 spout instances, so I think you've found the
issue. Thanks!

On Wed, Jul 27, 2016 at 4:44 PM, Erik Weathers <ew...@groupon.com>
wrote:

> How many spout tasks do you have?  The topology.max.spout.pending setting
> is *per* task.  Maybe you have 20?  20*2500 == 50K.
>
> On Wed, Jul 27, 2016 at 1:32 PM, Kevin Peek <kp...@salesforce.com> wrote:
>
>> Thanks for the reply.
>>
>> In either of these cases, shouldn't storm stop letting the spout emit
>> tuples once max_spout_pending is reached? In that case, the tuples already
>> in the topology (or dropped by accident, collected in a bolt, etc) will
>> take 5 minutes to time out, and the number of tuples failing in this way
>> will be limited to max_spout_pending per 5 minutes. The issue is we are
>> seeing a much higher level of spout failures.
>>
>> On Wed, Jul 27, 2016 at 3:48 PM, Igor Kuzmenko <f1...@gmail.com>
>> wrote:
>>
>>> We have such fails with two reasons:
>>>
>>> 1) Bolt doesn't ack tuple immidiatly, but collects a batch and at some
>>> point ack's them all. In that case thes situation when batch bigger than
>>> max_spout_pending and some tuples fails.
>>>
>>> 2) Bolt doesn't ack tuple at all. Make sure Bolt acks or fails tuples
>>> without any exclusions.
>>>
>>> On Wed, Jul 27, 2016 at 10:22 PM, Kevin Peek <kp...@salesforce.com>
>>> wrote:
>>>
>>>> We have a topology that is experiencing massive amounts of spout
>>>> failures without corresponding bolt failures. We have been interpreting
>>>> these as tuple timeouts, but we seem to be getting more of these failures
>>>> than we understand to be possible with timeouts.
>>>>
>>>> Our topology uses a Kafka spout and the topology is configured with:
>>>> topology.message.timeout.secs = 300
>>>> topology.max.spout.pending = 2500
>>>>
>>>> Based on these settings, I would expect the topology to experience a
>>>> maximum of 2500 tuple timeouts per 300 seconds. But from the Storm UI, we
>>>> see that after running for about 10 minutes, the topology will show about
>>>> 50K spout failures and zero bolt failures.
>>>>
>>>> Am I misunderstanding something that would allow more tuples to time
>>>> out, or is there another source of spout failures?
>>>>
>>>> Thanks in advance,
>>>> Kevin Peek
>>>>
>>>
>>>
>>
>

Re: Massive Number of Spout Failures

Posted by Erik Weathers <ew...@groupon.com>.
How many spout tasks do you have?  The topology.max.spout.pending setting
is *per* task.  Maybe you have 20?  20*2500 == 50K.

On Wed, Jul 27, 2016 at 1:32 PM, Kevin Peek <kp...@salesforce.com> wrote:

> Thanks for the reply.
>
> In either of these cases, shouldn't storm stop letting the spout emit
> tuples once max_spout_pending is reached? In that case, the tuples already
> in the topology (or dropped by accident, collected in a bolt, etc) will
> take 5 minutes to time out, and the number of tuples failing in this way
> will be limited to max_spout_pending per 5 minutes. The issue is we are
> seeing a much higher level of spout failures.
>
> On Wed, Jul 27, 2016 at 3:48 PM, Igor Kuzmenko <f1...@gmail.com> wrote:
>
>> We have such fails with two reasons:
>>
>> 1) Bolt doesn't ack tuple immidiatly, but collects a batch and at some
>> point ack's them all. In that case thes situation when batch bigger than
>> max_spout_pending and some tuples fails.
>>
>> 2) Bolt doesn't ack tuple at all. Make sure Bolt acks or fails tuples
>> without any exclusions.
>>
>> On Wed, Jul 27, 2016 at 10:22 PM, Kevin Peek <kp...@salesforce.com>
>> wrote:
>>
>>> We have a topology that is experiencing massive amounts of spout
>>> failures without corresponding bolt failures. We have been interpreting
>>> these as tuple timeouts, but we seem to be getting more of these failures
>>> than we understand to be possible with timeouts.
>>>
>>> Our topology uses a Kafka spout and the topology is configured with:
>>> topology.message.timeout.secs = 300
>>> topology.max.spout.pending = 2500
>>>
>>> Based on these settings, I would expect the topology to experience a
>>> maximum of 2500 tuple timeouts per 300 seconds. But from the Storm UI, we
>>> see that after running for about 10 minutes, the topology will show about
>>> 50K spout failures and zero bolt failures.
>>>
>>> Am I misunderstanding something that would allow more tuples to time
>>> out, or is there another source of spout failures?
>>>
>>> Thanks in advance,
>>> Kevin Peek
>>>
>>
>>
>

Re: Massive Number of Spout Failures

Posted by Kevin Peek <kp...@salesforce.com>.
Thanks for the reply.

In either of these cases, shouldn't storm stop letting the spout emit
tuples once max_spout_pending is reached? In that case, the tuples already
in the topology (or dropped by accident, collected in a bolt, etc) will
take 5 minutes to time out, and the number of tuples failing in this way
will be limited to max_spout_pending per 5 minutes. The issue is we are
seeing a much higher level of spout failures.

On Wed, Jul 27, 2016 at 3:48 PM, Igor Kuzmenko <f1...@gmail.com> wrote:

> We have such fails with two reasons:
>
> 1) Bolt doesn't ack tuple immidiatly, but collects a batch and at some
> point ack's them all. In that case thes situation when batch bigger than
> max_spout_pending and some tuples fails.
>
> 2) Bolt doesn't ack tuple at all. Make sure Bolt acks or fails tuples
> without any exclusions.
>
> On Wed, Jul 27, 2016 at 10:22 PM, Kevin Peek <kp...@salesforce.com> wrote:
>
>> We have a topology that is experiencing massive amounts of spout failures
>> without corresponding bolt failures. We have been interpreting these as
>> tuple timeouts, but we seem to be getting more of these failures than we
>> understand to be possible with timeouts.
>>
>> Our topology uses a Kafka spout and the topology is configured with:
>> topology.message.timeout.secs = 300
>> topology.max.spout.pending = 2500
>>
>> Based on these settings, I would expect the topology to experience a
>> maximum of 2500 tuple timeouts per 300 seconds. But from the Storm UI, we
>> see that after running for about 10 minutes, the topology will show about
>> 50K spout failures and zero bolt failures.
>>
>> Am I misunderstanding something that would allow more tuples to time out,
>> or is there another source of spout failures?
>>
>> Thanks in advance,
>> Kevin Peek
>>
>
>

Re: Massive Number of Spout Failures

Posted by Igor Kuzmenko <f1...@gmail.com>.
We have such fails with two reasons:

1) Bolt doesn't ack tuple immidiatly, but collects a batch and at some
point ack's them all. In that case thes situation when batch bigger than
max_spout_pending and some tuples fails.

2) Bolt doesn't ack tuple at all. Make sure Bolt acks or fails tuples
without any exclusions.

On Wed, Jul 27, 2016 at 10:22 PM, Kevin Peek <kp...@salesforce.com> wrote:

> We have a topology that is experiencing massive amounts of spout failures
> without corresponding bolt failures. We have been interpreting these as
> tuple timeouts, but we seem to be getting more of these failures than we
> understand to be possible with timeouts.
>
> Our topology uses a Kafka spout and the topology is configured with:
> topology.message.timeout.secs = 300
> topology.max.spout.pending = 2500
>
> Based on these settings, I would expect the topology to experience a
> maximum of 2500 tuple timeouts per 300 seconds. But from the Storm UI, we
> see that after running for about 10 minutes, the topology will show about
> 50K spout failures and zero bolt failures.
>
> Am I misunderstanding something that would allow more tuples to time out,
> or is there another source of spout failures?
>
> Thanks in advance,
> Kevin Peek
>