You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@storm.apache.org by "Stig Rohde Døssing (JIRA)" <ji...@apache.org> on 2017/05/28 18:52:04 UTC
[jira] [Commented] (STORM-2359) Revising Message Timeouts

    [ https://issues.apache.org/jira/browse/STORM-2359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16027906#comment-16027906 ] 

Stig Rohde Døssing commented on STORM-2359:
-------------------------------------------

I thought a bit more about this.

Here are the situations I could think of where tuples are currently being expired without being lost:
* The tuple tree is still making progress, and tuples are getting acked, but the time to process the entire tree is larger than the tuple timeout. This is very likely to happen if there's congestion somewhere in the topology.
* The tuple tree is not making progress, because a bolt is currently processing a tuple in the tree, and processing is taking longer than expected.
* The tuple tree is not making progress, because the tuple(s) are stuck in queues behind other slow tuples. This is also very likely to happen if there's congestion in the topology.

The situation where there's still progress being made can be solved by resetting the tuple timeout whenever an ack is received. In order to reduce load on the spout, we should try to "bundle up" these resets in the acker bolts before sending them to the spout. I think a decent way to do this bundling is to make the acker bolt keep track of which tuples they've received acks from since the last time timeouts were reset. When a configured interval expires, the bolt empties out the list of live tuples, and sends timeout resets for all of them to the spout. The interval should probably be specified as a percentage of the tuple timeout.

If a bolt is taking longer to process a tuple than expected, it can be solved in the concrete bolt implementation by using OutputCollector.resetTimeout at an appropriate interval (e.g. the tuple timeout minus a few seconds).

When tuples are stuck in queues behind other tuples, the topology can have a hard time recovering. This is because the expiration timer starts ticking for a tuple as soon as it's emitted, so if the bolt queues are congested, the bolts may be spending all their time processing tuples that belong to expired tuple trees. In order to solve this, we need to reset timeouts for queued tuples from time to time. It should be possible to add a thread that peeks at the available messages in the DisruptorQueue with some interval, and resets the timeout for any messages that were also queued last time the thread was run. Only sending the resets once a tuple has been queued for the entire interval should help decrease the number of unnecessary resets sent to the spout. We should be able to reuse the interval configuration also added to the acker bolt. 

I'd welcome any feedback on these ideas :)

> Revising Message Timeouts
> -------------------------
>
>                 Key: STORM-2359
>                 URL: https://issues.apache.org/jira/browse/STORM-2359
>             Project: Apache Storm
>          Issue Type: Sub-task
>          Components: storm-core
>    Affects Versions: 2.0.0
>            Reporter: Roshan Naik
>
> A revised strategy for message timeouts is proposed here.
> Design Doc:
>  https://docs.google.com/document/d/1am1kO7Wmf17U_Vz5_uyBB2OuSsc4TZQWRvbRhX52n5w/edit?usp=sharing



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)