You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@metron.apache.org by "Matt Foley (JIRA)" <ji...@apache.org> on 2017/02/06 06:35:42 UTC

[jira] [Updated] (METRON-322) Global Batching and flushing

     [ https://issues.apache.org/jira/browse/METRON-322?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Matt Foley updated METRON-322:
------------------------------
    Summary: Global Batching and flushing  (was: Global Batching & flushing)

> Global Batching and flushing
> ----------------------------
>
>                 Key: METRON-322
>                 URL: https://issues.apache.org/jira/browse/METRON-322
>             Project: Metron
>          Issue Type: Improvement
>            Reporter: Ajay Yadav
>            Assignee: Matt Foley
>
> All Writers and other bolts that maintain an internal "batch" queue, need to have a timeout flush, to prevent messages from low-volume telemetries from sitting in their queues indefinitely.  Storm has a timeout value (topology.message.timeout.secs) that prevents it from waiting for too long. If the Writer does not process the queue before the timeout, then Storm recycles the tuples through the topology. This has multiple undesirable consequences, including data duplication and waste of compute resources. We would like to be able to specify an interval after which the queues would flush, even if the batch size is not met.
> We will utilize the Storm Tick Tuple to trigger timeout flushing, following the recommendations of the article at 
> http://hortonworks.com/blog/apache-storm-design-pattern-micro-batching/#CONCLUSION
> Since every Writer processes its queue somewhat differently, every bolt that has a "batchSize" parameter will be given a "batchTimeout" parameter too.  It will default to 1/2 the value of "topology.message.timeout.secs", as recommended, and will ignore settings larger than the default, which could cause failure to flush in time.  In the Enrichment topology, where two Writers may be placed one after the other (enrichment and threat intel), the default timeout interval will be 1/4 the value of "topology.message.timeout.secs".  The default value of "topology.message.timeout.secs" in Storm is 30 seconds.
> In addition, Storm provides a limit on the number of pending messages that have not been acked. If more than "topology.max.spout.pending" messages are waiting in a topology, then Storm will recycle them through the topology. However, the default value of "topology.max.spout.pending" is null, and if set to non-null value, the user can manage the consequences by setting batchSize limits appropriately.  Having the timeout flush will also ameliorate this issue.  So we do not need to address "topology.max.spout.pending" directly in this task.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)