You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@storm.apache.org by "Arun Mahadevan (JIRA)" <ji...@apache.org> on 2017/05/02 11:44:04 UTC

[jira] [Comment Edited] (STORM-2489) Overlap and data loss on WindowedBolt based on Duration

    [ https://issues.apache.org/jira/browse/STORM-2489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15992758#comment-15992758 ] 

Arun Mahadevan edited comment on STORM-2489 at 5/2/17 11:43 AM:
----------------------------------------------------------------

[~wangkui], the initial tuples expired because the trigger was not fired exactly after the window interval but after a delay. When I tested in local mode with spout emitting without a delay, the trigger happened after 6s (for a 4s tumbling window). This may be because the system is overwhelmed with data and not able to schedule the trigger thread on time. In this case the initial tuples (0 - 2s) will not be considered in the first window. 

Typically the window duration should be such that all the tuples within a window can be processed before the next window trigger, otherwise the next window trigger will be delayed and it will lead to incorrect results. You should use a real cluster with multiple hosts/workers and split the data among these workers to handle such high data rates.

Another option would be to use event time windows where each event contains a "timestamp" field and the window calculations are done based on the actual event time instead of system time.


was (Author: arunmahadevan):
[~wangkui], the initial tuples expired because the trigger was not fired exactly after the window interval but after a delay. When I tested in local mode with spout emitting without a delay, the trigger happened after 6s (for a 4s tumbling window). This may be because the system is overwhelmed with data and not able to schedule the trigger thread on time. In this case the initial tuples (0 - 2s) will not be considered in the first window. 

Typically the window duration should be such that all the tuples within a window can be processed before the next window trigger, otherwise the next window trigger will be delayed and it will lead to incorrect results. You should use a real cluster with multiple hosts/workers and split the data among these workers to handle such high data rates.

> Overlap and data loss on WindowedBolt based on Duration
> -------------------------------------------------------
>
>                 Key: STORM-2489
>                 URL: https://issues.apache.org/jira/browse/STORM-2489
>             Project: Apache Storm
>          Issue Type: Bug
>          Components: storm-core
>    Affects Versions: 1.0.2
>         Environment: windows 10, eclipse, jdk1.7
>            Reporter: wangkui
>            Assignee: Arun Mahadevan
>         Attachments: TumblingWindowIssue.java
>
>          Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> The attachment is my test script, one of my test results is:
> ```
> expired=1...55
> get=56...4024
> new=56...4024
> Recived=3969,RecivedTotal=3969
> expired=56...4020
> get=4021...8191
> new=4025...8191
> Recived=4171,RecivedTotal=8140
> SendTotal=12175
> expired=4021...8188
> get=8189...12175
> new=8192...12175
> Recived=3987,RecivedTotal=12127
> ```
> This test result shows that some tuples appear in the expired list directly, we lost these data if we just use get() to get tuples, this is the first bug.
> The second: the tuples of get() has overlap, the getNew() seems alright.
> The problem not happen definitely, may need to try several times.
> Actually, I'm newbie about storm, so I'm not sure this is a bug indeed, or, I use it in wrong way?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)