You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Michael Armbrust (JIRA)" <ji...@apache.org> on 2016/11/01 23:34:59 UTC

[jira] [Updated] (SPARK-18124) Implement watermarking for handling late data

     [ https://issues.apache.org/jira/browse/SPARK-18124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael Armbrust updated SPARK-18124:
-------------------------------------
    Issue Type: New Feature  (was: Sub-task)
        Parent:     (was: SPARK-8360)

> Implement watermarking for handling late data
> ---------------------------------------------
>
>                 Key: SPARK-18124
>                 URL: https://issues.apache.org/jira/browse/SPARK-18124
>             Project: Spark
>          Issue Type: New Feature
>          Components: Structured Streaming
>            Reporter: Tathagata Das
>            Assignee: Michael Armbrust
>
> Whenever we aggregate data by event time, we want to consider data is late and out-of-order in terms of its event time. Since we keep aggregate keyed by the time as state, the state will grow unbounded if we keep around all old aggregates in an attempt consider arbitrarily late data. Since the state is a store in-memory, we have to prevent building up of this unbounded state. Hence, we need a watermarking mechanism by which we will mark data that is older beyond a threshold as “too late”, and stop updating the aggregates with them. This would allow us to remove old aggregates that are never going to be updated, thus bounding the size of the state.
> Here is the design doc - https://docs.google.com/document/d/1z-Pazs5v4rA31azvmYhu4I5xwqaNQl6ZLIS03xhkfCQ/edit?usp=sharing



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org