You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Liwei Lin (JIRA)" <ji...@apache.org> on 2017/03/13 04:45:04 UTC

[jira] [Updated] (SPARK-19932) Also save event time into StateStore for certain cases

     [ https://issues.apache.org/jira/browse/SPARK-19932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Liwei Lin updated SPARK-19932:
------------------------------
    Description: 
{code}
spark
   .readStream                 // schema: (word, eventTime), like ("a", 10), ("a", 11), ("b", 12) ...
   ...
   .withWatermark("eventTime", "10 seconds")
   .dropDuplicates("word")     // note: "eventTime" is not part of the key columns
   ...
{code}

As shown above, right now if watermark is specified for a streaming dropDuplicates query, but not specified as the key columns, then we'll still get the correct answer, but the state just keeps growing and will never get cleaned up.

The reason is, the watermark attribute is not part of the key of the state store in this case. We're not saving event time information in the state store.

  was:
<code>
spark
   .readStream                 // schema: (word, eventTime), like ("a", 10), ("a", 11), ("b", 12) ...
   ...
   .withWatermark("eventTime", "10 seconds")
   .dropDuplicates("word")     // note: "eventTime" is not part of the key columns
   ...
<code>

As shown above, right now if watermark is specified for a streaming dropDuplicates query, but not specified as the key columns, then we'll still get the correct answer, but the state just keeps growing and will never get cleaned up.

The reason is, the watermark attribute is not part of the key of the state store in this case. We're not saving event time information in the state store.


> Also save event time into StateStore for certain cases
> ------------------------------------------------------
>
>                 Key: SPARK-19932
>                 URL: https://issues.apache.org/jira/browse/SPARK-19932
>             Project: Spark
>          Issue Type: Improvement
>          Components: Structured Streaming
>    Affects Versions: 2.1.0
>            Reporter: Liwei Lin
>
> {code}
> spark
>    .readStream                 // schema: (word, eventTime), like ("a", 10), ("a", 11), ("b", 12) ...
>    ...
>    .withWatermark("eventTime", "10 seconds")
>    .dropDuplicates("word")     // note: "eventTime" is not part of the key columns
>    ...
> {code}
> As shown above, right now if watermark is specified for a streaming dropDuplicates query, but not specified as the key columns, then we'll still get the correct answer, but the state just keeps growing and will never get cleaned up.
> The reason is, the watermark attribute is not part of the key of the state store in this case. We're not saving event time information in the state store.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org