You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by lw-lin <gi...@git.apache.org> on 2017/03/12 15:52:16 UTC

[GitHub] spark pull request #17268: [WIP][SPARK][SS] Also save event time into StateS...

GitHub user lw-lin opened a pull request:

    https://github.com/apache/spark/pull/17268

    [WIP][SPARK][SS] Also save event time into StateStore for certain cases

    ## What changes were proposed in this pull request?
    
    (Please fill in changes proposed in this fix)
    
    ## How was this patch tested?
    
    (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
    (If this patch involves UI changes, please attach a screenshot; otherwise, remove this)
    
    Please review http://spark.apache.org/contributing.html before opening a pull request.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/lw-lin/spark dedup-watermark

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/17268.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #17268
    
----
commit 1cdf29af06418fe6dea6ed9d4b85be62c8b6edcd
Author: Liwei Lin <lw...@gmail.com>
Date:   2017-03-12T12:28:54Z

    Initial commit

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17268: [SPARK-19932][SS] Disallow a case that might caus...

Posted by lw-lin <gi...@git.apache.org>.
Github user lw-lin closed the pull request at:

    https://github.com/apache/spark/pull/17268


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17268: [SPARK-19932][SS] Also save event time into StateStore f...

Posted by marmbrus <gi...@git.apache.org>.
Github user marmbrus commented on the issue:

    https://github.com/apache/spark/pull/17268
  
    Sorry, I wasn't suggestion we mandate this.  There may be use cases where users are okay deduping a short lived stream w/o a watermark.  I'm only saying the timestamp is mandatory for the watermark to kick in.,


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17268: [WIP][SPARK][SS] Also save event time into StateStore fo...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17268
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/74406/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17268: [SPARK-19932][SS] Also save event time into StateStore f...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17268
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/74660/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17268: [SPARK-19932][SS] Also save event time into StateStore f...

Posted by lw-lin <gi...@git.apache.org>.
Github user lw-lin commented on the issue:

    https://github.com/apache/spark/pull/17268
  
    @marmbrus thanks for the comments.
    
    > In the worst case... it is possible that the result actually ends up with duplicates in it.
    
    Ah, could you elaborate?  I'm not sure why there might be duplicates, since this patch proposes to include the timestamp column implicitly in the state store values rather than in the state store keys.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17268: [SPARK-19932][SS] Disallow a case that might cause OOM f...

Posted by lw-lin <gi...@git.apache.org>.
Github user lw-lin commented on the issue:

    https://github.com/apache/spark/pull/17268
  
    Thanks for the comments! Closing this.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17268: [WIP][SPARK][SS] Also save event time into StateStore fo...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17268
  
    **[Test build #74406 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74406/testReport)** for PR 17268 at commit [`1cdf29a`](https://github.com/apache/spark/commit/1cdf29af06418fe6dea6ed9d4b85be62c8b6edcd).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17268: [SPARK-19932][SS] Also save event time into StateStore f...

Posted by lw-lin <gi...@git.apache.org>.
Github user lw-lin commented on the issue:

    https://github.com/apache/spark/pull/17268
  
    Thank you @marmbrus for the detailed explanation!
    
    > For that reason, I think its safest to require the user to explicitly include the timestamp.
    
    Yea, let me update this in this direction.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17268: [SPARK-19932][SS] Also save event time into StateStore f...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17268
  
    **[Test build #74660 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74660/testReport)** for PR 17268 at commit [`d76a138`](https://github.com/apache/spark/commit/d76a138ae734ded8d9c4d2a91c6c3dc9b216e349).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17268: [SPARK-19932][SS] Also save event time into StateStore f...

Posted by lw-lin <gi...@git.apache.org>.
Github user lw-lin commented on the issue:

    https://github.com/apache/spark/pull/17268
  
    @marmbrus @zsxwing would you take a look at this, thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17268: [SPARK-19932][SS] Disallow a case that might cause OOM f...

Posted by marmbrus <gi...@git.apache.org>.
Github user marmbrus commented on the issue:

    https://github.com/apache/spark/pull/17268
  
    Sorry I'm still not sure if this is a good idea.
    
    Why disallow the following,
    
    ```scala
    spark
       .readStream             
       .withWatermark("eventTime", "10 seconds")
       .dropDuplicates("id")
    ```
    
    But not this query?
    
    ```scala
    spark
       .readStream             
       .withWatermark("eventTime", "10 seconds")
       .groupBy("id")
       .agg(first(struct($"*")))
    ```
    
    I think its an incredibly difficult problem to tell people if their query is scalable without seeing the data.  I'm not sure if we want to do it piecemeal as I'm concerned that is more confusing.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17268: [SPARK-19932][SS] Also save event time into StateStore f...

Posted by marmbrus <gi...@git.apache.org>.
Github user marmbrus commented on the issue:

    https://github.com/apache/spark/pull/17268
  
    I'm mixed if we want this to happen implicitly.  Here's how I think about the tradeoffs for this change:  On the pro side, with this change we avoid the case where the user forgets to include the timestamp and their application crashes.
    
    On the con, side we are doing something implicit here that might result in confusing behavior.  In the worst case (where the timestamp is not deterministically correlated with the other deduplication columns), it is possible that the result actually ends up with duplicates in it.
    
    For that reason, I think its safest to require the user to explicitly include the timestamp.  I'd be curious what others think though.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17268: [SPARK-19932][SS] Disallow a case that might case OOM fo...

Posted by lw-lin <gi...@git.apache.org>.
Github user lw-lin commented on the issue:

    https://github.com/apache/spark/pull/17268
  
    Sure; sorry I didn't say it out but I meant the same thing :-)
    
    @marmbrus now that I've updated this as well as the JIRA, would you mind taking another look? Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17268: [WIP][SPARK][SS] Also save event time into StateStore fo...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17268
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17268: [SPARK-19932][SS] Also save event time into StateStore f...

Posted by marmbrus <gi...@git.apache.org>.
Github user marmbrus commented on the issue:

    https://github.com/apache/spark/pull/17268
  
    Say the eventtime column chosen is the time of delivery into something like Kafka.  Due to retries we end up with two events with different timestamps.  Consider the following stream with a watermark threshold of `5` and where a blank line delineates batch boundaries.
    
    ```
    [id=a, t=0]
    
    [id=b, t=6]
    
    [id=a, t=10]
    ```
    
    The first batch emits `a` the second batch emits `b` and drops `a` from the store.  The third batch emits a duplicate `a`.
    
    Since that result seems pretty confusing, I think we may want to require the user to explicitly include the timestamp (or get an explicit though admittedly suboptimal OOM).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17268: [SPARK-19932][SS] Also save event time into StateStore f...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17268
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17268: [WIP][SPARK][SS] Also save event time into StateStore fo...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17268
  
    **[Test build #74406 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74406/testReport)** for PR 17268 at commit [`1cdf29a`](https://github.com/apache/spark/commit/1cdf29af06418fe6dea6ed9d4b85be62c8b6edcd).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17268: [SPARK-19932][SS] Also save event time into StateStore f...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17268
  
    **[Test build #74660 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74660/testReport)** for PR 17268 at commit [`d76a138`](https://github.com/apache/spark/commit/d76a138ae734ded8d9c4d2a91c6c3dc9b216e349).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org