You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by jeanlyn <gi...@git.apache.org> on 2016/03/01 04:29:21 UTC

[GitHub] spark pull request: [SPARK-13586]add config to skip generate down ...

GitHub user jeanlyn opened a pull request:

    https://github.com/apache/spark/pull/11440

    [SPARK-13586]add config to skip generate down time batch when restart StreamingContext

    ## What changes were proposed in this pull request?
    
    The patch try to add a config `spark.streaming.skipDownTimeBatch` to control whether generate the down time batches when restarting StreamingContext. By default, it will be set to false.
    
    
    ## How was this patch tested?
    
    unit test: test("SPARK-13586: do no generate down time batch when recovering from checkpoint") 
    


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/jeanlyn/spark skipDownTime

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/11440.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #11440
    
----
commit 089d0af74317b767378c16673ac9d67f6dfd9972
Author: jeanlyn <je...@gmail.com>
Date:   2016-02-29T07:15:38Z

    add config to generate down time batch

commit 9068881aeb17bb77383b3e9eecf01c463f62113c
Author: jeanlyn <je...@gmail.com>
Date:   2016-03-01T03:21:23Z

    add jira num

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13586][STREAMING]add config to skip gen...

Posted by jerryshao <gi...@git.apache.org>.
Github user jerryshao commented on the pull request:

    https://github.com/apache/spark/pull/11440#issuecomment-190570144
  
    For example, if your sliding duration is 1, window duration is 4, and batch duration is 1, and the down time is 3. If you skip this this 3 batches, IIUC the result will be wrong, 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13586][STREAMING]add config to skip gen...

Posted by jeanlyn <gi...@git.apache.org>.
Github user jeanlyn commented on the pull request:

    https://github.com/apache/spark/pull/11440#issuecomment-190608101
  
    @jerryshao Thanks for the explanation. I see what you mean. It's only happen in the beginning, and if the stop time is much longer than the window time, i think it's acceptable to skip those down time batch.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13586][STREAMING]add config to skip gen...

Posted by jeanlyn <gi...@git.apache.org>.
Github user jeanlyn commented on the pull request:

    https://github.com/apache/spark/pull/11440#issuecomment-190994425
  
    Thanks @jerryshao  @srowen @zsxwing for suggestions.I close this PR.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13586][STREAMING]add config to skip gen...

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11440#discussion_r54551767
  
    --- Diff: streaming/src/main/scala/org/apache/spark/streaming/scheduler/JobGenerator.scala ---
    @@ -221,8 +221,12 @@ class JobGenerator(jobScheduler: JobScheduler) extends Logging {
         logInfo("Batches pending processing (" + pendingTimes.size + " batches): " +
           pendingTimes.mkString(", "))
         // Reschedule jobs for these times
    -    val timesToReschedule = (pendingTimes ++ downTimes).filter { _ < restartTime }
    -      .distinct.sorted(Time.ordering)
    +    val skipDownTime = conf.getBoolean("spark.streaming.skipDownTimeBatch", false)
    --- End diff --
    
    I'd prefer not to add yet another configuration to control this. It adds complexity. I don't think the name is descriptive here; what is a 'down time batch'? The current behavior is coherent, since the expected behavior is to pick up where it left off. It's not intended that you leave the job not running for a long time relative to the batch interval.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13586][STREAMING]add config to skip gen...

Posted by jerryshao <gi...@git.apache.org>.
Github user jerryshao commented on the pull request:

    https://github.com/apache/spark/pull/11440#issuecomment-190530543
  
    Jobs generated in the down time can be used for WAL replay, did you test when these down jobs are removed, the behavior of WAL replay is still correct?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13586][STREAMING]add config to skip gen...

Posted by jeanlyn <gi...@git.apache.org>.
Github user jeanlyn commented on the pull request:

    https://github.com/apache/spark/pull/11440#issuecomment-190613342
  
    My bad. I will try to figure out the way to fix the when window operations appear with the config set to true.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13586][STREAMING]add config to skip gen...

Posted by jerryshao <gi...@git.apache.org>.
Github user jerryshao commented on the pull request:

    https://github.com/apache/spark/pull/11440#issuecomment-190610110
  
    But how do you define "much longer", based on the batch number or time? IMHO we cannot fix a patch based on the assumptions. We should add some defensive codes to make sure the logic is still consistent.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13586][STREAMING]add config to skip gen...

Posted by jeanlyn <gi...@git.apache.org>.
Github user jeanlyn closed the pull request at:

    https://github.com/apache/spark/pull/11440


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13586][STREAMING]add config to skip gen...

Posted by zsxwing <gi...@git.apache.org>.
Github user zsxwing commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11440#discussion_r54610922
  
    --- Diff: streaming/src/main/scala/org/apache/spark/streaming/scheduler/JobGenerator.scala ---
    @@ -221,8 +221,12 @@ class JobGenerator(jobScheduler: JobScheduler) extends Logging {
         logInfo("Batches pending processing (" + pendingTimes.size + " batches): " +
           pendingTimes.mkString(", "))
         // Reschedule jobs for these times
    -    val timesToReschedule = (pendingTimes ++ downTimes).filter { _ < restartTime }
    -      .distinct.sorted(Time.ordering)
    +    val skipDownTime = conf.getBoolean("spark.streaming.skipDownTimeBatch", false)
    --- End diff --
    
    Agreed that not need to add this configuration. People can just remove the checkpoint instead.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13586]add config to skip generate down ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/11440#issuecomment-190522437
  
    Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13586][STREAMING]add config to skip gen...

Posted by jeanlyn <gi...@git.apache.org>.
Github user jeanlyn commented on the pull request:

    https://github.com/apache/spark/pull/11440#issuecomment-190568465
  
    Thanks @jerryshao for suggestion!
    > Jobs generated in the down time can be used for WAL replay, did you test when these down jobs are removed, the behavior of WAL replay is still correct?
    
    It seems that the `pendingTimes` is use for WAL replay, i do not skip these batches 
    
    > Also for some windowing operations, I think this removal of down time jobs may possibly lead to the inconsistent result of windowing aggregation.
    
    Does inconsistent result mean wrong result?
    
    Also, i will running the unit test with the config set to true by default in my local computer.
    
    



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13586][STREAMING]add config to skip gen...

Posted by jerryshao <gi...@git.apache.org>.
Github user jerryshao commented on the pull request:

    https://github.com/apache/spark/pull/11440#issuecomment-190531231
  
    Also for some windowing operations, I think this removal of down time jobs may possibly lead to the inconsistent result of windowing aggregation.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org