You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by lw-lin <gi...@git.apache.org> on 2016/05/03 04:26:42 UTC
[GitHub] spark pull request: [SPARK-15022][SPARK-15023][SQL][Streaming] Add...

Github user lw-lin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12797#discussion_r61837469
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/TriggerExecutor.scala ---
    @@ -65,8 +65,22 @@ case class ProcessingTimeExecutor(processingTime: ProcessingTime, clock: Clock =
           s"${intervalMs} milliseconds, but spent ${realElapsedTimeMs} milliseconds")
       }
     
    -  /** Return the next multiple of intervalMs */
    +  /** Return the next multiple of intervalMs
    +   *
    +   * e.g. for intervalMs = 100
    +   * nextBatchTime(0) = 100
    +   * nextBatchTime(1) = 100
    +   * ...
    +   * nextBatchTime(99) = 100
    +   * nextBatchTime(100) = 200
    +   * nextBatchTime(101) = 200
    +   * ...
    +   * nextBatchTime(199) = 200
    +   * nextBatchTime(200) = 300
    +   *
    +   * Note, this way, we'll get nextBatchTime(nextBatchTime(0)) = 200, rather than = 0
    +   * */
       def nextBatchTime(now: Long): Long = {
    -    (now - 1) / intervalMs * intervalMs + intervalMs
    +    now / intervalMs * intervalMs + intervalMs
    --- End diff --
    
    @zsxwing thanks for clarifying on this! :-)
    
    [1]
    The issue is triggered when both `batchElapsedTimeMs == 0` and `batchEndTimeMs` is multiple of `intervalMS` hold, e.g. `batchStartTimeMs == 50` and `batchEndTimeMS == 50` given `intervalMS == 100` won't trigger the issue. So, we might have to do like this:
    
    ```scala
    if (batchElapsedTimeMs == 0 && batchEndTimeMs % intervalMS == 0) {
      clock.waitTillTime(batchEndTimeMs + intervalMs)
    } else {
      clock.waitTillTime(nextBatchTime(batchEndTimeMs))
    }
    ```
    
    For me It seems a little hard to interpret...
    
    [2]
    >
    ... deal with one case: If a batch takes exactly intervalMs, we should run the next batch at once instead of sleeping intervalMs
    
    This is a good point! I've done some calculations based on your comments, and it seems we would still run the next batch at once when the last job takes exactly `intervalMs`?
    
    prior to this path:
    ```
    batch      | job
    -----------------------------------------
    [  0,  99] |
    [100, 199] | job x starts at 100, stops at 199, takes 100
    [200, 299] |
    ```
    after this patch, it's still the same:
    ```
    batch      | job
    -----------------------------------------
    [  0,  99] |
    [100, 199] | job y starts at 100, stops at 199, takes 100
    [200, 299] |
    ```
    --
    @zsxwing thoughts on the above [1] and [2]? Thanks! :-)
    
    



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org