You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2022/06/01 06:02:24 UTC

[GitHub] [spark] nyingping opened a new pull request, #36737: [SPARK-39347] [SS] Generate wrong time window when (timestamp-startTime) % slideDuration…

nyingping opened a new pull request, #36737:
URL: https://github.com/apache/spark/pull/36737

   
   <!--
   Thanks for sending a pull request!  Here are some tips for you:
     1. If this is your first time, please read our contributor guidelines: https://spark.apache.org/contributing.html
     2. Ensure you have added or run the appropriate tests for your PR: https://spark.apache.org/developer-tools.html
     3. If the PR is unfinished, add '[WIP]' in your PR title, e.g., '[WIP][SPARK-XXXX] Your PR title ...'.
     4. Be sure to keep the PR description updated to reflect all changes.
     5. Please write your PR title to summarize what this PR proposes.
     6. If possible, provide a concise example to reproduce the issue for a faster review.
     7. If you want to add a new configuration, please read the guideline first for naming configurations in
        'core/src/main/scala/org/apache/spark/internal/config/ConfigEntry.scala'.
     8. If you want to add or modify an error type or message, please read the guideline first in
        'core/src/main/resources/error/README.md'.
   -->
   
   
   ### What changes were proposed in this pull request?
   <!--
   Please clarify what changes you are proposing. The purpose of this section is to outline the changes and how this PR fixes the issue. 
   If possible, please consider writing useful notes for better and faster reviews in your PR. See the examples below.
     1. If you refactor some codes with changing classes, showing the class hierarchy will help reviewers.
     2. If you fix some SQL features, you can provide some references of other DBMSes.
     3. If there is design documentation, please add the link.
     4. If there is a discussion in the mailing list, please add the link.
   -->
   Fix bug that Generate wrong time window when (timestamp-startTime) % slideDuration < 0
   
   The original time window generation rule
   ```
    lastStart <- timestamp - (timestamp - startTime + slideDuration) % slideDuration
      ```
   change like this
   ```
    remainder <-  (timestamp - startTime) % slideDuration
    lastStart <-
       if (remainder < 0) timestamp - remainder - slideDuration
       else timestamp - remainder
      
      ```
   
   reference: [https://github.com/apache/flink/pull/18982](https://github.com/apache/flink/pull/18982)
   ### Why are the changes needed?
   <!--
   Please clarify why the changes are needed. For instance,
     1. If you propose a new API, clarify the use case for a new API.
     2. If you fix a bug, you can clarify why it is a bug.
   -->
   
   Since the generation strategy of the sliding window in PR [#35362](https://github.com/apache/spark/pull/35362) is changed to the current one, and that leads to a new problem.
   
   A window generation error occurs when the time required to process the recorded data is negative and the modulo value between the time and window length is less than 0. In the current test cases, this bug does not thorw up.
   
   [ test("negative timestamps")](https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/DataFrameTimeWindowingSuite.scala#L299)
   
   ```
   val df1 = Seq(
     ("1970-01-01 00:00:02", 1),
     ("1970-01-01 00:00:12", 2)).toDF("time", "value")
   val df2 = Seq(
     (LocalDateTime.parse("1970-01-01T00:00:02"), 1),
     (LocalDateTime.parse("1970-01-01T00:00:12"), 2)).toDF("time", "value")
   
   Seq(df1, df2).foreach { df =>
     checkAnswer(
       df.select(window($"time", "10 seconds", "10 seconds", "5 seconds"), $"value")
         .orderBy($"window.start".asc)
         .select($"window.start".cast(StringType), $"window.end".cast(StringType), $"value"),
       Seq(
         Row("1969-12-31 23:59:55", "1970-01-01 00:00:05", 1),
         Row("1970-01-01 00:00:05", "1970-01-01 00:00:15", 2))
     )
   } 
   ```
   The timestamp of the above test data is not negative, and the value modulo the window length is not negative, so it can be passes the test case.
   
   An exception occurs when the timestamp becomes something like this.
   
   ```
   val df3 = Seq(
         ("1969-12-31 00:00:02", 1),
         ("1969-12-31 00:00:12", 2)).toDF("time", "value")
   val df4 = Seq(
         (LocalDateTime.parse("1969-12-31T00:00:02"), 1),
         (LocalDateTime.parse("1969-12-31T00:00:12"), 2)).toDF("time", "value")    Seq(df3, df4).foreach { df =>
         checkAnswer(
           df.select(window($"time", "10 seconds", "10 seconds", "5 seconds"), $"value")
             .orderBy($"window.start".asc)
             .select($"window.start".cast(StringType), $"window.end".cast(StringType), $"value"),
           Seq(
             Row("1969-12-30 23:59:55", "1969-12-31 00:00:05", 1),
             Row("1969-12-31 00:00:05", "1969-12-31 00:00:15", 2))
         )
   } 
   ```
   run and get unexpected result:
   
   ```
   == Results ==
   !== Correct Answer - 2 ==                      == Spark Answer - 2 ==
   !struct<>                                      struct<CAST(window.start AS STRING):string,CAST(window.end AS STRING):string,value:int>
   ![1969-12-30 23:59:55,1969-12-31 00:00:05,1]   [1969-12-31 00:00:05,1969-12-31 00:00:15,1]
   ![1969-12-31 00:00:05,1969-12-31 00:00:15,2]   [1969-12-31 00:00:15,1969-12-31 00:00:25,2] 
   ```
   
   ### Does this PR introduce _any_ user-facing change?
   <!--
   Note that it means *any* user-facing change including all aspects such as the documentation fix.
   If yes, please clarify the previous behavior and the change this PR proposes - provide the console output, description and/or an example to show the behavior difference if possible.
   If possible, please also clarify if this is a user-facing change compared to the released Spark versions or within the unreleased branches such as master.
   If no, write 'No'.
   -->
   No
   
   ### How was this patch tested?
   <!--
   If tests were added, say they were added here. Please make sure to add some test cases that check the changes thoroughly including negative and positive cases if possible.
   If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future.
   If tests were not added, please describe why they were not added and/or why it was difficult to add.
   If benchmark tests were added, please run the benchmarks in GitHub Actions for the consistent environment, and the instructions could accord to: https://spark.apache.org/developer-tools.html#github-workflow-benchmarks.
   -->
   Add new unit test.
   
   **benchmark result**
   
   oldlogic[#18364](https://github.com/apache/spark/pull/18364)  VS 【fix version】
   ```
   Running benchmark: tumbling windows
   Running case: old logic
   Stopped after 407 iterations, 10012 ms
   Running case: new logic
   Stopped after 615 iterations, 10007 ms
   Java HotSpot(TM) 64-Bit Server VM 1.8.0_181-b13 on Windows 10 10.0
   Intel64 Family 6 Model 158 Stepping 10, GenuineIntel
   tumbling windows:                         Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
   ------------------------------------------------------------------------------------------------------------------------
   old logic                                            17             25           9        580.1           1.7       1.0X
   new logic                                            15             16           2        680.8           1.5       1.2X
   
   Running benchmark: sliding windows
   Running case: old logic
   Stopped after 10 iterations, 10296 ms
   Running case: new logic
   Stopped after 15 iterations, 10391 ms
   Java HotSpot(TM) 64-Bit Server VM 1.8.0_181-b13 on Windows 10 10.0
   Intel64 Family 6 Model 158 Stepping 10, GenuineIntel
   sliding windows:                          Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
   ------------------------------------------------------------------------------------------------------------------------
   old logic                                          1000           1030          19         10.0         100.0       1.0X
   new logic                                           668            693          21         15.0          66.8       1.5X
   
   ```
   
   
   Fixed version than PR [#38069](https://github.com/apache/spark/pull/35362) lost a bit of the performance.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] nyingping commented on a diff in pull request #36737: [SPARK-39347] [SS] Generate wrong time window when (timestamp-startTime) % slideDuration…

Posted by GitBox <gi...@apache.org>.
nyingping commented on code in PR #36737:
URL: https://github.com/apache/spark/pull/36737#discussion_r891847061


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala:
##########
@@ -3963,8 +3966,10 @@ object TimeWindowing extends Rule[LogicalPlan] {
 
         def getWindow(i: Int, dataType: DataType): Expression = {
           val timestamp = PreciseTimestampConversion(window.timeColumn, dataType, LongType)
-          val lastStart = timestamp - (timestamp - window.startTime
-            + window.slideDuration) % window.slideDuration
+          val remainder = (timestamp - window.startTime) % window.slideDuration
+          val lastStart = CaseWhen(
+            Seq((LessThan(remainder, 0), timestamp - remainder - window.slideDuration))

Review Comment:
   Thank you for reviewing.
   
   For any event timestamp, the start time of the last window (whether tumbling or sliding) obtained by it should always be less than the current timestamp. When `(timestamp - window.starttime)% window.slideduration <0`, the obtained laststart will be greater than timestamp. At this time, it is necessary to shift the window to the right, which is the correct laststart value.
   
   For example
   
   code
   ```
       val timestamp = -13
       val offset = 0
       val windowSize = 7
   
       // old code
       val lastStartOld = timestamp - (timestamp - offset - windowSize) % windowSize
   
      // new code
       val remainder =  (timestamp - offset) % windowSize
   
       val lastStartNew =
         if (remainder < 0) {
           timestamp - remainder - windowSize
         } else {
           timestamp - remainder
         }
   
       println(s"lastStartOld = $lastStartOld   lastStartNew = $lastStartNew")
   ```
   result
   ```
   lastStartOld = -7 lastStartNew = -14
   ```
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HeartSaVioR commented on pull request #36737: [SPARK-39347] [SS] Generate wrong time window when (timestamp-startTime) % slideDuration…

Posted by GitBox <gi...@apache.org>.
HeartSaVioR commented on PR #36737:
URL: https://github.com/apache/spark/pull/36737#issuecomment-1150658562

   General comment from what I see in review comments:
   
   I see you repeat the explanation of the code you changed; I don't think reviewers asked about the detailed explanation of the code changes. There is no "high-level" explanation why it is broken (I roughly see it's from the language spec of modulo operation), and also "high-level" explanation how you deal with it in this PR. Please look through the description of the reference Flink PR you linked - while it also mentioned about code snippet, it explained with high level first, and then introduced the code change it proposed.
   
   As long as you update the PR description with high-level explanation, I guess it should be straightforward to understand the code change, and you'd easily pass the reviews.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] srowen commented on a diff in pull request #36737: [SPARK-39347] [SS] Generate wrong time window when (timestamp-startTime) % slideDuration…

Posted by GitBox <gi...@apache.org>.
srowen commented on code in PR #36737:
URL: https://github.com/apache/spark/pull/36737#discussion_r891330435


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala:
##########
@@ -3963,8 +3966,10 @@ object TimeWindowing extends Rule[LogicalPlan] {
 
         def getWindow(i: Int, dataType: DataType): Expression = {
           val timestamp = PreciseTimestampConversion(window.timeColumn, dataType, LongType)
-          val lastStart = timestamp - (timestamp - window.startTime
-            + window.slideDuration) % window.slideDuration
+          val remainder = (timestamp - window.startTime) % window.slideDuration

Review Comment:
   When does processing time become negative -- is this a more fundamental problem to fix?
   
   Also, does not seem to be a reason to use CaseWhen here? just do this in Scala code



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] nyingping commented on a diff in pull request #36737: [SPARK-39347] [SS] Generate wrong time window when (timestamp-startTime) % slideDuration…

Posted by GitBox <gi...@apache.org>.
nyingping commented on code in PR #36737:
URL: https://github.com/apache/spark/pull/36737#discussion_r891848795


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala:
##########
@@ -3963,8 +3966,10 @@ object TimeWindowing extends Rule[LogicalPlan] {
 
         def getWindow(i: Int, dataType: DataType): Expression = {
           val timestamp = PreciseTimestampConversion(window.timeColumn, dataType, LongType)

Review Comment:
   `before epoch` means that B.C or before '1970-01-01 00:00:00',I understanding that both of negative value.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] nyingping commented on a diff in pull request #36737: [SPARK-39347] [SS] Generate wrong time window when (timestamp-startTime) % slideDuration…

Posted by GitBox <gi...@apache.org>.
nyingping commented on code in PR #36737:
URL: https://github.com/apache/spark/pull/36737#discussion_r891842533


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala:
##########
@@ -3963,8 +3966,10 @@ object TimeWindowing extends Rule[LogicalPlan] {
 
         def getWindow(i: Int, dataType: DataType): Expression = {
           val timestamp = PreciseTimestampConversion(window.timeColumn, dataType, LongType)
-          val lastStart = timestamp - (timestamp - window.startTime
-            + window.slideDuration) % window.slideDuration
+          val remainder = (timestamp - window.startTime) % window.slideDuration

Review Comment:
   Thank you for reviewing.
   My understanding is that whether the event time is negative or not is determined by the upstream data source, not the Spark platform itself. Even if the event time is negative, the platform should give the correct value.
   
   Does `in scala code` refer to the following code:
   ```
   val lastStart = if (remainder < 0) {
       timestamp - remainder - window.slideDuration
    } else {
       timestamp - remainder
   }
   ```
   In that case,The remainder is a expressions. Predicate, the remainder < 0 can't get a Boolean value.
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] jerrypeng commented on a diff in pull request #36737: [SPARK-39347] [SS] Generate wrong time window when (timestamp-startTime) % slideDuration…

Posted by GitBox <gi...@apache.org>.
jerrypeng commented on code in PR #36737:
URL: https://github.com/apache/spark/pull/36737#discussion_r891557370


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala:
##########
@@ -3963,8 +3966,10 @@ object TimeWindowing extends Rule[LogicalPlan] {
 
         def getWindow(i: Int, dataType: DataType): Expression = {
           val timestamp = PreciseTimestampConversion(window.timeColumn, dataType, LongType)

Review Comment:
   If the timeColumn is before epoch, what would be the timestamp long val be? Negative?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] nyingping commented on pull request #36737: [SPARK-39347] [SS] Generate wrong time window when (timestamp-startTime) % slideDuration…

Posted by GitBox <gi...@apache.org>.
nyingping commented on PR #36737:
URL: https://github.com/apache/spark/pull/36737#issuecomment-1156392559

   I get it, and I'll try to update this part as much as possible,thanks a lot. @HeartSaVioR 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] github-actions[bot] commented on pull request #36737: [SPARK-39347] [SS] Generate wrong time window when (timestamp-startTime) % slideDuration…

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on PR #36737:
URL: https://github.com/apache/spark/pull/36737#issuecomment-1272665859

   We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
   If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #36737: [SPARK-39347] [SS] Generate wrong time window when (timestamp-startTime) % slideDuration…

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on PR #36737:
URL: https://github.com/apache/spark/pull/36737#issuecomment-1143239527

   Can one of the admins verify this patch?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] nyingping commented on a diff in pull request #36737: [SPARK-39347] [SS] Generate wrong time window when (timestamp-startTime) % slideDuration…

Posted by GitBox <gi...@apache.org>.
nyingping commented on code in PR #36737:
URL: https://github.com/apache/spark/pull/36737#discussion_r891842533


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala:
##########
@@ -3963,8 +3966,10 @@ object TimeWindowing extends Rule[LogicalPlan] {
 
         def getWindow(i: Int, dataType: DataType): Expression = {
           val timestamp = PreciseTimestampConversion(window.timeColumn, dataType, LongType)
-          val lastStart = timestamp - (timestamp - window.startTime
-            + window.slideDuration) % window.slideDuration
+          val remainder = (timestamp - window.startTime) % window.slideDuration

Review Comment:
   Thank you for reviewing.
   My understanding is that whether the event time is negative or not is determined by the upstream data source, not the Spark platform itself. Even if the event time is negative, the platform should give the correct value.
   
   Does `in scala code` refer to the following code:
   ```
   val lastStart = if (remainder < 0) {
       timestamp - remainder - window.slideDuration
    } else {
       timestamp - remainder
   }
   ```
   In that case,The `remainder` is a `expressions. predicate`, the `remainder < 0` can't get a` boolean` value.
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] nyingping commented on pull request #36737: [SPARK-39347] [SS] Generate wrong time window when (timestamp-startTime) % slideDuration…

Posted by GitBox <gi...@apache.org>.
nyingping commented on PR #36737:
URL: https://github.com/apache/spark/pull/36737#issuecomment-1143157052

   This bug was caused by my previous PR. I'm sorry.
   Could you have a look when you have time @HeartSaVioR  @viirya,Thanks in advance.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] nyingping commented on a diff in pull request #36737: [SPARK-39347] [SS] Generate wrong time window when (timestamp-startTime) % slideDuration…

Posted by GitBox <gi...@apache.org>.
nyingping commented on code in PR #36737:
URL: https://github.com/apache/spark/pull/36737#discussion_r891842533


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala:
##########
@@ -3963,8 +3966,10 @@ object TimeWindowing extends Rule[LogicalPlan] {
 
         def getWindow(i: Int, dataType: DataType): Expression = {
           val timestamp = PreciseTimestampConversion(window.timeColumn, dataType, LongType)
-          val lastStart = timestamp - (timestamp - window.startTime
-            + window.slideDuration) % window.slideDuration
+          val remainder = (timestamp - window.startTime) % window.slideDuration

Review Comment:
   Thank you for reviewing.
   My understanding is that whether the event time is negative or not is determined by the upstream data source, not the Spark platform itself. Even if the event time is negative, the platform should give the correct value.
   
   Does `in scala code` refer to the following code:
   ```
   val lastStart = if (remainder < 0) {
       timestamp - remainder - window.slideDuration
    } else {
       timestamp - remainder
   }
   ```
   In that case,The `remainder` is a `expressions.predicate`, the `remainder < 0` can't get a` boolean` value.
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AnywalkerGiser commented on a diff in pull request #36737: [SPARK-39347] [SS] Generate wrong time window when (timestamp-startTime) % slideDuration…

Posted by GitBox <gi...@apache.org>.
AnywalkerGiser commented on code in PR #36737:
URL: https://github.com/apache/spark/pull/36737#discussion_r911545492


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala:
##########
@@ -3963,8 +3966,10 @@ object TimeWindowing extends Rule[LogicalPlan] {
 
         def getWindow(i: Int, dataType: DataType): Expression = {
           val timestamp = PreciseTimestampConversion(window.timeColumn, dataType, LongType)

Review Comment:
   Whether has something to do with this PR[[SPARK-39176][PYSPARK] Fixed a problem with pyspark serializing pre-1970 datetime in windows](https://github.com/apache/spark/pull/36566).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] github-actions[bot] closed pull request #36737: [SPARK-39347] [SS] Generate wrong time window when (timestamp-startTime) % slideDuration…

Posted by GitBox <gi...@apache.org>.
github-actions[bot] closed pull request #36737: [SPARK-39347] [SS] Generate wrong time window when (timestamp-startTime) % slideDuration…
URL: https://github.com/apache/spark/pull/36737


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HeartSaVioR commented on pull request #36737: [SPARK-39347] [SS] Generate wrong time window when (timestamp-startTime) % slideDuration…

Posted by GitBox <gi...@apache.org>.
HeartSaVioR commented on PR #36737:
URL: https://github.com/apache/spark/pull/36737#issuecomment-1148131018

   Sorry I'll find a time sooner. I'll also find someone able to review this in prior.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] nyingping commented on a diff in pull request #36737: [SPARK-39347] [SS] Generate wrong time window when (timestamp-startTime) % slideDuration…

Posted by GitBox <gi...@apache.org>.
nyingping commented on code in PR #36737:
URL: https://github.com/apache/spark/pull/36737#discussion_r891847061


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala:
##########
@@ -3963,8 +3966,10 @@ object TimeWindowing extends Rule[LogicalPlan] {
 
         def getWindow(i: Int, dataType: DataType): Expression = {
           val timestamp = PreciseTimestampConversion(window.timeColumn, dataType, LongType)
-          val lastStart = timestamp - (timestamp - window.startTime
-            + window.slideDuration) % window.slideDuration
+          val remainder = (timestamp - window.startTime) % window.slideDuration
+          val lastStart = CaseWhen(
+            Seq((LessThan(remainder, 0), timestamp - remainder - window.slideDuration))

Review Comment:
   Thank you for reviewing.
   
   For any event timestamp, the start time of the last window (whether tumbling or sliding) obtained by it should always be less than the current timestamp. When `(timestamp - window.starttime)% window.slideduration <0`, the obtained laststart will be greater than timestamp. At this time, it is necessary to shift the window to the right, which is the correct laststart value.
   
   For example
   
   code
   ```
       val timestamp = -13
       val offset = 0
       val windowSize = 7
   
       // old code
       val lastStartOld = timestamp - (timestamp - offset - windowSize) % windowSize
   
      // new code
       val remainder =  (timestamp - offset) % windowSize
   
       val lastStartNew =
         if (remainder < 0) {
           timestamp - remainder - windowSize
         } else {
           timestamp - remainder - windowSize
         }
   
       println(s"lastStartOld = $lastStartOld   lastStartNew = $lastStartNew")
   ```
   result
   ```
   lastStartOld = -7 lastStartNew = -14
   ```
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] nyingping commented on a diff in pull request #36737: [SPARK-39347] [SS] Generate wrong time window when (timestamp-startTime) % slideDuration…

Posted by GitBox <gi...@apache.org>.
nyingping commented on code in PR #36737:
URL: https://github.com/apache/spark/pull/36737#discussion_r891848795


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala:
##########
@@ -3963,8 +3966,10 @@ object TimeWindowing extends Rule[LogicalPlan] {
 
         def getWindow(i: Int, dataType: DataType): Expression = {
           val timestamp = PreciseTimestampConversion(window.timeColumn, dataType, LongType)

Review Comment:
   `before epoch` means that B.C or before '1970-01-01 00:00:00'? I understanding that both of negative value.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] nyingping commented on a diff in pull request #36737: [SPARK-39347] [SS] Generate wrong time window when (timestamp-startTime) % slideDuration…

Posted by GitBox <gi...@apache.org>.
nyingping commented on code in PR #36737:
URL: https://github.com/apache/spark/pull/36737#discussion_r891842533


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala:
##########
@@ -3963,8 +3966,10 @@ object TimeWindowing extends Rule[LogicalPlan] {
 
         def getWindow(i: Int, dataType: DataType): Expression = {
           val timestamp = PreciseTimestampConversion(window.timeColumn, dataType, LongType)
-          val lastStart = timestamp - (timestamp - window.startTime
-            + window.slideDuration) % window.slideDuration
+          val remainder = (timestamp - window.startTime) % window.slideDuration

Review Comment:
   Thank you for reviewing.
   My understanding is that whether the event time is negative or not is determined by the upstream data source, not the Spark platform itself. Even if the event time is negative, the platform should give the correct value.
   
   Does `in scala code` refer to the following code:
   ```
   val lastStart = if (remainder < 0) {
       timestamp - remainder - window.slideDuration
    } else {
       timestamp - remainder
   }
   ```
   In that case,The `remainder` is a `expressions.predicate`, the `remainder < 0` can't get a` boolean` value directly.
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] srowen commented on a diff in pull request #36737: [SPARK-39347] [SS] Generate wrong time window when (timestamp-startTime) % slideDuration…

Posted by GitBox <gi...@apache.org>.
srowen commented on code in PR #36737:
URL: https://github.com/apache/spark/pull/36737#discussion_r891850083


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala:
##########
@@ -3963,8 +3966,10 @@ object TimeWindowing extends Rule[LogicalPlan] {
 
         def getWindow(i: Int, dataType: DataType): Expression = {
           val timestamp = PreciseTimestampConversion(window.timeColumn, dataType, LongType)
-          val lastStart = timestamp - (timestamp - window.startTime
-            + window.slideDuration) % window.slideDuration
+          val remainder = (timestamp - window.startTime) % window.slideDuration
+          val lastStart = CaseWhen(
+            Seq((LessThan(remainder, 0), timestamp - remainder - window.slideDuration))

Review Comment:
   I think the question is how this arises in the first place. The code is self explanatory, not asking you to explain it



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] nyingping commented on a diff in pull request #36737: [SPARK-39347] [SS] Generate wrong time window when (timestamp-startTime) % slideDuration…

Posted by GitBox <gi...@apache.org>.
nyingping commented on code in PR #36737:
URL: https://github.com/apache/spark/pull/36737#discussion_r893042194


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala:
##########
@@ -3963,8 +3966,10 @@ object TimeWindowing extends Rule[LogicalPlan] {
 
         def getWindow(i: Int, dataType: DataType): Expression = {
           val timestamp = PreciseTimestampConversion(window.timeColumn, dataType, LongType)
-          val lastStart = timestamp - (timestamp - window.startTime
-            + window.slideDuration) % window.slideDuration
+          val remainder = (timestamp - window.startTime) % window.slideDuration
+          val lastStart = CaseWhen(
+            Seq((LessThan(remainder, 0), timestamp - remainder - window.slideDuration))

Review Comment:
   @srowen sure,As mentioned above, I think this problem is unavoidable because we can't control the behavior of users. All we can do is get the right results even if users enter illegal data.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] jerrypeng commented on a diff in pull request #36737: [SPARK-39347] [SS] Generate wrong time window when (timestamp-startTime) % slideDuration…

Posted by GitBox <gi...@apache.org>.
jerrypeng commented on code in PR #36737:
URL: https://github.com/apache/spark/pull/36737#discussion_r891532480


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala:
##########
@@ -3963,8 +3966,10 @@ object TimeWindowing extends Rule[LogicalPlan] {
 
         def getWindow(i: Int, dataType: DataType): Expression = {
           val timestamp = PreciseTimestampConversion(window.timeColumn, dataType, LongType)
-          val lastStart = timestamp - (timestamp - window.startTime
-            + window.slideDuration) % window.slideDuration
+          val remainder = (timestamp - window.startTime) % window.slideDuration
+          val lastStart = CaseWhen(
+            Seq((LessThan(remainder, 0), timestamp - remainder - window.slideDuration))

Review Comment:
   Can you explain why this is correct now?  Please give an example.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org