You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2022/02/15 09:00:50 UTC

[GitHub] [spark] nyingping opened a new pull request #35526: [SPARK-38214][SS]No need to filter data when the sliding window length is not redundant

nyingping opened a new pull request #35526:
URL: https://github.com/apache/spark/pull/35526

### What changes were proposed in this pull request?

At present, the sliding window adopts the form of expand + filter, but in some cases, filter is not necessary.

Filtering is required if the sliding window is irregular. When the window length is divided by the slide length the result is an integer (I believe this is also the case for most work scenarios in practice for sliding window), there is no need to filter, which can save calculation resources and improve performance.

### Why are the changes needed?

save calculation resources and improve performance.

### Does this PR introduce _any_ user-facing change?

NO

### How was this patch tested?

UT and benchmark.

simple benchmark in this [commit ](https://github.com/nyingping/spark/commit/cccc742f601cffca99ab602165c024b3523ebc72),thanks [HeartSaVioR@d532b6f](https://github.com/HeartSaVioR/spark/commit/d532b6f6bcdd80cdaac520b21587ebb69ff2df8f)

> spark.range(numOfRow)
> .selectExpr("CAST(id AS timestamp) AS time")
> .select(window(col("time"), "15 seconds", "3 seconds", "2 seconds"))
> .count()

Result:

```
Java HotSpot(TM) 64-Bit Server VM 1.8.0_291-b10 on Windows 10 10.0
AMD64 Family 23 Model 96 Stepping 1, AuthenticAMD
sliding windows: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
old logic 799 866 70 12.5 79.9 1.0X
new logic 58 68 9 171.2 5.8 13.7X
```

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org