You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2021/11/06 16:03:17 UTC

[GitHub] [spark] wangyum opened a new pull request #34504: [SPARK-37226][SQL] Filter push down through window

wangyum opened a new pull request #34504:
URL: https://github.com/apache/spark/pull/34504


   ### What changes were proposed in this pull request?
   
   This pr enhance `PushPredicateThroughNonJoin` to support filter push down through window if window partition is empty. For example:
   ```scala
   spark.sql("CREATE TABLE t1 using parquet AS SELECT id AS a, id AS b FROM range(1000)")
   spark.sql("SELECT * FROM (SELECT a, count(*) cnt, row_number() OVER (ORDER BY a desc) as rn FROM t1 GROUP BY a) WHERE rn <= 10").explain(true)
   ```
   After this pr:
   ```
   == Optimized Logical Plan ==
   Filter (rn#4 <= 10)
   +- Window [row_number() windowspecdefinition(a#7L DESC NULLS LAST, specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) AS rn#4], [a#7L DESC NULLS LAST]
      +- GlobalLimit 10
         +- LocalLimit 10
            +- Sort [a#7L DESC NULLS LAST], true
               +- Aggregate [a#7L], [a#7L, count(1) AS cnt#3L]
                  +- Project [a#7L]
                     +- Relation default.t1[a#7L,b#8L] parquet
   ```
   
   ### Why are the changes needed?
   
   Improve query performance.
   
   ### Does this PR introduce _any_ user-facing change?
   
   No.
   
   
   ### How was this patch tested?
   
   Unit test.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] wangyum commented on pull request #34504: [SPARK-37226][SQL] Filter push down through window

Posted by GitBox <gi...@apache.org>.
wangyum commented on pull request #34504:
URL: https://github.com/apache/spark/pull/34504#issuecomment-962766378


   cc @cloud-fan 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #34504: [SPARK-37226][SQL] Filter push down through window

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #34504:
URL: https://github.com/apache/spark/pull/34504#issuecomment-967784061


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/49650/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on pull request #34504: [SPARK-37226][SQL] Filter push down through window if partitionSpec isEmpty

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on pull request #34504:
URL: https://github.com/apache/spark/pull/34504#issuecomment-983363748


   Hi, All. Thank you for making this PR and #34367 .
   - This PR got +1 already and @zhengruifeng mentioned that `This should be faster than SPARK-37099` before.
   - However, the last benchmark and @zhengruifeng comment sounds like   https://github.com/apache/spark/pull/34367 is faster and more general.
    
   I'm wondering what is the current status and next plan for this.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #34504: [SPARK-37226][SQL] Filter push down through window

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #34504:
URL: https://github.com/apache/spark/pull/34504#issuecomment-962472670


   **[Test build #144955 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/144955/testReport)** for PR 34504 at commit [`96f1f47`](https://github.com/apache/spark/commit/96f1f47d69753e2889b692b6934336d5cf5e4a11).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #34504: [SPARK-37226][SQL] Filter push down through window

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #34504:
URL: https://github.com/apache/spark/pull/34504#issuecomment-962765805


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/49444/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #34504: [SPARK-37226][SQL] Filter push down through window

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34504:
URL: https://github.com/apache/spark/pull/34504#issuecomment-967781922


   Kubernetes integration test status failure
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49650/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] wangyum commented on a change in pull request #34504: [SPARK-37226][SQL] Filter push down through window if partitionSpec isEmpty

Posted by GitBox <gi...@apache.org>.
wangyum commented on a change in pull request #34504:
URL: https://github.com/apache/spark/pull/34504#discussion_r755612866



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
##########
@@ -1548,6 +1548,31 @@ object PushPredicateThroughNonJoin extends Rule[LogicalPlan] with PredicateHelpe
         filter
       }
 
+    case filter @ Filter(condition, w: Window) if w.partitionSpec.isEmpty =>
+      w.windowExpressions match {
+        case Seq(alias @ Alias(WindowExpression(_: RowNumber, WindowSpecDefinition(Nil, orderSpec,
+            SpecifiedWindowFrame(RowFrame, UnboundedPreceding, CurrentRow))), _)) =>
+          val aliasAttr = alias.toAttribute
+          val limitValue = splitConjunctivePredicates(condition).collectFirst {
+            case LessThanOrEqual(e, IntegerLiteral(v)) if e.semanticEquals(aliasAttr) => v
+            case Equality(e, IntegerLiteral(v)) if e.semanticEquals(aliasAttr) => v
+            case LessThan(e, IntegerLiteral(v)) if e.semanticEquals(aliasAttr) => v - 1
+          }
+
+          limitValue match {
+            case Some(lv) if lv <= 0 =>
+              LocalRelation(filter.output, data = Seq.empty, isStreaming = filter.isStreaming)
+            case Some(lv)
+                if lv < conf.topKSortFallbackThreshold && w.child.maxRows.forall(_ > lv) =>
+              filter.copy(child =
+                w.copy(child = Limit(Literal(lv), Sort(orderSpec, true, w.child))))

Review comment:
       Yes. `orderSpec` is always specified for `RowNumber`:
   ```
   spark-sql> SELECT a, b, row_number() OVER (PARTITION BY a) FROM VALUES ('A1', 2), ('A1', 1), ('A2', 3), ('A1', 1) tab(a, b);
   Error in query: Window function row_number() requires window to be ordered, please add ORDER BY clause. For example SELECT row_number()(value_expr) OVER (PARTITION BY window_partition ORDER BY window_ordering) from table
   spark-sql>
   ```
   But some window function do not need `orderSpec`. For example:
   ```sql
   spark-sql> SELECT a, b, last(b) OVER (PARTITION BY a) FROM VALUES ('A1', 2), ('A1', 1), ('A2', 3), ('A1', 1) tab(a, b);
   A2	3	3
   A1	2	1
   A1	1	1
   A1	1	1
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #34504: [SPARK-37226][SQL] Filter push down through window

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34504:
URL: https://github.com/apache/spark/pull/34504#issuecomment-962755804


   Kubernetes integration test status failure
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49444/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #34504: [SPARK-37226][SQL] Filter push down through window

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34504:
URL: https://github.com/apache/spark/pull/34504#issuecomment-967770239


   **[Test build #145181 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/145181/testReport)** for PR 34504 at commit [`a3a7648`](https://github.com/apache/spark/commit/a3a764888454c8f0778734ead1309c717e5e6407).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #34504: [SPARK-37226][SQL] Filter push down through window

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #34504:
URL: https://github.com/apache/spark/pull/34504#issuecomment-962486478


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/49426/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #34504: [SPARK-37226][SQL] Filter push down through window

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #34504:
URL: https://github.com/apache/spark/pull/34504#issuecomment-962508570


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/144955/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] wangyum commented on pull request #34504: [SPARK-37226][SQL] Filter push down through window

Posted by GitBox <gi...@apache.org>.
wangyum commented on pull request #34504:
URL: https://github.com/apache/spark/pull/34504#issuecomment-967757751


   @tanelk I did a benchmark and this implementation is faster than SPARK-37099.
   ```scala
   import org.apache.spark.benchmark.Benchmark
   val numRows = 1024 * 1024 * 20
   spark.sql(s"CREATE TABLE t1 using parquet AS SELECT id as a, id as b FROM range(${numRows}L)")
   val benchmark = new Benchmark("Benchmark filter push down through window", numRows, minNumIters = 5)
   
   Seq(1, 1000).foreach { threshold =>
     val name = s"Filter push down through window ${if (threshold > 1) "(Enabled)" else "(Disabled)"}"
     benchmark.addCase(name) { _ =>
       withSQLConf("spark.sql.execution.topKSortFallbackThreshold" -> s"$threshold") {
         spark.sql("SELECT * FROM (SELECT *, ROW_NUMBER() OVER(ORDER BY a) AS rn FROM t1) t WHERE rn > 100 and rn <= 200").write.format("noop").mode("Overwrite").save()
       }
     }
   }
   
   benchmark.addCase("[SPARK-37099][SQL] Impl a rank-based filter to optimize top-k computation") { _ =>
     withSQLConf(
       "spark.sql.rankLimit.enabled" -> "true",
       "spark.sql.execution.topKSortFallbackThreshold" -> "0") {
       spark.sql("SELECT * FROM (SELECT *, ROW_NUMBER() OVER(ORDER BY a) AS rn FROM t1) t WHERE rn > 100 and rn <= 200").write.format("noop").mode("Overwrite").save()
     }
   }
   
   benchmark.run()
   ```
   
   ```
   Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.15.7
   Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
   Benchmark filter push down through window:                                 Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
   ---------------------------------------------------------------------------------------------------------------------------------------------------------
   Filter push down through window (Disabled)                                         11289          17709        1128          1.9         538.3       1.0X
   Filter push down through window (Enabled)                                           1252           1345         114         16.8          59.7       9.0X
   [SPARK-37099][SQL] Impl a rank-based filter to optimize top-k computation           2542           2666         143          8.3         121.2       4.4X
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #34504: [SPARK-37226][SQL] Filter push down through window

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34504:
URL: https://github.com/apache/spark/pull/34504#issuecomment-967802246


   **[Test build #145181 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/145181/testReport)** for PR 34504 at commit [`a3a7648`](https://github.com/apache/spark/commit/a3a764888454c8f0778734ead1309c717e5e6407).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #34504: [SPARK-37226][SQL] Filter push down through window if partitionSpec isEmpty

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #34504:
URL: https://github.com/apache/spark/pull/34504#issuecomment-977533569


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/50030/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #34504: [SPARK-37226][SQL] Filter push down through window if partitionSpec isEmpty

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #34504:
URL: https://github.com/apache/spark/pull/34504#issuecomment-977457010


   **[Test build #145558 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/145558/testReport)** for PR 34504 at commit [`a9c4ebd`](https://github.com/apache/spark/commit/a9c4ebda972b024901bf53fa1f3c90c1a034e037).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] tanelk commented on a change in pull request #34504: [SPARK-37226][SQL] Filter push down through window

Posted by GitBox <gi...@apache.org>.
tanelk commented on a change in pull request #34504:
URL: https://github.com/apache/spark/pull/34504#discussion_r744969365



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
##########
@@ -1548,6 +1548,29 @@ object PushPredicateThroughNonJoin extends Rule[LogicalPlan] with PredicateHelpe
         filter
       }
 
+    case filter @ Filter(condition, w: Window) if w.partitionSpec.isEmpty =>
+      w.windowExpressions match {
+        case Seq(alias @ Alias(WindowExpression(_: RowNumber, WindowSpecDefinition(Nil, orderSpec,
+            SpecifiedWindowFrame(RowFrame, UnboundedPreceding, CurrentRow))), _)) =>
+          val aliasAttr = alias.toAttribute
+          val limitValue = splitConjunctivePredicates(condition).collectFirst {
+            case LessThanOrEqual(e, IntegerLiteral(v)) if e.semanticEquals(aliasAttr) => v
+            case EqualTo(e, IntegerLiteral(v)) if e.semanticEquals(aliasAttr) => v
+            case LessThan(e, IntegerLiteral(v)) if e.semanticEquals(aliasAttr) => v - 1

Review comment:
       `GreaterThan` and `GreaterThanOrEqual` would also work with the literal on the left. Also for `EqualTo` it can be either way. Better yet, you can use `Equality`. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] zhengruifeng commented on a change in pull request #34504: [SPARK-37226][SQL] Filter push down through window if partitionSpec isEmpty

Posted by GitBox <gi...@apache.org>.
zhengruifeng commented on a change in pull request #34504:
URL: https://github.com/apache/spark/pull/34504#discussion_r755892984



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
##########
@@ -1548,6 +1548,31 @@ object PushPredicateThroughNonJoin extends Rule[LogicalPlan] with PredicateHelpe
         filter
       }
 
+    case filter @ Filter(condition, w: Window) if w.partitionSpec.isEmpty =>
+      w.windowExpressions match {
+        case Seq(alias @ Alias(WindowExpression(_: RowNumber, WindowSpecDefinition(Nil, orderSpec,
+            SpecifiedWindowFrame(RowFrame, UnboundedPreceding, CurrentRow))), _)) =>

Review comment:
       I am afraid we can not relax restrictions here, suppose this case:
   
   ```
   val df1 = spark.range(0, 10000, 1, 9).select(when('id < 2500, 2499).when('id >= 7500, 10000).otherwise('id).as("key1"), 'id as "value1").withColumn("hash1", abs(hash(col("key1"))).mod(100))
   
   val df2 = df1.withColumn("rank", row_number().over(Window.orderBy("value1"))).withColumn("lag3", lag("value1", 3).over(Window.orderBy("value1"))).withColumn("count", count("value1").over(Window.orderBy("value1").rowsBetween(-10, 10))).where(col("rank") <= 1)
   
   println(df2.queryExecution.optimizedPlan)
   
   
   Filter (rank#176 <= 1)
   +- Window [row_number() windowspecdefinition(value1#168L ASC NULLS FIRST, specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) AS rank#176, lag(value1#168L, -3, null) windowspecdefinition(value1#168L ASC NULLS FIRST, specifiedwindowframe(RowFrame, -3, -3)) AS lag3#181L, count(1) windowspecdefinition(value1#168L ASC NULLS FIRST, specifiedwindowframe(RowFrame, -10, 10)) AS count#189L], [value1#168L ASC NULLS FIRST]
      +- Project [CASE WHEN (id#165L < 2500) THEN 2499 WHEN (id#165L >= 7500) THEN 10000 ELSE id#165L END AS key1#167L, id#165L AS value1#168L, (abs(hash(CASE WHEN (id#165L < 2500) THEN 2499 WHEN (id#165L >= 7500) THEN 10000 ELSE id#165L END, 42), false) % 100) AS hash1#171]
         +- Range (0, 10000, step=1, splits=Some(9))
   
   
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #34504: [SPARK-37226][SQL] Filter push down through window

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34504:
URL: https://github.com/apache/spark/pull/34504#issuecomment-962508357


   **[Test build #144955 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/144955/testReport)** for PR 34504 at commit [`96f1f47`](https://github.com/apache/spark/commit/96f1f47d69753e2889b692b6934336d5cf5e4a11).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #34504: [SPARK-37226][SQL] Filter push down through window

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #34504:
URL: https://github.com/apache/spark/pull/34504#issuecomment-962838914


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/144973/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #34504: [SPARK-37226][SQL] Filter push down through window

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34504:
URL: https://github.com/apache/spark/pull/34504#issuecomment-962478190


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49426/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #34504: [SPARK-37226][SQL] Filter push down through window

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #34504:
URL: https://github.com/apache/spark/pull/34504#issuecomment-967770239


   **[Test build #145181 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/145181/testReport)** for PR 34504 at commit [`a3a7648`](https://github.com/apache/spark/commit/a3a764888454c8f0778734ead1309c717e5e6407).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] tanelk commented on a change in pull request #34504: [SPARK-37226][SQL] Filter push down through window

Posted by GitBox <gi...@apache.org>.
tanelk commented on a change in pull request #34504:
URL: https://github.com/apache/spark/pull/34504#discussion_r744969365



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
##########
@@ -1548,6 +1548,29 @@ object PushPredicateThroughNonJoin extends Rule[LogicalPlan] with PredicateHelpe
         filter
       }
 
+    case filter @ Filter(condition, w: Window) if w.partitionSpec.isEmpty =>
+      w.windowExpressions match {
+        case Seq(alias @ Alias(WindowExpression(_: RowNumber, WindowSpecDefinition(Nil, orderSpec,
+            SpecifiedWindowFrame(RowFrame, UnboundedPreceding, CurrentRow))), _)) =>
+          val aliasAttr = alias.toAttribute
+          val limitValue = splitConjunctivePredicates(condition).collectFirst {
+            case LessThanOrEqual(e, IntegerLiteral(v)) if e.semanticEquals(aliasAttr) => v
+            case EqualTo(e, IntegerLiteral(v)) if e.semanticEquals(aliasAttr) => v
+            case LessThan(e, IntegerLiteral(v)) if e.semanticEquals(aliasAttr) => v - 1

Review comment:
       `GreaterThan` and `GreaterThanOrEqual` would also work with the literal on the left




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] zhengruifeng commented on pull request #34504: [SPARK-37226][SQL] Filter push down through window if partitionSpec isEmpty

Posted by GitBox <gi...@apache.org>.
zhengruifeng commented on pull request #34504:
URL: https://github.com/apache/spark/pull/34504#issuecomment-973825582


   This should be faster than SPARK-37099, since quick-selection is used internally instead of sorting. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #34504: [SPARK-37226][SQL] Filter push down through window if partitionSpec isEmpty

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #34504:
URL: https://github.com/apache/spark/pull/34504#issuecomment-977620798


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/145558/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] zhengruifeng commented on pull request #34504: [SPARK-37226][SQL] Filter push down through window if partitionSpec isEmpty

Posted by GitBox <gi...@apache.org>.
zhengruifeng commented on pull request #34504:
URL: https://github.com/apache/spark/pull/34504#issuecomment-977484067


   @wangyum  @cloud-fan  @tanelk 
   
   I had just updated the `RankLimitExec` in https://github.com/apache/spark/pull/34367/files to apply `Utils.takeOrdered` instead of sorting when partitionSpec is empty and row_number is used (always true for now)
   
   and I re-run the benchmark in my laptop, and the result is:
   
   ```
   [info] 11:45:28.060 WARN org.apache.spark.sql.execution.window.WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
   [info]   Stopped after 5 iterations, 9604 ms
   [info] Java HotSpot(TM) 64-Bit Server VM 1.8.0_301-b09 on Linux 5.11.0-40-generic
   [info] Intel(R) Core(TM) i7-8850H CPU @ 2.60GHz
   [info] Benchmark filter push down through window:                                 Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
   [info] ---------------------------------------------------------------------------------------------------------------------------------------------------------
   [info] Filter push down through window (Disabled)                                         11283          11670         240          1.9         538.0       1.0X
   [info] Filter push down through window (Enabled)                                           1871           1903          19         11.2          89.2       6.0X
   [info] [SPARK-37099][SQL] Impl a rank-based filter to optimize top-k computation           1853           1921          58         11.3          88.4       6.1X
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #34504: [SPARK-37226][SQL] Filter push down through window if partitionSpec isEmpty

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34504:
URL: https://github.com/apache/spark/pull/34504#issuecomment-977610874


   **[Test build #145558 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/145558/testReport)** for PR 34504 at commit [`a9c4ebd`](https://github.com/apache/spark/commit/a9c4ebda972b024901bf53fa1f3c90c1a034e037).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #34504: [SPARK-37226][SQL] Filter push down through window

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34504:
URL: https://github.com/apache/spark/pull/34504#issuecomment-962721742


   **[Test build #144973 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/144973/testReport)** for PR 34504 at commit [`d2e1027`](https://github.com/apache/spark/commit/d2e10279e77f48de681f78d431a61f41d1afc6d2).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #34504: [SPARK-37226][SQL] Filter push down through window

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #34504:
URL: https://github.com/apache/spark/pull/34504#issuecomment-962721742


   **[Test build #144973 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/144973/testReport)** for PR 34504 at commit [`d2e1027`](https://github.com/apache/spark/commit/d2e10279e77f48de681f78d431a61f41d1afc6d2).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #34504: [SPARK-37226][SQL] Filter push down through window

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #34504:
URL: https://github.com/apache/spark/pull/34504#issuecomment-967803107


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/145181/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on pull request #34504: [SPARK-37226][SQL] Filter push down through window if partitionSpec isEmpty

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on pull request #34504:
URL: https://github.com/apache/spark/pull/34504#issuecomment-977525153


   @zhengruifeng can you highlight the differences between your PR and this one?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a change in pull request #34504: [SPARK-37226][SQL] Filter push down through window if partitionSpec isEmpty

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #34504:
URL: https://github.com/apache/spark/pull/34504#discussion_r755210674



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
##########
@@ -1548,6 +1548,31 @@ object PushPredicateThroughNonJoin extends Rule[LogicalPlan] with PredicateHelpe
         filter
       }
 
+    case filter @ Filter(condition, w: Window) if w.partitionSpec.isEmpty =>
+      w.windowExpressions match {
+        case Seq(alias @ Alias(WindowExpression(_: RowNumber, WindowSpecDefinition(Nil, orderSpec,
+            SpecifiedWindowFrame(RowFrame, UnboundedPreceding, CurrentRow))), _)) =>

Review comment:
       I think this optimization should also apply if there are more than one window expressions. The algorithm should simply be
   ```
   val limits = w.windowExpressions.map {
     case row_number_func => calcuate the limit
     case _ => none
   }
   val limit = limits.min
   add limit + sort
   ``` 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #34504: [SPARK-37226][SQL] Filter push down through window if partitionSpec isEmpty

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34504:
URL: https://github.com/apache/spark/pull/34504#issuecomment-977522744


   Kubernetes integration test status failure
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50030/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #34504: [SPARK-37226][SQL] Filter push down through window

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #34504:
URL: https://github.com/apache/spark/pull/34504#issuecomment-962765805


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/49444/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #34504: [SPARK-37226][SQL] Filter push down through window

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34504:
URL: https://github.com/apache/spark/pull/34504#issuecomment-962486232


   Kubernetes integration test status failure
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49426/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #34504: [SPARK-37226][SQL] Filter push down through window

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34504:
URL: https://github.com/apache/spark/pull/34504#issuecomment-962472670


   **[Test build #144955 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/144955/testReport)** for PR 34504 at commit [`96f1f47`](https://github.com/apache/spark/commit/96f1f47d69753e2889b692b6934336d5cf5e4a11).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #34504: [SPARK-37226][SQL] Filter push down through window

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #34504:
URL: https://github.com/apache/spark/pull/34504#issuecomment-962508570


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/144955/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a change in pull request #34504: [SPARK-37226][SQL] Filter push down through window if partitionSpec isEmpty

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #34504:
URL: https://github.com/apache/spark/pull/34504#discussion_r755213627



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
##########
@@ -1548,6 +1548,31 @@ object PushPredicateThroughNonJoin extends Rule[LogicalPlan] with PredicateHelpe
         filter
       }
 
+    case filter @ Filter(condition, w: Window) if w.partitionSpec.isEmpty =>
+      w.windowExpressions match {
+        case Seq(alias @ Alias(WindowExpression(_: RowNumber, WindowSpecDefinition(Nil, orderSpec,
+            SpecifiedWindowFrame(RowFrame, UnboundedPreceding, CurrentRow))), _)) =>
+          val aliasAttr = alias.toAttribute
+          val limitValue = splitConjunctivePredicates(condition).collectFirst {
+            case LessThanOrEqual(e, IntegerLiteral(v)) if e.semanticEquals(aliasAttr) => v
+            case Equality(e, IntegerLiteral(v)) if e.semanticEquals(aliasAttr) => v
+            case LessThan(e, IntegerLiteral(v)) if e.semanticEquals(aliasAttr) => v - 1
+          }
+
+          limitValue match {
+            case Some(lv) if lv <= 0 =>
+              LocalRelation(filter.output, data = Seq.empty, isStreaming = filter.isStreaming)
+            case Some(lv)
+                if lv < conf.topKSortFallbackThreshold && w.child.maxRows.forall(_ > lv) =>
+              filter.copy(child =
+                w.copy(child = Limit(Literal(lv), Sort(orderSpec, true, w.child))))

Review comment:
       One worry is what if other optimizer rules put/move operators between Limit and Sort? Then we can't use the `TakeOrderedAndProjectExec` and introduce a big overhead by this global sort.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #34504: [SPARK-37226][SQL] Filter push down through window if partitionSpec isEmpty

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34504:
URL: https://github.com/apache/spark/pull/34504#issuecomment-977474971


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50030/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] wangyum commented on a change in pull request #34504: [SPARK-37226][SQL] Filter push down through window if partitionSpec isEmpty

Posted by GitBox <gi...@apache.org>.
wangyum commented on a change in pull request #34504:
URL: https://github.com/apache/spark/pull/34504#discussion_r755656586



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
##########
@@ -1548,6 +1548,31 @@ object PushPredicateThroughNonJoin extends Rule[LogicalPlan] with PredicateHelpe
         filter
       }
 
+    case filter @ Filter(condition, w: Window) if w.partitionSpec.isEmpty =>
+      w.windowExpressions match {
+        case Seq(alias @ Alias(WindowExpression(_: RowNumber, WindowSpecDefinition(Nil, orderSpec,
+            SpecifiedWindowFrame(RowFrame, UnboundedPreceding, CurrentRow))), _)) =>

Review comment:
       OK




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #34504: [SPARK-37226][SQL] Filter push down through window

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #34504:
URL: https://github.com/apache/spark/pull/34504#issuecomment-967803107


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/145181/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #34504: [SPARK-37226][SQL] Filter push down through window

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34504:
URL: https://github.com/apache/spark/pull/34504#issuecomment-967774937


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49650/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] zhengruifeng commented on a change in pull request #34504: [SPARK-37226][SQL] Filter push down through window if partitionSpec isEmpty

Posted by GitBox <gi...@apache.org>.
zhengruifeng commented on a change in pull request #34504:
URL: https://github.com/apache/spark/pull/34504#discussion_r755895427



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
##########
@@ -1548,6 +1548,31 @@ object PushPredicateThroughNonJoin extends Rule[LogicalPlan] with PredicateHelpe
         filter
       }
 
+    case filter @ Filter(condition, w: Window) if w.partitionSpec.isEmpty =>
+      w.windowExpressions match {
+        case Seq(alias @ Alias(WindowExpression(_: RowNumber, WindowSpecDefinition(Nil, orderSpec,
+            SpecifiedWindowFrame(RowFrame, UnboundedPreceding, CurrentRow))), _)) =>

Review comment:
       in the case, the inferred limit value should be 11 other than 1.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] zhengruifeng commented on pull request #34504: [SPARK-37226][SQL] Filter push down through window if partitionSpec isEmpty

Posted by GitBox <gi...@apache.org>.
zhengruifeng commented on pull request #34504:
URL: https://github.com/apache/spark/pull/34504#issuecomment-977648570


   > @zhengruifeng can you highlight the differences between your PR and this one?
   
   IMHO, there are two main differences:
   
   1, a new node `RankLimit` is introduced, and it supports both the empty partitionSpec cases and non-empty partitionSpec cases. It could support `rank` and `dense_rank` as the rank function in the future;
   
   2, Normally,`TakeOrderedAndProjectExec` performs the  top-K filtering in both mappers and the reducers, while `RankLimitExec` only filters rows in mappers.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #34504: [SPARK-37226][SQL] Filter push down through window if partitionSpec isEmpty

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34504:
URL: https://github.com/apache/spark/pull/34504#issuecomment-977457010


   **[Test build #145558 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/145558/testReport)** for PR 34504 at commit [`a9c4ebd`](https://github.com/apache/spark/commit/a9c4ebda972b024901bf53fa1f3c90c1a034e037).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #34504: [SPARK-37226][SQL] Filter push down through window

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34504:
URL: https://github.com/apache/spark/pull/34504#issuecomment-962740227


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49444/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #34504: [SPARK-37226][SQL] Filter push down through window

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34504:
URL: https://github.com/apache/spark/pull/34504#issuecomment-962837719


   **[Test build #144973 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/144973/testReport)** for PR 34504 at commit [`d2e1027`](https://github.com/apache/spark/commit/d2e10279e77f48de681f78d431a61f41d1afc6d2).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] zhengruifeng commented on pull request #34504: [SPARK-37226][SQL] Filter push down through window if partitionSpec isEmpty

Posted by GitBox <gi...@apache.org>.
zhengruifeng commented on pull request #34504:
URL: https://github.com/apache/spark/pull/34504#issuecomment-983419947


   > > @zhengruifeng can you highlight the differences between your PR and this one?
   > 
   > IMHO, there are two main differences:
   > 
   > 1, a new node `RankLimit` is introduced, and it supports both the empty partitionSpec cases and non-empty partitionSpec cases. It could support `rank` and `dense_rank` as the rank function in the future;
   > 
   > 2, Normally,`TakeOrderedAndProjectExec` performs the top-K filtering in both mappers and the reducers, while `RankLimitExec` only filters rows in mappers.
   
   update on https://github.com/apache/spark/pull/34367/commits/877558e439663d1028028e9a332a5e4e6a18ad6c
   
   1, `RankLimit` now supports row_number/rank/dense_rank, empty and non-empty partitionSpec;
   2, two `RankLimitExec` nodes are inserted now, one on the map side and one on the reduce side; if there is no shuffle between the two `RankLimitExec` nodes, the filtering in the second node will be skiped;


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #34504: [SPARK-37226][SQL] Filter push down through window

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #34504:
URL: https://github.com/apache/spark/pull/34504#issuecomment-967784061


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/49650/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a change in pull request #34504: [SPARK-37226][SQL] Filter push down through window if partitionSpec isEmpty

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #34504:
URL: https://github.com/apache/spark/pull/34504#discussion_r755214307



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
##########
@@ -1548,6 +1548,31 @@ object PushPredicateThroughNonJoin extends Rule[LogicalPlan] with PredicateHelpe
         filter
       }
 
+    case filter @ Filter(condition, w: Window) if w.partitionSpec.isEmpty =>
+      w.windowExpressions match {
+        case Seq(alias @ Alias(WindowExpression(_: RowNumber, WindowSpecDefinition(Nil, orderSpec,
+            SpecifiedWindowFrame(RowFrame, UnboundedPreceding, CurrentRow))), _)) =>
+          val aliasAttr = alias.toAttribute
+          val limitValue = splitConjunctivePredicates(condition).collectFirst {
+            case LessThanOrEqual(e, IntegerLiteral(v)) if e.semanticEquals(aliasAttr) => v
+            case Equality(e, IntegerLiteral(v)) if e.semanticEquals(aliasAttr) => v
+            case LessThan(e, IntegerLiteral(v)) if e.semanticEquals(aliasAttr) => v - 1
+          }
+
+          limitValue match {
+            case Some(lv) if lv <= 0 =>
+              LocalRelation(filter.output, data = Seq.empty, isStreaming = filter.isStreaming)
+            case Some(lv)
+                if lv < conf.topKSortFallbackThreshold && w.child.maxRows.forall(_ > lv) =>
+              filter.copy(child =
+                w.copy(child = Limit(Literal(lv), Sort(orderSpec, true, w.child))))

Review comment:
       BTW, is `orderSpec` always specified?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #34504: [SPARK-37226][SQL] Filter push down through window if partitionSpec isEmpty

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #34504:
URL: https://github.com/apache/spark/pull/34504#issuecomment-977620798


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/145558/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #34504: [SPARK-37226][SQL] Filter push down through window if partitionSpec isEmpty

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #34504:
URL: https://github.com/apache/spark/pull/34504#issuecomment-977533569


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/50030/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #34504: [SPARK-37226][SQL] Filter push down through window

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #34504:
URL: https://github.com/apache/spark/pull/34504#issuecomment-962486478


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/49426/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #34504: [SPARK-37226][SQL] Filter push down through window

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #34504:
URL: https://github.com/apache/spark/pull/34504#issuecomment-962838914


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/144973/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] tanelk commented on pull request #34504: [SPARK-37226][SQL] Filter push down through window

Posted by GitBox <gi...@apache.org>.
tanelk commented on pull request #34504:
URL: https://github.com/apache/spark/pull/34504#issuecomment-963546664


   Linking an approach, that also handles non-empty partition spec: #34367


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org