You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2021/10/22 10:51:15 UTC

[GitHub] [spark] zhengruifeng opened a new pull request #34367: [SPARK-37099][SQL] Impl a rank-based filter to optimize top-k computation

zhengruifeng opened a new pull request #34367:
URL: https://github.com/apache/spark/pull/34367


   ### What changes were proposed in this pull request?
   add a new node `RankLimit` to filter out uncessary data in the map side.
   
   
   ### Why are the changes needed?
   1, reduce the shuffle amount;
   2, solve skewed-window problem, a practice case is optimized from 2.5h to 26min
   
   
   ### Does this PR introduce _any_ user-facing change?
   a new config is added
   
   
   ### How was this patch tested?
   added testsuits


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] wangyum commented on pull request #34367: [SPARK-37099][SQL] Impl a rank-based filter to optimize top-k computation

Posted by GitBox <gi...@apache.org>.
wangyum commented on pull request #34367:
URL: https://github.com/apache/spark/pull/34367#issuecomment-950039381


   cc @opensky142857


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #34367: [SPARK-37099][SQL] Impl a rank-based filter to optimize top-k computation

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #34367:
URL: https://github.com/apache/spark/pull/34367#issuecomment-950481576


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/49041/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #34367: [SPARK-37099][SQL] Impl a rank-based filter to optimize top-k computation

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34367:
URL: https://github.com/apache/spark/pull/34367#issuecomment-979054207


   **[Test build #145625 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/145625/testReport)** for PR 34367 at commit [`16242b0`](https://github.com/apache/spark/commit/16242b04a97414b0a936cdbfc5ed54cd93706f67).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #34367: [SPARK-37099][SQL] Impl a rank-based filter to optimize top-k computation

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #34367:
URL: https://github.com/apache/spark/pull/34367#issuecomment-983328709


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/50263/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #34367: [SPARK-37099][SQL] Impl a rank-based filter to optimize top-k computation

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #34367:
URL: https://github.com/apache/spark/pull/34367#issuecomment-977958055






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] zhengruifeng edited a comment on pull request #34367: [SPARK-37099][SQL] Impl a rank-based filter to optimize top-k computation

Posted by GitBox <gi...@apache.org>.
zhengruifeng edited a comment on pull request #34367:
URL: https://github.com/apache/spark/pull/34367#issuecomment-978884601


   @wangyum 
   
   this PR was updated to support `rank` and `dense_rank`
   
   ```
   scala> spark.conf.set("spark.sql.rankLimit.enabled", "true")
   
   scala> spark.sql("""SELECT a, b, rank() OVER (PARTITION BY a ORDER BY b) as rk FROM VALUES ('A1', 1), ('A1', 1), ('A2', 3), ('A1', 1) tab(a, b);""").where("rk = 1").queryExecution.optimizedPlan
   res1: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan =
   Filter (rk#0 = 1)
   +- Window [rank(b#4) windowspecdefinition(a#3, b#4 ASC NULLS FIRST, specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) AS rk#0], [a#3], [b#4 ASC NULLS FIRST]
      +- RankLimit [a#3], [b#4 ASC NULLS FIRST], rank(b#4), 1
         +- LocalRelation [a#3, b#4]
   
   scala> spark.sql("""SELECT a, b, rank() OVER (PARTITION BY a ORDER BY b) as rk FROM VALUES ('A1', 1), ('A1', 1), ('A2', 3), ('A1', 1) tab(a, b);""").where("rk = 1").show
   +---+---+---+                                                                   
   |  a|  b| rk|
   +---+---+---+
   | A1|  1|  1|
   | A1|  1|  1|
   | A1|  1|  1|
   | A2|  3|  1|
   +---+---+---+
   
   ```
   
   ![image](https://user-images.githubusercontent.com/7322292/143392842-c046c52d-a31d-4af9-aed9-ef16714ebb45.png)
   
   ```
   == Physical Plan ==
   AdaptiveSparkPlan (17)
   +- == Final Plan ==
      * Project (10)
      +- * Filter (9)
         +- Window (8)
            +- * Sort (7)
               +- AQEShuffleRead (6)
                  +- ShuffleQueryStage (5)
                     +- Exchange (4)
                        +- RankLimit (3)
                           +- * Sort (2)
                              +- * LocalTableScan (1)
   +- == Initial Plan ==
      Project (16)
      +- Filter (15)
         +- Window (14)
            +- Sort (13)
               +- Exchange (12)
                  +- RankLimit (11)
                     +- Sort (2)
                        +- LocalTableScan (1)
   
   
   (1) LocalTableScan [codegen id : 1]
   Output [2]: [a#17, b#18]
   Arguments: [a#17, b#18]
   
   (2) Sort [codegen id : 1]
   Input [2]: [a#17, b#18]
   Arguments: [a#17 ASC NULLS FIRST, b#18 ASC NULLS FIRST], false, 0
   
   (3) RankLimit
   Input [2]: [a#17, b#18]
   Arguments: [a#17], [b#18 ASC NULLS FIRST], rank(b#18), 1
   
   (4) Exchange
   Input [2]: [a#17, b#18]
   Arguments: hashpartitioning(a#17, 200), ENSURE_REQUIREMENTS, [id=#37]
   
   (5) ShuffleQueryStage
   Output [2]: [a#17, b#18]
   Arguments: 0
   
   (6) AQEShuffleRead
   Input [2]: [a#17, b#18]
   Arguments: coalesced
   
   (7) Sort [codegen id : 2]
   Input [2]: [a#17, b#18]
   Arguments: [a#17 ASC NULLS FIRST, b#18 ASC NULLS FIRST], false, 0
   
   (8) Window
   Input [2]: [a#17, b#18]
   Arguments: [rank(b#18) windowspecdefinition(a#17, b#18 ASC NULLS FIRST, specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) AS rk#14], [a#17], [b#18 ASC NULLS FIRST]
   
   (9) Filter [codegen id : 3]
   Input [3]: [a#17, b#18, rk#14]
   Condition : (rk#14 = 1)
   
   (10) Project [codegen id : 3]
   Output [3]: [a#17, cast(b#18 as string) AS b#32, cast(rk#14 as string) AS rk#33]
   Input [3]: [a#17, b#18, rk#14]
   
   (11) RankLimit
   Input [2]: [a#17, b#18]
   Arguments: [a#17], [b#18 ASC NULLS FIRST], rank(b#18), 1
   
   (12) Exchange
   Input [2]: [a#17, b#18]
   Arguments: hashpartitioning(a#17, 200), ENSURE_REQUIREMENTS, [id=#23]
   
   (13) Sort
   Input [2]: [a#17, b#18]
   Arguments: [a#17 ASC NULLS FIRST, b#18 ASC NULLS FIRST], false, 0
   
   (14) Window
   Input [2]: [a#17, b#18]
   Arguments: [rank(b#18) windowspecdefinition(a#17, b#18 ASC NULLS FIRST, specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) AS rk#14], [a#17], [b#18 ASC NULLS FIRST]
   
   (15) Filter
   Input [3]: [a#17, b#18, rk#14]
   Condition : (rk#14 = 1)
   
   (16) Project
   Output [3]: [a#17, cast(b#18 as string) AS b#32, cast(rk#14 as string) AS rk#33]
   Input [3]: [a#17, b#18, rk#14]
   
   (17) AdaptiveSparkPlan
   Output [3]: [a#17, b#32, rk#33]
   Arguments: isFinalPlan=true
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #34367: [SPARK-37099][SQL] Impl a rank-based filter to optimize top-k computation

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #34367:
URL: https://github.com/apache/spark/pull/34367#issuecomment-977809636


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/50047/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #34367: [SPARK-37099][SQL] Impl a rank-based filter to optimize top-k computation

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #34367:
URL: https://github.com/apache/spark/pull/34367#issuecomment-979195663


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/50100/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #34367: [SPARK-37099][SQL] Impl a rank-based filter to optimize top-k computation

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #34367:
URL: https://github.com/apache/spark/pull/34367#issuecomment-979427976


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/145632/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #34367: [SPARK-37099][SQL] Impl a rank-based filter to optimize top-k computation

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34367:
URL: https://github.com/apache/spark/pull/34367#issuecomment-989823509


   **[Test build #146030 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/146030/testReport)** for PR 34367 at commit [`6d2efab`](https://github.com/apache/spark/commit/6d2efab0549c8283f582bf8fe019b2c223f0681d).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #34367: [SPARK-37099][SQL] Impl a rank-based filter to optimize top-k computation

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34367:
URL: https://github.com/apache/spark/pull/34367#issuecomment-949514645


   **[Test build #144536 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/144536/testReport)** for PR 34367 at commit [`134067f`](https://github.com/apache/spark/commit/134067f5e34ef4f6d177a93f38ed7a65592ad026).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #34367: [SPARK-37099][SQL] Impl a rank-based filter to optimize top-k computation

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #34367:
URL: https://github.com/apache/spark/pull/34367#issuecomment-950758625


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/144576/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #34367: [SPARK-37099][SQL] Impl a rank-based filter to optimize top-k computation

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #34367:
URL: https://github.com/apache/spark/pull/34367#issuecomment-950529376


   **[Test build #144576 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/144576/testReport)** for PR 34367 at commit [`3c51865`](https://github.com/apache/spark/commit/3c518652d74e50bcb85024dd5e5e75dc63121ce7).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #34367: [SPARK-37099][SQL] Impl a rank-based filter to optimize top-k computation

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #34367:
URL: https://github.com/apache/spark/pull/34367#issuecomment-949600168


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/49007/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #34367: [SPARK-37099][SQL] Impl a rank-based filter to optimize top-k computation

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #34367:
URL: https://github.com/apache/spark/pull/34367#issuecomment-950443311


   **[Test build #144570 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/144570/testReport)** for PR 34367 at commit [`134067f`](https://github.com/apache/spark/commit/134067f5e34ef4f6d177a93f38ed7a65592ad026).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] zhengruifeng commented on a change in pull request #34367: [SPARK-37099][SQL] Impl a rank-based filter to optimize top-k computation

Posted by GitBox <gi...@apache.org>.
zhengruifeng commented on a change in pull request #34367:
URL: https://github.com/apache/spark/pull/34367#discussion_r767467761



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/window/RankLimitExec.scala
##########
@@ -0,0 +1,280 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+
+package org.apache.spark.sql.execution.window
+
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.catalyst.InternalRow
+import org.apache.spark.sql.catalyst.expressions._
+import org.apache.spark.sql.catalyst.expressions.codegen._
+import org.apache.spark.sql.catalyst.plans.physical._
+import org.apache.spark.sql.execution._
+import org.apache.spark.sql.execution.metric.SQLMetrics
+import org.apache.spark.util.collection.Utils
+
+
+sealed trait RankLimitMode
+
+case object Partial extends RankLimitMode
+
+case object Final extends RankLimitMode
+
+
+
+/**
+ * This operator is designed to filter out unnecessary rows before WindowExec,
+ * for top-k computation.
+ * @param partitionSpec Should be the same as [[WindowExec#partitionSpec]]
+ * @param orderSpec Should be the same as [[WindowExec#orderSpec]]
+ * @param rankFunction The function to compute row rank, should be RowNumber/Rank/DenseRank.
+ */
+case class RankLimitExec(
+    partitionSpec: Seq[Expression],
+    orderSpec: Seq[SortOrder],
+    rankFunction: Expression,
+    limit: Int,
+    mode: RankLimitMode,
+    child: SparkPlan) extends UnaryExecNode {
+  assert(orderSpec.nonEmpty && limit > 0)
+
+  private val shouldPass = child match {
+    case r: RankLimitExec =>
+      partitionSpec.size == r.partitionSpec.size &&
+        partitionSpec.zip(r.partitionSpec).forall(p => p._1.semanticEquals(p._2)) &&
+        orderSpec.size == r.orderSpec.size &&
+        orderSpec.zip(r.orderSpec).forall(o => o._1.semanticEquals(o._2)) &&
+        rankFunction.semanticEquals(r.rankFunction) &&
+        mode == Final && r.mode == Partial && limit == r.limit
+    case _ => false
+  }
+
+  private val shouldApplyTakeOrdered: Boolean = {
+    rankFunction match {
+      case _: RowNumber => limit < conf.topKSortFallbackThreshold
+      case _: Rank => false
+      case _: DenseRank => false
+      case f => throw new IllegalArgumentException(s"Unsupported rank function: $f")
+    }
+  }
+
+  override def output: Seq[Attribute] = child.output
+
+  override def requiredChildOrdering: Seq[Seq[SortOrder]] = {
+    if (shouldApplyTakeOrdered) {
+      Seq(partitionSpec.map(SortOrder(_, Ascending)))
+    } else {
+      // Should be the same as [[WindowExec#requiredChildOrdering]]
+      Seq(partitionSpec.map(SortOrder(_, Ascending)) ++ orderSpec)
+    }
+  }
+
+  override def outputOrdering: Seq[SortOrder] = {
+    partitionSpec.map(SortOrder(_, Ascending)) ++ orderSpec
+  }
+
+  override def requiredChildDistribution: Seq[Distribution] = mode match {
+    case Partial => super.requiredChildDistribution
+    case Final =>
+      // Should be the same as [[WindowExec#requiredChildDistribution]]
+      if (partitionSpec.isEmpty) {
+        AllTuples :: Nil
+      } else ClusteredDistribution(partitionSpec) :: Nil
+  }
+
+  override def outputPartitioning: Partitioning = child.outputPartitioning
+
+  override lazy val metrics = Map(
+    "numOutputRows" -> SQLMetrics.createMetric(sparkContext, "number of output rows"))
+
+  private lazy val ordering = GenerateOrdering.generate(orderSpec, output)
+
+  private lazy val limitFunction = rankFunction match {
+    case _: RowNumber if shouldApplyTakeOrdered =>
+      (stream: Iterator[InternalRow]) =>
+        Utils.takeOrdered(stream.map(_.copy()), limit)(ordering)

Review comment:
         ```
   val TOP_K_SORT_FALLBACK_THRESHOLD =
       buildConf("spark.sql.execution.topKSortFallbackThreshold")
         .doc("In SQL queries with a SORT followed by a LIMIT like " +
             "'SELECT x FROM t ORDER BY y LIMIT m', if m is under this threshold, do a top-K sort" +
             " in memory, otherwise do a global sort which spills to disk if necessary.")
         .version("2.4.0")
         .intConf
         .createWithDefault(ByteArrayMethods.MAX_ROUNDED_ARRAY_LENGTH)
   ```
   
   I think we can still share the same threshold. The problem is that its default value is too high, which should be set to a relative smaller number in practice, such as 10000.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #34367: [SPARK-37099][SQL] Impl a rank-based filter to optimize top-k computation

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #34367:
URL: https://github.com/apache/spark/pull/34367#issuecomment-979933220


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/50138/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] zhengruifeng commented on pull request #34367: [SPARK-37099][SQL] Impl a rank-based filter to optimize top-k computation

Posted by GitBox <gi...@apache.org>.
zhengruifeng commented on pull request #34367:
URL: https://github.com/apache/spark/pull/34367#issuecomment-981477450


   retest this please


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] zhengruifeng commented on pull request #34367: [SPARK-37099][SQL] Impl a rank-based filter to optimize top-k computation

Posted by GitBox <gi...@apache.org>.
zhengruifeng commented on pull request #34367:
URL: https://github.com/apache/spark/pull/34367#issuecomment-978884601


   @wangyum 
   
   this PR was updated to support `rank` and `dense_rank`
   
   ```
   scala> spark.conf.set("spark.sql.rankLimit.enabled", "true")
   
   scala> spark.sql("""SELECT a, b, rank() OVER (PARTITION BY a ORDER BY b) as rk FROM VALUES ('A1', 1), ('A1', 1), ('A2', 3), ('A1', 1) tab(a, b);""").where("rk = 1").queryExecution.optimizedPlan
   res1: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan =
   Filter (rk#0 = 1)
   +- Window [rank(b#4) windowspecdefinition(a#3, b#4 ASC NULLS FIRST, specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) AS rk#0], [a#3], [b#4 ASC NULLS FIRST]
      +- RankLimit [a#3], [b#4 ASC NULLS FIRST], rank(b#4), 1
         +- LocalRelation [a#3, b#4]
   
   scala> spark.sql("""SELECT a, b, rank() OVER (PARTITION BY a ORDER BY b) as rk FROM VALUES ('A1', 1), ('A1', 1), ('A2', 3), ('A1', 1) tab(a, b);""").where("rk = 1").show
   +---+---+---+                                                                   
   |  a|  b| rk|
   +---+---+---+
   | A1|  1|  1|
   | A1|  1|  1|
   | A1|  1|  1|
   | A2|  3|  1|
   +---+---+---+
   
   ```
   
   ![image](https://user-images.githubusercontent.com/7322292/143392842-c046c52d-a31d-4af9-aed9-ef16714ebb45.png)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #34367: [SPARK-37099][SQL] Impl a rank-based filter to optimize top-k computation

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #34367:
URL: https://github.com/apache/spark/pull/34367#issuecomment-977721203


   **[Test build #145574 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/145574/testReport)** for PR 34367 at commit [`562fbb1`](https://github.com/apache/spark/commit/562fbb1d8dabf701472a1f76b8b6de42cbe1a706).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #34367: [SPARK-37099][SQL] Impl a rank-based filter to optimize top-k computation

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #34367:
URL: https://github.com/apache/spark/pull/34367#issuecomment-977958055






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #34367: [SPARK-37099][SQL] Impl a rank-based filter to optimize top-k computation

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34367:
URL: https://github.com/apache/spark/pull/34367#issuecomment-979089982


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50097/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #34367: [SPARK-37099][SQL] Impl a rank-based filter to optimize top-k computation

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #34367:
URL: https://github.com/apache/spark/pull/34367#issuecomment-979328359


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/145629/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #34367: [SPARK-37099][SQL] Impl a rank-based filter to optimize top-k computation

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #34367:
URL: https://github.com/apache/spark/pull/34367#issuecomment-979328359


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/145629/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #34367: [SPARK-37099][SQL] Impl a rank-based filter to optimize top-k computation

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #34367:
URL: https://github.com/apache/spark/pull/34367#issuecomment-977667783


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/145562/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #34367: [SPARK-37099][SQL] Impl a rank-based filter to optimize top-k computation

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34367:
URL: https://github.com/apache/spark/pull/34367#issuecomment-977548067


   Kubernetes integration test status failure
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50034/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #34367: [SPARK-37099][SQL] Impl a rank-based filter to optimize top-k computation

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #34367:
URL: https://github.com/apache/spark/pull/34367#issuecomment-950636191


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/144573/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #34367: [SPARK-37099][SQL] Impl a rank-based filter to optimize top-k computation

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #34367:
URL: https://github.com/apache/spark/pull/34367#issuecomment-949719335


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/144536/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #34367: [SPARK-37099][SQL] Impl a rank-based filter to optimize top-k computation

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #34367:
URL: https://github.com/apache/spark/pull/34367#issuecomment-949719335


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/144536/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #34367: [SPARK-37099][SQL] Impl a rank-based filter to optimize top-k computation

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34367:
URL: https://github.com/apache/spark/pull/34367#issuecomment-950459065


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49041/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #34367: [SPARK-37099][SQL] Impl a rank-based filter to optimize top-k computation

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #34367:
URL: https://github.com/apache/spark/pull/34367#issuecomment-950481576


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/49041/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #34367: [SPARK-37099][SQL] Impl a rank-based filter to optimize top-k computation

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #34367:
URL: https://github.com/apache/spark/pull/34367#issuecomment-950587579


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/49047/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #34367: [SPARK-37099][SQL] Impl a rank-based filter to optimize top-k computation

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34367:
URL: https://github.com/apache/spark/pull/34367#issuecomment-989635974


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50506/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #34367: [SPARK-37099][SQL] Impl a rank-based filter to optimize top-k computation

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34367:
URL: https://github.com/apache/spark/pull/34367#issuecomment-989670320


   Kubernetes integration test status failure
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50506/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #34367: [SPARK-37099][SQL] Impl a rank-based filter to optimize top-k computation

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #34367:
URL: https://github.com/apache/spark/pull/34367#issuecomment-989827534


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/146030/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #34367: [SPARK-37099][SQL] Impl a rank-based filter to optimize top-k computation

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34367:
URL: https://github.com/apache/spark/pull/34367#issuecomment-978101415


   **[Test build #145584 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/145584/testReport)** for PR 34367 at commit [`7c1e6c6`](https://github.com/apache/spark/commit/7c1e6c600c3999bb9e1fe1637c18c26046f651c8).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #34367: [SPARK-37099][SQL] Impl a rank-based filter to optimize top-k computation

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #34367:
URL: https://github.com/apache/spark/pull/34367#issuecomment-979285386


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/145625/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #34367: [SPARK-37099][SQL] Impl a rank-based filter to optimize top-k computation

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34367:
URL: https://github.com/apache/spark/pull/34367#issuecomment-979657376


   **[Test build #145644 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/145644/testReport)** for PR 34367 at commit [`b1006ee`](https://github.com/apache/spark/commit/b1006eecef9c2bc18422abaab27a932de6be5c47).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] zhengruifeng commented on a change in pull request #34367: [SPARK-37099][SQL] Impl a rank-based filter to optimize top-k computation

Posted by GitBox <gi...@apache.org>.
zhengruifeng commented on a change in pull request #34367:
URL: https://github.com/apache/spark/pull/34367#discussion_r756544659



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/window/RankLimitExec.scala
##########
@@ -0,0 +1,93 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+
+package org.apache.spark.sql.execution.window
+
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.catalyst.InternalRow
+import org.apache.spark.sql.catalyst.expressions._
+import org.apache.spark.sql.catalyst.expressions.codegen.LazilyGeneratedOrdering
+import org.apache.spark.sql.catalyst.plans.physical._
+import org.apache.spark.sql.execution.{GroupedIterator, SparkPlan, UnaryExecNode}
+import org.apache.spark.util.collection.Utils
+
+/**
+ * This operator is designed to filter out unnecessary rows before WindowExec,
+ * for top-k computation.
+ * @param partitionSpec Should be the same as [[WindowExec#partitionSpec]]
+ * @param orderSpec Should be the same as [[WindowExec#orderSpec]]
+ */
+case class RankLimitExec(
+    partitionSpec: Seq[Expression],
+    orderSpec: Seq[SortOrder],
+    limit: Int,
+    child: SparkPlan) extends UnaryExecNode {
+  assert(orderSpec.nonEmpty && limit > 0)
+
+  // apply Utils.takeOrdered when partitionSpec is empty and row_number is used.
+  private def applyTakeOrdered: Boolean = partitionSpec.isEmpty
+
+  override def output: Seq[Attribute] = child.output
+
+  override def requiredChildOrdering: Seq[Seq[SortOrder]] = {
+    if (applyTakeOrdered) {
+      super.requiredChildOrdering
+    } else {
+      // Should be the same as [[WindowExec#requiredChildOrdering]]
+      Seq(partitionSpec.map(SortOrder(_, Ascending)) ++ orderSpec)
+    }
+  }
+
+  override def outputOrdering: Seq[SortOrder] = {
+    if (applyTakeOrdered) {
+      orderSpec
+    } else {
+      child.outputOrdering
+    }
+  }
+
+  override def outputPartitioning: Partitioning = child.outputPartitioning
+
+  // TODO: support rank and dense_rank

Review comment:
       within each maptask:
   1, sort the partion by `(a,b)`;
   2, group the iterator by `a`:  `GroupedIterator.apply(stream, partitionSpec, output)`
   3, within each group, group again by `b`, compute the `rk`, and skip if `rk` > 1.
   
   the `limitFunc` for `rank` and `dense_rank` may be like:
   
   ```
         case _: Rank =>
           (partitionedStream: Iterator[InternalRow]) =>
             new Iterator[InternalRow]() {
               private val internalIter =
                 GroupedIterator.apply(partitionedStream, orderSpec.map(_.child), output)
   
               private var cnt = 0
               private var rank = 0
   
               private var currentGroup: Iterator[InternalRow] = null
   
               fetchNextGroup()
   
               private def fetchNextGroup(): Unit = {
                 if (internalIter.hasNext && rank < limit) {
                   currentGroup = internalIter.next()._2
                   rank = cnt + 1
                 } else {
                   currentGroup = null
                 }
               }
   
               override def hasNext: Boolean = currentGroup != null && currentGroup.hasNext
   
               override def next(): InternalRow = {
                 val row = currentGroup.next()
                 if (currentGroup.isEmpty) {
                   fetchNextGroup()
                 }
                 cnt += 1
                 row
               }
             }
   
         case _: DenseRank =>
           (partitionedStream: Iterator[InternalRow]) =>
             GroupedIterator
               .apply(partitionedStream, orderSpec.map(_.child), output)
               .take(limit)
               .flatMap(_._2)
   ```
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #34367: [SPARK-37099][SQL] Impl a rank-based filter to optimize top-k computation

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34367:
URL: https://github.com/apache/spark/pull/34367#issuecomment-979243636


   **[Test build #145632 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/145632/testReport)** for PR 34367 at commit [`2a47f39`](https://github.com/apache/spark/commit/2a47f39063454cc7274c0dfe8e5a34d086f6cb91).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #34367: [SPARK-37099][SQL] Impl a rank-based filter to optimize top-k computation

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34367:
URL: https://github.com/apache/spark/pull/34367#issuecomment-979278292


   **[Test build #145625 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/145625/testReport)** for PR 34367 at commit [`16242b0`](https://github.com/apache/spark/commit/16242b04a97414b0a936cdbfc5ed54cd93706f67).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #34367: [SPARK-37099][SQL] Impl a rank-based filter to optimize top-k computation

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34367:
URL: https://github.com/apache/spark/pull/34367#issuecomment-981573488


   Kubernetes integration test status failure
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50180/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #34367: [SPARK-37099][SQL] Impl a rank-based filter to optimize top-k computation

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34367:
URL: https://github.com/apache/spark/pull/34367#issuecomment-981741230


   **[Test build #145710 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/145710/testReport)** for PR 34367 at commit [`354b445`](https://github.com/apache/spark/commit/354b445a7fe645c95bddca0030ad3b56135a0106).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] zhengruifeng commented on pull request #34367: [SPARK-37099][SQL] Impl a rank-based filter to optimize top-k computation

Posted by GitBox <gi...@apache.org>.
zhengruifeng commented on pull request #34367:
URL: https://github.com/apache/spark/pull/34367#issuecomment-950527904


   retest this please


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] zhengruifeng edited a comment on pull request #34367: [SPARK-37099][SQL] Impl a rank-based filter to optimize top-k computation

Posted by GitBox <gi...@apache.org>.
zhengruifeng edited a comment on pull request #34367:
URL: https://github.com/apache/spark/pull/34367#issuecomment-949516811


   a simple example:
   ```
   
   import org.apache.spark.sql.expressions.Window
   
   val df1 = spark.range(0, 100000000, 1, 9).select(when('id < 90000000, 123).otherwise('id).as("key1"), 'id as "value1").withColumn("hash1", abs(hash(col("key1"))).mod(1000))
   
   df1.withColumn("rank", row_number().over(Window.partitionBy("hash1").orderBy("value1"))).where(col("rank") <= 1).write.mode("overwrite").parquet("/tmp/tmp1")
   
   
   spark.conf.set("spark.sql.rankLimit.enabled", "true")
   
   df1.withColumn("rank", row_number().over(Window.partitionBy("hash1").orderBy("value1"))).where(col("rank") <= 1).write.mode("overwrite").parquet("/tmp/tmp2")
   ```
   
   ![image](https://user-images.githubusercontent.com/7322292/138442467-05114fc2-46a3-48f3-aafe-3ce929d9cb24.png)
   
   existing plan took 33 sec, while the new plan with `RankLimit` took only 8 sec.
   
   ![image](https://user-images.githubusercontent.com/7322292/138442603-ca117469-9de4-4220-8b01-0b816fba7fa6.png)
   
   and the shuffle write was reduced from 544.9 MiB to 26.7 KiB
   
   
   ```
   == Physical Plan ==
   Execute InsertIntoHadoopFsRelationCommand (18)
   +- AdaptiveSparkPlan (17)
      +- == Final Plan ==
         * Filter (11)
         +- Window (10)
            +- * Sort (9)
               +- AQEShuffleRead (8)
                  +- ShuffleQueryStage (7)
                     +- Exchange (6)
                        +- RankLimit (5)
                           +- * Sort (4)
                              +- * Project (3)
                                 +- * Project (2)
                                    +- * Range (1)
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #34367: [SPARK-37099][SQL] Impl a rank-based filter to optimize top-k computation

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34367:
URL: https://github.com/apache/spark/pull/34367#issuecomment-950580099


   Kubernetes integration test status failure
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49047/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #34367: [SPARK-37099][SQL] Impl a rank-based filter to optimize top-k computation

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #34367:
URL: https://github.com/apache/spark/pull/34367#issuecomment-950758625


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/144576/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #34367: [SPARK-37099][SQL] Impl a rank-based filter to optimize top-k computation

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34367:
URL: https://github.com/apache/spark/pull/34367#issuecomment-950500260


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49044/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #34367: [SPARK-37099][SQL] Impl a rank-based filter to optimize top-k computation

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #34367:
URL: https://github.com/apache/spark/pull/34367#issuecomment-977667783


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/145562/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] wangyum commented on a change in pull request #34367: [SPARK-37099][SQL] Impl a rank-based filter to optimize top-k computation

Posted by GitBox <gi...@apache.org>.
wangyum commented on a change in pull request #34367:
URL: https://github.com/apache/spark/pull/34367#discussion_r756486047



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/window/RankLimitExec.scala
##########
@@ -0,0 +1,93 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+
+package org.apache.spark.sql.execution.window
+
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.catalyst.InternalRow
+import org.apache.spark.sql.catalyst.expressions._
+import org.apache.spark.sql.catalyst.expressions.codegen.LazilyGeneratedOrdering
+import org.apache.spark.sql.catalyst.plans.physical._
+import org.apache.spark.sql.execution.{GroupedIterator, SparkPlan, UnaryExecNode}
+import org.apache.spark.util.collection.Utils
+
+/**
+ * This operator is designed to filter out unnecessary rows before WindowExec,
+ * for top-k computation.
+ * @param partitionSpec Should be the same as [[WindowExec#partitionSpec]]
+ * @param orderSpec Should be the same as [[WindowExec#orderSpec]]
+ */
+case class RankLimitExec(
+    partitionSpec: Seq[Expression],
+    orderSpec: Seq[SortOrder],
+    limit: Int,
+    child: SparkPlan) extends UnaryExecNode {
+  assert(orderSpec.nonEmpty && limit > 0)
+
+  // apply Utils.takeOrdered when partitionSpec is empty and row_number is used.
+  private def applyTakeOrdered: Boolean = partitionSpec.isEmpty
+
+  override def output: Seq[Attribute] = child.output
+
+  override def requiredChildOrdering: Seq[Seq[SortOrder]] = {
+    if (applyTakeOrdered) {
+      super.requiredChildOrdering
+    } else {
+      // Should be the same as [[WindowExec#requiredChildOrdering]]
+      Seq(partitionSpec.map(SortOrder(_, Ascending)) ++ orderSpec)
+    }
+  }
+
+  override def outputOrdering: Seq[SortOrder] = {
+    if (applyTakeOrdered) {
+      orderSpec
+    } else {
+      child.outputOrdering
+    }
+  }
+
+  override def outputPartitioning: Partitioning = child.outputPartitioning
+
+  // TODO: support rank and dense_rank

Review comment:
       How can we support `rank` and `dense_rank`?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #34367: [SPARK-37099][SQL] Impl a rank-based filter to optimize top-k computation

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34367:
URL: https://github.com/apache/spark/pull/34367#issuecomment-977530622


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50034/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #34367: [SPARK-37099][SQL] Impl a rank-based filter to optimize top-k computation

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #34367:
URL: https://github.com/apache/spark/pull/34367#issuecomment-978982024


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/50093/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #34367: [SPARK-37099][SQL] Impl a rank-based filter to optimize top-k computation

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34367:
URL: https://github.com/apache/spark/pull/34367#issuecomment-977940598


   Kubernetes integration test status failure
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50057/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #34367: [SPARK-37099][SQL] Impl a rank-based filter to optimize top-k computation

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34367:
URL: https://github.com/apache/spark/pull/34367#issuecomment-978893149


   **[Test build #145621 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/145621/testReport)** for PR 34367 at commit [`873083f`](https://github.com/apache/spark/commit/873083fe8e55c32ba7310e621430950c0a8328a3).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] zhengruifeng commented on a change in pull request #34367: [SPARK-37099][SQL] Impl a rank-based filter to optimize top-k computation

Posted by GitBox <gi...@apache.org>.
zhengruifeng commented on a change in pull request #34367:
URL: https://github.com/apache/spark/pull/34367#discussion_r756752125



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/window/RankLimitExec.scala
##########
@@ -0,0 +1,93 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+
+package org.apache.spark.sql.execution.window
+
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.catalyst.InternalRow
+import org.apache.spark.sql.catalyst.expressions._
+import org.apache.spark.sql.catalyst.expressions.codegen.LazilyGeneratedOrdering
+import org.apache.spark.sql.catalyst.plans.physical._
+import org.apache.spark.sql.execution.{GroupedIterator, SparkPlan, UnaryExecNode}
+import org.apache.spark.util.collection.Utils
+
+/**
+ * This operator is designed to filter out unnecessary rows before WindowExec,
+ * for top-k computation.
+ * @param partitionSpec Should be the same as [[WindowExec#partitionSpec]]
+ * @param orderSpec Should be the same as [[WindowExec#orderSpec]]
+ */
+case class RankLimitExec(
+    partitionSpec: Seq[Expression],
+    orderSpec: Seq[SortOrder],
+    limit: Int,
+    child: SparkPlan) extends UnaryExecNode {
+  assert(orderSpec.nonEmpty && limit > 0)
+
+  // apply Utils.takeOrdered when partitionSpec is empty and row_number is used.
+  private def applyTakeOrdered: Boolean = partitionSpec.isEmpty
+
+  override def output: Seq[Attribute] = child.output
+
+  override def requiredChildOrdering: Seq[Seq[SortOrder]] = {
+    if (applyTakeOrdered) {
+      super.requiredChildOrdering
+    } else {
+      // Should be the same as [[WindowExec#requiredChildOrdering]]
+      Seq(partitionSpec.map(SortOrder(_, Ascending)) ++ orderSpec)
+    }
+  }
+
+  override def outputOrdering: Seq[SortOrder] = {
+    if (applyTakeOrdered) {
+      orderSpec
+    } else {
+      child.outputOrdering
+    }
+  }
+
+  override def outputPartitioning: Partitioning = child.outputPartitioning
+
+  // TODO: support rank and dense_rank

Review comment:
       The correctness exists in: 
   for all three rank functions (row_number, rank, dense_rank), `global rank` >= `partial rank`, so we can safely remove rows with `partial rank > limit`




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #34367: [SPARK-37099][SQL] Impl a rank-based filter to optimize top-k computation

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34367:
URL: https://github.com/apache/spark/pull/34367#issuecomment-979427252


   **[Test build #145632 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/145632/testReport)** for PR 34367 at commit [`2a47f39`](https://github.com/apache/spark/commit/2a47f39063454cc7274c0dfe8e5a34d086f6cb91).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #34367: [SPARK-37099][SQL] Impl a rank-based filter to optimize top-k computation

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34367:
URL: https://github.com/apache/spark/pull/34367#issuecomment-979098755


   **[Test build #145629 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/145629/testReport)** for PR 34367 at commit [`d17fcf8`](https://github.com/apache/spark/commit/d17fcf88585f7b58f0709cbbf4952aa050fa9017).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #34367: [SPARK-37099][SQL] Impl a rank-based filter to optimize top-k computation

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34367:
URL: https://github.com/apache/spark/pull/34367#issuecomment-979137002


   **[Test build #145621 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/145621/testReport)** for PR 34367 at commit [`873083f`](https://github.com/apache/spark/commit/873083fe8e55c32ba7310e621430950c0a8328a3).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #34367: [SPARK-37099][SQL] Impl a rank-based filter to optimize top-k computation

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #34367:
URL: https://github.com/apache/spark/pull/34367#issuecomment-981743370


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/145710/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #34367: [SPARK-37099][SQL] Impl a rank-based filter to optimize top-k computation

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34367:
URL: https://github.com/apache/spark/pull/34367#issuecomment-979143823


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50100/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #34367: [SPARK-37099][SQL] Impl a rank-based filter to optimize top-k computation

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #34367:
URL: https://github.com/apache/spark/pull/34367#issuecomment-979195663


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/50100/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #34367: [SPARK-37099][SQL] Impl a rank-based filter to optimize top-k computation

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34367:
URL: https://github.com/apache/spark/pull/34367#issuecomment-982677202


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50242/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #34367: [SPARK-37099][SQL] Impl a rank-based filter to optimize top-k computation

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #34367:
URL: https://github.com/apache/spark/pull/34367#issuecomment-983449556


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/145790/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #34367: [SPARK-37099][SQL] Impl a rank-based filter to optimize top-k computation

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #34367:
URL: https://github.com/apache/spark/pull/34367#issuecomment-982900233


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/145770/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #34367: [SPARK-37099][SQL] Impl a rank-based filter to optimize top-k computation

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34367:
URL: https://github.com/apache/spark/pull/34367#issuecomment-982711759


   Kubernetes integration test status failure
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50242/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] bersprockets commented on a change in pull request #34367: [SPARK-37099][SQL] Impl a rank-based filter to optimize top-k computation

Posted by GitBox <gi...@apache.org>.
bersprockets commented on a change in pull request #34367:
URL: https://github.com/apache/spark/pull/34367#discussion_r766913169



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/window/RankLimitExec.scala
##########
@@ -0,0 +1,280 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+
+package org.apache.spark.sql.execution.window
+
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.catalyst.InternalRow
+import org.apache.spark.sql.catalyst.expressions._
+import org.apache.spark.sql.catalyst.expressions.codegen._
+import org.apache.spark.sql.catalyst.plans.physical._
+import org.apache.spark.sql.execution._
+import org.apache.spark.sql.execution.metric.SQLMetrics
+import org.apache.spark.util.collection.Utils
+
+
+sealed trait RankLimitMode
+
+case object Partial extends RankLimitMode
+
+case object Final extends RankLimitMode
+
+
+
+/**
+ * This operator is designed to filter out unnecessary rows before WindowExec,
+ * for top-k computation.
+ * @param partitionSpec Should be the same as [[WindowExec#partitionSpec]]
+ * @param orderSpec Should be the same as [[WindowExec#orderSpec]]
+ * @param rankFunction The function to compute row rank, should be RowNumber/Rank/DenseRank.
+ */
+case class RankLimitExec(
+    partitionSpec: Seq[Expression],
+    orderSpec: Seq[SortOrder],
+    rankFunction: Expression,
+    limit: Int,
+    mode: RankLimitMode,
+    child: SparkPlan) extends UnaryExecNode {
+  assert(orderSpec.nonEmpty && limit > 0)
+
+  private val shouldPass = child match {
+    case r: RankLimitExec =>
+      partitionSpec.size == r.partitionSpec.size &&
+        partitionSpec.zip(r.partitionSpec).forall(p => p._1.semanticEquals(p._2)) &&
+        orderSpec.size == r.orderSpec.size &&
+        orderSpec.zip(r.orderSpec).forall(o => o._1.semanticEquals(o._2)) &&
+        rankFunction.semanticEquals(r.rankFunction) &&
+        mode == Final && r.mode == Partial && limit == r.limit
+    case _ => false
+  }
+
+  private val shouldApplyTakeOrdered: Boolean = {
+    rankFunction match {
+      case _: RowNumber => limit < conf.topKSortFallbackThreshold
+      case _: Rank => false
+      case _: DenseRank => false
+      case f => throw new IllegalArgumentException(s"Unsupported rank function: $f")
+    }
+  }
+
+  override def output: Seq[Attribute] = child.output
+
+  override def requiredChildOrdering: Seq[Seq[SortOrder]] = {
+    if (shouldApplyTakeOrdered) {
+      Seq(partitionSpec.map(SortOrder(_, Ascending)))
+    } else {
+      // Should be the same as [[WindowExec#requiredChildOrdering]]
+      Seq(partitionSpec.map(SortOrder(_, Ascending)) ++ orderSpec)
+    }
+  }
+
+  override def outputOrdering: Seq[SortOrder] = {
+    partitionSpec.map(SortOrder(_, Ascending)) ++ orderSpec
+  }
+
+  override def requiredChildDistribution: Seq[Distribution] = mode match {
+    case Partial => super.requiredChildDistribution
+    case Final =>
+      // Should be the same as [[WindowExec#requiredChildDistribution]]
+      if (partitionSpec.isEmpty) {
+        AllTuples :: Nil
+      } else ClusteredDistribution(partitionSpec) :: Nil
+  }
+
+  override def outputPartitioning: Partitioning = child.outputPartitioning
+
+  override lazy val metrics = Map(
+    "numOutputRows" -> SQLMetrics.createMetric(sparkContext, "number of output rows"))
+
+  private lazy val ordering = GenerateOrdering.generate(orderSpec, output)
+
+  private lazy val limitFunction = rankFunction match {
+    case _: RowNumber if shouldApplyTakeOrdered =>
+      (stream: Iterator[InternalRow]) =>
+        Utils.takeOrdered(stream.map(_.copy()), limit)(ordering)

Review comment:
       By default, `spark.sql.execution.topKSortFallbackThreshold` is set to a pretty big number (Integer.MAX_VALUE - 15). Therefore, by default anyway, this line of code will attempt to do a sort with guava regardless of the size of the rank limit.
   
   For example, using your example [here](https://github.com/apache/spark/pull/34367#issuecomment-949516811), but changing the where clause to `col("rank") <= 1000000`, I get:
   
   ```
   java.lang.OutOfMemoryError: GC overhead limit exceeded
   ```
   
   Whereas I don't get that with `spark.sql.rankLimit.enabled=false`.
   
   If I set `spark.sql.execution.topKSortFallbackThreshold=10000`, it succeeds with `spark.sql.rankLimit.enabled=true`.
   
   Maybe it needs its own threshold? Or something to make users more aware (since it is not always super clear from the stack traces what was causing the OOMs).




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #34367: [SPARK-37099][SQL] Impl a rank-based filter to optimize top-k computation

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34367:
URL: https://github.com/apache/spark/pull/34367#issuecomment-950482209


   **[Test build #144573 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/144573/testReport)** for PR 34367 at commit [`3c51865`](https://github.com/apache/spark/commit/3c518652d74e50bcb85024dd5e5e75dc63121ce7).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #34367: [SPARK-37099][SQL] Impl a rank-based filter to optimize top-k computation

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #34367:
URL: https://github.com/apache/spark/pull/34367#issuecomment-950587579


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/49047/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #34367: [SPARK-37099][SQL] Impl a rank-based filter to optimize top-k computation

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34367:
URL: https://github.com/apache/spark/pull/34367#issuecomment-950529376


   **[Test build #144576 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/144576/testReport)** for PR 34367 at commit [`3c51865`](https://github.com/apache/spark/commit/3c518652d74e50bcb85024dd5e5e75dc63121ce7).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #34367: [SPARK-37099][SQL] Impl a rank-based filter to optimize top-k computation

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #34367:
URL: https://github.com/apache/spark/pull/34367#issuecomment-979146991






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #34367: [SPARK-37099][SQL] Impl a rank-based filter to optimize top-k computation

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34367:
URL: https://github.com/apache/spark/pull/34367#issuecomment-979316433


   Kubernetes integration test status failure
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50103/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #34367: [SPARK-37099][SQL] Impl a rank-based filter to optimize top-k computation

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #34367:
URL: https://github.com/apache/spark/pull/34367#issuecomment-979933220


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/50138/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #34367: [SPARK-37099][SQL] Impl a rank-based filter to optimize top-k computation

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34367:
URL: https://github.com/apache/spark/pull/34367#issuecomment-979933187


   Kubernetes integration test status failure
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50138/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] wangyum commented on a change in pull request #34367: [SPARK-37099][SQL] Impl a rank-based filter to optimize top-k computation

Posted by GitBox <gi...@apache.org>.
wangyum commented on a change in pull request #34367:
URL: https://github.com/apache/spark/pull/34367#discussion_r756528430



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/window/RankLimitExec.scala
##########
@@ -0,0 +1,93 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+
+package org.apache.spark.sql.execution.window
+
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.catalyst.InternalRow
+import org.apache.spark.sql.catalyst.expressions._
+import org.apache.spark.sql.catalyst.expressions.codegen.LazilyGeneratedOrdering
+import org.apache.spark.sql.catalyst.plans.physical._
+import org.apache.spark.sql.execution.{GroupedIterator, SparkPlan, UnaryExecNode}
+import org.apache.spark.util.collection.Utils
+
+/**
+ * This operator is designed to filter out unnecessary rows before WindowExec,
+ * for top-k computation.
+ * @param partitionSpec Should be the same as [[WindowExec#partitionSpec]]
+ * @param orderSpec Should be the same as [[WindowExec#orderSpec]]
+ */
+case class RankLimitExec(
+    partitionSpec: Seq[Expression],
+    orderSpec: Seq[SortOrder],
+    limit: Int,
+    child: SparkPlan) extends UnaryExecNode {
+  assert(orderSpec.nonEmpty && limit > 0)
+
+  // apply Utils.takeOrdered when partitionSpec is empty and row_number is used.
+  private def applyTakeOrdered: Boolean = partitionSpec.isEmpty
+
+  override def output: Seq[Attribute] = child.output
+
+  override def requiredChildOrdering: Seq[Seq[SortOrder]] = {
+    if (applyTakeOrdered) {
+      super.requiredChildOrdering
+    } else {
+      // Should be the same as [[WindowExec#requiredChildOrdering]]
+      Seq(partitionSpec.map(SortOrder(_, Ascending)) ++ orderSpec)
+    }
+  }
+
+  override def outputOrdering: Seq[SortOrder] = {
+    if (applyTakeOrdered) {
+      orderSpec
+    } else {
+      child.outputOrdering
+    }
+  }
+
+  override def outputPartitioning: Partitioning = child.outputPartitioning
+
+  // TODO: support rank and dense_rank

Review comment:
       For example:
   ```
   0: jdbc:hive2://10.211.174.27:10000/access_vi> SELECT a, b, rank() OVER (PARTITION BY a ORDER BY b) as rk FROM VALUES ('A1', 1), ('A1', 1), ('A2', 3), ('A1', 1) tab(a, b);
   +-----+----+-----+--+
   |  a  | b  | rk  |
   +-----+----+-----+--+
   | A1  | 1  | 1   |
   | A1  | 1  | 1   |
   | A1  | 1  | 1   |
   | A2  | 3  | 1   |
   +-----+----+-----+--+
   4 rows selected (8.259 seconds)
   ```
   If we add a filter condition: `rk = 1`. How can we ensure the result is correct if we use `RankLimitExec`?
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #34367: [SPARK-37099][SQL] Impl a rank-based filter to optimize top-k computation

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34367:
URL: https://github.com/apache/spark/pull/34367#issuecomment-979283960


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50103/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #34367: [SPARK-37099][SQL] Impl a rank-based filter to optimize top-k computation

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #34367:
URL: https://github.com/apache/spark/pull/34367#issuecomment-977856204


   **[Test build #145584 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/145584/testReport)** for PR 34367 at commit [`7c1e6c6`](https://github.com/apache/spark/commit/7c1e6c600c3999bb9e1fe1637c18c26046f651c8).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] zhengruifeng commented on pull request #34367: [SPARK-37099][SQL] Impl a rank-based filter to optimize top-k computation

Posted by GitBox <gi...@apache.org>.
zhengruifeng commented on pull request #34367:
URL: https://github.com/apache/spark/pull/34367#issuecomment-979657176


   test benchmark results on my laptop:
   
   ```
   
   [info] Java HotSpot(TM) 64-Bit Server VM 1.8.0_301-b09 on Linux 5.11.0-40-generic
   [info] Intel(R) Core(TM) i7-8850H CPU @ 2.60GHz
   [info] Benchmark Top-K:                          Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
   [info] ------------------------------------------------------------------------------------------------------------------------
   [info] ROW_NUMBER WITHOUT PARTITION                       2368           2584         163          8.9         112.9       1.0X
   [info] ROW_NUMBER WITHOUT PARTITION (RANKLIMIT)           1019           1047          21         20.6          48.6       2.3X
   [info] RANK WITHOUT PARTITION                             2934           3087         142          7.1         139.9       0.8X
   [info] RANK WITHOUT PARTITION (RANKLIMIT)                   26             36          13        800.7           1.2      90.4X
   [info] DENSE_RANK WITHOUT PARTITION                       2762           2873         106          7.6         131.7       0.9X
   [info] DENSE_RANK WITHOUT PARTITION (RANKLIMIT)             22             28           4        934.7           1.1     105.5X
   [info] ROW_NUMBER WITH PARTITION                         19493          19758         319          1.1         929.5       0.1X
   [info] ROW_NUMBER WITH PARTITION (RANKLIMIT)              6373           6576         228          3.3         303.9       0.4X
   [info] RANK WITH PARTITION                               20696          21254         274          1.0         986.8       0.1X
   [info] RANK WITH PARTITION (RANKLIMIT)                    6172           6472         213          3.4         294.3       0.4X
   [info] DENSE_RANK WITH PARTITION                         19070          19346         284          1.1         909.3       0.1X
   [info] DENSE_RANK WITH PARTITION (RANKLIMIT)              6249           6439         131          3.4         298.0       0.4X
   
   
   
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #34367: [SPARK-37099][SQL] Impl a rank-based filter to optimize top-k computation

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34367:
URL: https://github.com/apache/spark/pull/34367#issuecomment-977664510


   **[Test build #145562 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/145562/testReport)** for PR 34367 at commit [`728d485`](https://github.com/apache/spark/commit/728d485058d0e0147e45e1588bd7b5cf0b838c95).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #34367: [SPARK-37099][SQL] Impl a rank-based filter to optimize top-k computation

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #34367:
URL: https://github.com/apache/spark/pull/34367#issuecomment-979146992






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #34367: [SPARK-37099][SQL] Impl a rank-based filter to optimize top-k computation

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34367:
URL: https://github.com/apache/spark/pull/34367#issuecomment-979890157


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50138/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #34367: [SPARK-37099][SQL] Impl a rank-based filter to optimize top-k computation

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34367:
URL: https://github.com/apache/spark/pull/34367#issuecomment-950476042


   Kubernetes integration test status failure
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49041/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #34367: [SPARK-37099][SQL] Impl a rank-based filter to optimize top-k computation

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34367:
URL: https://github.com/apache/spark/pull/34367#issuecomment-981299164


   Kubernetes integration test status failure
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50162/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #34367: [SPARK-37099][SQL] Impl a rank-based filter to optimize top-k computation

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #34367:
URL: https://github.com/apache/spark/pull/34367#issuecomment-989689730


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/50506/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #34367: [SPARK-37099][SQL] Impl a rank-based filter to optimize top-k computation

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #34367:
URL: https://github.com/apache/spark/pull/34367#issuecomment-989603356


   **[Test build #146030 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/146030/testReport)** for PR 34367 at commit [`6d2efab`](https://github.com/apache/spark/commit/6d2efab0549c8283f582bf8fe019b2c223f0681d).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] bersprockets commented on a change in pull request #34367: [SPARK-37099][SQL] Impl a rank-based filter to optimize top-k computation

Posted by GitBox <gi...@apache.org>.
bersprockets commented on a change in pull request #34367:
URL: https://github.com/apache/spark/pull/34367#discussion_r766913169



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/window/RankLimitExec.scala
##########
@@ -0,0 +1,280 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+
+package org.apache.spark.sql.execution.window
+
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.catalyst.InternalRow
+import org.apache.spark.sql.catalyst.expressions._
+import org.apache.spark.sql.catalyst.expressions.codegen._
+import org.apache.spark.sql.catalyst.plans.physical._
+import org.apache.spark.sql.execution._
+import org.apache.spark.sql.execution.metric.SQLMetrics
+import org.apache.spark.util.collection.Utils
+
+
+sealed trait RankLimitMode
+
+case object Partial extends RankLimitMode
+
+case object Final extends RankLimitMode
+
+
+
+/**
+ * This operator is designed to filter out unnecessary rows before WindowExec,
+ * for top-k computation.
+ * @param partitionSpec Should be the same as [[WindowExec#partitionSpec]]
+ * @param orderSpec Should be the same as [[WindowExec#orderSpec]]
+ * @param rankFunction The function to compute row rank, should be RowNumber/Rank/DenseRank.
+ */
+case class RankLimitExec(
+    partitionSpec: Seq[Expression],
+    orderSpec: Seq[SortOrder],
+    rankFunction: Expression,
+    limit: Int,
+    mode: RankLimitMode,
+    child: SparkPlan) extends UnaryExecNode {
+  assert(orderSpec.nonEmpty && limit > 0)
+
+  private val shouldPass = child match {
+    case r: RankLimitExec =>
+      partitionSpec.size == r.partitionSpec.size &&
+        partitionSpec.zip(r.partitionSpec).forall(p => p._1.semanticEquals(p._2)) &&
+        orderSpec.size == r.orderSpec.size &&
+        orderSpec.zip(r.orderSpec).forall(o => o._1.semanticEquals(o._2)) &&
+        rankFunction.semanticEquals(r.rankFunction) &&
+        mode == Final && r.mode == Partial && limit == r.limit
+    case _ => false
+  }
+
+  private val shouldApplyTakeOrdered: Boolean = {
+    rankFunction match {
+      case _: RowNumber => limit < conf.topKSortFallbackThreshold
+      case _: Rank => false
+      case _: DenseRank => false
+      case f => throw new IllegalArgumentException(s"Unsupported rank function: $f")
+    }
+  }
+
+  override def output: Seq[Attribute] = child.output
+
+  override def requiredChildOrdering: Seq[Seq[SortOrder]] = {
+    if (shouldApplyTakeOrdered) {
+      Seq(partitionSpec.map(SortOrder(_, Ascending)))
+    } else {
+      // Should be the same as [[WindowExec#requiredChildOrdering]]
+      Seq(partitionSpec.map(SortOrder(_, Ascending)) ++ orderSpec)
+    }
+  }
+
+  override def outputOrdering: Seq[SortOrder] = {
+    partitionSpec.map(SortOrder(_, Ascending)) ++ orderSpec
+  }
+
+  override def requiredChildDistribution: Seq[Distribution] = mode match {
+    case Partial => super.requiredChildDistribution
+    case Final =>
+      // Should be the same as [[WindowExec#requiredChildDistribution]]
+      if (partitionSpec.isEmpty) {
+        AllTuples :: Nil
+      } else ClusteredDistribution(partitionSpec) :: Nil
+  }
+
+  override def outputPartitioning: Partitioning = child.outputPartitioning
+
+  override lazy val metrics = Map(
+    "numOutputRows" -> SQLMetrics.createMetric(sparkContext, "number of output rows"))
+
+  private lazy val ordering = GenerateOrdering.generate(orderSpec, output)
+
+  private lazy val limitFunction = rankFunction match {
+    case _: RowNumber if shouldApplyTakeOrdered =>
+      (stream: Iterator[InternalRow]) =>
+        Utils.takeOrdered(stream.map(_.copy()), limit)(ordering)

Review comment:
       By default, `spark.sql.execution.topKSortFallbackThreshold` is set to a pretty big number (Integer.MAX_VALUE - 15). Therefore, by default anyway, this line of code will attempt to do a sort with guava regardless of the size of the rank limit.
   
   For example, using your example [here](https://github.com/apache/spark/pull/34367#issuecomment-949516811), but changing the where clause to `col("rank") <= 1000000`, I get:
   
   ```
   java.lang.OutOfMemoryError: GC overhead limit exceeded
   ```
   
   Whereas I don't get that with `spark.sql.rankLimit.enabled=false`.
   
   If I set `spark.sql.execution.topKSortFallbackThreshold=10000`, it succeeds with `spark.sql.rankLimit.enabled=true`.
   
   Maybe it needs its own threshold? Or something to make users more aware (since it is not always super clear from the stack traces what was causing the OOMs).




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #34367: [SPARK-37099][SQL] Impl a rank-based filter to optimize top-k computation

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34367:
URL: https://github.com/apache/spark/pull/34367#issuecomment-981538947


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50180/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #34367: [SPARK-37099][SQL] Impl a rank-based filter to optimize top-k computation

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34367:
URL: https://github.com/apache/spark/pull/34367#issuecomment-982890565


   **[Test build #145770 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/145770/testReport)** for PR 34367 at commit [`101f6c5`](https://github.com/apache/spark/commit/101f6c5f74f5a5d875dc43095c8bc3c8f57c6f6b).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #34367: [SPARK-37099][SQL] Impl a rank-based filter to optimize top-k computation

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34367:
URL: https://github.com/apache/spark/pull/34367#issuecomment-983256901


   **[Test build #145790 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/145790/testReport)** for PR 34367 at commit [`877558e`](https://github.com/apache/spark/commit/877558e439663d1028028e9a332a5e4e6a18ad6c).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] zhengruifeng commented on pull request #34367: [SPARK-37099][SQL] Impl a rank-based filter to optimize top-k computation

Posted by GitBox <gi...@apache.org>.
zhengruifeng commented on pull request #34367:
URL: https://github.com/apache/spark/pull/34367#issuecomment-950441852


   retest this please


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #34367: [SPARK-37099][SQL] Impl a rank-based filter to optimize top-k computation

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #34367:
URL: https://github.com/apache/spark/pull/34367#issuecomment-950527923


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/49044/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #34367: [SPARK-37099][SQL] Impl a rank-based filter to optimize top-k computation

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34367:
URL: https://github.com/apache/spark/pull/34367#issuecomment-949548326


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49007/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #34367: [SPARK-37099][SQL] Impl a rank-based filter to optimize top-k computation

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34367:
URL: https://github.com/apache/spark/pull/34367#issuecomment-949717978


   **[Test build #144536 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/144536/testReport)** for PR 34367 at commit [`134067f`](https://github.com/apache/spark/commit/134067f5e34ef4f6d177a93f38ed7a65592ad026).
    * This patch **fails Spark unit tests**.
    * This patch merges cleanly.
    * This patch adds the following public classes _(experimental)_:
     * `case class RankLimit(`
     * `case class RankLimitExec(`


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #34367: [SPARK-37099][SQL] Impl a rank-based filter to optimize top-k computation

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34367:
URL: https://github.com/apache/spark/pull/34367#issuecomment-977497924


   **[Test build #145562 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/145562/testReport)** for PR 34367 at commit [`728d485`](https://github.com/apache/spark/commit/728d485058d0e0147e45e1588bd7b5cf0b838c95).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #34367: [SPARK-37099][SQL] Impl a rank-based filter to optimize top-k computation

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #34367:
URL: https://github.com/apache/spark/pull/34367#issuecomment-977809636


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/50047/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #34367: [SPARK-37099][SQL] Impl a rank-based filter to optimize top-k computation

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34367:
URL: https://github.com/apache/spark/pull/34367#issuecomment-977856204


   **[Test build #145584 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/145584/testReport)** for PR 34367 at commit [`7c1e6c6`](https://github.com/apache/spark/commit/7c1e6c600c3999bb9e1fe1637c18c26046f651c8).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #34367: [SPARK-37099][SQL] Impl a rank-based filter to optimize top-k computation

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #34367:
URL: https://github.com/apache/spark/pull/34367#issuecomment-978982024


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/50093/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #34367: [SPARK-37099][SQL] Impl a rank-based filter to optimize top-k computation

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #34367:
URL: https://github.com/apache/spark/pull/34367#issuecomment-979427976


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/145632/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #34367: [SPARK-37099][SQL] Impl a rank-based filter to optimize top-k computation

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #34367:
URL: https://github.com/apache/spark/pull/34367#issuecomment-981299850


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/50162/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #34367: [SPARK-37099][SQL] Impl a rank-based filter to optimize top-k computation

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34367:
URL: https://github.com/apache/spark/pull/34367#issuecomment-981356532


   **[Test build #145692 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/145692/testReport)** for PR 34367 at commit [`354b445`](https://github.com/apache/spark/commit/354b445a7fe645c95bddca0030ad3b56135a0106).
    * This patch **fails Spark unit tests**.
    * This patch merges cleanly.
    * This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #34367: [SPARK-37099][SQL] Impl a rank-based filter to optimize top-k computation

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #34367:
URL: https://github.com/apache/spark/pull/34367#issuecomment-981263291


   **[Test build #145692 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/145692/testReport)** for PR 34367 at commit [`354b445`](https://github.com/apache/spark/commit/354b445a7fe645c95bddca0030ad3b56135a0106).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #34367: [SPARK-37099][SQL] Impl a rank-based filter to optimize top-k computation

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34367:
URL: https://github.com/apache/spark/pull/34367#issuecomment-950511716


   **[Test build #144570 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/144570/testReport)** for PR 34367 at commit [`134067f`](https://github.com/apache/spark/commit/134067f5e34ef4f6d177a93f38ed7a65592ad026).
    * This patch **fails Spark unit tests**.
    * This patch merges cleanly.
    * This patch adds the following public classes _(experimental)_:
     * `case class RankLimit(`
     * `case class RankLimitExec(`


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #34367: [SPARK-37099][SQL] Impl a rank-based filter to optimize top-k computation

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #34367:
URL: https://github.com/apache/spark/pull/34367#issuecomment-949514645


   **[Test build #144536 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/144536/testReport)** for PR 34367 at commit [`134067f`](https://github.com/apache/spark/commit/134067f5e34ef4f6d177a93f38ed7a65592ad026).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #34367: [SPARK-37099][SQL] Impl a rank-based filter to optimize top-k computation

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #34367:
URL: https://github.com/apache/spark/pull/34367#issuecomment-977497924


   **[Test build #145562 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/145562/testReport)** for PR 34367 at commit [`728d485`](https://github.com/apache/spark/commit/728d485058d0e0147e45e1588bd7b5cf0b838c95).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #34367: [SPARK-37099][SQL] Impl a rank-based filter to optimize top-k computation

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34367:
URL: https://github.com/apache/spark/pull/34367#issuecomment-978918945


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50093/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #34367: [SPARK-37099][SQL] Impl a rank-based filter to optimize top-k computation

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #34367:
URL: https://github.com/apache/spark/pull/34367#issuecomment-979054207


   **[Test build #145625 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/145625/testReport)** for PR 34367 at commit [`16242b0`](https://github.com/apache/spark/commit/16242b04a97414b0a936cdbfc5ed54cd93706f67).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #34367: [SPARK-37099][SQL] Impl a rank-based filter to optimize top-k computation

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34367:
URL: https://github.com/apache/spark/pull/34367#issuecomment-977721203


   **[Test build #145574 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/145574/testReport)** for PR 34367 at commit [`562fbb1`](https://github.com/apache/spark/commit/562fbb1d8dabf701472a1f76b8b6de42cbe1a706).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #34367: [SPARK-37099][SQL] Impl a rank-based filter to optimize top-k computation

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34367:
URL: https://github.com/apache/spark/pull/34367#issuecomment-977809612


   Kubernetes integration test status failure
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50047/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #34367: [SPARK-37099][SQL] Impl a rank-based filter to optimize top-k computation

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #34367:
URL: https://github.com/apache/spark/pull/34367#issuecomment-980046005


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/145668/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #34367: [SPARK-37099][SQL] Impl a rank-based filter to optimize top-k computation

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #34367:
URL: https://github.com/apache/spark/pull/34367#issuecomment-983328709


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/50263/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #34367: [SPARK-37099][SQL] Impl a rank-based filter to optimize top-k computation

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #34367:
URL: https://github.com/apache/spark/pull/34367#issuecomment-982633822


   **[Test build #145770 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/145770/testReport)** for PR 34367 at commit [`101f6c5`](https://github.com/apache/spark/commit/101f6c5f74f5a5d875dc43095c8bc3c8f57c6f6b).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #34367: [SPARK-37099][SQL] Impl a rank-based filter to optimize top-k computation

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34367:
URL: https://github.com/apache/spark/pull/34367#issuecomment-983313959


   Kubernetes integration test status failure
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50263/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #34367: [SPARK-37099][SQL] Impl a rank-based filter to optimize top-k computation

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #34367:
URL: https://github.com/apache/spark/pull/34367#issuecomment-983449556


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/145790/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #34367: [SPARK-37099][SQL] Impl a rank-based filter to optimize top-k computation

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34367:
URL: https://github.com/apache/spark/pull/34367#issuecomment-979846280


   **[Test build #145668 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/145668/testReport)** for PR 34367 at commit [`d23c431`](https://github.com/apache/spark/commit/d23c4319bd46e30f3d7971039f9f5ceda1d76072).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #34367: [SPARK-37099][SQL] Impl a rank-based filter to optimize top-k computation

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #34367:
URL: https://github.com/apache/spark/pull/34367#issuecomment-981582558


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/50180/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #34367: [SPARK-37099][SQL] Impl a rank-based filter to optimize top-k computation

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #34367:
URL: https://github.com/apache/spark/pull/34367#issuecomment-981496871


   **[Test build #145710 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/145710/testReport)** for PR 34367 at commit [`354b445`](https://github.com/apache/spark/commit/354b445a7fe645c95bddca0030ad3b56135a0106).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #34367: [SPARK-37099][SQL] Impl a rank-based filter to optimize top-k computation

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34367:
URL: https://github.com/apache/spark/pull/34367#issuecomment-982633822


   **[Test build #145770 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/145770/testReport)** for PR 34367 at commit [`101f6c5`](https://github.com/apache/spark/commit/101f6c5f74f5a5d875dc43095c8bc3c8f57c6f6b).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #34367: [SPARK-37099][SQL] Impl a rank-based filter to optimize top-k computation

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #34367:
URL: https://github.com/apache/spark/pull/34367#issuecomment-979769184


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/145644/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #34367: [SPARK-37099][SQL] Impl a rank-based filter to optimize top-k computation

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34367:
URL: https://github.com/apache/spark/pull/34367#issuecomment-981496871


   **[Test build #145710 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/145710/testReport)** for PR 34367 at commit [`354b445`](https://github.com/apache/spark/commit/354b445a7fe645c95bddca0030ad3b56135a0106).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #34367: [SPARK-37099][SQL] Impl a rank-based filter to optimize top-k computation

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34367:
URL: https://github.com/apache/spark/pull/34367#issuecomment-978952004


   Kubernetes integration test status failure
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50093/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #34367: [SPARK-37099][SQL] Impl a rank-based filter to optimize top-k computation

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #34367:
URL: https://github.com/apache/spark/pull/34367#issuecomment-979098755


   **[Test build #145629 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/145629/testReport)** for PR 34367 at commit [`d17fcf8`](https://github.com/apache/spark/commit/d17fcf88585f7b58f0709cbbf4952aa050fa9017).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #34367: [SPARK-37099][SQL] Impl a rank-based filter to optimize top-k computation

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #34367:
URL: https://github.com/apache/spark/pull/34367#issuecomment-979326802


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/50103/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #34367: [SPARK-37099][SQL] Impl a rank-based filter to optimize top-k computation

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #34367:
URL: https://github.com/apache/spark/pull/34367#issuecomment-979657376


   **[Test build #145644 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/145644/testReport)** for PR 34367 at commit [`b1006ee`](https://github.com/apache/spark/commit/b1006eecef9c2bc18422abaab27a932de6be5c47).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] zhengruifeng commented on a change in pull request #34367: [SPARK-37099][SQL] Impl a rank-based filter to optimize top-k computation

Posted by GitBox <gi...@apache.org>.
zhengruifeng commented on a change in pull request #34367:
URL: https://github.com/apache/spark/pull/34367#discussion_r756518072



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/window/RankLimitExec.scala
##########
@@ -0,0 +1,93 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+
+package org.apache.spark.sql.execution.window
+
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.catalyst.InternalRow
+import org.apache.spark.sql.catalyst.expressions._
+import org.apache.spark.sql.catalyst.expressions.codegen.LazilyGeneratedOrdering
+import org.apache.spark.sql.catalyst.plans.physical._
+import org.apache.spark.sql.execution.{GroupedIterator, SparkPlan, UnaryExecNode}
+import org.apache.spark.util.collection.Utils
+
+/**
+ * This operator is designed to filter out unnecessary rows before WindowExec,
+ * for top-k computation.
+ * @param partitionSpec Should be the same as [[WindowExec#partitionSpec]]
+ * @param orderSpec Should be the same as [[WindowExec#orderSpec]]
+ */
+case class RankLimitExec(
+    partitionSpec: Seq[Expression],
+    orderSpec: Seq[SortOrder],
+    limit: Int,
+    child: SparkPlan) extends UnaryExecNode {
+  assert(orderSpec.nonEmpty && limit > 0)
+
+  // apply Utils.takeOrdered when partitionSpec is empty and row_number is used.
+  private def applyTakeOrdered: Boolean = partitionSpec.isEmpty
+
+  override def output: Seq[Attribute] = child.output
+
+  override def requiredChildOrdering: Seq[Seq[SortOrder]] = {
+    if (applyTakeOrdered) {
+      super.requiredChildOrdering
+    } else {
+      // Should be the same as [[WindowExec#requiredChildOrdering]]
+      Seq(partitionSpec.map(SortOrder(_, Ascending)) ++ orderSpec)
+    }
+  }
+
+  override def outputOrdering: Seq[SortOrder] = {
+    if (applyTakeOrdered) {
+      orderSpec
+    } else {
+      child.outputOrdering
+    }
+  }
+
+  override def outputPartitioning: Partitioning = child.outputPartitioning
+
+  // TODO: support rank and dense_rank

Review comment:
       1, for `rank` and `dense_rank`, no matter whether partitionSpec is empty,  `requiredChildOrdering = Seq(partitionSpec.map(SortOrder(_, Ascending)) ++ orderSpec)`;
   2, then in the map side, we can compute the `rank` and `dense_rank` for each row, and filter out unnecessary rows.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #34367: [SPARK-37099][SQL] Impl a rank-based filter to optimize top-k computation

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34367:
URL: https://github.com/apache/spark/pull/34367#issuecomment-980038095


   **[Test build #145668 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/145668/testReport)** for PR 34367 at commit [`d23c431`](https://github.com/apache/spark/commit/d23c4319bd46e30f3d7971039f9f5ceda1d76072).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #34367: [SPARK-37099][SQL] Impl a rank-based filter to optimize top-k computation

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #34367:
URL: https://github.com/apache/spark/pull/34367#issuecomment-977557630


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/50034/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #34367: [SPARK-37099][SQL] Impl a rank-based filter to optimize top-k computation

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34367:
URL: https://github.com/apache/spark/pull/34367#issuecomment-979682381


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50116/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] zhengruifeng removed a comment on pull request #34367: [SPARK-37099][SQL] Impl a rank-based filter to optimize top-k computation

Posted by GitBox <gi...@apache.org>.
zhengruifeng removed a comment on pull request #34367:
URL: https://github.com/apache/spark/pull/34367#issuecomment-979657176


   test benchmark results on my laptop:
   
   ```
   
   [info] Java HotSpot(TM) 64-Bit Server VM 1.8.0_301-b09 on Linux 5.11.0-40-generic
   [info] Intel(R) Core(TM) i7-8850H CPU @ 2.60GHz
   [info] Benchmark Top-K:                          Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
   [info] ------------------------------------------------------------------------------------------------------------------------
   [info] ROW_NUMBER WITHOUT PARTITION                       2368           2584         163          8.9         112.9       1.0X
   [info] ROW_NUMBER WITHOUT PARTITION (RANKLIMIT)           1019           1047          21         20.6          48.6       2.3X
   [info] RANK WITHOUT PARTITION                             2934           3087         142          7.1         139.9       0.8X
   [info] RANK WITHOUT PARTITION (RANKLIMIT)                   26             36          13        800.7           1.2      90.4X
   [info] DENSE_RANK WITHOUT PARTITION                       2762           2873         106          7.6         131.7       0.9X
   [info] DENSE_RANK WITHOUT PARTITION (RANKLIMIT)             22             28           4        934.7           1.1     105.5X
   [info] ROW_NUMBER WITH PARTITION                         19493          19758         319          1.1         929.5       0.1X
   [info] ROW_NUMBER WITH PARTITION (RANKLIMIT)              6373           6576         228          3.3         303.9       0.4X
   [info] RANK WITH PARTITION                               20696          21254         274          1.0         986.8       0.1X
   [info] RANK WITH PARTITION (RANKLIMIT)                    6172           6472         213          3.4         294.3       0.4X
   [info] DENSE_RANK WITH PARTITION                         19070          19346         284          1.1         909.3       0.1X
   [info] DENSE_RANK WITH PARTITION (RANKLIMIT)              6249           6439         131          3.4         298.0       0.4X
   
   
   
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #34367: [SPARK-37099][SQL] Impl a rank-based filter to optimize top-k computation

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #34367:
URL: https://github.com/apache/spark/pull/34367#issuecomment-978893149


   **[Test build #145621 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/145621/testReport)** for PR 34367 at commit [`873083f`](https://github.com/apache/spark/commit/873083fe8e55c32ba7310e621430950c0a8328a3).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #34367: [SPARK-37099][SQL] Impl a rank-based filter to optimize top-k computation

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34367:
URL: https://github.com/apache/spark/pull/34367#issuecomment-979142008


   Kubernetes integration test status failure
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50097/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #34367: [SPARK-37099][SQL] Impl a rank-based filter to optimize top-k computation

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #34367:
URL: https://github.com/apache/spark/pull/34367#issuecomment-979243636


   **[Test build #145632 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/145632/testReport)** for PR 34367 at commit [`2a47f39`](https://github.com/apache/spark/commit/2a47f39063454cc7274c0dfe8e5a34d086f6cb91).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #34367: [SPARK-37099][SQL] Impl a rank-based filter to optimize top-k computation

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34367:
URL: https://github.com/apache/spark/pull/34367#issuecomment-989603356


   **[Test build #146030 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/146030/testReport)** for PR 34367 at commit [`6d2efab`](https://github.com/apache/spark/commit/6d2efab0549c8283f582bf8fe019b2c223f0681d).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #34367: [SPARK-37099][SQL] Impl a rank-based filter to optimize top-k computation

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #34367:
URL: https://github.com/apache/spark/pull/34367#issuecomment-950482209


   **[Test build #144573 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/144573/testReport)** for PR 34367 at commit [`3c51865`](https://github.com/apache/spark/commit/3c518652d74e50bcb85024dd5e5e75dc63121ce7).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #34367: [SPARK-37099][SQL] Impl a rank-based filter to optimize top-k computation

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34367:
URL: https://github.com/apache/spark/pull/34367#issuecomment-950443311


   **[Test build #144570 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/144570/testReport)** for PR 34367 at commit [`134067f`](https://github.com/apache/spark/commit/134067f5e34ef4f6d177a93f38ed7a65592ad026).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #34367: [SPARK-37099][SQL] Impl a rank-based filter to optimize top-k computation

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #34367:
URL: https://github.com/apache/spark/pull/34367#issuecomment-950512011


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/144570/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #34367: [SPARK-37099][SQL] Impl a rank-based filter to optimize top-k computation

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34367:
URL: https://github.com/apache/spark/pull/34367#issuecomment-950517863


   Kubernetes integration test status failure
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49044/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] zhengruifeng commented on pull request #34367: [SPARK-37099][SQL] Impl a rank-based filter to optimize top-k computation

Posted by GitBox <gi...@apache.org>.
zhengruifeng commented on pull request #34367:
URL: https://github.com/apache/spark/pull/34367#issuecomment-1002525886


   @cloud-fan @wangyum @dongjoon-hyun  Could you please help take a look when you have time? Thanks


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #34367: [SPARK-37099][SQL] Impl a rank-based filter to optimize top-k computation

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34367:
URL: https://github.com/apache/spark/pull/34367#issuecomment-981282705


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50162/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #34367: [SPARK-37099][SQL] Impl a rank-based filter to optimize top-k computation

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #34367:
URL: https://github.com/apache/spark/pull/34367#issuecomment-981357027


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/145692/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #34367: [SPARK-37099][SQL] Impl a rank-based filter to optimize top-k computation

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #34367:
URL: https://github.com/apache/spark/pull/34367#issuecomment-982738859


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/50242/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] zhengruifeng edited a comment on pull request #34367: [SPARK-37099][SQL] Impl a rank-based filter to optimize top-k computation

Posted by GitBox <gi...@apache.org>.
zhengruifeng edited a comment on pull request #34367:
URL: https://github.com/apache/spark/pull/34367#issuecomment-949516811


   a simple skewed window example:
   ```
   
   import org.apache.spark.sql.expressions.Window
   
   val df1 = spark.range(0, 100000000, 1, 9).select(when('id < 90000000, 123).otherwise('id).as("key1"), 'id as "value1").withColumn("hash1", abs(hash(col("key1"))).mod(1000))
   
   df1.withColumn("rank", row_number().over(Window.partitionBy("hash1").orderBy("value1"))).where(col("rank") <= 1).write.mode("overwrite").parquet("/tmp/tmp1")
   
   
   spark.conf.set("spark.sql.rankLimit.enabled", "true")
   
   df1.withColumn("rank", row_number().over(Window.partitionBy("hash1").orderBy("value1"))).where(col("rank") <= 1).write.mode("overwrite").parquet("/tmp/tmp2")
   ```
   
   ![image](https://user-images.githubusercontent.com/7322292/144168284-6134b831-87d0-4d72-94e6-479c498984c6.png)
   
   existing plan took 60 sec, while the new plan with `RankLimit` took only 8 sec.
   
   
   ![image](https://user-images.githubusercontent.com/7322292/144168318-51548b64-5e33-4147-a2a8-86fc9d6a0071.png)
   
   
   and the shuffle write was reduced from 544.9 MiB to 26.7 KiB
   
   
   ```
   == Physical Plan ==
   Execute InsertIntoHadoopFsRelationCommand (20)
   +- AdaptiveSparkPlan (19)
      +- == Final Plan ==
         * Filter (12)
         +- Window (11)
            +- RankLimit (10)
               +- * Sort (9)
                  +- AQEShuffleRead (8)
                     +- ShuffleQueryStage (7)
                        +- Exchange (6)
                           +- RankLimit (5)
                              +- * Sort (4)
                                 +- * Project (3)
                                    +- * Project (2)
                                       +- * Range (1)
      +- == Initial Plan ==
         Filter (18)
         +- Window (17)
            +- RankLimit (16)
               +- Sort (15)
                  +- Exchange (14)
                     +- RankLimit (13)
                        +- Sort (4)
                           +- Project (3)
                              +- Project (2)
                                 +- Range (1)
   
   
   (1) Range [codegen id : 1]
   Output [1]: [id#0L]
   Arguments: Range (0, 100000000, step=1, splits=Some(9))
   
   (2) Project [codegen id : 1]
   Output [2]: [CASE WHEN (id#0L < 90000000) THEN 123 ELSE id#0L END AS key1#2L, id#0L AS value1#3L]
   Input [1]: [id#0L]
   
   (3) Project [codegen id : 1]
   Output [3]: [key1#2L, value1#3L, (abs(hash(key1#2L, 42), false) % 1000) AS hash1#6]
   Input [2]: [key1#2L, value1#3L]
   
   (4) Sort [codegen id : 1]
   Input [3]: [key1#2L, value1#3L, hash1#6]
   Arguments: [hash1#6 ASC NULLS FIRST], false, 0
   
   (5) RankLimit
   Input [3]: [key1#2L, value1#3L, hash1#6]
   Arguments: [hash1#6], [value1#3L ASC NULLS FIRST], row_number(), 1, Partial
   
   (6) Exchange
   Input [3]: [key1#2L, value1#3L, hash1#6]
   Arguments: hashpartitioning(hash1#6, 200), ENSURE_REQUIREMENTS, [id=#119]
   
   (7) ShuffleQueryStage
   Output [3]: [key1#2L, value1#3L, hash1#6]
   Arguments: 0
   
   (8) AQEShuffleRead
   Input [3]: [key1#2L, value1#3L, hash1#6]
   Arguments: coalesced
   
   (9) Sort [codegen id : 2]
   Input [3]: [key1#2L, value1#3L, hash1#6]
   Arguments: [hash1#6 ASC NULLS FIRST], false, 0
   
   (10) RankLimit
   Input [3]: [key1#2L, value1#3L, hash1#6]
   Arguments: [hash1#6], [value1#3L ASC NULLS FIRST], row_number(), 1, Final
   
   (11) Window
   Input [3]: [key1#2L, value1#3L, hash1#6]
   Arguments: [row_number() windowspecdefinition(hash1#6, value1#3L ASC NULLS FIRST, specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) AS rank#21], [hash1#6], [value1#3L ASC NULLS FIRST]
   
   (12) Filter [codegen id : 3]
   Input [4]: [key1#2L, value1#3L, hash1#6, rank#21]
   Condition : (rank#21 <= 1)
   
   (13) RankLimit
   Input [3]: [key1#2L, value1#3L, hash1#6]
   Arguments: [hash1#6], [value1#3L ASC NULLS FIRST], row_number(), 1, Partial
   
   (14) Exchange
   Input [3]: [key1#2L, value1#3L, hash1#6]
   Arguments: hashpartitioning(hash1#6, 200), ENSURE_REQUIREMENTS, [id=#100]
   
   (15) Sort
   Input [3]: [key1#2L, value1#3L, hash1#6]
   Arguments: [hash1#6 ASC NULLS FIRST], false, 0
   
   (16) RankLimit
   Input [3]: [key1#2L, value1#3L, hash1#6]
   Arguments: [hash1#6], [value1#3L ASC NULLS FIRST], row_number(), 1, Final
   
   (17) Window
   Input [3]: [key1#2L, value1#3L, hash1#6]
   Arguments: [row_number() windowspecdefinition(hash1#6, value1#3L ASC NULLS FIRST, specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) AS rank#21], [hash1#6], [value1#3L ASC NULLS FIRST]
   
   (18) Filter
   Input [4]: [key1#2L, value1#3L, hash1#6, rank#21]
   Condition : (rank#21 <= 1)
   
   (19) AdaptiveSparkPlan
   Output [4]: [key1#2L, value1#3L, hash1#6, rank#21]
   Arguments: isFinalPlan=true
   
   (20) Execute InsertIntoHadoopFsRelationCommand
   Input [4]: [key1#2L, value1#3L, hash1#6, rank#21]
   Arguments: file:/tmp/tmp2, false, Parquet, [path=/tmp/tmp2], Overwrite, [key1, value1, hash1, rank]
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #34367: [SPARK-37099][SQL] Impl a rank-based filter to optimize top-k computation

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #34367:
URL: https://github.com/apache/spark/pull/34367#issuecomment-981299850


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/50162/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #34367: [SPARK-37099][SQL] Impl a rank-based filter to optimize top-k computation

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #34367:
URL: https://github.com/apache/spark/pull/34367#issuecomment-979285386


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/145625/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #34367: [SPARK-37099][SQL] Impl a rank-based filter to optimize top-k computation

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #34367:
URL: https://github.com/apache/spark/pull/34367#issuecomment-979326802


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/50103/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] zhengruifeng commented on a change in pull request #34367: [SPARK-37099][SQL] Impl a rank-based filter to optimize top-k computation

Posted by GitBox <gi...@apache.org>.
zhengruifeng commented on a change in pull request #34367:
URL: https://github.com/apache/spark/pull/34367#discussion_r756544659



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/window/RankLimitExec.scala
##########
@@ -0,0 +1,93 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+
+package org.apache.spark.sql.execution.window
+
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.catalyst.InternalRow
+import org.apache.spark.sql.catalyst.expressions._
+import org.apache.spark.sql.catalyst.expressions.codegen.LazilyGeneratedOrdering
+import org.apache.spark.sql.catalyst.plans.physical._
+import org.apache.spark.sql.execution.{GroupedIterator, SparkPlan, UnaryExecNode}
+import org.apache.spark.util.collection.Utils
+
+/**
+ * This operator is designed to filter out unnecessary rows before WindowExec,
+ * for top-k computation.
+ * @param partitionSpec Should be the same as [[WindowExec#partitionSpec]]
+ * @param orderSpec Should be the same as [[WindowExec#orderSpec]]
+ */
+case class RankLimitExec(
+    partitionSpec: Seq[Expression],
+    orderSpec: Seq[SortOrder],
+    limit: Int,
+    child: SparkPlan) extends UnaryExecNode {
+  assert(orderSpec.nonEmpty && limit > 0)
+
+  // apply Utils.takeOrdered when partitionSpec is empty and row_number is used.
+  private def applyTakeOrdered: Boolean = partitionSpec.isEmpty
+
+  override def output: Seq[Attribute] = child.output
+
+  override def requiredChildOrdering: Seq[Seq[SortOrder]] = {
+    if (applyTakeOrdered) {
+      super.requiredChildOrdering
+    } else {
+      // Should be the same as [[WindowExec#requiredChildOrdering]]
+      Seq(partitionSpec.map(SortOrder(_, Ascending)) ++ orderSpec)
+    }
+  }
+
+  override def outputOrdering: Seq[SortOrder] = {
+    if (applyTakeOrdered) {
+      orderSpec
+    } else {
+      child.outputOrdering
+    }
+  }
+
+  override def outputPartitioning: Partitioning = child.outputPartitioning
+
+  // TODO: support rank and dense_rank

Review comment:
       within each maptask:
   1, sort the partion by `(a,b)`;
   2, group the iterator by `a`:  `GroupedIterator.apply(stream, partitionSpec, output)`
   3, within each group, group again by `b`, compute the `rk`, and skip if `rk` > 1.
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #34367: [SPARK-37099][SQL] Impl a rank-based filter to optimize top-k computation

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #34367:
URL: https://github.com/apache/spark/pull/34367#issuecomment-979699595


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/50116/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #34367: [SPARK-37099][SQL] Impl a rank-based filter to optimize top-k computation

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #34367:
URL: https://github.com/apache/spark/pull/34367#issuecomment-979699595


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/50116/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #34367: [SPARK-37099][SQL] Impl a rank-based filter to optimize top-k computation

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34367:
URL: https://github.com/apache/spark/pull/34367#issuecomment-977901467


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50057/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #34367: [SPARK-37099][SQL] Impl a rank-based filter to optimize top-k computation

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34367:
URL: https://github.com/apache/spark/pull/34367#issuecomment-977954219


   **[Test build #145574 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/145574/testReport)** for PR 34367 at commit [`562fbb1`](https://github.com/apache/spark/commit/562fbb1d8dabf701472a1f76b8b6de42cbe1a706).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #34367: [SPARK-37099][SQL] Impl a rank-based filter to optimize top-k computation

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #34367:
URL: https://github.com/apache/spark/pull/34367#issuecomment-977557630


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/50034/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #34367: [SPARK-37099][SQL] Impl a rank-based filter to optimize top-k computation

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34367:
URL: https://github.com/apache/spark/pull/34367#issuecomment-979767907


   **[Test build #145644 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/145644/testReport)** for PR 34367 at commit [`b1006ee`](https://github.com/apache/spark/commit/b1006eecef9c2bc18422abaab27a932de6be5c47).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #34367: [SPARK-37099][SQL] Impl a rank-based filter to optimize top-k computation

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #34367:
URL: https://github.com/apache/spark/pull/34367#issuecomment-979769184


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/145644/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #34367: [SPARK-37099][SQL] Impl a rank-based filter to optimize top-k computation

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #34367:
URL: https://github.com/apache/spark/pull/34367#issuecomment-981357027


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/145692/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #34367: [SPARK-37099][SQL] Impl a rank-based filter to optimize top-k computation

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #34367:
URL: https://github.com/apache/spark/pull/34367#issuecomment-981743370


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/145710/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #34367: [SPARK-37099][SQL] Impl a rank-based filter to optimize top-k computation

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #34367:
URL: https://github.com/apache/spark/pull/34367#issuecomment-979846280


   **[Test build #145668 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/145668/testReport)** for PR 34367 at commit [`d23c431`](https://github.com/apache/spark/commit/d23c4319bd46e30f3d7971039f9f5ceda1d76072).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #34367: [SPARK-37099][SQL] Impl a rank-based filter to optimize top-k computation

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34367:
URL: https://github.com/apache/spark/pull/34367#issuecomment-981263291


   **[Test build #145692 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/145692/testReport)** for PR 34367 at commit [`354b445`](https://github.com/apache/spark/commit/354b445a7fe645c95bddca0030ad3b56135a0106).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #34367: [SPARK-37099][SQL] Impl a rank-based filter to optimize top-k computation

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34367:
URL: https://github.com/apache/spark/pull/34367#issuecomment-983419906


   **[Test build #145790 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/145790/testReport)** for PR 34367 at commit [`877558e`](https://github.com/apache/spark/commit/877558e439663d1028028e9a332a5e4e6a18ad6c).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #34367: [SPARK-37099][SQL] Impl a rank-based filter to optimize top-k computation

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #34367:
URL: https://github.com/apache/spark/pull/34367#issuecomment-989689730


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/50506/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #34367: [SPARK-37099][SQL] Impl a rank-based filter to optimize top-k computation

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #34367:
URL: https://github.com/apache/spark/pull/34367#issuecomment-950527923


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/49044/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #34367: [SPARK-37099][SQL] Impl a rank-based filter to optimize top-k computation

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34367:
URL: https://github.com/apache/spark/pull/34367#issuecomment-950756038


   **[Test build #144576 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/144576/testReport)** for PR 34367 at commit [`3c51865`](https://github.com/apache/spark/commit/3c518652d74e50bcb85024dd5e5e75dc63121ce7).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] zhengruifeng commented on pull request #34367: [SPARK-37099][SQL] Impl a rank-based filter to optimize top-k computation

Posted by GitBox <gi...@apache.org>.
zhengruifeng commented on pull request #34367:
URL: https://github.com/apache/spark/pull/34367#issuecomment-962989084


   cc @cloud-fan 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #34367: [SPARK-37099][SQL] Impl a rank-based filter to optimize top-k computation

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #34367:
URL: https://github.com/apache/spark/pull/34367#issuecomment-989827534


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/146030/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #34367: [SPARK-37099][SQL] Impl a rank-based filter to optimize top-k computation

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #34367:
URL: https://github.com/apache/spark/pull/34367#issuecomment-981582558


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/50180/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #34367: [SPARK-37099][SQL] Impl a rank-based filter to optimize top-k computation

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #34367:
URL: https://github.com/apache/spark/pull/34367#issuecomment-980046005


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/145668/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] zhengruifeng commented on a change in pull request #34367: [SPARK-37099][SQL] Impl a rank-based filter to optimize top-k computation

Posted by GitBox <gi...@apache.org>.
zhengruifeng commented on a change in pull request #34367:
URL: https://github.com/apache/spark/pull/34367#discussion_r756752125



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/window/RankLimitExec.scala
##########
@@ -0,0 +1,93 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+
+package org.apache.spark.sql.execution.window
+
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.catalyst.InternalRow
+import org.apache.spark.sql.catalyst.expressions._
+import org.apache.spark.sql.catalyst.expressions.codegen.LazilyGeneratedOrdering
+import org.apache.spark.sql.catalyst.plans.physical._
+import org.apache.spark.sql.execution.{GroupedIterator, SparkPlan, UnaryExecNode}
+import org.apache.spark.util.collection.Utils
+
+/**
+ * This operator is designed to filter out unnecessary rows before WindowExec,
+ * for top-k computation.
+ * @param partitionSpec Should be the same as [[WindowExec#partitionSpec]]
+ * @param orderSpec Should be the same as [[WindowExec#orderSpec]]
+ */
+case class RankLimitExec(
+    partitionSpec: Seq[Expression],
+    orderSpec: Seq[SortOrder],
+    limit: Int,
+    child: SparkPlan) extends UnaryExecNode {
+  assert(orderSpec.nonEmpty && limit > 0)
+
+  // apply Utils.takeOrdered when partitionSpec is empty and row_number is used.
+  private def applyTakeOrdered: Boolean = partitionSpec.isEmpty
+
+  override def output: Seq[Attribute] = child.output
+
+  override def requiredChildOrdering: Seq[Seq[SortOrder]] = {
+    if (applyTakeOrdered) {
+      super.requiredChildOrdering
+    } else {
+      // Should be the same as [[WindowExec#requiredChildOrdering]]
+      Seq(partitionSpec.map(SortOrder(_, Ascending)) ++ orderSpec)
+    }
+  }
+
+  override def outputOrdering: Seq[SortOrder] = {
+    if (applyTakeOrdered) {
+      orderSpec
+    } else {
+      child.outputOrdering
+    }
+  }
+
+  override def outputPartitioning: Partitioning = child.outputPartitioning
+
+  // TODO: support rank and dense_rank

Review comment:
       The correctness exists in: 
   for all three rank functions (row_number, rank, dense_rank), `global rank` <= `partial rank`, so we can safely remove rows with `partial rank > limit`




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #34367: [SPARK-37099][SQL] Impl a rank-based filter to optimize top-k computation

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34367:
URL: https://github.com/apache/spark/pull/34367#issuecomment-979327079


   **[Test build #145629 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/145629/testReport)** for PR 34367 at commit [`d17fcf8`](https://github.com/apache/spark/commit/d17fcf88585f7b58f0709cbbf4952aa050fa9017).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #34367: [SPARK-37099][SQL] Impl a rank-based filter to optimize top-k computation

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34367:
URL: https://github.com/apache/spark/pull/34367#issuecomment-977763552


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50047/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #34367: [SPARK-37099][SQL] Impl a rank-based filter to optimize top-k computation

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #34367:
URL: https://github.com/apache/spark/pull/34367#issuecomment-978106053


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/145584/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #34367: [SPARK-37099][SQL] Impl a rank-based filter to optimize top-k computation

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #34367:
URL: https://github.com/apache/spark/pull/34367#issuecomment-978106053


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/145584/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #34367: [SPARK-37099][SQL] Impl a rank-based filter to optimize top-k computation

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34367:
URL: https://github.com/apache/spark/pull/34367#issuecomment-979189685


   Kubernetes integration test status failure
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50100/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #34367: [SPARK-37099][SQL] Impl a rank-based filter to optimize top-k computation

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34367:
URL: https://github.com/apache/spark/pull/34367#issuecomment-979695211


   Kubernetes integration test status failure
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50116/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #34367: [SPARK-37099][SQL] Impl a rank-based filter to optimize top-k computation

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34367:
URL: https://github.com/apache/spark/pull/34367#issuecomment-950633872


   **[Test build #144573 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/144573/testReport)** for PR 34367 at commit [`3c51865`](https://github.com/apache/spark/commit/3c518652d74e50bcb85024dd5e5e75dc63121ce7).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #34367: [SPARK-37099][SQL] Impl a rank-based filter to optimize top-k computation

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #34367:
URL: https://github.com/apache/spark/pull/34367#issuecomment-950636191


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/144573/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] zhengruifeng commented on pull request #34367: [SPARK-37099][SQL] Impl a rank-based filter to optimize top-k computation

Posted by GitBox <gi...@apache.org>.
zhengruifeng commented on pull request #34367:
URL: https://github.com/apache/spark/pull/34367#issuecomment-949516811


   a simple example:
   ```
   
   import org.apache.spark.sql.expressions.Window
   
   val df1 = spark.range(0, 100000000, 1, 9).select(when('id < 90000000, 123).otherwise('id).as("key1"), 'id as "value1").withColumn("hash1", abs(hash(col("key1"))).mod(1000))
   
   df1.withColumn("rank", row_number().over(Window.partitionBy("hash1").orderBy("value1"))).where(col("rank") <= 1).write.mode("overwrite").parquet("/tmp/tmp1")
   
   
   spark.conf.set("spark.sql.replaceWindowTop.enabled", "true")
   
   df1.withColumn("rank", row_number().over(Window.partitionBy("hash1").orderBy("value1"))).where(col("rank") <= 1).write.mode("overwrite").parquet("/tmp/tmp2")
   ```
   
   ![image](https://user-images.githubusercontent.com/7322292/138442467-05114fc2-46a3-48f3-aafe-3ce929d9cb24.png)
   
   existing plan took 33 sec, while the new plan with `RankLimit` took only 8 sec.
   
   ![image](https://user-images.githubusercontent.com/7322292/138442603-ca117469-9de4-4220-8b01-0b816fba7fa6.png)
   
   and the shuffle write was reduced from 544.9 MiB to 26.7 KiB
   
   
   ```
   == Physical Plan ==
   Execute InsertIntoHadoopFsRelationCommand (18)
   +- AdaptiveSparkPlan (17)
      +- == Final Plan ==
         * Filter (11)
         +- Window (10)
            +- * Sort (9)
               +- AQEShuffleRead (8)
                  +- ShuffleQueryStage (7)
                     +- Exchange (6)
                        +- RankLimit (5)
                           +- * Sort (4)
                              +- * Project (3)
                                 +- * Project (2)
                                    +- * Range (1)
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #34367: [SPARK-37099][SQL] Impl a rank-based filter to optimize top-k computation

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34367:
URL: https://github.com/apache/spark/pull/34367#issuecomment-949578066


   Kubernetes integration test status failure
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49007/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #34367: [SPARK-37099][SQL] Impl a rank-based filter to optimize top-k computation

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34367:
URL: https://github.com/apache/spark/pull/34367#issuecomment-950549267


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49047/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #34367: [SPARK-37099][SQL] Impl a rank-based filter to optimize top-k computation

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #34367:
URL: https://github.com/apache/spark/pull/34367#issuecomment-949600168


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/49007/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #34367: [SPARK-37099][SQL] Impl a rank-based filter to optimize top-k computation

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #34367:
URL: https://github.com/apache/spark/pull/34367#issuecomment-950512011


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/144570/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #34367: [SPARK-37099][SQL] Impl a rank-based filter to optimize top-k computation

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #34367:
URL: https://github.com/apache/spark/pull/34367#issuecomment-983256901


   **[Test build #145790 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/145790/testReport)** for PR 34367 at commit [`877558e`](https://github.com/apache/spark/commit/877558e439663d1028028e9a332a5e4e6a18ad6c).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #34367: [SPARK-37099][SQL] Impl a rank-based filter to optimize top-k computation

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #34367:
URL: https://github.com/apache/spark/pull/34367#issuecomment-982900233


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/145770/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #34367: [SPARK-37099][SQL] Impl a rank-based filter to optimize top-k computation

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #34367:
URL: https://github.com/apache/spark/pull/34367#issuecomment-982738859


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/50242/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #34367: [SPARK-37099][SQL] Impl a rank-based filter to optimize top-k computation

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34367:
URL: https://github.com/apache/spark/pull/34367#issuecomment-983295626


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50263/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] zhengruifeng commented on pull request #34367: [SPARK-37099][SQL] Introduce a rank-based filter to optimize top-k computation

Posted by GitBox <gi...@apache.org>.
zhengruifeng commented on pull request #34367:
URL: https://github.com/apache/spark/pull/34367#issuecomment-1082701022


   retest this please


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org