You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2021/02/04 07:58:59 UTC

[GitHub] [spark] zhengruifeng opened a new pull request #31468: [SPARK-34353][SQL] CollectLimitExec avoid shuffle if input rdd has single partition

zhengruifeng opened a new pull request #31468:
URL: https://github.com/apache/spark/pull/31468


   ### What changes were proposed in this pull request?
   if child rdd has only one partition, skip the shuffle
   
   
   ### Why are the changes needed?
   skip shuffle if possible
   
   
   ### Does this PR introduce _any_ user-facing change?
   No
   
   
   ### How was this patch tested?
   existing testsuites
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #31468: [SPARK-34353][SQL] CollectLimitExec avoid shuffle if input rdd has 0/1 partition

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #31468:
URL: https://github.com/apache/spark/pull/31468#issuecomment-786996767


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/135523/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #31468: [SPARK-34353][SQL] CollectLimitExec avoid shuffle if input rdd has 0/1 partition

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #31468:
URL: https://github.com/apache/spark/pull/31468#issuecomment-784722250


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/39971/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #31468: [SPARK-34353][SQL] CollectLimitExec avoid shuffle if input rdd has single partition

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #31468:
URL: https://github.com/apache/spark/pull/31468#issuecomment-776744754


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/135097/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #31468: [SPARK-34353][SQL] CollectLimitExec avoid shuffle if input rdd has single partition

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #31468:
URL: https://github.com/apache/spark/pull/31468#issuecomment-773222391






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] maropu commented on pull request #31468: [SPARK-34353][SQL] CollectLimitExec avoid shuffle if input rdd has 0/1 partition

Posted by GitBox <gi...@apache.org>.

maropu commented on pull request #31468:
URL: https://github.com/apache/spark/pull/31468#issuecomment-786634694


   retest this please


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #31468: [SPARK-34353][SQL] CollectLimitExec avoid shuffle if input rdd has single partition

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #31468:
URL: https://github.com/apache/spark/pull/31468#issuecomment-775741112


   **[Test build #135069 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/135069/testReport)** for PR 31468 at commit [`6d199ff`](https://github.com/apache/spark/commit/6d199ff4ddcc2adca3141d70019354625e298372).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #31468: [SPARK-34353][SQL] CollectLimitExec avoid shuffle if input rdd has single partition

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #31468:
URL: https://github.com/apache/spark/pull/31468#issuecomment-775791224


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/39651/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #31468: [SPARK-34353][SQL] CollectLimitExec avoid shuffle if input rdd has 0/1 partition

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #31468:
URL: https://github.com/apache/spark/pull/31468#issuecomment-784725646


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/135391/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #31468: [SPARK-34353][SQL] CollectLimitExec avoid shuffle if input rdd has single partition

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #31468:
URL: https://github.com/apache/spark/pull/31468#issuecomment-775791224


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/39651/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #31468: [SPARK-34353][SQL] CollectLimitExec avoid shuffle if input rdd has single partition

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #31468:
URL: https://github.com/apache/spark/pull/31468#issuecomment-776150085


   **[Test build #135075 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/135075/testReport)** for PR 31468 at commit [`0bfb465`](https://github.com/apache/spark/commit/0bfb465c831ba1034d5e747308415c64f2f21306).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #31468: [SPARK-34353][SQL] CollectLimitExec avoid shuffle if input rdd has 0/1 partition

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #31468:
URL: https://github.com/apache/spark/pull/31468#issuecomment-784671414


   **[Test build #135391 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/135391/testReport)** for PR 31468 at commit [`75c4d92`](https://github.com/apache/spark/commit/75c4d92dc7d24b6698b2eb7e5d103e95e925247a).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] maropu commented on a change in pull request #31468: [SPARK-34353][SQL] CollectLimitExec avoid shuffle if input rdd has single partition

Posted by GitBox <gi...@apache.org>.

maropu commented on a change in pull request #31468:
URL: https://github.com/apache/spark/pull/31468#discussion_r573392812



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/limit.scala
##########
@@ -52,16 +52,21 @@ case class CollectLimitExec(limit: Int, child: SparkPlan) extends LimitExec {
     SQLShuffleReadMetricsReporter.createShuffleReadMetrics(sparkContext)
   override lazy val metrics = readMetrics ++ writeMetrics
   protected override def doExecute(): RDD[InternalRow] = {
-    val locallyLimited = child.execute().mapPartitionsInternal(_.take(limit))
-    val shuffled = new ShuffledRowRDD(
-      ShuffleExchangeExec.prepareShuffleDependency(
-        locallyLimited,
-        child.output,
-        SinglePartition,
-        serializer,
-        writeMetrics),
-      readMetrics)
-    shuffled.mapPartitionsInternal(_.take(limit))
+    val childRDD = child.execute()
+    val singlePartitionRDD = if (childRDD.getNumPartitions != 1) {

Review comment:
       nit: could you leave some comments about why we cannot strip the shuffle if  `childRDD.getNumPartitions == 0`?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #31468: [SPARK-34353][SQL] CollectLimitExec avoid shuffle if input rdd has single partition

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #31468:
URL: https://github.com/apache/spark/pull/31468#issuecomment-776084473


   Kubernetes integration test status failure
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39657/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] zhengruifeng edited a comment on pull request #31468: [SPARK-34353][SQL] CollectLimitExec avoid shuffle if input rdd has single partition

Posted by GitBox <gi...@apache.org>.

zhengruifeng edited a comment on pull request #31468:
URL: https://github.com/apache/spark/pull/31468#issuecomment-775778416


   ```
   scala> spark.sql("CREATE TABLE t (key bigint, value string) USING parquet")
   res0: org.apache.spark.sql.DataFrame = []
   
   scala> spark.sql("SELECT COUNT(*) FROM t").head
   res1: org.apache.spark.sql.Row = [0]
   
   scala> val t = spark.table("t")
   t: org.apache.spark.sql.DataFrame = [key: bigint, value: string]
   
   scala> t.count
   res2: Long = 0
   
   scala> t.rdd.getNumPartitions
   res3: Int = 0
   
   scala> 
   
   scala> spark.sql(s"CACHE TABLE v1 AS SELECT * FROM t LIMIT 10")
   java.util.NoSuchElementException: next on empty iterator
     at scala.collection.Iterator$$anon$2.next(Iterator.scala:41)
     at scala.collection.Iterator$$anon$2.next(Iterator.scala:39)
     at scala.collection.IndexedSeqLike$Elements.next(IndexedSeqLike.scala:63)
     at scala.collection.IterableLike.head(IterableLike.scala:109)
     at scala.collection.IterableLike.head$(IterableLike.scala:108)
     at scala.collection.mutable.ArrayOps$ofRef.scala$collection$IndexedSeqOptimized$$super$head(ArrayOps.scala:198)
     at scala.collection.IndexedSeqOptimized.head(IndexedSeqOptimized.scala:129)
     at scala.collection.IndexedSeqOptimized.head$(IndexedSeqOptimized.scala:129)
     at scala.collection.mutable.ArrayOps$ofRef.head(ArrayOps.scala:198)
     at org.apache.spark.sql.Dataset.$anonfun$count$1(Dataset.scala:3024)
     at org.apache.spark.sql.Dataset.$anonfun$count$1$adapted(Dataset.scala:3023)
     at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3705)
     at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103)
     at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163)
     at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90)
     at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:772)
     at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
     at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3703)
     at org.apache.spark.sql.Dataset.count(Dataset.scala:3023)
     at org.apache.spark.sql.execution.datasources.v2.BaseCacheTableExec.run(CacheTableExec.scala:65)
     at org.apache.spark.sql.execution.datasources.v2.BaseCacheTableExec.run$(CacheTableExec.scala:41)
     at org.apache.spark.sql.execution.datasources.v2.CacheTableAsSelectExec.run(CacheTableExec.scala:88)
     at org.apache.spark.sql.execution.datasources.v2.V2CommandExec.result$lzycompute(V2CommandExec.scala:42)
     at org.apache.spark.sql.execution.datasources.v2.V2CommandExec.result(V2CommandExec.scala:42)
     at org.apache.spark.sql.execution.datasources.v2.V2CommandExec.executeCollect(V2CommandExec.scala:48)
     at org.apache.spark.sql.Dataset.$anonfun$logicalPlan$1(Dataset.scala:228)
     at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3705)
     at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103)
     at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163)
     at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90)
     at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:772)
     at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
     at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3703)
     at org.apache.spark.sql.Dataset.<init>(Dataset.scala:228)
     at org.apache.spark.sql.Dataset$.$anonfun$ofRows$2(Dataset.scala:99)
     at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:772)
     at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:96)
     at org.apache.spark.sql.SparkSession.$anonfun$sql$1(SparkSession.scala:615)
     at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:772)
     at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:610)
     ... 47 elided
   
   scala> spark.sql("SELECT COUNT(*) FROM v1").head
   java.util.NoSuchElementException: next on empty iterator
     at scala.collection.Iterator$$anon$2.next(Iterator.scala:41)
     at scala.collection.Iterator$$anon$2.next(Iterator.scala:39)
     at scala.collection.IndexedSeqLike$Elements.next(IndexedSeqLike.scala:63)
     at scala.collection.IterableLike.head(IterableLike.scala:109)
     at scala.collection.IterableLike.head$(IterableLike.scala:108)
     at scala.collection.mutable.ArrayOps$ofRef.scala$collection$IndexedSeqOptimized$$super$head(ArrayOps.scala:198)
     at scala.collection.IndexedSeqOptimized.head(IndexedSeqOptimized.scala:129)
     at scala.collection.IndexedSeqOptimized.head$(IndexedSeqOptimized.scala:129)
     at scala.collection.mutable.ArrayOps$ofRef.head(ArrayOps.scala:198)
     at org.apache.spark.sql.Dataset.head(Dataset.scala:2747)
     ... 47 elided
   
   scala> val v1 = spark.table("v1")
   v1: org.apache.spark.sql.DataFrame = [key: bigint, value: string]
   
   scala> v1.count
   java.util.NoSuchElementException: next on empty iterator
     at scala.collection.Iterator$$anon$2.next(Iterator.scala:41)
     at scala.collection.Iterator$$anon$2.next(Iterator.scala:39)
     at scala.collection.IndexedSeqLike$Elements.next(IndexedSeqLike.scala:63)
     at scala.collection.IterableLike.head(IterableLike.scala:109)
     at scala.collection.IterableLike.head$(IterableLike.scala:108)
     at scala.collection.mutable.ArrayOps$ofRef.scala$collection$IndexedSeqOptimized$$super$head(ArrayOps.scala:198)
     at scala.collection.IndexedSeqOptimized.head(IndexedSeqOptimized.scala:129)
     at scala.collection.IndexedSeqOptimized.head$(IndexedSeqOptimized.scala:129)
     at scala.collection.mutable.ArrayOps$ofRef.head(ArrayOps.scala:198)
     at org.apache.spark.sql.Dataset.$anonfun$count$1(Dataset.scala:3024)
     at org.apache.spark.sql.Dataset.$anonfun$count$1$adapted(Dataset.scala:3023)
     at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3705)
     at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103)
     at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163)
     at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90)
     at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:772)
     at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
     at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3703)
     at org.apache.spark.sql.Dataset.count(Dataset.scala:3023)
     ... 47 elided
   
   scala> v1.rdd.getNumPartitions
   res7: Int = 0
   
   scala> v1.repartition(3).count
   res8: Long = 0
   ```
   
   `childRDD` may have no partition, then the number output partition is zero; 
   while in existing impl, the `ShuffledRowRDD` will make sure `SinglePartition`, so that this issue will not be triggered.
   
   
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #31468: [SPARK-34353][SQL] CollectLimitExec avoid shuffle if input rdd has 0/1 partition

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #31468:
URL: https://github.com/apache/spark/pull/31468#issuecomment-786939347


   **[Test build #135523 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/135523/testReport)** for PR 31468 at commit [`75c4d92`](https://github.com/apache/spark/commit/75c4d92dc7d24b6698b2eb7e5d103e95e925247a).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #31468: [SPARK-34353][SQL] CollectLimitExec avoid shuffle if input rdd has single partition

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #31468:
URL: https://github.com/apache/spark/pull/31468#issuecomment-776476510


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/39674/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] zhengruifeng commented on a change in pull request #31468: [SPARK-34353][SQL] CollectLimitExec avoid shuffle if input rdd has single partition

Posted by GitBox <gi...@apache.org>.

zhengruifeng commented on a change in pull request #31468:
URL: https://github.com/apache/spark/pull/31468#discussion_r573438154



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/limit.scala
##########
@@ -200,28 +210,32 @@ case class TakeOrderedAndProjectExec(
   protected override def doExecute(): RDD[InternalRow] = {
     val ord = new LazilyGeneratedOrdering(sortOrder, child.output)
     val childRDD = child.execute()
-    val singlePartitionRDD = if (childRDD.getNumPartitions > 1) {
-      val localTopK = childRDD.mapPartitions { iter =>
-        org.apache.spark.util.collection.Utils.takeOrdered(iter.map(_.copy()), limit)(ord)
-      }
-      new ShuffledRowRDD(
-        ShuffleExchangeExec.prepareShuffleDependency(
-          localTopK,
-          child.output,
-          SinglePartition,
-          serializer,
-          writeMetrics),
-        readMetrics)
+    if (childRDD.getNumPartitions == 0) {
+      sparkContext.parallelize(Array.empty[InternalRow], 1)

Review comment:
       ok, I will update it.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #31468: [SPARK-34353][SQL] CollectLimitExec avoid shuffle if input rdd has single partition

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #31468:
URL: https://github.com/apache/spark/pull/31468#issuecomment-773374303


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/39460/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #31468: [SPARK-34353][SQL] CollectLimitExec avoid shuffle if input rdd has single partition

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #31468:
URL: https://github.com/apache/spark/pull/31468#issuecomment-775867646






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA removed a comment on pull request #31468: [SPARK-34353][SQL] CollectLimitExec avoid shuffle if input rdd has single partition

Posted by GitBox <gi...@apache.org>.

SparkQA removed a comment on pull request #31468:
URL: https://github.com/apache/spark/pull/31468#issuecomment-773118983






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] zhengruifeng commented on pull request #31468: [SPARK-34353][SQL] CollectLimitExec avoid shuffle if input rdd has single partition

Posted by GitBox <gi...@apache.org>.

zhengruifeng commented on pull request #31468:
URL: https://github.com/apache/spark/pull/31468#issuecomment-773112265


   > We may have more operators that adding shuffle in the doExecute method instead of the planner
   
   @cloud-fan shuffle only is directly added in the `doExecute` method of `TakeOrderedAndProjectExec`, `CollectLimitExec` and `ShuffleExchangeExec`


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] viirya commented on a change in pull request #31468: [SPARK-34353][SQL] CollectLimitExec avoid shuffle if input rdd has 0/1 partition

Posted by GitBox <gi...@apache.org>.

viirya commented on a change in pull request #31468:
URL: https://github.com/apache/spark/pull/31468#discussion_r578914581



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/limit.scala
##########
@@ -52,16 +53,25 @@ case class CollectLimitExec(limit: Int, child: SparkPlan) extends LimitExec {
     SQLShuffleReadMetricsReporter.createShuffleReadMetrics(sparkContext)
   override lazy val metrics = readMetrics ++ writeMetrics
   protected override def doExecute(): RDD[InternalRow] = {
-    val locallyLimited = child.execute().mapPartitionsInternal(_.take(limit))
-    val shuffled = new ShuffledRowRDD(
-      ShuffleExchangeExec.prepareShuffleDependency(
-        locallyLimited,
-        child.output,
-        SinglePartition,
-        serializer,
-        writeMetrics),
-      readMetrics)
-    shuffled.mapPartitionsInternal(_.take(limit))
+    val childRDD = child.execute()
+    if (childRDD.getNumPartitions == 0) {
+      new ParallelCollectionRDD(sparkContext, Seq.empty[InternalRow], 1, Map.empty)

Review comment:
       Oh I see. I am not sure if `CollectLimitExec` must be single partition, but this looks minor. I'm okay with `ParallelCollectionRDD`. Thanks.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #31468: [SPARK-34353][SQL] CollectLimitExec avoid shuffle if input rdd has 0/1 partition

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #31468:
URL: https://github.com/apache/spark/pull/31468#issuecomment-786996767


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/135523/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] zhengruifeng commented on pull request #31468: [SPARK-34353][SQL] CollectLimitExec avoid shuffle if input rdd has single partition

Posted by GitBox <gi...@apache.org>.

zhengruifeng commented on pull request #31468:
URL: https://github.com/apache/spark/pull/31468#issuecomment-773112265






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #31468: [SPARK-34353][SQL] CollectLimitExec avoid shuffle if input rdd has single partition

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #31468:
URL: https://github.com/apache/spark/pull/31468#issuecomment-776643186


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/39679/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #31468: [SPARK-34353][SQL] CollectLimitExec avoid shuffle if input rdd has 0/1 partition

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #31468:
URL: https://github.com/apache/spark/pull/31468#issuecomment-786931035


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/135510/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] zhengruifeng commented on a change in pull request #31468: [SPARK-34353][SQL] CollectLimitExec avoid shuffle if input rdd has single partition

Posted by GitBox <gi...@apache.org>.

zhengruifeng commented on a change in pull request #31468:
URL: https://github.com/apache/spark/pull/31468#discussion_r578906516



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/limit.scala
##########
@@ -52,16 +53,25 @@ case class CollectLimitExec(limit: Int, child: SparkPlan) extends LimitExec {
     SQLShuffleReadMetricsReporter.createShuffleReadMetrics(sparkContext)
   override lazy val metrics = readMetrics ++ writeMetrics
   protected override def doExecute(): RDD[InternalRow] = {
-    val locallyLimited = child.execute().mapPartitionsInternal(_.take(limit))
-    val shuffled = new ShuffledRowRDD(
-      ShuffleExchangeExec.prepareShuffleDependency(
-        locallyLimited,
-        child.output,
-        SinglePartition,
-        serializer,
-        writeMetrics),
-      readMetrics)
-    shuffled.mapPartitionsInternal(_.take(limit))
+    val childRDD = child.execute()
+    if (childRDD.getNumPartitions == 0) {
+      new ParallelCollectionRDD(sparkContext, Seq.empty[InternalRow], 1, Map.empty)

Review comment:
       @viirya @maropu  we can not set the number of partitions in an `EmptyRDD`, it always has zero partitions.
   




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #31468: [SPARK-34353][SQL] CollectLimitExec avoid shuffle if input rdd has single partition

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #31468:
URL: https://github.com/apache/spark/pull/31468#issuecomment-775823647


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39652/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA removed a comment on pull request #31468: [SPARK-34353][SQL] CollectLimitExec avoid shuffle if input rdd has single partition

Posted by GitBox <gi...@apache.org>.

SparkQA removed a comment on pull request #31468:
URL: https://github.com/apache/spark/pull/31468#issuecomment-776440924


   **[Test build #135092 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/135092/testReport)** for PR 31468 at commit [`75c4d92`](https://github.com/apache/spark/commit/75c4d92dc7d24b6698b2eb7e5d103e95e925247a).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #31468: [SPARK-34353][SQL] CollectLimitExec avoid shuffle if input rdd has single partition

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #31468:
URL: https://github.com/apache/spark/pull/31468#issuecomment-776574374


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/135092/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] maropu commented on a change in pull request #31468: [SPARK-34353][SQL] CollectLimitExec avoid shuffle if input rdd has single partition

Posted by GitBox <gi...@apache.org>.

maropu commented on a change in pull request #31468:
URL: https://github.com/apache/spark/pull/31468#discussion_r573435246



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/limit.scala
##########
@@ -200,28 +210,32 @@ case class TakeOrderedAndProjectExec(
   protected override def doExecute(): RDD[InternalRow] = {
     val ord = new LazilyGeneratedOrdering(sortOrder, child.output)
     val childRDD = child.execute()
-    val singlePartitionRDD = if (childRDD.getNumPartitions > 1) {
-      val localTopK = childRDD.mapPartitions { iter =>
-        org.apache.spark.util.collection.Utils.takeOrdered(iter.map(_.copy()), limit)(ord)
-      }
-      new ShuffledRowRDD(
-        ShuffleExchangeExec.prepareShuffleDependency(
-          localTopK,
-          child.output,
-          SinglePartition,
-          serializer,
-          writeMetrics),
-        readMetrics)
+    if (childRDD.getNumPartitions == 0) {
+      sparkContext.parallelize(Array.empty[InternalRow], 1)

Review comment:
       nit: how about using `ParallelCollectionRDD` directly?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #31468: [SPARK-34353][SQL] CollectLimitExec avoid shuffle if input rdd has single partition

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #31468:
URL: https://github.com/apache/spark/pull/31468#issuecomment-776453331


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39674/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #31468: [SPARK-34353][SQL] CollectLimitExec avoid shuffle if input rdd has single partition

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #31468:
URL: https://github.com/apache/spark/pull/31468#issuecomment-776151186


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/135075/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #31468: [SPARK-34353][SQL] CollectLimitExec avoid shuffle if input rdd has single partition

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #31468:
URL: https://github.com/apache/spark/pull/31468#issuecomment-776451204


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/39672/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #31468: [SPARK-34353][SQL] CollectLimitExec avoid shuffle if input rdd has single partition

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #31468:
URL: https://github.com/apache/spark/pull/31468#issuecomment-773222391






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #31468: [SPARK-34353][SQL] CollectLimitExec avoid shuffle if input rdd has single partition

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #31468:
URL: https://github.com/apache/spark/pull/31468#issuecomment-773148878


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39447/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA removed a comment on pull request #31468: [SPARK-34353][SQL] CollectLimitExec avoid shuffle if input rdd has single partition

Posted by GitBox <gi...@apache.org>.

SparkQA removed a comment on pull request #31468:
URL: https://github.com/apache/spark/pull/31468#issuecomment-775988407


   **[Test build #135075 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/135075/testReport)** for PR 31468 at commit [`0bfb465`](https://github.com/apache/spark/commit/0bfb465c831ba1034d5e747308415c64f2f21306).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #31468: [SPARK-34353][SQL] CollectLimitExec avoid shuffle if input rdd has 0/1 partition

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #31468:
URL: https://github.com/apache/spark/pull/31468#issuecomment-786965907


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/40104/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] zhengruifeng commented on a change in pull request #31468: [SPARK-34353][SQL] CollectLimitExec avoid shuffle if input rdd has 0/1 partition

Posted by GitBox <gi...@apache.org>.

zhengruifeng commented on a change in pull request #31468:
URL: https://github.com/apache/spark/pull/31468#discussion_r578940920



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/limit.scala
##########
@@ -52,16 +53,25 @@ case class CollectLimitExec(limit: Int, child: SparkPlan) extends LimitExec {
     SQLShuffleReadMetricsReporter.createShuffleReadMetrics(sparkContext)
   override lazy val metrics = readMetrics ++ writeMetrics
   protected override def doExecute(): RDD[InternalRow] = {
-    val locallyLimited = child.execute().mapPartitionsInternal(_.take(limit))
-    val shuffled = new ShuffledRowRDD(
-      ShuffleExchangeExec.prepareShuffleDependency(
-        locallyLimited,
-        child.output,
-        SinglePartition,
-        serializer,
-        writeMetrics),
-      readMetrics)
-    shuffled.mapPartitionsInternal(_.take(limit))
+    val childRDD = child.execute()
+    if (childRDD.getNumPartitions == 0) {
+      new ParallelCollectionRDD(sparkContext, Seq.empty[InternalRow], 1, Map.empty)

Review comment:
       @viirya the `outputPartitioning` of `CollectLimitExec` is  `SinglePartition`, so the output rdd should have single partition.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] maropu commented on pull request #31468: [SPARK-34353][SQL] CollectLimitExec avoid shuffle if input rdd has 0/1 partition

Posted by GitBox <gi...@apache.org>.

maropu commented on pull request #31468:
URL: https://github.com/apache/spark/pull/31468#issuecomment-784661503






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #31468: [SPARK-34353][SQL] CollectLimitExec avoid shuffle if input rdd has single partition

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #31468:
URL: https://github.com/apache/spark/pull/31468#issuecomment-776744754


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/135097/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #31468: [SPARK-34353][SQL] CollectLimitExec avoid shuffle if input rdd has single partition

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #31468:
URL: https://github.com/apache/spark/pull/31468#issuecomment-775802401


   **[Test build #135070 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/135070/testReport)** for PR 31468 at commit [`0bfb465`](https://github.com/apache/spark/commit/0bfb465c831ba1034d5e747308415c64f2f21306).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #31468: [SPARK-34353][SQL] CollectLimitExec avoid shuffle if input rdd has single partition

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #31468:
URL: https://github.com/apache/spark/pull/31468#issuecomment-776574374


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/135092/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] maropu commented on pull request #31468: [SPARK-34353][SQL] CollectLimitExec avoid shuffle if input rdd has 0/1 partition

Posted by GitBox <gi...@apache.org>.

maropu commented on pull request #31468:
URL: https://github.com/apache/spark/pull/31468#issuecomment-787031497


   Thanks! Merged to master.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] maropu commented on pull request #31468: [SPARK-34353][SQL] CollectLimitExec avoid shuffle if input rdd has single partition

Posted by GitBox <gi...@apache.org>.

maropu commented on pull request #31468:
URL: https://github.com/apache/spark/pull/31468#issuecomment-773688646


   The fix itself looks fine.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA removed a comment on pull request #31468: [SPARK-34353][SQL] CollectLimitExec avoid shuffle if input rdd has single partition

Posted by GitBox <gi...@apache.org>.

SparkQA removed a comment on pull request #31468:
URL: https://github.com/apache/spark/pull/31468#issuecomment-775802401


   **[Test build #135070 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/135070/testReport)** for PR 31468 at commit [`0bfb465`](https://github.com/apache/spark/commit/0bfb465c831ba1034d5e747308415c64f2f21306).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] viirya commented on a change in pull request #31468: [SPARK-34353][SQL] CollectLimitExec avoid shuffle if input rdd has 0/1 partition

Posted by GitBox <gi...@apache.org>.

viirya commented on a change in pull request #31468:
URL: https://github.com/apache/spark/pull/31468#discussion_r578951088



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/limit.scala
##########
@@ -52,16 +53,25 @@ case class CollectLimitExec(limit: Int, child: SparkPlan) extends LimitExec {
     SQLShuffleReadMetricsReporter.createShuffleReadMetrics(sparkContext)
   override lazy val metrics = readMetrics ++ writeMetrics
   protected override def doExecute(): RDD[InternalRow] = {
-    val locallyLimited = child.execute().mapPartitionsInternal(_.take(limit))
-    val shuffled = new ShuffledRowRDD(
-      ShuffleExchangeExec.prepareShuffleDependency(
-        locallyLimited,
-        child.output,
-        SinglePartition,
-        serializer,
-        writeMetrics),
-      readMetrics)
-    shuffled.mapPartitionsInternal(_.take(limit))
+    val childRDD = child.execute()
+    if (childRDD.getNumPartitions == 0) {
+      new ParallelCollectionRDD(sparkContext, Seq.empty[InternalRow], 1, Map.empty)

Review comment:
       > @viirya the `outputPartitioning` of `CollectLimitExec` is `SinglePartition`, so the output rdd should have single partition.
   
   Oh, my previous comment was confusing. I mean I am not sure if its `outputPartitioning` must be ` SinglePartition`.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #31468: [SPARK-34353][SQL] CollectLimitExec avoid shuffle if input rdd has single partition

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #31468:
URL: https://github.com/apache/spark/pull/31468#issuecomment-775861344


   **[Test build #135070 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/135070/testReport)** for PR 31468 at commit [`0bfb465`](https://github.com/apache/spark/commit/0bfb465c831ba1034d5e747308415c64f2f21306).
    * This patch **fails Spark unit tests**.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #31468: [SPARK-34353][SQL] CollectLimitExec avoid shuffle if input rdd has single partition

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #31468:
URL: https://github.com/apache/spark/pull/31468#issuecomment-773457981


   **[Test build #134874 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/134874/testReport)** for PR 31468 at commit [`6d199ff`](https://github.com/apache/spark/commit/6d199ff4ddcc2adca3141d70019354625e298372).
    * This patch **fails Spark unit tests**.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] maropu commented on pull request #31468: [SPARK-34353][SQL] CollectLimitExec avoid shuffle if input rdd has single partition

Posted by GitBox <gi...@apache.org>.

maropu commented on pull request #31468:
URL: https://github.com/apache/spark/pull/31468#issuecomment-773688646


   The fix itself looks fine.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #31468: [SPARK-34353][SQL] CollectLimitExec avoid shuffle if input rdd has single partition

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #31468:
URL: https://github.com/apache/spark/pull/31468#issuecomment-773306844


   **[Test build #134874 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/134874/testReport)** for PR 31468 at commit [`6d199ff`](https://github.com/apache/spark/commit/6d199ff4ddcc2adca3141d70019354625e298372).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #31468: [SPARK-34353][SQL] CollectLimitExec avoid shuffle if input rdd has 0/1 partition

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #31468:
URL: https://github.com/apache/spark/pull/31468#issuecomment-786662467


   **[Test build #135510 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/135510/testReport)** for PR 31468 at commit [`75c4d92`](https://github.com/apache/spark/commit/75c4d92dc7d24b6698b2eb7e5d103e95e925247a).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #31468: [SPARK-34353][SQL] CollectLimitExec avoid shuffle if input rdd has single partition

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #31468:
URL: https://github.com/apache/spark/pull/31468#issuecomment-776626070


   Kubernetes integration test status failure
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39679/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] maropu commented on pull request #31468: [SPARK-34353][SQL] CollectLimitExec avoid shuffle if input rdd has single partition

Posted by GitBox <gi...@apache.org>.

maropu commented on pull request #31468:
URL: https://github.com/apache/spark/pull/31468#issuecomment-776712671


   Looks fine. Could you update the PR title and description, too? This PR is not only for a single partition case now.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #31468: [SPARK-34353][SQL] CollectLimitExec avoid shuffle if input rdd has single partition

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #31468:
URL: https://github.com/apache/spark/pull/31468#issuecomment-775988407


   **[Test build #135075 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/135075/testReport)** for PR 31468 at commit [`0bfb465`](https://github.com/apache/spark/commit/0bfb465c831ba1034d5e747308415c64f2f21306).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #31468: [SPARK-34353][SQL] CollectLimitExec avoid shuffle if input rdd has single partition

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #31468:
URL: https://github.com/apache/spark/pull/31468#issuecomment-776572321


   **[Test build #135097 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/135097/testReport)** for PR 31468 at commit [`75c4d92`](https://github.com/apache/spark/commit/75c4d92dc7d24b6698b2eb7e5d103e95e925247a).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #31468: [SPARK-34353][SQL] CollectLimitExec avoid shuffle if input rdd has 0/1 partition

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #31468:
URL: https://github.com/apache/spark/pull/31468#issuecomment-786922815


   **[Test build #135510 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/135510/testReport)** for PR 31468 at commit [`75c4d92`](https://github.com/apache/spark/commit/75c4d92dc7d24b6698b2eb7e5d103e95e925247a).
    * This patch **fails from timeout after a configured wait of `500m`**.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] zhengruifeng commented on pull request #31468: [SPARK-34353][SQL] CollectLimitExec avoid shuffle if input rdd has single partition

Posted by GitBox <gi...@apache.org>.

zhengruifeng commented on pull request #31468:
URL: https://github.com/apache/spark/pull/31468#issuecomment-775778416


   ```
   scala> spark.sql("CREATE TABLE t (key bigint, value string) USING parquet")
   res0: org.apache.spark.sql.DataFrame = []
   
   scala> spark.sql("SELECT COUNT(*) FROM t").head
   res1: org.apache.spark.sql.Row = [0]
   
   scala> val t = spark.table("t")
   t: org.apache.spark.sql.DataFrame = [key: bigint, value: string]
   
   scala> t.count
   res2: Long = 0
   
   scala> t.rdd.getNumPartitions
   res3: Int = 0
   
   scala> 
   
   scala> spark.sql(s"CACHE TABLE v1 AS SELECT * FROM t LIMIT 10")
   java.util.NoSuchElementException: next on empty iterator
     at scala.collection.Iterator$$anon$2.next(Iterator.scala:41)
     at scala.collection.Iterator$$anon$2.next(Iterator.scala:39)
     at scala.collection.IndexedSeqLike$Elements.next(IndexedSeqLike.scala:63)
     at scala.collection.IterableLike.head(IterableLike.scala:109)
     at scala.collection.IterableLike.head$(IterableLike.scala:108)
     at scala.collection.mutable.ArrayOps$ofRef.scala$collection$IndexedSeqOptimized$$super$head(ArrayOps.scala:198)
     at scala.collection.IndexedSeqOptimized.head(IndexedSeqOptimized.scala:129)
     at scala.collection.IndexedSeqOptimized.head$(IndexedSeqOptimized.scala:129)
     at scala.collection.mutable.ArrayOps$ofRef.head(ArrayOps.scala:198)
     at org.apache.spark.sql.Dataset.$anonfun$count$1(Dataset.scala:3024)
     at org.apache.spark.sql.Dataset.$anonfun$count$1$adapted(Dataset.scala:3023)
     at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3705)
     at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103)
     at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163)
     at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90)
     at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:772)
     at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
     at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3703)
     at org.apache.spark.sql.Dataset.count(Dataset.scala:3023)
     at org.apache.spark.sql.execution.datasources.v2.BaseCacheTableExec.run(CacheTableExec.scala:65)
     at org.apache.spark.sql.execution.datasources.v2.BaseCacheTableExec.run$(CacheTableExec.scala:41)
     at org.apache.spark.sql.execution.datasources.v2.CacheTableAsSelectExec.run(CacheTableExec.scala:88)
     at org.apache.spark.sql.execution.datasources.v2.V2CommandExec.result$lzycompute(V2CommandExec.scala:42)
     at org.apache.spark.sql.execution.datasources.v2.V2CommandExec.result(V2CommandExec.scala:42)
     at org.apache.spark.sql.execution.datasources.v2.V2CommandExec.executeCollect(V2CommandExec.scala:48)
     at org.apache.spark.sql.Dataset.$anonfun$logicalPlan$1(Dataset.scala:228)
     at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3705)
     at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103)
     at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163)
     at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90)
     at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:772)
     at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
     at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3703)
     at org.apache.spark.sql.Dataset.<init>(Dataset.scala:228)
     at org.apache.spark.sql.Dataset$.$anonfun$ofRows$2(Dataset.scala:99)
     at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:772)
     at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:96)
     at org.apache.spark.sql.SparkSession.$anonfun$sql$1(SparkSession.scala:615)
     at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:772)
     at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:610)
     ... 47 elided
   
   scala> spark.sql("SELECT COUNT(*) FROM v1").head
   java.util.NoSuchElementException: next on empty iterator
     at scala.collection.Iterator$$anon$2.next(Iterator.scala:41)
     at scala.collection.Iterator$$anon$2.next(Iterator.scala:39)
     at scala.collection.IndexedSeqLike$Elements.next(IndexedSeqLike.scala:63)
     at scala.collection.IterableLike.head(IterableLike.scala:109)
     at scala.collection.IterableLike.head$(IterableLike.scala:108)
     at scala.collection.mutable.ArrayOps$ofRef.scala$collection$IndexedSeqOptimized$$super$head(ArrayOps.scala:198)
     at scala.collection.IndexedSeqOptimized.head(IndexedSeqOptimized.scala:129)
     at scala.collection.IndexedSeqOptimized.head$(IndexedSeqOptimized.scala:129)
     at scala.collection.mutable.ArrayOps$ofRef.head(ArrayOps.scala:198)
     at org.apache.spark.sql.Dataset.head(Dataset.scala:2747)
     ... 47 elided
   
   scala> val v1 = spark.table("v1")
   v1: org.apache.spark.sql.DataFrame = [key: bigint, value: string]
   
   scala> v1.count
   java.util.NoSuchElementException: next on empty iterator
     at scala.collection.Iterator$$anon$2.next(Iterator.scala:41)
     at scala.collection.Iterator$$anon$2.next(Iterator.scala:39)
     at scala.collection.IndexedSeqLike$Elements.next(IndexedSeqLike.scala:63)
     at scala.collection.IterableLike.head(IterableLike.scala:109)
     at scala.collection.IterableLike.head$(IterableLike.scala:108)
     at scala.collection.mutable.ArrayOps$ofRef.scala$collection$IndexedSeqOptimized$$super$head(ArrayOps.scala:198)
     at scala.collection.IndexedSeqOptimized.head(IndexedSeqOptimized.scala:129)
     at scala.collection.IndexedSeqOptimized.head$(IndexedSeqOptimized.scala:129)
     at scala.collection.mutable.ArrayOps$ofRef.head(ArrayOps.scala:198)
     at org.apache.spark.sql.Dataset.$anonfun$count$1(Dataset.scala:3024)
     at org.apache.spark.sql.Dataset.$anonfun$count$1$adapted(Dataset.scala:3023)
     at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3705)
     at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103)
     at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163)
     at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90)
     at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:772)
     at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
     at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3703)
     at org.apache.spark.sql.Dataset.count(Dataset.scala:3023)
     ... 47 elided
   
   scala> v1.rdd.getNumPartitions
   res7: Int = 0
   
   scala> v1.repartition(3).count
   res8: Long = 0
   ```
   
   it seem that `count` method does not work if the number of partitions of a **View** is zero.
   
   `childRDD` may have no partition, then the number output partition is zero; 
   while in existing impl, the `ShuffledRowRDD` will make sure `SinglePartition`, so that this issue will not be triggered.
   
   
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] viirya commented on a change in pull request #31468: [SPARK-34353][SQL] CollectLimitExec avoid shuffle if input rdd has 0/1 partition

Posted by GitBox <gi...@apache.org>.

viirya commented on a change in pull request #31468:
URL: https://github.com/apache/spark/pull/31468#discussion_r578951440



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/limit.scala
##########
@@ -52,16 +53,25 @@ case class CollectLimitExec(limit: Int, child: SparkPlan) extends LimitExec {
     SQLShuffleReadMetricsReporter.createShuffleReadMetrics(sparkContext)
   override lazy val metrics = readMetrics ++ writeMetrics
   protected override def doExecute(): RDD[InternalRow] = {
-    val locallyLimited = child.execute().mapPartitionsInternal(_.take(limit))
-    val shuffled = new ShuffledRowRDD(
-      ShuffleExchangeExec.prepareShuffleDependency(
-        locallyLimited,
-        child.output,
-        SinglePartition,
-        serializer,
-        writeMetrics),
-      readMetrics)
-    shuffled.mapPartitionsInternal(_.take(limit))
+    val childRDD = child.execute()
+    if (childRDD.getNumPartitions == 0) {
+      new ParallelCollectionRDD(sparkContext, Seq.empty[InternalRow], 1, Map.empty)

Review comment:
       > or we can use `EmptyRDDWithPartitions` defined in `CoalesceExec`?
   > maybe we can make `EmptyRDD` support number of partition, so we can use it in both `CollectLimitExec` and `CoalesceExec`. But I am not sure whether we should do this.
   
   I think it sounds over-engineering. At least for now I don't think we need it.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #31468: [SPARK-34353][SQL] CollectLimitExec avoid shuffle if input rdd has single partition

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #31468:
URL: https://github.com/apache/spark/pull/31468#issuecomment-776434334


   **[Test build #135090 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/135090/testReport)** for PR 31468 at commit [`c9a5a81`](https://github.com/apache/spark/commit/c9a5a81d8a6e612bd2c8b7e82c00587edcd70cbc).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #31468: [SPARK-34353][SQL] CollectLimitExec avoid shuffle if input rdd has single partition

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #31468:
URL: https://github.com/apache/spark/pull/31468#issuecomment-776151186


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/135075/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #31468: [SPARK-34353][SQL] CollectLimitExec avoid shuffle if input rdd has single partition

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #31468:
URL: https://github.com/apache/spark/pull/31468#issuecomment-776643186


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/39679/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] zhengruifeng commented on a change in pull request #31468: [SPARK-34353][SQL] CollectLimitExec avoid shuffle if input rdd has single partition

Posted by GitBox <gi...@apache.org>.

zhengruifeng commented on a change in pull request #31468:
URL: https://github.com/apache/spark/pull/31468#discussion_r573432645



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/limit.scala
##########
@@ -200,7 +205,7 @@ case class TakeOrderedAndProjectExec(
   protected override def doExecute(): RDD[InternalRow] = {
     val ord = new LazilyGeneratedOrdering(sortOrder, child.output)
     val childRDD = child.execute()
-    val singlePartitionRDD = if (childRDD.getNumPartitions > 1) {
+    val singlePartitionRDD = if (childRDD.getNumPartitions != 1) {

Review comment:
       There is no separate testsuite for `CollectLimitExec`, I added case `childRDD.getNumPartitions == 0` and case `childRDD.getNumPartitions == 1` in `TakeOrderedAndProjectSuite`




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #31468: [SPARK-34353][SQL] CollectLimitExec avoid shuffle if input rdd has single partition

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #31468:
URL: https://github.com/apache/spark/pull/31468#issuecomment-773207218


   Kubernetes integration test status failure
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39447/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #31468: [SPARK-34353][SQL] CollectLimitExec avoid shuffle if input rdd has single partition

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #31468:
URL: https://github.com/apache/spark/pull/31468#issuecomment-776102503


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/39657/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #31468: [SPARK-34353][SQL] CollectLimitExec avoid shuffle if input rdd has 0/1 partition

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #31468:
URL: https://github.com/apache/spark/pull/31468#issuecomment-786717940


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/40091/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] zhengruifeng commented on a change in pull request #31468: [SPARK-34353][SQL] CollectLimitExec avoid shuffle if input rdd has 0/1 partition

Posted by GitBox <gi...@apache.org>.

zhengruifeng commented on a change in pull request #31468:
URL: https://github.com/apache/spark/pull/31468#discussion_r578910718



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/limit.scala
##########
@@ -52,16 +53,25 @@ case class CollectLimitExec(limit: Int, child: SparkPlan) extends LimitExec {
     SQLShuffleReadMetricsReporter.createShuffleReadMetrics(sparkContext)
   override lazy val metrics = readMetrics ++ writeMetrics
   protected override def doExecute(): RDD[InternalRow] = {
-    val locallyLimited = child.execute().mapPartitionsInternal(_.take(limit))
-    val shuffled = new ShuffledRowRDD(
-      ShuffleExchangeExec.prepareShuffleDependency(
-        locallyLimited,
-        child.output,
-        SinglePartition,
-        serializer,
-        writeMetrics),
-      readMetrics)
-    shuffled.mapPartitionsInternal(_.take(limit))
+    val childRDD = child.execute()
+    if (childRDD.getNumPartitions == 0) {
+      new ParallelCollectionRDD(sparkContext, Seq.empty[InternalRow], 1, Map.empty)

Review comment:
       `new ParallelCollectionRDD(sparkContext, Seq.empty[InternalRow], 1, Map.empty)` is a empty rdd with single partition




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] viirya commented on a change in pull request #31468: [SPARK-34353][SQL] CollectLimitExec avoid shuffle if input rdd has single partition

Posted by GitBox <gi...@apache.org>.

viirya commented on a change in pull request #31468:
URL: https://github.com/apache/spark/pull/31468#discussion_r573989674



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/limit.scala
##########
@@ -52,16 +53,25 @@ case class CollectLimitExec(limit: Int, child: SparkPlan) extends LimitExec {
     SQLShuffleReadMetricsReporter.createShuffleReadMetrics(sparkContext)
   override lazy val metrics = readMetrics ++ writeMetrics
   protected override def doExecute(): RDD[InternalRow] = {
-    val locallyLimited = child.execute().mapPartitionsInternal(_.take(limit))
-    val shuffled = new ShuffledRowRDD(
-      ShuffleExchangeExec.prepareShuffleDependency(
-        locallyLimited,
-        child.output,
-        SinglePartition,
-        serializer,
-        writeMetrics),
-      readMetrics)
-    shuffled.mapPartitionsInternal(_.take(limit))
+    val childRDD = child.execute()
+    if (childRDD.getNumPartitions == 0) {
+      new ParallelCollectionRDD(sparkContext, Seq.empty[InternalRow], 1, Map.empty)

Review comment:
       EmptyRDD?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #31468: [SPARK-34353][SQL] CollectLimitExec avoid shuffle if input rdd has single partition

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #31468:
URL: https://github.com/apache/spark/pull/31468#issuecomment-775867647






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] maropu closed pull request #31468: [SPARK-34353][SQL] CollectLimitExec avoid shuffle if input rdd has 0/1 partition

Posted by GitBox <gi...@apache.org>.

maropu closed pull request #31468:
URL: https://github.com/apache/spark/pull/31468


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #31468: [SPARK-34353][SQL] CollectLimitExec avoid shuffle if input rdd has single partition

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #31468:
URL: https://github.com/apache/spark/pull/31468#issuecomment-773222393






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #31468: [SPARK-34353][SQL] CollectLimitExec avoid shuffle if input rdd has single partition

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #31468:
URL: https://github.com/apache/spark/pull/31468#issuecomment-773118983






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on pull request #31468: [SPARK-34353][SQL] CollectLimitExec avoid shuffle if input rdd has single partition

Posted by GitBox <gi...@apache.org>.

HyukjinKwon commented on pull request #31468:
URL: https://github.com/apache/spark/pull/31468#issuecomment-775943050


   retest this please


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #31468: [SPARK-34353][SQL] CollectLimitExec avoid shuffle if input rdd has single partition

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #31468:
URL: https://github.com/apache/spark/pull/31468#issuecomment-776468501


   Kubernetes integration test status success
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39674/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #31468: [SPARK-34353][SQL] CollectLimitExec avoid shuffle if input rdd has single partition

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #31468:
URL: https://github.com/apache/spark/pull/31468#issuecomment-773118983


   **[Test build #134860 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/134860/testReport)** for PR 31468 at commit [`6d199ff`](https://github.com/apache/spark/commit/6d199ff4ddcc2adca3141d70019354625e298372).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #31468: [SPARK-34353][SQL] CollectLimitExec avoid shuffle if input rdd has single partition

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #31468:
URL: https://github.com/apache/spark/pull/31468#issuecomment-775859628


   **[Test build #135069 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/135069/testReport)** for PR 31468 at commit [`6d199ff`](https://github.com/apache/spark/commit/6d199ff4ddcc2adca3141d70019354625e298372).
    * This patch **fails Spark unit tests**.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #31468: [SPARK-34353][SQL] CollectLimitExec avoid shuffle if input rdd has single partition

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #31468:
URL: https://github.com/apache/spark/pull/31468#issuecomment-776572862


   **[Test build #135092 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/135092/testReport)** for PR 31468 at commit [`75c4d92`](https://github.com/apache/spark/commit/75c4d92dc7d24b6698b2eb7e5d103e95e925247a).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #31468: [SPARK-34353][SQL] CollectLimitExec avoid shuffle if input rdd has single partition

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #31468:
URL: https://github.com/apache/spark/pull/31468#issuecomment-773460675


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/134874/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on pull request #31468: [SPARK-34353][SQL] CollectLimitExec avoid shuffle if input rdd has single partition

Posted by GitBox <gi...@apache.org>.

HyukjinKwon commented on pull request #31468:
URL: https://github.com/apache/spark/pull/31468#issuecomment-776552571


   retest this please


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #31468: [SPARK-34353][SQL] CollectLimitExec avoid shuffle if input rdd has 0/1 partition

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #31468:
URL: https://github.com/apache/spark/pull/31468#issuecomment-786931035


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/135510/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA removed a comment on pull request #31468: [SPARK-34353][SQL] CollectLimitExec avoid shuffle if input rdd has single partition

Posted by GitBox <gi...@apache.org>.

SparkQA removed a comment on pull request #31468:
URL: https://github.com/apache/spark/pull/31468#issuecomment-775741112


   **[Test build #135069 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/135069/testReport)** for PR 31468 at commit [`6d199ff`](https://github.com/apache/spark/commit/6d199ff4ddcc2adca3141d70019354625e298372).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] cloud-fan commented on pull request #31468: [SPARK-34353][SQL] CollectLimitExec avoid shuffle if input rdd has single partition

Posted by GitBox <gi...@apache.org>.

cloud-fan commented on pull request #31468:
URL: https://github.com/apache/spark/pull/31468#issuecomment-773128349


   cc @viirya @maropu 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #31468: [SPARK-34353][SQL] CollectLimitExec avoid shuffle if input rdd has 0/1 partition

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #31468:
URL: https://github.com/apache/spark/pull/31468#issuecomment-786965907


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/40104/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] zhengruifeng commented on a change in pull request #31468: [SPARK-34353][SQL] CollectLimitExec avoid shuffle if input rdd has single partition

Posted by GitBox <gi...@apache.org>.

zhengruifeng commented on a change in pull request #31468:
URL: https://github.com/apache/spark/pull/31468#discussion_r573433676



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/limit.scala
##########
@@ -52,16 +52,21 @@ case class CollectLimitExec(limit: Int, child: SparkPlan) extends LimitExec {
     SQLShuffleReadMetricsReporter.createShuffleReadMetrics(sparkContext)
   override lazy val metrics = readMetrics ++ writeMetrics
   protected override def doExecute(): RDD[InternalRow] = {
-    val locallyLimited = child.execute().mapPartitionsInternal(_.take(limit))
-    val shuffled = new ShuffledRowRDD(
-      ShuffleExchangeExec.prepareShuffleDependency(
-        locallyLimited,
-        child.output,
-        SinglePartition,
-        serializer,
-        writeMetrics),
-      readMetrics)
-    shuffled.mapPartitionsInternal(_.take(limit))
+    val childRDD = child.execute()
+    val singlePartitionRDD = if (childRDD.getNumPartitions != 1) {

Review comment:
       refer to [CoalesceExec.doExecute](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/basicPhysicalOperators.scala#L715), I update the fix to directly return empty RDD with single partition if `childRDD.getNumPartitions == 0`




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #31468: [SPARK-34353][SQL] CollectLimitExec avoid shuffle if input rdd has single partition

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #31468:
URL: https://github.com/apache/spark/pull/31468#issuecomment-776440924


   **[Test build #135092 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/135092/testReport)** for PR 31468 at commit [`75c4d92`](https://github.com/apache/spark/commit/75c4d92dc7d24b6698b2eb7e5d103e95e925247a).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #31468: [SPARK-34353][SQL] CollectLimitExec avoid shuffle if input rdd has 0/1 partition

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #31468:
URL: https://github.com/apache/spark/pull/31468#issuecomment-786994403


   **[Test build #135523 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/135523/testReport)** for PR 31468 at commit [`75c4d92`](https://github.com/apache/spark/commit/75c4d92dc7d24b6698b2eb7e5d103e95e925247a).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA removed a comment on pull request #31468: [SPARK-34353][SQL] CollectLimitExec avoid shuffle if input rdd has single partition

Posted by GitBox <gi...@apache.org>.

SparkQA removed a comment on pull request #31468:
URL: https://github.com/apache/spark/pull/31468#issuecomment-776572321


   **[Test build #135097 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/135097/testReport)** for PR 31468 at commit [`75c4d92`](https://github.com/apache/spark/commit/75c4d92dc7d24b6698b2eb7e5d103e95e925247a).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA removed a comment on pull request #31468: [SPARK-34353][SQL] CollectLimitExec avoid shuffle if input rdd has 0/1 partition

Posted by GitBox <gi...@apache.org>.

SparkQA removed a comment on pull request #31468:
URL: https://github.com/apache/spark/pull/31468#issuecomment-786662467


   **[Test build #135510 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/135510/testReport)** for PR 31468 at commit [`75c4d92`](https://github.com/apache/spark/commit/75c4d92dc7d24b6698b2eb7e5d103e95e925247a).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #31468: [SPARK-34353][SQL] CollectLimitExec avoid shuffle if input rdd has single partition

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #31468:
URL: https://github.com/apache/spark/pull/31468#issuecomment-776020678


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39657/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] zhengruifeng commented on pull request #31468: [SPARK-34353][SQL] CollectLimitExec avoid shuffle if input rdd has single partition

Posted by GitBox <gi...@apache.org>.

zhengruifeng commented on pull request #31468:
URL: https://github.com/apache/spark/pull/31468#issuecomment-773112511


   related to https://github.com/apache/spark/pull/31409


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #31468: [SPARK-34353][SQL] CollectLimitExec avoid shuffle if input rdd has single partition

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #31468:
URL: https://github.com/apache/spark/pull/31468#issuecomment-776716166


   **[Test build #135097 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/135097/testReport)** for PR 31468 at commit [`75c4d92`](https://github.com/apache/spark/commit/75c4d92dc7d24b6698b2eb7e5d103e95e925247a).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #31468: [SPARK-34353][SQL] CollectLimitExec avoid shuffle if input rdd has single partition

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #31468:
URL: https://github.com/apache/spark/pull/31468#issuecomment-773222393






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #31468: [SPARK-34353][SQL] CollectLimitExec avoid shuffle if input rdd has single partition

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #31468:
URL: https://github.com/apache/spark/pull/31468#issuecomment-773374303


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/39460/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] viirya commented on pull request #31468: [SPARK-34353][SQL] CollectLimitExec avoid shuffle if input rdd has single partition

Posted by GitBox <gi...@apache.org>.

viirya commented on pull request #31468:
URL: https://github.com/apache/spark/pull/31468#issuecomment-773508769


   `org.apache.spark.sql.CachedTableSuite.SPARK-34269: cache lookup with ORDER BY / LIMIT clause` failed more than one time. But seems it passed without this change?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] cloud-fan commented on pull request #31468: [SPARK-34353][SQL] CollectLimitExec avoid shuffle if input rdd has single partition

Posted by GitBox <gi...@apache.org>.

cloud-fan commented on pull request #31468:
URL: https://github.com/apache/spark/pull/31468#issuecomment-773128349






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #31468: [SPARK-34353][SQL] CollectLimitExec avoid shuffle if input rdd has single partition

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #31468:
URL: https://github.com/apache/spark/pull/31468#issuecomment-773199615


   **[Test build #134860 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/134860/testReport)** for PR 31468 at commit [`6d199ff`](https://github.com/apache/spark/commit/6d199ff4ddcc2adca3141d70019354625e298372).
    * This patch **fails Spark unit tests**.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA removed a comment on pull request #31468: [SPARK-34353][SQL] CollectLimitExec avoid shuffle if input rdd has single partition

Posted by GitBox <gi...@apache.org>.

SparkQA removed a comment on pull request #31468:
URL: https://github.com/apache/spark/pull/31468#issuecomment-773306844


   **[Test build #134874 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/134874/testReport)** for PR 31468 at commit [`6d199ff`](https://github.com/apache/spark/commit/6d199ff4ddcc2adca3141d70019354625e298372).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] viirya commented on a change in pull request #31468: [SPARK-34353][SQL] CollectLimitExec avoid shuffle if input rdd has 0/1 partition

Posted by GitBox <gi...@apache.org>.

viirya commented on a change in pull request #31468:
URL: https://github.com/apache/spark/pull/31468#discussion_r578910294



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/limit.scala
##########
@@ -52,16 +53,25 @@ case class CollectLimitExec(limit: Int, child: SparkPlan) extends LimitExec {
     SQLShuffleReadMetricsReporter.createShuffleReadMetrics(sparkContext)
   override lazy val metrics = readMetrics ++ writeMetrics
   protected override def doExecute(): RDD[InternalRow] = {
-    val locallyLimited = child.execute().mapPartitionsInternal(_.take(limit))
-    val shuffled = new ShuffledRowRDD(
-      ShuffleExchangeExec.prepareShuffleDependency(
-        locallyLimited,
-        child.output,
-        SinglePartition,
-        serializer,
-        writeMetrics),
-      readMetrics)
-    shuffled.mapPartitionsInternal(_.take(limit))
+    val childRDD = child.execute()
+    if (childRDD.getNumPartitions == 0) {
+      new ParallelCollectionRDD(sparkContext, Seq.empty[InternalRow], 1, Map.empty)

Review comment:
       Isn't the `childRDD` zero partitions?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #31468: [SPARK-34353][SQL] CollectLimitExec avoid shuffle if input rdd has single partition

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #31468:
URL: https://github.com/apache/spark/pull/31468#issuecomment-776536491


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/135090/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #31468: [SPARK-34353][SQL] CollectLimitExec avoid shuffle if input rdd has single partition

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #31468:
URL: https://github.com/apache/spark/pull/31468#issuecomment-776605631


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39679/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] maropu commented on a change in pull request #31468: [SPARK-34353][SQL] CollectLimitExec avoid shuffle if input rdd has single partition

Posted by GitBox <gi...@apache.org>.

maropu commented on a change in pull request #31468:
URL: https://github.com/apache/spark/pull/31468#discussion_r578070009



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/limit.scala
##########
@@ -52,16 +53,25 @@ case class CollectLimitExec(limit: Int, child: SparkPlan) extends LimitExec {
     SQLShuffleReadMetricsReporter.createShuffleReadMetrics(sparkContext)
   override lazy val metrics = readMetrics ++ writeMetrics
   protected override def doExecute(): RDD[InternalRow] = {
-    val locallyLimited = child.execute().mapPartitionsInternal(_.take(limit))
-    val shuffled = new ShuffledRowRDD(
-      ShuffleExchangeExec.prepareShuffleDependency(
-        locallyLimited,
-        child.output,
-        SinglePartition,
-        serializer,
-        writeMetrics),
-      readMetrics)
-    shuffled.mapPartitionsInternal(_.take(limit))
+    val childRDD = child.execute()
+    if (childRDD.getNumPartitions == 0) {
+      new ParallelCollectionRDD(sparkContext, Seq.empty[InternalRow], 1, Map.empty)

Review comment:
       kindly ping @zhenglaizhang 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] zhengruifeng commented on pull request #31468: [SPARK-34353][SQL] CollectLimitExec avoid shuffle if input rdd has 0/1 partition

Posted by GitBox <gi...@apache.org>.

zhengruifeng commented on pull request #31468:
URL: https://github.com/apache/spark/pull/31468#issuecomment-787586530


   Thank you so much !


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #31468: [SPARK-34353][SQL] CollectLimitExec avoid shuffle if input rdd has single partition

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #31468:
URL: https://github.com/apache/spark/pull/31468#issuecomment-773460675


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/134874/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #31468: [SPARK-34353][SQL] CollectLimitExec avoid shuffle if input rdd has single partition

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #31468:
URL: https://github.com/apache/spark/pull/31468#issuecomment-776102503


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/39657/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA removed a comment on pull request #31468: [SPARK-34353][SQL] CollectLimitExec avoid shuffle if input rdd has 0/1 partition

Posted by GitBox <gi...@apache.org>.

SparkQA removed a comment on pull request #31468:
URL: https://github.com/apache/spark/pull/31468#issuecomment-786939347


   **[Test build #135523 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/135523/testReport)** for PR 31468 at commit [`75c4d92`](https://github.com/apache/spark/commit/75c4d92dc7d24b6698b2eb7e5d103e95e925247a).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] cloud-fan commented on pull request #31468: [SPARK-34353][SQL] CollectLimitExec avoid shuffle if input rdd has single partition

Posted by GitBox <gi...@apache.org>.

cloud-fan commented on pull request #31468:
URL: https://github.com/apache/spark/pull/31468#issuecomment-773292371


   retest this please


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] maropu commented on a change in pull request #31468: [SPARK-34353][SQL] CollectLimitExec avoid shuffle if input rdd has single partition

Posted by GitBox <gi...@apache.org>.

maropu commented on a change in pull request #31468:
URL: https://github.com/apache/spark/pull/31468#discussion_r573392389



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/limit.scala
##########
@@ -200,7 +205,7 @@ case class TakeOrderedAndProjectExec(
   protected override def doExecute(): RDD[InternalRow] = {
     val ord = new LazilyGeneratedOrdering(sortOrder, child.output)
     val childRDD = child.execute()
-    val singlePartitionRDD = if (childRDD.getNumPartitions > 1) {
+    val singlePartitionRDD = if (childRDD.getNumPartitions != 1) {

Review comment:
       We don't have any test for the case `childRDD.getNumPartitions == 0`?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA removed a comment on pull request #31468: [SPARK-34353][SQL] CollectLimitExec avoid shuffle if input rdd has single partition

Posted by GitBox <gi...@apache.org>.

SparkQA removed a comment on pull request #31468:
URL: https://github.com/apache/spark/pull/31468#issuecomment-776434334


   **[Test build #135090 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/135090/testReport)** for PR 31468 at commit [`c9a5a81`](https://github.com/apache/spark/commit/c9a5a81d8a6e612bd2c8b7e82c00587edcd70cbc).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #31468: [SPARK-34353][SQL] CollectLimitExec avoid shuffle if input rdd has single partition

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #31468:
URL: https://github.com/apache/spark/pull/31468#issuecomment-776476510


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/39674/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #31468: [SPARK-34353][SQL] CollectLimitExec avoid shuffle if input rdd has single partition

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #31468:
URL: https://github.com/apache/spark/pull/31468#issuecomment-776536491


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/135090/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] viirya commented on pull request #31468: [SPARK-34353][SQL] CollectLimitExec avoid shuffle if input rdd has single partition

Posted by GitBox <gi...@apache.org>.

viirya commented on pull request #31468:
URL: https://github.com/apache/spark/pull/31468#issuecomment-773508769


   `org.apache.spark.sql.CachedTableSuite.SPARK-34269: cache lookup with ORDER BY / LIMIT clause` failed more than one time. But seems it passed without this change?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] zhengruifeng edited a comment on pull request #31468: [SPARK-34353][SQL] CollectLimitExec avoid shuffle if input rdd has single partition

Posted by GitBox <gi...@apache.org>.

zhengruifeng edited a comment on pull request #31468:
URL: https://github.com/apache/spark/pull/31468#issuecomment-775778416


   ```
   scala> spark.sql("CREATE TABLE t (key bigint, value string) USING parquet")
   res0: org.apache.spark.sql.DataFrame = []
   
   scala> spark.sql("SELECT COUNT(*) FROM t").head
   res1: org.apache.spark.sql.Row = [0]
   
   scala> val t = spark.table("t")
   t: org.apache.spark.sql.DataFrame = [key: bigint, value: string]
   
   scala> t.count
   res2: Long = 0
   
   scala> t.rdd.getNumPartitions
   res3: Int = 0
   
   scala> 
   
   scala> spark.sql(s"CACHE TABLE v1 AS SELECT * FROM t LIMIT 10")
   java.util.NoSuchElementException: next on empty iterator
     at scala.collection.Iterator$$anon$2.next(Iterator.scala:41)
     at scala.collection.Iterator$$anon$2.next(Iterator.scala:39)
     at scala.collection.IndexedSeqLike$Elements.next(IndexedSeqLike.scala:63)
     at scala.collection.IterableLike.head(IterableLike.scala:109)
     at scala.collection.IterableLike.head$(IterableLike.scala:108)
     at scala.collection.mutable.ArrayOps$ofRef.scala$collection$IndexedSeqOptimized$$super$head(ArrayOps.scala:198)
     at scala.collection.IndexedSeqOptimized.head(IndexedSeqOptimized.scala:129)
     at scala.collection.IndexedSeqOptimized.head$(IndexedSeqOptimized.scala:129)
     at scala.collection.mutable.ArrayOps$ofRef.head(ArrayOps.scala:198)
     at org.apache.spark.sql.Dataset.$anonfun$count$1(Dataset.scala:3024)
     at org.apache.spark.sql.Dataset.$anonfun$count$1$adapted(Dataset.scala:3023)
     at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3705)
     at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103)
     at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163)
     at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90)
     at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:772)
     at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
     at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3703)
     at org.apache.spark.sql.Dataset.count(Dataset.scala:3023)
     at org.apache.spark.sql.execution.datasources.v2.BaseCacheTableExec.run(CacheTableExec.scala:65)
     at org.apache.spark.sql.execution.datasources.v2.BaseCacheTableExec.run$(CacheTableExec.scala:41)
     at org.apache.spark.sql.execution.datasources.v2.CacheTableAsSelectExec.run(CacheTableExec.scala:88)
     at org.apache.spark.sql.execution.datasources.v2.V2CommandExec.result$lzycompute(V2CommandExec.scala:42)
     at org.apache.spark.sql.execution.datasources.v2.V2CommandExec.result(V2CommandExec.scala:42)
     at org.apache.spark.sql.execution.datasources.v2.V2CommandExec.executeCollect(V2CommandExec.scala:48)
     at org.apache.spark.sql.Dataset.$anonfun$logicalPlan$1(Dataset.scala:228)
     at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3705)
     at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103)
     at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163)
     at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90)
     at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:772)
     at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
     at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3703)
     at org.apache.spark.sql.Dataset.<init>(Dataset.scala:228)
     at org.apache.spark.sql.Dataset$.$anonfun$ofRows$2(Dataset.scala:99)
     at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:772)
     at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:96)
     at org.apache.spark.sql.SparkSession.$anonfun$sql$1(SparkSession.scala:615)
     at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:772)
     at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:610)
     ... 47 elided
   
   scala> spark.sql("SELECT COUNT(*) FROM v1").head
   java.util.NoSuchElementException: next on empty iterator
     at scala.collection.Iterator$$anon$2.next(Iterator.scala:41)
     at scala.collection.Iterator$$anon$2.next(Iterator.scala:39)
     at scala.collection.IndexedSeqLike$Elements.next(IndexedSeqLike.scala:63)
     at scala.collection.IterableLike.head(IterableLike.scala:109)
     at scala.collection.IterableLike.head$(IterableLike.scala:108)
     at scala.collection.mutable.ArrayOps$ofRef.scala$collection$IndexedSeqOptimized$$super$head(ArrayOps.scala:198)
     at scala.collection.IndexedSeqOptimized.head(IndexedSeqOptimized.scala:129)
     at scala.collection.IndexedSeqOptimized.head$(IndexedSeqOptimized.scala:129)
     at scala.collection.mutable.ArrayOps$ofRef.head(ArrayOps.scala:198)
     at org.apache.spark.sql.Dataset.head(Dataset.scala:2747)
     ... 47 elided
   
   scala> val v1 = spark.table("v1")
   v1: org.apache.spark.sql.DataFrame = [key: bigint, value: string]
   
   scala> v1.count
   java.util.NoSuchElementException: next on empty iterator
     at scala.collection.Iterator$$anon$2.next(Iterator.scala:41)
     at scala.collection.Iterator$$anon$2.next(Iterator.scala:39)
     at scala.collection.IndexedSeqLike$Elements.next(IndexedSeqLike.scala:63)
     at scala.collection.IterableLike.head(IterableLike.scala:109)
     at scala.collection.IterableLike.head$(IterableLike.scala:108)
     at scala.collection.mutable.ArrayOps$ofRef.scala$collection$IndexedSeqOptimized$$super$head(ArrayOps.scala:198)
     at scala.collection.IndexedSeqOptimized.head(IndexedSeqOptimized.scala:129)
     at scala.collection.IndexedSeqOptimized.head$(IndexedSeqOptimized.scala:129)
     at scala.collection.mutable.ArrayOps$ofRef.head(ArrayOps.scala:198)
     at org.apache.spark.sql.Dataset.$anonfun$count$1(Dataset.scala:3024)
     at org.apache.spark.sql.Dataset.$anonfun$count$1$adapted(Dataset.scala:3023)
     at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3705)
     at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103)
     at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163)
     at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90)
     at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:772)
     at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
     at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3703)
     at org.apache.spark.sql.Dataset.count(Dataset.scala:3023)
     ... 47 elided
   
   scala> v1.rdd.getNumPartitions
   res7: Int = 0
   
   scala> v1.repartition(3).count
   res8: Long = 0
   ```
   
   commit 6d199ff did not guarantee the output has single partition, since `childRDD` may have no partition, then the number output partition is also zero; 
   while in master, the `ShuffledRowRDD` will make sure `SinglePartition`.
   
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #31468: [SPARK-34353][SQL] CollectLimitExec avoid shuffle if input rdd has 0/1 partition

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #31468:
URL: https://github.com/apache/spark/pull/31468#issuecomment-784725646


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/135391/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #31468: [SPARK-34353][SQL] CollectLimitExec avoid shuffle if input rdd has single partition

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #31468:
URL: https://github.com/apache/spark/pull/31468#issuecomment-776447930


   Kubernetes integration test status failure
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39672/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #31468: [SPARK-34353][SQL] CollectLimitExec avoid shuffle if input rdd has 0/1 partition

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #31468:
URL: https://github.com/apache/spark/pull/31468#issuecomment-784722250


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/39971/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #31468: [SPARK-34353][SQL] CollectLimitExec avoid shuffle if input rdd has single partition

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #31468:
URL: https://github.com/apache/spark/pull/31468#issuecomment-775843480


   Kubernetes integration test status success
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39652/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #31468: [SPARK-34353][SQL] CollectLimitExec avoid shuffle if input rdd has single partition

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #31468:
URL: https://github.com/apache/spark/pull/31468#issuecomment-776446396


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39672/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #31468: [SPARK-34353][SQL] CollectLimitExec avoid shuffle if input rdd has single partition

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #31468:
URL: https://github.com/apache/spark/pull/31468#issuecomment-776451204


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/39672/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] srowen commented on pull request #31468: [SPARK-34353][SQL] CollectLimitExec avoid shuffle if input rdd has 0/1 partition

Posted by GitBox <gi...@apache.org>.

srowen commented on pull request #31468:
URL: https://github.com/apache/spark/pull/31468#issuecomment-786931812


   Jenkins retest this please


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] zhengruifeng commented on a change in pull request #31468: [SPARK-34353][SQL] CollectLimitExec avoid shuffle if input rdd has 0/1 partition

Posted by GitBox <gi...@apache.org>.

zhengruifeng commented on a change in pull request #31468:
URL: https://github.com/apache/spark/pull/31468#discussion_r578943394



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/limit.scala
##########
@@ -52,16 +53,25 @@ case class CollectLimitExec(limit: Int, child: SparkPlan) extends LimitExec {
     SQLShuffleReadMetricsReporter.createShuffleReadMetrics(sparkContext)
   override lazy val metrics = readMetrics ++ writeMetrics
   protected override def doExecute(): RDD[InternalRow] = {
-    val locallyLimited = child.execute().mapPartitionsInternal(_.take(limit))
-    val shuffled = new ShuffledRowRDD(
-      ShuffleExchangeExec.prepareShuffleDependency(
-        locallyLimited,
-        child.output,
-        SinglePartition,
-        serializer,
-        writeMetrics),
-      readMetrics)
-    shuffled.mapPartitionsInternal(_.take(limit))
+    val childRDD = child.execute()
+    if (childRDD.getNumPartitions == 0) {
+      new ParallelCollectionRDD(sparkContext, Seq.empty[InternalRow], 1, Map.empty)

Review comment:
       or we can use `EmptyRDDWithPartitions` defined in `CoalesceExec`?
   maybe we can make `EmptyRDD` support number of partition, so we can use it in both `CollectLimitExec` and `CoalesceExec`. But I am not sure whether we should do this.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA removed a comment on pull request #31468: [SPARK-34353][SQL] CollectLimitExec avoid shuffle if input rdd has single partition

Posted by GitBox <gi...@apache.org>.

SparkQA removed a comment on pull request #31468:
URL: https://github.com/apache/spark/pull/31468#issuecomment-773118983


   **[Test build #134860 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/134860/testReport)** for PR 31468 at commit [`6d199ff`](https://github.com/apache/spark/commit/6d199ff4ddcc2adca3141d70019354625e298372).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #31468: [SPARK-34353][SQL] CollectLimitExec avoid shuffle if input rdd has single partition

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #31468:
URL: https://github.com/apache/spark/pull/31468#issuecomment-776535783


   **[Test build #135090 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/135090/testReport)** for PR 31468 at commit [`c9a5a81`](https://github.com/apache/spark/commit/c9a5a81d8a6e612bd2c8b7e82c00587edcd70cbc).
    * This patch **fails PySpark pip packaging tests**.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #31468: [SPARK-34353][SQL] CollectLimitExec avoid shuffle if input rdd has 0/1 partition

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #31468:
URL: https://github.com/apache/spark/pull/31468#issuecomment-786717940


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/40091/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] zhengruifeng commented on pull request #31468: [SPARK-34353][SQL] CollectLimitExec avoid shuffle if input rdd has single partition

Posted by GitBox <gi...@apache.org>.

zhengruifeng commented on pull request #31468:
URL: https://github.com/apache/spark/pull/31468#issuecomment-775732036


   retest this please


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org