You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2021/02/04 07:58:59 UTC
[GitHub] [spark] zhengruifeng opened a new pull request #31468: [SPARK-34353][SQL] CollectLimitExec avoid shuffle if input rdd has single partition
zhengruifeng opened a new pull request #31468:
URL: https://github.com/apache/spark/pull/31468
### What changes were proposed in this pull request?
if child rdd has only one partition, skip the shuffle
### Why are the changes needed?
skip shuffle if possible
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
existing testsuites
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #31468: [SPARK-34353][SQL] CollectLimitExec avoid shuffle if input rdd has 0/1 partition
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31468:
URL: https://github.com/apache/spark/pull/31468#issuecomment-786996767
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/135523/
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #31468: [SPARK-34353][SQL] CollectLimitExec avoid shuffle if input rdd has 0/1 partition
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31468:
URL: https://github.com/apache/spark/pull/31468#issuecomment-784722250
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/39971/
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #31468: [SPARK-34353][SQL] CollectLimitExec avoid shuffle if input rdd has single partition
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31468:
URL: https://github.com/apache/spark/pull/31468#issuecomment-776744754
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/135097/
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #31468: [SPARK-34353][SQL] CollectLimitExec avoid shuffle if input rdd has single partition
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31468:
URL: https://github.com/apache/spark/pull/31468#issuecomment-773222391
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] maropu commented on pull request #31468: [SPARK-34353][SQL] CollectLimitExec avoid shuffle if input rdd has 0/1 partition
Posted by GitBox <gi...@apache.org>.
maropu commented on pull request #31468:
URL: https://github.com/apache/spark/pull/31468#issuecomment-786634694
retest this please
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #31468: [SPARK-34353][SQL] CollectLimitExec avoid shuffle if input rdd has single partition
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31468:
URL: https://github.com/apache/spark/pull/31468#issuecomment-775741112
**[Test build #135069 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/135069/testReport)** for PR 31468 at commit [`6d199ff`](https://github.com/apache/spark/commit/6d199ff4ddcc2adca3141d70019354625e298372).
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #31468: [SPARK-34353][SQL] CollectLimitExec avoid shuffle if input rdd has single partition
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31468:
URL: https://github.com/apache/spark/pull/31468#issuecomment-775791224
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/39651/
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #31468: [SPARK-34353][SQL] CollectLimitExec avoid shuffle if input rdd has 0/1 partition
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31468:
URL: https://github.com/apache/spark/pull/31468#issuecomment-784725646
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/135391/
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #31468: [SPARK-34353][SQL] CollectLimitExec avoid shuffle if input rdd has single partition
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31468:
URL: https://github.com/apache/spark/pull/31468#issuecomment-775791224
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/39651/
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #31468: [SPARK-34353][SQL] CollectLimitExec avoid shuffle if input rdd has single partition
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31468:
URL: https://github.com/apache/spark/pull/31468#issuecomment-776150085
**[Test build #135075 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/135075/testReport)** for PR 31468 at commit [`0bfb465`](https://github.com/apache/spark/commit/0bfb465c831ba1034d5e747308415c64f2f21306).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #31468: [SPARK-34353][SQL] CollectLimitExec avoid shuffle if input rdd has 0/1 partition
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31468:
URL: https://github.com/apache/spark/pull/31468#issuecomment-784671414
**[Test build #135391 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/135391/testReport)** for PR 31468 at commit [`75c4d92`](https://github.com/apache/spark/commit/75c4d92dc7d24b6698b2eb7e5d103e95e925247a).
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] maropu commented on a change in pull request #31468: [SPARK-34353][SQL] CollectLimitExec avoid shuffle if input rdd has single partition
Posted by GitBox <gi...@apache.org>.
maropu commented on a change in pull request #31468:
URL: https://github.com/apache/spark/pull/31468#discussion_r573392812
##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/limit.scala
##########
@@ -52,16 +52,21 @@ case class CollectLimitExec(limit: Int, child: SparkPlan) extends LimitExec {
SQLShuffleReadMetricsReporter.createShuffleReadMetrics(sparkContext)
override lazy val metrics = readMetrics ++ writeMetrics
protected override def doExecute(): RDD[InternalRow] = {
- val locallyLimited = child.execute().mapPartitionsInternal(_.take(limit))
- val shuffled = new ShuffledRowRDD(
- ShuffleExchangeExec.prepareShuffleDependency(
- locallyLimited,
- child.output,
- SinglePartition,
- serializer,
- writeMetrics),
- readMetrics)
- shuffled.mapPartitionsInternal(_.take(limit))
+ val childRDD = child.execute()
+ val singlePartitionRDD = if (childRDD.getNumPartitions != 1) {
Review comment:
nit: could you leave some comments about why we cannot strip the shuffle if `childRDD.getNumPartitions == 0`?
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #31468: [SPARK-34353][SQL] CollectLimitExec avoid shuffle if input rdd has single partition
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31468:
URL: https://github.com/apache/spark/pull/31468#issuecomment-776084473
Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39657/
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] zhengruifeng edited a comment on pull request #31468: [SPARK-34353][SQL] CollectLimitExec avoid shuffle if input rdd has single partition
Posted by GitBox <gi...@apache.org>.
zhengruifeng edited a comment on pull request #31468:
URL: https://github.com/apache/spark/pull/31468#issuecomment-775778416
```
scala> spark.sql("CREATE TABLE t (key bigint, value string) USING parquet")
res0: org.apache.spark.sql.DataFrame = []
scala> spark.sql("SELECT COUNT(*) FROM t").head
res1: org.apache.spark.sql.Row = [0]
scala> val t = spark.table("t")
t: org.apache.spark.sql.DataFrame = [key: bigint, value: string]
scala> t.count
res2: Long = 0
scala> t.rdd.getNumPartitions
res3: Int = 0
scala>
scala> spark.sql(s"CACHE TABLE v1 AS SELECT * FROM t LIMIT 10")
java.util.NoSuchElementException: next on empty iterator
at scala.collection.Iterator$$anon$2.next(Iterator.scala:41)
at scala.collection.Iterator$$anon$2.next(Iterator.scala:39)
at scala.collection.IndexedSeqLike$Elements.next(IndexedSeqLike.scala:63)
at scala.collection.IterableLike.head(IterableLike.scala:109)
at scala.collection.IterableLike.head$(IterableLike.scala:108)
at scala.collection.mutable.ArrayOps$ofRef.scala$collection$IndexedSeqOptimized$$super$head(ArrayOps.scala:198)
at scala.collection.IndexedSeqOptimized.head(IndexedSeqOptimized.scala:129)
at scala.collection.IndexedSeqOptimized.head$(IndexedSeqOptimized.scala:129)
at scala.collection.mutable.ArrayOps$ofRef.head(ArrayOps.scala:198)
at org.apache.spark.sql.Dataset.$anonfun$count$1(Dataset.scala:3024)
at org.apache.spark.sql.Dataset.$anonfun$count$1$adapted(Dataset.scala:3023)
at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3705)
at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103)
at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163)
at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90)
at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:772)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3703)
at org.apache.spark.sql.Dataset.count(Dataset.scala:3023)
at org.apache.spark.sql.execution.datasources.v2.BaseCacheTableExec.run(CacheTableExec.scala:65)
at org.apache.spark.sql.execution.datasources.v2.BaseCacheTableExec.run$(CacheTableExec.scala:41)
at org.apache.spark.sql.execution.datasources.v2.CacheTableAsSelectExec.run(CacheTableExec.scala:88)
at org.apache.spark.sql.execution.datasources.v2.V2CommandExec.result$lzycompute(V2CommandExec.scala:42)
at org.apache.spark.sql.execution.datasources.v2.V2CommandExec.result(V2CommandExec.scala:42)
at org.apache.spark.sql.execution.datasources.v2.V2CommandExec.executeCollect(V2CommandExec.scala:48)
at org.apache.spark.sql.Dataset.$anonfun$logicalPlan$1(Dataset.scala:228)
at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3705)
at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103)
at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163)
at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90)
at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:772)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3703)
at org.apache.spark.sql.Dataset.<init>(Dataset.scala:228)
at org.apache.spark.sql.Dataset$.$anonfun$ofRows$2(Dataset.scala:99)
at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:772)
at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:96)
at org.apache.spark.sql.SparkSession.$anonfun$sql$1(SparkSession.scala:615)
at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:772)
at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:610)
... 47 elided
scala> spark.sql("SELECT COUNT(*) FROM v1").head
java.util.NoSuchElementException: next on empty iterator
at scala.collection.Iterator$$anon$2.next(Iterator.scala:41)
at scala.collection.Iterator$$anon$2.next(Iterator.scala:39)
at scala.collection.IndexedSeqLike$Elements.next(IndexedSeqLike.scala:63)
at scala.collection.IterableLike.head(IterableLike.scala:109)
at scala.collection.IterableLike.head$(IterableLike.scala:108)
at scala.collection.mutable.ArrayOps$ofRef.scala$collection$IndexedSeqOptimized$$super$head(ArrayOps.scala:198)
at scala.collection.IndexedSeqOptimized.head(IndexedSeqOptimized.scala:129)
at scala.collection.IndexedSeqOptimized.head$(IndexedSeqOptimized.scala:129)
at scala.collection.mutable.ArrayOps$ofRef.head(ArrayOps.scala:198)
at org.apache.spark.sql.Dataset.head(Dataset.scala:2747)
... 47 elided
scala> val v1 = spark.table("v1")
v1: org.apache.spark.sql.DataFrame = [key: bigint, value: string]
scala> v1.count
java.util.NoSuchElementException: next on empty iterator
at scala.collection.Iterator$$anon$2.next(Iterator.scala:41)
at scala.collection.Iterator$$anon$2.next(Iterator.scala:39)
at scala.collection.IndexedSeqLike$Elements.next(IndexedSeqLike.scala:63)
at scala.collection.IterableLike.head(IterableLike.scala:109)
at scala.collection.IterableLike.head$(IterableLike.scala:108)
at scala.collection.mutable.ArrayOps$ofRef.scala$collection$IndexedSeqOptimized$$super$head(ArrayOps.scala:198)
at scala.collection.IndexedSeqOptimized.head(IndexedSeqOptimized.scala:129)
at scala.collection.IndexedSeqOptimized.head$(IndexedSeqOptimized.scala:129)
at scala.collection.mutable.ArrayOps$ofRef.head(ArrayOps.scala:198)
at org.apache.spark.sql.Dataset.$anonfun$count$1(Dataset.scala:3024)
at org.apache.spark.sql.Dataset.$anonfun$count$1$adapted(Dataset.scala:3023)
at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3705)
at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103)
at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163)
at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90)
at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:772)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3703)
at org.apache.spark.sql.Dataset.count(Dataset.scala:3023)
... 47 elided
scala> v1.rdd.getNumPartitions
res7: Int = 0
scala> v1.repartition(3).count
res8: Long = 0
```
`childRDD` may have no partition, then the number output partition is zero;
while in existing impl, the `ShuffledRowRDD` will make sure `SinglePartition`, so that this issue will not be triggered.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #31468: [SPARK-34353][SQL] CollectLimitExec avoid shuffle if input rdd has 0/1 partition
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31468:
URL: https://github.com/apache/spark/pull/31468#issuecomment-786939347
**[Test build #135523 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/135523/testReport)** for PR 31468 at commit [`75c4d92`](https://github.com/apache/spark/commit/75c4d92dc7d24b6698b2eb7e5d103e95e925247a).
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #31468: [SPARK-34353][SQL] CollectLimitExec avoid shuffle if input rdd has single partition
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31468:
URL: https://github.com/apache/spark/pull/31468#issuecomment-776476510
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/39674/
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] zhengruifeng commented on a change in pull request #31468: [SPARK-34353][SQL] CollectLimitExec avoid shuffle if input rdd has single partition
Posted by GitBox <gi...@apache.org>.
zhengruifeng commented on a change in pull request #31468:
URL: https://github.com/apache/spark/pull/31468#discussion_r573438154
##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/limit.scala
##########
@@ -200,28 +210,32 @@ case class TakeOrderedAndProjectExec(
protected override def doExecute(): RDD[InternalRow] = {
val ord = new LazilyGeneratedOrdering(sortOrder, child.output)
val childRDD = child.execute()
- val singlePartitionRDD = if (childRDD.getNumPartitions > 1) {
- val localTopK = childRDD.mapPartitions { iter =>
- org.apache.spark.util.collection.Utils.takeOrdered(iter.map(_.copy()), limit)(ord)
- }
- new ShuffledRowRDD(
- ShuffleExchangeExec.prepareShuffleDependency(
- localTopK,
- child.output,
- SinglePartition,
- serializer,
- writeMetrics),
- readMetrics)
+ if (childRDD.getNumPartitions == 0) {
+ sparkContext.parallelize(Array.empty[InternalRow], 1)
Review comment:
ok, I will update it.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #31468: [SPARK-34353][SQL] CollectLimitExec avoid shuffle if input rdd has single partition
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31468:
URL: https://github.com/apache/spark/pull/31468#issuecomment-773374303
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/39460/
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #31468: [SPARK-34353][SQL] CollectLimitExec avoid shuffle if input rdd has single partition
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31468:
URL: https://github.com/apache/spark/pull/31468#issuecomment-775867646
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #31468: [SPARK-34353][SQL] CollectLimitExec avoid shuffle if input rdd has single partition
Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #31468:
URL: https://github.com/apache/spark/pull/31468#issuecomment-773118983
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] zhengruifeng commented on pull request #31468: [SPARK-34353][SQL] CollectLimitExec avoid shuffle if input rdd has single partition
Posted by GitBox <gi...@apache.org>.
zhengruifeng commented on pull request #31468:
URL: https://github.com/apache/spark/pull/31468#issuecomment-773112265
> We may have more operators that adding shuffle in the doExecute method instead of the planner
@cloud-fan shuffle only is directly added in the `doExecute` method of `TakeOrderedAndProjectExec`, `CollectLimitExec` and `ShuffleExchangeExec`
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] viirya commented on a change in pull request #31468: [SPARK-34353][SQL] CollectLimitExec avoid shuffle if input rdd has 0/1 partition
Posted by GitBox <gi...@apache.org>.
viirya commented on a change in pull request #31468:
URL: https://github.com/apache/spark/pull/31468#discussion_r578914581
##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/limit.scala
##########
@@ -52,16 +53,25 @@ case class CollectLimitExec(limit: Int, child: SparkPlan) extends LimitExec {
SQLShuffleReadMetricsReporter.createShuffleReadMetrics(sparkContext)
override lazy val metrics = readMetrics ++ writeMetrics
protected override def doExecute(): RDD[InternalRow] = {
- val locallyLimited = child.execute().mapPartitionsInternal(_.take(limit))
- val shuffled = new ShuffledRowRDD(
- ShuffleExchangeExec.prepareShuffleDependency(
- locallyLimited,
- child.output,
- SinglePartition,
- serializer,
- writeMetrics),
- readMetrics)
- shuffled.mapPartitionsInternal(_.take(limit))
+ val childRDD = child.execute()
+ if (childRDD.getNumPartitions == 0) {
+ new ParallelCollectionRDD(sparkContext, Seq.empty[InternalRow], 1, Map.empty)
Review comment:
Oh I see. I am not sure if `CollectLimitExec` must be single partition, but this looks minor. I'm okay with `ParallelCollectionRDD`. Thanks.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #31468: [SPARK-34353][SQL] CollectLimitExec avoid shuffle if input rdd has 0/1 partition
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31468:
URL: https://github.com/apache/spark/pull/31468#issuecomment-786996767
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/135523/
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] zhengruifeng commented on pull request #31468: [SPARK-34353][SQL] CollectLimitExec avoid shuffle if input rdd has single partition
Posted by GitBox <gi...@apache.org>.
zhengruifeng commented on pull request #31468:
URL: https://github.com/apache/spark/pull/31468#issuecomment-773112265
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #31468: [SPARK-34353][SQL] CollectLimitExec avoid shuffle if input rdd has single partition
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31468:
URL: https://github.com/apache/spark/pull/31468#issuecomment-776643186
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/39679/
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #31468: [SPARK-34353][SQL] CollectLimitExec avoid shuffle if input rdd has 0/1 partition
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31468:
URL: https://github.com/apache/spark/pull/31468#issuecomment-786931035
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/135510/
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] zhengruifeng commented on a change in pull request #31468: [SPARK-34353][SQL] CollectLimitExec avoid shuffle if input rdd has single partition
Posted by GitBox <gi...@apache.org>.
zhengruifeng commented on a change in pull request #31468:
URL: https://github.com/apache/spark/pull/31468#discussion_r578906516
##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/limit.scala
##########
@@ -52,16 +53,25 @@ case class CollectLimitExec(limit: Int, child: SparkPlan) extends LimitExec {
SQLShuffleReadMetricsReporter.createShuffleReadMetrics(sparkContext)
override lazy val metrics = readMetrics ++ writeMetrics
protected override def doExecute(): RDD[InternalRow] = {
- val locallyLimited = child.execute().mapPartitionsInternal(_.take(limit))
- val shuffled = new ShuffledRowRDD(
- ShuffleExchangeExec.prepareShuffleDependency(
- locallyLimited,
- child.output,
- SinglePartition,
- serializer,
- writeMetrics),
- readMetrics)
- shuffled.mapPartitionsInternal(_.take(limit))
+ val childRDD = child.execute()
+ if (childRDD.getNumPartitions == 0) {
+ new ParallelCollectionRDD(sparkContext, Seq.empty[InternalRow], 1, Map.empty)
Review comment:
@viirya @maropu we can not set the number of partitions in an `EmptyRDD`, it always has zero partitions.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #31468: [SPARK-34353][SQL] CollectLimitExec avoid shuffle if input rdd has single partition
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31468:
URL: https://github.com/apache/spark/pull/31468#issuecomment-775823647
Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39652/
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #31468: [SPARK-34353][SQL] CollectLimitExec avoid shuffle if input rdd has single partition
Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #31468:
URL: https://github.com/apache/spark/pull/31468#issuecomment-776440924
**[Test build #135092 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/135092/testReport)** for PR 31468 at commit [`75c4d92`](https://github.com/apache/spark/commit/75c4d92dc7d24b6698b2eb7e5d103e95e925247a).
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #31468: [SPARK-34353][SQL] CollectLimitExec avoid shuffle if input rdd has single partition
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31468:
URL: https://github.com/apache/spark/pull/31468#issuecomment-776574374
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/135092/
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] maropu commented on a change in pull request #31468: [SPARK-34353][SQL] CollectLimitExec avoid shuffle if input rdd has single partition
Posted by GitBox <gi...@apache.org>.
maropu commented on a change in pull request #31468:
URL: https://github.com/apache/spark/pull/31468#discussion_r573435246
##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/limit.scala
##########
@@ -200,28 +210,32 @@ case class TakeOrderedAndProjectExec(
protected override def doExecute(): RDD[InternalRow] = {
val ord = new LazilyGeneratedOrdering(sortOrder, child.output)
val childRDD = child.execute()
- val singlePartitionRDD = if (childRDD.getNumPartitions > 1) {
- val localTopK = childRDD.mapPartitions { iter =>
- org.apache.spark.util.collection.Utils.takeOrdered(iter.map(_.copy()), limit)(ord)
- }
- new ShuffledRowRDD(
- ShuffleExchangeExec.prepareShuffleDependency(
- localTopK,
- child.output,
- SinglePartition,
- serializer,
- writeMetrics),
- readMetrics)
+ if (childRDD.getNumPartitions == 0) {
+ sparkContext.parallelize(Array.empty[InternalRow], 1)
Review comment:
nit: how about using `ParallelCollectionRDD` directly?
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #31468: [SPARK-34353][SQL] CollectLimitExec avoid shuffle if input rdd has single partition
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31468:
URL: https://github.com/apache/spark/pull/31468#issuecomment-776453331
Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39674/
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #31468: [SPARK-34353][SQL] CollectLimitExec avoid shuffle if input rdd has single partition
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31468:
URL: https://github.com/apache/spark/pull/31468#issuecomment-776151186
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/135075/
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #31468: [SPARK-34353][SQL] CollectLimitExec avoid shuffle if input rdd has single partition
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31468:
URL: https://github.com/apache/spark/pull/31468#issuecomment-776451204
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/39672/
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #31468: [SPARK-34353][SQL] CollectLimitExec avoid shuffle if input rdd has single partition
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31468:
URL: https://github.com/apache/spark/pull/31468#issuecomment-773222391
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #31468: [SPARK-34353][SQL] CollectLimitExec avoid shuffle if input rdd has single partition
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31468:
URL: https://github.com/apache/spark/pull/31468#issuecomment-773148878
Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39447/
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #31468: [SPARK-34353][SQL] CollectLimitExec avoid shuffle if input rdd has single partition
Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #31468:
URL: https://github.com/apache/spark/pull/31468#issuecomment-775988407
**[Test build #135075 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/135075/testReport)** for PR 31468 at commit [`0bfb465`](https://github.com/apache/spark/commit/0bfb465c831ba1034d5e747308415c64f2f21306).
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #31468: [SPARK-34353][SQL] CollectLimitExec avoid shuffle if input rdd has 0/1 partition
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31468:
URL: https://github.com/apache/spark/pull/31468#issuecomment-786965907
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/40104/
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] zhengruifeng commented on a change in pull request #31468: [SPARK-34353][SQL] CollectLimitExec avoid shuffle if input rdd has 0/1 partition
Posted by GitBox <gi...@apache.org>.
zhengruifeng commented on a change in pull request #31468:
URL: https://github.com/apache/spark/pull/31468#discussion_r578940920
##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/limit.scala
##########
@@ -52,16 +53,25 @@ case class CollectLimitExec(limit: Int, child: SparkPlan) extends LimitExec {
SQLShuffleReadMetricsReporter.createShuffleReadMetrics(sparkContext)
override lazy val metrics = readMetrics ++ writeMetrics
protected override def doExecute(): RDD[InternalRow] = {
- val locallyLimited = child.execute().mapPartitionsInternal(_.take(limit))
- val shuffled = new ShuffledRowRDD(
- ShuffleExchangeExec.prepareShuffleDependency(
- locallyLimited,
- child.output,
- SinglePartition,
- serializer,
- writeMetrics),
- readMetrics)
- shuffled.mapPartitionsInternal(_.take(limit))
+ val childRDD = child.execute()
+ if (childRDD.getNumPartitions == 0) {
+ new ParallelCollectionRDD(sparkContext, Seq.empty[InternalRow], 1, Map.empty)
Review comment:
@viirya the `outputPartitioning` of `CollectLimitExec` is `SinglePartition`, so the output rdd should have single partition.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] maropu commented on pull request #31468: [SPARK-34353][SQL] CollectLimitExec avoid shuffle if input rdd has 0/1 partition
Posted by GitBox <gi...@apache.org>.
maropu commented on pull request #31468:
URL: https://github.com/apache/spark/pull/31468#issuecomment-784661503
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #31468: [SPARK-34353][SQL] CollectLimitExec avoid shuffle if input rdd has single partition
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31468:
URL: https://github.com/apache/spark/pull/31468#issuecomment-776744754
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/135097/
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #31468: [SPARK-34353][SQL] CollectLimitExec avoid shuffle if input rdd has single partition
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31468:
URL: https://github.com/apache/spark/pull/31468#issuecomment-775802401
**[Test build #135070 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/135070/testReport)** for PR 31468 at commit [`0bfb465`](https://github.com/apache/spark/commit/0bfb465c831ba1034d5e747308415c64f2f21306).
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #31468: [SPARK-34353][SQL] CollectLimitExec avoid shuffle if input rdd has single partition
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31468:
URL: https://github.com/apache/spark/pull/31468#issuecomment-776574374
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/135092/
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] maropu commented on pull request #31468: [SPARK-34353][SQL] CollectLimitExec avoid shuffle if input rdd has 0/1 partition
Posted by GitBox <gi...@apache.org>.
maropu commented on pull request #31468:
URL: https://github.com/apache/spark/pull/31468#issuecomment-787031497
Thanks! Merged to master.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] maropu commented on pull request #31468: [SPARK-34353][SQL] CollectLimitExec avoid shuffle if input rdd has single partition
Posted by GitBox <gi...@apache.org>.
maropu commented on pull request #31468:
URL: https://github.com/apache/spark/pull/31468#issuecomment-773688646
The fix itself looks fine.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #31468: [SPARK-34353][SQL] CollectLimitExec avoid shuffle if input rdd has single partition
Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #31468:
URL: https://github.com/apache/spark/pull/31468#issuecomment-775802401
**[Test build #135070 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/135070/testReport)** for PR 31468 at commit [`0bfb465`](https://github.com/apache/spark/commit/0bfb465c831ba1034d5e747308415c64f2f21306).
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] viirya commented on a change in pull request #31468: [SPARK-34353][SQL] CollectLimitExec avoid shuffle if input rdd has 0/1 partition
Posted by GitBox <gi...@apache.org>.
viirya commented on a change in pull request #31468:
URL: https://github.com/apache/spark/pull/31468#discussion_r578951088
##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/limit.scala
##########
@@ -52,16 +53,25 @@ case class CollectLimitExec(limit: Int, child: SparkPlan) extends LimitExec {
SQLShuffleReadMetricsReporter.createShuffleReadMetrics(sparkContext)
override lazy val metrics = readMetrics ++ writeMetrics
protected override def doExecute(): RDD[InternalRow] = {
- val locallyLimited = child.execute().mapPartitionsInternal(_.take(limit))
- val shuffled = new ShuffledRowRDD(
- ShuffleExchangeExec.prepareShuffleDependency(
- locallyLimited,
- child.output,
- SinglePartition,
- serializer,
- writeMetrics),
- readMetrics)
- shuffled.mapPartitionsInternal(_.take(limit))
+ val childRDD = child.execute()
+ if (childRDD.getNumPartitions == 0) {
+ new ParallelCollectionRDD(sparkContext, Seq.empty[InternalRow], 1, Map.empty)
Review comment:
> @viirya the `outputPartitioning` of `CollectLimitExec` is `SinglePartition`, so the output rdd should have single partition.
Oh, my previous comment was confusing. I mean I am not sure if its `outputPartitioning` must be ` SinglePartition`.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #31468: [SPARK-34353][SQL] CollectLimitExec avoid shuffle if input rdd has single partition
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31468:
URL: https://github.com/apache/spark/pull/31468#issuecomment-775861344
**[Test build #135070 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/135070/testReport)** for PR 31468 at commit [`0bfb465`](https://github.com/apache/spark/commit/0bfb465c831ba1034d5e747308415c64f2f21306).
* This patch **fails Spark unit tests**.
* This patch merges cleanly.
* This patch adds no public classes.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #31468: [SPARK-34353][SQL] CollectLimitExec avoid shuffle if input rdd has single partition
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31468:
URL: https://github.com/apache/spark/pull/31468#issuecomment-773457981
**[Test build #134874 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/134874/testReport)** for PR 31468 at commit [`6d199ff`](https://github.com/apache/spark/commit/6d199ff4ddcc2adca3141d70019354625e298372).
* This patch **fails Spark unit tests**.
* This patch merges cleanly.
* This patch adds no public classes.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] maropu commented on pull request #31468: [SPARK-34353][SQL] CollectLimitExec avoid shuffle if input rdd has single partition
Posted by GitBox <gi...@apache.org>.
maropu commented on pull request #31468:
URL: https://github.com/apache/spark/pull/31468#issuecomment-773688646
The fix itself looks fine.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #31468: [SPARK-34353][SQL] CollectLimitExec avoid shuffle if input rdd has single partition
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31468:
URL: https://github.com/apache/spark/pull/31468#issuecomment-773306844
**[Test build #134874 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/134874/testReport)** for PR 31468 at commit [`6d199ff`](https://github.com/apache/spark/commit/6d199ff4ddcc2adca3141d70019354625e298372).
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #31468: [SPARK-34353][SQL] CollectLimitExec avoid shuffle if input rdd has 0/1 partition
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31468:
URL: https://github.com/apache/spark/pull/31468#issuecomment-786662467
**[Test build #135510 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/135510/testReport)** for PR 31468 at commit [`75c4d92`](https://github.com/apache/spark/commit/75c4d92dc7d24b6698b2eb7e5d103e95e925247a).
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #31468: [SPARK-34353][SQL] CollectLimitExec avoid shuffle if input rdd has single partition
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31468:
URL: https://github.com/apache/spark/pull/31468#issuecomment-776626070
Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39679/
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] maropu commented on pull request #31468: [SPARK-34353][SQL] CollectLimitExec avoid shuffle if input rdd has single partition
Posted by GitBox <gi...@apache.org>.
maropu commented on pull request #31468:
URL: https://github.com/apache/spark/pull/31468#issuecomment-776712671
Looks fine. Could you update the PR title and description, too? This PR is not only for a single partition case now.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #31468: [SPARK-34353][SQL] CollectLimitExec avoid shuffle if input rdd has single partition
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31468:
URL: https://github.com/apache/spark/pull/31468#issuecomment-775988407
**[Test build #135075 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/135075/testReport)** for PR 31468 at commit [`0bfb465`](https://github.com/apache/spark/commit/0bfb465c831ba1034d5e747308415c64f2f21306).
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #31468: [SPARK-34353][SQL] CollectLimitExec avoid shuffle if input rdd has single partition
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31468:
URL: https://github.com/apache/spark/pull/31468#issuecomment-776572321
**[Test build #135097 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/135097/testReport)** for PR 31468 at commit [`75c4d92`](https://github.com/apache/spark/commit/75c4d92dc7d24b6698b2eb7e5d103e95e925247a).
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #31468: [SPARK-34353][SQL] CollectLimitExec avoid shuffle if input rdd has 0/1 partition
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31468:
URL: https://github.com/apache/spark/pull/31468#issuecomment-786922815
**[Test build #135510 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/135510/testReport)** for PR 31468 at commit [`75c4d92`](https://github.com/apache/spark/commit/75c4d92dc7d24b6698b2eb7e5d103e95e925247a).
* This patch **fails from timeout after a configured wait of `500m`**.
* This patch merges cleanly.
* This patch adds no public classes.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] zhengruifeng commented on pull request #31468: [SPARK-34353][SQL] CollectLimitExec avoid shuffle if input rdd has single partition
Posted by GitBox <gi...@apache.org>.
zhengruifeng commented on pull request #31468:
URL: https://github.com/apache/spark/pull/31468#issuecomment-775778416
```
scala> spark.sql("CREATE TABLE t (key bigint, value string) USING parquet")
res0: org.apache.spark.sql.DataFrame = []
scala> spark.sql("SELECT COUNT(*) FROM t").head
res1: org.apache.spark.sql.Row = [0]
scala> val t = spark.table("t")
t: org.apache.spark.sql.DataFrame = [key: bigint, value: string]
scala> t.count
res2: Long = 0
scala> t.rdd.getNumPartitions
res3: Int = 0
scala>
scala> spark.sql(s"CACHE TABLE v1 AS SELECT * FROM t LIMIT 10")
java.util.NoSuchElementException: next on empty iterator
at scala.collection.Iterator$$anon$2.next(Iterator.scala:41)
at scala.collection.Iterator$$anon$2.next(Iterator.scala:39)
at scala.collection.IndexedSeqLike$Elements.next(IndexedSeqLike.scala:63)
at scala.collection.IterableLike.head(IterableLike.scala:109)
at scala.collection.IterableLike.head$(IterableLike.scala:108)
at scala.collection.mutable.ArrayOps$ofRef.scala$collection$IndexedSeqOptimized$$super$head(ArrayOps.scala:198)
at scala.collection.IndexedSeqOptimized.head(IndexedSeqOptimized.scala:129)
at scala.collection.IndexedSeqOptimized.head$(IndexedSeqOptimized.scala:129)
at scala.collection.mutable.ArrayOps$ofRef.head(ArrayOps.scala:198)
at org.apache.spark.sql.Dataset.$anonfun$count$1(Dataset.scala:3024)
at org.apache.spark.sql.Dataset.$anonfun$count$1$adapted(Dataset.scala:3023)
at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3705)
at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103)
at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163)
at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90)
at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:772)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3703)
at org.apache.spark.sql.Dataset.count(Dataset.scala:3023)
at org.apache.spark.sql.execution.datasources.v2.BaseCacheTableExec.run(CacheTableExec.scala:65)
at org.apache.spark.sql.execution.datasources.v2.BaseCacheTableExec.run$(CacheTableExec.scala:41)
at org.apache.spark.sql.execution.datasources.v2.CacheTableAsSelectExec.run(CacheTableExec.scala:88)
at org.apache.spark.sql.execution.datasources.v2.V2CommandExec.result$lzycompute(V2CommandExec.scala:42)
at org.apache.spark.sql.execution.datasources.v2.V2CommandExec.result(V2CommandExec.scala:42)
at org.apache.spark.sql.execution.datasources.v2.V2CommandExec.executeCollect(V2CommandExec.scala:48)
at org.apache.spark.sql.Dataset.$anonfun$logicalPlan$1(Dataset.scala:228)
at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3705)
at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103)
at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163)
at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90)
at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:772)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3703)
at org.apache.spark.sql.Dataset.<init>(Dataset.scala:228)
at org.apache.spark.sql.Dataset$.$anonfun$ofRows$2(Dataset.scala:99)
at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:772)
at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:96)
at org.apache.spark.sql.SparkSession.$anonfun$sql$1(SparkSession.scala:615)
at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:772)
at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:610)
... 47 elided
scala> spark.sql("SELECT COUNT(*) FROM v1").head
java.util.NoSuchElementException: next on empty iterator
at scala.collection.Iterator$$anon$2.next(Iterator.scala:41)
at scala.collection.Iterator$$anon$2.next(Iterator.scala:39)
at scala.collection.IndexedSeqLike$Elements.next(IndexedSeqLike.scala:63)
at scala.collection.IterableLike.head(IterableLike.scala:109)
at scala.collection.IterableLike.head$(IterableLike.scala:108)
at scala.collection.mutable.ArrayOps$ofRef.scala$collection$IndexedSeqOptimized$$super$head(ArrayOps.scala:198)
at scala.collection.IndexedSeqOptimized.head(IndexedSeqOptimized.scala:129)
at scala.collection.IndexedSeqOptimized.head$(IndexedSeqOptimized.scala:129)
at scala.collection.mutable.ArrayOps$ofRef.head(ArrayOps.scala:198)
at org.apache.spark.sql.Dataset.head(Dataset.scala:2747)
... 47 elided
scala> val v1 = spark.table("v1")
v1: org.apache.spark.sql.DataFrame = [key: bigint, value: string]
scala> v1.count
java.util.NoSuchElementException: next on empty iterator
at scala.collection.Iterator$$anon$2.next(Iterator.scala:41)
at scala.collection.Iterator$$anon$2.next(Iterator.scala:39)
at scala.collection.IndexedSeqLike$Elements.next(IndexedSeqLike.scala:63)
at scala.collection.IterableLike.head(IterableLike.scala:109)
at scala.collection.IterableLike.head$(IterableLike.scala:108)
at scala.collection.mutable.ArrayOps$ofRef.scala$collection$IndexedSeqOptimized$$super$head(ArrayOps.scala:198)
at scala.collection.IndexedSeqOptimized.head(IndexedSeqOptimized.scala:129)
at scala.collection.IndexedSeqOptimized.head$(IndexedSeqOptimized.scala:129)
at scala.collection.mutable.ArrayOps$ofRef.head(ArrayOps.scala:198)
at org.apache.spark.sql.Dataset.$anonfun$count$1(Dataset.scala:3024)
at org.apache.spark.sql.Dataset.$anonfun$count$1$adapted(Dataset.scala:3023)
at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3705)
at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103)
at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163)
at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90)
at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:772)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3703)
at org.apache.spark.sql.Dataset.count(Dataset.scala:3023)
... 47 elided
scala> v1.rdd.getNumPartitions
res7: Int = 0
scala> v1.repartition(3).count
res8: Long = 0
```
it seem that `count` method does not work if the number of partitions of a **View** is zero.
`childRDD` may have no partition, then the number output partition is zero;
while in existing impl, the `ShuffledRowRDD` will make sure `SinglePartition`, so that this issue will not be triggered.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] viirya commented on a change in pull request #31468: [SPARK-34353][SQL] CollectLimitExec avoid shuffle if input rdd has 0/1 partition
Posted by GitBox <gi...@apache.org>.
viirya commented on a change in pull request #31468:
URL: https://github.com/apache/spark/pull/31468#discussion_r578951440
##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/limit.scala
##########
@@ -52,16 +53,25 @@ case class CollectLimitExec(limit: Int, child: SparkPlan) extends LimitExec {
SQLShuffleReadMetricsReporter.createShuffleReadMetrics(sparkContext)
override lazy val metrics = readMetrics ++ writeMetrics
protected override def doExecute(): RDD[InternalRow] = {
- val locallyLimited = child.execute().mapPartitionsInternal(_.take(limit))
- val shuffled = new ShuffledRowRDD(
- ShuffleExchangeExec.prepareShuffleDependency(
- locallyLimited,
- child.output,
- SinglePartition,
- serializer,
- writeMetrics),
- readMetrics)
- shuffled.mapPartitionsInternal(_.take(limit))
+ val childRDD = child.execute()
+ if (childRDD.getNumPartitions == 0) {
+ new ParallelCollectionRDD(sparkContext, Seq.empty[InternalRow], 1, Map.empty)
Review comment:
> or we can use `EmptyRDDWithPartitions` defined in `CoalesceExec`?
> maybe we can make `EmptyRDD` support number of partition, so we can use it in both `CollectLimitExec` and `CoalesceExec`. But I am not sure whether we should do this.
I think it sounds over-engineering. At least for now I don't think we need it.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #31468: [SPARK-34353][SQL] CollectLimitExec avoid shuffle if input rdd has single partition
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31468:
URL: https://github.com/apache/spark/pull/31468#issuecomment-776434334
**[Test build #135090 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/135090/testReport)** for PR 31468 at commit [`c9a5a81`](https://github.com/apache/spark/commit/c9a5a81d8a6e612bd2c8b7e82c00587edcd70cbc).
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #31468: [SPARK-34353][SQL] CollectLimitExec avoid shuffle if input rdd has single partition
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31468:
URL: https://github.com/apache/spark/pull/31468#issuecomment-776151186
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/135075/
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #31468: [SPARK-34353][SQL] CollectLimitExec avoid shuffle if input rdd has single partition
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31468:
URL: https://github.com/apache/spark/pull/31468#issuecomment-776643186
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/39679/
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] zhengruifeng commented on a change in pull request #31468: [SPARK-34353][SQL] CollectLimitExec avoid shuffle if input rdd has single partition
Posted by GitBox <gi...@apache.org>.
zhengruifeng commented on a change in pull request #31468:
URL: https://github.com/apache/spark/pull/31468#discussion_r573432645
##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/limit.scala
##########
@@ -200,7 +205,7 @@ case class TakeOrderedAndProjectExec(
protected override def doExecute(): RDD[InternalRow] = {
val ord = new LazilyGeneratedOrdering(sortOrder, child.output)
val childRDD = child.execute()
- val singlePartitionRDD = if (childRDD.getNumPartitions > 1) {
+ val singlePartitionRDD = if (childRDD.getNumPartitions != 1) {
Review comment:
There is no separate testsuite for `CollectLimitExec`, I added case `childRDD.getNumPartitions == 0` and case `childRDD.getNumPartitions == 1` in `TakeOrderedAndProjectSuite`
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #31468: [SPARK-34353][SQL] CollectLimitExec avoid shuffle if input rdd has single partition
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31468:
URL: https://github.com/apache/spark/pull/31468#issuecomment-773207218
Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39447/
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #31468: [SPARK-34353][SQL] CollectLimitExec avoid shuffle if input rdd has single partition
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31468:
URL: https://github.com/apache/spark/pull/31468#issuecomment-776102503
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/39657/
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #31468: [SPARK-34353][SQL] CollectLimitExec avoid shuffle if input rdd has 0/1 partition
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31468:
URL: https://github.com/apache/spark/pull/31468#issuecomment-786717940
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/40091/
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] zhengruifeng commented on a change in pull request #31468: [SPARK-34353][SQL] CollectLimitExec avoid shuffle if input rdd has 0/1 partition
Posted by GitBox <gi...@apache.org>.
zhengruifeng commented on a change in pull request #31468:
URL: https://github.com/apache/spark/pull/31468#discussion_r578910718
##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/limit.scala
##########
@@ -52,16 +53,25 @@ case class CollectLimitExec(limit: Int, child: SparkPlan) extends LimitExec {
SQLShuffleReadMetricsReporter.createShuffleReadMetrics(sparkContext)
override lazy val metrics = readMetrics ++ writeMetrics
protected override def doExecute(): RDD[InternalRow] = {
- val locallyLimited = child.execute().mapPartitionsInternal(_.take(limit))
- val shuffled = new ShuffledRowRDD(
- ShuffleExchangeExec.prepareShuffleDependency(
- locallyLimited,
- child.output,
- SinglePartition,
- serializer,
- writeMetrics),
- readMetrics)
- shuffled.mapPartitionsInternal(_.take(limit))
+ val childRDD = child.execute()
+ if (childRDD.getNumPartitions == 0) {
+ new ParallelCollectionRDD(sparkContext, Seq.empty[InternalRow], 1, Map.empty)
Review comment:
`new ParallelCollectionRDD(sparkContext, Seq.empty[InternalRow], 1, Map.empty)` is a empty rdd with single partition
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] viirya commented on a change in pull request #31468: [SPARK-34353][SQL] CollectLimitExec avoid shuffle if input rdd has single partition
Posted by GitBox <gi...@apache.org>.
viirya commented on a change in pull request #31468:
URL: https://github.com/apache/spark/pull/31468#discussion_r573989674
##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/limit.scala
##########
@@ -52,16 +53,25 @@ case class CollectLimitExec(limit: Int, child: SparkPlan) extends LimitExec {
SQLShuffleReadMetricsReporter.createShuffleReadMetrics(sparkContext)
override lazy val metrics = readMetrics ++ writeMetrics
protected override def doExecute(): RDD[InternalRow] = {
- val locallyLimited = child.execute().mapPartitionsInternal(_.take(limit))
- val shuffled = new ShuffledRowRDD(
- ShuffleExchangeExec.prepareShuffleDependency(
- locallyLimited,
- child.output,
- SinglePartition,
- serializer,
- writeMetrics),
- readMetrics)
- shuffled.mapPartitionsInternal(_.take(limit))
+ val childRDD = child.execute()
+ if (childRDD.getNumPartitions == 0) {
+ new ParallelCollectionRDD(sparkContext, Seq.empty[InternalRow], 1, Map.empty)
Review comment:
EmptyRDD?
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #31468: [SPARK-34353][SQL] CollectLimitExec avoid shuffle if input rdd has single partition
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31468:
URL: https://github.com/apache/spark/pull/31468#issuecomment-775867647
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] maropu closed pull request #31468: [SPARK-34353][SQL] CollectLimitExec avoid shuffle if input rdd has 0/1 partition
Posted by GitBox <gi...@apache.org>.
maropu closed pull request #31468:
URL: https://github.com/apache/spark/pull/31468
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #31468: [SPARK-34353][SQL] CollectLimitExec avoid shuffle if input rdd has single partition
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31468:
URL: https://github.com/apache/spark/pull/31468#issuecomment-773222393
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #31468: [SPARK-34353][SQL] CollectLimitExec avoid shuffle if input rdd has single partition
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31468:
URL: https://github.com/apache/spark/pull/31468#issuecomment-773118983
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on pull request #31468: [SPARK-34353][SQL] CollectLimitExec avoid shuffle if input rdd has single partition
Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on pull request #31468:
URL: https://github.com/apache/spark/pull/31468#issuecomment-775943050
retest this please
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #31468: [SPARK-34353][SQL] CollectLimitExec avoid shuffle if input rdd has single partition
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31468:
URL: https://github.com/apache/spark/pull/31468#issuecomment-776468501
Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39674/
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #31468: [SPARK-34353][SQL] CollectLimitExec avoid shuffle if input rdd has single partition
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31468:
URL: https://github.com/apache/spark/pull/31468#issuecomment-773118983
**[Test build #134860 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/134860/testReport)** for PR 31468 at commit [`6d199ff`](https://github.com/apache/spark/commit/6d199ff4ddcc2adca3141d70019354625e298372).
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #31468: [SPARK-34353][SQL] CollectLimitExec avoid shuffle if input rdd has single partition
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31468:
URL: https://github.com/apache/spark/pull/31468#issuecomment-775859628
**[Test build #135069 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/135069/testReport)** for PR 31468 at commit [`6d199ff`](https://github.com/apache/spark/commit/6d199ff4ddcc2adca3141d70019354625e298372).
* This patch **fails Spark unit tests**.
* This patch merges cleanly.
* This patch adds no public classes.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #31468: [SPARK-34353][SQL] CollectLimitExec avoid shuffle if input rdd has single partition
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31468:
URL: https://github.com/apache/spark/pull/31468#issuecomment-776572862
**[Test build #135092 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/135092/testReport)** for PR 31468 at commit [`75c4d92`](https://github.com/apache/spark/commit/75c4d92dc7d24b6698b2eb7e5d103e95e925247a).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #31468: [SPARK-34353][SQL] CollectLimitExec avoid shuffle if input rdd has single partition
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31468:
URL: https://github.com/apache/spark/pull/31468#issuecomment-773460675
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/134874/
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on pull request #31468: [SPARK-34353][SQL] CollectLimitExec avoid shuffle if input rdd has single partition
Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on pull request #31468:
URL: https://github.com/apache/spark/pull/31468#issuecomment-776552571
retest this please
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #31468: [SPARK-34353][SQL] CollectLimitExec avoid shuffle if input rdd has 0/1 partition
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31468:
URL: https://github.com/apache/spark/pull/31468#issuecomment-786931035
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/135510/
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #31468: [SPARK-34353][SQL] CollectLimitExec avoid shuffle if input rdd has single partition
Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #31468:
URL: https://github.com/apache/spark/pull/31468#issuecomment-775741112
**[Test build #135069 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/135069/testReport)** for PR 31468 at commit [`6d199ff`](https://github.com/apache/spark/commit/6d199ff4ddcc2adca3141d70019354625e298372).
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] cloud-fan commented on pull request #31468: [SPARK-34353][SQL] CollectLimitExec avoid shuffle if input rdd has single partition
Posted by GitBox <gi...@apache.org>.
cloud-fan commented on pull request #31468:
URL: https://github.com/apache/spark/pull/31468#issuecomment-773128349
cc @viirya @maropu
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #31468: [SPARK-34353][SQL] CollectLimitExec avoid shuffle if input rdd has 0/1 partition
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31468:
URL: https://github.com/apache/spark/pull/31468#issuecomment-786965907
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/40104/
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] zhengruifeng commented on a change in pull request #31468: [SPARK-34353][SQL] CollectLimitExec avoid shuffle if input rdd has single partition
Posted by GitBox <gi...@apache.org>.
zhengruifeng commented on a change in pull request #31468:
URL: https://github.com/apache/spark/pull/31468#discussion_r573433676
##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/limit.scala
##########
@@ -52,16 +52,21 @@ case class CollectLimitExec(limit: Int, child: SparkPlan) extends LimitExec {
SQLShuffleReadMetricsReporter.createShuffleReadMetrics(sparkContext)
override lazy val metrics = readMetrics ++ writeMetrics
protected override def doExecute(): RDD[InternalRow] = {
- val locallyLimited = child.execute().mapPartitionsInternal(_.take(limit))
- val shuffled = new ShuffledRowRDD(
- ShuffleExchangeExec.prepareShuffleDependency(
- locallyLimited,
- child.output,
- SinglePartition,
- serializer,
- writeMetrics),
- readMetrics)
- shuffled.mapPartitionsInternal(_.take(limit))
+ val childRDD = child.execute()
+ val singlePartitionRDD = if (childRDD.getNumPartitions != 1) {
Review comment:
refer to [CoalesceExec.doExecute](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/basicPhysicalOperators.scala#L715), I update the fix to directly return empty RDD with single partition if `childRDD.getNumPartitions == 0`
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #31468: [SPARK-34353][SQL] CollectLimitExec avoid shuffle if input rdd has single partition
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31468:
URL: https://github.com/apache/spark/pull/31468#issuecomment-776440924
**[Test build #135092 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/135092/testReport)** for PR 31468 at commit [`75c4d92`](https://github.com/apache/spark/commit/75c4d92dc7d24b6698b2eb7e5d103e95e925247a).
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #31468: [SPARK-34353][SQL] CollectLimitExec avoid shuffle if input rdd has 0/1 partition
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31468:
URL: https://github.com/apache/spark/pull/31468#issuecomment-786994403
**[Test build #135523 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/135523/testReport)** for PR 31468 at commit [`75c4d92`](https://github.com/apache/spark/commit/75c4d92dc7d24b6698b2eb7e5d103e95e925247a).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #31468: [SPARK-34353][SQL] CollectLimitExec avoid shuffle if input rdd has single partition
Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #31468:
URL: https://github.com/apache/spark/pull/31468#issuecomment-776572321
**[Test build #135097 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/135097/testReport)** for PR 31468 at commit [`75c4d92`](https://github.com/apache/spark/commit/75c4d92dc7d24b6698b2eb7e5d103e95e925247a).
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #31468: [SPARK-34353][SQL] CollectLimitExec avoid shuffle if input rdd has 0/1 partition
Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #31468:
URL: https://github.com/apache/spark/pull/31468#issuecomment-786662467
**[Test build #135510 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/135510/testReport)** for PR 31468 at commit [`75c4d92`](https://github.com/apache/spark/commit/75c4d92dc7d24b6698b2eb7e5d103e95e925247a).
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #31468: [SPARK-34353][SQL] CollectLimitExec avoid shuffle if input rdd has single partition
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31468:
URL: https://github.com/apache/spark/pull/31468#issuecomment-776020678
Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39657/
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] zhengruifeng commented on pull request #31468: [SPARK-34353][SQL] CollectLimitExec avoid shuffle if input rdd has single partition
Posted by GitBox <gi...@apache.org>.
zhengruifeng commented on pull request #31468:
URL: https://github.com/apache/spark/pull/31468#issuecomment-773112511
related to https://github.com/apache/spark/pull/31409
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #31468: [SPARK-34353][SQL] CollectLimitExec avoid shuffle if input rdd has single partition
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31468:
URL: https://github.com/apache/spark/pull/31468#issuecomment-776716166
**[Test build #135097 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/135097/testReport)** for PR 31468 at commit [`75c4d92`](https://github.com/apache/spark/commit/75c4d92dc7d24b6698b2eb7e5d103e95e925247a).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #31468: [SPARK-34353][SQL] CollectLimitExec avoid shuffle if input rdd has single partition
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31468:
URL: https://github.com/apache/spark/pull/31468#issuecomment-773222393
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #31468: [SPARK-34353][SQL] CollectLimitExec avoid shuffle if input rdd has single partition
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31468:
URL: https://github.com/apache/spark/pull/31468#issuecomment-773374303
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/39460/
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] viirya commented on pull request #31468: [SPARK-34353][SQL] CollectLimitExec avoid shuffle if input rdd has single partition
Posted by GitBox <gi...@apache.org>.
viirya commented on pull request #31468:
URL: https://github.com/apache/spark/pull/31468#issuecomment-773508769
`org.apache.spark.sql.CachedTableSuite.SPARK-34269: cache lookup with ORDER BY / LIMIT clause` failed more than one time. But seems it passed without this change?
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] cloud-fan commented on pull request #31468: [SPARK-34353][SQL] CollectLimitExec avoid shuffle if input rdd has single partition
Posted by GitBox <gi...@apache.org>.
cloud-fan commented on pull request #31468:
URL: https://github.com/apache/spark/pull/31468#issuecomment-773128349
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #31468: [SPARK-34353][SQL] CollectLimitExec avoid shuffle if input rdd has single partition
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31468:
URL: https://github.com/apache/spark/pull/31468#issuecomment-773199615
**[Test build #134860 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/134860/testReport)** for PR 31468 at commit [`6d199ff`](https://github.com/apache/spark/commit/6d199ff4ddcc2adca3141d70019354625e298372).
* This patch **fails Spark unit tests**.
* This patch merges cleanly.
* This patch adds no public classes.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #31468: [SPARK-34353][SQL] CollectLimitExec avoid shuffle if input rdd has single partition
Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #31468:
URL: https://github.com/apache/spark/pull/31468#issuecomment-773306844
**[Test build #134874 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/134874/testReport)** for PR 31468 at commit [`6d199ff`](https://github.com/apache/spark/commit/6d199ff4ddcc2adca3141d70019354625e298372).
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] viirya commented on a change in pull request #31468: [SPARK-34353][SQL] CollectLimitExec avoid shuffle if input rdd has 0/1 partition
Posted by GitBox <gi...@apache.org>.
viirya commented on a change in pull request #31468:
URL: https://github.com/apache/spark/pull/31468#discussion_r578910294
##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/limit.scala
##########
@@ -52,16 +53,25 @@ case class CollectLimitExec(limit: Int, child: SparkPlan) extends LimitExec {
SQLShuffleReadMetricsReporter.createShuffleReadMetrics(sparkContext)
override lazy val metrics = readMetrics ++ writeMetrics
protected override def doExecute(): RDD[InternalRow] = {
- val locallyLimited = child.execute().mapPartitionsInternal(_.take(limit))
- val shuffled = new ShuffledRowRDD(
- ShuffleExchangeExec.prepareShuffleDependency(
- locallyLimited,
- child.output,
- SinglePartition,
- serializer,
- writeMetrics),
- readMetrics)
- shuffled.mapPartitionsInternal(_.take(limit))
+ val childRDD = child.execute()
+ if (childRDD.getNumPartitions == 0) {
+ new ParallelCollectionRDD(sparkContext, Seq.empty[InternalRow], 1, Map.empty)
Review comment:
Isn't the `childRDD` zero partitions?
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #31468: [SPARK-34353][SQL] CollectLimitExec avoid shuffle if input rdd has single partition
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31468:
URL: https://github.com/apache/spark/pull/31468#issuecomment-776536491
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/135090/
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #31468: [SPARK-34353][SQL] CollectLimitExec avoid shuffle if input rdd has single partition
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31468:
URL: https://github.com/apache/spark/pull/31468#issuecomment-776605631
Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39679/
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] maropu commented on a change in pull request #31468: [SPARK-34353][SQL] CollectLimitExec avoid shuffle if input rdd has single partition
Posted by GitBox <gi...@apache.org>.
maropu commented on a change in pull request #31468:
URL: https://github.com/apache/spark/pull/31468#discussion_r578070009
##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/limit.scala
##########
@@ -52,16 +53,25 @@ case class CollectLimitExec(limit: Int, child: SparkPlan) extends LimitExec {
SQLShuffleReadMetricsReporter.createShuffleReadMetrics(sparkContext)
override lazy val metrics = readMetrics ++ writeMetrics
protected override def doExecute(): RDD[InternalRow] = {
- val locallyLimited = child.execute().mapPartitionsInternal(_.take(limit))
- val shuffled = new ShuffledRowRDD(
- ShuffleExchangeExec.prepareShuffleDependency(
- locallyLimited,
- child.output,
- SinglePartition,
- serializer,
- writeMetrics),
- readMetrics)
- shuffled.mapPartitionsInternal(_.take(limit))
+ val childRDD = child.execute()
+ if (childRDD.getNumPartitions == 0) {
+ new ParallelCollectionRDD(sparkContext, Seq.empty[InternalRow], 1, Map.empty)
Review comment:
kindly ping @zhenglaizhang
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] zhengruifeng commented on pull request #31468: [SPARK-34353][SQL] CollectLimitExec avoid shuffle if input rdd has 0/1 partition
Posted by GitBox <gi...@apache.org>.
zhengruifeng commented on pull request #31468:
URL: https://github.com/apache/spark/pull/31468#issuecomment-787586530
Thank you so much !
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #31468: [SPARK-34353][SQL] CollectLimitExec avoid shuffle if input rdd has single partition
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31468:
URL: https://github.com/apache/spark/pull/31468#issuecomment-773460675
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/134874/
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #31468: [SPARK-34353][SQL] CollectLimitExec avoid shuffle if input rdd has single partition
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31468:
URL: https://github.com/apache/spark/pull/31468#issuecomment-776102503
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/39657/
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #31468: [SPARK-34353][SQL] CollectLimitExec avoid shuffle if input rdd has 0/1 partition
Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #31468:
URL: https://github.com/apache/spark/pull/31468#issuecomment-786939347
**[Test build #135523 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/135523/testReport)** for PR 31468 at commit [`75c4d92`](https://github.com/apache/spark/commit/75c4d92dc7d24b6698b2eb7e5d103e95e925247a).
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] cloud-fan commented on pull request #31468: [SPARK-34353][SQL] CollectLimitExec avoid shuffle if input rdd has single partition
Posted by GitBox <gi...@apache.org>.
cloud-fan commented on pull request #31468:
URL: https://github.com/apache/spark/pull/31468#issuecomment-773292371
retest this please
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] maropu commented on a change in pull request #31468: [SPARK-34353][SQL] CollectLimitExec avoid shuffle if input rdd has single partition
Posted by GitBox <gi...@apache.org>.
maropu commented on a change in pull request #31468:
URL: https://github.com/apache/spark/pull/31468#discussion_r573392389
##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/limit.scala
##########
@@ -200,7 +205,7 @@ case class TakeOrderedAndProjectExec(
protected override def doExecute(): RDD[InternalRow] = {
val ord = new LazilyGeneratedOrdering(sortOrder, child.output)
val childRDD = child.execute()
- val singlePartitionRDD = if (childRDD.getNumPartitions > 1) {
+ val singlePartitionRDD = if (childRDD.getNumPartitions != 1) {
Review comment:
We don't have any test for the case `childRDD.getNumPartitions == 0`?
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #31468: [SPARK-34353][SQL] CollectLimitExec avoid shuffle if input rdd has single partition
Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #31468:
URL: https://github.com/apache/spark/pull/31468#issuecomment-776434334
**[Test build #135090 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/135090/testReport)** for PR 31468 at commit [`c9a5a81`](https://github.com/apache/spark/commit/c9a5a81d8a6e612bd2c8b7e82c00587edcd70cbc).
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #31468: [SPARK-34353][SQL] CollectLimitExec avoid shuffle if input rdd has single partition
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31468:
URL: https://github.com/apache/spark/pull/31468#issuecomment-776476510
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/39674/
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #31468: [SPARK-34353][SQL] CollectLimitExec avoid shuffle if input rdd has single partition
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31468:
URL: https://github.com/apache/spark/pull/31468#issuecomment-776536491
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/135090/
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] viirya commented on pull request #31468: [SPARK-34353][SQL] CollectLimitExec avoid shuffle if input rdd has single partition
Posted by GitBox <gi...@apache.org>.
viirya commented on pull request #31468:
URL: https://github.com/apache/spark/pull/31468#issuecomment-773508769
`org.apache.spark.sql.CachedTableSuite.SPARK-34269: cache lookup with ORDER BY / LIMIT clause` failed more than one time. But seems it passed without this change?
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] zhengruifeng edited a comment on pull request #31468: [SPARK-34353][SQL] CollectLimitExec avoid shuffle if input rdd has single partition
Posted by GitBox <gi...@apache.org>.
zhengruifeng edited a comment on pull request #31468:
URL: https://github.com/apache/spark/pull/31468#issuecomment-775778416
```
scala> spark.sql("CREATE TABLE t (key bigint, value string) USING parquet")
res0: org.apache.spark.sql.DataFrame = []
scala> spark.sql("SELECT COUNT(*) FROM t").head
res1: org.apache.spark.sql.Row = [0]
scala> val t = spark.table("t")
t: org.apache.spark.sql.DataFrame = [key: bigint, value: string]
scala> t.count
res2: Long = 0
scala> t.rdd.getNumPartitions
res3: Int = 0
scala>
scala> spark.sql(s"CACHE TABLE v1 AS SELECT * FROM t LIMIT 10")
java.util.NoSuchElementException: next on empty iterator
at scala.collection.Iterator$$anon$2.next(Iterator.scala:41)
at scala.collection.Iterator$$anon$2.next(Iterator.scala:39)
at scala.collection.IndexedSeqLike$Elements.next(IndexedSeqLike.scala:63)
at scala.collection.IterableLike.head(IterableLike.scala:109)
at scala.collection.IterableLike.head$(IterableLike.scala:108)
at scala.collection.mutable.ArrayOps$ofRef.scala$collection$IndexedSeqOptimized$$super$head(ArrayOps.scala:198)
at scala.collection.IndexedSeqOptimized.head(IndexedSeqOptimized.scala:129)
at scala.collection.IndexedSeqOptimized.head$(IndexedSeqOptimized.scala:129)
at scala.collection.mutable.ArrayOps$ofRef.head(ArrayOps.scala:198)
at org.apache.spark.sql.Dataset.$anonfun$count$1(Dataset.scala:3024)
at org.apache.spark.sql.Dataset.$anonfun$count$1$adapted(Dataset.scala:3023)
at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3705)
at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103)
at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163)
at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90)
at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:772)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3703)
at org.apache.spark.sql.Dataset.count(Dataset.scala:3023)
at org.apache.spark.sql.execution.datasources.v2.BaseCacheTableExec.run(CacheTableExec.scala:65)
at org.apache.spark.sql.execution.datasources.v2.BaseCacheTableExec.run$(CacheTableExec.scala:41)
at org.apache.spark.sql.execution.datasources.v2.CacheTableAsSelectExec.run(CacheTableExec.scala:88)
at org.apache.spark.sql.execution.datasources.v2.V2CommandExec.result$lzycompute(V2CommandExec.scala:42)
at org.apache.spark.sql.execution.datasources.v2.V2CommandExec.result(V2CommandExec.scala:42)
at org.apache.spark.sql.execution.datasources.v2.V2CommandExec.executeCollect(V2CommandExec.scala:48)
at org.apache.spark.sql.Dataset.$anonfun$logicalPlan$1(Dataset.scala:228)
at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3705)
at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103)
at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163)
at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90)
at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:772)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3703)
at org.apache.spark.sql.Dataset.<init>(Dataset.scala:228)
at org.apache.spark.sql.Dataset$.$anonfun$ofRows$2(Dataset.scala:99)
at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:772)
at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:96)
at org.apache.spark.sql.SparkSession.$anonfun$sql$1(SparkSession.scala:615)
at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:772)
at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:610)
... 47 elided
scala> spark.sql("SELECT COUNT(*) FROM v1").head
java.util.NoSuchElementException: next on empty iterator
at scala.collection.Iterator$$anon$2.next(Iterator.scala:41)
at scala.collection.Iterator$$anon$2.next(Iterator.scala:39)
at scala.collection.IndexedSeqLike$Elements.next(IndexedSeqLike.scala:63)
at scala.collection.IterableLike.head(IterableLike.scala:109)
at scala.collection.IterableLike.head$(IterableLike.scala:108)
at scala.collection.mutable.ArrayOps$ofRef.scala$collection$IndexedSeqOptimized$$super$head(ArrayOps.scala:198)
at scala.collection.IndexedSeqOptimized.head(IndexedSeqOptimized.scala:129)
at scala.collection.IndexedSeqOptimized.head$(IndexedSeqOptimized.scala:129)
at scala.collection.mutable.ArrayOps$ofRef.head(ArrayOps.scala:198)
at org.apache.spark.sql.Dataset.head(Dataset.scala:2747)
... 47 elided
scala> val v1 = spark.table("v1")
v1: org.apache.spark.sql.DataFrame = [key: bigint, value: string]
scala> v1.count
java.util.NoSuchElementException: next on empty iterator
at scala.collection.Iterator$$anon$2.next(Iterator.scala:41)
at scala.collection.Iterator$$anon$2.next(Iterator.scala:39)
at scala.collection.IndexedSeqLike$Elements.next(IndexedSeqLike.scala:63)
at scala.collection.IterableLike.head(IterableLike.scala:109)
at scala.collection.IterableLike.head$(IterableLike.scala:108)
at scala.collection.mutable.ArrayOps$ofRef.scala$collection$IndexedSeqOptimized$$super$head(ArrayOps.scala:198)
at scala.collection.IndexedSeqOptimized.head(IndexedSeqOptimized.scala:129)
at scala.collection.IndexedSeqOptimized.head$(IndexedSeqOptimized.scala:129)
at scala.collection.mutable.ArrayOps$ofRef.head(ArrayOps.scala:198)
at org.apache.spark.sql.Dataset.$anonfun$count$1(Dataset.scala:3024)
at org.apache.spark.sql.Dataset.$anonfun$count$1$adapted(Dataset.scala:3023)
at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3705)
at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103)
at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163)
at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90)
at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:772)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3703)
at org.apache.spark.sql.Dataset.count(Dataset.scala:3023)
... 47 elided
scala> v1.rdd.getNumPartitions
res7: Int = 0
scala> v1.repartition(3).count
res8: Long = 0
```
commit 6d199ff did not guarantee the output has single partition, since `childRDD` may have no partition, then the number output partition is also zero;
while in master, the `ShuffledRowRDD` will make sure `SinglePartition`.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #31468: [SPARK-34353][SQL] CollectLimitExec avoid shuffle if input rdd has 0/1 partition
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31468:
URL: https://github.com/apache/spark/pull/31468#issuecomment-784725646
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/135391/
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #31468: [SPARK-34353][SQL] CollectLimitExec avoid shuffle if input rdd has single partition
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31468:
URL: https://github.com/apache/spark/pull/31468#issuecomment-776447930
Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39672/
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #31468: [SPARK-34353][SQL] CollectLimitExec avoid shuffle if input rdd has 0/1 partition
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31468:
URL: https://github.com/apache/spark/pull/31468#issuecomment-784722250
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/39971/
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #31468: [SPARK-34353][SQL] CollectLimitExec avoid shuffle if input rdd has single partition
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31468:
URL: https://github.com/apache/spark/pull/31468#issuecomment-775843480
Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39652/
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #31468: [SPARK-34353][SQL] CollectLimitExec avoid shuffle if input rdd has single partition
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31468:
URL: https://github.com/apache/spark/pull/31468#issuecomment-776446396
Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39672/
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #31468: [SPARK-34353][SQL] CollectLimitExec avoid shuffle if input rdd has single partition
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31468:
URL: https://github.com/apache/spark/pull/31468#issuecomment-776451204
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/39672/
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] srowen commented on pull request #31468: [SPARK-34353][SQL] CollectLimitExec avoid shuffle if input rdd has 0/1 partition
Posted by GitBox <gi...@apache.org>.
srowen commented on pull request #31468:
URL: https://github.com/apache/spark/pull/31468#issuecomment-786931812
Jenkins retest this please
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] zhengruifeng commented on a change in pull request #31468: [SPARK-34353][SQL] CollectLimitExec avoid shuffle if input rdd has 0/1 partition
Posted by GitBox <gi...@apache.org>.
zhengruifeng commented on a change in pull request #31468:
URL: https://github.com/apache/spark/pull/31468#discussion_r578943394
##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/limit.scala
##########
@@ -52,16 +53,25 @@ case class CollectLimitExec(limit: Int, child: SparkPlan) extends LimitExec {
SQLShuffleReadMetricsReporter.createShuffleReadMetrics(sparkContext)
override lazy val metrics = readMetrics ++ writeMetrics
protected override def doExecute(): RDD[InternalRow] = {
- val locallyLimited = child.execute().mapPartitionsInternal(_.take(limit))
- val shuffled = new ShuffledRowRDD(
- ShuffleExchangeExec.prepareShuffleDependency(
- locallyLimited,
- child.output,
- SinglePartition,
- serializer,
- writeMetrics),
- readMetrics)
- shuffled.mapPartitionsInternal(_.take(limit))
+ val childRDD = child.execute()
+ if (childRDD.getNumPartitions == 0) {
+ new ParallelCollectionRDD(sparkContext, Seq.empty[InternalRow], 1, Map.empty)
Review comment:
or we can use `EmptyRDDWithPartitions` defined in `CoalesceExec`?
maybe we can make `EmptyRDD` support number of partition, so we can use it in both `CollectLimitExec` and `CoalesceExec`. But I am not sure whether we should do this.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #31468: [SPARK-34353][SQL] CollectLimitExec avoid shuffle if input rdd has single partition
Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #31468:
URL: https://github.com/apache/spark/pull/31468#issuecomment-773118983
**[Test build #134860 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/134860/testReport)** for PR 31468 at commit [`6d199ff`](https://github.com/apache/spark/commit/6d199ff4ddcc2adca3141d70019354625e298372).
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #31468: [SPARK-34353][SQL] CollectLimitExec avoid shuffle if input rdd has single partition
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31468:
URL: https://github.com/apache/spark/pull/31468#issuecomment-776535783
**[Test build #135090 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/135090/testReport)** for PR 31468 at commit [`c9a5a81`](https://github.com/apache/spark/commit/c9a5a81d8a6e612bd2c8b7e82c00587edcd70cbc).
* This patch **fails PySpark pip packaging tests**.
* This patch merges cleanly.
* This patch adds no public classes.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #31468: [SPARK-34353][SQL] CollectLimitExec avoid shuffle if input rdd has 0/1 partition
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31468:
URL: https://github.com/apache/spark/pull/31468#issuecomment-786717940
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/40091/
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] zhengruifeng commented on pull request #31468: [SPARK-34353][SQL] CollectLimitExec avoid shuffle if input rdd has single partition
Posted by GitBox <gi...@apache.org>.
zhengruifeng commented on pull request #31468:
URL: https://github.com/apache/spark/pull/31468#issuecomment-775732036
retest this please
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org