You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@spark.apache.org by rx...@apache.org on 2017/11/02 13:19:26 UTC
spark git commit: [SPARK-22408][SQL] RelationalGroupedDataset's distinct pivot value calculation launches unnecessary stages

Repository: spark
Updated Branches:
  refs/heads/master 849b465bb -> 277b1924b


[SPARK-22408][SQL] RelationalGroupedDataset's distinct pivot value calculation launches unnecessary stages

## What changes were proposed in this pull request?

Adding a global limit on top of the distinct values before sorting and collecting will reduce the overall work in the case where we have more distinct values. We will also eagerly perform a collect rather than a take because we know we only have at most (maxValues + 1) rows.

## How was this patch tested?

Existing tests cover sorted order

Author: Patrick Woody <pw...@palantir.com>

Closes #19629 from pwoody/SPARK-22408.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/277b1924
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/277b1924
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/277b1924

Branch: refs/heads/master
Commit: 277b1924b46a70ab25414f5670eb784906dbbfdf
Parents: 849b465
Author: Patrick Woody <pw...@palantir.com>
Authored: Thu Nov 2 14:19:21 2017 +0100
Committer: Reynold Xin <rx...@databricks.com>
Committed: Thu Nov 2 14:19:21 2017 +0100

----------------------------------------------------------------------
 .../scala/org/apache/spark/sql/RelationalGroupedDataset.scala    | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/spark/blob/277b1924/sql/core/src/main/scala/org/apache/spark/sql/RelationalGroupedDataset.scala
----------------------------------------------------------------------
diff --git a/sql/core/src/main/scala/org/apache/spark/sql/RelationalGroupedDataset.scala b/sql/core/src/main/scala/org/apache/spark/sql/RelationalGroupedDataset.scala
index 21e94fa..3e4edd4 100644
--- a/sql/core/src/main/scala/org/apache/spark/sql/RelationalGroupedDataset.scala
+++ b/sql/core/src/main/scala/org/apache/spark/sql/RelationalGroupedDataset.scala
@@ -321,10 +321,10 @@ class RelationalGroupedDataset protected[sql](
     // Get the distinct values of the column and sort them so its consistent
     val values = df.select(pivotColumn)
       .distinct()
+      .limit(maxValues + 1)
       .sort(pivotColumn)  // ensure that the output columns are in a consistent logical order
-      .rdd
+      .collect()
       .map(_.get(0))
-      .take(maxValues + 1)
       .toSeq
 
     if (values.length > maxValues) {


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@spark.apache.org
For additional commands, e-mail: commits-help@spark.apache.org