You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by "sunchao (via GitHub)" <gi...@apache.org> on 2023/09/06 20:09:12 UTC

[GitHub] [spark] sunchao opened a new pull request, #42839: [SPARK-45036][FOLLOWUP][SQL] SPJ: Make sure result partitions are sorted according to partition values

sunchao opened a new pull request, #42839:
URL: https://github.com/apache/spark/pull/42839

### What changes were proposed in this pull request?

This PR makes sure the result grouped partitions from `DataSourceV2ScanExec#groupPartitions` are sorted according to the partition values. Previously in the #42757 we were assuming Scala would preserve the input ordering but apparently that's not the case.

### Why are the changes needed?

See https://github.com/apache/spark/pull/42757#discussion_r1316926504 for diagnosis. The partition ordering is a fundamental property for SPJ and thus must be guaranteed.

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

We have tests in `KeyGroupedPartitioningSuite` to cover this.

### Was this patch authored or co-authored using generative AI tooling?

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] viirya commented on a diff in pull request #42839: [SPARK-45036][FOLLOWUP][SQL] SPJ: Make sure result partitions are sorted according to partition values

Posted by "viirya (via GitHub)" <gi...@apache.org>.

viirya commented on code in PR #42839:
URL: https://github.com/apache/spark/pull/42839#discussion_r1317860693


##########
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2ScanExecBase.scala:
##########
@@ -143,17 +143,16 @@ trait DataSourceV2ScanExecBase extends LeafExecNode {
         // also sort the input partitions according to their partition key order. This ensures
         // a canonical order from both sides of a bucketed join, for example.
         val partitionDataTypes = expressions.map(_.dataType)
-        val partitionOrdering: Ordering[(InternalRow, InputPartition)] = {
-          RowOrdering.createNaturalAscendingOrdering(partitionDataTypes).on(_._1)
-        }
-        val sortedKeyToPartitions = results.sorted(partitionOrdering)
-        val groupedPartitions = sortedKeyToPartitions
+        val rowOrdering = RowOrdering.createNaturalAscendingOrdering(partitionDataTypes)
+        val sortedKeyToPartitions = results.sorted(rowOrdering.on(_._1))

Review Comment:
   Oh okay, missed it. It is part of `KeyGroupedPartitionInfo`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] LuciferYang commented on pull request #42839: [SPARK-45036][FOLLOWUP][SQL] SPJ: Make sure result partitions are sorted according to partition values

Posted by "LuciferYang (via GitHub)" <gi...@apache.org>.

LuciferYang commented on PR #42839:
URL: https://github.com/apache/spark/pull/42839#issuecomment-1709895599

   Merged into master. Thanks @sunchao @viirya @dongjoon-hyun @Hisoka-X 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] viirya commented on a diff in pull request #42839: [SPARK-45036][FOLLOWUP][SQL] SPJ: Make sure result partitions are sorted according to partition values

Posted by "viirya (via GitHub)" <gi...@apache.org>.

viirya commented on code in PR #42839:
URL: https://github.com/apache/spark/pull/42839#discussion_r1317829652


##########
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2ScanExecBase.scala:
##########
@@ -143,17 +143,16 @@ trait DataSourceV2ScanExecBase extends LeafExecNode {
         // also sort the input partitions according to their partition key order. This ensures
         // a canonical order from both sides of a bucketed join, for example.
         val partitionDataTypes = expressions.map(_.dataType)
-        val partitionOrdering: Ordering[(InternalRow, InputPartition)] = {
-          RowOrdering.createNaturalAscendingOrdering(partitionDataTypes).on(_._1)
-        }
-        val sortedKeyToPartitions = results.sorted(partitionOrdering)
-        val groupedPartitions = sortedKeyToPartitions
+        val rowOrdering = RowOrdering.createNaturalAscendingOrdering(partitionDataTypes)
+        val sortedKeyToPartitions = results.sorted(rowOrdering.on(_._1))

Review Comment:
   Then I think we don't need this sorting, if you are going to sort anyway after `groupBy`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] sunchao commented on a diff in pull request #42839: [SPARK-45036][FOLLOWUP][SQL] SPJ: Make sure result partitions are sorted according to partition values

Posted by "sunchao (via GitHub)" <gi...@apache.org>.

sunchao commented on code in PR #42839:
URL: https://github.com/apache/spark/pull/42839#discussion_r1317833964


##########
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2ScanExecBase.scala:
##########
@@ -143,17 +143,16 @@ trait DataSourceV2ScanExecBase extends LeafExecNode {
         // also sort the input partitions according to their partition key order. This ensures
         // a canonical order from both sides of a bucketed join, for example.
         val partitionDataTypes = expressions.map(_.dataType)
-        val partitionOrdering: Ordering[(InternalRow, InputPartition)] = {
-          RowOrdering.createNaturalAscendingOrdering(partitionDataTypes).on(_._1)
-        }
-        val sortedKeyToPartitions = results.sorted(partitionOrdering)
-        val groupedPartitions = sortedKeyToPartitions
+        val rowOrdering = RowOrdering.createNaturalAscendingOrdering(partitionDataTypes)
+        val sortedKeyToPartitions = results.sorted(rowOrdering.on(_._1))

Review Comment:
   Hmm this is the partition value -> input split mapping before the `groupBy` though. It also need to be returned.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] LuciferYang commented on a diff in pull request #42839: [SPARK-45036][FOLLOWUP][SQL] SPJ: Make sure result partitions are sorted according to partition values

Posted by "LuciferYang (via GitHub)" <gi...@apache.org>.

LuciferYang commented on code in PR #42839:
URL: https://github.com/apache/spark/pull/42839#discussion_r1318042923


##########
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2ScanExecBase.scala:
##########
@@ -143,17 +143,16 @@ trait DataSourceV2ScanExecBase extends LeafExecNode {
         // also sort the input partitions according to their partition key order. This ensures
         // a canonical order from both sides of a bucketed join, for example.
         val partitionDataTypes = expressions.map(_.dataType)
-        val partitionOrdering: Ordering[(InternalRow, InputPartition)] = {
-          RowOrdering.createNaturalAscendingOrdering(partitionDataTypes).on(_._1)
-        }
-        val sortedKeyToPartitions = results.sorted(partitionOrdering)
-        val groupedPartitions = sortedKeyToPartitions
+        val rowOrdering = RowOrdering.createNaturalAscendingOrdering(partitionDataTypes)
+        val sortedKeyToPartitions = results.sorted(rowOrdering.on(_._1))
+        val sortedGroupedPartitions = sortedKeyToPartitions
             .map(t => (InternalRowComparableWrapper(t._1, expressions), t._2))
             .groupBy(_._1)
             .toSeq
             .map { case (key, s) => KeyGroupedPartition(key.row, s.map(_._2)) }
+            .sorted(rowOrdering.on(_.value))

Review Comment:
   ```suggestion
               .sorted(rowOrdering.on((k: KeyGroupedPartition) => k.value))
   ```



##########
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2ScanExecBase.scala:
##########
@@ -143,17 +143,16 @@ trait DataSourceV2ScanExecBase extends LeafExecNode {
         // also sort the input partitions according to their partition key order. This ensures
         // a canonical order from both sides of a bucketed join, for example.
         val partitionDataTypes = expressions.map(_.dataType)
-        val partitionOrdering: Ordering[(InternalRow, InputPartition)] = {
-          RowOrdering.createNaturalAscendingOrdering(partitionDataTypes).on(_._1)
-        }
-        val sortedKeyToPartitions = results.sorted(partitionOrdering)
-        val groupedPartitions = sortedKeyToPartitions
+        val rowOrdering = RowOrdering.createNaturalAscendingOrdering(partitionDataTypes)
+        val sortedKeyToPartitions = results.sorted(rowOrdering.on(_._1))

Review Comment:
   To fix Scala 2.13 build
   
   ```
   [error] /home/runner/work/spark/spark/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2ScanExecBase.scala:147:67: missing parameter type for expanded function ((<x$7: error>) => x$7._1)
   [error]         val sortedKeyToPartitions = results.sorted(rowOrdering.on(_._1))
   ```



##########
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2ScanExecBase.scala:
##########
@@ -143,17 +143,16 @@ trait DataSourceV2ScanExecBase extends LeafExecNode {
         // also sort the input partitions according to their partition key order. This ensures
         // a canonical order from both sides of a bucketed join, for example.
         val partitionDataTypes = expressions.map(_.dataType)
-        val partitionOrdering: Ordering[(InternalRow, InputPartition)] = {
-          RowOrdering.createNaturalAscendingOrdering(partitionDataTypes).on(_._1)
-        }
-        val sortedKeyToPartitions = results.sorted(partitionOrdering)
-        val groupedPartitions = sortedKeyToPartitions
+        val rowOrdering = RowOrdering.createNaturalAscendingOrdering(partitionDataTypes)
+        val sortedKeyToPartitions = results.sorted(rowOrdering.on(_._1))

Review Comment:
   ```suggestion
           val sortedKeyToPartitions = results.sorted(rowOrdering.on((t: (InternalRow, _)) => t._1))
   ```



##########
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2ScanExecBase.scala:
##########
@@ -143,17 +143,16 @@ trait DataSourceV2ScanExecBase extends LeafExecNode {
         // also sort the input partitions according to their partition key order. This ensures
         // a canonical order from both sides of a bucketed join, for example.
         val partitionDataTypes = expressions.map(_.dataType)
-        val partitionOrdering: Ordering[(InternalRow, InputPartition)] = {
-          RowOrdering.createNaturalAscendingOrdering(partitionDataTypes).on(_._1)
-        }
-        val sortedKeyToPartitions = results.sorted(partitionOrdering)
-        val groupedPartitions = sortedKeyToPartitions
+        val rowOrdering = RowOrdering.createNaturalAscendingOrdering(partitionDataTypes)
+        val sortedKeyToPartitions = results.sorted(rowOrdering.on(_._1))
+        val sortedGroupedPartitions = sortedKeyToPartitions
             .map(t => (InternalRowComparableWrapper(t._1, expressions), t._2))
             .groupBy(_._1)
             .toSeq
             .map { case (key, s) => KeyGroupedPartition(key.row, s.map(_._2)) }
+            .sorted(rowOrdering.on(_.value))

Review Comment:
   ditto



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] sunchao commented on a diff in pull request #42839: [SPARK-45036][FOLLOWUP][SQL] SPJ: Make sure result partitions are sorted according to partition values

Posted by "sunchao (via GitHub)" <gi...@apache.org>.

sunchao commented on code in PR #42839:
URL: https://github.com/apache/spark/pull/42839#discussion_r1318064440


##########
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2ScanExecBase.scala:
##########
@@ -143,17 +143,16 @@ trait DataSourceV2ScanExecBase extends LeafExecNode {
         // also sort the input partitions according to their partition key order. This ensures
         // a canonical order from both sides of a bucketed join, for example.
         val partitionDataTypes = expressions.map(_.dataType)
-        val partitionOrdering: Ordering[(InternalRow, InputPartition)] = {
-          RowOrdering.createNaturalAscendingOrdering(partitionDataTypes).on(_._1)
-        }
-        val sortedKeyToPartitions = results.sorted(partitionOrdering)
-        val groupedPartitions = sortedKeyToPartitions
+        val rowOrdering = RowOrdering.createNaturalAscendingOrdering(partitionDataTypes)
+        val sortedKeyToPartitions = results.sorted(rowOrdering.on(_._1))

Review Comment:
   oops didn't realize it doesn't compile this way.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] dongjoon-hyun commented on pull request #42839: [SPARK-45036][FOLLOWUP][SQL] SPJ: Make sure result partitions are sorted according to partition values

Posted by "dongjoon-hyun (via GitHub)" <gi...@apache.org>.

dongjoon-hyun commented on PR #42839:
URL: https://github.com/apache/spark/pull/42839#issuecomment-1710600124

   Thank you, all!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] LuciferYang closed pull request #42839: [SPARK-45036][FOLLOWUP][SQL] SPJ: Make sure result partitions are sorted according to partition values

Posted by "LuciferYang (via GitHub)" <gi...@apache.org>.

LuciferYang closed pull request #42839: [SPARK-45036][FOLLOWUP][SQL] SPJ: Make sure result partitions are sorted according to partition values
URL: https://github.com/apache/spark/pull/42839


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] sunchao commented on a diff in pull request #42839: [SPARK-45036][FOLLOWUP][SQL] SPJ: Make sure result partitions are sorted according to partition values

Posted by "sunchao (via GitHub)" <gi...@apache.org>.

sunchao commented on code in PR #42839:
URL: https://github.com/apache/spark/pull/42839#discussion_r1318064553


##########
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2ScanExecBase.scala:
##########
@@ -143,17 +143,16 @@ trait DataSourceV2ScanExecBase extends LeafExecNode {
         // also sort the input partitions according to their partition key order. This ensures
         // a canonical order from both sides of a bucketed join, for example.
         val partitionDataTypes = expressions.map(_.dataType)
-        val partitionOrdering: Ordering[(InternalRow, InputPartition)] = {
-          RowOrdering.createNaturalAscendingOrdering(partitionDataTypes).on(_._1)
-        }
-        val sortedKeyToPartitions = results.sorted(partitionOrdering)
-        val groupedPartitions = sortedKeyToPartitions
+        val rowOrdering = RowOrdering.createNaturalAscendingOrdering(partitionDataTypes)
+        val sortedKeyToPartitions = results.sorted(rowOrdering.on(_._1))
+        val sortedGroupedPartitions = sortedKeyToPartitions
             .map(t => (InternalRowComparableWrapper(t._1, expressions), t._2))
             .groupBy(_._1)
             .toSeq
             .map { case (key, s) => KeyGroupedPartition(key.row, s.map(_._2)) }
+            .sorted(rowOrdering.on(_.value))

Review Comment:
   thanks!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org