You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2022/01/19 16:26:24 UTC

[GitHub] [iceberg] aokolnychyi commented on a change in pull request #3930: Spark 3.2: Use hash distribution by default in copy-on-write DELETE

aokolnychyi commented on a change in pull request #3930:
URL: https://github.com/apache/iceberg/pull/3930#discussion_r787925408



##########
File path: spark/v3.2/spark/src/test/java/org/apache/iceberg/spark/TestSparkDistributionAndOrderingUtil.java
##########
@@ -299,14 +300,70 @@ public void testRangeWritePartitionedSortedTable() {
     checkWriteDistributionAndOrdering(table, expectedDistribution, expectedOrdering);
   }
 
+  // =============================================================
+  // Distribution and ordering for copy-on-write DELETE operations
+  // =============================================================
+  //
+  // UNPARTITIONED UNORDERED
+  // -------------------------------------------------------------------------
+  // delete mode is NOT SET -> CLUSTER BY _file + LOCALLY ORDER BY _file, _pos
+  // delete mode is NONE -> unspecified distribution + empty ordering
+  // delete mode is HASH -> CLUSTER BY _file + LOCALLY ORDER BY _file, _pos
+  // delete mode is RANGE -> CLUSTER BY _file + LOCALLY ORDER BY _file, _pos
+  //
+  // UNPARTITIONED ORDERED BY id, data
+  // -------------------------------------------------------------------------
+  // delete mode is NOT SET -> CLUSTER BY _file + LOCALLY ORDER BY _file, _pos

Review comment:
       I am debating a bit what should be the correct ordering in this case given that our table has a defined sort order. On one hand, sorting by `_file` and `_pos` may lead to unsorted files on S3 if the object storage paths are enabled. On the other hand, sorting by the defined sort columns may rebalance files (could be a good thing?).




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org