You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by "aokolnychyi (via GitHub)" <gi...@apache.org> on 2023/06/24 17:11:55 UTC

[GitHub] [iceberg] aokolnychyi commented on a diff in pull request #7897: Spark 3.4: Multiple shuffle partitions per file in compaction

aokolnychyi commented on code in PR #7897:
URL: https://github.com/apache/iceberg/pull/7897#discussion_r1240903803


##########
spark/v3.4/spark/src/main/java/org/apache/iceberg/spark/actions/SparkShufflingDataRewriter.java:
##########
@@ -59,7 +61,24 @@ abstract class SparkShufflingDataRewriter extends SparkSizeBasedDataRewriter {
 
   public static final double COMPRESSION_FACTOR_DEFAULT = 1.0;
 
+  /**

Review Comment:
   I tested the current implementation on a table with 1 TB of data and a cluster with 16 GB executors 7 cores each. The target file size is 1 GB (zstd Parquet data). Sort-based optimizations without this option were spilling and failed, I lost all executors one by one. I tried using 8 shuffle partitions per file and the operation succeeded without any failures and produced properly sized files.
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org