You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by "bharos (via GitHub)" <gi...@apache.org> on 2023/04/27 23:31:39 UTC

[GitHub] [iceberg] bharos commented on issue #7406: Question about iceberg partition table

bharos commented on issue #7406:
URL: https://github.com/apache/iceberg/issues/7406#issuecomment-1526757304

   I was exploring a similar issue and found a few things that might be helpful.
   
   The partition skew is mostly happening because of the `write.distribution-mode` config.
   
   By default, it is set to `hash` for partitioned tables using Spark writes. This means that Spark does a shuffle (which is expensive) and assigns each partition to a single task (which is observed in your example).
   
   The benefit of this is that you get large-sized files, which result in smaller metadata files and reduce the small files problem.
   
   One config you can try to speed up the writes is to set `write.distribution-mode` to `None`. This means there is no shuffle, and any task can write to any partition. This can result in a large number of files, which can trigger the small files problem and slower reads. 
   
   I also think there can be an intermediate option, such as setting `write.distribution-mode` to` None` and then repartitioning the data during the Spark write to get more parallelism. You can use `repartition(numPartitions)` or `coalesce(numPartitions)` if applicable, so that you can have more parallelism and still have somewhat larger files.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org