You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2022/09/09 01:45:19 UTC

[GitHub] [iceberg] robert8138 commented on pull request #3461: Spark: Request distribution and ordering for writes

robert8138 commented on PR #3461:
URL: https://github.com/apache/iceberg/pull/3461#issuecomment-1241406799

   Seeing Ryan's comment. it seems like `ALTER TABLE ... WRITE ORDER BY ...` SQL extension actually does not work for `INSERT` before Spark 3.2. I've tried it myself (we are on Spark 3.1) and it didn't work. Has there been any development on this extension recently?
   
   Additionally, we've tried a few other suggestions Ryan gave in the other threads:
   
   * `fan out writer` - This didn't work for us, possibly because we have DS partitioned table that can go back to as early as 2008, so we are opening thousands of files and the overhead is too big.
   * `sorting` - we tried out global sort (`ORDER BY`), local sort (`SORT BY`), and also `CLUSTER BY`, and they all worked, but with varying cost & performance! 
   
   I heard that this will be handled transparently starting in Spark 3.3 and users no longer have to explicitly sort their data before INSERT into a partitioned table. Is that correct? 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org