You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by "RussellSpitzer (via GitHub)" <gi...@apache.org> on 2023/01/27 16:03:38 UTC

[GitHub] [iceberg] RussellSpitzer opened a new issue, #6679: Change Default Write Distribution Mode

RussellSpitzer opened a new issue, #6679:
URL: https://github.com/apache/iceberg/issues/6679

   ### Feature Request / Improvement
   
   Merge Writes as well as some inserts end up generating many files with our default write distirbution mode of None. While this is the cheapest method and is our old default behavior, we now have several reasons to default to Range (or Hash).
   
   1. Spark AQE now has both skew handling and and adaptive coalesce
   2. With Merge operations None is never the correct mode to request since we are always shuffling anyway
   3. More users are coming to Iceberg and don't understand how Spark Partitioning works (required to get good perf with default None)
   
   
   I suggest we change the default distribution mode to Range and add some documentation around configuring AQE to the Spark docs. I think this will be a better behavior for most first users and power users can still manually configure a different mode for their specific requirements.
   
   ### Query engine
   
   Spark


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on issue #6679: Change Default Write Distribution Mode

Posted by "aokolnychyi (via GitHub)" <gi...@apache.org>.
aokolnychyi commented on issue #6679:
URL: https://github.com/apache/iceberg/issues/6679#issuecomment-1410960780

   We have examples in `TestSparkDistributionAndOrderingUtil` that should become a section in the docs.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on issue #6679: Change Default Write Distribution Mode

Posted by "aokolnychyi (via GitHub)" <gi...@apache.org>.
aokolnychyi commented on issue #6679:
URL: https://github.com/apache/iceberg/issues/6679#issuecomment-1410945521

   I would be careful with `range` as it may cause performance regressions. Especially, for MERGE. The range distribution requires sampling that leads to double scanning and re-evaluating of particular nodes in the plan. This will cause the same issues we have today where the default would perform poorly. 
   
   The upcoming Spark 3.4 has support for rebalancing partitions via AQE for hash distributions requested by v2 writes. That means, we can request a hash distribution without worrying about having too much data per task and OOM. I'd rather switch to `hash` as default and let users configure if it fails. I don't know a single use case where the range distribution performs well in MERGE at any reasonable scale.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on issue #6679: Change Default Write Distribution Mode

Posted by "RussellSpitzer (via GitHub)" <gi...@apache.org>.
RussellSpitzer commented on issue #6679:
URL: https://github.com/apache/iceberg/issues/6679#issuecomment-1416494895

   @dramaticlly Did you want to write up another issue for specifying write distribution mode as a Spark SqlConf option?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] jackye1995 commented on issue #6679: Change Default Write Distribution Mode

Posted by "jackye1995 (via GitHub)" <gi...@apache.org>.
jackye1995 commented on issue #6679:
URL: https://github.com/apache/iceberg/issues/6679#issuecomment-1407074738

   +1 for using range as default. Overall we probably need a dedicated doc section about how to configure those parameters in the Iceberg Spark documentation for people to make informed decisions.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi closed issue #6679: Change Default Write Distribution Mode

Posted by "aokolnychyi (via GitHub)" <gi...@apache.org>.
aokolnychyi closed issue #6679: Change Default Write Distribution Mode
URL: https://github.com/apache/iceberg/issues/6679


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on issue #6679: Change Default Write Distribution Mode

Posted by "RussellSpitzer (via GitHub)" <gi...@apache.org>.
RussellSpitzer commented on issue #6679:
URL: https://github.com/apache/iceberg/issues/6679#issuecomment-1406702289

   @aokolnychyi + @danielcweeks + @rdblue + @jackye1995 + @szehon-ho 
   
   Please ping anyone else as well who would have strong opinions about this change


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on issue #6679: Change Default Write Distribution Mode

Posted by "RussellSpitzer (via GitHub)" <gi...@apache.org>.
RussellSpitzer commented on issue #6679:
URL: https://github.com/apache/iceberg/issues/6679#issuecomment-1406926423

   The "none" mode in GDPR cases still only helps in case in which the data has already been aligned with the partitioning of the table. This is rarely the case in my experience. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] dramaticlly commented on issue #6679: Change Default Write Distribution Mode

Posted by "dramaticlly (via GitHub)" <gi...@apache.org>.
dramaticlly commented on issue #6679:
URL: https://github.com/apache/iceberg/issues/6679#issuecomment-1406924909

   Thank you @RussellSpitzer , I understand where this change is coming from but some of the GDPR like deletion on V1 table will benefit from the none write distribution mode (to avoid shuffle if possible). I am aware currently we can configure it via setting the table properties like `write.delete.distribution-mode	` or `write.update.distribution-mode`, but I am wondering if there's any way we can configure it on per spark job level (also delete is done via SQL only which only makes it harder)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on issue #6679: Change Default Write Distribution Mode

Posted by "rdblue (via GitHub)" <gi...@apache.org>.
rdblue commented on issue #6679:
URL: https://github.com/apache/iceberg/issues/6679#issuecomment-1407477437

   +1 for range as default.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] JunchengMa commented on issue #6679: Change Default Write Distribution Mode

Posted by "JunchengMa (via GitHub)" <gi...@apache.org>.
JunchengMa commented on issue #6679:
URL: https://github.com/apache/iceberg/issues/6679#issuecomment-1416599385

   > 
   
   +1 on @dramaticlly 's comment, changing the write distribution mode affects Spark job performance (causes heavy shuffle) when using Spark SQL like
   ```
   DELETE FROM db_name.tbl_name WHERE date < '20220801'
   ```
   or
   ```
   UPDATE db_name.tlb_name SET col_a = NULL WHERE date <= '20220801'
   ```
   setting `write.delete.distribution-mode`=`none` and `write.update.distribution-mode`='none' at table properties would reduce shuffle, but could affect other normal jobs writing to the same table.
   So having an option for specifying the write distribution mode would be ideal.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on issue #6679: Change Default Write Distribution Mode

Posted by "aokolnychyi (via GitHub)" <gi...@apache.org>.
aokolnychyi commented on issue #6679:
URL: https://github.com/apache/iceberg/issues/6679#issuecomment-1423000390

   I will submit a PR to change the default distribution modes for insert and merge. I'll be also happy to review a PR for #6741.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] dramaticlly commented on issue #6679: Change Default Write Distribution Mode

Posted by "dramaticlly (via GitHub)" <gi...@apache.org>.
dramaticlly commented on issue #6679:
URL: https://github.com/apache/iceberg/issues/6679#issuecomment-1407234861

   > Side note : I also see `write.merge.distribution-mode` and `write.update.distribution-mode` missing in table props in doc section as well https://iceberg.apache.org/docs/latest/configuration/
   
   yeah I noticed that before and had my attempt https://github.com/apache/iceberg/pull/5280 to fix it but need some help on `merge` case to provide better narrative.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] singhpk234 commented on issue #6679: Change Default Write Distribution Mode

Posted by "singhpk234 (via GitHub)" <gi...@apache.org>.
singhpk234 commented on issue #6679:
URL: https://github.com/apache/iceberg/issues/6679#issuecomment-1407201274

   +1 on changing the default from none and having a dedicated doc section for the configuring these. Happy to contribute to this if possible. 
   
   --- 
   
   Side note : I also see `write.merge.distribution-mode` and `write.update.distribution-mode` missing in table props in doc section as well https://iceberg.apache.org/docs/latest/configuration/ 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org