You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2021/02/05 22:00:13 UTC

[GitHub] [iceberg] szehon-ho opened a new pull request #2220: Change default write.target-file-size-bytes to 512 MB.

szehon-ho opened a new pull request #2220:
URL: https://github.com/apache/iceberg/pull/2220


   Hello, we are thinking to change for default "write.target-file-size-bytes" to 512 MB
   
   -- The current default (unbound file sizes) will never take advantage of any predicate push down
   -- This number corresponds well with the Parquet default row-group size (4 row groups/file)
   -- This will have no impact on ORC file (BaseTaskWriter#shouldRollToNewFile() makes an exception for ORC files)
   
   This will result in different behaviors, so would be good to see what the community thinks, cc @aokolnychyi  
   Thank you


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on pull request #2220: Change default write.target-file-size-bytes to 512 MB.

Posted by GitBox <gi...@apache.org>.
rdblue commented on pull request #2220:
URL: https://github.com/apache/iceberg/pull/2220#issuecomment-776935238


   I think this is a reasonable change. I'll commit it. We should note this in 0.12.0 release notes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on pull request #2220: Change default write.target-file-size-bytes to 512 MB.

Posted by GitBox <gi...@apache.org>.
rdblue commented on pull request #2220:
URL: https://github.com/apache/iceberg/pull/2220#issuecomment-776935583


   Thanks for updating this, @szehon-ho!


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on pull request #2220: Change default write.target-file-size-bytes to 512 MB.

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on pull request #2220:
URL: https://github.com/apache/iceberg/pull/2220#issuecomment-774389870


   I think we would want to use this property more after we refine our data compaction. Having this value as Long.MaxValue does not make much sense to me if we treat it as the target file size (not the max file size, like it is now). Internally, we have changed the default value quite some time ago. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue merged pull request #2220: Change default write.target-file-size-bytes to 512 MB.

Posted by GitBox <gi...@apache.org>.
rdblue merged pull request #2220:
URL: https://github.com/apache/iceberg/pull/2220


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] szehon-ho commented on pull request #2220: Change default write.target-file-size-bytes to 512 MB.

Posted by GitBox <gi...@apache.org>.
szehon-ho commented on pull request #2220:
URL: https://github.com/apache/iceberg/pull/2220#issuecomment-774513998


   Hi @rdblue  thanks, yea nice to see you, hope I can ramp up and contribute more to the iceberg project, it is really growing~  
   
   You are right, I definitely phrased it wrong, "reduce" is what I meant to say.  Having this does not affect pruning at the partition or row-group level.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on pull request #2220: Change default write.target-file-size-bytes to 512 MB.

Posted by GitBox <gi...@apache.org>.
rdblue commented on pull request #2220:
URL: https://github.com/apache/iceberg/pull/2220#issuecomment-774378949


   > The current default (unbound file sizes) will never take advantage of any predicate push down
   
   I'm not sure I understand what you're saying here. Why would this prevent predicate pushdown? Large files with unordered data may have larger and larger ranges, but the happens quickly in even a single row group. To get effective file pruning, you need to cluster data by filter columns. If you're doing that, then I would say that larger files _diminish_ the benefit of pusdown, but don't preclude it. And, parallelism concerns typically force people to create small files because writes are faster that way.
   
   I'm not sure this is needed, although I don't really have a problem with adding it. I'd also like to hear what Anton thinks.
   
   @szehon-ho, good to see you in this community!


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on pull request #2220: Change default write.target-file-size-bytes to 512 MB.

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on pull request #2220:
URL: https://github.com/apache/iceberg/pull/2220#issuecomment-774390316


   I'd support changing the default value but I admit my interpretation may be different from other folks in the community.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org