You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2019/01/14 13:36:41 UTC

[GitHub] aokolnychyi opened a new issue #77: Include the cost to open a file during split planning

aokolnychyi opened a new issue #77: Include the cost to open a file during split planning
URL: https://github.com/apache/incubator-iceberg/issues/77
 
 
   We need to take into account the cost to open a file to avoid stragler tasks. As an example, see how `spark.sql.files.openCostInBytes` is handled in Spark.
   
   
   Let's consider a case when we have 500 files with 10 records and 50 files that contain 1000000 entries. Today, Iceberg will group all small files into one task. Consequently, opening those files will become a bottleneck. Locally, Iceberg was 2x times slower than Spark file source (w/o vectorized execution). The test was executed with 4 executors. Iceberg grouped files into 2 tasks. The task with many small files was taking most of the time. As opposed to this, Spark grouped files into 20 bins.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org