You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2020/07/02 04:30:05 UTC

[GitHub] [iceberg] JingsongLi opened a new issue #1159: Avoid rewriting big files in RewriteDataFilesAction

JingsongLi opened a new issue #1159:
URL: https://github.com/apache/iceberg/issues/1159


   I try to use `RewriteDataFilesAction` to avoid too many small files.
   But every time will rewrite all files even just have few small files. It is very expensive and slow.
   
   So:
   - can `RewriteDataFilesAction` provides a `Predicate<FileScanTask>` like `RewriteManifestsAction` for giving a way to avoid rewriting big files.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] JingsongLi commented on issue #1159: Avoid rewriting big files in RewriteDataFilesAction

Posted by GitBox <gi...@apache.org>.
JingsongLi commented on issue #1159:
URL: https://github.com/apache/iceberg/issues/1159#issuecomment-657389247


   > I think you can specify filter expressions to rewrite files you wanted. For example, you can only rewrite data under newly added partition, not the whole data. I'm not sure if that is enough for you.
   
   Thanks for reply. Yes, using partition just rewrite once is OK.
   I think we optimize further, if the partition is "day_time + type", some type has enough records in its files, but some others may have less records so leads to small files, so I rewrite the all newly partitions for reduce files but can skip big files in some partitions.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on issue #1159: Avoid rewriting big files in RewriteDataFilesAction

Posted by GitBox <gi...@apache.org>.
rdblue commented on issue #1159:
URL: https://github.com/apache/iceberg/issues/1159#issuecomment-664054683


   We do this a bit differently: instead of rewriting everything that matches a filter, we configure the rewrite to primarily look for small files. Any file already near the target size is ignored, and then we bin pack the rest of the files in each partition and rewrite. That ensures that we don't rewrite large amounts of data that are already reasonably sized. We don't care about files that are too large because they can be split, only the small files.
   
   We also have an option to keep the files in order by file name for systems like Spark that sort the data. This is a hacky way to avoid ruining file pruning that takes advantage of clustered/sorted data.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi edited a comment on issue #1159: Avoid rewriting big files in RewriteDataFilesAction

Posted by GitBox <gi...@apache.org>.
aokolnychyi edited a comment on issue #1159:
URL: https://github.com/apache/iceberg/issues/1159#issuecomment-664115838


   I am working on a doc that will propose multiple compaction strategies depending on a use case. I want to map [SQL extensions](https://docs.google.com/document/d/1Nf8c16R2hj4lSc-4sQg4oiUUV_F4XqZKth1woEo6TN8/edit) into that. We have been running those for a while and they proved to work even on large tables and already covers what @rdblue mentioned. I'll share the doc by the end of the week.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] davseitsev commented on issue #1159: Avoid rewriting big files in RewriteDataFilesAction

Posted by GitBox <gi...@apache.org>.
davseitsev commented on issue #1159:
URL: https://github.com/apache/iceberg/issues/1159#issuecomment-668660110


   @rdblue logic that you described doesn't work for me. I have Spark Structured Streaming job which writes daily partitioned table and produce files of different size, primarily small ones. I ran `RewriteDataFilesAction` a few times on a single partition.
   First run produced 86 files with size up to 128Mb as I configured. Second run took all these files and compacted them again merging them with very small files. Third run again took all files produced by previous job and compacted them again.
   
   I understand that eventually compacted files will become so close to `targetSize` that they will be ignored, but until this happened we need to rewrite gigabytes of data again and again. Also it doesn't work for me because compaction process is relevant only for current day. At the beginning of new day we run major compaction with deduplication, sorting etc. and like "close" partition for previous day, it will not be modified anymore. Intermediate minor compaction is necessary only to prevent clients from reading thousands of small files.
   
   Applying filter by row timestamp can improve the situation, but as we have many tables with completely different size we need to choose right compaction period for each table to have output files with reasonable size. It's really difficult to manage because their size can vary, new table can be added, etc. Also we have late records which could not be considered for compaction if there is no fresh records in the file.
   
   It would be really nice to have separate configuration to limit file size or the predicate like @JingsongLi suggested. 
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on issue #1159: Avoid rewriting big files in RewriteDataFilesAction

Posted by GitBox <gi...@apache.org>.
rdblue commented on issue #1159:
URL: https://github.com/apache/iceberg/issues/1159#issuecomment-668713957


   What I was saying is that the service we built to do this chooses what to rewrite differently. The behavior of the rewrite action is as you describe: it will rewrite more than is necessary so we should add more configuration to provide more control.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] jerryshao commented on issue #1159: Avoid rewriting big files in RewriteDataFilesAction

Posted by GitBox <gi...@apache.org>.
jerryshao commented on issue #1159:
URL: https://github.com/apache/iceberg/issues/1159#issuecomment-656544944


   I think you can specify filter expressions to rewrite files you wanted. For example, you can only rewrite data under newly added partition, not the whole data. I'm not sure if that is enough for you.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on issue #1159: Avoid rewriting big files in RewriteDataFilesAction

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on issue #1159:
URL: https://github.com/apache/iceberg/issues/1159#issuecomment-668813917


   The doc for data compaction is almost ready. It should cover what @davseitsev experiences.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on issue #1159: Avoid rewriting big files in RewriteDataFilesAction

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on issue #1159:
URL: https://github.com/apache/iceberg/issues/1159#issuecomment-664115838


   I am working on a doc that will propose multiple compaction strategies depending on a use case. I want to map [SQL extensions](https://docs.google.com/document/d/1Nf8c16R2hj4lSc-4sQg4oiUUV_F4XqZKth1woEo6TN8/edit) into that. We have been running those for a while and they proved to work even on large tables and already covers what @rdblue mentioned.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on issue #1159: Avoid rewriting big files in RewriteDataFilesAction

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on issue #1159:
URL: https://github.com/apache/iceberg/issues/1159#issuecomment-668933615


   The proposal is [here](https://docs.google.com/document/d/1aXo1VzuXxSuqcTzMLSQdnivMVgtLExgDWUFMvWeXRxc/edit#). I'll keep editing it in the following days.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org