You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2021/02/03 15:17:37 UTC
[GitHub] [iceberg] RussellSpitzer edited a comment on pull request #2196: Core: Data loss after compaction #2195
RussellSpitzer edited a comment on pull request #2196:
URL: https://github.com/apache/iceberg/pull/2196#issuecomment-772584764
My quick notes on this issue:
```
Previously when computing the rewrite tasks for RewriteDataFiles the code
would ignore scan tasks which referred to a single file. This is an issue because
large files could be potentitally split into multiple read tasks. If one
slice of a large file was combined with a slice from another file, that
secition would be rewritten with the other file, but the other slices would be ignored.
For example given 2 files
File A - 100 Bytes
File B - 10 Bytes
If the target split size was 60 bytes we would end up with 3 tasks
A : 1 - 60
A : 61 - 100
B : 0 - 10
Which would be combined into
(A : 1 - 60)
(A : 61 -100, B : 0 -10)
The first task would be discarded since it only refered to one file. The
second task would be rewritten, which would end with deleting file A and B.
I believe the original intent was to ignore single file scan tasks as it was assumed these would
be unchanged files. But if a single file scan task only contains a partial scan of a file it must
be rewritten since it represents a new smaller file that needs to be rewritten.
```
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org