You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2021/02/03 15:17:37 UTC

[GitHub] [iceberg] RussellSpitzer edited a comment on pull request #2196: Core: Data loss after compaction #2195

RussellSpitzer edited a comment on pull request #2196:
URL: https://github.com/apache/iceberg/pull/2196#issuecomment-772584764


   My quick notes on this issue:
   
   ```
   Previously when computing the rewrite tasks for RewriteDataFiles the code
   would ignore scan tasks which referred to a single file. This is an issue because
   large files could be potentitally split into multiple read tasks. If one
   slice of a large file was combined with a slice from another file, that
   secition would be rewritten with the other file, but the other slices would be ignored.
   
   For example given 2 files
   File A - 100 Bytes
   File B - 10 Bytes
   
   If the target split size was 60 bytes we would end up with 3 tasks
   A : 1 - 60
   A : 61 - 100
   B : 0 - 10
   
   Which would be combined into
   
   (A : 1 - 60)
   (A : 61 -100, B : 0 -10)
   
   The first task would be discarded since it only refered to one file. The
   second task would be rewritten, which would end with deleting file A and B.
   
   I believe the original intent was to ignore single file scan tasks as it was assumed these would
   be unchanged files. But if a single file scan task only contains a partial scan of a file it must 
   be rewritten since it represents a new smaller file that needs to be rewritten.
   ```


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org