You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by "JonasJ-ap (via GitHub)" <gi...@apache.org> on 2023/02/10 04:16:40 UTC

[GitHub] [iceberg] JonasJ-ap commented on issue #6781: Fix migration of Delta table that has performed VACUUM

JonasJ-ap commented on issue #6781:
URL: https://github.com/apache/iceberg/issues/6781#issuecomment-1425153055

   Some context and my thoughts here:
   
   Reference: delta lake's [doc](https://docs.delta.io/latest/delta-utility.html): 
   1. `VACUUM` delete only data files, not log files
   2. `VACUUM` can only be called manually
   
   The `1` will cause `IOException` when migrate constructible snapshots' corresponding datafiles are cleaned. 
   The `2` makes the operation timestamp untracked as delta lake does not record `VACUUM` operation in logs based on my understanding. 
   There are two ways to configure delete candidate of `VACUUM`
   1. table property: `delta.deletedFileRetentionDuration`, default to 7 days
   2. manually specify the retention period: `VACUUM ... RETAIN <any> days`
   
   Based on these properties of `VACUUM`, it seems the entity that called `VACUUM` should keep track of the earliest versoin that can time travel back to after each execution of `VACUUM`. This will lead to my first proposed solution that this issue can be solved by a new feature that is word to be added to the conversion logic: Currently, the conversion logic starts to migrate from the earlist possible log version. We can add a property to let user set the start version. Users can use this property to skip those snashots whose datafiles are deleted.
   
   The second solution is that we can catch the `IOException` when trying to build the `DataFile` and skip the whole snapshot if any parquet file can not be found. Specifically, we should only catch the exception when there has been no version migrated yet. If there are some successfully migrated snapshot earlier, then the `IOException` must be caused by something else and we shall not skip the version as delta logs are consecutive. My concern here is that this feature may cause inconsistency between the user setting and the actual action result: e.g. Users may set the starting point at version A but the actual starting point will be moved to version B and users will not notice that easily. My thought here is to add the actual starting version to the `Action.Result` report or we can add another property with name like `autoDetectStartingPoint` such that we will still throw exception as normal if users do not set this to `true` 
   
   Indeed, we can do both. The first proposal is a good feature to be added anyway and the second can make the conversion logic more robust.
   
   I want to receive some feedback on these proposals before I start to implement them. If you have some comments or a better solution, please let me know. Thank you in advance for your help.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org