You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2021/07/08 00:58:24 UTC

[GitHub] [iceberg] dmgcodevil opened a new issue #2793: Does 'expireSnapshots' also remove data files ?

dmgcodevil opened a new issue #2793:
URL: https://github.com/apache/iceberg/issues/2793


   ```
   Snapshots accumulate until they are expired by the expireSnapshots operation. Regularly expiring snapshots is recommended to delete data files that are no longer needed, and to keep the size of table metadata small.
   ```
   
   so, I executed `expireSnapshots` did it also automatically marked my data files as deleted?
   after `expireSnapshots` I executed `removeOrphanFiles` and some folders in s3 became empty. 
   I thought that `expireSnapshots ` only removes *.metadata.json and corresponding `manifest-list` files, but not the actual data.
   
   Please help 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] dmgcodevil commented on issue #2793: Does 'expireSnapshots' also remove data files ?

Posted by GitBox <gi...@apache.org>.
dmgcodevil commented on issue #2793:
URL: https://github.com/apache/iceberg/issues/2793#issuecomment-876064807


   @RussellSpitzer what should I run after combining small files:
   
   ```
       Actions.forTable(table).rewriteDataFiles()
         .filter(....)
         .targetSizeInBytes(targetSizeMB * 1024 * 1024)
         .execute()
   ```
   
   RemoveOrphans  ?
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer edited a comment on issue #2793: Does 'expireSnapshots' also remove data files ?

Posted by GitBox <gi...@apache.org>.
RussellSpitzer edited a comment on issue #2793:
URL: https://github.com/apache/iceberg/issues/2793#issuecomment-876381171


   You can never expire the current snapshot, so there is never a way that expire snapshots can remove files that are needed to read the current state of the table. It can only remove your ability to query previous versions of the tables.
   
   If a file was left behind either expire snapshots had an error during deletes or that file was never part of the table. Your example could not happen as described.
   
   RemoveOrphanFiles works similarly but doesn't change snapshots at all so it would not be able to remove the current snapshot either. If it detected that the table owned no files it would delete everything. If you are now receiving query errors it is due to the string matching error I discussed previously where the metadata stored paths do not string match the actual fs.list paths. If when querying the table now you don't see any errors then those files were never part of the table.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] dmgcodevil commented on issue #2793: Does 'expireSnapshots' also remove data files ?

Posted by GitBox <gi...@apache.org>.
dmgcodevil commented on issue #2793:
URL: https://github.com/apache/iceberg/issues/2793#issuecomment-876100246


   that's exactly what I did. first I combined small files using `RewriteDataFilesAction`
   then I executed `ExpireSnapshotsAction`  using a timestamp of a snapshot before the one created by `RewriteDataFilesAction` (i.e. latest)
   I noticed that only a small number of files got removed. 
   
   Then I've fond the following statement in docs:
   
   ```
   in some cases normal snapshot expiration may not be able to determine a file is no longer needed and delete it.
   ```
   
   And decided to run `RemoveOrphanFilesAction`. Unfortunately, it deleted a lot of files that weren't combined.
   
   Also:
   `It will delete a superset of the files deleted by expire snapshot.` could you please explain ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] dmgcodevil commented on issue #2793: Does 'expireSnapshots' also remove data files ?

Posted by GitBox <gi...@apache.org>.
dmgcodevil commented on issue #2793:
URL: https://github.com/apache/iceberg/issues/2793#issuecomment-876831212


   @RussellSpitzer thanks for the help and detailed explanation.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer edited a comment on issue #2793: Does 'expireSnapshots' also remove data files ?

Posted by GitBox <gi...@apache.org>.
RussellSpitzer edited a comment on issue #2793:
URL: https://github.com/apache/iceberg/issues/2793#issuecomment-876381171


   You can never expire the current snapshot, so there is never a way that expire snapshots can remove files that are needed to read the current state of the table. It can only remove your ability to query previous versions of the tables.
   
   If a file was left behind either expire snapshots had an error during deletes or that file was never part of the table. Your example could not happen as described.
   
   RemoveOrphanFiles works similarly but doesn't change snapshots at all so it would not be able to remove the current snapshot either. If it detected that the table owned no files it would delete everything. If you are now receiving query errors it is due to the string matching error I discussed previously where the metadata stored paths do not string match the actual fs.list paths.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on issue #2793: Does 'expireSnapshots' also remove data files ?

Posted by GitBox <gi...@apache.org>.
RussellSpitzer commented on issue #2793:
URL: https://github.com/apache/iceberg/issues/2793#issuecomment-876111064


   ExpireSnapshotActions will always only remove files that were previously part of the Iceberg Table.
   RemoveOrphanFiles will remove those files and any other files that are not explicitly part of the table.
   
   Let's take an example
   
   I have a Directory /myTable/
   
   I have preformed several commits
   
   ```
   1. Add File A, B. - Table is A, B
   2. Remove File A, Add File C. - Table is B, C
   3. Remove File C, Add File D - Table is B, D
   ```
   
   But lets say we had a bit of an error and we also wrote BrokenFile in a run that got canceled, it was never added to the table but the file was created.
   
   The directory now has five files from Iceberg (and metadata files)
   
   ```
   A, B , C, D and BrokenFile
   ```
   
   If I run expire snapshots and remove snapshot 1 then it checks for all files that were referred to by 
   snapshot 1 (A, B)  that were not referred to by Snapshot 2 (B, C) or Snapshot 3 (B, D). 
   This is a single file, 
   
   ```
   (A, B) but not (B, C, D) = (A). 
   ```
   
   So only File A should be deleted (along with metadata for Snapshot 2 and Snapshot 3).
   
   If I run expire snapshots and remove snapshot 2 and 1 we do a similar thing.
   
   ```
   (A, B, C) but not (B, D) = (A, C)
   ```
   
   So in this case expire snapshots removes two data files, A and C.
   
   Neither of these operations were able to remove "BrokenFile" it was never listed, so it can never be picked up by this operation. Let's say that we expired Snapshots 2 and 1, but the delete operation failed so the files were never removed.
   
   Now the table just looks like
   
   ```
   3. Remove C, Add D - Table is B, D
   ```
   
   But our directory still has A, B, C, D and Broken File
   
   Remove Orphan Files can clean this up for us because it does not use the snapshots which are removed to determine which files to remove. Instead it lists all the files in the table location (A, B, C, D, BrokenFile) and
   then deletes all files which are not referenced by the table. Currently the table only has 1 snapshot, snapshot 3 (B,D).
   ```
   (A, B, C, D, Broken File) but not (B, D) = (A, C, BrokenFile)
   ```
   So RemoveOrphanFiles will remove 3 files. A, C and BrokenFile
   
   
   So this is why I consider Remove Orphan Files to be a superset of what ExpireSnapshot removes. I guess it is more correct to say RemoveOrphanFiles will remove all files that should have been removed after ExpireSnapshots even if ExpireSnpashots fails to delete those files for some reason. 
   
   Expire snapshots removes the history and then all the files which were only reachable by that history. Remove OrphanFiles looks at all of the current history and compares it to a raw directory listing. Remove orphan files can clean up a failed ExpireSnapshots, but not the reverse. Remove orphan files is more dangerous because there is a possibility that your table location has files from other projects or the paths have changed in some subtle way that matches on resolution but does not string match. For example if you store paths without authority and change authorities you may have files which are the same, but do not string match correctly.
   
   -----
   
   TLDR; 
   
   Expire Snapshots will *never* remove a file that Iceberg will need to read the table with a small caveat for tables which are created as Snapshots of other tables. See GC_ENABLED in table properties.
   
   RemoveOrphanFiles *should never* remove a file that Iceberg will need to read the table, but since it uses string matching to determine which files to remove there is a chance it can remove necessary files which is why a dry-run flag was introduced. It also will remove any other files in the table location even if they were never part of the Iceberg table.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer edited a comment on issue #2793: Does 'expireSnapshots' also remove data files ?

Posted by GitBox <gi...@apache.org>.
RussellSpitzer edited a comment on issue #2793:
URL: https://github.com/apache/iceberg/issues/2793#issuecomment-876060678


   The ExpireSnapshots action removes all data files and manifests which are no longer reachable once the expired snapshots have been removed.
   
   RemoveOrphanFiles compares the current reachable set of files with all the files in the table's location and removes any that are not referenced by the Iceberg table.
   
   So expire snapshots should never remove any files unless they were once part of an Iceberg table and have become unreachable. This will never delete other files.
   
   RemoveOrphans will remove any files in the table location, regardless of whether they were once part of the table or just happen to be in that location.
   
   In both cases these actions will physically delete the files they believe are no longer needed.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] dmgcodevil edited a comment on issue #2793: Does 'expireSnapshots' also remove data files ?

Posted by GitBox <gi...@apache.org>.
dmgcodevil edited a comment on issue #2793:
URL: https://github.com/apache/iceberg/issues/2793#issuecomment-876064807


   @RussellSpitzer what should I run after combining small files:
   
   ```
       Actions.forTable(table).rewriteDataFiles()
         .filter(....)
         .targetSizeInBytes(targetSizeMB * 1024 * 1024)
         .execute()
   ```
   
   RemoveOrphans  ?
   
   
   @RussellSpitzer  I'd like to delete small files that are no longer needed


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] dmgcodevil edited a comment on issue #2793: Does 'expireSnapshots' also remove data files ?

Posted by GitBox <gi...@apache.org>.
dmgcodevil edited a comment on issue #2793:
URL: https://github.com/apache/iceberg/issues/2793#issuecomment-876127259


   Understood, let's say we have the following snapshots:
   
   snapshot_1 (ts=1) contains files A,B
   snapshot_2 (ts=2) contains files C,D
   
   ts - timestamp
   
   If I expire snapshot_1, would I be able to query data from files A and B? Based on your explanation, I should because snapshot_2's  manifest list includes A and B. thus only snapshot_1 metadata can be removed (.metadata.json, snap-*.avro) but not data files: A, B
   
   what will happen if I expire snapshots by timestamp less than 3. will Expire Snapshots delete A, B, C, D ?
   
   i.e. if I've made a mistake and somehow specified a very large timestamp, it will expire all my snapshots and potentially kill all data files ? I think that `RemoveOrphanFiles ` will definitely delete files. 
   
   Let me explain my case and the outcome. 
   
   I hade a table like the one below 
   
   
   snapshot_1 A, B (2021-07-05)
   snapshot_2 C, D (2021-07-06)
   
   table: A,B,C,D
   
   my data is partitioned by day
   
   2021-07-05 contains: A,B,
   2021-07-06 contains: C,D
   I wanted to combine files from 2021-07-05
   
   ```scala
   Actions.forTable(table).rewriteDataFiles()
         .filter(Expressions.greaterThanOrEqual(field, startDate * 1000))
         .filter(Expressions.lessThan(field, endDate * 1000))
         .targetSizeInBytes(targetSizeMB * 1024 * 1024)
         .execute()
   ```
   
   snapshot_1 (ts=1) A, B 
   snapshot_2 (ts=2) C, D  
   snapshot_3 (ts=3) F - added , A-deleted, B-deleted
   
   ts - timestamp
   
   table: C,D,F
   
   2021-07-05 contains: A,B,F
   2021-07-06 contains: C,D
   
   I executed Expire Snapshots where ts < 3
   
   After this operation, I've noticed that  some files got deleted from `metadata` folder, but A, B were still in data folder: 2021-07-05
   
   Then I executed `RemoveOrphanFiles `. And noticed that a lot of files 90% removed from metadata folder, some files got deleted from `2021-07-06` and other days (that I didn't expect). I have about 4 months of data, and I noticed some files get deleted from different days, months, etc. 
   
   the list looks like this:
   
   ```
   2020-11-17
   2020-11-18
   2020-11-19
   2020-11-20
   2020-11-21
   2020-11-22
   2020-11-23
   2020-11-24
   2020-11-25
   2020-11-26
   2020-11-27
   2020-11-28
   2020-11-29
   2020-11-30
   2020-12-01
   2020-12-02
   2020-12-03
   2020-12-04
   2020-12-05
   2020-12-06
   2020-12-07
   2020-12-08
   2020-12-09
   2020-12-10
   2020-12-11
   2020-12-12
   2020-12-13
   2020-12-14
   2020-12-15
   2020-12-16
   2020-12-17
   2020-12-18
   2020-12-19
   2020-12-20
   2020-12-21
   2020-12-22
   2020-12-23
   2020-12-24
   2020-12-25
   2020-12-26
   2020-12-27
   2020-12-28
   2020-12-29
   2020-12-30
   2020-12-31
   2021-01-15
   2021-01-16
   2021-01-17
   2021-01-18
   2021-01-19
   2021-01-20
   2021-01-21
   2021-01-22
   2021-01-23
   2021-01-24
   2021-01-25
   2021-01-26
   2021-01-27
   2021-01-28
   2021-01-29
   2021-01-30
   2021-01-31
   2021-02-01
   2021-03-23
   2021-03-24
   2021-03-25
   2021-03-31
   2021-04-24
   2021-04-28
   2021-04-29
   2021-05-05
   2021-05-07
   2021-06-02
   ```
   
   
   So, if I accidentally expired all snapshots, then I don't understand why `RemoveOrphanFiles` all the files. 
   Maybe those files were never in the table. B/c I know that the spark job was failing periodically.
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on issue #2793: Does 'expireSnapshots' also remove data files ?

Posted by GitBox <gi...@apache.org>.
RussellSpitzer commented on issue #2793:
URL: https://github.com/apache/iceberg/issues/2793#issuecomment-876096048


   ExpireSnapshots is what you want to run. If you want to combine small files and then remove the old small files physically you would perform a Rewrite followed by an ExpireSnapshots expiring all snapshots except for the most recent one you created.
   
   Remove orphan files is more for when you have failed commits that have left files in the directory. It is much more expensive because it requires doing directory listings and more dangerous since it can delete files that were not previously part of the table. It will delete a superset of the files deleted by expire snapshot.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] dmgcodevil edited a comment on issue #2793: Does 'expireSnapshots' also remove data files ?

Posted by GitBox <gi...@apache.org>.
dmgcodevil edited a comment on issue #2793:
URL: https://github.com/apache/iceberg/issues/2793#issuecomment-876127259


   Understood, let's say we have the following snapshots:
   
   snapshot_1 (ts=1) contains files A,B
   snapshot_2 (ts=2) contains files C,D
   
   ts - timestamp
   
   If I expire snapshot_1, would I be able to query data from files A and B? Based on your explanation, I should because snapshot_2's  manifest list includes A and B. thus only snapshot_1 metadata can be removed (.metadata.json, snap-*.avro) but not data files: A, B
   
   what will happen if I expire snapshots by timestamp less than 3. will Expire Snapshots delete A, B, C, D ?
   
   i.e. if I've made a mistake and somehow specified a very large timestamp, it will expire all my snapshots and potentially kill all data files ? I think that `RemoveOrphanFiles ` will definitely delete files. 
   
   Let me explain my case and the outcome. 
   
   I hade a table like the one below 
   
   
   snapshot_1 A, B (2021-07-05)
   snapshot_2 C, D (2021-07-06)
   
   table: A,B,C,D
   
   my data is partitioned by day
   
   2021-07-05 contains: A,B,
   2021-07-06 contains: C,D
   I wanted to combine files from 2021-07-05
   
   ```scala
   Actions.forTable(table).rewriteDataFiles()
         .filter(Expressions.greaterThanOrEqual(field, startDate * 1000))
         .filter(Expressions.lessThan(field, endDate * 1000))
         .targetSizeInBytes(targetSizeMB * 1024 * 1024)
         .execute()
   ```
   
   snapshot_1 (ts=1) A, B 
   snapshot_2 (ts=2) C, D  
   snapshot_3 (ts=3) F - added , A-deleted, B-deleted
   
   ts - timestamp
   
   table: C,D,F
   
   2021-07-05 contains: A,B,F
   2021-07-06 contains: C,D
   
   I executed Expire Snapshots where ts < 3
   
   After this operation, I've noticed that  some files got deleted from `metadata` folder, but A, B were still in data folder: 2021-07-05
   
   Then I executed `RemoveOrphanFiles `. And noticed that a lot of files 90% removed from metadata folder, some files got deleted from `2021-07-06` and other days (that I didn't expect). I have about 4 months of data, and I noticed some files get deleted from different days, months, etc. 
   
   the list looks like this:
   
   ```
   2020-11-17
   2020-11-18
   2020-11-19
   2020-11-20
   2020-11-21
   2020-11-22
   2020-11-23
   2020-11-24
   2020-11-25
   2020-11-26
   2020-11-27
   2020-11-28
   2020-11-29
   2020-11-30
   2020-12-01
   2020-12-02
   2020-12-03
   2020-12-04
   2020-12-05
   2020-12-06
   2020-12-07
   2020-12-08
   2020-12-09
   2020-12-10
   2020-12-11
   2020-12-12
   2020-12-13
   2020-12-14
   2020-12-15
   2020-12-16
   2020-12-17
   2020-12-18
   2020-12-19
   2020-12-20
   2020-12-21
   2020-12-22
   2020-12-23
   2020-12-24
   2020-12-25
   2020-12-26
   2020-12-27
   2020-12-28
   2020-12-29
   2020-12-30
   2020-12-31
   2021-01-15
   2021-01-16
   2021-01-17
   2021-01-18
   2021-01-19
   2021-01-20
   2021-01-21
   2021-01-22
   2021-01-23
   2021-01-24
   2021-01-25
   2021-01-26
   2021-01-27
   2021-01-28
   2021-01-29
   2021-01-30
   2021-01-31
   2021-02-01
   2021-03-23
   2021-03-24
   2021-03-25
   2021-03-31
   2021-04-24
   2021-04-28
   2021-04-29
   2021-05-05
   2021-05-07
   2021-06-02
   ```
   
   
   So, if I accidentally expired all snapshots, then I don't understand why `RemoveOrphanFiles` did not delete all the files. 
   Maybe those files were never in the table. B/c I know that the spark job was failing periodically.
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] dmgcodevil closed issue #2793: Does 'expireSnapshots' also remove data files ?

Posted by GitBox <gi...@apache.org>.
dmgcodevil closed issue #2793:
URL: https://github.com/apache/iceberg/issues/2793


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] dmgcodevil edited a comment on issue #2793: Does 'expireSnapshots' also remove data files ?

Posted by GitBox <gi...@apache.org>.
dmgcodevil edited a comment on issue #2793:
URL: https://github.com/apache/iceberg/issues/2793#issuecomment-876100246


   that's exactly what I did. first I combined small files using `RewriteDataFilesAction`
   then I executed `ExpireSnapshotsAction`  using a timestamp of a snapshot before the one created by `RewriteDataFilesAction` (i.e. latest)
   I noticed that only a small number of files got removed. 
   
   Then I've found the following statement in docs:
   
   ```
   in some cases normal snapshot expiration may not be able to determine a file is no longer needed and delete it.
   ```
   
   And decided to run `RemoveOrphanFilesAction`. Unfortunately, it deleted a lot of files that weren't combined.
   
   Also:
   `It will delete a superset of the files deleted by expire snapshot.` could you please explain ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on issue #2793: Does 'expireSnapshots' also remove data files ?

Posted by GitBox <gi...@apache.org>.
RussellSpitzer commented on issue #2793:
URL: https://github.com/apache/iceberg/issues/2793#issuecomment-876060678


   The ExpireSnapshots action removes all data files and manifests which are no longer reachable once the expired snapshots have been removed.
   
   RemoveOrphanFiles compares the current reachable set of files with all the files in the table's location and removes any that are not referenced by the Iceberg table.
   
   So expire snapshots should never remove any files unless they were once part of an Iceberg table and have become unreachable. This will never delete other files.
   
   RemoveOrphans will remove any files in the table location, regardless of whether they were once part of the table or just happen to be in that location.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on issue #2793: Does 'expireSnapshots' also remove data files ?

Posted by GitBox <gi...@apache.org>.
RussellSpitzer commented on issue #2793:
URL: https://github.com/apache/iceberg/issues/2793#issuecomment-876381171


   You can never expire the current snapshot, so there is never a way that expire snapshots can remove files that are needed to read the current state of the table. It can only remove your ability to query previous versions of the tables.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] dmgcodevil commented on issue #2793: Does 'expireSnapshots' also remove data files ?

Posted by GitBox <gi...@apache.org>.
dmgcodevil commented on issue #2793:
URL: https://github.com/apache/iceberg/issues/2793#issuecomment-876127259


   Understood, let's say we have the following snapshots:
   
   snapshot_1 (ts=1) contains files A,B
   snapshot_2 (ts=2) contains files C,D
   
   ts - timestamp
   
   If I expire snapshot_1, would I be able to query data from files A and B? Based on your explanation, I should because snapshot_2's  manifest list includes A and B. thus only snapshot_1 metadata can be removed (.metadata.json, snap-*.avro) but not data files: A, B
   
   what will happen if I expire snapshots by timestamp less than 3. will Expire Snapshots delete A, B, C, D ?
   
   i.e. if I've made a mistake and somehow specified a very large timestamp, it will expire all my snapshots and potentially kill all data files ? I think that `RemoveOrphanFiles ` will definitely delete files. 
   
   Let me explain my case and the outcome. 
   
   I hade a table like the one below 
   
   
   snapshot_1 A, B (2021-07-05)
   snapshot_2 C, D (2021-07-06)
   
   table: A,B,C,D
   
   my data is partitioned by day
   
   2021-07-05 contains: A,B,
   2021-07-06 contains: C,D
   I wanted to combine files from 2021-07-05
   
   ```scala
   Actions.forTable(table).rewriteDataFiles()
         .filter(Expressions.greaterThanOrEqual(field, startDate * 1000))
         .filter(Expressions.lessThan(field, endDate * 1000))
         .targetSizeInBytes(targetSizeMB * 1024 * 1024)
         .execute()
   ```
   
   snapshot_1 (ts=1) A, B 
   snapshot_2 (ts=2) C, D  
   snapshot_3 (ts=3) F - added , A-deleted, B-deleted
   
   ts - timestamp
   
   table: C,D,F
   
   2021-07-05 contains: A,B,F
   2021-07-06 contains: C,D
   
   I executed Expire Snapshots where ts < 3
   
   After this operation, I've noticed that  some files got deleted from `metadata` folder, but A, B still were in data folder: 2021-07-05
   
   Then I executed `RemoveOrphanFiles `. And noticed that a lot of files 90% removed from metadata folder, some files got deleted from `2021-07-06` and other days (that I didn't expect). I have about 4 months of data, and I noticed some files get deleted from different days, months, etc. 
   
   the list looks like this:
   
   ```
   2020-11-17
   2020-11-18
   2020-11-19
   2020-11-20
   2020-11-21
   2020-11-22
   2020-11-23
   2020-11-24
   2020-11-25
   2020-11-26
   2020-11-27
   2020-11-28
   2020-11-29
   2020-11-30
   2020-12-01
   2020-12-02
   2020-12-03
   2020-12-04
   2020-12-05
   2020-12-06
   2020-12-07
   2020-12-08
   2020-12-09
   2020-12-10
   2020-12-11
   2020-12-12
   2020-12-13
   2020-12-14
   2020-12-15
   2020-12-16
   2020-12-17
   2020-12-18
   2020-12-19
   2020-12-20
   2020-12-21
   2020-12-22
   2020-12-23
   2020-12-24
   2020-12-25
   2020-12-26
   2020-12-27
   2020-12-28
   2020-12-29
   2020-12-30
   2020-12-31
   2021-01-15
   2021-01-16
   2021-01-17
   2021-01-18
   2021-01-19
   2021-01-20
   2021-01-21
   2021-01-22
   2021-01-23
   2021-01-24
   2021-01-25
   2021-01-26
   2021-01-27
   2021-01-28
   2021-01-29
   2021-01-30
   2021-01-31
   2021-02-01
   2021-03-23
   2021-03-24
   2021-03-25
   2021-03-31
   2021-04-24
   2021-04-28
   2021-04-29
   2021-05-05
   2021-05-07
   2021-06-02
   ```
   
   
   So, if I accidentally expired all snapshots, then I don't understand why `RemoveOrphanFiles` all the files. 
   Maybe those files were never in the table. B/c I know that the spark job was failing periodically.
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org