You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by "cgpoh (via GitHub)" <gi...@apache.org> on 2023/04/20 11:12:26 UTC

[GitHub] [iceberg] cgpoh opened a new issue, #7383: After running ExpireSnapshots, metadata json and avro files still not deleted

cgpoh opened a new issue, #7383:
URL: https://github.com/apache/iceberg/issues/7383

   ### Query engine
   
   _No response_
   
   ### Question
   
   I'm using Springboot app to run the expire snapshots action:
   
   `table.expireSnapshots().expireOlderThan(old).cleanExpiredFiles(true).commit()`
   
   after running the job successfully, I still can see the expired metadata and avro files still in the storage as shown below:
   
   [2023-04-19 11:07:44 +08]  30MiB 35864-09058134-56ca-4478-9d9c-8bd80ec0bda9.metadata.json
   [2023-04-19 11:09:13 +08]  30MiB 35865-484e2177-8d8e-4f9e-abfc-dccb1838e2c8.metadata.json
   [2023-04-19 11:10:43 +08]  30MiB 35866-a0830f21-7644-4689-a469-480494b81587.metadata.json
   [2023-04-19 11:12:10 +08]  30MiB 35867-6b47be94-2147-46ea-a095-3c618e5fc4ea.metadata.json
   [2023-04-19 11:13:42 +08]  30MiB 35868-07d7cebd-90e6-49eb-a7d6-04fd6d4d32bc.metadata.json
   [2023-04-19 11:15:11 +08]  30MiB 35869-e3d933ed-42be-4249-a6f8-58a64cfc442f.metadata.json
   [2023-04-19 11:16:42 +08]  30MiB 35870-c8d8daf4-ed33-4720-b14d-c72bcc0e9e5f.metadata.json
   [2023-04-19 11:18:11 +08]  30MiB 35871-5aa288a9-64b3-411d-8386-ed595365d20c.metadata.json
   [2023-04-19 11:19:42 +08]  30MiB 35872-304f5200-db21-4ae4-a442-ab562165c320.metadata.json
   [2023-04-19 11:21:12 +08]  30MiB 35873-0864eb66-9590-4c0f-9727-323a302b8061.metadata.json
   [2023-04-19 11:22:44 +08]  30MiB 35874-54db0d17-a0d0-43c4-818f-8c56e0b6fc4e.metadata.json
   [2023-04-19 11:24:13 +08]  30MiB 35875-921c4c53-0e1e-4250-8d74-4cf33a145260.metadata.json
   [2023-04-19 11:25:41 +08]  30MiB 35876-4e3d5d0f-bc35-4a46-a63e-ac0d80077d31.metadata.json
   [2023-04-19 11:27:12 +08]  30MiB 35877-b776dc69-d598-4a0f-9e33-1be34e2d9374.metadata.json
   [2023-04-19 11:28:41 +08]  30MiB 35878-b31e97d1-efb2-4496-a633-9f409ec5691c.metadata.json
   [2023-04-19 11:30:12 +08]  30MiB 35879-f40e2145-808b-4919-86cc-697157c118d2.metadata.json
   [2023-04-19 11:31:42 +08]  30MiB 35880-31bcb55b-6c46-4f8c-a291-d0338de9a962.metadata.json
   [2023-04-19 11:33:10 +08]  30MiB 35881-c0008820-97ac-45aa-ba2d-e8d0c869c3e3.metadata.json
   [2023-04-19 11:34:41 +08]  30MiB 35882-740d2406-504b-4bb8-beba-3dc706fb9f9c.metadata.json
   [2023-04-20 18:52:36 +08] 971KiB 37136-695469c1-d89b-4abf-ada2-abd818274fae.metadata.json
   [2023-04-20 18:54:06 +08] 972KiB 37137-72f97f42-3832-4304-8330-f78598ead5a4.metadata.json
   [2023-04-20 18:55:36 +08] 973KiB 37138-eeb9392a-b7d1-44aa-b524-8282c5765d3f.metadata.json
   [2023-04-20 18:57:06 +08] 974KiB 37139-6a906d2a-6766-4dca-a67d-d6dc17ab0458.metadata.json
   [2023-04-20 18:58:36 +08] 975KiB 37140-73fe07a9-0434-490a-9fd8-a3d65d4b3b9a.metadata.json
   [2023-04-20 19:00:06 +08] 976KiB 37141-05283c49-e7fa-4a8d-a7be-506c5de7cbd9.metadata.json
   
   I can see that the metadata json files are updated (file size is reduced) to reflect the latest snapshots but why are the expired metadata and avro files (2023-04-19 and older) not being deleted?
   
   I applied the following properties:
   
   `write.metadata.delete-after-commit.enabled=true`
   `write.metadata.previous-versions-max=5`
   
   to my table on 2023-04-19 12:00:00, is it due to this that the expired files are not deleted?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] cgpoh commented on issue #7383: After running ExpireSnapshots, metadata json and avro files still not deleted

Posted by "cgpoh (via GitHub)" <gi...@apache.org>.
cgpoh commented on issue #7383:
URL: https://github.com/apache/iceberg/issues/7383#issuecomment-1516526142

   > Manifest files are removed when there are no longer any snapshots referring to them, not when they are too old. For example, a manifest file might be 10 days old, but the current Snapshot may still refer to that file.
   > 
   > The JSON files are different,they are not actually tracked or removed by expired snapshots. At least if I can remember correctly. They could be removed by remove orphan files though, which would remove all metadata.json files not listed in the current metadata.json
   
   Thanks for the reply! Can help me understand, every update to the table, there will be new snapshot created. Let’s say I’m committing to the table every 2 mins, in order to keep the number of manifest files small, I should expire current snapshot time - 4mins?
   
   Another question is for deleteorphan action, we can only use Spark to do that, correct? I can’t find any Flink or table api that uses deleteorphan action.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on issue #7383: After running ExpireSnapshots, metadata json and avro files still not deleted

Posted by "RussellSpitzer (via GitHub)" <gi...@apache.org>.
RussellSpitzer commented on issue #7383:
URL: https://github.com/apache/iceberg/issues/7383#issuecomment-1516418538

   Manifest files are removed when there are no longer any snapshots referring to them, not when they are too old. For example, a manifest file might be 10 days old, but the current Snapshot may still refer to that file.
   
   The JSON files are different,they are not actually tracked or removed by expired snapshots. At least if I can remember correctly. They could be removed by remove orphan files though, which would remove all metadata.json files not listed in the current metadata.json


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on issue #7383: After running ExpireSnapshots, metadata json and avro files still not deleted

Posted by "RussellSpitzer (via GitHub)" <gi...@apache.org>.
RussellSpitzer commented on issue #7383:
URL: https://github.com/apache/iceberg/issues/7383#issuecomment-1523583052

   Basically every "fastAppend" will create at least 1 new manifest file. Spark uses this when doing streaming writes.
   
   https://github.com/apache/iceberg/blob/master/spark/v3.4/spark/src/main/java/org/apache/iceberg/spark/source/SparkWrite.java#L545
   
   To actually get rid of the manifests you need to also need to run the rewrite metadata action before doing the expiration. Otherwise you would keep all of them even if you expire the snapshots.
   
   Remove orphan files I believe only has a spark implementation. If you don't see one in other places it doesn't exist.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] cgpoh commented on issue #7383: After running ExpireSnapshots, metadata json and avro files still not deleted

Posted by "cgpoh (via GitHub)" <gi...@apache.org>.
cgpoh commented on issue #7383:
URL: https://github.com/apache/iceberg/issues/7383#issuecomment-1524134661

   @RussellSpitzer thanks for the reply! Will try out the rewrite manifest action.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] cgpoh closed issue #7383: After running ExpireSnapshots, metadata json and avro files still not deleted

Posted by "cgpoh (via GitHub)" <gi...@apache.org>.
cgpoh closed issue #7383: After running ExpireSnapshots, metadata json and avro files still not deleted
URL: https://github.com/apache/iceberg/issues/7383


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org