You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2022/02/18 06:07:35 UTC

[GitHub] [iceberg] aokolnychyi opened a new issue #4159: Define behavior of gc.enabled and location ownership

aokolnychyi opened a new issue #4159:
URL: https://github.com/apache/iceberg/issues/4159


   The question about location ownership and file removal comes up in a lot of discussions. See [here](https://github.com/apache/iceberg/pull/3056#discussion_r804826159) for an example.
   
   Right now, our interpretation of `gc.enabled` is not consistent.
   - `HadoopTables` ignores it while purging files.
   - `CatalogUtil` keeps data files if `gc.enabled` is true while purging files and removes all metadata files.
   - `DeleteReachableFiles` and other actions throw an exception if `gc.enabled` is true.
   
   I think we should make the behavior consistent. My initial thoughts are below.
   
   1. One way to interpret disabled garbage collection is to disallow removal of data and delete files. For instance, it should not be possible to expire snapshots.
   2. We should add a list of location prefixes that are owned by the table to our metadata. Until that is done, we can use `gc.enabled` to prohibit dangerous actions. For example, `DeleteOrphanFiles` should throw an exception if garbage collection is disabled. Once we know what locations owned by the table, we can reconsider that check.
   
   Here is a list of places that may physically remove files.
   
   **Expire snapshots**
   
   It should not be possible to expire snapshots if `gc.enabled` is false.
   
   **Delete orphan files**
   
   For now, we should continue to throw an exception if `gc.enabled` is false. Once we know what prefixes are owned by the table, we can allow removal of orphan files in locations that are owned by the table.
   
   **Delete reachable files**
   
    It shouldn't be possible to delete data and delete files if garbage collection is disabled. However, we may consider allowing removal of metadata when `gc.enabled` is false. One may argue that metadata files are always owned by the table. We should also make our action configurable so that it can delete only data or metadata files.
   
   **Drop and purge tables**
   
   I think it should match the removal of reachable files and be consistent through all APIs. Once we know locations owned by the table, we may drop them too.
   
   cc @jackye1995 @rdblue @pvary @RussellSpitzer @flyrain @szehon-ho @danielcweeks @karuppayya 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on issue #4159: Define behavior of gc.enabled and location ownership

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on issue #4159:
URL: https://github.com/apache/iceberg/issues/4159#issuecomment-1055296007


   Well, I'd also love to see consistent behavior. However, path-based tables are quite special: there is no table pointer in the catalog. Any thoughts on what drop without purge may mean for them? Maybe we should only allow drop with purge = true for path-based tables? That way, we will have consistent behavior across catalogs but path-based tables won't support some functionality?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] anuragmantri commented on issue #4159: Define behavior of gc.enabled and location ownership

Posted by GitBox <gi...@apache.org>.
anuragmantri commented on issue #4159:
URL: https://github.com/apache/iceberg/issues/4159#issuecomment-1081241359


   Thanks for this thread. 
   
   > One valuable thing to add in the Iceberg spec is the list (or set?) of all the table locations used.
   
   This makes sense but we should define what this list means. There are some questions that come to my mind. 
   - Do these locations mean a table can have multiple root locations simultaneously? 
   - How does this work when object store path is used since users may have same object store path bucket for multiple tables?
   - In case of DR, the table is replicated to a different location. Are all of these replicated locations part of this list? 
   
   Incidentally, most of these questions also came up in the [dev discussion ](https://lists.apache.org/list?dev@iceberg.apache.org:2021-8:Iceberg%20disaster%20recovery%20and%20relative%20path%20sync-up) on relative paths/DR strategy. Supporting multiple roots is mentioned in `Phase 2` in [this design doc](https://docs.google.com/document/d/1RDEjJAVEXg1csRzyzTuM634L88vvI0iDHNQQK3kOVR0/edit#heading=h.last3sm2iqbr). Implementation is yet to be discussed. It maybe a good idea to discuss it here.
   
   Coming back to the problem, assuming we have such a list of owned table locations, how would we make sure all the files in these locations are indeed owned by Iceberg table? In other words, can we safely delete `all` files in these locations?
   
   
   
   
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on issue #4159: Define behavior of gc.enabled and location ownership

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on issue #4159:
URL: https://github.com/apache/iceberg/issues/4159#issuecomment-1045432279


   cc @openinx @stevenzwu 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on issue #4159: Define behavior of gc.enabled and location ownership

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on issue #4159:
URL: https://github.com/apache/iceberg/issues/4159#issuecomment-1048370419


   Sorry, was distracted. Let me catch up.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] amogh-jahagirdar commented on issue #4159: Define behavior of gc.enabled and location ownership

Posted by GitBox <gi...@apache.org>.
amogh-jahagirdar commented on issue #4159:
URL: https://github.com/apache/iceberg/issues/4159#issuecomment-1046159290


   For expire snapshots, I think it's possible that we do not delete any manifest or data files in the case that those files are still referenced by later snapshots. If we don't allow expiring snapshots when gc.enabled is false, we would prevent them from performing the metadata update to remove snapshots which may be limiting. 
   
   Should we differentiate between performing the metadata operation to expire snapshots and performing the following deletion of unreachable files due to the expiration? Please let me know if I misunderstood something!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] amogh-jahagirdar edited a comment on issue #4159: Define behavior of gc.enabled and location ownership

Posted by GitBox <gi...@apache.org>.
amogh-jahagirdar edited a comment on issue #4159:
URL: https://github.com/apache/iceberg/issues/4159#issuecomment-1046159290


   For expire snapshots, I think it's possible that we do not delete any manifest or data files in the case that those files are still referenced by later snapshots. If we don't allow expiring snapshots when gc.enabled is false, we would prevent users from performing the metadata update to remove snapshots which may be limiting.
   
   Should we differentiate between performing the metadata operation to expire snapshots and performing the following deletion of unreachable files due to the expiration? Tbh I can't really think of a solid use case where someone would want this differentiation for expiring snapshots, but just bringing it up while we're having this discussion.Please let me know if I misunderstood something!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] jackye1995 edited a comment on issue #4159: Define behavior of gc.enabled and location ownership

Posted by GitBox <gi...@apache.org>.
jackye1995 edited a comment on issue #4159:
URL: https://github.com/apache/iceberg/issues/4159#issuecomment-1046172956


   Thanks for raising this thread! I was also planning to raise the same discussion but had some thoughts to finalize. I have been thinking about this question quite a lot recently, here are my current thoughts:
   
   ## 1. gc.enabled
   
   I think as of today almost all the Iceberg users that tune this parameter are interpreting `gc.enabled=false` as "disallow removal of data and delete files" as @aokolnychyi suggested, so it only makes sense for us to throw exception in every place that deletes data of a table if `gc.enabled` is false. 
   
   I think there are only 2 cases we need to consider:
   1. remove reacheable files, which includes
     1. purge table
     2. expire snapshot
     3. any file deletion based on the Iceberg metadata tree
   2. remove orphan files
   
   For 1, we should follow the definition of `gc.enabled` and throw for all cases.
   For 2, see section 3, I think this is not really a table related operation.
   
   ## 2. Remove metadata files by config
   
   The current behavior in `CatalogUtil` of keeping data while removing metadata feels quite odd to me. If metadata is removed maybe the result data is still useful and can be reconstructed as a Hive table, but when object storage mode is enabled, it's basically not possible to track down the file locations, making everything just orphan files. @aokolnychyi you said:
   
   > However, we may consider allowing removal of metadata when gc.enabled is false. One may argue that metadata files are always owned by the table. We should also make our action configurable so that it can delete only data or metadata files.
   
   Could you provide some use cases where this is useful in addition to the recovery as a Hive table?
   
   ## 3. Table location ownership
   
   At a first glance, defining table prefix location is the best way to proceed forward, but when I think more, I start to realize that it defining ownership for a set of location prefixes in a table is not really needed for the use cases we want to achieve. Here are the 2 big use cases I considered so far:
   
   ### remove orphan files
   
   The fact that remove orphan files needs root location definition seems to be a circular argument. We exposes the action `remove_orphan_files(table)` with the assumption that table files are under the same root prefix, which works for Hive-like Iceberg tables. But after all orphan file removal is not a table operation, but a storage operation. Once a file is orphan, it no longer belongs to a table. We remove orphan files to save storage cost, not to make any aspect of the table better. We just remove by table using an assumption that is for Hive table, and now we try to make the Iceberg spec work with it.
   
   I think the correct way to run remove orphan files is to do it for the entire warehouse. I talked about my idea a bit in https://apache-iceberg.slack.com/archives/C025PH0G1D4/p1645203709220099. Most storage services have the ability to provide the full listing of files and it can be delivered much more efficiently than a ListFiles API, e.g. S3 inventory list. And Iceberg provides an efficient way to query Iceberg files metadata through system table. That means we can perform an efficient distributed join and find out the orphan files of the entire warehouse. I think that’s all what the data warehouse admin needs if we provide the feature.
   
   This is basically talking about the `VACUUM` command without referring to a table. I think that's also what most managed data warehouse products on the market offer for storage cleanup. `VACUUM table` only makes sense for things like snapshot expiration, index cleanup, which do not rely on table root location mutual exclusion. If it's across storage you just do it for each. At least we will start to propose and provide such a feature for S3, for the other storage I think it's also not hard to provide an implementation once the interface is solidified.
   
   ### table access permission
   
   Another use case of table root location definition I can think of is for table access control. Admin might configure user access based on the table locations in storage. However, using file path access control to achieve that is just a storage-specific implementation which does not necessarily need to be true to support Iceberg table access management. For example, in S3 people can tag all the files it writes, and control access of that using S3 bucket policy. This allows multiple Iceberg tables to store files under the same bucket, but access control is still intact.
   
   Therefore from security perspective, table root path based file ownership should just be a type of ownership mode.
   
   ### The problem of declaring table location ownership
   
   From S3 object storage mode usage reports from customers, most people still want to store data files in the same shared root bucket to minimize throttling as long as the above 2 issues I describe can be solved. 
   
   If an exclusive root location has to be declared for each table, then files of the table has to share certain part of the S3 prefix, and the throttling issue comes back, where a table not accessed for a year is guaranteed to be in a cold S3 partition and get throttled heavily when new traffic comes.
   
   Maybe S3 can in the future solve the issue (which is unlikely because that's how S3 throttling and billing works no matter how the underlying architecture change), but my biggest concern is that people will start to build tools that only work for tables with declared root locations (which we already see in Trino), which creates difference in table behavior at spec level that optimizes for a certain storage layout.
   
   There are also some areas that I have not finalized my thoughts yet, such as when data is replicated, should the replicated locations, quick access layer locations, archival locations also be declared as owned in Iceberg, and how should we provide tools to continuously track down those aspects. This feels to me like too much storage implementation details to handle from the table spec layer.
   
   I think it would be better to delegate to FileIOs to tackle those issues, as that is the actual vendor integration layer provided by Iceberg, where people choose which storage to use and what fits their bill, and storage frameworks and products can compete with the same set of rules. At least all the features I described for S3 are planned to be added, and I think there are GCS and HDFS equivalents that people can implement if feeling the need to reach feature parity.
   
   One valuable thing to add in the Iceberg spec is the list (or set?) of all the table locations used. I think that could be used by a specific storage to do whatever is needed based on the information, such as removing all data in all directory. But in general we should be cautious about saying that all the locations are owned by a table.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on issue #4159: Define behavior of gc.enabled and location ownership

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on issue #4159:
URL: https://github.com/apache/iceberg/issues/4159#issuecomment-1048440808


   > The current behavior in CatalogUtil of keeping data while removing metadata feels quite odd to me. If metadata is removed maybe the result data is still useful and can be reconstructed as a Hive table, but when object storage mode is enabled, it's basically not possible to track daown the file locations, making everything just orphan files.
   
   That's generally true but there is a valid use case for keeping data files: you may use SNAPSHOT command in Spark that would create an Iceberg table pointing to non-Iceberg files. Those referenced data files may belong to other prod Hive tables. We allow to create Iceberg metadata to play with that data but we can't remove the imported data files. It may corrupt original tables.
   
   > I think the correct way to run remove orphan files is to do it for the entire warehouse.
   
   I hear your point and I think both use cases are valid. You are right, if a single prefix is shared by many tables and object store locations are enabled, the only way to remove orphans is by getting a list of all files under that prefix and querying systems tables for all Iceberg tables. However, I am not sure that's always possible. You have to assume no extra jobs use that location, you know all metastores/catalogs, etc. I'd say having a short per-table prefix or a set of prefixes would be also quite common. I can be convinced otherwise.
   
   I'd be interested to hear more from other folks too.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi edited a comment on issue #4159: Define behavior of gc.enabled and location ownership

Posted by GitBox <gi...@apache.org>.
aokolnychyi edited a comment on issue #4159:
URL: https://github.com/apache/iceberg/issues/4159#issuecomment-1045441889






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ConeyLiu commented on issue #4159: Define behavior of gc.enabled and location ownership

Posted by GitBox <gi...@apache.org>.
ConeyLiu commented on issue #4159:
URL: https://github.com/apache/iceberg/issues/4159#issuecomment-1048551526


   As @aokolnychyi suggested in [3056](https://github.com/apache/iceberg/pull/3056), we use `DeleteReachableFiles ` to purge table data which could provide much more scalability and performance. While there still some drawbacks that need to consider:
   
   1. Different catalog has a different implementation for drop table. For example, `HadoopCatalog`/`HadoopTables` delete the whole warehouse directly and ignore the purge argument. In this case, we could not use `DeleteReachableFiles`.
   2. User self catalog may have some customized features, such as sending event/metrics when purging data. With `DeleteReachableFiles` we will ignore those operations.
   
   > I think it should match the removal of reachable files and be consistent in all APIs. Once we know locations owned by the table, we may drop them too.
   
   I think this is necessary. We should unify the built-in catalog behavior of the drop table [purge]. And maybe need to define the interface to support some parallel operations (by leveraging distributed engine, such as spark/flink/more).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] anuragmantri edited a comment on issue #4159: Define behavior of gc.enabled and location ownership

Posted by GitBox <gi...@apache.org>.
anuragmantri edited a comment on issue #4159:
URL: https://github.com/apache/iceberg/issues/4159#issuecomment-1081241359


   Thanks for this thread. 
   
   > One valuable thing to add in the Iceberg spec is the list (or set?) of all the table locations used.
   
   This makes sense but we should define what this list means. There are some questions that come to my mind. 
   - Do these locations mean a table can have multiple root locations simultaneously? 
   - How does this work when object store path is used since users may have same object store path bucket for multiple tables?
   - In case of DR, the table is replicated to a different location. Are all of these replicated locations part of this list? 
   
   Incidentally, most of these questions also came up in the [dev discussion ](https://lists.apache.org/list?dev@iceberg.apache.org:2021-8:Iceberg%20disaster%20recovery%20and%20relative%20path%20sync-up) on relative paths/DR strategy. A single location prefix support will be added as part of `Phase 1`. If you agree, maybe I can decouple location prefix support with relative paths and create PR just for that change first.
   
   Supporting multiple roots is mentioned in `Phase 2` in [this design doc](https://docs.google.com/document/d/1RDEjJAVEXg1csRzyzTuM634L88vvI0iDHNQQK3kOVR0/edit#heading=h.last3sm2iqbr). Implementation is yet to be discussed. It maybe a good idea to discuss it here.
   
   Coming back to the problem, assuming we have such a list of owned table locations, how would we make sure all the files in these locations are indeed owned by Iceberg table? In other words, can we safely delete `all` files in these locations?
   
   
   
   
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] szehon-ho commented on issue #4159: Define behavior of gc.enabled and location ownership

Posted by GitBox <gi...@apache.org>.
szehon-ho commented on issue #4159:
URL: https://github.com/apache/iceberg/issues/4159#issuecomment-1059409098


   > I think the correct way to run remove orphan files is to do it for the entire warehouse.
   
   > I hear your point and I think both use cases are valid. You are right, if a single prefix is shared by many tables and object store locations are enabled, the only way to remove orphans is by getting a list of all files under that prefix and querying systems tables for all Iceberg tables. However, I am not sure that's always possible. You have to assume no extra jobs use that location, you know all metastores/catalogs, etc. I'd say having a short per-table prefix or a set of prefixes would be also quite common. I can be convinced otherwise.
   
   This is an interesting idea I also considered at one point, having the option to provide the whole listing for S3-based storage bucket, filter out for all locations owned by known tables, and run remove orphans on those locations?  If you know ahead of time your 'bucket' is majority files iceberg tables, might be less expensive overall (though this sounds like a single monster job if we are bottlenecked at the speed of physically deleting the files).  Another consideration is that system like S3 inventory will be stale in order of hours or days.
   
   Though it still seems to go back to the the main question, how to define the "owned locations" of a table?  (Sorry, please let me know if there is discussion on it already I missed).  That would be great to have, one thing I struggled in the past is "alter table location", if the previous locations can somehow be saved in a growing list it could allow to run more complete orphan removal jobs.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] amogh-jahagirdar edited a comment on issue #4159: Define behavior of gc.enabled and location ownership

Posted by GitBox <gi...@apache.org>.
amogh-jahagirdar edited a comment on issue #4159:
URL: https://github.com/apache/iceberg/issues/4159#issuecomment-1046159290


   For expire snapshots, I think it's possible that we do not delete any manifest or data files in the case that those files are still referenced by later snapshots. If we don't allow expiring snapshots when gc.enabled is false, we would prevent them from performing the metadata update to remove snapshots which may be limiting.
   
   Should we differentiate between performing the metadata operation to expire snapshots and performing the following deletion of unreachable files due to the expiration? Tbh I can't really think of a solid use case where someone would want this differentiation for expiring snapshots, but just bringing it up while we're having this discussion.Please let me know if I misunderstood something!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] jackye1995 edited a comment on issue #4159: Define behavior of gc.enabled and location ownership

Posted by GitBox <gi...@apache.org>.
jackye1995 edited a comment on issue #4159:
URL: https://github.com/apache/iceberg/issues/4159#issuecomment-1046172956


   Thanks for raising this thread! I was also planning to raise the same discussion but had some thoughts to finalize. I have been thinking about this question quite a lot recently, here are my current thoughts:
   
   ## 1. gc.enabled
   
   I think as of today almost all the Iceberg users that tune this parameter are interpreting `gc.enabled=false` as "disallow removal of data and delete files" as @aokolnychyi suggested, so it only makes sense for us to throw exception in every place that deletes data of a table if `gc.enabled` is false. 
   
   I think there are only 2 cases we need to consider:
   1. remove reacheable files, which includes
     1. purge table
     2. expire snapshot
     3. any file deletion based on the Iceberg metadata tree
   2. remove orphan files
   
   For 1, we should follow the definition of `gc.enabled` and throw for all cases.
   For 2, see section 3, I think this is not really a table related operation.
   
   ## 2. Remove metadata files by config
   
   The current behavior in `CatalogUtil` of keeping data while removing metadata feels quite odd to me. If metadata is removed maybe the result data is still useful and can be reconstructed as a Hive table, but when object storage mode is enabled, it's basically not possible to track down the file locations, making everything just orphan files. @aokolnychyi you said:
   
   > However, we may consider allowing removal of metadata when gc.enabled is false. One may argue that metadata files are always owned by the table. We should also make our action configurable so that it can delete only data or metadata files.
   
   Could you provide some use cases where this is useful in addition to the recovery as a Hive table?
   
   ## 3. Table location ownership
   
   At a first glance, defining table prefix location is the best way to proceed forward, but when I think more, I start to realize that it defining ownership for a set of location prefixes in a table is not really needed for the use cases we want to achieve. Here are the 2 big use cases I considered so far:
   
   ### remove orphan files
   
   The fact that remove orphan files needs root location definition seems to be a circular argument. We exposes the action `remove_orphan_files(table)` with the assumption that table files are under the same root prefix, which works for Hive-like Iceberg tables. But after all orphan file removal is not a table operation, but a storage operation. Once a file is orphan, it no longer belongs to a table. We remove orphan files to save storage cost, not to make any aspect of the table better. We just remove by table using an assumption that is for Hive table, and now we try to make the Iceberg spec work with it.
   
   I think the correct way to run remove orphan files is to do it for the entire warehouse. I talked about my idea a bit in https://apache-iceberg.slack.com/archives/C025PH0G1D4/p1645203709220099. Most storage services have the ability to provide the full listing of files and it can be delivered much more efficiently than a ListFiles API, e.g. S3 inventory list. And Iceberg provides an efficient way to query Iceberg files metadata through system table. That means we can perform an efficient distributed join and find out the orphan files of the entire warehouse. I think that’s all what the data warehouse admin needs if we provide the feature.
   
   This is basically talking about the `VACUUM` command without referring to a table. I think that's also what most managed data warehouse products on the market offer for storage cleanup. `VACUUM table` only makes sense for things like snapshot expiration, index cleanup, which do not rely on table root location mutual exclusion. If it's across storage you just do it for each. At least we will start to propose and provide such a feature for S3, for the other storage I think it's also not hard to provide an implementation once the interface is solidified.
   
   ### table access permission
   
   Another use case of table root location definition I can think of is for table access control. Admin might configure user access based on the table locations in storage. However, using file path access control to achieve that is just a storage-specific implementation which does not necessarily need to be true to support Iceberg table access management. For example, in S3 people can tag all the files it writes, and control access of that using S3 bucket policy. This allows multiple Iceberg tables to store files under the same bucket, but access control is still intact.
   
   Therefore from security perspective, table root path based file ownership should just be a type of ownership mode.
   
   ### The problem of declaring table location ownership
   
   From S3 object storage mode usage reports from customers, most people still want to store data files in the same shared root bucket to minimize throttling as long as the above 2 issues I describe can be solved. 
   
   If an exclusive root location has to be declared for each table, then files of the table has to share certain part of the S3 prefix, and the throttling issue comes back, where a table not accessed for a year is guaranteed to be in a cold S3 partition and get throttled heavily when new traffic comes.
   
   Maybe S3 can in the future solve the cold partition issue with some keep-warm feature, but it will come with a pricing cost for sure, and my biggest concern is that people will start to build tools that only work for tables with declared root locations (we already see this in Trino), which creates difference in table behavior at spec level that optimizes for a certain storage layout. I don't know what's the future of storage, maybe a new product will come in the market with a completely different access pattern. Currently Iceberg's minimum storage feature requirement is very resilient to integrate with any storage, adding the concept of prefix ownership is basically preferring a file-system like storage.
   
   There are also some areas that I have not finalized my thoughts yet, such as when data is replicated, should the replicated locations, quick access layer locations, archival locations also be declared as owned in Iceberg, and how should we provide tools to continuously track down those aspects. This feels to me like too much storage implementation details to handle from the table spec layer.
   
   I think it would be better to delegate to FileIOs to tackle those issues, as that is the actual vendor integration layer provided by Iceberg, where people choose which storage to use and what fits their bill, and storage frameworks and products can compete with the same set of rules. At least all the features I described for S3 are planned to be added, and I think there are GCS and HDFS equivalents that people can implement if feeling the need to reach feature parity.
   
   One valuable thing to add in the Iceberg spec is the list (or set?) of all the table locations used. I think that could be used by a specific storage to do whatever is needed based on the information, such as removing all data in all directory. But in general we should be cautious about saying that all the locations are owned by a table.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] jackye1995 edited a comment on issue #4159: Define behavior of gc.enabled and location ownership

Posted by GitBox <gi...@apache.org>.
jackye1995 edited a comment on issue #4159:
URL: https://github.com/apache/iceberg/issues/4159#issuecomment-1046172956


   Thanks for raising this thread! I was also planning to raise the same discussion but had some thoughts to finalize. I have been thinking about this question quite a lot recently, here are my current thoughts:
   
   ## 1. gc.enabled
   
   I think as of today almost all the Iceberg users that tune this parameter are interpreting `gc.enabled=false` as "disallow removal of data and delete files" as @aokolnychyi suggested, so it only makes sense for us to throw exception in every place that deletes data of a table if `gc.enabled` is false. 
   
   I think there are only 2 cases we need to consider:
   1. remove reacheable files, which includes
     1. purge table
     2. expire snapshot
     3. any file deletion based on the Iceberg metadata tree
   2. remove orphan files
   
   For 1, we should follow the definition of `gc.enabled` and throw for all cases.
   For 2, see section 3, I think this is not really a table related operation.
   
   ## 2. Remove metadata files by config
   
   The current behavior in `CatalogUtil` of keeping data while removing metadata feels quite odd to me. If metadata is removed maybe the result data is still useful and can be reconstructed as a Hive table, but when object storage mode is enabled, it's basically not possible to track down the file locations, making everything just orphan files. @aokolnychyi you said:
   
   > However, we may consider allowing removal of metadata when gc.enabled is false. One may argue that metadata files are always owned by the table. We should also make our action configurable so that it can delete only data or metadata files.
   
   Could you provide some use cases where this is useful in addition to the recovery as a Hive table?
   
   ## 3. Table location ownership
   
   At a first glance, defining table prefix location is the best way to proceed forward, but when I think more, I start to realize that it defining ownership for a set of location prefixes in a table is not really needed for the use cases we want to achieve. Here are the 2 big use cases I considered so far:
   
   ### remove orphan files
   
   The fact that remove orphan files needs root location definition seems to be a circular argument. We exposes the action `remove_orphan_files(table)` with the assumption that table files are under the same root prefix, which works for Hive-like Iceberg tables. But after all orphan file removal is not a table operation, but a storage operation. Once a file is orphan, it no longer belongs to a table. We remove orphan files to save storage cost, not to make any aspect of the table better. We just remove by table using an assumption that is for Hive table, and now we try to make the Iceberg spec work with it.
   
   I think the correct way to run remove orphan files is to do it for the entire warehouse. I talked about my idea a bit in https://apache-iceberg.slack.com/archives/C025PH0G1D4/p1645203709220099. Most storage services have the ability to provide the full listing of files and it can be delivered much more efficiently than a ListFiles API, e.g. S3 inventory list. And Iceberg provides an efficient way to query Iceberg files metadata through system table. That means we can perform an efficient distributed join and find out the orphan files of the entire warehouse. I think that’s all what the data warehouse admin needs if we provide the feature.
   
   This is basically talking about the `VACUUM` command without referring to a table. I think that's also what most managed data warehouse products on the market offer for storage cleanup. `VACUUM table` only makes sense for things like snapshot expiration, index cleanup, which do not rely on table root location mutual exclusion. If it's across storage you just do it for each. At least we will start to propose and provide such a feature for S3, for the other storage I think it's also not hard to provide an implementation once the interface is solidified.
   
   ### table access permission
   
   Another use case of table root location definition I can think of is for table access control. Admin might configure user access based on the table locations in storage. However, using file path access control to achieve that is just a storage-specific implementation which does not necessarily need to be true to support Iceberg table access management. For example, in S3 people can tag all the files it writes, and control access of that using S3 bucket policy. This allows multiple Iceberg tables to store files under the same bucket, but access control is still intact.
   
   Therefore from security perspective, table root path based file ownership should just be a type of ownership mode.
   
   ### The problem of declaring table location ownership
   
   From S3 object storage mode usage reports from customers, most people still want to store data files in the same shared root bucket to minimize throttling as long as the above 2 issues I describe can be solved. 
   
   If an exclusive root location has to be declared for each table, then files of the table has to share certain part of the S3 prefix, and the throttling issue comes back, where a table not accessed for a year is guaranteed to be in a cold S3 partition and get throttled heavily when new traffic comes.
   
   Maybe S3 can in the future solve the issue (which is unlikely because that's how S3 throttling and billing works no matter how the underlying architecture change), but my biggest concern is that people will start to build tools that only work for tables with declared root locations (we already see in Trino), which creates difference in table behavior at spec level that optimizes for a certain storage layout.
   
   There are also some areas that I have not finalized my thoughts yet, such as when data is replicated, should the replicated locations, quick access layer locations, archival locations also be declared as owned in Iceberg, and how should we provide tools to continuously track down those aspects. This feels to me like too much storage implementation details to handle from the table spec layer.
   
   I think it would be better to delegate to FileIOs to tackle those issues, as that is the actual vendor integration layer provided by Iceberg, where people choose which storage to use and what fits their bill, and storage frameworks and products can compete with the same set of rules. At least all the features I described for S3 are planned to be added, and I think there are GCS and HDFS equivalents that people can implement if feeling the need to reach feature parity.
   
   One valuable thing to add in the Iceberg spec is the list (or set?) of all the table locations used. I think that could be used by a specific storage to do whatever is needed based on the information, such as removing all data in all directory. But in general we should be cautious about saying that all the locations are owned by a table.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] jackye1995 commented on issue #4159: Define behavior of gc.enabled and location ownership

Posted by GitBox <gi...@apache.org>.
jackye1995 commented on issue #4159:
URL: https://github.com/apache/iceberg/issues/4159#issuecomment-1046172956


   Thanks for raising this thread! I was also planning to raise the same discussion but had some thoughts to finalize. I have been thinking about this question quite a lot recently, here are my current thoughts:
   
   ## 1. gc.enabled
   
   I think as of today almost all the Iceberg users that tune this parameter are interpreting `gc.enabled=false` as "disallow removal of data and delete files" as @aokolnychyi suggested, so it only makes sense for us to throw exception in every place that deletes data of a table if `gc.enabled` is false. 
   
   I think there are only 2 cases we need to consider:
   1. remove reacheable files, which includes
     1. purge table
     2. expire snapshot
     3. any file deletion based on the Iceberg metadata tree
   2. remove orphan files
   
   For 1, we should follow the definition of `gc.enabled` and throw for all cases.
   For 2, see section 3, I think this is not really a table related operation.
   
   ## 2. Remove metadata files by config
   
   The current behavior in `CatalogUtil` of keeping data while removing metadata feels quite odd to me. If metadata is removed maybe the result data is still useful and can be reconstructed as a Hive table, but when object storage mode is enabled, it's basically not possible to track down the file locations, making everything just orphan files. @aokolnychyi you said:
   
   > However, we may consider allowing removal of metadata when gc.enabled is false. One may argue that metadata files are always owned by the table. We should also make our action configurable so that it can delete only data or metadata files.
   
   Could you provide some use cases where this is useful in addition to the recovery as a Hive table?
   
   ## 3. Table location ownership
   
   At a first glance, defining table prefix location is the best way to proceed forward, but when I think more, I start to realize that it defining ownership for a set of location prefixes in a table is not really needed for the use cases we want to achieve. Here are the 2 big use cases I considered so far:
   
   ### remove orphan files
   
   The fact that remove orphan files needs root location definition seems to be a circular argument. We exposes the action `remove_orphan_files(table)` with the assumption that table files are under the same root prefix, which works for Hive-like Iceberg tables. But in after all orphan file removal is not a table operation, but a storage operation. Once a file is orphan, it no longer belongs to a table. We remove orphan files to save storage cost, not to make any aspect of the table better. We just remove by table using an assumption that is for Hive table, and now we try to make Iceberg spec work with it.
   
   I think the correct way to run remove orphan files is to do it for the entire warehouse. I talked about my idea a bit in https://apache-iceberg.slack.com/archives/C025PH0G1D4/p1645203709220099. Most storage services have the ability to provide the full listing of files and it can be delivered much more efficiently than a ListFiles API, e.g. S3 inventory list. And Iceberg provides an efficient way to query Iceberg files metadata through system table. That means we can perform an efficient distributed join and find out the orphan files of the entire warehouse. I think that’s all what the data warehouse admin needs if we provide the feature.
   
   This is basically talking about the `VACUUM` command without referring to a table. I think that's also what most managed data warehouse products on the market offer for storage cleanup. `VACUUM table` only makes sense for things like snapshot expiration, index cleanup, which do not rely on table root location mutual exclusion. If it's across storage you just do it for each. At least we will start to propose and provide such a feature for S3, for the other storage I think it's also not hard to provide an implementation once the interface is solidified.
   
   ### table access permission
   
   Another use case of table root location definition I can think of is for table access control. Admin might configure user access based on the table locations in storage. However, using file path access control to achieve that is just a storage-specific implementation which does not necessarily need to be true to support Iceberg table access management. For example, in S3 people can tag all the files it writes, and control access of that using S3 bucket policy. This allows multiple Iceberg tables to store files under the same bucket, but access control is still intact.
   
   Therefore from security perspective, table root path based file ownership should just be a type of ownership mode.
   
   ### The problem of declaring table location ownership
   
   From S3 object storage mode usage reports from customers, most people still want to store data files in the same shared root bucket to minimize throttling as long as the above 2 issues I describe can be solved. 
   
   If an exclusive root location has to be declared for each table, then files of the table has to share certain part of the S3 prefix, and the throttling issue comes back, where a table not accessed for a year is guaranteed to be in a cold S3 partition and get throttled heavily when new traffic comes.
   
   Maybe S3 can in the future solve the issue (which is unlikely because that's how S3 throttling and billing works no matter how the underlying architecture change), but my biggest concern is that people will start to build tools that only work for tables with declared root locations (which we already see in Trino), which creates difference in table behavior at spec level that optimizes for a certain storage layout.
   
   There are also some areas that I have not finalized my thoughts yet, such as when data is replicated, should the replicated locations, quick access layer locations, archival locations also be declared as owned in Iceberg, and how should we provide tools to continuously track down those aspects. This feels to me like too much storage implementation details to handle from the table spec layer.
   
   I think it would be better to delegate to FileIOs to tackle those issues, as that is the actual vendor integration layer provided by Iceberg, where people choose which storage to use and what fits their bill, and storage frameworks and products can compete with the same set of rules. At least all the features I described for S3 are planned to be added, and I think there are GCS and HDFS equivalents that people can implement if feeling the need to reach feature parity.
   
   One valuable thing to add in the Iceberg spec is the list (or set?) of all the table locations used. I think that could be used by a specific storage to do whatever is needed based on the information, such as removing all data in all directory. But in general we should be cautious about saying that all the locations are owned by a table.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on issue #4159: Define behavior of gc.enabled and location ownership

Posted by GitBox <gi...@apache.org>.
rdblue commented on issue #4159:
URL: https://github.com/apache/iceberg/issues/4159#issuecomment-1045428982


   FYI @nastra and @rymurr also.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on issue #4159: Define behavior of gc.enabled and location ownership

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on issue #4159:
URL: https://github.com/apache/iceberg/issues/4159#issuecomment-1045441889


   > If a file is in an owned location, then the table should always be able to delete it
   
   I agree with that but I'd phrase it slightly differently. It see it more like a consequence. I'd say `gc.enabled` should control whether we can delete data and delete files (i.e. expire snapshots). Metadata files are considered always owned and can be deleted during table purge. A list of owned locations indicate what locations can be scanned for orphan files and what locations can be deleted during table purge or removal of reachable files.
   
   That being said, if `gc.enabled` is true and a data file is in a shared location not owned by the table, I think we should still allow its removal.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on issue #4159: Define behavior of gc.enabled and location ownership

Posted by GitBox <gi...@apache.org>.
rdblue commented on issue #4159:
URL: https://github.com/apache/iceberg/issues/4159#issuecomment-1045431113


   I agree with the approach of defining gc.enabled behavior more clearly and adding owned locations. If a file is in an owned location, then the table should always be able to delete it (though it may not choose to based on catalog policy).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ConeyLiu commented on issue #4159: Define behavior of gc.enabled and location ownership

Posted by GitBox <gi...@apache.org>.
ConeyLiu commented on issue #4159:
URL: https://github.com/apache/iceberg/issues/4159#issuecomment-1059889465


   > Well, I'd also love to see consistent behavior. However, path-based tables are quite special: there is no table pointer in the catalog. Any thoughts on what drop without purge may mean for them? Maybe we should only allow drop with purge = true for path-based tables? That way, we will have consistent behavior across catalogs but path-based tables won't support some functionality?
   
   drop or purge thinks should be the same mean for the path-based tables. In traditional database, the drop will move the table into the recycle bin and could restore the table with some command. However, purge will clean all data and cannot restore it. Maybe we should refer to such a design?   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org