You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2022/02/13 19:02:07 UTC

[GitHub] [iceberg] rdblue commented on pull request #3056: Support purge for Spark 3.2

rdblue commented on pull request #3056:
URL: https://github.com/apache/iceberg/pull/3056#issuecomment-1038350548


   > 100% Agree. However we now opened the door by adding registerTable in catalog, which maps to the external table concept perfectly. I already received a few feature requests of people asking for this to map to CREATE EXTERNAL TABLE. People can now register an Iceberg table with an old metadata file location and do writes against it to create basically 2 diverged metadata history of the same table. This is very dangerous action because 2 Iceberg tables can now own the same set of files and corrupt each other.
   
   I'm not sure I agree that it maps perfectly. This is a way to register a table with a catalog, after which the catalog owns it like any other table. There should be nothing that suggests registration has anything to do with `EXTERNAL` and no reason for people to think that tables that are added to a catalog through `registerTable` should behave any differently.
   
   If this confusion persists, I would support removing `registerTable` from the API.
   
   > Just from correctness perspective, this is the wrong thing to promote.
   
   Agreed!
   
   > In the long term, we should start to promote a new table ownership model (maybe call it a SHARED model) and start to bring people up to date with how Iceberg tables are operated. Let me draft a doc for that to have a formal discussion, and also include concepts like table root location ownership in that doc so we can have full clarity in the domain of table ownership.
   
   I'm not sure that I would want a `SHARED` keyword -- that just implies there are times when the table is not shared and we would get into similar trouble. But I think your idea to address this in a design doc is good.
   
   Also, I consider the data/file ownership a separate problem, so you may want to keep them separate in design docs or proposals. I wouldn't want to confuse table modification with data file ownership, although modification does have implications for file ownership.
   
   > I think if we change the behavior of drop table to not drop any data that alleviates our concern on accidental drops on external tables. However, it also means that drop table on managed tables would leave data around, which is also an issue.
   
   This is why Iceberg ignores `EXTERNAL`. The platform should be making these decisions, ideally. Users interact with logical tables, physical concerns are for the platform. If you don't have a platform-level plan for dropping table data, then I think the `PURGE` approach is okay because a user presumably makes the choice at the right time (rather than months if not years before the drop).
   
   My general recommendation is to tell users that they're logically dropping tables and data. Maybe you can have a platform-supported way to un-delete, but when you drop a table you generally have no expectation that you didn't do anything destructive!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org