You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2022/02/11 23:29:35 UTC

[GitHub] [iceberg] rdblue commented on a change in pull request #3056: Support purge for Spark 3.2

rdblue commented on a change in pull request #3056:
URL: https://github.com/apache/iceberg/pull/3056#discussion_r805069054



##########
File path: spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/SparkCatalog.java
##########
@@ -244,10 +244,19 @@ public SparkTable alterTable(Identifier ident, TableChange... changes) throws No
 
   @Override
   public boolean dropTable(Identifier ident) {
+    return dropTableInternal(ident, false);
+  }
+
+  @Override
+  public boolean purgeTable(Identifier ident) {
+    return dropTableInternal(ident, true);
+  }
+
+  private boolean dropTableInternal(Identifier ident, boolean purge) {
     try {
       return isPathIdentifier(ident) ?
-          tables.dropTable(((PathIdentifier) ident).location()) :
-          icebergCatalog.dropTable(buildIdentifier(ident));
+          tables.dropTable(((PathIdentifier) ident).location(), purge) :

Review comment:
       @jackye1995's summary of external is helpful. Thanks for helping set the context!
   
   The confusion here often lies with users that have been taught to expect certain outcomes from external tables, like that external tables don't remove data when they're dropped. Those guarantees are inconsistent in the first place: you can still overwrite (remove files) in an external table in Hive. In addition, why should a user set whether a table's data can be deleted when it is dropped at the time the table is created?
   
   How to clean up when dropping a table depends on the choices in the data platform. At Netflix, we removed references only and deleted data files only after a waiting period (to allow restoring a table). Users had no say in how the data was removed, only that the table was logically dropped. Both `EXTERNAL` and `PURGE` makes little sense with policies like that.
   
   > I think an extension of registerTable is needed for Iceberg to complete the notion of external table. And it should be the catalog's responsibility to manage this semantics, maybe set a flag to distinguish external table from normal tables to make it clear that it's a read-only copy of a managed table.
   
   I really don't think that the concept of `EXTERNAL` has a place in Iceberg and I am very skeptical that we should add it. `EXTERNAL` essentially means that the table is owned by some other system. But any engine can use an Iceberg catalog and the library to update a table. What is the utility of arbitrarily restricting that?
   
   As Jack said, we can think of the relationship differently, removing the engine requirement. But if we still have `EXTERNAL` in this world, then it is really just to support secondary references. I don't know why you would want secondary references that look exactly like primary references but are restricted from making changes. Just connect to the catalog. And if there's a need, we can add a catalog flag that makes it read-only.
   
   That still leaves an open question of what to do for `gc.enabled`. There are times when a table doesn't "own" the resources within it and `gc.enabled` is a quick way to account for that -- to basically turn off operations that may delete files that aren't owned. I think that includes PURGE, but I do see the distinction and agree that it is an ugly choice to need to make. Short term, I think we should disable PURGE and throw an exception if `gc.enabled` is true.
   
   Longer term, I think we need to refine the idea of file ownership. The best way that I can think of is to do this by prefix, as we discussed on a thread in the Trino community a while back and in the Iceberg sync. In this model, a table gets its own prefix and can delete anything in that prefix (or more likely, a small set of prefixes). Imported data files are unowned and are not deleted by expire snapshots or purge if they are outside of owned prefixes.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org