You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2021/09/01 09:40:17 UTC

[GitHub] [iceberg] ConeyLiu opened a new pull request #3056: Support purge for Spark 3.1.x

ConeyLiu opened a new pull request #3056:
URL: https://github.com/apache/iceberg/pull/3056






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ConeyLiu commented on pull request #3056: Support purge for Spark 3.1.x

Posted by GitBox <gi...@apache.org>.
ConeyLiu commented on pull request #3056:
URL: https://github.com/apache/iceberg/pull/3056#issuecomment-918942052


   gentle ping @rdblue @aokolnychyi, could you help to review this also? Thanks a lot.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #3056: Support purge for Spark 3.1.x

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #3056:
URL: https://github.com/apache/iceberg/pull/3056#discussion_r711279835



##########
File path: spark3/src/main/java/org/apache/iceberg/spark/SparkSessionCatalog.java
##########
@@ -287,4 +302,13 @@ private T getSessionCatalog() {
         "Please make sure your are replacing Spark's default catalog, named 'spark_catalog'.");
     return sessionCatalog;
   }
+
+  private boolean callPurgeTable(Object catalog, Identifier ident) {

Review comment:
       I don't think it's a good idea to use reflection here. I'd probably wait to merge this until after the Spark refactor so that we can implement new APIs from 3.1




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #3056: Support purge for Spark 3.1.x

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #3056:
URL: https://github.com/apache/iceberg/pull/3056#discussion_r711277474



##########
File path: spark3/src/main/java/org/apache/iceberg/spark/SparkCatalog.java
##########
@@ -230,16 +230,25 @@ public SparkTable alterTable(Identifier ident, TableChange... changes) throws No
   }
 
   @Override
-  public boolean dropTable(Identifier ident) {
+  protected boolean dropTableWithPurge(Identifier ident, boolean purge) {
     try {
       return isPathIdentifier(ident) ?
-          tables.dropTable(((PathIdentifier) ident).location()) :
-          icebergCatalog.dropTable(buildIdentifier(ident));
+          tables.dropTable(((PathIdentifier) ident).location(), purge) :
+          icebergCatalog.dropTable(buildIdentifier(ident), purge);
     } catch (org.apache.iceberg.exceptions.NoSuchTableException e) {
       return false;
     }
   }
 
+  @Override
+  public boolean dropTable(Identifier ident) {
+    return dropTableWithPurge(ident, false);
+  }
+
+  public boolean purgeTable(Identifier ident) throws UnsupportedOperationException {

Review comment:
       This doesn't throw `UnsupportedOperationException`.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on a change in pull request #3056: Support purge for Spark 3.2

Posted by GitBox <gi...@apache.org>.
RussellSpitzer commented on a change in pull request #3056:
URL: https://github.com/apache/iceberg/pull/3056#discussion_r809456911



##########
File path: spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/SparkSessionCatalog.java
##########
@@ -237,6 +237,13 @@ public boolean dropTable(Identifier ident) {
     return icebergCatalog.dropTable(ident) || getSessionCatalog().dropTable(ident);
   }
 
+  @Override
+  public boolean purgeTable(Identifier ident) {
+    // no need to check table existence to determine which catalog to use. if a table doesn't exist then both are
+    // required to return false.
+    return icebergCatalog.purgeTable(ident) || getSessionCatalog().purgeTable(ident);

Review comment:
       Quick check here, we are safe in the GC_ENABLED situation because iceberg.purge will throw a runtime exception correct? Just want to make sure we don't call purge in that case.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] jackye1995 commented on a change in pull request #3056: Support purge for Spark 3.2

Posted by GitBox <gi...@apache.org>.
jackye1995 commented on a change in pull request #3056:
URL: https://github.com/apache/iceberg/pull/3056#discussion_r804879622



##########
File path: spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/SparkCatalog.java
##########
@@ -244,10 +244,19 @@ public SparkTable alterTable(Identifier ident, TableChange... changes) throws No
 
   @Override
   public boolean dropTable(Identifier ident) {
+    return dropTableInternal(ident, false);
+  }
+
+  @Override
+  public boolean purgeTable(Identifier ident) {
+    return dropTableInternal(ident, true);
+  }
+
+  private boolean dropTableInternal(Identifier ident, boolean purge) {
     try {
       return isPathIdentifier(ident) ?
-          tables.dropTable(((PathIdentifier) ident).location()) :
-          icebergCatalog.dropTable(buildIdentifier(ident));
+          tables.dropTable(((PathIdentifier) ident).location(), purge) :

Review comment:
       My understanding is that `gc.enabled` should only impact the snapshot expiration (including remove orphan files) aspect, at least that is what my definition of “garbage collection” is. The DROP TABLE behavior of Iceberg external vs managed table are different discussions. We should not try to work around the managed table issue using `gc.enabled`.
   
   The industry has been using this notion of Managed vs External tables for a long time now. Most people simply associate it with the delete table behavior, but it really is describing the ownership of a table with respect to an engine with a data catalog. Just take Hive as an example, Managed table (Hive ACID) means table is in Hive metastore, and also the data is fully manged by the Hive data warehouse compute system. Changing data using external systems will likely cause trouble, so they are supposed to only read data, which means to treat a table as External. This has been how most of the vertical integration products define their table concepts, including Snowflake and AWS Redshift.
   
   In the Iceberg multi-engine model where data can be updated by any compute engine or process through the Iceberg library, this definition of managed that bundles catalog and engine is confusing. But we can still make it consistent by changing the definition. For Iceberg, the ownership of the table is bundled with the catalog service plus the Iceberg spec itself. Any compute engine can claim the responsibility to manage a table, as long as the Iceberg spec is respected and the catalog service is identified for the table. In this definition, the issue you describe is true, if two table identifiers point to the same underlying table metadata in storage, that means the process operating against the tables are both managing the underlying files. I think this should be made extra clear for users who would like to use the `registerTable` feature we just introduced, that the table registered is being fully managed, not just an External table.
   
   I think an extension of `registerTable` is needed for Iceberg to complete the notion of external table. And it should be the catalog's responsibility to manage this semantics, maybe set a flag to distinguish external table from normal tables to make it clear that it's a read-only copy of a managed table. The `registerTable` interface should be updated to something like:
   
   ```
   default Table registerTable(TableIdentifier identifier, String metadataFileLocation, boolean isManaged) {
       throw new UnsupportedOperationException("Registering tables is not supported");
     }
   ```
   
   If not managed, some sort of flag should be set at catalog level, so that if user later tries to do a DELETE TABLE PURGE, the catalog should reject the operation. I think this will fully solve the issue you describe about "a user creating a Snapshot of another table, then dropping it with purge"




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on pull request #3056: Support purge for Spark 3.2

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on pull request #3056:
URL: https://github.com/apache/iceberg/pull/3056#issuecomment-1057698463


   Great work, @ConeyLiu! This has been an open issue for quite some time. We still have some remaining work that we will continue to discuss in #4159.
   
   Thanks everyone for reviews!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi merged pull request #3056: Support purge for Spark 3.2

Posted by GitBox <gi...@apache.org>.
aokolnychyi merged pull request #3056:
URL: https://github.com/apache/iceberg/pull/3056


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue removed a comment on pull request #3056: Support purge for Spark 3.2

Posted by GitBox <gi...@apache.org>.
rdblue removed a comment on pull request #3056:
URL: https://github.com/apache/iceberg/pull/3056#issuecomment-1038350548


   > 100% Agree. However we now opened the door by adding registerTable in catalog, which maps to the external table concept perfectly. I already received a few feature requests of people asking for this to map to CREATE EXTERNAL TABLE. People can now register an Iceberg table with an old metadata file location and do writes against it to create basically 2 diverged metadata history of the same table. This is very dangerous action because 2 Iceberg tables can now own the same set of files and corrupt each other.
   
   I'm not sure I agree that it maps perfectly. This is a way to register a table with a catalog, after which the catalog owns it like any other table. There should be nothing that suggests registration has anything to do with `EXTERNAL` and no reason for people to think that tables that are added to a catalog through `registerTable` should behave any differently.
   
   If this confusion persists, I would support removing `registerTable` from the API.
   
   > Just from correctness perspective, this is the wrong thing to promote.
   
   Agreed!
   
   > In the long term, we should start to promote a new table ownership model (maybe call it a SHARED model) and start to bring people up to date with how Iceberg tables are operated. Let me draft a doc for that to have a formal discussion, and also include concepts like table root location ownership in that doc so we can have full clarity in the domain of table ownership.
   
   I'm not sure that I would want a `SHARED` keyword -- that just implies there are times when the table is not shared and we would get into similar trouble. But I think your idea to address this in a design doc is good.
   
   Also, I consider the data/file ownership a separate problem, so you may want to keep them separate in design docs or proposals. I wouldn't want to confuse table modification with data file ownership, although modification does have implications for file ownership.
   
   > I think if we change the behavior of drop table to not drop any data that alleviates our concern on accidental drops on external tables. However, it also means that drop table on managed tables would leave data around, which is also an issue.
   
   This is why Iceberg ignores `EXTERNAL`. The platform should be making these decisions, ideally. Users interact with logical tables, physical concerns are for the platform. If you don't have a platform-level plan for dropping table data, then I think the `PURGE` approach is okay because a user presumably makes the choice at the right time (rather than months if not years before the drop).
   
   My general recommendation is to tell users that they're logically dropping tables and data. Maybe you can have a platform-supported way to un-delete, but when you drop a table you generally have no expectation that you didn't do anything destructive!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a change in pull request #3056: Support purge for Spark 3.2

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #3056:
URL: https://github.com/apache/iceberg/pull/3056#discussion_r809665743



##########
File path: spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/SparkCatalog.java
##########
@@ -244,10 +248,25 @@ public SparkTable alterTable(Identifier ident, TableChange... changes) throws No
 
   @Override
   public boolean dropTable(Identifier ident) {
+    return dropTableInternal(ident, false);
+  }
+
+  @Override
+  public boolean purgeTable(Identifier ident) {
+    return dropTableInternal(ident, true);
+  }
+
+  private boolean dropTableInternal(Identifier ident, boolean purge) {
     try {
+      if (purge) {
+        Table table = load(ident).first();
+        ValidationException.check(
+            PropertyUtil.propertyAsBoolean(table.properties(), GC_ENABLED, GC_ENABLED_DEFAULT),
+            "Cannot purge table: GC is disabled (deleting files may corrupt other tables)");
+      }
       return isPathIdentifier(ident) ?

Review comment:
       I know we used a ternary operator here before but the whole statement seems too complicated. I'd consider refactoring this into a separate utility method like `dropTableWithoutPurging` in the example above.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ConeyLiu commented on a change in pull request #3056: Support purge for Spark 3.2

Posted by GitBox <gi...@apache.org>.
ConeyLiu commented on a change in pull request #3056:
URL: https://github.com/apache/iceberg/pull/3056#discussion_r801210882



##########
File path: spark/v3.1/spark/src/main/java/org/apache/iceberg/spark/SparkCatalog.java
##########
@@ -246,8 +246,19 @@ public SparkTable alterTable(Identifier ident, TableChange... changes) throws No
   public boolean dropTable(Identifier ident) {
     try {
       return isPathIdentifier(ident) ?
-          tables.dropTable(((PathIdentifier) ident).location()) :
-          icebergCatalog.dropTable(buildIdentifier(ident));
+          tables.dropTable(((PathIdentifier) ident).location(), false) :
+          icebergCatalog.dropTable(buildIdentifier(ident), false);
+    } catch (org.apache.iceberg.exceptions.NoSuchTableException e) {
+      return false;
+    }

Review comment:
       done




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ConeyLiu commented on pull request #3056: Support purge for Spark 3.1 & 3.2

Posted by GitBox <gi...@apache.org>.
ConeyLiu commented on pull request #3056:
URL: https://github.com/apache/iceberg/pull/3056#issuecomment-1026343368


   OK, will do it.
   
   Get Outlook for Android<https://aka.ms/AAb9ysg>
   ________________________________
   From: Ryan Blue ***@***.***>
   Sent: Tuesday, February 1, 2022 12:09:40 AM
   To: apache/iceberg ***@***.***>
   Cc: Xianyang Liu ***@***.***>; Mention ***@***.***>
   Subject: Re: [apache/iceberg] Support purge for Spark 3.1 & 3.2 (#3056)
   
   
   @ajantha-bhat<https://github.com/ajantha-bhat>, thanks for bringing this back up. Now that the refactor is done, this should be much easier to support in 3.1 and 3.2. @ConeyLiu<https://github.com/ConeyLiu> would you like to update this?
   
   —
   Reply to this email directly, view it on GitHub<https://github.com/apache/iceberg/pull/3056#issuecomment-1025945558>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/ADBEWSA65FIVO65ICGJ4HGLUY2X4JANCNFSM5DD7MGTQ>.
   Triage notifications on the go with GitHub Mobile for iOS<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
   You are receiving this because you were mentioned.Message ID: ***@***.***>
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on pull request #3056: Support purge for Spark 3.2

Posted by GitBox <gi...@apache.org>.
rdblue commented on pull request #3056:
URL: https://github.com/apache/iceberg/pull/3056#issuecomment-1038358431


   @ConeyLiu, I think we should simplify the way that the tests are happening. This should just validate that the data and metadata files for the table are deleted after a purge or not deleted when not purged. I don't think there is a need for custom catalogs or anything that is specific to the catalog that is used.
   
   In addition, I think we need to fail `PURGE` if the table has `gc.enabled=true` in table properties. The thread discussing this is long and covers a few other topics, but I think there's agreement about handling `gc.enabled`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi edited a comment on pull request #3056: Support purge for Spark 3.2

Posted by GitBox <gi...@apache.org>.
aokolnychyi edited a comment on pull request #3056:
URL: https://github.com/apache/iceberg/pull/3056#issuecomment-1055243235


   @ConeyLiu, looks like we would need to adapt `TestRemoveOrphanFilesProcedure` since we added support for the purge flag. I guess we can simply modify `removeTable` (call PURGE) and `testRemoveOrphanFilesGCDisabled` (unset GC flag at the end).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on pull request #3056: Support purge for Spark 3.2

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on pull request #3056:
URL: https://github.com/apache/iceberg/pull/3056#issuecomment-1055152962


   Looks good to me. Thanks for the work, @ConeyLiu! Let's wait for the tests.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ConeyLiu commented on a change in pull request #3056: Support purge for Spark 3.2

Posted by GitBox <gi...@apache.org>.
ConeyLiu commented on a change in pull request #3056:
URL: https://github.com/apache/iceberg/pull/3056#discussion_r809623565



##########
File path: spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/SparkSessionCatalog.java
##########
@@ -237,6 +237,13 @@ public boolean dropTable(Identifier ident) {
     return icebergCatalog.dropTable(ident) || getSessionCatalog().dropTable(ident);
   }
 
+  @Override
+  public boolean purgeTable(Identifier ident) {
+    // no need to check table existence to determine which catalog to use. if a table doesn't exist then both are
+    // required to return false.
+    return icebergCatalog.purgeTable(ident) || getSessionCatalog().purgeTable(ident);

Review comment:
       That's right. The `icebergCatalog.purgeTable` will throw a RuntimeException if it is an iceberg table. Otherwise, it depends on the `sessionCatalog` implementation because is out of the scope of iceberg.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ConeyLiu commented on a change in pull request #3056: Support purge for Spark 3.1.x

Posted by GitBox <gi...@apache.org>.
ConeyLiu commented on a change in pull request #3056:
URL: https://github.com/apache/iceberg/pull/3056#discussion_r711909952



##########
File path: spark3-extensions/src/test/java/org/apache/iceberg/spark/extensions/TestDropTable.java
##########
@@ -0,0 +1,66 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+
+package org.apache.iceberg.spark.extensions;
+
+import java.util.Map;
+import org.apache.iceberg.relocated.com.google.common.collect.ImmutableList;
+import org.junit.After;
+import org.junit.Assert;
+import org.junit.Before;
+import org.junit.Test;
+
+public class TestDropTable extends SparkExtensionsTestBase {
+
+  public TestDropTable(String catalogName, String implementation, Map<String, String> config) {
+    super(catalogName, implementation, config);
+  }
+
+  @Before
+  public void createTable() {
+    sql("CREATE TABLE %s (id INT, name STRING) USING iceberg", tableName);
+    sql("INSERT INTO %s VALUES (1, 'test')", tableName);
+  }
+
+  @After
+  public void removeTable() {
+    sql("DROP TABLE IF EXISTS %s", tableName);
+  }
+
+  @Test
+  public void testDropTable() {
+    assertEquals("Should have expected rows",
+        ImmutableList.of(row(1, "test")),
+        sql("SELECT * FROM %s", tableName));
+
+    sql("DROP TABLE %s", tableName);
+    Assert.assertFalse("Table should not exist", validationCatalog.tableExists(tableIdent));
+  }
+
+  @Test
+  public void testPurgeTable() {
+    assertEquals("Should have expected rows",
+        ImmutableList.of(row(1, "test")),
+        sql("SELECT * FROM %s", tableName));
+
+    sql("DROP TABLE %s PURGE", tableName);
+    Assert.assertFalse("Table should not exist", validationCatalog.tableExists(tableIdent));

Review comment:
       Updated the UT.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] jackye1995 commented on a change in pull request #3056: Support purge for Spark 3.1.x

Posted by GitBox <gi...@apache.org>.
jackye1995 commented on a change in pull request #3056:
URL: https://github.com/apache/iceberg/pull/3056#discussion_r713636737



##########
File path: spark3/src/main/java/org/apache/iceberg/spark/SparkSessionCatalog.java
##########
@@ -287,4 +302,13 @@ private T getSessionCatalog() {
         "Please make sure your are replacing Spark's default catalog, named 'spark_catalog'.");
     return sessionCatalog;
   }
+
+  private boolean callPurgeTable(Object catalog, Identifier ident) {

Review comment:
       +1




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ConeyLiu commented on pull request #3056: Support purge for Spark 3.1.x

Posted by GitBox <gi...@apache.org>.
ConeyLiu commented on pull request #3056:
URL: https://github.com/apache/iceberg/pull/3056#issuecomment-922672260


   Thanks @rdblue  for the review. I am sorry for the later response.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ConeyLiu commented on pull request #3056: Support purge for Spark 3.1.x

Posted by GitBox <gi...@apache.org>.
ConeyLiu commented on pull request #3056:
URL: https://github.com/apache/iceberg/pull/3056#issuecomment-960422526


   @rdblue @jackye1995 @wypoon the code has been rebased, please help to review when you have time. Thanks a lot.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] wypoon commented on pull request #3056: Support purge for Spark 3.1.x

Posted by GitBox <gi...@apache.org>.
wypoon commented on pull request #3056:
URL: https://github.com/apache/iceberg/pull/3056#issuecomment-951201985


   Support for Spark 3.2 has been added. I think the `purgeTable` support can be added to `SparkCatalog` in v3.2. Two versions for the new test can be added, one in v3.0 and another in v3.2, with different expected behavior.
   If we add v3.1, then the `purgeTable` support can be added to `SparkCatalog` in v3.1 as well.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ajantha-bhat commented on pull request #3056: Support purge for Spark 3.1 & 3.2

Posted by GitBox <gi...@apache.org>.
ajantha-bhat commented on pull request #3056:
URL: https://github.com/apache/iceberg/pull/3056#issuecomment-1025825624


   @ConeyLiu, @rdblue, @jackye1995:   Should we support this with spark3.2 now ? I see some of the users in slack asking for this feature. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] jackye1995 commented on a change in pull request #3056: Support purge for Spark 3.2

Posted by GitBox <gi...@apache.org>.
jackye1995 commented on a change in pull request #3056:
URL: https://github.com/apache/iceberg/pull/3056#discussion_r805458009



##########
File path: spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/SparkCatalog.java
##########
@@ -244,10 +244,19 @@ public SparkTable alterTable(Identifier ident, TableChange... changes) throws No
 
   @Override
   public boolean dropTable(Identifier ident) {
+    return dropTableInternal(ident, false);
+  }
+
+  @Override
+  public boolean purgeTable(Identifier ident) {
+    return dropTableInternal(ident, true);
+  }
+
+  private boolean dropTableInternal(Identifier ident, boolean purge) {
     try {
       return isPathIdentifier(ident) ?
-          tables.dropTable(((PathIdentifier) ident).location()) :
-          icebergCatalog.dropTable(buildIdentifier(ident));
+          tables.dropTable(((PathIdentifier) ident).location(), purge) :

Review comment:
       > I'm not sure I agree that it maps perfectly. This is a way to register a table with a catalog, after which the catalog owns it like any other table.
   
   probably too strong to say "perfectly". I am saying this not because I think it that way, but because multiple data engineers have asked me if they could use CREATE EXTERNAL TABLE to execute that `registerTable` feature. So from anecdotal evidence I think people will think in that way. Maybe we should call it something like `reviveTable` or expose it as a Spark procedure instead of making it a part of the catalog interface.
   
   > I'm not sure that I would want a SHARED keyword 
   
   I am not trying to promote a new keyword, I think we need a new concept to explain the difference from the traditional definition of external/managed tables.
   
   > Also, I consider the data/file ownership a separate problem, so you may want to keep them separate in design docs or proposals
   
   I think it's hard to not mention them together because owning a table has implicit relationship with owning the file paths and files of the table. But we can discuss this further, I will update in a new devlist thread.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ConeyLiu commented on a change in pull request #3056: Support purge for Spark 3.1.x

Posted by GitBox <gi...@apache.org>.
ConeyLiu commented on a change in pull request #3056:
URL: https://github.com/apache/iceberg/pull/3056#discussion_r713845085



##########
File path: spark3/src/main/java/org/apache/iceberg/spark/SparkSessionCatalog.java
##########
@@ -287,4 +302,13 @@ private T getSessionCatalog() {
         "Please make sure your are replacing Spark's default catalog, named 'spark_catalog'.");
     return sessionCatalog;
   }
+
+  private boolean callPurgeTable(Object catalog, Identifier ident) {

Review comment:
       OK, waiting for the Spark refactor.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] jackye1995 commented on a change in pull request #3056: Support purge for Spark 3.1.x

Posted by GitBox <gi...@apache.org>.
jackye1995 commented on a change in pull request #3056:
URL: https://github.com/apache/iceberg/pull/3056#discussion_r713637296



##########
File path: spark3-extensions/src/test/java/org/apache/iceberg/spark/extensions/TestDropTable.java
##########
@@ -0,0 +1,228 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+
+package org.apache.iceberg.spark.extensions;

Review comment:
       why is the test added in the extension package instead of spark3 directly?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] SinghAsDev commented on a change in pull request #3056: Support purge for Spark 3.2

Posted by GitBox <gi...@apache.org>.
SinghAsDev commented on a change in pull request #3056:
URL: https://github.com/apache/iceberg/pull/3056#discussion_r805396170



##########
File path: spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/SparkCatalog.java
##########
@@ -244,10 +244,19 @@ public SparkTable alterTable(Identifier ident, TableChange... changes) throws No
 
   @Override
   public boolean dropTable(Identifier ident) {
+    return dropTableInternal(ident, false);
+  }
+
+  @Override
+  public boolean purgeTable(Identifier ident) {
+    return dropTableInternal(ident, true);
+  }
+
+  private boolean dropTableInternal(Identifier ident, boolean purge) {
     try {
       return isPathIdentifier(ident) ?
-          tables.dropTable(((PathIdentifier) ident).location()) :
-          icebergCatalog.dropTable(buildIdentifier(ident));
+          tables.dropTable(((PathIdentifier) ident).location(), purge) :

Review comment:
       At Pinterest, we were also running into this issue where hive table users unexpectedly drop data when they create an iceberg table as `EXTERNAL` table type on existing data (more details on https://github.com/apache/iceberg/pull/4018). I think this would be a common accident among orgs trying out Iceberg. I can see how Netflix did not run into this issue, based on @rdblue comment, but I don't think all data platforms using hive tables drop only the references for `drop table` statements.
   
   I think if we change the behavior of `drop table` to not drop any data that alleviates our concern on accidental drops on external tables. However, it also means that `drop table` on managed tables would leave data around, which is also an issue. Spark's documentation also says that data is dropped as part of `drop table` operation when it is not an external table, https://spark.apache.org/docs/latest/sql-ref-syntax-ddl-drop-table.html. So, this behavior of `drop table` not dropping data for managed tables is also misleading.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] jackye1995 commented on a change in pull request #3056: Support purge for Spark 3.2

Posted by GitBox <gi...@apache.org>.
jackye1995 commented on a change in pull request #3056:
URL: https://github.com/apache/iceberg/pull/3056#discussion_r805296518



##########
File path: spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/SparkCatalog.java
##########
@@ -244,10 +244,19 @@ public SparkTable alterTable(Identifier ident, TableChange... changes) throws No
 
   @Override
   public boolean dropTable(Identifier ident) {
+    return dropTableInternal(ident, false);
+  }
+
+  @Override
+  public boolean purgeTable(Identifier ident) {
+    return dropTableInternal(ident, true);
+  }
+
+  private boolean dropTableInternal(Identifier ident, boolean purge) {
     try {
       return isPathIdentifier(ident) ?
-          tables.dropTable(((PathIdentifier) ident).location()) :
-          icebergCatalog.dropTable(buildIdentifier(ident));
+          tables.dropTable(((PathIdentifier) ident).location(), purge) :

Review comment:
       Thank you Ryan very much for the insights!
   
   For the discussion related to GC, I am fine with disallowing purge when `gc.enabled` is set to false as a short term solution if we would like to define garbage collection that way. We should do it within the catalog implementations to ensure this behavior is consistent across engines.
   
   > I really don't think that the concept of EXTERNAL has a place in Iceberg and I am very skeptical that we should add it.
   
   100% Agree. However we now opened the door by adding `registerTable` in catalog, which maps to the external table concept perfectly. I already received a few feature requests of people asking for this to map to CREATE EXTERNAL TABLE. People can now register an Iceberg table with an old metadata file location and do writes against it to create basically 2 diverged metadata history of the same table. This is very dangerous action because 2 Iceberg tables can now own the same set of files and corrupt each other.
   
   As the first step, we should make it clear in the javadoc of `registerTable` that it's only used to recover a table. Creating 2 references in catalog of the same table metadata and do write operations on both to create diverged metadata history is not recommended and will have unintended side effects.
   
   When I was suggesting to add the `EXTERNAL` concept to `registerTable`, to be honest I was really trying to make peace with people who wants to stick with this definition. At least we can have a read only solution and only encourage registering a normal table for recovery. But the more I think of this topic, the more I feel we should start to promote the right way to operate against Iceberg tables.
   
   Just from correctness perspective, this is the wrong thing to promote. Even if people just want to read a historical metadata file, information like table properties are stale. It is always better to do time travel against the latest metadata, or run query against a tagged snapshot recorded in the latest metadata, and have the metadata location automatically updated with new commits.
   
   As you said, `EXTERNAL` is also just arbitrarily limiting the ability for people to use Iceberg. I would say the only value is at business level. As compute vendors or data platforms that have such traditional definition of external and managed tables, it is much easier to provide external table support for an alien product comparing to full table operation support that is usually offered to managed tables only. My team underwent such debate for a long time before we decided to go with full support, and we tried really hard to explain the entire picture, but I doubt if other people could do the same. We can already see some vendors now offer Iceberg support under the name of "external table".
   
   In the long term, we should start to promote a new table ownership model (maybe call it a `SHARED` model) and start to bring people up to date with how Iceberg tables are operated. Let me draft a doc for that to have a formal discussion, and also include concepts like table root location ownership in that doc so we can have full clarity in the domain of table ownership.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #3056: Support purge for Spark 3.2

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #3056:
URL: https://github.com/apache/iceberg/pull/3056#discussion_r805401777



##########
File path: spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/SparkCatalog.java
##########
@@ -244,10 +244,19 @@ public SparkTable alterTable(Identifier ident, TableChange... changes) throws No
 
   @Override
   public boolean dropTable(Identifier ident) {
+    return dropTableInternal(ident, false);
+  }
+
+  @Override
+  public boolean purgeTable(Identifier ident) {
+    return dropTableInternal(ident, true);
+  }
+
+  private boolean dropTableInternal(Identifier ident, boolean purge) {
     try {
       return isPathIdentifier(ident) ?
-          tables.dropTable(((PathIdentifier) ident).location()) :
-          icebergCatalog.dropTable(buildIdentifier(ident));
+          tables.dropTable(((PathIdentifier) ident).location(), purge) :

Review comment:
       > 100% Agree. However we now opened the door by adding registerTable in catalog, which maps to the external table concept perfectly. I already received a few feature requests of people asking for this to map to CREATE EXTERNAL TABLE. People can now register an Iceberg table with an old metadata file location and do writes against it to create basically 2 diverged metadata history of the same table. This is very dangerous action because 2 Iceberg tables can now own the same set of files and corrupt each other.
   
   I'm not sure I agree that it maps perfectly. This is a way to register a table with a catalog, after which the catalog owns it like any other table. There should be nothing that suggests registration has anything to do with `EXTERNAL` and no reason for people to think that tables that are added to a catalog through `registerTable` should behave any differently.
   
   If this confusion persists, I would support removing `registerTable` from the API.
   
   > Just from correctness perspective, this is the wrong thing to promote.
   
   Agreed!
   
   > In the long term, we should start to promote a new table ownership model (maybe call it a SHARED model) and start to bring people up to date with how Iceberg tables are operated. Let me draft a doc for that to have a formal discussion, and also include concepts like table root location ownership in that doc so we can have full clarity in the domain of table ownership.
   
   I'm not sure that I would want a `SHARED` keyword -- that just implies there are times when the table is not shared and we would get into similar trouble. But I think your idea to address this in a design doc is good.
   
   Also, I consider the data/file ownership a separate problem, so you may want to keep them separate in design docs or proposals. I wouldn't want to confuse table modification with data file ownership, although modification does have implications for file ownership.
   
   > I think if we change the behavior of drop table to not drop any data that alleviates our concern on accidental drops on external tables. However, it also means that drop table on managed tables would leave data around, which is also an issue.
   
   This is why Iceberg ignores `EXTERNAL`. The platform should be making these decisions, ideally. Users interact with logical tables, physical concerns are for the platform. If you don't have a platform-level plan for dropping table data, then I think the `PURGE` approach is okay because a user presumably makes the choice at the right time (rather than months if not years before the drop).
   
   My general recommendation is to tell users that they're logically dropping tables and data. Maybe you can have a platform-supported way to un-delete, but when you drop a table you generally have no expectation that you didn't do anything destructive!




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #3056: Support purge for Spark 3.2

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #3056:
URL: https://github.com/apache/iceberg/pull/3056#discussion_r805401552



##########
File path: spark/v3.2/spark/src/test/java/org/apache/iceberg/spark/sql/TestDropTable.java
##########
@@ -0,0 +1,260 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.spark.sql;
+
+import java.io.File;
+import java.io.IOException;
+import java.nio.file.Paths;
+import java.util.Arrays;
+import java.util.List;
+import java.util.Map;
+import java.util.stream.Collectors;
+import org.apache.commons.io.FileUtils;
+import org.apache.iceberg.relocated.com.google.common.collect.ImmutableList;
+import org.apache.iceberg.relocated.com.google.common.collect.ImmutableMap;
+import org.apache.iceberg.relocated.com.google.common.collect.Lists;
+import org.apache.iceberg.spark.SparkCatalog;
+import org.apache.iceberg.spark.SparkCatalogTestBase;
+import org.apache.iceberg.spark.SparkSessionCatalog;
+import org.apache.spark.sql.Row;
+import org.apache.spark.sql.SparkSession;
+import org.apache.spark.sql.connector.catalog.Identifier;
+import org.junit.After;
+import org.junit.Assert;
+import org.junit.Before;
+import org.junit.Test;
+import org.junit.runners.Parameterized;
+
+public class TestDropTable extends SparkCatalogTestBase {
+  private Map<String, String> config = null;
+  private String implementation = null;
+  private SparkSession session = null;
+
+  @Parameterized.Parameters(name = "catalogName = {0}, implementation = {1}, config = {2}")
+  public static Object[][] parameters() {
+    return new Object[][]{
+        {"testhive", CustomSparkCatalog.class.getName(),
+            ImmutableMap.of(
+                "type", "hive",
+                "default-namespace", "default"
+            )},
+        {"testhadoop", CustomSparkCatalog.class.getName(),
+            ImmutableMap.of(
+                "type", "hadoop"
+            )},
+        {"spark_catalog", CustomSparkSessionCatalog.class.getName(),
+            ImmutableMap.of(
+                "type", "hive",
+                "default-namespace", "default",
+                "parquet-enabled", "true",
+                "cache-enabled", "false" // Spark will delete tables using v1, leaving the cache out of sync
+            )}
+    };
+  }
+
+  public TestDropTable(String catalogName, String implementation, Map<String, String> config) {
+    super(catalogName, implementation, config);
+    this.config = config;
+    this.implementation = implementation;
+  }
+
+  @Before
+  public void createTable() {
+    // Spark CatalogManager cached the loaded catalog, here we use new SparkSession to force it load the catalog again

Review comment:
       This doesn't sound correct to me. You should not need to alter the catalog to check whether a table is purged. Can't you get the location from the table itself? And this appears to make the tests specific to the Hadoop catalog.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on pull request #3056: Support purge for Spark 3.1 & 3.2

Posted by GitBox <gi...@apache.org>.
rdblue commented on pull request #3056:
URL: https://github.com/apache/iceberg/pull/3056#issuecomment-1025945558


   @ajantha-bhat, thanks for bringing this back up. Now that the refactor is done, this should be much easier to support in 3.1 and 3.2. @ConeyLiu would you like to update this?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #3056: Support purge for Spark 3.1.x

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #3056:
URL: https://github.com/apache/iceberg/pull/3056#discussion_r711279230



##########
File path: spark3-extensions/src/test/java/org/apache/iceberg/spark/extensions/TestDropTable.java
##########
@@ -0,0 +1,66 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+
+package org.apache.iceberg.spark.extensions;
+
+import java.util.Map;
+import org.apache.iceberg.relocated.com.google.common.collect.ImmutableList;
+import org.junit.After;
+import org.junit.Assert;
+import org.junit.Before;
+import org.junit.Test;
+
+public class TestDropTable extends SparkExtensionsTestBase {
+
+  public TestDropTable(String catalogName, String implementation, Map<String, String> config) {
+    super(catalogName, implementation, config);
+  }
+
+  @Before
+  public void createTable() {
+    sql("CREATE TABLE %s (id INT, name STRING) USING iceberg", tableName);
+    sql("INSERT INTO %s VALUES (1, 'test')", tableName);
+  }
+
+  @After
+  public void removeTable() {
+    sql("DROP TABLE IF EXISTS %s", tableName);
+  }
+
+  @Test
+  public void testDropTable() {
+    assertEquals("Should have expected rows",
+        ImmutableList.of(row(1, "test")),
+        sql("SELECT * FROM %s", tableName));
+
+    sql("DROP TABLE %s", tableName);
+    Assert.assertFalse("Table should not exist", validationCatalog.tableExists(tableIdent));
+  }
+
+  @Test
+  public void testPurgeTable() {
+    assertEquals("Should have expected rows",
+        ImmutableList.of(row(1, "test")),
+        sql("SELECT * FROM %s", tableName));
+
+    sql("DROP TABLE %s PURGE", tableName);
+    Assert.assertFalse("Table should not exist", validationCatalog.tableExists(tableIdent));

Review comment:
       This isn't sufficient to test purge. I think you should create a mock catalog to test that the argument is passed through to `dropTable` correctly.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on pull request #3056: Support purge for Spark 3.2

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on pull request #3056:
URL: https://github.com/apache/iceberg/pull/3056#issuecomment-1043403790


   Let me also take a look.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ConeyLiu commented on pull request #3056: Support purge for Spark 3.2

Posted by GitBox <gi...@apache.org>.
ConeyLiu commented on pull request #3056:
URL: https://github.com/apache/iceberg/pull/3056#issuecomment-1048498955


   @aokolnychyi, I'm sorry for the late reply due to the busy work recently. The code has been updated based on your comments. Please take a look when you have time. Thanks a lot. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #3056: Support purge for Spark 3.1 & 3.2

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #3056:
URL: https://github.com/apache/iceberg/pull/3056#discussion_r800228953



##########
File path: spark/v3.1/spark/src/main/java/org/apache/iceberg/spark/SparkCatalog.java
##########
@@ -246,8 +246,19 @@ public SparkTable alterTable(Identifier ident, TableChange... changes) throws No
   public boolean dropTable(Identifier ident) {
     try {
       return isPathIdentifier(ident) ?
-          tables.dropTable(((PathIdentifier) ident).location()) :
-          icebergCatalog.dropTable(buildIdentifier(ident));
+          tables.dropTable(((PathIdentifier) ident).location(), false) :
+          icebergCatalog.dropTable(buildIdentifier(ident), false);
+    } catch (org.apache.iceberg.exceptions.NoSuchTableException e) {
+      return false;
+    }

Review comment:
       Rather than maintaining this in 2 places, can you use a private `dropTableInternal` method that accepts a `boolean purge` argument? That will cut down on duplication.
   
   Also, can you either fix 3.1 or 3.2 in this PR? It's easier to fix one version and port the changes once they are approved than to try to keep multiple versions in sync in a a single PR.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] SinghAsDev commented on a change in pull request #3056: Support purge for Spark 3.2

Posted by GitBox <gi...@apache.org>.
SinghAsDev commented on a change in pull request #3056:
URL: https://github.com/apache/iceberg/pull/3056#discussion_r805396170



##########
File path: spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/SparkCatalog.java
##########
@@ -244,10 +244,19 @@ public SparkTable alterTable(Identifier ident, TableChange... changes) throws No
 
   @Override
   public boolean dropTable(Identifier ident) {
+    return dropTableInternal(ident, false);
+  }
+
+  @Override
+  public boolean purgeTable(Identifier ident) {
+    return dropTableInternal(ident, true);
+  }
+
+  private boolean dropTableInternal(Identifier ident, boolean purge) {
     try {
       return isPathIdentifier(ident) ?
-          tables.dropTable(((PathIdentifier) ident).location()) :
-          icebergCatalog.dropTable(buildIdentifier(ident));
+          tables.dropTable(((PathIdentifier) ident).location(), purge) :

Review comment:
       At Pinterest, we were also running into this issue where hive table users unexpectedly drop data when they create an iceberg table as `EXTERNAL` table type on existing data (more details on https://github.com/apache/iceberg/pull/4018). I think this would be a common accident among orgs trying out Iceberg. I can see how Netflix did not run into this issue, based on @rdblue comment, but I don't think all data platforms using hive tables drop only the references for `drop table` statements.
   
   I think if we change the behavior of `drop table` to not drop any data that alleviates our concern on accidental drops on external tables. However, it also means that `drop table` on managed tables would leave data around, which is also an issue. Spark's documentation also says that data is dropped as part of `drop table` operation when it is not an external table, https://spark.apache.org/docs/latest/sql-ref-syntax-ddl-drop-table.html. So, this behavior of `drop table` not dropping data for managed tables is also misleading.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ConeyLiu commented on pull request #3056: Support purge for Spark 3.1.x

Posted by GitBox <gi...@apache.org>.
ConeyLiu commented on pull request #3056:
URL: https://github.com/apache/iceberg/pull/3056#issuecomment-951893333


   I will resolve conflicts recently. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on pull request #3056: Support purge for Spark 3.2

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on pull request #3056:
URL: https://github.com/apache/iceberg/pull/3056#issuecomment-1055243235


   @ConeyLiu, looks like we would need to adapt `TestRemoveOrphanFilesProcedure` since we added support for the purge flag. I think you would need to modify `removeTable` and `testRemoveOrphanFilesGCDisabled` (unset GC flag at the end).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ConeyLiu commented on pull request #3056: Support purge for Spark 3.2

Posted by GitBox <gi...@apache.org>.
ConeyLiu commented on pull request #3056:
URL: https://github.com/apache/iceberg/pull/3056#issuecomment-1059888376


   Thanks @aokolnychyi for merging this and also thanks to everyone for the review. I will submit a pr to spark 3.1.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue removed a comment on pull request #3056: Support purge for Spark 3.2

Posted by GitBox <gi...@apache.org>.
rdblue removed a comment on pull request #3056:
URL: https://github.com/apache/iceberg/pull/3056#issuecomment-1038350548


   > 100% Agree. However we now opened the door by adding registerTable in catalog, which maps to the external table concept perfectly. I already received a few feature requests of people asking for this to map to CREATE EXTERNAL TABLE. People can now register an Iceberg table with an old metadata file location and do writes against it to create basically 2 diverged metadata history of the same table. This is very dangerous action because 2 Iceberg tables can now own the same set of files and corrupt each other.
   
   I'm not sure I agree that it maps perfectly. This is a way to register a table with a catalog, after which the catalog owns it like any other table. There should be nothing that suggests registration has anything to do with `EXTERNAL` and no reason for people to think that tables that are added to a catalog through `registerTable` should behave any differently.
   
   If this confusion persists, I would support removing `registerTable` from the API.
   
   > Just from correctness perspective, this is the wrong thing to promote.
   
   Agreed!
   
   > In the long term, we should start to promote a new table ownership model (maybe call it a SHARED model) and start to bring people up to date with how Iceberg tables are operated. Let me draft a doc for that to have a formal discussion, and also include concepts like table root location ownership in that doc so we can have full clarity in the domain of table ownership.
   
   I'm not sure that I would want a `SHARED` keyword -- that just implies there are times when the table is not shared and we would get into similar trouble. But I think your idea to address this in a design doc is good.
   
   Also, I consider the data/file ownership a separate problem, so you may want to keep them separate in design docs or proposals. I wouldn't want to confuse table modification with data file ownership, although modification does have implications for file ownership.
   
   > I think if we change the behavior of drop table to not drop any data that alleviates our concern on accidental drops on external tables. However, it also means that drop table on managed tables would leave data around, which is also an issue.
   
   This is why Iceberg ignores `EXTERNAL`. The platform should be making these decisions, ideally. Users interact with logical tables, physical concerns are for the platform. If you don't have a platform-level plan for dropping table data, then I think the `PURGE` approach is okay because a user presumably makes the choice at the right time (rather than months if not years before the drop).
   
   My general recommendation is to tell users that they're logically dropping tables and data. Maybe you can have a platform-supported way to un-delete, but when you drop a table you generally have no expectation that you didn't do anything destructive!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a change in pull request #3056: Support purge for Spark 3.2

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #3056:
URL: https://github.com/apache/iceberg/pull/3056#discussion_r810200188



##########
File path: spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/SparkCatalog.java
##########
@@ -244,10 +248,25 @@ public SparkTable alterTable(Identifier ident, TableChange... changes) throws No
 
   @Override
   public boolean dropTable(Identifier ident) {
+    return dropTableInternal(ident, false);
+  }
+
+  @Override
+  public boolean purgeTable(Identifier ident) {
+    return dropTableInternal(ident, true);
+  }
+
+  private boolean dropTableInternal(Identifier ident, boolean purge) {

Review comment:
       I think we should collect inconsistencies and fix them. I've started #4159 for that.
   Could you post your findings there, please? 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] jackye1995 commented on a change in pull request #3056: Support purge for Spark 3.2

Posted by GitBox <gi...@apache.org>.
jackye1995 commented on a change in pull request #3056:
URL: https://github.com/apache/iceberg/pull/3056#discussion_r804879622



##########
File path: spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/SparkCatalog.java
##########
@@ -244,10 +244,19 @@ public SparkTable alterTable(Identifier ident, TableChange... changes) throws No
 
   @Override
   public boolean dropTable(Identifier ident) {
+    return dropTableInternal(ident, false);
+  }
+
+  @Override
+  public boolean purgeTable(Identifier ident) {
+    return dropTableInternal(ident, true);
+  }
+
+  private boolean dropTableInternal(Identifier ident, boolean purge) {
     try {
       return isPathIdentifier(ident) ?
-          tables.dropTable(((PathIdentifier) ident).location()) :
-          icebergCatalog.dropTable(buildIdentifier(ident));
+          tables.dropTable(((PathIdentifier) ident).location(), purge) :

Review comment:
       My understanding is that `gc.enabled` should only impact the snapshot expiration (including remove orphan files) aspect, at least that is what my definition of “garbage collection” is. The DROP TABLE behavior of Iceberg external vs managed table are different discussions. We should not try to work around the managed table issue using `gc.enabled`.
   
   The industry has been using this notion of Managed vs External tables for a long time now. Most people simply associate it with the delete table behavior, but it really is describing the ownership of a table with respect to an engine with a data catalog. Just take Hive as an example, Managed table (Hive ACID) means table is in Hive metastore, and also the data is fully manged by the Hive data warehouse. Changing data using external systems will likely cause trouble, so they are supposed to only read data, which means to treat a table as External. This has been how most of the vertical integration products define their table concepts, including Snowflake and AWS Redshift.
   
   In the Iceberg multi-engine model where data can be updated by any compute engine or process through the Iceberg library, this definition of managed that bundles catalog and engine is confusing. But we can still make it consistent by changing the definition. For Iceberg, the ownership of the table is bundled with the catalog service plus the Iceberg spec itself. Any compute engine can claim the responsibility to manage a table, as long as the Iceberg spec is respected and the catalog service is identified for the table. In this definition, the issue you describe is true, if two table identifiers point to the same underlying table metadata in storage, that means the process operating against the tables are both managing the underlying files. I think this should be made extra clear for users who would like to use the `registerTable` feature we just introduced, that the table registered is being fully managed, not just an External table.
   
   I think an extension of `registerTable` is needed for Iceberg to complete the notion of external table. And it should be the catalog's responsibility to manage this semantics, maybe set a flag to distinguish external table from normal tables to make it clear that it's a read-only copy of a managed table. The `registerTable` interface should be updated to something like:
   
   ```
   default Table registerTable(TableIdentifier identifier, String metadataFileLocation, boolean isManaged) {
       throw new UnsupportedOperationException("Registering tables is not supported");
     }
   ```
   
   If not managed, some sort of flag should be set at catalog level, so that if user later tries to do a DELETE TABLE PURGE, the catalog should reject the operation. I think this will fully solve the issue you describe about "a user creating a Snapshot of another table, then dropping it with purge"




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on a change in pull request #3056: Support purge for Spark 3.2

Posted by GitBox <gi...@apache.org>.
RussellSpitzer commented on a change in pull request #3056:
URL: https://github.com/apache/iceberg/pull/3056#discussion_r804826159



##########
File path: spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/SparkCatalog.java
##########
@@ -244,10 +244,19 @@ public SparkTable alterTable(Identifier ident, TableChange... changes) throws No
 
   @Override
   public boolean dropTable(Identifier ident) {
+    return dropTableInternal(ident, false);
+  }
+
+  @Override
+  public boolean purgeTable(Identifier ident) {
+    return dropTableInternal(ident, true);
+  }
+
+  private boolean dropTableInternal(Identifier ident, boolean purge) {
     try {
       return isPathIdentifier(ident) ?
-          tables.dropTable(((PathIdentifier) ident).location()) :
-          icebergCatalog.dropTable(buildIdentifier(ident));
+          tables.dropTable(((PathIdentifier) ident).location(), purge) :

Review comment:
       If Purge is True and GC_Enabled is false should we allow a purge? Or is should that throw an error?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on pull request #3056: Support purge for Spark 3.2

Posted by GitBox <gi...@apache.org>.
rdblue commented on pull request #3056:
URL: https://github.com/apache/iceberg/pull/3056#issuecomment-1038350548


   > 100% Agree. However we now opened the door by adding registerTable in catalog, which maps to the external table concept perfectly. I already received a few feature requests of people asking for this to map to CREATE EXTERNAL TABLE. People can now register an Iceberg table with an old metadata file location and do writes against it to create basically 2 diverged metadata history of the same table. This is very dangerous action because 2 Iceberg tables can now own the same set of files and corrupt each other.
   
   I'm not sure I agree that it maps perfectly. This is a way to register a table with a catalog, after which the catalog owns it like any other table. There should be nothing that suggests registration has anything to do with `EXTERNAL` and no reason for people to think that tables that are added to a catalog through `registerTable` should behave any differently.
   
   If this confusion persists, I would support removing `registerTable` from the API.
   
   > Just from correctness perspective, this is the wrong thing to promote.
   
   Agreed!
   
   > In the long term, we should start to promote a new table ownership model (maybe call it a SHARED model) and start to bring people up to date with how Iceberg tables are operated. Let me draft a doc for that to have a formal discussion, and also include concepts like table root location ownership in that doc so we can have full clarity in the domain of table ownership.
   
   I'm not sure that I would want a `SHARED` keyword -- that just implies there are times when the table is not shared and we would get into similar trouble. But I think your idea to address this in a design doc is good.
   
   Also, I consider the data/file ownership a separate problem, so you may want to keep them separate in design docs or proposals. I wouldn't want to confuse table modification with data file ownership, although modification does have implications for file ownership.
   
   > I think if we change the behavior of drop table to not drop any data that alleviates our concern on accidental drops on external tables. However, it also means that drop table on managed tables would leave data around, which is also an issue.
   
   This is why Iceberg ignores `EXTERNAL`. The platform should be making these decisions, ideally. Users interact with logical tables, physical concerns are for the platform. If you don't have a platform-level plan for dropping table data, then I think the `PURGE` approach is okay because a user presumably makes the choice at the right time (rather than months if not years before the drop).
   
   My general recommendation is to tell users that they're logically dropping tables and data. Maybe you can have a platform-supported way to un-delete, but when you drop a table you generally have no expectation that you didn't do anything destructive!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ConeyLiu commented on pull request #3056: Support purge for Spark 3.2

Posted by GitBox <gi...@apache.org>.
ConeyLiu commented on pull request #3056:
URL: https://github.com/apache/iceberg/pull/3056#issuecomment-1038750863


   Thanks, everyone for the detailed and meaningful discussion.
   
   > In addition, I think we need to fail PURGE if the table has gc.enabled=true in table properties. The thread discussing this is long and covers a few other topics, but I think there's agreement about handling gc.enabled.
   
   @rdblue, the UTs have been updated. It should be `gc.enabled=false`, right? And the UTs have covered the case.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ConeyLiu commented on a change in pull request #3056: Support purge for Spark 3.2

Posted by GitBox <gi...@apache.org>.
ConeyLiu commented on a change in pull request #3056:
URL: https://github.com/apache/iceberg/pull/3056#discussion_r812659758



##########
File path: spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/SparkCatalog.java
##########
@@ -244,10 +248,25 @@ public SparkTable alterTable(Identifier ident, TableChange... changes) throws No
 
   @Override
   public boolean dropTable(Identifier ident) {
+    return dropTableInternal(ident, false);
+  }
+
+  @Override
+  public boolean purgeTable(Identifier ident) {
+    return dropTableInternal(ident, true);
+  }
+
+  private boolean dropTableInternal(Identifier ident, boolean purge) {

Review comment:
       Thanks for opening the discussion, and I have left the findings.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ConeyLiu commented on a change in pull request #3056: Support purge for Spark 3.2

Posted by GitBox <gi...@apache.org>.
ConeyLiu commented on a change in pull request #3056:
URL: https://github.com/apache/iceberg/pull/3056#discussion_r801213224



##########
File path: spark/v3.2/spark/src/test/java/org/apache/iceberg/spark/sql/TestDropTable.java
##########
@@ -0,0 +1,260 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.spark.sql;
+
+import java.io.File;
+import java.io.IOException;
+import java.nio.file.Paths;
+import java.util.Arrays;
+import java.util.List;
+import java.util.Map;
+import java.util.stream.Collectors;
+import org.apache.commons.io.FileUtils;
+import org.apache.iceberg.relocated.com.google.common.collect.ImmutableList;
+import org.apache.iceberg.relocated.com.google.common.collect.ImmutableMap;
+import org.apache.iceberg.relocated.com.google.common.collect.Lists;
+import org.apache.iceberg.spark.SparkCatalog;
+import org.apache.iceberg.spark.SparkCatalogTestBase;
+import org.apache.iceberg.spark.SparkSessionCatalog;
+import org.apache.spark.sql.Row;
+import org.apache.spark.sql.SparkSession;
+import org.apache.spark.sql.connector.catalog.Identifier;
+import org.junit.After;
+import org.junit.Assert;
+import org.junit.Before;
+import org.junit.Test;
+import org.junit.runners.Parameterized;
+
+public class TestDropTable extends SparkCatalogTestBase {
+  private Map<String, String> config = null;
+  private String implementation = null;
+  private SparkSession session = null;
+
+  @Parameterized.Parameters(name = "catalogName = {0}, implementation = {1}, config = {2}")
+  public static Object[][] parameters() {
+    return new Object[][]{
+        {"testhive", CustomSparkCatalog.class.getName(),
+            ImmutableMap.of(
+                "type", "hive",
+                "default-namespace", "default"
+            )},
+        {"testhadoop", CustomSparkCatalog.class.getName(),
+            ImmutableMap.of(
+                "type", "hadoop"
+            )},
+        {"spark_catalog", CustomSparkSessionCatalog.class.getName(),
+            ImmutableMap.of(
+                "type", "hive",
+                "default-namespace", "default",
+                "parquet-enabled", "true",
+                "cache-enabled", "false" // Spark will delete tables using v1, leaving the cache out of sync
+            )}
+    };
+  }
+
+  public TestDropTable(String catalogName, String implementation, Map<String, String> config) {
+    super(catalogName, implementation, config);
+    this.config = config;
+    this.implementation = implementation;
+  }
+
+  @Before
+  public void createTable() {
+    // Spark CatalogManager cached the loaded catalog, here we use new SparkSession to force it load the catalog again

Review comment:
       @rdblue please see the comments. We have customed the `catalog` to verify whether `drop` or `purge` method is called. So we need to load a new catalog for each test. That's why we define a new `drop` method.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ConeyLiu commented on a change in pull request #3056: Support purge for Spark 3.2

Posted by GitBox <gi...@apache.org>.
ConeyLiu commented on a change in pull request #3056:
URL: https://github.com/apache/iceberg/pull/3056#discussion_r805564085



##########
File path: spark/v3.2/spark/src/test/java/org/apache/iceberg/spark/sql/TestDropTable.java
##########
@@ -0,0 +1,260 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.spark.sql;
+
+import java.io.File;
+import java.io.IOException;
+import java.nio.file.Paths;
+import java.util.Arrays;
+import java.util.List;
+import java.util.Map;
+import java.util.stream.Collectors;
+import org.apache.commons.io.FileUtils;
+import org.apache.iceberg.relocated.com.google.common.collect.ImmutableList;
+import org.apache.iceberg.relocated.com.google.common.collect.ImmutableMap;
+import org.apache.iceberg.relocated.com.google.common.collect.Lists;
+import org.apache.iceberg.spark.SparkCatalog;
+import org.apache.iceberg.spark.SparkCatalogTestBase;
+import org.apache.iceberg.spark.SparkSessionCatalog;
+import org.apache.spark.sql.Row;
+import org.apache.spark.sql.SparkSession;
+import org.apache.spark.sql.connector.catalog.Identifier;
+import org.junit.After;
+import org.junit.Assert;
+import org.junit.Before;
+import org.junit.Test;
+import org.junit.runners.Parameterized;
+
+public class TestDropTable extends SparkCatalogTestBase {
+  private Map<String, String> config = null;
+  private String implementation = null;
+  private SparkSession session = null;
+
+  @Parameterized.Parameters(name = "catalogName = {0}, implementation = {1}, config = {2}")
+  public static Object[][] parameters() {
+    return new Object[][]{
+        {"testhive", CustomSparkCatalog.class.getName(),
+            ImmutableMap.of(
+                "type", "hive",
+                "default-namespace", "default"
+            )},
+        {"testhadoop", CustomSparkCatalog.class.getName(),
+            ImmutableMap.of(
+                "type", "hadoop"
+            )},
+        {"spark_catalog", CustomSparkSessionCatalog.class.getName(),
+            ImmutableMap.of(
+                "type", "hive",
+                "default-namespace", "default",
+                "parquet-enabled", "true",
+                "cache-enabled", "false" // Spark will delete tables using v1, leaving the cache out of sync
+            )}
+    };
+  }
+
+  public TestDropTable(String catalogName, String implementation, Map<String, String> config) {
+    super(catalogName, implementation, config);
+    this.config = config;
+    this.implementation = implementation;
+  }
+
+  @Before
+  public void createTable() {
+    // Spark CatalogManager cached the loaded catalog, here we use new SparkSession to force it load the catalog again

Review comment:
       This adopted the previous comments, maybe it is too old:
   
   > This isn't sufficient to test purge. I think you should create a mock catalog to test that the argument is passed through to dropTable correctly.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on a change in pull request #3056: Support purge for Spark 3.2

Posted by GitBox <gi...@apache.org>.
RussellSpitzer commented on a change in pull request #3056:
URL: https://github.com/apache/iceberg/pull/3056#discussion_r804830153



##########
File path: spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/SparkCatalog.java
##########
@@ -244,10 +244,19 @@ public SparkTable alterTable(Identifier ident, TableChange... changes) throws No
 
   @Override
   public boolean dropTable(Identifier ident) {
+    return dropTableInternal(ident, false);
+  }
+
+  @Override
+  public boolean purgeTable(Identifier ident) {
+    return dropTableInternal(ident, true);
+  }
+
+  private boolean dropTableInternal(Identifier ident, boolean purge) {
     try {
       return isPathIdentifier(ident) ?
-          tables.dropTable(((PathIdentifier) ident).location()) :
-          icebergCatalog.dropTable(buildIdentifier(ident));
+          tables.dropTable(((PathIdentifier) ident).location(), purge) :

Review comment:
       To elaborate on this, we have a table property GC_ENABLED which we use to signify that a table identifier may not have full ownership of all the data files it references. When this flag is enabled we disable Iceberg operations which can phyiscally delete files like EXPIRE_SNAPSHOTS and REMOVE_ORPHAN_FILES. I feel like DROP TABLE PURGE is probably very similar and it should be blocked.
   
   An example would be a user creating a Snapshot of another table, then dropping it with purge. We wouldn't want to delete files owned by the original table which has been snapshotted.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ConeyLiu commented on a change in pull request #3056: Support purge for Spark 3.2

Posted by GitBox <gi...@apache.org>.
ConeyLiu commented on a change in pull request #3056:
URL: https://github.com/apache/iceberg/pull/3056#discussion_r809656510



##########
File path: spark/v3.2/spark/src/test/java/org/apache/iceberg/spark/sql/TestDropTable.java
##########
@@ -0,0 +1,152 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.spark.sql;
+
+import java.io.File;
+import java.io.IOException;
+import java.nio.file.Paths;
+import java.util.Arrays;
+import java.util.List;
+import java.util.Map;
+import java.util.stream.Collectors;
+import org.apache.commons.io.FileUtils;
+import org.apache.iceberg.AssertHelpers;
+import org.apache.iceberg.exceptions.ValidationException;
+import org.apache.iceberg.relocated.com.google.common.collect.ImmutableList;
+import org.apache.iceberg.relocated.com.google.common.collect.Lists;
+import org.apache.iceberg.spark.SparkCatalogTestBase;
+import org.junit.After;
+import org.junit.Assert;
+import org.junit.Before;
+import org.junit.Test;
+
+public class TestDropTable extends SparkCatalogTestBase {
+
+  public TestDropTable(String catalogName, String implementation, Map<String, String> config) {
+    super(catalogName, implementation, config);
+  }
+
+  @Before
+  public void createTable() {
+    sql("CREATE TABLE %s (id INT, name STRING) USING iceberg", tableName);
+    sql("INSERT INTO %s VALUES (1, 'test')", tableName);
+  }
+
+  @After
+  public void removeTable() throws IOException {
+    sql("DROP TABLE IF EXISTS %s", tableName);
+
+    File baseLocation = new File(getTableBaseLocation());
+    if (baseLocation.exists()) {
+      FileUtils.deleteDirectory(baseLocation);
+    }
+  }
+
+  @Test
+  public void testDropTable() {
+    dropTableInternal();
+  }
+
+  @Test
+  public void testDropTableGCDisabled() {
+    sql("ALTER TABLE %s SET TBLPROPERTIES (gc.enabled = false)", tableName);
+    dropTableInternal();
+  }
+
+  private void dropTableInternal() {
+    assertEquals("Should have expected rows",
+        ImmutableList.of(row(1, "test")), sql("SELECT * FROM %s", tableName));
+
+    List<String> previousDataFiles = getDataOrMetadataFiles(true);
+    Assert.assertTrue("The number of data files should be larger than zero", previousDataFiles.size() > 0);
+
+    sql("DROP TABLE %s", tableName);
+    Assert.assertFalse("Table should not exist", validationCatalog.tableExists(tableIdent));
+
+    if (catalogName.equals("testhadoop")) {
+      // HadoopCatalog drop table without purge will delete the base table location.
+      Assert.assertEquals("The number of data files should be zero", 0, getDataOrMetadataFiles(true).size());
+      Assert.assertEquals("The number of metadata files should be zero", 0, getDataOrMetadataFiles(false).size());
+    } else {
+      Assert.assertEquals("The data files should not change", previousDataFiles, getDataOrMetadataFiles(true));
+    }
+  }
+
+  @Test
+  public void testPurgeTable() {
+    assertEquals("Should have expected rows",
+        ImmutableList.of(row(1, "test")),
+        sql("SELECT * FROM %s", tableName));
+
+    List<String> previousDataFiles = getDataOrMetadataFiles(true);
+    Assert.assertTrue("The number of data files should be larger than zero", previousDataFiles.size() > 0);
+
+    sql("DROP TABLE %s PURGE", tableName);
+    Assert.assertFalse("Table should not exist", validationCatalog.tableExists(tableIdent));
+    Assert.assertEquals("The number of data files should be zero", 0, getDataOrMetadataFiles(true).size());
+    Assert.assertEquals("The number of metadata files should be zero", 0, getDataOrMetadataFiles(false).size());
+  }
+
+  @Test
+  public void testPurgeTableGCDisabled() {
+    sql("ALTER TABLE %s SET TBLPROPERTIES (gc.enabled = false)", tableName);
+    List<String> previousDataFiles = getDataOrMetadataFiles(true);
+    List<String> previousMetadataFiles = getDataOrMetadataFiles(false);
+    Assert.assertTrue("The number of data files should be larger than zero", previousDataFiles.size() > 0);
+    Assert.assertTrue("The number of metadata files should be larger than zero", previousMetadataFiles.size() > 0);
+
+    AssertHelpers.assertThrows("Purge table is not allowed when GC is disabled", ValidationException.class,
+        "Cannot purge table: GC is disabled (deleting files may corrupt other tables",
+        () -> sql("DROP TABLE %s PURGE", tableName));
+
+    Assert.assertTrue("Table should not been dropped", validationCatalog.tableExists(tableIdent));
+    Assert.assertEquals("The data files should not change", previousDataFiles, getDataOrMetadataFiles(true));
+    Assert.assertEquals("The metadata files should not change", previousMetadataFiles, getDataOrMetadataFiles(false));
+  }
+
+  private List<String> getDataOrMetadataFiles(boolean dataFiles) {
+    String baseLocation = getTableBaseLocation();

Review comment:
       Changed to verify the files of `MANIFESTS` and `FILES`.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a change in pull request #3056: Support purge for Spark 3.2

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #3056:
URL: https://github.com/apache/iceberg/pull/3056#discussion_r809664272



##########
File path: spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/SparkCatalog.java
##########
@@ -244,10 +248,25 @@ public SparkTable alterTable(Identifier ident, TableChange... changes) throws No
 
   @Override
   public boolean dropTable(Identifier ident) {
+    return dropTableInternal(ident, false);
+  }
+
+  @Override
+  public boolean purgeTable(Identifier ident) {
+    return dropTableInternal(ident, true);
+  }
+
+  private boolean dropTableInternal(Identifier ident, boolean purge) {

Review comment:
       I think we should use `DeleteReachableFiles` that was added recently for this use case. I doubt `CatalogUtil` can perform well on a reasonably sized table. The action will scale much better as it uses distributed metadata tables under the hood.
   
   Here is what we use internally (kind of).
   
   ```
   @Override
   public boolean dropTable(Identifier ident) {
     return dropTableWithoutPurging(ident);
   }
   
   @Override
   public boolean purgeTable(Identifier ident) {
     try {
       Table table = load(ident).first();
       ValidationException.check(
           PropertyUtil.propertyAsBoolean(table.properties(), GC_ENABLED, GC_ENABLED_DEFAULT),
           "Cannot purge table: GC is disabled (deleting files may corrupt other tables)");
   
       String metadataFileLocation = ((HasTableOperations) table).operations().current().metadataFileLocation();
   
       boolean dropped = dropTableWithoutPurging(ident);
   
       if (dropped) {
         SparkActions actions = SparkActions.get();
   
         actions.deleteReachableFiles(metadataFileLocation)
             .io(table.io())
             .execute();
       }
   
       return dropped;
   
     } catch (org.apache.iceberg.exceptions.NoSuchTableException e) {
       return false;
     }
   }
   
   private boolean dropTableWithoutPurging(Identifier ident) {
     if (isPathIdentifier(ident)) {
       return tables.dropTable(((PathIdentifier) ident).location(), false);
     } else {
       return icebergCatalog.dropTable(buildIdentifier(ident), false);
     }
   }
   ```
   
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a change in pull request #3056: Support purge for Spark 3.2

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #3056:
URL: https://github.com/apache/iceberg/pull/3056#discussion_r809664272



##########
File path: spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/SparkCatalog.java
##########
@@ -244,10 +248,25 @@ public SparkTable alterTable(Identifier ident, TableChange... changes) throws No
 
   @Override
   public boolean dropTable(Identifier ident) {
+    return dropTableInternal(ident, false);
+  }
+
+  @Override
+  public boolean purgeTable(Identifier ident) {
+    return dropTableInternal(ident, true);
+  }
+
+  private boolean dropTableInternal(Identifier ident, boolean purge) {

Review comment:
       I think we should use `DeleteReachableFiles` that was added recently for this use case. I doubt `CatalogUtil` can perform well on a reasonably sized table. The action will scale much better as it uses distributed metadata tables under the hood.
   
   Here is what we use internally.
   
   ```
   @Override
   public boolean dropTable(Identifier ident) {
     return dropTableWithoutPurging(ident);
   }
   
   @Override
   public boolean purgeTable(Identifier ident) {
     try {
       Table table = load(ident).first();
       ValidationException.check(
           PropertyUtil.propertyAsBoolean(table.properties(), GC_ENABLED, GC_ENABLED_DEFAULT),
           "Cannot purge table: GC is disabled (deleting files may corrupt other tables)");
   
       String metadataFileLocation = ((HasTableOperations) table).operations().current().metadataFileLocation();
   
       boolean dropped = dropTableWithoutPurging(ident);
   
       if (dropped) {
         SparkActions actions = SparkActions.get();
   
         actions.deleteReachableFiles(metadataFileLocation)
             .io(table.io())
             .execute();
       }
   
       return dropped;
   
     } catch (org.apache.iceberg.exceptions.NoSuchTableException e) {
       return false;
     }
   }
   
   private boolean dropTableWithoutPurging(Identifier ident) {
     if (isPathIdentifier(ident)) {
       return tables.dropTable(((PathIdentifier) ident).location(), false);
     } else {
       return icebergCatalog.dropTable(buildIdentifier(ident), false);
     }
   }
   ```
   
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on pull request #3056: Support purge for Spark 3.2

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on pull request #3056:
URL: https://github.com/apache/iceberg/pull/3056#issuecomment-1048354837


   @ConeyLiu, I think we should go ahead and use the action for purging. Once we reach consensus on #4159, we can adapt the behavior.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ConeyLiu closed pull request #3056: Support purge for Spark 3.1.x

Posted by GitBox <gi...@apache.org>.
ConeyLiu closed pull request #3056:
URL: https://github.com/apache/iceberg/pull/3056


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ConeyLiu closed pull request #3056: Support purge for Spark 3.1.x

Posted by GitBox <gi...@apache.org>.
ConeyLiu closed pull request #3056:
URL: https://github.com/apache/iceberg/pull/3056


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #3056: Support purge for Spark 3.2

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #3056:
URL: https://github.com/apache/iceberg/pull/3056#discussion_r805069054



##########
File path: spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/SparkCatalog.java
##########
@@ -244,10 +244,19 @@ public SparkTable alterTable(Identifier ident, TableChange... changes) throws No
 
   @Override
   public boolean dropTable(Identifier ident) {
+    return dropTableInternal(ident, false);
+  }
+
+  @Override
+  public boolean purgeTable(Identifier ident) {
+    return dropTableInternal(ident, true);
+  }
+
+  private boolean dropTableInternal(Identifier ident, boolean purge) {
     try {
       return isPathIdentifier(ident) ?
-          tables.dropTable(((PathIdentifier) ident).location()) :
-          icebergCatalog.dropTable(buildIdentifier(ident));
+          tables.dropTable(((PathIdentifier) ident).location(), purge) :

Review comment:
       @jackye1995's summary of external is helpful. Thanks for helping set the context!
   
   The confusion here often lies with users that have been taught to expect certain outcomes from external tables, like that external tables don't remove data when they're dropped. Those guarantees are inconsistent in the first place: you can still overwrite (remove files) in an external table in Hive. In addition, why should a user set whether a table's data can be deleted when it is dropped at the time the table is created?
   
   How to clean up when dropping a table depends on the choices in the data platform. At Netflix, we removed references only and deleted data files only after a waiting period (to allow restoring a table). Users had no say in how the data was removed, only that the table was logically dropped. Both `EXTERNAL` and `PURGE` makes little sense with policies like that.
   
   > I think an extension of registerTable is needed for Iceberg to complete the notion of external table. And it should be the catalog's responsibility to manage this semantics, maybe set a flag to distinguish external table from normal tables to make it clear that it's a read-only copy of a managed table.
   
   I really don't think that the concept of `EXTERNAL` has a place in Iceberg and I am very skeptical that we should add it. `EXTERNAL` essentially means that the table is owned by some other system. But any engine can use an Iceberg catalog and the library to update a table. What is the utility of arbitrarily restricting that?
   
   As Jack said, we can think of the relationship differently, removing the engine requirement. But if we still have `EXTERNAL` in this world, then it is really just to support secondary references. I don't know why you would want secondary references that look exactly like primary references but are restricted from making changes. Just connect to the catalog. And if there's a need, we can add a catalog flag that makes it read-only.
   
   That still leaves an open question of what to do for `gc.enabled`. There are times when a table doesn't "own" the resources within it and `gc.enabled` is a quick way to account for that -- to basically turn off operations that may delete files that aren't owned. I think that includes PURGE, but I do see the distinction and agree that it is an ugly choice to need to make. Short term, I think we should disable PURGE and throw an exception if `gc.enabled` is true.
   
   Longer term, I think we need to refine the idea of file ownership. The best way that I can think of is to do this by prefix, as we discussed on a thread in the Trino community a while back and in the Iceberg sync. In this model, a table gets its own prefix and can delete anything in that prefix (or more likely, a small set of prefixes). Imported data files are unowned and are not deleted by expire snapshots or purge if they are outside of owned prefixes.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #3056: Support purge for Spark 3.2

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #3056:
URL: https://github.com/apache/iceberg/pull/3056#discussion_r805401552



##########
File path: spark/v3.2/spark/src/test/java/org/apache/iceberg/spark/sql/TestDropTable.java
##########
@@ -0,0 +1,260 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.spark.sql;
+
+import java.io.File;
+import java.io.IOException;
+import java.nio.file.Paths;
+import java.util.Arrays;
+import java.util.List;
+import java.util.Map;
+import java.util.stream.Collectors;
+import org.apache.commons.io.FileUtils;
+import org.apache.iceberg.relocated.com.google.common.collect.ImmutableList;
+import org.apache.iceberg.relocated.com.google.common.collect.ImmutableMap;
+import org.apache.iceberg.relocated.com.google.common.collect.Lists;
+import org.apache.iceberg.spark.SparkCatalog;
+import org.apache.iceberg.spark.SparkCatalogTestBase;
+import org.apache.iceberg.spark.SparkSessionCatalog;
+import org.apache.spark.sql.Row;
+import org.apache.spark.sql.SparkSession;
+import org.apache.spark.sql.connector.catalog.Identifier;
+import org.junit.After;
+import org.junit.Assert;
+import org.junit.Before;
+import org.junit.Test;
+import org.junit.runners.Parameterized;
+
+public class TestDropTable extends SparkCatalogTestBase {
+  private Map<String, String> config = null;
+  private String implementation = null;
+  private SparkSession session = null;
+
+  @Parameterized.Parameters(name = "catalogName = {0}, implementation = {1}, config = {2}")
+  public static Object[][] parameters() {
+    return new Object[][]{
+        {"testhive", CustomSparkCatalog.class.getName(),
+            ImmutableMap.of(
+                "type", "hive",
+                "default-namespace", "default"
+            )},
+        {"testhadoop", CustomSparkCatalog.class.getName(),
+            ImmutableMap.of(
+                "type", "hadoop"
+            )},
+        {"spark_catalog", CustomSparkSessionCatalog.class.getName(),
+            ImmutableMap.of(
+                "type", "hive",
+                "default-namespace", "default",
+                "parquet-enabled", "true",
+                "cache-enabled", "false" // Spark will delete tables using v1, leaving the cache out of sync
+            )}
+    };
+  }
+
+  public TestDropTable(String catalogName, String implementation, Map<String, String> config) {
+    super(catalogName, implementation, config);
+    this.config = config;
+    this.implementation = implementation;
+  }
+
+  @Before
+  public void createTable() {
+    // Spark CatalogManager cached the loaded catalog, here we use new SparkSession to force it load the catalog again

Review comment:
       This doesn't sound correct to me. You should not need to alter the catalog to check whether a table is purged. Can't you get the location from the table itself? And this appears to make the tests specific to the Hadoop catalog.

##########
File path: spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/SparkCatalog.java
##########
@@ -244,10 +244,19 @@ public SparkTable alterTable(Identifier ident, TableChange... changes) throws No
 
   @Override
   public boolean dropTable(Identifier ident) {
+    return dropTableInternal(ident, false);
+  }
+
+  @Override
+  public boolean purgeTable(Identifier ident) {
+    return dropTableInternal(ident, true);
+  }
+
+  private boolean dropTableInternal(Identifier ident, boolean purge) {
     try {
       return isPathIdentifier(ident) ?
-          tables.dropTable(((PathIdentifier) ident).location()) :
-          icebergCatalog.dropTable(buildIdentifier(ident));
+          tables.dropTable(((PathIdentifier) ident).location(), purge) :

Review comment:
       > 100% Agree. However we now opened the door by adding registerTable in catalog, which maps to the external table concept perfectly. I already received a few feature requests of people asking for this to map to CREATE EXTERNAL TABLE. People can now register an Iceberg table with an old metadata file location and do writes against it to create basically 2 diverged metadata history of the same table. This is very dangerous action because 2 Iceberg tables can now own the same set of files and corrupt each other.
   
   I'm not sure I agree that it maps perfectly. This is a way to register a table with a catalog, after which the catalog owns it like any other table. There should be nothing that suggests registration has anything to do with `EXTERNAL` and no reason for people to think that tables that are added to a catalog through `registerTable` should behave any differently.
   
   If this confusion persists, I would support removing `registerTable` from the API.
   
   > Just from correctness perspective, this is the wrong thing to promote.
   
   Agreed!
   
   > In the long term, we should start to promote a new table ownership model (maybe call it a SHARED model) and start to bring people up to date with how Iceberg tables are operated. Let me draft a doc for that to have a formal discussion, and also include concepts like table root location ownership in that doc so we can have full clarity in the domain of table ownership.
   
   I'm not sure that I would want a `SHARED` keyword -- that just implies there are times when the table is not shared and we would get into similar trouble. But I think your idea to address this in a design doc is good.
   
   Also, I consider the data/file ownership a separate problem, so you may want to keep them separate in design docs or proposals. I wouldn't want to confuse table modification with data file ownership, although modification does have implications for file ownership.
   
   > I think if we change the behavior of drop table to not drop any data that alleviates our concern on accidental drops on external tables. However, it also means that drop table on managed tables would leave data around, which is also an issue.
   
   This is why Iceberg ignores `EXTERNAL`. The platform should be making these decisions, ideally. Users interact with logical tables, physical concerns are for the platform. If you don't have a platform-level plan for dropping table data, then I think the `PURGE` approach is okay because a user presumably makes the choice at the right time (rather than months if not years before the drop).
   
   My general recommendation is to tell users that they're logically dropping tables and data. Maybe you can have a platform-supported way to un-delete, but when you drop a table you generally have no expectation that you didn't do anything destructive!




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] szehon-ho commented on a change in pull request #3056: Support purge for Spark 3.2

Posted by GitBox <gi...@apache.org>.
szehon-ho commented on a change in pull request #3056:
URL: https://github.com/apache/iceberg/pull/3056#discussion_r813231019



##########
File path: spark/v3.2/spark/src/test/java/org/apache/iceberg/spark/sql/TestDropTable.java
##########
@@ -0,0 +1,140 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.spark.sql;
+
+import java.io.IOException;
+import java.util.List;
+import java.util.Map;
+import java.util.stream.Collectors;
+import org.apache.hadoop.fs.FileSystem;
+import org.apache.hadoop.fs.Path;
+import org.apache.iceberg.AssertHelpers;
+import org.apache.iceberg.MetadataTableType;
+import org.apache.iceberg.exceptions.ValidationException;
+import org.apache.iceberg.relocated.com.google.common.collect.ImmutableList;
+import org.apache.iceberg.relocated.com.google.common.collect.Streams;
+import org.apache.iceberg.spark.SparkCatalogTestBase;
+import org.junit.After;
+import org.junit.Assert;
+import org.junit.Before;
+import org.junit.Test;
+
+public class TestDropTable extends SparkCatalogTestBase {
+
+  public TestDropTable(String catalogName, String implementation, Map<String, String> config) {
+    super(catalogName, implementation, config);
+  }
+
+  @Before
+  public void createTable() {
+    sql("CREATE TABLE %s (id INT, name STRING) USING iceberg", tableName);
+    sql("INSERT INTO %s VALUES (1, 'test')", tableName);
+  }
+
+  @After
+  public void removeTable() throws IOException {
+    sql("DROP TABLE IF EXISTS %s", tableName);
+  }
+
+  @Test
+  public void testDropTable() throws IOException {
+    dropTableInternal();
+  }
+
+  @Test
+  public void testDropTableGCDisabled() throws IOException {
+    sql("ALTER TABLE %s SET TBLPROPERTIES (gc.enabled = false)", tableName);
+    dropTableInternal();
+  }
+
+  private void dropTableInternal() throws IOException {
+    assertEquals("Should have expected rows",
+        ImmutableList.of(row(1, "test")), sql("SELECT * FROM %s", tableName));
+
+    List<String> manifestAndFiles = getManifestsAndFiles();
+    Assert.assertEquals("There totally should have 2 files for manifests and files", 2, manifestAndFiles.size());
+    Assert.assertTrue("All files should be existed", checkFilesExist(manifestAndFiles, true));
+
+    sql("DROP TABLE %s", tableName);
+    Assert.assertFalse("Table should not exist", validationCatalog.tableExists(tableIdent));
+
+    if (catalogName.equals("testhadoop")) {
+      // HadoopCatalog drop table without purge will delete the base table location.
+      Assert.assertTrue("All files should be deleted", checkFilesExist(manifestAndFiles, false));
+    } else {
+      Assert.assertTrue("All files should not be deleted", checkFilesExist(manifestAndFiles, true));
+    }
+  }
+
+  @Test
+  public void testPurgeTable() throws IOException {
+    assertEquals("Should have expected rows",
+        ImmutableList.of(row(1, "test")), sql("SELECT * FROM %s", tableName));
+
+    List<String> manifestAndFiles = getManifestsAndFiles();
+    Assert.assertEquals("There totally should have 2 files for manifests and files", 2, manifestAndFiles.size());
+    Assert.assertTrue("All files should be existed", checkFilesExist(manifestAndFiles, true));

Review comment:
       Nit: "should be existed" => "should exist"

##########
File path: spark/v3.2/spark/src/test/java/org/apache/iceberg/spark/sql/TestDropTable.java
##########
@@ -0,0 +1,140 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.spark.sql;
+
+import java.io.IOException;
+import java.util.List;
+import java.util.Map;
+import java.util.stream.Collectors;
+import org.apache.hadoop.fs.FileSystem;
+import org.apache.hadoop.fs.Path;
+import org.apache.iceberg.AssertHelpers;
+import org.apache.iceberg.MetadataTableType;
+import org.apache.iceberg.exceptions.ValidationException;
+import org.apache.iceberg.relocated.com.google.common.collect.ImmutableList;
+import org.apache.iceberg.relocated.com.google.common.collect.Streams;
+import org.apache.iceberg.spark.SparkCatalogTestBase;
+import org.junit.After;
+import org.junit.Assert;
+import org.junit.Before;
+import org.junit.Test;
+
+public class TestDropTable extends SparkCatalogTestBase {
+
+  public TestDropTable(String catalogName, String implementation, Map<String, String> config) {
+    super(catalogName, implementation, config);
+  }
+
+  @Before
+  public void createTable() {
+    sql("CREATE TABLE %s (id INT, name STRING) USING iceberg", tableName);
+    sql("INSERT INTO %s VALUES (1, 'test')", tableName);
+  }
+
+  @After
+  public void removeTable() throws IOException {
+    sql("DROP TABLE IF EXISTS %s", tableName);
+  }
+
+  @Test
+  public void testDropTable() throws IOException {
+    dropTableInternal();
+  }
+
+  @Test
+  public void testDropTableGCDisabled() throws IOException {
+    sql("ALTER TABLE %s SET TBLPROPERTIES (gc.enabled = false)", tableName);
+    dropTableInternal();
+  }
+
+  private void dropTableInternal() throws IOException {
+    assertEquals("Should have expected rows",
+        ImmutableList.of(row(1, "test")), sql("SELECT * FROM %s", tableName));
+
+    List<String> manifestAndFiles = getManifestsAndFiles();
+    Assert.assertEquals("There totally should have 2 files for manifests and files", 2, manifestAndFiles.size());

Review comment:
       Nit, "There totally should have" can be "there should be"

##########
File path: spark/v3.2/spark/src/test/java/org/apache/iceberg/spark/sql/TestDropTable.java
##########
@@ -0,0 +1,140 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.spark.sql;
+
+import java.io.IOException;
+import java.util.List;
+import java.util.Map;
+import java.util.stream.Collectors;
+import org.apache.hadoop.fs.FileSystem;
+import org.apache.hadoop.fs.Path;
+import org.apache.iceberg.AssertHelpers;
+import org.apache.iceberg.MetadataTableType;
+import org.apache.iceberg.exceptions.ValidationException;
+import org.apache.iceberg.relocated.com.google.common.collect.ImmutableList;
+import org.apache.iceberg.relocated.com.google.common.collect.Streams;
+import org.apache.iceberg.spark.SparkCatalogTestBase;
+import org.junit.After;
+import org.junit.Assert;
+import org.junit.Before;
+import org.junit.Test;
+
+public class TestDropTable extends SparkCatalogTestBase {
+
+  public TestDropTable(String catalogName, String implementation, Map<String, String> config) {
+    super(catalogName, implementation, config);
+  }
+
+  @Before
+  public void createTable() {
+    sql("CREATE TABLE %s (id INT, name STRING) USING iceberg", tableName);
+    sql("INSERT INTO %s VALUES (1, 'test')", tableName);
+  }
+
+  @After
+  public void removeTable() throws IOException {
+    sql("DROP TABLE IF EXISTS %s", tableName);
+  }
+
+  @Test
+  public void testDropTable() throws IOException {
+    dropTableInternal();
+  }
+
+  @Test
+  public void testDropTableGCDisabled() throws IOException {
+    sql("ALTER TABLE %s SET TBLPROPERTIES (gc.enabled = false)", tableName);
+    dropTableInternal();
+  }
+
+  private void dropTableInternal() throws IOException {
+    assertEquals("Should have expected rows",
+        ImmutableList.of(row(1, "test")), sql("SELECT * FROM %s", tableName));
+
+    List<String> manifestAndFiles = getManifestsAndFiles();
+    Assert.assertEquals("There totally should have 2 files for manifests and files", 2, manifestAndFiles.size());
+    Assert.assertTrue("All files should be existed", checkFilesExist(manifestAndFiles, true));
+
+    sql("DROP TABLE %s", tableName);
+    Assert.assertFalse("Table should not exist", validationCatalog.tableExists(tableIdent));
+
+    if (catalogName.equals("testhadoop")) {
+      // HadoopCatalog drop table without purge will delete the base table location.
+      Assert.assertTrue("All files should be deleted", checkFilesExist(manifestAndFiles, false));
+    } else {
+      Assert.assertTrue("All files should not be deleted", checkFilesExist(manifestAndFiles, true));
+    }
+  }
+
+  @Test
+  public void testPurgeTable() throws IOException {
+    assertEquals("Should have expected rows",
+        ImmutableList.of(row(1, "test")), sql("SELECT * FROM %s", tableName));
+
+    List<String> manifestAndFiles = getManifestsAndFiles();
+    Assert.assertEquals("There totally should have 2 files for manifests and files", 2, manifestAndFiles.size());
+    Assert.assertTrue("All files should be existed", checkFilesExist(manifestAndFiles, true));
+
+    sql("DROP TABLE %s PURGE", tableName);
+    Assert.assertFalse("Table should not exist", validationCatalog.tableExists(tableIdent));
+    Assert.assertTrue("All files should be deleted", checkFilesExist(manifestAndFiles, false));
+  }
+
+  @Test
+  public void testPurgeTableGCDisabled() throws IOException {
+    sql("ALTER TABLE %s SET TBLPROPERTIES (gc.enabled = false)", tableName);
+
+    assertEquals("Should have expected rows",
+        ImmutableList.of(row(1, "test")), sql("SELECT * FROM %s", tableName));
+
+    List<String> manifestAndFiles = getManifestsAndFiles();
+    Assert.assertEquals("There totally should have 2 files for manifests and files", 2, manifestAndFiles.size());
+    Assert.assertTrue("All files should be existed", checkFilesExist(manifestAndFiles, true));

Review comment:
       same as above

##########
File path: spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/SparkCatalog.java
##########
@@ -244,15 +253,51 @@ public SparkTable alterTable(Identifier ident, TableChange... changes) throws No
 
   @Override
   public boolean dropTable(Identifier ident) {
+    return dropTableWithoutPurging(ident);
+  }
+
+  @Override
+  public boolean purgeTable(Identifier ident) {
     try {
-      return isPathIdentifier(ident) ?
-          tables.dropTable(((PathIdentifier) ident).location()) :
-          icebergCatalog.dropTable(buildIdentifier(ident));
+      Table table = load(ident).first();
+      ValidationException.check(
+          PropertyUtil.propertyAsBoolean(table.properties(), GC_ENABLED, GC_ENABLED_DEFAULT),
+          "Cannot purge table: GC is disabled (deleting files may corrupt other tables)");
+      String metadataFileLocation = ((HasTableOperations) table).operations().current().metadataFileLocation();
+
+      boolean dropped = dropTableWithoutPurging(ident);
+
+      if (dropped) {
+        try {
+          // We should check whether the metadata file is existed. Because the HadoopCatalog/HadoopTables will drop the

Review comment:
       Nit "is existed" => "exists"

##########
File path: spark/v3.2/spark/src/test/java/org/apache/iceberg/spark/sql/TestDropTable.java
##########
@@ -0,0 +1,140 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.spark.sql;
+
+import java.io.IOException;
+import java.util.List;
+import java.util.Map;
+import java.util.stream.Collectors;
+import org.apache.hadoop.fs.FileSystem;
+import org.apache.hadoop.fs.Path;
+import org.apache.iceberg.AssertHelpers;
+import org.apache.iceberg.MetadataTableType;
+import org.apache.iceberg.exceptions.ValidationException;
+import org.apache.iceberg.relocated.com.google.common.collect.ImmutableList;
+import org.apache.iceberg.relocated.com.google.common.collect.Streams;
+import org.apache.iceberg.spark.SparkCatalogTestBase;
+import org.junit.After;
+import org.junit.Assert;
+import org.junit.Before;
+import org.junit.Test;
+
+public class TestDropTable extends SparkCatalogTestBase {
+
+  public TestDropTable(String catalogName, String implementation, Map<String, String> config) {
+    super(catalogName, implementation, config);
+  }
+
+  @Before
+  public void createTable() {
+    sql("CREATE TABLE %s (id INT, name STRING) USING iceberg", tableName);
+    sql("INSERT INTO %s VALUES (1, 'test')", tableName);
+  }
+
+  @After
+  public void removeTable() throws IOException {
+    sql("DROP TABLE IF EXISTS %s", tableName);
+  }
+
+  @Test
+  public void testDropTable() throws IOException {
+    dropTableInternal();
+  }
+
+  @Test
+  public void testDropTableGCDisabled() throws IOException {
+    sql("ALTER TABLE %s SET TBLPROPERTIES (gc.enabled = false)", tableName);
+    dropTableInternal();
+  }
+
+  private void dropTableInternal() throws IOException {
+    assertEquals("Should have expected rows",
+        ImmutableList.of(row(1, "test")), sql("SELECT * FROM %s", tableName));
+
+    List<String> manifestAndFiles = getManifestsAndFiles();
+    Assert.assertEquals("There totally should have 2 files for manifests and files", 2, manifestAndFiles.size());
+    Assert.assertTrue("All files should be existed", checkFilesExist(manifestAndFiles, true));
+
+    sql("DROP TABLE %s", tableName);
+    Assert.assertFalse("Table should not exist", validationCatalog.tableExists(tableIdent));
+
+    if (catalogName.equals("testhadoop")) {
+      // HadoopCatalog drop table without purge will delete the base table location.
+      Assert.assertTrue("All files should be deleted", checkFilesExist(manifestAndFiles, false));
+    } else {
+      Assert.assertTrue("All files should not be deleted", checkFilesExist(manifestAndFiles, true));
+    }
+  }
+
+  @Test
+  public void testPurgeTable() throws IOException {
+    assertEquals("Should have expected rows",
+        ImmutableList.of(row(1, "test")), sql("SELECT * FROM %s", tableName));
+
+    List<String> manifestAndFiles = getManifestsAndFiles();
+    Assert.assertEquals("There totally should have 2 files for manifests and files", 2, manifestAndFiles.size());
+    Assert.assertTrue("All files should be existed", checkFilesExist(manifestAndFiles, true));
+
+    sql("DROP TABLE %s PURGE", tableName);
+    Assert.assertFalse("Table should not exist", validationCatalog.tableExists(tableIdent));
+    Assert.assertTrue("All files should be deleted", checkFilesExist(manifestAndFiles, false));
+  }
+
+  @Test
+  public void testPurgeTableGCDisabled() throws IOException {
+    sql("ALTER TABLE %s SET TBLPROPERTIES (gc.enabled = false)", tableName);
+
+    assertEquals("Should have expected rows",
+        ImmutableList.of(row(1, "test")), sql("SELECT * FROM %s", tableName));
+
+    List<String> manifestAndFiles = getManifestsAndFiles();
+    Assert.assertEquals("There totally should have 2 files for manifests and files", 2, manifestAndFiles.size());
+    Assert.assertTrue("All files should be existed", checkFilesExist(manifestAndFiles, true));
+
+    AssertHelpers.assertThrows("Purge table is not allowed when GC is disabled", ValidationException.class,
+        "Cannot purge table: GC is disabled (deleting files may corrupt other tables",
+        () -> sql("DROP TABLE %s PURGE", tableName));
+
+    Assert.assertTrue("Table should not been dropped", validationCatalog.tableExists(tableIdent));
+    Assert.assertTrue("All files should not be deleted", checkFilesExist(manifestAndFiles, true));
+  }
+
+  private List<String> getManifestsAndFiles() {

Review comment:
       nit: in Iceberg code style, can remove get from method name: https://iceberg.apache.org/contribute/.  (in "Method Naming".  (I have received a few reviews on this myself :))

##########
File path: spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/SparkCatalog.java
##########
@@ -244,15 +253,51 @@ public SparkTable alterTable(Identifier ident, TableChange... changes) throws No
 
   @Override
   public boolean dropTable(Identifier ident) {
+    return dropTableWithoutPurging(ident);
+  }
+
+  @Override
+  public boolean purgeTable(Identifier ident) {
     try {
-      return isPathIdentifier(ident) ?
-          tables.dropTable(((PathIdentifier) ident).location()) :
-          icebergCatalog.dropTable(buildIdentifier(ident));
+      Table table = load(ident).first();
+      ValidationException.check(
+          PropertyUtil.propertyAsBoolean(table.properties(), GC_ENABLED, GC_ENABLED_DEFAULT),
+          "Cannot purge table: GC is disabled (deleting files may corrupt other tables)");
+      String metadataFileLocation = ((HasTableOperations) table).operations().current().metadataFileLocation();
+
+      boolean dropped = dropTableWithoutPurging(ident);
+
+      if (dropped) {
+        try {
+          // We should check whether the metadata file is existed. Because the HadoopCatalog/HadoopTables will drop the
+          // warehouse directly and ignore the `purge` argument.
+          table.io().newInputFile(metadataFileLocation).newStream().close();

Review comment:
       Cant we use InputFile.exists() instead of newStream and check for exception?

##########
File path: spark/v3.2/spark/src/test/java/org/apache/iceberg/spark/sql/TestDropTable.java
##########
@@ -0,0 +1,140 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.spark.sql;
+
+import java.io.IOException;
+import java.util.List;
+import java.util.Map;
+import java.util.stream.Collectors;
+import org.apache.hadoop.fs.FileSystem;
+import org.apache.hadoop.fs.Path;
+import org.apache.iceberg.AssertHelpers;
+import org.apache.iceberg.MetadataTableType;
+import org.apache.iceberg.exceptions.ValidationException;
+import org.apache.iceberg.relocated.com.google.common.collect.ImmutableList;
+import org.apache.iceberg.relocated.com.google.common.collect.Streams;
+import org.apache.iceberg.spark.SparkCatalogTestBase;
+import org.junit.After;
+import org.junit.Assert;
+import org.junit.Before;
+import org.junit.Test;
+
+public class TestDropTable extends SparkCatalogTestBase {
+
+  public TestDropTable(String catalogName, String implementation, Map<String, String> config) {
+    super(catalogName, implementation, config);
+  }
+
+  @Before
+  public void createTable() {
+    sql("CREATE TABLE %s (id INT, name STRING) USING iceberg", tableName);
+    sql("INSERT INTO %s VALUES (1, 'test')", tableName);
+  }
+
+  @After
+  public void removeTable() throws IOException {
+    sql("DROP TABLE IF EXISTS %s", tableName);
+  }
+
+  @Test
+  public void testDropTable() throws IOException {
+    dropTableInternal();
+  }
+
+  @Test
+  public void testDropTableGCDisabled() throws IOException {
+    sql("ALTER TABLE %s SET TBLPROPERTIES (gc.enabled = false)", tableName);
+    dropTableInternal();
+  }
+
+  private void dropTableInternal() throws IOException {
+    assertEquals("Should have expected rows",
+        ImmutableList.of(row(1, "test")), sql("SELECT * FROM %s", tableName));
+
+    List<String> manifestAndFiles = getManifestsAndFiles();
+    Assert.assertEquals("There totally should have 2 files for manifests and files", 2, manifestAndFiles.size());
+    Assert.assertTrue("All files should be existed", checkFilesExist(manifestAndFiles, true));
+
+    sql("DROP TABLE %s", tableName);
+    Assert.assertFalse("Table should not exist", validationCatalog.tableExists(tableIdent));
+
+    if (catalogName.equals("testhadoop")) {
+      // HadoopCatalog drop table without purge will delete the base table location.
+      Assert.assertTrue("All files should be deleted", checkFilesExist(manifestAndFiles, false));
+    } else {
+      Assert.assertTrue("All files should not be deleted", checkFilesExist(manifestAndFiles, true));
+    }
+  }
+
+  @Test
+  public void testPurgeTable() throws IOException {
+    assertEquals("Should have expected rows",
+        ImmutableList.of(row(1, "test")), sql("SELECT * FROM %s", tableName));
+
+    List<String> manifestAndFiles = getManifestsAndFiles();
+    Assert.assertEquals("There totally should have 2 files for manifests and files", 2, manifestAndFiles.size());

Review comment:
       same as above

##########
File path: spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/SparkCatalog.java
##########
@@ -244,15 +253,51 @@ public SparkTable alterTable(Identifier ident, TableChange... changes) throws No
 
   @Override
   public boolean dropTable(Identifier ident) {
+    return dropTableWithoutPurging(ident);
+  }
+
+  @Override
+  public boolean purgeTable(Identifier ident) {
     try {
-      return isPathIdentifier(ident) ?
-          tables.dropTable(((PathIdentifier) ident).location()) :
-          icebergCatalog.dropTable(buildIdentifier(ident));
+      Table table = load(ident).first();
+      ValidationException.check(
+          PropertyUtil.propertyAsBoolean(table.properties(), GC_ENABLED, GC_ENABLED_DEFAULT),
+          "Cannot purge table: GC is disabled (deleting files may corrupt other tables)");
+      String metadataFileLocation = ((HasTableOperations) table).operations().current().metadataFileLocation();
+
+      boolean dropped = dropTableWithoutPurging(ident);
+
+      if (dropped) {
+        try {
+          // We should check whether the metadata file is existed. Because the HadoopCatalog/HadoopTables will drop the
+          // warehouse directly and ignore the `purge` argument.
+          table.io().newInputFile(metadataFileLocation).newStream().close();
+        } catch (NotFoundException e) {
+          return true;
+        } catch (IOException e) {
+          throw new UncheckedIOException(e);
+        }
+
+        SparkActions.get()
+            .deleteReachableFiles(metadataFileLocation)
+            .io(table.io())
+            .execute();
+      }
+
+      return dropped;
     } catch (org.apache.iceberg.exceptions.NoSuchTableException e) {
       return false;
     }
   }
 
+  private boolean dropTableWithoutPurging(Identifier ident) {
+    if (isPathIdentifier(ident)) {
+      return tables.dropTable(((PathIdentifier) ident).location(), false);

Review comment:
       Nit: there seems to also be guidelines about making comment when passing booleans in  https://iceberg.apache.org/contribute/.   in "Boolean Arguments" , about this very case it seems!




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ConeyLiu commented on pull request #3056: Support purge for Spark 3.2

Posted by GitBox <gi...@apache.org>.
ConeyLiu commented on pull request #3056:
URL: https://github.com/apache/iceberg/pull/3056#issuecomment-1049431643


   Thanks @szehon-ho for the review. Comments have been addressed. Thanks again.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ConeyLiu commented on pull request #3056: Support purge for Spark 3.2

Posted by GitBox <gi...@apache.org>.
ConeyLiu commented on pull request #3056:
URL: https://github.com/apache/iceberg/pull/3056#issuecomment-1056367896


   Thanks @aokolnychyi, the code has been updated. And it passed at locally testing.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on a change in pull request #3056: Support purge for Spark 3.2

Posted by GitBox <gi...@apache.org>.
RussellSpitzer commented on a change in pull request #3056:
URL: https://github.com/apache/iceberg/pull/3056#discussion_r809461459



##########
File path: spark/v3.2/spark/src/test/java/org/apache/iceberg/spark/sql/TestDropTable.java
##########
@@ -0,0 +1,152 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.spark.sql;
+
+import java.io.File;
+import java.io.IOException;
+import java.nio.file.Paths;
+import java.util.Arrays;
+import java.util.List;
+import java.util.Map;
+import java.util.stream.Collectors;
+import org.apache.commons.io.FileUtils;
+import org.apache.iceberg.AssertHelpers;
+import org.apache.iceberg.exceptions.ValidationException;
+import org.apache.iceberg.relocated.com.google.common.collect.ImmutableList;
+import org.apache.iceberg.relocated.com.google.common.collect.Lists;
+import org.apache.iceberg.spark.SparkCatalogTestBase;
+import org.junit.After;
+import org.junit.Assert;
+import org.junit.Before;
+import org.junit.Test;
+
+public class TestDropTable extends SparkCatalogTestBase {
+
+  public TestDropTable(String catalogName, String implementation, Map<String, String> config) {
+    super(catalogName, implementation, config);
+  }
+
+  @Before
+  public void createTable() {
+    sql("CREATE TABLE %s (id INT, name STRING) USING iceberg", tableName);
+    sql("INSERT INTO %s VALUES (1, 'test')", tableName);
+  }
+
+  @After
+  public void removeTable() throws IOException {
+    sql("DROP TABLE IF EXISTS %s", tableName);
+
+    File baseLocation = new File(getTableBaseLocation());
+    if (baseLocation.exists()) {
+      FileUtils.deleteDirectory(baseLocation);
+    }
+  }
+
+  @Test
+  public void testDropTable() {
+    dropTableInternal();
+  }
+
+  @Test
+  public void testDropTableGCDisabled() {
+    sql("ALTER TABLE %s SET TBLPROPERTIES (gc.enabled = false)", tableName);
+    dropTableInternal();
+  }
+
+  private void dropTableInternal() {
+    assertEquals("Should have expected rows",
+        ImmutableList.of(row(1, "test")), sql("SELECT * FROM %s", tableName));
+
+    List<String> previousDataFiles = getDataOrMetadataFiles(true);
+    Assert.assertTrue("The number of data files should be larger than zero", previousDataFiles.size() > 0);
+
+    sql("DROP TABLE %s", tableName);
+    Assert.assertFalse("Table should not exist", validationCatalog.tableExists(tableIdent));
+
+    if (catalogName.equals("testhadoop")) {
+      // HadoopCatalog drop table without purge will delete the base table location.
+      Assert.assertEquals("The number of data files should be zero", 0, getDataOrMetadataFiles(true).size());
+      Assert.assertEquals("The number of metadata files should be zero", 0, getDataOrMetadataFiles(false).size());
+    } else {
+      Assert.assertEquals("The data files should not change", previousDataFiles, getDataOrMetadataFiles(true));
+    }
+  }
+
+  @Test
+  public void testPurgeTable() {
+    assertEquals("Should have expected rows",
+        ImmutableList.of(row(1, "test")),
+        sql("SELECT * FROM %s", tableName));
+
+    List<String> previousDataFiles = getDataOrMetadataFiles(true);
+    Assert.assertTrue("The number of data files should be larger than zero", previousDataFiles.size() > 0);
+
+    sql("DROP TABLE %s PURGE", tableName);
+    Assert.assertFalse("Table should not exist", validationCatalog.tableExists(tableIdent));
+    Assert.assertEquals("The number of data files should be zero", 0, getDataOrMetadataFiles(true).size());
+    Assert.assertEquals("The number of metadata files should be zero", 0, getDataOrMetadataFiles(false).size());
+  }
+
+  @Test
+  public void testPurgeTableGCDisabled() {
+    sql("ALTER TABLE %s SET TBLPROPERTIES (gc.enabled = false)", tableName);
+    List<String> previousDataFiles = getDataOrMetadataFiles(true);
+    List<String> previousMetadataFiles = getDataOrMetadataFiles(false);
+    Assert.assertTrue("The number of data files should be larger than zero", previousDataFiles.size() > 0);
+    Assert.assertTrue("The number of metadata files should be larger than zero", previousMetadataFiles.size() > 0);
+
+    AssertHelpers.assertThrows("Purge table is not allowed when GC is disabled", ValidationException.class,
+        "Cannot purge table: GC is disabled (deleting files may corrupt other tables",
+        () -> sql("DROP TABLE %s PURGE", tableName));
+
+    Assert.assertTrue("Table should not been dropped", validationCatalog.tableExists(tableIdent));
+    Assert.assertEquals("The data files should not change", previousDataFiles, getDataOrMetadataFiles(true));
+    Assert.assertEquals("The metadata files should not change", previousMetadataFiles, getDataOrMetadataFiles(false));
+  }
+
+  private List<String> getDataOrMetadataFiles(boolean dataFiles) {
+    String baseLocation = getTableBaseLocation();

Review comment:
       one little thing you could do here to avoid relying on the actual location of the table and all that, is just read the path column out of the MANIFESTS and FILES metadata tables. That would remove some fo the logic in getTableBaseLocation.
   
   See
   
   https://iceberg.apache.org/docs/latest/spark-queries/#files
   and
   https://iceberg.apache.org/docs/latest/spark-queries/#manifests
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ConeyLiu commented on a change in pull request #3056: Support purge for Spark 3.2

Posted by GitBox <gi...@apache.org>.
ConeyLiu commented on a change in pull request #3056:
URL: https://github.com/apache/iceberg/pull/3056#discussion_r809701190



##########
File path: spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/SparkCatalog.java
##########
@@ -244,10 +248,25 @@ public SparkTable alterTable(Identifier ident, TableChange... changes) throws No
 
   @Override
   public boolean dropTable(Identifier ident) {
+    return dropTableInternal(ident, false);
+  }
+
+  @Override
+  public boolean purgeTable(Identifier ident) {
+    return dropTableInternal(ident, true);
+  }
+
+  private boolean dropTableInternal(Identifier ident, boolean purge) {

Review comment:
       This should be much more scalable. Just one question, different catalog seems to have different purge implementations(such as `HadoopCataglog` remove warehouse directly). Whether we need to consider this?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #3056: Support purge for Spark 3.1 & 3.2

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #3056:
URL: https://github.com/apache/iceberg/pull/3056#discussion_r800229332



##########
File path: spark/v3.1/spark/src/test/java/org/apache/iceberg/spark/sql/TestDropTable.java
##########
@@ -0,0 +1,260 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.spark.sql;
+
+import java.io.File;
+import java.io.IOException;
+import java.nio.file.Paths;
+import java.util.Arrays;
+import java.util.List;
+import java.util.Map;
+import java.util.stream.Collectors;
+import org.apache.commons.io.FileUtils;
+import org.apache.iceberg.relocated.com.google.common.collect.ImmutableList;
+import org.apache.iceberg.relocated.com.google.common.collect.ImmutableMap;
+import org.apache.iceberg.relocated.com.google.common.collect.Lists;
+import org.apache.iceberg.spark.SparkCatalog;
+import org.apache.iceberg.spark.SparkCatalogTestBase;
+import org.apache.iceberg.spark.SparkSessionCatalog;
+import org.apache.spark.sql.Row;
+import org.apache.spark.sql.SparkSession;
+import org.apache.spark.sql.connector.catalog.Identifier;
+import org.junit.After;
+import org.junit.Assert;
+import org.junit.Before;
+import org.junit.Test;
+import org.junit.runners.Parameterized;
+
+public class TestDropTable extends SparkCatalogTestBase {
+  private Map<String, String> config = null;
+  private String implementation = null;
+  private SparkSession session = null;
+
+  @Parameterized.Parameters(name = "catalogName = {0}, implementation = {1}, config = {2}")
+  public static Object[][] parameters() {
+    return new Object[][]{
+        {"testhive", CustomSparkCatalog.class.getName(),
+            ImmutableMap.of(
+                "type", "hive",
+                "default-namespace", "default"
+            )},
+        {"testhadoop", CustomSparkCatalog.class.getName(),
+            ImmutableMap.of(
+                "type", "hadoop"
+            )},
+        {"spark_catalog", CustomSparkSessionCatalog.class.getName(),
+            ImmutableMap.of(
+                "type", "hive",
+                "default-namespace", "default",
+                "parquet-enabled", "true",
+                "cache-enabled", "false" // Spark will delete tables using v1, leaving the cache out of sync
+            )}
+    };
+  }
+
+  public TestDropTable(String catalogName, String implementation, Map<String, String> config) {
+    super(catalogName, implementation, config);
+    this.config = config;
+    this.implementation = implementation;
+  }
+
+  @Before
+  public void createTable() {
+    // Spark CatalogManager cached the loaded catalog, here we use new SparkSession to force it load the catalog again
+    session = spark.newSession();
+    session.conf().set("spark.sql.catalog." + catalogName, this.implementation);
+    config.forEach((key, value) -> session.conf().set("spark.sql.catalog." + catalogName + "." + key, value));
+
+    if (config.get("type").equalsIgnoreCase("hadoop")) {
+      session.conf().set("spark.sql.catalog." + catalogName + ".warehouse", "file:" + warehouse);
+    }
+    sql(session, "CREATE NAMESPACE IF NOT EXISTS default");
+    sql(session, "CREATE TABLE %s (id INT, name STRING) USING iceberg", tableName);
+    sql(session, "INSERT INTO %s VALUES (1, 'test')", tableName);
+  }
+
+  @After
+  public void removeTable() throws IOException {
+    sql(session, "DROP TABLE IF EXISTS %s PURGE", tableName);
+
+    File baseLocation = new File(getTableBaseLocation());
+    if (baseLocation.exists()) {
+      FileUtils.deleteDirectory(baseLocation);
+    }
+    session = null;
+  }
+
+  @Test
+  public void testDropTable() {
+    assertEquals("Should have expected rows",
+        ImmutableList.of(row(1, "test")), sql(session, "SELECT * FROM %s", tableName));
+
+    int fileSize = dataFiles().size();
+    Assert.assertTrue("The number of data files should be larger than zero", fileSize > 0);
+
+    sql(session, "DROP TABLE %s", tableName);
+    Assert.assertFalse("Table should not exist", validationCatalog.tableExists(tableIdent));
+
+    if (catalogName.equals("testhadoop")) {
+      // HadoopCatalog drop table without purge will delete the base table location.
+      Assert.assertEquals("The number of data files should be zero", 0, dataFiles().size());
+    } else {
+      Assert.assertEquals("The number of data files should not change", fileSize, dataFiles().size());
+    }
+
+    Identifier identifier = Identifier.of(tableIdent.namespace().levels(), tableIdent.name());
+    if (!"spark_catalog".equals(catalogName)) {
+      CustomSparkCatalog catalog = (CustomSparkCatalog) session.sessionState().catalogManager().catalog(catalogName);
+      Assert.assertTrue("dropTable should be called", catalog.isDropTableCalled());
+      Assert.assertFalse("purgeTable should not be called", catalog.isPurgeTableCalled());
+      Assert.assertEquals(identifier, catalog.droppedIdentifier());
+    } else {
+      CustomSparkSessionCatalog catalog =
+          (CustomSparkSessionCatalog) session.sessionState().catalogManager().catalog(catalogName);
+      Assert.assertTrue("dropTable should be called", catalog.isDropTableCalled());
+      Assert.assertFalse("purgeTable should not be called", catalog.isPurgeTableCalled());
+      Assert.assertEquals(identifier, catalog.droppedIdentifier());
+    }
+  }
+
+  @Test
+  public void testPurgeTable() {
+    assertEquals("Should have expected rows",
+        ImmutableList.of(row(1, "test")),
+        sql(session, "SELECT * FROM %s", tableName));
+
+    int fileSize = dataFiles().size();
+    Assert.assertTrue("The number of data files should be larger than zero", fileSize > 0);
+
+    sql(session, "DROP TABLE %s PURGE", tableName);
+    Assert.assertFalse("Table should not exist", validationCatalog.tableExists(tableIdent));
+    Assert.assertEquals("The number of data files should be zero", 0, dataFiles().size());
+
+    Identifier identifier = Identifier.of(tableIdent.namespace().levels(), tableIdent.name());
+    if (!"spark_catalog".equals(catalogName)) {
+      CustomSparkCatalog catalog = (CustomSparkCatalog) session.sessionState().catalogManager().catalog(catalogName);
+      Assert.assertTrue("purgeTable should be called", catalog.isPurgeTableCalled());
+      Assert.assertFalse("dropTable should not be called", catalog.isDropTableCalled());
+      Assert.assertEquals(identifier, catalog.droppedIdentifier());
+    } else {
+      CustomSparkSessionCatalog catalog =
+          (CustomSparkSessionCatalog) session.sessionState().catalogManager().catalog(catalogName);
+      Assert.assertTrue("purgeTable should be called", catalog.isPurgeTableCalled());
+      Assert.assertFalse("dropTable should not be called", catalog.isDropTableCalled());
+      Assert.assertEquals(identifier, catalog.droppedIdentifier());
+    }
+  }
+
+  private List<Object[]> sql(SparkSession spark, String query, Object... args) {

Review comment:
       Why is this here? Isn't there already a `sql` method available?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on pull request #3056: Support purge for Spark 3.2

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on pull request #3056:
URL: https://github.com/apache/iceberg/pull/3056#issuecomment-1043969088


   @jackye1995 @RussellSpitzer @rdblue, the question about ownership and file removal comes up in a lot of discussions. I've created issue #4159 so that we can continue the discussion that happened on this PR.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on pull request #3056: Support purge for Spark 3.2

Posted by GitBox <gi...@apache.org>.
rdblue commented on pull request #3056:
URL: https://github.com/apache/iceberg/pull/3056#issuecomment-1038350548






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] jackye1995 commented on a change in pull request #3056: Support purge for Spark 3.2

Posted by GitBox <gi...@apache.org>.
jackye1995 commented on a change in pull request #3056:
URL: https://github.com/apache/iceberg/pull/3056#discussion_r805296518



##########
File path: spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/SparkCatalog.java
##########
@@ -244,10 +244,19 @@ public SparkTable alterTable(Identifier ident, TableChange... changes) throws No
 
   @Override
   public boolean dropTable(Identifier ident) {
+    return dropTableInternal(ident, false);
+  }
+
+  @Override
+  public boolean purgeTable(Identifier ident) {
+    return dropTableInternal(ident, true);
+  }
+
+  private boolean dropTableInternal(Identifier ident, boolean purge) {
     try {
       return isPathIdentifier(ident) ?
-          tables.dropTable(((PathIdentifier) ident).location()) :
-          icebergCatalog.dropTable(buildIdentifier(ident));
+          tables.dropTable(((PathIdentifier) ident).location(), purge) :

Review comment:
       Thank you Ryan very much for the insights!
   
   For the discussion related to GC, I am fine with disallowing purge when `gc.enabled` is set to false as a short term solution if we would like to define garbage collection that way. We should do it within the catalog implementations to ensure this behavior is consistent across engines.
   
   > I really don't think that the concept of EXTERNAL has a place in Iceberg and I am very skeptical that we should add it.
   
   100% Agree. However we now opened the door by adding `registerTable` in catalog, which maps to the external table concept perfectly. I already received a few feature requests of people asking for this to map to CREATE EXTERNAL TABLE. People can now register an Iceberg table with an old metadata file location and do writes against it to create basically 2 diverged metadata history of the same table. This is very dangerous action because 2 Iceberg tables can now own the same set of files and corrupt each other.
   
   As the first step, we should make it clear in the javadoc of `registerTable` that it's only used to recover a table. Creating 2 references in catalog of the same table metadata and do write operations on both to create diverged metadata history is not recommended and will have unintended side effects.
   
   When I was suggesting to add the `EXTERNAL` concept to `registerTable`, to be honest I was really trying to make peace with people who wants to stick with this definition. At least we can have a read only solution and only encourage registering a normal table for recovery. But the more I think of this topic, the more I feel we should start to promote the right way to operate against Iceberg tables.
   
   Just from correctness perspective, this is the wrong thing to promote. Even if people just want to read a historical metadata file, information like table properties are stale. It is always better to do time travel against the latest metadata, or run query against a tagged snapshot recorded in the latest metadata, and have the metadata location automatically updated with new commits.
   
   As you said, `EXTERNAL` is also just arbitrarily limiting the ability for people to use Iceberg. I would say the only value is at business level. As compute vendors or data platforms that have such traditional definition of external and managed tables, it is much easier to provide external table support for an alien product comparing to full table operation support that is usually offered to managed tables only. My team underwent such debate for a long time before we decided to go with full support, and we tried really hard to explain the entire picture, but I doubt if other people could do the same. We can already see some vendors now offer Iceberg support under the name of "external table".
   
   In the long term, we should start to promote a new table ownership model (maybe call it a `SHARED` model) and start to bring people up to date with how Iceberg tables are operated. Let me draft a doc for that to have a formal discussion, and also include concepts like table root location ownership in that doc so we can have full clarity in the domain of table ownership.

##########
File path: spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/SparkCatalog.java
##########
@@ -244,10 +244,19 @@ public SparkTable alterTable(Identifier ident, TableChange... changes) throws No
 
   @Override
   public boolean dropTable(Identifier ident) {
+    return dropTableInternal(ident, false);
+  }
+
+  @Override
+  public boolean purgeTable(Identifier ident) {
+    return dropTableInternal(ident, true);
+  }
+
+  private boolean dropTableInternal(Identifier ident, boolean purge) {
     try {
       return isPathIdentifier(ident) ?
-          tables.dropTable(((PathIdentifier) ident).location()) :
-          icebergCatalog.dropTable(buildIdentifier(ident));
+          tables.dropTable(((PathIdentifier) ident).location(), purge) :

Review comment:
       > I'm not sure I agree that it maps perfectly. This is a way to register a table with a catalog, after which the catalog owns it like any other table.
   
   probably too strong to say "perfectly". I am saying this not because I think it that way, but because multiple data engineers have asked me if they could use CREATE EXTERNAL TABLE to execute that `registerTable` feature. So from anecdotal evidence I think people will think in that way. Maybe we should call it something like `reviveTable` or expose it as a Spark procedure instead of making it a part of the catalog interface.
   
   > I'm not sure that I would want a SHARED keyword 
   
   I am not trying to promote a new keyword, I think we need a new concept to explain the difference from the traditional definition of external/managed tables.
   
   > Also, I consider the data/file ownership a separate problem, so you may want to keep them separate in design docs or proposals
   
   I think it's hard to not mention them together because owning a table has implicit relationship with owning the file paths and files of the table. But we can discuss this further, I will update in a new devlist thread.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi edited a comment on pull request #3056: Support purge for Spark 3.2

Posted by GitBox <gi...@apache.org>.
aokolnychyi edited a comment on pull request #3056:
URL: https://github.com/apache/iceberg/pull/3056#issuecomment-1055243235


   @ConeyLiu, looks like we would need to adapt `TestRemoveOrphanFilesProcedure` since we added support for the purge flag. I think you would need to modify `removeTable` (call PURGE) and `testRemoveOrphanFilesGCDisabled` (unset GC flag at the end).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ConeyLiu commented on pull request #3056: Support purge for Spark 3.1 & 3.2

Posted by GitBox <gi...@apache.org>.
ConeyLiu commented on pull request #3056:
URL: https://github.com/apache/iceberg/pull/3056#issuecomment-1026892885


   @ajantha-bhat @rdblue Please take a look, thanks a lot.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on pull request #3056: Support purge for Spark 3.2

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on pull request #3056:
URL: https://github.com/apache/iceberg/pull/3056#issuecomment-1057663216


   Thanks, @ConeyLiu. Let's wait for test results.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ConeyLiu commented on a change in pull request #3056: Support purge for Spark 3.1.x

Posted by GitBox <gi...@apache.org>.
ConeyLiu commented on a change in pull request #3056:
URL: https://github.com/apache/iceberg/pull/3056#discussion_r713844120



##########
File path: spark3-extensions/src/test/java/org/apache/iceberg/spark/extensions/TestDropTable.java
##########
@@ -0,0 +1,228 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+
+package org.apache.iceberg.spark.extensions;

Review comment:
       Thanks @jackye1995 for the review, moved to spark3.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #3056: Support purge for Spark 3.1.x

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #3056:
URL: https://github.com/apache/iceberg/pull/3056#discussion_r711278788



##########
File path: spark3/src/main/java/org/apache/iceberg/spark/BaseCatalog.java
##########
@@ -45,4 +45,6 @@ public Procedure loadProcedure(Identifier ident) throws NoSuchProcedureException
 
     throw new NoSuchProcedureException(ident);
   }
+
+  protected abstract boolean dropTableWithPurge(Identifier ident, boolean purge);

Review comment:
       This doesn't need to be an abstract method on this class. Both implementations of `BaseCatalog` implement `dropTableWithPurge` and it should just be private in both of those.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ConeyLiu closed pull request #3056: Support purge for Spark 3.1.x

Posted by GitBox <gi...@apache.org>.
ConeyLiu closed pull request #3056:
URL: https://github.com/apache/iceberg/pull/3056


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org