You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2021/03/24 05:01:55 UTC

[GitHub] [iceberg] aokolnychyi opened a new pull request #2367: Core: Don't delete data files on DROP if GC is disabled

aokolnychyi opened a new pull request #2367:
URL: https://github.com/apache/iceberg/pull/2367


   This PR prohibits removal of data files if GC is disabled. Fixes #2366.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on pull request #2367: Core: Don't delete data files on DROP if GC is disabled

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on pull request #2367:
URL: https://github.com/apache/iceberg/pull/2367#issuecomment-805498760


   cc @RussellSpitzer @rdblue @pvary @rymurr @shardulm94


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] marton-bod commented on pull request #2367: Core: Don't delete data files on DROP if GC is disabled

Posted by GitBox <gi...@apache.org>.
marton-bod commented on pull request #2367:
URL: https://github.com/apache/iceberg/pull/2367#issuecomment-805770620


   From a Hive user perspective, I would expect my data files to be dropped purely on the basis of whether `external.table.purge=TRUE` is set, and not be silently overridden by the Iceberg-specific `gc.enabled` property. So there might be surprises for Hive users if `purge=TRUE` but `gc.enabled=false` and therefore the data files end up not actually removed.
   
   I'm wondering if we should keep the two props in sync somehow via the `HiveTableOperations`, or at least provide better documentation on the possible Hive DROP TABLE behaviours?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on pull request #2367: Core: Don't delete data files on DROP if GC is disabled

Posted by GitBox <gi...@apache.org>.
RussellSpitzer commented on pull request #2367:
URL: https://github.com/apache/iceberg/pull/2367#issuecomment-805880565


   I think for snapshots you only want to delete files that are made in commits after the original import, that should be possible but complicated. Basically just delete everything added after the first iceberg snapshot.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] marton-bod commented on pull request #2367: Core: Don't delete data files on DROP if GC is disabled

Posted by GitBox <gi...@apache.org>.
marton-bod commented on pull request #2367:
URL: https://github.com/apache/iceberg/pull/2367#issuecomment-806687999


   @aokolnychyi Thanks for your questions, Anton.
   
   > How does Hive populate deleteData in commitDropTable?
   
   `deleteData` is set to true by default here: [Hive.java](https://github.com/apache/hive/blob/branch-3.1/ql/src/java/org/apache/hadoop/hive/ql/metadata/Hive.java#L965). This ensures data is automatically deleted for managed tables.
   However, data is not deleted for external tables pre-Hive4: [HiveMetaStore.java](https://github.com/apache/hive/blob/branch-3.1/standalone-metastore/src/main/java/org/apache/hadoop/hive/metastore/HiveMetaStore.java#L2543), 
   Introduced in Hive4, we can now force the deletion even for external tables with the `external.table.purge=TRUE` property: [Docs](https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-ExternalTables)
   
   > Is it correct that external.table.purge is not specific to Iceberg, it used in other tables too?
   
   Correct, it's used for Hive tables as well.
   
   So at the moment, all Hive Iceberg tables are external, therefore their data is not dropped automatically by Hive2 and Hive3 (Hive4 does so with the purge flag). Currently our solution for that is to check for the `external.table.purge` property in the `HiveIcebergMetaHook`, and if set, call `Catalogs.dropTable/CatalogUtil.dropTableData` to delete the data ourselves in the postHook.
   
    A Hive user should be able to decide whether or not data deletion is desirable for a table by changing the table property `external.table.purge`. Going forward, some of the options I see to avoid surprises:
   
   - Synchronizing `external.table.purge` with `gc.enabled`. When one changes, the other is changed accordingly. This could already work when changing the `gc.enabled` (and then altering the purge prop in HMS accordingly), but not the other way around (since `ALTER TABLE SET TBLPROPERTIES` is not implemented yet to push down HMS prop changes to Iceberg - although this is on our roadmap already). 
   - Adding the `gc.enabled` prop to the HMS table props as well, make it control the data deletion for iceberg tables via the MetaHook (+ update docs to let users know of this) and let users change this prop via ALTER TABLE as desired. I don't really like this one, because it introduces an inconsistency between Iceberg and Hive tables.
   - Leaving things as is, but updating the Hive docs so that users are aware that DROP TABLE for Iceberg tables might behave differently depending on Iceberg configuration, and improve logging. I'm no big fan of this either.
   
   At first glance, I think I'd prefer the first option. Any thoughts on this?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on pull request #2367: Core: Don't delete data files on DROP if GC is disabled

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on pull request #2367:
URL: https://github.com/apache/iceberg/pull/2367#issuecomment-805994813


   > I think for snapshots you only want to delete files that are made in commits after the original import, that should be possible but complicated. Basically just delete everything added after the first iceberg snapshot.
   
   Yes and no. That's the case for snapshot but what if someone creates an Iceberg table with GC disabled and uses it to just promote files written not through Iceberg? Then deleting data beyond the first snapshot not safe too. Maybe, we can take `snapshot` table property into account.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on pull request #2367: Core: Don't delete data files on DROP if GC is disabled

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on pull request #2367:
URL: https://github.com/apache/iceberg/pull/2367#issuecomment-805542440


   Ugh, good question. We could delete files only under the data subfolder but that won't be sufficient in a generic case. Suppose someone creates a table, sets `gc.enabled` to `false` and calls `add_files`. That's why it does not work.
   
   Let me sleep on this issue. I don't have a good solution.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi merged pull request #2367: Core: Don't delete data files on DROP if GC is disabled

Posted by GitBox <gi...@apache.org>.
aokolnychyi merged pull request #2367:
URL: https://github.com/apache/iceberg/pull/2367


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on pull request #2367: Core: Don't delete data files on DROP if GC is disabled

Posted by GitBox <gi...@apache.org>.
rdblue commented on pull request #2367:
URL: https://github.com/apache/iceberg/pull/2367#issuecomment-806046326


   I think that this fix is a good starting point. For files that are added to a snapshot table after creation, it is better to leak those than to delete files from the original table. I think we should move forward with this PR and then try to catch more data files later.
   
   As for the Hive property, I'm not sure what to do but it is a good thing to consider separately. What respects that property? I think Iceberg uses it somewhere, but that may just be in the Hive code. I think having a clear understanding of what it does and where it is currently used is the right place to start.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on pull request #2367: Core: Don't delete data files on DROP if GC is disabled

Posted by GitBox <gi...@apache.org>.
RussellSpitzer commented on pull request #2367:
URL: https://github.com/apache/iceberg/pull/2367#issuecomment-805502435


   How does a user delete files that are added to the snapshot but not part of the original table?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on pull request #2367: Core: Don't delete data files on DROP if GC is disabled

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on pull request #2367:
URL: https://github.com/apache/iceberg/pull/2367#issuecomment-806023117


   A few questions, @marton-bod @pvary.
   
   - How does Hive populate `deleteData` in `commitDropTable`?
   - Is it correct that `external.table.purge` is not specific to Iceberg, it used in other tables too?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] pvary commented on pull request #2367: Core: Don't delete data files on DROP if GC is disabled

Posted by GitBox <gi...@apache.org>.
pvary commented on pull request #2367:
URL: https://github.com/apache/iceberg/pull/2367#issuecomment-805686470






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #2367: Core: Don't delete data files on DROP if GC is disabled

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #2367:
URL: https://github.com/apache/iceberg/pull/2367#discussion_r600738227



##########
File path: core/src/main/java/org/apache/iceberg/CatalogUtil.java
##########
@@ -78,7 +82,12 @@ public static void dropTableData(FileIO io, TableMetadata metadata) {
 
     // run all of the deletes
 
-    deleteFiles(io, manifestsToDelete);
+    boolean gcEnabled = PropertyUtil.propertyAsBoolean(metadata.properties(), GC_ENABLED, GC_ENABLED_DEFAULT);
+
+    if (gcEnabled) {
+      // delete data files only if GC is enabled

Review comment:
       This comment could be better to explain what the GC property is doing.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #2367: Core: Don't delete data files on DROP if GC is disabled

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #2367:
URL: https://github.com/apache/iceberg/pull/2367#discussion_r600738688



##########
File path: spark3-extensions/src/test/java/org/apache/iceberg/spark/extensions/TestSnapshotTableProcedure.java
##########
@@ -118,6 +118,26 @@ public void testSnapshotWithAlternateLocation() throws IOException {
         sql("SELECT * FROM %s ORDER BY id", tableName));
   }
 
+  @Test
+  public void testDropTable() throws IOException {
+    String location = temp.newFolder().toString();
+    sql("CREATE TABLE %s (id bigint NOT NULL, data string) USING parquet LOCATION '%s'", sourceName, location);
+    sql("INSERT INTO TABLE %s VALUES (1, 'a')", sourceName);
+
+    Object result = scalarSql("CALL %s.system.snapshot('%s', '%s')", catalogName, sourceName, tableName);
+    Assert.assertEquals("Should have added one file", 1L, result);
+
+    assertEquals("Should have expected rows",
+        ImmutableList.of(row(1L, "a")),
+        sql("SELECT * FROM %s", tableName));
+
+    sql("DROP TABLE %s", tableName);

Review comment:
       Do we want to assert that the metadata files for the snapshot table are removed?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a change in pull request #2367: Core: Don't delete data files on DROP if GC is disabled

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #2367:
URL: https://github.com/apache/iceberg/pull/2367#discussion_r600784671



##########
File path: core/src/main/java/org/apache/iceberg/CatalogUtil.java
##########
@@ -78,7 +82,12 @@ public static void dropTableData(FileIO io, TableMetadata metadata) {
 
     // run all of the deletes
 
-    deleteFiles(io, manifestsToDelete);
+    boolean gcEnabled = PropertyUtil.propertyAsBoolean(metadata.properties(), GC_ENABLED, GC_ENABLED_DEFAULT);
+
+    if (gcEnabled) {
+      // delete data files only if GC is enabled

Review comment:
       Fixed.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a change in pull request #2367: Core: Don't delete data files on DROP if GC is disabled

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #2367:
URL: https://github.com/apache/iceberg/pull/2367#discussion_r600782904



##########
File path: spark3-extensions/src/test/java/org/apache/iceberg/spark/extensions/TestSnapshotTableProcedure.java
##########
@@ -118,6 +118,26 @@ public void testSnapshotWithAlternateLocation() throws IOException {
         sql("SELECT * FROM %s ORDER BY id", tableName));
   }
 
+  @Test
+  public void testDropTable() throws IOException {
+    String location = temp.newFolder().toString();
+    sql("CREATE TABLE %s (id bigint NOT NULL, data string) USING parquet LOCATION '%s'", sourceName, location);
+    sql("INSERT INTO TABLE %s VALUES (1, 'a')", sourceName);
+
+    Object result = scalarSql("CALL %s.system.snapshot('%s', '%s')", catalogName, sourceName, tableName);
+    Assert.assertEquals("Should have added one file", 1L, result);
+
+    assertEquals("Should have expected rows",
+        ImmutableList.of(row(1L, "a")),
+        sql("SELECT * FROM %s", tableName));
+
+    sql("DROP TABLE %s", tableName);

Review comment:
       It would fail for the session catalog due to #2374.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on pull request #2367: Core: Don't delete data files on DROP if GC is disabled

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on pull request #2367:
URL: https://github.com/apache/iceberg/pull/2367#issuecomment-806269627


   Thanks, everyone! I think we should continue discussing the purge flag in Hive.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi edited a comment on pull request #2367: Core: Don't delete data files on DROP if GC is disabled

Posted by GitBox <gi...@apache.org>.
aokolnychyi edited a comment on pull request #2367:
URL: https://github.com/apache/iceberg/pull/2367#issuecomment-805542440


   Ugh, good question. We could delete files only under the data subfolder but that won't be sufficient in a generic case. Suppose someone creates a table, sets `gc.enabled` to `false` and calls `add_files`. We need another approach.
   
   Let me sleep on this issue. I don't have a good solution.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on pull request #2367: Core: Don't delete data files on DROP if GC is disabled

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on pull request #2367:
URL: https://github.com/apache/iceberg/pull/2367#issuecomment-805995031


   Let me see what we do in the Hive codebase.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org