You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2021/12/11 21:21:21 UTC

[GitHub] [iceberg] kbendick commented on a change in pull request #3719: Spark: support delete_reachable_files procedure

kbendick commented on a change in pull request #3719:
URL: https://github.com/apache/iceberg/pull/3719#discussion_r767194131



##########
File path: spark/v3.2/spark-extensions/src/test/java/org/apache/iceberg/spark/extensions/TestDeleteReachableFilesProcedure.java
##########
@@ -0,0 +1,117 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.spark.extensions;
+
+import java.io.IOException;
+import java.util.List;
+import java.util.Map;
+import org.apache.iceberg.AssertHelpers;
+import org.apache.iceberg.HasTableOperations;
+import org.apache.iceberg.Table;
+import org.apache.iceberg.exceptions.ValidationException;
+import org.apache.iceberg.relocated.com.google.common.collect.ImmutableList;
+import org.junit.Rule;
+import org.junit.Test;
+import org.junit.rules.TemporaryFolder;
+
+import static org.apache.iceberg.TableProperties.GC_ENABLED;
+
+public class TestDeleteReachableFilesProcedure extends SparkExtensionsTestBase {
+
+  @Rule
+  public TemporaryFolder temp = new TemporaryFolder();
+
+  public TestDeleteReachableFilesProcedure(String catalogName, String implementation, Map<String, String> config) {
+    super(catalogName, implementation, config);
+  }
+
+  @Test
+  public void testDeleteReachableFilesInEmptyTable() throws IOException {
+    if (!createTable()) {
+      return;
+    }
+    Table table = catalog.loadTable(tableIdent);
+    String metadataFileLocation = metadataFileLocation(table);
+
+    catalog.dropTable(tableIdent, false);
+
+    List<Object[]> output = sql(
+        "CALL %s.system.delete_reachable_files('%s')",
+        catalogName, metadataFileLocation);
+    assertEquals("Should delete only other files (metadata json and version_hint file)",
+        ImmutableList.of(row(0L, 0L, 0L, 2L)), output);
+  }
+
+  @Test
+  public void testDeleteReachableFilesInDataFolder() throws IOException {
+    if (!createTable()) {
+      return;
+    }
+
+    sql("INSERT INTO TABLE %s VALUES (1, 'a')", tableName);
+    sql("INSERT INTO TABLE %s VALUES (2, 'b')", tableName);
+
+    Table table = catalog.loadTable(tableIdent);
+    String metadataFileLocation = metadataFileLocation(table);
+
+    catalog.dropTable(tableIdent, false);
+
+    List<Object[]> output = sql(
+        "CALL %s.system.delete_reachable_files('%s')",
+        catalogName, metadataFileLocation);
+    assertEquals("Should delete expected number of files",
+        ImmutableList.of(row(2L, 2L, 2L, 4L)), output);
+  }
+
+  @Test
+  public void testDeleteReachableFilesGCDisabled() throws IOException {
+    if (!createTable()) {
+      return;
+    }
+    sql("ALTER TABLE %s SET TBLPROPERTIES ('%s' 'false')", tableName, GC_ENABLED);
+
+    Table table = catalog.loadTable(tableIdent);
+    String metadataFileLocation = metadataFileLocation(table);
+
+    catalog.dropTable(tableIdent, false);
+    AssertHelpers.assertThrows("Should reject call",
+        ValidationException.class, "Cannot remove files: GC is disabled",
+        () -> sql("CALL %s.system.delete_reachable_files('%s')", catalogName, metadataFileLocation));
+  }
+
+  private String metadataFileLocation(Table tbl) {
+    return ((HasTableOperations) tbl).operations().current().metadataFileLocation();
+  }
+
+  private boolean createTable() throws IOException {
+    if (catalogName.equals("testhadoop") || catalogName.equals("testhive")) {
+      // This procedure cannot work for hadoop catalog,
+      // as after drop table metadata file will be deleted (even with purge=false)
+      // For hive catalog, drop table with purge = true is not cleaning the table in metastore.

Review comment:
       An example of setting the `isHadoopCatalog` variable in that same file: https://github.com/apache/iceberg/blob/f5a753791f4dc6aca78569a14f731feda9edf462/spark/v3.2/spark/src/test/java/org/apache/iceberg/spark/sql/TestNamespaceSQL.java#L43-L49
   
   Notice that it is the same call, but it's much cleaner to look at in the test when you use `Assume.assumeFalse(isHadoopCatalog)`.

##########
File path: spark/v3.2/spark-extensions/src/test/java/org/apache/iceberg/spark/extensions/TestDeleteReachableFilesProcedure.java
##########
@@ -0,0 +1,117 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.spark.extensions;
+
+import java.io.IOException;
+import java.util.List;
+import java.util.Map;
+import org.apache.iceberg.AssertHelpers;
+import org.apache.iceberg.HasTableOperations;
+import org.apache.iceberg.Table;
+import org.apache.iceberg.exceptions.ValidationException;
+import org.apache.iceberg.relocated.com.google.common.collect.ImmutableList;
+import org.junit.Rule;
+import org.junit.Test;
+import org.junit.rules.TemporaryFolder;
+
+import static org.apache.iceberg.TableProperties.GC_ENABLED;
+
+public class TestDeleteReachableFilesProcedure extends SparkExtensionsTestBase {
+
+  @Rule
+  public TemporaryFolder temp = new TemporaryFolder();
+
+  public TestDeleteReachableFilesProcedure(String catalogName, String implementation, Map<String, String> config) {
+    super(catalogName, implementation, config);
+  }
+
+  @Test
+  public void testDeleteReachableFilesInEmptyTable() throws IOException {
+    if (!createTable()) {
+      return;
+    }
+    Table table = catalog.loadTable(tableIdent);
+    String metadataFileLocation = metadataFileLocation(table);
+
+    catalog.dropTable(tableIdent, false);
+
+    List<Object[]> output = sql(
+        "CALL %s.system.delete_reachable_files('%s')",
+        catalogName, metadataFileLocation);
+    assertEquals("Should delete only other files (metadata json and version_hint file)",
+        ImmutableList.of(row(0L, 0L, 0L, 2L)), output);
+  }
+
+  @Test
+  public void testDeleteReachableFilesInDataFolder() throws IOException {
+    if (!createTable()) {
+      return;
+    }
+
+    sql("INSERT INTO TABLE %s VALUES (1, 'a')", tableName);
+    sql("INSERT INTO TABLE %s VALUES (2, 'b')", tableName);
+
+    Table table = catalog.loadTable(tableIdent);
+    String metadataFileLocation = metadataFileLocation(table);
+
+    catalog.dropTable(tableIdent, false);
+
+    List<Object[]> output = sql(
+        "CALL %s.system.delete_reachable_files('%s')",
+        catalogName, metadataFileLocation);
+    assertEquals("Should delete expected number of files",
+        ImmutableList.of(row(2L, 2L, 2L, 4L)), output);
+  }
+
+  @Test
+  public void testDeleteReachableFilesGCDisabled() throws IOException {
+    if (!createTable()) {
+      return;
+    }
+    sql("ALTER TABLE %s SET TBLPROPERTIES ('%s' 'false')", tableName, GC_ENABLED);
+
+    Table table = catalog.loadTable(tableIdent);
+    String metadataFileLocation = metadataFileLocation(table);
+
+    catalog.dropTable(tableIdent, false);
+    AssertHelpers.assertThrows("Should reject call",
+        ValidationException.class, "Cannot remove files: GC is disabled",
+        () -> sql("CALL %s.system.delete_reachable_files('%s')", catalogName, metadataFileLocation));
+  }
+
+  private String metadataFileLocation(Table tbl) {
+    return ((HasTableOperations) tbl).operations().current().metadataFileLocation();
+  }
+
+  private boolean createTable() throws IOException {
+    if (catalogName.equals("testhadoop") || catalogName.equals("testhive")) {
+      // This procedure cannot work for hadoop catalog,
+      // as after drop table metadata file will be deleted (even with purge=false)
+      // For hive catalog, drop table with purge = true is not cleaning the table in metastore.

Review comment:
       Two things:
   
   1) I would use `Assume.assumeFalse("The reason is.... ", isHadoopCatalog);`. I would put the Hive one on its own line with its own explanation as well. And set up in the test constructor whether or not the catalog is a hadoop catalog or a hive catalog vs relying on the names once you're in the tests (even if you rely on the names in the constructor to determine the catalog type, it's still cleaner to use a simple boolean variable and then the boolean condition can be changed as tests evolve).
   
   And I'd place it at the start of each test vs in a nested function in the file so it's clear why it's skipped.
   
   Here's an example: https://github.com/apache/iceberg/blob/f5a753791f4dc6aca78569a14f731feda9edf462/spark/v3.2/spark/src/test/java/org/apache/iceberg/spark/sql/TestNamespaceSQL.java#L66-L68
   
   2) But more importantly, I would see if you can use `temp` to forcefully drop the data in an `@After` (after each test) each time so the tests can be run. Or otherwise refactor your tests so that the `LOCATION` is simply different for each test. You might have to save the location of `temp.newFolder()` in a private class level variable to reference for dropping. I'm not sure. That would ideally get rid of this pre-requirement. It should be able to run for at least Hive catalog.
   
   And then create an issue related to the purge if necessary.

##########
File path: spark/v3.2/spark-extensions/src/test/java/org/apache/iceberg/spark/extensions/TestDeleteReachableFilesProcedure.java
##########
@@ -0,0 +1,117 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.spark.extensions;
+
+import java.io.IOException;
+import java.util.List;
+import java.util.Map;
+import org.apache.iceberg.AssertHelpers;
+import org.apache.iceberg.HasTableOperations;
+import org.apache.iceberg.Table;
+import org.apache.iceberg.exceptions.ValidationException;
+import org.apache.iceberg.relocated.com.google.common.collect.ImmutableList;
+import org.junit.Rule;
+import org.junit.Test;
+import org.junit.rules.TemporaryFolder;
+
+import static org.apache.iceberg.TableProperties.GC_ENABLED;
+
+public class TestDeleteReachableFilesProcedure extends SparkExtensionsTestBase {
+
+  @Rule
+  public TemporaryFolder temp = new TemporaryFolder();
+
+  public TestDeleteReachableFilesProcedure(String catalogName, String implementation, Map<String, String> config) {
+    super(catalogName, implementation, config);
+  }
+
+  @Test
+  public void testDeleteReachableFilesInEmptyTable() throws IOException {
+    if (!createTable()) {
+      return;
+    }
+    Table table = catalog.loadTable(tableIdent);
+    String metadataFileLocation = metadataFileLocation(table);
+
+    catalog.dropTable(tableIdent, false);
+
+    List<Object[]> output = sql(
+        "CALL %s.system.delete_reachable_files('%s')",
+        catalogName, metadataFileLocation);
+    assertEquals("Should delete only other files (metadata json and version_hint file)",
+        ImmutableList.of(row(0L, 0L, 0L, 2L)), output);
+  }
+
+  @Test
+  public void testDeleteReachableFilesInDataFolder() throws IOException {
+    if (!createTable()) {
+      return;
+    }
+
+    sql("INSERT INTO TABLE %s VALUES (1, 'a')", tableName);
+    sql("INSERT INTO TABLE %s VALUES (2, 'b')", tableName);
+
+    Table table = catalog.loadTable(tableIdent);
+    String metadataFileLocation = metadataFileLocation(table);
+
+    catalog.dropTable(tableIdent, false);
+
+    List<Object[]> output = sql(
+        "CALL %s.system.delete_reachable_files('%s')",
+        catalogName, metadataFileLocation);
+    assertEquals("Should delete expected number of files",
+        ImmutableList.of(row(2L, 2L, 2L, 4L)), output);
+  }
+
+  @Test
+  public void testDeleteReachableFilesGCDisabled() throws IOException {
+    if (!createTable()) {
+      return;
+    }
+    sql("ALTER TABLE %s SET TBLPROPERTIES ('%s' 'false')", tableName, GC_ENABLED);
+
+    Table table = catalog.loadTable(tableIdent);
+    String metadataFileLocation = metadataFileLocation(table);
+
+    catalog.dropTable(tableIdent, false);
+    AssertHelpers.assertThrows("Should reject call",
+        ValidationException.class, "Cannot remove files: GC is disabled",
+        () -> sql("CALL %s.system.delete_reachable_files('%s')", catalogName, metadataFileLocation));
+  }
+
+  private String metadataFileLocation(Table tbl) {
+    return ((HasTableOperations) tbl).operations().current().metadataFileLocation();
+  }
+
+  private boolean createTable() throws IOException {
+    if (catalogName.equals("testhadoop") || catalogName.equals("testhive")) {
+      // This procedure cannot work for hadoop catalog,
+      // as after drop table metadata file will be deleted (even with purge=false)
+      // For hive catalog, drop table with purge = true is not cleaning the table in metastore.

Review comment:
       But more importantly (and sorry for the long winded comments, but I figured since you're asking these are useful things we use a lot in the repo):
   
   I'd suggest getting the temp folder to be in the correct state in either an `@Before` function or an `@After` function so this can be run with as many catalogs as are supported by it. It's usually just `HadoopCatalog` or `spark_catalog` that I see the assumption on.
   
   So the HiveCatalog assumption gives me a bit of pause and makes me wonder if we can't fix it temporarily within the test and then fix the behavior fully in a later patch (please open a ticket if there is an incorrect behavior taking place).
   
   Also, have you merged in latest master? Somebody submitted a fix for properly purging data files so that might help. But this seems to be about NOT wanting to purge and having purge happen anyways (if my understanding is correct), so that seems like a new issue that needs to be looked into.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org