You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by "amogh-jahagirdar (via GitHub)" <gi...@apache.org> on 2023/02/22 21:12:17 UTC

[GitHub] [iceberg] amogh-jahagirdar commented on a diff in pull request #6899: Spark 3.3: Optimize metadata-only DELETEs

amogh-jahagirdar commented on code in PR #6899:
URL: https://github.com/apache/iceberg/pull/6899#discussion_r1114953978


##########
spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/SparkUtil.java:
##########
@@ -292,4 +292,8 @@ public static List<Expression> partitionMapToExpression(
   public static String toColumnName(NamedReference ref) {
     return DOT.join(ref.fieldNames());
   }
+
+  public static boolean caseSensitive(SparkSession spark) {
+    return Boolean.parseBoolean(spark.conf().get("spark.sql.caseSensitive"));

Review Comment:
     We don't need to take it in this PR, but I see hardcoded "spark.sql.caseSensitive" in a few places in the spark code, is there a good place to define this constant?  I know it's a general spark sql option and not Iceberg specific so `SparkSQLProperties` is probably not the right place 



##########
spark/v3.3/spark-extensions/src/test/java/org/apache/iceberg/spark/extensions/TestDelete.java:
##########
@@ -91,6 +97,35 @@ public void removeTables() {
     sql("DROP TABLE IF EXISTS deleted_dep");
   }
 
+  @Test
+  public void testDeleteWithoutScanningTable() throws Exception {
+    createAndInitPartitionedTable();
+
+    append(new Employee(1, "hr"), new Employee(3, "hr"));
+    append(new Employee(1, "hardware"), new Employee(2, "hardware"));
+
+    Table table = validationCatalog.loadTable(tableIdent);
+
+    List<String> manifestLocations =
+        table.currentSnapshot().allManifests(table.io()).stream()
+            .map(ManifestFile::path)
+            .collect(Collectors.toList());
+
+    withUnavailableLocations(
+        manifestLocations,
+        () -> {
+          LogicalPlan parsed = parsePlan("DELETE FROM %s WHERE dep = 'hr'", tableName);
+
+          DeleteFromIcebergTable analyzed =
+              (DeleteFromIcebergTable) spark.sessionState().analyzer().execute(parsed);
+          Assert.assertTrue("Should have rewrite plan", analyzed.rewritePlan().isDefined());
+
+          DeleteFromIcebergTable optimized =
+              (DeleteFromIcebergTable) OptimizeMetadataOnlyDeleteFromIcebergTable.apply(analyzed);
+          Assert.assertTrue("Should discard rewrite plan", optimized.rewritePlan().isEmpty());

Review Comment:
   Asserting the plan is good and gets the crux of the optimization but should we still actually perform the delete and read the table with an assertion on the expected records as a sanity validation? 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org