You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by "aokolnychyi (via GitHub)" <gi...@apache.org> on 2023/02/22 00:17:59 UTC

[GitHub] [iceberg] aokolnychyi opened a new pull request, #6899: Spark 3.3: Optimize metadata-only DELETEs

aokolnychyi opened a new pull request, #6899:
URL: https://github.com/apache/iceberg/pull/6899

   This PR avoids planning a scan to check whether a metadata-only DELETE is possible if the delete condition always matches entire partitions.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] amogh-jahagirdar commented on a diff in pull request #6899: Spark 3.3: Optimize metadata-only DELETEs

Posted by "amogh-jahagirdar (via GitHub)" <gi...@apache.org>.
amogh-jahagirdar commented on code in PR #6899:
URL: https://github.com/apache/iceberg/pull/6899#discussion_r1114953978


##########
spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/SparkUtil.java:
##########
@@ -292,4 +292,8 @@ public static List<Expression> partitionMapToExpression(
   public static String toColumnName(NamedReference ref) {
     return DOT.join(ref.fieldNames());
   }
+
+  public static boolean caseSensitive(SparkSession spark) {
+    return Boolean.parseBoolean(spark.conf().get("spark.sql.caseSensitive"));

Review Comment:
     We don't need to take it in this PR, but I see hardcoded "spark.sql.caseSensitive" in a few places in the spark code, is there a good place to define this constant?  I know it's a general spark sql option and not Iceberg specific so `SparkSQLProperties` is probably not the right place 



##########
spark/v3.3/spark-extensions/src/test/java/org/apache/iceberg/spark/extensions/TestDelete.java:
##########
@@ -91,6 +97,35 @@ public void removeTables() {
     sql("DROP TABLE IF EXISTS deleted_dep");
   }
 
+  @Test
+  public void testDeleteWithoutScanningTable() throws Exception {
+    createAndInitPartitionedTable();
+
+    append(new Employee(1, "hr"), new Employee(3, "hr"));
+    append(new Employee(1, "hardware"), new Employee(2, "hardware"));
+
+    Table table = validationCatalog.loadTable(tableIdent);
+
+    List<String> manifestLocations =
+        table.currentSnapshot().allManifests(table.io()).stream()
+            .map(ManifestFile::path)
+            .collect(Collectors.toList());
+
+    withUnavailableLocations(
+        manifestLocations,
+        () -> {
+          LogicalPlan parsed = parsePlan("DELETE FROM %s WHERE dep = 'hr'", tableName);
+
+          DeleteFromIcebergTable analyzed =
+              (DeleteFromIcebergTable) spark.sessionState().analyzer().execute(parsed);
+          Assert.assertTrue("Should have rewrite plan", analyzed.rewritePlan().isDefined());
+
+          DeleteFromIcebergTable optimized =
+              (DeleteFromIcebergTable) OptimizeMetadataOnlyDeleteFromIcebergTable.apply(analyzed);
+          Assert.assertTrue("Should discard rewrite plan", optimized.rewritePlan().isEmpty());

Review Comment:
   Asserting the plan is good and gets the crux of the optimization but should we still actually perform the delete and read the table with an assertion on the expected records as a sanity validation? 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on pull request #6899: Spark 3.3: Optimize metadata-only DELETEs

Posted by "aokolnychyi (via GitHub)" <gi...@apache.org>.
aokolnychyi commented on PR #6899:
URL: https://github.com/apache/iceberg/pull/6899#issuecomment-1440899010

   Thanks for reviewing, @flyrain @singhpk234 @amogh-jahagirdar!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a diff in pull request #6899: Spark 3.3: Optimize metadata-only DELETEs

Posted by "aokolnychyi (via GitHub)" <gi...@apache.org>.
aokolnychyi commented on code in PR #6899:
URL: https://github.com/apache/iceberg/pull/6899#discussion_r1114947065


##########
spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/source/SparkTable.java:
##########
@@ -314,6 +318,10 @@ private boolean canDeleteUsingMetadata(Expression deleteExpr) {
     }
   }
 
+  private boolean isCaseSensitive() {
+    return Boolean.parseBoolean(sparkSession().conf().get("spark.sql.caseSensitive"));

Review Comment:
   Changed this bit back to have a local variable.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a diff in pull request #6899: Spark 3.3: Optimize metadata-only DELETEs

Posted by "aokolnychyi (via GitHub)" <gi...@apache.org>.
aokolnychyi commented on code in PR #6899:
URL: https://github.com/apache/iceberg/pull/6899#discussion_r1114966979


##########
spark/v3.3/spark-extensions/src/test/java/org/apache/iceberg/spark/extensions/TestDelete.java:
##########
@@ -91,6 +97,35 @@ public void removeTables() {
     sql("DROP TABLE IF EXISTS deleted_dep");
   }
 
+  @Test
+  public void testDeleteWithoutScanningTable() throws Exception {
+    createAndInitPartitionedTable();
+
+    append(new Employee(1, "hr"), new Employee(3, "hr"));
+    append(new Employee(1, "hardware"), new Employee(2, "hardware"));
+
+    Table table = validationCatalog.loadTable(tableIdent);
+
+    List<String> manifestLocations =
+        table.currentSnapshot().allManifests(table.io()).stream()
+            .map(ManifestFile::path)
+            .collect(Collectors.toList());
+
+    withUnavailableLocations(
+        manifestLocations,
+        () -> {
+          LogicalPlan parsed = parsePlan("DELETE FROM %s WHERE dep = 'hr'", tableName);
+
+          DeleteFromIcebergTable analyzed =
+              (DeleteFromIcebergTable) spark.sessionState().analyzer().execute(parsed);
+          Assert.assertTrue("Should have rewrite plan", analyzed.rewritePlan().isDefined());
+
+          DeleteFromIcebergTable optimized =
+              (DeleteFromIcebergTable) OptimizeMetadataOnlyDeleteFromIcebergTable.apply(analyzed);
+          Assert.assertTrue("Should discard rewrite plan", optimized.rewritePlan().isEmpty());

Review Comment:
   That's a good idea, I added a call below.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a diff in pull request #6899: Spark 3.3: Optimize metadata-only DELETEs

Posted by "aokolnychyi (via GitHub)" <gi...@apache.org>.
aokolnychyi commented on code in PR #6899:
URL: https://github.com/apache/iceberg/pull/6899#discussion_r1114872889


##########
spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/source/SparkTable.java:
##########
@@ -274,18 +275,21 @@ public boolean canDeleteWhere(Filter[] filters) {
       }
     }
 
-    return deleteExpr == Expressions.alwaysTrue() || canDeleteUsingMetadata(deleteExpr);
+    return selectsPartitions(deleteExpr) || canDeleteUsingMetadata(deleteExpr);

Review Comment:
   I'd keep a simpler return statement in this case as the selects partition call should be fairly cheap for a true literal. Previously, we wanted to avoid planning in some obvious cases. The new check should be instant.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi merged pull request #6899: Spark 3.3: Optimize metadata-only DELETEs

Posted by "aokolnychyi (via GitHub)" <gi...@apache.org>.
aokolnychyi merged PR #6899:
URL: https://github.com/apache/iceberg/pull/6899


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] amogh-jahagirdar commented on a diff in pull request #6899: Spark 3.3: Optimize metadata-only DELETEs

Posted by "amogh-jahagirdar (via GitHub)" <gi...@apache.org>.
amogh-jahagirdar commented on code in PR #6899:
URL: https://github.com/apache/iceberg/pull/6899#discussion_r1114995708


##########
spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/SparkUtil.java:
##########
@@ -292,4 +292,8 @@ public static List<Expression> partitionMapToExpression(
   public static String toColumnName(NamedReference ref) {
     return DOT.join(ref.fieldNames());
   }
+
+  public static boolean caseSensitive(SparkSession spark) {
+    return Boolean.parseBoolean(spark.conf().get("spark.sql.caseSensitive"));

Review Comment:
   Ah I see I missed that this is public, then that should be good! 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] singhpk234 commented on a diff in pull request #6899: Spark 3.3: Optimize metadata-only DELETEs

Posted by "singhpk234 (via GitHub)" <gi...@apache.org>.
singhpk234 commented on code in PR #6899:
URL: https://github.com/apache/iceberg/pull/6899#discussion_r1113747203


##########
spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/source/SparkTable.java:
##########
@@ -274,18 +275,21 @@ public boolean canDeleteWhere(Filter[] filters) {
       }
     }
 
-    return deleteExpr == Expressions.alwaysTrue() || canDeleteUsingMetadata(deleteExpr);
+    return selectsPartitions(deleteExpr) || canDeleteUsingMetadata(deleteExpr);

Review Comment:
   [minor] should we still keep `deleteExpr == Expressions.alwaysTrue()` it can even skip selectPartitions call



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a diff in pull request #6899: Spark 3.3: Optimize metadata-only DELETEs

Posted by "aokolnychyi (via GitHub)" <gi...@apache.org>.
aokolnychyi commented on code in PR #6899:
URL: https://github.com/apache/iceberg/pull/6899#discussion_r1114905581


##########
spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/source/SparkTable.java:
##########
@@ -314,6 +318,10 @@ private boolean canDeleteUsingMetadata(Expression deleteExpr) {
     }
   }
 
+  private boolean isCaseSensitive() {
+    return Boolean.parseBoolean(sparkSession().conf().get("spark.sql.caseSensitive"));

Review Comment:
   I also thought about that but it would mean initializing the `SparkSession` var in the constructor. I don't recall why we decided to avoid that initially but I'd probably keep the current logic to prevent surprises. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a diff in pull request #6899: Spark 3.3: Optimize metadata-only DELETEs

Posted by "aokolnychyi (via GitHub)" <gi...@apache.org>.
aokolnychyi commented on code in PR #6899:
URL: https://github.com/apache/iceberg/pull/6899#discussion_r1114946716


##########
spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/source/SparkTable.java:
##########
@@ -274,18 +275,21 @@ public boolean canDeleteWhere(Filter[] filters) {
       }
     }
 
-    return deleteExpr == Expressions.alwaysTrue() || canDeleteUsingMetadata(deleteExpr);
+    return selectsPartitions(deleteExpr) || canDeleteUsingMetadata(deleteExpr);
+  }
+
+  private boolean selectsPartitions(Expression expr) {

Review Comment:
   Added a utility to check if an expression selects partitions in all specs.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] flyrain commented on a diff in pull request #6899: Spark 3.3: Optimize metadata-only DELETEs

Posted by "flyrain (via GitHub)" <gi...@apache.org>.
flyrain commented on code in PR #6899:
URL: https://github.com/apache/iceberg/pull/6899#discussion_r1113726942


##########
spark/v3.3/spark-extensions/src/test/java/org/apache/iceberg/spark/extensions/TestDelete.java:
##########
@@ -91,6 +97,35 @@ public void removeTables() {
     sql("DROP TABLE IF EXISTS deleted_dep");
   }
 
+  @Test
+  public void testDeleteWithoutScanningTable() throws Exception {
+    createAndInitPartitionedTable();
+
+    append(new Employee(1, "hr"), new Employee(3, "hr"));
+    append(new Employee(1, "hardware"), new Employee(2, "hardware"));
+
+    Table table = validationCatalog.loadTable(tableIdent);
+
+    List<String> manifestLocations =
+        table.currentSnapshot().allManifests(table.io()).stream()
+            .map(ManifestFile::path)
+            .collect(Collectors.toList());
+
+    withUnavailableLocations(
+        manifestLocations,
+        () -> {
+          LogicalPlan parsed = parsePlan("DELETE FROM %s WHERE dep = 'hr'", tableName);
+
+          DeleteFromIcebergTable analyzed =
+              (DeleteFromIcebergTable) spark.sessionState().analyzer().execute(parsed);
+          Assert.assertTrue("Should have rewrite plan", analyzed.rewritePlan().isDefined());
+
+          DeleteFromIcebergTable optimized =
+              (DeleteFromIcebergTable) OptimizeMetadataOnlyDeleteFromIcebergTable.apply(analyzed);
+          Assert.assertTrue("Should discard rewrite plan", optimized.rewritePlan().isEmpty());

Review Comment:
   +1



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] singhpk234 commented on a diff in pull request #6899: Spark 3.3: Optimize metadata-only DELETEs

Posted by "singhpk234 (via GitHub)" <gi...@apache.org>.
singhpk234 commented on code in PR #6899:
URL: https://github.com/apache/iceberg/pull/6899#discussion_r1113735659


##########
spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/source/SparkTable.java:
##########
@@ -314,6 +318,10 @@ private boolean canDeleteUsingMetadata(Expression deleteExpr) {
     }
   }
 
+  private boolean isCaseSensitive() {
+    return Boolean.parseBoolean(sparkSession().conf().get("spark.sql.caseSensitive"));

Review Comment:
   [minor] should we make this instance variable ?



##########
spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/source/SparkTable.java:
##########
@@ -274,18 +275,21 @@ public boolean canDeleteWhere(Filter[] filters) {
       }
     }
 
-    return deleteExpr == Expressions.alwaysTrue() || canDeleteUsingMetadata(deleteExpr);
+    return selectsPartitions(deleteExpr) || canDeleteUsingMetadata(deleteExpr);
+  }
+
+  private boolean selectsPartitions(Expression expr) {

Review Comment:
   +1, if selectsPartitions returns true and it's an disjunction, so canDeleteUsingMetadata will not be called.
   
   [question] if we can move selectsPartition into an util, for ex: SparkUtil we can make SparkScanBuilder use the same. As in one place i.e here we are checking if all are true, and other i.e SparkScanBuilder if atleast one is false.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a diff in pull request #6899: Spark 3.3: Optimize metadata-only DELETEs

Posted by "aokolnychyi (via GitHub)" <gi...@apache.org>.
aokolnychyi commented on code in PR #6899:
URL: https://github.com/apache/iceberg/pull/6899#discussion_r1113699927


##########
spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/source/SparkTable.java:
##########
@@ -274,18 +275,21 @@ public boolean canDeleteWhere(Filter[] filters) {
       }
     }
 
-    return deleteExpr == Expressions.alwaysTrue() || canDeleteUsingMetadata(deleteExpr);
+    return selectsPartitions(deleteExpr) || canDeleteUsingMetadata(deleteExpr);
+  }
+
+  private boolean selectsPartitions(Expression expr) {

Review Comment:
   It is a similar trick to what we do in `SparkScanBuilder`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on pull request #6899: Spark 3.3: Optimize metadata-only DELETEs

Posted by "aokolnychyi (via GitHub)" <gi...@apache.org>.
aokolnychyi commented on PR #6899:
URL: https://github.com/apache/iceberg/pull/6899#issuecomment-1439260724

   cc @amogh-jahagirdar @singhpk234 @rdblue @karuppayya @RussellSpitzer @flyrain @szehon-ho 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a diff in pull request #6899: Spark 3.3: Optimize metadata-only DELETEs

Posted by "aokolnychyi (via GitHub)" <gi...@apache.org>.
aokolnychyi commented on code in PR #6899:
URL: https://github.com/apache/iceberg/pull/6899#discussion_r1114967611


##########
spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/SparkUtil.java:
##########
@@ -292,4 +292,8 @@ public static List<Expression> partitionMapToExpression(
   public static String toColumnName(NamedReference ref) {
     return DOT.join(ref.fieldNames());
   }
+
+  public static boolean caseSensitive(SparkSession spark) {
+    return Boolean.parseBoolean(spark.conf().get("spark.sql.caseSensitive"));

Review Comment:
   I tried to partially address that by introducing this utility call and migrating all other places to use it.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org