You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by "zhongyujiang (via GitHub)" <gi...@apache.org> on 2023/03/17 15:58:15 UTC

[GitHub] [iceberg] zhongyujiang opened a new pull request, #7132: Core, Spark: Fix delete with filter on nested columns

zhongyujiang opened a new pull request, #7132:
URL: https://github.com/apache/iceberg/pull/7132

   This fixes Spark delete data when using a filter on nested columns. Now such operations will fail because Spark calls `canDeleteUsingMetadata` which uses `StrictMetricsEvaluator` to evaluate whether a file should be completely deleted, however `StrictMetricsEvaluator` doesn't support evaluate on nested columns now, and a NPE will be thrown out, see #7065.
   
   This updates `StrictMetricsEvaluator` to support evaluation on nested columns(only for columns nested in a chain of Struct fileds, will return `ROWS_MIGHT_NOT_MATCH`  if columns are nested in Map or List fields), which solve this problem.
   Fixes #7065.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


Re: [PR] Core, Spark: Fix delete with filter on nested columns [iceberg]

Posted by "bluzy (via GitHub)" <gi...@apache.org>.
bluzy commented on PR #7132:
URL: https://github.com/apache/iceberg/pull/7132#issuecomment-1870762007

   PTAL @rdblue @RussellSpitzer @aokolnychyi @szehon-ho 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] szehon-ho commented on a diff in pull request #7132: Core, Spark: Fix delete with filter on nested columns

Posted by "szehon-ho (via GitHub)" <gi...@apache.org>.
szehon-ho commented on code in PR #7132:
URL: https://github.com/apache/iceberg/pull/7132#discussion_r1146647169


##########
api/src/main/java/org/apache/iceberg/expressions/StrictMetricsEvaluator.java:
##########
@@ -489,5 +500,32 @@ private boolean containsNaNsOnly(Integer id) {
           && valueCounts != null
           && nanCounts.get(id).equals(valueCounts.get(id));
     }
+
+    private boolean supportsEvaluate(int fieldId) {
+      Boolean evaluable = canEvaluate.get(fieldId);
+      if (evaluable != null) {
+        return evaluable;
+      }
+
+      evaluable = true;
+      // Cannot evaluate on complex types or repeated primitive types.
+      if (!schema.findType(fieldId).isPrimitiveType()) {
+        evaluable = false;
+      } else {
+        Integer parent = idToParent.get(fieldId);
+        while (parent != null) {
+          Type type = schema.findType(parent);
+          if (type.isListType() || type.isMapType()) {
+            evaluable = false;

Review Comment:
   So you are skipping if list/map type, but allowing struct, if I understand?  I think it makes sense to me, as I feel we have nested column stats.  but definitely like @rdblue @RussellSpitzer @aokolnychyi  to have a sanity check here on the overall direction. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] szehon-ho commented on a diff in pull request #7132: Core, Spark: Fix delete with filter on nested columns

Posted by "szehon-ho (via GitHub)" <gi...@apache.org>.
szehon-ho commented on code in PR #7132:
URL: https://github.com/apache/iceberg/pull/7132#discussion_r1146647169


##########
api/src/main/java/org/apache/iceberg/expressions/StrictMetricsEvaluator.java:
##########
@@ -489,5 +500,32 @@ private boolean containsNaNsOnly(Integer id) {
           && valueCounts != null
           && nanCounts.get(id).equals(valueCounts.get(id));
     }
+
+    private boolean supportsEvaluate(int fieldId) {
+      Boolean evaluable = canEvaluate.get(fieldId);
+      if (evaluable != null) {
+        return evaluable;
+      }
+
+      evaluable = true;
+      // Cannot evaluate on complex types or repeated primitive types.
+      if (!schema.findType(fieldId).isPrimitiveType()) {
+        evaluable = false;
+      } else {
+        Integer parent = idToParent.get(fieldId);
+        while (parent != null) {
+          Type type = schema.findType(parent);
+          if (type.isListType() || type.isMapType()) {
+            evaluable = false;

Review Comment:
   So you are skipping if list/map type, if I understand?  I think it makes sense to me, as I feel we have nested column stats.  but definitely like @rdblue @RussellSpitzer @aokolnychyi  to have a sanity check here on the overall direction. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] zhongyujiang commented on a diff in pull request #7132: Core, Spark: Fix delete with filter on nested columns

Posted by "zhongyujiang (via GitHub)" <gi...@apache.org>.
zhongyujiang commented on code in PR #7132:
URL: https://github.com/apache/iceberg/pull/7132#discussion_r1147037896


##########
api/src/main/java/org/apache/iceberg/expressions/StrictMetricsEvaluator.java:
##########
@@ -489,5 +500,32 @@ private boolean containsNaNsOnly(Integer id) {
           && valueCounts != null
           && nanCounts.get(id).equals(valueCounts.get(id));
     }
+
+    private boolean supportsEvaluate(int fieldId) {
+      Boolean evaluable = canEvaluate.get(fieldId);
+      if (evaluable != null) {
+        return evaluable;
+      }
+
+      evaluable = true;
+      // Cannot evaluate on complex types or repeated primitive types.
+      if (!schema.findType(fieldId).isPrimitiveType()) {
+        evaluable = false;
+      } else {
+        Integer parent = idToParent.get(fieldId);
+        while (parent != null) {
+          Type type = schema.findType(parent);
+          if (type.isListType() || type.isMapType()) {
+            evaluable = false;

Review Comment:
   Yes, your understanding is correct.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] zhongyujiang commented on a diff in pull request #7132: Core, Spark: Fix delete with filter on nested columns

Posted by "zhongyujiang (via GitHub)" <gi...@apache.org>.
zhongyujiang commented on code in PR #7132:
URL: https://github.com/apache/iceberg/pull/7132#discussion_r1140422212


##########
spark/v3.3/spark-extensions/src/test/java/org/apache/iceberg/spark/extensions/TestDelete.java:
##########
@@ -510,6 +510,22 @@ public void testDeleteWithConditionOnNestedColumn() {
         "Should have expected rows", ImmutableList.of(), sql("SELECT id FROM %s", tableName));
   }
 
+  @Test
+  public void testDeleteWithFilterOnNestedColumn() {
+    createAndInitNestedColumnsTable();
+
+    sql("INSERT INTO TABLE %s VALUES (1, named_struct(\"c1\", 3, \"c2\", \"v1\"))", tableName);
+    sql("INSERT INTO TABLE %s VALUES (2, named_struct(\"c1\", 2, \"c2\", \"v2\"))", tableName);
+
+    sql("DELETE FROM %s WHERE complex.c1 = 3", tableName);

Review Comment:
   Delete conditions in `testDeleteWithConditionOnNestedColumn` can not be push down, so added this UT.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] zhongyujiang commented on pull request #7132: Core, Spark: Fix delete with filter on nested columns

Posted by "zhongyujiang (via GitHub)" <gi...@apache.org>.
zhongyujiang commented on PR #7132:
URL: https://github.com/apache/iceberg/pull/7132#issuecomment-1474061266

   @aokolnychyi @rdblue can you help review this?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org