You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2022/08/25 23:10:30 UTC

[GitHub] [iceberg] huaxingao opened a new pull request, #5638: Bind overwrite filters

huaxingao opened a new pull request, #5638:
URL: https://github.com/apache/iceberg/pull/5638

   In `overwrite(Filter[] filters)`, I think we need to bind to find out if the columns in the Filters are valid columns and throw Exception if not.
   
   I think we probably don't need the similar check in [`deleteWhere(Filter[] filters)`](https://github.com/apache/iceberg/blob/master/spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/source/SparkTable.java#L310) because before calling this method, [`canDeleteWhere(Filter[] filters)`](https://github.com/apache/iceberg/blob/master/spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/source/SparkTable.java#L254) is already called. In `canDeleteWhere(Filter[] filters)`, bind will be called in [`canDeleteUsingMetadata`](https://github.com/apache/iceberg/blob/master/spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/source/SparkTable.java#L269).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a diff in pull request #5638: Bind overwrite filters

Posted by GitBox <gi...@apache.org>.
rdblue commented on code in PR #5638:
URL: https://github.com/apache/iceberg/pull/5638#discussion_r956790144


##########
spark/v3.3/spark/src/test/java/org/apache/iceberg/spark/source/TestSparkTable.java:
##########
@@ -47,14 +55,39 @@ public void removeTable() {
 
   @Test
   public void testTableEquality() throws NoSuchTableException {
-    CatalogManager catalogManager = spark.sessionState().catalogManager();
-    TableCatalog catalog = (TableCatalog) catalogManager.catalog(catalogName);
-    Identifier identifier = Identifier.of(tableIdent.namespace().levels(), tableIdent.name());
-    SparkTable table1 = (SparkTable) catalog.loadTable(identifier);
-    SparkTable table2 = (SparkTable) catalog.loadTable(identifier);
-
+    SparkTable table1 = loadTable();
+    SparkTable table2 = loadTable();
     // different instances pointing to the same table must be equivalent
     Assert.assertNotSame("References must be different", table1, table2);
     Assert.assertEquals("Tables must be equivalent", table1, table2);
   }
+
+  @Test
+  public void testOverwriteFilterConversions() throws NoSuchTableException {

Review Comment:
   When does this happen in practice? Can we have a test that mirrors a real-world situation a bit more closely?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] huaxingao commented on pull request #5638: Bind overwrite filters

Posted by GitBox <gi...@apache.org>.
huaxingao commented on PR #5638:
URL: https://github.com/apache/iceberg/pull/5638#issuecomment-1228794550

   cc @aokolnychyi @szehon-ho Could you please take a look when you have time? Thanks a lot!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] zinking commented on pull request #5638: Bind overwrite filters

Posted by GitBox <gi...@apache.org>.
zinking commented on PR #5638:
URL: https://github.com/apache/iceberg/pull/5638#issuecomment-1229652115

   but why would these column be invalid anyways, didn't they pass the sql validation?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] huaxingao commented on a diff in pull request #5638: Bind overwrite filters

Posted by GitBox <gi...@apache.org>.
huaxingao commented on code in PR #5638:
URL: https://github.com/apache/iceberg/pull/5638#discussion_r958586790


##########
spark/v3.3/spark/src/test/java/org/apache/iceberg/spark/source/TestSparkTable.java:
##########
@@ -47,14 +55,39 @@ public void removeTable() {
 
   @Test
   public void testTableEquality() throws NoSuchTableException {
-    CatalogManager catalogManager = spark.sessionState().catalogManager();
-    TableCatalog catalog = (TableCatalog) catalogManager.catalog(catalogName);
-    Identifier identifier = Identifier.of(tableIdent.namespace().levels(), tableIdent.name());
-    SparkTable table1 = (SparkTable) catalog.loadTable(identifier);
-    SparkTable table2 = (SparkTable) catalog.loadTable(identifier);
-
+    SparkTable table1 = loadTable();
+    SparkTable table2 = loadTable();
     // different instances pointing to the same table must be equivalent
     Assert.assertNotSame("References must be different", table1, table2);
     Assert.assertEquals("Tables must be equivalent", table1, table2);
   }
+
+  @Test
+  public void testOverwriteFilterConversions() throws NoSuchTableException {

Review Comment:
   Thanks a lot for taking a look at this PR!
   
   I looked at the real-world usage (`INSERT OVERWRITE` or `DataFrameWriterV2.overwrite`) and realized that actually Spark will throw `AnalysisException` if the overwrite filters are on invalid columns. So there is no need to bind the filters. I will close this PR.
   
   The reason I did this PR is because I was trying to address this [comment](https://github.com/apache/iceberg/pull/5302#discussion_r950580132). Now since there is no need to bind the filters in `SparkFilters.convert(Filter[] filters)`, I will add back the `SparkV2Filters.convert(Predicate[] predicates)`.
   
   I am also wondering if this [bind](https://github.com/apache/iceberg/blob/master/spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/source/SparkScanBuilder.java#L122) is needed. If the filter expression is on invalid columns, Spark throws `AnalysisException` before it reaches here. Shall I remove this bind?
   
   
   
   
   
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a diff in pull request #5638: Bind overwrite filters

Posted by GitBox <gi...@apache.org>.
rdblue commented on code in PR #5638:
URL: https://github.com/apache/iceberg/pull/5638#discussion_r956790093


##########
spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/source/SparkWriteBuilder.java:
##########
@@ -119,7 +121,24 @@ public WriteBuilder overwrite(Filter[] filters) {
         !overwriteFiles, "Cannot overwrite individual files and using filters");
     Preconditions.checkState(rewrittenFileSetId == null, "Cannot overwrite and rewrite");
 
-    this.overwriteExpr = SparkFilters.convert(filters);
+    boolean caseSensitive = Boolean.parseBoolean(spark.conf().get("spark.sql.caseSensitive"));
+    overwriteExpr = Expressions.alwaysTrue();
+    for (Filter filter : filters) {

Review Comment:
   Can you update the `convert(Filter[])` method rather than moving so much of the logic here?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org