You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2022/07/04 12:13:35 UTC

[GitHub] [iceberg] shidayang opened a new pull request, #5195: Make predicates of delete only initiate once

shidayang opened a new pull request, #5195:
URL: https://github.com/apache/iceberg/pull/5195

   DeleteFilter initialize 'delete rows' multiple times  when call DeleteFilter#filter method multiple times,  This cause performance of MOR is very low in trino.
   In my case, It spend 8 minutes when I query a table, after optimize it only spend 20 seconds.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] chenjunjiedada commented on a diff in pull request #5195: Make predicates of delete only initialize once

Posted by GitBox <gi...@apache.org>.
chenjunjiedada commented on code in PR #5195:
URL: https://github.com/apache/iceberg/pull/5195#discussion_r912961202


##########
data/src/main/java/org/apache/iceberg/data/DeleteFilter.java:
##########
@@ -192,13 +196,10 @@ protected void markRowDeleted(T item) {
   }
 
   public Predicate<T> eqDeletedRowFilter() {
-    if (eqDeleteRows == null) {

Review Comment:
   Do we need to change this?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] shidayang commented on a diff in pull request #5195: Make predicates of delete only initialize once

Posted by GitBox <gi...@apache.org>.
shidayang commented on code in PR #5195:
URL: https://github.com/apache/iceberg/pull/5195#discussion_r918501331


##########
data/src/main/java/org/apache/iceberg/data/DeleteFilter.java:
##########
@@ -192,13 +196,10 @@ protected void markRowDeleted(T item) {
   }
 
   public Predicate<T> eqDeletedRowFilter() {
-    if (eqDeleteRows == null) {

Review Comment:
    I get you



##########
data/src/main/java/org/apache/iceberg/data/DeleteFilter.java:
##########
@@ -70,7 +70,7 @@
   private final int isDeletedColumnPosition;
 
   private PositionDeleteIndex deleteRowPositions = null;
-  private Predicate<T> eqDeleteRows = null;

Review Comment:
   ok



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] shidayang commented on a diff in pull request #5195: Make predicates of delete only initialize once

Posted by GitBox <gi...@apache.org>.
shidayang commented on code in PR #5195:
URL: https://github.com/apache/iceberg/pull/5195#discussion_r918506795


##########
data/src/main/java/org/apache/iceberg/data/DeleteFilter.java:
##########
@@ -132,8 +132,12 @@ public CloseableIterable<T> filter(CloseableIterable<T> records) {
   }
 
   private List<Predicate<T>> applyEqDeletes() {
+    if (eqDeletePredicates != null) {
+      return eqDeletePredicates;
+    }
     List<Predicate<T>> isInDeleteSets = Lists.newArrayList();
     if (eqDeletes.isEmpty()) {
+      this.eqDeletePredicates = isInDeleteSets;

Review Comment:
   Bacause if encounter some exception when loading delete file, it is difficult to judge weather had loaded delete file complete when next retry



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] shidayang commented on a diff in pull request #5195: Make predicates of delete only initialize once

Posted by GitBox <gi...@apache.org>.
shidayang commented on code in PR #5195:
URL: https://github.com/apache/iceberg/pull/5195#discussion_r918506795


##########
data/src/main/java/org/apache/iceberg/data/DeleteFilter.java:
##########
@@ -132,8 +132,12 @@ public CloseableIterable<T> filter(CloseableIterable<T> records) {
   }
 
   private List<Predicate<T>> applyEqDeletes() {
+    if (eqDeletePredicates != null) {
+      return eqDeletePredicates;
+    }
     List<Predicate<T>> isInDeleteSets = Lists.newArrayList();
     if (eqDeletes.isEmpty()) {
+      this.eqDeletePredicates = isInDeleteSets;

Review Comment:
   Bacause if encounter some exception when loading delete file, it is difficult to judge weather had loaded delete file complete when next retry



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] chenjunjiedada commented on pull request #5195: Make predicates of delete only initialize once

Posted by GitBox <gi...@apache.org>.
chenjunjiedada commented on PR #5195:
URL: https://github.com/apache/iceberg/pull/5195#issuecomment-1173786593

   The DeleteFilter#filter is called once for each scan task, it is created as a new object. How do you call it multiple times with the same DeleteFilter object? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] chenjunjiedada commented on a diff in pull request #5195: Make predicates of delete only initialize once

Posted by GitBox <gi...@apache.org>.
chenjunjiedada commented on code in PR #5195:
URL: https://github.com/apache/iceberg/pull/5195#discussion_r917884758


##########
data/src/main/java/org/apache/iceberg/data/DeleteFilter.java:
##########
@@ -192,13 +196,10 @@ protected void markRowDeleted(T item) {
   }
 
   public Predicate<T> eqDeletedRowFilter() {
-    if (eqDeleteRows == null) {

Review Comment:
   I just found out that we don't use this function right now. How about directly using the original function like this: 
   ``` java
   private CloseableIterable<T> applyEqDeletes(CloseableIterable<T> records) {
     Filter<T> remainingRowsFilter = new Filter<T>() {
       @Override
       protected boolean shouldKeep(T item) {
         return eqDeletedRowFilter().test(item);
       }
     };
   
     return remainingRowsFilter.filter(records);
   }
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a diff in pull request #5195: Make predicates of delete only initialize once

Posted by GitBox <gi...@apache.org>.
rdblue commented on code in PR #5195:
URL: https://github.com/apache/iceberg/pull/5195#discussion_r918491734


##########
data/src/main/java/org/apache/iceberg/data/DeleteFilter.java:
##########
@@ -192,13 +196,10 @@ protected void markRowDeleted(T item) {
   }
 
   public Predicate<T> eqDeletedRowFilter() {
-    if (eqDeleteRows == null) {
-      eqDeleteRows = applyEqDeletes().stream()
-          .map(Predicate::negate)
-          .reduce(Predicate::and)
-          .orElse(t -> true);
-    }
-    return eqDeleteRows;
+    return applyEqDeletes().stream()
+            .map(Predicate::negate)
+            .reduce(Predicate::and)
+            .orElse(t -> true);

Review Comment:
   Please fix indentation. Continuation indents are 4 spaces. Rolling back this change should fix it.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on pull request #5195: Make predicates of delete only initialize once

Posted by GitBox <gi...@apache.org>.
rdblue commented on PR #5195:
URL: https://github.com/apache/iceberg/pull/5195#issuecomment-1181899491

   Thanks, @shidayang!
   
   I think this is a good idea for cases where the delete filter is reused. In Spark, this shouldn't affect how long deletes are held in memory because both the filter iterator and the filter are held by the reader. Since there's no negative effect, but there's a very positive effect if this is reused, I think it's a good idea.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] lhofhansl commented on pull request #5195: Make predicates of delete only initialize once

Posted by GitBox <gi...@apache.org>.
lhofhansl commented on PR #5195:
URL: https://github.com/apache/iceberg/pull/5195#issuecomment-1180472518

   Note that the same can be done in Trino without any API change in Iceberg. See https://github.com/trinodb/trino/pull/13112.
   (But I do not care where we fix it.)
   
   Always caching delete filters might be detrimental to Spark, which does not require the filters to be cached.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] shidayang commented on pull request #5195: Make predicates of delete only initialize once

Posted by GitBox <gi...@apache.org>.
shidayang commented on PR #5195:
URL: https://github.com/apache/iceberg/pull/5195#issuecomment-1182681881

   Thanks for your review, @rdblue 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue merged pull request #5195: Make predicates of delete only initialize once

Posted by GitBox <gi...@apache.org>.
rdblue merged PR #5195:
URL: https://github.com/apache/iceberg/pull/5195


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a diff in pull request #5195: Make predicates of delete only initialize once

Posted by GitBox <gi...@apache.org>.
rdblue commented on code in PR #5195:
URL: https://github.com/apache/iceberg/pull/5195#discussion_r918492578


##########
data/src/main/java/org/apache/iceberg/data/DeleteFilter.java:
##########
@@ -132,8 +132,12 @@ public CloseableIterable<T> filter(CloseableIterable<T> records) {
   }
 
   private List<Predicate<T>> applyEqDeletes() {
+    if (eqDeletePredicates != null) {
+      return eqDeletePredicates;
+    }
     List<Predicate<T>> isInDeleteSets = Lists.newArrayList();
     if (eqDeletes.isEmpty()) {
+      this.eqDeletePredicates = isInDeleteSets;

Review Comment:
   I think a better solution is to make `isInDeleteSets` a private field in place of `eqDeletePredicates`. You should just rename the new `eqDeletePredicates` field to `isInDeleteSets` and set it on the line above instead of creating a local variable.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a diff in pull request #5195: Make predicates of delete only initialize once

Posted by GitBox <gi...@apache.org>.
rdblue commented on code in PR #5195:
URL: https://github.com/apache/iceberg/pull/5195#discussion_r918491636


##########
data/src/main/java/org/apache/iceberg/data/DeleteFilter.java:
##########
@@ -192,13 +196,10 @@ protected void markRowDeleted(T item) {
   }
 
   public Predicate<T> eqDeletedRowFilter() {
-    if (eqDeleteRows == null) {

Review Comment:
   I don't think this needs to change. The unified predicate and the list of predicates can coexist.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] amogh-jahagirdar commented on pull request #5195: Make predicates of delete only initialize once

Posted by GitBox <gi...@apache.org>.
amogh-jahagirdar commented on PR #5195:
URL: https://github.com/apache/iceberg/pull/5195#issuecomment-1181206541

   @lhofhansl For my understanding, could you elaborate on the cases where caching delete filters might be detrimental to Spark? Do you mean in terms of retaining the filters in memory? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] shidayang commented on a diff in pull request #5195: Make predicates of delete only initialize once

Posted by GitBox <gi...@apache.org>.
shidayang commented on code in PR #5195:
URL: https://github.com/apache/iceberg/pull/5195#discussion_r918506795


##########
data/src/main/java/org/apache/iceberg/data/DeleteFilter.java:
##########
@@ -132,8 +132,12 @@ public CloseableIterable<T> filter(CloseableIterable<T> records) {
   }
 
   private List<Predicate<T>> applyEqDeletes() {
+    if (eqDeletePredicates != null) {
+      return eqDeletePredicates;
+    }
     List<Predicate<T>> isInDeleteSets = Lists.newArrayList();
     if (eqDeletes.isEmpty()) {
+      this.eqDeletePredicates = isInDeleteSets;

Review Comment:
   Bacause if encounter some exception when loading delete file, it is difficult to judge weather loaded delete file complete when next retry



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a diff in pull request #5195: Make predicates of delete only initialize once

Posted by GitBox <gi...@apache.org>.
rdblue commented on code in PR #5195:
URL: https://github.com/apache/iceberg/pull/5195#discussion_r918490842


##########
data/src/main/java/org/apache/iceberg/data/DeleteFilter.java:
##########
@@ -70,7 +70,7 @@
   private final int isDeletedColumnPosition;
 
   private PositionDeleteIndex deleteRowPositions = null;
-  private Predicate<T> eqDeleteRows = null;

Review Comment:
   I don't see a need to remove this. Can you add it back?



##########
data/src/main/java/org/apache/iceberg/data/DeleteFilter.java:
##########
@@ -132,8 +132,12 @@ public CloseableIterable<T> filter(CloseableIterable<T> records) {
   }
 
   private List<Predicate<T>> applyEqDeletes() {
+    if (eqDeletePredicates != null) {
+      return eqDeletePredicates;
+    }

Review Comment:
   Please add an empty newline after this block to follow style guidelines.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on pull request #5195: Make predicates of delete only initialize once

Posted by GitBox <gi...@apache.org>.
rdblue commented on PR #5195:
URL: https://github.com/apache/iceberg/pull/5195#issuecomment-1182010739

   Merged. Thanks for catching this, @shidayang!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] lhofhansl commented on pull request #5195: Make predicates of delete only initialize once

Posted by GitBox <gi...@apache.org>.
lhofhansl commented on PR #5195:
URL: https://github.com/apache/iceberg/pull/5195#issuecomment-1181469377

   > could you elaborate on the cases where caching delete filters might be detrimental to Spark? Do you mean in terms of retaining the filters in memory?
   
   Yep. We'd be losing out on the opportunity to not materialize the complete filter as a set in memory.
   (see: https://github.com/apache/iceberg/blob/6cc4a198c56c05ff103e6ecdf75fe50004af19da/data/src/main/java/org/apache/iceberg/data/DeleteFilter.java#L231)
   
   An alternative is to add the same API I added to TrinoDeleteFilter (which extends DeleteFilter<TrinoRow>) in the Trino PR, namely a `filterCached` method (could call it cacheAndFilter as well). That way we can leave the Spark code in place and Trino can use the caching method.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] shidayang commented on pull request #5195: Make predicates of delete only initialize once

Posted by GitBox <gi...@apache.org>.
shidayang commented on PR #5195:
URL: https://github.com/apache/iceberg/pull/5195#issuecomment-1175742091

   @chenjunjiedada In trino query engine every "page"  will call DeleteFilter#filter, The "The DeleteFilter#filter is called once for each scan task" is not guaranteed。


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] shidayang commented on a diff in pull request #5195: Make predicates of delete only initialize once

Posted by GitBox <gi...@apache.org>.
shidayang commented on code in PR #5195:
URL: https://github.com/apache/iceberg/pull/5195#discussion_r914441340


##########
data/src/main/java/org/apache/iceberg/data/DeleteFilter.java:
##########
@@ -192,13 +196,10 @@ protected void markRowDeleted(T item) {
   }
 
   public Predicate<T> eqDeletedRowFilter() {
-    if (eqDeleteRows == null) {

Review Comment:
   Yes,The function to avoid initialize 'delete row' of eqDeleteRows is the same as eqDeletePredicates.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org