You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2022/05/10 07:00:47 UTC

[GitHub] [iceberg] chenjunjiedada opened a new pull request, #4738: Spark: add delete info to rewrite job description

chenjunjiedada opened a new pull request, #4738:
URL: https://github.com/apache/iceberg/pull/4738

   This adds deletes info to spark job desc so that people could see the rewrite effect more conveniently. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] chenjunjiedada commented on pull request #4738: Spark: add delete info to rewrite job description

Posted by GitBox <gi...@apache.org>.

chenjunjiedada commented on PR #4738:
URL: https://github.com/apache/iceberg/pull/4738#issuecomment-1123078157

   @RussellSpitzer, Is this change OK for you? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] chenjunjiedada commented on pull request #4738: Spark: add delete info to rewrite job description

Posted by GitBox <gi...@apache.org>.

chenjunjiedada commented on PR #4738:
URL: https://github.com/apache/iceberg/pull/4738#issuecomment-1142345112

   @RussellSpitzer How about putting the delete metrics inside the braces? Does that look more like metrics instead of results?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] singhpk234 commented on a diff in pull request #4738: Spark: add delete info to rewrite job description

Posted by GitBox <gi...@apache.org>.

singhpk234 commented on code in PR #4738:
URL: https://github.com/apache/iceberg/pull/4738#discussion_r868936483


##########
core/src/main/java/org/apache/iceberg/actions/RewriteFileGroup.java:
##########
@@ -60,6 +61,18 @@ public Set<DataFile> rewrittenFiles() {
     return fileScans().stream().map(FileScanTask::file).collect(Collectors.toSet());
   }
 
+  public int rewrittenEqDeletes() {
+    return (int) fileScans().stream().flatMap(f -> f.deletes().stream())
+        .filter(d -> d.content().equals(FileContent.EQUALITY_DELETES))
+        .count();
+  }
+
+  public int rewrittenPosDeletes() {
+    return (int) fileScans().stream().flatMap(f -> f.deletes().stream())
+        .filter(d -> d.content().equals(FileContent.POSITION_DELETES))
+        .count();
+  }

Review Comment:
   [question] A single delete file can be associated with multiple data files, should we do a distinct and then count . Your thoughts ?
   As for ex : 
   1. FSTask1 -> DataFile1 , PosDeleteFile1
   2. FSTask2 -> DataFile2 , PosDeleteFile1
   
   here we would return 2 as processed deletes, but actually we processed only 1 Position Delete File i.e PosDeleteFile1



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] chenjunjiedada commented on pull request #4738: Spark: add delete info to rewrite job description

Posted by GitBox <gi...@apache.org>.

chenjunjiedada commented on PR #4738:
URL: https://github.com/apache/iceberg/pull/4738#issuecomment-1142243542

   @RussellSpitzer , I added the `distinct` and renamed the function, could you please take another look?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] RussellSpitzer commented on a diff in pull request #4738: Spark: add delete info to rewrite job description

Posted by GitBox <gi...@apache.org>.

RussellSpitzer commented on code in PR #4738:
URL: https://github.com/apache/iceberg/pull/4738#discussion_r882202951


##########
core/src/main/java/org/apache/iceberg/actions/RewriteFileGroup.java:
##########
@@ -60,6 +61,18 @@ public Set<DataFile> rewrittenFiles() {
     return fileScans().stream().map(FileScanTask::file).collect(Collectors.toSet());
   }
 
+  public int rewrittenEqDeletes() {
+    return (int) fileScans().stream().flatMap(f -> f.deletes().stream())
+        .filter(d -> d.content().equals(FileContent.EQUALITY_DELETES))
+        .count();
+  }
+
+  public int rewrittenPosDeletes() {
+    return (int) fileScans().stream().flatMap(f -> f.deletes().stream())
+        .filter(d -> d.content().equals(FileContent.POSITION_DELETES))
+        .count();
+  }

Review Comment:
   I think @singhpk234 is right here. The equivalent thing we do in the rewrittenFiles code is collecting as a set and then we call size on the set. 
   
   Unless we are counting just the pure number of references to delete files. In that case i'm not sure what the "rewritten" part would mean in the api name? The other confusing thing here for me is that these files are not actually rewritten or deleted.
   
   At the moment I think these would be 
   "referencedPositionalDeletes" or something like that



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] chenjunjiedada commented on a diff in pull request #4738: Spark: add delete info to rewrite job description

Posted by GitBox <gi...@apache.org>.

chenjunjiedada commented on code in PR #4738:
URL: https://github.com/apache/iceberg/pull/4738#discussion_r869790958


##########
core/src/main/java/org/apache/iceberg/actions/RewriteFileGroup.java:
##########
@@ -60,6 +61,18 @@ public Set<DataFile> rewrittenFiles() {
     return fileScans().stream().map(FileScanTask::file).collect(Collectors.toSet());
   }
 
+  public int rewrittenEqDeletes() {
+    return (int) fileScans().stream().flatMap(f -> f.deletes().stream())
+        .filter(d -> d.content().equals(FileContent.EQUALITY_DELETES))
+        .count();
+  }
+
+  public int rewrittenPosDeletes() {
+    return (int) fileScans().stream().flatMap(f -> f.deletes().stream())
+        .filter(d -> d.content().equals(FileContent.POSITION_DELETES))
+        .count();
+  }

Review Comment:
   The intention here is to show how many files and deletes are loaded for each job. We can get the exact commit result from the snapshot summary now. So I wouldn't change this. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] singhpk234 commented on a diff in pull request #4738: Spark: add delete info to rewrite job description

Posted by GitBox <gi...@apache.org>.

singhpk234 commented on code in PR #4738:
URL: https://github.com/apache/iceberg/pull/4738#discussion_r868939809


##########
spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/actions/BaseRewriteDataFilesSparkAction.java:
##########
@@ -414,14 +414,14 @@ void validateAndInitOptions() {
   private String jobDesc(RewriteFileGroup group, RewriteExecutionContext ctx) {
     StructLike partition = group.info().partition();
     if (partition.size() > 0) {
-      return String.format("Rewriting %d files (%s, file group %d/%d, %s (%d/%d)) in %s",
-          group.rewrittenFiles().size(),
+      return String.format("Rewriting %d files %d eq deletes %d pos deletes (%s, file group %d/%d, %s (%d/%d)) in %s",
+          group.rewrittenFiles().size(), group.rewrittenEqDeletes(), group.rewrittenEqDeletes(),

Review Comment:
   [minor] should we add a comma in between
   ```suggestion
         return String.format("Rewriting %d files, %d eq deletes, %d pos deletes (%s, file group %d/%d, %s (%d/%d)) in %s",
             group.rewrittenFiles().size(), group.rewrittenEqDeletes(), group.rewrittenEqDeletes(),
   ```



##########
core/src/main/java/org/apache/iceberg/actions/RewriteFileGroup.java:
##########
@@ -60,6 +61,18 @@ public Set<DataFile> rewrittenFiles() {
     return fileScans().stream().map(FileScanTask::file).collect(Collectors.toSet());
   }
 
+  public int rewrittenEqDeletes() {
+    return (int) fileScans().stream().flatMap(f -> f.deletes().stream())
+        .filter(d -> d.content().equals(FileContent.EQUALITY_DELETES))
+        .count();
+  }
+
+  public int rewrittenPosDeletes() {
+    return (int) fileScans().stream().flatMap(f -> f.deletes().stream())
+        .filter(d -> d.content().equals(FileContent.POSITION_DELETES))
+        .count();
+  }

Review Comment:
   [question] A single delete file can be associated with multiple data files, should we do a distinct and then count . Your thoughts ?
   As for ex : 
   1. FSTask1 -> DataFile1 , PosDeleteFile1
   2. FSTask2 -> DataFile2 , PosDeleteFile1
   
   here we would return 2 as rewritten deletes, but actually we re-wrote only 1 Position Delete File i.e PosDeleteFile1



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] chenjunjiedada closed pull request #4738: Spark: add delete info to rewrite job description

Posted by GitBox <gi...@apache.org>.

chenjunjiedada closed pull request #4738: Spark: add delete info to rewrite job description
URL: https://github.com/apache/iceberg/pull/4738


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] RussellSpitzer commented on pull request #4738: Spark: add delete info to rewrite job description

Posted by GitBox <gi...@apache.org>.

RussellSpitzer commented on PR #4738:
URL: https://github.com/apache/iceberg/pull/4738#issuecomment-1142321172

   @chenjunjiedada i'm Still wondering a bit whether this should be a part of the result since it's really more of a metric. We don't do anything to these delete files so is it appropriate to return the count?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] chenjunjiedada commented on a diff in pull request #4738: Spark: add delete info to rewrite job description

Posted by GitBox <gi...@apache.org>.

chenjunjiedada commented on code in PR #4738:
URL: https://github.com/apache/iceberg/pull/4738#discussion_r882208646


##########
core/src/main/java/org/apache/iceberg/actions/RewriteFileGroup.java:
##########
@@ -60,6 +61,18 @@ public Set<DataFile> rewrittenFiles() {
     return fileScans().stream().map(FileScanTask::file).collect(Collectors.toSet());
   }
 
+  public int rewrittenEqDeletes() {
+    return (int) fileScans().stream().flatMap(f -> f.deletes().stream())
+        .filter(d -> d.content().equals(FileContent.EQUALITY_DELETES))
+        .count();
+  }
+
+  public int rewrittenPosDeletes() {
+    return (int) fileScans().stream().flatMap(f -> f.deletes().stream())
+        .filter(d -> d.content().equals(FileContent.POSITION_DELETES))
+        .count();
+  }

Review Comment:
   I review this again and agree that adding `distinct` is needed.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] singhpk234 commented on a diff in pull request #4738: Spark: add delete info to rewrite job description

Posted by GitBox <gi...@apache.org>.

singhpk234 commented on code in PR #4738:
URL: https://github.com/apache/iceberg/pull/4738#discussion_r868936483


##########
core/src/main/java/org/apache/iceberg/actions/RewriteFileGroup.java:
##########
@@ -60,6 +61,18 @@ public Set<DataFile> rewrittenFiles() {
     return fileScans().stream().map(FileScanTask::file).collect(Collectors.toSet());
   }
 
+  public int rewrittenEqDeletes() {
+    return (int) fileScans().stream().flatMap(f -> f.deletes().stream())
+        .filter(d -> d.content().equals(FileContent.EQUALITY_DELETES))
+        .count();
+  }
+
+  public int rewrittenPosDeletes() {
+    return (int) fileScans().stream().flatMap(f -> f.deletes().stream())
+        .filter(d -> d.content().equals(FileContent.POSITION_DELETES))
+        .count();
+  }

Review Comment:
   [question] A single delete file can be associated with multiple data files, should we do a distinct and then count . Your thoughts ?
   As for ex : 
   1. FSTask1 -> DataFile1 , PosDeleteFile1
   2. FSTask2 -> DataFile2 , PosDeleteFile1
   
   here we would return 2 as rewritten deletes, but actually we processed only 1 Position Delete File i.e PosDeleteFile1



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org