You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2021/04/25 08:27:51 UTC

[GitHub] [iceberg] zhangjun0x01 opened a new pull request #2508: Core : fix bugs in RewriteDataFilesAction when datafile > targetFileSize

zhangjun0x01 opened a new pull request #2508:
URL: https://github.com/apache/iceberg/pull/2508


   I found some bugs in RewriteDataFilesAction when the size of datafile is larger than targetFileSize
   1. If there is only one datafile which size> targetFileSize in the partition, this datafile will be ignored and will not be rewrite.
   2. The test case in [pr 2196](https://github.com/apache/iceberg/pull/2196) has some logical issues.I fix it and add some comments in it.   
   3. the deletedDataFiles of the RewriteDataFilesActionResult should be deduplicated
   
   @rdblue @RussellSpitzer @openinx could you help me review it ? thanks


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on a change in pull request #2508: Core : fix bugs in RewriteDataFilesAction when datafile size greater than targetFileSize

Posted by GitBox <gi...@apache.org>.
RussellSpitzer commented on a change in pull request #2508:
URL: https://github.com/apache/iceberg/pull/2508#discussion_r635692446



##########
File path: spark/src/test/java/org/apache/iceberg/actions/TestRewriteDataFilesAction.java
##########
@@ -347,15 +348,19 @@ public void testRewriteDataFilesForLargeFile() throws AnalysisException {
 
     Actions actions = Actions.forTable(table);
 
-    long targetSizeInBytes = maxSizeFile.fileSizeInBytes() - 10;
+    long targetSizeInBytes = maxSizeFile.fileSizeInBytes() / 2;
     RewriteDataFilesActionResult result = actions
         .rewriteDataFiles()
         .targetSizeInBytes(targetSizeInBytes)
         .splitOpenFileCost(1)
         .execute();
 
-    Assert.assertEquals("Action should delete 4 data files", 4, result.deletedDataFiles().size());
-    Assert.assertEquals("Action should add 2 data files", 2, result.addedDataFiles().size());
+    // 1 big datafile and 2 small datafiles
+    Assert.assertEquals("Action should delete 3 data files", 3, result.deletedDataFiles().size());
+
+    // The 2 small datafiles will be rewrote to 1 datafile, the big datafile will be rewrite to 2 datafiles,

Review comment:
       will be rewrote -> will be rewritten
   will be rewrite -> will be rewritten




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on a change in pull request #2508: Core : fix bugs in RewriteDataFilesAction when datafile size greater than targetFileSize

Posted by GitBox <gi...@apache.org>.
RussellSpitzer commented on a change in pull request #2508:
URL: https://github.com/apache/iceberg/pull/2508#discussion_r635692719



##########
File path: spark/src/test/java/org/apache/iceberg/actions/TestRewriteDataFilesAction.java
##########
@@ -347,15 +348,19 @@ public void testRewriteDataFilesForLargeFile() throws AnalysisException {
 
     Actions actions = Actions.forTable(table);
 
-    long targetSizeInBytes = maxSizeFile.fileSizeInBytes() - 10;
+    long targetSizeInBytes = maxSizeFile.fileSizeInBytes() / 2;
     RewriteDataFilesActionResult result = actions
         .rewriteDataFiles()
         .targetSizeInBytes(targetSizeInBytes)
         .splitOpenFileCost(1)
         .execute();
 
-    Assert.assertEquals("Action should delete 4 data files", 4, result.deletedDataFiles().size());
-    Assert.assertEquals("Action should add 2 data files", 2, result.addedDataFiles().size());
+    // 1 big datafile and 2 small datafiles
+    Assert.assertEquals("Action should delete 3 data files", 3, result.deletedDataFiles().size());

Review comment:
       I actually noticed this when I did the new RewriteDatafile api because this test did fail because the output here isn't quiet correct.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] zhangjun0x01 commented on a change in pull request #2508: Core : fix bugs in RewriteDataFilesAction when datafile size greater than targetFileSize

Posted by GitBox <gi...@apache.org>.
zhangjun0x01 commented on a change in pull request #2508:
URL: https://github.com/apache/iceberg/pull/2508#discussion_r635861443



##########
File path: core/src/main/java/org/apache/iceberg/actions/BaseRewriteDataFilesAction.java
##########
@@ -277,6 +279,15 @@ private void replaceDataFiles(Iterable<DataFile> deletedDataFiles, Iterable<Data
     }
   }
 
+  private boolean isSingleBigFile(Map.Entry<StructLikeWrapper, Collection<FileScanTask>> kv) {

Review comment:
       Yes,I udpate it




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] zhangjun0x01 commented on a change in pull request #2508: Core : fix bugs in RewriteDataFilesAction when datafile size greater than targetFileSize

Posted by GitBox <gi...@apache.org>.
zhangjun0x01 commented on a change in pull request #2508:
URL: https://github.com/apache/iceberg/pull/2508#discussion_r635824619



##########
File path: core/src/main/java/org/apache/iceberg/actions/BaseRewriteDataFilesAction.java
##########
@@ -238,12 +239,13 @@ public RewriteDataFilesActionResult execute() {
     }
 
     List<DataFile> addedDataFiles = rewriteDataForTasks(combinedScanTasks);
-    List<DataFile> currentDataFiles = combinedScanTasks.stream()
+    Set<DataFile> currentDataFiles = combinedScanTasks.stream()

Review comment:
       If we split a large file into multiple small files, the original large file will appear multiple times in `currentDataFiles`, so I use the `set` to remove duplicates




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] zhangjun0x01 commented on pull request #2508: Core : fix bugs in RewriteDataFilesAction when datafile size greater than targetFileSize

Posted by GitBox <gi...@apache.org>.
zhangjun0x01 commented on pull request #2508:
URL: https://github.com/apache/iceberg/pull/2508#issuecomment-828118230


   the issues of test case in [PR 2196](https://github.com/apache/iceberg/pull/2196):
   
   We set the targetSizeInBytes to filesize of max filesize - 10, but there is no data in the last 10 bytes of this largest file, so in this case the largest file is not  split, and the newly generated file will be the same size as the original file. So I set targetSizeInBytes to half of the maximum file size, so this largest file will be split into two files, and finally three datafiles will be generated
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on a change in pull request #2508: Core : fix bugs in RewriteDataFilesAction when datafile size greater than targetFileSize

Posted by GitBox <gi...@apache.org>.
RussellSpitzer commented on a change in pull request #2508:
URL: https://github.com/apache/iceberg/pull/2508#discussion_r635689835



##########
File path: core/src/main/java/org/apache/iceberg/actions/BaseRewriteDataFilesAction.java
##########
@@ -238,12 +239,13 @@ public RewriteDataFilesActionResult execute() {
     }
 
     List<DataFile> addedDataFiles = rewriteDataForTasks(combinedScanTasks);
-    List<DataFile> currentDataFiles = combinedScanTasks.stream()
+    Set<DataFile> currentDataFiles = combinedScanTasks.stream()

Review comment:
       Why is there a change in this bit of code? I see we are creating a set but why would we have duplicates?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on a change in pull request #2508: Core : fix bugs in RewriteDataFilesAction when datafile size greater than targetFileSize

Posted by GitBox <gi...@apache.org>.
RussellSpitzer commented on a change in pull request #2508:
URL: https://github.com/apache/iceberg/pull/2508#discussion_r635690206



##########
File path: core/src/main/java/org/apache/iceberg/actions/BaseRewriteDataFilesAction.java
##########
@@ -238,12 +239,13 @@ public RewriteDataFilesActionResult execute() {
     }
 
     List<DataFile> addedDataFiles = rewriteDataForTasks(combinedScanTasks);
-    List<DataFile> currentDataFiles = combinedScanTasks.stream()
+    Set<DataFile> currentDataFiles = combinedScanTasks.stream()
         .flatMap(tasks -> tasks.files().stream().map(FileScanTask::file))
-        .collect(Collectors.toList());
-    replaceDataFiles(currentDataFiles, addedDataFiles);
+        .collect(Collectors.toSet());
+    List<DataFile> deletedDataFiles = Lists.newArrayList(currentDataFiles.iterator());

Review comment:
       Do we need .iterator? I thought Lists.newArrayList took iterable as an arg, so set should be fine all by itself, although I'm not sure why we are doing the "set" operation




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on a change in pull request #2508: Core : fix bugs in RewriteDataFilesAction when datafile size greater than targetFileSize

Posted by GitBox <gi...@apache.org>.
RussellSpitzer commented on a change in pull request #2508:
URL: https://github.com/apache/iceberg/pull/2508#discussion_r635692719



##########
File path: spark/src/test/java/org/apache/iceberg/actions/TestRewriteDataFilesAction.java
##########
@@ -347,15 +348,19 @@ public void testRewriteDataFilesForLargeFile() throws AnalysisException {
 
     Actions actions = Actions.forTable(table);
 
-    long targetSizeInBytes = maxSizeFile.fileSizeInBytes() - 10;
+    long targetSizeInBytes = maxSizeFile.fileSizeInBytes() / 2;
     RewriteDataFilesActionResult result = actions
         .rewriteDataFiles()
         .targetSizeInBytes(targetSizeInBytes)
         .splitOpenFileCost(1)
         .execute();
 
-    Assert.assertEquals("Action should delete 4 data files", 4, result.deletedDataFiles().size());
-    Assert.assertEquals("Action should add 2 data files", 2, result.addedDataFiles().size());
+    // 1 big datafile and 2 small datafiles
+    Assert.assertEquals("Action should delete 3 data files", 3, result.deletedDataFiles().size());

Review comment:
       I actually noticed this when I did the new RewriteDatafile api because this test did fail because the output here isn't quite correct.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on a change in pull request #2508: Core : fix bugs in RewriteDataFilesAction when datafile size greater than targetFileSize

Posted by GitBox <gi...@apache.org>.
RussellSpitzer commented on a change in pull request #2508:
URL: https://github.com/apache/iceberg/pull/2508#discussion_r635690823



##########
File path: core/src/main/java/org/apache/iceberg/actions/BaseRewriteDataFilesAction.java
##########
@@ -277,6 +279,15 @@ private void replaceDataFiles(Iterable<DataFile> deletedDataFiles, Iterable<Data
     }
   }
 
+  private boolean isSingleBigFile(Map.Entry<StructLikeWrapper, Collection<FileScanTask>> kv) {

Review comment:
       This should be fine just taking a single FileScanTask as an arg right? Since we only call this method on a single FileScan task to see if it is splittable?
   
   I would probably also say 
   isSplittable(FileScanTask file)




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org