You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2021/12/13 07:06:55 UTC

[GitHub] [iceberg] xloya opened a new pull request #3724: [Spark][Core]: Expired data files that read 0 pieces of data during rewriting

xloya opened a new pull request #3724:
URL: https://github.com/apache/iceberg/pull/3724


   When rewriting, the data file read by `Spark3BinPackStrategy` through `SparkSession.read()` and the pieces of data after applying delete files may be 0. At this time, it will throw a `IllegalArgumentException` which is "Cannot stage null or empty file set" in `FileRewriteCoordinator.stageRewrite()` when rewriting.  
   For these data files, if their split_offsets list do not exist or there is only one element, it can be directly expired, speeding up query efficiency.  
   cc @RussellSpitzer @rdblue @kbendick @jackye1995 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] xloya commented on pull request #3724: [Spark][Core]: Expire data files that read 0 pieces of data during rewriting

Posted by GitBox <gi...@apache.org>.

xloya commented on pull request #3724:
URL: https://github.com/apache/iceberg/pull/3724#issuecomment-993081828


   > If understand correctly, the goal here is to fix RewriteData so that files which have been made completely because of deletes being applied should be removed. Currently this is an issue because the core RewriteOperation doesn't allow addedFiles to be empty.
   
   Yeah, I found that delete files in the v2 table to delete all the records of a data file, but the rewrite had been failed.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] RussellSpitzer commented on a change in pull request #3724: [Spark][Core]: Expire data files that read 0 pieces of data during rewriting

Posted by GitBox <gi...@apache.org>.

RussellSpitzer commented on a change in pull request #3724:
URL: https://github.com/apache/iceberg/pull/3724#discussion_r767765801



##########
File path: spark/v3.2/spark/src/test/java/org/apache/iceberg/spark/actions/TestRewriteDataFilesAction.java
##########
@@ -1055,12 +1265,11 @@ protected Table createTable(int files) {
     return table;
   }
 
-  protected Table createTablePartitioned(int partitions, int files) {
+  protected Table createTablePartitioned(int partitions, int files, Map<String, String> options) {

Review comment:
       Instead of adding a parameter and changing all the usages, it may be cleaner to add another method
   ```java
   protected Table createTablePartitioned(int partitions, int files, Map<String, String> options) {
   all this code
   }
   ```
   
   And keep a version 
   ```java
    protected Table createTablePartitioned(int partitions, int files) {
        createTablePartitioned(partitions, files, Maps.newHashMap())
    }
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] xloya commented on pull request #3724: [Spark][Core]: Support RewriteDataFiles when Files are Completely Eliminated by Deletes

Posted by GitBox <gi...@apache.org>.

xloya commented on pull request #3724:
URL: https://github.com/apache/iceberg/pull/3724#issuecomment-993105333


   @RussellSpitzer @jackye1995 Hey, I will modify the implementation method first to remove the precondition check in `FileRewriteCoordinator.stageRewrite()`, but I'm not sure whether there will be other situations that will cause no new files to be generated. Do you have any other use cases that would produce this situation?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] xloya removed a comment on pull request #3724: [Spark][Core]: Support RewriteDataFiles when Files are Completely Eliminated by Deletes

Posted by GitBox <gi...@apache.org>.

xloya removed a comment on pull request #3724:
URL: https://github.com/apache/iceberg/pull/3724#issuecomment-993106658


   > @RussellSpitzer @jackye1995 Hey, I will modify the implementation method first to remove the precondition check in `FileRewriteCoordinator.stageRewrite()`, but I'm not sure whether there will be other situations that will cause no new files to be generated. Do you have any other use cases that would produce this situation?
   
   Or do we need to distinguish between the v1 table or the v2 table here?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] RussellSpitzer commented on a change in pull request #3724: [Spark][Core]: Expire data files that read 0 pieces of data during rewriting

Posted by GitBox <gi...@apache.org>.

RussellSpitzer commented on a change in pull request #3724:
URL: https://github.com/apache/iceberg/pull/3724#discussion_r767773607



##########
File path: core/src/main/java/org/apache/iceberg/actions/RewriteDataFilesCommitManager.java
##########
@@ -79,11 +83,34 @@ public void commitFileGroups(Set<RewriteFileGroup> fileGroups) {
     }
 
     RewriteFiles rewrite = table.newRewrite().validateFromSnapshot(startingSnapshotId);
-    if (useStartingSequenceNumber) {
-      long sequenceNumber = table.snapshot(startingSnapshotId).sequenceNumber();
-      rewrite.rewriteFiles(rewrittenDataFiles, addedDataFiles, sequenceNumber);
+
+    if (addedDataFiles.size() > 0) {

Review comment:
       I'm not sure why we are making the changes here, shouldn't we just be removing the precondition check
   https://github.com/apache/iceberg/blob/f5a753791f4dc6aca78569a14f731feda9edf462/spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/FileRewriteCoordinator.java#L58 ?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] RussellSpitzer commented on pull request #3724: [Spark][Core]: Support RewriteDataFiles when Files are Completely Eliminated by Deletes

Posted by GitBox <gi...@apache.org>.

RussellSpitzer commented on pull request #3724:
URL: https://github.com/apache/iceberg/pull/3724#issuecomment-1016650821


   @xloya Could you please open up a PR backporting these changes into 3.1 module and 3.0 module, You can do both in a single pr


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] xloya commented on pull request #3724: [Spark][Core]: Support RewriteDataFiles when Files are Completely Eliminated by Deletes

Posted by GitBox <gi...@apache.org>.

xloya commented on pull request #3724:
URL: https://github.com/apache/iceberg/pull/3724#issuecomment-1016021655


   > LGTM, I just a have a few questions on the tests
   
   Thanks @RussellSpitzer for reviewing this patch in the corner of the pr list again! I have made some changes to the unit test based on your comments, please review it again when you have time


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] xloya commented on a change in pull request #3724: [Spark][Core]: Expire data files that read 0 pieces of data during rewriting

Posted by GitBox <gi...@apache.org>.

xloya commented on a change in pull request #3724:
URL: https://github.com/apache/iceberg/pull/3724#discussion_r768258460



##########
File path: core/src/main/java/org/apache/iceberg/actions/RewriteDataFilesCommitManager.java
##########
@@ -79,11 +83,34 @@ public void commitFileGroups(Set<RewriteFileGroup> fileGroups) {
     }
 
     RewriteFiles rewrite = table.newRewrite().validateFromSnapshot(startingSnapshotId);
-    if (useStartingSequenceNumber) {
-      long sequenceNumber = table.snapshot(startingSnapshotId).sequenceNumber();
-      rewrite.rewriteFiles(rewrittenDataFiles, addedDataFiles, sequenceNumber);
+
+    if (addedDataFiles.size() > 0) {

Review comment:
       My initial idea was to rewrite these data files into an empty file or delete the restrictions here, but I'm not sure whether it affects other logic. If it can be removed, I think this is the easiest.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] xloya commented on a change in pull request #3724: [Spark][Core]: Expire data files that read 0 pieces of data during rewriting

Posted by GitBox <gi...@apache.org>.

xloya commented on a change in pull request #3724:
URL: https://github.com/apache/iceberg/pull/3724#discussion_r768262329



##########
File path: spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/actions/Spark3BinPackStrategy.java
##########
@@ -67,6 +68,11 @@ public Table table() {
           .option(SparkReadOptions.FILE_OPEN_COST, "0")
           .load(table.name());
 
+      // If got 0 pieces of data, no need write new data file, should return to expire the data file
+      if (scanDF.count() == 0) {

Review comment:
       After reading it, I did not consider the cost of `count()` in spark, maybe another way should be considered




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] RussellSpitzer commented on a change in pull request #3724: [Spark][Core]: Expire data files that read 0 pieces of data during rewriting

Posted by GitBox <gi...@apache.org>.

RussellSpitzer commented on a change in pull request #3724:
URL: https://github.com/apache/iceberg/pull/3724#discussion_r767770445



##########
File path: spark/v3.2/spark/src/test/java/org/apache/iceberg/spark/actions/TestRewriteDataFilesAction.java
##########
@@ -24,6 +24,7 @@
 import java.io.UncheckedIOException;
 import java.util.Collections;
 import java.util.Comparator;
+import java.util.HashMap;

Review comment:
       We use `Maps.newHashMap` when we are creating maps so usually we don't need to import the class directly
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] xloya commented on a change in pull request #3724: [Spark][Core]: Expire data files that read 0 pieces of data during rewriting

Posted by GitBox <gi...@apache.org>.

xloya commented on a change in pull request #3724:
URL: https://github.com/apache/iceberg/pull/3724#discussion_r768258870



##########
File path: spark/v3.2/spark/src/test/java/org/apache/iceberg/spark/actions/TestRewriteDataFilesAction.java
##########
@@ -1055,12 +1265,11 @@ protected Table createTable(int files) {
     return table;
   }
 
-  protected Table createTablePartitioned(int partitions, int files) {
+  protected Table createTablePartitioned(int partitions, int files, Map<String, String> options) {

Review comment:
       Sounds great!




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] RussellSpitzer commented on a change in pull request #3724: [Spark][Core]: Support RewriteDataFiles when Files are Completely Eliminated by Deletes

Posted by GitBox <gi...@apache.org>.

RussellSpitzer commented on a change in pull request #3724:
URL: https://github.com/apache/iceberg/pull/3724#discussion_r787007034



##########
File path: spark/v3.2/spark/src/test/java/org/apache/iceberg/spark/actions/TestRewriteDataFilesAction.java
##########
@@ -234,6 +234,139 @@ public void testBinPackWithDeletes() throws Exception {
     Assert.assertEquals("7 rows are removed", total - 7, actualRecords.size());
   }
 
+  @Test
+  public void testBinPackWithDeleteAllData() {
+    Map<String, String> options = Maps.newHashMap();
+    options.put(TableProperties.FORMAT_VERSION, "2");
+    Table table = createTablePartitioned(1, 1, options);
+    shouldHaveFiles(table, 1);
+    table.refresh();
+
+    CloseableIterable<FileScanTask> tasks = table.newScan().planFiles();
+    List<DataFile> dataFiles = Lists.newArrayList(CloseableIterable.transform(tasks, FileScanTask::file));
+    int total = (int) dataFiles.stream().mapToLong(ContentFile::recordCount).sum();
+
+    RowDelta rowDelta = table.newRowDelta();
+    // remove all data
+    writePosDeletesToFile(table, dataFiles.get(0), total)
+        .forEach(rowDelta::addDeletes);
+
+    rowDelta.commit();
+    table.refresh();
+    List<Object[]> expectedRecords = currentData();
+    Result result = actions().rewriteDataFiles(table)
+            .option(BinPackStrategy.MIN_FILE_SIZE_BYTES, "0")
+            .option(RewriteDataFiles.TARGET_FILE_SIZE_BYTES, Long.toString(Long.MAX_VALUE - 1))
+            .option(BinPackStrategy.MAX_FILE_SIZE_BYTES, Long.toString(Long.MAX_VALUE))
+            .option(BinPackStrategy.DELETE_FILE_THRESHOLD, "1")

Review comment:
       Are the options other than DELETE_FILE_THRESHOLD required? Seems like it should be sufficient to trigger the compaction




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] xloya edited a comment on pull request #3724: [Spark][Core]: Support RewriteDataFiles when Files are Completely Eliminated by Deletes

Posted by GitBox <gi...@apache.org>.

xloya edited a comment on pull request #3724:
URL: https://github.com/apache/iceberg/pull/3724#issuecomment-1017049903






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] RussellSpitzer commented on a change in pull request #3724: [Spark][Core]: Expire data files that read 0 pieces of data during rewriting

Posted by GitBox <gi...@apache.org>.

RussellSpitzer commented on a change in pull request #3724:
URL: https://github.com/apache/iceberg/pull/3724#discussion_r767766468



##########
File path: spark/v3.2/spark/src/test/java/org/apache/iceberg/spark/actions/TestRewriteDataFilesAction.java
##########
@@ -234,12 +236,220 @@ public void testBinPackWithDeletes() throws Exception {
     Assert.assertEquals("7 rows are removed", total - 7, actualRecords.size());
   }
 
+  @Test
+  public void testBinPackWithDeleteAllDataAndDataFileHasGroupOffsets() {
+    Map<String, String> options = new HashMap<>();
+    options.put(TableProperties.FORMAT_VERSION, "2");
+    options.put(TableProperties.PARQUET_ROW_GROUP_SIZE_BYTES, "1024");
+    options.put(TableProperties.PARQUET_PAGE_SIZE_BYTES, "256");
+    options.put(TableProperties.PARQUET_DICT_SIZE_BYTES, "512");
+    Table table = createTablePartitioned(1, 1, options);
+    shouldHaveFiles(table, 1);
+    table.refresh();
+
+    CloseableIterable<FileScanTask> tasks = table.newScan().planFiles();
+    List<DataFile> dataFiles = Lists.newArrayList(CloseableIterable.transform(tasks, FileScanTask::file));
+    int total = (int) dataFiles.stream().mapToLong(ContentFile::recordCount).sum();
+
+    RowDelta rowDelta = table.newRowDelta();
+    // remove all data
+    writePosDeletesToFile(table, dataFiles.get(0), total)
+        .forEach(rowDelta::addDeletes);
+
+    rowDelta.commit();
+    table.refresh();
+
+    AssertHelpers.assertThrows("Expected an exception",
+        RuntimeException.class,
+        () -> actions().rewriteDataFiles(table)
+            .option(BinPackStrategy.MIN_FILE_SIZE_BYTES, "0")
+            .option(RewriteDataFiles.TARGET_FILE_SIZE_BYTES, Long.toString(Long.MAX_VALUE - 1))
+            .option(BinPackStrategy.MAX_FILE_SIZE_BYTES, Long.toString(Long.MAX_VALUE))
+            .option(BinPackStrategy.DELETE_FILE_THRESHOLD, "1")
+            .option(RewriteDataFiles.USE_STARTING_SEQUENCE_NUMBER, "false")
+            .execute());
+  }
+
+  @Test
+  public void testBinPackWithDeleteAllDataAndDataFileHasGroupOffsetsWithSeqNum() {
+    Map<String, String> options = new HashMap<>();
+    options.put(TableProperties.FORMAT_VERSION, "2");
+    options.put(TableProperties.PARQUET_ROW_GROUP_SIZE_BYTES, "1024");
+    options.put(TableProperties.PARQUET_PAGE_SIZE_BYTES, "256");
+    options.put(TableProperties.PARQUET_DICT_SIZE_BYTES, "512");
+    Table table = createTablePartitioned(1, 1, options);
+    shouldHaveFiles(table, 1);
+    table.refresh();
+
+    CloseableIterable<FileScanTask> tasks = table.newScan().planFiles();
+    List<DataFile> dataFiles = Lists.newArrayList(CloseableIterable.transform(tasks, FileScanTask::file));
+    int total = (int) dataFiles.stream().mapToLong(ContentFile::recordCount).sum();
+
+    RowDelta rowDelta = table.newRowDelta();
+    // remove all data
+    writePosDeletesToFile(table, dataFiles.get(0), total)
+        .forEach(rowDelta::addDeletes);
+
+    rowDelta.commit();
+    table.refresh();
+
+    AssertHelpers.assertThrows("Expected an exception",
+        RuntimeException.class,

Review comment:
       Shouldn't we be fixing this such that this doesn't throw an error?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] RussellSpitzer commented on pull request #3724: [Spark][Core]: Expire data files that read 0 pieces of data during rewriting

Posted by GitBox <gi...@apache.org>.

RussellSpitzer commented on pull request #3724:
URL: https://github.com/apache/iceberg/pull/3724#issuecomment-992488034


   If understand correctly, the goal here is to fix RewriteData so that files which have been made completely because of deletes being applied should be removed. Currently this is an issue because the core RewriteOperation doesn't allow addedFiles to be empty.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] xloya edited a comment on pull request #3724: [Spark][Core]: Support RewriteDataFiles when Files are Completely Eliminated by Deletes

Posted by GitBox <gi...@apache.org>.

xloya edited a comment on pull request #3724:
URL: https://github.com/apache/iceberg/pull/3724#issuecomment-993105333


   @RussellSpitzer @jackye1995 Hey, I will modify the implementation method first to remove the precondition check in `FileRewriteCoordinator.stageRewrite()`, but I'm not sure whether there will be other situations that will cause no new files to be generated. Do you have any other use cases that would produce this situation? And I have a question, can we ensure that the data file in a rewrite group must be a complete file when we rewrite it, instead of several row groups in a parquet file?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] xloya commented on a change in pull request #3724: [Spark][Core]: Expire data files that read 0 pieces of data during rewriting

Posted by GitBox <gi...@apache.org>.

xloya commented on a change in pull request #3724:
URL: https://github.com/apache/iceberg/pull/3724#discussion_r768259735



##########
File path: spark/v3.2/spark/src/test/java/org/apache/iceberg/spark/actions/TestRewriteDataFilesAction.java
##########
@@ -234,12 +236,220 @@ public void testBinPackWithDeletes() throws Exception {
     Assert.assertEquals("7 rows are removed", total - 7, actualRecords.size());
   }
 
+  @Test
+  public void testBinPackWithDeleteAllDataAndDataFileHasGroupOffsets() {
+    Map<String, String> options = new HashMap<>();
+    options.put(TableProperties.FORMAT_VERSION, "2");
+    options.put(TableProperties.PARQUET_ROW_GROUP_SIZE_BYTES, "1024");
+    options.put(TableProperties.PARQUET_PAGE_SIZE_BYTES, "256");
+    options.put(TableProperties.PARQUET_DICT_SIZE_BYTES, "512");
+    Table table = createTablePartitioned(1, 1, options);
+    shouldHaveFiles(table, 1);
+    table.refresh();
+
+    CloseableIterable<FileScanTask> tasks = table.newScan().planFiles();
+    List<DataFile> dataFiles = Lists.newArrayList(CloseableIterable.transform(tasks, FileScanTask::file));
+    int total = (int) dataFiles.stream().mapToLong(ContentFile::recordCount).sum();
+
+    RowDelta rowDelta = table.newRowDelta();
+    // remove all data
+    writePosDeletesToFile(table, dataFiles.get(0), total)
+        .forEach(rowDelta::addDeletes);
+
+    rowDelta.commit();
+    table.refresh();
+
+    AssertHelpers.assertThrows("Expected an exception",
+        RuntimeException.class,
+        () -> actions().rewriteDataFiles(table)
+            .option(BinPackStrategy.MIN_FILE_SIZE_BYTES, "0")
+            .option(RewriteDataFiles.TARGET_FILE_SIZE_BYTES, Long.toString(Long.MAX_VALUE - 1))
+            .option(BinPackStrategy.MAX_FILE_SIZE_BYTES, Long.toString(Long.MAX_VALUE))
+            .option(BinPackStrategy.DELETE_FILE_THRESHOLD, "1")
+            .option(RewriteDataFiles.USE_STARTING_SEQUENCE_NUMBER, "false")
+            .execute());
+  }
+
+  @Test
+  public void testBinPackWithDeleteAllDataAndDataFileHasGroupOffsetsWithSeqNum() {
+    Map<String, String> options = new HashMap<>();
+    options.put(TableProperties.FORMAT_VERSION, "2");
+    options.put(TableProperties.PARQUET_ROW_GROUP_SIZE_BYTES, "1024");
+    options.put(TableProperties.PARQUET_PAGE_SIZE_BYTES, "256");
+    options.put(TableProperties.PARQUET_DICT_SIZE_BYTES, "512");
+    Table table = createTablePartitioned(1, 1, options);
+    shouldHaveFiles(table, 1);
+    table.refresh();
+
+    CloseableIterable<FileScanTask> tasks = table.newScan().planFiles();
+    List<DataFile> dataFiles = Lists.newArrayList(CloseableIterable.transform(tasks, FileScanTask::file));
+    int total = (int) dataFiles.stream().mapToLong(ContentFile::recordCount).sum();
+
+    RowDelta rowDelta = table.newRowDelta();
+    // remove all data
+    writePosDeletesToFile(table, dataFiles.get(0), total)
+        .forEach(rowDelta::addDeletes);
+
+    rowDelta.commit();
+    table.refresh();
+
+    AssertHelpers.assertThrows("Expected an exception",
+        RuntimeException.class,

Review comment:
       According to my current implementation method, it is actually unable to handle the situation that multiple row groups of a parquet data file may be split into multiple tasks when reading, so I hope to keep throwing exceptions first




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] jackye1995 commented on a change in pull request #3724: [Spark][Core]: Expire data files that read 0 pieces of data during rewriting

Posted by GitBox <gi...@apache.org>.

jackye1995 commented on a change in pull request #3724:
URL: https://github.com/apache/iceberg/pull/3724#discussion_r768102963



##########
File path: core/src/main/java/org/apache/iceberg/actions/RewriteDataFilesCommitManager.java
##########
@@ -79,11 +83,34 @@ public void commitFileGroups(Set<RewriteFileGroup> fileGroups) {
     }
 
     RewriteFiles rewrite = table.newRewrite().validateFromSnapshot(startingSnapshotId);
-    if (useStartingSequenceNumber) {
-      long sequenceNumber = table.snapshot(startingSnapshotId).sequenceNumber();
-      rewrite.rewriteFiles(rewrittenDataFiles, addedDataFiles, sequenceNumber);
+
+    if (addedDataFiles.size() > 0) {

Review comment:
       yeah, I was gonna ask the same question. The original logic was implemented before v2 deletes, so I think it make sense to now to just remove that restriction.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] RussellSpitzer commented on a change in pull request #3724: [Spark][Core]: Support RewriteDataFiles when Files are Completely Eliminated by Deletes

Posted by GitBox <gi...@apache.org>.

RussellSpitzer commented on a change in pull request #3724:
URL: https://github.com/apache/iceberg/pull/3724#discussion_r787015510



##########
File path: spark/v3.2/spark/src/test/java/org/apache/iceberg/spark/actions/TestRewriteDataFilesAction.java
##########
@@ -234,6 +234,139 @@ public void testBinPackWithDeletes() throws Exception {
     Assert.assertEquals("7 rows are removed", total - 7, actualRecords.size());
   }
 
+  @Test
+  public void testBinPackWithDeleteAllData() {
+    Map<String, String> options = Maps.newHashMap();
+    options.put(TableProperties.FORMAT_VERSION, "2");
+    Table table = createTablePartitioned(1, 1, options);
+    shouldHaveFiles(table, 1);
+    table.refresh();
+
+    CloseableIterable<FileScanTask> tasks = table.newScan().planFiles();
+    List<DataFile> dataFiles = Lists.newArrayList(CloseableIterable.transform(tasks, FileScanTask::file));
+    int total = (int) dataFiles.stream().mapToLong(ContentFile::recordCount).sum();
+
+    RowDelta rowDelta = table.newRowDelta();
+    // remove all data
+    writePosDeletesToFile(table, dataFiles.get(0), total)
+        .forEach(rowDelta::addDeletes);
+
+    rowDelta.commit();
+    table.refresh();
+    List<Object[]> expectedRecords = currentData();
+    Result result = actions().rewriteDataFiles(table)
+            .option(BinPackStrategy.MIN_FILE_SIZE_BYTES, "0")
+            .option(RewriteDataFiles.TARGET_FILE_SIZE_BYTES, Long.toString(Long.MAX_VALUE - 1))
+            .option(BinPackStrategy.MAX_FILE_SIZE_BYTES, Long.toString(Long.MAX_VALUE))
+            .option(BinPackStrategy.DELETE_FILE_THRESHOLD, "1")
+            .option(RewriteDataFiles.USE_STARTING_SEQUENCE_NUMBER, "false")
+            .execute();
+    Assert.assertEquals("Action should rewrite 1 data files", 1, result.rewrittenDataFilesCount());
+
+    List<Object[]> actualRecords = currentData();
+    assertEquals("Rows must match", expectedRecords, actualRecords);
+    Assert.assertEquals(
+        "Data manifest should not have existing data file",
+        0,
+        (long) table.currentSnapshot().dataManifests().get(0).existingFilesCount());
+    Assert.assertEquals("Data manifest should have 1 delete data file",
+        1L,
+        (long) table.currentSnapshot().dataManifests().get(0).deletedFilesCount());
+    Assert.assertEquals(
+        "Delete manifest added row count should equal total count",
+        total,
+        (long) table.currentSnapshot().deleteManifests().get(0).addedRowsCount());
+  }
+
+  @Test
+  public void testBinPackWithDeleteAllDataAndSeqNum() {
+    Map<String, String> options = Maps.newHashMap();
+    options.put(TableProperties.FORMAT_VERSION, "2");
+    Table table = createTablePartitioned(1, 1, options);
+    shouldHaveFiles(table, 1);
+    table.refresh();
+
+    CloseableIterable<FileScanTask> tasks = table.newScan().planFiles();
+    List<DataFile> dataFiles = Lists.newArrayList(CloseableIterable.transform(tasks, FileScanTask::file));
+    int total = (int) dataFiles.stream().mapToLong(ContentFile::recordCount).sum();
+
+    RowDelta rowDelta = table.newRowDelta();
+    // remove all data
+    writePosDeletesToFile(table, dataFiles.get(0), total)
+        .forEach(rowDelta::addDeletes);
+
+    rowDelta.commit();
+    table.refresh();
+    List<Object[]> expectedRecords = currentData();
+    Result result = actions().rewriteDataFiles(table)
+          .option(BinPackStrategy.MIN_FILE_SIZE_BYTES, "0")
+          .option(RewriteDataFiles.TARGET_FILE_SIZE_BYTES, Long.toString(Long.MAX_VALUE - 1))
+          .option(BinPackStrategy.MAX_FILE_SIZE_BYTES, Long.toString(Long.MAX_VALUE))
+          .option(BinPackStrategy.DELETE_FILE_THRESHOLD, "1")
+          .option(RewriteDataFiles.USE_STARTING_SEQUENCE_NUMBER, "true")
+          .execute();
+    Assert.assertEquals("Action should rewrite 1 data files", 1, result.rewrittenDataFilesCount());
+
+    List<Object[]> actualRecords = currentData();
+    assertEquals("Rows must match", expectedRecords, actualRecords);
+    Assert.assertEquals(
+        "Data manifest should not have existing data file",
+        0,
+        (long) table.currentSnapshot().dataManifests().get(0).existingFilesCount());
+    Assert.assertEquals("Data manifest should have 1 delete data file",
+        1L,
+        (long) table.currentSnapshot().dataManifests().get(0).deletedFilesCount());
+    Assert.assertEquals(
+        "Delete manifest added row count should equal total count",
+        total,
+        (long) table.currentSnapshot().deleteManifests().get(0).addedRowsCount());
+  }
+
+  @Test
+  public void testBinPackPartialCommitWithDeleteAllDataAndSeqNum() {
+    Map<String, String> options = Maps.newHashMap();
+    options.put(TableProperties.FORMAT_VERSION, "2");
+    Table table = createTablePartitioned(1, 1, options);

Review comment:
       Since this is only creating a single file, shouldn't this be basically the same as the other tests with a single file being completely deleted?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] RussellSpitzer commented on a change in pull request #3724: [Spark][Core]: Expire data files that read 0 pieces of data during rewriting

Posted by GitBox <gi...@apache.org>.

RussellSpitzer commented on a change in pull request #3724:
URL: https://github.com/apache/iceberg/pull/3724#discussion_r767769861



##########
File path: spark/v3.2/spark/src/test/java/org/apache/iceberg/spark/actions/TestRewriteDataFilesAction.java
##########
@@ -159,7 +160,7 @@ public void testBinPackUnpartitionedTable() {
 
   @Test
   public void testBinPackPartitionedTable() {
-    Table table = createTablePartitioned(4, 2);
+    Table table = createTablePartitioned(4, 2, new HashMap<>());

Review comment:
       Adding the additional message signature will remove the need to change these lines




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] RussellSpitzer commented on pull request #3724: [Spark][Core]: Expire data files that read 0 pieces of data during rewriting

Posted by GitBox <gi...@apache.org>.

RussellSpitzer commented on pull request #3724:
URL: https://github.com/apache/iceberg/pull/3724#issuecomment-992497012


   I think we definitely should fix this use case but I'm not sure about the approach here or the requirement for dealing with files without split offsets.
   
   Can we also change the title to something like ```Support RewriteDataFiles when Files are Completely Eliminated by Deletes```?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] xloya edited a comment on pull request #3724: [Spark][Core]: Support RewriteDataFiles when Files are Completely Eliminated by Deletes

Posted by GitBox <gi...@apache.org>.

xloya edited a comment on pull request #3724:
URL: https://github.com/apache/iceberg/pull/3724#issuecomment-993240956


   @RussellSpitzer @jackye1995 I have modified to remove the precondition checks, can you please review it again if you have time? Thanks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] xloya commented on pull request #3724: [Spark][Core]: Support RewriteDataFiles when Files are Completely Eliminated by Deletes

Posted by GitBox <gi...@apache.org>.

xloya commented on pull request #3724:
URL: https://github.com/apache/iceberg/pull/3724#issuecomment-993240956


   @RussellSpitzer @jackye1995 I have modified to remove the precondition check, can you please review it again if you have time? Thanks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] xloya commented on pull request #3724: [Spark][Core]: Expire data files that read 0 pieces of data during rewriting

Posted by GitBox <gi...@apache.org>.

xloya commented on pull request #3724:
URL: https://github.com/apache/iceberg/pull/3724#issuecomment-993085506


   > I think we definitely should fix this use case but I'm not sure about the approach here or the requirement for dealing with files without split offsets.
   > 
   > Can we also change the title to something like `Support RewriteDataFiles when Files are Completely Eliminated by Deletes`?
   
   I think so, I will revise the title. As mentioned above, if the precondition check can be removed, I think it is a relatively simple way


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] xloya commented on a change in pull request #3724: [Spark][Core]: Expire data files that read 0 pieces of data during rewriting

Posted by GitBox <gi...@apache.org>.

xloya commented on a change in pull request #3724:
URL: https://github.com/apache/iceberg/pull/3724#discussion_r768260561



##########
File path: spark/v3.2/spark/src/test/java/org/apache/iceberg/spark/actions/TestRewriteDataFilesAction.java
##########
@@ -159,7 +160,7 @@ public void testBinPackUnpartitionedTable() {
 
   @Test
   public void testBinPackPartitionedTable() {
-    Table table = createTablePartitioned(4, 2);
+    Table table = createTablePartitioned(4, 2, new HashMap<>());

Review comment:
       ok.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] RussellSpitzer commented on pull request #3724: [Spark][Core]: Support RewriteDataFiles when Files are Completely Eliminated by Deletes

Posted by GitBox <gi...@apache.org>.

RussellSpitzer commented on pull request #3724:
URL: https://github.com/apache/iceberg/pull/3724#issuecomment-1016052956


   Looks good to me! I'll merge tomorrow morning after tests pass since it's bed time for me now.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] RussellSpitzer merged pull request #3724: [Spark][Core]: Support RewriteDataFiles when Files are Completely Eliminated by Deletes

Posted by GitBox <gi...@apache.org>.

RussellSpitzer merged pull request #3724:
URL: https://github.com/apache/iceberg/pull/3724


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] xloya edited a comment on pull request #3724: [Spark][Core]: Support RewriteDataFiles when Files are Completely Eliminated by Deletes

Posted by GitBox <gi...@apache.org>.

xloya edited a comment on pull request #3724:
URL: https://github.com/apache/iceberg/pull/3724#issuecomment-993106658


   > @RussellSpitzer @jackye1995 Hey, I will modify the implementation method first to remove the precondition check in `FileRewriteCoordinator.stageRewrite()`, but I'm not sure whether there will be other situations that will cause no new files to be generated. Do you have any other use cases that would produce this situation?
   
   Or do we need to distinguish between the v1 table or the v2 table here?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] RussellSpitzer commented on a change in pull request #3724: [Spark][Core]: Expire data files that read 0 pieces of data during rewriting

Posted by GitBox <gi...@apache.org>.

RussellSpitzer commented on a change in pull request #3724:
URL: https://github.com/apache/iceberg/pull/3724#discussion_r767769476



##########
File path: spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/actions/Spark3BinPackStrategy.java
##########
@@ -67,6 +68,11 @@ public Table table() {
           .option(SparkReadOptions.FILE_OPEN_COST, "0")
           .load(table.name());
 
+      // If got 0 pieces of data, no need write new data file, should return to expire the data file
+      if (scanDF.count() == 0) {

Review comment:
       This will be extremely expensive for all operations where the scanDF isn't empty forcing a double read for all such data sets. I think we should look for another solution.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] xloya commented on a change in pull request #3724: [Spark][Core]: Support RewriteDataFiles when Files are Completely Eliminated by Deletes

Posted by GitBox <gi...@apache.org>.

xloya commented on a change in pull request #3724:
URL: https://github.com/apache/iceberg/pull/3724#discussion_r768259735



##########
File path: spark/v3.2/spark/src/test/java/org/apache/iceberg/spark/actions/TestRewriteDataFilesAction.java
##########
@@ -234,12 +236,220 @@ public void testBinPackWithDeletes() throws Exception {
     Assert.assertEquals("7 rows are removed", total - 7, actualRecords.size());
   }
 
+  @Test
+  public void testBinPackWithDeleteAllDataAndDataFileHasGroupOffsets() {
+    Map<String, String> options = new HashMap<>();
+    options.put(TableProperties.FORMAT_VERSION, "2");
+    options.put(TableProperties.PARQUET_ROW_GROUP_SIZE_BYTES, "1024");
+    options.put(TableProperties.PARQUET_PAGE_SIZE_BYTES, "256");
+    options.put(TableProperties.PARQUET_DICT_SIZE_BYTES, "512");
+    Table table = createTablePartitioned(1, 1, options);
+    shouldHaveFiles(table, 1);
+    table.refresh();
+
+    CloseableIterable<FileScanTask> tasks = table.newScan().planFiles();
+    List<DataFile> dataFiles = Lists.newArrayList(CloseableIterable.transform(tasks, FileScanTask::file));
+    int total = (int) dataFiles.stream().mapToLong(ContentFile::recordCount).sum();
+
+    RowDelta rowDelta = table.newRowDelta();
+    // remove all data
+    writePosDeletesToFile(table, dataFiles.get(0), total)
+        .forEach(rowDelta::addDeletes);
+
+    rowDelta.commit();
+    table.refresh();
+
+    AssertHelpers.assertThrows("Expected an exception",
+        RuntimeException.class,
+        () -> actions().rewriteDataFiles(table)
+            .option(BinPackStrategy.MIN_FILE_SIZE_BYTES, "0")
+            .option(RewriteDataFiles.TARGET_FILE_SIZE_BYTES, Long.toString(Long.MAX_VALUE - 1))
+            .option(BinPackStrategy.MAX_FILE_SIZE_BYTES, Long.toString(Long.MAX_VALUE))
+            .option(BinPackStrategy.DELETE_FILE_THRESHOLD, "1")
+            .option(RewriteDataFiles.USE_STARTING_SEQUENCE_NUMBER, "false")
+            .execute());
+  }
+
+  @Test
+  public void testBinPackWithDeleteAllDataAndDataFileHasGroupOffsetsWithSeqNum() {
+    Map<String, String> options = new HashMap<>();
+    options.put(TableProperties.FORMAT_VERSION, "2");
+    options.put(TableProperties.PARQUET_ROW_GROUP_SIZE_BYTES, "1024");
+    options.put(TableProperties.PARQUET_PAGE_SIZE_BYTES, "256");
+    options.put(TableProperties.PARQUET_DICT_SIZE_BYTES, "512");
+    Table table = createTablePartitioned(1, 1, options);
+    shouldHaveFiles(table, 1);
+    table.refresh();
+
+    CloseableIterable<FileScanTask> tasks = table.newScan().planFiles();
+    List<DataFile> dataFiles = Lists.newArrayList(CloseableIterable.transform(tasks, FileScanTask::file));
+    int total = (int) dataFiles.stream().mapToLong(ContentFile::recordCount).sum();
+
+    RowDelta rowDelta = table.newRowDelta();
+    // remove all data
+    writePosDeletesToFile(table, dataFiles.get(0), total)
+        .forEach(rowDelta::addDeletes);
+
+    rowDelta.commit();
+    table.refresh();
+
+    AssertHelpers.assertThrows("Expected an exception",
+        RuntimeException.class,

Review comment:
       According to my current implementation method, it is actually unable to handle the situation that multiple row groups of a parquet data file may be split into multiple tasks when reading, so I hope to keep throwing exceptions first. If we can be sure that the data file in a single rewrite group must be a complete file when rewriting, then I think we can directly remove it without using `split_offsets` to judge




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] xloya commented on a change in pull request #3724: [Spark][Core]: Expire data files that read 0 pieces of data during rewriting

Posted by GitBox <gi...@apache.org>.

xloya commented on a change in pull request #3724:
URL: https://github.com/apache/iceberg/pull/3724#discussion_r768258609



##########
File path: spark/v3.2/spark/src/test/java/org/apache/iceberg/spark/actions/TestRewriteDataFilesAction.java
##########
@@ -24,6 +24,7 @@
 import java.io.UncheckedIOException;
 import java.util.Collections;
 import java.util.Comparator;
+import java.util.HashMap;

Review comment:
       Got that.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] xloya commented on pull request #3724: [Spark][Core]: Support RewriteDataFiles when Files are Completely Eliminated by Deletes

Posted by GitBox <gi...@apache.org>.

xloya commented on pull request #3724:
URL: https://github.com/apache/iceberg/pull/3724#issuecomment-993106658


   > @RussellSpitzer @jackye1995 Hey, I will modify the implementation method first to remove the precondition check in `FileRewriteCoordinator.stageRewrite()`, but I'm not sure whether there will be other situations that will cause no new files to be generated. Do you have any other use cases that would produce this situation?
   
   Or do we need to distinguish between the v1 table or the v2 table here


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] RussellSpitzer commented on a change in pull request #3724: [Spark][Core]: Support RewriteDataFiles when Files are Completely Eliminated by Deletes

Posted by GitBox <gi...@apache.org>.

RussellSpitzer commented on a change in pull request #3724:
URL: https://github.com/apache/iceberg/pull/3724#discussion_r787013624



##########
File path: spark/v3.2/spark/src/test/java/org/apache/iceberg/spark/actions/TestRewriteDataFilesAction.java
##########
@@ -234,6 +234,139 @@ public void testBinPackWithDeletes() throws Exception {
     Assert.assertEquals("7 rows are removed", total - 7, actualRecords.size());
   }
 
+  @Test
+  public void testBinPackWithDeleteAllData() {
+    Map<String, String> options = Maps.newHashMap();
+    options.put(TableProperties.FORMAT_VERSION, "2");
+    Table table = createTablePartitioned(1, 1, options);
+    shouldHaveFiles(table, 1);
+    table.refresh();
+
+    CloseableIterable<FileScanTask> tasks = table.newScan().planFiles();
+    List<DataFile> dataFiles = Lists.newArrayList(CloseableIterable.transform(tasks, FileScanTask::file));
+    int total = (int) dataFiles.stream().mapToLong(ContentFile::recordCount).sum();
+
+    RowDelta rowDelta = table.newRowDelta();
+    // remove all data
+    writePosDeletesToFile(table, dataFiles.get(0), total)
+        .forEach(rowDelta::addDeletes);
+
+    rowDelta.commit();
+    table.refresh();
+    List<Object[]> expectedRecords = currentData();
+    Result result = actions().rewriteDataFiles(table)
+            .option(BinPackStrategy.MIN_FILE_SIZE_BYTES, "0")
+            .option(RewriteDataFiles.TARGET_FILE_SIZE_BYTES, Long.toString(Long.MAX_VALUE - 1))
+            .option(BinPackStrategy.MAX_FILE_SIZE_BYTES, Long.toString(Long.MAX_VALUE))
+            .option(BinPackStrategy.DELETE_FILE_THRESHOLD, "1")
+            .option(RewriteDataFiles.USE_STARTING_SEQUENCE_NUMBER, "false")
+            .execute();
+    Assert.assertEquals("Action should rewrite 1 data files", 1, result.rewrittenDataFilesCount());
+
+    List<Object[]> actualRecords = currentData();
+    assertEquals("Rows must match", expectedRecords, actualRecords);
+    Assert.assertEquals(
+        "Data manifest should not have existing data file",
+        0,
+        (long) table.currentSnapshot().dataManifests().get(0).existingFilesCount());
+    Assert.assertEquals("Data manifest should have 1 delete data file",
+        1L,
+        (long) table.currentSnapshot().dataManifests().get(0).deletedFilesCount());
+    Assert.assertEquals(
+        "Delete manifest added row count should equal total count",
+        total,
+        (long) table.currentSnapshot().deleteManifests().get(0).addedRowsCount());
+  }
+
+  @Test
+  public void testBinPackWithDeleteAllDataAndSeqNum() {
+    Map<String, String> options = Maps.newHashMap();

Review comment:
       This test shouldn't be any different right? The differing starting sequence number shouldn't matter to this all data is deleted rewrite?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] xloya commented on pull request #3724: [Spark][Core]: Support RewriteDataFiles when Files are Completely Eliminated by Deletes

Posted by GitBox <gi...@apache.org>.

xloya commented on pull request #3724:
URL: https://github.com/apache/iceberg/pull/3724#issuecomment-1017049903


   @RussellSpitzer Sure! I would do it later today


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org