You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2021/03/08 13:30:54 UTC

[GitHub] [iceberg] openinx opened a new pull request #2303: Core: Remove all delete files in RewriteFiles action.

openinx opened a new pull request #2303:
URL: https://github.com/apache/iceberg/pull/2303


   This PR is building on top of https://github.com/apache/iceberg/pull/2294,  resolving the issue https://github.com/apache/iceberg/issues/1028


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] huadongliu commented on pull request #2303: Core: Remove all delete files in RewriteFiles action.

Posted by GitBox <gi...@apache.org>.
huadongliu commented on pull request #2303:
URL: https://github.com/apache/iceberg/pull/2303#issuecomment-822819955


   Thanks for PR. I was excited to try it out. Before the rewriteDataFiles action in my test, the last snapshot has three manifest files after two table writes and one row-level deletion. After the rewriteDataFiles action, the latest snapshot has four manifest files. The first three basically change manifest entries in the previous manifest files from ADDED to DELETED. The last manifest file points to newly created partition data files.
   
   Do the DELETED manifest entries need to be kept in the current snapshot (i.e. manifest files 1, 2 and 3)? After expireSnapshots, those data files are orphaned anyway.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on pull request #2303: Core: Remove all delete files in RewriteFiles action.

Posted by GitBox <gi...@apache.org>.
rdblue commented on pull request #2303:
URL: https://github.com/apache/iceberg/pull/2303#issuecomment-844589947


   Sorry for the delay. I'm back from parental leave now.
   
   I agree with @RussellSpitzer's comments on this. I don't think that we can remove delete files just because data files were rewritten. We need to ensure that there are no data files that are still referenced by the delete files. This is probably going to require some work and may require reading the delete file. We can use the `_file` stats in some cases but we need to be careful.
   
   For now, I'd recommend letting the sequence numbers handle this. Because rewriting files will move them to a newer sequence number, the delete file won't be added to the new file's scan task when reading. It would still be considered during job planning, but I think that is okay and we don't need to aggressively drop them.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] openinx commented on pull request #2303: Core: Remove all delete files in RewriteFiles action.

Posted by GitBox <gi...@apache.org>.
openinx commented on pull request #2303:
URL: https://github.com/apache/iceberg/pull/2303#issuecomment-1086583227


   As I've cleaned the old forked github repo, so I cannot update this PR now.   Let's just address those comments in a new PR. Closing this now.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] openinx commented on a change in pull request #2303: Core: Remove all delete files in RewriteFiles action.

Posted by GitBox <gi...@apache.org>.
openinx commented on a change in pull request #2303:
URL: https://github.com/apache/iceberg/pull/2303#discussion_r594102116



##########
File path: core/src/main/java/org/apache/iceberg/actions/BaseRewriteDataFilesAction.java
##########
@@ -192,7 +194,7 @@ protected boolean caseSensitive() {
 
   @Override
   public RewriteDataFilesActionResult execute() {
-    CloseableIterable<FileScanTask> fileScanTasks = null;
+    CloseableIterable<FileScanTask> fileScanTasks = CloseableIterable.empty();

Review comment:
       I'm not sure whether this name `RewriteDataFilesActionResult` should be renamed to `RewriteFilesActionResult` because the rewrite action is removing all the deletions from files set, it also rewrite the delete files actually.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] openinx commented on pull request #2303: Core: Remove all delete files in RewriteFiles action.

Posted by GitBox <gi...@apache.org>.
openinx commented on pull request #2303:
URL: https://github.com/apache/iceberg/pull/2303#issuecomment-815498217


   @aokolnychyi ,   looks like @rdblue  is absent for the community in recent days,  would you mind to take a look this PR if you have a chance ?  Thanks.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on a change in pull request #2303: Core: Remove all delete files in RewriteFiles action.

Posted by GitBox <gi...@apache.org>.
RussellSpitzer commented on a change in pull request #2303:
URL: https://github.com/apache/iceberg/pull/2303#discussion_r609763716



##########
File path: core/src/main/java/org/apache/iceberg/actions/BaseRewriteDataFilesAction.java
##########
@@ -262,31 +269,64 @@ public RewriteDataFilesActionResult execute() {
     return tasksGroupedByPartition.asMap();
   }
 
-  private void replaceDataFiles(Iterable<DataFile> deletedDataFiles, Iterable<DataFile> addedDataFiles) {
+  private void replaceDataFiles(RewriteResult result) {
     try {
       RewriteFiles rewriteFiles = table.newRewrite();
-      rewriteFiles.rewriteFiles(Sets.newHashSet(deletedDataFiles), Sets.newHashSet(addedDataFiles));
+
+      rewriteFiles.rewriteFiles(
+          Sets.newHashSet(result.dataFilesToDelete()),
+          Sets.newHashSet(result.deleteFilesToDelete()),
+          Sets.newHashSet(result.dataFilesToAdd()),
+          Sets.newHashSet(result.deleteFilesToAdd())
+      );
+
       commit(rewriteFiles);
     } catch (Exception e) {
-      Tasks.foreach(Iterables.transform(addedDataFiles, f -> f.path().toString()))
+      // Remove all the newly produced files if possible.
+      Iterable<ContentFile<?>> addedFiles = Iterables.concat(
+          Arrays.asList(result.dataFilesToAdd()),
+          Arrays.asList(result.deleteFilesToAdd())
+      );
+
+      Tasks.foreach(Iterables.transform(addedFiles, f -> f.path().toString()))
           .noRetry()
           .suppressFailureWhenFinished()
           .onFailure((location, exc) -> LOG.warn("Failed to delete: {}", location, exc))
           .run(fileIO::deleteFile);
+
       throw e;
     }
   }
 
-  private boolean isPartialFileScan(CombinedScanTask task) {
+  private boolean doPartitionNeedRewrite(Collection<FileScanTask> partitionTasks) {
+    int files = 0;
+    for (FileScanTask scanTask : partitionTasks) {

Review comment:
       I believe this would break splitting large files, since a file scan task with only a single file will never be marked for rewrite.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] huadongliu edited a comment on pull request #2303: Core: Remove all delete files in RewriteFiles action.

Posted by GitBox <gi...@apache.org>.
huadongliu edited a comment on pull request #2303:
URL: https://github.com/apache/iceberg/pull/2303#issuecomment-822819955


   Thanks for PR. I was excited to try it out. Before the rewriteDataFiles action in my test, the last snapshot has three manifest files after two table writes and one row-level deletion. After the rewriteDataFiles action, the latest snapshot has four manifest files. The first three basically change manifest entries in the previous manifest files from ADDED to DELETED. The last manifest file points to newly created partition data files.
   
   Do the DELETED manifest entries need to be kept in the current snapshot (i.e. manifest files 1, 2 and 3)? After expiring the previous snapshot, the referenced data files will be deleted and the manifest entries in current manifest files 1, 2 and 3 will be orphaned.
   
   Maybe they are left to rewriteManifests to clean up?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] openinx closed pull request #2303: Core: Remove all delete files in RewriteFiles action.

Posted by GitBox <gi...@apache.org>.
openinx closed pull request #2303:
URL: https://github.com/apache/iceberg/pull/2303


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] openinx commented on a change in pull request #2303: Core: Remove all delete files in RewriteFiles action.

Posted by GitBox <gi...@apache.org>.
openinx commented on a change in pull request #2303:
URL: https://github.com/apache/iceberg/pull/2303#discussion_r602999421



##########
File path: core/src/main/java/org/apache/iceberg/BaseFile.java
##########
@@ -434,7 +433,7 @@ public Integer sortOrderId() {
     if (map != null) {
       Map<K, V> copy = Maps.newHashMapWithExpectedSize(map.size());
       copy.putAll(map);
-      return Collections.unmodifiableMap(copy);
+      return copy;

Review comment:
       There is a patch to address the kryo serialization issue: https://github.com/apache/iceberg/pull/2343




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] huadongliu removed a comment on pull request #2303: Core: Remove all delete files in RewriteFiles action.

Posted by GitBox <gi...@apache.org>.
huadongliu removed a comment on pull request #2303:
URL: https://github.com/apache/iceberg/pull/2303#issuecomment-822819955


   Thanks for PR. I was excited to try it out. Before the rewriteDataFiles action in my test, the last snapshot has three manifest files after two table writes and one row-level deletion. After the rewriteDataFiles action, the latest snapshot has four manifest files. The first three basically change manifest entries in the previous manifest files from ADDED to DELETED. The last manifest file points to newly created partition data files.
   
   Do the DELETED manifest entries need to be kept in the current snapshot (i.e. manifest files 1, 2 and 3)? After expiring the previous snapshot, the referenced data files will be deleted and the manifest entries in current manifest files 1, 2 and 3 will be orphaned.
   
   Maybe they are left to rewriteManifests to clean up?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] huadongliu edited a comment on pull request #2303: Core: Remove all delete files in RewriteFiles action.

Posted by GitBox <gi...@apache.org>.
huadongliu edited a comment on pull request #2303:
URL: https://github.com/apache/iceberg/pull/2303#issuecomment-822819955


   Thanks for PR. I was excited to try it out. Before the rewriteDataFiles action in my test, the last snapshot has three manifest files after two table writes and one row-level deletion. After the rewriteDataFiles action, the latest snapshot has four manifest files. The first three basically change manifest entries in the previous manifest files from ADDED to DELETED. The last manifest file points to newly created partition data files.
   
   Do the DELETED manifest entries need to be kept in the current snapshot (i.e. manifest files 1, 2 and 3)? After expiring the previous snapshot, the referenced data files will be deleted and the manifest entries in current manifest files 1, 2 and 3 will be orphaned.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on a change in pull request #2303: Core: Remove all delete files in RewriteFiles action.

Posted by GitBox <gi...@apache.org>.
RussellSpitzer commented on a change in pull request #2303:
URL: https://github.com/apache/iceberg/pull/2303#discussion_r609772679



##########
File path: flink/src/main/java/org/apache/iceberg/flink/source/RowDataRewriter.java
##########
@@ -124,16 +129,28 @@ public void open(Configuration parameters) {
     }
 
     @Override
-    public List<DataFile> map(CombinedScanTask task) throws Exception {
+    public RewriteResult map(CombinedScanTask task) throws Exception {
+      // Initialize the builder of ResultWriter.
+      RewriteResult.Builder resultBuilder = RewriteResult.builder();
+      for (FileScanTask scanTask : task.files()) {
+        resultBuilder.addDataFilesToDelete(scanTask.file());
+        resultBuilder.addDeleteFilesToDelete(scanTask.deletes());

Review comment:
       I believe the spec allows for a delete file to reference multiple data files, which I think means that just because a delete file is associated with a data file that is being rewritten, it doesn't mean that the file can removed. Instead you would need to check that there are no longer any live data files which are referenced by the delete. This probably requires getting a hold of a reversed delete file index.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org