You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by "ajantha-bhat (via GitHub)" <gi...@apache.org> on 2023/05/22 05:08:31 UTC

[GitHub] [iceberg] ajantha-bhat opened a new pull request, #7676: Spark-3.3: Harmonize RewriteDataFilesSparkAction

ajantha-bhat opened a new pull request, #7676:
URL: https://github.com/apache/iceberg/pull/7676

   Simple backport of https://github.com/apache/iceberg/pull/7630/files


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] szehon-ho merged pull request #7676: Spark 3.3: Harmonize RewriteDataFilesSparkAction

Posted by "szehon-ho (via GitHub)" <gi...@apache.org>.
szehon-ho merged PR #7676:
URL: https://github.com/apache/iceberg/pull/7676


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] szehon-ho commented on pull request #7676: Spark 3.3: Harmonize RewriteDataFilesSparkAction

Posted by "szehon-ho (via GitHub)" <gi...@apache.org>.
szehon-ho commented on PR #7676:
URL: https://github.com/apache/iceberg/pull/7676#issuecomment-1559922385

   Merged, thanks @ajantha-bhat 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ajantha-bhat commented on pull request #7676: Spark-3.3: Harmonize RewriteDataFilesSparkAction

Posted by "ajantha-bhat (via GitHub)" <gi...@apache.org>.
ajantha-bhat commented on PR #7676:
URL: https://github.com/apache/iceberg/pull/7676#issuecomment-1556564056

   Spark-3.2 and older versions have a lot of gaps with spark-3.3 and spark-3.4 (like using FileRewriter instead of RewriteStrategy), hence only backported to spark-3.3
   
   cc: @aokolnychyi, @szehon-ho  


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] szehon-ho commented on a diff in pull request #7676: Spark 3.3: Harmonize RewriteDataFilesSparkAction

Posted by "szehon-ho (via GitHub)" <gi...@apache.org>.
szehon-ho commented on code in PR #7676:
URL: https://github.com/apache/iceberg/pull/7676#discussion_r1202804908


##########
spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/actions/RewriteDataFilesSparkAction.java:
##########
@@ -366,45 +365,29 @@ private Result doExecuteWithPartialProgress(
   }
 
   Stream<RewriteFileGroup> toGroupStream(
-      RewriteExecutionContext ctx,
-      Map<StructLike, List<List<FileScanTask>>> fileGroupsByPartition) {
-    Stream<RewriteFileGroup> rewriteFileGroupStream =
-        fileGroupsByPartition.entrySet().stream()
-            .flatMap(
-                e -> {
-                  StructLike partition = e.getKey();
-                  List<List<FileScanTask>> fileGroups = e.getValue();
-                  return fileGroups.stream()
-                      .map(
-                          tasks -> {
-                            int globalIndex = ctx.currentGlobalIndex();
-                            int partitionIndex = ctx.currentPartitionIndex(partition);
-                            FileGroupInfo info =
-                                ImmutableRewriteDataFiles.FileGroupInfo.builder()
-                                    .globalIndex(globalIndex)
-                                    .partitionIndex(partitionIndex)
-                                    .partition(partition)
-                                    .build();
-                            return new RewriteFileGroup(info, tasks);
-                          });
-                });
-
-    return rewriteFileGroupStream.sorted(rewriteGroupComparator());
+      RewriteExecutionContext ctx, Map<StructLike, List<List<FileScanTask>>> groupsByPartition) {

Review Comment:
   (Not related to this pr).  This is my fault, but I think this should be 'StructLikeMap' all across to be safer, I guess we can do it later.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] szehon-ho commented on a diff in pull request #7676: Spark-3.3: Harmonize RewriteDataFilesSparkAction

Posted by "szehon-ho (via GitHub)" <gi...@apache.org>.
szehon-ho commented on code in PR #7676:
URL: https://github.com/apache/iceberg/pull/7676#discussion_r1201361740


##########
spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/actions/RewriteDataFilesSparkAction.java:
##########
@@ -331,24 +331,31 @@ private Result doExecuteWithPartialProgress(
       RewriteDataFilesCommitManager commitManager) {
     ExecutorService rewriteService = rewriteService();
 
-    // Start Commit Service
+    // start commit service
     int groupsPerCommit = IntMath.divide(ctx.totalGroupCount(), maxCommits, RoundingMode.CEILING);
     RewriteDataFilesCommitManager.CommitService commitService =
         commitManager.service(groupsPerCommit);
     commitService.start();
 
-    // Start rewrite tasks
+    Collection<FileGroupFailureResult> rewriteFailures = new ConcurrentLinkedQueue<>();

Review Comment:
   Looks like we are also backporting https://github.com/apache/iceberg/pull/7361 to Spark 3.3?  No problem with me, but we may want to note that in pr description in case some issues arise because of this.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ajantha-bhat commented on a diff in pull request #7676: Spark-3.3: Harmonize RewriteDataFilesSparkAction

Posted by "ajantha-bhat (via GitHub)" <gi...@apache.org>.
ajantha-bhat commented on code in PR #7676:
URL: https://github.com/apache/iceberg/pull/7676#discussion_r1201409911


##########
spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/actions/RewriteDataFilesSparkAction.java:
##########
@@ -331,24 +331,31 @@ private Result doExecuteWithPartialProgress(
       RewriteDataFilesCommitManager commitManager) {
     ExecutorService rewriteService = rewriteService();
 
-    // Start Commit Service
+    // start commit service
     int groupsPerCommit = IntMath.divide(ctx.totalGroupCount(), maxCommits, RoundingMode.CEILING);
     RewriteDataFilesCommitManager.CommitService commitService =
         commitManager.service(groupsPerCommit);
     commitService.start();
 
-    // Start rewrite tasks
+    Collection<FileGroupFailureResult> rewriteFailures = new ConcurrentLinkedQueue<>();

Review Comment:
   you are right. It looks confusing to handle from this PR.
   
   I have reverted it now. It can be done in another PR along with testcase and procedure changes. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org