You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2021/04/16 17:36:58 UTC

[GitHub] [iceberg] karuppayya commented on a change in pull request #2452: Dedup files list generated in BaseSparkAction

karuppayya commented on a change in pull request #2452:
URL: https://github.com/apache/iceberg/pull/2452#discussion_r615018277



##########
File path: spark/src/main/java/org/apache/iceberg/spark/actions/BaseSparkAction.java
##########
@@ -159,7 +159,7 @@ protected Table newStaticTable(TableMetadata metadata, FileIO io) {
         .repartition(spark.sessionState().conf().numShufflePartitions()) // avoid adaptive execution combining tasks
         .as(Encoders.bean(ManifestFileBean.class));
 
-    return allManifests.flatMap(new ReadManifest(ioBroadcast), Encoders.STRING()).toDF("file_path");
+    return allManifests.flatMap(new ReadManifest(ioBroadcast), Encoders.STRING()).toDF("file_path").distinct();

Review comment:
       @aokolnychyi When building the data files, we do a `dropDuplicates`, which takes care of the deduping currently.
   We get duplicates for the following method
   ```
     protected Dataset<Row> buildManifestFileDF(Table table) {
       return loadMetadataTable(table, ALL_MANIFESTS).selectExpr("path as file_path");
     }
   ```
   I think different snapshots reference same manifest files, and hence we get duplicates for the manifests.
   We could  do `dropDuplicates`/ `distinct` while collecting manifest files in `BaseSparkAction`
   
   But like @RussellSpitzer suggested, this would affect all actions with additional shuffle. We could leave it to the caller to decide the behaviour.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org