You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2021/04/16 06:34:12 UTC

[GitHub] [iceberg] aokolnychyi commented on a change in pull request #2452: Dedup files list generated in BaseSparkAction

aokolnychyi commented on a change in pull request #2452:
URL: https://github.com/apache/iceberg/pull/2452#discussion_r614593663



##########
File path: spark/src/main/java/org/apache/iceberg/spark/actions/BaseSparkAction.java
##########
@@ -173,7 +173,7 @@ protected Table newStaticTable(TableMetadata metadata, FileIO io) {
 
   protected Dataset<Row> buildOtherMetadataFileDF(TableOperations ops) {
     List<String> otherMetadataFiles = getOtherMetadataFilePaths(ops);
-    return spark.createDataset(otherMetadataFiles, Encoders.STRING()).toDF("file_path");
+    return spark.createDataset(otherMetadataFiles, Encoders.STRING()).toDF("file_path").distinct();

Review comment:
       Instead of doing a shuffle here, I think we should refine the approach we use to build the list of JSON files. I think what happens now is that we will take previous 100 version files in every version file and add them to the list even though each new version file has only one different entry. Will tables with 2000 snapshots and 100 previous metadata files produce a list with 200000 elements?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org