You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2022/10/14 22:46:19 UTC

[GitHub] [iceberg] amogh-jahagirdar commented on a diff in pull request #5981: Core: Parallelize the determining of reachable manifests during file cleanup

amogh-jahagirdar commented on code in PR #5981:
URL: https://github.com/apache/iceberg/pull/5981#discussion_r996201966


##########
core/src/main/java/org/apache/iceberg/ReachableFileCleanup.java:
##########
@@ -85,19 +79,60 @@ public void cleanFiles(TableMetadata beforeExpiration, TableMetadata afterExpira
   }
 
   private Set<ManifestFile> readManifests(Set<Snapshot> snapshots) {
-    Set<ManifestFile> manifestFiles = Sets.newHashSet();
-    for (Snapshot snapshot : snapshots) {
-      try (CloseableIterable<ManifestFile> manifestFilesForSnapshot = readManifestFiles(snapshot)) {
-        for (ManifestFile manifestFile : manifestFilesForSnapshot) {
-          manifestFiles.add(manifestFile.copy());
-        }
-      } catch (IOException e) {
-        throw new RuntimeIOException(
-            e, "Failed to close manifest list: %s", snapshot.manifestListLocation());
-      }
-    }
+    Set<ManifestFile> manifests = ConcurrentHashMap.newKeySet();
+    Tasks.foreach(snapshots)
+        .retry(3)
+        .stopOnFailure()
+        .throwFailureWhenFinished()
+        .executeWith(planExecutorService)
+        .onFailure(
+            (snapshot, exc) ->
+                LOG.warn(
+                    "Failed to determine manifests for snapshot {}", snapshot.snapshotId(), exc))
+        .run(
+            snapshot -> {
+              try (CloseableIterable<ManifestFile> manifestFilesForSnapshot =
+                  readManifestFiles(snapshot)) {
+                for (ManifestFile manifestFile : manifestFilesForSnapshot) {
+                  manifests.add(manifestFile.copy());
+                }
+              } catch (IOException e) {
+                throw new RuntimeIOException(
+                    e, "Failed to close manifest list: %s", snapshot.manifestListLocation());
+              }
+            });
+
+    return manifests;
+  }
+
+  private Set<ManifestFile> manifestFilesToDelete(
+      Set<ManifestFile> currentManifests, Set<Snapshot> expiredSnapshots) {

Review Comment:
   Yeah in my mind I was thinking in the worst case we anyways need to read the current manifests for the data files and structuring the code this way allows the set of current manifests to be re-used during the reachable data file analysis. But it's important to consider the average/common case more and I think i can structure the code in a readable way to re-use the determined set of current manifests. will update



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org