You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2022/11/01 07:28:29 UTC

[GitHub] [iceberg] ajantha-bhat opened a new pull request, #6090: Core: Handle statistics file clean up from expireSnapshots

ajantha-bhat opened a new pull request, #6090:
URL: https://github.com/apache/iceberg/pull/6090

   Currently, Statistics files are safeguarded against orphan_files cleanup. But they are never cleaned up from table metadata and from the storage once the snapshots are expired/deleted.
   
   Hence, this PR adds a change to handle the Statistics file cleanup during expire_snapshot.
   
   Note that this is just for API level clean up (`table#expireSnapshots`) 
   
   Clean-up from expired snapshots spark action/procedure will be built on top of it in a follow-up PR. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a diff in pull request #6090: Core: Handle statistics file clean up from expireSnapshots

Posted by GitBox <gi...@apache.org>.
rdblue commented on code in PR #6090:
URL: https://github.com/apache/iceberg/pull/6090#discussion_r1053475952


##########
core/src/main/java/org/apache/iceberg/IncrementalFileCleanup.java:
##########
@@ -260,10 +260,18 @@ public void cleanFiles(TableMetadata beforeExpiration, TableMetadata afterExpira
         findFilesToDelete(manifestsToScan, manifestsToRevert, validIds, afterExpiration);
 
     deleteFiles(filesToDelete, "data");
-    LOG.warn("Manifests to delete: {}", Joiner.on(", ").join(manifestsToDelete));
-    LOG.warn("Manifests Lists to delete: {}", Joiner.on(", ").join(manifestListsToDelete));
+    LOG.debug("Manifests to delete: {}", Joiner.on(", ").join(manifestsToDelete));
+    LOG.debug("Manifests Lists to delete: {}", Joiner.on(", ").join(manifestListsToDelete));
     deleteFiles(manifestsToDelete, "manifest");
     deleteFiles(manifestListsToDelete, "manifest list");
+
+    if (!beforeExpiration.statisticsFiles().isEmpty()) {
+      Set<String> expiredStatisticsFilesLocations =
+          expiredStatisticsFilesLocations(beforeExpiration, afterExpiration);
+      LOG.debug(
+          "Statistics files to delete: {}", Joiner.on(", ").join(expiredStatisticsFilesLocations));

Review Comment:
   What is the value of this debug message? This isn't present in the reachable cleanup code, so I'm guessing it isn't important? If not, please remove it.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ajantha-bhat commented on a diff in pull request #6090: Core: Handle statistics file clean up from expireSnapshots

Posted by GitBox <gi...@apache.org>.
ajantha-bhat commented on code in PR #6090:
URL: https://github.com/apache/iceberg/pull/6090#discussion_r1044053881


##########
core/src/main/java/org/apache/iceberg/FileCleanupStrategy.java:
##########
@@ -83,4 +85,33 @@ protected void deleteFiles(Set<String> pathsToDelete, String fileType) {
             (file, thrown) -> LOG.warn("Delete failed for {} file: {}", fileType, file, thrown))
         .run(deleteFunc::accept);
   }
+
+  protected Set<String> expiredStatisticsFilesLocations(
+      TableMetadata beforeExpiration, Set<Long> validIds, Set<Long> expiredIds) {

Review Comment:
   ok. I will update it. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] findepi commented on a diff in pull request #6090: Core: Handle statistics file clean up from expireSnapshots

Posted by GitBox <gi...@apache.org>.
findepi commented on code in PR #6090:
URL: https://github.com/apache/iceberg/pull/6090#discussion_r1013923155


##########
core/src/main/java/org/apache/iceberg/FileCleanupStrategy.java:
##########
@@ -79,4 +80,15 @@ protected void deleteFiles(Set<String> pathsToDelete, String fileType) {
             (file, thrown) -> LOG.warn("Delete failed for {} file: {}", fileType, file, thrown))
         .run(deleteFunc::accept);
   }
+
+  protected Set<String> expiredStatisticsFilesLocations(
+      TableMetadata beforeExpiration, Set<Long> expiredIds) {
+    Set<String> expiredStatisticsFilesLocations = Sets.newHashSet();
+    for (StatisticsFile statisticsFile : beforeExpiration.statisticsFiles()) {
+      if (expiredIds.contains(statisticsFile.snapshotId())) {
+        expiredStatisticsFilesLocations.add(statisticsFile.path());

Review Comment:
   - `statisticsFile.snapshotId()` is "ID of the Iceberg table's snapshot the statistics were computed from",  it's not "ID of the snapshot the file is attached to"
   - The spec allows one Puffin file to be attached to multiple snapshots. For example, rewrite_files / optimize will produce a new snapshot, but could carry forward some/all table-level stats.
     - So, the fact that the snapshot is being forgotten doesn't imply the stats file can be safely deleted.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ajantha-bhat commented on a diff in pull request #6090: Core: Handle statistics file clean up from expireSnapshots

Posted by GitBox <gi...@apache.org>.
ajantha-bhat commented on code in PR #6090:
URL: https://github.com/apache/iceberg/pull/6090#discussion_r1014286079


##########
core/src/main/java/org/apache/iceberg/FileCleanupStrategy.java:
##########
@@ -79,4 +80,15 @@ protected void deleteFiles(Set<String> pathsToDelete, String fileType) {
             (file, thrown) -> LOG.warn("Delete failed for {} file: {}", fileType, file, thrown))
         .run(deleteFunc::accept);
   }
+
+  protected Set<String> expiredStatisticsFilesLocations(
+      TableMetadata beforeExpiration, Set<Long> expiredIds) {
+    Set<String> expiredStatisticsFilesLocations = Sets.newHashSet();
+    for (StatisticsFile statisticsFile : beforeExpiration.statisticsFiles()) {
+      if (expiredIds.contains(statisticsFile.snapshotId())) {
+        expiredStatisticsFilesLocations.add(statisticsFile.path());

Review Comment:
   @findepi: Thinking more about this, As the `TableMetadata` has just the list of  `StatisticsFile`. And you have mentioned,  statisticsFile.snapshotId() is "ID of the Iceberg table's snapshot the statistics were computed from"
   
   So, how will the query knows which statistics file to use for the current snapshot (Incase of rewrite data files, the current snapshot id may not be present in that list of statistics file?)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a diff in pull request #6090: Core: Handle statistics file clean up from expireSnapshots

Posted by "rdblue (via GitHub)" <gi...@apache.org>.
rdblue commented on code in PR #6090:
URL: https://github.com/apache/iceberg/pull/6090#discussion_r1092361210


##########
core/src/test/java/org/apache/iceberg/TestRemoveSnapshots.java:
##########
@@ -1234,6 +1243,95 @@ public void testMultipleRefsAndCleanExpiredFilesFailsForIncrementalCleanup() {
                 .commit());
   }
 
+  @Test
+  public void testExpireWithStatisticsFiles() throws IOException {
+    table.newAppend().appendFile(FILE_A).commit();
+    String statsFileLocation1 = statsFileLocation(table.location());
+    StatisticsFile statisticsFile1 =
+        writeStatsFileForCurrentSnapshot(
+            table.currentSnapshot().snapshotId(),
+            table.currentSnapshot().sequenceNumber(),
+            statsFileLocation1,
+            table.io());
+    commitStats(table.newTransaction(), statisticsFile1);

Review Comment:
   There's no need to use transactions for a single operation. Can you remove this and update the table directly?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on pull request #6090: Core: Handle statistics file clean up from expireSnapshots

Posted by "rdblue (via GitHub)" <gi...@apache.org>.
rdblue commented on PR #6090:
URL: https://github.com/apache/iceberg/pull/6090#issuecomment-1410923876

   Thanks, @ajantha-bhat! I made some comments in tests to fix.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a diff in pull request #6090: Core: Handle statistics file clean up from expireSnapshots

Posted by GitBox <gi...@apache.org>.
rdblue commented on code in PR #6090:
URL: https://github.com/apache/iceberg/pull/6090#discussion_r1043910826


##########
core/src/main/java/org/apache/iceberg/FileCleanupStrategy.java:
##########
@@ -83,4 +85,33 @@ protected void deleteFiles(Set<String> pathsToDelete, String fileType) {
             (file, thrown) -> LOG.warn("Delete failed for {} file: {}", fileType, file, thrown))
         .run(deleteFunc::accept);
   }
+
+  protected Set<String> expiredStatisticsFilesLocations(
+      TableMetadata beforeExpiration, Set<Long> validIds, Set<Long> expiredIds) {
+    Set<String> validStatisticsFilesLocation = Sets.newHashSet();
+    for (long snapshotId : validIds) {
+      List<StatisticsFile> statisticsFiles =
+          beforeExpiration.statisticsFilesForSnapshot(snapshotId);
+      if (statisticsFiles != null) {
+        statisticsFiles.forEach(file -> validStatisticsFilesLocation.add(file.path()));
+      }
+    }
+
+    // Statistics files can be reused in case of rewrite data files.
+    // Hence, filter out the reachable/valid statistics files.
+    Set<String> expiredStatisticsFilesLocations = Sets.newHashSet();
+    for (long snapshotId : expiredIds) {
+      List<StatisticsFile> statisticsFiles =
+          beforeExpiration.statisticsFilesForSnapshot(snapshotId);
+      if (statisticsFiles != null) {
+        statisticsFiles.forEach(
+            file -> {
+              if (!validStatisticsFilesLocation.contains(file.path())) {
+                expiredStatisticsFilesLocations.add(file.path());
+              }
+            });
+      }
+    }
+    return expiredStatisticsFilesLocations;

Review Comment:
   Style: missing whitespace.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a diff in pull request #6090: Core: Handle statistics file clean up from expireSnapshots

Posted by GitBox <gi...@apache.org>.
rdblue commented on code in PR #6090:
URL: https://github.com/apache/iceberg/pull/6090#discussion_r1053497562


##########
core/src/test/java/org/apache/iceberg/TestRemoveSnapshots.java:
##########
@@ -1234,6 +1245,74 @@ public void testMultipleRefsAndCleanExpiredFilesFailsForIncrementalCleanup() {
                 .commit());
   }
 
+  @Test
+  public void testExpireWithStatisticsFiles() throws URISyntaxException, IOException {
+    table.newAppend().appendFile(FILE_A).commit();
+    File statsFileLocation1 = statsFileLocation(table);
+    StatisticsFile statisticsFile1 = writeStatsFileForCurrentSnapshot(table, statsFileLocation1);
+    Assert.assertEquals(
+        "Must match the latest snapshot",
+        table.currentSnapshot().snapshotId(),
+        statisticsFile1.snapshotId());
+
+    table.newAppend().appendFile(FILE_B).commit();
+    File statsFileLocation2 = statsFileLocation(table);
+    StatisticsFile statisticsFile2 = writeStatsFileForCurrentSnapshot(table, statsFileLocation2);
+    Assert.assertEquals(
+        "Must match the latest snapshot",
+        table.currentSnapshot().snapshotId(),
+        statisticsFile2.snapshotId());
+
+    Assert.assertEquals("Should have 2 statistics file", 2, table.statisticsFiles().size());
+
+    table.updateProperties().set(TableProperties.MAX_SNAPSHOT_AGE_MS, "1").commit();

Review Comment:
   Expiring based on assumptions about time possibly introduces flakiness to this test.
   
   The right way to test this is to use `long afterFirstSnapshot = waitUntilAfter(table.currentSnapshot().timestampMillis()` and `table.expireSnapshots().expireOlderThan(afterFirstSnapshot).commit()`
   
   This should not use the table property, since we can exercise the API directly.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue merged pull request #6090: Core: Handle statistics file clean up from expireSnapshots

Posted by "rdblue (via GitHub)" <gi...@apache.org>.
rdblue merged PR #6090:
URL: https://github.com/apache/iceberg/pull/6090


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on pull request #6090: Core: Handle statistics file clean up from expireSnapshots

Posted by "rdblue (via GitHub)" <gi...@apache.org>.
rdblue commented on PR #6090:
URL: https://github.com/apache/iceberg/pull/6090#issuecomment-1440556094

   Thanks, @ajantha-bhat. Good to have this in.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ajantha-bhat commented on pull request #6090: Core: Handle statistics file clean up from expireSnapshots

Posted by GitBox <gi...@apache.org>.
ajantha-bhat commented on PR #6090:
URL: https://github.com/apache/iceberg/pull/6090#issuecomment-1361292197

   @rdblue :  Thanks for the awesome review. I will keep those points in mind. 
   
   I have addressed all the comments. Please take a look at it again. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a diff in pull request #6090: Core: Handle statistics file clean up from expireSnapshots

Posted by GitBox <gi...@apache.org>.
rdblue commented on code in PR #6090:
URL: https://github.com/apache/iceberg/pull/6090#discussion_r1053475952


##########
core/src/main/java/org/apache/iceberg/IncrementalFileCleanup.java:
##########
@@ -260,10 +260,18 @@ public void cleanFiles(TableMetadata beforeExpiration, TableMetadata afterExpira
         findFilesToDelete(manifestsToScan, manifestsToRevert, validIds, afterExpiration);
 
     deleteFiles(filesToDelete, "data");
-    LOG.warn("Manifests to delete: {}", Joiner.on(", ").join(manifestsToDelete));
-    LOG.warn("Manifests Lists to delete: {}", Joiner.on(", ").join(manifestListsToDelete));
+    LOG.debug("Manifests to delete: {}", Joiner.on(", ").join(manifestsToDelete));
+    LOG.debug("Manifests Lists to delete: {}", Joiner.on(", ").join(manifestListsToDelete));
     deleteFiles(manifestsToDelete, "manifest");
     deleteFiles(manifestListsToDelete, "manifest list");
+
+    if (!beforeExpiration.statisticsFiles().isEmpty()) {
+      Set<String> expiredStatisticsFilesLocations =
+          expiredStatisticsFilesLocations(beforeExpiration, afterExpiration);
+      LOG.debug(
+          "Statistics files to delete: {}", Joiner.on(", ").join(expiredStatisticsFilesLocations));

Review Comment:
   What is the value of this debug message?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a diff in pull request #6090: Core: Handle statistics file clean up from expireSnapshots

Posted by GitBox <gi...@apache.org>.
rdblue commented on code in PR #6090:
URL: https://github.com/apache/iceberg/pull/6090#discussion_r1053477411


##########
core/src/main/java/org/apache/iceberg/TableMetadata.java:
##########
@@ -1211,6 +1211,7 @@ public Builder removeSnapshots(Collection<Long> idsToRemove) {
         if (idsToRemove.contains(snapshotId)) {
           snapshotsById.remove(snapshotId);
           changes.add(new MetadataUpdate.RemoveSnapshot(snapshotId));
+          removeStatistics(snapshotId);

Review Comment:
   Can you explain why this PR is changing the behavior of `TableMetadata`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] findepi commented on pull request #6090: Core: Handle statistics file clean up from expireSnapshots

Posted by GitBox <gi...@apache.org>.
findepi commented on PR #6090:
URL: https://github.com/apache/iceberg/pull/6090#issuecomment-1305652072

   I think we should change the label of the `snapshot-id` entry in https://iceberg.apache.org/spec/#table-statistics (to level, not blob level)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a diff in pull request #6090: Core: Handle statistics file clean up from expireSnapshots

Posted by "rdblue (via GitHub)" <gi...@apache.org>.
rdblue commented on code in PR #6090:
URL: https://github.com/apache/iceberg/pull/6090#discussion_r1101922436


##########
core/src/test/java/org/apache/iceberg/TestRemoveSnapshots.java:
##########
@@ -1515,4 +1599,47 @@ private RemoveSnapshots removeSnapshots(Table table) {
     RemoveSnapshots removeSnapshots = (RemoveSnapshots) table.expireSnapshots();
     return (RemoveSnapshots) removeSnapshots.withIncrementalCleanup(incrementalCleanup);
   }
+
+  private StatisticsFile writeStatsFile(
+      long snapshotId, long snapshotSequenceNumber, String statsLocation, FileIO fileIO)
+      throws IOException {
+    try (PuffinWriter puffinWriter = Puffin.write(fileIO.newOutputFile(statsLocation)).build()) {
+      puffinWriter.add(
+          new Blob(
+              "some-blob-type",
+              ImmutableList.of(1),
+              snapshotId,
+              snapshotSequenceNumber,
+              ByteBuffer.wrap("blob content".getBytes(StandardCharsets.UTF_8))));
+      puffinWriter.finish();
+
+      return new GenericStatisticsFile(
+          snapshotId,
+          statsLocation,
+          puffinWriter.fileSize(),
+          puffinWriter.footerSize(),
+          puffinWriter.writtenBlobsMetadata().stream()
+              .map(GenericBlobMetadata::from)
+              .collect(ImmutableList.toImmutableList()));
+    }
+  }
+
+  private StatisticsFile reuseStatsForCurrentSnapshot(

Review Comment:
   This is not for the "current" snapshot because the snapshot ID is being passed in.
   
   When there are problems that need to be fixed in multiple places, I might just mention it once to avoid unnecessary repetition. So to keep PRs moving faster, you should always look for similar cases that also need to be fixed.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ajantha-bhat commented on a diff in pull request #6090: Core: Handle statistics file clean up from expireSnapshots

Posted by "ajantha-bhat (via GitHub)" <gi...@apache.org>.
ajantha-bhat commented on code in PR #6090:
URL: https://github.com/apache/iceberg/pull/6090#discussion_r1092833429


##########
core/src/test/java/org/apache/iceberg/TestRemoveSnapshots.java:
##########
@@ -1234,6 +1243,95 @@ public void testMultipleRefsAndCleanExpiredFilesFailsForIncrementalCleanup() {
                 .commit());
   }
 
+  @Test
+  public void testExpireWithStatisticsFiles() throws IOException {
+    table.newAppend().appendFile(FILE_A).commit();
+    String statsFileLocation1 = statsFileLocation(table.location());
+    StatisticsFile statisticsFile1 =
+        writeStatsFileForCurrentSnapshot(
+            table.currentSnapshot().snapshotId(),
+            table.currentSnapshot().sequenceNumber(),
+            statsFileLocation1,
+            table.io());
+    commitStats(table.newTransaction(), statisticsFile1);
+    Assert.assertEquals(
+        "Must match the latest snapshot",
+        table.currentSnapshot().snapshotId(),
+        statisticsFile1.snapshotId());
+
+    table.newAppend().appendFile(FILE_B).commit();
+    String statsFileLocation2 = statsFileLocation(table.location());
+    StatisticsFile statisticsFile2 =
+        writeStatsFileForCurrentSnapshot(
+            table.currentSnapshot().snapshotId(),
+            table.currentSnapshot().sequenceNumber(),
+            statsFileLocation2,
+            table.io());
+    commitStats(table.newTransaction(), statisticsFile2);
+    Assert.assertEquals(
+        "Must match the latest snapshot",
+        table.currentSnapshot().snapshotId(),
+        statisticsFile2.snapshotId());
+
+    Assert.assertEquals("Should have 2 statistics file", 2, table.statisticsFiles().size());
+
+    long tAfterCommits = waitUntilAfter(table.currentSnapshot().timestampMillis());
+    removeSnapshots(table).expireOlderThan(tAfterCommits).commit();
+
+    Assert.assertEquals("Should keep 1 snapshot", 1, Iterables.size(table.snapshots()));
+    Assertions.assertThat(table.statisticsFiles())
+        .hasSize(1)
+        .extracting(StatisticsFile::snapshotId)
+        .as("Should contain only the statistics file of snapshot2")
+        .isEqualTo(Lists.newArrayList(statisticsFile2.snapshotId()));
+
+    Assertions.assertThat(new File(statsFileLocation1).exists()).isFalse();
+    Assertions.assertThat(new File(statsFileLocation2).exists()).isTrue();
+  }
+
+  @Test
+  public void testExpireWithStatisticsFilesWithReuse() throws IOException {
+    table.newAppend().appendFile(FILE_A).commit();
+    String statsFileLocation1 = statsFileLocation(table.location());
+    StatisticsFile statisticsFile1 =
+        writeStatsFileForCurrentSnapshot(
+            table.currentSnapshot().snapshotId(),
+            table.currentSnapshot().sequenceNumber(),
+            statsFileLocation1,
+            table.io());
+    commitStats(table.newTransaction(), statisticsFile1);
+    Assert.assertEquals(
+        "Must match the latest snapshot",
+        table.currentSnapshot().snapshotId(),
+        statisticsFile1.snapshotId());
+
+    table.newAppend().appendFile(FILE_B).commit();
+    // Note: RewriteDataFiles can reuse statistics files across operations.
+    // This test reuses stats for append just to mimic this scenario without having to run
+    // RewriteDataFiles.

Review Comment:
   @findepi has mentioned about reusing the stats file. I think we should not allow it because concurrent operations can add extra stats during rewrite operation.   
   
   We don't have any engine integration with stats in this repo. So, I mentioned "can reuse" instead of "will resue" 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ajantha-bhat commented on a diff in pull request #6090: Core: Handle statistics file clean up from expireSnapshots

Posted by "ajantha-bhat (via GitHub)" <gi...@apache.org>.
ajantha-bhat commented on code in PR #6090:
URL: https://github.com/apache/iceberg/pull/6090#discussion_r1092842407


##########
core/src/test/java/org/apache/iceberg/TestRemoveSnapshots.java:
##########
@@ -1234,6 +1243,95 @@ public void testMultipleRefsAndCleanExpiredFilesFailsForIncrementalCleanup() {
                 .commit());
   }
 
+  @Test
+  public void testExpireWithStatisticsFiles() throws IOException {
+    table.newAppend().appendFile(FILE_A).commit();
+    String statsFileLocation1 = statsFileLocation(table.location());
+    StatisticsFile statisticsFile1 =
+        writeStatsFileForCurrentSnapshot(
+            table.currentSnapshot().snapshotId(),
+            table.currentSnapshot().sequenceNumber(),
+            statsFileLocation1,
+            table.io());
+    commitStats(table.newTransaction(), statisticsFile1);
+    Assert.assertEquals(
+        "Must match the latest snapshot",
+        table.currentSnapshot().snapshotId(),
+        statisticsFile1.snapshotId());
+
+    table.newAppend().appendFile(FILE_B).commit();
+    String statsFileLocation2 = statsFileLocation(table.location());
+    StatisticsFile statisticsFile2 =
+        writeStatsFileForCurrentSnapshot(
+            table.currentSnapshot().snapshotId(),
+            table.currentSnapshot().sequenceNumber(),
+            statsFileLocation2,
+            table.io());
+    commitStats(table.newTransaction(), statisticsFile2);
+    Assert.assertEquals(
+        "Must match the latest snapshot",
+        table.currentSnapshot().snapshotId(),
+        statisticsFile2.snapshotId());
+
+    Assert.assertEquals("Should have 2 statistics file", 2, table.statisticsFiles().size());
+
+    long tAfterCommits = waitUntilAfter(table.currentSnapshot().timestampMillis());
+    removeSnapshots(table).expireOlderThan(tAfterCommits).commit();

Review Comment:
   added a comment below



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ajantha-bhat commented on a diff in pull request #6090: Core: Handle statistics file clean up from expireSnapshots

Posted by "ajantha-bhat (via GitHub)" <gi...@apache.org>.
ajantha-bhat commented on code in PR #6090:
URL: https://github.com/apache/iceberg/pull/6090#discussion_r1092842882


##########
core/src/test/java/org/apache/iceberg/TestRemoveSnapshots.java:
##########
@@ -1515,4 +1613,51 @@ private RemoveSnapshots removeSnapshots(Table table) {
     RemoveSnapshots removeSnapshots = (RemoveSnapshots) table.expireSnapshots();
     return (RemoveSnapshots) removeSnapshots.withIncrementalCleanup(incrementalCleanup);
   }
+
+  private StatisticsFile writeStatsFileForCurrentSnapshot(

Review Comment:
   ACK. Missed renaming during refactoring. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a diff in pull request #6090: Core: Handle statistics file clean up from expireSnapshots

Posted by GitBox <gi...@apache.org>.
rdblue commented on code in PR #6090:
URL: https://github.com/apache/iceberg/pull/6090#discussion_r1043912545


##########
core/src/main/java/org/apache/iceberg/RemoveSnapshots.java:
##########
@@ -204,7 +204,11 @@ private TableMetadata internalApply() {
     base.snapshots().stream()
         .map(Snapshot::snapshotId)
         .filter(snapshot -> !idsToRetain.contains(snapshot))
-        .forEach(idsToRemove::add);
+        .forEach(
+            snapshotId -> {
+              updatedMetaBuilder.removeStatistics(snapshotId);

Review Comment:
   This should be handled inside of `removeSnapshots` because we `TableMetadata` that to maintain consistency.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] amogh-jahagirdar commented on a diff in pull request #6090: Core: Handle statistics file clean up from expireSnapshots

Posted by GitBox <gi...@apache.org>.
amogh-jahagirdar commented on code in PR #6090:
URL: https://github.com/apache/iceberg/pull/6090#discussion_r1047584271


##########
core/src/main/java/org/apache/iceberg/FileCleanupStrategy.java:
##########
@@ -83,4 +85,25 @@ protected void deleteFiles(Set<String> pathsToDelete, String fileType) {
             (file, thrown) -> LOG.warn("Delete failed for {} file: {}", fileType, file, thrown))
         .run(deleteFunc::accept);
   }
+
+  protected Set<String> expiredStatisticsFilesLocations(
+      TableMetadata beforeExpiration, TableMetadata afterExpiration) {
+    Set<String> statisticsFilesLocationBeforeExpiry = allStatisticsFilesLocation(beforeExpiration);
+    Set<String> statisticsFilesLocationAfterExpiry = allStatisticsFilesLocation(afterExpiration);
+    statisticsFilesLocationBeforeExpiry.removeAll(statisticsFilesLocationAfterExpiry);
+    return statisticsFilesLocationBeforeExpiry;
+  }
+
+  private Set<String> allStatisticsFilesLocation(TableMetadata tableMetadata) {
+    Set<String> allStatisticsFilesLocation;
+    if (tableMetadata.statisticsFiles() != null) {
+      allStatisticsFilesLocation =
+          tableMetadata.statisticsFiles().stream()
+              .map(StatisticsFile::path)
+              .collect(Collectors.toSet());
+    } else {
+      allStatisticsFilesLocation = Sets.newHashSet();
+    }
+    return allStatisticsFilesLocation;

Review Comment:
   Nit: I think we could just initialize `allStatisticsFile = Sets.newHashSet()`and remove the else block. Also newline after the block/before the return.



##########
core/src/main/java/org/apache/iceberg/IncrementalFileCleanup.java:
##########
@@ -264,6 +264,14 @@ public void cleanFiles(TableMetadata beforeExpiration, TableMetadata afterExpira
     LOG.warn("Manifests Lists to delete: {}", Joiner.on(", ").join(manifestListsToDelete));
     deleteFiles(manifestsToDelete, "manifest");
     deleteFiles(manifestListsToDelete, "manifest list");
+
+    if (!beforeExpiration.statisticsFiles().isEmpty()) {
+      Set<String> expiredStatisticsFilesLocations =
+          expiredStatisticsFilesLocations(beforeExpiration, afterExpiration);
+      LOG.warn(
+          "Statistics files to delete: {}", Joiner.on(", ").join(expiredStatisticsFilesLocations));

Review Comment:
   I know it was warn level for the other cases in incremental file cleanup, but it seems like something we shouldn't keep following IMO because it's not really a warning case. We know we want to delete these files and it's intentional. 
   
   I think this could be debug level or we can simply remove it.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a diff in pull request #6090: Core: Handle statistics file clean up from expireSnapshots

Posted by GitBox <gi...@apache.org>.
rdblue commented on code in PR #6090:
URL: https://github.com/apache/iceberg/pull/6090#discussion_r1053497562


##########
core/src/test/java/org/apache/iceberg/TestRemoveSnapshots.java:
##########
@@ -1234,6 +1245,74 @@ public void testMultipleRefsAndCleanExpiredFilesFailsForIncrementalCleanup() {
                 .commit());
   }
 
+  @Test
+  public void testExpireWithStatisticsFiles() throws URISyntaxException, IOException {
+    table.newAppend().appendFile(FILE_A).commit();
+    File statsFileLocation1 = statsFileLocation(table);
+    StatisticsFile statisticsFile1 = writeStatsFileForCurrentSnapshot(table, statsFileLocation1);
+    Assert.assertEquals(
+        "Must match the latest snapshot",
+        table.currentSnapshot().snapshotId(),
+        statisticsFile1.snapshotId());
+
+    table.newAppend().appendFile(FILE_B).commit();
+    File statsFileLocation2 = statsFileLocation(table);
+    StatisticsFile statisticsFile2 = writeStatsFileForCurrentSnapshot(table, statsFileLocation2);
+    Assert.assertEquals(
+        "Must match the latest snapshot",
+        table.currentSnapshot().snapshotId(),
+        statisticsFile2.snapshotId());
+
+    Assert.assertEquals("Should have 2 statistics file", 2, table.statisticsFiles().size());
+
+    table.updateProperties().set(TableProperties.MAX_SNAPSHOT_AGE_MS, "1").commit();

Review Comment:
   Expiring based on assumptions about time possibly introduces flakiness to this test.
   
   The right way to test this is to use `long afterFirstSnapshot = waitUntilAfter(table.currentSnapshot().timestampMillis()` and `table.expireSnapshots().expireOlderThan(afterFirstSnapshot).commit()`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ajantha-bhat commented on a diff in pull request #6090: Core: Handle statistics file clean up from expireSnapshots

Posted by GitBox <gi...@apache.org>.
ajantha-bhat commented on code in PR #6090:
URL: https://github.com/apache/iceberg/pull/6090#discussion_r1053480162


##########
core/src/main/java/org/apache/iceberg/TableMetadata.java:
##########
@@ -1211,6 +1211,7 @@ public Builder removeSnapshots(Collection<Long> idsToRemove) {
         if (idsToRemove.contains(snapshotId)) {
           snapshotsById.remove(snapshotId);
           changes.add(new MetadataUpdate.RemoveSnapshot(snapshotId));
+          removeStatistics(snapshotId);

Review Comment:
   expire snapshots used to remove the expired snapshot entires from `TableMetadata` (which is changing contents of `TableMetadata`).
   So, now that those snapshots are expired. We need to clean up the statistics files entires present for those expired snapshots. Hence, modifying the  `TableMetadata`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ajantha-bhat commented on pull request #6090: Core: Handle statistics file clean up from expireSnapshots

Posted by "ajantha-bhat (via GitHub)" <gi...@apache.org>.
ajantha-bhat commented on PR #6090:
URL: https://github.com/apache/iceberg/pull/6090#issuecomment-1434588412

   @jackye1995: Can you please consider this for the 1.2.0 release?  


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ajantha-bhat commented on pull request #6090: Core: Handle statistics file clean up from expireSnapshots

Posted by GitBox <gi...@apache.org>.
ajantha-bhat commented on PR #6090:
URL: https://github.com/apache/iceberg/pull/6090#issuecomment-1375041628

   ping @rdblue 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ajantha-bhat commented on pull request #6090: Core: Handle statistics file clean up from expireSnapshots

Posted by "ajantha-bhat (via GitHub)" <gi...@apache.org>.
ajantha-bhat commented on PR #6090:
URL: https://github.com/apache/iceberg/pull/6090#issuecomment-1441282799

   @rdblue: Thanks for the review and merge.   
   
   Now, I have rebased and reworked #6091  based on the learnings of this PR. 
   So, it is ready for review. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ajantha-bhat commented on a diff in pull request #6090: Core: Handle statistics file clean up from expireSnapshots

Posted by GitBox <gi...@apache.org>.
ajantha-bhat commented on code in PR #6090:
URL: https://github.com/apache/iceberg/pull/6090#discussion_r1044055824


##########
core/src/main/java/org/apache/iceberg/FileCleanupStrategy.java:
##########
@@ -83,4 +85,33 @@ protected void deleteFiles(Set<String> pathsToDelete, String fileType) {
             (file, thrown) -> LOG.warn("Delete failed for {} file: {}", fileType, file, thrown))
         .run(deleteFunc::accept);
   }
+
+  protected Set<String> expiredStatisticsFilesLocations(
+      TableMetadata beforeExpiration, Set<Long> validIds, Set<Long> expiredIds) {
+    Set<String> validStatisticsFilesLocation = Sets.newHashSet();
+    for (long snapshotId : validIds) {
+      List<StatisticsFile> statisticsFiles =
+          beforeExpiration.statisticsFilesForSnapshot(snapshotId);
+      if (statisticsFiles != null) {
+        statisticsFiles.forEach(file -> validStatisticsFilesLocation.add(file.path()));
+      }
+    }
+
+    // Statistics files can be reused in case of rewrite data files.

Review Comment:
   I can remove/update the comment. 
   
   But I just wanted to mention stats files can be reused (rewrite data files is one of the scenarios). So, we can't blindly remove it without checking the reachability graph. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ajantha-bhat commented on a diff in pull request #6090: Core: Handle statistics file clean up from expireSnapshots

Posted by GitBox <gi...@apache.org>.
ajantha-bhat commented on code in PR #6090:
URL: https://github.com/apache/iceberg/pull/6090#discussion_r1044055212


##########
core/src/main/java/org/apache/iceberg/RemoveSnapshots.java:
##########
@@ -204,7 +204,11 @@ private TableMetadata internalApply() {
     base.snapshots().stream()
         .map(Snapshot::snapshotId)
         .filter(snapshot -> !idsToRetain.contains(snapshot))
-        .forEach(idsToRemove::add);
+        .forEach(
+            snapshotId -> {
+              updatedMetaBuilder.removeStatistics(snapshotId);

Review Comment:
   Ah, true. I will move it. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a diff in pull request #6090: Core: Handle statistics file clean up from expireSnapshots

Posted by GitBox <gi...@apache.org>.
rdblue commented on code in PR #6090:
URL: https://github.com/apache/iceberg/pull/6090#discussion_r1053471444


##########
core/src/main/java/org/apache/iceberg/FileCleanupStrategy.java:
##########
@@ -83,4 +85,25 @@ protected void deleteFiles(Set<String> pathsToDelete, String fileType) {
             (file, thrown) -> LOG.warn("Delete failed for {} file: {}", fileType, file, thrown))
         .run(deleteFunc::accept);
   }
+
+  protected Set<String> expiredStatisticsFilesLocations(
+      TableMetadata beforeExpiration, TableMetadata afterExpiration) {
+    Set<String> statisticsFilesLocationBeforeExpiry = allStatisticsFilesLocation(beforeExpiration);

Review Comment:
   The term ["expiry" should be "expiration"](https://books.google.com/ngrams/graph?content=epiry%2Cexpiration&year_start=1800&year_end=2019&corpus=26&smoothing=3).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ajantha-bhat commented on a diff in pull request #6090: Core: Handle statistics file clean up from expireSnapshots

Posted by GitBox <gi...@apache.org>.
ajantha-bhat commented on code in PR #6090:
URL: https://github.com/apache/iceberg/pull/6090#discussion_r1054227750


##########
core/src/test/java/org/apache/iceberg/TestRemoveSnapshots.java:
##########
@@ -1515,4 +1594,63 @@ private RemoveSnapshots removeSnapshots(Table table) {
     RemoveSnapshots removeSnapshots = (RemoveSnapshots) table.expireSnapshots();
     return (RemoveSnapshots) removeSnapshots.withIncrementalCleanup(incrementalCleanup);
   }
+
+  private StatisticsFile writeStatsFileForCurrentSnapshot(Table table, File statsLocation)
+      throws IOException {
+    StatisticsFile statisticsFile;
+    try (PuffinWriter puffinWriter = Puffin.write(Files.localOutput(statsLocation)).build()) {
+      long snapshotId = table.currentSnapshot().snapshotId();
+      long snapshotSequenceNumber = table.currentSnapshot().sequenceNumber();
+      puffinWriter.add(
+          new Blob(
+              "some-blob-type",
+              ImmutableList.of(1),
+              snapshotId,
+              snapshotSequenceNumber,
+              ByteBuffer.wrap("blob content".getBytes(StandardCharsets.UTF_8))));
+      puffinWriter.finish();
+      statisticsFile =
+          new GenericStatisticsFile(
+              snapshotId,
+              statsLocation.toString(),
+              puffinWriter.fileSize(),
+              puffinWriter.footerSize(),
+              puffinWriter.writtenBlobsMetadata().stream()
+                  .map(GenericBlobMetadata::from)
+                  .collect(ImmutableList.toImmutableList()));
+    }
+
+    commitStats(table, statisticsFile);
+    return statisticsFile;
+  }
+
+  private StatisticsFile reuseStatsForCurrentSnapshot(Table table, StatisticsFile statisticsFile) {
+    StatisticsFile newStatisticsFile =
+        new GenericStatisticsFile(
+            table.currentSnapshot().snapshotId(),
+            statisticsFile.path(),
+            statisticsFile.fileSizeInBytes(),
+            statisticsFile.fileFooterSizeInBytes(),
+            statisticsFile.blobMetadata());
+    commitStats(table, newStatisticsFile);
+    return newStatisticsFile;
+  }
+
+  private void commitStats(Table table, StatisticsFile statisticsFile) {
+    Transaction transaction = table.newTransaction();
+    transaction
+        .updateStatistics()
+        .setStatistics(statisticsFile.snapshotId(), statisticsFile)
+        .commit();
+    transaction.commitTransaction();
+  }
+
+  private File statsFileLocation(Table table) throws URISyntaxException {

Review Comment:
   This was taken from existing code.
   https://github.com/apache/iceberg/blob/master/spark/v3.3/spark/src/test/java/org/apache/iceberg/spark/actions/TestRemoveOrphanFilesAction.java#L910-L915
   
   I will modify the current testcase. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ajantha-bhat commented on pull request #6090: Core: Handle statistics file clean up from expireSnapshots

Posted by GitBox <gi...@apache.org>.
ajantha-bhat commented on PR #6090:
URL: https://github.com/apache/iceberg/pull/6090#issuecomment-1305146273

   > @findepi: Thinking more about this, As the TableMetadata has just the list of StatisticsFile. And you have mentioned, statisticsFile.snapshotId() is "ID of the Iceberg table's snapshot the statistics were computed from"
   So, how will the query knows which statistics file to use for the current snapshot (Incase of rewrite data files, the current snapshot id may not be present in that list of statistics file?)
   
   @rdblue, @findepi: Please help in clearing my above doubt. 
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ajantha-bhat commented on a diff in pull request #6090: Core: Handle statistics file clean up from expireSnapshots

Posted by "ajantha-bhat (via GitHub)" <gi...@apache.org>.
ajantha-bhat commented on code in PR #6090:
URL: https://github.com/apache/iceberg/pull/6090#discussion_r1092842616


##########
core/src/test/java/org/apache/iceberg/TestRemoveSnapshots.java:
##########
@@ -1234,6 +1243,95 @@ public void testMultipleRefsAndCleanExpiredFilesFailsForIncrementalCleanup() {
                 .commit());
   }
 
+  @Test
+  public void testExpireWithStatisticsFiles() throws IOException {
+    table.newAppend().appendFile(FILE_A).commit();
+    String statsFileLocation1 = statsFileLocation(table.location());
+    StatisticsFile statisticsFile1 =
+        writeStatsFileForCurrentSnapshot(
+            table.currentSnapshot().snapshotId(),
+            table.currentSnapshot().sequenceNumber(),
+            statsFileLocation1,
+            table.io());
+    commitStats(table.newTransaction(), statisticsFile1);
+    Assert.assertEquals(

Review Comment:
   removed in all the places



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a diff in pull request #6090: Core: Handle statistics file clean up from expireSnapshots

Posted by "rdblue (via GitHub)" <gi...@apache.org>.
rdblue commented on code in PR #6090:
URL: https://github.com/apache/iceberg/pull/6090#discussion_r1092364464


##########
core/src/test/java/org/apache/iceberg/TestRemoveSnapshots.java:
##########
@@ -1234,6 +1243,95 @@ public void testMultipleRefsAndCleanExpiredFilesFailsForIncrementalCleanup() {
                 .commit());
   }
 
+  @Test
+  public void testExpireWithStatisticsFiles() throws IOException {
+    table.newAppend().appendFile(FILE_A).commit();
+    String statsFileLocation1 = statsFileLocation(table.location());
+    StatisticsFile statisticsFile1 =
+        writeStatsFileForCurrentSnapshot(
+            table.currentSnapshot().snapshotId(),
+            table.currentSnapshot().sequenceNumber(),
+            statsFileLocation1,
+            table.io());
+    commitStats(table.newTransaction(), statisticsFile1);
+    Assert.assertEquals(

Review Comment:
   What is the value of this assertion? This code created the stats file and passed in the snapshot ID.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ajantha-bhat commented on a diff in pull request #6090: Core: Handle statistics file clean up from expireSnapshots

Posted by GitBox <gi...@apache.org>.
ajantha-bhat commented on code in PR #6090:
URL: https://github.com/apache/iceberg/pull/6090#discussion_r1013964398


##########
core/src/main/java/org/apache/iceberg/FileCleanupStrategy.java:
##########
@@ -79,4 +80,15 @@ protected void deleteFiles(Set<String> pathsToDelete, String fileType) {
             (file, thrown) -> LOG.warn("Delete failed for {} file: {}", fileType, file, thrown))
         .run(deleteFunc::accept);
   }
+
+  protected Set<String> expiredStatisticsFilesLocations(
+      TableMetadata beforeExpiration, Set<Long> expiredIds) {
+    Set<String> expiredStatisticsFilesLocations = Sets.newHashSet();
+    for (StatisticsFile statisticsFile : beforeExpiration.statisticsFiles()) {
+      if (expiredIds.contains(statisticsFile.snapshotId())) {
+        expiredStatisticsFilesLocations.add(statisticsFile.path());

Review Comment:
   > The spec allows one Puffin file to be attached to multiple snapshots. For example, rewrite_files / optimize will produce a new snapshot, but could carry forward some/all table-level stats.
   So, the fact that the snapshot is being forgotten doesn't imply the stats file can be safely deleted.
   
   makes sense. I will compute live statistics file set and then check whether it can be safely deleted along with the current changes. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ajantha-bhat commented on a diff in pull request #6090: Core: Handle statistics file clean up from expireSnapshots

Posted by GitBox <gi...@apache.org>.
ajantha-bhat commented on code in PR #6090:
URL: https://github.com/apache/iceberg/pull/6090#discussion_r1047957805


##########
core/src/main/java/org/apache/iceberg/IncrementalFileCleanup.java:
##########
@@ -264,6 +264,14 @@ public void cleanFiles(TableMetadata beforeExpiration, TableMetadata afterExpira
     LOG.warn("Manifests Lists to delete: {}", Joiner.on(", ").join(manifestListsToDelete));
     deleteFiles(manifestsToDelete, "manifest");
     deleteFiles(manifestListsToDelete, "manifest list");
+
+    if (!beforeExpiration.statisticsFiles().isEmpty()) {
+      Set<String> expiredStatisticsFilesLocations =
+          expiredStatisticsFilesLocations(beforeExpiration, afterExpiration);
+      LOG.warn(
+          "Statistics files to delete: {}", Joiner.on(", ").join(expiredStatisticsFilesLocations));

Review Comment:
   I made it to debug and other places too I have changed to debug log. 



##########
core/src/main/java/org/apache/iceberg/FileCleanupStrategy.java:
##########
@@ -83,4 +85,25 @@ protected void deleteFiles(Set<String> pathsToDelete, String fileType) {
             (file, thrown) -> LOG.warn("Delete failed for {} file: {}", fileType, file, thrown))
         .run(deleteFunc::accept);
   }
+
+  protected Set<String> expiredStatisticsFilesLocations(
+      TableMetadata beforeExpiration, TableMetadata afterExpiration) {
+    Set<String> statisticsFilesLocationBeforeExpiry = allStatisticsFilesLocation(beforeExpiration);
+    Set<String> statisticsFilesLocationAfterExpiry = allStatisticsFilesLocation(afterExpiration);
+    statisticsFilesLocationBeforeExpiry.removeAll(statisticsFilesLocationAfterExpiry);
+    return statisticsFilesLocationBeforeExpiry;
+  }
+
+  private Set<String> allStatisticsFilesLocation(TableMetadata tableMetadata) {
+    Set<String> allStatisticsFilesLocation;
+    if (tableMetadata.statisticsFiles() != null) {
+      allStatisticsFilesLocation =
+          tableMetadata.statisticsFiles().stream()
+              .map(StatisticsFile::path)
+              .collect(Collectors.toSet());
+    } else {
+      allStatisticsFilesLocation = Sets.newHashSet();
+    }
+    return allStatisticsFilesLocation;

Review Comment:
   done



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a diff in pull request #6090: Core: Handle statistics file clean up from expireSnapshots

Posted by GitBox <gi...@apache.org>.
rdblue commented on code in PR #6090:
URL: https://github.com/apache/iceberg/pull/6090#discussion_r1053474166


##########
core/src/main/java/org/apache/iceberg/FileCleanupStrategy.java:
##########
@@ -83,4 +85,25 @@ protected void deleteFiles(Set<String> pathsToDelete, String fileType) {
             (file, thrown) -> LOG.warn("Delete failed for {} file: {}", fileType, file, thrown))
         .run(deleteFunc::accept);
   }
+
+  protected Set<String> expiredStatisticsFilesLocations(
+      TableMetadata beforeExpiration, TableMetadata afterExpiration) {
+    Set<String> statisticsFilesLocationBeforeExpiry = allStatisticsFilesLocation(beforeExpiration);
+    Set<String> statisticsFilesLocationAfterExpiry = allStatisticsFilesLocation(afterExpiration);
+    statisticsFilesLocationBeforeExpiry.removeAll(statisticsFilesLocationAfterExpiry);
+    return statisticsFilesLocationBeforeExpiry;

Review Comment:
   I think it is a best practice to avoid mutating sets or returning mutable sets. Can you use Guava's `Sets.difference` instead of `removeAll`? It would also be better if the set returned by `allStatisticsFilesLocations` were immutable.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ajantha-bhat commented on a diff in pull request #6090: Core: Handle statistics file clean up from expireSnapshots

Posted by GitBox <gi...@apache.org>.
ajantha-bhat commented on code in PR #6090:
URL: https://github.com/apache/iceberg/pull/6090#discussion_r1053500145


##########
core/src/test/java/org/apache/iceberg/TestRemoveSnapshots.java:
##########
@@ -1234,6 +1245,74 @@ public void testMultipleRefsAndCleanExpiredFilesFailsForIncrementalCleanup() {
                 .commit());
   }
 
+  @Test
+  public void testExpireWithStatisticsFiles() throws URISyntaxException, IOException {
+    table.newAppend().appendFile(FILE_A).commit();
+    File statsFileLocation1 = statsFileLocation(table);
+    StatisticsFile statisticsFile1 = writeStatsFileForCurrentSnapshot(table, statsFileLocation1);
+    Assert.assertEquals(
+        "Must match the latest snapshot",
+        table.currentSnapshot().snapshotId(),
+        statisticsFile1.snapshotId());
+
+    table.newAppend().appendFile(FILE_B).commit();
+    File statsFileLocation2 = statsFileLocation(table);
+    StatisticsFile statisticsFile2 = writeStatsFileForCurrentSnapshot(table, statsFileLocation2);
+    Assert.assertEquals(
+        "Must match the latest snapshot",
+        table.currentSnapshot().snapshotId(),
+        statisticsFile2.snapshotId());
+
+    Assert.assertEquals("Should have 2 statistics file", 2, table.statisticsFiles().size());
+
+    table.updateProperties().set(TableProperties.MAX_SNAPSHOT_AGE_MS, "1").commit();

Review Comment:
   The testcase in this file has the same pattern. I will modify this one first and then have a follow up to fix the existing testcases. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ajantha-bhat commented on a diff in pull request #6090: Core: Handle statistics file clean up from expireSnapshots

Posted by GitBox <gi...@apache.org>.
ajantha-bhat commented on code in PR #6090:
URL: https://github.com/apache/iceberg/pull/6090#discussion_r1054260595


##########
core/src/test/java/org/apache/iceberg/TestRemoveSnapshots.java:
##########
@@ -1515,4 +1594,63 @@ private RemoveSnapshots removeSnapshots(Table table) {
     RemoveSnapshots removeSnapshots = (RemoveSnapshots) table.expireSnapshots();
     return (RemoveSnapshots) removeSnapshots.withIncrementalCleanup(incrementalCleanup);
   }
+
+  private StatisticsFile writeStatsFileForCurrentSnapshot(Table table, File statsLocation)

Review Comment:
   This was taken from existing code.
   https://github.com/apache/iceberg/blob/master/spark/v3.3/spark/src/test/java/org/apache/iceberg/spark/actions/TestRemoveOrphanFilesAction.java#L917
   
   I will modify the current testcase.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a diff in pull request #6090: Core: Handle statistics file clean up from expireSnapshots

Posted by "rdblue (via GitHub)" <gi...@apache.org>.
rdblue commented on code in PR #6090:
URL: https://github.com/apache/iceberg/pull/6090#discussion_r1092363213


##########
core/src/test/java/org/apache/iceberg/TestRemoveSnapshots.java:
##########
@@ -1515,4 +1613,51 @@ private RemoveSnapshots removeSnapshots(Table table) {
     RemoveSnapshots removeSnapshots = (RemoveSnapshots) table.expireSnapshots();
     return (RemoveSnapshots) removeSnapshots.withIncrementalCleanup(incrementalCleanup);
   }
+
+  private StatisticsFile writeStatsFileForCurrentSnapshot(

Review Comment:
   This name doesn't make sense. The snapshot ID and sequence number are passed in. I think it should be `writeStatsFile`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ajantha-bhat commented on pull request #6090: Core: Handle statistics file clean up from expireSnapshots

Posted by "ajantha-bhat (via GitHub)" <gi...@apache.org>.
ajantha-bhat commented on PR #6090:
URL: https://github.com/apache/iceberg/pull/6090#issuecomment-1418186286

   If the changes are ok, please merge this PR. So that I can rebase #6091 and make it ready for review. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a diff in pull request #6090: Core: Handle statistics file clean up from expireSnapshots

Posted by GitBox <gi...@apache.org>.
rdblue commented on code in PR #6090:
URL: https://github.com/apache/iceberg/pull/6090#discussion_r1053468991


##########
core/src/main/java/org/apache/iceberg/IncrementalFileCleanup.java:
##########
@@ -260,10 +260,18 @@ public void cleanFiles(TableMetadata beforeExpiration, TableMetadata afterExpira
         findFilesToDelete(manifestsToScan, manifestsToRevert, validIds, afterExpiration);
 
     deleteFiles(filesToDelete, "data");
-    LOG.warn("Manifests to delete: {}", Joiner.on(", ").join(manifestsToDelete));
-    LOG.warn("Manifests Lists to delete: {}", Joiner.on(", ").join(manifestListsToDelete));
+    LOG.debug("Manifests to delete: {}", Joiner.on(", ").join(manifestsToDelete));

Review Comment:
   Joining should not be performed prior to passing the argument to debug, or else it will always create a string regardless of whether it is used.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a diff in pull request #6090: Core: Handle statistics file clean up from expireSnapshots

Posted by GitBox <gi...@apache.org>.
rdblue commented on code in PR #6090:
URL: https://github.com/apache/iceberg/pull/6090#discussion_r1053482462


##########
core/src/test/java/org/apache/iceberg/TestRemoveSnapshots.java:
##########
@@ -1515,4 +1594,63 @@ private RemoveSnapshots removeSnapshots(Table table) {
     RemoveSnapshots removeSnapshots = (RemoveSnapshots) table.expireSnapshots();
     return (RemoveSnapshots) removeSnapshots.withIncrementalCleanup(incrementalCleanup);
   }
+
+  private StatisticsFile writeStatsFileForCurrentSnapshot(Table table, File statsLocation)
+      throws IOException {
+    StatisticsFile statisticsFile;
+    try (PuffinWriter puffinWriter = Puffin.write(Files.localOutput(statsLocation)).build()) {
+      long snapshotId = table.currentSnapshot().snapshotId();
+      long snapshotSequenceNumber = table.currentSnapshot().sequenceNumber();
+      puffinWriter.add(
+          new Blob(
+              "some-blob-type",
+              ImmutableList.of(1),
+              snapshotId,
+              snapshotSequenceNumber,
+              ByteBuffer.wrap("blob content".getBytes(StandardCharsets.UTF_8))));
+      puffinWriter.finish();
+      statisticsFile =
+          new GenericStatisticsFile(
+              snapshotId,
+              statsLocation.toString(),
+              puffinWriter.fileSize(),
+              puffinWriter.footerSize(),
+              puffinWriter.writtenBlobsMetadata().stream()
+                  .map(GenericBlobMetadata::from)
+                  .collect(ImmutableList.toImmutableList()));
+    }
+
+    commitStats(table, statisticsFile);

Review Comment:
   Functions should not have unexpected side-effects, like committing to a table when the expectation is to only write a file. Can you remove this?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ajantha-bhat commented on a diff in pull request #6090: Core: Handle statistics file clean up from expireSnapshots

Posted by GitBox <gi...@apache.org>.
ajantha-bhat commented on code in PR #6090:
URL: https://github.com/apache/iceberg/pull/6090#discussion_r1053487339


##########
core/src/main/java/org/apache/iceberg/IncrementalFileCleanup.java:
##########
@@ -260,10 +260,18 @@ public void cleanFiles(TableMetadata beforeExpiration, TableMetadata afterExpira
         findFilesToDelete(manifestsToScan, manifestsToRevert, validIds, afterExpiration);
 
     deleteFiles(filesToDelete, "data");
-    LOG.warn("Manifests to delete: {}", Joiner.on(", ").join(manifestsToDelete));
-    LOG.warn("Manifests Lists to delete: {}", Joiner.on(", ").join(manifestListsToDelete));
+    LOG.debug("Manifests to delete: {}", Joiner.on(", ").join(manifestsToDelete));
+    LOG.debug("Manifests Lists to delete: {}", Joiner.on(", ").join(manifestListsToDelete));
     deleteFiles(manifestsToDelete, "manifest");
     deleteFiles(manifestListsToDelete, "manifest list");
+
+    if (!beforeExpiration.statisticsFiles().isEmpty()) {
+      Set<String> expiredStatisticsFilesLocations =
+          expiredStatisticsFilesLocations(beforeExpiration, afterExpiration);
+      LOG.debug(
+          "Statistics files to delete: {}", Joiner.on(", ").join(expiredStatisticsFilesLocations));

Review Comment:
   Yes. It is only present in incremental flow (not reachable flow). I just followed the same pattern which is present for manifest, manifest list files. Shall I remove them too? 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ajantha-bhat commented on a diff in pull request #6090: Core: Handle statistics file clean up from expireSnapshots

Posted by "ajantha-bhat (via GitHub)" <gi...@apache.org>.
ajantha-bhat commented on code in PR #6090:
URL: https://github.com/apache/iceberg/pull/6090#discussion_r1102266077


##########
core/src/test/java/org/apache/iceberg/TestRemoveSnapshots.java:
##########
@@ -1515,4 +1599,47 @@ private RemoveSnapshots removeSnapshots(Table table) {
     RemoveSnapshots removeSnapshots = (RemoveSnapshots) table.expireSnapshots();
     return (RemoveSnapshots) removeSnapshots.withIncrementalCleanup(incrementalCleanup);
   }
+
+  private StatisticsFile writeStatsFile(
+      long snapshotId, long snapshotSequenceNumber, String statsLocation, FileIO fileIO)
+      throws IOException {
+    try (PuffinWriter puffinWriter = Puffin.write(fileIO.newOutputFile(statsLocation)).build()) {
+      puffinWriter.add(
+          new Blob(
+              "some-blob-type",
+              ImmutableList.of(1),
+              snapshotId,
+              snapshotSequenceNumber,
+              ByteBuffer.wrap("blob content".getBytes(StandardCharsets.UTF_8))));
+      puffinWriter.finish();
+
+      return new GenericStatisticsFile(
+          snapshotId,
+          statsLocation,
+          puffinWriter.fileSize(),
+          puffinWriter.footerSize(),
+          puffinWriter.writtenBlobsMetadata().stream()
+              .map(GenericBlobMetadata::from)
+              .collect(ImmutableList.toImmutableList()));
+    }
+  }
+
+  private StatisticsFile reuseStatsForCurrentSnapshot(

Review Comment:
   ACK. 
   
   Apologies for the back and forth. This was induced during refactoring.  



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a diff in pull request #6090: Core: Handle statistics file clean up from expireSnapshots

Posted by "rdblue (via GitHub)" <gi...@apache.org>.
rdblue commented on code in PR #6090:
URL: https://github.com/apache/iceberg/pull/6090#discussion_r1101920517


##########
core/src/test/java/org/apache/iceberg/TestRemoveSnapshots.java:
##########
@@ -1234,6 +1243,95 @@ public void testMultipleRefsAndCleanExpiredFilesFailsForIncrementalCleanup() {
                 .commit());
   }
 
+  @Test
+  public void testExpireWithStatisticsFiles() throws IOException {
+    table.newAppend().appendFile(FILE_A).commit();
+    String statsFileLocation1 = statsFileLocation(table.location());
+    StatisticsFile statisticsFile1 =
+        writeStatsFileForCurrentSnapshot(
+            table.currentSnapshot().snapshotId(),
+            table.currentSnapshot().sequenceNumber(),
+            statsFileLocation1,
+            table.io());
+    commitStats(table.newTransaction(), statisticsFile1);
+    Assert.assertEquals(
+        "Must match the latest snapshot",
+        table.currentSnapshot().snapshotId(),
+        statisticsFile1.snapshotId());
+
+    table.newAppend().appendFile(FILE_B).commit();
+    String statsFileLocation2 = statsFileLocation(table.location());
+    StatisticsFile statisticsFile2 =
+        writeStatsFileForCurrentSnapshot(
+            table.currentSnapshot().snapshotId(),
+            table.currentSnapshot().sequenceNumber(),
+            statsFileLocation2,
+            table.io());
+    commitStats(table.newTransaction(), statisticsFile2);
+    Assert.assertEquals(
+        "Must match the latest snapshot",
+        table.currentSnapshot().snapshotId(),
+        statisticsFile2.snapshotId());
+
+    Assert.assertEquals("Should have 2 statistics file", 2, table.statisticsFiles().size());
+
+    long tAfterCommits = waitUntilAfter(table.currentSnapshot().timestampMillis());
+    removeSnapshots(table).expireOlderThan(tAfterCommits).commit();
+
+    Assert.assertEquals("Should keep 1 snapshot", 1, Iterables.size(table.snapshots()));
+    Assertions.assertThat(table.statisticsFiles())
+        .hasSize(1)
+        .extracting(StatisticsFile::snapshotId)
+        .as("Should contain only the statistics file of snapshot2")
+        .isEqualTo(Lists.newArrayList(statisticsFile2.snapshotId()));
+
+    Assertions.assertThat(new File(statsFileLocation1).exists()).isFalse();
+    Assertions.assertThat(new File(statsFileLocation2).exists()).isTrue();
+  }
+
+  @Test
+  public void testExpireWithStatisticsFilesWithReuse() throws IOException {
+    table.newAppend().appendFile(FILE_A).commit();
+    String statsFileLocation1 = statsFileLocation(table.location());
+    StatisticsFile statisticsFile1 =
+        writeStatsFileForCurrentSnapshot(
+            table.currentSnapshot().snapshotId(),
+            table.currentSnapshot().sequenceNumber(),
+            statsFileLocation1,
+            table.io());
+    commitStats(table.newTransaction(), statisticsFile1);
+    Assert.assertEquals(
+        "Must match the latest snapshot",
+        table.currentSnapshot().snapshotId(),
+        statisticsFile1.snapshotId());
+
+    table.newAppend().appendFile(FILE_B).commit();
+    // Note: RewriteDataFiles can reuse statistics files across operations.
+    // This test reuses stats for append just to mimic this scenario without having to run
+    // RewriteDataFiles.

Review Comment:
   I think this is inaccurate then. It should be enough to state that in the even that a snapshot file is for some reason reused, we want to detect that it is still referenced and not delete it from the file system.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ajantha-bhat commented on a diff in pull request #6090: Core: Handle statistics file clean up from expireSnapshots

Posted by GitBox <gi...@apache.org>.
ajantha-bhat commented on code in PR #6090:
URL: https://github.com/apache/iceberg/pull/6090#discussion_r1012472518


##########
core/src/main/java/org/apache/iceberg/FileCleanupStrategy.java:
##########
@@ -79,4 +80,15 @@ protected void deleteFiles(Set<String> pathsToDelete, String fileType) {
             (file, thrown) -> LOG.warn("Delete failed for {} file: {}", fileType, file, thrown))
         .run(deleteFunc::accept);
   }
+
+  protected Set<String> expiredStatisticsFilesLocations(
+      TableMetadata beforeExpiration, Set<Long> expiredIds) {
+    Set<String> expiredStatisticsFilesLocations = Sets.newHashSet();
+    for (StatisticsFile statisticsFile : beforeExpiration.statisticsFiles()) {
+      if (expiredIds.contains(statisticsFile.snapshotId())) {
+        expiredStatisticsFilesLocations.add(statisticsFile.path());
+      }
+    }
+    return expiredStatisticsFilesLocations;
+  }

Review Comment:
   How do we support time travel if we reuse the same statistics file for all the snapshots?
   Because of this, I assumed that each snapshot will create its own statistics file.
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ajantha-bhat commented on pull request #6090: Core: Handle statistics file clean up from expireSnapshots

Posted by GitBox <gi...@apache.org>.
ajantha-bhat commented on PR #6090:
URL: https://github.com/apache/iceberg/pull/6090#issuecomment-1306046322

   > I think we should change the label of the snapshot-id entry in https://iceberg.apache.org/spec/#table-statistics (to level, not blob level)
   
   Sorry, I still didn't get how the query engine will figure out the statistics file for the current snapshot (when the snapshot is reused).
   Instead of the suggested change, can we change `statisticsFile.snapshotId()` to the snapshot id of the referring snapshot? This way `TableMetadata` will have entries for each snapshot id (even for the resue case). Snapshot file path can be reused. 
   
   @rdblue: What do you think about this? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] findepi commented on a diff in pull request #6090: Core: Handle statistics file clean up from expireSnapshots

Posted by GitBox <gi...@apache.org>.
findepi commented on code in PR #6090:
URL: https://github.com/apache/iceberg/pull/6090#discussion_r1013924386


##########
core/src/main/java/org/apache/iceberg/FileCleanupStrategy.java:
##########
@@ -79,4 +80,15 @@ protected void deleteFiles(Set<String> pathsToDelete, String fileType) {
             (file, thrown) -> LOG.warn("Delete failed for {} file: {}", fileType, file, thrown))
         .run(deleteFunc::accept);
   }
+
+  protected Set<String> expiredStatisticsFilesLocations(
+      TableMetadata beforeExpiration, Set<Long> expiredIds) {
+    Set<String> expiredStatisticsFilesLocations = Sets.newHashSet();
+    for (StatisticsFile statisticsFile : beforeExpiration.statisticsFiles()) {
+      if (expiredIds.contains(statisticsFile.snapshotId())) {
+        expiredStatisticsFilesLocations.add(statisticsFile.path());
+      }
+    }
+    return expiredStatisticsFilesLocations;
+  }

Review Comment:
   @amogh-jahagirdar good point
   sorry, didn't read this comment before posting https://github.com/apache/iceberg/pull/6090/files#r1013923155
   @ajantha-bhat see there for an example where you want to reuse a stats file 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ajantha-bhat commented on a diff in pull request #6090: Core: Handle statistics file clean up from expireSnapshots

Posted by GitBox <gi...@apache.org>.
ajantha-bhat commented on code in PR #6090:
URL: https://github.com/apache/iceberg/pull/6090#discussion_r1039527780


##########
core/src/main/java/org/apache/iceberg/FileCleanupStrategy.java:
##########
@@ -79,4 +80,15 @@ protected void deleteFiles(Set<String> pathsToDelete, String fileType) {
             (file, thrown) -> LOG.warn("Delete failed for {} file: {}", fileType, file, thrown))
         .run(deleteFunc::accept);
   }
+
+  protected Set<String> expiredStatisticsFilesLocations(
+      TableMetadata beforeExpiration, Set<Long> expiredIds) {
+    Set<String> expiredStatisticsFilesLocations = Sets.newHashSet();
+    for (StatisticsFile statisticsFile : beforeExpiration.statisticsFiles()) {
+      if (expiredIds.contains(statisticsFile.snapshotId())) {
+        expiredStatisticsFilesLocations.add(statisticsFile.path());
+      }
+    }
+    return expiredStatisticsFilesLocations;
+  }

Review Comment:
   @amogh-jahagirdar, @findepi  
   As #6267 clarifies the spec. I have handled the reuse scenario and added the testcase now.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ajantha-bhat commented on pull request #6090: Core: Handle statistics file clean up from expireSnapshots

Posted by GitBox <gi...@apache.org>.
ajantha-bhat commented on PR #6090:
URL: https://github.com/apache/iceberg/pull/6090#issuecomment-1348577920

   @rdblue, @findepi, @amogh-jahagirdar: Handled the comments. Please take a look at it again.    
   Also, #6267 is ready. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ajantha-bhat commented on pull request #6090: Core: Handle statistics file clean up from expireSnapshots

Posted by GitBox <gi...@apache.org>.
ajantha-bhat commented on PR #6090:
URL: https://github.com/apache/iceberg/pull/6090#issuecomment-1298128853

   cc: @findepi, @rdblue, @szehon-ho   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a diff in pull request #6090: Core: Handle statistics file clean up from expireSnapshots

Posted by GitBox <gi...@apache.org>.
rdblue commented on code in PR #6090:
URL: https://github.com/apache/iceberg/pull/6090#discussion_r1053474955


##########
core/src/main/java/org/apache/iceberg/FileCleanupStrategy.java:
##########
@@ -83,4 +85,25 @@ protected void deleteFiles(Set<String> pathsToDelete, String fileType) {
             (file, thrown) -> LOG.warn("Delete failed for {} file: {}", fileType, file, thrown))
         .run(deleteFunc::accept);
   }
+
+  protected Set<String> expiredStatisticsFilesLocations(
+      TableMetadata beforeExpiration, TableMetadata afterExpiration) {
+    Set<String> statisticsFilesLocationBeforeExpiry = allStatisticsFilesLocation(beforeExpiration);
+    Set<String> statisticsFilesLocationAfterExpiry = allStatisticsFilesLocation(afterExpiration);
+    statisticsFilesLocationBeforeExpiry.removeAll(statisticsFilesLocationAfterExpiry);
+    return statisticsFilesLocationBeforeExpiry;
+  }
+
+  private Set<String> allStatisticsFilesLocation(TableMetadata tableMetadata) {

Review Comment:
   How about `statsFileLocations` instead? Since it returns a collection, the method name should be plural. And we also avoid needless modifiers, like `all` (which is assumed).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] amogh-jahagirdar commented on a diff in pull request #6090: Core: Handle statistics file clean up from expireSnapshots

Posted by GitBox <gi...@apache.org>.
amogh-jahagirdar commented on code in PR #6090:
URL: https://github.com/apache/iceberg/pull/6090#discussion_r1058714858


##########
core/src/main/java/org/apache/iceberg/IncrementalFileCleanup.java:
##########
@@ -260,10 +259,14 @@ public void cleanFiles(TableMetadata beforeExpiration, TableMetadata afterExpira
         findFilesToDelete(manifestsToScan, manifestsToRevert, validIds, afterExpiration);
 
     deleteFiles(filesToDelete, "data");
-    LOG.warn("Manifests to delete: {}", Joiner.on(", ").join(manifestsToDelete));
-    LOG.warn("Manifests Lists to delete: {}", Joiner.on(", ").join(manifestListsToDelete));
     deleteFiles(manifestsToDelete, "manifest");
     deleteFiles(manifestListsToDelete, "manifest list");
+
+    if (!beforeExpiration.statisticsFiles().isEmpty()) {
+      Set<String> expiredStatisticsFilesLocations =
+          expiredStatisticsFilesLocations(beforeExpiration, afterExpiration);
+      deleteFiles(expiredStatisticsFilesLocations, "statistics files");
+    }

Review Comment:
   Nit: I think we could just have expireStatisticsFilesLocation short circuit before computing  in case before expiration is empty. Rather than have the if(!beforeExpiration.statisticsFiles().isEmpty()) in both cleanup implementations



##########
core/src/test/java/org/apache/iceberg/TestRemoveSnapshots.java:
##########
@@ -1234,6 +1243,95 @@ public void testMultipleRefsAndCleanExpiredFilesFailsForIncrementalCleanup() {
                 .commit());
   }
 
+  @Test
+  public void testExpireWithStatisticsFiles() throws IOException {
+    table.newAppend().appendFile(FILE_A).commit();
+    String statsFileLocation1 = statsFileLocation(table);
+    StatisticsFile statisticsFile1 =
+        writeStatsFileForCurrentSnapshot(
+            table.currentSnapshot().snapshotId(),
+            table.currentSnapshot().sequenceNumber(),
+            statsFileLocation1,
+            table.io());
+    commitStats(table.newTransaction(), statisticsFile1);
+    Assert.assertEquals(
+        "Must match the latest snapshot",
+        table.currentSnapshot().snapshotId(),
+        statisticsFile1.snapshotId());
+
+    table.newAppend().appendFile(FILE_B).commit();
+    String statsFileLocation2 = statsFileLocation(table);
+    StatisticsFile statisticsFile2 =
+        writeStatsFileForCurrentSnapshot(
+            table.currentSnapshot().snapshotId(),
+            table.currentSnapshot().sequenceNumber(),
+            statsFileLocation2,
+            table.io());
+    commitStats(table.newTransaction(), statisticsFile2);
+    Assert.assertEquals(
+        "Must match the latest snapshot",
+        table.currentSnapshot().snapshotId(),
+        statisticsFile2.snapshotId());
+
+    Assert.assertEquals("Should have 2 statistics file", 2, table.statisticsFiles().size());
+
+    long tAfterCommits = waitUntilAfter(table.currentSnapshot().timestampMillis());
+    removeSnapshots(table).expireOlderThan(tAfterCommits).commit();
+
+    Assert.assertEquals("Should keep 1 snapshot", 1, Iterables.size(table.snapshots()));
+    Assertions.assertThat(table.statisticsFiles())
+        .hasSize(1)
+        .extracting(StatisticsFile::snapshotId)
+        .as("Should contain only the statistics file of snapshot2")
+        .isEqualTo(Lists.newArrayList(statisticsFile2.snapshotId()));
+
+    Assertions.assertThat(new File(statsFileLocation1).exists()).isFalse();
+    Assertions.assertThat(new File(statsFileLocation2).exists()).isTrue();
+  }
+
+  @Test
+  public void testExpireWithStatisticsFilesWithReuse() throws IOException {
+    table.newAppend().appendFile(FILE_A).commit();
+    String statsFileLocation1 = statsFileLocation(table);
+    StatisticsFile statisticsFile1 =
+        writeStatsFileForCurrentSnapshot(
+            table.currentSnapshot().snapshotId(),
+            table.currentSnapshot().sequenceNumber(),
+            statsFileLocation1,
+            table.io());
+    commitStats(table.newTransaction(), statisticsFile1);
+    Assert.assertEquals(
+        "Must match the latest snapshot",
+        table.currentSnapshot().snapshotId(),
+        statisticsFile1.snapshotId());
+
+    table.newAppend().appendFile(FILE_B).commit();
+    // Note: In the real scenario, appendFile will always create a new statistics file.
+    // Only rewrite_data_files scenario can reuse the statistics file with new snapshot id.
+    // To avoid simulating rewrite, this test is reusing the stats for append operation itself.

Review Comment:
   Just a nit, I think this inline comment could be a bit more concise, maybe something like:
   
   "RewriteDataFiles can reuse statistics files across operations. This test reuses stats for append just to mimic this scenario without having to run RewriteDataFiles." 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a diff in pull request #6090: Core: Handle statistics file clean up from expireSnapshots

Posted by GitBox <gi...@apache.org>.
rdblue commented on code in PR #6090:
URL: https://github.com/apache/iceberg/pull/6090#discussion_r1053487409


##########
core/src/test/java/org/apache/iceberg/TestRemoveSnapshots.java:
##########
@@ -1515,4 +1594,63 @@ private RemoveSnapshots removeSnapshots(Table table) {
     RemoveSnapshots removeSnapshots = (RemoveSnapshots) table.expireSnapshots();
     return (RemoveSnapshots) removeSnapshots.withIncrementalCleanup(incrementalCleanup);
   }
+
+  private StatisticsFile writeStatsFileForCurrentSnapshot(Table table, File statsLocation)

Review Comment:
   I don't think there's a reason to pass table to this method. I think this should accept a String location, a FileIO, and a snapshot ID.
   
   This should also not use `File` for writing.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ajantha-bhat commented on pull request #6090: Core: Handle statistics file clean up from expireSnapshots

Posted by GitBox <gi...@apache.org>.
ajantha-bhat commented on PR #6090:
URL: https://github.com/apache/iceberg/pull/6090#issuecomment-1383414321

   Just another reminder. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ajantha-bhat commented on pull request #6090: Core: Handle statistics file clean up from expireSnapshots

Posted by "ajantha-bhat (via GitHub)" <gi...@apache.org>.
ajantha-bhat commented on PR #6090:
URL: https://github.com/apache/iceberg/pull/6090#issuecomment-1411577821

   > Thanks, @ajantha-bhat! I made some comments in tests to fix.
   
   Thanks for the review. I have addressed the comments. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ajantha-bhat commented on a diff in pull request #6090: Core: Handle statistics file clean up from expireSnapshots

Posted by "ajantha-bhat (via GitHub)" <gi...@apache.org>.
ajantha-bhat commented on code in PR #6090:
URL: https://github.com/apache/iceberg/pull/6090#discussion_r1102265083


##########
core/src/test/java/org/apache/iceberg/TestRemoveSnapshots.java:
##########
@@ -1234,6 +1243,95 @@ public void testMultipleRefsAndCleanExpiredFilesFailsForIncrementalCleanup() {
                 .commit());
   }
 
+  @Test
+  public void testExpireWithStatisticsFiles() throws IOException {
+    table.newAppend().appendFile(FILE_A).commit();
+    String statsFileLocation1 = statsFileLocation(table.location());
+    StatisticsFile statisticsFile1 =
+        writeStatsFileForCurrentSnapshot(
+            table.currentSnapshot().snapshotId(),
+            table.currentSnapshot().sequenceNumber(),
+            statsFileLocation1,
+            table.io());
+    commitStats(table.newTransaction(), statisticsFile1);
+    Assert.assertEquals(
+        "Must match the latest snapshot",
+        table.currentSnapshot().snapshotId(),
+        statisticsFile1.snapshotId());
+
+    table.newAppend().appendFile(FILE_B).commit();
+    String statsFileLocation2 = statsFileLocation(table.location());
+    StatisticsFile statisticsFile2 =
+        writeStatsFileForCurrentSnapshot(
+            table.currentSnapshot().snapshotId(),
+            table.currentSnapshot().sequenceNumber(),
+            statsFileLocation2,
+            table.io());
+    commitStats(table.newTransaction(), statisticsFile2);
+    Assert.assertEquals(
+        "Must match the latest snapshot",
+        table.currentSnapshot().snapshotId(),
+        statisticsFile2.snapshotId());
+
+    Assert.assertEquals("Should have 2 statistics file", 2, table.statisticsFiles().size());
+
+    long tAfterCommits = waitUntilAfter(table.currentSnapshot().timestampMillis());
+    removeSnapshots(table).expireOlderThan(tAfterCommits).commit();
+
+    Assert.assertEquals("Should keep 1 snapshot", 1, Iterables.size(table.snapshots()));
+    Assertions.assertThat(table.statisticsFiles())
+        .hasSize(1)
+        .extracting(StatisticsFile::snapshotId)
+        .as("Should contain only the statistics file of snapshot2")
+        .isEqualTo(Lists.newArrayList(statisticsFile2.snapshotId()));
+
+    Assertions.assertThat(new File(statsFileLocation1).exists()).isFalse();
+    Assertions.assertThat(new File(statsFileLocation2).exists()).isTrue();
+  }
+
+  @Test
+  public void testExpireWithStatisticsFilesWithReuse() throws IOException {
+    table.newAppend().appendFile(FILE_A).commit();
+    String statsFileLocation1 = statsFileLocation(table.location());
+    StatisticsFile statisticsFile1 =
+        writeStatsFileForCurrentSnapshot(
+            table.currentSnapshot().snapshotId(),
+            table.currentSnapshot().sequenceNumber(),
+            statsFileLocation1,
+            table.io());
+    commitStats(table.newTransaction(), statisticsFile1);
+    Assert.assertEquals(
+        "Must match the latest snapshot",
+        table.currentSnapshot().snapshotId(),
+        statisticsFile1.snapshotId());
+
+    table.newAppend().appendFile(FILE_B).commit();
+    // Note: RewriteDataFiles can reuse statistics files across operations.
+    // This test reuses stats for append just to mimic this scenario without having to run
+    // RewriteDataFiles.

Review Comment:
   ok. rephrased it. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ajantha-bhat commented on pull request #6090: Core: Handle statistics file clean up from expireSnapshots

Posted by "ajantha-bhat (via GitHub)" <gi...@apache.org>.
ajantha-bhat commented on PR #6090:
URL: https://github.com/apache/iceberg/pull/6090#issuecomment-1425164429

   > @ajantha-bhat, looks like there are just two more things to fix. Thanks!
   
   Done. Thanks for the review.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a diff in pull request #6090: Core: Handle statistics file clean up from expireSnapshots

Posted by "rdblue (via GitHub)" <gi...@apache.org>.
rdblue commented on code in PR #6090:
URL: https://github.com/apache/iceberg/pull/6090#discussion_r1092367108


##########
core/src/test/java/org/apache/iceberg/TestRemoveSnapshots.java:
##########
@@ -1234,6 +1243,95 @@ public void testMultipleRefsAndCleanExpiredFilesFailsForIncrementalCleanup() {
                 .commit());
   }
 
+  @Test
+  public void testExpireWithStatisticsFiles() throws IOException {
+    table.newAppend().appendFile(FILE_A).commit();
+    String statsFileLocation1 = statsFileLocation(table.location());
+    StatisticsFile statisticsFile1 =
+        writeStatsFileForCurrentSnapshot(
+            table.currentSnapshot().snapshotId(),
+            table.currentSnapshot().sequenceNumber(),
+            statsFileLocation1,
+            table.io());
+    commitStats(table.newTransaction(), statisticsFile1);
+    Assert.assertEquals(
+        "Must match the latest snapshot",
+        table.currentSnapshot().snapshotId(),
+        statisticsFile1.snapshotId());
+
+    table.newAppend().appendFile(FILE_B).commit();
+    String statsFileLocation2 = statsFileLocation(table.location());
+    StatisticsFile statisticsFile2 =
+        writeStatsFileForCurrentSnapshot(
+            table.currentSnapshot().snapshotId(),
+            table.currentSnapshot().sequenceNumber(),
+            statsFileLocation2,
+            table.io());
+    commitStats(table.newTransaction(), statisticsFile2);
+    Assert.assertEquals(
+        "Must match the latest snapshot",
+        table.currentSnapshot().snapshotId(),
+        statisticsFile2.snapshotId());
+
+    Assert.assertEquals("Should have 2 statistics file", 2, table.statisticsFiles().size());
+
+    long tAfterCommits = waitUntilAfter(table.currentSnapshot().timestampMillis());
+    removeSnapshots(table).expireOlderThan(tAfterCommits).commit();
+
+    Assert.assertEquals("Should keep 1 snapshot", 1, Iterables.size(table.snapshots()));
+    Assertions.assertThat(table.statisticsFiles())
+        .hasSize(1)
+        .extracting(StatisticsFile::snapshotId)
+        .as("Should contain only the statistics file of snapshot2")
+        .isEqualTo(Lists.newArrayList(statisticsFile2.snapshotId()));
+
+    Assertions.assertThat(new File(statsFileLocation1).exists()).isFalse();
+    Assertions.assertThat(new File(statsFileLocation2).exists()).isTrue();
+  }
+
+  @Test
+  public void testExpireWithStatisticsFilesWithReuse() throws IOException {
+    table.newAppend().appendFile(FILE_A).commit();
+    String statsFileLocation1 = statsFileLocation(table.location());
+    StatisticsFile statisticsFile1 =
+        writeStatsFileForCurrentSnapshot(
+            table.currentSnapshot().snapshotId(),
+            table.currentSnapshot().sequenceNumber(),
+            statsFileLocation1,
+            table.io());
+    commitStats(table.newTransaction(), statisticsFile1);
+    Assert.assertEquals(
+        "Must match the latest snapshot",
+        table.currentSnapshot().snapshotId(),
+        statisticsFile1.snapshotId());
+
+    table.newAppend().appendFile(FILE_B).commit();
+    // Note: RewriteDataFiles can reuse statistics files across operations.
+    // This test reuses stats for append just to mimic this scenario without having to run
+    // RewriteDataFiles.

Review Comment:
   Does this actually happen in `RewriteDataFiles`? I don't think that the same stats file should be added more than once. It's a good idea to make sure it doesn't, but that should not be the behavior of built-in operations.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a diff in pull request #6090: Core: Handle statistics file clean up from expireSnapshots

Posted by "rdblue (via GitHub)" <gi...@apache.org>.
rdblue commented on code in PR #6090:
URL: https://github.com/apache/iceberg/pull/6090#discussion_r1092365363


##########
core/src/test/java/org/apache/iceberg/TestRemoveSnapshots.java:
##########
@@ -1234,6 +1243,95 @@ public void testMultipleRefsAndCleanExpiredFilesFailsForIncrementalCleanup() {
                 .commit());
   }
 
+  @Test
+  public void testExpireWithStatisticsFiles() throws IOException {
+    table.newAppend().appendFile(FILE_A).commit();
+    String statsFileLocation1 = statsFileLocation(table.location());
+    StatisticsFile statisticsFile1 =
+        writeStatsFileForCurrentSnapshot(
+            table.currentSnapshot().snapshotId(),
+            table.currentSnapshot().sequenceNumber(),
+            statsFileLocation1,
+            table.io());
+    commitStats(table.newTransaction(), statisticsFile1);
+    Assert.assertEquals(
+        "Must match the latest snapshot",
+        table.currentSnapshot().snapshotId(),
+        statisticsFile1.snapshotId());
+
+    table.newAppend().appendFile(FILE_B).commit();
+    String statsFileLocation2 = statsFileLocation(table.location());
+    StatisticsFile statisticsFile2 =
+        writeStatsFileForCurrentSnapshot(
+            table.currentSnapshot().snapshotId(),
+            table.currentSnapshot().sequenceNumber(),
+            statsFileLocation2,
+            table.io());
+    commitStats(table.newTransaction(), statisticsFile2);
+    Assert.assertEquals(
+        "Must match the latest snapshot",
+        table.currentSnapshot().snapshotId(),
+        statisticsFile2.snapshotId());
+
+    Assert.assertEquals("Should have 2 statistics file", 2, table.statisticsFiles().size());
+
+    long tAfterCommits = waitUntilAfter(table.currentSnapshot().timestampMillis());
+    removeSnapshots(table).expireOlderThan(tAfterCommits).commit();

Review Comment:
   Please add a comment about why the second snapshot and file aren't removed (they are the current table state).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a diff in pull request #6090: Core: Handle statistics file clean up from expireSnapshots

Posted by GitBox <gi...@apache.org>.
rdblue commented on code in PR #6090:
URL: https://github.com/apache/iceberg/pull/6090#discussion_r1043911077


##########
core/src/main/java/org/apache/iceberg/FileCleanupStrategy.java:
##########
@@ -83,4 +85,33 @@ protected void deleteFiles(Set<String> pathsToDelete, String fileType) {
             (file, thrown) -> LOG.warn("Delete failed for {} file: {}", fileType, file, thrown))
         .run(deleteFunc::accept);
   }
+
+  protected Set<String> expiredStatisticsFilesLocations(
+      TableMetadata beforeExpiration, Set<Long> validIds, Set<Long> expiredIds) {
+    Set<String> validStatisticsFilesLocation = Sets.newHashSet();
+    for (long snapshotId : validIds) {
+      List<StatisticsFile> statisticsFiles =
+          beforeExpiration.statisticsFilesForSnapshot(snapshotId);
+      if (statisticsFiles != null) {
+        statisticsFiles.forEach(file -> validStatisticsFilesLocation.add(file.path()));
+      }
+    }
+
+    // Statistics files can be reused in case of rewrite data files.

Review Comment:
   Why would rewrite data files be involved?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a diff in pull request #6090: Core: Handle statistics file clean up from expireSnapshots

Posted by GitBox <gi...@apache.org>.
rdblue commented on code in PR #6090:
URL: https://github.com/apache/iceberg/pull/6090#discussion_r1043913680


##########
core/src/main/java/org/apache/iceberg/FileCleanupStrategy.java:
##########
@@ -83,4 +85,33 @@ protected void deleteFiles(Set<String> pathsToDelete, String fileType) {
             (file, thrown) -> LOG.warn("Delete failed for {} file: {}", fileType, file, thrown))
         .run(deleteFunc::accept);
   }
+
+  protected Set<String> expiredStatisticsFilesLocations(
+      TableMetadata beforeExpiration, Set<Long> validIds, Set<Long> expiredIds) {

Review Comment:
   Rather than guessing which stats files were removed during expiration, I think that this should just diff the two lists of stats files.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ajantha-bhat commented on a diff in pull request #6090: Core: Handle statistics file clean up from expireSnapshots

Posted by GitBox <gi...@apache.org>.
ajantha-bhat commented on code in PR #6090:
URL: https://github.com/apache/iceberg/pull/6090#discussion_r1044149920


##########
core/src/main/java/org/apache/iceberg/FileCleanupStrategy.java:
##########
@@ -83,4 +85,33 @@ protected void deleteFiles(Set<String> pathsToDelete, String fileType) {
             (file, thrown) -> LOG.warn("Delete failed for {} file: {}", fileType, file, thrown))
         .run(deleteFunc::accept);
   }
+
+  protected Set<String> expiredStatisticsFilesLocations(
+      TableMetadata beforeExpiration, Set<Long> validIds, Set<Long> expiredIds) {

Review Comment:
   > just diff the two lists of stats files.
   
   Two StatisticsFiles objects can use the same stats file location in case of reuse.  (for example during rewrite data files). I have simplified the logic now. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a diff in pull request #6090: Core: Handle statistics file clean up from expireSnapshots

Posted by GitBox <gi...@apache.org>.
rdblue commented on code in PR #6090:
URL: https://github.com/apache/iceberg/pull/6090#discussion_r1053500032


##########
core/src/test/java/org/apache/iceberg/TestRemoveSnapshots.java:
##########
@@ -1234,6 +1245,74 @@ public void testMultipleRefsAndCleanExpiredFilesFailsForIncrementalCleanup() {
                 .commit());
   }
 
+  @Test
+  public void testExpireWithStatisticsFiles() throws URISyntaxException, IOException {
+    table.newAppend().appendFile(FILE_A).commit();
+    File statsFileLocation1 = statsFileLocation(table);
+    StatisticsFile statisticsFile1 = writeStatsFileForCurrentSnapshot(table, statsFileLocation1);
+    Assert.assertEquals(
+        "Must match the latest snapshot",
+        table.currentSnapshot().snapshotId(),
+        statisticsFile1.snapshotId());
+
+    table.newAppend().appendFile(FILE_B).commit();
+    File statsFileLocation2 = statsFileLocation(table);
+    StatisticsFile statisticsFile2 = writeStatsFileForCurrentSnapshot(table, statsFileLocation2);
+    Assert.assertEquals(
+        "Must match the latest snapshot",
+        table.currentSnapshot().snapshotId(),
+        statisticsFile2.snapshotId());
+
+    Assert.assertEquals("Should have 2 statistics file", 2, table.statisticsFiles().size());
+
+    table.updateProperties().set(TableProperties.MAX_SNAPSHOT_AGE_MS, "1").commit();
+
+    removeSnapshots(table).commit();
+
+    Assert.assertEquals("Should keep 1 snapshot", 1, Iterables.size(table.snapshots()));
+    Assertions.assertThat(table.statisticsFiles())
+        .hasSize(1)
+        .extracting(StatisticsFile::snapshotId)
+        .as("Should contain only the statistics file of snapshot2")
+        .isEqualTo(Lists.newArrayList(statisticsFile2.snapshotId()));
+    Assertions.assertThat(statsFileLocation1.exists()).isFalse();
+    Assertions.assertThat(statsFileLocation2.exists()).isTrue();
+  }
+
+  @Test
+  public void testExpireWithStatisticsFilesWithReuse() throws URISyntaxException, IOException {
+    table.newAppend().appendFile(FILE_A).commit();
+    File statsFileLocation1 = statsFileLocation(table);
+    StatisticsFile statisticsFile1 = writeStatsFileForCurrentSnapshot(table, statsFileLocation1);
+    Assert.assertEquals(
+        "Must match the latest snapshot",
+        table.currentSnapshot().snapshotId(),
+        statisticsFile1.snapshotId());
+
+    table.newAppend().appendFile(FILE_B).commit();
+    // reuse the existing stats file with the current snapshot
+    StatisticsFile statisticsFile2 = reuseStatsForCurrentSnapshot(table, statisticsFile1);

Review Comment:
   I can understand why you'd want to test that reused files are not removed. However, this metadata is invalid so I think this test should have a warning comment at the top about why it is testing this case and that it is not expected in real tables.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a diff in pull request #6090: Core: Handle statistics file clean up from expireSnapshots

Posted by GitBox <gi...@apache.org>.
rdblue commented on code in PR #6090:
URL: https://github.com/apache/iceberg/pull/6090#discussion_r1053490603


##########
core/src/test/java/org/apache/iceberg/TestRemoveSnapshots.java:
##########
@@ -1515,4 +1594,63 @@ private RemoveSnapshots removeSnapshots(Table table) {
     RemoveSnapshots removeSnapshots = (RemoveSnapshots) table.expireSnapshots();
     return (RemoveSnapshots) removeSnapshots.withIncrementalCleanup(incrementalCleanup);
   }
+
+  private StatisticsFile writeStatsFileForCurrentSnapshot(Table table, File statsLocation)
+      throws IOException {
+    StatisticsFile statisticsFile;
+    try (PuffinWriter puffinWriter = Puffin.write(Files.localOutput(statsLocation)).build()) {
+      long snapshotId = table.currentSnapshot().snapshotId();
+      long snapshotSequenceNumber = table.currentSnapshot().sequenceNumber();
+      puffinWriter.add(
+          new Blob(
+              "some-blob-type",
+              ImmutableList.of(1),
+              snapshotId,
+              snapshotSequenceNumber,
+              ByteBuffer.wrap("blob content".getBytes(StandardCharsets.UTF_8))));
+      puffinWriter.finish();
+      statisticsFile =
+          new GenericStatisticsFile(
+              snapshotId,
+              statsLocation.toString(),
+              puffinWriter.fileSize(),
+              puffinWriter.footerSize(),
+              puffinWriter.writtenBlobsMetadata().stream()
+                  .map(GenericBlobMetadata::from)
+                  .collect(ImmutableList.toImmutableList()));
+    }
+
+    commitStats(table, statisticsFile);
+    return statisticsFile;
+  }
+
+  private StatisticsFile reuseStatsForCurrentSnapshot(Table table, StatisticsFile statisticsFile) {
+    StatisticsFile newStatisticsFile =
+        new GenericStatisticsFile(
+            table.currentSnapshot().snapshotId(),
+            statisticsFile.path(),
+            statisticsFile.fileSizeInBytes(),
+            statisticsFile.fileFooterSizeInBytes(),
+            statisticsFile.blobMetadata());
+    commitStats(table, newStatisticsFile);
+    return newStatisticsFile;
+  }
+
+  private void commitStats(Table table, StatisticsFile statisticsFile) {
+    Transaction transaction = table.newTransaction();
+    transaction
+        .updateStatistics()
+        .setStatistics(statisticsFile.snapshotId(), statisticsFile)
+        .commit();
+    transaction.commitTransaction();
+  }
+
+  private File statsFileLocation(Table table) throws URISyntaxException {

Review Comment:
   This should not use `File` or `URI`. Please use `table.ops().metadataFileLocation(...)` instead.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] amogh-jahagirdar commented on a diff in pull request #6090: Core: Handle statistics file clean up from expireSnapshots

Posted by GitBox <gi...@apache.org>.
amogh-jahagirdar commented on code in PR #6090:
URL: https://github.com/apache/iceberg/pull/6090#discussion_r1012465971


##########
core/src/main/java/org/apache/iceberg/FileCleanupStrategy.java:
##########
@@ -79,4 +80,15 @@ protected void deleteFiles(Set<String> pathsToDelete, String fileType) {
             (file, thrown) -> LOG.warn("Delete failed for {} file: {}", fileType, file, thrown))
         .run(deleteFunc::accept);
   }
+
+  protected Set<String> expiredStatisticsFilesLocations(
+      TableMetadata beforeExpiration, Set<Long> expiredIds) {
+    Set<String> expiredStatisticsFilesLocations = Sets.newHashSet();
+    for (StatisticsFile statisticsFile : beforeExpiration.statisticsFiles()) {
+      if (expiredIds.contains(statisticsFile.snapshotId())) {
+        expiredStatisticsFilesLocations.add(statisticsFile.path());
+      }
+    }
+    return expiredStatisticsFilesLocations;
+  }

Review Comment:
   Curious, is it guaranteed that StatisticsFIles produced at a given snapshot cannot be reused in subsequent snapshots?
   
    
   The comment on StatisticsFile#snapshotId says 
   
   `  /** ID of the Iceberg table's snapshot the statistics were computed from. */
     long snapshotId();
   `
   For example, Is it possible that a StatisticsFile "file1" is computed at snapshot 1, but then still considered a valid reachable statistics file at Snapshot 2? and then we wouldn't want to do a removal of that file.
   
   If it is possible it seems that we need to capture the reachability set rather than just removing all the statistics file before expiration. let me know if i misunderstood ! 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] amogh-jahagirdar commented on a diff in pull request #6090: Core: Handle statistics file clean up from expireSnapshots

Posted by GitBox <gi...@apache.org>.
amogh-jahagirdar commented on code in PR #6090:
URL: https://github.com/apache/iceberg/pull/6090#discussion_r1012465971


##########
core/src/main/java/org/apache/iceberg/FileCleanupStrategy.java:
##########
@@ -79,4 +80,15 @@ protected void deleteFiles(Set<String> pathsToDelete, String fileType) {
             (file, thrown) -> LOG.warn("Delete failed for {} file: {}", fileType, file, thrown))
         .run(deleteFunc::accept);
   }
+
+  protected Set<String> expiredStatisticsFilesLocations(
+      TableMetadata beforeExpiration, Set<Long> expiredIds) {
+    Set<String> expiredStatisticsFilesLocations = Sets.newHashSet();
+    for (StatisticsFile statisticsFile : beforeExpiration.statisticsFiles()) {
+      if (expiredIds.contains(statisticsFile.snapshotId())) {
+        expiredStatisticsFilesLocations.add(statisticsFile.path());
+      }
+    }
+    return expiredStatisticsFilesLocations;
+  }

Review Comment:
   Curious, is it guaranteed that StatisticsFIles cannot be reused across multiple snapshots? 
   The comment on StatisticsFile#snapshotId says 
   
   `  /** ID of the Iceberg table's snapshot the statistics were computed from. */
     long snapshotId();
   `
   Is it possible that a StatisticsFile "file1" is computed at snapshot 1, but then still considered a valid reachable statistics file at Snapshot 2? and then we wouldn't want to do a removal of that file.
   
   If it is possible it seems that we need to capture the reachability set rather than just removing all the statistics file before expiration. let me know if i misunderstood ! 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a diff in pull request #6090: Core: Handle statistics file clean up from expireSnapshots

Posted by "rdblue (via GitHub)" <gi...@apache.org>.
rdblue commented on code in PR #6090:
URL: https://github.com/apache/iceberg/pull/6090#discussion_r1114760916


##########
core/src/test/java/org/apache/iceberg/TestRemoveSnapshots.java:
##########
@@ -1515,4 +1599,46 @@ private RemoveSnapshots removeSnapshots(Table table) {
     RemoveSnapshots removeSnapshots = (RemoveSnapshots) table.expireSnapshots();
     return (RemoveSnapshots) removeSnapshots.withIncrementalCleanup(incrementalCleanup);
   }
+
+  private StatisticsFile writeStatsFile(
+      long snapshotId, long snapshotSequenceNumber, String statsLocation, FileIO fileIO)
+      throws IOException {
+    try (PuffinWriter puffinWriter = Puffin.write(fileIO.newOutputFile(statsLocation)).build()) {
+      puffinWriter.add(
+          new Blob(
+              "some-blob-type",
+              ImmutableList.of(1),
+              snapshotId,
+              snapshotSequenceNumber,
+              ByteBuffer.wrap("blob content".getBytes(StandardCharsets.UTF_8))));
+      puffinWriter.finish();
+
+      return new GenericStatisticsFile(
+          snapshotId,
+          statsLocation,
+          puffinWriter.fileSize(),
+          puffinWriter.footerSize(),
+          puffinWriter.writtenBlobsMetadata().stream()
+              .map(GenericBlobMetadata::from)
+              .collect(ImmutableList.toImmutableList()));
+    }
+  }
+
+  private StatisticsFile reuseStatsFile(long snapshotId, StatisticsFile statisticsFile) {
+    return new GenericStatisticsFile(
+        snapshotId,
+        statisticsFile.path(),
+        statisticsFile.fileSizeInBytes(),
+        statisticsFile.fileFooterSizeInBytes(),
+        statisticsFile.blobMetadata());
+  }
+
+  private void commitStats(Table table, StatisticsFile statisticsFile) {
+    table.updateStatistics().setStatistics(statisticsFile.snapshotId(), statisticsFile).commit();

Review Comment:
   I find it odd that a single line like this is in a separate method. Seems like this could be inlined and would make the tests more readable.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org