You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2020/07/24 22:04:40 UTC

[GitHub] [iceberg] RussellSpitzer opened a new pull request #1244: Allow ExpireSnapshots to run without Deleting Files

RussellSpitzer opened a new pull request #1244:
URL: https://github.com/apache/iceberg/pull/1244


   Previously ExpireSnapshots would always follow the expiration of Snapshots
   with a deletion of local DataFiles and Manifests which were no longer relevant
   to the Table. This patch introduces an API to skip this deletion phase, which
   was always run locally, and instead just expire the snapshots. This allows for
   future implementations which can read and remove unused files in parallel or
   on distributed frameworks.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] moulimukherjee commented on a change in pull request #1244: Allow ExpireSnapshots to run without Deleting Files

Posted by GitBox <gi...@apache.org>.
moulimukherjee commented on a change in pull request #1244:
URL: https://github.com/apache/iceberg/pull/1244#discussion_r460327522



##########
File path: core/src/test/java/org/apache/iceberg/TestRemoveSnapshots.java
##########
@@ -472,6 +472,55 @@ public void dataFilesCleanup() throws IOException {
     Assert.assertTrue("FILE_B should be deleted", deletedFiles.contains(FILE_B.path().toString()));
   }
 
+  @Test
+  public void noDataFileCleanup() throws IOException {
+    table.newFastAppend()
+        .appendFile(FILE_A)
+        .commit();
+
+    table.newFastAppend()
+        .appendFile(FILE_B)
+        .commit();
+
+    table.newRewrite()
+        .rewriteFiles(ImmutableSet.of(FILE_B), ImmutableSet.of(FILE_D))
+        .commit();
+    long thirdSnapshotId = table.currentSnapshot().snapshotId();
+
+    table.newRewrite()
+        .rewriteFiles(ImmutableSet.of(FILE_A), ImmutableSet.of(FILE_C))
+        .commit();
+    long fourthSnapshotId = table.currentSnapshot().snapshotId();
+
+    long t4 = System.currentTimeMillis();
+    while (t4 <= table.currentSnapshot().timestampMillis()) {
+      t4 = System.currentTimeMillis();
+    }
+
+    List<ManifestFile> manifests = table.currentSnapshot().dataManifests();
+
+    ManifestFile newManifest = writeManifest(
+        "manifest-file-1.avro",
+        manifestEntry(Status.EXISTING, thirdSnapshotId, FILE_C),
+        manifestEntry(Status.EXISTING, fourthSnapshotId, FILE_D));
+
+    RewriteManifests rewriteManifests = table.rewriteManifests();
+    manifests.forEach(rewriteManifests::deleteManifest);
+    rewriteManifests.addManifest(newManifest);
+    rewriteManifests.commit();
+
+    Set<String> deletedFiles = Sets.newHashSet();
+
+    table.expireSnapshots()
+        .deleteExpiredFiles(false)
+        .expireOlderThan(t4)
+        .deleteWith(deletedFiles::add)

Review comment:
       Ack. (To clarify, was not commenting on the test per se, I should have made it a general comment rather than on this particular line)




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] HeartSaVioR commented on pull request #1244: Allow ExpireSnapshots to run without Deleting Files

Posted by GitBox <gi...@apache.org>.
HeartSaVioR commented on pull request #1244:
URL: https://github.com/apache/iceberg/pull/1244#issuecomment-664811179


   I find this useful (as it is, even without the possibility of extension), as I realized expiring snapshots take very short period on actual expiration of snapshots (with committing), and bunch of time is taken from removing stale data and manifest files (I'm on 0.9.0 so still single threaded, but even with the recent patch it's limited on single process) and it looks to be shown as stuck, though it is actually doing something.
   
   If I understand correctly, as once the snapshots are expired, stale data and manifest can be also removed via RemoveOrphanFilesAction with Spark computation power and resource. It's read-only in point of table's view so no contention on optimistic locking.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue merged pull request #1244: Allow ExpireSnapshots to run without Deleting Files

Posted by GitBox <gi...@apache.org>.
rdblue merged pull request #1244:
URL: https://github.com/apache/iceberg/pull/1244


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on pull request #1244: Allow ExpireSnapshots to run without Deleting Files

Posted by GitBox <gi...@apache.org>.
RussellSpitzer commented on pull request #1244:
URL: https://github.com/apache/iceberg/pull/1244#issuecomment-663760212


   @aokolnychyi just the Delete flag


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] moulimukherjee commented on a change in pull request #1244: Allow ExpireSnapshots to run without Deleting Files

Posted by GitBox <gi...@apache.org>.
moulimukherjee commented on a change in pull request #1244:
URL: https://github.com/apache/iceberg/pull/1244#discussion_r460322942



##########
File path: core/src/test/java/org/apache/iceberg/TestRemoveSnapshots.java
##########
@@ -472,6 +472,55 @@ public void dataFilesCleanup() throws IOException {
     Assert.assertTrue("FILE_B should be deleted", deletedFiles.contains(FILE_B.path().toString()));
   }
 
+  @Test
+  public void noDataFileCleanup() throws IOException {
+    table.newFastAppend()
+        .appendFile(FILE_A)
+        .commit();
+
+    table.newFastAppend()
+        .appendFile(FILE_B)
+        .commit();
+
+    table.newRewrite()
+        .rewriteFiles(ImmutableSet.of(FILE_B), ImmutableSet.of(FILE_D))
+        .commit();
+    long thirdSnapshotId = table.currentSnapshot().snapshotId();
+
+    table.newRewrite()
+        .rewriteFiles(ImmutableSet.of(FILE_A), ImmutableSet.of(FILE_C))
+        .commit();
+    long fourthSnapshotId = table.currentSnapshot().snapshotId();
+
+    long t4 = System.currentTimeMillis();
+    while (t4 <= table.currentSnapshot().timestampMillis()) {
+      t4 = System.currentTimeMillis();
+    }
+
+    List<ManifestFile> manifests = table.currentSnapshot().dataManifests();
+
+    ManifestFile newManifest = writeManifest(
+        "manifest-file-1.avro",
+        manifestEntry(Status.EXISTING, thirdSnapshotId, FILE_C),
+        manifestEntry(Status.EXISTING, fourthSnapshotId, FILE_D));
+
+    RewriteManifests rewriteManifests = table.rewriteManifests();
+    manifests.forEach(rewriteManifests::deleteManifest);
+    rewriteManifests.addManifest(newManifest);
+    rewriteManifests.commit();
+
+    Set<String> deletedFiles = Sets.newHashSet();
+
+    table.expireSnapshots()
+        .deleteExpiredFiles(false)
+        .expireOlderThan(t4)
+        .deleteWith(deletedFiles::add)

Review comment:
       It seems kind of counterintuitive to pass a `deleteWith` but still expect that the files won't be deleted? 
   
   What about defining a default method instead which does nothing eg: `doNothing`, and make it available as a handy reference which people could use instead? So it would look something like `table.expireSnapshots().deleteWith(doNothing)`




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a change in pull request #1244: Allow ExpireSnapshots to run without Deleting Files

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #1244:
URL: https://github.com/apache/iceberg/pull/1244#discussion_r461721527



##########
File path: api/src/main/java/org/apache/iceberg/ExpireSnapshots.java
##########
@@ -97,4 +97,16 @@
    * @return this for method chaining
    */
   ExpireSnapshots executeWith(ExecutorService executorService);
+
+  /**
+   * Allows expiration of snapshots without any cleanup of underlying manifest or data files.
+   * <p>
+   * Allows control in removing data and manifest files which may be more efficiently removed using
+   * a distributed framework through the actions API.
+   * <p>

Review comment:
       nit: is this deliberately?

##########
File path: api/src/main/java/org/apache/iceberg/ExpireSnapshots.java
##########
@@ -97,4 +97,16 @@
    * @return this for method chaining
    */
   ExpireSnapshots executeWith(ExecutorService executorService);
+
+  /**
+   * Allows expiration of snapshots without any cleanup of underlying manifest or data files.
+   * <p>
+   * Allows control in removing data and manifest files which may be more efficiently removed using
+   * a distributed framework through the actions API.
+   * <p>

Review comment:
       nit: is this added deliberately?

##########
File path: api/src/main/java/org/apache/iceberg/ExpireSnapshots.java
##########
@@ -97,4 +97,16 @@
    * @return this for method chaining
    */
   ExpireSnapshots executeWith(ExecutorService executorService);
+
+  /**
+   * Allows expiration of snapshots without any cleanup of underlying manifest or data files.
+   * <p>
+   * Allows control in removing data and manifest files which may be more efficiently removed using
+   * a distributed framework through the actions API.
+   * <p>

Review comment:
       I mean an extra `<p>` at the end.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on pull request #1244: Allow ExpireSnapshots to run without Deleting Files

Posted by GitBox <gi...@apache.org>.
RussellSpitzer commented on pull request #1244:
URL: https://github.com/apache/iceberg/pull/1244#issuecomment-663788526


   Ah, yes the reason we don't want to do that is we would still be scanning
   all the manifests in that case. Our goal is to skip all operations after
   snapshot expiration.
   
   On Fri, Jul 24, 2020, 6:22 PM moulimukherjee <no...@github.com>
   wrote:
   
   > *@moulimukherjee* commented on this pull request.
   > ------------------------------
   >
   > In core/src/test/java/org/apache/iceberg/TestRemoveSnapshots.java
   > <https://github.com/apache/iceberg/pull/1244#discussion_r460327522>:
   >
   > > +    ManifestFile newManifest = writeManifest(
   > +        "manifest-file-1.avro",
   > +        manifestEntry(Status.EXISTING, thirdSnapshotId, FILE_C),
   > +        manifestEntry(Status.EXISTING, fourthSnapshotId, FILE_D));
   > +
   > +    RewriteManifests rewriteManifests = table.rewriteManifests();
   > +    manifests.forEach(rewriteManifests::deleteManifest);
   > +    rewriteManifests.addManifest(newManifest);
   > +    rewriteManifests.commit();
   > +
   > +    Set<String> deletedFiles = Sets.newHashSet();
   > +
   > +    table.expireSnapshots()
   > +        .deleteExpiredFiles(false)
   > +        .expireOlderThan(t4)
   > +        .deleteWith(deletedFiles::add)
   >
   > Ack. (To clarify, was not commenting on the test per se, I should have
   > made it a general comment rather than on this particular line)
   >
   > —
   > You are receiving this because you authored the thread.
   > Reply to this email directly, view it on GitHub
   > <https://github.com/apache/iceberg/pull/1244#discussion_r460327522>, or
   > unsubscribe
   > <https://github.com/notifications/unsubscribe-auth/AADE2YLJJA7XQABSN4KQLR3R5IJTNANCNFSM4PHDMMFA>
   > .
   >
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on pull request #1244: Allow ExpireSnapshots to run without Deleting Files

Posted by GitBox <gi...@apache.org>.
RussellSpitzer commented on pull request #1244:
URL: https://github.com/apache/iceberg/pull/1244#issuecomment-665139201






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on pull request #1244: Allow ExpireSnapshots to run without Deleting Files

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on pull request #1244:
URL: https://github.com/apache/iceberg/pull/1244#issuecomment-665135343






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on a change in pull request #1244: Allow ExpireSnapshots to run without Deleting Files

Posted by GitBox <gi...@apache.org>.
RussellSpitzer commented on a change in pull request #1244:
URL: https://github.com/apache/iceberg/pull/1244#discussion_r461723985



##########
File path: api/src/main/java/org/apache/iceberg/ExpireSnapshots.java
##########
@@ -97,4 +97,16 @@
    * @return this for method chaining
    */
   ExpireSnapshots executeWith(ExecutorService executorService);
+
+  /**
+   * Allows expiration of snapshots without any cleanup of underlying manifest or data files.
+   * <p>
+   * Allows control in removing data and manifest files which may be more efficiently removed using
+   * a distributed framework through the actions API.
+   * <p>

Review comment:
       What is this?

##########
File path: api/src/main/java/org/apache/iceberg/ExpireSnapshots.java
##########
@@ -97,4 +97,16 @@
    * @return this for method chaining
    */
   ExpireSnapshots executeWith(ExecutorService executorService);
+
+  /**
+   * Allows expiration of snapshots without any cleanup of underlying manifest or data files.
+   * <p>
+   * Allows control in removing data and manifest files which may be more efficiently removed using
+   * a distributed framework through the actions API.
+   * <p>

Review comment:
       Ah I can remove that.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on pull request #1244: Allow ExpireSnapshots to run without Deleting Files

Posted by GitBox <gi...@apache.org>.
rdblue commented on pull request #1244:
URL: https://github.com/apache/iceberg/pull/1244#issuecomment-665237459






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on a change in pull request #1244: Allow ExpireSnapshots to run without Deleting Files

Posted by GitBox <gi...@apache.org>.
RussellSpitzer commented on a change in pull request #1244:
URL: https://github.com/apache/iceberg/pull/1244#discussion_r460326689



##########
File path: core/src/test/java/org/apache/iceberg/TestRemoveSnapshots.java
##########
@@ -472,6 +472,55 @@ public void dataFilesCleanup() throws IOException {
     Assert.assertTrue("FILE_B should be deleted", deletedFiles.contains(FILE_B.path().toString()));
   }
 
+  @Test
+  public void noDataFileCleanup() throws IOException {
+    table.newFastAppend()
+        .appendFile(FILE_A)
+        .commit();
+
+    table.newFastAppend()
+        .appendFile(FILE_B)
+        .commit();
+
+    table.newRewrite()
+        .rewriteFiles(ImmutableSet.of(FILE_B), ImmutableSet.of(FILE_D))
+        .commit();
+    long thirdSnapshotId = table.currentSnapshot().snapshotId();
+
+    table.newRewrite()
+        .rewriteFiles(ImmutableSet.of(FILE_A), ImmutableSet.of(FILE_C))
+        .commit();
+    long fourthSnapshotId = table.currentSnapshot().snapshotId();
+
+    long t4 = System.currentTimeMillis();
+    while (t4 <= table.currentSnapshot().timestampMillis()) {
+      t4 = System.currentTimeMillis();
+    }
+
+    List<ManifestFile> manifests = table.currentSnapshot().dataManifests();
+
+    ManifestFile newManifest = writeManifest(
+        "manifest-file-1.avro",
+        manifestEntry(Status.EXISTING, thirdSnapshotId, FILE_C),
+        manifestEntry(Status.EXISTING, fourthSnapshotId, FILE_D));
+
+    RewriteManifests rewriteManifests = table.rewriteManifests();
+    manifests.forEach(rewriteManifests::deleteManifest);
+    rewriteManifests.addManifest(newManifest);
+    rewriteManifests.commit();
+
+    Set<String> deletedFiles = Sets.newHashSet();
+
+    table.expireSnapshots()
+        .deleteExpiredFiles(false)
+        .expireOlderThan(t4)
+        .deleteWith(deletedFiles::add)

Review comment:
       Yeah, this is how all the tests in this class function. They use the "add" to set delete action to see whether the delete action gets triggered and on what files.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] moulimukherjee commented on a change in pull request #1244: Allow ExpireSnapshots to run without Deleting Files

Posted by GitBox <gi...@apache.org>.
moulimukherjee commented on a change in pull request #1244:
URL: https://github.com/apache/iceberg/pull/1244#discussion_r460324521



##########
File path: core/src/test/java/org/apache/iceberg/TestRemoveSnapshots.java
##########
@@ -472,6 +472,55 @@ public void dataFilesCleanup() throws IOException {
     Assert.assertTrue("FILE_B should be deleted", deletedFiles.contains(FILE_B.path().toString()));
   }
 
+  @Test
+  public void noDataFileCleanup() throws IOException {
+    table.newFastAppend()
+        .appendFile(FILE_A)
+        .commit();
+
+    table.newFastAppend()
+        .appendFile(FILE_B)
+        .commit();
+
+    table.newRewrite()
+        .rewriteFiles(ImmutableSet.of(FILE_B), ImmutableSet.of(FILE_D))
+        .commit();
+    long thirdSnapshotId = table.currentSnapshot().snapshotId();
+
+    table.newRewrite()
+        .rewriteFiles(ImmutableSet.of(FILE_A), ImmutableSet.of(FILE_C))
+        .commit();
+    long fourthSnapshotId = table.currentSnapshot().snapshotId();
+
+    long t4 = System.currentTimeMillis();
+    while (t4 <= table.currentSnapshot().timestampMillis()) {
+      t4 = System.currentTimeMillis();
+    }
+
+    List<ManifestFile> manifests = table.currentSnapshot().dataManifests();
+
+    ManifestFile newManifest = writeManifest(
+        "manifest-file-1.avro",
+        manifestEntry(Status.EXISTING, thirdSnapshotId, FILE_C),
+        manifestEntry(Status.EXISTING, fourthSnapshotId, FILE_D));
+
+    RewriteManifests rewriteManifests = table.rewriteManifests();
+    manifests.forEach(rewriteManifests::deleteManifest);
+    rewriteManifests.addManifest(newManifest);
+    rewriteManifests.commit();
+
+    Set<String> deletedFiles = Sets.newHashSet();
+
+    table.expireSnapshots()
+        .deleteExpiredFiles(false)
+        .expireOlderThan(t4)
+        .deleteWith(deletedFiles::add)

Review comment:
       Oh, is it to prevent iteration through callback for each deleted file? Actually that makes sense. If so, please ignore ^.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org