You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2020/08/27 19:32:18 UTC

[GitHub] [iceberg] RussellSpitzer opened a new pull request #1395: Changes default collect behavior of ExpireSnapshotActions

RussellSpitzer opened a new pull request #1395:
URL: https://github.com/apache/iceberg/pull/1395


   Previously ExpireSnapshotAction would always use toLocalIterator which
   ends up costing significantly more time on smaller data sets. Since even
   the largest lists of files are expected to fit in memory we are changing
   the default to Collect. Collect will bring back the results more quickly
   at the cost of additional memory. An option to streamDeleteResults will still
   be available for extremely large expire operations.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a change in pull request #1395: Changes default collect behavior of ExpireSnapshotActions

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #1395:
URL: https://github.com/apache/iceberg/pull/1395#discussion_r478662813



##########
File path: spark/src/main/java/org/apache/iceberg/actions/ExpireSnapshotsAction.java
##########
@@ -90,6 +91,19 @@ protected Table table() {
     return table;
   }
 
+  /**
+   * Whether or not to use stream the expired file list to the driver. The default (false) will use
+   * collect to bring back all results to the driver at once which may be an issue with very long file lists.

Review comment:
       What about this? `By default, all files to delete are brought to the driver at once which may be an issue with very long file lists. Set this to true to use {@link Dataset#toLocalIterator()} if you are running into memory issues when collecting the list of files to be deleted.` 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on a change in pull request #1395: Changes default collect behavior of ExpireSnapshotActions

Posted by GitBox <gi...@apache.org>.
RussellSpitzer commented on a change in pull request #1395:
URL: https://github.com/apache/iceberg/pull/1395#discussion_r478664879



##########
File path: spark/src/main/java/org/apache/iceberg/actions/ExpireSnapshotsAction.java
##########
@@ -90,6 +91,19 @@ protected Table table() {
     return table;
   }
 
+  /**
+   * Whether or not to use stream the expired file list to the driver. The default (false) will use
+   * collect to bring back all results to the driver at once which may be an issue with very long file lists.

Review comment:
       Sgtm




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi merged pull request #1395: Changes default collect behavior of ExpireSnapshotActions

Posted by GitBox <gi...@apache.org>.
aokolnychyi merged pull request #1395:
URL: https://github.com/apache/iceberg/pull/1395


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #1395: Changes default collect behavior of ExpireSnapshotActions

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #1395:
URL: https://github.com/apache/iceberg/pull/1395#discussion_r478678646



##########
File path: spark/src/main/java/org/apache/iceberg/actions/ExpireSnapshotsAction.java
##########
@@ -92,11 +92,10 @@ protected Table table() {
   }
 
   /**
-   * Whether or not to use stream the expired file list to the driver. The default (false) will use
-   * collect to bring back all results to the driver at once which may be an issue with very long file lists.
-   * Set this to true to use toLocalIterator if you are running into memory issues when collecting the list of files
-   * to be deleted.
-   * @param stream whether to use toLocalIterator to stream results instead of collect.
+   * By default, all files to delete are brought to the driver at once which may be an issue with very long file lists.
+   * Set this to true to use {@link Dataset#toLocalIterator()} if you are running into memory issues when collecting
+   * the list of files to be deleted.
+   * @param stream whether to use {@link Dataset#toLocalIterator} to stream results instead of {@link Dataset#collect}.

Review comment:
       These are broken links because we don't have Spark's API in our Javadoc, right?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #1395: Changes default collect behavior of ExpireSnapshotActions

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #1395:
URL: https://github.com/apache/iceberg/pull/1395#discussion_r478664931



##########
File path: spark/src/main/java/org/apache/iceberg/actions/ExpireSnapshotsAction.java
##########
@@ -90,6 +91,19 @@ protected Table table() {
     return table;
   }
 
+  /**
+   * Whether or not to use stream the expired file list to the driver. The default (false) will use
+   * collect to bring back all results to the driver at once which may be an issue with very long file lists.

Review comment:
       Sounds good to me, except that it creates a dead link because `Dataset#toLocalIterator` is not in Iceberg Javadoc.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on pull request #1395: Changes default collect behavior of ExpireSnapshotActions

Posted by GitBox <gi...@apache.org>.
RussellSpitzer commented on pull request #1395:
URL: https://github.com/apache/iceberg/pull/1395#issuecomment-682149429


   @aokolnychyi  + @rdblue The change we discussed previously for the "collect" operation in ExpireSnapshotActions
   
   And yes, that is the easiest way to determine how many jobs have run in the context without starting a new job state store and attaching it to the listener bus :)


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on pull request #1395: Changes default collect behavior of ExpireSnapshotActions

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on pull request #1395:
URL: https://github.com/apache/iceberg/pull/1395#issuecomment-682236077


   retest this please


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a change in pull request #1395: Changes default collect behavior of ExpireSnapshotActions

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #1395:
URL: https://github.com/apache/iceberg/pull/1395#discussion_r478660756



##########
File path: spark/src/test/java/org/apache/iceberg/actions/TestExpireSnapshotsAction.java
##########
@@ -1008,4 +1008,37 @@ public void testExpireAction() {
     Assert.assertSame("Multiple calls to expire should return the same deleted files",
         pendingDeletes, action.expire());
   }
+
+  @Test
+  public void testUseLocalIterator() {
+    table.newFastAppend()
+            .appendFile(FILE_A)

Review comment:
       nit: I think it should be 4 spaces for continued indentation. Applies to the whole test.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on pull request #1395: Changes default collect behavior of ExpireSnapshotActions

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on pull request #1395:
URL: https://github.com/apache/iceberg/pull/1395#issuecomment-682162915


   I think it makes sense to collect the result to the driver by default to avoid the overhead of spinning a job for every Spark partition.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a change in pull request #1395: Changes default collect behavior of ExpireSnapshotActions

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #1395:
URL: https://github.com/apache/iceberg/pull/1395#discussion_r478661130



##########
File path: spark/src/main/java/org/apache/iceberg/actions/ExpireSnapshotsAction.java
##########
@@ -90,6 +91,19 @@ protected Table table() {
     return table;
   }
 
+  /**
+   * Whether or not to use stream the expired file list to the driver. The default (false) will use

Review comment:
       nit: `to use stream` -> `to stream`




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a change in pull request #1395: Changes default collect behavior of ExpireSnapshotActions

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #1395:
URL: https://github.com/apache/iceberg/pull/1395#discussion_r478681895



##########
File path: spark/src/main/java/org/apache/iceberg/actions/ExpireSnapshotsAction.java
##########
@@ -92,11 +92,10 @@ protected Table table() {
   }
 
   /**
-   * Whether or not to use stream the expired file list to the driver. The default (false) will use
-   * collect to bring back all results to the driver at once which may be an issue with very long file lists.
-   * Set this to true to use toLocalIterator if you are running into memory issues when collecting the list of files
-   * to be deleted.
-   * @param stream whether to use toLocalIterator to stream results instead of collect.
+   * By default, all files to delete are brought to the driver at once which may be an issue with very long file lists.
+   * Set this to true to use {@link Dataset#toLocalIterator()} if you are running into memory issues when collecting
+   * the list of files to be deleted.
+   * @param stream whether to use {@link Dataset#toLocalIterator} to stream results instead of {@link Dataset#collect}.

Review comment:
       You are right, my bad.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on a change in pull request #1395: Changes default collect behavior of ExpireSnapshotActions

Posted by GitBox <gi...@apache.org>.
RussellSpitzer commented on a change in pull request #1395:
URL: https://github.com/apache/iceberg/pull/1395#discussion_r478664788



##########
File path: spark/src/test/java/org/apache/iceberg/actions/TestExpireSnapshotsAction.java
##########
@@ -1008,4 +1008,37 @@ public void testExpireAction() {
     Assert.assertSame("Multiple calls to expire should return the same deleted files",
         pendingDeletes, action.expire());
   }
+
+  @Test
+  public void testUseLocalIterator() {
+    table.newFastAppend()
+            .appendFile(FILE_A)

Review comment:
       Got it!




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer closed pull request #1395: Changes default collect behavior of ExpireSnapshotActions

Posted by GitBox <gi...@apache.org>.
RussellSpitzer closed pull request #1395:
URL: https://github.com/apache/iceberg/pull/1395


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org