You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2020/10/12 23:18:48 UTC

[GitHub] [iceberg] aokolnychyi opened a new issue #1597: Spark SQL Extensions: Expire snapshots

aokolnychyi opened a new issue #1597:
URL: https://github.com/apache/iceberg/issues/1597


   Until we have a proper VACUUM command, we can expose a stored procedure for expiring snapshots. 
   
   ```
   CALL catalog.schema.expire_snapshots(
     namespace  => 'namespace_name'
     table => 'table_name',
     older_than => timestamp,
     retain_last => number
   )
   ```
   
   The output from this command may include the number of deleted data, manifest, manifest list files.
   
   I think we should make the older_than and retain_last args optional and require at least one to be present. It seems Presto recently added support for optional args in procedures. 
   
   Then we can expire snapshots that are older than a given interval:
   
   ```
   CALL catalog.schema.expire_snapshots(
     namespace => 'namespace_name',
     table => 'table_name',
     older_than => CURRENT_TIMESTAMP() - INTERVAL '7' DAYS
   )
   ```
   
   We can also provide a minimum number of snapshots to keep:
   
   ```
   CALL catalog.schema.expire_snapshots(
     namespace => 'namespace_name',
     table => 'table_name',
     older_than => CURRENT_TIMESTAMP() - INTERVAL '7' DAYS,
     retain_last => 100
   )
   ```
   
   Or just give a number of snapshots to keep:
   
   ```
   CALL catalog.schema.expire_snapshots(
     namespace => 'namespace_name',
     table => 'table_name',
     retain_last => 100
   )
   ```
   
   Or give a specific point in time:
   
   ```
   CALL catalog.schema.expire_snapshots(
     namespace => 'namespace_name',
     table => 'table_name',
     older_than => TIMESTAMP '2020-01-19 03:14:07',
     retain_last => 100
   )
   ```


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] liukun4515 edited a comment on issue #1597: Spark SQL Extensions: Expire snapshots

Posted by GitBox <gi...@apache.org>.
liukun4515 edited a comment on issue #1597:
URL: https://github.com/apache/iceberg/issues/1597#issuecomment-708114465


   > Can we just use `System.currentTimeMillis`? Spark returns us `java.sql.Timestamp` that is time zone agnostic, I guess.
   
   The `system.currenttimestamp` is the absolute time which does‘t need to consider the time zone.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] liukun4515 commented on issue #1597: Spark SQL Extensions: Expire snapshots

Posted by GitBox <gi...@apache.org>.
liukun4515 commented on issue #1597:
URL: https://github.com/apache/iceberg/issues/1597#issuecomment-707533790


   please assign the issue to me!


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on issue #1597: Spark SQL Extensions: Expire snapshots

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on issue #1597:
URL: https://github.com/apache/iceberg/issues/1597#issuecomment-707497461


   I'd make both `older_than` and `retain_last` optional and require at least one to be present. If only `retain_last` is present, I'd default `timestamp` to the current timestamp. 
   
   The following would keep only 100 last snapshots, for example.
   
   ```
   CALL catalog.schema.expire_snapshots(
     namespace => 'namespace_name',
     table => 'table_name',
     retain_last => 100
   )
   ```


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] liukun4515 commented on issue #1597: Spark SQL Extensions: Expire snapshots

Posted by GitBox <gi...@apache.org>.
liukun4515 commented on issue #1597:
URL: https://github.com/apache/iceberg/issues/1597#issuecomment-708114465


   > Can we just use `System.currentTimeMillis`? Spark returns us `java.sql.Timestamp` that is time zone agnostic, I guess.
   
   The `system.currenttimestamp` is the absolute time which does need to consider the time zone.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on issue #1597: Spark SQL Extensions: Expire snapshots

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on issue #1597:
URL: https://github.com/apache/iceberg/issues/1597#issuecomment-707839400


   Can we just use `System.currentTimeMillis`? Spark returns us `java.sql.Timestamp` that is time zone agnostic, I guess. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] liukun4515 edited a comment on issue #1597: Spark SQL Extensions: Expire snapshots

Posted by GitBox <gi...@apache.org>.
liukun4515 edited a comment on issue #1597:
URL: https://github.com/apache/iceberg/issues/1597#issuecomment-707533790


   please assign this issue to me!


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] liukun4515 edited a comment on issue #1597: Spark SQL Extensions: Expire snapshots

Posted by GitBox <gi...@apache.org>.
liukun4515 edited a comment on issue #1597:
URL: https://github.com/apache/iceberg/issues/1597#issuecomment-707507924


   > I'd make both `older_than` and `retain_last` optional and require at least one to be present. If only `retain_last` is present, I'd default `timestamp` to the current timestamp.
   
   I agree with the default timestamp (current time) for `older_than`, but what's the default time zone?
   Just the default local system time zone or a specific default time zone?
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] liukun4515 edited a comment on issue #1597: Spark SQL Extensions: Expire snapshots

Posted by GitBox <gi...@apache.org>.
liukun4515 edited a comment on issue #1597:
URL: https://github.com/apache/iceberg/issues/1597#issuecomment-707455716


   @aokolnychyi 
   ```
   CALL catalog.schema.expire_snapshots(
     namespace => 'namespace_name',
     table => 'table_name',
     retain_last => 100
   )
   ```
   In the `expire snapshot` API, if we only give the `retain last num` for a table, the commit action is invalid.
   
   the apply code is:
   ```
     private TableMetadata internalApply() {
       this.base = ops.refresh();
   
       TableMetadata updateMeta = base.removeSnapshotsIf(snapshot ->
           idsToRemove.contains(snapshot.snapshotId()) ||
           (expireOlderThan != null && snapshot.timestampMillis() < expireOlderThan &&
               !idsToRetain.contains(snapshot.snapshotId())));
       List<Snapshot> updateSnapshots = updateMeta.snapshots();
       List<Snapshot> baseSnapshots = base.snapshots();
       return updateSnapshots.size() != baseSnapshots.size() ? updateMeta : base;
     }
   ```
   if the `expireOlderThan` is not set(`expireOlderThan` is null), the `retain last num` or `idsToRetain` is invalid.
   
   So I think the  `expireOlderThan` or `older_than ` is required, or default value for the `older_than` parameter.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on issue #1597: Spark SQL Extensions: Expire snapshots

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on issue #1597:
URL: https://github.com/apache/iceberg/issues/1597#issuecomment-707392683


   @liukun4515 has expressed interest in working on this.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] liukun4515 edited a comment on issue #1597: Spark SQL Extensions: Expire snapshots

Posted by GitBox <gi...@apache.org>.
liukun4515 edited a comment on issue #1597:
URL: https://github.com/apache/iceberg/issues/1597#issuecomment-707507924


   > I'd make both `older_than` and `retain_last` optional and require at least one to be present. If only `retain_last` is present, I'd default `timestamp` to the current timestamp.
   
   I agree with the default timestamp (current time) for `older_than`, but what's the default time zone?
   Just the default local system time zone or a specific default time zone?
   
   @aokolnychyi 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] liukun4515 edited a comment on issue #1597: Spark SQL Extensions: Expire snapshots

Posted by GitBox <gi...@apache.org>.
liukun4515 edited a comment on issue #1597:
URL: https://github.com/apache/iceberg/issues/1597#issuecomment-707455716


   @aokolnychyi 
   ```
   CALL catalog.schema.expire_snapshots(
     namespace => 'namespace_name',
     table => 'table_name',
     retain_last => 100
   )
   ```
   In the `expire snapshot` API, if we only give the `retain last num` for a table, the commit action is invalid.
   
   the apply code is:
   ```
     private TableMetadata internalApply() {
       this.base = ops.refresh();
   
       TableMetadata updateMeta = base.removeSnapshotsIf(snapshot ->
           idsToRemove.contains(snapshot.snapshotId()) ||
           (expireOlderThan != null && snapshot.timestampMillis() < expireOlderThan &&
               !idsToRetain.contains(snapshot.snapshotId())));
       List<Snapshot> updateSnapshots = updateMeta.snapshots();
       List<Snapshot> baseSnapshots = base.snapshots();
       return updateSnapshots.size() != baseSnapshots.size() ? updateMeta : base;
     }
   ```
   if the `expireOlderThan` is not set(`expireOlderThan` is null), the `retain last num` or `idsToRetain` is invalid.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] liukun4515 commented on issue #1597: Spark SQL Extensions: Expire snapshots

Posted by GitBox <gi...@apache.org>.
liukun4515 commented on issue #1597:
URL: https://github.com/apache/iceberg/issues/1597#issuecomment-707584762


   > I'd make both `older_than` and `retain_last` optional and require at least one to be present. If only `retain_last` is present, I'd default `timestamp` to the current timestamp.
   
   In the procedure SQL:
   ```
   CALL catalog.schema.expire_snapshot(
       namespace => 'namespace_name',
       table => 'table_name'
   )
   ```
   if the `expire snapshot sql` lacks the two parameters `retain_last` and `older_than`, how to tell the user where the error occurred?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue closed issue #1597: Spark SQL Extensions: Expire snapshots

Posted by GitBox <gi...@apache.org>.
rdblue closed issue #1597:
URL: https://github.com/apache/iceberg/issues/1597


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] liukun4515 edited a comment on issue #1597: Spark SQL Extensions: Expire snapshots

Posted by GitBox <gi...@apache.org>.
liukun4515 edited a comment on issue #1597:
URL: https://github.com/apache/iceberg/issues/1597#issuecomment-707584762


   > I'd make both `older_than` and `retain_last` optional and require at least one to be present. If only `retain_last` is present, I'd default `timestamp` to the current timestamp.
   
   In the procedure SQL:
   ```
   CALL catalog.schema.expire_snapshot(
       namespace => 'namespace_name',
       table => 'table_name'
   )
   ```
   if the `expire snapshot sql` lacks the two parameters `retain_last` and `older_than`, how to tell the user where the error occurred?
   Just throw a RE exception or error?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] liukun4515 commented on issue #1597: Spark SQL Extensions: Expire snapshots

Posted by GitBox <gi...@apache.org>.
liukun4515 commented on issue #1597:
URL: https://github.com/apache/iceberg/issues/1597#issuecomment-707507924


   > I'd make both `older_than` and `retain_last` optional and require at least one to be present. If only `retain_last` is present, I'd default `timestamp` to the current timestamp.
   
   I agree with the default timestamp (current time) for `older_than`, but what's the default time zone?
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] liukun4515 commented on issue #1597: Spark SQL Extensions: Expire snapshots

Posted by GitBox <gi...@apache.org>.
liukun4515 commented on issue #1597:
URL: https://github.com/apache/iceberg/issues/1597#issuecomment-707455716


   @aokolnychyi 
   ```
   CALL catalog.schema.expire_snapshots(
     namespace => 'namespace_name',
     table => 'table_name',
     retain_last => 100
   )
   ```
   In the `expire snapshot` API, if we only give the `retain last num` for a table, the commit action is invalid.
   
   the apply code is:
   ```
     private TableMetadata internalApply() {
       this.base = ops.refresh();
   
       TableMetadata updateMeta = base.removeSnapshotsIf(snapshot ->
           idsToRemove.contains(snapshot.snapshotId()) ||
           (expireOlderThan != null && snapshot.timestampMillis() < expireOlderThan &&
               !idsToRetain.contains(snapshot.snapshotId())));
       List<Snapshot> updateSnapshots = updateMeta.snapshots();
       List<Snapshot> baseSnapshots = base.snapshots();
       return updateSnapshots.size() != baseSnapshots.size() ? updateMeta : base;
     }
   ```
   if the `expireOlderThan` is not set, the `retain last num` or `idsToRetain` is invalid.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org