You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2022/01/04 18:57:12 UTC

[GitHub] [iceberg] smallx opened a new pull request #3844: Core: Add truncate table API and support fast truncate table

smallx opened a new pull request #3844:
URL: https://github.com/apache/iceberg/pull/3844


   Use the truncate table API, we can truncate table without Spark or Flink.
   
   Also support fast truncate by committing an empty snapshot to table instead of deleting all files.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on pull request #3844: Core: Add truncate table API and support fast truncate table

Posted by GitBox <gi...@apache.org>.
rdblue commented on pull request #3844:
URL: https://github.com/apache/iceberg/pull/3844#issuecomment-1005283192


   Thanks, @smallx! This looks good and I can certainly see the desire for a quick truncate operation. The main problem is that Iceberg normally writes a delete entry for each data file that is deleted, so that snapshot expiration can clean them up. The snapshot expiration action would still work without those entries because it compares all of the reachable files in the metadata tree, but `table.expireSnapshots().commit()` would leak files if you used this. That's probably why the "fast" truncate is optional, right?
   
   As for the implementation, what about making this part of the `DeleteFiles` API? You could add a `truncate` method there and detect when deleting with a `true` filter. Then this wouldn't need to add to the `Table` interface, which is already quite large.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on pull request #3844: Core: Add truncate table API and support fast truncate table

Posted by GitBox <gi...@apache.org>.
rdblue commented on pull request #3844:
URL: https://github.com/apache/iceberg/pull/3844#issuecomment-1015602165


   > I'm trying to fix possible file leaks in table.expireSnapshots().commit().
   
   Any update? Maybe this could add a snapshot summary property that indicates that it was a truncate so that expire snapshots can use that information?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] smallx commented on pull request #3844: Core: Add truncate table API and support fast truncate table

Posted by GitBox <gi...@apache.org>.
smallx commented on pull request #3844:
URL: https://github.com/apache/iceberg/pull/3844#issuecomment-1007617994


   I'm trying to fix possible file leaks in `table.expireSnapshots().commit()`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] smallx commented on pull request #3844: Core: Add truncate table API and support fast truncate table

Posted by GitBox <gi...@apache.org>.
smallx commented on pull request #3844:
URL: https://github.com/apache/iceberg/pull/3844#issuecomment-1023976437


   Good idea, I add `truncate-snapshot-id` to `SnapshotSummary` to record truncate-before snapshot.
   
   **Case 1**: We will delete ANY files in snapshot B and D.
   ```
   A -- B -truncate- C (empty snapshot, truncate-snapshot-id=B) -- D -truncate- E (empty snapshot, truncate-snapshot-id=D)
   ```
   
   **Case 2**: We will ignore truncate snapshot C.
   ```
   A -- B -- D -- E (current snapshot)
        |
        `truncate- C (empty snapshot, truncate-snapshot-id=B)
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] smallx commented on pull request #3844: Core: Add truncate table API and support fast truncate table

Posted by GitBox <gi...@apache.org>.
smallx commented on pull request #3844:
URL: https://github.com/apache/iceberg/pull/3844#issuecomment-1031728382


   Ready to review, cc @rdblue.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] szehon-ho commented on pull request #3844: Core: Add truncate table API and support fast truncate table

Posted by GitBox <gi...@apache.org>.
szehon-ho commented on pull request #3844:
URL: https://github.com/apache/iceberg/pull/3844#issuecomment-1069766220


   I'm a bit conflicted, it makes sense to have fast truncate, but I think the presence of DELETED entries in manifest is also used in other places to check whether data has been deleted, like example:
   
    1. checking serializable-isolation of concurrent operations (must fail if data they use is deleted)
    2. CDC design (to mark row as deleted row)
   
   If we truncate this way from Spark/Flink then any system using those wont work, is it a concern?  Or is it more like a drop -table operation where we dont care anymore about the table.  cc @aokolnychyi 
   
   The other thought is that we can achieve the same by doing DeleteFiles.deleteFromRowFilter(Expressions.alwaysTrue()), it is a bit slower in having to read each manifest file, but still faster than having to read data files, not sure what others think.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] smallx commented on pull request #3844: Core: Add truncate table API and support fast truncate table

Posted by GitBox <gi...@apache.org>.
smallx commented on pull request #3844:
URL: https://github.com/apache/iceberg/pull/3844#issuecomment-1007617994


   I'm trying to fix possible file leaks in `table.expireSnapshots().commit()`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] smallx commented on pull request #3844: Core: Add truncate table API and support fast truncate table

Posted by GitBox <gi...@apache.org>.
smallx commented on pull request #3844:
URL: https://github.com/apache/iceberg/pull/3844#issuecomment-1005739482


   Thanks for reviewing, @rdblue! I updated the code. I use `fastMode` (false by default) just to keep consistent with the behavior of Spark SQL `TRUNCATE TABLE ...`. I didn't notice the possible file leaks problem of `table.expireSnapshots().commit()` before.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org