You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2021/01/15 03:44:10 UTC

[GitHub] [iceberg] qphien opened a new issue #2093: How to get table/partition create time/update time from Iceberg

qphien opened a new issue #2093:
URL: https://github.com/apache/iceberg/issues/2093


   With building data warehouse on iceberg, it is necessary to get table/partition create time/update time. 
   Based on current iceberg realization,  we may get table create time from `version.metasata.json` earlist snapshot timestamp-ms, update time from latest snapshot timestamp-ms. However, there's no way to get partition create time/update time.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] kbendick commented on issue #2093: How to get table/partition create time/update time from Iceberg

Posted by GitBox <gi...@apache.org>.
kbendick commented on issue #2093:
URL: https://github.com/apache/iceberg/issues/2093#issuecomment-762056780


   You might be able to do this via the various metadata tables, though it would be somewhat complex.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] qphien commented on issue #2093: How to get table/partition create time/update time from Iceberg

Posted by GitBox <gi...@apache.org>.
qphien commented on issue #2093:
URL: https://github.com/apache/iceberg/issues/2093#issuecomment-767300672


   Sorry for my late reply. 
   
   In my use case, partition creation time is used to remove expired partition.  A few jobs may be used to write to the same partition at different times. With partition update time, i can track related jobs which update the table partition.
   
   I searched related issues on github and dev mails: 
   https://github.com/apache/iceberg/issues/1597: remove expired snapshot with snapshot timestamp-ms
   https://github.com/apache/iceberg/issues/1599: remove expired file with filesystem file modification time
   
   However, it seems there is no way to remove expired partition currently.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] qphien commented on issue #2093: How to get table/partition create time/update time from Iceberg

Posted by GitBox <gi...@apache.org>.
qphien commented on issue #2093:
URL: https://github.com/apache/iceberg/issues/2093#issuecomment-762191244


   > You might be able to do this via the various metadata tables, though it would be somewhat complex. https://iceberg.apache.org/spark/#inspecting-tables
   > 
   > It looks like you could achieve this by joining a table's `manifest` metadata table with the table's , which has a `partitions` column indicating what partition columns have been affected, with the table's `snapshots` table and `history` metadata table.
   > 
   > There are some examples of joining the two, but essentially you'd want to explode the table's snapshot metadata table on the `manifest_list` column so that you get one row in the expanded snapshots table for each updated / created manifest. That manifest path can be joined with the `path` column in the `manifest` metadata table to then get all of the partitions that are involved in that snapshot. You can find when exactly that snapshot was made current by joining on the `made_current_at` field from the metadata `history` table.
   
   Thanks @kbendick for your reply. Yeah, we can join `manifest` with `snapshot` and `history` to get partition create/update time, but this join query is inefficient when there are large number of snapshots, we have to scan all snapshots and manifests.
   
   Could we add an additional `create-time` field to `manifest.data_file`? In this case, only latest snapshot and related manifests are needed to scan.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] kbendick edited a comment on issue #2093: How to get table/partition create time/update time from Iceberg

Posted by GitBox <gi...@apache.org>.
kbendick edited a comment on issue #2093:
URL: https://github.com/apache/iceberg/issues/2093#issuecomment-762056780


   You might be able to do this via the various metadata tables, though it would be somewhat complex. https://iceberg.apache.org/spark/#inspecting-tables
   
   It looks like you could achieve this by joining a table's `manifest` metadata table with the table's , which has a `partitions` column indicating what partition columns have been affected, with the table's `snapshots` table and `history` metadata table.
   
   There are some examples of joining the two, but essentially you'd want to explode the table's snapshot metadata table on the `manifest_list` column so that you get one row in the expanded snapshots table for each updated / created manifest. That manifest path can be joined with the `path` column in the `manifest` metadata table to then get all of the partitions that are involved in that snapshot. You can find when exactly that snapshot was made current by joining on the `made_current_at` field from the metadata `history` table.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] qphien closed issue #2093: How to get table/partition create time/update time from Iceberg

Posted by GitBox <gi...@apache.org>.
qphien closed issue #2093:
URL: https://github.com/apache/iceberg/issues/2093


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] kbendick commented on issue #2093: How to get table/partition create time/update time from Iceberg

Posted by GitBox <gi...@apache.org>.
kbendick commented on issue #2093:
URL: https://github.com/apache/iceberg/issues/2093#issuecomment-762060899


   Alternatively, it's possible that Nessie integration might be what you're looking for. Nessie provides a git-like experience for managing your data lake and is integrated with iceberg: https://projectnessie.org/tables/iceberg/


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] kbendick edited a comment on issue #2093: How to get table/partition create time/update time from Iceberg

Posted by GitBox <gi...@apache.org>.
kbendick edited a comment on issue #2093:
URL: https://github.com/apache/iceberg/issues/2093#issuecomment-762056780


   You might be able to do this via the various metadata tables, though it would be somewhat complex. https://iceberg.apache.org/spark/#inspecting-tables
   
   It looks like you could achieve this by joining a table's `manifest` metadata table with the table's , which has a `partitions` column indicating what partition columns have been affected, with the table's `snapshots` table and `history` metadata table.
   
   There are some examples of joining the two, but essentially you'd want to explode the snapshot's `manifest_list` so that you get one row in the expanded snapshots table for each updated / created manifest. That manifest path can be joined with the `path` column in the `manifest` metadata table to then get all of the partitions that are involved in that snapshot.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] kbendick commented on issue #2093: How to get table/partition create time/update time from Iceberg

Posted by GitBox <gi...@apache.org>.
kbendick commented on issue #2093:
URL: https://github.com/apache/iceberg/issues/2093#issuecomment-762566233


   I agree that such a query with several joins is likely not practical to be performing regularly.
   
   Can you detail more how you intend to use this partition creation / update time info? Is this possibly as a trigger for a batch job - for example, a common pattern would be to continuously write to a table that is partitioned by hour from a streaming job (say, using Apache Flink as the query engine), and then when the new hours partition is committed a batch job would be triggered to process it.
   
   I’m not opposed to adding such metadata, though there would be better people than me to ask and it’s likely something that should be brought up on the dev mailing list.
   
   However, if you detail your use case (as well as possibly the intended writing situation / query engine), it’s very possible that other users are already handling this use case with some existing pattern that is already supported. Or it’s possible that your use case is new / not currently handled and then adding this metadata might be needed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org