You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@iceberg.apache.org by Mayur Srivastava <Ma...@twosigma.com> on 2022/02/23 16:49:27 UTC

Getting last modified timestamp/other stats per partition

Hi,

In Iceberg, is there a way to get the last modified timestamp and other stats (e.g. num rows, uncompressed size, compressed size) of the data per partition?

Thanks,
Mayur

RE: Getting last modified timestamp/other stats per partition

Posted by Mayur Srivastava <Ma...@twosigma.com>.

Thanks for the responses. We can try filter snapshots that are added by compaction. Tagging should help reduce the number of snapshots that we have to maintain. But, how will tagging help with partition level last modified times without maintain the history of snapshots (assuming each partition has a different commit time). Will we be able to store additional stats, e.g. commit times, per data file or partition in the tagged snapshot?

From: Szehon Ho <sz...@gmail.com>
Sent: Monday, March 7, 2022 1:40 PM
To: Iceberg Dev List <de...@iceberg.apache.org>
Subject: Re: Getting last modified timestamp/other stats per partition

2.       How can we distinguish between snapshots where new data was added vs snapshots where compaction was done?

Yea, to answer the second question, I forgot to mention there is a field on Manifest Entries table called 'status' that you can filter on.  It might not be documented as it's a bit more advanced/internal, but the values are listed here:
https://github.com/apache/iceberg/blob/master/core%2Fsrc%2Fmain%2Fjava%2Forg%2Fapache%2Ficeberg%2FManifestEntry.java#L30

So you would want to add to the query the filter (e.status =1) , which is Added, if you only care about when the file is first added.

Thanks
Szehon

On Mon, Mar 7, 2022 at 9:33 AM Ryan Blue <bl...@tabular.io>> wrote:
Mayur,

This is one of the reasons why we want to introduce tagging in the format. That will allow you to tag snapshots that you want to keep and expire intermediate versions.

In general, there is some cost to keeping thousands of snapshots. Those are held in the metadata file that gets written each commit, so you end up writing a fairly large file. If your commits are infrequent it doens't generally make a difference. But if you have commits every minute or so it can get in the way.

Tagging will reduce the problem, and moving to change-based commits with the REST catalog should also help in the long term.

Ryan

On Mon, Mar 7, 2022 at 8:18 AM Mayur Srivastava <Ma...@twosigma.com>> wrote:
A few follow-up questions for getting last modified time for each partition:

1.       If we want to use snapshots, does this mean we will have to maintain full history of snapshots? E.g. if we partition by method=‘day’ and write once a day for a few years, we will end up in maintaining 1000s of snapshots. How does a long history of snapshots affect metadata size, commit performance, etc.? We intend to experiment with this but I’m curious to know if there’s already some recommendation on the amount of history for snapshots.

2.       How can we distinguish between snapshots where new data was added vs snapshots where compaction was done?

Thanks,
Mayur

From: Mayur Srivastava <Ma...@twosigma.com>>
Sent: Thursday, February 24, 2022 7:27 AM
To: dev@iceberg.apache.org<ma...@iceberg.apache.org>
Subject: RE: Getting last modified timestamp/other stats per partition

Thanks Szehon. I’ll give this a try.

From: Szehon Ho <sz...@gmail.com>>
Sent: Wednesday, February 23, 2022 1:38 PM
To: Iceberg Dev List <de...@iceberg.apache.org>>
Subject: Re: Getting last modified timestamp/other stats per partition

Hi

Probably the metadata tables can help with this.

For the size/num_rows of partitions, you can query the files table, https://iceberg.apache.org/docs/latest/spark-queries/#files.  (Because Iceberg keeps stats for files, and not necessary partitions).

SELECT partition, sum(file_size_in_bytes), sum(record_count) from $my_table.files f GROUP BY f.partition

This will be compressed size (again Iceberg keeps file-level stats and so not sure if there are any stats for uncompressed sizes.)

For the last modified time, it will be slightly harder.  The file's physical modified time is not good enough because it's not exactly when it is 'committed' into Iceberg.   You may have to try a more advanced query on the snapshots table and manifest-entries table: https://iceberg.apache.org/docs/latest/spark-queries/#snapshots

SELECT MAX(s.committed_at),e.data_file.partition FROM $my_table.snapshots s JOIN $my_table.entries e WHERE s.snapshot_id = e.snapshot_id GROUP_BY by e.data_file.partition

Hope that helps,
Szehon

On Wed, Feb 23, 2022 at 8:50 AM Mayur Srivastava <Ma...@twosigma.com>> wrote:
Hi,

In Iceberg, is there a way to get the last modified timestamp and other stats (e.g. num rows, uncompressed size, compressed size) of the data per partition?

Thanks,
Mayur

--
Ryan Blue
Tabular

Re: Getting last modified timestamp/other stats per partition

Posted by Szehon Ho <sz...@gmail.com>.

>
> 2.       How can we distinguish between snapshots where new data was
> added vs snapshots where compaction was done?
>

Yea, to answer the second question, I forgot to mention there is a field on
Manifest Entries table called 'status' that you can filter on.  It might
not be documented as it's a bit more advanced/internal, but the values are
listed here:
https://github.com/apache/iceberg/blob/master/core%2Fsrc%2Fmain%2Fjava%2Forg%2Fapache%2Ficeberg%2FManifestEntry.java#L30

So you would want to add to the query the filter (e.status =1) , which is
Added, if you only care about when the file is first added.

Thanks
Szehon

On Mon, Mar 7, 2022 at 9:33 AM Ryan Blue <bl...@tabular.io> wrote:

> Mayur,
>
> This is one of the reasons why we want to introduce tagging in the format.
> That will allow you to tag snapshots that you want to keep and expire
> intermediate versions.
>
> In general, there is some cost to keeping thousands of snapshots. Those
> are held in the metadata file that gets written each commit, so you end up
> writing a fairly large file. If your commits are infrequent it doens't
> generally make a difference. But if you have commits every minute or so it
> can get in the way.
>
> Tagging will reduce the problem, and moving to change-based commits with
> the REST catalog should also help in the long term.
>
> Ryan
>
> On Mon, Mar 7, 2022 at 8:18 AM Mayur Srivastava <
> Mayur.Srivastava@twosigma.com> wrote:
>
>> A few follow-up questions for getting last modified time for each
>> partition:
>>
>>
>>
>> 1.       If we want to use snapshots, does this mean we will have to
>> maintain full history of snapshots? E.g. if we partition by method=‘day’
>> and write once a day for a few years, we will end up in maintaining 1000s
>> of snapshots. How does a long history of snapshots affect metadata size,
>> commit performance, etc.? We intend to experiment with this but I’m curious
>> to know if there’s already some recommendation on the amount of history for
>> snapshots.
>>
>> 2.       How can we distinguish between snapshots where new data was
>> added vs snapshots where compaction was done?
>>
>>
>>
>> Thanks,
>>
>> Mayur
>>
>>
>>
>> *From:* Mayur Srivastava <Ma...@twosigma.com>
>> *Sent:* Thursday, February 24, 2022 7:27 AM
>> *To:* dev@iceberg.apache.org
>> *Subject:* RE: Getting last modified timestamp/other stats per partition
>>
>>
>>
>> Thanks Szehon. I’ll give this a try.
>>
>>
>>
>> *From:* Szehon Ho <sz...@gmail.com>
>> *Sent:* Wednesday, February 23, 2022 1:38 PM
>> *To:* Iceberg Dev List <de...@iceberg.apache.org>
>> *Subject:* Re: Getting last modified timestamp/other stats per partition
>>
>>
>>
>> Hi
>>
>>
>>
>> Probably the metadata tables can help with this.
>>
>>
>>
>> For the size/num_rows of partitions, you can query the files table,
>> https://iceberg.apache.org/docs/latest/spark-queries/#files.  (Because
>> Iceberg keeps stats for files, and not necessary partitions).
>>
>>
>>
>> SELECT partition, sum(file_size_in_bytes), sum(record_count) from
>> $my_table.files f GROUP BY f.partition
>>
>>
>>
>> This will be compressed size (again Iceberg keeps file-level stats and so
>> not sure if there are any stats for uncompressed sizes.)
>>
>>
>>
>> For the last modified time, it will be slightly harder.  The file's
>> physical modified time is not good enough because it's not exactly when it
>> is 'committed' into Iceberg.   You may have to try a more advanced query on
>> the snapshots table and manifest-entries table:
>> https://iceberg.apache.org/docs/latest/spark-queries/#snapshots
>>
>>
>>
>> SELECT MAX(s.committed_at),e.data_file.partition FROM $my_table.snapshots
>> s JOIN $my_table.entries e WHERE s.snapshot_id = e.snapshot_id GROUP_BY by
>> e.data_file.partition
>>
>>
>>
>> Hope that helps,
>>
>> Szehon
>>
>>
>>
>> On Wed, Feb 23, 2022 at 8:50 AM Mayur Srivastava <
>> Mayur.Srivastava@twosigma.com> wrote:
>>
>> Hi,
>>
>>
>>
>> In Iceberg, is there a way to get the last modified timestamp and other
>> stats (e.g. num rows, uncompressed size, compressed size) of the data per
>> partition?
>>
>>
>>
>> Thanks,
>>
>> Mayur
>>
>>
>>
>>
>
> --
> Ryan Blue
> Tabular
>

Re: Getting last modified timestamp/other stats per partition

Posted by Ryan Blue <bl...@tabular.io>.

Mayur,

This is one of the reasons why we want to introduce tagging in the format.
That will allow you to tag snapshots that you want to keep and expire
intermediate versions.

In general, there is some cost to keeping thousands of snapshots. Those are
held in the metadata file that gets written each commit, so you end up
writing a fairly large file. If your commits are infrequent it doens't
generally make a difference. But if you have commits every minute or so it
can get in the way.

Tagging will reduce the problem, and moving to change-based commits with
the REST catalog should also help in the long term.

Ryan

On Mon, Mar 7, 2022 at 8:18 AM Mayur Srivastava <
Mayur.Srivastava@twosigma.com> wrote:

> A few follow-up questions for getting last modified time for each
> partition:
>
>
>
> 1.       If we want to use snapshots, does this mean we will have to
> maintain full history of snapshots? E.g. if we partition by method=‘day’
> and write once a day for a few years, we will end up in maintaining 1000s
> of snapshots. How does a long history of snapshots affect metadata size,
> commit performance, etc.? We intend to experiment with this but I’m curious
> to know if there’s already some recommendation on the amount of history for
> snapshots.
>
> 2.       How can we distinguish between snapshots where new data was
> added vs snapshots where compaction was done?
>
>
>
> Thanks,
>
> Mayur
>
>
>
> *From:* Mayur Srivastava <Ma...@twosigma.com>
> *Sent:* Thursday, February 24, 2022 7:27 AM
> *To:* dev@iceberg.apache.org
> *Subject:* RE: Getting last modified timestamp/other stats per partition
>
>
>
> Thanks Szehon. I’ll give this a try.
>
>
>
> *From:* Szehon Ho <sz...@gmail.com>
> *Sent:* Wednesday, February 23, 2022 1:38 PM
> *To:* Iceberg Dev List <de...@iceberg.apache.org>
> *Subject:* Re: Getting last modified timestamp/other stats per partition
>
>
>
> Hi
>
>
>
> Probably the metadata tables can help with this.
>
>
>
> For the size/num_rows of partitions, you can query the files table,
> https://iceberg.apache.org/docs/latest/spark-queries/#files.  (Because
> Iceberg keeps stats for files, and not necessary partitions).
>
>
>
> SELECT partition, sum(file_size_in_bytes), sum(record_count) from
> $my_table.files f GROUP BY f.partition
>
>
>
> This will be compressed size (again Iceberg keeps file-level stats and so
> not sure if there are any stats for uncompressed sizes.)
>
>
>
> For the last modified time, it will be slightly harder.  The file's
> physical modified time is not good enough because it's not exactly when it
> is 'committed' into Iceberg.   You may have to try a more advanced query on
> the snapshots table and manifest-entries table:
> https://iceberg.apache.org/docs/latest/spark-queries/#snapshots
>
>
>
> SELECT MAX(s.committed_at),e.data_file.partition FROM $my_table.snapshots
> s JOIN $my_table.entries e WHERE s.snapshot_id = e.snapshot_id GROUP_BY by
> e.data_file.partition
>
>
>
> Hope that helps,
>
> Szehon
>
>
>
> On Wed, Feb 23, 2022 at 8:50 AM Mayur Srivastava <
> Mayur.Srivastava@twosigma.com> wrote:
>
> Hi,
>
>
>
> In Iceberg, is there a way to get the last modified timestamp and other
> stats (e.g. num rows, uncompressed size, compressed size) of the data per
> partition?
>
>
>
> Thanks,
>
> Mayur
>
>
>
>

-- 
Ryan Blue
Tabular

RE: Getting last modified timestamp/other stats per partition

Posted by Mayur Srivastava <Ma...@twosigma.com>.

A few follow-up questions for getting last modified time for each partition:


1.       If we want to use snapshots, does this mean we will have to maintain full history of snapshots? E.g. if we partition by method=‘day’ and write once a day for a few years, we will end up in maintaining 1000s of snapshots. How does a long history of snapshots affect metadata size, commit performance, etc.? We intend to experiment with this but I’m curious to know if there’s already some recommendation on the amount of history for snapshots.

2.       How can we distinguish between snapshots where new data was added vs snapshots where compaction was done?

Thanks,
Mayur

From: Mayur Srivastava <Ma...@twosigma.com>
Sent: Thursday, February 24, 2022 7:27 AM
To: dev@iceberg.apache.org
Subject: RE: Getting last modified timestamp/other stats per partition

Thanks Szehon. I’ll give this a try.

From: Szehon Ho <sz...@gmail.com>>
Sent: Wednesday, February 23, 2022 1:38 PM
To: Iceberg Dev List <de...@iceberg.apache.org>>
Subject: Re: Getting last modified timestamp/other stats per partition

Hi

Probably the metadata tables can help with this.

For the size/num_rows of partitions, you can query the files table, https://iceberg.apache.org/docs/latest/spark-queries/#files.  (Because Iceberg keeps stats for files, and not necessary partitions).

SELECT partition, sum(file_size_in_bytes), sum(record_count) from $my_table.files f GROUP BY f.partition

This will be compressed size (again Iceberg keeps file-level stats and so not sure if there are any stats for uncompressed sizes.)

For the last modified time, it will be slightly harder.  The file's physical modified time is not good enough because it's not exactly when it is 'committed' into Iceberg.   You may have to try a more advanced query on the snapshots table and manifest-entries table: https://iceberg.apache.org/docs/latest/spark-queries/#snapshots

SELECT MAX(s.committed_at),e.data_file.partition FROM $my_table.snapshots s JOIN $my_table.entries e WHERE s.snapshot_id = e.snapshot_id GROUP_BY by e.data_file.partition

Hope that helps,
Szehon

On Wed, Feb 23, 2022 at 8:50 AM Mayur Srivastava <Ma...@twosigma.com>> wrote:
Hi,

In Iceberg, is there a way to get the last modified timestamp and other stats (e.g. num rows, uncompressed size, compressed size) of the data per partition?

Thanks,
Mayur

RE: Getting last modified timestamp/other stats per partition

Posted by Mayur Srivastava <Ma...@twosigma.com>.

Thanks Szehon. I’ll give this a try.

From: Szehon Ho <sz...@gmail.com>
Sent: Wednesday, February 23, 2022 1:38 PM
To: Iceberg Dev List <de...@iceberg.apache.org>
Subject: Re: Getting last modified timestamp/other stats per partition

Hi

Probably the metadata tables can help with this.

For the size/num_rows of partitions, you can query the files table, https://iceberg.apache.org/docs/latest/spark-queries/#files.  (Because Iceberg keeps stats for files, and not necessary partitions).

SELECT partition, sum(file_size_in_bytes), sum(record_count) from $my_table.files f GROUP BY f.partition

This will be compressed size (again Iceberg keeps file-level stats and so not sure if there are any stats for uncompressed sizes.)

For the last modified time, it will be slightly harder.  The file's physical modified time is not good enough because it's not exactly when it is 'committed' into Iceberg.   You may have to try a more advanced query on the snapshots table and manifest-entries table: https://iceberg.apache.org/docs/latest/spark-queries/#snapshots

SELECT MAX(s.committed_at),e.data_file.partition FROM $my_table.snapshots s JOIN $my_table.entries e WHERE s.snapshot_id = e.snapshot_id GROUP_BY by e.data_file.partition

Hope that helps,
Szehon

On Wed, Feb 23, 2022 at 8:50 AM Mayur Srivastava <Ma...@twosigma.com>> wrote:
Hi,

In Iceberg, is there a way to get the last modified timestamp and other stats (e.g. num rows, uncompressed size, compressed size) of the data per partition?

Thanks,
Mayur

Re: Getting last modified timestamp/other stats per partition

Posted by Szehon Ho <sz...@gmail.com>.

Hi

Probably the metadata tables can help with this.

For the size/num_rows of partitions, you can query the files table,
https://iceberg.apache.org/docs/latest/spark-queries/#files.  (Because
Iceberg keeps stats for files, and not necessary partitions).

SELECT partition, sum(file_size_in_bytes), sum(record_count) from
$my_table.files f GROUP BY f.partition

This will be compressed size (again Iceberg keeps file-level stats and so
not sure if there are any stats for uncompressed sizes.)

For the last modified time, it will be slightly harder.  The file's
physical modified time is not good enough because it's not exactly when it
is 'committed' into Iceberg.   You may have to try a more advanced query on
the snapshots table and manifest-entries table:
https://iceberg.apache.org/docs/latest/spark-queries/#snapshots

SELECT MAX(s.committed_at),e.data_file.partition FROM $my_table.snapshots s
JOIN $my_table.entries e WHERE s.snapshot_id = e.snapshot_id GROUP_BY by
e.data_file.partition

Hope that helps,
Szehon

On Wed, Feb 23, 2022 at 8:50 AM Mayur Srivastava <
Mayur.Srivastava@twosigma.com> wrote:

> Hi,
>
>
>
> In Iceberg, is there a way to get the last modified timestamp and other
> stats (e.g. num rows, uncompressed size, compressed size) of the data per
> partition?
>
>
>
> Thanks,
>
> Mayur
>
>
>