You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@iceberg.apache.org by Suraj Chandran <ch...@gmail.com> on 2021/06/13 16:47:26 UTC

Keeping infinite snapshots

Hi there,

(Had asked on Slack, trying here as well)

The documentation proposes "regularly expiring snapshots is recommended to
delete data files that are no longer needed, and to keep the size of table
metadata small".
I had a few questions around that:
1) Are there people/usecases who are keeping snapshots for a long history
of time, like for decades? This would help people manage/find "back dated
corrections" in data.
2) Are snapshots even meant for keeping history for such long periods of
time.
3) Would regular rewriteDataFiles help in such cases (by how much?)

Thanks,
Suraj

Re: Keeping infinite snapshots

Posted by Ryan Blue <bl...@apache.org>.

Keeping snapshots will add some metadata, but it isn't a ton and you can
probably drop some summary metadata to make it smaller (the Spark app ID,
for example).

Since compaction creates new snapshots, it wouldn't really help. What would
help is keeping track of "versions" as branches. Then you can compact the
branches. But that's not really keeping track of all snapshots forever.
That would be choosing which ones you want to keep.

On Mon, Jun 14, 2021 at 9:21 AM Suraj Chandran <ch...@gmail.com>
wrote:

> Thanks.
> So our use-case is to keep all the snapshots till the beginning of time.
> How is that going to impact performance, since the metadata files will be
> quite a bit?
>  Also would it reduce opportunities of data compaction?
> One idea I had around this was to create a solution in Iceberg to be able
> to isolate multiple snapshots completely, "so they don't share metadata
> among them", it would increase data but it’s almost like the new snapshot
> can be completely independent and hence can be compacted independently of
> older snapshots metadata, increasing performance. Does that make any sense
> at all?
>
> On Mon, Jun 14, 2021 at 9:27 PM Ryan Blue <bl...@apache.org> wrote:
>
>> Hi Suraj,
>>
>> I just answered on slack, but I'll copy the replies here for everyone
>> that's subscribed to the dev list:
>>
>> 1) Yes, there are use cases around this. To assist, we're planning on
>> adding named snapshots so you don't keep complete history. Instead, you
>> should keep a selection of snapshots.
>> 2) It is fine to keep snapshots for a long period of time. Part of the
>> purpose is to allow you to time travel and we've known about the use case
>> of keeping a labelled version around (e.g. what you trained a model with)
>> for a long time.
>> 3) RewriteDataFiles will rewrite the files from one snapshot and produce
>> another. If you're keeping around old snapshots this wouldn't change them.
>> Although you probably could go rewrite those snapshots if you wanted to.
>>
>> I hope that helps!
>>
>> Ryan
>>
>> On Sun, Jun 13, 2021 at 9:47 AM Suraj Chandran <ch...@gmail.com>
>> wrote:
>>
>>> Hi there,
>>>
>>> (Had asked on Slack, trying here as well)
>>>
>>> The documentation proposes "regularly expiring snapshots is recommended
>>> to delete data files that are no longer needed, and to keep the size of
>>> table metadata small".
>>> I had a few questions around that:
>>> 1) Are there people/usecases who are keeping snapshots for a long
>>> history of time, like for decades? This would help people manage/find "back
>>> dated corrections" in data.
>>> 2) Are snapshots even meant for keeping history for such long periods of
>>> time.
>>> 3) Would regular rewriteDataFiles help in such cases (by how much?)
>>>
>>> Thanks,
>>> Suraj
>>>
>>
>>
>> --
>> Ryan Blue
>>
>

-- 
Ryan Blue

Re: Keeping infinite snapshots

Posted by Suraj Chandran <ch...@gmail.com>.

Thanks.
So our use-case is to keep all the snapshots till the beginning of time.
How is that going to impact performance, since the metadata files will be
quite a bit?
 Also would it reduce opportunities of data compaction?
One idea I had around this was to create a solution in Iceberg to be able
to isolate multiple snapshots completely, "so they don't share metadata
among them", it would increase data but it’s almost like the new snapshot
can be completely independent and hence can be compacted independently of
older snapshots metadata, increasing performance. Does that make any sense
at all?

On Mon, Jun 14, 2021 at 9:27 PM Ryan Blue <bl...@apache.org> wrote:

> Hi Suraj,
>
> I just answered on slack, but I'll copy the replies here for everyone
> that's subscribed to the dev list:
>
> 1) Yes, there are use cases around this. To assist, we're planning on
> adding named snapshots so you don't keep complete history. Instead, you
> should keep a selection of snapshots.
> 2) It is fine to keep snapshots for a long period of time. Part of the
> purpose is to allow you to time travel and we've known about the use case
> of keeping a labelled version around (e.g. what you trained a model with)
> for a long time.
> 3) RewriteDataFiles will rewrite the files from one snapshot and produce
> another. If you're keeping around old snapshots this wouldn't change them.
> Although you probably could go rewrite those snapshots if you wanted to.
>
> I hope that helps!
>
> Ryan
>
> On Sun, Jun 13, 2021 at 9:47 AM Suraj Chandran <ch...@gmail.com>
> wrote:
>
>> Hi there,
>>
>> (Had asked on Slack, trying here as well)
>>
>> The documentation proposes "regularly expiring snapshots is recommended
>> to delete data files that are no longer needed, and to keep the size of
>> table metadata small".
>> I had a few questions around that:
>> 1) Are there people/usecases who are keeping snapshots for a long history
>> of time, like for decades? This would help people manage/find "back dated
>> corrections" in data.
>> 2) Are snapshots even meant for keeping history for such long periods of
>> time.
>> 3) Would regular rewriteDataFiles help in such cases (by how much?)
>>
>> Thanks,
>> Suraj
>>
>
>
> --
> Ryan Blue
>

Re: Keeping infinite snapshots

Posted by Ryan Blue <bl...@apache.org>.

Hi Suraj,

I just answered on slack, but I'll copy the replies here for everyone
that's subscribed to the dev list:

1) Yes, there are use cases around this. To assist, we're planning on
adding named snapshots so you don't keep complete history. Instead, you
should keep a selection of snapshots.
2) It is fine to keep snapshots for a long period of time. Part of the
purpose is to allow you to time travel and we've known about the use case
of keeping a labelled version around (e.g. what you trained a model with)
for a long time.
3) RewriteDataFiles will rewrite the files from one snapshot and produce
another. If you're keeping around old snapshots this wouldn't change them.
Although you probably could go rewrite those snapshots if you wanted to.

I hope that helps!

Ryan

On Sun, Jun 13, 2021 at 9:47 AM Suraj Chandran <ch...@gmail.com>
wrote:

> Hi there,
>
> (Had asked on Slack, trying here as well)
>
> The documentation proposes "regularly expiring snapshots is recommended to
> delete data files that are no longer needed, and to keep the size of table
> metadata small".
> I had a few questions around that:
> 1) Are there people/usecases who are keeping snapshots for a long history
> of time, like for decades? This would help people manage/find "back dated
> corrections" in data.
> 2) Are snapshots even meant for keeping history for such long periods of
> time.
> 3) Would regular rewriteDataFiles help in such cases (by how much?)
>
> Thanks,
> Suraj
>

-- 
Ryan Blue