You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@iceberg.apache.org by Pucheng Yang <py...@pinterest.com.INVALID> on 2022/09/29 21:25:20 UTC

Reverting a commit in the table history?

Hi all,

I wonder if any discussion happened about the idea of reverting a commit in
the table history?

My clients have such a use case: they are writing some data into a
partition, and later want to revert that. But since there are new snapshots
generated, thus they can not use snapshot rollback.

Any comments are welcome! Thanks!

Best,
Pucheng

Re: Reverting a commit in the table history?

Posted by Pucheng Yang <py...@pinterest.com.INVALID>.
Thanks Ryan, this is helpful! I will keep what you said in mind when I
explore it.

On Fri, Sep 30, 2022 at 10:42 AM Ryan Blue <bl...@tabular.io> wrote:

> It depends on what you want the semantics of the revert to be. Here’s an
> example overwrite:
>
> df.writeTo("db.table")
>     .overwrite(expr("ts >= today() and ts <= date_add(today(), 1)"))
>
> The overwrite expression removes any files written today and replaces them
> with the contents of the DataFrame. Let’s say that replaces
> today/file_A.parquet and today/file_B.parquet with today/file_C.parquet.
> When there are no further changes, it’s easy to revert by replacing C with
> A and B. That means at a minimum that C still needs to exist in the table
> to revert.
>
> But what happens when there’s a new delete applied to C? Reverting would
> un-delete a position delete against C and if the row was in A or B then it
> would bring back a deleted row.
>
> For this, we probably also need to know the original filter so that we can
> check for certain conflicts. Right now, that’s not stored anywhere. But we
> could start adding it to Snapshot metadata.
>
> Ryan
>
> On Fri, Sep 30, 2022 at 9:41 AM Pucheng Yang <py...@pinterest.com.invalid>
> wrote:
>
>> Thanks Ryan, how about an overwrite commit (insert overwrite)? What
>> should I be aware of? Thanks.
>>
>> On Fri, Sep 30, 2022 at 9:26 AM Ryan Blue <bl...@tabular.io> wrote:
>>
>>> Pucheng,
>>>
>>> I think you'd want to add a new option to the SnapshotManager to revert
>>> a commit by ID. That would need to get the changes from the commit and
>>> reverse them. We'd want to start small because reverting the file-level
>>> changes isn't always the same thing as reverting the semantic changes. But
>>> for simple cases like an append commit, it would work just fine.
>>>
>>> Ryan
>>>
>>> On Thu, Sep 29, 2022 at 3:13 PM Pucheng Yang <py...@pinterest.com.invalid>
>>> wrote:
>>>
>>>> Thank you, I will take a look.
>>>>
>>>> On Thu, Sep 29, 2022 at 2:40 PM Ye, Jack <yz...@amazon.com.invalid>
>>>> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>>
>>>>>
>>>>> There is a PR published just today for something similar that you
>>>>> might be able to reference:
>>>>> https://github.com/apache/iceberg/pull/5888, which rolls back a
>>>>> compaction commit on conflict and then reapply the changes. The logic seems
>>>>> to be similar as what you want, to rollback to that specific snapshot and
>>>>> try to reapply the ones you still want.
>>>>>
>>>>>
>>>>>
>>>>> Best,
>>>>>
>>>>> Jack Ye
>>>>>
>>>>>
>>>>>
>>>>> *From: *Pucheng Yang <py...@pinterest.com.INVALID>
>>>>> *Reply-To: *"dev@iceberg.apache.org" <de...@iceberg.apache.org>
>>>>> *Date: *Thursday, September 29, 2022 at 2:27 PM
>>>>> *To: *"dev@iceberg.apache.org" <de...@iceberg.apache.org>
>>>>> *Subject: *[EXTERNAL] Reverting a commit in the table history?
>>>>>
>>>>>
>>>>>
>>>>> Hi all,
>>>>>
>>>>>
>>>>>
>>>>> I wonder if any discussion happened about the idea of reverting a
>>>>> commit in the table history?
>>>>>
>>>>>
>>>>>
>>>>> My clients have such a use case: they are writing some data into a
>>>>> partition, and later want to revert that. But since there are new snapshots
>>>>> generated, thus they can not use snapshot rollback.
>>>>>
>>>>>
>>>>>
>>>>> Any comments are welcome! Thanks!
>>>>>
>>>>>
>>>>>
>>>>> Best,
>>>>>
>>>>> Pucheng
>>>>>
>>>>
>>>
>>> --
>>> Ryan Blue
>>> Tabular
>>>
>>
>
> --
> Ryan Blue
> Tabular
>

Re: Reverting a commit in the table history?

Posted by Ryan Blue <bl...@tabular.io>.
It depends on what you want the semantics of the revert to be. Here’s an
example overwrite:

df.writeTo("db.table")
    .overwrite(expr("ts >= today() and ts <= date_add(today(), 1)"))

The overwrite expression removes any files written today and replaces them
with the contents of the DataFrame. Let’s say that replaces
today/file_A.parquet and today/file_B.parquet with today/file_C.parquet.
When there are no further changes, it’s easy to revert by replacing C with
A and B. That means at a minimum that C still needs to exist in the table
to revert.

But what happens when there’s a new delete applied to C? Reverting would
un-delete a position delete against C and if the row was in A or B then it
would bring back a deleted row.

For this, we probably also need to know the original filter so that we can
check for certain conflicts. Right now, that’s not stored anywhere. But we
could start adding it to Snapshot metadata.

Ryan

On Fri, Sep 30, 2022 at 9:41 AM Pucheng Yang <py...@pinterest.com.invalid>
wrote:

> Thanks Ryan, how about an overwrite commit (insert overwrite)? What should
> I be aware of? Thanks.
>
> On Fri, Sep 30, 2022 at 9:26 AM Ryan Blue <bl...@tabular.io> wrote:
>
>> Pucheng,
>>
>> I think you'd want to add a new option to the SnapshotManager to revert a
>> commit by ID. That would need to get the changes from the commit and
>> reverse them. We'd want to start small because reverting the file-level
>> changes isn't always the same thing as reverting the semantic changes. But
>> for simple cases like an append commit, it would work just fine.
>>
>> Ryan
>>
>> On Thu, Sep 29, 2022 at 3:13 PM Pucheng Yang <py...@pinterest.com.invalid>
>> wrote:
>>
>>> Thank you, I will take a look.
>>>
>>> On Thu, Sep 29, 2022 at 2:40 PM Ye, Jack <yz...@amazon.com.invalid>
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>>
>>>>
>>>> There is a PR published just today for something similar that you might
>>>> be able to reference: https://github.com/apache/iceberg/pull/5888,
>>>> which rolls back a compaction commit on conflict and then reapply the
>>>> changes. The logic seems to be similar as what you want, to rollback to
>>>> that specific snapshot and try to reapply the ones you still want.
>>>>
>>>>
>>>>
>>>> Best,
>>>>
>>>> Jack Ye
>>>>
>>>>
>>>>
>>>> *From: *Pucheng Yang <py...@pinterest.com.INVALID>
>>>> *Reply-To: *"dev@iceberg.apache.org" <de...@iceberg.apache.org>
>>>> *Date: *Thursday, September 29, 2022 at 2:27 PM
>>>> *To: *"dev@iceberg.apache.org" <de...@iceberg.apache.org>
>>>> *Subject: *[EXTERNAL] Reverting a commit in the table history?
>>>>
>>>>
>>>>
>>>> Hi all,
>>>>
>>>>
>>>>
>>>> I wonder if any discussion happened about the idea of reverting a
>>>> commit in the table history?
>>>>
>>>>
>>>>
>>>> My clients have such a use case: they are writing some data into a
>>>> partition, and later want to revert that. But since there are new snapshots
>>>> generated, thus they can not use snapshot rollback.
>>>>
>>>>
>>>>
>>>> Any comments are welcome! Thanks!
>>>>
>>>>
>>>>
>>>> Best,
>>>>
>>>> Pucheng
>>>>
>>>
>>
>> --
>> Ryan Blue
>> Tabular
>>
>

-- 
Ryan Blue
Tabular

Re: Reverting a commit in the table history?

Posted by Pucheng Yang <py...@pinterest.com.INVALID>.
Thanks Ryan, how about an overwrite commit (insert overwrite)? What should
I be aware of? Thanks.

On Fri, Sep 30, 2022 at 9:26 AM Ryan Blue <bl...@tabular.io> wrote:

> Pucheng,
>
> I think you'd want to add a new option to the SnapshotManager to revert a
> commit by ID. That would need to get the changes from the commit and
> reverse them. We'd want to start small because reverting the file-level
> changes isn't always the same thing as reverting the semantic changes. But
> for simple cases like an append commit, it would work just fine.
>
> Ryan
>
> On Thu, Sep 29, 2022 at 3:13 PM Pucheng Yang <py...@pinterest.com.invalid>
> wrote:
>
>> Thank you, I will take a look.
>>
>> On Thu, Sep 29, 2022 at 2:40 PM Ye, Jack <yz...@amazon.com.invalid>
>> wrote:
>>
>>> Hi,
>>>
>>>
>>>
>>> There is a PR published just today for something similar that you might
>>> be able to reference: https://github.com/apache/iceberg/pull/5888,
>>> which rolls back a compaction commit on conflict and then reapply the
>>> changes. The logic seems to be similar as what you want, to rollback to
>>> that specific snapshot and try to reapply the ones you still want.
>>>
>>>
>>>
>>> Best,
>>>
>>> Jack Ye
>>>
>>>
>>>
>>> *From: *Pucheng Yang <py...@pinterest.com.INVALID>
>>> *Reply-To: *"dev@iceberg.apache.org" <de...@iceberg.apache.org>
>>> *Date: *Thursday, September 29, 2022 at 2:27 PM
>>> *To: *"dev@iceberg.apache.org" <de...@iceberg.apache.org>
>>> *Subject: *[EXTERNAL] Reverting a commit in the table history?
>>>
>>>
>>>
>>> Hi all,
>>>
>>>
>>>
>>> I wonder if any discussion happened about the idea of reverting a commit
>>> in the table history?
>>>
>>>
>>>
>>> My clients have such a use case: they are writing some data into a
>>> partition, and later want to revert that. But since there are new snapshots
>>> generated, thus they can not use snapshot rollback.
>>>
>>>
>>>
>>> Any comments are welcome! Thanks!
>>>
>>>
>>>
>>> Best,
>>>
>>> Pucheng
>>>
>>
>
> --
> Ryan Blue
> Tabular
>

Re: Reverting a commit in the table history?

Posted by Ryan Blue <bl...@tabular.io>.
Pucheng,

I think you'd want to add a new option to the SnapshotManager to revert a
commit by ID. That would need to get the changes from the commit and
reverse them. We'd want to start small because reverting the file-level
changes isn't always the same thing as reverting the semantic changes. But
for simple cases like an append commit, it would work just fine.

Ryan

On Thu, Sep 29, 2022 at 3:13 PM Pucheng Yang <py...@pinterest.com.invalid>
wrote:

> Thank you, I will take a look.
>
> On Thu, Sep 29, 2022 at 2:40 PM Ye, Jack <yz...@amazon.com.invalid>
> wrote:
>
>> Hi,
>>
>>
>>
>> There is a PR published just today for something similar that you might
>> be able to reference: https://github.com/apache/iceberg/pull/5888, which
>> rolls back a compaction commit on conflict and then reapply the changes.
>> The logic seems to be similar as what you want, to rollback to that
>> specific snapshot and try to reapply the ones you still want.
>>
>>
>>
>> Best,
>>
>> Jack Ye
>>
>>
>>
>> *From: *Pucheng Yang <py...@pinterest.com.INVALID>
>> *Reply-To: *"dev@iceberg.apache.org" <de...@iceberg.apache.org>
>> *Date: *Thursday, September 29, 2022 at 2:27 PM
>> *To: *"dev@iceberg.apache.org" <de...@iceberg.apache.org>
>> *Subject: *[EXTERNAL] Reverting a commit in the table history?
>>
>>
>>
>> Hi all,
>>
>>
>>
>> I wonder if any discussion happened about the idea of reverting a commit
>> in the table history?
>>
>>
>>
>> My clients have such a use case: they are writing some data into a
>> partition, and later want to revert that. But since there are new snapshots
>> generated, thus they can not use snapshot rollback.
>>
>>
>>
>> Any comments are welcome! Thanks!
>>
>>
>>
>> Best,
>>
>> Pucheng
>>
>

-- 
Ryan Blue
Tabular

Re: Reverting a commit in the table history?

Posted by Pucheng Yang <py...@pinterest.com.INVALID>.
Thank you, I will take a look.

On Thu, Sep 29, 2022 at 2:40 PM Ye, Jack <yz...@amazon.com.invalid>
wrote:

> Hi,
>
>
>
> There is a PR published just today for something similar that you might be
> able to reference: https://github.com/apache/iceberg/pull/5888, which
> rolls back a compaction commit on conflict and then reapply the changes.
> The logic seems to be similar as what you want, to rollback to that
> specific snapshot and try to reapply the ones you still want.
>
>
>
> Best,
>
> Jack Ye
>
>
>
> *From: *Pucheng Yang <py...@pinterest.com.INVALID>
> *Reply-To: *"dev@iceberg.apache.org" <de...@iceberg.apache.org>
> *Date: *Thursday, September 29, 2022 at 2:27 PM
> *To: *"dev@iceberg.apache.org" <de...@iceberg.apache.org>
> *Subject: *[EXTERNAL] Reverting a commit in the table history?
>
>
>
> Hi all,
>
>
>
> I wonder if any discussion happened about the idea of reverting a commit
> in the table history?
>
>
>
> My clients have such a use case: they are writing some data into a
> partition, and later want to revert that. But since there are new snapshots
> generated, thus they can not use snapshot rollback.
>
>
>
> Any comments are welcome! Thanks!
>
>
>
> Best,
>
> Pucheng
>

Re: Reverting a commit in the table history?

Posted by "Ye, Jack" <yz...@amazon.com.INVALID>.
Hi,

There is a PR published just today for something similar that you might be able to reference: https://github.com/apache/iceberg/pull/5888, which rolls back a compaction commit on conflict and then reapply the changes. The logic seems to be similar as what you want, to rollback to that specific snapshot and try to reapply the ones you still want.

Best,
Jack Ye

From: Pucheng Yang <py...@pinterest.com.INVALID>
Reply-To: "dev@iceberg.apache.org" <de...@iceberg.apache.org>
Date: Thursday, September 29, 2022 at 2:27 PM
To: "dev@iceberg.apache.org" <de...@iceberg.apache.org>
Subject: [EXTERNAL] Reverting a commit in the table history?

Hi all,

I wonder if any discussion happened about the idea of reverting a commit in the table history?

My clients have such a use case: they are writing some data into a partition, and later want to revert that. But since there are new snapshots generated, thus they can not use snapshot rollback.

Any comments are welcome! Thanks!

Best,
Pucheng