You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hudi.apache.org by Prashant Wason <pw...@uber.com.INVALID> on 2020/03/18 18:22:41 UTC

Query regarding restoring HUDI tables to older commits

HI Team,

I noticed that when a table is restored to a previous commit (
HoodieWriteClient::restoreToInstant
<https://github.com/apache/incubator-hudi/blob/master/hudi-client/src/main/java/org/apache/hudi/client/HoodieWriteClient.java#L735>),
only the COMMIT, DELTA_COMMIT and COMPACTION instants are rolled back and
their corresponding files are deleted from the timeline. If there are some
CLEAN instants, they are left over.

Is there a reason why CLEAN are not removed? Won't they be referring to
files  which are no longer present and hence not useful?

Thanks
Prashant

Re: Query regarding restoring HUDI tables to older commits

Posted by Balaji Varadarajan <v....@ymail.com.INVALID>.
 Vinoth,
Yes, I agree. Reverting completed operations when writers are stopped is safe. 
Balaji.V
    On Saturday, March 21, 2020, 08:04:10 PM PDT, Vinoth Chandar <vi...@apache.org> wrote:  
 
 Hi all,

Good discussion. let me try and tease this apart.

Rollback. : Should only be used for rolling back an inflight write..
Nothing else IMO.. This is where we guarantee that there will be no impact
to readers/query engines.

Restore : It's an invasive maintenance operation, that will be disruptive
to queries that are currently running..

To Prashant's point, I think it will be cleaner to restore the timeline to
not have any actions > the restored instant time?  Note that with MOR, we
may have logged data blocks belonging to multiple instant into the same log
file and we may have to log additional rollback blocks?

@balaji , if we mandate ingest job be stopped/bounced during restore
anyway, I think it should be safe right? We have a clean log based design
where the cleaner will just work off what's in the timeline and reach the
same state again (well, not same to same, but equivalent, since input could
be larger/different)..

If you all agree, can we may be talk about gaps in our implementation
around restores today?

Thanks
Vinoth










On Wed, Mar 18, 2020 at 12:21 PM Balajee Nagasubramaniam
<ba...@uber.com.invalid> wrote:

> Hi Prashant,
>
> Regarding clean vs rollback/restoreToInstant, if you think of all the
> commits/datafiles in the active timeline as a queue of items,
> rollback/restoreToInstant would be working on the head of the queue whereas
> clean would be working on the tail of the queue. They should be treated as
> two independent operations on the queue. At datafile/file-slice level, if
> cleaner is configured to maintain 3 versions of the file, then you can
> rollback at most 2 recent versions. Hope this helps.
>
> Thanks,
> Balajee
>
> On Wed, Mar 18, 2020 at 11:54 AM Prashant Wason <pw...@uber.com.invalid>
> wrote:
>
> > Thanks for the info Vinoth / Balaji.
> >
> > To me it feels a split between easier-to-understand design and
> > current-implementation. I feel it is simpler to reason (based on how file
> > systems work in general) that restoreToInstant is a complete
> point-in-time
> > shift to the past (like restoring a file system from a snapshot/backup).
> >
> > If I have restored the Table to commitTime=005, then having any instants
> > with commitTime > 005 are confusing as it implies that even though my
> table
> > is at an older time, some future operations will be applied onto it at
> some
> > point.
> >
> > I will have to read more about incremental timeline syncing and timeline
> > server to understand how it uses the clean instants. BTW, the comment on
> > the function HoodieWriteClient::restoreToInstant reads "NOTE : This
> action
> > requires all writers (ingest and compact) to a table to be stopped before
> > proceeding". So probably the embedded timeline server can recreate the
> view
> > next time it comes back up?
> >
> > Thanks
> > Prashant
> >
> >
> > On Wed, Mar 18, 2020 at 11:37 AM Balaji Varadarajan
> > <v....@ymail.com.invalid> wrote:
> >
> > >  Prashanth,
> > > I think we should not be reverting clean operations here. Cleans are
> done
> > > on the oldest file slices and a restore/rollback is not completely
> > undoing
> > > the work of clean that happened before it.
> > > For incremental timeline syncing, embedded timeline server needs to
> read
> > > these clean metadata to sync its cached file-system view.
> > > Let me know your thoughts.
> > > Balaji.V
> > >    On Wednesday, March 18, 2020, 11:23:09 AM PDT, Prashant Wason
> > > <pw...@uber.com.invalid> wrote:
> > >
> > >  HI Team,
> > >
> > > I noticed that when a table is restored to a previous commit (
> > > HoodieWriteClient::restoreToInstant
> > > <
> > >
> >
> https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_incubator-2Dhudi_blob_master_hudi-2Dclient_src_main_java_org_apache_hudi_client_HoodieWriteClient.java-23L735&d=DwIFaQ&c=r2dcLCtU9q6n0vrtnDw9vg&r=c89AU9T1AVhM4r2Xi3ctZA&m=ASTWkm7UUMnhZ7sBzpXGPkTc1PhNTJeO7q5IXlBCprY&s=43rqua7SdhvO91hA0ZhOPNQw8ON1nL3bAsCue5o8aYw&e=
> > > >),
> > > only the COMMIT, DELTA_COMMIT and COMPACTION instants are rolled back
> and
> > > their corresponding files are deleted from the timeline. If there are
> > some
> > > CLEAN instants, they are left over.
> > >
> > > Is there a reason why CLEAN are not removed? Won't they be referring to
> > > files  which are no longer present and hence not useful?
> > >
> > > Thanks
> > > Prashant
> > >
> >
>  

Re: Query regarding restoring HUDI tables to older commits

Posted by Vinoth Chandar <vi...@apache.org>.
Hi all,

Good discussion. let me try and tease this apart.

Rollback. : Should only be used for rolling back an inflight write..
Nothing else IMO.. This is where we guarantee that there will be no impact
to readers/query engines.

Restore : It's an invasive maintenance operation, that will be disruptive
to queries that are currently running..

To Prashant's point, I think it will be cleaner to restore the timeline to
not have any actions > the restored instant time?  Note that with MOR, we
may have logged data blocks belonging to multiple instant into the same log
file and we may have to log additional rollback blocks?

@balaji , if we mandate ingest job be stopped/bounced during restore
anyway, I think it should be safe right? We have a clean log based design
where the cleaner will just work off what's in the timeline and reach the
same state again (well, not same to same, but equivalent, since input could
be larger/different)..

If you all agree, can we may be talk about gaps in our implementation
around restores today?

Thanks
Vinoth










On Wed, Mar 18, 2020 at 12:21 PM Balajee Nagasubramaniam
<ba...@uber.com.invalid> wrote:

> Hi Prashant,
>
> Regarding clean vs rollback/restoreToInstant, if you think of all the
> commits/datafiles in the active timeline as a queue of items,
> rollback/restoreToInstant would be working on the head of the queue whereas
> clean would be working on the tail of the queue. They should be treated as
> two independent operations on the queue. At datafile/file-slice level, if
> cleaner is configured to maintain 3 versions of the file, then you can
> rollback at most 2 recent versions. Hope this helps.
>
> Thanks,
> Balajee
>
> On Wed, Mar 18, 2020 at 11:54 AM Prashant Wason <pw...@uber.com.invalid>
> wrote:
>
> > Thanks for the info Vinoth / Balaji.
> >
> > To me it feels a split between easier-to-understand design and
> > current-implementation. I feel it is simpler to reason (based on how file
> > systems work in general) that restoreToInstant is a complete
> point-in-time
> > shift to the past (like restoring a file system from a snapshot/backup).
> >
> > If I have restored the Table to commitTime=005, then having any instants
> > with commitTime > 005 are confusing as it implies that even though my
> table
> > is at an older time, some future operations will be applied onto it at
> some
> > point.
> >
> > I will have to read more about incremental timeline syncing and timeline
> > server to understand how it uses the clean instants. BTW, the comment on
> > the function HoodieWriteClient::restoreToInstant reads "NOTE : This
> action
> > requires all writers (ingest and compact) to a table to be stopped before
> > proceeding". So probably the embedded timeline server can recreate the
> view
> > next time it comes back up?
> >
> > Thanks
> > Prashant
> >
> >
> > On Wed, Mar 18, 2020 at 11:37 AM Balaji Varadarajan
> > <v....@ymail.com.invalid> wrote:
> >
> > >  Prashanth,
> > > I think we should not be reverting clean operations here. Cleans are
> done
> > > on the oldest file slices and a restore/rollback is not completely
> > undoing
> > > the work of clean that happened before it.
> > > For incremental timeline syncing, embedded timeline server needs to
> read
> > > these clean metadata to sync its cached file-system view.
> > > Let me know your thoughts.
> > > Balaji.V
> > >     On Wednesday, March 18, 2020, 11:23:09 AM PDT, Prashant Wason
> > > <pw...@uber.com.invalid> wrote:
> > >
> > >  HI Team,
> > >
> > > I noticed that when a table is restored to a previous commit (
> > > HoodieWriteClient::restoreToInstant
> > > <
> > >
> >
> https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_incubator-2Dhudi_blob_master_hudi-2Dclient_src_main_java_org_apache_hudi_client_HoodieWriteClient.java-23L735&d=DwIFaQ&c=r2dcLCtU9q6n0vrtnDw9vg&r=c89AU9T1AVhM4r2Xi3ctZA&m=ASTWkm7UUMnhZ7sBzpXGPkTc1PhNTJeO7q5IXlBCprY&s=43rqua7SdhvO91hA0ZhOPNQw8ON1nL3bAsCue5o8aYw&e=
> > > >),
> > > only the COMMIT, DELTA_COMMIT and COMPACTION instants are rolled back
> and
> > > their corresponding files are deleted from the timeline. If there are
> > some
> > > CLEAN instants, they are left over.
> > >
> > > Is there a reason why CLEAN are not removed? Won't they be referring to
> > > files  which are no longer present and hence not useful?
> > >
> > > Thanks
> > > Prashant
> > >
> >
>

Re: Query regarding restoring HUDI tables to older commits

Posted by Balajee Nagasubramaniam <ba...@uber.com.INVALID>.
Hi Prashant,

Regarding clean vs rollback/restoreToInstant, if you think of all the
commits/datafiles in the active timeline as a queue of items,
rollback/restoreToInstant would be working on the head of the queue whereas
clean would be working on the tail of the queue. They should be treated as
two independent operations on the queue. At datafile/file-slice level, if
cleaner is configured to maintain 3 versions of the file, then you can
rollback at most 2 recent versions. Hope this helps.

Thanks,
Balajee

On Wed, Mar 18, 2020 at 11:54 AM Prashant Wason <pw...@uber.com.invalid>
wrote:

> Thanks for the info Vinoth / Balaji.
>
> To me it feels a split between easier-to-understand design and
> current-implementation. I feel it is simpler to reason (based on how file
> systems work in general) that restoreToInstant is a complete point-in-time
> shift to the past (like restoring a file system from a snapshot/backup).
>
> If I have restored the Table to commitTime=005, then having any instants
> with commitTime > 005 are confusing as it implies that even though my table
> is at an older time, some future operations will be applied onto it at some
> point.
>
> I will have to read more about incremental timeline syncing and timeline
> server to understand how it uses the clean instants. BTW, the comment on
> the function HoodieWriteClient::restoreToInstant reads "NOTE : This action
> requires all writers (ingest and compact) to a table to be stopped before
> proceeding". So probably the embedded timeline server can recreate the view
> next time it comes back up?
>
> Thanks
> Prashant
>
>
> On Wed, Mar 18, 2020 at 11:37 AM Balaji Varadarajan
> <v....@ymail.com.invalid> wrote:
>
> >  Prashanth,
> > I think we should not be reverting clean operations here. Cleans are done
> > on the oldest file slices and a restore/rollback is not completely
> undoing
> > the work of clean that happened before it.
> > For incremental timeline syncing, embedded timeline server needs to read
> > these clean metadata to sync its cached file-system view.
> > Let me know your thoughts.
> > Balaji.V
> >     On Wednesday, March 18, 2020, 11:23:09 AM PDT, Prashant Wason
> > <pw...@uber.com.invalid> wrote:
> >
> >  HI Team,
> >
> > I noticed that when a table is restored to a previous commit (
> > HoodieWriteClient::restoreToInstant
> > <
> >
> https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_incubator-2Dhudi_blob_master_hudi-2Dclient_src_main_java_org_apache_hudi_client_HoodieWriteClient.java-23L735&d=DwIFaQ&c=r2dcLCtU9q6n0vrtnDw9vg&r=c89AU9T1AVhM4r2Xi3ctZA&m=ASTWkm7UUMnhZ7sBzpXGPkTc1PhNTJeO7q5IXlBCprY&s=43rqua7SdhvO91hA0ZhOPNQw8ON1nL3bAsCue5o8aYw&e=
> > >),
> > only the COMMIT, DELTA_COMMIT and COMPACTION instants are rolled back and
> > their corresponding files are deleted from the timeline. If there are
> some
> > CLEAN instants, they are left over.
> >
> > Is there a reason why CLEAN are not removed? Won't they be referring to
> > files  which are no longer present and hence not useful?
> >
> > Thanks
> > Prashant
> >
>

Re: Query regarding restoring HUDI tables to older commits

Posted by "vbalaji@apache.org" <vb...@apache.org>.
 Prashanth,
My concern was we should not be losing metadata about clean operation. 

But there is a way, As long as we are faithfully copying the clean metadata that tracks the files which got cleaned and storing in restore metadata, we should be able to keep metadata in sync.
Balaji.V



    On Wednesday, March 18, 2020, 11:54:11 AM PDT, Prashant Wason <pw...@uber.com.invalid> wrote:  
 
 Thanks for the info Vinoth / Balaji.

To me it feels a split between easier-to-understand design and
current-implementation. I feel it is simpler to reason (based on how file
systems work in general) that restoreToInstant is a complete point-in-time
shift to the past (like restoring a file system from a snapshot/backup).

If I have restored the Table to commitTime=005, then having any instants
with commitTime > 005 are confusing as it implies that even though my table
is at an older time, some future operations will be applied onto it at some
point.

I will have to read more about incremental timeline syncing and timeline
server to understand how it uses the clean instants. BTW, the comment on
the function HoodieWriteClient::restoreToInstant reads "NOTE : This action
requires all writers (ingest and compact) to a table to be stopped before
proceeding". So probably the embedded timeline server can recreate the view
next time it comes back up?

Thanks
Prashant


On Wed, Mar 18, 2020 at 11:37 AM Balaji Varadarajan
<v....@ymail.com.invalid> wrote:

>  Prashanth,
> I think we should not be reverting clean operations here. Cleans are done
> on the oldest file slices and a restore/rollback is not completely undoing
> the work of clean that happened before it.
> For incremental timeline syncing, embedded timeline server needs to read
> these clean metadata to sync its cached file-system view.
> Let me know your thoughts.
> Balaji.V
>    On Wednesday, March 18, 2020, 11:23:09 AM PDT, Prashant Wason
> <pw...@uber.com.invalid> wrote:
>
>  HI Team,
>
> I noticed that when a table is restored to a previous commit (
> HoodieWriteClient::restoreToInstant
> <
> https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_incubator-2Dhudi_blob_master_hudi-2Dclient_src_main_java_org_apache_hudi_client_HoodieWriteClient.java-23L735&d=DwIFaQ&c=r2dcLCtU9q6n0vrtnDw9vg&r=c89AU9T1AVhM4r2Xi3ctZA&m=ASTWkm7UUMnhZ7sBzpXGPkTc1PhNTJeO7q5IXlBCprY&s=43rqua7SdhvO91hA0ZhOPNQw8ON1nL3bAsCue5o8aYw&e=
> >),
> only the COMMIT, DELTA_COMMIT and COMPACTION instants are rolled back and
> their corresponding files are deleted from the timeline. If there are some
> CLEAN instants, they are left over.
>
> Is there a reason why CLEAN are not removed? Won't they be referring to
> files  which are no longer present and hence not useful?
>
> Thanks
> Prashant
>  

Re: Query regarding restoring HUDI tables to older commits

Posted by Prashant Wason <pw...@uber.com.INVALID>.
Thanks for the info Vinoth / Balaji.

To me it feels a split between easier-to-understand design and
current-implementation. I feel it is simpler to reason (based on how file
systems work in general) that restoreToInstant is a complete point-in-time
shift to the past (like restoring a file system from a snapshot/backup).

If I have restored the Table to commitTime=005, then having any instants
with commitTime > 005 are confusing as it implies that even though my table
is at an older time, some future operations will be applied onto it at some
point.

I will have to read more about incremental timeline syncing and timeline
server to understand how it uses the clean instants. BTW, the comment on
the function HoodieWriteClient::restoreToInstant reads "NOTE : This action
requires all writers (ingest and compact) to a table to be stopped before
proceeding". So probably the embedded timeline server can recreate the view
next time it comes back up?

Thanks
Prashant


On Wed, Mar 18, 2020 at 11:37 AM Balaji Varadarajan
<v....@ymail.com.invalid> wrote:

>  Prashanth,
> I think we should not be reverting clean operations here. Cleans are done
> on the oldest file slices and a restore/rollback is not completely undoing
> the work of clean that happened before it.
> For incremental timeline syncing, embedded timeline server needs to read
> these clean metadata to sync its cached file-system view.
> Let me know your thoughts.
> Balaji.V
>     On Wednesday, March 18, 2020, 11:23:09 AM PDT, Prashant Wason
> <pw...@uber.com.invalid> wrote:
>
>  HI Team,
>
> I noticed that when a table is restored to a previous commit (
> HoodieWriteClient::restoreToInstant
> <
> https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_incubator-2Dhudi_blob_master_hudi-2Dclient_src_main_java_org_apache_hudi_client_HoodieWriteClient.java-23L735&d=DwIFaQ&c=r2dcLCtU9q6n0vrtnDw9vg&r=c89AU9T1AVhM4r2Xi3ctZA&m=ASTWkm7UUMnhZ7sBzpXGPkTc1PhNTJeO7q5IXlBCprY&s=43rqua7SdhvO91hA0ZhOPNQw8ON1nL3bAsCue5o8aYw&e=
> >),
> only the COMMIT, DELTA_COMMIT and COMPACTION instants are rolled back and
> their corresponding files are deleted from the timeline. If there are some
> CLEAN instants, they are left over.
>
> Is there a reason why CLEAN are not removed? Won't they be referring to
> files  which are no longer present and hence not useful?
>
> Thanks
> Prashant
>

Re: Query regarding restoring HUDI tables to older commits

Posted by Balaji Varadarajan <v....@ymail.com.INVALID>.
 Prashanth,
I think we should not be reverting clean operations here. Cleans are done on the oldest file slices and a restore/rollback is not completely undoing the work of clean that happened before it. 
For incremental timeline syncing, embedded timeline server needs to read these clean metadata to sync its cached file-system view.
Let me know your thoughts.
Balaji.V
    On Wednesday, March 18, 2020, 11:23:09 AM PDT, Prashant Wason <pw...@uber.com.invalid> wrote:  
 
 HI Team,

I noticed that when a table is restored to a previous commit (
HoodieWriteClient::restoreToInstant
<https://github.com/apache/incubator-hudi/blob/master/hudi-client/src/main/java/org/apache/hudi/client/HoodieWriteClient.java#L735>),
only the COMMIT, DELTA_COMMIT and COMPACTION instants are rolled back and
their corresponding files are deleted from the timeline. If there are some
CLEAN instants, they are left over.

Is there a reason why CLEAN are not removed? Won't they be referring to
files  which are no longer present and hence not useful?

Thanks
Prashant
  

Re: Query regarding restoring HUDI tables to older commits

Posted by Vinoth Chandar <vi...@apache.org>.
Hi Prashant,

Not sure if there is a specific reason. Mostly, it because until recently,
the clean metadata was not actually used.
Currently, incremental cleaning will use it, but even then, it only relies
on the partition paths being touched there.. So should be fine..

+100 though on consistently cleaning all of this up. Some of these
inconsistencies exist actually to ensure the old timelines for old users
(e.g uber) continue to work.
So I would like to actually have a conversation on streamlining all this,
so the system implementation is as simple/close to the design..

On Wed, Mar 18, 2020 at 11:23 AM Prashant Wason <pw...@uber.com.invalid>
wrote:

> HI Team,
>
> I noticed that when a table is restored to a previous commit (
> HoodieWriteClient::restoreToInstant
> <
> https://github.com/apache/incubator-hudi/blob/master/hudi-client/src/main/java/org/apache/hudi/client/HoodieWriteClient.java#L735
> >),
> only the COMMIT, DELTA_COMMIT and COMPACTION instants are rolled back and
> their corresponding files are deleted from the timeline. If there are some
> CLEAN instants, they are left over.
>
> Is there a reason why CLEAN are not removed? Won't they be referring to
> files  which are no longer present and hence not useful?
>
> Thanks
> Prashant
>