You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hudi.apache.org by Sivaprakash <si...@gmail.com> on 2020/08/13 07:13:23 UTC

Incremental query on partition column

Hi,


What is the design that can be used/implemented when we re-ingest the data
without affecting incremental query?



   - Is it possible to maintain a delta dataset across partitions (
   hoodie.datasource.write.partitionpath.field) ? In my case it is a date.
   - Can I do a snapshot query on across and specific partitions?
   - Or, possible to control Hudi's commit time?


Thanks

Re: Incremental query on partition column

Posted by Vinoth Chandar <vi...@apache.org>.
> if you could see a way of deleting the user's data for COW as well as I
understood that COW is the most commonly used

For s3 (our anything that guarantees an atomic overwrite), we could build
an out of band overwriting. For something like hdfs, we cannot overwrite
the file since we risk data loss if we fail mid way

Can any of the aws folks, confirm my understanding about s3 above.

On Mon, Aug 24, 2020 at 1:48 AM David Rosalia <da...@hotmail.com>
wrote:

> Wim,
>
>
>
> Yes, it is more than cumbersome.  I got the same feedback from Siva.   I
> also thought it was contrary to RTBF but I wanted to check with both of you
> first.
>
>
>
> MOR = merge on read
>
> COW = copy on write
>
>
>
> Kind Regards,
>
> David
>
>
>
> Sent from Mail<https://go.microsoft.com/fwlink/?LinkId=550986> for
> Windows 10
>
>
>
> From: Wim Van Leuven<ma...@kbc.be.INVALID>
>
> Sent: Monday, 24 August 2020 10:42
>
> To: dev@hudi.apache.org<ma...@hudi.apache.org>
>
> Subject: Re: Incremental query on partition column
>
>
>
> Hey David,
>
>
>
> the 1st solution they propose is actually not GDPR compliant beause the
> data is still there on disk ... so it is not just cumbersome.
>
>
>
> BTW, what's COW and MOR?
>
> -w
>
> ________________________________
>
> Van: David Rosalia <da...@hotmail.com>
>
> Verzonden: zaterdag 22 augustus 2020 10:09
>
> Aan: dev@hudi.apache.org <de...@hudi.apache.org>
>
> Onderwerp: Re: Incremental query on partition column
>
>
>
> Good moring Balaji, Vinoth,
>
>
>
> Thank you both for your replies.  I agree that this is a topic that should
> come up more often and I am surprised that so little is said about this.
>
>
>
> The option B in your mail (writing the delete marker also in the
> historical records)  sounds like a good option, but that would only work
> for MOR and not COW.  I was wondering if you could see a way of deleting
> the user's data for COW as well as I understood that COW is the most
> commonly used, correct?
>
>
>
> I think that the option A would indeed be cumbersome.
>
>
>
> You mentioned that you didn't quite understand how our replay scenario
> works.  Siva knows the details better than I do, but the idea is to re-read
> the whole time line, and to then re-write the data into folders which are
> named based on the date of the original write, but not lo write the
> record's for the RTBF persons. But in this scenario Hudi would not
> recognize redundant data between folders and would not save on storage.
>
>
>
> Still, I like the option B and I would like to speak to my colleagues
> about this and get their input.
>
>
>
> Kind Regards,
>
> David Rosalia
>
>
>
> Get Outlook for Android<https://aka.ms/ghei36>
>
>
>
> ________________________________
>
> From: Vinoth Chandar <vi...@apache.org>
>
> Sent: Saturday, August 22, 2020 8:41:00 AM
>
> To: dev@hudi.apache.org <de...@hudi.apache.org>
>
> Subject: Re: Incremental query on partition column
>
>
>
> Hi David,
>
>
>
> Thanks for the detailed email. and apologies for the sudden break in
>
> communication.
>
>
>
> >We wanted to manipulate the commit times to rebuild the history.
>
> yes. best not to try and change the commit times/history.
>
>
>
> >- replay the data omitting the data of the persons who have requested to
>
> be forgotten, but writing to a date-based partition folder using the
>
> "partitionpath" parameter.
>
> I don't follow this fully :(
>
>
>
> The tricky thing here is the combination of RTBF + Incremental queries,
>
> which honestly should have come up lot more often :)
>
>
>
> Couple of ideas that came to mind
>
>
>
> A) Do it, at the application level: Store the users who want to be
>
> forgotten in a separate rtbf_users table and filter records belonging to
>
> these users in each snapshot/incremental query by a join. Most query
>
> engines will turn this into a cheap map join, if the rtbf_users table is
>
> small. Ofc, this is onerous and every query needs to remember to do this.
>
> But since the user record can be deleted in the latest snapshot, only the
>
> ETLs that use incremental queries will need to do this- hopefully a more
>
> tractable subset
>
>
>
> B) Do it inside Hudi, we could log a delete block to all file slices within
>
> the file group, and not just the latest one. It's kind of weird, that there
>
> is a write to an older file slice. But functionally, for MOR tables this
>
> can achieve what you want. Still mulling if this is indeed the right
>
> approach
>
>
>
>
>
>
>
> On Fri, Aug 21, 2020 at 1:02 PM Balaji Varadarajan
>
> <v....@ymail.com.invalid> wrote:
>
>
>
> >  Thanks for the detailed email David. We had discussed this in last week
>
> > community meeting and Vinoth had ideas on how to implement this. This is
>
> > something that can be supported by the timeline layout that Hudi has. It
>
> > would be a new feature (new write operation) that basically appends the
>
> > delete marker to all versions of the data instead of just the latest.
>
> > Opened a Jira : https://issues.apache.org/jira/browse/HUDI-1212
>
> > Balaji.V
>
> >
>
> >
>
> >
>
> >     On Friday, August 14, 2020, 06:12:26 AM PDT, David Rosalia <
>
> > davidrosalia@hotmail.com> wrote:
>
> >
>
> >  Hello,
>
> >
>
> > I am Siva's colleague and I am working on the problem below as well.
>
> >
>
> > I would like to describe what we are trying to achieve with Hudi as well
>
> > as our current way of working and our GDPR and "Right To Be Forgotten "
>
> > compliance policies.
>
> >
>
> > Our requirements :
>
> > - We wish to apply a strict interpretation of the RTBF.  In other words,
>
> > when we remove a person's data, it should be throughout the historical
> data
>
> > and not just the latest snapshot.
>
> > - We wish to use Hudi to reduce our storage requirements using upserts
> and
>
> > don't want to have duplicates between commits.
>
> > - We wish to retain history for persons who have not requested to be
>
> > forgotten and therefore we do not want to delete commit files from the
>
> > history as some have proposed.
>
> >
>
> > We have tried a couple of solutions, but so far without success :
>
> > - replay the data omitting the data of the persons who have requested to
>
> > be forgotten.  We wanted to manipulate the commit times to rebuild the
>
> > history.
>
> > We found that we couldn't manipulate the commit times and retain the
>
> > history.
>
> >
>
> > - replay the data omitting the data of the persons who have requested to
>
> > be forgotten, but writing to a date-based partition folder using the
>
> > "partitionpath" parameter.
>
> > We found that commits using upserts between the partitionpath folders, do
>
> > not ignore data that is unchanged between 2 commit dates as when using
> the
>
> > default commit file system, so we will not save on our storage or speed
> up
>
> > our  processing using this technique.
>
> >
>
> > So basically we would like to find a way to apply a strict RTBF, GDPR,
>
> > maintain history and time-travel (large history) and save storage space
>
> > using Hudi.
>
> >
>
> > Can anyone see a way to achieve this?
>
> >
>
> > Kind Regards,
>
> > David Rosalia
>
> >
>
> >
>
> > Get Outlook for Android<https://aka.ms/ghei36>
>
> >
>
> > ________________________________
>
> > From: Vinoth Chandar <vi...@apache.org>
>
> > Sent: Friday, August 14, 2020 8:26:22 AM
>
> > To: dev@hudi.apache.org <de...@hudi.apache.org>
>
> > Subject: Re: Incremental query on partition column
>
> >
>
> > Hi,
>
> >
>
> > On re-ingesting, do you mean to say you want to overwrite the table,
> while
>
> > not getting the changes in the incremental query?  This has not come up
>
> > before.
>
> > As you can imagine, it'd tricky scenario, where we need some special
>
> > handling/action type introduced.
>
> >
>
> > yes, yes on the next two questions.
>
> > Commit. time can be controlled if using the HoodieWriteClient API, not on
>
> > datasource/deltastreamer atm
>
> >
>
> > On Thu, Aug 13, 2020 at 12:13 AM Sivaprakash <
>
> > sivaprakashshanmugam@gmail.com>
>
> > wrote:
>
> >
>
> > > Hi,
>
> > >
>
> > >
>
> > > What is the design that can be used/implemented when we re-ingest the
>
> > data
>
> > > without affecting incremental query?
>
> > >
>
> > >
>
> > >
>
> > >    - Is it possible to maintain a delta dataset across partitions (
>
> > >    hoodie.datasource.write.partitionpath.field) ? In my case it is a
>
> > date.
>
> > >    - Can I do a snapshot query on across and specific partitions?
>
> > >    - Or, possible to control Hudi's commit time?
>
> > >
>
> > >
>
> > > Thanks
>
> > >
>
>
>
> Disclaimer <http://www.kbc.com/KBCmailDisclaimer>
>
>
>
>

RE: Incremental query on partition column

Posted by David Rosalia <da...@hotmail.com>.
Wim,

Yes, it is more than cumbersome.  I got the same feedback from Siva.   I also thought it was contrary to RTBF but I wanted to check with both of you first.

MOR = merge on read
COW = copy on write

Kind Regards,
David

Sent from Mail<https://go.microsoft.com/fwlink/?LinkId=550986> for Windows 10

From: Wim Van Leuven<ma...@kbc.be.INVALID>
Sent: Monday, 24 August 2020 10:42
To: dev@hudi.apache.org<ma...@hudi.apache.org>
Subject: Re: Incremental query on partition column

Hey David,

the 1st solution they propose is actually not GDPR compliant beause the data is still there on disk ... so it is not just cumbersome.

BTW, what's COW and MOR?
-w
________________________________
Van: David Rosalia <da...@hotmail.com>
Verzonden: zaterdag 22 augustus 2020 10:09
Aan: dev@hudi.apache.org <de...@hudi.apache.org>
Onderwerp: Re: Incremental query on partition column

Good moring Balaji, Vinoth,

Thank you both for your replies.  I agree that this is a topic that should come up more often and I am surprised that so little is said about this.

The option B in your mail (writing the delete marker also in the historical records)  sounds like a good option, but that would only work for MOR and not COW.  I was wondering if you could see a way of deleting the user's data for COW as well as I understood that COW is the most commonly used, correct?

I think that the option A would indeed be cumbersome.

You mentioned that you didn't quite understand how our replay scenario works.  Siva knows the details better than I do, but the idea is to re-read the whole time line, and to then re-write the data into folders which are named based on the date of the original write, but not lo write the record's for the RTBF persons. But in this scenario Hudi would not recognize redundant data between folders and would not save on storage.

Still, I like the option B and I would like to speak to my colleagues about this and get their input.

Kind Regards,
David Rosalia

Get Outlook for Android<https://aka.ms/ghei36>

________________________________
From: Vinoth Chandar <vi...@apache.org>
Sent: Saturday, August 22, 2020 8:41:00 AM
To: dev@hudi.apache.org <de...@hudi.apache.org>
Subject: Re: Incremental query on partition column

Hi David,

Thanks for the detailed email. and apologies for the sudden break in
communication.

>We wanted to manipulate the commit times to rebuild the history.
yes. best not to try and change the commit times/history.

>- replay the data omitting the data of the persons who have requested to
be forgotten, but writing to a date-based partition folder using the
"partitionpath" parameter.
I don't follow this fully :(

The tricky thing here is the combination of RTBF + Incremental queries,
which honestly should have come up lot more often :)

Couple of ideas that came to mind

A) Do it, at the application level: Store the users who want to be
forgotten in a separate rtbf_users table and filter records belonging to
these users in each snapshot/incremental query by a join. Most query
engines will turn this into a cheap map join, if the rtbf_users table is
small. Ofc, this is onerous and every query needs to remember to do this.
But since the user record can be deleted in the latest snapshot, only the
ETLs that use incremental queries will need to do this- hopefully a more
tractable subset

B) Do it inside Hudi, we could log a delete block to all file slices within
the file group, and not just the latest one. It's kind of weird, that there
is a write to an older file slice. But functionally, for MOR tables this
can achieve what you want. Still mulling if this is indeed the right
approach



On Fri, Aug 21, 2020 at 1:02 PM Balaji Varadarajan
<v....@ymail.com.invalid> wrote:

>  Thanks for the detailed email David. We had discussed this in last week
> community meeting and Vinoth had ideas on how to implement this. This is
> something that can be supported by the timeline layout that Hudi has. It
> would be a new feature (new write operation) that basically appends the
> delete marker to all versions of the data instead of just the latest.
> Opened a Jira : https://issues.apache.org/jira/browse/HUDI-1212
> Balaji.V
>
>
>
>     On Friday, August 14, 2020, 06:12:26 AM PDT, David Rosalia <
> davidrosalia@hotmail.com> wrote:
>
>  Hello,
>
> I am Siva's colleague and I am working on the problem below as well.
>
> I would like to describe what we are trying to achieve with Hudi as well
> as our current way of working and our GDPR and "Right To Be Forgotten "
> compliance policies.
>
> Our requirements :
> - We wish to apply a strict interpretation of the RTBF.  In other words,
> when we remove a person's data, it should be throughout the historical data
> and not just the latest snapshot.
> - We wish to use Hudi to reduce our storage requirements using upserts and
> don't want to have duplicates between commits.
> - We wish to retain history for persons who have not requested to be
> forgotten and therefore we do not want to delete commit files from the
> history as some have proposed.
>
> We have tried a couple of solutions, but so far without success :
> - replay the data omitting the data of the persons who have requested to
> be forgotten.  We wanted to manipulate the commit times to rebuild the
> history.
> We found that we couldn't manipulate the commit times and retain the
> history.
>
> - replay the data omitting the data of the persons who have requested to
> be forgotten, but writing to a date-based partition folder using the
> "partitionpath" parameter.
> We found that commits using upserts between the partitionpath folders, do
> not ignore data that is unchanged between 2 commit dates as when using the
> default commit file system, so we will not save on our storage or speed up
> our  processing using this technique.
>
> So basically we would like to find a way to apply a strict RTBF, GDPR,
> maintain history and time-travel (large history) and save storage space
> using Hudi.
>
> Can anyone see a way to achieve this?
>
> Kind Regards,
> David Rosalia
>
>
> Get Outlook for Android<https://aka.ms/ghei36>
>
> ________________________________
> From: Vinoth Chandar <vi...@apache.org>
> Sent: Friday, August 14, 2020 8:26:22 AM
> To: dev@hudi.apache.org <de...@hudi.apache.org>
> Subject: Re: Incremental query on partition column
>
> Hi,
>
> On re-ingesting, do you mean to say you want to overwrite the table, while
> not getting the changes in the incremental query?  This has not come up
> before.
> As you can imagine, it'd tricky scenario, where we need some special
> handling/action type introduced.
>
> yes, yes on the next two questions.
> Commit. time can be controlled if using the HoodieWriteClient API, not on
> datasource/deltastreamer atm
>
> On Thu, Aug 13, 2020 at 12:13 AM Sivaprakash <
> sivaprakashshanmugam@gmail.com>
> wrote:
>
> > Hi,
> >
> >
> > What is the design that can be used/implemented when we re-ingest the
> data
> > without affecting incremental query?
> >
> >
> >
> >    - Is it possible to maintain a delta dataset across partitions (
> >    hoodie.datasource.write.partitionpath.field) ? In my case it is a
> date.
> >    - Can I do a snapshot query on across and specific partitions?
> >    - Or, possible to control Hudi's commit time?
> >
> >
> > Thanks
> >

Disclaimer <http://www.kbc.com/KBCmailDisclaimer>


Re: Incremental query on partition column

Posted by Wim Van Leuven <wi...@kbc.be.INVALID>.
Hey David,

the 1st solution they propose is actually not GDPR compliant beause the data is still there on disk ... so it is not just cumbersome.

BTW, what's COW and MOR?
-w
________________________________
Van: David Rosalia <da...@hotmail.com>
Verzonden: zaterdag 22 augustus 2020 10:09
Aan: dev@hudi.apache.org <de...@hudi.apache.org>
Onderwerp: Re: Incremental query on partition column

Good moring Balaji, Vinoth,

Thank you both for your replies.  I agree that this is a topic that should come up more often and I am surprised that so little is said about this.

The option B in your mail (writing the delete marker also in the historical records)  sounds like a good option, but that would only work for MOR and not COW.  I was wondering if you could see a way of deleting the user's data for COW as well as I understood that COW is the most commonly used, correct?

I think that the option A would indeed be cumbersome.

You mentioned that you didn't quite understand how our replay scenario works.  Siva knows the details better than I do, but the idea is to re-read the whole time line, and to then re-write the data into folders which are named based on the date of the original write, but not lo write the record's for the RTBF persons. But in this scenario Hudi would not recognize redundant data between folders and would not save on storage.

Still, I like the option B and I would like to speak to my colleagues about this and get their input.

Kind Regards,
David Rosalia

Get Outlook for Android<https://aka.ms/ghei36>

________________________________
From: Vinoth Chandar <vi...@apache.org>
Sent: Saturday, August 22, 2020 8:41:00 AM
To: dev@hudi.apache.org <de...@hudi.apache.org>
Subject: Re: Incremental query on partition column

Hi David,

Thanks for the detailed email. and apologies for the sudden break in
communication.

>We wanted to manipulate the commit times to rebuild the history.
yes. best not to try and change the commit times/history.

>- replay the data omitting the data of the persons who have requested to
be forgotten, but writing to a date-based partition folder using the
"partitionpath" parameter.
I don't follow this fully :(

The tricky thing here is the combination of RTBF + Incremental queries,
which honestly should have come up lot more often :)

Couple of ideas that came to mind

A) Do it, at the application level: Store the users who want to be
forgotten in a separate rtbf_users table and filter records belonging to
these users in each snapshot/incremental query by a join. Most query
engines will turn this into a cheap map join, if the rtbf_users table is
small. Ofc, this is onerous and every query needs to remember to do this.
But since the user record can be deleted in the latest snapshot, only the
ETLs that use incremental queries will need to do this- hopefully a more
tractable subset

B) Do it inside Hudi, we could log a delete block to all file slices within
the file group, and not just the latest one. It's kind of weird, that there
is a write to an older file slice. But functionally, for MOR tables this
can achieve what you want. Still mulling if this is indeed the right
approach



On Fri, Aug 21, 2020 at 1:02 PM Balaji Varadarajan
<v....@ymail.com.invalid> wrote:

>  Thanks for the detailed email David. We had discussed this in last week
> community meeting and Vinoth had ideas on how to implement this. This is
> something that can be supported by the timeline layout that Hudi has. It
> would be a new feature (new write operation) that basically appends the
> delete marker to all versions of the data instead of just the latest.
> Opened a Jira : https://issues.apache.org/jira/browse/HUDI-1212
> Balaji.V
>
>
>
>     On Friday, August 14, 2020, 06:12:26 AM PDT, David Rosalia <
> davidrosalia@hotmail.com> wrote:
>
>  Hello,
>
> I am Siva's colleague and I am working on the problem below as well.
>
> I would like to describe what we are trying to achieve with Hudi as well
> as our current way of working and our GDPR and "Right To Be Forgotten "
> compliance policies.
>
> Our requirements :
> - We wish to apply a strict interpretation of the RTBF.  In other words,
> when we remove a person's data, it should be throughout the historical data
> and not just the latest snapshot.
> - We wish to use Hudi to reduce our storage requirements using upserts and
> don't want to have duplicates between commits.
> - We wish to retain history for persons who have not requested to be
> forgotten and therefore we do not want to delete commit files from the
> history as some have proposed.
>
> We have tried a couple of solutions, but so far without success :
> - replay the data omitting the data of the persons who have requested to
> be forgotten.  We wanted to manipulate the commit times to rebuild the
> history.
> We found that we couldn't manipulate the commit times and retain the
> history.
>
> - replay the data omitting the data of the persons who have requested to
> be forgotten, but writing to a date-based partition folder using the
> "partitionpath" parameter.
> We found that commits using upserts between the partitionpath folders, do
> not ignore data that is unchanged between 2 commit dates as when using the
> default commit file system, so we will not save on our storage or speed up
> our  processing using this technique.
>
> So basically we would like to find a way to apply a strict RTBF, GDPR,
> maintain history and time-travel (large history) and save storage space
> using Hudi.
>
> Can anyone see a way to achieve this?
>
> Kind Regards,
> David Rosalia
>
>
> Get Outlook for Android<https://aka.ms/ghei36>
>
> ________________________________
> From: Vinoth Chandar <vi...@apache.org>
> Sent: Friday, August 14, 2020 8:26:22 AM
> To: dev@hudi.apache.org <de...@hudi.apache.org>
> Subject: Re: Incremental query on partition column
>
> Hi,
>
> On re-ingesting, do you mean to say you want to overwrite the table, while
> not getting the changes in the incremental query?  This has not come up
> before.
> As you can imagine, it'd tricky scenario, where we need some special
> handling/action type introduced.
>
> yes, yes on the next two questions.
> Commit. time can be controlled if using the HoodieWriteClient API, not on
> datasource/deltastreamer atm
>
> On Thu, Aug 13, 2020 at 12:13 AM Sivaprakash <
> sivaprakashshanmugam@gmail.com>
> wrote:
>
> > Hi,
> >
> >
> > What is the design that can be used/implemented when we re-ingest the
> data
> > without affecting incremental query?
> >
> >
> >
> >    - Is it possible to maintain a delta dataset across partitions (
> >    hoodie.datasource.write.partitionpath.field) ? In my case it is a
> date.
> >    - Can I do a snapshot query on across and specific partitions?
> >    - Or, possible to control Hudi's commit time?
> >
> >
> > Thanks
> >

Disclaimer <http://www.kbc.com/KBCmailDisclaimer>

Re: Incremental query on partition column

Posted by David Rosalia <da...@hotmail.com>.
Good moring Balaji, Vinoth,

Thank you both for your replies.  I agree that this is a topic that should come up more often and I am surprised that so little is said about this.

The option B in your mail (writing the delete marker also in the historical records)  sounds like a good option, but that would only work for MOR and not COW.  I was wondering if you could see a way of deleting the user's data for COW as well as I understood that COW is the most commonly used, correct?

I think that the option A would indeed be cumbersome.

You mentioned that you didn't quite understand how our replay scenario works.  Siva knows the details better than I do, but the idea is to re-read the whole time line, and to then re-write the data into folders which are named based on the date of the original write, but not lo write the record's for the RTBF persons. But in this scenario Hudi would not recognize redundant data between folders and would not save on storage.

Still, I like the option B and I would like to speak to my colleagues about this and get their input.

Kind Regards,
David Rosalia

Get Outlook for Android<https://aka.ms/ghei36>

________________________________
From: Vinoth Chandar <vi...@apache.org>
Sent: Saturday, August 22, 2020 8:41:00 AM
To: dev@hudi.apache.org <de...@hudi.apache.org>
Subject: Re: Incremental query on partition column

Hi David,

Thanks for the detailed email. and apologies for the sudden break in
communication.

>We wanted to manipulate the commit times to rebuild the history.
yes. best not to try and change the commit times/history.

>- replay the data omitting the data of the persons who have requested to
be forgotten, but writing to a date-based partition folder using the
"partitionpath" parameter.
I don't follow this fully :(

The tricky thing here is the combination of RTBF + Incremental queries,
which honestly should have come up lot more often :)

Couple of ideas that came to mind

A) Do it, at the application level: Store the users who want to be
forgotten in a separate rtbf_users table and filter records belonging to
these users in each snapshot/incremental query by a join. Most query
engines will turn this into a cheap map join, if the rtbf_users table is
small. Ofc, this is onerous and every query needs to remember to do this.
But since the user record can be deleted in the latest snapshot, only the
ETLs that use incremental queries will need to do this- hopefully a more
tractable subset

B) Do it inside Hudi, we could log a delete block to all file slices within
the file group, and not just the latest one. It's kind of weird, that there
is a write to an older file slice. But functionally, for MOR tables this
can achieve what you want. Still mulling if this is indeed the right
approach



On Fri, Aug 21, 2020 at 1:02 PM Balaji Varadarajan
<v....@ymail.com.invalid> wrote:

>  Thanks for the detailed email David. We had discussed this in last week
> community meeting and Vinoth had ideas on how to implement this. This is
> something that can be supported by the timeline layout that Hudi has. It
> would be a new feature (new write operation) that basically appends the
> delete marker to all versions of the data instead of just the latest.
> Opened a Jira : https://issues.apache.org/jira/browse/HUDI-1212
> Balaji.V
>
>
>
>     On Friday, August 14, 2020, 06:12:26 AM PDT, David Rosalia <
> davidrosalia@hotmail.com> wrote:
>
>  Hello,
>
> I am Siva's colleague and I am working on the problem below as well.
>
> I would like to describe what we are trying to achieve with Hudi as well
> as our current way of working and our GDPR and "Right To Be Forgotten "
> compliance policies.
>
> Our requirements :
> - We wish to apply a strict interpretation of the RTBF.  In other words,
> when we remove a person's data, it should be throughout the historical data
> and not just the latest snapshot.
> - We wish to use Hudi to reduce our storage requirements using upserts and
> don't want to have duplicates between commits.
> - We wish to retain history for persons who have not requested to be
> forgotten and therefore we do not want to delete commit files from the
> history as some have proposed.
>
> We have tried a couple of solutions, but so far without success :
> - replay the data omitting the data of the persons who have requested to
> be forgotten.  We wanted to manipulate the commit times to rebuild the
> history.
> We found that we couldn't manipulate the commit times and retain the
> history.
>
> - replay the data omitting the data of the persons who have requested to
> be forgotten, but writing to a date-based partition folder using the
> "partitionpath" parameter.
> We found that commits using upserts between the partitionpath folders, do
> not ignore data that is unchanged between 2 commit dates as when using the
> default commit file system, so we will not save on our storage or speed up
> our  processing using this technique.
>
> So basically we would like to find a way to apply a strict RTBF, GDPR,
> maintain history and time-travel (large history) and save storage space
> using Hudi.
>
> Can anyone see a way to achieve this?
>
> Kind Regards,
> David Rosalia
>
>
> Get Outlook for Android<https://aka.ms/ghei36>
>
> ________________________________
> From: Vinoth Chandar <vi...@apache.org>
> Sent: Friday, August 14, 2020 8:26:22 AM
> To: dev@hudi.apache.org <de...@hudi.apache.org>
> Subject: Re: Incremental query on partition column
>
> Hi,
>
> On re-ingesting, do you mean to say you want to overwrite the table, while
> not getting the changes in the incremental query?  This has not come up
> before.
> As you can imagine, it'd tricky scenario, where we need some special
> handling/action type introduced.
>
> yes, yes on the next two questions.
> Commit. time can be controlled if using the HoodieWriteClient API, not on
> datasource/deltastreamer atm
>
> On Thu, Aug 13, 2020 at 12:13 AM Sivaprakash <
> sivaprakashshanmugam@gmail.com>
> wrote:
>
> > Hi,
> >
> >
> > What is the design that can be used/implemented when we re-ingest the
> data
> > without affecting incremental query?
> >
> >
> >
> >    - Is it possible to maintain a delta dataset across partitions (
> >    hoodie.datasource.write.partitionpath.field) ? In my case it is a
> date.
> >    - Can I do a snapshot query on across and specific partitions?
> >    - Or, possible to control Hudi's commit time?
> >
> >
> > Thanks
> >

Re: Incremental query on partition column

Posted by Vinoth Chandar <vi...@apache.org>.
Hi David,

Thanks for the detailed email. and apologies for the sudden break in
communication.

>We wanted to manipulate the commit times to rebuild the history.
yes. best not to try and change the commit times/history.

>- replay the data omitting the data of the persons who have requested to
be forgotten, but writing to a date-based partition folder using the
"partitionpath" parameter.
I don't follow this fully :(

The tricky thing here is the combination of RTBF + Incremental queries,
which honestly should have come up lot more often :)

Couple of ideas that came to mind

A) Do it, at the application level: Store the users who want to be
forgotten in a separate rtbf_users table and filter records belonging to
these users in each snapshot/incremental query by a join. Most query
engines will turn this into a cheap map join, if the rtbf_users table is
small. Ofc, this is onerous and every query needs to remember to do this.
But since the user record can be deleted in the latest snapshot, only the
ETLs that use incremental queries will need to do this- hopefully a more
tractable subset

B) Do it inside Hudi, we could log a delete block to all file slices within
the file group, and not just the latest one. It's kind of weird, that there
is a write to an older file slice. But functionally, for MOR tables this
can achieve what you want. Still mulling if this is indeed the right
approach



On Fri, Aug 21, 2020 at 1:02 PM Balaji Varadarajan
<v....@ymail.com.invalid> wrote:

>  Thanks for the detailed email David. We had discussed this in last week
> community meeting and Vinoth had ideas on how to implement this. This is
> something that can be supported by the timeline layout that Hudi has. It
> would be a new feature (new write operation) that basically appends the
> delete marker to all versions of the data instead of just the latest.
> Opened a Jira : https://issues.apache.org/jira/browse/HUDI-1212
> Balaji.V
>
>
>
>     On Friday, August 14, 2020, 06:12:26 AM PDT, David Rosalia <
> davidrosalia@hotmail.com> wrote:
>
>  Hello,
>
> I am Siva's colleague and I am working on the problem below as well.
>
> I would like to describe what we are trying to achieve with Hudi as well
> as our current way of working and our GDPR and "Right To Be Forgotten "
> compliance policies.
>
> Our requirements :
> - We wish to apply a strict interpretation of the RTBF.  In other words,
> when we remove a person's data, it should be throughout the historical data
> and not just the latest snapshot.
> - We wish to use Hudi to reduce our storage requirements using upserts and
> don't want to have duplicates between commits.
> - We wish to retain history for persons who have not requested to be
> forgotten and therefore we do not want to delete commit files from the
> history as some have proposed.
>
> We have tried a couple of solutions, but so far without success :
> - replay the data omitting the data of the persons who have requested to
> be forgotten.  We wanted to manipulate the commit times to rebuild the
> history.
> We found that we couldn't manipulate the commit times and retain the
> history.
>
> - replay the data omitting the data of the persons who have requested to
> be forgotten, but writing to a date-based partition folder using the
> "partitionpath" parameter.
> We found that commits using upserts between the partitionpath folders, do
> not ignore data that is unchanged between 2 commit dates as when using the
> default commit file system, so we will not save on our storage or speed up
> our  processing using this technique.
>
> So basically we would like to find a way to apply a strict RTBF, GDPR,
> maintain history and time-travel (large history) and save storage space
> using Hudi.
>
> Can anyone see a way to achieve this?
>
> Kind Regards,
> David Rosalia
>
>
> Get Outlook for Android<https://aka.ms/ghei36>
>
> ________________________________
> From: Vinoth Chandar <vi...@apache.org>
> Sent: Friday, August 14, 2020 8:26:22 AM
> To: dev@hudi.apache.org <de...@hudi.apache.org>
> Subject: Re: Incremental query on partition column
>
> Hi,
>
> On re-ingesting, do you mean to say you want to overwrite the table, while
> not getting the changes in the incremental query?  This has not come up
> before.
> As you can imagine, it'd tricky scenario, where we need some special
> handling/action type introduced.
>
> yes, yes on the next two questions.
> Commit. time can be controlled if using the HoodieWriteClient API, not on
> datasource/deltastreamer atm
>
> On Thu, Aug 13, 2020 at 12:13 AM Sivaprakash <
> sivaprakashshanmugam@gmail.com>
> wrote:
>
> > Hi,
> >
> >
> > What is the design that can be used/implemented when we re-ingest the
> data
> > without affecting incremental query?
> >
> >
> >
> >    - Is it possible to maintain a delta dataset across partitions (
> >    hoodie.datasource.write.partitionpath.field) ? In my case it is a
> date.
> >    - Can I do a snapshot query on across and specific partitions?
> >    - Or, possible to control Hudi's commit time?
> >
> >
> > Thanks
> >

Re: Incremental query on partition column

Posted by Balaji Varadarajan <v....@ymail.com.INVALID>.
 Thanks for the detailed email David. We had discussed this in last week community meeting and Vinoth had ideas on how to implement this. This is something that can be supported by the timeline layout that Hudi has. It would be a new feature (new write operation) that basically appends the delete marker to all versions of the data instead of just the latest. 
Opened a Jira : https://issues.apache.org/jira/browse/HUDI-1212
Balaji.V



    On Friday, August 14, 2020, 06:12:26 AM PDT, David Rosalia <da...@hotmail.com> wrote:  
 
 Hello,

I am Siva's colleague and I am working on the problem below as well.

I would like to describe what we are trying to achieve with Hudi as well as our current way of working and our GDPR and "Right To Be Forgotten " compliance policies.

Our requirements :
- We wish to apply a strict interpretation of the RTBF.  In other words, when we remove a person's data, it should be throughout the historical data and not just the latest snapshot.
- We wish to use Hudi to reduce our storage requirements using upserts and don't want to have duplicates between commits.
- We wish to retain history for persons who have not requested to be forgotten and therefore we do not want to delete commit files from the history as some have proposed.

We have tried a couple of solutions, but so far without success :
- replay the data omitting the data of the persons who have requested to be forgotten.  We wanted to manipulate the commit times to rebuild the history.
We found that we couldn't manipulate the commit times and retain the history.

- replay the data omitting the data of the persons who have requested to be forgotten, but writing to a date-based partition folder using the "partitionpath" parameter.
We found that commits using upserts between the partitionpath folders, do not ignore data that is unchanged between 2 commit dates as when using the default commit file system, so we will not save on our storage or speed up our  processing using this technique.

So basically we would like to find a way to apply a strict RTBF, GDPR, maintain history and time-travel (large history) and save storage space using Hudi.

Can anyone see a way to achieve this?

Kind Regards,
David Rosalia


Get Outlook for Android<https://aka.ms/ghei36>

________________________________
From: Vinoth Chandar <vi...@apache.org>
Sent: Friday, August 14, 2020 8:26:22 AM
To: dev@hudi.apache.org <de...@hudi.apache.org>
Subject: Re: Incremental query on partition column

Hi,

On re-ingesting, do you mean to say you want to overwrite the table, while
not getting the changes in the incremental query?  This has not come up
before.
As you can imagine, it'd tricky scenario, where we need some special
handling/action type introduced.

yes, yes on the next two questions.
Commit. time can be controlled if using the HoodieWriteClient API, not on
datasource/deltastreamer atm

On Thu, Aug 13, 2020 at 12:13 AM Sivaprakash <si...@gmail.com>
wrote:

> Hi,
>
>
> What is the design that can be used/implemented when we re-ingest the data
> without affecting incremental query?
>
>
>
>    - Is it possible to maintain a delta dataset across partitions (
>    hoodie.datasource.write.partitionpath.field) ? In my case it is a date.
>    - Can I do a snapshot query on across and specific partitions?
>    - Or, possible to control Hudi's commit time?
>
>
> Thanks
>  

Re: Incremental query on partition column

Posted by David Rosalia <da...@hotmail.com>.
Hello,

I am Siva's colleague and I am working on the problem below as well.

I would like to describe what we are trying to achieve with Hudi as well as our current way of working and our GDPR and "Right To Be Forgotten " compliance policies.

Our requirements :
- We wish to apply a strict interpretation of the RTBF.  In other words, when we remove a person's data, it should be throughout the historical data and not just the latest snapshot.
- We wish to use Hudi to reduce our storage requirements using upserts and don't want to have duplicates between commits.
- We wish to retain history for persons who have not requested to be forgotten and therefore we do not want to delete commit files from the history as some have proposed.

We have tried a couple of solutions, but so far without success :
- replay the data omitting the data of the persons who have requested to be forgotten.  We wanted to manipulate the commit times to rebuild the history.
We found that we couldn't manipulate the commit times and retain the history.

- replay the data omitting the data of the persons who have requested to be forgotten, but writing to a date-based partition folder using the "partitionpath" parameter.
We found that commits using upserts between the partitionpath folders, do not ignore data that is unchanged between 2 commit dates as when using the default commit file system, so we will not save on our storage or speed up our  processing using this technique.

So basically we would like to find a way to apply a strict RTBF, GDPR, maintain history and time-travel (large history) and save storage space using Hudi.

Can anyone see a way to achieve this?

Kind Regards,
David Rosalia


Get Outlook for Android<https://aka.ms/ghei36>

________________________________
From: Vinoth Chandar <vi...@apache.org>
Sent: Friday, August 14, 2020 8:26:22 AM
To: dev@hudi.apache.org <de...@hudi.apache.org>
Subject: Re: Incremental query on partition column

Hi,

On re-ingesting, do you mean to say you want to overwrite the table, while
not getting the changes in the incremental query?  This has not come up
before.
As you can imagine, it'd tricky scenario, where we need some special
handling/action type introduced.

yes, yes on the next two questions.
Commit. time can be controlled if using the HoodieWriteClient API, not on
datasource/deltastreamer atm

On Thu, Aug 13, 2020 at 12:13 AM Sivaprakash <si...@gmail.com>
wrote:

> Hi,
>
>
> What is the design that can be used/implemented when we re-ingest the data
> without affecting incremental query?
>
>
>
>    - Is it possible to maintain a delta dataset across partitions (
>    hoodie.datasource.write.partitionpath.field) ? In my case it is a date.
>    - Can I do a snapshot query on across and specific partitions?
>    - Or, possible to control Hudi's commit time?
>
>
> Thanks
>

Re: Incremental query on partition column

Posted by Vinoth Chandar <vi...@apache.org>.
Hi,

On re-ingesting, do you mean to say you want to overwrite the table, while
not getting the changes in the incremental query?  This has not come up
before.
As you can imagine, it'd tricky scenario, where we need some special
handling/action type introduced.

yes, yes on the next two questions.
Commit. time can be controlled if using the HoodieWriteClient API, not on
datasource/deltastreamer atm

On Thu, Aug 13, 2020 at 12:13 AM Sivaprakash <si...@gmail.com>
wrote:

> Hi,
>
>
> What is the design that can be used/implemented when we re-ingest the data
> without affecting incremental query?
>
>
>
>    - Is it possible to maintain a delta dataset across partitions (
>    hoodie.datasource.write.partitionpath.field) ? In my case it is a date.
>    - Can I do a snapshot query on across and specific partitions?
>    - Or, possible to control Hudi's commit time?
>
>
> Thanks
>