You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hudi.apache.org by "Kizhakkel Jose, Felix" <fe...@philips.com.INVALID> on 2021/04/13 20:44:52 UTC

GDPR deletes and Consenting deletes of data from hudi table

Hi All,

I have 100s of HUDI tables (AWS S3) where each of those are populated via Spark structured streaming from kafka streams. Now I have to delete records for a given user (userId) from all the tables which has data for that user. Meaning all tables where we have reference to that specific userId. I cannot republish all the events/records for that user to kafka to perform delete, since its around 10-15 year’s worth of data for each user and is going to be so costly and time consuming. So I am wondering how everybody is performing GDPR on the their HUDI tables?


How I get delete request?
On a delete kafka topic we get a delete event [which just contains the userId of the user  to delete], so we have to use that as filter condition and read all the records from HUDI tables and write it back with data source operation as ‘delete’. But while performing/running this delete spark job on the table if the streaming job continues to ingest new arriving data- what will be the side effect? Will it work, since seems like multi writers are not currently supported.

Could you help me with a solution?

Regards,
Felix K Jose

________________________________
The information contained in this message may be confidential and legally protected under applicable law. The message is intended solely for the addressee(s). If you are not the intended recipient, you are hereby notified that any use, forwarding, dissemination, or reproduction of this message is strictly prohibited and may be unlawful. If you are not the intended recipient, please contact the sender by return e-mail and destroy all copies of the original message.

Re: GDPR deletes and Consenting deletes of data from hudi table

Posted by "Kizhakkel Jose, Felix" <fe...@philips.com.INVALID>.
Hi Vinoth,

First of all thank you so much for the response.

Now let me explain in detail my use case:
We get 5.6 million telemetry data every day and each telemetry data produces 500-1000 Kafka messages  and these high frequency Kafka topics are compacted, means TTL for 1 week on Kafka storage. And these messages land on different tables (HUDI tables on S3) based on types. Now we get a delete request (yes its small fraction) [due to either they wrongly send data to an incorrect user or it’s a GDPR request], we get only the userId as the attribute in the delete request. Based on this userId we have to query those tables and pull the records which are for that user and delete them.

So if I understand you correctly you are asking whether I can query those tables and pull the corresponding records by spark job and write it back (output) to the same topic which we use for ingestion and the existing streaming ingestion can handle those accordingly by performing delete/upsert based on flags?

Other option you are saying is  we can have a batch job which can delete the records while streaming jobs are still ingesting data with the new OCC feature. And one of the job fails if both touches the same file. But which one will fail, either streaming or delete batch job?

I am AWS EMR and it doesn’t have latest (0.8.0) yet so it would be nice to get a custom build for EMR. So I could test  the feature and if get stuck, I can borrow some help from one of you.

Once again thank you.

Regards,
Felix K Jose
From: Vinoth Chandar <vi...@apache.org>
Date: Wednesday, April 14, 2021 at 3:26 PM
To: dev <de...@hudi.apache.org>
Subject: Re: GDPR deletes and Consenting deletes of data from hudi table
Caution: This e-mail originated from outside of Philips, be careful for phishing.


Hi Felix,

Most people I think are publishing this data into Kafka,and apply the
deletes as a part of the streaming job itself. The reason why this works is
because typically, only a small fraction of users leave the service (say <<
0.1% weekly is what I have heard). So, the cost of storage on Kafka is not
much. Is that not the case for you? Are you looking for one time scrubbing
of data for e.g? The benefit of this approach is that you eliminate any
concurrency issues that arise from streaming job producing data for a user,
while the deletes are also issued for that user.

On concurrency control, Hudi now supports multiple writers, if you want to
write a background job that will perform these deletes for you. it's in
0.8.0, see https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fhudi.apache.org%2Fdocs%2Fconcurrency_control.html&amp;data=04%7C01%7C%7Ca6cfa9c8d2c9486aef6908d8ff7b241a%7C1a407a2d76754d178692b3ac285306e4%7C0%7C0%7C637540251633946765%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=JlylbgCfMM7NSDJ4KkBSQlxOm%2B0vVBTJBgFYsMWAw2A%3D&amp;reserved=0. One of us
can help you out with trying this and rolling out. (Nishith is the feature
author). Here, if the delete job touches same files, that the streaming job
is writing to, then only one of them will succeed.

We are working on a design for true lock free concurrency control, which
provides the benefits of both models. But, won't be there for another month
or two.

Thanks
Vinoth


On Tue, Apr 13, 2021 at 1:45 PM Kizhakkel Jose, Felix
<fe...@philips.com.invalid> wrote:

> Hi All,
>
> I have 100s of HUDI tables (AWS S3) where each of those are populated via
> Spark structured streaming from kafka streams. Now I have to delete records
> for a given user (userId) from all the tables which has data for that user.
> Meaning all tables where we have reference to that specific userId. I
> cannot republish all the events/records for that user to kafka to perform
> delete, since its around 10-15 year’s worth of data for each user and is
> going to be so costly and time consuming. So I am wondering how everybody
> is performing GDPR on the their HUDI tables?
>
>
> How I get delete request?
> On a delete kafka topic we get a delete event [which just contains the
> userId of the user  to delete], so we have to use that as filter condition
> and read all the records from HUDI tables and write it back with data
> source operation as ‘delete’. But while performing/running this delete
> spark job on the table if the streaming job continues to ingest new
> arriving data- what will be the side effect? Will it work, since seems like
> multi writers are not currently supported.
>
> Could you help me with a solution?
>
> Regards,
> Felix K Jose
>
> ________________________________
> The information contained in this message may be confidential and legally
> protected under applicable law. The message is intended solely for the
> addressee(s). If you are not the intended recipient, you are hereby
> notified that any use, forwarding, dissemination, or reproduction of this
> message is strictly prohibited and may be unlawful. If you are not the
> intended recipient, please contact the sender by return e-mail and destroy
> all copies of the original message.
>

________________________________
The information contained in this message may be confidential and legally protected under applicable law. The message is intended solely for the addressee(s). If you are not the intended recipient, you are hereby notified that any use, forwarding, dissemination, or reproduction of this message is strictly prohibited and may be unlawful. If you are not the intended recipient, please contact the sender by return e-mail and destroy all copies of the original message.

Re: GDPR deletes and Consenting deletes of data from hudi table

Posted by Vinoth Chandar <vi...@apache.org>.
>But which one will fail, either streaming or delete batch job?

That's the pitfall with an OCC based approach. We can't really choose. You
probably need to break up your delete job also to be more incremental and
run in smaller batches to avoid contention. Otherwise, you'll get into a
scenario where the delete job runs for a few hours, and then always fails
to commit, because streaming job wrote some conflicting data.

>you are asking whether I can query those tables and pull the corresponding
records by spark job and write it back (output) to the same topic

yes. That's what did at Uber, for e.g

Thanks
Vinoth

On Wed, Apr 14, 2021 at 12:59 PM nishith agarwal <n3...@gmail.com>
wrote:

> No worries. Is the custom build something you can work with the AWS team to
> get installed to be able to test ?
>
> -Nishith
>
> On Wed, Apr 14, 2021 at 12:57 PM Kizhakkel Jose, Felix
> <fe...@philips.com.invalid> wrote:
>
> > Hi Nishith, Vinoth,
> >
> > Thank you so much for the quick response and offering the help.
> >
> > Regards,
> > Felix K Jose
> > From: Kizhakkel Jose, Felix <fe...@philips.com.INVALID>
> > Date: Wednesday, April 14, 2021 at 3:55 PM
> > To: dev@hudi.apache.org <de...@hudi.apache.org>
> > Subject: Re: GDPR deletes and Consenting deletes of data from hudi table
> > Caution: This e-mail originated from outside of Philips, be careful for
> > phishing.
> >
> >
> > Hi Nishith,
> >
> > As I mentioned we are on AWS EMR, but I don’t think we have this 0.8.0
> > version available as part of existing version. So we need a custom build
> > for working it on latest EMR 6.1.0
> >
> > Regards,
> > Felix K Jose
> > From: nishith agarwal <n3...@gmail.com>
> > Date: Wednesday, April 14, 2021 at 3:49 PM
> > To: dev <de...@hudi.apache.org>
> > Subject: Re: GDPR deletes and Consenting deletes of data from hudi table
> > Caution: This e-mail originated from outside of Philips, be careful for
> > phishing.
> >
> >
> > Felix,
> >
> > Happy to help you through trying and rolling out multi-writer on Hudi
> > tables. Do you have a test environment where you can try out the feature
> by
> > following the doc that Vinoth pointed above ?
> >
> > Thanks,
> > Nishith
> >
> > On Wed, Apr 14, 2021 at 12:26 PM Vinoth Chandar <vi...@apache.org>
> wrote:
> >
> > > Hi Felix,
> > >
> > > Most people I think are publishing this data into Kafka,and apply the
> > > deletes as a part of the streaming job itself. The reason why this
> works
> > is
> > > because typically, only a small fraction of users leave the service
> (say
> > <<
> > > 0.1% weekly is what I have heard). So, the cost of storage on Kafka is
> > not
> > > much. Is that not the case for you? Are you looking for one time
> > scrubbing
> > > of data for e.g? The benefit of this approach is that you eliminate any
> > > concurrency issues that arise from streaming job producing data for a
> > user,
> > > while the deletes are also issued for that user.
> > >
> > > On concurrency control, Hudi now supports multiple writers, if you want
> > to
> > > write a background job that will perform these deletes for you. it's in
> > > 0.8.0, see
> >
> https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fhudi.apache.org%2Fdocs%2Fconcurrency_control.html&amp;data=04%7C01%7C%7Cd9f8ea00fdba484d5cee08d8ff7f3df6%7C1a407a2d76754d178692b3ac285306e4%7C0%7C1%7C637540269251879953%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C2000&amp;sdata=5PqHnY%2FI7i3u7z31irU8Xu5VRW1niA0ljfWUWm0vjDY%3D&amp;reserved=0
> .
> > One of
> > > us
> > > can help you out with trying this and rolling out. (Nishith is the
> > feature
> > > author). Here, if the delete job touches same files, that the streaming
> > job
> > > is writing to, then only one of them will succeed.
> > >
> > > We are working on a design for true lock free concurrency control,
> which
> > > provides the benefits of both models. But, won't be there for another
> > month
> > > or two.
> > >
> > > Thanks
> > > Vinoth
> > >
> > >
> > > On Tue, Apr 13, 2021 at 1:45 PM Kizhakkel Jose, Felix
> > > <fe...@philips.com.invalid> wrote:
> > >
> > > > Hi All,
> > > >
> > > > I have 100s of HUDI tables (AWS S3) where each of those are populated
> > via
> > > > Spark structured streaming from kafka streams. Now I have to delete
> > > records
> > > > for a given user (userId) from all the tables which has data for that
> > > user.
> > > > Meaning all tables where we have reference to that specific userId. I
> > > > cannot republish all the events/records for that user to kafka to
> > perform
> > > > delete, since its around 10-15 year’s worth of data for each user and
> > is
> > > > going to be so costly and time consuming. So I am wondering how
> > everybody
> > > > is performing GDPR on the their HUDI tables?
> > > >
> > > >
> > > > How I get delete request?
> > > > On a delete kafka topic we get a delete event [which just contains
> the
> > > > userId of the user  to delete], so we have to use that as filter
> > > condition
> > > > and read all the records from HUDI tables and write it back with data
> > > > source operation as ‘delete’. But while performing/running this
> delete
> > > > spark job on the table if the streaming job continues to ingest new
> > > > arriving data- what will be the side effect? Will it work, since
> seems
> > > like
> > > > multi writers are not currently supported.
> > > >
> > > > Could you help me with a solution?
> > > >
> > > > Regards,
> > > > Felix K Jose
> > > >
> > > > ________________________________
> > > > The information contained in this message may be confidential and
> > legally
> > > > protected under applicable law. The message is intended solely for
> the
> > > > addressee(s). If you are not the intended recipient, you are hereby
> > > > notified that any use, forwarding, dissemination, or reproduction of
> > this
> > > > message is strictly prohibited and may be unlawful. If you are not
> the
> > > > intended recipient, please contact the sender by return e-mail and
> > > destroy
> > > > all copies of the original message.
> > > >
> > >
> >
> > ________________________________
> > The information contained in this message may be confidential and legally
> > protected under applicable law. The message is intended solely for the
> > addressee(s). If you are not the intended recipient, you are hereby
> > notified that any use, forwarding, dissemination, or reproduction of this
> > message is strictly prohibited and may be unlawful. If you are not the
> > intended recipient, please contact the sender by return e-mail and
> destroy
> > all copies of the original message.
> >
>

Re: GDPR deletes and Consenting deletes of data from hudi table

Posted by Vinoth Chandar <vi...@apache.org>.
If you want to quickly try something, you can also build jar off master and
run independently (works for client mode/spark-shell experiments)
https://dev.to/bytearray/using-your-own-apache-spark-hudi-versions-with-aws-emr-40a0



On Thu, Apr 15, 2021 at 6:09 AM Kizhakkel Jose, Felix
<fe...@philips.com.invalid> wrote:

> Hi Nishith,
>
> I will check with Udit M, since he had helped me in the past with a custom
> jar for EMR.
>
> Regards,
> Felix K Jose
> From: nishith agarwal <n3...@gmail.com>
> Date: Wednesday, April 14, 2021 at 3:59 PM
> To: dev <de...@hudi.apache.org>
> Subject: Re: GDPR deletes and Consenting deletes of data from hudi table
> Caution: This e-mail originated from outside of Philips, be careful for
> phishing.
>
>
> No worries. Is the custom build something you can work with the AWS team to
> get installed to be able to test ?
>
> -Nishith
>
> On Wed, Apr 14, 2021 at 12:57 PM Kizhakkel Jose, Felix
> <fe...@philips.com.invalid> wrote:
>
> > Hi Nishith, Vinoth,
> >
> > Thank you so much for the quick response and offering the help.
> >
> > Regards,
> > Felix K Jose
> > From: Kizhakkel Jose, Felix <fe...@philips.com.INVALID>
> > Date: Wednesday, April 14, 2021 at 3:55 PM
> > To: dev@hudi.apache.org <de...@hudi.apache.org>
> > Subject: Re: GDPR deletes and Consenting deletes of data from hudi table
> > Caution: This e-mail originated from outside of Philips, be careful for
> > phishing.
> >
> >
> > Hi Nishith,
> >
> > As I mentioned we are on AWS EMR, but I don’t think we have this 0.8.0
> > version available as part of existing version. So we need a custom build
> > for working it on latest EMR 6.1.0
> >
> > Regards,
> > Felix K Jose
> > From: nishith agarwal <n3...@gmail.com>
> > Date: Wednesday, April 14, 2021 at 3:49 PM
> > To: dev <de...@hudi.apache.org>
> > Subject: Re: GDPR deletes and Consenting deletes of data from hudi table
> > Caution: This e-mail originated from outside of Philips, be careful for
> > phishing.
> >
> >
> > Felix,
> >
> > Happy to help you through trying and rolling out multi-writer on Hudi
> > tables. Do you have a test environment where you can try out the feature
> by
> > following the doc that Vinoth pointed above ?
> >
> > Thanks,
> > Nishith
> >
> > On Wed, Apr 14, 2021 at 12:26 PM Vinoth Chandar <vi...@apache.org>
> wrote:
> >
> > > Hi Felix,
> > >
> > > Most people I think are publishing this data into Kafka,and apply the
> > > deletes as a part of the streaming job itself. The reason why this
> works
> > is
> > > because typically, only a small fraction of users leave the service
> (say
> > <<
> > > 0.1% weekly is what I have heard). So, the cost of storage on Kafka is
> > not
> > > much. Is that not the case for you? Are you looking for one time
> > scrubbing
> > > of data for e.g? The benefit of this approach is that you eliminate any
> > > concurrency issues that arise from streaming job producing data for a
> > user,
> > > while the deletes are also issued for that user.
> > >
> > > On concurrency control, Hudi now supports multiple writers, if you want
> > to
> > > write a background job that will perform these deletes for you. it's in
> > > 0.8.0, see
> >
> https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fhudi.apache.org%2Fdocs%2Fconcurrency_control.html&amp;data=04%7C01%7C%7Cde1da0fb3fb2458b31a208d8ff7fcf24%7C1a407a2d76754d178692b3ac285306e4%7C0%7C0%7C637540271689560701%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=aaJEMtZBWGIT2SuYO9qRyPihTvHDkHMTHFFMlwVyJXc%3D&amp;reserved=0
> .
> > One of
> > > us
> > > can help you out with trying this and rolling out. (Nishith is the
> > feature
> > > author). Here, if the delete job touches same files, that the streaming
> > job
> > > is writing to, then only one of them will succeed.
> > >
> > > We are working on a design for true lock free concurrency control,
> which
> > > provides the benefits of both models. But, won't be there for another
> > month
> > > or two.
> > >
> > > Thanks
> > > Vinoth
> > >
> > >
> > > On Tue, Apr 13, 2021 at 1:45 PM Kizhakkel Jose, Felix
> > > <fe...@philips.com.invalid> wrote:
> > >
> > > > Hi All,
> > > >
> > > > I have 100s of HUDI tables (AWS S3) where each of those are populated
> > via
> > > > Spark structured streaming from kafka streams. Now I have to delete
> > > records
> > > > for a given user (userId) from all the tables which has data for that
> > > user.
> > > > Meaning all tables where we have reference to that specific userId. I
> > > > cannot republish all the events/records for that user to kafka to
> > perform
> > > > delete, since its around 10-15 year’s worth of data for each user and
> > is
> > > > going to be so costly and time consuming. So I am wondering how
> > everybody
> > > > is performing GDPR on the their HUDI tables?
> > > >
> > > >
> > > > How I get delete request?
> > > > On a delete kafka topic we get a delete event [which just contains
> the
> > > > userId of the user  to delete], so we have to use that as filter
> > > condition
> > > > and read all the records from HUDI tables and write it back with data
> > > > source operation as ‘delete’. But while performing/running this
> delete
> > > > spark job on the table if the streaming job continues to ingest new
> > > > arriving data- what will be the side effect? Will it work, since
> seems
> > > like
> > > > multi writers are not currently supported.
> > > >
> > > > Could you help me with a solution?
> > > >
> > > > Regards,
> > > > Felix K Jose
> > > >
> > > > ________________________________
> > > > The information contained in this message may be confidential and
> > legally
> > > > protected under applicable law. The message is intended solely for
> the
> > > > addressee(s). If you are not the intended recipient, you are hereby
> > > > notified that any use, forwarding, dissemination, or reproduction of
> > this
> > > > message is strictly prohibited and may be unlawful. If you are not
> the
> > > > intended recipient, please contact the sender by return e-mail and
> > > destroy
> > > > all copies of the original message.
> > > >
> > >
> >
> > ________________________________
> > The information contained in this message may be confidential and legally
> > protected under applicable law. The message is intended solely for the
> > addressee(s). If you are not the intended recipient, you are hereby
> > notified that any use, forwarding, dissemination, or reproduction of this
> > message is strictly prohibited and may be unlawful. If you are not the
> > intended recipient, please contact the sender by return e-mail and
> destroy
> > all copies of the original message.
> >
>
> ________________________________
> The information contained in this message may be confidential and legally
> protected under applicable law. The message is intended solely for the
> addressee(s). If you are not the intended recipient, you are hereby
> notified that any use, forwarding, dissemination, or reproduction of this
> message is strictly prohibited and may be unlawful. If you are not the
> intended recipient, please contact the sender by return e-mail and destroy
> all copies of the original message.
>

Re: GDPR deletes and Consenting deletes of data from hudi table

Posted by "Kizhakkel Jose, Felix" <fe...@philips.com.INVALID>.
Hi Nishith,

I will check with Udit M, since he had helped me in the past with a custom jar for EMR.

Regards,
Felix K Jose
From: nishith agarwal <n3...@gmail.com>
Date: Wednesday, April 14, 2021 at 3:59 PM
To: dev <de...@hudi.apache.org>
Subject: Re: GDPR deletes and Consenting deletes of data from hudi table
Caution: This e-mail originated from outside of Philips, be careful for phishing.


No worries. Is the custom build something you can work with the AWS team to
get installed to be able to test ?

-Nishith

On Wed, Apr 14, 2021 at 12:57 PM Kizhakkel Jose, Felix
<fe...@philips.com.invalid> wrote:

> Hi Nishith, Vinoth,
>
> Thank you so much for the quick response and offering the help.
>
> Regards,
> Felix K Jose
> From: Kizhakkel Jose, Felix <fe...@philips.com.INVALID>
> Date: Wednesday, April 14, 2021 at 3:55 PM
> To: dev@hudi.apache.org <de...@hudi.apache.org>
> Subject: Re: GDPR deletes and Consenting deletes of data from hudi table
> Caution: This e-mail originated from outside of Philips, be careful for
> phishing.
>
>
> Hi Nishith,
>
> As I mentioned we are on AWS EMR, but I don’t think we have this 0.8.0
> version available as part of existing version. So we need a custom build
> for working it on latest EMR 6.1.0
>
> Regards,
> Felix K Jose
> From: nishith agarwal <n3...@gmail.com>
> Date: Wednesday, April 14, 2021 at 3:49 PM
> To: dev <de...@hudi.apache.org>
> Subject: Re: GDPR deletes and Consenting deletes of data from hudi table
> Caution: This e-mail originated from outside of Philips, be careful for
> phishing.
>
>
> Felix,
>
> Happy to help you through trying and rolling out multi-writer on Hudi
> tables. Do you have a test environment where you can try out the feature by
> following the doc that Vinoth pointed above ?
>
> Thanks,
> Nishith
>
> On Wed, Apr 14, 2021 at 12:26 PM Vinoth Chandar <vi...@apache.org> wrote:
>
> > Hi Felix,
> >
> > Most people I think are publishing this data into Kafka,and apply the
> > deletes as a part of the streaming job itself. The reason why this works
> is
> > because typically, only a small fraction of users leave the service (say
> <<
> > 0.1% weekly is what I have heard). So, the cost of storage on Kafka is
> not
> > much. Is that not the case for you? Are you looking for one time
> scrubbing
> > of data for e.g? The benefit of this approach is that you eliminate any
> > concurrency issues that arise from streaming job producing data for a
> user,
> > while the deletes are also issued for that user.
> >
> > On concurrency control, Hudi now supports multiple writers, if you want
> to
> > write a background job that will perform these deletes for you. it's in
> > 0.8.0, see
> https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fhudi.apache.org%2Fdocs%2Fconcurrency_control.html&amp;data=04%7C01%7C%7Cde1da0fb3fb2458b31a208d8ff7fcf24%7C1a407a2d76754d178692b3ac285306e4%7C0%7C0%7C637540271689560701%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=aaJEMtZBWGIT2SuYO9qRyPihTvHDkHMTHFFMlwVyJXc%3D&amp;reserved=0.
> One of
> > us
> > can help you out with trying this and rolling out. (Nishith is the
> feature
> > author). Here, if the delete job touches same files, that the streaming
> job
> > is writing to, then only one of them will succeed.
> >
> > We are working on a design for true lock free concurrency control, which
> > provides the benefits of both models. But, won't be there for another
> month
> > or two.
> >
> > Thanks
> > Vinoth
> >
> >
> > On Tue, Apr 13, 2021 at 1:45 PM Kizhakkel Jose, Felix
> > <fe...@philips.com.invalid> wrote:
> >
> > > Hi All,
> > >
> > > I have 100s of HUDI tables (AWS S3) where each of those are populated
> via
> > > Spark structured streaming from kafka streams. Now I have to delete
> > records
> > > for a given user (userId) from all the tables which has data for that
> > user.
> > > Meaning all tables where we have reference to that specific userId. I
> > > cannot republish all the events/records for that user to kafka to
> perform
> > > delete, since its around 10-15 year’s worth of data for each user and
> is
> > > going to be so costly and time consuming. So I am wondering how
> everybody
> > > is performing GDPR on the their HUDI tables?
> > >
> > >
> > > How I get delete request?
> > > On a delete kafka topic we get a delete event [which just contains the
> > > userId of the user  to delete], so we have to use that as filter
> > condition
> > > and read all the records from HUDI tables and write it back with data
> > > source operation as ‘delete’. But while performing/running this delete
> > > spark job on the table if the streaming job continues to ingest new
> > > arriving data- what will be the side effect? Will it work, since seems
> > like
> > > multi writers are not currently supported.
> > >
> > > Could you help me with a solution?
> > >
> > > Regards,
> > > Felix K Jose
> > >
> > > ________________________________
> > > The information contained in this message may be confidential and
> legally
> > > protected under applicable law. The message is intended solely for the
> > > addressee(s). If you are not the intended recipient, you are hereby
> > > notified that any use, forwarding, dissemination, or reproduction of
> this
> > > message is strictly prohibited and may be unlawful. If you are not the
> > > intended recipient, please contact the sender by return e-mail and
> > destroy
> > > all copies of the original message.
> > >
> >
>
> ________________________________
> The information contained in this message may be confidential and legally
> protected under applicable law. The message is intended solely for the
> addressee(s). If you are not the intended recipient, you are hereby
> notified that any use, forwarding, dissemination, or reproduction of this
> message is strictly prohibited and may be unlawful. If you are not the
> intended recipient, please contact the sender by return e-mail and destroy
> all copies of the original message.
>

________________________________
The information contained in this message may be confidential and legally protected under applicable law. The message is intended solely for the addressee(s). If you are not the intended recipient, you are hereby notified that any use, forwarding, dissemination, or reproduction of this message is strictly prohibited and may be unlawful. If you are not the intended recipient, please contact the sender by return e-mail and destroy all copies of the original message.

Re: GDPR deletes and Consenting deletes of data from hudi table

Posted by nishith agarwal <n3...@gmail.com>.
No worries. Is the custom build something you can work with the AWS team to
get installed to be able to test ?

-Nishith

On Wed, Apr 14, 2021 at 12:57 PM Kizhakkel Jose, Felix
<fe...@philips.com.invalid> wrote:

> Hi Nishith, Vinoth,
>
> Thank you so much for the quick response and offering the help.
>
> Regards,
> Felix K Jose
> From: Kizhakkel Jose, Felix <fe...@philips.com.INVALID>
> Date: Wednesday, April 14, 2021 at 3:55 PM
> To: dev@hudi.apache.org <de...@hudi.apache.org>
> Subject: Re: GDPR deletes and Consenting deletes of data from hudi table
> Caution: This e-mail originated from outside of Philips, be careful for
> phishing.
>
>
> Hi Nishith,
>
> As I mentioned we are on AWS EMR, but I don’t think we have this 0.8.0
> version available as part of existing version. So we need a custom build
> for working it on latest EMR 6.1.0
>
> Regards,
> Felix K Jose
> From: nishith agarwal <n3...@gmail.com>
> Date: Wednesday, April 14, 2021 at 3:49 PM
> To: dev <de...@hudi.apache.org>
> Subject: Re: GDPR deletes and Consenting deletes of data from hudi table
> Caution: This e-mail originated from outside of Philips, be careful for
> phishing.
>
>
> Felix,
>
> Happy to help you through trying and rolling out multi-writer on Hudi
> tables. Do you have a test environment where you can try out the feature by
> following the doc that Vinoth pointed above ?
>
> Thanks,
> Nishith
>
> On Wed, Apr 14, 2021 at 12:26 PM Vinoth Chandar <vi...@apache.org> wrote:
>
> > Hi Felix,
> >
> > Most people I think are publishing this data into Kafka,and apply the
> > deletes as a part of the streaming job itself. The reason why this works
> is
> > because typically, only a small fraction of users leave the service (say
> <<
> > 0.1% weekly is what I have heard). So, the cost of storage on Kafka is
> not
> > much. Is that not the case for you? Are you looking for one time
> scrubbing
> > of data for e.g? The benefit of this approach is that you eliminate any
> > concurrency issues that arise from streaming job producing data for a
> user,
> > while the deletes are also issued for that user.
> >
> > On concurrency control, Hudi now supports multiple writers, if you want
> to
> > write a background job that will perform these deletes for you. it's in
> > 0.8.0, see
> https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fhudi.apache.org%2Fdocs%2Fconcurrency_control.html&amp;data=04%7C01%7C%7Cd9f8ea00fdba484d5cee08d8ff7f3df6%7C1a407a2d76754d178692b3ac285306e4%7C0%7C1%7C637540269251879953%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C2000&amp;sdata=5PqHnY%2FI7i3u7z31irU8Xu5VRW1niA0ljfWUWm0vjDY%3D&amp;reserved=0.
> One of
> > us
> > can help you out with trying this and rolling out. (Nishith is the
> feature
> > author). Here, if the delete job touches same files, that the streaming
> job
> > is writing to, then only one of them will succeed.
> >
> > We are working on a design for true lock free concurrency control, which
> > provides the benefits of both models. But, won't be there for another
> month
> > or two.
> >
> > Thanks
> > Vinoth
> >
> >
> > On Tue, Apr 13, 2021 at 1:45 PM Kizhakkel Jose, Felix
> > <fe...@philips.com.invalid> wrote:
> >
> > > Hi All,
> > >
> > > I have 100s of HUDI tables (AWS S3) where each of those are populated
> via
> > > Spark structured streaming from kafka streams. Now I have to delete
> > records
> > > for a given user (userId) from all the tables which has data for that
> > user.
> > > Meaning all tables where we have reference to that specific userId. I
> > > cannot republish all the events/records for that user to kafka to
> perform
> > > delete, since its around 10-15 year’s worth of data for each user and
> is
> > > going to be so costly and time consuming. So I am wondering how
> everybody
> > > is performing GDPR on the their HUDI tables?
> > >
> > >
> > > How I get delete request?
> > > On a delete kafka topic we get a delete event [which just contains the
> > > userId of the user  to delete], so we have to use that as filter
> > condition
> > > and read all the records from HUDI tables and write it back with data
> > > source operation as ‘delete’. But while performing/running this delete
> > > spark job on the table if the streaming job continues to ingest new
> > > arriving data- what will be the side effect? Will it work, since seems
> > like
> > > multi writers are not currently supported.
> > >
> > > Could you help me with a solution?
> > >
> > > Regards,
> > > Felix K Jose
> > >
> > > ________________________________
> > > The information contained in this message may be confidential and
> legally
> > > protected under applicable law. The message is intended solely for the
> > > addressee(s). If you are not the intended recipient, you are hereby
> > > notified that any use, forwarding, dissemination, or reproduction of
> this
> > > message is strictly prohibited and may be unlawful. If you are not the
> > > intended recipient, please contact the sender by return e-mail and
> > destroy
> > > all copies of the original message.
> > >
> >
>
> ________________________________
> The information contained in this message may be confidential and legally
> protected under applicable law. The message is intended solely for the
> addressee(s). If you are not the intended recipient, you are hereby
> notified that any use, forwarding, dissemination, or reproduction of this
> message is strictly prohibited and may be unlawful. If you are not the
> intended recipient, please contact the sender by return e-mail and destroy
> all copies of the original message.
>

Re: GDPR deletes and Consenting deletes of data from hudi table

Posted by "Kizhakkel Jose, Felix" <fe...@philips.com.INVALID>.
Hi Nishith, Vinoth,

Thank you so much for the quick response and offering the help.

Regards,
Felix K Jose
From: Kizhakkel Jose, Felix <fe...@philips.com.INVALID>
Date: Wednesday, April 14, 2021 at 3:55 PM
To: dev@hudi.apache.org <de...@hudi.apache.org>
Subject: Re: GDPR deletes and Consenting deletes of data from hudi table
Caution: This e-mail originated from outside of Philips, be careful for phishing.


Hi Nishith,

As I mentioned we are on AWS EMR, but I don’t think we have this 0.8.0 version available as part of existing version. So we need a custom build for working it on latest EMR 6.1.0

Regards,
Felix K Jose
From: nishith agarwal <n3...@gmail.com>
Date: Wednesday, April 14, 2021 at 3:49 PM
To: dev <de...@hudi.apache.org>
Subject: Re: GDPR deletes and Consenting deletes of data from hudi table
Caution: This e-mail originated from outside of Philips, be careful for phishing.


Felix,

Happy to help you through trying and rolling out multi-writer on Hudi
tables. Do you have a test environment where you can try out the feature by
following the doc that Vinoth pointed above ?

Thanks,
Nishith

On Wed, Apr 14, 2021 at 12:26 PM Vinoth Chandar <vi...@apache.org> wrote:

> Hi Felix,
>
> Most people I think are publishing this data into Kafka,and apply the
> deletes as a part of the streaming job itself. The reason why this works is
> because typically, only a small fraction of users leave the service (say <<
> 0.1% weekly is what I have heard). So, the cost of storage on Kafka is not
> much. Is that not the case for you? Are you looking for one time scrubbing
> of data for e.g? The benefit of this approach is that you eliminate any
> concurrency issues that arise from streaming job producing data for a user,
> while the deletes are also issued for that user.
>
> On concurrency control, Hudi now supports multiple writers, if you want to
> write a background job that will perform these deletes for you. it's in
> 0.8.0, see https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fhudi.apache.org%2Fdocs%2Fconcurrency_control.html&amp;data=04%7C01%7C%7Cd9f8ea00fdba484d5cee08d8ff7f3df6%7C1a407a2d76754d178692b3ac285306e4%7C0%7C1%7C637540269251879953%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C2000&amp;sdata=5PqHnY%2FI7i3u7z31irU8Xu5VRW1niA0ljfWUWm0vjDY%3D&amp;reserved=0. One of
> us
> can help you out with trying this and rolling out. (Nishith is the feature
> author). Here, if the delete job touches same files, that the streaming job
> is writing to, then only one of them will succeed.
>
> We are working on a design for true lock free concurrency control, which
> provides the benefits of both models. But, won't be there for another month
> or two.
>
> Thanks
> Vinoth
>
>
> On Tue, Apr 13, 2021 at 1:45 PM Kizhakkel Jose, Felix
> <fe...@philips.com.invalid> wrote:
>
> > Hi All,
> >
> > I have 100s of HUDI tables (AWS S3) where each of those are populated via
> > Spark structured streaming from kafka streams. Now I have to delete
> records
> > for a given user (userId) from all the tables which has data for that
> user.
> > Meaning all tables where we have reference to that specific userId. I
> > cannot republish all the events/records for that user to kafka to perform
> > delete, since its around 10-15 year’s worth of data for each user and is
> > going to be so costly and time consuming. So I am wondering how everybody
> > is performing GDPR on the their HUDI tables?
> >
> >
> > How I get delete request?
> > On a delete kafka topic we get a delete event [which just contains the
> > userId of the user  to delete], so we have to use that as filter
> condition
> > and read all the records from HUDI tables and write it back with data
> > source operation as ‘delete’. But while performing/running this delete
> > spark job on the table if the streaming job continues to ingest new
> > arriving data- what will be the side effect? Will it work, since seems
> like
> > multi writers are not currently supported.
> >
> > Could you help me with a solution?
> >
> > Regards,
> > Felix K Jose
> >
> > ________________________________
> > The information contained in this message may be confidential and legally
> > protected under applicable law. The message is intended solely for the
> > addressee(s). If you are not the intended recipient, you are hereby
> > notified that any use, forwarding, dissemination, or reproduction of this
> > message is strictly prohibited and may be unlawful. If you are not the
> > intended recipient, please contact the sender by return e-mail and
> destroy
> > all copies of the original message.
> >
>

________________________________
The information contained in this message may be confidential and legally protected under applicable law. The message is intended solely for the addressee(s). If you are not the intended recipient, you are hereby notified that any use, forwarding, dissemination, or reproduction of this message is strictly prohibited and may be unlawful. If you are not the intended recipient, please contact the sender by return e-mail and destroy all copies of the original message.

Re: GDPR deletes and Consenting deletes of data from hudi table

Posted by "Kizhakkel Jose, Felix" <fe...@philips.com.INVALID>.
Hi Nishith,

As I mentioned we are on AWS EMR, but I don’t think we have this 0.8.0 version available as part of existing version. So we need a custom build for working it on latest EMR 6.1.0

Regards,
Felix K Jose
From: nishith agarwal <n3...@gmail.com>
Date: Wednesday, April 14, 2021 at 3:49 PM
To: dev <de...@hudi.apache.org>
Subject: Re: GDPR deletes and Consenting deletes of data from hudi table
Caution: This e-mail originated from outside of Philips, be careful for phishing.


Felix,

Happy to help you through trying and rolling out multi-writer on Hudi
tables. Do you have a test environment where you can try out the feature by
following the doc that Vinoth pointed above ?

Thanks,
Nishith

On Wed, Apr 14, 2021 at 12:26 PM Vinoth Chandar <vi...@apache.org> wrote:

> Hi Felix,
>
> Most people I think are publishing this data into Kafka,and apply the
> deletes as a part of the streaming job itself. The reason why this works is
> because typically, only a small fraction of users leave the service (say <<
> 0.1% weekly is what I have heard). So, the cost of storage on Kafka is not
> much. Is that not the case for you? Are you looking for one time scrubbing
> of data for e.g? The benefit of this approach is that you eliminate any
> concurrency issues that arise from streaming job producing data for a user,
> while the deletes are also issued for that user.
>
> On concurrency control, Hudi now supports multiple writers, if you want to
> write a background job that will perform these deletes for you. it's in
> 0.8.0, see https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fhudi.apache.org%2Fdocs%2Fconcurrency_control.html&amp;data=04%7C01%7C%7C7e2423066e794f2164d908d8ff7e6e1a%7C1a407a2d76754d178692b3ac285306e4%7C0%7C0%7C637540265765782629%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=tWqjRyZoXQ7rTr0nUmd83mI7xNJOGJYFEBTHMcvkeDM%3D&amp;reserved=0. One of
> us
> can help you out with trying this and rolling out. (Nishith is the feature
> author). Here, if the delete job touches same files, that the streaming job
> is writing to, then only one of them will succeed.
>
> We are working on a design for true lock free concurrency control, which
> provides the benefits of both models. But, won't be there for another month
> or two.
>
> Thanks
> Vinoth
>
>
> On Tue, Apr 13, 2021 at 1:45 PM Kizhakkel Jose, Felix
> <fe...@philips.com.invalid> wrote:
>
> > Hi All,
> >
> > I have 100s of HUDI tables (AWS S3) where each of those are populated via
> > Spark structured streaming from kafka streams. Now I have to delete
> records
> > for a given user (userId) from all the tables which has data for that
> user.
> > Meaning all tables where we have reference to that specific userId. I
> > cannot republish all the events/records for that user to kafka to perform
> > delete, since its around 10-15 year’s worth of data for each user and is
> > going to be so costly and time consuming. So I am wondering how everybody
> > is performing GDPR on the their HUDI tables?
> >
> >
> > How I get delete request?
> > On a delete kafka topic we get a delete event [which just contains the
> > userId of the user  to delete], so we have to use that as filter
> condition
> > and read all the records from HUDI tables and write it back with data
> > source operation as ‘delete’. But while performing/running this delete
> > spark job on the table if the streaming job continues to ingest new
> > arriving data- what will be the side effect? Will it work, since seems
> like
> > multi writers are not currently supported.
> >
> > Could you help me with a solution?
> >
> > Regards,
> > Felix K Jose
> >
> > ________________________________
> > The information contained in this message may be confidential and legally
> > protected under applicable law. The message is intended solely for the
> > addressee(s). If you are not the intended recipient, you are hereby
> > notified that any use, forwarding, dissemination, or reproduction of this
> > message is strictly prohibited and may be unlawful. If you are not the
> > intended recipient, please contact the sender by return e-mail and
> destroy
> > all copies of the original message.
> >
>

________________________________
The information contained in this message may be confidential and legally protected under applicable law. The message is intended solely for the addressee(s). If you are not the intended recipient, you are hereby notified that any use, forwarding, dissemination, or reproduction of this message is strictly prohibited and may be unlawful. If you are not the intended recipient, please contact the sender by return e-mail and destroy all copies of the original message.

Re: GDPR deletes and Consenting deletes of data from hudi table

Posted by nishith agarwal <n3...@gmail.com>.
Felix,

Happy to help you through trying and rolling out multi-writer on Hudi
tables. Do you have a test environment where you can try out the feature by
following the doc that Vinoth pointed above ?

Thanks,
Nishith

On Wed, Apr 14, 2021 at 12:26 PM Vinoth Chandar <vi...@apache.org> wrote:

> Hi Felix,
>
> Most people I think are publishing this data into Kafka,and apply the
> deletes as a part of the streaming job itself. The reason why this works is
> because typically, only a small fraction of users leave the service (say <<
> 0.1% weekly is what I have heard). So, the cost of storage on Kafka is not
> much. Is that not the case for you? Are you looking for one time scrubbing
> of data for e.g? The benefit of this approach is that you eliminate any
> concurrency issues that arise from streaming job producing data for a user,
> while the deletes are also issued for that user.
>
> On concurrency control, Hudi now supports multiple writers, if you want to
> write a background job that will perform these deletes for you. it's in
> 0.8.0, see https://hudi.apache.org/docs/concurrency_control.html. One of
> us
> can help you out with trying this and rolling out. (Nishith is the feature
> author). Here, if the delete job touches same files, that the streaming job
> is writing to, then only one of them will succeed.
>
> We are working on a design for true lock free concurrency control, which
> provides the benefits of both models. But, won't be there for another month
> or two.
>
> Thanks
> Vinoth
>
>
> On Tue, Apr 13, 2021 at 1:45 PM Kizhakkel Jose, Felix
> <fe...@philips.com.invalid> wrote:
>
> > Hi All,
> >
> > I have 100s of HUDI tables (AWS S3) where each of those are populated via
> > Spark structured streaming from kafka streams. Now I have to delete
> records
> > for a given user (userId) from all the tables which has data for that
> user.
> > Meaning all tables where we have reference to that specific userId. I
> > cannot republish all the events/records for that user to kafka to perform
> > delete, since its around 10-15 year’s worth of data for each user and is
> > going to be so costly and time consuming. So I am wondering how everybody
> > is performing GDPR on the their HUDI tables?
> >
> >
> > How I get delete request?
> > On a delete kafka topic we get a delete event [which just contains the
> > userId of the user  to delete], so we have to use that as filter
> condition
> > and read all the records from HUDI tables and write it back with data
> > source operation as ‘delete’. But while performing/running this delete
> > spark job on the table if the streaming job continues to ingest new
> > arriving data- what will be the side effect? Will it work, since seems
> like
> > multi writers are not currently supported.
> >
> > Could you help me with a solution?
> >
> > Regards,
> > Felix K Jose
> >
> > ________________________________
> > The information contained in this message may be confidential and legally
> > protected under applicable law. The message is intended solely for the
> > addressee(s). If you are not the intended recipient, you are hereby
> > notified that any use, forwarding, dissemination, or reproduction of this
> > message is strictly prohibited and may be unlawful. If you are not the
> > intended recipient, please contact the sender by return e-mail and
> destroy
> > all copies of the original message.
> >
>

Re: GDPR deletes and Consenting deletes of data from hudi table

Posted by Vinoth Chandar <vi...@apache.org>.
Hi Felix,

Most people I think are publishing this data into Kafka,and apply the
deletes as a part of the streaming job itself. The reason why this works is
because typically, only a small fraction of users leave the service (say <<
0.1% weekly is what I have heard). So, the cost of storage on Kafka is not
much. Is that not the case for you? Are you looking for one time scrubbing
of data for e.g? The benefit of this approach is that you eliminate any
concurrency issues that arise from streaming job producing data for a user,
while the deletes are also issued for that user.

On concurrency control, Hudi now supports multiple writers, if you want to
write a background job that will perform these deletes for you. it's in
0.8.0, see https://hudi.apache.org/docs/concurrency_control.html. One of us
can help you out with trying this and rolling out. (Nishith is the feature
author). Here, if the delete job touches same files, that the streaming job
is writing to, then only one of them will succeed.

We are working on a design for true lock free concurrency control, which
provides the benefits of both models. But, won't be there for another month
or two.

Thanks
Vinoth


On Tue, Apr 13, 2021 at 1:45 PM Kizhakkel Jose, Felix
<fe...@philips.com.invalid> wrote:

> Hi All,
>
> I have 100s of HUDI tables (AWS S3) where each of those are populated via
> Spark structured streaming from kafka streams. Now I have to delete records
> for a given user (userId) from all the tables which has data for that user.
> Meaning all tables where we have reference to that specific userId. I
> cannot republish all the events/records for that user to kafka to perform
> delete, since its around 10-15 year’s worth of data for each user and is
> going to be so costly and time consuming. So I am wondering how everybody
> is performing GDPR on the their HUDI tables?
>
>
> How I get delete request?
> On a delete kafka topic we get a delete event [which just contains the
> userId of the user  to delete], so we have to use that as filter condition
> and read all the records from HUDI tables and write it back with data
> source operation as ‘delete’. But while performing/running this delete
> spark job on the table if the streaming job continues to ingest new
> arriving data- what will be the side effect? Will it work, since seems like
> multi writers are not currently supported.
>
> Could you help me with a solution?
>
> Regards,
> Felix K Jose
>
> ________________________________
> The information contained in this message may be confidential and legally
> protected under applicable law. The message is intended solely for the
> addressee(s). If you are not the intended recipient, you are hereby
> notified that any use, forwarding, dissemination, or reproduction of this
> message is strictly prohibited and may be unlawful. If you are not the
> intended recipient, please contact the sender by return e-mail and destroy
> all copies of the original message.
>