You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hudi.apache.org by Ivan Panico <iv...@gmail.com> on 2019/04/16 12:51:47 UTC

Re: Hudi and GDPR (Databricks Delta)

Thanks for all these amazing inputs. To get further into details. Let’s Say
we Kafka + HDFS architecture (classical lambda) storing both real time and
batch time customers personal data like name and stuff, here comes « john »
who wants to be forgotten.

Now with our classical lambda architecture de are quite doomed as HDFS and
Kafka has trouble when dealing with delete.

The solution I am right now forseeing is thé following one :

- Make Kafka retention short (15 days or so) so I don’t have to deal with
Kafka (I could also to for log compaction)
- The problem is now focused on HDFS :
    - Update all our treatments in a way that it maintains a key-value base
referencing all files on which a user appear (I now know where John is in
my lake)
    - Maintain a base withholding all users asking to be forgotten (john
goes there).
    - Design a treatment that goes over my forgotten users database, find
the files associated and do some deletes on those files (now john is no
more)

The last part is quite troublesome with pure HDFS. So right now I’m
considering technologies that allows me to do this last part. Hive is
leading (all files are parquets and I’m not sure I need this near-real time
paradigm but Ihave doubt on the perf part) but Hudi and delta seem quite
relevant too.

Regards,
Ivan

Le mar. 16 avr. 2019 à 13:58, Semantic Beeng <ni...@semanticbeeng.com> a
écrit :

> Databricks Delta is a tool similar to Apache Hudi but commercial.
>
> This article about it claims it can be used for GDPR "right to be
> forgotten"
>
>
> https://databricks.com/blog/2019/03/19/efficient-upserts-into-data-lakes-databricks-delta.html
>
> "supports the MERGE command"
>
> "Databricks Delta also offers rollback capabilities with the time travel
> feature
> <https://databricks.com/blog/2019/02/04/introducing-delta-time-travel-for-large-scale-data-lakes.html>
> "
>
> This is worth studying and compare with Hudi.
>
> But the rollback and time travel capabilities (Hudi has this too, right?)
> are very useful.
>
>
> On April 16, 2019 at 7:33 AM Semantic Beeng <ni...@semanticbeeng.com>
> wrote:
>
> Hello Ivan,
>
> First please elaborate more precisely on the GDPR use case you are
> considering, how you see it implemented and where the technology becomes
> the issue.
>
> Some thoughts to consider:
>
> 1. Deleting could realized by upserting over current data - after all Hudi
> make HDFS update-able in a sense. So maybe this is about proving that this
> "sense" is enough. After all, even if you delete files at OS level data is
> still physically left on disk.
>
> 2. Structure the data so it can be easily made unusable from GDPR point of
> view.
>
> After all this is about identifiable customer data.
>
> 3. See https://news.ycombinator.com/item?id=16366050 for more ideas.
>
> Let us iterate further.
>
> I will study some resource I found.
>
> You are bringing an important use case.
>
> As this time I would say the premise that the use of HDFS makes it
> impossible to be GDPR compliant does not sound right.
>
> Cheers
>
> Nick.
>
>
>
> On April 16, 2019 at 4:32 AM Ivan Panico < iv.panico@gmail.com> wrote:
>
>
> Hi,
>
> I'm currently facing the challenge of GDPR compliance on an HDFS cluster.
> The most troubling part is "the right to be forgotten" activable by
> customers. If activated, this new right forces companies to delete all
> data
> related to this user. Since HDFS is WORM you can see the issue.
>
> I reach out to HDFS mailing list and people told me that HDFS is not a fit
> for my use case (but I can't see myself migrating everything to
> Kudu/Hbase)
> but one person told me to check Hudi and it looks very promising.
> Hence, I wanted to know if my use case (deleting lines in HDFS datasets
> based on a user uuid) seems suitable for Hudi as I think it is. Also I'm
> really interested in any feedback on companies using this tool (on
> production environment) as I'm wondering if it is production ready. I
> believe Uber does but I'm not aware of anyone else.
>
> Thanks in advance for your help,
> best regards,
> Ivan
>
>
>
>

Re: Hudi and GDPR (Databricks Delta)

Posted by Vinoth Chandar <vi...@apache.org>.
Hello Ivan,

Thanks for writing in. Let me add more perspective on top of what Nick
shared already.

It is possible to hard delete data using Hudi, as long as the dataset
written by Hudi. We do this at Uber and one of the reasons for building
Hudi to support mutations over big data.
You can see https://github.com/apache/incubator-hudi/pull/635 , which when
merged will add a simple way to perform deletes by specifying the keys.

Happy to talk through any specific issues/concerns you have.

Thanks
Vinoth



On Tue, Apr 16, 2019 at 5:52 AM Ivan Panico <iv...@gmail.com> wrote:

> Thanks for all these amazing inputs. To get further into details. Let’s
> Say we Kafka + HDFS architecture (classical lambda) storing both real time
> and batch time customers personal data like name and stuff, here comes «
> john » who wants to be forgotten.
>
> Now with our classical lambda architecture de are quite doomed as HDFS and
> Kafka has trouble when dealing with delete.
>
> The solution I am right now forseeing is thé following one :
>
> - Make Kafka retention short (15 days or so) so I don’t have to deal with
> Kafka (I could also to for log compaction)
> - The problem is now focused on HDFS :
>     - Update all our treatments in a way that it maintains a key-value
> base referencing all files on which a user appear (I now know where John is
> in my lake)
>     - Maintain a base withholding all users asking to be forgotten (john
> goes there).
>     - Design a treatment that goes over my forgotten users database, find
> the files associated and do some deletes on those files (now john is no
> more)
>
> The last part is quite troublesome with pure HDFS. So right now I’m
> considering technologies that allows me to do this last part. Hive is
> leading (all files are parquets and I’m not sure I need this near-real time
> paradigm but Ihave doubt on the perf part) but Hudi and delta seem quite
> relevant too.
>
> Regards,
> Ivan
>
> Le mar. 16 avr. 2019 à 13:58, Semantic Beeng <ni...@semanticbeeng.com> a
> écrit :
>
>> Databricks Delta is a tool similar to Apache Hudi but commercial.
>>
>> This article about it claims it can be used for GDPR "right to be
>> forgotten"
>>
>>
>> https://databricks.com/blog/2019/03/19/efficient-upserts-into-data-lakes-databricks-delta.html
>>
>> "supports the MERGE command"
>>
>> "Databricks Delta also offers rollback capabilities with the time travel
>> feature
>> <https://databricks.com/blog/2019/02/04/introducing-delta-time-travel-for-large-scale-data-lakes.html>
>> "
>>
>> This is worth studying and compare with Hudi.
>>
>> But the rollback and time travel capabilities (Hudi has this too, right?)
>> are very useful.
>>
>>
>> On April 16, 2019 at 7:33 AM Semantic Beeng <ni...@semanticbeeng.com>
>> wrote:
>>
>> Hello Ivan,
>>
>> First please elaborate more precisely on the GDPR use case you are
>> considering, how you see it implemented and where the technology becomes
>> the issue.
>>
>> Some thoughts to consider:
>>
>> 1. Deleting could realized by upserting over current data - after all
>> Hudi make HDFS update-able in a sense. So maybe this is about proving that
>> this "sense" is enough. After all, even if you delete files at OS level
>> data is still physically left on disk.
>>
>> 2. Structure the data so it can be easily made unusable from GDPR point
>> of view.
>>
>> After all this is about identifiable customer data.
>>
>> 3. See https://news.ycombinator.com/item?id=16366050 for more ideas.
>>
>> Let us iterate further.
>>
>> I will study some resource I found.
>>
>> You are bringing an important use case.
>>
>> As this time I would say the premise that the use of HDFS makes it
>> impossible to be GDPR compliant does not sound right.
>>
>> Cheers
>>
>> Nick.
>>
>>
>>
>> On April 16, 2019 at 4:32 AM Ivan Panico < iv.panico@gmail.com> wrote:
>>
>>
>> Hi,
>>
>> I'm currently facing the challenge of GDPR compliance on an HDFS cluster.
>> The most troubling part is "the right to be forgotten" activable by
>> customers. If activated, this new right forces companies to delete all
>> data
>> related to this user. Since HDFS is WORM you can see the issue.
>>
>> I reach out to HDFS mailing list and people told me that HDFS is not a
>> fit
>> for my use case (but I can't see myself migrating everything to
>> Kudu/Hbase)
>> but one person told me to check Hudi and it looks very promising.
>> Hence, I wanted to know if my use case (deleting lines in HDFS datasets
>> based on a user uuid) seems suitable for Hudi as I think it is. Also I'm
>> really interested in any feedback on companies using this tool (on
>> production environment) as I'm wondering if it is production ready. I
>> believe Uber does but I'm not aware of anyone else.
>>
>> Thanks in advance for your help,
>> best regards,
>> Ivan
>>
>>
>>
>>
>