You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by "Udit Mehrotra (Jira)" <ji...@apache.org> on 2021/08/25 09:06:00 UTC

[jira] [Updated] (HUDI-1212) GDPR: Support deletions of records on all versions of Hudi dataset

     [ https://issues.apache.org/jira/browse/HUDI-1212?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Udit Mehrotra updated HUDI-1212:
--------------------------------
    Fix Version/s:     (was: 0.9.0)
                   0.10.0

> GDPR: Support deletions of records on  all versions of Hudi dataset
> -------------------------------------------------------------------
>
>                 Key: HUDI-1212
>                 URL: https://issues.apache.org/jira/browse/HUDI-1212
>             Project: Apache Hudi
>          Issue Type: Improvement
>          Components: Incremental Pull, Writer Core
>    Affects Versions: 0.9.0
>            Reporter: Balaji Varadarajan
>            Priority: Major
>             Fix For: 0.10.0
>
>
> Incremental Pull should also stop returning the record on historical  datset when we delete them from latest snapshot.
>  
> Context from Mailing list email :
>  
> Hello,
> I am Siva's colleague and I am working on the problem below as well.
> I would like to describe what we are trying to achieve with Hudi as well as our current way of working and our GDPR and "Right To Be Forgotten " compliance policies.
> Our requirements :
> - We wish to apply a strict interpretation of the RTBF.  In other words, when we remove a person's data, it should be throughout the historical data and not just the latest snapshot.
> - We wish to use Hudi to reduce our storage requirements using upserts and don't want to have duplicates between commits.
> - We wish to retain history for persons who have not requested to be forgotten and therefore we do not want to delete commit files from the history as some have proposed.
> We have tried a couple of solutions, but so far without success :
> - replay the data omitting the data of the persons who have requested to be forgotten.  We wanted to manipulate the commit times to rebuild the history.
> We found that we couldn't manipulate the commit times and retain the history.
> - replay the data omitting the data of the persons who have requested to be forgotten, but writing to a date-based partition folder using the "partitionpath" parameter.
> We found that commits using upserts between the partitionpath folders, do not ignore data that is unchanged between 2 commit dates as when using the default commit file system, so we will not save on our storage or speed up our  processing using this technique.
> So basically we would like to find a way to apply a strict RTBF, GDPR, maintain history and time-travel (large history) and save storage space using Hudi.
> Can anyone see a way to achieve this?
> Kind Regards,
> David Rosalia
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)