You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hudi.apache.org by Vinoth Chandar <vi...@apache.org> on 2020/08/06 06:25:43 UTC

Re: GDPR - Time Travel Query

Hi,

IIUC, what you want is for the deletes to be applied on different versions
of the data? so that no time travel query can read the deleted field again.
I am afraid this cannot be achieved as-is today and would need logging
these deletes for older base files - that might be one way to achieve this.
needs more discussion, but the good thing is the hudi's log based design
lends itself to doing this. it's an interesting use-case. thanks for
bringing this up!

As a workaround, would it be possible to split the snapshot and time-travel
queries into different tables for now? i.e the time-travel table will be
insert-only and you can use snapshot queries to achieve the effect of and
thus at a later time, you can just issue a delete to get rid of the field
from all versions of the record. maybe this makes the time travel more
expensive? I guess?


On Thu, Jul 30, 2020 at 6:08 AM Sivaprakash <si...@gmail.com>
wrote:

> Hello
>
> What I see is; If I we  want to implement GDPR (
>
> https://cwiki.apache.org/confluence/display/HUDI/FAQ#FAQ-HowdoIdeleterecordsinthedatasetusingHudi
> )
> then old version of commit files should be removed (otherwise incremental
> query with point-time options can still read the data which is deleted in
> latter stage). Time travel query is not possible anymore if we want to
> implement GDPR? any configurations/options to delete only specific records
> in the older commit files instead of removing the whole file?
>
> Thanks
>