You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by Machiel Groeneveld <ma...@gmail.com> on 2017/11/07 09:11:09 UTC

GDPR requirements

Hi,

The upcoming cross EU law GDPR requires companies to remove data collected
from consumers as requested. I'm exploring the options concerning our
Parquet tables.

I don't see any support for mutating parquet files, if it's not there is it
possible to add that?

I wonder if anyone has any knowledge of how a deletion could be processed
in the parquet world. Of course there is the option to sift through
billions of records and recreate all our tables for each deletion request
but I'm hoping for a more efficient method. Perhaps a delete flag could be
added to the format or is there a way to zero out existing data?

At some point all companies storing data of EU citizens will need to have
an answer to this. Simply locking the data behind more restrictions is not
an option, data should be erased. Companies are already looking into ways
to delete data from tape backups, the law is that far reaching.

Re: GDPR requirements

Posted by Ryan Blue <rb...@netflix.com.INVALID>.
Parquet files don't support mutation. If you want to remove records, you
have to rewrite the file and filter out the records you don't want. I think
this is probably better for compliance because delete markers don't really
delete the data or make the data inaccessible.

I'm not sure what else the format could provide here. Maybe making it
easier to delete a column would be sufficient? If you had data that wasn't
tied to a person except for some ID column, you could anonymize by removing
that column without re-encoding the rest of the data (though this would
require rewriting the file). That wouldn't be too difficult to do, but
unfortunately requires planning ahead to know what columns you can delete
to reach compliance. Another idea here is to replace the ID column with
hash(ID) so you'd still have relationships, but no information to tie rows
to individuals.

rb

On Tue, Nov 7, 2017 at 1:11 AM, Machiel Groeneveld <ma...@gmail.com>
wrote:

> Hi,
>
> The upcoming cross EU law GDPR requires companies to remove data collected
> from consumers as requested. I'm exploring the options concerning our
> Parquet tables.
>
> I don't see any support for mutating parquet files, if it's not there is it
> possible to add that?
>
> I wonder if anyone has any knowledge of how a deletion could be processed
> in the parquet world. Of course there is the option to sift through
> billions of records and recreate all our tables for each deletion request
> but I'm hoping for a more efficient method. Perhaps a delete flag could be
> added to the format or is there a way to zero out existing data?
>
> At some point all companies storing data of EU citizens will need to have
> an answer to this. Simply locking the data behind more restrictions is not
> an option, data should be erased. Companies are already looking into ways
> to delete data from tape backups, the law is that far reaching.
>



-- 
Ryan Blue
Software Engineer
Netflix