You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by Gidon Gershinsky <gg...@gmail.com> on 2020/08/04 06:19:48 UTC

[DISCUSS] Parquet data masking/anonymization

Hi all,

Now that the encryption mechanism is mostly complete, we are starting a
long-term project on  a new security feature on top of encryption. Called
"data obfuscation",  it combines masking and anonymization of sensitive
data.
https://issues.apache.org/jira/browse/PARQUET-1376

On the one hand, a basic masking can be easily implemented on top of
Parquet, by simply adding columns with masked (hashed, redacted, etc)
versions of the original column data. On the other hand, if done
improperly, data masking can leak out the sensitive information. For these
two reasons, we have decided not to rush it, this feature is not planned
for the upcoming Parquet versions. Following an initial discussion, we have
produced a write up on the goals, challenges and possible approaches.
Before drafting the design, we start with a call to the community to
provide feedback on this write up (eg via comments inside the doc). Any
real-life examples, usecases, requirements are very welcome.

https://docs.google.com/document/d/1LMs74uhqvMNJacBySPnWq6tM8qIpgcIZz444c7vfibM/edit?usp=sharing


Cheers,
Gidon, Xinli, Shri

Re: [DISCUSS] Parquet data masking/anonymization

Posted by Gidon Gershinsky <gg...@gmail.com>.
Hi Micah,

Yep, we've been asking ourselves the same question; this is one of the
reasons we take this slowly.
The general answer is we want to help users to avoid the need to implement
the masking mechanism (and the privacy leakage analysis tools) on their own.
The idea is to create a common set of open source tools, that implement the
best practices in this field, and benefit from community's contribution in
terms of usecase requirements, design improvements and bug fixes.
Also, if we manage to find a way to compress N masked versions of the same
column, using an algorithm that produces (way) less than xN bytes, then we
might want to integrate the obfuscation feature deeper in the Parquet
stack. But this is an advanced goal, TBD.
We'll proceed top-down, starting with an above-the-surface tool that can
convert a regular file into a file with additional columns (masked versions
of the sensitive columns). Then we'll explore doing the same just under the
surface, when a new file is directly written with masked columns, added
automatically. We'll see then if we can/should go deeper.
The doc authors have motivating use-cases in their respective
organizations. We do ask for additional usecases / requirements, and
general feedback.

Cheers, Gidon


On Sat, Aug 8, 2020 at 7:10 AM Micah Kornfield <em...@gmail.com>
wrote:

> Hi Gidon,
> Was there prior discussion on this on the mailing list?  I left a comment
> on the document, but it isn't clear to me why this particular use-case
> needs to be part of the core parquet library,
>
> Are there motivating use-cases that wouldn't be served by an external
> library/application level?
>
> Thanks,
> Micah
>
> On Mon, Aug 3, 2020 at 11:20 PM Gidon Gershinsky <gg...@gmail.com> wrote:
>
> > Hi all,
> >
> > Now that the encryption mechanism is mostly complete, we are starting a
> > long-term project on  a new security feature on top of encryption. Called
> > "data obfuscation",  it combines masking and anonymization of sensitive
> > data.
> > https://issues.apache.org/jira/browse/PARQUET-1376
> >
> > On the one hand, a basic masking can be easily implemented on top of
> > Parquet, by simply adding columns with masked (hashed, redacted, etc)
> > versions of the original column data. On the other hand, if done
> > improperly, data masking can leak out the sensitive information. For
> these
> > two reasons, we have decided not to rush it, this feature is not planned
> > for the upcoming Parquet versions. Following an initial discussion, we
> have
> > produced a write up on the goals, challenges and possible approaches.
> > Before drafting the design, we start with a call to the community to
> > provide feedback on this write up (eg via comments inside the doc). Any
> > real-life examples, usecases, requirements are very welcome.
> >
> >
> >
> https://docs.google.com/document/d/1LMs74uhqvMNJacBySPnWq6tM8qIpgcIZz444c7vfibM/edit?usp=sharing
> >
> >
> > Cheers,
> > Gidon, Xinli, Shri
> >
>

Re: [DISCUSS] Parquet data masking/anonymization

Posted by Micah Kornfield <em...@gmail.com>.
Hi Gidon,
Was there prior discussion on this on the mailing list?  I left a comment
on the document, but it isn't clear to me why this particular use-case
needs to be part of the core parquet library,

Are there motivating use-cases that wouldn't be served by an external
library/application level?

Thanks,
Micah

On Mon, Aug 3, 2020 at 11:20 PM Gidon Gershinsky <gg...@gmail.com> wrote:

> Hi all,
>
> Now that the encryption mechanism is mostly complete, we are starting a
> long-term project on  a new security feature on top of encryption. Called
> "data obfuscation",  it combines masking and anonymization of sensitive
> data.
> https://issues.apache.org/jira/browse/PARQUET-1376
>
> On the one hand, a basic masking can be easily implemented on top of
> Parquet, by simply adding columns with masked (hashed, redacted, etc)
> versions of the original column data. On the other hand, if done
> improperly, data masking can leak out the sensitive information. For these
> two reasons, we have decided not to rush it, this feature is not planned
> for the upcoming Parquet versions. Following an initial discussion, we have
> produced a write up on the goals, challenges and possible approaches.
> Before drafting the design, we start with a call to the community to
> provide feedback on this write up (eg via comments inside the doc). Any
> real-life examples, usecases, requirements are very welcome.
>
>
> https://docs.google.com/document/d/1LMs74uhqvMNJacBySPnWq6tM8qIpgcIZz444c7vfibM/edit?usp=sharing
>
>
> Cheers,
> Gidon, Xinli, Shri
>