You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by Gidon Gershinsky <gg...@gmail.com> on 2017/12/20 15:32:56 UTC

Parquet modular encryption

We are working on frameworks that perform secure analytics on encrypted
data. The analytic engine runs in a secured environment, but the data is
kept in an untrusted storage. Could be a public cloud storage or anything
else - the main requirement is to store/retrieve the data in an encrypted
form only. The storage admin should never have the data key. The data
should be decrypted only at the end point (analytic engine), never in the
storage.

This obviously impacts performance of Parquet selective reads. If a Parquet
file is bulk-encrypted in the storage, it becomes impossible to extract its
footer, retrieve a column subset, a few pages, etc. The file must be fully
delivered from storage to the engine location, decrypted there, and then
processed.
Moreover, even if the storage is trusted - it still has to fully decrypt
the file before parsing it and extracting select columns/pages.

I've searched for available solutions to this problem, haven't found any
(but do let me know if I've missed anything!)

So I have developed a basic Parquet implementation that performs separate
encryption of each header and page. It is fully functional, and allows to
retrieve only the required data pieces, while keeping the Parquet file
encrypted in the storage. Actually, it doesn't require deep changes in
Parquet code, since it builds on the existing Thrift and compression
mechanisms. Its also not intrusive, in a sense that if encryption is not
used, the new code is by-passed with a number of 'ifs', so the existing
apps and tests continue to run unaffected.

Its still a raw code, not quite ready yet for upstreaming. Unless you guys
tell me this is pointless :), I'll start on preparing it for a pull request.



Regards,
Gidon

Re: Parquet modular encryption

Posted by Gidon Gershinsky <GI...@il.ibm.com>.
A first PR in the encryption series is sent (to parquet-format), please 
review.



Regards, 
Gidon






Re: Parquet modular encryption

Posted by Gidon Gershinsky <GI...@il.ibm.com>.
A status update: I'm working on a (Java) implementation and testing of 
this function, the code should be ready for a pull request in a couple of 
weeks.



Regards, 
Gidon







From:   "Gidon Gershinsky" <GI...@il.ibm.com>
To:     dev@parquet.apache.org
Date:   01/02/2018 09:18 AM
Subject:        Re: Parquet modular encryption



Following the Parquet sync call this week, I've updated the design 
document to focus on a Phase 1 version of this mechanism (single key, 
per-column encryption opt in/out), as we discussed.
Thanks for constructive comments and suggestions. Feel free to leave 
additional comments at the document.



Regards, 
Gidon







Re: Parquet modular encryption

Posted by Gidon Gershinsky <GI...@il.ibm.com>.
Following the Parquet sync call this week, I've updated the design 
document to focus on a Phase 1 version of this mechanism (single key, 
per-column encryption opt in/out), as we discussed.
Thanks for constructive comments and suggestions. Feel free to leave 
additional comments at the document.



Regards, 
Gidon




Re: Parquet modular encryption

Posted by Gidon Gershinsky <GI...@il.ibm.com>.
I have posted a link to the design draft at the Jira.
All comments are welcome.
Thanks to Julien for an initial feedback and suggestions, at our chat 
during my SF trip last week.

The design document is relatively detailed, with 8 pages its actually 
longer than the code used to implement it :).
Still, the code can be split into multiple pull requests, to enable a 
staged implementation of this mechanism.

Is there a Parquet call next week? I'd be glad to join and discuss this 
with the community.



Regards, 
Gidon







From:   Gidon Gershinsky <gg...@gmail.com>
To:     dev@parquet.apache.org, rblue@netflix.com
Date:   20/12/2017 08:13 PM
Subject:        Re: Parquet modular encryption



Hi Ryan,

Certainly, I'd be glad to draft a design doc, sounds like a good idea.
Could you assign me to PARQUET-1178? I'll pin the doc link there.

I've seen a brief discussion on creating an 'encrypting compressor', but
indeed for data pages only.
My implementation encrypts pages (data and dictionary), headers and the
file footer. Also, I don't use a separate compressor for encryption, my
code works with any compression supported in Parquet.

Regards, Gidon.

On Wed, Dec 20, 2017 at 7:12 PM, Ryan Blue <rb...@netflix.com.invalid>
wrote:

> Hi Gidon,
>
> Thanks for working on this. People have talked about using this approach
> for page data in the past, but I haven't seen an implementation of it. 
You
> encrypt headers as well to make sure column stats are not stored in 
plain
> text?
>
> I think it would be helpful if you wrote up a small doc on your changes,
> like the design doc for column indexes
> <
https://urldefense.proofpoint.com/v2/url?u=https-3A__docs.google.com_document_d_1sBACp8Lbutuj1Zxdowvsrlm8ku4BF&d=DwIBaQ&c=jf_iaSHvJObTbx-siA1ZOg&r=xR6HJBGHfjijqP-JgubSvA&m=IyTKfGy6KePzS2zPPfNzfq9G9ac88N5DimmeL5o20kc&s=KIzWAU6JFlspk28i50NbPvA8aUr8AZXGu2BqLNuUuE4&e=

> xf8U_Do5K2wSO4/edit#>.
> That way, we can discuss it in comments to make sure that you didn't 
miss
> any structures and validate the approach. Would you be willing to do 
that?
> Then we could figure out what we need to add to the Parquet spec to make
> this portable.
>
> Thanks!
>
> rb
>
> On Wed, Dec 20, 2017 at 7:46 AM, Gidon Gershinsky <GI...@il.ibm.com>
> wrote:
>
> > 'Hi' is missing in the message, due to a faulty copy/paste, sorry 
about
> > that :)
> >
> > And of course, all comments are most welcome.
> >
> >
> >
> > Regards,
> > Gidon
> >
> >
> >
> >
> >
> >
> >
> > From:   Gidon Gershinsky <gg...@gmail.com>
> > To:     dev@parquet.apache.org
> > Date:   20/12/2017 05:33 PM
> > Subject:        Parquet modular encryption
> >
> >
> >
> > We are working on frameworks that perform secure analytics on 
encrypted
> > data. The analytic engine runs in a secured environment, but the data 
is
> > kept in an untrusted storage. Could be a public cloud storage or 
anything
> > else - the main requirement is to store/retrieve the data in an 
encrypted
> > form only. The storage admin should never have the data key. The data
> > should be decrypted only at the end point (analytic engine), never in 
the
> > storage.
> >
> > This obviously impacts performance of Parquet selective reads. If a
> > Parquet
> > file is bulk-encrypted in the storage, it becomes impossible to 
extract
> > its
> > footer, retrieve a column subset, a few pages, etc. The file must be
> fully
> > delivered from storage to the engine location, decrypted there, and 
then
> > processed.
> > Moreover, even if the storage is trusted - it still has to fully 
decrypt
> > the file before parsing it and extracting select columns/pages.
> >
> > I've searched for available solutions to this problem, haven't found 
any
> > (but do let me know if I've missed anything!)
> >
> > So I have developed a basic Parquet implementation that performs 
separate
> > encryption of each header and page. It is fully functional, and allows 
to
> > retrieve only the required data pieces, while keeping the Parquet file
> > encrypted in the storage. Actually, it doesn't require deep changes in
> > Parquet code, since it builds on the existing Thrift and compression
> > mechanisms. Its also not intrusive, in a sense that if encryption is 
not
> > used, the new code is by-passed with a number of 'ifs', so the 
existing
> > apps and tests continue to run unaffected.
> >
> > Its still a raw code, not quite ready yet for upstreaming. Unless you
> guys
> > tell me this is pointless :), I'll start on preparing it for a pull
> > request.
> >
> >
> >
> > Regards,
> > Gidon
> >
> >
> >
> >
>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>




Re: Parquet modular encryption

Posted by Gidon Gershinsky <gg...@gmail.com>.
Hi Ryan,

Certainly, I'd be glad to draft a design doc, sounds like a good idea.
Could you assign me to PARQUET-1178? I'll pin the doc link there.

I've seen a brief discussion on creating an 'encrypting compressor', but
indeed for data pages only.
My implementation encrypts pages (data and dictionary), headers and the
file footer. Also, I don't use a separate compressor for encryption, my
code works with any compression supported in Parquet.

Regards, Gidon.

On Wed, Dec 20, 2017 at 7:12 PM, Ryan Blue <rb...@netflix.com.invalid>
wrote:

> Hi Gidon,
>
> Thanks for working on this. People have talked about using this approach
> for page data in the past, but I haven't seen an implementation of it. You
> encrypt headers as well to make sure column stats are not stored in plain
> text?
>
> I think it would be helpful if you wrote up a small doc on your changes,
> like the design doc for column indexes
> <https://docs.google.com/document/d/1sBACp8Lbutuj1Zxdowvsrlm8ku4BF
> xf8U_Do5K2wSO4/edit#>.
> That way, we can discuss it in comments to make sure that you didn't miss
> any structures and validate the approach. Would you be willing to do that?
> Then we could figure out what we need to add to the Parquet spec to make
> this portable.
>
> Thanks!
>
> rb
>
> On Wed, Dec 20, 2017 at 7:46 AM, Gidon Gershinsky <GI...@il.ibm.com>
> wrote:
>
> > 'Hi' is missing in the message, due to a faulty copy/paste, sorry about
> > that :)
> >
> > And of course, all comments are most welcome.
> >
> >
> >
> > Regards,
> > Gidon
> >
> >
> >
> >
> >
> >
> >
> > From:   Gidon Gershinsky <gg...@gmail.com>
> > To:     dev@parquet.apache.org
> > Date:   20/12/2017 05:33 PM
> > Subject:        Parquet modular encryption
> >
> >
> >
> > We are working on frameworks that perform secure analytics on encrypted
> > data. The analytic engine runs in a secured environment, but the data is
> > kept in an untrusted storage. Could be a public cloud storage or anything
> > else - the main requirement is to store/retrieve the data in an encrypted
> > form only. The storage admin should never have the data key. The data
> > should be decrypted only at the end point (analytic engine), never in the
> > storage.
> >
> > This obviously impacts performance of Parquet selective reads. If a
> > Parquet
> > file is bulk-encrypted in the storage, it becomes impossible to extract
> > its
> > footer, retrieve a column subset, a few pages, etc. The file must be
> fully
> > delivered from storage to the engine location, decrypted there, and then
> > processed.
> > Moreover, even if the storage is trusted - it still has to fully decrypt
> > the file before parsing it and extracting select columns/pages.
> >
> > I've searched for available solutions to this problem, haven't found any
> > (but do let me know if I've missed anything!)
> >
> > So I have developed a basic Parquet implementation that performs separate
> > encryption of each header and page. It is fully functional, and allows to
> > retrieve only the required data pieces, while keeping the Parquet file
> > encrypted in the storage. Actually, it doesn't require deep changes in
> > Parquet code, since it builds on the existing Thrift and compression
> > mechanisms. Its also not intrusive, in a sense that if encryption is not
> > used, the new code is by-passed with a number of 'ifs', so the existing
> > apps and tests continue to run unaffected.
> >
> > Its still a raw code, not quite ready yet for upstreaming. Unless you
> guys
> > tell me this is pointless :), I'll start on preparing it for a pull
> > request.
> >
> >
> >
> > Regards,
> > Gidon
> >
> >
> >
> >
>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>

Re: Parquet modular encryption

Posted by Ryan Blue <rb...@netflix.com.INVALID>.
Hi Gidon,

Thanks for working on this. People have talked about using this approach
for page data in the past, but I haven't seen an implementation of it. You
encrypt headers as well to make sure column stats are not stored in plain
text?

I think it would be helpful if you wrote up a small doc on your changes,
like the design doc for column indexes
<https://docs.google.com/document/d/1sBACp8Lbutuj1Zxdowvsrlm8ku4BFxf8U_Do5K2wSO4/edit#>.
That way, we can discuss it in comments to make sure that you didn't miss
any structures and validate the approach. Would you be willing to do that?
Then we could figure out what we need to add to the Parquet spec to make
this portable.

Thanks!

rb

On Wed, Dec 20, 2017 at 7:46 AM, Gidon Gershinsky <GI...@il.ibm.com> wrote:

> 'Hi' is missing in the message, due to a faulty copy/paste, sorry about
> that :)
>
> And of course, all comments are most welcome.
>
>
>
> Regards,
> Gidon
>
>
>
>
>
>
>
> From:   Gidon Gershinsky <gg...@gmail.com>
> To:     dev@parquet.apache.org
> Date:   20/12/2017 05:33 PM
> Subject:        Parquet modular encryption
>
>
>
> We are working on frameworks that perform secure analytics on encrypted
> data. The analytic engine runs in a secured environment, but the data is
> kept in an untrusted storage. Could be a public cloud storage or anything
> else - the main requirement is to store/retrieve the data in an encrypted
> form only. The storage admin should never have the data key. The data
> should be decrypted only at the end point (analytic engine), never in the
> storage.
>
> This obviously impacts performance of Parquet selective reads. If a
> Parquet
> file is bulk-encrypted in the storage, it becomes impossible to extract
> its
> footer, retrieve a column subset, a few pages, etc. The file must be fully
> delivered from storage to the engine location, decrypted there, and then
> processed.
> Moreover, even if the storage is trusted - it still has to fully decrypt
> the file before parsing it and extracting select columns/pages.
>
> I've searched for available solutions to this problem, haven't found any
> (but do let me know if I've missed anything!)
>
> So I have developed a basic Parquet implementation that performs separate
> encryption of each header and page. It is fully functional, and allows to
> retrieve only the required data pieces, while keeping the Parquet file
> encrypted in the storage. Actually, it doesn't require deep changes in
> Parquet code, since it builds on the existing Thrift and compression
> mechanisms. Its also not intrusive, in a sense that if encryption is not
> used, the new code is by-passed with a number of 'ifs', so the existing
> apps and tests continue to run unaffected.
>
> Its still a raw code, not quite ready yet for upstreaming. Unless you guys
> tell me this is pointless :), I'll start on preparing it for a pull
> request.
>
>
>
> Regards,
> Gidon
>
>
>
>


-- 
Ryan Blue
Software Engineer
Netflix

Re: Parquet modular encryption

Posted by Gidon Gershinsky <GI...@il.ibm.com>.
'Hi' is missing in the message, due to a faulty copy/paste, sorry about 
that :)

And of course, all comments are most welcome.



Regards, 
Gidon







From:   Gidon Gershinsky <gg...@gmail.com>
To:     dev@parquet.apache.org
Date:   20/12/2017 05:33 PM
Subject:        Parquet modular encryption



We are working on frameworks that perform secure analytics on encrypted
data. The analytic engine runs in a secured environment, but the data is
kept in an untrusted storage. Could be a public cloud storage or anything
else - the main requirement is to store/retrieve the data in an encrypted
form only. The storage admin should never have the data key. The data
should be decrypted only at the end point (analytic engine), never in the
storage.

This obviously impacts performance of Parquet selective reads. If a 
Parquet
file is bulk-encrypted in the storage, it becomes impossible to extract 
its
footer, retrieve a column subset, a few pages, etc. The file must be fully
delivered from storage to the engine location, decrypted there, and then
processed.
Moreover, even if the storage is trusted - it still has to fully decrypt
the file before parsing it and extracting select columns/pages.

I've searched for available solutions to this problem, haven't found any
(but do let me know if I've missed anything!)

So I have developed a basic Parquet implementation that performs separate
encryption of each header and page. It is fully functional, and allows to
retrieve only the required data pieces, while keeping the Parquet file
encrypted in the storage. Actually, it doesn't require deep changes in
Parquet code, since it builds on the existing Thrift and compression
mechanisms. Its also not intrusive, in a sense that if encryption is not
used, the new code is by-passed with a number of 'ifs', so the existing
apps and tests continue to run unaffected.

Its still a raw code, not quite ready yet for upstreaming. Unless you guys
tell me this is pointless :), I'll start on preparing it for a pull 
request.



Regards,
Gidon