You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by Gidon Gershinsky <gg...@gmail.com> on 2018/10/16 08:11:01 UTC

[VOTE] Modular Encryption design sign-off

Hello Parquet developers,

Per the last sync discussion, it is time to call for a vote on the Parquet
Modular Encryption design sign-off. The design doc can be found at the
encryption branch of the parquet-format repository,
https://github.com/apache/parquet-format/blob/encryption/Encryption.md.

The design is stable by now. This work had started 10 months ago, has been
extensively reviewed -  and implemented (in Java, partially in C++), by a
number of folks from different companies. To continue with the
implementation pull requests, we need the design to be formally signed off
by the community.

Cheers, Gidon

Re: [VOTE] Modular Encryption design sign-off

Posted by Gidon Gershinsky <gg...@gmail.com>.
Ryan, Zoltan, Wes,

Thank you for reviewing the encryption spec and voting for it. The last
round of design updates had been complete in early December, so all 3 votes
refer to the current version.

Now that the spec has been signed off, we are getting back to the pull
requests. While the implementation of this mechanism is indeed a
significant amount of work, fortunately we don't start from the scratch.
Much of the Java code is already implemented, we'll need to clean it up,
add column index encryption and update the pull requests. The C++ code is
also in a good shape, though its implementation is based on an older design
version, with a larger gap to close.

The first step would be to merge the parquet-format pull requests - #114
(the spec) and #124 (the parquet.thrift updates for encryption). Once
this is done, we'll proceed to parquet-mr and arrow/cpp pull requests.


Cheers, Gidon.

On Wed, Jan 16, 2019, 09:39 Ryan Blue <rblue@netflix.com.invalid wrote:

> This vote passes, with at least 3 binding +1 votes.
>
> Some votes are from before we made significant changes, so I'm not sure
> which ones to count. Regardless, thank you to everyone that took the time
> to review the proposal.
>
> And thanks to Gidon for taking the lead and getting it done!
>
> rb
>
> On Wed, Jan 16, 2019 at 9:15 AM Wes McKinney <we...@gmail.com> wrote:
>
> > +1 (binding). Sorry for the delay in reviewing. There are a lot of
> > moving parts here and different configurations (same key for footer
> > and columns, different keys, plaintext footer, etc.) and so the
> > implementations are justifiably going to be a lot of work.
> >
> > @Tham, per your comment
> >
> > > We are going to release our product with encrypted parquet soon
> >
> > I haven't seen the C++ encryption PR updated in a while. Have you been
> > doing development on this somewhere else?
> >
> > On Tue, Jan 15, 2019 at 8:37 PM tham@emotiv.com <th...@emotiv.com> wrote:
> > >
> > > Hi all,
> > >
> > > I’m working on C++ parquet encryption module and my project at Emotiv
> is
> > currently using it in development phase. We are going to release our
> > product with encrypted parquet soon. So I’ll appreciate a lot if we can
> > process voting procedure on crypto specification as soon as possible,
> then
> > we can continue with implementation and Java-C++ interop testing.
> > >
> > > Many thanks,
> > > Tham
> > >
> > > On 2019/01/08 17:08:01, Xinli shang <sh...@uber.com.INVALID> wrote:
> > > > Hi all,
> > > >
> > > > Can we have one more binding votes to sign off the encryption design
> > to let
> > > > us move on to implementation?  Or, if you think 2 current binding
> > votes can
> > > > also be considered as sign off in this particular case, given this
> > design
> > > > has been extensively discussed in a year and we have tested it in our
> > > > staging environment with privates, please let us know.
> > > >
> > > > Once we are unblocked, we will work on the implementation and
> > integration
> > > > with our security ecosystem. Maybe someday, we can show the community
> > how
> > > > Parquet encryption can work in large scale production. I would be
> > really
> > > > appreciated if we can be unblocked.
> > > >
> > > > Xinli
> > > >
> > > > On Wed, Dec 19, 2018 at 9:58 AM Gidon Gershinsky <gg...@gmail.com>
> > wrote:
> > > >
> > > > > Hi all,
> > > > >
> > > > > This is to remind that we have 2 binding votes - and need one more
> > in order
> > > > > to sign off the encryption.
> > > > >
> > > > > Cheers, Gidon.
> > > > >
> > > > > On Sat, Dec 15, 2018 at 10:13 AM Zoltan Ivanfi
> > <zi...@cloudera.com.invalid>
> > > > > wrote:
> > > > >
> > > > > > +1 (binding)
> > > > > >
> > > > > > Thanks for your efforts, Gidon!
> > > > > >
> > > > > > On Fri, Dec 14, 2018 at 6:55 PM Ryan Blue
> > <rb...@netflix.com.invalid>
> > > > > > wrote:
> > > > > >
> > > > > > > +1 (binding)
> > > > > > >
> > > > > > > Thanks for all your work on this, Gidon!
> > > > > > >
> > > > > > > On Fri, Dec 14, 2018 at 9:48 AM Gidon Gershinsky <
> > gg5070@gmail.com>
> > > > > > wrote:
> > > > > > >
> > > > > > > > Hello Parquet developers,
> > > > > > > >
> > > > > > > > After a productive round of discussions, we have a new
> enhanced
> > > > > version
> > > > > > > of
> > > > > > > > the design document. You can find it here:
> > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> >
> https://github.com/ggershinsky/parquet-format/blob/p1450-crypto-specification/Encryption.md
> > > > > > > >
> > > > > > > > This is a call for vote (either new, or resumption of the
> > previous
> > > > > one.
> > > > > > > If
> > > > > > > > you have voted +1 before, feel free to do that again :).
> > > > > > > >
> > > > > > > > Cheers, Gidon
> > > > > > > >
> > > > > > > > ---------- Forwarded message ---------
> > > > > > > > From: Gidon Gershinsky <gg...@gmail.com>
> > > > > > > > Date: Tue, Oct 16, 2018 at 11:11 AM
> > > > > > > > Subject: [VOTE] Modular Encryption design sign-off
> > > > > > > > To: <de...@parquet.apache.org>
> > > > > > > >
> > > > > > > >
> > > > > > > > Hello Parquet developers,
> > > > > > > >
> > > > > > > > Per the last sync discussion, it is time to call for a vote
> on
> > the
> > > > > > > Parquet
> > > > > > > > Modular Encryption design sign-off. The design doc can be
> > found at
> > > > > the
> > > > > > > > encryption branch of the parquet-format repository,
> > > > > > > >
> > > > >
> > https://github.com/apache/parquet-format/blob/encryption/Encryption.md
> > > > > > .
> > > > > > > >
> > > > > > > > The design is stable by now. This work had started 10 months
> > ago, has
> > > > > > > been
> > > > > > > > extensively reviewed -  and implemented (in Java, partially
> in
> > C++),
> > > > > > by a
> > > > > > > > number of folks from different companies. To continue with
> the
> > > > > > > > implementation pull requests, we need the design to be
> formally
> > > > > signed
> > > > > > > off
> > > > > > > > by the community.
> > > > > > > >
> > > > > > > > Cheers, Gidon
> > > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > --
> > > > > > > Ryan Blue
> > > > > > > Software Engineer
> > > > > > > Netflix
> > > > > > >
> > > > > >
> > > > >
> > > >
> > > >
> > > > --
> > > > Xinli Shang
> > > >
> >
>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>

Re: [VOTE] Modular Encryption design sign-off

Posted by Ryan Blue <rb...@netflix.com.INVALID>.
This vote passes, with at least 3 binding +1 votes.

Some votes are from before we made significant changes, so I'm not sure
which ones to count. Regardless, thank you to everyone that took the time
to review the proposal.

And thanks to Gidon for taking the lead and getting it done!

rb

On Wed, Jan 16, 2019 at 9:15 AM Wes McKinney <we...@gmail.com> wrote:

> +1 (binding). Sorry for the delay in reviewing. There are a lot of
> moving parts here and different configurations (same key for footer
> and columns, different keys, plaintext footer, etc.) and so the
> implementations are justifiably going to be a lot of work.
>
> @Tham, per your comment
>
> > We are going to release our product with encrypted parquet soon
>
> I haven't seen the C++ encryption PR updated in a while. Have you been
> doing development on this somewhere else?
>
> On Tue, Jan 15, 2019 at 8:37 PM tham@emotiv.com <th...@emotiv.com> wrote:
> >
> > Hi all,
> >
> > I’m working on C++ parquet encryption module and my project at Emotiv is
> currently using it in development phase. We are going to release our
> product with encrypted parquet soon. So I’ll appreciate a lot if we can
> process voting procedure on crypto specification as soon as possible, then
> we can continue with implementation and Java-C++ interop testing.
> >
> > Many thanks,
> > Tham
> >
> > On 2019/01/08 17:08:01, Xinli shang <sh...@uber.com.INVALID> wrote:
> > > Hi all,
> > >
> > > Can we have one more binding votes to sign off the encryption design
> to let
> > > us move on to implementation?  Or, if you think 2 current binding
> votes can
> > > also be considered as sign off in this particular case, given this
> design
> > > has been extensively discussed in a year and we have tested it in our
> > > staging environment with privates, please let us know.
> > >
> > > Once we are unblocked, we will work on the implementation and
> integration
> > > with our security ecosystem. Maybe someday, we can show the community
> how
> > > Parquet encryption can work in large scale production. I would be
> really
> > > appreciated if we can be unblocked.
> > >
> > > Xinli
> > >
> > > On Wed, Dec 19, 2018 at 9:58 AM Gidon Gershinsky <gg...@gmail.com>
> wrote:
> > >
> > > > Hi all,
> > > >
> > > > This is to remind that we have 2 binding votes - and need one more
> in order
> > > > to sign off the encryption.
> > > >
> > > > Cheers, Gidon.
> > > >
> > > > On Sat, Dec 15, 2018 at 10:13 AM Zoltan Ivanfi
> <zi...@cloudera.com.invalid>
> > > > wrote:
> > > >
> > > > > +1 (binding)
> > > > >
> > > > > Thanks for your efforts, Gidon!
> > > > >
> > > > > On Fri, Dec 14, 2018 at 6:55 PM Ryan Blue
> <rb...@netflix.com.invalid>
> > > > > wrote:
> > > > >
> > > > > > +1 (binding)
> > > > > >
> > > > > > Thanks for all your work on this, Gidon!
> > > > > >
> > > > > > On Fri, Dec 14, 2018 at 9:48 AM Gidon Gershinsky <
> gg5070@gmail.com>
> > > > > wrote:
> > > > > >
> > > > > > > Hello Parquet developers,
> > > > > > >
> > > > > > > After a productive round of discussions, we have a new enhanced
> > > > version
> > > > > > of
> > > > > > > the design document. You can find it here:
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> https://github.com/ggershinsky/parquet-format/blob/p1450-crypto-specification/Encryption.md
> > > > > > >
> > > > > > > This is a call for vote (either new, or resumption of the
> previous
> > > > one.
> > > > > > If
> > > > > > > you have voted +1 before, feel free to do that again :).
> > > > > > >
> > > > > > > Cheers, Gidon
> > > > > > >
> > > > > > > ---------- Forwarded message ---------
> > > > > > > From: Gidon Gershinsky <gg...@gmail.com>
> > > > > > > Date: Tue, Oct 16, 2018 at 11:11 AM
> > > > > > > Subject: [VOTE] Modular Encryption design sign-off
> > > > > > > To: <de...@parquet.apache.org>
> > > > > > >
> > > > > > >
> > > > > > > Hello Parquet developers,
> > > > > > >
> > > > > > > Per the last sync discussion, it is time to call for a vote on
> the
> > > > > > Parquet
> > > > > > > Modular Encryption design sign-off. The design doc can be
> found at
> > > > the
> > > > > > > encryption branch of the parquet-format repository,
> > > > > > >
> > > >
> https://github.com/apache/parquet-format/blob/encryption/Encryption.md
> > > > > .
> > > > > > >
> > > > > > > The design is stable by now. This work had started 10 months
> ago, has
> > > > > > been
> > > > > > > extensively reviewed -  and implemented (in Java, partially in
> C++),
> > > > > by a
> > > > > > > number of folks from different companies. To continue with the
> > > > > > > implementation pull requests, we need the design to be formally
> > > > signed
> > > > > > off
> > > > > > > by the community.
> > > > > > >
> > > > > > > Cheers, Gidon
> > > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > > Ryan Blue
> > > > > > Software Engineer
> > > > > > Netflix
> > > > > >
> > > > >
> > > >
> > >
> > >
> > > --
> > > Xinli Shang
> > >
>


-- 
Ryan Blue
Software Engineer
Netflix

Re: [VOTE] Modular Encryption design sign-off

Posted by Wes McKinney <we...@gmail.com>.
hi Tham -- I created https://issues.apache.org/jira/browse/ARROW-4302
about adding OpenSSL to the build toolchain. Let's discuss there

On Wed, Jan 16, 2019 at 8:33 PM tham <th...@emotiv.com> wrote:
>
> Hi all,
>
> Thank you all for very quick process. I really appriciated that.
>
> > I haven't seen the C++ encryption PR updated in a while. Have you been
> doing development on this somewhere else?
> Sorry, I’ve been working silently in our own repo, but still some small fixes. I’m waiting for review process, then Gidon and I will discuss about detailed implementation for the update.
>
> Moreover, I have some difficulties in update CMakeList about adding flexible OpenSSL thirdparty. It will be great if someone who has experiences with Cmake can help.
>
> Thanks,
> Tham
>
> From: Wes McKinney
> Sent: Thursday, January 17, 2019 12:15 AM
> To: Parquet Dev
> Subject: Re: [VOTE] Modular Encryption design sign-off
>
> +1 (binding). Sorry for the delay in reviewing. There are a lot of
> moving parts here and different configurations (same key for footer
> and columns, different keys, plaintext footer, etc.) and so the
> implementations are justifiably going to be a lot of work.
>
> @Tham, per your comment
>
> > We are going to release our product with encrypted parquet soon
>
> I haven't seen the C++ encryption PR updated in a while. Have you been
> doing development on this somewhere else?
>
> On Tue, Jan 15, 2019 at 8:37 PM tham@emotiv.com <th...@emotiv.com> wrote:
> >
> > Hi all,
> >
> > I’m working on C++ parquet encryption module and my project at Emotiv is currently using it in development phase. We are going to release our product with encrypted parquet soon. So I’ll appreciate a lot if we can process voting procedure on crypto specification as soon as possible, then we can continue with implementation and Java-C++ interop testing.
> >
> > Many thanks,
> > Tham
> >
> > On 2019/01/08 17:08:01, Xinli shang <sh...@uber.com.INVALID> wrote:
> > > Hi all,
> > >
> > > Can we have one more binding votes to sign off the encryption design to let
> > > us move on to implementation?  Or, if you think 2 current binding votes can
> > > also be considered as sign off in this particular case, given this design
> > > has been extensively discussed in a year and we have tested it in our
> > > staging environment with privates, please let us know.
> > >
> > > Once we are unblocked, we will work on the implementation and integration
> > > with our security ecosystem. Maybe someday, we can show the community how
> > > Parquet encryption can work in large scale production. I would be really
> > > appreciated if we can be unblocked.
> > >
> > > Xinli
> > >
> > > On Wed, Dec 19, 2018 at 9:58 AM Gidon Gershinsky <gg...@gmail.com> wrote:
> > >
> > > > Hi all,
> > > >
> > > > This is to remind that we have 2 binding votes - and need one more in order
> > > > to sign off the encryption.
> > > >
> > > > Cheers, Gidon.
> > > >
> > > > On Sat, Dec 15, 2018 at 10:13 AM Zoltan Ivanfi <zi...@cloudera.com.invalid>
> > > > wrote:
> > > >
> > > > > +1 (binding)
> > > > >
> > > > > Thanks for your efforts, Gidon!
> > > > >
> > > > > On Fri, Dec 14, 2018 at 6:55 PM Ryan Blue <rb...@netflix.com.invalid>
> > > > > wrote:
> > > > >
> > > > > > +1 (binding)
> > > > > >
> > > > > > Thanks for all your work on this, Gidon!
> > > > > >
> > > > > > On Fri, Dec 14, 2018 at 9:48 AM Gidon Gershinsky <gg...@gmail.com>
> > > > > wrote:
> > > > > >
> > > > > > > Hello Parquet developers,
> > > > > > >
> > > > > > > After a productive round of discussions, we have a new enhanced
> > > > version
> > > > > > of
> > > > > > > the design document. You can find it here:
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > > https://github.com/ggershinsky/parquet-format/blob/p1450-crypto-specification/Encryption.md
> > > > > > >
> > > > > > > This is a call for vote (either new, or resumption of the previous
> > > > one.
> > > > > > If
> > > > > > > you have voted +1 before, feel free to do that again :).
> > > > > > >
> > > > > > > Cheers, Gidon
> > > > > > >
> > > > > > > ---------- Forwarded message ---------
> > > > > > > From: Gidon Gershinsky <gg...@gmail.com>
> > > > > > > Date: Tue, Oct 16, 2018 at 11:11 AM
> > > > > > > Subject: [VOTE] Modular Encryption design sign-off
> > > > > > > To: <de...@parquet.apache.org>
> > > > > > >
> > > > > > >
> > > > > > > Hello Parquet developers,
> > > > > > >
> > > > > > > Per the last sync discussion, it is time to call for a vote on the
> > > > > > Parquet
> > > > > > > Modular Encryption design sign-off. The design doc can be found at
> > > > the
> > > > > > > encryption branch of the parquet-format repository,
> > > > > > >
> > > > https://github.com/apache/parquet-format/blob/encryption/Encryption.md
> > > > > .
> > > > > > >
> > > > > > > The design is stable by now. This work had started 10 months ago, has
> > > > > > been
> > > > > > > extensively reviewed -  and implemented (in Java, partially in C++),
> > > > > by a
> > > > > > > number of folks from different companies. To continue with the
> > > > > > > implementation pull requests, we need the design to be formally
> > > > signed
> > > > > > off
> > > > > > > by the community.
> > > > > > >
> > > > > > > Cheers, Gidon
> > > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > > Ryan Blue
> > > > > > Software Engineer
> > > > > > Netflix
> > > > > >
> > > > >
> > > >
> > >
> > >
> > > --
> > > Xinli Shang
> > >
>

RE: [VOTE] Modular Encryption design sign-off

Posted by tham <th...@emotiv.com>.
Hi all,

Thank you all for very quick process. I really appriciated that.

> I haven't seen the C++ encryption PR updated in a while. Have you been
doing development on this somewhere else?
Sorry, I’ve been working silently in our own repo, but still some small fixes. I’m waiting for review process, then Gidon and I will discuss about detailed implementation for the update.

Moreover, I have some difficulties in update CMakeList about adding flexible OpenSSL thirdparty. It will be great if someone who has experiences with Cmake can help.

Thanks,
Tham

From: Wes McKinney
Sent: Thursday, January 17, 2019 12:15 AM
To: Parquet Dev
Subject: Re: [VOTE] Modular Encryption design sign-off

+1 (binding). Sorry for the delay in reviewing. There are a lot of
moving parts here and different configurations (same key for footer
and columns, different keys, plaintext footer, etc.) and so the
implementations are justifiably going to be a lot of work.

@Tham, per your comment

> We are going to release our product with encrypted parquet soon

I haven't seen the C++ encryption PR updated in a while. Have you been
doing development on this somewhere else?

On Tue, Jan 15, 2019 at 8:37 PM tham@emotiv.com <th...@emotiv.com> wrote:
>
> Hi all,
>
> I’m working on C++ parquet encryption module and my project at Emotiv is currently using it in development phase. We are going to release our product with encrypted parquet soon. So I’ll appreciate a lot if we can process voting procedure on crypto specification as soon as possible, then we can continue with implementation and Java-C++ interop testing.
>
> Many thanks,
> Tham
>
> On 2019/01/08 17:08:01, Xinli shang <sh...@uber.com.INVALID> wrote:
> > Hi all,
> >
> > Can we have one more binding votes to sign off the encryption design to let
> > us move on to implementation?  Or, if you think 2 current binding votes can
> > also be considered as sign off in this particular case, given this design
> > has been extensively discussed in a year and we have tested it in our
> > staging environment with privates, please let us know.
> >
> > Once we are unblocked, we will work on the implementation and integration
> > with our security ecosystem. Maybe someday, we can show the community how
> > Parquet encryption can work in large scale production. I would be really
> > appreciated if we can be unblocked.
> >
> > Xinli
> >
> > On Wed, Dec 19, 2018 at 9:58 AM Gidon Gershinsky <gg...@gmail.com> wrote:
> >
> > > Hi all,
> > >
> > > This is to remind that we have 2 binding votes - and need one more in order
> > > to sign off the encryption.
> > >
> > > Cheers, Gidon.
> > >
> > > On Sat, Dec 15, 2018 at 10:13 AM Zoltan Ivanfi <zi...@cloudera.com.invalid>
> > > wrote:
> > >
> > > > +1 (binding)
> > > >
> > > > Thanks for your efforts, Gidon!
> > > >
> > > > On Fri, Dec 14, 2018 at 6:55 PM Ryan Blue <rb...@netflix.com.invalid>
> > > > wrote:
> > > >
> > > > > +1 (binding)
> > > > >
> > > > > Thanks for all your work on this, Gidon!
> > > > >
> > > > > On Fri, Dec 14, 2018 at 9:48 AM Gidon Gershinsky <gg...@gmail.com>
> > > > wrote:
> > > > >
> > > > > > Hello Parquet developers,
> > > > > >
> > > > > > After a productive round of discussions, we have a new enhanced
> > > version
> > > > > of
> > > > > > the design document. You can find it here:
> > > > > >
> > > > > >
> > > > >
> > > >
> > > https://github.com/ggershinsky/parquet-format/blob/p1450-crypto-specification/Encryption.md
> > > > > >
> > > > > > This is a call for vote (either new, or resumption of the previous
> > > one.
> > > > > If
> > > > > > you have voted +1 before, feel free to do that again :).
> > > > > >
> > > > > > Cheers, Gidon
> > > > > >
> > > > > > ---------- Forwarded message ---------
> > > > > > From: Gidon Gershinsky <gg...@gmail.com>
> > > > > > Date: Tue, Oct 16, 2018 at 11:11 AM
> > > > > > Subject: [VOTE] Modular Encryption design sign-off
> > > > > > To: <de...@parquet.apache.org>
> > > > > >
> > > > > >
> > > > > > Hello Parquet developers,
> > > > > >
> > > > > > Per the last sync discussion, it is time to call for a vote on the
> > > > > Parquet
> > > > > > Modular Encryption design sign-off. The design doc can be found at
> > > the
> > > > > > encryption branch of the parquet-format repository,
> > > > > >
> > > https://github.com/apache/parquet-format/blob/encryption/Encryption.md
> > > > .
> > > > > >
> > > > > > The design is stable by now. This work had started 10 months ago, has
> > > > > been
> > > > > > extensively reviewed -  and implemented (in Java, partially in C++),
> > > > by a
> > > > > > number of folks from different companies. To continue with the
> > > > > > implementation pull requests, we need the design to be formally
> > > signed
> > > > > off
> > > > > > by the community.
> > > > > >
> > > > > > Cheers, Gidon
> > > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Ryan Blue
> > > > > Software Engineer
> > > > > Netflix
> > > > >
> > > >
> > >
> >
> >
> > --
> > Xinli Shang
> >


Re: [VOTE] Modular Encryption design sign-off

Posted by Wes McKinney <we...@gmail.com>.
+1 (binding). Sorry for the delay in reviewing. There are a lot of
moving parts here and different configurations (same key for footer
and columns, different keys, plaintext footer, etc.) and so the
implementations are justifiably going to be a lot of work.

@Tham, per your comment

> We are going to release our product with encrypted parquet soon

I haven't seen the C++ encryption PR updated in a while. Have you been
doing development on this somewhere else?

On Tue, Jan 15, 2019 at 8:37 PM tham@emotiv.com <th...@emotiv.com> wrote:
>
> Hi all,
>
> I’m working on C++ parquet encryption module and my project at Emotiv is currently using it in development phase. We are going to release our product with encrypted parquet soon. So I’ll appreciate a lot if we can process voting procedure on crypto specification as soon as possible, then we can continue with implementation and Java-C++ interop testing.
>
> Many thanks,
> Tham
>
> On 2019/01/08 17:08:01, Xinli shang <sh...@uber.com.INVALID> wrote:
> > Hi all,
> >
> > Can we have one more binding votes to sign off the encryption design to let
> > us move on to implementation?  Or, if you think 2 current binding votes can
> > also be considered as sign off in this particular case, given this design
> > has been extensively discussed in a year and we have tested it in our
> > staging environment with privates, please let us know.
> >
> > Once we are unblocked, we will work on the implementation and integration
> > with our security ecosystem. Maybe someday, we can show the community how
> > Parquet encryption can work in large scale production. I would be really
> > appreciated if we can be unblocked.
> >
> > Xinli
> >
> > On Wed, Dec 19, 2018 at 9:58 AM Gidon Gershinsky <gg...@gmail.com> wrote:
> >
> > > Hi all,
> > >
> > > This is to remind that we have 2 binding votes - and need one more in order
> > > to sign off the encryption.
> > >
> > > Cheers, Gidon.
> > >
> > > On Sat, Dec 15, 2018 at 10:13 AM Zoltan Ivanfi <zi...@cloudera.com.invalid>
> > > wrote:
> > >
> > > > +1 (binding)
> > > >
> > > > Thanks for your efforts, Gidon!
> > > >
> > > > On Fri, Dec 14, 2018 at 6:55 PM Ryan Blue <rb...@netflix.com.invalid>
> > > > wrote:
> > > >
> > > > > +1 (binding)
> > > > >
> > > > > Thanks for all your work on this, Gidon!
> > > > >
> > > > > On Fri, Dec 14, 2018 at 9:48 AM Gidon Gershinsky <gg...@gmail.com>
> > > > wrote:
> > > > >
> > > > > > Hello Parquet developers,
> > > > > >
> > > > > > After a productive round of discussions, we have a new enhanced
> > > version
> > > > > of
> > > > > > the design document. You can find it here:
> > > > > >
> > > > > >
> > > > >
> > > >
> > > https://github.com/ggershinsky/parquet-format/blob/p1450-crypto-specification/Encryption.md
> > > > > >
> > > > > > This is a call for vote (either new, or resumption of the previous
> > > one.
> > > > > If
> > > > > > you have voted +1 before, feel free to do that again :).
> > > > > >
> > > > > > Cheers, Gidon
> > > > > >
> > > > > > ---------- Forwarded message ---------
> > > > > > From: Gidon Gershinsky <gg...@gmail.com>
> > > > > > Date: Tue, Oct 16, 2018 at 11:11 AM
> > > > > > Subject: [VOTE] Modular Encryption design sign-off
> > > > > > To: <de...@parquet.apache.org>
> > > > > >
> > > > > >
> > > > > > Hello Parquet developers,
> > > > > >
> > > > > > Per the last sync discussion, it is time to call for a vote on the
> > > > > Parquet
> > > > > > Modular Encryption design sign-off. The design doc can be found at
> > > the
> > > > > > encryption branch of the parquet-format repository,
> > > > > >
> > > https://github.com/apache/parquet-format/blob/encryption/Encryption.md
> > > > .
> > > > > >
> > > > > > The design is stable by now. This work had started 10 months ago, has
> > > > > been
> > > > > > extensively reviewed -  and implemented (in Java, partially in C++),
> > > > by a
> > > > > > number of folks from different companies. To continue with the
> > > > > > implementation pull requests, we need the design to be formally
> > > signed
> > > > > off
> > > > > > by the community.
> > > > > >
> > > > > > Cheers, Gidon
> > > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Ryan Blue
> > > > > Software Engineer
> > > > > Netflix
> > > > >
> > > >
> > >
> >
> >
> > --
> > Xinli Shang
> >

Re: [VOTE] Modular Encryption design sign-off

Posted by th...@emotiv.com, th...@emotiv.com.
Hi all,
 
I’m working on C++ parquet encryption module and my project at Emotiv is currently using it in development phase. We are going to release our product with encrypted parquet soon. So I’ll appreciate a lot if we can process voting procedure on crypto specification as soon as possible, then we can continue with implementation and Java-C++ interop testing.
 
Many thanks,
Tham

On 2019/01/08 17:08:01, Xinli shang <sh...@uber.com.INVALID> wrote: 
> Hi all,
> 
> Can we have one more binding votes to sign off the encryption design to let
> us move on to implementation?  Or, if you think 2 current binding votes can
> also be considered as sign off in this particular case, given this design
> has been extensively discussed in a year and we have tested it in our
> staging environment with privates, please let us know.
> 
> Once we are unblocked, we will work on the implementation and integration
> with our security ecosystem. Maybe someday, we can show the community how
> Parquet encryption can work in large scale production. I would be really
> appreciated if we can be unblocked.
> 
> Xinli
> 
> On Wed, Dec 19, 2018 at 9:58 AM Gidon Gershinsky <gg...@gmail.com> wrote:
> 
> > Hi all,
> >
> > This is to remind that we have 2 binding votes - and need one more in order
> > to sign off the encryption.
> >
> > Cheers, Gidon.
> >
> > On Sat, Dec 15, 2018 at 10:13 AM Zoltan Ivanfi <zi...@cloudera.com.invalid>
> > wrote:
> >
> > > +1 (binding)
> > >
> > > Thanks for your efforts, Gidon!
> > >
> > > On Fri, Dec 14, 2018 at 6:55 PM Ryan Blue <rb...@netflix.com.invalid>
> > > wrote:
> > >
> > > > +1 (binding)
> > > >
> > > > Thanks for all your work on this, Gidon!
> > > >
> > > > On Fri, Dec 14, 2018 at 9:48 AM Gidon Gershinsky <gg...@gmail.com>
> > > wrote:
> > > >
> > > > > Hello Parquet developers,
> > > > >
> > > > > After a productive round of discussions, we have a new enhanced
> > version
> > > > of
> > > > > the design document. You can find it here:
> > > > >
> > > > >
> > > >
> > >
> > https://github.com/ggershinsky/parquet-format/blob/p1450-crypto-specification/Encryption.md
> > > > >
> > > > > This is a call for vote (either new, or resumption of the previous
> > one.
> > > > If
> > > > > you have voted +1 before, feel free to do that again :).
> > > > >
> > > > > Cheers, Gidon
> > > > >
> > > > > ---------- Forwarded message ---------
> > > > > From: Gidon Gershinsky <gg...@gmail.com>
> > > > > Date: Tue, Oct 16, 2018 at 11:11 AM
> > > > > Subject: [VOTE] Modular Encryption design sign-off
> > > > > To: <de...@parquet.apache.org>
> > > > >
> > > > >
> > > > > Hello Parquet developers,
> > > > >
> > > > > Per the last sync discussion, it is time to call for a vote on the
> > > > Parquet
> > > > > Modular Encryption design sign-off. The design doc can be found at
> > the
> > > > > encryption branch of the parquet-format repository,
> > > > >
> > https://github.com/apache/parquet-format/blob/encryption/Encryption.md
> > > .
> > > > >
> > > > > The design is stable by now. This work had started 10 months ago, has
> > > > been
> > > > > extensively reviewed -  and implemented (in Java, partially in C++),
> > > by a
> > > > > number of folks from different companies. To continue with the
> > > > > implementation pull requests, we need the design to be formally
> > signed
> > > > off
> > > > > by the community.
> > > > >
> > > > > Cheers, Gidon
> > > > >
> > > >
> > > >
> > > > --
> > > > Ryan Blue
> > > > Software Engineer
> > > > Netflix
> > > >
> > >
> >
> 
> 
> -- 
> Xinli Shang
> 

Re: [VOTE] Modular Encryption design sign-off

Posted by Xinli shang <sh...@uber.com.INVALID>.
Hi all,

Can we have one more binding votes to sign off the encryption design to let
us move on to implementation?  Or, if you think 2 current binding votes can
also be considered as sign off in this particular case, given this design
has been extensively discussed in a year and we have tested it in our
staging environment with privates, please let us know.

Once we are unblocked, we will work on the implementation and integration
with our security ecosystem. Maybe someday, we can show the community how
Parquet encryption can work in large scale production. I would be really
appreciated if we can be unblocked.

Xinli

On Wed, Dec 19, 2018 at 9:58 AM Gidon Gershinsky <gg...@gmail.com> wrote:

> Hi all,
>
> This is to remind that we have 2 binding votes - and need one more in order
> to sign off the encryption.
>
> Cheers, Gidon.
>
> On Sat, Dec 15, 2018 at 10:13 AM Zoltan Ivanfi <zi...@cloudera.com.invalid>
> wrote:
>
> > +1 (binding)
> >
> > Thanks for your efforts, Gidon!
> >
> > On Fri, Dec 14, 2018 at 6:55 PM Ryan Blue <rb...@netflix.com.invalid>
> > wrote:
> >
> > > +1 (binding)
> > >
> > > Thanks for all your work on this, Gidon!
> > >
> > > On Fri, Dec 14, 2018 at 9:48 AM Gidon Gershinsky <gg...@gmail.com>
> > wrote:
> > >
> > > > Hello Parquet developers,
> > > >
> > > > After a productive round of discussions, we have a new enhanced
> version
> > > of
> > > > the design document. You can find it here:
> > > >
> > > >
> > >
> >
> https://github.com/ggershinsky/parquet-format/blob/p1450-crypto-specification/Encryption.md
> > > >
> > > > This is a call for vote (either new, or resumption of the previous
> one.
> > > If
> > > > you have voted +1 before, feel free to do that again :).
> > > >
> > > > Cheers, Gidon
> > > >
> > > > ---------- Forwarded message ---------
> > > > From: Gidon Gershinsky <gg...@gmail.com>
> > > > Date: Tue, Oct 16, 2018 at 11:11 AM
> > > > Subject: [VOTE] Modular Encryption design sign-off
> > > > To: <de...@parquet.apache.org>
> > > >
> > > >
> > > > Hello Parquet developers,
> > > >
> > > > Per the last sync discussion, it is time to call for a vote on the
> > > Parquet
> > > > Modular Encryption design sign-off. The design doc can be found at
> the
> > > > encryption branch of the parquet-format repository,
> > > >
> https://github.com/apache/parquet-format/blob/encryption/Encryption.md
> > .
> > > >
> > > > The design is stable by now. This work had started 10 months ago, has
> > > been
> > > > extensively reviewed -  and implemented (in Java, partially in C++),
> > by a
> > > > number of folks from different companies. To continue with the
> > > > implementation pull requests, we need the design to be formally
> signed
> > > off
> > > > by the community.
> > > >
> > > > Cheers, Gidon
> > > >
> > >
> > >
> > > --
> > > Ryan Blue
> > > Software Engineer
> > > Netflix
> > >
> >
>


-- 
Xinli Shang

Re: [VOTE] Modular Encryption design sign-off

Posted by Gidon Gershinsky <gg...@gmail.com>.
Hi all,

This is to remind that we have 2 binding votes - and need one more in order
to sign off the encryption.

Cheers, Gidon.

On Sat, Dec 15, 2018 at 10:13 AM Zoltan Ivanfi <zi...@cloudera.com.invalid>
wrote:

> +1 (binding)
>
> Thanks for your efforts, Gidon!
>
> On Fri, Dec 14, 2018 at 6:55 PM Ryan Blue <rb...@netflix.com.invalid>
> wrote:
>
> > +1 (binding)
> >
> > Thanks for all your work on this, Gidon!
> >
> > On Fri, Dec 14, 2018 at 9:48 AM Gidon Gershinsky <gg...@gmail.com>
> wrote:
> >
> > > Hello Parquet developers,
> > >
> > > After a productive round of discussions, we have a new enhanced version
> > of
> > > the design document. You can find it here:
> > >
> > >
> >
> https://github.com/ggershinsky/parquet-format/blob/p1450-crypto-specification/Encryption.md
> > >
> > > This is a call for vote (either new, or resumption of the previous one.
> > If
> > > you have voted +1 before, feel free to do that again :).
> > >
> > > Cheers, Gidon
> > >
> > > ---------- Forwarded message ---------
> > > From: Gidon Gershinsky <gg...@gmail.com>
> > > Date: Tue, Oct 16, 2018 at 11:11 AM
> > > Subject: [VOTE] Modular Encryption design sign-off
> > > To: <de...@parquet.apache.org>
> > >
> > >
> > > Hello Parquet developers,
> > >
> > > Per the last sync discussion, it is time to call for a vote on the
> > Parquet
> > > Modular Encryption design sign-off. The design doc can be found at the
> > > encryption branch of the parquet-format repository,
> > > https://github.com/apache/parquet-format/blob/encryption/Encryption.md
> .
> > >
> > > The design is stable by now. This work had started 10 months ago, has
> > been
> > > extensively reviewed -  and implemented (in Java, partially in C++),
> by a
> > > number of folks from different companies. To continue with the
> > > implementation pull requests, we need the design to be formally signed
> > off
> > > by the community.
> > >
> > > Cheers, Gidon
> > >
> >
> >
> > --
> > Ryan Blue
> > Software Engineer
> > Netflix
> >
>

Re: [VOTE] Modular Encryption design sign-off

Posted by Zoltan Ivanfi <zi...@cloudera.com.INVALID>.
+1 (binding)

Thanks for your efforts, Gidon!

On Fri, Dec 14, 2018 at 6:55 PM Ryan Blue <rb...@netflix.com.invalid> wrote:

> +1 (binding)
>
> Thanks for all your work on this, Gidon!
>
> On Fri, Dec 14, 2018 at 9:48 AM Gidon Gershinsky <gg...@gmail.com> wrote:
>
> > Hello Parquet developers,
> >
> > After a productive round of discussions, we have a new enhanced version
> of
> > the design document. You can find it here:
> >
> >
> https://github.com/ggershinsky/parquet-format/blob/p1450-crypto-specification/Encryption.md
> >
> > This is a call for vote (either new, or resumption of the previous one.
> If
> > you have voted +1 before, feel free to do that again :).
> >
> > Cheers, Gidon
> >
> > ---------- Forwarded message ---------
> > From: Gidon Gershinsky <gg...@gmail.com>
> > Date: Tue, Oct 16, 2018 at 11:11 AM
> > Subject: [VOTE] Modular Encryption design sign-off
> > To: <de...@parquet.apache.org>
> >
> >
> > Hello Parquet developers,
> >
> > Per the last sync discussion, it is time to call for a vote on the
> Parquet
> > Modular Encryption design sign-off. The design doc can be found at the
> > encryption branch of the parquet-format repository,
> > https://github.com/apache/parquet-format/blob/encryption/Encryption.md.
> >
> > The design is stable by now. This work had started 10 months ago, has
> been
> > extensively reviewed -  and implemented (in Java, partially in C++), by a
> > number of folks from different companies. To continue with the
> > implementation pull requests, we need the design to be formally signed
> off
> > by the community.
> >
> > Cheers, Gidon
> >
>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>

Re: [VOTE] Modular Encryption design sign-off

Posted by Ryan Blue <rb...@netflix.com.INVALID>.
+1 (binding)

Thanks for all your work on this, Gidon!

On Fri, Dec 14, 2018 at 9:48 AM Gidon Gershinsky <gg...@gmail.com> wrote:

> Hello Parquet developers,
>
> After a productive round of discussions, we have a new enhanced version of
> the design document. You can find it here:
>
> https://github.com/ggershinsky/parquet-format/blob/p1450-crypto-specification/Encryption.md
>
> This is a call for vote (either new, or resumption of the previous one. If
> you have voted +1 before, feel free to do that again :).
>
> Cheers, Gidon
>
> ---------- Forwarded message ---------
> From: Gidon Gershinsky <gg...@gmail.com>
> Date: Tue, Oct 16, 2018 at 11:11 AM
> Subject: [VOTE] Modular Encryption design sign-off
> To: <de...@parquet.apache.org>
>
>
> Hello Parquet developers,
>
> Per the last sync discussion, it is time to call for a vote on the Parquet
> Modular Encryption design sign-off. The design doc can be found at the
> encryption branch of the parquet-format repository,
> https://github.com/apache/parquet-format/blob/encryption/Encryption.md.
>
> The design is stable by now. This work had started 10 months ago, has been
> extensively reviewed -  and implemented (in Java, partially in C++), by a
> number of folks from different companies. To continue with the
> implementation pull requests, we need the design to be formally signed off
> by the community.
>
> Cheers, Gidon
>


-- 
Ryan Blue
Software Engineer
Netflix

Fwd: [VOTE] Modular Encryption design sign-off

Posted by Gidon Gershinsky <gg...@gmail.com>.
Hello Parquet developers,

After a productive round of discussions, we have a new enhanced version of
the design document. You can find it here:
https://github.com/ggershinsky/parquet-format/blob/p1450-crypto-specification/Encryption.md

This is a call for vote (either new, or resumption of the previous one. If
you have voted +1 before, feel free to do that again :).

Cheers, Gidon

---------- Forwarded message ---------
From: Gidon Gershinsky <gg...@gmail.com>
Date: Tue, Oct 16, 2018 at 11:11 AM
Subject: [VOTE] Modular Encryption design sign-off
To: <de...@parquet.apache.org>


Hello Parquet developers,

Per the last sync discussion, it is time to call for a vote on the Parquet
Modular Encryption design sign-off. The design doc can be found at the
encryption branch of the parquet-format repository,
https://github.com/apache/parquet-format/blob/encryption/Encryption.md.

The design is stable by now. This work had started 10 months ago, has been
extensively reviewed -  and implemented (in Java, partially in C++), by a
number of folks from different companies. To continue with the
implementation pull requests, we need the design to be formally signed off
by the community.

Cheers, Gidon

Re: [VOTE] Modular Encryption design sign-off

Posted by Zoltan Ivanfi <zi...@cloudera.com.INVALID>.
Hi,

Regarding whether to initially only support a single encryption algorithm
or immediately add two: Personally I find it advantageous to have multiple
ones right from the start. When trying to create a generic solution
starting with a single implementation one can easily underestimate the
flexibility required to really make the solution generic. These design
faults only show up when one tries to add the second implementation and
realizes that it does not fit in the framework.

(Some possible mistakes with Thrift data structures: Embedding fields
directly instead of putting them in a struct, not realizing that it will
need to be extended with new fields. Making fields required not realizing
that they will not always make sense. Using enums instead of unions not
realizing that they will need to be extended. Using a non-repeating field
not realizing that multiple instances will be needed.)

When these problems surface too late, one may need to introduce breaking
changes or overcomplicate the design to keep backwards compatibility and
add the desired flexibility at the same time. It is not impossible to get
things right at first try, but it is a lot more error-prone.

In my opinion, adding two implementations is like a unit test for the
design.

Br,

Zoltan

On Sat, Oct 20, 2018 at 11:22 AM Gidon Gershinsky <gg...@gmail.com> wrote:

> Ryan,
>
> Thank you for the clear and detailed feedback.
>
> 1. Using a new magic string (PARE) for files with a plaintext footer would
> not allow legacy readers to parse these files.
>
> 2. Support for multiple encryption algorithms is one of the differentiating
> features of Parquet encryption, therefore having a second cipher upfront is
> significant.
> The design/review overhead of adding a CTR-based cipher to GCM  is minimal.
> In fact, CTR is a subset of GCM, because the latter is built on CTR (and
> GMAC).
> Since you explicitly mark your suggestion as a minor note, I propose we
> keep this part unchanged.
>
> 3.  EncryptedColumnChunk structure will be identical to a ColumnChunk. The
> reasons are:
> - In both EF and PF modes (encrypted/plaintext footer), the ColumnMetaData
> is not always separated.
> It is kept inside ColumnChunk if the column is plaintext,  or encrypted
> with the footer key in the EF mode. No reason to separate it then.
> - ColumnCryptoMetaData is needed in ColumnChunk to support column
> encryption in PF mode (and in EF mode).
> - We can make the file_offset an optional field. But actually it is not set
> anyway in today's Parquet, which works only with the optional meta_data
> field. So, this reason alone shouldn't be enough to create a new Thrift
> structure that fully replicates an existing one, save for optional vs
> required classifier in one field.
>
> 4. There are no options 1,2 in the NIST spec section 8.2.2. This subject
> had been extensively discussed by community during the design review:
> - If you mean the free field option there - its inapplicable for the 96-bit
> IV we are using.
> Per the spec, 96-bit IV is fully random, 0 free field length. 96 bit is a
> recommended length for AES-GCM IV, see discussion at
>
> https://crypto.stackexchange.com/questions/41601/aes-gcm-recommended-iv-size-why-12-bytes/41610
> eg "*For GCM a 12 byte IV is strongly suggested as other IV lengths will
> require additional calculations.*"
> - If you mean the 8.2.1 section, with deterministic construction option -
> its highly problematic, since deterministic IV prefix ("fixed field")
> "*shall
> identify the device, or, more generally, the context for the instance of
> the authenticated encryption function*". This is not trivial in a single
> process, and even more challenging in distributed frameworks like Spark
> ("*no
> two distinct devices shall share the same fixed field*").  Moreover, this
> can be unsafe - a mistake in prefix creation can lead to identical IVs -
> which breaks the encryption.
> - The good news is we don't need these complications. The 8.2.2 allows us
> to have 4 billion random IVs for a single key before any collision would
> happen in IVs that are open to attackers. With Parquet modular encryption,
> the number is more like 400 billion, because most (>99%) of IVs are hidden
> from an attacker - their offset is encrypted with the column/footer keys.
> Of course, we use the SecureRandom class, seeded properly.
> - Besides checking the NIST spec and crypto forums, I work directly with
> cipher technology experts - who confirm the Parquet approach to IV
> generation is the right thing to do in distributed systems.
>
>
> All other suggestions sound good, and are not hard to add to the spec. I'll
> update the document accordingly.
>
> Cheers, Gidon.
>
> On Fri, Oct 19, 2018 at 7:38 PM Ryan Blue <rb...@netflix.com.invalid>
> wrote:
>
> > After thinking about this more last night, I think it also needs to be
> more
> > careful around IV generation. The spec just says that implementations
> > should use RBG-based IV construction, but it needs to be more explicit
> that
> > no two GCM streams use the same IV. If an IV is reused, then there is a
> > possible plaintext leak when blocks are padded with a predictable
> pattern.
> >
> > I think the spec should state that an initial IV should be generated
> using
> > secure random as described by 8.2.2 option 1 and the IV for each
> encrypted
> > stream should be generated using 8.2.2 option 2 until there is a
> collision,
> > when the base IV should be regenerated using 8.2.2 option 1. That will
> > ensure that each IV is unique, but that Parquet doesn't consume a huge
> > amount of crypto randomness.
> >
> > rb
> >
> > On Thu, Oct 18, 2018 at 6:15 PM Ryan Blue <rb...@netflix.com> wrote:
> >
> > > -1
> > >
> > >    - I don’t think that it contains enough information for someone to
> > >    successfully implement the proposal. For example, this references
> > thrift
> > >    structures like AesGcmV1 and AesGcmCtrV1, but doesn’t introduce them
> > and
> > >    explain what they are or detail what they contain. Similarly,
> > >    ColumnCryptoMetaData and its sub-types are not defined. It looks
> like
> > this
> > >    spec relies on thrift, but I think that it should be self-contained.
> > >    - It should be more clear exactly how to construct each GCM or CTR
> > >    encrypted buffer. Is it necessary to specify padding for AES-GCM? I
> > see
> > >    from a quick google search that some implementations do, but there’s
> > >    nothing in this doc.
> > >    - I don’t think that this explains enough of the intent that is
> behind
> > >    the specification. For example, “Key metadata is a free-form byte
> > array
> > >    that can be used by a reader to retrieve the column encryption key”
> > should
> > >    be more clear about what this byte array actually contains or
> > represents.
> > >    Will it identify a key that can be fetched from a store? Will it be
> a
> > >    randomly-generated key stored encrypted by another key?
> > >    - This changes the definition of ColumnChunk’s file_offset field and
> > >    the location of ColumnMetaData in some cases. Instead of repurposing
> > thrift
> > >    structures, the spec should introduce new ones, like an
> > >    EncryptedColumnChunk that preserves the meaning of fields in
> > ColumnChunk
> > >    but separates out ColumnMetaData.
> > >    - This contains details that aren’t required, like using the
> > >    “.parquet.encrypted” file extension.
> > >    - The only time the new magic bytes are mentioned is after the
> footer
> > >    and encryption metadata, but the diagram shows that the first bytes
> > in the
> > >    file are updated as well. This is also only in the encrypted footer
> > mode.
> > >    Should PARE magic bytes be used in plaintext footer mode?
> > >
> > > Minor note: I would also prefer to vote on GCM, leaving out CTR for now
> > > and adding it once the GCM spec is finished. That way we can
> concentrate
> > on
> > > a single cipher mode instead of thinking about multiple modes at once.
> > >
> > > On Tue, Oct 16, 2018 at 2:44 AM Anna Szonyi
> <szonyi@cloudera.com.invalid
> > >
> > > wrote:
> > >
> > >> +1 (non-binding)
> > >>
> > >> On Tue, Oct 16, 2018 at 11:14 AM Nandor Kollar
> > >> <nk...@cloudera.com.invalid>
> > >> wrote:
> > >>
> > >> > +1 (non-binding)
> > >> > On Tue, Oct 16, 2018 at 10:59 AM 俊杰陈 <cj...@gmail.com> wrote:
> > >> > >
> > >> > > +1 (non-binding)
> > >> > >
> > >> > >
> > >> > > Zoltan Ivanfi <zi...@cloudera.com.invalid> 于2018年10月16日周二 下午4:46写道:
> > >> > >
> > >> > > > +1 (binding)
> > >> > > >
> > >> > > > Cheers,
> > >> > > >
> > >> > > > Zoltan
> > >> > > >
> > >> > > > On Tue, Oct 16, 2018 at 10:11 AM Gidon Gershinsky <
> > gg5070@gmail.com
> > >> >
> > >> > > > wrote:
> > >> > > >
> > >> > > > > Hello Parquet developers,
> > >> > > > >
> > >> > > > > Per the last sync discussion, it is time to call for a vote on
> > the
> > >> > > > Parquet
> > >> > > > > Modular Encryption design sign-off. The design doc can be
> found
> > at
> > >> > the
> > >> > > > > encryption branch of the parquet-format repository,
> > >> > > > >
> > >> >
> > https://github.com/apache/parquet-format/blob/encryption/Encryption.md.
> > >> > > > >
> > >> > > > > The design is stable by now. This work had started 10 months
> > ago,
> > >> has
> > >> > > > been
> > >> > > > > extensively reviewed -  and implemented (in Java, partially in
> > >> C++),
> > >> > by a
> > >> > > > > number of folks from different companies. To continue with the
> > >> > > > > implementation pull requests, we need the design to be
> formally
> > >> > signed
> > >> > > > off
> > >> > > > > by the community.
> > >> > > > >
> > >> > > > > Cheers, Gidon
> > >> > > > >
> > >> > > >
> > >> > >
> > >> > >
> > >> > > --
> > >> > > Thanks & Best Regards
> > >> >
> > >>
> > >
> > >
> > > --
> > > Ryan Blue
> > > Software Engineer
> > > Netflix
> > >
> >
> >
> > --
> > Ryan Blue
> > Software Engineer
> > Netflix
> >
>

Re: [VOTE] Modular Encryption design sign-off

Posted by Gidon Gershinsky <gg...@gmail.com>.
I had a quick look at the new comments, shouldn't be a problem to factor
them in.

Re section 5.4 - it describes an issue that is resolved today with a
workaround (in the current Parquet code, for regular files).
However, the workaround stops working when files are encrypted. Therefore,
the new fields are required for parsing encrypted files, and are
nice-to-have for parsing regular files.
I'll add details in the pr comments (some are already available in the pr
itself). Still, for the purpose of this discussion - these fields are
indeed related to encryption indirectly, and have a wider applicability.
The encryption spec doesn't have to allocate a separate section for them;
can mention this point in a sentence or two in the section 5.2. TBD at the
pr.

Cheers, Gidon.


On Mon, Dec 3, 2018 at 9:48 PM Ryan Blue <rb...@netflix.com> wrote:

> I added another round of reviews. I didn't find many technical changes,
> but I think that several things should be more clear to make it easier for
> people to implement the spec.
>
> The main technical problem I have is the addition of section 5.4 about
> vectorization. These extra fields don't appear to be required for
> encryption, unless I'm missing something. Can someone explain why they were
> added to the encryption doc and not added separately as an optimization for
> certain readers?
>
> rb
>
> On Sat, Dec 1, 2018 at 3:23 PM Ryan Blue <rb...@netflix.com> wrote:
>
>> I'll take a look at it again. Sorry for the delay.
>>
>> On Sat, Dec 1, 2018 at 5:04 AM Gidon Gershinsky <gg...@gmail.com> wrote:
>>
>>> Xinli, thanks, this is a team / community effort. E.g., in the latest
>>> round, I had many constructive comments from Ryan and Zoltan.
>>> The design and document got better as the result.
>>>
>>> Ryan, its been a while since then. Let us know if the doc is ok now. We
>>> will resume the work once the document is signed off.
>>>
>>> Cheers, Gidon.
>>>
>>>
>>> On Fri, Nov 30, 2018 at 10:48 PM Xinli shang <sh...@uber.com.invalid>
>>> wrote:
>>>
>>>> + 1 (non-binding). The latest one looks good to me! Thanks, Gidon!
>>>>
>>>> On Thu, Oct 25, 2018 at 4:17 PM Ryan Blue <rb...@netflix.com.invalid>
>>>> wrote:
>>>>
>>>> > Thanks, Gidon! I'll have a look as soon as I get some free time.
>>>> >
>>>> > On Thu, Oct 25, 2018 at 6:20 AM Gidon Gershinsky <gg...@gmail.com>
>>>> wrote:
>>>> >
>>>> > > Ryan,
>>>> > >
>>>> > > Your suggestions have been addressed, please check out the
>>>> > > https://github.com/apache/parquet-format/pull/114/files
>>>> > >
>>>> > > Cheers, Gidon.
>>>> > >
>>>> > >
>>>> > > ---------- Forwarded message ---------
>>>> > > From: Gidon Gershinsky <gg...@gmail.com>
>>>> > > Date: Tue, Oct 23, 2018 at 10:32 AM
>>>> > > Subject: Re: [VOTE] Modular Encryption design sign-off
>>>> > > To: <de...@parquet.apache.org>
>>>> > >
>>>> > >
>>>> > > Regarding the EncryptedColumnChunk, we might be talking about two
>>>> > different
>>>> > > things, looks like we need a deeper tech dive to sync up and get to
>>>> the
>>>> > > bottom of this.
>>>> > > I'll <> send you a separate mail on this subject, but no doubt we'll
>>>> > > figure it out one way or another.
>>>> > >
>>>> > > It is the only discussion point we have left, since all other
>>>> suggestions
>>>> > > sound good to me - the md doc can certainly use a substantial
>>>> addition of
>>>> > > explicit details for every feature
>>>> > > of the encryption format, to make it a full specification document.
>>>> > >
>>>> > > Cheers, Gidon.
>>>> > >
>>>> > > On Mon, Oct 22, 2018 at 9:36 PM Ryan Blue <rblue@netflix.com.invalid
>>>> >
>>>> > > wrote:
>>>> > >
>>>> > > > Using a new magic string (PARE) for files with a plaintext footer
>>>> would
>>>> > > not
>>>> > > > allow legacy readers to parse these files.
>>>> > > > This should be clear from reading the spec. I see that the
>>>> plaintext
>>>> > > footer
>>>> > > > section states “Then the footer is written as usual, followed by
>>>> …,
>>>> > and a
>>>> > > > final magic string, ‘PAR1’” but, like the other section, this is
>>>> not
>>>> > > > clearly called out and nothing states that this is applies to the
>>>> > header
>>>> > > > magic as well. The spec needs to include a statement like “the
>>>> magic
>>>> > > bytes
>>>> > > > for plaintext footer mode are ‘PAR1’ to allow older readers to
>>>> read
>>>> > > > projections of the file that do not include encrypted columns”.
>>>> > > Similarly,
>>>> > > > the other section needs to clearly state what the magic string is
>>>> for
>>>> > > > encrypted footers.
>>>> > > >
>>>> > > > Support for multiple encryption algorithms is one of the
>>>> > differentiating
>>>> > > > features of Parquet encryption, therefore having a second cipher
>>>> > upfront
>>>> > > is
>>>> > > > significant.
>>>> > > >
>>>> > > > That sounds okay to me since CTR is a subset of GCM so it isn’t
>>>> that
>>>> > > large
>>>> > > > of a change. The spec does need to be explicit about exactly how
>>>> the
>>>> > two
>>>> > > > relate. For example, CTR resources tend to use the term “nonce”
>>>> where
>>>> > GCM
>>>> > > > uses IV, so it needs to be explicitly stated that the GCM IV is
>>>> used as
>>>> > > the
>>>> > > > CTR nonce.
>>>> > > >
>>>> > > > EncryptedColumnChunk structure will be identical to a
>>>> ColumnChunk. The
>>>> > > > reasons are: …
>>>> > > >
>>>> > > > These structures should be defined in the encryption spec. Some
>>>> details
>>>> > > can
>>>> > > > be included by reference, like “the remaining fields match those
>>>> in
>>>> > > > ColumnMetaData” but there isn’t enough information in the current
>>>> spec
>>>> > > > draft.
>>>> > > >
>>>> > > > My point about file_offset is that this field already has a use
>>>> and a
>>>> > > > definition and this spec should not change its use or purpose. If
>>>> you
>>>> > > need
>>>> > > > a field for the location of an encrypted ColumnMetaData, then
>>>> > > > EncryptedColumnChunk should add one.
>>>> > > >
>>>> > > > EncryptedColumnChunk structure will be identical to a
>>>> ColumnChunk. …
>>>> > > > ColumnCryptoMetaData is needed in ColumnChunk
>>>> > > >
>>>> > > > How is EncryptedColumnChunk identical to ColumnChunk if it also
>>>> > includes
>>>> > > an
>>>> > > > extra structure for crypto?
>>>> > > >
>>>> > > > There are no options 1,2 in the NIST spec section 8.2.2.
>>>> > > >
>>>> > > > I’m referring to this section: “The random field shall either
>>>> consist
>>>> > of
>>>> > > 1)
>>>> > > > an output string of r(i) bits from an approved RBG with a
>>>> sufficient
>>>> > > > security strength, or 2) the result of applying the r(i)–bit
>>>> > incrementing
>>>> > > > function to the random field of the preceding IV for the given
>>>> key”
>>>> > > >
>>>> > > > My main point is that reusing any IV with the same key is a
>>>> problem.
>>>> > The
>>>> > > > spec must explicitly state that no IV can be used with the same
>>>> key.
>>>> > > Sounds
>>>> > > > like you’re confident that the approach you’re using is
>>>> sufficient, but
>>>> > > if
>>>> > > > it isn’t stated in the spec then other implementations might do it
>>>> > > > incorrectly. Referencing section 8.2.2 or relying on other people
>>>> > having
>>>> > > > knowledge that you do is not sufficient for a spec.
>>>> > > >
>>>> > > > I also have a minor concern over the amount of crypto random that
>>>> this
>>>> > > may
>>>> > > > require because Parquet requires so many encryption streams, but
>>>> that’s
>>>> > > an
>>>> > > > implementation detail and doesn’t need to block the spec.
>>>> > > >
>>>> > > > On Sat, Oct 20, 2018 at 2:22 AM Gidon Gershinsky <
>>>> gg5070@gmail.com>
>>>> > > wrote:
>>>> > > >
>>>> > > > > Ryan,
>>>> > > > >
>>>> > > > > Thank you for the clear and detailed feedback.
>>>> > > > >
>>>> > > > > 1. Using a new magic string (PARE) for files with a plaintext
>>>> footer
>>>> > > > would
>>>> > > > > not allow legacy readers to parse these files.
>>>> > > > >
>>>> > > > > 2. Support for multiple encryption algorithms is one of the
>>>> > > > differentiating
>>>> > > > > features of Parquet encryption, therefore having a second cipher
>>>> > > upfront
>>>> > > > is
>>>> > > > > significant.
>>>> > > > > The design/review overhead of adding a CTR-based cipher to GCM
>>>> is
>>>> > > > minimal.
>>>> > > > > In fact, CTR is a subset of GCM, because the latter is built on
>>>> CTR
>>>> > > (and
>>>> > > > > GMAC).
>>>> > > > > Since you explicitly mark your suggestion as a minor note, I
>>>> propose
>>>> > we
>>>> > > > > keep this part unchanged.
>>>> > > > >
>>>> > > > > 3.  EncryptedColumnChunk structure will be identical to a
>>>> > ColumnChunk.
>>>> > > > The
>>>> > > > > reasons are:
>>>> > > > > - In both EF and PF modes (encrypted/plaintext footer), the
>>>> > > > ColumnMetaData
>>>> > > > > is not always separated.
>>>> > > > > It is kept inside ColumnChunk if the column is plaintext,  or
>>>> > encrypted
>>>> > > > > with the footer key in the EF mode. No reason to separate it
>>>> then.
>>>> > > > > - ColumnCryptoMetaData is needed in ColumnChunk to support
>>>> column
>>>> > > > > encryption in PF mode (and in EF mode).
>>>> > > > > - We can make the file_offset an optional field. But actually
>>>> it is
>>>> > not
>>>> > > > set
>>>> > > > > anyway in today's Parquet, which works only with the optional
>>>> > meta_data
>>>> > > > > field. So, this reason alone shouldn't be enough to create a new
>>>> > Thrift
>>>> > > > > structure that fully replicates an existing one, save for
>>>> optional vs
>>>> > > > > required classifier in one field.
>>>> > > > >
>>>> > > > > 4. There are no options 1,2 in the NIST spec section 8.2.2. This
>>>> > > subject
>>>> > > > > had been extensively discussed by community during the design
>>>> review:
>>>> > > > > - If you mean the free field option there - its inapplicable
>>>> for the
>>>> > > > 96-bit
>>>> > > > > IV we are using.
>>>> > > > > Per the spec, 96-bit IV is fully random, 0 free field length.
>>>> 96 bit
>>>> > > is a
>>>> > > > > recommended length for AES-GCM IV, see discussion at
>>>> > > > >
>>>> > > > >
>>>> > > >
>>>> > >
>>>> >
>>>> https://crypto.stackexchange.com/questions/41601/aes-gcm-recommended-iv-size-why-12-bytes/41610
>>>> > > > > eg "*For GCM a 12 byte IV is strongly suggested as other IV
>>>> lengths
>>>> > > will
>>>> > > > > require additional calculations.*"
>>>> > > > > - If you mean the 8.2.1 section, with deterministic construction
>>>> > > option -
>>>> > > > > its highly problematic, since deterministic IV prefix ("fixed
>>>> field")
>>>> > > > > "*shall
>>>> > > > > identify the device, or, more generally, the context for the
>>>> instance
>>>> > > of
>>>> > > > > the authenticated encryption function*". This is not trivial in
>>>> a
>>>> > > single
>>>> > > > > process, and even more challenging in distributed frameworks
>>>> like
>>>> > Spark
>>>> > > > > ("*no
>>>> > > > > two distinct devices shall share the same fixed field*").
>>>> Moreover,
>>>> > > this
>>>> > > > > can be unsafe - a mistake in prefix creation can lead to
>>>> identical
>>>> > IVs
>>>> > > -
>>>> > > > > which breaks the encryption.
>>>> > > > > - The good news is we don't need these complications. The 8.2.2
>>>> > allows
>>>> > > us
>>>> > > > > to have 4 billion random IVs for a single key before any
>>>> collision
>>>> > > would
>>>> > > > > happen in IVs that are open to attackers. With Parquet modular
>>>> > > > encryption,
>>>> > > > > the number is more like 400 billion, because most (>99%) of IVs
>>>> are
>>>> > > > hidden
>>>> > > > > from an attacker - their offset is encrypted with the
>>>> column/footer
>>>> > > keys.
>>>> > > > > Of course, we use the SecureRandom class, seeded properly.
>>>> > > > > - Besides checking the NIST spec and crypto forums, I work
>>>> directly
>>>> > > with
>>>> > > > > cipher technology experts - who confirm the Parquet approach to
>>>> IV
>>>> > > > > generation is the right thing to do in distributed systems.
>>>> > > > >
>>>> > > > >
>>>> > > > > All other suggestions sound good, and are not hard to add to the
>>>> > spec.
>>>> > > > I'll
>>>> > > > > update the document accordingly.
>>>> > > > >
>>>> > > > > Cheers, Gidon.
>>>> > > > >
>>>> > > > > On Fri, Oct 19, 2018 at 7:38 PM Ryan Blue
>>>> <rblue@netflix.com.invalid
>>>> > >
>>>> > > > > wrote:
>>>> > > > >
>>>> > > > > > After thinking about this more last night, I think it also
>>>> needs to
>>>> > > be
>>>> > > > > more
>>>> > > > > > careful around IV generation. The spec just says that
>>>> > implementations
>>>> > > > > > should use RBG-based IV construction, but it needs to be more
>>>> > > explicit
>>>> > > > > that
>>>> > > > > > no two GCM streams use the same IV. If an IV is reused, then
>>>> there
>>>> > > is a
>>>> > > > > > possible plaintext leak when blocks are padded with a
>>>> predictable
>>>> > > > > pattern.
>>>> > > > > >
>>>> > > > > > I think the spec should state that an initial IV should be
>>>> > generated
>>>> > > > > using
>>>> > > > > > secure random as described by 8.2.2 option 1 and the IV for
>>>> each
>>>> > > > > encrypted
>>>> > > > > > stream should be generated using 8.2.2 option 2 until there
>>>> is a
>>>> > > > > collision,
>>>> > > > > > when the base IV should be regenerated using 8.2.2 option 1.
>>>> That
>>>> > > will
>>>> > > > > > ensure that each IV is unique, but that Parquet doesn't
>>>> consume a
>>>> > > huge
>>>> > > > > > amount of crypto randomness.
>>>> > > > > >
>>>> > > > > > rb
>>>> > > > > >
>>>> > > > > > On Thu, Oct 18, 2018 at 6:15 PM Ryan Blue <rb...@netflix.com>
>>>> > wrote:
>>>> > > > > >
>>>> > > > > > > -1
>>>> > > > > > >
>>>> > > > > > >    - I don’t think that it contains enough information for
>>>> > someone
>>>> > > to
>>>> > > > > > >    successfully implement the proposal. For example, this
>>>> > > references
>>>> > > > > > thrift
>>>> > > > > > >    structures like AesGcmV1 and AesGcmCtrV1, but doesn’t
>>>> > introduce
>>>> > > > them
>>>> > > > > > and
>>>> > > > > > >    explain what they are or detail what they contain.
>>>> Similarly,
>>>> > > > > > >    ColumnCryptoMetaData and its sub-types are not defined.
>>>> It
>>>> > looks
>>>> > > > > like
>>>> > > > > > this
>>>> > > > > > >    spec relies on thrift, but I think that it should be
>>>> > > > self-contained.
>>>> > > > > > >    - It should be more clear exactly how to construct each
>>>> GCM or
>>>> > > CTR
>>>> > > > > > >    encrypted buffer. Is it necessary to specify padding for
>>>> > > AES-GCM?
>>>> > > > I
>>>> > > > > > see
>>>> > > > > > >    from a quick google search that some implementations do,
>>>> but
>>>> > > > there’s
>>>> > > > > > >    nothing in this doc.
>>>> > > > > > >    - I don’t think that this explains enough of the intent
>>>> that
>>>> > is
>>>> > > > > behind
>>>> > > > > > >    the specification. For example, “Key metadata is a
>>>> free-form
>>>> > > byte
>>>> > > > > > array
>>>> > > > > > >    that can be used by a reader to retrieve the column
>>>> encryption
>>>> > > > key”
>>>> > > > > > should
>>>> > > > > > >    be more clear about what this byte array actually
>>>> contains or
>>>> > > > > > represents.
>>>> > > > > > >    Will it identify a key that can be fetched from a store?
>>>> Will
>>>> > it
>>>> > > > be
>>>> > > > > a
>>>> > > > > > >    randomly-generated key stored encrypted by another key?
>>>> > > > > > >    - This changes the definition of ColumnChunk’s
>>>> file_offset
>>>> > field
>>>> > > > and
>>>> > > > > > >    the location of ColumnMetaData in some cases. Instead of
>>>> > > > repurposing
>>>> > > > > > thrift
>>>> > > > > > >    structures, the spec should introduce new ones, like an
>>>> > > > > > >    EncryptedColumnChunk that preserves the meaning of
>>>> fields in
>>>> > > > > > ColumnChunk
>>>> > > > > > >    but separates out ColumnMetaData.
>>>> > > > > > >    - This contains details that aren’t required, like using
>>>> the
>>>> > > > > > >    “.parquet.encrypted” file extension.
>>>> > > > > > >    - The only time the new magic bytes are mentioned is
>>>> after the
>>>> > > > > footer
>>>> > > > > > >    and encryption metadata, but the diagram shows that the
>>>> first
>>>> > > > bytes
>>>> > > > > > in the
>>>> > > > > > >    file are updated as well. This is also only in the
>>>> encrypted
>>>> > > > footer
>>>> > > > > > mode.
>>>> > > > > > >    Should PARE magic bytes be used in plaintext footer mode?
>>>> > > > > > >
>>>> > > > > > > Minor note: I would also prefer to vote on GCM, leaving out
>>>> CTR
>>>> > for
>>>> > > > now
>>>> > > > > > > and adding it once the GCM spec is finished. That way we can
>>>> > > > > concentrate
>>>> > > > > > on
>>>> > > > > > > a single cipher mode instead of thinking about multiple
>>>> modes at
>>>> > > > once.
>>>> > > > > > >
>>>> > > > > > > On Tue, Oct 16, 2018 at 2:44 AM Anna Szonyi
>>>> > > > > <szonyi@cloudera.com.invalid
>>>> > > > > > >
>>>> > > > > > > wrote:
>>>> > > > > > >
>>>> > > > > > >> +1 (non-binding)
>>>> > > > > > >>
>>>> > > > > > >> On Tue, Oct 16, 2018 at 11:14 AM Nandor Kollar
>>>> > > > > > >> <nk...@cloudera.com.invalid>
>>>> > > > > > >> wrote:
>>>> > > > > > >>
>>>> > > > > > >> > +1 (non-binding)
>>>> > > > > > >> > On Tue, Oct 16, 2018 at 10:59 AM 俊杰陈 <cjjnjust@gmail.com
>>>> >
>>>> > > wrote:
>>>> > > > > > >> > >
>>>> > > > > > >> > > +1 (non-binding)
>>>> > > > > > >> > >
>>>> > > > > > >> > >
>>>> > > > > > >> > > Zoltan Ivanfi <zi...@cloudera.com.invalid> 于2018年10月16日周二
>>>> > > > 下午4:46写道:
>>>> > > > > > >> > >
>>>> > > > > > >> > > > +1 (binding)
>>>> > > > > > >> > > >
>>>> > > > > > >> > > > Cheers,
>>>> > > > > > >> > > >
>>>> > > > > > >> > > > Zoltan
>>>> > > > > > >> > > >
>>>> > > > > > >> > > > On Tue, Oct 16, 2018 at 10:11 AM Gidon Gershinsky <
>>>> > > > > > gg5070@gmail.com
>>>> > > > > > >> >
>>>> > > > > > >> > > > wrote:
>>>> > > > > > >> > > >
>>>> > > > > > >> > > > > Hello Parquet developers,
>>>> > > > > > >> > > > >
>>>> > > > > > >> > > > > Per the last sync discussion, it is time to call
>>>> for a
>>>> > > vote
>>>> > > > on
>>>> > > > > > the
>>>> > > > > > >> > > > Parquet
>>>> > > > > > >> > > > > Modular Encryption design sign-off. The design doc
>>>> can
>>>> > be
>>>> > > > > found
>>>> > > > > > at
>>>> > > > > > >> > the
>>>> > > > > > >> > > > > encryption branch of the parquet-format repository,
>>>> > > > > > >> > > > >
>>>> > > > > > >> >
>>>> > > > > >
>>>> > >
>>>> https://github.com/apache/parquet-format/blob/encryption/Encryption.md
>>>> > > > .
>>>> > > > > > >> > > > >
>>>> > > > > > >> > > > > The design is stable by now. This work had started
>>>> 10
>>>> > > months
>>>> > > > > > ago,
>>>> > > > > > >> has
>>>> > > > > > >> > > > been
>>>> > > > > > >> > > > > extensively reviewed -  and implemented (in Java,
>>>> > > partially
>>>> > > > in
>>>> > > > > > >> C++),
>>>> > > > > > >> > by a
>>>> > > > > > >> > > > > number of folks from different companies. To
>>>> continue
>>>> > with
>>>> > > > the
>>>> > > > > > >> > > > > implementation pull requests, we need the design
>>>> to be
>>>> > > > > formally
>>>> > > > > > >> > signed
>>>> > > > > > >> > > > off
>>>> > > > > > >> > > > > by the community.
>>>> > > > > > >> > > > >
>>>> > > > > > >> > > > > Cheers, Gidon
>>>> > > > > > >> > > > >
>>>> > > > > > >> > > >
>>>> > > > > > >> > >
>>>> > > > > > >> > >
>>>> > > > > > >> > > --
>>>> > > > > > >> > > Thanks & Best Regards
>>>> > > > > > >> >
>>>> > > > > > >>
>>>> > > > > > >
>>>> > > > > > >
>>>> > > > > > > --
>>>> > > > > > > Ryan Blue
>>>> > > > > > > Software Engineer
>>>> > > > > > > Netflix
>>>> > > > > > >
>>>> > > > > >
>>>> > > > > >
>>>> > > > > > --
>>>> > > > > > Ryan Blue
>>>> > > > > > Software Engineer
>>>> > > > > > Netflix
>>>> > > > > >
>>>> > > > >
>>>> > > >
>>>> > > >
>>>> > > > --
>>>> > > > Ryan Blue
>>>> > > > Software Engineer
>>>> > > > Netflix
>>>> > > >
>>>> > >
>>>> >
>>>> >
>>>> > --
>>>> > Ryan Blue
>>>> > Software Engineer
>>>> > Netflix
>>>> >
>>>>
>>>>
>>>> --
>>>> Xinli Shang
>>>>
>>>
>>
>> --
>> Ryan Blue
>> Software Engineer
>> Netflix
>>
>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>

Re: [VOTE] Modular Encryption design sign-off

Posted by Ryan Blue <rb...@netflix.com.INVALID>.
I added another round of reviews. I didn't find many technical changes, but
I think that several things should be more clear to make it easier for
people to implement the spec.

The main technical problem I have is the addition of section 5.4 about
vectorization. These extra fields don't appear to be required for
encryption, unless I'm missing something. Can someone explain why they were
added to the encryption doc and not added separately as an optimization for
certain readers?

rb

On Sat, Dec 1, 2018 at 3:23 PM Ryan Blue <rb...@netflix.com> wrote:

> I'll take a look at it again. Sorry for the delay.
>
> On Sat, Dec 1, 2018 at 5:04 AM Gidon Gershinsky <gg...@gmail.com> wrote:
>
>> Xinli, thanks, this is a team / community effort. E.g., in the latest
>> round, I had many constructive comments from Ryan and Zoltan.
>> The design and document got better as the result.
>>
>> Ryan, its been a while since then. Let us know if the doc is ok now. We
>> will resume the work once the document is signed off.
>>
>> Cheers, Gidon.
>>
>>
>> On Fri, Nov 30, 2018 at 10:48 PM Xinli shang <sh...@uber.com.invalid>
>> wrote:
>>
>>> + 1 (non-binding). The latest one looks good to me! Thanks, Gidon!
>>>
>>> On Thu, Oct 25, 2018 at 4:17 PM Ryan Blue <rb...@netflix.com.invalid>
>>> wrote:
>>>
>>> > Thanks, Gidon! I'll have a look as soon as I get some free time.
>>> >
>>> > On Thu, Oct 25, 2018 at 6:20 AM Gidon Gershinsky <gg...@gmail.com>
>>> wrote:
>>> >
>>> > > Ryan,
>>> > >
>>> > > Your suggestions have been addressed, please check out the
>>> > > https://github.com/apache/parquet-format/pull/114/files
>>> > >
>>> > > Cheers, Gidon.
>>> > >
>>> > >
>>> > > ---------- Forwarded message ---------
>>> > > From: Gidon Gershinsky <gg...@gmail.com>
>>> > > Date: Tue, Oct 23, 2018 at 10:32 AM
>>> > > Subject: Re: [VOTE] Modular Encryption design sign-off
>>> > > To: <de...@parquet.apache.org>
>>> > >
>>> > >
>>> > > Regarding the EncryptedColumnChunk, we might be talking about two
>>> > different
>>> > > things, looks like we need a deeper tech dive to sync up and get to
>>> the
>>> > > bottom of this.
>>> > > I'll <> send you a separate mail on this subject, but no doubt we'll
>>> > > figure it out one way or another.
>>> > >
>>> > > It is the only discussion point we have left, since all other
>>> suggestions
>>> > > sound good to me - the md doc can certainly use a substantial
>>> addition of
>>> > > explicit details for every feature
>>> > > of the encryption format, to make it a full specification document.
>>> > >
>>> > > Cheers, Gidon.
>>> > >
>>> > > On Mon, Oct 22, 2018 at 9:36 PM Ryan Blue <rblue@netflix.com.invalid
>>> >
>>> > > wrote:
>>> > >
>>> > > > Using a new magic string (PARE) for files with a plaintext footer
>>> would
>>> > > not
>>> > > > allow legacy readers to parse these files.
>>> > > > This should be clear from reading the spec. I see that the
>>> plaintext
>>> > > footer
>>> > > > section states “Then the footer is written as usual, followed by …,
>>> > and a
>>> > > > final magic string, ‘PAR1’” but, like the other section, this is
>>> not
>>> > > > clearly called out and nothing states that this is applies to the
>>> > header
>>> > > > magic as well. The spec needs to include a statement like “the
>>> magic
>>> > > bytes
>>> > > > for plaintext footer mode are ‘PAR1’ to allow older readers to read
>>> > > > projections of the file that do not include encrypted columns”.
>>> > > Similarly,
>>> > > > the other section needs to clearly state what the magic string is
>>> for
>>> > > > encrypted footers.
>>> > > >
>>> > > > Support for multiple encryption algorithms is one of the
>>> > differentiating
>>> > > > features of Parquet encryption, therefore having a second cipher
>>> > upfront
>>> > > is
>>> > > > significant.
>>> > > >
>>> > > > That sounds okay to me since CTR is a subset of GCM so it isn’t
>>> that
>>> > > large
>>> > > > of a change. The spec does need to be explicit about exactly how
>>> the
>>> > two
>>> > > > relate. For example, CTR resources tend to use the term “nonce”
>>> where
>>> > GCM
>>> > > > uses IV, so it needs to be explicitly stated that the GCM IV is
>>> used as
>>> > > the
>>> > > > CTR nonce.
>>> > > >
>>> > > > EncryptedColumnChunk structure will be identical to a ColumnChunk.
>>> The
>>> > > > reasons are: …
>>> > > >
>>> > > > These structures should be defined in the encryption spec. Some
>>> details
>>> > > can
>>> > > > be included by reference, like “the remaining fields match those in
>>> > > > ColumnMetaData” but there isn’t enough information in the current
>>> spec
>>> > > > draft.
>>> > > >
>>> > > > My point about file_offset is that this field already has a use
>>> and a
>>> > > > definition and this spec should not change its use or purpose. If
>>> you
>>> > > need
>>> > > > a field for the location of an encrypted ColumnMetaData, then
>>> > > > EncryptedColumnChunk should add one.
>>> > > >
>>> > > > EncryptedColumnChunk structure will be identical to a ColumnChunk.
>>> …
>>> > > > ColumnCryptoMetaData is needed in ColumnChunk
>>> > > >
>>> > > > How is EncryptedColumnChunk identical to ColumnChunk if it also
>>> > includes
>>> > > an
>>> > > > extra structure for crypto?
>>> > > >
>>> > > > There are no options 1,2 in the NIST spec section 8.2.2.
>>> > > >
>>> > > > I’m referring to this section: “The random field shall either
>>> consist
>>> > of
>>> > > 1)
>>> > > > an output string of r(i) bits from an approved RBG with a
>>> sufficient
>>> > > > security strength, or 2) the result of applying the r(i)–bit
>>> > incrementing
>>> > > > function to the random field of the preceding IV for the given key”
>>> > > >
>>> > > > My main point is that reusing any IV with the same key is a
>>> problem.
>>> > The
>>> > > > spec must explicitly state that no IV can be used with the same
>>> key.
>>> > > Sounds
>>> > > > like you’re confident that the approach you’re using is
>>> sufficient, but
>>> > > if
>>> > > > it isn’t stated in the spec then other implementations might do it
>>> > > > incorrectly. Referencing section 8.2.2 or relying on other people
>>> > having
>>> > > > knowledge that you do is not sufficient for a spec.
>>> > > >
>>> > > > I also have a minor concern over the amount of crypto random that
>>> this
>>> > > may
>>> > > > require because Parquet requires so many encryption streams, but
>>> that’s
>>> > > an
>>> > > > implementation detail and doesn’t need to block the spec.
>>> > > >
>>> > > > On Sat, Oct 20, 2018 at 2:22 AM Gidon Gershinsky <gg5070@gmail.com
>>> >
>>> > > wrote:
>>> > > >
>>> > > > > Ryan,
>>> > > > >
>>> > > > > Thank you for the clear and detailed feedback.
>>> > > > >
>>> > > > > 1. Using a new magic string (PARE) for files with a plaintext
>>> footer
>>> > > > would
>>> > > > > not allow legacy readers to parse these files.
>>> > > > >
>>> > > > > 2. Support for multiple encryption algorithms is one of the
>>> > > > differentiating
>>> > > > > features of Parquet encryption, therefore having a second cipher
>>> > > upfront
>>> > > > is
>>> > > > > significant.
>>> > > > > The design/review overhead of adding a CTR-based cipher to GCM
>>> is
>>> > > > minimal.
>>> > > > > In fact, CTR is a subset of GCM, because the latter is built on
>>> CTR
>>> > > (and
>>> > > > > GMAC).
>>> > > > > Since you explicitly mark your suggestion as a minor note, I
>>> propose
>>> > we
>>> > > > > keep this part unchanged.
>>> > > > >
>>> > > > > 3.  EncryptedColumnChunk structure will be identical to a
>>> > ColumnChunk.
>>> > > > The
>>> > > > > reasons are:
>>> > > > > - In both EF and PF modes (encrypted/plaintext footer), the
>>> > > > ColumnMetaData
>>> > > > > is not always separated.
>>> > > > > It is kept inside ColumnChunk if the column is plaintext,  or
>>> > encrypted
>>> > > > > with the footer key in the EF mode. No reason to separate it
>>> then.
>>> > > > > - ColumnCryptoMetaData is needed in ColumnChunk to support column
>>> > > > > encryption in PF mode (and in EF mode).
>>> > > > > - We can make the file_offset an optional field. But actually it
>>> is
>>> > not
>>> > > > set
>>> > > > > anyway in today's Parquet, which works only with the optional
>>> > meta_data
>>> > > > > field. So, this reason alone shouldn't be enough to create a new
>>> > Thrift
>>> > > > > structure that fully replicates an existing one, save for
>>> optional vs
>>> > > > > required classifier in one field.
>>> > > > >
>>> > > > > 4. There are no options 1,2 in the NIST spec section 8.2.2. This
>>> > > subject
>>> > > > > had been extensively discussed by community during the design
>>> review:
>>> > > > > - If you mean the free field option there - its inapplicable for
>>> the
>>> > > > 96-bit
>>> > > > > IV we are using.
>>> > > > > Per the spec, 96-bit IV is fully random, 0 free field length. 96
>>> bit
>>> > > is a
>>> > > > > recommended length for AES-GCM IV, see discussion at
>>> > > > >
>>> > > > >
>>> > > >
>>> > >
>>> >
>>> https://crypto.stackexchange.com/questions/41601/aes-gcm-recommended-iv-size-why-12-bytes/41610
>>> > > > > eg "*For GCM a 12 byte IV is strongly suggested as other IV
>>> lengths
>>> > > will
>>> > > > > require additional calculations.*"
>>> > > > > - If you mean the 8.2.1 section, with deterministic construction
>>> > > option -
>>> > > > > its highly problematic, since deterministic IV prefix ("fixed
>>> field")
>>> > > > > "*shall
>>> > > > > identify the device, or, more generally, the context for the
>>> instance
>>> > > of
>>> > > > > the authenticated encryption function*". This is not trivial in a
>>> > > single
>>> > > > > process, and even more challenging in distributed frameworks like
>>> > Spark
>>> > > > > ("*no
>>> > > > > two distinct devices shall share the same fixed field*").
>>> Moreover,
>>> > > this
>>> > > > > can be unsafe - a mistake in prefix creation can lead to
>>> identical
>>> > IVs
>>> > > -
>>> > > > > which breaks the encryption.
>>> > > > > - The good news is we don't need these complications. The 8.2.2
>>> > allows
>>> > > us
>>> > > > > to have 4 billion random IVs for a single key before any
>>> collision
>>> > > would
>>> > > > > happen in IVs that are open to attackers. With Parquet modular
>>> > > > encryption,
>>> > > > > the number is more like 400 billion, because most (>99%) of IVs
>>> are
>>> > > > hidden
>>> > > > > from an attacker - their offset is encrypted with the
>>> column/footer
>>> > > keys.
>>> > > > > Of course, we use the SecureRandom class, seeded properly.
>>> > > > > - Besides checking the NIST spec and crypto forums, I work
>>> directly
>>> > > with
>>> > > > > cipher technology experts - who confirm the Parquet approach to
>>> IV
>>> > > > > generation is the right thing to do in distributed systems.
>>> > > > >
>>> > > > >
>>> > > > > All other suggestions sound good, and are not hard to add to the
>>> > spec.
>>> > > > I'll
>>> > > > > update the document accordingly.
>>> > > > >
>>> > > > > Cheers, Gidon.
>>> > > > >
>>> > > > > On Fri, Oct 19, 2018 at 7:38 PM Ryan Blue
>>> <rblue@netflix.com.invalid
>>> > >
>>> > > > > wrote:
>>> > > > >
>>> > > > > > After thinking about this more last night, I think it also
>>> needs to
>>> > > be
>>> > > > > more
>>> > > > > > careful around IV generation. The spec just says that
>>> > implementations
>>> > > > > > should use RBG-based IV construction, but it needs to be more
>>> > > explicit
>>> > > > > that
>>> > > > > > no two GCM streams use the same IV. If an IV is reused, then
>>> there
>>> > > is a
>>> > > > > > possible plaintext leak when blocks are padded with a
>>> predictable
>>> > > > > pattern.
>>> > > > > >
>>> > > > > > I think the spec should state that an initial IV should be
>>> > generated
>>> > > > > using
>>> > > > > > secure random as described by 8.2.2 option 1 and the IV for
>>> each
>>> > > > > encrypted
>>> > > > > > stream should be generated using 8.2.2 option 2 until there is
>>> a
>>> > > > > collision,
>>> > > > > > when the base IV should be regenerated using 8.2.2 option 1.
>>> That
>>> > > will
>>> > > > > > ensure that each IV is unique, but that Parquet doesn't
>>> consume a
>>> > > huge
>>> > > > > > amount of crypto randomness.
>>> > > > > >
>>> > > > > > rb
>>> > > > > >
>>> > > > > > On Thu, Oct 18, 2018 at 6:15 PM Ryan Blue <rb...@netflix.com>
>>> > wrote:
>>> > > > > >
>>> > > > > > > -1
>>> > > > > > >
>>> > > > > > >    - I don’t think that it contains enough information for
>>> > someone
>>> > > to
>>> > > > > > >    successfully implement the proposal. For example, this
>>> > > references
>>> > > > > > thrift
>>> > > > > > >    structures like AesGcmV1 and AesGcmCtrV1, but doesn’t
>>> > introduce
>>> > > > them
>>> > > > > > and
>>> > > > > > >    explain what they are or detail what they contain.
>>> Similarly,
>>> > > > > > >    ColumnCryptoMetaData and its sub-types are not defined. It
>>> > looks
>>> > > > > like
>>> > > > > > this
>>> > > > > > >    spec relies on thrift, but I think that it should be
>>> > > > self-contained.
>>> > > > > > >    - It should be more clear exactly how to construct each
>>> GCM or
>>> > > CTR
>>> > > > > > >    encrypted buffer. Is it necessary to specify padding for
>>> > > AES-GCM?
>>> > > > I
>>> > > > > > see
>>> > > > > > >    from a quick google search that some implementations do,
>>> but
>>> > > > there’s
>>> > > > > > >    nothing in this doc.
>>> > > > > > >    - I don’t think that this explains enough of the intent
>>> that
>>> > is
>>> > > > > behind
>>> > > > > > >    the specification. For example, “Key metadata is a
>>> free-form
>>> > > byte
>>> > > > > > array
>>> > > > > > >    that can be used by a reader to retrieve the column
>>> encryption
>>> > > > key”
>>> > > > > > should
>>> > > > > > >    be more clear about what this byte array actually
>>> contains or
>>> > > > > > represents.
>>> > > > > > >    Will it identify a key that can be fetched from a store?
>>> Will
>>> > it
>>> > > > be
>>> > > > > a
>>> > > > > > >    randomly-generated key stored encrypted by another key?
>>> > > > > > >    - This changes the definition of ColumnChunk’s file_offset
>>> > field
>>> > > > and
>>> > > > > > >    the location of ColumnMetaData in some cases. Instead of
>>> > > > repurposing
>>> > > > > > thrift
>>> > > > > > >    structures, the spec should introduce new ones, like an
>>> > > > > > >    EncryptedColumnChunk that preserves the meaning of fields
>>> in
>>> > > > > > ColumnChunk
>>> > > > > > >    but separates out ColumnMetaData.
>>> > > > > > >    - This contains details that aren’t required, like using
>>> the
>>> > > > > > >    “.parquet.encrypted” file extension.
>>> > > > > > >    - The only time the new magic bytes are mentioned is
>>> after the
>>> > > > > footer
>>> > > > > > >    and encryption metadata, but the diagram shows that the
>>> first
>>> > > > bytes
>>> > > > > > in the
>>> > > > > > >    file are updated as well. This is also only in the
>>> encrypted
>>> > > > footer
>>> > > > > > mode.
>>> > > > > > >    Should PARE magic bytes be used in plaintext footer mode?
>>> > > > > > >
>>> > > > > > > Minor note: I would also prefer to vote on GCM, leaving out
>>> CTR
>>> > for
>>> > > > now
>>> > > > > > > and adding it once the GCM spec is finished. That way we can
>>> > > > > concentrate
>>> > > > > > on
>>> > > > > > > a single cipher mode instead of thinking about multiple
>>> modes at
>>> > > > once.
>>> > > > > > >
>>> > > > > > > On Tue, Oct 16, 2018 at 2:44 AM Anna Szonyi
>>> > > > > <szonyi@cloudera.com.invalid
>>> > > > > > >
>>> > > > > > > wrote:
>>> > > > > > >
>>> > > > > > >> +1 (non-binding)
>>> > > > > > >>
>>> > > > > > >> On Tue, Oct 16, 2018 at 11:14 AM Nandor Kollar
>>> > > > > > >> <nk...@cloudera.com.invalid>
>>> > > > > > >> wrote:
>>> > > > > > >>
>>> > > > > > >> > +1 (non-binding)
>>> > > > > > >> > On Tue, Oct 16, 2018 at 10:59 AM 俊杰陈 <cj...@gmail.com>
>>> > > wrote:
>>> > > > > > >> > >
>>> > > > > > >> > > +1 (non-binding)
>>> > > > > > >> > >
>>> > > > > > >> > >
>>> > > > > > >> > > Zoltan Ivanfi <zi...@cloudera.com.invalid> 于2018年10月16日周二
>>> > > > 下午4:46写道:
>>> > > > > > >> > >
>>> > > > > > >> > > > +1 (binding)
>>> > > > > > >> > > >
>>> > > > > > >> > > > Cheers,
>>> > > > > > >> > > >
>>> > > > > > >> > > > Zoltan
>>> > > > > > >> > > >
>>> > > > > > >> > > > On Tue, Oct 16, 2018 at 10:11 AM Gidon Gershinsky <
>>> > > > > > gg5070@gmail.com
>>> > > > > > >> >
>>> > > > > > >> > > > wrote:
>>> > > > > > >> > > >
>>> > > > > > >> > > > > Hello Parquet developers,
>>> > > > > > >> > > > >
>>> > > > > > >> > > > > Per the last sync discussion, it is time to call
>>> for a
>>> > > vote
>>> > > > on
>>> > > > > > the
>>> > > > > > >> > > > Parquet
>>> > > > > > >> > > > > Modular Encryption design sign-off. The design doc
>>> can
>>> > be
>>> > > > > found
>>> > > > > > at
>>> > > > > > >> > the
>>> > > > > > >> > > > > encryption branch of the parquet-format repository,
>>> > > > > > >> > > > >
>>> > > > > > >> >
>>> > > > > >
>>> > >
>>> https://github.com/apache/parquet-format/blob/encryption/Encryption.md
>>> > > > .
>>> > > > > > >> > > > >
>>> > > > > > >> > > > > The design is stable by now. This work had started
>>> 10
>>> > > months
>>> > > > > > ago,
>>> > > > > > >> has
>>> > > > > > >> > > > been
>>> > > > > > >> > > > > extensively reviewed -  and implemented (in Java,
>>> > > partially
>>> > > > in
>>> > > > > > >> C++),
>>> > > > > > >> > by a
>>> > > > > > >> > > > > number of folks from different companies. To
>>> continue
>>> > with
>>> > > > the
>>> > > > > > >> > > > > implementation pull requests, we need the design to
>>> be
>>> > > > > formally
>>> > > > > > >> > signed
>>> > > > > > >> > > > off
>>> > > > > > >> > > > > by the community.
>>> > > > > > >> > > > >
>>> > > > > > >> > > > > Cheers, Gidon
>>> > > > > > >> > > > >
>>> > > > > > >> > > >
>>> > > > > > >> > >
>>> > > > > > >> > >
>>> > > > > > >> > > --
>>> > > > > > >> > > Thanks & Best Regards
>>> > > > > > >> >
>>> > > > > > >>
>>> > > > > > >
>>> > > > > > >
>>> > > > > > > --
>>> > > > > > > Ryan Blue
>>> > > > > > > Software Engineer
>>> > > > > > > Netflix
>>> > > > > > >
>>> > > > > >
>>> > > > > >
>>> > > > > > --
>>> > > > > > Ryan Blue
>>> > > > > > Software Engineer
>>> > > > > > Netflix
>>> > > > > >
>>> > > > >
>>> > > >
>>> > > >
>>> > > > --
>>> > > > Ryan Blue
>>> > > > Software Engineer
>>> > > > Netflix
>>> > > >
>>> > >
>>> >
>>> >
>>> > --
>>> > Ryan Blue
>>> > Software Engineer
>>> > Netflix
>>> >
>>>
>>>
>>> --
>>> Xinli Shang
>>>
>>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>


-- 
Ryan Blue
Software Engineer
Netflix

Re: [VOTE] Modular Encryption design sign-off

Posted by Ryan Blue <rb...@netflix.com.INVALID>.
I'll take a look at it again. Sorry for the delay.

On Sat, Dec 1, 2018 at 5:04 AM Gidon Gershinsky <gg...@gmail.com> wrote:

> Xinli, thanks, this is a team / community effort. E.g., in the latest
> round, I had many constructive comments from Ryan and Zoltan.
> The design and document got better as the result.
>
> Ryan, its been a while since then. Let us know if the doc is ok now. We
> will resume the work once the document is signed off.
>
> Cheers, Gidon.
>
>
> On Fri, Nov 30, 2018 at 10:48 PM Xinli shang <sh...@uber.com.invalid>
> wrote:
>
>> + 1 (non-binding). The latest one looks good to me! Thanks, Gidon!
>>
>> On Thu, Oct 25, 2018 at 4:17 PM Ryan Blue <rb...@netflix.com.invalid>
>> wrote:
>>
>> > Thanks, Gidon! I'll have a look as soon as I get some free time.
>> >
>> > On Thu, Oct 25, 2018 at 6:20 AM Gidon Gershinsky <gg...@gmail.com>
>> wrote:
>> >
>> > > Ryan,
>> > >
>> > > Your suggestions have been addressed, please check out the
>> > > https://github.com/apache/parquet-format/pull/114/files
>> > >
>> > > Cheers, Gidon.
>> > >
>> > >
>> > > ---------- Forwarded message ---------
>> > > From: Gidon Gershinsky <gg...@gmail.com>
>> > > Date: Tue, Oct 23, 2018 at 10:32 AM
>> > > Subject: Re: [VOTE] Modular Encryption design sign-off
>> > > To: <de...@parquet.apache.org>
>> > >
>> > >
>> > > Regarding the EncryptedColumnChunk, we might be talking about two
>> > different
>> > > things, looks like we need a deeper tech dive to sync up and get to
>> the
>> > > bottom of this.
>> > > I'll <> send you a separate mail on this subject, but no doubt we'll
>> > > figure it out one way or another.
>> > >
>> > > It is the only discussion point we have left, since all other
>> suggestions
>> > > sound good to me - the md doc can certainly use a substantial
>> addition of
>> > > explicit details for every feature
>> > > of the encryption format, to make it a full specification document.
>> > >
>> > > Cheers, Gidon.
>> > >
>> > > On Mon, Oct 22, 2018 at 9:36 PM Ryan Blue <rb...@netflix.com.invalid>
>> > > wrote:
>> > >
>> > > > Using a new magic string (PARE) for files with a plaintext footer
>> would
>> > > not
>> > > > allow legacy readers to parse these files.
>> > > > This should be clear from reading the spec. I see that the plaintext
>> > > footer
>> > > > section states “Then the footer is written as usual, followed by …,
>> > and a
>> > > > final magic string, ‘PAR1’” but, like the other section, this is not
>> > > > clearly called out and nothing states that this is applies to the
>> > header
>> > > > magic as well. The spec needs to include a statement like “the magic
>> > > bytes
>> > > > for plaintext footer mode are ‘PAR1’ to allow older readers to read
>> > > > projections of the file that do not include encrypted columns”.
>> > > Similarly,
>> > > > the other section needs to clearly state what the magic string is
>> for
>> > > > encrypted footers.
>> > > >
>> > > > Support for multiple encryption algorithms is one of the
>> > differentiating
>> > > > features of Parquet encryption, therefore having a second cipher
>> > upfront
>> > > is
>> > > > significant.
>> > > >
>> > > > That sounds okay to me since CTR is a subset of GCM so it isn’t that
>> > > large
>> > > > of a change. The spec does need to be explicit about exactly how the
>> > two
>> > > > relate. For example, CTR resources tend to use the term “nonce”
>> where
>> > GCM
>> > > > uses IV, so it needs to be explicitly stated that the GCM IV is
>> used as
>> > > the
>> > > > CTR nonce.
>> > > >
>> > > > EncryptedColumnChunk structure will be identical to a ColumnChunk.
>> The
>> > > > reasons are: …
>> > > >
>> > > > These structures should be defined in the encryption spec. Some
>> details
>> > > can
>> > > > be included by reference, like “the remaining fields match those in
>> > > > ColumnMetaData” but there isn’t enough information in the current
>> spec
>> > > > draft.
>> > > >
>> > > > My point about file_offset is that this field already has a use and
>> a
>> > > > definition and this spec should not change its use or purpose. If
>> you
>> > > need
>> > > > a field for the location of an encrypted ColumnMetaData, then
>> > > > EncryptedColumnChunk should add one.
>> > > >
>> > > > EncryptedColumnChunk structure will be identical to a ColumnChunk. …
>> > > > ColumnCryptoMetaData is needed in ColumnChunk
>> > > >
>> > > > How is EncryptedColumnChunk identical to ColumnChunk if it also
>> > includes
>> > > an
>> > > > extra structure for crypto?
>> > > >
>> > > > There are no options 1,2 in the NIST spec section 8.2.2.
>> > > >
>> > > > I’m referring to this section: “The random field shall either
>> consist
>> > of
>> > > 1)
>> > > > an output string of r(i) bits from an approved RBG with a sufficient
>> > > > security strength, or 2) the result of applying the r(i)–bit
>> > incrementing
>> > > > function to the random field of the preceding IV for the given key”
>> > > >
>> > > > My main point is that reusing any IV with the same key is a problem.
>> > The
>> > > > spec must explicitly state that no IV can be used with the same key.
>> > > Sounds
>> > > > like you’re confident that the approach you’re using is sufficient,
>> but
>> > > if
>> > > > it isn’t stated in the spec then other implementations might do it
>> > > > incorrectly. Referencing section 8.2.2 or relying on other people
>> > having
>> > > > knowledge that you do is not sufficient for a spec.
>> > > >
>> > > > I also have a minor concern over the amount of crypto random that
>> this
>> > > may
>> > > > require because Parquet requires so many encryption streams, but
>> that’s
>> > > an
>> > > > implementation detail and doesn’t need to block the spec.
>> > > >
>> > > > On Sat, Oct 20, 2018 at 2:22 AM Gidon Gershinsky <gg...@gmail.com>
>> > > wrote:
>> > > >
>> > > > > Ryan,
>> > > > >
>> > > > > Thank you for the clear and detailed feedback.
>> > > > >
>> > > > > 1. Using a new magic string (PARE) for files with a plaintext
>> footer
>> > > > would
>> > > > > not allow legacy readers to parse these files.
>> > > > >
>> > > > > 2. Support for multiple encryption algorithms is one of the
>> > > > differentiating
>> > > > > features of Parquet encryption, therefore having a second cipher
>> > > upfront
>> > > > is
>> > > > > significant.
>> > > > > The design/review overhead of adding a CTR-based cipher to GCM  is
>> > > > minimal.
>> > > > > In fact, CTR is a subset of GCM, because the latter is built on
>> CTR
>> > > (and
>> > > > > GMAC).
>> > > > > Since you explicitly mark your suggestion as a minor note, I
>> propose
>> > we
>> > > > > keep this part unchanged.
>> > > > >
>> > > > > 3.  EncryptedColumnChunk structure will be identical to a
>> > ColumnChunk.
>> > > > The
>> > > > > reasons are:
>> > > > > - In both EF and PF modes (encrypted/plaintext footer), the
>> > > > ColumnMetaData
>> > > > > is not always separated.
>> > > > > It is kept inside ColumnChunk if the column is plaintext,  or
>> > encrypted
>> > > > > with the footer key in the EF mode. No reason to separate it then.
>> > > > > - ColumnCryptoMetaData is needed in ColumnChunk to support column
>> > > > > encryption in PF mode (and in EF mode).
>> > > > > - We can make the file_offset an optional field. But actually it
>> is
>> > not
>> > > > set
>> > > > > anyway in today's Parquet, which works only with the optional
>> > meta_data
>> > > > > field. So, this reason alone shouldn't be enough to create a new
>> > Thrift
>> > > > > structure that fully replicates an existing one, save for
>> optional vs
>> > > > > required classifier in one field.
>> > > > >
>> > > > > 4. There are no options 1,2 in the NIST spec section 8.2.2. This
>> > > subject
>> > > > > had been extensively discussed by community during the design
>> review:
>> > > > > - If you mean the free field option there - its inapplicable for
>> the
>> > > > 96-bit
>> > > > > IV we are using.
>> > > > > Per the spec, 96-bit IV is fully random, 0 free field length. 96
>> bit
>> > > is a
>> > > > > recommended length for AES-GCM IV, see discussion at
>> > > > >
>> > > > >
>> > > >
>> > >
>> >
>> https://crypto.stackexchange.com/questions/41601/aes-gcm-recommended-iv-size-why-12-bytes/41610
>> > > > > eg "*For GCM a 12 byte IV is strongly suggested as other IV
>> lengths
>> > > will
>> > > > > require additional calculations.*"
>> > > > > - If you mean the 8.2.1 section, with deterministic construction
>> > > option -
>> > > > > its highly problematic, since deterministic IV prefix ("fixed
>> field")
>> > > > > "*shall
>> > > > > identify the device, or, more generally, the context for the
>> instance
>> > > of
>> > > > > the authenticated encryption function*". This is not trivial in a
>> > > single
>> > > > > process, and even more challenging in distributed frameworks like
>> > Spark
>> > > > > ("*no
>> > > > > two distinct devices shall share the same fixed field*").
>> Moreover,
>> > > this
>> > > > > can be unsafe - a mistake in prefix creation can lead to identical
>> > IVs
>> > > -
>> > > > > which breaks the encryption.
>> > > > > - The good news is we don't need these complications. The 8.2.2
>> > allows
>> > > us
>> > > > > to have 4 billion random IVs for a single key before any collision
>> > > would
>> > > > > happen in IVs that are open to attackers. With Parquet modular
>> > > > encryption,
>> > > > > the number is more like 400 billion, because most (>99%) of IVs
>> are
>> > > > hidden
>> > > > > from an attacker - their offset is encrypted with the
>> column/footer
>> > > keys.
>> > > > > Of course, we use the SecureRandom class, seeded properly.
>> > > > > - Besides checking the NIST spec and crypto forums, I work
>> directly
>> > > with
>> > > > > cipher technology experts - who confirm the Parquet approach to IV
>> > > > > generation is the right thing to do in distributed systems.
>> > > > >
>> > > > >
>> > > > > All other suggestions sound good, and are not hard to add to the
>> > spec.
>> > > > I'll
>> > > > > update the document accordingly.
>> > > > >
>> > > > > Cheers, Gidon.
>> > > > >
>> > > > > On Fri, Oct 19, 2018 at 7:38 PM Ryan Blue
>> <rblue@netflix.com.invalid
>> > >
>> > > > > wrote:
>> > > > >
>> > > > > > After thinking about this more last night, I think it also
>> needs to
>> > > be
>> > > > > more
>> > > > > > careful around IV generation. The spec just says that
>> > implementations
>> > > > > > should use RBG-based IV construction, but it needs to be more
>> > > explicit
>> > > > > that
>> > > > > > no two GCM streams use the same IV. If an IV is reused, then
>> there
>> > > is a
>> > > > > > possible plaintext leak when blocks are padded with a
>> predictable
>> > > > > pattern.
>> > > > > >
>> > > > > > I think the spec should state that an initial IV should be
>> > generated
>> > > > > using
>> > > > > > secure random as described by 8.2.2 option 1 and the IV for each
>> > > > > encrypted
>> > > > > > stream should be generated using 8.2.2 option 2 until there is a
>> > > > > collision,
>> > > > > > when the base IV should be regenerated using 8.2.2 option 1.
>> That
>> > > will
>> > > > > > ensure that each IV is unique, but that Parquet doesn't consume
>> a
>> > > huge
>> > > > > > amount of crypto randomness.
>> > > > > >
>> > > > > > rb
>> > > > > >
>> > > > > > On Thu, Oct 18, 2018 at 6:15 PM Ryan Blue <rb...@netflix.com>
>> > wrote:
>> > > > > >
>> > > > > > > -1
>> > > > > > >
>> > > > > > >    - I don’t think that it contains enough information for
>> > someone
>> > > to
>> > > > > > >    successfully implement the proposal. For example, this
>> > > references
>> > > > > > thrift
>> > > > > > >    structures like AesGcmV1 and AesGcmCtrV1, but doesn’t
>> > introduce
>> > > > them
>> > > > > > and
>> > > > > > >    explain what they are or detail what they contain.
>> Similarly,
>> > > > > > >    ColumnCryptoMetaData and its sub-types are not defined. It
>> > looks
>> > > > > like
>> > > > > > this
>> > > > > > >    spec relies on thrift, but I think that it should be
>> > > > self-contained.
>> > > > > > >    - It should be more clear exactly how to construct each
>> GCM or
>> > > CTR
>> > > > > > >    encrypted buffer. Is it necessary to specify padding for
>> > > AES-GCM?
>> > > > I
>> > > > > > see
>> > > > > > >    from a quick google search that some implementations do,
>> but
>> > > > there’s
>> > > > > > >    nothing in this doc.
>> > > > > > >    - I don’t think that this explains enough of the intent
>> that
>> > is
>> > > > > behind
>> > > > > > >    the specification. For example, “Key metadata is a
>> free-form
>> > > byte
>> > > > > > array
>> > > > > > >    that can be used by a reader to retrieve the column
>> encryption
>> > > > key”
>> > > > > > should
>> > > > > > >    be more clear about what this byte array actually contains
>> or
>> > > > > > represents.
>> > > > > > >    Will it identify a key that can be fetched from a store?
>> Will
>> > it
>> > > > be
>> > > > > a
>> > > > > > >    randomly-generated key stored encrypted by another key?
>> > > > > > >    - This changes the definition of ColumnChunk’s file_offset
>> > field
>> > > > and
>> > > > > > >    the location of ColumnMetaData in some cases. Instead of
>> > > > repurposing
>> > > > > > thrift
>> > > > > > >    structures, the spec should introduce new ones, like an
>> > > > > > >    EncryptedColumnChunk that preserves the meaning of fields
>> in
>> > > > > > ColumnChunk
>> > > > > > >    but separates out ColumnMetaData.
>> > > > > > >    - This contains details that aren’t required, like using
>> the
>> > > > > > >    “.parquet.encrypted” file extension.
>> > > > > > >    - The only time the new magic bytes are mentioned is after
>> the
>> > > > > footer
>> > > > > > >    and encryption metadata, but the diagram shows that the
>> first
>> > > > bytes
>> > > > > > in the
>> > > > > > >    file are updated as well. This is also only in the
>> encrypted
>> > > > footer
>> > > > > > mode.
>> > > > > > >    Should PARE magic bytes be used in plaintext footer mode?
>> > > > > > >
>> > > > > > > Minor note: I would also prefer to vote on GCM, leaving out
>> CTR
>> > for
>> > > > now
>> > > > > > > and adding it once the GCM spec is finished. That way we can
>> > > > > concentrate
>> > > > > > on
>> > > > > > > a single cipher mode instead of thinking about multiple modes
>> at
>> > > > once.
>> > > > > > >
>> > > > > > > On Tue, Oct 16, 2018 at 2:44 AM Anna Szonyi
>> > > > > <szonyi@cloudera.com.invalid
>> > > > > > >
>> > > > > > > wrote:
>> > > > > > >
>> > > > > > >> +1 (non-binding)
>> > > > > > >>
>> > > > > > >> On Tue, Oct 16, 2018 at 11:14 AM Nandor Kollar
>> > > > > > >> <nk...@cloudera.com.invalid>
>> > > > > > >> wrote:
>> > > > > > >>
>> > > > > > >> > +1 (non-binding)
>> > > > > > >> > On Tue, Oct 16, 2018 at 10:59 AM 俊杰陈 <cj...@gmail.com>
>> > > wrote:
>> > > > > > >> > >
>> > > > > > >> > > +1 (non-binding)
>> > > > > > >> > >
>> > > > > > >> > >
>> > > > > > >> > > Zoltan Ivanfi <zi...@cloudera.com.invalid> 于2018年10月16日周二
>> > > > 下午4:46写道:
>> > > > > > >> > >
>> > > > > > >> > > > +1 (binding)
>> > > > > > >> > > >
>> > > > > > >> > > > Cheers,
>> > > > > > >> > > >
>> > > > > > >> > > > Zoltan
>> > > > > > >> > > >
>> > > > > > >> > > > On Tue, Oct 16, 2018 at 10:11 AM Gidon Gershinsky <
>> > > > > > gg5070@gmail.com
>> > > > > > >> >
>> > > > > > >> > > > wrote:
>> > > > > > >> > > >
>> > > > > > >> > > > > Hello Parquet developers,
>> > > > > > >> > > > >
>> > > > > > >> > > > > Per the last sync discussion, it is time to call for
>> a
>> > > vote
>> > > > on
>> > > > > > the
>> > > > > > >> > > > Parquet
>> > > > > > >> > > > > Modular Encryption design sign-off. The design doc
>> can
>> > be
>> > > > > found
>> > > > > > at
>> > > > > > >> > the
>> > > > > > >> > > > > encryption branch of the parquet-format repository,
>> > > > > > >> > > > >
>> > > > > > >> >
>> > > > > >
>> > >
>> https://github.com/apache/parquet-format/blob/encryption/Encryption.md
>> > > > .
>> > > > > > >> > > > >
>> > > > > > >> > > > > The design is stable by now. This work had started 10
>> > > months
>> > > > > > ago,
>> > > > > > >> has
>> > > > > > >> > > > been
>> > > > > > >> > > > > extensively reviewed -  and implemented (in Java,
>> > > partially
>> > > > in
>> > > > > > >> C++),
>> > > > > > >> > by a
>> > > > > > >> > > > > number of folks from different companies. To continue
>> > with
>> > > > the
>> > > > > > >> > > > > implementation pull requests, we need the design to
>> be
>> > > > > formally
>> > > > > > >> > signed
>> > > > > > >> > > > off
>> > > > > > >> > > > > by the community.
>> > > > > > >> > > > >
>> > > > > > >> > > > > Cheers, Gidon
>> > > > > > >> > > > >
>> > > > > > >> > > >
>> > > > > > >> > >
>> > > > > > >> > >
>> > > > > > >> > > --
>> > > > > > >> > > Thanks & Best Regards
>> > > > > > >> >
>> > > > > > >>
>> > > > > > >
>> > > > > > >
>> > > > > > > --
>> > > > > > > Ryan Blue
>> > > > > > > Software Engineer
>> > > > > > > Netflix
>> > > > > > >
>> > > > > >
>> > > > > >
>> > > > > > --
>> > > > > > Ryan Blue
>> > > > > > Software Engineer
>> > > > > > Netflix
>> > > > > >
>> > > > >
>> > > >
>> > > >
>> > > > --
>> > > > Ryan Blue
>> > > > Software Engineer
>> > > > Netflix
>> > > >
>> > >
>> >
>> >
>> > --
>> > Ryan Blue
>> > Software Engineer
>> > Netflix
>> >
>>
>>
>> --
>> Xinli Shang
>>
>

-- 
Ryan Blue
Software Engineer
Netflix

Re: [VOTE] Modular Encryption design sign-off

Posted by Gidon Gershinsky <gg...@gmail.com>.
Xinli, thanks, this is a team / community effort. E.g., in the latest
round, I had many constructive comments from Ryan and Zoltan.
The design and document got better as the result.

Ryan, its been a while since then. Let us know if the doc is ok now. We
will resume the work once the document is signed off.

Cheers, Gidon.


On Fri, Nov 30, 2018 at 10:48 PM Xinli shang <sh...@uber.com.invalid>
wrote:

> + 1 (non-binding). The latest one looks good to me! Thanks, Gidon!
>
> On Thu, Oct 25, 2018 at 4:17 PM Ryan Blue <rb...@netflix.com.invalid>
> wrote:
>
> > Thanks, Gidon! I'll have a look as soon as I get some free time.
> >
> > On Thu, Oct 25, 2018 at 6:20 AM Gidon Gershinsky <gg...@gmail.com>
> wrote:
> >
> > > Ryan,
> > >
> > > Your suggestions have been addressed, please check out the
> > > https://github.com/apache/parquet-format/pull/114/files
> > >
> > > Cheers, Gidon.
> > >
> > >
> > > ---------- Forwarded message ---------
> > > From: Gidon Gershinsky <gg...@gmail.com>
> > > Date: Tue, Oct 23, 2018 at 10:32 AM
> > > Subject: Re: [VOTE] Modular Encryption design sign-off
> > > To: <de...@parquet.apache.org>
> > >
> > >
> > > Regarding the EncryptedColumnChunk, we might be talking about two
> > different
> > > things, looks like we need a deeper tech dive to sync up and get to the
> > > bottom of this.
> > > I'll <> send you a separate mail on this subject, but no doubt we'll
> > > figure it out one way or another.
> > >
> > > It is the only discussion point we have left, since all other
> suggestions
> > > sound good to me - the md doc can certainly use a substantial addition
> of
> > > explicit details for every feature
> > > of the encryption format, to make it a full specification document.
> > >
> > > Cheers, Gidon.
> > >
> > > On Mon, Oct 22, 2018 at 9:36 PM Ryan Blue <rb...@netflix.com.invalid>
> > > wrote:
> > >
> > > > Using a new magic string (PARE) for files with a plaintext footer
> would
> > > not
> > > > allow legacy readers to parse these files.
> > > > This should be clear from reading the spec. I see that the plaintext
> > > footer
> > > > section states “Then the footer is written as usual, followed by …,
> > and a
> > > > final magic string, ‘PAR1’” but, like the other section, this is not
> > > > clearly called out and nothing states that this is applies to the
> > header
> > > > magic as well. The spec needs to include a statement like “the magic
> > > bytes
> > > > for plaintext footer mode are ‘PAR1’ to allow older readers to read
> > > > projections of the file that do not include encrypted columns”.
> > > Similarly,
> > > > the other section needs to clearly state what the magic string is for
> > > > encrypted footers.
> > > >
> > > > Support for multiple encryption algorithms is one of the
> > differentiating
> > > > features of Parquet encryption, therefore having a second cipher
> > upfront
> > > is
> > > > significant.
> > > >
> > > > That sounds okay to me since CTR is a subset of GCM so it isn’t that
> > > large
> > > > of a change. The spec does need to be explicit about exactly how the
> > two
> > > > relate. For example, CTR resources tend to use the term “nonce” where
> > GCM
> > > > uses IV, so it needs to be explicitly stated that the GCM IV is used
> as
> > > the
> > > > CTR nonce.
> > > >
> > > > EncryptedColumnChunk structure will be identical to a ColumnChunk.
> The
> > > > reasons are: …
> > > >
> > > > These structures should be defined in the encryption spec. Some
> details
> > > can
> > > > be included by reference, like “the remaining fields match those in
> > > > ColumnMetaData” but there isn’t enough information in the current
> spec
> > > > draft.
> > > >
> > > > My point about file_offset is that this field already has a use and a
> > > > definition and this spec should not change its use or purpose. If you
> > > need
> > > > a field for the location of an encrypted ColumnMetaData, then
> > > > EncryptedColumnChunk should add one.
> > > >
> > > > EncryptedColumnChunk structure will be identical to a ColumnChunk. …
> > > > ColumnCryptoMetaData is needed in ColumnChunk
> > > >
> > > > How is EncryptedColumnChunk identical to ColumnChunk if it also
> > includes
> > > an
> > > > extra structure for crypto?
> > > >
> > > > There are no options 1,2 in the NIST spec section 8.2.2.
> > > >
> > > > I’m referring to this section: “The random field shall either consist
> > of
> > > 1)
> > > > an output string of r(i) bits from an approved RBG with a sufficient
> > > > security strength, or 2) the result of applying the r(i)–bit
> > incrementing
> > > > function to the random field of the preceding IV for the given key”
> > > >
> > > > My main point is that reusing any IV with the same key is a problem.
> > The
> > > > spec must explicitly state that no IV can be used with the same key.
> > > Sounds
> > > > like you’re confident that the approach you’re using is sufficient,
> but
> > > if
> > > > it isn’t stated in the spec then other implementations might do it
> > > > incorrectly. Referencing section 8.2.2 or relying on other people
> > having
> > > > knowledge that you do is not sufficient for a spec.
> > > >
> > > > I also have a minor concern over the amount of crypto random that
> this
> > > may
> > > > require because Parquet requires so many encryption streams, but
> that’s
> > > an
> > > > implementation detail and doesn’t need to block the spec.
> > > >
> > > > On Sat, Oct 20, 2018 at 2:22 AM Gidon Gershinsky <gg...@gmail.com>
> > > wrote:
> > > >
> > > > > Ryan,
> > > > >
> > > > > Thank you for the clear and detailed feedback.
> > > > >
> > > > > 1. Using a new magic string (PARE) for files with a plaintext
> footer
> > > > would
> > > > > not allow legacy readers to parse these files.
> > > > >
> > > > > 2. Support for multiple encryption algorithms is one of the
> > > > differentiating
> > > > > features of Parquet encryption, therefore having a second cipher
> > > upfront
> > > > is
> > > > > significant.
> > > > > The design/review overhead of adding a CTR-based cipher to GCM  is
> > > > minimal.
> > > > > In fact, CTR is a subset of GCM, because the latter is built on CTR
> > > (and
> > > > > GMAC).
> > > > > Since you explicitly mark your suggestion as a minor note, I
> propose
> > we
> > > > > keep this part unchanged.
> > > > >
> > > > > 3.  EncryptedColumnChunk structure will be identical to a
> > ColumnChunk.
> > > > The
> > > > > reasons are:
> > > > > - In both EF and PF modes (encrypted/plaintext footer), the
> > > > ColumnMetaData
> > > > > is not always separated.
> > > > > It is kept inside ColumnChunk if the column is plaintext,  or
> > encrypted
> > > > > with the footer key in the EF mode. No reason to separate it then.
> > > > > - ColumnCryptoMetaData is needed in ColumnChunk to support column
> > > > > encryption in PF mode (and in EF mode).
> > > > > - We can make the file_offset an optional field. But actually it is
> > not
> > > > set
> > > > > anyway in today's Parquet, which works only with the optional
> > meta_data
> > > > > field. So, this reason alone shouldn't be enough to create a new
> > Thrift
> > > > > structure that fully replicates an existing one, save for optional
> vs
> > > > > required classifier in one field.
> > > > >
> > > > > 4. There are no options 1,2 in the NIST spec section 8.2.2. This
> > > subject
> > > > > had been extensively discussed by community during the design
> review:
> > > > > - If you mean the free field option there - its inapplicable for
> the
> > > > 96-bit
> > > > > IV we are using.
> > > > > Per the spec, 96-bit IV is fully random, 0 free field length. 96
> bit
> > > is a
> > > > > recommended length for AES-GCM IV, see discussion at
> > > > >
> > > > >
> > > >
> > >
> >
> https://crypto.stackexchange.com/questions/41601/aes-gcm-recommended-iv-size-why-12-bytes/41610
> > > > > eg "*For GCM a 12 byte IV is strongly suggested as other IV lengths
> > > will
> > > > > require additional calculations.*"
> > > > > - If you mean the 8.2.1 section, with deterministic construction
> > > option -
> > > > > its highly problematic, since deterministic IV prefix ("fixed
> field")
> > > > > "*shall
> > > > > identify the device, or, more generally, the context for the
> instance
> > > of
> > > > > the authenticated encryption function*". This is not trivial in a
> > > single
> > > > > process, and even more challenging in distributed frameworks like
> > Spark
> > > > > ("*no
> > > > > two distinct devices shall share the same fixed field*").
> Moreover,
> > > this
> > > > > can be unsafe - a mistake in prefix creation can lead to identical
> > IVs
> > > -
> > > > > which breaks the encryption.
> > > > > - The good news is we don't need these complications. The 8.2.2
> > allows
> > > us
> > > > > to have 4 billion random IVs for a single key before any collision
> > > would
> > > > > happen in IVs that are open to attackers. With Parquet modular
> > > > encryption,
> > > > > the number is more like 400 billion, because most (>99%) of IVs are
> > > > hidden
> > > > > from an attacker - their offset is encrypted with the column/footer
> > > keys.
> > > > > Of course, we use the SecureRandom class, seeded properly.
> > > > > - Besides checking the NIST spec and crypto forums, I work directly
> > > with
> > > > > cipher technology experts - who confirm the Parquet approach to IV
> > > > > generation is the right thing to do in distributed systems.
> > > > >
> > > > >
> > > > > All other suggestions sound good, and are not hard to add to the
> > spec.
> > > > I'll
> > > > > update the document accordingly.
> > > > >
> > > > > Cheers, Gidon.
> > > > >
> > > > > On Fri, Oct 19, 2018 at 7:38 PM Ryan Blue
> <rblue@netflix.com.invalid
> > >
> > > > > wrote:
> > > > >
> > > > > > After thinking about this more last night, I think it also needs
> to
> > > be
> > > > > more
> > > > > > careful around IV generation. The spec just says that
> > implementations
> > > > > > should use RBG-based IV construction, but it needs to be more
> > > explicit
> > > > > that
> > > > > > no two GCM streams use the same IV. If an IV is reused, then
> there
> > > is a
> > > > > > possible plaintext leak when blocks are padded with a predictable
> > > > > pattern.
> > > > > >
> > > > > > I think the spec should state that an initial IV should be
> > generated
> > > > > using
> > > > > > secure random as described by 8.2.2 option 1 and the IV for each
> > > > > encrypted
> > > > > > stream should be generated using 8.2.2 option 2 until there is a
> > > > > collision,
> > > > > > when the base IV should be regenerated using 8.2.2 option 1. That
> > > will
> > > > > > ensure that each IV is unique, but that Parquet doesn't consume a
> > > huge
> > > > > > amount of crypto randomness.
> > > > > >
> > > > > > rb
> > > > > >
> > > > > > On Thu, Oct 18, 2018 at 6:15 PM Ryan Blue <rb...@netflix.com>
> > wrote:
> > > > > >
> > > > > > > -1
> > > > > > >
> > > > > > >    - I don’t think that it contains enough information for
> > someone
> > > to
> > > > > > >    successfully implement the proposal. For example, this
> > > references
> > > > > > thrift
> > > > > > >    structures like AesGcmV1 and AesGcmCtrV1, but doesn’t
> > introduce
> > > > them
> > > > > > and
> > > > > > >    explain what they are or detail what they contain.
> Similarly,
> > > > > > >    ColumnCryptoMetaData and its sub-types are not defined. It
> > looks
> > > > > like
> > > > > > this
> > > > > > >    spec relies on thrift, but I think that it should be
> > > > self-contained.
> > > > > > >    - It should be more clear exactly how to construct each GCM
> or
> > > CTR
> > > > > > >    encrypted buffer. Is it necessary to specify padding for
> > > AES-GCM?
> > > > I
> > > > > > see
> > > > > > >    from a quick google search that some implementations do, but
> > > > there’s
> > > > > > >    nothing in this doc.
> > > > > > >    - I don’t think that this explains enough of the intent that
> > is
> > > > > behind
> > > > > > >    the specification. For example, “Key metadata is a free-form
> > > byte
> > > > > > array
> > > > > > >    that can be used by a reader to retrieve the column
> encryption
> > > > key”
> > > > > > should
> > > > > > >    be more clear about what this byte array actually contains
> or
> > > > > > represents.
> > > > > > >    Will it identify a key that can be fetched from a store?
> Will
> > it
> > > > be
> > > > > a
> > > > > > >    randomly-generated key stored encrypted by another key?
> > > > > > >    - This changes the definition of ColumnChunk’s file_offset
> > field
> > > > and
> > > > > > >    the location of ColumnMetaData in some cases. Instead of
> > > > repurposing
> > > > > > thrift
> > > > > > >    structures, the spec should introduce new ones, like an
> > > > > > >    EncryptedColumnChunk that preserves the meaning of fields in
> > > > > > ColumnChunk
> > > > > > >    but separates out ColumnMetaData.
> > > > > > >    - This contains details that aren’t required, like using the
> > > > > > >    “.parquet.encrypted” file extension.
> > > > > > >    - The only time the new magic bytes are mentioned is after
> the
> > > > > footer
> > > > > > >    and encryption metadata, but the diagram shows that the
> first
> > > > bytes
> > > > > > in the
> > > > > > >    file are updated as well. This is also only in the encrypted
> > > > footer
> > > > > > mode.
> > > > > > >    Should PARE magic bytes be used in plaintext footer mode?
> > > > > > >
> > > > > > > Minor note: I would also prefer to vote on GCM, leaving out CTR
> > for
> > > > now
> > > > > > > and adding it once the GCM spec is finished. That way we can
> > > > > concentrate
> > > > > > on
> > > > > > > a single cipher mode instead of thinking about multiple modes
> at
> > > > once.
> > > > > > >
> > > > > > > On Tue, Oct 16, 2018 at 2:44 AM Anna Szonyi
> > > > > <szonyi@cloudera.com.invalid
> > > > > > >
> > > > > > > wrote:
> > > > > > >
> > > > > > >> +1 (non-binding)
> > > > > > >>
> > > > > > >> On Tue, Oct 16, 2018 at 11:14 AM Nandor Kollar
> > > > > > >> <nk...@cloudera.com.invalid>
> > > > > > >> wrote:
> > > > > > >>
> > > > > > >> > +1 (non-binding)
> > > > > > >> > On Tue, Oct 16, 2018 at 10:59 AM 俊杰陈 <cj...@gmail.com>
> > > wrote:
> > > > > > >> > >
> > > > > > >> > > +1 (non-binding)
> > > > > > >> > >
> > > > > > >> > >
> > > > > > >> > > Zoltan Ivanfi <zi...@cloudera.com.invalid> 于2018年10月16日周二
> > > > 下午4:46写道:
> > > > > > >> > >
> > > > > > >> > > > +1 (binding)
> > > > > > >> > > >
> > > > > > >> > > > Cheers,
> > > > > > >> > > >
> > > > > > >> > > > Zoltan
> > > > > > >> > > >
> > > > > > >> > > > On Tue, Oct 16, 2018 at 10:11 AM Gidon Gershinsky <
> > > > > > gg5070@gmail.com
> > > > > > >> >
> > > > > > >> > > > wrote:
> > > > > > >> > > >
> > > > > > >> > > > > Hello Parquet developers,
> > > > > > >> > > > >
> > > > > > >> > > > > Per the last sync discussion, it is time to call for a
> > > vote
> > > > on
> > > > > > the
> > > > > > >> > > > Parquet
> > > > > > >> > > > > Modular Encryption design sign-off. The design doc can
> > be
> > > > > found
> > > > > > at
> > > > > > >> > the
> > > > > > >> > > > > encryption branch of the parquet-format repository,
> > > > > > >> > > > >
> > > > > > >> >
> > > > > >
> > > https://github.com/apache/parquet-format/blob/encryption/Encryption.md
> > > > .
> > > > > > >> > > > >
> > > > > > >> > > > > The design is stable by now. This work had started 10
> > > months
> > > > > > ago,
> > > > > > >> has
> > > > > > >> > > > been
> > > > > > >> > > > > extensively reviewed -  and implemented (in Java,
> > > partially
> > > > in
> > > > > > >> C++),
> > > > > > >> > by a
> > > > > > >> > > > > number of folks from different companies. To continue
> > with
> > > > the
> > > > > > >> > > > > implementation pull requests, we need the design to be
> > > > > formally
> > > > > > >> > signed
> > > > > > >> > > > off
> > > > > > >> > > > > by the community.
> > > > > > >> > > > >
> > > > > > >> > > > > Cheers, Gidon
> > > > > > >> > > > >
> > > > > > >> > > >
> > > > > > >> > >
> > > > > > >> > >
> > > > > > >> > > --
> > > > > > >> > > Thanks & Best Regards
> > > > > > >> >
> > > > > > >>
> > > > > > >
> > > > > > >
> > > > > > > --
> > > > > > > Ryan Blue
> > > > > > > Software Engineer
> > > > > > > Netflix
> > > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > > Ryan Blue
> > > > > > Software Engineer
> > > > > > Netflix
> > > > > >
> > > > >
> > > >
> > > >
> > > > --
> > > > Ryan Blue
> > > > Software Engineer
> > > > Netflix
> > > >
> > >
> >
> >
> > --
> > Ryan Blue
> > Software Engineer
> > Netflix
> >
>
>
> --
> Xinli Shang
>

Re: [VOTE] Modular Encryption design sign-off

Posted by Xinli shang <sh...@uber.com.INVALID>.
+ 1 (non-binding). The latest one looks good to me! Thanks, Gidon!

On Thu, Oct 25, 2018 at 4:17 PM Ryan Blue <rb...@netflix.com.invalid> wrote:

> Thanks, Gidon! I'll have a look as soon as I get some free time.
>
> On Thu, Oct 25, 2018 at 6:20 AM Gidon Gershinsky <gg...@gmail.com> wrote:
>
> > Ryan,
> >
> > Your suggestions have been addressed, please check out the
> > https://github.com/apache/parquet-format/pull/114/files
> >
> > Cheers, Gidon.
> >
> >
> > ---------- Forwarded message ---------
> > From: Gidon Gershinsky <gg...@gmail.com>
> > Date: Tue, Oct 23, 2018 at 10:32 AM
> > Subject: Re: [VOTE] Modular Encryption design sign-off
> > To: <de...@parquet.apache.org>
> >
> >
> > Regarding the EncryptedColumnChunk, we might be talking about two
> different
> > things, looks like we need a deeper tech dive to sync up and get to the
> > bottom of this.
> > I'll <> send you a separate mail on this subject, but no doubt we'll
> > figure it out one way or another.
> >
> > It is the only discussion point we have left, since all other suggestions
> > sound good to me - the md doc can certainly use a substantial addition of
> > explicit details for every feature
> > of the encryption format, to make it a full specification document.
> >
> > Cheers, Gidon.
> >
> > On Mon, Oct 22, 2018 at 9:36 PM Ryan Blue <rb...@netflix.com.invalid>
> > wrote:
> >
> > > Using a new magic string (PARE) for files with a plaintext footer would
> > not
> > > allow legacy readers to parse these files.
> > > This should be clear from reading the spec. I see that the plaintext
> > footer
> > > section states “Then the footer is written as usual, followed by …,
> and a
> > > final magic string, ‘PAR1’” but, like the other section, this is not
> > > clearly called out and nothing states that this is applies to the
> header
> > > magic as well. The spec needs to include a statement like “the magic
> > bytes
> > > for plaintext footer mode are ‘PAR1’ to allow older readers to read
> > > projections of the file that do not include encrypted columns”.
> > Similarly,
> > > the other section needs to clearly state what the magic string is for
> > > encrypted footers.
> > >
> > > Support for multiple encryption algorithms is one of the
> differentiating
> > > features of Parquet encryption, therefore having a second cipher
> upfront
> > is
> > > significant.
> > >
> > > That sounds okay to me since CTR is a subset of GCM so it isn’t that
> > large
> > > of a change. The spec does need to be explicit about exactly how the
> two
> > > relate. For example, CTR resources tend to use the term “nonce” where
> GCM
> > > uses IV, so it needs to be explicitly stated that the GCM IV is used as
> > the
> > > CTR nonce.
> > >
> > > EncryptedColumnChunk structure will be identical to a ColumnChunk. The
> > > reasons are: …
> > >
> > > These structures should be defined in the encryption spec. Some details
> > can
> > > be included by reference, like “the remaining fields match those in
> > > ColumnMetaData” but there isn’t enough information in the current spec
> > > draft.
> > >
> > > My point about file_offset is that this field already has a use and a
> > > definition and this spec should not change its use or purpose. If you
> > need
> > > a field for the location of an encrypted ColumnMetaData, then
> > > EncryptedColumnChunk should add one.
> > >
> > > EncryptedColumnChunk structure will be identical to a ColumnChunk. …
> > > ColumnCryptoMetaData is needed in ColumnChunk
> > >
> > > How is EncryptedColumnChunk identical to ColumnChunk if it also
> includes
> > an
> > > extra structure for crypto?
> > >
> > > There are no options 1,2 in the NIST spec section 8.2.2.
> > >
> > > I’m referring to this section: “The random field shall either consist
> of
> > 1)
> > > an output string of r(i) bits from an approved RBG with a sufficient
> > > security strength, or 2) the result of applying the r(i)–bit
> incrementing
> > > function to the random field of the preceding IV for the given key”
> > >
> > > My main point is that reusing any IV with the same key is a problem.
> The
> > > spec must explicitly state that no IV can be used with the same key.
> > Sounds
> > > like you’re confident that the approach you’re using is sufficient, but
> > if
> > > it isn’t stated in the spec then other implementations might do it
> > > incorrectly. Referencing section 8.2.2 or relying on other people
> having
> > > knowledge that you do is not sufficient for a spec.
> > >
> > > I also have a minor concern over the amount of crypto random that this
> > may
> > > require because Parquet requires so many encryption streams, but that’s
> > an
> > > implementation detail and doesn’t need to block the spec.
> > >
> > > On Sat, Oct 20, 2018 at 2:22 AM Gidon Gershinsky <gg...@gmail.com>
> > wrote:
> > >
> > > > Ryan,
> > > >
> > > > Thank you for the clear and detailed feedback.
> > > >
> > > > 1. Using a new magic string (PARE) for files with a plaintext footer
> > > would
> > > > not allow legacy readers to parse these files.
> > > >
> > > > 2. Support for multiple encryption algorithms is one of the
> > > differentiating
> > > > features of Parquet encryption, therefore having a second cipher
> > upfront
> > > is
> > > > significant.
> > > > The design/review overhead of adding a CTR-based cipher to GCM  is
> > > minimal.
> > > > In fact, CTR is a subset of GCM, because the latter is built on CTR
> > (and
> > > > GMAC).
> > > > Since you explicitly mark your suggestion as a minor note, I propose
> we
> > > > keep this part unchanged.
> > > >
> > > > 3.  EncryptedColumnChunk structure will be identical to a
> ColumnChunk.
> > > The
> > > > reasons are:
> > > > - In both EF and PF modes (encrypted/plaintext footer), the
> > > ColumnMetaData
> > > > is not always separated.
> > > > It is kept inside ColumnChunk if the column is plaintext,  or
> encrypted
> > > > with the footer key in the EF mode. No reason to separate it then.
> > > > - ColumnCryptoMetaData is needed in ColumnChunk to support column
> > > > encryption in PF mode (and in EF mode).
> > > > - We can make the file_offset an optional field. But actually it is
> not
> > > set
> > > > anyway in today's Parquet, which works only with the optional
> meta_data
> > > > field. So, this reason alone shouldn't be enough to create a new
> Thrift
> > > > structure that fully replicates an existing one, save for optional vs
> > > > required classifier in one field.
> > > >
> > > > 4. There are no options 1,2 in the NIST spec section 8.2.2. This
> > subject
> > > > had been extensively discussed by community during the design review:
> > > > - If you mean the free field option there - its inapplicable for the
> > > 96-bit
> > > > IV we are using.
> > > > Per the spec, 96-bit IV is fully random, 0 free field length. 96 bit
> > is a
> > > > recommended length for AES-GCM IV, see discussion at
> > > >
> > > >
> > >
> >
> https://crypto.stackexchange.com/questions/41601/aes-gcm-recommended-iv-size-why-12-bytes/41610
> > > > eg "*For GCM a 12 byte IV is strongly suggested as other IV lengths
> > will
> > > > require additional calculations.*"
> > > > - If you mean the 8.2.1 section, with deterministic construction
> > option -
> > > > its highly problematic, since deterministic IV prefix ("fixed field")
> > > > "*shall
> > > > identify the device, or, more generally, the context for the instance
> > of
> > > > the authenticated encryption function*". This is not trivial in a
> > single
> > > > process, and even more challenging in distributed frameworks like
> Spark
> > > > ("*no
> > > > two distinct devices shall share the same fixed field*").  Moreover,
> > this
> > > > can be unsafe - a mistake in prefix creation can lead to identical
> IVs
> > -
> > > > which breaks the encryption.
> > > > - The good news is we don't need these complications. The 8.2.2
> allows
> > us
> > > > to have 4 billion random IVs for a single key before any collision
> > would
> > > > happen in IVs that are open to attackers. With Parquet modular
> > > encryption,
> > > > the number is more like 400 billion, because most (>99%) of IVs are
> > > hidden
> > > > from an attacker - their offset is encrypted with the column/footer
> > keys.
> > > > Of course, we use the SecureRandom class, seeded properly.
> > > > - Besides checking the NIST spec and crypto forums, I work directly
> > with
> > > > cipher technology experts - who confirm the Parquet approach to IV
> > > > generation is the right thing to do in distributed systems.
> > > >
> > > >
> > > > All other suggestions sound good, and are not hard to add to the
> spec.
> > > I'll
> > > > update the document accordingly.
> > > >
> > > > Cheers, Gidon.
> > > >
> > > > On Fri, Oct 19, 2018 at 7:38 PM Ryan Blue <rblue@netflix.com.invalid
> >
> > > > wrote:
> > > >
> > > > > After thinking about this more last night, I think it also needs to
> > be
> > > > more
> > > > > careful around IV generation. The spec just says that
> implementations
> > > > > should use RBG-based IV construction, but it needs to be more
> > explicit
> > > > that
> > > > > no two GCM streams use the same IV. If an IV is reused, then there
> > is a
> > > > > possible plaintext leak when blocks are padded with a predictable
> > > > pattern.
> > > > >
> > > > > I think the spec should state that an initial IV should be
> generated
> > > > using
> > > > > secure random as described by 8.2.2 option 1 and the IV for each
> > > > encrypted
> > > > > stream should be generated using 8.2.2 option 2 until there is a
> > > > collision,
> > > > > when the base IV should be regenerated using 8.2.2 option 1. That
> > will
> > > > > ensure that each IV is unique, but that Parquet doesn't consume a
> > huge
> > > > > amount of crypto randomness.
> > > > >
> > > > > rb
> > > > >
> > > > > On Thu, Oct 18, 2018 at 6:15 PM Ryan Blue <rb...@netflix.com>
> wrote:
> > > > >
> > > > > > -1
> > > > > >
> > > > > >    - I don’t think that it contains enough information for
> someone
> > to
> > > > > >    successfully implement the proposal. For example, this
> > references
> > > > > thrift
> > > > > >    structures like AesGcmV1 and AesGcmCtrV1, but doesn’t
> introduce
> > > them
> > > > > and
> > > > > >    explain what they are or detail what they contain. Similarly,
> > > > > >    ColumnCryptoMetaData and its sub-types are not defined. It
> looks
> > > > like
> > > > > this
> > > > > >    spec relies on thrift, but I think that it should be
> > > self-contained.
> > > > > >    - It should be more clear exactly how to construct each GCM or
> > CTR
> > > > > >    encrypted buffer. Is it necessary to specify padding for
> > AES-GCM?
> > > I
> > > > > see
> > > > > >    from a quick google search that some implementations do, but
> > > there’s
> > > > > >    nothing in this doc.
> > > > > >    - I don’t think that this explains enough of the intent that
> is
> > > > behind
> > > > > >    the specification. For example, “Key metadata is a free-form
> > byte
> > > > > array
> > > > > >    that can be used by a reader to retrieve the column encryption
> > > key”
> > > > > should
> > > > > >    be more clear about what this byte array actually contains or
> > > > > represents.
> > > > > >    Will it identify a key that can be fetched from a store? Will
> it
> > > be
> > > > a
> > > > > >    randomly-generated key stored encrypted by another key?
> > > > > >    - This changes the definition of ColumnChunk’s file_offset
> field
> > > and
> > > > > >    the location of ColumnMetaData in some cases. Instead of
> > > repurposing
> > > > > thrift
> > > > > >    structures, the spec should introduce new ones, like an
> > > > > >    EncryptedColumnChunk that preserves the meaning of fields in
> > > > > ColumnChunk
> > > > > >    but separates out ColumnMetaData.
> > > > > >    - This contains details that aren’t required, like using the
> > > > > >    “.parquet.encrypted” file extension.
> > > > > >    - The only time the new magic bytes are mentioned is after the
> > > > footer
> > > > > >    and encryption metadata, but the diagram shows that the first
> > > bytes
> > > > > in the
> > > > > >    file are updated as well. This is also only in the encrypted
> > > footer
> > > > > mode.
> > > > > >    Should PARE magic bytes be used in plaintext footer mode?
> > > > > >
> > > > > > Minor note: I would also prefer to vote on GCM, leaving out CTR
> for
> > > now
> > > > > > and adding it once the GCM spec is finished. That way we can
> > > > concentrate
> > > > > on
> > > > > > a single cipher mode instead of thinking about multiple modes at
> > > once.
> > > > > >
> > > > > > On Tue, Oct 16, 2018 at 2:44 AM Anna Szonyi
> > > > <szonyi@cloudera.com.invalid
> > > > > >
> > > > > > wrote:
> > > > > >
> > > > > >> +1 (non-binding)
> > > > > >>
> > > > > >> On Tue, Oct 16, 2018 at 11:14 AM Nandor Kollar
> > > > > >> <nk...@cloudera.com.invalid>
> > > > > >> wrote:
> > > > > >>
> > > > > >> > +1 (non-binding)
> > > > > >> > On Tue, Oct 16, 2018 at 10:59 AM 俊杰陈 <cj...@gmail.com>
> > wrote:
> > > > > >> > >
> > > > > >> > > +1 (non-binding)
> > > > > >> > >
> > > > > >> > >
> > > > > >> > > Zoltan Ivanfi <zi...@cloudera.com.invalid> 于2018年10月16日周二
> > > 下午4:46写道:
> > > > > >> > >
> > > > > >> > > > +1 (binding)
> > > > > >> > > >
> > > > > >> > > > Cheers,
> > > > > >> > > >
> > > > > >> > > > Zoltan
> > > > > >> > > >
> > > > > >> > > > On Tue, Oct 16, 2018 at 10:11 AM Gidon Gershinsky <
> > > > > gg5070@gmail.com
> > > > > >> >
> > > > > >> > > > wrote:
> > > > > >> > > >
> > > > > >> > > > > Hello Parquet developers,
> > > > > >> > > > >
> > > > > >> > > > > Per the last sync discussion, it is time to call for a
> > vote
> > > on
> > > > > the
> > > > > >> > > > Parquet
> > > > > >> > > > > Modular Encryption design sign-off. The design doc can
> be
> > > > found
> > > > > at
> > > > > >> > the
> > > > > >> > > > > encryption branch of the parquet-format repository,
> > > > > >> > > > >
> > > > > >> >
> > > > >
> > https://github.com/apache/parquet-format/blob/encryption/Encryption.md
> > > .
> > > > > >> > > > >
> > > > > >> > > > > The design is stable by now. This work had started 10
> > months
> > > > > ago,
> > > > > >> has
> > > > > >> > > > been
> > > > > >> > > > > extensively reviewed -  and implemented (in Java,
> > partially
> > > in
> > > > > >> C++),
> > > > > >> > by a
> > > > > >> > > > > number of folks from different companies. To continue
> with
> > > the
> > > > > >> > > > > implementation pull requests, we need the design to be
> > > > formally
> > > > > >> > signed
> > > > > >> > > > off
> > > > > >> > > > > by the community.
> > > > > >> > > > >
> > > > > >> > > > > Cheers, Gidon
> > > > > >> > > > >
> > > > > >> > > >
> > > > > >> > >
> > > > > >> > >
> > > > > >> > > --
> > > > > >> > > Thanks & Best Regards
> > > > > >> >
> > > > > >>
> > > > > >
> > > > > >
> > > > > > --
> > > > > > Ryan Blue
> > > > > > Software Engineer
> > > > > > Netflix
> > > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Ryan Blue
> > > > > Software Engineer
> > > > > Netflix
> > > > >
> > > >
> > >
> > >
> > > --
> > > Ryan Blue
> > > Software Engineer
> > > Netflix
> > >
> >
>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>


-- 
Xinli Shang

Re: [VOTE] Modular Encryption design sign-off

Posted by Ryan Blue <rb...@netflix.com.INVALID>.
Thanks, Gidon! I'll have a look as soon as I get some free time.

On Thu, Oct 25, 2018 at 6:20 AM Gidon Gershinsky <gg...@gmail.com> wrote:

> Ryan,
>
> Your suggestions have been addressed, please check out the
> https://github.com/apache/parquet-format/pull/114/files
>
> Cheers, Gidon.
>
>
> ---------- Forwarded message ---------
> From: Gidon Gershinsky <gg...@gmail.com>
> Date: Tue, Oct 23, 2018 at 10:32 AM
> Subject: Re: [VOTE] Modular Encryption design sign-off
> To: <de...@parquet.apache.org>
>
>
> Regarding the EncryptedColumnChunk, we might be talking about two different
> things, looks like we need a deeper tech dive to sync up and get to the
> bottom of this.
> I'll <> send you a separate mail on this subject, but no doubt we'll
> figure it out one way or another.
>
> It is the only discussion point we have left, since all other suggestions
> sound good to me - the md doc can certainly use a substantial addition of
> explicit details for every feature
> of the encryption format, to make it a full specification document.
>
> Cheers, Gidon.
>
> On Mon, Oct 22, 2018 at 9:36 PM Ryan Blue <rb...@netflix.com.invalid>
> wrote:
>
> > Using a new magic string (PARE) for files with a plaintext footer would
> not
> > allow legacy readers to parse these files.
> > This should be clear from reading the spec. I see that the plaintext
> footer
> > section states “Then the footer is written as usual, followed by …, and a
> > final magic string, ‘PAR1’” but, like the other section, this is not
> > clearly called out and nothing states that this is applies to the header
> > magic as well. The spec needs to include a statement like “the magic
> bytes
> > for plaintext footer mode are ‘PAR1’ to allow older readers to read
> > projections of the file that do not include encrypted columns”.
> Similarly,
> > the other section needs to clearly state what the magic string is for
> > encrypted footers.
> >
> > Support for multiple encryption algorithms is one of the differentiating
> > features of Parquet encryption, therefore having a second cipher upfront
> is
> > significant.
> >
> > That sounds okay to me since CTR is a subset of GCM so it isn’t that
> large
> > of a change. The spec does need to be explicit about exactly how the two
> > relate. For example, CTR resources tend to use the term “nonce” where GCM
> > uses IV, so it needs to be explicitly stated that the GCM IV is used as
> the
> > CTR nonce.
> >
> > EncryptedColumnChunk structure will be identical to a ColumnChunk. The
> > reasons are: …
> >
> > These structures should be defined in the encryption spec. Some details
> can
> > be included by reference, like “the remaining fields match those in
> > ColumnMetaData” but there isn’t enough information in the current spec
> > draft.
> >
> > My point about file_offset is that this field already has a use and a
> > definition and this spec should not change its use or purpose. If you
> need
> > a field for the location of an encrypted ColumnMetaData, then
> > EncryptedColumnChunk should add one.
> >
> > EncryptedColumnChunk structure will be identical to a ColumnChunk. …
> > ColumnCryptoMetaData is needed in ColumnChunk
> >
> > How is EncryptedColumnChunk identical to ColumnChunk if it also includes
> an
> > extra structure for crypto?
> >
> > There are no options 1,2 in the NIST spec section 8.2.2.
> >
> > I’m referring to this section: “The random field shall either consist of
> 1)
> > an output string of r(i) bits from an approved RBG with a sufficient
> > security strength, or 2) the result of applying the r(i)–bit incrementing
> > function to the random field of the preceding IV for the given key”
> >
> > My main point is that reusing any IV with the same key is a problem. The
> > spec must explicitly state that no IV can be used with the same key.
> Sounds
> > like you’re confident that the approach you’re using is sufficient, but
> if
> > it isn’t stated in the spec then other implementations might do it
> > incorrectly. Referencing section 8.2.2 or relying on other people having
> > knowledge that you do is not sufficient for a spec.
> >
> > I also have a minor concern over the amount of crypto random that this
> may
> > require because Parquet requires so many encryption streams, but that’s
> an
> > implementation detail and doesn’t need to block the spec.
> >
> > On Sat, Oct 20, 2018 at 2:22 AM Gidon Gershinsky <gg...@gmail.com>
> wrote:
> >
> > > Ryan,
> > >
> > > Thank you for the clear and detailed feedback.
> > >
> > > 1. Using a new magic string (PARE) for files with a plaintext footer
> > would
> > > not allow legacy readers to parse these files.
> > >
> > > 2. Support for multiple encryption algorithms is one of the
> > differentiating
> > > features of Parquet encryption, therefore having a second cipher
> upfront
> > is
> > > significant.
> > > The design/review overhead of adding a CTR-based cipher to GCM  is
> > minimal.
> > > In fact, CTR is a subset of GCM, because the latter is built on CTR
> (and
> > > GMAC).
> > > Since you explicitly mark your suggestion as a minor note, I propose we
> > > keep this part unchanged.
> > >
> > > 3.  EncryptedColumnChunk structure will be identical to a ColumnChunk.
> > The
> > > reasons are:
> > > - In both EF and PF modes (encrypted/plaintext footer), the
> > ColumnMetaData
> > > is not always separated.
> > > It is kept inside ColumnChunk if the column is plaintext,  or encrypted
> > > with the footer key in the EF mode. No reason to separate it then.
> > > - ColumnCryptoMetaData is needed in ColumnChunk to support column
> > > encryption in PF mode (and in EF mode).
> > > - We can make the file_offset an optional field. But actually it is not
> > set
> > > anyway in today's Parquet, which works only with the optional meta_data
> > > field. So, this reason alone shouldn't be enough to create a new Thrift
> > > structure that fully replicates an existing one, save for optional vs
> > > required classifier in one field.
> > >
> > > 4. There are no options 1,2 in the NIST spec section 8.2.2. This
> subject
> > > had been extensively discussed by community during the design review:
> > > - If you mean the free field option there - its inapplicable for the
> > 96-bit
> > > IV we are using.
> > > Per the spec, 96-bit IV is fully random, 0 free field length. 96 bit
> is a
> > > recommended length for AES-GCM IV, see discussion at
> > >
> > >
> >
> https://crypto.stackexchange.com/questions/41601/aes-gcm-recommended-iv-size-why-12-bytes/41610
> > > eg "*For GCM a 12 byte IV is strongly suggested as other IV lengths
> will
> > > require additional calculations.*"
> > > - If you mean the 8.2.1 section, with deterministic construction
> option -
> > > its highly problematic, since deterministic IV prefix ("fixed field")
> > > "*shall
> > > identify the device, or, more generally, the context for the instance
> of
> > > the authenticated encryption function*". This is not trivial in a
> single
> > > process, and even more challenging in distributed frameworks like Spark
> > > ("*no
> > > two distinct devices shall share the same fixed field*").  Moreover,
> this
> > > can be unsafe - a mistake in prefix creation can lead to identical IVs
> -
> > > which breaks the encryption.
> > > - The good news is we don't need these complications. The 8.2.2 allows
> us
> > > to have 4 billion random IVs for a single key before any collision
> would
> > > happen in IVs that are open to attackers. With Parquet modular
> > encryption,
> > > the number is more like 400 billion, because most (>99%) of IVs are
> > hidden
> > > from an attacker - their offset is encrypted with the column/footer
> keys.
> > > Of course, we use the SecureRandom class, seeded properly.
> > > - Besides checking the NIST spec and crypto forums, I work directly
> with
> > > cipher technology experts - who confirm the Parquet approach to IV
> > > generation is the right thing to do in distributed systems.
> > >
> > >
> > > All other suggestions sound good, and are not hard to add to the spec.
> > I'll
> > > update the document accordingly.
> > >
> > > Cheers, Gidon.
> > >
> > > On Fri, Oct 19, 2018 at 7:38 PM Ryan Blue <rb...@netflix.com.invalid>
> > > wrote:
> > >
> > > > After thinking about this more last night, I think it also needs to
> be
> > > more
> > > > careful around IV generation. The spec just says that implementations
> > > > should use RBG-based IV construction, but it needs to be more
> explicit
> > > that
> > > > no two GCM streams use the same IV. If an IV is reused, then there
> is a
> > > > possible plaintext leak when blocks are padded with a predictable
> > > pattern.
> > > >
> > > > I think the spec should state that an initial IV should be generated
> > > using
> > > > secure random as described by 8.2.2 option 1 and the IV for each
> > > encrypted
> > > > stream should be generated using 8.2.2 option 2 until there is a
> > > collision,
> > > > when the base IV should be regenerated using 8.2.2 option 1. That
> will
> > > > ensure that each IV is unique, but that Parquet doesn't consume a
> huge
> > > > amount of crypto randomness.
> > > >
> > > > rb
> > > >
> > > > On Thu, Oct 18, 2018 at 6:15 PM Ryan Blue <rb...@netflix.com> wrote:
> > > >
> > > > > -1
> > > > >
> > > > >    - I don’t think that it contains enough information for someone
> to
> > > > >    successfully implement the proposal. For example, this
> references
> > > > thrift
> > > > >    structures like AesGcmV1 and AesGcmCtrV1, but doesn’t introduce
> > them
> > > > and
> > > > >    explain what they are or detail what they contain. Similarly,
> > > > >    ColumnCryptoMetaData and its sub-types are not defined. It looks
> > > like
> > > > this
> > > > >    spec relies on thrift, but I think that it should be
> > self-contained.
> > > > >    - It should be more clear exactly how to construct each GCM or
> CTR
> > > > >    encrypted buffer. Is it necessary to specify padding for
> AES-GCM?
> > I
> > > > see
> > > > >    from a quick google search that some implementations do, but
> > there’s
> > > > >    nothing in this doc.
> > > > >    - I don’t think that this explains enough of the intent that is
> > > behind
> > > > >    the specification. For example, “Key metadata is a free-form
> byte
> > > > array
> > > > >    that can be used by a reader to retrieve the column encryption
> > key”
> > > > should
> > > > >    be more clear about what this byte array actually contains or
> > > > represents.
> > > > >    Will it identify a key that can be fetched from a store? Will it
> > be
> > > a
> > > > >    randomly-generated key stored encrypted by another key?
> > > > >    - This changes the definition of ColumnChunk’s file_offset field
> > and
> > > > >    the location of ColumnMetaData in some cases. Instead of
> > repurposing
> > > > thrift
> > > > >    structures, the spec should introduce new ones, like an
> > > > >    EncryptedColumnChunk that preserves the meaning of fields in
> > > > ColumnChunk
> > > > >    but separates out ColumnMetaData.
> > > > >    - This contains details that aren’t required, like using the
> > > > >    “.parquet.encrypted” file extension.
> > > > >    - The only time the new magic bytes are mentioned is after the
> > > footer
> > > > >    and encryption metadata, but the diagram shows that the first
> > bytes
> > > > in the
> > > > >    file are updated as well. This is also only in the encrypted
> > footer
> > > > mode.
> > > > >    Should PARE magic bytes be used in plaintext footer mode?
> > > > >
> > > > > Minor note: I would also prefer to vote on GCM, leaving out CTR for
> > now
> > > > > and adding it once the GCM spec is finished. That way we can
> > > concentrate
> > > > on
> > > > > a single cipher mode instead of thinking about multiple modes at
> > once.
> > > > >
> > > > > On Tue, Oct 16, 2018 at 2:44 AM Anna Szonyi
> > > <szonyi@cloudera.com.invalid
> > > > >
> > > > > wrote:
> > > > >
> > > > >> +1 (non-binding)
> > > > >>
> > > > >> On Tue, Oct 16, 2018 at 11:14 AM Nandor Kollar
> > > > >> <nk...@cloudera.com.invalid>
> > > > >> wrote:
> > > > >>
> > > > >> > +1 (non-binding)
> > > > >> > On Tue, Oct 16, 2018 at 10:59 AM 俊杰陈 <cj...@gmail.com>
> wrote:
> > > > >> > >
> > > > >> > > +1 (non-binding)
> > > > >> > >
> > > > >> > >
> > > > >> > > Zoltan Ivanfi <zi...@cloudera.com.invalid> 于2018年10月16日周二
> > 下午4:46写道:
> > > > >> > >
> > > > >> > > > +1 (binding)
> > > > >> > > >
> > > > >> > > > Cheers,
> > > > >> > > >
> > > > >> > > > Zoltan
> > > > >> > > >
> > > > >> > > > On Tue, Oct 16, 2018 at 10:11 AM Gidon Gershinsky <
> > > > gg5070@gmail.com
> > > > >> >
> > > > >> > > > wrote:
> > > > >> > > >
> > > > >> > > > > Hello Parquet developers,
> > > > >> > > > >
> > > > >> > > > > Per the last sync discussion, it is time to call for a
> vote
> > on
> > > > the
> > > > >> > > > Parquet
> > > > >> > > > > Modular Encryption design sign-off. The design doc can be
> > > found
> > > > at
> > > > >> > the
> > > > >> > > > > encryption branch of the parquet-format repository,
> > > > >> > > > >
> > > > >> >
> > > >
> https://github.com/apache/parquet-format/blob/encryption/Encryption.md
> > .
> > > > >> > > > >
> > > > >> > > > > The design is stable by now. This work had started 10
> months
> > > > ago,
> > > > >> has
> > > > >> > > > been
> > > > >> > > > > extensively reviewed -  and implemented (in Java,
> partially
> > in
> > > > >> C++),
> > > > >> > by a
> > > > >> > > > > number of folks from different companies. To continue with
> > the
> > > > >> > > > > implementation pull requests, we need the design to be
> > > formally
> > > > >> > signed
> > > > >> > > > off
> > > > >> > > > > by the community.
> > > > >> > > > >
> > > > >> > > > > Cheers, Gidon
> > > > >> > > > >
> > > > >> > > >
> > > > >> > >
> > > > >> > >
> > > > >> > > --
> > > > >> > > Thanks & Best Regards
> > > > >> >
> > > > >>
> > > > >
> > > > >
> > > > > --
> > > > > Ryan Blue
> > > > > Software Engineer
> > > > > Netflix
> > > > >
> > > >
> > > >
> > > > --
> > > > Ryan Blue
> > > > Software Engineer
> > > > Netflix
> > > >
> > >
> >
> >
> > --
> > Ryan Blue
> > Software Engineer
> > Netflix
> >
>


-- 
Ryan Blue
Software Engineer
Netflix

Re: [VOTE] Modular Encryption design sign-off

Posted by Gidon Gershinsky <gg...@gmail.com>.
Ryan,

Your suggestions have been addressed, please check out the
https://github.com/apache/parquet-format/pull/114/files

Cheers, Gidon.


---------- Forwarded message ---------
From: Gidon Gershinsky <gg...@gmail.com>
Date: Tue, Oct 23, 2018 at 10:32 AM
Subject: Re: [VOTE] Modular Encryption design sign-off
To: <de...@parquet.apache.org>


Regarding the EncryptedColumnChunk, we might be talking about two different
things, looks like we need a deeper tech dive to sync up and get to the
bottom of this.
I'll <> send you a separate mail on this subject, but no doubt we'll
figure it out one way or another.

It is the only discussion point we have left, since all other suggestions
sound good to me - the md doc can certainly use a substantial addition of
explicit details for every feature
of the encryption format, to make it a full specification document.

Cheers, Gidon.

On Mon, Oct 22, 2018 at 9:36 PM Ryan Blue <rb...@netflix.com.invalid> wrote:

> Using a new magic string (PARE) for files with a plaintext footer would not
> allow legacy readers to parse these files.
> This should be clear from reading the spec. I see that the plaintext footer
> section states “Then the footer is written as usual, followed by …, and a
> final magic string, ‘PAR1’” but, like the other section, this is not
> clearly called out and nothing states that this is applies to the header
> magic as well. The spec needs to include a statement like “the magic bytes
> for plaintext footer mode are ‘PAR1’ to allow older readers to read
> projections of the file that do not include encrypted columns”. Similarly,
> the other section needs to clearly state what the magic string is for
> encrypted footers.
>
> Support for multiple encryption algorithms is one of the differentiating
> features of Parquet encryption, therefore having a second cipher upfront is
> significant.
>
> That sounds okay to me since CTR is a subset of GCM so it isn’t that large
> of a change. The spec does need to be explicit about exactly how the two
> relate. For example, CTR resources tend to use the term “nonce” where GCM
> uses IV, so it needs to be explicitly stated that the GCM IV is used as the
> CTR nonce.
>
> EncryptedColumnChunk structure will be identical to a ColumnChunk. The
> reasons are: …
>
> These structures should be defined in the encryption spec. Some details can
> be included by reference, like “the remaining fields match those in
> ColumnMetaData” but there isn’t enough information in the current spec
> draft.
>
> My point about file_offset is that this field already has a use and a
> definition and this spec should not change its use or purpose. If you need
> a field for the location of an encrypted ColumnMetaData, then
> EncryptedColumnChunk should add one.
>
> EncryptedColumnChunk structure will be identical to a ColumnChunk. …
> ColumnCryptoMetaData is needed in ColumnChunk
>
> How is EncryptedColumnChunk identical to ColumnChunk if it also includes an
> extra structure for crypto?
>
> There are no options 1,2 in the NIST spec section 8.2.2.
>
> I’m referring to this section: “The random field shall either consist of 1)
> an output string of r(i) bits from an approved RBG with a sufficient
> security strength, or 2) the result of applying the r(i)–bit incrementing
> function to the random field of the preceding IV for the given key”
>
> My main point is that reusing any IV with the same key is a problem. The
> spec must explicitly state that no IV can be used with the same key. Sounds
> like you’re confident that the approach you’re using is sufficient, but if
> it isn’t stated in the spec then other implementations might do it
> incorrectly. Referencing section 8.2.2 or relying on other people having
> knowledge that you do is not sufficient for a spec.
>
> I also have a minor concern over the amount of crypto random that this may
> require because Parquet requires so many encryption streams, but that’s an
> implementation detail and doesn’t need to block the spec.
>
> On Sat, Oct 20, 2018 at 2:22 AM Gidon Gershinsky <gg...@gmail.com> wrote:
>
> > Ryan,
> >
> > Thank you for the clear and detailed feedback.
> >
> > 1. Using a new magic string (PARE) for files with a plaintext footer
> would
> > not allow legacy readers to parse these files.
> >
> > 2. Support for multiple encryption algorithms is one of the
> differentiating
> > features of Parquet encryption, therefore having a second cipher upfront
> is
> > significant.
> > The design/review overhead of adding a CTR-based cipher to GCM  is
> minimal.
> > In fact, CTR is a subset of GCM, because the latter is built on CTR (and
> > GMAC).
> > Since you explicitly mark your suggestion as a minor note, I propose we
> > keep this part unchanged.
> >
> > 3.  EncryptedColumnChunk structure will be identical to a ColumnChunk.
> The
> > reasons are:
> > - In both EF and PF modes (encrypted/plaintext footer), the
> ColumnMetaData
> > is not always separated.
> > It is kept inside ColumnChunk if the column is plaintext,  or encrypted
> > with the footer key in the EF mode. No reason to separate it then.
> > - ColumnCryptoMetaData is needed in ColumnChunk to support column
> > encryption in PF mode (and in EF mode).
> > - We can make the file_offset an optional field. But actually it is not
> set
> > anyway in today's Parquet, which works only with the optional meta_data
> > field. So, this reason alone shouldn't be enough to create a new Thrift
> > structure that fully replicates an existing one, save for optional vs
> > required classifier in one field.
> >
> > 4. There are no options 1,2 in the NIST spec section 8.2.2. This subject
> > had been extensively discussed by community during the design review:
> > - If you mean the free field option there - its inapplicable for the
> 96-bit
> > IV we are using.
> > Per the spec, 96-bit IV is fully random, 0 free field length. 96 bit is a
> > recommended length for AES-GCM IV, see discussion at
> >
> >
> https://crypto.stackexchange.com/questions/41601/aes-gcm-recommended-iv-size-why-12-bytes/41610
> > eg "*For GCM a 12 byte IV is strongly suggested as other IV lengths will
> > require additional calculations.*"
> > - If you mean the 8.2.1 section, with deterministic construction option -
> > its highly problematic, since deterministic IV prefix ("fixed field")
> > "*shall
> > identify the device, or, more generally, the context for the instance of
> > the authenticated encryption function*". This is not trivial in a single
> > process, and even more challenging in distributed frameworks like Spark
> > ("*no
> > two distinct devices shall share the same fixed field*").  Moreover, this
> > can be unsafe - a mistake in prefix creation can lead to identical IVs -
> > which breaks the encryption.
> > - The good news is we don't need these complications. The 8.2.2 allows us
> > to have 4 billion random IVs for a single key before any collision would
> > happen in IVs that are open to attackers. With Parquet modular
> encryption,
> > the number is more like 400 billion, because most (>99%) of IVs are
> hidden
> > from an attacker - their offset is encrypted with the column/footer keys.
> > Of course, we use the SecureRandom class, seeded properly.
> > - Besides checking the NIST spec and crypto forums, I work directly with
> > cipher technology experts - who confirm the Parquet approach to IV
> > generation is the right thing to do in distributed systems.
> >
> >
> > All other suggestions sound good, and are not hard to add to the spec.
> I'll
> > update the document accordingly.
> >
> > Cheers, Gidon.
> >
> > On Fri, Oct 19, 2018 at 7:38 PM Ryan Blue <rb...@netflix.com.invalid>
> > wrote:
> >
> > > After thinking about this more last night, I think it also needs to be
> > more
> > > careful around IV generation. The spec just says that implementations
> > > should use RBG-based IV construction, but it needs to be more explicit
> > that
> > > no two GCM streams use the same IV. If an IV is reused, then there is a
> > > possible plaintext leak when blocks are padded with a predictable
> > pattern.
> > >
> > > I think the spec should state that an initial IV should be generated
> > using
> > > secure random as described by 8.2.2 option 1 and the IV for each
> > encrypted
> > > stream should be generated using 8.2.2 option 2 until there is a
> > collision,
> > > when the base IV should be regenerated using 8.2.2 option 1. That will
> > > ensure that each IV is unique, but that Parquet doesn't consume a huge
> > > amount of crypto randomness.
> > >
> > > rb
> > >
> > > On Thu, Oct 18, 2018 at 6:15 PM Ryan Blue <rb...@netflix.com> wrote:
> > >
> > > > -1
> > > >
> > > >    - I don’t think that it contains enough information for someone to
> > > >    successfully implement the proposal. For example, this references
> > > thrift
> > > >    structures like AesGcmV1 and AesGcmCtrV1, but doesn’t introduce
> them
> > > and
> > > >    explain what they are or detail what they contain. Similarly,
> > > >    ColumnCryptoMetaData and its sub-types are not defined. It looks
> > like
> > > this
> > > >    spec relies on thrift, but I think that it should be
> self-contained.
> > > >    - It should be more clear exactly how to construct each GCM or CTR
> > > >    encrypted buffer. Is it necessary to specify padding for AES-GCM?
> I
> > > see
> > > >    from a quick google search that some implementations do, but
> there’s
> > > >    nothing in this doc.
> > > >    - I don’t think that this explains enough of the intent that is
> > behind
> > > >    the specification. For example, “Key metadata is a free-form byte
> > > array
> > > >    that can be used by a reader to retrieve the column encryption
> key”
> > > should
> > > >    be more clear about what this byte array actually contains or
> > > represents.
> > > >    Will it identify a key that can be fetched from a store? Will it
> be
> > a
> > > >    randomly-generated key stored encrypted by another key?
> > > >    - This changes the definition of ColumnChunk’s file_offset field
> and
> > > >    the location of ColumnMetaData in some cases. Instead of
> repurposing
> > > thrift
> > > >    structures, the spec should introduce new ones, like an
> > > >    EncryptedColumnChunk that preserves the meaning of fields in
> > > ColumnChunk
> > > >    but separates out ColumnMetaData.
> > > >    - This contains details that aren’t required, like using the
> > > >    “.parquet.encrypted” file extension.
> > > >    - The only time the new magic bytes are mentioned is after the
> > footer
> > > >    and encryption metadata, but the diagram shows that the first
> bytes
> > > in the
> > > >    file are updated as well. This is also only in the encrypted
> footer
> > > mode.
> > > >    Should PARE magic bytes be used in plaintext footer mode?
> > > >
> > > > Minor note: I would also prefer to vote on GCM, leaving out CTR for
> now
> > > > and adding it once the GCM spec is finished. That way we can
> > concentrate
> > > on
> > > > a single cipher mode instead of thinking about multiple modes at
> once.
> > > >
> > > > On Tue, Oct 16, 2018 at 2:44 AM Anna Szonyi
> > <szonyi@cloudera.com.invalid
> > > >
> > > > wrote:
> > > >
> > > >> +1 (non-binding)
> > > >>
> > > >> On Tue, Oct 16, 2018 at 11:14 AM Nandor Kollar
> > > >> <nk...@cloudera.com.invalid>
> > > >> wrote:
> > > >>
> > > >> > +1 (non-binding)
> > > >> > On Tue, Oct 16, 2018 at 10:59 AM 俊杰陈 <cj...@gmail.com> wrote:
> > > >> > >
> > > >> > > +1 (non-binding)
> > > >> > >
> > > >> > >
> > > >> > > Zoltan Ivanfi <zi...@cloudera.com.invalid> 于2018年10月16日周二
> 下午4:46写道:
> > > >> > >
> > > >> > > > +1 (binding)
> > > >> > > >
> > > >> > > > Cheers,
> > > >> > > >
> > > >> > > > Zoltan
> > > >> > > >
> > > >> > > > On Tue, Oct 16, 2018 at 10:11 AM Gidon Gershinsky <
> > > gg5070@gmail.com
> > > >> >
> > > >> > > > wrote:
> > > >> > > >
> > > >> > > > > Hello Parquet developers,
> > > >> > > > >
> > > >> > > > > Per the last sync discussion, it is time to call for a vote
> on
> > > the
> > > >> > > > Parquet
> > > >> > > > > Modular Encryption design sign-off. The design doc can be
> > found
> > > at
> > > >> > the
> > > >> > > > > encryption branch of the parquet-format repository,
> > > >> > > > >
> > > >> >
> > > https://github.com/apache/parquet-format/blob/encryption/Encryption.md
> .
> > > >> > > > >
> > > >> > > > > The design is stable by now. This work had started 10 months
> > > ago,
> > > >> has
> > > >> > > > been
> > > >> > > > > extensively reviewed -  and implemented (in Java, partially
> in
> > > >> C++),
> > > >> > by a
> > > >> > > > > number of folks from different companies. To continue with
> the
> > > >> > > > > implementation pull requests, we need the design to be
> > formally
> > > >> > signed
> > > >> > > > off
> > > >> > > > > by the community.
> > > >> > > > >
> > > >> > > > > Cheers, Gidon
> > > >> > > > >
> > > >> > > >
> > > >> > >
> > > >> > >
> > > >> > > --
> > > >> > > Thanks & Best Regards
> > > >> >
> > > >>
> > > >
> > > >
> > > > --
> > > > Ryan Blue
> > > > Software Engineer
> > > > Netflix
> > > >
> > >
> > >
> > > --
> > > Ryan Blue
> > > Software Engineer
> > > Netflix
> > >
> >
>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>

Re: [VOTE] Modular Encryption design sign-off

Posted by Gidon Gershinsky <gg...@gmail.com>.
Regarding the EncryptedColumnChunk, we might be talking about two different
things, looks like we need a deeper tech dive to sync up and get to the
bottom of this.
I'll you send a separate mail on this subject, but no doubt we'll figure it
out one way or another.

It is the only discussion point we have left, since all other suggestions
sound good to me - the md doc can certainly use a substantial addition of
explicit details for every feature
of the encryption format, to make it a full specification document.

Cheers, Gidon.

On Mon, Oct 22, 2018 at 9:36 PM Ryan Blue <rb...@netflix.com.invalid> wrote:

> Using a new magic string (PARE) for files with a plaintext footer would not
> allow legacy readers to parse these files.
> This should be clear from reading the spec. I see that the plaintext footer
> section states “Then the footer is written as usual, followed by …, and a
> final magic string, ‘PAR1’” but, like the other section, this is not
> clearly called out and nothing states that this is applies to the header
> magic as well. The spec needs to include a statement like “the magic bytes
> for plaintext footer mode are ‘PAR1’ to allow older readers to read
> projections of the file that do not include encrypted columns”. Similarly,
> the other section needs to clearly state what the magic string is for
> encrypted footers.
>
> Support for multiple encryption algorithms is one of the differentiating
> features of Parquet encryption, therefore having a second cipher upfront is
> significant.
>
> That sounds okay to me since CTR is a subset of GCM so it isn’t that large
> of a change. The spec does need to be explicit about exactly how the two
> relate. For example, CTR resources tend to use the term “nonce” where GCM
> uses IV, so it needs to be explicitly stated that the GCM IV is used as the
> CTR nonce.
>
> EncryptedColumnChunk structure will be identical to a ColumnChunk. The
> reasons are: …
>
> These structures should be defined in the encryption spec. Some details can
> be included by reference, like “the remaining fields match those in
> ColumnMetaData” but there isn’t enough information in the current spec
> draft.
>
> My point about file_offset is that this field already has a use and a
> definition and this spec should not change its use or purpose. If you need
> a field for the location of an encrypted ColumnMetaData, then
> EncryptedColumnChunk should add one.
>
> EncryptedColumnChunk structure will be identical to a ColumnChunk. …
> ColumnCryptoMetaData is needed in ColumnChunk
>
> How is EncryptedColumnChunk identical to ColumnChunk if it also includes an
> extra structure for crypto?
>
> There are no options 1,2 in the NIST spec section 8.2.2.
>
> I’m referring to this section: “The random field shall either consist of 1)
> an output string of r(i) bits from an approved RBG with a sufficient
> security strength, or 2) the result of applying the r(i)–bit incrementing
> function to the random field of the preceding IV for the given key”
>
> My main point is that reusing any IV with the same key is a problem. The
> spec must explicitly state that no IV can be used with the same key. Sounds
> like you’re confident that the approach you’re using is sufficient, but if
> it isn’t stated in the spec then other implementations might do it
> incorrectly. Referencing section 8.2.2 or relying on other people having
> knowledge that you do is not sufficient for a spec.
>
> I also have a minor concern over the amount of crypto random that this may
> require because Parquet requires so many encryption streams, but that’s an
> implementation detail and doesn’t need to block the spec.
>
> On Sat, Oct 20, 2018 at 2:22 AM Gidon Gershinsky <gg...@gmail.com> wrote:
>
> > Ryan,
> >
> > Thank you for the clear and detailed feedback.
> >
> > 1. Using a new magic string (PARE) for files with a plaintext footer
> would
> > not allow legacy readers to parse these files.
> >
> > 2. Support for multiple encryption algorithms is one of the
> differentiating
> > features of Parquet encryption, therefore having a second cipher upfront
> is
> > significant.
> > The design/review overhead of adding a CTR-based cipher to GCM  is
> minimal.
> > In fact, CTR is a subset of GCM, because the latter is built on CTR (and
> > GMAC).
> > Since you explicitly mark your suggestion as a minor note, I propose we
> > keep this part unchanged.
> >
> > 3.  EncryptedColumnChunk structure will be identical to a ColumnChunk.
> The
> > reasons are:
> > - In both EF and PF modes (encrypted/plaintext footer), the
> ColumnMetaData
> > is not always separated.
> > It is kept inside ColumnChunk if the column is plaintext,  or encrypted
> > with the footer key in the EF mode. No reason to separate it then.
> > - ColumnCryptoMetaData is needed in ColumnChunk to support column
> > encryption in PF mode (and in EF mode).
> > - We can make the file_offset an optional field. But actually it is not
> set
> > anyway in today's Parquet, which works only with the optional meta_data
> > field. So, this reason alone shouldn't be enough to create a new Thrift
> > structure that fully replicates an existing one, save for optional vs
> > required classifier in one field.
> >
> > 4. There are no options 1,2 in the NIST spec section 8.2.2. This subject
> > had been extensively discussed by community during the design review:
> > - If you mean the free field option there - its inapplicable for the
> 96-bit
> > IV we are using.
> > Per the spec, 96-bit IV is fully random, 0 free field length. 96 bit is a
> > recommended length for AES-GCM IV, see discussion at
> >
> >
> https://crypto.stackexchange.com/questions/41601/aes-gcm-recommended-iv-size-why-12-bytes/41610
> > eg "*For GCM a 12 byte IV is strongly suggested as other IV lengths will
> > require additional calculations.*"
> > - If you mean the 8.2.1 section, with deterministic construction option -
> > its highly problematic, since deterministic IV prefix ("fixed field")
> > "*shall
> > identify the device, or, more generally, the context for the instance of
> > the authenticated encryption function*". This is not trivial in a single
> > process, and even more challenging in distributed frameworks like Spark
> > ("*no
> > two distinct devices shall share the same fixed field*").  Moreover, this
> > can be unsafe - a mistake in prefix creation can lead to identical IVs -
> > which breaks the encryption.
> > - The good news is we don't need these complications. The 8.2.2 allows us
> > to have 4 billion random IVs for a single key before any collision would
> > happen in IVs that are open to attackers. With Parquet modular
> encryption,
> > the number is more like 400 billion, because most (>99%) of IVs are
> hidden
> > from an attacker - their offset is encrypted with the column/footer keys.
> > Of course, we use the SecureRandom class, seeded properly.
> > - Besides checking the NIST spec and crypto forums, I work directly with
> > cipher technology experts - who confirm the Parquet approach to IV
> > generation is the right thing to do in distributed systems.
> >
> >
> > All other suggestions sound good, and are not hard to add to the spec.
> I'll
> > update the document accordingly.
> >
> > Cheers, Gidon.
> >
> > On Fri, Oct 19, 2018 at 7:38 PM Ryan Blue <rb...@netflix.com.invalid>
> > wrote:
> >
> > > After thinking about this more last night, I think it also needs to be
> > more
> > > careful around IV generation. The spec just says that implementations
> > > should use RBG-based IV construction, but it needs to be more explicit
> > that
> > > no two GCM streams use the same IV. If an IV is reused, then there is a
> > > possible plaintext leak when blocks are padded with a predictable
> > pattern.
> > >
> > > I think the spec should state that an initial IV should be generated
> > using
> > > secure random as described by 8.2.2 option 1 and the IV for each
> > encrypted
> > > stream should be generated using 8.2.2 option 2 until there is a
> > collision,
> > > when the base IV should be regenerated using 8.2.2 option 1. That will
> > > ensure that each IV is unique, but that Parquet doesn't consume a huge
> > > amount of crypto randomness.
> > >
> > > rb
> > >
> > > On Thu, Oct 18, 2018 at 6:15 PM Ryan Blue <rb...@netflix.com> wrote:
> > >
> > > > -1
> > > >
> > > >    - I don’t think that it contains enough information for someone to
> > > >    successfully implement the proposal. For example, this references
> > > thrift
> > > >    structures like AesGcmV1 and AesGcmCtrV1, but doesn’t introduce
> them
> > > and
> > > >    explain what they are or detail what they contain. Similarly,
> > > >    ColumnCryptoMetaData and its sub-types are not defined. It looks
> > like
> > > this
> > > >    spec relies on thrift, but I think that it should be
> self-contained.
> > > >    - It should be more clear exactly how to construct each GCM or CTR
> > > >    encrypted buffer. Is it necessary to specify padding for AES-GCM?
> I
> > > see
> > > >    from a quick google search that some implementations do, but
> there’s
> > > >    nothing in this doc.
> > > >    - I don’t think that this explains enough of the intent that is
> > behind
> > > >    the specification. For example, “Key metadata is a free-form byte
> > > array
> > > >    that can be used by a reader to retrieve the column encryption
> key”
> > > should
> > > >    be more clear about what this byte array actually contains or
> > > represents.
> > > >    Will it identify a key that can be fetched from a store? Will it
> be
> > a
> > > >    randomly-generated key stored encrypted by another key?
> > > >    - This changes the definition of ColumnChunk’s file_offset field
> and
> > > >    the location of ColumnMetaData in some cases. Instead of
> repurposing
> > > thrift
> > > >    structures, the spec should introduce new ones, like an
> > > >    EncryptedColumnChunk that preserves the meaning of fields in
> > > ColumnChunk
> > > >    but separates out ColumnMetaData.
> > > >    - This contains details that aren’t required, like using the
> > > >    “.parquet.encrypted” file extension.
> > > >    - The only time the new magic bytes are mentioned is after the
> > footer
> > > >    and encryption metadata, but the diagram shows that the first
> bytes
> > > in the
> > > >    file are updated as well. This is also only in the encrypted
> footer
> > > mode.
> > > >    Should PARE magic bytes be used in plaintext footer mode?
> > > >
> > > > Minor note: I would also prefer to vote on GCM, leaving out CTR for
> now
> > > > and adding it once the GCM spec is finished. That way we can
> > concentrate
> > > on
> > > > a single cipher mode instead of thinking about multiple modes at
> once.
> > > >
> > > > On Tue, Oct 16, 2018 at 2:44 AM Anna Szonyi
> > <szonyi@cloudera.com.invalid
> > > >
> > > > wrote:
> > > >
> > > >> +1 (non-binding)
> > > >>
> > > >> On Tue, Oct 16, 2018 at 11:14 AM Nandor Kollar
> > > >> <nk...@cloudera.com.invalid>
> > > >> wrote:
> > > >>
> > > >> > +1 (non-binding)
> > > >> > On Tue, Oct 16, 2018 at 10:59 AM 俊杰陈 <cj...@gmail.com> wrote:
> > > >> > >
> > > >> > > +1 (non-binding)
> > > >> > >
> > > >> > >
> > > >> > > Zoltan Ivanfi <zi...@cloudera.com.invalid> 于2018年10月16日周二
> 下午4:46写道:
> > > >> > >
> > > >> > > > +1 (binding)
> > > >> > > >
> > > >> > > > Cheers,
> > > >> > > >
> > > >> > > > Zoltan
> > > >> > > >
> > > >> > > > On Tue, Oct 16, 2018 at 10:11 AM Gidon Gershinsky <
> > > gg5070@gmail.com
> > > >> >
> > > >> > > > wrote:
> > > >> > > >
> > > >> > > > > Hello Parquet developers,
> > > >> > > > >
> > > >> > > > > Per the last sync discussion, it is time to call for a vote
> on
> > > the
> > > >> > > > Parquet
> > > >> > > > > Modular Encryption design sign-off. The design doc can be
> > found
> > > at
> > > >> > the
> > > >> > > > > encryption branch of the parquet-format repository,
> > > >> > > > >
> > > >> >
> > > https://github.com/apache/parquet-format/blob/encryption/Encryption.md
> .
> > > >> > > > >
> > > >> > > > > The design is stable by now. This work had started 10 months
> > > ago,
> > > >> has
> > > >> > > > been
> > > >> > > > > extensively reviewed -  and implemented (in Java, partially
> in
> > > >> C++),
> > > >> > by a
> > > >> > > > > number of folks from different companies. To continue with
> the
> > > >> > > > > implementation pull requests, we need the design to be
> > formally
> > > >> > signed
> > > >> > > > off
> > > >> > > > > by the community.
> > > >> > > > >
> > > >> > > > > Cheers, Gidon
> > > >> > > > >
> > > >> > > >
> > > >> > >
> > > >> > >
> > > >> > > --
> > > >> > > Thanks & Best Regards
> > > >> >
> > > >>
> > > >
> > > >
> > > > --
> > > > Ryan Blue
> > > > Software Engineer
> > > > Netflix
> > > >
> > >
> > >
> > > --
> > > Ryan Blue
> > > Software Engineer
> > > Netflix
> > >
> >
>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>

Re: [VOTE] Modular Encryption design sign-off

Posted by Ryan Blue <rb...@netflix.com.INVALID>.
Using a new magic string (PARE) for files with a plaintext footer would not
allow legacy readers to parse these files.
This should be clear from reading the spec. I see that the plaintext footer
section states “Then the footer is written as usual, followed by …, and a
final magic string, ‘PAR1’” but, like the other section, this is not
clearly called out and nothing states that this is applies to the header
magic as well. The spec needs to include a statement like “the magic bytes
for plaintext footer mode are ‘PAR1’ to allow older readers to read
projections of the file that do not include encrypted columns”. Similarly,
the other section needs to clearly state what the magic string is for
encrypted footers.

Support for multiple encryption algorithms is one of the differentiating
features of Parquet encryption, therefore having a second cipher upfront is
significant.

That sounds okay to me since CTR is a subset of GCM so it isn’t that large
of a change. The spec does need to be explicit about exactly how the two
relate. For example, CTR resources tend to use the term “nonce” where GCM
uses IV, so it needs to be explicitly stated that the GCM IV is used as the
CTR nonce.

EncryptedColumnChunk structure will be identical to a ColumnChunk. The
reasons are: …

These structures should be defined in the encryption spec. Some details can
be included by reference, like “the remaining fields match those in
ColumnMetaData” but there isn’t enough information in the current spec
draft.

My point about file_offset is that this field already has a use and a
definition and this spec should not change its use or purpose. If you need
a field for the location of an encrypted ColumnMetaData, then
EncryptedColumnChunk should add one.

EncryptedColumnChunk structure will be identical to a ColumnChunk. …
ColumnCryptoMetaData is needed in ColumnChunk

How is EncryptedColumnChunk identical to ColumnChunk if it also includes an
extra structure for crypto?

There are no options 1,2 in the NIST spec section 8.2.2.

I’m referring to this section: “The random field shall either consist of 1)
an output string of r(i) bits from an approved RBG with a sufficient
security strength, or 2) the result of applying the r(i)–bit incrementing
function to the random field of the preceding IV for the given key”

My main point is that reusing any IV with the same key is a problem. The
spec must explicitly state that no IV can be used with the same key. Sounds
like you’re confident that the approach you’re using is sufficient, but if
it isn’t stated in the spec then other implementations might do it
incorrectly. Referencing section 8.2.2 or relying on other people having
knowledge that you do is not sufficient for a spec.

I also have a minor concern over the amount of crypto random that this may
require because Parquet requires so many encryption streams, but that’s an
implementation detail and doesn’t need to block the spec.

On Sat, Oct 20, 2018 at 2:22 AM Gidon Gershinsky <gg...@gmail.com> wrote:

> Ryan,
>
> Thank you for the clear and detailed feedback.
>
> 1. Using a new magic string (PARE) for files with a plaintext footer would
> not allow legacy readers to parse these files.
>
> 2. Support for multiple encryption algorithms is one of the differentiating
> features of Parquet encryption, therefore having a second cipher upfront is
> significant.
> The design/review overhead of adding a CTR-based cipher to GCM  is minimal.
> In fact, CTR is a subset of GCM, because the latter is built on CTR (and
> GMAC).
> Since you explicitly mark your suggestion as a minor note, I propose we
> keep this part unchanged.
>
> 3.  EncryptedColumnChunk structure will be identical to a ColumnChunk. The
> reasons are:
> - In both EF and PF modes (encrypted/plaintext footer), the ColumnMetaData
> is not always separated.
> It is kept inside ColumnChunk if the column is plaintext,  or encrypted
> with the footer key in the EF mode. No reason to separate it then.
> - ColumnCryptoMetaData is needed in ColumnChunk to support column
> encryption in PF mode (and in EF mode).
> - We can make the file_offset an optional field. But actually it is not set
> anyway in today's Parquet, which works only with the optional meta_data
> field. So, this reason alone shouldn't be enough to create a new Thrift
> structure that fully replicates an existing one, save for optional vs
> required classifier in one field.
>
> 4. There are no options 1,2 in the NIST spec section 8.2.2. This subject
> had been extensively discussed by community during the design review:
> - If you mean the free field option there - its inapplicable for the 96-bit
> IV we are using.
> Per the spec, 96-bit IV is fully random, 0 free field length. 96 bit is a
> recommended length for AES-GCM IV, see discussion at
>
> https://crypto.stackexchange.com/questions/41601/aes-gcm-recommended-iv-size-why-12-bytes/41610
> eg "*For GCM a 12 byte IV is strongly suggested as other IV lengths will
> require additional calculations.*"
> - If you mean the 8.2.1 section, with deterministic construction option -
> its highly problematic, since deterministic IV prefix ("fixed field")
> "*shall
> identify the device, or, more generally, the context for the instance of
> the authenticated encryption function*". This is not trivial in a single
> process, and even more challenging in distributed frameworks like Spark
> ("*no
> two distinct devices shall share the same fixed field*").  Moreover, this
> can be unsafe - a mistake in prefix creation can lead to identical IVs -
> which breaks the encryption.
> - The good news is we don't need these complications. The 8.2.2 allows us
> to have 4 billion random IVs for a single key before any collision would
> happen in IVs that are open to attackers. With Parquet modular encryption,
> the number is more like 400 billion, because most (>99%) of IVs are hidden
> from an attacker - their offset is encrypted with the column/footer keys.
> Of course, we use the SecureRandom class, seeded properly.
> - Besides checking the NIST spec and crypto forums, I work directly with
> cipher technology experts - who confirm the Parquet approach to IV
> generation is the right thing to do in distributed systems.
>
>
> All other suggestions sound good, and are not hard to add to the spec. I'll
> update the document accordingly.
>
> Cheers, Gidon.
>
> On Fri, Oct 19, 2018 at 7:38 PM Ryan Blue <rb...@netflix.com.invalid>
> wrote:
>
> > After thinking about this more last night, I think it also needs to be
> more
> > careful around IV generation. The spec just says that implementations
> > should use RBG-based IV construction, but it needs to be more explicit
> that
> > no two GCM streams use the same IV. If an IV is reused, then there is a
> > possible plaintext leak when blocks are padded with a predictable
> pattern.
> >
> > I think the spec should state that an initial IV should be generated
> using
> > secure random as described by 8.2.2 option 1 and the IV for each
> encrypted
> > stream should be generated using 8.2.2 option 2 until there is a
> collision,
> > when the base IV should be regenerated using 8.2.2 option 1. That will
> > ensure that each IV is unique, but that Parquet doesn't consume a huge
> > amount of crypto randomness.
> >
> > rb
> >
> > On Thu, Oct 18, 2018 at 6:15 PM Ryan Blue <rb...@netflix.com> wrote:
> >
> > > -1
> > >
> > >    - I don’t think that it contains enough information for someone to
> > >    successfully implement the proposal. For example, this references
> > thrift
> > >    structures like AesGcmV1 and AesGcmCtrV1, but doesn’t introduce them
> > and
> > >    explain what they are or detail what they contain. Similarly,
> > >    ColumnCryptoMetaData and its sub-types are not defined. It looks
> like
> > this
> > >    spec relies on thrift, but I think that it should be self-contained.
> > >    - It should be more clear exactly how to construct each GCM or CTR
> > >    encrypted buffer. Is it necessary to specify padding for AES-GCM? I
> > see
> > >    from a quick google search that some implementations do, but there’s
> > >    nothing in this doc.
> > >    - I don’t think that this explains enough of the intent that is
> behind
> > >    the specification. For example, “Key metadata is a free-form byte
> > array
> > >    that can be used by a reader to retrieve the column encryption key”
> > should
> > >    be more clear about what this byte array actually contains or
> > represents.
> > >    Will it identify a key that can be fetched from a store? Will it be
> a
> > >    randomly-generated key stored encrypted by another key?
> > >    - This changes the definition of ColumnChunk’s file_offset field and
> > >    the location of ColumnMetaData in some cases. Instead of repurposing
> > thrift
> > >    structures, the spec should introduce new ones, like an
> > >    EncryptedColumnChunk that preserves the meaning of fields in
> > ColumnChunk
> > >    but separates out ColumnMetaData.
> > >    - This contains details that aren’t required, like using the
> > >    “.parquet.encrypted” file extension.
> > >    - The only time the new magic bytes are mentioned is after the
> footer
> > >    and encryption metadata, but the diagram shows that the first bytes
> > in the
> > >    file are updated as well. This is also only in the encrypted footer
> > mode.
> > >    Should PARE magic bytes be used in plaintext footer mode?
> > >
> > > Minor note: I would also prefer to vote on GCM, leaving out CTR for now
> > > and adding it once the GCM spec is finished. That way we can
> concentrate
> > on
> > > a single cipher mode instead of thinking about multiple modes at once.
> > >
> > > On Tue, Oct 16, 2018 at 2:44 AM Anna Szonyi
> <szonyi@cloudera.com.invalid
> > >
> > > wrote:
> > >
> > >> +1 (non-binding)
> > >>
> > >> On Tue, Oct 16, 2018 at 11:14 AM Nandor Kollar
> > >> <nk...@cloudera.com.invalid>
> > >> wrote:
> > >>
> > >> > +1 (non-binding)
> > >> > On Tue, Oct 16, 2018 at 10:59 AM 俊杰陈 <cj...@gmail.com> wrote:
> > >> > >
> > >> > > +1 (non-binding)
> > >> > >
> > >> > >
> > >> > > Zoltan Ivanfi <zi...@cloudera.com.invalid> 于2018年10月16日周二 下午4:46写道:
> > >> > >
> > >> > > > +1 (binding)
> > >> > > >
> > >> > > > Cheers,
> > >> > > >
> > >> > > > Zoltan
> > >> > > >
> > >> > > > On Tue, Oct 16, 2018 at 10:11 AM Gidon Gershinsky <
> > gg5070@gmail.com
> > >> >
> > >> > > > wrote:
> > >> > > >
> > >> > > > > Hello Parquet developers,
> > >> > > > >
> > >> > > > > Per the last sync discussion, it is time to call for a vote on
> > the
> > >> > > > Parquet
> > >> > > > > Modular Encryption design sign-off. The design doc can be
> found
> > at
> > >> > the
> > >> > > > > encryption branch of the parquet-format repository,
> > >> > > > >
> > >> >
> > https://github.com/apache/parquet-format/blob/encryption/Encryption.md.
> > >> > > > >
> > >> > > > > The design is stable by now. This work had started 10 months
> > ago,
> > >> has
> > >> > > > been
> > >> > > > > extensively reviewed -  and implemented (in Java, partially in
> > >> C++),
> > >> > by a
> > >> > > > > number of folks from different companies. To continue with the
> > >> > > > > implementation pull requests, we need the design to be
> formally
> > >> > signed
> > >> > > > off
> > >> > > > > by the community.
> > >> > > > >
> > >> > > > > Cheers, Gidon
> > >> > > > >
> > >> > > >
> > >> > >
> > >> > >
> > >> > > --
> > >> > > Thanks & Best Regards
> > >> >
> > >>
> > >
> > >
> > > --
> > > Ryan Blue
> > > Software Engineer
> > > Netflix
> > >
> >
> >
> > --
> > Ryan Blue
> > Software Engineer
> > Netflix
> >
>


-- 
Ryan Blue
Software Engineer
Netflix

Re: [VOTE] Modular Encryption design sign-off

Posted by Gidon Gershinsky <gg...@gmail.com>.
Ryan,

Thank you for the clear and detailed feedback.

1. Using a new magic string (PARE) for files with a plaintext footer would
not allow legacy readers to parse these files.

2. Support for multiple encryption algorithms is one of the differentiating
features of Parquet encryption, therefore having a second cipher upfront is
significant.
The design/review overhead of adding a CTR-based cipher to GCM  is minimal.
In fact, CTR is a subset of GCM, because the latter is built on CTR (and
GMAC).
Since you explicitly mark your suggestion as a minor note, I propose we
keep this part unchanged.

3.  EncryptedColumnChunk structure will be identical to a ColumnChunk. The
reasons are:
- In both EF and PF modes (encrypted/plaintext footer), the ColumnMetaData
is not always separated.
It is kept inside ColumnChunk if the column is plaintext,  or encrypted
with the footer key in the EF mode. No reason to separate it then.
- ColumnCryptoMetaData is needed in ColumnChunk to support column
encryption in PF mode (and in EF mode).
- We can make the file_offset an optional field. But actually it is not set
anyway in today's Parquet, which works only with the optional meta_data
field. So, this reason alone shouldn't be enough to create a new Thrift
structure that fully replicates an existing one, save for optional vs
required classifier in one field.

4. There are no options 1,2 in the NIST spec section 8.2.2. This subject
had been extensively discussed by community during the design review:
- If you mean the free field option there - its inapplicable for the 96-bit
IV we are using.
Per the spec, 96-bit IV is fully random, 0 free field length. 96 bit is a
recommended length for AES-GCM IV, see discussion at
https://crypto.stackexchange.com/questions/41601/aes-gcm-recommended-iv-size-why-12-bytes/41610
eg "*For GCM a 12 byte IV is strongly suggested as other IV lengths will
require additional calculations.*"
- If you mean the 8.2.1 section, with deterministic construction option -
its highly problematic, since deterministic IV prefix ("fixed field") "*shall
identify the device, or, more generally, the context for the instance of
the authenticated encryption function*". This is not trivial in a single
process, and even more challenging in distributed frameworks like Spark ("*no
two distinct devices shall share the same fixed field*").  Moreover, this
can be unsafe - a mistake in prefix creation can lead to identical IVs -
which breaks the encryption.
- The good news is we don't need these complications. The 8.2.2 allows us
to have 4 billion random IVs for a single key before any collision would
happen in IVs that are open to attackers. With Parquet modular encryption,
the number is more like 400 billion, because most (>99%) of IVs are hidden
from an attacker - their offset is encrypted with the column/footer keys.
Of course, we use the SecureRandom class, seeded properly.
- Besides checking the NIST spec and crypto forums, I work directly with
cipher technology experts - who confirm the Parquet approach to IV
generation is the right thing to do in distributed systems.


All other suggestions sound good, and are not hard to add to the spec. I'll
update the document accordingly.

Cheers, Gidon.

On Fri, Oct 19, 2018 at 7:38 PM Ryan Blue <rb...@netflix.com.invalid> wrote:

> After thinking about this more last night, I think it also needs to be more
> careful around IV generation. The spec just says that implementations
> should use RBG-based IV construction, but it needs to be more explicit that
> no two GCM streams use the same IV. If an IV is reused, then there is a
> possible plaintext leak when blocks are padded with a predictable pattern.
>
> I think the spec should state that an initial IV should be generated using
> secure random as described by 8.2.2 option 1 and the IV for each encrypted
> stream should be generated using 8.2.2 option 2 until there is a collision,
> when the base IV should be regenerated using 8.2.2 option 1. That will
> ensure that each IV is unique, but that Parquet doesn't consume a huge
> amount of crypto randomness.
>
> rb
>
> On Thu, Oct 18, 2018 at 6:15 PM Ryan Blue <rb...@netflix.com> wrote:
>
> > -1
> >
> >    - I don’t think that it contains enough information for someone to
> >    successfully implement the proposal. For example, this references
> thrift
> >    structures like AesGcmV1 and AesGcmCtrV1, but doesn’t introduce them
> and
> >    explain what they are or detail what they contain. Similarly,
> >    ColumnCryptoMetaData and its sub-types are not defined. It looks like
> this
> >    spec relies on thrift, but I think that it should be self-contained.
> >    - It should be more clear exactly how to construct each GCM or CTR
> >    encrypted buffer. Is it necessary to specify padding for AES-GCM? I
> see
> >    from a quick google search that some implementations do, but there’s
> >    nothing in this doc.
> >    - I don’t think that this explains enough of the intent that is behind
> >    the specification. For example, “Key metadata is a free-form byte
> array
> >    that can be used by a reader to retrieve the column encryption key”
> should
> >    be more clear about what this byte array actually contains or
> represents.
> >    Will it identify a key that can be fetched from a store? Will it be a
> >    randomly-generated key stored encrypted by another key?
> >    - This changes the definition of ColumnChunk’s file_offset field and
> >    the location of ColumnMetaData in some cases. Instead of repurposing
> thrift
> >    structures, the spec should introduce new ones, like an
> >    EncryptedColumnChunk that preserves the meaning of fields in
> ColumnChunk
> >    but separates out ColumnMetaData.
> >    - This contains details that aren’t required, like using the
> >    “.parquet.encrypted” file extension.
> >    - The only time the new magic bytes are mentioned is after the footer
> >    and encryption metadata, but the diagram shows that the first bytes
> in the
> >    file are updated as well. This is also only in the encrypted footer
> mode.
> >    Should PARE magic bytes be used in plaintext footer mode?
> >
> > Minor note: I would also prefer to vote on GCM, leaving out CTR for now
> > and adding it once the GCM spec is finished. That way we can concentrate
> on
> > a single cipher mode instead of thinking about multiple modes at once.
> >
> > On Tue, Oct 16, 2018 at 2:44 AM Anna Szonyi <szonyi@cloudera.com.invalid
> >
> > wrote:
> >
> >> +1 (non-binding)
> >>
> >> On Tue, Oct 16, 2018 at 11:14 AM Nandor Kollar
> >> <nk...@cloudera.com.invalid>
> >> wrote:
> >>
> >> > +1 (non-binding)
> >> > On Tue, Oct 16, 2018 at 10:59 AM 俊杰陈 <cj...@gmail.com> wrote:
> >> > >
> >> > > +1 (non-binding)
> >> > >
> >> > >
> >> > > Zoltan Ivanfi <zi...@cloudera.com.invalid> 于2018年10月16日周二 下午4:46写道:
> >> > >
> >> > > > +1 (binding)
> >> > > >
> >> > > > Cheers,
> >> > > >
> >> > > > Zoltan
> >> > > >
> >> > > > On Tue, Oct 16, 2018 at 10:11 AM Gidon Gershinsky <
> gg5070@gmail.com
> >> >
> >> > > > wrote:
> >> > > >
> >> > > > > Hello Parquet developers,
> >> > > > >
> >> > > > > Per the last sync discussion, it is time to call for a vote on
> the
> >> > > > Parquet
> >> > > > > Modular Encryption design sign-off. The design doc can be found
> at
> >> > the
> >> > > > > encryption branch of the parquet-format repository,
> >> > > > >
> >> >
> https://github.com/apache/parquet-format/blob/encryption/Encryption.md.
> >> > > > >
> >> > > > > The design is stable by now. This work had started 10 months
> ago,
> >> has
> >> > > > been
> >> > > > > extensively reviewed -  and implemented (in Java, partially in
> >> C++),
> >> > by a
> >> > > > > number of folks from different companies. To continue with the
> >> > > > > implementation pull requests, we need the design to be formally
> >> > signed
> >> > > > off
> >> > > > > by the community.
> >> > > > >
> >> > > > > Cheers, Gidon
> >> > > > >
> >> > > >
> >> > >
> >> > >
> >> > > --
> >> > > Thanks & Best Regards
> >> >
> >>
> >
> >
> > --
> > Ryan Blue
> > Software Engineer
> > Netflix
> >
>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>

Re: [VOTE] Modular Encryption design sign-off

Posted by Ryan Blue <rb...@netflix.com.INVALID>.
After thinking about this more last night, I think it also needs to be more
careful around IV generation. The spec just says that implementations
should use RBG-based IV construction, but it needs to be more explicit that
no two GCM streams use the same IV. If an IV is reused, then there is a
possible plaintext leak when blocks are padded with a predictable pattern.

I think the spec should state that an initial IV should be generated using
secure random as described by 8.2.2 option 1 and the IV for each encrypted
stream should be generated using 8.2.2 option 2 until there is a collision,
when the base IV should be regenerated using 8.2.2 option 1. That will
ensure that each IV is unique, but that Parquet doesn't consume a huge
amount of crypto randomness.

rb

On Thu, Oct 18, 2018 at 6:15 PM Ryan Blue <rb...@netflix.com> wrote:

> -1
>
>    - I don’t think that it contains enough information for someone to
>    successfully implement the proposal. For example, this references thrift
>    structures like AesGcmV1 and AesGcmCtrV1, but doesn’t introduce them and
>    explain what they are or detail what they contain. Similarly,
>    ColumnCryptoMetaData and its sub-types are not defined. It looks like this
>    spec relies on thrift, but I think that it should be self-contained.
>    - It should be more clear exactly how to construct each GCM or CTR
>    encrypted buffer. Is it necessary to specify padding for AES-GCM? I see
>    from a quick google search that some implementations do, but there’s
>    nothing in this doc.
>    - I don’t think that this explains enough of the intent that is behind
>    the specification. For example, “Key metadata is a free-form byte array
>    that can be used by a reader to retrieve the column encryption key” should
>    be more clear about what this byte array actually contains or represents.
>    Will it identify a key that can be fetched from a store? Will it be a
>    randomly-generated key stored encrypted by another key?
>    - This changes the definition of ColumnChunk’s file_offset field and
>    the location of ColumnMetaData in some cases. Instead of repurposing thrift
>    structures, the spec should introduce new ones, like an
>    EncryptedColumnChunk that preserves the meaning of fields in ColumnChunk
>    but separates out ColumnMetaData.
>    - This contains details that aren’t required, like using the
>    “.parquet.encrypted” file extension.
>    - The only time the new magic bytes are mentioned is after the footer
>    and encryption metadata, but the diagram shows that the first bytes in the
>    file are updated as well. This is also only in the encrypted footer mode.
>    Should PARE magic bytes be used in plaintext footer mode?
>
> Minor note: I would also prefer to vote on GCM, leaving out CTR for now
> and adding it once the GCM spec is finished. That way we can concentrate on
> a single cipher mode instead of thinking about multiple modes at once.
>
> On Tue, Oct 16, 2018 at 2:44 AM Anna Szonyi <sz...@cloudera.com.invalid>
> wrote:
>
>> +1 (non-binding)
>>
>> On Tue, Oct 16, 2018 at 11:14 AM Nandor Kollar
>> <nk...@cloudera.com.invalid>
>> wrote:
>>
>> > +1 (non-binding)
>> > On Tue, Oct 16, 2018 at 10:59 AM 俊杰陈 <cj...@gmail.com> wrote:
>> > >
>> > > +1 (non-binding)
>> > >
>> > >
>> > > Zoltan Ivanfi <zi...@cloudera.com.invalid> 于2018年10月16日周二 下午4:46写道:
>> > >
>> > > > +1 (binding)
>> > > >
>> > > > Cheers,
>> > > >
>> > > > Zoltan
>> > > >
>> > > > On Tue, Oct 16, 2018 at 10:11 AM Gidon Gershinsky <gg5070@gmail.com
>> >
>> > > > wrote:
>> > > >
>> > > > > Hello Parquet developers,
>> > > > >
>> > > > > Per the last sync discussion, it is time to call for a vote on the
>> > > > Parquet
>> > > > > Modular Encryption design sign-off. The design doc can be found at
>> > the
>> > > > > encryption branch of the parquet-format repository,
>> > > > >
>> > https://github.com/apache/parquet-format/blob/encryption/Encryption.md.
>> > > > >
>> > > > > The design is stable by now. This work had started 10 months ago,
>> has
>> > > > been
>> > > > > extensively reviewed -  and implemented (in Java, partially in
>> C++),
>> > by a
>> > > > > number of folks from different companies. To continue with the
>> > > > > implementation pull requests, we need the design to be formally
>> > signed
>> > > > off
>> > > > > by the community.
>> > > > >
>> > > > > Cheers, Gidon
>> > > > >
>> > > >
>> > >
>> > >
>> > > --
>> > > Thanks & Best Regards
>> >
>>
>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>


-- 
Ryan Blue
Software Engineer
Netflix

Re: [VOTE] Modular Encryption design sign-off

Posted by Ryan Blue <rb...@netflix.com.INVALID>.
-1

   - I don’t think that it contains enough information for someone to
   successfully implement the proposal. For example, this references thrift
   structures like AesGcmV1 and AesGcmCtrV1, but doesn’t introduce them and
   explain what they are or detail what they contain. Similarly,
   ColumnCryptoMetaData and its sub-types are not defined. It looks like this
   spec relies on thrift, but I think that it should be self-contained.
   - It should be more clear exactly how to construct each GCM or CTR
   encrypted buffer. Is it necessary to specify padding for AES-GCM? I see
   from a quick google search that some implementations do, but there’s
   nothing in this doc.
   - I don’t think that this explains enough of the intent that is behind
   the specification. For example, “Key metadata is a free-form byte array
   that can be used by a reader to retrieve the column encryption key” should
   be more clear about what this byte array actually contains or represents.
   Will it identify a key that can be fetched from a store? Will it be a
   randomly-generated key stored encrypted by another key?
   - This changes the definition of ColumnChunk’s file_offset field and the
   location of ColumnMetaData in some cases. Instead of repurposing thrift
   structures, the spec should introduce new ones, like an
   EncryptedColumnChunk that preserves the meaning of fields in ColumnChunk
   but separates out ColumnMetaData.
   - This contains details that aren’t required, like using the
   “.parquet.encrypted” file extension.
   - The only time the new magic bytes are mentioned is after the footer
   and encryption metadata, but the diagram shows that the first bytes in the
   file are updated as well. This is also only in the encrypted footer mode.
   Should PARE magic bytes be used in plaintext footer mode?

Minor note: I would also prefer to vote on GCM, leaving out CTR for now and
adding it once the GCM spec is finished. That way we can concentrate on a
single cipher mode instead of thinking about multiple modes at once.

On Tue, Oct 16, 2018 at 2:44 AM Anna Szonyi <sz...@cloudera.com.invalid>
wrote:

> +1 (non-binding)
>
> On Tue, Oct 16, 2018 at 11:14 AM Nandor Kollar
> <nk...@cloudera.com.invalid>
> wrote:
>
> > +1 (non-binding)
> > On Tue, Oct 16, 2018 at 10:59 AM 俊杰陈 <cj...@gmail.com> wrote:
> > >
> > > +1 (non-binding)
> > >
> > >
> > > Zoltan Ivanfi <zi...@cloudera.com.invalid> 于2018年10月16日周二 下午4:46写道:
> > >
> > > > +1 (binding)
> > > >
> > > > Cheers,
> > > >
> > > > Zoltan
> > > >
> > > > On Tue, Oct 16, 2018 at 10:11 AM Gidon Gershinsky <gg...@gmail.com>
> > > > wrote:
> > > >
> > > > > Hello Parquet developers,
> > > > >
> > > > > Per the last sync discussion, it is time to call for a vote on the
> > > > Parquet
> > > > > Modular Encryption design sign-off. The design doc can be found at
> > the
> > > > > encryption branch of the parquet-format repository,
> > > > >
> > https://github.com/apache/parquet-format/blob/encryption/Encryption.md.
> > > > >
> > > > > The design is stable by now. This work had started 10 months ago,
> has
> > > > been
> > > > > extensively reviewed -  and implemented (in Java, partially in
> C++),
> > by a
> > > > > number of folks from different companies. To continue with the
> > > > > implementation pull requests, we need the design to be formally
> > signed
> > > > off
> > > > > by the community.
> > > > >
> > > > > Cheers, Gidon
> > > > >
> > > >
> > >
> > >
> > > --
> > > Thanks & Best Regards
> >
>


-- 
Ryan Blue
Software Engineer
Netflix

Re: [VOTE] Modular Encryption design sign-off

Posted by Anna Szonyi <sz...@cloudera.com.INVALID>.
+1 (non-binding)

On Tue, Oct 16, 2018 at 11:14 AM Nandor Kollar <nk...@cloudera.com.invalid>
wrote:

> +1 (non-binding)
> On Tue, Oct 16, 2018 at 10:59 AM 俊杰陈 <cj...@gmail.com> wrote:
> >
> > +1 (non-binding)
> >
> >
> > Zoltan Ivanfi <zi...@cloudera.com.invalid> 于2018年10月16日周二 下午4:46写道:
> >
> > > +1 (binding)
> > >
> > > Cheers,
> > >
> > > Zoltan
> > >
> > > On Tue, Oct 16, 2018 at 10:11 AM Gidon Gershinsky <gg...@gmail.com>
> > > wrote:
> > >
> > > > Hello Parquet developers,
> > > >
> > > > Per the last sync discussion, it is time to call for a vote on the
> > > Parquet
> > > > Modular Encryption design sign-off. The design doc can be found at
> the
> > > > encryption branch of the parquet-format repository,
> > > >
> https://github.com/apache/parquet-format/blob/encryption/Encryption.md.
> > > >
> > > > The design is stable by now. This work had started 10 months ago, has
> > > been
> > > > extensively reviewed -  and implemented (in Java, partially in C++),
> by a
> > > > number of folks from different companies. To continue with the
> > > > implementation pull requests, we need the design to be formally
> signed
> > > off
> > > > by the community.
> > > >
> > > > Cheers, Gidon
> > > >
> > >
> >
> >
> > --
> > Thanks & Best Regards
>

Re: [VOTE] Modular Encryption design sign-off

Posted by Nandor Kollar <nk...@cloudera.com.INVALID>.
+1 (non-binding)
On Tue, Oct 16, 2018 at 10:59 AM 俊杰陈 <cj...@gmail.com> wrote:
>
> +1 (non-binding)
>
>
> Zoltan Ivanfi <zi...@cloudera.com.invalid> 于2018年10月16日周二 下午4:46写道:
>
> > +1 (binding)
> >
> > Cheers,
> >
> > Zoltan
> >
> > On Tue, Oct 16, 2018 at 10:11 AM Gidon Gershinsky <gg...@gmail.com>
> > wrote:
> >
> > > Hello Parquet developers,
> > >
> > > Per the last sync discussion, it is time to call for a vote on the
> > Parquet
> > > Modular Encryption design sign-off. The design doc can be found at the
> > > encryption branch of the parquet-format repository,
> > > https://github.com/apache/parquet-format/blob/encryption/Encryption.md.
> > >
> > > The design is stable by now. This work had started 10 months ago, has
> > been
> > > extensively reviewed -  and implemented (in Java, partially in C++), by a
> > > number of folks from different companies. To continue with the
> > > implementation pull requests, we need the design to be formally signed
> > off
> > > by the community.
> > >
> > > Cheers, Gidon
> > >
> >
>
>
> --
> Thanks & Best Regards

Re: [VOTE] Modular Encryption design sign-off

Posted by 俊杰陈 <cj...@gmail.com>.
+1 (non-binding)


Zoltan Ivanfi <zi...@cloudera.com.invalid> 于2018年10月16日周二 下午4:46写道:

> +1 (binding)
>
> Cheers,
>
> Zoltan
>
> On Tue, Oct 16, 2018 at 10:11 AM Gidon Gershinsky <gg...@gmail.com>
> wrote:
>
> > Hello Parquet developers,
> >
> > Per the last sync discussion, it is time to call for a vote on the
> Parquet
> > Modular Encryption design sign-off. The design doc can be found at the
> > encryption branch of the parquet-format repository,
> > https://github.com/apache/parquet-format/blob/encryption/Encryption.md.
> >
> > The design is stable by now. This work had started 10 months ago, has
> been
> > extensively reviewed -  and implemented (in Java, partially in C++), by a
> > number of folks from different companies. To continue with the
> > implementation pull requests, we need the design to be formally signed
> off
> > by the community.
> >
> > Cheers, Gidon
> >
>


-- 
Thanks & Best Regards

Re: [VOTE] Modular Encryption design sign-off

Posted by Zoltan Ivanfi <zi...@cloudera.com.INVALID>.
+1 (binding)

Cheers,

Zoltan

On Tue, Oct 16, 2018 at 10:11 AM Gidon Gershinsky <gg...@gmail.com> wrote:

> Hello Parquet developers,
>
> Per the last sync discussion, it is time to call for a vote on the Parquet
> Modular Encryption design sign-off. The design doc can be found at the
> encryption branch of the parquet-format repository,
> https://github.com/apache/parquet-format/blob/encryption/Encryption.md.
>
> The design is stable by now. This work had started 10 months ago, has been
> extensively reviewed -  and implemented (in Java, partially in C++), by a
> number of folks from different companies. To continue with the
> implementation pull requests, we need the design to be formally signed off
> by the community.
>
> Cheers, Gidon
>