You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by Gidon Gershinsky <gg...@gmail.com> on 2018/09/18 07:44:47 UTC

Old readers & encrypted files

Hi all,

This week, 8 months after the first call for goals feedback and
requirements :), I got a new one - enabling old Parquet readers to access
data of unencrypted columns in encrypted files.
Better late than never.. But actually it doesn't sound unreasonable, and
deserved at least a consideration.

Let me describe the options (the way I see them). Any community feedback is
welcome.

But first, a little tech intro. Encrypted Parquet files can be created in
two modes - with an encrypted footer (lets call this an 'EF' mode for the
purpose of this discussion), or with a plaintext footer ('PF' mode).
EF is significantly more secure - it protects all data and metadata in a
file, including the schema, number of rows, key-value properties, column
names, column sort order, list of encrypted columns and metadata of the
column encryption keys.
PF hides the data, but leaks all of these metadata fields. Moreover, EF
makes the footer tamper-proof, while PF doesn't.
The reason we have the PF option is to let users with relaxed security
requirements to enable readers, that don't have access to any keys, to read
unencrypted columns in a file.

For encrypted columns, both EH and PH hide the ColumnMetaData - including
the min/max stats, number of values, data offset, data size and other
fields. Old Parquet readers obviously can't read EF files. But they can't
also read PF files - because old readers need access to data offset and
size of every column in a file, event if they try to read just one column
(this is fixed in an encryption pull request).

Now, the options:

1) Don't allow old Parquet readers to read encrypted files. Organizations
that start working with encrypted data, will update their analytic
frameworks to use an encrypting Parquet version. This includes both
frameworks that write/read encrypted columns, and frameworks that work only
with unencrypted columns. The former and latter can technically be the same
framework, just different instances of it. The update can be done in one of
the following ways:
a. Upgrade Parquet version to the latest one, supporting encryption. This
might require some changes in framework code, unrelated to encryption.
b. Use the original old Parquet version, with an added encryption support
(requires rebuilding the framework, no code changes). This is not hard, I'm
doing it for Parquet 1.8.2 in order to build and run Spark 2.3.0 with
encrypted data.
I think I can post this for 1.8.2 and other versions, with some help from
the community.

2) Replace PF with PF~, in order to allow old Parquet readers to read
unencrypted columns in encrypted files. PF~ is a little less secure and a
little less elegant version of PF. Less secure because it has to expose the
offset and size of encrypted column data. But actually its not
catastrophic, and in any case, organizations with higher security
requirements will use the EF mode. Others can start with PF~ for a
transition period, and switch to EF later.
PH~ requires changing 2 lines in the parquet.thrift file, and a few dozen
lines in the implementation. I've played with this today, seems quite
feasible.
So, unless the community strongly favors option 1, I'm inclined to proceed
with 2, should take up to a week to get the prs submitted.

Cheers, Gidon.

Re: Old readers & encrypted files

Posted by Gidon Gershinsky <gg...@gmail.com>.
and sorry for the EH, PH typos in a couple of places, should've been EF, PF.

On Tue, Sep 18, 2018 at 11:19 AM Zoltan Ivanfi <zi...@cloudera.com.invalid>
wrote:

> Hi,
>
> Just to clarify: PF~ allows older readers to read data as long as they only
> try to access unencrypted columns. What happens when older readers do try
> to access encrypted columns?
>
> Also, by older readers do you specificially mean the current Java library
> or all existing language bindings?
>
> Thanks,
>
> Zoltan
>
> On Tue, Sep 18, 2018 at 9:45 AM Gidon Gershinsky <gg...@gmail.com> wrote:
>
> > Hi all,
> >
> > This week, 8 months after the first call for goals feedback and
> > requirements :), I got a new one - enabling old Parquet readers to access
> > data of unencrypted columns in encrypted files.
> > Better late than never.. But actually it doesn't sound unreasonable, and
> > deserved at least a consideration.
> >
> > Let me describe the options (the way I see them). Any community feedback
> is
> > welcome.
> >
> > But first, a little tech intro. Encrypted Parquet files can be created in
> > two modes - with an encrypted footer (lets call this an 'EF' mode for the
> > purpose of this discussion), or with a plaintext footer ('PF' mode).
> > EF is significantly more secure - it protects all data and metadata in a
> > file, including the schema, number of rows, key-value properties, column
> > names, column sort order, list of encrypted columns and metadata of the
> > column encryption keys.
> > PF hides the data, but leaks all of these metadata fields. Moreover, EF
> > makes the footer tamper-proof, while PF doesn't.
> > The reason we have the PF option is to let users with relaxed security
> > requirements to enable readers, that don't have access to any keys, to
> read
> > unencrypted columns in a file.
> >
> > For encrypted columns, both EH and PH hide the ColumnMetaData - including
> > the min/max stats, number of values, data offset, data size and other
> > fields. Old Parquet readers obviously can't read EF files. But they can't
> > also read PF files - because old readers need access to data offset and
> > size of every column in a file, event if they try to read just one column
> > (this is fixed in an encryption pull request).
> >
> > Now, the options:
> >
> > 1) Don't allow old Parquet readers to read encrypted files. Organizations
> > that start working with encrypted data, will update their analytic
> > frameworks to use an encrypting Parquet version. This includes both
> > frameworks that write/read encrypted columns, and frameworks that work
> only
> > with unencrypted columns. The former and latter can technically be the
> same
> > framework, just different instances of it. The update can be done in one
> of
> > the following ways:
> > a. Upgrade Parquet version to the latest one, supporting encryption. This
> > might require some changes in framework code, unrelated to encryption.
> > b. Use the original old Parquet version, with an added encryption support
> > (requires rebuilding the framework, no code changes). This is not hard,
> I'm
> > doing it for Parquet 1.8.2 in order to build and run Spark 2.3.0 with
> > encrypted data.
> > I think I can post this for 1.8.2 and other versions, with some help from
> > the community.
> >
> > 2) Replace PF with PF~, in order to allow old Parquet readers to read
> > unencrypted columns in encrypted files. PF~ is a little less secure and a
> > little less elegant version of PF. Less secure because it has to expose
> the
> > offset and size of encrypted column data. But actually its not
> > catastrophic, and in any case, organizations with higher security
> > requirements will use the EF mode. Others can start with PF~ for a
> > transition period, and switch to EF later.
> > PH~ requires changing 2 lines in the parquet.thrift file, and a few dozen
> > lines in the implementation. I've played with this today, seems quite
> > feasible.
> > So, unless the community strongly favors option 1, I'm inclined to
> proceed
> > with 2, should take up to a week to get the prs submitted.
> >
> > Cheers, Gidon.
> >
>

Re: ***UNCHECKED*** Re: Old readers & encrypted files

Posted by Gidon Gershinsky <gg...@gmail.com>.
Hi,

Sounds good. I've submitted a pr for this, and updated the Encryption.md pr.

Cheers, Gidon.

On Wed, Sep 19, 2018 at 9:54 AM Zoltan Ivanfi <zi...@cloudera.com.invalid>
wrote:

> Hi,
>
> Sounds good to me. The filename extension could really help to prevent
> confusion.
>
> Br,
>
> Zoltan
>
> On Tue, Sep 18, 2018 at 4:35 PM Gidon Gershinsky <gg...@gmail.com> wrote:
>
> > Hi,
> >
> > 2 cents re the first point - the encrypted files will have an extension
> > "parquet.encrypted", which should help people understand the reason for
> > their error. They also should be aware that using old readers for
> encrypted
> > files is a temporary solution, the right thing to do is to upgrade to new
> > Parquet version.
> > But I'm also ok with a truly incompatible format for the encrypted files.
> >
> > On Tue, Sep 18, 2018 at 5:07 PM Zoltan Ivanfi <zi...@cloudera.com.invalid>
> > wrote:
> >
> > > Hi,
> > >
> > > I'm a little bit worried that the misleading error message could lead
> to
> > > serious confusion. For this reason, I would slighlty prefer a truly
> > > incompatible format for the encrypted files, but I don't have strong
> > > feelings against doing it the other way either.
> > >
> > > One idea that came to my mind (which could easily be a bad idea) is to
> > > write two metadata sections, one for new readers and one for older
> ones.
> > > The latter would not contain references to encrypted columns at all.
> > >
> > > Br,
> > >
> > > Zoltan
> > >
> > > On Tue, Sep 18, 2018 at 10:40 AM Gidon Gershinsky <gg...@gmail.com>
> > > wrote:
> > >
> > > > Hi Zoltan,
> > > >
> > > > Old readers, trying to access encrypted columns in PF~ files, get a
> > > Thrift
> > > > parsing exception, since they expect a plaintext PageHeader structure
> > at
> > > > the page offset.
> > > > In encrypted columns, PageHeaders are encrypted with the column key.
> > > >
> > > > Old Parquet binding in any language should be able to read plaintext
> > > > columns in PF~ files.
> > > >
> > > > Cheers, Gidon.
> > > >
> > > >
> > > > On Tue, Sep 18, 2018 at 11:19 AM Zoltan Ivanfi
> <zi@cloudera.com.invalid
> > >
> > > > wrote:
> > > >
> > > > > Hi,
> > > > >
> > > > > Just to clarify: PF~ allows older readers to read data as long as
> > they
> > > > only
> > > > > try to access unencrypted columns. What happens when older readers
> do
> > > try
> > > > > to access encrypted columns?
> > > > >
> > > > > Also, by older readers do you specificially mean the current Java
> > > library
> > > > > or all existing language bindings?
> > > > >
> > > > > Thanks,
> > > > >
> > > > > Zoltan
> > > > >
> > > > > On Tue, Sep 18, 2018 at 9:45 AM Gidon Gershinsky <gg5070@gmail.com
> >
> > > > wrote:
> > > > >
> > > > > > Hi all,
> > > > > >
> > > > > > This week, 8 months after the first call for goals feedback and
> > > > > > requirements :), I got a new one - enabling old Parquet readers
> to
> > > > access
> > > > > > data of unencrypted columns in encrypted files.
> > > > > > Better late than never.. But actually it doesn't sound
> > unreasonable,
> > > > and
> > > > > > deserved at least a consideration.
> > > > > >
> > > > > > Let me describe the options (the way I see them). Any community
> > > > feedback
> > > > > is
> > > > > > welcome.
> > > > > >
> > > > > > But first, a little tech intro. Encrypted Parquet files can be
> > > created
> > > > in
> > > > > > two modes - with an encrypted footer (lets call this an 'EF' mode
> > for
> > > > the
> > > > > > purpose of this discussion), or with a plaintext footer ('PF'
> > mode).
> > > > > > EF is significantly more secure - it protects all data and
> metadata
> > > in
> > > > a
> > > > > > file, including the schema, number of rows, key-value properties,
> > > > column
> > > > > > names, column sort order, list of encrypted columns and metadata
> of
> > > the
> > > > > > column encryption keys.
> > > > > > PF hides the data, but leaks all of these metadata fields.
> > Moreover,
> > > EF
> > > > > > makes the footer tamper-proof, while PF doesn't.
> > > > > > The reason we have the PF option is to let users with relaxed
> > > security
> > > > > > requirements to enable readers, that don't have access to any
> keys,
> > > to
> > > > > read
> > > > > > unencrypted columns in a file.
> > > > > >
> > > > > > For encrypted columns, both EH and PH hide the ColumnMetaData -
> > > > including
> > > > > > the min/max stats, number of values, data offset, data size and
> > other
> > > > > > fields. Old Parquet readers obviously can't read EF files. But
> they
> > > > can't
> > > > > > also read PF files - because old readers need access to data
> offset
> > > and
> > > > > > size of every column in a file, event if they try to read just
> one
> > > > column
> > > > > > (this is fixed in an encryption pull request).
> > > > > >
> > > > > > Now, the options:
> > > > > >
> > > > > > 1) Don't allow old Parquet readers to read encrypted files.
> > > > Organizations
> > > > > > that start working with encrypted data, will update their
> analytic
> > > > > > frameworks to use an encrypting Parquet version. This includes
> both
> > > > > > frameworks that write/read encrypted columns, and frameworks that
> > > work
> > > > > only
> > > > > > with unencrypted columns. The former and latter can technically
> be
> > > the
> > > > > same
> > > > > > framework, just different instances of it. The update can be done
> > in
> > > > one
> > > > > of
> > > > > > the following ways:
> > > > > > a. Upgrade Parquet version to the latest one, supporting
> > encryption.
> > > > This
> > > > > > might require some changes in framework code, unrelated to
> > > encryption.
> > > > > > b. Use the original old Parquet version, with an added encryption
> > > > support
> > > > > > (requires rebuilding the framework, no code changes). This is not
> > > hard,
> > > > > I'm
> > > > > > doing it for Parquet 1.8.2 in order to build and run Spark 2.3.0
> > with
> > > > > > encrypted data.
> > > > > > I think I can post this for 1.8.2 and other versions, with some
> > help
> > > > from
> > > > > > the community.
> > > > > >
> > > > > > 2) Replace PF with PF~, in order to allow old Parquet readers to
> > read
> > > > > > unencrypted columns in encrypted files. PF~ is a little less
> secure
> > > > and a
> > > > > > little less elegant version of PF. Less secure because it has to
> > > expose
> > > > > the
> > > > > > offset and size of encrypted column data. But actually its not
> > > > > > catastrophic, and in any case, organizations with higher security
> > > > > > requirements will use the EF mode. Others can start with PF~ for
> a
> > > > > > transition period, and switch to EF later.
> > > > > > PH~ requires changing 2 lines in the parquet.thrift file, and a
> few
> > > > dozen
> > > > > > lines in the implementation. I've played with this today, seems
> > quite
> > > > > > feasible.
> > > > > > So, unless the community strongly favors option 1, I'm inclined
> to
> > > > > proceed
> > > > > > with 2, should take up to a week to get the prs submitted.
> > > > > >
> > > > > > Cheers, Gidon.
> > > > > >
> > > > >
> > > >
> > >
> >
>

***UNCHECKED*** Re: Old readers & encrypted files

Posted by Zoltan Ivanfi <zi...@cloudera.com.INVALID>.
Hi,

Sounds good to me. The filename extension could really help to prevent
confusion.

Br,

Zoltan

On Tue, Sep 18, 2018 at 4:35 PM Gidon Gershinsky <gg...@gmail.com> wrote:

> Hi,
>
> 2 cents re the first point - the encrypted files will have an extension
> "parquet.encrypted", which should help people understand the reason for
> their error. They also should be aware that using old readers for encrypted
> files is a temporary solution, the right thing to do is to upgrade to new
> Parquet version.
> But I'm also ok with a truly incompatible format for the encrypted files.
>
> On Tue, Sep 18, 2018 at 5:07 PM Zoltan Ivanfi <zi...@cloudera.com.invalid>
> wrote:
>
> > Hi,
> >
> > I'm a little bit worried that the misleading error message could lead to
> > serious confusion. For this reason, I would slighlty prefer a truly
> > incompatible format for the encrypted files, but I don't have strong
> > feelings against doing it the other way either.
> >
> > One idea that came to my mind (which could easily be a bad idea) is to
> > write two metadata sections, one for new readers and one for older ones.
> > The latter would not contain references to encrypted columns at all.
> >
> > Br,
> >
> > Zoltan
> >
> > On Tue, Sep 18, 2018 at 10:40 AM Gidon Gershinsky <gg...@gmail.com>
> > wrote:
> >
> > > Hi Zoltan,
> > >
> > > Old readers, trying to access encrypted columns in PF~ files, get a
> > Thrift
> > > parsing exception, since they expect a plaintext PageHeader structure
> at
> > > the page offset.
> > > In encrypted columns, PageHeaders are encrypted with the column key.
> > >
> > > Old Parquet binding in any language should be able to read plaintext
> > > columns in PF~ files.
> > >
> > > Cheers, Gidon.
> > >
> > >
> > > On Tue, Sep 18, 2018 at 11:19 AM Zoltan Ivanfi <zi@cloudera.com.invalid
> >
> > > wrote:
> > >
> > > > Hi,
> > > >
> > > > Just to clarify: PF~ allows older readers to read data as long as
> they
> > > only
> > > > try to access unencrypted columns. What happens when older readers do
> > try
> > > > to access encrypted columns?
> > > >
> > > > Also, by older readers do you specificially mean the current Java
> > library
> > > > or all existing language bindings?
> > > >
> > > > Thanks,
> > > >
> > > > Zoltan
> > > >
> > > > On Tue, Sep 18, 2018 at 9:45 AM Gidon Gershinsky <gg...@gmail.com>
> > > wrote:
> > > >
> > > > > Hi all,
> > > > >
> > > > > This week, 8 months after the first call for goals feedback and
> > > > > requirements :), I got a new one - enabling old Parquet readers to
> > > access
> > > > > data of unencrypted columns in encrypted files.
> > > > > Better late than never.. But actually it doesn't sound
> unreasonable,
> > > and
> > > > > deserved at least a consideration.
> > > > >
> > > > > Let me describe the options (the way I see them). Any community
> > > feedback
> > > > is
> > > > > welcome.
> > > > >
> > > > > But first, a little tech intro. Encrypted Parquet files can be
> > created
> > > in
> > > > > two modes - with an encrypted footer (lets call this an 'EF' mode
> for
> > > the
> > > > > purpose of this discussion), or with a plaintext footer ('PF'
> mode).
> > > > > EF is significantly more secure - it protects all data and metadata
> > in
> > > a
> > > > > file, including the schema, number of rows, key-value properties,
> > > column
> > > > > names, column sort order, list of encrypted columns and metadata of
> > the
> > > > > column encryption keys.
> > > > > PF hides the data, but leaks all of these metadata fields.
> Moreover,
> > EF
> > > > > makes the footer tamper-proof, while PF doesn't.
> > > > > The reason we have the PF option is to let users with relaxed
> > security
> > > > > requirements to enable readers, that don't have access to any keys,
> > to
> > > > read
> > > > > unencrypted columns in a file.
> > > > >
> > > > > For encrypted columns, both EH and PH hide the ColumnMetaData -
> > > including
> > > > > the min/max stats, number of values, data offset, data size and
> other
> > > > > fields. Old Parquet readers obviously can't read EF files. But they
> > > can't
> > > > > also read PF files - because old readers need access to data offset
> > and
> > > > > size of every column in a file, event if they try to read just one
> > > column
> > > > > (this is fixed in an encryption pull request).
> > > > >
> > > > > Now, the options:
> > > > >
> > > > > 1) Don't allow old Parquet readers to read encrypted files.
> > > Organizations
> > > > > that start working with encrypted data, will update their analytic
> > > > > frameworks to use an encrypting Parquet version. This includes both
> > > > > frameworks that write/read encrypted columns, and frameworks that
> > work
> > > > only
> > > > > with unencrypted columns. The former and latter can technically be
> > the
> > > > same
> > > > > framework, just different instances of it. The update can be done
> in
> > > one
> > > > of
> > > > > the following ways:
> > > > > a. Upgrade Parquet version to the latest one, supporting
> encryption.
> > > This
> > > > > might require some changes in framework code, unrelated to
> > encryption.
> > > > > b. Use the original old Parquet version, with an added encryption
> > > support
> > > > > (requires rebuilding the framework, no code changes). This is not
> > hard,
> > > > I'm
> > > > > doing it for Parquet 1.8.2 in order to build and run Spark 2.3.0
> with
> > > > > encrypted data.
> > > > > I think I can post this for 1.8.2 and other versions, with some
> help
> > > from
> > > > > the community.
> > > > >
> > > > > 2) Replace PF with PF~, in order to allow old Parquet readers to
> read
> > > > > unencrypted columns in encrypted files. PF~ is a little less secure
> > > and a
> > > > > little less elegant version of PF. Less secure because it has to
> > expose
> > > > the
> > > > > offset and size of encrypted column data. But actually its not
> > > > > catastrophic, and in any case, organizations with higher security
> > > > > requirements will use the EF mode. Others can start with PF~ for a
> > > > > transition period, and switch to EF later.
> > > > > PH~ requires changing 2 lines in the parquet.thrift file, and a few
> > > dozen
> > > > > lines in the implementation. I've played with this today, seems
> quite
> > > > > feasible.
> > > > > So, unless the community strongly favors option 1, I'm inclined to
> > > > proceed
> > > > > with 2, should take up to a week to get the prs submitted.
> > > > >
> > > > > Cheers, Gidon.
> > > > >
> > > >
> > >
> >
>

Re: Old readers & encrypted files

Posted by Gidon Gershinsky <gg...@gmail.com>.
Hi,

2 cents re the first point - the encrypted files will have an extension
"parquet.encrypted", which should help people understand the reason for
their error. They also should be aware that using old readers for encrypted
files is a temporary solution, the right thing to do is to upgrade to new
Parquet version.
But I'm also ok with a truly incompatible format for the encrypted files.

On Tue, Sep 18, 2018 at 5:07 PM Zoltan Ivanfi <zi...@cloudera.com.invalid>
wrote:

> Hi,
>
> I'm a little bit worried that the misleading error message could lead to
> serious confusion. For this reason, I would slighlty prefer a truly
> incompatible format for the encrypted files, but I don't have strong
> feelings against doing it the other way either.
>
> One idea that came to my mind (which could easily be a bad idea) is to
> write two metadata sections, one for new readers and one for older ones.
> The latter would not contain references to encrypted columns at all.
>
> Br,
>
> Zoltan
>
> On Tue, Sep 18, 2018 at 10:40 AM Gidon Gershinsky <gg...@gmail.com>
> wrote:
>
> > Hi Zoltan,
> >
> > Old readers, trying to access encrypted columns in PF~ files, get a
> Thrift
> > parsing exception, since they expect a plaintext PageHeader structure at
> > the page offset.
> > In encrypted columns, PageHeaders are encrypted with the column key.
> >
> > Old Parquet binding in any language should be able to read plaintext
> > columns in PF~ files.
> >
> > Cheers, Gidon.
> >
> >
> > On Tue, Sep 18, 2018 at 11:19 AM Zoltan Ivanfi <zi...@cloudera.com.invalid>
> > wrote:
> >
> > > Hi,
> > >
> > > Just to clarify: PF~ allows older readers to read data as long as they
> > only
> > > try to access unencrypted columns. What happens when older readers do
> try
> > > to access encrypted columns?
> > >
> > > Also, by older readers do you specificially mean the current Java
> library
> > > or all existing language bindings?
> > >
> > > Thanks,
> > >
> > > Zoltan
> > >
> > > On Tue, Sep 18, 2018 at 9:45 AM Gidon Gershinsky <gg...@gmail.com>
> > wrote:
> > >
> > > > Hi all,
> > > >
> > > > This week, 8 months after the first call for goals feedback and
> > > > requirements :), I got a new one - enabling old Parquet readers to
> > access
> > > > data of unencrypted columns in encrypted files.
> > > > Better late than never.. But actually it doesn't sound unreasonable,
> > and
> > > > deserved at least a consideration.
> > > >
> > > > Let me describe the options (the way I see them). Any community
> > feedback
> > > is
> > > > welcome.
> > > >
> > > > But first, a little tech intro. Encrypted Parquet files can be
> created
> > in
> > > > two modes - with an encrypted footer (lets call this an 'EF' mode for
> > the
> > > > purpose of this discussion), or with a plaintext footer ('PF' mode).
> > > > EF is significantly more secure - it protects all data and metadata
> in
> > a
> > > > file, including the schema, number of rows, key-value properties,
> > column
> > > > names, column sort order, list of encrypted columns and metadata of
> the
> > > > column encryption keys.
> > > > PF hides the data, but leaks all of these metadata fields. Moreover,
> EF
> > > > makes the footer tamper-proof, while PF doesn't.
> > > > The reason we have the PF option is to let users with relaxed
> security
> > > > requirements to enable readers, that don't have access to any keys,
> to
> > > read
> > > > unencrypted columns in a file.
> > > >
> > > > For encrypted columns, both EH and PH hide the ColumnMetaData -
> > including
> > > > the min/max stats, number of values, data offset, data size and other
> > > > fields. Old Parquet readers obviously can't read EF files. But they
> > can't
> > > > also read PF files - because old readers need access to data offset
> and
> > > > size of every column in a file, event if they try to read just one
> > column
> > > > (this is fixed in an encryption pull request).
> > > >
> > > > Now, the options:
> > > >
> > > > 1) Don't allow old Parquet readers to read encrypted files.
> > Organizations
> > > > that start working with encrypted data, will update their analytic
> > > > frameworks to use an encrypting Parquet version. This includes both
> > > > frameworks that write/read encrypted columns, and frameworks that
> work
> > > only
> > > > with unencrypted columns. The former and latter can technically be
> the
> > > same
> > > > framework, just different instances of it. The update can be done in
> > one
> > > of
> > > > the following ways:
> > > > a. Upgrade Parquet version to the latest one, supporting encryption.
> > This
> > > > might require some changes in framework code, unrelated to
> encryption.
> > > > b. Use the original old Parquet version, with an added encryption
> > support
> > > > (requires rebuilding the framework, no code changes). This is not
> hard,
> > > I'm
> > > > doing it for Parquet 1.8.2 in order to build and run Spark 2.3.0 with
> > > > encrypted data.
> > > > I think I can post this for 1.8.2 and other versions, with some help
> > from
> > > > the community.
> > > >
> > > > 2) Replace PF with PF~, in order to allow old Parquet readers to read
> > > > unencrypted columns in encrypted files. PF~ is a little less secure
> > and a
> > > > little less elegant version of PF. Less secure because it has to
> expose
> > > the
> > > > offset and size of encrypted column data. But actually its not
> > > > catastrophic, and in any case, organizations with higher security
> > > > requirements will use the EF mode. Others can start with PF~ for a
> > > > transition period, and switch to EF later.
> > > > PH~ requires changing 2 lines in the parquet.thrift file, and a few
> > dozen
> > > > lines in the implementation. I've played with this today, seems quite
> > > > feasible.
> > > > So, unless the community strongly favors option 1, I'm inclined to
> > > proceed
> > > > with 2, should take up to a week to get the prs submitted.
> > > >
> > > > Cheers, Gidon.
> > > >
> > >
> >
>

Re: Old readers & encrypted files

Posted by Zoltan Ivanfi <zi...@cloudera.com.INVALID>.
Hi,

I'm a little bit worried that the misleading error message could lead to
serious confusion. For this reason, I would slighlty prefer a truly
incompatible format for the encrypted files, but I don't have strong
feelings against doing it the other way either.

One idea that came to my mind (which could easily be a bad idea) is to
write two metadata sections, one for new readers and one for older ones.
The latter would not contain references to encrypted columns at all.

Br,

Zoltan

On Tue, Sep 18, 2018 at 10:40 AM Gidon Gershinsky <gg...@gmail.com> wrote:

> Hi Zoltan,
>
> Old readers, trying to access encrypted columns in PF~ files, get a Thrift
> parsing exception, since they expect a plaintext PageHeader structure at
> the page offset.
> In encrypted columns, PageHeaders are encrypted with the column key.
>
> Old Parquet binding in any language should be able to read plaintext
> columns in PF~ files.
>
> Cheers, Gidon.
>
>
> On Tue, Sep 18, 2018 at 11:19 AM Zoltan Ivanfi <zi...@cloudera.com.invalid>
> wrote:
>
> > Hi,
> >
> > Just to clarify: PF~ allows older readers to read data as long as they
> only
> > try to access unencrypted columns. What happens when older readers do try
> > to access encrypted columns?
> >
> > Also, by older readers do you specificially mean the current Java library
> > or all existing language bindings?
> >
> > Thanks,
> >
> > Zoltan
> >
> > On Tue, Sep 18, 2018 at 9:45 AM Gidon Gershinsky <gg...@gmail.com>
> wrote:
> >
> > > Hi all,
> > >
> > > This week, 8 months after the first call for goals feedback and
> > > requirements :), I got a new one - enabling old Parquet readers to
> access
> > > data of unencrypted columns in encrypted files.
> > > Better late than never.. But actually it doesn't sound unreasonable,
> and
> > > deserved at least a consideration.
> > >
> > > Let me describe the options (the way I see them). Any community
> feedback
> > is
> > > welcome.
> > >
> > > But first, a little tech intro. Encrypted Parquet files can be created
> in
> > > two modes - with an encrypted footer (lets call this an 'EF' mode for
> the
> > > purpose of this discussion), or with a plaintext footer ('PF' mode).
> > > EF is significantly more secure - it protects all data and metadata in
> a
> > > file, including the schema, number of rows, key-value properties,
> column
> > > names, column sort order, list of encrypted columns and metadata of the
> > > column encryption keys.
> > > PF hides the data, but leaks all of these metadata fields. Moreover, EF
> > > makes the footer tamper-proof, while PF doesn't.
> > > The reason we have the PF option is to let users with relaxed security
> > > requirements to enable readers, that don't have access to any keys, to
> > read
> > > unencrypted columns in a file.
> > >
> > > For encrypted columns, both EH and PH hide the ColumnMetaData -
> including
> > > the min/max stats, number of values, data offset, data size and other
> > > fields. Old Parquet readers obviously can't read EF files. But they
> can't
> > > also read PF files - because old readers need access to data offset and
> > > size of every column in a file, event if they try to read just one
> column
> > > (this is fixed in an encryption pull request).
> > >
> > > Now, the options:
> > >
> > > 1) Don't allow old Parquet readers to read encrypted files.
> Organizations
> > > that start working with encrypted data, will update their analytic
> > > frameworks to use an encrypting Parquet version. This includes both
> > > frameworks that write/read encrypted columns, and frameworks that work
> > only
> > > with unencrypted columns. The former and latter can technically be the
> > same
> > > framework, just different instances of it. The update can be done in
> one
> > of
> > > the following ways:
> > > a. Upgrade Parquet version to the latest one, supporting encryption.
> This
> > > might require some changes in framework code, unrelated to encryption.
> > > b. Use the original old Parquet version, with an added encryption
> support
> > > (requires rebuilding the framework, no code changes). This is not hard,
> > I'm
> > > doing it for Parquet 1.8.2 in order to build and run Spark 2.3.0 with
> > > encrypted data.
> > > I think I can post this for 1.8.2 and other versions, with some help
> from
> > > the community.
> > >
> > > 2) Replace PF with PF~, in order to allow old Parquet readers to read
> > > unencrypted columns in encrypted files. PF~ is a little less secure
> and a
> > > little less elegant version of PF. Less secure because it has to expose
> > the
> > > offset and size of encrypted column data. But actually its not
> > > catastrophic, and in any case, organizations with higher security
> > > requirements will use the EF mode. Others can start with PF~ for a
> > > transition period, and switch to EF later.
> > > PH~ requires changing 2 lines in the parquet.thrift file, and a few
> dozen
> > > lines in the implementation. I've played with this today, seems quite
> > > feasible.
> > > So, unless the community strongly favors option 1, I'm inclined to
> > proceed
> > > with 2, should take up to a week to get the prs submitted.
> > >
> > > Cheers, Gidon.
> > >
> >
>

Re: Old readers & encrypted files

Posted by Gidon Gershinsky <gg...@gmail.com>.
Hi Zoltan,

Old readers, trying to access encrypted columns in PF~ files, get a Thrift
parsing exception, since they expect a plaintext PageHeader structure at
the page offset.
In encrypted columns, PageHeaders are encrypted with the column key.

Old Parquet binding in any language should be able to read plaintext
columns in PF~ files.

Cheers, Gidon.


On Tue, Sep 18, 2018 at 11:19 AM Zoltan Ivanfi <zi...@cloudera.com.invalid>
wrote:

> Hi,
>
> Just to clarify: PF~ allows older readers to read data as long as they only
> try to access unencrypted columns. What happens when older readers do try
> to access encrypted columns?
>
> Also, by older readers do you specificially mean the current Java library
> or all existing language bindings?
>
> Thanks,
>
> Zoltan
>
> On Tue, Sep 18, 2018 at 9:45 AM Gidon Gershinsky <gg...@gmail.com> wrote:
>
> > Hi all,
> >
> > This week, 8 months after the first call for goals feedback and
> > requirements :), I got a new one - enabling old Parquet readers to access
> > data of unencrypted columns in encrypted files.
> > Better late than never.. But actually it doesn't sound unreasonable, and
> > deserved at least a consideration.
> >
> > Let me describe the options (the way I see them). Any community feedback
> is
> > welcome.
> >
> > But first, a little tech intro. Encrypted Parquet files can be created in
> > two modes - with an encrypted footer (lets call this an 'EF' mode for the
> > purpose of this discussion), or with a plaintext footer ('PF' mode).
> > EF is significantly more secure - it protects all data and metadata in a
> > file, including the schema, number of rows, key-value properties, column
> > names, column sort order, list of encrypted columns and metadata of the
> > column encryption keys.
> > PF hides the data, but leaks all of these metadata fields. Moreover, EF
> > makes the footer tamper-proof, while PF doesn't.
> > The reason we have the PF option is to let users with relaxed security
> > requirements to enable readers, that don't have access to any keys, to
> read
> > unencrypted columns in a file.
> >
> > For encrypted columns, both EH and PH hide the ColumnMetaData - including
> > the min/max stats, number of values, data offset, data size and other
> > fields. Old Parquet readers obviously can't read EF files. But they can't
> > also read PF files - because old readers need access to data offset and
> > size of every column in a file, event if they try to read just one column
> > (this is fixed in an encryption pull request).
> >
> > Now, the options:
> >
> > 1) Don't allow old Parquet readers to read encrypted files. Organizations
> > that start working with encrypted data, will update their analytic
> > frameworks to use an encrypting Parquet version. This includes both
> > frameworks that write/read encrypted columns, and frameworks that work
> only
> > with unencrypted columns. The former and latter can technically be the
> same
> > framework, just different instances of it. The update can be done in one
> of
> > the following ways:
> > a. Upgrade Parquet version to the latest one, supporting encryption. This
> > might require some changes in framework code, unrelated to encryption.
> > b. Use the original old Parquet version, with an added encryption support
> > (requires rebuilding the framework, no code changes). This is not hard,
> I'm
> > doing it for Parquet 1.8.2 in order to build and run Spark 2.3.0 with
> > encrypted data.
> > I think I can post this for 1.8.2 and other versions, with some help from
> > the community.
> >
> > 2) Replace PF with PF~, in order to allow old Parquet readers to read
> > unencrypted columns in encrypted files. PF~ is a little less secure and a
> > little less elegant version of PF. Less secure because it has to expose
> the
> > offset and size of encrypted column data. But actually its not
> > catastrophic, and in any case, organizations with higher security
> > requirements will use the EF mode. Others can start with PF~ for a
> > transition period, and switch to EF later.
> > PH~ requires changing 2 lines in the parquet.thrift file, and a few dozen
> > lines in the implementation. I've played with this today, seems quite
> > feasible.
> > So, unless the community strongly favors option 1, I'm inclined to
> proceed
> > with 2, should take up to a week to get the prs submitted.
> >
> > Cheers, Gidon.
> >
>

Re: Old readers & encrypted files

Posted by Zoltan Ivanfi <zi...@cloudera.com.INVALID>.
Hi,

Just to clarify: PF~ allows older readers to read data as long as they only
try to access unencrypted columns. What happens when older readers do try
to access encrypted columns?

Also, by older readers do you specificially mean the current Java library
or all existing language bindings?

Thanks,

Zoltan

On Tue, Sep 18, 2018 at 9:45 AM Gidon Gershinsky <gg...@gmail.com> wrote:

> Hi all,
>
> This week, 8 months after the first call for goals feedback and
> requirements :), I got a new one - enabling old Parquet readers to access
> data of unencrypted columns in encrypted files.
> Better late than never.. But actually it doesn't sound unreasonable, and
> deserved at least a consideration.
>
> Let me describe the options (the way I see them). Any community feedback is
> welcome.
>
> But first, a little tech intro. Encrypted Parquet files can be created in
> two modes - with an encrypted footer (lets call this an 'EF' mode for the
> purpose of this discussion), or with a plaintext footer ('PF' mode).
> EF is significantly more secure - it protects all data and metadata in a
> file, including the schema, number of rows, key-value properties, column
> names, column sort order, list of encrypted columns and metadata of the
> column encryption keys.
> PF hides the data, but leaks all of these metadata fields. Moreover, EF
> makes the footer tamper-proof, while PF doesn't.
> The reason we have the PF option is to let users with relaxed security
> requirements to enable readers, that don't have access to any keys, to read
> unencrypted columns in a file.
>
> For encrypted columns, both EH and PH hide the ColumnMetaData - including
> the min/max stats, number of values, data offset, data size and other
> fields. Old Parquet readers obviously can't read EF files. But they can't
> also read PF files - because old readers need access to data offset and
> size of every column in a file, event if they try to read just one column
> (this is fixed in an encryption pull request).
>
> Now, the options:
>
> 1) Don't allow old Parquet readers to read encrypted files. Organizations
> that start working with encrypted data, will update their analytic
> frameworks to use an encrypting Parquet version. This includes both
> frameworks that write/read encrypted columns, and frameworks that work only
> with unencrypted columns. The former and latter can technically be the same
> framework, just different instances of it. The update can be done in one of
> the following ways:
> a. Upgrade Parquet version to the latest one, supporting encryption. This
> might require some changes in framework code, unrelated to encryption.
> b. Use the original old Parquet version, with an added encryption support
> (requires rebuilding the framework, no code changes). This is not hard, I'm
> doing it for Parquet 1.8.2 in order to build and run Spark 2.3.0 with
> encrypted data.
> I think I can post this for 1.8.2 and other versions, with some help from
> the community.
>
> 2) Replace PF with PF~, in order to allow old Parquet readers to read
> unencrypted columns in encrypted files. PF~ is a little less secure and a
> little less elegant version of PF. Less secure because it has to expose the
> offset and size of encrypted column data. But actually its not
> catastrophic, and in any case, organizations with higher security
> requirements will use the EF mode. Others can start with PF~ for a
> transition period, and switch to EF later.
> PH~ requires changing 2 lines in the parquet.thrift file, and a few dozen
> lines in the implementation. I've played with this today, seems quite
> feasible.
> So, unless the community strongly favors option 1, I'm inclined to proceed
> with 2, should take up to a week to get the prs submitted.
>
> Cheers, Gidon.
>