You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by Zoltan Ivanfi <zi...@cloudera.com.INVALID> on 2018/09/19 06:53:43 UTC

***UNCHECKED*** Re: Old readers & encrypted files

Hi,

Sounds good to me. The filename extension could really help to prevent
confusion.

Br,

Zoltan

On Tue, Sep 18, 2018 at 4:35 PM Gidon Gershinsky <gg...@gmail.com> wrote:

> Hi,
>
> 2 cents re the first point - the encrypted files will have an extension
> "parquet.encrypted", which should help people understand the reason for
> their error. They also should be aware that using old readers for encrypted
> files is a temporary solution, the right thing to do is to upgrade to new
> Parquet version.
> But I'm also ok with a truly incompatible format for the encrypted files.
>
> On Tue, Sep 18, 2018 at 5:07 PM Zoltan Ivanfi <zi...@cloudera.com.invalid>
> wrote:
>
> > Hi,
> >
> > I'm a little bit worried that the misleading error message could lead to
> > serious confusion. For this reason, I would slighlty prefer a truly
> > incompatible format for the encrypted files, but I don't have strong
> > feelings against doing it the other way either.
> >
> > One idea that came to my mind (which could easily be a bad idea) is to
> > write two metadata sections, one for new readers and one for older ones.
> > The latter would not contain references to encrypted columns at all.
> >
> > Br,
> >
> > Zoltan
> >
> > On Tue, Sep 18, 2018 at 10:40 AM Gidon Gershinsky <gg...@gmail.com>
> > wrote:
> >
> > > Hi Zoltan,
> > >
> > > Old readers, trying to access encrypted columns in PF~ files, get a
> > Thrift
> > > parsing exception, since they expect a plaintext PageHeader structure
> at
> > > the page offset.
> > > In encrypted columns, PageHeaders are encrypted with the column key.
> > >
> > > Old Parquet binding in any language should be able to read plaintext
> > > columns in PF~ files.
> > >
> > > Cheers, Gidon.
> > >
> > >
> > > On Tue, Sep 18, 2018 at 11:19 AM Zoltan Ivanfi <zi@cloudera.com.invalid
> >
> > > wrote:
> > >
> > > > Hi,
> > > >
> > > > Just to clarify: PF~ allows older readers to read data as long as
> they
> > > only
> > > > try to access unencrypted columns. What happens when older readers do
> > try
> > > > to access encrypted columns?
> > > >
> > > > Also, by older readers do you specificially mean the current Java
> > library
> > > > or all existing language bindings?
> > > >
> > > > Thanks,
> > > >
> > > > Zoltan
> > > >
> > > > On Tue, Sep 18, 2018 at 9:45 AM Gidon Gershinsky <gg...@gmail.com>
> > > wrote:
> > > >
> > > > > Hi all,
> > > > >
> > > > > This week, 8 months after the first call for goals feedback and
> > > > > requirements :), I got a new one - enabling old Parquet readers to
> > > access
> > > > > data of unencrypted columns in encrypted files.
> > > > > Better late than never.. But actually it doesn't sound
> unreasonable,
> > > and
> > > > > deserved at least a consideration.
> > > > >
> > > > > Let me describe the options (the way I see them). Any community
> > > feedback
> > > > is
> > > > > welcome.
> > > > >
> > > > > But first, a little tech intro. Encrypted Parquet files can be
> > created
> > > in
> > > > > two modes - with an encrypted footer (lets call this an 'EF' mode
> for
> > > the
> > > > > purpose of this discussion), or with a plaintext footer ('PF'
> mode).
> > > > > EF is significantly more secure - it protects all data and metadata
> > in
> > > a
> > > > > file, including the schema, number of rows, key-value properties,
> > > column
> > > > > names, column sort order, list of encrypted columns and metadata of
> > the
> > > > > column encryption keys.
> > > > > PF hides the data, but leaks all of these metadata fields.
> Moreover,
> > EF
> > > > > makes the footer tamper-proof, while PF doesn't.
> > > > > The reason we have the PF option is to let users with relaxed
> > security
> > > > > requirements to enable readers, that don't have access to any keys,
> > to
> > > > read
> > > > > unencrypted columns in a file.
> > > > >
> > > > > For encrypted columns, both EH and PH hide the ColumnMetaData -
> > > including
> > > > > the min/max stats, number of values, data offset, data size and
> other
> > > > > fields. Old Parquet readers obviously can't read EF files. But they
> > > can't
> > > > > also read PF files - because old readers need access to data offset
> > and
> > > > > size of every column in a file, event if they try to read just one
> > > column
> > > > > (this is fixed in an encryption pull request).
> > > > >
> > > > > Now, the options:
> > > > >
> > > > > 1) Don't allow old Parquet readers to read encrypted files.
> > > Organizations
> > > > > that start working with encrypted data, will update their analytic
> > > > > frameworks to use an encrypting Parquet version. This includes both
> > > > > frameworks that write/read encrypted columns, and frameworks that
> > work
> > > > only
> > > > > with unencrypted columns. The former and latter can technically be
> > the
> > > > same
> > > > > framework, just different instances of it. The update can be done
> in
> > > one
> > > > of
> > > > > the following ways:
> > > > > a. Upgrade Parquet version to the latest one, supporting
> encryption.
> > > This
> > > > > might require some changes in framework code, unrelated to
> > encryption.
> > > > > b. Use the original old Parquet version, with an added encryption
> > > support
> > > > > (requires rebuilding the framework, no code changes). This is not
> > hard,
> > > > I'm
> > > > > doing it for Parquet 1.8.2 in order to build and run Spark 2.3.0
> with
> > > > > encrypted data.
> > > > > I think I can post this for 1.8.2 and other versions, with some
> help
> > > from
> > > > > the community.
> > > > >
> > > > > 2) Replace PF with PF~, in order to allow old Parquet readers to
> read
> > > > > unencrypted columns in encrypted files. PF~ is a little less secure
> > > and a
> > > > > little less elegant version of PF. Less secure because it has to
> > expose
> > > > the
> > > > > offset and size of encrypted column data. But actually its not
> > > > > catastrophic, and in any case, organizations with higher security
> > > > > requirements will use the EF mode. Others can start with PF~ for a
> > > > > transition period, and switch to EF later.
> > > > > PH~ requires changing 2 lines in the parquet.thrift file, and a few
> > > dozen
> > > > > lines in the implementation. I've played with this today, seems
> quite
> > > > > feasible.
> > > > > So, unless the community strongly favors option 1, I'm inclined to
> > > > proceed
> > > > > with 2, should take up to a week to get the prs submitted.
> > > > >
> > > > > Cheers, Gidon.
> > > > >
> > > >
> > >
> >
>

Re: ***UNCHECKED*** Re: Old readers & encrypted files

Posted by Gidon Gershinsky <gg...@gmail.com>.
Hi,

Sounds good. I've submitted a pr for this, and updated the Encryption.md pr.

Cheers, Gidon.

On Wed, Sep 19, 2018 at 9:54 AM Zoltan Ivanfi <zi...@cloudera.com.invalid>
wrote:

> Hi,
>
> Sounds good to me. The filename extension could really help to prevent
> confusion.
>
> Br,
>
> Zoltan
>
> On Tue, Sep 18, 2018 at 4:35 PM Gidon Gershinsky <gg...@gmail.com> wrote:
>
> > Hi,
> >
> > 2 cents re the first point - the encrypted files will have an extension
> > "parquet.encrypted", which should help people understand the reason for
> > their error. They also should be aware that using old readers for
> encrypted
> > files is a temporary solution, the right thing to do is to upgrade to new
> > Parquet version.
> > But I'm also ok with a truly incompatible format for the encrypted files.
> >
> > On Tue, Sep 18, 2018 at 5:07 PM Zoltan Ivanfi <zi...@cloudera.com.invalid>
> > wrote:
> >
> > > Hi,
> > >
> > > I'm a little bit worried that the misleading error message could lead
> to
> > > serious confusion. For this reason, I would slighlty prefer a truly
> > > incompatible format for the encrypted files, but I don't have strong
> > > feelings against doing it the other way either.
> > >
> > > One idea that came to my mind (which could easily be a bad idea) is to
> > > write two metadata sections, one for new readers and one for older
> ones.
> > > The latter would not contain references to encrypted columns at all.
> > >
> > > Br,
> > >
> > > Zoltan
> > >
> > > On Tue, Sep 18, 2018 at 10:40 AM Gidon Gershinsky <gg...@gmail.com>
> > > wrote:
> > >
> > > > Hi Zoltan,
> > > >
> > > > Old readers, trying to access encrypted columns in PF~ files, get a
> > > Thrift
> > > > parsing exception, since they expect a plaintext PageHeader structure
> > at
> > > > the page offset.
> > > > In encrypted columns, PageHeaders are encrypted with the column key.
> > > >
> > > > Old Parquet binding in any language should be able to read plaintext
> > > > columns in PF~ files.
> > > >
> > > > Cheers, Gidon.
> > > >
> > > >
> > > > On Tue, Sep 18, 2018 at 11:19 AM Zoltan Ivanfi
> <zi@cloudera.com.invalid
> > >
> > > > wrote:
> > > >
> > > > > Hi,
> > > > >
> > > > > Just to clarify: PF~ allows older readers to read data as long as
> > they
> > > > only
> > > > > try to access unencrypted columns. What happens when older readers
> do
> > > try
> > > > > to access encrypted columns?
> > > > >
> > > > > Also, by older readers do you specificially mean the current Java
> > > library
> > > > > or all existing language bindings?
> > > > >
> > > > > Thanks,
> > > > >
> > > > > Zoltan
> > > > >
> > > > > On Tue, Sep 18, 2018 at 9:45 AM Gidon Gershinsky <gg5070@gmail.com
> >
> > > > wrote:
> > > > >
> > > > > > Hi all,
> > > > > >
> > > > > > This week, 8 months after the first call for goals feedback and
> > > > > > requirements :), I got a new one - enabling old Parquet readers
> to
> > > > access
> > > > > > data of unencrypted columns in encrypted files.
> > > > > > Better late than never.. But actually it doesn't sound
> > unreasonable,
> > > > and
> > > > > > deserved at least a consideration.
> > > > > >
> > > > > > Let me describe the options (the way I see them). Any community
> > > > feedback
> > > > > is
> > > > > > welcome.
> > > > > >
> > > > > > But first, a little tech intro. Encrypted Parquet files can be
> > > created
> > > > in
> > > > > > two modes - with an encrypted footer (lets call this an 'EF' mode
> > for
> > > > the
> > > > > > purpose of this discussion), or with a plaintext footer ('PF'
> > mode).
> > > > > > EF is significantly more secure - it protects all data and
> metadata
> > > in
> > > > a
> > > > > > file, including the schema, number of rows, key-value properties,
> > > > column
> > > > > > names, column sort order, list of encrypted columns and metadata
> of
> > > the
> > > > > > column encryption keys.
> > > > > > PF hides the data, but leaks all of these metadata fields.
> > Moreover,
> > > EF
> > > > > > makes the footer tamper-proof, while PF doesn't.
> > > > > > The reason we have the PF option is to let users with relaxed
> > > security
> > > > > > requirements to enable readers, that don't have access to any
> keys,
> > > to
> > > > > read
> > > > > > unencrypted columns in a file.
> > > > > >
> > > > > > For encrypted columns, both EH and PH hide the ColumnMetaData -
> > > > including
> > > > > > the min/max stats, number of values, data offset, data size and
> > other
> > > > > > fields. Old Parquet readers obviously can't read EF files. But
> they
> > > > can't
> > > > > > also read PF files - because old readers need access to data
> offset
> > > and
> > > > > > size of every column in a file, event if they try to read just
> one
> > > > column
> > > > > > (this is fixed in an encryption pull request).
> > > > > >
> > > > > > Now, the options:
> > > > > >
> > > > > > 1) Don't allow old Parquet readers to read encrypted files.
> > > > Organizations
> > > > > > that start working with encrypted data, will update their
> analytic
> > > > > > frameworks to use an encrypting Parquet version. This includes
> both
> > > > > > frameworks that write/read encrypted columns, and frameworks that
> > > work
> > > > > only
> > > > > > with unencrypted columns. The former and latter can technically
> be
> > > the
> > > > > same
> > > > > > framework, just different instances of it. The update can be done
> > in
> > > > one
> > > > > of
> > > > > > the following ways:
> > > > > > a. Upgrade Parquet version to the latest one, supporting
> > encryption.
> > > > This
> > > > > > might require some changes in framework code, unrelated to
> > > encryption.
> > > > > > b. Use the original old Parquet version, with an added encryption
> > > > support
> > > > > > (requires rebuilding the framework, no code changes). This is not
> > > hard,
> > > > > I'm
> > > > > > doing it for Parquet 1.8.2 in order to build and run Spark 2.3.0
> > with
> > > > > > encrypted data.
> > > > > > I think I can post this for 1.8.2 and other versions, with some
> > help
> > > > from
> > > > > > the community.
> > > > > >
> > > > > > 2) Replace PF with PF~, in order to allow old Parquet readers to
> > read
> > > > > > unencrypted columns in encrypted files. PF~ is a little less
> secure
> > > > and a
> > > > > > little less elegant version of PF. Less secure because it has to
> > > expose
> > > > > the
> > > > > > offset and size of encrypted column data. But actually its not
> > > > > > catastrophic, and in any case, organizations with higher security
> > > > > > requirements will use the EF mode. Others can start with PF~ for
> a
> > > > > > transition period, and switch to EF later.
> > > > > > PH~ requires changing 2 lines in the parquet.thrift file, and a
> few
> > > > dozen
> > > > > > lines in the implementation. I've played with this today, seems
> > quite
> > > > > > feasible.
> > > > > > So, unless the community strongly favors option 1, I'm inclined
> to
> > > > > proceed
> > > > > > with 2, should take up to a week to get the prs submitted.
> > > > > >
> > > > > > Cheers, Gidon.
> > > > > >
> > > > >
> > > >
> > >
> >
>