You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by Gidon Gershinsky <gg...@gmail.com> on 2018/06/07 13:56:26 UTC

Parquet encryption - updates

Hi,

As I've mentioned in the last sync, there had been a number of significant
developments re the subject. The design doc and the implementation (pr's)
are updated now; if you have reviewed the design before, please do so
again,
https://docs.google.com/document/d/1T89G7xR0zHFV1f2pjTO28jtfVm8qoNVGEJQ70Rsk-bY/edit?usp=sharing
with a focus on the following new features:

1. Support for multiple keys (footer key and a key per column). Seems to be
required by everybody interested in Parquet encryption, so I've added this
to the design and implementation. We'll skip the single-key phase and go
directly to multi-key mechanism. If you recall, we have discussed the
challenge of protecting a column-specific sensitive information (eg stats)
in the footer when working with multiple keys. Turns out there is a very
clean solution: this information is kept in a single structure, that is
already defined as an optional field in the footer, while its file offset
is a required field. This is perfect for separate serialization and
encryption with a column-specific key. See the doc (and pr code) for
details.

2. Second encryption algorithm. Trades some of integrity checks for higher
throughput. Useful in certain situations (light workloads running on Java 8
- see the performance doc link below), but also added as an example of
support for multiple encryption algorithms. So we can tune the design
accordingly.

3. API samples section - modified to demo multiple keys, algorithms and the
concept of hidden columns (those encrypted with a key not available to a
reader).

4. Performance report posted as a separate doc:
https://docs.google.com/document/d/1JpqaNIvkZZ5Hl39UNaYA7XIPB4sy7kQ_PPjqN7fWGqM/edit?usp=sharing
its a very initial version, that currently shows only the raw Java numbers
(as a background for the second encryption algorithm). Later, will add
measurement results for Parquet encryption with different workloads. The
doc had some of these, but they were measured for a single-key Parquet
encryption, so removed for now.

Cheers, Gidon.