You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by Julien Le Dem <ju...@wework.com> on 2018/06/07 18:19:53 UTC

Parquet sync notes

Attendees / Agenda:
Gidon (IBM): Parquet encryption. Uber, Vertica, Amazon
Anna, Gabor, Nandor (Cloudera): Review for column indexing
Junjie (tencent): Bloom filter
Lars (Cloudera impala)
Jim (Cloudera): Bloom filter
Deepak (Vertica): Encryption
Qinghui, Benoit (Criteo): parquet protobuf.

Parquet encryption:
* Deepak will look at the code this week.
* Gidon update:
    * multi key encryption (one for keys and one for footer)
        * Implementation available.
    * Working on performance evaluation
        * Starting in java 9 encryption is hardware accelerated and much
better. Very little overhead
        * Java 8 encryption has more overhead.
            * If using gzip overhead is small
            * If using snappy, overhead is high
    * Added a second encryption implementation that is faster but less
secure for java 8
        * Advantage of 2 algorithms: makes us think of formalization of
also in metadata.
    * Use case to use encryption without api. Through Hadoop config to pass
info.
    * Modified design document
* Discussion on metadata.
* Column indexes do not replace the statistics in the footer but replace
the statistics in the page header.
Column indexing:
* Parquet-mr/pr/481

* Encryption
    * [Some things covered already before these notes started]
    * Hardware support for encryption? Yes power. Not sure if ARM.
Definitely x86-64
* Bloom filters: C++ needs review, but also doing performance tests
    * Guava Bloom filter: Not sure if compat between version. Impala BFs
might be much faster
    * Java vs. C++ compat: there will be tests
* Column indexing
    * parquet-mr 481 https://github.com/apache/parquet-mr/pull/481
    * Right now doing in a separate branch for compat reasons. Not sure the
write path will work.
        * That branch has 3 or more commits
    * Column indexes will be stored just before the filter. Will the
statistics (before the footer) still be useful with column indexing - can
we just leave them out.
        * Filter is for row-groups, column indexing is for pages?
        * Do we store the maximum value in a page, or a value that is
greater than or equal to the largest value in the page? Impala does the
latter; PR#481 does that for some pages, but not all (?)