You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by Julien Le Dem <ju...@gmail.com> on 2017/09/27 15:57:17 UTC
parquet sync
starting now at:
https://meet.google.com/wgv-qske-hzs
Re: parquet sync
Posted by Julien Le Dem <ju...@gmail.com>.
Parquet Sync Sept 27 2017:
Attendance and agenda:
Lars (Cloudera Impala):
- Parquet page index status
Zoltan (Cloudera impala):
- vectorization
- api annotation (Private/Public)
Ryan (Netflix):
- logical types commit
- Compression tests
Wes (TwoSigma):
- Compression C++
Julien:
- testing parquet files: JSON and Parquet.
Jim (Cloudera)
Notes:
Page Index status:
- need feedback on PR: https://github.com/apache/parquet-format/pull/63
Action: Julien, Marcel Review
Vectorization:
- https://issues.apache.org/jira/browse/PARQUET-131
original discussion in parquet which stalled.
- https://issues.apache.org/jira/browse/HIVE-14815
Hive vectorized parquet read.
Use annotations to clarify the state of an api
- Zoltan to open jira: annotations.
- need to reopen vectorized reader discussion. Follow up on JIRA-131
Logical types:
- action: need to review PR:
https://github.com/apache/parquet-format/pull/51
Compression tests:
- Ryan: used parquet-cli with 4 largest/most expensive tables
=> some are big map of k/v pairs, others are features/structured
ran 5 times + average.
will send spreadsheet with results for brotli/zstandard/lz4
brotli/zstandard look like winners: need more extensive tests
brotli level 5 seems to be a good tradeoff compression cost/size
lz4 quickest compression time but largest output
zstandard a bit faster and a bit smaller than brotli
uses:
- jbrotli: embedded native library in jar
- zstd: zlibnative path. packaged in ubuntu
- action: Ryan cleanup and send out report
- Wes: C++
speed: gzip, snappy, lz4, zstd
parquet files for tests:
- Impala has a repository of files for tests:
https://github.com/apache/incubator-impala/tree/master/testdata
- old compat test repo: https://github.com/Parquet/parquet-compatibility
- have a repository of files.
- open a JIRA: Lars.
parquet-tools merge command:
- merge command: puts row groups after one another.
- need jira to add comment on how this works (concatenates existing
rowgroups without combining them in larger ones)