You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hive.apache.org by Dong Chen <do...@intel.com> on 2015/07/21 10:44:47 UTC
Re: Review Request 36540: HIVE-8128: Improve Parquet Vectorization
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/36540/
-----------------------------------------------------------
(Updated July 21, 2015, 8:44 a.m.)
Review request for hive, Ryan Blue, cheng xu, and Sergio Pena.
Changes
-------
Review request
Repository: hive-git
Description
-------
This patch is based on the Parquet vector API at https://github.com/nezihyigitbasi-nflx/parquet-mr/commits/vector
In this POC, the general workflow was done, two tests passed, and INT type was supported. The idea is that we create a VectorizedParquetRecordReader, which wraps the ParquetRecordReader provided by Parquet. Then in its next() method, we convert Parquet RowBatch to Hive VectorizedRowBatch.
This is the first patch. To complete vectorization feature, we still have work to do in follow-up: 1) support all data types 2) support partition column 3) add more test cases 4) evaluate performance on a real cluster.
Diffs
-----
pom.xml 1abf738
ql/pom.xml 6026c49
ql/src/java/org/apache/hadoop/hive/ql/io/parquet/MapredParquetInputFormat.java e1b6dd8
ql/src/java/org/apache/hadoop/hive/ql/io/parquet/VectorizedParquetInputFormat.java 98691c7
ql/src/java/org/apache/hadoop/hive/ql/io/parquet/read/ParquetRecordReaderWrapper.java adeb971
ql/src/test/org/apache/hadoop/hive/ql/io/parquet/TestVectorizedParquetReader.java PRE-CREATION
ql/src/test/queries/clientpositive/vectorized_parquet_data_types.q PRE-CREATION
ql/src/test/results/clientpositive/vectorized_parquet_data_types.q.out PRE-CREATION
Diff: https://reviews.apache.org/r/36540/diff/
Testing
-------
unit test passed
Thanks,
Dong Chen