You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hive.apache.org by Dong Chen <do...@intel.com> on 2015/07/21 10:44:47 UTC

Re: Review Request 36540: HIVE-8128: Improve Parquet Vectorization

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/36540/
-----------------------------------------------------------

(Updated July 21, 2015, 8:44 a.m.)


Review request for hive, Ryan Blue, cheng xu, and Sergio Pena.


Changes
-------

Review request


Repository: hive-git


Description
-------

This patch is based on the Parquet vector API at https://github.com/nezihyigitbasi-nflx/parquet-mr/commits/vector

In this POC, the general workflow was done, two tests passed, and INT type was supported. The idea is that we create a VectorizedParquetRecordReader, which wraps the ParquetRecordReader provided by Parquet. Then in its next() method, we convert Parquet RowBatch to Hive VectorizedRowBatch.

This is the first patch. To complete vectorization feature, we still have work to do in follow-up: 1) support all data types 2) support partition column 3) add more test cases 4) evaluate performance on a real cluster.


Diffs
-----

  pom.xml 1abf738 
  ql/pom.xml 6026c49 
  ql/src/java/org/apache/hadoop/hive/ql/io/parquet/MapredParquetInputFormat.java e1b6dd8 
  ql/src/java/org/apache/hadoop/hive/ql/io/parquet/VectorizedParquetInputFormat.java 98691c7 
  ql/src/java/org/apache/hadoop/hive/ql/io/parquet/read/ParquetRecordReaderWrapper.java adeb971 
  ql/src/test/org/apache/hadoop/hive/ql/io/parquet/TestVectorizedParquetReader.java PRE-CREATION 
  ql/src/test/queries/clientpositive/vectorized_parquet_data_types.q PRE-CREATION 
  ql/src/test/results/clientpositive/vectorized_parquet_data_types.q.out PRE-CREATION 

Diff: https://reviews.apache.org/r/36540/diff/


Testing
-------

unit test passed


Thanks,

Dong Chen