You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@hive.apache.org by Sergio Pena <se...@cloudera.com> on 2015/07/07 18:25:11 UTC

Re: Review Request 35950: HIVE-11131: Get row information on DataWritableWriter once for better writing performance

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/35950/
-----------------------------------------------------------

(Updated July 7, 2015, 4:25 p.m.)


Review request for hive, Ryan Blue, cheng xu, and Dong Chen.


Changes
-------

Address feedback changes.


Bugs: HIVE-11131
    https://issues.apache.org/jira/browse/HIVE-11131


Repository: hive-git


Description
-------

Implemented data type writers that will be created before the first Hive row is written to Parquet. These writers contain information about object inspectors and schema of a specific data type, and calls the specific addXXXX() method used by Parquet for each data type.


Diffs (updated)
-----

  ql/src/java/org/apache/hadoop/hive/ql/io/parquet/write/DataWritableWriter.java c195c3ec3ddae19bf255fc2c9633f8bf4390f428 

Diff: https://reviews.apache.org/r/35950/diff/


Testing
-------

Tests from TestDataWritableWriter run OK.

I run other tests with micro-becnhmarks, and I got some better results from this new implemntation:

Using repeated rows across the file, this is the throughput increase using 1 million records:

bigint	boolean	double	float	int	    string
7.598	7.491	7.488	7.588	7.53	0.270     (before)
10.137	11.511	10.155	10.297	10.242  0.286     (after)

Using random rows across the file, the is the throughput increase using 1 million records:

bigint	boolean	double	float	int	    string
5.268	7.723	4.107	4.173	4.729	0.20       (before)
6.236	10.466	5.944	4.749	5.234	0.22       (after)


Thanks,

Sergio Pena

Re: Review Request 35950: HIVE-11131: Get row information on DataWritableWriter once for better writing performance

Posted by cheng xu <ch...@intel.com>.

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/35950/#review90838
-----------------------------------------------------------

Ship it!


Ship It!

- cheng xu


On July 8, 2015, 12:25 a.m., Sergio Pena wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/35950/
> -----------------------------------------------------------
> 
> (Updated July 8, 2015, 12:25 a.m.)
> 
> 
> Review request for hive, Ryan Blue, cheng xu, and Dong Chen.
> 
> 
> Bugs: HIVE-11131
>     https://issues.apache.org/jira/browse/HIVE-11131
> 
> 
> Repository: hive-git
> 
> 
> Description
> -------
> 
> Implemented data type writers that will be created before the first Hive row is written to Parquet. These writers contain information about object inspectors and schema of a specific data type, and calls the specific addXXXX() method used by Parquet for each data type.
> 
> 
> Diffs
> -----
> 
>   ql/src/java/org/apache/hadoop/hive/ql/io/parquet/write/DataWritableWriter.java c195c3ec3ddae19bf255fc2c9633f8bf4390f428 
> 
> Diff: https://reviews.apache.org/r/35950/diff/
> 
> 
> Testing
> -------
> 
> Tests from TestDataWritableWriter run OK.
> 
> I run other tests with micro-becnhmarks, and I got some better results from this new implemntation:
> 
> Using repeated rows across the file, this is the throughput increase using 1 million records:
> 
> bigint	boolean	double	float	int	    string
> 7.598	7.491	7.488	7.588	7.53	0.270     (before)
> 10.137	11.511	10.155	10.297	10.242  0.286     (after)
> 
> Using random rows across the file, the is the throughput increase using 1 million records:
> 
> bigint	boolean	double	float	int	    string
> 5.268	7.723	4.107	4.173	4.729	0.20       (before)
> 6.236	10.466	5.944	4.749	5.234	0.22       (after)
> 
> 
> Thanks,
> 
> Sergio Pena
> 
>