You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hive.apache.org by Dong Chen <do...@intel.com> on 2015/02/02 03:27:57 UTC
Re: Review Request 30281: Move parquet serialize implementation to
DataWritableWriter to improve write speeds
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/30281/#review70538
-----------------------------------------------------------
Sorry for late review... The patch looks good, and I see there already were a lot of great discussions! Thanks.
I left just one minor comments below.
ql/src/java/org/apache/hadoop/hive/ql/io/parquet/write/DataWritableWriter.java
<https://reviews.apache.org/r/30281/#comment115753>
This seems duplicate, since it has been checked before invoking writeMap(...)
- Dong Chen
On Jan. 29, 2015, 5:12 p.m., Sergio Pena wrote:
>
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/30281/
> -----------------------------------------------------------
>
> (Updated Jan. 29, 2015, 5:12 p.m.)
>
>
> Review request for hive, Ryan Blue, cheng xu, and Dong Chen.
>
>
> Bugs: HIVE-9333
> https://issues.apache.org/jira/browse/HIVE-9333
>
>
> Repository: hive-git
>
>
> Description
> -------
>
> This patch moves the ParquetHiveSerDe.serialize() implementation to DataWritableWriter class in order to save time in materializing data on serialize().
>
>
> Diffs
> -----
>
> ql/src/java/org/apache/hadoop/hive/ql/io/parquet/MapredParquetOutputFormat.java ea4109d358f7c48d1e2042e5da299475de4a0a29
> ql/src/java/org/apache/hadoop/hive/ql/io/parquet/serde/ParquetHiveSerDe.java 9caa4ed169ba92dbd863e4a2dc6d06ab226a4465
> ql/src/java/org/apache/hadoop/hive/ql/io/parquet/write/DataWritableWriteSupport.java 060b1b722d32f3b2f88304a1a73eb249e150294b
> ql/src/java/org/apache/hadoop/hive/ql/io/parquet/write/DataWritableWriter.java 41b5f1c3b0ab43f734f8a211e3e03d5060c75434
> ql/src/java/org/apache/hadoop/hive/ql/io/parquet/write/ParquetRecordWriterWrapper.java e52c4bc0b869b3e60cb4bfa9e11a09a0d605ac28
> ql/src/test/org/apache/hadoop/hive/ql/io/parquet/TestDataWritableWriter.java a693aff18516d133abf0aae4847d3fe00b9f1c96
> ql/src/test/org/apache/hadoop/hive/ql/io/parquet/TestMapredParquetOutputFormat.java 667d3671547190d363107019cd9a2d105d26d336
> ql/src/test/org/apache/hadoop/hive/ql/io/parquet/TestParquetSerDe.java 007a665529857bcec612f638a157aa5043562a15
> serde/src/java/org/apache/hadoop/hive/serde2/io/ParquetWritable.java PRE-CREATION
>
> Diff: https://reviews.apache.org/r/30281/diff/
>
>
> Testing
> -------
>
> The tests run were the following:
>
> 1. JMH (Java microbenchmark)
>
> This benchmark called parquet serialize/write methods using text writable objects.
>
> Class.method Before Change (ops/s) After Change (ops/s)
> -------------------------------------------------------------------------------
> ParquetHiveSerDe.serialize: 19,113 249,528 -> 19x speed increase
> DataWritableWriter.write: 5,033 5,201 -> 3.34% speed increase
>
>
> 2. Write 20 million rows (~1GB file) from Text to Parquet
>
> I wrote a ~1Gb file in Textfile format, then convert it to a Parquet format using the following
> statement: CREATE TABLE parquet STORED AS parquet AS SELECT * FROM text;
>
> Time (s) it took to write the whole file BEFORE changes: 93.758 s
> Time (s) it took to write the whole file AFTER changes: 83.903 s
>
> It got a 10% of speed inscrease.
>
>
> Thanks,
>
> Sergio Pena
>
>
Re: Review Request 30281: Move parquet serialize implementation to
DataWritableWriter to improve write speeds
Posted by Sergio Pena <se...@cloudera.com>.
> On Feb. 2, 2015, 2:27 a.m., Dong Chen wrote:
> > ql/src/java/org/apache/hadoop/hive/ql/io/parquet/write/DataWritableWriter.java, line 215
> > <https://reviews.apache.org/r/30281/diff/4/?file=840163#file840163line215>
> >
> > This seems duplicate, since it has been checked before invoking writeMap(...)
Thanks Dong. I did not see this extra validation.
- Sergio
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/30281/#review70538
-----------------------------------------------------------
On Ene. 29, 2015, 5:12 p.m., Sergio Pena wrote:
>
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/30281/
> -----------------------------------------------------------
>
> (Updated Ene. 29, 2015, 5:12 p.m.)
>
>
> Review request for hive, Ryan Blue, cheng xu, and Dong Chen.
>
>
> Bugs: HIVE-9333
> https://issues.apache.org/jira/browse/HIVE-9333
>
>
> Repository: hive-git
>
>
> Description
> -------
>
> This patch moves the ParquetHiveSerDe.serialize() implementation to DataWritableWriter class in order to save time in materializing data on serialize().
>
>
> Diffs
> -----
>
> ql/src/java/org/apache/hadoop/hive/ql/io/parquet/MapredParquetOutputFormat.java ea4109d358f7c48d1e2042e5da299475de4a0a29
> ql/src/java/org/apache/hadoop/hive/ql/io/parquet/serde/ParquetHiveSerDe.java 9caa4ed169ba92dbd863e4a2dc6d06ab226a4465
> ql/src/java/org/apache/hadoop/hive/ql/io/parquet/write/DataWritableWriteSupport.java 060b1b722d32f3b2f88304a1a73eb249e150294b
> ql/src/java/org/apache/hadoop/hive/ql/io/parquet/write/DataWritableWriter.java 41b5f1c3b0ab43f734f8a211e3e03d5060c75434
> ql/src/java/org/apache/hadoop/hive/ql/io/parquet/write/ParquetRecordWriterWrapper.java e52c4bc0b869b3e60cb4bfa9e11a09a0d605ac28
> ql/src/test/org/apache/hadoop/hive/ql/io/parquet/TestDataWritableWriter.java a693aff18516d133abf0aae4847d3fe00b9f1c96
> ql/src/test/org/apache/hadoop/hive/ql/io/parquet/TestMapredParquetOutputFormat.java 667d3671547190d363107019cd9a2d105d26d336
> ql/src/test/org/apache/hadoop/hive/ql/io/parquet/TestParquetSerDe.java 007a665529857bcec612f638a157aa5043562a15
> serde/src/java/org/apache/hadoop/hive/serde2/io/ParquetWritable.java PRE-CREATION
>
> Diff: https://reviews.apache.org/r/30281/diff/
>
>
> Testing
> -------
>
> The tests run were the following:
>
> 1. JMH (Java microbenchmark)
>
> This benchmark called parquet serialize/write methods using text writable objects.
>
> Class.method Before Change (ops/s) After Change (ops/s)
> -------------------------------------------------------------------------------
> ParquetHiveSerDe.serialize: 19,113 249,528 -> 19x speed increase
> DataWritableWriter.write: 5,033 5,201 -> 3.34% speed increase
>
>
> 2. Write 20 million rows (~1GB file) from Text to Parquet
>
> I wrote a ~1Gb file in Textfile format, then convert it to a Parquet format using the following
> statement: CREATE TABLE parquet STORED AS parquet AS SELECT * FROM text;
>
> Time (s) it took to write the whole file BEFORE changes: 93.758 s
> Time (s) it took to write the whole file AFTER changes: 83.903 s
>
> It got a 10% of speed inscrease.
>
>
> Thanks,
>
> Sergio Pena
>
>