You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@hive.apache.org by Sergio Pena <se...@cloudera.com> on 2015/02/11 17:21:23 UTC

Re: Review Request 30281: HIVE-9333: Move parquet serialize implementation to DataWritableWriter to improve write speeds

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/30281/
-----------------------------------------------------------

(Updated Feb. 11, 2015, 4:21 p.m.)


Review request for hive, Ryan Blue, cheng xu, and Dong Chen.


Summary (updated)
-----------------

HIVE-9333: Move parquet serialize implementation to DataWritableWriter to improve write speeds


Bugs: HIVE-9333
    https://issues.apache.org/jira/browse/HIVE-9333


Repository: hive-git


Description
-------

This patch moves the ParquetHiveSerDe.serialize() implementation to DataWritableWriter class in order to save time in materializing data on serialize().


Diffs
-----

  ql/src/java/org/apache/hadoop/hive/ql/io/parquet/MapredParquetOutputFormat.java ea4109d358f7c48d1e2042e5da299475de4a0a29 
  ql/src/java/org/apache/hadoop/hive/ql/io/parquet/serde/ParquetHiveSerDe.java 9caa4ed169ba92dbd863e4a2dc6d06ab226a4465 
  ql/src/java/org/apache/hadoop/hive/ql/io/parquet/write/DataWritableWriteSupport.java 060b1b722d32f3b2f88304a1a73eb249e150294b 
  ql/src/java/org/apache/hadoop/hive/ql/io/parquet/write/DataWritableWriter.java 41b5f1c3b0ab43f734f8a211e3e03d5060c75434 
  ql/src/java/org/apache/hadoop/hive/ql/io/parquet/write/ParquetRecordWriterWrapper.java e52c4bc0b869b3e60cb4bfa9e11a09a0d605ac28 
  ql/src/test/org/apache/hadoop/hive/ql/io/parquet/TestDataWritableWriter.java a693aff18516d133abf0aae4847d3fe00b9f1c96 
  ql/src/test/org/apache/hadoop/hive/ql/io/parquet/TestMapredParquetOutputFormat.java 667d3671547190d363107019cd9a2d105d26d336 
  ql/src/test/org/apache/hadoop/hive/ql/io/parquet/TestParquetSerDe.java 007a665529857bcec612f638a157aa5043562a15 
  serde/src/java/org/apache/hadoop/hive/serde2/io/ParquetWritable.java PRE-CREATION 

Diff: https://reviews.apache.org/r/30281/diff/


Testing
-------

The tests run were the following:

1. JMH (Java microbenchmark)

This benchmark called parquet serialize/write methods using text writable objects. 

Class.method                  Before Change (ops/s)      After Change (ops/s)       
-------------------------------------------------------------------------------
ParquetHiveSerDe.serialize:          19,113                   249,528   ->  19x speed increase
DataWritableWriter.write:             5,033                     5,201   ->  3.34% speed increase


2. Write 20 million rows (~1GB file) from Text to Parquet

I wrote a ~1Gb file in Textfile format, then convert it to a Parquet format using the following
statement: CREATE TABLE parquet STORED AS parquet AS SELECT * FROM text;

Time (s) it took to write the whole file BEFORE changes: 93.758 s
Time (s) it took to write the whole file AFTER changes: 83.903 s

It got a 10% of speed inscrease.


Thanks,

Sergio Pena

Re: Review Request 30281: HIVE-9333: Move parquet serialize implementation to DataWritableWriter to improve write speeds

Posted by Ryan Blue <bl...@apache.org>.

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/30281/#review72398
-----------------------------------------------------------

Ship it!


Ship It!

- Ryan Blue


On Feb. 11, 2015, 3:19 p.m., Sergio Pena wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/30281/
> -----------------------------------------------------------
> 
> (Updated Feb. 11, 2015, 3:19 p.m.)
> 
> 
> Review request for hive, Ryan Blue, cheng xu, and Dong Chen.
> 
> 
> Bugs: HIVE-9333
>     https://issues.apache.org/jira/browse/HIVE-9333
> 
> 
> Repository: hive-git
> 
> 
> Description
> -------
> 
> This patch moves the ParquetHiveSerDe.serialize() implementation to DataWritableWriter class in order to save time in materializing data on serialize().
> 
> 
> Diffs
> -----
> 
>   ql/src/java/org/apache/hadoop/hive/ql/io/parquet/MapredParquetOutputFormat.java ea4109d358f7c48d1e2042e5da299475de4a0a29 
>   ql/src/java/org/apache/hadoop/hive/ql/io/parquet/serde/ParquetHiveSerDe.java 9199127735533f9a324c5ef456786dda10766c46 
>   ql/src/java/org/apache/hadoop/hive/ql/io/parquet/write/DataWritableWriteSupport.java 060b1b722d32f3b2f88304a1a73eb249e150294b 
>   ql/src/java/org/apache/hadoop/hive/ql/io/parquet/write/DataWritableWriter.java 1d83bf31a3dbcbaa68b3e75a72cec2ec67e7faa5 
>   ql/src/java/org/apache/hadoop/hive/ql/io/parquet/write/ParquetRecordWriterWrapper.java e52c4bc0b869b3e60cb4bfa9e11a09a0d605ac28 
>   ql/src/test/org/apache/hadoop/hive/ql/io/parquet/TestDataWritableWriter.java a693aff18516d133abf0aae4847d3fe00b9f1c96 
>   ql/src/test/org/apache/hadoop/hive/ql/io/parquet/TestMapredParquetOutputFormat.java 667d3671547190d363107019cd9a2d105d26d336 
>   ql/src/test/org/apache/hadoop/hive/ql/io/parquet/TestParquetSerDe.java 007a665529857bcec612f638a157aa5043562a15 
>   serde/src/java/org/apache/hadoop/hive/serde2/io/ParquetHiveRecord.java PRE-CREATION 
> 
> Diff: https://reviews.apache.org/r/30281/diff/
> 
> 
> Testing
> -------
> 
> The tests run were the following:
> 
> 1. JMH (Java microbenchmark)
> 
> This benchmark called parquet serialize/write methods using text writable objects. 
> 
> Class.method                  Before Change (ops/s)      After Change (ops/s)       
> -------------------------------------------------------------------------------
> ParquetHiveSerDe.serialize:          19,113                   249,528   ->  19x speed increase
> DataWritableWriter.write:             5,033                     5,201   ->  3.34% speed increase
> 
> 
> 2. Write 20 million rows (~1GB file) from Text to Parquet
> 
> I wrote a ~1Gb file in Textfile format, then convert it to a Parquet format using the following
> statement: CREATE TABLE parquet STORED AS parquet AS SELECT * FROM text;
> 
> Time (s) it took to write the whole file BEFORE changes: 93.758 s
> Time (s) it took to write the whole file AFTER changes: 83.903 s
> 
> It got a 10% of speed inscrease.
> 
> 
> Thanks,
> 
> Sergio Pena
> 
>

Re: Review Request 30281: HIVE-9333: Move parquet serialize implementation to DataWritableWriter to improve write speeds

Posted by Sergio Pena <se...@cloudera.com>.


> On Feb. 11, 2015, 11:40 p.m., Ryan Blue wrote:
> >

Thanks Ryan for your comments.

I will add this changes in another JIRA as this one was already merged. I did not add a comment on JIRA to wait for the merge.


- Sergio


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/30281/#review72053
-----------------------------------------------------------


On Feb. 11, 2015, 11:19 p.m., Sergio Pena wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/30281/
> -----------------------------------------------------------
> 
> (Updated Feb. 11, 2015, 11:19 p.m.)
> 
> 
> Review request for hive, Ryan Blue, cheng xu, and Dong Chen.
> 
> 
> Bugs: HIVE-9333
>     https://issues.apache.org/jira/browse/HIVE-9333
> 
> 
> Repository: hive-git
> 
> 
> Description
> -------
> 
> This patch moves the ParquetHiveSerDe.serialize() implementation to DataWritableWriter class in order to save time in materializing data on serialize().
> 
> 
> Diffs
> -----
> 
>   ql/src/java/org/apache/hadoop/hive/ql/io/parquet/MapredParquetOutputFormat.java ea4109d358f7c48d1e2042e5da299475de4a0a29 
>   ql/src/java/org/apache/hadoop/hive/ql/io/parquet/serde/ParquetHiveSerDe.java 9199127735533f9a324c5ef456786dda10766c46 
>   ql/src/java/org/apache/hadoop/hive/ql/io/parquet/write/DataWritableWriteSupport.java 060b1b722d32f3b2f88304a1a73eb249e150294b 
>   ql/src/java/org/apache/hadoop/hive/ql/io/parquet/write/DataWritableWriter.java 1d83bf31a3dbcbaa68b3e75a72cec2ec67e7faa5 
>   ql/src/java/org/apache/hadoop/hive/ql/io/parquet/write/ParquetRecordWriterWrapper.java e52c4bc0b869b3e60cb4bfa9e11a09a0d605ac28 
>   ql/src/test/org/apache/hadoop/hive/ql/io/parquet/TestDataWritableWriter.java a693aff18516d133abf0aae4847d3fe00b9f1c96 
>   ql/src/test/org/apache/hadoop/hive/ql/io/parquet/TestMapredParquetOutputFormat.java 667d3671547190d363107019cd9a2d105d26d336 
>   ql/src/test/org/apache/hadoop/hive/ql/io/parquet/TestParquetSerDe.java 007a665529857bcec612f638a157aa5043562a15 
>   serde/src/java/org/apache/hadoop/hive/serde2/io/ParquetHiveRecord.java PRE-CREATION 
> 
> Diff: https://reviews.apache.org/r/30281/diff/
> 
> 
> Testing
> -------
> 
> The tests run were the following:
> 
> 1. JMH (Java microbenchmark)
> 
> This benchmark called parquet serialize/write methods using text writable objects. 
> 
> Class.method                  Before Change (ops/s)      After Change (ops/s)       
> -------------------------------------------------------------------------------
> ParquetHiveSerDe.serialize:          19,113                   249,528   ->  19x speed increase
> DataWritableWriter.write:             5,033                     5,201   ->  3.34% speed increase
> 
> 
> 2. Write 20 million rows (~1GB file) from Text to Parquet
> 
> I wrote a ~1Gb file in Textfile format, then convert it to a Parquet format using the following
> statement: CREATE TABLE parquet STORED AS parquet AS SELECT * FROM text;
> 
> Time (s) it took to write the whole file BEFORE changes: 93.758 s
> Time (s) it took to write the whole file AFTER changes: 83.903 s
> 
> It got a 10% of speed inscrease.
> 
> 
> Thanks,
> 
> Sergio Pena
> 
>

Re: Review Request 30281: HIVE-9333: Move parquet serialize implementation to DataWritableWriter to improve write speeds

Posted by Ryan Blue <bl...@apache.org>.

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/30281/#review72053
-----------------------------------------------------------



ql/src/java/org/apache/hadoop/hive/ql/io/parquet/write/DataWritableWriter.java
<https://reviews.apache.org/r/30281/#comment117985>

    Will the keyValue ever be null? I don't think map implementations ever return nulls here.



ql/src/test/org/apache/hadoop/hive/ql/io/parquet/TestDataWritableWriter.java
<https://reviews.apache.org/r/30281/#comment117995>

    Like the test for PARQUET-113 structure, this should test the old structure, using "bag" and "array_element" for the names.



ql/src/test/org/apache/hadoop/hive/ql/io/parquet/TestDataWritableWriter.java
<https://reviews.apache.org/r/30281/#comment117992>

    I think part of the reasoning behind this was to make sure if the Parquet code is passed a schema with required fields, it will correctly write out required fields rather than rejecting them. That might be the case in the future (if we add an underlying schema, for example), so I think this should continue to use required.



ql/src/test/org/apache/hadoop/hive/ql/io/parquet/TestDataWritableWriter.java
<https://reviews.apache.org/r/30281/#comment117994>

    Part of the purpose is to validate that the write code accepts schemas with the new structure mandated by PARQUET-113. This would be a good place, so I think this should actually be "array" => "list" and "element" for elements. That way we're testing the new structure exactly.


- Ryan Blue


On Feb. 11, 2015, 3:19 p.m., Sergio Pena wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/30281/
> -----------------------------------------------------------
> 
> (Updated Feb. 11, 2015, 3:19 p.m.)
> 
> 
> Review request for hive, Ryan Blue, cheng xu, and Dong Chen.
> 
> 
> Bugs: HIVE-9333
>     https://issues.apache.org/jira/browse/HIVE-9333
> 
> 
> Repository: hive-git
> 
> 
> Description
> -------
> 
> This patch moves the ParquetHiveSerDe.serialize() implementation to DataWritableWriter class in order to save time in materializing data on serialize().
> 
> 
> Diffs
> -----
> 
>   ql/src/java/org/apache/hadoop/hive/ql/io/parquet/MapredParquetOutputFormat.java ea4109d358f7c48d1e2042e5da299475de4a0a29 
>   ql/src/java/org/apache/hadoop/hive/ql/io/parquet/serde/ParquetHiveSerDe.java 9199127735533f9a324c5ef456786dda10766c46 
>   ql/src/java/org/apache/hadoop/hive/ql/io/parquet/write/DataWritableWriteSupport.java 060b1b722d32f3b2f88304a1a73eb249e150294b 
>   ql/src/java/org/apache/hadoop/hive/ql/io/parquet/write/DataWritableWriter.java 1d83bf31a3dbcbaa68b3e75a72cec2ec67e7faa5 
>   ql/src/java/org/apache/hadoop/hive/ql/io/parquet/write/ParquetRecordWriterWrapper.java e52c4bc0b869b3e60cb4bfa9e11a09a0d605ac28 
>   ql/src/test/org/apache/hadoop/hive/ql/io/parquet/TestDataWritableWriter.java a693aff18516d133abf0aae4847d3fe00b9f1c96 
>   ql/src/test/org/apache/hadoop/hive/ql/io/parquet/TestMapredParquetOutputFormat.java 667d3671547190d363107019cd9a2d105d26d336 
>   ql/src/test/org/apache/hadoop/hive/ql/io/parquet/TestParquetSerDe.java 007a665529857bcec612f638a157aa5043562a15 
>   serde/src/java/org/apache/hadoop/hive/serde2/io/ParquetHiveRecord.java PRE-CREATION 
> 
> Diff: https://reviews.apache.org/r/30281/diff/
> 
> 
> Testing
> -------
> 
> The tests run were the following:
> 
> 1. JMH (Java microbenchmark)
> 
> This benchmark called parquet serialize/write methods using text writable objects. 
> 
> Class.method                  Before Change (ops/s)      After Change (ops/s)       
> -------------------------------------------------------------------------------
> ParquetHiveSerDe.serialize:          19,113                   249,528   ->  19x speed increase
> DataWritableWriter.write:             5,033                     5,201   ->  3.34% speed increase
> 
> 
> 2. Write 20 million rows (~1GB file) from Text to Parquet
> 
> I wrote a ~1Gb file in Textfile format, then convert it to a Parquet format using the following
> statement: CREATE TABLE parquet STORED AS parquet AS SELECT * FROM text;
> 
> Time (s) it took to write the whole file BEFORE changes: 93.758 s
> Time (s) it took to write the whole file AFTER changes: 83.903 s
> 
> It got a 10% of speed inscrease.
> 
> 
> Thanks,
> 
> Sergio Pena
> 
>

Re: Review Request 30281: HIVE-9333: Move parquet serialize implementation to DataWritableWriter to improve write speeds

Posted by Sergio Pena <se...@cloudera.com>.

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/30281/
-----------------------------------------------------------

(Updated Feb. 11, 2015, 11:19 p.m.)


Review request for hive, Ryan Blue, cheng xu, and Dong Chen.


Changes
-------

Applied changes due to Ryan comments.


Bugs: HIVE-9333
    https://issues.apache.org/jira/browse/HIVE-9333


Repository: hive-git


Description
-------

This patch moves the ParquetHiveSerDe.serialize() implementation to DataWritableWriter class in order to save time in materializing data on serialize().


Diffs (updated)
-----

  ql/src/java/org/apache/hadoop/hive/ql/io/parquet/MapredParquetOutputFormat.java ea4109d358f7c48d1e2042e5da299475de4a0a29 
  ql/src/java/org/apache/hadoop/hive/ql/io/parquet/serde/ParquetHiveSerDe.java 9199127735533f9a324c5ef456786dda10766c46 
  ql/src/java/org/apache/hadoop/hive/ql/io/parquet/write/DataWritableWriteSupport.java 060b1b722d32f3b2f88304a1a73eb249e150294b 
  ql/src/java/org/apache/hadoop/hive/ql/io/parquet/write/DataWritableWriter.java 1d83bf31a3dbcbaa68b3e75a72cec2ec67e7faa5 
  ql/src/java/org/apache/hadoop/hive/ql/io/parquet/write/ParquetRecordWriterWrapper.java e52c4bc0b869b3e60cb4bfa9e11a09a0d605ac28 
  ql/src/test/org/apache/hadoop/hive/ql/io/parquet/TestDataWritableWriter.java a693aff18516d133abf0aae4847d3fe00b9f1c96 
  ql/src/test/org/apache/hadoop/hive/ql/io/parquet/TestMapredParquetOutputFormat.java 667d3671547190d363107019cd9a2d105d26d336 
  ql/src/test/org/apache/hadoop/hive/ql/io/parquet/TestParquetSerDe.java 007a665529857bcec612f638a157aa5043562a15 
  serde/src/java/org/apache/hadoop/hive/serde2/io/ParquetHiveRecord.java PRE-CREATION 

Diff: https://reviews.apache.org/r/30281/diff/


Testing
-------

The tests run were the following:

1. JMH (Java microbenchmark)

This benchmark called parquet serialize/write methods using text writable objects. 

Class.method                  Before Change (ops/s)      After Change (ops/s)       
-------------------------------------------------------------------------------
ParquetHiveSerDe.serialize:          19,113                   249,528   ->  19x speed increase
DataWritableWriter.write:             5,033                     5,201   ->  3.34% speed increase


2. Write 20 million rows (~1GB file) from Text to Parquet

I wrote a ~1Gb file in Textfile format, then convert it to a Parquet format using the following
statement: CREATE TABLE parquet STORED AS parquet AS SELECT * FROM text;

Time (s) it took to write the whole file BEFORE changes: 93.758 s
Time (s) it took to write the whole file AFTER changes: 83.903 s

It got a 10% of speed inscrease.


Thanks,

Sergio Pena