You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hive.apache.org by Tim Armstrong <ti...@gmail.com> on 2011/08/12 00:44:48 UTC

Re: Review Request: rcfilecat 16x performance improvement

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/1474/
-----------------------------------------------------------

(Updated 2011-08-11 22:44:48.620762)


Review request for hive, Yongqiang He, Ning Zhang, and namit jain.


Summary (updated)
-------

This patch improves rcfilecat performance enormously: throughput increased from 0.32MB/s to 5.15MB/s on one benchmark: 16x. There were a number of improvements I made to get to this performance:

Initial:
0.32 MB/s

Change System.out to use bigger buffer (not line buffered)
1.7MB/s

Unchecked Get:
1.75MB/s

Use StringBuilder to construct each row before writing output.
3.7MB/s

Streamline decoding:
4.16 MB/s

Use StringBuilder to buffer multiple lines:
5 MB/s

Tuning buffer sizes:
5.15 MB/s


I also added a --verbose mode which writes progress updates to stderr.


This addresses bug HIVE-2370.
    https://issues.apache.org/jira/browse/HIVE-2370


Diffs
-----

  trunk/cli/src/java/org/apache/hadoop/hive/cli/RCFileCat.java 1156839 

Diff: https://reviews.apache.org/r/1474/diff


Testing
-------

Used diff to check output was same as old version of RCFileCat


Thanks,

Tim


Re: Review Request: rcfilecat 16x performance improvement

Posted by Tim Armstrong <ti...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/1474/
-----------------------------------------------------------

(Updated 2011-08-12 03:56:39.298860)


Review request for hive, Yongqiang He, Ning Zhang, and namit jain.


Changes
-------

Turned magic numbers into named constants, enable output buffering only after arguments processed.


Summary
-------

This patch improves rcfilecat performance enormously: throughput increased from 0.32MB/s to 5.15MB/s on one benchmark: 16x. There were a number of improvements I made to get to this performance:

Initial:
0.32 MB/s

Change System.out to use bigger buffer (not line buffered)
1.7MB/s

Unchecked Get:
1.75MB/s

Use StringBuilder to construct each row before writing output.
3.7MB/s

Streamline decoding:
4.16 MB/s

Use StringBuilder to buffer multiple lines:
5 MB/s

Tuning buffer sizes:
5.15 MB/s


I also added a --verbose mode which writes progress updates to stderr.


This addresses bug HIVE-2370.
    https://issues.apache.org/jira/browse/HIVE-2370


Diffs (updated)
-----

  trunk/cli/src/java/org/apache/hadoop/hive/cli/RCFileCat.java 1156839 

Diff: https://reviews.apache.org/r/1474/diff


Testing
-------

Used diff to check output was same as old version of RCFileCat


Thanks,

Tim


Re: Review Request: rcfilecat 16x performance improvement

Posted by Carl Steinbach <ca...@cloudera.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/1474/#review1414
-----------------------------------------------------------



trunk/cli/src/java/org/apache/hadoop/hive/cli/RCFileCat.java
<https://reviews.apache.org/r/1474/#comment3271>

    This should probably be done after we finish processing the command line options.



trunk/cli/src/java/org/apache/hadoop/hive/cli/RCFileCat.java
<https://reviews.apache.org/r/1474/#comment3269>

    "1024*1024" should be replaced with a static final variable.



trunk/cli/src/java/org/apache/hadoop/hive/cli/RCFileCat.java
<https://reviews.apache.org/r/1474/#comment3270>

    Another constant that should be converted to a static final.


- Carl


On 2011-08-12 00:22:11, Tim Armstrong wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/1474/
> -----------------------------------------------------------
> 
> (Updated 2011-08-12 00:22:11)
> 
> 
> Review request for hive, Yongqiang He, Ning Zhang, and namit jain.
> 
> 
> Summary
> -------
> 
> This patch improves rcfilecat performance enormously: throughput increased from 0.32MB/s to 5.15MB/s on one benchmark: 16x. There were a number of improvements I made to get to this performance:
> 
> Initial:
> 0.32 MB/s
> 
> Change System.out to use bigger buffer (not line buffered)
> 1.7MB/s
> 
> Unchecked Get:
> 1.75MB/s
> 
> Use StringBuilder to construct each row before writing output.
> 3.7MB/s
> 
> Streamline decoding:
> 4.16 MB/s
> 
> Use StringBuilder to buffer multiple lines:
> 5 MB/s
> 
> Tuning buffer sizes:
> 5.15 MB/s
> 
> 
> I also added a --verbose mode which writes progress updates to stderr.
> 
> 
> This addresses bug HIVE-2370.
>     https://issues.apache.org/jira/browse/HIVE-2370
> 
> 
> Diffs
> -----
> 
>   trunk/cli/src/java/org/apache/hadoop/hive/cli/RCFileCat.java 1156839 
> 
> Diff: https://reviews.apache.org/r/1474/diff
> 
> 
> Testing
> -------
> 
> Used diff to check output was same as old version of RCFileCat
> 
> 
> Thanks,
> 
> Tim
> 
>


Re: Review Request: rcfilecat 16x performance improvement

Posted by Tim Armstrong <ti...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/1474/
-----------------------------------------------------------

(Updated 2011-08-12 00:22:11.295461)


Review request for hive, Yongqiang He, Ning Zhang, and namit jain.


Changes
-------

Stripped out whitespace at end of line of old version.


Summary
-------

This patch improves rcfilecat performance enormously: throughput increased from 0.32MB/s to 5.15MB/s on one benchmark: 16x. There were a number of improvements I made to get to this performance:

Initial:
0.32 MB/s

Change System.out to use bigger buffer (not line buffered)
1.7MB/s

Unchecked Get:
1.75MB/s

Use StringBuilder to construct each row before writing output.
3.7MB/s

Streamline decoding:
4.16 MB/s

Use StringBuilder to buffer multiple lines:
5 MB/s

Tuning buffer sizes:
5.15 MB/s


I also added a --verbose mode which writes progress updates to stderr.


This addresses bug HIVE-2370.
    https://issues.apache.org/jira/browse/HIVE-2370


Diffs (updated)
-----

  trunk/cli/src/java/org/apache/hadoop/hive/cli/RCFileCat.java 1156839 

Diff: https://reviews.apache.org/r/1474/diff


Testing
-------

Used diff to check output was same as old version of RCFileCat


Thanks,

Tim


Re: Review Request: rcfilecat 16x performance improvement

Posted by Ning Zhang <nz...@fb.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/1474/#review1412
-----------------------------------------------------------


Great job! Does this number indicate the read and write speed or just the read (including decompression) part? 


trunk/cli/src/java/org/apache/hadoop/hive/cli/RCFileCat.java
<https://reviews.apache.org/r/1474/#comment3266>

    can you remove all these TABs?



trunk/cli/src/java/org/apache/hadoop/hive/cli/RCFileCat.java
<https://reviews.apache.org/r/1474/#comment3267>

    make 2048 a static constant variable. 


- Ning


On 2011-08-11 22:44:48, Tim Armstrong wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/1474/
> -----------------------------------------------------------
> 
> (Updated 2011-08-11 22:44:48)
> 
> 
> Review request for hive, Yongqiang He, Ning Zhang, and namit jain.
> 
> 
> Summary
> -------
> 
> This patch improves rcfilecat performance enormously: throughput increased from 0.32MB/s to 5.15MB/s on one benchmark: 16x. There were a number of improvements I made to get to this performance:
> 
> Initial:
> 0.32 MB/s
> 
> Change System.out to use bigger buffer (not line buffered)
> 1.7MB/s
> 
> Unchecked Get:
> 1.75MB/s
> 
> Use StringBuilder to construct each row before writing output.
> 3.7MB/s
> 
> Streamline decoding:
> 4.16 MB/s
> 
> Use StringBuilder to buffer multiple lines:
> 5 MB/s
> 
> Tuning buffer sizes:
> 5.15 MB/s
> 
> 
> I also added a --verbose mode which writes progress updates to stderr.
> 
> 
> This addresses bug HIVE-2370.
>     https://issues.apache.org/jira/browse/HIVE-2370
> 
> 
> Diffs
> -----
> 
>   trunk/cli/src/java/org/apache/hadoop/hive/cli/RCFileCat.java 1156839 
> 
> Diff: https://reviews.apache.org/r/1474/diff
> 
> 
> Testing
> -------
> 
> Used diff to check output was same as old version of RCFileCat
> 
> 
> Thanks,
> 
> Tim
> 
>