You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hive.apache.org by Tim Armstrong <ti...@gmail.com> on 2011/08/12 00:43:38 UTC
Review Request: rcfilecat16x performance improvement
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/1474/
-----------------------------------------------------------
Review request for hive, Yongqiang He, Ning Zhang, and namit jain.
Summary
-------
This patch improves rcfilecat performance enormously: throughput increased from 0.32MB/s to 5.15MB/s on one benchmark: 16x. There were a number of improvements I made to get to this performance:
Initial:
0.32 MB/s
Change System.out to use bigger buffer (not line buffered)
1.7MB/s
Unchecked Get:
1.75MB/s
Use StringBuilder to construct each row before writing output.
3.7MB/s
Streamline decoding:
4.16 MB/s
Use StringBuilder to buffer multiple lines:
5 MB/s
Tuning buffer sizes:
5.15 MB/s
I also added a --verbose mode which writes progress updates to stderr.
This addresses bug HIVE-2370.
https://issues.apache.org/jira/browse/HIVE-2370
Diffs
-----
trunk/cli/src/java/org/apache/hadoop/hive/cli/RCFileCat.java 1156839
Diff: https://reviews.apache.org/r/1474/diff
Testing
-------
Used diff to check output was same as old version of RCFileCat
Thanks,
Tim
Re: Review Request: rcfilecat 16x performance improvement
Posted by Tim Armstrong <ti...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/1474/
-----------------------------------------------------------
(Updated 2011-08-12 03:56:39.298860)
Review request for hive, Yongqiang He, Ning Zhang, and namit jain.
Changes
-------
Turned magic numbers into named constants, enable output buffering only after arguments processed.
Summary
-------
This patch improves rcfilecat performance enormously: throughput increased from 0.32MB/s to 5.15MB/s on one benchmark: 16x. There were a number of improvements I made to get to this performance:
Initial:
0.32 MB/s
Change System.out to use bigger buffer (not line buffered)
1.7MB/s
Unchecked Get:
1.75MB/s
Use StringBuilder to construct each row before writing output.
3.7MB/s
Streamline decoding:
4.16 MB/s
Use StringBuilder to buffer multiple lines:
5 MB/s
Tuning buffer sizes:
5.15 MB/s
I also added a --verbose mode which writes progress updates to stderr.
This addresses bug HIVE-2370.
https://issues.apache.org/jira/browse/HIVE-2370
Diffs (updated)
-----
trunk/cli/src/java/org/apache/hadoop/hive/cli/RCFileCat.java 1156839
Diff: https://reviews.apache.org/r/1474/diff
Testing
-------
Used diff to check output was same as old version of RCFileCat
Thanks,
Tim
Re: Review Request: rcfilecat 16x performance improvement
Posted by Carl Steinbach <ca...@cloudera.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/1474/#review1414
-----------------------------------------------------------
trunk/cli/src/java/org/apache/hadoop/hive/cli/RCFileCat.java
<https://reviews.apache.org/r/1474/#comment3271>
This should probably be done after we finish processing the command line options.
trunk/cli/src/java/org/apache/hadoop/hive/cli/RCFileCat.java
<https://reviews.apache.org/r/1474/#comment3269>
"1024*1024" should be replaced with a static final variable.
trunk/cli/src/java/org/apache/hadoop/hive/cli/RCFileCat.java
<https://reviews.apache.org/r/1474/#comment3270>
Another constant that should be converted to a static final.
- Carl
On 2011-08-12 00:22:11, Tim Armstrong wrote:
>
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/1474/
> -----------------------------------------------------------
>
> (Updated 2011-08-12 00:22:11)
>
>
> Review request for hive, Yongqiang He, Ning Zhang, and namit jain.
>
>
> Summary
> -------
>
> This patch improves rcfilecat performance enormously: throughput increased from 0.32MB/s to 5.15MB/s on one benchmark: 16x. There were a number of improvements I made to get to this performance:
>
> Initial:
> 0.32 MB/s
>
> Change System.out to use bigger buffer (not line buffered)
> 1.7MB/s
>
> Unchecked Get:
> 1.75MB/s
>
> Use StringBuilder to construct each row before writing output.
> 3.7MB/s
>
> Streamline decoding:
> 4.16 MB/s
>
> Use StringBuilder to buffer multiple lines:
> 5 MB/s
>
> Tuning buffer sizes:
> 5.15 MB/s
>
>
> I also added a --verbose mode which writes progress updates to stderr.
>
>
> This addresses bug HIVE-2370.
> https://issues.apache.org/jira/browse/HIVE-2370
>
>
> Diffs
> -----
>
> trunk/cli/src/java/org/apache/hadoop/hive/cli/RCFileCat.java 1156839
>
> Diff: https://reviews.apache.org/r/1474/diff
>
>
> Testing
> -------
>
> Used diff to check output was same as old version of RCFileCat
>
>
> Thanks,
>
> Tim
>
>
Re: Review Request: rcfilecat 16x performance improvement
Posted by Tim Armstrong <ti...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/1474/
-----------------------------------------------------------
(Updated 2011-08-12 00:22:11.295461)
Review request for hive, Yongqiang He, Ning Zhang, and namit jain.
Changes
-------
Stripped out whitespace at end of line of old version.
Summary
-------
This patch improves rcfilecat performance enormously: throughput increased from 0.32MB/s to 5.15MB/s on one benchmark: 16x. There were a number of improvements I made to get to this performance:
Initial:
0.32 MB/s
Change System.out to use bigger buffer (not line buffered)
1.7MB/s
Unchecked Get:
1.75MB/s
Use StringBuilder to construct each row before writing output.
3.7MB/s
Streamline decoding:
4.16 MB/s
Use StringBuilder to buffer multiple lines:
5 MB/s
Tuning buffer sizes:
5.15 MB/s
I also added a --verbose mode which writes progress updates to stderr.
This addresses bug HIVE-2370.
https://issues.apache.org/jira/browse/HIVE-2370
Diffs (updated)
-----
trunk/cli/src/java/org/apache/hadoop/hive/cli/RCFileCat.java 1156839
Diff: https://reviews.apache.org/r/1474/diff
Testing
-------
Used diff to check output was same as old version of RCFileCat
Thanks,
Tim
Re: Review Request: rcfilecat 16x performance improvement
Posted by Ning Zhang <nz...@fb.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/1474/#review1412
-----------------------------------------------------------
Great job! Does this number indicate the read and write speed or just the read (including decompression) part?
trunk/cli/src/java/org/apache/hadoop/hive/cli/RCFileCat.java
<https://reviews.apache.org/r/1474/#comment3266>
can you remove all these TABs?
trunk/cli/src/java/org/apache/hadoop/hive/cli/RCFileCat.java
<https://reviews.apache.org/r/1474/#comment3267>
make 2048 a static constant variable.
- Ning
On 2011-08-11 22:44:48, Tim Armstrong wrote:
>
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/1474/
> -----------------------------------------------------------
>
> (Updated 2011-08-11 22:44:48)
>
>
> Review request for hive, Yongqiang He, Ning Zhang, and namit jain.
>
>
> Summary
> -------
>
> This patch improves rcfilecat performance enormously: throughput increased from 0.32MB/s to 5.15MB/s on one benchmark: 16x. There were a number of improvements I made to get to this performance:
>
> Initial:
> 0.32 MB/s
>
> Change System.out to use bigger buffer (not line buffered)
> 1.7MB/s
>
> Unchecked Get:
> 1.75MB/s
>
> Use StringBuilder to construct each row before writing output.
> 3.7MB/s
>
> Streamline decoding:
> 4.16 MB/s
>
> Use StringBuilder to buffer multiple lines:
> 5 MB/s
>
> Tuning buffer sizes:
> 5.15 MB/s
>
>
> I also added a --verbose mode which writes progress updates to stderr.
>
>
> This addresses bug HIVE-2370.
> https://issues.apache.org/jira/browse/HIVE-2370
>
>
> Diffs
> -----
>
> trunk/cli/src/java/org/apache/hadoop/hive/cli/RCFileCat.java 1156839
>
> Diff: https://reviews.apache.org/r/1474/diff
>
>
> Testing
> -------
>
> Used diff to check output was same as old version of RCFileCat
>
>
> Thanks,
>
> Tim
>
>
Re: Review Request: rcfilecat 16x performance improvement
Posted by Tim Armstrong <ti...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/1474/
-----------------------------------------------------------
(Updated 2011-08-11 22:44:48.620762)
Review request for hive, Yongqiang He, Ning Zhang, and namit jain.
Summary (updated)
-------
This patch improves rcfilecat performance enormously: throughput increased from 0.32MB/s to 5.15MB/s on one benchmark: 16x. There were a number of improvements I made to get to this performance:
Initial:
0.32 MB/s
Change System.out to use bigger buffer (not line buffered)
1.7MB/s
Unchecked Get:
1.75MB/s
Use StringBuilder to construct each row before writing output.
3.7MB/s
Streamline decoding:
4.16 MB/s
Use StringBuilder to buffer multiple lines:
5 MB/s
Tuning buffer sizes:
5.15 MB/s
I also added a --verbose mode which writes progress updates to stderr.
This addresses bug HIVE-2370.
https://issues.apache.org/jira/browse/HIVE-2370
Diffs
-----
trunk/cli/src/java/org/apache/hadoop/hive/cli/RCFileCat.java 1156839
Diff: https://reviews.apache.org/r/1474/diff
Testing
-------
Used diff to check output was same as old version of RCFileCat
Thanks,
Tim