You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-dev@hadoop.apache.org by "Chris Douglas (JIRA)" <ji...@apache.org> on 2007/12/12 00:51:43 UTC
[jira] Updated: (HADOOP-2406) Micro-benchmark to measure read/write
times through InputFormats
[ https://issues.apache.org/jira/browse/HADOOP-2406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Chris Douglas updated HADOOP-2406:
----------------------------------
Attachment: 2406-0.patch
> Micro-benchmark to measure read/write times through InputFormats
> ----------------------------------------------------------------
>
> Key: HADOOP-2406
> URL: https://issues.apache.org/jira/browse/HADOOP-2406
> Project: Hadoop
> Issue Type: Test
> Components: fs, test
> Reporter: Chris Douglas
> Assignee: Chris Douglas
> Fix For: 0.16.0
>
> Attachments: 2406-0.patch
>
>
> The attached test writes/reads XGB to/from the default filesystem through SequenceFileInputFormat and TextInputFormat, using LzoCodec, GzipCodec, and without compression, using both block and record compression for SequenceFiles.
> The following results using 10GB of data through RawLocalFileSystem with 5 word keys, 20 word values (as generated by RandomTextWriter with the same seed for each file) are pretty stable:
> Writes:
> || Format || Compression || Type || Time (sec) || Filesize (bytes) ||
> | SEQ | LZO | BLOCK | 318 | 8 604 288 397 |
> | SEQ | LZO | RECORD | 367 | 11 689 969 413 |
> | SEQ | ZIP | BLOCK | 929 | 2 827 697 769 |
> | SEQ | ZIP | RECORD | 1737 | 9 324 730 365 |
> | SEQ | | | 201 | 11 282 745 683 |
> | TXT | LZO | | 742 | 12 671 065 769 |
> | TXT | ZIP | | 1320 | 2 597 397 680 |
> | TXT | | | 392 | 10 818 058 643 |
> Reads:
> || Format || Compression || Type || Time (sec) ||
> | SEQ | LZO | BLOCK | 150 |
> | SEQ | LZO | RECORD | 281 |
> | SEQ | ZIP | BLOCK | 155 |
> | SEQ | ZIP | RECORD | 548 |
> | SEQ | | | 209 |
> | TXT | LZO | | 620 |
> | TXT | ZIP | | 355 |
> | TXT | | | 284 |
> Of note:
> - Lzo compressed TextOutput is larger than the uncompressed output (HADOOP-2402); lzop cannot read it.
> - Zip compression is expensive. Short values are responsible for the unimpressive compression for record-compressed SequenceFiles.
> - TextInputFormat is slow (HADOOP-2285). TextOutputFormat also looks suspect.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.