You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-dev@hadoop.apache.org by "Chris Douglas (JIRA)" <ji...@apache.org> on 2007/12/12 00:51:43 UTC

[jira] Updated: (HADOOP-2406) Micro-benchmark to measure read/write times through InputFormats

     [ https://issues.apache.org/jira/browse/HADOOP-2406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Chris Douglas updated HADOOP-2406:
----------------------------------

    Attachment: 2406-0.patch

> Micro-benchmark to measure read/write times through InputFormats
> ----------------------------------------------------------------
>
>                 Key: HADOOP-2406
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2406
>             Project: Hadoop
>          Issue Type: Test
>          Components: fs, test
>            Reporter: Chris Douglas
>            Assignee: Chris Douglas
>             Fix For: 0.16.0
>
>         Attachments: 2406-0.patch
>
>
> The attached test writes/reads XGB to/from the default filesystem through SequenceFileInputFormat and TextInputFormat, using LzoCodec, GzipCodec, and without compression, using both block and record compression for SequenceFiles.
> The following results using 10GB of data through RawLocalFileSystem with 5 word keys, 20 word values (as generated by RandomTextWriter with the same seed for each file) are pretty stable:
> Writes:
> || Format || Compression || Type || Time (sec) || Filesize (bytes) ||
> | SEQ | LZO | BLOCK | 318 | 8 604 288 397 |
> | SEQ | LZO | RECORD | 367 | 11 689 969 413 |
> | SEQ | ZIP | BLOCK | 929 | 2 827 697 769 |
> | SEQ | ZIP | RECORD | 1737 | 9 324 730 365 |
> | SEQ |  |  | 201 | 11 282 745 683 |
> | TXT | LZO |  | 742 | 12 671 065 769 |
> | TXT | ZIP |  | 1320 | 2 597 397 680 |
> | TXT |  |  | 392 | 10 818 058 643 |
> Reads:
> || Format || Compression || Type || Time (sec) ||
> | SEQ | LZO | BLOCK | 150 |
> | SEQ | LZO | RECORD | 281 |
> | SEQ | ZIP | BLOCK | 155 |
> | SEQ | ZIP | RECORD | 548 |
> | SEQ |  |  | 209 |
> | TXT | LZO |  | 620 |
> | TXT | ZIP |  | 355 |
> | TXT |  |  | 284 |
> Of note:
> - Lzo compressed TextOutput is larger than the uncompressed output (HADOOP-2402); lzop cannot read it.
> - Zip compression is expensive. Short values are responsible for the unimpressive compression for record-compressed SequenceFiles.
> - TextInputFormat is slow (HADOOP-2285). TextOutputFormat also looks suspect.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.