You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-issues@hadoop.apache.org by "Steve Loughran (JIRA)" <ji...@apache.org> on 2016/06/17 15:40:05 UTC
[jira] [Updated] (HADOOP-13286) add a scale test to do gunzip and linecount

     [ https://issues.apache.org/jira/browse/HADOOP-13286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Steve Loughran updated HADOOP-13286:
------------------------------------
    Attachment: HADOOP-13286-branch-2-001.patch

Patch 001; streams the test data through the (presumably) non-native gz codec, then into LineReader. Simulates a mapper applied to a .CSV.gz file

timings
{code}
testDecompression128K: Decompress with a 128K readahead

2016-06-17 16:30:42,408 [Thread-0] INFO  compress.CodecPool (CodecPool.java:getDecompressor(181)) - Got brand-new decompressor [.gz]
2016-06-17 16:30:47,345 [Thread-0] INFO  contract.ContractTestUtils (ContractTestUtils.java:end(1262)) - Duration of Time to read 514690 lines [99896260 bytes expanded, 22633778 raw] with readahead = 131072: 5,107,155,982 nS
2016-06-17 16:30:47,345 [Thread-0] INFO  scale.TestS3AInputStreamPerformance (TestS3AInputStreamPerformance.java:logTimePerIOP(144)) - Time per IOP: 9,922 nS
2016-06-17 16:30:47,346 [Thread-0] INFO  scale.TestS3AInputStreamPerformance (TestS3AInputStreamPerformance.java:logStreamStatistics(301)) - Stream Statistics
StreamStatistics{OpenOperations=1, CloseOperations=1, Closed=1, Aborted=0, SeekOperations=0, ReadExceptions=0, ForwardSeekOperations=0, BackwardSeekOperations=0, BytesSkippedOnSeek=0, BytesBackwardsOnSeek=0, BytesRead=22633778, BytesRead excluding skipped=22633778, ReadOperations=5708, ReadFullyOperations=0, ReadsIncomplete=243}
{code}

that is: 1 microsecond/line; 5.1s for the entire 20MB file, which expands to 99MB on the way through the pipeline

> add a scale test to do gunzip and linecount
> -------------------------------------------
>
>                 Key: HADOOP-13286
>                 URL: https://issues.apache.org/jira/browse/HADOOP-13286
>             Project: Hadoop Common
>          Issue Type: Sub-task
>          Components: fs/s3
>    Affects Versions: 2.8.0
>            Reporter: Steve Loughran
>            Assignee: Steve Loughran
>         Attachments: HADOOP-13286-branch-2-001.patch
>
>
> the HADOOP-13203 patch proposal showed that there were performance problems downstream which weren't surfacing in the current scale tests.
> Trying to decompress the .gz test file and then go through it with LineReader models a basic use case: parse a .csv.gz data source. 
> Add this, with metric printing



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org