You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-dev@hadoop.apache.org by "Hadoop QA (JIRA)" <ji...@apache.org> on 2008/09/02 01:06:44 UTC

[jira] Commented: (HADOOP-4010) Chaging LineRecordReader algo so that it does not need to skip backwards in the stream

    [ https://issues.apache.org/jira/browse/HADOOP-4010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12627549#action_12627549 ] 

Hadoop QA commented on HADOOP-4010:
-----------------------------------

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12389242/Hadoop-4010_version2.patch
  against trunk revision 690641.

    +1 @author.  The patch does not contain any @author tags.

    +1 tests included.  The patch appears to include 3 new or modified tests.

    +1 javadoc.  The javadoc tool did not generate any warning messages.

    +1 javac.  The applied patch does not increase the total number of javac compiler warnings.

    +1 findbugs.  The patch does not introduce any new Findbugs warnings.

    +1 release audit.  The applied patch does not increase the total number of release audit warnings.

    -1 core tests.  The patch failed core unit tests.

    -1 contrib tests.  The patch failed contrib unit tests.

Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3151/testReport/
Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3151/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3151/artifact/trunk/build/test/checkstyle-errors.html
Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3151/console

This message is automatically generated.

> Chaging LineRecordReader algo so that it does not need to skip backwards in the stream
> --------------------------------------------------------------------------------------
>
>                 Key: HADOOP-4010
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4010
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: mapred
>    Affects Versions: 0.19.0
>            Reporter: Abdul Qadeer
>            Assignee: Abdul Qadeer
>             Fix For: 0.19.0
>
>         Attachments: Hadoop-4010.patch, Hadoop-4010_version2.patch
>
>
> The current algorithm of the LineRecordReader needs to move backwards in the stream (in its constructor) to correctly position itself in the stream.  So it moves back one byte from the start of its split and try to read a record (i.e. a line) and throws that away.  This is so because it is sure that, this line would be taken care of by some other mapper.  This algorithm is difficult and in-efficient if used for compressed stream where data is coming to the LineRecordReader via some codecs. (Although in the current implementation, Hadoop does not split a compressed file and only makes one split from the start to the end of the file and so only one mapper handles it.  We are currently working on BZip2 codecs where splitting is possible to work with Hadoop.  So this proposed change will make it possible to uniformly handle plain as well as compressed stream.)
> In the new algorithm, each mapper always skips its first line because it is sure that, that line would have been read by some other mapper.  So now each mapper must finish its reading at a record boundary which is always beyond its upper split limit.  Due to this change, LineRecordReader does not need to move backwards in the stream.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.