You are viewing a plain text version of this content. The canonical link for it is here.
Posted to mapreduce-issues@hadoop.apache.org by "jay vyas (JIRA)" <ji...@apache.org> on 2013/10/08 00:43:42 UTC

[jira] [Updated] (MAPREDUCE-5572) Provide alternative logic for getPos() implementation in custom RecordReader of mapred implementation of MultiFileWordCount

     [ https://issues.apache.org/jira/browse/MAPREDUCE-5572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

jay vyas updated MAPREDUCE-5572:
--------------------------------

    Description: 
The custom RecordReader class in MultiFileWordCount (MultiFileLineRecordReader) has been replaced in newer examples with a better implementation which uses the CombineFileInputFormat, which doesn't feature this bug.  However, this bug nevertheless still exists in 1.x versions of the MultiFileWordCount which rely on the mapred API.


The older MultiFileWordCount implementation defines the getPos() as follows:

long currentOffset = currentStream == null ? 0 : currentStream.getPos();
...

This is meant to prevent errors when underlying stream is null. But it doesn't gaurantee to work: The RawLocalFileSystem, for example, currectly will close the underlying file stream once it is consumed, and the currentStream will thus throw a NullPointerException when trying to access the null stream.

This is only seen when running this in the context where the MapTask class, which is only relevant in mapred.* API, calls getPos() twice in tandem, before and after reading a record.

This custom record reader should be gaurded, or else eliminated, since it assumes something which is not in the FileSystem contract:  That a getPos will always return a integral value.



  was:
The custom RecordReader class defines the getPos() as follows:

long currentOffset = currentStream == null ? 0 : currentStream.getPos();
...

This is meant to prevent errors when underlying stream is null. But it doesn't gaurantee to work: The RawLocalFileSystem, for example, currectly will close the underlying file stream once it is consumed, and the currentStream will thus throw a NullPointerException when trying to access the null stream.

This is only seen when running this in the context where the MapTask class, which is only relevant in mapred.* API, calls getPos() twice in tandem, before and after reading a record.

This custom record reader should be gaurded, or else eliminated, since it assumes something which is not in the FileSystem contract:  That a getPos will always return a integral value.

        Summary: Provide alternative logic for getPos() implementation in custom RecordReader of mapred implementation of MultiFileWordCount  (was: Provide alternative logic for getPos() implementation in custom RecordReader)

> Provide alternative logic for getPos() implementation in custom RecordReader of mapred implementation of MultiFileWordCount
> ---------------------------------------------------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-5572
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5572
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: examples
>    Affects Versions: 1.1.0, 1.1.1, 1.2.0, 1.1.3, 1.2.1, 1.2.2
>            Reporter: jay vyas
>            Priority: Minor
>
> The custom RecordReader class in MultiFileWordCount (MultiFileLineRecordReader) has been replaced in newer examples with a better implementation which uses the CombineFileInputFormat, which doesn't feature this bug.  However, this bug nevertheless still exists in 1.x versions of the MultiFileWordCount which rely on the mapred API.
> The older MultiFileWordCount implementation defines the getPos() as follows:
> long currentOffset = currentStream == null ? 0 : currentStream.getPos();
> ...
> This is meant to prevent errors when underlying stream is null. But it doesn't gaurantee to work: The RawLocalFileSystem, for example, currectly will close the underlying file stream once it is consumed, and the currentStream will thus throw a NullPointerException when trying to access the null stream.
> This is only seen when running this in the context where the MapTask class, which is only relevant in mapred.* API, calls getPos() twice in tandem, before and after reading a record.
> This custom record reader should be gaurded, or else eliminated, since it assumes something which is not in the FileSystem contract:  That a getPos will always return a integral value.



--
This message was sent by Atlassian JIRA
(v6.1#6144)