You are viewing a plain text version of this content. The canonical link for it is here.
Posted to mapreduce-issues@hadoop.apache.org by "jay vyas (JIRA)" <ji...@apache.org> on 2013/09/16 17:43:52 UTC

[jira] [Commented] (MAPREDUCE-5511) Multifilewc and the mapred.* API: Is the use of getPos() valid?

    [ https://issues.apache.org/jira/browse/MAPREDUCE-5511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13768433#comment-13768433 ] 

jay vyas commented on MAPREDUCE-5511:
-------------------------------------

Another note: The newer implementations of multifilewordcount in mapreduce.* that dont provide a RecordReader.getPos() implementation don't have this problem.   

So this really is related also to support for the multifilewordcount class.  

With new filesystem implementations which mapreduce can work on top of, it is important to define the expected semantics of getPos() for FSInputStreams.


                
> Multifilewc and the mapred.* API:  Is the use of getPos() valid?
> ----------------------------------------------------------------
>
>                 Key: MAPREDUCE-5511
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5511
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: examples
>            Reporter: jay vyas
>            Priority: Minor
>
> The MultiFileWordCount class in the hadoop examples libraries uses a record reader which switches between files.  This behaviour can cause the RawLocalFileSystem to break in a concurrent environment because of the way buffering works (in RawLocalFileSystem, switching between streams results in a temproraily "null" inner stream, and that inner stream is called by the getPos() implementation in the custom RecordReader for MultiFileWordCount). 
> There are basically 2 ways to handle this:
> 1) Wrap the getPos() implementation in the object returned by open() in the RawLocalFileSystem to cache the value of getPos() everytime it is called, so that calls to getPos() can return a valid long even if underlying stream is null. OR
> 2) Update the RecordReader in multifilewc to not rely on the inner input stream and cache the position / return 0 if the stream cannot return a valid value. 
> The final question here is:  Is the RecordReader for MultiFileWordCount doing the right thing ?  Or is it breaking the contract of getPos()... and really... what SHOULD getPos() return if the underlying stream has already been consumed? 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira