You are viewing a plain text version of this content. The canonical link for it is here.
Posted to mapreduce-issues@hadoop.apache.org by "jay vyas (JIRA)" <ji...@apache.org> on 2013/09/16 18:13:54 UTC

[jira] [Updated] (MAPREDUCE-5511) Multifilewc and the mapred.* API: Is the use of getPos() valid?

     [ https://issues.apache.org/jira/browse/MAPREDUCE-5511?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

jay vyas updated MAPREDUCE-5511:
--------------------------------

    Affects Version/s: 1.0.0
                       1.2.0
    
> Multifilewc and the mapred.* API:  Is the use of getPos() valid?
> ----------------------------------------------------------------
>
>                 Key: MAPREDUCE-5511
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5511
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: examples
>    Affects Versions: 1.0.0, 1.2.0
>            Reporter: jay vyas
>            Priority: Minor
>
> The MultiFileWordCount class in the hadoop examples libraries uses a record reader which switches between files.  This behaviour can cause the RawLocalFileSystem to break in a concurrent environment because of the way buffering works (in RawLocalFileSystem, switching between streams results in a temproraily "null" inner stream, and that inner stream is called by the getPos() implementation in the custom RecordReader for MultiFileWordCount). 
> There are basically 2 ways to handle this:
> 1) Wrap the getPos() implementation in the object returned by open() in the RawLocalFileSystem to cache the value of getPos() everytime it is called, so that calls to getPos() can return a valid long even if underlying stream is null. OR
> 2) Update the RecordReader in multifilewc to not rely on the inner input stream and cache the position / return 0 if the stream cannot return a valid value. 
> The final question here is:  Is the RecordReader for MultiFileWordCount doing the right thing ?  Or is it breaking the contract of getPos()... and really... what SHOULD getPos() return if the underlying stream has already been consumed? 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira