You are viewing a plain text version of this content. The canonical link for it is here.
Posted to mapreduce-issues@hadoop.apache.org by "Robert Joseph Evans (JIRA)" <ji...@apache.org> on 2011/06/10 18:47:59 UTC

[jira] [Commented] (MAPREDUCE-2583) DistributedCache for M-R chains

    [ https://issues.apache.org/jira/browse/MAPREDUCE-2583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13047284#comment-13047284 ] 

Robert Joseph Evans commented on MAPREDUCE-2583:
------------------------------------------------

I am not really sure what you are getting at here.  I have tried this sort of thing before by writing small statistics files as out-of-band files and reading them back in the next map/reduce as part of the distributed cache, but it did not turn out very well.  Even with the distributed cache if you are scaling up to 100s of mappers/reducers it will put a lot of load on the name node.  If it really is a requirement it is best to post process the files turning them into a single highly replicated files before passing them off to the next phase.

If you are turning this into a formal Map/Reduce feature then you probably want to do this compaction in the cleanup task, and have some sort of size limits on how much data can flow through this.

> DistributedCache for M-R chains
> -------------------------------
>
>                 Key: MAPREDUCE-2583
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2583
>             Project: Hadoop Map/Reduce
>          Issue Type: New Feature
>            Reporter: Mitch McCuiston
>
> Currently the DistributedCache appears to be created at the granularity of a job.  In the case of a M-R chain, it is sometimes useful to share information out-of-band (as small files in hdfs) with each task in the chain.  For instance, the first M-R phase within a two-phase M-R chain might produce useful statistics that could be used to configure the second phase.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira