You are viewing a plain text version of this content. The canonical link for it is here.

Posted to mapreduce-issues@hadoop.apache.org by "Greg Roelofs (JIRA)" <ji...@apache.org> on 2011/03/08 17:06:00 UTC

[jira] Updated: (MAPREDUCE-1220) Implement an in-cluster LocalJobRunner

     [ https://issues.apache.org/jira/browse/MAPREDUCE-1220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Greg Roelofs updated MAPREDUCE-1220:
------------------------------------

    Attachment: MR-1220.v1.trunk-hadoop-common.Progress-dumper.patch.txt

Not sure if this is worthy of its own HADOOP-xxx issue, but it was useful while debugging UberTask's 3-level Progress/phase tree. (Progress needs more help than this, but that's a topic for another day.)

> Implement an in-cluster LocalJobRunner
> --------------------------------------
>
>                 Key: MAPREDUCE-1220
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1220
>             Project: Hadoop Map/Reduce
>          Issue Type: New Feature
>          Components: client, jobtracker
>            Reporter: Arun C Murthy
>            Assignee: Greg Roelofs
>         Attachments: MAPREDUCE-1220_yhadoop20.patch, MR-1220.v1.trunk-hadoop-common.Progress-dumper.patch.txt, MR-1220.v2.trunk-hadoop-mapreduce.patch.txt, MR-1220.v2.trunk-hadoop-mapreduce.patch.txt
>
>
> Currently very small map-reduce jobs suffer from latency issues due to overheads in Hadoop Map-Reduce such as scheduling, jvm startup etc. We've periodically tried to optimize all parts of framework to achieve lower latencies.
> I'd like to turn the problem around a little bit. I propose we allow very small jobs to run as a single task job with multiple maps and reduces i.e. similar to our current implementation of the LocalJobRunner. Thus, under certain conditions (maybe user-set configuration, or if input data is small i.e. less a DFS blocksize) we could launch a special task which will run all maps in a serial manner, followed by the reduces. This would really help small jobs achieve significantly smaller latencies, thanks to lesser scheduling overhead, jvm startup, lack of shuffle over the network etc. 
> This would be a huge benefit, especially on large clusters, to small Hive/Pig queries.
> Thoughts?

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira