You are viewing a plain text version of this content. The canonical link for it is here.
Posted to mapreduce-issues@hadoop.apache.org by "Li Junjun (JIRA)" <ji...@apache.org> on 2013/02/17 08:53:13 UTC
[jira] [Created] (MAPREDUCE-5010) use multithreading to speed up
Merger and try MapPartitionsCompleteEvent to schedule fetch in reduce
Li Junjun created MAPREDUCE-5010:
------------------------------------
Summary: use multithreading to speed up Merger and try MapPartitionsCompleteEvent to schedule fetch in reduce
Key: MAPREDUCE-5010
URL: https://issues.apache.org/jira/browse/MAPREDUCE-5010
Project: Hadoop Map/Reduce
Issue Type: Improvement
Components: mrv1
Affects Versions: 1.0.1
Reporter: Li Junjun
Assignee: Todd Lipcon
use multithreading to speed up Merger and try MapPartitionsCompleteEvent to schedule fetch in reduce
This is for muticore cpu, the performance will depend on your hardware and config.
In maptask
[code]
for (int parts = 0; parts < partitions; parts++) {
//doing merger , append to final output file (file.out)
}
[/code]
it only use one thread !
so,I think :We can use more Theads(conf: mapred.map.mergerthreads) to do Merger , if you have many cores or cpus.
Before, only a map task complete the reduce tasks will fetch the output , that means
when map x complete , all the reduce will fetch the output concomitantly. even we use
[code]
// Randomize the map output locations to prevent
// all reduce-tasks swamping the same tasktracker
List<String> hostList = new ArrayList<String>();
hostList.addAll(mapLocations.keySet());
Collections.shuffle(hostList, this.random);
[code]
in reduce task .
for example , 100 reduce wait 2 map complete ,beacase the cluster's map task capacity is 98,but the job have
100 map tasks .
so,I think : During the threads mergering , for example if map has 8 partitions , and use 3 thread doing merger ,
where one of the thread complete one part we can inform the Reduce to fetch the partition file immediately,
or we can wait after 3 parts complete then send the event (conf: mapred.map.parts.inform) to reduce the jt's stress.
not to wait all the map task complete. by doing this, it will prevent all reduce-tasks swamping the same tasktracker
more effective .
is it acceptable ?
and other good ideas ?
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira