You are viewing a plain text version of this content. The canonical link for it is here.
Posted to mapreduce-issues@hadoop.apache.org by "Li Junjun (JIRA)" <ji...@apache.org> on 2013/02/17 08:53:13 UTC

[jira] [Created] (MAPREDUCE-5010) use multithreading to speed up Merger and try MapPartitionsCompleteEvent to schedule fetch in reduce

Li Junjun created MAPREDUCE-5010:
------------------------------------

             Summary: use multithreading to speed up Merger and try MapPartitionsCompleteEvent to schedule fetch in reduce 
                 Key: MAPREDUCE-5010
                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5010
             Project: Hadoop Map/Reduce
          Issue Type: Improvement
          Components: mrv1
    Affects Versions: 1.0.1
            Reporter: Li Junjun
            Assignee: Todd Lipcon


use multithreading to speed up Merger and try MapPartitionsCompleteEvent to schedule fetch in reduce 


This is for muticore cpu, the performance will depend on your hardware and config.

In maptask 
[code]
for (int parts = 0; parts < partitions; parts++) {
	//doing merger , append to final output file (file.out)
}
[/code]
it only use one thread !
so,I think :We can use more Theads(conf: mapred.map.mergerthreads) to do Merger , if you have many cores or cpus.


Before, only a map task complete the reduce tasks will fetch the output , that means 
when map x complete , all the reduce will fetch the output concomitantly. even we use
[code]   
   // Randomize the map output locations to prevent 
   // all reduce-tasks swamping the same tasktracker
   List<String> hostList = new ArrayList<String>();
   hostList.addAll(mapLocations.keySet());       
   Collections.shuffle(hostList, this.random);
[code]
in  reduce task .
for example ,  100 reduce wait 2 map complete ,beacase the cluster's map task capacity is 98,but the job have 
100 map tasks . 


so,I think : During the threads mergering  , for example if map has 8 partitions , and use 3 thread  doing merger , 
where one of the thread complete one part we can inform  the Reduce to fetch the partition file  immediately,
or we can wait after 3 parts complete then send the event  (conf: mapred.map.parts.inform) to reduce the jt's stress.
not to wait all the map task complete. by doing this, it will  prevent all reduce-tasks swamping the same tasktracker
more effective .



is it  acceptable ?
and other good ideas ?


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira