You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-dev@hadoop.apache.org by "dhruba borthakur (JIRA)" <ji...@apache.org> on 2008/10/29 01:08:44 UTC

[jira] Commented: (HADOOP-2560) Processing multiple input splits per mapper task

    [ https://issues.apache.org/jira/browse/HADOOP-2560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12643381#action_12643381 ] 

dhruba borthakur commented on HADOOP-2560:
------------------------------------------

The resources needed in the Map phase typically depends on the amount of data it processes. Instead of specifying a grouping value of N, can we say that grouping will occur based on the size of the input data? The JT can then group as many mappers as needed to make the total input data be close to the specified value.

> Processing multiple input splits per mapper task
> ------------------------------------------------
>
>                 Key: HADOOP-2560
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2560
>             Project: Hadoop Core
>          Issue Type: Bug
>            Reporter: Runping Qi
>
> Currently, an input split contains a consecutive chunk of input file, which by default, corresponding to a DFS block.
> This may lead to a large number of mapper tasks if the input data is large. This leads to the following problems:
> 1. Shuffling cost: since the framework has to move M * R map output segments to the nodes running reducers, 
> larger M means larger shuffling cost.
> 2. High JVM initialization overhead
> 3. Disk fragmentation: larger number of map output files means lower read throughput for accessing them.
> Ideally, you want to keep the number of mappers to no more than 16 times the number of  nodes in the cluster.
> To achive that, we can increase the input split size. However, if a split span over more than one dfs block,
> you lose the data locality scheduling benefits.
> One way to address this problem is to combine multiple input blocks with the same rack into one split.
> If in average we combine B blocks into one split, then we will reduce the number of mappers by a factor of B.
> Since all the blocks for one mapper share a rack, thus we can benefit from rack-aware scheduling.
> Thoughts?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.