You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-dev@hadoop.apache.org by "dhruba borthakur (JIRA)" <ji...@apache.org> on 2008/10/30 18:34:44 UTC

[jira] Updated: (HADOOP-2560) Processing multiple input splits per mapper task

     [ https://issues.apache.org/jira/browse/HADOOP-2560?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

dhruba borthakur updated HADOOP-2560:
-------------------------------------

    Attachment: multipleSplitsPerMapper.patch

Here is a sample patch that combines N splits to be executed by the same mapper instance.  The logic is like this: The Normal logic is used to find the first split to be assigned to a mapper. Then if there exists other splits that are rack-local to the TT, then N-1 of those are selected and assigned to the same mapper. N is configurable per job.

This code still needs more testing, I am posting it early just as a reference.

> Processing multiple input splits per mapper task
> ------------------------------------------------
>
>                 Key: HADOOP-2560
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2560
>             Project: Hadoop Core
>          Issue Type: Bug
>            Reporter: Runping Qi
>         Attachments: multipleSplitsPerMapper.patch
>
>
> Currently, an input split contains a consecutive chunk of input file, which by default, corresponding to a DFS block.
> This may lead to a large number of mapper tasks if the input data is large. This leads to the following problems:
> 1. Shuffling cost: since the framework has to move M * R map output segments to the nodes running reducers, 
> larger M means larger shuffling cost.
> 2. High JVM initialization overhead
> 3. Disk fragmentation: larger number of map output files means lower read throughput for accessing them.
> Ideally, you want to keep the number of mappers to no more than 16 times the number of  nodes in the cluster.
> To achive that, we can increase the input split size. However, if a split span over more than one dfs block,
> you lose the data locality scheduling benefits.
> One way to address this problem is to combine multiple input blocks with the same rack into one split.
> If in average we combine B blocks into one split, then we will reduce the number of mappers by a factor of B.
> Since all the blocks for one mapper share a rack, thus we can benefit from rack-aware scheduling.
> Thoughts?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.