You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-dev@hadoop.apache.org by "Amar Kamat (JIRA)" <ji...@apache.org> on 2008/04/21 16:51:22 UTC
[jira] Commented: (HADOOP-3285) map tasks with node local splits do
not always read from local nodes
[ https://issues.apache.org/jira/browse/HADOOP-3285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12590939#action_12590939 ]
Amar Kamat commented on HADOOP-3285:
------------------------------------
The namenode (via {{Namenode.getBlockLocations()}}) returns different block location information than the one in {{split.getLocations()}}. Note that {{split.getLocations()}} is used for task cache creation in JobInProgress while {{Namenode.getBlockLocations()}} is used by the DFS client for pulling the split before starting the actual map phase. One possible problem is 2027. We are investigating.
> map tasks with node local splits do not always read from local nodes
> --------------------------------------------------------------------
>
> Key: HADOOP-3285
> URL: https://issues.apache.org/jira/browse/HADOOP-3285
> Project: Hadoop Core
> Issue Type: Bug
> Components: mapred
> Reporter: Runping Qi
>
> I ran a simple map/reduce job counting the number of records in the input data.
> The number of reducers was set to 1.
> I did not set the number of mappers. Thus by default, all splits except the last split of a file contain one dfs block (128MB in my case).
> The web gui indicated that 99% of map tasks were with local splits.
> Thus I expected that most of the dfs reads should have come from the local data nodes.
> However, when I examine the traffic of the ethernet interfaces,
> I found about 50% traffic of each node were through the loopback interface and other 50% were through the ethernet card!
> Also, the switch monitoring indicated that a lot of traffic went through the links and cross racks!
> This indicated that the data locality feature does not work as expected.
> To confirm that, I set the number of map tasks to a very high number so that it forced the split size down to about 27MB.
> The web gui indicated that 99% of map tasks were with local splits, as expected.
> The ethernet interface monitor showed that almost 100% traffic went through the loopback interface, as it should be.
> I found about 50% traffic of each node were through the loopback interface and other 50% were through the ethernet card!
> Also, the switch monitoring indicated that there were very little traffic through the links and cross racks.
> This implies that some corner cases are not handled properly.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.