You are viewing a plain text version of this content. The canonical link for it is here.
Posted to mapreduce-issues@hadoop.apache.org by "Siddharth Seth (JIRA)" <ji...@apache.org> on 2014/12/16 02:11:47 UTC
[jira] [Created] (MAPREDUCE-6197) Cache MapOutputLocations in
ShuffleHandler
Siddharth Seth created MAPREDUCE-6197:
-----------------------------------------
Summary: Cache MapOutputLocations in ShuffleHandler
Key: MAPREDUCE-6197
URL: https://issues.apache.org/jira/browse/MAPREDUCE-6197
Project: Hadoop Map/Reduce
Issue Type: Bug
Reporter: Siddharth Seth
ShuffleHandler currently seems to create a map of mapId - mapInfo (file.out / index information) when it receives a message.
This should be caching map info across requests, so that the a scan of all directories is not required for each reducer fetching from the same map.
Also, the scan for each map output / index file is performed twice per mapId within a request. In populateHeaders - once in the call to getMapOutputInfo, and then directly in the method.
For an invocation where we do end up with more than 1000 (default) mapIds in a single call, and don't cache them in the map - the path constructed for such entries will be invalid. This is highly unlikely to be the case though, until there's proper caching.
{code}
MapOutputInfo info = mapOutputInfoMap.get(mapId);
if (info == null) {
info = getMapOutputInfo(outputBasePathStr, mapId, reduceId, user);
}
{code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)