You are viewing a plain text version of this content. The canonical link for it is here.
Posted to mapreduce-issues@hadoop.apache.org by "Hadoop QA (JIRA)" <ji...@apache.org> on 2015/05/02 06:36:08 UTC
[jira] [Commented] (MAPREDUCE-5611) CombineFileInputFormat only
requests a single location per split when more could be optimal
[ https://issues.apache.org/jira/browse/MAPREDUCE-5611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14524745#comment-14524745 ]
Hadoop QA commented on MAPREDUCE-5611:
--------------------------------------
\\
\\
| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:red}-1{color} | patch | 0m 0s | The patch command could not apply the patch during dryrun. |
\\
\\
|| Subsystem || Report/Notes ||
| Patch URL | http://issues.apache.org/jira/secure/attachment/12613866/CombineFileInputFormat-trunk.patch |
| Optional Tests | javadoc javac unit findbugs checkstyle |
| git revision | trunk / f1a152c |
| Console output | https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/5536/console |
This message was automatically generated.
> CombineFileInputFormat only requests a single location per split when more could be optimal
> -------------------------------------------------------------------------------------------
>
> Key: MAPREDUCE-5611
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-5611
> Project: Hadoop Map/Reduce
> Issue Type: Bug
> Affects Versions: 1.2.1
> Reporter: Chandra Prakash Bhagtani
> Assignee: Chandra Prakash Bhagtani
> Attachments: CombineFileInputFormat-trunk.patch
>
>
> I have come across an issue with CombineFileInputFormat. Actually I ran a hive query on approx 1.2 GB data with CombineHiveInputFormat which internally uses CombineFileInputFormat. My cluster size is 9 datanodes and max.split.size is 256 MB
> When I ran this query with replication factor 9, hive consistently creates all 6 rack-local tasks and with replication factor 3 it creates 5 rack-local and 1 data local tasks.
> When replication factor is 9 (equal to cluster size), all the tasks should be data-local as each datanode contains all the replicas of the input data, but that is not happening i.e all the tasks are rack-local.
> When I dug into CombineFileInputFormat.java code in getMoreSplits method, I found the issue with the following snippet (specially in case of higher replication factor)
> {code:title=CombineFileInputFormat.java|borderStyle=solid}
> for (Iterator<Map.Entry<String,
> List<OneBlockInfo>>> iter = nodeToBlocks.entrySet().iterator();
> iter.hasNext();) {
> Map.Entry<String, List<OneBlockInfo>> one = iter.next();
> nodes.add(one.getKey());
> List<OneBlockInfo> blocksInNode = one.getValue();
> // for each block, copy it into validBlocks. Delete it from
> // blockToNodes so that the same block does not appear in
> // two different splits.
> for (OneBlockInfo oneblock : blocksInNode) {
> if (blockToNodes.containsKey(oneblock)) {
> validBlocks.add(oneblock);
> blockToNodes.remove(oneblock);
> curSplitSize += oneblock.length;
> // if the accumulated split size exceeds the maximum, then
> // create this split.
> if (maxSize != 0 && curSplitSize >= maxSize) {
> // create an input split and add it to the splits array
> addCreatedSplit(splits, nodes, validBlocks);
> curSplitSize = 0;
> validBlocks.clear();
> }
> }
> }
> {code}
> First node in the map nodeToBlocks has all the replicas of input file, so the above code creates 6 splits all with only one location. Now if JT doesn't schedule these tasks on that node, all the tasks will be rack-local, even though all the other datanodes have all the other replicas.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)