You are viewing a plain text version of this content. The canonical link for it is here.
Posted to mapreduce-dev@hadoop.apache.org by "Allen Wittenauer (JIRA)" <ji...@apache.org> on 2014/07/23 23:52:40 UTC
[jira] [Resolved] (MAPREDUCE-902) Map output merge still uses
unnecessary seeks
[ https://issues.apache.org/jira/browse/MAPREDUCE-902?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Allen Wittenauer resolved MAPREDUCE-902.
----------------------------------------
Resolution: Fixed
Probably stale.
> Map output merge still uses unnecessary seeks
> ---------------------------------------------
>
> Key: MAPREDUCE-902
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-902
> Project: Hadoop Map/Reduce
> Issue Type: Bug
> Components: task
> Affects Versions: 0.20.1
> Reporter: Christian Kunz
>
> HADOOP-3638 improved the merge of the map output by caching the index files.
> But why not also caching the data files?
> In our use-case scenario, still using hadoop-0.18.3, but HADOOP-3638 would only help partially, an individual map tasks finishes in less than 30 minutes, but needs 4 hours to merge 70 spills for 20,000 partitions (with lzo compression), reading about 10kB from each spill file (which is re-opened for every partition). As this is just a merge sort, there is no reason to not keep the input files open and eliminate seek altogether with sequential access.
--
This message was sent by Atlassian JIRA
(v6.2#6252)