You are viewing a plain text version of this content. The canonical link for it is here.
Posted to mapreduce-issues@hadoop.apache.org by "Jeff Hammerbacher (JIRA)" <ji...@apache.org> on 2009/09/18 06:39:57 UTC

[jira] Commented: (MAPREDUCE-64) Map-side sort is hampered by io.sort.record.percent

    [ https://issues.apache.org/jira/browse/MAPREDUCE-64?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12756962#action_12756962 ] 

Jeff Hammerbacher commented on MAPREDUCE-64:
--------------------------------------------

Hey Chris,

Any progress on this patch? We've held off on working on it, since you mentioned you had some work already done. Should we continue to hold off?

Thanks,
Jeff

> Map-side sort is hampered by io.sort.record.percent
> ---------------------------------------------------
>
>                 Key: MAPREDUCE-64
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-64
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>            Reporter: Arun C Murthy
>
> Currently io.sort.record.percent is a fairly obscure, per-job configurable, expert-level parameter which controls how much accounting space is available for records in the map-side sort buffer (io.sort.mb). Typically values for io.sort.mb (100) and io.sort.record.percent (0.05) imply that we can store ~350,000 records in the buffer before necessitating a sort/combine/spill.
> However for many applications which deal with small records e.g. the world-famous wordcount and it's family this implies we can only use 5-10% of io.sort.mb i.e. (5-10M) before we spill inspite of having _much_ more memory available in the sort-buffer. The word-count for e.g. results in ~12 spills (given hdfs block size of 64M). The presence of a combiner exacerbates the problem by piling serialization/deserialization of records too...
> Sure, jobs can configure io.sort.record.percent, but it's tedious and obscure; we really can do better by getting the framework to automagically pick it by using all available memory (upto io.sort.mb) for either the data or accounting.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.