You are viewing a plain text version of this content. The canonical link for it is here.

Posted to mapreduce-issues@hadoop.apache.org by "Todd Lipcon (JIRA)" <ji...@apache.org> on 2014/04/04 20:04:19 UTC

[jira] [Updated] (MAPREDUCE-5821) IFile merge allocates new byte array for every value

     [ https://issues.apache.org/jira/browse/MAPREDUCE-5821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Todd Lipcon updated MAPREDUCE-5821:
-----------------------------------

    Attachment: after-patch.png
                before-patch.png
                mapreduce-5821.txt

Attached patch should fix the issue by making the Merger call nextRawKeyValue on the "disk" DataInputBuffer instead of calling it on "value" and resetting "value" every time to an empty buffer.

Also attached YourKit-generated heap graphs of before/after. The graph is of a map task which sorts 1GB of data with default settings (100mb io sort mb) and 100 reducers. This causes a lot of merges. You can see that without the patch, there are many allocations which start halfway through (when the merge phase starts) causing the heap to have to grow. With the patch, there is only a gentle slope upwards at that point (the gentle slope is caused by allocations of the various input readers)

The wall time performance doesn't seem to be significantly changed, at least for my micro-benchmark, since there's enough heap to easily grow and absorb the useless allocations. That said, it might have more meaningful improvements on a more heap-constrained workload or larger heap.

> IFile merge allocates new byte array for every value
> ----------------------------------------------------
>
>                 Key: MAPREDUCE-5821
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5821
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: performance, task
>    Affects Versions: 2.4.1
>            Reporter: Todd Lipcon
>            Assignee: Todd Lipcon
>         Attachments: after-patch.png, before-patch.png, mapreduce-5821.txt
>
>
> I wrote a standalone benchmark of the MapOutputBuffer and found that it did a lot of allocations during the merge phase. After looking at an allocation profile, I found that IFile.Reader.nextRawValue() would always allocate a new byte array for every value, so the allocation rate goes way up during the merge phase of the mapper. I imagine this also affects the reducer input, though I didn't profile that.



--
This message was sent by Atlassian JIRA
(v6.2#6252)