You are viewing a plain text version of this content. The canonical link for it is here.

Posted to mapreduce-issues@hadoop.apache.org by "Gopal V (JIRA)" <ji...@apache.org> on 2012/10/27 10:49:11 UTC

[jira] [Commented] (MAPREDUCE-4755) Rewrite MapOutputBuffer to use direct buffers & allow parallel sort+collect

    [ https://issues.apache.org/jira/browse/MAPREDUCE-4755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13485388#comment-13485388 ] 

Gopal V commented on MAPREDUCE-4755:
------------------------------------

TeraSort benchmarks for 10M entries showed improvement from 70s to 53s in wall-clock time (user cpu time is higher than wall-clock because of the asynchronous sort FutureTask)

running "terasort /tmp/data/ file:///tmp/t.$RANDOM/"

{code}
	Map-Reduce Framework
		Map input records=10000000
		Map output records=10000000
		GC time elapsed (ms)=81
		Total committed heap usage (bytes)=4242079744
real	0m53.355s
user	0m56.392s
sys	0m6.548s
{code}

{code}
	Map-Reduce Framework
		Map input records=10000000
		Map output records=10000000
		GC time elapsed (ms)=374
		Total committed heap usage (bytes)=4878761984
real	1m10.191s
user	1m8.908s
sys	0m8.609s
{code}

And the results from both runs are identical byte-for-byte

{code}
$ md5sum t.19982/part-r-00000 t.13037/part-r-00000 
d3368a9e0897ea8efcd2a290d8e27906  t.19982/part-r-00000
d3368a9e0897ea8efcd2a290d8e27906  t.13037/part-r-00000
{code}

The combiner remains to be tested and the counters+progress indicators need to be fixed.
                
> Rewrite MapOutputBuffer to use direct buffers & allow parallel sort+collect
> ---------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-4755
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4755
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>    Affects Versions: 3.0.0
>         Environment: Ubuntu 12.10 x86_64 (Bulldozer 8-core)
>            Reporter: Gopal V
>            Assignee: Gopal V
>              Labels: optimization, sort
>         Attachments: 0001-first-cut-of-MMapOutputBuffer.patch
>
>
> The MapOutputBuffer has been written with a very severe constraint on the amount of memory it can consume. This results in code that has to page-in & page-out (i.e spill) data as it passes through the map buffers.
> With the advent of the java.nio package, there is a fast and portable MMap alternative to handling your own buffers. This exists outside the GC space of Java and yet provides decently fast memory access to all the data.
> The suggestion is that using mmap() direct buffers can be faster when a spill is involved and simpler than the current spill logic, when given enough address space & uses the buffer caches to deliver best effort I/O.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira