You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-dev@hadoop.apache.org by "Koji Noguchi (JIRA)" <ji...@apache.org> on 2007/10/15 16:50:50 UTC
[jira] Commented: (HADOOP-2054) Improve memory model for map-side
sorts
[ https://issues.apache.org/jira/browse/HADOOP-2054?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12534860 ]
Koji Noguchi commented on HADOOP-2054:
--------------------------------------
Does map output always go through DataOutputBuffer.Buffer.write?
>From HADOOP-2053 stack trace,
<stack>
task_200710112103_0001_m_000015_1: java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOf(Arrays.java:2786)
at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:94)
at java.io.DataOutputStream.write(DataOutputStream.java:90)
at org.apache.hadoop.io.Text.write(Text.java:243)
at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:340)
</stack>
Text.write is directly calling DataOutputStream.write in which calls ByteArrayOutputStream.write.
What I expected was DataOutputBuffer.write --> DataOutputBufffer.Buffer.write.
> Improve memory model for map-side sorts
> ---------------------------------------
>
> Key: HADOOP-2054
> URL: https://issues.apache.org/jira/browse/HADOOP-2054
> Project: Hadoop
> Issue Type: Improvement
> Components: mapred
> Reporter: Arun C Murthy
> Assignee: Arun C Murthy
> Fix For: 0.16.0
>
>
> {{MapTask#MapOutputBuffer}} uses a plain-jane {{DataOutputBuffer}} which defaults to a buffer of size 32-bytes, and the {{DataOutputBuffer#write}} call doubles the underlying byte-array when it needs more space.
> However for maps which output any decent amount of data (e.g. 128MB in examples/Sort.java) this means the buffer grows painfully slowly from 2^6 to 2^28, and each time this results in a new array being created, followed by an array-copy:
> {noformat}
> public void write(DataInput in, int len) throws IOException {
> int newcount = count + len;
> if (newcount > buf.length) {
> byte newbuf[] = new byte[Math.max(buf.length << 1, newcount)];
> System.arraycopy(buf, 0, newbuf, 0, count);
> buf = newbuf;
> }
> in.readFully(buf, count, len);
> count = newcount;
> }
> {noformat}
> I reckon we could do much better in the {{MapTask}}, specifically...
> For e.g. we start with a buffer of size 1/4KB and quadruple, rather than double, upto, say 4/8/16MB. Then we resume doubling (or less).
> This means that it quickly ramps up to minimize no. of {{System.arrayCopy}} calls and small-sized buffers to GC; and later start doubling to ensure we don't ramp-up too quickly to minimize memory wastage due to fragmentation.
> Of course, this issue is about benchmarking and figuring if all this is worth it, and, if so, what are the right set of trade-offs to make.
> Thoughts?
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.