You are viewing a plain text version of this content. The canonical link for it is here.
Posted to general@hadoop.apache.org by Oded Rosen <od...@legolas-media.com> on 2010/07/29 08:35:52 UTC
OutOfMemoryError when writing a huge writable

Hi,

My job uses 0.18 api (my cluster's hadoop version is 0.20.1), and I have my
own set of nesting writables as input and output, some of them might be
really big (memory wise).
During the finishing steps of my reduce() function, When i call
output.collect() on a particularly large writable, the task failes with a
java heap space error:

2010-07-28 16:08:46,727 FATAL org.apache.hadoop.mapred.TaskTracker:
Error running child : java.lang.OutOfMemoryError: Java heap space
	at java.util.Arrays.copyOf(Arrays.java:2786)
	at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:94)
	at java.io.DataOutputStream.writeLong(DataOutputStream.java:207)
	at com.legolas.data.users.EntryWritable.write(EntryWritable.java:339)
	at com.legolas.data.users.UserWritable.write(UserWritable.java:1313)
	at org.apache.hadoop.io.serializer.WritableSerialization$WritableSerializer.serialize(WritableSerialization.java:90)
	at org.apache.hadoop.io.serializer.WritableSerialization$WritableSerializer.serialize(WritableSerialization.java:77)
	at org.apache.hadoop.io.SequenceFile$Writer.append(SequenceFile.java:1006)
	at org.apache.hadoop.mapred.SequenceFileOutputFormat$1.write(SequenceFileOutputFormat.java:75)
	at org.apache.hadoop.mapred.lib.MultipleOutputs$1.collect(MultipleOutputs.java:521)
	at com.legolas.data.users.UserReducer.reduce(UserReducer.java:267)
	at com.legolas.data.users.UserReducer.reduce(UserReducer.java:32)
	at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:463)
	at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:411)
	at org.apache.hadoop.mapred.Child.main(Child.java:170)

This failure was during writeLong(), but many other writes were made to the
same stream before that call, on that specific instance of my object (it was
a big one, i can tell because that specific output.collect only happens on
very big objects).
My first guess is that I should flush() the stream periodically during these
big writes, and I want to know if hadoop's DataOutputStream supports
flush(), as many other streams does not really pay attention to flush()
calls (according to a java api on DataOutputStream's flush())

Does anyone know how can I outwit this problem and write by big writables
without failure?

Again, this only happens when I write() a perticularly large writable. Other
write() operations ends without a problem.

-- 
Oded