You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@avro.apache.org by Michael Burr <mi...@engisys.com> on 2013/08/30 23:58:21 UTC

BufferedBinaryEncoder OOM from mapred

Hello,

We are starting up a project using map/reduce to produce avro files. In short, our job produces avro records which can contain very large arrays. In effect, we really can't practically predict how large some of them can get. 

When we hit one of these "very large" records, the BufferedBinaryEncoder seems to blow out the heap when calling org.apache.avro.mapred.AvroMultipleOutputs$1.collect() from a reducer (see stack trace below).

Browsing through the avro code and the Jira's, it seems that AVRO-105  could be part of the solution here, as I believe we would probably want to be able to use the BlockingBinaryEncoder (or perhaps even the DirectBinaryEncoder?? ) to be able to write these large arrays in a memory-efficient manner. 

Am I on the right track here? If so, it also seems that we would  need an additional feature to be able to configure/enable this from mapred via the  JobConf etc.. 

Since I'm as-of-yet not that familiar with the internals of avro, I would appreciate it if anyone could give me a sanity check, and/or potentially offer other suggestions as to how we may be able to work around this problem.

Thanks in advance for your help,
-Mike


Error running child : java.lang.OutOfMemoryError: Java heap space
         at java.util.Arrays.copyOf(Arrays.java:2786)
         at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:94)
         at org.apache.avro.io.BufferedBinaryEncoder$OutputStreamSink.innerWrite(BufferedBinaryEncoder.java:216)
         at org.apache.avro.io.BufferedBinaryEncoder.flushBuffer(BufferedBinaryEncoder.java:93)
         at org.apache.avro.io.BufferedBinaryEncoder.ensureBounds(BufferedBinaryEncoder.java:108)
         at org.apache.avro.io.BufferedBinaryEncoder.writeFixed(BufferedBinaryEncoder.java:153)
         at org.apache.avro.io.Encoder.writeFixed(Encoder.java:174)
         at org.apache.avro.io.BufferedBinaryEncoder.writeFixed(BufferedBinaryEncoder.java:164)
         at org.apache.avro.io.BinaryEncoder.writeBytes(BinaryEncoder.java:65)
         at org.apache.avro.generic.GenericDatumWriter.writeBytes(GenericDatumWriter.java:212)
         at org.apache.avro.reflect.ReflectDatumWriter.writeBytes(ReflectDatumWriter.java:93)
         at org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:77)
         at org.apache.avro.reflect.ReflectDatumWriter.write(ReflectDatumWriter.java:104)
         at org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:73)
         at org.apache.avro.reflect.ReflectDatumWriter.write(ReflectDatumWriter.java:104)
         at org.apache.avro.generic.GenericDatumWriter.writeRecord(GenericDatumWriter.java:106)
         at org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:66)
         at org.apache.avro.reflect.ReflectDatumWriter.write(ReflectDatumWriter.java:104)
         at org.apache.avro.generic.GenericDatumWriter.writeArray(GenericDatumWriter.java:131)
         at org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:68)
         at org.apache.avro.reflect.ReflectDatumWriter.write(ReflectDatumWriter.java:104)
         at org.apache.avro.generic.GenericDatumWriter.writeRecord(GenericDatumWriter.java:106)
         at org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:66)
         at org.apache.avro.reflect.ReflectDatumWriter.write(ReflectDatumWriter.java:104)
         at org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:58)
         at org.apache.avro.file.DataFileWriter.append(DataFileWriter.java:257)
         at org.apache.avro.mapred.AvroOutputFormat$1.write(AvroOutputFormat.java:160)
         at org.apache.avro.mapred.AvroOutputFormat$1.write(AvroOutputFormat.java:157)
         at org.apache.avro.mapred.AvroMultipleOutputs$RecordWriterWithCounter.write(AvroMultipleOutputs.java:436)
         at org.apache.avro.mapred.AvroMultipleOutputs$1.collect(AvroMultipleOutputs.java:499)

>