You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Calvin <ip...@gmail.com> on 2012/01/19 15:27:45 UTC

processing binary sequencefiles, java heap space and child errors, and skipping bad records

Hi everyone,

I'm attempting to do something a bit unconventional with Hadoop and
haven't had much luck in getting things to work as smoothly as I'd
like. I'd like to do some processing over a large set of SequenceFiles
that contain data in both text and binary format. I'm using the
SequenceFileAsTextInputFormat class as my InputFormat, unhexifying the
value in the <key,value> pair, and manipulating it.

This works to some degree: the job fails, in one case, consistently
across different machines over specific parts of the SequenceFile
(which is approximately 60GB) with a rather ambiguous message "Error:
Java heap space". Some additional digging around in the
/usr/local/hadoop/logs/userlogs results in the following:

stderr> Could not create the Java virtual machine.
stdout> Error occurred during initialization of VM
stdout> Could not reserve enough space for object heap

I've attempted to tweak the configuration settings by allocating more memory:

    conf.set("mapred.child.java.opts", "-Xmx1500m");
    conf.set("mapred.map.java.opts", "-Xmx1500m");
    conf.set("mapred.reduce.java.opts", "-Xmx1024m");
    conf.set("mapred.child.ulimit", "3145728");
    conf.set("mapreduce.map.maxattempts", "45");

but to no avail. Most of the machines on the cluster I'm working on
have multiple processors, 4-8 GB of RAM, and low utilization.
Increasing the heap size beyond 1500 results in another ambiguous
error:
	
java.lang.Throwable: Child Error
	at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:225)
Caused by: java.io.IOException: Task process exit with nonzero status of 1.
	at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:212)

I wouldn't mind if I could skip over these problematic records, but it
doesn't seem like record skipping is supported [1] in SequenceFiles
(if I'm wrong please let me know!). (Aside from splitting up the large
SequenceFile, skipping 60GB of data doesn't seem like a great
solution, either).

I've encountered the same errors using Python with Hadoop Streaming
and a "dummy" MapReduce script (i.e., take the input and do nothing
but return zeros, for example), though I managed to get Hadoop
Streaming/Python to work at one point by increasing the number of maps
to an absurd 4000 (usually 863 maps are allocated). I can tolerate
that inefficiency if I had a way to manually set the number of mappers
in Java.

Any thoughts or experiences in working with SequenceFiles containing
binary data? Are there any additional configuration settings I should
be tweaking or configuring?

Thanks,
Calvin

[1] https://issues.apache.org/jira/browse/MAPREDUCE-15