You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Barry Haddow <bh...@inf.ed.ac.uk> on 2008/07/10 15:49:04 UTC
Out of memory error in readseg
Hi
We have a collection of blogs that we've crawled using nutch and I'd like to
copy them to the UNIX filesystem. I'm using nutch readseg to copy each
segment, but sometimes this dies with an OutOfMemoryError (below). The
particular segment it dies on is about 500MB in size, as opposed to 100-200MB
for most of the other segments. I've increased the max heap size on the
slaves to 1500MB but that hasn't helped. The slaves only have 500MB of
physical ram so I'm going to get a lot of swapping if I try to push the heap
size up.
Should I keep increasing the heap size until I can load the segment, or is
there anything else I can do? Surely segread doesn't need to hold the whole
segment in memory at once? We're using a nutch snapshot from 2008-01-25.
thanks in advance
regards
Barry
Dumping /user/bhaddow/blog-crawl/2008/02/14/segments/20080214070430
SegmentReader: dump
segment: /user/bhaddow/blog-crawl/2008/02/14/segments/20080214070430
java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOf(Arrays.java:2786)
at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:94)
at java.io.DataOutputStream.write(DataOutputStream.java:90)
at java.io.FilterOutputStream.write(FilterOutputStream.java:80)
at org.apache.nutch.protocol.Content.write(Content.java:173)
at
org.apache.hadoop.io.GenericWritable.write(GenericWritable.java:135)
at
org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:349)
at
org.apache.nutch.segment.SegmentReader$InputCompatMapper.map(SegmentReader.java:90)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:192)
at
org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:1760)
--
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.