You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Barry Haddow <bh...@inf.ed.ac.uk> on 2008/07/10 15:49:04 UTC

Out of memory error in readseg

Hi 

We have a collection of blogs that we've crawled using nutch and I'd like to 
copy them to the UNIX filesystem. I'm using nutch readseg to copy each 
segment, but sometimes this dies with an OutOfMemoryError (below).  The 
particular segment it dies on is about 500MB in size, as opposed to 100-200MB 
for most of the other segments. I've increased the max heap size on the 
slaves to 1500MB but that hasn't helped. The slaves only have 500MB of 
physical ram so I'm going to get a lot of swapping if I try to push the heap 
size up.

Should I keep increasing the heap size until I can load the segment, or is 
there anything else I can do? Surely segread doesn't need to hold the whole 
segment in memory at once? We're using a nutch snapshot from 2008-01-25.

thanks in advance
regards
Barry


Dumping /user/bhaddow/blog-crawl/2008/02/14/segments/20080214070430
SegmentReader: dump 
segment: /user/bhaddow/blog-crawl/2008/02/14/segments/20080214070430
java.lang.OutOfMemoryError: Java heap space
        at java.util.Arrays.copyOf(Arrays.java:2786)
        at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:94)
        at java.io.DataOutputStream.write(DataOutputStream.java:90)
        at java.io.FilterOutputStream.write(FilterOutputStream.java:80)
        at org.apache.nutch.protocol.Content.write(Content.java:173)
        at 
org.apache.hadoop.io.GenericWritable.write(GenericWritable.java:135)
        at 
org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:349)
        at 
org.apache.nutch.segment.SegmentReader$InputCompatMapper.map(SegmentReader.java:90)
        at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:192)
        at 
org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:1760)

-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.