You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Marek Bachmann <m....@uni-kassel.de> on 2011/08/31 01:47:05 UTC

Parser crash with HeapSpace error

Hello,

I ran in a bad situation.

After crawling and parsing about 130k pages in multiple
generate/fetch/parse/update cycles today the parser crashed with:

Error parsing:
http://www.usf.uni-kassel.de/ftp/user/eisner/Felix/precip/B1/ECHAM5/GPREC_2041_11.31.UNF0:
failed(2,200): org.apache.nutch.parse.ParseException: Unable to
successfully parse content
Exception in thread "main" java.io.IOException: Job failed!
        at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1252)
        at org.apache.nutch.parse.ParseSegment.parse(ParseSegment.java:157)
        at org.apache.nutch.parse.ParseSegment.run(ParseSegment.java:178)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
        at org.apache.nutch.parse.ParseSegment.main(ParseSegment.java:164)

and in the hadoop.log, more verbosely:

java.lang.OutOfMemoryError: Java heap space
        at org.apache.nutch.protocol.Content.readFields(Content.java:140)
        at
org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:67)
        at
org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:40)
        at
org.apache.hadoop.io.SequenceFile$Reader.deserializeValue(SequenceFile.java:1817)
        at
org.apache.hadoop.io.SequenceFile$Reader.getCurrentValue(SequenceFile.java:1790)
        at
org.apache.hadoop.mapred.SequenceFileRecordReader.getCurrentValue(SequenceFileRecordReader.java:103)
        at
org.apache.hadoop.mapred.SequenceFileRecordReader.next(SequenceFileRecordReader.java:78)
        at
org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:192)
        at
org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:176)
        at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48)
        at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
        at
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177)

The strange thing is, that the parser didn't stop running. It remains in
a state where it consumes 100 % cpu and doesn't do anything any more.

The last lines it wrote to the hadoop.log file were:

java.lang.OutOfMemoryError: Java heap space
        at org.apache.nutch.protocol.Content.readFields(Content.java:140)
        at
org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:67)
        at
org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:40)
        at
org.apache.hadoop.io.SequenceFile$Reader.deserializeValue(SequenceFile.java:1817)
        at
org.apache.hadoop.io.SequenceFile$Reader.getCurrentValue(SequenceFile.java:1790)
        at
org.apache.hadoop.mapred.SequenceFileRecordReader.getCurrentValue(SequenceFileRecordReader.java:103)
        at
org.apache.hadoop.mapred.SequenceFileRecordReader.next(SequenceFileRecordReader.java:78)
        at
org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:192)
        at
org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:176)
        at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48)
        at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
        at
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177)
2011-08-31 01:27:00,722 INFO  mapred.JobClient - Job complete:
job_local_0001
2011-08-31 01:27:08,975 INFO  mapred.JobClient - Counters: 11
2011-08-31 01:27:08,975 INFO  mapred.JobClient -   ParserStatus
2011-08-31 01:27:08,975 INFO  mapred.JobClient -     failed=313
2011-08-31 01:27:08,975 INFO  mapred.JobClient -     success=14826
2011-08-31 01:27:08,975 INFO  mapred.JobClient -   FileSystemCounters
2011-08-31 01:27:08,975 INFO  mapred.JobClient -    
FILE_BYTES_READ=2047029532
2011-08-31 01:27:08,975 INFO  mapred.JobClient -    
FILE_BYTES_WRITTEN=819506637
2011-08-31 01:27:08,975 INFO  mapred.JobClient -   Map-Reduce Framework
2011-08-31 01:27:08,975 INFO  mapred.JobClient -     Combine output
records=0
2011-08-31 01:27:08,975 INFO  mapred.JobClient -     Map input records=15746
2011-08-31 01:27:08,975 INFO  mapred.JobClient -     Spilled Records=15138
2011-08-31 01:27:08,975 INFO  mapred.JobClient -     Map output
bytes=83235364
2011-08-31 01:27:08,975 INFO  mapred.JobClient -     Map input
bytes=306386116
2011-08-31 01:27:08,975 INFO  mapred.JobClient -     Combine input records=0
2011-08-31 01:27:08,975 INFO  mapred.JobClient -     Map output
records=15139

Now it is 01:42 an nothing had happened since this last log, but the
java process is still using all of the cpu.

I think there is something wrong.

It seems to me that my machine has to less memory. (2 GB) Bur I am a
little curios about that top says, the java process is only using 52 %
of the memory.

Any suggestions?

BTW: I don't want to parse UNFO files. In fact I have no idea what this
is! But in our university network are many strange fiel types. Handling
this is an other topic for me :)

Re: Parser crash with HeapSpace error

Posted by Markus Jelsma <ma...@openindex.io>.


On Wednesday 31 August 2011 15:12:25 Marek Bachmann wrote:
> Am 31.08.2011 12:58, schrieb Markus Jelsma:
> > UNFO?? That's interesting! Anyway, i understand you don't want to parse
> > this file. See your other thread.
> 
> Interesting? Do you know the file type? Is this something that shouldn't
> be public? Actually, I noticed it is UNF0 (ZERO!) not the letter O.

No, just interesting because i never heard of it. Could be:
http://www.file-extensions.org/unf-file-extension

> 
> > The OOM can happen for many reasons. By default Nutch takes 1G of RAM
> > (hence the 52%). You can toggle the setting via the Xmx JVM parameter.
> > 
> > On Wednesday 31 August 2011 01:47:05 Marek Bachmann wrote:
> >> Hello,
> >> 
> >> I ran in a bad situation.
> >> 
> >> After crawling and parsing about 130k pages in multiple
> >> generate/fetch/parse/update cycles today the parser crashed with:
> >> 
> >> Error parsing:
> >> http://www.usf.uni-kassel.de/ftp/user/eisner/Felix/precip/B1/ECHAM5/GPRE
> >> C_2 041_11.31.UNF0: failed(2,200):
> >> org.apache.nutch.parse.ParseException: Unable to
> >> successfully parse content
> >> Exception in thread "main" java.io.IOException: Job failed!
> >> 
> >>          at
> >>          org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1252)
> >>          at
> >>          org.apache.nutch.parse.ParseSegment.parse(ParseSegment.java:15
> >>          7) at
> >>          org.apache.nutch.parse.ParseSegment.run(ParseSegment.java:178)
> >>          at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
> >>          at
> >>          org.apache.nutch.parse.ParseSegment.main(ParseSegment.java:164
> >>          )
> >> 
> >> and in the hadoop.log, more verbosely:
> >> 
> >> java.lang.OutOfMemoryError: Java heap space
> >> 
> >>          at
> >>          org.apache.nutch.protocol.Content.readFields(Content.java:140)
> >>          at
> >> 
> >> org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializ
> >> er. deserialize(WritableSerialization.java:67) at
> >> org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializ
> >> er. deserialize(WritableSerialization.java:40) at
> >> org.apache.hadoop.io.SequenceFile$Reader.deserializeValue(SequenceFile.j
> >> ava
> >> 
> >> :1817) at
> >> 
> >> org.apache.hadoop.io.SequenceFile$Reader.getCurrentValue(SequenceFile.ja
> >> va: 1790) at
> >> org.apache.hadoop.mapred.SequenceFileRecordReader.getCurrentValue(Sequen
> >> ceF ileRecordReader.java:103) at
> >> org.apache.hadoop.mapred.SequenceFileRecordReader.next(SequenceFileRecor
> >> dRe ader.java:78) at
> >> org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.
> >> jav a:192) at
> >> org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:1
> >> 76)
> >> 
> >>          at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48)
> >>          at
> >>          org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358
> >>          ) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307) at
> >> 
> >> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177)
> >> 
> >> The strange thing is, that the parser didn't stop running. It remains in
> >> a state where it consumes 100 % cpu and doesn't do anything any more.
> >> 
> >> The last lines it wrote to the hadoop.log file were:
> >> 
> >> java.lang.OutOfMemoryError: Java heap space
> >> 
> >>          at
> >>          org.apache.nutch.protocol.Content.readFields(Content.java:140)
> >>          at
> >> 
> >> org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializ
> >> er. deserialize(WritableSerialization.java:67) at
> >> org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializ
> >> er. deserialize(WritableSerialization.java:40) at
> >> org.apache.hadoop.io.SequenceFile$Reader.deserializeValue(SequenceFile.j
> >> ava
> >> 
> >> :1817) at
> >> 
> >> org.apache.hadoop.io.SequenceFile$Reader.getCurrentValue(SequenceFile.ja
> >> va: 1790) at
> >> org.apache.hadoop.mapred.SequenceFileRecordReader.getCurrentValue(Sequen
> >> ceF ileRecordReader.java:103) at
> >> org.apache.hadoop.mapred.SequenceFileRecordReader.next(SequenceFileRecor
> >> dRe ader.java:78) at
> >> org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.
> >> jav a:192) at
> >> org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:1
> >> 76)
> >> 
> >>          at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48)
> >>          at
> >>          org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358
> >>          ) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307) at
> >> 
> >> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177)
> >> 2011-08-31 01:27:00,722 INFO  mapred.JobClient - Job complete:
> >> job_local_0001
> >> 2011-08-31 01:27:08,975 INFO  mapred.JobClient - Counters: 11
> >> 2011-08-31 01:27:08,975 INFO  mapred.JobClient -   ParserStatus
> >> 2011-08-31 01:27:08,975 INFO  mapred.JobClient -     failed=313
> >> 2011-08-31 01:27:08,975 INFO  mapred.JobClient -     success=14826
> >> 2011-08-31 01:27:08,975 INFO  mapred.JobClient -   FileSystemCounters
> >> 2011-08-31 01:27:08,975 INFO  mapred.JobClient -
> >> FILE_BYTES_READ=2047029532
> >> 2011-08-31 01:27:08,975 INFO  mapred.JobClient -
> >> FILE_BYTES_WRITTEN=819506637
> >> 2011-08-31 01:27:08,975 INFO  mapred.JobClient -   Map-Reduce Framework
> >> 2011-08-31 01:27:08,975 INFO  mapred.JobClient -     Combine output
> >> records=0
> >> 2011-08-31 01:27:08,975 INFO  mapred.JobClient -     Map input
> >> records=15746 2011-08-31 01:27:08,975 INFO  mapred.JobClient -    
> >> Spilled Records=15138 2011-08-31 01:27:08,975 INFO  mapred.JobClient - 
> >>    Map output
> >> bytes=83235364
> >> 2011-08-31 01:27:08,975 INFO  mapred.JobClient -     Map input
> >> bytes=306386116
> >> 2011-08-31 01:27:08,975 INFO  mapred.JobClient -     Combine input
> >> records=0 2011-08-31 01:27:08,975 INFO  mapred.JobClient -     Map
> >> output records=15139
> >> 
> >> Now it is 01:42 an nothing had happened since this last log, but the
> >> java process is still using all of the cpu.
> >> 
> >> I think there is something wrong.
> >> 
> >> It seems to me that my machine has to less memory. (2 GB) Bur I am a
> >> little curios about that top says, the java process is only using 52 %
> >> of the memory.
> >> 
> >> Any suggestions?
> >> 
> >> BTW: I don't want to parse UNFO files. In fact I have no idea what this
> >> is! But in our university network are many strange fiel types. Handling
> >> this is an other topic for me :)

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350

Re: Parser crash with HeapSpace error

Posted by Markus Jelsma <ma...@openindex.io>.

UNFO?? That's interesting! Anyway, i understand you don't want to parse this 
file. See your other thread.

The OOM can happen for many reasons. By default Nutch takes 1G of RAM (hence 
the 52%). You can toggle the setting via the Xmx JVM parameter.

On Wednesday 31 August 2011 01:47:05 Marek Bachmann wrote:
> Hello,
> 
> I ran in a bad situation.
> 
> After crawling and parsing about 130k pages in multiple
> generate/fetch/parse/update cycles today the parser crashed with:
> 
> Error parsing:
> http://www.usf.uni-kassel.de/ftp/user/eisner/Felix/precip/B1/ECHAM5/GPREC_2
> 041_11.31.UNF0: failed(2,200): org.apache.nutch.parse.ParseException:
> Unable to
> successfully parse content
> Exception in thread "main" java.io.IOException: Job failed!
>         at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1252)
>         at org.apache.nutch.parse.ParseSegment.parse(ParseSegment.java:157)
>         at org.apache.nutch.parse.ParseSegment.run(ParseSegment.java:178)
>         at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>         at org.apache.nutch.parse.ParseSegment.main(ParseSegment.java:164)
> 
> and in the hadoop.log, more verbosely:
> 
> java.lang.OutOfMemoryError: Java heap space
>         at org.apache.nutch.protocol.Content.readFields(Content.java:140)
>         at
> org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.
> deserialize(WritableSerialization.java:67) at
> org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.
> deserialize(WritableSerialization.java:40) at
> org.apache.hadoop.io.SequenceFile$Reader.deserializeValue(SequenceFile.java
> :1817) at
> org.apache.hadoop.io.SequenceFile$Reader.getCurrentValue(SequenceFile.java:
> 1790) at
> org.apache.hadoop.mapred.SequenceFileRecordReader.getCurrentValue(SequenceF
> ileRecordReader.java:103) at
> org.apache.hadoop.mapred.SequenceFileRecordReader.next(SequenceFileRecordRe
> ader.java:78) at
> org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.jav
> a:192) at
> org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:176)
>         at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48)
>         at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
>         at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
>         at
> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177)
> 
> The strange thing is, that the parser didn't stop running. It remains in
> a state where it consumes 100 % cpu and doesn't do anything any more.
> 
> The last lines it wrote to the hadoop.log file were:
> 
> java.lang.OutOfMemoryError: Java heap space
>         at org.apache.nutch.protocol.Content.readFields(Content.java:140)
>         at
> org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.
> deserialize(WritableSerialization.java:67) at
> org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.
> deserialize(WritableSerialization.java:40) at
> org.apache.hadoop.io.SequenceFile$Reader.deserializeValue(SequenceFile.java
> :1817) at
> org.apache.hadoop.io.SequenceFile$Reader.getCurrentValue(SequenceFile.java:
> 1790) at
> org.apache.hadoop.mapred.SequenceFileRecordReader.getCurrentValue(SequenceF
> ileRecordReader.java:103) at
> org.apache.hadoop.mapred.SequenceFileRecordReader.next(SequenceFileRecordRe
> ader.java:78) at
> org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.jav
> a:192) at
> org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:176)
>         at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48)
>         at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
>         at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
>         at
> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177)
> 2011-08-31 01:27:00,722 INFO  mapred.JobClient - Job complete:
> job_local_0001
> 2011-08-31 01:27:08,975 INFO  mapred.JobClient - Counters: 11
> 2011-08-31 01:27:08,975 INFO  mapred.JobClient -   ParserStatus
> 2011-08-31 01:27:08,975 INFO  mapred.JobClient -     failed=313
> 2011-08-31 01:27:08,975 INFO  mapred.JobClient -     success=14826
> 2011-08-31 01:27:08,975 INFO  mapred.JobClient -   FileSystemCounters
> 2011-08-31 01:27:08,975 INFO  mapred.JobClient -
> FILE_BYTES_READ=2047029532
> 2011-08-31 01:27:08,975 INFO  mapred.JobClient -
> FILE_BYTES_WRITTEN=819506637
> 2011-08-31 01:27:08,975 INFO  mapred.JobClient -   Map-Reduce Framework
> 2011-08-31 01:27:08,975 INFO  mapred.JobClient -     Combine output
> records=0
> 2011-08-31 01:27:08,975 INFO  mapred.JobClient -     Map input
> records=15746 2011-08-31 01:27:08,975 INFO  mapred.JobClient -     Spilled
> Records=15138 2011-08-31 01:27:08,975 INFO  mapred.JobClient -     Map
> output
> bytes=83235364
> 2011-08-31 01:27:08,975 INFO  mapred.JobClient -     Map input
> bytes=306386116
> 2011-08-31 01:27:08,975 INFO  mapred.JobClient -     Combine input
> records=0 2011-08-31 01:27:08,975 INFO  mapred.JobClient -     Map output
> records=15139
> 
> Now it is 01:42 an nothing had happened since this last log, but the
> java process is still using all of the cpu.
> 
> I think there is something wrong.
> 
> It seems to me that my machine has to less memory. (2 GB) Bur I am a
> little curios about that top says, the java process is only using 52 %
> of the memory.
> 
> Any suggestions?
> 
> BTW: I don't want to parse UNFO files. In fact I have no idea what this
> is! But in our university network are many strange fiel types. Handling
> this is an other topic for me :)

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350

spellchecking in nutch solr

Posted by al...@aim.com.


Hello,
I have tried to implement spellchecker based on index in nutch-solr by adding spell field to schema.xml and making it a copy from content field. However, this increased data folder size twice and spell filed as a copy of content field appears in xml feed which is not necessary. Is it possible to implement spellchecker without this issue?

Thanks.
Alex.

Re: spellchecking in nutch solr

Posted by Markus Jelsma <ma...@openindex.io>.

Wrong list

> Hello,
> 
> I have tried to implement spellchecker based on index in nutch-solr by
> adding spell field to schema.xml and making it a copy from content field.
> However, this increased data folder size twice and spell filed as a copy
> of content field appears in xml feed which is not necessary. Is it
> possible to implement spellchecker without this issue?
> 
> Thanks.
> Alex.

spellchecking in nutch solr

Posted by al...@aim.com.

Hello,

I have tried to implement spellchecker based on index in nutch-solr by adding spell field to schema.xml and making it a copy from content field. However, this increased data folder size twice and spell filed as a copy of content field appears in xml feed which is not necessary. Is it possible to implement spellchecker without this issue?

Thanks.
Alex.