You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Marek Bachmann <m....@uni-kassel.de> on 2011/08/31 01:47:05 UTC
Parser crash with HeapSpace error
Hello,
I ran in a bad situation.
After crawling and parsing about 130k pages in multiple
generate/fetch/parse/update cycles today the parser crashed with:
Error parsing:
http://www.usf.uni-kassel.de/ftp/user/eisner/Felix/precip/B1/ECHAM5/GPREC_2041_11.31.UNF0:
failed(2,200): org.apache.nutch.parse.ParseException: Unable to
successfully parse content
Exception in thread "main" java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1252)
at org.apache.nutch.parse.ParseSegment.parse(ParseSegment.java:157)
at org.apache.nutch.parse.ParseSegment.run(ParseSegment.java:178)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.nutch.parse.ParseSegment.main(ParseSegment.java:164)
and in the hadoop.log, more verbosely:
java.lang.OutOfMemoryError: Java heap space
at org.apache.nutch.protocol.Content.readFields(Content.java:140)
at
org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:67)
at
org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:40)
at
org.apache.hadoop.io.SequenceFile$Reader.deserializeValue(SequenceFile.java:1817)
at
org.apache.hadoop.io.SequenceFile$Reader.getCurrentValue(SequenceFile.java:1790)
at
org.apache.hadoop.mapred.SequenceFileRecordReader.getCurrentValue(SequenceFileRecordReader.java:103)
at
org.apache.hadoop.mapred.SequenceFileRecordReader.next(SequenceFileRecordReader.java:78)
at
org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:192)
at
org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:176)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
at
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177)
The strange thing is, that the parser didn't stop running. It remains in
a state where it consumes 100 % cpu and doesn't do anything any more.
The last lines it wrote to the hadoop.log file were:
java.lang.OutOfMemoryError: Java heap space
at org.apache.nutch.protocol.Content.readFields(Content.java:140)
at
org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:67)
at
org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:40)
at
org.apache.hadoop.io.SequenceFile$Reader.deserializeValue(SequenceFile.java:1817)
at
org.apache.hadoop.io.SequenceFile$Reader.getCurrentValue(SequenceFile.java:1790)
at
org.apache.hadoop.mapred.SequenceFileRecordReader.getCurrentValue(SequenceFileRecordReader.java:103)
at
org.apache.hadoop.mapred.SequenceFileRecordReader.next(SequenceFileRecordReader.java:78)
at
org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:192)
at
org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:176)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
at
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177)
2011-08-31 01:27:00,722 INFO mapred.JobClient - Job complete:
job_local_0001
2011-08-31 01:27:08,975 INFO mapred.JobClient - Counters: 11
2011-08-31 01:27:08,975 INFO mapred.JobClient - ParserStatus
2011-08-31 01:27:08,975 INFO mapred.JobClient - failed=313
2011-08-31 01:27:08,975 INFO mapred.JobClient - success=14826
2011-08-31 01:27:08,975 INFO mapred.JobClient - FileSystemCounters
2011-08-31 01:27:08,975 INFO mapred.JobClient -
FILE_BYTES_READ=2047029532
2011-08-31 01:27:08,975 INFO mapred.JobClient -
FILE_BYTES_WRITTEN=819506637
2011-08-31 01:27:08,975 INFO mapred.JobClient - Map-Reduce Framework
2011-08-31 01:27:08,975 INFO mapred.JobClient - Combine output
records=0
2011-08-31 01:27:08,975 INFO mapred.JobClient - Map input records=15746
2011-08-31 01:27:08,975 INFO mapred.JobClient - Spilled Records=15138
2011-08-31 01:27:08,975 INFO mapred.JobClient - Map output
bytes=83235364
2011-08-31 01:27:08,975 INFO mapred.JobClient - Map input
bytes=306386116
2011-08-31 01:27:08,975 INFO mapred.JobClient - Combine input records=0
2011-08-31 01:27:08,975 INFO mapred.JobClient - Map output
records=15139
Now it is 01:42 an nothing had happened since this last log, but the
java process is still using all of the cpu.
I think there is something wrong.
It seems to me that my machine has to less memory. (2 GB) Bur I am a
little curios about that top says, the java process is only using 52 %
of the memory.
Any suggestions?
BTW: I don't want to parse UNFO files. In fact I have no idea what this
is! But in our university network are many strange fiel types. Handling
this is an other topic for me :)
Re: Parser crash with HeapSpace error
Posted by Markus Jelsma <ma...@openindex.io>.
On Wednesday 31 August 2011 15:12:25 Marek Bachmann wrote:
> Am 31.08.2011 12:58, schrieb Markus Jelsma:
> > UNFO?? That's interesting! Anyway, i understand you don't want to parse
> > this file. See your other thread.
>
> Interesting? Do you know the file type? Is this something that shouldn't
> be public? Actually, I noticed it is UNF0 (ZERO!) not the letter O.
No, just interesting because i never heard of it. Could be:
http://www.file-extensions.org/unf-file-extension
>
> > The OOM can happen for many reasons. By default Nutch takes 1G of RAM
> > (hence the 52%). You can toggle the setting via the Xmx JVM parameter.
> >
> > On Wednesday 31 August 2011 01:47:05 Marek Bachmann wrote:
> >> Hello,
> >>
> >> I ran in a bad situation.
> >>
> >> After crawling and parsing about 130k pages in multiple
> >> generate/fetch/parse/update cycles today the parser crashed with:
> >>
> >> Error parsing:
> >> http://www.usf.uni-kassel.de/ftp/user/eisner/Felix/precip/B1/ECHAM5/GPRE
> >> C_2 041_11.31.UNF0: failed(2,200):
> >> org.apache.nutch.parse.ParseException: Unable to
> >> successfully parse content
> >> Exception in thread "main" java.io.IOException: Job failed!
> >>
> >> at
> >> org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1252)
> >> at
> >> org.apache.nutch.parse.ParseSegment.parse(ParseSegment.java:15
> >> 7) at
> >> org.apache.nutch.parse.ParseSegment.run(ParseSegment.java:178)
> >> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
> >> at
> >> org.apache.nutch.parse.ParseSegment.main(ParseSegment.java:164
> >> )
> >>
> >> and in the hadoop.log, more verbosely:
> >>
> >> java.lang.OutOfMemoryError: Java heap space
> >>
> >> at
> >> org.apache.nutch.protocol.Content.readFields(Content.java:140)
> >> at
> >>
> >> org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializ
> >> er. deserialize(WritableSerialization.java:67) at
> >> org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializ
> >> er. deserialize(WritableSerialization.java:40) at
> >> org.apache.hadoop.io.SequenceFile$Reader.deserializeValue(SequenceFile.j
> >> ava
> >>
> >> :1817) at
> >>
> >> org.apache.hadoop.io.SequenceFile$Reader.getCurrentValue(SequenceFile.ja
> >> va: 1790) at
> >> org.apache.hadoop.mapred.SequenceFileRecordReader.getCurrentValue(Sequen
> >> ceF ileRecordReader.java:103) at
> >> org.apache.hadoop.mapred.SequenceFileRecordReader.next(SequenceFileRecor
> >> dRe ader.java:78) at
> >> org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.
> >> jav a:192) at
> >> org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:1
> >> 76)
> >>
> >> at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48)
> >> at
> >> org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358
> >> ) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307) at
> >>
> >> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177)
> >>
> >> The strange thing is, that the parser didn't stop running. It remains in
> >> a state where it consumes 100 % cpu and doesn't do anything any more.
> >>
> >> The last lines it wrote to the hadoop.log file were:
> >>
> >> java.lang.OutOfMemoryError: Java heap space
> >>
> >> at
> >> org.apache.nutch.protocol.Content.readFields(Content.java:140)
> >> at
> >>
> >> org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializ
> >> er. deserialize(WritableSerialization.java:67) at
> >> org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializ
> >> er. deserialize(WritableSerialization.java:40) at
> >> org.apache.hadoop.io.SequenceFile$Reader.deserializeValue(SequenceFile.j
> >> ava
> >>
> >> :1817) at
> >>
> >> org.apache.hadoop.io.SequenceFile$Reader.getCurrentValue(SequenceFile.ja
> >> va: 1790) at
> >> org.apache.hadoop.mapred.SequenceFileRecordReader.getCurrentValue(Sequen
> >> ceF ileRecordReader.java:103) at
> >> org.apache.hadoop.mapred.SequenceFileRecordReader.next(SequenceFileRecor
> >> dRe ader.java:78) at
> >> org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.
> >> jav a:192) at
> >> org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:1
> >> 76)
> >>
> >> at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48)
> >> at
> >> org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358
> >> ) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307) at
> >>
> >> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177)
> >> 2011-08-31 01:27:00,722 INFO mapred.JobClient - Job complete:
> >> job_local_0001
> >> 2011-08-31 01:27:08,975 INFO mapred.JobClient - Counters: 11
> >> 2011-08-31 01:27:08,975 INFO mapred.JobClient - ParserStatus
> >> 2011-08-31 01:27:08,975 INFO mapred.JobClient - failed=313
> >> 2011-08-31 01:27:08,975 INFO mapred.JobClient - success=14826
> >> 2011-08-31 01:27:08,975 INFO mapred.JobClient - FileSystemCounters
> >> 2011-08-31 01:27:08,975 INFO mapred.JobClient -
> >> FILE_BYTES_READ=2047029532
> >> 2011-08-31 01:27:08,975 INFO mapred.JobClient -
> >> FILE_BYTES_WRITTEN=819506637
> >> 2011-08-31 01:27:08,975 INFO mapred.JobClient - Map-Reduce Framework
> >> 2011-08-31 01:27:08,975 INFO mapred.JobClient - Combine output
> >> records=0
> >> 2011-08-31 01:27:08,975 INFO mapred.JobClient - Map input
> >> records=15746 2011-08-31 01:27:08,975 INFO mapred.JobClient -
> >> Spilled Records=15138 2011-08-31 01:27:08,975 INFO mapred.JobClient -
> >> Map output
> >> bytes=83235364
> >> 2011-08-31 01:27:08,975 INFO mapred.JobClient - Map input
> >> bytes=306386116
> >> 2011-08-31 01:27:08,975 INFO mapred.JobClient - Combine input
> >> records=0 2011-08-31 01:27:08,975 INFO mapred.JobClient - Map
> >> output records=15139
> >>
> >> Now it is 01:42 an nothing had happened since this last log, but the
> >> java process is still using all of the cpu.
> >>
> >> I think there is something wrong.
> >>
> >> It seems to me that my machine has to less memory. (2 GB) Bur I am a
> >> little curios about that top says, the java process is only using 52 %
> >> of the memory.
> >>
> >> Any suggestions?
> >>
> >> BTW: I don't want to parse UNFO files. In fact I have no idea what this
> >> is! But in our university network are many strange fiel types. Handling
> >> this is an other topic for me :)
--
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350
Re: Parser crash with HeapSpace error
Posted by Markus Jelsma <ma...@openindex.io>.
UNFO?? That's interesting! Anyway, i understand you don't want to parse this
file. See your other thread.
The OOM can happen for many reasons. By default Nutch takes 1G of RAM (hence
the 52%). You can toggle the setting via the Xmx JVM parameter.
On Wednesday 31 August 2011 01:47:05 Marek Bachmann wrote:
> Hello,
>
> I ran in a bad situation.
>
> After crawling and parsing about 130k pages in multiple
> generate/fetch/parse/update cycles today the parser crashed with:
>
> Error parsing:
> http://www.usf.uni-kassel.de/ftp/user/eisner/Felix/precip/B1/ECHAM5/GPREC_2
> 041_11.31.UNF0: failed(2,200): org.apache.nutch.parse.ParseException:
> Unable to
> successfully parse content
> Exception in thread "main" java.io.IOException: Job failed!
> at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1252)
> at org.apache.nutch.parse.ParseSegment.parse(ParseSegment.java:157)
> at org.apache.nutch.parse.ParseSegment.run(ParseSegment.java:178)
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
> at org.apache.nutch.parse.ParseSegment.main(ParseSegment.java:164)
>
> and in the hadoop.log, more verbosely:
>
> java.lang.OutOfMemoryError: Java heap space
> at org.apache.nutch.protocol.Content.readFields(Content.java:140)
> at
> org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.
> deserialize(WritableSerialization.java:67) at
> org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.
> deserialize(WritableSerialization.java:40) at
> org.apache.hadoop.io.SequenceFile$Reader.deserializeValue(SequenceFile.java
> :1817) at
> org.apache.hadoop.io.SequenceFile$Reader.getCurrentValue(SequenceFile.java:
> 1790) at
> org.apache.hadoop.mapred.SequenceFileRecordReader.getCurrentValue(SequenceF
> ileRecordReader.java:103) at
> org.apache.hadoop.mapred.SequenceFileRecordReader.next(SequenceFileRecordRe
> ader.java:78) at
> org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.jav
> a:192) at
> org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:176)
> at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48)
> at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
> at
> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177)
>
> The strange thing is, that the parser didn't stop running. It remains in
> a state where it consumes 100 % cpu and doesn't do anything any more.
>
> The last lines it wrote to the hadoop.log file were:
>
> java.lang.OutOfMemoryError: Java heap space
> at org.apache.nutch.protocol.Content.readFields(Content.java:140)
> at
> org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.
> deserialize(WritableSerialization.java:67) at
> org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.
> deserialize(WritableSerialization.java:40) at
> org.apache.hadoop.io.SequenceFile$Reader.deserializeValue(SequenceFile.java
> :1817) at
> org.apache.hadoop.io.SequenceFile$Reader.getCurrentValue(SequenceFile.java:
> 1790) at
> org.apache.hadoop.mapred.SequenceFileRecordReader.getCurrentValue(SequenceF
> ileRecordReader.java:103) at
> org.apache.hadoop.mapred.SequenceFileRecordReader.next(SequenceFileRecordRe
> ader.java:78) at
> org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.jav
> a:192) at
> org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:176)
> at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48)
> at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
> at
> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177)
> 2011-08-31 01:27:00,722 INFO mapred.JobClient - Job complete:
> job_local_0001
> 2011-08-31 01:27:08,975 INFO mapred.JobClient - Counters: 11
> 2011-08-31 01:27:08,975 INFO mapred.JobClient - ParserStatus
> 2011-08-31 01:27:08,975 INFO mapred.JobClient - failed=313
> 2011-08-31 01:27:08,975 INFO mapred.JobClient - success=14826
> 2011-08-31 01:27:08,975 INFO mapred.JobClient - FileSystemCounters
> 2011-08-31 01:27:08,975 INFO mapred.JobClient -
> FILE_BYTES_READ=2047029532
> 2011-08-31 01:27:08,975 INFO mapred.JobClient -
> FILE_BYTES_WRITTEN=819506637
> 2011-08-31 01:27:08,975 INFO mapred.JobClient - Map-Reduce Framework
> 2011-08-31 01:27:08,975 INFO mapred.JobClient - Combine output
> records=0
> 2011-08-31 01:27:08,975 INFO mapred.JobClient - Map input
> records=15746 2011-08-31 01:27:08,975 INFO mapred.JobClient - Spilled
> Records=15138 2011-08-31 01:27:08,975 INFO mapred.JobClient - Map
> output
> bytes=83235364
> 2011-08-31 01:27:08,975 INFO mapred.JobClient - Map input
> bytes=306386116
> 2011-08-31 01:27:08,975 INFO mapred.JobClient - Combine input
> records=0 2011-08-31 01:27:08,975 INFO mapred.JobClient - Map output
> records=15139
>
> Now it is 01:42 an nothing had happened since this last log, but the
> java process is still using all of the cpu.
>
> I think there is something wrong.
>
> It seems to me that my machine has to less memory. (2 GB) Bur I am a
> little curios about that top says, the java process is only using 52 %
> of the memory.
>
> Any suggestions?
>
> BTW: I don't want to parse UNFO files. In fact I have no idea what this
> is! But in our university network are many strange fiel types. Handling
> this is an other topic for me :)
--
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350
spellchecking in nutch solr
Posted by al...@aim.com.
Hello,
I have tried to implement spellchecker based on index in nutch-solr by adding spell field to schema.xml and making it a copy from content field. However, this increased data folder size twice and spell filed as a copy of content field appears in xml feed which is not necessary. Is it possible to implement spellchecker without this issue?
Thanks.
Alex.
Re: spellchecking in nutch solr
Posted by Markus Jelsma <ma...@openindex.io>.
Wrong list
> Hello,
>
> I have tried to implement spellchecker based on index in nutch-solr by
> adding spell field to schema.xml and making it a copy from content field.
> However, this increased data folder size twice and spell filed as a copy
> of content field appears in xml feed which is not necessary. Is it
> possible to implement spellchecker without this issue?
>
> Thanks.
> Alex.
spellchecking in nutch solr
Posted by al...@aim.com.
Hello,
I have tried to implement spellchecker based on index in nutch-solr by adding spell field to schema.xml and making it a copy from content field. However, this increased data folder size twice and spell filed as a copy of content field appears in xml feed which is not necessary. Is it possible to implement spellchecker without this issue?
Thanks.
Alex.