You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Koch Martina <Ko...@huberverlag.de> on 2009/02/12 16:16:22 UTC

Fetcher2 crashes with current trunk

Hi all,

we use the current trunk of 04.02.09 with the patch for CrawlDbMerger (Nutch-683) manually applied.
We're doing an inject - generate - fetch - parse - updatedb - invertlinks cycle at depth 1.
When we use Fetcher2, we can do this cycle four times in a row without any problems. If we start the fifth cycle the Injector crashes with the following error log:

2009-02-12 00:00:05,015 INFO  crawl.Injector - Injector: Merging injected urls into crawl db.
2009-02-12 00:00:05,023 INFO  jvm.JvmMetrics - Cannot initialize JVM Metrics with processName=JobTracker, sessionId= - already initialized
2009-02-12 00:00:05,358 INFO  mapred.FileInputFormat - Total input paths to process : 2
2009-02-12 00:00:05,524 INFO  mapred.JobClient - Running job: job_local_0002
2009-02-12 00:00:05,528 INFO  mapred.FileInputFormat - Total input paths to process : 2
2009-02-12 00:00:05,553 INFO  mapred.MapTask - numReduceTasks: 1
2009-02-12 00:00:05,554 INFO  mapred.MapTask - io.sort.mb = 100
2009-02-12 00:00:05,828 INFO  mapred.MapTask - data buffer = 79691776/99614720
2009-02-12 00:00:05,828 INFO  mapred.MapTask - record buffer = 262144/327680
2009-02-12 00:00:06,538 INFO  mapred.JobClient -  map 0% reduce 0%
2009-02-12 00:00:07,262 WARN  mapred.LocalJobRunner - job_local_0002
java.lang.RuntimeException: java.lang.NullPointerException
       at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:81)
       at org.apache.hadoop.io.MapWritable.readFields(MapWritable.java:164)
       at org.apache.nutch.crawl.CrawlDatum.readFields(CrawlDatum.java:262)
       at org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:67)
       at org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:40)
       at org.apache.hadoop.io.SequenceFile$Reader.deserializeValue(SequenceFile.java:1817)
       at org.apache.hadoop.io.SequenceFile$Reader.getCurrentValue(SequenceFile.java:1790)
       at org.apache.hadoop.mapred.SequenceFileRecordReader.getCurrentValue(SequenceFileRecordReader.java:103)
       at org.apache.hadoop.mapred.SequenceFileRecordReader.next(SequenceFileRecordReader.java:78)
       at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:186)
       at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:170)
       at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48)
       at org.apache.hadoop.mapred.MapTask.run(MapTask.java:332)
       at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:138)
Caused by: java.lang.NullPointerException
       at java.util.concurrent.ConcurrentHashMap.get(ConcurrentHashMap.java:768)
       at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:73)
       ... 13 more
2009-02-12 00:00:07,550 FATAL crawl.Injector - Injector: java.io.IOException: Job failed!
       at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1217)
       at org.apache.nutch.crawl.Injector.inject(Injector.java:169)
       at org.apache.nutch.crawl.Injector.run(Injector.java:190)
       at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
       at org.apache.nutch.crawl.Injector.main(Injector.java:180)

After that the crawldb is broken and can't be accessed e.g. with the readdb <crawldb> -stats command.
When we use for exactly the same task Fetcher instead of Fetcher2, we can do as many cycles as we like without any problems or crashes.

Besides this error we've observed that the fetch-cycle with Fetcher is about twice as fast as Fetcher2, although we use the exact same settings in the nutch-site:
generate.max.per.host  - 100
fetcher.threads.per.host - 1
fetcher.server.delay - 0
for an initial url list with 30 URLs of different hosts.

Has anybody observed similar errors or performance issues?

Kind regards,
Martina