You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Mike Brzozowski <bi...@gmail.com> on 2007/04/27 19:51:44 UTC

Nutch crawl crashing during merge with ArrayIndexOutOfBoundsException

Hi,

I'm running an intranet crawl on a fairly large site with nutch 0.9,
using this commandline:

nice nohup bin/nutch crawl /data/crawl/urls -dir /data/crawl/intranet3
-threads 625 -depth 10

After fetching 300,000 or so pages in the first segment, it crashes
unceremoniously. I see this near the end of hadoop.log:

2007-04-27 04:50:55,415 INFO  fetcher.Fetcher - fetching
http://(internal url).html
2007-04-27 04:50:55,489 WARN  mapred.LocalJobRunner - job_6qek1v
java.lang.ArrayIndexOutOfBoundsException: 401
        at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.mergeParts(MapTask.java:509)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:183)
        at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:126)
2007-04-27 04:50:57,295 INFO  fetcher.Fetcher - fetch of
http://(internal url).doc failed with: Http code=406,
url=http://(internal url).doc

...and in the console:
fetch of http://(internal url).doc failed with: Http code=406,
url=http://(internal url).doc
Exception in thread "main" java.io.IOException: Job failed!
 at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:604)
 at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:470)
 at org.apache.nutch.crawl.Crawl.main(Crawl.java:124)

(Environment: RHEL, 8 GB RAM, lots of disk space. Logs show the system
never ran out of disk space.)

Does anyone have any idea what's going on? How I could continue from
this point? How I can avoid this sort of crash in the future?

Thanks in advance for your help.
--Mike