You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Mike Brzozowski <bi...@gmail.com> on 2007/04/27 19:51:44 UTC
Nutch crawl crashing during merge with ArrayIndexOutOfBoundsException
Hi,
I'm running an intranet crawl on a fairly large site with nutch 0.9,
using this commandline:
nice nohup bin/nutch crawl /data/crawl/urls -dir /data/crawl/intranet3
-threads 625 -depth 10
After fetching 300,000 or so pages in the first segment, it crashes
unceremoniously. I see this near the end of hadoop.log:
2007-04-27 04:50:55,415 INFO fetcher.Fetcher - fetching
http://(internal url).html
2007-04-27 04:50:55,489 WARN mapred.LocalJobRunner - job_6qek1v
java.lang.ArrayIndexOutOfBoundsException: 401
at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.mergeParts(MapTask.java:509)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:183)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:126)
2007-04-27 04:50:57,295 INFO fetcher.Fetcher - fetch of
http://(internal url).doc failed with: Http code=406,
url=http://(internal url).doc
...and in the console:
fetch of http://(internal url).doc failed with: Http code=406,
url=http://(internal url).doc
Exception in thread "main" java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:604)
at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:470)
at org.apache.nutch.crawl.Crawl.main(Crawl.java:124)
(Environment: RHEL, 8 GB RAM, lots of disk space. Logs show the system
never ran out of disk space.)
Does anyone have any idea what's going on? How I could continue from
this point? How I can avoid this sort of crash in the future?
Thanks in advance for your help.
--Mike