You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Vinci <vi...@polyu.edu.hk> on 2008/01/30 08:36:59 UTC

Dedup: Job Failed and crawl stopped at depth 1

I run the 0.9 crawler with parameter -depth 2  -threads 1, and I get the job
failed message for a dynamic-content site:
Dedup: starting
Dedup: adding indexes in: /var/crawl/indexes
Exception in thread "main" java.io.IOException: Job failed!
        at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:604)
        at
org.apache.nutch.indexer.DeleteDuplicates.dedup(DeleteDuplicates.java:439)
        at org.apache.nutch.crawl.Crawl.main(Crawl.java:135)
in the hadoop.log:
2008-01-30 15:08:12,402 INFO  indexer.Indexer - Optimizing index.
2008-01-30 15:08:12,601 INFO  indexer.Indexer - Indexer: done
2008-01-30 15:08:12,602 INFO  indexer.DeleteDuplicates - Dedup: starting
2008-01-30 15:08:12,622 INFO  indexer.DeleteDuplicates - Dedup: adding
indexes in: /var/crawl/indexes
2008-01-30 15:08:12,882 WARN  mapred.LocalJobRunner - job_b5nenb
java.lang.ArrayIndexOutOfBoundsException: -1
	at org.apache.lucene.index.MultiReader.isDeleted(MultiReader.java:113)
	at
org.apache.nutch.indexer.DeleteDuplicates$InputFormat$DDRecordReader.next(DeleteDuplicates.java:176)
	at org.apache.hadoop.mapred.MapTask$1.next(MapTask.java:157)
	at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:46)
	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:175)
	at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:126)

Also the crawling stop at depth=1
2008-01-30 15:08:10,083 WARN  crawl.Generator - Generator: 0 records
selected for fetching, exiting ...
2008-01-30 15:08:10,084 INFO  crawl.Crawl - Stopping at depth=1 - no more
URLs to fetch.

I checked the index is work in luke, it only fetch the page of url in the
list. I tried the search in luke and it seems work well, but the nutch
searcher return nothing to me......did I miss some setting or this is the
problem of aborted indexing?
-- 
View this message in context: http://www.nabble.com/Dedup%3A-Job-Failed-and-crawl-stopped-at-depth-1-tp15176806p15176806.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Dedup: Job Failed and crawl stopped at depth 1

Posted by pranesh <pr...@hcl.in>.

While crawling i got this error, how to solve it?

Injector: starting
Injector: crawlDb: crawl/crawldb
Injector: urlDir: urls
Injector: Converting injected urls to crawl db entries.
Injector: Merging injected urls into crawl db.
Injector: done
Generator: Selecting best-scoring urls due for fetch.
Generator: starting
Generator: segment: crawl/segments/20090403101008
Generator: filtering: false
Generator: topN: 10
Generator: jobtracker is 'local', generating exactly one partition.
Generator: Partitioning selected urls by host, for politeness.
Generator: done.
Fetcher: starting
Fetcher: segment: crawl/segments/20090403101008
Fetcher: threads: 10
fetching http://www.hcltech.com/
Fetcher: done
CrawlDb update: starting
CrawlDb update: db: crawl/crawldb
CrawlDb update: segments: [crawl/segments/20090403101008]
CrawlDb update: additions allowed: true
CrawlDb update: URL normalizing: true
CrawlDb update: URL filtering: true
CrawlDb update: Merging segment data into db.
CrawlDb update: done
LinkDb: starting
LinkDb: linkdb: crawl/linkdb
LinkDb: URL normalize: true
LinkDb: URL filter: true
LinkDb: adding segment: crawl/segments/20090403101008
LinkDb: done
Indexer: starting
Indexer: linkdb: crawl/linkdb
Indexer: adding segment: crawl/segments/20090403101008
Exception in thread "main" java.io.IOException: Job failed!
	at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:604)
	at org.apache.nutch.indexer.Indexer.index(Indexer.java:273)
	at test.Crawl.doCrawl(Crawl.java:143)
	at test.Crawl.main(Crawl.java:50)


-- 
View this message in context: http://www.nabble.com/Dedup%3A-Job-Failed-and-crawl-stopped-at-depth-1-tp15176806p22861787.html
Sent from the Nutch - User mailing list archive at Nabble.com.