You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@nutch.apache.org by kraman <ki...@gmail.com> on 2010/01/20 20:10:23 UTC

Tried to run Crawl with depth of only 2 and getting IOException

kirthi10@cerebrum [~/www/nutch]# ./bin/nutch crawl url -dir tinycrawl -depth
2
crawl started in: tinycrawl
rootUrlDir = url
threads = 10
depth = 2
Injector: starting
Injector: crawlDb: tinycrawl/crawldb
Injector: urlDir: url
Injector: Converting injected urls to crawl db entries.
Injector: Merging injected urls into crawl db.
Injector: done
Generator: Selecting best-scoring urls due for fetch.
Generator: starting
Generator: segment: tinycrawl/segments/20100120130316
Generator: filtering: false
Generator: topN: 2147483647
Generator: jobtracker is 'local', generating exactly one partition.
Generator: Partitioning selected urls by host, for politeness.
Generator: done.
Fetcher: starting
Fetcher: segment: tinycrawl/segments/20100120130316
Fetcher: threads: 10
fetching http://www.mywebsite.us/
fetch of http://www.mywebsite.us/ failed with: java.lang.RuntimeException:
Agent name not configured!
Fetcher: done
CrawlDb update: starting
CrawlDb update: db: tinycrawl/crawldb
CrawlDb update: segments: [tinycrawl/segments/20100120130316]
CrawlDb update: additions allowed: true
CrawlDb update: URL normalizing: true
CrawlDb update: URL filtering: true
CrawlDb update: Merging segment data into db.
CrawlDb update: done
Generator: Selecting best-scoring urls due for fetch.
Generator: starting
Generator: segment: tinycrawl/segments/20100120130323
Generator: filtering: false
Generator: topN: 2147483647
Generator: jobtracker is 'local', generating exactly one partition.
Generator: Partitioning selected urls by host, for politeness.
Generator: done.
Fetcher: starting
Fetcher: segment: tinycrawl/segments/20100120130323
Fetcher: threads: 10
fetching http://www.mywebsite.us/
fetch of http://www.mywebsite.us/ failed with: java.lang.RuntimeException:
Agent name not configured!
Fetcher: done
CrawlDb update: starting
CrawlDb update: db: tinycrawl/crawldb
CrawlDb update: segments: [tinycrawl/segments/20100120130323]
CrawlDb update: additions allowed: true
CrawlDb update: URL normalizing: true
CrawlDb update: URL filtering: true
CrawlDb update: Merging segment data into db.
CrawlDb update: done
LinkDb: starting
LinkDb: linkdb: tinycrawl/linkdb
LinkDb: URL normalize: true
LinkDb: URL filter: true
LinkDb: adding segment: tinycrawl/segments/20100120130323
LinkDb: adding segment: tinycrawl/segments/20100120130316
LinkDb: done
Indexer: starting
Indexer: linkdb: tinycrawl/linkdb
Indexer: adding segment: tinycrawl/segments/20100120130323
Indexer: adding segment: tinycrawl/segments/20100120130316
Optimizing index.
Indexer: done
Dedup: starting
Dedup: adding indexes in: tinycrawl/indexes
Exception in thread "main" java.io.IOException: Job failed!
        at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:604)
        at
org.apache.nutch.indexer.DeleteDuplicates.dedup(DeleteDuplicates.java:439)
        at org.apache.nutch.crawl.Crawl.main(Crawl.java:135)

LogFile gives 
java.lang.ArrayIndexOutOfBoundsException: -1
        at
org.apache.lucene.index.MultiReader.isDeleted(MultiReader.java:113)
        at
org.apache.nutch.indexer.DeleteDuplicates$InputFormat$DDRecordReader.next(DeleteDuplicates.java:176)
        at org.apache.hadoop.mapred.MapTask$1.next(MapTask.java:157)
        at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:46)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:175)
        at
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:126)
-- 
View this message in context: http://old.nabble.com/Tried-to-run-Crawl-with-depth-of-only-2-and-getting-IOException-tp27246959p27246959.html
Sent from the Nutch - Dev mailing list archive at Nabble.com.

Re: Tried to run Crawl with depth of only 2 and getting IOException

Posted by kraman <ki...@gmail.com>.

Yes, the Agent Name was empty.  It works now.

Thanks Much.


Nutch Newbie wrote:
> 
> On Wed, Jan 20, 2010 at 7:10 PM, kraman <ki...@gmail.com> wrote:
>>
>> kirthi10@cerebrum [~/www/nutch]# ./bin/nutch crawl url -dir tinycrawl
>> -depth
>> 2
>> crawl started in: tinycrawl
>> rootUrlDir = url
>> threads = 10
>> depth = 2
>> Injector: starting
>> Injector: crawlDb: tinycrawl/crawldb
>> Injector: urlDir: url
>> Injector: Converting injected urls to crawl db entries.
>> Injector: Merging injected urls into crawl db.
>> Injector: done
>> Generator: Selecting best-scoring urls due for fetch.
>> Generator: starting
>> Generator: segment: tinycrawl/segments/20100120130316
>> Generator: filtering: false
>> Generator: topN: 2147483647
>> Generator: jobtracker is 'local', generating exactly one partition.
>> Generator: Partitioning selected urls by host, for politeness.
>> Generator: done.
>> Fetcher: starting
>> Fetcher: segment: tinycrawl/segments/20100120130316
>> Fetcher: threads: 10
>> fetching http://www.mywebsite.us/
>> fetch of http://www.mywebsite.us/ failed with:
>> java.lang.RuntimeException:
>> Agent name not configured!
> 
> You need to fix nutch config file as per README.
> 
> 
> 
> 
>> Fetcher: done
>> CrawlDb update: starting
>> CrawlDb update: db: tinycrawl/crawldb
>> CrawlDb update: segments: [tinycrawl/segments/20100120130316]
>> CrawlDb update: additions allowed: true
>> CrawlDb update: URL normalizing: true
>> CrawlDb update: URL filtering: true
>> CrawlDb update: Merging segment data into db.
>> CrawlDb update: done
>> Generator: Selecting best-scoring urls due for fetch.
>> Generator: starting
>> Generator: segment: tinycrawl/segments/20100120130323
>> Generator: filtering: false
>> Generator: topN: 2147483647
>> Generator: jobtracker is 'local', generating exactly one partition.
>> Generator: Partitioning selected urls by host, for politeness.
>> Generator: done.
>> Fetcher: starting
>> Fetcher: segment: tinycrawl/segments/20100120130323
>> Fetcher: threads: 10
>> fetching http://www.mywebsite.us/
>> fetch of http://www.mywebsite.us/ failed with:
>> java.lang.RuntimeException:
>> Agent name not configured!
>> Fetcher: done
>> CrawlDb update: starting
>> CrawlDb update: db: tinycrawl/crawldb
>> CrawlDb update: segments: [tinycrawl/segments/20100120130323]
>> CrawlDb update: additions allowed: true
>> CrawlDb update: URL normalizing: true
>> CrawlDb update: URL filtering: true
>> CrawlDb update: Merging segment data into db.
>> CrawlDb update: done
>> LinkDb: starting
>> LinkDb: linkdb: tinycrawl/linkdb
>> LinkDb: URL normalize: true
>> LinkDb: URL filter: true
>> LinkDb: adding segment: tinycrawl/segments/20100120130323
>> LinkDb: adding segment: tinycrawl/segments/20100120130316
>> LinkDb: done
>> Indexer: starting
>> Indexer: linkdb: tinycrawl/linkdb
>> Indexer: adding segment: tinycrawl/segments/20100120130323
>> Indexer: adding segment: tinycrawl/segments/20100120130316
>> Optimizing index.
>> Indexer: done
>> Dedup: starting
>> Dedup: adding indexes in: tinycrawl/indexes
>> Exception in thread "main" java.io.IOException: Job failed!
>>        at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:604)
>>        at
>> org.apache.nutch.indexer.DeleteDuplicates.dedup(DeleteDuplicates.java:439)
>>        at org.apache.nutch.crawl.Crawl.main(Crawl.java:135)
>>
>> LogFile gives
>> java.lang.ArrayIndexOutOfBoundsException: -1
>>        at
>> org.apache.lucene.index.MultiReader.isDeleted(MultiReader.java:113)
>>        at
>> org.apache.nutch.indexer.DeleteDuplicates$InputFormat$DDRecordReader.next(DeleteDuplicates.java:176)
>>        at org.apache.hadoop.mapred.MapTask$1.next(MapTask.java:157)
>>        at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:46)
>>        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:175)
>>        at
>> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:126)
>> --
>> View this message in context:
>> http://old.nabble.com/Tried-to-run-Crawl-with-depth-of-only-2-and-getting-IOException-tp27246959p27246959.html
>> Sent from the Nutch - Dev mailing list archive at Nabble.com.
>>
>>
> 
> 

-- 
View this message in context: http://old.nabble.com/Tried-to-run-Crawl-with-depth-of-only-2-and-getting-IOException-tp27246959p27257065.html
Sent from the Nutch - Dev mailing list archive at Nabble.com.

Re: Tried to run Crawl with depth of only 2 and getting IOException

Posted by Nutch Newbie <nu...@gmail.com>.

On Wed, Jan 20, 2010 at 7:10 PM, kraman <ki...@gmail.com> wrote:
>
> kirthi10@cerebrum [~/www/nutch]# ./bin/nutch crawl url -dir tinycrawl -depth
> 2
> crawl started in: tinycrawl
> rootUrlDir = url
> threads = 10
> depth = 2
> Injector: starting
> Injector: crawlDb: tinycrawl/crawldb
> Injector: urlDir: url
> Injector: Converting injected urls to crawl db entries.
> Injector: Merging injected urls into crawl db.
> Injector: done
> Generator: Selecting best-scoring urls due for fetch.
> Generator: starting
> Generator: segment: tinycrawl/segments/20100120130316
> Generator: filtering: false
> Generator: topN: 2147483647
> Generator: jobtracker is 'local', generating exactly one partition.
> Generator: Partitioning selected urls by host, for politeness.
> Generator: done.
> Fetcher: starting
> Fetcher: segment: tinycrawl/segments/20100120130316
> Fetcher: threads: 10
> fetching http://www.mywebsite.us/
> fetch of http://www.mywebsite.us/ failed with: java.lang.RuntimeException:
> Agent name not configured!

You need to fix nutch config file as per README.




> Fetcher: done
> CrawlDb update: starting
> CrawlDb update: db: tinycrawl/crawldb
> CrawlDb update: segments: [tinycrawl/segments/20100120130316]
> CrawlDb update: additions allowed: true
> CrawlDb update: URL normalizing: true
> CrawlDb update: URL filtering: true
> CrawlDb update: Merging segment data into db.
> CrawlDb update: done
> Generator: Selecting best-scoring urls due for fetch.
> Generator: starting
> Generator: segment: tinycrawl/segments/20100120130323
> Generator: filtering: false
> Generator: topN: 2147483647
> Generator: jobtracker is 'local', generating exactly one partition.
> Generator: Partitioning selected urls by host, for politeness.
> Generator: done.
> Fetcher: starting
> Fetcher: segment: tinycrawl/segments/20100120130323
> Fetcher: threads: 10
> fetching http://www.mywebsite.us/
> fetch of http://www.mywebsite.us/ failed with: java.lang.RuntimeException:
> Agent name not configured!
> Fetcher: done
> CrawlDb update: starting
> CrawlDb update: db: tinycrawl/crawldb
> CrawlDb update: segments: [tinycrawl/segments/20100120130323]
> CrawlDb update: additions allowed: true
> CrawlDb update: URL normalizing: true
> CrawlDb update: URL filtering: true
> CrawlDb update: Merging segment data into db.
> CrawlDb update: done
> LinkDb: starting
> LinkDb: linkdb: tinycrawl/linkdb
> LinkDb: URL normalize: true
> LinkDb: URL filter: true
> LinkDb: adding segment: tinycrawl/segments/20100120130323
> LinkDb: adding segment: tinycrawl/segments/20100120130316
> LinkDb: done
> Indexer: starting
> Indexer: linkdb: tinycrawl/linkdb
> Indexer: adding segment: tinycrawl/segments/20100120130323
> Indexer: adding segment: tinycrawl/segments/20100120130316
> Optimizing index.
> Indexer: done
> Dedup: starting
> Dedup: adding indexes in: tinycrawl/indexes
> Exception in thread "main" java.io.IOException: Job failed!
>        at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:604)
>        at
> org.apache.nutch.indexer.DeleteDuplicates.dedup(DeleteDuplicates.java:439)
>        at org.apache.nutch.crawl.Crawl.main(Crawl.java:135)
>
> LogFile gives
> java.lang.ArrayIndexOutOfBoundsException: -1
>        at
> org.apache.lucene.index.MultiReader.isDeleted(MultiReader.java:113)
>        at
> org.apache.nutch.indexer.DeleteDuplicates$InputFormat$DDRecordReader.next(DeleteDuplicates.java:176)
>        at org.apache.hadoop.mapred.MapTask$1.next(MapTask.java:157)
>        at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:46)
>        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:175)
>        at
> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:126)
> --
> View this message in context: http://old.nabble.com/Tried-to-run-Crawl-with-depth-of-only-2-and-getting-IOException-tp27246959p27246959.html
> Sent from the Nutch - Dev mailing list archive at Nabble.com.
>
>