You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Cam Bazz <ca...@gmail.com> on 2006/09/01 00:55:00 UTC
nutch protocol-file
Hello,
I wanted to index my files so I followed the instructions at
http://www.folge2.de/tp/search/1/crawling-the-local-filesystem-with-nutch
I get : Exception in thread "main" java.io.IOException: Job failed!
and looking at the log file:
2006-09-01 01:49:43,166 WARN mapred.LocalJobRunner - job_p2pnnk
java.lang.RuntimeException: No scoring plugins - at least one scoring
plugin is required!
at
org.apache.nutch.scoring.ScoringFilters.<init>(ScoringFilters.java:84)
at
org.apache.nutch.crawl.Injector$InjectMapper.configure(Injector.java:57)
at org.apache.hadoop.mapred.JobConf.newInstance(JobConf.java:443)
at org.apache.hadoop.mapred.MapRunner.configure(MapRunner.java:33)
at org.apache.hadoop.mapred.JobConf.newInstance(JobConf.java:443)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:125)
at
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:90)
my plugin.includes is like:
<property>
<name>plugin.includes</name>
<value>protocol-file|protocol-http|urlfilter-regex|parse-(xml|text|html|js|pdf)|index-basic|query-(basic|site|url)</value>
</property>
how can I add a scoring plugin. by default, we dont have to add a
scoring plugin, so I dont know where to go.
Any ideas appreciated,
Best Regards,
-C.B.
Re: nutch protocol-file
Posted by Cam Bazz <ca...@gmail.com>.
Hello,
I almost got it to work, it starts to crawl but after sometime it
finishes. I have something like 200000 little html pages, (2-3k) under
/nutch/data.
I am only getting like 100, or 200 pages, then it stops.
The command I am giving is:
# bin/nutch crawl urls -dir crawl -threads 10 -depth 2
directory urls, contain a urls file, which contains:
file:///nutch/data/
Any Ideas?
-Thanks a bunch.
When I look at the logs, I see:
2006-09-03 18:27:04,231 INFO indexer.Indexer - Indexing
[file:///nutch/data] with analyzer
org.apache.nutch.analysis.NutchDocumentAnalyzer@6aba4211 (null)
2006-09-03 18:27:04,381 INFO indexer.Indexer - maxFieldLength 10000
reached, ignoring following tokens
2006-09-03 18:27:04,454 INFO indexer.Indexer - Optimizing index.
2006-09-03 18:27:04,529 INFO indexer.Indexer - merging segments _0 (1
docs) into _1 (1 docs)
2006-09-03 18:27:05,442 INFO indexer.Indexer - Indexer: done
2006-09-03 18:27:05,444 INFO indexer.DeleteDuplicates - Dedup: starting
2006-09-03 18:27:05,459 INFO indexer.DeleteDuplicates - Dedup: adding
indexes in: crawl/indexes
2006-09-03 18:27:08,042 INFO indexer.DeleteDuplicates - Dedup: done
2006-09-03 18:27:08,043 INFO indexer.IndexMerger - Adding
crawl/indexes/part-00000
2006-09-03 18:27:08,080 INFO crawl.Crawl - crawl finished: crawl
2006-09-03 18:28:07,452 INFO crawl.CrawlDbReader - CrawlDb statistics
start: crawl/crawldb
2006-09-03 18:28:09,387 INFO crawl.CrawlDbReader - Statistics for
CrawlDb: crawl/crawldb
2006-09-03 18:28:09,387 INFO crawl.CrawlDbReader - TOTAL urls: 101
2006-09-03 18:28:09,388 INFO crawl.CrawlDbReader - avg score: 1.008
2006-09-03 18:28:09,388 INFO crawl.CrawlDbReader - max score: 1.009
2006-09-03 18:28:09,388 INFO crawl.CrawlDbReader - min score: 1.0
2006-09-03 18:28:09,388 INFO crawl.CrawlDbReader - retry 0: 1
2006-09-03 18:28:09,388 INFO crawl.CrawlDbReader - retry 1: 100
2006-09-03 18:28:09,388 INFO crawl.CrawlDbReader - status 1
(DB_unfetched): 100
2006-09-03 18:28:09,388 INFO crawl.CrawlDbReader - status 2
(DB_fetched): 1
2006-09-03 18:28:09,388 INFO crawl.CrawlDbReader - CrawlDb statistics: done
Thomas Delnoij wrote:
> Just add scoring-opic to your plugin.includes in nutch-site.xml.
>
> Rgrds, Thomas
>
> On 9/1/06, Cam Bazz <ca...@gmail.com> wrote:
>> Hello,
>>
>> I wanted to index my files so I followed the instructions at
>>
>> http://www.folge2.de/tp/search/1/crawling-the-local-filesystem-with-nutch
>>
>>
>> I get : Exception in thread "main" java.io.IOException: Job failed!
>>
>> and looking at the log file:
>>
>> 2006-09-01 01:49:43,166 WARN mapred.LocalJobRunner - job_p2pnnk
>> java.lang.RuntimeException: No scoring plugins - at least one scoring
>> plugin is required!
>> at
>> org.apache.nutch.scoring.ScoringFilters.<init>(ScoringFilters.java:84)
>> at
>> org.apache.nutch.crawl.Injector$InjectMapper.configure(Injector.java:57)
>> at
>> org.apache.hadoop.mapred.JobConf.newInstance(JobConf.java:443)
>> at
>> org.apache.hadoop.mapred.MapRunner.configure(MapRunner.java:33)
>> at
>> org.apache.hadoop.mapred.JobConf.newInstance(JobConf.java:443)
>> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:125)
>> at
>> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:90)
>>
>>
>> my plugin.includes is like:
>>
>> <property>
>> <name>plugin.includes</name>
>>
>> <value>protocol-file|protocol-http|urlfilter-regex|parse-(xml|text|html|js|pdf)|index-basic|query-(basic|site|url)</value>
>>
>> </property>
>>
>> how can I add a scoring plugin. by default, we dont have to add a
>> scoring plugin, so I dont know where to go.
>>
>> Any ideas appreciated,
>>
>> Best Regards,
>> -C.B.
>>
>>
>>
>>
>
Re: nutch protocol-file
Posted by Thomas Delnoij <di...@gmail.com>.
Just add scoring-opic to your plugin.includes in nutch-site.xml.
Rgrds, Thomas
On 9/1/06, Cam Bazz <ca...@gmail.com> wrote:
> Hello,
>
> I wanted to index my files so I followed the instructions at
>
> http://www.folge2.de/tp/search/1/crawling-the-local-filesystem-with-nutch
>
> I get : Exception in thread "main" java.io.IOException: Job failed!
>
> and looking at the log file:
>
> 2006-09-01 01:49:43,166 WARN mapred.LocalJobRunner - job_p2pnnk
> java.lang.RuntimeException: No scoring plugins - at least one scoring
> plugin is required!
> at
> org.apache.nutch.scoring.ScoringFilters.<init>(ScoringFilters.java:84)
> at
> org.apache.nutch.crawl.Injector$InjectMapper.configure(Injector.java:57)
> at org.apache.hadoop.mapred.JobConf.newInstance(JobConf.java:443)
> at org.apache.hadoop.mapred.MapRunner.configure(MapRunner.java:33)
> at org.apache.hadoop.mapred.JobConf.newInstance(JobConf.java:443)
> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:125)
> at
> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:90)
>
>
> my plugin.includes is like:
>
> <property>
> <name>plugin.includes</name>
>
> <value>protocol-file|protocol-http|urlfilter-regex|parse-(xml|text|html|js|pdf)|index-basic|query-(basic|site|url)</value>
> </property>
>
> how can I add a scoring plugin. by default, we dont have to add a
> scoring plugin, so I dont know where to go.
>
> Any ideas appreciated,
>
> Best Regards,
> -C.B.
>
>
>
>