You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Cam Bazz <ca...@gmail.com> on 2006/09/01 00:55:00 UTC

nutch protocol-file

Hello,

I wanted to index my files so I followed the instructions at

http://www.folge2.de/tp/search/1/crawling-the-local-filesystem-with-nutch

I get : Exception in thread "main" java.io.IOException: Job failed!

and looking at the log file:

2006-09-01 01:49:43,166 WARN  mapred.LocalJobRunner - job_p2pnnk
java.lang.RuntimeException: No scoring plugins - at least one scoring 
plugin is required!
        at 
org.apache.nutch.scoring.ScoringFilters.<init>(ScoringFilters.java:84)
        at 
org.apache.nutch.crawl.Injector$InjectMapper.configure(Injector.java:57)
        at org.apache.hadoop.mapred.JobConf.newInstance(JobConf.java:443)
        at org.apache.hadoop.mapred.MapRunner.configure(MapRunner.java:33)
        at org.apache.hadoop.mapred.JobConf.newInstance(JobConf.java:443)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:125)
        at 
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:90)


my plugin.includes is like:

<property>
  <name>plugin.includes</name>
  
<value>protocol-file|protocol-http|urlfilter-regex|parse-(xml|text|html|js|pdf)|index-basic|query-(basic|site|url)</value>
</property>

how can I add a scoring plugin. by default, we dont have to add a 
scoring plugin, so I dont know where to go.

Any ideas appreciated,

Best Regards,
-C.B.

Re: nutch protocol-file

Posted by Cam Bazz <ca...@gmail.com>.

Hello,

I almost got it to work, it starts to crawl but after sometime it 
finishes. I have something like 200000 little html pages, (2-3k) under 
/nutch/data.
I am only getting like 100, or 200 pages, then it stops.

The command I am giving is:

# bin/nutch crawl urls -dir crawl -threads 10 -depth 2

directory urls, contain a urls file, which contains:

file:///nutch/data/

Any Ideas?

-Thanks a bunch.

When I look at the logs, I see:

2006-09-03 18:27:04,231 INFO  indexer.Indexer -  Indexing 
[file:///nutch/data] with analyzer 
org.apache.nutch.analysis.NutchDocumentAnalyzer@6aba4211 (null)
2006-09-03 18:27:04,381 INFO  indexer.Indexer - maxFieldLength 10000 
reached, ignoring following tokens
2006-09-03 18:27:04,454 INFO  indexer.Indexer - Optimizing index.
2006-09-03 18:27:04,529 INFO  indexer.Indexer - merging segments _0 (1 
docs) into _1 (1 docs)
2006-09-03 18:27:05,442 INFO  indexer.Indexer - Indexer: done
2006-09-03 18:27:05,444 INFO  indexer.DeleteDuplicates - Dedup: starting
2006-09-03 18:27:05,459 INFO  indexer.DeleteDuplicates - Dedup: adding 
indexes in: crawl/indexes
2006-09-03 18:27:08,042 INFO  indexer.DeleteDuplicates - Dedup: done
2006-09-03 18:27:08,043 INFO  indexer.IndexMerger - Adding 
crawl/indexes/part-00000
2006-09-03 18:27:08,080 INFO  crawl.Crawl - crawl finished: crawl
2006-09-03 18:28:07,452 INFO  crawl.CrawlDbReader - CrawlDb statistics 
start: crawl/crawldb
2006-09-03 18:28:09,387 INFO  crawl.CrawlDbReader - Statistics for 
CrawlDb: crawl/crawldb
2006-09-03 18:28:09,387 INFO  crawl.CrawlDbReader - TOTAL urls: 101
2006-09-03 18:28:09,388 INFO  crawl.CrawlDbReader - avg score:  1.008
2006-09-03 18:28:09,388 INFO  crawl.CrawlDbReader - max score:  1.009
2006-09-03 18:28:09,388 INFO  crawl.CrawlDbReader - min score:  1.0
2006-09-03 18:28:09,388 INFO  crawl.CrawlDbReader - retry 0:    1
2006-09-03 18:28:09,388 INFO  crawl.CrawlDbReader - retry 1:    100
2006-09-03 18:28:09,388 INFO  crawl.CrawlDbReader - status 1 
(DB_unfetched):    100
2006-09-03 18:28:09,388 INFO  crawl.CrawlDbReader - status 2 
(DB_fetched):      1
2006-09-03 18:28:09,388 INFO  crawl.CrawlDbReader - CrawlDb statistics: done



Thomas Delnoij wrote:
> Just add scoring-opic to your plugin.includes in nutch-site.xml.
>
> Rgrds, Thomas
>
> On 9/1/06, Cam Bazz <ca...@gmail.com> wrote:
>> Hello,
>>
>> I wanted to index my files so I followed the instructions at
>>
>> http://www.folge2.de/tp/search/1/crawling-the-local-filesystem-with-nutch 
>>
>>
>> I get : Exception in thread "main" java.io.IOException: Job failed!
>>
>> and looking at the log file:
>>
>> 2006-09-01 01:49:43,166 WARN  mapred.LocalJobRunner - job_p2pnnk
>> java.lang.RuntimeException: No scoring plugins - at least one scoring
>> plugin is required!
>>         at
>> org.apache.nutch.scoring.ScoringFilters.<init>(ScoringFilters.java:84)
>>         at
>> org.apache.nutch.crawl.Injector$InjectMapper.configure(Injector.java:57)
>>         at 
>> org.apache.hadoop.mapred.JobConf.newInstance(JobConf.java:443)
>>         at 
>> org.apache.hadoop.mapred.MapRunner.configure(MapRunner.java:33)
>>         at 
>> org.apache.hadoop.mapred.JobConf.newInstance(JobConf.java:443)
>>         at org.apache.hadoop.mapred.MapTask.run(MapTask.java:125)
>>         at
>> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:90)
>>
>>
>> my plugin.includes is like:
>>
>> <property>
>>   <name>plugin.includes</name>
>>
>> <value>protocol-file|protocol-http|urlfilter-regex|parse-(xml|text|html|js|pdf)|index-basic|query-(basic|site|url)</value> 
>>
>> </property>
>>
>> how can I add a scoring plugin. by default, we dont have to add a
>> scoring plugin, so I dont know where to go.
>>
>> Any ideas appreciated,
>>
>> Best Regards,
>> -C.B.
>>
>>
>>
>>
>

Re: nutch protocol-file

Posted by Thomas Delnoij <di...@gmail.com>.

Just add scoring-opic to your plugin.includes in nutch-site.xml.

Rgrds, Thomas

On 9/1/06, Cam Bazz <ca...@gmail.com> wrote:
> Hello,
>
> I wanted to index my files so I followed the instructions at
>
> http://www.folge2.de/tp/search/1/crawling-the-local-filesystem-with-nutch
>
> I get : Exception in thread "main" java.io.IOException: Job failed!
>
> and looking at the log file:
>
> 2006-09-01 01:49:43,166 WARN  mapred.LocalJobRunner - job_p2pnnk
> java.lang.RuntimeException: No scoring plugins - at least one scoring
> plugin is required!
>         at
> org.apache.nutch.scoring.ScoringFilters.<init>(ScoringFilters.java:84)
>         at
> org.apache.nutch.crawl.Injector$InjectMapper.configure(Injector.java:57)
>         at org.apache.hadoop.mapred.JobConf.newInstance(JobConf.java:443)
>         at org.apache.hadoop.mapred.MapRunner.configure(MapRunner.java:33)
>         at org.apache.hadoop.mapred.JobConf.newInstance(JobConf.java:443)
>         at org.apache.hadoop.mapred.MapTask.run(MapTask.java:125)
>         at
> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:90)
>
>
> my plugin.includes is like:
>
> <property>
>   <name>plugin.includes</name>
>
> <value>protocol-file|protocol-http|urlfilter-regex|parse-(xml|text|html|js|pdf)|index-basic|query-(basic|site|url)</value>
> </property>
>
> how can I add a scoring plugin. by default, we dont have to add a
> scoring plugin, so I dont know where to go.
>
> Any ideas appreciated,
>
> Best Regards,
> -C.B.
>
>
>
>