You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Kai_testing Middleton <ka...@yahoo.com> on 2007/07/02 21:13:44 UTC

Re: IOException using feed plugin - NUTCH-444

I hope someone can suggest a method to proceed with this RuntimeExceptionI'm getting.

java.lang.RuntimeException: No scoring plugins - at least one scoring plugin is required!
    at org.apache.nutch.scoring.ScoringFilters.<init>(ScoringFilters.java:87)
    at org.apache.nutch.crawl.Injector$InjectMapper.configure(Injector.java:61)
    at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:58)
    at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:82)
    at org.apache.hadoop.mapred.MapRunner.configure(MapRunner.java:34)
    at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:58)
    at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:82)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:170)
    at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:126)

As far as I can tell I'm using NUTCH-444 out-of-the-box since I have a nightly build.

--Kai M.


----- Original Message ----
From: Kai_testing Middleton <ka...@yahoo.com>
To: nutch-user@lucene.apache.org
Sent: Friday, June 29, 2007 5:24:57 PM
Subject: Re: IOException using feed plugin - NUTCH-444

The exception is:
   java.lang.RuntimeException: No scoring plugins - at least one scoring plugin is required!

I note that my nutch-site.xml does contain a reference to scoring-opic so I wonder why it would give that exception.

--Kai M.

----- Original Message ----
From: Kai_testing Middleton <ka...@yahoo.com>
To: nutch-user@lucene.apache.org
Sent: Friday, June 29, 2007 11:36:11 AM
Subject: Re: IOException using feed plugin - NUTCH-444

Here is the more detailed stack trace:
java.lang.RuntimeException: No scoring plugins - at least one scoring plugin is required!
    at org.apache.nutch.scoring.ScoringFilters.<init>(ScoringFilters.java:87)
    at org.apache.nutch.crawl.Injector$InjectMapper.configure(Injector.java:61)
    at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:58)
    at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:82)
    at org.apache.hadoop.mapred.MapRunner.configure(MapRunner.java:34)
    at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:58)
    at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:82)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:170)
    at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:126)

In fact, here is a complete hadoop.log for the command I attempt:
nutch crawl /usr/tmp/lee_urls.txt -dir /usr/tmp/lee_apollo -depth 2 2>&1 | tee crawl.log

2007-06-29 11:28:58,785 INFO  crawl.Crawl - crawl started in: /usr/tmp/lee_apollo
2007-06-29 11:28:58,788 INFO  crawl.Crawl - rootUrlDir = /usr/tmp/lee_urls.txt
2007-06-29 11:28:58,789 INFO  crawl.Crawl - threads = 10
2007-06-29 11:28:58,790 INFO  crawl.Crawl - depth = 2
2007-06-29 11:28:58,925 INFO  crawl.Injector - Injector: starting
2007-06-29 11:28:58,925 INFO  crawl.Injector - Injector: crawlDb: /usr/tmp/lee_apollo/crawldb
2007-06-29 11:28:58,925 INFO  crawl.Injector - Injector: urlDir: /usr/tmp/lee_urls.txt
2007-06-29 11:28:58,926 INFO  crawl.Injector - Injector: Converting injected urls to crawl db entries.
2007-06-29 11:28:59,936 INFO  plugin.PluginRepository - Plugins: looking in: /usr/local/nutch-2007-06-27_06-52-44/plugins
2007-06-29 11:29:00,253 INFO  plugin.PluginRepository - Plugin Auto-activation mode: [true]
2007-06-29 11:29:00,253 INFO  plugin.PluginRepository - Registered Plugins:
2007-06-29 11:29:00,253 INFO  plugin.PluginRepository -     CyberNeko HTML Parser (lib-nekohtml)
2007-06-29 11:29:00,253 INFO  plugin.PluginRepository -     Site Query Filter (query-site)
2007-06-29 11:29:00,253 INFO  plugin.PluginRepository -     Basic URL Normalizer (urlnormalizer-basic)
2007-06-29 11:29:00,253 INFO  plugin.PluginRepository -     Html Parse Plug-in (parse-html)
2007-06-29 11:29:00,253 INFO  plugin.PluginRepository -     Pass-through URL Normalizer (urlnormalizer-pass)
2007-06-29 11:29:00,260 INFO  plugin.PluginRepository -     Regex URL Filter Framework (lib-regex-filter)
2007-06-29 11:29:00,260 INFO  plugin.PluginRepository -     Feed Parse/Index/Query Plug-in (feed)
2007-06-29 11:29:00,260 INFO  plugin.PluginRepository -     Basic Indexing Filter (index-basic)
2007-06-29 11:29:00,261 INFO  plugin.PluginRepository -     Basic Summarizer Plug-in (summary-basic)
2007-06-29 11:29:00,261 INFO  plugin.PluginRepository -     Text Parse Plug-in (parse-text)
2007-06-29 11:29:00,261 INFO  plugin.PluginRepository -     JavaScript Parser (parse-js)
2007-06-29 11:29:00,261 INFO  plugin.PluginRepository -     Basic Query Filter (query-basic)
2007-06-29 11:29:00,261 INFO  plugin.PluginRepository -     Regex URL Filter (urlfilter-regex)
2007-06-29 11:29:00,261 INFO  plugin.PluginRepository -     HTTP Framework (lib-http)
2007-06-29 11:29:00,261 INFO  plugin.PluginRepository -     XML Libraries (lib-xml)
2007-06-29 11:29:00,261 INFO  plugin.PluginRepository -     URL Query Filter (query-url)
2007-06-29 11:29:00,261 INFO  plugin.PluginRepository -     Regex URL Normalizer (urlnormalizer-regex)
2007-06-29 11:29:00,261 INFO  plugin.PluginRepository -     Http Protocol Plug-in (protocol-http)
2007-06-29 11:29:00,261 INFO  plugin.PluginRepository -     the nutch core extension points (nutch-extensionpoints)
2007-06-29 11:29:00,261 INFO  plugin.PluginRepository - Registered Extension-Points:
2007-06-29 11:29:00,261 INFO  plugin.PluginRepository -     Nutch Summarizer (org.apache.nutch.searcher.Summarizer)
2007-06-29 11:29:00,261 INFO  plugin.PluginRepository -     Nutch Scoring (org.apache.nutch.scoring.ScoringFilter)
2007-06-29 11:29:00,261 INFO  plugin.PluginRepository -     Nutch Protocol (org.apache.nutch.protocol.Protocol)
2007-06-29 11:29:00,261 INFO  plugin.PluginRepository -     Nutch URL Normalizer (org.apache.nutch.net.URLNormalizer)
2007-06-29 11:29:00,261 INFO  plugin.PluginRepository -     Nutch URL Filter (org.apache.nutch.net.URLFilter)
2007-06-29 11:29:00,262 INFO  plugin.PluginRepository -     HTML Parse Filter (org.apache.nutch.parse.HtmlParseFilter)
2007-06-29 11:29:00,262 INFO  plugin.PluginRepository -     Nutch Online Search Results Clustering Plugin (org.apache.nutch.clustering.OnlineClusterer)
2007-06-29 11:29:00,262 INFO  plugin.PluginRepository -     Nutch Indexing Filter (org.apache.nutch.indexer.IndexingFilter)
2007-06-29 11:29:00,262 INFO  plugin.PluginRepository -     Nutch Content Parser (org.apache.nutch.parse.Parser)
2007-06-29 11:29:00,262 INFO  plugin.PluginRepository -     Ontology Model Loader (org.apache.nutch.ontology.Ontology)
2007-06-29 11:29:00,262 INFO  plugin.PluginRepository -     Nutch Analysis (org.apache.nutch.analysis.NutchAnalyzer)
2007-06-29 11:29:00,262 INFO  plugin.PluginRepository -     Nutch Query Filter (org.apache.nutch.searcher.QueryFilter)
2007-06-29 11:29:00,367 WARN  mapred.LocalJobRunner - job_w7bra3
java.lang.RuntimeException: No scoring plugins - at least one scoring plugin is required!
    at org.apache.nutch.scoring.ScoringFilters.<init>(ScoringFilters.java:87)
    at org.apache.nutch.crawl.Injector$InjectMapper.configure(Injector.java:61)
    at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:58)
    at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:82)
    at org.apache.hadoop.mapred.MapRunner.configure(MapRunner.java:34)
    at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:58)
    at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:82)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:170)
    at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:126)


----- Original Message ----
From: Doğacan Güney <do...@gmail.com>
To: nutch-user@lucene.apache.org
Sent: Friday, June 29, 2007 12:45:36 AM
Subject: Re: IOException using feed plugin - NUTCH-444

Hi,

On 6/29/07, Kai_testing Middleton <ka...@yahoo.com> wrote:
> I have tried the NUTCH-444 "feed" plugin to enable spidering of RSS feeds:
> /nutch-2007-06-27_06-52-44/plugins/feed
> (that's a recent nightly build of nutch).
>
> When I attempt a crawl I get an IOException:
>
> $ nutch crawl /usr/tmp/lee_urls.txt -dir /usr/tmp/lee_apollo -depth 2
> crawl started in: /usr/tmp/lee_apollo
> rootUrlDir = /usr/tmp/lee_urls.txt
> threads = 10
> depth = 2
> Injector: starting
> Injector: crawlDb: /usr/tmp/lee_apollo/crawldb
> Injector: urlDir: /usr/tmp/lee_urls.txt
> Injector: Converting injected urls to crawl db entries.
> Exception in thread "main" java.io.IOException: Job failed!
>         at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:604)
>         at org.apache.nutch.crawl.Injector.inject(Injector.java:162)
>         at org.apache.nutch.crawl.Crawl.main(Crawl.java:115)
>         3.14 real         1.92 user         0.30 sys

This stack trace is not useful. This is only JobTracker (or
LocalJobRunner) reporting back to us that job has failed. If you are
running in a distributed environment, check your tasktracker logs or
if you are running locally check out logs/hadoop.log.

>
> The seed URL is:
> http://www.mt-olympus.com/apollo/feed/
>
> I enabled the feed plugin via this property in nutch-site.xml:
> <property>
>   <name>plugin.includes</name>
>   <value>protocol-http|urlfilter-regex|parse-(text|html|js)|index-basic|query-(basic|site|url)|summary-basic|scoring-opi
> c|urlnormalizer-(pass|regex|basic)|feed</value>
>   <description>Regular expression naming plugin directory names to
>   include.  Any plugin not matching this expression is excluded.
>   In any case you need at least include the nutch-extensionpoints plugin. By
>   default Nutch includes crawling just HTML and plain text via HTTP,
>   and basic indexing and search plugins. In order to use HTTPS please enable
>   protocol-httpclient, but be aware of possible intermittent problems with the
>   underlying commons-httpclient library.
>   </description>
> </property>
>
>
> As a sanity check, when I take out "feed" from <value> above, it no longer throws an exception (but it also doesn't fetch anything):
>
> $ nutch crawl /usr/tmp/lee_urls.txt -dir /usr/tmp/lee_apollo -depth 2 2>&1 | tee crawl.log
> crawl started in: /usr/tmp/lee_apollo
> rootUrlDir = /usr/tmp/lee_urls.txt
> threads = 10
> depth = 2
> Injector: starting
> Injector: crawlDb: /usr/tmp/lee_apollo/crawldb
> Injector: urlDir: /usr/tmp/lee_urls.txt
> Injector: Converting injected urls to crawl db entries.
> Injector: Merging injected urls into crawl db.
> Injector: done
> Generator: Selecting best-scoring urls due for fetch.
> Generator: starting
> Generator: segment: /usr/tmp/lee_apollo/segments/20070628155854
> Generator: filtering: false
> Generator: topN: 2147483647
> Generator: jobtracker is 'local', generating exactly one partition.
> Generator: Partitioning selected urls by host, for politeness.
> Generator: done.
> Fetcher: starting
> Fetcher: segment: /usr/tmp/lee_apollo/segments/20070628155854
> Fetcher: threads: 10
> fetching http://www.mt-olympus.com/apollo/feed/
> Fetcher: done
> CrawlDb update: starting
> CrawlDb update: db: /usr/tmp/lee_apollo/crawldb
> CrawlDb update: segments: [/usr/tmp/lee_apollo/segments/20070628155854]
> CrawlDb update: additions allowed: true
> CrawlDb update: URL normalizing: true
> CrawlDb update: URL filtering: true
> CrawlDb update: Merging segment data into db.
> CrawlDb update: done
> Generator: Selecting best-scoring urls due for fetch.
> Generator: starting
> Generator: segment: /usr/tmp/lee_apollo/segments/20070628155907
> Generator: filtering: false
> Generator: topN: 2147483647
> Generator: jobtracker is 'local', generating exactly one partition.
> Generator: 0 records selected for fetching, exiting ...
> Stopping at depth=1 - no more URLs to fetch.
> LinkDb: starting
> LinkDb: linkdb: /usr/tmp/lee_apollo/linkdb
> LinkDb: URL normalize: true
> LinkDb: URL filter: true
> LinkDb: adding segment: /usr/tmp/lee_apollo/segments/20070628155854
> LinkDb: done
> Indexer: starting
> Indexer: linkdb: /usr/tmp/lee_apollo/linkdb
> Indexer: adding segment: /usr/tmp/lee_apollo/segments/20070628155854
>  Indexing [http://www.mt-olympus.com/apollo/feed/] with analyzer org.apache.nutch.analysis.NutchDocumentAnalyzer@114b82b (null)
> Optimizing index.
> merging segments _ram_0 (1 docs) into _0 (1 docs)
> org.apache.lucene.index.IndexFileDeleter@1f99eea Thread-36: now checkpoint "segments_2" [isCommit = true]
> org.apache.lucene.index.IndexFileDeleter@1f99eea Thread-36:   IncRef "_0.fnm": pre-incr count is 0
> org.apache.lucene.index.IndexFileDeleter@1f99eea Thread-36:   IncRef "_0.fdx": pre-incr count is 0
> org.apache.lucene.index.IndexFileDeleter@1f99eea Thread-36:   IncRef "_0.fdt": pre-incr count is 0
> org.apache.lucene.index.IndexFileDeleter@1f99eea Thread-36:   IncRef "_0.tii": pre-incr count is 0
> org.apache.lucene.index.IndexFileDeleter@1f99eea Thread-36:   IncRef "_0.tis": pre-incr count is 0
> org.apache.lucene.index.IndexFileDeleter@1f99eea Thread-36:   IncRef "_0.frq": pre-incr count is 0
> org.apache.lucene.index.IndexFileDeleter@1f99eea Thread-36:   IncRef "_0.prx": pre-incr count is 0
> org.apache.lucene.index.IndexFileDeleter@1f99eea Thread-36:   IncRef "_0.nrm": pre-incr count is 0
> org.apache.lucene.index.IndexFileDeleter@1f99eea Thread-36: deleteCommits: now remove commit "segments_1"
> org.apache.lucene.index.IndexFileDeleter@1f99eea Thread-36:   DecRef "segments_1": pre-decr count is 1
> org.apache.lucene.index.IndexFileDeleter@1f99eea Thread-36: delete "segments_1"
> Indexer: done
> Dedup: starting
> Dedup: adding indexes in: /usr/tmp/lee_apollo/indexes
> Dedup: done
> merging indexes to: /usr/tmp/lee_apollo/index
> Adding /usr/tmp/lee_apollo/indexes/part-00000
> done merging
> crawl finished: /usr/tmp/lee_apollo
>        30.45 real         8.40 user         2.26 sys
>
>
> ----- Original Message ----
> From: Doğacan Güney <do...@gmail.com>
> To: nutch-user@lucene.apache.org
> Sent: Wednesday, June 27, 2007 10:59:52 PM
> Subject: Re: Possibly use a different library to parse RSS feed for improved performance and compatibility
>
> On 6/28/07, Kai_testing Middleton <ka...@yahoo.com> wrote:
> > I am choosing to use NUTCH-444 for my RSS functionality.  Doğacan commented on how to do this; he wrote:
> >     ...if you need the functionality of NUTCH-444, I would suggest
> >     trying a nightly version of Nutch. Becase NUTCH-444 by itself is not
> >     enough. You also need two patches from NUTCH-443 and probably
> >     NUTCH-504.
> >
> > I have a couple newbie questions about the mechanics of installing this.
> >
> > Prefatory comments: I have already installed another patch (for NUTCH-505) so I think I already have a nightly build (I'm guessing trunk==nightly?).  These were the steps I did:
> > $ svn co http://svn.apache.org/repos/asf/lucene/nutch/trunk nutch
> > $ cd nutch
> > $ wget https://issues.apache.org/jira/secure/attachment/12360411/NUTCH-505_draft_v2.patch
> > $ patch -p0 < NUTCH-505_draft_v2.patch
> > $ ant clean && ant
> >
> > ---
> >
> > Now I need NUTCH-443 NUTCH-504 NUTCH-444.  Here's my guess:
> >
> > $ cd nutch
> >
> > $ wget http://issues.apache.org/jira/secure/attachment/12359953/NUTCH_443_reopened_v3.patch
> > $ patch -p0 < NUTCH_443_reopened_v3.patch
> > $ wget http://issues.apache.org/jira/secure/attachment/12350644/parse-map-core-draft-v1.patch
> > $ patch -p0 < parse-map-core-draft-v1.patch
> > $ wget http://issues.apache.org/jira/secure/attachment/12350634/parse-map-core-untested.patch
> > $ patch -p0 < parse-map-core-untested.patch
> > $ wget http://issues.apache.org/jira/secure/attachment/12357183/redirect_and_index.patch
> >
> > $ patch -p0 < redirect_and_index.patch
> >
> >
> > $ wget http://issues.apache.org/jira/secure/attachment/12357300/redirect_and_index_v2.patch
> >
> > $ patch -p0 < redirect_and_index_v2.patch
> >
> > I'm really guessing on the above ... continuing:
> >
> > $ wget http://issues.apache.org/jira/secure/attachment/12360361/NUTCH-504_v2.patch
> >
> > $ patch -p0 < NUTCH-504_v2.patch
> >
> > $ wget http://issues.apache.org/jira/secure/attachment/12360348/parse_in_fetchers.patch
> >
> > $ patch -p0 < parse_in_fetchers.patch
> >
> > ... that felt like less of a guess, but now:
> >
> >
> > $ wget http://issues.apache.org/jira/secure/attachment/12357192/NUTCH-444.patch
> >
> > $ patch -p0 < NUTCH-444.patch
> >
> > $ wget http://issues.apache.org/jira/secure/attachment/12350820/parse-feed.tar.bz2
> >
> > $ tar xjvf parse-feed.tar.bz2
> >
> > what do I do with this newly created parse-feed directory?
> >
> > so then I would do:
> >
> > $ ant clean && ant
> >
> >
> > Wait a minute:  do I have this whole thing wrong?  Maybe Doğacan means that the nightly builds ALREADY contain NUTCH-443 and NUTCH-504 so that I would do this:
> >
> >
> > $ wget http://lucene.zones.apache.org:8080/hudson/job/Nutch-Nightly/lastStableBuild/artifact/trunk/build/nutch-2007-06-27_06-52-44.tar.gz
> > $ tar xvzf nutch-2007-06-27_06-52-44.tar.gz
> > $ cd nutch-2007-06-27_06-52-44
> >
> > then this business:
> >
> > $ wget http://issues.apache.org/jira/secure/attachment/12357192/NUTCH-444.patch
> >
> >
> > $ patch -p0 < NUTCH-444.patch
> >
> >
> > $ wget http://issues.apache.org/jira/secure/attachment/12350820/parse-feed.tar.bz2
> >
> >
> > $ tar xjvf parse-feed.tar.bz2
> >
> >
> >
> > what do I do with this newly created parse-feed directory?
> >
> >
> >
> > so then I would do:
> >
> >
> >
> > $ ant clean && ant
> >
> > I guess this is why "release engineer" is a job in and of itself!
> > Please advise.
>
> If you downloaded nightly build of 27th June, it contains feed plugin
> already (the plugin is called "feed", not "parse-feed", parse-feed was
> an older plugin and it is never committed. In my earlier comment, I
> meant to write parse-rss but wrote parse-feed). So, you don't have to
> apply any patches or anything. Just download a recent nightly build,
> and you are good to go :).
>
> You can also checkout trunk from svn and it will work too.
>
> >
> > --Kai Middleton
> >
> > ----- Original Message ----
> > From: Doğacan Güney <do...@gmail.com>
> > To: nutch-user@lucene.apache.org
> > Sent: Friday, June 22, 2007 1:39:12 AM
> > Subject: Re: Possibly use a different library to parse RSS feed for improved performance and compatibility
> >
> > On 6/21/07, Kai_testing Middleton <ka...@yahoo.com> wrote:
> > > I am a new nutch user and the ability to crawl RSS feeds is critical to my mission.  Do I understand from this (lengthy) discussion that in order to get the new RSS I need to either a) download one of the nightly builds and run ant or b) download and apply a patch (NUTCH-444.patch, I gather).
> >
> > Nutch 0.9 can already parse RSS feeds (via parse-feed) plugin.
> > However, if you need the functionality of NUTCH-444, I would suggest
> > trying a nightly version of Nutch. Becase NUTCH-444 by itself is not
> > enough. You also need two patches from NUTCH-443 and probably
> > NUTCH-504. If you are worrying about stability, nightlies of nutch are
> > generally pretty stable.
> >
> > --
> > Doğacan Güney






      ____________________________________________________________________________________
Luggage? GPS? Comic books? 
Check out fitting gifts for grads at Yahoo! Search
http://search.yahoo.com/search?fr=oni_on_mail&p=graduation+gifts&cs=bz






       
____________________________________________________________________________________
Take the Internet to Go: Yahoo!Go puts the Internet in your pocket: mail, news, photos & more. 
http://mobile.yahoo.com/go?refer=1GNXIC

Re: IOException using feed plugin - NUTCH-444

Posted by Sami Siren <ss...@gmail.com>.
Kai_testing Middleton wrote:
> I hope someone can suggest a method to proceed with this RuntimeExceptionI'm getting.

recheck that you have scoring plugin enabled properly (scoring-opic) in
nutch configuration (in the snippet you gave below it did not exist,
also the pluginRepository log you showed did not have it registered)

--
 Sami Siren


> 
> java.lang.RuntimeException: No scoring plugins - at least one scoring plugin is required!
>     at org.apache.nutch.scoring.ScoringFilters.<init>(ScoringFilters.java:87)
>     at org.apache.nutch.crawl.Injector$InjectMapper.configure(Injector.java:61)
>     at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:58)
>     at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:82)
>     at org.apache.hadoop.mapred.MapRunner.configure(MapRunner.java:34)
>     at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:58)
>     at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:82)
>     at org.apache.hadoop.mapred.MapTask.run(MapTask.java:170)
>     at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:126)
> 
> As far as I can tell I'm using NUTCH-444 out-of-the-box since I have a nightly build.
> 
> --Kai M.
> 
> 
> ----- Original Message ----
> From: Kai_testing Middleton <ka...@yahoo.com>
> To: nutch-user@lucene.apache.org
> Sent: Friday, June 29, 2007 5:24:57 PM
> Subject: Re: IOException using feed plugin - NUTCH-444
> 
> The exception is:
>    java.lang.RuntimeException: No scoring plugins - at least one scoring plugin is required!
> 
> I note that my nutch-site.xml does contain a reference to scoring-opic so I wonder why it would give that exception.
> 
> --Kai M.
> 
> ----- Original Message ----
> From: Kai_testing Middleton <ka...@yahoo.com>
> To: nutch-user@lucene.apache.org
> Sent: Friday, June 29, 2007 11:36:11 AM
> Subject: Re: IOException using feed plugin - NUTCH-444
> 
> Here is the more detailed stack trace:
> java.lang.RuntimeException: No scoring plugins - at least one scoring plugin is required!
>     at org.apache.nutch.scoring.ScoringFilters.<init>(ScoringFilters.java:87)
>     at org.apache.nutch.crawl.Injector$InjectMapper.configure(Injector.java:61)
>     at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:58)
>     at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:82)
>     at org.apache.hadoop.mapred.MapRunner.configure(MapRunner.java:34)
>     at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:58)
>     at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:82)
>     at org.apache.hadoop.mapred.MapTask.run(MapTask.java:170)
>     at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:126)
> 
> In fact, here is a complete hadoop.log for the command I attempt:
> nutch crawl /usr/tmp/lee_urls.txt -dir /usr/tmp/lee_apollo -depth 2 2>&1 | tee crawl.log
> 
> 2007-06-29 11:28:58,785 INFO  crawl.Crawl - crawl started in: /usr/tmp/lee_apollo
> 2007-06-29 11:28:58,788 INFO  crawl.Crawl - rootUrlDir = /usr/tmp/lee_urls.txt
> 2007-06-29 11:28:58,789 INFO  crawl.Crawl - threads = 10
> 2007-06-29 11:28:58,790 INFO  crawl.Crawl - depth = 2
> 2007-06-29 11:28:58,925 INFO  crawl.Injector - Injector: starting
> 2007-06-29 11:28:58,925 INFO  crawl.Injector - Injector: crawlDb: /usr/tmp/lee_apollo/crawldb
> 2007-06-29 11:28:58,925 INFO  crawl.Injector - Injector: urlDir: /usr/tmp/lee_urls.txt
> 2007-06-29 11:28:58,926 INFO  crawl.Injector - Injector: Converting injected urls to crawl db entries.
> 2007-06-29 11:28:59,936 INFO  plugin.PluginRepository - Plugins: looking in: /usr/local/nutch-2007-06-27_06-52-44/plugins
> 2007-06-29 11:29:00,253 INFO  plugin.PluginRepository - Plugin Auto-activation mode: [true]
> 2007-06-29 11:29:00,253 INFO  plugin.PluginRepository - Registered Plugins:
> 2007-06-29 11:29:00,253 INFO  plugin.PluginRepository -     CyberNeko HTML Parser (lib-nekohtml)
> 2007-06-29 11:29:00,253 INFO  plugin.PluginRepository -     Site Query Filter (query-site)
> 2007-06-29 11:29:00,253 INFO  plugin.PluginRepository -     Basic URL Normalizer (urlnormalizer-basic)
> 2007-06-29 11:29:00,253 INFO  plugin.PluginRepository -     Html Parse Plug-in (parse-html)
> 2007-06-29 11:29:00,253 INFO  plugin.PluginRepository -     Pass-through URL Normalizer (urlnormalizer-pass)
> 2007-06-29 11:29:00,260 INFO  plugin.PluginRepository -     Regex URL Filter Framework (lib-regex-filter)
> 2007-06-29 11:29:00,260 INFO  plugin.PluginRepository -     Feed Parse/Index/Query Plug-in (feed)
> 2007-06-29 11:29:00,260 INFO  plugin.PluginRepository -     Basic Indexing Filter (index-basic)
> 2007-06-29 11:29:00,261 INFO  plugin.PluginRepository -     Basic Summarizer Plug-in (summary-basic)
> 2007-06-29 11:29:00,261 INFO  plugin.PluginRepository -     Text Parse Plug-in (parse-text)
> 2007-06-29 11:29:00,261 INFO  plugin.PluginRepository -     JavaScript Parser (parse-js)
> 2007-06-29 11:29:00,261 INFO  plugin.PluginRepository -     Basic Query Filter (query-basic)
> 2007-06-29 11:29:00,261 INFO  plugin.PluginRepository -     Regex URL Filter (urlfilter-regex)
> 2007-06-29 11:29:00,261 INFO  plugin.PluginRepository -     HTTP Framework (lib-http)
> 2007-06-29 11:29:00,261 INFO  plugin.PluginRepository -     XML Libraries (lib-xml)
> 2007-06-29 11:29:00,261 INFO  plugin.PluginRepository -     URL Query Filter (query-url)
> 2007-06-29 11:29:00,261 INFO  plugin.PluginRepository -     Regex URL Normalizer (urlnormalizer-regex)
> 2007-06-29 11:29:00,261 INFO  plugin.PluginRepository -     Http Protocol Plug-in (protocol-http)
> 2007-06-29 11:29:00,261 INFO  plugin.PluginRepository -     the nutch core extension points (nutch-extensionpoints)
> 2007-06-29 11:29:00,261 INFO  plugin.PluginRepository - Registered Extension-Points:
> 2007-06-29 11:29:00,261 INFO  plugin.PluginRepository -     Nutch Summarizer (org.apache.nutch.searcher.Summarizer)
> 2007-06-29 11:29:00,261 INFO  plugin.PluginRepository -     Nutch Scoring (org.apache.nutch.scoring.ScoringFilter)
> 2007-06-29 11:29:00,261 INFO  plugin.PluginRepository -     Nutch Protocol (org.apache.nutch.protocol.Protocol)
> 2007-06-29 11:29:00,261 INFO  plugin.PluginRepository -     Nutch URL Normalizer (org.apache.nutch.net.URLNormalizer)
> 2007-06-29 11:29:00,261 INFO  plugin.PluginRepository -     Nutch URL Filter (org.apache.nutch.net.URLFilter)
> 2007-06-29 11:29:00,262 INFO  plugin.PluginRepository -     HTML Parse Filter (org.apache.nutch.parse.HtmlParseFilter)
> 2007-06-29 11:29:00,262 INFO  plugin.PluginRepository -     Nutch Online Search Results Clustering Plugin (org.apache.nutch.clustering.OnlineClusterer)
> 2007-06-29 11:29:00,262 INFO  plugin.PluginRepository -     Nutch Indexing Filter (org.apache.nutch.indexer.IndexingFilter)
> 2007-06-29 11:29:00,262 INFO  plugin.PluginRepository -     Nutch Content Parser (org.apache.nutch.parse.Parser)
> 2007-06-29 11:29:00,262 INFO  plugin.PluginRepository -     Ontology Model Loader (org.apache.nutch.ontology.Ontology)
> 2007-06-29 11:29:00,262 INFO  plugin.PluginRepository -     Nutch Analysis (org.apache.nutch.analysis.NutchAnalyzer)
> 2007-06-29 11:29:00,262 INFO  plugin.PluginRepository -     Nutch Query Filter (org.apache.nutch.searcher.QueryFilter)
> 2007-06-29 11:29:00,367 WARN  mapred.LocalJobRunner - job_w7bra3
> java.lang.RuntimeException: No scoring plugins - at least one scoring plugin is required!
>     at org.apache.nutch.scoring.ScoringFilters.<init>(ScoringFilters.java:87)
>     at org.apache.nutch.crawl.Injector$InjectMapper.configure(Injector.java:61)
>     at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:58)
>     at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:82)
>     at org.apache.hadoop.mapred.MapRunner.configure(MapRunner.java:34)
>     at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:58)
>     at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:82)
>     at org.apache.hadoop.mapred.MapTask.run(MapTask.java:170)
>     at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:126)
> 
> 
> ----- Original Message ----
> From: Do»acan Güney <do...@gmail.com>
> To: nutch-user@lucene.apache.org
> Sent: Friday, June 29, 2007 12:45:36 AM
> Subject: Re: IOException using feed plugin - NUTCH-444
> 
> Hi,
> 
> On 6/29/07, Kai_testing Middleton <ka...@yahoo.com> wrote:
>> I have tried the NUTCH-444 "feed" plugin to enable spidering of RSS feeds:
>> /nutch-2007-06-27_06-52-44/plugins/feed
>> (that's a recent nightly build of nutch).
>>
>> When I attempt a crawl I get an IOException:
>>
>> $ nutch crawl /usr/tmp/lee_urls.txt -dir /usr/tmp/lee_apollo -depth 2
>> crawl started in: /usr/tmp/lee_apollo
>> rootUrlDir = /usr/tmp/lee_urls.txt
>> threads = 10
>> depth = 2
>> Injector: starting
>> Injector: crawlDb: /usr/tmp/lee_apollo/crawldb
>> Injector: urlDir: /usr/tmp/lee_urls.txt
>> Injector: Converting injected urls to crawl db entries.
>> Exception in thread "main" java.io.IOException: Job failed!
>>         at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:604)
>>         at org.apache.nutch.crawl.Injector.inject(Injector.java:162)
>>         at org.apache.nutch.crawl.Crawl.main(Crawl.java:115)
>>         3.14 real         1.92 user         0.30 sys
> 
> This stack trace is not useful. This is only JobTracker (or
> LocalJobRunner) reporting back to us that job has failed. If you are
> running in a distributed environment, check your tasktracker logs or
> if you are running locally check out logs/hadoop.log.
> 
>> The seed URL is:
>> http://www.mt-olympus.com/apollo/feed/
>>
>> I enabled the feed plugin via this property in nutch-site.xml:
>> <property>
>>   <name>plugin.includes</name>
>>   <value>protocol-http|urlfilter-regex|parse-(text|html|js)|index-basic|query-(basic|site|url)|summary-basic|scoring-opi
>> c|urlnormalizer-(pass|regex|basic)|feed</value>
>>   <description>Regular expression naming plugin directory names to
>>   include.  Any plugin not matching this expression is excluded.
>>   In any case you need at least include the nutch-extensionpoints plugin. By
>>   default Nutch includes crawling just HTML and plain text via HTTP,
>>   and basic indexing and search plugins. In order to use HTTPS please enable
>>   protocol-httpclient, but be aware of possible intermittent problems with the
>>   underlying commons-httpclient library.
>>   </description>
>> </property>
>>
>>
>> As a sanity check, when I take out "feed" from <value> above, it no longer throws an exception (but it also doesn't fetch anything):
>>
>> $ nutch crawl /usr/tmp/lee_urls.txt -dir /usr/tmp/lee_apollo -depth 2 2>&1 | tee crawl.log
>> crawl started in: /usr/tmp/lee_apollo
>> rootUrlDir = /usr/tmp/lee_urls.txt
>> threads = 10
>> depth = 2
>> Injector: starting
>> Injector: crawlDb: /usr/tmp/lee_apollo/crawldb
>> Injector: urlDir: /usr/tmp/lee_urls.txt
>> Injector: Converting injected urls to crawl db entries.
>> Injector: Merging injected urls into crawl db.
>> Injector: done
>> Generator: Selecting best-scoring urls due for fetch.
>> Generator: starting
>> Generator: segment: /usr/tmp/lee_apollo/segments/20070628155854
>> Generator: filtering: false
>> Generator: topN: 2147483647
>> Generator: jobtracker is 'local', generating exactly one partition.
>> Generator: Partitioning selected urls by host, for politeness.
>> Generator: done.
>> Fetcher: starting
>> Fetcher: segment: /usr/tmp/lee_apollo/segments/20070628155854
>> Fetcher: threads: 10
>> fetching http://www.mt-olympus.com/apollo/feed/
>> Fetcher: done
>> CrawlDb update: starting
>> CrawlDb update: db: /usr/tmp/lee_apollo/crawldb
>> CrawlDb update: segments: [/usr/tmp/lee_apollo/segments/20070628155854]
>> CrawlDb update: additions allowed: true
>> CrawlDb update: URL normalizing: true
>> CrawlDb update: URL filtering: true
>> CrawlDb update: Merging segment data into db.
>> CrawlDb update: done
>> Generator: Selecting best-scoring urls due for fetch.
>> Generator: starting
>> Generator: segment: /usr/tmp/lee_apollo/segments/20070628155907
>> Generator: filtering: false
>> Generator: topN: 2147483647
>> Generator: jobtracker is 'local', generating exactly one partition.
>> Generator: 0 records selected for fetching, exiting ...
>> Stopping at depth=1 - no more URLs to fetch.
>> LinkDb: starting
>> LinkDb: linkdb: /usr/tmp/lee_apollo/linkdb
>> LinkDb: URL normalize: true
>> LinkDb: URL filter: true
>> LinkDb: adding segment: /usr/tmp/lee_apollo/segments/20070628155854
>> LinkDb: done
>> Indexer: starting
>> Indexer: linkdb: /usr/tmp/lee_apollo/linkdb
>> Indexer: adding segment: /usr/tmp/lee_apollo/segments/20070628155854
>>  Indexing [http://www.mt-olympus.com/apollo/feed/] with analyzer org.apache.nutch.analysis.NutchDocumentAnalyzer@114b82b (null)
>> Optimizing index.
>> merging segments _ram_0 (1 docs) into _0 (1 docs)
>> org.apache.lucene.index.IndexFileDeleter@1f99eea Thread-36: now checkpoint "segments_2" [isCommit = true]
>> org.apache.lucene.index.IndexFileDeleter@1f99eea Thread-36:   IncRef "_0.fnm": pre-incr count is 0
>> org.apache.lucene.index.IndexFileDeleter@1f99eea Thread-36:   IncRef "_0.fdx": pre-incr count is 0
>> org.apache.lucene.index.IndexFileDeleter@1f99eea Thread-36:   IncRef "_0.fdt": pre-incr count is 0
>> org.apache.lucene.index.IndexFileDeleter@1f99eea Thread-36:   IncRef "_0.tii": pre-incr count is 0
>> org.apache.lucene.index.IndexFileDeleter@1f99eea Thread-36:   IncRef "_0.tis": pre-incr count is 0
>> org.apache.lucene.index.IndexFileDeleter@1f99eea Thread-36:   IncRef "_0.frq": pre-incr count is 0
>> org.apache.lucene.index.IndexFileDeleter@1f99eea Thread-36:   IncRef "_0.prx": pre-incr count is 0
>> org.apache.lucene.index.IndexFileDeleter@1f99eea Thread-36:   IncRef "_0.nrm": pre-incr count is 0
>> org.apache.lucene.index.IndexFileDeleter@1f99eea Thread-36: deleteCommits: now remove commit "segments_1"
>> org.apache.lucene.index.IndexFileDeleter@1f99eea Thread-36:   DecRef "segments_1": pre-decr count is 1
>> org.apache.lucene.index.IndexFileDeleter@1f99eea Thread-36: delete "segments_1"
>> Indexer: done
>> Dedup: starting
>> Dedup: adding indexes in: /usr/tmp/lee_apollo/indexes
>> Dedup: done
>> merging indexes to: /usr/tmp/lee_apollo/index
>> Adding /usr/tmp/lee_apollo/indexes/part-00000
>> done merging
>> crawl finished: /usr/tmp/lee_apollo
>>        30.45 real         8.40 user         2.26 sys
>>
>>
>> ----- Original Message ----
>> From: Do»acan Güney <do...@gmail.com>
>> To: nutch-user@lucene.apache.org
>> Sent: Wednesday, June 27, 2007 10:59:52 PM
>> Subject: Re: Possibly use a different library to parse RSS feed for improved performance and compatibility
>>
>> On 6/28/07, Kai_testing Middleton <ka...@yahoo.com> wrote:
>>> I am choosing to use NUTCH-444 for my RSS functionality.  Do»acan commented on how to do this; he wrote:
>>>     ...if you need the functionality of NUTCH-444, I would suggest
>>>     trying a nightly version of Nutch. Becase NUTCH-444 by itself is not
>>>     enough. You also need two patches from NUTCH-443 and probably
>>>     NUTCH-504.
>>>
>>> I have a couple newbie questions about the mechanics of installing this.
>>>
>>> Prefatory comments: I have already installed another patch (for NUTCH-505) so I think I already have a nightly build (I'm guessing trunk==nightly?).  These were the steps I did:
>>> $ svn co http://svn.apache.org/repos/asf/lucene/nutch/trunk nutch
>>> $ cd nutch
>>> $ wget https://issues.apache.org/jira/secure/attachment/12360411/NUTCH-505_draft_v2.patch
>>> $ patch -p0 < NUTCH-505_draft_v2.patch
>>> $ ant clean && ant
>>>
>>> ---
>>>
>>> Now I need NUTCH-443 NUTCH-504 NUTCH-444.  Here's my guess:
>>>
>>> $ cd nutch
>>>
>>> $ wget http://issues.apache.org/jira/secure/attachment/12359953/NUTCH_443_reopened_v3.patch
>>> $ patch -p0 < NUTCH_443_reopened_v3.patch
>>> $ wget http://issues.apache.org/jira/secure/attachment/12350644/parse-map-core-draft-v1.patch
>>> $ patch -p0 < parse-map-core-draft-v1.patch
>>> $ wget http://issues.apache.org/jira/secure/attachment/12350634/parse-map-core-untested.patch
>>> $ patch -p0 < parse-map-core-untested.patch
>>> $ wget http://issues.apache.org/jira/secure/attachment/12357183/redirect_and_index.patch
>>>
>>> $ patch -p0 < redirect_and_index.patch
>>>
>>>
>>> $ wget http://issues.apache.org/jira/secure/attachment/12357300/redirect_and_index_v2.patch
>>>
>>> $ patch -p0 < redirect_and_index_v2.patch
>>>
>>> I'm really guessing on the above ... continuing:
>>>
>>> $ wget http://issues.apache.org/jira/secure/attachment/12360361/NUTCH-504_v2.patch
>>>
>>> $ patch -p0 < NUTCH-504_v2.patch
>>>
>>> $ wget http://issues.apache.org/jira/secure/attachment/12360348/parse_in_fetchers.patch
>>>
>>> $ patch -p0 < parse_in_fetchers.patch
>>>
>>> ... that felt like less of a guess, but now:
>>>
>>>
>>> $ wget http://issues.apache.org/jira/secure/attachment/12357192/NUTCH-444.patch
>>>
>>> $ patch -p0 < NUTCH-444.patch
>>>
>>> $ wget http://issues.apache.org/jira/secure/attachment/12350820/parse-feed.tar.bz2
>>>
>>> $ tar xjvf parse-feed.tar.bz2
>>>
>>> what do I do with this newly created parse-feed directory?
>>>
>>> so then I would do:
>>>
>>> $ ant clean && ant
>>>
>>>
>>> Wait a minute:  do I have this whole thing wrong?  Maybe Do»acan means that the nightly builds ALREADY contain NUTCH-443 and NUTCH-504 so that I would do this:
>>>
>>>
>>> $ wget http://lucene.zones.apache.org:8080/hudson/job/Nutch-Nightly/lastStableBuild/artifact/trunk/build/nutch-2007-06-27_06-52-44.tar.gz
>>> $ tar xvzf nutch-2007-06-27_06-52-44.tar.gz
>>> $ cd nutch-2007-06-27_06-52-44
>>>
>>> then this business:
>>>
>>> $ wget http://issues.apache.org/jira/secure/attachment/12357192/NUTCH-444.patch
>>>
>>>
>>> $ patch -p0 < NUTCH-444.patch
>>>
>>>
>>> $ wget http://issues.apache.org/jira/secure/attachment/12350820/parse-feed.tar.bz2
>>>
>>>
>>> $ tar xjvf parse-feed.tar.bz2
>>>
>>>
>>>
>>> what do I do with this newly created parse-feed directory?
>>>
>>>
>>>
>>> so then I would do:
>>>
>>>
>>>
>>> $ ant clean && ant
>>>
>>> I guess this is why "release engineer" is a job in and of itself!
>>> Please advise.
>> If you downloaded nightly build of 27th June, it contains feed plugin
>> already (the plugin is called "feed", not "parse-feed", parse-feed was
>> an older plugin and it is never committed. In my earlier comment, I
>> meant to write parse-rss but wrote parse-feed). So, you don't have to
>> apply any patches or anything. Just download a recent nightly build,
>> and you are good to go :).
>>
>> You can also checkout trunk from svn and it will work too.
>>
>>> --Kai Middleton
>>>
>>> ----- Original Message ----
>>> From: Do»acan Güney <do...@gmail.com>
>>> To: nutch-user@lucene.apache.org
>>> Sent: Friday, June 22, 2007 1:39:12 AM
>>> Subject: Re: Possibly use a different library to parse RSS feed for improved performance and compatibility
>>>
>>> On 6/21/07, Kai_testing Middleton <ka...@yahoo.com> wrote:
>>>> I am a new nutch user and the ability to crawl RSS feeds is critical to my mission.  Do I understand from this (lengthy) discussion that in order to get the new RSS I need to either a) download one of the nightly builds and run ant or b) download and apply a patch (NUTCH-444.patch, I gather).
>>> Nutch 0.9 can already parse RSS feeds (via parse-feed) plugin.
>>> However, if you need the functionality of NUTCH-444, I would suggest
>>> trying a nightly version of Nutch. Becase NUTCH-444 by itself is not
>>> enough. You also need two patches from NUTCH-443 and probably
>>> NUTCH-504. If you are worrying about stability, nightlies of nutch are
>>> generally pretty stable.
>>>
>>> --
>>> Do»acan Güney
> 
> 
> 
> 
> 
> 
>       ____________________________________________________________________________________
> Luggage? GPS? Comic books? 
> Check out fitting gifts for grads at Yahoo! Search
> http://search.yahoo.com/search?fr=oni_on_mail&p=graduation+gifts&cs=bz
> 
> 
> 
> 
> 
> 
>        
> ____________________________________________________________________________________
> Take the Internet to Go: Yahoo!Go puts the Internet in your pocket: mail, news, photos & more. 
> http://mobile.yahoo.com/go?refer=1GNXIC