You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Sagar Vibhute <sa...@gmail.com> on 2007/10/05 15:09:27 UTC
First Plugin
Hi,
I have recently downloaded and used nutch and I need to develop a few
plugins for my work. I took the plugin example given on the wiki,
http://wiki.apache.org/nutch/WritingPluginExample-0%2e9
and followed the instructions as given there. Now when I start crawling
again it aborts and throws the following exception:
Exception in thread "main" java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:604)
at org.apache.nutch.crawl.Injector.inject(Injector.java:162)
at org.apache.nutch.crawl.Crawl.main(Crawl.java:115)
I could crawl successfully before I added this plugin.
Please give any insights you can to get this fixed.
Thank You!
- Sagar
Re: First Plugin
Posted by Sagar Vibhute <sa...@gmail.com>.
I started a crawl after adding a plugin given on the wiki (
http://wiki.apache.org/nutch/WritingPluginExample-0%2e9)
When I crawled, it stopped after throwing an exception. Here is what the
hadoop.log file says:
----------------------------------------------------------------------------------------------------------------
2007-10-07 16:42:25,407 INFO crawl.Crawl - crawl started in:
/home/sagar/nutch_crawl
2007-10-07 16:42:25,422 INFO crawl.Crawl - rootUrlDir =
/home/sagar/urls/iiitb
2007-10-07 16:42:25,422 INFO crawl.Crawl - threads = 10
2007-10-07 16:42:25,422 INFO crawl.Crawl - depth = 3
2007-10-07 16:42:25,608 INFO crawl.Injector - Injector: starting
2007-10-07 16:42:25,608 INFO crawl.Injector - Injector: crawlDb:
/home/sagar/nutch_crawl/crawldb
2007-10-07 16:42:25,608 INFO crawl.Injector - Injector: urlDir:
/home/sagar/urls/iiitb
2007-10-07 16:42:25,626 INFO crawl.Injector - Injector: Converting injected
urls to crawl db entries.
2007-10-07 16:42:27,207 INFO plugin.PluginRepository - Plugins: looking in:
/home/sagar/nutch-0.9/src/plugin
2007-10-07 16:42:27,620 INFO plugin.PluginRepository - Plugin
Auto-activation mode: [true]
2007-10-07 16:42:27,620 INFO plugin.PluginRepository - Registered Plugins:
2007-10-07 16:42:27,620 INFO plugin.PluginRepository - the nutch core
extension points (nutch-extensionpoints)
2007-10-07 16:42:27,620 INFO plugin.PluginRepository - Basic Query
Filter (query-basic)
2007-10-07 16:42:27,620 INFO plugin.PluginRepository - Basic URL
Normalizer (urlnormalizer-basic)
2007-10-07 16:42:27,620 INFO plugin.PluginRepository - Html Parse
Plug-in (parse-html)
2007-10-07 16:42:27,620 INFO plugin.PluginRepository - Basic Indexing
Filter (index-basic)
2007-10-07 16:42:27,620 INFO plugin.PluginRepository - Site Query
Filter (query-site)
2007-10-07 16:42:27,620 INFO plugin.PluginRepository - Basic Summarizer
Plug-in (summary-basic)
2007-10-07 16:42:27,620 INFO plugin.PluginRepository - HTTP Framework
(lib-http)
2007-10-07 16:42:27,620 INFO plugin.PluginRepository - Text Parse
Plug-in (parse-text)
2007-10-07 16:42:27,620 INFO plugin.PluginRepository - Regex URL Filter
(urlfilter-regex)
2007-10-07 16:42:27,620 INFO plugin.PluginRepository - Pass-through URL
Normalizer (urlnormalizer-pass)
2007-10-07 16:42:27,620 INFO plugin.PluginRepository - Http Protocol
Plug-in (protocol-http)
2007-10-07 16:42:27,620 INFO plugin.PluginRepository - Regex URL
Normalizer (urlnormalizer-regex)
2007-10-07 16:42:27,621 INFO plugin.PluginRepository - OPIC Scoring
Plug-in (scoring-opic)
2007-10-07 16:42:27,621 INFO plugin.PluginRepository - CyberNeko HTML
Parser (lib-nekohtml)
2007-10-07 16:42:27,621 INFO plugin.PluginRepository - JavaScript
Parser (parse-js)
2007-10-07 16:42:27,621 INFO plugin.PluginRepository - URL Query Filter
(query-url)
2007-10-07 16:42:27,621 INFO plugin.PluginRepository - Regex URL Filter
Framework (lib-regex-filter)
2007-10-07 16:42:27,621 INFO plugin.PluginRepository - Registered
Extension-Points:
2007-10-07 16:42:27,621 INFO plugin.PluginRepository - Nutch Summarizer
(org.apache.nutch.searcher.Summarizer)
2007-10-07 16:42:27,621 INFO plugin.PluginRepository - Nutch URL
Normalizer (org.apache.nutch.net.URLNormalizer)
2007-10-07 16:42:27,621 INFO plugin.PluginRepository - Nutch Protocol (
org.apache.nutch.protocol.Protocol)
2007-10-07 16:42:27,621 INFO plugin.PluginRepository - Nutch Analysis (
org.apache.nutch.analysis.NutchAnalyzer)
2007-10-07 16:42:27,621 INFO plugin.PluginRepository - Nutch URL Filter
(org.apache.nutch.net.URLFilter)
2007-10-07 16:42:27,621 INFO plugin.PluginRepository - Nutch Indexing
Filter (org.apache.nutch.indexer.IndexingFilter)
2007-10-07 16:42:27,621 INFO plugin.PluginRepository - Nutch Online
Search Results Clustering Plugin (
org.apache.nutch.clustering.OnlineClusterer)
2007-10-07 16:42:27,621 INFO plugin.PluginRepository - HTML Parse
Filter (org.apache.nutch.parse.HtmlParseFilter)
2007-10-07 16:42:27,621 INFO plugin.PluginRepository - Nutch Content
Parser (org.apache.nutch.parse.Parser)
2007-10-07 16:42:27,621 INFO plugin.PluginRepository - Nutch Scoring (
org.apache.nutch.scoring.ScoringFilter)
2007-10-07 16:42:27,621 INFO plugin.PluginRepository - Nutch Query
Filter (org.apache.nutch.searcher.QueryFilter)
2007-10-07 16:42:27,621 INFO plugin.PluginRepository - Ontology Model
Loader (org.apache.nutch.ontology.Ontology)
2007-10-07 16:42:27,625 WARN net.URLNormalizers -
URLNormalizers:PluginRuntimeException when initializing url normalizer
plugin urlnormalizer-basic instance in getURLNormalizers function:
attempting to continue instantiating plugins
2007-10-07 16:42:27,628 WARN net.URLNormalizers -
URLNormalizers:PluginRuntimeException when initializing url normalizer
plugin urlnormalizer-regex instance in getURLNormalizers function:
attempting to continue instantiating plugins
2007-10-07 16:42:27,632 WARN net.URLNormalizers -
URLNormalizers:PluginRuntimeException when initializing url normalizer
plugin urlnormalizer-pass instance in getURLNormalizers function: attempting
to continue instantiating plugins
2007-10-07 16:42:27,667 WARN mapred.LocalJobRunner - job_l8t6s1
java.lang.RuntimeException: org.apache.nutch.plugin.PluginRuntimeException:
java.lang.ClassNotFoundException:
org.apache.nutch.urlfilter.regex.RegexURLFilter
at org.apache.nutch.net.URLFilters.<init>(URLFilters.java:74)
at org.apache.nutch.crawl.Injector$InjectMapper.configure(Injector.java
:60)
at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java
:58)
at org.apache.hadoop.util.ReflectionUtils.newInstance(
ReflectionUtils.java:82)
at org.apache.hadoop.mapred.MapRunner.configure(MapRunner.java:34)
at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java
:58)
at org.apache.hadoop.util.ReflectionUtils.newInstance(
ReflectionUtils.java:82)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:170)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java
:126)
Caused by: org.apache.nutch.plugin.PluginRuntimeException:
java.lang.ClassNotFoundException:
org.apache.nutch.urlfilter.regex.RegexURLFilter
at org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java
:166)
at org.apache.nutch.net.URLFilters.<init>(URLFilters.java:54)
... 8 more
Caused by: java.lang.ClassNotFoundException:
org.apache.nutch.urlfilter.regex.RegexURLFilter
at java.net.URLClassLoader$1.run(URLClassLoader.java:200)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:188)
at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
at java.lang.ClassLoader.loadClass(ClassLoader.java:251)
at org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java
:156)
... 9 more
----------------------------------------------------------------------------------------------------------------
How do I get over this exception? I checked the nutch sources. In the java
packages there is no urlfilter package under src/java/org/apache/nutch.
Please advise...
Re: First Plugin
Posted by Doğacan Güney <do...@gmail.com>.
On 10/5/07, Sagar Vibhute <sa...@gmail.com> wrote:
> Well, the initial value for plugin.includes was
>
> protocol-http|urlfilter-regex|parse-(text|html|js)|index-basic|query-(basic|site|url)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)
>
> The example on the site states I put in the following
>
> nutch-extensionpoints|protocol-http|urlfilter-regex|parse-(text|html)|index-basic|query-(basic|site|url)
You have to put a scoring filter or some of the jobs won't work. If
you want to enable your plugin, just append it (with |<your plugin's
name>) to the end of the original list.
>
> So I did the necessary. By the way I reverted to the original value and
> crawled. It is throwing a different set of exceptions now ... :-)
>
> - Sagar
>
--
Doğacan Güney
Re: First Plugin
Posted by Sagar Vibhute <sa...@gmail.com>.
Well, the initial value for plugin.includes was
protocol-http|urlfilter-regex|parse-(text|html|js)|index-basic|query-(basic|site|url)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)
The example on the site states I put in the following
nutch-extensionpoints|protocol-http|urlfilter-regex|parse-(text|html)|index-basic|query-(basic|site|url)
So I did the necessary. By the way I reverted to the original value and
crawled. It is throwing a different set of exceptions now ... :-)
- Sagar
Re: First Plugin
Posted by Doğacan Güney <do...@gmail.com>.
On 10/5/07, Sagar Vibhute <sa...@gmail.com> wrote:
> I am really sorry for that. Will take some time to get used to this one :-)
>
> This is the log for the last nutch crawl I tried to execute:
> -----------------------------------------------------------------------------------------------------------------
> 2007-10-05 12:16:33,416 INFO crawl.Crawl - crawl started in:
> /home/sagar/nutch_crawl
> 2007-10-05 12:16:33,417 INFO crawl.Crawl - rootUrlDir =
> /home/sagar/urls/iiitb
> 2007-10-05 12:16:33,417 INFO crawl.Crawl - threads = 10
> 2007-10-05 12:16:33,417 INFO crawl.Crawl - depth = 3
> 2007-10-05 12:16:33,522 INFO crawl.Injector - Injector: starting
> 2007-10-05 12:16:33,523 INFO crawl.Injector - Injector: crawlDb:
> /home/sagar/nutch_crawl/crawldb
> 2007-10-05 12:16:33,523 INFO crawl.Injector - Injector: urlDir:
> /home/sagar/urls/iiitb
> 2007-10-05 12:16:33,524 INFO crawl.Injector - Injector: Converting injected
> urls to crawl db entries.
> 2007-10-05 12:16:34,116 INFO plugin.PluginRepository - Plugins: looking in:
> /home/sagar/nutch-0.9/plugins
> 2007-10-05 12:16:34,277 INFO plugin.PluginRepository - Plugin
> Auto-activation mode: [true]
> 2007-10-05 12:16:34,278 INFO plugin.PluginRepository - Registered Plugins:
> 2007-10-05 12:16:34,278 INFO plugin.PluginRepository - the nutch core
> extension points (nutch-extensionpoints)
> 2007-10-05 12:16:34,278 INFO plugin.PluginRepository - Basic Query
> Filter (query-basic)
> 2007-10-05 12:16:34,278 INFO plugin.PluginRepository - CyberNeko HTML
> Parser (lib-nekohtml)
> 2007-10-05 12:16:34,278 INFO plugin.PluginRepository - Basic Indexing
> Filter (index-basic)
> 2007-10-05 12:16:34,278 INFO plugin.PluginRepository - Html Parse
> Plug-in (parse-html)
> 2007-10-05 12:16:34,278 INFO plugin.PluginRepository - Site Query
> Filter (query-site)
> 2007-10-05 12:16:34,278 INFO plugin.PluginRepository - URL Query Filter
> (query-url)
> 2007-10-05 12:16:34,278 INFO plugin.PluginRepository - HTTP Framework
> (lib-http)
> 2007-10-05 12:16:34,278 INFO plugin.PluginRepository - Text Parse
> Plug-in (parse-text)
> 2007-10-05 12:16:34,278 INFO plugin.PluginRepository - Regex URL Filter
> (urlfilter-regex)
> 2007-10-05 12:16:34,278 INFO plugin.PluginRepository - Regex URL Filter
> Framework (lib-regex-filter)
> 2007-10-05 12:16:34,278 INFO plugin.PluginRepository - Http Protocol
> Plug-in (protocol-http)
> 2007-10-05 12:16:34,278 INFO plugin.PluginRepository - Registered
> Extension-Points:
> 2007-10-05 12:16:34,278 INFO plugin.PluginRepository - Nutch Summarizer
> (org.apache.nutch.searcher.Summarizer)
> 2007-10-05 12:16:34,278 INFO plugin.PluginRepository - Nutch URL
> Normalizer (org.apache.nutch.net.URLNormalizer)
> 2007-10-05 12:16:34,278 INFO plugin.PluginRepository - Nutch Protocol (
> org.apache.nutch.protocol.Protocol)
> 2007-10-05 12:16:34,278 INFO plugin.PluginRepository - Nutch Analysis (
> org.apache.nutch.analysis.NutchAnalyzer)
> 2007-10-05 12:16:34,278 INFO plugin.PluginRepository - Nutch URL Filter
> (org.apache.nutch.net.URLFilter)
> 2007-10-05 12:16:34,278 INFO plugin.PluginRepository - Nutch Indexing
> Filter (org.apache.nutch.indexer.IndexingFilter)
> 2007-10-05 12:16:34,278 INFO plugin.PluginRepository - Nutch Online
> Search Results Clustering Plugin (
> org.apache.nutch.clustering.OnlineClusterer)
> 2007-10-05 12:16:34,278 INFO plugin.PluginRepository - HTML Parse
> Filter (org.apache.nutch.parse.HtmlParseFilter)
> 2007-10-05 12:16:34,278 INFO plugin.PluginRepository - Nutch Content
> Parser (org.apache.nutch.parse.Parser)
> 2007-10-05 12:16:34,278 INFO plugin.PluginRepository - Nutch Scoring (
> org.apache.nutch.scoring.ScoringFilter)
> 2007-10-05 12:16:34,279 INFO plugin.PluginRepository - Nutch Query
> Filter (org.apache.nutch.searcher.QueryFilter)
> 2007-10-05 12:16:34,279 INFO plugin.PluginRepository - Ontology Model
> Loader (org.apache.nutch.ontology.Ontology)
> 2007-10-05 12:16:34,296 WARN mapred.LocalJobRunner - job_fx2l2k
> java.lang.RuntimeException: No scoring plugins - at least one scoring plugin
> is required!
> at org.apache.nutch.scoring.ScoringFilters.<init>(ScoringFilters.java
> :85)
> at org.apache.nutch.crawl.Injector$InjectMapper.configure(Injector.java
> :61)
> at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java
> :58)
> at org.apache.hadoop.util.ReflectionUtils.newInstance(
> ReflectionUtils.java:82)
> at org.apache.hadoop.mapred.MapRunner.configure(MapRunner.java:34)
> at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java
> :58)
> at org.apache.hadoop.util.ReflectionUtils.newInstance(
> ReflectionUtils.java:82)
> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:170)
> at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java
> :126)
> ---------------------------------------------------------------------------------------------------------------------
>
> I am totally new to this. Your insights please.
OK, it seems you have removed scoring-opic plugins (and other scoring
plugins if you have any) by accident. You should check your
plugin.includes option in nutch-site.xml, there is probably something
wrong with that. Perhaps, you put a new line there?
>
> - Sagar
>
--
Doğacan Güney
Re: First Plugin
Posted by Sagar Vibhute <sa...@gmail.com>.
I am really sorry for that. Will take some time to get used to this one :-)
This is the log for the last nutch crawl I tried to execute:
-----------------------------------------------------------------------------------------------------------------
2007-10-05 12:16:33,416 INFO crawl.Crawl - crawl started in:
/home/sagar/nutch_crawl
2007-10-05 12:16:33,417 INFO crawl.Crawl - rootUrlDir =
/home/sagar/urls/iiitb
2007-10-05 12:16:33,417 INFO crawl.Crawl - threads = 10
2007-10-05 12:16:33,417 INFO crawl.Crawl - depth = 3
2007-10-05 12:16:33,522 INFO crawl.Injector - Injector: starting
2007-10-05 12:16:33,523 INFO crawl.Injector - Injector: crawlDb:
/home/sagar/nutch_crawl/crawldb
2007-10-05 12:16:33,523 INFO crawl.Injector - Injector: urlDir:
/home/sagar/urls/iiitb
2007-10-05 12:16:33,524 INFO crawl.Injector - Injector: Converting injected
urls to crawl db entries.
2007-10-05 12:16:34,116 INFO plugin.PluginRepository - Plugins: looking in:
/home/sagar/nutch-0.9/plugins
2007-10-05 12:16:34,277 INFO plugin.PluginRepository - Plugin
Auto-activation mode: [true]
2007-10-05 12:16:34,278 INFO plugin.PluginRepository - Registered Plugins:
2007-10-05 12:16:34,278 INFO plugin.PluginRepository - the nutch core
extension points (nutch-extensionpoints)
2007-10-05 12:16:34,278 INFO plugin.PluginRepository - Basic Query
Filter (query-basic)
2007-10-05 12:16:34,278 INFO plugin.PluginRepository - CyberNeko HTML
Parser (lib-nekohtml)
2007-10-05 12:16:34,278 INFO plugin.PluginRepository - Basic Indexing
Filter (index-basic)
2007-10-05 12:16:34,278 INFO plugin.PluginRepository - Html Parse
Plug-in (parse-html)
2007-10-05 12:16:34,278 INFO plugin.PluginRepository - Site Query
Filter (query-site)
2007-10-05 12:16:34,278 INFO plugin.PluginRepository - URL Query Filter
(query-url)
2007-10-05 12:16:34,278 INFO plugin.PluginRepository - HTTP Framework
(lib-http)
2007-10-05 12:16:34,278 INFO plugin.PluginRepository - Text Parse
Plug-in (parse-text)
2007-10-05 12:16:34,278 INFO plugin.PluginRepository - Regex URL Filter
(urlfilter-regex)
2007-10-05 12:16:34,278 INFO plugin.PluginRepository - Regex URL Filter
Framework (lib-regex-filter)
2007-10-05 12:16:34,278 INFO plugin.PluginRepository - Http Protocol
Plug-in (protocol-http)
2007-10-05 12:16:34,278 INFO plugin.PluginRepository - Registered
Extension-Points:
2007-10-05 12:16:34,278 INFO plugin.PluginRepository - Nutch Summarizer
(org.apache.nutch.searcher.Summarizer)
2007-10-05 12:16:34,278 INFO plugin.PluginRepository - Nutch URL
Normalizer (org.apache.nutch.net.URLNormalizer)
2007-10-05 12:16:34,278 INFO plugin.PluginRepository - Nutch Protocol (
org.apache.nutch.protocol.Protocol)
2007-10-05 12:16:34,278 INFO plugin.PluginRepository - Nutch Analysis (
org.apache.nutch.analysis.NutchAnalyzer)
2007-10-05 12:16:34,278 INFO plugin.PluginRepository - Nutch URL Filter
(org.apache.nutch.net.URLFilter)
2007-10-05 12:16:34,278 INFO plugin.PluginRepository - Nutch Indexing
Filter (org.apache.nutch.indexer.IndexingFilter)
2007-10-05 12:16:34,278 INFO plugin.PluginRepository - Nutch Online
Search Results Clustering Plugin (
org.apache.nutch.clustering.OnlineClusterer)
2007-10-05 12:16:34,278 INFO plugin.PluginRepository - HTML Parse
Filter (org.apache.nutch.parse.HtmlParseFilter)
2007-10-05 12:16:34,278 INFO plugin.PluginRepository - Nutch Content
Parser (org.apache.nutch.parse.Parser)
2007-10-05 12:16:34,278 INFO plugin.PluginRepository - Nutch Scoring (
org.apache.nutch.scoring.ScoringFilter)
2007-10-05 12:16:34,279 INFO plugin.PluginRepository - Nutch Query
Filter (org.apache.nutch.searcher.QueryFilter)
2007-10-05 12:16:34,279 INFO plugin.PluginRepository - Ontology Model
Loader (org.apache.nutch.ontology.Ontology)
2007-10-05 12:16:34,296 WARN mapred.LocalJobRunner - job_fx2l2k
java.lang.RuntimeException: No scoring plugins - at least one scoring plugin
is required!
at org.apache.nutch.scoring.ScoringFilters.<init>(ScoringFilters.java
:85)
at org.apache.nutch.crawl.Injector$InjectMapper.configure(Injector.java
:61)
at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java
:58)
at org.apache.hadoop.util.ReflectionUtils.newInstance(
ReflectionUtils.java:82)
at org.apache.hadoop.mapred.MapRunner.configure(MapRunner.java:34)
at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java
:58)
at org.apache.hadoop.util.ReflectionUtils.newInstance(
ReflectionUtils.java:82)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:170)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java
:126)
---------------------------------------------------------------------------------------------------------------------
I am totally new to this. Your insights please.
- Sagar
Re: First Plugin
Posted by Doğacan Güney <do...@gmail.com>.
Hi,
On 10/5/07, Sagar Vibhute <sa...@gmail.com> wrote:
> Hi,
>
> I have recently downloaded and used nutch and I need to develop a few
> plugins for my work. I took the plugin example given on the wiki,
>
> http://wiki.apache.org/nutch/WritingPluginExample-0%2e9
>
> and followed the instructions as given there. Now when I start crawling
> again it aborts and throws the following exception:
>
> Exception in thread "main" java.io.IOException: Job failed!
> at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:604)
> at org.apache.nutch.crawl.Injector.inject(Injector.java:162)
> at org.apache.nutch.crawl.Crawl.main(Crawl.java:115)
>
> I could crawl successfully before I added this plugin.
>
> Please give any insights you can to get this fixed.
(I really should add this to FAQ)
This log doesn't help us. This simply tells us that crawling has
failed. You have to check your logs elsewhere (logs/hadoop.log
directory if you are local and your tasktracker's logs if you are
running in distributed mode). If you can send those logs we can make a
more informed analysis about your problem.
>
> Thank You!
>
> - Sagar
>
--
Doğacan Güney