You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Alaak <al...@gmx.de> on 2012/08/12 11:58:20 UTC
Problem creating a simple Plugin
Hi
I need to create a simple extension to Nutch indexing only web pages
matching certain criteria.
I followed the explanation on how to setup Nutch using Eclipse and got a
running basic system. Then I followed the explanations on setting up a
simple plugin here: http://wiki.apache.org/nutch/WritingPluginExample.
However after adding the Plugin I always get output with the following
exception which basically tells me nothing:
...
Fetcher: finished at 2012-08-12 11:06:47, elapsed: 00:00:07
ParseSegment: starting at 2012-08-12 11:06:47
ParseSegment: segment: crawl/segments/20120812110633
Exception in thread "main" java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1265)
at org.apache.nutch.parse.ParseSegment.parse(ParseSegment.java:209)
at org.apache.nutch.crawl.Crawl.run(Crawl.java:138)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.nutch.crawl.Crawl.main(Crawl.java:55)
I wanted to simplify the example by using only on extension which simply
prints out "test" for every crawled page. Here is the code for my plugin
class:
package testplugin;
import java.util.Collection;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.io.Text;
import org.apache.nutch.crawl.CrawlDatum;
import org.apache.nutch.crawl.Inlinks;
import org.apache.nutch.indexer.IndexingException;
import org.apache.nutch.indexer.IndexingFilter;
import org.apache.nutch.indexer.NutchDocument;
import org.apache.nutch.parse.Parse;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
public final class SimpleFilter implements IndexingFilter {
public static final Logger LOGGER =
LoggerFactory.getLogger(SimpleFilter.class);
public static final Logger LOGGER =
LoggerFactory.getLogger(FocusedForumCrawler.class);
private Configuration conf;
@Override
public Configuration getConf() {
return conf;
}
@Override
public void setConf(Configuration conf) {
this.conf = conf;
if (conf == null)
return;
}
@Override
public NutchDocument filter(NutchDocument doc, Parse parse, Text
url, CrawlDatum datum, Inlinks inlinks)
throws IndexingException {
LOGGER.info("test");
return doc;
}
}
I also adapted the plugin.xml to look like:
<?xml version="1.0" encoding="UTF-8"?>
<plugin id="simpletestplugin" name="URL Meta Indexing Filter"
version="1.0.0" provider-name="alaak">
<runtime>
<library name="simpletestplugin.jar">
<export name="*"/>
</library>
</runtime>
<requires>
<import plugin="nutch-extensionpoints"/>
</requires>
<extension id="testplugin" name="Some Simple Test Plugin"
point="org.apache.nutch.segment.SegmentMergeFilter">
<implementation id="page-filter" class="testplugin.SimpleFilter"/>
</extension>
</plugin>
Can someone please give me a clue what I am doing wrong or which
additional information you would need to help me?
Thanks and regards.
Re: Problem creating a simple Plugin
Posted by Lewis John Mcgibbney <le...@gmail.com>.
Unless you are indexing nothing will happen. You specify an indexing
filter so you actually need to index something before the filter is
run.
Although it is loaded this doesn't mean that anything is being indexed.
Lewis
On Sun, Aug 12, 2012 at 3:22 PM, Alaak <al...@gmx.de> wrote:
> Thanks for your answer.
>
> I managed to make it run now. The problem was in the parse-html plugin. It
> was missing the dependencies to nekohtml and tagsoup. I added both as
> external jars to my environment.
>
> Currently I get the message that my plugin is loaded successfully in
> hadoop.log
>
> 2012-08-12 16:06:43,712 INFO plugin.PluginRepository - URL Meta
> Indexing Filter (simpletestplugin)
>
> However it is never called by the crawler. Neither my 'Test' message is
> printed nor does the execution stop if I set a break point within the filter
> method of my plugin class.
>
> I didn't see any error message. I also double checked the plugin.xml,
> build.xml src/plugin/build.xml and nutch-site.xml and compared all of them
> to some existing plugin code. Everything seems to be correct, so I am
> basically quite clueless on how to proceed.
>
> Do you have any tips?
>
>
> Am 12.08.2012 14:01, schrieb Lewis John Mcgibbney:
>>
>> Please carefully read the xml configuration in the file you have pasted
>>
>>
>> On Sun, Aug 12, 2012 at 12:11 PM, Alaak <al...@gmx.de> wrote:
>>
>>> <extension id="de.effingo.crawler" name="Some Simple Test Plugin"
>>> point="org.apache.nutch.indexer.IndexingFilter">
>>> <implementation id="page-filter"
>>> class="testplugin.SimpleFilter"/>
>>> </extension>
>>> </plugin>
>>
>> The extension id attribute should equal the package name followed by
>> your class name. Looking at your Java code this should be
>>
>> testplugin.SimpleFilter
>>
>> additionally the implementation id attribute should be SimpleFilter
>>
>> Do you have the build.xml correctly configured? Have you added the
>> plugin to plugin.includes property in nutch-site.xml
>
>
--
Lewis
Re: Problem creating a simple Plugin
Posted by Alaak <al...@gmx.de>.
Thanks for your answer.
I managed to make it run now. The problem was in the parse-html plugin.
It was missing the dependencies to nekohtml and tagsoup. I added both as
external jars to my environment.
Currently I get the message that my plugin is loaded successfully in
hadoop.log
2012-08-12 16:06:43,712 INFO plugin.PluginRepository - URL Meta
Indexing Filter (simpletestplugin)
However it is never called by the crawler. Neither my 'Test' message is
printed nor does the execution stop if I set a break point within the
filter method of my plugin class.
I didn't see any error message. I also double checked the plugin.xml,
build.xml src/plugin/build.xml and nutch-site.xml and compared all of
them to some existing plugin code. Everything seems to be correct, so I
am basically quite clueless on how to proceed.
Do you have any tips?
Am 12.08.2012 14:01, schrieb Lewis John Mcgibbney:
> Please carefully read the xml configuration in the file you have pasted
>
> On Sun, Aug 12, 2012 at 12:11 PM, Alaak <al...@gmx.de> wrote:
>
>> <extension id="de.effingo.crawler" name="Some Simple Test Plugin"
>> point="org.apache.nutch.indexer.IndexingFilter">
>> <implementation id="page-filter" class="testplugin.SimpleFilter"/>
>> </extension>
>> </plugin>
> The extension id attribute should equal the package name followed by
> your class name. Looking at your Java code this should be
>
> testplugin.SimpleFilter
>
> additionally the implementation id attribute should be SimpleFilter
>
> Do you have the build.xml correctly configured? Have you added the
> plugin to plugin.includes property in nutch-site.xml
Re: Problem creating a simple Plugin
Posted by Lewis John Mcgibbney <le...@gmail.com>.
Please carefully read the xml configuration in the file you have pasted
On Sun, Aug 12, 2012 at 12:11 PM, Alaak <al...@gmx.de> wrote:
>
> <extension id="de.effingo.crawler" name="Some Simple Test Plugin"
> point="org.apache.nutch.indexer.IndexingFilter">
> <implementation id="page-filter" class="testplugin.SimpleFilter"/>
> </extension>
> </plugin>
The extension id attribute should equal the package name followed by
your class name. Looking at your Java code this should be
testplugin.SimpleFilter
additionally the implementation id attribute should be SimpleFilter
Do you have the build.xml correctly configured? Have you added the
plugin to plugin.includes property in nutch-site.xml
Re: Problem creating a simple Plugin
Posted by Lewis John Mcgibbney <le...@gmail.com>.
Hi Klemens,
Please don't hijack others' threads. It is impolite and your threads
will not be answered.
Thank you
Lewis
On Sun, Aug 12, 2012 at 12:23 PM, Klemens Muthmann
<kl...@googlemail.com> wrote:
> Hi,
>
> I found the following exception in hadoop.log
>
> java.lang.Error: Unresolved compilation problems:
> The import org.cyberneko cannot be resolved
> org.ccil cannot be resolved to a type
> org.ccil cannot be resolved to a type
> org.ccil.cowan.tagsoup.Parser cannot be resolved to a type
> org.ccil.cowan.tagsoup.Parser cannot be resolved to a type
> DOMFragmentParser cannot be resolved to a type
> DOMFragmentParser cannot be resolved to a type
>
> at org.apache.nutch.parse.html.HtmlParser.<init>(HtmlParser.java:28)
> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
> Method)
> at
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
> at
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> at java.lang.reflect.Constructor.newInstance(Constructor.java:525)
> at java.lang.Class.newInstance0(Class.java:372)
> at java.lang.Class.newInstance(Class.java:325)
> at
> org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:160)
> at
> org.apache.nutch.parse.ParserFactory.getParsers(ParserFactory.java:132)
> at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:77)
> at org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:97)
> at org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:1)
> at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
> at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:436)
> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:372)
> at
> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212)
>
> Eclipse indeed does show me that cyberneko is missing but it worked until I
> added:
>
> <property>
> <name>plugin.includes</name>
>
> <value>protocol-http|urlfilter-regex|parse-(html)|simpletestplugin</value>
> </property>
>
> to my nutch-site.xml file. I can only assume that the parse-(html) normally
> is no part of the plugin.includes property. So I think I have two possible
> directions of action. Either get the default value of plugin.includes from
> somewhere and add my plugin to that list or fix the missing dependencies
> which I do not exactly know how because I usually use Maven and never have
> worked with Ant or Ivy for dependency management. It would be nice if you
> could give me a pointer in either direction.
>
> Am So 12 Aug 2012 13:11:16 CEST schrieb Alaak:
>
>> Hi,
>>
>> Ah sorry. Both are actually copy and paste errors. Of course I only
>> have one logger with the correct class name and the extension point
>> is: "org.apache.nutch.indexer.IndexingFilter"
>>
>> This is the actual plugin.xml I am using.
>>
>> <?xml version="1.0" encoding="UTF-8"?>
>> <plugin id="simpletestplugin" name="URL Meta Indexing Filter""
>> version="1.0.0" provider-name="alaak">
>> <runtime>
>> <library name="simpletestplugin.jar">
>> <export name="*"/>
>> </library>
>> </runtime>
>>
>> <requires>
>> <import plugin="nutch-extensionpoints"/>
>> </requires>
>>
>> <extension id="de.effingo.crawler" name="Some Simple Test Plugin"
>> point="org.apache.nutch.indexer.IndexingFilter">
>> <implementation id="page-filter"
>> class="testplugin.SimpleFilter"/>
>> </extension>
>> </plugin>
>>
>> Am So 12 Aug 2012 12:31:46 CEST schrieb Lewis John Mcgibbney:
>>>
>>>
>>> Hi Alaak,
>>>
>>> On Sun, Aug 12, 2012 at 10:58 AM, Alaak <al...@gmx.de> wrote:
>>>>
>>>>
>>>> I always get output with the following
>>>> exception which basically tells me nothing:
>>>>
>>>> ...
>>>> Fetcher: finished at 2012-08-12 11:06:47, elapsed: 00:00:07
>>>> ParseSegment: starting at 2012-08-12 11:06:47
>>>> ParseSegment: segment: crawl/segments/20120812110633
>>>> Exception in thread "main" java.io.IOException: Job failed!
>>>> at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1265)
>>>> at org.apache.nutch.parse.ParseSegment.parse(ParseSegment.java:209)
>>>
>>>
>>>
>>> It tells you that there is a problem whilst parsing a particular
>>> segment. This is quite a lot to go on.
>>>
>>> All the Java code looks fine. I don't see any problems except that you
>>> have an addition logging variable which seems to point outside of the
>>> class.
>>>
>>>>
>>>>
>>>> <extension id="testplugin" name="Some Simple Test Plugin"
>>>> point="org.apache.nutch.segment.SegmentMergeFilter">
>>>> <implementation id="page-filter" class="testplugin.SimpleFilter"/>
>>>> </extension>
>>>> </plugin>
>>>
>>>
>>>
>>> Now we come to the main point of concern. For me (as far as I
>>> understand what you ar trying to do) you should not extend the
>>> SegmentMergeFilter point. This should refer to the IndexingFilter you
>>> wish to extend. A list of extension points can be seen here [0]
>>>
>>> [0]
>>>
>>> http://svn.apache.org/repos/asf/nutch/trunk/src/plugin/nutch-extensionpoints/plugin.xml
>>>
>>>
>>> hth
>>>
>>> Lewis
--
Lewis
Re: Problem creating a simple Plugin
Posted by Klemens Muthmann <kl...@googlemail.com>.
Hi,
I found the following exception in hadoop.log
java.lang.Error: Unresolved compilation problems:
The import org.cyberneko cannot be resolved
org.ccil cannot be resolved to a type
org.ccil cannot be resolved to a type
org.ccil.cowan.tagsoup.Parser cannot be resolved to a type
org.ccil.cowan.tagsoup.Parser cannot be resolved to a type
DOMFragmentParser cannot be resolved to a type
DOMFragmentParser cannot be resolved to a type
at org.apache.nutch.parse.html.HtmlParser.<init>(HtmlParser.java:28)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
Method)
at
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
at
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:525)
at java.lang.Class.newInstance0(Class.java:372)
at java.lang.Class.newInstance(Class.java:325)
at
org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:160)
at
org.apache.nutch.parse.ParserFactory.getParsers(ParserFactory.java:132)
at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:77)
at org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:97)
at org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:1)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:436)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:372)
at
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212)
Eclipse indeed does show me that cyberneko is missing but it worked
until I added:
<property>
<name>plugin.includes</name>
<value>protocol-http|urlfilter-regex|parse-(html)|simpletestplugin</value>
</property>
to my nutch-site.xml file. I can only assume that the parse-(html)
normally is no part of the plugin.includes property. So I think I have
two possible directions of action. Either get the default value of
plugin.includes from somewhere and add my plugin to that list or fix
the missing dependencies which I do not exactly know how because I
usually use Maven and never have worked with Ant or Ivy for dependency
management. It would be nice if you could give me a pointer in either
direction.
Am So 12 Aug 2012 13:11:16 CEST schrieb Alaak:
> Hi,
>
> Ah sorry. Both are actually copy and paste errors. Of course I only
> have one logger with the correct class name and the extension point
> is: "org.apache.nutch.indexer.IndexingFilter"
>
> This is the actual plugin.xml I am using.
>
> <?xml version="1.0" encoding="UTF-8"?>
> <plugin id="simpletestplugin" name="URL Meta Indexing Filter""
> version="1.0.0" provider-name="alaak">
> <runtime>
> <library name="simpletestplugin.jar">
> <export name="*"/>
> </library>
> </runtime>
>
> <requires>
> <import plugin="nutch-extensionpoints"/>
> </requires>
>
> <extension id="de.effingo.crawler" name="Some Simple Test Plugin"
> point="org.apache.nutch.indexer.IndexingFilter">
> <implementation id="page-filter"
> class="testplugin.SimpleFilter"/>
> </extension>
> </plugin>
>
> Am So 12 Aug 2012 12:31:46 CEST schrieb Lewis John Mcgibbney:
>>
>> Hi Alaak,
>>
>> On Sun, Aug 12, 2012 at 10:58 AM, Alaak <al...@gmx.de> wrote:
>>>
>>> I always get output with the following
>>> exception which basically tells me nothing:
>>>
>>> ...
>>> Fetcher: finished at 2012-08-12 11:06:47, elapsed: 00:00:07
>>> ParseSegment: starting at 2012-08-12 11:06:47
>>> ParseSegment: segment: crawl/segments/20120812110633
>>> Exception in thread "main" java.io.IOException: Job failed!
>>> at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1265)
>>> at org.apache.nutch.parse.ParseSegment.parse(ParseSegment.java:209)
>>
>>
>> It tells you that there is a problem whilst parsing a particular
>> segment. This is quite a lot to go on.
>>
>> All the Java code looks fine. I don't see any problems except that you
>> have an addition logging variable which seems to point outside of the
>> class.
>>
>>>
>>>
>>> <extension id="testplugin" name="Some Simple Test Plugin"
>>> point="org.apache.nutch.segment.SegmentMergeFilter">
>>> <implementation id="page-filter" class="testplugin.SimpleFilter"/>
>>> </extension>
>>> </plugin>
>>
>>
>> Now we come to the main point of concern. For me (as far as I
>> understand what you ar trying to do) you should not extend the
>> SegmentMergeFilter point. This should refer to the IndexingFilter you
>> wish to extend. A list of extension points can be seen here [0]
>>
>> [0]
>> http://svn.apache.org/repos/asf/nutch/trunk/src/plugin/nutch-extensionpoints/plugin.xml
>>
>>
>> hth
>>
>> Lewis
Re: Problem creating a simple Plugin
Posted by Alaak <al...@gmx.de>.
Hi,
Ah sorry. Both are actually copy and paste errors. Of course I only have
one logger with the correct class name and the extension point is:
"org.apache.nutch.indexer.IndexingFilter"
This is the actual plugin.xml I am using.
<?xml version="1.0" encoding="UTF-8"?>
<plugin id="simpletestplugin" name="URL Meta Indexing Filter""
version="1.0.0" provider-name="alaak">
<runtime>
<library name="simpletestplugin.jar">
<export name="*"/>
</library>
</runtime>
<requires>
<import plugin="nutch-extensionpoints"/>
</requires>
<extension id="de.effingo.crawler" name="Some Simple Test Plugin"
point="org.apache.nutch.indexer.IndexingFilter">
<implementation id="page-filter" class="testplugin.SimpleFilter"/>
</extension>
</plugin>
Am So 12 Aug 2012 12:31:46 CEST schrieb Lewis John Mcgibbney:
>
> Hi Alaak,
>
> On Sun, Aug 12, 2012 at 10:58 AM, Alaak <al...@gmx.de> wrote:
>>
>> I always get output with the following
>> exception which basically tells me nothing:
>>
>> ...
>> Fetcher: finished at 2012-08-12 11:06:47, elapsed: 00:00:07
>> ParseSegment: starting at 2012-08-12 11:06:47
>> ParseSegment: segment: crawl/segments/20120812110633
>> Exception in thread "main" java.io.IOException: Job failed!
>> at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1265)
>> at org.apache.nutch.parse.ParseSegment.parse(ParseSegment.java:209)
>
>
> It tells you that there is a problem whilst parsing a particular
> segment. This is quite a lot to go on.
>
> All the Java code looks fine. I don't see any problems except that you
> have an addition logging variable which seems to point outside of the
> class.
>
>>
>>
>> <extension id="testplugin" name="Some Simple Test Plugin"
>> point="org.apache.nutch.segment.SegmentMergeFilter">
>> <implementation id="page-filter" class="testplugin.SimpleFilter"/>
>> </extension>
>> </plugin>
>
>
> Now we come to the main point of concern. For me (as far as I
> understand what you ar trying to do) you should not extend the
> SegmentMergeFilter point. This should refer to the IndexingFilter you
> wish to extend. A list of extension points can be seen here [0]
>
> [0]
> http://svn.apache.org/repos/asf/nutch/trunk/src/plugin/nutch-extensionpoints/plugin.xml
>
> hth
>
> Lewis
Re: Problem creating a simple Plugin
Posted by Lewis John Mcgibbney <le...@gmail.com>.
Hi Alaak,
On Sun, Aug 12, 2012 at 10:58 AM, Alaak <al...@gmx.de> wrote:
> I always get output with the following
> exception which basically tells me nothing:
>
> ...
> Fetcher: finished at 2012-08-12 11:06:47, elapsed: 00:00:07
> ParseSegment: starting at 2012-08-12 11:06:47
> ParseSegment: segment: crawl/segments/20120812110633
> Exception in thread "main" java.io.IOException: Job failed!
> at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1265)
> at org.apache.nutch.parse.ParseSegment.parse(ParseSegment.java:209)
It tells you that there is a problem whilst parsing a particular
segment. This is quite a lot to go on.
All the Java code looks fine. I don't see any problems except that you
have an addition logging variable which seems to point outside of the
class.
>
> <extension id="testplugin" name="Some Simple Test Plugin"
> point="org.apache.nutch.segment.SegmentMergeFilter">
> <implementation id="page-filter" class="testplugin.SimpleFilter"/>
> </extension>
> </plugin>
Now we come to the main point of concern. For me (as far as I
understand what you ar trying to do) you should not extend the
SegmentMergeFilter point. This should refer to the IndexingFilter you
wish to extend. A list of extension points can be seen here [0]
[0] http://svn.apache.org/repos/asf/nutch/trunk/src/plugin/nutch-extensionpoints/plugin.xml
hth
Lewis