You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Alaak <al...@gmx.de> on 2012/08/12 11:58:20 UTC

Problem creating a simple Plugin

Hi

I need to create a simple extension to Nutch indexing only web pages 
matching certain criteria.

I followed the explanation on how to setup Nutch using Eclipse and got a 
running basic system. Then I followed the explanations on setting up a 
simple plugin here: http://wiki.apache.org/nutch/WritingPluginExample. 
However after adding the Plugin I always get output with the following 
exception which basically tells me nothing:

...
Fetcher: finished at 2012-08-12 11:06:47, elapsed: 00:00:07
ParseSegment: starting at 2012-08-12 11:06:47
ParseSegment: segment: crawl/segments/20120812110633
Exception in thread "main" java.io.IOException: Job failed!
     at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1265)
     at org.apache.nutch.parse.ParseSegment.parse(ParseSegment.java:209)
     at org.apache.nutch.crawl.Crawl.run(Crawl.java:138)
     at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
     at org.apache.nutch.crawl.Crawl.main(Crawl.java:55)

I wanted to simplify the example by using only on extension which simply 
prints out "test" for every crawled page. Here is the code for my plugin 
class:

package testplugin;

import java.util.Collection;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.io.Text;
import org.apache.nutch.crawl.CrawlDatum;
import org.apache.nutch.crawl.Inlinks;
import org.apache.nutch.indexer.IndexingException;
import org.apache.nutch.indexer.IndexingFilter;
import org.apache.nutch.indexer.NutchDocument;
import org.apache.nutch.parse.Parse;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

public final class SimpleFilter implements IndexingFilter {

     public static final Logger LOGGER = 
LoggerFactory.getLogger(SimpleFilter.class);

    public static final Logger LOGGER = 
LoggerFactory.getLogger(FocusedForumCrawler.class);
     private Configuration conf;

     @Override
     public Configuration getConf() {
         return conf;
     }

     @Override
     public void setConf(Configuration conf) {
         this.conf = conf;

         if (conf == null)
             return;
     }

     @Override
     public NutchDocument filter(NutchDocument doc, Parse parse, Text 
url, CrawlDatum datum, Inlinks inlinks)
             throws IndexingException {
         LOGGER.info("test");
         return doc;
     }

}

I also adapted the plugin.xml to look like:

<?xml version="1.0" encoding="UTF-8"?>
<plugin id="simpletestplugin" name="URL Meta Indexing Filter" 
version="1.0.0" provider-name="alaak">
     <runtime>
         <library name="simpletestplugin.jar">
             <export name="*"/>
         </library>
     </runtime>

     <requires>
         <import plugin="nutch-extensionpoints"/>
     </requires>

     <extension id="testplugin" name="Some Simple Test Plugin" 
point="org.apache.nutch.segment.SegmentMergeFilter">
         <implementation id="page-filter" class="testplugin.SimpleFilter"/>
     </extension>
</plugin>

Can someone please give me a clue what I am doing wrong or which 
additional information you would need to help me?

Thanks and regards.

Re: Problem creating a simple Plugin

Posted by Lewis John Mcgibbney <le...@gmail.com>.
Unless you are indexing nothing will happen. You specify an indexing
filter so you actually need to index something before the filter is
run.
Although it is loaded this doesn't mean that anything is being indexed.

Lewis

On Sun, Aug 12, 2012 at 3:22 PM, Alaak <al...@gmx.de> wrote:
> Thanks for your answer.
>
> I managed to make it run now. The problem was in the parse-html plugin. It
> was missing the dependencies to nekohtml and tagsoup. I added both as
> external jars to my environment.
>
> Currently I get the message that my plugin is loaded successfully in
> hadoop.log
>
> 2012-08-12 16:06:43,712 INFO  plugin.PluginRepository -     URL Meta
> Indexing Filter (simpletestplugin)
>
> However it is never called by the crawler. Neither my 'Test' message is
> printed nor does the execution stop if I set a break point within the filter
> method of my plugin class.
>
> I didn't see any error message. I also double checked the plugin.xml,
> build.xml src/plugin/build.xml and nutch-site.xml and compared all of them
> to some existing plugin code. Everything seems to be correct, so I am
> basically quite clueless on how to proceed.
>
> Do you have any tips?
>
>
> Am 12.08.2012 14:01, schrieb Lewis John Mcgibbney:
>>
>> Please carefully read the xml configuration in the file you have pasted
>>
>>
>> On Sun, Aug 12, 2012 at 12:11 PM, Alaak <al...@gmx.de> wrote:
>>
>>>      <extension id="de.effingo.crawler" name="Some Simple Test Plugin"
>>> point="org.apache.nutch.indexer.IndexingFilter">
>>>          <implementation id="page-filter"
>>> class="testplugin.SimpleFilter"/>
>>>      </extension>
>>> </plugin>
>>
>> The extension id attribute should equal the package name followed by
>> your class name. Looking at your Java code this should be
>>
>> testplugin.SimpleFilter
>>
>> additionally the implementation id attribute should be SimpleFilter
>>
>> Do you have the build.xml correctly configured? Have you added the
>> plugin to plugin.includes property in nutch-site.xml
>
>



-- 
Lewis

Re: Problem creating a simple Plugin

Posted by Alaak <al...@gmx.de>.
Thanks for your answer.

I managed to make it run now. The problem was in the parse-html plugin. 
It was missing the dependencies to nekohtml and tagsoup. I added both as 
external jars to my environment.

Currently I get the message that my plugin is loaded successfully in 
hadoop.log

2012-08-12 16:06:43,712 INFO  plugin.PluginRepository -     URL Meta 
Indexing Filter (simpletestplugin)

However it is never called by the crawler. Neither my 'Test' message is 
printed nor does the execution stop if I set a break point within the 
filter method of my plugin class.

I didn't see any error message. I also double checked the plugin.xml, 
build.xml src/plugin/build.xml and nutch-site.xml and compared all of 
them to some existing plugin code. Everything seems to be correct, so I 
am basically quite clueless on how to proceed.

Do you have any tips?


Am 12.08.2012 14:01, schrieb Lewis John Mcgibbney:
> Please carefully read the xml configuration in the file you have pasted
>
> On Sun, Aug 12, 2012 at 12:11 PM, Alaak <al...@gmx.de> wrote:
>
>>      <extension id="de.effingo.crawler" name="Some Simple Test Plugin"
>> point="org.apache.nutch.indexer.IndexingFilter">
>>          <implementation id="page-filter" class="testplugin.SimpleFilter"/>
>>      </extension>
>> </plugin>
> The extension id attribute should equal the package name followed by
> your class name. Looking at your Java code this should be
>
> testplugin.SimpleFilter
>
> additionally the implementation id attribute should be SimpleFilter
>
> Do you have the build.xml correctly configured? Have you added the
> plugin to plugin.includes property in nutch-site.xml


Re: Problem creating a simple Plugin

Posted by Lewis John Mcgibbney <le...@gmail.com>.
Please carefully read the xml configuration in the file you have pasted

On Sun, Aug 12, 2012 at 12:11 PM, Alaak <al...@gmx.de> wrote:

>
>     <extension id="de.effingo.crawler" name="Some Simple Test Plugin"
> point="org.apache.nutch.indexer.IndexingFilter">
>         <implementation id="page-filter" class="testplugin.SimpleFilter"/>
>     </extension>
> </plugin>

The extension id attribute should equal the package name followed by
your class name. Looking at your Java code this should be

testplugin.SimpleFilter

additionally the implementation id attribute should be SimpleFilter

Do you have the build.xml correctly configured? Have you added the
plugin to plugin.includes property in nutch-site.xml

Re: Problem creating a simple Plugin

Posted by Lewis John Mcgibbney <le...@gmail.com>.
Hi Klemens,

Please don't hijack others' threads. It is impolite and your threads
will not be answered.

Thank you
Lewis

On Sun, Aug 12, 2012 at 12:23 PM, Klemens Muthmann
<kl...@googlemail.com> wrote:
> Hi,
>
> I found the following exception in hadoop.log
>
> java.lang.Error: Unresolved compilation problems:
>         The import org.cyberneko cannot be resolved
>         org.ccil cannot be resolved to a type
>         org.ccil cannot be resolved to a type
>         org.ccil.cowan.tagsoup.Parser cannot be resolved to a type
>         org.ccil.cowan.tagsoup.Parser cannot be resolved to a type
>         DOMFragmentParser cannot be resolved to a type
>         DOMFragmentParser cannot be resolved to a type
>
>         at org.apache.nutch.parse.html.HtmlParser.<init>(HtmlParser.java:28)
>         at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
> Method)
>         at
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
>         at
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>         at java.lang.reflect.Constructor.newInstance(Constructor.java:525)
>         at java.lang.Class.newInstance0(Class.java:372)
>         at java.lang.Class.newInstance(Class.java:325)
>         at
> org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:160)
>         at
> org.apache.nutch.parse.ParserFactory.getParsers(ParserFactory.java:132)
>         at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:77)
>         at org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:97)
>         at org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:1)
>         at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
>         at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:436)
>         at org.apache.hadoop.mapred.MapTask.run(MapTask.java:372)
>         at
> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212)
>
> Eclipse indeed does show me that cyberneko is missing but it worked until I
> added:
>
> <property>
>         <name>plugin.includes</name>
>
> <value>protocol-http|urlfilter-regex|parse-(html)|simpletestplugin</value>
> </property>
>
> to my nutch-site.xml file. I can only assume that the parse-(html) normally
> is no part of the plugin.includes property. So I think I have two possible
> directions of action. Either get the default value of plugin.includes from
> somewhere and add my plugin to that list or fix the missing dependencies
> which I do not exactly know how because I usually use Maven and never have
> worked with Ant or Ivy for dependency management. It would be nice if you
> could give me a pointer in either direction.
>
> Am So 12 Aug 2012 13:11:16 CEST schrieb Alaak:
>
>> Hi,
>>
>> Ah sorry. Both are actually copy and paste errors. Of course I only
>> have one logger with the correct class name and the extension point
>> is: "org.apache.nutch.indexer.IndexingFilter"
>>
>> This is the actual plugin.xml I am using.
>>
>> <?xml version="1.0" encoding="UTF-8"?>
>> <plugin id="simpletestplugin" name="URL Meta Indexing Filter""
>> version="1.0.0" provider-name="alaak">
>>     <runtime>
>>         <library name="simpletestplugin.jar">
>>             <export name="*"/>
>>         </library>
>>     </runtime>
>>
>>     <requires>
>>         <import plugin="nutch-extensionpoints"/>
>>     </requires>
>>
>>     <extension id="de.effingo.crawler" name="Some Simple Test Plugin"
>> point="org.apache.nutch.indexer.IndexingFilter">
>>         <implementation id="page-filter"
>> class="testplugin.SimpleFilter"/>
>>     </extension>
>> </plugin>
>>
>> Am So 12 Aug 2012 12:31:46 CEST schrieb Lewis John Mcgibbney:
>>>
>>>
>>> Hi Alaak,
>>>
>>> On Sun, Aug 12, 2012 at 10:58 AM, Alaak <al...@gmx.de> wrote:
>>>>
>>>>
>>>> I always get output with the following
>>>> exception which basically tells me nothing:
>>>>
>>>> ...
>>>> Fetcher: finished at 2012-08-12 11:06:47, elapsed: 00:00:07
>>>> ParseSegment: starting at 2012-08-12 11:06:47
>>>> ParseSegment: segment: crawl/segments/20120812110633
>>>> Exception in thread "main" java.io.IOException: Job failed!
>>>> at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1265)
>>>> at org.apache.nutch.parse.ParseSegment.parse(ParseSegment.java:209)
>>>
>>>
>>>
>>> It tells you that there is a problem whilst parsing a particular
>>> segment. This is quite a lot to go on.
>>>
>>> All the Java code looks fine. I don't see any problems except that you
>>> have an addition logging variable which seems to point outside of the
>>> class.
>>>
>>>>
>>>>
>>>> <extension id="testplugin" name="Some Simple Test Plugin"
>>>> point="org.apache.nutch.segment.SegmentMergeFilter">
>>>> <implementation id="page-filter" class="testplugin.SimpleFilter"/>
>>>> </extension>
>>>> </plugin>
>>>
>>>
>>>
>>> Now we come to the main point of concern. For me (as far as I
>>> understand what you ar trying to do) you should not extend the
>>> SegmentMergeFilter point. This should refer to the IndexingFilter you
>>> wish to extend. A list of extension points can be seen here [0]
>>>
>>> [0]
>>>
>>> http://svn.apache.org/repos/asf/nutch/trunk/src/plugin/nutch-extensionpoints/plugin.xml
>>>
>>>
>>> hth
>>>
>>> Lewis



-- 
Lewis

Re: Problem creating a simple Plugin

Posted by Klemens Muthmann <kl...@googlemail.com>.
Hi,

I found the following exception in hadoop.log

java.lang.Error: Unresolved compilation problems:
	The import org.cyberneko cannot be resolved
	org.ccil cannot be resolved to a type
	org.ccil cannot be resolved to a type
	org.ccil.cowan.tagsoup.Parser cannot be resolved to a type
	org.ccil.cowan.tagsoup.Parser cannot be resolved to a type
	DOMFragmentParser cannot be resolved to a type
	DOMFragmentParser cannot be resolved to a type

	at org.apache.nutch.parse.html.HtmlParser.<init>(HtmlParser.java:28)
	at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native 
Method)
	at 
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
	at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
	at java.lang.reflect.Constructor.newInstance(Constructor.java:525)
	at java.lang.Class.newInstance0(Class.java:372)
	at java.lang.Class.newInstance(Class.java:325)
	at 
org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:160)
	at 
org.apache.nutch.parse.ParserFactory.getParsers(ParserFactory.java:132)
	at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:77)
	at org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:97)
	at org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:1)
	at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
	at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:436)
	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:372)
	at 
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212)

Eclipse indeed does show me that cyberneko is missing but it worked 
until I added:

<property>
	<name>plugin.includes</name>
	<value>protocol-http|urlfilter-regex|parse-(html)|simpletestplugin</value>
</property>

to my nutch-site.xml file. I can only assume that the parse-(html) 
normally is no part of the plugin.includes property. So I think I have 
two possible directions of action. Either get the default value of 
plugin.includes from somewhere and add my plugin to that list or fix 
the missing dependencies which I do not exactly know how because I 
usually use Maven and never have worked with Ant or Ivy for dependency 
management. It would be nice if you could give me a pointer in either 
direction.

Am So 12 Aug 2012 13:11:16 CEST schrieb Alaak:
> Hi,
>
> Ah sorry. Both are actually copy and paste errors. Of course I only
> have one logger with the correct class name and the extension point
> is: "org.apache.nutch.indexer.IndexingFilter"
>
> This is the actual plugin.xml I am using.
>
> <?xml version="1.0" encoding="UTF-8"?>
> <plugin id="simpletestplugin" name="URL Meta Indexing Filter""
> version="1.0.0" provider-name="alaak">
>     <runtime>
>         <library name="simpletestplugin.jar">
>             <export name="*"/>
>         </library>
>     </runtime>
>
>     <requires>
>         <import plugin="nutch-extensionpoints"/>
>     </requires>
>
>     <extension id="de.effingo.crawler" name="Some Simple Test Plugin"
> point="org.apache.nutch.indexer.IndexingFilter">
>         <implementation id="page-filter"
> class="testplugin.SimpleFilter"/>
>     </extension>
> </plugin>
>
> Am So 12 Aug 2012 12:31:46 CEST schrieb Lewis John Mcgibbney:
>>
>> Hi Alaak,
>>
>> On Sun, Aug 12, 2012 at 10:58 AM, Alaak <al...@gmx.de> wrote:
>>>
>>> I always get output with the following
>>> exception which basically tells me nothing:
>>>
>>> ...
>>> Fetcher: finished at 2012-08-12 11:06:47, elapsed: 00:00:07
>>> ParseSegment: starting at 2012-08-12 11:06:47
>>> ParseSegment: segment: crawl/segments/20120812110633
>>> Exception in thread "main" java.io.IOException: Job failed!
>>> at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1265)
>>> at org.apache.nutch.parse.ParseSegment.parse(ParseSegment.java:209)
>>
>>
>> It tells you that there is a problem whilst parsing a particular
>> segment. This is quite a lot to go on.
>>
>> All the Java code looks fine. I don't see any problems except that you
>> have an addition logging variable which seems to point outside of the
>> class.
>>
>>>
>>>
>>> <extension id="testplugin" name="Some Simple Test Plugin"
>>> point="org.apache.nutch.segment.SegmentMergeFilter">
>>> <implementation id="page-filter" class="testplugin.SimpleFilter"/>
>>> </extension>
>>> </plugin>
>>
>>
>> Now we come to the main point of concern. For me (as far as I
>> understand what you ar trying to do) you should not extend the
>> SegmentMergeFilter point. This should refer to the IndexingFilter you
>> wish to extend. A list of extension points can be seen here [0]
>>
>> [0]
>> http://svn.apache.org/repos/asf/nutch/trunk/src/plugin/nutch-extensionpoints/plugin.xml
>>
>>
>> hth
>>
>> Lewis

Re: Problem creating a simple Plugin

Posted by Alaak <al...@gmx.de>.
Hi,

Ah sorry. Both are actually copy and paste errors. Of course I only have 
one logger with the correct class name and the extension point is: 
"org.apache.nutch.indexer.IndexingFilter"

This is the actual plugin.xml I am using.

<?xml version="1.0" encoding="UTF-8"?>
<plugin id="simpletestplugin" name="URL Meta Indexing Filter"" 
version="1.0.0" provider-name="alaak">
     <runtime>
         <library name="simpletestplugin.jar">
             <export name="*"/>
         </library>
     </runtime>

     <requires>
         <import plugin="nutch-extensionpoints"/>
     </requires>

     <extension id="de.effingo.crawler" name="Some Simple Test Plugin" 
point="org.apache.nutch.indexer.IndexingFilter">
         <implementation id="page-filter" class="testplugin.SimpleFilter"/>
     </extension>
</plugin>

Am So 12 Aug 2012 12:31:46 CEST schrieb Lewis John Mcgibbney:
>
> Hi Alaak,
>
> On Sun, Aug 12, 2012 at 10:58 AM, Alaak <al...@gmx.de> wrote:
>>
>> I always get output with the following
>> exception which basically tells me nothing:
>>
>> ...
>> Fetcher: finished at 2012-08-12 11:06:47, elapsed: 00:00:07
>> ParseSegment: starting at 2012-08-12 11:06:47
>> ParseSegment: segment: crawl/segments/20120812110633
>> Exception in thread "main" java.io.IOException: Job failed!
>> at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1265)
>> at org.apache.nutch.parse.ParseSegment.parse(ParseSegment.java:209)
>
>
> It tells you that there is a problem whilst parsing a particular
> segment. This is quite a lot to go on.
>
> All the Java code looks fine. I don't see any problems except that you
> have an addition logging variable which seems to point outside of the
> class.
>
>>
>>
>> <extension id="testplugin" name="Some Simple Test Plugin"
>> point="org.apache.nutch.segment.SegmentMergeFilter">
>> <implementation id="page-filter" class="testplugin.SimpleFilter"/>
>> </extension>
>> </plugin>
>
>
> Now we come to the main point of concern. For me (as far as I
> understand what you ar trying to do) you should not extend the
> SegmentMergeFilter point. This should refer to the IndexingFilter you
> wish to extend. A list of extension points can be seen here [0]
>
> [0] 
> http://svn.apache.org/repos/asf/nutch/trunk/src/plugin/nutch-extensionpoints/plugin.xml
>
> hth
>
> Lewis

Re: Problem creating a simple Plugin

Posted by Lewis John Mcgibbney <le...@gmail.com>.
Hi Alaak,

On Sun, Aug 12, 2012 at 10:58 AM, Alaak <al...@gmx.de> wrote:
> I always get output with the following
> exception which basically tells me nothing:
>
> ...
> Fetcher: finished at 2012-08-12 11:06:47, elapsed: 00:00:07
> ParseSegment: starting at 2012-08-12 11:06:47
> ParseSegment: segment: crawl/segments/20120812110633
> Exception in thread "main" java.io.IOException: Job failed!
>     at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1265)
>     at org.apache.nutch.parse.ParseSegment.parse(ParseSegment.java:209)

It tells you that there is a problem whilst parsing a particular
segment. This is quite a lot to go on.

All the Java code looks fine. I don't see any problems except that you
have an addition logging variable which seems to point outside of the
class.

>
>     <extension id="testplugin" name="Some Simple Test Plugin"
> point="org.apache.nutch.segment.SegmentMergeFilter">
>         <implementation id="page-filter" class="testplugin.SimpleFilter"/>
>     </extension>
> </plugin>

Now we come to the main point of concern. For me (as far as I
understand what you ar trying to do) you should not extend the
SegmentMergeFilter point. This should refer to the IndexingFilter you
wish to extend. A list of extension points can be seen here [0]

[0] http://svn.apache.org/repos/asf/nutch/trunk/src/plugin/nutch-extensionpoints/plugin.xml

hth

Lewis