You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by zzcgiacomini <zz...@echo.fr> on 2006/11/08 11:25:05 UTC
Nutch 0.9 not loading plugins (sorry very long)
Hi everybody,
Sorry if I come again on this issue with this long mail but I really
cant have my plugin loaded.
I have read and applied the suggestion given in various previous
postings on this list
but i still have not get results
Well basically I have used part of the code written for the "recommended"
plugin example from the nutch wiki, and kept only the Parse extension.
I have ported it a on nutch 0.9 and run the inject/generate/fetch cycle.
The plugin is compiled and correctly installed in
$NUTCH_HOME/plugins/parse-rec directory.
My problem is the it looks like that my plugin is never executed even if
it appears to be correctly registered.
Another problem I got is to make the plugin system to produce some
logs unless I invoke it directly (see below)
I add here all my code/config etc. hoping someone can point out my
mistakes or misunderstanding .
-Corrado
I took the code from the latest nightly "At revision 472436"
put my plugin code in
trunk/src/plugin/parse-rec/src/java/org/apache/nutch/parse/rec/RecParseFilter.java
here is the code and config files:
__________________________ RecParseFilter.java
______________________________________
package org.apache.nutch.parse.rec;
// JDK imports
import java.util.Enumeration;
import java.util.Properties;
import java.util.logging.Logger;
// Nutch imports
import org.apache.nutch.parse.HTMLMetaTags;
import org.apache.nutch.parse.Parse;
import org.apache.nutch.parse.HtmlParseFilter;
import org.apache.nutch.protocol.Content;
import org.apache.commons.logging.Log;
import org.apache.commons.logging.LogFactory;
import org.apache.nutch.util.NutchConfiguration;
import org.apache.hadoop.conf.Configuration;
import org.w3c.dom.DocumentFragment;
public class RecParseFilter implements HtmlParseFilter {
/** Configuration */
private Configuration conf;
public static final Log LOG = LogFactory.getLog("RecParseFilter.class");
/** The Recommended meta data attribute name */
public static final String META_RECOMMENDED_NAME="Recommended";
/** Scan the HTML document looking for a recommended meta tag. */
public Parse filter(Content content, Parse parse, HTMLMetaTags
metaTags, DocumentFragment doc) {
LOG.debug("RecParseFilter::filter() --->");
/** Trying to find the document's recommended term */
String recommendation = null;
Properties generalMetaTags = metaTags.getGeneralTags();
String title = parse.getData().getTitle();
LOG.debug("RecParseFilter::filter() - Document Title : " + title);
for(Enumeration tagNames = generalMetaTags.propertyNames();
tagNames.hasMoreElements(); ) {
if (tagNames.nextElement().equals("recommended")) {
recommendation = generalMetaTags.getProperty("recommended");
LOG.debug("RecParseFilter::filter() - Found a
Recommendation for " + recommendation);
}
}
if(recommendation == null)
LOG.debug("RecParseFilter::filter() - No Recommendataion");
else {
LOG.debug("RecParseFilter::filter() - Adding Recommendation
for " + recommendation);
parse.getData().getContentMeta().set(META_RECOMMENDED_NAME,
recommendation);
}
LOG.debug("RecParseFilter::filter() <--");
return parse;
}
public Configuration getConf() {
LOG.debug("RecParseFilter::getConf() -->");
LOG.debug("RecParseFilter::getConf() <--");
return this.conf;
}
public void setConf(Configuration conf) {
LOG.debug("RecParseFilter::setConf() -->");
LOG.debug("RecParseFilter::setConf() <--");
this.conf = conf;
}
}
________________________________________________________________
_________________________plugin.xml_______________________________
<?xml version="1.0" encoding="UTF-8"?>
<plugin
id="parse-rec"
name="Recommended Parser/Filter"
version="0.0.1"
provider-name="nutch.org">
<runtime>
<!-- As defined in build.xml this plugin will end up bundled as
recommended.jar -->
<library name="parse-rec.jar">
<export name="*"/>
</library>
</runtime>
<requires>
<import plugin="nutch-extensionpoints"/>
</requires>
<!-- The RecommendedParser extends the HtmlParseFilter to grab the
contents of any recommended meta tags -->
<extension id="org.apache.nutch.parse.rec.RecParseFilter"
name="Recommended Parser"
point="org.apache.nutch.parse.HtmlParseFilter">
<implementation id="RecParseFilter"
class="org.apache.nutch.parse.rec.RecParseFilter">
<parameter name="contentType" value="text/html"/>
<parameter name="pathSuffix" value=""/>
</implementation>
</extension>
</plugin>
________________________________________________________________
I have added this line in nutch-site.xml
___________________________nutch-site.xml__________________________
<property>
<name>plugin.includes</name>
<value>*nutch-extensionpoints*|protocol-http|urlfilter-regex|*parse-(*text|html|js|rec)|index-basic|query-(basic|site|url)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
</property>
________________________________________________________________
I have added this lines in parse-plugins.xml.
Whell I also tried to have only my plugin with the same results
___________________________parse.plugins.xml__________________________
<mimeType name="text/html">
<plugin id="parse-rec" />
<plugin id="parse-html" />
</mimeType>
________________________________________________________________
and finally added a line to make plugin system to log in log4j.properties
But despite of the this line I get no plugins logs at all.
___________________________log4j.properties__________________________
log4j.logger.org.apache.nutch.plugin=DEBUG
________________________________________________________________
After having run the fetcher I was expecting to have the "recommended"
meta tag in my segement
nutch readseg -get test/segments/20061108110142
"http://testmachine.toto.net/index.html"
SegmentReader: get 'http://testmachine.toto.net/index.html'
Content::
Version: 2
url: http://testmachine.toto.net/index.html
base: http://testmachine.toto.net/index.html
contentType: text/html
metadata: nutch.segment.name=20061108110142 nutch.crawl.score=1.0
Content:
Crawl Generate::
Version: 4
Status: 1 (DB_unfetched)
Fetch time: Wed Nov 08 10:54:39 CET 2006
Modified time: Thu Jan 01 01:00:00 CET 1970
Retries since fetch: 0
Retry interval: 30.0 days
Score: 1.0
Signature: null
Metadata: null
Crawl Fetch::
Version: 4
Status: 6 (fetch_retry)
Fetch time: Wed Nov 08 11:02:46 CET 2006
Modified time: Thu Jan 01 01:00:00 CET 1970
Retries since fetch: 1
Retry interval: 30.0 days
Score: 1.0
Signature: null
Metadata: null
I have then tried to invoke the plugin directly :
nutch plugin parse-rec org.apache.nutch.parse.rec.RecParseFilter
In this way I got the plugin logs I wanted in hadoop.log showing that
the plugin is registered
.....
2006-11-08 11:07:33,520 DEBUG plugin.PluginRepository - parsing:
/home/opt/nutch-0.9-dev/plugins/parse-rec/plugin.xml
2006-11-08 11:07:33,526 DEBUG plugin.PluginRepository - plugin:
id=parse-rec name=Recommended Parser/Filter version=0.0.1
provider=nutch.orgclass=null
2006-11-08 11:07:33,527 DEBUG plugin.PluginRepository - impl:
point=org.apache.nutch.parse.HtmlParseFilter
class=org.apache.nutch.parse.rec.RecParseFilter
2006-11-08 11:07:33,528 DEBUG plugin.PluginRepository - parsing:
/home/opt/nutch-0.9-dev/plugins/parse-text/plugin.xml
.....
Registered Plugins:
....
2006-11-08 11:07:34,014 INFO plugin.PluginRepository -
Recommended Parser/Filter (parse-rec)
....
2006-11-08 11:07:51,827 DEBUG plugin.PluginRepository - parsing:
/home/opt/nutch-0.9-dev/plugins/parse-rec/plugin.xml
2006-11-08 11:07:51,837 DEBUG plugin.PluginRepository - plugin:
id=parse-rec name=Recommended Parser/Filter version=0.0.1
provider=nutch.orgclass=null
...
Re: Nutch 0.9 not loading plugins (sorry very long)
Posted by zzcgiacomini <zz...@echo.fr>.
Sorry in my previous posting the output of nutch "readseg -get" was
wrong .. here is the actual output:
-Corrado
SegmentReader: get 'http://testmachine.test.net/index.html'
Content::
Version: 2
url: http://testmachine.test.net/index.html
base: http://testmachine.test.net/index.html
contentType: text/html
metadata: Content-Length=345 Connection=close
ETag="2f4ac-159-421166c12a140" nutch.segment.name=20061108113703
nutch.crawl.score=1.0 Recommended=plugins
nutch.content.digest=82e307c71d7476ce729a8e6d3b0de50a
Accept-Ranges=bytes Server=Apache/2.2.0 (Fedora) Content-Type=text/html;
charset=UTF-8 date=Wed, 08 Nov 2006 10:37:57 GMT Last-Modified=Tue, 31
Oct 2006 07:34:53 GMT
Content:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Frameset//EN"
"http://www.w3.org/TR/html4/frameset.dtd">
<HTML>
<HEAD>
<TITLE>
PLUG-IN TEST
</TITLE>
</HEAD>
<meta name="recommended" content="plugins">
<A HREF="http://testmachine.test.net/omniORB/index.html">omniORB</A>
<BR>
<A HREF="http://testmachine.test.net/nutch/index.html">Nutch</A>
</HTML>
Crawl Generate::
Version: 4
Status: 1 (DB_unfetched)
Fetch time: Wed Nov 08 11:36:31 CET 2006
Modified time: Thu Jan 01 01:00:00 CET 1970
Retries since fetch: 0
Retry interval: 30.0 days
Score: 1.0
Signature: null
Metadata: null
Crawl Fetch::
Version: 4
Status: 5 (fetch_success)
Fetch time: Wed Nov 08 11:37:58 CET 2006
Modified time: Thu Jan 01 01:00:00 CET 1970
Retries since fetch: 0
Retry interval: 30.0 days
Score: 1.0
Signature: 82e307c71d7476ce729a8e6d3b0de50a
Metadata: null
Crawl Parse::
Version: 4
Status: 4 (linked)
Fetch time: Wed Nov 08 11:38:05 CET 2006
Modified time: Thu Jan 01 01:00:00 CET 1970
Retries since fetch: 0
Retry interval: 30.0 days
Score: 0.5
Signature: null
Metadata: null
ParseData::
Version: 5
Status: success(1,0)
Title: PLUG-IN TEST
Outlinks: 2
outlink: toUrl: http://testmachine.test.net/omniORB/index.html anchor:
omniORB
outlink: toUrl: http://testmachine.test.net/nutch/index.html anchor: Nutch
Content Metadata: Connection=close Content-Length=345
nutch.crawl.score=1.0 nutch.segment.name=20061108113703
ETag="2f4ac-159-421166c12a140" Recommended=plugins
nutch.content.digest=82e307c71d7476ce729a8e6d3b0de50a
Accept-Ranges=bytes Content-Type=text/html; charset=UTF-8
Server=Apache/2.2.0 (Fedora) Last-Modified=Tue, 31 Oct 2006 07:34:53 GMT
date=Wed, 08 Nov 2006 10:37:57 GMT
Parse Metadata: OriginalCharEncoding=UTF-8 CharEncodingForConversion=UTF-8
ParseText::
PLUG-IN TEST omniORB Nutch