You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by zzcgiacomini <zz...@echo.fr> on 2006/11/08 11:25:05 UTC

Nutch 0.9 not loading plugins (sorry very long)

Hi everybody,
Sorry if I come again on this issue with this long mail but I really 
cant have my plugin loaded.
I have read and applied the suggestion given  in various previous 
postings on this list
but i still have not get results

Well basically I  have used part of the code written for the "recommended"
plugin example from the nutch wiki, and kept only the Parse extension.
I have ported it a on nutch 0.9 and run the inject/generate/fetch cycle.
The plugin is compiled and correctly installed in 
$NUTCH_HOME/plugins/parse-rec directory.

My problem is the it looks like that my plugin is never executed even if 
it appears to be correctly registered.
Another problem I got is to make the plugin  system to produce some  
logs unless I invoke it directly (see below)

I add here all my code/config etc. hoping someone can point out my 
mistakes or misunderstanding .

-Corrado

I took the code from the latest nightly  "At revision 472436"
put my plugin code in 
trunk/src/plugin/parse-rec/src/java/org/apache/nutch/parse/rec/RecParseFilter.java

here is the code  and  config files:
__________________________ RecParseFilter.java 
______________________________________
package org.apache.nutch.parse.rec;

// JDK imports
import java.util.Enumeration;
import java.util.Properties;
import java.util.logging.Logger;

// Nutch imports
import org.apache.nutch.parse.HTMLMetaTags;
import org.apache.nutch.parse.Parse;
import org.apache.nutch.parse.HtmlParseFilter;
import org.apache.nutch.protocol.Content;

import org.apache.commons.logging.Log;
import org.apache.commons.logging.LogFactory;

import org.apache.nutch.util.NutchConfiguration;
import org.apache.hadoop.conf.Configuration;

import org.w3c.dom.DocumentFragment;

public class RecParseFilter implements HtmlParseFilter {

  /** Configuration  */
  private Configuration conf;

  public static final Log LOG = LogFactory.getLog("RecParseFilter.class");

  /** The Recommended meta data attribute name */
  public static final String META_RECOMMENDED_NAME="Recommended";

  /** Scan the HTML document looking for a recommended meta tag.  */
  public Parse filter(Content content, Parse parse, HTMLMetaTags 
metaTags, DocumentFragment doc) {

        LOG.debug("RecParseFilter::filter() --->");
        /** Trying to find the document's recommended term */
        String recommendation = null;
        Properties generalMetaTags = metaTags.getGeneralTags();
        String title = parse.getData().getTitle();
        LOG.debug("RecParseFilter::filter() - Document Title : " + title);

        for(Enumeration tagNames = generalMetaTags.propertyNames(); 
tagNames.hasMoreElements(); ) {
            if (tagNames.nextElement().equals("recommended")) {
                recommendation = generalMetaTags.getProperty("recommended");
                LOG.debug("RecParseFilter::filter() - Found a 
Recommendation for " + recommendation);
             }
        }

        if(recommendation == null)
           LOG.debug("RecParseFilter::filter() - No Recommendataion");
        else {
           LOG.debug("RecParseFilter::filter() - Adding Recommendation 
for " + recommendation);
           parse.getData().getContentMeta().set(META_RECOMMENDED_NAME, 
recommendation);
        }
        LOG.debug("RecParseFilter::filter() <--");
        return parse;
  }

  public Configuration getConf() {
    LOG.debug("RecParseFilter::getConf() -->");
    LOG.debug("RecParseFilter::getConf() <--");
    return this.conf;
  }

  public void setConf(Configuration conf) {
    LOG.debug("RecParseFilter::setConf() -->");
    LOG.debug("RecParseFilter::setConf() <--");
    this.conf = conf;
  }
}
________________________________________________________________

_________________________plugin.xml_______________________________

<?xml version="1.0" encoding="UTF-8"?>
<plugin
   id="parse-rec"
   name="Recommended Parser/Filter"
   version="0.0.1"
   provider-name="nutch.org">

   <runtime>
      <!-- As defined in build.xml this plugin will end up bundled as 
recommended.jar -->
      <library name="parse-rec.jar">
         <export name="*"/>
      </library>
   </runtime>

   <requires>
    <import plugin="nutch-extensionpoints"/>
   </requires>

   <!-- The RecommendedParser extends the HtmlParseFilter to grab the 
contents of any recommended meta tags -->
   <extension id="org.apache.nutch.parse.rec.RecParseFilter"
              name="Recommended Parser"
              point="org.apache.nutch.parse.HtmlParseFilter">
      <implementation id="RecParseFilter" 
class="org.apache.nutch.parse.rec.RecParseFilter">
         <parameter name="contentType" value="text/html"/>
         <parameter name="pathSuffix"  value=""/>
      </implementation>
   </extension>
</plugin>
________________________________________________________________

I have added this line in nutch-site.xml

___________________________nutch-site.xml__________________________
      <property>
        <name>plugin.includes</name>   
<value>*nutch-extensionpoints*|protocol-http|urlfilter-regex|*parse-(*text|html|js|rec)|index-basic|query-(basic|site|url)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value>

      </property>
________________________________________________________________

I have added this lines in parse-plugins.xml.
Whell I also tried to have only my  plugin with the same results

___________________________parse.plugins.xml__________________________
        <mimeType name="text/html">
                <plugin id="parse-rec" />
                <plugin id="parse-html" />
        </mimeType>
________________________________________________________________

and finally added a line to make plugin system to log in log4j.properties
But despite of the this line I get no plugins logs at all.
___________________________log4j.properties__________________________
log4j.logger.org.apache.nutch.plugin=DEBUG
________________________________________________________________

After having run the fetcher I was expecting to have the "recommended" 
meta tag in my segement

nutch readseg -get test/segments/20061108110142 
"http://testmachine.toto.net/index.html"
SegmentReader: get 'http://testmachine.toto.net/index.html'
Content::
Version: 2
url: http://testmachine.toto.net/index.html
base: http://testmachine.toto.net/index.html
contentType: text/html
metadata: nutch.segment.name=20061108110142 nutch.crawl.score=1.0
Content:

Crawl Generate::
Version: 4
Status: 1 (DB_unfetched)
Fetch time: Wed Nov 08 10:54:39 CET 2006
Modified time: Thu Jan 01 01:00:00 CET 1970
Retries since fetch: 0
Retry interval: 30.0 days
Score: 1.0
Signature: null
Metadata: null

Crawl Fetch::
Version: 4
Status: 6 (fetch_retry)
Fetch time: Wed Nov 08 11:02:46 CET 2006
Modified time: Thu Jan 01 01:00:00 CET 1970
Retries since fetch: 1
Retry interval: 30.0 days
Score: 1.0
Signature: null
Metadata: null

I have then tried to invoke the plugin directly :
nutch  plugin parse-rec  org.apache.nutch.parse.rec.RecParseFilter

In this way I got the plugin logs I wanted in hadoop.log showing that 
the plugin is registered


.....
2006-11-08 11:07:33,520 DEBUG plugin.PluginRepository - parsing: 
/home/opt/nutch-0.9-dev/plugins/parse-rec/plugin.xml
2006-11-08 11:07:33,526 DEBUG plugin.PluginRepository - plugin: 
id=parse-rec name=Recommended Parser/Filter version=0.0.1 
provider=nutch.orgclass=null
2006-11-08 11:07:33,527 DEBUG plugin.PluginRepository - impl: 
point=org.apache.nutch.parse.HtmlParseFilter 
class=org.apache.nutch.parse.rec.RecParseFilter
2006-11-08 11:07:33,528 DEBUG plugin.PluginRepository - parsing: 
/home/opt/nutch-0.9-dev/plugins/parse-text/plugin.xml
.....
Registered Plugins:
....
2006-11-08 11:07:34,014 INFO  plugin.PluginRepository -         
Recommended Parser/Filter (parse-rec)
....
2006-11-08 11:07:51,827 DEBUG plugin.PluginRepository - parsing: 
/home/opt/nutch-0.9-dev/plugins/parse-rec/plugin.xml
2006-11-08 11:07:51,837 DEBUG plugin.PluginRepository - plugin: 
id=parse-rec name=Recommended Parser/Filter version=0.0.1 
provider=nutch.orgclass=null
...




Re: Nutch 0.9 not loading plugins (sorry very long)

Posted by zzcgiacomini <zz...@echo.fr>.
Sorry in my previous posting the output of  nutch "readseg -get" was 
wrong .. here is the actual output:

-Corrado

SegmentReader: get 'http://testmachine.test.net/index.html'
Content::
Version: 2
url: http://testmachine.test.net/index.html
base: http://testmachine.test.net/index.html
contentType: text/html
metadata: Content-Length=345 Connection=close 
ETag="2f4ac-159-421166c12a140" nutch.segment.name=20061108113703 
nutch.crawl.score=1.0 Recommended=plugins 
nutch.content.digest=82e307c71d7476ce729a8e6d3b0de50a 
Accept-Ranges=bytes Server=Apache/2.2.0 (Fedora) Content-Type=text/html; 
charset=UTF-8 date=Wed, 08 Nov 2006 10:37:57 GMT Last-Modified=Tue, 31 
Oct 2006 07:34:53 GMT
Content:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Frameset//EN" 
"http://www.w3.org/TR/html4/frameset.dtd">
<HTML>
<HEAD>
<TITLE>
PLUG-IN TEST
</TITLE>
</HEAD>
<meta name="recommended" content="plugins">
<A HREF="http://testmachine.test.net/omniORB/index.html">omniORB</A>
<BR>
<A HREF="http://testmachine.test.net/nutch/index.html">Nutch</A>
</HTML>

Crawl Generate::
Version: 4
Status: 1 (DB_unfetched)
Fetch time: Wed Nov 08 11:36:31 CET 2006
Modified time: Thu Jan 01 01:00:00 CET 1970
Retries since fetch: 0
Retry interval: 30.0 days
Score: 1.0
Signature: null
Metadata: null

Crawl Fetch::
Version: 4
Status: 5 (fetch_success)
Fetch time: Wed Nov 08 11:37:58 CET 2006
Modified time: Thu Jan 01 01:00:00 CET 1970
Retries since fetch: 0
Retry interval: 30.0 days
Score: 1.0
Signature: 82e307c71d7476ce729a8e6d3b0de50a
Metadata: null

Crawl Parse::
Version: 4
Status: 4 (linked)
Fetch time: Wed Nov 08 11:38:05 CET 2006
Modified time: Thu Jan 01 01:00:00 CET 1970
Retries since fetch: 0
Retry interval: 30.0 days
Score: 0.5
Signature: null
Metadata: null

ParseData::
Version: 5
Status: success(1,0)
Title: PLUG-IN TEST
Outlinks: 2
  outlink: toUrl: http://testmachine.test.net/omniORB/index.html anchor: 
omniORB
  outlink: toUrl: http://testmachine.test.net/nutch/index.html anchor: Nutch
Content Metadata: Connection=close Content-Length=345 
nutch.crawl.score=1.0 nutch.segment.name=20061108113703 
ETag="2f4ac-159-421166c12a140" Recommended=plugins 
nutch.content.digest=82e307c71d7476ce729a8e6d3b0de50a 
Accept-Ranges=bytes Content-Type=text/html; charset=UTF-8 
Server=Apache/2.2.0 (Fedora) Last-Modified=Tue, 31 Oct 2006 07:34:53 GMT 
date=Wed, 08 Nov 2006 10:37:57 GMT
Parse Metadata: OriginalCharEncoding=UTF-8 CharEncodingForConversion=UTF-8

ParseText::
PLUG-IN TEST omniORB Nutch