You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by "Krishnanand, Kartik" <ka...@bankofamerica.com> on 2014/09/10 14:49:09 UTC
Parser plugin not being invoked from nutch jobs

Hi, Nutch Gurus,

I am a Nutch newbie and I would like to ask for help seeking the execution of a Nutch plugin. I have written a plugin that extracts all the JavaScript urls and creates outlinks wrapped within a Parse object. The outlinks generated would be ideally inserted into the crawldb  during any of the phases.
Unfortunately, the plugin is not being invoked and I would appreciate any assistance in this matter.

I have tried to run this on both Windows and Linux machines, but to no avail. The set up for the windows machine is given below. I referred to
http://wiki.apache.org/nutch/WritingPluginExample , http://florianhartl.com/nutch-plugin-tutorial.html, http://sujitpal.blogspot.de/2009/07/nutch-custom-plugin-to-parse-and-add.html,  and http://sujitpal.blogspot.com/2009/07/nutch-getting-my-feet-wet.html

I would like help for 2 questions that  I have


1.       How to invoke the plugin to generate outlinks?

2.       How to do I update the crawldb with the updated outlinks?

Any suggestion would be gratefully appreciated.

######################
My nutch-config is given below
########################

<property>
  <name>plugin.folders</name>
  <value>C:\apache-nutch-2.2.1\build</value>
  <description>Directories where nutch plugins are located.  Each
  element may be a relative or absolute path.  If absolute, it is used
  as is.  If relative, it is searched for on the classpath.</description>
</property>

property>
  <name>plugin.auto-activation</name>
  <value>true</value>
  <description>Defines if some plugins that are not activated regarding
  the plugin.includes and plugin.excludes properties must be automaticaly
  activated if they are needed by some actived plugins.
  </description>
</property>

<!-- localeextractor is my custom plugin -->
<property>
  <name>plugin.includes</name>
<value>protocol-http|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|urlnormalizer-(pass|regex|basic)|scoring-opic|(localeextractor)</value>
<description>Regular expression naming plugin directory names to
  include.  Any plugin not matching this expression is excluded.
  In any case you need at least include the nutch-extensionpoints plugin. By
  default Nutch includes crawling just HTML and plain text via HTTP,
  and basic indexing and search plugins. In order to use HTTPS please enable
  protocol-httpclient, but be aware of possible intermittent problems with the
  underlying commons-httpclient library.
  </description>
</property>

######################
My plugin code
########################

public class LocaleExtractorFilter implements Parser {

  private static final Logger LOG = LoggerFactory.getLogger(LocaleExtractorFilter.class);

  private Configuration configuration;

  private static final Set<Field> FIELDS = new HashSet<Field>();

  static {
    FIELDS.add(WebPage.Field.OUTLINKS);
  }

  @Override
  public Collection<Field> getFields() {
    // TODO Auto-generated method stub
    return FIELDS;
  }

  @Override
  public void setConf(Configuration conf) {
    this.configuration = conf;
  }

  @Override
  public Configuration getConf() {
    return this.configuration;
  }

  /**
   * Extracts the JS links to create outlinks.
   * {@inheritdoc}
   */
  @Override
  public Parse getParse(String url, WebPage page) {
    // TODO Auto-generated method stub
    String stringContent = Bytes.toString(page.getContent());
    Set<Outlink> jsOutlinks = this.addUrlsToBeParsed(stringContent);
    return new Parse(
        page.getText().toString(), page.getTitle().toString(),
        jsOutlinks.toArray(new Outlink[0]), page.getParseStatus());
  }

  private static final Pattern PATTERN_WITH_ASCII_QUOTES =
      Pattern.compile("^(?:.*?goto\\(&#39;(\\w+)&#39;\\).*|.*?OOLPopUp\\(&#39;(.+?&#39;\\)).*)$",
          Pattern.MULTILINE);

  private static final String REDIRECT = "/accounts/redirect.go?target=";
  /**
   * The implementation parses the URLs from the string content of HTML files. The URLs are of the
   * following format:
   * <ul>
   *   <li>{@code goto} links, Example
   *       {@code &lt;a href='javascript:goto(&#39;billpay&#39;);'&gt;Accounts&lt;/a&gt;}
   * </ul>
   *
   * @param stringContent from which multiple urls can be constructed
   */
  Set<Outlink> addUrlsToBeParsed(String stringContent) {
    Set<Outlink> outlinks = new TreeSet<Outlink>();
    Matcher matcher = PATTERN_WITH_ASCII_QUOTES.matcher(stringContent);
    while (matcher.find()) {
      String url = "";
      try {
        url = new StringBuilder(REDIRECT).append(
            matcher.group(1) != null ? matcher.group(1) : matcher.group(2)).toString();
        outlinks.add(new Outlink(url, ""));
      } catch (MalformedURLException mue) {
        LOG.warn("Error generating outlink urls for " + url, mue);
      }
    }

    return outlinks;
  }

}

##############
Plugin.xml
###############

<?xml version="1.0" encoding="UTF-8"?>
<plugin id="localeextractor" name="Locale extractor Filter" version="1.0.0"
  provider-name="nutch.org">

  <runtime>
    <library name="localeextractor">
      <export name="*" />
    </library>
  </runtime>

  <requires>e
    <import plugin="nutch-extensionpoints" />
  </requires>

  <extension id="com.bofa.ecom.search.LocaleExtractorFilter"
    name="Nutch Links Generator"
    point="org.apache.nutch.parse.Parser">
    <implementation id="parser-localeextractor"
      class="com.bofa.ecom.search.LocaleExtractorFilter" />
  </extension>

</plugin>

##############
Build.xml
###############
<project name="locale-detector" default="jar-core">

  <import file="../build-plugin.xml" />

</project>

----------------------------------------------------------------------
This message, and any attachments, is for the intended recipient(s) only, may contain information that is privileged, confidential and/or proprietary and subject to important terms and conditions available at http://www.bankofamerica.com/emaildisclaimer.   If you are not the intended recipient, please delete this message.