You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by "Krishnanand, Kartik" <ka...@bankofamerica.com> on 2014/09/10 14:49:09 UTC
Parser plugin not being invoked from nutch jobs
Hi, Nutch Gurus,
I am a Nutch newbie and I would like to ask for help seeking the execution of a Nutch plugin. I have written a plugin that extracts all the JavaScript urls and creates outlinks wrapped within a Parse object. The outlinks generated would be ideally inserted into the crawldb during any of the phases.
Unfortunately, the plugin is not being invoked and I would appreciate any assistance in this matter.
I have tried to run this on both Windows and Linux machines, but to no avail. The set up for the windows machine is given below. I referred to
http://wiki.apache.org/nutch/WritingPluginExample , http://florianhartl.com/nutch-plugin-tutorial.html, http://sujitpal.blogspot.de/2009/07/nutch-custom-plugin-to-parse-and-add.html, and http://sujitpal.blogspot.com/2009/07/nutch-getting-my-feet-wet.html
I would like help for 2 questions that I have
1. How to invoke the plugin to generate outlinks?
2. How to do I update the crawldb with the updated outlinks?
Any suggestion would be gratefully appreciated.
######################
My nutch-config is given below
########################
<property>
<name>plugin.folders</name>
<value>C:\apache-nutch-2.2.1\build</value>
<description>Directories where nutch plugins are located. Each
element may be a relative or absolute path. If absolute, it is used
as is. If relative, it is searched for on the classpath.</description>
</property>
property>
<name>plugin.auto-activation</name>
<value>true</value>
<description>Defines if some plugins that are not activated regarding
the plugin.includes and plugin.excludes properties must be automaticaly
activated if they are needed by some actived plugins.
</description>
</property>
<!-- localeextractor is my custom plugin -->
<property>
<name>plugin.includes</name>
<value>protocol-http|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|urlnormalizer-(pass|regex|basic)|scoring-opic|(localeextractor)</value>
<description>Regular expression naming plugin directory names to
include. Any plugin not matching this expression is excluded.
In any case you need at least include the nutch-extensionpoints plugin. By
default Nutch includes crawling just HTML and plain text via HTTP,
and basic indexing and search plugins. In order to use HTTPS please enable
protocol-httpclient, but be aware of possible intermittent problems with the
underlying commons-httpclient library.
</description>
</property>
######################
My plugin code
########################
public class LocaleExtractorFilter implements Parser {
private static final Logger LOG = LoggerFactory.getLogger(LocaleExtractorFilter.class);
private Configuration configuration;
private static final Set<Field> FIELDS = new HashSet<Field>();
static {
FIELDS.add(WebPage.Field.OUTLINKS);
}
@Override
public Collection<Field> getFields() {
// TODO Auto-generated method stub
return FIELDS;
}
@Override
public void setConf(Configuration conf) {
this.configuration = conf;
}
@Override
public Configuration getConf() {
return this.configuration;
}
/**
* Extracts the JS links to create outlinks.
* {@inheritdoc}
*/
@Override
public Parse getParse(String url, WebPage page) {
// TODO Auto-generated method stub
String stringContent = Bytes.toString(page.getContent());
Set<Outlink> jsOutlinks = this.addUrlsToBeParsed(stringContent);
return new Parse(
page.getText().toString(), page.getTitle().toString(),
jsOutlinks.toArray(new Outlink[0]), page.getParseStatus());
}
private static final Pattern PATTERN_WITH_ASCII_QUOTES =
Pattern.compile("^(?:.*?goto\\('(\\w+)'\\).*|.*?OOLPopUp\\('(.+?'\\)).*)$",
Pattern.MULTILINE);
private static final String REDIRECT = "/accounts/redirect.go?target=";
/**
* The implementation parses the URLs from the string content of HTML files. The URLs are of the
* following format:
* <ul>
* <li>{@code goto} links, Example
* {@code <a href='javascript:goto('billpay');'>Accounts</a>}
* </ul>
*
* @param stringContent from which multiple urls can be constructed
*/
Set<Outlink> addUrlsToBeParsed(String stringContent) {
Set<Outlink> outlinks = new TreeSet<Outlink>();
Matcher matcher = PATTERN_WITH_ASCII_QUOTES.matcher(stringContent);
while (matcher.find()) {
String url = "";
try {
url = new StringBuilder(REDIRECT).append(
matcher.group(1) != null ? matcher.group(1) : matcher.group(2)).toString();
outlinks.add(new Outlink(url, ""));
} catch (MalformedURLException mue) {
LOG.warn("Error generating outlink urls for " + url, mue);
}
}
return outlinks;
}
}
##############
Plugin.xml
###############
<?xml version="1.0" encoding="UTF-8"?>
<plugin id="localeextractor" name="Locale extractor Filter" version="1.0.0"
provider-name="nutch.org">
<runtime>
<library name="localeextractor">
<export name="*" />
</library>
</runtime>
<requires>e
<import plugin="nutch-extensionpoints" />
</requires>
<extension id="com.bofa.ecom.search.LocaleExtractorFilter"
name="Nutch Links Generator"
point="org.apache.nutch.parse.Parser">
<implementation id="parser-localeextractor"
class="com.bofa.ecom.search.LocaleExtractorFilter" />
</extension>
</plugin>
##############
Build.xml
###############
<project name="locale-detector" default="jar-core">
<import file="../build-plugin.xml" />
</project>
----------------------------------------------------------------------
This message, and any attachments, is for the intended recipient(s) only, may contain information that is privileged, confidential and/or proprietary and subject to important terms and conditions available at http://www.bankofamerica.com/emaildisclaimer. If you are not the intended recipient, please delete this message.