You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by "Krishnanand, Kartik" <ka...@bankofamerica.com> on 2014/08/29 10:44:41 UTC
How do I pass custom URL filter URL configuration to filter plugins?
Hi, Nutch Gurus,
I have a use case that I need to implement and I hope that someone can help.
I have a situation where I need to generate and build URLs dynamically and pass them to the respective filter.
I want to pass a newly constructed string to the Filter implementation associated with regex-urlfilter.txt the following new string to parse.
# URLs to be excluded
-http://foo[aZ-zZ0-9]\.mydomain.com
-https:// foo[aZ-zZ0-9]\.mydomain.com
# URL to be crawled
+http://newfoo[aZ-zZ0-9]\.mydomain.com
+https://newfoo[aZ-zZ0-9]\.mydomain.com
>From the Nutch's RegexURLFilter.java implementation, we have the following set up.
public static final String URLFILTER_REGEX_FILE = "urlfilter.regex.file";
public static final String URLFILTER_REGEX_RULES = "urlfilter.regex.rules";
/**
* Rules specified as a config property will override rules specified
* as a config file.
*/
protected Reader getRulesReader(Configuration conf) throws IOException {
String stringRules = conf.get(URLFILTER_REGEX_RULES);
LOG.debug("The string rules = " + stringRules);
if (stringRules != null) {
LOG.debug("The string rules are not null. Returning a String Reader object.");
return new StringReader(stringRules);
}
String fileRules = conf.get(URLFILTER_REGEX_FILE);
LOG.debug("The fileRules rules = " + fileRules);
LOG.debug("Getting the rules as an input stream.");
return conf.getConfResourceAsReader(fileRules);
}
I have a TimerTask implementation that based on certain conditions, updates the Configuration object.
public class MyTask extends TimerTask {
private Configuration configuration;
// Get and Setter.
@Override
public void run() {
// Some backend logic that involves constructing the URL if updated.
String urlFilterRegexRules = new StringBuilder(. . . . ).toString();
Map<String, Object> argsMap = new HashMap<>();
Random random = new Random(1e8);
long num = random.nextLong();
argsMap.put(NUTCH.ARGS_SEEDDIR, "/tmp/seed" + num + ".txt");
this.configuration.set(RegexURLFilter.URLFILTER_REGEX_RULES, urlFilterRegexRules);
InjectorJob job = new InjectorJob(this.configuration);
job.run(argsMap);
}
}
>From the logs.
2014-08-28 13:55:36 DEBUG org.apache.nutch.urlfilter.regex.RegexURLFilter:71 - The string rules = null
2014-08-28 13:55:36 DEBUG org.apache.nutch.urlfilter.regex.RegexURLFilter:77 - The fileRules rules = regex-urlfilter.txt
2014-08-28 13:55:36 DEBUG org.apache.nutch.urlfilter.regex.RegexURLFilter:78 - Getting the rules as an input stream.
What am I doing wrong? Any advice would be gratefully appreciated. My modified crawler main method
Crawler.java
public static void main(String[] args) {
Configuration configuration = NutchConfiguration.create()
Timer timer = new Timer();
MyTask myTask = new MyTask();
myTask.setConfiguration(configuration);
timer.scheduleAtFixedRate(myTask, 0, 4 * 60 * 60 * 1000);
ToolRunner.run(refreshConfigTask.getConfiguration(), crawler, args);
}
Thanks,
Kartik
----------------------------------------------------------------------
This message, and any attachments, is for the intended recipient(s) only, may contain information that is privileged, confidential and/or proprietary and subject to important terms and conditions available at http://www.bankofamerica.com/emaildisclaimer. If you are not the intended recipient, please delete this message.