You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by "Krishnanand, Kartik" <ka...@bankofamerica.com> on 2014/08/29 10:44:41 UTC

How do I pass custom URL filter URL configuration to filter plugins?

Hi, Nutch Gurus,

I have a use case that I need to implement and I hope that someone can help.

I have a situation where I need to generate and build URLs dynamically and pass them to the respective filter.

I want to pass a newly constructed string to the Filter implementation associated with regex-urlfilter.txt the following new string to parse.

# URLs to be excluded
-http://foo[aZ-zZ0-9]\.mydomain.com
-https:// foo[aZ-zZ0-9]\.mydomain.com

# URL to be crawled
+http://newfoo[aZ-zZ0-9]\.mydomain.com
+https://newfoo[aZ-zZ0-9]\.mydomain.com

>From the Nutch's RegexURLFilter.java implementation, we have the following set up.

  public static final String URLFILTER_REGEX_FILE = "urlfilter.regex.file";
  public static final String URLFILTER_REGEX_RULES = "urlfilter.regex.rules";

  /**
   * Rules specified as a config property will override rules specified
   * as a config file.
   */
  protected Reader getRulesReader(Configuration conf) throws IOException {
    String stringRules = conf.get(URLFILTER_REGEX_RULES);
    LOG.debug("The string rules = " + stringRules);
    if (stringRules != null) {
      LOG.debug("The string rules are not null. Returning a String Reader object.");
      return new StringReader(stringRules);
    }
    String fileRules = conf.get(URLFILTER_REGEX_FILE);
    LOG.debug("The fileRules rules = " + fileRules);
    LOG.debug("Getting the rules as an input stream.");
    return conf.getConfResourceAsReader(fileRules);
  }

I have a TimerTask implementation that based on certain conditions, updates the Configuration object.

public class MyTask extends TimerTask {

  private Configuration configuration;

  // Get and Setter.

   @Override
  public void run() {
     // Some backend logic that involves constructing the URL if updated.
     String urlFilterRegexRules = new StringBuilder(. . . . ).toString();

     Map<String, Object> argsMap = new HashMap<>();
     Random random = new Random(1e8);
     long  num = random.nextLong();
     argsMap.put(NUTCH.ARGS_SEEDDIR, "/tmp/seed" + num + ".txt");

     this.configuration.set(RegexURLFilter.URLFILTER_REGEX_RULES, urlFilterRegexRules);
     InjectorJob job = new InjectorJob(this.configuration);
     job.run(argsMap);
  }
}

>From the logs.

2014-08-28 13:55:36 DEBUG org.apache.nutch.urlfilter.regex.RegexURLFilter:71 - The string rules = null
  2014-08-28 13:55:36 DEBUG org.apache.nutch.urlfilter.regex.RegexURLFilter:77 - The fileRules rules = regex-urlfilter.txt
  2014-08-28 13:55:36 DEBUG org.apache.nutch.urlfilter.regex.RegexURLFilter:78 - Getting the rules as an input stream.

What am I doing wrong? Any advice would be gratefully appreciated. My modified crawler main method

Crawler.java

public static void main(String[] args) {
  Configuration configuration = NutchConfiguration.create()
  Timer timer = new Timer();
  MyTask myTask = new MyTask();
  myTask.setConfiguration(configuration);
  timer.scheduleAtFixedRate(myTask, 0, 4 * 60 * 60 * 1000);
  ToolRunner.run(refreshConfigTask.getConfiguration(), crawler, args);
}

Thanks,

Kartik

----------------------------------------------------------------------
This message, and any attachments, is for the intended recipient(s) only, may contain information that is privileged, confidential and/or proprietary and subject to important terms and conditions available at http://www.bankofamerica.com/emaildisclaimer.   If you are not the intended recipient, please delete this message.