You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Chris A. Mattmann (Commented) (JIRA)" <ji...@apache.org> on 2012/03/07 07:20:24 UTC

[jira] [Commented] (NUTCH-366) Merge URLFilters and URLNormalizers

    [ https://issues.apache.org/jira/browse/NUTCH-366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13224020#comment-13224020 ] 

Chris A. Mattmann commented on NUTCH-366:
-----------------------------------------

I'd favor option #1 here, especially if there is a GSoC student interested. I'd also be willing to help mentor. I'll tag the issue with gsoc2012.
                
> Merge URLFilters and URLNormalizers
> -----------------------------------
>
>                 Key: NUTCH-366
>                 URL: https://issues.apache.org/jira/browse/NUTCH-366
>             Project: Nutch
>          Issue Type: Improvement
>            Reporter: Andrzej Bialecki 
>              Labels: gsoc2012
>
> Currently Nutch uses two subsystems related to url validation and normalization:
> * URLFilter: this interface checks if URLs are valid for further processing. Input URL is not changed in any way. The output is a boolean value.
> * URLNormalizer: this interface brings URLs to their base ("normal") form, or removes unneeded URL components, or performs any other URL mangling as necessary. Input URLs are changed, and are returned as result.
> However, various Nutch tools run filters and normalizers in pre-determined order, i.e. normalizers first, and then filters. In some cases, where normalizers are complex and running them is costly (e.g. numerous regex rules, DNS lookups) it would make sense to run some of the filters first (e.g. prefix-based filters that select only certain protocols, or suffix-based filters that select only known "extensions"). This is currently not possible - we always have to run normalizers, only to later throw away urls because they failed to pass through filters.
> I would like to solicit comments on the following two solutions, and work on implementation of one of them:
> 1) we could make URLFilters and URLNormalizers implement the same interface, and basically make them interchangeable. This way users could configure their order arbitrarily, even mixing filters and normalizers out of order. This is more complicated, but gives much more flexibility - and NUTCH-365 already provides sufficient framework to implement this, including the ability to define different sequences for different steps in the workflow.
> 2) we could use a property "url.mangling.order" ;) to define whether normalizers or filters should run first. This is simple to implement, but provides only limited improvement - because either all filters or all normalizers would run, they couldn't be mixed in arbitrary order.
> Any comments?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira