You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Doug Cook (JIRA)" <ji...@apache.org> on 2006/11/26 02:03:03 UTC
[jira] Commented: (NUTCH-409) Add "short circuit" notion to filters to speedup mixed site/subsite crawling

    [ http://issues.apache.org/jira/browse/NUTCH-409?page=comments#action_12452617 ] 
            
Doug Cook commented on NUTCH-409:
---------------------------------

I should also note that this approach is still not optimal (though it is faster for my usage pattern). I'm still running the site-independent regular expressions (ad removal, etc) on *every* URL; really, they should just be run on the URLs which belong to the set of sites I'm crawling. 

One could think of a slight extension to the change here, where each filter has a parameter: (A) "run me on all URLs which have passed the prior filters" or (B) "run me on only the non-shortcircuit matches." This would allow us to put the RegexURLFilter *after* the PrefixURLFilter, and make it a type "A" (site-independent) filter, while the Automaton would be Type "B." (site-dependent). Simple code-wise, but a little more complexity in configuration.

Or one could return to the notion of a "super filter" which takes one config file, internally combines these effects and automatically optimizes the filtering. A little more ambitious code-wise, but ultimately easier to use.

At any rate, the attached change is pretty simple, and at least helpful for me, if not perfect; thought I would share it.

D

> Add "short circuit" notion to filters to speedup mixed site/subsite crawling
> ----------------------------------------------------------------------------
>
>                 Key: NUTCH-409
>                 URL: http://issues.apache.org/jira/browse/NUTCH-409
>             Project: Nutch
>          Issue Type: Improvement
>          Components: fetcher
>    Affects Versions: 0.8
>            Reporter: Doug Cook
>            Priority: Minor
>         Attachments: shortcircuit.patch
>
>
> In the case where one is crawling a mixture of sites and sub-sites, the prefix matcher can match the sites quite quickly, but either the regex or automaton filters are considerably slower matching the sub-sites. In the current model of AND-ing all the filters together, the pattern-matching filter will be run on every site that matches the prefix matcher -- even if that entire site is to be crawled and there are no sub-site patterns. If only a small portion of the sites actually need sub-site pattern matching, this is much slower than it should be.
> I propose (and attach) a simple modification allowing considerable speedup for this usage pattern. I define the notion of a "short circuit" match that means "accept this URL and don't run any of the remaining filters in the filter chain." 
> Though with this change, any filter plugin can in theory return a short-circuit match, I have only implemented the short-circuit match for the PrefixURLFilter. The configuration file format is backwards-compatible; shortcircuit matches just have SHORTCIRCUIT: in front of them.
> One minor "gotcha":
> * Because the shortcircuit matches will avoid running any later filters, all of the site-independent filters need to be BEFORE the PrefixURLFilter in the chain.
> I get my best performance using the following filter chain:
> 1) The SuffixURLFilter  to throw away anything with unwanted extensions
> 2) The RegexURLFilter to do site-independent cleanup (ad removal, skipping mailto:, bulletin-board pages, etc.)
> 3) The PrefixURLFilter, with SHORTCIRCUIT: in front of every site name EXCEPT the sites needing subsite matching
> 4) The AutomatonURLFilter to match those sites needing subsite pattern matching.
> I have tens of thousands of sites and an order of magnitude fewer subsites, so skipping step #4 90% of the time speeds things up considerably (my reduce time on a round of crawling is down from some 26 hours to less than 10).
> There are only two drawbacks to the implementation, and I think they're pretty minor:
> 1) Because I pass a special token (_PASS_) in the place of the URL to implement the short circuit, if for some reason someone wanted to crawl a URL named "_PASS_", there would be problems. I find this highly unlikely, since that's an invalid URL.
> 2) The correct behavior of steps #3 and #4 above depends upon coordination of the config files between the prefix and automaton filters, making an opportunity for user screwup. I thought about creating a "new kind of filter" which essentially combined prefix & automaton's behaviors, took one config file, and internally handled the short-circuiting. But I think the approach I took is simpler, cleaner, more flexible, and avoids creating yet another kind of filter. Coordinating the config files is pretty easy (I generate them programmatically).
> As this is my first contribution to Nutch I'm sure that there are things I've missed, whether in coding style or desired patch format. I welcome any feedback, suggestions, etc.
> Doug

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira