You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@nutch.apache.org by "Doug Cook (JIRA)" <ji...@apache.org> on 2006/11/26 01:18:01 UTC

[jira] Created: (NUTCH-409) Add "short circuit" notion to filters to speedup mixed site/subsite crawling

Add "short circuit" notion to filters to speedup mixed site/subsite crawling
----------------------------------------------------------------------------

                 Key: NUTCH-409
                 URL: http://issues.apache.org/jira/browse/NUTCH-409
             Project: Nutch
          Issue Type: Improvement
          Components: fetcher
    Affects Versions: 0.8
            Reporter: Doug Cook
            Priority: Minor


In the case where one is crawling a mixture of sites and sub-sites, the prefix matcher can match the sites quite quickly, but either the regex or automaton filters are considerably slower matching the sub-sites. In the current model of AND-ing all the filters together, the pattern-matching filter will be run on every site that matches the prefix matcher -- even if that entire site is to be crawled and there are no sub-site patterns. If only a small portion of the sites actually need sub-site pattern matching, this is much slower than it should be.

I propose (and attach) a simple modification allowing considerable speedup for this usage pattern. I define the notion of a "short circuit" match that means "accept this URL and don't run any of the remaining filters in the filter chain." 

Though with this change, any filter plugin can in theory return a short-circuit match, I have only implemented the short-circuit match for the PrefixURLFilter. The configuration file format is backwards-compatible; shortcircuit matches just have SHORTCIRCUIT: in front of them.

One minor "gotcha":

* Because the shortcircuit matches will avoid running any later filters, all of the site-independent filters need to be BEFORE the PrefixURLFilter in the chain.

I get my best performance using the following filter chain:

1) The SuffixURLFilter  to throw away anything with unwanted extensions
2) The RegexURLFilter to do site-independent cleanup (ad removal, skipping mailto:, bulletin-board pages, etc.)
3) The PrefixURLFilter, with SHORTCIRCUIT: in front of every site name EXCEPT the sites needing subsite matching
4) The AutomatonURLFilter to match those sites needing subsite pattern matching.

I have tens of thousands of sites and an order of magnitude fewer subsites, so skipping step #4 90% of the time speeds things up considerably (my reduce time on a round of crawling is down from some 26 hours to less than 10).

There are only two drawbacks to the implementation, and I think they're pretty minor:

1) Because I pass a special token (_PASS_) in the place of the URL to implement the short circuit, if for some reason someone wanted to crawl a URL named "_PASS_", there would be problems. I find this highly unlikely, since that's an invalid URL.

2) The correct behavior of steps #3 and #4 above depends upon coordination of the config files between the prefix and automaton filters, making an opportunity for user screwup. I thought about creating a "new kind of filter" which essentially combined prefix & automaton's behaviors, took one config file, and internally handled the short-circuiting. But I think the approach I took is simpler, cleaner, more flexible, and avoids creating yet another kind of filter. Coordinating the config files is pretty easy (I generate them programmatically).

As this is my first contribution to Nutch I'm sure that there are things I've missed, whether in coding style or desired patch format. I welcome any feedback, suggestions, etc.

Doug

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Updated: (NUTCH-409) Add "short circuit" notion to filters to speedup mixed site/subsite crawling

Posted by "Doug Cook (JIRA)" <ji...@apache.org>.

     [ http://issues.apache.org/jira/browse/NUTCH-409?page=all ]

Doug Cook updated NUTCH-409:
----------------------------

    Attachment: shortcircuit.patch

> Add "short circuit" notion to filters to speedup mixed site/subsite crawling
> ----------------------------------------------------------------------------
>
>                 Key: NUTCH-409
>                 URL: http://issues.apache.org/jira/browse/NUTCH-409
>             Project: Nutch
>          Issue Type: Improvement
>          Components: fetcher
>    Affects Versions: 0.8
>            Reporter: Doug Cook
>            Priority: Minor
>         Attachments: shortcircuit.patch
>
>
> In the case where one is crawling a mixture of sites and sub-sites, the prefix matcher can match the sites quite quickly, but either the regex or automaton filters are considerably slower matching the sub-sites. In the current model of AND-ing all the filters together, the pattern-matching filter will be run on every site that matches the prefix matcher -- even if that entire site is to be crawled and there are no sub-site patterns. If only a small portion of the sites actually need sub-site pattern matching, this is much slower than it should be.
> I propose (and attach) a simple modification allowing considerable speedup for this usage pattern. I define the notion of a "short circuit" match that means "accept this URL and don't run any of the remaining filters in the filter chain." 
> Though with this change, any filter plugin can in theory return a short-circuit match, I have only implemented the short-circuit match for the PrefixURLFilter. The configuration file format is backwards-compatible; shortcircuit matches just have SHORTCIRCUIT: in front of them.
> One minor "gotcha":
> * Because the shortcircuit matches will avoid running any later filters, all of the site-independent filters need to be BEFORE the PrefixURLFilter in the chain.
> I get my best performance using the following filter chain:
> 1) The SuffixURLFilter  to throw away anything with unwanted extensions
> 2) The RegexURLFilter to do site-independent cleanup (ad removal, skipping mailto:, bulletin-board pages, etc.)
> 3) The PrefixURLFilter, with SHORTCIRCUIT: in front of every site name EXCEPT the sites needing subsite matching
> 4) The AutomatonURLFilter to match those sites needing subsite pattern matching.
> I have tens of thousands of sites and an order of magnitude fewer subsites, so skipping step #4 90% of the time speeds things up considerably (my reduce time on a round of crawling is down from some 26 hours to less than 10).
> There are only two drawbacks to the implementation, and I think they're pretty minor:
> 1) Because I pass a special token (_PASS_) in the place of the URL to implement the short circuit, if for some reason someone wanted to crawl a URL named "_PASS_", there would be problems. I find this highly unlikely, since that's an invalid URL.
> 2) The correct behavior of steps #3 and #4 above depends upon coordination of the config files between the prefix and automaton filters, making an opportunity for user screwup. I thought about creating a "new kind of filter" which essentially combined prefix & automaton's behaviors, took one config file, and internally handled the short-circuiting. But I think the approach I took is simpler, cleaner, more flexible, and avoids creating yet another kind of filter. Coordinating the config files is pretty easy (I generate them programmatically).
> As this is my first contribution to Nutch I'm sure that there are things I've missed, whether in coding style or desired patch format. I welcome any feedback, suggestions, etc.
> Doug

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-409) Add "short circuit" notion to filters to speedup mixed site/subsite crawling

Posted by "Doug Cook (JIRA)" <ji...@apache.org>.

    [ http://issues.apache.org/jira/browse/NUTCH-409?page=comments#action_12452617 ] 
            
Doug Cook commented on NUTCH-409:
---------------------------------

I should also note that this approach is still not optimal (though it is faster for my usage pattern). I'm still running the site-independent regular expressions (ad removal, etc) on *every* URL; really, they should just be run on the URLs which belong to the set of sites I'm crawling. 

One could think of a slight extension to the change here, where each filter has a parameter: (A) "run me on all URLs which have passed the prior filters" or (B) "run me on only the non-shortcircuit matches." This would allow us to put the RegexURLFilter *after* the PrefixURLFilter, and make it a type "A" (site-independent) filter, while the Automaton would be Type "B." (site-dependent). Simple code-wise, but a little more complexity in configuration.

Or one could return to the notion of a "super filter" which takes one config file, internally combines these effects and automatically optimizes the filtering. A little more ambitious code-wise, but ultimately easier to use.

At any rate, the attached change is pretty simple, and at least helpful for me, if not perfect; thought I would share it.

D

> Add "short circuit" notion to filters to speedup mixed site/subsite crawling
> ----------------------------------------------------------------------------
>
>                 Key: NUTCH-409
>                 URL: http://issues.apache.org/jira/browse/NUTCH-409
>             Project: Nutch
>          Issue Type: Improvement
>          Components: fetcher
>    Affects Versions: 0.8
>            Reporter: Doug Cook
>            Priority: Minor
>         Attachments: shortcircuit.patch
>
>
> In the case where one is crawling a mixture of sites and sub-sites, the prefix matcher can match the sites quite quickly, but either the regex or automaton filters are considerably slower matching the sub-sites. In the current model of AND-ing all the filters together, the pattern-matching filter will be run on every site that matches the prefix matcher -- even if that entire site is to be crawled and there are no sub-site patterns. If only a small portion of the sites actually need sub-site pattern matching, this is much slower than it should be.
> I propose (and attach) a simple modification allowing considerable speedup for this usage pattern. I define the notion of a "short circuit" match that means "accept this URL and don't run any of the remaining filters in the filter chain." 
> Though with this change, any filter plugin can in theory return a short-circuit match, I have only implemented the short-circuit match for the PrefixURLFilter. The configuration file format is backwards-compatible; shortcircuit matches just have SHORTCIRCUIT: in front of them.
> One minor "gotcha":
> * Because the shortcircuit matches will avoid running any later filters, all of the site-independent filters need to be BEFORE the PrefixURLFilter in the chain.
> I get my best performance using the following filter chain:
> 1) The SuffixURLFilter  to throw away anything with unwanted extensions
> 2) The RegexURLFilter to do site-independent cleanup (ad removal, skipping mailto:, bulletin-board pages, etc.)
> 3) The PrefixURLFilter, with SHORTCIRCUIT: in front of every site name EXCEPT the sites needing subsite matching
> 4) The AutomatonURLFilter to match those sites needing subsite pattern matching.
> I have tens of thousands of sites and an order of magnitude fewer subsites, so skipping step #4 90% of the time speeds things up considerably (my reduce time on a round of crawling is down from some 26 hours to less than 10).
> There are only two drawbacks to the implementation, and I think they're pretty minor:
> 1) Because I pass a special token (_PASS_) in the place of the URL to implement the short circuit, if for some reason someone wanted to crawl a URL named "_PASS_", there would be problems. I find this highly unlikely, since that's an invalid URL.
> 2) The correct behavior of steps #3 and #4 above depends upon coordination of the config files between the prefix and automaton filters, making an opportunity for user screwup. I thought about creating a "new kind of filter" which essentially combined prefix & automaton's behaviors, took one config file, and internally handled the short-circuiting. But I think the approach I took is simpler, cleaner, more flexible, and avoids creating yet another kind of filter. Coordinating the config files is pretty easy (I generate them programmatically).
> As this is my first contribution to Nutch I'm sure that there are things I've missed, whether in coding style or desired patch format. I welcome any feedback, suggestions, etc.
> Doug

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira