You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Doug Cook <na...@candiru.com> on 2006/11/16 17:30:32 UTC

More fetcher speed increases

Hi, folks,

I, too, was slowed down by reduce operations in fetch. Some benchmarking
showed that in my case, the limiting operation was filtering (though a
distant second was the time spent calculating Levenshtein distances,
presumably part of the spellchecking that Sami just removed to speed things
up, though I haven't looked at it yet).

I've fixed the problem, and my reduce speed is better by about a factor of
three. However, the fix is limited to certain usage patterns.

In my case, I have tens of thousands of sites and subsites I'm crawling, and
I'm using a combination of PrefixURLFilter + AutomatonURLFilter. I
essentially use the prefix filter to limit to the set of sites, and then
automaton to pattern-match within those sites. I only have subsite matches
on < 10% of the sites, however, so I was clearly wasting a lot of time
running the automaton patterns that didn't need it. And automaton, though
much faster than RegexURLFilter, is still dog-slow with that many patterns.

A simple fix was to extend the current "AND all the filters together" model
to have the notion of a "short-circuit" match, which allows a filter to say
"let this URL through and DON'T run the other filters" by returning a
special token to URLFilters. Now I have a version of PrefixURLFilter that
can return both "normal" matches and "short circuit" matches, and only
returns "normal" matches for those sites that need to run subsite patterns.
It seems to work well, the overhead is negligible when not in use, and the
speedup is massive for my usage pattern.

I'd like to contribute it back, if people would find this useful (not that
it's rocket science!).

First, is there anyone out there besides me who would find this useful?

Second, I've been thinking about the best way to handle PrefixURLFilter
configuration. I can see a few options:

1. Have two different config files, one for "normal" matches, and one for
"short-circuit" matches.
2. Have one config file, with a syntax to say "make this pattern a
short-circuit match," and make the default be a "normal" match, so it is
backwards compatible with the current version.
3. Make a new type of filter which internally combines Prefix and Automaton,
takes one config file, and decides internally which patterns should generate
automaton inputs vs "normal" or "short circuit" prefix matches.

Approach #3 requires no changes to the URLFilter model, and makes it
difficult to screw up by making config files which are inconsistent (e.g.
forgetting to put in a prefix pattern for one of the automaton patterns). It
is also the least flexible, requires the most code, and introduces yet
another kind of filter.

I tend to like the changed URLFilter model; it's more flexible, even if it
requires a little more care in configuration (a simple Perl script, in my
case, to generate the config files correctly and consistently). I'm leaning
towards approach #2. I'm thinking something simple, syntax-wise, like
putting SHORTCIRCUIT: before the patterns which should short-circuit. Any
suggestions for a  better syntax? Or reasons why I should consider a
different approach?

Doug

-- 
View this message in context: http://www.nabble.com/More-fetcher-speed-increases-tf2644170.html#a7381430
Sent from the Nutch - Dev mailing list archive at Nabble.com.


Re: More fetcher speed increases

Posted by Doug Cook <na...@candiru.com>.

Done. See http://issues.apache.org/jira/browse/NUTCH-409

This is my first Nutch contribution, so hopefully I've got it right ;-) Any
suggestions/questions/feedback welcome.

Hope this is useful to others.

D


scott green wrote:
> 
> Hi Doug,
> 
> Your idea about PrefixURLFilter and  AutomatonURLFilter combination
> sounds interesting. Could you please attach the patch to JIRA? Thanks
> 
> - Scott
> 
> On 11/17/06, Doug Cook <na...@candiru.com> wrote:
>>
>> Hi, folks,
>>
>> I, too, was slowed down by reduce operations in fetch. Some benchmarking
>> showed that in my case, the limiting operation was filtering (though a
>> distant second was the time spent calculating Levenshtein distances,
>> presumably part of the spellchecking that Sami just removed to speed
>> things
>> up, though I haven't looked at it yet).
>>
>> I've fixed the problem, and my reduce speed is better by about a factor
>> of
>> three. However, the fix is limited to certain usage patterns.
>>
>> In my case, I have tens of thousands of sites and subsites I'm crawling,
>> and
>> I'm using a combination of PrefixURLFilter + AutomatonURLFilter. I
>> essentially use the prefix filter to limit to the set of sites, and then
>> automaton to pattern-match within those sites. I only have subsite
>> matches
>> on < 10% of the sites, however, so I was clearly wasting a lot of time
>> running the automaton patterns that didn't need it. And automaton, though
>> much faster than RegexURLFilter, is still dog-slow with that many
>> patterns.
>>
>> A simple fix was to extend the current "AND all the filters together"
>> model
>> to have the notion of a "short-circuit" match, which allows a filter to
>> say
>> "let this URL through and DON'T run the other filters" by returning a
>> special token to URLFilters. Now I have a version of PrefixURLFilter that
>> can return both "normal" matches and "short circuit" matches, and only
>> returns "normal" matches for those sites that need to run subsite
>> patterns.
>> It seems to work well, the overhead is negligible when not in use, and
>> the
>> speedup is massive for my usage pattern.
>>
>> I'd like to contribute it back, if people would find this useful (not
>> that
>> it's rocket science!).
>>
>> First, is there anyone out there besides me who would find this useful?
>>
>> Second, I've been thinking about the best way to handle PrefixURLFilter
>> configuration. I can see a few options:
>>
>> 1. Have two different config files, one for "normal" matches, and one for
>> "short-circuit" matches.
>> 2. Have one config file, with a syntax to say "make this pattern a
>> short-circuit match," and make the default be a "normal" match, so it is
>> backwards compatible with the current version.
>> 3. Make a new type of filter which internally combines Prefix and
>> Automaton,
>> takes one config file, and decides internally which patterns should
>> generate
>> automaton inputs vs "normal" or "short circuit" prefix matches.
>>
>> Approach #3 requires no changes to the URLFilter model, and makes it
>> difficult to screw up by making config files which are inconsistent (e.g.
>> forgetting to put in a prefix pattern for one of the automaton patterns).
>> It
>> is also the least flexible, requires the most code, and introduces yet
>> another kind of filter.
>>
>> I tend to like the changed URLFilter model; it's more flexible, even if
>> it
>> requires a little more care in configuration (a simple Perl script, in my
>> case, to generate the config files correctly and consistently). I'm
>> leaning
>> towards approach #2. I'm thinking something simple, syntax-wise, like
>> putting SHORTCIRCUIT: before the patterns which should short-circuit. Any
>> suggestions for a  better syntax? Or reasons why I should consider a
>> different approach?
>>
>> Doug
>>
>> --
>> View this message in context:
>> http://www.nabble.com/More-fetcher-speed-increases-tf2644170.html#a7381430
>> Sent from the Nutch - Dev mailing list archive at Nabble.com.
>>
>>
> 
> 

-- 
View this message in context: http://www.nabble.com/More-fetcher-speed-increases-tf2644170.html#a7543634
Sent from the Nutch - Dev mailing list archive at Nabble.com.


Re: More fetcher speed increases

Posted by scott green <sm...@gmail.com>.
Hi Doug,

Your idea about PrefixURLFilter and  AutomatonURLFilter combination
sounds interesting. Could you please attach the patch to JIRA? Thanks

- Scott

On 11/17/06, Doug Cook <na...@candiru.com> wrote:
>
> Hi, folks,
>
> I, too, was slowed down by reduce operations in fetch. Some benchmarking
> showed that in my case, the limiting operation was filtering (though a
> distant second was the time spent calculating Levenshtein distances,
> presumably part of the spellchecking that Sami just removed to speed things
> up, though I haven't looked at it yet).
>
> I've fixed the problem, and my reduce speed is better by about a factor of
> three. However, the fix is limited to certain usage patterns.
>
> In my case, I have tens of thousands of sites and subsites I'm crawling, and
> I'm using a combination of PrefixURLFilter + AutomatonURLFilter. I
> essentially use the prefix filter to limit to the set of sites, and then
> automaton to pattern-match within those sites. I only have subsite matches
> on < 10% of the sites, however, so I was clearly wasting a lot of time
> running the automaton patterns that didn't need it. And automaton, though
> much faster than RegexURLFilter, is still dog-slow with that many patterns.
>
> A simple fix was to extend the current "AND all the filters together" model
> to have the notion of a "short-circuit" match, which allows a filter to say
> "let this URL through and DON'T run the other filters" by returning a
> special token to URLFilters. Now I have a version of PrefixURLFilter that
> can return both "normal" matches and "short circuit" matches, and only
> returns "normal" matches for those sites that need to run subsite patterns.
> It seems to work well, the overhead is negligible when not in use, and the
> speedup is massive for my usage pattern.
>
> I'd like to contribute it back, if people would find this useful (not that
> it's rocket science!).
>
> First, is there anyone out there besides me who would find this useful?
>
> Second, I've been thinking about the best way to handle PrefixURLFilter
> configuration. I can see a few options:
>
> 1. Have two different config files, one for "normal" matches, and one for
> "short-circuit" matches.
> 2. Have one config file, with a syntax to say "make this pattern a
> short-circuit match," and make the default be a "normal" match, so it is
> backwards compatible with the current version.
> 3. Make a new type of filter which internally combines Prefix and Automaton,
> takes one config file, and decides internally which patterns should generate
> automaton inputs vs "normal" or "short circuit" prefix matches.
>
> Approach #3 requires no changes to the URLFilter model, and makes it
> difficult to screw up by making config files which are inconsistent (e.g.
> forgetting to put in a prefix pattern for one of the automaton patterns). It
> is also the least flexible, requires the most code, and introduces yet
> another kind of filter.
>
> I tend to like the changed URLFilter model; it's more flexible, even if it
> requires a little more care in configuration (a simple Perl script, in my
> case, to generate the config files correctly and consistently). I'm leaning
> towards approach #2. I'm thinking something simple, syntax-wise, like
> putting SHORTCIRCUIT: before the patterns which should short-circuit. Any
> suggestions for a  better syntax? Or reasons why I should consider a
> different approach?
>
> Doug
>
> --
> View this message in context: http://www.nabble.com/More-fetcher-speed-increases-tf2644170.html#a7381430
> Sent from the Nutch - Dev mailing list archive at Nabble.com.
>
>