You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by MilleBii <mi...@gmail.com> on 2011/06/02 21:42:29 UTC

Big regex-urlfilter size

What will be the impact of a growing big regex-urlfilter ?

I ask this because there are more & more  sites that I want to filter out,
it will limit the # of unecessary pages at a cost of lots of url
verification.
Side question since I already have pages from those sites in the crawldb,
will they be removed ever ? What would be the method to remove them ?

-- 
-MilleBii-

Re: Big regex-urlfilter size

Posted by Kirby Bohling <ki...@gmail.com>.
The underlying Automaton documentation is here:

http://www.brics.dk/automaton/doc/index.html?dk/brics/automaton/RegExp.html

I am pretty sure they just extract the RegExp and had it to that
library, and don't do any pre-processing or parsing of it in Nutch
from a quick scan of the code.

Kirby


On Sat, Jun 4, 2011 at 4:45 AM, MilleBii <mi...@gmail.com> wrote:
> I could not find any documentation on the syntax automaton url filter
> accepts ?
> Any idea, I will update the wiki accordingly.
>
> 2011/6/4 MilleBii <mi...@gmail.com>
>
>> Regexes where not optimized against backtracking as I did not know. A
>> typical one looked like this :
>>
>> -.*\.domain\.*.*
>>
>> I guess something like this would be better and give less backtracking :
>>
>> http:\/\/www\.domain\..*
>>
>> As for the automaton no reason a priori since I really never looked at it.
>> Does it used  TRIE like pattern matching ? Which would be very fast and
>> appropriate I guess.
>> I will have a go at it and see how it helps. Thx.
>>
>>
>> 2011/6/4 Julien Nioche <li...@gmail.com>
>>
>>> As Kirby pointed out the automaton-based filter should be far more
>>> efficent
>>> + its syntax is more restricted than the regex one but not dissimilar.
>>> What do your filters look like? Any reason why you can't use the automaton
>>> instead?
>>>
>>> Julien
>>>
>>> On 4 June 2011 09:44, MilleBii <mi...@gmail.com> wrote:
>>>
>>> > Just for the record the impact can be very very bad if you add too many
>>> > regexes. I just finished a test and I got a factor 20 slower just for
>>> the
>>> > generate step by adding 30 or so regexes in the filter. So beware.
>>> >
>>> > 2011/6/3 MilleBii <mi...@gmail.com>
>>> >
>>> > > Indeed I'm running a vertical search engine too however I want to
>>> improve
>>> > > it as well on several front.
>>> > > Scoring is one way but it does not prevent uninteresting content to
>>> > > creep-in the crawld which eventually grows big/too big wasting
>>> ressources
>>> > > for nothing. I
>>> > >
>>> > > Filtering is another way at the cost of a lot of regexes, hence this
>>> > > question.
>>> > >
>>> > > Third I see crawldb pruning  you want to ditch all urls that are below
>>> a
>>> > > certain score. A question I asked a long time ago and the answer was
>>> > write
>>> > > your own mapred for that, a bit too far fetch for me insofar.
>>> > >
>>> > > What would be top for me is to be able to extract properties about
>>> > > pages/urls in whatever phase scoring, indexing and be able to use
>>> those
>>> > > properties during the generate phase as a kind of feedback loop. It is
>>>  a
>>> > > real pain to be forced to try to merge this information into a score
>>> > >
>>> > >
>>> > >
>>> > > 2011/6/2 Kirby Bohling <ki...@gmail.com>
>>> > >
>>> > >> I see from your e-mails that you are modifying the Scoring algorithm,
>>> > >> the only other option I see is to write the scoring algorithm which
>>> > >> detects that this is content you don't want to crawl, and this lowers
>>> > >> the score.  As I recall, links with the highest score are crawled
>>> > >> first, so in the end that might be easier.  Which sounds like it'd be
>>> > >> writing a Vertical Search engine of some type (either that, or Spam
>>> > >> detector with your personal/custom definition of Spam).
>>> > >>
>>> > >> I know several people on the this list or the dev list are writing
>>> > >> vertical search engines, maybe they would have more thoughts or info.
>>> > >>
>>> > >> Kirby
>>> > >>
>>> > >> On Thu, Jun 2, 2011 at 3:47 PM, MilleBii <mi...@gmail.com> wrote:
>>> > >> > Yes I remember reading that a few years ago.
>>> > >> > But frankly I can't design by hand such finite automaton which will
>>> > ever
>>> > >> > changing by the way.
>>> > >> >
>>> > >> > Even adding regexes by hand, is most likely a daunting task for me.
>>> > >> >
>>> > >> > 2011/6/2 Kirby Bohling <ki...@gmail.com>
>>> > >> >
>>> > >> >> From what I remember of earlier advise, you really want to use the
>>> > >> >> Automaton filter if at all possible, rather than series of
>>> straight
>>> > >> >> regex.  Using the Automaton should be linear with respect to the
>>> > >> >> number of characters in the URL.  Building the actual automaton
>>> could
>>> > >> >> be fairly time consuming, but as you'll re-using it often, likely
>>> > >> >> worth the cost.
>>> > >> >>
>>> > >> >>
>>> > >> >>
>>> > >>
>>> >
>>> http://nutch.apache.org/apidocs-1.2/org/apache/nutch/urlfilter/automaton/package-summary.html
>>> > >> >>
>>> > >> >> A series of Java Regex's should also be linear with the number of
>>> > >> >> characters in the URL assuming you avoid specific constructs (the
>>> > >> >> things that cause back tracking, where it effectively tries to
>>> ensure
>>> > >> >> that one group/subgroup is equal to a later group/subgroup is the
>>> > >> >> primary culprit).  Each Regex will add to the constant multiple in
>>> > >> >> front of the number of characters.
>>> > >> >>
>>> > >> >> I've used the Automaton library, and if you can work within the
>>> > >> >> limitations (it is a classic regex matcher with limited operators
>>> > >> >> relative to say Perl 5 Compatible Regular Expressions).
>>> > >> >>
>>> > >> >> I don't have any practical experience with Nutch for a large scale
>>> > >> >> crawl, but based upon my experience with using regular expressions
>>> > and
>>> > >> >> the Automaton Library, I know it is much faster.  I recall Andrej
>>> > >> >> talking about it being much faster.  It might also be worth while
>>> for
>>> > >> >> Nutch to look into Lucene's optimized versions of Automaton (they
>>> > >> >> ported over several critical operations for use in Lucene and the
>>> > >> >> Fuzzy matching when computing the Levenshtien distance).
>>> > >> >>
>>> > >> >> I can't seem to find the thread where I saw that advice given, but
>>> > you
>>> > >> >> can see the thread where they discuss adding the Automaton URL
>>> filter
>>> > >> >> back in Nutch 0.8 and it seems to agree with my experience in
>>> using
>>> > >> >> both.
>>> > >> >>
>>> > >> >>
>>> > >> >>
>>> > >>
>>> >
>>> http://lucene.472066.n3.nabble.com/Much-faster-RegExp-lib-needed-in-nutch-td623308.html
>>> > >> >>
>>> > >> >> Kirby
>>> > >> >>
>>> > >> >>
>>> > >> >>
>>> > >> >> On Thu, Jun 2, 2011 at 2:42 PM, MilleBii <mi...@gmail.com>
>>> wrote:
>>> > >> >> > What will be the impact of a growing big regex-urlfilter ?
>>> > >> >> >
>>> > >> >> > I ask this because there are more & more  sites that I want to
>>> > filter
>>> > >> >> out,
>>> > >> >> > it will limit the # of unecessary pages at a cost of lots of url
>>> > >> >> > verification.
>>> > >> >> > Side question since I already have pages from those sites in the
>>> > >> crawldb,
>>> > >> >> > will they be removed ever ? What would be the method to remove
>>> them
>>> > ?
>>> > >> >> >
>>> > >> >> > --
>>> > >> >> > -MilleBii-
>>> > >> >> >
>>> > >> >>
>>> > >> >
>>> > >> >
>>> > >> >
>>> > >> > --
>>> > >> > -MilleBii-
>>> > >> >
>>> > >>
>>> > >
>>> > >
>>> > >
>>> > > --
>>> > > -MilleBii-
>>> > >
>>> >
>>> >
>>> >
>>> > --
>>> > -MilleBii-
>>> >
>>>
>>>
>>>
>>> --
>>> *
>>> *Open Source Solutions for Text Engineering
>>>
>>> http://digitalpebble.blogspot.com/
>>> http://www.digitalpebble.com
>>>
>>
>>
>>
>> --
>> -MilleBii-
>>
>
>
>
> --
> -MilleBii-
>

Re: Big regex-urlfilter size

Posted by MilleBii <mi...@gmail.com>.
I could not find any documentation on the syntax automaton url filter
accepts ?
Any idea, I will update the wiki accordingly.

2011/6/4 MilleBii <mi...@gmail.com>

> Regexes where not optimized against backtracking as I did not know. A
> typical one looked like this :
>
> -.*\.domain\.*.*
>
> I guess something like this would be better and give less backtracking :
>
> http:\/\/www\.domain\..*
>
> As for the automaton no reason a priori since I really never looked at it.
> Does it used  TRIE like pattern matching ? Which would be very fast and
> appropriate I guess.
> I will have a go at it and see how it helps. Thx.
>
>
> 2011/6/4 Julien Nioche <li...@gmail.com>
>
>> As Kirby pointed out the automaton-based filter should be far more
>> efficent
>> + its syntax is more restricted than the regex one but not dissimilar.
>> What do your filters look like? Any reason why you can't use the automaton
>> instead?
>>
>> Julien
>>
>> On 4 June 2011 09:44, MilleBii <mi...@gmail.com> wrote:
>>
>> > Just for the record the impact can be very very bad if you add too many
>> > regexes. I just finished a test and I got a factor 20 slower just for
>> the
>> > generate step by adding 30 or so regexes in the filter. So beware.
>> >
>> > 2011/6/3 MilleBii <mi...@gmail.com>
>> >
>> > > Indeed I'm running a vertical search engine too however I want to
>> improve
>> > > it as well on several front.
>> > > Scoring is one way but it does not prevent uninteresting content to
>> > > creep-in the crawld which eventually grows big/too big wasting
>> ressources
>> > > for nothing. I
>> > >
>> > > Filtering is another way at the cost of a lot of regexes, hence this
>> > > question.
>> > >
>> > > Third I see crawldb pruning  you want to ditch all urls that are below
>> a
>> > > certain score. A question I asked a long time ago and the answer was
>> > write
>> > > your own mapred for that, a bit too far fetch for me insofar.
>> > >
>> > > What would be top for me is to be able to extract properties about
>> > > pages/urls in whatever phase scoring, indexing and be able to use
>> those
>> > > properties during the generate phase as a kind of feedback loop. It is
>>  a
>> > > real pain to be forced to try to merge this information into a score
>> > >
>> > >
>> > >
>> > > 2011/6/2 Kirby Bohling <ki...@gmail.com>
>> > >
>> > >> I see from your e-mails that you are modifying the Scoring algorithm,
>> > >> the only other option I see is to write the scoring algorithm which
>> > >> detects that this is content you don't want to crawl, and this lowers
>> > >> the score.  As I recall, links with the highest score are crawled
>> > >> first, so in the end that might be easier.  Which sounds like it'd be
>> > >> writing a Vertical Search engine of some type (either that, or Spam
>> > >> detector with your personal/custom definition of Spam).
>> > >>
>> > >> I know several people on the this list or the dev list are writing
>> > >> vertical search engines, maybe they would have more thoughts or info.
>> > >>
>> > >> Kirby
>> > >>
>> > >> On Thu, Jun 2, 2011 at 3:47 PM, MilleBii <mi...@gmail.com> wrote:
>> > >> > Yes I remember reading that a few years ago.
>> > >> > But frankly I can't design by hand such finite automaton which will
>> > ever
>> > >> > changing by the way.
>> > >> >
>> > >> > Even adding regexes by hand, is most likely a daunting task for me.
>> > >> >
>> > >> > 2011/6/2 Kirby Bohling <ki...@gmail.com>
>> > >> >
>> > >> >> From what I remember of earlier advise, you really want to use the
>> > >> >> Automaton filter if at all possible, rather than series of
>> straight
>> > >> >> regex.  Using the Automaton should be linear with respect to the
>> > >> >> number of characters in the URL.  Building the actual automaton
>> could
>> > >> >> be fairly time consuming, but as you'll re-using it often, likely
>> > >> >> worth the cost.
>> > >> >>
>> > >> >>
>> > >> >>
>> > >>
>> >
>> http://nutch.apache.org/apidocs-1.2/org/apache/nutch/urlfilter/automaton/package-summary.html
>> > >> >>
>> > >> >> A series of Java Regex's should also be linear with the number of
>> > >> >> characters in the URL assuming you avoid specific constructs (the
>> > >> >> things that cause back tracking, where it effectively tries to
>> ensure
>> > >> >> that one group/subgroup is equal to a later group/subgroup is the
>> > >> >> primary culprit).  Each Regex will add to the constant multiple in
>> > >> >> front of the number of characters.
>> > >> >>
>> > >> >> I've used the Automaton library, and if you can work within the
>> > >> >> limitations (it is a classic regex matcher with limited operators
>> > >> >> relative to say Perl 5 Compatible Regular Expressions).
>> > >> >>
>> > >> >> I don't have any practical experience with Nutch for a large scale
>> > >> >> crawl, but based upon my experience with using regular expressions
>> > and
>> > >> >> the Automaton Library, I know it is much faster.  I recall Andrej
>> > >> >> talking about it being much faster.  It might also be worth while
>> for
>> > >> >> Nutch to look into Lucene's optimized versions of Automaton (they
>> > >> >> ported over several critical operations for use in Lucene and the
>> > >> >> Fuzzy matching when computing the Levenshtien distance).
>> > >> >>
>> > >> >> I can't seem to find the thread where I saw that advice given, but
>> > you
>> > >> >> can see the thread where they discuss adding the Automaton URL
>> filter
>> > >> >> back in Nutch 0.8 and it seems to agree with my experience in
>> using
>> > >> >> both.
>> > >> >>
>> > >> >>
>> > >> >>
>> > >>
>> >
>> http://lucene.472066.n3.nabble.com/Much-faster-RegExp-lib-needed-in-nutch-td623308.html
>> > >> >>
>> > >> >> Kirby
>> > >> >>
>> > >> >>
>> > >> >>
>> > >> >> On Thu, Jun 2, 2011 at 2:42 PM, MilleBii <mi...@gmail.com>
>> wrote:
>> > >> >> > What will be the impact of a growing big regex-urlfilter ?
>> > >> >> >
>> > >> >> > I ask this because there are more & more  sites that I want to
>> > filter
>> > >> >> out,
>> > >> >> > it will limit the # of unecessary pages at a cost of lots of url
>> > >> >> > verification.
>> > >> >> > Side question since I already have pages from those sites in the
>> > >> crawldb,
>> > >> >> > will they be removed ever ? What would be the method to remove
>> them
>> > ?
>> > >> >> >
>> > >> >> > --
>> > >> >> > -MilleBii-
>> > >> >> >
>> > >> >>
>> > >> >
>> > >> >
>> > >> >
>> > >> > --
>> > >> > -MilleBii-
>> > >> >
>> > >>
>> > >
>> > >
>> > >
>> > > --
>> > > -MilleBii-
>> > >
>> >
>> >
>> >
>> > --
>> > -MilleBii-
>> >
>>
>>
>>
>> --
>> *
>> *Open Source Solutions for Text Engineering
>>
>> http://digitalpebble.blogspot.com/
>> http://www.digitalpebble.com
>>
>
>
>
> --
> -MilleBii-
>



-- 
-MilleBii-

Re: Big regex-urlfilter size

Posted by MilleBii <mi...@gmail.com>.
Regexes where not optimized against backtracking as I did not know. A
typical one looked like this :

-.*\.domain\.*.*

I guess something like this would be better and give less backtracking :

http:\/\/www\.domain\..*

As for the automaton no reason a priori since I really never looked at it.
Does it used  TRIE like pattern matching ? Which would be very fast and
appropriate I guess.
I will have a go at it and see how it helps. Thx.

2011/6/4 Julien Nioche <li...@gmail.com>

> As Kirby pointed out the automaton-based filter should be far more efficent
> + its syntax is more restricted than the regex one but not dissimilar.
> What do your filters look like? Any reason why you can't use the automaton
> instead?
>
> Julien
>
> On 4 June 2011 09:44, MilleBii <mi...@gmail.com> wrote:
>
> > Just for the record the impact can be very very bad if you add too many
> > regexes. I just finished a test and I got a factor 20 slower just for the
> > generate step by adding 30 or so regexes in the filter. So beware.
> >
> > 2011/6/3 MilleBii <mi...@gmail.com>
> >
> > > Indeed I'm running a vertical search engine too however I want to
> improve
> > > it as well on several front.
> > > Scoring is one way but it does not prevent uninteresting content to
> > > creep-in the crawld which eventually grows big/too big wasting
> ressources
> > > for nothing. I
> > >
> > > Filtering is another way at the cost of a lot of regexes, hence this
> > > question.
> > >
> > > Third I see crawldb pruning  you want to ditch all urls that are below
> a
> > > certain score. A question I asked a long time ago and the answer was
> > write
> > > your own mapred for that, a bit too far fetch for me insofar.
> > >
> > > What would be top for me is to be able to extract properties about
> > > pages/urls in whatever phase scoring, indexing and be able to use those
> > > properties during the generate phase as a kind of feedback loop. It is
>  a
> > > real pain to be forced to try to merge this information into a score
> > >
> > >
> > >
> > > 2011/6/2 Kirby Bohling <ki...@gmail.com>
> > >
> > >> I see from your e-mails that you are modifying the Scoring algorithm,
> > >> the only other option I see is to write the scoring algorithm which
> > >> detects that this is content you don't want to crawl, and this lowers
> > >> the score.  As I recall, links with the highest score are crawled
> > >> first, so in the end that might be easier.  Which sounds like it'd be
> > >> writing a Vertical Search engine of some type (either that, or Spam
> > >> detector with your personal/custom definition of Spam).
> > >>
> > >> I know several people on the this list or the dev list are writing
> > >> vertical search engines, maybe they would have more thoughts or info.
> > >>
> > >> Kirby
> > >>
> > >> On Thu, Jun 2, 2011 at 3:47 PM, MilleBii <mi...@gmail.com> wrote:
> > >> > Yes I remember reading that a few years ago.
> > >> > But frankly I can't design by hand such finite automaton which will
> > ever
> > >> > changing by the way.
> > >> >
> > >> > Even adding regexes by hand, is most likely a daunting task for me.
> > >> >
> > >> > 2011/6/2 Kirby Bohling <ki...@gmail.com>
> > >> >
> > >> >> From what I remember of earlier advise, you really want to use the
> > >> >> Automaton filter if at all possible, rather than series of straight
> > >> >> regex.  Using the Automaton should be linear with respect to the
> > >> >> number of characters in the URL.  Building the actual automaton
> could
> > >> >> be fairly time consuming, but as you'll re-using it often, likely
> > >> >> worth the cost.
> > >> >>
> > >> >>
> > >> >>
> > >>
> >
> http://nutch.apache.org/apidocs-1.2/org/apache/nutch/urlfilter/automaton/package-summary.html
> > >> >>
> > >> >> A series of Java Regex's should also be linear with the number of
> > >> >> characters in the URL assuming you avoid specific constructs (the
> > >> >> things that cause back tracking, where it effectively tries to
> ensure
> > >> >> that one group/subgroup is equal to a later group/subgroup is the
> > >> >> primary culprit).  Each Regex will add to the constant multiple in
> > >> >> front of the number of characters.
> > >> >>
> > >> >> I've used the Automaton library, and if you can work within the
> > >> >> limitations (it is a classic regex matcher with limited operators
> > >> >> relative to say Perl 5 Compatible Regular Expressions).
> > >> >>
> > >> >> I don't have any practical experience with Nutch for a large scale
> > >> >> crawl, but based upon my experience with using regular expressions
> > and
> > >> >> the Automaton Library, I know it is much faster.  I recall Andrej
> > >> >> talking about it being much faster.  It might also be worth while
> for
> > >> >> Nutch to look into Lucene's optimized versions of Automaton (they
> > >> >> ported over several critical operations for use in Lucene and the
> > >> >> Fuzzy matching when computing the Levenshtien distance).
> > >> >>
> > >> >> I can't seem to find the thread where I saw that advice given, but
> > you
> > >> >> can see the thread where they discuss adding the Automaton URL
> filter
> > >> >> back in Nutch 0.8 and it seems to agree with my experience in using
> > >> >> both.
> > >> >>
> > >> >>
> > >> >>
> > >>
> >
> http://lucene.472066.n3.nabble.com/Much-faster-RegExp-lib-needed-in-nutch-td623308.html
> > >> >>
> > >> >> Kirby
> > >> >>
> > >> >>
> > >> >>
> > >> >> On Thu, Jun 2, 2011 at 2:42 PM, MilleBii <mi...@gmail.com>
> wrote:
> > >> >> > What will be the impact of a growing big regex-urlfilter ?
> > >> >> >
> > >> >> > I ask this because there are more & more  sites that I want to
> > filter
> > >> >> out,
> > >> >> > it will limit the # of unecessary pages at a cost of lots of url
> > >> >> > verification.
> > >> >> > Side question since I already have pages from those sites in the
> > >> crawldb,
> > >> >> > will they be removed ever ? What would be the method to remove
> them
> > ?
> > >> >> >
> > >> >> > --
> > >> >> > -MilleBii-
> > >> >> >
> > >> >>
> > >> >
> > >> >
> > >> >
> > >> > --
> > >> > -MilleBii-
> > >> >
> > >>
> > >
> > >
> > >
> > > --
> > > -MilleBii-
> > >
> >
> >
> >
> > --
> > -MilleBii-
> >
>
>
>
> --
> *
> *Open Source Solutions for Text Engineering
>
> http://digitalpebble.blogspot.com/
> http://www.digitalpebble.com
>



-- 
-MilleBii-

Re: Big regex-urlfilter size

Posted by Julien Nioche <li...@gmail.com>.
As Kirby pointed out the automaton-based filter should be far more efficent
+ its syntax is more restricted than the regex one but not dissimilar.
What do your filters look like? Any reason why you can't use the automaton
instead?

Julien

On 4 June 2011 09:44, MilleBii <mi...@gmail.com> wrote:

> Just for the record the impact can be very very bad if you add too many
> regexes. I just finished a test and I got a factor 20 slower just for the
> generate step by adding 30 or so regexes in the filter. So beware.
>
> 2011/6/3 MilleBii <mi...@gmail.com>
>
> > Indeed I'm running a vertical search engine too however I want to improve
> > it as well on several front.
> > Scoring is one way but it does not prevent uninteresting content to
> > creep-in the crawld which eventually grows big/too big wasting ressources
> > for nothing. I
> >
> > Filtering is another way at the cost of a lot of regexes, hence this
> > question.
> >
> > Third I see crawldb pruning  you want to ditch all urls that are below a
> > certain score. A question I asked a long time ago and the answer was
> write
> > your own mapred for that, a bit too far fetch for me insofar.
> >
> > What would be top for me is to be able to extract properties about
> > pages/urls in whatever phase scoring, indexing and be able to use those
> > properties during the generate phase as a kind of feedback loop. It is  a
> > real pain to be forced to try to merge this information into a score
> >
> >
> >
> > 2011/6/2 Kirby Bohling <ki...@gmail.com>
> >
> >> I see from your e-mails that you are modifying the Scoring algorithm,
> >> the only other option I see is to write the scoring algorithm which
> >> detects that this is content you don't want to crawl, and this lowers
> >> the score.  As I recall, links with the highest score are crawled
> >> first, so in the end that might be easier.  Which sounds like it'd be
> >> writing a Vertical Search engine of some type (either that, or Spam
> >> detector with your personal/custom definition of Spam).
> >>
> >> I know several people on the this list or the dev list are writing
> >> vertical search engines, maybe they would have more thoughts or info.
> >>
> >> Kirby
> >>
> >> On Thu, Jun 2, 2011 at 3:47 PM, MilleBii <mi...@gmail.com> wrote:
> >> > Yes I remember reading that a few years ago.
> >> > But frankly I can't design by hand such finite automaton which will
> ever
> >> > changing by the way.
> >> >
> >> > Even adding regexes by hand, is most likely a daunting task for me.
> >> >
> >> > 2011/6/2 Kirby Bohling <ki...@gmail.com>
> >> >
> >> >> From what I remember of earlier advise, you really want to use the
> >> >> Automaton filter if at all possible, rather than series of straight
> >> >> regex.  Using the Automaton should be linear with respect to the
> >> >> number of characters in the URL.  Building the actual automaton could
> >> >> be fairly time consuming, but as you'll re-using it often, likely
> >> >> worth the cost.
> >> >>
> >> >>
> >> >>
> >>
> http://nutch.apache.org/apidocs-1.2/org/apache/nutch/urlfilter/automaton/package-summary.html
> >> >>
> >> >> A series of Java Regex's should also be linear with the number of
> >> >> characters in the URL assuming you avoid specific constructs (the
> >> >> things that cause back tracking, where it effectively tries to ensure
> >> >> that one group/subgroup is equal to a later group/subgroup is the
> >> >> primary culprit).  Each Regex will add to the constant multiple in
> >> >> front of the number of characters.
> >> >>
> >> >> I've used the Automaton library, and if you can work within the
> >> >> limitations (it is a classic regex matcher with limited operators
> >> >> relative to say Perl 5 Compatible Regular Expressions).
> >> >>
> >> >> I don't have any practical experience with Nutch for a large scale
> >> >> crawl, but based upon my experience with using regular expressions
> and
> >> >> the Automaton Library, I know it is much faster.  I recall Andrej
> >> >> talking about it being much faster.  It might also be worth while for
> >> >> Nutch to look into Lucene's optimized versions of Automaton (they
> >> >> ported over several critical operations for use in Lucene and the
> >> >> Fuzzy matching when computing the Levenshtien distance).
> >> >>
> >> >> I can't seem to find the thread where I saw that advice given, but
> you
> >> >> can see the thread where they discuss adding the Automaton URL filter
> >> >> back in Nutch 0.8 and it seems to agree with my experience in using
> >> >> both.
> >> >>
> >> >>
> >> >>
> >>
> http://lucene.472066.n3.nabble.com/Much-faster-RegExp-lib-needed-in-nutch-td623308.html
> >> >>
> >> >> Kirby
> >> >>
> >> >>
> >> >>
> >> >> On Thu, Jun 2, 2011 at 2:42 PM, MilleBii <mi...@gmail.com> wrote:
> >> >> > What will be the impact of a growing big regex-urlfilter ?
> >> >> >
> >> >> > I ask this because there are more & more  sites that I want to
> filter
> >> >> out,
> >> >> > it will limit the # of unecessary pages at a cost of lots of url
> >> >> > verification.
> >> >> > Side question since I already have pages from those sites in the
> >> crawldb,
> >> >> > will they be removed ever ? What would be the method to remove them
> ?
> >> >> >
> >> >> > --
> >> >> > -MilleBii-
> >> >> >
> >> >>
> >> >
> >> >
> >> >
> >> > --
> >> > -MilleBii-
> >> >
> >>
> >
> >
> >
> > --
> > -MilleBii-
> >
>
>
>
> --
> -MilleBii-
>



-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com

Re: Big regex-urlfilter size

Posted by MilleBii <mi...@gmail.com>.
Just for the record the impact can be very very bad if you add too many
regexes. I just finished a test and I got a factor 20 slower just for the
generate step by adding 30 or so regexes in the filter. So beware.

2011/6/3 MilleBii <mi...@gmail.com>

> Indeed I'm running a vertical search engine too however I want to improve
> it as well on several front.
> Scoring is one way but it does not prevent uninteresting content to
> creep-in the crawld which eventually grows big/too big wasting ressources
> for nothing. I
>
> Filtering is another way at the cost of a lot of regexes, hence this
> question.
>
> Third I see crawldb pruning  you want to ditch all urls that are below a
> certain score. A question I asked a long time ago and the answer was write
> your own mapred for that, a bit too far fetch for me insofar.
>
> What would be top for me is to be able to extract properties about
> pages/urls in whatever phase scoring, indexing and be able to use those
> properties during the generate phase as a kind of feedback loop. It is  a
> real pain to be forced to try to merge this information into a score
>
>
>
> 2011/6/2 Kirby Bohling <ki...@gmail.com>
>
>> I see from your e-mails that you are modifying the Scoring algorithm,
>> the only other option I see is to write the scoring algorithm which
>> detects that this is content you don't want to crawl, and this lowers
>> the score.  As I recall, links with the highest score are crawled
>> first, so in the end that might be easier.  Which sounds like it'd be
>> writing a Vertical Search engine of some type (either that, or Spam
>> detector with your personal/custom definition of Spam).
>>
>> I know several people on the this list or the dev list are writing
>> vertical search engines, maybe they would have more thoughts or info.
>>
>> Kirby
>>
>> On Thu, Jun 2, 2011 at 3:47 PM, MilleBii <mi...@gmail.com> wrote:
>> > Yes I remember reading that a few years ago.
>> > But frankly I can't design by hand such finite automaton which will ever
>> > changing by the way.
>> >
>> > Even adding regexes by hand, is most likely a daunting task for me.
>> >
>> > 2011/6/2 Kirby Bohling <ki...@gmail.com>
>> >
>> >> From what I remember of earlier advise, you really want to use the
>> >> Automaton filter if at all possible, rather than series of straight
>> >> regex.  Using the Automaton should be linear with respect to the
>> >> number of characters in the URL.  Building the actual automaton could
>> >> be fairly time consuming, but as you'll re-using it often, likely
>> >> worth the cost.
>> >>
>> >>
>> >>
>> http://nutch.apache.org/apidocs-1.2/org/apache/nutch/urlfilter/automaton/package-summary.html
>> >>
>> >> A series of Java Regex's should also be linear with the number of
>> >> characters in the URL assuming you avoid specific constructs (the
>> >> things that cause back tracking, where it effectively tries to ensure
>> >> that one group/subgroup is equal to a later group/subgroup is the
>> >> primary culprit).  Each Regex will add to the constant multiple in
>> >> front of the number of characters.
>> >>
>> >> I've used the Automaton library, and if you can work within the
>> >> limitations (it is a classic regex matcher with limited operators
>> >> relative to say Perl 5 Compatible Regular Expressions).
>> >>
>> >> I don't have any practical experience with Nutch for a large scale
>> >> crawl, but based upon my experience with using regular expressions and
>> >> the Automaton Library, I know it is much faster.  I recall Andrej
>> >> talking about it being much faster.  It might also be worth while for
>> >> Nutch to look into Lucene's optimized versions of Automaton (they
>> >> ported over several critical operations for use in Lucene and the
>> >> Fuzzy matching when computing the Levenshtien distance).
>> >>
>> >> I can't seem to find the thread where I saw that advice given, but you
>> >> can see the thread where they discuss adding the Automaton URL filter
>> >> back in Nutch 0.8 and it seems to agree with my experience in using
>> >> both.
>> >>
>> >>
>> >>
>> http://lucene.472066.n3.nabble.com/Much-faster-RegExp-lib-needed-in-nutch-td623308.html
>> >>
>> >> Kirby
>> >>
>> >>
>> >>
>> >> On Thu, Jun 2, 2011 at 2:42 PM, MilleBii <mi...@gmail.com> wrote:
>> >> > What will be the impact of a growing big regex-urlfilter ?
>> >> >
>> >> > I ask this because there are more & more  sites that I want to filter
>> >> out,
>> >> > it will limit the # of unecessary pages at a cost of lots of url
>> >> > verification.
>> >> > Side question since I already have pages from those sites in the
>> crawldb,
>> >> > will they be removed ever ? What would be the method to remove them ?
>> >> >
>> >> > --
>> >> > -MilleBii-
>> >> >
>> >>
>> >
>> >
>> >
>> > --
>> > -MilleBii-
>> >
>>
>
>
>
> --
> -MilleBii-
>



-- 
-MilleBii-

Re: Big regex-urlfilter size

Posted by MilleBii <mi...@gmail.com>.
Indeed I'm running a vertical search engine too however I want to improve it
as well on several front.
Scoring is one way but it does not prevent uninteresting content to creep-in
the crawld which eventually grows big/too big wasting ressources for
nothing. I

Filtering is another way at the cost of a lot of regexes, hence this
question.

Third I see crawldb pruning  you want to ditch all urls that are below a
certain score. A question I asked a long time ago and the answer was write
your own mapred for that, a bit too far fetch for me insofar.

What would be top for me is to be able to extract properties about
pages/urls in whatever phase scoring, indexing and be able to use those
properties during the generate phase as a kind of feedback loop. It is  a
real pain to be forced to try to merge this information into a score



2011/6/2 Kirby Bohling <ki...@gmail.com>

> I see from your e-mails that you are modifying the Scoring algorithm,
> the only other option I see is to write the scoring algorithm which
> detects that this is content you don't want to crawl, and this lowers
> the score.  As I recall, links with the highest score are crawled
> first, so in the end that might be easier.  Which sounds like it'd be
> writing a Vertical Search engine of some type (either that, or Spam
> detector with your personal/custom definition of Spam).
>
> I know several people on the this list or the dev list are writing
> vertical search engines, maybe they would have more thoughts or info.
>
> Kirby
>
> On Thu, Jun 2, 2011 at 3:47 PM, MilleBii <mi...@gmail.com> wrote:
> > Yes I remember reading that a few years ago.
> > But frankly I can't design by hand such finite automaton which will ever
> > changing by the way.
> >
> > Even adding regexes by hand, is most likely a daunting task for me.
> >
> > 2011/6/2 Kirby Bohling <ki...@gmail.com>
> >
> >> From what I remember of earlier advise, you really want to use the
> >> Automaton filter if at all possible, rather than series of straight
> >> regex.  Using the Automaton should be linear with respect to the
> >> number of characters in the URL.  Building the actual automaton could
> >> be fairly time consuming, but as you'll re-using it often, likely
> >> worth the cost.
> >>
> >>
> >>
> http://nutch.apache.org/apidocs-1.2/org/apache/nutch/urlfilter/automaton/package-summary.html
> >>
> >> A series of Java Regex's should also be linear with the number of
> >> characters in the URL assuming you avoid specific constructs (the
> >> things that cause back tracking, where it effectively tries to ensure
> >> that one group/subgroup is equal to a later group/subgroup is the
> >> primary culprit).  Each Regex will add to the constant multiple in
> >> front of the number of characters.
> >>
> >> I've used the Automaton library, and if you can work within the
> >> limitations (it is a classic regex matcher with limited operators
> >> relative to say Perl 5 Compatible Regular Expressions).
> >>
> >> I don't have any practical experience with Nutch for a large scale
> >> crawl, but based upon my experience with using regular expressions and
> >> the Automaton Library, I know it is much faster.  I recall Andrej
> >> talking about it being much faster.  It might also be worth while for
> >> Nutch to look into Lucene's optimized versions of Automaton (they
> >> ported over several critical operations for use in Lucene and the
> >> Fuzzy matching when computing the Levenshtien distance).
> >>
> >> I can't seem to find the thread where I saw that advice given, but you
> >> can see the thread where they discuss adding the Automaton URL filter
> >> back in Nutch 0.8 and it seems to agree with my experience in using
> >> both.
> >>
> >>
> >>
> http://lucene.472066.n3.nabble.com/Much-faster-RegExp-lib-needed-in-nutch-td623308.html
> >>
> >> Kirby
> >>
> >>
> >>
> >> On Thu, Jun 2, 2011 at 2:42 PM, MilleBii <mi...@gmail.com> wrote:
> >> > What will be the impact of a growing big regex-urlfilter ?
> >> >
> >> > I ask this because there are more & more  sites that I want to filter
> >> out,
> >> > it will limit the # of unecessary pages at a cost of lots of url
> >> > verification.
> >> > Side question since I already have pages from those sites in the
> crawldb,
> >> > will they be removed ever ? What would be the method to remove them ?
> >> >
> >> > --
> >> > -MilleBii-
> >> >
> >>
> >
> >
> >
> > --
> > -MilleBii-
> >
>



-- 
-MilleBii-

Re: Big regex-urlfilter size

Posted by Kirby Bohling <ki...@gmail.com>.
I see from your e-mails that you are modifying the Scoring algorithm,
the only other option I see is to write the scoring algorithm which
detects that this is content you don't want to crawl, and this lowers
the score.  As I recall, links with the highest score are crawled
first, so in the end that might be easier.  Which sounds like it'd be
writing a Vertical Search engine of some type (either that, or Spam
detector with your personal/custom definition of Spam).

I know several people on the this list or the dev list are writing
vertical search engines, maybe they would have more thoughts or info.

Kirby

On Thu, Jun 2, 2011 at 3:47 PM, MilleBii <mi...@gmail.com> wrote:
> Yes I remember reading that a few years ago.
> But frankly I can't design by hand such finite automaton which will ever
> changing by the way.
>
> Even adding regexes by hand, is most likely a daunting task for me.
>
> 2011/6/2 Kirby Bohling <ki...@gmail.com>
>
>> From what I remember of earlier advise, you really want to use the
>> Automaton filter if at all possible, rather than series of straight
>> regex.  Using the Automaton should be linear with respect to the
>> number of characters in the URL.  Building the actual automaton could
>> be fairly time consuming, but as you'll re-using it often, likely
>> worth the cost.
>>
>>
>> http://nutch.apache.org/apidocs-1.2/org/apache/nutch/urlfilter/automaton/package-summary.html
>>
>> A series of Java Regex's should also be linear with the number of
>> characters in the URL assuming you avoid specific constructs (the
>> things that cause back tracking, where it effectively tries to ensure
>> that one group/subgroup is equal to a later group/subgroup is the
>> primary culprit).  Each Regex will add to the constant multiple in
>> front of the number of characters.
>>
>> I've used the Automaton library, and if you can work within the
>> limitations (it is a classic regex matcher with limited operators
>> relative to say Perl 5 Compatible Regular Expressions).
>>
>> I don't have any practical experience with Nutch for a large scale
>> crawl, but based upon my experience with using regular expressions and
>> the Automaton Library, I know it is much faster.  I recall Andrej
>> talking about it being much faster.  It might also be worth while for
>> Nutch to look into Lucene's optimized versions of Automaton (they
>> ported over several critical operations for use in Lucene and the
>> Fuzzy matching when computing the Levenshtien distance).
>>
>> I can't seem to find the thread where I saw that advice given, but you
>> can see the thread where they discuss adding the Automaton URL filter
>> back in Nutch 0.8 and it seems to agree with my experience in using
>> both.
>>
>>
>> http://lucene.472066.n3.nabble.com/Much-faster-RegExp-lib-needed-in-nutch-td623308.html
>>
>> Kirby
>>
>>
>>
>> On Thu, Jun 2, 2011 at 2:42 PM, MilleBii <mi...@gmail.com> wrote:
>> > What will be the impact of a growing big regex-urlfilter ?
>> >
>> > I ask this because there are more & more  sites that I want to filter
>> out,
>> > it will limit the # of unecessary pages at a cost of lots of url
>> > verification.
>> > Side question since I already have pages from those sites in the crawldb,
>> > will they be removed ever ? What would be the method to remove them ?
>> >
>> > --
>> > -MilleBii-
>> >
>>
>
>
>
> --
> -MilleBii-
>

Re: Big regex-urlfilter size

Posted by MilleBii <mi...@gmail.com>.
Yes I remember reading that a few years ago.
But frankly I can't design by hand such finite automaton which will ever
changing by the way.

Even adding regexes by hand, is most likely a daunting task for me.

2011/6/2 Kirby Bohling <ki...@gmail.com>

> From what I remember of earlier advise, you really want to use the
> Automaton filter if at all possible, rather than series of straight
> regex.  Using the Automaton should be linear with respect to the
> number of characters in the URL.  Building the actual automaton could
> be fairly time consuming, but as you'll re-using it often, likely
> worth the cost.
>
>
> http://nutch.apache.org/apidocs-1.2/org/apache/nutch/urlfilter/automaton/package-summary.html
>
> A series of Java Regex's should also be linear with the number of
> characters in the URL assuming you avoid specific constructs (the
> things that cause back tracking, where it effectively tries to ensure
> that one group/subgroup is equal to a later group/subgroup is the
> primary culprit).  Each Regex will add to the constant multiple in
> front of the number of characters.
>
> I've used the Automaton library, and if you can work within the
> limitations (it is a classic regex matcher with limited operators
> relative to say Perl 5 Compatible Regular Expressions).
>
> I don't have any practical experience with Nutch for a large scale
> crawl, but based upon my experience with using regular expressions and
> the Automaton Library, I know it is much faster.  I recall Andrej
> talking about it being much faster.  It might also be worth while for
> Nutch to look into Lucene's optimized versions of Automaton (they
> ported over several critical operations for use in Lucene and the
> Fuzzy matching when computing the Levenshtien distance).
>
> I can't seem to find the thread where I saw that advice given, but you
> can see the thread where they discuss adding the Automaton URL filter
> back in Nutch 0.8 and it seems to agree with my experience in using
> both.
>
>
> http://lucene.472066.n3.nabble.com/Much-faster-RegExp-lib-needed-in-nutch-td623308.html
>
> Kirby
>
>
>
> On Thu, Jun 2, 2011 at 2:42 PM, MilleBii <mi...@gmail.com> wrote:
> > What will be the impact of a growing big regex-urlfilter ?
> >
> > I ask this because there are more & more  sites that I want to filter
> out,
> > it will limit the # of unecessary pages at a cost of lots of url
> > verification.
> > Side question since I already have pages from those sites in the crawldb,
> > will they be removed ever ? What would be the method to remove them ?
> >
> > --
> > -MilleBii-
> >
>



-- 
-MilleBii-

Re: Big regex-urlfilter size

Posted by Kirby Bohling <ki...@gmail.com>.
>From what I remember of earlier advise, you really want to use the
Automaton filter if at all possible, rather than series of straight
regex.  Using the Automaton should be linear with respect to the
number of characters in the URL.  Building the actual automaton could
be fairly time consuming, but as you'll re-using it often, likely
worth the cost.

http://nutch.apache.org/apidocs-1.2/org/apache/nutch/urlfilter/automaton/package-summary.html

A series of Java Regex's should also be linear with the number of
characters in the URL assuming you avoid specific constructs (the
things that cause back tracking, where it effectively tries to ensure
that one group/subgroup is equal to a later group/subgroup is the
primary culprit).  Each Regex will add to the constant multiple in
front of the number of characters.

I've used the Automaton library, and if you can work within the
limitations (it is a classic regex matcher with limited operators
relative to say Perl 5 Compatible Regular Expressions).

I don't have any practical experience with Nutch for a large scale
crawl, but based upon my experience with using regular expressions and
the Automaton Library, I know it is much faster.  I recall Andrej
talking about it being much faster.  It might also be worth while for
Nutch to look into Lucene's optimized versions of Automaton (they
ported over several critical operations for use in Lucene and the
Fuzzy matching when computing the Levenshtien distance).

I can't seem to find the thread where I saw that advice given, but you
can see the thread where they discuss adding the Automaton URL filter
back in Nutch 0.8 and it seems to agree with my experience in using
both.

http://lucene.472066.n3.nabble.com/Much-faster-RegExp-lib-needed-in-nutch-td623308.html

Kirby



On Thu, Jun 2, 2011 at 2:42 PM, MilleBii <mi...@gmail.com> wrote:
> What will be the impact of a growing big regex-urlfilter ?
>
> I ask this because there are more & more  sites that I want to filter out,
> it will limit the # of unecessary pages at a cost of lots of url
> verification.
> Side question since I already have pages from those sites in the crawldb,
> will they be removed ever ? What would be the method to remove them ?
>
> --
> -MilleBii-
>