You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by fxmy wang <fx...@gmail.com> on 2015/01/13 06:16:43 UTC

Proper regex-urlfilter syntax to filter out certain numbers in urls

Hi Nutch users,


We are trying to crawl a forum site with the help of Nutch-2.2.1.

The URLs are like far.boo.com/f?kw=SomeTopic&pn=150
where pn means PageNumber.

The goal, is to filter out those old posts, say I want all those pn>1000
posts filtered.

So in conf/regex-urlfilter.txt I added this above the '# accept anything
else' line.

        -[*!@]                    # skip certain queries
        -pn=[0-9]{4,}$       # filter out pn>1000
        +.                          # accept anything else

And... no effect :(
After some generate-fetch-parse-updatedb circle the URL
far.boo.com/f?kw=SomeTopic&pn=649800 still got fetched.

To verify furthermore I run the command below
        bin/nutch plugin urlfilter-regex
org.apache.nutch.urlfilter.regex.RegexURLFilter [0]
and pasted 'far.boo.com/f?kw=SomeTopic&pn=649800' in, the output is
        +far.boo.com/f?kw=SomeTopic&pn=649800
Seems nutch didn't filter it out.

What is the proper way to deal with numbers in URLs?
Did I do something wrong?
Any advice will be very appreciated.

----------------------------------------------------------------------
[0]http://www.mail-archive.com/user%40nutch.apache.org/msg09536.html
----------------------------------------------------------------------

BR, fxmy

Re: Proper regex-urlfilter syntax to filter out certain numbers in urls

Posted by fxmy wang <fx...@gmail.com>.

Hi Sebastian,

I indeed changed the  runtime/local/conf/regex-urlfilter.txt, so that's not
the problem.

After some unfruitful Googling, I turned myself to some on-line java regex
testers, finally I've been able to solve this problem.
It seems for regexes containing numbers to work, you need provide a 'full
match' regex
in the regex-urlfilter.txt file.
So by changing
        -pn=[0-9]{4,}$
to
        -^http://far\.boo\.com/f\?kw=SomeTopic&pn=[0-9]{4,}$

Nutch worked as expected.


2015-01-14 3:13 GMT+08:00 Sebastian Nagel <wa...@googlemail.com>:

> Hi,
>
> the regular expression looks good.
> Which conf/regex-urlfilter.txt has been changed?
>  runtime/local/conf/regex-urlfilter.txt  ?
> If
>  conf/regex-urlfilter.txt is changed
> you need to run "ant runtime" again
> to install the configuration changes
> into runtime/local/conf.
> For distributed mode you need to rebuild
> and deploy after any configuration change
> because configuration files are included
> in the job file.
>
> Sebastian
>
> On 01/13/2015 06:16 AM, fxmy wang wrote:
> > Hi Nutch users,
> >
> >
> > We are trying to crawl a forum site with the help of Nutch-2.2.1.
> >
> > The URLs are like far.boo.com/f?kw=SomeTopic&pn=150
> > where pn means PageNumber.
> >
> > The goal, is to filter out those old posts, say I want all those pn>1000
> > posts filtered.
> >
> > So in conf/regex-urlfilter.txt I added this above the '# accept anything
> > else' line.
> >
> >         -[*!@]                    # skip certain queries
> >         -pn=[0-9]{4,}$       # filter out pn>1000
> >         +.                          # accept anything else
> >
> > And... no effect :(
> > After some generate-fetch-parse-updatedb circle the URL
> > far.boo.com/f?kw=SomeTopic&pn=649800 still got fetched.
> >
> > To verify furthermore I run the command below
> >         bin/nutch plugin urlfilter-regex
> > org.apache.nutch.urlfilter.regex.RegexURLFilter [0]
> > and pasted 'far.boo.com/f?kw=SomeTopic&pn=649800' in, the output is
> >         +far.boo.com/f?kw=SomeTopic&pn=649800
> > Seems nutch didn't filter it out.
> >
> > What is the proper way to deal with numbers in URLs?
> > Did I do something wrong?
> > Any advice will be very appreciated.
> >
> > ----------------------------------------------------------------------
> > [0]http://www.mail-archive.com/user%40nutch.apache.org/msg09536.html
> > ----------------------------------------------------------------------
> >
> > BR, fxmy
> >
>
>

Re: Proper regex-urlfilter syntax to filter out certain numbers in urls

Posted by Sebastian Nagel <wa...@googlemail.com>.

Hi,

the regular expression looks good.
Which conf/regex-urlfilter.txt has been changed?
 runtime/local/conf/regex-urlfilter.txt  ?
If
 conf/regex-urlfilter.txt is changed
you need to run "ant runtime" again
to install the configuration changes
into runtime/local/conf.
For distributed mode you need to rebuild
and deploy after any configuration change
because configuration files are included
in the job file.

Sebastian

On 01/13/2015 06:16 AM, fxmy wang wrote:
> Hi Nutch users,
> 
> 
> We are trying to crawl a forum site with the help of Nutch-2.2.1.
> 
> The URLs are like far.boo.com/f?kw=SomeTopic&pn=150
> where pn means PageNumber.
> 
> The goal, is to filter out those old posts, say I want all those pn>1000
> posts filtered.
> 
> So in conf/regex-urlfilter.txt I added this above the '# accept anything
> else' line.
> 
>         -[*!@]                    # skip certain queries
>         -pn=[0-9]{4,}$       # filter out pn>1000
>         +.                          # accept anything else
> 
> And... no effect :(
> After some generate-fetch-parse-updatedb circle the URL
> far.boo.com/f?kw=SomeTopic&pn=649800 still got fetched.
> 
> To verify furthermore I run the command below
>         bin/nutch plugin urlfilter-regex
> org.apache.nutch.urlfilter.regex.RegexURLFilter [0]
> and pasted 'far.boo.com/f?kw=SomeTopic&pn=649800' in, the output is
>         +far.boo.com/f?kw=SomeTopic&pn=649800
> Seems nutch didn't filter it out.
> 
> What is the proper way to deal with numbers in URLs?
> Did I do something wrong?
> Any advice will be very appreciated.
> 
> ----------------------------------------------------------------------
> [0]http://www.mail-archive.com/user%40nutch.apache.org/msg09536.html
> ----------------------------------------------------------------------
> 
> BR, fxmy
>