You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Lucas Rockwell <lu...@tsw.berkeley.edu> on 2005/05/09 02:43:44 UTC

[Solved - probably] Re: Need help with URL regex

Hi again,

I think I may have it working now. Not exactly the way I want, but I 
put this:

+^http://www.myorg.org/*

and

-.

and not it seems to just be picking up things from my site. But have 
not tried to put the ([a-z0-9]*\.)* back in.

-lucas

On May 8, 2005, at 4:54 PM, Lucas Rockwell wrote:

> Hi all,
>
> I have look in the archive and have followed the instructions in the 
> tutorial and I am still having problems limiting nutch to just my 
> site.
>
> For instance, the tutorial reads:
>
> 	2.  	Edit the file conf/crawl-urlfilter.txt and replace 
> MY.DOMAIN.NAME with the name of the domain you wish to crawl. For 
> example, if you wished to limit the crawl to the nutch.org domain, the 
> line should read:
> +^http://([a-z0-9]*\.)*nutch.org/
>
> But when I test the above regex according to a comment in the archives 
> on April 16 using:
>
> cat file-with-test-urls | nutch net/nutch/net/RegexURLFilter
>
> I get this for the output:
>
> <snip>
> +# skip URLs containing certain characters as probable queries, etc.
> --[?*!@=]
> -
> +# limit to org site only
> -+^http://([a-z0-9]*\.)*nutch.org/
> -
> +# do not accept anything else
> ++.
> </snip>
>
> So, according to to the filter test, the regex in the tutorial does 
> not work. Also, when I use Doug's example from another email 
> (+^http://www.cs.princeton.edu/(people/(grad|fac)\.php)?$) I also get 
> the "-" sign when I run the test. Also, the "-[?*!@=]" also gets a "-" 
> sign...
>
> So, can anyone out there give me the exact syntax so that nutch will 
> *only* crawl the domain (and subdomain(s)) for the site I want to 
> crawl?
>
> Many thanks.
>
> -lucas
>