You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Lucas Rockwell <lu...@tsw.berkeley.edu> on 2005/05/09 02:43:44 UTC
[Solved - probably] Re: Need help with URL regex
Hi again,
I think I may have it working now. Not exactly the way I want, but I
put this:
+^http://www.myorg.org/*
and
-.
and not it seems to just be picking up things from my site. But have
not tried to put the ([a-z0-9]*\.)* back in.
-lucas
On May 8, 2005, at 4:54 PM, Lucas Rockwell wrote:
> Hi all,
>
> I have look in the archive and have followed the instructions in the
> tutorial and I am still having problems limiting nutch to just my
> site.
>
> For instance, the tutorial reads:
>
> 2. Edit the file conf/crawl-urlfilter.txt and replace
> MY.DOMAIN.NAME with the name of the domain you wish to crawl. For
> example, if you wished to limit the crawl to the nutch.org domain, the
> line should read:
> +^http://([a-z0-9]*\.)*nutch.org/
>
> But when I test the above regex according to a comment in the archives
> on April 16 using:
>
> cat file-with-test-urls | nutch net/nutch/net/RegexURLFilter
>
> I get this for the output:
>
> <snip>
> +# skip URLs containing certain characters as probable queries, etc.
> --[?*!@=]
> -
> +# limit to org site only
> -+^http://([a-z0-9]*\.)*nutch.org/
> -
> +# do not accept anything else
> ++.
> </snip>
>
> So, according to to the filter test, the regex in the tutorial does
> not work. Also, when I use Doug's example from another email
> (+^http://www.cs.princeton.edu/(people/(grad|fac)\.php)?$) I also get
> the "-" sign when I run the test. Also, the "-[?*!@=]" also gets a "-"
> sign...
>
> So, can anyone out there give me the exact syntax so that nutch will
> *only* crawl the domain (and subdomain(s)) for the site I want to
> crawl?
>
> Many thanks.
>
> -lucas
>