You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by SriramG <sg...@etrade.com> on 2007/03/22 22:00:26 UTC

Need Help with crawl-urlfilter.txt

I trying to crawl a wikipedia site.

I want to skip any url which has the term Special:

Eg:
https://wiki.mydomain.com/index.php/Special:Whatlinkshere/Main_Page
https://wiki.mydomain.com/index.php/Special:Recentchangeslinked/Main_Page
https://wiki.mydomain.com/index.php/Special:Watchlist
https://wiki.mydomain.com/index.php/Special:Contributions/SName
https://wiki.mydomain.com/index.php/Special:Recentchanges

This is my crawl-urlfilter.txt
-^http://wiki.mydomain.com/index.php/Special:
-^http://wiki.mydomain.com/index.php/Special:*
-^http://wiki.mydomain.com/index.php/Special:*/
-^http://wiki.mydomain.com/index.php/Special:*/*
-^https://wiki.mydomain.com/index.php/Special:Upload
+^https://wiki.mydomain.com/index.php
-.

But I still see the fetcher logs.

2007-03-22 12:52:15,387 INFO  fetcher.Fetcher - fetching
https://wiki.mydomain.com/index.php
2007-03-22 12:52:32,128 INFO  fetcher.Fetcher - fetching
https://wiki.mydomain.com/index.php/Telecom
2007-03-22 12:52:32,159 INFO  fetcher.Fetcher - fetching
https://wiki.mydomain.com/index.php/Special:Contributions/SName
2007-03-22 12:52:32,159 INFO  fetcher.Fetcher - fetching
https://wiki.mydomain.com/index.php/Special:Watchlist
2007-03-22 12:52:32,179 INFO  fetcher.Fetcher - fetching
https://wiki.mydomain.com/index.php/Special:Preferences
2007-03-22 12:52:32,198 INFO  fetcher.Fetcher - fetching
https://wiki.mydomain.com/index.php/Special:Recentchanges
2007-03-22 12:52:32,322 INFO  fetcher.Fetcher - fetching
https://wiki.mydomain.com/index.php/Talk:Main_Page
2007-03-22 12:52:32,323 INFO  fetcher.Fetcher - fetching
https://wiki.mydomain.com/index.php/Special:Whatlinkshere/Main_Page
2007-03-22 12:52:32,326 INFO  fetcher.Fetcher - fetching
https://wiki.mydomain.com/index.php/BCP
2007-03-22 12:52:32,339 INFO  fetcher.Fetcher - fetching
https://wiki.mydomain.com/index.php/Special:Recentchangeslinked/Main_Page
2007-03-22 12:52:32,343 INFO  fetcher.Fetcher - fetching
https://wiki.mydomain.com/index.php/Network_Engineering


Not sure whats wrong in my regular expression.

Any help please.


-- 
View this message in context: http://www.nabble.com/Need-Help-with-crawl-urlfilter.txt-tf3450339.html#a9623983
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Need Help with crawl-urlfilter.txt

Posted by Ravi Chintakunta <ra...@gmail.com>.

Hi Sriram,

In regex, . matches to any single character, and following . with a *
matches that single character zero or more times. That is,  .* in
combination is a wildcard match.

So modifying your regex to:

-^http://wiki.mydomain.com/index.php/Special:.*

should fix the problem.

- Ravi Chintakunta


On 3/22/07, SriramG <sg...@etrade.com> wrote:
>
> I trying to crawl a wikipedia site.
>
> I want to skip any url which has the term Special:
>
> Eg:
> https://wiki.mydomain.com/index.php/Special:Whatlinkshere/Main_Page
> https://wiki.mydomain.com/index.php/Special:Recentchangeslinked/Main_Page
> https://wiki.mydomain.com/index.php/Special:Watchlist
> https://wiki.mydomain.com/index.php/Special:Contributions/SName
> https://wiki.mydomain.com/index.php/Special:Recentchanges
>
> This is my crawl-urlfilter.txt
> -^http://wiki.mydomain.com/index.php/Special:
> -^http://wiki.mydomain.com/index.php/Special:*
> -^http://wiki.mydomain.com/index.php/Special:*/
> -^http://wiki.mydomain.com/index.php/Special:*/*
> -^https://wiki.mydomain.com/index.php/Special:Upload
> +^https://wiki.mydomain.com/index.php
> -.
>
> But I still see the fetcher logs.
>
> 2007-03-22 12:52:15,387 INFO  fetcher.Fetcher - fetching
> https://wiki.mydomain.com/index.php
> 2007-03-22 12:52:32,128 INFO  fetcher.Fetcher - fetching
> https://wiki.mydomain.com/index.php/Telecom
> 2007-03-22 12:52:32,159 INFO  fetcher.Fetcher - fetching
> https://wiki.mydomain.com/index.php/Special:Contributions/SName
> 2007-03-22 12:52:32,159 INFO  fetcher.Fetcher - fetching
> https://wiki.mydomain.com/index.php/Special:Watchlist
> 2007-03-22 12:52:32,179 INFO  fetcher.Fetcher - fetching
> https://wiki.mydomain.com/index.php/Special:Preferences
> 2007-03-22 12:52:32,198 INFO  fetcher.Fetcher - fetching
> https://wiki.mydomain.com/index.php/Special:Recentchanges
> 2007-03-22 12:52:32,322 INFO  fetcher.Fetcher - fetching
> https://wiki.mydomain.com/index.php/Talk:Main_Page
> 2007-03-22 12:52:32,323 INFO  fetcher.Fetcher - fetching
> https://wiki.mydomain.com/index.php/Special:Whatlinkshere/Main_Page
> 2007-03-22 12:52:32,326 INFO  fetcher.Fetcher - fetching
> https://wiki.mydomain.com/index.php/BCP
> 2007-03-22 12:52:32,339 INFO  fetcher.Fetcher - fetching
> https://wiki.mydomain.com/index.php/Special:Recentchangeslinked/Main_Page
> 2007-03-22 12:52:32,343 INFO  fetcher.Fetcher - fetching
> https://wiki.mydomain.com/index.php/Network_Engineering
>
>
> Not sure whats wrong in my regular expression.
>
> Any help please.
>
>
> --
> View this message in context: http://www.nabble.com/Need-Help-with-crawl-urlfilter.txt-tf3450339.html#a9623983
> Sent from the Nutch - User mailing list archive at Nabble.com.
>
>