You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by SriramG <sg...@etrade.com> on 2007/03/22 22:00:26 UTC
Need Help with crawl-urlfilter.txt
I trying to crawl a wikipedia site.
I want to skip any url which has the term Special:
Eg:
https://wiki.mydomain.com/index.php/Special:Whatlinkshere/Main_Page
https://wiki.mydomain.com/index.php/Special:Recentchangeslinked/Main_Page
https://wiki.mydomain.com/index.php/Special:Watchlist
https://wiki.mydomain.com/index.php/Special:Contributions/SName
https://wiki.mydomain.com/index.php/Special:Recentchanges
This is my crawl-urlfilter.txt
-^http://wiki.mydomain.com/index.php/Special:
-^http://wiki.mydomain.com/index.php/Special:*
-^http://wiki.mydomain.com/index.php/Special:*/
-^http://wiki.mydomain.com/index.php/Special:*/*
-^https://wiki.mydomain.com/index.php/Special:Upload
+^https://wiki.mydomain.com/index.php
-.
But I still see the fetcher logs.
2007-03-22 12:52:15,387 INFO fetcher.Fetcher - fetching
https://wiki.mydomain.com/index.php
2007-03-22 12:52:32,128 INFO fetcher.Fetcher - fetching
https://wiki.mydomain.com/index.php/Telecom
2007-03-22 12:52:32,159 INFO fetcher.Fetcher - fetching
https://wiki.mydomain.com/index.php/Special:Contributions/SName
2007-03-22 12:52:32,159 INFO fetcher.Fetcher - fetching
https://wiki.mydomain.com/index.php/Special:Watchlist
2007-03-22 12:52:32,179 INFO fetcher.Fetcher - fetching
https://wiki.mydomain.com/index.php/Special:Preferences
2007-03-22 12:52:32,198 INFO fetcher.Fetcher - fetching
https://wiki.mydomain.com/index.php/Special:Recentchanges
2007-03-22 12:52:32,322 INFO fetcher.Fetcher - fetching
https://wiki.mydomain.com/index.php/Talk:Main_Page
2007-03-22 12:52:32,323 INFO fetcher.Fetcher - fetching
https://wiki.mydomain.com/index.php/Special:Whatlinkshere/Main_Page
2007-03-22 12:52:32,326 INFO fetcher.Fetcher - fetching
https://wiki.mydomain.com/index.php/BCP
2007-03-22 12:52:32,339 INFO fetcher.Fetcher - fetching
https://wiki.mydomain.com/index.php/Special:Recentchangeslinked/Main_Page
2007-03-22 12:52:32,343 INFO fetcher.Fetcher - fetching
https://wiki.mydomain.com/index.php/Network_Engineering
Not sure whats wrong in my regular expression.
Any help please.
--
View this message in context: http://www.nabble.com/Need-Help-with-crawl-urlfilter.txt-tf3450339.html#a9623983
Sent from the Nutch - User mailing list archive at Nabble.com.
Re: Need Help with crawl-urlfilter.txt
Posted by Ravi Chintakunta <ra...@gmail.com>.
Hi Sriram,
In regex, . matches to any single character, and following . with a *
matches that single character zero or more times. That is, .* in
combination is a wildcard match.
So modifying your regex to:
-^http://wiki.mydomain.com/index.php/Special:.*
should fix the problem.
- Ravi Chintakunta
On 3/22/07, SriramG <sg...@etrade.com> wrote:
>
> I trying to crawl a wikipedia site.
>
> I want to skip any url which has the term Special:
>
> Eg:
> https://wiki.mydomain.com/index.php/Special:Whatlinkshere/Main_Page
> https://wiki.mydomain.com/index.php/Special:Recentchangeslinked/Main_Page
> https://wiki.mydomain.com/index.php/Special:Watchlist
> https://wiki.mydomain.com/index.php/Special:Contributions/SName
> https://wiki.mydomain.com/index.php/Special:Recentchanges
>
> This is my crawl-urlfilter.txt
> -^http://wiki.mydomain.com/index.php/Special:
> -^http://wiki.mydomain.com/index.php/Special:*
> -^http://wiki.mydomain.com/index.php/Special:*/
> -^http://wiki.mydomain.com/index.php/Special:*/*
> -^https://wiki.mydomain.com/index.php/Special:Upload
> +^https://wiki.mydomain.com/index.php
> -.
>
> But I still see the fetcher logs.
>
> 2007-03-22 12:52:15,387 INFO fetcher.Fetcher - fetching
> https://wiki.mydomain.com/index.php
> 2007-03-22 12:52:32,128 INFO fetcher.Fetcher - fetching
> https://wiki.mydomain.com/index.php/Telecom
> 2007-03-22 12:52:32,159 INFO fetcher.Fetcher - fetching
> https://wiki.mydomain.com/index.php/Special:Contributions/SName
> 2007-03-22 12:52:32,159 INFO fetcher.Fetcher - fetching
> https://wiki.mydomain.com/index.php/Special:Watchlist
> 2007-03-22 12:52:32,179 INFO fetcher.Fetcher - fetching
> https://wiki.mydomain.com/index.php/Special:Preferences
> 2007-03-22 12:52:32,198 INFO fetcher.Fetcher - fetching
> https://wiki.mydomain.com/index.php/Special:Recentchanges
> 2007-03-22 12:52:32,322 INFO fetcher.Fetcher - fetching
> https://wiki.mydomain.com/index.php/Talk:Main_Page
> 2007-03-22 12:52:32,323 INFO fetcher.Fetcher - fetching
> https://wiki.mydomain.com/index.php/Special:Whatlinkshere/Main_Page
> 2007-03-22 12:52:32,326 INFO fetcher.Fetcher - fetching
> https://wiki.mydomain.com/index.php/BCP
> 2007-03-22 12:52:32,339 INFO fetcher.Fetcher - fetching
> https://wiki.mydomain.com/index.php/Special:Recentchangeslinked/Main_Page
> 2007-03-22 12:52:32,343 INFO fetcher.Fetcher - fetching
> https://wiki.mydomain.com/index.php/Network_Engineering
>
>
> Not sure whats wrong in my regular expression.
>
> Any help please.
>
>
> --
> View this message in context: http://www.nabble.com/Need-Help-with-crawl-urlfilter.txt-tf3450339.html#a9623983
> Sent from the Nutch - User mailing list archive at Nabble.com.
>
>