You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by karthik085 <ka...@gmail.com> on 2008/09/02 23:10:33 UTC

Skipping certain characters to special urls

Hi,

I would like to ignore these urls (Ones with 'GO:') from crawling:
http://domain.com/NEW-IMAGE?object=GO:0005737

I added different variants as described below in my crawl-urlfilter.txt
(using 'crawl' command to crawl) & tested. But, these type of pages still
gets fetched.

Variant #1:
-GO:

Variant #2:
-GO:.*

Variant #3
-object=GO

Another variant I also tried is  - all of the above variants with
double-quotation marks starting after '-' and ending after the last
character. EG: -"GO:"

I CANNOT even add '=' to '# skip URLs containing certain characters as
probable queries, etc.'
-[*!@]
as there are other pages with '=' that needs to be fetched.

Any help is appreciated. 

Thanks,
Karthik
-- 
View this message in context: http://www.nabble.com/Skipping-certain-characters-to-special-urls-tp19278456p19278456.html
Sent from the Nutch - User mailing list archive at Nabble.com.