You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by karthik085 <ka...@gmail.com> on 2008/09/02 23:10:33 UTC
Skipping certain characters to special urls
Hi,
I would like to ignore these urls (Ones with 'GO:') from crawling:
http://domain.com/NEW-IMAGE?object=GO:0005737
I added different variants as described below in my crawl-urlfilter.txt
(using 'crawl' command to crawl) & tested. But, these type of pages still
gets fetched.
Variant #1:
-GO:
Variant #2:
-GO:.*
Variant #3
-object=GO
Another variant I also tried is - all of the above variants with
double-quotation marks starting after '-' and ending after the last
character. EG: -"GO:"
I CANNOT even add '=' to '# skip URLs containing certain characters as
probable queries, etc.'
-[*!@]
as there are other pages with '=' that needs to be fetched.
Any help is appreciated.
Thanks,
Karthik
--
View this message in context: http://www.nabble.com/Skipping-certain-characters-to-special-urls-tp19278456p19278456.html
Sent from the Nutch - User mailing list archive at Nabble.com.