You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by cemsoft <bc...@yahoo.com> on 2009/02/19 15:23:25 UTC
fetch pattern
hi
how or where can i define the urls while crawling
i want to index only the sites which has a certain link format eg.
http://www.myCompany.com/myServlet?
(while crawling i have now all the links under my company host but i need
more filtering)
# accept hosts in MY.DOMAIN.NAME
+^http://([a-z0-9]*\.)*myCompany.com/
index all pages whose link starts with
"http://www.myCompany.com/myServlet?".....
thnx for any idea
regards
cem
--
View this message in context: http://www.nabble.com/fetch-pattern-tp22101517p22101517.html
Sent from the Nutch - User mailing list archive at Nabble.com.
Re: fetch pattern
Posted by ahammad <ah...@gmail.com>.
Hello,
All you would need to do is to change that line to:
+^http://([a-z0-9]*\.)*myCompany.com/myServlet?
That's what the filter will do. It will search for all the pages in any of
the subdomains that have /myServlet? in them.
In terms of filtering, there are other options that you can play with in
nutch-default.xml. Crawl with the default settings first, and if you get too
many (or too little) results, start looking at the nutch-default.xml file.
Cheers
cemsoft wrote:
>
>
> hi
>
> how or where can i define the urls while crawling
> i want to index only the sites which has a certain link format eg.
>
> http://www.myCompany.com/myServlet?
> (while crawling i have now all the links under my company host but i need
> more filtering)
>
> # accept hosts in MY.DOMAIN.NAME
> +^http://([a-z0-9]*\.)*myCompany.com/
>
> index all pages whose link starts with
> "http://www.myCompany.com/myServlet?".....
>
> thnx for any idea
>
> regards
> cem
>
--
View this message in context: http://www.nabble.com/fetch-pattern-tp22101517p22163422.html
Sent from the Nutch - User mailing list archive at Nabble.com.