You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by cemsoft <bc...@yahoo.com> on 2009/02/19 15:23:25 UTC

fetch pattern


hi

how or where can i define the urls while crawling
i want to index only the sites which has a certain link format eg.

http://www.myCompany.com/myServlet?
(while crawling i have now all the links under my company host but i need
more filtering)

# accept hosts in MY.DOMAIN.NAME
+^http://([a-z0-9]*\.)*myCompany.com/

index  all pages whose link starts with
"http://www.myCompany.com/myServlet?".....

thnx for any idea

regards
cem
-- 
View this message in context: http://www.nabble.com/fetch-pattern-tp22101517p22101517.html
Sent from the Nutch - User mailing list archive at Nabble.com.


Re: fetch pattern

Posted by ahammad <ah...@gmail.com>.
Hello,

All you would need to do is to change that line to:

+^http://([a-z0-9]*\.)*myCompany.com/myServlet?

That's what the filter will do. It will search for all the pages in any of
the subdomains that have /myServlet? in them. 

In terms of filtering, there are other options that you can play with in
nutch-default.xml. Crawl with the default settings first, and if you get too
many (or too little) results, start looking at the nutch-default.xml file.

Cheers



cemsoft wrote:
> 
> 
> hi
> 
> how or where can i define the urls while crawling
> i want to index only the sites which has a certain link format eg.
> 
> http://www.myCompany.com/myServlet?
> (while crawling i have now all the links under my company host but i need
> more filtering)
> 
> # accept hosts in MY.DOMAIN.NAME
> +^http://([a-z0-9]*\.)*myCompany.com/
> 
> index  all pages whose link starts with
> "http://www.myCompany.com/myServlet?".....
> 
> thnx for any idea
> 
> regards
> cem
> 

-- 
View this message in context: http://www.nabble.com/fetch-pattern-tp22101517p22163422.html
Sent from the Nutch - User mailing list archive at Nabble.com.