You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Markus Thomas <Ma...@Hamburg.de> on 2009/01/10 12:18:39 UTC

Crawl the Internet - Limit the fetchlist of unfetched urls

Hello everyone,

first of all, I am new to nutch. I installed nutch on my internet server 
and tried to start crowling the internet.
I unterstood that there are two opportunities to generate a fetchlist. 
First, using the parameter -topN to generate a limited list of the top 
rated domains. Second, without the topN parameter it generated a 
fetchlist of all unfetched urls. That's what i want to do now, but i 
don't want to fetch ALL uncrawled domains at a time.
So is there an opportunity to crawl unfetched urls, but limit that to 
1000 urls, ar what else?


Thank you and best regars,
Markus Thomas

Re: Crawl the Internet - Limit the fetchlist of unfetched urls

Posted by Dennis Kubes <ku...@apache.org>.

Markus Thomas wrote:
> Hello everyone,
> 
> first of all, I am new to nutch. I installed nutch on my internet server 
> and tried to start crowling the internet.
> I unterstood that there are two opportunities to generate a fetchlist. 
> First, using the parameter -topN to generate a limited list of the top 
> rated domains. Second, without the topN parameter it generated a 
> fetchlist of all unfetched urls. That's what i want to do now, but i 
> don't want to fetch ALL uncrawled domains at a time.
> So is there an opportunity to crawl unfetched urls, but limit that to 
> 1000 urls, ar what else?

Yes, first you would need to inject a list of urls.  Search the nutch 
list or take a look at the wiki for injecting the DMOZ database.  That 
will give you a starting point.

All urls start off with the same score.  Using topN once a list is 
injected will limit to only X number of urls.  So you would go through a 
process of inject once, (generate, fetch, update crawldb), loop on the 
generate-update cycle for x number of shards, then either merge segments 
and index, or index and merge indexes, or deploy out shards to 
individual servers.  Rinse, lather repeat, start the whole process over 
again from generate.

Dennis

> 
> 
> Thank you and best regars,
> Markus Thomas