You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Daniel D." <nu...@gmail.com> on 2005/06/13 15:38:36 UTC

Crawling method control !!

Hi,

 As I understand Nutch crawler is employing crawl & stop with threshold is 
used with –topN parameter. Please correct me if I'm wrong. This also means 
that some sites will have different depth the others. 

Is there a way to control the crawling depth per domain and number of URLS 
per domain as well as the total number of domains crawled (in this case it's 
- topN). 

 Thanks,

 Daniel

Re: Crawling method control !!

Posted by "Daniel D." <nu...@gmail.com>.
Gents,

 I have one more question. Hope anyone will response!!

 The whole-web crawling tutorial advices to use the following command 
sequence:

 *Fetch***

*updatedb db*

and then *generate db segments -topN 1000*

 Use of the topN parameter implies that *updatedb db *doing some analysis on 
fetched data. Command *analyze* * *(net.nutch.tools.LinkAnalysisTool) is not 
being mentioned in tutorial.
DissectingTheNutchCrawler<http://wiki.apache.org/nutch/DissectingTheNutchCrawler?action=fullsearch&value=linkto%3A%22DissectingTheNutchCrawler%22&context=180>(
http://wiki.apache.org/nutch/DissectingTheNutchCrawler) article includes 
this command in the sequance of command for the whole-internaet crawling.

 When should I use command *analyze *and when might I not use it?

 I'm trying to get sense on how much memory (hard-drive and RAM) webDB will 
require and now I also will concern about how much machine resources
will *analyze
*consume. Nobody provide this information yet. I will appreciate if somebody 
will share his knowledge and thoughts here.

 I'm looking for something like: for 1,000,000 
documents WebDB will take approximately XX GB and running bin/nutch
updatedb on 1,000,000 will use up to XX MB of RAM.

 Thanks,

Daniel


On 6/13/05, Daniel D. <nu...@gmail.com> wrote: 
> 
> Hi,
> 
>  As I understand Nutch crawler is employing crawl & stop with threshold is 
> used with –topN parameter. Please correct me if I'm wrong. This also means 
> that some sites will have different depth the others. 
> 
> Is there a way to control the crawling depth per domain and number of URLS 
> per domain as well as the total number of domains crawled (in this case it's 
> - topN). 
> 
>  Thanks,
> 
>  Daniel
>