You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Daniel D." <nu...@gmail.com> on 2005/06/13 15:38:36 UTC
Crawling method control !!
Hi,
As I understand Nutch crawler is employing crawl & stop with threshold is
used with –topN parameter. Please correct me if I'm wrong. This also means
that some sites will have different depth the others.
Is there a way to control the crawling depth per domain and number of URLS
per domain as well as the total number of domains crawled (in this case it's
- topN).
Thanks,
Daniel
Re: Crawling method control !!
Posted by "Daniel D." <nu...@gmail.com>.
Gents,
I have one more question. Hope anyone will response!!
The whole-web crawling tutorial advices to use the following command
sequence:
*Fetch***
*updatedb db*
and then *generate db segments -topN 1000*
Use of the topN parameter implies that *updatedb db *doing some analysis on
fetched data. Command *analyze* * *(net.nutch.tools.LinkAnalysisTool) is not
being mentioned in tutorial.
DissectingTheNutchCrawler<http://wiki.apache.org/nutch/DissectingTheNutchCrawler?action=fullsearch&value=linkto%3A%22DissectingTheNutchCrawler%22&context=180>(
http://wiki.apache.org/nutch/DissectingTheNutchCrawler) article includes
this command in the sequance of command for the whole-internaet crawling.
When should I use command *analyze *and when might I not use it?
I'm trying to get sense on how much memory (hard-drive and RAM) webDB will
require and now I also will concern about how much machine resources
will *analyze
*consume. Nobody provide this information yet. I will appreciate if somebody
will share his knowledge and thoughts here.
I'm looking for something like: for 1,000,000
documents WebDB will take approximately XX GB and running bin/nutch
updatedb on 1,000,000 will use up to XX MB of RAM.
Thanks,
Daniel
On 6/13/05, Daniel D. <nu...@gmail.com> wrote:
>
> Hi,
>
> As I understand Nutch crawler is employing crawl & stop with threshold is
> used with –topN parameter. Please correct me if I'm wrong. This also means
> that some sites will have different depth the others.
>
> Is there a way to control the crawling depth per domain and number of URLS
> per domain as well as the total number of domains crawled (in this case it's
> - topN).
>
> Thanks,
>
> Daniel
>