You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Daniel D." <nu...@gmail.com> on 2005/06/16 17:06:30 UTC

Analyze command purpose ....

Dear Nutch Developers,

I'm trying to get answers to my questions below but nobody is responding. 
This is why I'm trying to post my questions again.

----------- Question # 1 ------------------------
As I understand Nutch crawler is employing crawl & stop with threshold is 
used with –topN parameter. Please correct me if I'm wrong. This also means 
that some sites will have different depth the others.

Is there a way to control the crawling depth per domain and number of URLS 
per domain as well as the total number of domains crawled (in this case it's 
- topN).

----------- Question # 2 ------------------------
The whole-web crawling tutorial advices to use the following command 
sequence:

Fetch

updatedb db

and then generate db segments -topN 1000

Use of the topN parameter implies that updatedb db doing some analysis on 
fetched data. Command analyze (net.nutch.tools.LinkAnalysisTool) is not 
being mentioned in tutorial. DissectingTheNutchCrawler ( 
http://wiki.apache.org/nutch/DissectingTheNutchCrawler) article includes 
this command in the sequance of command for the whole-internaet crawling.

When should I use command analyze and when might I not use it?

I'm trying to get sense on how much memory (hard-drive and RAM) webDB will 
require and now I also will concern about how much machine resources will 
analyze consume. Nobody provide this information yet. I will appreciate if 
somebody will share his knowledge and thoughts here.

I'm looking for something like: for 1,000,000
documents WebDB will take approximately XX GB and running bin/nutch
updatedb on 1,000,000 will use up to XX MB of RAM.


----------- Question # 3 ------------------------


After initial inject and subsequent fetch and updatedb command (s) can I use 
inject to add more URLS to the WebDB ?

 Will greatly appreciate your help.

 Thanks,
Daniel

Re: Analyze command purpose ....

Posted by "Daniel D." <nu...@gmail.com>.
Andrej,
 Thanks a lot for the ansewrs. 
Sorry for being persistent in my posts .. .. I was going on vacation for 3 
weeks and needed to finish my work before. I appriciate your help.
 Reagrds,
Daniel

 On 6/16/05, Andrzej Bialecki <ab...@getopt.org> wrote: 
> 
> Daniel D. wrote:
> > Dear Nutch Developers,
> >
> > I'm trying to get answers to my questions below but nobody is 
> responding.
> > This is why I'm trying to post my questions again.
> 
> Hi Daniel,
> 
> Please see my answers below. Sometimes it takes patience, people have
> busy schedules...
> 
> >
> > ----------- Question # 1 ------------------------
> > As I understand Nutch crawler is employing crawl & stop with threshold 
> is
> > used with –topN parameter. Please correct me if I'm wrong. This also 
> means
> > that some sites will have different depth the others.
> 
> Yes and no - some pages that are located deep could have a high score
> (because of many inlinks), so they would be put on the list for
> fetching, even though pages that are closer to root URL may not have
> been fetched yet, or indeed will never be fetcher because they score too
> low.
> 
> >
> > Is there a way to control the crawling depth per domain and number of 
> URLS
> > per domain as well as the total number of domains crawled (in this case 
> it's
> > - topN).
> 
> -topN controls fetching by score. What you want is to control fetching
> by depth. Currently the FetchListTool doesn't implement this, but it
> would be trivial to add.
> 
> >
> > ----------- Question # 2 ------------------------
> > The whole-web crawling tutorial advices to use the following command
> > sequence:
> >
> > Fetch
> >
> > updatedb db
> >
> > and then generate db segments -topN 1000
> >
> > Use of the topN parameter implies that updatedb db doing some analysis 
> on
> > fetched data. Command analyze (net.nutch.tools.LinkAnalysisTool) is not
> > being mentioned in tutorial. DissectingTheNutchCrawler (
> > http://wiki.apache.org/nutch/DissectingTheNutchCrawler) article includes
> > this command in the sequance of command for the whole-internaet 
> crawling.
> >
> > When should I use command analyze and when might I not use it?
> 
> With the default settings you don't need to use this command. Nutch
> approximates the full web-graph scoring by using scoring based on the
> number of inlinks. Additionally, this command is known to be slightly
> broken...
> 
> >
> > I'm trying to get sense on how much memory (hard-drive and RAM) webDB 
> will
> > require and now I also will concern about how much machine resources 
> will
> > analyze consume. Nobody provide this information yet. I will appreciate 
> if
> > somebody will share his knowledge and thoughts here.
> 
> Don't use analyze - it will consume any disk space that you throw at it 
> ;-)
> 
> WebDB normally consumes ca. 2kB per page. This may temporarily increase
> to 3x this number during DB updating.
> 
> >
> > I'm looking for something like: for 1,000,000
> > documents WebDB will take approximately XX GB and running bin/nutch
> > updatedb on 1,000,000 will use up to XX MB of RAM.
> 
> The last figure depends on the settings of your JVM, i.e. what heap size
> you set for JVM. Updatedb should not consume much memory in any case.
> 
> >
> >
> > ----------- Question # 3 ------------------------
> >
> >
> > After initial inject and subsequent fetch and updatedb command (s) can I 
> use
> > inject to add more URLS to the WebDB ?
> 
> Yes, of course.
> 
> --
> Best regards,
> Andrzej Bialecki <><
> ___. ___ ___ ___ _ _ __________________________________
> [__ || __|__/|__||\/| Information Retrieval, Semantic Web
> ___|||__|| \| || | Embedded Unix, System Integration
> http://www.sigram.com Contact: info at sigram dot com
> 
>

Re: Analyze command purpose ....

Posted by Andrzej Bialecki <ab...@getopt.org>.
Daniel D. wrote:
> Dear Nutch Developers,
> 
> I'm trying to get answers to my questions below but nobody is responding. 
> This is why I'm trying to post my questions again.

Hi Daniel,

Please see my answers below. Sometimes it takes patience, people have 
busy schedules...

> 
> ----------- Question # 1 ------------------------
> As I understand Nutch crawler is employing crawl & stop with threshold is 
> used with –topN parameter. Please correct me if I'm wrong. This also means 
> that some sites will have different depth the others.

Yes and no - some pages that are located deep could have a high score 
(because of many inlinks), so they would be put on the list for 
fetching, even though pages that are closer to root URL may not have 
been fetched yet, or indeed will never be fetcher because they score too 
low.

> 
> Is there a way to control the crawling depth per domain and number of URLS 
> per domain as well as the total number of domains crawled (in this case it's 
> - topN).

-topN controls fetching by score. What you want is to control fetching 
by depth. Currently the FetchListTool doesn't implement this, but it 
would be trivial to add.

> 
> ----------- Question # 2 ------------------------
> The whole-web crawling tutorial advices to use the following command 
> sequence:
> 
> Fetch
> 
> updatedb db
> 
> and then generate db segments -topN 1000
> 
> Use of the topN parameter implies that updatedb db doing some analysis on 
> fetched data. Command analyze (net.nutch.tools.LinkAnalysisTool) is not 
> being mentioned in tutorial. DissectingTheNutchCrawler ( 
> http://wiki.apache.org/nutch/DissectingTheNutchCrawler) article includes 
> this command in the sequance of command for the whole-internaet crawling.
> 
> When should I use command analyze and when might I not use it?

With the default settings you don't need to use this command. Nutch 
approximates the full web-graph scoring by using scoring based on the 
number of inlinks. Additionally, this command is known to be slightly 
broken...

> 
> I'm trying to get sense on how much memory (hard-drive and RAM) webDB will 
> require and now I also will concern about how much machine resources will 
> analyze consume. Nobody provide this information yet. I will appreciate if 
> somebody will share his knowledge and thoughts here.

Don't use analyze - it will consume any disk space that you throw at it ;-)

WebDB normally consumes ca. 2kB per page. This may temporarily increase 
to 3x this number during DB updating.

> 
> I'm looking for something like: for 1,000,000
> documents WebDB will take approximately XX GB and running bin/nutch
> updatedb on 1,000,000 will use up to XX MB of RAM.

The last figure depends on the settings of your JVM, i.e. what heap size 
you set for JVM. Updatedb should not consume much memory in any case.

> 
> 
> ----------- Question # 3 ------------------------
> 
> 
> After initial inject and subsequent fetch and updatedb command (s) can I use 
> inject to add more URLS to the WebDB ?

Yes, of course.

-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com