You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Daniel D." <nu...@gmail.com> on 2005/06/16 17:06:30 UTC
Analyze command purpose ....
Dear Nutch Developers,
I'm trying to get answers to my questions below but nobody is responding.
This is why I'm trying to post my questions again.
----------- Question # 1 ------------------------
As I understand Nutch crawler is employing crawl & stop with threshold is
used with –topN parameter. Please correct me if I'm wrong. This also means
that some sites will have different depth the others.
Is there a way to control the crawling depth per domain and number of URLS
per domain as well as the total number of domains crawled (in this case it's
- topN).
----------- Question # 2 ------------------------
The whole-web crawling tutorial advices to use the following command
sequence:
Fetch
updatedb db
and then generate db segments -topN 1000
Use of the topN parameter implies that updatedb db doing some analysis on
fetched data. Command analyze (net.nutch.tools.LinkAnalysisTool) is not
being mentioned in tutorial. DissectingTheNutchCrawler (
http://wiki.apache.org/nutch/DissectingTheNutchCrawler) article includes
this command in the sequance of command for the whole-internaet crawling.
When should I use command analyze and when might I not use it?
I'm trying to get sense on how much memory (hard-drive and RAM) webDB will
require and now I also will concern about how much machine resources will
analyze consume. Nobody provide this information yet. I will appreciate if
somebody will share his knowledge and thoughts here.
I'm looking for something like: for 1,000,000
documents WebDB will take approximately XX GB and running bin/nutch
updatedb on 1,000,000 will use up to XX MB of RAM.
----------- Question # 3 ------------------------
After initial inject and subsequent fetch and updatedb command (s) can I use
inject to add more URLS to the WebDB ?
Will greatly appreciate your help.
Thanks,
Daniel
Re: Analyze command purpose ....
Posted by "Daniel D." <nu...@gmail.com>.
Andrej,
Thanks a lot for the ansewrs.
Sorry for being persistent in my posts .. .. I was going on vacation for 3
weeks and needed to finish my work before. I appriciate your help.
Reagrds,
Daniel
On 6/16/05, Andrzej Bialecki <ab...@getopt.org> wrote:
>
> Daniel D. wrote:
> > Dear Nutch Developers,
> >
> > I'm trying to get answers to my questions below but nobody is
> responding.
> > This is why I'm trying to post my questions again.
>
> Hi Daniel,
>
> Please see my answers below. Sometimes it takes patience, people have
> busy schedules...
>
> >
> > ----------- Question # 1 ------------------------
> > As I understand Nutch crawler is employing crawl & stop with threshold
> is
> > used with –topN parameter. Please correct me if I'm wrong. This also
> means
> > that some sites will have different depth the others.
>
> Yes and no - some pages that are located deep could have a high score
> (because of many inlinks), so they would be put on the list for
> fetching, even though pages that are closer to root URL may not have
> been fetched yet, or indeed will never be fetcher because they score too
> low.
>
> >
> > Is there a way to control the crawling depth per domain and number of
> URLS
> > per domain as well as the total number of domains crawled (in this case
> it's
> > - topN).
>
> -topN controls fetching by score. What you want is to control fetching
> by depth. Currently the FetchListTool doesn't implement this, but it
> would be trivial to add.
>
> >
> > ----------- Question # 2 ------------------------
> > The whole-web crawling tutorial advices to use the following command
> > sequence:
> >
> > Fetch
> >
> > updatedb db
> >
> > and then generate db segments -topN 1000
> >
> > Use of the topN parameter implies that updatedb db doing some analysis
> on
> > fetched data. Command analyze (net.nutch.tools.LinkAnalysisTool) is not
> > being mentioned in tutorial. DissectingTheNutchCrawler (
> > http://wiki.apache.org/nutch/DissectingTheNutchCrawler) article includes
> > this command in the sequance of command for the whole-internaet
> crawling.
> >
> > When should I use command analyze and when might I not use it?
>
> With the default settings you don't need to use this command. Nutch
> approximates the full web-graph scoring by using scoring based on the
> number of inlinks. Additionally, this command is known to be slightly
> broken...
>
> >
> > I'm trying to get sense on how much memory (hard-drive and RAM) webDB
> will
> > require and now I also will concern about how much machine resources
> will
> > analyze consume. Nobody provide this information yet. I will appreciate
> if
> > somebody will share his knowledge and thoughts here.
>
> Don't use analyze - it will consume any disk space that you throw at it
> ;-)
>
> WebDB normally consumes ca. 2kB per page. This may temporarily increase
> to 3x this number during DB updating.
>
> >
> > I'm looking for something like: for 1,000,000
> > documents WebDB will take approximately XX GB and running bin/nutch
> > updatedb on 1,000,000 will use up to XX MB of RAM.
>
> The last figure depends on the settings of your JVM, i.e. what heap size
> you set for JVM. Updatedb should not consume much memory in any case.
>
> >
> >
> > ----------- Question # 3 ------------------------
> >
> >
> > After initial inject and subsequent fetch and updatedb command (s) can I
> use
> > inject to add more URLS to the WebDB ?
>
> Yes, of course.
>
> --
> Best regards,
> Andrzej Bialecki <><
> ___. ___ ___ ___ _ _ __________________________________
> [__ || __|__/|__||\/| Information Retrieval, Semantic Web
> ___|||__|| \| || | Embedded Unix, System Integration
> http://www.sigram.com Contact: info at sigram dot com
>
>
Re: Analyze command purpose ....
Posted by Andrzej Bialecki <ab...@getopt.org>.
Daniel D. wrote:
> Dear Nutch Developers,
>
> I'm trying to get answers to my questions below but nobody is responding.
> This is why I'm trying to post my questions again.
Hi Daniel,
Please see my answers below. Sometimes it takes patience, people have
busy schedules...
>
> ----------- Question # 1 ------------------------
> As I understand Nutch crawler is employing crawl & stop with threshold is
> used with –topN parameter. Please correct me if I'm wrong. This also means
> that some sites will have different depth the others.
Yes and no - some pages that are located deep could have a high score
(because of many inlinks), so they would be put on the list for
fetching, even though pages that are closer to root URL may not have
been fetched yet, or indeed will never be fetcher because they score too
low.
>
> Is there a way to control the crawling depth per domain and number of URLS
> per domain as well as the total number of domains crawled (in this case it's
> - topN).
-topN controls fetching by score. What you want is to control fetching
by depth. Currently the FetchListTool doesn't implement this, but it
would be trivial to add.
>
> ----------- Question # 2 ------------------------
> The whole-web crawling tutorial advices to use the following command
> sequence:
>
> Fetch
>
> updatedb db
>
> and then generate db segments -topN 1000
>
> Use of the topN parameter implies that updatedb db doing some analysis on
> fetched data. Command analyze (net.nutch.tools.LinkAnalysisTool) is not
> being mentioned in tutorial. DissectingTheNutchCrawler (
> http://wiki.apache.org/nutch/DissectingTheNutchCrawler) article includes
> this command in the sequance of command for the whole-internaet crawling.
>
> When should I use command analyze and when might I not use it?
With the default settings you don't need to use this command. Nutch
approximates the full web-graph scoring by using scoring based on the
number of inlinks. Additionally, this command is known to be slightly
broken...
>
> I'm trying to get sense on how much memory (hard-drive and RAM) webDB will
> require and now I also will concern about how much machine resources will
> analyze consume. Nobody provide this information yet. I will appreciate if
> somebody will share his knowledge and thoughts here.
Don't use analyze - it will consume any disk space that you throw at it ;-)
WebDB normally consumes ca. 2kB per page. This may temporarily increase
to 3x this number during DB updating.
>
> I'm looking for something like: for 1,000,000
> documents WebDB will take approximately XX GB and running bin/nutch
> updatedb on 1,000,000 will use up to XX MB of RAM.
The last figure depends on the settings of your JVM, i.e. what heap size
you set for JVM. Updatedb should not consume much memory in any case.
>
>
> ----------- Question # 3 ------------------------
>
>
> After initial inject and subsequent fetch and updatedb command (s) can I use
> inject to add more URLS to the WebDB ?
Yes, of course.
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com