You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by al...@aim.com on 2009/08/19 19:13:56 UTC

topN value in crawl

 Hi,

I have read a few tutorials on running Nutch to crawl web. However, I still do not understand the meaning of topN variable in crawl command. In tutorials it is suggested to create 3 segments and fetch them with topN=1000. What if I create 100 segments or only one. What would be difference. My goal is to index urls I have in my seed file and nothing more.

Thanks.
Alex.




Re: topN value in crawl

Posted by al...@aim.com.
 In the tutroial on the wiki the depth is not specified and topN=1000. I run those commands yesterday and it is still running. Will it index all my urls? My seed file has about 20K urls.

Thanks.
Alex.



 


 

-----Original Message-----
From: Marko Bauhardt <mb...@101tec.com>
To: nutch-user@lucene.apache.org
Sent: Thu, Aug 20, 2009 12:17 am
Subject: Re: topN value in crawl










On Aug 19, 2009, at 8:42 PM, alxsss@aim.com wrote:?
?

>?

>?
?

hi?
?

>?

>?

> Thanks. What if urls in my seed file do not have outlinks, let > say .pdf files. Should I still specify topN variable? All I need is > to index all urls in my seed file. And they are about 1 M.?
?

topN means that your generated shards (segments) contains max. N popular urls from your crawldb which are not fetched.?

popular urls means urls with highest score.?
?

You can set the topN to "-1". if you do this then you generate and fetch all urls in one shard.?

if you set topN=330.000 then you fetch 330.000 Urls in one shard.?

if you specifiy the depth parameter then you generate depth shards?
?

for example -topN=330.000 -depth=3?

then you generate/fetch/parse/index 3 shards, every shard contains max. 330.000 urls,  ~990.000 urls.?
?


marko?
?



 


Re: topN value in crawl

Posted by Marko Bauhardt <mb...@101tec.com>.
On Aug 19, 2009, at 8:42 PM, alxsss@aim.com wrote:

>
>

hi

>
>
> Thanks. What if urls in my seed file do not have outlinks, let  
> say .pdf files. Should I still specify topN variable? All I need is  
> to index all urls in my seed file. And they are about 1 M.

topN means that your generated shards (segments) contains max. N  
popular urls from your crawldb which are not fetched.
popular urls means urls with highest score.

You can set the topN to "-1". if you do this then you generate and  
fetch all urls in one shard.
if you set topN=330.000 then you fetch 330.000 Urls in one shard.
if you specifiy the depth parameter then you generate depth shards

for example -topN=330.000 -depth=3
then you generate/fetch/parse/index 3 shards, every shard contains  
max. 330.000 urls,  ~990.000 urls.


marko


Re: topN value in crawl

Posted by al...@aim.com.
 


 Thanks. What if urls in my seed file do not have outlinks, let say .pdf files. Should I still specify topN variable? All I need is to index all urls in my seed file. And they are about 1 M.

Alex.


 

-----Original Message-----
From: Kirby Bohling <ki...@gmail.com>
To: nutch-user@lucene.apache.org
Sent: Wed, Aug 19, 2009 11:02 am
Subject: Re: topN value in crawl










On Wed, Aug 19, 2009 at 12:13 PM, <al...@aim.com> wrote:
>
> ?Hi,
>
> I have read a few tutorials on running Nutch to crawl web. However, I still do 
not understand the meaning of topN variable in crawl command. In tutorials it is 
suggested to create 3 segments and fetch them with topN=1000. What if I create 
100 segments or only one. What would be difference. My goal is to index urls I 
have in my seed file and nothing more.
>

My understanding of "TopN" is that it interacts with the depth to help
you keep crawling "interesting" areas.  So if you have a depth of 3,
and a topN of let's say 100 (just to keep the math easy).  Every page
I go to has 20 outlinks.  I have 10 pages listed in my seed list.

This is my understanding from reading the documentation and watching
what happens, not from reading the code, I could be all wrong.
Hopefully someone corrects any details I have wrong:

depth 0:
10 pages fetched, 10 * 20 = 200 pending links to be fetched.

depth 1:
Because I have a "topN" of 100, of the 200 links I have, it will pick
the "100" most interesting (using whatever algorithm is configured, I
believe it is OPIC by default).

depth 2:
100 pages fetched, 100 + 100 * 20 = 2100 pages to fetch. (100
existing, 100 pages with 20 outlinks)

depth 3:
100 pages fetched, 2000 + 100 * 20 = 4000 pages to fetch. (2000
existing pages, 100 pages with 20 outlinks).

(NOTE: This analysis assumes all the links are unique, which is highly
unlikely).

I believe the point is to not force you to do a depth first search of
the web.  Note that the algorithm might still not have fetched all of
the pending links from depth 0 by depth 3 (or depth 100 for that
matter).  If they were deemed less interesting then other links, they
could sit in the queue effectively forever.

I view it as an latency vs. throughput thing:  How much effort are you
willing to always fetch _the most_ interesting page next.  Evaluating
and managing the computation of ordering that list is expensive.  So
queue the "topN" most interesting links you know about now, and
process that without re-evaluating "interesting" as new information is
gathered that would change the ordering.

I also believe that "topN * depth" is an upper bound on the number of
pages you will fetch during a crawl.

However, take all this with a grain of salt.  I haven't read the code
closely, but that was gleaned while tracking down why some pages were
not being fetched that I expected to be, reading the documentation,
and modifying the topN parameter to fix my issues.

Thanks,
   Kirby



> Thanks.
> Alex.
>
>
>
>



 


Re: topN value in crawl

Posted by Kirby Bohling <ki...@gmail.com>.
On Wed, Aug 19, 2009 at 12:13 PM, <al...@aim.com> wrote:
>
>  Hi,
>
> I have read a few tutorials on running Nutch to crawl web. However, I still do not understand the meaning of topN variable in crawl command. In tutorials it is suggested to create 3 segments and fetch them with topN=1000. What if I create 100 segments or only one. What would be difference. My goal is to index urls I have in my seed file and nothing more.
>

My understanding of "TopN" is that it interacts with the depth to help
you keep crawling "interesting" areas.  So if you have a depth of 3,
and a topN of let's say 100 (just to keep the math easy).  Every page
I go to has 20 outlinks.  I have 10 pages listed in my seed list.

This is my understanding from reading the documentation and watching
what happens, not from reading the code, I could be all wrong.
Hopefully someone corrects any details I have wrong:

depth 0:
10 pages fetched, 10 * 20 = 200 pending links to be fetched.

depth 1:
Because I have a "topN" of 100, of the 200 links I have, it will pick
the "100" most interesting (using whatever algorithm is configured, I
believe it is OPIC by default).

depth 2:
100 pages fetched, 100 + 100 * 20 = 2100 pages to fetch. (100
existing, 100 pages with 20 outlinks)

depth 3:
100 pages fetched, 2000 + 100 * 20 = 4000 pages to fetch. (2000
existing pages, 100 pages with 20 outlinks).

(NOTE: This analysis assumes all the links are unique, which is highly
unlikely).

I believe the point is to not force you to do a depth first search of
the web.  Note that the algorithm might still not have fetched all of
the pending links from depth 0 by depth 3 (or depth 100 for that
matter).  If they were deemed less interesting then other links, they
could sit in the queue effectively forever.

I view it as an latency vs. throughput thing:  How much effort are you
willing to always fetch _the most_ interesting page next.  Evaluating
and managing the computation of ordering that list is expensive.  So
queue the "topN" most interesting links you know about now, and
process that without re-evaluating "interesting" as new information is
gathered that would change the ordering.

I also believe that "topN * depth" is an upper bound on the number of
pages you will fetch during a crawl.

However, take all this with a grain of salt.  I haven't read the code
closely, but that was gleaned while tracking down why some pages were
not being fetched that I expected to be, reading the documentation,
and modifying the topN parameter to fix my issues.

Thanks,
   Kirby



> Thanks.
> Alex.
>
>
>
>