You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Paul Tomblin <pt...@xcski.com> on 2009/08/19 15:08:22 UTC

Nutch.SIGNATURE_KEY

Is SIGNATURE_KEY (aka nutch.content.digest) a valid way to check if my
page has changed since the last time I crawled it?  I patched Nutch to
properly handle modification dates, and then discovered that my web
site doesn't send Modification-Date because it uses shmtl
(Server-parsed HTML).  I assume it's some sort of cryptographic hash
of the entire page?

Another question: is Nutch smart enough to use that signature to
determine that, say, http://xcski.com/ and http://xcski.com/index.html
are the same page?


-- 
http://www.linkedin.com/in/paultomblin

Re: topN value in crawl

Posted by al...@aim.com.

 In the tutroial on the wiki the depth is not specified and topN=1000. I run those commands yesterday and it is still running. Will it index all my urls? My seed file has about 20K urls.

Thanks.
Alex.

-----Original Message-----
From: Marko Bauhardt <mb...@101tec.com>
To: nutch-user@lucene.apache.org
Sent: Thu, Aug 20, 2009 12:17 am
Subject: Re: topN value in crawl

On Aug 19, 2009, at 8:42 PM, alxsss@aim.com wrote:?
?

>?

>?
?

hi?
?

>?

>?

> Thanks. What if urls in my seed file do not have outlinks, let > say .pdf files. Should I still specify topN variable? All I need is > to index all urls in my seed file. And they are about 1 M.?
?

topN means that your generated shards (segments) contains max. N popular urls from your crawldb which are not fetched.?

popular urls means urls with highest score.?
?

You can set the topN to "-1". if you do this then you generate and fetch all urls in one shard.?

if you set topN=330.000 then you fetch 330.000 Urls in one shard.?

if you specifiy the depth parameter then you generate depth shards?
?

for example -topN=330.000 -depth=3?

then you generate/fetch/parse/index 3 shards, every shard contains max. 330.000 urls,  ~990.000 urls.?
?

marko?
?

Re: topN value in crawl

Posted by Marko Bauhardt <mb...@101tec.com>.

On Aug 19, 2009, at 8:42 PM, alxsss@aim.com wrote:

>
>

hi

>
>
> Thanks. What if urls in my seed file do not have outlinks, let  
> say .pdf files. Should I still specify topN variable? All I need is  
> to index all urls in my seed file. And they are about 1 M.

topN means that your generated shards (segments) contains max. N  
popular urls from your crawldb which are not fetched.
popular urls means urls with highest score.

You can set the topN to "-1". if you do this then you generate and  
fetch all urls in one shard.
if you set topN=330.000 then you fetch 330.000 Urls in one shard.
if you specifiy the depth parameter then you generate depth shards

for example -topN=330.000 -depth=3
then you generate/fetch/parse/index 3 shards, every shard contains  
max. 330.000 urls,  ~990.000 urls.

marko

Re: topN value in crawl

Posted by al...@aim.com.

 Thanks. What if urls in my seed file do not have outlinks, let say .pdf files. Should I still specify topN variable? All I need is to index all urls in my seed file. And they are about 1 M.

Alex.

-----Original Message-----
From: Kirby Bohling <ki...@gmail.com>
To: nutch-user@lucene.apache.org
Sent: Wed, Aug 19, 2009 11:02 am
Subject: Re: topN value in crawl

On Wed, Aug 19, 2009 at 12:13 PM, <al...@aim.com> wrote:
>
> ?Hi,
>
> I have read a few tutorials on running Nutch to crawl web. However, I still do 
not understand the meaning of topN variable in crawl command. In tutorials it is 
suggested to create 3 segments and fetch them with topN=1000. What if I create 
100 segments or only one. What would be difference. My goal is to index urls I 
have in my seed file and nothing more.
>

My understanding of "TopN" is that it interacts with the depth to help
you keep crawling "interesting" areas.  So if you have a depth of 3,
and a topN of let's say 100 (just to keep the math easy).  Every page
I go to has 20 outlinks.  I have 10 pages listed in my seed list.

This is my understanding from reading the documentation and watching
what happens, not from reading the code, I could be all wrong.
Hopefully someone corrects any details I have wrong:

depth 0:
10 pages fetched, 10 * 20 = 200 pending links to be fetched.

depth 1:
Because I have a "topN" of 100, of the 200 links I have, it will pick
the "100" most interesting (using whatever algorithm is configured, I
believe it is OPIC by default).

depth 2:
100 pages fetched, 100 + 100 * 20 = 2100 pages to fetch. (100
existing, 100 pages with 20 outlinks)

depth 3:
100 pages fetched, 2000 + 100 * 20 = 4000 pages to fetch. (2000
existing pages, 100 pages with 20 outlinks).

(NOTE: This analysis assumes all the links are unique, which is highly
unlikely).

I believe the point is to not force you to do a depth first search of
the web.  Note that the algorithm might still not have fetched all of
the pending links from depth 0 by depth 3 (or depth 100 for that
matter).  If they were deemed less interesting then other links, they
could sit in the queue effectively forever.

I view it as an latency vs. throughput thing:  How much effort are you
willing to always fetch _the most_ interesting page next.  Evaluating
and managing the computation of ordering that list is expensive.  So
queue the "topN" most interesting links you know about now, and
process that without re-evaluating "interesting" as new information is
gathered that would change the ordering.

I also believe that "topN * depth" is an upper bound on the number of
pages you will fetch during a crawl.

However, take all this with a grain of salt.  I haven't read the code
closely, but that was gleaned while tracking down why some pages were
not being fetched that I expected to be, reading the documentation,
and modifying the topN parameter to fix my issues.

Thanks,
   Kirby

> Thanks.
> Alex.
>
>
>
>

Re: topN value in crawl

Posted by Kirby Bohling <ki...@gmail.com>.

On Wed, Aug 19, 2009 at 12:13 PM, <al...@aim.com> wrote:
>
>  Hi,
>
> I have read a few tutorials on running Nutch to crawl web. However, I still do not understand the meaning of topN variable in crawl command. In tutorials it is suggested to create 3 segments and fetch them with topN=1000. What if I create 100 segments or only one. What would be difference. My goal is to index urls I have in my seed file and nothing more.
>

My understanding of "TopN" is that it interacts with the depth to help
you keep crawling "interesting" areas.  So if you have a depth of 3,
and a topN of let's say 100 (just to keep the math easy).  Every page
I go to has 20 outlinks.  I have 10 pages listed in my seed list.

This is my understanding from reading the documentation and watching
what happens, not from reading the code, I could be all wrong.
Hopefully someone corrects any details I have wrong:

depth 0:
10 pages fetched, 10 * 20 = 200 pending links to be fetched.

depth 1:
Because I have a "topN" of 100, of the 200 links I have, it will pick
the "100" most interesting (using whatever algorithm is configured, I
believe it is OPIC by default).

depth 2:
100 pages fetched, 100 + 100 * 20 = 2100 pages to fetch. (100
existing, 100 pages with 20 outlinks)

depth 3:
100 pages fetched, 2000 + 100 * 20 = 4000 pages to fetch. (2000
existing pages, 100 pages with 20 outlinks).

(NOTE: This analysis assumes all the links are unique, which is highly
unlikely).

I believe the point is to not force you to do a depth first search of
the web.  Note that the algorithm might still not have fetched all of
the pending links from depth 0 by depth 3 (or depth 100 for that
matter).  If they were deemed less interesting then other links, they
could sit in the queue effectively forever.

I view it as an latency vs. throughput thing:  How much effort are you
willing to always fetch _the most_ interesting page next.  Evaluating
and managing the computation of ordering that list is expensive.  So
queue the "topN" most interesting links you know about now, and
process that without re-evaluating "interesting" as new information is
gathered that would change the ordering.

I also believe that "topN * depth" is an upper bound on the number of
pages you will fetch during a crawl.

However, take all this with a grain of salt.  I haven't read the code
closely, but that was gleaned while tracking down why some pages were
not being fetched that I expected to be, reading the documentation,
and modifying the topN parameter to fix my issues.

Thanks,
   Kirby

> Thanks.
> Alex.
>
>
>
>

topN value in crawl

Posted by al...@aim.com.

 Hi,

I have read a few tutorials on running Nutch to crawl web. However, I still do not understand the meaning of topN variable in crawl command. In tutorials it is suggested to create 3 segments and fetch them with topN=1000. What if I create 100 segments or only one. What would be difference. My goal is to index urls I have in my seed file and nothing more.

Thanks.
Alex.

Re: Nutch.SIGNATURE_KEY

Posted by Andrzej Bialecki <ab...@getopt.org>.

Paul Tomblin wrote:
> On Wed, Aug 19, 2009 at 1:00 PM, Ken Krugler<kk...@transpac.com> wrote:
>>> Another question: is Nutch smart enough to use that signature to
>>> determine that, say, http://xcski.com/ and http://xcski.com/index.html
>>> are the same page?
>> I believe the hashes would be the same for either raw MD5 or text signature,
>> yes. So on the search side these would get collapsed. Don't know about what
>> else you mean as far as "same page" - e.g. one entry in the CrawlDB? If so,
>> then somebody else with more up-to-date knowledge of Nutch would need to
>> chime in here. Older versions of Nutch would still have these as separate
>> entries, FWIR.
> 
> Actually, I just checked some of my own pages, and http://xcski.com/
> and http://xcski.com/index.html have different signatures, in spite of
> them being the same page.  So I guess the answer to that is no, even
> if there were logic to make them the same page in CrawlDB, it wouldn't
> work.

There is nothing magic about the process of calculating a signature - 
eg. MD5Signature just takes Content.getContent() (array of bytes) and 
runs it through MD5. So if you get different MD5 values, then your 
content was indeed different (even if it was only an advertisement link 
somewhere on the page).

You could use urlnormalizer to collapse www.example.com/ and 
www.example.com/index.html into a single entry, in fact there is a 
commented-out rule like that in urlnormalizer config file. But as you 
observed above, there may be cases when these two are not really the 
same page, so you need to be careful ...

-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: Nutch.SIGNATURE_KEY

Posted by Paul Tomblin <pt...@xcski.com>.

On Wed, Aug 19, 2009 at 1:00 PM, Ken Krugler<kk...@transpac.com> wrote:
>> Another question: is Nutch smart enough to use that signature to
>> determine that, say, http://xcski.com/ and http://xcski.com/index.html
>> are the same page?
>
> I believe the hashes would be the same for either raw MD5 or text signature,
> yes. So on the search side these would get collapsed. Don't know about what
> else you mean as far as "same page" - e.g. one entry in the CrawlDB? If so,
> then somebody else with more up-to-date knowledge of Nutch would need to
> chime in here. Older versions of Nutch would still have these as separate
> entries, FWIR.

Actually, I just checked some of my own pages, and http://xcski.com/
and http://xcski.com/index.html have different signatures, in spite of
them being the same page.  So I guess the answer to that is no, even
if there were logic to make them the same page in CrawlDB, it wouldn't
work.


-- 
http://www.linkedin.com/in/paultomblin

Re: Nutch.SIGNATURE_KEY

Posted by Ken Krugler <kk...@transpac.com>.

Hi Paul,

On Aug 19, 2009, at 6:08am, Paul Tomblin wrote:

> Is SIGNATURE_KEY (aka nutch.content.digest) a valid way to check if my
> page has changed since the last time I crawled it?  I patched Nutch to
> properly handle modification dates, and then discovered that my web
> site doesn't send Modification-Date because it uses shmtl
> (Server-parsed HTML).

Yes, that's why nobody uses the modification date in the response  
headers - even when it's there, it often lies.

> I assume it's some sort of cryptographic hash
> of the entire page?

There are two ways for Nutch to calculate the page signature - one is  
MD5 of the page contents. The other is a "text signature" that tries  
to be lenient of minor changes to a web page. Which one to use depends  
on your situation.

> Another question: is Nutch smart enough to use that signature to
> determine that, say, http://xcski.com/ and http://xcski.com/index.html
> are the same page?

I believe the hashes would be the same for either raw MD5 or text  
signature, yes. So on the search side these would get collapsed. Don't  
know about what else you mean as far as "same page" - e.g. one entry  
in the CrawlDB? If so, then somebody else with more up-to-date  
knowledge of Nutch would need to chime in here. Older versions of  
Nutch would still have these as separate entries, FWIR.

-- Ken