You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@nutch.apache.org by Doug Cutting <cu...@nutch.org> on 2005/12/01 20:15:49 UTC

incremental crawling

It would be good to improve the support for incremental crawling added 
to Nutch.  Here are some ideas about how we might implement it.  Andrzej 
has posted in the past about this, so he probably has better ideas.

Incremental crawling could proceed as follows:

1. Bootstrap with a batch crawl, using the 'crawl' command.  Modify 
CrawlDatum to store the MD5Hash of the content of fetched urls.

2. Reduce the fetch interval for high-scoring urls.  If the default is 
monthly, then the top-scoring 1% of urls might be set to daily, and the 
top-scoring 10% of urls might be set to weekly.

3. Generate a fetch list & fetch it.  When the url has been previously 
fetched, and its content is unchanged, increase its fetch interval by an 
amount, e.g., 50%.  If the content is changed, decrease the fetch 
interval.  The percentage of increase and decrease might be influenced 
by the url's score.

4. Update the crawl db & link db, index the new segment, dedup, etc. 
When updating the crawl db, scores for existing urls should not change, 
since the scoring method we're using (OPIC) assumes each page is fetched 
only once.

Steps 3 & 4 can be packaged as an 'update' command.  Step 2 can be 
included in the 'crawl' command, so that crawled indexes are always 
ready for update.

Comments?

Doug

Re: incremental crawling

Posted by Jack Tang <hi...@gmail.com>.

Hi Doug

1. How to deal with "dead urls"? If I remove the url after nutch 1st
crawling. Should nutch keeps the "dead urls" and never fetches them
again?
2. should nutch export dedup as one extension point? In my project, we
add information extraction layer to nutch, I think it is good idea
export dedup as extension point since we can build our "duplicates
rule" base on extracted data object, of course, the default is page
url.

Thought?
/Jack

On 12/2/05, Doug Cutting <cu...@nutch.org> wrote:
> It would be good to improve the support for incremental crawling added
> to Nutch.  Here are some ideas about how we might implement it.  Andrzej
> has posted in the past about this, so he probably has better ideas.
>
> Incremental crawling could proceed as follows:
>
> 1. Bootstrap with a batch crawl, using the 'crawl' command.  Modify
> CrawlDatum to store the MD5Hash of the content of fetched urls.
>
> 2. Reduce the fetch interval for high-scoring urls.  If the default is
> monthly, then the top-scoring 1% of urls might be set to daily, and the
> top-scoring 10% of urls might be set to weekly.
>
> 3. Generate a fetch list & fetch it.  When the url has been previously
> fetched, and its content is unchanged, increase its fetch interval by an
> amount, e.g., 50%.  If the content is changed, decrease the fetch
> interval.  The percentage of increase and decrease might be influenced
> by the url's score.
>
> 4. Update the crawl db & link db, index the new segment, dedup, etc.
> When updating the crawl db, scores for existing urls should not change,
> since the scoring method we're using (OPIC) assumes each page is fetched
> only once.
>
> Steps 3 & 4 can be packaged as an 'update' command.  Step 2 can be
> included in the 'crawl' command, so that crawled indexes are always
> ready for update.
>
> Comments?
>
> Doug
>


--
Keep Discovering ... ...
http://www.jroller.com/page/jmars

Re: [Nutch-dev] incremental crawling

Posted by Jérôme Charron <je...@gmail.com>.

Sounds really good (and it is requested by a lot of nutch users!).
+1

Jérôme

On 12/1/05, Doug Cutting <cu...@nutch.org> wrote:
>
> Matt Kangas wrote:
> > #2 should be a pluggable/hookable parameter. "high-scoring" sounds  like
> > a reasonable default basis for choosing recrawl intervals, but  I'm sure
> > that nearly everyone will think of a way to improve upon  that for their
> > particular system.
> >
> > e.g. "high-scoring" ain't gonna cut it for my needs. (0.5 wink ;)
>
> In NUTCH-61, Andrzej has a pluggable FetchSchedule.  That looks like a
> good idea.
>
> http://issues.apache.org/jira/browse/NUTCH-61
>
> Doug
>



--
http://motrech.free.fr/
http://www.frutch.org/

Re: [Nutch-dev] incremental crawling

Posted by Doug Cutting <cu...@nutch.org>.

Matt Kangas wrote:
> #2 should be a pluggable/hookable parameter. "high-scoring" sounds  like 
> a reasonable default basis for choosing recrawl intervals, but  I'm sure 
> that nearly everyone will think of a way to improve upon  that for their 
> particular system.
> 
> e.g. "high-scoring" ain't gonna cut it for my needs. (0.5 wink ;)

In NUTCH-61, Andrzej has a pluggable FetchSchedule.  That looks like a 
good idea.

http://issues.apache.org/jira/browse/NUTCH-61

Doug

Re: [Nutch-dev] incremental crawling

Posted by Matt Kangas <ka...@gmail.com>.

#2 should be a pluggable/hookable parameter. "high-scoring" sounds  
like a reasonable default basis for choosing recrawl intervals, but  
I'm sure that nearly everyone will think of a way to improve upon  
that for their particular system.

e.g. "high-scoring" ain't gonna cut it for my needs. (0.5 wink ;)

--matt

On Dec 1, 2005, at 2:15 PM, Doug Cutting wrote:

> It would be good to improve the support for incremental crawling  
> added to Nutch.  Here are some ideas about how we might implement  
> it.  Andrzej has posted in the past about this, so he probably has  
> better ideas.
>
> Incremental crawling could proceed as follows:
>
> 1. Bootstrap with a batch crawl, using the 'crawl' command.  Modify  
> CrawlDatum to store the MD5Hash of the content of fetched urls.
>
> 2. Reduce the fetch interval for high-scoring urls.  If the default  
> is monthly, then the top-scoring 1% of urls might be set to daily,  
> and the top-scoring 10% of urls might be set to weekly.
>
> 3. Generate a fetch list & fetch it.  When the url has been  
> previously fetched, and its content is unchanged, increase its  
> fetch interval by an amount, e.g., 50%.  If the content is changed,  
> decrease the fetch interval.  The percentage of increase and  
> decrease might be influenced by the url's score.
>
> 4. Update the crawl db & link db, index the new segment, dedup,  
> etc. When updating the crawl db, scores for existing urls should  
> not change, since the scoring method we're using (OPIC) assumes  
> each page is fetched only once.
>
> Steps 3 & 4 can be packaged as an 'update' command.  Step 2 can be  
> included in the 'crawl' command, so that crawled indexes are always  
> ready for update.
>
> Comments?
>
> Doug

--
Matt Kangas / kangas@gmail.com

Re: incremental crawling

Posted by Doug Cutting <cu...@nutch.org>.

Andrzej Bialecki wrote:
> Doug Cutting wrote:
>> Modify CrawlDatum to store the MD5Hash of the content of fetched urls.
> 
> Yes, this is required to detect unmodified content. A small note: plain 
> MD5Hash(byte[] content) is quite ineffective for many pages, e.g. pages 
> with a counter, or with ads. It would be good to provide a framework for 
> other implementations of "page equality" - for now perhaps we should 
> just say that this value is a byte[], and not specifically an MD5Hash.

That's reasonable.  But we should also make it clear that the larger 
this gets, the slower that crawldb updates will be.

> Other additions to CrawlDatum for consideration:
> 
> * last modified time, not just the last fetched time - these two are 
> different, and the fetching policy will depend on both. E.g. to 
> synchronize with the page change cycle it is necessary to know the time 
> of the previous modification seen by Nutch. I've done simulations, which 
> show that if we don't track this value then the fetchInterval 
> adjustments won't stabilize even if the page change cycle is fixed.

Instead of a long, these could be two unsigned ints, seconds since 
epoch.  That would be good for another 100 years.

> * segment name from the last updatedb. I'm not fully convinced about 
> this, but consider the following:
> 
> I think this is needed in order to check which segments may be safely 
> deleted, because there are no more active pages in them. If you enable a 
> variable fetchInterval, then after a while you will end up with widely 
> ranging intervals - some pages will have a daily or hourly period, some 
> others will have a period of several months. Add to this the fact that 
> you start counting the time for each page at different moments, and then 
> the oldest page you have could be as old as maxFetchInterval (whatever 
> that is, Float.MAX_VALUE or some other maximum you set). Most likely 
> such old pages would live in segments with very little current data.

This is only possible if maxFetchInterval is very long and one is doing 
frequent large updates.  That to me is misuse.  If maxFetchInterval is, 
e.g., 30 days and one updates 1% of the collection every day, then the 
wasted space would be small, and segments may be discarded after 
maxFetchInterval.  Is that impractical?

I also think we could go crazy trying to make things work indefinitely 
with incremental crawling.  Rather I think one should periodically 
re-crawl from scratch, using the existing crawldb to bootstrap.  This 
way dead links are eventually retried, and urls that are no longer 
referred to can be GC'd.

> Alternatively, we could add Properties to CrawlDatum, and let people put 
> whatever they wish there...

This is probably a good idea, although it makes CrawlDatum bigger and 
algorithms slower, since strings must be parsed.  So I'd argue that the 
default mechanism not rely on properties.  Is that a premature optimization?

> In the original patchset I had a notion of pluggable FetchSchedule-s. I 
> think this would be an ideal place to make such decisions. 
> Implementations would be pluggable in a similar way as URLFilter, with 
> the DefaultFetchSchedule doing what we do today.

+1

>> 4. Update the crawl db & link db, index the new segment, dedup, etc. 
>> When updating the crawl db, scores for existing urls should not 
>> change, since the scoring method we're using (OPIC) assumes each page 
>> is fetched only once.
> 
> 
> 
> I would love to refactor this part too, to make the scoring mechanism 
> abstracted in a similar way, so that you could plug in different scoring 
> implementations. The float value in CrawlDatum is opaque enough to 
> support different scoring mechanisms.

+1

Doug

Re: incremental crawling

Posted by Stefan Groschupf <sg...@media-style.com>.

Am 02.12.2005 um 10:15 schrieb Andrzej Bialecki:

> Yes, this is required to detect unmodified content. A small note:  
> plain MD5Hash(byte[] content) is quite ineffective for many pages,  
> e.g. pages with a counter, or with ads. It would be good to provide  
> a framework for other implementations of "page equality" - for now  
> perhaps we should just say that this value is a byte[], and not  
> specifically an MD5Hash.

Some time ago I found a interesting mechanism that may would help us,  
it is called Locality-Sensitive Hashing (LSH).
 From my point of view this is would perfect solution to also remove  
a lot of spam pages,  on my todo list I have a task to write a kind  
of proof of concept, but as we all - I was to busy with other things.
You will find the paper behind the link below and I really would love  
to see this in the nutch sources and I would offer to work with other  
on such a solution.

http://dbpubs.stanford.edu:8090/pub/2000-23
or the pdf:
http://dbpubs.stanford.edu/pub/showDoc.Fulltext? 
lang=en&doc=2000-23&format=pdf&compression=&name=2000-23.pdf

Greetings,
Stefan

Re: incremental crawling

Posted by Andrzej Bialecki <ab...@getopt.org>.

Doug Cutting wrote:

> It would be good to improve the support for incremental crawling added 
> to Nutch.  Here are some ideas about how we might implement it.  
> Andrzej has posted in the past about this, so he probably has better 
> ideas.
>
> Incremental crawling could proceed as follows:
>
> 1. Bootstrap with a batch crawl, using the 'crawl' command.  Modify 
> CrawlDatum to store the MD5Hash of the content of fetched urls.

Yes, this is required to detect unmodified content. A small note: plain 
MD5Hash(byte[] content) is quite ineffective for many pages, e.g. pages 
with a counter, or with ads. It would be good to provide a framework for 
other implementations of "page equality" - for now perhaps we should 
just say that this value is a byte[], and not specifically an MD5Hash.

Other additions to CrawlDatum for consideration:

* last modified time, not just the last fetched time - these two are 
different, and the fetching policy will depend on both. E.g. to 
synchronize with the page change cycle it is necessary to know the time 
of the previous modification seen by Nutch. I've done simulations, which 
show that if we don't track this value then the fetchInterval 
adjustments won't stabilize even if the page change cycle is fixed.

* segment name from the last updatedb. I'm not fully convinced about 
this, but consider the following:

I think this is needed in order to check which segments may be safely 
deleted, because there are no more active pages in them. If you enable a 
variable fetchInterval, then after a while you will end up with widely 
ranging intervals - some pages will have a daily or hourly period, some 
others will have a period of several months. Add to this the fact that 
you start counting the time for each page at different moments, and then 
the oldest page you have could be as old as maxFetchInterval (whatever 
that is, Float.MAX_VALUE or some other maximum you set). Most likely 
such old pages would live in segments with very little current data.

Now, you need to minimize the number of active segments (because of 
search performance and the time to deduplicate). However, with variable 
fetchInterval you no longer know which segments it is safe to delete. I 
imagine a tool could collect all segment names from CrawlDB, and prepare 
a list (segmentName, numRecords). Those segments that are not found on 
this list it would be safe to delete. Those segments that have few 
records could be processed to extract those records and move them to a 
single segment (and discard the rest of old segment data).

..

Alternatively, we could add Properties to CrawlDatum, and let people put 
whatever they wish there...

>
> 2. Reduce the fetch interval for high-scoring urls.  If the default is 
> monthly, then the top-scoring 1% of urls might be set to daily, and 
> the top-scoring 10% of urls might be set to weekly.
>

In the original patchset I had a notion of pluggable FetchSchedule-s. I 
think this would be an ideal place to make such decisions. 
Implementations would be pluggable in a similar way as URLFilter, with 
the DefaultFetchSchedule doing what we do today.

> 3. Generate a fetch list & fetch it.  When the url has been previously 
> fetched, and its content is unchanged, increase its fetch interval by 
> an amount, e.g., 50%.  If the content is changed, decrease the fetch 
> interval.  The percentage of increase and decrease might be influenced 
> by the url's score.
>

Again, that's the task for a FetchSchedule.

> 4. Update the crawl db & link db, index the new segment, dedup, etc. 
> When updating the crawl db, scores for existing urls should not 
> change, since the scoring method we're using (OPIC) assumes each page 
> is fetched only once.

I would love to refactor this part too, to make the scoring mechanism 
abstracted in a similar way, so that you could plug in different scoring 
implementations. The float value in CrawlDatum is opaque enough to 
support different scoring mechanisms.

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: incremental crawling

Posted by Matthias Jaekle <ja...@eventax.de>.

> 3. Generate a fetch list & fetch it.  When the url has been previously 
> fetched, and its content is unchanged, increase its fetch interval by an 
> amount, e.g., 50%.  If the content is changed, decrease the fetch 
> interval.  The percentage of increase and decrease might be influenced 
> by the url's score.
Hi,

if we would track in this way the amount of changes, we could also 
prefer pages in the ranking algorithm which change more often.
Frequently changing pages might be more up-to-date and could have a 
higher value then pages never change.
Also pages, which are unchanged for a long time, might run out of date 
and loose a litte bit in their general scoring.
So, maybe the fetch interval value could be used as a multiplier for 
boosting pages in the final result set.

Matthias