You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Anton Potehin <an...@orbita1.ru> on 2005/11/24 09:26:39 UTC

Incremental crawling

We think out next work scheme for incremental crawling:

 

1.	Depth =1, topN = big enough (for example 100000)
2.	clear partial indexes from previous iteration
3.	copy global index to indexes
4.	crawl new segment
5.	create index for new segment
6.	deldup (working for total index and new)
7.	merge old total and new index into new global index
8.	replace old total index with new total index

 

May be there is ways to optimize this scheme?

May be there is way to append new indexes into total without copying it?

 

All this need for use search engine while new crawl process in progress.

Therefore total index must be accessible all the time, or at least
minimize time of inaccessibility.


RE: Incremental crawling

Posted by an...@orbita1.ru.
We realize this scheme, and (to our surprise! :)) it work!
But after each iteration, we need to restart Tomcat, otherwise on search
page there no results at all!

How work out this problem?