You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Tomi NA <he...@gmail.com> on 2006/09/05 19:44:03 UTC

crawling frequently changing data on an intranet - how?

The task
-----------

I have less than 100GB of diverse documents (.doc, .pdf, .ppt, .txt,
.xls, etc.) to index. Dozens, or even hundreds and thousands of
documents can change their content, be created or deleted every day.
The crawler will run on a HP DL380 G4 server - don't know the exact
specs yet.
I'd like to keep the index no more than 20 minutes out of date (5-10
would be ideal).
I'm currently sticking to nutch 0.7.2 because of crawl (especially
fetch) speed considerations.

Current idea
-----------
>From what I've read so far, nutch relies on the date a certain
document was last crawled, rather than checking the live document's
last modification date (a reasonable way to behave on the Internet,
but could be better in an intranet). That's why I can't simply run the
wiki recrawl script and let it find the documents that changed since
the last index.
I'd therefore run a crawl overnight and use the produced index as a
"main index". During the day, however, I can traverse the whole
intranet web, see what's changed and crawl/index only the documents
that have changed, building a second, "helper index".
I'd set up the search application to use both of those indexes.

Problems
---------
I don't know to tell the search interface to use 2 separate indices.
I'm really not sure how I'll make the search interface reload the
"helper index" every 10 or 20 minutes.

I'd welcome an opinion from anyone with more experience with
nutch...which basically means anyone. :)

TIA,
t.n.a.