You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@nutch.apache.org by ".: Abishek :." <ab...@gmail.com> on 2011/02/10 08:23:23 UTC

Decoupling crawling and indexing

Hi all,

 I am looking for a way to kind of decouple crawling and indexing instead of
tying them together. I am crawling some huge sites and I cannot afford to
wait till the whole crawling is over for searching for the results. I am
kind of working on some proof of concepts so can't wait for long, and also
the target sites cannot be replicated or faked. I know its kind of tough to
do because of the link inversions, deduping and so on.

 Is there a way I can at least try crawling for a day or two then complete
the whole of the process like link inversions, deduping and indexing. Then,
may be come back and start the crawl from where it was left. Kind of a
incremental process?

 Any suggestions on this or pointers would be really of great help.

Thanks,
Abi