You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Konstantin Ott <ot...@netropol.de> on 2005/03/16 12:55:11 UTC

How do I recrawl

Hi,
Once crawled through a list of urls from the urls-file, i would like to 
crawl again. I dont want to inject again and I hope nutch can recognise 
already indexed content with the digist. So nutch should only index 
modified and new files.
I notice that with every crawl there are more segment-dirs.
So what is the recommended way to recrawl the same url-file?

thanks
Konstantin Ott

Re: How do I recrawl

Posted by Stefan Groschupf <sg...@media-style.com>.
try:
http://incubator.apache.org/nutch/tutorial.html#Whole-web+Crawling

Am 16.03.2005 um 12:55 schrieb Konstantin Ott:

> Hi,
> Once crawled through a list of urls from the urls-file, i would like 
> to crawl again. I dont want to inject again and I hope nutch can 
> recognise already indexed content with the digist. So nutch should 
> only index modified and new files.
> I notice that with every crawl there are more segment-dirs.
> So what is the recommended way to recrawl the same url-file?
>
> thanks
> Konstantin Ott
>
>
-----------information technology-------------------
company:     http://www.media-style.com
forum:           http://www.text-mining.org
blog:	             http://www.find23.net