You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by rulesmm <ru...@gmail.com> on 2010/01/11 07:13:52 UTC

Maintaining website version with Nutch

Hi,
Is there a way we can main version of a website with nutch?
Example: I want to index testsite.com for a month and want to generate a
dump of the contents added in this period with timestamp.

Thanks
-- 
View this message in context: http://old.nabble.com/Maintaining-website-version-with-Nutch-tp27106277p27106277.html
Sent from the Nutch - User mailing list archive at Nabble.com.


Re: Maintaining website version with Nutch

Posted by rulesmm <ru...@gmail.com>.
ok, thanks. I have to index and search the archived contents too so Nutch
does an excellent job of that :).




Ken Krugler wrote:
> 
> 
> On Jan 10, 2010, at 10:13pm, rulesmm wrote:
> 
>>
>> Hi,
>> Is there a way we can main version of a website with nutch?
>> Example: I want to index testsite.com for a month and want to  
>> generate a
>> dump of the contents added in this period with timestamp.
> 
> You should be able to do this with Nutch, yes. You'd need to tune the  
> config parameters to do the re-crawl at the target interval.
> 
> Though for pure site archiving, Heritrix is a more optimized solution,  
> especially when used with some of the add-on admin GUIs.
> 
> -- Ken
> 
> --------------------------------------------
> Ken Krugler
> +1 530-210-6378
> http://bixolabs.com
> e l a s t i c   w e b   m i n i n g
> 
> 
> 
> 
> 
> 

-- 
View this message in context: http://old.nabble.com/Maintaining-website-version-with-Nutch-tp27106277p27122589.html
Sent from the Nutch - User mailing list archive at Nabble.com.


Re: Maintaining website version with Nutch

Posted by Ken Krugler <kk...@transpac.com>.
On Jan 10, 2010, at 10:13pm, rulesmm wrote:

>
> Hi,
> Is there a way we can main version of a website with nutch?
> Example: I want to index testsite.com for a month and want to  
> generate a
> dump of the contents added in this period with timestamp.

You should be able to do this with Nutch, yes. You'd need to tune the  
config parameters to do the re-crawl at the target interval.

Though for pure site archiving, Heritrix is a more optimized solution,  
especially when used with some of the add-on admin GUIs.

-- Ken

--------------------------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g