You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Annona Keene <an...@yahoo.com> on 2007/04/24 23:57:59 UTC

Nutch 0.9 recrawl

I've been using nutch for a little while now, and the new release is great. I'm hoping someone can help me with what I'm trying to do.

One of the sites I crawl is basically an archive for a mailing list. So there's lots of data that never changes, and then there are new pages every day. I'm not entirely clear on how this recrawling thing works. Is there a way I can use Nutch to just crawl those new pages, and ignore all the old ones that are pretty much static forever that I've already crawled? 

Any help would be greatly appreciated.

Thanks,
Ann


__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 

Re: Nutch 0.9 recrawl

Posted by Arun Kaundal <ar...@gmail.com>.
better , if u use Heritrix+Nutchwax. I can help how to achieve this if u
really like idea


On 4/25/07, Annona Keene <an...@yahoo.com> wrote:
>
> I've been using nutch for a little while now, and the new release is
> great. I'm hoping someone can help me with what I'm trying to do.
>
> One of the sites I crawl is basically an archive for a mailing list. So
> there's lots of data that never changes, and then there are new pages every
> day. I'm not entirely clear on how this recrawling thing works. Is there a
> way I can use Nutch to just crawl those new pages, and ignore all the old
> ones that are pretty much static forever that I've already crawled?
>
> Any help would be greatly appreciated.
>
> Thanks,
> Ann
>
>
> __________________________________________________
> Do You Yahoo!?
> Tired of spam?  Yahoo! Mail has the best spam protection around
> http://mail.yahoo.com