You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by al...@aim.com on 2012/05/22 22:25:44 UTC

using less resources

Hello,

As far as I understood nutch recrawls urls when their fetch time has past  current time regardless if those urls were modified or not.
Is there any initiative on restricting recrawls to only those urls that have modified time(MT) greater than the old MT?
In detail: if nutch have crawled a  url with next fetch time in 30 days, then in the second recrawl nutch must visit this url, retrieve its modified time and compare it  with modified time that we have in the crawldb and recrawl it if the new MT is greater than the old one, otherwise skip it.

Thanks.
Alex.



Re: using less resources

Posted by alxsss <al...@aim.com>.
I was thinking of using last modified header, but it may be absent. In that
case we could use signature of urls in the indexing time. I took a look to
to code, it seems it is implemented but not working. I tested nutch-1.4 with
a single url, solrindexer always sends the same number of documents to solr
although none of the urls is changed.

Thanks.
Alex.

--
View this message in context: http://lucene.472066.n3.nabble.com/using-less-resources-tp3985537p3990625.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: using less resources

Posted by remi tassing <ta...@gmail.com>.
I was wondering how do you know  if the page was changed without actually
fetching it

On Wednesday, May 23, 2012, wrote:

> Hello,
>
> As far as I understood nutch recrawls urls when their fetch time has past
>  current time regardless if those urls were modified or not.
> Is there any initiative on restricting recrawls to only those urls that
> have modified time(MT) greater than the old MT?
> In detail: if nutch have crawled a  url with next fetch time in 30 days,
> then in the second recrawl nutch must visit this url, retrieve its modified
> time and compare it  with modified time that we have in the crawldb and
> recrawl it if the new MT is greater than the old one, otherwise skip it.
>
> Thanks.
> Alex.
>
>
>