You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Max Dzyuba <ma...@comintelli.com> on 2012/08/24 17:15:57 UTC

recrawl a URL?

Hello everyone,

 

I run a crawl command every day, but I don't want Nutch to submit an update
to Solr if a particular page hasn't changed. How do I achieve that? Right
now the value of db.fetch.interval.default doesn't seem to help prevent the
crawl since the updates are submitted to Solr as if the page has been
changed. I know for sure that the page has not been changed. This happens
for every new crawl command.

 

 

Thanks in advance,

Max


RE: recrawl a URL?

Posted by Markus Jelsma <ma...@openindex.io>.
Hmm, i had to look it up but it is supported in 1.5 and 1.5.1:

http://svn.apache.org/viewvc/nutch/tags/release-1.5.1/src/java/org/apache/nutch/indexer/IndexerMapReduce.java?view=markup
 
 
-----Original message-----
> From:Max Dzyuba <ma...@comintelli.com>
> Sent: Fri 24-Aug-2012 17:35
> To: Markus Jelsma <ma...@openindex.io>; user@nutch.apache.org
> Subject: RE: recrawl a URL?
> 
> Thank you for the reply. Does it mean that it is not supported in latest stable release of Nutch?
> 
> 
> -----Original Message-----
> From: Markus Jelsma [mailto:markus.jelsma@openindex.io] 
> Sent: den 24 augusti 2012 17:21
> To: user@nutch.apache.org; Max Dzyuba
> Subject: RE: recrawl a URL?
> 
> Hi,
> 
> Trunk has a feature for this: indexer.skip.notmodified
> 
> Cheers 
>  
> -----Original message-----
> > From:Max Dzyuba <ma...@comintelli.com>
> > Sent: Fri 24-Aug-2012 17:19
> > To: user@nutch.apache.org
> > Subject: recrawl a URL?
> > 
> > Hello everyone,
> > 
> >  
> > 
> > I run a crawl command every day, but I don't want Nutch to submit an 
> > update to Solr if a particular page hasn't changed. How do I achieve 
> > that? Right now the value of db.fetch.interval.default doesn't seem to 
> > help prevent the crawl since the updates are submitted to Solr as if 
> > the page has been changed. I know for sure that the page has not been 
> > changed. This happens for every new crawl command.
> > 
> >  
> > 
> >  
> > 
> > Thanks in advance,
> > 
> > Max
> > 
> > 
> 
> 

RE: recrawl a URL?

Posted by Max Dzyuba <ma...@comintelli.com>.
Thank you for the reply. Does it mean that it is not supported in latest stable release of Nutch?


-----Original Message-----
From: Markus Jelsma [mailto:markus.jelsma@openindex.io] 
Sent: den 24 augusti 2012 17:21
To: user@nutch.apache.org; Max Dzyuba
Subject: RE: recrawl a URL?

Hi,

Trunk has a feature for this: indexer.skip.notmodified

Cheers 
 
-----Original message-----
> From:Max Dzyuba <ma...@comintelli.com>
> Sent: Fri 24-Aug-2012 17:19
> To: user@nutch.apache.org
> Subject: recrawl a URL?
> 
> Hello everyone,
> 
>  
> 
> I run a crawl command every day, but I don't want Nutch to submit an 
> update to Solr if a particular page hasn't changed. How do I achieve 
> that? Right now the value of db.fetch.interval.default doesn't seem to 
> help prevent the crawl since the updates are submitted to Solr as if 
> the page has been changed. I know for sure that the page has not been 
> changed. This happens for every new crawl command.
> 
>  
> 
>  
> 
> Thanks in advance,
> 
> Max
> 
> 


RE: recrawl a URL?

Posted by Markus Jelsma <ma...@openindex.io>.
Hi,

Trunk has a feature for this: indexer.skip.notmodified

Cheers 
 
-----Original message-----
> From:Max Dzyuba <ma...@comintelli.com>
> Sent: Fri 24-Aug-2012 17:19
> To: user@nutch.apache.org
> Subject: recrawl a URL?
> 
> Hello everyone,
> 
>  
> 
> I run a crawl command every day, but I don't want Nutch to submit an update
> to Solr if a particular page hasn't changed. How do I achieve that? Right
> now the value of db.fetch.interval.default doesn't seem to help prevent the
> crawl since the updates are submitted to Solr as if the page has been
> changed. I know for sure that the page has not been changed. This happens
> for every new crawl command.
> 
>  
> 
>  
> 
> Thanks in advance,
> 
> Max
> 
>