You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Max Dzyuba <ma...@comintelli.com> on 2012/08/24 17:15:57 UTC
recrawl a URL?
Hello everyone,
I run a crawl command every day, but I don't want Nutch to submit an update
to Solr if a particular page hasn't changed. How do I achieve that? Right
now the value of db.fetch.interval.default doesn't seem to help prevent the
crawl since the updates are submitted to Solr as if the page has been
changed. I know for sure that the page has not been changed. This happens
for every new crawl command.
Thanks in advance,
Max
RE: recrawl a URL?
Posted by Markus Jelsma <ma...@openindex.io>.
Hmm, i had to look it up but it is supported in 1.5 and 1.5.1:
http://svn.apache.org/viewvc/nutch/tags/release-1.5.1/src/java/org/apache/nutch/indexer/IndexerMapReduce.java?view=markup
-----Original message-----
> From:Max Dzyuba <ma...@comintelli.com>
> Sent: Fri 24-Aug-2012 17:35
> To: Markus Jelsma <ma...@openindex.io>; user@nutch.apache.org
> Subject: RE: recrawl a URL?
>
> Thank you for the reply. Does it mean that it is not supported in latest stable release of Nutch?
>
>
> -----Original Message-----
> From: Markus Jelsma [mailto:markus.jelsma@openindex.io]
> Sent: den 24 augusti 2012 17:21
> To: user@nutch.apache.org; Max Dzyuba
> Subject: RE: recrawl a URL?
>
> Hi,
>
> Trunk has a feature for this: indexer.skip.notmodified
>
> Cheers
>
> -----Original message-----
> > From:Max Dzyuba <ma...@comintelli.com>
> > Sent: Fri 24-Aug-2012 17:19
> > To: user@nutch.apache.org
> > Subject: recrawl a URL?
> >
> > Hello everyone,
> >
> >
> >
> > I run a crawl command every day, but I don't want Nutch to submit an
> > update to Solr if a particular page hasn't changed. How do I achieve
> > that? Right now the value of db.fetch.interval.default doesn't seem to
> > help prevent the crawl since the updates are submitted to Solr as if
> > the page has been changed. I know for sure that the page has not been
> > changed. This happens for every new crawl command.
> >
> >
> >
> >
> >
> > Thanks in advance,
> >
> > Max
> >
> >
>
>
RE: recrawl a URL?
Posted by Max Dzyuba <ma...@comintelli.com>.
Thank you for the reply. Does it mean that it is not supported in latest stable release of Nutch?
-----Original Message-----
From: Markus Jelsma [mailto:markus.jelsma@openindex.io]
Sent: den 24 augusti 2012 17:21
To: user@nutch.apache.org; Max Dzyuba
Subject: RE: recrawl a URL?
Hi,
Trunk has a feature for this: indexer.skip.notmodified
Cheers
-----Original message-----
> From:Max Dzyuba <ma...@comintelli.com>
> Sent: Fri 24-Aug-2012 17:19
> To: user@nutch.apache.org
> Subject: recrawl a URL?
>
> Hello everyone,
>
>
>
> I run a crawl command every day, but I don't want Nutch to submit an
> update to Solr if a particular page hasn't changed. How do I achieve
> that? Right now the value of db.fetch.interval.default doesn't seem to
> help prevent the crawl since the updates are submitted to Solr as if
> the page has been changed. I know for sure that the page has not been
> changed. This happens for every new crawl command.
>
>
>
>
>
> Thanks in advance,
>
> Max
>
>
RE: recrawl a URL?
Posted by Markus Jelsma <ma...@openindex.io>.
Hi,
Trunk has a feature for this: indexer.skip.notmodified
Cheers
-----Original message-----
> From:Max Dzyuba <ma...@comintelli.com>
> Sent: Fri 24-Aug-2012 17:19
> To: user@nutch.apache.org
> Subject: recrawl a URL?
>
> Hello everyone,
>
>
>
> I run a crawl command every day, but I don't want Nutch to submit an update
> to Solr if a particular page hasn't changed. How do I achieve that? Right
> now the value of db.fetch.interval.default doesn't seem to help prevent the
> crawl since the updates are submitted to Solr as if the page has been
> changed. I know for sure that the page has not been changed. This happens
> for every new crawl command.
>
>
>
>
>
> Thanks in advance,
>
> Max
>
>