You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Mike Pountney <Mi...@semantico.com> on 2010/09/03 11:17:41 UTC
Dynamically changing the URL retry interval
I'd like to refetch pages that I know change frequently more often.
Does anyone know of a way to set a lower retry interval on a set of pages matched by a regex?
Thanks in advance,
Mike
Re: Dynamically changing the URL retry interval
Posted by Julien Nioche <li...@gmail.com>.
Hi Mike,
Using the Adaptive Fetch Schedule is definitely a good option.
You can also specify a custom nutch.fetchInterval per URL of the seedlist
via their metadata e.g.
*http://www.nutch.org/ \t nutch.score=10 \t nutch.fetchInterval=2592000 \t
userType=open_source
*
but this works only for the URLs known at the injection step. Writing a
custom map-reduce job to set a custom fetch interval for the URLS matching a
regex would not be too difficult.
In Nutch 2.0 we should be able to do that on a host basis (see
https://issues.apache.org/jira/browse/NUTCH-882).
Julien
--
*
*Open Source Solutions for Text Engineering
http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
On 3 September 2010 10:52, Mike Pountney <Mi...@semantico.com>wrote:
> To partially answer my own question, I've just found the Adaptive Fetch
> Schedule blog post by Pascal Dimassimo:
>
> http://pascaldimassimo.com/2010/06/11/how-to-re-crawl-with-nutch/
>
> I think this covers what I need to do, but any other advise in this area
> would be appreciated.
>
> Cheers,
>
> Mike
>
> On 3 Sep 2010, at 10:17, Mike Pountney wrote:
>
> >
> > I'd like to refetch pages that I know change frequently more often.
> >
> > Does anyone know of a way to set a lower retry interval on a set of pages
> matched by a regex?
> >
> > Thanks in advance,
> >
> > Mike
> >
>
> --
> Mike Pountney
>
> Information Systems Manager, Semantico Limited
> <ma...@semantico.com> <tel:+44 1273 358 209>
> Registered in England and Wales no. 03841410, VAT no. GB-744614334.
> Registered office Lees House, 21-23 Dyke Road, Brighton BN1 3FE, UK.
>
> Check out all our latest news and thinking on our blog;
> - http://blogs.semantico.com/discovery-blog/
>
> Follow Semantico on Twitter;
> - http://twitter.com/semantico
>
>
Re: Dynamically changing the URL retry interval
Posted by Mike Pountney <Mi...@semantico.com>.
To partially answer my own question, I've just found the Adaptive Fetch Schedule blog post by Pascal Dimassimo:
http://pascaldimassimo.com/2010/06/11/how-to-re-crawl-with-nutch/
I think this covers what I need to do, but any other advise in this area would be appreciated.
Cheers,
Mike
On 3 Sep 2010, at 10:17, Mike Pountney wrote:
>
> I'd like to refetch pages that I know change frequently more often.
>
> Does anyone know of a way to set a lower retry interval on a set of pages matched by a regex?
>
> Thanks in advance,
>
> Mike
>
--
Mike Pountney
Information Systems Manager, Semantico Limited
<ma...@semantico.com> <tel:+44 1273 358 209>
Registered in England and Wales no. 03841410, VAT no. GB-744614334.
Registered office Lees House, 21-23 Dyke Road, Brighton BN1 3FE, UK.
Check out all our latest news and thinking on our blog;
- http://blogs.semantico.com/discovery-blog/
Follow Semantico on Twitter;
- http://twitter.com/semantico