You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by kemical <mi...@gmail.com> on 2013/02/05 16:04:54 UTC

invalidate fetch interval only for given urls

Hi,

I'd like to invalidate fetch interval of given urls without waiting
db.default.fetch.interval .
It seems -adddays is doing the job but only for the whole database

i was thinking about freegen command (on my seed urls file), but how to be
sure it will fetch urls with fetch interval not expired already?

A small explanation about why i'm searching this :
The tool is to improve search on new featured content (website homepages),
so almost every urls in my seed list need to be refetch every day (but i
still want to keep 30 days for all others)

I'm using nutch 1.6, and as far as possible, i don't really want to make
plugins since i'm not a java dev (as soon as my crawler is clean, i'll focus
on the front end with my usual tools/languages).



--
View this message in context: http://lucene.472066.n3.nabble.com/invalidate-fetch-interval-only-for-given-urls-tp4038591.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: invalidate fetch interval only for given urls

Posted by Julien Nioche <li...@gmail.com>.

Hi Mickael,

No problem. Indeed db.injector.update is necessary if you want to modify
existing entries in a crawlDB but you won't need it if injecting into a
brand new one.

J.

On 8 February 2013 08:21, kemical <mi...@gmail.com> wrote:

> Hi Julien and thanks,
>
> First, i'm sorry to have asked something that was in the documentation...
> (and easily reachable)
>
> I've re-injected urls with  nutch.fetchInterval and nutch.score so they can
> be fetched each day, and when i run readdb on one of those urls, i see
> everything has been correctly updated.
>
> Just to give a clue to people which would find this thread, you need to
> change your conf property:
> db.injector.update=true
> otherwise existing urls re-injected will be ignored
>
> Please correct me if i'm wrong  (but from the tests i've done it seems
> mandatory)
>
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/invalidate-fetch-interval-only-for-given-urls-tp4038591p4039189.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>



-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble

Re: invalidate fetch interval only for given urls

Posted by kemical <mi...@gmail.com>.

Hi Julien and thanks,

First, i'm sorry to have asked something that was in the documentation...
(and easily reachable)

I've re-injected urls with  nutch.fetchInterval and nutch.score so they can
be fetched each day, and when i run readdb on one of those urls, i see
everything has been correctly updated.

Just to give a clue to people which would find this thread, you need to
change your conf property:
db.injector.update=true 
otherwise existing urls re-injected will be ignored

Please correct me if i'm wrong  (but from the tests i've done it seems
mandatory) 




--
View this message in context: http://lucene.472066.n3.nabble.com/invalidate-fetch-interval-only-for-given-urls-tp4038591p4039189.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: invalidate fetch interval only for given urls

Posted by Julien Nioche <li...@gmail.com>.

See http://wiki.apache.org/nutch/bin/nutch_inject => "*nutch.fetchInterval*:
allows to set a custom fetch interval for a specific URL"


On 5 February 2013 15:04, kemical <mi...@gmail.com> wrote:

> Hi,
>
> I'd like to invalidate fetch interval of given urls without waiting
> db.default.fetch.interval .
> It seems -adddays is doing the job but only for the whole database
>
> i was thinking about freegen command (on my seed urls file), but how to be
> sure it will fetch urls with fetch interval not expired already?
>
> A small explanation about why i'm searching this :
> The tool is to improve search on new featured content (website homepages),
> so almost every urls in my seed list need to be refetch every day (but i
> still want to keep 30 days for all others)
>
> I'm using nutch 1.6, and as far as possible, i don't really want to make
> plugins since i'm not a java dev (as soon as my crawler is clean, i'll
> focus
> on the front end with my usual tools/languages).
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/invalidate-fetch-interval-only-for-given-urls-tp4038591.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>



-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble