You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by chris sleeman <ch...@gmail.com> on 2007/10/13 08:56:12 UTC

Fetch schedule and unmodified content

Hi,

Can someone please explain how the fetcher behaves with respect to
modified/unmodified content, in the current trunk version?

My requirement is basically this - I
have one page (seed url) which has links to other urls. The links in
this page, keeps getting changed on a daily basis.
I want nutch to keep refetching this page, as it changes regularly,
but not refetch the outlinks on this page since they are more or less
static.

I have set both "db.fetch.interval.default" and "db.fetch.interval.max" to a
high value of apprx 1 year and am using the DefaultFetchSchedule
class. Does this imply that even for pages which have been modified,
the next fetch would be after an year? Or
do I need to use the AdaptiveFetchSchedule?

I would be really thankful if someone could help me with my fetcher
settings.

Regards,
Chris

Re: Fetch schedule and unmodified content

Posted by chris sleeman <ch...@gmail.com>.
Thanks for your inputs.....will try it out.

-Chris

On 10/15/07, Andrzej Bialecki <ab...@getopt.org> wrote:
>
> chris sleeman wrote:
> > Hi Andrzej,
> >
> > Thanks for your response. However, I still have a couple of  doubts.
> >
> >> In your case, I would recommend setting a very short interval for the
> >> main page, and setting longer (default) intervals for other pages.
> >
> > Isnt' the fetch interval a system wide setting? Or can we set it for
> > individual urls?
>
> That's true. A workaround for this would be to inject urls in batches -
> for the first batch you would set a certain fetch interval, and for the
> next batch you would set a different one (seems a bit ugly, I admit, but
> it works).
>
> Perhaps we should add a command-line option to Injector to specify the
> fetch interval for urls?
>
> Or we should use a line-oriented text format for Injector, which allows
> to specify fetch interval and/or other metadata, something like this:
>
> seedFile ::= {lineEOL} ;
> lineEOL ::= line <EOL> ;
> line ::= url [{"|" meta}] ;
> url ::= ? valid url characters except pipe symbol ? ;
> meta ::= type " " name " " value ;
> type ::= "S" | "F" | "I" | "L" (* string, float, int, long *) ;
> name ::= ? any string without whitespace or pipe symbol ? ;
> value ::= ? any string except pipe symbol ? ;
>
>
>
> >
> > What
> > I would basically need is a different fetch interval for injected
> > (seed urls) as compared to the other urls.
> > Since this may not be available out of the box, I was thinking of just
> > modifying the injector code and using a much different
> > value for the fetch interval, in this
> > case. Would such an approach work? and will the same
> > fetch value, set once per url, be used throughout?
>
> The fetch interval value is set by calling
> FetchSechule.initializeSchedule(), so you should probably modify the
> implementation of this method in your active FetchSchedule.
>
>
> --
> Best regards,
> Andrzej Bialecki     <><
>   ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
>
>

Re: Fetch schedule and unmodified content

Posted by Andrzej Bialecki <ab...@getopt.org>.
chris sleeman wrote:
> Hi Andrzej,
> 
> Thanks for your response. However, I still have a couple of  doubts.
> 
>> In your case, I would recommend setting a very short interval for the
>> main page, and setting longer (default) intervals for other pages.
> 
> Isnt' the fetch interval a system wide setting? Or can we set it for
> individual urls?

That's true. A workaround for this would be to inject urls in batches - 
for the first batch you would set a certain fetch interval, and for the 
next batch you would set a different one (seems a bit ugly, I admit, but 
it works).

Perhaps we should add a command-line option to Injector to specify the 
fetch interval for urls?

Or we should use a line-oriented text format for Injector, which allows 
to specify fetch interval and/or other metadata, something like this:

seedFile ::= {lineEOL} ;
lineEOL ::= line <EOL> ;
line ::= url [{"|" meta}] ;
url ::= ? valid url characters except pipe symbol ? ;
meta ::= type " " name " " value ;
type ::= "S" | "F" | "I" | "L" (* string, float, int, long *) ;
name ::= ? any string without whitespace or pipe symbol ? ;
value ::= ? any string except pipe symbol ? ;



> 
> What
> I would basically need is a different fetch interval for injected
> (seed urls) as compared to the other urls.
> Since this may not be available out of the box, I was thinking of just
> modifying the injector code and using a much different
> value for the fetch interval, in this
> case. Would such an approach work? and will the same
> fetch value, set once per url, be used throughout?

The fetch interval value is set by calling 
FetchSechule.initializeSchedule(), so you should probably modify the 
implementation of this method in your active FetchSchedule.


-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


Re: Fetch schedule and unmodified content

Posted by chris sleeman <ch...@gmail.com>.
Hi Andrzej,

Thanks for your response. However, I still have a couple of  doubts.

>In your case, I would recommend setting a very short interval for the
>main page, and setting longer (default) intervals for other pages.

Isnt' the fetch interval a system wide setting? Or can we set it for
individual urls?

What
I would basically need is a different fetch interval for injected
(seed urls) as compared to the other urls.
Since this may not be available out of the box, I was thinking of just
modifying the injector code and using a much different
value for the fetch interval, in this
case. Would such an approach work? and will the same
fetch value, set once per url, be used throughout?

Thanks and Regards,
Chris

On 10/13/07, Andrzej Bialecki <ab...@getopt.org> wrote:
>
> chris sleeman wrote:
> > Hi,
> >
> > Can someone please explain how the fetcher behaves with respect to
> > modified/unmodified content, in the current trunk version?
> >
> > My requirement is basically this - I
> > have one page (seed url) which has links to other urls. The links in
> > this page, keeps getting changed on a daily basis.
> > I want nutch to keep refetching this page, as it changes regularly,
> > but not refetch the outlinks on this page since they are more or less
> > static.
>
> Nutch will behave differently, depending on which fetch schedule you're
> using. With the DefaultFetchSchedule, the refetch period is fixed and
> doesn't change, no matter if a page as modified or not. With
> AdaptiveFetchSchedule Nutch will adjust refetch interval to match the
> expected period of changes.
>
> In any case, if a page is not modified, Nutch will try to avoid fetching
> it again (using If-Modified-Since headers).
>
> >
> > I have set both "db.fetch.interval.default" and "db.fetch.interval.max"
> to a
> > high value of apprx 1 year and am using the DefaultFetchSchedule
> > class. Does this imply that even for pages which have been modified,
> > the next fetch would be after an year?
>
> Correct. Nutch doesn't know that a page is changed, unless it actually
> tries to fetch it. Since you're using the DefaultFetchSchedule, and the
> fetch interval is 1 year, Nutch will check the page in 1 year interval,
> and it will never adjust the interval no matter what's the status of the
> page.
>
> However, this is not strictly true. Even if you set a very high value of
> this interval, there is a hard limit (db.fetch.interval.max), and pages
> older than this interval will be scheduled for refetching, no matter
> what their fetch interval.
>
> In your case, I would recommend setting a very short interval for the
> main page, and setting longer (default) intervals for other pages.
> Additionally, you can use AdaptiveFetchSchedule to adjust these intervals.
>
> --
> Best regards,
> Andrzej Bialecki     <><
>   ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
>
>

Re: Fetch schedule and unmodified content

Posted by Andrzej Bialecki <ab...@getopt.org>.
chris sleeman wrote:
> Hi,
> 
> Can someone please explain how the fetcher behaves with respect to
> modified/unmodified content, in the current trunk version?
> 
> My requirement is basically this - I
> have one page (seed url) which has links to other urls. The links in
> this page, keeps getting changed on a daily basis.
> I want nutch to keep refetching this page, as it changes regularly,
> but not refetch the outlinks on this page since they are more or less
> static.

Nutch will behave differently, depending on which fetch schedule you're 
using. With the DefaultFetchSchedule, the refetch period is fixed and 
doesn't change, no matter if a page as modified or not. With 
AdaptiveFetchSchedule Nutch will adjust refetch interval to match the 
expected period of changes.

In any case, if a page is not modified, Nutch will try to avoid fetching 
it again (using If-Modified-Since headers).

> 
> I have set both "db.fetch.interval.default" and "db.fetch.interval.max" to a
> high value of apprx 1 year and am using the DefaultFetchSchedule
> class. Does this imply that even for pages which have been modified,
> the next fetch would be after an year?

Correct. Nutch doesn't know that a page is changed, unless it actually 
tries to fetch it. Since you're using the DefaultFetchSchedule, and the 
fetch interval is 1 year, Nutch will check the page in 1 year interval, 
and it will never adjust the interval no matter what's the status of the 
page.

However, this is not strictly true. Even if you set a very high value of 
this interval, there is a hard limit (db.fetch.interval.max), and pages 
older than this interval will be scheduled for refetching, no matter 
what their fetch interval.

In your case, I would recommend setting a very short interval for the 
main page, and setting longer (default) intervals for other pages. 
Additionally, you can use AdaptiveFetchSchedule to adjust these intervals.

-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com