You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by chris sleeman <ch...@gmail.com> on 2007/10/13 08:56:12 UTC
Fetch schedule and unmodified content
Hi,
Can someone please explain how the fetcher behaves with respect to
modified/unmodified content, in the current trunk version?
My requirement is basically this - I
have one page (seed url) which has links to other urls. The links in
this page, keeps getting changed on a daily basis.
I want nutch to keep refetching this page, as it changes regularly,
but not refetch the outlinks on this page since they are more or less
static.
I have set both "db.fetch.interval.default" and "db.fetch.interval.max" to a
high value of apprx 1 year and am using the DefaultFetchSchedule
class. Does this imply that even for pages which have been modified,
the next fetch would be after an year? Or
do I need to use the AdaptiveFetchSchedule?
I would be really thankful if someone could help me with my fetcher
settings.
Regards,
Chris
Re: Fetch schedule and unmodified content
Posted by chris sleeman <ch...@gmail.com>.
Thanks for your inputs.....will try it out.
-Chris
On 10/15/07, Andrzej Bialecki <ab...@getopt.org> wrote:
>
> chris sleeman wrote:
> > Hi Andrzej,
> >
> > Thanks for your response. However, I still have a couple of doubts.
> >
> >> In your case, I would recommend setting a very short interval for the
> >> main page, and setting longer (default) intervals for other pages.
> >
> > Isnt' the fetch interval a system wide setting? Or can we set it for
> > individual urls?
>
> That's true. A workaround for this would be to inject urls in batches -
> for the first batch you would set a certain fetch interval, and for the
> next batch you would set a different one (seems a bit ugly, I admit, but
> it works).
>
> Perhaps we should add a command-line option to Injector to specify the
> fetch interval for urls?
>
> Or we should use a line-oriented text format for Injector, which allows
> to specify fetch interval and/or other metadata, something like this:
>
> seedFile ::= {lineEOL} ;
> lineEOL ::= line <EOL> ;
> line ::= url [{"|" meta}] ;
> url ::= ? valid url characters except pipe symbol ? ;
> meta ::= type " " name " " value ;
> type ::= "S" | "F" | "I" | "L" (* string, float, int, long *) ;
> name ::= ? any string without whitespace or pipe symbol ? ;
> value ::= ? any string except pipe symbol ? ;
>
>
>
> >
> > What
> > I would basically need is a different fetch interval for injected
> > (seed urls) as compared to the other urls.
> > Since this may not be available out of the box, I was thinking of just
> > modifying the injector code and using a much different
> > value for the fetch interval, in this
> > case. Would such an approach work? and will the same
> > fetch value, set once per url, be used throughout?
>
> The fetch interval value is set by calling
> FetchSechule.initializeSchedule(), so you should probably modify the
> implementation of this method in your active FetchSchedule.
>
>
> --
> Best regards,
> Andrzej Bialecki <><
> ___. ___ ___ ___ _ _ __________________________________
> [__ || __|__/|__||\/| Information Retrieval, Semantic Web
> ___|||__|| \| || | Embedded Unix, System Integration
> http://www.sigram.com Contact: info at sigram dot com
>
>
Re: Fetch schedule and unmodified content
Posted by Andrzej Bialecki <ab...@getopt.org>.
chris sleeman wrote:
> Hi Andrzej,
>
> Thanks for your response. However, I still have a couple of doubts.
>
>> In your case, I would recommend setting a very short interval for the
>> main page, and setting longer (default) intervals for other pages.
>
> Isnt' the fetch interval a system wide setting? Or can we set it for
> individual urls?
That's true. A workaround for this would be to inject urls in batches -
for the first batch you would set a certain fetch interval, and for the
next batch you would set a different one (seems a bit ugly, I admit, but
it works).
Perhaps we should add a command-line option to Injector to specify the
fetch interval for urls?
Or we should use a line-oriented text format for Injector, which allows
to specify fetch interval and/or other metadata, something like this:
seedFile ::= {lineEOL} ;
lineEOL ::= line <EOL> ;
line ::= url [{"|" meta}] ;
url ::= ? valid url characters except pipe symbol ? ;
meta ::= type " " name " " value ;
type ::= "S" | "F" | "I" | "L" (* string, float, int, long *) ;
name ::= ? any string without whitespace or pipe symbol ? ;
value ::= ? any string except pipe symbol ? ;
>
> What
> I would basically need is a different fetch interval for injected
> (seed urls) as compared to the other urls.
> Since this may not be available out of the box, I was thinking of just
> modifying the injector code and using a much different
> value for the fetch interval, in this
> case. Would such an approach work? and will the same
> fetch value, set once per url, be used throughout?
The fetch interval value is set by calling
FetchSechule.initializeSchedule(), so you should probably modify the
implementation of this method in your active FetchSchedule.
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com
Re: Fetch schedule and unmodified content
Posted by chris sleeman <ch...@gmail.com>.
Hi Andrzej,
Thanks for your response. However, I still have a couple of doubts.
>In your case, I would recommend setting a very short interval for the
>main page, and setting longer (default) intervals for other pages.
Isnt' the fetch interval a system wide setting? Or can we set it for
individual urls?
What
I would basically need is a different fetch interval for injected
(seed urls) as compared to the other urls.
Since this may not be available out of the box, I was thinking of just
modifying the injector code and using a much different
value for the fetch interval, in this
case. Would such an approach work? and will the same
fetch value, set once per url, be used throughout?
Thanks and Regards,
Chris
On 10/13/07, Andrzej Bialecki <ab...@getopt.org> wrote:
>
> chris sleeman wrote:
> > Hi,
> >
> > Can someone please explain how the fetcher behaves with respect to
> > modified/unmodified content, in the current trunk version?
> >
> > My requirement is basically this - I
> > have one page (seed url) which has links to other urls. The links in
> > this page, keeps getting changed on a daily basis.
> > I want nutch to keep refetching this page, as it changes regularly,
> > but not refetch the outlinks on this page since they are more or less
> > static.
>
> Nutch will behave differently, depending on which fetch schedule you're
> using. With the DefaultFetchSchedule, the refetch period is fixed and
> doesn't change, no matter if a page as modified or not. With
> AdaptiveFetchSchedule Nutch will adjust refetch interval to match the
> expected period of changes.
>
> In any case, if a page is not modified, Nutch will try to avoid fetching
> it again (using If-Modified-Since headers).
>
> >
> > I have set both "db.fetch.interval.default" and "db.fetch.interval.max"
> to a
> > high value of apprx 1 year and am using the DefaultFetchSchedule
> > class. Does this imply that even for pages which have been modified,
> > the next fetch would be after an year?
>
> Correct. Nutch doesn't know that a page is changed, unless it actually
> tries to fetch it. Since you're using the DefaultFetchSchedule, and the
> fetch interval is 1 year, Nutch will check the page in 1 year interval,
> and it will never adjust the interval no matter what's the status of the
> page.
>
> However, this is not strictly true. Even if you set a very high value of
> this interval, there is a hard limit (db.fetch.interval.max), and pages
> older than this interval will be scheduled for refetching, no matter
> what their fetch interval.
>
> In your case, I would recommend setting a very short interval for the
> main page, and setting longer (default) intervals for other pages.
> Additionally, you can use AdaptiveFetchSchedule to adjust these intervals.
>
> --
> Best regards,
> Andrzej Bialecki <><
> ___. ___ ___ ___ _ _ __________________________________
> [__ || __|__/|__||\/| Information Retrieval, Semantic Web
> ___|||__|| \| || | Embedded Unix, System Integration
> http://www.sigram.com Contact: info at sigram dot com
>
>
Re: Fetch schedule and unmodified content
Posted by Andrzej Bialecki <ab...@getopt.org>.
chris sleeman wrote:
> Hi,
>
> Can someone please explain how the fetcher behaves with respect to
> modified/unmodified content, in the current trunk version?
>
> My requirement is basically this - I
> have one page (seed url) which has links to other urls. The links in
> this page, keeps getting changed on a daily basis.
> I want nutch to keep refetching this page, as it changes regularly,
> but not refetch the outlinks on this page since they are more or less
> static.
Nutch will behave differently, depending on which fetch schedule you're
using. With the DefaultFetchSchedule, the refetch period is fixed and
doesn't change, no matter if a page as modified or not. With
AdaptiveFetchSchedule Nutch will adjust refetch interval to match the
expected period of changes.
In any case, if a page is not modified, Nutch will try to avoid fetching
it again (using If-Modified-Since headers).
>
> I have set both "db.fetch.interval.default" and "db.fetch.interval.max" to a
> high value of apprx 1 year and am using the DefaultFetchSchedule
> class. Does this imply that even for pages which have been modified,
> the next fetch would be after an year?
Correct. Nutch doesn't know that a page is changed, unless it actually
tries to fetch it. Since you're using the DefaultFetchSchedule, and the
fetch interval is 1 year, Nutch will check the page in 1 year interval,
and it will never adjust the interval no matter what's the status of the
page.
However, this is not strictly true. Even if you set a very high value of
this interval, there is a hard limit (db.fetch.interval.max), and pages
older than this interval will be scheduled for refetching, no matter
what their fetch interval.
In your case, I would recommend setting a very short interval for the
main page, and setting longer (default) intervals for other pages.
Additionally, you can use AdaptiveFetchSchedule to adjust these intervals.
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com