You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Mike Baranczak <mb...@gmail.com> on 2011/02/01 23:15:32 UTC
CrawlDatum.getFetchTime()
From the javadocs for CrawlDatum.getFetchTime() (Nutch 1.1):
"Returns either the time of the last fetch, or the next fetch time, depending on whether Fetcher or CrawlDbReducer set the time."
So is there any way to determine which of these two conditions is true, using just the information in CrawlDatum? My goal is to estimate the next time that the item will be fetched.
-MB
Re: CrawlDatum.getFetchTime()
Posted by Markus Jelsma <ma...@openindex.io>.
I should also mention that it is common to execute a series of jobs as a
complete crawl cycle, generate, fetch, parse, update crawldb, update linkdb
and index (and optionally deduplicate and/or cleaning the index once that's
committed).
This means that after every crawl, the timestamps all show the next fetch
time. This fetch time is set by a default and usually influenced by an
algorithm that can increase or decrease that default depending on whether the
page was modified since the previous fetch time.
> After the fetch job, it will contain the time it was fetched. If you run
> the updatedb job, it will contain the time it is due to be fetched.
>
> It makes sense because newly discovered URL's are added to the CrawlDB and
> carry the time after which they are due to be fetched. Consequent fetch and
> update jobs complete the cycle.
>
> > From the javadocs for CrawlDatum.getFetchTime() (Nutch 1.1):
> >
> > "Returns either the time of the last fetch, or the next fetch time,
> > depending on whether Fetcher or CrawlDbReducer set the time."
> >
> > So is there any way to determine which of these two conditions is true,
> > using just the information in CrawlDatum? My goal is to estimate the next
> > time that the item will be fetched.
> >
> > -MB
Re: CrawlDatum.getFetchTime()
Posted by Markus Jelsma <ma...@openindex.io>.
After the fetch job, it will contain the time it was fetched. If you run the
updatedb job, it will contain the time it is due to be fetched.
It makes sense because newly discovered URL's are added to the CrawlDB and
carry the time after which they are due to be fetched. Consequent fetch and
update jobs complete the cycle.
> From the javadocs for CrawlDatum.getFetchTime() (Nutch 1.1):
>
> "Returns either the time of the last fetch, or the next fetch time,
> depending on whether Fetcher or CrawlDbReducer set the time."
>
> So is there any way to determine which of these two conditions is true,
> using just the information in CrawlDatum? My goal is to estimate the next
> time that the item will be fetched.
>
> -MB