You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Mike Baranczak <mb...@gmail.com> on 2011/02/01 23:15:32 UTC

CrawlDatum.getFetchTime()

From the javadocs for CrawlDatum.getFetchTime() (Nutch 1.1):

"Returns either the time of the last fetch, or the next fetch time, depending on whether Fetcher or CrawlDbReducer set the time."

So is there any way to determine which of these two conditions is true, using just the information in CrawlDatum? My goal is to estimate the next time that the item will be fetched.

-MB

Re: CrawlDatum.getFetchTime()

Posted by Markus Jelsma <ma...@openindex.io>.

I should also mention that it is common to execute a series of jobs as a 
complete crawl cycle, generate, fetch, parse, update crawldb, update linkdb 
and index (and optionally deduplicate and/or cleaning the index once that's 
committed).

This means that after every crawl, the timestamps all show the next fetch 
time. This fetch time is set by a default and usually influenced by an 
algorithm that can increase or decrease that default depending on whether the 
page was modified since the previous fetch time.



> After the fetch job, it will contain the time it was fetched. If you run
> the updatedb job, it will contain the time it is due to be fetched.
> 
> It makes sense because newly discovered URL's are added to the CrawlDB and
> carry the time after which they are due to be fetched. Consequent fetch and
> update jobs complete the cycle.
> 
> > From the javadocs for CrawlDatum.getFetchTime() (Nutch 1.1):
> > 
> > "Returns either the time of the last fetch, or the next fetch time,
> > depending on whether Fetcher or CrawlDbReducer set the time."
> > 
> > So is there any way to determine which of these two conditions is true,
> > using just the information in CrawlDatum? My goal is to estimate the next
> > time that the item will be fetched.
> > 
> > -MB

Re: CrawlDatum.getFetchTime()

Posted by Markus Jelsma <ma...@openindex.io>.

After the fetch job, it will contain the time it was fetched. If you run the 
updatedb job, it will contain the time it is due to be fetched.

It makes sense because newly discovered URL's are added to the CrawlDB and 
carry the time after which they are due to be fetched. Consequent fetch and 
update jobs complete the cycle.

> From the javadocs for CrawlDatum.getFetchTime() (Nutch 1.1):
> 
> "Returns either the time of the last fetch, or the next fetch time,
> depending on whether Fetcher or CrawlDbReducer set the time."
> 
> So is there any way to determine which of these two conditions is true,
> using just the information in CrawlDatum? My goal is to estimate the next
> time that the item will be fetched.
> 
> -MB