You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@nutch.apache.org by "Andrzej Bialecki (JIRA)" <ji...@apache.org> on 2007/08/03 01:08:53 UTC

[jira] Commented: (NUTCH-532) CrawlDbMerger: wrong computation of last fetch time

    [ https://issues.apache.org/jira/browse/NUTCH-532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12517402 ] 

Andrzej Bialecki  commented on NUTCH-532:
-----------------------------------------

Float values were originally intended to express fractions of a day, when fetch interval was expressed in days, but after we changed the unit to seconds there is little purpose to it.

However, we need to be careful about the size of the data - long values are .. long ;), and for all operations that involve CrawlDatum this will have performance implications. Is it really useful to keep re-fetch interval in milliseconds? If we limit the resolution to a unit of seconds, as it is now, then I think an int value should be enough - which means that the sizeof(CrawlDatum) stays the same.

+1 on adding a getLastFetchTime, with a good javadoc that explains the formula and assumptions. Perhaps it should be called calculateLastFetchTime, to avoid misunderstandings, because in reality we don't keep that value. The method should be added to FetchSchedule interface, and it should be implemented in AbstractFetchSchedule.

Re: datum.setFetchTime - IMHO it's a premature optimization, this expression is used just twice in the whole code base


> CrawlDbMerger: wrong computation of last fetch time
> ---------------------------------------------------
>
>                 Key: NUTCH-532
>                 URL: https://issues.apache.org/jira/browse/NUTCH-532
>             Project: Nutch
>          Issue Type: Bug
>            Reporter: Emmanuel Joke
>            Assignee: Emmanuel Joke
>             Fix For: 1.0.0
>
>         Attachments: NUTCH-532.patch
>
>
> CrawlDbMerger.reduce analyse the last fetch time of each record and keep the more recent record.
> This comparison is based on a FetchInterval in days : resTime = res.getFetchTime() - Math.round(res.getFetchInterval() * 3600 * 24 * 1000);
> It was not really a noticeable as the Math.Round method return the INTEGER.MAX_VALUE i.e 25 days.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.