You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by vivekvl <vi...@yahoo.com> on 2013/07/19 07:24:30 UTC

Issue in generating URLs for re-fetching once db.fetch.interval.max elapses

I found a issue in shouldFetch() of AbstractFetchSchedule class.  (Nutch 2.1) 

Here even when (fetchTime - curTime > maxInterval * 1000L), the method
returns false with the below return statement. 

"return fetchTime <= curTime" 

The fetchTime is not reset within the if block, and hence the method returns
false (The javadoc of this method says "It will also check that fetchTime is
not too remote, in which case it lowers the interval and returns true.) 

Whether this method need fix? (resetting fetchTime to curTime when interval
> maxinterval)  Any chance of this fix leading to strange generate and fetch
behavior? 



--
View this message in context: http://lucene.472066.n3.nabble.com/Issue-in-generating-URLs-for-re-fetching-once-db-fetch-interval-max-elapses-tp4079039.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Issue in generating URLs for re-fetching once db.fetch.interval.max elapses

Posted by Sebastian Nagel <wa...@googlemail.com>.
Hi,

> I am expecting this method to return true when maxInterval elapses for a
> page, so that it could be included in the generate list.

can you give an example where the fetchInterval gets larger than
maxInterval? Or a (next) fetch time more than maxInterval in the
future? If this happens that's a bug.

Indeed, 2.x code compared to 1.x seems wrong (at least different):

  public boolean shouldFetch(Text url, CrawlDatum datum, long curTime) {
    ...
    if (datum.getFetchTime() > curTime) {
      return false; // not time yet
    }
    return true;
  }

Sebastian

On 07/21/2013 10:16 AM, vivekvl wrote:
> Hi Lewis,
> I am expecting this method to return true when maxInterval elapses for a
> page, so that it could be included in the generate list.
> 
> @Override
> public boolean shouldFetch(String url, WebPage page, long curTime) {
>   // pages are never truly GONE - we have to check them from time to time.
>   // pages with too long fetchInterval are adjusted so that they fit within
>   // maximum fetchInterval (segment retention period).
>   long fetchTime = page.getFetchTime();
>   if (fetchTime - curTime > maxInterval * 1000L) {
>     if (page.getFetchInterval() > maxInterval) {
>       page.setFetchInterval(Math.round(maxInterval * 0.9f));
>     }
>     page.setFetchTime(curTime);
>   }
>   return fetchTime <= curTime;
> }
> 
> 
> 
> --
> View this message in context: http://lucene.472066.n3.nabble.com/Issue-in-generating-URLs-for-re-fetching-once-db-fetch-interval-max-elapses-tp4079039p4079343.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
> 


Re: Issue in generating URLs for re-fetching once db.fetch.interval.max elapses

Posted by vivekvl <vi...@yahoo.com>.
Hi Lewis,
I am expecting this method to return true when maxInterval elapses for a
page, so that it could be included in the generate list.

@Override
public boolean shouldFetch(String url, WebPage page, long curTime) {
  // pages are never truly GONE - we have to check them from time to time.
  // pages with too long fetchInterval are adjusted so that they fit within
  // maximum fetchInterval (segment retention period).
  long fetchTime = page.getFetchTime();
  if (fetchTime - curTime > maxInterval * 1000L) {
    if (page.getFetchInterval() > maxInterval) {
      page.setFetchInterval(Math.round(maxInterval * 0.9f));
    }
    page.setFetchTime(curTime);
  }
  return fetchTime <= curTime;
}



--
View this message in context: http://lucene.472066.n3.nabble.com/Issue-in-generating-URLs-for-re-fetching-once-db-fetch-interval-max-elapses-tp4079039p4079343.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Issue in generating URLs for re-fetching once db.fetch.interval.max elapses

Posted by Lewis John Mcgibbney <le...@gmail.com>.
I am not looking at the code.
Can you explain what you're expecting to happen please?

On Thursday, July 18, 2013, vivekvl <vi...@yahoo.com> wrote:
> I found a issue in shouldFetch() of AbstractFetchSchedule class.  (Nutch
2.1)
>
> Here even when (fetchTime - curTime > maxInterval * 1000L), the method
> returns false with the below return statement.
>
> "return fetchTime <= curTime"
>
> The fetchTime is not reset within the if block, and hence the method
returns
> false (The javadoc of this method says "It will also check that fetchTime
is
> not too remote, in which case it lowers the interval and returns true.)
>
> Whether this method need fix? (resetting fetchTime to curTime when
interval
>> maxinterval)  Any chance of this fix leading to strange generate and
fetch
> behavior?
>
>
>
> --
> View this message in context:
http://lucene.472066.n3.nabble.com/Issue-in-generating-URLs-for-re-fetching-once-db-fetch-interval-max-elapses-tp4079039.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>

-- 
*Lewis*