You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by reinhard schwab <re...@aon.at> on 2009/12/02 00:30:40 UTC

crawl dates with fetch interval 0

i'm observing crawl dates, which have fetch interval with value 0.
when i dump the segment, i see

Recno:: 33
URL::
http://www.wachauclimbing.net/home/impressum-disclaimer/comment-page-1/

CrawlDatum::
Version: 7
Status: 65 (signature)
Fetch time: Tue Dec 01 23:41:15 CET 2009
Modified time: Thu Jan 01 01:00:00 CET 1970
Retries since fetch: 0
Retry interval: 0 seconds (0 days)
Score: 1.0
Signature: 1d63c4283a5e0c7b8eb8dee359adfabe
Metadata:

CrawlDatum::
Version: 7
Status: 33 (fetch_success)
Fetch time: Tue Dec 01 23:38:48 CET 2009
Modified time: Thu Jan 01 01:00:00 CET 1970
Retries since fetch: 0
Retry interval: 0 seconds (0 days)
Score: 1.0
Signature: null
Metadata:


this crawl date is parsed/generated from a feed.
http://www.wachauclimbing.net/home/impressum-disclaimer/feed

when and where should the fetch interval be set?
when parsing or when updating the crawl db?

this is the code i suspect in ParseOutputFormat to generate the crawl date

    if (!parse.isCanonical()) {
                    CrawlDatum datum = new CrawlDatum();
                    datum.setStatus(CrawlDatum.STATUS_FETCH_SUCCESS);
                    String timeString =
parse.getData().getContentMeta().get(
                            Nutch.FETCH_TIME_KEY);
                    try {
                        datum.setFetchTime(Long.parseLong(timeString));
                    } catch (Exception e) {
                        LOG.warn("Can't read fetch time for: " + key);
                        datum.setFetchTime(System.currentTimeMillis());
                    }
                    crawlOut.append(key, datum);
                }

i assume, the fetch interval should be set in CrawlDbReducer

// set the schedule
            result = schedule.setFetchSchedule((Text) key, result,
                    prevFetchTime, prevModifiedTime, fetch.getFetchTime(),
                    fetch.getModifiedTime(), modified);
            if ( result.getFetchInterval() == 0 ) {
              LOG.warn( "WARNING: FETCH INTERVAL is 0 for " + key);
            }

here i observe the 0.

i propose to check in DefaultFetchSchedule for 0 fetch interval or in
AbstractFetchSchedule.

public class DefaultFetchSchedule extends AbstractFetchSchedule {

  @Override
  public CrawlDatum setFetchSchedule(Text url, CrawlDatum datum,
          long prevFetchTime, long prevModifiedTime,
          long fetchTime, long modifiedTime, int state) {
    datum = super.setFetchSchedule(url, datum, prevFetchTime,
prevModifiedTime,
        fetchTime, modifiedTime, state);
    datum.setFetchTime(fetchTime + (long)datum.getFetchInterval() * 1000);
    datum.setModifiedTime(modifiedTime);
    return datum;

regards
reinhard



Re: crawl dates with fetch interval 0

Posted by Andrzej Bialecki <ab...@getopt.org>.
reinhard schwab wrote:

> 
> this crawl date will be fetched and fetched again with 0 days retry
> interval.
> 
> i will open an issue in jira and attach a patch.

Thanks for catching this bug - please do so.


-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


Re: crawl dates with fetch interval 0

Posted by reinhard schwab <re...@aon.at>.
i have tested this now with the current trunk of nutch.
Revision: 886112

the dump of the crawl db shows

http://www.wachauclimbing.net/home/impressum-disclaimer/comment-page-1/
Version: 7
Status: 2 (db_fetched)
Fetch time: Wed Dec 02 12:48:22 CET 2009
Modified time: Thu Jan 01 01:00:00 CET 1970
Retries since fetch: 0
Retry interval: 0 seconds (0 days)
Score: 1.0833334
Signature: db9ab2193924cd2d0b53113a500ca604
Metadata: _pst_: success(1), lastModified=0

http://www.wachauclimbing.net/home/impressum-disclaimer/feed/ Version: 7
Status: 2 (db_fetched)
Fetch time: Sun Jan 31 12:44:52 CET 2010
Modified time: Thu Jan 01 01:00:00 CET 1970
Retries since fetch: 0
Retry interval: 5184000 seconds (60 days)
Score: 1.0166667
Signature: c409d31ddf24f01b262c19ac2e301671
Metadata: _pst_: success(1), lastModified=0_repr_:
http://www.wachauclimbing.net/home/impressum-disclaimer/feed

the other crawl dates have 60 days retry interval.

this crawl date will be fetched and fetched again with 0 days retry
interval.

i will open an issue in jira and attach a patch.

regards
reinhard


reinhard schwab schrieb:
> i'm observing crawl dates, which have fetch interval with value 0.
> when i dump the segment, i see
>
> Recno:: 33
> URL::
> http://www.wachauclimbing.net/home/impressum-disclaimer/comment-page-1/
>
> CrawlDatum::
> Version: 7
> Status: 65 (signature)
> Fetch time: Tue Dec 01 23:41:15 CET 2009
> Modified time: Thu Jan 01 01:00:00 CET 1970
> Retries since fetch: 0
> Retry interval: 0 seconds (0 days)
> Score: 1.0
> Signature: 1d63c4283a5e0c7b8eb8dee359adfabe
> Metadata:
>
> CrawlDatum::
> Version: 7
> Status: 33 (fetch_success)
> Fetch time: Tue Dec 01 23:38:48 CET 2009
> Modified time: Thu Jan 01 01:00:00 CET 1970
> Retries since fetch: 0
> Retry interval: 0 seconds (0 days)
> Score: 1.0
> Signature: null
> Metadata:
>
>
> this crawl date is parsed/generated from a feed.
> http://www.wachauclimbing.net/home/impressum-disclaimer/feed
>
> when and where should the fetch interval be set?
> when parsing or when updating the crawl db?
>
> this is the code i suspect in ParseOutputFormat to generate the crawl date
>
>     if (!parse.isCanonical()) {
>                     CrawlDatum datum = new CrawlDatum();
>                     datum.setStatus(CrawlDatum.STATUS_FETCH_SUCCESS);
>                     String timeString =
> parse.getData().getContentMeta().get(
>                             Nutch.FETCH_TIME_KEY);
>                     try {
>                         datum.setFetchTime(Long.parseLong(timeString));
>                     } catch (Exception e) {
>                         LOG.warn("Can't read fetch time for: " + key);
>                         datum.setFetchTime(System.currentTimeMillis());
>                     }
>                     crawlOut.append(key, datum);
>                 }
>
> i assume, the fetch interval should be set in CrawlDbReducer
>
> // set the schedule
>             result = schedule.setFetchSchedule((Text) key, result,
>                     prevFetchTime, prevModifiedTime, fetch.getFetchTime(),
>                     fetch.getModifiedTime(), modified);
>             if ( result.getFetchInterval() == 0 ) {
>               LOG.warn( "WARNING: FETCH INTERVAL is 0 for " + key);
>             }
>
> here i observe the 0.
>
> i propose to check in DefaultFetchSchedule for 0 fetch interval or in
> AbstractFetchSchedule.
>
> public class DefaultFetchSchedule extends AbstractFetchSchedule {
>
>   @Override
>   public CrawlDatum setFetchSchedule(Text url, CrawlDatum datum,
>           long prevFetchTime, long prevModifiedTime,
>           long fetchTime, long modifiedTime, int state) {
>     datum = super.setFetchSchedule(url, datum, prevFetchTime,
> prevModifiedTime,
>         fetchTime, modifiedTime, state);
>     datum.setFetchTime(fetchTime + (long)datum.getFetchInterval() * 1000);
>     datum.setModifiedTime(modifiedTime);
>     return datum;
>
> regards
> reinhard
>
>
>
>