You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by reinhard schwab <re...@aon.at> on 2009/12/02 00:30:40 UTC
crawl dates with fetch interval 0
i'm observing crawl dates, which have fetch interval with value 0.
when i dump the segment, i see
Recno:: 33
URL::
http://www.wachauclimbing.net/home/impressum-disclaimer/comment-page-1/
CrawlDatum::
Version: 7
Status: 65 (signature)
Fetch time: Tue Dec 01 23:41:15 CET 2009
Modified time: Thu Jan 01 01:00:00 CET 1970
Retries since fetch: 0
Retry interval: 0 seconds (0 days)
Score: 1.0
Signature: 1d63c4283a5e0c7b8eb8dee359adfabe
Metadata:
CrawlDatum::
Version: 7
Status: 33 (fetch_success)
Fetch time: Tue Dec 01 23:38:48 CET 2009
Modified time: Thu Jan 01 01:00:00 CET 1970
Retries since fetch: 0
Retry interval: 0 seconds (0 days)
Score: 1.0
Signature: null
Metadata:
this crawl date is parsed/generated from a feed.
http://www.wachauclimbing.net/home/impressum-disclaimer/feed
when and where should the fetch interval be set?
when parsing or when updating the crawl db?
this is the code i suspect in ParseOutputFormat to generate the crawl date
if (!parse.isCanonical()) {
CrawlDatum datum = new CrawlDatum();
datum.setStatus(CrawlDatum.STATUS_FETCH_SUCCESS);
String timeString =
parse.getData().getContentMeta().get(
Nutch.FETCH_TIME_KEY);
try {
datum.setFetchTime(Long.parseLong(timeString));
} catch (Exception e) {
LOG.warn("Can't read fetch time for: " + key);
datum.setFetchTime(System.currentTimeMillis());
}
crawlOut.append(key, datum);
}
i assume, the fetch interval should be set in CrawlDbReducer
// set the schedule
result = schedule.setFetchSchedule((Text) key, result,
prevFetchTime, prevModifiedTime, fetch.getFetchTime(),
fetch.getModifiedTime(), modified);
if ( result.getFetchInterval() == 0 ) {
LOG.warn( "WARNING: FETCH INTERVAL is 0 for " + key);
}
here i observe the 0.
i propose to check in DefaultFetchSchedule for 0 fetch interval or in
AbstractFetchSchedule.
public class DefaultFetchSchedule extends AbstractFetchSchedule {
@Override
public CrawlDatum setFetchSchedule(Text url, CrawlDatum datum,
long prevFetchTime, long prevModifiedTime,
long fetchTime, long modifiedTime, int state) {
datum = super.setFetchSchedule(url, datum, prevFetchTime,
prevModifiedTime,
fetchTime, modifiedTime, state);
datum.setFetchTime(fetchTime + (long)datum.getFetchInterval() * 1000);
datum.setModifiedTime(modifiedTime);
return datum;
regards
reinhard
Re: crawl dates with fetch interval 0
Posted by Andrzej Bialecki <ab...@getopt.org>.
reinhard schwab wrote:
>
> this crawl date will be fetched and fetched again with 0 days retry
> interval.
>
> i will open an issue in jira and attach a patch.
Thanks for catching this bug - please do so.
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com
Re: crawl dates with fetch interval 0
Posted by reinhard schwab <re...@aon.at>.
i have tested this now with the current trunk of nutch.
Revision: 886112
the dump of the crawl db shows
http://www.wachauclimbing.net/home/impressum-disclaimer/comment-page-1/
Version: 7
Status: 2 (db_fetched)
Fetch time: Wed Dec 02 12:48:22 CET 2009
Modified time: Thu Jan 01 01:00:00 CET 1970
Retries since fetch: 0
Retry interval: 0 seconds (0 days)
Score: 1.0833334
Signature: db9ab2193924cd2d0b53113a500ca604
Metadata: _pst_: success(1), lastModified=0
http://www.wachauclimbing.net/home/impressum-disclaimer/feed/ Version: 7
Status: 2 (db_fetched)
Fetch time: Sun Jan 31 12:44:52 CET 2010
Modified time: Thu Jan 01 01:00:00 CET 1970
Retries since fetch: 0
Retry interval: 5184000 seconds (60 days)
Score: 1.0166667
Signature: c409d31ddf24f01b262c19ac2e301671
Metadata: _pst_: success(1), lastModified=0_repr_:
http://www.wachauclimbing.net/home/impressum-disclaimer/feed
the other crawl dates have 60 days retry interval.
this crawl date will be fetched and fetched again with 0 days retry
interval.
i will open an issue in jira and attach a patch.
regards
reinhard
reinhard schwab schrieb:
> i'm observing crawl dates, which have fetch interval with value 0.
> when i dump the segment, i see
>
> Recno:: 33
> URL::
> http://www.wachauclimbing.net/home/impressum-disclaimer/comment-page-1/
>
> CrawlDatum::
> Version: 7
> Status: 65 (signature)
> Fetch time: Tue Dec 01 23:41:15 CET 2009
> Modified time: Thu Jan 01 01:00:00 CET 1970
> Retries since fetch: 0
> Retry interval: 0 seconds (0 days)
> Score: 1.0
> Signature: 1d63c4283a5e0c7b8eb8dee359adfabe
> Metadata:
>
> CrawlDatum::
> Version: 7
> Status: 33 (fetch_success)
> Fetch time: Tue Dec 01 23:38:48 CET 2009
> Modified time: Thu Jan 01 01:00:00 CET 1970
> Retries since fetch: 0
> Retry interval: 0 seconds (0 days)
> Score: 1.0
> Signature: null
> Metadata:
>
>
> this crawl date is parsed/generated from a feed.
> http://www.wachauclimbing.net/home/impressum-disclaimer/feed
>
> when and where should the fetch interval be set?
> when parsing or when updating the crawl db?
>
> this is the code i suspect in ParseOutputFormat to generate the crawl date
>
> if (!parse.isCanonical()) {
> CrawlDatum datum = new CrawlDatum();
> datum.setStatus(CrawlDatum.STATUS_FETCH_SUCCESS);
> String timeString =
> parse.getData().getContentMeta().get(
> Nutch.FETCH_TIME_KEY);
> try {
> datum.setFetchTime(Long.parseLong(timeString));
> } catch (Exception e) {
> LOG.warn("Can't read fetch time for: " + key);
> datum.setFetchTime(System.currentTimeMillis());
> }
> crawlOut.append(key, datum);
> }
>
> i assume, the fetch interval should be set in CrawlDbReducer
>
> // set the schedule
> result = schedule.setFetchSchedule((Text) key, result,
> prevFetchTime, prevModifiedTime, fetch.getFetchTime(),
> fetch.getModifiedTime(), modified);
> if ( result.getFetchInterval() == 0 ) {
> LOG.warn( "WARNING: FETCH INTERVAL is 0 for " + key);
> }
>
> here i observe the 0.
>
> i propose to check in DefaultFetchSchedule for 0 fetch interval or in
> AbstractFetchSchedule.
>
> public class DefaultFetchSchedule extends AbstractFetchSchedule {
>
> @Override
> public CrawlDatum setFetchSchedule(Text url, CrawlDatum datum,
> long prevFetchTime, long prevModifiedTime,
> long fetchTime, long modifiedTime, int state) {
> datum = super.setFetchSchedule(url, datum, prevFetchTime,
> prevModifiedTime,
> fetchTime, modifiedTime, state);
> datum.setFetchTime(fetchTime + (long)datum.getFetchInterval() * 1000);
> datum.setModifiedTime(modifiedTime);
> return datum;
>
> regards
> reinhard
>
>
>
>