You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by al...@aim.com on 2012/09/17 20:57:51 UTC

updatedb in nutch-2.0 increases fetch time of all pages

Hello,

updatedb in nutch-2.0 increases fetch time of all pages independent of if they have already been fetched or not.
For example if updatedb is applied in depth 1 and page A is fetched and its fetchTime is 30 days from now, then as a result of running updatedb in depth 2 fetch time of page A will be 60 days from now and so on.

Also, I wondered if it is possible to remove pages that do not pass filters from hbase datastore by using updatedb?.

Thanks.
Alex.

Re: updatedb in nutch-2.0 increases fetch time of all pages

Posted by Ferdy Galema <fe...@kalooga.com>.
Hi,

The fetchtime increasing is a bug indeed. There is already an issue for it:
https://issues.apache.org/jira/browse/NUTCH-1457

About removing urls, I'm not sure what the best solution is. It is
difficult to handle changes to normalizing/filtering rules over time. For
know it is best to not change rules in an existing crawl, otherwise you
have to run a custom delete tool or something like that.

Ferdy.

On Mon, Sep 17, 2012 at 8:57 PM, <al...@aim.com> wrote:

> Hello,
>
> updatedb in nutch-2.0 increases fetch time of all pages independent of if
> they have already been fetched or not.
> For example if updatedb is applied in depth 1 and page A is fetched and
> its fetchTime is 30 days from now, then as a result of running updatedb in
> depth 2 fetch time of page A will be 60 days from now and so on.
>
> Also, I wondered if it is possible to remove pages that do not pass
> filters from hbase datastore by using updatedb?.
>
> Thanks.
> Alex.
>