You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by al...@aim.com on 2012/09/17 20:57:51 UTC
updatedb in nutch-2.0 increases fetch time of all pages
Hello,
updatedb in nutch-2.0 increases fetch time of all pages independent of if they have already been fetched or not.
For example if updatedb is applied in depth 1 and page A is fetched and its fetchTime is 30 days from now, then as a result of running updatedb in depth 2 fetch time of page A will be 60 days from now and so on.
Also, I wondered if it is possible to remove pages that do not pass filters from hbase datastore by using updatedb?.
Thanks.
Alex.
Re: updatedb in nutch-2.0 increases fetch time of all pages
Posted by Ferdy Galema <fe...@kalooga.com>.
Hi,
The fetchtime increasing is a bug indeed. There is already an issue for it:
https://issues.apache.org/jira/browse/NUTCH-1457
About removing urls, I'm not sure what the best solution is. It is
difficult to handle changes to normalizing/filtering rules over time. For
know it is best to not change rules in an existing crawl, otherwise you
have to run a custom delete tool or something like that.
Ferdy.
On Mon, Sep 17, 2012 at 8:57 PM, <al...@aim.com> wrote:
> Hello,
>
> updatedb in nutch-2.0 increases fetch time of all pages independent of if
> they have already been fetched or not.
> For example if updatedb is applied in depth 1 and page A is fetched and
> its fetchTime is 30 days from now, then as a result of running updatedb in
> depth 2 fetch time of page A will be 60 days from now and so on.
>
> Also, I wondered if it is possible to remove pages that do not pass
> filters from hbase datastore by using updatedb?.
>
> Thanks.
> Alex.
>