You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by al...@aim.com on 2012/08/17 21:42:53 UTC

updatedb goes over all urls in nutch-2.0

Hi,

I noticed that updatedb command goes over all urls, even if they have been updated in the previous generate, fetch updatedb stages.
As a result updatedb takes long time depending on the number of rows in the datastore.
I thought maybe this is redundant and it must be restricted to not updated urls, only.

Thanks.
Alex.

Re: updatedb goes over all urls in nutch-2.0

Posted by Ferdy Galema <fe...@kalooga.com>.
Hi,

It is needed for scoring and inlink calculation. There are some tricks to
make it faster though, such as not clearing the previous inlinks map before
writing new one and not deleting any markers. (Because that is slow in
HBase). You have to modify the code for that now.

Ferdy.

On Fri, Aug 17, 2012 at 9:42 PM, <al...@aim.com> wrote:

> Hi,
>
> I noticed that updatedb command goes over all urls, even if they have been
> updated in the previous generate, fetch updatedb stages.
> As a result updatedb takes long time depending on the number of rows in
> the datastore.
> I thought maybe this is redundant and it must be restricted to not updated
> urls, only.
>
> Thanks.
> Alex.
>