You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Emmanuel <jo...@gmail.com> on 2007/08/02 14:14:15 UTC

Outlinks normalizer

I've got a simple question why do we normalize each single outlink int he
constructor of the object. It involved the creation of many URLNormalizer
object.

We could just add the normalizer in ParseOutputFormat just before the filter
and it will limited the number of instanciation.
Don't u think ? or did i miss something ?

Re: Outlinks normalizer

Posted by Doğacan Güney <do...@gmail.com>.
On 8/2/07, Emmanuel <jo...@gmail.com> wrote:
> I've got a simple question why do we normalize each single outlink int he
> constructor of the object. It involved the creation of many URLNormalizer
> object.
>
> We could just add the normalizer in ParseOutputFormat just before the filter
> and it will limited the number of instanciation.
> Don't u think ? or did i miss something ?
>


I am not sure, but I think the idea is to make Outlink class useful
outside of ParseOutputformat (so that if you use Outlink w/o
ParseOutputFormat, you would still end up with a normalized url).

However, this minor advantage is hugely offset by the fact that we are
recreating URLNormalizers for every outlink (and if you have an
ordering on your normalizers, re-ordering them *every* *single* time),
so overall moving normalizing into ParseOutputFormat seems like a good
idea to me. (and while we are doing that, perhaps we can stop creating
 a ParseUtil instance for every ParseSegment.map [even though it has a
smaller overhead]).

-- 
Doğacan Güney