You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Jesse Hires <jh...@gmail.com> on 2009/12/10 19:59:51 UTC
domain vs www.domain?
I'm seeing a lot of duplicates where a single site is getting recognized as
two different sites. Specifically I am seeing www.domain.com and
domain.combeing recognized as two different sites.
I imagine there is a setting to prevent this. If so, what is the setting, if
not, what would you recomend doing to prevent this?
Jesse
int GetRandomNumber()
{
return 4; // Chosen by fair roll of dice
// Guaranteed to be random
} // xkcd.com
Re: domain vs www.domain?
Posted by Jesse Hires <jh...@gmail.com>.
For the specific case I was running into (on a single known domain) using
regex-urlnormalizer did the trick. Thanks!
Jesse
int GetRandomNumber()
{
return 4; // Chosen by fair roll of dice
// Guaranteed to be random
} // xkcd.com
On Thu, Dec 10, 2009 at 1:01 PM, Andrzej Bialecki <ab...@getopt.org> wrote:
> On 2009-12-10 19:59, Jesse Hires wrote:
>
>> I'm seeing a lot of duplicates where a single site is getting recognized
>> as
>> two different sites. Specifically I am seeing www.domain.com and
>> domain.combeing recognized as two different sites.
>>
>> I imagine there is a setting to prevent this. If so, what is the setting,
>> if
>> not, what would you recomend doing to prevent this?
>>
>
> This is a surprisingly difficult problem to solve in general case, because
> it's not always true that 'www.domain' equals 'domain'. If you do know this
> is true in your particular case, you can add a rule to regex-urlnormalizer
> that changes the matching urls to e.g. always lose the 'www.' part.
>
>
>
> --
> Best regards,
> Andrzej Bialecki <><
> ___. ___ ___ ___ _ _ __________________________________
> [__ || __|__/|__||\/| Information Retrieval, Semantic Web
> ___|||__|| \| || | Embedded Unix, System Integration
> http://www.sigram.com Contact: info at sigram dot com
>
>
Re: domain vs www.domain?
Posted by Andrzej Bialecki <ab...@getopt.org>.
On 2009-12-10 19:59, Jesse Hires wrote:
> I'm seeing a lot of duplicates where a single site is getting recognized as
> two different sites. Specifically I am seeing www.domain.com and
> domain.combeing recognized as two different sites.
> I imagine there is a setting to prevent this. If so, what is the setting, if
> not, what would you recomend doing to prevent this?
This is a surprisingly difficult problem to solve in general case,
because it's not always true that 'www.domain' equals 'domain'. If you
do know this is true in your particular case, you can add a rule to
regex-urlnormalizer that changes the matching urls to e.g. always lose
the 'www.' part.
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com