You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Jesse Hires <jh...@gmail.com> on 2009/12/10 19:59:51 UTC

domain vs www.domain?

I'm seeing a lot of duplicates where a single site is getting recognized as
two different sites. Specifically I am seeing www.domain.com and
domain.combeing recognized as two different sites.
I imagine there is a setting to prevent this. If so, what is the setting, if
not, what would you recomend doing to prevent this?


Jesse

int GetRandomNumber()
{
   return 4; // Chosen by fair roll of dice
                // Guaranteed to be random
} // xkcd.com

Re: domain vs www.domain?

Posted by Jesse Hires <jh...@gmail.com>.
For the specific case I was running into (on a single known domain) using
regex-urlnormalizer did the trick. Thanks!



Jesse

int GetRandomNumber()
{
   return 4; // Chosen by fair roll of dice
                // Guaranteed to be random
} // xkcd.com



On Thu, Dec 10, 2009 at 1:01 PM, Andrzej Bialecki <ab...@getopt.org> wrote:

> On 2009-12-10 19:59, Jesse Hires wrote:
>
>> I'm seeing a lot of duplicates where a single site is getting recognized
>> as
>> two different sites. Specifically I am seeing www.domain.com and
>> domain.combeing recognized as two different sites.
>>
>> I imagine there is a setting to prevent this. If so, what is the setting,
>> if
>> not, what would you recomend doing to prevent this?
>>
>
> This is a surprisingly difficult problem to solve in general case, because
> it's not always true that 'www.domain' equals 'domain'. If you do know this
> is true in your particular case, you can add a rule to regex-urlnormalizer
> that changes the matching urls to e.g. always lose the 'www.' part.
>
>
>
> --
> Best regards,
> Andrzej Bialecki     <><
>  ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
>
>

Re: domain vs www.domain?

Posted by Andrzej Bialecki <ab...@getopt.org>.
On 2009-12-10 19:59, Jesse Hires wrote:
> I'm seeing a lot of duplicates where a single site is getting recognized as
> two different sites. Specifically I am seeing www.domain.com and
> domain.combeing recognized as two different sites.
> I imagine there is a setting to prevent this. If so, what is the setting, if
> not, what would you recomend doing to prevent this?

This is a surprisingly difficult problem to solve in general case, 
because it's not always true that 'www.domain' equals 'domain'. If you 
do know this is true in your particular case, you can add a rule to 
regex-urlnormalizer that changes the matching urls to e.g. always lose 
the 'www.' part.


-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com