You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Markus Jelsma <ma...@openindex.io> on 2011/11/27 15:46:24 UTC

Handling duplicate sub domains

Hi,

How do you handle the issue of duplicate (sub) domains? We measure a 
significant amount of duplicate pages across sub domains. Bad websites for 
example do not force a single sub domain and accept anything. With regex 
normalizers we can easily tackle a portion of the problem by normalizing www 
derivatives such as ww. wwww. or www.w.ww.www. to www. This still leaves a 
huge amount of incorrect sub domains, leading to duplicates of _entire_ 
websites.

We've built analysis jobs to detect and list duplicate pages within sub 
domains (but also works across domains) which we can then reduce with another 
job to bad sub domains. Yet, one of each sub domain for a given domain must be 
kept but i've still to figure out which sub domain will prevail.

Here's an example of one such site:
113425:188	example.org
114314:186	startpagina.example.org
114334:186	mobile.example.org
114339:186	massages.example.org
114340:186	massage.example.org
114362:186	http.www.example.org
114446:185	www.example.org
115280:184	m.example.org
115316:184	forum.example.org

In this case it may be simple to select www as the sub domain we want to keep 
but it is not always so trivial.

Anyone to share some inspiring insights for edge cases that make up the bulk 
of duplicates?

Thanks,
markus

Re: Handling duplicate sub domains

Posted by Markus Jelsma <ma...@openindex.io>.
Another analysis job just finished and it indicates how serious this problem 
is in internet crawling: 10.01% of all domains we've visited so far contain 
one or more duplicates of themselves.

We're glad we select and limit the generator by domain name, this limits the 
problem of wasted resources but also prevents crawling of legitimate sub 
domains such as wikipedia. When limiting on domain such a massive website will 
never be indexed at all. But limiting on host would mean we quickly download 
_all_ duplicates of a bad site.

We intend to filter these legitimate sites out by calculating the ratio of 
duplicate sub domains and have some, yet to determine, threshold.



> Hi,
> 
> How do you handle the issue of duplicate (sub) domains? We measure a
> significant amount of duplicate pages across sub domains. Bad websites for
> example do not force a single sub domain and accept anything. With regex
> normalizers we can easily tackle a portion of the problem by normalizing
> www derivatives such as ww. wwww. or www.w.ww.www. to www. This still
> leaves a huge amount of incorrect sub domains, leading to duplicates of
> _entire_ websites.
> 
> We've built analysis jobs to detect and list duplicate pages within sub
> domains (but also works across domains) which we can then reduce with
> another job to bad sub domains. Yet, one of each sub domain for a given
> domain must be kept but i've still to figure out which sub domain will
> prevail.
> 
> Here's an example of one such site:
> 113425:188	example.org
> 114314:186	startpagina.example.org
> 114334:186	mobile.example.org
> 114339:186	massages.example.org
> 114340:186	massage.example.org
> 114362:186	http.www.example.org
> 114446:185	www.example.org
> 115280:184	m.example.org
> 115316:184	forum.example.org
> 
> In this case it may be simple to select www as the sub domain we want to
> keep but it is not always so trivial.
> 
> Anyone to share some inspiring insights for edge cases that make up the
> bulk of duplicates?
> 
> Thanks,
> markus

Re: Handling duplicate sub domains

Posted by Markus Jelsma <ma...@openindex.io>.
Hi Mathijs,

I've already implemented several of your clues and get good results. The final 
problem is deciding when an entire sub domain is to be filtered out based on 
just a couple of duplicates. I've seen similar metrics for both good and bad 
sites.

I haven't yet decided on how to prevent false positives for these edge cases.

Thanks
Markus

On Sunday 27 November 2011 20:12:48 Mathijs Homminga wrote:
> Hi Markus,
> 
> What is your definition of duplicate (sub) domains?
> 
> By reading your examples, I think you are looking for domains (or host
> IP's) that are interchangeable. That is, domains that give identical
> response when combined with the same protocol, port, path and query (a
> url).
> 
> You could indeed use heuristics (like normalizing wwww. to www.).
> 
> I guess that most of the time this happens when the domain has set a
> wildcard dns record (catch-all). No guarantee however that wildcard
> domains act 'identical' of course. Although (sub) domains may point to the
> same canonical name or IP address, they still may give different responses
> because of domain/url based dispatching on that host (think virtual hosts
> in Apache) or application level logic. I guess this is why you never can
> be 100% sure that the domains are duplicates...
> 
> Clues I can think of (none of them are hard guarantees):
> 
> - Your heuristics using common patterns.
> - Do a DNS lookup of the domains... does it point to another domain or an
> IP address which is shared among other domains? - Did we find duplicate
> URLs on different hosts?
> 	- Quick: if there are a lot of identical urls (paths+query of substantial
> length) on different subdomains, then the domains might be identical. -
> You might want to include a content check in the above.
> - Actively check a fingerprint of the main page of each subdomain (e.g.
> title + some headers) and group domains based on this.
> 
> I'm currently working on the Host table (in nutchgora) and like to include
> some of this in there too.
> 
> Mathijs
> 
> On Nov 27, 2011, at 15:46 , Markus Jelsma wrote:
> > Hi,
> > 
> > How do you handle the issue of duplicate (sub) domains? We measure a
> > significant amount of duplicate pages across sub domains. Bad websites
> > for example do not force a single sub domain and accept anything. With
> > regex normalizers we can easily tackle a portion of the problem by
> > normalizing www derivatives such as ww. wwww. or www.w.ww.www. to www.
> > This still leaves a huge amount of incorrect sub domains, leading to
> > duplicates of _entire_ websites.
> > 
> > We've built analysis jobs to detect and list duplicate pages within sub
> > domains (but also works across domains) which we can then reduce with
> > another job to bad sub domains. Yet, one of each sub domain for a given
> > domain must be kept but i've still to figure out which sub domain will
> > prevail.
> > 
> > Here's an example of one such site:
> > 113425:188	example.org
> > 114314:186	startpagina.example.org
> > 114334:186	mobile.example.org
> > 114339:186	massages.example.org
> > 114340:186	massage.example.org
> > 114362:186	http.www.example.org
> > 114446:185	www.example.org
> > 115280:184	m.example.org
> > 115316:184	forum.example.org
> > 
> > In this case it may be simple to select www as the sub domain we want to
> > keep but it is not always so trivial.
> > 
> > Anyone to share some inspiring insights for edge cases that make up the
> > bulk of duplicates?
> > 
> > Thanks,
> > markus

-- 
Markus Jelsma - CTO - Openindex

Re: Handling duplicate sub domains

Posted by Mathijs Homminga <ma...@kalooga.com>.
Hi Markus,

What is your definition of duplicate (sub) domains?

By reading your examples, I think you are looking for domains (or host IP's) that are interchangeable. 
That is, domains that give identical response when combined with the same protocol, port, path and query (a url).

You could indeed use heuristics (like normalizing wwww. to www.).

I guess that most of the time this happens when the domain has set a wildcard dns record (catch-all).
No guarantee however that wildcard domains act 'identical' of course. Although (sub) domains may point to the same canonical name or IP address, they still may give different responses because of domain/url based dispatching on that host (think virtual hosts in Apache) or application level logic.
I guess this is why you never can be 100% sure that the domains are duplicates...

Clues I can think of (none of them are hard guarantees):

- Your heuristics using common patterns.
- Do a DNS lookup of the domains... does it point to another domain or an IP address which is shared among other domains?
- Did we find duplicate URLs on different hosts?
	- Quick: if there are a lot of identical urls (paths+query of substantial length) on different subdomains, then the domains might be identical. 
	- You might want to include a content check in the above.
- Actively check a fingerprint of the main page of each subdomain (e.g. title + some headers) and group domains based on this.

I'm currently working on the Host table (in nutchgora) and like to include some of this in there too.

Mathijs 


On Nov 27, 2011, at 15:46 , Markus Jelsma wrote:

> Hi,
> 
> How do you handle the issue of duplicate (sub) domains? We measure a 
> significant amount of duplicate pages across sub domains. Bad websites for 
> example do not force a single sub domain and accept anything. With regex 
> normalizers we can easily tackle a portion of the problem by normalizing www 
> derivatives such as ww. wwww. or www.w.ww.www. to www. This still leaves a 
> huge amount of incorrect sub domains, leading to duplicates of _entire_ 
> websites.
> 
> We've built analysis jobs to detect and list duplicate pages within sub 
> domains (but also works across domains) which we can then reduce with another 
> job to bad sub domains. Yet, one of each sub domain for a given domain must be 
> kept but i've still to figure out which sub domain will prevail.
> 
> Here's an example of one such site:
> 113425:188	example.org
> 114314:186	startpagina.example.org
> 114334:186	mobile.example.org
> 114339:186	massages.example.org
> 114340:186	massage.example.org
> 114362:186	http.www.example.org
> 114446:185	www.example.org
> 115280:184	m.example.org
> 115316:184	forum.example.org
> 
> In this case it may be simple to select www as the sub domain we want to keep 
> but it is not always so trivial.
> 
> Anyone to share some inspiring insights for edge cases that make up the bulk 
> of duplicates?
> 
> Thanks,
> markus