You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Rod Taylor <rb...@sitesell.com> on 2006/03/07 19:24:48 UTC

Link Farms

We've managed to dig ourselves into a couple of link farms with tens of
thousands of sub-domains.

I didn't notice until they blocked our DNS requests and the Nutch error
rates shot way up.

Are there any methods for detecting these things (more than 100
sub-domains) or a master list somewhere that we can filter?

-- 
Rod Taylor <rb...@sitesell.com>


Re: Link Farms

Posted by "Insurance Squared Inc." <gc...@insurancesquared.com>.
I don't think it is a slam dunk either, even Google doesn't do a super 
job of detecting these.  I think a lot of it's still done manually.

I think you'd have to look at detecting closed networks or mostly closed 
networks (since the link farm would be relatively clustered from a link 
perspective).  As noted, not too easy to implement - and why people 
working in SEO still use this technique to game the SE's.

Besides, it gets crazy fast trying to pin this stuff down.  I spoke to 
someone who was complaining about managing 400+ webhosting accounts.  
Tough to nail folks going to that level.




Ken Krugler wrote:

>> We've managed to dig ourselves into a couple of link farms with tens of
>> thousands of sub-domains.
>>
>> I didn't notice until they blocked our DNS requests and the Nutch error
>> rates shot way up.
>>
>> Are there any methods for detecting these things (more than 100
>> sub-domains) or a master list somewhere that we can filter?
>
>
> I've read a paper on detecting link farms, but from what I remember, 
> it wasn't a slam-dunk to implement.
>
> So far we've relied on manually detecting these, and then pruning the 
> results from the crawldb and adding them to the regex-urlfilter file.
>
> -- Ken


Re: Link Farms

Posted by Matt Kangas <ka...@gmail.com>.
Hi folks,

Offhand, I'm not aware of any slam-dunk solution to link farms  
either. One thing that could help mitigate the problem is a pre-built  
blacklist of some sort. For example:

http://www.squidguard.org/blacklist/

That one is really meant for blocking user-access to porn, known  
warez providers, etc, but it may have some value for you.

Another source of link farms are parked-domain providers. Many of  
these can be identified by their DNS server name. Some of the top  
offenders (afaik) include:
- dns(\d+).name-services.com
- ns(\d+).directnic.com
- ns(\d+).itsyourdomain.com
- park(\d+).secureserver.net
- ns.buydomains.com
- this-domain-for-sale.com

A reasonable first-pass at this list can be achieved by getting the  
Verisign COM Zone file, getting a count of domains per DNS server,  
then checking the top 100 or so. (that's what i did, anyway! :)

Rob, does that help you? Or are you hitting a different type of link  
farm?

--Matt

On Mar 7, 2006, at 5:13 PM, Stefan Groschupf wrote:

> Hi,
>
> is the content of the pages 'mostly' identically?
> Since we can now provide custom hash implementations to the  
> crawlDB, what people think about local sensitive hashing?
>
> http://citeseer.ist.psu.edu/haveliwala00scalable.html
>
> As far I understand the paper we can implement the hashing in a  
> style that it allows to handle 'similar' (just change one word )  
> pages  as once.
> My experience of link farms is that pages are identically except of  
> one number or word or data or something like that.
> In such a case LSH may could be a interesting try to get the  
> problem solved.
>
> Any thoughts?
>
> Stefan
>
>
> Am 07.03.2006 um 22:38 schrieb Ken Krugler:
>
>>> We've managed to dig ourselves into a couple of link farms with  
>>> tens of
>>> thousands of sub-domains.
>>>
>>> I didn't notice until they blocked our DNS requests and the Nutch  
>>> error
>>> rates shot way up.
>>>
>>> Are there any methods for detecting these things (more than 100
>>> sub-domains) or a master list somewhere that we can filter?
>>
>> I've read a paper on detecting link farms, but from what I  
>> remember, it wasn't a slam-dunk to implement.
>>
>> So far we've relied on manually detecting these, and then pruning  
>> the results from the crawldb and adding them to the regex- 
>> urlfilter file.
>>
>> -- Ken
>
>

--
Matt Kangas / kangas@gmail.com



Re: Link Farms

Posted by Stefan Groschupf <sg...@media-style.com>.
Hi,

is the content of the pages 'mostly' identically?
Since we can now provide custom hash implementations to the crawlDB,  
what people think about local sensitive hashing?

http://citeseer.ist.psu.edu/haveliwala00scalable.html

As far I understand the paper we can implement the hashing in a style  
that it allows to handle 'similar' (just change one word ) pages  as  
once.
My experience of link farms is that pages are identically except of  
one number or word or data or something like that.
In such a case LSH may could be a interesting try to get the problem  
solved.

Any thoughts?

Stefan


Am 07.03.2006 um 22:38 schrieb Ken Krugler:

>> We've managed to dig ourselves into a couple of link farms with  
>> tens of
>> thousands of sub-domains.
>>
>> I didn't notice until they blocked our DNS requests and the Nutch  
>> error
>> rates shot way up.
>>
>> Are there any methods for detecting these things (more than 100
>> sub-domains) or a master list somewhere that we can filter?
>
> I've read a paper on detecting link farms, but from what I  
> remember, it wasn't a slam-dunk to implement.
>
> So far we've relied on manually detecting these, and then pruning  
> the results from the crawldb and adding them to the regex-urlfilter  
> file.
>
> -- Ken
> -- 
> Ken Krugler
> Krugle, Inc.
> +1 530-210-6378
> "Find Code, Find Answers"
>

---------------------------------------------
blog: http://www.find23.org
company: http://www.media-style.com



Re: Link Farms

Posted by Ken Krugler <kk...@transpac.com>.
>We've managed to dig ourselves into a couple of link farms with tens of
>thousands of sub-domains.
>
>I didn't notice until they blocked our DNS requests and the Nutch error
>rates shot way up.
>
>Are there any methods for detecting these things (more than 100
>sub-domains) or a master list somewhere that we can filter?

I've read a paper on detecting link farms, but from what I remember, 
it wasn't a slam-dunk to implement.

So far we've relied on manually detecting these, and then pruning the 
results from the crawldb and adding them to the regex-urlfilter file.

-- Ken
-- 
Ken Krugler
Krugle, Inc.
+1 530-210-6378
"Find Code, Find Answers"