You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Rod Taylor <rb...@sitesell.com> on 2006/03/07 19:24:48 UTC
Link Farms
We've managed to dig ourselves into a couple of link farms with tens of
thousands of sub-domains.
I didn't notice until they blocked our DNS requests and the Nutch error
rates shot way up.
Are there any methods for detecting these things (more than 100
sub-domains) or a master list somewhere that we can filter?
--
Rod Taylor <rb...@sitesell.com>
Re: Link Farms
Posted by "Insurance Squared Inc." <gc...@insurancesquared.com>.
I don't think it is a slam dunk either, even Google doesn't do a super
job of detecting these. I think a lot of it's still done manually.
I think you'd have to look at detecting closed networks or mostly closed
networks (since the link farm would be relatively clustered from a link
perspective). As noted, not too easy to implement - and why people
working in SEO still use this technique to game the SE's.
Besides, it gets crazy fast trying to pin this stuff down. I spoke to
someone who was complaining about managing 400+ webhosting accounts.
Tough to nail folks going to that level.
Ken Krugler wrote:
>> We've managed to dig ourselves into a couple of link farms with tens of
>> thousands of sub-domains.
>>
>> I didn't notice until they blocked our DNS requests and the Nutch error
>> rates shot way up.
>>
>> Are there any methods for detecting these things (more than 100
>> sub-domains) or a master list somewhere that we can filter?
>
>
> I've read a paper on detecting link farms, but from what I remember,
> it wasn't a slam-dunk to implement.
>
> So far we've relied on manually detecting these, and then pruning the
> results from the crawldb and adding them to the regex-urlfilter file.
>
> -- Ken
Re: Link Farms
Posted by Matt Kangas <ka...@gmail.com>.
Hi folks,
Offhand, I'm not aware of any slam-dunk solution to link farms
either. One thing that could help mitigate the problem is a pre-built
blacklist of some sort. For example:
http://www.squidguard.org/blacklist/
That one is really meant for blocking user-access to porn, known
warez providers, etc, but it may have some value for you.
Another source of link farms are parked-domain providers. Many of
these can be identified by their DNS server name. Some of the top
offenders (afaik) include:
- dns(\d+).name-services.com
- ns(\d+).directnic.com
- ns(\d+).itsyourdomain.com
- park(\d+).secureserver.net
- ns.buydomains.com
- this-domain-for-sale.com
A reasonable first-pass at this list can be achieved by getting the
Verisign COM Zone file, getting a count of domains per DNS server,
then checking the top 100 or so. (that's what i did, anyway! :)
Rob, does that help you? Or are you hitting a different type of link
farm?
--Matt
On Mar 7, 2006, at 5:13 PM, Stefan Groschupf wrote:
> Hi,
>
> is the content of the pages 'mostly' identically?
> Since we can now provide custom hash implementations to the
> crawlDB, what people think about local sensitive hashing?
>
> http://citeseer.ist.psu.edu/haveliwala00scalable.html
>
> As far I understand the paper we can implement the hashing in a
> style that it allows to handle 'similar' (just change one word )
> pages as once.
> My experience of link farms is that pages are identically except of
> one number or word or data or something like that.
> In such a case LSH may could be a interesting try to get the
> problem solved.
>
> Any thoughts?
>
> Stefan
>
>
> Am 07.03.2006 um 22:38 schrieb Ken Krugler:
>
>>> We've managed to dig ourselves into a couple of link farms with
>>> tens of
>>> thousands of sub-domains.
>>>
>>> I didn't notice until they blocked our DNS requests and the Nutch
>>> error
>>> rates shot way up.
>>>
>>> Are there any methods for detecting these things (more than 100
>>> sub-domains) or a master list somewhere that we can filter?
>>
>> I've read a paper on detecting link farms, but from what I
>> remember, it wasn't a slam-dunk to implement.
>>
>> So far we've relied on manually detecting these, and then pruning
>> the results from the crawldb and adding them to the regex-
>> urlfilter file.
>>
>> -- Ken
>
>
--
Matt Kangas / kangas@gmail.com
Re: Link Farms
Posted by Stefan Groschupf <sg...@media-style.com>.
Hi,
is the content of the pages 'mostly' identically?
Since we can now provide custom hash implementations to the crawlDB,
what people think about local sensitive hashing?
http://citeseer.ist.psu.edu/haveliwala00scalable.html
As far I understand the paper we can implement the hashing in a style
that it allows to handle 'similar' (just change one word ) pages as
once.
My experience of link farms is that pages are identically except of
one number or word or data or something like that.
In such a case LSH may could be a interesting try to get the problem
solved.
Any thoughts?
Stefan
Am 07.03.2006 um 22:38 schrieb Ken Krugler:
>> We've managed to dig ourselves into a couple of link farms with
>> tens of
>> thousands of sub-domains.
>>
>> I didn't notice until they blocked our DNS requests and the Nutch
>> error
>> rates shot way up.
>>
>> Are there any methods for detecting these things (more than 100
>> sub-domains) or a master list somewhere that we can filter?
>
> I've read a paper on detecting link farms, but from what I
> remember, it wasn't a slam-dunk to implement.
>
> So far we've relied on manually detecting these, and then pruning
> the results from the crawldb and adding them to the regex-urlfilter
> file.
>
> -- Ken
> --
> Ken Krugler
> Krugle, Inc.
> +1 530-210-6378
> "Find Code, Find Answers"
>
---------------------------------------------
blog: http://www.find23.org
company: http://www.media-style.com
Re: Link Farms
Posted by Ken Krugler <kk...@transpac.com>.
>We've managed to dig ourselves into a couple of link farms with tens of
>thousands of sub-domains.
>
>I didn't notice until they blocked our DNS requests and the Nutch error
>rates shot way up.
>
>Are there any methods for detecting these things (more than 100
>sub-domains) or a master list somewhere that we can filter?
I've read a paper on detecting link farms, but from what I remember,
it wasn't a slam-dunk to implement.
So far we've relied on manually detecting these, and then pruning the
results from the crawldb and adding them to the regex-urlfilter file.
-- Ken
--
Ken Krugler
Krugle, Inc.
+1 530-210-6378
"Find Code, Find Answers"