You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spamassassin.apache.org by Henry Stern <hs...@apache.org> on 2004/12/22 20:00:16 UTC
RFC: New subproject, BlogSpamAssassin
Hello all,
Considering the latest press on blog comment spam, I think that it's
time that we organize a cross-platform project to address the problem.
There are a considerable number of plugins implemented for various blog
software with the intent of reducing blog spam but many are ineffective
or require a tremendous amount of work to maintain (Jay's mt-blacklist
plugin is definitely the latter).
http://news.netcraft.com/archives/2004/12/17/hosts_disable_movable_type_as_comment_spam_slows_servers.html
http://it.slashdot.org/article.pl?sid=04/12/18/1827225&tid=111&tid=128
http://www.sixapart.com/log/2004/12/more_on_comment.shtml
I propose that we create a subproject of Apache SpamAssassin to
encourage collaborative research in the area of anti blog spam with the
goal of producing cross-platform standards and implementations of
workable comment spam solutions. SpamAssassin's expertise of anti-spam
in the e-mail domain will complement the knowledge of the weblogging
community.
Here are some of the ideas that I would like to explore further and see
incorporated into standard installations of blogging software:
* Proof-of-work: A legitimate user will take several seconds to minutes
to create each unqiue comment while a comment spammer sends them out as
fast as possible. Consider a proof-of-work algorithm executed within
the browser (e.g. javascript, java, activex) added to comment submission
forms. The weblog software can safely reject all comment submissions
that lack valid proof of work. Legitimate users will not be
inconvenienced by a short delay as they submit their comment while
spammers will not be able to easily submit comments in large volumes.
For example, if a typical comment spammer sends 1000000 comments per day
and the proof of work requires 2 seconds of compute time then they will
need to dedicate 24 machines to proof-of-work computation to maintain
their rate of transmission. The cons of this method are that users
without advanced browsers or older, slow computers may not be able to
post comments.
There is a javascript implementation of Hashcash that can be combined
with SpamAssassin's hashcash verification and duplicate detection
algorithms to quickly produce a prototype.
* Collaborative filtering: IronPort maintains a database of e-mail
server traffic volumes called SenderBase. Mail servers can use
SenderBase to find "traffic spikes" and potentially block e-mail from
those servers. Something similar could be done for weblogs. As
comments come in, weblogs could report the urls in the comments to a
central server. If an URL is sent in too rapidly, it can be added to a
list of probable spam urls and weblogs can quarantine or delete comments
containing that url.
* DNS-based URI Blocklists: SpamAssassin has had great success using
Jeff Chan's Spam URI Realtime Blocklists. When an e-mail arrives,
SpamAssassin extracts the urls contained within and performs a few DNS
TXT queries to find whether the url has been reported in spam. These
blocklists can be used for weblogs too. Instead of Jay maintaining a
central blocklist that people download and install manually,
mt-blacklist could use a DNS-based blocklist that is effectively updated
in real time. This would significantly cut down on comment spam because
weblog owners would not need to actively maintain their blocklists. The
submission process could be streamlined so that it doesn't consume so
much of any one person's time.
I'm very interested to hear any comments that you may have on this idea
and encourage you to pass this information on to your developer lists as
well as to other weblog software developers that I have missed.
I look forward to collaborating with you in the future.
Best regards,
Henry Stern
Committer, SpamAssassin
Re: RFC: New subproject, BlogSpamAssassin
Posted by Matthew Mullenweg <m...@mullenweg.com>.
Michael Parker wrote:
> There seems to already be some movement in this area:
> http://www.hjackson.org/blog/archives/2004/11/moveable_type_s.html
There was a similar effort for WP:
http://wordpress.org/support/10/12268
It was too difficult for most people to set up so it didn't get far.
A Big List of other things people have been doing:
http://codex.wordpress.org/Combat_Comment_Spam
--
Matt Mullenweg
http://photomatt.net | http://wordpress.org
http://pingomatic.com | http://cnet.com
Re: RFC: New subproject, BlogSpamAssassin
Posted by Matthew Mullenweg <m...@mullenweg.com>.
Henry Stern wrote:
> Interesting plugin. However, I'm a bit skeptical of how well
> content-based filtering will work for blog spam. The main difference
> between e-mail spam and weblog spam is that e-mail spam is intended to
> be read by a person, whereas blog spam is intended to be read by a
> search engine's spider.
My experience has been content filtering can be very effective, because
no one wants to be first on Google for "v1agra". Therefore obscuration
techniques they can use are somewhat limited. WP had virtually no spam
until they found a bug in older versions where they could use lower
numeric entities (like e for "e") to get past the *very* basic
moderation filters we had in place and still be read correctly by
Google. The WordPress plugin community has been very active in
addressing this problem, so let me take a moment to point out some of
the tools currently out there:
===
http://elliottback.com/wp/archives/2004/11/29/spam-stopgap-extreme/
http://dev.wp-plugins.org/browser/wp-hashcash/trunk/
This is a JS proof of work implementation that has been extremely (100%)
effective in blocking non-human spam thus far. This is the only
technique of this type that has worked more than about a week, other
modifications such as adding random fields, asking questions in the
comment form, and changing the URI of the comment post script have been
bypassed by the bots within a few days.
Things along this line will not be effective in the long run because
there is a commenting protocol popularized by Six Apart designed
specifically for no human involvement, TrackBack. This is a essential
feature to many bloggers.
http://www.movabletype.org/trackback/
Pingback is more robust and requires a link back, but can still be spoofed:
http://www.hixie.ch/specs/pingback/pingback
The approach we're taking to that is white listing of URIs in the
WP-integrated blogroll and moderation of others, we also don't allow any
markup within these comments.
===
http://wordpress.org/development/2004/12/fight-spam/
http://mookitty.co.uk/devblog/category/kittens-spaminator/
http://www.unknowngenius.com/blog/static/spam-karma
These are the two plugins that combined about a dozen different efforts
that were going on. Both have a scoring system very much like
SpamAssassin in some ways that uses content characteristics, RBL
lookups, user agent characteristics (how long it was on the page before,
is it coming through a proxy) and contextual characteristics like the
age of the post. Spaminator has a "tar pit" which tries to delay bots
when one has been identified by inserting random delays before
responses. This seems to have pissed them off enough because now several
of the bots check for the Spaminator files before targeting a weblog.
Spam Karma is interesting because if your comment is borderline spam
(right on the threshold) you can get it through by filling out a image
CAPTCHA or responding to an email confirmation, thus it combines CAPTCHA
with an accessible alternative.
===
Others
I've seen some interesting talk of centralized/decentralized systems,
which operate much like razor or pyzor except the server is freely
available and easy to install as an add-on to WordPress. Submissions can
come from trusted sources with keys and then a web of trust can be
extended out by utilizing XFN metadata that WordPress supports in its
blogrolls.
http://gmpg.org/xfn/
This could be very interesting, as it would be hard to target in a
central fashion (there can be hundreds/thousands of "servers") and it
doesn't require much manual intervention by the person running the
plugin, just the person running the server has to be proactive. It could
also scale well. However the code for this isn't ready for release yet,
it's undergoing a security audit and review.
===
Tool level
On the core WordPress level I've been focused on bugs that could allow
bypassing the content filters (like the numeric entity thing) and making
the attack surface as small as possible. WP has a nice moderation system
where you can say a comment needs to be approved manually before it will
show up on the site, so enabling this automatically for old or inactive
discussions is a great way to make the "open targets" fewer and still
not kill conversation on older entries. (Most bloggers *love* comments
and the thought of missing some is painful.)
So, I hope that's a helpful overview to get the conversation started.
--
Matt Mullenweg
http://photomatt.net | http://wordpress.org
http://pingomatic.com | http://cnet.com
Re: RFC: New subproject, BlogSpamAssassin
Posted by Henry Stern <he...@stern.ca>.
Interesting plugin. However, I'm a bit skeptical of how well
content-based filtering will work for blog spam. The main difference
between e-mail spam and weblog spam is that e-mail spam is intended to
be read by a person, whereas blog spam is intended to be read by a
search engine's spider.
Rather than porting SpamAssassin to weblogs, I'm suggesting that we take
what we know from the spam e-mail domain and help to come up with a
permanent solution to weblog spam.
Henry
Michael Parker wrote:
> On Wed, Dec 22, 2004 at 03:00:16PM -0400, Henry Stern wrote:
>
>>I'm very interested to hear any comments that you may have on this idea
>>and encourage you to pass this information on to your developer lists as
>>well as to other weblog software developers that I have missed.
>>
>>I look forward to collaborating with you in the future.
>>
>
>
> There seems to already be some movement in this area:
> http://www.hjackson.org/blog/archives/2004/11/moveable_type_s.html
>
> I haven't looked at it, but have pointed people to it in the past.
>
> Michael
Re: RFC: New subproject, BlogSpamAssassin
Posted by Michael Parker <pa...@pobox.com>.
On Wed, Dec 22, 2004 at 03:00:16PM -0400, Henry Stern wrote:
>
> I'm very interested to hear any comments that you may have on this idea
> and encourage you to pass this information on to your developer lists as
> well as to other weblog software developers that I have missed.
>
> I look forward to collaborating with you in the future.
>
There seems to already be some movement in this area:
http://www.hjackson.org/blog/archives/2004/11/moveable_type_s.html
I haven't looked at it, but have pointed people to it in the past.
Michael
Re: RFC: New subproject, BlogSpamAssassin
Posted by Michael Parker <pa...@pobox.com>.
On Wed, Dec 22, 2004 at 03:00:16PM -0400, Henry Stern wrote:
>
> I propose that we create a subproject of Apache SpamAssassin to
> encourage collaborative research in the area of anti blog spam with the
> goal of producing cross-platform standards and implementations of
> workable comment spam solutions. SpamAssassin's expertise of anti-spam
> in the e-mail domain will complement the knowledge of the weblogging
> community.
>
My $.02. I agree that blogspam is bad and some sort of system that
could leverage similar functionalities as SA could prove useful.
I do however have reservations about it being an SA sub-project. If
anything, perhaps a better project would be a spamd like server that
you could feed blogposts/comments and it would return a score. It
could still make use of some of SAs strongest core functions with
small tweaks to accept a slightly different payload for checking.
Here are a couple of reasons:
1) The vast number of blogging software packages, in different
languages, for different platforms, and different APIs.
I counterfeit this example with my idea above of a spamd like
server, which could interface with any system, so long as it spoke
the proper protocol.
2) Community/Expertise
I have no doubt, that if done right, something like this would
really take off and be used by a lot of people. It would have to
be dumb simple to be used (drop in and go, with minimal config),
because that is how much of the blogging software is today. My
concern would be that while a significant user community exists,
what is the development community like and how spread out it is
amoungst the different packages.
To my knowledge, only one of the core SA developers blogs with any
frequency, and I'm pretty sure he doesn't use one of the mainstream
blogging software packages. I won't speak for other developers, but
it is likely that the expertise does not currently exist within the
developers now. This means that an initial developer base would
have to be established before any significant work could be done.
The SA development community is very small, almost too small and
was a concern of the board during our incubation (see board minutes
for when we were voted out of the Incubator). Going into this with
out a good development community is likely to lead to failure. If
you look at the Incubator as an example, they do not like to take
on project that do not already have an established set of
committers. I would like to see several developer types step up
and say, "Yes, this is a good idea and I am willing to put in the
time and effort to make it happen." I think those developer types
need to be familar with one or more of the mainstream blogging
packages and/or willing to learn.
Anyway, those are my thoughts. Like I said, it could turn out to be
huge, but I'm not completely convinced that an SA subproject is the
way to go.
The seperate spamd like server appeals to me, because it could be used
for a good number of other things, not just blog spam.
Michael
Re: New subproject, BlogSpamAssassin
Posted by Loren Wilton <lw...@earthlink.net>.
Someone a few months back already implemented a way to integrate SA with at
least one of the blog tools, and I think reported that it helped a lot.
This was just using normal SA filtering, I believe along with a modified
rule base. I think this implementation was pre-3.0, or no later than the
early beta days. References should be found to it in the SA talk list. It
could make a good place to start, I beleive it was released as free
software.
Re: RFC: New subproject, BlogSpamAssassin
Posted by Matthew Mullenweg <m...@mullenweg.com>.
Henry Stern wrote:
> * DNS-based URI Blocklists: SpamAssassin has had great success using
> Jeff Chan's Spam URI Realtime Blocklists. When an e-mail arrives,
> SpamAssassin extracts the urls contained within and performs a few DNS
> TXT queries to find whether the url has been reported in spam. These
> blocklists can be used for weblogs too.
The tools I know of for dealing with DNS in PHP are not particularly
robust with regards to timeouts and such so they sometimes introduce
unacceptable delays into the normal operations of a website. However
most all weblogging tools have some sort of XML-RPC toolkit for dealing
with the various blogger APIs, so a web service equivalent to anything
that's set up would be very convenient. I'm not familiar with the
scaling issues these services might face, so I don't know if that's
practical or not.
--
Matt Mullenweg
http://photomatt.net | http://wordpress.org
http://pingomatic.com | http://cnet.com