You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spamassassin.apache.org by Henry Stern <hs...@apache.org> on 2004/12/22 20:00:16 UTC

RFC: New subproject, BlogSpamAssassin

Hello all,

Considering the latest press on blog comment spam, I think that it's
time that we organize a cross-platform project to address the problem.
There are a considerable number of plugins implemented for various blog
software with the intent of reducing blog spam but many are ineffective
or require a tremendous amount of work to maintain (Jay's mt-blacklist
plugin is definitely the latter).

http://news.netcraft.com/archives/2004/12/17/hosts_disable_movable_type_as_comment_spam_slows_servers.html
http://it.slashdot.org/article.pl?sid=04/12/18/1827225&tid=111&tid=128
http://www.sixapart.com/log/2004/12/more_on_comment.shtml

I propose that we create a subproject of Apache SpamAssassin to
encourage collaborative research in the area of anti blog spam with the
goal of producing cross-platform standards and implementations of
workable comment spam solutions.  SpamAssassin's expertise of anti-spam
in the e-mail domain will complement the knowledge of the weblogging
community.

Here are some of the ideas that I would like to explore further and see
incorporated into standard installations of blogging software:

* Proof-of-work:  A legitimate user will take several seconds to minutes
to create each unqiue comment while a comment spammer sends them out as
fast as possible.  Consider a proof-of-work algorithm executed within
the browser (e.g. javascript, java, activex) added to comment submission
forms.  The weblog software can safely reject all comment submissions
that lack valid proof of work.  Legitimate users will not be
inconvenienced by a short delay as they submit their comment while
spammers will not be able to easily submit comments in large volumes.
For example, if a typical comment spammer sends 1000000 comments per day
and the proof of work requires 2 seconds of compute time then they will
need to dedicate 24 machines to proof-of-work computation to maintain
their rate of transmission.  The cons of this method are that users
without advanced browsers or older, slow computers may not be able to
post comments.

There is a javascript implementation of Hashcash that can be combined
with SpamAssassin's hashcash verification and duplicate detection
algorithms to quickly produce a prototype.

* Collaborative filtering:  IronPort maintains a database of e-mail
server traffic volumes called SenderBase.  Mail servers can use
SenderBase to find "traffic spikes" and potentially block e-mail from
those servers.  Something similar could be done for weblogs.  As
comments come in, weblogs could report the urls in the comments to a
central server.  If an URL is sent in too rapidly, it can be added to a
list of probable spam urls and weblogs can quarantine or delete comments
containing that url.

* DNS-based URI Blocklists:  SpamAssassin has had great success using
Jeff Chan's Spam URI Realtime Blocklists.  When an e-mail arrives,
SpamAssassin extracts the urls contained within and performs a few DNS
TXT queries to find whether the url has been reported in spam.  These
blocklists can be used for weblogs too.  Instead of Jay maintaining a
central blocklist that people download and install manually,
mt-blacklist could use a DNS-based blocklist that is effectively updated
in real time.  This would significantly cut down on comment spam because
weblog owners would not need to actively maintain their blocklists.  The
submission process could be streamlined so that it doesn't consume so
much of any one person's time.

I'm very interested to hear any comments that you may have on this idea
and encourage you to pass this information on to your developer lists as
well as to other weblog software developers that I have missed.

I look forward to collaborating with you in the future.

Best regards,

Henry Stern
Committer, SpamAssassin

Re: RFC: New subproject, BlogSpamAssassin

Posted by Matthew Mullenweg <m...@mullenweg.com>.
Michael Parker wrote:
> There seems to already be some movement in this area:
> http://www.hjackson.org/blog/archives/2004/11/moveable_type_s.html

There was a similar effort for WP:

http://wordpress.org/support/10/12268

It was too difficult for most people to set up so it didn't get far.


A Big List of other things people have been doing:

http://codex.wordpress.org/Combat_Comment_Spam

-- 
Matt Mullenweg
  http://photomatt.net | http://wordpress.org
http://pingomatic.com | http://cnet.com

Re: RFC: New subproject, BlogSpamAssassin

Posted by Matthew Mullenweg <m...@mullenweg.com>.
Henry Stern wrote:
> Interesting plugin.  However, I'm a bit skeptical of how well
> content-based filtering will work for blog spam.  The main difference
> between e-mail spam and weblog spam is that e-mail spam is intended to
> be read by a person, whereas blog spam is intended to be read by a
> search engine's spider.

My experience has been content filtering can be very effective, because 
no one wants to be first on Google for "v1agra". Therefore obscuration 
techniques they can use are somewhat limited. WP had virtually no spam 
until they found a bug in older versions where they could use lower 
numeric entities (like &#101; for "e") to get past the *very* basic 
moderation filters we had in place and still be read correctly by 
Google. The WordPress plugin community has been very active in 
addressing this problem, so let me take a moment to point out some of 
the tools currently out there:

===
http://elliottback.com/wp/archives/2004/11/29/spam-stopgap-extreme/
http://dev.wp-plugins.org/browser/wp-hashcash/trunk/

This is a JS proof of work implementation that has been extremely (100%) 
effective in blocking non-human spam thus far. This is the only 
technique of this type that has worked more than about a week, other 
modifications such as adding random fields, asking questions in the 
comment form, and changing the URI of the comment post script have been 
bypassed by the bots within a few days.

Things along this line will not be effective in the long run because 
there is a commenting protocol popularized by Six Apart designed 
specifically for no human involvement, TrackBack. This is a essential 
feature to many bloggers.

http://www.movabletype.org/trackback/

Pingback is more robust and requires a link back, but can still be spoofed:

http://www.hixie.ch/specs/pingback/pingback

The approach we're taking to that is white listing of URIs in the 
WP-integrated blogroll and moderation of others, we also don't allow any 
markup within these comments.

===
http://wordpress.org/development/2004/12/fight-spam/
http://mookitty.co.uk/devblog/category/kittens-spaminator/
http://www.unknowngenius.com/blog/static/spam-karma

These are the two plugins that combined about a dozen different efforts 
that were going on. Both have a scoring system very much like 
SpamAssassin in some ways that uses content characteristics, RBL 
lookups, user agent characteristics (how long it was on the page before, 
is it coming through a proxy) and contextual characteristics like the 
age of the post. Spaminator has a "tar pit" which tries to delay bots 
when one has been identified by inserting random delays before 
responses. This seems to have pissed them off enough because now several 
of the bots check for the Spaminator files before targeting a weblog. 
Spam Karma is interesting because if your comment is borderline spam 
(right on the threshold) you can get it through by filling out a image 
CAPTCHA or responding to an email confirmation, thus it combines CAPTCHA 
with an accessible alternative.

===
Others

I've seen some interesting talk of centralized/decentralized systems, 
which operate much like razor or pyzor except the server is freely 
available and easy to install as an add-on to WordPress. Submissions can 
come from trusted sources with keys and then a web of trust can be 
extended out by utilizing XFN metadata that WordPress supports in its 
blogrolls.

http://gmpg.org/xfn/

This could be very interesting, as it would be hard to target in a 
central fashion (there can be hundreds/thousands of "servers") and it 
doesn't require much manual intervention by the person running the 
plugin, just the person running the server has to be proactive. It could 
also scale well. However the code for this isn't ready for release yet, 
it's undergoing a security audit and review.

===
Tool level

On the core WordPress level I've been focused on bugs that could allow 
bypassing the content filters (like the numeric entity thing) and making 
the attack surface as small as possible. WP has a nice moderation system 
where you can say a comment needs to be approved manually before it will 
show up on the site, so enabling this automatically for old or inactive 
discussions is a great way to make the "open targets" fewer and still 
not kill conversation on older entries. (Most bloggers *love* comments 
and the thought of missing some is painful.)

So, I hope that's a helpful overview to get the conversation started.
-- 
Matt Mullenweg
  http://photomatt.net | http://wordpress.org
http://pingomatic.com | http://cnet.com

Re: RFC: New subproject, BlogSpamAssassin

Posted by Henry Stern <he...@stern.ca>.
Interesting plugin.  However, I'm a bit skeptical of how well
content-based filtering will work for blog spam.  The main difference
between e-mail spam and weblog spam is that e-mail spam is intended to
be read by a person, whereas blog spam is intended to be read by a
search engine's spider.

Rather than porting SpamAssassin to weblogs, I'm suggesting that we take
what we know from the spam e-mail domain and help to come up with a
permanent solution to weblog spam.

Henry

Michael Parker wrote:
> On Wed, Dec 22, 2004 at 03:00:16PM -0400, Henry Stern wrote:
>
>>I'm very interested to hear any comments that you may have on this idea
>>and encourage you to pass this information on to your developer lists as
>>well as to other weblog software developers that I have missed.
>>
>>I look forward to collaborating with you in the future.
>>
>
>
> There seems to already be some movement in this area:
> http://www.hjackson.org/blog/archives/2004/11/moveable_type_s.html
>
> I haven't looked at it, but have pointed people to it in the past.
>
> Michael

Re: RFC: New subproject, BlogSpamAssassin

Posted by Michael Parker <pa...@pobox.com>.
On Wed, Dec 22, 2004 at 03:00:16PM -0400, Henry Stern wrote:
> 
> I'm very interested to hear any comments that you may have on this idea
> and encourage you to pass this information on to your developer lists as
> well as to other weblog software developers that I have missed.
> 
> I look forward to collaborating with you in the future.
> 

There seems to already be some movement in this area:
http://www.hjackson.org/blog/archives/2004/11/moveable_type_s.html

I haven't looked at it, but have pointed people to it in the past.

Michael

Re: RFC: New subproject, BlogSpamAssassin

Posted by Michael Parker <pa...@pobox.com>.
On Wed, Dec 22, 2004 at 03:00:16PM -0400, Henry Stern wrote:
> 
> I propose that we create a subproject of Apache SpamAssassin to
> encourage collaborative research in the area of anti blog spam with the
> goal of producing cross-platform standards and implementations of
> workable comment spam solutions.  SpamAssassin's expertise of anti-spam
> in the e-mail domain will complement the knowledge of the weblogging
> community.
> 

My $.02.  I agree that blogspam is bad and some sort of system that
could leverage similar functionalities as SA could prove useful.

I do however have reservations about it being an SA sub-project.  If
anything, perhaps a better project would be a spamd like server that
you could feed blogposts/comments and it would return a score.  It
could still make use of some of SAs strongest core functions with
small tweaks to accept a slightly different payload for checking.

Here are a couple of reasons:

1) The vast number of blogging software packages, in different
   languages, for different platforms, and different APIs.

   I counterfeit this example with my idea above of a spamd like
   server, which could interface with any system, so long as it spoke
   the proper protocol.

2) Community/Expertise

   I have no doubt, that if done right, something like this would
   really take off and be used by a lot of people.  It would have to
   be dumb simple to be used (drop in and go, with minimal config),
   because that is how much of the blogging software is today.  My
   concern would be that while a significant user community exists,
   what is the development community like and how spread out it is
   amoungst the different packages.

   To my knowledge, only one of the core SA developers blogs with any
   frequency, and I'm pretty sure he doesn't use one of the mainstream
   blogging software packages.  I won't speak for other developers, but
   it is likely that the expertise does not currently exist within the
   developers now.  This means that an initial developer base would
   have to be established before any significant work could be done.
   The SA development community is very small, almost too small and
   was a concern of the board during our incubation (see board minutes
   for when we were voted out of the Incubator).  Going into this with
   out a good development community is likely to lead to failure.  If
   you look at the Incubator as an example, they do not like to take
   on project that do not already have an established set of
   committers.  I would like to see several developer types step up
   and say, "Yes, this is a good idea and I am willing to put in the
   time and effort to make it happen."  I think those developer types
   need to be familar with one or more of the mainstream blogging
   packages and/or willing to learn.

Anyway, those are my thoughts.  Like I said, it could turn out to be
huge, but I'm not completely convinced that an SA subproject is the
way to go.

The seperate spamd like server appeals to me, because it could be used
for a good number of other things, not just blog spam.

Michael


Re: New subproject, BlogSpamAssassin

Posted by Loren Wilton <lw...@earthlink.net>.
Someone a few months back already implemented a way to integrate SA with at
least one of the blog tools, and I think reported that it helped a lot.
This was just using normal SA filtering, I believe along with a modified
rule base.  I think this implementation was pre-3.0, or no later than the
early beta days.  References should be found to it in the SA talk list.  It
could make a good place to start, I beleive it was released as free
software.



Re: RFC: New subproject, BlogSpamAssassin

Posted by Matthew Mullenweg <m...@mullenweg.com>.
Henry Stern wrote:
> * DNS-based URI Blocklists:  SpamAssassin has had great success using
> Jeff Chan's Spam URI Realtime Blocklists.  When an e-mail arrives,
> SpamAssassin extracts the urls contained within and performs a few DNS
> TXT queries to find whether the url has been reported in spam.  These
> blocklists can be used for weblogs too. 

The tools I know of for dealing with DNS in PHP are not particularly 
robust with regards to timeouts and such so they sometimes introduce 
unacceptable delays into the normal operations of a website. However 
most all weblogging tools have some sort of XML-RPC toolkit for dealing 
with the various blogger APIs, so a web service equivalent to anything 
that's set up would be very convenient. I'm not familiar with the 
scaling issues these services might face, so I don't know if that's 
practical or not.

-- 
Matt Mullenweg
  http://photomatt.net | http://wordpress.org
http://pingomatic.com | http://cnet.com