You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@spamassassin.apache.org by Ole Kasper Olsen <ol...@opera.com> on 2006/01/11 14:35:37 UTC

Using SpamAssassin to fight comment spam?

Hi,

I am a developer on a fairly large community site (30-50,000 active users)  
with blogs, photo albums and forums.

I spent yesterday tinkering with a spam prevension system which runs each  
new comment to a blog post or image in a photo album through SpamAssassin.  
I take the provided comment, and assemble a RFC822-compliant message based  
on the users IP address and sender and reciever's registered email  
addresses, and then run it through Mail::SpamAssassin (the Perl module)  
with default settings.

This seems to work. At least it intercepts the test-message provided in  
the SpamAssassin documentation.

This system requires me to have a utility where people can mark spam as  
ham in the case of SpamAssassin wrongly identifying a valid comment as  
spam. I was planning of having this utility teach the Bayesian filter on a  
community-wide basis, i.e. for all users. Therefore, people cannot mark  
their own messages as ham. This to guard against spammers teaching the  
filter wrongly.

  - Is learning a good idea at all in this setting?
    - If so, what are the advantages and more importantly disadvantages of  
having community-wide learning?
    - Should I use autolearning?
  - Is there anything else I should be aware of when implementing  
SpamAssassin in this setting?
    - Settings
    - Thresholds
    - &c?


After testing this a bit on comments, I hope to expand to blog posts and  
forum posts as well, so that moderators gets a heads-up when people post  
spam.

-- 
Ole Kasper Olsen
Information Systems Developer
Opera Software ASA

Re: Using SpamAssassin to fight comment spam?

Posted by jdow <jd...@earthlink.net>.
From: "jdow" <jd...@earthlink.net>

> From: "Ole Kasper Olsen" <ol...@opera.com>
> 
>> Hi,
>> 
>> I am a developer on a fairly large community site (30-50,000 active users)  
>> with blogs, photo albums and forums.
>> 
>> I spent yesterday tinkering with a spam prevension system which runs each  
>> new comment to a blog post or image in a photo album through SpamAssassin.  
>> I take the provided comment, and assemble a RFC822-compliant message based  
>> on the users IP address and sender and reciever's registered email  
>> addresses, and then run it through Mail::SpamAssassin (the Perl module)  
>> with default settings.
> 
> First off I'd join the spamassassin-users list at spamassassin.apache.org.
> Then I'd post this message to the list.

Joanne sits here wibbling her lips - you've already done this step. Thought
this was a different list for some baffling reason.

{O,o}


Re: Using SpamAssassin to fight comment spam?

Posted by jdow <jd...@earthlink.net>.
From: "Ole Kasper Olsen" <ol...@opera.com>

> Hi,
> 
> I am a developer on a fairly large community site (30-50,000 active users)  
> with blogs, photo albums and forums.
> 
> I spent yesterday tinkering with a spam prevension system which runs each  
> new comment to a blog post or image in a photo album through SpamAssassin.  
> I take the provided comment, and assemble a RFC822-compliant message based  
> on the users IP address and sender and reciever's registered email  
> addresses, and then run it through Mail::SpamAssassin (the Perl module)  
> with default settings.

First off I'd join the spamassassin-users list at spamassassin.apache.org.
Then I'd post this message to the list.

I think this is a good basic idea, although the tool is not really
designed for this sort of thing. I suspect you will have trouble with
ALL_TRUSTED and a few other things if you do not include proper
Received: headers that would track the "path" via which the message
was received. Since you do not know the poster's ISP's smarthost
in all cases you can end up falsely triggering a lot of rules that are
based on things like dialup addresses for the postings. I believe, but
am not sure, that this phenomenon intrudes on the Bayes operations, too.

In the best of all possible worlds you'd need a very carefully pruned
set of rules and may end up having to manually train Bayes. (You might
want to make provisions for this in your scripting setup.) This can lead
to considerable processing time per message. SpamAssassin is hungry for
CPU cycles. (I run here with a very large number of rule sets on a 1.8GHz
Athlon system with a gigabyte of RAM. An average spam takes over 3 seconds
to get scanned. About half to three quarters of this time is CPU cycles.
This is highly dependant on rule sets chosen, of course.)

> This seems to work. At least it intercepts the test-message provided in  
> the SpamAssassin documentation.
> 
> This system requires me to have a utility where people can mark spam as  
> ham in the case of SpamAssassin wrongly identifying a valid comment as  
> spam. I was planning of having this utility teach the Bayesian filter on a  
> community-wide basis, i.e. for all users. Therefore, people cannot mark  
> their own messages as ham. This to guard against spammers teaching the  
> filter wrongly.
> 
>  - Is learning a good idea at all in this setting?

I'm shooting from the hip on this one and have a bias for carefully
manually trained Bayes, at least at first. Also in general the learning
thresholds for spam and ham both need to be adjusted carefully.

Of course, "learning" is required not just a good idea. It's how you do
the learning that is at issue. Automated learning can be risky with bad
threshold values and inadequate initial training. Over time you could
probably move to automated training (and automated expires) safely
enough.

This leaves your potential real system poison, the auto-whitelist system.
Turn it off. You probably cannot afford it's misfires. Manually whitelist
those who must be whitelisted.

>    - If so, what are the advantages and more importantly disadvantages of  
> having community-wide learning?

For a blog I'd break with my other strong bias for per user Bayes and
choose site-wide. All users should have the same "experience" in the
blogs. Otherwise you'll get a large "Hunh?" factor from people not
seeing the original post in a discussion chain.

>    - Should I use autolearning?

See above. If you use it be very careful. Set thresholds wider than stock.
And do not even consider using auto-whitelisting.

>  - Is there anything else I should be aware of when implementing  
> SpamAssassin in this setting?
>    - Settings
>    - Thresholds
>    - &c?

Do not use the same SpamAssassin setup for both email and blog. If they
must run on the same machine check the man files and use alternate
configpath and siteconfigpath settings.

I'd be sure to use spamd/spamc rather than spamassassin itself. This
cuts down CPU requirements considerably. If mail runs on the same
machine run two spamd's with different pid storage and port numbers.

> After testing this a bit on comments, I hope to expand to blog posts and  
> forum posts as well, so that moderators gets a heads-up when people post  
> spam.

This may work. It's not what SpamAssassin is designed to do. But you may
see a significant aid in perverting its use to both the blog and forum
usage. In both cases sitewide rules and Bayes are required, IMAO. However,
you MAY want different learning and custom rules for SOME forums and
blogs if the machine is running multiple blogs. In that case you have an
interesting setup challenge facing you. I believe it can be done if each
"entity" for which different rules are needed must be a "user" that has a
"/home/<entity>" directory and into which it's ".spamassassin" files can
be placed. (I'd start that process with the default shells for these users
as /bin/nologin or some such.)

As mentioned above, spamassassin-users is probably your best shot for
some good thought and help. But you might also get the authors saying
"This can't be done!" Emphasize you have it partially working and need
to fine tune the concept, ideally without having to spawn off a special
blog version of spamassassin with a different name.

{^_^}   Joanne


RE: Using SpamAssassin to fight comment spam?

Posted by "Michele Neylon :: Blacknight Solutions" <mi...@blacknight.ie>.
Ole Kasper Olsen <ma...@opera.com> said on 11 January 2006 13:36:

> Hi,
> 
> I am a developer on a fairly large community site (30-50,000 active
> users) with blogs, photo albums and forums. 
> 
> I spent yesterday tinkering with a spam prevension system which runs
> each new comment to a blog post or image in a photo album through
> SpamAssassin.  
> I take the provided comment, and assemble a RFC822-compliant message
> based on the users IP address and sender and reciever's registered
> email addresses, and then run it through Mail::SpamAssassin (the Perl
> module) with default settings.   
> 
> This seems to work. At least it intercepts the test-message provided
> in the SpamAssassin documentation. 
> 
> This system requires me to have a utility where people can mark spam
> as ham in the case of SpamAssassin wrongly identifying a valid
> comment as spam. I was planning of having this utility teach the
> Bayesian filter on a community-wide basis, i.e. for all users.
> Therefore, people cannot mark their own messages as ham. This to
> guard against spammers teaching the filter wrongly.     
> 
>   - Is learning a good idea at all in this setting?
>     - If so, what are the advantages and more importantly
> disadvantages of having community-wide learning? 
>     - Should I use autolearning?
>   - Is there anything else I should be aware of when implementing
> SpamAssassin in this setting? 
>     - Settings
>     - Thresholds
>     - &c?
> 
> 
> After testing this a bit on comments, I hope to expand to blog posts
> and forum posts as well, so that moderators gets a heads-up when
> people post spam.  

Ole

Have you had a look at some of the existing plugins for Wordpress? 

Michele

Mr Michele Neylon
Blacknight Solutions
Hosting & Colocation, Brand Protection
http://www.blacknight.ie/
Tel. 1850 927 280
Intl. +353 (0) 59  9183072
UK: 0870 163 0607
Direct Dial: +353 (0)59 9183090
Fax. +353 (0) 59  9164239