You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@spamassassin.apache.org by Andy Pieters <x_...@yahoo.fr> on 2005/11/28 15:01:44 UTC

Improving sa

Hi list

I have been using spamassassin for over a year now in combination with Kmail.

First, some observations

When manually applying the filters "Mark as SPAM" or "Mark as HAM", which pipe 
the message to the command sa-learn --spam or sa-learn --ham respectively, it 
takes up to a minute to process on a PIV 4.3Ghz HT with 1Gb of RAM, which 
seems like ages.

Second, it seems that spamassassin vs spam is nothing less then an arms-race, 
with spamassassin perpetually running behind.

As more and more rules are added, doesn't it come to a point where deciding if 
a message is spam or ham takes longer and longer or up to a point where 
spamassassin allone can't handle it anymore?

Lastly, I am running spamassassin 3.1 out of the box, that is installed the 
rpm and that's it.

What can I do to increase effectiveness of spamassassin in diffrentiating spam 
from ham?  Right now, there's about 10% of all messages that come in on a day 
(4.500) that are injustly marked as ham or spam (10% is not a lot, but still 
45 messages each day!)

With kind regards

Andy

-- 
Now listening to DJ Promo - Into The Light on amaroK
Geek code: www.vlaamse-kern.com/geek
Registered Linux User No 379093
If life was for sale, what would be its price?
www.vlaamse-kern.com/sas/ for free php utilities
--

Re: Improving sa

Posted by shane mullins <ts...@wise.k12.va.us>.
hey andy,

    sa is really a great piece of software.  i would highly recommend a
book and some serious reading.  sounds like you need to fine tune sa.
and that is one of the great things about sa, you can configure it to
your needs.

shane

----- Original Message ----- 
From: "Andy Pieters" <x_...@yahoo.fr>
To: <us...@spamassassin.apache.org>
Sent: Monday, November 28, 2005 9:01 AM
Subject: Improving sa




Re: Improving sa

Posted by Andy Pieters <x_...@yahoo.fr>.
On Monday 28 November 2005 Andy Pieters wrote

> > Right now, there's about 10% of all messages that come in on a
> > day
> > (4.500) that are injustly marked as ham or spam (10% is not a lot, but
> > still
> > 45 messages each day!)

On Monday 28 November 2005 17:31, Mike Jackson wrote:

> Uh, wouldn't 10% be 450 messages?  ;)
Right my bad, should read 1%

> This is my prejudice showing, but personally I would compile SA from
> scratch rather than relying on an RPM. I rarely trust that precompiled
> packages are going to contain the options I want, or exclude the options
> I'll never use (but then, I'm also a FreeBSD user, and even when you
> install something from ports it's compiled from scratch and can be
> fine-tuned). Make sure you have all the SQL tools you need and use your
> favorite database backend for Bayes and AWL. This is purely anecdotal, but
> it seems much faster on the several servers where I've implemented it than
> the older database methods. I'd also look for other bottlenecks, because
> with a 4GHz processor and 1GB RAM, SA should kick booty. Either something
> else is consuming your resources, or something's rotten in Denmark.

Cheese for one thing is rotten in Denmark ;) other then that, thank you for 
your suggestions.  Looks like I'm going to have to research sa in more detail 
then, because rpm -q --requires spamassassin shows huge dependencies on perl 
but none whatsoever on SQL or other databases.

With kind regards

Andy

-- 
Currently not listening to amaroK
Geek code: www.vlaamse-kern.com/geek
Registered Linux User No 379093
If life was for sale, what would be its price?
www.vlaamse-kern.com/sas/ for free php utilities
--

Re: Improving sa

Posted by Mike Jackson <mj...@barking-dog.net>.
> When manually applying the filters "Mark as SPAM" or "Mark as HAM", which 
> pipe
> the message to the command sa-learn --spam or sa-learn --ham respectively, 
> it
> takes up to a minute to process on a PIV 4.3Ghz HT with 1Gb of RAM, which
> seems like ages.

I've noticed that the SQL backends to Bayes and AWL are quite a bit faster. 
If you're learning on a single-message basis, you might want to 
add --no-sync to your sa-learn invocation so that it doesn't sync the 
journal and the database with every single message. Do that as a cron job on 
an appropriate schedule, like once a day.

> Second, it seems that spamassassin vs spam is nothing less then an 
> arms-race,
> with spamassassin perpetually running behind.

Well, of course. Any rules are going to be reactive to what they've seen, 
not proactive. The Bayesian filter gets much closer to being an "on the fly" 
reaction to the mail you see, but it still needs historial record to go on, 
not intuition. Anything else would end up resembling a Douglas Adams novel 
:)

> As more and more rules are added, doesn't it come to a point where 
> deciding if
> a message is spam or ham takes longer and longer or up to a point where
> spamassassin allone can't handle it anymore?

I'm not geeky enough to formulate this in fancier words, but it seems like 
there's an upper threshold to how complicated you can make a mail message, 
therefore there should be an upper limit to the rules to identify a message 
automatically based on certain characteristics. But, there may come a time 
when the "arms race" goes thermonuclear and the only way to deal with spam 
is to nuke SMTP as we know it and formulate a new system that better deals 
with the loopholes spammers exploit to send their ads.

> Lastly, I am running spamassassin 3.1 out of the box, that is installed 
> the
> rpm and that's it.
>
> What can I do to increase effectiveness of spamassassin in diffrentiating 
> spam
> from ham?  Right now, there's about 10% of all messages that come in on a 
> day
> (4.500) that are injustly marked as ham or spam (10% is not a lot, but 
> still
> 45 messages each day!)

Uh, wouldn't 10% be 450 messages?  ;)

This is my prejudice showing, but personally I would compile SA from scratch 
rather than relying on an RPM. I rarely trust that precompiled packages are 
going to contain the options I want, or exclude the options I'll never use 
(but then, I'm also a FreeBSD user, and even when you install something from 
ports it's compiled from scratch and can be fine-tuned). Make sure you have 
all the SQL tools you need and use your favorite database backend for Bayes 
and AWL. This is purely anecdotal, but it seems much faster on the several 
servers where I've implemented it than the older database methods. I'd also 
look for other bottlenecks, because with a 4GHz processor and 1GB RAM, SA 
should kick booty. Either something else is consuming your resources, or 
something's rotten in Denmark. 


Re: Improving sa

Posted by Kai Schaetzl <ma...@conactive.com>.
Andy Pieters wrote on Mon, 28 Nov 2005 15:01:44 +0100:

> What can I do to increase effectiveness of spamassassin in diffrentiating spam 
> from ham?

Use SARE rules/ rulesdujour and drop as many mail as you can on MTA level (If you 
run an MTA). Your FP rate of 10% is not acceptable at all, it should be way under 
1%. You may want to train your Bayes. (BTW: 10% would be 450, not 45, so maybe 
you have 1% FP rate?)

Kai

-- 
Kai Schätzl, Berlin, Germany
Get your web at Conactive Internet Services: http://www.conactive.com
IE-Center: http://ie5.de & http://msie.winware.org