You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@spamassassin.apache.org by Andy Pieters <x_...@yahoo.fr> on 2005/11/28 15:01:44 UTC
Improving sa
Hi list
I have been using spamassassin for over a year now in combination with Kmail.
First, some observations
When manually applying the filters "Mark as SPAM" or "Mark as HAM", which pipe
the message to the command sa-learn --spam or sa-learn --ham respectively, it
takes up to a minute to process on a PIV 4.3Ghz HT with 1Gb of RAM, which
seems like ages.
Second, it seems that spamassassin vs spam is nothing less then an arms-race,
with spamassassin perpetually running behind.
As more and more rules are added, doesn't it come to a point where deciding if
a message is spam or ham takes longer and longer or up to a point where
spamassassin allone can't handle it anymore?
Lastly, I am running spamassassin 3.1 out of the box, that is installed the
rpm and that's it.
What can I do to increase effectiveness of spamassassin in diffrentiating spam
from ham? Right now, there's about 10% of all messages that come in on a day
(4.500) that are injustly marked as ham or spam (10% is not a lot, but still
45 messages each day!)
With kind regards
Andy
--
Now listening to DJ Promo - Into The Light on amaroK
Geek code: www.vlaamse-kern.com/geek
Registered Linux User No 379093
If life was for sale, what would be its price?
www.vlaamse-kern.com/sas/ for free php utilities
--
Re: Improving sa
Posted by shane mullins <ts...@wise.k12.va.us>.
hey andy,
sa is really a great piece of software. i would highly recommend a
book and some serious reading. sounds like you need to fine tune sa.
and that is one of the great things about sa, you can configure it to
your needs.
shane
----- Original Message -----
From: "Andy Pieters" <x_...@yahoo.fr>
To: <us...@spamassassin.apache.org>
Sent: Monday, November 28, 2005 9:01 AM
Subject: Improving sa
Re: Improving sa
Posted by Andy Pieters <x_...@yahoo.fr>.
On Monday 28 November 2005 Andy Pieters wrote
> > Right now, there's about 10% of all messages that come in on a
> > day
> > (4.500) that are injustly marked as ham or spam (10% is not a lot, but
> > still
> > 45 messages each day!)
On Monday 28 November 2005 17:31, Mike Jackson wrote:
> Uh, wouldn't 10% be 450 messages? ;)
Right my bad, should read 1%
> This is my prejudice showing, but personally I would compile SA from
> scratch rather than relying on an RPM. I rarely trust that precompiled
> packages are going to contain the options I want, or exclude the options
> I'll never use (but then, I'm also a FreeBSD user, and even when you
> install something from ports it's compiled from scratch and can be
> fine-tuned). Make sure you have all the SQL tools you need and use your
> favorite database backend for Bayes and AWL. This is purely anecdotal, but
> it seems much faster on the several servers where I've implemented it than
> the older database methods. I'd also look for other bottlenecks, because
> with a 4GHz processor and 1GB RAM, SA should kick booty. Either something
> else is consuming your resources, or something's rotten in Denmark.
Cheese for one thing is rotten in Denmark ;) other then that, thank you for
your suggestions. Looks like I'm going to have to research sa in more detail
then, because rpm -q --requires spamassassin shows huge dependencies on perl
but none whatsoever on SQL or other databases.
With kind regards
Andy
--
Currently not listening to amaroK
Geek code: www.vlaamse-kern.com/geek
Registered Linux User No 379093
If life was for sale, what would be its price?
www.vlaamse-kern.com/sas/ for free php utilities
--
Re: Improving sa
Posted by Mike Jackson <mj...@barking-dog.net>.
> When manually applying the filters "Mark as SPAM" or "Mark as HAM", which
> pipe
> the message to the command sa-learn --spam or sa-learn --ham respectively,
> it
> takes up to a minute to process on a PIV 4.3Ghz HT with 1Gb of RAM, which
> seems like ages.
I've noticed that the SQL backends to Bayes and AWL are quite a bit faster.
If you're learning on a single-message basis, you might want to
add --no-sync to your sa-learn invocation so that it doesn't sync the
journal and the database with every single message. Do that as a cron job on
an appropriate schedule, like once a day.
> Second, it seems that spamassassin vs spam is nothing less then an
> arms-race,
> with spamassassin perpetually running behind.
Well, of course. Any rules are going to be reactive to what they've seen,
not proactive. The Bayesian filter gets much closer to being an "on the fly"
reaction to the mail you see, but it still needs historial record to go on,
not intuition. Anything else would end up resembling a Douglas Adams novel
:)
> As more and more rules are added, doesn't it come to a point where
> deciding if
> a message is spam or ham takes longer and longer or up to a point where
> spamassassin allone can't handle it anymore?
I'm not geeky enough to formulate this in fancier words, but it seems like
there's an upper threshold to how complicated you can make a mail message,
therefore there should be an upper limit to the rules to identify a message
automatically based on certain characteristics. But, there may come a time
when the "arms race" goes thermonuclear and the only way to deal with spam
is to nuke SMTP as we know it and formulate a new system that better deals
with the loopholes spammers exploit to send their ads.
> Lastly, I am running spamassassin 3.1 out of the box, that is installed
> the
> rpm and that's it.
>
> What can I do to increase effectiveness of spamassassin in diffrentiating
> spam
> from ham? Right now, there's about 10% of all messages that come in on a
> day
> (4.500) that are injustly marked as ham or spam (10% is not a lot, but
> still
> 45 messages each day!)
Uh, wouldn't 10% be 450 messages? ;)
This is my prejudice showing, but personally I would compile SA from scratch
rather than relying on an RPM. I rarely trust that precompiled packages are
going to contain the options I want, or exclude the options I'll never use
(but then, I'm also a FreeBSD user, and even when you install something from
ports it's compiled from scratch and can be fine-tuned). Make sure you have
all the SQL tools you need and use your favorite database backend for Bayes
and AWL. This is purely anecdotal, but it seems much faster on the several
servers where I've implemented it than the older database methods. I'd also
look for other bottlenecks, because with a 4GHz processor and 1GB RAM, SA
should kick booty. Either something else is consuming your resources, or
something's rotten in Denmark.
Re: Improving sa
Posted by Kai Schaetzl <ma...@conactive.com>.
Andy Pieters wrote on Mon, 28 Nov 2005 15:01:44 +0100:
> What can I do to increase effectiveness of spamassassin in diffrentiating spam
> from ham?
Use SARE rules/ rulesdujour and drop as many mail as you can on MTA level (If you
run an MTA). Your FP rate of 10% is not acceptable at all, it should be way under
1%. You may want to train your Bayes. (BTW: 10% would be 450, not 45, so maybe
you have 1% FP rate?)
Kai
--
Kai Schätzl, Berlin, Germany
Get your web at Conactive Internet Services: http://www.conactive.com
IE-Center: http://ie5.de & http://msie.winware.org