You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@spamassassin.apache.org by jdow <jd...@earthlink.net> on 2006/04/01 00:34:27 UTC
Re: Best Practices: SpamAssassin

From: "Ryan Kather" <RD...@Roushind.com>

>Performance... are you hunting for speed or accuracy?
>(perhaps you wrote it before and I missed it)

Accuracy is most important, speed is only as important as insuring that messages don't 
back up in the processing queue or overload the servers.

...
>After the initial setup, Bayes can live more or less its own life with
>broad enough autolearn thresholds. We do not let users submit stuff for
>training (80kusers!) but rather submit meaningful samples occasionally.

Interesting.  It would be nice to require zero user involvement in training.  Are there 
any caveats to autolearning I should be aware of?

<< jdow >> Both of the above would benefit VERY nicely if you take a
valid ham and spam corpus from your own captures and use them to perform
an initial training on Bayes before SpamAssassin is put into service.
You'll start out cleaner.

>We've also found that spammers are targeting common addresses such as
>info@, software@, john@, ... which were not used on some domains. So we
>transformed those into spamtraps (with LDAP's mailAcceptingGeneralId or
>mailAlternateAddress is pretty straightforward!), manually review and
>feed to an IMAP folder for autospamlearn. HAM learning is unfortunately
>underestimated and more rarely done, out of our own HAM messages.

Trap accounts are great, but I always worry they get different spam then real accounts and 
pollute the Bayesian database.  Has anyone experienced this?  Also how does SpamAssassin 
deal with the Bayesian pollution attempts seen recently (spam emails with garbage in 
them).

<< jdow >> Bayes is REALLY hard to poison. Items marked as spam on these
accounts are really nice Bayes food. Setup some 'easy to guess' usernames
as trap accounts, too. (Have it a user name that violates the company
username policy. I've no such policy here. But I've used "jdow" so many
places for so many years "joanned" would be an excellent spam trap.) I
used to worry about this. But I've just been feeding Bayes with anything
that scores less than Bayes_99 that is spam that I run across in cleaning
my spam folder out. It's food. Bayes has very broad tastes. {^_-}

>> use some kind of common database.  In the default configuration SA
>> uses one Bayesian database for all users.  Is there a reason to
>> change this?  What is the consensus on a shared ruleset versus
>> individual rulesets?

>If your users share common-type messages, I'd go for a common Bayes >DB.
>We do have a common one for all our domains (actually one for old and
>another for new SA servers). Individual Bayes DBs get large and if they
>break you've got to troubleshoot each individually...

The individual DBs sound painful, but I'm not sure how consistent our users are.  I guess 
I will have to watch the Bayesian accuracy as it's built and make a decision later.

<< jdow >> Can be painful, indeed. But this opens the door to individualized
spam and ham directories for training. "Dump a few good ham messages into
the ham folder every month or so. They'll dissappear overnight untouched by
human hands." (Of course, company usage policy should make it clear that
email CAN be monitored at any time to assure compliance with company
standards. That plus good standards go a long way towards keeping malware
out of your network. Clamd helps, too.)

<< jdow >> Also note that per user Bayes may allow you to take full
benefit of the accuracy that Bayes can generate over time. I've boosted
it's BAYES_99 score almost to the spam threshold with an actual improvement
in the net SA performance.

>> What about an initial corpus to train the Bayesian database?  Will
>> this hurt my accuracy in the long term?  What corpuses are being
>> used?  Am I better off letting the Bayesian autolearn gradually
>> perform this function?

>You don't keep your spam, do you? :-) Train the DB with your *own*
>(company's) spam and ham corpus. It will not hurt. Don't use public
>corpuses.

It seems as if most of the recommendations advise against trying to force feed your 
Bayesian.  I suppose there's no shortcuts if you want it to be accurate.

<< jdow >> Do *NOT* force feed SA with any public corpus. DO force feed
SA with a largish sample of YOUR COMPANY'S profile of messages, both ham
and spam.

>> SpamAssassin is typically represented as a magic dance of tweaking
>> rules.  Are the default rule thresholds good values to start at?  How
>> can I adequately decide which rules to tweak and how much to tweak
>> them by?  In other words, how do you manage your adjustments without
>> users noticing wide spam classifying variations?

>We do not adjust rules scoring. Not with SA 3.1, while we did it on SA
>2.6 Bayes scores. Since most of our traffic is non-English, this helped
>a bit.

>Default values are the most suitable for each rule.

A number of people have confirmed that SA 3.1 needs little rule weight adjustments.

<< jdow >> SARE rules do get adjusted from time to time. That is done
for you by the excellent Ninjas at the Rules Emporium. (Some beer money
would no doubt be appreciated by most of them. {^_-} But it's not in
any way expected of you. It just helps them feel wanted.)

>> Also, in regards to rules.  What is the preferred method for update?
>> Official rule releases, rulesdujour, custom?  All of the above?

>Test them and decide which apply to your case. Dunno how indipendent
>your current antispam solution is, with SA you need to invest some time
>to review false negatives/positives (if any) and review extra rulesets.

I am beginning to think I won't be able to select new rulesets until the system is online, 
and I have a present metric on it to go by.

<< jdow >> http://rulesemporium.com/rules.htm, please. You will like
yourself for doing it. You can make a rather good estimate of a basic
set of rules for starting out. Over time you may find you need to
incorporate some rule sets and remove others. The removal is usually
if you run into a tradeoff between machine speed and anti-spam efforts.
With a modest set of rules your initial SpamAssassin performance will
be significantly improved.

{^_^}   Joanne