You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@spamassassin.apache.org by Robert Menschel <Ro...@Menschel.net> on 2004/09/13 02:02:22 UTC
Re[2]: delivery to multiple mailboxes from single account

Hello Stewart,

Sunday, September 12, 2004, 4:42:13 PM, you wrote:

>> Adding custom rules is among the last things you want to do. I do them,
>> and I can help you with the process (provided you can run bash scripts
>> under cron), but there are things you want to do first.

SN> I had considered running SpamAssassin from a background job, but there
SN> seemed to be a bad interaction with IMAP (see below).

I don't think you should do that anyway, since SpamAssassin is being run
automatically by your host. I'd be concerned about such a system
corrupting your emails.

>> Step 1: If False Positives are your major problem,
>> a) identify which rules are causing the false positives and lower their
>> scores, or
>> b) raise your required_hits, or
>> c) both.  I use required_hits of 9.0, and have modified the scores of
>> several dozen rules.

SN> We don't have an FP problem at all.  Mail sent by individuals almost
SN> always gets a negative score, and our users know that they need to
SN> make a whitelist entry if they don't want to miss "Sex News Daily" ;)
SN> It's the dozen spams per user per day that leak through that is our
SN> problem.

Good. That's easier to deal with. Sorry for misreading your original
email.
 
>> Step 2: Having done step 1, you'll increase the amount of spam that comes
>> through. Identify which distribution rules hit that spam, and raise their
>> scores enough to score the spam, without causing false positives.

SN> Well, a typical false negative shows:
SN> X-Spam-Status: No, hits=3.7 required=5.0 tests=BAYES_50,HTML_90_100,
SN>  HTML_IMAGE_ONLY_02,HTML_MESSAGE,MIME_HTML_ONLY,RCVD_IN_SBL 
SN> The only difference, unfortunately, between this and much commercial
SN> ham is the SBL, but that gets too polluted with ham sources to assign
SN> it a much bigger score.

So the best solution for those is Bayes.

>> Step 3: Bayes is your friend. Identify all email as guaranteed spam,
>> guaranteed not-spam, spam discussions, and uncertain. Feed the first two
>> into the Bayes system consistently and accurately, and that will help
>> enormously.
>> So enormously that some people will recommend doing step 3 before steps 1
>> and 2.

SN> Yes.  I made a big mistake here, naively thinking that the autolearn
SN> feature would do an adequate job.  I now suspect that the bayes_* files
SN> on my server are garbage.  Should I save and delete them before feeding
SN> the spam and ham corpera to sa-learn?  Is it necessary to run sa-learn
SN> on mail that SpamAssassin has already correctly classified?

Actually, unless you're getting spam flagged regularly as BAYES_00, or
non-spam as BAYES_99, then you don't yet have a problem. If spam is
sneaking through with BAYES_50 as above, then no, your Bayes files are
not garbage -- they just haven't learned about the questionable emails
yet.

Unless you have the 00/99 problem causing emails to be mis-classified, do
not delete your bayes files. Simply train them better.

It's not necessary to run sa-learn on mail that SpamAssassin has already
auto-learned, but it doesn't hurt.

If SpamAssassin correctly classified but did not auto-learn an email,
then it's not *necessary* to sa-learn it, but it helps. The more emalis
you feed to Bayes, correctly, the more correctly Bayes will be able to
score emails going forward.

I don't worry here about whether an email has been correctly or not
correctly classified, nor whether it's been auto-learned. I sa-learn
EVERY email after manual classification.

>> Step 4: Your system does allow for whitelist and blacklist entries. Maybe
>> this should be in front of step 1 also: identify from your false
>> positives those sites that can be reliably whitelisted with
>> whitelist_from_rcvd (use the _rcvd version rather than just
>> whitelist_from whenever possible). Copy William Sterns' blacklist file
>> from http://www.stearns.org/sa-blacklist/sa-blacklist.current.cf into
>> your user_prefs.

SN> Many thanks for this link.  I manually checked some uncaught spam against
SN> it, and found hits on about 75% !  I'll be installing this right away.
SN> However, it is IMO unfortunate that we are forced to blacklist by name.
SN> Bill Waggoner alone accounts for about 1000 domains on Mr. Sterns' list.
SN> If we could say blacklist_from_rcvd 69.42.96.0/19, one line would do the
SN> job of 1000.  More importantly, it would last a lot longer, because
SN> this A-hole got his IPs directly from ARIN and they are unlikely to change
SN> any time soon.  OTOH, he registers a dozen new domains every day!

Agreed. That's why SARE has begun using our SARE_RECV_IP_* rules. The
best of those may eventually end up in the distribution set.

>> Bayes:  Do your people retrieve their email using POP3 (in which case
>> they probably get the inbox mail only), or do they use webmail? If the
>> latter, have them create two more folders: spam and notspam. Have them
>> move all spam into the spam folder. Have them copy (not move) all
>> non-spam intothe notspam folder. Have a cron job which runs sa-learn
>> against these mbox files on a regular basis (mine runs hourly), deleting
>> the mbox files when done.

SN> We don't use POP3 at all; it's mostly IMAP and occasionally webmail.
SN> The good news is that the folders you describe are easily accessible;
SN> I'll try that in the next couple of days and let you know how it works.
SN> The bad news (I think) is that when users leave their Outlook open,
SN> then new mail appears on the desktop within seconds of when it is
SN> delivered to the server.  This would prevent a cron-based task from
SN> resorting the mail properly.

But you don't want to run sa-learn on un-verified emails. You want your
users to check the emails, and you want someone to manually put the spam
into a spam folder for sa-learn, and to manually copy the not-spam into a
not-spam folder for sa-learn. Automating this without manual verification
/will/ corrupt your Bayes files.

>> No, under your setup there's no way for each mailbox to have its own
>> user_prefs; there's one user_prefs for each master domain and that's it.
>> There's also no way for each mailbox to have its own bayes database --
>> there's one bayes database for the entire master domain.

SN> I realize that this is true for my present setup.  However, I hope that
SN> the new setup won't have those restrictions.  If it's possible to run
SN> SpamAssassin via cron or whatever, it should also be possible to run
SN> a private copy that is installed in my home directory.  I hope that by
SN> determining the recipient and setting up an appropriate environment
SN> prior to invoking SpamAssassin, independent bayes and prefs will work.
SN> If not, hey, SpamAssassin is made of this amazing stuff called open source
SN> -- you can change the code and make it do what you want.  Of course,
SN> it may take more effort than the improvement in performance would justify,
SN> so I'll first see how much improvement sa-learn gives.

Several people are making progress with SQL-based user_prefs and rules;
their systems might be adaptable to yours.

Bob Menschel