You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@spamassassin.apache.org by Stewart Nelson <sn...@scgroup.com> on 2004/09/12 08:57:46 UTC

delivery to multiple mailboxes from single account

Hi,

I have one linux-based account on a shared server with a hosting
provider; they presently use cpanel, exim 4.42, SpamAssassin 2.64,
and imapd.  We are unhappy with the filtering performance
(> 10% false negatives, even with required_hits=5.0).

I'd like to add custom rules, but the provider won't enable
allow_user_rules, citing security concerns.  Also, he won't
be upgrading to 3.0.0 until it is released with cpanel.

It appears that installing SpamAssassin 3.0.0 in my home directory
would be a good solution.  However, we have several add-on domains,
each with several mailboxes, and I don't know a good way to deliver
the output to the proper box.  Mail for user@domain.name presently
gets delivered to
/home/myaccount/mail/domain.name/user/inbox or to
/home/myaccount/mail/domain.name/user/spam .
We also have many aliases "forwarders" that point to various boxes.
Ideally, there would be a way for each mailbox to have its own
user_prefs.

There is lots of detailed info at
http://wiki.apache.org/spamassassin/UsedViaProcmail and at
http://wiki.apache.org/spamassassin/SingleUserUnixInstall , but
I could not find any examples with multiple independent email
accounts.  Surely, hundreds of users have done this before, but
sorry, I was unable to find a solution with Google, or searching
the archives for this list.

Thanks,

Stewart

Re: delivery to multiple mailboxes from single account

Posted by jdow <jd...@earthlink.net>.

From: "Stewart Nelson" <sn...@scgroup.com>

> Hi Bob,
>
> Many thanks for taking the time to send such a detailed reply.
>
> > Your situation is similar to mine, but I'm still at SA 2.63. Last week's
> > performance stunk at 0 false positives and 20 false negatives (a rotten
> > 99.5% accuracy record; I'm not satisfied unless I hit 99.8%).
>
> That's awesome; I'd be happy with 97%, as long as there are almost no
> false positives on unicast mail.

I get about 1000 emails a day on the average. I get about 260 to 275
spams a day. In a couple of the last few weeks I received zero mis-
identified spams and maybe one mis-identified Linux Kernel Mailing
List mailings tagged as spam with low spam scores.

This last week bas been hell (note the lower case "h" - call it say
minihell - where mini-me goes.) I've had as many as two missed spams
a day. And I have still misidentified a couple LKML items as spam.
Not a single one of those LKML items was 'critical' let alone important
to my needs.

It seems in the last week and a half a new generation of spam tricks
has been launched. Bayes seems to get them, though. So it's all
climbing back to sanity slowly. Those few weeks with 100% spam and
99.95% ham correctness were heaven, so much so that the random miss
is now annoying out of proportion with reality.

By the way, only 10% of these escaped spams made it all the way to
my real mailbox. I have a vicious and effective set of OE folder
filters, too. "Prodigy" is still a word that finds its way into my
spam box. So are "msn.com" and the like if none of the mailing list
filters catch the mail first. {^_-}

SARE rules, manual spam training, and patience gets you there. BTW,
I don't do the black hole lists. So that may be why things are slipping
through. BigEvil is getting old. And one pattern I have been noticing
of late is that there is a domain registration proxy service out there
somewhere. If that turns up in email I figure it that email should be
scored with a modest 10 or so as spam. (And from the looks of things
GoDaddy also stinks on ice lately.) I wonder how long before some
super aggressive black hole list uses such whois registrations to block
whole registrars. In my foul moods it seems like a good idea. Most of
the time rationality wins out over my foul moods, though.

Re[2]: delivery to multiple mailboxes from single account

Posted by Robert Menschel <Ro...@Menschel.net>.

Hello Stewart,

Sunday, September 12, 2004, 4:42:13 PM, you wrote:

>> Adding custom rules is among the last things you want to do. I do them,
>> and I can help you with the process (provided you can run bash scripts
>> under cron), but there are things you want to do first.

SN> I had considered running SpamAssassin from a background job, but there
SN> seemed to be a bad interaction with IMAP (see below).

I don't think you should do that anyway, since SpamAssassin is being run
automatically by your host. I'd be concerned about such a system
corrupting your emails.

>> Step 1: If False Positives are your major problem,
>> a) identify which rules are causing the false positives and lower their
>> scores, or
>> b) raise your required_hits, or
>> c) both.  I use required_hits of 9.0, and have modified the scores of
>> several dozen rules.

SN> We don't have an FP problem at all.  Mail sent by individuals almost
SN> always gets a negative score, and our users know that they need to
SN> make a whitelist entry if they don't want to miss "Sex News Daily" ;)
SN> It's the dozen spams per user per day that leak through that is our
SN> problem.

Good. That's easier to deal with. Sorry for misreading your original
email.

>> Step 2: Having done step 1, you'll increase the amount of spam that comes
>> through. Identify which distribution rules hit that spam, and raise their
>> scores enough to score the spam, without causing false positives.

SN> Well, a typical false negative shows:
SN> X-Spam-Status: No, hits=3.7 required=5.0 tests=BAYES_50,HTML_90_100,
SN>  HTML_IMAGE_ONLY_02,HTML_MESSAGE,MIME_HTML_ONLY,RCVD_IN_SBL 
SN> The only difference, unfortunately, between this and much commercial
SN> ham is the SBL, but that gets too polluted with ham sources to assign
SN> it a much bigger score.

So the best solution for those is Bayes.

>> Step 3: Bayes is your friend. Identify all email as guaranteed spam,
>> guaranteed not-spam, spam discussions, and uncertain. Feed the first two
>> into the Bayes system consistently and accurately, and that will help
>> enormously.
>> So enormously that some people will recommend doing step 3 before steps 1
>> and 2.

SN> Yes.  I made a big mistake here, naively thinking that the autolearn
SN> feature would do an adequate job.  I now suspect that the bayes_* files
SN> on my server are garbage.  Should I save and delete them before feeding
SN> the spam and ham corpera to sa-learn?  Is it necessary to run sa-learn
SN> on mail that SpamAssassin has already correctly classified?

Actually, unless you're getting spam flagged regularly as BAYES_00, or
non-spam as BAYES_99, then you don't yet have a problem. If spam is
sneaking through with BAYES_50 as above, then no, your Bayes files are
not garbage -- they just haven't learned about the questionable emails
yet.

Unless you have the 00/99 problem causing emails to be mis-classified, do
not delete your bayes files. Simply train them better.

It's not necessary to run sa-learn on mail that SpamAssassin has already
auto-learned, but it doesn't hurt.

If SpamAssassin correctly classified but did not auto-learn an email,
then it's not *necessary* to sa-learn it, but it helps. The more emalis
you feed to Bayes, correctly, the more correctly Bayes will be able to
score emails going forward.

I don't worry here about whether an email has been correctly or not
correctly classified, nor whether it's been auto-learned. I sa-learn
EVERY email after manual classification.

>> Step 4: Your system does allow for whitelist and blacklist entries. Maybe
>> this should be in front of step 1 also: identify from your false
>> positives those sites that can be reliably whitelisted with
>> whitelist_from_rcvd (use the _rcvd version rather than just
>> whitelist_from whenever possible). Copy William Sterns' blacklist file
>> from http://www.stearns.org/sa-blacklist/sa-blacklist.current.cf into
>> your user_prefs.

SN> Many thanks for this link.  I manually checked some uncaught spam against
SN> it, and found hits on about 75% !  I'll be installing this right away.
SN> However, it is IMO unfortunate that we are forced to blacklist by name.
SN> Bill Waggoner alone accounts for about 1000 domains on Mr. Sterns' list.
SN> If we could say blacklist_from_rcvd 69.42.96.0/19, one line would do the
SN> job of 1000.  More importantly, it would last a lot longer, because
SN> this A-hole got his IPs directly from ARIN and they are unlikely to change
SN> any time soon.  OTOH, he registers a dozen new domains every day!

Agreed. That's why SARE has begun using our SARE_RECV_IP_* rules. The
best of those may eventually end up in the distribution set.

>> Bayes:  Do your people retrieve their email using POP3 (in which case
>> they probably get the inbox mail only), or do they use webmail? If the
>> latter, have them create two more folders: spam and notspam. Have them
>> move all spam into the spam folder. Have them copy (not move) all
>> non-spam intothe notspam folder. Have a cron job which runs sa-learn
>> against these mbox files on a regular basis (mine runs hourly), deleting
>> the mbox files when done.

SN> We don't use POP3 at all; it's mostly IMAP and occasionally webmail.
SN> The good news is that the folders you describe are easily accessible;
SN> I'll try that in the next couple of days and let you know how it works.
SN> The bad news (I think) is that when users leave their Outlook open,
SN> then new mail appears on the desktop within seconds of when it is
SN> delivered to the server.  This would prevent a cron-based task from
SN> resorting the mail properly.

But you don't want to run sa-learn on un-verified emails. You want your
users to check the emails, and you want someone to manually put the spam
into a spam folder for sa-learn, and to manually copy the not-spam into a
not-spam folder for sa-learn. Automating this without manual verification
/will/ corrupt your Bayes files.

>> No, under your setup there's no way for each mailbox to have its own
>> user_prefs; there's one user_prefs for each master domain and that's it.
>> There's also no way for each mailbox to have its own bayes database --
>> there's one bayes database for the entire master domain.

SN> I realize that this is true for my present setup.  However, I hope that
SN> the new setup won't have those restrictions.  If it's possible to run
SN> SpamAssassin via cron or whatever, it should also be possible to run
SN> a private copy that is installed in my home directory.  I hope that by
SN> determining the recipient and setting up an appropriate environment
SN> prior to invoking SpamAssassin, independent bayes and prefs will work.
SN> If not, hey, SpamAssassin is made of this amazing stuff called open source
SN> -- you can change the code and make it do what you want.  Of course,
SN> it may take more effort than the improvement in performance would justify,
SN> so I'll first see how much improvement sa-learn gives.

Several people are making progress with SQL-based user_prefs and rules;
their systems might be adaptable to yours.

Bob Menschel

Re: delivery to multiple mailboxes from single account

Posted by Stewart Nelson <sn...@scgroup.com>.

Hi Bob,

Many thanks for taking the time to send such a detailed reply.

> Your situation is similar to mine, but I'm still at SA 2.63. Last week's
> performance stunk at 0 false positives and 20 false negatives (a rotten
> 99.5% accuracy record; I'm not satisfied unless I hit 99.8%).

That's awesome; I'd be happy with 97%, as long as there are almost no
false positives on unicast mail.

> Adding custom rules is among the last things you want to do. I do them,
> and I can help you with the process (provided you can run bash scripts
> under cron), but there are things you want to do first.

I had considered running SpamAssassin from a background job, but there
seemed to be a bad interaction with IMAP (see below).

> Step 1: If False Positives are your major problem,
> a) identify which rules are causing the false positives and lower their
> scores, or
> b) raise your required_hits, or
> c) both.  I use required_hits of 9.0, and have modified the scores of
> several dozen rules.

We don't have an FP problem at all.  Mail sent by individuals almost
always gets a negative score, and our users know that they need to
make a whitelist entry if they don't want to miss "Sex News Daily" ;)
It's the dozen spams per user per day that leak through that is our
problem.
 
> Step 2: Having done step 1, you'll increase the amount of spam that comes
> through. Identify which distribution rules hit that spam, and raise their
> scores enough to score the spam, without causing false positives.

Well, a typical false negative shows:
X-Spam-Status: No, hits=3.7 required=5.0 tests=BAYES_50,HTML_90_100,
 HTML_IMAGE_ONLY_02,HTML_MESSAGE,MIME_HTML_ONLY,RCVD_IN_SBL 
The only difference, unfortunately, between this and much commercial
ham is the SBL, but that gets too polluted with ham sources to assign
it a much bigger score.

> Step 3: Bayes is your friend. Identify all email as guaranteed spam,
> guaranteed not-spam, spam discussions, and uncertain. Feed the first two
> into the Bayes system consistently and accurately, and that will help
> enormously.
> So enormously that some people will recommend doing step 3 before steps 1
> and 2.

Yes.  I made a big mistake here, naively thinking that the autolearn
feature would do an adequate job.  I now suspect that the bayes_* files
on my server are garbage.  Should I save and delete them before feeding
the spam and ham corpera to sa-learn?  Is it necessary to run sa-learn
on mail that SpamAssassin has already correctly classified?

> Step 4: Your system does allow for whitelist and blacklist entries. Maybe
> this should be in front of step 1 also: identify from your false
> positives those sites that can be reliably whitelisted with
> whitelist_from_rcvd (use the _rcvd version rather than just
> whitelist_from whenever possible). Copy William Sterns' blacklist file
> from http://www.stearns.org/sa-blacklist/sa-blacklist.current.cf into
> your user_prefs.

Many thanks for this link.  I manually checked some uncaught spam against
it, and found hits on about 75% !  I'll be installing this right away.
However, it is IMO unfortunate that we are forced to blacklist by name.
Bill Waggoner alone accounts for about 1000 domains on Mr. Sterns' list.
If we could say blacklist_from_rcvd 69.42.96.0/19, one line would do the
job of 1000.  More importantly, it would last a lot longer, because
this A-hole got his IPs directly from ARIN and they are unlikely to change
any time soon.  OTOH, he registers a dozen new domains every day!

> Bayes:  Do your people retrieve their email using POP3 (in which case
> they probably get the inbox mail only), or do they use webmail? If the
> latter, have them create two more folders: spam and notspam. Have them
> move all spam into the spam folder. Have them copy (not move) all
> non-spam intothe notspam folder. Have a cron job which runs sa-learn
> against these mbox files on a regular basis (mine runs hourly), deleting
> the mbox files when done.

We don't use POP3 at all; it's mostly IMAP and occasionally webmail.
The good news is that the folders you describe are easily accessible;
I'll try that in the next couple of days and let you know how it works.
The bad news (I think) is that when users leave their Outlook open,
then new mail appears on the desktop within seconds of when it is
delivered to the server.  This would prevent a cron-based task from
resorting the mail properly.

> No, under your setup there's no way for each mailbox to have its own
> user_prefs; there's one user_prefs for each master domain and that's it.
> There's also no way for each mailbox to have its own bayes database --
> there's one bayes database for the entire master domain.

I realize that this is true for my present setup.  However, I hope that
the new setup won't have those restrictions.  If it's possible to run
SpamAssassin via cron or whatever, it should also be possible to run
a private copy that is installed in my home directory.  I hope that by
determining the recipient and setting up an appropriate environment
prior to invoking SpamAssassin, independent bayes and prefs will work.
If not, hey, SpamAssassin is made of this amazing stuff called open source
-- you can change the code and make it do what you want.  Of course,
it may take more effort than the improvement in performance would justify,
so I'll first see how much improvement sa-learn gives.

> Once you've done the above three steps, then we can explore whether the
> method I use for implementing my own custom rules will work for you.

Thanks again,

Stewart

Re: delivery to multiple mailboxes from single account

Posted by jdow <jd...@earthlink.net>.

From: "Roger Taranto" <ro...@danybrooks.com>

> On Sun, 2004-09-12 at 12:20, Robert Menschel wrote:
> 
> 
> > Copy William Sterns' blacklist file
> > from http://www.stearns.org/sa-blacklist/sa-blacklist.current.cf into
> > your user_prefs.
> 
> 
> I was doing this, but this list is so large that it caused spamassassin
> to take about 20 seconds to initialize each time.  Is there a faster way
> to use this list?

Do you use spamd, Roger? If not, try it. If so - something's mal-
configured.
{^_^}

Re: delivery to multiple mailboxes from single account

Posted by Roger Taranto <ro...@danybrooks.com>.

On Sun, 2004-09-12 at 12:20, Robert Menschel wrote:

> Copy William Sterns' blacklist file
> from http://www.stearns.org/sa-blacklist/sa-blacklist.current.cf into
> your user_prefs.

I was doing this, but this list is so large that it caused spamassassin
to take about 20 seconds to initialize each time.  Is there a faster way
to use this list?

-Roger

Re: delivery to multiple mailboxes from single account

Posted by Robert Menschel <Ro...@Menschel.net>.

Hello Stewart,

Saturday, September 11, 2004, 11:57:46 PM, you wrote:

SN> Hi,

SN> I have one linux-based account on a shared server with a hosting
SN> provider; they presently use cpanel, exim 4.42, SpamAssassin 2.64,
SN> and imapd.  We are unhappy with the filtering performance
SN> (> 10% false negatives, even with required_hits=5.0).

Your situation is similar to mine, but I'm still at SA 2.63. Last week's
performance stunk at 0 false positives and 20 false negatives (a rotten
99.5% accuracy record; I'm not satisfied unless I hit 99.8%).

SN> I'd like to add custom rules, but the provider won't enable
SN> allow_user_rules, citing security concerns.  Also, he won't
SN> be upgrading to 3.0.0 until it is released with cpanel.

Adding custom rules is among the last things you want to do. I do them,
and I can help you with the process (provided you can run bash scripts
under cron), but there are things you want to do first.

Step 1: If False Positives are your major problem,
a) identify which rules are causing the false positives and lower their
scores, or
b) raise your required_hits, or
c) both.  I use required_hits of 9.0, and have modified the scores of
several dozen rules.

Step 2: Having done step 1, you'll increase the amount of spam that comes
through. Identify which distribution rules hit that spam, and raise their
scores enough to score the spam, without causing false positives.

Step 3: Bayes is your friend. Identify all email as guaranteed spam,
guaranteed not-spam, spam discussions, and uncertain. Feed the first two
into the Bayes system consistently and accurately, and that will help
enormously.

So enormously that some people will recommend doing step 3 before steps 1
and 2.

Step 4: Your system does allow for whitelist and blacklist entries. Maybe
this should be in front of step 1 also: identify from your false
positives those sites that can be reliably whitelisted with
whitelist_from_rcvd (use the _rcvd version rather than just
whitelist_from whenever possible). Copy William Sterns' blacklist file
from http://www.stearns.org/sa-blacklist/sa-blacklist.current.cf into
your user_prefs.

Once you've done all four steps properly, you should have almost no false
positives, and a 95%-98% accuracy rate on spam.

SN> It appears that installing SpamAssassin 3.0.0 in my home directory
SN> would be a good solution.  However, we have several add-on domains,
SN> each with several mailboxes, and I don't know a good way to deliver
SN> the output to the proper box.  Mail for user@domain.name presently
SN> gets delivered to
SN> /home/myaccount/mail/domain.name/user/inbox or to
SN> /home/myaccount/mail/domain.name/user/spam .
SN> We also have many aliases "forwarders" that point to various boxes.
SN> Ideally, there would be a way for each mailbox to have its own
SN> user_prefs.

Bayes:  Do your people retrieve their email using POP3 (in which case
they probably get the inbox mail only), or do they use webmail? If the
latter, have them create two more folders: spam and notspam. Have them
move all spam into the spam folder. Have them copy (not move) all
non-spam intothe notspam folder. Have a cron job which runs sa-learn
against these mbox files on a regular basis (mine runs hourly), deleting
the mbox files when done.

No, under your setup there's no way for each mailbox to have its own
user_prefs; there's one user_prefs for each master domain and that's it.
There's also no way for each mailbox to have its own bayes database --
there's one bayes database for the entire master domain.

SN> There is lots of detailed info at
SN> http://wiki.apache.org/spamassassin/UsedViaProcmail and at
SN> http://wiki.apache.org/spamassassin/SingleUserUnixInstall , but
SN> I could not find any examples with multiple independent email
SN> accounts.  Surely, hundreds of users have done this before, but
SN> sorry, I was unable to find a solution with Google, or searching
SN> the archives for this list.

Once you've done the above three steps, then we can explore whether the
method I use for implementing my own custom rules will work for you.

Bob Menschel