You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@spamassassin.apache.org by "Gerald V. Livingston II" <ge...@sysmatrix.net> on 2005/04/08 03:58:55 UTC

BAYES...sitewide or per-user or not at all?

We are a small ISP. Our primary domain mail server currently has a few
over 6000 addresses and inbound volume is between 100K and 200K/day with
probably 95%+ of that being spam.

I'm going through the wiki now and should be able to have SA running on the
new test box soon with per-user preferences on virtual domains.

My question is, should I set up BAYES at all? I'm fairly certain domain
level BAYES would be a bad thing with our demographic. We have people with
family, friends, or business partners in APNIC countries, we have customers
who frequent spam havens (online porn gatherers), we have ultra religious
customers, and we have middle of the road customers.

I'm afraid domain wide bayes would show up as many FPs for the first two
groups or many FNs for the last two -- or the database would just stay
hosed up with customers shoving conflicting spam and ham into the learning
folders.

I'm not sure how resource efficient per-user BAYES would be. Will it kill
the machine as the user base grows or the spam volume increases?

For the first few weeks this system will be running everything on a single
machine. When it's operational and the customers have been moved from the
old server I will turn that server into a gateway box to handle the
scanning duties and leave the IMAP/WebMail and the MySQL database on the
main box.

Who's running a successful large user base/high volume site with per-user
BAYES?

Gerald

Re[2]: BAYES...sitewide or per-user or not at all?

Posted by Robert Menschel <Ro...@Menschel.net>.

Hello Gerald,

Saturday, April 9, 2005, 5:10:02 PM, you wrote:

GVLI> I'm looking at what scores I'll be able to let my users modify directly. If
GVLI> they can drop the bayes scores some for individual users it might not be so
GVLI> bad. I'm trying really hard not to ostracize any specific groups of people
GVLI> though. Our userbase leans MUCH more heavily to the "non-porn-hound" type
GVLI> (families and businesses) so that's what has me concerned about site-wide
GVLI> or domain-wide bayes.

Is there a generic ISP or email system whose userbase leans much more
to the adult than to the general audience?  My email host's customer
base includes several of the former, but they're drowned out by the
more common type of customer, and they don't have problems with
system-wide bayes.

GVLI> sa-learn -- anyone have a way to stat() all the SPAM folders and run
GVLI> sa-learn only on those that have new messages added by customers? I could
GVLI> find them using 'find' by searching on the mod date but I'd have to have
GVLI> some way for sa-learn to know the username to run as.

The method I've used is to
a) see if the missed-spam folder or not-spam folder have any contents.
If not, skip to the next user.
b) Move the contents out of that folder to work folder.
c) learn from the work folder.
d) skip to the next user.

That way there's no old messages to worry about.

Make sure the users know to "copy" mails to the not-spam folder rather
than move them, if they want to keep the originals.

Bob Menschel

Re: BAYES...sitewide or per-user or not at all?

Posted by "Gerald V. Livingston II" <ge...@sysmatrix.net>.

On Sat, 9 Apr 2005 17:34:45 -0700 Joshua Tinnin wrote:

> On Saturday 09 April 2005 17:10, "Gerald V. Livingston II"
> <ge...@sysmatrix.net> wrote:

> > sa-learn -- anyone have a way to stat() all the SPAM folders and run 
> > sa-learn only on those that have new messages added by customers? I 
> > could find them using 'find' by searching on the mod date but I'd have
> > to have some way for sa-learn to know the username to run as.
> 
> Run it from each user's crontab.

Each user will not HAVE a crontab. This is a full virtual setup. No entries
in /etc/passwd, no local accounts at all.

Plus, I don't want to run it for every address on the system. Why run
sa-learn 7000 times if only 2000 users are actually moving spam to be
learned into the proper folders?

Gerald

Re: BAYES...sitewide or per-user or not at all?

Posted by Joshua Tinnin <kr...@spymac.com>.

On Saturday 09 April 2005 17:10, "Gerald V. Livingston II" <ge...@sysmatrix.net> wrote:
> sa-learn -- anyone have a way to stat() all the SPAM folders and run
> sa-learn only on those that have new messages added by customers? I could
> find them using 'find' by searching on the mod date but I'd have to have
> some way for sa-learn to know the username to run as.

Run it from each user's crontab.

- jt

Re: BAYES...sitewide or per-user or not at all?

Posted by "Gerald V. Livingston II" <ge...@sysmatrix.net>.

Thanks Bob,

On Fri, 8 Apr 2005 17:24:05 -0700 Robert Menschel wrote:

> Hello Gerald,
> 
> Thursday, April 7, 2005, 6:58:55 PM, you wrote:
> 
> GVLI> I'm afraid domain wide bayes would show up as many FPs for the
> GVLI> first two groups or many FNs for the last two -- or the database
> 
> It balances out.  Granny puts the porn into her spam box, and Ginger
> puts a graphic discussion of last night's wet dream into her ham box.
> Over time bayes learns which mails everyone thinks is spam, which
> mails everyone thinks is ham, and which mails are undeterminable.

I guess I need to read more on how bayes works.

I'm looking at what scores I'll be able to let my users modify directly. If
they can drop the bayes scores some for individual users it might not be so
bad. I'm trying really hard not to ostracize any specific groups of people
though. Our userbase leans MUCH more heavily to the "non-porn-hound" type
(families and businesses) so that's what has me concerned about site-wide
or domain-wide bayes.

> GVLI> I'm not sure how resource efficient per-user BAYES would be. Will
> it kill
> GVLI> the machine as the user base grows or the spam volume increases?
> 
> per-user Bayes lookups aren't bad -- don't worry about them. The
> question revolves around per-user Bayes database storage (do you have
> enough disk space), and how you manage the sa-learn process.

sa-learn -- anyone have a way to stat() all the SPAM folders and run
sa-learn only on those that have new messages added by customers? I could
find them using 'find' by searching on the mod date but I'd have to have
some way for sa-learn to know the username to run as.

Space I'm not worried about. The machine I'm building "everything" on now
has 250Gig of storage (2*250G drives in RAID1) and will be the primary
location for user mail stores and the SMTP/IMAP/POP3 server for customers
(IMAP only for the webmail interfaces, not direct). At 20M per mailbox 7000
addresses only use 140G if every customer stops using POP3 and lets their
online storage fill to max capacity.

When I can take the other server down I will be moving scanning duties to a
dedicated gateway that will have 2*250G drives in RAID0 striping for speed
rather than data redundancy.

So, now I have to decide where to put the database(s) and how to split them
up. I'm thinking a single database with all required user information would
be best (login, SA prefs, Maildir info, everything) from a configuration
point of view. I'd be able to point all config items to a single database
and relate the tables within.

I'm worried about resources though. Will a single machine striping across 2
spindles be able to handle the I/O in a timely fashion? Should I put the
database(s) on the customer mail machine and just waste the extra space
available on the gateway drives? Should I split the system into multiple
databases with duplicate data for identification (bayes with username +
login with username and storage info). Put one database on the gateway and
another on the mail server?

I'm trying hard to determine what I can set up to allow users to modify
just about anything in their SA settings also. Just as if they had a login
account and could create their own .prefs file -- except this is all going
to be virtual with no home directories -- all in MySQL. I have one customer
who doesn't want ANYTHING from overseas. No APNIC, RIPE, etc. He has some
very carefully crafted PERL regex filters on the current mail server that
mostly do the trick for him and he's going to lose the ability to use those
when we move the server. I want him to be able to pinpoint RBL filters that
score on origination point and bump up those that come from the countries
he definitely wants blocked.

I still have to yank Dovecot off the server and go with Courier. Courier is
more resource intensive but Dovecot isn't quite "ready for prime time" yet.
I need full quota support and it's not there in the stable versions. The
test versions tend to break something every time they fix something else.
I'll probably move back to it when it settles out.

Gerald

Re: BAYES...sitewide or per-user or not at all?

Posted by Robert Menschel <Ro...@Menschel.net>.

Hello Gerald,

Thursday, April 7, 2005, 6:58:55 PM, you wrote:

GVLI> My question is, should I set up BAYES at all?

Yes. User-specific if you can do it, domain-level or site-wide
otherwise.

GVLI> I'm fairly certain domain level BAYES would be a bad thing with
GVLI> our demographic. We have people with family, friends, or
GVLI> business partners in APNIC countries, we have customers who
GVLI> frequent spam havens (online porn gatherers), we have ultra
GVLI> religious customers, and we have middle of the road customers.

The email host for my family domain does the same, and runs site-wide
bayes.

GVLI> I'm afraid domain wide bayes would show up as many FPs for the
GVLI> first two groups or many FNs for the last two -- or the database
GVLI> would just stay hosed up with customers shoving conflicting spam
GVLI> and ham into the learning folders.

It balances out.  Granny puts the porn into her spam box, and Ginger
puts a graphic discussion of last night's wet dream into her ham box.
Over time bayes learns which mails everyone thinks is spam, which
mails everyone thinks is ham, and which mails are undeterminable.

GVLI> I'm not sure how resource efficient per-user BAYES would be. Will it kill
GVLI> the machine as the user base grows or the spam volume increases?

per-user Bayes lookups aren't bad -- don't worry about them. The
question revolves around per-user Bayes database storage (do you have
enough disk space), and how you manage the sa-learn process.

Bob Menschel