You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@spamassassin.apache.org by Andrew Donkin <ar...@waikato.ac.nz> on 2006/01/12 22:56:37 UTC

Scaling SA for 100k/day: (was Purging the Spamassassin Database)

> I've been investigating some recent slowness issues with our mail
> servers and I noticed that the spamassassin database is getting
> rather large.  We process approximately 300,000 mails a day (or
> more).

We do only a third of that Jason but I'm still having problems with
capacity.  I have filled spamc with debugging to help me figure out
where the problem lies, and I think it has come down to database
contention, or slow DNS, DCC, and Razor lookups.  Whatever it is, I'm
running out of spamd children.

I have two boxes running spamd --max-children 60, and in bursty times
every child is busy and spam leaks through unchecked.  We receive far
more "leaked" spam than false negatives:  5,000 out of 171,000
attempted in the past 48 hours.

Could Jason, and others on the list who handle a large amount of
email, report back on their setups?  It might be quite a useful
resource to have in the archives.  I don't think it has been covered
on this list before, but please set me straight if it has.

In particular I am interested in:

- how many boxes running spamd?
- how many spamd children per box (spamd --max-children)
- if Bayes is SQL, is it on the same or separate server as spamd, and
  are you replicating it to balance read load?
- spamc timeout (spamc -t)
- rbl_timeout
- bayes_expiry_max_db_size
- bayes_journal_max_size

Until this morning, when something calamitous happened to it, I was
using Berkeley for Bayes.  That is why I am interested in the load on
MySQL, since my database is on a separate box and is already handling
the per-user configs (one select per message) and the statistics (one
update per message).

With autolearning on, and the default bayes_journal_max_size, the
journal filled and was flushed every couple of minutes.  Approximately
how often should the journal flush itself?  Is there any harm in
having it happen every few minutes, or should I tune it up to an hour
or so?

> The bayes_token database is over 1.8 Gig at the moment. (Actually,
> 1.8 Gig for the data, and 1.3 Gig for the index)

Yikes.  That is the kind of thing I need to avoid!

Many, many thanks in advance.

-- 
_________________________________________________________________________
Andrew Donkin                  Waikato University, Hamilton,  New Zealand

Re: Scaling SA for 100k/day: (was Purging the Spamassassin Database)

Posted by Rick Macdougall <ri...@ummm-beer.com>.
Hi,

In-line, we do about 500K a day.

Andrew Donkin wrote:
> In particular I am interested in:
> 
> - how many boxes running spamd?

2 currently but only because we are under a bounce back joe job. 
Normally one P4 3.2 Ghz with 2 gig of ram handles the load.

> - how many spamd children per box (spamd --max-children)

10 with max connections set at 250

> - if Bayes is SQL, is it on the same or separate server as spamd, and

Bayes, same server as spamd

>   are you replicating it to balance read load?

No, our temp second box is reading from the first.

> - spamc timeout (spamc -t)

The default

> - rbl_timeout

2 seconds

> - bayes_expiry_max_db_size

Default

> - bayes_journal_max_size

Default

> 
> With autolearning on, and the default bayes_journal_max_size, the

Our auto learn is off after in initial week of training.  I now manually 
add spam and ham into in.

We also run a force expiry nightly.

> 
>> The bayes_token database is over 1.8 Gig at the moment. (Actually,
>> 1.8 Gig for the data, and 1.3 Gig for the index)

207744000 Nov 23 15:47 bayes_seen.MYI
163404544 Nov 23 15:47 bayes_seen.MYD
22894592 Nov 24 01:00 bayes_token.MYI
36 Jan 12 13:00 bayes_expire.MYD
2048 Jan 12 13:00 bayes_expire.MYI
68 Jan 12 17:08 bayes_vars.MYD
31961270 Jan 12 17:08 bayes_token.MYD

This is for approx 35k users, all in a global bayes.

Regards,

Rick

Re: Scaling SA for 100k/day: (was Purging the Spamassassin Database)

Posted by "Daniel J. Cody" <dc...@uwm.edu>.
Andrew Donkin wrote:
> Could Jason, and others on the list who handle a large amount of
> email, report back on their setups?  It might be quite a useful
> resource to have in the archives.  I don't think it has been covered
> on this list before, but please set me straight if it has.

We have approx. 8 servers that handle a combined 1mil+ messages a day 
running SA 3.1.0 with mimedefang and clamd for about 50k accounts.

Each of the mail servers gets it's bayes info from a shared MySQL4 
database which is it's own server with 4G of RAM and lots of fast disk. 
If you're running a Bayesian DB of any relative size, I'd highly 
recommend offloading it to another dedicated database server.

If you want more details, feel free to ask!

Daniel Cody