You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@spamassassin.apache.org by Ben Poliakoff <be...@reed.edu> on 2005/02/24 01:45:54 UTC

highly available sitewide bayes, local db vs. sql

What sort of experiences have people had managing a sitewide bayes db
that is used by spamassassin (spamd|amavisd) instances on multiple
machines?  I've got an environment with spamassassin/amavisd-new running
in parallel on a pool of two (but possibly more in the future) equally
weighted machines.  How have you avoided the dreaded Single Point of
Failure?

I've been experimenting (on a small scale) with an SQL backed bayes db.
I can readily have multiple machines talk to single mysql instance, but
then I'm stuck trying to make that mysql instance "highly available"
(and I *could* do that on an existing "clustered" server).

I could also have an instance of mysql running on all of the machines,
with one master mysql instance replicating to one or more mysql slave
instances.  I've never set up mysql replication (but it can't be much
harder than OpenLDAP replication!).  In such an example I'd only enable
autolearning on the machine with the master mysql db.

I could also ditch the idea of using a mysql backed bayes and simply
rsync the bayes db file from the master to the slaves on a regular basis
(stopping and starting spamd|amavisd in the process).  In such an
environment I'd do training only on one "master" machine and enable
autolearning only on that machine.

How are other people addressing this issue?

Ben

Re: highly available sitewide bayes, local db vs. sql

Posted by Paolo Cravero as2594 <pc...@as2594.net>.

Ben Poliakoff wrote:

Hi Ben

> What sort of experiences have people had managing a sitewide bayes db
> that is used by spamassassin (spamd|amavisd) instances on multiple
> machines?  I've got an environment with spamassassin/amavisd-new running
> in parallel on a pool of two (but possibly more in the future) equally
> weighted machines.  How have you avoided the dreaded Single Point of
> Failure?

Running here two servers with SA in load balancing. Each machine has its 
own local Bayes&AWL DB (no SPoF). Given the amount of incoming traffic 
(100kmsgs/server/workday) we are statistically sure that both servers 
see the same (spam) messages.

We have not noticed any efficiency unbalance between the two instances 
in over 12 months.

Having two DBs has also one advantage: if Bayes on one machine gets 
corrupted (wrong training, ...) you can restore it from the twin server 
with a simple FTP. We have done this at least once.

What needs to be done periodically is AWL DB purging/reset since it 
keeps growing and growing...

We were considering a MySQL DB on a third machine (with failover on 
other two), but the loss of Bayes history is not such a big issue IMHO. 
A nighttime backup is probably enough as long as you have another 
machine to restore the DB few hours after failure. Nevertheless a good 
ham/spam collection will re-train your Bayesian filter in a matter of 
minutes.

Our third machine will probably run a local mirror of SURBL, instead!

HTH,
Paolo