You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@spamassassin.apache.org by Jason Frisvold <xe...@gmail.com> on 2006/11/27 23:01:40 UTC

Bayes - Optimizing the database

Greetings,

After struggling a bit with Bayes in general and trying to figure out
a way to make things run a bit faster, I've done some serious digging
and I want to clarify a few things before I make a mess of my Bayes
DB...

I have everything currently set up to use a MySQL database.  The
bayes_token table is about 3GB in size and tends to be the slowest
link in the system.  (AWL isn't too far behind, but I think I have a
viable strategy for dealing with that monster)

First, some quick assumptions.  Please correct me if I'm wrong.

All of the bayes_ tables are directly related via the id field.
bayes_token contains the actual tokens for bayesian processing and
bayes_seen contains the message ids of messages bayes has already
processed for tokens, presumably to reduce cpu usage?  I *think*
bayes_vars merely contains the magic data used by bayes, and I have no
idea what bayes_expire is for.  Am I correct thus far?

Now, given that, I can directly map my users to an entry in bayes_vars
and identify their "id".  With that, I can purge non-existant users
from the system.  Simple enough.

Now, for other users, can I trust the last_expire field in bayes_vars
and formulate something to force-expire at periodic intervals based on
that value?  I realize that spamc/spamd already expire when necessary,
but I think I'd rather run this on a nightly basis during off-peak
hours, and serialize it so that only a single user is being expired at
a time.  Is that a reasonable move to reduce overall cpu usage on the
system?

Thanks!

-- 
Jason 'XenoPhage' Frisvold
XenoPhage0@gmail.com