You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@spamassassin.apache.org by Jason Frisvold <xe...@gmail.com> on 2006/01/12 14:44:39 UTC

Purging the Spamassassin Database

Hi all,

I've been investigating some recent slowness issues with our mail
servers and I noticed that the spamassassin database is getting rather
large.  We process approximately 300,000 mails a day (or more).  The
bayes_token database is over 1.8 Gig at the moment. (Actually, 1.8 Gig
for the data, and 1.3 Gig for the index)

I checked the first few entries in the bayes_token database and the
atime on those entries was from back in Jan 2005.  Is it safe to purge
this database and keep only "current" data?  (I'm not sure what the
definition of current would be for a bayes database)

How about the other databases?  Specifically, the awl database which
is approximately 350 Meg, or the bayes_seen database which is 150
Meg..

I think a lot of the slowdown I'm seeing right at this moment has to
do with spamassassin spending a good deal of time accessing the
database...


Any help would be appreciated!  Thanks!

--
Jason 'XenoPhage' Frisvold
XenoPhage0@gmail.com

Re: Purging the Spamassassin Database

Posted by Matthias Keller <li...@matthias-keller.ch>.

Jason Frisvold wrote:

>Hi all,
>
>I've been investigating some recent slowness issues with our mail
>servers and I noticed that the spamassassin database is getting rather
>large.  We process approximately 300,000 mails a day (or more).  The
>bayes_token database is over 1.8 Gig at the moment. (Actually, 1.8 Gig
>for the data, and 1.3 Gig for the index)
>
>I checked the first few entries in the bayes_token database and the
>atime on those entries was from back in Jan 2005.  Is it safe to purge
>this database and keep only "current" data?  (I'm not sure what the
>definition of current would be for a bayes database)
>
>How about the other databases?  Specifically, the awl database which
>is approximately 350 Meg, or the bayes_seen database which is 150
>Meg..
>
>I think a lot of the slowdown I'm seeing right at this moment has to
>do with spamassassin spending a good deal of time accessing the
>database...
>  
>
I'm by no means a bayes specialist but i dont think it's a good idea 
just to delete the oldest entries since SA provides its own mean of 
purging...

You might want to check your value for

Re: Gain an extra 25%! (was "Purging the Spamassassin Database")

Posted by Andrew Donkin <ar...@waikato.ac.nz>.

> ALTER TABLE bayes_vars   MODIFY id SMALLINT UNSIGNED NOT NULL;

Whoops - doing that will prevent any new users from being created in
your Bayes database.  Add auto_increment:

ALTER TABLE bayes_vars MODIFY id SMALLINT UNSIGNED NOT NULL AUTO_INCREMENT;

-- 
_________________________________________________________________________
Andrew Donkin                  Waikato University, Hamilton,  New Zealand

Re: Gain an extra 25%! (was "Purging the Spamassassin Database")

Posted by "Chr. v. Stuckrad" <st...@mi.fu-berlin.de>.

On Mon, Jan 16, 2006 at 04:09:37PM +0100, M.S. Lucas wrote:
> Could this be made a default with the small size of the id columns and a 
> note in the installation file for the big users?
> There are more users of SA with less then 65k users then with more.

Does it mean '65k is the largest User-Number' (numerical) like in UNIX-UIDs,
or really '65k different Users in the Database of Setups and Tokens?

The latter really will be relatively seldom.

Stucki

-- 
Christoph von Stuckrad      * * |nickname |<st...@mi.fu-berlin.de>  \
Freie Universitaet Berlin   |/_*|'stucki' |Tel(days):+49 30 838-75 459|
Mathematik & Informatik EDV |\ *|if online|Tel(else):+49 30 77 39 6600|
Arnimallee 2-6/14195 Berlin * * |on IRCnet|Fax(alle):+49 30 838-75454/

Re: Gain an extra 25%! (was "Purging the Spamassassin Database")

Posted by "M.S. Lucas" <ms...@taos-it.nl>.

From: "Jason Frisvold" <xe...@gmail.com>
> On 1/15/06, Andrew Donkin <ar...@waikato.ac.nz> wrote:
> > If you have fewer than 65,000 accounts you could halve the size of the
> > id columns.
> >
> > That took 25% off my bayes_token data file and index.  Since MySQL
> > likes to keep indexes in VM, I think it was worthwhile.
>
> Excellent information, thanks!

Could this be made a default with the small size of the id columns and a 
note in the installation file for the big users?
There are more users of SA with less then 65k users then with more.

Maurice Lucas

Re: Gain an extra 25%! (was "Purging the Spamassassin Database")

Posted by Jason Frisvold <xe...@gmail.com>.

On 1/15/06, Andrew Donkin <ar...@waikato.ac.nz> wrote:
> If you have fewer than 65,000 accounts you could halve the size of the
> id columns.
>
> That took 25% off my bayes_token data file and index.  Since MySQL
> likes to keep indexes in VM, I think it was worthwhile.

Excellent information, thanks!

> --
> Ard
>


--
Jason 'XenoPhage' Frisvold
XenoPhage0@gmail.com

Gain an extra 25%! (was "Purging the Spamassassin Database")

Posted by Andrew Donkin <ar...@waikato.ac.nz>.

Jason Frisvold <xenophage0 at gmail> writes:

> I've been investigating some recent slowness issues with our mail
> servers and I noticed that the spamassassin database is getting rather
> large.  We process approximately 300,000 mails a day (or more).  The
> bayes_token database is over 1.8 Gig at the moment. (Actually, 1.8 Gig
> for the data, and 1.3 Gig for the index)

If you have fewer than 65,000 accounts you could halve the size of the
id columns.

Those folks *not* storing per-user Bayes statistics -- i.e. with only
one line in bayes_vars -- could take the id column down to one byte, a
TINYINT.

  ALTER TABLE bayes_token  MODIFY id SMALLINT UNSIGNED NOT NULL;
  ALTER TABLE bayes_expire MODIFY id SMALLINT UNSIGNED NOT NULL;
  ALTER TABLE bayes_seen   MODIFY id SMALLINT UNSIGNED NOT NULL;
  ALTER TABLE bayes_vars   MODIFY id SMALLINT UNSIGNED NOT NULL;

The last three won't recover much space, but table indexes should be
kept the same type and size.

Also, try the following.  Since you are keeping Bayes per user, your
spam_count and ham_count probably don't get very high so you can take
those down to SMALLINTs.

  SELECT MAX(ham_count), MAX(spam_count) FROM bayes_token;

  ALTER TABLE bayes_token MODIFY spam_count SMALLINT UNSIGNED NOT NULL,
                          MODIFY ham_count SMALLINT UNSIGNED NOT NULL;

That took 25% off my bayes_token data file and index.  Since MySQL
likes to keep indexes in VM, I think it was worthwhile.

-- 
Ard

Re: Scaling SA for 100k/day: (was Purging the Spamassassin Database)

Posted by Rick Macdougall <ri...@ummm-beer.com>.

Hi,

In-line, we do about 500K a day.

Andrew Donkin wrote:
> In particular I am interested in:
> 
> - how many boxes running spamd?

2 currently but only because we are under a bounce back joe job. 
Normally one P4 3.2 Ghz with 2 gig of ram handles the load.

> - how many spamd children per box (spamd --max-children)

10 with max connections set at 250

> - if Bayes is SQL, is it on the same or separate server as spamd, and

Bayes, same server as spamd

>   are you replicating it to balance read load?

No, our temp second box is reading from the first.

> - spamc timeout (spamc -t)

The default

> - rbl_timeout

2 seconds

> - bayes_expiry_max_db_size

Default

> - bayes_journal_max_size

Default

> 
> With autolearning on, and the default bayes_journal_max_size, the

Our auto learn is off after in initial week of training.  I now manually 
add spam and ham into in.

We also run a force expiry nightly.

> 
>> The bayes_token database is over 1.8 Gig at the moment. (Actually,
>> 1.8 Gig for the data, and 1.3 Gig for the index)

207744000 Nov 23 15:47 bayes_seen.MYI
163404544 Nov 23 15:47 bayes_seen.MYD
22894592 Nov 24 01:00 bayes_token.MYI
36 Jan 12 13:00 bayes_expire.MYD
2048 Jan 12 13:00 bayes_expire.MYI
68 Jan 12 17:08 bayes_vars.MYD
31961270 Jan 12 17:08 bayes_token.MYD

This is for approx 35k users, all in a global bayes.

Regards,

Rick

Re: Scaling SA for 100k/day: (was Purging the Spamassassin Database)

Posted by "Daniel J. Cody" <dc...@uwm.edu>.

Andrew Donkin wrote:
> Could Jason, and others on the list who handle a large amount of
> email, report back on their setups?  It might be quite a useful
> resource to have in the archives.  I don't think it has been covered
> on this list before, but please set me straight if it has.

We have approx. 8 servers that handle a combined 1mil+ messages a day 
running SA 3.1.0 with mimedefang and clamd for about 50k accounts.

Each of the mail servers gets it's bayes info from a shared MySQL4 
database which is it's own server with 4G of RAM and lots of fast disk. 
If you're running a Bayesian DB of any relative size, I'd highly 
recommend offloading it to another dedicated database server.

If you want more details, feel free to ask!

Daniel Cody

Scaling SA for 100k/day: (was Purging the Spamassassin Database)

Posted by Andrew Donkin <ar...@waikato.ac.nz>.

> I've been investigating some recent slowness issues with our mail
> servers and I noticed that the spamassassin database is getting
> rather large.  We process approximately 300,000 mails a day (or
> more).

We do only a third of that Jason but I'm still having problems with
capacity.  I have filled spamc with debugging to help me figure out
where the problem lies, and I think it has come down to database
contention, or slow DNS, DCC, and Razor lookups.  Whatever it is, I'm
running out of spamd children.

I have two boxes running spamd --max-children 60, and in bursty times
every child is busy and spam leaks through unchecked.  We receive far
more "leaked" spam than false negatives:  5,000 out of 171,000
attempted in the past 48 hours.

Could Jason, and others on the list who handle a large amount of
email, report back on their setups?  It might be quite a useful
resource to have in the archives.  I don't think it has been covered
on this list before, but please set me straight if it has.

In particular I am interested in:

- how many boxes running spamd?
- how many spamd children per box (spamd --max-children)
- if Bayes is SQL, is it on the same or separate server as spamd, and
  are you replicating it to balance read load?
- spamc timeout (spamc -t)
- rbl_timeout
- bayes_expiry_max_db_size
- bayes_journal_max_size

Until this morning, when something calamitous happened to it, I was
using Berkeley for Bayes.  That is why I am interested in the load on
MySQL, since my database is on a separate box and is already handling
the per-user configs (one select per message) and the statistics (one
update per message).

With autolearning on, and the default bayes_journal_max_size, the
journal filled and was flushed every couple of minutes.  Approximately
how often should the journal flush itself?  Is there any harm in
having it happen every few minutes, or should I tune it up to an hour
or so?

> The bayes_token database is over 1.8 Gig at the moment. (Actually,
> 1.8 Gig for the data, and 1.3 Gig for the index)

Yikes.  That is the kind of thing I need to avoid!

Many, many thanks in advance.

-- 
_________________________________________________________________________
Andrew Donkin                  Waikato University, Hamilton,  New Zealand

Re: Purging the Spamassassin Database

Posted by jdow <jd...@earthlink.net>.

From: "Matthias Keller" <li...@matthias-keller.ch>

> If you reduce the value of the above directive and issue a
> 
> sa-learn --force-expire
> 
> it should expire all tokens not needed anymore until it reaches 
> (approximately) some value lower than  the max_db_size ....
> 
> Hope that helps

My understanding is that this process can be quite lengthly and might
affect delivery during the purge time. This might be something to watch
out for. And if there is a time of day when incoming is at its lightest
I'd do it then.

{o.o}

Re: Purging the Spamassassin Database

Posted by Jason Frisvold <xe...@gmail.com>.

On 1/12/06, Michael Parker <pa...@pobox.com> wrote:
> You shouldn't check mail for non-existent users.

We don't.  :)  Old users who don't have accounts anymore, but who's
account data still exists in spamassassin..  This data is never
checked, but merely takes up space..

> The best way to clear this data is via sa-learn.
>
> sa-learn -u <nonexistent user> --clear

Ahh, excellent.  I was gonna create a huge select statement, but this
will work as well...

> You can get a list of usernames from the bayes_vars table, it is not the
> id column it is the username column.
>
> If you don't really have the ability to not check mail for non-existent
> users you might want to check out this plugin:
> http://wiki.apache.org/spamassassin/AuthzUserPlugin
>
> Michael

Thanks for the info!

--
Jason 'XenoPhage' Frisvold
XenoPhage0@gmail.com

Re: Purging the Spamassassin Database

Posted by Michael Parker <pa...@pobox.com>.

Jason Frisvold wrote:
> On 1/12/06, Matthias Keller <li...@matthias-keller.ch> wrote:
>> Sorry i accidently sent the previous (incomplete) message...
>>
>> I'm by no means a bayes specialist but i dont think it's a good idea
>> just to delete the oldest entries since SA provides its own mean of
>> purging...
> 
> Gotcha..  I kinda knew about it, and thought I had it enabled (I do)
> ..  Any recommendations on the number of tokens to keep?
> 
> I think I've determined part of the problem.  Non-existant users are
> still in the database!  ACK!
> 
> Given that, is it safe to walk through the database and delete all
> data for non-existant users?  It looks like the tables are linkes via
> a field called id...
> 

You shouldn't check mail for non-existent users.

The best way to clear this data is via sa-learn.

sa-learn -u <nonexistent user> --clear

You can get a list of usernames from the bayes_vars table, it is not the
id column it is the username column.

If you don't really have the ability to not check mail for non-existent
users you might want to check out this plugin:
http://wiki.apache.org/spamassassin/AuthzUserPlugin

Michael

Re: Purging the Spamassassin Database

Posted by Jason Frisvold <xe...@gmail.com>.

On 1/12/06, Matthias Keller <li...@matthias-keller.ch> wrote:
> Sorry i accidently sent the previous (incomplete) message...
>
> I'm by no means a bayes specialist but i dont think it's a good idea
> just to delete the oldest entries since SA provides its own mean of
> purging...

Gotcha..  I kinda knew about it, and thought I had it enabled (I do)
..  Any recommendations on the number of tokens to keep?

I think I've determined part of the problem.  Non-existant users are
still in the database!  ACK!

Given that, is it safe to walk through the database and delete all
data for non-existant users?  It looks like the tables are linkes via
a field called id...

> Hope that helps

It does, I appreciate it..

> Matt

> ps: if you're using berkeleyDB - i've read lots of problems with big
> databases with that -- consider switching to sql

Already SQL..  :)

--
Jason 'XenoPhage' Frisvold
XenoPhage0@gmail.com

Re: Purging the Spamassassin Database

Posted by Matthias Keller <li...@matthias-keller.ch>.

Jason Frisvold wrote:

>Hi all,
>
>I've been investigating some recent slowness issues with our mail
>servers and I noticed that the spamassassin database is getting rather
>large.  We process approximately 300,000 mails a day (or more).  The
>bayes_token database is over 1.8 Gig at the moment. (Actually, 1.8 Gig
>for the data, and 1.3 Gig for the index)
>
>I checked the first few entries in the bayes_token database and the
>atime on those entries was from back in Jan 2005.  Is it safe to purge
>this database and keep only "current" data?  (I'm not sure what the
>definition of current would be for a bayes database)
>
>How about the other databases?  Specifically, the awl database which
>is approximately 350 Meg, or the bayes_seen database which is 150
>Meg..
>
>I think a lot of the slowdown I'm seeing right at this moment has to
>do with spamassassin spending a good deal of time accessing the
>database...
>  
>
Sorry i accidently sent the previous (incomplete) message...

I'm by no means a bayes specialist but i dont think it's a good idea
just to delete the oldest entries since SA provides its own mean of
purging...

You might want to check your value for

bayes_expiry_max_db_size

and

sa-learn --dump magic

should give you the current number of tokens

If you reduce the value of the above directive and issue a

sa-learn --force-expire

it should expire all tokens not needed anymore until it reaches 
(approximately) some value lower than  the max_db_size ....

Hope that helps

Matt
ps: if you're using berkeleyDB - i've read lots of problems with big 
databases with that -- consider switching to sql