You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@spamassassin.apache.org by "Chris St. Pierre" <st...@NebrWesleyan.edu> on 2007/03/08 23:33:03 UTC

Make Bayes more efficient?

We're sharing our Bayesian database (MySQL) between two MX nodes and
the database server has hit a wall.  It's underpowered and is no
longer able to keep up with the I/O demands of our two MXes.  During
the day, uptime on the machine plateaus at about 5-7, and iowait
percentage rides at about 70-90%.  Our mail queue and time-to-scan
skyrocket.  Luckily, we've been able to catch up at night thus far,
but it's still not fun having 500 messages enqueued from 8-5 daily.

Until I get our DB box replaced this summer, is there anything I can
do to make things work more efficiently?  Options I'm currently aware
of are:

1.  Stop doing Bayesian filtering.

2.  Turn off autolearn.

3.  Throw more hardware at the problem (which is the plan -- eventually...)

I ran a manual ''sa-learn --force-expire'', but that didn't have any
effect.

Database sizes (in rows) are as follows:

bayes_expire:      219
bayes_global_vars: 1
bayes_seen:        670102
bayes_token:       801496
bayes_vars:        117

To give an idea of the magnitude of this problem:  it took 34 minutes
and 19 seconds to count the bayes_seen rows.

Mail volume per MX is roughly 40K messages per day, <10K of which ever
make it to SpamAssassin.

Ideas?  Thanks!

Chris St. Pierre
Unix Systems Administrator
Nebraska Wesleyan University
----------------------------
Never send mail to thobrux@nebrwesleyan.edu


Chris St. Pierre
Unix Systems Administrator
Nebraska Wesleyan University
402.465.7549
----------------------------
Never send mail to thobrux@nebrwesleyan.edu

Re: Make Bayes more efficient?

Posted by Dirk Bonengel <di...@bonengel.de>.
Chris St. Pierre schrieb:
> We're sharing our Bayesian database (MySQL) between two MX nodes and
> the database server has hit a wall.  It's underpowered and is no
> longer able to keep up with the I/O demands of our two MXes.  During
> the day, uptime on the machine plateaus at about 5-7, and iowait
> percentage rides at about 70-90%.  Our mail queue and time-to-scan
> skyrocket.  Luckily, we've been able to catch up at night thus far,
> but it's still not fun having 500 messages enqueued from 8-5 daily.
>
> Until I get our DB box replaced this summer, is there anything I can
> do to make things work more efficiently?  Options I'm currently aware
> of are:
>
> 1.  Stop doing Bayesian filtering.
>
> 2.  Turn off autolearn.
>
> 3.  Throw more hardware at the problem (which is the plan -- 
> eventually...)
>
> I ran a manual ''sa-learn --force-expire'', but that didn't have any
> effect.
>
> Database sizes (in rows) are as follows:
>
> bayes_expire:      219
> bayes_global_vars: 1
> bayes_seen:        670102
> bayes_token:       801496
> bayes_vars:        117
>
> To give an idea of the magnitude of this problem:  it took 34 minutes
> and 19 seconds to count the bayes_seen rows.
>
> Mail volume per MX is roughly 40K messages per day, <10K of which ever
> make it to SpamAssassin.
>
> Ideas?  Thanks!
>
> Chris St. Pierre
> Unix Systems Administrator
> Nebraska Wesleyan University
> ----------------------------
> Never send mail to thobrux@nebrwesleyan.edu
>
>
> Chris St. Pierre
> Unix Systems Administrator
> Nebraska Wesleyan University
> 402.465.7549
> ----------------------------
> Never send mail to thobrux@nebrwesleyan.edu
Looks to me you either have some really ancient machine doing the job or 
your MySQL setup is somewhat barfed.
I had problems with MySQL keeping up with the input for my iXhash lists. 
I switched to InnoDB (better suited for concurrent access), googled some 
info from http://linuxweblog.com/node/231 and 
http://mysqluc.com/presentations/mysql05/zaitsev_asplund.pdf and (most 
importantly, I think!) set the option 'innodb_flush_method = O_DSYNC'.
For comparison, I have two tables with ca 1.000.000 rows each on a HP 
DL360 G1 (PIII@1.4GHz and 2 Gig RAM and SCSI disks with 10K. mtop said 
at times it makes up to 2K queries per seconds. Should be enough for you 
too...
You don't say what your hardware is but tuning and putting in some more 
RAM should help

Dirk


Re: Make Bayes more efficient?

Posted by Dirk Bonengel <di...@bonengel.de>.
Chris St. Pierre schrieb:
> We're sharing our Bayesian database (MySQL) between two MX nodes and
> the database server has hit a wall.  It's underpowered and is no
> longer able to keep up with the I/O demands of our two MXes.  During
> the day, uptime on the machine plateaus at about 5-7, and iowait
> percentage rides at about 70-90%.  Our mail queue and time-to-scan
> skyrocket.  Luckily, we've been able to catch up at night thus far,
> but it's still not fun having 500 messages enqueued from 8-5 daily.
>
> Until I get our DB box replaced this summer, is there anything I can
> do to make things work more efficiently?  Options I'm currently aware
> of are:
>
> 1.  Stop doing Bayesian filtering.
>
> 2.  Turn off autolearn.
>
> 3.  Throw more hardware at the problem (which is the plan -- 
> eventually...)
>
> I ran a manual ''sa-learn --force-expire'', but that didn't have any
> effect.
>
> Database sizes (in rows) are as follows:
>
> bayes_expire:      219
> bayes_global_vars: 1
> bayes_seen:        670102
> bayes_token:       801496
> bayes_vars:        117
>
> To give an idea of the magnitude of this problem:  it took 34 minutes
> and 19 seconds to count the bayes_seen rows.
>
> Mail volume per MX is roughly 40K messages per day, <10K of which ever
> make it to SpamAssassin.
>
> Ideas?  Thanks!
>
> Chris St. Pierre
> Unix Systems Administrator
> Nebraska Wesleyan University
> ----------------------------
> Never send mail to thobrux@nebrwesleyan.edu
>
>
> Chris St. Pierre
> Unix Systems Administrator
> Nebraska Wesleyan University
> 402.465.7549
> ----------------------------
> Never send mail to thobrux@nebrwesleyan.edu
Looks to me you either have some really ancient machine doing the job or 
your MySQL setup is somewhat barfed.
I had problems with MySQL keeping up with the input for my iXhash lists. 
I switched to InnoDB (better suited for concurrent access), googled some 
info from http://linuxweblog.com/node/231 and 
http://mysqluc.com/presentations/mysql05/zaitsev_asplund.pdf and (most 
importantly, I think!) set the option 'innodb_flush_method = O_DSYNC'. 
(This runs on debian stable)
For comparison, I have two tables with ca 1.000.000 rows each on a HP 
DL360 G1 (PIII@1.4GHz and 2 Gig RAM) and SCSI disks with 10K. mtop said 
at times it makes up to 2K queries per seconds. Should be enough for you 
too...
You don't say what your hardware is but tuning and putting in some more 
RAM should help

Dirk

R: R: Make Bayes more efficient?

Posted by Giampaolo Tomassoni <g....@libero.it>.
> -----Messaggio originale-----
> Da: Chris St. Pierre [mailto:stpierre@NebrWesleyan.edu]
> Inviato: venerdì 9 marzo 2007 15.30
> A: Giampaolo Tomassoni
> Cc: users@spamassassin.apache.org
> Oggetto: Re: R: Make Bayes more efficient?
> 
> On Fri, 9 Mar 2007, Giampaolo Tomassoni wrote:
> 
> > Sorry, my English is probably not so good to understand the correct
> meaning
> > of the "ever make it to SpamAssassin". If you mean that only 10K
> messages do
> > reach a mailbox and the others are discarded by SA, you could attempt
> > reducing the flux by adopting some greylisting technique.
> 
> Actually, I meant that, of the 40K messages I receive, about 30K are
> discarded by greylisting, RBLs, HELO restrictions, etc.

Ah, you already greylist incoming messages...

Ok. Thank you for the English lesson. :)


> I can't imagine trying to scan all the mail I get. :)
> 
> Chris St. Pierre
> Unix Systems Administrator
> Nebraska Wesleyan University
> ----------------------------
> Never send mail to thobrux@nebrwesleyan.edu



Re: R: Make Bayes more efficient?

Posted by "Chris St. Pierre" <st...@NebrWesleyan.edu>.
On Fri, 9 Mar 2007, Giampaolo Tomassoni wrote:

> Sorry, my English is probably not so good to understand the correct meaning
> of the "ever make it to SpamAssassin". If you mean that only 10K messages do
> reach a mailbox and the others are discarded by SA, you could attempt
> reducing the flux by adopting some greylisting technique.

Actually, I meant that, of the 40K messages I receive, about 30K are
discarded by greylisting, RBLs, HELO restrictions, etc.

I can't imagine trying to scan all the mail I get. :)

Chris St. Pierre
Unix Systems Administrator
Nebraska Wesleyan University
----------------------------
Never send mail to thobrux@nebrwesleyan.edu


R: Make Bayes more efficient?

Posted by Giampaolo Tomassoni <g....@libero.it>.
> -----Messaggio originale-----
> Da: Chris St. Pierre [mailto:stpierre@NebrWesleyan.edu]
> Inviato: giovedì 8 marzo 2007 23.33
> A: users@spamassassin.apache.org
> Oggetto: Make Bayes more efficient?
> 
> We're sharing our Bayesian database (MySQL) between two MX nodes and
> the database server has hit a wall.  It's underpowered and is no
> longer able to keep up with the I/O demands of our two MXes.  During
> the day, uptime on the machine plateaus at about 5-7, and iowait
> percentage rides at about 70-90%.  Our mail queue and time-to-scan
> skyrocket.  Luckily, we've been able to catch up at night thus far,
> but it's still not fun having 500 messages enqueued from 8-5 daily.
> 
> Until I get our DB box replaced this summer, is there anything I can
> do to make things work more efficiently?  Options I'm currently aware
> of are:
> 
> 1.  Stop doing Bayesian filtering.
> 
> 2.  Turn off autolearn.
> 
> 3.  Throw more hardware at the problem (which is the plan --
> eventually...)
> 
> I ran a manual ''sa-learn --force-expire'', but that didn't have any
> effect.
> 
> Database sizes (in rows) are as follows:
> 
> bayes_expire:      219
> bayes_global_vars: 1
> bayes_seen:        670102
> bayes_token:       801496
> bayes_vars:        117
> 
> To give an idea of the magnitude of this problem:  it took 34 minutes
> and 19 seconds to count the bayes_seen rows.
> 
> Mail volume per MX is roughly 40K messages per day, <10K of which ever
> make it to SpamAssassin.
> 
> Ideas?  Thanks!

Sorry, my English is probably not so good to understand the correct meaning
of the "ever make it to SpamAssassin". If you mean that only 10K messages do
reach a mailbox and the others are discarded by SA, you could attempt
reducing the flux by adopting some greylisting technique.

Thanks to it, I decreased a lot the amount of messages that reach amavis
(and thereby SA). You may even not need an hw upgrade if this solution works
enough also to you.

Giampaolo

> 
> Chris St. Pierre
> Unix Systems Administrator
> Nebraska Wesleyan University
> ----------------------------
> Never send mail to thobrux@nebrwesleyan.edu
> 
> 
> Chris St. Pierre
> Unix Systems Administrator
> Nebraska Wesleyan University
> 402.465.7549
> ----------------------------
> Never send mail to thobrux@nebrwesleyan.edu


Re: Make Bayes more efficient?

Posted by "Chris St. Pierre" <st...@NebrWesleyan.edu>.
On Sat, 17 Mar 2007, Dirk Bonengel wrote:

> just curious but Chris how successful were you in optimizing your
> MySQL installation?  I take it as a given that many installations
> nowadays use MySQL as data store, so hints on optimizing MySQL would
> be a welcome addition to the wiki, I think

I was _very_ successful in tuning MySQL.  I/O wait times have
plummetted, and our average time-to-scan has dropped from several
minutes down to 5 seconds.

Here's what I added to /etc/my.cnf:

skip-external-locking
max_connections=50
innodb_flush_method=O_DSYNC
innodb_flush_log_at_trx_commit=0
innodb_log_file_size=1024M
innodb_log_buffer_size=8M
innodb_buffer_pool_size=3072M
innodb_additional_mem_pool_size=20M
table_cache=96
query_cache_limit=5M
query_cache_type=2

Most of these were recommendations from the MySQL tuning script
someone else suggested, plus some stuff I found online.  I'm still
tweaking some of them, but this has made all the difference.  Note
that memory numbers are in relation to the amount of memory in the
machine.

I can't remember what all of these do at the moment -- some of them
are pretty nuanced -- but you can find all of them in the MySQL
manual.  One directive -- I believe innodb_flush_method=O_DSYNC -- can
cause data loss if your machine crashes, but I don't really care
because this is just Bayes data.

HTH.

Chris St. Pierre
Unix Systems Administrator
Nebraska Wesleyan University
----------------------------
Never send mail to thobrux@nebrwesleyan.edu

Re: Make Bayes more efficient?

Posted by Dirk Bonengel <di...@bonengel.de>.
Chris, List,

just curious but Chris how successful were you in optimizing your MySQL 
installation?
I take it as a given that many installations nowadays use MySQL as data 
store, so hints on optimizing MySQL would be a welcome addition to the 
wiki, I think

Dirk


Chris St. Pierre schrieb:
> We're sharing our Bayesian database (MySQL) between two MX nodes and
> the database server has hit a wall.  It's underpowered and is no
> longer able to keep up with the I/O demands of our two MXes.  During
> the day, uptime on the machine plateaus at about 5-7, and iowait
> percentage rides at about 70-90%.  Our mail queue and time-to-scan
> skyrocket.  Luckily, we've been able to catch up at night thus far,
> but it's still not fun having 500 messages enqueued from 8-5 daily.
>
> Until I get our DB box replaced this summer, is there anything I can
> do to make things work more efficiently?  Options I'm currently aware
> of are:
>
> 1.  Stop doing Bayesian filtering.
>
> 2.  Turn off autolearn.
>
> 3.  Throw more hardware at the problem (which is the plan -- 
> eventually...)
>
> I ran a manual ''sa-learn --force-expire'', but that didn't have any
> effect.
>
> Database sizes (in rows) are as follows:
>
> bayes_expire:      219
> bayes_global_vars: 1
> bayes_seen:        670102
> bayes_token:       801496
> bayes_vars:        117
>
> To give an idea of the magnitude of this problem:  it took 34 minutes
> and 19 seconds to count the bayes_seen rows.
>
> Mail volume per MX is roughly 40K messages per day, <10K of which ever
> make it to SpamAssassin.
>
> Ideas?  Thanks!
>
> Chris St. Pierre
> Unix Systems Administrator
> Nebraska Wesleyan University
> ----------------------------
> Never send mail to thobrux@nebrwesleyan.edu
>
>
> Chris St. Pierre
> Unix Systems Administrator
> Nebraska Wesleyan University
> 402.465.7549
> ----------------------------
> Never send mail to thobrux@nebrwesleyan.edu


RE: Make Bayes more efficient?

Posted by "Chris St. Pierre" <st...@NebrWesleyan.edu>.
On Sat, 17 Mar 2007, Michael Scheidell wrote:

> I didn't see this answer yet, but, you show the sa-learn
> --force-expire,are you using
> bayes_auto_expire 0  in local.cf?

No, I'm not.

> Also, did someone let you know that while Innodb is required for
> stability in bayes, it WILL slow down writes?

I did not know that; we are using InnoDB.  Luckily, I've been able to
tune a sufficient amount of performance out of it.

> Also, in local.cf, you use the enhanced sql modules?
> user_awl_dsn                 DBI:mysql:mail:localhost
> bayes_sql_dsn               DBI:mysql:mail:localhost
>
> bayes_store_module  Mail::SpamAssassin::BayesStore::MySQL
>
> And:
>
> auto_whitelist_factory          Mail::SpamAssassin::SQLBasedAddrList

I currently only have Bayes stuff in MySQL, not AWL, but yes, I am
using the enhanced modules.
> Ever think of making the secondary MX have a separate READ ONLY bayes
> DB?
> Feed it from the primary? (you don't want a secondary MX to have a
> different bayes from the primary since it will have a VERY jaded view of
> the world.  Spammers go for the secondary first)

We don't have a secondary MX.  Our MySQL database is shared between
two MX nodes of equal priority.

Chris St. Pierre
Unix Systems Administrator
Nebraska Wesleyan University
----------------------------
Never send mail to thobrux@nebrwesleyan.edu


RE: Make Bayes more efficient?

Posted by Michael Scheidell <sc...@secnap.net>.
 

> -----Original Message-----
> From: Chris St. Pierre [mailto:stpierre@NebrWesleyan.edu] 
> Sent: Thursday, March 08, 2007 5:33 PM
> To: users@spamassassin.apache.org
> Subject: Make Bayes more efficient?
> 
> We're sharing our Bayesian database (MySQL) between two MX 
> nodes and the database server has hit a wall.  It's 

I didn't see this answer yet, but, you show the sa-learn
--force-expire,are you using 
bayes_auto_expire 0  in local.cf?

Also, did someone let you know that while Innodb is required for
stability in bayes, it WILL slow down writes?

Also, in local.cf, you use the enhanced sql modules?
user_awl_dsn                 DBI:mysql:mail:localhost
bayes_sql_dsn               DBI:mysql:mail:localhost

bayes_store_module  Mail::SpamAssassin::BayesStore::MySQL

And:

auto_whitelist_factory          Mail::SpamAssassin::SQLBasedAddrList


You also said you are sharing a sql db for bayes between mx's.

Ever think of making the secondary MX have a separate READ ONLY bayes
DB?
Feed it from the primary? (you don't want a secondary MX to have a
different bayes from the primary since it will have a VERY jaded view of
the world.  Spammers go for the secondary first)

-- 
Michael Scheidell, CTO
SECNAP Network Security Corporation
HackerTrap Managed IPS: www.secnap.com/services
Privacy and Security Training: www.secnap.com/training
----------------------------------------------------------------- 
This email has been scanned and certified safe by SpammerTrap(tm) 
For Information please see http://www.spammertrap.com 
----------------------------------------------------------------- 

Re: Make Bayes more efficient?

Posted by Mike Jackson <mj...@barking-dog.net>.
> Thanks for everyone's suggestions.  I've taken most of them and done
> some other tuning; I'll have to wait and see how much things have
> improved.  If they haven't improved much, I'll be back on Monday. :)

I'm a little late to the party, and this is sorta off-topic, but you may 
want to check this out:

http://www.day32.com/MySQL/

It helps quite a bit for tuning your MySQL install.

Re: Make Bayes more efficient?

Posted by "Chris St. Pierre" <st...@NebrWesleyan.edu>.
On Thu, 8 Mar 2007, Gary V wrote:

> Yes, if you have not tuned MySQL you should. If you have the available RAM you 
> can increase performance significantly. In some ad hoc experiments I did I 
> increased throughput by a factor of 8 simply by increasing 
> innodb_buffer_pool_size. Look for sample files such as my-medium.cnf and 
> my-large.cnf on your system for examples. If your tables are InnoDB and you 
> are using the default InnoDB settings you are likely hurting performance (but 
> saving ram).

Thanks for everyone's suggestions.  I've taken most of them and done
some other tuning; I'll have to wait and see how much things have
improved.  If they haven't improved much, I'll be back on Monday. :)

Thanks again!

Chris St. Pierre
Unix Systems Administrator
Nebraska Wesleyan University
----------------------------
Never send mail to thobrux@nebrwesleyan.edu


Re: Make Bayes more efficient?

Posted by Gary V <mr...@hotmail.com>.
>
>At 14:33 08-03-2007, Chris St. Pierre wrote:
>>We're sharing our Bayesian database (MySQL) between two MX nodes and
>>the database server has hit a wall.  It's underpowered and is no
>
>What engine are you using for MySQL?  InnoDB is better for Bayes.  Did you 
>look into MySQL optimization?
>
>Regards,
>-sm
>

Yes, if you have not tuned MySQL you should. If you have the available RAM 
you can increase performance significantly. In some ad hoc experiments I did 
I increased throughput by a factor of 8 simply by increasing 
innodb_buffer_pool_size. Look for sample files such as my-medium.cnf and 
my-large.cnf on your system for examples. If your tables are InnoDB and you 
are using the default InnoDB settings you are likely hurting performance 
(but saving ram).

Gary V

_________________________________________________________________
Play Flexicon: the crossword game that feeds your brain. PLAY now for FREE.  
  http://zone.msn.com/en/flexicon/default.htm?icid=flexicon_hmtagline


Re: Make Bayes more efficient?

Posted by SM <sm...@resistor.net>.
At 14:33 08-03-2007, Chris St. Pierre wrote:
>We're sharing our Bayesian database (MySQL) between two MX nodes and
>the database server has hit a wall.  It's underpowered and is no

What engine are you using for MySQL?  InnoDB is better for 
Bayes.  Did you look into MySQL optimization?

Regards,
-sm