You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@spamassassin.apache.org by email builder <em...@yahoo.com> on 2005/11/04 00:24:55 UTC

HUGE bayes DB (non-sitewide) advice?

Hi all,

  I'm wondering if anyone out there hosts a large number of users with
per-USER bayes (in MySQL)?  Our user base is varied enough that we do not
feel bayes would be effective if done site-wide.  Some people like their
spammy newsletters, some are geeks who would deeply resent someone training
newsletters to be ham.

  As a result of this, however, we are currently burdened with an 8GB(! yep,
you read it right) bayes database (more than 20K users having mail
delivered).  We went to InnoDB when we upgraded to 3.1 per the upgrade doc's
recommendation, so that also means things are a bit slower.  Watching mytop,
most all the activity we get is from bayes inserts, which is not surprising,
and is probably the cause of why we get a lot of iowait, trying to keep
writing to an 8G tablespace...

  Oh, and we let bayes do its token cleanup on the spot (sorry, not
remembering the config setting name right now), not at night, since a small
lag in delivery is acceptable, but figuring out how to run an absolutely huge
cleanup by cron every night in this scenario seems like it'd really kill the
DB (and we'd have to run sa-learn once for every single user, right... ugh)

  We've tuned the InnoDB some, but performance is still not all that good --
is there anyone out there who runs a system like this?  

  * What kinds of MySQL tuning are people using to help cope?
  * Are there any SA settings to help allieviate performance problems?
  * If we want to walk away from per-user bayes, is the only option to go
site-wide?  What other options are there?




		
__________________________________ 
Yahoo! FareChase: Search multiple travel sites in one click.
http://farechase.yahoo.com

Re: HUGE bayes DB (non-sitewide) advice?

Posted by Michael Parker <pa...@pobox.com>.
email builder wrote:
> Well, I know there have to be some admins out there who have a lot of users
> and do not use sitewide bayes...... RIGHT?  See original email snippet at
> bottom.

I believe that there are a few running bayes is a similar configuration. 
  It certainly is a tough problem.  I believe your tuning ideas are on 
the mark and are certainly outside the scope of what SpamAssassin is 
doing, unless of course if there was something in the code that we could 
do more efficiently.  In that case I highly encourage you to read the 
MySQL website, there is lots of great documentation there, also there 
are several books that can help in the tuning.  Don't be afraid to ask 
the MySQL community for help, I'm sure they would gladly offer tuning 
advice.  The key to just about any database tuning, once you've 
exhausted all the various config params is going to be hardware.  More 
memory and more spindles will do wonders.

The other way to look at this is (as someone else mentioned I believe) 
the possibility that in the long run it just won't be possible to run 
efficiently with such a large database.  In that case you can create a 
custom storage module, the API is documented, that can handle it.  Of 
course, it's open source you can always do this yourself, or it is 
possible that a few amongst us would be willing to contract to come up 
with a storage module that better fits your needs.

The main thing is that these are all good discussions and I myself 
highly encourage them.  It is very hard to test these types of 
deployments in development so anytime there is one it is a learning 
experience, at least for me.

Michael

Re: HUGE bayes DB (non-sitewide) advice?

Posted by email builder <em...@yahoo.com>.
> > 
> > I guess the relevant point for this thread is that I don't necessarily
> think
> > that this is the silver bullet as implied.  Even if you use a
> > high-availability clustering technology that can mirror writes and reads,
> you
> > are STILL dealing with the possibility of a database that is just
> massive. 
> > Processing this size of database will still be disk-bound unless you have
> an
> > unheard-of amount of memory; I don't think there's any reason to think
> that
> > clustering the problem will make it go away.
> > 
> > So I still wonder if anyone has any musings on my earlier questions?
> 
> A few spamassassin hacks could help.
> 1. Have multiple mysql servers, split your users into A-J, K-S, T-Z OR 
> smaller units and distribute them over different servers, with some HA / 
> failover mechanism (possibly drbd).
> 2. Have 2 level of bayes, one large global and the other smaller per 
> user if thats possible. Of course SA will need to be changed to use both 
> the bayes'. This way you could have 2 large servers for the global bayes 
> db and 2 for the per user bayes dbs.
> 
> Also see if this SQL failover patch can help you in any way.
> http://issues.apache.org/SpamAssassin/show_bug.cgi?id=2197

Thanks for the good thoughts.  Sounds like the ultimate answer is that not
many people are using per-user Bayes, at least at this level, and that any
"solutions" are yet to be realized in practice.  I don't think we've got the
resources or time to contribute any SA patches, but the food for thought is
very much appreciated!
 
> Finally to speed up the database have a look at this, the people at 
> wikimedia / livejournal seem to be happy using it.
> http://www.danga.com/memcached/

That's very cool.  I'll *definitely* be keeping this one in mind.



	
		
__________________________________ 
Yahoo! Mail - PC Magazine Editors' Choice 2005 
http://mail.yahoo.com

Re: HUGE bayes DB (non-sitewide) advice?

Posted by Dhawal Doshy <dh...@netmagicsolutions.com>.
email builder wrote:
>>>In-memory storage:
>>>All data stored in each data node is kept in memory on the node's
>>>host computer. For each data node in the cluster, you must have
>>>available an amount of RAM equal to the size of the database times
>>>the number of replicas,
>>
>>This refers to the first line: "In-memory storage". Of course you can't 
>>do that with 160GB DBs. You can still cluster - look at DRBD 
>>http://www.drbd.org/
> 
> 
> I guess the relevant point for this thread is that I don't necessarily think
> that this is the silver bullet as implied.  Even if you use a
> high-availability clustering technology that can mirror writes and reads, you
> are STILL dealing with the possibility of a database that is just massive. 
> Processing this size of database will still be disk-bound unless you have an
> unheard-of amount of memory; I don't think there's any reason to think that
> clustering the problem will make it go away.
> 
> So I still wonder if anyone has any musings on my earlier questions?

A few spamassassin hacks could help.
1. Have multiple mysql servers, split your users into A-J, K-S, T-Z OR 
smaller units and distribute them over different servers, with some HA / 
failover mechanism (possibly drbd).
2. Have 2 level of bayes, one large global and the other smaller per 
user if thats possible. Of course SA will need to be changed to use both 
the bayes'. This way you could have 2 large servers for the global bayes 
db and 2 for the per user bayes dbs.

Also see if this SQL failover patch can help you in any way.
http://issues.apache.org/SpamAssassin/show_bug.cgi?id=2197

Finally to speed up the database have a look at this, the people at 
wikimedia / livejournal seem to be happy using it.
http://www.danga.com/memcached/

Hope that helps,
- dhawal

Re: HUGE bayes DB (non-sitewide) advice?

Posted by email builder <em...@yahoo.com>.
> > In-memory storage:
> > All data stored in each data node is kept in memory on the node's
> > host computer. For each data node in the cluster, you must have
> > available an amount of RAM equal to the size of the database times
> > the number of replicas,
> 
> This refers to the first line: "In-memory storage". Of course you can't 
> do that with 160GB DBs. You can still cluster - look at DRBD 
> http://www.drbd.org/

I guess the relevant point for this thread is that I don't necessarily think
that this is the silver bullet as implied.  Even if you use a
high-availability clustering technology that can mirror writes and reads, you
are STILL dealing with the possibility of a database that is just massive. 
Processing this size of database will still be disk-bound unless you have an
unheard-of amount of memory; I don't think there's any reason to think that
clustering the problem will make it go away.

So I still wonder if anyone has any musings on my earlier questions?



		
__________________________________ 
Yahoo! FareChase: Search multiple travel sites in one click.
http://farechase.yahoo.com

Re: HUGE bayes DB (non-sitewide) advice?

Posted by Michael Monnerie <m....@zmi.at>.
On Dienstag, 8. November 2005 03:38 email builder wrote:
> In-memory storage:
> All data stored in each data node is kept in memory on the node's
> host computer. For each data node in the cluster, you must have
> available an amount of RAM equal to the size of the database times
> the number of replicas,

This refers to the first line: "In-memory storage". Of course you can't 
do that with 160GB DBs. You can still cluster - look at DRBD 
http://www.drbd.org/

mfg zmi
-- 
// Michael Monnerie, Ing.BSc  ---   it-management Michael Monnerie
// http://zmi.at           Tel: 0660/4156531          Linux 2.6.11
// PGP Key:   "lynx -source http://zmi.at/zmi2.asc | gpg --import"
// Fingerprint: EB93 ED8A 1DCD BB6C F952  F7F4 3911 B933 7054 5879
// Keyserver: www.keyserver.net                 Key-ID: 0x70545879

Re: HUGE bayes DB (non-sitewide) advice?

Posted by email builder <em...@yahoo.com>.
> > Well, I know there have to be some admins out there who have a lot of
> users
> > and do not use sitewide bayes...... RIGHT?  See original email snippet at
> > bottom.
> 
> <snip>
> 
> > * Other ideas:
> >     - increase system memory as much as possible
> >     - per-domain Bayes instead of per-user???
> 
> This might be our 2nd best choice (unless there is a good
> bayes_expiry_max_db_size solution), but I don't see anything in the manual
> about the syntax of bayes_sql_override_username.  The manual mentions
> "grouping", but gives no examples of how I could, for instance, group bayes
> data by domain (my usernames are in the form user@example.com).

Just a follow-up to my own brain-lapse:

If you define a custom user scores query like this:

user_scores_sql_custom_query    SELECT preference, value FROM
spamassassin_settings WHERE username = _USERNAME_ OR username = '!GLOBAL' OR
username = CONCAT('@', _DOMAIN_) ORDER BY username ASC

Then you can easily decide to use bayes on a per-domain basis for one or more
of your domains (and still have per-user bayes for all other domains).  A
sample insert row into the settings table, then, would be:

INSERT INTO spamassassin_settings (username, preference, value) VALUES
('@example.com', 'bayes_sql_override_username', 'example.com');

So everyone in the example.com domain shares all bayes information which is
placed under the username "example.com".


 
> >     - cluster Bayes DB???
> 
> This apparently is not an option, since clustered MySQL databases are kept
> entirely in memory.  We don't have any 10GB RAM machines sadly  :)
> 
> From the MySQL manual:
> 
> In-memory storage:
> 
> All data stored in each data node is kept in memory on the node's host
> computer. For each data node in the cluster, you must have available an
> amount of RAM equal to the size of the database times the number of
> replicas,
> divided by the number of data nodes. Thus, if the database takes up 1
> gigabyte of memory, and you wish to set up the cluster with 4 replicas and
> 8
> data nodes, a minimum of 500 MB memory will be required per node. Note that
> this is in addition to any requirements for the operating system and any
> other applications that might be running on the host.
> 



		
__________________________________ 
Start your day with Yahoo! - Make it your home page! 
http://www.yahoo.com/r/hs

Re: HUGE bayes DB (non-sitewide) advice?

Posted by email builder <em...@yahoo.com>.
> Well, I know there have to be some admins out there who have a lot of users
> and do not use sitewide bayes...... RIGHT?  See original email snippet at
> bottom.

<snip>

> * Other ideas:
>     - increase system memory as much as possible
>     - per-domain Bayes instead of per-user???

This might be our 2nd best choice (unless there is a good
bayes_expiry_max_db_size solution), but I don't see anything in the manual
about the syntax of bayes_sql_override_username.  The manual mentions
"grouping", but gives no examples of how I could, for instance, group bayes
data by domain (my usernames are in the form user@example.com).

>     - cluster Bayes DB???

This apparently is not an option, since clustered MySQL databases are kept
entirely in memory.  We don't have any 10GB RAM machines sadly  :)

>From the MySQL manual:

In-memory storage:

All data stored in each data node is kept in memory on the node's host
computer. For each data node in the cluster, you must have available an
amount of RAM equal to the size of the database times the number of replicas,
divided by the number of data nodes. Thus, if the database takes up 1
gigabyte of memory, and you wish to set up the cluster with 4 replicas and 8
data nodes, a minimum of 500 MB memory will be required per node. Note that
this is in addition to any requirements for the operating system and any
other applications that might be running on the host.




	
		
__________________________________ 
Yahoo! Mail - PC Magazine Editors' Choice 2005 
http://mail.yahoo.com

Re: HUGE bayes DB (non-sitewide) advice?

Posted by email builder <em...@yahoo.com>.
Well, I know there have to be some admins out there who have a lot of users
and do not use sitewide bayes...... RIGHT?  See original email snippet at
bottom.

I'll start the ball rolling with what few tweaks we've made, although they
are not enough; we desperately need more ideas to make this viable.

* bayes_auto_expire is turned on; cronning the expiry of 20K+ accounts every
night seems outrageous

* bayes_expiry_max_db_size is at its default value; if 20K accounts used the
maximum allowable space, then, we'd have a 160GB bayes DB.  If 8MB is
considered sufficient for a whole domain for some people, then perhaps we can
reduce this size for per-user bayes...??

* MySQL tuning for InnoDB: pretty much straight from the MySQL manual... 
    - multiple data files (approx 10G each)
    - innodb_flush_log_at_trx_commit=0 because it's faster and we don't care
about Bayes data enough that the risk of losing one second of data is fine
    - innodb_buffer_pool_size as large as we can handle, but even if this was
3 or more GB, it's only a fraction of a 160GB database
    - innodb_additional_mem_pool_size=20M because that's what we saw for
their "big" example, although I am wondering in particular about the value of
increasing this one
    - innodb_log_file_size 25% of innodb_buffer_pool_size

* Other ideas:
    - increase system memory as much as possible
    - per-domain Bayes instead of per-user???
    - cluster Bayes DB???
    - revert to MyISAM -- will this help THAT much?


>   I'm wondering if anyone out there hosts a large number of users with
> per-USER bayes (in MySQL)?  Our user base is varied enough that we do not
> feel bayes would be effective if done site-wide.  Some people like their
> spammy newsletters, some are geeks who would deeply resent someone training
> newsletters to be ham.
> 
>   As a result of this, however, we are currently burdened with an 8GB(!
> yep,
> you read it right) bayes database (more than 20K users having mail
> delivered).  We went to InnoDB when we upgraded to 3.1 per the upgrade
> doc's
> recommendation, so that also means things are a bit slower.  Watching
> mytop,
> most all the activity we get is from bayes inserts, which is not
> surprising,
> and is probably the cause of why we get a lot of iowait, trying to keep
> writing to an 8G tablespace...
> 
>   We've tuned the InnoDB some, but performance is still not all that good
> --
> is there anyone out there who runs a system like this?  
> 
>   * What kinds of MySQL tuning are people using to help cope?
>   * Are there any SA settings to help allieviate performance problems?
>   * If we want to walk away from per-user bayes, is the only option to go
> site-wide?  What other options are there?



		
__________________________________ 
Start your day with Yahoo! - Make it your home page! 
http://www.yahoo.com/r/hs