You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spamassassin.apache.org by Tim Bishop <ti...@bishnet.net> on 2004/06/12 01:25:07 UTC

SQL Bayes - MySQL versus Postgresql

I've been deploying a new system that uses SpamAssassin with an SQL
database for user config, auto-whitelists, and bayesian databases.

I started off with Postgresql because that's the current database system
we use. I was disappointed with the performance - it sometimes ran into
minutes per message when autolearning. Normal processing was OK, but
when a message was autolearnt it when hideously slow.

At this point I tried MySQL. I was gobsmacked at the performance
difference. Admittedly I thought it might go a bit quicker, maybe even
twice as fast, but it was more like 10-100 times as fast.

With postgresql the database server was maxed out on CPU and IO, and the
spam server had a handful of spamd processes consuming about 5% CPU
each. With mysql this changed to the spam server being maxed out on CPU
and the database server practically idle.

This seems to scale the way I want it too - many spam servers with one
database.

I guess this email is just for information really, unless someone steps
up and finds a bug in SpamAssassin/Perl to attribute this problem to :-)
I'd certainly recommend people use MySQL over Postgresql at this stage.

Cheers,
Tim.

----- Stats -----

Here are some stats. They're not entirely accurate because of varying
test conditions and hardware, etc. But they put across my point...

Postgresql:

There were 2470 messages that took on average 54.8 seconds to process.
1337 messages (54.13%) were processed in less than 10 seconds.
2151 messages (87.09%) were processed in less than 120 seconds.

MySQL:

There were 7550 messages that took on average 4.2 seconds to process.
7418 messages (98.25%) were processed in less than 10 seconds.
7549 messages (99.99%) were processed in less than 120 seconds.

----- Stats ----

-- 
Tim Bishop
http://www.bishnet.net/tim
PGP Key: 0x5AE7D984


Re: SQL Bayes - MySQL versus Postgresql

Posted by Michael Parker <pa...@pobox.com>.
On Sat, Jun 12, 2004 at 12:25:07AM +0100, Tim Bishop wrote:
> 
> I guess this email is just for information really, unless someone steps
> up and finds a bug in SpamAssassin/Perl to attribute this problem to :-)
> I'd certainly recommend people use MySQL over Postgresql at this stage.
> 

For what it's worth I've seen similar results, although with tuning I
was able to get PostgreSQL to behave much better.  For sure MySQL
works better out of the box than PostgreSQL.

Granted much of the Bayes/AWL codes was developed using MySQL, but
with an eye portability, so it is possible something code related
could be involved.  In fact, it might be the fact that we do so many
row updates.  For ACID databases this means that they must insert a
whole new row and then mark the old row deleted.  Interestingly, the
InnoDB table type under MySQL does this, but I've had much better
success (along with others) than with PostgreSQL.

I'm sure we've got some PostgreSQL experts out there.  If any of y'all
can see a problem with the code as written or have some tips for
tuning PostgreSQL that we can document, feel free to speak up.

Michael

Re: SQL Bayes - MySQL versus Postgresql

Posted by Matt Sergeant <ms...@startechgroup.co.uk>.
On Sat, 12 Jun 2004, Tim Bishop wrote:

> I've been deploying a new system that uses SpamAssassin with an SQL
> database for user config, auto-whitelists, and bayesian databases.
> 
> I started off with Postgresql because that's the current database system
> we use. I was disappointed with the performance - it sometimes ran into
> minutes per message when autolearning. Normal processing was OK, but
> when a message was autolearnt it when hideously slow.
> 
> At this point I tried MySQL. I was gobsmacked at the performance
> difference. Admittedly I thought it might go a bit quicker, maybe even
> twice as fast, but it was more like 10-100 times as fast.
> 
> With postgresql the database server was maxed out on CPU and IO, and the
> spam server had a handful of spamd processes consuming about 5% CPU
> each. With mysql this changed to the spam server being maxed out on CPU
> and the database server practically idle.

It's quite a task to tune PostgreSQL, and out of the box it comes horribly 
badly tuned.  Also you have to make very good use of transactions with 
PostgreSQL in order to get good performance, which I suspect we're not 
doing. Also you have to use high concurrency to really see benefits (it 
scales much better with more concurrent clients than MySQL does).

All those things are probably working against your setup.

Matt.