You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@spamassassin.apache.org by Matt Kettler <mk...@verizon.net> on 2006/11/11 11:19:45 UTC

Re: "Distributed" Bayes DB? (SQL usage)

Matthias Leisi wrote:
> Matt Kettler wrote:
>
>   
>>> Do you see additional options? 
>>>       
>> Use a SQL server backend. If you must have a no-failure option for the
>> bayes DB, use a  cluster of SQL servers.
>> [..]
>>
>> Also see the SQL readme:
>>
>> http://wiki.apache.org/spamassassin/BetterDocumentation/SqlReadmeBayes
>>     
>
> I already took a look at using SQL, but this quote:
>
> | NB:  This should be considered BETA, and the interface, schema, or
> | overall operation of SQL support may change at any time with future
> | releases of SA.
>
> stops me from using it. Unfortunately, I can not run software officially
> considered Beta on this system.
>   
I think that documentation line is obsolete, and has probably been
overlooked for a long time.

SQL support has been in SA since 2004, and was touted as a major feature
of SA 3.0.0.

http://mail-archives.apache.org/mod_mbox/spamassassin-announce/200409.mbox/browser

The 3.1.0 release announcement declared SQL to be THE preferred method
for bayes storage, even for single-box setups.

http://mail-archives.apache.org/mod_mbox/spamassassin-announce/200509.mbox/%3c20050914235232.814A45900BA@radish.jmason.org%3e

---------

- added PostgreSQL, MySQL 4.1+, and local SDBM file Bayes storage modules. SQL
  storage is now recommended for Bayes, instead of DB_File. NDBM_File support
  has been dropped due to a major bug in that module.
---------



That said, yes, they might change the schema or operation in a future
version.. But the same goes for DB files. It's happened once already..

But this is not beta, it's the recommended configuration.

>
>   
>> Use a SQL server backend. If you must have a no-failure option for the
>> bayes DB, use a  cluster of SQL servers.
>>
>> Example with mysql:
>>
>> http://www.howtoforge.com/loadbalanced_mysql_cluster_debian
>>     
>
> I suppose that every message passed through SpamAssassin will issue at
> least on query and one update statement to the DB. How does a MySQL
> cluster perform with 500'000 messages per day, considering that
> replication must also take place?
>   
*MUCH* faster than the default Berkely DB does:

http://wiki.apache.org/spamassassin/BayesBenchmarkResults

MySQL with MYISAM tables completed the test in 56% of the time DBM took.

Admittedly that's over lo, not the wire, but you get the point. In
general, SQL is more efficient and faster than the default Berkely DB.

SDBM is faster still, but it's got some issues with the dump/restore
process last I checked, so conversion to SDBM is not very practical. I'd
consider SDBM not well supported nor well tested, although I do use it
on my boxes.


>
>   
>>> What is the "best practice" in that
>>> regard with Spamassassin? 
>>>       
>> Using SQL is by far the best practice here.
>>     
>
> I do not see many mentions of the SQL approach - either because it is
> not used much or because it works so well?
>
>   
Erm, really?  It seems to get talked about here a lot. And the official
recommendation in the release announcement is hard to overlook.