You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@spamassassin.apache.org by Matthias Leisi <ma...@leisi.net> on 2006/11/11 09:02:07 UTC

"Distributed" Bayes DB?

Hello List,

How would you set up a "distributed" Bayes DB?

In this context, "distributed" means that I have four mailserver
machines in parallel (all with equal MX priority) where I want to run
Spamassassins Bayes filtering -- without introducing a single point of
failure (eg a central database).

All servers should thus run with local Bayes DBs. In order to avoid that
they diverge too much,

1) the files are copied from one machine to the others once a day (or
twice, ...).

2) the files are merged and re-distributed to all four machines once a
day (or twice, ...).

Do you see additional options? What is the "best practice" in that
regard with Spamassassin? Is it even possible to merge Bayes DBs (and if
yes, how)?

Btw., I would like a similar setup for the Autowhitelist/AWL where I
think a simple filecopy (ie option 1 above) is sufficient.

Thanks for your input,
-- Matthias

-- 
Blog: http://matthias.leisi.net/
DNS Whitelist: http://www.dnswl.org/

Re: "Distributed" Bayes DB?

Posted by Matthias Leisi <ma...@leisi.net>.
First, a thank you all for the suggestions relating to SQL. It seems SQL
support is better than I expected and I will give it a try.

Alex Woick wrote:
> Don't overrate Bayes. 

The system has been running without Bayes for roughly 3 years (with
incremental Spamassassin updates), and with good results until now.

However that system without the Bayes check handled the recent increase
in spam volumes with less success than other systems that do have Bayes
checks enabled.

> Don't focus solely on a bullet-proof highly
> available clustered or replicated database. If the Bayes database is
> gone, only one check is gone! All the others are still there.

That's a very good suggestion, since it seems like a bit of an overkill
to have additional database server machines for this simple task.

Is it even necessary to have a consistent shared storage amongst "equal"
MXes or would it be sufficient to let them run independently?

> For Bayes, use a central SQL database on one server that is used by all
> your MTA's, and keep it simple. Make a disaster recovery concept for the
> database machine and for the rebuild of an empty SA Bayes database. This
> could be very fast. Don't backup the Bayes token data. You wrote that

I don't worry too much about disaster recovery, more about avoiding a
single point of failure, ie if one or two machine go/es up in smoke or
is/are taken offline for maintenance the remaining machines should
continue just as before.

-- Matthias


Re: "Distributed" Bayes DB?

Posted by Alex Woick <al...@wombaz.de>.
Don't overrate Bayes. Don't focus solely on a bullet-proof highly 
available clustered or replicated database. If the Bayes database is 
gone, only one check is gone! All the others are still there.

For my mail content, the real filtering power today come from the 
network checks such as url-blocklists, content-checksums (razor/dcc) and 
   open-relay block lists. Focus on making these additional tests work.

For Bayes, use a central SQL database on one server that is used by all 
your MTA's, and keep it simple. Make a disaster recovery concept for the 
database machine and for the rebuild of an empty SA Bayes database. This 
could be very fast. Don't backup the Bayes token data. You wrote that 
you expect 500.000 messages per day. If you use Bayes auto-learning, an 
empty central Bayes database is refilled to a usable state from current 
messages in only a few hours. This is probably faster than a cumbersome 
restore process.

regards,
Alex

Re: "Distributed" Bayes DB? (SQL usage)

Posted by Matt Kettler <mk...@verizon.net>.
Matthias Leisi wrote:
> Matt Kettler wrote:
>
>   
>>> Do you see additional options? 
>>>       
>> Use a SQL server backend. If you must have a no-failure option for the
>> bayes DB, use a  cluster of SQL servers.
>> [..]
>>
>> Also see the SQL readme:
>>
>> http://wiki.apache.org/spamassassin/BetterDocumentation/SqlReadmeBayes
>>     
>
> I already took a look at using SQL, but this quote:
>
> | NB:  This should be considered BETA, and the interface, schema, or
> | overall operation of SQL support may change at any time with future
> | releases of SA.
>
> stops me from using it. Unfortunately, I can not run software officially
> considered Beta on this system.
>   
I think that documentation line is obsolete, and has probably been
overlooked for a long time.

SQL support has been in SA since 2004, and was touted as a major feature
of SA 3.0.0.

http://mail-archives.apache.org/mod_mbox/spamassassin-announce/200409.mbox/browser

The 3.1.0 release announcement declared SQL to be THE preferred method
for bayes storage, even for single-box setups.

http://mail-archives.apache.org/mod_mbox/spamassassin-announce/200509.mbox/%3c20050914235232.814A45900BA@radish.jmason.org%3e

---------

- added PostgreSQL, MySQL 4.1+, and local SDBM file Bayes storage modules. SQL
  storage is now recommended for Bayes, instead of DB_File. NDBM_File support
  has been dropped due to a major bug in that module.
---------



That said, yes, they might change the schema or operation in a future
version.. But the same goes for DB files. It's happened once already..

But this is not beta, it's the recommended configuration.

>
>   
>> Use a SQL server backend. If you must have a no-failure option for the
>> bayes DB, use a  cluster of SQL servers.
>>
>> Example with mysql:
>>
>> http://www.howtoforge.com/loadbalanced_mysql_cluster_debian
>>     
>
> I suppose that every message passed through SpamAssassin will issue at
> least on query and one update statement to the DB. How does a MySQL
> cluster perform with 500'000 messages per day, considering that
> replication must also take place?
>   
*MUCH* faster than the default Berkely DB does:

http://wiki.apache.org/spamassassin/BayesBenchmarkResults

MySQL with MYISAM tables completed the test in 56% of the time DBM took.

Admittedly that's over lo, not the wire, but you get the point. In
general, SQL is more efficient and faster than the default Berkely DB.

SDBM is faster still, but it's got some issues with the dump/restore
process last I checked, so conversion to SDBM is not very practical. I'd
consider SDBM not well supported nor well tested, although I do use it
on my boxes.


>
>   
>>> What is the "best practice" in that
>>> regard with Spamassassin? 
>>>       
>> Using SQL is by far the best practice here.
>>     
>
> I do not see many mentions of the SQL approach - either because it is
> not used much or because it works so well?
>
>   
Erm, really?  It seems to get talked about here a lot. And the official
recommendation in the release announcement is hard to overlook.



Re: "Distributed" Bayes DB?

Posted by Charlie Clark <ch...@begeistert.org>.
Am 11.11.2006 um 11:47 schrieb Matt Kettler:

>> I suppose you could use something like NFS so that all systems share
>> the same DB, config files, etc.
> NFS would be HIGHLY not -recommended.
>
> http://article.gmane.org/gmane.mail.spam.spamassassin.general/72362/ 
> match=sql
>
> In fact, I personally would suggest never using NFS for anything at  
> all,
> and I'm shocked that you'd even consider using it for any production
> purpose.

NFS or equivalent has its place and can be made safe enough if  
required but I think other issues like concurrent access suggest that  
the SQL approach is the way to go.

> Besides, the point here is to eliminate any single-point-of- 
> failure. NFS
> would offer no redundancy at all. If the server hosting the NFS share
> went down, the bayes DB would be unavailable.

Agreed.

>>> I do not see many mentions of the SQL approach - either because  
>>> it is
>>> not used much or because it works so well?
>>
>> Probably the former. And you're right not to use something like the
>> SQL backend for a large volume production system. Not because it's
>> unreliable but because it's still in development and keeping the
>> schema up to date could become a real headache.
> But it's not still in development.. It's the recommended configuration
> as of 3.1.0.
>
> SA's SQL support is solid. I personally don't use it, but many here  
> do.

Yes, sorry I should have read all e-mails relating to the thread first.

Charlie
--
Charlie Clark
Helmholtzstr. 20
Düsseldorf
D- 40215
Tel: +49-211-938-5360
GSM: +49-178-782-6226




Re: "Distributed" Bayes DB?

Posted by Matt Kettler <mk...@verizon.net>.
Charlie Clark wrote:
>
> Am 11.11.2006 um 10:48 schrieb Matthias Leisi:
>
>>
>> I already took a look at using SQL, but this quote:
>>
>> | NB:  This should be considered BETA, and the interface, schema, or
>> | overall operation of SQL support may change at any time with future
>> | releases of SA.
>>
>> stops me from using it. Unfortunately, I can not run software officially
>> considered Beta on this system.
>
> I suppose you could use something like NFS so that all systems share
> the same DB, config files, etc.
NFS would be HIGHLY not -recommended.

http://article.gmane.org/gmane.mail.spam.spamassassin.general/72362/match=sql

In fact, I personally would suggest never using NFS for anything at all,
and I'm shocked that you'd even consider using it for any production
purpose.

Besides, the point here is to eliminate any single-point-of-failure. NFS
would offer no redundancy at all. If the server hosting the NFS share
went down, the bayes DB would be unavailable.

>
>>
>> I do not see many mentions of the SQL approach - either because it is
>> not used much or because it works so well?
>
> Probably the former. And you're right not to use something like the
> SQL backend for a large volume production system. Not because it's
> unreliable but because it's still in development and keeping the
> schema up to date could become a real headache.
But it's not still in development.. It's the recommended configuration
as of 3.1.0.

SA's SQL support is solid. I personally don't use it, but many here do.


Re: "Distributed" Bayes DB?

Posted by Charlie Clark <ch...@begeistert.org>.
Am 11.11.2006 um 10:48 schrieb Matthias Leisi:

>
> I already took a look at using SQL, but this quote:
>
> | NB:  This should be considered BETA, and the interface, schema, or
> | overall operation of SQL support may change at any time with future
> | releases of SA.
>
> stops me from using it. Unfortunately, I can not run software  
> officially
> considered Beta on this system.

I suppose you could use something like NFS so that all systems share  
the same DB, config files, etc.

>
>> Use a SQL server backend. If you must have a no-failure option for  
>> the
>> bayes DB, use a  cluster of SQL servers.
>>
>> Example with mysql:
>>
>> http://www.howtoforge.com/loadbalanced_mysql_cluster_debian
>
> I suppose that every message passed through SpamAssassin will issue at
> least on query and one update statement to the DB. How does a MySQL
> cluster perform with 500'000 messages per day, considering that
> replication must also take place?

How long is a piece of string? 500,000 queries per day shouldn't  
cause any problems for an RDBMS but the architecture of such a system  
should be given a bit of consideration - connection pooling et al.

There is in fact a mail system that uses PostgreSQL to store all the  
mails. If you want more information on requirements, speed, etc. I'm  
pretty sure you could run Spamassassin on the top of it.

>
>>> What is the "best practice" in that
>>> regard with Spamassassin?
>>
>> Using SQL is by far the best practice here.
>
> I do not see many mentions of the SQL approach - either because it is
> not used much or because it works so well?

Probably the former. And you're right not to use something like the  
SQL backend for a large volume production system. Not because it's  
unreliable but because it's still in development and keeping the  
schema up to date could become a real headache.

I suspect that at some point it might make sense to use something  
like SQLite for persistence (because it's relatively easy to  
distribute) which would make using alternative backends relatively easy.

Charlie
--
Charlie Clark
Helmholtzstr. 20
Düsseldorf
D- 40215
Tel: +49-211-938-5360
GSM: +49-178-782-6226




Re: "Distributed" Bayes DB?

Posted by Dhawal Doshy <dh...@netmagicsolutions.com>.
Dhawal Doshy wrote:
> Matthias Leisi wrote:
>> Matt Kettler wrote:
>>
>>>> Do you see additional options? 
>>> Use a SQL server backend. If you must have a no-failure option for the
>>> bayes DB, use a  cluster of SQL servers.
>>> [..]
>>>
>>> Also see the SQL readme:
>>>
>>> http://wiki.apache.org/spamassassin/BetterDocumentation/SqlReadmeBayes
>>
>> I already took a look at using SQL, but this quote:
>>
>> | NB:  This should be considered BETA, and the interface, schema, or
>> | overall operation of SQL support may change at any time with future
>> | releases of SA.
>>
>> stops me from using it. Unfortunately, I can not run software officially
>> considered Beta on this system.
> 
> Like Matt mentioned.. this is an oops. I've been using global sql bayes 
> ever since the 3.0.0 release (about 2 years now).. same for awl (which i 
> later disabled for lack of janitor tools).
> 
> It's rock stable and quite fast (though on a dedicated server).. for 
> redundancy look at DRBL or something similar.
                      ^^^^ that should be DRBD

- dhawal

Re: "Distributed" Bayes DB?

Posted by Dhawal Doshy <dh...@netmagicsolutions.com>.
Matthias Leisi wrote:
> Matt Kettler wrote:
> 
>>> Do you see additional options? 
>> Use a SQL server backend. If you must have a no-failure option for the
>> bayes DB, use a  cluster of SQL servers.
>> [..]
>>
>> Also see the SQL readme:
>>
>> http://wiki.apache.org/spamassassin/BetterDocumentation/SqlReadmeBayes
> 
> I already took a look at using SQL, but this quote:
> 
> | NB:  This should be considered BETA, and the interface, schema, or
> | overall operation of SQL support may change at any time with future
> | releases of SA.
> 
> stops me from using it. Unfortunately, I can not run software officially
> considered Beta on this system.

Like Matt mentioned.. this is an oops. I've been using global sql bayes 
ever since the 3.0.0 release (about 2 years now).. same for awl (which i 
later disabled for lack of janitor tools).

It's rock stable and quite fast (though on a dedicated server).. for 
redundancy look at DRBL or something similar.

- dhawal

Re: "Distributed" Bayes DB?

Posted by Matthias Leisi <ma...@leisi.net>.
Matt Kettler wrote:

>> Do you see additional options? 
> Use a SQL server backend. If you must have a no-failure option for the
> bayes DB, use a  cluster of SQL servers.
> [..]
>
> Also see the SQL readme:
> 
> http://wiki.apache.org/spamassassin/BetterDocumentation/SqlReadmeBayes

I already took a look at using SQL, but this quote:

| NB:  This should be considered BETA, and the interface, schema, or
| overall operation of SQL support may change at any time with future
| releases of SA.

stops me from using it. Unfortunately, I can not run software officially
considered Beta on this system.


> Use a SQL server backend. If you must have a no-failure option for the
> bayes DB, use a  cluster of SQL servers.
>
> Example with mysql:
>
> http://www.howtoforge.com/loadbalanced_mysql_cluster_debian

I suppose that every message passed through SpamAssassin will issue at
least on query and one update statement to the DB. How does a MySQL
cluster perform with 500'000 messages per day, considering that
replication must also take place?


>> What is the "best practice" in that
>> regard with Spamassassin? 
>
> Using SQL is by far the best practice here.

I do not see many mentions of the SQL approach - either because it is
not used much or because it works so well?

Thanks,
-- Matthias


Re: "Distributed" Bayes DB?

Posted by Matt Kettler <mk...@verizon.net>.
Matthias Leisi wrote:
> Hello List,
>
> How would you set up a "distributed" Bayes DB?
>
> In this context, "distributed" means that I have four mailserver
> machines in parallel (all with equal MX priority) where I want to run
> Spamassassins Bayes filtering -- without introducing a single point of
> failure (eg a central database).
>
> All servers should thus run with local Bayes DBs.
No they shouldn't.. there are better ways.
>  In order to avoid that
> they diverge too much,
>
> 1) the files are copied from one machine to the others once a day (or
> twice, ...).
>
> 2) the files are merged and re-distributed to all four machines once a
> day (or twice, ...).
>
> Do you see additional options? 
Use a SQL server backend. If you must have a no-failure option for the
bayes DB, use a  cluster of SQL servers.

Example with mysql:

http://www.howtoforge.com/loadbalanced_mysql_cluster_debian

SA 3.0.0 and higher supports generic SQL, as well as MySQL and Postgres
optimized backends for bayes storage. This is THE way to have multiple
servers share a bayes database, because it's what SQL was designed to
do. Anything else is a hack at best.

See bayes_store_module and the bayes_sql_* options in the conf manpage.

http://spamassassin.apache.org/full/3.1.x/doc/Mail_SpamAssassin_Conf.html

Also see the SQL readme:

http://wiki.apache.org/spamassassin/BetterDocumentation/SqlReadmeBayes

> What is the "best practice" in that
> regard with Spamassassin? 
Using SQL is by far the best practice here.
> Is it even possible to merge Bayes DBs (and if
> yes, how)?
>   
No.
> Btw., I would like a similar setup for the Autowhitelist/AWL where I
> think a simple filecopy (ie option 1 above) is sufficient.
>   
Ditto. See auto_whitelist_factory in the AWL plugin manpage (assuming SA
3.1.x)

http://spamassassin.apache.org/full/3.1.x/doc/Mail_SpamAssassin_Plugin_AWL.html
> Thanks for your input,
> -- Matthias
>
>