You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@spamassassin.apache.org by Kris Deugau <kd...@vianet.ca> on 2008/04/09 18:12:43 UTC

Large-scale global Bayes tuning?

Anyone have any suggestions on tuning a large global Bayes db for 
stability and sanity?  I've got my fingers in the pie of a moderately 
large mail cluster, but I haven't yet found a Bayes configuration that's 
sane and stable for any extended period.  Wiping it completely about 
once a week seems to provide "acceptable" filtering performance (we have 
a number of addon rulesets), but I still see spam in my inbox with 
BAYES_00 - a sure sign of a mistuned Bayes database.

Past experience with (much) smaller systems has shown stable behaviour 
with bayes_expiry_max_db_size set to 1500000 (~40M BDB Bayes), daily 
expiry runs delete ~25-35K tokens;  mail volume ~3K/day.  However, the 
larger system (MySQL, currently set with max_db_size at 3000000, on-disk 
files running ~100M) only seems to be expiring that same 25-35K tokens 
even though autolearn is picking up ~1.5M+ from ~300K messages on a 
daily basis.  Reading through the docs on token expiry I would guess it 
should be far more aggressive than it is.  (Among other things, I really 
don't want to bump up max_db_size by two orders of magnitude;  up to ~5M 
should be fine, and I could see as high as 7.5M if really necssary.)

I'm not even really sure what questions to ask to get more detail; 
sa-learn -D doesn't really spit out *enough* detail about the expiry 
process to know for sure if something is going wrong there.

-kgd

Re: Large-scale global Bayes tuning?

Posted by SM <sm...@resistor.net>.
Hi Kris,
At 09:12 09-04-2008, Kris Deugau wrote:
>Anyone have any suggestions on tuning a large global Bayes db for 
>stability and sanity?  I've got my fingers in the pie of a 
>moderately large mail cluster, but I haven't yet found a Bayes 
>configuration that's sane and stable for any extended 
>period.  Wiping it completely about once a week seems to provide 
>"acceptable" filtering performance (we have a number of addon 
>rulesets), but I still see spam in my inbox with BAYES_00 - a sure 
>sign of a mistuned Bayes database.

Spam hitting BAYES_00 points to the bayes database being 
polluted.  That can happen if the autolearn levels are not low 
enough.  Some manual learning can help to keep the Bayes database in 
tune.  A more aggressive expiry won't necessarily prevent 
mistuning.  You'll have to do some MySQL tuning for performance.  In 
a large setup, manual learning isn't always possible.  You can have 
some rules to identify some "good" and "bad" messages which are 
representative of the userbase.

Regards,
-sm



Re: Large-scale global Bayes tuning?

Posted by Kris Deugau <kd...@vianet.ca>.
John Hardin wrote:
> How varied is the character of your message traffic? Is manual learning 
> an option, especially with larger autolearn thresholds?

What is this... "manual learning"...  you speak of?  <g>

Not really an option in the short term, although in the long term I'd 
*like* to have a system similar to what I've mostly trained users to do 
on the much smaller systems - forward misclassified mail to a suitable 
role account as an attachment for manual processing (whitelist, 
blacklist, feed to Bayes, write/adjust rules, etc).  Of course, that 
requires someone to *do* the manual processing....  :(

I've been taking my own FNs and feeding them back in;  that's really the 
only misclassified mail I have easy access to.  No FPs noticed so far....

> Then at least you'd be able to reseed your bayes with a known-good corpus.

*nod*  I've thought about exporting the database from the smaller system 
and pulling it in to the cluster to see how the accuracy is.

"Tokens don't get expired according to my understanding of the expiry 
algorithm" about sums up the immediate problem;  overall filter accuracy 
is pretty good on the whole.

-kgd

Re: Large-scale global Bayes tuning?

Posted by John Hardin <jh...@impsec.org>.
On Wed, 9 Apr 2008, Kris Deugau wrote:

> John Hardin wrote:
>>  On Wed, 9 Apr 2008, Kris Deugau wrote:
>> 
>> >  autolearn is picking up ~1.5M+ from ~300K messages on a daily basis.
>>
>>  Push your autolearn thresholds out to reduce the overall volume of learned
>>  spam and ham?
>
> I've thought about that.  It makes it more difficult to get Bayes data 
> on the critical messages in that middle range though.  :(

How varied is the character of your message traffic? Is manual learning an 
option, especially with larger autolearn thresholds?

Then at least you'd be able to reseed your bayes with a known-good corpus.

-- 
  John Hardin KA7OHZ                    http://www.impsec.org/~jhardin/
  jhardin@impsec.org    FALaholic #11174     pgpk -a jhardin@impsec.org
  key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
   People seem to have this obsession with objects and tools as being
   dangerous in and of themselves, as though a weapon will act of its
   own accord to cause harm. A weapon is just a force multiplier. It's
   *humans* that are (or are not) dangerous.
-----------------------------------------------------------------------
  4 days until Thomas Jefferson's 265th Birthday

Re: Large-scale global Bayes tuning?

Posted by Kris Deugau <kd...@vianet.ca>.
John Hardin wrote:
> On Wed, 9 Apr 2008, Kris Deugau wrote:
> 
>> autolearn is picking up ~1.5M+ from ~300K messages on a daily basis.
> 
> Push your autolearn thresholds out to reduce the overall volume of 
> learned spam and ham?

I've thought about that.  It makes it more difficult to get Bayes data 
on the critical messages in that middle range though.  :(

-kgd

Re: Large-scale global Bayes tuning?

Posted by John Hardin <jh...@impsec.org>.
On Wed, 9 Apr 2008, Kris Deugau wrote:

> autolearn is picking up ~1.5M+ from ~300K messages on a daily basis.

Push your autolearn thresholds out to reduce the overall volume of learned 
spam and ham?

-- 
  John Hardin KA7OHZ                    http://www.impsec.org/~jhardin/
  jhardin@impsec.org    FALaholic #11174     pgpk -a jhardin@impsec.org
  key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
   People seem to have this obsession with objects and tools as being
   dangerous in and of themselves, as though a weapon will act of its
   own accord to cause harm. A weapon is just a force multiplier. It's
   *humans* that are (or are not) dangerous.
-----------------------------------------------------------------------
  4 days until Thomas Jefferson's 265th Birthday

Re: Large-scale global Bayes tuning?

Posted by Michael Scheidell <sc...@secnap.net>.
> From: Kris Deugau <kd...@vianet.ca>
> Organization: ViaNet Internet Solutions
> Reply-To: <us...@spamassassin.apache.org>
> Date: Wed, 09 Apr 2008 12:36:56 -0400
> To: <us...@spamassassin.apache.org>
> Subject: Re: Large-scale global Bayes tuning?
> 
> Michael Scheidell wrote:
>> Bayes on cluster begs the question: what if you didn't replicate the bayes
>> tables, and left them server specific?
> 
> It may yet take that.  :(  (If only for overall cluster reliability -
> any one of the current three machines could handle the current load
> without any trouble, but we're likely going to stuff ClamAV on them as
> well.)  Unfortunately that means doing mistake-training on *each*
> machine - autolearn on it's own just doesn't cut it.
> 
> I'm dogfooding pretty much that exact scenario on one machine;  it's got
> its own local Bayes DB that I'm hand-training with my own mail.
> 
You could also take mysql off of one or several, have them load balance to
the other mysql servers, run a caching (global) dns server and clamav on one
of them.

What about DCC? I assume with those volumes you are running a local DCC
server, and having the other boxes talk to it?


>> Since (depending on configurations) some of the servers might get 'spam
>> only' (higher mx records), maybe just take one of the 'valid' bayes tables
>> and manually copy it (sa-learn backup, sa-learn clear, restore) every week
>> or so.
> 
> Mmmh.  Access is for both inbound and outbound mail, through a
Keep a couple for outbound only, won't need bayes too much on those.
We have an engineering spec for a 9x9 (9 nodes in a cluster, 9 clusters in a
group) to support up to 2MM users, and we do a lot of task and load
splitting like that.

-- 
Michael Scheidell, CTO
>|SECNAP Network Security
Winner 2008 Network Products Guide Hot Companies
FreeBSD SpamAssassin Ports maintainer
Charter member, ICSA labs anti-spam consortium

_________________________________________________________________________
This email has been scanned and certified safe by SpammerTrap(tm). 
For Information please see http://www.spammertrap.com
_________________________________________________________________________

Re: Large-scale global Bayes tuning?

Posted by Kris Deugau <kd...@vianet.ca>.
Michael Scheidell wrote:
> Bayes on cluster begs the question: what if you didn't replicate the bayes
> tables, and left them server specific?

It may yet take that.  :(  (If only for overall cluster reliability - 
any one of the current three machines could handle the current load 
without any trouble, but we're likely going to stuff ClamAV on them as 
well.)  Unfortunately that means doing mistake-training on *each* 
machine - autolearn on it's own just doesn't cut it.

I'm dogfooding pretty much that exact scenario on one machine;  it's got 
its own local Bayes DB that I'm hand-training with my own mail.

> Since (depending on configurations) some of the servers might get 'spam
> only' (higher mx records), maybe just take one of the 'valid' bayes tables
> and manually copy it (sa-learn backup, sa-learn clear, restore) every week
> or so.

Mmmh.  Access is for both inbound and outbound mail, through a 
load-balancer;  the type of mail seen on any one system is pretty much 
identical over time.

Re: Large-scale global Bayes tuning?

Posted by Michael Scheidell <sc...@secnap.net>.

> From: Kris Deugau <kd...@vianet.ca>
> Organization: ViaNet Internet Solutions
> Date: Wed, 09 Apr 2008 12:12:43 -0400
> To: <us...@spamassassin.apache.org>
> Subject: Large-scale global Bayes tuning?
> 
> Anyone have any suggestions on tuning a large global Bayes db for
> stability and sanity?  I've got my fingers in the pie of a moderately
> large mail cluster, but I haven't yet found a Bayes configuration that's
> sane and stable for any extended period.  Wiping it completely about
> once a week seems to provide "acceptable" filtering performance (we have
> a number of addon rulesets), but I still see spam in my inbox with
> BAYES_00 - a sure sign of a mistuned Bayes database.
> 
Bayes on cluster begs the question: what if you didn't replicate the bayes
tables, and left them server specific?

Since (depending on configurations) some of the servers might get 'spam
only' (higher mx records), maybe just take one of the 'valid' bayes tables
and manually copy it (sa-learn backup, sa-learn clear, restore) every week
or so.

Only way I could get a cluster of 9 to work right.
-- 
Michael Scheidell, CTO
>|SECNAP Network Security
Winner 2008 Network Products Guide Hot Companies
FreeBSD SpamAssassin Ports maintainer
Charter member, ICSA labs anti-spam consortium

_________________________________________________________________________
This email has been scanned and certified safe by SpammerTrap(tm). 
For Information please see http://www.spammertrap.com
_________________________________________________________________________