You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@spamassassin.apache.org by Wes <we...@msg.bt.com> on 2007/11/29 04:51:00 UTC

Mondo bayes_toks - millions of entries

I've searched and searched the archives, but no answers..  Sorry for the
lengthy email, but...


Spam Assassin 3.2.3-1
Smf-spamd 1.3.1 with spamd
Dual quad-core Xeon 5355 (Woodcrest) systems with 8GB memory.

Configuration:

    bayes_auto_learn 1
    bayes_expiry_max_db_size 150000
    lock_method flock
    rules compiled with sa-compile
    Auto-whitelist module is loaded
    Number of spamd children: 5

We are only using the spam/not spam verdict, not any of the message
rewriting features (this is handled by the MTA).

Per-user preferences are not feasible (per-user policy applications based on
the verdict are done at the MTA level).  Since this is not an end-user
server, each message has many recipients and it is not reasonable to scan
for each recipient to get unique Bayes, scores, etc.

We are processing a large volume of mail.  Spam Assassin is running after a
commercial scanner to minimize the volume and system load.

In 12 hours, the bayes_toks file gets to 160-320 MB, with a ball park of
something over 7 million tokens.  Some time before this, performance drops
off a cliff and the queue starts backing up big time.   When this happens,
mail is taking 15-20 seconds per message to process, one spamd child is
using 100% of a CPU, and none of the other spamd's are using any CPU - I
assume because they can't get a lock on the DB because it is taking the
other process so long to get what it needs.

Auto-expire doesn't work due to the volume, so I turned that off and am
doing a manual expire.  Of course, since bayes_expiry_period is 12 hours,
the minimum token age is 12 hours so the number of tokens is never going to
drop below about 7 million, regardless of how often I expire it.

As an immediate solution, I modified

    /usr/lib/perl5/site_perl/5.8.5/Mail/SpamAssassin/Conf.pm

And set bayes_expiry_period to 21600 (6 hours) and run an expire every 3
hours (why isn't this a configuration file parameter??)

This seems to be enough to keep it away from the edge of the performance
cliff - the number of tokens varies from about 3.5-5 million and the DBM
file gets reorganized every 3 hours.  It's too early to tell for sure if
this will hold, but I may need to drop bayes_expiry_period down to 3 hours.

Tomorrow I'm going to set up a test on one of the servers using PostgreSQL
to hold the Bayes tokens and see if it scales better than the DBM file.
That would also allow our multiple servers to share information instead of
act independently.


On to the questions...

1. Setting the expiry period down that low doesn't see to be an optimal
thing to do from an effectiveness standpoint.  Comments on this?  Am I
missing something?  Due to the type of user base, all-manual learning isn't
likely to work well.  Is auto-learning just a waste of resources in this
case?

2. If I set up manual learning where false positives and false negatives can
be manually sent in by users and added to the site-wide configuration, won't
they also be subject to the (short) expiration period, or is manual learning
kept permanently?

Thanks

Wes

RE: Mondo bayes_toks - millions of entries

Posted by "Randal, Phil" <pr...@herefordshire.gov.uk>.

Matt Kettler wrote:

> If you want to increase token lifespan, you'd increase 
> bayes_expiry_max_db_size  so that more tokens are kept at expire time.

And what, pray tell, are reasonable values for this and / or a
reasonable oldest token age?

Just wondering how to optimise the number of Bayes tokens to best
effect.

Cheers,

Phil

--
Phil Randal
Network Engineer
Herefordshire Council
Hereford, UK

Re: Mondo bayes_toks - millions of entries

Posted by Wes <we...@msg.bt.com>.

On 11/30/07 12:57 PM, "Kevin Parris" <KP...@ed.sc.gov> wrote:

> If I have followed the discussion correctly so far, the explanation for
> manual-learn not being distinguished from auto-learn is this:  no matter what
> mode of learning caused a token to appear in the database, if there is ongoing
> mail traffic that "hits" on the token then said token will not expire out
> anyway.
> 
> In other words, tokens don't expire because of where or how they came to be
> listed, they expire because no more incoming mail traffic references them.  If
> you manually train a message that is the ONLY instance of that particular spam
> to slip through your other filter, and your Bayes never sees another message
> that matches the tokens it generated, then those tokens are irrelevant
> regardless of learn mode.

That makes sense, except that if that type of message shows up infrequently,
and your token database turns over several times a day because of the high
volume of auto-learn...  If I've taken the time to send it in for a
manual-learn, I'd expect it to be remembered for a while, even if the
message only shows up every couple of days.

I guess the flip side is that if a message is manually learned, and then you
continue to get messages in like that (at least more than the turnover
frequency), then the manually-learned information should stay active.
Correct?

Wes

Re: Mondo bayes_toks - millions of entries

Posted by Kevin Parris <KP...@ed.sc.gov>.

If I have followed the discussion correctly so far, the explanation for manual-learn not being distinguished from auto-learn is this:  no matter what mode of learning caused a token to appear in the database, if there is ongoing mail traffic that "hits" on the token then said token will not expire out anyway.

In other words, tokens don't expire because of where or how they came to be listed, they expire because no more incoming mail traffic references them.  If you manually train a message that is the ONLY instance of that particular spam to slip through your other filter, and your Bayes never sees another message that matches the tokens it generated, then those tokens are irrelevant regardless of learn mode.

>>> Wes <we...@msg.bt.com> 11/30/07 11:56 AM >>>
> 
> The whole reason bayes works is the fact that there's a *LOT* of tokens
> that are repeated over and over and over again for any given kind of
> mail. So the set of tokens acted on by one message are 95% the same as
> the ones in another, provided the general type of email is the same (and
> by general type, I'm thinking all email fits into maybe 20 types, I'm
> talking really broad categories like "conversation" "newsletter" "spam"
> "nonspam ad", etc..)

Guess I need to read up on Bayes some more.

I was thinking more along the lines of separate databases for auto and
manual learning that are combined for a result, giving more weight to manual
learning.  Maybe that just isn't reasonable, though.  I can't see (at least
here) that manual learning would get any kind of significant volume.
Someone's only going to send in a message for manual learning if it is a
leaked spam or a false positive, and then only if they bother to do it.  I'd
be surprised if the manual learning volume was 1 in 10,000 of the messages
going through the auto-learning.

Wes

Re: Mondo bayes_toks - millions of entries

Posted by Wes <we...@msg.bt.com>.

> Well, I was suggesting making the expiry period just under, not the
> force-expire.. Really you can do it either way as long as expiry_period
> < force-expire.

Ok, I misunderstood what you were saying.  I set bayes_expiry_period to 3
hours, and ran expires every 4 hours over night.

I still get the same results - which makes sense after thinking about it.
It calculates newdelta - the time point where it wants to expire back to.
If newdelta is below bayes_expiry_period, which it still is, it reverts to
the "can't use estimation method for expiry" mode.  I think the only way to
make this work as intended would be to set bayes_expiry_period much shorter,
short enough that there are fewer than bayes_expiry_max_db_size tokens
created (accessed?) in that period - or increase bayes_expiry_max_db_size
above the number created in bayes_expiry_period.

To make 600,000 work, I'd need to set bayes_expiry_period to less than an
hour.  Or, for bayes_expiry_period of 3 hours, set bayes_expiry_max_db_size
to something like 2 million.  Which of course is why my original comment
about bayes_expiry_period should be a config parameter instead of hard
coded.

> The problem is that doesn't make any physical sense. The tokens are the
> same.
> 
> It's not like there's 6 tokens generated for one message, and 5
> completely different ones for the next. Odds are you'd only have 6
> tokens total. SA  just tracks them as counters. So, it's not like it
> tracks "this instance of "hello" was learned on xyz date, and came from
> message-id 1234".. SA just tracks "hello was in 150 nonspams 120 spams,
> and was last present in an email on 11/29/2007"
> 
> Besides, let's say you've got some kind of flag that makes manually
> learned tokens be retained longer, and added it onto the end of the
> record. In very short order your entire database would have this flag if
> you have any regular manual training. Any token that got autolearned is
> likely to get flagged by a manual training in very short order, because
> even if the emails aren't the same, the tokens generally are.
> 
> The whole reason bayes works is the fact that there's a *LOT* of tokens
> that are repeated over and over and over again for any given kind of
> mail. So the set of tokens acted on by one message are 95% the same as
> the ones in another, provided the general type of email is the same (and
> by general type, I'm thinking all email fits into maybe 20 types, I'm
> talking really broad categories like "conversation" "newsletter" "spam"
> "nonspam ad", etc..)

Guess I need to read up on Bayes some more.

I was thinking more along the lines of separate databases for auto and
manual learning that are combined for a result, giving more weight to manual
learning.  Maybe that just isn't reasonable, though.  I can't see (at least
here) that manual learning would get any kind of significant volume.
Someone's only going to send in a message for manual learning if it is a
leaked spam or a false positive, and then only if they bother to do it.  I'd
be surprised if the manual learning volume was 1 in 10,000 of the messages
going through the auto-learning.

Wes

Re: Mondo bayes_toks - millions of entries

Posted by Matt Kettler <mk...@verizon.net>.

Wes wrote:
> On 11/29/07 7:45 PM, "Matt Kettler" <mk...@verizon.net> wrote:
>
>   
>> As a starting point I'd suggest:
>>    either disable your force-expire calls or disable bayes_auto_expire.
>>     
>
> I am doing only force-expires.  I disabled auto-expire when I started doing
> force expires.
>
>   
>> Doesn't matter to me which, but you really want to be expiring at the
>> bayes_expiry_period interval
>>     drop bayes_expiry_period to 3 hours, if you're still using
>> force-expires, make it just a tad under 3 hours.
>>     expand bayes_expiry_max_db_size to at least 300,000, maybe 600,000.
>>     
>
> Thanks.  I'll give this a try.
>
> If bayes_expiry_period is set to 3 hours, shouldn't the force-expire be just
> *over* 3 hours, not just under?  
Well, I was suggesting making the expiry period just under, not the
force-expire.. Really you can do it either way as long as expiry_period
< force-expire.

> Otherwise wouldn't the "can't use
> estimation method for expiry" always be triggered as it is now?
>   
Aye.
> I planned to have the PostgreSQL DB enabled on one live system tonight, but
> have to wait on a couple of missing RPM's to be installed.  I have great
> hopes for it...  I am running a nearly 2 billion record database under
> PostgreSQL with great performance.  A few million records should be
> nothing...  Guess it depends on what the update vs. read load is.
>   
Should be mostly read, except the atimes for the tokens in a message
will be updated every message that's scanned.
> I would think it would be extremely useful to be able to treat
> manually-learned rules separately from auto-learned rules.  In a high volume
> environment, you'd want to keep manually learned rules far longer than you
> could possibly keep auto-learned ones.  Manually learned rules should be
> more important.
>   
The problem is that doesn't make any physical sense. The tokens are the
same.

It's not like there's 6 tokens generated for one message, and 5
completely different ones for the next. Odds are you'd only have 6
tokens total. SA  just tracks them as counters. So, it's not like it
tracks "this instance of "hello" was learned on xyz date, and came from
message-id 1234".. SA just tracks "hello was in 150 nonspams 120 spams,
and was last present in an email on 11/29/2007"

Besides, let's say you've got some kind of flag that makes manually
learned tokens be retained longer, and added it onto the end of the
record. In very short order your entire database would have this flag if
you have any regular manual training. Any token that got autolearned is
likely to get flagged by a manual training in very short order, because
even if the emails aren't the same, the tokens generally are.

The whole reason bayes works is the fact that there's a *LOT* of tokens
that are repeated over and over and over again for any given kind of
mail. So the set of tokens acted on by one message are 95% the same as
the ones in another, provided the general type of email is the same (and
by general type, I'm thinking all email fits into maybe 20 types, I'm
talking really broad categories like "conversation" "newsletter" "spam"
"nonspam ad", etc..)

Re: Mondo bayes_toks - millions of entries

Posted by Wes <we...@msg.bt.com>.

On 11/29/07 7:45 PM, "Matt Kettler" <mk...@verizon.net> wrote:

> As a starting point I'd suggest:
>    either disable your force-expire calls or disable bayes_auto_expire.

I am doing only force-expires.  I disabled auto-expire when I started doing
force expires.

> Doesn't matter to me which, but you really want to be expiring at the
> bayes_expiry_period interval
>     drop bayes_expiry_period to 3 hours, if you're still using
> force-expires, make it just a tad under 3 hours.
>     expand bayes_expiry_max_db_size to at least 300,000, maybe 600,000.

Thanks.  I'll give this a try.

If bayes_expiry_period is set to 3 hours, shouldn't the force-expire be just
*over* 3 hours, not just under?  Otherwise wouldn't the "can't use
estimation method for expiry" always be triggered as it is now?

I planned to have the PostgreSQL DB enabled on one live system tonight, but
have to wait on a couple of missing RPM's to be installed.  I have great
hopes for it...  I am running a nearly 2 billion record database under
PostgreSQL with great performance.  A few million records should be
nothing...  Guess it depends on what the update vs. read load is.

I would think it would be extremely useful to be able to treat
manually-learned rules separately from auto-learned rules.  In a high volume
environment, you'd want to keep manually learned rules far longer than you
could possibly keep auto-learned ones.  Manually learned rules should be
more important.

Wes

Re: Mondo bayes_toks - millions of entries

Posted by Matt Kettler <mk...@verizon.net>.

Wes wrote:
> On 11/29/07 5:00 AM, "Matt Kettler" <mk...@verizon.net> wrote:
>
>   
> That's what I expected, but that's not what happens...
>
> I have no problem being proven wrong, and perhaps there's something else
> going on here, but based on digging through the documentation, the code
> (including running sa-learn under the perl debugger down into the expire
> module), and observations of actually running a forced expire, this is not
> true.  
>
> It does try to reduce down to 75% of bayes_expiry_max_db_size, but will not
> expire any tokens younger than bayes_expiry_period, even on a force expire.
> With "bayes_expiry_max_db_size 150000" set, if I run a force expire on a
> database that is less than bayes_expiry_period old, but with millions of
> tokens, *no* tokens are expired.  If I run it on one older than
> bayes_expiry_period, only tokens older than bayes_expiry_period are expired.
>
> At the bottom is a sample debug output from sa-learn forced expire.  You'll
> notice that the target is to reduce the number of tokens down to 112,500
> (token count: 4933086, final goal reduction size: 4820586).  However, look
> at the reduction table, and the final results: "3702653 entries kept,
> 1230354 deleted".
>   
Hmm, that is an indirect side effect of how SA does cutoff-time
estimation for expiry, and the way you're manually running things.

SA eliminates tokens based on how recently they've been accessed. (and
note that's accessed, not when they were added) So it shrinks the
database by selecting a "cutoff time" and ditching everything that
hasn't been used since before that cutoff time.

At present, it seems you're being burned by your manual running of
force-expires. Right now, SA assumes that if the last expire was less
than "bayes_expiry_period" second ago, something must be tragically
wrong, and it skips its usual estimation method and goes to a longer,
more involved estimate.  However, that method  assumes your database
covers a pretty broad range of time, so it does an estimation using
powers-of-two of the bayes_expiry_period. (ie: n*1, n*2, n*4). Since
you're running force-expire every 3 hours, and your bayes_expiry_period
is 6 hours, you're pretty much guaranteed to never use the short estimate.

Even without the problem of calling force-expire more often than
bayes_expiry_period, you'd still have had problems on this particular
run, as the ratio would be too high, but at least it might have a shot
at settling in to a more reasonable rhythm.

However, I would also point out that with your mail volume, a 150k token
bayes database is likely way too small. That's really more appropriate
for sites doing "small to mid-sized company" volumes of email ie:
100,000 a day or so. From the looks of things, you must be pumping
several million a day.

As a starting point I'd suggest:
   either disable your force-expire calls or disable bayes_auto_expire.
Doesn't matter to me which, but you really want to be expiring at the
bayes_expiry_period interval
    drop bayes_expiry_period to 3 hours, if you're still using
force-expires, make it just a tad under 3 hours.
    expand bayes_expiry_max_db_size to at least 300,000, maybe 600,000.

In general I've been thinking for a long time the expiry algorithm needs
to be a lot smarter. Right now, it's rather complex code wise, but its
behavior is a bit on the dumb side.. It certainly could be a *LOT*
smarter than always running a fixed set of time-periods. (ie: once you
hit zero, there's no need to keep running them). Clearly it doesn't
perform well on busy sites.

And from the looks of it, lots of other folks have been thinking the
same thing:

http://issues.apache.org/SpamAssassin/show_bug.cgi?id=3894

Re: Mondo bayes_toks - millions of entries

Posted by Wes <we...@msg.bt.com>.

On 11/29/07 5:00 AM, "Matt Kettler" <mk...@verizon.net> wrote:

> However,  it's important to note that the  bayes_expiry_period  does not
> dictate token life. It dictates how often expiry check will run
> automatically. Basically, SA looks at the database, finds out when the
> last expire ran, and if more than bayes_expiry_period has elapsed, it
> kicks off an auto-expire. Since you're manually expiring every 3 hours,
> your modified bayes_expiry_period never comes into effect.
>  
> When expiry (either due to the bayes_expiry_period or a manual
> force-expire) runs, it checks if the database has more than
> "bayes_expiry_max_db_size" tokens in it, SA will attempt to reduce the
> database to 75% of bayes_expiry_max_db_size, keeping the most recently
> used tokens.
> 
> In your case, you have a high learning volume, so this means that every
> 3 hours (due to your manual sa-learn --force-expire), your database is
> going to be reduced to the 100,000 most-recently used tokens.

That's what I expected, but that's not what happens...

I have no problem being proven wrong, and perhaps there's something else
going on here, but based on digging through the documentation, the code
(including running sa-learn under the perl debugger down into the expire
module), and observations of actually running a forced expire, this is not
true.  

It does try to reduce down to 75% of bayes_expiry_max_db_size, but will not
expire any tokens younger than bayes_expiry_period, even on a force expire.
With "bayes_expiry_max_db_size 150000" set, if I run a force expire on a
database that is less than bayes_expiry_period old, but with millions of
tokens, *no* tokens are expired.  If I run it on one older than
bayes_expiry_period, only tokens older than bayes_expiry_period are expired.

At the bottom is a sample debug output from sa-learn forced expire.  You'll
notice that the target is to reduce the number of tokens down to 112,500
(token count: 4933086, final goal reduction size: 4820586).  However, look
at the reduction table, and the final results: "3702653 entries kept,
1230354 deleted".

As I read the algorithm documentation, and the code (as best I remember
without looking at it right now), goes something like this:  For the first
pass, it calculates the number of tokens that would be expired for
bayes_expiry_period*1, bayes_expiry_period*2, *4, ... Expoentially up to the
max exponent of 9.  It then picks the one that would expire closest to
bayes_expiry_max_db_size, without dropping below 75% of
bayes_expiry_max_db_size.  The smallest exponent is bayes_expiry_period*1 -
it expires entries older than bayes_expiry_period*1 because that is closest
to .75*150,000 - even though that leaves 3.5 million.  For subsequent
passes, if the estimated cutoff is less than bayes_expiry_period, then use
the above algorithm again.

As further evidence of this, with a static database (spamd not running), and
bayes_expiry_period set to the default of 12 hours, I ran a force expire.
It expired only tokens older than 12 hours, leaving about 7 million.  I ran
it again.  No tokens were removed.  I then dropped bayes_expiry_period  down
to 6 hours and reran the expire.  It then expired another 3.5 million
tokens.  It never dropped to anything approaching 150,000.

All results were verified by comparing the sa-learn debug output with
"sa-learn --dump magic", and done using the DBM module.

One noteable thing in the debug output is "bayes: can't use estimation
method for expiry, unexpected result, calculating optimal atime delta (first
pass)", which comes from BayesStore.pm around line 300
("$self->{expiry_period}" and $start are bayes_expiry_period):

  if ( (time() - $vars[4] > 86400*30) || ($vars[8] < $self->{expiry_period})
       || ($vars[9] < 1000)
       || ($newdelta < $self->{expiry_period}) || ($ratio > 1.5) ) {
    dbg("bayes: can't use estimation method for expiry, unexpected result,
calculating optimal atime delta (first pass)");

[snip]

    $newdelta = $start * $max_expire_mult;   <<<<<<<<<<<<<<<<<<<<<<<<<<<
    dbg("bayes: first pass decided on $newdelta for atime delta");
  }
  else { # use the estimation method
    dbg("bayes: can do estimation method for expiry, skipping first pass");
  }

This code is triggered because "newdelta: 10981", which is less than
bayes_expiry_period (BayesStore.pm line 281 cacluates newdelta):

  # Estimate new atime delta based on the last atime delta
  my $newdelta = 0;
  if ( $vars[9] > 0 ) {
    # newdelta = olddelta * old / goal;
    # this may seem backwards, but since we're talking delta here,
    # not actual atime, we want smaller atimes to expire more tokens,
    # and visa versa.
    #
    $newdelta = int($vars[8] * $vars[9] / $goal_reduction);
  }

Again, the code appears to say "if we're expiring anything younger than
bayes_expiry_period, then recalculate so nothing younger than
bayes_expiry_period  is expired".

Here's the log:

[21506] dbg: bayes: expiry starting
[21506] dbg: bayes: expiry check keep size, 0.75 * max: 112500
[21506] dbg: bayes: token count: 4933086, final goal reduction size: 4820586
[21506] dbg: bayes: first pass? current: 1196349002, Last: 1196338557,
atime: 21600, count: 1049110, newdelta: 4700, ratio: 4.59492903508688,
period: 21600
[21506] dbg: bayes: can't use estimation method for expiry, unexpected
result, calculating optimal atime delta (first pass)
[21506] dbg: bayes: expiry max exponent: 9
[21506] dbg: bayes: atime token reduction
[21506] dbg: bayes: ======== ===============
[21506] dbg: bayes: 21600 1230354
[21506] dbg: bayes: 43200 0
[21506] dbg: bayes: 86400 0
[21506] dbg: bayes: 172800 0
[21506] dbg: bayes: 345600 0
[21506] dbg: bayes: 691200 0
[21506] dbg: bayes: 1382400 0
[21506] dbg: bayes: 2764800 0
[21506] dbg: bayes: 5529600 0
[21506] dbg: bayes: 11059200 0
[21506] dbg: bayes: first pass decided on 21600 for atime delta
[21506] dbg: bayes: untie-ing
[21506] dbg: bayes: files locked, now unlocking lock
[21506] dbg: locker: safe_unlock: unlocked
/home/smfs/.spamassassin/bayes.mutex
[21506] dbg: bayes: expiry completed
bayes: synced databases from journal in 0 seconds: 927 unique entries (927
total entries)
expired old bayes database entries in 432 seconds
3702653 entries kept, 1230354 deleted
token frequency: 1-occurrence tokens: 83.22%
token frequency: less than 8 occurrences: 12.56%

Wes

Re: Mondo bayes_toks - millions of entries

Posted by Matt Kettler <mk...@verizon.net>.

Wes wrote:
> I've searched and searched the archives, but no answers..  Sorry for the
> lengthy email, but...
>
>
> Spam Assassin 3.2.3-1
> Smf-spamd 1.3.1 with spamd
> Dual quad-core Xeon 5355 (Woodcrest) systems with 8GB memory.
>
> Configuration:
>
>     bayes_auto_learn 1
>     bayes_expiry_max_db_size 150000
>     lock_method flock
>     rules compiled with sa-compile
>     Auto-whitelist module is loaded
>     Number of spamd children: 5
>
>   
<snip>
> As an immediate solution, I modified
>
>     /usr/lib/perl5/site_perl/5.8.5/Mail/SpamAssassin/Conf.pm
>
> And set bayes_expiry_period to 21600 (6 hours) and run an expire every 3
> hours (why isn't this a configuration file parameter??)
>
>   
Even if it was a setting, it's now irrelevant. The fact that you're
running a force-expire every 3 hours makes the bayes_expiry_period moot.
(unless you set it to less than 3 hours).

Read below to understand what this variable actually does. It does not
dictate token lifespan.
>
> On to the questions...
>
> 1. Setting the expiry period down that low doesn't see to be an optimal
> thing to do from an effectiveness standpoint.  Comments on this?  Am I
> missing something?  Due to the type of user base, all-manual learning isn't
> likely to work well.  Is auto-learning just a waste of resources in this
> case?
>
> 2. If I set up manual learning where false positives and false negatives can
> be manually sent in by users and added to the site-wide configuration, won't
> they also be subject to the (short) expiration period, or is manual learning
> kept permanently?
>   
Manual learning is handled no differently than auto-learning.

However,  it's important to note that the  bayes_expiry_period  does not
dictate token life. It dictates how often expiry check will run
automatically. Basically, SA looks at the database, finds out when the
last expire ran, and if more than bayes_expiry_period has elapsed, it
kicks off an auto-expire. Since you're manually expiring every 3 hours,
your modified bayes_expiry_period never comes into effect.

When expiry (either due to the bayes_expiry_period or a manual
force-expire) runs, it checks if the database has more than
"bayes_expiry_max_db_size" tokens in it, SA will attempt to reduce the
database to 75% of bayes_expiry_max_db_size, keeping the most recently
used tokens.

In your case, you have a high learning volume, so this means that every
3 hours (due to your manual sa-learn --force-expire), your database is
going to be reduced to the 100,000 most-recently used tokens.

If you want to increase token lifespan, you'd increase 
bayes_expiry_max_db_size  so that more tokens are kept at expire time.

Re: Mondo bayes_toks - millions of entries

Posted by Michael Parker <pa...@pobox.com>.

On Nov 30, 2007, at 1:56 PM, Wes wrote:

> Well, spamd is apparently doing things far more efficiently than "sa- 
> learn
> --restore".  Tokens are loading into the DB much faster than the  
> restore,
> and postmaster is hardly ever a blip in 'top' (at least so far).  When
> running the restore, postmaster was sitting up about 60-80% CPU  
> constantly.

Learning normally can take advantage of inserting/updating tokens in  
batches.  When doing a restore it has to insert each token separately.

BTW, while the best effort was put into the postgresql support, I'm  
sure it could use help so if anyone wants to hack on it and submit  
patches I'm certain that the developers would be more than happy to  
take a look.

Michael

Re: Mondo bayes_toks - millions of entries

Posted by Wes <we...@msg.bt.com>.

Well, spamd is apparently doing things far more efficiently than "sa-learn
--restore".  Tokens are loading into the DB much faster than the restore,
and postmaster is hardly ever a blip in 'top' (at least so far).  When
running the restore, postmaster was sitting up about 60-80% CPU constantly.

Wes

Re: Mondo bayes_toks - millions of entries

Posted by "Daryl C. W. O'Shea" <sp...@dostech.ca>.

Wes wrote:
> I'm doing the "sa-learn restore" to the PostgreSQL database now.
> Performance is not so good - about 300 tokens per second loaded.  It's going
> to take a while to reload the several million from the backup.
> 
> I am using Mail::SpamAssassin::BayesStore::PgSQL.
> 
> The PostgreSQL shows it is doing a separate transaction per token loaded.

> I'm guessing this is because the restore is using the same modules as spamd,
> instead of doing a bulk load, which would take a few seconds?  Does it do
> the same thing when updating existing token access times and adding tokens
> from a message?  If so, this would seem to be a rather significant
> bottleneck as opposed to updating everything with one transaction.
> 
> Is this being done to avoid deadlocks?  Deadlocks can be avoided by sorting
> the keys to be updated so that they are always updated in the same order
> (and/or retrying should a deadlock be detected).

I have no idea.  I've never looked at the PgSQL storage engine code. 
Nor have I ever used PostgreSQL.  You could take a look at the code or 
query Michael Parker.

Daryl

Re: Mondo bayes_toks - millions of entries

Posted by Wes <we...@msg.bt.com>.

I'm doing the "sa-learn restore" to the PostgreSQL database now.
Performance is not so good - about 300 tokens per second loaded.  It's going
to take a while to reload the several million from the backup.

I am using Mail::SpamAssassin::BayesStore::PgSQL.

The PostgreSQL shows it is doing a separate transaction per token loaded.

11-30-2007.18:38:52 postmaster-20565: LOG:  statement: begin
11-30-2007.18:38:52 postmaster-20565: LOG:  statement:
select put_tokens(2,'{\\\\353\\\\244\\\\114\\\\145\\\\321}', 0,1,1196373684)
11-30-2007.18:38:52 postmaster-20565: LOG:  statement: commit
11-30-2007.18:38:52 postmaster-20565: LOG:  statement: begin
11-30-2007.18:38:52 postmaster-20565: LOG:  statement:
select put_tokens(2,'{\\\\164\\\\223\\\\254\\\\212\\\\016}', 0,2,1196379608)
11-30-2007.18:38:52 postmaster-20565: LOG:  statement: commit
11-30-2007.18:38:52 postmaster-20565: LOG:  statement: begin
11-30-2007.18:38:52 postmaster-20565: LOG:  statement:
select put_tokens(2,'{\\\\264\\\\260\\\\042\\\\254\\\\337}', 0,1,1196374147)
11-30-2007.18:38:52 postmaster-20565: LOG:  statement: commit
11-30-2007.18:38:52 postmaster-20565: LOG:  statement: begin
11-30-2007.18:38:52 postmaster-20565: LOG:  statement:
select put_tokens(2,'{\\\\144\\\\207\\\\105\\\\341\\\\202}', 0,1,1196374214)
11-30-2007.18:38:52 postmaster-20565: LOG:  statement: commit
11-30-2007.18:38:52 postmaster-20565: LOG:  statement: begin
11-30-2007.18:38:52 postmaster-20565: LOG:  statement:
select put_tokens(2,'{\\\\167\\\\116\\\\332\\\\321\\\\265}', 0,1,1196374269)
11-30-2007.18:38:52 postmaster-20565: LOG:  statement: commit
1

I'm guessing this is because the restore is using the same modules as spamd,
instead of doing a bulk load, which would take a few seconds?  Does it do
the same thing when updating existing token access times and adding tokens
from a message?  If so, this would seem to be a rather significant
bottleneck as opposed to updating everything with one transaction.

Is this being done to avoid deadlocks?  Deadlocks can be avoided by sorting
the keys to be updated so that they are always updated in the same order
(and/or retrying should a deadlock be detected).

Wes

Re: Mondo bayes_toks - millions of entries

Posted by Wes <we...@msg.bt.com>.

The PostgreSQL experiment turned out to not be as stellar as I had hoped.
With our volume, the disk write load for bayes auto-learn is extremely high,
even with fsync disabled, share mem increased, etc.  I also ran into some
severe concurrency issues - lots of waiting on locks, even with only one
system doing updates and the others reading.  Auto-vacuum set to 60 seconds
(every 3 minutes for the SpamAssassin table) appears to help a tremendously.
I think we'd need a solid state disk, or SAN with a large buffer, to safely
handle it with a larger number of tokens.  I'm also getting failed expires
due to 'deadlock detected'.

Regrouping, I was looking at benchmarks for QDBM and see it is on the "we
need volunteers" list.  Is this more than just changing the "tie" in the
Bayes DBM store module?

Wes

Re: Mondo bayes_toks - millions of entries

Posted by "Daryl C. W. O'Shea" <sp...@dostech.ca>.

Wes wrote:
> One other question on the database...  What happens when the DB is down?

SA continues using scoreset 0 or 1 (instead of 2 or 3), depending on if 
you've got net tests enabled or not.

> Connection refused could be handled quickly if it fails opena and just said
> "ok, no bayes for now".  Waiting on a TCP Connect Abort timer for every
> query attempt would be devastating.

Again, I haven't looked at the code for years.  Connection refused 
should be detected quite quickly though.  If there's a firewall just 
dropping packets (so that you can't really detect a closed port) things 
are going to be less than ideal (I don't know how long it'll take to 
timeout).

Daryl

Re: Mondo bayes_toks - millions of entries

Posted by Wes <we...@msg.bt.com>.

One other question on the database...  What happens when the DB is down?

Connection refused could be handled quickly if it fails opena and just said
"ok, no bayes for now".  Waiting on a TCP Connect Abort timer for every
query attempt would be devastating.

Load performance has dropped dramatically.  With 163,000 loaded, it is down
to 100/second.  I decided to start with a clean DB and let auto-learn
repopulate it.

Wes

Re: Mondo bayes_toks - millions of entries

Posted by "Daryl C. W. O'Shea" <sp...@dostech.ca>.

Wes wrote:
> On 11/29/07 2:49 PM, "Daryl C. W. O'Shea" <sp...@dostech.ca> wrote:
> 
>> Even still though, 5 queries times, say, 50ms is a 1/4 of a second that
>> you're idle in that spamd child process.  That leaves you trying to make
>> up for it by runnning more child processes (you've freed up some CPU
>> time by having those children idle so you'll have some CPU time to run
>> more) but you'll never get it all back and you'll be lucky to get even
>> half of the lost throughput back.
>>
>> If you'd like to share a database between distributed MXes/spamd
>> machines you're best off to use replication and limit autolearning to
>> the machines that connect to the master database server.
> 
> Thanks for the details.  That gives me an idea what activity to expect.  One
> DB per location may end up being the way to go.  How well does it handle
> concurrency, if it has to update the last access time of tokens and learn
> new tokens?  Are there any numbers on concurrent servers when it starts to
> bog down?

Sorry, I have no concrete data on that.  Most of my high volume 
customers don't use bayes (usually because of memories of misguided 
configurations in the past or the fear of bayes taking off in the wrong 
direction as it occasionally has a habit of doing).

I would expect that if your spamd and SQL machines are of similar 
hardware, though, that you may be able to support a few hundred spamd 
children per SQL server.  I could be way off though... it's just a guess.

I'd imagine you'd naturally do this, but for others following along, 
rather than switching everything over at once, I would switch one 
machine (or a couple of machines, depending on how many you have) over 
at a time to using the SQL database, track throughput stats for a day 
(so you get a complete days mail flow cycle and an expiry or two in) and 
then add more.  Stop when the average throughput of the SQL using spamd 
machines falls too far below linear.

Selecting a storage engine that supports row level locking could help 
with concurrency... but not always... for MySQL, MyISAM is faster than 
InnoDB, probably due to it's faster indexing (and no transaction support 
overhead).

See  http://wiki.apache.org/spamassassin/BayesBenchmarkResults  for some 
small scale stats.  Note that I don't think that SDBM's performance will 
scale to really large databases.  Matt Kettler may have input on that 
though.

Also, be sure to read the sql/README.bayes documentation in the SA 
release tarball (make sure you use the PostgreSQL specific storage
module if you're going to use PostgreSQL).

Daryl

Re: Mondo bayes_toks - millions of entries

Posted by Wes <we...@msg.bt.com>.

On 11/29/07 2:49 PM, "Daryl C. W. O'Shea" <sp...@dostech.ca> wrote:

> Even still though, 5 queries times, say, 50ms is a 1/4 of a second that
> you're idle in that spamd child process.  That leaves you trying to make
> up for it by runnning more child processes (you've freed up some CPU
> time by having those children idle so you'll have some CPU time to run
> more) but you'll never get it all back and you'll be lucky to get even
> half of the lost throughput back.
> 
> If you'd like to share a database between distributed MXes/spamd
> machines you're best off to use replication and limit autolearning to
> the machines that connect to the master database server.

Thanks for the details.  That gives me an idea what activity to expect.  One
DB per location may end up being the way to go.  How well does it handle
concurrency, if it has to update the last access time of tokens and learn
new tokens?  Are there any numbers on concurrent servers when it starts to
bog down?

Wes

Re: Mondo bayes_toks - millions of entries

Posted by "Daryl C. W. O'Shea" <sp...@dostech.ca>.

Wes wrote:
> On 11/29/07 1:00 PM, "John D. Hardin" <jh...@impsec.org> wrote:
> 
>> Have you considered pushing your autolearn thresholds a bit further
>> out, to reduce the number of messages that are elegible for autolearn
>> and thus reduce the growth of the token database?
> 
> I hadn't thought about that, but not sure it would make sense here.  Man
> Mail::SpamAssassin::Plugin::AutoLearnThreshold shows:
> 
>        bayes_auto_learn_threshold_nonspam n.nn   (default: 0.1)
>            The score threshold below which a mail has to score, to be fed
> into SpamAssassin¹s learning systems automatically as a non-spam message.
> 
>        bayes_auto_learn_threshold_spam n.nn      (default: 12.0)
>            The score threshold above which a mail has to score, to be fed
> into SpamAssassin¹s learning systems automatically as a spam message.
> 
> Since the mail has already been processed by a commercial scanner, the
> majority of the mail is now good - we're trying to catch leakage.  That
> means most of the auto-learning is good mail.  I'm thinking increasing
> bayes_auto_learn_threshold_nonspam  would be a bad thing, no?

It'd decrease your token count, but it'd decrease the usefulness of 
using bayes by a larger factor.

>> Do not waste any more time trying to get more performance out of DBM.
>> Just about any SQL based database will perform a lot better than DBM
>> will when your bayes database is large.
> 
> That's good feedback.  I was hoping that, but don't quite have the DB up and
> running yet - gotta get it working in the test environment before putting it
> somewhere with a load.  Then the question becomes how much network latency
> can be tolerated before there's a performance problem (e.g. Between physical
> locations).

IIRC (it's been about three years since I looked at the code for this) 
tokens are pulled in a loop 100 at a time for a message.  So each 
message is probably going to have to poll the SQL server 5 times (+/- 
another 5/3?) just for tokens.  Add in a couple of other queries 
(especially if it's decided to autolearn the message) and latency starts 
to add up.

The last time I tried sharing a bayes database over the internet didn't 
go to well at all past a few thousand messages a day (so not useful at 
all).  However there was a cable modem in use without any traffic 
shaping in place to defeat the cable modem's huge buffer so that could 
have had an insane impact on it.

Even still though, 5 queries times, say, 50ms is a 1/4 of a second that 
you're idle in that spamd child process.  That leaves you trying to make 
up for it by runnning more child processes (you've freed up some CPU 
time by having those children idle so you'll have some CPU time to run 
more) but you'll never get it all back and you'll be lucky to get even 
half of the lost throughput back.

If you'd like to share a database between distributed MXes/spamd 
machines you're best off to use replication and limit autolearning to 
the machines that connect to the master database server.

>> If you process a lot of mail and are using autolearn you are going to
>> have a large bayes database, period.  If the database isn't large enough
>> it is going to churn so fast that it'll defeat the purpose of even
>> having a bayes database.
> 
> I had pretty much come to that conclusion, but all the posts I found were
> talking about token databases in the low hundreds of thousands, and I've
> been seeing millions...  Wasn't sure I wasn't overlooking something big.

For a comparison, I've got a $10 month VPS with 128 MB of RAM serving a 
MySQL backed SA bayes database with 2.5 million tokens in the database. 
  It runs fine.

Daryl

Re: Mondo bayes_toks - millions of entries

Posted by Wes <we...@msg.bt.com>.

On 11/29/07 1:00 PM, "John D. Hardin" <jh...@impsec.org> wrote:

> Have you considered pushing your autolearn thresholds a bit further
> out, to reduce the number of messages that are elegible for autolearn
> and thus reduce the growth of the token database?

I hadn't thought about that, but not sure it would make sense here.  Man
Mail::SpamAssassin::Plugin::AutoLearnThreshold shows:

       bayes_auto_learn_threshold_nonspam n.nn   (default: 0.1)
           The score threshold below which a mail has to score, to be fed
into SpamAssassin¹s learning systems automatically as a non-spam message.

       bayes_auto_learn_threshold_spam n.nn      (default: 12.0)
           The score threshold above which a mail has to score, to be fed
into SpamAssassin¹s learning systems automatically as a spam message.

Since the mail has already been processed by a commercial scanner, the
majority of the mail is now good - we're trying to catch leakage.  That
means most of the auto-learning is good mail.  I'm thinking increasing
bayes_auto_learn_threshold_nonspam  would be a bad thing, no?

> Do not waste any more time trying to get more performance out of DBM.
> Just about any SQL based database will perform a lot better than DBM
> will when your bayes database is large.

That's good feedback.  I was hoping that, but don't quite have the DB up and
running yet - gotta get it working in the test environment before putting it
somewhere with a load.  Then the question becomes how much network latency
can be tolerated before there's a performance problem (e.g. Between physical
locations).

> If you process a lot of mail and are using autolearn you are going to
> have a large bayes database, period.  If the database isn't large enough
> it is going to churn so fast that it'll defeat the purpose of even
> having a bayes database.

I had pretty much come to that conclusion, but all the posts I found were
talking about token databases in the low hundreds of thousands, and I've
been seeing millions...  Wasn't sure I wasn't overlooking something big.

Wes

Re: Mondo bayes_toks - millions of entries

Posted by "John D. Hardin" <jh...@impsec.org>.

On Wed, 28 Nov 2007, Wes wrote:

> In 12 hours, the bayes_toks file gets to 160-320 MB, with a ball
> park of something over 7 million tokens.

Have you considered pushing your autolearn thresholds a bit further 
out, to reduce the number of messages that are elegible for autolearn 
and thus reduce the growth of the token database?

--
 John Hardin KA7OHZ                    http://www.impsec.org/~jhardin/
 jhardin@impsec.org    FALaholic #11174     pgpk -a jhardin@impsec.org
 key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
  "Bother," said Pooh as he struggled with /etc/sendmail.cf, "it never
  does quite what I want. I wish Christopher Robin was here."
				           -- Peter da Silva in a.s.r
-----------------------------------------------------------------------
 26 days until Christmas