You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@spamassassin.apache.org by "Rosenbaum, Larry M." <ro...@ornl.gov> on 2008/10/06 21:42:53 UTC

bayes_token table too big?

SpamAssassin version 3.2.5, running on Perl version 5.8.8, Solaris 9
Using MySQL for Bayes database.

I'm wondering if our Bayes token database is too big, and why.

Based on some posts to this list, I decided to try converting our Bayes and AWL databases to InnoDB to improve performance.  So I copied the database to a non-production MySQL server and tried to convert it there.  It has taken 4 days to convert!  I'm thinking something must be wrong.

Here is the output I'm getting from our Bayes expire job:

Tue Sep 30 00:12:00 EDT 2008 Forcing Bayes expiry run
expired old bayes database entries in 193 seconds
104999743 entries kept, 147355 deleted
token frequency: 1-occurrence tokens: 0.12%
token frequency: less than 8 occurrences: 0.05%
Tue Sep 30 00:15:28 EDT 2008 Done

Wed Oct 1 00:12:00 EDT 2008 Forcing Bayes expiry run
expired old bayes database entries in 210 seconds
105000814 entries kept, 242825 deleted
token frequency: 1-occurrence tokens: 0.11%
token frequency: less than 8 occurrences: 0.06%
Wed Oct 1 00:15:47 EDT 2008 Done

Thu Oct 2 00:12:00 EDT 2008 Forcing Bayes expiry run
expired old bayes database entries in 206 seconds
105032264 entries kept, 239214 deleted
token frequency: 1-occurrence tokens: 0.13%
token frequency: less than 8 occurrences: 0.06%
Thu Oct 2 00:15:39 EDT 2008 Done

And here is the information from the local.cf file:

bayes_expiry_max_db_size  500000

So the config file says 500 thousand tokens, but the database has 105 million entries.  Have I misunderstood something, or is expiry not working correctly?



Re: bayes_token table too big?

Posted by Kai Schaetzl <ma...@conactive.com>.
Larry M. Rosenbaum wrote on Mon, 06 Oct 2008 15:42:53 -0400:

> So I copied
> the database to a non-production MySQL server and tried to convert
> it there.  It has taken 4 days to convert!  I'm thinking something
> must be wrong.

Yes, converting a database with a 100 million records will take that long 
or longer.

> So the config file says 500 thousand tokens, but the database has
> 105 million entries.  Have I misunderstood something, or is expiry
> not working correctly?

Maybe. Check the bayes_vars table for the token count and then check how 
many tokens the database actually contains. The expiry code just takes the 
token count from bayes_vars and doesn't check for the real record count of 
bayes_token. So, if there's a mismatch things like this can happen.
For me it happened the other way around. After converting to SQL I removed 
all entries older than a year and then ran expiry without changing the 
token count value in bayes_vars. As it was thinking I still had several 
million tokens it slashed almost the complete database and I had to import 
all the stuff again.

BTW: I'm not seeing output like this when I do an expire:
token frequency: 1-occurrence tokens: 0.13%
token frequency: less than 8 occurrences: 0.06%


Kai

-- 
Kai Schätzl, Berlin, Germany
Get your web at Conactive Internet Services: http://www.conactive.com




Re: bayes_token table too big?

Posted by Kris Deugau <kd...@vianet.ca>.
Rosenbaum, Larry M. wrote:
> 104999743 entries kept, 147355 deleted

> bayes_expiry_max_db_size  500000
> 
> So the config file says 500 thousand tokens, but the database has 105 million entries.  Have I misunderstood something, or is expiry not working correctly?

Check and make sure you haven't accidentally set per-user Bayes DBs 
instead of one global one.  >_<  (BTDT, filled the RAMDISK.  Ooops...)

Of course, if you really *want* per-user Bayes, well....

-kgd

Re: bayes_token table too big?

Posted by Kai Schaetzl <ma...@conactive.com>.
It's like Theo suggested. You likely have a structure that has many old 
tokens and fewer new tokens. If you want to shrink your database decide 
how many tokens you want to keep (for instance one million) and then 
determine the date that fits this (e.g. token count where atime > 'your 
time' is around 1 million). Then copy these one million records to a new 
table and swap it for the old table. Do not do it the other way (delete 
all records with atime < '....') as this will take much much much longer.
Then adjust the bayes_vars table data and flush bayes_seen. Make a backup 
of the new data and then try an expiry. 

Kai

-- 
Kai Schätzl, Berlin, Germany
Get your web at Conactive Internet Services: http://www.conactive.com




RE: bayes_token table too big?

Posted by "Rosenbaum, Larry M." <ro...@ornl.gov>.
> From: Theo Van Dinter [mailto:felicity@apache.org]
>
> On Mon, Oct 06, 2008 at 03:42:53PM -0400, Rosenbaum, Larry M. wrote:
> > And here is the information from the local.cf file:
> >
> > bayes_expiry_max_db_size  500000
> >
> > So the config file says 500 thousand tokens, but the database has 105
> million entries.  Have I misunderstood something, or is expiry not
> working correctly?
>
> Do an expire run w/ "-D bayes" and show the expiry details.

Mon Oct 6 16:11:00 EDT 2008 Forcing Bayes expiry run
[25080] dbg: bayes: using username: root
[25080] dbg: bayes: database connection established
[25080] dbg: bayes: found bayes db version 3
[25080] dbg: bayes: Using userid: 1
[25080] dbg: bayes: bayes journal sync starting
[25080] dbg: bayes: bayes journal sync completed
[25080] dbg: bayes: expiry starting
[25080] dbg: bayes: expiry check keep size, 0.75 * max: 375000
[25080] dbg: bayes: token count: 105095925, final goal reduction size: 104720925
[25080] dbg: bayes: first pass? current: 1223323871, Last: 1223266468, atime: 43200, count: 91425, newdelta: 37, ratio: 1145.42986054143, period: 43200
[25080] dbg: bayes: can't use estimation method for expiry, unexpected result, calculating optimal atime delta (first pass)
[25080] dbg: bayes: expiry max exponent: 9
[25080] dbg: bayes: atime token reduction
[25080] dbg: bayes: ======== ===============
[25080] dbg: bayes: 43200 69517
[25080] dbg: bayes: 86400 16821
[25080] dbg: bayes: 172800 6
[25080] dbg: bayes: 345600 6
[25080] dbg: bayes: 691200 6
[25080] dbg: bayes: 1382400 6
[25080] dbg: bayes: 2764800 6
[25080] dbg: bayes: 5529600 5
[25080] dbg: bayes: 11059200 3
[25080] dbg: bayes: 22118400 3
[25080] dbg: bayes: first pass decided on 43200 for atime delta
[25080] dbg: bayes: expiry completed
expired old bayes database entries in 118 seconds
105026416 entries kept, 69509 deleted
token frequency: 1-occurrence tokens: 0.15%
token frequency: less than 8 occurrences: 0.05%
Mon Oct 6 16:13:09 EDT 2008 Done

Re: bayes_token table too big?

Posted by Theo Van Dinter <fe...@apache.org>.
On Mon, Oct 06, 2008 at 03:42:53PM -0400, Rosenbaum, Larry M. wrote:
> And here is the information from the local.cf file:
> 
> bayes_expiry_max_db_size  500000
> 
> So the config file says 500 thousand tokens, but the database has 105 million entries.  Have I misunderstood something, or is expiry not working correctly?

Do an expire run w/ "-D bayes" and show the expiry details.

It's likely that your tokens are such that there's no good expiry delta to
use, so each run removes as many as it can w/out going over (it's like the
Price is Right...)

-- 
Randomly Selected Tagline:
"... and on that side you have a 50kg kid, and that's a pretty good sized
  kid..."                  - Prof. Farr