You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@spamassassin.apache.org by Kai Schaetzl <ma...@conactive.com> on 2004/10/08 17:59:09 UTC

Still "fishy" problems with bayes expiry in SA 3.0

The problem seems to exists on all of our Bayes databases and I think the 
cause is not "bad" data, but simply the way the SA expiry algorithm works. 
There are no negative atimes or atimes in the future. If the database 
contains tokens from a wide time range it's not able to calculate a 
reasonable expiry atime and quits. This is typically to happen when you 
set bayes_expiry_max_db_size to a high value and it takes some time to 
fill up. When it finally hits the limit and wants to start the first 
expire after maybe months of never expiring it fails.

Can something be done about the problem, shall I submit a bug on it? 
(Already submitted bug #3872 where I mention this problem, but it's not 
directly related to bug #3872.) SA could either do more iterations or try 
a completely different approach. F.i. if it is told to expire 50.000 
tokens it should remove all old entries until the 50.000 tokens are 
removed and then stop. I understand that this would take a bit longer 
since the db needs to be sorted first but it should be feasible.

If this problem isn't fixed using "bayes_auto_expire 1" is an open game.

Here are examples (each one is from a different database since I don't 
have examples from the same db "before and after", but they are very 
similar in size and structure. Some are also version 2 and not 3.)

n9:/home/spamd/bayes # sa-learn --dump magic
0.000          0          3          0  non-token data: bayes db version
0.000          0      19760          0  non-token data: nspam
0.000          0       5706          0  non-token data: nham
0.000          0     736251          0  non-token data: ntokens
0.000          0 1052059392          0  non-token data: oldest atime
0.000          0 1097242496          0  non-token data: newest atime
0.000          0 1097248297          0  non-token data: last journal sync 
atime
0.000          0 1097248490          0  non-token data: last expiry atime
0.000          0   29754654          0  non-token data: last expire atime 
delta
0.000          0         36          0  non-token data: last expire 
reduction count

This db contains tokens going back to March 2003 or so. It works quite 
fine and marks almost every spamm message with BAYES_99. Size is about 20 
MB, max_db_size was set to 1.000.000 which made it skip any expire for 
some time (don't know from when to when).

Here's the failed result for a forced expire (with max_db_size set to 
500.000).

debug: bayes: expiry check keep size, 0.75 * max: 375000
debug: bayes: token count: 736251, final goal reduction size: 361251
debug: bayes: First pass?  Current: 1097248298, Last: 1096983812, atime: 
29754654, count: 36, newdelta: 2965, ratio: 10034.75, period: 43200
debug: bayes: Can't use estimation method for expiry, something fishy, 
calculating optimal atime delta (first pass)
debug: bayes: expiry max exponent: 9
debug: bayes: atime     token reduction
debug: bayes: ========  ===============
debug: bayes: 43200     735241
debug: bayes: 86400     734058
debug: bayes: 172800    733218
debug: bayes: 345600    731427
debug: bayes: 691200    728680
debug: bayes: 1382400   721684
debug: bayes: 2764800   712668
debug: bayes: 5529600   679017
debug: bayes: 11059200  668118
debug: bayes: 22118400  553162
debug: bayes: couldn't find a good delta atime, need more token 
difference, skipping expire.
debug: Syncing complete.


Finally, after setting to bayes_expiry_max_db_size 100.000 the expire 
works because the reduction goal is high enough and expires down to 
162.000. Just that I didn't want to throw out more than 500.000 tokens :-(

Here's the result after expiring so many tokens (remember, this is not the 
same db, it was some days ago on another machine!)

0.000          0          2          0  non-token data: bayes db version
0.000          0      19172          0  non-token data: nspam
0.000          0       5379          0  non-token data: nham
0.000          0     162010          0  non-token data: ntokens
0.000          0 1074822619          0  non-token data: oldest atime
0.000          0 1096936738          0  non-token data: newest atime
0.000          0          0          0  non-token data: last journal sync 
atime
0.000          0 1096992499          0  non-token data: last expiry atime
0.000          0   22118400          0  non-token data: last expire atime 
delta
0.000          0     553013          0  non-token data: last expire 
reduction count

but the problem already hits again with the next --force-expire:

debug: bayes: found bayes db version 2
debug: bayes: expiry check keep size, 75% of max: 75000
debug: bayes: expiry keep size too small, resetting to 100,000 tokens
debug: bayes: token count: 162010, final goal reduction size: 62010
debug: bayes: First pass?  Current: 1096992487, Last: 1096988477, atime: 
22118400, count: 553013, newdelta: 197254680, ratio: 8.91812610869215
debug: bayes: Can't use estimation method for expiry, something fishy, 
calculating optimal atime delta (first pass)
debug: bayes: atime     token reduction
debug: bayes: ========  ===============
debug: bayes: 43200     162006
debug: bayes: 86400     162006
debug: bayes: 172800    162006
debug: bayes: 345600    162006
debug: bayes: 691200    162006
debug: bayes: 1382400   162006
debug: bayes: 2764800   161954
debug: bayes: 5529600   130225
debug: bayes: 11059200  119126
debug: bayes: 22118400  0
debug: bayes: couldn't find a good delta atime, need more token 
difference, skipping expire.

This was a few days ago. Today, finally, the expiry worked again and 
removed about a thousand tokens. And, again, next forced expiry doesn't 
work. Maybe it will work in three days again. Here's the magic dump at the 
moment:

0.000          0          2          0  non-token data: bayes db version
0.000          0      19172          0  non-token data: nspam
0.000          0       5379          0  non-token data: nham
0.000          0     160600          0  non-token data: ntokens
0.000          0 1075078200          0  non-token data: oldest atime
0.000          0 1097195892          0  non-token data: newest atime
0.000          0          0          0  non-token data: last journal sync 
atime
0.000          0 1097250405          0  non-token data: last expiry atime
0.000          0   22118400          0  non-token data: last expire atime 
delta
0.000          0       1410          0  non-token data: last expire 
reduction count





Kai

-- 

Kai Schätzl, Berlin, Germany
Get your web at Conactive Internet Services: http://www.conactive.com
IE-Center: http://ie5.de & http://msie.winware.org