You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@spamassassin.apache.org by Justin Mason <jm...@jmason.org> on 2004/10/08 18:39:15 UTC
Re: Still "fishy" problems with bayes expiry in SA 3.0
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Kai Schaetzl writes:
> The problem seems to exists on all of our Bayes databases and I think the
> cause is not "bad" data, but simply the way the SA expiry algorithm works.
> There are no negative atimes or atimes in the future. If the database
> contains tokens from a wide time range it's not able to calculate a
> reasonable expiry atime and quits. This is typically to happen when you
> set bayes_expiry_max_db_size to a high value and it takes some time to
> fill up. When it finally hits the limit and wants to start the first
> expire after maybe months of never expiring it fails.
So you wind up with a very big, but unexpirable, db? I think
that would be worth a bug, yes.
in my opinion, expiry should always do *something* to get the db
below a target size, even if that *something* isn't strictly token
removal by atime.
- --j.
> Can something be done about the problem, shall I submit a bug on it?
> (Already submitted bug #3872 where I mention this problem, but it's not
> directly related to bug #3872.) SA could either do more iterations or try
> a completely different approach. F.i. if it is told to expire 50.000
> tokens it should remove all old entries until the 50.000 tokens are
> removed and then stop. I understand that this would take a bit longer
> since the db needs to be sorted first but it should be feasible.
>
> If this problem isn't fixed using "bayes_auto_expire 1" is an open game.
>
> Here are examples (each one is from a different database since I don't
> have examples from the same db "before and after", but they are very
> similar in size and structure. Some are also version 2 and not 3.)
>
> n9:/home/spamd/bayes # sa-learn --dump magic
> 0.000 0 3 0 non-token data: bayes db version
> 0.000 0 19760 0 non-token data: nspam
> 0.000 0 5706 0 non-token data: nham
> 0.000 0 736251 0 non-token data: ntokens
> 0.000 0 1052059392 0 non-token data: oldest atime
> 0.000 0 1097242496 0 non-token data: newest atime
> 0.000 0 1097248297 0 non-token data: last journal sync
> atime
> 0.000 0 1097248490 0 non-token data: last expiry atime
> 0.000 0 29754654 0 non-token data: last expire atime
> delta
> 0.000 0 36 0 non-token data: last expire
> reduction count
>
> This db contains tokens going back to March 2003 or so. It works quite
> fine and marks almost every spamm message with BAYES_99. Size is about 20
> MB, max_db_size was set to 1.000.000 which made it skip any expire for
> some time (don't know from when to when).
>
> Here's the failed result for a forced expire (with max_db_size set to
> 500.000).
>
> debug: bayes: expiry check keep size, 0.75 * max: 375000
> debug: bayes: token count: 736251, final goal reduction size: 361251
> debug: bayes: First pass? Current: 1097248298, Last: 1096983812, atime:
> 29754654, count: 36, newdelta: 2965, ratio: 10034.75, period: 43200
> debug: bayes: Can't use estimation method for expiry, something fishy,
> calculating optimal atime delta (first pass)
> debug: bayes: expiry max exponent: 9
> debug: bayes: atime token reduction
> debug: bayes: ======== ===============
> debug: bayes: 43200 735241
> debug: bayes: 86400 734058
> debug: bayes: 172800 733218
> debug: bayes: 345600 731427
> debug: bayes: 691200 728680
> debug: bayes: 1382400 721684
> debug: bayes: 2764800 712668
> debug: bayes: 5529600 679017
> debug: bayes: 11059200 668118
> debug: bayes: 22118400 553162
> debug: bayes: couldn't find a good delta atime, need more token
> difference, skipping expire.
> debug: Syncing complete.
>
> Finally, after setting to bayes_expiry_max_db_size 100.000 the expire
> works because the reduction goal is high enough and expires down to
> 162.000. Just that I didn't want to throw out more than 500.000 tokens :-(
>
> Here's the result after expiring so many tokens (remember, this is not the
> same db, it was some days ago on another machine!)
>
> 0.000 0 2 0 non-token data: bayes db version
> 0.000 0 19172 0 non-token data: nspam
> 0.000 0 5379 0 non-token data: nham
> 0.000 0 162010 0 non-token data: ntokens
> 0.000 0 1074822619 0 non-token data: oldest atime
> 0.000 0 1096936738 0 non-token data: newest atime
> 0.000 0 0 0 non-token data: last journal sync
> atime
> 0.000 0 1096992499 0 non-token data: last expiry atime
> 0.000 0 22118400 0 non-token data: last expire atime
> delta
> 0.000 0 553013 0 non-token data: last expire
> reduction count
>
> but the problem already hits again with the next --force-expire:
>
> debug: bayes: found bayes db version 2
> debug: bayes: expiry check keep size, 75% of max: 75000
> debug: bayes: expiry keep size too small, resetting to 100,000 tokens
> debug: bayes: token count: 162010, final goal reduction size: 62010
> debug: bayes: First pass? Current: 1096992487, Last: 1096988477, atime:
> 22118400, count: 553013, newdelta: 197254680, ratio: 8.91812610869215
> debug: bayes: Can't use estimation method for expiry, something fishy,
> calculating optimal atime delta (first pass)
> debug: bayes: atime token reduction
> debug: bayes: ======== ===============
> debug: bayes: 43200 162006
> debug: bayes: 86400 162006
> debug: bayes: 172800 162006
> debug: bayes: 345600 162006
> debug: bayes: 691200 162006
> debug: bayes: 1382400 162006
> debug: bayes: 2764800 161954
> debug: bayes: 5529600 130225
> debug: bayes: 11059200 119126
> debug: bayes: 22118400 0
> debug: bayes: couldn't find a good delta atime, need more token
> difference, skipping expire.
>
> This was a few days ago. Today, finally, the expiry worked again and
> removed about a thousand tokens. And, again, next forced expiry doesn't
> work. Maybe it will work in three days again. Here's the magic dump at the
> moment:
>
> 0.000 0 2 0 non-token data: bayes db version
> 0.000 0 19172 0 non-token data: nspam
> 0.000 0 5379 0 non-token data: nham
> 0.000 0 160600 0 non-token data: ntokens
> 0.000 0 1075078200 0 non-token data: oldest atime
> 0.000 0 1097195892 0 non-token data: newest atime
> 0.000 0 0 0 non-token data: last journal sync
> atime
> 0.000 0 1097250405 0 non-token data: last expiry atime
> 0.000 0 22118400 0 non-token data: last expire atime
> delta
> 0.000 0 1410 0 non-token data: last expire
> reduction count
>
> Kai
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.4 (GNU/Linux)
Comment: Exmh CVS
iD8DBQFBZsKzQTcbUG5Y7woRAkxcAKDf0gvThy4vVp2feI+hcaeFTDBKQQCgok19
/rIaDDSyyfQc0WlV3naz3ao=
=B2lX
-----END PGP SIGNATURE-----
Re: Still "fishy" problems with bayes expiry in SA 3.0
Posted by Kai Schaetzl <ma...@conactive.com>.
Justin Mason wrote on Fri, 08 Oct 2004 09:39:15 -0700:
> So you wind up with a very big, but unexpirable, db?
yes. I can expire it with the trick mentioned, but then it blows most of
the db. And the next expire fails again until I play other tricks or wait
long enough. F.i. I can dump the stuff, change the atime delta to a value
which would make sa-learn expire a few tokens and then import the whole
dump. I did this once in the past when I had a database which contained
negative values and values in the future. I stripped all those wrong
tokens, but the expire would still not work because it uses the last
expire atime delta as a parameter in the expire calculation. So, I had to
calculate a delta which would stop it thinking "fishy" (= produce a good
ratio, if I remember it right) and replace it in the dump and then
imported all of that again. Until a better method is found it would help
immensely if we could provide sa-learn with "faked" values, so that it
doesn't go in the "fishy" interations at all. Or just to be able to give
it an expire atime delta to use instead it trying to calculate one itself.
I think
> that would be worth a bug, yes.
>
> in my opinion, expiry should always do *something* to get the db
> below a target size, even if that *something* isn't strictly token
> removal by atime.
Yes, at least I should be able to rely on the expiry. F.i. the problem now
is that with auto-expiry it suddenly hits the threshold and tries to
expire with any new pass - and can't. This is like a DoS. When I shut off
auto-expiry to avoid it it doesn't do any good other than avoiding the
processing delay. I still can't expire.
Proposal:
I think a good way would be to use a removal percentage plus a *minimum*,
configurable in local.cf. The db needs then to be sorted before expiry,
which we don't do now I think. This takes a bit but is much more reliable.
auto_expiry could still use the current method *when it works*, but stop
doing any iterations when it detects something "fishy". This would stop
those massive time-outs. Then, a forced expire would use the percentage
method *only*. This way I could switch off auto-expire and run a (f.i.) 1%
expire each night until it reaches a minimum. If the bayes db is at the
minimum it won't expire at all. So, a given minimum would avoid the
minimal chance of slashing my db to almost zero if it grows too slow.
Once the db is sorted you can start removing at the beginning of the file
and stop when you reach the reduction goal. f.i. db of 1.000.000,
reduction percentage 1%, minimum 500.000 => reduction goal = 10.000.
Of course, if we stop right at the 10.000th removal we are likely to keep
some tokes of the same atime we just removed. But, does this really
matter? We could also use the last atime we removed when we were about to
stop and expire all remaining tokes of that atime. So, instead of removing
exactly 10.000 tokens we may remove 10.312 tokens. Again, I think it
doesn't matter.
Does this proposal sound reasonable? I could then file a bug outlining it.
Kai
--
Kai Schätzl, Berlin, Germany
Get your web at Conactive Internet Services: http://www.conactive.com
IE-Center: http://ie5.de & http://msie.winware.org