You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@spamassassin.apache.org by polloxx <po...@gmail.com> on 2008/05/06 16:16:06 UTC

Bayes database

Hi,

Your Bayesian database has become dirty: too mush ham mails get a
score of BAYES_99, certainly for one of your customer domains.
Is there a way to sanitize the database without clear the whole thing?

What are the best practices to keep your Bayes database clean?

Thanks,
P.

Re: Bayes database

Posted by Matus UHLAR - fantomas <uh...@fantomas.sk>.
On 06.05.08 16:16, polloxx wrote:
> Your Bayesian database has become dirty: too mush ham mails get a
> score of BAYES_99, certainly for one of your customer domains.
> Is there a way to sanitize the database without clear the whole thing?

do you keep all mail you've user to learn? If so, re-check them.

> What are the best practices to keep your Bayes database clean?

I guess correct training should fix/prevent the problem. Autolearn might
cause problems, especially when too low scores. Use network checks too, that
may save you from mail that is not catched, but listed in DCC/RAZOR etc.
Check all mails that were used for autolearn and train all mail whose
BAYES score is not proper (probably all hams that do not get BAYES_00 and
all spams that do not get BAYES_99)
 
-- 
Matus UHLAR - fantomas, uhlar@fantomas.sk ; http://www.fantomas.sk/
Warning: I wish NOT to receive e-mail advertising to this address.
Varovanie: na tuto adresu chcem NEDOSTAVAT akukolvek reklamnu postu.
M$ Win's are shit, do not use it !

Re: Bayes database

Posted by Theo Van Dinter <fe...@apache.org>.
On Tue, May 06, 2008 at 04:16:06PM +0200, polloxx wrote:
> Your Bayesian database has become dirty: too mush ham mails get a
> score of BAYES_99, certainly for one of your customer domains.
>
> Is there a way to sanitize the database without clear the whole thing?
> What are the best practices to keep your Bayes database clean?

The Bayes DB is simply as useful as what it's trained with.  If you
(or your customers, etc,) are training the DB for one thing, it's not
going to work for other things.

This is one of the reasons that site-wide DBs aren't as good as personal
ones -- your definition of ham/spam is at least somewhat different from
someone else's, and so the DB won't work as well for either of you.

It's worth noting that lots of people seem to treat "report spam" as
"delete" -- anything they don't want to see again is reported as spam,
instead of dealing with not having the mails sent in the first place.
(I've heard about everything from cronjob output to meeting notices to
mailing lists to ...)

As for sanitizing the DB ...  I guess it depends what that means.  If you know
there were inappropriate mails trained, one way or the other, and you still
have them, you can relearn them (or forget them) easily.  If you don't have
the mails, then you don't know what the tokens in question are, and so you
can't do anything short of restarting the DB and doing a better job w/
training the next time around.

Hope this helps.

-- 
Randomly Selected Tagline:
"The only way you'll get me to talk is through slow painful torture, and I
 don't think you've got the grapes."     - Stewie on Family Guy