You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@spamassassin.apache.org by Pieter Vanmeerbeek <pi...@able.be> on 2006/03/23 10:58:01 UTC

Bayesian filtering : the full story and question

Hi,

I'm investigating the bayesian filtering in Spam assassin and although a lot
of info is available on it, no full story is available. I've tried to gather
all info and created my full story. Can anyone correct my if I'm wrong? And
can somebody answer the question at the bottem ?

Thanks,
Pieter

Working principle for scoring mail with bayesian filtering :
--------------------------------------
Bayesian filtering works by gathering tokens (ie alphanumerics and some
special chars) and assigning the number of their occurances in both spam and
ham mail. By doing this it is possible to assign a spam probability to each
token (0.01 for ham, 0.99 for spam). This system is called naïve bayesian
learning as it does not take relations between different tokens into account

When a new e-mail arrives the 15 most high/low scoring tokens are used to
compute a spam probability for this e-mail.
According to this value a spam assassin rule is matched and assigns a score
to the total spam assassin score :

23_bayes.cf:

body BAYES_00 eval:check_bayes('0.00', '0.01')
body BAYES_05 eval:check_bayes('0.01', '0.05')
...
body BAYES_95 eval:check_bayes('0.95', '0.99')
body BAYES_99 eval:check_bayes('0.99', '1.00')

50_scores.cf

# make the bayes scores unmutable (as discussed in bug 4505)
score BAYES_00 0.0001 0.0001 -2.312 -2.599
score BAYES_05 0.0001 0.0001 -1.110 -1.110
...
score BAYES_95 0.0001 0.0001 3.0 3.0
score BAYES_99 0.0001 0.0001 3.5 3.5

The last 2 scoresets will be used ( last one is internet check+bayes, other
one is local+bayes)

Restrictions:
------------
The scoring by the previous described system only works if a certain number
of mails was already trained (see below for training). Default this is set
to 200 spam and 200 ham mails.

Training:
---------
A. auto-learn :

By default spam assassin is set to auto-learn new mails, i.e tokens are
added to the bayesian db.

A new mail arriving will be scored using the spam assassin ruleset, however
using scoreset 0(local checks) or 1(internet lookups) and not the last two
ones which are only used for determine the total spam assassin score and not
for learning.
Rules with tflags noautolearn, userconf, learn will not be used to calculate
the score.

A message will only be accepted for adding into the bayes DB when the
computed learn score is between predefined scores (configurable),<0.1 for
ham and > 12 for spam (with at least 3 for header and 3 for body). When
these constraints are met, the message is added to the bayes db for futher
use.
PS: tresholds are defined in 10_misc.cf

To prevent locking the bayes db, newly trained messages are added to a
journel which is synced.

B. manual learning /correcting:

It is possible by using sa-learn --ham or --spam to manually learn a
message.

If the message was already learnt as spam it will not be re-learned. If it
was learned incorreclty, the previous learning will be forgotten and it will
be learned correctly.

???
QUESTION : I could not find if manually learning still respects the
tresholds above ?? In other words if a mail is delivered with score 6, can
it be added to the bayes db by manuallt learning it
???

--
---------------------------------------------------
aXs GUARD Training Center
more info at http://www.axsguard.com/indextraining.htm

aXs GUARD has completed security and anti-virus checks on this e-mail (http://www.axsguard.com)
---------------------------------------------------
Able NV: ond.nr 0457.938.087
RPR Mechelen

Re: Bayesian filtering : the full story and question

Posted by jdow <jd...@earthlink.net>.

From: "Pieter Vanmeerbeek" <pi...@able.be>

??? 
QUESTION :  I could not find if manually learning still respects the
tresholds above ?? In other words if a mail is delivered with score 6, can
it be added to the bayes db by manuallt learning it 
???

<< jdow >> Of course it can. Do it all the time. Learn it as ham or spam,
your choice. But I tend not to feed Bayes with things it already considers
to be BAYES_99.

{^_^}