You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@spamassassin.apache.org by Ronan <r....@qub.ac.uk> on 2004/11/09 15:29:19 UTC
sa-learn
ok folks a couple of questions regarding sa-learn
I am about to enable the baysian filters via sa-learn on my mailhubs.
I currently have SURIBLS and a few choice rules from rulesemproium
running on 3.0.1 w/ exiscan
1) Am I right in thinking that i can run sa-learn spam on a folder which
contains spam, of which most has spassassin headers indicating the same
and that sa-learn knows to disregard the (spam-assasin) headers or all
headers for that matter...
2) how will the baysian checking affect the load as I have tweaked it so
that currently my servers are hitting 0-5% idle during peak and anything
more will probably make them fall over
3) how will the baysian affect the need for some of the rulesets i have,
no strike that
3b) how does the baysian affect any rulesets from say
exit0/rulesemporium can any be done awaywith are any made practicaly
obsolete by a well trained baysian???
4) Anything else i should be looking into???
thanks all
ronan
--
Regards
Ronan McGlue
==============
Analyst/Programmer
Information Services
Queens University Belfast
BT7 1NN
IMAP folder with sa-learn
Posted by hi...@free.fr.
Quoting hitete@free.fr:
> I want to integrate sa-learn to learn what is spam and what is non spam.
> I use SA 2.64 and procmail.
>
> Is it OK if I have two users that each move there SPAM and HAM to
> local IMAP folders ?.
>
> Like FALSE-SPAM and SPAM
> How do I specfy to sa-learn to go look in a certain imap folder ?.
>
> /Hitete
Re: sa-learn
Posted by hi...@free.fr.
I want to integrate sa-learn to learn what is spam and what is non spam.
I use SA 2.64 and procmail.
Is it OK if I have two users that each move there SPAM and HAM to
local IMAP folders ?.
Like FALSE-SPAM and SPAM
How do I specfy to sa-learn to go look in a certain imap folder ?.
/Hitete
Re: sa-learn
Posted by Matt Kettler <mk...@evi-inc.com>.
At 10:08 AM 11/9/2004, Ronan wrote:
>thats quite comprehensive answering there matt - most appreciated... :D
>
>one more though. sa-learn ham. Is this to explicity demark what should not
>be learnt as spam? so should you feed it the rest of your mailbox?
sa-learn --ham is used to teach SA what non-spam looks like. This has
nothing to do with adjusting what should or should not be learned in the
future. It is a direct, integral, and REQUIRED part of bayes training.
Bayes makes a judgement about how probable it is that a given mail is spam.
To do this, it needs to know what common words/phrases/tokens in spam look
like, and what they look like in nonspam.
When you train, sa-learn breaks a message into "tokens" (tokens are mostly
words from the body, but also various headers get encoded.). It then puts
these in a database and tracks how many times it was seen in spam, and how
many in times in nonspam. Based on the count of spam/ham matches, SA can
calculate a probability that a given token appears in a spam email (in
percentage 0% to 100%)
When new mail comes, bayes looks for token matches against it's existing
learning. It then comes to a probability of spam for the whole message
based on combining the probabilities of the tokens it matched.
It's all a simple statistical word-frequency thing...
Without ham training, bayes will think that everything is spam.
(Fortunately, SA will flatly refuse to use bayes until 200 hams have been
trained, as well as 200 spams)
Re: sa-learn
Posted by Ronan <r....@qub.ac.uk>.
thats quite comprehensive answering there matt - most appreciated... :D
one more though. sa-learn ham. Is this to explicity demark what should
not be learnt as spam? so should you feed it the rest of your mailbox?
Ive just created the two folders and Im opening them up for others ( a
small trusted fraternity ie the email group) to upload their spam to it.
So is it simply a case of whatever isnt spam put it in ham?
thanks
ronan
Matt Kettler wrote:
> At 02:29 PM 11/9/2004 +0000, Ronan wrote:
>
>> 1) Am I right in thinking that i can run sa-learn spam on a folder
>> which contains spam, of which most has spassassin headers indicating
>> the same and that sa-learn knows to disregard the (spam-assasin)
>> headers or all headers for that matter...
>
>
> SA's bayes subsystem tracks what message ID's it's learned from already
> and what they were learned as. It will not re-learn the same message
> unless you tell SA to change what it was learned as.
>
> SA can (and does) learn useful information from mail already tagged as
> spam, so feeding tagged mail to sa-learn is good, not redundant. It will
> only ignore those it already learned or autolearned.
>
> sa-learn will automatically ignore headers generated by SA itself. You
> can specify a bayes_ignore_header in your local.cf to make it ignore
> headers added by other tools.
>
>
>
>> 2) how will the baysian checking affect the load as I have tweaked it
>> so that currently my servers are hitting 0-5% idle during peak and
>> anything more will probably make them fall over
>
>
> bayes adds quite a bit of load, but if you're using some insanely large
> rulesets (ie: anything over 256kb) it's insignificant by comparison.
>
>
>> 3) how will the baysian affect the need for some of the rulesets i
>> have, no strike that
>> 3b) how does the baysian affect any rulesets from say
>> exit0/rulesemporium can any be done awaywith are any made practicaly
>> obsolete by a well trained baysian???
>
>
> Theoreticaly any and all rules can be obsoleted by a well trained bayes
> DB. The other rules exist to balance out the amount of work needed to
> get good results. You can get great results from a bayes-only system,
> but you've got to train it heavily and constantly.
>
> SA's rules pick up the slack if you're not training 200 spams and 200
> hams a day every day.
>
>
>
>> 4) Anything else i should be looking into???
>
>
> Hardware upgrades so you can run some more CPU intensive stuff? :)
>
>
--
Regards
Ronan McGlue
==============
Analyst/Programmer
Information Services
Queens University Belfast
BT7 1NN
Re: sa-learn
Posted by Matt Kettler <mk...@comcast.net>.
At 02:29 PM 11/9/2004 +0000, Ronan wrote:
>1) Am I right in thinking that i can run sa-learn spam on a folder which
>contains spam, of which most has spassassin headers indicating the same
>and that sa-learn knows to disregard the (spam-assasin) headers or all
>headers for that matter...
SA's bayes subsystem tracks what message ID's it's learned from already and
what they were learned as. It will not re-learn the same message unless you
tell SA to change what it was learned as.
SA can (and does) learn useful information from mail already tagged as
spam, so feeding tagged mail to sa-learn is good, not redundant. It will
only ignore those it already learned or autolearned.
sa-learn will automatically ignore headers generated by SA itself. You can
specify a bayes_ignore_header in your local.cf to make it ignore headers
added by other tools.
>2) how will the baysian checking affect the load as I have tweaked it so
>that currently my servers are hitting 0-5% idle during peak and anything
>more will probably make them fall over
bayes adds quite a bit of load, but if you're using some insanely large
rulesets (ie: anything over 256kb) it's insignificant by comparison.
>3) how will the baysian affect the need for some of the rulesets i have,
>no strike that
>3b) how does the baysian affect any rulesets from say exit0/rulesemporium
>can any be done awaywith are any made practicaly obsolete by a well
>trained baysian???
Theoreticaly any and all rules can be obsoleted by a well trained bayes DB.
The other rules exist to balance out the amount of work needed to get good
results. You can get great results from a bayes-only system, but you've got
to train it heavily and constantly.
SA's rules pick up the slack if you're not training 200 spams and 200 hams
a day every day.
>4) Anything else i should be looking into???
Hardware upgrades so you can run some more CPU intensive stuff? :)