You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@spamassassin.apache.org by Ronan <r....@qub.ac.uk> on 2004/11/09 15:29:19 UTC

sa-learn

ok folks a couple of questions regarding sa-learn

I am about to enable the baysian filters via sa-learn on my mailhubs.
I currently have SURIBLS and a few choice rules from rulesemproium 
running on 3.0.1 w/ exiscan

1) Am I right in thinking that i can run sa-learn spam on a folder which 
contains spam, of which most has spassassin headers indicating the same 
and that sa-learn knows to disregard the (spam-assasin) headers or all 
headers for that matter...

2) how will the baysian checking affect the load as I have tweaked it so 
that currently my servers are hitting 0-5% idle during peak and anything 
more will probably make them fall over

3) how will the baysian affect the need for some of the rulesets i have, 
no strike that
3b) how does the baysian affect any rulesets from say 
exit0/rulesemporium can any be done awaywith are any made practicaly 
obsolete by a well trained baysian???

4) Anything else i should be looking into???

thanks all

ronan
-- 
Regards

Ronan McGlue
==============
Analyst/Programmer
Information Services
Queens University Belfast
BT7 1NN

IMAP folder with sa-learn

Posted by hi...@free.fr.

Quoting hitete@free.fr:

> I want to integrate sa-learn to learn what is spam and what is non spam.
> I use SA 2.64 and procmail.
>
> Is it OK if I have two users that each move there SPAM and HAM to
> local IMAP folders ?.
>
> Like FALSE-SPAM and SPAM
> How do I specfy to sa-learn to go look in a certain imap folder ?.
>
> /Hitete

Re: sa-learn

Posted by hi...@free.fr.

I want to integrate sa-learn to learn what is spam and what is non spam.
I use SA 2.64 and procmail.

Is it OK if I have two users that each move there SPAM and HAM to
local IMAP folders ?.

Like FALSE-SPAM and SPAM
How do I specfy to sa-learn to go look in a certain imap folder ?.

/Hitete

Re: sa-learn

Posted by Matt Kettler <mk...@evi-inc.com>.

At 10:08 AM 11/9/2004, Ronan wrote:
>thats quite comprehensive answering there matt - most appreciated... :D
>
>one more though. sa-learn ham. Is this to explicity demark what should not 
>be learnt as spam? so should you feed it the rest of your mailbox?

sa-learn --ham is used to teach SA what non-spam looks like. This has 
nothing to do with adjusting what should or should not be learned in the 
future. It is a direct, integral, and REQUIRED part of bayes training.

Bayes makes a judgement about how probable it is that a given mail is spam. 
To do this, it needs to know what common words/phrases/tokens in spam look 
like, and what they look like in nonspam.

When you train, sa-learn breaks a message into "tokens" (tokens are mostly 
words from the body, but also various headers get encoded.). It then puts 
these in a database and tracks how many times it was seen in spam, and how 
many in times in nonspam. Based on the count of spam/ham matches, SA can 
calculate a probability that a given token appears in a spam email (in 
percentage 0% to 100%)

When new mail comes, bayes looks for token matches against it's existing 
learning. It then comes to a probability of spam for the whole message 
based on combining the probabilities of the tokens it matched.

It's all a simple statistical word-frequency thing...

Without ham training, bayes will think that everything is spam. 
(Fortunately, SA will flatly refuse to use bayes until 200 hams have been 
trained, as well as 200 spams)

Re: sa-learn

Posted by Ronan <r....@qub.ac.uk>.

thats quite comprehensive answering there matt - most appreciated... :D

one more though. sa-learn ham. Is this to explicity demark what should 
not be learnt as spam? so should you feed it the rest of your mailbox?

Ive just created the two folders and Im opening them up for others ( a 
small trusted fraternity ie the email group) to upload their spam to it.

So is it simply a case of whatever isnt spam put it in ham?

thanks
ronan

Matt Kettler wrote:
> At 02:29 PM 11/9/2004 +0000, Ronan wrote:
> 
>> 1) Am I right in thinking that i can run sa-learn spam on a folder 
>> which contains spam, of which most has spassassin headers indicating 
>> the same and that sa-learn knows to disregard the (spam-assasin) 
>> headers or all headers for that matter...
> 
> 
> SA's bayes subsystem tracks what message ID's it's learned from already 
> and what they were learned as. It will not re-learn the same message 
> unless you tell SA to change what it was learned as.
> 
> SA can (and does) learn useful information from mail already tagged as 
> spam, so feeding tagged mail to sa-learn is good, not redundant. It will 
> only ignore those it already learned or autolearned.
> 
> sa-learn will automatically ignore headers generated by SA itself. You 
> can specify a bayes_ignore_header in your local.cf to make it ignore 
> headers added by other tools.
> 
> 
> 
>> 2) how will the baysian checking affect the load as I have tweaked it 
>> so that currently my servers are hitting 0-5% idle during peak and 
>> anything more will probably make them fall over
> 
> 
> bayes adds quite a bit of load, but if you're using some insanely large 
> rulesets (ie: anything over 256kb) it's insignificant by comparison.
> 
> 
>> 3) how will the baysian affect the need for some of the rulesets i 
>> have, no strike that
>> 3b) how does the baysian affect any rulesets from say 
>> exit0/rulesemporium can any be done awaywith are any made practicaly 
>> obsolete by a well trained baysian???
> 
> 
> Theoreticaly any and all rules can be obsoleted by a well trained bayes 
> DB. The other rules exist to balance out the amount of work needed to 
> get good results. You can get great results from a bayes-only system, 
> but you've got to train it heavily and constantly.
> 
> SA's rules pick up the slack if you're not training 200 spams and 200 
> hams a day every day.
> 
> 
> 
>> 4) Anything else i should be looking into???
> 
> 
> Hardware upgrades so you can run some more CPU intensive stuff? :)
> 
> 

-- 
Regards

Ronan McGlue
==============
Analyst/Programmer
Information Services
Queens University Belfast
BT7 1NN

Re: sa-learn

Posted by Matt Kettler <mk...@comcast.net>.

At 02:29 PM 11/9/2004 +0000, Ronan wrote:
>1) Am I right in thinking that i can run sa-learn spam on a folder which 
>contains spam, of which most has spassassin headers indicating the same 
>and that sa-learn knows to disregard the (spam-assasin) headers or all 
>headers for that matter...

SA's bayes subsystem tracks what message ID's it's learned from already and 
what they were learned as. It will not re-learn the same message unless you 
tell SA to change what it was learned as.

SA can (and does) learn useful information from mail already tagged as 
spam, so feeding tagged mail to sa-learn is good, not redundant. It will 
only ignore those it already learned or autolearned.

sa-learn will automatically ignore headers generated by SA itself. You can 
specify a bayes_ignore_header in your local.cf to make it ignore headers 
added by other tools.

>2) how will the baysian checking affect the load as I have tweaked it so 
>that currently my servers are hitting 0-5% idle during peak and anything 
>more will probably make them fall over

bayes adds quite a bit of load, but if you're using some insanely large 
rulesets (ie: anything over 256kb) it's insignificant by comparison.

>3) how will the baysian affect the need for some of the rulesets i have, 
>no strike that
>3b) how does the baysian affect any rulesets from say exit0/rulesemporium 
>can any be done awaywith are any made practicaly obsolete by a well 
>trained baysian???

Theoreticaly any and all rules can be obsoleted by a well trained bayes DB. 
The other rules exist to balance out the amount of work needed to get good 
results. You can get great results from a bayes-only system, but you've got 
to train it heavily and constantly.

SA's rules pick up the slack if you're not training 200 spams and 200 hams 
a day every day.

>4) Anything else i should be looking into???

Hardware upgrades so you can run some more CPU intensive stuff? :)