You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@spamassassin.apache.org by poifgh <ab...@gmail.com> on 2009/09/21 21:05:20 UTC

Understanding SpamAssassin

I am trying to understand inner workings of spam assassin and would be great
if someone can answer my questions. I have read online documentation but
there are still some questions left unanswered or I am not sure about.

As far as I understand, the default configuration of spamassassin processes
emails in this fashion

DNSBL Tests ---> RAW Body Tests ---> Bayesian Learning --> AWL 

[Is the sequence right? I know for sure AWL comes in last, what about
Bayesian learning and RAW Body tests' order? Did I miss any module?]

Why do we need Bayesian learning in presence of RAW body tests?

Mails which have very high or very low score are fed to bayesian learning.
Since we are confident about them being HAM or SPAM what do we want to learn
from them - The regex filters have identified that the mail is a spam (say),
what additional does bayesian learning achieve? Does it learn other words in
the spam mail (say words surrounding obfuscated term) in hope of matching
them in future emails? Or am I understanding it completely different?

Thnx for help.
-- 
View this message in context: http://www.nabble.com/Understanding-SpamAssassin-tp25530437p25530437.html
Sent from the SpamAssassin - Users mailing list archive at Nabble.com.

Re: 3.3.0 and sa-compile

Posted by "tonio@starbridge.org" <to...@starbridge.org>.

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

McDonald, Dan a écrit :
> On Tue, 2009-09-29 at 08:19 +0200, tonio@starbridge.org wrote:
>> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
>>
>> tonio@starbridge.org a écrit :
>>> tonio@starbridge.org a écrit :
>>>> Benny Pedersen a écrit :
>>>>> On fre 25 sep 2009 13:38:19 CEST, "tonio@starbridge.org"
>>>>> wrote
>>>>>> I've tested with SA 3.2.5 and it's working fine with
>>>>>> Rule2XSBody active. I've tried to delete compiled rules
>>>>>> and compile again: same result.
>>>>> forget to sa-compile in 3.3 ?
>>>> sa-compile has been run correctly with no errors (even in
>>>> debug)
>>> has anyone encountered the same problem ?
>
> Someone posted a problem with perl 5.6. What version of perl are
> you running?
>
thx for your answer

perl v5.10.0

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAkrCDtUACgkQ8FtMlUNHQIOPZgCfSY4GphHXEhWNolU7h0pYKcas
r8kAn1iP1rtNl9WHYPszFPBrpgpjECqv
=ccGX
-----END PGP SIGNATURE-----

Re: 3.3.0 and sa-compile

Posted by "McDonald, Dan" <Da...@austinenergy.com>.

On Tue, 2009-09-29 at 08:19 +0200, tonio@starbridge.org wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
> 
> tonio@starbridge.org a écrit :
> > tonio@starbridge.org a écrit :
> >> Benny Pedersen a écrit :
> >>> On fre 25 sep 2009 13:38:19 CEST, "tonio@starbridge.org" wrote
> >>>> I've tested with SA 3.2.5 and it's working fine with
> >>>> Rule2XSBody active. I've tried to delete compiled rules and
> >>>> compile again: same result.
> >>> forget to sa-compile in 3.3 ?
> >> sa-compile has been run correctly with no errors (even in debug)
> > has anyone encountered the same problem ?

Someone posted a problem with perl 5.6. What version of perl are you
running?

-- 
Daniel J McDonald, CCIE # 2495, CISSP # 78281, CNX
www.austinenergy.com

Re: 3.3.0 and sa-compile

Posted by "tonio@starbridge.org" <to...@starbridge.org>.

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

tonio@starbridge.org a écrit :
> tonio@starbridge.org a écrit :
>> Benny Pedersen a écrit :
>>> On fre 25 sep 2009 13:38:19 CEST, "tonio@starbridge.org" wrote
>>>> I've tested with SA 3.2.5 and it's working fine with
>>>> Rule2XSBody active. I've tried to delete compiled rules and
>>>> compile again: same result.
>>> forget to sa-compile in 3.3 ?
>> sa-compile has been run correctly with no errors (even in debug)
> has anyone encountered the same problem ?
nobody ?
i really need help on this one

thx
Regards
Tonio
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAkrBpuIACgkQ8FtMlUNHQINEvwCg27ECYMTslFW1K80srvM5SdB3
YB0AoIytnseU1nW6iqlRasCNTCNFjrQW
=tjQR
-----END PGP SIGNATURE-----

Re: 3.3.0 and sa-compile

Posted by "tonio@starbridge.org" <to...@starbridge.org>.

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

tonio@starbridge.org a écrit :
> Benny Pedersen a écrit :
>> On fre 25 sep 2009 13:38:19 CEST, "tonio@starbridge.org" wrote
>
>>> I've tested with SA 3.2.5 and it's working fine with
>>> Rule2XSBody active. I've tried to delete compiled rules and
>>> compile again: same result.
>> forget to sa-compile in 3.3 ?
>
> sa-compile has been run correctly with no errors (even in debug)
has anyone encountered the same problem ?
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAkq9zccACgkQ8FtMlUNHQIMA4wCeK5STV5P49zdRhMgdy9saHmfq
a4oAoNdduS36HIID3G38afVydbgxBhgb
=D8Ng
-----END PGP SIGNATURE-----

Re: 3.3.0 and sa-compile

Posted by Matt Kettler <mk...@verizon.net>.

tonio@starbridge.org wrote:
> Benny Pedersen a écrit :
> > On fre 25 sep 2009 13:38:19 CEST, "tonio@starbridge.org" wrote
>
> >> I've tested with SA 3.2.5 and it's working fine with Rule2XSBody
> >> active. I've tried to delete compiled rules and compile again:
> >> same result.
> > forget to sa-compile in 3.3 ?
>
> sa-compile has been run correctly with no errors (even in debug)

Re: 3.3.0 and sa-compile

Posted by "tonio@starbridge.org" <to...@starbridge.org>.

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Benny Pedersen a écrit :
> On fre 25 sep 2009 13:38:19 CEST, "tonio@starbridge.org" wrote
>
>> I've tested with SA 3.2.5 and it's working fine with Rule2XSBody
>> active. I've tried to delete compiled rules and compile again:
>> same result.
>
> forget to sa-compile in 3.3 ?
>
sa-compile has been run correctly with no errors (even in debug)
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAkq8syMACgkQ8FtMlUNHQIP2+wCdGgQpUR/MLpHT8hdW+4ooARAP
bYIAnRHf2xM7QeUE1HGWirN3OTTovnVW
=tO8Q
-----END PGP SIGNATURE-----

Re: 3.3.0 and sa-compile

Posted by Benny Pedersen <me...@junc.org>.

On fre 25 sep 2009 13:38:19 CEST, "tonio@starbridge.org" wrote

> I've tested with SA 3.2.5 and it's working fine with Rule2XSBody
> active. I've tried to delete compiled rules and compile again: same
> result.

forget to sa-compile in 3.3 ?

-- 
xpoint

3.3.0 and sa-compile

Posted by "tonio@starbridge.org" <to...@starbridge.org>.

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hi,
i'm running SA 3.3.0 (3.3.0-alpha3-r808953) and i've some problem with
compiled rules.

sa-compile runs without errors, and SA seems to works fine when restarted.
But some body rules are now not detected.

exemple of simple body rule (for testing):

body TONIO_SPAM_TEST            /toniospam/i
describe TONIO_SPAM_TEST        Mentions Generic toniospamtest
score   TONIO_SPAM_TEST 5

if i commented out
loadplugin Mail::SpamAssassin::Plugin::Rule2XSBody
in v320.pre, rules is working again.

I've tested with SA 3.2.5 and it's working fine with Rule2XSBody active.
I've tried to delete compiled rules and compile again: same result.

Some info on my environnement:
debian testing
xsubpp version 2.200401 (from debian perl package)
re2c version 0.13.5-1

Thanks for your help
Regard
Tonio



-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAkq8q6sACgkQ8FtMlUNHQIN9zwCg3s5HNL7DKUBRo8fTLbD6BsqV
aWMAoLnDI/+eABGST8KEG5todvABFUSF
=aDLE
-----END PGP SIGNATURE-----

Re: Understanding SpamAssassin

Posted by Benny Pedersen <me...@junc.org>.

On tir 22 sep 2009 09:43:23 CEST, LuKreme wrote
> bayes learning from ham helps score messages as
> ham that might otherwise be tagged as ham.

ups :)

-- 
xpoint

Re: Understanding SpamAssassin

Posted by LuKreme <kr...@kreme.com>.

On 21-Sep-2009, at 13:05, poifgh wrote:
> Mails which have very high or very low score are fed to bayesian  
> learning.
> Since we are confident about them being HAM or SPAM what do we want  
> to learn
> from them - The regex filters have identified that the mail is a  
> spam (say),
> what additional does bayesian learning achieve? Does it learn other  
> words in
> the spam mail (say words surrounding obfuscated term) in hope of  
> matching
> them in future emails? Or am I understanding it completely different?

Bayes learning from spam helps score message that would not score as  
spam. Similarly, bayes learning from ham helps score messages as ham  
that might otherwise be tagged as ham.

-- 
Heisenberg's only uncertainty was what pub to vomit in next and
	Jung fancied Freud's mother too. -- Jared Earle

Re: Understanding SpamAssassin

Posted by LuKreme <kr...@kreme.com>.

On 25-Sep-2009, at 03:56, Mark Martinec wrote:
LuKreme wrote:
>> Other surprises are that DKIM is pretty useless and SPF_PASS is
>> actually a slight spam indicator.
>
> Benny Pedersen wrote:
>> so without some whitelist_from_* dkim and spf will not be helpfull
>
> Indeed. Score points should be kept close to zero for rules
> DKIM_SIGNED, DKIM_VALID and DKIM_VALID_AU (or DKIM_VERIFIED in  
> pre-3.3).

As they are, and I never said anything differently. I don't know where  
Benny got he idea I was giving spammers a 'free ride.'

I meant to say "pretty useless on its own".




-- 
I think it's the duty of the comedian to find out where the line is
	drawn and cross it deliberately.

Re: Understanding SpamAssassin

Posted by Mark Martinec <Ma...@ijs.si>.

LuKreme wrote:
> Other surprises are that DKIM is pretty useless and SPF_PASS is
> actually a slight spam indicator.

Benny Pedersen wrote:
> so without some whitelist_from_* dkim and spf will not be helpfull

Indeed. Score points should be kept close to zero for rules
DKIM_SIGNED, DKIM_VALID and DKIM_VALID_AU (or DKIM_VERIFIED in pre-3.3).

The value of DKIM verification does not come from score points of these
informational rules directly, but from derived rules: from DKIM-based
whitelisting and from fraud protection (DKIM_ADSP_* rules with their
associated 'adsp_override' in 3.3.0, or hand written rules in pre-3.3).

  Mark

Re: Understanding SpamAssassin

Posted by Benny Pedersen <me...@junc.org>.

On fre 25 sep 2009 09:58:41 CEST, LuKreme wrote
> Other surprises are that DKIM is pretty useless and SPF_PASS is  
> actually a slight spam indicator.

you miss the point, there is no USER_IN_*

so without some whitelist_from_* dkim and spf will not be helpfull

if it was so you will have gived spammers a free ride, what you wanted ?

-- 
xpoint

Re: Understanding SpamAssassin

Posted by LuKreme <kr...@kreme.com>.

On Sep 24, 2009, at 7:44 PM, poifgh wrote:
> For 101st mail, if the regex MEDICINE is unable to match the  
> obfuscated
> text, then the mail would have a low score, but bayesian learner  
> would say,
> seeing the words surrounding obfuscated text, that this mail is spam.

Essentially this is how it works. Bayes looks for tokens in the  
messages and categorizes them as spam or ham depending on two factors,  
the overall score or the specific command line flag. If the score is  
high enough, then the message is learned as spam, which means all it's  
tokens are classified as spam. If the score is low enough, the message  
is learned as ham and its tokens are likewise classified as ham.  
Tokens that appear in both classes cancel out, and new messages are  
examined for tokens. Depending on how many there are of each type and  
(and this is the clever bit) how strong each is an indicator of  
spamishness/hamishness that is how the final bayes 'score' is weighted.

The reason the manual training is useful is that there is a wide range  
of score in-between auto-learn ham and auto-learn spam.

A bayes_50 is a neutral score, and this is generally seen as a 0  
weight score. However, in my experience quite a lot of emails with a  
bayes_50 are actually spam. Ham messages tend to score out lower,  
assuming your data is sufficiently large.

score BAYES_99 5.0
score BAYES_95 4.5
score BAYES_80 2
score BAYES_60 1.00
score BAYES_50 0.25
score BAYES_40 -0.50
score BAYES_20 -2.50
score BAYES_05 -3.50
score BAYES_00 -5.00

So yes, for me Bayes_99 is a poison pill, and 95 is close enough. I  
have very little hitting _80 or _60 or _40, so these scores are  
basically WAGs.

TOP SPAM RULES FIRED
RANK	RULE NAME               	       %OFMAIL %OFSPAM  %OFHAM
    1	BAYES_99                		 57.12	 92.66	  1.84
    2	HTML_MESSAGE            		 78.17	 79.89	 75.51
    3	URIBL_BLACK             		 43.66	 70.76	  1.49
    4	RCVD_IN_JMF_BL          		 36.20	 57.45	  3.14
    5	SPF_PASS                		 37.14	 50.73	 15.99
    6	URIBL_JP_SURBL          		 28.99	 47.56	  0.10
    7	URIBL_OB_SURBL          		 21.01	 34.44	  0.13
    8	DKIM_SIGNED             		 31.58	 31.10	 32.33

TOP HAM RULES FIRED
RANK	RULE NAME               	       %OFMAIL %OFSPAM  %OFHAM
    1	AWL                     		 45.92	 19.29	 87.37
    2	HTML_MESSAGE            		 78.17	 79.89	 75.51
    3	BAYES_00                		 21.30	  0.08	 54.31
    4	RCVD_IN_JMF_W           		 16.63	  0.78	 41.29
    5	DKIM_SIGNED             		 31.58	 31.10	 32.33
    6	DKIM_VERIFIED           		 25.13	 23.44	 27.77
    7	BAYES_50                		 11.88	  1.94	 27.36
    8	SPF_PASS                		 37.14	 50.73	 15.99

Now, this is misleading here because this is looking at the spammed  
log, and when ti gets right down to searching, a large number of  
BAYES_50 messages will end up being classified as spam.

Other surprises are that DKIM is pretty useless and SPF_PASS is  
actually a slight spam indicator.

-- 
if you ever get that chimp of your back, if you ever find the thing
	you lack, ah but you know you're only having a laugh. Oh, oh
	here we go again -- until the end.

Re: Understanding SpamAssassin

Posted by Bowie Bailey <Bo...@BUC.com>.

poifgh wrote:
> Bowie Bailey wrote:
>   
>> For auto-learning, the high and low scoring messages are fed to Bayes. 
>> However, for an optimal setup, you should manually train Bayes on as
>> much of your (verified) ham and spam as possible.  The more of your mail
>> stream Bayes sees, the better the results will be.
>>
>> Your description of Bayes is pretty close.  It breaks down the message
>> into "tokens" (words and character sequences) and then keeps track of
>> how likely each of those tokens is to appear in either a ham or spam
>> message.  When a new message comes in, Bayes breaks it into tokens and
>> then scores it depending on which tokens were found in the message.
>>
>>     
>
> Suppose we do not have manual Bayesian training. We only do online training
> in which high and low scoring mails are fed to the learner [is the a usual
> thing to do? How many people manually train their bayesian filter?]
> A high scoring spam is then fed to the learner. The spam is high scoring
> since a few rules [regex] matched. Now the bayesian leaner would learn all
> the tokens from this mail. Next time a mail [say M] with similar tokens is
> seen, it would be flagged as spam [using bayes rule]. why would bayesian
> learning be needed for us to say M is spam. Since it contains very much
> similar words like earlier high scoring mails, shouldnt we expect the regex
> rules to work for M as well? - since M is very much similar to those mails
> from which we learnt from ?
>   

Look at it this way -- Bayes is learning what your spam looks like and
what your ham looks like.  Most of your spam will be caught by other
rules, but there are times when an email will come in that the main
rules do not catch.  Bayes is frequently able to catch these because it
is looking at the message as a whole rather than looking for particular
words or phrases as the main regex rules do.

Manual training is not strictly required for Bayes, but the more manual
training you do, the higher the accuracy and the more useful it
becomes.  At the least, you should manually train Bayes on all of your
false positives and false negatives.  This can be scripted to happen
automatically based on folders which are expected to contain hand-sorted
spam and ham.

> Here is how I think bayesian is helpful [which could be be entirely my
> misunderstanding]. Suppose a set of spam mails look like
>
> "Please buy M3d1C1NE X at store Y for cheap". 
>
> Now spammers have obfuscated word "medicine" in the mail. Spammers send, say
> a thousand spam each having a different way in which "medicine" is spelt
> out, but all the other words around it remain nearly the same. Only some of
> the first 100 of these mails would hit [say if there exists] a MEDICINE rule
> [regex]. Those particular mails would have high spam scores and hence the
> bayesian filter would learn that mails containing words "Please", "buy",
> "at", "store", "for", "cheap" corresponds to have a high spam probability.  
>
> For 101st mail, if the regex MEDICINE is unable to match the obfuscated
> text, then the mail would have a low score, but bayesian learner would say,
> seeing the words surrounding obfuscated text, that this mail is spam.
>
> Does it work this way? Does it work only this way [if not manually trained]? 
>   

That is a pretty fair description of how it works regardless of how you
train it.  The advantage of manual training is that you allow it to
learn from the lower scoring spam (and higher scoring ham), which are
the kinds of messages that can most use the extra points from the Bayes
rules.

-- 
Bowie

Re: Understanding SpamAssassin

Posted by poifgh <ab...@gmail.com>.

Bowie Bailey wrote:
> 
> For auto-learning, the high and low scoring messages are fed to Bayes. 
> However, for an optimal setup, you should manually train Bayes on as
> much of your (verified) ham and spam as possible.  The more of your mail
> stream Bayes sees, the better the results will be.
> 
> Your description of Bayes is pretty close.  It breaks down the message
> into "tokens" (words and character sequences) and then keeps track of
> how likely each of those tokens is to appear in either a ham or spam
> message.  When a new message comes in, Bayes breaks it into tokens and
> then scores it depending on which tokens were found in the message.
> 

Suppose we do not have manual Bayesian training. We only do online training
in which high and low scoring mails are fed to the learner [is the a usual
thing to do? How many people manually train their bayesian filter?]
A high scoring spam is then fed to the learner. The spam is high scoring
since a few rules [regex] matched. Now the bayesian leaner would learn all
the tokens from this mail. Next time a mail [say M] with similar tokens is
seen, it would be flagged as spam [using bayes rule]. why would bayesian
learning be needed for us to say M is spam. Since it contains very much
similar words like earlier high scoring mails, shouldnt we expect the regex
rules to work for M as well? - since M is very much similar to those mails
from which we learnt from ?

Here is how I think bayesian is helpful [which could be be entirely my
misunderstanding]. Suppose a set of spam mails look like

"Please buy M3d1C1NE X at store Y for cheap". 

Now spammers have obfuscated word "medicine" in the mail. Spammers send, say
a thousand spam each having a different way in which "medicine" is spelt
out, but all the other words around it remain nearly the same. Only some of
the first 100 of these mails would hit [say if there exists] a MEDICINE rule
[regex]. Those particular mails would have high spam scores and hence the
bayesian filter would learn that mails containing words "Please", "buy",
"at", "store", "for", "cheap" corresponds to have a high spam probability.  

For 101st mail, if the regex MEDICINE is unable to match the obfuscated
text, then the mail would have a low score, but bayesian learner would say,
seeing the words surrounding obfuscated text, that this mail is spam.

Does it work this way? Does it work only this way [if not manually trained]? 

-- 
View this message in context: http://www.nabble.com/Understanding-SpamAssassin-tp25549227p25605170.html
Sent from the SpamAssassin - Users mailing list archive at Nabble.com.

Re: Understanding SpamAssassin

Posted by Bowie Bailey <Bo...@BUC.com>.

poifgh wrote:
> I am trying to understand inner workings of spam assassin and would be great
> if someone can answer my questions. I have read online documentation but
> there are still some questions left unanswered or I am not sure about.
>   

I'm not an expert, just a long-time user, but I can give you some basic
answers.

> As far as I understand, the default configuration of spamassassin processes
> emails in this fashion
>
> DNSBL Tests ---> RAW Body Tests ---> Bayesian Learning --> AWL 
>
> [Is the sequence right? I know for sure AWL comes in last, what about
> Bayesian learning and RAW Body tests' order? Did I miss any module?]
>   

As I understand it, quite a bit of this is done in parallel.  In
particular, the DNS based tests are fired off first and then other tests
are run while waiting for the response.

In any case, unless you are playing with the shortcut features, all
rules are run for every message, so does it really matter what order
they are in?

> Why do we need Bayesian learning in presence of RAW body tests?
>
> Mails which have very high or very low score are fed to bayesian learning.
> Since we are confident about them being HAM or SPAM what do we want to learn
> from them - The regex filters have identified that the mail is a spam (say),
> what additional does bayesian learning achieve? Does it learn other words in
> the spam mail (say words surrounding obfuscated term) in hope of matching
> them in future emails? Or am I understanding it completely different?
>   

For auto-learning, the high and low scoring messages are fed to Bayes. 
However, for an optimal setup, you should manually train Bayes on as
much of your (verified) ham and spam as possible.  The more of your mail
stream Bayes sees, the better the results will be.

Your description of Bayes is pretty close.  It breaks down the message
into "tokens" (words and character sequences) and then keeps track of
how likely each of those tokens is to appear in either a ham or spam
message.  When a new message comes in, Bayes breaks it into tokens and
then scores it depending on which tokens were found in the message.

-- 
Bowie.