You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@spamassassin.apache.org by Randy Ramsdell <rr...@livedatagroup.com> on 2008/02/28 15:21:00 UTC

AWL - BAYES_99/ general questions

Hi,

One thing I do not understand regarding AWL and BAYES. When a message is 
reported to me as spam and was not marked as spam, I test is using debug 
before and after sa-learn. Each time I do this, BAYES_99 does hit, but 
they will also include AWL.

1. Does anyone understand why this happens?
2. I also noticed that when using "spamassassin -D" on a message, I 
sometimes see a nice report like below (2nd example) but other times it 
doesn't show report formatted. Any ideas on this one?

Here are an example of two spam report headers for the same message.

Before sa-learn:

X-Spam-Status: No, score=3.982 tagged_above=-9999 required=5
 tests=[ADVANCE_FEE_1=0, BAYES_60=1, SUB_HELLO=2.141, UNDISC_RECIPS=0.841]
X-Spam-Score: 3.982
X-Spam-Level: ***

After sa-learn:

Content analysis details:   (5.2 points, 5.0 required)

 pts rule name              description
---- ---------------------- 
--------------------------------------------------
 2.1 SUB_HELLO              Subject starts with "Hello"
 0.8 UNDISC_RECIPS          Valid-looking To "undisclosed-recipients"
 3.5 BAYES_99               BODY: Bayesian spam probability is 99 to 100%
                            [score: 1.0000]
 0.0 ADVANCE_FEE_1          Appears to be advance fee fraud (Nigerian 419)
-1.2 AWL                    AWL: From: address is in the auto white-list

Thanks,
Randy Ramsdell

Re: AWL - BAYES_99/ general questions

Posted by Matt Kettler <mk...@verizon.net>.
Randy Ramsdell wrote:
> Karsten Bräckelmann wrote:
>> On Thu, 2008-02-28 at 09:21 -0500, Randy Ramsdell wrote:
>>  
>>> Hi,
>>>
>>> One thing I do not understand regarding AWL and BAYES. When a 
>>> message is reported to me as spam and was not marked as spam, I test 
>>> is using debug before and after sa-learn. Each time I do this, 
>>> BAYES_99 does hit, but they will also include AWL.
>>>
>>> 1. Does anyone understand why this happens?
>>>     
>>
>> AWL is a score averager. SA has seen that sender before.
>>   http://wiki.apache.org/spamassassin/AutoWhitelist
>>
>> Run it through SA again, and you will see the AWL score getting closer
>> to 0, since the score without AWL is constant. The AWL score is
>> negative, because previous scores have been lower.
>>
>>   guenther
>>
>>
>>   
> I understand that  AWL is averaging what it has seen before and it 
> must have seen the message as ham, but why would one have to sa-learn 
> the message as spam multiple times. 

The sa-learn doesn't count as having been seen.

However, it has been seen twice. It was seen once when it first arrived, 
and a second time when you manually invoked spamassassin on it (after 
sa-learning it).





Re: AWL - BAYES_99/ general questions

Posted by Karsten Bräckelmann <gu...@rudersport.de>.
On Thu, 2008-02-28 at 10:28 -0500, Randy Ramsdell wrote:
> Karsten Bräckelmann wrote:

> > AWL is a score averager. SA has seen that sender before.
> >   http://wiki.apache.org/spamassassin/AutoWhitelist
> >
> > Run it through SA again, and you will see the AWL score getting closer
> > to 0, since the score without AWL is constant. The AWL score is
> > negative, because previous scores have been lower.
> 
> I understand that  AWL is averaging what it has seen before and it must 
> have seen the message as ham, 

No. :)  AWL does not know the concept of spam or ham, it does not know
about your required_score spam threshold. It merely knows about the
previous scores.

> but why would one have to sa-learn the message as spam multiple times.

You do NOT have to, and I didn't say so. :)  AWL keeps track of all
*seen* messages, as opposed to learned ones. Given the initial score of
the message, it has not been learned automatically.

To observe the AWL score it is sufficient, as I said, to run the message
through spamassassin -- this does not require sa-learn. Note that my
comment regarding this was intended to demonstrate AWL, so you can see
for yourself. I did not mean to imply you have to do it regularly. Just
this one time, so you can see how AWL behaves...


Also please note, that AWL in fact keeps track of a pair of sender and
IP address (space). IMHO, this kind of explains the confusing naming,
namely the "whitelist" part. It is most useful for legit senders -- if
they send a single spammy message once, AWL is there for rescue and
lower the score drastically.

The general spam on the other hand is really unlikely to ever be sent a
second time From: the same forged sender address and the same origina-
ting network. Odds are, this particular AWL entry will never ever be
used again with new incoming spam.


> This also means that a system wide 
> approach to improving our SPAM effectiveness requires me parse the AWL 
> score after sa-learning the message to determine if I need to run it 
> again. This would a monumental task and very resource intensive. 

No. See above. Also please note, that Bayes (which you train using
sa-learn) and AWL are entirely unrelated. (Bayes is a token-based
mechanism, about "words" in the message, and does not know about the
concept of email addresses, let alone sender.)


> Wouldn't a better approach be to set AWL to max positive  if I manually 
> learn the message as spam? Or is there a way to modify the DB to correct 
> the previous AWL hits on this message?

Again, see above. If you never will get spam forged to come from that
sender, it won't make a difference. Also, again, Bayes and AWL are
unrelated.

Besides, the A stands for Automatic. No need to correct anything. ;)

If you ever need to clear an AWL score (usually, because the learned
average for a *legit* sender is too high), if at all, you can do so
using 'spambuttbuttin'. Not sa-learn. See 'man spambuttbuttin-run'. [1]

  guenther


[1] See another recent post by Justin. ;-)

-- 
char *t="\10pse\0r\0dtu\0.@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4";
main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;i<l;i++){ i%8? c<<=1:
(c=*++x); c&128 && (s+=h); if (!(h>>=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}


Re: AWL - BAYES_99/ general questions

Posted by Randy Ramsdell <rr...@livedatagroup.com>.
Karsten Bräckelmann wrote:
> On Thu, 2008-02-28 at 09:21 -0500, Randy Ramsdell wrote:
>   
>> Hi,
>>
>> One thing I do not understand regarding AWL and BAYES. When a message is 
>> reported to me as spam and was not marked as spam, I test is using debug 
>> before and after sa-learn. Each time I do this, BAYES_99 does hit, but 
>> they will also include AWL.
>>
>> 1. Does anyone understand why this happens?
>>     
>
> AWL is a score averager. SA has seen that sender before.
>   http://wiki.apache.org/spamassassin/AutoWhitelist
>
> Run it through SA again, and you will see the AWL score getting closer
> to 0, since the score without AWL is constant. The AWL score is
> negative, because previous scores have been lower.
>
>   guenther
>
>
>   
I understand that  AWL is averaging what it has seen before and it must 
have seen the message as ham, but why would one have to sa-learn the 
message as spam multiple times. This also means that a system wide 
approach to improving our SPAM effectiveness requires me parse the AWL 
score after sa-learning the message to determine if I need to run it 
again. This would a monumental task and very resource intensive. 
Wouldn't a better approach be to set AWL to max positive  if I manually 
learn the message as spam? Or is there a way to modify the DB to correct 
the previous AWL hits on this message?

Re: AWL - BAYES_99/ general questions

Posted by Karsten Bräckelmann <gu...@rudersport.de>.
On Thu, 2008-02-28 at 09:21 -0500, Randy Ramsdell wrote:
> Hi,
> 
> One thing I do not understand regarding AWL and BAYES. When a message is 
> reported to me as spam and was not marked as spam, I test is using debug 
> before and after sa-learn. Each time I do this, BAYES_99 does hit, but 
> they will also include AWL.
> 
> 1. Does anyone understand why this happens?

AWL is a score averager. SA has seen that sender before.
  http://wiki.apache.org/spamassassin/AutoWhitelist

Run it through SA again, and you will see the AWL score getting closer
to 0, since the score without AWL is constant. The AWL score is
negative, because previous scores have been lower.

  guenther


-- 
char *t="\10pse\0r\0dtu\0.@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4";
main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;i<l;i++){ i%8? c<<=1:
(c=*++x); c&128 && (s+=h); if (!(h>>=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}


Re: AWL - BAYES_99/ general questions

Posted by Randy Ramsdell <rr...@livedatagroup.com>.
Jari Fredriksson wrote:
>> Hi,
>>
>> One thing I do not understand regarding AWL and BAYES.
>> When a message is reported to me as spam and was not
>> marked as spam, I test is using debug before and after
>> sa-learn. Each time I do this, BAYES_99 does hit, but
>> they will also include AWL. 
>>
>> 1. Does anyone understand why this happens?
>> 2. I also noticed that when using "spamassassin -D" on a
>> message, I sometimes see a nice report like below (2nd
>> example) but other times it doesn't show report
>> formatted. Any ideas on this one? 
>>     
>
>
> If I understood you correctly..
>
> In your samples, the first run gets 3.9 points, which is less than needed to classify the post as spam. The second run (after the learning) gets 5.2 points, which is more than needed to classify the post as spam.
>
>   
No. What I wanted to know is why do messages that are passed through 
sa-learn include AWL as well as BAYES_99. Notice the message did not hit 
AWL initially, but did so after the sa-learn process. giving a message a 
AWL score of -1.2 and BAYES score of 3.5 compete with each other to mark 
this message as spam.
> Your configuration prints the formatted report only for spam. There is no point in delivering reports to users for email which is  not spam.
>
>   
Sweet thanks for this.

> The limit for spam is 5.0 points (as the report says, 5.0 required), which is the default and a pretty good value.
>
>
>
>
>   

>> Here are an example of two spam report headers for the
>> same message. 
>>
>> Before sa-learn:
>>
>> X-Spam-Status: No, score=3.982 tagged_above=-9999
>> required=5 tests=[ADVANCE_FEE_1=0, BAYES_60=1,
>> SUB_HELLO=2.141, UNDISC_RECIPS=0.841] X-Spam-Score: 3.982
>> X-Spam-Level: ***
>>
>> After sa-learn:
>>
>> Content analysis details:   (5.2 points, 5.0 required)
>>
>> pts rule name              description
>> ---- ----------------------
>> --------------------------------------------------
>> 2.1 SUB_HELLO              Subject starts with "Hello"
>> 0.8 UNDISC_RECIPS          Valid-looking To
>> "undisclosed-recipients" 
>> 3.5 BAYES_99               BODY: Bayesian spam
>>                            probability is 99 to 100%
>> [score: 1.0000] 
>> 0.0 ADVANCE_FEE_1          Appears to be advance fee
>> fraud (Nigerian 419) -1.2 AWL                    AWL:
>> From: address is in the auto white-list 
>>
>> Thanks,
>> Randy Ramsdell
>>     


Re: AWL - BAYES_99/ general questions

Posted by Jari Fredriksson <ja...@iki.fi>.
> Hi,
> 
> One thing I do not understand regarding AWL and BAYES.
> When a message is reported to me as spam and was not
> marked as spam, I test is using debug before and after
> sa-learn. Each time I do this, BAYES_99 does hit, but
> they will also include AWL. 
> 
> 1. Does anyone understand why this happens?
> 2. I also noticed that when using "spamassassin -D" on a
> message, I sometimes see a nice report like below (2nd
> example) but other times it doesn't show report
> formatted. Any ideas on this one? 


If I understood you correctly..

In your samples, the first run gets 3.9 points, which is less than needed to classify the post as spam. The second run (after the learning) gets 5.2 points, which is more than needed to classify the post as spam.

Your configuration prints the formatted report only for spam. There is no point in delivering reports to users for email which is  not spam.

The limit for spam is 5.0 points (as the report says, 5.0 required), which is the default and a pretty good value.




> 
> Here are an example of two spam report headers for the
> same message. 
> 
> Before sa-learn:
> 
> X-Spam-Status: No, score=3.982 tagged_above=-9999
> required=5 tests=[ADVANCE_FEE_1=0, BAYES_60=1,
> SUB_HELLO=2.141, UNDISC_RECIPS=0.841] X-Spam-Score: 3.982
> X-Spam-Level: ***
> 
> After sa-learn:
> 
> Content analysis details:   (5.2 points, 5.0 required)
> 
> pts rule name              description
> ---- ----------------------
> --------------------------------------------------
> 2.1 SUB_HELLO              Subject starts with "Hello"
> 0.8 UNDISC_RECIPS          Valid-looking To
> "undisclosed-recipients" 
> 3.5 BAYES_99               BODY: Bayesian spam
>                            probability is 99 to 100%
> [score: 1.0000] 
> 0.0 ADVANCE_FEE_1          Appears to be advance fee
> fraud (Nigerian 419) -1.2 AWL                    AWL:
> From: address is in the auto white-list 
> 
> Thanks,
> Randy Ramsdell

Re: AWL - BAYES_99/ general questions

Posted by Matt Kettler <mk...@verizon.net>.
Randy Ramsdell wrote:
> Hi,
>
> One thing I do not understand regarding AWL and BAYES. When a message 
> is reported to me as spam and was not marked as spam, I test is using 
> debug before and after sa-learn. Each time I do this, BAYES_99 does 
> hit, but they will also include AWL.
>
> 1. Does anyone understand why this happens?
I assume you're asking about while the AWL appears. That's normal. The 
first thing to realize is the AWL is *NOT* a whitelist. It's a 
sender-based score averager. It has both white and blacklist effects.

 If the current message scores higher than the past average for a 
sender, the AWL will take points off, trying to "split the difference" 
between the past and current scores.

Since you just sa-learned a message from a sender that's probably never 
sent to you before, the score now is almost gaurnteed to be higher than 
the first pass through, resulting in a negative AWL score.

However, that's not a problem. Note this message, even with the AWL, 
didn't fall below the spam tag threshold. The AWL doesn't work on a 
"good vs bad" senders basis, so just because it scores negative, it 
doesn't mean the AWL thinks the message is nonspam.. in your example, it 
just thought it was less spammy, but still spam.

You might want to read this wiki article for a better discussion of the 
AWL's behaviors:

http://wiki.apache.org/spamassassin/AwlWrongWay

> 2. I also noticed that when using "spamassassin -D" on a message, I 
> sometimes see a nice report like below (2nd example) but other times 
> it doesn't show report formatted. Any ideas on this one?
SA won't generate a formatted report for a message below the spam tag 
level. You can force it to do so by adding -t.

>
> Here are an example of two spam report headers for the same message.
>
> Before sa-learn:
>
> X-Spam-Status: No, score=3.982 tagged_above=-9999 required=5
> tests=[ADVANCE_FEE_1=0, BAYES_60=1, SUB_HELLO=2.141, UNDISC_RECIPS=0.841]
> X-Spam-Score: 3.982
> X-Spam-Level: ***
>
> After sa-learn:
>
> Content analysis details:   (5.2 points, 5.0 required)
>
> pts rule name              description
> ---- ---------------------- 
> --------------------------------------------------
> 2.1 SUB_HELLO              Subject starts with "Hello"
> 0.8 UNDISC_RECIPS          Valid-looking To "undisclosed-recipients"
> 3.5 BAYES_99               BODY: Bayesian spam probability is 99 to 100%
>                            [score: 1.0000]
> 0.0 ADVANCE_FEE_1          Appears to be advance fee fraud (Nigerian 419)
> -1.2 AWL                    AWL: From: address is in the auto white-list
>
> Thanks,
> Randy Ramsdell
>