You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@spamassassin.apache.org by Jerry Pape <jp...@espt.com> on 2010/10/18 02:05:15 UTC

Seeking advice re: SA score discrepancies

All,

[Not sure if this is the right place to send this--please correct me if 
I am in error]

At some time in the not too distant past, my otherwise reliable SA 
system has broken in an odd way.

This example is characteristic of the problem:

Cheap Airline Tickets email received--clearly junk

x-spam-status reads: No, score=3.8 required=4.0 
tests=BAYES_40,HTML_IMAGE_RATIO_02,    
HTML_MESSAGE,HTML_MIME_NO_HTML_TAG,MIME_HTML_ONLY,RDNS_NONE,URIBL_BLACK    
autolearn=no version=3.2.5

Assessment of this header at http://www.futurequest.net/docs/SA/decode/ 
yields:


Test 	Score 	Description
BAYES_40 	0.000 	Bayesian spam probability is 20 to 40%
HTML_IMAGE_RATIO_02 	0.550 	HTML has a low ratio of text to image area
HTML_MESSAGE 	0.001 	HTML included in message
HTML_MIME_NO_HTML_TAG 	1.052 	HTML-only message, but there is no HTML tag
MIME_HTML_ONLY 	1.672 	Message only has text/html MIME parts
RDNS_NONE 	0.100 	Delivered to trusted network by a host with no rDNS
URIBL_BLACK 	1.961 	Contains an URL listed in the URIBL blacklist
Total: 	
5.336


Clearly 5.336 does not equal 3.8.

My SA is 3.2.5 in a default config except that I have set global score 
required to 4.0 with latest updates.

I have no idea how to regress and resolve this problem.

Any guidance would be greatly appreciated.

JP

Re: Seeking advice re: SA score discrepancies

Posted by John Hardin <jh...@impsec.org>.

On Sun, 17 Oct 2010, Jerry Pape wrote:

> Oops, further investigation indicates that Bayes is "on"--thought the 
> default was "off" for my config. I would be inclined to turn it off as I have 
> no decent way of teaching it beyond mass-config into the future--please 
> advise.

Training is critical. If your userbase is not uniform enough to allow for 
you to do global training (e.g. you're not a home or a company), and your 
users aren't willing or able to do individual training or don't trust you 
enough to send you private ham to train with, then you are probably best 
served by turning off Bayes.

> JP
>
> On 10/17/10 10:37 PM, Jerry Pape wrote:
>>
>>  Further, what are the "scoreset" indexes?

There are a couple of configurations that can greatly affect scoring - 
whether or not bayes is in use, and whether or not network checks like 
URIBL lookups are in use. This gives (at the moment) four general 
configuration cases. Four possibly different scores for common rules are 
needed to give the best performance in each case.

For example, URIBL checks are very good at detecting spam, so if they are 
enabled the scores on other non-network rules can be reduced a bit.

>>  I don't use Bayes because all of my clients are POP mail and they are
>>  neither smart|committed enough to mail back ham/spam to educate the
>>  system.

Then your decision will be based on how varied your users' mail traffic 
is. If you're an ISP, bayes probably won't be appropriate. If you're an 
organization (where the nature of ham you receive will be more consistent) 
then global bayes may be appropriate and useful.

-- 
  John Hardin KA7OHZ                    http://www.impsec.org/~jhardin/
  jhardin@impsec.org    FALaholic #11174     pgpk -a jhardin@impsec.org
  key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
   Sheep have only two speeds: graze and stampede.     -- LTC Grossman
-----------------------------------------------------------------------
  60 days until TRON Legacy

Re: Seeking advice re: SA score discrepancies

Posted by Jerry Pape <jp...@espt.com>.

  Oops, further investigation indicates that Bayes is "on"--thought the 
default was "off" for my config. I would be inclined to turn it off as I 
have no decent way of teaching it beyond mass-config into the 
future--please advise.

JP

On 10/17/10 10:37 PM, Jerry Pape wrote:
>  Wow, I am grateful for the prompt answers, but I must say they have 
> confused me.
>
> Bayes should not be on in my config and subsequent check of the GUI 
> says its not--this may be wrong.
>
> Further, what are the "scoreset" indexes?
>
> I don't use Bayes because all of my clients are POP mail and they are 
> neither smart|committed enough to mail back ham/spam to educate the 
> system.
>
> Additionally, when I used Bayes way back when (without manual 
> population) and simply allowed auto-population to occur, I ended up 
> with enormous
> .spamassassin sub-files that rapidly eclipsed 50% of the client's disk 
> quota.
>
> I am certain that I am missing critical configurational understanding 
> and optimizations, but
> until your lot kindly educates me--it is what it is and my initial 
> dilemma remains unresolved.
>
> JP
>
> On 10/17/10 7:01 PM, John Hardin wrote:
>> On Sun, 17 Oct 2010, Jerry Pape wrote:
>>
>>> [Not sure if this is the right place to send this--please correct me 
>>> if I am in error]
>>
>> This is the place.
>>
>>> Assessment of this header at 
>>> http://www.futurequest.net/docs/SA/decode/ yields:
>>>
>>> Test     Score     Description
>>> BAYES_40     0.000     Bayesian spam probability is 20 to 40%
>>> HTML_IMAGE_RATIO_02     0.550     HTML has a low ratio of text to 
>>> image area
>>> HTML_MESSAGE     0.001     HTML included in message
>>> HTML_MIME_NO_HTML_TAG     1.052     HTML-only message, but there is 
>>> no HTML tag
>>> MIME_HTML_ONLY     1.672     Message only has text/html MIME parts
>>> RDNS_NONE     0.100     Delivered to trusted network by a host with 
>>> no rDNS
>>> URIBL_BLACK     1.961     Contains an URL listed in the URIBL blacklist
>>> Total:     5.336
>>>
>>> Clearly 5.336 does not equal 3.8.
>>
>> There are four score sets to choose from based on what options you 
>> have enabled. The above is for scoreset 2, no BAYES + net tests. 
>> Scoreset 3, BAYES + net tests, gives:
>>
>>   HTML_MIME_NO_HTML_TAG  0.097
>>   MIME_HTML_ONLY_MULTI   0.001
>>   HTML_IMAGE_RATIO_02    0.383
>>   HTML_MESSAGE           0.001
>>   MIME_HTML_ONLY         1.457
>>   BAYES_40              -0.185
>>   URIBL_BLACK            1.955
>>   RDNS_NONE              0.1
>>                         -------
>>                          3.809
>>
>> These are all of the default scores, and match what you're seeing.
>>
>>> I have no idea how to regress and resolve this problem.
>>
>> First off, you need to review your Bayes training. An obviously 
>> spammy message shouldn't be hitting BAYES_40. Properly-trained Bayes, 
>> hitting BAYES_99, would have scored 7.494 on that message.
>>
>> For analysis in general...
>>
>> This will put the individual rule scores into the headers:
>>
>>  add_header all Status "_YESNO_, score=_SCORE_ required=_REQD_ 
>> tests=_TESTSSCORES_ autolearn=_AUTOLEARN_ version=_VERSION_"
>>
>> "spamassassin --debug area=rules <test_msg_file" is often helpful.
>>
>> However:
>>
>> The nature of spam changes over time. 3.2, which is only getting 
>> critical bug fixes now, will become steadily less effective the more 
>> time passes and the spammers evolve new tricks. It's getting to the 
>> point that you should really consider upgrading to the latest 3.3 
>> release.
>>
>
>

Re: Seeking advice re: SA score discrepancies

Posted by Jerry Pape <jp...@espt.com>.

  Wow, I am grateful for the prompt answers, but I must say they have 
confused me.

Bayes should not be on in my config and subsequent check of the GUI says 
its not--this may be wrong.

Further, what are the "scoreset" indexes?

I don't use Bayes because all of my clients are POP mail and they are 
neither smart|committed enough to mail back ham/spam to educate the system.

Additionally, when I used Bayes way back when (without manual 
population) and simply allowed auto-population to occur, I ended up with 
enormous
.spamassassin sub-files that rapidly eclipsed 50% of the client's disk 
quota.

I am certain that I am missing critical configurational understanding 
and optimizations, but
until your lot kindly educates me--it is what it is and my initial 
dilemma remains unresolved.

JP

On 10/17/10 7:01 PM, John Hardin wrote:
> On Sun, 17 Oct 2010, Jerry Pape wrote:
>
>> [Not sure if this is the right place to send this--please correct me 
>> if I am in error]
>
> This is the place.
>
>> Assessment of this header at 
>> http://www.futurequest.net/docs/SA/decode/ yields:
>>
>> Test     Score     Description
>> BAYES_40     0.000     Bayesian spam probability is 20 to 40%
>> HTML_IMAGE_RATIO_02     0.550     HTML has a low ratio of text to 
>> image area
>> HTML_MESSAGE     0.001     HTML included in message
>> HTML_MIME_NO_HTML_TAG     1.052     HTML-only message, but there is 
>> no HTML tag
>> MIME_HTML_ONLY     1.672     Message only has text/html MIME parts
>> RDNS_NONE     0.100     Delivered to trusted network by a host with 
>> no rDNS
>> URIBL_BLACK     1.961     Contains an URL listed in the URIBL blacklist
>> Total:     5.336
>>
>> Clearly 5.336 does not equal 3.8.
>
> There are four score sets to choose from based on what options you 
> have enabled. The above is for scoreset 2, no BAYES + net tests. 
> Scoreset 3, BAYES + net tests, gives:
>
>   HTML_MIME_NO_HTML_TAG  0.097
>   MIME_HTML_ONLY_MULTI   0.001
>   HTML_IMAGE_RATIO_02    0.383
>   HTML_MESSAGE           0.001
>   MIME_HTML_ONLY         1.457
>   BAYES_40              -0.185
>   URIBL_BLACK            1.955
>   RDNS_NONE              0.1
>                         -------
>                          3.809
>
> These are all of the default scores, and match what you're seeing.
>
>> I have no idea how to regress and resolve this problem.
>
> First off, you need to review your Bayes training. An obviously spammy 
> message shouldn't be hitting BAYES_40. Properly-trained Bayes, hitting 
> BAYES_99, would have scored 7.494 on that message.
>
> For analysis in general...
>
> This will put the individual rule scores into the headers:
>
>  add_header all Status "_YESNO_, score=_SCORE_ required=_REQD_ 
> tests=_TESTSSCORES_ autolearn=_AUTOLEARN_ version=_VERSION_"
>
> "spamassassin --debug area=rules <test_msg_file" is often helpful.
>
> However:
>
> The nature of spam changes over time. 3.2, which is only getting 
> critical bug fixes now, will become steadily less effective the more 
> time passes and the spammers evolve new tricks. It's getting to the 
> point that you should really consider upgrading to the latest 3.3 
> release.
>

Re: Seeking advice re: SA score discrepancies

Posted by John Hardin <jh...@impsec.org>.

On Sun, 17 Oct 2010, John Hardin wrote:

> There are four score sets to choose from based on what options you have 
> enabled. The above is for scoreset 2, no BAYES + net tests.

Crap. That should be "scoreset 1". Sorry.

-- 
  John Hardin KA7OHZ                    http://www.impsec.org/~jhardin/
  jhardin@impsec.org    FALaholic #11174     pgpk -a jhardin@impsec.org
  key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
   Men by their constitutions are naturally divided in to two parties:
   1. Those who fear and distrust the people and wish to draw all
   powers from them into the hands of the higher classes. 2. Those who
   identify themselves with the people, have confidence in them,
   cherish and consider them as the most honest and safe, although not
   the most wise, depository of the public interests.
                                                   -- Thomas Jefferson
-----------------------------------------------------------------------
  61 days until TRON Legacy

Re: Seeking advice re: SA score discrepancies

Posted by John Hardin <jh...@impsec.org>.

On Sun, 17 Oct 2010, Jerry Pape wrote:

> [Not sure if this is the right place to send this--please correct me if 
> I am in error]

This is the place.

> Assessment of this header at http://www.futurequest.net/docs/SA/decode/ 
> yields:
>
> Test 	Score 	Description
> BAYES_40 	0.000 	Bayesian spam probability is 20 to 40%
> HTML_IMAGE_RATIO_02 	0.550 	HTML has a low ratio of text to image area
> HTML_MESSAGE 	0.001 	HTML included in message
> HTML_MIME_NO_HTML_TAG 	1.052 	HTML-only message, but there is no HTML tag
> MIME_HTML_ONLY 	1.672 	Message only has text/html MIME parts
> RDNS_NONE 	0.100 	Delivered to trusted network by a host with no rDNS
> URIBL_BLACK 	1.961 	Contains an URL listed in the URIBL blacklist
> Total: 	5.336
>
> Clearly 5.336 does not equal 3.8.

There are four score sets to choose from based on what options you have 
enabled. The above is for scoreset 2, no BAYES + net tests. Scoreset 3, 
BAYES + net tests, gives:

   HTML_MIME_NO_HTML_TAG  0.097
   MIME_HTML_ONLY_MULTI   0.001
   HTML_IMAGE_RATIO_02    0.383
   HTML_MESSAGE           0.001
   MIME_HTML_ONLY         1.457
   BAYES_40              -0.185
   URIBL_BLACK            1.955
   RDNS_NONE              0.1
                         -------
                          3.809

These are all of the default scores, and match what you're seeing.

> I have no idea how to regress and resolve this problem.

First off, you need to review your Bayes training. An obviously spammy 
message shouldn't be hitting BAYES_40. Properly-trained Bayes, hitting 
BAYES_99, would have scored 7.494 on that message.

For analysis in general...

This will put the individual rule scores into the headers:

  add_header all Status "_YESNO_, score=_SCORE_ required=_REQD_ tests=_TESTSSCORES_ autolearn=_AUTOLEARN_ version=_VERSION_"

"spamassassin --debug area=rules <test_msg_file" is often helpful.

However:

The nature of spam changes over time. 3.2, which is only getting critical 
bug fixes now, will become steadily less effective the more time passes 
and the spammers evolve new tricks. It's getting to the point that you 
should really consider upgrading to the latest 3.3 release.

-- 
  John Hardin KA7OHZ                    http://www.impsec.org/~jhardin/
  jhardin@impsec.org    FALaholic #11174     pgpk -a jhardin@impsec.org
  key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
   Individual liberties are always "loopholes" to absolute authority.
-----------------------------------------------------------------------
  61 days until TRON Legacy

Re: Seeking advice re: SA score discrepancies

Posted by Karsten Bräckelmann <gu...@rudersport.de>.

On Sun, 2010-10-17 at 17:05 -0700, Jerry Pape wrote:
> At some time in the not too distant past, my otherwise reliable SA
> system has broken in an odd way.
> 
> This example is characteristic of the problem:

Can't follow. It is broken, because SA itself reports something
different from an unrelated, third-party, stranger website?

If not, please feel free to explain what changed without pointing to
that source.

> x-spam-status reads: No, score=3.8 required=4.0
> tests=BAYES_40,HTML_IMAGE_RATIO_02,
> HTML_MESSAGE,HTML_MIME_NO_HTML_TAG,MIME_HTML_ONLY,RDNS_NONE,URIBL_BLACK    autolearn=no version=3.2.5
> 
> Assessment of this header at
> http://www.futurequest.net/docs/SA/decode/ yields:

> BAYES_40             0.000  Bayesian spam probability is 20 to 40%
> HTML_IMAGE_RATIO_02  0.550  HTML has a low ratio of text to image area

That site uses SA 3.2.x, score set 1, network tests enabled, Bayes
disabled, as evidenced by the above two scores and confirmed by the
other scores. You clearly use score set 3, both network tests and Bayes
enabled.

Given there *is* a BAYES_xx rule in there, the site is broken and does
not evaluate correctly. No excuse for the site in this case. (It would
be different with "no network test hits", which is indistinguishable
from being disabled, without the scores.)

> Clearly 5.336 does not equal 3.8.

Clearly, that site does not know, neither detect correctly your score
set used.

> My SA is 3.2.5 in a default config except that I have set global score
> required to 4.0 with latest updates.

Yup, with Bayes enabled, the exact total score is 3.808.

What's off-setting all this is, that the Bayes Classifier based on its
training believes the mail to be hammy-ish, almost neutral -- while it
should, after appropriate training, classify it spammy, raising the
overall score.

-- 
char *t="\10pse\0r\0dtu\0.@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4";
main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;i<l;i++){ i%8? c<<=1:
(c=*++x); c&128 && (s+=h); if (!(h>>=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}

Re: Seeking advice re: SA score discrepancies

Posted by René Berber <r....@computer.org>.

On 10/17/2010 7:05 PM, Jerry Pape wrote:

[snip]
> x-spam-status reads: No, score=3.8 required=4.0
> tests=BAYES_40,HTML_IMAGE_RATIO_02,   
> HTML_MESSAGE,HTML_MIME_NO_HTML_TAG,MIME_HTML_ONLY,RDNS_NONE,URIBL_BLACK   
> autolearn=no version=3.2.5
> 
> Assessment of this header at http://www.futurequest.net/docs/SA/decode/
> yields:
> 
> Test 	Score 	Description
> BAYES_40 	0.000 	Bayesian spam probability is 20 to 40%
> HTML_IMAGE_RATIO_02 	0.550 	HTML has a low ratio of text to image area
> HTML_MESSAGE 	0.001 	HTML included in message
> HTML_MIME_NO_HTML_TAG 	1.052 	HTML-only message, but there is no HTML tag
> MIME_HTML_ONLY 	1.672 	Message only has text/html MIME parts
> RDNS_NONE 	0.100 	Delivered to trusted network by a host with no rDNS
> URIBL_BLACK 	1.961 	Contains an URL listed in the URIBL blacklist
> Total: 	
> 5.336
> 
> Clearly 5.336 does not equal 3.8.

There are several possible causes of the discrepancy:

* You are running an old version, the site you used probably is using
the latest, 3.3.1 (with the latest rule sets).

* The scores come from the rule sets, which are updated periodically...
do you use sa-update?

* You could have changed scores locally, but you said you are using
defaults, so I only mention it for reference.
-- 
René Berber