You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@spamassassin.apache.org by Jerry Pape <jp...@espt.com> on 2010/10/18 02:05:15 UTC
Seeking advice re: SA score discrepancies
All,
[Not sure if this is the right place to send this--please correct me if
I am in error]
At some time in the not too distant past, my otherwise reliable SA
system has broken in an odd way.
This example is characteristic of the problem:
Cheap Airline Tickets email received--clearly junk
x-spam-status reads: No, score=3.8 required=4.0
tests=BAYES_40,HTML_IMAGE_RATIO_02,
HTML_MESSAGE,HTML_MIME_NO_HTML_TAG,MIME_HTML_ONLY,RDNS_NONE,URIBL_BLACK
autolearn=no version=3.2.5
Assessment of this header at http://www.futurequest.net/docs/SA/decode/
yields:
Test Score Description
BAYES_40 0.000 Bayesian spam probability is 20 to 40%
HTML_IMAGE_RATIO_02 0.550 HTML has a low ratio of text to image area
HTML_MESSAGE 0.001 HTML included in message
HTML_MIME_NO_HTML_TAG 1.052 HTML-only message, but there is no HTML tag
MIME_HTML_ONLY 1.672 Message only has text/html MIME parts
RDNS_NONE 0.100 Delivered to trusted network by a host with no rDNS
URIBL_BLACK 1.961 Contains an URL listed in the URIBL blacklist
Total:
5.336
Clearly 5.336 does not equal 3.8.
My SA is 3.2.5 in a default config except that I have set global score
required to 4.0 with latest updates.
I have no idea how to regress and resolve this problem.
Any guidance would be greatly appreciated.
JP
Re: Seeking advice re: SA score discrepancies
Posted by John Hardin <jh...@impsec.org>.
On Sun, 17 Oct 2010, Jerry Pape wrote:
> Oops, further investigation indicates that Bayes is "on"--thought the
> default was "off" for my config. I would be inclined to turn it off as I have
> no decent way of teaching it beyond mass-config into the future--please
> advise.
Training is critical. If your userbase is not uniform enough to allow for
you to do global training (e.g. you're not a home or a company), and your
users aren't willing or able to do individual training or don't trust you
enough to send you private ham to train with, then you are probably best
served by turning off Bayes.
> JP
>
> On 10/17/10 10:37 PM, Jerry Pape wrote:
>>
>> Further, what are the "scoreset" indexes?
There are a couple of configurations that can greatly affect scoring -
whether or not bayes is in use, and whether or not network checks like
URIBL lookups are in use. This gives (at the moment) four general
configuration cases. Four possibly different scores for common rules are
needed to give the best performance in each case.
For example, URIBL checks are very good at detecting spam, so if they are
enabled the scores on other non-network rules can be reduced a bit.
>> I don't use Bayes because all of my clients are POP mail and they are
>> neither smart|committed enough to mail back ham/spam to educate the
>> system.
Then your decision will be based on how varied your users' mail traffic
is. If you're an ISP, bayes probably won't be appropriate. If you're an
organization (where the nature of ham you receive will be more consistent)
then global bayes may be appropriate and useful.
--
John Hardin KA7OHZ http://www.impsec.org/~jhardin/
jhardin@impsec.org FALaholic #11174 pgpk -a jhardin@impsec.org
key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
Sheep have only two speeds: graze and stampede. -- LTC Grossman
-----------------------------------------------------------------------
60 days until TRON Legacy
Re: Seeking advice re: SA score discrepancies
Posted by Jerry Pape <jp...@espt.com>.
Oops, further investigation indicates that Bayes is "on"--thought the
default was "off" for my config. I would be inclined to turn it off as I
have no decent way of teaching it beyond mass-config into the
future--please advise.
JP
On 10/17/10 10:37 PM, Jerry Pape wrote:
> Wow, I am grateful for the prompt answers, but I must say they have
> confused me.
>
> Bayes should not be on in my config and subsequent check of the GUI
> says its not--this may be wrong.
>
> Further, what are the "scoreset" indexes?
>
> I don't use Bayes because all of my clients are POP mail and they are
> neither smart|committed enough to mail back ham/spam to educate the
> system.
>
> Additionally, when I used Bayes way back when (without manual
> population) and simply allowed auto-population to occur, I ended up
> with enormous
> .spamassassin sub-files that rapidly eclipsed 50% of the client's disk
> quota.
>
> I am certain that I am missing critical configurational understanding
> and optimizations, but
> until your lot kindly educates me--it is what it is and my initial
> dilemma remains unresolved.
>
> JP
>
> On 10/17/10 7:01 PM, John Hardin wrote:
>> On Sun, 17 Oct 2010, Jerry Pape wrote:
>>
>>> [Not sure if this is the right place to send this--please correct me
>>> if I am in error]
>>
>> This is the place.
>>
>>> Assessment of this header at
>>> http://www.futurequest.net/docs/SA/decode/ yields:
>>>
>>> Test Score Description
>>> BAYES_40 0.000 Bayesian spam probability is 20 to 40%
>>> HTML_IMAGE_RATIO_02 0.550 HTML has a low ratio of text to
>>> image area
>>> HTML_MESSAGE 0.001 HTML included in message
>>> HTML_MIME_NO_HTML_TAG 1.052 HTML-only message, but there is
>>> no HTML tag
>>> MIME_HTML_ONLY 1.672 Message only has text/html MIME parts
>>> RDNS_NONE 0.100 Delivered to trusted network by a host with
>>> no rDNS
>>> URIBL_BLACK 1.961 Contains an URL listed in the URIBL blacklist
>>> Total: 5.336
>>>
>>> Clearly 5.336 does not equal 3.8.
>>
>> There are four score sets to choose from based on what options you
>> have enabled. The above is for scoreset 2, no BAYES + net tests.
>> Scoreset 3, BAYES + net tests, gives:
>>
>> HTML_MIME_NO_HTML_TAG 0.097
>> MIME_HTML_ONLY_MULTI 0.001
>> HTML_IMAGE_RATIO_02 0.383
>> HTML_MESSAGE 0.001
>> MIME_HTML_ONLY 1.457
>> BAYES_40 -0.185
>> URIBL_BLACK 1.955
>> RDNS_NONE 0.1
>> -------
>> 3.809
>>
>> These are all of the default scores, and match what you're seeing.
>>
>>> I have no idea how to regress and resolve this problem.
>>
>> First off, you need to review your Bayes training. An obviously
>> spammy message shouldn't be hitting BAYES_40. Properly-trained Bayes,
>> hitting BAYES_99, would have scored 7.494 on that message.
>>
>> For analysis in general...
>>
>> This will put the individual rule scores into the headers:
>>
>> add_header all Status "_YESNO_, score=_SCORE_ required=_REQD_
>> tests=_TESTSSCORES_ autolearn=_AUTOLEARN_ version=_VERSION_"
>>
>> "spamassassin --debug area=rules <test_msg_file" is often helpful.
>>
>> However:
>>
>> The nature of spam changes over time. 3.2, which is only getting
>> critical bug fixes now, will become steadily less effective the more
>> time passes and the spammers evolve new tricks. It's getting to the
>> point that you should really consider upgrading to the latest 3.3
>> release.
>>
>
>
Re: Seeking advice re: SA score discrepancies
Posted by Jerry Pape <jp...@espt.com>.
Wow, I am grateful for the prompt answers, but I must say they have
confused me.
Bayes should not be on in my config and subsequent check of the GUI says
its not--this may be wrong.
Further, what are the "scoreset" indexes?
I don't use Bayes because all of my clients are POP mail and they are
neither smart|committed enough to mail back ham/spam to educate the system.
Additionally, when I used Bayes way back when (without manual
population) and simply allowed auto-population to occur, I ended up with
enormous
.spamassassin sub-files that rapidly eclipsed 50% of the client's disk
quota.
I am certain that I am missing critical configurational understanding
and optimizations, but
until your lot kindly educates me--it is what it is and my initial
dilemma remains unresolved.
JP
On 10/17/10 7:01 PM, John Hardin wrote:
> On Sun, 17 Oct 2010, Jerry Pape wrote:
>
>> [Not sure if this is the right place to send this--please correct me
>> if I am in error]
>
> This is the place.
>
>> Assessment of this header at
>> http://www.futurequest.net/docs/SA/decode/ yields:
>>
>> Test Score Description
>> BAYES_40 0.000 Bayesian spam probability is 20 to 40%
>> HTML_IMAGE_RATIO_02 0.550 HTML has a low ratio of text to
>> image area
>> HTML_MESSAGE 0.001 HTML included in message
>> HTML_MIME_NO_HTML_TAG 1.052 HTML-only message, but there is
>> no HTML tag
>> MIME_HTML_ONLY 1.672 Message only has text/html MIME parts
>> RDNS_NONE 0.100 Delivered to trusted network by a host with
>> no rDNS
>> URIBL_BLACK 1.961 Contains an URL listed in the URIBL blacklist
>> Total: 5.336
>>
>> Clearly 5.336 does not equal 3.8.
>
> There are four score sets to choose from based on what options you
> have enabled. The above is for scoreset 2, no BAYES + net tests.
> Scoreset 3, BAYES + net tests, gives:
>
> HTML_MIME_NO_HTML_TAG 0.097
> MIME_HTML_ONLY_MULTI 0.001
> HTML_IMAGE_RATIO_02 0.383
> HTML_MESSAGE 0.001
> MIME_HTML_ONLY 1.457
> BAYES_40 -0.185
> URIBL_BLACK 1.955
> RDNS_NONE 0.1
> -------
> 3.809
>
> These are all of the default scores, and match what you're seeing.
>
>> I have no idea how to regress and resolve this problem.
>
> First off, you need to review your Bayes training. An obviously spammy
> message shouldn't be hitting BAYES_40. Properly-trained Bayes, hitting
> BAYES_99, would have scored 7.494 on that message.
>
> For analysis in general...
>
> This will put the individual rule scores into the headers:
>
> add_header all Status "_YESNO_, score=_SCORE_ required=_REQD_
> tests=_TESTSSCORES_ autolearn=_AUTOLEARN_ version=_VERSION_"
>
> "spamassassin --debug area=rules <test_msg_file" is often helpful.
>
> However:
>
> The nature of spam changes over time. 3.2, which is only getting
> critical bug fixes now, will become steadily less effective the more
> time passes and the spammers evolve new tricks. It's getting to the
> point that you should really consider upgrading to the latest 3.3
> release.
>
Re: Seeking advice re: SA score discrepancies
Posted by John Hardin <jh...@impsec.org>.
On Sun, 17 Oct 2010, John Hardin wrote:
> There are four score sets to choose from based on what options you have
> enabled. The above is for scoreset 2, no BAYES + net tests.
Crap. That should be "scoreset 1". Sorry.
--
John Hardin KA7OHZ http://www.impsec.org/~jhardin/
jhardin@impsec.org FALaholic #11174 pgpk -a jhardin@impsec.org
key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
Men by their constitutions are naturally divided in to two parties:
1. Those who fear and distrust the people and wish to draw all
powers from them into the hands of the higher classes. 2. Those who
identify themselves with the people, have confidence in them,
cherish and consider them as the most honest and safe, although not
the most wise, depository of the public interests.
-- Thomas Jefferson
-----------------------------------------------------------------------
61 days until TRON Legacy
Re: Seeking advice re: SA score discrepancies
Posted by John Hardin <jh...@impsec.org>.
On Sun, 17 Oct 2010, Jerry Pape wrote:
> [Not sure if this is the right place to send this--please correct me if
> I am in error]
This is the place.
> Assessment of this header at http://www.futurequest.net/docs/SA/decode/
> yields:
>
> Test Score Description
> BAYES_40 0.000 Bayesian spam probability is 20 to 40%
> HTML_IMAGE_RATIO_02 0.550 HTML has a low ratio of text to image area
> HTML_MESSAGE 0.001 HTML included in message
> HTML_MIME_NO_HTML_TAG 1.052 HTML-only message, but there is no HTML tag
> MIME_HTML_ONLY 1.672 Message only has text/html MIME parts
> RDNS_NONE 0.100 Delivered to trusted network by a host with no rDNS
> URIBL_BLACK 1.961 Contains an URL listed in the URIBL blacklist
> Total: 5.336
>
> Clearly 5.336 does not equal 3.8.
There are four score sets to choose from based on what options you have
enabled. The above is for scoreset 2, no BAYES + net tests. Scoreset 3,
BAYES + net tests, gives:
HTML_MIME_NO_HTML_TAG 0.097
MIME_HTML_ONLY_MULTI 0.001
HTML_IMAGE_RATIO_02 0.383
HTML_MESSAGE 0.001
MIME_HTML_ONLY 1.457
BAYES_40 -0.185
URIBL_BLACK 1.955
RDNS_NONE 0.1
-------
3.809
These are all of the default scores, and match what you're seeing.
> I have no idea how to regress and resolve this problem.
First off, you need to review your Bayes training. An obviously spammy
message shouldn't be hitting BAYES_40. Properly-trained Bayes, hitting
BAYES_99, would have scored 7.494 on that message.
For analysis in general...
This will put the individual rule scores into the headers:
add_header all Status "_YESNO_, score=_SCORE_ required=_REQD_ tests=_TESTSSCORES_ autolearn=_AUTOLEARN_ version=_VERSION_"
"spamassassin --debug area=rules <test_msg_file" is often helpful.
However:
The nature of spam changes over time. 3.2, which is only getting critical
bug fixes now, will become steadily less effective the more time passes
and the spammers evolve new tricks. It's getting to the point that you
should really consider upgrading to the latest 3.3 release.
--
John Hardin KA7OHZ http://www.impsec.org/~jhardin/
jhardin@impsec.org FALaholic #11174 pgpk -a jhardin@impsec.org
key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
Individual liberties are always "loopholes" to absolute authority.
-----------------------------------------------------------------------
61 days until TRON Legacy
Re: Seeking advice re: SA score discrepancies
Posted by Karsten Bräckelmann <gu...@rudersport.de>.
On Sun, 2010-10-17 at 17:05 -0700, Jerry Pape wrote:
> At some time in the not too distant past, my otherwise reliable SA
> system has broken in an odd way.
>
> This example is characteristic of the problem:
Can't follow. It is broken, because SA itself reports something
different from an unrelated, third-party, stranger website?
If not, please feel free to explain what changed without pointing to
that source.
> x-spam-status reads: No, score=3.8 required=4.0
> tests=BAYES_40,HTML_IMAGE_RATIO_02,
> HTML_MESSAGE,HTML_MIME_NO_HTML_TAG,MIME_HTML_ONLY,RDNS_NONE,URIBL_BLACK autolearn=no version=3.2.5
>
> Assessment of this header at
> http://www.futurequest.net/docs/SA/decode/ yields:
> BAYES_40 0.000 Bayesian spam probability is 20 to 40%
> HTML_IMAGE_RATIO_02 0.550 HTML has a low ratio of text to image area
That site uses SA 3.2.x, score set 1, network tests enabled, Bayes
disabled, as evidenced by the above two scores and confirmed by the
other scores. You clearly use score set 3, both network tests and Bayes
enabled.
Given there *is* a BAYES_xx rule in there, the site is broken and does
not evaluate correctly. No excuse for the site in this case. (It would
be different with "no network test hits", which is indistinguishable
from being disabled, without the scores.)
> Clearly 5.336 does not equal 3.8.
Clearly, that site does not know, neither detect correctly your score
set used.
> My SA is 3.2.5 in a default config except that I have set global score
> required to 4.0 with latest updates.
Yup, with Bayes enabled, the exact total score is 3.808.
What's off-setting all this is, that the Bayes Classifier based on its
training believes the mail to be hammy-ish, almost neutral -- while it
should, after appropriate training, classify it spammy, raising the
overall score.
--
char *t="\10pse\0r\0dtu\0.@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4";
main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;i<l;i++){ i%8? c<<=1:
(c=*++x); c&128 && (s+=h); if (!(h>>=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}
Re: Seeking advice re: SA score discrepancies
Posted by René Berber <r....@computer.org>.
On 10/17/2010 7:05 PM, Jerry Pape wrote:
[snip]
> x-spam-status reads: No, score=3.8 required=4.0
> tests=BAYES_40,HTML_IMAGE_RATIO_02,
> HTML_MESSAGE,HTML_MIME_NO_HTML_TAG,MIME_HTML_ONLY,RDNS_NONE,URIBL_BLACK
> autolearn=no version=3.2.5
>
> Assessment of this header at http://www.futurequest.net/docs/SA/decode/
> yields:
>
> Test Score Description
> BAYES_40 0.000 Bayesian spam probability is 20 to 40%
> HTML_IMAGE_RATIO_02 0.550 HTML has a low ratio of text to image area
> HTML_MESSAGE 0.001 HTML included in message
> HTML_MIME_NO_HTML_TAG 1.052 HTML-only message, but there is no HTML tag
> MIME_HTML_ONLY 1.672 Message only has text/html MIME parts
> RDNS_NONE 0.100 Delivered to trusted network by a host with no rDNS
> URIBL_BLACK 1.961 Contains an URL listed in the URIBL blacklist
> Total:
> 5.336
>
> Clearly 5.336 does not equal 3.8.
There are several possible causes of the discrepancy:
* You are running an old version, the site you used probably is using
the latest, 3.3.1 (with the latest rule sets).
* The scores come from the rule sets, which are updated periodically...
do you use sa-update?
* You could have changed scores locally, but you said you are using
defaults, so I only mention it for reference.
--
René Berber