You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@spamassassin.apache.org by Kai Schaetzl <ma...@conactive.com> on 2004/08/29 16:32:03 UTC

autolearning score requirements

I see that the header and body requirement of three score points each 
leads to many messages with high scores not getting autolearned. The 
reason is that URIDNSBL isn't counted in (isn't that a body hit?) and any 
custom rules (my own, SARE etc.) are also not getting counted in. However, 
it's those rules which often make it score high, and for a good reason. 
Isn't there a way to mark these rules as header or body, so they count?
I think it's particularly those messages which should get learned to bayes 
since the built-in rules fail for them. But exactly the opposite happens.

BTW: I'm surprised that BAYES_99 dropped to 1.89 from 5.4 in SA 3.0. Are 
we so unimpressed by the bayes categorization? Was there a significant 
change in how it works now? I have the feeling that it (3.0 RC1 and RC2) 
shows less "99" than with 2.63, but that's only a feeling from a few days 
of a test run. But still the BAYES_99 is quite accurate when it shows up. 
I know I can change that score, but I'm curious why it is so low now.


Kai

-- 

Kai Schätzl, Berlin, Germany
Get your web at Conactive Internet Services: http://www.conactive.com
IE-Center: http://ie5.de & http://msie.winware.org




Re: autolearning score requirements

Posted by Bill Landry <bi...@pointshare.com>.
----- Original Message ----- 
From: "Theo Van Dinter" <fe...@kluge.net>

> On Sun, Aug 29, 2004 at 08:47:59AM -0700, Bill Landry wrote:
> > Is this correct, that custom rule scores and URIDNSBL scores are
> ignored by
> > bayes auto-learning?  If so, what's the rational behind this?  Is this
> true
> > for both SA 2.6x and SA 3.0?
>
> I'm not sure where he came up with that idea actually.  The "which rules
> to skip" decision is very straightforward:
>
> - skip rules with tflags set as "noautolearn", "userconf", or "learn"
> - "skip" rules with a 0 score in set 0/1
>
> There's nothing internally that differentiates between "custom" (locally
> configured) and "standard" (comes with SA) rules/scores/etc.

Thanks, Theo, I very much appreciate you taking the time to clarify and
explain this.

Bill


Re: autolearning score requirements

Posted by Theo Van Dinter <fe...@kluge.net>.
On Sun, Aug 29, 2004 at 08:57:50PM +0200, Kai Schaetzl wrote:
> debug: auto-learn: message score: 26.353, computed score for autolearn: 23.74
> debug: auto-learn? ham=0.1, spam=8, body-points=14.687, head-points=2.2, learned-points=1.886
> 
> There is an overall score of 26 - body score of 14 = 12 - head points =
> 10 - BAYES_99 = 8 (roughly). So,

As I said in the last message, the body/head bit is not straight-forward,
aka: it's slightly complex.  You can't just assume the equation 'body+head
= score' is valid because it's usually not.

>         *  1.2 HTML_MESSAGE HTML message
>         *  3.1 STRONG_BUY BODY: Tells you about a strong buy
>         *  2.7 NOT_ADVISOR BODY: Not registered investment advisor
>         *  0.2 MIME_HTML_ONLY BODY: Message only has text/html MIME parts
>         *  0.2 HTML_10_20 BODY: Message is 10% to 20% HTML
these all look like straight body rules, so they'd be added to body-points.

>         *  0.0 MIME_QP_LONG_LINE RAW: Quoted-printable line longer than 76 chars
head + body, but 0 score, so ignored.

>         *  2.3 LONGWORDS Long string of long words
meta and no "net" tflag, so ignored (could be either, so it's ignored -- this
is different from my last mail I think, but I reread the code to verify...)

>         *  4.1 RATWARE_ZERO_TZ Bulk email fingerprint (+0000) found
meta and no "net" tflag, so ignored (see above).

I have no input on the non-standard rules since they're ... well,
non-standard, so I have no idea how they're configured or what they do.
A quick look at the original list shows that SARE_MULT_RATW_02 seems
like the only header rule, and it's 2.2 points, which would explain the 2.2
head-points in the debug output though. :)

-- 
Randomly Generated Tagline:
Know what I like about Windows? Not a damn thing.

Re: autolearning score requirements

Posted by Kai Schaetzl <ma...@conactive.com>.
Theo Van Dinter wrote on Sun, 29 Aug 2004 13:04:35 -0400:

> I'm not sure where he came up with that idea actually.
>

Yeah, with later tests I saw that the header and body tests *must* be taking custom rules into account. 
I based my assumption on earlier tests today, I probably misinterpreted them. Anyway, here's a typical 
example where I'm not sure why *exactly* it did not get autolearned:

debug: auto-learn: message score: 26.353, computed score for autolearn: 23.74
debug: auto-learn? ham=0.1, spam=8, body-points=14.687, head-points=2.2, learned-points=1.886

Ok, that's clear, too few head-points, but "where" did I "loose" them? Below's a breakdown of the hits. 
Doesn't it look on first and second glance like it should have been autolearned? Only digging deep into 
it shows some possible reasons. Let's see:
There is an overall score of 26 - body score of 14 = 12 - head points = 10 - BAYES_99 = 8 (roughly). So, 
around 8 score points didn't count. Which rules did they come from? There's also a recomputed score 
which is used for autolearn, that's only 24 (possibly minus bayes). So I'm still missing 8 score points 
which didn't account for header or body. (Actually, it might be interesting to use the SARE_bayes-poison 
rules as a basis for NOT autolearning, but that's obviously not the case here.)
I assume the missing points belong to those rules with "noautolearn" etc. or to rules which are neither
header nor body. But how to determine? F.i. LONGWORDS should count as a body test (no noautolearn from 
what I can see), but it is not identified as a BODY test in the list. I don't know if it counted or not. 
On the other hand RATWARE_ZERO_TZ is a header test, but wasn't used. Looking further I see that it's 
actually a meta test based on header tests and LONGWORDS is a meta test as well, but based on body 
tests. Looks like I found the answer: meta tests don't count at all? I'd rather count them. As you see, 
it deducted the necessary score points to get this message autolearned although it scored a whopping 
overall of 26. I suppose the reason for not using meta tests at all is that it would need even more 
processing to determine the nature from the subtests and there are also meta tests which are not clearly 
body or header tests. The problem is there are many meta tests and they all don't count as it seems.


        *  1.0 S_FREE_6 S_FREE_6
        *  1.2 HTML_MESSAGE HTML message
        *  0.1 TW_JS BODY: Odd Letter Triples with JS
        *  0.1 TW_ZF BODY: Odd Letter Triples with ZF
        *  0.6 J_CHICKENPOX_42 BODY: 4alpha-pock-2alpha
        *  0.1 TW_UW BODY: Odd Letter Triples with UW
        *  0.1 TW_XV BODY: Odd Letter Triples with XV
        *  0.1 TW_FG BODY: Odd Letter Triples with FG
        *  2.0 SPAM_BUY_8 BODY: SPAM_BUY_8
        *  0.1 TW_HW BODY: Odd Letter Triples with HW
        *  0.1 TW_FH BODY: Odd Letter Triples with FH
        *  0.1 TW_VT BODY: Odd Letter Triples with VT
        *  0.1 TW_QL BODY: Odd Letter Triples with QL
        *  0.1 TW_YY BODY: Odd Letter Triples with YY
        *  0.1 TW_UQ BODY: Odd Letter Triples with UQ
        *  3.1 STRONG_BUY BODY: Tells you about a strong buy
        *  0.1 TW_QH BODY: Odd Letter Triples with QH
        *  0.1 TW_TC BODY: Odd Letter Triples with TC
        *  0.1 TW_DP BODY: Odd Letter Triples with DP
        *  0.1 TW_VQ BODY: Odd Letter Triples with VQ
        *  0.1 TW_HJ BODY: Odd Letter Triples with HJ
        *  0.8 SARE_BAYES_7x5 BODY: Bayes poison 7x5
        *  0.1 TW_CB BODY: Odd Letter Triples with CB
        *  2.7 NOT_ADVISOR BODY: Not registered investment advisor
        *  1.7 SARE_FWDLOOK BODY: Forward looking statements about stocks
        *  0.1 TW_YD BODY: Odd Letter Triples with YD
        *  0.8 SARE_BAYES_8x5 BODY: Bayes poison 8x5
        *  0.1 TW_IU BODY: Odd Letter Triples with IU
        *  0.1 TW_YQ BODY: Odd Letter Triples with YQ
        *  0.2 MIME_HTML_ONLY BODY: Message only has text/html MIME parts
        *  1.9 BAYES_99 BODY: Bayesian spam probability is 99 to 100%
        *      [score: 1.0000]
        *  0.2 HTML_10_20 BODY: Message is 10% to 20% HTML
        *  0.0 MIME_QP_LONG_LINE RAW: Quoted-printable line longer than 76 chars
        *  2.3 LONGWORDS Long string of long words
        *  4.1 RATWARE_ZERO_TZ Bulk email fingerprint (+0000) found
        *  2.2 SARE_MULT_RATW_02 Spammer sign in headers



Kai

-- 

Kai Schätzl, Berlin, Germany
Get your web at Conactive Internet Services: http://www.conactive.com
IE-Center: http://ie5.de & http://msie.winware.org




Re: autolearning score requirements

Posted by Jeff Chan <je...@surbl.org>.
On Sunday, August 29, 2004, 10:04:35 AM, Theo Dinter wrote:
> On Sun, Aug 29, 2004 at 08:47:59AM -0700, Bill Landry wrote:
>> Is this correct, that custom rule scores and URIDNSBL scores are ignored by
>> bayes auto-learning?  If so, what's the rational behind this?  Is this true
>> for both SA 2.6x and SA 3.0?
[...]

> As for head vs body ...

> The details of head/body can get into a long discussion (see below)
> since it's slightly complex.  In short, URIBL rules are considered
> header tests and are added appropriately.  (I'm not 100% sure why they're
> written as "header:eval" instead of "body:eval", but that's a different
> discussion... actually, I've opened ticket 3734 about that.)
[...]

> The URIBL_* rules are are "header" rules, and they're not considered
> "RBL", so they should be considered definite "head" rules.

To be honest, I'm not really familiar with SA architecture, but
it naively seems that URI tests should be considered "body" rules.

Jeff C.
-- 
Jeff Chan
mailto:jeffc@surbl.org
http://www.surbl.org/


Re: autolearning score requirements

Posted by Theo Van Dinter <fe...@kluge.net>.
On Sun, Aug 29, 2004 at 08:47:59AM -0700, Bill Landry wrote:
> Is this correct, that custom rule scores and URIDNSBL scores are ignored by
> bayes auto-learning?  If so, what's the rational behind this?  Is this true
> for both SA 2.6x and SA 3.0?

I'm not sure where he came up with that idea actually.  The "which rules
to skip" decision is very straightforward:

- skip rules with tflags set as "noautolearn", "userconf", or "learn"
- "skip" rules with a 0 score in set 0/1

There's nothing internally that differentiates between "custom" (locally
configured) and "standard" (comes with SA) rules/scores/etc.


As for head vs body ...

The details of head/body can get into a long discussion (see below)
since it's slightly complex.  In short, URIBL rules are considered
header tests and are added appropriately.  (I'm not 100% sure why they're
written as "header:eval" instead of "body:eval", but that's a different
discussion... actually, I've opened ticket 3734 about that.)


The less short version:

"head" rules are usually "header" or "header ... eval", or "meta" rules
that aren't tflag "net" as well.  Things like "header ... eval:check_rbl" are
considered "RBL" rules, not "head" rules though.

"body" rules are "body", "body ... eval", "uri", or "meta" rules that
aren't tflag "net" as well.

The URIBL_* rules are are "header" rules, and they're not considered
"RBL", so they should be considered definite "head" rules.

-D shows the different points as used for autolearning calculations, so a
random sample shows:

[...]
debug: auto-learn: currently using scoreset 1.
debug: auto-learn: message score: 17.258, computed score for autolearn: 17.259
debug: auto-learn? ham=0.1, spam=12, body-points=6.307, head-points=17.259, learned-points=0
debug: auto-learn? yes, spam (17.259 > 12)
[...]
debug: tests=DNS_FROM_RFC_ABUSE,DNS_FROM_RFC_POST,MIME_BOUND_DD_DIGITS,RCVD_IN_DSBL,RCVD_IN_NJABL_DUL,RCVD_IN_SORBS_DUL,SPF_HELO_PASS,URIBL_OB_SURBL,URIBL_WS_SURBL,X_MESSAGE_INFO
[...]

Scores are:

DNS_FROM_RFC_ABUSE   0.374
DNS_FROM_RFC_POST    1.376
MIME_BOUND_DD_DIGITS 4.230
RCVD_IN_DSBL         2.765
RCVD_IN_NJABL_DUL    1.655
RCVD_IN_SORBS_DUL    0.137
SPF_HELO_PASS        -0.001
URIBL_OB_SURBL       1.996
URIBL_WS_SURBL       0.539
X_MESSAGE_INFO       4.187

A quick debug addition shows where points get added per rule:

>> DNS_FROM_RFC_ABUSE = body
>> DNS_FROM_RFC_ABUSE = head
>> DNS_FROM_RFC_POST = body
>> DNS_FROM_RFC_POST = head
>> MIME_BOUND_DD_DIGITS = head
>> RCVD_IN_DSBL = body
>> RCVD_IN_DSBL = head
>> RCVD_IN_NJABL_DUL = body
>> RCVD_IN_NJABL_DUL = head
>> RCVD_IN_SORBS_DUL = body
>> RCVD_IN_SORBS_DUL = head
>> URIBL_OB_SURBL = head
>> URIBL_WS_SURBL = head
>> X_MESSAGE_INFO = head

It's a little confusing, but if the rule isn't considered "head" by the
above rules, the points are added to body, and visa-versa, so some rules
get added to both.  Since most of the above ones are "RBL" rules, they're
neither considered "head" nor "body", and are therefore added to both.
In this case, it just so happens that everything adds to "head" and only
the "RBL" ones get added to "body".

-- 
Randomly Generated Tagline:
"To have a right to do a thing is not at all the same as to be right in
 doing it."                      - G.K. Chesterton

Re: autolearning score requirements

Posted by Bill Landry <bi...@pointshare.com>.
----- Original Message ----- 
From: "Theo Van Dinter" <fe...@kluge.net>

> On Sun, Aug 29, 2004 at 04:32:03PM +0200, Kai Schaetzl wrote:
> > change in how it works now? I have the feeling that it (3.0 RC1 and
> RC2)
> > shows less "99" than with 2.63, but that's only a feeling from a few
> days
> > of a test run. But still the BAYES_99 is quite accurate when it shows
> up.
> > I know I can change that score, but I'm curious why it is so low now.
>
> You can read the wiki for why scores get generated the way they do.
> As for BAYES_00 and BAYES_99, they're by far my most hit rules for
> ham/spam respectively:
>
>    1    BAYES_00                         6953   93.35%
>    1    BAYES_99                        58233   87.46%
>
> I don't have stats for 2.6x, but there's nothing wrong with those
> percentages IMO.

Theo, I am most interested in Kai's question about auto-learning:
=====
"I see that the header and body requirement of three score points each
leads to many messages with high scores not getting autolearned. The
reason is that URIDNSBL isn't counted in (isn't that a body hit?) and any
custom rules (my own, SARE etc.)"
=====

Is this correct, that custom rule scores and URIDNSBL scores are ignored by
bayes auto-learning?  If so, what's the rational behind this?  Is this true
for both SA 2.6x and SA 3.0?

Bill


Re: autolearning score requirements

Posted by Kai Schaetzl <ma...@conactive.com>.
Theo Van Dinter wrote on Sun, 29 Aug 2004 13:49:11 -0400:

> I don't know what the exit0.us one is.   The page I was refering to is:
> 
> http://wiki.apache.org/spamassassin/HowScoresAreAssigned

Ah, thanks, explains the low Bayes score.

The Wiki says:
FrequentlyAskedQuestions: General questions about SpamAssassin,

I followed the link under "SpamAssassin", assuming that leads to the FAQ, but 
actually it's the first link which leads to it.
 
> 50 is different in 3.0, so you can't do a one to one comparison.
> For instance, in 2.6 if a probability couldn't be calculated, there'd
> just be no BAYES_* hit, in 3.0 it'll hit BAYES_50.
>

I forgot about that. Nevertheless, I've seen very few hits other than 00 and 99 
with 2.63. I see much more now with 3.0. But I have had only somewhat over a 
thousand messages going thru this new install during the last days and many of 
them were whitelisted, anyway. I'll wait and see :-) Thanks for the 
explanations!


Kai

-- 

Kai Schätzl, Berlin, Germany
Get your web at Conactive Internet Services: http://www.conactive.com
IE-Center: http://ie5.de & http://msie.winware.org




Re: autolearning score requirements

Posted by Theo Van Dinter <fe...@kluge.net>.
On Sun, Aug 29, 2004 at 07:32:30PM +0200, Kai Schaetzl wrote:
> The wiki at apache.org or the one at exit0.us? I suppose the first and 

I don't know what the exit0.us one is.   The page I was refering to is:

http://wiki.apache.org/spamassassin/HowScoresAreAssigned

> actually at 100%, but 99 is far off (it marks only about 50% of the spam) 
> and I often see 50 (actually more often than 99) which usually I had never 
> seen. I use a db which got upgraded from 2 to 3. It looks to me that SA 

50 is different in 3.0, so you can't do a one to one comparison.
For instance, in 2.6 if a probability couldn't be calculated, there'd
just be no BAYES_* hit, in 3.0 it'll hit BAYES_50.

-- 
Randomly Generated Tagline:
"But you have to allow a little for the desire to evangelize when you
 think you have good news."         - Larry Wall

Re: autolearning score requirements

Posted by Kai Schaetzl <ma...@conactive.com>.
Theo Van Dinter wrote on Sun, 29 Aug 2004 11:31:40 -0400:

> You can read the wiki for why scores get generated the way they do.

The wiki at apache.org or the one at exit0.us? I suppose the first and 
checked it but didn't find it. Ah, wait, DevelopmentStuff/4. Stuff about 
scoring. Is it that what you refer to? I'm not sure if it answers my 
question, though.

> As for BAYES_00 and BAYES_99, they're by far my most hit rules for
> ham/spam respectively:
> 
>    1    BAYES_00                         6953   93.35%
>    1    BAYES_99                        58233   87.46%
>

That looks very much like my 2.63 results, 00 and 99 where indeed the most 
successful rules, hitting on almost every message and being almost 100% 
accurate. Now with 3.00 I still have 00 scoring quite often (correctly), 
actually at 100%, but 99 is far off (it marks only about 50% of the spam) 
and I often see 50 (actually more often than 99) which usually I had never 
seen. I use a db which got upgraded from 2 to 3. It looks to me that SA 
3.0 assigns BAYES_99 somewhat differently from SA 2.6x, so the "old" 
database doesn't provide the same basis for assigning 99's than it did for 
2.6x. That should correct itself over time, but nonetheless made me 
wonder.
Maybe it's not caused by SA at all but by MailScanner? I just ran a 
message that scored BAYES_50 thru SA again and it scores BAYES_99 now. 
It's unlikely that it changes so much within 40 minutes (and no spam 
messages in that time, so no spam tokens learned), or not?


Kai

-- 

Kai Schätzl, Berlin, Germany
Get your web at Conactive Internet Services: http://www.conactive.com
IE-Center: http://ie5.de & http://msie.winware.org




Re: autolearning score requirements

Posted by Theo Van Dinter <fe...@kluge.net>.
On Sun, Aug 29, 2004 at 04:32:03PM +0200, Kai Schaetzl wrote:
> change in how it works now? I have the feeling that it (3.0 RC1 and RC2) 
> shows less "99" than with 2.63, but that's only a feeling from a few days 
> of a test run. But still the BAYES_99 is quite accurate when it shows up. 
> I know I can change that score, but I'm curious why it is so low now.

You can read the wiki for why scores get generated the way they do.
As for BAYES_00 and BAYES_99, they're by far my most hit rules for
ham/spam respectively:

   1    BAYES_00                         6953   93.35%
   1    BAYES_99                        58233   87.46%

I don't have stats for 2.6x, but there's nothing wrong with those
percentages IMO.

-- 
Randomly Generated Tagline:
*** The previous line contains the naughty word "$&".\n
 if /(ibm|apple|awk)/;      # :-)
              -- Larry Wall in the perl man page