You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@spamassassin.apache.org by Jim Gottlieb <ji...@nccom.com> on 2005/08/17 22:22:27 UTC

learning or not?

Since I switched to using a site-wide Bayes database, I've been seeing
some strange behavior.

When I run sa-learn against a message, it always says it learned from
it, even if I've run it through sa-learn previously.  This doesn't seem
right, as I thought SA keeps track of what messages it's seen.

% sa-learn --mbox --spam mail/spam
Learned from 2 message(s) (2 message(s) examined).
% sa-learn --mbox --spam mail/spam
Learned from 2 message(s) (2 message(s) examined).

And each time the nspam count goes up by 2:

% sa-learn --dump magic
0.000          0          3          0  non-token data: bayes db version
0.000          0      49931          0  non-token data: nspam
0.000          0      36433          0  non-token data: nham
0.000          0     127533          0  non-token data: ntokens
0.000          0 1123771562          0  non-token data: oldest atime
0.000          0 1124309893          0  non-token data: newest atime
0.000          0 1124309898          0  non-token data: last journal sync atime
0.000          0 1124307009          0  non-token data: last expiry atime
0.000          0     345600          0  non-token data: last expire atime delta
0.000          0     176152          0  non-token data: last expire reduction co

What might this indicate is wrong?

Running Spamassassin 3.0.4 under Solaris 8 (SPARC).  SA is running as spamd.

Thanks.

Re: learning or not?

Posted by Matt Kettler <mk...@evi-inc.com>.

Jim Gottlieb wrote:
> On 2005-08-17 at 16:52, Matt Kettler (mkettler@evi-inc.com) wrote:
> 
> 
>>In your version (3.0.4), yes.. Although there's some thoughts about removing
>>this in 3.1.0, or at least having an option to disable it, as it's proving less
>>useful than one might think.
> 
> 
> I still wonder if there's a problem with my install, though, as it
> should be working on mine.

Possibly.. check your bayes_seen file.. does it exist? does it grow?

> 
> Another symptom is that spam never autolearns even when it's above the
> threshold:
> 
> X-Spam-Status: Yes, score=12.8 required=3.9
> 	tests=BAYES_99,HTML_FONT_BIG,                      
>         HTML_FONT_LOW_CONTRAST,HTML_MESSAGE,MIME_HTML_ONLY,URIBL_JP_SURBL,                     
>         URIBL_OB_SURBL autolearn=no version=3.0.4                  
> 
> Or is it 12 _above_ your required score?
> 

SA's autolearner is not a simple as "is the total score over 12".

That message did not meet at least 2 of the criteria required to autolearn as
spam. It didn't meet the threshold, and it didn't have enough header points.

First off, SA does not use to total message score for deciding on autolearning,
so don't even think about the number present in X-Spam-Status when considering
autolearning. SA uses the score the message would have gotten if bayes was
disabled. This is to prevent bayes from having a "self feedback" effect.

So you have to re-calculate to score without BAYES_99's score in there, and also
recompensate all the scores to a different scoreset.

score HTML_FONT_BIG 0 0.232 0 0.142
score HTML_FONT_LOW_CONTRAST 1.011 0.955 1.017 0.788
score HTML_MESSAGE 0.001
score MIME_HTML_ONLY 1.204 1.158 1.156 0.177
score URIBL_JP_SURBL 0 1.539 0 2.462
score URIBL_OB_SURBL 0 1.996 0 3.213

Adding up the set 1 scores (the second number in each if more than one present)
I get 5.881.. WAY below the threshold of 12.

On top of that, to learn as spam SA requires at least 3.0 points of score from
header rules AND 3.0 points of score for body rules. This is a hard-coded limit,
so it's not easily changed.

You have 0 points worth of header rules matching here, so it's disqualified for
that reason as well, no matter how low you adjusted the threshold, that message
would not autolearn as spam.

Re: learning or not?

Posted by Jim Gottlieb <ji...@nccom.com>.

On 2005-08-17 at 16:52, Matt Kettler (mkettler@evi-inc.com) wrote:

> In your version (3.0.4), yes.. Although there's some thoughts about removing
> this in 3.1.0, or at least having an option to disable it, as it's proving less
> useful than one might think.

I still wonder if there's a problem with my install, though, as it
should be working on mine.

Another symptom is that spam never autolearns even when it's above the
threshold:

X-Spam-Status: Yes, score=12.8 required=3.9
	tests=BAYES_99,HTML_FONT_BIG,                      
        HTML_FONT_LOW_CONTRAST,HTML_MESSAGE,MIME_HTML_ONLY,URIBL_JP_SURBL,                     
        URIBL_OB_SURBL autolearn=no version=3.0.4                  

Or is it 12 _above_ your required score?

Re: learning or not?

Posted by Matt Kettler <mk...@evi-inc.com>.

Jim Gottlieb wrote:
> Since I switched to using a site-wide Bayes database, I've been seeing
> some strange behavior.
> 
> When I run sa-learn against a message, it always says it learned from
> it, even if I've run it through sa-learn previously.  This doesn't seem
> right, as I thought SA keeps track of what messages it's seen.

In your version (3.0.4), yes.. Although there's some thoughts about removing
this in 3.1.0, or at least having an option to disable it, as it's proving less
useful than one might think.

In particular, a tricky spammer can send multiple different spams with different
content, but trick SA into thinking they are the same message and cause it to
only learn one of them.

(No, I'm not going to detail how, I don't want to make life easier on the spammers.)

> 
> % sa-learn --mbox --spam mail/spam
> Learned from 2 message(s) (2 message(s) examined).
> % sa-learn --mbox --spam mail/spam
> Learned from 2 message(s) (2 message(s) examined).
> 
> And each time the nspam count goes up by 2:

Do you mean each time it goes up by 2, for a total of 4?

> 
> % sa-learn --dump magic

That output isn't very useful, since it's only one dump, not a before/after dump.

> 
> What might this indicate is wrong?

Do the messages in mail/spam have message ID headers? SA normally tracks
duplicate learning message-id, but if none exists it will "create" a fake one.

This could cause what you're seeing.