You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@spamassassin.apache.org by gr...@amnesiak.com on 2005/03/11 18:00:03 UTC

Bayes Autolearn Threshold - different scoring?

Hello all,
Let me start out by saying I've been searching for a couple of days on the
web on this subject but to no avail, so I would appreciate any help.

I have been using SA for more than a year and right now I'm running 3.0.1
on linux (bayes corpus size: nspam = 19482, nham = 3249). My filter
behaves very well, I only get about one false positive a month and 2-3
false negatives (averaging about 100 spams a day,
http://www.amnesiak.com/spam/ if you're curious). I'm invoking SA through
procmailrc with | spamassassin -p /home/greg/.spamassassin/user_prefs .

My problem is this: I'm using squirrelmail, and to keep an eye on false
negatives (I define those as real mails that get shuttled to spam, just to
keep things clear) I have a 'spam' folder. As anyone that uses sqmail
knows, it gets very slow when any folder contains more than a few hundred
messages. But, since my filter is trained very well, I'd like to send
autolearned spams to /mail/Trash (ultimately to /dev/null) so I don't have
to deal with those. I figured just setting bayes_auto_learn_threshold_spam
6 would work great. It really does not do much of anything. I've decreased
it to 3, and to 1, but it really doesnt make a difference. I found these
relevant lines in a debug:

debug: running full-text regexp tests; score so far=4.648
debug: auto-learn: currently using scoreset 3, recomputing score based on
scoreset 1.
debug: auto-learn: message score: 4.648, computed score for autolearn: 3.987
debug: auto-learn? ham=0.1, spam=1, body-points=0, head-points=-2.82,
learned-points=1.886
debug: auto-learn? no: scored as spam but too few body points (0 < 3)
debug: is spam? score=4.648 required=1

What, exactly, is going on here? The head points I can explain (this is a
spam I saved that had already come to me) but the body points - I don't
understand. It also wasn't clear to me until this debug that the autolearn
had its own scoring system.

Any help or clarification would be great!

Thanks,
-Greg

Re: Bayes Autolearn Threshold - different scoring?

Posted by Greg Daly <gr...@amnesiak.com>.

Kris, thanks for your help and insight. From what I can see, the settings
are in PerMsgStatus.pm, line 308/309 (my version of course).

    my $required_body_points = 3;
    my $required_head_points = 3;

I'll try changing those around, and update my status to this list in a while.

Again, thanks!
-g

> greg@amnesiak.com wrote:
>> I'm sure that's the problem. Here's a different sample spam, minus
>> the bayes score (which isn't counted on the autolearn body tests,
>> correct?)
>
> Correct.  But keep in mind that the autolearn process actually uses
> different scores.
>
>>  2.2 RCVD_HELO_IP_MISMATCH  Received: HELO and IP do not match, but
>> should
>
>>>From scoreset 3 (2.178);  autolearn will use set 1 (score: 0.618)
>
>>  3.0 DATE_IN_FUTURE_12_24   Date: is 12 to 24 hours after Received:
>> date
>
> Set 1 score is 2.329.
>
>>  1.2 RCVD_NUMERIC_HELO      Received: contains an IP address used for
>> HELO
>
> Set 1 score is 1.531.
>
>>  2.7 FORGED_YAHOO_RCVD      'From' yahoo.com does not match
>> 'Received' headers
>
> Set 1 score is 2.174.
>
> All together, that's well over the minimum 3 points from headers...  but
> no body score.
>
>> No body hits there... So basically, I'm getting what I want from the
>> headers, and from what bayes already knows. How do I tweak the
>> thresholds that the autolearner uses, for example, either setting the
>> body threshold to 0 or eliminating that check entirely?
>
> Hack the code.  There's no option I've heard of, and nothing noted in
> the man page IIRC to allow that.
>
>> I realize this might produce
>> unwanted results, so I'd probably give it a week or so initial
>> experiment.
>
> I don't know how the current setup was decided on, but I'd imagine that
> other methods have been tried - for general use, the 3+3 minimum in the
> distributed SA is probably ideal.  For some specific mail streams
> (yours, perhaps?)  this may not be optimal and may need to be tweaked.
>
> -kgd
> --
> Get your mouse off of there!  You don't know where that email has been!
>

Re: Bayes Autolearn Threshold - different scoring?

Posted by Kris Deugau <kd...@vianet.ca>.

greg@amnesiak.com wrote:
> I'm sure that's the problem. Here's a different sample spam, minus
> the bayes score (which isn't counted on the autolearn body tests,
> correct?)

Correct.  But keep in mind that the autolearn process actually uses
different scores.

>  2.2 RCVD_HELO_IP_MISMATCH  Received: HELO and IP do not match, but
> should

>From scoreset 3 (2.178);  autolearn will use set 1 (score: 0.618)

>  3.0 DATE_IN_FUTURE_12_24   Date: is 12 to 24 hours after Received:
> date

Set 1 score is 2.329.

>  1.2 RCVD_NUMERIC_HELO      Received: contains an IP address used for
> HELO

Set 1 score is 1.531.

>  2.7 FORGED_YAHOO_RCVD      'From' yahoo.com does not match
> 'Received' headers

Set 1 score is 2.174.

All together, that's well over the minimum 3 points from headers...  but
no body score.

> No body hits there... So basically, I'm getting what I want from the
> headers, and from what bayes already knows. How do I tweak the
> thresholds that the autolearner uses, for example, either setting the
> body threshold to 0 or eliminating that check entirely?

Hack the code.  There's no option I've heard of, and nothing noted in
the man page IIRC to allow that.

> I realize this might produce
> unwanted results, so I'd probably give it a week or so initial
> experiment.

I don't know how the current setup was decided on, but I'd imagine that
other methods have been tried - for general use, the 3+3 minimum in the
distributed SA is probably ideal.  For some specific mail streams
(yours, perhaps?)  this may not be optimal and may need to be tweaked.

-kgd
-- 
Get your mouse off of there!  You don't know where that email has been!

Re: Bayes Autolearn Threshold - different scoring?

Posted by gr...@amnesiak.com.

> As your only email access?
pretty much, yes.

> <g>  Try several thousand, as a number of customers have reported to
> me...

oh, I've been there - I'm just trying to avoid going there again. :)

> Mmm.  Dangerous - I've seen FPs get autolearned as spam once or twice.
> :(

I realize that. With my system on my spam the way it is now, my spam
threshold is set to one. I have not seen a FP >=3.0 in several months. So,
I know there's a risk.

> What I do on my accounts is set up a "big-spam" folder, and rely on the
> X-Spam-Level header to move mail there.  Anything scoring 15 or higher
> gets 15 or more stars in X-Spam-Level, and I have this:
>
> :0:
> * ^X-Spam-Level:.\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*
> /home/kdeugau/mail/bigspam
>
> before the check that files spam in my "main" spam folder.
>
> With the well-tuned 2.64+SURBL systems I have, ~80% or the spam usually
> ends up in the "big-spam" folder.

If I did that with a threshold of 3.0 on my system I would have had 84% of
the total 'spams' I've gotten in the last week end up in the big-spam
folder, with no FPs.

> [snip]
>> debug: auto-learn? ham=0.1, spam=1, body-points=0, head-points=-2.82,
>> learned-points=1.886
>> debug: auto-learn? no: scored as spam but too few body points (0 < 3)
>
> These two entries are the critical ones;  note the body-points and
> head-points.  To be autolearned as spam, a message must hit tests worth
> a total of 3 points or more on header tests, and a total of 3 points or
> more on body tests.

I'm sure that's the problem. Here's a different sample spam, minus the
bayes score (which isn't counted on the autolearn body tests, correct?)
 2.2 RCVD_HELO_IP_MISMATCH  Received: HELO and IP do not match, but should
 3.0 DATE_IN_FUTURE_12_24   Date: is 12 to 24 hours after Received: date
 1.2 RCVD_NUMERIC_HELO      Received: contains an IP address used for HELO
 2.7 FORGED_YAHOO_RCVD      'From' yahoo.com does not match 'Received'
headers

No body hits there... So basically, I'm getting what I want from the
headers, and from what bayes already knows. How do I tweak the thresholds
that the autolearner uses, for example, either setting the body threshold
to 0 or eliminating that check entirely? I realize this might produce
unwanted results, so I'd probably give it a week or so initial experiment.

> I notice you're still using the default autolearn-as-ham setting;  this
> is dangerous as very low-scoring spam can get autolearned incorrectly.
> I've dropped it to -0.01 on my systems to prevent this.

That's a good tip, i'll implement that.

Thanks!

Re: Bayes Autolearn Threshold - different scoring?

Posted by Kris Deugau <kd...@vianet.ca>.

greg@amnesiak.com wrote:
> My problem is this: I'm using squirrelmail,

As your only email access?

> and to keep an eye on false negatives (I define those as real mails
> that get shuttled to spam, just to keep things clear) I have a 'spam'
> folder. As anyone that uses sqmail knows, it gets very slow when any
> folder contains more than a few hundred messages.

<g>  Try several thousand, as a number of customers have reported to
me...

Actually, it's only spewed out error messages in a very few cases.

> But, since my
> filter is trained very well, I'd like to send autolearned spams to
> /mail/Trash (ultimately to /dev/null) so I don't have to deal with
> those.

Mmm.  Dangerous - I've seen FPs get autolearned as spam once or twice. 
:(

What I do on my accounts is set up a "big-spam" folder, and rely on the
X-Spam-Level header to move mail there.  Anything scoring 15 or higher
gets 15 or more stars in X-Spam-Level, and I have this:

:0:
* ^X-Spam-Level:.\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*
/home/kdeugau/mail/bigspam

before the check that files spam in my "main" spam folder.

With the well-tuned 2.64+SURBL systems I have, ~80% or the spam usually
ends up in the "big-spam" folder.

> I figured just setting bayes_auto_learn_threshold_spam 6 would
> work great. It really does not do much of anything. I've decreased
> it to 3, and to 1, but it really doesnt make a difference. I found
> these relevant lines in a debug:

[snip]
> debug: auto-learn? ham=0.1, spam=1, body-points=0, head-points=-2.82,
> learned-points=1.886
> debug: auto-learn? no: scored as spam but too few body points (0 < 3)

These two entries are the critical ones;  note the body-points and
head-points.  To be autolearned as spam, a message must hit tests worth
a total of 3 points or more on header tests, and a total of 3 points or
more on body tests.

I notice you're still using the default autolearn-as-ham setting;  this
is dangerous as very low-scoring spam can get autolearned incorrectly.
I've dropped it to -0.01 on my systems to prevent this.

> What, exactly, is going on here? The head points I can explain (this
> is a spam I saved that had already come to me) but the body points -
> I don't understand. It also wasn't clear to me until this debug that
> the autolearn had its own scoring system.

Not entirely;  to decide whether to autolearn a message one of the
"no-Bayes" score sets is used to calculate the scores, depending on
whether you've got network tests disabled or not.

-kgd
-- 
Get your mouse off of there!  You don't know where that email has been!