You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@spamassassin.apache.org by gr...@amnesiak.com on 2005/03/11 18:00:03 UTC
Bayes Autolearn Threshold - different scoring?
Hello all,
Let me start out by saying I've been searching for a couple of days on the
web on this subject but to no avail, so I would appreciate any help.
I have been using SA for more than a year and right now I'm running 3.0.1
on linux (bayes corpus size: nspam = 19482, nham = 3249). My filter
behaves very well, I only get about one false positive a month and 2-3
false negatives (averaging about 100 spams a day,
http://www.amnesiak.com/spam/ if you're curious). I'm invoking SA through
procmailrc with | spamassassin -p /home/greg/.spamassassin/user_prefs .
My problem is this: I'm using squirrelmail, and to keep an eye on false
negatives (I define those as real mails that get shuttled to spam, just to
keep things clear) I have a 'spam' folder. As anyone that uses sqmail
knows, it gets very slow when any folder contains more than a few hundred
messages. But, since my filter is trained very well, I'd like to send
autolearned spams to /mail/Trash (ultimately to /dev/null) so I don't have
to deal with those. I figured just setting bayes_auto_learn_threshold_spam
6 would work great. It really does not do much of anything. I've decreased
it to 3, and to 1, but it really doesnt make a difference. I found these
relevant lines in a debug:
debug: running full-text regexp tests; score so far=4.648
debug: auto-learn: currently using scoreset 3, recomputing score based on
scoreset 1.
debug: auto-learn: message score: 4.648, computed score for autolearn: 3.987
debug: auto-learn? ham=0.1, spam=1, body-points=0, head-points=-2.82,
learned-points=1.886
debug: auto-learn? no: scored as spam but too few body points (0 < 3)
debug: is spam? score=4.648 required=1
What, exactly, is going on here? The head points I can explain (this is a
spam I saved that had already come to me) but the body points - I don't
understand. It also wasn't clear to me until this debug that the autolearn
had its own scoring system.
Any help or clarification would be great!
Thanks,
-Greg
Re: Bayes Autolearn Threshold - different scoring?
Posted by Greg Daly <gr...@amnesiak.com>.
Kris, thanks for your help and insight. From what I can see, the settings
are in PerMsgStatus.pm, line 308/309 (my version of course).
my $required_body_points = 3;
my $required_head_points = 3;
I'll try changing those around, and update my status to this list in a while.
Again, thanks!
-g
> greg@amnesiak.com wrote:
>> I'm sure that's the problem. Here's a different sample spam, minus
>> the bayes score (which isn't counted on the autolearn body tests,
>> correct?)
>
> Correct. But keep in mind that the autolearn process actually uses
> different scores.
>
>> 2.2 RCVD_HELO_IP_MISMATCH Received: HELO and IP do not match, but
>> should
>
>>>From scoreset 3 (2.178); autolearn will use set 1 (score: 0.618)
>
>> 3.0 DATE_IN_FUTURE_12_24 Date: is 12 to 24 hours after Received:
>> date
>
> Set 1 score is 2.329.
>
>> 1.2 RCVD_NUMERIC_HELO Received: contains an IP address used for
>> HELO
>
> Set 1 score is 1.531.
>
>> 2.7 FORGED_YAHOO_RCVD 'From' yahoo.com does not match
>> 'Received' headers
>
> Set 1 score is 2.174.
>
> All together, that's well over the minimum 3 points from headers... but
> no body score.
>
>> No body hits there... So basically, I'm getting what I want from the
>> headers, and from what bayes already knows. How do I tweak the
>> thresholds that the autolearner uses, for example, either setting the
>> body threshold to 0 or eliminating that check entirely?
>
> Hack the code. There's no option I've heard of, and nothing noted in
> the man page IIRC to allow that.
>
>> I realize this might produce
>> unwanted results, so I'd probably give it a week or so initial
>> experiment.
>
> I don't know how the current setup was decided on, but I'd imagine that
> other methods have been tried - for general use, the 3+3 minimum in the
> distributed SA is probably ideal. For some specific mail streams
> (yours, perhaps?) this may not be optimal and may need to be tweaked.
>
> -kgd
> --
> Get your mouse off of there! You don't know where that email has been!
>
Re: Bayes Autolearn Threshold - different scoring?
Posted by Kris Deugau <kd...@vianet.ca>.
greg@amnesiak.com wrote:
> I'm sure that's the problem. Here's a different sample spam, minus
> the bayes score (which isn't counted on the autolearn body tests,
> correct?)
Correct. But keep in mind that the autolearn process actually uses
different scores.
> 2.2 RCVD_HELO_IP_MISMATCH Received: HELO and IP do not match, but
> should
>From scoreset 3 (2.178); autolearn will use set 1 (score: 0.618)
> 3.0 DATE_IN_FUTURE_12_24 Date: is 12 to 24 hours after Received:
> date
Set 1 score is 2.329.
> 1.2 RCVD_NUMERIC_HELO Received: contains an IP address used for
> HELO
Set 1 score is 1.531.
> 2.7 FORGED_YAHOO_RCVD 'From' yahoo.com does not match
> 'Received' headers
Set 1 score is 2.174.
All together, that's well over the minimum 3 points from headers... but
no body score.
> No body hits there... So basically, I'm getting what I want from the
> headers, and from what bayes already knows. How do I tweak the
> thresholds that the autolearner uses, for example, either setting the
> body threshold to 0 or eliminating that check entirely?
Hack the code. There's no option I've heard of, and nothing noted in
the man page IIRC to allow that.
> I realize this might produce
> unwanted results, so I'd probably give it a week or so initial
> experiment.
I don't know how the current setup was decided on, but I'd imagine that
other methods have been tried - for general use, the 3+3 minimum in the
distributed SA is probably ideal. For some specific mail streams
(yours, perhaps?) this may not be optimal and may need to be tweaked.
-kgd
--
Get your mouse off of there! You don't know where that email has been!
Re: Bayes Autolearn Threshold - different scoring?
Posted by gr...@amnesiak.com.
> As your only email access?
pretty much, yes.
> <g> Try several thousand, as a number of customers have reported to
> me...
oh, I've been there - I'm just trying to avoid going there again. :)
> Mmm. Dangerous - I've seen FPs get autolearned as spam once or twice.
> :(
I realize that. With my system on my spam the way it is now, my spam
threshold is set to one. I have not seen a FP >=3.0 in several months. So,
I know there's a risk.
> What I do on my accounts is set up a "big-spam" folder, and rely on the
> X-Spam-Level header to move mail there. Anything scoring 15 or higher
> gets 15 or more stars in X-Spam-Level, and I have this:
>
> :0:
> * ^X-Spam-Level:.\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*
> /home/kdeugau/mail/bigspam
>
> before the check that files spam in my "main" spam folder.
>
> With the well-tuned 2.64+SURBL systems I have, ~80% or the spam usually
> ends up in the "big-spam" folder.
If I did that with a threshold of 3.0 on my system I would have had 84% of
the total 'spams' I've gotten in the last week end up in the big-spam
folder, with no FPs.
> [snip]
>> debug: auto-learn? ham=0.1, spam=1, body-points=0, head-points=-2.82,
>> learned-points=1.886
>> debug: auto-learn? no: scored as spam but too few body points (0 < 3)
>
> These two entries are the critical ones; note the body-points and
> head-points. To be autolearned as spam, a message must hit tests worth
> a total of 3 points or more on header tests, and a total of 3 points or
> more on body tests.
I'm sure that's the problem. Here's a different sample spam, minus the
bayes score (which isn't counted on the autolearn body tests, correct?)
2.2 RCVD_HELO_IP_MISMATCH Received: HELO and IP do not match, but should
3.0 DATE_IN_FUTURE_12_24 Date: is 12 to 24 hours after Received: date
1.2 RCVD_NUMERIC_HELO Received: contains an IP address used for HELO
2.7 FORGED_YAHOO_RCVD 'From' yahoo.com does not match 'Received'
headers
No body hits there... So basically, I'm getting what I want from the
headers, and from what bayes already knows. How do I tweak the thresholds
that the autolearner uses, for example, either setting the body threshold
to 0 or eliminating that check entirely? I realize this might produce
unwanted results, so I'd probably give it a week or so initial experiment.
> I notice you're still using the default autolearn-as-ham setting; this
> is dangerous as very low-scoring spam can get autolearned incorrectly.
> I've dropped it to -0.01 on my systems to prevent this.
That's a good tip, i'll implement that.
Thanks!
Re: Bayes Autolearn Threshold - different scoring?
Posted by Kris Deugau <kd...@vianet.ca>.
greg@amnesiak.com wrote:
> My problem is this: I'm using squirrelmail,
As your only email access?
> and to keep an eye on false negatives (I define those as real mails
> that get shuttled to spam, just to keep things clear) I have a 'spam'
> folder. As anyone that uses sqmail knows, it gets very slow when any
> folder contains more than a few hundred messages.
<g> Try several thousand, as a number of customers have reported to
me...
Actually, it's only spewed out error messages in a very few cases.
> But, since my
> filter is trained very well, I'd like to send autolearned spams to
> /mail/Trash (ultimately to /dev/null) so I don't have to deal with
> those.
Mmm. Dangerous - I've seen FPs get autolearned as spam once or twice.
:(
What I do on my accounts is set up a "big-spam" folder, and rely on the
X-Spam-Level header to move mail there. Anything scoring 15 or higher
gets 15 or more stars in X-Spam-Level, and I have this:
:0:
* ^X-Spam-Level:.\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*
/home/kdeugau/mail/bigspam
before the check that files spam in my "main" spam folder.
With the well-tuned 2.64+SURBL systems I have, ~80% or the spam usually
ends up in the "big-spam" folder.
> I figured just setting bayes_auto_learn_threshold_spam 6 would
> work great. It really does not do much of anything. I've decreased
> it to 3, and to 1, but it really doesnt make a difference. I found
> these relevant lines in a debug:
[snip]
> debug: auto-learn? ham=0.1, spam=1, body-points=0, head-points=-2.82,
> learned-points=1.886
> debug: auto-learn? no: scored as spam but too few body points (0 < 3)
These two entries are the critical ones; note the body-points and
head-points. To be autolearned as spam, a message must hit tests worth
a total of 3 points or more on header tests, and a total of 3 points or
more on body tests.
I notice you're still using the default autolearn-as-ham setting; this
is dangerous as very low-scoring spam can get autolearned incorrectly.
I've dropped it to -0.01 on my systems to prevent this.
> What, exactly, is going on here? The head points I can explain (this
> is a spam I saved that had already come to me) but the body points -
> I don't understand. It also wasn't clear to me until this debug that
> the autolearn had its own scoring system.
Not entirely; to decide whether to autolearn a message one of the
"no-Bayes" score sets is used to calculate the scores, depending on
whether you've got network tests disabled or not.
-kgd
--
Get your mouse off of there! You don't know where that email has been!