You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@spamassassin.apache.org by Magnus Holmgren <ho...@lysator.liu.se> on 2005/07/28 09:06:20 UTC

Forcing autolearn

Is there a way to say to SA that "if this custom rule of mine triggers,
then the mail *is* spam and you have to autolearn it as such.", except
looking for MY_CUSTOM_RULE in X-Spam-Status afterwards and feeding the
mail to sa-learn if found?

In other words, is there a way to bypass the 3 points minimum for header
and body? (Why isn't that limit configurable, by the way?)

Thanks.
-- 
Magnus Holmgren
holmgren@lysator.liu.se

Re: Forcing autolearn

Posted by Matt Kettler <mk...@evi-inc.com>.

Magnus Holmgren wrote:
> Matt Kettler wrote:
> 
>>Yes, bayes poison should be trained without worry. However, bayes poison is not
>>the topic of discussion here. We are talking about mis-learning, something
>>COMPLETELY different.
> 
> 
> Kai Schaetzl talked about "prevent[ing] you from accidently poisoning
> your Bayes db", so I assumed we were talking about bayes poisoning.

This gets into a subtlety of usage of the words.

"bayes poison"  is a noun, and unless otherwise stated means text inserted by
spammers in an attempt to make the message look like nonspam.

"bayes poisoning" is a verb, and refers to the act of successfully imbalancing a
bayes database. Most bayes poison, despite its name, is very ineffective at
causing this, although it does try.

There are two sources of bayes poisoning, but only one is commonly called "bayes
poison", and it's mostly harmless. Mislearing is just called mislearning,
although it's a much more potent cause of bayes poisoning.

So this thread is about bayes poisoning, but it's about poisoning as a result of
mislearning, not poisoning as a result of bayes poison.

(Isn't the clarity of human language wonderful?)

>>Are you sure your conclusions are based on accurate perceptions of the consequences?
>>
> 
> I am sure that there will be no mislearning, even if I lower the body
> and/or header limits a bit, and that any mislearning that nevertheless
> may occur can be rectified by relearning. The mail volumes are low

In that case, it doesn't matter a whole lot if the message gets autolearned or
not. You'll manually train it correctly one way or the other.

> 
> What I still would like to know is the theory behind the hardcoded 3
> point limits. Can someone give as an example a message that would be
> mislearnt if it weren't for those limits?
> 

A whole lot of messages posted to this list will score very high in body points
due to spam quotations, but 0 or near 0 header points. Many messages sent by
persons on shady ISPs will score high in the header points but low in the body.

Ideally SA wants to take the approach of autolearning as spam when it's quite
sure of itself. Anything that doesn't get autolearned can always be manually
trained to compensate.

Basically you have two paths:

 1) aggressively autolearn and try to fix any errors with manual training. Risk
of FPs is slightly increased in the interim.

 2) autolearn normally and manually train anything that didn't autolearn. Risk
of FNs is slightly increased in the interim.

So, it boils down to which is worse for you, FPs or FNs.

SA in general takes the  standpoint that FPs are much worse than FNs. Thus, it
is natural for SA to be very conservative about spam learning, and liberal about
ham learning. Such a learning pattern fits SA's general design.

Re: Forcing autolearn

Posted by Magnus Holmgren <ho...@lysator.liu.se>.

Matt Kettler wrote:
> 
> Yes, bayes poison should be trained without worry. However, bayes poison is not
> the topic of discussion here. We are talking about mis-learning, something
> COMPLETELY different.

Kai Schaetzl talked about "prevent[ing] you from accidently poisoning
your Bayes db", so I assumed we were talking about bayes poisoning.

> Mis-learning a ham message as spam is always bad, and can have a minor or severe
> impact depending on the circumstances. There is no question of that mis-learning
> should be avoided whenever possible.

I agree.

> Learning bayes poison as spam isn't a matter of "oh, it doesn't matter because
> it's in the random noise" it's a matter of accurate training. You WANT SA to
> learn about common tokens that are used by both categories. This is important to
> SA's accuracy, as it's a fact of reality.

I agree.

> Mis-learning is not random noise, it doesn't reflect reality, and it is not the
> same thing as bayes poison. Not at ALL the same. It's just bad.
>
>>>In conclusion, I feel confident in letting SA learn from every message
>>>that I am certain that it can be certain is spam.
> 
> Are you sure your conclusions are based on accurate perceptions of the consequences?
> 
I am sure that there will be no mislearning, even if I lower the body
and/or header limits a bit, and that any mislearning that nevertheless
may occur can be rectified by relearning. The mail volumes are low

What I still would like to know is the theory behind the hardcoded 3
point limits. Can someone give as an example a message that would be
mislearnt if it weren't for those limits?

-- 
Magnus Holmgren
holmgren@lysator.liu.se

Re: Forcing autolearn

Posted by Matt Kettler <mk...@evi-inc.com>.

Magnus Holmgren wrote:

>>>>
>>>>DISCLAIMER: I *really* think it's a bad idea to adjust this. But if you insist,
>>>>it is possible.
>>>>
>>>>I want there to still be some difficulty to intimidate you from changing this
>>>>without some consideration. (it shouldn't be hard to find the setting knowing
>>>>what file it's in, so this isn't much of a hurdle)
>
>>
>>
>> You can always hack the source, and yes, it was easy to find. :-)
>>
>> Now for the consideration part:
>>
>> First, we don't want to learn anything as spam that isn't. With a
>> default lower limit of 12 points that's very unlikely and as already
>> mentioned I haven't yet noticed a single false positive in my case.
>> Second, we don't want bayes poisoning, i.e. "hammy" words recorded as
>> "spammy". I guess the reasoning is that if the header scores lots of
>> points while the body scores low or even zero, then the body isn't
>> spammy enough and shouldn't be learnt from. Conversely, if the header is
>> clean then any (at least 9!) body points are probably just coincidence.
>> Right?
>>
>> Now, whether bayes poisoning is really is an issue is debated. Someone
>> pointed out that the random words hidden by spammers in the message in
>> various ways aren't likely to resemble typical legit correspondence;
>> indeed they are just random noise that doesn't contribute in any
>> direction. In my case most real messages are in Swedish, meaning less
>> problem with those (but slightly more with English ones). Also, many
>> body points doesn't mean there is no bayes poison. Finally, when spam
>> slips through, the user would want to feed it to sa-learn regardless of
>> any bayes poison.

Yes, bayes poison should be trained without worry. However, bayes poison is not
the topic of discussion here. We are talking about mis-learning, something
COMPLETELY different.

Mis-learning a ham message as spam is always bad, and can have a minor or severe
impact depending on the circumstances. There is no question of that mis-learning
should be avoided whenever possible.

Learning bayes poison as spam isn't a matter of "oh, it doesn't matter because
it's in the random noise" it's a matter of accurate training. You WANT SA to
learn about common tokens that are used by both categories. This is important to
SA's accuracy, as it's a fact of reality.

Mis-learning is not random noise, it doesn't reflect reality, and it is not the
same thing as bayes poison. Not at ALL the same. It's just bad.

>>
>> In conclusion, I feel confident in letting SA learn from every message
>> that I am certain that it can be certain is spam.

Are you sure your conclusions are based on accurate perceptions of the consequences?

Re: Forcing autolearn

Posted by Magnus Holmgren <ho...@lysator.liu.se>.

Matt Kettler wrote:
>Magnus Holmgren wrote:
>>Kai Schaetzl wrote:
>>>Magnus Holmgren wrote on Thu, 28 Jul 2005 09:06:20 +0200:
>>>
>>>>In other words, is there a way to bypass the 3 points minimum for header 
>>>>and body? (Why isn't that limit configurable, by the way?)
>>>
>>>It's trying to prevent you from accidently poisoning your Bayes db.
>>
>>That explains the limit but not its non-configurability, IMHO. Hey, why
>>can't I shoot myself in my foot if I really want to! There is always a
>>possibility to re-learn (provided you save the learnt-from messages).
>>
> It is reconfigurable.. It's just harder than most SA options as you have to hack
> the source code to change it. :)
> 
> DISCLAIMER: I *really* think it's a bad idea to adjust this. But if you insist,
> it is possible.
> 
> I want there to still be some difficulty to intimidate you from changing this
> without some consideration. (it shouldn't be hard to find the setting knowing
> what file it's in, so this isn't much of a hurdle)

You can always hack the source, and yes, it was easy to find. :-)

Now for the consideration part:

First, we don't want to learn anything as spam that isn't. With a
default lower limit of 12 points that's very unlikely and as already
mentioned I haven't yet noticed a single false positive in my case.
Second, we don't want bayes poisoning, i.e. "hammy" words recorded as
"spammy". I guess the reasoning is that if the header scores lots of
points while the body scores low or even zero, then the body isn't
spammy enough and shouldn't be learnt from. Conversely, if the header is
clean then any (at least 9!) body points are probably just coincidence.
Right?

Now, whether bayes poisoning is really is an issue is debated. Someone
pointed out that the random words hidden by spammers in the message in
various ways aren't likely to resemble typical legit correspondence;
indeed they are just random noise that doesn't contribute in any
direction. In my case most real messages are in Swedish, meaning less
problem with those (but slightly more with English ones). Also, many
body points doesn't mean there is no bayes poison. Finally, when spam
slips through, the user would want to feed it to sa-learn regardless of
any bayes poison.

In conclusion, I feel confident in letting SA learn from every message
that I am certain that it can be certain is spam.

-- 
Magnus Holmgren
holmgren@lysator.liu.se

Re: Forcing autolearn

Posted by Matt Kettler <mk...@evi-inc.com>.

Magnus Holmgren wrote:
> Kai Schaetzl wrote:
> 
>>Magnus Holmgren wrote on Thu, 28 Jul 2005 09:06:20 +0200:
>>
>>
>>
>>>In other words, is there a way to bypass the 3 points minimum for header 
>>>and body? (Why isn't that limit configurable, by the way?)
>>
>>
>>It's trying to prevent you from accidently poisoning your Bayes db.
>>
>>Kai
>>
> 
> 
> That explains the limit but not its non-configurability, IMHO. Hey, why
> can't I shoot myself in my foot if I really want to! There is always a
> possibility to re-learn (provided you save the learnt-from messages).
> 

It is reconfigurable.. It's just harder than most SA options as you have to hack
the source code to change it. :)

IMO, that's a good level of hurdle to jump over if you really insist on shooting
yourself in the foot by changing a setting that shouldn't be changed. It shows a
certain level of determination and effort, and that hopefully reflects some
prior thought.

DISCLAIMER: I *really* think it's a bad idea to adjust this. But if you insist,
it is possible.

Disclaimer aside, I will give you a hint: The setting is in PerMsgStatus.pm, but
I'll leave finding it as an exercise of determination.

I want there to still be some difficulty to intimidate you from changing this
without some consideration. (it shouldn't be hard to find the setting knowing
what file it's in, so this isn't much of a hurdle)

Re: Forcing autolearn

Posted by Magnus Holmgren <ho...@lysator.liu.se>.

Kai Schaetzl wrote:
> Magnus Holmgren wrote on Thu, 28 Jul 2005 09:06:20 +0200:
> 
> 
>>In other words, is there a way to bypass the 3 points minimum for header 
>>and body? (Why isn't that limit configurable, by the way?)
> 
> 
> It's trying to prevent you from accidently poisoning your Bayes db.
> 
> Kai
> 

That explains the limit but not its non-configurability, IMHO. Hey, why
can't I shoot myself in my foot if I really want to! There is always a
possibility to re-learn (provided you save the learnt-from messages).

-- 
Magnus Holmgren
holmgren@lysator.liu.se

Re: Forcing autolearn

Posted by Kai Schaetzl <ma...@conactive.com>.

Magnus Holmgren wrote on Thu, 28 Jul 2005 09:06:20 +0200:

> In other words, is there a way to bypass the 3 points minimum for header 
> and body? (Why isn't that limit configurable, by the way?)

It's trying to prevent you from accidently poisoning your Bayes db.

Kai

-- 
Kai Schätzl, Berlin, Germany
Get your web at Conactive Internet Services: http://www.conactive.com
IE-Center: http://ie5.de & http://msie.winware.org

Re: Forcing autolearn

Posted by Robert Menschel <Ro...@Menschel.net>.

Hello Magnus,

Thursday, July 28, 2005, 12:06:20 AM, you wrote:

MH> Is there a way to say to SA that "if this custom rule of mine triggers,
MH> then the mail *is* spam and you have to autolearn it as such.", except
MH> looking for MY_CUSTOM_RULE in X-Spam-Status afterwards and feeding the
MH> mail to sa-learn if found?

No. That "afterwards" would be the method you would need to use.

MH> In other words, is there a way to bypass the 3 points minimum for header
MH> and body? (Why isn't that limit configurable, by the way?)

No.

Sorry.

But you do have the answer.

Bob Menschel