You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@spamassassin.apache.org by David Bürgin <db...@gluet.ch> on 2023/01/17 12:33:03 UTC

Auto-learning ‘considered harmful’: not so much when rejecting spam?

I have heard it said many times on this list that auto-learning is
discouraged, so I decided to finally look into disabling it.

But then I realised that I do have a use for auto-learning: In my setup,
I use a milter to reject certain spam (score > 10.0). Now, if I turn off
auto-learning I lose something. Because, as far as I understand the
default spam auto-learning threshold of 12.0 causes incoming
high-probability spam to be learned as spam, even though the message is
then rejected and not available locally later.

Is my understanding correct? Auto-learning of spam can be useful if spam
is rejected during the SMTP conversation but after it has been seen
– and learned – by SpamAssassin?

Re: Auto-learning ‘considered harmful’: not so much when rejecting spam?

Posted by Matus UHLAR - fantomas <uh...@fantomas.sk>.
>On 1/17/2023 7:33 AM, David Bürgin wrote:
>>I have heard it said many times on this list that auto-learning is
>>discouraged, so I decided to finally look into disabling it.
>>
>>But then I realised that I do have a use for auto-learning: In my setup,
>>I use a milter to reject certain spam (score > 10.0). Now, if I turn off
>>auto-learning I lose something. Because, as far as I understand the
>>default spam auto-learning threshold of 12.0 causes incoming
>>high-probability spam to be learned as spam, even though the message is
>>then rejected and not available locally later.
>>
>>Is my understanding correct? Auto-learning of spam can be useful if spam
>>is rejected during the SMTP conversation but after it has been seen
>>– and learned – by SpamAssassin?

On 17.01.23 09:37, Kevin A. McGrail wrote:
>The problem with auto learning I've seen is that it slowly spirals 
>miscategorization errors.

mostly because there are no really useful indicators of hamminess, and if 
they are, spammers use them to spread their junk.

after long manual training beingocasionally spoiled by autolearn, 
I have manually selected all rules that have negative scores to noautolearn:

tflags  RCVD_IN_RP_CERTIFIED            noautolearn net nice
tflags  RCVD_IN_VALIDITY_CERTIFIED      noautolearn net nice
tflags  RCVD_IN_RP_SAFE                 noautolearn net nice
tflags  RCVD_IN_VALIDITY_SAFE           noautolearn net nice
tflags  RCVD_IN_DNSWL_LOW       noautolearn net nice
tflags  RCVD_IN_DNSWL_MED       noautolearn net nice
tflags  RCVD_IN_DNSWL_HI        noautolearn net nice
tflags  RCVD_IN_MSPIKE_H2       noautolearn net nice
tflags  RCVD_IN_MSPIKE_H3       noautolearn net nice
tflags  RCVD_IN_MSPIKE_H4       noautolearn net nice
tflags  RCVD_IN_MSPIKE_H5       noautolearn net nice
tflags  RCVD_IN_MSPIKE_WL       noautolearn net nice
tflags  RCVD_IN_IADB_DK         noautolearn net nice
tflags  RCVD_IN_IADB_DOPTIN     noautolearn net nice
tflags  RCVD_IN_IADB_LISTED     noautolearn net nice
tflags  RCVD_IN_IADB_MI_CPR_MAT noautolearn net nice
tflags  RCVD_IN_IADB_ML_DOPTIN  noautolearn net nice
tflags  RCVD_IN_IADB_OPTIN      noautolearn net nice
tflags  RCVD_IN_IADB_OPTIN_GT50 noautolearn net nice
tflags  RCVD_IN_IADB_RDNS       noautolearn net nice
tflags  RCVD_IN_IADB_SENDERID   noautolearn net nice
tflags  RCVD_IN_IADB_SPF        noautolearn net nice
tflags  RCVD_IN_IADB_UT_CPR_MAT noautolearn net nice
tflags  RCVD_IN_IADB_VOUCHED    noautolearn net nice
tflags  DKIMWL_WL_HIGH          noautolearn net nice
tflags  DKIMWL_WL_MEDHI         noautolearn net nice
tflags  DKIMWL_WL_MED           noautolearn net nice
tflags  DKIM_VALID              noautolearn net nice
tflags  DKIM_VALID_EF           noautolearn net nice

still needs some training.

and, in some places, you may need to dump the database and re-train from 
scratch.
That's why manual training is great and why you need to keep some spam, but 
mostly ham.


> The technical term is that it reinforces a 
>bias.  A Bayes database should be carefully maintained.  It's not very 
>much of a fire and forget technology.
>
>And, for example, letting user's control it becomes a question of 
>"what is spam?"  For example, users might get a very legit mail BUT 
>they are tired of seeing it in their inbox.  So they want to train it 
>as spam.  If you have per-user implementations, that can be good BUT 
>you need a few hundred samples of good email and bad email to activate 
>Bayes.
>
>In short, I don't have a good solution for training Bayes that isn't a 
>lot of work but auto-learning is usually a bad solution.
>
>One case where it might be good is if you had a system setup that you 
>fed emails to that were classified.  It would then use that good feed 
>to use the auto-learning and add a way of learning without using the 
>command line.

-- 
Matus UHLAR - fantomas, uhlar@fantomas.sk ; http://www.fantomas.sk/
Warning: I wish NOT to receive e-mail advertising to this address.
Varovanie: na tuto adresu chcem NEDOSTAVAT akukolvek reklamnu postu.
It's now safe to throw off your computer.

Re: Auto-learning ‘considered harmful’: not so much when rejecting spam?

Posted by "Kevin A. McGrail" <km...@apache.org>.
On 1/17/2023 7:33 AM, David Bürgin wrote:
> I have heard it said many times on this list that auto-learning is
> discouraged, so I decided to finally look into disabling it.
>
> But then I realised that I do have a use for auto-learning: In my setup,
> I use a milter to reject certain spam (score > 10.0). Now, if I turn off
> auto-learning I lose something. Because, as far as I understand the
> default spam auto-learning threshold of 12.0 causes incoming
> high-probability spam to be learned as spam, even though the message is
> then rejected and not available locally later.
>
> Is my understanding correct? Auto-learning of spam can be useful if spam
> is rejected during the SMTP conversation but after it has been seen
> – and learned – by SpamAssassin?

The problem with auto learning I've seen is that it slowly spirals 
miscategorization errors.  The technical term is that it reinforces a 
bias.  A Bayes database should be carefully maintained.  It's not very 
much of a fire and forget technology.

And, for example, letting user's control it becomes a question of "what 
is spam?"  For example, users might get a very legit mail BUT they are 
tired of seeing it in their inbox.  So they want to train it as spam.  
If you have per-user implementations, that can be good BUT you need a 
few hundred samples of good email and bad email to activate Bayes.

In short, I don't have a good solution for training Bayes that isn't a 
lot of work but auto-learning is usually a bad solution.

One case where it might be good is if you had a system setup that you 
fed emails to that were classified.  It would then use that good feed to 
use the auto-learning and add a way of learning without using the 
command line.

Regards,
KAM

-- 
Kevin A. McGrail
KMcGrail@Apache.org

Member, Apache Software Foundation
Chair Emeritus Apache SpamAssassin Project
https://www.linkedin.com/in/kmcgrail - 703.798.0171