You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@spamassassin.apache.org by Daniele Duca <du...@staff.spin.it> on 2018/07/25 17:49:04 UTC

Bayes overtraining

Hi,

I'm evaluating incorporating CRM114 in my current setup and I was 
reading the FAQs about training the filter here: 
http://crm114.sourceforge.net/src/FAQ.txt

What made me rethink my actual strategy were the following lines:

...

If you train in only on an error, that's close to the minimal change
necessary to obtain correct behavior from the filter.

If you train in something that would have been classified correctly
anyway, you have now set up a prejudice (an inappropriately strong
reaction) to that particular text.

Now, that prejudice will make it _harder_ to re-learn correct behavior on
the next piece of text that isn't right.  Instead of just learning
the correct behavior, we first have to unlearn the prejudice, and
_then_ learn the correct behavior.
...

In my current SA setup I use bayes_auto_learn along with some custom 
poison pills (autolearn_force on some rules) , and I'm currently 
wondering if over training SA's bayes could lead to the same "prejudice" 
problem as CRM114.

I'm thinking that maybe it would be better to use 
"bayes_auto_learn_on_error 1"

What is your preferred strategy? Train everything you can or train only 
errors?

Daniele

Re: Bayes overtraining

Posted by David Jones <dj...@ena.com>.

On 07/25/2018 12:49 PM, Daniele Duca wrote:
> Hi,
> 
> I'm evaluating incorporating CRM114 in my current setup and I was 
> reading the FAQs about training the filter here: 
> http://crm114.sourceforge.net/src/FAQ.txt
> 
> What made me rethink my actual strategy were the following lines:
> 
> ...
> 
> If you train in only on an error, that's close to the minimal change
> necessary to obtain correct behavior from the filter.
> 
> If you train in something that would have been classified correctly
> anyway, you have now set up a prejudice (an inappropriately strong
> reaction) to that particular text.
> 
> Now, that prejudice will make it _harder_ to re-learn correct behavior on
> the next piece of text that isn't right.  Instead of just learning
> the correct behavior, we first have to unlearn the prejudice, and
> _then_ learn the correct behavior.
> ...
> 
> In my current SA setup I use bayes_auto_learn along with some custom 
> poison pills (autolearn_force on some rules) , and I'm currently 
> wondering if over training SA's bayes could lead to the same "prejudice" 
> problem as CRM114.
> 
> I'm thinking that maybe it would be better to use 
> "bayes_auto_learn_on_error 1"
> 
> What is your preferred strategy? Train everything you can or train only 
> errors?
> 
> Daniele
> 

I personally found in our customer mail flow that CRM114 and Bogofilter 
didn't help that much.  We well-trained Bayesian DB with good meta 
rules, RBLs (Invaluement is a must) along with MTA checks/blocks have 
worked out to be spot on for my mail flow.

-- 
David Jones

Re: Bayes overtraining

Posted by "Anne P. Mitchell, Esq." <am...@isipp.com>.


 
> 
>>> There are spams hitting negative scoring rules e.g.  MAILING_LIST_MULTI,
>>> RCVD_IN_RP_*, RCVD_IN_IADB_* and they are constantly trained as ham.

Just a reminder, if you ever receive spam which is tagged as RCVD_IN_IADB (or *any* flavour of IADB tag) *please* forward it to me personally and I will personally make sure that whoever it is sending it is soundly whacked.

We do *not* have a sense of humour about anyone sending anything that is not 100% true opt-in (if not confirmed opt-in) - and we do *not* certify anyone who is doing anything less - and if we find that someone's practices have slipped and they are being sloppy with permission, we fire them.  Our definition of spam is the definition that Paul (Vixie) and I put forward years ago:

“An electronic message is “spam” IF: (1) the recipient’s personal identity and context are
irrelevant because the message is equally applicable to many other potential recipients;
AND (2) the recipient has not verifiably granted deliberate, explicit, and still-revocable
permission for it to be sent; AND (3) the transmission and reception of the message
appears to the recipient to give a disproportionate benefit to the sender.”

Anything less is grounds for immediate termination.

So, again, if you ever find anything that triggers an IADB rule that is not something for which you/your user affirmatively opted in, we want to know about it.

The buck stops right here:

Anne

Anne P. Mitchell, 
Attorney at Law
CEO/President, 
SuretyMail Email Reputation Certification and Inbox Delivery Assistance
GDPR & CCPA Compliance Consultant
GDPR & CCPA Compliance Certification
http://www.SuretyMail.com/
http://www.SuretyMail.eu/

Attorney at Law / Legislative Consultant
Author: Section 6 of the CAN-SPAM Act of 2003 (the Federal anti-spam law)
Author: The Email Deliverability Handbook
Legal Counsel: The CyberGreen Institute
Legal Counsel: The Earth Law Center
Member, California Bar Cyberspace Law Committee
Member, Colorado Cybersecurity Consortium
Member, Board of Directors, Asilomar Microcomputer Workshop
Former Chair, Asilomar Microcomputer Workshop
Ret. Professor of Law, Lincoln Law School of San Jose

Re: Bayes overtraining

Posted by Matus UHLAR - fantomas <uh...@fantomas.sk>.

>> >On 08/08/2018 15:04, Matus UHLAR - fantomas wrote:
>> >>...of last 40 mail in my spambox, 14 matches MAILING_LIST_MULTI
>> >>...of last 100 mail in spambox, 27 matches MAILING_LIST_MULTI
>>
>> On 09.08.18 08:54, Daniele Duca wrote:
>> >I practically zeroed MAILING_LIST_MULTI the day it came in the
>> >ruleset.

On 09.08.18 23:52, RW wrote:
>MAILING_LIST_MULTI has the default "nice" score of -1.0 rather than an
>explicit score. I'm wondering if this is deliberate.

I would guess so.
... and so I had to enlarge (-1 => -0.1) the score on another host.

seems more and more mailing lists are being abused (or deliberately used) to
spread spam.

>> >>but not possible to put:
>> >>
>> >>tflags BAYES_99 learn dothefuckingautolearn

>> >Personally I'll never trust BAYES_* with autolearn_force. I saw some
>> >FPs sometimes and I fear that autolearning would quickly lead to
>> >poisoning

>I would advise against using auto-training where it's possible to
>train manually. It's not just a matter of mistraining, autolearning may
>also bias the database in favour of types of spam that are easily
>caught, thereby diluting the frequencies of tokens needed to catch the
>difficult spam.

the same applies about ham, however 

>> with autolearn_force yes, it could apparently lead to poisoning.
>>
>> However, if "learn" only did its job (whatever it is) and only
>> "noautolearn" would ignore the score, it would be just enough.
>>
>> Currently, as docs say, "learn" in fact implicates "noautolearn".

>As does userconf.

So, both "learn" and "userconf" explicitly implicate "noautolearn"? 
I wonder why we have them at all. And what is 

>> I just don't understand why. Simply use both flags and that's it.

>If you really must do this just create a new rule without tflags and
>then score it something like this:
>
>    3.0  3.0  0.001 0.001
>
>i.e so it's scored in the non-Bayes  score sets. You can just modify
>the scores and tflags of an original rule, but that's less flexible.

I have just listed all rules with negative scores, and surprise, I haven't
found any realiable rule with negative score.
(MAILING_LIST_MULTI added manually as it doesn't have score set explicitly)

It seems that I will need to whitelist and use the hack you have proposed
above.

- unreliable rules
ALL_TRUSTED -1.000
ENCRYPTED_MESSAGE                     -1.000 -1.000 -1.000 -1.000
ENV_AND_HDR_SPF_MATCH -0.5
DKIM_VALID -0.1
DKIM_VALID_AU -0.1
DKIM_VALID_EF -0.1
HASHCASH_20 -0.5
HASHCASH_21 -0.7
HASHCASH_22 -1.0
HASHCASH_23 -2.0
HASHCASH_24 -3.0
HASHCASH_25 -4.0
HASHCASH_HIGH -5.0
MAILING_LIST_MULTI -1.000

- not used for autolearning
BAYES_00  0  0 -1.5   -1.9
BAYES_05  0  0 -0.3   -0.5

- not available everywhere
DCC_REPUT_00_12  0 -0.8   0 -0.4
DCC_REPUT_13_19  0 -0.1   0 -0.1

- DNS whitelists
RCVD_IN_DNSWL_HI 0 -5 0 -5
RCVD_IN_DNSWL_LOW 0 -0.7 0 -0.7
RCVD_IN_DNSWL_MED 0 -2.3 0 -2.3
RCVD_IN_IADB_DK 0 -0.223 0 -0.095 # n=0 n=1 n=2
RCVD_IN_IADB_DOPTIN 0 -4 0 -4
RCVD_IN_IADB_LISTED 0 -0.380 0 -0.001 # n=0 n=2
RCVD_IN_IADB_MI_CPR_MAT 0 -0.332 0 -0.000 # n=0 n=1 n=2
RCVD_IN_IADB_ML_DOPTIN 0 -6 0 -6
RCVD_IN_IADB_OPTIN 0 -2.057 0 -1.470 # n=0 n=1 n=2
RCVD_IN_IADB_OPTIN_GT50 0 -1.208 0 -0.007 # n=0 n=2
RCVD_IN_IADB_RDNS 0 -0.167 0 -0.235 # n=0 n=1 n=2
RCVD_IN_IADB_VOUCHED 0 -2.2 0 -2.2
RCVD_IN_RP_CERTIFIED 0.0 -3.0 0.0 -3.0
RCVD_IN_RP_SAFE 0.0 -2.0 0.0 -2.0
DKIMDOMAIN_IN_DWL 0 -3.5 0 -3.5

- local whitelists:
HEADER_HOST_IN_WHITELIST -100.0
SUBJECT_IN_WHITELIST -100
URI_HOST_IN_WHITELIST -100.0
USER_IN_ALL_SPAM_TO -100.000
USER_IN_DEF_DKIM_WL -7.500
USER_IN_DEF_SPF_WL -7.500
USER_IN_DEF_WHITELIST -15.000
USER_IN_DKIM_WHITELIST -100.000
USER_IN_MORE_SPAM_TO -20.000
USER_IN_SPF_WHITELIST -100.000
USER_IN_WHITELIST -100.000
USER_IN_WHITELIST_TO -6.000

-- 
Matus UHLAR - fantomas, uhlar@fantomas.sk ; http://www.fantomas.sk/
Warning: I wish NOT to receive e-mail advertising to this address.
Varovanie: na tuto adresu chcem NEDOSTAVAT akukolvek reklamnu postu.
LSD will make your ECS screen display 16.7 million colors

Re: Bayes overtraining

Posted by RW <rw...@googlemail.com>.

On Thu, 9 Aug 2018 13:35:21 +0200
Matus UHLAR - fantomas wrote:

> >On 08/08/2018 15:04, Matus UHLAR - fantomas wrote:  
> >>...of last 40 mail in my spambox, 14 matches MAILING_LIST_MULTI
> >>...of last 100 mail in spambox, 27 matches MAILING_LIST_MULTI  
> 
> On 09.08.18 08:54, Daniele Duca wrote:
> >I practically zeroed MAILING_LIST_MULTI the day it came in the
> >ruleset.  

MAILING_LIST_MULTI has the default "nice" score of -1.0 rather than an
explicit score. I'm wondering if this is deliberate.

> >>but not possible to put:
> >>
> >>tflags BAYES_99 learn dothefuckingautolearn  
> 
> >Wouldn't
> >
> >tflags BAYES_99 autolearn_force
> >
> >do what you want? Or did I misunderstood completely what you meant? 

I think you have probably misunderstood autolearn_force. All it does is
turn-off the check that requires that at least 3 points come from both
body and header rules when autolearning as spam.

> >Personally I'll never trust BAYES_* with autolearn_force. I saw some 
> >FPs sometimes and I fear that autolearning would quickly lead to 
> >poisoning  

I would advise against using auto-training where it's possible to
train manually. It's not just a matter of mistraining, autolearning may
also bias the database in favour of types of spam that are easily
caught, thereby diluting the frequencies of tokens needed to catch the
difficult spam. 

> with autolearn_force yes, it could apparently lead to poisoning.
>
> However, if "learn" only did its job (whatever it is) and only
> "noautolearn" would ignore the score, it would be just enough.
> 
> Currently, as docs say, "learn" in fact implicates "noautolearn". 

As does userconf.

> I just don't understand why. Simply use both flags and that's it.

If you really must do this just create a new rule without tflags and
then score it something like this:

    3.0  3.0  0.001 0.001 

i.e so it's scored in the non-Bayes  score sets. You can just modify
the scores and tflags of an original rule, but that's less flexible.

Re: Bayes overtraining

Posted by Matus UHLAR - fantomas <uh...@fantomas.sk>.

>On 08/08/2018 15:04, Matus UHLAR - fantomas wrote:
>>...of last 40 mail in my spambox, 14 matches MAILING_LIST_MULTI
>>...of last 100 mail in spambox, 27 matches MAILING_LIST_MULTI

On 09.08.18 08:54, Daniele Duca wrote:
>I practically zeroed MAILING_LIST_MULTI the day it came in the ruleset.


>>I mean, since there's tflag "noautolearn" designed for this, the flag
>>"learn" should not be ignored.
>>
>>It's easy to put:
>>
>>tflags BAYES_99 learn noautolearn
>>
>>but not possible to put:
>>
>>tflags BAYES_99 learn dothefuckingautolearn

>Wouldn't
>
>tflags BAYES_99 autolearn_force
>
>do what you want? Or did I misunderstood completely what you meant? 
>Personally I'll never trust BAYES_* with autolearn_force. I saw some 
>FPs sometimes and I fear that autolearning would quickly lead to 
>poisoning

with autolearn_force yes, it could apparently lead to poisoning.

However, if "learn" only did its job (whatever it is) and only "noautolearn"
would ignore the score, it would be just enough.

Currently, as docs say, "learn" in fact implicates "noautolearn". 
I just don't understand why. Simply use both flags and that's it.

-- 
Matus UHLAR - fantomas, uhlar@fantomas.sk ; http://www.fantomas.sk/
Warning: I wish NOT to receive e-mail advertising to this address.
Varovanie: na tuto adresu chcem NEDOSTAVAT akukolvek reklamnu postu.
Spam is for losers who can't get business any other way.

Re: Bayes overtraining

Posted by Daniele Duca <du...@staff.spin.it>.

On 08/08/2018 15:04, Matus UHLAR - fantomas wrote:

>
>
> ...of last 40 mail in my spambox, 14 matches MAILING_LIST_MULTI
> ...of last 100 mail in spambox, 27 matches MAILING_LIST_MULTI
>
I practically zeroed MAILING_LIST_MULTI the day it came in the ruleset.
>
>
> I mean, since there's tflag "noautolearn" designed for this, the flag
> "learn" should not be ignored.
>
> It's easy to put:
>
> tflags BAYES_99 learn noautolearn
>
> but not possible to put:
>
> tflags BAYES_99 learn dothefuckingautolearn
>
>
>

Wouldn't

tflags BAYES_99 autolearn_force

do what you want? Or did I misunderstood completely what you meant? 
Personally I'll never trust BAYES_* with autolearn_force. I saw some FPs 
sometimes and I fear that autolearning would quickly lead to poisoning

Daniele

Re: Bayes overtraining

Posted by Matus UHLAR - fantomas <uh...@fantomas.sk>.

>> >On Wed, 25 Jul 2018 19:49:04 +0200
>> >Daniele Duca wrote:
>> >> In my current SA setup I use bayes_auto_learn along with some
>> >> custom poison pills (autolearn_force on some rules) , and I'm
>> >> currently wondering if over training SA's bayes could lead to the
>> >> same "prejudice" problem as CRM114.
>> >>
>> >> I'm thinking that maybe it would be better to use
>> >> "bayes_auto_learn_on_error 1"
>>
>> On 26.07.18 15:48, RW wrote:
>> >On a busy server using auto-learning it's probably a good idea to set
>> >this just to increase the token retention, and reduce writes into the
>> >database.

>On Thu, 26 Jul 2018 17:36:19 +0200 Matus UHLAR - fantomas wrote:
>> well, I have a bit different experience.

On 26.07.18 21:25, RW wrote:
>I didn't say auto-training itself, is a good idea.

I mean, if I set bayes_auto_learn_on_error 1, the scores that confirm BAYES
decision would never be trained, even if the decision was correct.

That could result in BAYES scores geting to the wrong direction.

I believe, that after I train BAYES enough, autolearn should be able to do
the rest of work and collect further tokens especially when BAYES_00 or
BAYES_99 is in effect.

re-training a few mismatched mails once a time should be better than pushing
back to the _00 and _99 because only mails pointing to opposite direction
are trained.

>> There are spams hitting negative scoring rules e.g.  MAILING_LIST_MULTI,
>> RCVD_IN_RP_*, RCVD_IN_IADB_* and they are constantly trained as ham.

>You should be able to work around that by adding noautolearn to the
>tflags.

Well, since I tend to trust those rules less and less....

Especially because in the meantime I personally get many spams via mailing
lists I have never subscribed and never seen subscription confirmation.

...of last 40 mail in my spambox, 14 matches MAILING_LIST_MULTI
...of last 100 mail in spambox, 27 matches MAILING_LIST_MULTI

>> I would like to prevent re-training when bayes disagrees with score
>> soming from other rules.

>I don't know what you mean by 'prevent re-training', but auto-learning
>is not supposed to happen if Bayes generates  1 point or more  in the
>opposite direction.

either this is new to me, or I have already forgot, but I have different
feeling about this. Will try to remember and watch.

(I often watch what kind of mail was tagged autolearn=ham)

>> I quite wonder why "learn" tflag causes score being ignored.
>> Only the "noautolearn" flag should be used for this so at least
>> BAYES_99 and BAYES_00 could be takein into account when learning.

>It's to prevent  mistraining from running away in a vicious circle.

I mean, since there's tflag "noautolearn" designed for this, the flag
"learn" should not be ignored.

It's easy to put:

tflags BAYES_99 learn noautolearn

but not possible to put:

tflags BAYES_99 learn dothefuckingautolearn

-- 
Matus UHLAR - fantomas, uhlar@fantomas.sk ; http://www.fantomas.sk/
Warning: I wish NOT to receive e-mail advertising to this address.
Varovanie: na tuto adresu chcem NEDOSTAVAT akukolvek reklamnu postu.
The early bird may get the worm, but the second mouse gets the cheese.

Re: Bayes overtraining

Posted by RW <rw...@googlemail.com>.

On Thu, 26 Jul 2018 17:36:19 +0200
Matus UHLAR - fantomas wrote:

> >On Wed, 25 Jul 2018 19:49:04 +0200
> >Daniele Duca wrote:  
> >> In my current SA setup I use bayes_auto_learn along with some
> >> custom poison pills (autolearn_force on some rules) , and I'm
> >> currently wondering if over training SA's bayes could lead to the
> >> same "prejudice" problem as CRM114.
> >>
> >> I'm thinking that maybe it would be better to use
> >> "bayes_auto_learn_on_error 1"  
> 
> On 26.07.18 15:48, RW wrote:
> >On a busy server using auto-learning it's probably a good idea to set
> >this just to increase the token retention, and reduce writes into the
> >database.  
> 
> well, I have a bit different experience. 


I didn't say auto-training itself, is a good idea.


> There are spams hitting
> negative scoring rules e.g.  MAILING_LIST_MULTI, RCVD_IN_RP_*,
> RCVD_IN_IADB_* and they are constantly trained as ham.


You should be able to work around that by adding noautolearn to the
tflags.


> I would like to prevent re-training when bayes disagrees with score
> soming from other rules.


I don't know what you mean by 'prevent re-training', but auto-learning
is not supposed to happen if Bayes generates  1 point or more  in the
opposite direction.

 
> I quite wonder why "learn" tflag causes score being ignored.
> Only the "noautolearn" flag should be used for this so at least
> BAYES_99 and BAYES_00 could be takein into account when learning.


It's to prevent  mistraining from running away in a vicious
circle.

Re: Bayes overtraining

Posted by Matus UHLAR - fantomas <uh...@fantomas.sk>.

>On Wed, 25 Jul 2018 19:49:04 +0200
>Daniele Duca wrote:
>> In my current SA setup I use bayes_auto_learn along with some custom
>> poison pills (autolearn_force on some rules) , and I'm currently
>> wondering if over training SA's bayes could lead to the same
>> "prejudice" problem as CRM114.
>>
>> I'm thinking that maybe it would be better to use
>> "bayes_auto_learn_on_error 1"

On 26.07.18 15:48, RW wrote:
>On a busy server using auto-learning it's probably a good idea to set
>this just to increase the token retention, and reduce writes into the
>database.

well, I have a bit different experience. There are spams hitting negative
scoring rules e.g.  MAILING_LIST_MULTI, RCVD_IN_RP_*, RCVD_IN_IADB_* and
they are constantly trained as ham.

I would like to prevent re-training when bayes disagrees with score soming
from other rules.

I quite wonder why "learn" tflag causes score being ignored.
Only the "noautolearn" flag should be used for this so at least BAYES_99 and
BAYES_00 could be takein into account when learning.

-- 
Matus UHLAR - fantomas, uhlar@fantomas.sk ; http://www.fantomas.sk/
Warning: I wish NOT to receive e-mail advertising to this address.
Varovanie: na tuto adresu chcem NEDOSTAVAT akukolvek reklamnu postu.
- Have you got anything without Spam in it?
- Well, there's Spam egg sausage and Spam, that's not got much Spam in it.

Re: Bayes overtraining

Posted by RW <rw...@googlemail.com>.

On Wed, 25 Jul 2018 19:49:04 +0200
Daniele Duca wrote:

> In my current SA setup I use bayes_auto_learn along with some custom 
> poison pills (autolearn_force on some rules) , and I'm currently 
> wondering if over training SA's bayes could lead to the same
> "prejudice" problem as CRM114.
> 
> I'm thinking that maybe it would be better to use 
> "bayes_auto_learn_on_error 1"

On a busy server using auto-learning it's probably a good idea to set
this just to increase the token retention, and reduce writes into the
database.