You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spamassassin.apache.org by bu...@bugzilla.spamassassin.org on 2012/08/16 00:18:40 UTC

[Bug 6828] New: Adjust default autolearn ham threshold to reduce mistraining under default configuration

https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6828

          Priority: P2
            Bug ID: 6828
          Assignee: dev@spamassassin.apache.org
           Summary: Adjust default autolearn ham threshold to reduce
                    mistraining under default configuration
          Severity: normal
    Classification: Unclassified
                OS: All
          Reporter: jhardin@impsec.org
          Hardware: PC
            Status: NEW
           Version: SVN Trunk (Latest Devel Version)
         Component: Learner
           Product: Spamassassin

Reduce the default Bayes autolearning score threshold for ham from 0.1 to -3

If autolearning is enabled by default (which is a good idea) then the system
should have very conservative defaults to reduce the possibility that spams
will be learned as hams. It's better to take longer to get a corpus sufficient
to enable Bayes analysis than it is to autolearn messages improperly.

See users list 2012-08-15 "Very spammy messages yield BAYES_00"

-- 
You are receiving this mail because:
You are the assignee for the bug.

[Bug 6828] Adjust default autolearn ham threshold to reduce mistraining under default configuration

Posted by bu...@bugzilla.spamassassin.org.
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6828

--- Comment #8 from RW <rw...@googlemail.com> ---
(In reply to comment #7)

> FWIW: On my systems all DNS whitelist/certifiers/SPF/DKIM are disabled.
> (I don't trust third parties/keys for WLing)
> Autolearning ham has never been an issue on a mixed language system.
> (in the last 8 years, I have never fed Bayes manually )

Then how do you get to -4?

-- 
You are receiving this mail because:
You are the assignee for the bug.

[Bug 6828] Adjust default autolearn ham threshold to reduce mistraining under default configuration

Posted by bu...@bugzilla.spamassassin.org.
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6828

Darxus <Da...@ChaosReigns.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |Darxus@ChaosReigns.com

--- Comment #1 from Darxus <Da...@ChaosReigns.com> ---
Has anyone ever actually done any testing on autolearning to verify it helps or
determine optimal thresholds?

-- 
You are receiving this mail because:
You are the assignee for the bug.

[Bug 6828] Adjust default autolearn ham threshold to reduce mistraining under default configuration

Posted by bu...@bugzilla.spamassassin.org.
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6828

--- Comment #5 from AXB <ax...@gmail.com> ---
(In reply to comment #1)
> Has anyone ever actually done any testing on autolearning to verify it helps
> or determine optimal thresholds?

tested and using -4 and autolearn only (no manual trainig) on a very mixed user
base and site wide Bayes has been very reliable..

-- 
You are receiving this mail because:
You are the assignee for the bug.

[Bug 6828] Adjust default autolearn ham threshold to reduce mistraining under default configuration

Posted by bu...@bugzilla.spamassassin.org.
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6828

--- Comment #7 from AXB <ax...@gmail.com> ---
(In reply to comment #6)
> (In reply to comment #5)
> 
> > tested and using -4 and autolearn only (no manual trainig) on a very mixed
> > user base and site wide Bayes has been very reliable..
> 
> The trouble with making ham autolearning dependent on DNS whitelists is that
> the training can change dramatically with the scores of those rules. If you
> started training a while ago when  RCVD_IN_DNSWL_MED scored -4, then you
> will have trained on a much wider selection that you if start over now.
> Currently you'll be reliant on RCVD_IN_DNSWL_HI and combinations like
> RCVD_IN_DNSWL_MED+RCVD_IN_RP_CERTIFIED, which will mean mostly autogenerated
> mail from companies like Amazon, and direct marketing mail, but probably
> almost no person to person mail. 
> 
> Also if someone turns-off DNS whitelists they wont learn any ham at all.

FWIW: On my systems all DNS whitelist/certifiers/SPF/DKIM are disabled.
(I don't trust third parties/keys for WLing)
Autolearning ham has never been an issue on a mixed language system.
(in the last 8 years, I have never fed Bayes manually )

-- 
You are receiving this mail because:
You are the assignee for the bug.

[Bug 6828] Adjust default autolearn ham threshold to reduce mistraining under default configuration

Posted by bu...@bugzilla.spamassassin.org.
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6828

--- Comment #4 from AXB <ax...@gmail.com> ---
(In reply to comment #0)
> Reduce the default Bayes autolearning score threshold for ham from 0.1 to -3
> 
> If autolearning is enabled by default (which is a good idea) then the system
> should have very conservative defaults to reduce the possibility that spams
> will be learned as hams. It's better to take longer to get a corpus
> sufficient to enable Bayes analysis than it is to autolearn messages
> improperly.
> 
> See users list 2012-08-15 "Very spammy messages yield BAYES_00"

+1 on this.

-- 
You are receiving this mail because:
You are the assignee for the bug.

[Bug 6828] Adjust default autolearn settings to reduce Bayesian mistraining under default configuration

Posted by bu...@bugzilla.spamassassin.org.
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6828

--- Comment #13 from Kevin A. McGrail <km...@pccc.com> ---
(In reply to comment #12)
> > Reduce the default Bayes autolearning score threshold for ham from 0.1 to -3
> 
> -1, I do not agree.
> 
> In 2007 we had to bump the ham threshold from -1 to 0.1
> to widen a too narrow view on ham.
> 
> See Bug 5497 (and its predecessor Bug 5257).

Agreed. As mentioned above, "none of our tweaked system data and configuration
are relevant to this discussion."

I think note 5497 remains open and this should be marked as a duplicate really.

But we perhaps could use some additional information in the wiki to help
admins, perhaps?  John, what do you think of that?

-- 
You are receiving this mail because:
You are the assignee for the bug.

[Bug 6828] Adjust default autolearn ham threshold to reduce mistraining under default configuration

Posted by bu...@bugzilla.spamassassin.org.
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6828

RW <rw...@googlemail.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |rwmaillists@googlemail.com

--- Comment #3 from RW <rw...@googlemail.com> ---
Bear in mind that hardly any default nice rules contribute to autolearning, all
the contributing rules with non-neglible scores are DNS whitelists, the very
thing that created the problem in user list thread in the first place.

See also bug 6344

-- 
You are receiving this mail because:
You are the assignee for the bug.

[Bug 6828] Adjust default autolearn settings to reduce Bayesian mistraining under default configuration

Posted by bu...@bugzilla.spamassassin.org.
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6828

Kevin A. McGrail <km...@pccc.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |kmcgrail@pccc.com
   Target Milestone|Undefined                   |3.4.0
            Summary|Adjust default autolearn    |Adjust default autolearn
                   |ham threshold to reduce     |settings to reduce Bayesian
                   |mistraining under default   |mistraining under default
                   |configuration               |configuration

--- Comment #10 from Kevin A. McGrail <km...@pccc.com> ---
It does seems that lowering the threshold for learning as ham makes sense to
try and avoid any FNs slipping through based on anecdotal complaints.  I think
this is also being extrapolated to a spam threshold change as well.

Anyone have suggestions on a testing protocol that might help decide the
defaults?  If I am thinking correctly, if we used masscheck data, the scoring
is designed not to mark spam as ham and ham as spam.  So the minimum threshold
should be the spam threshold.  This means that 12.0 is chosen at random as an
experienced guess for a number inflated for real-world safety.

Going further, my system is configured for 6.0 instead of 5.0 with a lot of
single-fire rules and things that focus on scoring ham.  So it doesn't make it
a good source of project-wide data concerning auto-learning thresholds.

In fact, I'm wondering a bit if a default setup can score below a zero very
often and if we are now going to skew bayes towards only certain
classifications of ham.

And in the end, none of our tweaked system data and configuration are relevant
to this discussion.


Looking at the thresholds, we really need a scientific approach based on the
DEFAULT configurations to continue this discussion.

bayes_auto_learn_threshold_nonspam n.nn   (default: 0.1)
bayes_auto_learn_threshold_spam n.nn      (default: 12.0)

And, in the end, I wonder also if we are missing turning on
bayes_auto_learn_on_error as a default.  I think for 3.4.0 turning this setting
on and losing the backwards compatibility makes sense.

Regards,
KAM

-- 
You are receiving this mail because:
You are the assignee for the bug.

[Bug 6828] Adjust default autolearn ham threshold to reduce mistraining under default configuration

Posted by bu...@bugzilla.spamassassin.org.
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6828

--- Comment #9 from AXB <ax...@gmail.com> ---
(In reply to comment #8)
> (In reply to comment #7)
> 
> > FWIW: On my systems all DNS whitelist/certifiers/SPF/DKIM are disabled.
> > (I don't trust third parties/keys for WLing)
> > Autolearning ham has never been an issue on a mixed language system.
> > (in the last 8 years, I have never fed Bayes manually )
> 
> Then how do you get to -4?

from production settings:

use_bayes 1
bayes_auto_learn  1
bayes_auto_expire  0


bayes_min_ham_num  200
bayes_min_spam_num 200

bayes_auto_learn_threshold_nonspam -3.0
bayes_auto_learn_threshold_spam 20.0

"it just works"

-- 
You are receiving this mail because:
You are the assignee for the bug.

[Bug 6828] Adjust default autolearn settings to reduce Bayesian mistraining under default configuration

Posted by bu...@bugzilla.spamassassin.org.
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6828

--- Comment #11 from John Hardin <jh...@impsec.org> ---
(In reply to comment #6)
> 
> Also if someone turns-off DNS whitelists they wont learn any ham at all.

I'd point out the object of this exercise is to keep an unconfigured or
minimally-configured SA install from going off the rails. If the admin is
involved enough to be disabling DNSWL lookups, they are likely involved enough
to look at and tune the autolearn settings, especially if given guidance in the
wiki.

-- 
You are receiving this mail because:
You are the assignee for the bug.

[Bug 6828] Adjust default autolearn ham threshold to reduce mistraining under default configuration

Posted by bu...@bugzilla.spamassassin.org.
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6828

--- Comment #6 from RW <rw...@googlemail.com> ---
(In reply to comment #5)

> tested and using -4 and autolearn only (no manual trainig) on a very mixed
> user base and site wide Bayes has been very reliable..

The trouble with making ham autolearning dependent on DNS whitelists is that
the training can change dramatically with the scores of those rules. If you
started training a while ago when  RCVD_IN_DNSWL_MED scored -4, then you will
have trained on a much wider selection that you if start over now. Currently
you'll be reliant on RCVD_IN_DNSWL_HI and combinations like
RCVD_IN_DNSWL_MED+RCVD_IN_RP_CERTIFIED, which will mean mostly autogenerated
mail from companies like Amazon, and direct marketing mail, but probably almost
no person to person mail. 

Also if someone turns-off DNS whitelists they wont learn any ham at all.

-- 
You are receiving this mail because:
You are the assignee for the bug.

[Bug 6828] Adjust default autolearn ham threshold to reduce mistraining under default configuration

Posted by bu...@bugzilla.spamassassin.org.
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6828

John Hardin <jh...@impsec.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |jhardin@impsec.org

--- Comment #2 from John Hardin <jh...@impsec.org> ---
(In reply to comment #1)
> Has anyone ever actually done any testing on autolearning to verify it helps
> or determine optimal thresholds?

No idea. -3 was a WAG.

-- 
You are receiving this mail because:
You are the assignee for the bug.

[Bug 6828] Adjust default autolearn settings to reduce Bayesian mistraining under default configuration

Posted by bu...@bugzilla.spamassassin.org.
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6828

Kevin A. McGrail <km...@pccc.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
   Target Milestone|3.4.0                       |3.4.1

--- Comment #14 from Kevin A. McGrail <km...@pccc.com> ---
Moving all open bugs where target is defined and 3.4.0 or lower to 3.4.1 target

-- 
You are receiving this mail because:
You are the assignee for the bug.

[Bug 6828] Adjust default autolearn settings to reduce Bayesian mistraining under default configuration

Posted by bu...@bugzilla.spamassassin.org.
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6828

--- Comment #12 from Mark Martinec <Ma...@ijs.si> ---
> Reduce the default Bayes autolearning score threshold for ham from 0.1 to -3

-1, I do not agree.

In 2007 we had to bump the ham threshold from -1 to 0.1
to widen a too narrow view on ham.

See Bug 5497 (and its predecessor Bug 5257).

-- 
You are receiving this mail because:
You are the assignee for the bug.