You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@spamassassin.apache.org by bu...@bugzilla.spamassassin.org on 2005/07/07 19:56:56 UTC

[Bug 4467] New: investigate setting BAYES_ scores manually instead of via perceptron

http://bugzilla.spamassassin.org/show_bug.cgi?id=4467

           Summary: investigate setting BAYES_ scores manually instead of
                    via perceptron
           Product: Spamassassin
           Version: SVN Trunk (Latest Devel Version)
          Platform: Other
        OS/Version: other
            Status: NEW
          Severity: normal
          Priority: P5
         Component: Score Generation
        AssignedTo: dev@spamassassin.apache.org
        ReportedBy: jm@jmason.org


It's been a pretty solid FAQ during SpamAssassin 3.0.0's release timeframe, that
BAYES_99 was scored too low. e.g.:

  http://permalink.gmane.org/gmane.mail.spam.spamassassin.general/60217
  http://readlist.com/lists/incubator.apache.org/spamassassin-users/0/1500.html

On top of that, the scores for the BAYES_* rules are wholly dependent on
external factors that cannot be measured effectively through mass-checks to
match all environments.  For example, these setups have radically different
amounts of accurate training:

  - a site-wide autolearning system
  - a personalised, extensively hand-trained system with over 10000 mails of
each type
  - a system that has received the bare minimum "200 of each" training, with a
little autolearning on top
  - mass-check, with the new sampling method

As a result, I suspect that the Perceptron is going to generate scores that are
over-optimized for mass-check only, and under-optimized for the other end-user
setups.  To avoid this, I suggest that we set the BAYES_* scores manually, by
setting them as "userconf" rules.

comments/votes please.



------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 4467] investigate setting BAYES_ scores manually instead of via perceptron

Posted by bu...@bugzilla.spamassassin.org.

http://issues.apache.org/SpamAssassin/show_bug.cgi?id=4467


jm@jmason.org changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |RESOLVED
         Resolution|                            |FIXED




------- Additional Comments From jm@jmason.org  2006-12-04 10:36 -------
'I was thinking the other day, what if we used reuse for BAYES_ rules?  This
assumes that mass-checkers are running bayes of course.'

I'm not keen on that -- each mass-checker would have differing levels of
reliability for their training data.  for example, I haven't trained bayes
(apart from via autolearning) in 2 years... I wouldn't really want the accuracy
of my neglected db to dictate scores for someone who's put in the work to train
theirs.

I'm just going to mark this as FIXED for 3.2.0, since the bayes scores in
50_scores.cf *are* marked as unmutable anyway since bug 4505.



------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 4467] investigate setting BAYES_ scores manually instead of via perceptron

Posted by bu...@bugzilla.spamassassin.org.

http://issues.apache.org/SpamAssassin/show_bug.cgi?id=4467





------- Additional Comments From jm@jmason.org  2006-11-03 10:39 -------
'Is the Perceptron smart enough to take fixed scores into account and
redistribute the score amongst the other non-fixed rules that hit, or does
it just ignore fixed scores?'

yes, it redistributes.

'If it takes fixed scores into account, it might be interesting to do several
scoring runs with different Bayes scores and see what effect this has on a
few of the other more interesting rules, unless this would be a huge pain to
attempt.'

btw, I reread the bug for perceptron runs in 3.1.x; we actually did this on the
last perceptron run for 3.1.x, since we fixed up some extreme
perceptron-generated BAYES scores for more sane ones.  It made little difference
(which was good).



------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 4467] investigate setting BAYES_ scores manually instead of via perceptron

Posted by bu...@bugzilla.spamassassin.org.

http://issues.apache.org/SpamAssassin/show_bug.cgi?id=4467


jm@jmason.org changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
   Target Milestone|Undefined                   |3.2.0






------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 4467] investigate setting BAYES_ scores manually instead of via perceptron

Posted by bu...@bugzilla.spamassassin.org.

http://bugzilla.spamassassin.org/show_bug.cgi?id=4467





------- Additional Comments From Bob@Menschel.net  2005-07-07 18:47 -------
I'm in full agreement with this idea.  And following Loren's comment, it might
be worth while to 1) Let the perceptron suggest initial BAYES values. 2) Adjust
BAYES_9* rules up towards 5.0 by 25%, 50%, and 75%, and rescore at those levels.
I'd be interested in seeing not only how other rules' scores change, but the
overall FN and FP rates. 

Given Justin's four categories of Bayes systems, perhaps it might be worth
having two or three Bayes score sets, for Bayes with low confidence (new
systems, still feeling their way), Bayes with good confidence, and Bayes with
high confidence. 

I'd have no problem with the "low confidence" scores file being the default,
with instructions on how to apply the higher confidence scores files being
included in the INSTALL file. 



------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 4467] investigate setting BAYES_ scores manually instead of via perceptron

Posted by bu...@bugzilla.spamassassin.org.

http://issues.apache.org/SpamAssassin/show_bug.cgi?id=4467





------- Additional Comments From parkerm@pobox.com  2006-11-03 11:33 -------
I was thinking the other day, what if we used reuse for BAYES_ rules?  This
assumes that mass-checkers are running bayes of course.



------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 4467] investigate setting BAYES_ scores manually instead of via perceptron

Posted by bu...@bugzilla.spamassassin.org.

http://bugzilla.spamassassin.org/show_bug.cgi?id=4467





------- Additional Comments From lwilton@earthlink.net  2005-07-07 17:59 -------
Subject: Re:   New: investigate setting BAYES_ scores manually instead of via perceptron

> comments/votes please.

I don't get a vote, but if I did I'd sure be in favor of this!

I seem to recall quite a good deal of discussion back in the 3.0 timeframe
on ranges and possible score assignments for the Bayes tests; or at least I
think I do.  Perhaps there were some useful potential scores in there.

Is the Perceptron smart enough to take fixed scores into account and
redistribute the score amongst the other non-fixed rules that hit, or does
it just ignore fixed scores?

If it takes fixed scores into account, it might be interesting to do several
scoring runs with different Bayes scores and see what effect this has on a
few of the other more interesting rules, unless this would be a huge pain to
attempt.





------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.