You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spamassassin.apache.org by bu...@bugzilla.spamassassin.org on 2004/12/14 12:45:05 UTC

[Bug 4031] New: bayesian scores lower for higher probability

http://bugzilla.spamassassin.org/show_bug.cgi?id=4031

           Summary: bayesian scores lower for higher probability
           Product: Spamassassin
           Version: 3.0.1
          Platform: Other
        OS/Version: Linux
            Status: NEW
          Severity: major
          Priority: P3
         Component: Rules
        AssignedTo: dev@spamassassin.apache.org
        ReportedBy: mail@webthatworks.it


2.1 BAYES_95               BODY: Bayesian spam probability is 95 to 
99%                            [score: 0.9673] 
 
 1.9 BAYES_99               BODY: Bayesian spam probability is 99 to 
100%                            [score: 1.0000] 
 
23_bayes.cf was not modified (same as in 3.0.1 and cvs snapshot) 
user and global cfgs unmodified with the exception of some cosmetic change in 
reporting



------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 4031] bayesian scores lower for higher probability

Posted by bu...@bugzilla.spamassassin.org.
http://bugzilla.spamassassin.org/show_bug.cgi?id=4031





------- Additional Comments From felicity@kluge.net  2004-12-15 08:58 -------
Subject: Re:  bayesian scores lower for higher probability

On Wed, Dec 15, 2004 at 08:21:52AM -0800, bugzilla-daemon@bugzilla.spamassassin.org wrote:
> So if I understand it right, the perceptron's scores are optimised for
> a rather inaccurate Bayes database ?  Or is it that there was no spam
> that only hit Bayes, so the perceptron thought it could safely reduce
> the Bayes scores ?

You're thinking about this too much in terms of black and white.
Even a 100% "correctly" trained Bayes system can (and will, likely,)
produce incorrect results.  Current score generation uses autolearning
only, which generally has errors.  The new version for 3.1, not based on
autolearn, is currently planned to introduce a small amount of learning
error to simulate real world conditions.

Anyway, I'm going to reclose this ticket again, because there is no
bug here.  If you want to discuss score generation/etc further, please feel
free to start a conversation on the dev mailing list.





------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 4031] bayesian scores lower for higher probability

Posted by bu...@bugzilla.spamassassin.org.
http://bugzilla.spamassassin.org/show_bug.cgi?id=4031

mail@webthatworks.it changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|RESOLVED                    |REOPENED
         Resolution|INVALID                     |



------- Additional Comments From mail@webthatworks.it  2004-12-15 03:22 -------
It is not a problem of "perceived" inconsistency. If the system was working I     
would't care to check it. Catch rates are lower compared to 2.6 and there was     
any change in false positives.     
     
It could be a setup/install problem but... I wiped everything and reinstalled.     
Then I fed the learning engine with fresh spam, ham, false     
positives/negatives. Results are the same. 1.9 BAYES_99,  2.1 BAYES_95. 
     
90% of false negative I had switching to 3.X could have been catch if score     
was linear.    
All false positives I've had in the past (last false positive is dated Sept.     
2004) was marked spam cos the sender was in many SBL. Just 5% of these false    
positives had a bayesian spam probability higher than 40%. 90% of this 5% were    
alert messages sent in the same day by a server of mine for error.    
These are rough estimates, I could do more precise analysis but I think    
conclusions won't be qualitatively too much different.    
    
I'm wondering if it is anything related to the environment (libraries, perl    
version 5.8.0, compilation... whatever). 
 
 



------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 4031] bayesian scores lower for higher probability

Posted by bu...@bugzilla.spamassassin.org.
http://bugzilla.spamassassin.org/show_bug.cgi?id=4031





------- Additional Comments From felicity@kluge.net  2004-12-15 07:02 -------
Subject: Re:  bayesian scores lower for higher probability

On Wed, Dec 15, 2004 at 03:22:17AM -0800, bugzilla-daemon@bugzilla.spamassassin.org wrote:
> It could be a setup/install problem but... I wiped everything and reinstalled.     
> Then I fed the learning engine with fresh spam, ham, false     
> positives/negatives. Results are the same. 1.9 BAYES_99,  2.1 BAYES_95. 

Of course they didn't change.  The scores aren't dynamic, they're generated
once before a release.

> 90% of false negative I had switching to 3.X could have been catch if score     
> was linear.    

The "problem" here is that BAYES_99 causes FPs, which the perceptron handles by
lowering the score since a billion other rules usually hit on the spam.

This isn't a scoring issue, it's a learning/bayes issue, if anything.





------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 4031] bayesian scores lower for higher probability

Posted by bu...@bugzilla.spamassassin.org.
http://bugzilla.spamassassin.org/show_bug.cgi?id=4031

henry@stern.ca changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
         Resolution|INVALID                     |DUPLICATE



------- Additional Comments From henry@stern.ca  2004-12-15 09:07 -------
Nick,

I did the score optimization for SpamAssassin 3.0.

> So if I understand it right, the perceptron's scores are optimised for
> a rather inaccurate Bayes database ?  Or is it that there was no spam
> that only hit Bayes, so the perceptron thought it could safely reduce
> the Bayes scores ?

The latter is more the case than the former.  We're learning scores by
minimizing an error function that is sort of like:

Err(Msg,Score,Threshold,Class) = { 0 if Class=Spam and Score>=Threshold
                                 { 0 if Class=Ham and Score<Threshold
                                 { abs(Score-Threshold) otherwise

I don't have time to give you any hard numbers right now, but one can safely
assume that messages with BAYES_99 usually have a lot of other rule hits as
well.  Because of this, most messages with BAYES_99 already have high scores and
the value of the error function will almost always be 0.

Another reason why scores for rules like BAYES_99 are smaller than they were
before is that the URIBL rules are too "loud" in the dataset:  Because there are
so many of these rules with very high hit rates, they tend to occupy a
proportionate (or disproportionate, depending on how you look at it) amount of
"mass" of the scores.  

It is a lot easier to make an accurate perceptron with lower scores.  As the
scores are forced higher, the false positive rate goes up.  However, this method
is also vulnerable to attacks from an adversary because the scores for the most
accurate rules tend to be much larger than those of the median rules.  This is
the subject of my masters thesis (which is almost done).  If you want a more in
depth answer, ping me for a copy of my thesis in January.

The short answer is:  For now, the scores are as high as I can make them without
SpamAssassin making false positives all the time.

This bug is a duplicate of Bug 3821.

Henry

*** This bug has been marked as a duplicate of 3821 ***

*** This bug has been marked as a duplicate of 3821 ***



------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 4031] bayesian scores lower for higher probability

Posted by bu...@bugzilla.spamassassin.org.
http://bugzilla.spamassassin.org/show_bug.cgi?id=4031

felicity@kluge.net changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |RESOLVED
         Resolution|                            |INVALID



------- Additional Comments From felicity@kluge.net  2004-12-14 09:01 -------
I'm not sure what the issue is here, but I'm guessing that what you're complaining about is already 
covered in the wiki:

http://wiki.apache.org/spamassassin/HowScoresAreAssigned



------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 4031] bayesian scores lower for higher probability

Posted by bu...@bugzilla.spamassassin.org.
http://bugzilla.spamassassin.org/show_bug.cgi?id=4031

felicity@kluge.net changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|REOPENED                    |RESOLVED
         Resolution|                            |INVALID



------- Additional Comments From felicity@kluge.net  2004-12-15 09:00 -------
closing again.



------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 4031] bayesian scores lower for higher probability

Posted by bu...@bugzilla.spamassassin.org.
http://bugzilla.spamassassin.org/show_bug.cgi?id=4031





------- Additional Comments From nj@leverton.org  2004-12-15 08:21 -------
Subject: Re:  bayesian scores lower for higher probability

So if I understand it right, the perceptron's scores are optimised for
a rather inaccurate Bayes database ?  Or is it that there was no spam
that only hit Bayes, so the perceptron thought it could safely reduce
the Bayes scores ?

In use against fresh spam, we hope the former won't be true, and we
know the latter won't be - the whole point of Bayes is to catch spam
that the rules don't !

Nick





------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.