You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@spamassassin.apache.org by Justin Mason <jm...@jmason.org> on 2005/07/11 19:02:23 UTC

Re: update on floating dividing score between spam and ham messages

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1


There's another thing worth noting -- the SpamAssassin score distribution
for hams and spams isn't even.

If you draw a graph of hams and spams, plotting the number of mails in
each category as the vertical axis and the score they get as teh
horizontal axis, you don't get a simple pair of intersecting straight
lines.

Instead, since we have many more mark-as-spam rules than mark-as-ham,
and due to how the perceptron attempts to optimise for the 5.0
threshold, what happens is that you have two different lines.

The ham line is a sigmoid curve, that starts high in the negative area,
and curves down to almost 0 at the 5.0 threshold mark.  The spam line, by
contrast, is a straight line.
http://taint.org/xfer/2005/score-dist-doodle.gif is a doodle to illustrate
this, or take a look at
http://spamassassin.apache.org/presentations/HEANet_2002/img12.html
for real-world graphs of this data from 2002 -- although graphing
the inverse.

Very interesting approach though!

- --j.

Joe Flowers writes:
> Matt Kettler wrote:
> 
> >The only problem I see with this approach is that it treats false positives and
> >false negatives as being equally bad.
> >  
> >
> 
> We do get many more false negatives than false positives, even though we 
> don't get false positives very often - they are rare.
> We certainly don't get 1 fp for every fn.
> 
> >In general, you're adjusting the score bias so the number of FP's and FNs are
> >approximately equal. 
> >
> 
> This is not what we are seeing in practice. It's not even close to 50-50.
> 
> >Although STATISTICS*.txt would suggest that this boundary
> >occurs somewhere near 2.0, your own local biases could change this considerably.
> >
> >
> >SA's normal scoreset is evolved with the concept that it's better to have 99
> >false negatives than 1 false positive. 
> >
> 
> We are very glad and happy about this concept and implementation.
> 
> >The concept here is most people use
> >scripts to move their spam into a separate folder, or auto delete it. With that
> >going on, a FP is potentially lost valid email, whereas a FN is a minor
> >inconvenience.
> >  
> >
> 
> Yes.... We work hard to inform our users and to actively solicit their 
> feedback on how the system is working and to lookout for the system 
> misplacing emails, especially valid ones. I know it's still not perfect....
> 
> >For any site that considers FPs to be "not too bad" because all mail is manually
> >examined anyway, lowering the score threshold may be a workable thing.
> >
> >However, other sites that auto-delete such messages may have considerable
> >problems if they lower the threshold
> >  
> >
> 
> YES!
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.5 (GNU/Linux)
Comment: Exmh CVS

iD8DBQFC0qYfMJF5cimLx9ARAp+YAJ0X7eoijcnMOE+3WkOlfQQEzasjwgCfZp9B
TdyM6BfLga48fgif1AzBW7U=
=qdan
-----END PGP SIGNATURE-----


Re: update on floating dividing score between spam and ham messages

Posted by Joe Flowers <fl...@social.chass.ncsu.edu>.
Thanks Jason!

That's good, new info for me. That'll help me *at the very least* 
visualize what I am trying to do a little better. I've been very curious 
to know what the rough shapes of those graphs look like.

Joe



Justin Mason wrote:

>-----BEGIN PGP SIGNED MESSAGE-----
>Hash: SHA1
>
>
>There's another thing worth noting -- the SpamAssassin score distribution
>for hams and spams isn't even.
>
>If you draw a graph of hams and spams, plotting the number of mails in
>each category as the vertical axis and the score they get as teh
>horizontal axis, you don't get a simple pair of intersecting straight
>lines.
>
>Instead, since we have many more mark-as-spam rules than mark-as-ham,
>and due to how the perceptron attempts to optimise for the 5.0
>threshold, what happens is that you have two different lines.
>
>The ham line is a sigmoid curve, that starts high in the negative area,
>and curves down to almost 0 at the 5.0 threshold mark.  The spam line, by
>contrast, is a straight line.
>http://taint.org/xfer/2005/score-dist-doodle.gif is a doodle to illustrate
>this, or take a look at
>http://spamassassin.apache.org/presentations/HEANet_2002/img12.html
>for real-world graphs of this data from 2002 -- although graphing
>the inverse.
>
>Very interesting approach though!
>
>- --j.
>  
>



Re: update on floating dividing score between spam and ham messages

Posted by Loren Wilton <lw...@earthlink.net>.
> There's another thing worth noting -- the SpamAssassin score distribution
> for hams and spams isn't even.

I don't necessarily see that those particular curve shapes necessarily in
any way invalidate this method, although they do bias the method somewhat.
The two curves are essentially smooth curves with no major dips or bumps in
them, so it is possible to select a ratio without getting inversions in the
ratio as the selector moves from left to right.  You may have to be careful
of calculating the ratio, given that ham goes to effectively zero above a
certain value.  But n:0 and 3.45n:0 are still perfectly valid ratios to deal
with, even if one of the terms is zero.

        Loren