You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@spamassassin.apache.org by Justin Mason <jm...@jmason.org> on 2006/11/07 11:55:28 UTC

Re: Default SpamAssassin scores don't make sense

Matt Kettler writes:
> Adam Katz wrote:
> > Theo Van Dinter wrote:
> >   
> >> http://wiki.apache.org/spamassassin/HowScoresAreAssigned
> >>     
> >
> > Thanks, that's what I was looking for.
> >
> >   
> >> The short version is that as far as SA and the perceptron (that which
> >> generates the scores) are concerned, rules are independent.  There is no
> >> "increase in severity", either a rule hits or it doesn't
> >>     
> >
> > Bayes is a perfect example of this, and is mentioned as such on the very
> > page you referenced.  Several filters, including those that I listed at
> > the top of this thread, are indeed incremental, increasing in severity.
> >  I am shocked to hear that there is nobody moderating the automated
> > scores (an Alan Greenspan of the anti-spam world, per se).
> >   
> 
> 
> Nobody said that nobody moderates the scores. I myself spend a
> considerable amount of time studying them.
> 
> However, none of us is so rash as to make adjustments just to make the
> results look better. 99% of the time, investigations into "illogical"
> scores turn up real-world evidence that explains them.
> Let's take a brief look at your SPF expample.
> 
> You'd expect SPF_FAIL to have a higher score than SPF_SOFTFAIL. However,
> the real world shows otherwise.
> 
> Let's rip the results out of STATISTICS-set3.txt:
> 
> OVERALL%   SPAM%     HAM%     S/O    RANK   SCORE  NAME
> 
>   3.437   4.8942   0.0396    0.992   0.80    1.38  SPF_SOFTFAIL
>   2.550   3.5717   0.1676    0.955   0.53    1.14  SPF_FAIL
> 
> Look at the S/O for each. This represents what percentage of mail the
> rule matched is actually spam, where 1.00 means 100% of the matching
> messages were spam.
> 
> Notice how the S/O of SPF_FAIL is actually LOWER than SOFTFAIL?
> 
> Why? Probably because there are more "aggressive" admins publishing
> records with -all without thinking about their whole network. The more
> cautious folks who have spent a lot of time thinking about their
> network, are more likely to realize them might have missed something and
> use ~all (softfail).
> 
> Human behavior is in no way linear, and SPF here is a result of the
> behavior of the admin publishing the records. My explanation is a guess,
> but it makes sense if you think about the generall behaviors of cautious
> admin compared to a "rabbid" one.
> 
> Now let's look at DATE_IN_FUTURE..
> 
>   1.605   2.2815   0.0264    0.989   0.75    1.96  DATE_IN_FUTURE_03_06
>   0.926   1.2926   0.0716    0.948   0.56    1.67  DATE_IN_FUTURE_06_12
>   1.986   2.8309   0.0151    0.995   0.81    2.77  DATE_IN_FUTURE_12_24
>   0.260   0.3676   0.0075    0.980   0.53    2.69  DATE_IN_FUTURE_24_48
>   0.089   0.1252   0.0038    0.971   0.40    2.10  DATE_IN_FUTURE_48_96
>   0.245   0.3474   0.0075    0.979   0.52    2.40  DATE_IN_FUTURE_96_XX
> 
> Here again we see non-linearity in the S/O performance of the real world
> data. Note that 06_12 has the lowest S/O of the lot, and, imagine that,
> it got the lowest score too.
> 
> There's some degree of "non-fit" here, as DATE_IN_FUTURE_96_XX has the
> highest score, but not the highest S/O. A study of the actual corpus
> itself would likely show that this rule is more likely to match spam
> that has very few other rules matching, hence the higher score. This is
> a case of that "interaction with other rules" thing in my last message.
> 
> HTML_OBFUSCATE is a bit more complicated:
> 
> OVERALL%   SPAM%     HAM%     S/O    RANK   SCORE  NAME
>   0.637   0.9048   0.0132    0.986   0.66    1.45  HTML_OBFUSCATE_05_10
>   0.921   1.3128   0.0075    0.994   0.74    1.77  HTML_OBFUSCATE_10_20
>   0.671   0.9582   0.0000    1.000   0.70    3.40  HTML_OBFUSCATE_20_30
>   0.406   0.5801   0.0000    1.000   0.63    2.86  HTML_OBFUSCATE_30_40
>   0.198   0.2836   0.0000    1.000   0.51    2.64  HTML_OBFUSCATE_40_50
>   0.242   0.3458   0.0000    1.000   0.54    2.03  HTML_OBFUSCATE_50_60
>   0.081   0.1155   0.0000    1.000   0.40    1.65  HTML_OBFUSCATE_60_70
>   0.055   0.0784   0.0000    1.000   0.38    1.47  HTML_OBFUSCATE_70_80
>   0.012   0.0178   0.0000    1.000   0.31    0.98  HTML_OBFUSCATE_80_90
>   0.004   0.0057   0.0000    1.000   0.29    0.00  HTML_OBFUSCATE_90_100
> 
> Here the S/O's have a clear up-swing trend. However, the hit-rates at
> the upper end are very low. That's probably what's suppressing the
> scores of 60_70 and higher. They just don't hit enough mail to be relevant.

Yep.  It may also be that they hit only spam that is *already* scoring
over 10 points  -- at that stage, there's no point in adding to the score,
so whatever value the perceptron assigns to it would have no real effect.
Therefore the perceptron is free to assign low scores.

--j.