You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@spamassassin.apache.org by Justin Mason <jm...@jmason.org> on 2004/10/09 00:49:08 UTC

statistics help needed

Hey all --

I've been considering possible improvements to how we figure out what
rules are effective.

Currently we use the S/O ratio and hit-rate of each individual rule, in
other words, if a rule hits a lot of spam, and little nonspam, we detect
that and consider it "good".

However, that doesn't take in account the situation where multiple rules
are hitting mostly the same mail; for example, like this:

             S1  S2  S3  S4  S5  H1  H2  H3  H4  H5
    RULE1:   x   x   x   x                       
    RULE2:   x   x   x   x                       
    RULE3:               x   x                   x
    RULE4:                   x                    

(S1-S5 = 5 spam mails; H1-H5 = 5 ham/nonspam mails.  "x" means a "hit"
by a rule, " " means no hit -- our rules are boolean.)

obviously, RULE1 and RULE2 overlap entirely, and therefore either (a) one
should be removed, or (b) both should share half the score as equal
contributors.  (b) is what the perceptron currently does.

RULE3, by contrast, would be considered a lousy rule under our current
scheme, because it hits ham 33% of the time; however in this case, it's
actually quite informational to a certain extent, because it's hitting
spam that the others cannot hit.

RULE4 is even better than RULE3, because it's hitting the mail that
RULE1 and RULE2 miss, yet it doesn't appear that good because:

    - it has a hit-rate half that of RULE3
    - it has a hit-rate 4 times lower than RULE1 and RULE2

This is the kind of effect we do see now -- a lot of our rules are
actually firing in combination, and some rules that hit e.g. 0.5% of
spam are in effect more useful than some rules that hit 20%, because
they're hitting the 0.5% of spam that *gets past* the other rules.

So, what I'm looking for is a statistical method to measure this effect,
and report

    - (a) that RULE1 and RULE2 overlap almost entirely
    - (b) that RULE3 is worthwhile, because it can hit that 20% of the
      messages the other rules cannot
    - (c) that RULE4 is better than RULE3 because it has a lower
      false-positive rate

The perceptron rescoring system *does* do this already, but for
rule QA and rule selection, being able to do this at a "human"
level -- and quickly -- would be essential.    We also have an
overlap-measurement tool, but that's only useful for measuring (a)
and is extremely RAM-hungry.

So -- statisticians?  any tips? ;)   (if anyone can fwd this on
to their resident stats guy, that would be appreciated, too.)

(Henry, you may be too busy to respond if you're writing up of course ;)

--j.

Re: statistics help needed

Posted by Joe Emenaker <jo...@emenaker.com>.

Justin Mason wrote:

>             S1  S2  S3  S4  S5  H1  H2  H3  H4  H5
>    RULE1:   x   x   x   x                       
>    RULE2:   x   x   x   x                       
>    RULE3:               x   x                   x
>    RULE4:                   x                    
>
>(S1-S5 = 5 spam mails; H1-H5 = 5 ham/nonspam mails.  "x" means a "hit"
>by a rule, " " means no hit -- our rules are boolean.)
>  
>
...

>So, what I'm looking for is a statistical method to measure this effect,
>and report
>
>    - (a) that RULE1 and RULE2 overlap almost entirely
>    - (b) that RULE3 is worthwhile, because it can hit that 20% of the
>      messages the other rules cannot
>    - (c) that RULE4 is better than RULE3 because it has a lower
>      false-positive rate
>  
>
Well, it should be fairly straightforward to calculate correlation 
values between all of the rules, but I'm not sure how far that will take 
you.

I've got an idea of something which will give you a ruleset which:
  1 - Maximizes the amount of detected spam,
  2 - Minimizes false-positives, and
  3 - Contains the fewest rules possible.

But I'm not sure that's what you want either, because the algorithm 
would gravitate toward giving you a ruleset wherein each spam would be 
matched by a single rule in the set... which makes me uneasy.

Frankly, I'd rather have a set of rules which hit on ever spam I receive 
(provided that they don't increase my false-positives), because doing so 
only sends the spam score of the spam messages higher...  which widens 
the numerical gap between my ham and spam scores... which I regard as a 
good thing.

- Joe

Re: statistics help needed

Posted by Scott A Crosby <sc...@cs.rice.edu>.

On Fri, 08 Oct 2004 15:49:08 -0700, jm@jmason.org (Justin Mason) writes:

> However, that doesn't take in account the situation where multiple rules
> are hitting mostly the same mail; for example, like this:
>
>              S1  S2  S3  S4  S5  H1  H2  H3  H4  H5
>     RULE1:   x   x   x   x                       
>     RULE2:   x   x   x   x                       
>     RULE3:               x   x                   x
>     RULE4:                   x                    
>

> obviously, RULE1 and RULE2 overlap entirely, and therefore either (a) one
> should be removed, or (b) both should share half the score as equal
> contributors.  (b) is what the perceptron currently does.
>
> RULE3, by contrast, would be considered a lousy rule under our current
> scheme, because it hits ham 33% of the time; however in this case, it's
> actually quite informational to a certain extent, because it's hitting
> spam that the others cannot hit.
>
> RULE4 is even better than RULE3, because it's hitting the mail that
> RULE1 and RULE2 miss, yet it doesn't appear that good because:
>
>     - it has a hit-rate half that of RULE3
>     - it has a hit-rate 4 times lower than RULE1 and RULE2
>
> This is the kind of effect we do see now -- a lot of our rules are
> actually firing in combination, and some rules that hit e.g. 0.5% of
> spam are in effect more useful than some rules that hit 20%, because
> they're hitting the 0.5% of spam that *gets past* the other rules.
>
> So, what I'm looking for is a statistical method to measure this effect,
> and report
>
>     - (a) that RULE1 and RULE2 overlap almost entirely

Cross Entropy // Information Gain between the two rules. 

Cross entropy can also identify if one rule is redundant with respect
to, eg, two different rules. I think it may be possible to create a
formula akin to CE / IG, but biased toward avoding FP's.

>     - (b) that RULE3 is worthwhile, because it can hit that 20% of the
>       messages the other rules cannot

Information gain of RULE3 over the set of email that the other rules miss.

>     - (c) that RULE4 is better than RULE3 because it has a lower
>       false-positive rate


> So -- statisticians?  any tips? ;)   (if anyone can fwd this on
> to their resident stats guy, that would be appreciated, too.)

A google for 'cross entropy' and 'information gain' found this thread
which looks to have a few citations.

  http://www.mail-archive.com/perl-ai@perl.org/msg00127.html

This question also makes me think very strongly of decision tree
algorithms --- if a rule doesn't have a prominent place in the
decision tree, it probably doesn't contribute much.


Scott

Re: statistics help needed

Posted by Michael Barnes <mb...@compsci.wm.edu>.

On Fri, Oct 08, 2004 at 03:49:08PM -0700, Justin Mason wrote:
> I've been considering possible improvements to how we figure out what
> rules are effective.
> 
> Currently we use the S/O ratio and hit-rate of each individual rule, in
> other words, if a rule hits a lot of spam, and little nonspam, we detect
> that and consider it "good".
> 
> However, that doesn't take in account the situation where multiple rules
> are hitting mostly the same mail; for example, like this:
> 
>              S1  S2  S3  S4  S5  H1  H2  H3  H4  H5
>     RULE1:   x   x   x   x                       
>     RULE2:   x   x   x   x                       
>     RULE3:               x   x                   x
>     RULE4:                   x                    

I've thought about this as well, and in the social science world a
common statistical technique for describing things like "personality" is
factor analysis.  Its been years, but from what I remember factor
analysis is used to get cocorrelating "factors" together to describe a
common construct or idea.

I'm not familiar with the genetic or whatever algorithms that are
currently used in SA, but I would be interested in looking into
alternate algorithms like factor analysis for identifying spam.

One interesting thing with the factor analysis, is that it could
describe "SPAM" better than "SPAM" or "HAM".  It could categorize mails
as say "nigerian spam", "porn spam", "mortage spam", etc.

Mike

-- 
/-----------------------------------------\
| Michael Barnes <mb...@compsci.wm.edu> |
| UNIX Systems Administrator              |
| College of William and Mary             |
| Phone: (757) 879-3930                   |
\-----------------------------------------/