You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@spamassassin.apache.org by Justin Mason <jm...@jmason.org> on 2004/10/11 08:04:39 UTC
Re: statistics help needed

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1


Scott A Crosby writes:
> On Fri, 08 Oct 2004 15:49:08 -0700, jm@jmason.org (Justin Mason) writes:
> 
> > However, that doesn't take in account the situation where multiple rules
> > are hitting mostly the same mail; for example, like this:
> >
> >              S1  S2  S3  S4  S5  H1  H2  H3  H4  H5
> >     RULE1:   x   x   x   x                       
> >     RULE2:   x   x   x   x                       
> >     RULE3:               x   x                   x
> >     RULE4:                   x                    
> >
> 
> > obviously, RULE1 and RULE2 overlap entirely, and therefore either (a) one
> > should be removed, or (b) both should share half the score as equal
> > contributors.  (b) is what the perceptron currently does.
> >
> > RULE3, by contrast, would be considered a lousy rule under our current
> > scheme, because it hits ham 33% of the time; however in this case, it's
> > actually quite informational to a certain extent, because it's hitting
> > spam that the others cannot hit.
> >
> > RULE4 is even better than RULE3, because it's hitting the mail that
> > RULE1 and RULE2 miss, yet it doesn't appear that good because:
> >
> >     - it has a hit-rate half that of RULE3
> >     - it has a hit-rate 4 times lower than RULE1 and RULE2
> >
> > This is the kind of effect we do see now -- a lot of our rules are
> > actually firing in combination, and some rules that hit e.g. 0.5% of
> > spam are in effect more useful than some rules that hit 20%, because
> > they're hitting the 0.5% of spam that *gets past* the other rules.
> >
> > So, what I'm looking for is a statistical method to measure this effect,
> > and report
> >
> >     - (a) that RULE1 and RULE2 overlap almost entirely
> 
> Cross Entropy // Information Gain between the two rules. 
> 
> Cross entropy can also identify if one rule is redundant with respect
> to, eg, two different rules. I think it may be possible to create a
> formula akin to CE / IG, but biased toward avoding FP's.

We already use IG to measure each rule's effectiveness in 3.0.0,
considering each rule individually.   Maybe pairwise IG would work;
something like:

    input: ruleA
    foreach ruleB
        compute overlap between ruleA and ruleB
        if (overlap > max)
            max = overlap
    overlap figure for ruleA = max

(since we only want one number for each rule indicating how much
overlap it has with the existing ruleset, and this would report
the maximum overlap of a rule with any other in the set.)

> >     - (b) that RULE3 is worthwhile, because it can hit that 20% of the
> >       messages the other rules cannot
> 
> Information gain of RULE3 over the set of email that the other rules miss.

problem there is, generally almost all emails hit one or two "noise"
rules, so we'd have to have a way to measure which rules are useful,
since otherwise the "set of email that the other rules miss" is very
small.

> >     - (c) that RULE4 is better than RULE3 because it has a lower
> >       false-positive rate
> 
> > So -- statisticians?  any tips? ;)   (if anyone can fwd this on
> > to their resident stats guy, that would be appreciated, too.)
> 
> A google for 'cross entropy' and 'information gain' found this thread
> which looks to have a few citations.
> 
>   http://www.mail-archive.com/perl-ai@perl.org/msg00127.html
> 
> This question also makes me think very strongly of decision tree
> algorithms --- if a rule doesn't have a prominent place in the
> decision tree, it probably doesn't contribute much.

it certainly seems to be a hard problem; not really suited to
producing a single number we can use in the "freqs" report table.

- --j.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.4 (GNU/Linux)
Comment: Exmh CVS

iD8DBQFBaiJ2MJF5cimLx9ARAmWAAJ98nFLYrqHzenvkCVyeVhxh7je74ACgtc7Q
UG7mC9J7ZNlSYtUPPuY5c3M=
=CaMh
-----END PGP SIGNATURE-----