You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@spamassassin.apache.org by co...@spamassassin.apache.org on 2004/10/08 05:57:14 UTC

[SpamAssassin Wiki] Updated: HitFrequencies

   Date: 2004-10-07T20:57:14
   Editor: JustinMason <jm...@jmason.org>
   Wiki: SpamAssassin Wiki
   Page: HitFrequencies
   URL: http://wiki.apache.org/spamassassin/HitFrequencies

   note bioinformatics parallel

Change Log:

------------------------------------------------------------------------------
@@ -35,12 +35,15 @@
 
 The first two lines list the number of messages in the corpora, and the percentage makeup of the corpus as ham vs. spam (so in this example, the corpus is 41.38% spam vs 58.61% ham).
 
+"freqs" is the best way to determine a rule's usefulness, since it immediately shows up any false-positive issues.  The development team run a nightly mass-check and freqs report from the rules in CVS to test them, with several people scanning their corpora, and the results are put up at: http://www.pathname.com/~corpus/ .
+
+= The S/O Ratio =
+
 S/O needs more explanation, as it's a key figure.  A rule with S/O 1.0 is very very accurate at hitting spam without hitting ham; a rule with S/O 0.0 hits only ham, no spam; but a rule with 0.5 hits both evenly (and is therefore pretty useless).
 
-A ''good'' rule has a very extreme S/O (near as possible to 1.0 or 0.0) and a high percentage of hits in the correct category.  In other words,  
-RCVD_IN_OPM_HTTP is a very good rule, because it hits 5.2028% of all spam mails without hitting any ham at all (no false positives).
+A ''good'' rule has a very extreme S/O (near as possible to 1.0 or 0.0) and a high percentage of hits in the correct category.  In other words,  RCVD_IN_OPM_HTTP is a very good rule in the example above, because it hits 5.2028% of all spam mails without hitting any ham at all (no false positives).
 
-"freqs" is the best way to determine a rule's usefulness, since it immediately shows up any false-positive issues.  The development team run a nightly mass-check and freqs report from the rules in CVS to test them, with several people scanning their corpora, and the results are put up at: http://www.pathname.com/~corpus/ .
+S/O stands for "spam / overall", in other words, the proportion of the total hits that were spam messages.  As such, it is equivalent to Bayesian probability, or 'positive predictive value' in pattern discovery in bioinformatics.
 
 = Measuring Rule Overlap =