You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@spamassassin.apache.org by Apache Wiki <wi...@apache.org> on 2005/08/09 21:34:55 UTC

[Spamassassin Wiki] Update of "HitFrequencies" by JustinMason

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Spamassassin Wiki" for change notification.

The following page has been changed by JustinMason:
http://wiki.apache.org/spamassassin/HitFrequencies

------------------------------------------------------------------------------
- = Using hit-frequencies =
+ = hit-frequencies =
  
  Once you've run MassCheck, you have a "ham.log" and a "spam.log" file.  To turn those into a useful summary, you run "hit-frequencies" to generate a "freqs report".  Here's how -- run:
  
  {{{
+     ./hit-frequencies -x -p -a > freqs
-     make clean
-     make freqs
  }}}
  
+ (Add the -s switch to use alternate scoresets; for example {{{-s 3}}} will measure rule scores with scoreset 3.)
+ 
- That will take "ham.log" and "spam.log" and generate a "freqs" file from the data.  This gives you the frequencies that each rule hits and details of its accuracy in hitting spam vs. ham.  Its format looks like this:
+ That will take "ham.log" and "spam.log" and generate a "freqs" file from the data.  This gives you the frequencies that each rule hits and details of its accuracy in hitting spam vs. ham.
+ 
+ == The Format ==
+ 
+ HitFrequencies output looks like this:
  
  {{{
  OVERALL%   SPAM%     HAM%     S/O    RANK   SCORE  NAME
@@ -30, +35 @@

    * OVERALL%: the percentage of mail overall that the test hits
    * SPAM%: the percentage of spam mails hit by the rule
    * HAM%: the percentage of ham mails hit by the rule
-   * S/O: "spam over overall" -- the Bayesian probability that, when the rule fires, it hits on a spam message
+   * S/O: "spam over overall ratio" -- the Bayesian probability that, when the rule fires, it hits on a spam message
    * RANK: an artificial number indicating how "good" the rule is
    * SCORE: the score listed in the "../rules/50_scores.cf" file for that rule
    * NAME: the rule's name
@@ -39, +44 @@

  
  "freqs" is the best way to determine a rule's usefulness, since it immediately shows up any false-positive issues.  The development team run a nightly mass-check and freqs report from the rules in CVS to test them, with several people scanning their corpora, and the results are put up at: http://www.pathname.com/~corpus/ .
  
- = The S/O Ratio =
+ == The S/O Ratio ==
  
  S/O needs more explanation, as it's a key figure.  A rule with S/O 1.0 is very very accurate at hitting spam without hitting ham; a rule with S/O 0.0 hits only ham, no spam; but a rule with 0.5 hits both evenly (and is therefore pretty useless).
  
@@ -47, +52 @@

  
  S/O stands for "spam / overall", in other words, the proportion of the total hits that were spam messages.  As such, it is equivalent to Bayesian probability, or 'positive predictive value' in pattern discovery in bioinformatics.
  
- = Measuring Rule Overlap =
+ == Measuring Rule Overlap ==
  
  There's one more tool to determine how much 2 rules overlap with each other -- "overlap".  This is occasionally useful if you suspect that two rules are redundant, checking the same data or hitting exactly the same messages as each other.  Take a look at the comments at the top of the "masses/overlap" script for details on how to run this against one or more "mass-check" output log files.
+ 
+ Alternatively, "hit-frequencies" has the {{{-o}}} switch to measure overlap; warning, however, this can be quite a bit slower and RAM-hungry than running without it, as it then needs to track a lot more data internally.
+ 
  ----
  CategorySoftware