You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@spamassassin.apache.org by Apache Wiki <wi...@apache.org> on 2005/08/09 21:34:55 UTC
[Spamassassin Wiki] Update of "HitFrequencies" by JustinMason
Dear Wiki user,
You have subscribed to a wiki page or wiki category on "Spamassassin Wiki" for change notification.
The following page has been changed by JustinMason:
http://wiki.apache.org/spamassassin/HitFrequencies
------------------------------------------------------------------------------
- = Using hit-frequencies =
+ = hit-frequencies =
Once you've run MassCheck, you have a "ham.log" and a "spam.log" file. To turn those into a useful summary, you run "hit-frequencies" to generate a "freqs report". Here's how -- run:
{{{
+ ./hit-frequencies -x -p -a > freqs
- make clean
- make freqs
}}}
+ (Add the -s switch to use alternate scoresets; for example {{{-s 3}}} will measure rule scores with scoreset 3.)
+
- That will take "ham.log" and "spam.log" and generate a "freqs" file from the data. This gives you the frequencies that each rule hits and details of its accuracy in hitting spam vs. ham. Its format looks like this:
+ That will take "ham.log" and "spam.log" and generate a "freqs" file from the data. This gives you the frequencies that each rule hits and details of its accuracy in hitting spam vs. ham.
+
+ == The Format ==
+
+ HitFrequencies output looks like this:
{{{
OVERALL% SPAM% HAM% S/O RANK SCORE NAME
@@ -30, +35 @@
* OVERALL%: the percentage of mail overall that the test hits
* SPAM%: the percentage of spam mails hit by the rule
* HAM%: the percentage of ham mails hit by the rule
- * S/O: "spam over overall" -- the Bayesian probability that, when the rule fires, it hits on a spam message
+ * S/O: "spam over overall ratio" -- the Bayesian probability that, when the rule fires, it hits on a spam message
* RANK: an artificial number indicating how "good" the rule is
* SCORE: the score listed in the "../rules/50_scores.cf" file for that rule
* NAME: the rule's name
@@ -39, +44 @@
"freqs" is the best way to determine a rule's usefulness, since it immediately shows up any false-positive issues. The development team run a nightly mass-check and freqs report from the rules in CVS to test them, with several people scanning their corpora, and the results are put up at: http://www.pathname.com/~corpus/ .
- = The S/O Ratio =
+ == The S/O Ratio ==
S/O needs more explanation, as it's a key figure. A rule with S/O 1.0 is very very accurate at hitting spam without hitting ham; a rule with S/O 0.0 hits only ham, no spam; but a rule with 0.5 hits both evenly (and is therefore pretty useless).
@@ -47, +52 @@
S/O stands for "spam / overall", in other words, the proportion of the total hits that were spam messages. As such, it is equivalent to Bayesian probability, or 'positive predictive value' in pattern discovery in bioinformatics.
- = Measuring Rule Overlap =
+ == Measuring Rule Overlap ==
There's one more tool to determine how much 2 rules overlap with each other -- "overlap". This is occasionally useful if you suspect that two rules are redundant, checking the same data or hitting exactly the same messages as each other. Take a look at the comments at the top of the "masses/overlap" script for details on how to run this against one or more "mass-check" output log files.
+
+ Alternatively, "hit-frequencies" has the {{{-o}}} switch to measure overlap; warning, however, this can be quite a bit slower and RAM-hungry than running without it, as it then needs to track a lot more data internally.
+
----
CategorySoftware