You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@spamassassin.apache.org by sp...@incubator.apache.org on 2004/05/07 04:45:14 UTC
[SpamAssassin Wiki] New: UsingOverlap
Date: 2004-05-06T19:45:14
Editor: JustinMason <jm...@jmason.org>
Wiki: SpamAssassin Wiki
Page: UsingOverlap
URL: http://wiki.apache.org/spamassassin/UsingOverlap
no comment
New Page:
= Using 'overlap' =
In the SpamAssassin "masses" directory, there's a tool called 'overlap', which is used to determine how much the rules in the ruleset overlap with each other.
For example, let's say I have a log file in ''spam.log'', and want to examine how much the rules that start with ''T_DRUG'' overlap with each other. I run overlap like so:
{{{
./overlap spam.log > ov
pcregrep "\sT_DRUG.*,T_DRUG" ov | sort -r +1 -n
}}}
Which in this case produces this output:
{{{
87 1.000 0.690 T_DRUGS_SLEEP_EREC,T_DRUGS_SLEEP
87 1.000 0.084 T_DRUGS_SLEEP_EREC,T_DRUGS_ERECTILE
703 1.000 0.679 T_DRUGS_ERECTILE_OBFU,T_DRUGS_ERECTILE
328 1.000 0.715 T_DRUGS_ANXIETY_EREC,T_DRUGS_ANXIETY
328 1.000 0.317 T_DRUGS_ANXIETY_EREC,T_DRUGS_ERECTILE
315 1.000 0.887 T_DRUGS_DIET_EREC,T_DRUGS_DIET
315 1.000 0.304 T_DRUGS_DIET_EREC,T_DRUGS_ERECTILE
311 1.000 0.523 T_DRUGS_PAIN_EREC,T_DRUGS_PAIN
311 1.000 0.300 T_DRUGS_PAIN_EREC,T_DRUGS_ERECTILE
289 1.000 0.630 T_DRUGS_ANXIETY_OBFU,T_DRUGS_ANXIETY
}}}
Explanation of the columns: the first number is how many mails hit both rules; the second, how much of the hits for the first rule also hit the second; the third, how much of the hits for the second rule also hit the first.
So in the case of this line:
{{{
87 1.000 0.690 T_DRUGS_SLEEP_EREC,T_DRUGS_SLEEP
}}}
87 mails hit both rules; all of the mails that hit T_DRUGS_SLEEP_EREC also hit T_DRUGS_SLEEP; and 69% of the mails that hit T_DRUGS_SLEEP also hit T_DRUGS_SLEEP_EREC.
Overlap is very useful, if you believe that some rules are all hitting the same spam messages.