You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@spamassassin.apache.org by sp...@incubator.apache.org on 2004/05/07 04:45:14 UTC

[SpamAssassin Wiki] New: UsingOverlap

   Date: 2004-05-06T19:45:14
   Editor: JustinMason <jm...@jmason.org>
   Wiki: SpamAssassin Wiki
   Page: UsingOverlap
   URL: http://wiki.apache.org/spamassassin/UsingOverlap

   no comment

New Page:

= Using 'overlap' =

In the SpamAssassin "masses" directory, there's a tool called 'overlap', which is used to determine how much the rules in the ruleset overlap with each other.

For example, let's say I have a log file in ''spam.log'', and want to examine how much the rules that start with ''T_DRUG'' overlap with each other.  I run overlap like so:
 
{{{
  ./overlap spam.log  > ov
  pcregrep "\sT_DRUG.*,T_DRUG" ov | sort -r +1 -n
}}}

Which in this case produces this output:

{{{
87      1.000   0.690   T_DRUGS_SLEEP_EREC,T_DRUGS_SLEEP
87      1.000   0.084   T_DRUGS_SLEEP_EREC,T_DRUGS_ERECTILE
703     1.000   0.679   T_DRUGS_ERECTILE_OBFU,T_DRUGS_ERECTILE
328     1.000   0.715   T_DRUGS_ANXIETY_EREC,T_DRUGS_ANXIETY
328     1.000   0.317   T_DRUGS_ANXIETY_EREC,T_DRUGS_ERECTILE
315     1.000   0.887   T_DRUGS_DIET_EREC,T_DRUGS_DIET
315     1.000   0.304   T_DRUGS_DIET_EREC,T_DRUGS_ERECTILE
311     1.000   0.523   T_DRUGS_PAIN_EREC,T_DRUGS_PAIN
311     1.000   0.300   T_DRUGS_PAIN_EREC,T_DRUGS_ERECTILE
289     1.000   0.630   T_DRUGS_ANXIETY_OBFU,T_DRUGS_ANXIETY
}}}

Explanation of the columns: the first number is how many mails hit both rules; the second, how much of the hits for the first rule also hit the second; the third, how much of the hits for the second rule also hit the first.

So in the case of this line:

{{{
87      1.000   0.690   T_DRUGS_SLEEP_EREC,T_DRUGS_SLEEP
}}}

87 mails hit both rules; all of the mails that hit T_DRUGS_SLEEP_EREC also hit T_DRUGS_SLEEP; and 69% of the mails that hit T_DRUGS_SLEEP also hit T_DRUGS_SLEEP_EREC.

Overlap is very useful, if you believe that some rules are all hitting the same spam messages.