You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spamassassin.apache.org by bu...@bugzilla.spamassassin.org on 2008/11/04 12:23:11 UTC
[Bug 6010] New: RuleQA: default corpus for QA measurements should ignore "high scoring" spam

https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6010

           Summary: RuleQA: default corpus for QA measurements should ignore
                    "high scoring" spam
           Product: Spamassassin
           Version: unspecified
          Platform: Other
        OS/Version: All
            Status: NEW
          Severity: enhancement
          Priority: P5
         Component: RuleQA
        AssignedTo: dev@spamassassin.apache.org
        ReportedBy: jm@jmason.org


Currently, our ruleqa measurements include details of how the rules
perform against _all_ mail in the spam corpora, including the stuff
that's hitting every single rule we have.  This means that great rules like:

    http://ruleqa.spamassassin.org/20081103-<a href="https://svn.apache.org/viewcvs.cgi?view=rev&rev=710024">r710024</a>-n/PQRTW_4/detail

are hidden.

This is clearly demonstrated by my 2 corpora, "jm" and "bb-jm".  "bb-jm"
is my high-scoring spam; on this corpus, PQRTW_4 hits only 0.251% of
spam.  But on my low-scoring spam ("jm"), it hits 13.6118%.
Overall it hits 0.7921% of spam.   But as you can see it's really good
against the low-scoring stuff.

Now, that's where we _need_ good rules... so in my opinion we should fix
the ruleqa app to highlight those rules by default.  We don't need lots of
rules that hit the spam we're already catching.

I suggest the ruleqa scripts are extended to track a new subset of logs,
alongside the current set, for mass-check lines under some score threshold (10
points?). So something like this:


  set 0, low-scoring spam

  MSECS      SPAM%     HAM%     S/O    RANK   SCORE  NAME 
  0.00000  10.6118   0.0000   1.000    0.86    1.00  PQRTW_4  

  set 0, in aggregate

  MSECS      SPAM%     HAM%     S/O    RANK   SCORE  NAME 
  0.00000   0.7921   0.0000   1.000    0.86    1.00  PQRTW_4  

  set 0, broken down by message age in weeks

  MSECS      SPAM%     HAM%     S/O    RANK   SCORE  NAME WHO/AGE
  0.00000   0.3335   0.0000   1.000    0.66    1.00  PQRTW_4 0-1 
  0.00000   0.2991   0.0000   1.000    0.63    1.00  PQRTW_4 1-2 
  0.00000   0.0000   0.0000   0.500    0.45    1.00  PQRTW_4 2-3 
  0.00000   0.7855   0.0000   1.000    0.80    1.00  PQRTW_4 3-6 

  set 0, broken down by contributor
  [etc.]


this should be easy enough to do.

I don't think it needs to dictate the promotion criteria; rules like this would
still be promoted, since the SPAM% ratio is over the very low threshold (what
is it? 0.1%? can't recall)


-- 
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.