You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@spamassassin.apache.org by Apache Wiki <wi...@apache.org> on 2005/08/13 23:44:11 UTC

[Spamassassin Wiki] Update of "RulesProjStreamlining" by JustinMason

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Spamassassin Wiki" for change notification.

The following page has been changed by JustinMason:
http://wiki.apache.org/spamassassin/RulesProjStreamlining

The comment on the change is:
updating

------------------------------------------------------------------------------
  
  First off, the sandboxes idea greatly increases the number of people who can check rules into SVN.  Secondly, the barriers to entry for getting a sandboxes account are much lower.
  
+ = Rule Promotion =
- Some bulletpoints from discussion, needs expanding:
- 
- sandbox:
  
    * each user gets their own sandbox as discussed on RulesProjMoreInput
    * checked-in rules in the sandboxes are mass-checked in the nightly mass-checks
@@ -26, +24 @@

    * S/O ratio of 0.95 or greater (or 0.05 or less for nice rules)
    * > 0.25% of target type hit (e.g. spam for non-nice rules)
    * < 1.00% of non-target type hit (e.g. ham for non-nice rules)
-   * not too slow ;)
-   * TODO: criteria for overlap with existing rules? BobMenschel: The method I used for weeding out SARE rules that overlapped 3.0.0 rules, was to run a full mass-check with overlap analysis, and throw away anything where the overlap is less than 50% (ie: keep only those rules which have "meaningful" overlap). Manually reviewing the remaining (significantly) overlapping rules was fairly easy. The command I use is: perl ./overlap ../rules/tested/$testfile.ham.log ../rules/tested/$testfile.spam.log | grep -v mid= | awk ' NR == 1 { print } ; $2 + 0 == 1.000 && $3 + 0 >= 0.500 { print } ' >../rules/tested/$testfile.overlap.out
  
+ Future criteria:
- A ruleset in the "extra" set would have different criteria.
-  * DanielQuinlan suggested: The second, a collection that do not qualify for rules/core.  For example, SpamAssassin intentionally doesn't filter virus bounces (yet, at least), but there is a good virus bounce ruleset out there.
-  * BobMenschel: Similarly, an "extra" rules set might include rules that positively identify spam from spamware, but hit <0.25% of spam. Or an "aggressive" rules set might include rules that hit with an S/O of only 0.89, but push a lot of spam over the 5.0 threshold without impacting significantly on ham.
-  * ChrisSanterre: Seeing this breakdown of dirs, gave me an idea. Why not set the "aggresiveness" of SA for updates? Like how SARE has ruleset0.cf (no ham hits), ruleset1.cf (few ham, high S/O), etc., with each "level" of rule set file getting slightly more aggressive, risking (though not necessarily seeing) slightly higher FP rates. Users could set some config like supdate=(1-4), with 1 being the most conservative, and 4 being the most aggresive (with the knowledge that more aggresive *could* possibly cause more FPs). 
  
- We can also vote for extraordinary stuff that doesn't fit into those criteria...
+   * not too slow ;)   TODO: need an automated way to measure that
+   * TODO: criteria for overlap with existing rules? see 'overlap criteria' below.
  
- private list for mass-checks:
+ We can also vote for rules that don't pass those criteria, but we think should be put into core for some reason. 
  
+ A ruleset in the "extra" set would have different criteria; e.g.
-   * archives delayed 1 month?
-   * moderated signups
-   * automated mass-checks of attachments in specific file format
-   * rules considered suitable for use are checked into the "sandbox" area for a quick nightly-mass-check, for release
  
+  * the virus bounce ruleset
+  * rules that positively identify spam from spamware, but hit <0.25% of spam
+  * an "aggressive" rules set might include rules that hit with an S/O of only 0.89, but push a lot of spam over the 5.0 threshold without impacting significantly on ham
+ 
+ (ChrisSanterre: Seeing this breakdown of dirs, gave me an idea. Why not set the "aggresiveness" of SA for updates? Like how SARE has ruleset0.cf (no ham hits), ruleset1.cf (few ham, high S/O), etc., with each "level" of rule set file getting slightly more aggressive, risking (though not necessarily seeing) slightly higher FP rates. Users could set some config like supdate=(1-4), with 1 being the most conservative, and 4 being the most aggresive (with the knowledge that more aggresive *could* possibly cause more FPs). 
+ 
+ JustinMason: I think for now it's easiest to stick with the 'load aggressive rulesets by name' idea, rather than adding a new configuration variable.  For example, aggressiveness is not the only criteria for what rulesets to use; we'd have to include config variables for "I want anti-viral-bounce rulesets", too.)
+ 
+ == Overlap Criteria ==
+ 
+ BobMenschel: The method I used for weeding out SARE rules that overlapped 3.0.0 rules, was to run a full mass-check with overlap analysis, and throw away anything where the overlap is less than 50% (ie: keep only those rules which have "meaningful" overlap). Manually reviewing the remaining (significantly) overlapping rules was fairly easy. The command I use is: perl ./overlap ../rules/tested/$testfile.ham.log ../rules/tested/$testfile.spam.log | grep -v mid= | awk ' NR == 1 { print } ; $2 + 0 == 1.000 && $3 + 0 >= 0.500 { print } ' >../rules/tested/$testfile.overlap.out
+ 
+ DanielQuinlan: 'By "throw away", do you mean put into the bucket that is retained going forward or did you mean to say "greater than 50%"?'
+ 
+ BobMenschel: 'By "throw away anything where the overlap is less than 50%" I
+ meant to discard (exclude from the final file) anything where the overlap was
+ (IMO) insignificant.
+ This would leave those overlaps where RULE_A hit all the emails that
+ RULE_B also hit (100%), and RULE_B hit somewhere between 50% and 100%
+ of the rules that RULE_A hit.'
+ 
+ JustinMason: Like Daniel, I'm confused here.  as far as I can see, you want to
+ keep the rules that do NOT have a high degree of overlap with other rules, and
+ throw out the rules that do (because they're redundant).   in other words, you
+ want to throw away when the mutual overlap is greater than some high value
+ (like 95% at a guess).
+