You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@spamassassin.apache.org by Justin Mason <jm...@jmason.org> on 2007/02/12 18:45:17 UTC
Re: Find the Ham: A Prototype Config
Have you looked into decision trees? this sounds a bit like that...
http://en.wikipedia.org/wiki/Decision_tree_learning
--j.
Dan writes:
> Yesterday I described an unorthodox approach to email filtering and
> generated both interest and confusion. Hopefully by describing it
> further, I can create understanding. Below is my design and at the
> bottom a question, but first, a summary of points:
>
>
> 1) I created confusion by starting with a big picture description.
> Find the Ham does NOT rely on rules that target ham, it relies ONLY
> on rules that find spam. It applies spam fighting rules we already
> know and love, just in a new way.
>
> 2) This configuration sits between your rules and MTA, REPLACING what
> weights currently do. Its more accurate than weighted
> configurations, so this works out.
>
> 3) Well tuned, FP rates lower than 1 in 10,000 (0.01%) are easy and
> captured messages can be organized by likelihood: the more rules a
> message hits, the less likely it is to be ham. Finding the few ham
> among the few spam requires less work than deleting the uncaught spam
> many weighted approaches currently leave in a user's inbox. FP
> reviewing remains the same, isolate and examine the "least likely to
> be spam" captures.
>
> 4) Whitelist entries slow accuracy training but can STILL be used.
>
>
>
> Here is Find the Ham's architecture, built inside SA. It appears
> more complicated than it is because of the brute force engineering
> needed to work in a system that wasn't designed to support it. This
> is essentially a hack of the current weighting/meta systems but is
> NOT a weighted approach. For clarity, I have replaced 'weight' with
> 'value.' Here are the parts:
>
>
> 1) Give every rule a value of 1, regardless of how 'strong' or 'weak'
> it is. 1 is easy to count and requires no score lines in SA. The
> point here is not to assign a weight, but rather create a profile. A
> message's profile is determined by which specific rules (by name!) a
> message fails (or should not fail), NOT the weight of those rules -
> no more plus and minus.
>
>
>
> 2) HAM: Messages hitting 0 or 1 get through so as any message
> hitting no rules or any one rule always count as ham. Rules good
> enough to never capture ham by themselves should be handled separately.
>
> RESULT1 = values of 0-1 get through
>
>
>
> 3) SPAM: Rules that DON'T hit in combination with a meta (more below)
> add up to something above 1 point (1+1+1+1+1+1+1 = 7). This catches
> behavior patterns shown by spammers that have NOT also been shown to
> be ham. And since all such patterns are assumed to be spam until
> proven otherwise, you catch millions of permutations without having
> to map any of them:
>
> RESULT2 = values of 2-999 get caught
>
>
>
> 4a) HAM: Exceptions = 1000 points. This value is arbitrary as any
> will do, but 1000 is easy to work with and provides scalability for
> the number of rules hitting on the same message. So for example, a
> ham hits on rules A, B, and C and an exception is needed to prevent
> all future messages that hit only the same 3 rules from being caught
> again. First make an entry that combines all three rules into every
> combination of 2 and 3 hits. To find every permutation, every hit
> down to 2 hits must be entered, so 3 hits requires 3,2, 4 hits
> requires 4,3,2 etc. (the checksum names have no meaning and simply
> ensure that each name is unique), as so:
>
>
> meta __M002_68BEF298 (ruleA + ruleB + ruleC == 2)
> meta __M003_68BEF298 (ruleA + ruleB + ruleC == 3)
>
>
>
> 4b) Then add each combination's name to a master meta entry so a
> given exception is counted only once. The point here is prevent
> multiple 2's, 3's, 4's etc from throwing off the final count. Each
> entry also includes ! so a given lower value meta only fires when a
> higher value meta has not already fired. All this mutual exclusivity
> is key to getting an accurate score:
>
>
> meta META_002 (__M002_B2BF879D || __M002_300DB130 || __M002_1CDAD8FA
> || __M002_188F3E40 || __M002_F9594F8D || __M002_185A6A58 ||
> __M002_68BEF298) && !(META_003 || META_004 || META_005 || META_006 ||
> META_007 || META_008 || META_009 || META_010 || META_011 || META_012
> || META_013 || META_014)
>
> meta META_003 (__M003_B2BF879D || __M003_300DB130 || __M003_1CDAD8FA
> || __M003_188F3E40 || __M003_F9594F8D || __M003_185A6A58 ||
> __M003_68BEF298) && !(META_004 || META_005 || META_006 || META_007 ||
> META_008 || META_009 || META_010 || META_011 || META_012 || META_013
> || META_014)
>
>
>
> 5) The master meta then 'scores' a precise cancelation value to the
> existing total (997 + 1 + 1 + 1 = 1000):
>
> score META_002 998
> score META_003 997
> score META_004 996
> score META_005 995
> score META_006 994
>
>
> RESULT3 = values of 1000 get through
>
>
>
> 6) When a multitude of rules hit and only some have meta exceptions
> (a spam or new FP), the score is bumped above 1000:
>
> score=1003.0 tests=ruleA,ruleB,ruleC,ruleG,ruleH,ruleI,META_3
>
> Such that G, H, & I in this example, add 3 to what would have been
> 1000 (997 + 1+ 1+ 1 + 1+ 1+ 1 = 1003), catching the message.
>
> RESULT4 = values of 1001+ get caught
>
>
>
>
> Here are actual header entries:
>
>
> HAM with 1 rule hit:
> X-Spam-Assassin-DANS: score=1.0 tests=440_GenericHelo
>
>
> SPAM with 7 rule hits, none of which are in a meta:
> X-Spam-Assassin-DANS: score=7.0 tests=100_NoTo,
> 110_Sniffer_53,419_Four19_Multi,440_GenericHelo,
> 714_MSGidPack2,714_MissingTo
>
>
> HAM with 4 rule hits, all of which are in a meta:
> X-Spam-Assassin-DANS: score=1000.0 tests=310_ClickHereAA,
> 714_MimePack,716_HtmlFontPack,716_MostlyHtml,META_004
>
>
> SPAM with 7 rule hits, 5 of which are in a meta:
> X-Spam-Assassin-DANS: score=1005.0
> tests=105_MailFrom_AORMX,110_Sniffer_61,130_NJABL_DUL,
> 130_SORBS_DUL,230_WestEurope,440_ImageCode,
> 714_EXTRA_MPART_TYPE,META_002
>
>
>
> Here's a value summary:
>
> HAM = values of 0-1 get through
> SPAM = values of 2-999 get caught
> HAM = values of 1000 get through
> SPAM = values of 1001+ get caught
>
>
>
> The only obstacle I haven't yet resolved is how to get the 4 value
> results to an MTA. So once I've explained this well enough, the
> question is, how do we add this functionality to SA in way easy to
> implement and use?
>
>
> Dan