You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@spamassassin.apache.org by Justin Mason <jm...@jmason.org> on 2007/02/12 18:45:17 UTC
Re: Find the Ham: A Prototype Config

Have you looked into decision trees?  this sounds a bit like that...

  http://en.wikipedia.org/wiki/Decision_tree_learning

--j.

Dan writes:
> Yesterday I described an unorthodox approach to email filtering and  
> generated both interest and confusion.  Hopefully by describing it  
> further, I can create understanding.  Below is my design and at the  
> bottom a question, but first, a summary of points:
> 
> 
> 1) I created confusion by starting with a big picture description.   
> Find the Ham does NOT rely on rules that target ham, it relies ONLY  
> on rules that find spam.  It applies spam fighting rules we already  
> know and love, just in a new way.
> 
> 2) This configuration sits between your rules and MTA, REPLACING what  
> weights currently do.  Its more accurate than weighted  
> configurations, so this works out.
> 
> 3) Well tuned, FP rates lower than 1 in 10,000 (0.01%) are easy and  
> captured messages can be organized by likelihood:  the more rules a  
> message hits, the less likely it is to be ham.  Finding the few ham  
> among the few spam requires less work than deleting the uncaught spam  
> many weighted approaches currently leave in a user's inbox.  FP  
> reviewing remains the same, isolate and examine the "least likely to  
> be spam" captures.
> 
> 4) Whitelist entries slow accuracy training but can STILL be used.
> 
> 
> 
> Here is Find the Ham's architecture, built inside SA.  It appears  
> more complicated than it is because of the brute force engineering  
> needed to work in a system that wasn't designed to support it.  This  
> is essentially a hack of the current weighting/meta systems but is  
> NOT a weighted approach.  For clarity, I have replaced 'weight' with  
> 'value.'  Here are the parts:
> 
> 
> 1) Give every rule a value of 1, regardless of how 'strong' or 'weak'  
> it is.  1 is easy to count and requires no score lines in SA.  The  
> point here is not to assign a weight, but rather create a profile.  A  
> message's profile is determined by which specific rules (by name!) a  
> message fails (or should not fail), NOT the weight of those rules -  
> no more plus and minus.
> 
> 
> 
> 2) HAM:  Messages hitting 0 or 1 get through so as any message  
> hitting no rules or any one rule always count as ham.  Rules good  
> enough to never capture ham by themselves should be handled separately.
> 
> 	RESULT1  =  values of 0-1 get through
> 
> 
> 
> 3) SPAM: Rules that DON'T hit in combination with a meta (more below)  
> add up to something above 1 point (1+1+1+1+1+1+1 =  7).  This catches  
> behavior patterns shown by spammers that have NOT also been shown to  
> be ham.  And since all such patterns are assumed to be spam until  
> proven otherwise, you catch millions of permutations without having  
> to map any of them:
> 
> 	RESULT2  =  values of 2-999 get caught
> 
> 
> 
> 4a) HAM: Exceptions = 1000 points.  This value is arbitrary as any  
> will do, but 1000 is easy to work with and provides scalability for  
> the number of rules hitting on the same message.  So for example, a  
> ham hits on rules A, B, and C and an exception is needed to prevent  
> all future messages that hit only the same 3 rules from being caught  
> again.  First make an entry that combines all three rules into every  
> combination of 2 and 3 hits.  To find every permutation, every hit  
> down to 2 hits must be entered, so 3 hits requires 3,2, 4 hits  
> requires 4,3,2 etc. (the checksum names have no meaning and simply  
> ensure that each name is unique), as so:
> 
> 
> 	meta __M002_68BEF298 (ruleA + ruleB + ruleC == 2)
> 	meta __M003_68BEF298 (ruleA + ruleB + ruleC == 3)
> 
> 
> 
> 4b) Then add each combination's name to a master meta entry so a  
> given exception is counted only once.  The point here is prevent  
> multiple 2's, 3's, 4's etc from throwing off the final count.  Each  
> entry also includes ! so a given lower value meta only fires when a  
> higher value meta has not already fired.  All this mutual exclusivity  
> is key to getting an accurate score:
> 
> 
> meta META_002 (__M002_B2BF879D || __M002_300DB130 || __M002_1CDAD8FA  
> || __M002_188F3E40 || __M002_F9594F8D || __M002_185A6A58 ||  
> __M002_68BEF298) && !(META_003 || META_004 || META_005 || META_006 ||  
> META_007 || META_008 || META_009 || META_010 || META_011 || META_012  
> || META_013 || META_014)
> 
> meta META_003 (__M003_B2BF879D || __M003_300DB130 || __M003_1CDAD8FA  
> || __M003_188F3E40 || __M003_F9594F8D || __M003_185A6A58 ||  
> __M003_68BEF298) && !(META_004 || META_005 || META_006 || META_007 ||  
> META_008 || META_009 || META_010 || META_011 || META_012 || META_013  
> || META_014)
> 
> 
> 
> 5) The master meta then 'scores' a precise cancelation value to the  
> existing total (997 + 1 + 1 + 1 = 1000):
> 
> score META_002 998
> score META_003 997
> score META_004 996
> score META_005 995
> score META_006 994
> 
> 
> 	RESULT3  =  values of 1000 get through
> 
> 
> 
> 6) When a multitude of rules hit and only some have meta exceptions  
> (a spam or new FP), the score is bumped above 1000:
> 
> 	score=1003.0 tests=ruleA,ruleB,ruleC,ruleG,ruleH,ruleI,META_3
> 
> Such that G, H, & I in this example, add 3 to what would have been  
> 1000 (997 + 1+ 1+ 1 + 1+ 1+ 1 = 1003), catching the message.
> 
> 	RESULT4  =  values of 1001+ get caught
> 
> 
> 
> 
> Here are actual header entries:
> 
> 
> HAM with 1 rule hit:
> 	X-Spam-Assassin-DANS: score=1.0 tests=440_GenericHelo
> 
> 
> SPAM with 7 rule hits, none of which are in a meta:
> 	X-Spam-Assassin-DANS: score=7.0 tests=100_NoTo,
> 	110_Sniffer_53,419_Four19_Multi,440_GenericHelo,
> 	714_MSGidPack2,714_MissingTo
> 
> 
> HAM	 with 4 rule hits, all of which are in a meta:
> 	X-Spam-Assassin-DANS: score=1000.0 tests=310_ClickHereAA,
> 	714_MimePack,716_HtmlFontPack,716_MostlyHtml,META_004
> 
> 
> SPAM with 7 rule hits, 5 of which are in a meta:
> 	X-Spam-Assassin-DANS: score=1005.0
> 	tests=105_MailFrom_AORMX,110_Sniffer_61,130_NJABL_DUL,
> 	130_SORBS_DUL,230_WestEurope,440_ImageCode,
> 	714_EXTRA_MPART_TYPE,META_002
> 
> 
> 
> Here's a value summary:
> 
> 	HAM  =  values of 0-1 get through
> 	SPAM  =  values of 2-999 get caught
> 	HAM  =  values of 1000 get through
> 	SPAM  =  values of 1001+ get caught
> 
> 
> 
> The only obstacle I haven't yet resolved is how to get the 4 value  
> results to an MTA.  So once I've explained this well enough, the  
> question is, how do we add this functionality to SA in way easy to  
> implement and use?
> 
> 
> Dan