You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spamassassin.apache.org by Matt Hampton <ma...@coders.co.uk> on 2008/06/20 09:32:05 UTC

Creating auto-generated rule sets.....

Justin Mason wrote:

> Well, it'd be worth cc'ing the dev list, if that's ok.   With any luck
> there'll be future people trying similar stuff and it'll be handy to have
> a thread URL to point at ;)

Quick intro - I have been working on automatically generatig rules from 
the Sane Security Clamav signatures.  With a fair bit of help from 
Justin I have something up and running so I wanted to share what I have 
done so far to see what people think and for some feedback.

I have a small perl script that extracts the rules from the scam.ndb and 
phish.ndb files and generates 2 MAMOTH rulesets (60000 rules!).

I then run a mass check and then hit frequencies


Then the selection of rules to import is based on Justin's suggestion:
> More or less -- I'd keep it even simpler.  Select if column 2 ("SPAM %
> hit") > 0.5, and discard if column 3 ("HAM % hit") > 0.
>
> The reason is, this is an automatically generated ruleset -- avoiding FPs
> in auto-generated stuff is critical in my opinion.  Some of those are
> pretty bad: an 8.8% false positive rate, ouch!!
>
> The rule of thumb for false positives is that you will only see a fraction
> of the "real-world" false positive rate in any measurement, since the
> degree of variation between people's ham collections can be very large.
>
>   
Finally I run a mkrules (that took a while to work out where all the 
files had to be - either that or I can't read documentation ;-))


And have a first stab at a ruleset avaliable:

http://www.coders.co.uk/80_sane.cf

I am concerned with the results of some of the rules e.g.

##{ SANE_f48d6d7bf39ebd0b4e830b808d5b45bd
body SANE_f48d6d7bf39ebd0b4e830b808d5b45bd /\.cn\//
describe SANE_f48d6d7bf39ebd0b4e830b808d5b45bd 
Email.Malware.Sanesecurity.08022207u
score SANE_f48d6d7bf39ebd0b4e830b808d5b45bd 0.01
##} SANE_f48d6d7bf39ebd0b4e830b808d5b45bd

Sorry the rule names are long - I haven't truncated the hash yet!

It isn't automatically updating at the moment and all of the scores are 
set to 0.01

matt