You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spamassassin.apache.org by Matt Hampton <ma...@coders.co.uk> on 2008/06/20 09:32:05 UTC
Creating auto-generated rule sets.....
Justin Mason wrote:
> Well, it'd be worth cc'ing the dev list, if that's ok. With any luck
> there'll be future people trying similar stuff and it'll be handy to have
> a thread URL to point at ;)
Quick intro - I have been working on automatically generatig rules from
the Sane Security Clamav signatures. With a fair bit of help from
Justin I have something up and running so I wanted to share what I have
done so far to see what people think and for some feedback.
I have a small perl script that extracts the rules from the scam.ndb and
phish.ndb files and generates 2 MAMOTH rulesets (60000 rules!).
I then run a mass check and then hit frequencies
Then the selection of rules to import is based on Justin's suggestion:
> More or less -- I'd keep it even simpler. Select if column 2 ("SPAM %
> hit") > 0.5, and discard if column 3 ("HAM % hit") > 0.
>
> The reason is, this is an automatically generated ruleset -- avoiding FPs
> in auto-generated stuff is critical in my opinion. Some of those are
> pretty bad: an 8.8% false positive rate, ouch!!
>
> The rule of thumb for false positives is that you will only see a fraction
> of the "real-world" false positive rate in any measurement, since the
> degree of variation between people's ham collections can be very large.
>
>
Finally I run a mkrules (that took a while to work out where all the
files had to be - either that or I can't read documentation ;-))
And have a first stab at a ruleset avaliable:
http://www.coders.co.uk/80_sane.cf
I am concerned with the results of some of the rules e.g.
##{ SANE_f48d6d7bf39ebd0b4e830b808d5b45bd
body SANE_f48d6d7bf39ebd0b4e830b808d5b45bd /\.cn\//
describe SANE_f48d6d7bf39ebd0b4e830b808d5b45bd
Email.Malware.Sanesecurity.08022207u
score SANE_f48d6d7bf39ebd0b4e830b808d5b45bd 0.01
##} SANE_f48d6d7bf39ebd0b4e830b808d5b45bd
Sorry the rule names are long - I haven't truncated the hash yet!
It isn't automatically updating at the moment and all of the scores are
set to 0.01
matt