You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@spamassassin.apache.org by poifgh <ab...@gmail.com> on 2009/10/06 20:08:28 UTC

SpamAssassin Ruleset Generation

I have a question about - understanding how are rulesets generated for
spamassassin.

For example - consider the rule in 20_drugs.cf : 
header SUBJECT_DRUG_GAP_C       Subject =~
/\bc.{0,2}i.{0,2}a.{0,2}l.{0,2}i.{0,2}s\b/i
describe SUBJECT_DRUG_GAP_C     Subject contains a gappy version of 'cialis'

Who generated the regular expression
"/\bc.{0,2}i.{0,2}a.{0,2}l.{0,2}i.{0,2}s\b/i"

a. Is it done manually with people writing regex to see how efficiently they
capture spams?
b. Is there an algorithm that identifies large corpus of spam and the comes
up with these regex'es on its own?
c. Is it a combination of (a), (b)?

I know scores for rules are generated using "a neural network trained with
error back propagation"
http://wiki.apache.org/spamassassin/HowScoresAreAssigned

But how are the rules generated themselves? 

Thnx
-- 
View this message in context: http://www.nabble.com/SpamAssassin-Ruleset-Generation-tp25773508p25773508.html
Sent from the SpamAssassin - Users mailing list archive at Nabble.com.


Re: SpamAssassin Ruleset Generation

Posted by John Hardin <jh...@impsec.org>.
On Tue, 6 Oct 2009, poifgh wrote:

> Other than the sought rules, all the rules are manually generated? Is 
> there any statistics on how frequently are new rules/regex adopted by 
> spamassasssin? Who are the people who write them? Any details related to 
> it?

Most of the rules are manually written by contributors such as myself. 
Some meta rules are generated by various means from existing rules - for 
example, the ADVANCE_FEE rules are generated using genetic algorithms to 
find effective combinations of simpler subrules that were manually 
generated.

New rules are added whenever a contributor works on them, and this is 
generally based on when they have time to do so, when they have new ideas, 
and when new forms of spam appear. Indirect contributors will post rules 
to the users list and a contributor may add them to the rules sandbox for 
testing and eventual inclusion in the base ruleset.

The CREDITS file in the sources should list all of the contributors. Some 
contributors may not have added their names to that file, though.

-- 
  John Hardin KA7OHZ                    http://www.impsec.org/~jhardin/
  jhardin@impsec.org    FALaholic #11174     pgpk -a jhardin@impsec.org
  key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
  5 days since a sunspot last seen - EPA blames CO2 emissions

Re: SpamAssassin Ruleset Generation

Posted by Karsten Bräckelmann <gu...@rudersport.de>.
On Tue, 2009-10-06 at 13:50 -0700, an anonymous Nabble user wrote:
> Other than the sought rules, all the rules are manually generated?

Actually, as has been said, I believe all stock rules are manually
written. There are some third-party rule-sets out there that are auto
generated -- not limited to Sought.

> Is there any statistics on how frequently are new rules/regex adopted by
> spamassasssin? Who are the people who write them? Any details related to it?

Somehow this begs the question -- why?

Why are you asking? Why and what are you ultimately interested in?

And of course, did you even consider to dig through the SVN repo, some
docs on the wiki and to ask google? Most of this should be pretty easy
to find out if you're willing to read some.


-- 
char *t="\10pse\0r\0dtu\0.@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4";
main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;i<l;i++){ i%8? c<<=1:
(c=*++x); c&128 && (s+=h); if (!(h>>=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}


Re: SpamAssassin Ruleset Generation

Posted by MySQL Student <my...@gmail.com>.
Hi,

> Other than the sought rules, all the rules are manually generated? Is there
> any statistics on how frequently are new rules/regex adopted by
> spamassasssin? Who are the people who write them? Any details related to

Information on Justin Mason's SOUGHT rules is here:

http://taint.org/2007/08/15/004348a.html

Use sa-update to update your SA rules once or twice per day with the
new stuff. His ongoing development work is here:

http://svn.apache.org/viewvc/spamassassin/trunk/rulesrc/sandbox/jm/?sortby=date

HTH,
Alex

Re: SpamAssassin Ruleset Generation

Posted by poifgh <ab...@gmail.com>.


poifgh wrote:
> 
> 
> 
> Bowie Bailey wrote:
>> 
>> 
>> 
>> http://www.google.com/search?q=spamassassin+sought
>> 
> :-D - Thnx
> 
> 

Other than the sought rules, all the rules are manually generated? Is there
any statistics on how frequently are new rules/regex adopted by
spamassasssin? Who are the people who write them? Any details related to it?

thnx
-- 
View this message in context: http://www.nabble.com/SpamAssassin-Ruleset-Generation-tp25773508p25776307.html
Sent from the SpamAssassin - Users mailing list archive at Nabble.com.


Re: SpamAssassin Ruleset Generation

Posted by poifgh <ab...@gmail.com>.


Bowie Bailey wrote:
> 
> 
> 
> http://www.google.com/search?q=spamassassin+sought
> 
:-D - Thnx

-- 
View this message in context: http://www.nabble.com/SpamAssassin-Ruleset-Generation-tp25773508p25776303.html
Sent from the SpamAssassin - Users mailing list archive at Nabble.com.


Re: SpamAssassin Ruleset Generation

Posted by Bowie Bailey <Bo...@BUC.com>.
poifgh wrote:
>
> RW-15 wrote:
>   
>> On Tue, 6 Oct 2009 11:08:28 -0700 (PDT)
>> poifgh <ab...@gmail.com> wrote:
>>
>>     
>>> I have a question about - understanding how are rulesets generated for
>>> ...
>>> a. Is it done manually with people writing regex to see how
>>> efficiently they capture spams?
>>> b. Is there an algorithm that identifies large corpus of spam and the
>>> comes up with these regex'es on its own?
>>> c. Is it a combination of (a), (b)?
>>>       
>> The optional sought rules are autogenerated, the rest are manual.
>>     
>
> Thnx - What are optional sought rules?
>   

http://www.google.com/search?q=spamassassin+sought

-- 
Bowie

Re: SpamAssassin Ruleset Generation

Posted by poifgh <ab...@gmail.com>.


RW-15 wrote:
> 
> On Tue, 6 Oct 2009 11:08:28 -0700 (PDT)
> poifgh <ab...@gmail.com> wrote:
> 
>> 
>> I have a question about - understanding how are rulesets generated for
>> ...
>> a. Is it done manually with people writing regex to see how
>> efficiently they capture spams?
>> b. Is there an algorithm that identifies large corpus of spam and the
>> comes up with these regex'es on its own?
>> c. Is it a combination of (a), (b)?
> 
> The optional sought rules are autogenerated, the rest are manual.
> 
> 

Thnx - What are optional sought rules?

-- 
View this message in context: http://www.nabble.com/SpamAssassin-Ruleset-Generation-tp25773508p25776105.html
Sent from the SpamAssassin - Users mailing list archive at Nabble.com.


Re: SpamAssassin Ruleset Generation

Posted by RW <rw...@googlemail.com>.
On Tue, 6 Oct 2009 11:08:28 -0700 (PDT)
poifgh <ab...@gmail.com> wrote:

> 
> I have a question about - understanding how are rulesets generated for
> ...
> a. Is it done manually with people writing regex to see how
> efficiently they capture spams?
> b. Is there an algorithm that identifies large corpus of spam and the
> comes up with these regex'es on its own?
> c. Is it a combination of (a), (b)?

The optional sought rules are autogenerated, the rest are manual.

Re: SpamAssassin Ruleset Generation

Posted by Matt Kettler <mk...@verizon.net>.
poifgh wrote:
> I have a question about - understanding how are rulesets generated for
> spamassassin.
>
> For example - consider the rule in 20_drugs.cf : 
> header SUBJECT_DRUG_GAP_C       Subject =~
> /\bc.{0,2}i.{0,2}a.{0,2}l.{0,2}i.{0,2}s\b/i
> describe SUBJECT_DRUG_GAP_C     Subject contains a gappy version of 'cialis'
>
> Who generated the regular expression
> "/\bc.{0,2}i.{0,2}a.{0,2}l.{0,2}i.{0,2}s\b/i"
>   
Man, that's a good question. I wrote a large chunk of the rules in
20_drugs.cf, but not that one. ( I wrote the stuff near the bottom that
uses meta rules. ie:  __DRUGS_ERECTILE1 through DRUGS_MANYKINDS,
originally distributed as a separate set called antidrug.cf). As I
recall, there were 2 other people making drug rules, but it's been a
LONG time, and I forget who did it. Those rules were written in the
2004-2006 time frame when pharmacy spams were just hammering the heck
outa everyone.

> a. Is it done manually with people writing regex to see how efficiently they
> capture spams?
>   
Yes. Many hours of reading spams, studying them, testing various regex
tweaks, checking for false positives, etc, etc.

mass-check is your friend for this kind of stuff.

One post from when I was developing this as a stand-alone set:

http://mail-archives.apache.org/mod_mbox/spamassassin-users/200404.mbox/%3C6.0.0.22.0.20040428132346.029d96e0@opal.evi-inc.com%3E

Note: the comcast link mentioned in that message should be considered
DEAD. The antidrug set is no longer maintained separately from the
mailline ruleset, and hasn't been for years.


If you want to break the rules down a bit, here's some tips:

The rules are in general designed to detect common methods to obscure
text by inserting spaces, punctuation, etc between letters, and possibly
substituting some of the letters for other similar looking characters.
(W4R3Z style, etc)

The simple format would be to think of it in groupings. You end up using
a repeating pattern of (some representation of a character)(some kind of
"gap" sequence)(character)(gap)...etc.

.{0,2} is a "gap sequence", although not one I prefer. I prefer
[_\W]{0,3} in most cases because it's a bit less FP-prone, but risks
missing things using small lower-case letters to gap.

You also get replacements for characters in some of those, like [A4]
instead of just A. Or, more elaborately..  [a4\xE0-\xE6@]

So this mess:

body __DRUGS_ERECTILE1	/(?:\b|\s)[_\W]{0,3}(?:\\\/|V)[_\W]{0,3}[ij1!|l\xEC\xED\xEE\xEF][_\W]{0,3}[a40\xE0-\xE6@][_\W]{0,3}[xyz]?[gj][_\W]{0,3}r[_\W]{0,3}[a40\xE0-\xE6@][_\W]{0,3}x?[_\W]{0,3}(?:\b|\s)/i


Could be broken down:

(?:\b|\s)   - preamble, detecting space or word boundary.
[_\W]{0,3}   - gap
(?:\\\/|V)   - V
[_\W]{0,3}   - gap
[ij1!|l\xEC\xED\xEE\xEF] - I
[_\W]{0,3}   - gap
[a40\xE0-\xE6@]   - A
[_\W]{0,3}   - gap
[xyz]?[gj]   - G (with optional extra garbage before it)
[_\W]{0,3}   - gap
r  	     - just R :-)
[_\W]{0,3}   - gap
[a40\xE0-\xE6@] -A
[_\W]{0,3}   - gap
x? 	     - optional garbage
[_\W]{0,3}   - gap
(?:\b|\s)    - suffix, detecting space or word boundary.

Which detects weird spacings and substitutions in the word Viagra.


> But how are the rules generated themselves? 
>   
Mostly meatware, except the sought rules others have mentioned.
> Thnx
>