You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@spamassassin.apache.org by NFN Smith <wo...@sacbeemail.com> on 2006/12/05 18:30:39 UTC

Need regexp tip

I'm working on a series of rules to find obfuscated words in subject 
lines that have been misspelled by adding an extra character (often a 
repeated letter) to a word.  For certain words, it seems to be 
appropriate to assume that if they're misspelled in that way, it's 
deliberate.

I've got the syntax for a regular expression mostly working (including 
words with trailing punctuation), but I don't have it identifying words 
where the last letter is doubled.  Thus if I have a regexp that looks like:

  /\b(?!badword)(?:b.?a.?d.?w.?o.?r.?d.?)(\b|\!|\.|\,|\;|\:|\?)/i

I'm getting hits on things like 'baddword' and 'badwoord', and even 
'badworrd!', but I'm not getting a hit on 'badwordd'

I've tried a number of variants, but still am not quite getting it. 
What am I missing?

Smith


Re: Need regexp tip

Posted by "John D. Hardin" <jh...@impsec.org>.
On Tue, 5 Dec 2006, NFN Smith wrote:

> I'm working on a series of rules to find obfuscated words 
> 
>   /\b(?!badword)(?:b.?a.?d.?w.?o.?r.?d.?)(\b|\!|\.|\,|\;|\:|\?)/i

I have a tool that does this (for double letters as well as other
obfuscations) automatically.

http://www.impsec.org/~jhardin/antispam/

--
 John Hardin KA7OHZ                    http://www.impsec.org/~jhardin/
 jhardin@impsec.org    FALaholic #11174     pgpk -a jhardin@impsec.org
 key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
  The question of whether people should be allowed to harm themselves
  is simple. They *must*.                           -- Charles Murray
-----------------------------------------------------------------------
 10 days until Bill of Rights day