You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spamassassin.apache.org by Justin Mason <jm...@jmason.org> on 2005/06/22 04:56:22 UTC

Re: Normalized text ruletype

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1


One problem is that we've already added something for those mails in 3.1.0 --
but from the other direction ;)

Namely, Theo wrote a plugin which allows rules to be written which
are then translated into more complex rules, that match the variety
of obfuscations observed.  The two modes kind of clash... but
we should compare one against the other.

FWIW, I quite like the idea of massively normalising as you do there --
lowercasing, dropping spaces, etc.   I can see one problem with doing it
that way though.  If you approach it from the normalization angle, there
are issues with some kinds of obfuscation, e.g. the ones where a char in a
string has been replaced by multiple chars:

    the quick brown fox jumped
    the quick brow|\| fox jumped

coming from the other angle, by munging the rule strings, you *can*
match that.

anyway, I'll let Theo comment...

- --j.

Loren Wilton writes:
> RFC: Normalized text ruletypeWow, neat!  I've been looking at something like this for quite some time.
> 
> Adding in pipes and some of the other characters known to be used for
> obfuscations could well drastically increase your hit ratios, they
> are really common.
> 
> I think this is quite possibly a good start on a new rule type.
> 
>         Loren
> 
> ------=_NextPart_000_05C8_01C57697.BB8EE2D0
> Content-Type: text/html;
> 	charset="iso-8859-1"
> Content-Transfer-Encoding: quoted-printable
> 
> <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
> <HTML><HEAD><TITLE>RFC: Normalized text ruletype</TITLE>
> <META http-equiv=Content-Type content="text/html; charset=iso-8859-1">
> <META content="MSHTML 6.00.2800.1505" name=GENERATOR>
> <STYLE></STYLE>
> </HEAD>
> <BODY bgColor=#ffffff>
> <DIV><FONT size=2>Wow, neat!&nbsp; I've been looking at something like this for 
> quite some time.</FONT></DIV>
> <DIV><FONT size=2></FONT>&nbsp;</DIV>
> <DIV><FONT size=2>Adding in pipes and some of the other characters known to be 
> used for obfuscations could well drastically increase your hit ratios, they are 
> really common.</FONT></DIV>
> <DIV><FONT size=2></FONT>&nbsp;</DIV>
> <DIV><FONT size=2>I think this is quite possibly a good start on a new rule 
> type.</FONT></DIV>
> <DIV><FONT size=2></FONT>&nbsp;</DIV>
> <DIV><FONT size=2>&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; Loren</FONT></DIV>
> <DIV>&nbsp;</DIV></BODY></HTML>
> 
> ------=_NextPart_000_05C8_01C57697.BB8EE2D0--
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.5 (GNU/Linux)
Comment: Exmh CVS

iD8DBQFCuNNWMJF5cimLx9ARAhIXAJ9JdpBxQDWyc8AxRsXHkr9z6Db3lQCfRjhb
7+t77dN8g1uaS0n+lJSqwz8=
=QeQ0
-----END PGP SIGNATURE-----


Re: Normalized text ruletype

Posted by Daniel Quinlan <qu...@pathname.com>.
jm@jmason.org (Justin Mason) writes:

> Namely, Theo wrote a plugin which allows rules to be written which

*cough*

Actually, it was written by Felix Bauer and I extended/rewrote it some.

I had a ticket in bugzilla way back when to do what the normalized text
does except on a per-rule basis.  However, it would have been expensive.
The normal rules idea is not nearly as expensive, but is somewhat less
flexible and flexibility is important so you can tune each rule to
eliminate false positives.  Also, as Justin mentioned, you can't
transform spam garble to a standard format because there are lots of
characters used more than one way (like '|' can be 1, i, l, or a
building block for multi-character representations of letters).

However, you can loosen a regexp up such that it will match most garbled
text and that's what the new ReplaceTags plugin does (in 3.1.0-pre1).
Further, it does all the replacements at start-up time, so they're cheap
in spamd (still an expensive regexp, but it's no worse than any complex
body rule) *and* you can use different replacements for different rules.

Here's the usage:

Mail::SpamAssassin::Plugin::ReplaceTags - tags for SpamAssassin rules

The plugin allows rules to contain regular expression tags to be used in
regular expression rules.  The tags make it much easier to maintain
complicated rules.

Warning: This plugin replies on data structures specific to this version of
SpamAssasin; it is not guaranteed to work with other versions of SpamAssassin.

  loadplugin    Mail::SpamAssassin::Plugin::ReplaceTags

  replace_start <
  replace_end   >

  replace_tag   A       [a@]
  replace_tag   G       [gk]
  replace_tag   I       [il|!1y\?\xcc\xcd\xce\xcf\xec\xed\xee\xef]
  replace_tag   R       [r3]
  replace_tag   V       (?:[vu]|\\\/)
  replace_tag   SP      [\s~_-]

  body          VIAGRA_OBFU     /(?!viagra)<V>+<SP>*<I>+<SP>*<A>+<SP>*<G>+<SP>*<R>+<SP>*<A>+/i
  describe      VIAGRA_OBFU     Attempt to obfuscate "viagra"

  replace_rules VIAGRA_OBFU

But, wait, there's more!

You can also define "pre", "post", and "inter" tags which are
automatically, placed before each, after each, and between adjacent
tags, respectively.  So, if you wanted, you could define the above
VIAGRA rule like this:

  replace_post RE       +
  replace_inter SP      [\s~_-]*

  body          VIAGRA_OBFU     /<inter W2><post RE>(?!viagra)<V><I><A><G><R><SP>/i
  describe      VIAGRA_OBFU     Attempt to obfuscate "viagra"

  replace_rules VIAGRA_OBFU

In case you're not familiar with Perl regexps, the (?!viagra) just
means: don't match if you see plain-text "viagra".  It only matches
obfuscated "viagra".

Daniel

-- 
Daniel Quinlan
http://www.pathname.com/~quinlan/