You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spamassassin.apache.org by jm...@jmason.org on 2004/04/17 00:05:12 UTC

Unicode decomposition and spam (fwd)

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1


Hi David --

thanks for the tip.  I've fwded it on to the developers list, let's
see if anyone there is interested...

- --j.

- ------- Forwarded Message

Date:    Fri, 16 Apr 2004 16:04:13 -0500
From:    David Nesting <da...@fastolfe.net>
To:      jm@jmason.org
Subject: Unicode decomposition and spam

Hi Justin,

I had a thought that you may be interested in with your development
of SpamAssassin.

I see spam subject lines that tend to say things like "Low Cost Term Life
Insurance!"  with the vowels replaced with some international character,
like � instead of o.

People understand the two are roughly the same, but some applications may
have a hard time.  Fortunately, Unicode has an algorithm for "decomposing"
those letters into their constituent parts, namely the letter itself
along with the marks applied to it.  So � might decompose to the letter
o and the little ^ symbol above it.

This decomposition process may make pattern matching a little better in
e-mails that use this trick to hide words.  It won't catch things like
v|@gra, though.

The good news is that there's an implementation of this in Perl already:

use utf8;
use Unicode::Normalize;

my $input = 'L�w C�st Term Life ins';
print $input;			# 'L�w C�st Term Life ins'
my $output = NFD($input);	# equivalent to $input but decomposed
$output =~ s/\p{M}//g;		# strip out all marks
print $output;			# 'Low Cost Term Life ins'

One drawback to an approach like this is that it causes "correctly
spelled" words in other languages to assume an incorrect spelling, just
for the sake of pattern matching.  Maybe the spam really is in French?
How does this hurt the algorithm?

Anyway, I thought this might be interesting to you.

David

- -- 
 == David Nesting WL7RO Fastolfe david@fastolfe.net http://fastolfe.net/ ==
 fastolfe.net/me/pgp-key A054 47B1 6D4C E97A D882  C41F 3065 57D9 832F AB01



- ------- End of Forwarded Message

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.4 (GNU/Linux)
Comment: Exmh CVS

iD8DBQFAgFiYQTcbUG5Y7woRAiJ5AJ9twt+QC+l7HUltpOmvl5myxc/kpQCfZ1XD
MZQR3x5UGD4Oo8GEvlkMGY8=
=N/Ue
-----END PGP SIGNATURE-----