You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spamassassin.apache.org by bu...@bugzilla.spamassassin.org on 2005/08/03 05:46:41 UTC
[Bug 4512] New: Different approach to obfuscation rules
http://bugzilla.spamassassin.org/show_bug.cgi?id=4512
Summary: Different approach to obfuscation rules
Product: Spamassassin
Version: unspecified
Platform: Other
OS/Version: other
Status: NEW
Severity: enhancement
Priority: P5
Component: Rules
AssignedTo: dev@spamassassin.apache.org
ReportedBy: rcasha@waldonet.net.mt
I've been thinking of a very different approach to obfuscated text rules than
that mentioned in bug 4094.
To implement this rule, SA would, on receiving a message, create a *new string*
from the subject line and/or message body, replacing characters based on visual
appearance into their nearest ascii equivalent. Thus for instance the characters
�, �, �, ċ, č (cent, copyright, c with cedilla etc) - in upper or lowercase -
would all be replaced by a small letter c. All spacing, commas and periods would
be removed. Thus the string "B�y �.i.a.l.�.s" would become "buycialis". Rules
can then test against this resulting string, simplifying the rules in general.
Of course this only works for latin scripts but I think that's quite a large
chunk of spam.
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
[Bug 4512] Different approach to obfuscation rules
Posted by bu...@bugzilla.spamassassin.org.
http://bugzilla.spamassassin.org/show_bug.cgi?id=4512
------- Additional Comments From lwilton@earthlink.net 2005-08-02 21:39 -------
Subject: Re: New: Different approach to obfuscation rules
This is very similar to an idea that has been floated before, perhaps in a
different bz entry than the one you mentioned.
I personally believe the concept has potential merit, but is lacking a
practical demonstration to indicate one way or the other if the time to do
the substitution will pay off. Of course, this would also require rewriting
quite a few rules to take advantage of it; but that is mostly a nasty
bookeeping issue.
It is necessary to keep the original text around for easy observation. It
is pretty easy to determine that something is spam simply by observing the
pseudo-encryption the spammers use to try to get around character matches.
So you need to have essentially a 4th body-type classification. The current
'body' would be 'unencrypted' or some such, and the new type would be
'clean' or some such. SA already has much of this concept present, but it
doesn't extend far enough that the rules can get to it.
Loren
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
[Bug 4512] Different approach to obfuscation rules
Posted by bu...@bugzilla.spamassassin.org.
http://bugzilla.spamassassin.org/show_bug.cgi?id=4512
quinlan@pathname.com changed:
What |Removed |Added
----------------------------------------------------------------------------
Status|NEW |RESOLVED
Resolution| |LATER
Target Milestone|Undefined |Future
------- Additional Comments From quinlan@pathname.com 2005-08-03 11:13 -------
It's a good idea, but not really workable since the mapping is not
one to one. Some characters can mean multiple things in latin-alphabet
obfuscations. In addition, it is very locale specific.
There is some research being done to use Markov models to change obfuscations
accurately using a dictionary of words, but it is very slow (I think he said 160
characters per second), so I'm closing this as LATER.
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.