You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spamassassin.apache.org by bu...@bugzilla.spamassassin.org on 2005/08/03 05:46:41 UTC

[Bug 4512] New: Different approach to obfuscation rules

http://bugzilla.spamassassin.org/show_bug.cgi?id=4512

           Summary: Different approach to obfuscation rules
           Product: Spamassassin
           Version: unspecified
          Platform: Other
        OS/Version: other
            Status: NEW
          Severity: enhancement
          Priority: P5
         Component: Rules
        AssignedTo: dev@spamassassin.apache.org
        ReportedBy: rcasha@waldonet.net.mt


I've been thinking of a very different approach to obfuscated text rules than
that mentioned in bug 4094.

To implement this rule, SA would, on receiving a message, create a *new string*
from the subject line and/or message body, replacing characters based on visual
appearance into their nearest ascii equivalent. Thus for instance the characters
�, �, �, ċ, č (cent, copyright, c with cedilla etc) - in upper or lowercase -
would all be replaced by a small letter c. All spacing, commas and periods would
be removed. Thus the string "B�y �.i.a.l.�.s" would become "buycialis". Rules
can then test against this resulting string, simplifying the rules in general.

Of course this only works for latin scripts but I think that's quite a large
chunk of spam.



------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 4512] Different approach to obfuscation rules

Posted by bu...@bugzilla.spamassassin.org.
http://bugzilla.spamassassin.org/show_bug.cgi?id=4512





------- Additional Comments From lwilton@earthlink.net  2005-08-02 21:39 -------
Subject: Re:   New: Different approach to obfuscation rules

This is very similar to an idea that has been floated before, perhaps in a
different bz entry than the one you mentioned.

I personally believe the concept has potential merit, but is lacking a
practical demonstration to indicate one way or the other if the time to do
the substitution will pay off.  Of course, this would also require rewriting
quite a few rules to take advantage of it; but that is mostly a nasty
bookeeping issue.

It is necessary to keep the original text around for easy observation.  It
is pretty easy to determine that something is spam simply by observing the
pseudo-encryption the spammers use to try to get around character matches.
So you need to have essentially a 4th body-type classification.  The current
'body' would be 'unencrypted' or some such, and the new type would be
'clean' or some such.  SA already has much of this concept present, but it
doesn't extend far enough that the rules can get to it.

        Loren





------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 4512] Different approach to obfuscation rules

Posted by bu...@bugzilla.spamassassin.org.
http://bugzilla.spamassassin.org/show_bug.cgi?id=4512


quinlan@pathname.com changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |RESOLVED
         Resolution|                            |LATER
   Target Milestone|Undefined                   |Future




------- Additional Comments From quinlan@pathname.com  2005-08-03 11:13 -------
It's a good idea, but not really workable since the mapping is not
one to one.  Some characters can mean multiple things in latin-alphabet
obfuscations.  In addition, it is very locale specific.

There is some research being done to use Markov models to change obfuscations
accurately using a dictionary of words, but it is very slow (I think he said 160
characters per second), so I'm closing this as LATER.




------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.