You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@spamassassin.apache.org by Richard Doyle <rd...@islandnetworks.com> on 2006/10/04 23:57:38 UTC

double letter porn

I've been getting lots of porn site spam containing words with doubled
letters, like this one:

================================================
Orrgy pornn parrties! Lotts of
sttupid bitchees gangbangged by queue of guyss.
 annal_nailing and cum__swallowing orgiees.
 archiive of group_ssex materiall!
http://www.teens229mx.com/?lcajuryrpdbejn
================================================

Most of these hit razor2, and www.teens???mx.com sooner-or-later show up
on the SURBL and URIBL lists, but nothing seem to catch the misspelled
words.

Can anybody suggest a rule or ruleset to catch these double-letter
obfuscations? I'm using Spamassassin 3.1.4.

Re: double letter porn

Posted by jdow <jd...@earthlink.net>.

Addition of a Soundex module for seeing if words "sound" like the
words given negative scores in SpamAssassin might be an interesting
trick.
{^_^}
----- Original Message ----- 
From: "Chris St. Pierre" <st...@NebrWesleyan.edu>


> One thing I've wondered/thought about is using the Levenshtein
> difference between the words in an email and a list of spam words
> (ideally pulled from the bayes db).  In this case, all of the
> misspelled words in that sample have a L-distance of 1 from the real
> word -- in other words, they're *very* close.
> 
> I think the problem would be that this would consume tons of
> resources.  Anything else, though, would be susceptible to other typo
> attacks.  For instance, say you took each email, and replaced all
> doubled letters with single letters, it wouldn't be long before you
> were getting spam advertising "analr bictches" or the like.
> 
> Chris St. Pierre
> Unix Systems Administrator
> Nebraska Wesleyan University
> 
> On Wed, 4 Oct 2006, Eric A. Hall wrote:
> 
>>
>>On 10/4/2006 5:57 PM, Richard Doyle wrote:
>>> I've been getting lots of porn site spam containing words with doubled
>>> letters, like this one:
>>
>>> Can anybody suggest a rule or ruleset to catch these double-letter
>>> obfuscations? I'm using Spamassassin 3.1.4.
>>
>>You'd probably need to write a plug-in that used some kind of
>>typo-matching logic to find porno words.
>>
>>Would be a good plug-in actually. Get busy :)
>>
>>-- 
>>Eric A. Hall                                        http://www.ehsco.com/
>>Internet Core Protocols          http://www.oreilly.com/catalog/coreprot/
>>

Re: double letter porn

Posted by "Chris St. Pierre" <st...@NebrWesleyan.edu>.

If anyone's curious, I did some followup research on the ideas below
and found them to be, generally, totally unfeasable.

I downloaded the TREC corpus and generated a list of words that
commonly appeared in spam.  I used the top 1000 most common words of
greater than four letters in the TREC spam that were NOT in the top
1000 most common >4 letter words in the TREC ham.

I then did two sets of tests on a few sample hams and spams, and the
results convinced me that it was not even necessary to run the tests
on the whole corpus.

For each message, I compared each word of greater than four letters
with each word in my spam wordlist with the Wagner-Fischer distance, a
slightly modified Levenshtein distance.  With W-F, I was able to give
greater weight to letter replacements, so "viagna" would be further
from "viagra" than, say, "viagrra."  I also compared the Metaphone
representation of each word of >4 letters with the Metaphone hashes of
each word in my spam wordlist, again with Wagner-Fischer.  I discarded
those distances that were too high and then computed a score for each
message with the following formula:

<metaphone_length> ^ 2 / (<metaphone_distance> + 1) + 
     <word_length> ^ 2 / (<distance> + 1)

I ran this on the first ten spams and hams in the corpus.  The mean
score for spams was 365.7 and the median was 12.5; the mean score for
hams was 3715.565 and the median was 1103.6.  More than anything, the
results seem to indicate the length of the message rather than the
spamminess.

Processor time was also a problem; the largest message scanned took
over 23 minutes to process.  The quickest was under 3 seconds, but the
average was around 45 seconds, with ham taking much longer to process
than spam.

Running either test individually -- the plain text W-F distance or the
metaphone W-F distance -- did not show an appreciable improvement in
the accuracy of the algorithm, although the processing time improved.

It's too bad this won't work, although if someone else wants to take a
crack at it, I'd be happy to share my code, word lists, etc.

Chris St. Pierre
Unix Systems Administrator
Nebraska Wesleyan University

On Thu, 5 Oct 2006, Chris St. Pierre wrote:

>One thing I've wondered/thought about is using the Levenshtein
>difference between the words in an email and a list of spam words
>(ideally pulled from the bayes db).  In this case, all of the
>misspelled words in that sample have a L-distance of 1 from the real
>word -- in other words, they're *very* close.
>
>I think the problem would be that this would consume tons of
>resources.  Anything else, though, would be susceptible to other typo
>attacks.  For instance, say you took each email, and replaced all
>doubled letters with single letters, it wouldn't be long before you
>were getting spam advertising "analr bictches" or the like.
>
>Chris St. Pierre
>Unix Systems Administrator
>Nebraska Wesleyan University
>
>On Wed, 4 Oct 2006, Eric A. Hall wrote:
>
>>
>>On 10/4/2006 5:57 PM, Richard Doyle wrote:
>>> I've been getting lots of porn site spam containing words with doubled
>>> letters, like this one:
>>
>>> Can anybody suggest a rule or ruleset to catch these double-letter
>>> obfuscations? I'm using Spamassassin 3.1.4.
>>
>>You'd probably need to write a plug-in that used some kind of
>>typo-matching logic to find porno words.
>>
>>Would be a good plug-in actually. Get busy :)
>>
>>-- 
>>Eric A. Hall                                        http://www.ehsco.com/
>>Internet Core Protocols          http://www.oreilly.com/catalog/coreprot/
>>
>

Re: double letter porn

Posted by Evan Platt <ev...@espphotography.com>.

At 01:22 PM 10/5/2006, you wrote:
>I think the problem would be that this would consume tons of
>resources.  Anything else, though, would be susceptible to other typo
>attacks.  For instance, say you took each email, and replaced all
>doubled letters with single letters, it wouldn't be long before you
>were getting spam advertising "analr bictches" or the like.

Fortunately, we've gotten to the point where spammers have to 
misspell everything for their spam to get through, or basically 
include NOTHING but a hyperlink, or both. For example, I see quite a 
bit of crap with
"Hi.

ViiiIaaaAgGGRAAA
CIIIAAALLLIS

http://www.blahblah.com"

of couse, I disable HTML and all that crap, so the spam could look 
'normal' in HTML.

Then of course, there's the totally misspelled crap, "We offar tawp 
kwality roooleks washes".

Re: double letter porn

Posted by "Chris St. Pierre" <st...@NebrWesleyan.edu>.

One thing I've wondered/thought about is using the Levenshtein
difference between the words in an email and a list of spam words
(ideally pulled from the bayes db).  In this case, all of the
misspelled words in that sample have a L-distance of 1 from the real
word -- in other words, they're *very* close.

I think the problem would be that this would consume tons of
resources.  Anything else, though, would be susceptible to other typo
attacks.  For instance, say you took each email, and replaced all
doubled letters with single letters, it wouldn't be long before you
were getting spam advertising "analr bictches" or the like.

Chris St. Pierre
Unix Systems Administrator
Nebraska Wesleyan University

On Wed, 4 Oct 2006, Eric A. Hall wrote:

>
>On 10/4/2006 5:57 PM, Richard Doyle wrote:
>> I've been getting lots of porn site spam containing words with doubled
>> letters, like this one:
>
>> Can anybody suggest a rule or ruleset to catch these double-letter
>> obfuscations? I'm using Spamassassin 3.1.4.
>
>You'd probably need to write a plug-in that used some kind of
>typo-matching logic to find porno words.
>
>Would be a good plug-in actually. Get busy :)
>
>-- 
>Eric A. Hall                                        http://www.ehsco.com/
>Internet Core Protocols          http://www.oreilly.com/catalog/coreprot/
>

Re: double letter porn

Posted by Ken <ka...@pacific.net>.

John D. Hardin wrote:
> On Wed, 4 Oct 2006, Eric A. Hall wrote:
>
>   
>> On 10/4/2006 5:57 PM, Richard Doyle wrote:
>>     
>>> I've been getting lots of porn site spam containing words with doubled
>>> letters, like this one:
>>>       
>>> Can anybody suggest a rule or ruleset to catch these double-letter
>>> obfuscations? I'm using Spamassassin 3.1.4.
>>>       
>> You'd probably need to write a plug-in that used some kind of
>> typo-matching logic to find porno words.
>>     
>
> /\bss?ee?xx?\b/i
> /\boo?rr?gg?yy?\b/i
> /\boo?rr?gg?ii?ee?ss?\b/i
>
>   

Seeing same here; some targetted porn spam with doubled up letters in 
the subject, usually scoring 2-3 due to various SA tests on rcvd lines, 
with very short (2 line) bodies and urls that are not surbl and uribl or 
dob (day old bread) listed yet. Typically they also include somewhat odd 
adjectives, like audacious, immaculate, etc... I've just been reacting 
with similar to what is suggested above, with some success, but it's got 
me wondering if there isn't another list that I can find these on.
Ken  Anderson
> etc...
>
> --
>  John Hardin KA7OHZ    ICQ#15735746    http://www.impsec.org/~jhardin/
>  jhardin@impsec.org    FALaholic #11174    pgpk -a jhardin@impsec.org
>  key: 0xB8732E79 - 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
> -----------------------------------------------------------------------
>  [Small arms] are fundamentally dangerous and their removal from the
>  equation either by control, neutralisation or removal is essential.
>  The first step is to gain information on their numbers and
>  whereabouts.         -- the UN, who "doesn't want to confiscate guns"
> -----------------------------------------------------------------------
>
>

Re: double letter porn

Posted by "John D. Hardin" <jh...@impsec.org>.

On Wed, 4 Oct 2006, Eric A. Hall wrote:

> On 10/4/2006 5:57 PM, Richard Doyle wrote:
> > I've been getting lots of porn site spam containing words with doubled
> > letters, like this one:
> 
> > Can anybody suggest a rule or ruleset to catch these double-letter
> > obfuscations? I'm using Spamassassin 3.1.4.
> 
> You'd probably need to write a plug-in that used some kind of
> typo-matching logic to find porno words.

/\bss?ee?xx?\b/i
/\boo?rr?gg?yy?\b/i
/\boo?rr?gg?ii?ee?ss?\b/i

etc...

--
 John Hardin KA7OHZ    ICQ#15735746    http://www.impsec.org/~jhardin/
 jhardin@impsec.org    FALaholic #11174    pgpk -a jhardin@impsec.org
 key: 0xB8732E79 - 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
 [Small arms] are fundamentally dangerous and their removal from the
 equation either by control, neutralisation or removal is essential.
 The first step is to gain information on their numbers and
 whereabouts.         -- the UN, who "doesn't want to confiscate guns"
-----------------------------------------------------------------------

Re: double letter porn

Posted by "Eric A. Hall" <eh...@ehsco.com>.

On 10/4/2006 5:57 PM, Richard Doyle wrote:
> I've been getting lots of porn site spam containing words with doubled
> letters, like this one:

> Can anybody suggest a rule or ruleset to catch these double-letter
> obfuscations? I'm using Spamassassin 3.1.4.

You'd probably need to write a plug-in that used some kind of
typo-matching logic to find porno words.

Would be a good plug-in actually. Get busy :)

-- 
Eric A. Hall                                        http://www.ehsco.com/
Internet Core Protocols          http://www.oreilly.com/catalog/coreprot/

Re: double letter porn

Posted by "John D. Hardin" <jh...@impsec.org>.

On Wed, 18 Oct 2006, NFN Smith wrote:

> Can anybody who has more experience in this area tell me of potential 
> problems to this approach?

It sounds terribly inefficient and overly complex. You should probably
be using negative lookforward matches. For example, I have an
obfuscated-word-rule generator that generates tests like this:

# cialis @              3.0
describe        OBFU_WRD_021    obfuscated "cialis"
body    OBFU_WRD_021
/\b(?!cialis)(?:[c\xA2\xA9\xAB\xC7\xE7]|&\#(?:67|99);)(?:[i!l1\|\/\xA1\xCC-\xCF\xEC-\xEF]|&i[a-z]+;)(?:[a4\@\xC0-\xC6\xE0-\xE6]|\/\\|&a[a-z]+;)(?:[l1i!\|\xCC-\xCF]|(\|_)|&\#(?:76|108);)(?:[i!l1\|\/\xA1\xCC-\xCF\xEC-\xEF]|&i[a-z]+;)(?:[s5z\$\xA6\xA7\xA8]|&\#(?:83|115);)/i
score   OBFU_WRD_021                    3.0

Note the (?!cialis) bit? That means "don't try the rest if it matches
"cialis".

n.b.: I am refining this tool to include double-letter obfuscation.
I'll publish a link when that's done. It's a perl script that works
against a word+score file to generate these rules.

--
 John Hardin KA7OHZ    ICQ#15735746    http://www.impsec.org/~jhardin/
 jhardin@impsec.org    FALaholic #11174    pgpk -a jhardin@impsec.org
 key: 0xB8732E79 - 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
  ...the Fates notice those who buy chainsaws...
                                              -- www.darwinawards.com
-----------------------------------------------------------------------
 13 days until Halloween

Re: double letter porn

Posted by NFN Smith <wo...@sacbeemail.com>.

Richard Doyle wrote:
> I've been getting lots of porn site spam containing words with doubled
> letters, like this one:
> 

I was looking at this one yesterday, and thought of a different 
approach.  It may be a little kludgy, but it seems to work on some basic 
tests.

For this, I'm starting with a list of words that are commonly misspelled 
with double characters.

I start with a rule that looks for these words, with correct spelling, 
and score a hit with 0.01 points.  Call this the strict rule.

I then do a second rule that looks for the same words, but with regexp 
wildcarding that looks for the pattern characters in the word, but has a 
positive, if there's other stuff there -- either b*a*d*w*o*r*d or 
b.?a.?d.?w.?o.?r.?d.  A hit on this rule generates a very high score, 
say 100 points.  Call this the loose rule.

Finally, I create a meta rule that includes both the strict rule and the 
loose rule.  If I get a hit there (that is, where I have hits on both 
the other rules), it means that the word is correctly spelled, and hit 
the metarule generates a negative value of whatever score was applied to 
the loose rule.

If only the loose rule is hit, then the word is misspelled (presumably 
deliberately), and the high score is retained.

I haven't yet tested how this approach works on messages that may have 
multiple words that are deliberately misspelled, but with just a single 
word and basic testing, I'm pleased with the initial results.  In 
particular, this seems to allow me to accept words that are often 
legitimate when correctly spelled, but have high probability of spam 
(and likely offensive) if misspelled.

Can anybody who has more experience in this area tell me of potential 
problems to this approach?

Smith

RE: double letter porn

Posted by Bret Miller <br...@wcg.org>.

> I've been getting lots of porn site spam containing words with doubled
> letters, like this one:
> 
> ================================================
> Orrgy pornn parrties! Lotts of
> sttupid bitchees gangbangged by queue of guyss.
>  annal_nailing and cum__swallowing orgiees.
>  archiive of group_ssex materiall!
> http://www.teens229mx.com/?lcajuryrpdbejn
> ================================================
> 
> Most of these hit razor2, and www.teens???mx.com 
> sooner-or-later show up
> on the SURBL and URIBL lists, but nothing seem to catch the misspelled
> words.
> 
> Can anybody suggest a rule or ruleset to catch these double-letter
> obfuscations? I'm using Spamassassin 3.1.4.

Network tests...

That hit URIBL_Black and the SURBL JP and OB tests.

I'm sure a rule *could* be written, but those are common double-letter
combinations, so it would be a bit more difficult than it seems.

Bret