You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@spamassassin.apache.org by Justin Mason <jm...@jmason.org> on 2006/10/10 18:09:44 UTC

Re: double letter porn

hi Chris --
Sorry to hear it didn't work out -- but thanks for the great analysis!

--j.

Chris St. Pierre writes:
> If anyone's curious, I did some followup research on the ideas below
> and found them to be, generally, totally unfeasable.
> 
> I downloaded the TREC corpus and generated a list of words that
> commonly appeared in spam.  I used the top 1000 most common words of
> greater than four letters in the TREC spam that were NOT in the top
> 1000 most common >4 letter words in the TREC ham.
> 
> I then did two sets of tests on a few sample hams and spams, and the
> results convinced me that it was not even necessary to run the tests
> on the whole corpus.
> 
> For each message, I compared each word of greater than four letters
> with each word in my spam wordlist with the Wagner-Fischer distance, a
> slightly modified Levenshtein distance.  With W-F, I was able to give
> greater weight to letter replacements, so "viagna" would be further
> from "viagra" than, say, "viagrra."  I also compared the Metaphone
> representation of each word of >4 letters with the Metaphone hashes of
> each word in my spam wordlist, again with Wagner-Fischer.  I discarded
> those distances that were too high and then computed a score for each
> message with the following formula:
> 
> <metaphone_length> ^ 2 / (<metaphone_distance> + 1) + 
>      <word_length> ^ 2 / (<distance> + 1)
> 
> I ran this on the first ten spams and hams in the corpus.  The mean
> score for spams was 365.7 and the median was 12.5; the mean score for
> hams was 3715.565 and the median was 1103.6.  More than anything, the
> results seem to indicate the length of the message rather than the
> spamminess.
> 
> Processor time was also a problem; the largest message scanned took
> over 23 minutes to process.  The quickest was under 3 seconds, but the
> average was around 45 seconds, with ham taking much longer to process
> than spam.
> 
> Running either test individually -- the plain text W-F distance or the
> metaphone W-F distance -- did not show an appreciable improvement in
> the accuracy of the algorithm, although the processing time improved.
> 
> It's too bad this won't work, although if someone else wants to take a
> crack at it, I'd be happy to share my code, word lists, etc.
> 
> Chris St. Pierre
> Unix Systems Administrator
> Nebraska Wesleyan University
> 
> On Thu, 5 Oct 2006, Chris St. Pierre wrote:
> 
> >One thing I've wondered/thought about is using the Levenshtein
> >difference between the words in an email and a list of spam words
> >(ideally pulled from the bayes db).  In this case, all of the
> >misspelled words in that sample have a L-distance of 1 from the real
> >word -- in other words, they're *very* close.
> >
> >I think the problem would be that this would consume tons of
> >resources.  Anything else, though, would be susceptible to other typo
> >attacks.  For instance, say you took each email, and replaced all
> >doubled letters with single letters, it wouldn't be long before you
> >were getting spam advertising "analr bictches" or the like.
> >
> >Chris St. Pierre
> >Unix Systems Administrator
> >Nebraska Wesleyan University
> >
> >On Wed, 4 Oct 2006, Eric A. Hall wrote:
> >
> >>
> >>On 10/4/2006 5:57 PM, Richard Doyle wrote:
> >>> I've been getting lots of porn site spam containing words with doubled
> >>> letters, like this one:
> >>
> >>> Can anybody suggest a rule or ruleset to catch these double-letter
> >>> obfuscations? I'm using Spamassassin 3.1.4.
> >>
> >>You'd probably need to write a plug-in that used some kind of
> >>typo-matching logic to find porno words.
> >>
> >>Would be a good plug-in actually. Get busy :)
> >>
> >>-- 
> >>Eric A. Hall                                        http://www.ehsco.com/
> >>Internet Core Protocols          http://www.oreilly.com/catalog/coreprot/
> >>
> >