You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spamassassin.apache.org by bu...@bugzilla.spamassassin.org on 2004/01/08 12:35:34 UTC

[Bug 2908] New: Use bayes translation to decrease effectiveness of intentional misspellings

http://bugzilla.spamassassin.org/show_bug.cgi?id=2908

           Summary: Use bayes translation to decrease effectiveness of
                    intentional misspellings
           Product: Spamassassin
           Version: 2.61
          Platform: PC
        OS/Version: Linux
            Status: NEW
          Severity: enhancement
          Priority: P5
         Component: spamassassin
        AssignedTo: spamassassin-dev@incubator.apache.org
        ReportedBy: cmt-spamassassin@someone.dhs.org


The latest crop of spam I receive contains misspellings of spam-sign words, such
as generic, viagra, paris, hilton.  Some simple examples of permutations I
receive are geenric vvvaigraa ppariis hilllton.  To counteract this, I have
written a simple modification to sub tokenize_line in Bayes.pm.

pseudocode:

(For each non-header token)
  Strip sk: prefix from token if it was added previously
  Remove all non-alpha characters
  Force token to lowercase (I have no idea if this is a good idea)
  Sort the characters in the string (bananas => aaabnns)
  Prepend sk: to string if we stripped it
  Add new token to bayes token list
  Strip any repeated characters (aaabnns => abns)
  Add new token to bayes token list

This has the effect that the words translate as such:

generic, viagra, paris, hilton
debug: BAYES TRANSLATE: generic: ceeginr, ceginr
debug: BAYES TRANSLATE: viagra: aagirv, agirv
debug: BAYES TRANSLATE: paris: aiprs, aiprs
debug: BAYES TRANSLATE: hilton: hilnot, hilnot

geenric vvvaigraa ppariis hilllton
debug: BAYES TRANSLATE: geenric: ceeginr, ceginr
debug: BAYES TRANSLATE: vvvaigraa: aaagirvvv, agirv
debug: BAYES TRANSLATE: ppariis: aiipprs, aiprs
debug: BAYES TRANSLATE: hilllton: hilllnot, hilnot

in my bayes database, agirv, aiprs, hilnot all score very high. ceginr scores
neutrally.



------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

Re: [Bug 2908] New: Use bayes translation to decrease effectiveness of intentional misspellings

Posted by Marc Perkel <ma...@perkel.com>.
Here's something I'm doing to catch misspellings.

I have a list of about 100 words commonly deliberately misspelled. I 
first remove all the words that are correctly spelled based in this 
list. Then I translate characters - @-a 0-o 1-i etc. I then remove all 
punctuion and space characters. Then - I check for the misspelled words 
again after spell correcting them, and if there's a match - it's spam.