You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spamassassin.apache.org by bu...@bugzilla.spamassassin.org on 2004/01/08 12:35:34 UTC
[Bug 2908] New: Use bayes translation to decrease effectiveness of intentional misspellings
http://bugzilla.spamassassin.org/show_bug.cgi?id=2908
Summary: Use bayes translation to decrease effectiveness of
intentional misspellings
Product: Spamassassin
Version: 2.61
Platform: PC
OS/Version: Linux
Status: NEW
Severity: enhancement
Priority: P5
Component: spamassassin
AssignedTo: spamassassin-dev@incubator.apache.org
ReportedBy: cmt-spamassassin@someone.dhs.org
The latest crop of spam I receive contains misspellings of spam-sign words, such
as generic, viagra, paris, hilton. Some simple examples of permutations I
receive are geenric vvvaigraa ppariis hilllton. To counteract this, I have
written a simple modification to sub tokenize_line in Bayes.pm.
pseudocode:
(For each non-header token)
Strip sk: prefix from token if it was added previously
Remove all non-alpha characters
Force token to lowercase (I have no idea if this is a good idea)
Sort the characters in the string (bananas => aaabnns)
Prepend sk: to string if we stripped it
Add new token to bayes token list
Strip any repeated characters (aaabnns => abns)
Add new token to bayes token list
This has the effect that the words translate as such:
generic, viagra, paris, hilton
debug: BAYES TRANSLATE: generic: ceeginr, ceginr
debug: BAYES TRANSLATE: viagra: aagirv, agirv
debug: BAYES TRANSLATE: paris: aiprs, aiprs
debug: BAYES TRANSLATE: hilton: hilnot, hilnot
geenric vvvaigraa ppariis hilllton
debug: BAYES TRANSLATE: geenric: ceeginr, ceginr
debug: BAYES TRANSLATE: vvvaigraa: aaagirvvv, agirv
debug: BAYES TRANSLATE: ppariis: aiipprs, aiprs
debug: BAYES TRANSLATE: hilllton: hilllnot, hilnot
in my bayes database, agirv, aiprs, hilnot all score very high. ceginr scores
neutrally.
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
Re: [Bug 2908] New: Use bayes translation to decrease effectiveness
of intentional misspellings
Posted by Marc Perkel <ma...@perkel.com>.
Here's something I'm doing to catch misspellings.
I have a list of about 100 words commonly deliberately misspelled. I
first remove all the words that are correctly spelled based in this
list. Then I translate characters - @-a 0-o 1-i etc. I then remove all
punctuion and space characters. Then - I check for the misspelled words
again after spell correcting them, and if there's a match - it's spam.