You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spamassassin.apache.org by Erick Calder <e...@arix.com> on 2004/04/05 22:34:49 UTC
accented characters
hei everyone, I have antidrug.cf installed but am getting stuff sneaking
through on account of accented characters like: Vìgêl, vïágra and Cìális.
it seems to me that it would be useful to run an accent stripper (which can
be written in 2-3 lines of perl) on the contents before running the rules...
my question is thus: I can write the perl, but how do I integrate it?
thanks - ekkis
Re: accented characters and other misspellings
Posted by Marc Perkel <ma...@perkel.com>.
If you look at the filter and the list of words at the bottom you'll see
what I'm doing. This is a trick I'm using to detect deliberately
misspelled words. I'm using Exim rules - but it would be better if coded
in SA.
The idea is - at the bottom is a list of words spammers deliberately
misspell. What I do is grab the sobject and the beginning of the
message. I rtemove all the words from the list that are spelled
correctly. Then - I get rid of spaces - translate characters - remove
white and gappy characters. This process "fixes" misspelled words. I
then look again for the words and if I find it - it's spam!
What I have here works REALLY well - but it would be better if it were
part of SA.
I'm not a programmer - so - have at it!
###################################################
# This filter tests for misspelled words using punctuation
# y0ung g!rls - but not young girls
# First I try to separate real words by changing the spaces into X so that
# when I remove spaces - prohibited words aren't created but joining
# unrelated words. It keeps phrases like "this alert" from
# becoming "thi[sale]rt". Any space after 4 characters from a-z
# is considered to be a hard space as opposed to gappy text.
headers add "X-Temp0: ${sg{${lc:$h_subject:${substr_0_180:$message_body}}}\
{\\N([a-z]{4,}) \\N}{\\N$1X\\N}}"
headers add "X-Temp1: ${sg{$h_X-Temp0:}\
{\\N ([a-z]{4,})\\N}{\\NQ$1\\N}}"
headers remove "X-Temp0:"
# Then we remove all properly spelled words from the subject and store it
# in X-Temp2 leaving only deliberately misspelled words.
# I use Z as a word separator when removing a word so that words #
running together don't form other words in the list.
headers add "X-Temp2: ${sg{$h_X-Temp1:}\
{\x28${sg{${sg{${sg{${readfile{/etc/exim/lists/misspell}{|}}}{\\\\|+}{|}}}{#.*?\\\\|}{}}}{\\\\|\\$}{}}\x29}{Z}}"
# Then we translate characters into other characters the way spammers do
# 0-o 1-i !-i and spaces and punctuation is deleted correcting the spelling
headers add "X-Temp3: ${sg{${tr{$h_X-Temp2:}\
{àáâãäåèéëìíîïòóôõöùúûüýÿñ×@1!03\\$#-:_*=,.%^~`;|/}\
{aaaaaaeeeiiiiooooouuuuyynxaiioes }}}{ |<.*>}{}}"
# We then test it again to see if the prohibited words reappear after
character
# translation and removal of junk characters. If so - it's spam.
# The new header is the flag indicating a positive match which is
# passed on to Spam Assassin for scoring.
if "$h_X-Temp3:" matches
\x28${sg{${sg{${sg{${readfile{/etc/exim/lists/misspell}{|}}}{\\|+}{|}}}{#.*?\\|}{}}}{\\|\$}{}}\x29
then
headers add "X-Temp-Misspell: YES"
endif
###################################################
# Tests SUBJECT for PHRASES - Low Points
# (lesbo|paris hilton|teensluts)
if "$h_X-Temp3:" matches
\x28${sg{${sg{${sg{${readfile{/etc/exim/lists/subjectphrase1}{|}}}{\\|+}{|}}}{#.*?\\|}{}}}{\\|\$}{}}\x29
then
headers add "X-Temp-Subjectphrase1: YES"
endif
###################################################
# Tests SUBJECT for PHRASES - High Points
# (lesbo|paris hilton|teensluts)
if "$h_X-Temp3:" matches
\x28${sg{${sg{${sg{${readfile{/etc/exim/lists/subjectphrase2}{|}}}{\\|+}{|}}}{#.*?\\|}{}}}{\\|\$}{}}\x29
then
headers add "X-Temp-Subjectphrase2: YES"
endif
###################################################
# Tests Deliberately Misspelled Words
# (v1agra|v i a g r a)
if "$h_X-Temp3:$message_body" matches
\x28${sg{${sg{${sg{${readfile{/etc/exim/lists/blockspelling}{|}}}{\\|+}{|}}}{#.*?\\|}{}}}{\\|\$}{}}\x29
then
headers add "X-Temp-Spelling: YES"
endif
# Finally - we get rid of headers used for temporary variables.
headers remove "X-Temp1:"
headers remove "X-Temp2:"
headers remove "X-Temp3:"
adult
adv
assistence
attract
auction
banned
best
bitch
blowing
business
cable
cards
cartriges
cash
casino
celeb
cheap
cialis
click
confirmation
credit
cunt
debt
dick
digital
diploma
discount
doctor
dollar
domain
drug
earn
enlarge
extra
fast
feel
finance
free
fuck
generic
girl
grant
guaranteed
holiday
home
horny
hosting
housewives
incest
income
increase
inkjet
interest
judicial
lender
length
lesbian
link
loan
lolita
look
losing
lowest
market
married
medic
meds
money
month
more
mortgage
need
offer
online
opportunity
order
orgasm
paris
payment
penis
perscription
pharmacy
phentermine
pill
porn
price
program
quotes
rape
rate
regist
remove
sale
scream
shocked
size
spam
special
stock
super
suplies
today
track
university
urgent
vacation
valium
viagra
vicodin
visa
vitamin
voyeur
wholesale
winn
would
xanex
young
your