You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spamassassin.apache.org by Erick Calder <e...@arix.com> on 2004/04/05 22:34:49 UTC

accented characters

hei everyone, I have antidrug.cf installed but am getting stuff sneaking
through on account of accented characters like: Vìgêl, vïágra and Cìális.

it seems to me that it would be useful to run an accent stripper (which can
be written in 2-3 lines of perl) on the contents before running the rules...

my question is thus: I can write the perl, but how do I integrate it?

thanks - ekkis


Re: accented characters and other misspellings

Posted by Marc Perkel <ma...@perkel.com>.
If you look at the filter and the list of words at the bottom you'll see 
what I'm doing. This is a trick I'm using to detect deliberately 
misspelled words. I'm using Exim rules - but it would be better if coded 
in SA.

The idea is - at the bottom is a list of words spammers deliberately 
misspell. What I do is grab the sobject and the beginning of the 
message. I rtemove all the words from the list that are spelled 
correctly. Then - I get rid of spaces - translate characters - remove 
white and gappy characters. This process "fixes" misspelled words. I 
then look again for the words and if I find it - it's spam!

What I have here works REALLY well - but it would be better if it were 
part of SA.

I'm not a programmer - so - have at it!



###################################################
# This filter tests for misspelled words using punctuation
# y0ung g!rls - but not young girls

# First I try to separate real words by changing the spaces into X so that
# when I remove spaces - prohibited words aren't created but joining
# unrelated words. It keeps phrases like "this alert" from
# becoming "thi[sale]rt". Any space after 4 characters from a-z
# is considered to be a hard space as opposed to gappy text.

headers add "X-Temp0: ${sg{${lc:$h_subject:${substr_0_180:$message_body}}}\
{\\N([a-z]{4,}) \\N}{\\N$1X\\N}}"

headers add "X-Temp1: ${sg{$h_X-Temp0:}\
{\\N ([a-z]{4,})\\N}{\\NQ$1\\N}}"

headers remove "X-Temp0:"

# Then we remove all properly spelled words from the subject and store it
# in X-Temp2 leaving only deliberately misspelled words.
# I use Z as a word separator when removing a word so that words # 
running together don't form other words in the list.

headers add "X-Temp2: ${sg{$h_X-Temp1:}\
{\x28${sg{${sg{${sg{${readfile{/etc/exim/lists/misspell}{|}}}{\\\\|+}{|}}}{#.*?\\\\|}{}}}{\\\\|\\$}{}}\x29}{Z}}" 


# Then we translate characters into other characters the way spammers do
# 0-o 1-i !-i and spaces and punctuation is deleted correcting the spelling

headers add "X-Temp3: ${sg{${tr{$h_X-Temp2:}\
{àáâãäåèéëìíîïòóôõöùúûüýÿñ×@1!03\\$#-:_*=,.%^~`;|/}\
{aaaaaaeeeiiiiooooouuuuyynxaiioes               }}}{ |<.*>}{}}"

# We then test it again to see if the prohibited words reappear after 
character
# translation and removal of junk characters. If so - it's spam.
# The new header is the flag indicating a positive match which is
# passed on to Spam Assassin for scoring.

if "$h_X-Temp3:" matches 
\x28${sg{${sg{${sg{${readfile{/etc/exim/lists/misspell}{|}}}{\\|+}{|}}}{#.*?\\|}{}}}{\\|\$}{}}\x29 

then
  headers add "X-Temp-Misspell: YES"
endif


###################################################
# Tests SUBJECT for PHRASES - Low Points
# (lesbo|paris hilton|teensluts)

if "$h_X-Temp3:" matches 
\x28${sg{${sg{${sg{${readfile{/etc/exim/lists/subjectphrase1}{|}}}{\\|+}{|}}}{#.*?\\|}{}}}{\\|\$}{}}\x29 

then
  headers add "X-Temp-Subjectphrase1: YES"
endif


###################################################
# Tests SUBJECT for PHRASES - High Points
# (lesbo|paris hilton|teensluts)

if "$h_X-Temp3:" matches 
\x28${sg{${sg{${sg{${readfile{/etc/exim/lists/subjectphrase2}{|}}}{\\|+}{|}}}{#.*?\\|}{}}}{\\|\$}{}}\x29 

then
  headers add "X-Temp-Subjectphrase2: YES"
endif


###################################################
# Tests Deliberately Misspelled Words
# (v1agra|v i a g r a)

if "$h_X-Temp3:$message_body" matches 
\x28${sg{${sg{${sg{${readfile{/etc/exim/lists/blockspelling}{|}}}{\\|+}{|}}}{#.*?\\|}{}}}{\\|\$}{}}\x29 

then
  headers add "X-Temp-Spelling: YES"
endif

# Finally - we get rid of headers used for temporary variables.

headers remove "X-Temp1:"
headers remove "X-Temp2:"
headers remove "X-Temp3:"


adult
adv
assistence
attract
auction
banned
best
bitch
blowing
business
cable
cards
cartriges
cash
casino
celeb
cheap
cialis
click
confirmation
credit
cunt
debt
dick
digital
diploma
discount
doctor
dollar
domain
drug
earn
enlarge
extra
fast
feel
finance
free
fuck
generic
girl
grant
guaranteed
holiday
home
horny
hosting
housewives
incest
income
increase
inkjet
interest
judicial
lender
length
lesbian
link
loan
lolita
look
losing
lowest
market
married
medic
meds
money
month
more
mortgage
need
offer
online
opportunity
order
orgasm
paris
payment
penis
perscription
pharmacy
phentermine
pill
porn
price
program
quotes
rape
rate
regist
remove
sale
scream
shocked
size
spam
special
stock
super
suplies
today
track
university
urgent
vacation
valium
viagra
vicodin
visa
vitamin
voyeur
wholesale
winn
would
xanex
young
your