You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@spamassassin.apache.org by ma...@swetech.se on 2011/03/30 00:58:05 UTC

Spam

recetly i been getting ALOT of these mail with the subjects like this
contain a link to some scam/chinese crap factory

i run the latest spamassassin along with amavis  but these mails keep 
getting through any ideas?

Re: YouWillNotBelieveYourPennisCanBbeThhatHardAndThick!GiveYouserlfATreat


Re: Spam

Posted by Per Jessen <pe...@computer.org>.
Adam Katz wrote:

> The multi-lingual dictionary that I use for this kind of purpose has
> 132 words that are 29+ characters.  Its longest word is 58 characters:
> Llanfairpwllgwyngyllgogerychwyrndrobwllllantysiliogogogoch is a large
> village on the Welsh island of Anglesey, see
> http://en.wikipedia.org/wiki/Llanfairpwllgwyngyll for more.  Wikipedia
> also notes a hill in New Zealand (short name Taumata) with an even
> longer name.  The next longest word is
> pneumonoultramicroscopicsilicovolcanoconiosis with 45 letters.  German
> words, which I would have expected to take the cake, seem to be
> limited to 35 or so letters.

From:
http://german.about.com/library/blwort_long.htm

Rindfleischetikettierungsüberwachungsaufgabenübertragungsgesetz
Donaudampfschiffahrtselektrizitätenhauptbetriebswerkbauunterbeamtengesellschaft


/Per Jessen, Zürich


Re: Spam

Posted by John Hardin <jh...@impsec.org>.
On Wed, 30 Mar 2011, RW wrote:

>>> On Wed, 2011-03-30 at 00:58 +0200, martin@swetech.se wrote:
>>>>
>>>> Re:
>>>> YouWillNotBelieveYourPennisCanBbeThhatHardAndThick!GiveYouserlfATreat
>
> The subjects have two separate characteristics: the length and the
> number of lower to upper case transitions. I score them separately and
> use:
>
> header SUBJ_LONG_WORD Subject =~ /\b[^[:space:][:punct:]]{30}/
> header SUBJ_ODD_CASE  Subject =~ /(?:[[:lower:]][[:upper:]].{0,15}){3}/

How about:

header SUBJ_RUNON Subject =~ /(?:[[:upper:]][[:lower]]{2,15}[!:,'"]?){10}/

?

-- 
  John Hardin KA7OHZ                    http://www.impsec.org/~jhardin/
  jhardin@impsec.org    FALaholic #11174     pgpk -a jhardin@impsec.org
  key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
   The third basic rule of firearms safety:
   Keep your booger hook off the bang switch!
-----------------------------------------------------------------------
  2 days until April Fools' day

Re: Spam

Posted by Adam Katz <an...@khopis.com>.
On 03/30/2011 01:23 PM, RW wrote:
> A lot of these long words are rarely used in the wild - other than
> to say how long they are.
> 
> The subjects have two separate characteristics: the length and the 
> number of lower to upper case transitions. I score them separately
> and use:
> 
> header SUBJ_LONG_WORD Subject =~ /\b[^[:space:][:punct:]]{30}/
> header SUBJ_ODD_CASE  Subject =~ /(?:[[:lower:]][[:upper:]].{0,15}){3}/

(Personally, I'd prefer to limit it to letters rather than also
including numbers, underscores, and special characters.)

There's also exaggerated text like aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaarg,
hahahahahahahahahahahahahahaha, lollllllllllllllllllllll!11111one,
intentional strings like goodluckwiththat, and suffixes like
"somethingorother" (as in "Mr. Rosensomethingorother").

I think my rule was a little more efficient at accomplishing something
similar.  John's was better named and is preferable except for the fact
that it still takes a while to parse (though at least it's limited to
just one line of each message).


Re: Spam

Posted by RW <rw...@googlemail.com>.
On Wed, 30 Mar 2011 09:16:09 -0700
Adam Katz <an...@khopis.com> wrote:

> On 03/29/2011 04:57 PM, Martin Gregorie wrote:
> > On Wed, 2011-03-30 at 00:58 +0200, martin@swetech.se wrote:
> >> recetly i been getting ALOT of these mail with the subjects like
> >> this contain a link to some scam/chinese crap factory
> >>
> >> i run the latest spamassassin along with amavis  but these mails
> >> keep getting through any ideas?
> >>
> >> Re:
> >> YouWillNotBelieveYourPennisCanBbeThhatHardAndThick!GiveYouserlfATreat
> > 
> > Since the longest (English) word I know has 28 letters
> > (antidisestablishmentarianism), a private rule like:
> > 
> > header VERY_LONG_WORD  Subject =~ /Re:\s+\S{29}/
> > 
> > should catch that spam.
> 
> The multi-lingual dictionary that I use for this kind of purpose has
> 132 words that are 29+ characters.  Its longest word is 58 characters:
> Llanfairpwllgwyngyllgogerychwyrndrobwllllantysiliogogogoch is a large
> village on the Welsh island of Anglesey,   ...

A lot of these long words are rarely used in the wild - other than to
say how long they are. 

The subjects have two separate characteristics: the length and the
number of lower to upper case transitions. I score them separately and
use:

header SUBJ_LONG_WORD Subject =~ /\b[^[:space:][:punct:]]{30}/
header SUBJ_ODD_CASE  Subject =~ /(?:[[:lower:]][[:upper:]].{0,15}){3}/

Re: Spam

Posted by Adam Katz <an...@khopis.com>.
On 03/29/2011 04:57 PM, Martin Gregorie wrote:
> On Wed, 2011-03-30 at 00:58 +0200, martin@swetech.se wrote:
>> recetly i been getting ALOT of these mail with the subjects like this
>> contain a link to some scam/chinese crap factory
>>
>> i run the latest spamassassin along with amavis  but these mails keep 
>> getting through any ideas?
>>
>> Re: YouWillNotBelieveYourPennisCanBbeThhatHardAndThick!GiveYouserlfATreat
> 
> Since the longest (English) word I know has 28 letters
> (antidisestablishmentarianism), a private rule like:
> 
> header VERY_LONG_WORD  Subject =~ /Re:\s+\S{29}/
> 
> should catch that spam.

The multi-lingual dictionary that I use for this kind of purpose has 132
words that are 29+ characters.  Its longest word is 58 characters:
Llanfairpwllgwyngyllgogerychwyrndrobwllllantysiliogogogoch is a large
village on the Welsh island of Anglesey, see
http://en.wikipedia.org/wiki/Llanfairpwllgwyngyll for more.  Wikipedia
also notes a hill in New Zealand (short name Taumata) with an even
longer name.  The next longest word is
pneumonoultramicroscopicsilicovolcanoconiosis with 45 letters.  German
words, which I would have expected to take the cake, seem to be limited
to 35 or so letters.

Maybe try this instead:

header VERY_LONG_WORD  Subject =~ /Re:\s+\w(?![a-z]{40})[A-Za-z]{40}/


If anybody is interested in the dictionary I use, this should be enough
to replicate it:

$ ls -lGg |sed 's/^.* 1 //; s/ ... .. ..... / /'
total 18M
 17M all
  32 american-english -> /usr/share/dict/american-english
  37 american-english-huge -> /usr/share/dict/american-english-huge
  39 american-english-insane -> /usr/share/dict/american-english-insane
 86K beale.wordlist.asc
  25 brazilian -> /usr/share/dict/brazilian
  36 british-english-huge -> /usr/share/dict/british-english-huge
  37 canadian-english-huge -> /usr/share/dict/canadian-english-huge
 86K diceware.wordlist.asc
1.6K expurgated
  22 french -> /usr/share/dict/french
  23 italian -> /usr/share/dict/italian
 135 make-all
  23 ngerman -> /usr/share/dict/ngerman
  23 ogerman -> /usr/share/dict/ogerman
  23 spanish -> /usr/share/dict/spanish
1.7M twl06.txt
  21 words -> /usr/share/dict/words
$ cat make-all
#!/bin/sh

( cat `ls |grep -Ev '^all|.wordlist.asc'`
  sed -r '/^[0-9]{5}\s+/!d; s///; /\w/!d' *.wordlist.asc
) |sort -f |uniq -i >all


Expurgated and twl06.txt are scrabble dictionaries that you'll have to
find specifically.  The .wordlist.asc files are for diceware.
Everything else came from a Debian package.  If you're not a word nut
like me, all you really need is the largest of each of the languages,
plus perhaps the standard English dictionary so you can determine if
something is an edge case.

This made it really easy for me to verify the cialis-in-word problem we
had here earlier; `grep -ci cialis all` currently counts 287 words.


Re: Spam

Posted by "Lawrence @ Rogers" <la...@nl.rogers.com>.
On 29/03/2011 9:27 PM, Martin Gregorie wrote:
> On Wed, 2011-03-30 at 00:58 +0200, martin@swetech.se wrote:
>> recetly i been getting ALOT of these mail with the subjects like this
>> contain a link to some scam/chinese crap factory
>>
>> i run the latest spamassassin along with amavis  but these mails keep
>> getting through any ideas?
>>
>> Re: YouWillNotBelieveYourPennisCanBbeThhatHardAndThick!GiveYouserlfATreat
> Since the longest (English) word I know has 28 letters
> (antidisestablishmentarianism), a private rule like:
>
> header VERY_LONG_WORD  Subject =~ /Re:\s+\S{29}/
>
> should catch that spam.
>
>
> Martin
>
>
We started getting those spams about 6 months ago. What I did was come 
up with a low scoring rule that hits on this

# Rule 1: check if the Subject also containing numbers, letters, or 
common formatting (except spaces) and more than 34 characters
header LW_SUBJECT_SPAMMY  Subject =~ /^[0-9a-zA-Z,.+_\-'!\\\/]{31,}$/
describe LW_SUBJECT_SPAMMY Subject appears spammy (31 or more characters 
without spaces. Only numbers, letters, and formattiing)
score  LW_SUBJECT_SPAMMY 0.2
#tflags LW_SUBJECT_SPAMMY noautolearn

I'm sure this rule could use some improvement.

The ones we saw also always followed 2 possible patterns (sometimes 
containing both in the same e-mail)

1) Hit the HTML_MESSAGE, and either FREEMAIL_FROM or TRACKER_ID, rules.
2) Hit MIME_QP_LONG_LINE and a network test.

We have the above 2 in the form of meta rules and scored at 1.0 each.

We also have a 3rd meta rule, with the first rule + the 2 described 
above, scored at 1.5

This has proven to be quite effective at nuking these spams without FP. 
This is because the likelyhood of a ham e-mail setting off all of the 
above rules is quite low.

Regards,
Lawrence



Re: Spam

Posted by Martin Gregorie <ma...@gregorie.org>.
On Wed, 2011-03-30 at 00:58 +0200, martin@swetech.se wrote:
> recetly i been getting ALOT of these mail with the subjects like this
> contain a link to some scam/chinese crap factory
> 
> i run the latest spamassassin along with amavis  but these mails keep 
> getting through any ideas?
> 
> Re: YouWillNotBelieveYourPennisCanBbeThhatHardAndThick!GiveYouserlfATreat

Since the longest (English) word I know has 28 letters
(antidisestablishmentarianism), a private rule like:

header VERY_LONG_WORD  Subject =~ /Re:\s+\S{29}/

should catch that spam.


Martin