You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@spamassassin.apache.org by Adam Moffett <ad...@plexicomm.net> on 2012/08/21 21:30:28 UTC

spam in foreign characters

I have a user who seems to get 4-5 messages per day with Chinese 
characters for the subject and body.  They come from a variety of 
domains and IP's so I guess she somehow got onto a list used to spam 
Chinese speaking people.

If I paste them into Google Translate they seem to be roughly the same 
kind of junk as our English spam: "work from home", "buy our drugs", 
etc.  The handful that I looked at closely had scores of 2.0-3.0.

Are there existing SpamAssassin rules that work on non english 
characters?  Is there maybe something extra I should enable or install 
that would score these higher?

I'm sorry if it's an ignorant question, but the issue hasn't really come 
up here before.

Thanks.


Re: spam in foreign characters

Posted by Axb <ax...@gmail.com>.
On 08/21/2012 09:30 PM, Adam Moffett wrote:
> I have a user who seems to get 4-5 messages per day with Chinese
> characters for the subject and body.  They come from a variety of
> domains and IP's so I guess she somehow got onto a list used to spam
> Chinese speaking people.
>
> If I paste them into Google Translate they seem to be roughly the same
> kind of junk as our English spam: "work from home", "buy our drugs",
> etc.  The handful that I looked at closely had scores of 2.0-3.0.
>
> Are there existing SpamAssassin rules that work on non english
> characters?  Is there maybe something extra I should enable or install
> that would score these higher?
>
> I'm sorry if it's an ignorant question, but the issue hasn't really come
> up here before.

I you can set user preferences:
(you may even want this as site wide)

ok_locales en

this will add some points to the "foreign languages"


"Western" languages are not affected although the docs say "(only allow 
English)" but that should be corrected - further down:

"en - Western character sets in general"

See:
http://spamassassin.apache.org/full/3.3.x/doc/Mail_SpamAssassin_Conf.txt

"LANGUAGE OPTIONS"

h2h

Axb

Re: spam in foreign characters

Posted by Adam Moffett <ad...@plexicomm.net>.
I think I'd have to read Chinese to tackle that accurately.

> So, you should probably try using ok_locales, and if it doesn't work,
> create your own rules to match these spams, if you can find good common
> patterns that don't seem likely to match non-spams (or match all Chinese
> email if that's what you want).  And please share what works.


RE: spam in foreign characters

Posted by Daniel Lemke <le...@jam-software.com>.
> -----Original Message-----
> From: Niamh Holding [mailto:niamh@fullbore.co.uk]
> Sent: Wednesday, August 22, 2012 8:01 AM
> To: users@spamassassin.apache.org
> Subject: Re: spam in foreign characters
>
>
> dcc> match all Chinese email if that's what you want
>
> mimeheader  NH_CHINESE                  Content-Type =~ /charset="?gb2312/i
> score       NH_CHINESE                  2.5
> describe    NH_CHINESE                  Chinese character set

'all' is such a strong word ;-)

The rule actually won't hit Chinese/Japanese/Korean mails that are utf8, base64 encoded.
For those mails the most reliable mechanism is a good trained Bayes as John already suggested.

You may also want to have a look at the TextCat plugin.
It doesn't work for all mails but in combination with Bayes and ok_locales you should be able to filter most foreign spam mails.

Daniel

________________________________



----------------------------------------------------
JAM Software GmbH
Geschäftsführer: Joachim Marder
Am Wissenschaftspark 26 * 54296 Trier * Germany
Tel: 0651-145 653 -0 * Fax: 0651-145 653 -29
Handelsregister Nr. HRB 4920 (AG Wittlich) http://www.jam-software.de

Re: spam in foreign characters

Posted by Niamh Holding <ni...@fullbore.co.uk>.
Hello Darxus,

Tuesday, August 21, 2012, 8:42:33 PM, you wrote:

dcc> match all Chinese email if that's what you want

mimeheader  NH_CHINESE                  Content-Type =~ /charset="?gb2312/i
score       NH_CHINESE                  2.5
describe    NH_CHINESE                  Chinese character set


-- 
Best regards,
 Niamh                            mailto:niamh@fullbore.co.uk

Re: spam in foreign characters

Posted by John Hardin <jh...@impsec.org>.
On Tue, 21 Aug 2012, Adam Moffett wrote:

> One of our users definitely emails with Chinese vendors.  I'm sure they 
> correspond in English, but I'm guessing the Chinese folks might have 
> Chinese characters in their signature line or some such.

Consider Bayes.

I have trained my Bayes with Chinese-language spams and they are all 
getting BAYES_99 now. If you do decide to train on Chinese-language spams, 
you will definitely want to also train hams from your user's Chinese 
vendors to catch any use of non-latin characters in .sigs or message 
headers.

Be sure to keep your training corpora on hand so that you can un-train 
those messages if it doesn't work out.

-- 
  John Hardin KA7OHZ                    http://www.impsec.org/~jhardin/
  jhardin@impsec.org    FALaholic #11174     pgpk -a jhardin@impsec.org
  key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
   USMC Rules of Gunfighting #20: The faster you finish the fight,
   the less shot you will get.
-----------------------------------------------------------------------
  3 days until the 1933rd anniversary of the destruction of Pompeii

Re: spam in foreign characters

Posted by Adam Moffett <ad...@plexicomm.net>.
Awesome, thanks for the tip.

Any guess how this affects messages with mixed character sets?  One of 
our users definitely emails with Chinese vendors.  I'm sure they 
correspond in English, but I'm guessing the Chinese folks might have 
Chinese characters in their signature line or some such.

Thanks.

> SpamAssassin has an ok_locales thing that allows you to specify basically
> languages you want to accept.  But it has problems:
> https://issues.apache.org/SpamAssassin/show_bug.cgi?id=4078
>
> I don't believe anybody has created rules to match these kinds of spams.
> A big part of the problem is lacking examples of non-English non-spam
> to verify the rules don't hit them.
>
> So, you should probably try using ok_locales, and if it doesn't work,
> create your own rules to match these spams, if you can find good common
> patterns that don't seem likely to match non-spams (or match all Chinese
> email if that's what you want).  And please share what works.
>
> ok_locales is defined in the Mail::SpamAssassin::Conf main page which can
> also be found here:
> http://spamassassin.apache.org/full/3.3.x/doc/Mail_SpamAssassin_Conf.html
>
> Hmm, ok_locales may actually work on Chinese, I don't see examples of
> problems with that language.
>
> On 08/21, Adam Moffett wrote:
>> I have a user who seems to get 4-5 messages per day with Chinese
>> characters for the subject and body.  They come from a variety of
>> domains and IP's so I guess she somehow got onto a list used to spam
>> Chinese speaking people.
>>
>> If I paste them into Google Translate they seem to be roughly the
>> same kind of junk as our English spam: "work from home", "buy our
>> drugs", etc.  The handful that I looked at closely had scores of
>> 2.0-3.0.
>>
>> Are there existing SpamAssassin rules that work on non english
>> characters?  Is there maybe something extra I should enable or
>> install that would score these higher?
>>
>> I'm sorry if it's an ignorant question, but the issue hasn't really
>> come up here before.
>>
>> Thanks.
>>


Re: spam in foreign characters

Posted by da...@chaosreigns.com.
SpamAssassin has an ok_locales thing that allows you to specify basically
languages you want to accept.  But it has problems:
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=4078

I don't believe anybody has created rules to match these kinds of spams.
A big part of the problem is lacking examples of non-English non-spam
to verify the rules don't hit them.

So, you should probably try using ok_locales, and if it doesn't work,
create your own rules to match these spams, if you can find good common
patterns that don't seem likely to match non-spams (or match all Chinese
email if that's what you want).  And please share what works.

ok_locales is defined in the Mail::SpamAssassin::Conf main page which can
also be found here:
http://spamassassin.apache.org/full/3.3.x/doc/Mail_SpamAssassin_Conf.html

Hmm, ok_locales may actually work on Chinese, I don't see examples of
problems with that language.

On 08/21, Adam Moffett wrote:
> I have a user who seems to get 4-5 messages per day with Chinese
> characters for the subject and body.  They come from a variety of
> domains and IP's so I guess she somehow got onto a list used to spam
> Chinese speaking people.
> 
> If I paste them into Google Translate they seem to be roughly the
> same kind of junk as our English spam: "work from home", "buy our
> drugs", etc.  The handful that I looked at closely had scores of
> 2.0-3.0.
> 
> Are there existing SpamAssassin rules that work on non english
> characters?  Is there maybe something extra I should enable or
> install that would score these higher?
> 
> I'm sorry if it's an ignorant question, but the issue hasn't really
> come up here before.
> 
> Thanks.
> 

-- 
"There never has been an answer. There never will be an answer.
That's the answer." - Gertrude Stein
http://www.ChaosReigns.com