You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@spamassassin.apache.org by Jay Sekora <js...@csail.mit.edu> on 2013/07/16 17:30:38 UTC

Current best-practices around normalize_charset?

Hi.  We're running SpamAssassin 3.3.1, and pursuant to some advice I've 
seen in archives of this list and spamassassin-dev (e.g., 
http://osdir.com/ml/spamassassin-dev/2009-07/msg00156.html), I am *not* 
using normalize_charset.  Unfortunately, this makes filtering text in 
binary encodings almost impossible, since even if you can come up with a 
word you want to match, word boundaries aren't at byte boundaries, so if 
I were to try to write rules byte-by-byte, I'd need several possible 
match strings, and I wouldn't be able to match the first or last 
character of the phrase I want to match (which for, say, Chinese, where 
words tend to be one or two characters long, is a big problem).  That's 
on top of the alternative patterns needed to represent non-Unicode 
encodings, of course.

Anyway, my question is, is that advice still valid (for 3.3.1, which is 
packaged for Debian Squeeze, or for latest stable)?  And if so, what do 
people tend to do to write rules for East Asian character sets (or, for 
that matter, for Western character sets encoded in binary to make them 
harder to filter)?  The traffic on the bug report quoted in the above 
message is kind of ambiguous.

(I will note that ok_languages and ok_locales are pretty useless here, 
at least for site-wide use, since we have users with correspondence in 
pretty much any language we've ever seen spam in.)

Jay

-- 
Jay Sekora
Linux system administrator and postmaster,
The Infrastructure Group
MIT Computer Science and Artificial Intelligence Laboratory

Re: Current best-practices around normalize_charset?

Posted by Ivo Truxa <iv...@truxa.cz>.

From: "Jay Sekora [via SpamAssassin]" 

I forgot to comment on this:

> Seems like just normalizing them to U+NNNN might be better than 
> trying to transcribe them.  (And that would let a brave or foolhardy 
> mail administrator write rules to match patterns seen in, say, 
> Chinese-language spam even without knowing Chinese, or even without 
> knowing what language the spam was in.)

You can do that already with the UTF8 normalizing. You can write your rules 
directly in Unicode characters, or in numeric codes if you want - Perl 
regular expressions accept that without any problems. So you do not need any 
ASCII normalizing for that, it works already well with the UTF8 normalizing.

Unfortunately the UTF-8 normalizing does not solve at all the problem with 
different ways to use (or not to use) diacritics. In any European language 
(except of English, and to certain extent the Dutch), there is a high number 
of variants people are able to write the same word with or without 
diacritics. It makes the rules development quite difficult. You cannot write 
a simple rule to match a single word. You either have to use plenty of 
wildcards or piping all the variants into the regex. 

The UTF-8 normalizing also does not help with text obfuscation through 
diacritics or through visually similar Unicode characters. Each Latin 
character has many versions with various diacritics, or similarly looking 
Latin or non-Latin characters. I did not make the exact statistics, but there 
may be easily 20 or perhaps even more variants of each Latin character. If 
there were 20 variants for each character, you would have over 3 millions of 
permutations for each single 5-letter word. Without some kind of rather 
aggressive reduction of the variants (such as the 7bit US-ASCII normalizing), 
you would have hard time to write rules to match those obfuscations.





--
View this message in context: http://spamassassin.1065346.n5.nabble.com/Current-best-practices-around-normalize-charset-tp105840p108558.html
Sent from the SpamAssassin - Users mailing list archive at Nabble.com.

Re: Current best-practices around normalize_charset?

Posted by Ivo Truxa <iv...@truxa.cz>.

From: "Jay Sekora [via SpamAssassin]" 
> Interesting idea!  I searched in the spamassassin-dev archives but I
> don't think I found the right patch; could you point me at it?

You can fid it also on SA Bugzilla (although not a bug) here:
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=7022


> How do you handle non-alphabetic scripts (like CJK, where a character
> may have multiple pronunciations both within and between languages)?
> Seems like just normalizing them to U+NNNN might be better than 
> trying to transcribe them.  (And that would let a brave or foolhardy 
> mail administrator write rules to match patterns seen in, say, 
> Chinese-language spam even without knowing Chinese, or even without 
> knowing what language the spam was in.)

I did very little. In fact I just did some relatively minor changes, and 
incorporated the Text::Unidecode module that handles the transcription. You 
can have a look at its description to have a better idea how the 
transcription is handled. 

http://search.cpan.org/~sburke/Text-Unidecode-0.04/lib/Text/Unidecode.pm

In some cases the resulting text wouldn't be well understood by the natives, 
but since it is systematical, it can still be used by SpamAssassin for the 
rules.

I also wrote a standalone normalizing script, that you can use for the rues 
development, and to see how your text would be normalized by the patched SA 
to ASCII:

https://github.com/truxoft/sa-normalize

You need to install the Text::Unidecode module for the script to work (for 
the patch, of course, too).

> Anyway, glad to hear that normalize_charset hasn't been causing you
> problems, and for us, normalizing to UTF8 is almost certainly what we
> want if it's reasonably safe.

Do not take it for guaranteed. I ran the UTF8 normalizing only shortly before 
writing the US-ASCII normalizing, since it is exactly what I needed, and 
makes my life easier. Without it, even with UTF-8 normalizing, writing rules 
for international text with diacritics is a nightmare, because half of the 
people use no diacritics, and the rest does it often in a partial or a wrong 
way, so unless using the ASCII normalizing, you need to write hundreds of 
permutations to catch all the cases. With plain US-ASCII, you write a single 
rule, and its finished for all version, and for all charsets.

Ivo




--
View this message in context: http://spamassassin.1065346.n5.nabble.com/Current-best-practices-around-normalize-charset-tp105840p108547.html
Sent from the SpamAssassin - Users mailing list archive at Nabble.com.

Re: Current best-practices around normalize_charset?

Posted by "Jay A. Sekora" <js...@csail.mit.edu>.

On Wed, 2014-03-12 at 19:04 -0700, Ivo Truxa wrote:

> Your message is a few months old, but I see no answer, and stumbled upon it
> when writing an enhanced version of the normalize_charset feature, so
> thought that I could perhaps help.

Thanks!  I'm glad to hear of your experiences.

> [R]egardless whether
> you use normalizing or not, as long as you need to match non-ASCII patterns,
> you need to write rules also in Unicode anyway, because you cannot reject
> Unicode messages.

Indeed!  And even if you only want to accept messages in English (or
some other ASCII-supported language), nowadays it's not at all uncommon
for messages to have dingbats or printer's quotation marks in them -- or
one of your correspondents might be sitting at a relative's computer or
in an internet cafe somewhere and the subject line might get the Chinese
equivalent of "Re:" prepended to it, or the body might have a disclaimer
in French appended.

> Another possibility may be normalizing, instead to UTF, to plain 7bit
> US-ASCII. The currently proposed patch for ASCII normalizing transliterates
> also non-Latin alphabets. The patch was proposed to the dev list, so
> impatient and courageous users might want to try it on a non-production
> server, but be warned that it is not any official code (at least not now),
> and currently very little tested.

Interesting idea!  I searched in the spamassassin-dev archives but I
don't think I found the right patch; could you point me at it?

How do you handle non-alphabetic scripts (like CJK, where a character
may have multiple pronunciations both within and between languages)?
Seems like just normalizing them to U+NNNN might be better than trying
to transcribe them.  (And that would let a brave or foolhardy mail
administrator write rules to match patterns seen in, say,
Chinese-language spam even without knowing Chinese, or even without
knowing what language the spam was in.)

Anyway, glad to hear that normalize_charset hasn't been causing you
problems, and for us, normalizing to UTF8 is almost certainly what we
want if it's reasonably safe.

Jay

-- 
Jay Sekora
Linux system administrator and postmaster,
The Infrastructure Group
MIT Computer Science and Artificial Intelligence Laboratory

Re: Current best-practices around normalize_charset?

Posted by Ivo Truxa <iv...@truxa.cz>.

Hello,

Your message is a few months old, but I see no answer, and stumbled upon it
when writing an enhanced version of the normalize_charset feature, so
thought that I could perhaps help.

Jay Sekora wrote
> Hi. We're running SpamAssassin 3.3.1, and pursuant to some advice I've
> seen in archives of this list and spamassassin-dev, I am *not*
> using normalize_charset.

I do not know much about the original bug, but until recently I used Unicode
normalizing without observing any problems. Perhaps I was lucky, or did not
look close enough. However, that's irrelevant, because regardless whether
you use normalizing or not, as long as you need to match non-ASCII patterns,
you need to write rules also in Unicode anyway, because you cannot reject
Unicode messages. So when you disable the normalizing, you only make your
case worse. Not only you have to write rules in UTF8 anyway (hence risking
that they'll be slow), but in plus you need to write the rules also for any
possible characters set that can arrive (and you wrote your server needs to
accept email in all possible languages, so there would be dozens of
different character sets). That's an unhuman task, and the number of rules
or their complexity would slow down your server possibly more than the bug
(if it still exists).

On my mind, anyone who needs to write rules for a multi-national server and
for Asian languages, cannot go around the normalizing. Or he has to stick
with mostly only ASCII rules (which are not much useful for Asian
languages).

Another possibility may be normalizing, instead to UTF, to plain 7bit
US-ASCII. The currently proposed patch for ASCII normalizing transliterates
also non-Latin alphabets. The patch was proposed to the dev list, so
impatient and courageous users might want to try it on a non-production
server, but be warned that it is not any official code (at least not now),
and currently very little tested.

Ivo

--
View this message in context: http://spamassassin.1065346.n5.nabble.com/Current-best-practices-around-normalize-charset-tp105840p108513.html
Sent from the SpamAssassin - Users mailing list archive at Nabble.com.