You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@spamassassin.apache.org by Karsten Bräckelmann <gu...@rudersport.de> on 2008/02/15 13:58:58 UTC

RE: Rule for Russian character sets (=?koi8-r? not quite a charset)

On Fri, 2008-02-15 at 17:10 +1300, Michael Hutchinson wrote:
> > From: Karsten Bräckelmann [mailto:guenther@rudersport.de]

> > Why are you guys now trying to re-invent the wheel in the special case
> > of a gray asphalt street? What about a dirt track, grass, and anything
> > else a wheel works on?
> > 
> > I've pointed it out before. Just use ok_locales, which is all about
> > these char sets. No REs, almost no thinking required, no headache. A
> > single line, and you're done.
> 
> We don't want to "only allow" the English locale, because we (here at
> my work) do not want our international clients (non Russian) to be
> denied email service. 

ok_locales  en ja ko th zh

This will allow anything but Cyrillic char sets. Please note that en
does *not* mean "English locale" despite its name. It applies to all
Western charsets, including German Umlauts, Swedisch, French, Turkish,
etc. Basically everything that uses the characters in this post, plus
language specific chars.


> That aside, I really don't think getting detailed with Regular
> Expressions is re-inventing the wheel. Rather, it is expanding
> knowledge that will help write better rules in the future. (More
> flexible wheels, in your context).
> 
> Although I appreciated your earlier post of 'ok_locales', and
> understood it, I did not appreciate your Troll.

Sorry, I did not mean to troll nor any kind of offense.

However, you missed my point. Getting detailed with REs is a good thing,
sure. I was not about that -- but the RE in question does not properly
handle charset encoding. See the Subject for an example which is not
encoding, but will be matched by your rule.

My point was, that the rule discussed aims at being something that it
unfortunately is not, because charset encoding is slightly more complex
and definitely requires a closing part. A Regular Expression that does
this can be found in check_for_faraway_charset_in_headers() in
HeaderEval.pm:
  $hdr =~ /=\?(.+?)\?.\?.*?\?=/g

Hence, the my re-inventing the wheel analogy. And these wheels are quite
flexible, too. ;-)

Also, your rule applies to the Subject only, whereas ok_locales does
check all MIME parts and will trigger on Russian spam with a "western"
Subject.


Hope this clarifies my previous posts and is appreciated again...

  guenther


-- 
char *t="\10pse\0r\0dtu\0.@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4";
main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;i<l;i++){ i%8? c<<=1:
(c=*++x); c&128 && (s+=h); if (!(h>>=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}