You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@spamassassin.apache.org by Michael Hutchinson <mh...@manux.co.nz> on 2008/02/17 21:36:11 UTC

FW: Rule for Russian character sets (=?koi8-r? not quite acharset)

-----Original Message-----<snipsnip>
> > We don't want to "only allow" the English locale, because we (here
at
> > my work) do not want our international clients (non Russian) to be
> > denied email service.
> 
> ok_locales  en ja ko th zh
> 
> This will allow anything but Cyrillic char sets. Please note that en
> does *not* mean "English locale" despite its name. It applies to all
> Western charsets, including German Umlauts, Swedisch, French, Turkish,
> etc. Basically everything that uses the characters in this post, plus
> language specific chars.
 
Ok now we're talking turkey. Thanks for providing the much needed
clarity on ok_locales. I may just employ that technique yet, pending
whether we get any more Russian spam through the gates.

> Sorry, I did not mean to troll nor any kind of offense.

You have my apologies, as being a Friday afternoon, I was pretty sick of
work and shouldn't have taken it out on you or the list. Sorry.
 
> However, you missed my point. Getting detailed with REs is a good
thing,
> sure. I was not about that -- but the RE in question does not properly
> handle charset encoding. See the Subject for an example which is not
> encoding, but will be matched by your rule.
> 
> My point was, that the rule discussed aims at being something that it
> unfortunately is not, because charset encoding is slightly more
complex
> and definitely requires a closing part. A Regular Expression that does
> this can be found in check_for_faraway_charset_in_headers() in
> HeaderEval.pm:
>   $hdr =~ /=\?(.+?)\?.\?.*?\?=/g
> 
> Hence, the my re-inventing the wheel analogy. And these wheels are
quite
> flexible, too. ;-)
> 
> Also, your rule applies to the Subject only, whereas ok_locales does
> check all MIME parts and will trigger on Russian spam with a "western"
> Subject.

The RE in question (my one) was not just written for subject, but a
separate rule was written for the raw From: line as well. As we only
score spam here and leave filing it to the MUA (unless a score of 25 is
reached, where SA bins it), scoring against the Subject and From lines
makes OK sense, because if you used simply (=?koi8-r?) in the subject it
would not score high enough on it's own to be filtered or blocked. (I'm
trying to employ what I've learned from the SA webpage about writing
multiple low-scoring rules, instead of a few big-scoring ones).

I can see it is flawed, but have to also admit that it is working rather
well at the moment. Mind you, I have taken the time to translate some of
the Russian Spam, work out spammy phrases, and then quote those phrases
to be scored against by SA.

> Hope this clarifies my previous posts and is appreciated again...

Your posts are appreciated, and sorry for the mean comment.

Cheers,
Mike


Re: FW: Rule for Russian character sets (=?koi8-r? not quite acharset)

Posted by Karsten Bräckelmann <gu...@rudersport.de>.
On Mon, 2008-02-18 at 09:36 +1300, Michael Hutchinson wrote:
> > > We don't want to "only allow" the English locale, because we (here at
> > > my work) do not want our international clients (non Russian) to be
> > > denied email service.
> > 
> > ok_locales  en ja ko th zh
> > 
> > This will allow anything but Cyrillic char sets. Please note that en
> > does *not* mean "English locale" despite its name. It applies to all
> > Western charsets, including German Umlauts, Swedisch, French, Turkish,
> > etc. Basically everything that uses the characters in this post, plus
> > language specific chars.
>  
> Ok now we're talking turkey. Thanks for providing the much needed
> clarity on ok_locales. I may just employ that technique yet, pending
> whether we get any more Russian spam through the gates.
> 
> > Sorry, I did not mean to troll nor any kind of offense.
> 
> You have my apologies, as being a Friday afternoon, I was pretty sick of
> work and shouldn't have taken it out on you or the list. Sorry.

> > Hope this clarifies my previous posts and is appreciated again...
> 
> Your posts are appreciated, and sorry for the mean comment.

Thanks.  No offense taken, no harm done, don't worry. :)

  guenther


-- 
char *t="\10pse\0r\0dtu\0.@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4";
main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;i<l;i++){ i%8? c<<=1:
(c=*++x); c&128 && (s+=h); if (!(h>>=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}