You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@spamassassin.apache.org by Martin Gregorie <ma...@gregorie.org> on 2011/12/15 01:09:21 UTC

Problems with Cyrillic spam

I'm getting spam with the Subject, Sender personal name and body all
written in Cyrillic, but, despite having "ok_locales en fr de" defined
in local.cf, no rules are fired to mark the message as being in an
unwanted language. 

The body text is in two MIME parts, one UTF-8 plaintext and the other
contains HTML, also encoded as UTF-8. The sending domain name varies,
but the tld seems to be always .ru

I'm running SA 3.3.2 and would appreciate knowing how it recognises that
a message contains a language that is not listed as belonging to an OK
locale.


Martin
 




Re: Problems with Cyrillic spam

Posted by da...@chaosreigns.com.
On 12/15, Martin Gregorie wrote:
> In that case I'm missing some information: how to write a rule that can
> interpret the value(s) returned by TextCat.

I think you're looking for:

ok_languages en fr de

- http://spamassassin.apache.org/full/3.3.x/doc/Mail_SpamAssassin_Plugin_TextCat.html

> Why wouldn't it be sensible to rewrite ok_locales to compare TextCat
> return value(s) against its list of OK codes?

Because that functionality already exists within TextCat?  

> Then why has ok_locales not been fixed already? This is not a criticism,
> just a request for information. Is it something that's difficult to do
> efficiently? I'd imagine that language recognition by looking codepoint
> values is possible but not necessarily fast nor unambiguous.

Because it's not actually broken.  That bug should probably be closed.
Perhaps after noting the limited utility in the documentation.

ok_locales functions by identifying character sets that can only be used
for a specific language.  UTF8, Windows-1255, and koi8 are not such
character sets, because they can also be used to write in English.  

And, most importantly, as Kevin says here, people *do* use those character
sets to write in English:
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=4078#c27

Well, it's obvious that people write English in UTF8.  

> I've no time ATM and in any case I'm a middling to poor Perl coder. Now,
> if SA was written in C or Java....

I bet you know that's the best way to get better at a language.

-- 
"If you are not paranoid... you may not be paying attention."
 - jimh@creative-net.net, on an IDPA mailing list
http://www.ChaosReigns.com

Re: Problems with Cyrillic spam

Posted by Martin Gregorie <ma...@gregorie.org>.
On Thu, 2011-12-15 at 10:57 -0500, darxus@chaosreigns.com wrote:
> On 12/15, Martin Gregorie wrote:
> > The problem that needs addressing is that the ok_locales configuration
> > parameter doesn't work. This appears to be because it thinks the
> > sender's choice of (in Windows terms) the character translation code
> > page is a reliable indication of the sender's locale. I accept that this
> 
> I'd argue that ok_locales is defined by the way it functions, which was
> dependent on the fact that at one time it was useful to differentiate
> languages by character set.  And TextCat's functionality is basically
> exactly what you're looking for.  So it would make less sense to redefine
> ok_locales, and more sense to fix TextCat.
> 
In that case I'm missing some information: how to write a rule that can
interpret the value(s) returned by TextCat.

Why wouldn't it be sensible to rewrite ok_locales to compare TextCat
return value(s) against its list of OK codes?

> I don't think your comment will help either way.  Cyrillic character sets
> aren't hard to find, and all the devs are aware of the problem.  
> 
Then why has ok_locales not been fixed already? This is not a criticism,
just a request for information. Is it something that's difficult to do
efficiently? I'd imagine that language recognition by looking codepoint
values is possible but not necessarily fast nor unambiguous.

> If, on the other hand, you want to fix TextCat, or otherwise implement a
> solution to the problem, and attach a patch to a bugzilla comment, that
> would be awesome.
> 
I've no time ATM and in any case I'm a middling to poor Perl coder. Now,
if SA was written in C or Java....


Martin



Re: Problems with Cyrillic spam

Posted by da...@chaosreigns.com.
On 12/15, Martin Gregorie wrote:
> The problem that needs addressing is that the ok_locales configuration
> parameter doesn't work. This appears to be because it thinks the
> sender's choice of (in Windows terms) the character translation code
> page is a reliable indication of the sender's locale. I accept that this

I'd argue that ok_locales is defined by the way it functions, which was
dependent on the fact that at one time it was useful to differentiate
languages by character set.  And TextCat's functionality is basically
exactly what you're looking for.  So it would make less sense to redefine
ok_locales, and more sense to fix TextCat.

> That said, I'm happy to become a bugzilla user, but before I add
> anything to it, I'd like to know if you'd prefer me to add comments to
> 4078 and/or 6364 or if it would be best raise a new bug containing my
> suggestion #1. I've kept an example message that I can provide as
> evidence.

I don't think your comment will help either way.  Cyrillic character sets
aren't hard to find, and all the devs are aware of the problem.  

If, on the other hand, you want to fix TextCat, or otherwise implement a
solution to the problem, and attach a patch to a bugzilla comment, that
would be awesome.

-- 
"If you want to make an apple pie from scratch, you must first create
the universe." - Carl Sagan
http://www.ChaosReigns.com

Re: Problems with Cyrillic spam

Posted by Martin Gregorie <ma...@gregorie.org>.
On Wed, 2011-12-14 at 23:36 -0500, darxus@chaosreigns.com wrote:
> On 12/15, Martin Gregorie wrote:
> > Could somebody with access to the SA Bugzilla kindly add a comment to
> > bug 4078 saying that this is also an issue with Cyrillic encoded in
> > UTF-8? I'm asking because at present #4078 only mentions Windows code
> > pages and koi8. There is nothing to indicate that this is also a problem
> > with UTF-8.
> 
> Although as Karsten pointed out, bug 4078 isn't actually
> related, since that bug is actually related to character sets primarily in
> another language.  Which UTF8 is not.  Bug 6364 is probably exactly the
> same as your issue, just in a different language - needing TextCat fixed /
> rewritten.  
>
The actual problem is that bug 4078 is over-restrictive in its
applicability: it merely says that CHARSET_FARAWAY_HEADER isn't returned
if a message body is in Hebrew.

The problem that needs addressing is that the ok_locales configuration
parameter doesn't work. This appears to be because it thinks the
sender's choice of (in Windows terms) the character translation code
page is a reliable indication of the sender's locale. I accept that this
used to work, but since the widespread introduction of UTF-8 and other
Unicode encodings, any such assumption is deeply flawed.

The same comments are also applicable to textcat (bug 6364) 

There are really only two possibilities for resolving these bugs: 
1) Fix bug 6364 by rewriting the code textcat uses to recognise the
   predominant language used in body text. Fix bug 4078 by rationalising
   ok_locales to use the revised textcat code to determine the locale
   used by the sender before comparing this with the list of acceptable
   locales.
2) Declare textcat and ok_locales to be irretrievably broken and
   remove them from future versions of SA.

That said, I'm happy to become a bugzilla user, but before I add
anything to it, I'd like to know if you'd prefer me to add comments to
4078 and/or 6364 or if it would be best raise a new bug containing my
suggestion #1. I've kept an example message that I can provide as
evidence.


Martin



Re: Problems with Cyrillic spam

Posted by da...@chaosreigns.com.
On 12/15, Martin Gregorie wrote:
> Could somebody with access to the SA Bugzilla kindly add a comment to
> bug 4078 saying that this is also an issue with Cyrillic encoded in
> UTF-8? I'm asking because at present #4078 only mentions Windows code
> pages and koi8. There is nothing to indicate that this is also a problem
> with UTF-8.

Access to bugzilla is not restricted, just create an account and make the
comment yourself.  Although as Karsten pointed out, bug 4078 isn't actually
related, since that bug is actually related to character sets primarily in
another language.  Which UTF8 is not.  Bug 6364 is probably exactly the
same as your issue, just in a different language - needing TextCat fixed /
rewritten.  

-- 
"Wash daily from nose-tip to tail-tip; drink deeply, but never too deep;
And remember the night is for hunting, and forget not the day is for sleep."
- The Law of the Jungle, Rudyard Kipling
http://www.ChaosReigns.com

Re: Problems with Cyrillic spam

Posted by Martin Gregorie <ma...@gregorie.org>.
On Wed, 2011-12-14 at 19:38 -0500, darxus@chaosreigns.com wrote:
> On 12/15, Martin Gregorie wrote:
> > I'm getting spam with the Subject, Sender personal name and body all
> > written in Cyrillic, but, despite having "ok_locales en fr de" defined
> > in local.cf, no rules are fired to mark the message as being in an
> > unwanted language. 
> 
> Probably related to this:
> https://issues.apache.org/SpamAssassin/show_bug.cgi?id=4078
> 
> There's also TextCat, which is also broken:
> https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6364
> 
> Basically, spamassassin's detection of languages is broken.
> 
I agree that it seems to be broken by UTF-8 in the way that bug 4078
describes for Windows codepages.

Could somebody with access to the SA Bugzilla kindly add a comment to
bug 4078 saying that this is also an issue with Cyrillic encoded in
UTF-8? I'm asking because at present #4078 only mentions Windows code
pages and koi8. There is nothing to indicate that this is also a problem
with UTF-8.


Martin



Re: Problems with Cyrillic spam

Posted by da...@chaosreigns.com.
On 12/15, Martin Gregorie wrote:
> I'm getting spam with the Subject, Sender personal name and body all
> written in Cyrillic, but, despite having "ok_locales en fr de" defined
> in local.cf, no rules are fired to mark the message as being in an
> unwanted language. 

Probably related to this:
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=4078

There's also TextCat, which is also broken:
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6364

Basically, spamassassin's detection of languages is broken.

-- 
"Don't go around saying the world owes you a living. The world owes you
nothing. It was here first."  - Mark Twain
http://www.ChaosReigns.com

Re: Problems with Cyrillic spam

Posted by Karsten Bräckelmann <gu...@rudersport.de>.
On Thu, 2011-12-15 at 00:09 +0000, Martin Gregorie wrote:
> I'm running SA 3.3.2 and would appreciate knowing how it recognises that
> a message contains a language that is not listed as belonging to an OK
> locale.

It's based on the charset.

For obvious reasons, UTF-8 is excluded here. What would be necessary for
a plugin like this to work with UTF-8 is snooping the content. I once
had a quick look at it -- seems rather straight forward to solve for
Cyrillic, but was much harder e.g. for Chinese chars.


-- 
char *t="\10pse\0r\0dtu\0.@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4";
main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;i<l;i++){ i%8? c<<=1:
(c=*++x); c&128 && (s+=h); if (!(h>>=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}