You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@spamassassin.apache.org by Bogun Dmitriy <vu...@vugluskr.org.ua> on 2009/01/14 22:27:42 UTC

utf8

Hello.

Is there any way to make configuration option "normalize_charset"
working? As I understand it didn't work because of broken utf8 support.
But without it, there is no way to normal use of spamassassin for not
English messages.

I am not like rules like this.
#body   LR_SEMINAR      /[[:blank:][:punct:]](((с|c)(е|e)(м|m)(и|u)(н|n|
h)(а|a)(р|p))|((\xf1|\xd1|c)(\xe5|\xc5|e)(\xec|\xcc|m)(\xe8|\xc8|
u)(\xed|\xcd|n)(\xe0|\xc0|a)(\xf0|\xd0|p))|((\xd3|\xf3|c)(\xc5|\xe5|
e)(\xcd|\xed|m)(\xc9|\xe9|u)(\xce|\xee|n)(\xc1|\xe1|a)(\xd2|\xf2|
p))|((\xe1|\x91|c)(\xa5|\x85|e)(\xac|\x8c|m)(\xa8|\x88|u)(\xad|\x8d|
n)(\xa0|\x80|a)(\xe0|\x90|p))|((\xe1|\xc1|c)(\xd5|\xb5|e)(\xdc|\xbc|
m)(\xd8|\xb8|u)(\xdd|\xbd|n)(\xd0|\xb0|a)(\xe0|\xc0|
p)))[[:blank:][:punct:]]/i

PS:

# spamassassin --version
SpamAssassin version 3.2.5
  running on Perl version 5.8.8

OS: Gentoo linux.



Re: utf8

Posted by Bogun Dmitriy <vu...@vugluskr.org.ua>.
В Сбт, 17/01/2009 в 18:43 +0300, Sergey Kovalev пишет:

> Bogun Dmitriy пишет:
> > I have upgraded to 3.59(was 3.56). But it not help... it still not 
> > converting body and not match my test rule. I have tried with utf8, 
> > koi8-r, cp1251... all not working. But when I have disabled 
> > normalize_charset, message in UTF8 hit into my rule... all 
> > other(koi8-r,cp1251 still not hit). I think this is because local.cf in 
> > utf8 too.
> > 
> > Any suggestions how it fix?
> 
> One russian SA user here on the list suggested adding line
> use utf8;
> to Mail/SpamAssassin/Plugin/Check.pm
> and enabling normalize_charset in local.cf.
> 
> After that my rules like
> body CYR_PORN_BODY_8 /(?:\b|^)оральный(?:$|\b)/
> began hitting koi8-r spam that my postmaster@ receives a lot.
> I use a perl script that generates one word rules from a list of "bad"
> words. It also adds a summary big-score rule that is hitted if some 
> ammount of small-scored rules hit.
> 
> I have not investigated if it hits html or cp1251-encoded messages
> because have no time right now and haven't seen them often.
> 
> But there is a side affect: SA starts complaining about some rules in 
> 20_advance_fee.cf which contains some non-ascii characters. Since there 
> are no usefull rules for me, I just renamed this file (I do not cron 
> sa-update).

I have made this ugly hack and it start working as wrote into
documentation. I have googled same solution. As I understand this is old
bug... Its working - it good, but this is not solution - it bad.
Is here any developer who cat explain mainstream "politics" about utf8
support. Is it will be fixed or no one need it?

Re: utf8

Posted by Sergey Kovalev <sp...@kovalev.com.ru>.
Bogun Dmitriy пишет:
> I have upgraded to 3.59(was 3.56). But it not help... it still not 
> converting body and not match my test rule. I have tried with utf8, 
> koi8-r, cp1251... all not working. But when I have disabled 
> normalize_charset, message in UTF8 hit into my rule... all 
> other(koi8-r,cp1251 still not hit). I think this is because local.cf in 
> utf8 too.
> 
> Any suggestions how it fix?

One russian SA user here on the list suggested adding line
use utf8;
to Mail/SpamAssassin/Plugin/Check.pm
and enabling normalize_charset in local.cf.

After that my rules like
body CYR_PORN_BODY_8 /(?:\b|^)оральный(?:$|\b)/
began hitting koi8-r spam that my postmaster@ receives a lot.
I use a perl script that generates one word rules from a list of "bad"
words. It also adds a summary big-score rule that is hitted if some 
ammount of small-scored rules hit.

I have not investigated if it hits html or cp1251-encoded messages
because have no time right now and haven't seen them often.

But there is a side affect: SA starts complaining about some rules in 
20_advance_fee.cf which contains some non-ascii characters. Since there 
are no usefull rules for me, I just renamed this file (I do not cron 
sa-update).

Re: utf8

Posted by Bogun Dmitriy <vu...@vugluskr.org.ua>.
В Чтв, 15/01/2009 в 20:47 +0100, Benny Pedersen пишет:

> On Thu, January 15, 2009 17:27, Bogun Dmitriy wrote:
> 
> > perldoc Mail::SpamAssassin::Conf say, that I need Encode::Detect
> 1.01 here

from me too 1.01

> > HTML::Parser version 3.46 or later. I have them both.
> 
> 3.59 here in my gentoo

I have upgraded to 3.59(was 3.56). But it not help... it still not
converting body and not match my test rule. I have tried with utf8,
koi8-r, cp1251... all not working. But when I have disabled
normalize_charset, message in UTF8 hit into my rule... all
other(koi8-r,cp1251 still not hit). I think this is because local.cf in
utf8 too.

Any suggestions how it fix?

Re: utf8

Posted by Benny Pedersen <me...@junc.org>.
On Thu, January 15, 2009 17:27, Bogun Dmitriy wrote:

> perldoc Mail::SpamAssassin::Conf say, that I need Encode::Detect

1.01 here

> HTML::Parser version 3.46 or later. I have them both.

3.59 here in my gentoo

-- 
Benny Pedersen
Need more webspace ? http://www.servage.net/?coupon=cust37098


Re: utf8

Posted by Bogun Dmitriy <vu...@vugluskr.org.ua>.
В Чтв, 15/01/2009 в 11:03 +0000, Justin Mason пишет: 

> it should work, assuming you have the required CPAN module installed.

But it didn't work.

There is a test message and local.cf. And here is processing log: 

2009-01-15 14:51:42+02:00 mahoro.home.lan exim[22912]: SMTP connection from [192.168.214.254] (TCP/IP connection count = 1)
2009-01-15 14:51:44+02:00 mahoro.home.lan spamd[14177]: spamd: connection from alice.home.lan [192.168.214.254] at port 47223
2009-01-15 14:51:44+02:00 mahoro.home.lan spamd[14177]: spamd: setuid to mail succeeded
2009-01-15 14:51:44+02:00 mahoro.home.lan spamd[14177]: spamd: checking message <12...@localhost> for mail:8
2009-01-15 14:51:46+02:00 mahoro.home.lan spamd[14177]: spamd: clean message (0.0/10.0) for mail:8 in 2.6 seconds, 1666 bytes.
2009-01-15 14:51:46+02:00 mahoro.home.lan spamd[14177]: spamd: result: . 0 - BAYES_50,HTML_MESSAGE scantime=2.6,size=1666,user=mail,uid=8,required_score=10.0,rhost=alice.home.lan,raddr=192.168.214.254,rport=47223,mid=<12...@localhost>,bayes=0.500006,autolearn=disabled


perldoc Mail::SpamAssassin::Conf say, that I need Encode::Detect and
HTML::Parser version 3.46 or later. I have them both. 

[I] dev-perl/Encode-Detect (1.01@10.01.2009): Encode::Detect - An Encode::Encoding subclass that detects the encoding of data
[I] dev-perl/HTML-Parser (3.56@10.12.2008): Parse <HEAD> section of HTML documents

What I am missing? 

> --j.
> 
> On Wed, Jan 14, 2009 at 21:27, Bogun Dmitriy <vu...@vugluskr.org.ua> wrote:
> > Hello.
> >
> > Is there any way to make configuration option "normalize_charset" working?
> > As I understand it didn't work because of broken utf8 support. But without
> > it, there is no way to normal use of spamassassin for not English messages.
> >
> > I am not like rules like this.
> > #body   LR_SEMINAR
> > /[[:blank:][:punct:]](((с|c)(е|e)(м|m)(и|u)(н|n|h)(а|a)(р|p))|((\xf1|\xd1|c)(\xe5|\xc5|e)(\xec|\xcc|m)(\xe8|\xc8|u)(\xed|\xcd|n)(\xe0|\xc0|a)(\xf0|\xd0|p))|((\xd3|\xf3|c)(\xc5|\xe5|e)(\xcd|\xed|m)(\xc9|\xe9|u)(\xce|\xee|n)(\xc1|\xe1|a)(\xd2|\xf2|p))|((\xe1|\x91|c)(\xa5|\x85|e)(\xac|\x8c|m)(\xa8|\x88|u)(\xad|\x8d|n)(\xa0|\x80|a)(\xe0|\x90|p))|((\xe1|\xc1|c)(\xd5|\xb5|e)(\xdc|\xbc|m)(\xd8|\xb8|u)(\xdd|\xbd|n)(\xd0|\xb0|a)(\xe0|\xc0|p)))[[:blank:][:punct:]]/i
> >
> > PS:
> >
> > # spamassassin --version
> > SpamAssassin version 3.2.5
> >   running on Perl version 5.8.8
> >
> > OS: Gentoo linux.
> >
> >

Re: utf8

Posted by Benny Pedersen <me...@junc.org>.
On Thu, January 15, 2009 12:03, Justin Mason wrote:
> it should work, assuming you have the required CPAN module
> installed.

what cpan module is it ?

i have olso seen problems with some utf-7 :/

-- 
Benny Pedersen
Need more webspace ? http://www.servage.net/?coupon=cust37098


Re: utf8

Posted by Justin Mason <jm...@gmail.com>.
it should work, assuming you have the required CPAN module installed.

--j.

On Wed, Jan 14, 2009 at 21:27, Bogun Dmitriy <vu...@vugluskr.org.ua> wrote:
> Hello.
>
> Is there any way to make configuration option "normalize_charset" working?
> As I understand it didn't work because of broken utf8 support. But without
> it, there is no way to normal use of spamassassin for not English messages.
>
> I am not like rules like this.
> #body   LR_SEMINAR
> /[[:blank:][:punct:]](((с|c)(е|e)(м|m)(и|u)(н|n|h)(а|a)(р|p))|((\xf1|\xd1|c)(\xe5|\xc5|e)(\xec|\xcc|m)(\xe8|\xc8|u)(\xed|\xcd|n)(\xe0|\xc0|a)(\xf0|\xd0|p))|((\xd3|\xf3|c)(\xc5|\xe5|e)(\xcd|\xed|m)(\xc9|\xe9|u)(\xce|\xee|n)(\xc1|\xe1|a)(\xd2|\xf2|p))|((\xe1|\x91|c)(\xa5|\x85|e)(\xac|\x8c|m)(\xa8|\x88|u)(\xad|\x8d|n)(\xa0|\x80|a)(\xe0|\x90|p))|((\xe1|\xc1|c)(\xd5|\xb5|e)(\xdc|\xbc|m)(\xd8|\xb8|u)(\xdd|\xbd|n)(\xd0|\xb0|a)(\xe0|\xc0|p)))[[:blank:][:punct:]]/i
>
> PS:
>
> # spamassassin --version
> SpamAssassin version 3.2.5
>   running on Perl version 5.8.8
>
> OS: Gentoo linux.
>
>