You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@spamassassin.apache.org by "David F. Skoll" <df...@roaringpenguin.com> on 2012/05/29 21:58:21 UTC

Canonicalizing text parts to UTF-8 before applying body rules

Hi,

This idea is growing out of a thread I started in which someone pointed me
to https://issues.apache.org/SpamAssassin/show_bug.cgi?id=3062

Ignoring the locale under which SA runs and also ignoring the character
encoding of the message can make body matching rules behave differently
on different systems and just plain incorrectly for some messages.

I'm thinking of making something (a plugin, maybe?) that canonicalizes
text/* parts to UTF-8 and lets you write rules using Unicode regexes.
Something like:

body_utf8  __DRUGS_MUSCLE1 /.. proper Unicode regex/...

According to the perlunicode man page:

   Regular Expressions
       The regular expression compiler produces polymorphic opcodes.  That
       is, the pattern adapts to the data and automatically switches to
       the Unicode character scheme when presented with data that is
       internally encoded in UTF-8 -- or instead uses a traditional byte
       scheme when presented with byte data.

so assuming we present it with proper UTF-8 data, the regexes should Just Work.

I'm not sure how easy this will be, but I think it's worthwhile.
In the long run, I think all body rules should be body_utf8 and another
rule type should provide access to the body in its original encoding if that
is needed.

Comments?  Suggestions?

Regards,

DAvid.

Re: Canonicalizing text parts to UTF-8 before applying body rules

Posted by "Kevin A. McGrail" <KM...@PCCC.com>.
> I'm idly wondering what affect this would have on the time to scan a 
> single
> email. I'd suspect the time required would increase significantly if the
> user has a "bloody ridiculous (but effective) lot of rules", such as I 
> use. 

I had the same thought but figured that we will have to improve things.  
Going from UTF-8 to UTF-32 scares me to tears though.  I think it might 
be easier to convert the globe to English ;-)

Regards,
KAM

Re: Canonicalizing text parts to UTF-8 before applying body rules

Posted by "David F. Skoll" <df...@roaringpenguin.com>.
On Wed, 30 May 2012 08:26:44 -0700
jdow <jd...@earthlink.net> wrote:

> I'm idly wondering what affect this would have on the time to scan a
> single email.

Actually converting from the original encoding to UTF-8 is very fast.
Internally, Perl uses pretty fast C code to convert between character
encodings.

As for Unicode regexes, I think they're pretty efficient in Perl.  We
added UTF-8 support to our Bayes tokenizer and we use some pretty
hairy regexes to pick out tokens (handling CJK glyphs is interesting.)
Performance seems decent enough.

Regards,

David.

Re: Canonicalizing text parts to UTF-8 before applying body rules

Posted by jdow <jd...@earthlink.net>.
On 2012/05/29 13:18, Kevin A. McGrail wrote:
> On 5/29/2012 3:58 PM, David F. Skoll wrote:
>> This idea is growing out of a thread I started in which someone pointed me
>> to https://issues.apache.org/SpamAssassin/show_bug.cgi?id=3062
>>
>> Ignoring the locale under which SA runs and also ignoring the character
>> encoding of the message can make body matching rules behave differently
>> on different systems and just plain incorrectly for some messages.
>>
>> I'm thinking of making something (a plugin, maybe?) that canonicalizes
>> text/* parts to UTF-8 and lets you write rules using Unicode regexes.
>> Something like:
>>
>> body_utf8 __DRUGS_MUSCLE1 /.. proper Unicode regex/...
>>
>> According to the perlunicode man page:
>>
>> Regular Expressions
>> The regular expression compiler produces polymorphic opcodes. That
>> is, the pattern adapts to the data and automatically switches to
>> the Unicode character scheme when presented with data that is
>> internally encoded in UTF-8 -- or instead uses a traditional byte
>> scheme when presented with byte data.
>>
>> so assuming we present it with proper UTF-8 data, the regexes should Just Work.
>>
>> I'm not sure how easy this will be, but I think it's worthwhile.
>> In the long run, I think all body rules should be body_utf8 and another
>> rule type should provide access to the body in its original encoding if that
>> is needed.
>>
>> Comments? Suggestions?
> Your idea seems elegant to me. I'd help support it in SA.
>
> Regards,
> KAM

I'm idly wondering what affect this would have on the time to scan a single
email. I'd suspect the time required would increase significantly if the
user has a "bloody ridiculous (but effective) lot of rules", such as I use.

{o.o}

Re: Canonicalizing text parts to UTF-8 before applying body rules

Posted by "Kevin A. McGrail" <KM...@PCCC.com>.
On 5/29/2012 3:58 PM, David F. Skoll wrote:
> This idea is growing out of a thread I started in which someone pointed me
> to https://issues.apache.org/SpamAssassin/show_bug.cgi?id=3062
>
> Ignoring the locale under which SA runs and also ignoring the character
> encoding of the message can make body matching rules behave differently
> on different systems and just plain incorrectly for some messages.
>
> I'm thinking of making something (a plugin, maybe?) that canonicalizes
> text/* parts to UTF-8 and lets you write rules using Unicode regexes.
> Something like:
>
> body_utf8  __DRUGS_MUSCLE1 /.. proper Unicode regex/...
>
> According to the perlunicode man page:
>
>     Regular Expressions
>         The regular expression compiler produces polymorphic opcodes.  That
>         is, the pattern adapts to the data and automatically switches to
>         the Unicode character scheme when presented with data that is
>         internally encoded in UTF-8 -- or instead uses a traditional byte
>         scheme when presented with byte data.
>
> so assuming we present it with proper UTF-8 data, the regexes should Just Work.
>
> I'm not sure how easy this will be, but I think it's worthwhile.
> In the long run, I think all body rules should be body_utf8 and another
> rule type should provide access to the body in its original encoding if that
> is needed.
>
> Comments?  Suggestions?
Your idea seems elegant to me.  I'd help support it in SA.

Regards,
KAM

Re: Canonicalizing text parts to UTF-8 before applying body rules

Posted by "David F. Skoll" <df...@roaringpenguin.com>.
On Thu, 31 May 2012 09:05:00 +0200
"Andrzej A. Filip" <an...@gmail.com> wrote:

> a) Unicode itself may require  canonicalization too.

Perl's Encode module should take care of that.

> b) some spammers do not declare encoding properly so some encoding
> guessing would be handy

Possibly, but probably not.  Guessing can lead to problems.

> c) It would be nice to allow access to _both_ raw (bytes) and utf-8
> encoded message body

Yes, absolutely, but with different rule types.

> d) many people in "ASCII part of the world" would not need it
> anyway :-)

UTF-8 is a proper superset of ASCII, so the ASCII part of the world would
be unaffected.

Regards,

David.

Re: Canonicalizing text parts to UTF-8 before applying body rules

Posted by "Andrzej A. Filip" <an...@gmail.com>.
On 05/29/2012 09:58 PM, David F. Skoll wrote:
> This idea is growing out of a thread I started in which someone pointed me
> to https://issues.apache.org/SpamAssassin/show_bug.cgi?id=3062
>
> Ignoring the locale under which SA runs and also ignoring the character
> encoding of the message can make body matching rules behave differently
> on different systems and just plain incorrectly for some messages.
>
> I'm thinking of making something (a plugin, maybe?) that canonicalizes
> text/* parts to UTF-8 and lets you write rules using Unicode regexes.
> Something like:
>
> body_utf8  __DRUGS_MUSCLE1 /.. proper Unicode regex/...
>
> According to the perlunicode man page:
>
>    Regular Expressions
>        The regular expression compiler produces polymorphic opcodes.  That
>        is, the pattern adapts to the data and automatically switches to
>        the Unicode character scheme when presented with data that is
>        internally encoded in UTF-8 -- or instead uses a traditional byte
>        scheme when presented with byte data.
>
> so assuming we present it with proper UTF-8 data, the regexes should Just Work.
>
> I'm not sure how easy this will be, but I think it's worthwhile.
> In the long run, I think all body rules should be body_utf8 and another
> rule type should provide access to the body in its original encoding if that
> is needed.
>
> Comments?  Suggestions?
It is a nice idea IMHO.
But it is worth to remember:
a) Unicode itself may require  canonicalization too.
    Some chars may be represented in Unicode as single character of a
composition of a few characters
b) some spammers do not declare encoding properly so some encoding
guessing would be handy
c) It would be nice to allow access to _both_ raw (bytes) and utf-8
encoded message body
d) many people in "ASCII part of the world" would not need it anyway :-)


Re: Canonicalizing text parts to UTF-8 before applying body rules

Posted by "David F. Skoll" <df...@roaringpenguin.com>.
On Wed, 30 May 2012 14:43:54 +0100
RW <rw...@googlemail.com> wrote:

> UTF-8 wont work, it will need to be UTF-32 to be compatible with
> sa-compile.  From the re2c man page:

Ah.  Too bad. :(

(I don't use sa-compile, so this is not a killer problem for me, but
I can see how it could be for some people.)

On Wed, 30 May 2012 17:02:48 +0300
Henrik K <he...@hege.li> wrote:

> Frankly I believe there are so many dependencies in SA that all this
> is impossible without modyfing the whole engine to support Unicode.
> I don't see a point in standalone plugin, what good does it do for
> the current SA body rules?

Nothing.  The reason I proposed a plugin was to start converting some
of the worst offenders in terms of FPs over to UTF-8.  Converting
everything in SA to support Unicode is a huge effort; doing it
piecemeal is obviously less efficient in the long run, but may be an
easier path to actually getting stuff done.

> The way the current
> eval-body-chunk-magic-tricks-code works with all it dependencies I
> don't even know if it's possible to implement similar stuff as
> "plugin".

I have no idea either.  I haven't looked closely enough at the code to
know.

Regards,

David.

Re: Canonicalizing text parts to UTF-8 before applying body rules

Posted by Henrik K <he...@hege.li>.
On Wed, May 30, 2012 at 02:43:54PM +0100, RW wrote:
> On Tue, 29 May 2012 15:58:21 -0400
> David F. Skoll wrote:
> 
> 
> > I'm thinking of making something (a plugin, maybe?) that canonicalizes
> > text/* parts to UTF-8 and lets you write rules using Unicode regexes.
> > Something like:
> 
> > According to the perlunicode man page:
> > 
> >    Regular Expressions
> >        The regular expression compiler produces polymorphic opcodes.
> > That is, the pattern adapts to the data and automatically switches to
> >        the Unicode character scheme when presented with data that is
> >        internally encoded in UTF-8 -- or instead uses a traditional
> > byte scheme when presented with byte data.
> > 
> > so assuming we present it with proper UTF-8 data, the regexes should
> > Just Work.
> 
> UTF-8 wont work, it will need to be UTF-32 to be compatible with
> sa-compile.  From the re2c man page:
> 
> -u     Generate  a  parser  that  supports Unicode chars (UTF-32). This
>        means the generated code can deal with any valid Unicode
>        character  up  to 0x10FFFF. When UTF-8 or UTF-16 needs to
>        be supported you need to convert the incoming stream  to
>        UTF-32 upon  input yourself.

Frankly I believe there are so many dependencies in SA that all this is
impossible without modyfing the whole engine to support Unicode.  I don't
see a point in standalone plugin, what good does it do for the current SA
body rules?  The way the current eval-body-chunk-magic-tricks-code works
with all it dependencies I don't even know if it's possible to implement
similar stuff as "plugin".


Re: Canonicalizing text parts to UTF-8 before applying body rules

Posted by RW <rw...@googlemail.com>.
On Tue, 29 May 2012 15:58:21 -0400
David F. Skoll wrote:


> I'm thinking of making something (a plugin, maybe?) that canonicalizes
> text/* parts to UTF-8 and lets you write rules using Unicode regexes.
> Something like:

> According to the perlunicode man page:
> 
>    Regular Expressions
>        The regular expression compiler produces polymorphic opcodes.
> That is, the pattern adapts to the data and automatically switches to
>        the Unicode character scheme when presented with data that is
>        internally encoded in UTF-8 -- or instead uses a traditional
> byte scheme when presented with byte data.
> 
> so assuming we present it with proper UTF-8 data, the regexes should
> Just Work.

UTF-8 wont work, it will need to be UTF-32 to be compatible with
sa-compile.  From the re2c man page:

-u     Generate  a  parser  that  supports Unicode chars (UTF-32). This
       means the generated code can deal with any valid Unicode
       character  up  to 0x10FFFF. When UTF-8 or UTF-16 needs to
       be supported you need to convert the incoming stream  to
       UTF-32 upon  input yourself.