You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@spamassassin.apache.org by bu...@bugzilla.spamassassin.org on 2018/11/17 11:03:22 UTC

[Bug 7656] New: UTF8 rules, normalize_charset etc overhaul

https://bz.apache.org/SpamAssassin/show_bug.cgi?id=7656

            Bug ID: 7656
           Summary: UTF8 rules, normalize_charset etc overhaul
           Product: Spamassassin
           Version: SVN Trunk (Latest Devel Version)
          Hardware: All
                OS: All
            Status: NEW
          Severity: blocker
          Priority: P2
         Component: Libraries
          Assignee: dev@spamassassin.apache.org
          Reporter: hege@hege.li
  Target Milestone: Undefined

There are few relating bugs, but I'm creating new to oversee this.

I don't think we should release 4.0.0 before all UTF8 related functionality
works adequately and is documented properly.

I made few tests with a message that either contains latin1 or utf8 encoded
text (or simple html without any encoding clauses). Also three variants with
Content-Type missing or specified as such.

body RULE_LATIN1 /päivää/
body RULE_UTF8 /pÃ€ivÃ€Ã€/

TEXT/PLAIN  normalize_charset 0 / 1
utf8 message, no ct       RULE_UTF8   / RULE_UTF8
utf8 message, utf8 ct     RULE_UTF8   / RULE_UTF8
utf8 message, latin1 ct   RULE_UTF8   / RULE_UTF8
latin1 message, no ct     RULE_LATIN1 / <no hits>
latin1 message, utf8 ct   RULE_LATIN1 / <no hits>
latin1 message, latin1 ct RULE_LATIN1 / RULE_UTF8

TEXT/HTML  normalize_charset 0 / 1
utf8 message, no ct       RULE_UTF8 / RULE_UTF8
utf8 message, utf8 ct     RULE_UTF8 / RULE_UTF8
utf8 message, latin1 ct   RULE_UTF8 / RULE_UTF8
latin1 message, no ct     RULE_UTF8 / <no hits>
latin1 message, utf8 ct   RULE_UTF8 / <no hits>
latin1 message, latin1 ct RULE_UTF8 / RULE_UTF8

- normalize_charset 1 doesn't hit either rule unless message contains
Content-Type..ISO-8859-1 ??

- html parser apparently assumes everything is UTF8. Only matches UTF8 rules?

One can't even use simple workarounds such as "body RULE_FOO /p.iv../" to match
umlauts(diacritic?) from UTF8 messages, as they obviously eat up two
characters.

Let's not even get into other things yet like sa-compile (bug 7645), textcat
etc that all expect some correct encoding to work..

Unless people want to use multiple rules to match non-utf8 and utf8 messages,
perhaps the only sane solution would be to "upgrade" all non-utf8 rules to utf8
internally and do the matching to utf8 upgraded body. In such case the two
rules above would actually be duplicates and work on any message.

-- 
You are receiving this mail because:
You are the assignee for the bug.

[Bug 7656] UTF8 rules, normalize_charset etc overhaul

Posted by bu...@spamassassin.apache.org.

https://bz.apache.org/SpamAssassin/show_bug.cgi?id=7656

Henrik Krohns <ap...@hege.li> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
           Severity|blocker                     |normal
   Target Milestone|Undefined                   |4.0.0

-- 
You are receiving this mail because:
You are the assignee for the bug.

[Bug 7656] UTF8 rules, normalize_charset etc overhaul

Posted by bu...@spamassassin.apache.org.

https://bz.apache.org/SpamAssassin/show_bug.cgi?id=7656

--- Comment #16 from Henrik Krohns <ap...@hege.li> ---

Changed normalized_charset 1 as default and added some docs.

Sending        trunk/UPGRADE
Sending        trunk/lib/Mail/SpamAssassin/Conf.pm
Sending        trunk/lib/Mail/SpamAssassin/Util/DependencyInfo.pm
Transmitting file data ...done
Committing transaction...
Committed revision 1890317.

-- 
You are receiving this mail because:
You are the assignee for the bug.

[Bug 7656] UTF8 rules, normalize_charset etc overhaul

Posted by bu...@bugzilla.spamassassin.org.

https://bz.apache.org/SpamAssassin/show_bug.cgi?id=7656

--- Comment #3 from Henrik Krohns <he...@hege.li> ---
(In reply to Henrik Krohns from comment #0)
> latin1 message, no ct     RULE_LATIN1 / <no hits>
> latin1 message, utf8 ct   RULE_LATIN1 / <no hits>
> latin1 message, no ct     RULE_UTF8 / <no hits>
> latin1 message, utf8 ct   RULE_UTF8 / <no hits>

Ok these should be now fixed..

Basically Encode::Detect::Detector thinks body "päivää" is Windows-1255
(Hebrew!!). 

dbg: message: failed decoding as declared charset UTF-8
dbg: message: decoded as detected charset windows-1255, declared UTF-8

Why are we using a module that hasn't been updated in 10 years anyway? Maybe
look at Encode::Guess which has been in core atleast from 5.8.8?

I simply added latin diacretic letters to SA's own basic Win-1252 detection. I
borrowed the \xc0-\xd6\xd8-\xde\xe0-\xf6\xf8-\xfe bit from textcat, also
looking at https://en.wikipedia.org/wiki/Windows-1252 it seems correct. Not
sure if the missing ÿ (\xff) should be added to here and textcat..

Sending        spamassassin-3.4/lib/Mail/SpamAssassin/Message/Node.pm
Sending        trunk/lib/Mail/SpamAssassin/Message/Node.pm
Transmitting file data ..done
Committing transaction...
Committed revision 1846805.

-- 
You are receiving this mail because:
You are the assignee for the bug.

[Bug 7656] UTF8 rules, normalize_charset etc overhaul

Posted by bu...@spamassassin.apache.org.

https://bz.apache.org/SpamAssassin/show_bug.cgi?id=7656

--- Comment #7 from Henrik Krohns <ap...@hege.li> ---
(In reply to Kevin A. McGrail from comment #6)
> I know I have some rules that fire differently with normalize_charset.

Could you show some examples?

-- 
You are receiving this mail because:
You are the assignee for the bug.

[Bug 7656] UTF8 rules, normalize_charset etc overhaul

Posted by bu...@spamassassin.apache.org.

https://bz.apache.org/SpamAssassin/show_bug.cgi?id=7656

--- Comment #9 from Henrik Krohns <ap...@hege.li> ---
Well yes that pretty much sums up what was already said in this bug. You can't
expect to match extended ascii characters like before. It's nothing but a
documentation issue.

-- 
You are receiving this mail because:
You are the assignee for the bug.

[Bug 7656] UTF8 rules, normalize_charset etc overhaul

Posted by bu...@spamassassin.apache.org.

https://bz.apache.org/SpamAssassin/show_bug.cgi?id=7656

--- Comment #8 from Kevin A. McGrail <km...@apache.org> ---
Sure, here's one example:

#ZWNJ
#ZWNJ 200C 157 https://en.wikipedia.org/wiki/Windows-1256
# Also want to look at Unicode U+200C. 
# Also 'zero-width joiner' which is Windows-1256 0x9E and Unicode U+200D. $a

# Per RW, switching for this to work with 'normalize_charset 1', \x9d needs to
be replaced with (?:\x9d|\xe2\x80\x8c)
mimeheader      __KAM_ZWNJ1     Content-Type =~ /charset.+windows-1256/i
body            __KAM_ZWNJ2     /(?:\x9D|\xe2\x80\x8c)/
tflags          __KAM_ZWNJ2     multiple maxhits=16
body            __KAM_ZWNJ3     /\&\#x200B;/i

-- 
You are receiving this mail because:
You are the assignee for the bug.

[Bug 7656] UTF8 rules, normalize_charset etc overhaul

Posted by bu...@bugzilla.spamassassin.org.

https://bz.apache.org/SpamAssassin/show_bug.cgi?id=7656

Henrik Krohns <he...@hege.li> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |hege@hege.li

--- Comment #1 from Henrik Krohns <he...@hege.li> ---
Lots of talk here too that I haven't digested yet..
https://bz.apache.org/SpamAssassin/show_bug.cgi?id=7022

-- 
You are receiving this mail because:
You are the assignee for the bug.

[Bug 7656] UTF8 rules, normalize_charset etc overhaul

Posted by bu...@bugzilla.spamassassin.org.

https://bz.apache.org/SpamAssassin/show_bug.cgi?id=7656

Henrik Krohns <he...@hege.li> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Blocks|                            |7645


Referenced Bugs:

https://bz.apache.org/SpamAssassin/show_bug.cgi?id=7645
[Bug 7645] Wide character in print at /usr/bin/sa-compile line 433
-- 
You are receiving this mail because:
You are the assignee for the bug.

[Bug 7656] UTF8 rules, normalize_charset etc overhaul

Posted by bu...@bugzilla.spamassassin.org.

https://bz.apache.org/SpamAssassin/show_bug.cgi?id=7656

--- Comment #2 from Henrik Krohns <he...@hege.li> ---
(In reply to Henrik Krohns from comment #0)
> Unless people want to use multiple rules to match non-utf8 and utf8
> messages, perhaps the only sane solution would be to "upgrade" all non-utf8
> rules to utf8 internally and do the matching to utf8 upgraded body. In such
> case the two rules above would actually be duplicates and work on any
> message.

Basically with this I mean that normalize_charset should affect rule parsing
too and encode the rules (and resulting regexes) to UTF8? I don't think we can
simply tell users to "convert all your rules/files to UTF8, if you want them to
work". I don't use UTF8 in my editors or Linuxes anywhere. :-)

-- 
You are receiving this mail because:
You are the assignee for the bug.

[Bug 7656] UTF8 rules, normalize_charset etc overhaul

Posted by bu...@spamassassin.apache.org.

https://bz.apache.org/SpamAssassin/show_bug.cgi?id=7656

--- Comment #11 from Henrik Krohns <ap...@hege.li> ---
This bug already floods dev@ list, if someone wants to chime in, feel free.

I have no intention of spending time posting on users@ at this stage, when it's
still only on idea and much left to do. Developers are the ones who need to
steer this ship.

-- 
You are receiving this mail because:
You are the assignee for the bug.

[Bug 7656] UTF8 rules, normalize_charset etc overhaul

Posted by bu...@bugzilla.spamassassin.org.

https://bz.apache.org/SpamAssassin/show_bug.cgi?id=7656

Henrik Krohns <he...@hege.li> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Blocks|                            |7022


Referenced Bugs:

https://bz.apache.org/SpamAssassin/show_bug.cgi?id=7022
[Bug 7022] normalize_charset
-- 
You are receiving this mail because:
You are the assignee for the bug.

[Bug 7656] UTF8 rules, normalize_charset etc overhaul

Posted by bu...@spamassassin.apache.org.

https://bz.apache.org/SpamAssassin/show_bug.cgi?id=7656

Henrik Krohns <ap...@hege.li> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
         Resolution|---                         |FIXED
             Status|NEW                         |RESOLVED

--- Comment #17 from Henrik Krohns <ap...@hege.li> ---
This was sufficiently resolved with the previous commit. No changes into how
rules/cf-files and body are processed, it would be too complicated for now and
potentially backwards breaking.

-- 
You are receiving this mail because:
You are the assignee for the bug.

[Bug 7656] UTF8 rules, normalize_charset etc overhaul

Posted by bu...@spamassassin.apache.org.

https://bz.apache.org/SpamAssassin/show_bug.cgi?id=7656

--- Comment #4 from Henrik Krohns <ap...@hege.li> ---
So getting back to this.

I've been running my SA with normalize_charset 1 without any ill-effects so
far. Should we head towards activating it by default in 4.0.0?

Only thing left after that would be documenting what format .cf files are
expected to be in. Probably just "bytes" without any special encoding? For
anything else than personal use, pure ascii should be used for portability
(non-ascii characters should be in \xff format).

To be compatible for both normalize_charset 0/1, it should be clearly
documented that any rules expected to hit latin1 extended characters would need
to be written to include both latin1/utf8 - "ä" -> (?:\xe4|\xc3\xa4). We could
also detect this automatically from rules and output warning that it should be
fixed.

One thing to consider would be removing the whole normalize_charset option, and
just force everything normalized, plain and simple.

-- 
You are receiving this mail because:
You are the assignee for the bug.

[Bug 7656] UTF8 rules, normalize_charset etc overhaul

Posted by bu...@spamassassin.apache.org.

https://bz.apache.org/SpamAssassin/show_bug.cgi?id=7656

--- Comment #14 from Henrik Krohns <ap...@hege.li> ---
Good to hear, I cast my official +1 for normalize_charset 1 too.

There doesn't seen to be any dependencies, Encode::Detect can still remain
optional and required HTML::Parser 3.46 is from 2005..

Will check if there's anything in tests that should be changed.

-- 
You are receiving this mail because:
You are the assignee for the bug.

[Bug 7656] UTF8 rules, normalize_charset etc overhaul

Posted by bu...@spamassassin.apache.org.

https://bz.apache.org/SpamAssassin/show_bug.cgi?id=7656

Henrik Krohns <ap...@hege.li> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Blocks|4745                        |
         Depends on|                            |4745

-- 
You are receiving this mail because:
You are the assignee for the bug.

[Bug 7656] UTF8 rules, normalize_charset etc overhaul

Posted by bu...@spamassassin.apache.org.

https://bz.apache.org/SpamAssassin/show_bug.cgi?id=7656

Giovanni Bechis <gi...@paclan.it> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |giovanni@paclan.it

--- Comment #13 from Giovanni Bechis <gi...@paclan.it> ---
I am +1 to enable "normalize_charset 1" on 4.0.0 by default,
I have it enabled for a long time in production without any issues.

-- 
You are receiving this mail because:
You are the assignee for the bug.

[Bug 7656] UTF8 rules, normalize_charset etc overhaul

Posted by bu...@spamassassin.apache.org.

https://bz.apache.org/SpamAssassin/show_bug.cgi?id=7656

Henrik Krohns <ap...@hege.li> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Blocks|                            |4745

-- 
You are receiving this mail because:
You are the assignee for the bug.

[Bug 7656] UTF8 rules, normalize_charset etc overhaul

Posted by bu...@spamassassin.apache.org.

https://bz.apache.org/SpamAssassin/show_bug.cgi?id=7656

Henrik Krohns <ap...@hege.li> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
         Depends on|                            |6234, 7072

-- 
You are receiving this mail because:
You are the assignee for the bug.

ATTN: BUG BUMP: [Bug 7656] UTF8 rules, normalize_charset etc overhaul

Posted by Bill Cole <sa...@billmail.scconsult.com>.

This bug is part of the complex related to smoothing out all the edge 
and corner cases of character set encoding for v4. There is some concern 
that changing the default for normalize_charset (to enable it) or even 
removing the switch altogether to nail down documentation of how to 
match problem characters like the Latin-1 "extended ASCII" range: 
basically any 8-bit character >127.

Making the change requires some work on rules that look for those 
high-bit-set characters by people who understand encoding issues and 
common failings (e.g. using a 1-byte high-bit-set character in a 
notionally UTF-8 document.) My personal opinion is that the change is 
worth the work, but I admit that I've not completely audited the default 
rules for problematic cases. I have been writing rules to work with 
normalize_charset for many years however. With reasonably modern Perl, 
there's no strong argument for normalize_charset=0 beyond the technical 
debt of code and rules written to accommodate it.

On 15 Apr 2021, at 8:55, bugzilla-daemon@spamassassin.apache.org wrote:

> https://bz.apache.org/SpamAssassin/show_bug.cgi?id=7656
>
> Bill Cole <bi...@apache.org> changed:
>
>            What    |Removed                     |Added
> ----------------------------------------------------------------------------
>                  CC|                            |billcole@apache.org
>
> --- Comment #15 from Bill Cole <bi...@apache.org> ---
> (In reply to Henrik Krohns from comment #12)
>> Bumping this bug. Comments? Monologs are getting a bit tiresome.. :-)
>
> +1
>
> The minor pain of revamping rules that match non-ASCII characters is
> compensated by the fact that this is a *normalization* and so reduces 
> the
> frequency of edge cases that escape rules written (perhaps 
> inadvertently) to
> depend on a particular subset of possible encodings. My personal 
> experience
> running SA instances that see a lot of non-ASCII messages is that 
> enabling
> normalize_charset is a best practice, and the default is basically 
> tech debt.
>
> As for requiring discussion on-list, these comments are sent to the 
> dev list.
> I'm going to bump it there to get the attention of anyone filtering 
> out
> Bugzilla mail (!? if that's a thing...) and will also post on the 
> Users list to
> get a broader audience.
>
> -- 
> You are receiving this mail because:
> You are the assignee for the bug.

-- 
Bill Cole
bill@scconsult.com or billcole@apache.org
(AKA @grumpybozo and many *@billmail.scconsult.com addresses)
Not Currently Available For Hire

[Bug 7656] UTF8 rules, normalize_charset etc overhaul

Posted by bu...@spamassassin.apache.org.

https://bz.apache.org/SpamAssassin/show_bug.cgi?id=7656

Bill Cole <bi...@apache.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |billcole@apache.org

--- Comment #15 from Bill Cole <bi...@apache.org> ---
(In reply to Henrik Krohns from comment #12)
> Bumping this bug. Comments? Monologs are getting a bit tiresome.. :-)

+1

The minor pain of revamping rules that match non-ASCII characters is
compensated by the fact that this is a *normalization* and so reduces the
frequency of edge cases that escape rules written (perhaps inadvertently) to
depend on a particular subset of possible encodings. My personal experience
running SA instances that see a lot of non-ASCII messages is that enabling
normalize_charset is a best practice, and the default is basically tech debt. 

As for requiring discussion on-list, these comments are sent to the dev list.
I'm going to bump it there to get the attention of anyone filtering out
Bugzilla mail (!? if that's a thing...) and will also post on the Users list to
get a broader audience.

-- 
You are receiving this mail because:
You are the assignee for the bug.

[Bug 7656] UTF8 rules, normalize_charset etc overhaul

Posted by bu...@spamassassin.apache.org.

https://bz.apache.org/SpamAssassin/show_bug.cgi?id=7656

Kevin A. McGrail <km...@apache.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |kmcgrail@apache.org

--- Comment #6 from Kevin A. McGrail <km...@apache.org> ---
I'm a 0 on this.  I haven't see this proposed for default on dev@ or users@ and
would like to see that done.  I know I have some rules that fire differently
with normalize_charset.

-- 
You are receiving this mail because:
You are the assignee for the bug.

[Bug 7656] UTF8 rules, normalize_charset etc overhaul

Posted by bu...@spamassassin.apache.org.

https://bz.apache.org/SpamAssassin/show_bug.cgi?id=7656

--- Comment #12 from Henrik Krohns <ap...@hege.li> ---
Bumping this bug. Comments? Monologs are getting a bit tiresome.. :-)

-- 
You are receiving this mail because:
You are the assignee for the bug.

[Bug 7656] UTF8 rules, normalize_charset etc overhaul

Posted by bu...@spamassassin.apache.org.

https://bz.apache.org/SpamAssassin/show_bug.cgi?id=7656

--- Comment #5 from Henrik Krohns <ap...@hege.li> ---
I tried performance tests with mass-check, there's absolutely no difference
here for normalize_charset, total duration was always within normal +-2%
variance.

Rule differences between these were mainly:

__HIGHBITS
MPART_ALT_DIFF_COUNT
TVD_SPACE_RATIO
__freemail_safe_fwd

As we can see from __freemail_safe_fwd, if normalize is on, we can't assume
that a single dot will match a character like "ä".. committed (?:\xe4|\xc3\xa4)
fix for it. Question arises whether regexes should be run with unicode
semantics (. = single character) instead of matching raw bytes.

Have to investigate if the others need fixing.

-- 
You are receiving this mail because:
You are the assignee for the bug.

[Bug 7656] UTF8 rules, normalize_charset etc overhaul

Posted by bu...@spamassassin.apache.org.

https://bz.apache.org/SpamAssassin/show_bug.cgi?id=7656

--- Comment #10 from Kevin A. McGrail <km...@apache.org> ---
Well for it to be the default in 4.0.0, I'd like it to be discussed on list,
please.

-- 
You are receiving this mail because:
You are the assignee for the bug.