You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@spamassassin.apache.org by bu...@bugzilla.spamassassin.org on 2015/01/29 16:01:58 UTC

[Bug 7126] New: Incorrect character set detections by normalize_charset

https://issues.apache.org/SpamAssassin/show_bug.cgi?id=7126

            Bug ID: 7126
           Summary: Incorrect character set detections by
                    normalize_charset
           Product: Spamassassin
           Version: 3.4.0
          Hardware: All
                OS: All
            Status: NEW
          Severity: normal
          Priority: P2
         Component: Libraries
          Assignee: dev@spamassassin.apache.org
          Reporter: Mark.Martinec@ijs.si

Noticing that several of our local mail messages are considered by
MS::Message::Node::_normalize() / normalize_charset as being written
in far-East character sets and decoded as such, which clearly does not
make sense (they are actually in UTF-8 or Windows-1252 or ISO-8859-2),
I have put up an alternative reference implementation of _normalize()
and compared the results of the two, while manually checking the
reported differences.

In our case the one-day statistics shows that more than 8 % of
decisions taken by _normalize() were wrong. The most common
differences were:

- decoded as big5      (should be decoded as iso-8859-2)
- decoded as euc-kr    (should be decoded as utf-8)
- decoded as euc-jp    (should be decoded as utf-8)
- decoded as shift_jis (should be decoded as windows-1252)
- decoded as utf-8     (should be decoded as windows-1252)
- not decoded          (should be decoded as gb2312)
- not decoded          (should be decoded as gbk)
- not decoded          (should be decoded as utf-8)

The source of the problem in my opinion is that the existing
_normalize() puts too much reliance on Encode::Detect::Detector
and the underlying "Mozilla's universal charset detector",
instead of trusting a declared character set (in a Content-Type),
and falling back to guesswork only when the declared character
set seems inconsistent with actual contents of a message part.

While relying primarily on guesswork may have made good sense
ten years ago, and probably still produces sensible results
in the far-East (as it errs on the side of far-Eastern character
sets), nowadays when UTF-8 is much more widespread, in my
opinion the logic is now flawed.

-- 
You are receiving this mail because:
You are the assignee for the bug.

[Bug 7126] Incorrect character set detections by normalize_charset

Posted by bu...@bugzilla.spamassassin.org.

https://issues.apache.org/SpamAssassin/show_bug.cgi?id=7126

Mark Martinec <Ma...@ijs.si> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
   Attachment #5271|0                           |1
        is obsolete|                            |
   Attachment #5272|0                           |1
        is obsolete|                            |

--- Comment #8 from Mark Martinec <Ma...@ijs.si> ---
Created attachment 5277
  --> https://issues.apache.org/SpamAssassin/attachment.cgi?id=5277&action=edit
The suggested replacement subroutine MS::Message::Node::_normalize() - V2

In view of:

  [Bug 7133] Revisiting Bug 4046 - HTML::Parser: Parsing of undecoded UTF-8
     will give garbage when decoding entities,

  and HTML::Parser bug:
    https://rt.cpan.org/Public/Bug/Display.html?id=99755

it seems desirable to be able to obtain from sub _normalize either
decoded characters (Unicode), or encoded as UTF-8 octets,
so I have generalized the proposed replacement sub _normalize()
to provide one or the other, based on an optional parameter.
In its absence it defaults to current behaviour (returns UTF-8
octets), preserving compatibility.

Attached is my last version of sub _normalize().



Bug 7126: Incorrect character set detections
by normalize_charset - sub _normalize() V2
  Sending lib/Mail/SpamAssassin/Message/Node.pm
Committed revision 1659255.

-- 
You are receiving this mail because:
You are the assignee for the bug.

[Bug 7126] Incorrect character set detections by normalize_charset

Posted by bu...@bugzilla.spamassassin.org.

https://issues.apache.org/SpamAssassin/show_bug.cgi?id=7126

--- Comment #6 from Mark Martinec <Ma...@ijs.si> ---
Some interesting statistics, collected from 100.000 textual mail parts
as seen in two working days at our site. A single mail message can
be counted as more than one part (e.g. text/plain + text/html in case
of multipart/alternative), so the number of mail messages analyzed is
slightly less than half that much.

The debug messages were grepped by a ': message: .*charset' and
grouped into the following groups:

  11.1%  true US-ASCII (kept unchanged)
  67.5%  valid UTF-8 as declared in Content-Type (kept unchanged)
   0.2%  valid UTF-8 as detected/guessed (kept unchanged)
  20.8%  decoded (non- UTF-8) as declared in Content-Type
   0.4%  decoded (non- UTF-8) as detected/guessed

The 'decoded' and 'as detected/guessed' only occur with a setting:
  normalize_charset 1
(otherwise these would just have been kept as unchanged octets / Mojibake).

Summarizing the above further down yields:

  11.1%  true US-ASCII (kept unchanged)
  67.6%  is UTF-8      (kept unchanged)
  21.3%  decoded into UTF-8 (when normalize_charset is enabled)

So, 67.6% is natively UTF-8, and 88.9% of textual parts end up
as UTF-8 if normalize_charset is enabled. The remaining 11.1%
of textual mail parts is just plain ASCII text.


Interestingly, (while not directly comparable), our 88.9% UTF-8 figure
corresponds closely to the 82.5% in "Usage of character encodings
for websites" January 2015:
  "UTF-8 is used by 82.5% of all the websites whose character
   encoding we know."
http://w3techs.com/technologies/overview/character_encoding/all

-- 
You are receiving this mail because:
You are the assignee for the bug.

[Bug 7126] Incorrect character set detections by normalize_charset

Posted by bu...@bugzilla.spamassassin.org.

https://issues.apache.org/SpamAssassin/show_bug.cgi?id=7126

--- Comment #2 from Mark Martinec <Ma...@ijs.si> ---
Created attachment 5272
  --> https://issues.apache.org/SpamAssassin/attachment.cgi?id=5272&action=edit
The full proposed patch (includes documentation update, Conf.pm,
DependencyInfo.pm)

-- 
You are receiving this mail because:
You are the assignee for the bug.

[Bug 7126] Incorrect character set detections by normalize_charset

Posted by bu...@bugzilla.spamassassin.org.

https://issues.apache.org/SpamAssassin/show_bug.cgi?id=7126

--- Comment #4 from Mark Martinec <Ma...@ijs.si> ---
Some small refinements:
  Sending lib/Mail/SpamAssassin/Message/Node.pm
Committed revision 1656048.


- noticed that some undecodable/invalid text is 'almost ascii', but
  contains some characters like a NBSP (non-breaking space) or SHY
  (soft hyphen), or some punctuation from Windows-1252 at the codes
  which are unassigned in ISO-8859  -  so deal with that: decode it
  as Windows-1252;

- improved debugging: report if some encoding is unrecognized
  or unsupported by module Encode

-- 
You are receiving this mail because:
You are the assignee for the bug.

[Bug 7126] Incorrect character set detections by normalize_charset

Posted by bu...@bugzilla.spamassassin.org.

https://issues.apache.org/SpamAssassin/show_bug.cgi?id=7126

--- Comment #1 from Mark Martinec <Ma...@ijs.si> ---
Created attachment 5271
  --> https://issues.apache.org/SpamAssassin/attachment.cgi?id=5271&action=edit
The suggested replacement subroutine MS::Message::Node::_normalize()

This implementation is compatible with the existing implementation,
they agree with each other in about 92 % or cases.

Implemented with emphasis on:
- trust a declared character set as long as this assumption is not
  invalidated by the actual text (failed attempted strict decoding)
- avoid unnecessary decoding (by Encode::decode) where possible,
- avoid calling Encode::Detect::Detector unless necessary,
- produce useful debug diagnostics.

-- 
You are receiving this mail because:
You are the assignee for the bug.

[Bug 7126] Incorrect character set detections by normalize_charset

Posted by bu...@bugzilla.spamassassin.org.

https://issues.apache.org/SpamAssassin/show_bug.cgi?id=7126

--- Comment #3 from Mark Martinec <Ma...@ijs.si> ---
Bug 7126: Incorrect character set detections by normalize_charset
  Sending lib/Mail/SpamAssassin/Conf.pm
  Sending lib/Mail/SpamAssassin/Message/Node.pm
  Sending lib/Mail/SpamAssassin/Util/DependencyInfo.pm
Committed revision 1655758.

-- 
You are receiving this mail because:
You are the assignee for the bug.

[Bug 7126] Incorrect character set detections by normalize_charset

Posted by bu...@bugzilla.spamassassin.org.

https://issues.apache.org/SpamAssassin/show_bug.cgi?id=7126

Mark Martinec <Ma...@ijs.si> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
   Target Milestone|Undefined                   |3.4.1

-- 
You are receiving this mail because:
You are the assignee for the bug.

[Bug 7126] Incorrect character set detections by normalize_charset

Posted by bu...@bugzilla.spamassassin.org.

https://issues.apache.org/SpamAssassin/show_bug.cgi?id=7126

--- Comment #7 from Mark Martinec <Ma...@ijs.si> ---
(sometimes a text is declared as ISO-8859-* but is actually UTF-8)

Bug 7126 - some more tweaks at sub _normalize
  Sending lib/Mail/SpamAssassin/Message/Node.pm
Committed revision 1657862.

-- 
You are receiving this mail because:
You are the assignee for the bug.

[Bug 7126] Incorrect character set detections by normalize_charset

Posted by bu...@bugzilla.spamassassin.org.

https://issues.apache.org/SpamAssassin/show_bug.cgi?id=7126

--- Comment #5 from Mark Martinec <Ma...@ijs.si> ---
Seen a declared character set 'ANSI X3.4-1986' (i.e. US-ASCII)
in the wild, interesting.

Bug 7126: refinements: ANSI X3.4-1986, Windows-1252 quotes
  Sending lib/Mail/SpamAssassin/Message/Node.pm
Committed revision 1656447.

-- 
You are receiving this mail because:
You are the assignee for the bug.