You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spamassassin.apache.org by bu...@bugzilla.spamassassin.org on 2015/01/29 16:01:58 UTC
[Bug 7126] New: Incorrect character set detections by
normalize_charset
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=7126
Bug ID: 7126
Summary: Incorrect character set detections by
normalize_charset
Product: Spamassassin
Version: 3.4.0
Hardware: All
OS: All
Status: NEW
Severity: normal
Priority: P2
Component: Libraries
Assignee: dev@spamassassin.apache.org
Reporter: Mark.Martinec@ijs.si
Noticing that several of our local mail messages are considered by
MS::Message::Node::_normalize() / normalize_charset as being written
in far-East character sets and decoded as such, which clearly does not
make sense (they are actually in UTF-8 or Windows-1252 or ISO-8859-2),
I have put up an alternative reference implementation of _normalize()
and compared the results of the two, while manually checking the
reported differences.
In our case the one-day statistics shows that more than 8 % of
decisions taken by _normalize() were wrong. The most common
differences were:
- decoded as big5 (should be decoded as iso-8859-2)
- decoded as euc-kr (should be decoded as utf-8)
- decoded as euc-jp (should be decoded as utf-8)
- decoded as shift_jis (should be decoded as windows-1252)
- decoded as utf-8 (should be decoded as windows-1252)
- not decoded (should be decoded as gb2312)
- not decoded (should be decoded as gbk)
- not decoded (should be decoded as utf-8)
The source of the problem in my opinion is that the existing
_normalize() puts too much reliance on Encode::Detect::Detector
and the underlying "Mozilla's universal charset detector",
instead of trusting a declared character set (in a Content-Type),
and falling back to guesswork only when the declared character
set seems inconsistent with actual contents of a message part.
While relying primarily on guesswork may have made good sense
ten years ago, and probably still produces sensible results
in the far-East (as it errs on the side of far-Eastern character
sets), nowadays when UTF-8 is much more widespread, in my
opinion the logic is now flawed.
--
You are receiving this mail because:
You are the assignee for the bug.
[Bug 7126] Incorrect character set detections by normalize_charset
Posted by bu...@bugzilla.spamassassin.org.
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=7126
Mark Martinec <Ma...@ijs.si> changed:
What |Removed |Added
----------------------------------------------------------------------------
Attachment #5271|0 |1
is obsolete| |
Attachment #5272|0 |1
is obsolete| |
--- Comment #8 from Mark Martinec <Ma...@ijs.si> ---
Created attachment 5277
--> https://issues.apache.org/SpamAssassin/attachment.cgi?id=5277&action=edit
The suggested replacement subroutine MS::Message::Node::_normalize() - V2
In view of:
[Bug 7133] Revisiting Bug 4046 - HTML::Parser: Parsing of undecoded UTF-8
will give garbage when decoding entities,
and HTML::Parser bug:
https://rt.cpan.org/Public/Bug/Display.html?id=99755
it seems desirable to be able to obtain from sub _normalize either
decoded characters (Unicode), or encoded as UTF-8 octets,
so I have generalized the proposed replacement sub _normalize()
to provide one or the other, based on an optional parameter.
In its absence it defaults to current behaviour (returns UTF-8
octets), preserving compatibility.
Attached is my last version of sub _normalize().
Bug 7126: Incorrect character set detections
by normalize_charset - sub _normalize() V2
Sending lib/Mail/SpamAssassin/Message/Node.pm
Committed revision 1659255.
--
You are receiving this mail because:
You are the assignee for the bug.
[Bug 7126] Incorrect character set detections by normalize_charset
Posted by bu...@bugzilla.spamassassin.org.
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=7126
--- Comment #6 from Mark Martinec <Ma...@ijs.si> ---
Some interesting statistics, collected from 100.000 textual mail parts
as seen in two working days at our site. A single mail message can
be counted as more than one part (e.g. text/plain + text/html in case
of multipart/alternative), so the number of mail messages analyzed is
slightly less than half that much.
The debug messages were grepped by a ': message: .*charset' and
grouped into the following groups:
11.1% true US-ASCII (kept unchanged)
67.5% valid UTF-8 as declared in Content-Type (kept unchanged)
0.2% valid UTF-8 as detected/guessed (kept unchanged)
20.8% decoded (non- UTF-8) as declared in Content-Type
0.4% decoded (non- UTF-8) as detected/guessed
The 'decoded' and 'as detected/guessed' only occur with a setting:
normalize_charset 1
(otherwise these would just have been kept as unchanged octets / Mojibake).
Summarizing the above further down yields:
11.1% true US-ASCII (kept unchanged)
67.6% is UTF-8 (kept unchanged)
21.3% decoded into UTF-8 (when normalize_charset is enabled)
So, 67.6% is natively UTF-8, and 88.9% of textual parts end up
as UTF-8 if normalize_charset is enabled. The remaining 11.1%
of textual mail parts is just plain ASCII text.
Interestingly, (while not directly comparable), our 88.9% UTF-8 figure
corresponds closely to the 82.5% in "Usage of character encodings
for websites" January 2015:
"UTF-8 is used by 82.5% of all the websites whose character
encoding we know."
http://w3techs.com/technologies/overview/character_encoding/all
--
You are receiving this mail because:
You are the assignee for the bug.
[Bug 7126] Incorrect character set detections by normalize_charset
Posted by bu...@bugzilla.spamassassin.org.
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=7126
--- Comment #2 from Mark Martinec <Ma...@ijs.si> ---
Created attachment 5272
--> https://issues.apache.org/SpamAssassin/attachment.cgi?id=5272&action=edit
The full proposed patch (includes documentation update, Conf.pm,
DependencyInfo.pm)
--
You are receiving this mail because:
You are the assignee for the bug.
[Bug 7126] Incorrect character set detections by normalize_charset
Posted by bu...@bugzilla.spamassassin.org.
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=7126
--- Comment #4 from Mark Martinec <Ma...@ijs.si> ---
Some small refinements:
Sending lib/Mail/SpamAssassin/Message/Node.pm
Committed revision 1656048.
- noticed that some undecodable/invalid text is 'almost ascii', but
contains some characters like a NBSP (non-breaking space) or SHY
(soft hyphen), or some punctuation from Windows-1252 at the codes
which are unassigned in ISO-8859 - so deal with that: decode it
as Windows-1252;
- improved debugging: report if some encoding is unrecognized
or unsupported by module Encode
--
You are receiving this mail because:
You are the assignee for the bug.
[Bug 7126] Incorrect character set detections by normalize_charset
Posted by bu...@bugzilla.spamassassin.org.
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=7126
--- Comment #1 from Mark Martinec <Ma...@ijs.si> ---
Created attachment 5271
--> https://issues.apache.org/SpamAssassin/attachment.cgi?id=5271&action=edit
The suggested replacement subroutine MS::Message::Node::_normalize()
This implementation is compatible with the existing implementation,
they agree with each other in about 92 % or cases.
Implemented with emphasis on:
- trust a declared character set as long as this assumption is not
invalidated by the actual text (failed attempted strict decoding)
- avoid unnecessary decoding (by Encode::decode) where possible,
- avoid calling Encode::Detect::Detector unless necessary,
- produce useful debug diagnostics.
--
You are receiving this mail because:
You are the assignee for the bug.
[Bug 7126] Incorrect character set detections by normalize_charset
Posted by bu...@bugzilla.spamassassin.org.
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=7126
--- Comment #3 from Mark Martinec <Ma...@ijs.si> ---
Bug 7126: Incorrect character set detections by normalize_charset
Sending lib/Mail/SpamAssassin/Conf.pm
Sending lib/Mail/SpamAssassin/Message/Node.pm
Sending lib/Mail/SpamAssassin/Util/DependencyInfo.pm
Committed revision 1655758.
--
You are receiving this mail because:
You are the assignee for the bug.
[Bug 7126] Incorrect character set detections by normalize_charset
Posted by bu...@bugzilla.spamassassin.org.
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=7126
Mark Martinec <Ma...@ijs.si> changed:
What |Removed |Added
----------------------------------------------------------------------------
Target Milestone|Undefined |3.4.1
--
You are receiving this mail because:
You are the assignee for the bug.
[Bug 7126] Incorrect character set detections by normalize_charset
Posted by bu...@bugzilla.spamassassin.org.
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=7126
--- Comment #7 from Mark Martinec <Ma...@ijs.si> ---
(sometimes a text is declared as ISO-8859-* but is actually UTF-8)
Bug 7126 - some more tweaks at sub _normalize
Sending lib/Mail/SpamAssassin/Message/Node.pm
Committed revision 1657862.
--
You are receiving this mail because:
You are the assignee for the bug.
[Bug 7126] Incorrect character set detections by normalize_charset
Posted by bu...@bugzilla.spamassassin.org.
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=7126
--- Comment #5 from Mark Martinec <Ma...@ijs.si> ---
Seen a declared character set 'ANSI X3.4-1986' (i.e. US-ASCII)
in the wild, interesting.
Bug 7126: refinements: ANSI X3.4-1986, Windows-1252 quotes
Sending lib/Mail/SpamAssassin/Message/Node.pm
Committed revision 1656447.
--
You are receiving this mail because:
You are the assignee for the bug.