You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@spamassassin.apache.org by bu...@bugzilla.spamassassin.org on 2015/08/11 01:40:32 UTC

[Bug 7236] New: Net::DNS assumes strings in TXT resource records are in UTF-8 and gratuitously tries to decodes it

https://bz.apache.org/SpamAssassin/show_bug.cgi?id=7236

            Bug ID: 7236
           Summary: Net::DNS assumes strings in TXT resource records are
                    in UTF-8 and gratuitously tries to decodes it
           Product: Spamassassin
           Version: 3.4.1
          Hardware: All
                OS: All
            Status: NEW
          Severity: minor
          Priority: P2
         Component: Libraries
          Assignee: dev@spamassassin.apache.org
          Reporter: Mark.Martinec@ijs.si

Noticed that some tokens that a Bayes plugin receives from a mail header
are flagged as perl native characters (utf8 flag on), even tough they
are in plain ASCII. When such native strings are then concatenated with
other text in plain octets, e.g. UTF-8 encoded (utf8 flag off), perl does
an implicit upgrade of the result to native perl characters, which is
undesired for two reasons: processing such text through regexps may be
slower, and the resulting upgraded text (e.g. if it was e.g. in UTF-8
octets) will be doubly encoded - which is sometimes referred to as an
apparent (not actual) "Unicode Bug" (btw, we have been hit by this in
the HTML decoding module).

The perlunicode man page explains:

The "Unicode Bug"
  The term, "Unicode bug" has been applied to an inconsistency with the
  code points in the "Latin-1 Supplement" block, that is, between 128 and
  255.  Without a locale specified, unlike all other characters or code
  points, these characters can have very different semantics depending on
  the rules in effect.  (Characters whose code points are above 255 force
  Unicode rules; whereas the rules for ASCII characters are the same
  under both ASCII and Unicode rules.)

  Under Unicode rules, these upper-Latin1 characters are interpreted as
  Unicode code points, which means they have the same semantics as
  Latin-1 (ISO-8859-1) and C1 controls.

  As explained in "ASCII Rules versus Unicode Rules", under ASCII rules,
  they are considered to be unassigned characters.

  This can lead to unexpected results.  For example, a string's semantics
  can suddenly change if a code point above 255 is appended to it, which
  changes the rules from ASCII to Unicode.  As an example, consider the
  following program and its output:
  [...]


The Net::DNS started behaving like this with version 0.69.

The Net::DNS::RR::TXT::txtdata() calls Net::DNS::Text::value(),
which calls _decode_utf8 and hence Encode::decode. The result is
a native perl string (utf8 flag on), regardless of whether the
string of octets was in plain ASCII (which usually is) or not.

Apart from undesired utf8 flag (which in itself is not incorrect),
the Encode::decode_utf8 turns each octet (e.g. a Latin-* or KOI8
character) which does not form a valid UTF-8 sequence
into a Unicode "Replacement character" U+FFFD ( �, encoded as UTF-8:
\x{EF}\x{BF}\x{BD}, thus making a reverse operation impossible,
i.e. loses information.

Found the following explanation from a module maintainer:

  https://www.nlnetlabs.nl/pipermail/net-dns-users/2012/000005.html

which claims:
| The relevent RFCs define text to be encoded on the wire using ASCII,
| which is a proper subset of the UTF8 encoding used by the Text module

which isn't something that RFC1035 claims:
  <character-string> is a single length octet followed by that number
  of characters.  <character-string> is treated as binary information,
  and can be up to 256 characters in length (including the length octet)
(and states elsewhere that certain characters must be quoted in a zone file)

Moreover, the posting claims:
| Text rdata fields will return a standard Perl string using the internal
| encoding appropriate for the platform.  Characters which cannot
| otherwise be represented will be expressed as RFC1035 numeric escape
| sequences.
which clearly isn't the case: octets which cannot be represented in
Unicode are turned into a Unicode "Replacement character", instead of
being expressed as a RFC 1035 numeric escape sequence. For example,
if strings in a TXT RR are in Latin-1 (or worse: Cyrillic KOI8),
the resulting text contains lots of � characters, there is no way
to obtain original text.


There is no documented way to avoid this. I cannot find a way around
it, except by hot-patching or replacing one of these two modules,
which would not be future-proof.

I claim that this is an incorrect behaviour in Net::DNS, but certainly
is undesired. The DNS system (RFC 1035) has no inherent character set
or encoding, and is 8-bit clean: strings in TXT records can contain
arbitrary octets (up to the length limit). The RFC 1035 does impose
encoding rules for zone file syntax, but this still does not restrict
what can go into <character-string> fields. It is up to an application
of DNS to attribute any encoding semantics to TXT resource records,
if it wants to.

For lack of a better solution, I'd like to make a simple change
to the three plugins that make use of Net::DNS::RR::TXT::txtdata,
namely to encode native perl strings back to UTF-8 octets if it
happens that a result is in native characters (i.e. Net::DNS > 0.69).
There is no way to restore lost bytes, but hopefully these are
rare in actual DNS responses.

Btw, remember that we have hit exactly the same issue with the
'question' section of a DNS response, which has been dealt with
in Bug 6959.

-- 
You are receiving this mail because:
You are the assignee for the bug.

[Bug 7236] Net::DNS assumes strings in TXT resource records are in UTF-8 and gratuitously tries to decodes it

Posted by bu...@bugzilla.spamassassin.org.

https://bz.apache.org/SpamAssassin/show_bug.cgi?id=7236

--- Comment #2 from Mark Martinec <Ma...@ijs.si> ---
Forgot to do the same in Dns.pm:

trunk:
  Sending lib/Mail/SpamAssassin/Dns.pm
  Sending lib/Mail/SpamAssassin/Plugin/AskDNS.pm  (updated comment)
Committed revision 1695336.

-- 
You are receiving this mail because:
You are the assignee for the bug.

[Bug 7236] Net::DNS assumes strings in TXT resource records are in UTF-8 and gratuitously tries to decodes it

Posted by bu...@bugzilla.spamassassin.org.

https://bz.apache.org/SpamAssassin/show_bug.cgi?id=7236

Mark Martinec <Ma...@ijs.si> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
   Target Milestone|Undefined                   |4.0.0

--- Comment #1 from Mark Martinec <Ma...@ijs.si> ---
trunk:
  Sending lib/Mail/SpamAssassin/Plugin/ASN.pm
  Sending lib/Mail/SpamAssassin/Plugin/AskDNS.pm
  Sending lib/Mail/SpamAssassin/Plugin/URIDNSBL.pm
Committed revision 1695182.

-- 
You are receiving this mail because:
You are the assignee for the bug.

[Bug 7236] Net::DNS assumes strings in TXT resource records are in UTF-8 and gratuitously tries to decodes it

Posted by bu...@bugzilla.spamassassin.org.

https://bz.apache.org/SpamAssassin/show_bug.cgi?id=7236

Kevin A. McGrail <km...@apache.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |RESOLVED
         Resolution|---                         |FIXED
                 CC|                            |kmcgrail@apache.org

--- Comment #3 from Kevin A. McGrail <km...@apache.org> ---
Closing.  Fix committed 3 years ago.

-- 
You are receiving this mail because:
You are the assignee for the bug.