You are viewing a plain text version of this content. The canonical link for it is here.
Posted to j-dev@xerces.apache.org by bu...@apache.org on 2003/11/10 21:03:18 UTC

DO NOT REPLY [Bug 24579] New: - [XML 1.0] - E27: Must reject non-shortest forms in UTF-8

DO NOT REPLY TO THIS EMAIL, BUT PLEASE POST YOUR BUG 
RELATED COMMENTS THROUGH THE WEB INTERFACE AVAILABLE AT
<http://nagoya.apache.org/bugzilla/show_bug.cgi?id=24579>.
ANY REPLY MADE TO THIS MESSAGE WILL NOT BE COLLECTED AND 
INSERTED IN THE BUG DATABASE.

http://nagoya.apache.org/bugzilla/show_bug.cgi?id=24579

[XML 1.0] - E27: Must reject non-shortest forms in UTF-8

           Summary: [XML 1.0] - E27: Must reject non-shortest forms in UTF-8
           Product: Xerces2-J
           Version: 2.5.0
          Platform: All
               URL: http://www.w3.org/XML/xml-V10-2e-errata#E27
        OS/Version: All
            Status: NEW
          Severity: Normal
          Priority: Other
         Component: Other
        AssignedTo: xerces-j-dev@xml.apache.org
        ReportedBy: mrglavas@ca.ibm.com


E27 [1] states that "it is a fatal error if an entity encoded in UTF-8 contains 
any irregular code unit sequences, as defined in Unicode 3.1".  I had a look at 
this errata sometime ago, and in addition to irregular code unit sequences 
being a fatal error, we should also reject non-shortest forms. These non-
shortest forms (such as C0 80 or E0 80 80
which both correspond to codepoint 0), are not legal in Unicode 3.1. See "UTF-8 
Corrigendum" and "Table 3.1B. Legal UTF-8 Byte Sequences" of Unicode 3.1 [3].

[1] http://www.w3.org/XML/xml-V10-2e-errata#E27

---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-j-dev-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-j-dev-help@xml.apache.org