You are viewing a plain text version of this content. The canonical link for it is here.
Posted to j-dev@xerces.apache.org by bu...@apache.org on 2003/01/21 06:02:32 UTC

DO NOT REPLY [Bug 16287] New: - The parser cannot parse some UTF-8 encoded Japanese documents

DO NOT REPLY TO THIS EMAIL, BUT PLEASE POST YOUR BUG 
RELATED COMMENTS THROUGH THE WEB INTERFACE AVAILABLE AT
<http://nagoya.apache.org/bugzilla/show_bug.cgi?id=16287>.
ANY REPLY MADE TO THIS MESSAGE WILL NOT BE COLLECTED AND 
INSERTED IN THE BUG DATABASE.

http://nagoya.apache.org/bugzilla/show_bug.cgi?id=16287

The parser cannot parse some UTF-8 encoded Japanese documents

           Summary: The parser cannot parse some UTF-8 encoded Japanese
                    documents
           Product: Xerces-J
           Version: 1.4.4
          Platform: All
        OS/Version: All
            Status: NEW
          Severity: Major
          Priority: Other
         Component: Core
        AssignedTo: xerces-j-dev@xml.apache.org
        ReportedBy: moritaku@bx.jp.nec.com
                CC: moritaku@bx.jp.nec.com


Some UTF-8 encoded Japanese documents causes Fatal Error.

If a name with multi-byte characters of UTF-8 encoding reaches to or exceeds
over every 16kbyte-length boundary in its file, the parser reports
'[Fatal Error] testdata.xml:266:14: Element type "--a substring of the element
name in Japanese--" must be followed by either attribute specifications, ">" or
"/>".'

The following is a part of a hex-dump of the document.
----
00003fe0  20 3c e6 97 a5 e6 9c ac  e8 aa 9e e3 81 ae e3 81  | <..............|
00003ff0  bf e3 81 ae e3 82 a8 e3  83 ac e3 83 a1 e3 83 b3  |................|
00004000  e3 83 88 e5 90 8d e3 81  a7 e3 82 82 e3 83 80 e3  |................|
00004010  83 a1 e3 81 a7 e3 81 97  e3 82 87 3e e6 97 a5 e6  |...........>....|
00004020  9c ac e8 aa 9e e3 81 ae  e3 81 bf e3 81 ae e3 82  |................|
00004030  a8 e3 83 ac e3 83 a1 e3  83 b3 e3 83 88 e5 90 8d  |................|
00004040  e3 82 82 e3 83 80 e3 83  a1 e3 81 a7 e3 81 97 e3  |................|
00004050  82 87 3c 2f e6 97 a5 e6  9c ac e8 aa 9e e3 81 ae  |..</............|
00004060  e3 81 bf e3 81 ae e3 82  a8 e3 83 ac e3 83 a1 e3  |................|
00004070  83 b3 e3 83 88 e5 90 8d  e3 81 a7 e3 82 82 e3 83  |................|
00004080  80 e3 83 a1 e3 81 a7 e3  81 97 e3 82 87 3e 0a 3c  |.............>.<|
00004090  2f 64 6f 63 3e 0a                                 |/doc>.|
----

And the following code will generate a test data which causes the problem.
----
import java.io.FileOutputStream;

public class MakeTestData {
    static final String xmldecl = "<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n";

    static final String rootbgn = "<doc>\n";
    static final String elem1 = "  <\u30a8\u30ec\u30e1\u30f3\u30c8>\u65e5\u672c\
u8a9e\u8981\u7d20\u540d\u3060\u3088</\u30a8\u30ec\u30e1\u30f3\u30c8>\n";
    static final String elem2 = "  <\u65e5\u672c\u8a9e\u306e\u307f\u306e\u30a8\u
30ec\u30e1\u30f3\u30c8\u540d\u3067\u3082\u30c0\u30e1\u3067\u3057\u3087>\u65e5\u6
72c\u8a9e\u306e\u307f\u306e\u30a8\u30ec\u30e1\u30f3\u30c8\u540d\u3082\u30c0\u30e
1\u3067\u3057\u3087</\u65e5\u672c\u8a9e\u306e\u307f\u306e\u30a8\u30ec\u30e1\u30f
3\u30c8\u540d\u3067\u3082\u30c0\u30e1\u3067\u3057\u3087>\n";
    static final String rootend = "</doc>\n";
    static final String fname = "testdata.xml";
    static final String fenc = "UTF-8";

    public static void main(String[] args) {
        StringBuffer buf = new StringBuffer();

        buf.append(xmldecl);
        buf.append(rootbgn);

        for (int i=0; i<263; i++) {
            buf.append(elem1);
        }

        buf.append(elem2);
        buf.append(rootend);

        String testdata = buf.toString();

        try {
            FileOutputStream fos = new FileOutputStream(fname);
            fos.write(testdata.getBytes(fenc));
            
            fos.close();
        } catch (Exception e) {
            e.printStackTrace();
        }
    }
}

---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-j-dev-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-j-dev-help@xml.apache.org