You are viewing a plain text version of this content. The canonical link for it is here.
Posted to j-dev@xerces.apache.org by bu...@apache.org on 2003/01/21 06:02:32 UTC
DO NOT REPLY [Bug 16287] New: -
The parser cannot parse some UTF-8 encoded Japanese documents
DO NOT REPLY TO THIS EMAIL, BUT PLEASE POST YOUR BUG
RELATED COMMENTS THROUGH THE WEB INTERFACE AVAILABLE AT
<http://nagoya.apache.org/bugzilla/show_bug.cgi?id=16287>.
ANY REPLY MADE TO THIS MESSAGE WILL NOT BE COLLECTED AND
INSERTED IN THE BUG DATABASE.
http://nagoya.apache.org/bugzilla/show_bug.cgi?id=16287
The parser cannot parse some UTF-8 encoded Japanese documents
Summary: The parser cannot parse some UTF-8 encoded Japanese
documents
Product: Xerces-J
Version: 1.4.4
Platform: All
OS/Version: All
Status: NEW
Severity: Major
Priority: Other
Component: Core
AssignedTo: xerces-j-dev@xml.apache.org
ReportedBy: moritaku@bx.jp.nec.com
CC: moritaku@bx.jp.nec.com
Some UTF-8 encoded Japanese documents causes Fatal Error.
If a name with multi-byte characters of UTF-8 encoding reaches to or exceeds
over every 16kbyte-length boundary in its file, the parser reports
'[Fatal Error] testdata.xml:266:14: Element type "--a substring of the element
name in Japanese--" must be followed by either attribute specifications, ">" or
"/>".'
The following is a part of a hex-dump of the document.
----
00003fe0 20 3c e6 97 a5 e6 9c ac e8 aa 9e e3 81 ae e3 81 | <..............|
00003ff0 bf e3 81 ae e3 82 a8 e3 83 ac e3 83 a1 e3 83 b3 |................|
00004000 e3 83 88 e5 90 8d e3 81 a7 e3 82 82 e3 83 80 e3 |................|
00004010 83 a1 e3 81 a7 e3 81 97 e3 82 87 3e e6 97 a5 e6 |...........>....|
00004020 9c ac e8 aa 9e e3 81 ae e3 81 bf e3 81 ae e3 82 |................|
00004030 a8 e3 83 ac e3 83 a1 e3 83 b3 e3 83 88 e5 90 8d |................|
00004040 e3 82 82 e3 83 80 e3 83 a1 e3 81 a7 e3 81 97 e3 |................|
00004050 82 87 3c 2f e6 97 a5 e6 9c ac e8 aa 9e e3 81 ae |..</............|
00004060 e3 81 bf e3 81 ae e3 82 a8 e3 83 ac e3 83 a1 e3 |................|
00004070 83 b3 e3 83 88 e5 90 8d e3 81 a7 e3 82 82 e3 83 |................|
00004080 80 e3 83 a1 e3 81 a7 e3 81 97 e3 82 87 3e 0a 3c |.............>.<|
00004090 2f 64 6f 63 3e 0a |/doc>.|
----
And the following code will generate a test data which causes the problem.
----
import java.io.FileOutputStream;
public class MakeTestData {
static final String xmldecl = "<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n";
static final String rootbgn = "<doc>\n";
static final String elem1 = " <\u30a8\u30ec\u30e1\u30f3\u30c8>\u65e5\u672c\
u8a9e\u8981\u7d20\u540d\u3060\u3088</\u30a8\u30ec\u30e1\u30f3\u30c8>\n";
static final String elem2 = " <\u65e5\u672c\u8a9e\u306e\u307f\u306e\u30a8\u
30ec\u30e1\u30f3\u30c8\u540d\u3067\u3082\u30c0\u30e1\u3067\u3057\u3087>\u65e5\u6
72c\u8a9e\u306e\u307f\u306e\u30a8\u30ec\u30e1\u30f3\u30c8\u540d\u3082\u30c0\u30e
1\u3067\u3057\u3087</\u65e5\u672c\u8a9e\u306e\u307f\u306e\u30a8\u30ec\u30e1\u30f
3\u30c8\u540d\u3067\u3082\u30c0\u30e1\u3067\u3057\u3087>\n";
static final String rootend = "</doc>\n";
static final String fname = "testdata.xml";
static final String fenc = "UTF-8";
public static void main(String[] args) {
StringBuffer buf = new StringBuffer();
buf.append(xmldecl);
buf.append(rootbgn);
for (int i=0; i<263; i++) {
buf.append(elem1);
}
buf.append(elem2);
buf.append(rootend);
String testdata = buf.toString();
try {
FileOutputStream fos = new FileOutputStream(fname);
fos.write(testdata.getBytes(fenc));
fos.close();
} catch (Exception e) {
e.printStackTrace();
}
}
}
---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-j-dev-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-j-dev-help@xml.apache.org