You are viewing a plain text version of this content. The canonical link for it is here.
Posted to j-dev@xerces.apache.org by Dmitry Mordovin <dm...@dwide.com> on 2015/03/27 07:52:11 UTC

UTF8Reader: Invalid byte sequence

Hi!

Try to parse html string with english, russian and vietnamese characters.

Sample:

Document doc = builder.parse(new 
StringBufferInputStream("<html><body>Eng Рус Việt Nam</body></html>"));

Java file stored as UTF-8
I even check string "Eng Рус Việt Nam" with online convert service - 
result: input string encoding same as output - utf8

Java Appliction Exception at parse proc:

com.sun.org.apache.xerces.internal.impl.io.MalformedByteSequenceException: 
Invalid byte 2 of 2-byte UTF-8 sequence.
         at 
com.sun.org.apache.xerces.internal.impl.io.UTF8Reader.invalidByte(UTF8Reader.java:691)
         at 
com.sun.org.apache.xerces.internal.impl.io.UTF8Reader.read(UTF8Reader.java:372)
         at 
com.sun.org.apache.xerces.internal.impl.XMLEntityScanner.load(XMLEntityScanner.java:1743)
         at 
com.sun.org.apache.xerces.internal.impl.XMLEntityScanner.skipChar(XMLEntityScanner.java:1413)
         at 
com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDriver.next(XMLDocumentFragmentScannerImpl.java:2823)
         at 
com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(XMLDocumentScannerImpl.java:606)
         at 
com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:510)
         at 
com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:848)
         at 
com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:777)
         at 
com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:141)
         at 
com.sun.org.apache.xerces.internal.parsers.DOMParser.parse(DOMParser.java:243)
         at 
com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderImpl.parse(DocumentBuilderImpl.java:348)
         at 
javax.xml.parsers.DocumentBuilder.parse(DocumentBuilder.java:121)
         at cc.jmitty.PowerWorker.doProceedPDFRequest(PowerWorker.java:268)
         at cc.jmitty.PowerWorker.doSendPDF(PowerWorker.java:187)
         at cc.jmitty.PowerWorker.run(PowerWorker.java:93)


Have you any idea how to check my string or another solution?

Dmitry