You are viewing a plain text version of this content. The canonical link for it is here.
Posted to j-dev@xerces.apache.org by "Jan Berkel (JIRA)" <xe...@xml.apache.org> on 2015/11/19 03:14:12 UTC

[jira] [Created] (XERCESJ-1668) Off-by-one bug w/ surrogates in UTF8Reader

Jan Berkel created XERCESJ-1668:
-----------------------------------

             Summary: Off-by-one bug w/ surrogates in UTF8Reader
                 Key: XERCESJ-1668
                 URL: https://issues.apache.org/jira/browse/XERCESJ-1668
             Project: Xerces2-J
          Issue Type: Bug
          Components: Other
            Reporter: Jan Berkel


There's a bug in the surrogate handling when the reader buffer is exhausted and only the high-part can be written. On the next run the low-part gets added but the buffer space calculation is off by one.

This gets triggered when parsing the current [enwiktionary dump file|http://dumps.wikimedia.org/enwiktionary/20151102/enwiktionary-20151102-pages-articles.xml.bz2].

{noformat}
org.xml.sax.SAXParseException; lineNumber: 99849520; columnNumber: 47; Invalid byte 2 of 4-byte UTF-8 sequence.
{noformat}

In the attached patch I added a testcase for this bug. Another related issue is that when the low-part is written as last part of the stream -1 is returned instead of 1.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: j-dev-unsubscribe@xerces.apache.org
For additional commands, e-mail: j-dev-help@xerces.apache.org