You are viewing a plain text version of this content. The canonical link for it is here.
Posted to j-dev@xerces.apache.org by "Timo Boehme (Jira)" <xe...@xml.apache.org> on 2021/01/07 18:51:00 UTC

[jira] [Commented] (XERCESJ-1668) Off-by-one bug w/ surrogates in UTF8Reader

    [ https://issues.apache.org/jira/browse/XERCESJ-1668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17260733#comment-17260733 ] 

Timo Boehme commented on XERCESJ-1668:
--------------------------------------

I have the same problem (org.xml.sax.SAXParseException; lineNumber: 6414317; columnNumber: 1136; Invalid byte 2 of 4-b yte UTF-8 sequence) with an update file from the Medline corpus of the NIH (see [https://pubmed.ncbi.nlm.nih.gov/help/#download-pubmed-data] , file [https://ftp.ncbi.nlm.nih.gov/pubmed/baseline/pubmed21n1052.xml.gz] ). The file passes all UTF-8 tests, contains 3 4-byte sequences starting with 'F4' byte with correct following 3 bytes.

This is a severe problem as totally valid and simple XML files are rejected by the parser.

> Off-by-one bug w/ surrogates in UTF8Reader
> ------------------------------------------
>
>                 Key: XERCESJ-1668
>                 URL: https://issues.apache.org/jira/browse/XERCESJ-1668
>             Project: Xerces2-J
>          Issue Type: Bug
>          Components: Other
>            Reporter: Jan Berkel
>            Priority: Major
>         Attachments: surrogate.patch
>
>
> There's a bug in the surrogate handling when the reader buffer is exhausted and only the high-part can be written. On the next run the low-part gets added but the buffer space calculation is off by one.
> This gets triggered when parsing the current [enwiktionary dump file|http://dumps.wikimedia.org/enwiktionary/20151102/enwiktionary-20151102-pages-articles.xml.bz2].
> {noformat}
> org.xml.sax.SAXParseException; lineNumber: 99849520; columnNumber: 47; Invalid byte 2 of 4-byte UTF-8 sequence.
> {noformat}
> In the attached patch I added a fix + testcase for this bug. Another related issue is that when the low-part is written as last part of the stream -1 is returned instead of 1.
> Is UTF8Reader still necessary? It might be safer to just use a plain InputStreamReader.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: j-dev-unsubscribe@xerces.apache.org
For additional commands, e-mail: j-dev-help@xerces.apache.org