You are viewing a plain text version of this content. The canonical link for it is here.
Posted to j-dev@xerces.apache.org by "Michael McCandless (Commented) (JIRA)" <xe...@xml.apache.org> on 2012/03/29 20:28:23 UTC

[jira] [Commented] (XERCESJ-1257) buffer overflow in UTF8Reader for characters out of BMP

    [ https://issues.apache.org/jira/browse/XERCESJ-1257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13241487#comment-13241487 ] 

Michael McCandless commented on XERCESJ-1257:
---------------------------------------------

Note that this (more recent) Wikipedia export also hits this bug: enwiki-20110115-pages-articles.xml.bz2

We are still struggling with this nasty Xerces UTF8 bug in Lucene, this time because we (Lucene committers) want/need to stop shipping the custom Xerces Java JAR (compiled with the patch on this issue) in Lucene, in our source releases.

At first, we explored ant automation, to pull the Xerces Java 2.9.1 source release, apply the patch here, and build the custom JAR... that seems to work but:

In LUCENE-3937 we found a new approach: we can instead work around this bug by using the JVM, not Xerces, to (correctly) decode UTF8, by passing a Reader instead of an InputStream to Xerces (I now see that this was already suggested by Michael as a workaround: doh!).

Then we can use the stock (but buggy) Xerces releases... no patches / custom Xerces JARs needed in Lucene.

Still, it would be best if the Xerces committers could commit the current patch (if there are no problems with it) and finally resolve this longstanding issue.  Or maybe disable Xerces's custom UTF8 decoding (just use the JVM's)?
                
> buffer overflow in UTF8Reader for characters out of BMP
> -------------------------------------------------------
>
>                 Key: XERCESJ-1257
>                 URL: https://issues.apache.org/jira/browse/XERCESJ-1257
>             Project: Xerces2-J
>          Issue Type: Bug
>          Components: JAXP (javax.xml.parsers)
>    Affects Versions: 2.9.0
>         Environment: Any
>            Reporter: Robert Stojnic
>            Assignee: Michael Glavassevich
>            Priority: Minor
>         Attachments: TestXerces.java, UTF8Reader.patch
>
>
> There is a ArrayOutOfBoundsException in org.apache.xerces.impl.io.UTF8Reader, in read(char[],int,int) for 4-byte utf-8 chars.
> Imagine a following scenario. read() has a buffer of size N, and it reads N-1 ascii chars, and stores it in the output buffer. Let the Nth char be the first byte of a 4 byte utf-8 char. The other 3 bytes are fetched by invoking read() on the input stream. From these a surrogate pair of java chars is made, however, method does not check if both chars can fit into the output buffer ... In most cases, they would fit into the ouput buffer (e.g. if there are some other multi-byte chars in the fetched text), so the bug is very rare, but it still happens.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: j-dev-unsubscribe@xerces.apache.org
For additional commands, e-mail: j-dev-help@xerces.apache.org