You are viewing a plain text version of this content. The canonical link for it is here.
Posted to j-dev@xerces.apache.org by "Ian Upright (JIRA)" <xe...@xml.apache.org> on 2016/01/21 18:24:39 UTC

[jira] [Commented] (XERCESJ-1257) buffer overflow in UTF8Reader for characters out of BMP

    [ https://issues.apache.org/jira/browse/XERCESJ-1257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15110954#comment-15110954 ] 

Ian Upright commented on XERCESJ-1257:
--------------------------------------

For those using mwdumper to load wikipedia or other sources and encountering this issue, this change seemed to fix it.  (also serves as an example of how to workaround it)  However, it would be good to have the real issue addressed.  I would vote to modify Xerces to simply use the JVM to decode UTF-8 as Michael suggested.

        public void readDump() throws IOException {
                try {
                        SAXParserFactory factory = SAXParserFactory.newInstance();
                        SAXParser parser = factory.newSAXParser();
                        Reader reader = new InputStreamReader(input,"UTF-8");
                        InputSource is = new InputSource(reader);
                        is.setEncoding("UTF-8");
                        parser.parse(is, this);
                } catch (ParserConfigurationException e) {
                        throw (IOException)new IOException(e.getMessage()).initCause(e);
                } catch (SAXException e) {
                        throw (IOException)new IOException(e.getMessage()).initCause(e);
                }
                writer.close();
        }


> buffer overflow in UTF8Reader for characters out of BMP
> -------------------------------------------------------
>
>                 Key: XERCESJ-1257
>                 URL: https://issues.apache.org/jira/browse/XERCESJ-1257
>             Project: Xerces2-J
>          Issue Type: Bug
>          Components: JAXP (javax.xml.parsers)
>    Affects Versions: 2.9.0
>         Environment: Any
>            Reporter: Robert Stojnic
>            Assignee: Michael Glavassevich
>            Priority: Minor
>         Attachments: TestXerces.java, UTF8Reader.patch, XERCESJ-1257_tests.patch
>
>
> There is a ArrayOutOfBoundsException in org.apache.xerces.impl.io.UTF8Reader, in read(char[],int,int) for 4-byte utf-8 chars.
> Imagine a following scenario. read() has a buffer of size N, and it reads N-1 ascii chars, and stores it in the output buffer. Let the Nth char be the first byte of a 4 byte utf-8 char. The other 3 bytes are fetched by invoking read() on the input stream. From these a surrogate pair of java chars is made, however, method does not check if both chars can fit into the output buffer ... In most cases, they would fit into the ouput buffer (e.g. if there are some other multi-byte chars in the fetched text), so the bug is very rare, but it still happens.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: j-dev-unsubscribe@xerces.apache.org
For additional commands, e-mail: j-dev-help@xerces.apache.org