You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@commons.apache.org by "Benedikt Ritter (JIRA)" <ji...@apache.org> on 2014/01/21 17:23:23 UTC

[jira] [Commented] (LANG-955) StringEscapeUtils.escapeXml doesn't remove invalid characters

    [ https://issues.apache.org/jira/browse/LANG-955?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13877578#comment-13877578 ] 

Benedikt Ritter commented on LANG-955:
--------------------------------------

Patches welcome.

> StringEscapeUtils.escapeXml doesn't remove invalid characters
> -------------------------------------------------------------
>
>                 Key: LANG-955
>                 URL: https://issues.apache.org/jira/browse/LANG-955
>             Project: Commons Lang
>          Issue Type: Bug
>          Components: lang.*
>    Affects Versions: 3.1
>         Environment: Ubuntu 13.10
>            Reporter: Adam Hooper
>              Labels: xml
>             Fix For: Patch Needed
>
>
> escapeXml lets non-text characters pass through into XML files:
> {code}
> scala> org.apache.commons.lang3.StringEscapeUtils.escapeXml("\u0004").codePointAt(0)
> res4: Int = 4
> {code}
> I would expect the result to be an exception -- either from StringEscapeUtils (refusing to encode it) or, preferably, from String.codePointAt, complaining that the string is empty. \u0004 is not a valid character in XML 1.0, and there is no way to represent it in an XML document -- not even by escaping it.
> Wikipedia summarizes the characters that are not allowed in XML -- even after escaping: http://en.wikipedia.org/wiki/Valid_characters_in_XML. The reason for disallowing them: XML is a text interchange format, and control characters are not text.
> If StringEscapeUtils.escapeXml allows invalid XML characters through -- whether escaped or not -- it generates invalid XML. Valid XML parsers will refuse to read such files.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)