You are viewing a plain text version of this content. The canonical link for it is here.
Posted to c-dev@xerces.apache.org by "Andreas Krantz (JIRA)" <xe...@xml.apache.org> on 2018/01/11 11:20:00 UTC

[jira] [Commented] (XERCESC-1854) Serialization does not detect invalid XML characters

    [ https://issues.apache.org/jira/browse/XERCESC-1854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16322064#comment-16322064 ] 

Andreas Krantz commented on XERCESC-1854:
-----------------------------------------

The method {{DOMLSSerializerImpl::ensureValidString}} is introduced as a fix in 3.2.0 but there is a wrong assumtion in it.
With the new implementation the area of Surrogate area xD800 - xDFFF is marked as invalid for XMLCh which ist utf16.
The problem is that the valid area of unicode area x10000-x10FFFF is encoded using those areas.

x10FFFF becomes xDBFF,xDFFF

The surrogates are handled correctly by the reader code but now it is no longer possible to save back the read DOM.
e.g. const std::u16string xmlString{ u"<?xml version=\"1.0\" encoding=\"UTF-16\" standalone=\"yes\" ?><root>\U0010FFFF</root>" };

This potentially breaks our format if changing to 3.2.0

I am not sure if it is possible to reopen this issue for an fix in 3.2.1???

A closer look to 
{{inline bool XMLChar1_0::isXMLChar(const XMLCh toCheck, const XMLCh toCheck2)}}

shows that it has two parameters to handle surrogates. But the ensureValidString must handle the leading surrogate and act using it.

> Serialization does not detect invalid XML characters
> ----------------------------------------------------
>
>                 Key: XERCESC-1854
>                 URL: https://issues.apache.org/jira/browse/XERCESC-1854
>             Project: Xerces-C++
>          Issue Type: Bug
>          Components: DOM
>    Affects Versions: 3.0.1
>            Reporter: Boris Kolpackov
>            Assignee: Alberto Massari
>             Fix For: 3.2.0
>
>         Attachments: test.cxx
>
>
> The attached test case serializes an invalid XML 1.0 document that contains a character with value 0x04. See http://www.w3.org/TR/REC-xml/#NT-Char for the list of valid characters in an XML 1.0 document.
> I've done some digging and it seems that XMLFormatter should check for this. In fast, there is already code for XML 1.1 that checks for these control characters since they need to be escaped in 1.1. It looks like we need to check for invalid characters when in the 1.0 mode. There is the XMLChar1_0::isXMLChar() function which can presumably be used.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: c-dev-unsubscribe@xerces.apache.org
For additional commands, e-mail: c-dev-help@xerces.apache.org