You are viewing a plain text version of this content. The canonical link for it is here.
Posted to j-users@xerces.apache.org by Tom Sugden <to...@epcc.ed.ac.uk> on 2003/10/01 16:57:57 UTC

Invalid XML characters?

Hello,

I was wondering whether anyone could clarify something for me. I've noticed
some behaviour with the Xerces SAX parser (version 2.4.0 according to jar
manifest file) that may constitute a bug. When attempting to parse some XML
character data that contains an unusual character (Unicode 0xC) wrapped in a
CDATA section, the parser throws an org.xml.sax.SAXParseException.

The XML specification seems to indicate that valid character data is any
Unicode character, excluding the surrogate blocks, FFFE, and FFFF. Since 0xC
is neither within the surrogate blocks nor equivalent to 0xFFFE or 0xFFFF, I
was surprised by this exception. I wrote a small test program to try parsing
a series of documents containing each possible unicode character within a
CDATA section, excluding the surrogate blocks and FFFE and FFFF. This seemed
to identify a further 151 characters that would cause either an
org.xml.sax.SAXParseException or a java.io.UTFDataFormatException to be
raised.

Is this the desired behaviour? And if so, can anyone recommend a technique
for transforming data retrieved from a relational database table (that may
contain these unusual characters) in such a way that it can safely be
encoded into an XML document without raising an exception?

Thanks,
Tom Sugden


---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-j-user-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-j-user-help@xml.apache.org


Re: Invalid XML characters?

Posted by Michael Glavassevich <mr...@apache.org>.
Hi Tom,

Only Unicode characters that match Char [1] are allowed to appear in XML
1.0 documents. If you want to include something that isn't text in your
document you need to encode it. One such encoding is Base 64.

[1] http://www.w3.org/TR/REC-xml#charsets

On Wed, 1 Oct 2003, Tom Sugden wrote:

> Hello,
>
> I was wondering whether anyone could clarify something for me. I've noticed
> some behaviour with the Xerces SAX parser (version 2.4.0 according to jar
> manifest file) that may constitute a bug. When attempting to parse some XML
> character data that contains an unusual character (Unicode 0xC) wrapped in a
> CDATA section, the parser throws an org.xml.sax.SAXParseException.
>
> The XML specification seems to indicate that valid character data is any
> Unicode character, excluding the surrogate blocks, FFFE, and FFFF. Since 0xC
> is neither within the surrogate blocks nor equivalent to 0xFFFE or 0xFFFF, I
> was surprised by this exception. I wrote a small test program to try parsing
> a series of documents containing each possible unicode character within a
> CDATA section, excluding the surrogate blocks and FFFE and FFFF. This seemed
> to identify a further 151 characters that would cause either an
> org.xml.sax.SAXParseException or a java.io.UTFDataFormatException to be
> raised.
>
> Is this the desired behaviour? And if so, can anyone recommend a technique
> for transforming data retrieved from a relational database table (that may
> contain these unusual characters) in such a way that it can safely be
> encoded into an XML document without raising an exception?
>
> Thanks,
> Tom Sugden
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: xerces-j-user-unsubscribe@xml.apache.org
> For additional commands, e-mail: xerces-j-user-help@xml.apache.org
>
>

-- 
--------------------
Michael Glavassevich
mrglavas@apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-j-user-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-j-user-help@xml.apache.org