You are viewing a plain text version of this content. The canonical link for it is here.
Posted to j-users@xerces.apache.org by Jakub Kahovec <j....@imperial.ac.uk> on 2005/02/28 21:08:55 UTC
utf-8 characters problem
Hi,
when I parse the xml document (with xerces 2.6.2) which has in xml
declaration specified utf-8 encoding and which contains utf-8 characters
in character reference form &#xxxx;
the parser replaces these characters with ascii characters. For some
characters is ok but for instance InvisibleTimes change for some
incorrect strange character sentese.
I'd like to know if is possible to prohibit changing characters from
char. ref. form ? Or does it exist some recommendation how to treat with
these characters.
Here is a piece of my 'problematic' xml document
<?xml version="1.0" encoding="UTF-8"?>
<mathDoc>
<p>Factorise the following quadratic expression:
<math>
<mrow>
<msup>
<mrow>
<mi>x</mi>
</mrow>
<mrow>
<mn>2</mn>
</mrow>
</msup>
<mo>+</mo> <!-- replaces with character + -->
<mi>p</mi>
<mo>⁢</mo> <!-- here is InvisibleTimes -->
<mi>x</mi>
<mo>+</mo> <!-- replaces with character + -->
<mi>q</mi>
</mrow>
</math>
</mathDoc>
Thanks so much
Jakub
---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-j-user-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-j-user-help@xml.apache.org
Re: utf-8 characters problem
Posted by Bob Foster <bo...@objfac.com>.
Exactly what Xerces or standard API is producing this result? Are you
sure you're not looking at the result in some editor (that is using the
wrong code page to represent your characters)?
XML parsers deliver characters in Unicode. You are apparently trying to
use the characters as though each character had eight bits.
Tell us a little more about what steps you took to see what you describe
and maybe someone will be able to help.
Bob Foster
Jakub Kahovec wrote:
> Hi,
> when I parse the xml document (with xerces 2.6.2) which has in xml
> declaration specified utf-8 encoding and which contains utf-8 characters
> in character reference form &#xxxx;
> the parser replaces these characters with ascii characters. For some
> characters is ok but for instance InvisibleTimes change for some
> incorrect strange character sentese.
> I'd like to know if is possible to prohibit changing characters from
> char. ref. form ? Or does it exist some recommendation how to treat with
> these characters.
>
> Here is a piece of my 'problematic' xml document
>
> <?xml version="1.0" encoding="UTF-8"?>
> <mathDoc>
>
> <p>Factorise the following quadratic expression:
> <math>
> <mrow>
> <msup>
> <mrow>
> <mi>x</mi>
> </mrow>
> <mrow>
> <mn>2</mn>
> </mrow>
> </msup>
> <mo>+</mo> <!-- replaces with character + -->
> <mi>p</mi>
> <mo>⁢</mo> <!-- here is InvisibleTimes -->
> <mi>x</mi>
> <mo>+</mo> <!-- replaces with character + -->
> <mi>q</mi>
> </mrow>
> </math>
>
> </mathDoc>
>
> Thanks so much
>
> Jakub
---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-j-user-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-j-user-help@xml.apache.org
RE: utf-8 characters problem
Posted by "F. Andy Seidl" <fa...@myst-technology.com>.
Jakub,
When a character is expressed as a numeric entity, the parser is not allowed
to change the numeric value of the character. So, when using numeric
entities, it is important to use the Unicode character values. Since ASCII
values are also Unicode, it is always safe to do something like  . But
for non-ascii characters, you need to be more careful. Some, like the
circled-R (R) registered symbol, is hex A9 in both the windows character set
*and* in Unicode. So, this © often works *by accident* in XML
documents where as the trademark TM character (157, I think) is not the same
in Windows and Unicode and is often found to be the source of problems in
XML documents originating on Windows.
The best thing is to avoid using numeric character entities and just encode
the character as a UTF-8 byte sequence (or the appropriate character
sequence for the charset in effect). That way, XML parsers and serializers
are free to translate the character as appropriate for the charset in
effect.
-- fas
F. Andy Seidl, Co-founder
MyST Technology Partners
http://myst-technology.com | http://blogsite.com
-----Original Message-----
From: Jakub Kahovec [mailto:j.kahovec@imperial.ac.uk]
Sent: Monday, February 28, 2005 3:09 PM
To: xerces-j-user@xml.apache.org
Subject: utf-8 characters problem
Hi,
when I parse the xml document (with xerces 2.6.2) which has in xml
declaration specified utf-8 encoding and which contains utf-8 characters
in character reference form &#xxxx;
the parser replaces these characters with ascii characters. For some
characters is ok but for instance InvisibleTimes change for some
incorrect strange character sentese.
I'd like to know if is possible to prohibit changing characters from
char. ref. form ? Or does it exist some recommendation how to treat with
these characters.
Here is a piece of my 'problematic' xml document
<?xml version="1.0" encoding="UTF-8"?>
<mathDoc>
<p>Factorise the following quadratic expression:
<math>
<mrow>
<msup>
<mrow>
<mi>x</mi>
</mrow>
<mrow>
<mn>2</mn>
</mrow>
</msup>
<mo>+</mo> <!-- replaces with character + -->
<mi>p</mi>
<mo>⁢</mo> <!-- here is InvisibleTimes -->
<mi>x</mi>
<mo>+</mo> <!-- replaces with character + -->
<mi>q</mi>
</mrow>
</math>
</mathDoc>
Thanks so much
Jakub
---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-j-user-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-j-user-help@xml.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-j-user-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-j-user-help@xml.apache.org