You are viewing a plain text version of this content. The canonical link for it is here.

Posted to j-users@xerces.apache.org by Jakub Kahovec <j....@imperial.ac.uk> on 2005/02/28 21:08:55 UTC

utf-8 characters problem

Hi,
when I parse the xml document (with xerces 2.6.2) which has in xml 
declaration specified utf-8 encoding and which contains utf-8 characters 
in character reference form &#xxxx;
the parser replaces these characters  with ascii characters. For some 
characters is ok but for instance InvisibleTimes change for some 
incorrect strange character sentese.
I'd like to know if is possible to prohibit changing characters from 
char. ref. form ? Or does it exist some recommendation how to treat with 
these characters.

Here is a piece of my 'problematic' xml document

<?xml version="1.0" encoding="UTF-8"?>
<mathDoc>

<p>Factorise the following quadratic expression:
        <math>
          <mrow>
            <msup>
              <mrow>
            <mi>x</mi>
              </mrow>
              <mrow>
            <mn>2</mn>
              </mrow>
            </msup>
            <mo>&#x002b;</mo> <!-- replaces with character + -->
            <mi>p</mi>
            <mo>&#x2062;</mo>   <!-- here is InvisibleTimes -->
                    <mi>x</mi>
            <mo>&#x002b;</mo>  <!-- replaces with character + -->
            <mi>q</mi>
          </mrow>
        </math>

</mathDoc>

Thanks so much

Jakub

---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-j-user-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-j-user-help@xml.apache.org

Re: utf-8 characters problem

Posted by Bob Foster <bo...@objfac.com>.

Exactly what Xerces or standard API is producing this result? Are you 
sure you're not looking at the result in some editor (that is using the 
wrong code page to represent your characters)?

XML parsers deliver characters in Unicode. You are apparently trying to 
use the characters as though each character had eight bits.

Tell us a little more about what steps you took to see what you describe 
and maybe someone will be able to help.

Bob Foster

Jakub Kahovec wrote:
> Hi,
> when I parse the xml document (with xerces 2.6.2) which has in xml 
> declaration specified utf-8 encoding and which contains utf-8 characters 
> in character reference form &#xxxx;
> the parser replaces these characters  with ascii characters. For some 
> characters is ok but for instance InvisibleTimes change for some 
> incorrect strange character sentese.
> I'd like to know if is possible to prohibit changing characters from 
> char. ref. form ? Or does it exist some recommendation how to treat with 
> these characters.
> 
> Here is a piece of my 'problematic' xml document
> 
> <?xml version="1.0" encoding="UTF-8"?>
> <mathDoc>
> 
> <p>Factorise the following quadratic expression:
>        <math>
>          <mrow>
>            <msup>
>              <mrow>
>            <mi>x</mi>
>              </mrow>
>              <mrow>
>            <mn>2</mn>
>              </mrow>
>            </msup>
>            <mo>&#x002b;</mo> <!-- replaces with character + -->
>            <mi>p</mi>
>            <mo>&#x2062;</mo>   <!-- here is InvisibleTimes -->
>                    <mi>x</mi>
>            <mo>&#x002b;</mo>  <!-- replaces with character + -->
>            <mi>q</mi>
>          </mrow>
>        </math>
> 
> </mathDoc>
> 
> Thanks so much
> 
> Jakub



---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-j-user-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-j-user-help@xml.apache.org

RE: utf-8 characters problem

Posted by "F. Andy Seidl" <fa...@myst-technology.com>.

Jakub,
When a character is expressed as a numeric entity, the parser is not allowed
to change the numeric value of the character.  So, when using numeric
entities, it is important to use the Unicode character values.  Since ASCII
values are also Unicode, it is always safe to do something like &#x20;.  But
for non-ascii characters, you need to be more careful.  Some, like the
circled-R (R) registered symbol, is hex A9 in both the windows character set
*and* in Unicode.  So, this &#xA9; often works *by accident* in XML
documents where as the trademark TM character (157, I think) is not the same
in Windows and Unicode and is often found to be the source of problems in
XML documents originating on Windows.
The best thing is to avoid using numeric character entities and just encode
the character as a UTF-8 byte sequence (or the appropriate character
sequence for the charset in effect).  That way, XML parsers and serializers
are free to translate the character as appropriate for the charset in
effect.
  -- fas
 F. Andy Seidl, Co-founder
MyST Technology Partners
http://myst-technology.com | http://blogsite.com
 
 

-----Original Message-----
From: Jakub Kahovec [mailto:j.kahovec@imperial.ac.uk] 
Sent: Monday, February 28, 2005 3:09 PM
To: xerces-j-user@xml.apache.org
Subject: utf-8 characters problem

Hi,
when I parse the xml document (with xerces 2.6.2) which has in xml 
declaration specified utf-8 encoding and which contains utf-8 characters 
in character reference form &#xxxx;
the parser replaces these characters  with ascii characters. For some 
characters is ok but for instance InvisibleTimes change for some 
incorrect strange character sentese.
I'd like to know if is possible to prohibit changing characters from 
char. ref. form ? Or does it exist some recommendation how to treat with 
these characters.

Here is a piece of my 'problematic' xml document

<?xml version="1.0" encoding="UTF-8"?>
<mathDoc>

<p>Factorise the following quadratic expression:
        <math>
          <mrow>
            <msup>
              <mrow>
            <mi>x</mi>
              </mrow>
              <mrow>
            <mn>2</mn>
              </mrow>
            </msup>
            <mo>&#x002b;</mo> <!-- replaces with character + -->
            <mi>p</mi>
            <mo>&#x2062;</mo>   <!-- here is InvisibleTimes -->
                    <mi>x</mi>
            <mo>&#x002b;</mo>  <!-- replaces with character + -->
            <mi>q</mi>
          </mrow>
        </math>

</mathDoc>

Thanks so much

Jakub

---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-j-user-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-j-user-help@xml.apache.org






---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-j-user-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-j-user-help@xml.apache.org