You are viewing a plain text version of this content. The canonical link for it is here.
Posted to j-users@xerces.apache.org by "Kahovec, Jakub" <j....@imperial.ac.uk> on 2005/03/01 00:21:03 UTC

RE: utf-8 characters problem

It produdes Xerces 2.6.2 (LSParser, LSSerializer and XMLSerializer). 
I've been using xerces parser and serializer in my java authoring 
tool to load and save documents. I've found out the problem with 
encoding when I loaded and displayed the xml document (with char. ref. form chars)
in the jeditpanel component. Instead of &#x002b; and &#x2062; I saw '+' and 'square-liked
character. I tried to serialized xml document to console as well as to file, load document via
InputStream or Reader input with LSInput but I never got results where would be chars sequence 
in origin form. 
Only when I explicitly set encoding in LSInput to (ISO-8859-1)and loaded it via InputStream 
then the chars sequence &#x2062; kept in the same form but the sequence &#x002b; was changed to '+' character anyway.
Then I tried to debug structure of DOM document (in Eclipse 3.1) but saw the same results (+ char 
and square char, probably it's only problem of showing utf-8 chars in eclipse.)
So to be honest I don't know now, how to find out, where is the problem, whether is it
during parsing, serializing or displaying data. I'm not so experienced in encodings as well as in charsets but as far as I know java treat internaly with chars in UTF-16 charset, could be it the a part of the problem ? I don't really know.

Thanks for any ideas.

Jakub


-----Original Message-----
From: Bob Foster [mailto:bob@objfac.com]
Sent: Mon 2/28/2005 10:36 PM
To: xerces-j-user@xml.apache.org
Subject: Re: utf-8 characters problem
 
Exactly what Xerces or standard API is producing this result? Are you 
sure you're not looking at the result in some editor (that is using the 
wrong code page to represent your characters)?

XML parsers deliver characters in Unicode. You are apparently trying to 
use the characters as though each character had eight bits.

Tell us a little more about what steps you took to see what you describe 
and maybe someone will be able to help.

Bob Foster

Jakub Kahovec wrote:
> Hi,
> when I parse the xml document (with xerces 2.6.2) which has in xml 
> declaration specified utf-8 encoding and which contains utf-8 characters 
> in character reference form &#xxxx;
> the parser replaces these characters  with ascii characters. For some 
> characters is ok but for instance InvisibleTimes change for some 
> incorrect strange character sentese.
> I'd like to know if is possible to prohibit changing characters from 
> char. ref. form ? Or does it exist some recommendation how to treat with 
> these characters.
> 
> Here is a piece of my 'problematic' xml document
> 
> <?xml version="1.0" encoding="UTF-8"?>
> <mathDoc>
> 
> <p>Factorise the following quadratic expression:
>        <math>
>          <mrow>
>            <msup>
>              <mrow>
>            <mi>x</mi>
>              </mrow>
>              <mrow>
>            <mn>2</mn>
>              </mrow>
>            </msup>
>            <mo>&#x002b;</mo> <!-- replaces with character + -->
>            <mi>p</mi>
>            <mo>&#x2062;</mo>   <!-- here is InvisibleTimes -->
>                    <mi>x</mi>
>            <mo>&#x002b;</mo>  <!-- replaces with character + -->
>            <mi>q</mi>
>          </mrow>
>        </math>
> 
> </mathDoc>
> 
> Thanks so much
> 
> Jakub



---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-j-user-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-j-user-help@xml.apache.org



Re: utf-8 characters problem

Posted by Bob Foster <bo...@objfac.com>.
If you read the file in UTF-8, parse it, serialize it without adding any 
whitespace and write the result back out in UTF-8, the only difference 
between the two documents (in your example) will be that character 
references are expanded.

The trouble arises when you don't specify the encoding on the way out. 
Then Java will use whatever is set as the platform encoding, e.g., win1250.

What normal text editors do with a UTF-8 file is really outside the 
scope here. You have to use a competent editor.

Bob Foster

Jakub Kahovec wrote:
> I've been experimenting a bit with serializing and parsing (java 1.4, 
> xerces 2.6.2, windows xp) and here are the results which I got
> This is a input xml file
> 
> <?xml version="1.0" encoding="UTF-8"?>
> <testEncoding>
> <czechCharsInUTF8>ěčřžýá</czechCharsInUTF8>
> <ecaron>&#283;</ecaron>
> <scaron>&#353;</scaron>
> <invisibleTimesHex>&#x2062;</invisibleTimesHex>
> <invisibleTimeDec>&#8290;</invisibleTimeDec>
> <visibleTimes>&#x002a;</visibleTimes>
> <plus>&#x002b;</plus>
> </testEncoding>
> 
> after parsing and serializing fromt/to file via byte stream i got this 
> output
> 
> <?xml version="1.0" encoding="UTF-8"?>
> <testEncoding>
> <czechCharsInUTF8>ěčřžýá</czechCharsInUTF8>
> <ecaron>Ä›</ecaron>
> <scaron>š</scaron>
> <invisibleTimesHex>⁢</invisibleTimesHex>
> <invisibleTimeDec>⁢</invisibleTimeDec>
> <visibleTimes>*</visibleTimes>
> <plus>+</plus>
> </testEncoding>
> 
> it seems to be pretty good, all characters are in UTF-8. Problem is with 
> the InvisibleTimes again. if one wants to edit it it's just impossible 
> because normal text editors show
> him sequence: ⁢ which nobody can understand it.
> 
> 
> after parsing and serializing fromt/to file via char stream i got this 
> output
> 
> <?xml version="1.0" encoding="UTF-16"?>
> <testEncoding>
> <czechCharsInUTF8>&#xc4;›ÄŤĹ™ĹľĂ˝Ăˇ</czechCharsInUTF8>
> <ecaron>ě</ecaron>
> <scaron>š</scaron>
> <invisibleTimesHex>?</invisibleTimesHex>
> <invisibleTimeDec>?</invisibleTimeDec>
> <visibleTimes>*</visibleTimes>
> <plus>+</plus>
> </testEncoding>
> 
> it' completely useless, some of chars are in win1250 (ecaron ad scaron) 
> charset, some of them are in utf-8 (part of tag <czechChardInUTF8> , 
> some of them are
> just question mark (invisibleTimes tags).
> 
> 
> These results make me a bit confused about which method should I use to 
> be able to get following result :
> 
> <?xml version="1.0" encoding="UTF-8"?>
> <testEncoding>
> <czechCharsInUTF8>ěčřžýá</czechCharsInUTF8>
> <ecaron>Ä›</ecaron>
> <scaron>š</scaron>
> <invisibleTimesHex>&#x2062;</invisibleTimesHex>
> <invisibleTimeDec>&#8290;</invisibleTimeDec>
> <visibleTimes>*</visibleTimes>
> <plus>+</plus>
> </testEncoding>
> 
> 
> 
> Bob Foster wrote:
> 
>> As others have suggested, the problem is in JEditPane. You need to 
>> tell it to use a font that can display all of your characters. 
>> Unfortunately, that's platform-specific and I'm not much of a 
>> JEditPane user (Eclipse/SWT for me), but somebody can probably help 
>> you if you say what platform you're running on.
>>
>> Bob Foster
>>
>> Kahovec, Jakub wrote:
>>
>>> It produdes Xerces 2.6.2 (LSParser, LSSerializer and XMLSerializer). 
>>> I've been using xerces parser and serializer in my java authoring 
>>> tool to load and save documents. I've found out the problem with 
>>> encoding when I loaded and displayed the xml document (with char. 
>>> ref. form chars)
>>> in the jeditpanel component. Instead of &#x002b; and &#x2062; I saw 
>>> '+' and 'square-liked
>>> character. I tried to serialized xml document to console as well as 
>>> to file, load document via
>>> InputStream or Reader input with LSInput but I never got results 
>>> where would be chars sequence in origin form. Only when I explicitly 
>>> set encoding in LSInput to (ISO-8859-1)and loaded it via InputStream 
>>> then the chars sequence &#x2062; kept in the same form but the 
>>> sequence &#x002b; was changed to '+' character anyway.
>>> Then I tried to debug structure of DOM document (in Eclipse 3.1) but 
>>> saw the same results (+ char and square char, probably it's only 
>>> problem of showing utf-8 chars in eclipse.)
>>> So to be honest I don't know now, how to find out, where is the 
>>> problem, whether is it
>>> during parsing, serializing or displaying data. I'm not so 
>>> experienced in encodings as well as in charsets but as far as I know 
>>> java treat internaly with chars in UTF-16 charset, could be it the a 
>>> part of the problem ? I don't really know.
>>>
>>> Thanks for any ideas.
>>>
>>> Jakub
>>>
>>>
>>> -----Original Message-----
>>> From: Bob Foster [mailto:bob@objfac.com]
>>> Sent: Mon 2/28/2005 10:36 PM
>>> To: xerces-j-user@xml.apache.org
>>> Subject: Re: utf-8 characters problem
>>>
>>> Exactly what Xerces or standard API is producing this result? Are you 
>>> sure you're not looking at the result in some editor (that is using 
>>> the wrong code page to represent your characters)?
>>>
>>> XML parsers deliver characters in Unicode. You are apparently trying 
>>> to use the characters as though each character had eight bits.
>>>
>>> Tell us a little more about what steps you took to see what you 
>>> describe and maybe someone will be able to help.
>>>
>>> Bob Foster
>>>
>>> Jakub Kahovec wrote:
>>>
>>>> Hi,
>>>> when I parse the xml document (with xerces 2.6.2) which has in xml 
>>>> declaration specified utf-8 encoding and which contains utf-8 
>>>> characters in character reference form &#xxxx;
>>>> the parser replaces these characters with ascii characters. For some 
>>>> characters is ok but for instance InvisibleTimes change for some 
>>>> incorrect strange character sentese.
>>>> I'd like to know if is possible to prohibit changing characters from 
>>>> char. ref. form ? Or does it exist some recommendation how to treat 
>>>> with these characters.
>>>>
>>>> Here is a piece of my 'problematic' xml document
>>>>
>>>> <?xml version="1.0" encoding="UTF-8"?>
>>>> <mathDoc>
>>>>
>>>> <p>Factorise the following quadratic expression:
>>>> <math>
>>>> <mrow>
>>>> <msup>
>>>> <mrow>
>>>> <mi>x</mi>
>>>> </mrow>
>>>> <mrow>
>>>> <mn>2</mn>
>>>> </mrow>
>>>> </msup>
>>>> <mo>&#x002b;</mo> <!-- replaces with character + -->
>>>> <mi>p</mi>
>>>> <mo>&#x2062;</mo> <!-- here is InvisibleTimes -->
>>>> <mi>x</mi>
>>>> <mo>&#x002b;</mo> <!-- replaces with character + -->
>>>> <mi>q</mi>
>>>> </mrow>
>>>> </math>
>>>>
>>>> </mathDoc>
>>>>
>>>> Thanks so much
>>>>
>>>> Jakub


---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-j-user-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-j-user-help@xml.apache.org


Re: utf-8 characters problem

Posted by Jakub Kahovec <j....@imperial.ac.uk>.
I've been experimenting a bit with serializing and parsing (java 1.4, 
xerces 2.6.2, windows xp) and here are the results which I got
This is a input xml file

<?xml version="1.0" encoding="UTF-8"?>
<testEncoding>
<czechCharsInUTF8>ěčřžýá</czechCharsInUTF8>
<ecaron>&#283;</ecaron>
<scaron>&#353;</scaron>
<invisibleTimesHex>&#x2062;</invisibleTimesHex>
<invisibleTimeDec>&#8290;</invisibleTimeDec>
<visibleTimes>&#x002a;</visibleTimes>
<plus>&#x002b;</plus>
</testEncoding>

after parsing and serializing fromt/to file via byte stream i got this 
output

<?xml version="1.0" encoding="UTF-8"?>
<testEncoding>
<czechCharsInUTF8>ěčřžýá</czechCharsInUTF8>
<ecaron>Ä›</ecaron>
<scaron>š</scaron>
<invisibleTimesHex>⁢</invisibleTimesHex>
<invisibleTimeDec>⁢</invisibleTimeDec>
<visibleTimes>*</visibleTimes>
<plus>+</plus>
</testEncoding>

it seems to be pretty good, all characters are in UTF-8. Problem is with 
the InvisibleTimes again. if one wants to edit it it's just impossible 
because normal text editors show
him sequence: ⁢ which nobody can understand it.


after parsing and serializing fromt/to file via char stream i got this 
output

<?xml version="1.0" encoding="UTF-16"?>
<testEncoding>
<czechCharsInUTF8>&#xc4;›ÄŤĹ™ĹľĂ˝Ăˇ</czechCharsInUTF8>
<ecaron>ě</ecaron>
<scaron>š</scaron>
<invisibleTimesHex>?</invisibleTimesHex>
<invisibleTimeDec>?</invisibleTimeDec>
<visibleTimes>*</visibleTimes>
<plus>+</plus>
</testEncoding>

it' completely useless, some of chars are in win1250 (ecaron ad scaron) 
charset, some of them are in utf-8 (part of tag <czechChardInUTF8> , 
some of them are
just question mark (invisibleTimes tags).


These results make me a bit confused about which method should I use to 
be able to get following result :

<?xml version="1.0" encoding="UTF-8"?>
<testEncoding>
<czechCharsInUTF8>ěčřžýá</czechCharsInUTF8>
<ecaron>Ä›</ecaron>
<scaron>š</scaron>
<invisibleTimesHex>&#x2062;</invisibleTimesHex>
<invisibleTimeDec>&#8290;</invisibleTimeDec>
<visibleTimes>*</visibleTimes>
<plus>+</plus>
</testEncoding>



Bob Foster wrote:

> As others have suggested, the problem is in JEditPane. You need to 
> tell it to use a font that can display all of your characters. 
> Unfortunately, that's platform-specific and I'm not much of a 
> JEditPane user (Eclipse/SWT for me), but somebody can probably help 
> you if you say what platform you're running on.
>
> Bob Foster
>
> Kahovec, Jakub wrote:
>
>> It produdes Xerces 2.6.2 (LSParser, LSSerializer and XMLSerializer). 
>> I've been using xerces parser and serializer in my java authoring 
>> tool to load and save documents. I've found out the problem with 
>> encoding when I loaded and displayed the xml document (with char. 
>> ref. form chars)
>> in the jeditpanel component. Instead of &#x002b; and &#x2062; I saw 
>> '+' and 'square-liked
>> character. I tried to serialized xml document to console as well as 
>> to file, load document via
>> InputStream or Reader input with LSInput but I never got results 
>> where would be chars sequence in origin form. Only when I explicitly 
>> set encoding in LSInput to (ISO-8859-1)and loaded it via InputStream 
>> then the chars sequence &#x2062; kept in the same form but the 
>> sequence &#x002b; was changed to '+' character anyway.
>> Then I tried to debug structure of DOM document (in Eclipse 3.1) but 
>> saw the same results (+ char and square char, probably it's only 
>> problem of showing utf-8 chars in eclipse.)
>> So to be honest I don't know now, how to find out, where is the 
>> problem, whether is it
>> during parsing, serializing or displaying data. I'm not so 
>> experienced in encodings as well as in charsets but as far as I know 
>> java treat internaly with chars in UTF-16 charset, could be it the a 
>> part of the problem ? I don't really know.
>>
>> Thanks for any ideas.
>>
>> Jakub
>>
>>
>> -----Original Message-----
>> From: Bob Foster [mailto:bob@objfac.com]
>> Sent: Mon 2/28/2005 10:36 PM
>> To: xerces-j-user@xml.apache.org
>> Subject: Re: utf-8 characters problem
>>
>> Exactly what Xerces or standard API is producing this result? Are you 
>> sure you're not looking at the result in some editor (that is using 
>> the wrong code page to represent your characters)?
>>
>> XML parsers deliver characters in Unicode. You are apparently trying 
>> to use the characters as though each character had eight bits.
>>
>> Tell us a little more about what steps you took to see what you 
>> describe and maybe someone will be able to help.
>>
>> Bob Foster
>>
>> Jakub Kahovec wrote:
>>
>>> Hi,
>>> when I parse the xml document (with xerces 2.6.2) which has in xml 
>>> declaration specified utf-8 encoding and which contains utf-8 
>>> characters in character reference form &#xxxx;
>>> the parser replaces these characters with ascii characters. For some 
>>> characters is ok but for instance InvisibleTimes change for some 
>>> incorrect strange character sentese.
>>> I'd like to know if is possible to prohibit changing characters from 
>>> char. ref. form ? Or does it exist some recommendation how to treat 
>>> with these characters.
>>>
>>> Here is a piece of my 'problematic' xml document
>>>
>>> <?xml version="1.0" encoding="UTF-8"?>
>>> <mathDoc>
>>>
>>> <p>Factorise the following quadratic expression:
>>> <math>
>>> <mrow>
>>> <msup>
>>> <mrow>
>>> <mi>x</mi>
>>> </mrow>
>>> <mrow>
>>> <mn>2</mn>
>>> </mrow>
>>> </msup>
>>> <mo>&#x002b;</mo> <!-- replaces with character + -->
>>> <mi>p</mi>
>>> <mo>&#x2062;</mo> <!-- here is InvisibleTimes -->
>>> <mi>x</mi>
>>> <mo>&#x002b;</mo> <!-- replaces with character + -->
>>> <mi>q</mi>
>>> </mrow>
>>> </math>
>>>
>>> </mathDoc>
>>>
>>> Thanks so much
>>>
>>> Jakub
>>
>>
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: xerces-j-user-unsubscribe@xml.apache.org
>> For additional commands, e-mail: xerces-j-user-help@xml.apache.org
>>
>>
>>
>>
>> ------------------------------------------------------------------------
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: xerces-j-user-unsubscribe@xml.apache.org
>> For additional commands, e-mail: xerces-j-user-help@xml.apache.org
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: xerces-j-user-unsubscribe@xml.apache.org
> For additional commands, e-mail: xerces-j-user-help@xml.apache.org
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-j-user-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-j-user-help@xml.apache.org


Re: utf-8 characters problem

Posted by Bob Foster <bo...@objfac.com>.
As others have suggested, the problem is in JEditPane. You need to tell 
it to use a font that can display all of your characters. Unfortunately, 
that's platform-specific and I'm not much of a JEditPane user 
(Eclipse/SWT for me), but somebody can probably help you if you say what 
platform you're running on.

Bob Foster

Kahovec, Jakub wrote:
> It produdes Xerces 2.6.2 (LSParser, LSSerializer and XMLSerializer). 
> I've been using xerces parser and serializer in my java authoring 
> tool to load and save documents. I've found out the problem with 
> encoding when I loaded and displayed the xml document (with char. ref. form chars)
> in the jeditpanel component. Instead of &#x002b; and &#x2062; I saw '+' and 'square-liked
> character. I tried to serialized xml document to console as well as to file, load document via
> InputStream or Reader input with LSInput but I never got results where would be chars sequence 
> in origin form. 
> Only when I explicitly set encoding in LSInput to (ISO-8859-1)and loaded it via InputStream 
> then the chars sequence &#x2062; kept in the same form but the sequence &#x002b; was changed to '+' character anyway.
> Then I tried to debug structure of DOM document (in Eclipse 3.1) but saw the same results (+ char 
> and square char, probably it's only problem of showing utf-8 chars in eclipse.)
> So to be honest I don't know now, how to find out, where is the problem, whether is it
> during parsing, serializing or displaying data. I'm not so experienced in encodings as well as in charsets but as far as I know java treat internaly with chars in UTF-16 charset, could be it the a part of the problem ? I don't really know.
> 
> Thanks for any ideas.
> 
> Jakub
> 
> 
> -----Original Message-----
> From: Bob Foster [mailto:bob@objfac.com]
> Sent: Mon 2/28/2005 10:36 PM
> To: xerces-j-user@xml.apache.org
> Subject: Re: utf-8 characters problem
>  
> Exactly what Xerces or standard API is producing this result? Are you 
> sure you're not looking at the result in some editor (that is using the 
> wrong code page to represent your characters)?
> 
> XML parsers deliver characters in Unicode. You are apparently trying to 
> use the characters as though each character had eight bits.
> 
> Tell us a little more about what steps you took to see what you describe 
> and maybe someone will be able to help.
> 
> Bob Foster
> 
> Jakub Kahovec wrote:
> 
>>Hi,
>>when I parse the xml document (with xerces 2.6.2) which has in xml 
>>declaration specified utf-8 encoding and which contains utf-8 characters 
>>in character reference form &#xxxx;
>>the parser replaces these characters  with ascii characters. For some 
>>characters is ok but for instance InvisibleTimes change for some 
>>incorrect strange character sentese.
>>I'd like to know if is possible to prohibit changing characters from 
>>char. ref. form ? Or does it exist some recommendation how to treat with 
>>these characters.
>>
>>Here is a piece of my 'problematic' xml document
>>
>><?xml version="1.0" encoding="UTF-8"?>
>><mathDoc>
>>
>><p>Factorise the following quadratic expression:
>>       <math>
>>         <mrow>
>>           <msup>
>>             <mrow>
>>           <mi>x</mi>
>>             </mrow>
>>             <mrow>
>>           <mn>2</mn>
>>             </mrow>
>>           </msup>
>>           <mo>&#x002b;</mo> <!-- replaces with character + -->
>>           <mi>p</mi>
>>           <mo>&#x2062;</mo>   <!-- here is InvisibleTimes -->
>>                   <mi>x</mi>
>>           <mo>&#x002b;</mo>  <!-- replaces with character + -->
>>           <mi>q</mi>
>>         </mrow>
>>       </math>
>>
>></mathDoc>
>>
>>Thanks so much
>>
>>Jakub
> 
> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: xerces-j-user-unsubscribe@xml.apache.org
> For additional commands, e-mail: xerces-j-user-help@xml.apache.org
> 
> 
> 
> 
> ------------------------------------------------------------------------
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: xerces-j-user-unsubscribe@xml.apache.org
> For additional commands, e-mail: xerces-j-user-help@xml.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-j-user-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-j-user-help@xml.apache.org