You are viewing a plain text version of this content. The canonical link for it is here.
Posted to j-users@xerces.apache.org by Scott Eade <se...@backstagetech.com.au> on 2003/02/21 08:42:33 UTC

charset problem - UTF-8

I have had a brief scan of the mail archive and not come across anything
like this, but that said, I am not sure of exactly where this problem bight
be coming from.

Here is what I have:
1. Some data in a MySQL database that contains "right single quotation
marks" (UTF Hex 2019) - thanks to the content being pasted in from MS Word.
2. The data is included in a CDATA section in a jdom-b8 tree.
3. A jdom XMLOutputter created with the encoding set to UTF-8
    XMLOutputter outputter = new XMLOutputter("  ", true, "UTF-8");
4. A HttpServletResponse with ContentType set to "text/xml; charset=UTF-8".
    HttpServletResponse response = whatever...;
    response.setContentType("text/xml; charset=UTF-8");
5. The Writer for the response is used to output the content
    outputter.output(doc, response.getWriter());
    response.flushBuffer();

Now the trouble is that the /u2019 characters do not seem to be written
correctly to the output stream (I am expecting to see "&#8217;" as a
replacement for these characters, but instead I am seeing the square block
placeholder - platform is win2k).

I am at a loss of what to try.  I have gone from jdom-b7 to jdom-b8 and from
xercesj-1.3.0 to xercesj-2.0.2 to xercesj-2.3.0 and the problem persists.

Interestingly some other characters are being correctly converted to their
character entity references, but then sometimes they are not in the same
document.

Any clues would be most welcome.  I'll probably try the jdom list as well.

Thanks in advance for any replies.

Cheers,

Scott
-- 
Scott Eade
Backstage Technologies Pty. Ltd.
http://www.backstagetech.com.au
.Mac Chat/AIM: seade at mac dot com


---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-j-user-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-j-user-help@xml.apache.org


Re: charset problem - UTF-8

Posted by "Jesus M. Salvo Jr." <jm...@ihug.com.au>.
Scott Eade wrote:

>Okay, I'll answer my own question:
>1. The character /u2019 will not be converted to a character reference when
>UTF-8 is used 
>
Correct

>(it will use two bytes and will not be displayed correctly in
>applications that do not correctly deal with UTF-8 - e.g. Windows notepad).
>
Notepad _can_ display Unicode characters from files that have been saved 
as UTF-8, as long as the font you use on Notepad can display that 
character. At work, we have lots of files that contain Chinese 
characters that are saved as UTF-8, and I use the SimSun or SimHei font 
to view those files, including XML files in UTF-8.

When you do a "Save As", you have the option to save a file as UTF-8 ( 
and UTF-16 I think ). Notepad also puts a BOM ( Byte Order Marking ) on 
front of the file. You can see this BOM through a hex editor.

>2. In the cases where character references are used an editing component is
>causing them to be encoded - the component is not being used in the places
>where the characters are not encoded.
>3. Windows file encodings are a PITA.
>
The default is  called windows-1252 in most cases at least ( Will be 
different of course for someone running Windows Thai ).
It's _not_ the same as iso-8859-1. You can think of windows-1252 as a 
superset of iso-8859-1.

http://czyborra.com/charsets/iso8859.html

On some websites, what were supposed to be "smart quote" characters 
appear as questions marks or as another funny character on your non-IE 
browser.
It turns out that the HTTP header for the webpage was advertised as 
iso-8859-1, but the file itself was encoded in windows-1252.

>4. I know more now than I did before.
>
>Sorry for the noise.
>
>Scott
>  
>



---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-j-user-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-j-user-help@xml.apache.org


Re: charset problem - UTF-8

Posted by Scott Eade <se...@backstagetech.com.au>.
Okay, I'll answer my own question:
1. The character /u2019 will not be converted to a character reference when
UTF-8 is used (it will use two bytes and will not be displayed correctly in
applications that do not correctly deal with UTF-8 - e.g. Windows notepad).
2. In the cases where character references are used an editing component is
causing them to be encoded - the component is not being used in the places
where the characters are not encoded.
3. Windows file encodings are a PITA.
4. I know more now than I did before.

Sorry for the noise.

Scott
-- 
Scott Eade
Backstage Technologies Pty. Ltd.
http://www.backstagetech.com.au
.Mac Chat/AIM: seade at mac dot com

On 21/02/2003 6:42 PM, "Scott Eade" <se...@backstagetech.com.au> wrote:

> I have had a brief scan of the mail archive and not come across anything
> like this, but that said, I am not sure of exactly where this problem bight
> be coming from.
> 
> Here is what I have:
> 1. Some data in a MySQL database that contains "right single quotation
> marks" (UTF Hex 2019) - thanks to the content being pasted in from MS Word.
> 2. The data is included in a CDATA section in a jdom-b8 tree.
> 3. A jdom XMLOutputter created with the encoding set to UTF-8
>   XMLOutputter outputter = new XMLOutputter("  ", true, "UTF-8");
> 4. A HttpServletResponse with ContentType set to "text/xml; charset=UTF-8".
>   HttpServletResponse response = whatever...;
>   response.setContentType("text/xml; charset=UTF-8");
> 5. The Writer for the response is used to output the content
>   outputter.output(doc, response.getWriter());
>   response.flushBuffer();
> 
> Now the trouble is that the /u2019 characters do not seem to be written
> correctly to the output stream (I am expecting to see "&#8217;" as a
> replacement for these characters, but instead I am seeing the square block
> placeholder - platform is win2k).
> 
> I am at a loss of what to try.  I have gone from jdom-b7 to jdom-b8 and from
> xercesj-1.3.0 to xercesj-2.0.2 to xercesj-2.3.0 and the problem persists.
> 
> Interestingly some other characters are being correctly converted to their
> character entity references, but then sometimes they are not in the same
> document.
> 
> Any clues would be most welcome.  I'll probably try the jdom list as well.
> 
> Thanks in advance for any replies.
> 
> Cheers,
> 
> Scott


---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-j-user-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-j-user-help@xml.apache.org