You are viewing a plain text version of this content. The canonical link for it is here.
Posted to c-dev@xerces.apache.org by "Dominik Stadler (JIRA)" <xe...@xml.apache.org> on 2004/12/03 18:15:20 UTC
[jira] Created: (XERCESC-1305) Problem with XMLString::transcode() on Solaris
Problem with XMLString::transcode() on Solaris
----------------------------------------------
Key: XERCESC-1305
URL: http://nagoya.apache.org/jira/browse/XERCESC-1305
Project: Xerces-C++
Type: Bug
Components: Utilities
Versions: 2.4.0, 2.6.0
Environment: Solaris 8, Forte 8 Solaris C++ Compiler
Reporter: Dominik Stadler
Attachments: XercesTestcase.h
We have a problem on Sun Solaris where it seems that XMLString::transcode() does not correctly convert characters from the ISO-8859-1 character-set to the Unicode/XMLCh-representation.
We have ISO-8859-1 set as local codepage through setting the environment variable LC_ALL.
When we call XMLString::transcode() for characters above hex-code 127, we get invalid unicode characters back.
The same application works fine on Linux.
This is a small testcase that shows the problem:
The output on Solaris is:
------------------- start of Solaris output -------------------------
Converted the character, result:
00 23 00 54 00 45 00 53 00 54 00 23
------------------- end of Solaris output -------------------------
This is wrong, as the unicode representation of the pound-sign(£) is 0x00A3, not 0x0023!
On Linux the output is correct:
------------------- start of Linux output -------------------------
Converted the character, result:
00 A3 00 54 00 45 00 53 00 54 00 A3
------------------- end of Linux output -------------------------
I will attach a testcase that shows the problem.
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://nagoya.apache.org/jira/secure/Administrators.jspa
-
If you want more information on JIRA, or have a bug to report see:
http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-c-dev-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-c-dev-help@xml.apache.org
[jira] Commented: (XERCESC-1305) Problem with XMLString::transcode() on Solaris
Posted by "Dominik Stadler (JIRA)" <xe...@xml.apache.org>.
[ http://nagoya.apache.org/jira/browse/XERCESC-1305?page=comments#action_56255 ]
Dominik Stadler commented on XERCESC-1305:
------------------------------------------
If I run mbtowc() with the Pound-Sign, I get the following four hex-bytes as resulting whcar_t:
30 00 00 23
Xerces then just cuts of the first two bytes which results in the incorrect value "00 23" reported above.
> Problem with XMLString::transcode() on Solaris
> ----------------------------------------------
>
> Key: XERCESC-1305
> URL: http://nagoya.apache.org/jira/browse/XERCESC-1305
> Project: Xerces-C++
> Type: Bug
> Components: Utilities
> Versions: 2.4.0, 2.6.0
> Environment: Solaris 8, Forte 8 Solaris C++ Compiler
> Reporter: Dominik Stadler
> Attachments: XercesTestcase.h
>
> We have a problem on Sun Solaris where it seems that XMLString::transcode() does not correctly convert characters from the ISO-8859-1 character-set to the Unicode/XMLCh-representation.
> We have ISO-8859-1 set as local codepage through setting the environment variable LC_ALL.
> When we call XMLString::transcode() for characters above hex-code 127, we get invalid unicode characters back.
> The same application works fine on Linux.
> This is a small testcase that shows the problem:
> The output on Solaris is:
> ------------------- start of Solaris output -------------------------
> Converted the character, result:
> 00 23 00 54 00 45 00 53 00 54 00 23
> ------------------- end of Solaris output -------------------------
> This is wrong, as the unicode representation of the pound-sign(£) is 0x00A3, not 0x0023!
> On Linux the output is correct:
> ------------------- start of Linux output -------------------------
> Converted the character, result:
> 00 A3 00 54 00 45 00 53 00 54 00 A3
> ------------------- end of Linux output -------------------------
> I will attach a testcase that shows the problem.
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://nagoya.apache.org/jira/secure/Administrators.jspa
-
If you want more information on JIRA, or have a bug to report see:
http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-c-dev-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-c-dev-help@xml.apache.org
[jira] Commented: (XERCESC-1305) Problem with XMLString::transcode() on Solaris
Posted by "Dominik Stadler (JIRA)" <xe...@xml.apache.org>.
[ http://nagoya.apache.org/jira/browse/XERCESC-1305?page=comments#action_56331 ]
Dominik Stadler commented on XERCESC-1305:
------------------------------------------
I inquired at Sun about this issue and this is the response:
------------------------------------------------------
Solaris uses Unicode (UTF-32) at wchar_t only if the current locale is Unicode/UTF-8 locale. In any other locale, the wchar_t is not in UTF-32. This is due to the fact that wchar_t is an opaque data type in POSIX and we have been supporting wchar_t long before Unicode in our systems.
MSFT Windows declared that their wchar_t is Unicode when they created Windows NT as long as you define the _UNICODE macro in your VB/VC++ programs and that makes people think all wchar_t, regardless of platforms use Unicode but that's not really true.
To have UTF-32, please use iconv(3C) code convresions between the current locale's codeset (i.e., nl_laninfo(CODESET)) to UTF-32 or UTF-32BE/UTF-32LE.
By the way, we guarantee that the wchar_t is in UTF-32 if the current locale is a Unicode/UTF-8 locale.
------------------------------------------------------
So this is definitely incorrect in the current Xerces without ICU.
> Problem with XMLString::transcode() on Solaris
> ----------------------------------------------
>
> Key: XERCESC-1305
> URL: http://nagoya.apache.org/jira/browse/XERCESC-1305
> Project: Xerces-C++
> Type: Bug
> Components: Utilities
> Versions: 2.4.0, 2.6.0
> Environment: Solaris 8, Forte 8 Solaris C++ Compiler
> Reporter: Dominik Stadler
> Attachments: XercesTestcase.h
>
> We have a problem on Sun Solaris where it seems that XMLString::transcode() does not correctly convert characters from the ISO-8859-1 character-set to the Unicode/XMLCh-representation.
> We have ISO-8859-1 set as local codepage through setting the environment variable LC_ALL.
> When we call XMLString::transcode() for characters above hex-code 127, we get invalid unicode characters back.
> The same application works fine on Linux.
> This is a small testcase that shows the problem:
> The output on Solaris is:
> ------------------- start of Solaris output -------------------------
> Converted the character, result:
> 00 23 00 54 00 45 00 53 00 54 00 23
> ------------------- end of Solaris output -------------------------
> This is wrong, as the unicode representation of the pound-sign(£) is 0x00A3, not 0x0023!
> On Linux the output is correct:
> ------------------- start of Linux output -------------------------
> Converted the character, result:
> 00 A3 00 54 00 45 00 53 00 54 00 A3
> ------------------- end of Linux output -------------------------
> I will attach a testcase that shows the problem.
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://nagoya.apache.org/jira/secure/Administrators.jspa
-
If you want more information on JIRA, or have a bug to report see:
http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-c-dev-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-c-dev-help@xml.apache.org
[jira] Commented: (XERCESC-1305) Problem with XMLString::transcode() on Solaris
Posted by "Alberto Massari (JIRA)" <xe...@xml.apache.org>.
[ http://issues.apache.org/jira/browse/XERCESC-1305?page=comments#action_59198 ]
Alberto Massari commented on XERCESC-1305:
------------------------------------------
For the record, I tried reproducing the bug using Solaris 10 (x86) and Sun Studio 10, but the testcase reports the correct result.
Someone with a SPARC should try reproducing it...
Alberto
> Problem with XMLString::transcode() on Solaris
> ----------------------------------------------
>
> Key: XERCESC-1305
> URL: http://issues.apache.org/jira/browse/XERCESC-1305
> Project: Xerces-C++
> Type: Bug
> Components: Utilities
> Versions: 2.4.0, 2.6.0
> Environment: Solaris 8, Forte 8 Solaris C++ Compiler
> Reporter: Dominik Stadler
> Attachments: XercesTestcase.h
>
> We have a problem on Sun Solaris where it seems that XMLString::transcode() does not correctly convert characters from the ISO-8859-1 character-set to the Unicode/XMLCh-representation.
> We have ISO-8859-1 set as local codepage through setting the environment variable LC_ALL.
> When we call XMLString::transcode() for characters above hex-code 127, we get invalid unicode characters back.
> The same application works fine on Linux.
> This is a small testcase that shows the problem:
> The output on Solaris is:
> ------------------- start of Solaris output -------------------------
> Converted the character, result:
> 00 23 00 54 00 45 00 53 00 54 00 23
> ------------------- end of Solaris output -------------------------
> This is wrong, as the unicode representation of the pound-sign(£) is 0x00A3, not 0x0023!
> On Linux the output is correct:
> ------------------- start of Linux output -------------------------
> Converted the character, result:
> 00 A3 00 54 00 45 00 53 00 54 00 A3
> ------------------- end of Linux output -------------------------
> I will attach a testcase that shows the problem.
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
If you want more information on JIRA, or have a bug to report see:
http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-c-dev-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-c-dev-help@xml.apache.org
[jira] Commented: (XERCESC-1305) Problem with XMLString::transcode() on Solaris
Posted by "Dominik Stadler (JIRA)" <xe...@xml.apache.org>.
[ http://nagoya.apache.org/jira/browse/XERCESC-1305?page=comments#action_56253 ]
Dominik Stadler commented on XERCESC-1305:
------------------------------------------
I found some discussion around this issue at
http://groups.yahoo.com/group/i18n-prog/message/1257
------- begin quote ------------
I am sorry this keeps coming as a surprise to people. This UNIX
behavior is well-documented and existed before Unicode was implemented
on UNIX (and Windows). Note that UCS-2 is insufficient for properly
covering Unicode characters.
The way wchar works is to take variable byte-length encodings and make
them a uniform 4 bytes/codepoint, so as to make certain types of
processing more straightforward. So, for example, if you're working in
EUC-JP, you don't have to worry whether you're taking 1 or 2 or 3
bytes/codepoint, you can be assured that you take 4 bytes and you get
one codepoint.
------- begin quote ------------
So the problem is that Xerces on all Platforms that use the Iconv-Transcoder with the mbstowcs or mbtowc-methods assumes that wchar_t is UCS-2, but especially on Solaris this is not the case, wchar_t is something different.
> Problem with XMLString::transcode() on Solaris
> ----------------------------------------------
>
> Key: XERCESC-1305
> URL: http://nagoya.apache.org/jira/browse/XERCESC-1305
> Project: Xerces-C++
> Type: Bug
> Components: Utilities
> Versions: 2.4.0, 2.6.0
> Environment: Solaris 8, Forte 8 Solaris C++ Compiler
> Reporter: Dominik Stadler
> Attachments: XercesTestcase.h
>
> We have a problem on Sun Solaris where it seems that XMLString::transcode() does not correctly convert characters from the ISO-8859-1 character-set to the Unicode/XMLCh-representation.
> We have ISO-8859-1 set as local codepage through setting the environment variable LC_ALL.
> When we call XMLString::transcode() for characters above hex-code 127, we get invalid unicode characters back.
> The same application works fine on Linux.
> This is a small testcase that shows the problem:
> The output on Solaris is:
> ------------------- start of Solaris output -------------------------
> Converted the character, result:
> 00 23 00 54 00 45 00 53 00 54 00 23
> ------------------- end of Solaris output -------------------------
> This is wrong, as the unicode representation of the pound-sign(£) is 0x00A3, not 0x0023!
> On Linux the output is correct:
> ------------------- start of Linux output -------------------------
> Converted the character, result:
> 00 A3 00 54 00 45 00 53 00 54 00 A3
> ------------------- end of Linux output -------------------------
> I will attach a testcase that shows the problem.
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://nagoya.apache.org/jira/secure/Administrators.jspa
-
If you want more information on JIRA, or have a bug to report see:
http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-c-dev-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-c-dev-help@xml.apache.org
[jira] Updated: (XERCESC-1305) Problem with XMLString::transcode() on Solaris
Posted by "Dominik Stadler (JIRA)" <xe...@xml.apache.org>.
[ http://nagoya.apache.org/jira/browse/XERCESC-1305?page=history ]
Dominik Stadler updated XERCESC-1305:
-------------------------------------
Attachment: XercesTestcase.h
Testcase that prints the correct text on Linux but not on Solaris
> Problem with XMLString::transcode() on Solaris
> ----------------------------------------------
>
> Key: XERCESC-1305
> URL: http://nagoya.apache.org/jira/browse/XERCESC-1305
> Project: Xerces-C++
> Type: Bug
> Components: Utilities
> Versions: 2.4.0, 2.6.0
> Environment: Solaris 8, Forte 8 Solaris C++ Compiler
> Reporter: Dominik Stadler
> Attachments: XercesTestcase.h
>
> We have a problem on Sun Solaris where it seems that XMLString::transcode() does not correctly convert characters from the ISO-8859-1 character-set to the Unicode/XMLCh-representation.
> We have ISO-8859-1 set as local codepage through setting the environment variable LC_ALL.
> When we call XMLString::transcode() for characters above hex-code 127, we get invalid unicode characters back.
> The same application works fine on Linux.
> This is a small testcase that shows the problem:
> The output on Solaris is:
> ------------------- start of Solaris output -------------------------
> Converted the character, result:
> 00 23 00 54 00 45 00 53 00 54 00 23
> ------------------- end of Solaris output -------------------------
> This is wrong, as the unicode representation of the pound-sign(£) is 0x00A3, not 0x0023!
> On Linux the output is correct:
> ------------------- start of Linux output -------------------------
> Converted the character, result:
> 00 A3 00 54 00 45 00 53 00 54 00 A3
> ------------------- end of Linux output -------------------------
> I will attach a testcase that shows the problem.
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://nagoya.apache.org/jira/secure/Administrators.jspa
-
If you want more information on JIRA, or have a bug to report see:
http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-c-dev-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-c-dev-help@xml.apache.org