You are viewing a plain text version of this content. The canonical link for it is here.
Posted to c-dev@xerces.apache.org by "Dominik Stadler (JIRA)" <xe...@xml.apache.org> on 2004/12/03 18:15:20 UTC

[jira] Created: (XERCESC-1305) Problem with XMLString::transcode() on Solaris

Problem with XMLString::transcode() on Solaris
----------------------------------------------

         Key: XERCESC-1305
         URL: http://nagoya.apache.org/jira/browse/XERCESC-1305
     Project: Xerces-C++
        Type: Bug
  Components: Utilities  
    Versions: 2.4.0, 2.6.0    
 Environment: Solaris 8, Forte 8 Solaris C++ Compiler
    Reporter: Dominik Stadler
 Attachments: XercesTestcase.h

We have a problem on Sun Solaris where it seems that XMLString::transcode() does not correctly convert characters from the ISO-8859-1 character-set to the Unicode/XMLCh-representation. 

We have ISO-8859-1 set as local codepage through setting the environment variable LC_ALL. 

When we call XMLString::transcode() for characters above hex-code 127, we get invalid unicode characters back. 

The same application works fine on Linux.

This is a small testcase that shows the problem:

The output on Solaris is:
------------------- start of Solaris output -------------------------
Converted the character, result:
00 23  00 54  00 45  00 53  00 54  00 23
------------------- end of Solaris output -------------------------

This is wrong, as the unicode representation of the pound-sign(£) is 0x00A3, not 0x0023!

On Linux the output is correct:
------------------- start of Linux output -------------------------
Converted the character, result:
00 A3  00 54  00 45  00 53  00 54  00 A3
------------------- end of Linux output -------------------------

I will attach a testcase that shows the problem.


-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://nagoya.apache.org/jira/secure/Administrators.jspa
-
If you want more information on JIRA, or have a bug to report see:
   http://www.atlassian.com/software/jira


---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-c-dev-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-c-dev-help@xml.apache.org


[jira] Commented: (XERCESC-1305) Problem with XMLString::transcode() on Solaris

Posted by "Dominik Stadler (JIRA)" <xe...@xml.apache.org>.
     [ http://nagoya.apache.org/jira/browse/XERCESC-1305?page=comments#action_56255 ]
     
Dominik Stadler commented on XERCESC-1305:
------------------------------------------

If I run mbtowc() with the Pound-Sign, I get the following four hex-bytes as resulting whcar_t:

30 00 00 23

Xerces then just cuts of the first two bytes which results in the incorrect value "00 23" reported above.

> Problem with XMLString::transcode() on Solaris
> ----------------------------------------------
>
>          Key: XERCESC-1305
>          URL: http://nagoya.apache.org/jira/browse/XERCESC-1305
>      Project: Xerces-C++
>         Type: Bug
>   Components: Utilities
>     Versions: 2.4.0, 2.6.0
>  Environment: Solaris 8, Forte 8 Solaris C++ Compiler
>     Reporter: Dominik Stadler
>  Attachments: XercesTestcase.h
>
> We have a problem on Sun Solaris where it seems that XMLString::transcode() does not correctly convert characters from the ISO-8859-1 character-set to the Unicode/XMLCh-representation. 
> We have ISO-8859-1 set as local codepage through setting the environment variable LC_ALL. 
> When we call XMLString::transcode() for characters above hex-code 127, we get invalid unicode characters back. 
> The same application works fine on Linux.
> This is a small testcase that shows the problem:
> The output on Solaris is:
> ------------------- start of Solaris output -------------------------
> Converted the character, result:
> 00 23  00 54  00 45  00 53  00 54  00 23
> ------------------- end of Solaris output -------------------------
> This is wrong, as the unicode representation of the pound-sign(£) is 0x00A3, not 0x0023!
> On Linux the output is correct:
> ------------------- start of Linux output -------------------------
> Converted the character, result:
> 00 A3  00 54  00 45  00 53  00 54  00 A3
> ------------------- end of Linux output -------------------------
> I will attach a testcase that shows the problem.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://nagoya.apache.org/jira/secure/Administrators.jspa
-
If you want more information on JIRA, or have a bug to report see:
   http://www.atlassian.com/software/jira


---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-c-dev-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-c-dev-help@xml.apache.org


[jira] Commented: (XERCESC-1305) Problem with XMLString::transcode() on Solaris

Posted by "Dominik Stadler (JIRA)" <xe...@xml.apache.org>.
     [ http://nagoya.apache.org/jira/browse/XERCESC-1305?page=comments#action_56331 ]
     
Dominik Stadler commented on XERCESC-1305:
------------------------------------------

I inquired at Sun about this issue and this is the response:

------------------------------------------------------
Solaris uses Unicode (UTF-32) at wchar_t only if the current locale is Unicode/UTF-8 locale. In any other locale, the wchar_t is not in UTF-32. This is due to the fact that wchar_t is an opaque data type in POSIX and we have been supporting wchar_t long before Unicode in our systems.

MSFT Windows declared that their wchar_t is Unicode when they created Windows NT as long as you define the _UNICODE macro in your VB/VC++ programs and that makes people think all wchar_t, regardless of platforms use Unicode but that's not really true.

To have UTF-32, please use iconv(3C) code convresions between the current locale's codeset (i.e., nl_laninfo(CODESET)) to UTF-32 or UTF-32BE/UTF-32LE.

By the way, we guarantee that the wchar_t is in UTF-32 if the current locale is a Unicode/UTF-8 locale.
------------------------------------------------------

So this is definitely incorrect in the current Xerces without ICU.

> Problem with XMLString::transcode() on Solaris
> ----------------------------------------------
>
>          Key: XERCESC-1305
>          URL: http://nagoya.apache.org/jira/browse/XERCESC-1305
>      Project: Xerces-C++
>         Type: Bug
>   Components: Utilities
>     Versions: 2.4.0, 2.6.0
>  Environment: Solaris 8, Forte 8 Solaris C++ Compiler
>     Reporter: Dominik Stadler
>  Attachments: XercesTestcase.h
>
> We have a problem on Sun Solaris where it seems that XMLString::transcode() does not correctly convert characters from the ISO-8859-1 character-set to the Unicode/XMLCh-representation. 
> We have ISO-8859-1 set as local codepage through setting the environment variable LC_ALL. 
> When we call XMLString::transcode() for characters above hex-code 127, we get invalid unicode characters back. 
> The same application works fine on Linux.
> This is a small testcase that shows the problem:
> The output on Solaris is:
> ------------------- start of Solaris output -------------------------
> Converted the character, result:
> 00 23  00 54  00 45  00 53  00 54  00 23
> ------------------- end of Solaris output -------------------------
> This is wrong, as the unicode representation of the pound-sign(£) is 0x00A3, not 0x0023!
> On Linux the output is correct:
> ------------------- start of Linux output -------------------------
> Converted the character, result:
> 00 A3  00 54  00 45  00 53  00 54  00 A3
> ------------------- end of Linux output -------------------------
> I will attach a testcase that shows the problem.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://nagoya.apache.org/jira/secure/Administrators.jspa
-
If you want more information on JIRA, or have a bug to report see:
   http://www.atlassian.com/software/jira


---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-c-dev-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-c-dev-help@xml.apache.org


[jira] Commented: (XERCESC-1305) Problem with XMLString::transcode() on Solaris

Posted by "Alberto Massari (JIRA)" <xe...@xml.apache.org>.
     [ http://issues.apache.org/jira/browse/XERCESC-1305?page=comments#action_59198 ]
     
Alberto Massari commented on XERCESC-1305:
------------------------------------------

For the record, I tried reproducing the bug using Solaris 10 (x86) and Sun Studio 10, but the testcase reports the correct result.
Someone with a SPARC should try reproducing it...

Alberto

> Problem with XMLString::transcode() on Solaris
> ----------------------------------------------
>
>          Key: XERCESC-1305
>          URL: http://issues.apache.org/jira/browse/XERCESC-1305
>      Project: Xerces-C++
>         Type: Bug
>   Components: Utilities
>     Versions: 2.4.0, 2.6.0
>  Environment: Solaris 8, Forte 8 Solaris C++ Compiler
>     Reporter: Dominik Stadler
>  Attachments: XercesTestcase.h
>
> We have a problem on Sun Solaris where it seems that XMLString::transcode() does not correctly convert characters from the ISO-8859-1 character-set to the Unicode/XMLCh-representation. 
> We have ISO-8859-1 set as local codepage through setting the environment variable LC_ALL. 
> When we call XMLString::transcode() for characters above hex-code 127, we get invalid unicode characters back. 
> The same application works fine on Linux.
> This is a small testcase that shows the problem:
> The output on Solaris is:
> ------------------- start of Solaris output -------------------------
> Converted the character, result:
> 00 23  00 54  00 45  00 53  00 54  00 23
> ------------------- end of Solaris output -------------------------
> This is wrong, as the unicode representation of the pound-sign(£) is 0x00A3, not 0x0023!
> On Linux the output is correct:
> ------------------- start of Linux output -------------------------
> Converted the character, result:
> 00 A3  00 54  00 45  00 53  00 54  00 A3
> ------------------- end of Linux output -------------------------
> I will attach a testcase that shows the problem.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
If you want more information on JIRA, or have a bug to report see:
   http://www.atlassian.com/software/jira


---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-c-dev-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-c-dev-help@xml.apache.org


[jira] Commented: (XERCESC-1305) Problem with XMLString::transcode() on Solaris

Posted by "Dominik Stadler (JIRA)" <xe...@xml.apache.org>.
     [ http://nagoya.apache.org/jira/browse/XERCESC-1305?page=comments#action_56253 ]
     
Dominik Stadler commented on XERCESC-1305:
------------------------------------------

I found some discussion around this issue at 

http://groups.yahoo.com/group/i18n-prog/message/1257

------- begin quote ------------
I am sorry this keeps coming as a surprise to people. This UNIX
behavior is well-documented and existed before Unicode was implemented
on UNIX (and Windows). Note that UCS-2 is insufficient for properly
covering Unicode characters.

The way wchar works is to take variable byte-length encodings and make
them a uniform 4 bytes/codepoint, so as to make certain types of
processing more straightforward. So, for example, if you're working in
EUC-JP, you don't have to worry whether you're taking 1 or 2 or 3
bytes/codepoint, you can be assured that you take 4 bytes and you get
one codepoint.
------- begin quote ------------

So the problem is that Xerces on all Platforms that use the Iconv-Transcoder with the mbstowcs or mbtowc-methods assumes that wchar_t is UCS-2, but especially on Solaris this is not the case, wchar_t is something different. 

> Problem with XMLString::transcode() on Solaris
> ----------------------------------------------
>
>          Key: XERCESC-1305
>          URL: http://nagoya.apache.org/jira/browse/XERCESC-1305
>      Project: Xerces-C++
>         Type: Bug
>   Components: Utilities
>     Versions: 2.4.0, 2.6.0
>  Environment: Solaris 8, Forte 8 Solaris C++ Compiler
>     Reporter: Dominik Stadler
>  Attachments: XercesTestcase.h
>
> We have a problem on Sun Solaris where it seems that XMLString::transcode() does not correctly convert characters from the ISO-8859-1 character-set to the Unicode/XMLCh-representation. 
> We have ISO-8859-1 set as local codepage through setting the environment variable LC_ALL. 
> When we call XMLString::transcode() for characters above hex-code 127, we get invalid unicode characters back. 
> The same application works fine on Linux.
> This is a small testcase that shows the problem:
> The output on Solaris is:
> ------------------- start of Solaris output -------------------------
> Converted the character, result:
> 00 23  00 54  00 45  00 53  00 54  00 23
> ------------------- end of Solaris output -------------------------
> This is wrong, as the unicode representation of the pound-sign(£) is 0x00A3, not 0x0023!
> On Linux the output is correct:
> ------------------- start of Linux output -------------------------
> Converted the character, result:
> 00 A3  00 54  00 45  00 53  00 54  00 A3
> ------------------- end of Linux output -------------------------
> I will attach a testcase that shows the problem.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://nagoya.apache.org/jira/secure/Administrators.jspa
-
If you want more information on JIRA, or have a bug to report see:
   http://www.atlassian.com/software/jira


---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-c-dev-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-c-dev-help@xml.apache.org


[jira] Updated: (XERCESC-1305) Problem with XMLString::transcode() on Solaris

Posted by "Dominik Stadler (JIRA)" <xe...@xml.apache.org>.
     [ http://nagoya.apache.org/jira/browse/XERCESC-1305?page=history ]

Dominik Stadler updated XERCESC-1305:
-------------------------------------

    Attachment: XercesTestcase.h

Testcase that prints the correct text on Linux but not on Solaris

> Problem with XMLString::transcode() on Solaris
> ----------------------------------------------
>
>          Key: XERCESC-1305
>          URL: http://nagoya.apache.org/jira/browse/XERCESC-1305
>      Project: Xerces-C++
>         Type: Bug
>   Components: Utilities
>     Versions: 2.4.0, 2.6.0
>  Environment: Solaris 8, Forte 8 Solaris C++ Compiler
>     Reporter: Dominik Stadler
>  Attachments: XercesTestcase.h
>
> We have a problem on Sun Solaris where it seems that XMLString::transcode() does not correctly convert characters from the ISO-8859-1 character-set to the Unicode/XMLCh-representation. 
> We have ISO-8859-1 set as local codepage through setting the environment variable LC_ALL. 
> When we call XMLString::transcode() for characters above hex-code 127, we get invalid unicode characters back. 
> The same application works fine on Linux.
> This is a small testcase that shows the problem:
> The output on Solaris is:
> ------------------- start of Solaris output -------------------------
> Converted the character, result:
> 00 23  00 54  00 45  00 53  00 54  00 23
> ------------------- end of Solaris output -------------------------
> This is wrong, as the unicode representation of the pound-sign(£) is 0x00A3, not 0x0023!
> On Linux the output is correct:
> ------------------- start of Linux output -------------------------
> Converted the character, result:
> 00 A3  00 54  00 45  00 53  00 54  00 A3
> ------------------- end of Linux output -------------------------
> I will attach a testcase that shows the problem.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://nagoya.apache.org/jira/secure/Administrators.jspa
-
If you want more information on JIRA, or have a bug to report see:
   http://www.atlassian.com/software/jira


---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-c-dev-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-c-dev-help@xml.apache.org