You are viewing a plain text version of this content. The canonical link for it is here.
Posted to c-dev@xerces.apache.org by Geoff Coffey <gc...@wmotion.com> on 2001/10/03 04:22:33 UTC

Re: Windows 98 transcoder problem

My understanding now is that the standard specification for wcstombs does
not dictate that if a buffer size of zero is passed in, the size needed
should be returned. This seems only to be a (documented) side effect of the
implementation in VC++ and Borland. It also doesn't dictate that the
wchar_t* be transcoded to the local code page (?). In the CodeWarrior
implementation, as of version 7, wcstombs transcodes the wchar_t* to UTF8,
which complies with the standard in their opinion (since UTF8 is a
"multi-byte string").

So in my mind, the implementation in "Win32Transcoder.cpp" is a bug. It
should instead call WideCharToMultiByte directly, which has the behavior
desired across all compilers. Since this is a platform-specific source file,
I see no reason to use the standard c routine when the Win32 routine is by
default available.

Does this make sense? Is this the kind of thing that should be reported as a
bug? I've already fixed it on my end as it is a very simple change.

Thanks,

Geoff

On 9/28/01 3:33 AM, "Don Mastrovito" <dm...@marathontechnologies.com>
wrote:

> Goeff and Dean,
> 
> After looking at the C RTL sources for both Borland and MSVC, wcstombs()
> returns -1 on errors using either compiler.  The Borland documentation
> states "If an invalid multibyte character is encountered, wcstombs returns
> (size_t) -1. Otherwise, the function returns the number of bytes modified,
> not including the terminating code, if any."
> 
> Regarding the 2 or more byte issue:  Both implementations of wcstombs rely
> on a compile time quantity for the maximum number of bytes that a
> multi-byte character can contain.
> 
> Borland mbyte1.c:
> #define MB_MAX_CHARLEN  2           // current maximum MBCS character length
> 
> MSVC limits.h:
> #define MB_LEN_MAX    2             /* max. # bytes in multibyte char */
> 
> Additionally, both implementations utilize a Windows API to determine the
> correct string length.  It takes into account the current code page and how
> to deal with Unicode characters that don't directly translate into
> multi-byte.  Lookup "WideCharToMultiByte" in the Platform SDK documentation
> for all the details.  I don't know of a standard c library equivalent to
> WideCharToMultiByte.
> 
> HTH,
> 
> Don
> 
> At 01:36 AM 9/28/2001 -0700, you wrote:
>> On 9/28/01 12:50 AM, "Dean Roddey" <dr...@charmedquark.com> wrote:
>> 
>>> No, definitely not 2 bytes. UTF-8 can take up to 6 bytes to hold a single
>>> Unicode character, and others can take 3 or 4 and whatnot. You really need
>>> to know what the target is going to take. And you can't really afford to do
>>> a worst case. If they are about to transcode a large amount of text,
>>> allocating 6 bytes per source Unicode char would be really piggy. Those
>>> other platforms have to have a function to do this calculation, since its
>>> fundamental to doing transcoding.
>> 
>> Except that wcstombs would never transcode to UTF-8...if I understand it
>> correctly. It transcodes to whatever encoding makes sense in the current
>> locale, so the question is, can a "multi-byte" string ever require more than
>> 2 bytes per character? I know in my case it cannot because I'm always
>> dealing with iso_8859-1, which is always 1 byte per character. I took my
>> assumption above from this line in the wcstombs documentation at msdn:
>> 
>> "If there are two bytes in the multibyte output string for every wide
>> character in the input string, the result is guaranteed to fit."
>> 
>> http://msdn.microsoft.com/library/default.asp?url=/library/en-us/vccore98/HT
>> ML/_crt_wcstombs.asp
>> 
>> 
>> Which applies at least to the MSVC++ implementation. Metrowerk's
>> implementation is actually simple-minded (it copies the low order bytes of
>> each wchar_t into a new char array) so as I said, for my purposes, my
>> assumption should be fine...
>> 
>> Is there a way in the standard c library to determine the necessary length?
>> 
>> Thanks,
>> 
>> Geoff
>> 
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: xerces-c-dev-unsubscribe@xml.apache.org
>> For additional commands, e-mail: xerces-c-dev-help@xml.apache.org
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: xerces-c-dev-unsubscribe@xml.apache.org
> For additional commands, e-mail: xerces-c-dev-help@xml.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-c-dev-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-c-dev-help@xml.apache.org