You are viewing a plain text version of this content. The canonical link for it is here.
Posted to c-dev@xerces.apache.org by Geoff Coffey <gc...@wmotion.com> on 2001/09/28 08:55:57 UTC

Windows 98 transcoder problem

Ok, thanks to those who addressed my previous questions. I have one more
question, this time on the windows side.

Our windows work was done on Windows 2000, and everything worked well. But
we received reports from testers that it was not working on Windows 98. In
digging through it, I see that the problem lies in
Win32LCPTranscoder::transcode(XMLCh *). This code calls wcstombs() once to
determine the length of the transcoded string, and again to perform the
transcoding. The first call, on our windows 98 environment, returns 0.
Consequently, an empty string is returned.

I can't fathom why this is failing. I looked at msdn and couldn't find any
valid explanation. The toTranscode XMLCh string is valid (I can view it in
the debugger as a "unicode string" and it looks correct, and again it works
well on win2k).

Has anyone run in to this before?

Thanks,

Geoff


---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-c-dev-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-c-dev-help@xml.apache.org


Re: Windows 98 transcoder problem

Posted by Geoff Coffey <gc...@wmotion.com>.
My understanding now is that the standard specification for wcstombs does
not dictate that if a buffer size of zero is passed in, the size needed
should be returned. This seems only to be a (documented) side effect of the
implementation in VC++ and Borland. It also doesn't dictate that the
wchar_t* be transcoded to the local code page (?). In the CodeWarrior
implementation, as of version 7, wcstombs transcodes the wchar_t* to UTF8,
which complies with the standard in their opinion (since UTF8 is a
"multi-byte string").

So in my mind, the implementation in "Win32Transcoder.cpp" is a bug. It
should instead call WideCharToMultiByte directly, which has the behavior
desired across all compilers. Since this is a platform-specific source file,
I see no reason to use the standard c routine when the Win32 routine is by
default available.

Does this make sense? Is this the kind of thing that should be reported as a
bug? I've already fixed it on my end as it is a very simple change.

Thanks,

Geoff

On 9/28/01 3:33 AM, "Don Mastrovito" <dm...@marathontechnologies.com>
wrote:

> Goeff and Dean,
> 
> After looking at the C RTL sources for both Borland and MSVC, wcstombs()
> returns -1 on errors using either compiler.  The Borland documentation
> states "If an invalid multibyte character is encountered, wcstombs returns
> (size_t) -1. Otherwise, the function returns the number of bytes modified,
> not including the terminating code, if any."
> 
> Regarding the 2 or more byte issue:  Both implementations of wcstombs rely
> on a compile time quantity for the maximum number of bytes that a
> multi-byte character can contain.
> 
> Borland mbyte1.c:
> #define MB_MAX_CHARLEN  2           // current maximum MBCS character length
> 
> MSVC limits.h:
> #define MB_LEN_MAX    2             /* max. # bytes in multibyte char */
> 
> Additionally, both implementations utilize a Windows API to determine the
> correct string length.  It takes into account the current code page and how
> to deal with Unicode characters that don't directly translate into
> multi-byte.  Lookup "WideCharToMultiByte" in the Platform SDK documentation
> for all the details.  I don't know of a standard c library equivalent to
> WideCharToMultiByte.
> 
> HTH,
> 
> Don
> 
> At 01:36 AM 9/28/2001 -0700, you wrote:
>> On 9/28/01 12:50 AM, "Dean Roddey" <dr...@charmedquark.com> wrote:
>> 
>>> No, definitely not 2 bytes. UTF-8 can take up to 6 bytes to hold a single
>>> Unicode character, and others can take 3 or 4 and whatnot. You really need
>>> to know what the target is going to take. And you can't really afford to do
>>> a worst case. If they are about to transcode a large amount of text,
>>> allocating 6 bytes per source Unicode char would be really piggy. Those
>>> other platforms have to have a function to do this calculation, since its
>>> fundamental to doing transcoding.
>> 
>> Except that wcstombs would never transcode to UTF-8...if I understand it
>> correctly. It transcodes to whatever encoding makes sense in the current
>> locale, so the question is, can a "multi-byte" string ever require more than
>> 2 bytes per character? I know in my case it cannot because I'm always
>> dealing with iso_8859-1, which is always 1 byte per character. I took my
>> assumption above from this line in the wcstombs documentation at msdn:
>> 
>> "If there are two bytes in the multibyte output string for every wide
>> character in the input string, the result is guaranteed to fit."
>> 
>> http://msdn.microsoft.com/library/default.asp?url=/library/en-us/vccore98/HT
>> ML/_crt_wcstombs.asp
>> 
>> 
>> Which applies at least to the MSVC++ implementation. Metrowerk's
>> implementation is actually simple-minded (it copies the low order bytes of
>> each wchar_t into a new char array) so as I said, for my purposes, my
>> assumption should be fine...
>> 
>> Is there a way in the standard c library to determine the necessary length?
>> 
>> Thanks,
>> 
>> Geoff
>> 
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: xerces-c-dev-unsubscribe@xml.apache.org
>> For additional commands, e-mail: xerces-c-dev-help@xml.apache.org
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: xerces-c-dev-unsubscribe@xml.apache.org
> For additional commands, e-mail: xerces-c-dev-help@xml.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-c-dev-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-c-dev-help@xml.apache.org


Re: Windows 98 transcoder problem

Posted by Don Mastrovito <dm...@marathontechnologies.com>.
Goeff and Dean,

After looking at the C RTL sources for both Borland and MSVC, wcstombs() 
returns -1 on errors using either compiler.  The Borland documentation 
states "If an invalid multibyte character is encountered, wcstombs returns 
(size_t) -1. Otherwise, the function returns the number of bytes modified, 
not including the terminating code, if any."

Regarding the 2 or more byte issue:  Both implementations of wcstombs rely 
on a compile time quantity for the maximum number of bytes that a 
multi-byte character can contain.

Borland mbyte1.c:
#define MB_MAX_CHARLEN  2           // current maximum MBCS character length

MSVC limits.h:
#define MB_LEN_MAX    2             /* max. # bytes in multibyte char */

Additionally, both implementations utilize a Windows API to determine the 
correct string length.  It takes into account the current code page and how 
to deal with Unicode characters that don't directly translate into 
multi-byte.  Lookup "WideCharToMultiByte" in the Platform SDK documentation 
for all the details.  I don't know of a standard c library equivalent to 
WideCharToMultiByte.

HTH,

Don

At 01:36 AM 9/28/2001 -0700, you wrote:
>On 9/28/01 12:50 AM, "Dean Roddey" <dr...@charmedquark.com> wrote:
>
> > No, definitely not 2 bytes. UTF-8 can take up to 6 bytes to hold a single
> > Unicode character, and others can take 3 or 4 and whatnot. You really need
> > to know what the target is going to take. And you can't really afford to do
> > a worst case. If they are about to transcode a large amount of text,
> > allocating 6 bytes per source Unicode char would be really piggy. Those
> > other platforms have to have a function to do this calculation, since its
> > fundamental to doing transcoding.
>
>Except that wcstombs would never transcode to UTF-8...if I understand it
>correctly. It transcodes to whatever encoding makes sense in the current
>locale, so the question is, can a "multi-byte" string ever require more than
>2 bytes per character? I know in my case it cannot because I'm always
>dealing with iso_8859-1, which is always 1 byte per character. I took my
>assumption above from this line in the wcstombs documentation at msdn:
>
>"If there are two bytes in the multibyte output string for every wide
>character in the input string, the result is guaranteed to fit."
>
>http://msdn.microsoft.com/library/default.asp?url=/library/en-us/vccore98/HT
>ML/_crt_wcstombs.asp
>
>
>Which applies at least to the MSVC++ implementation. Metrowerk's
>implementation is actually simple-minded (it copies the low order bytes of
>each wchar_t into a new char array) so as I said, for my purposes, my
>assumption should be fine...
>
>Is there a way in the standard c library to determine the necessary length?
>
>Thanks,
>
>Geoff
>
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: xerces-c-dev-unsubscribe@xml.apache.org
>For additional commands, e-mail: xerces-c-dev-help@xml.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-c-dev-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-c-dev-help@xml.apache.org


Re: Windows 98 transcoder problem

Posted by Geoff Coffey <gc...@wmotion.com>.
On 9/28/01 12:50 AM, "Dean Roddey" <dr...@charmedquark.com> wrote:

> No, definitely not 2 bytes. UTF-8 can take up to 6 bytes to hold a single
> Unicode character, and others can take 3 or 4 and whatnot. You really need
> to know what the target is going to take. And you can't really afford to do
> a worst case. If they are about to transcode a large amount of text,
> allocating 6 bytes per source Unicode char would be really piggy. Those
> other platforms have to have a function to do this calculation, since its
> fundamental to doing transcoding.

Except that wcstombs would never transcode to UTF-8...if I understand it
correctly. It transcodes to whatever encoding makes sense in the current
locale, so the question is, can a "multi-byte" string ever require more than
2 bytes per character? I know in my case it cannot because I'm always
dealing with iso_8859-1, which is always 1 byte per character. I took my
assumption above from this line in the wcstombs documentation at msdn:

"If there are two bytes in the multibyte output string for every wide
character in the input string, the result is guaranteed to fit."

http://msdn.microsoft.com/library/default.asp?url=/library/en-us/vccore98/HT
ML/_crt_wcstombs.asp


Which applies at least to the MSVC++ implementation. Metrowerk's
implementation is actually simple-minded (it copies the low order bytes of
each wchar_t into a new char array) so as I said, for my purposes, my
assumption should be fine...

Is there a way in the standard c library to determine the necessary length?

Thanks,

Geoff


---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-c-dev-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-c-dev-help@xml.apache.org


Re: Windows 98 transcoder problem

Posted by Dean Roddey <dr...@charmedquark.com>.
> At any rate, it should be safe to modify Win32LCPTranscoder::transcode()
to
> simply allocate 2 bytes per unicode character before transcoding,
shouldn't
> it?
>

No, definitely not 2 bytes. UTF-8 can take up to 6 bytes to hold a single
Unicode character, and others can take 3 or 4 and whatnot. You really need
to know what the target is going to take. And you can't really afford to do
a worst case. If they are about to transcode a large amount of text,
allocating 6 bytes per source Unicode char would be really piggy. Those
other platforms have to have a function to do this calculation, since its
fundamental to doing transcoding.

--------------------------
Dean Roddey
The Charmed Quark Controller
Charmed Quark Software
droddey@charmedquark.com
http://www.charmedquark.com

"If it don't have a control port, don't buy it!"



---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-c-dev-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-c-dev-help@xml.apache.org


Re: Windows 98 transcoder problem

Posted by Geoff Coffey <gc...@wmotion.com>.
My apologies for the hasty post....it's late :)

In looking a little more, it looks like wcstombs is not _supposed_ to return
the needed size when the buffer pointer is null, although MSDN documents it
this way, so I assume it does work this way in MSVC++??

But in every other reference, I see no mention of this behavior, and looking
at codewarrior's implementation of wcstombs, it clearly does not expect this
case.

At any rate, it should be safe to modify Win32LCPTranscoder::transcode() to
simply allocate 2 bytes per unicode character before transcoding, shouldn't
it?

Thanks,

Geoff

On 9/27/01 11:55 PM, "Geoff Coffey" <gc...@wmotion.com> wrote:

> Ok, thanks to those who addressed my previous questions. I have one more
> question, this time on the windows side.
> 
> Our windows work was done on Windows 2000, and everything worked well. But
> we received reports from testers that it was not working on Windows 98. In
> digging through it, I see that the problem lies in
> Win32LCPTranscoder::transcode(XMLCh *). This code calls wcstombs() once to
> determine the length of the transcoded string, and again to perform the
> transcoding. The first call, on our windows 98 environment, returns 0.
> Consequently, an empty string is returned.
> 
> I can't fathom why this is failing. I looked at msdn and couldn't find any
> valid explanation. The toTranscode XMLCh string is valid (I can view it in
> the debugger as a "unicode string" and it looks correct, and again it works
> well on win2k).
> 
> Has anyone run in to this before?


---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-c-dev-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-c-dev-help@xml.apache.org