You are viewing a plain text version of this content. The canonical link for it is here.
Posted to c-dev@xerces.apache.org by Ronald Lamprecht <R....@T-Online.de> on 2006/10/09 11:10:59 UTC

hardcoded transcoder assumptions in trunk 3.0

Hi,

For a gcc (mingw cross and native) compiled test version of a Windows
application I received error reports from Russia about unability to
write XML files to paths containing cyrillic characters. I can confirm
this problem and it is valid for Euro signs in the write path
components, too.

I tracked down the main problem to WindowsFileMgr.cpp l.141 (trunk rev
453748):

>     if (_onNT)
>     {
>         retVal = ::CreateFileW
>             (
>             (LPCWSTR) nameToOpen

where "nameToOpen" is the internal XMLCh* string. No transcoding is
performed.

This seems to caus no problems if you use MSVC with the 
"windows"-transcoder, as XMLCh equals wchar. But for gcc (mingw) you 
have to choose another transcoder (iconv) and you will use XMLCh 
declared as uint16. Passing this encoded string to windows fails of course.

Please check that a fix is working with cyrillic/Euro sign,... in
directory path components as well as within the filename itself. I got
the first case running with a simple fix without solving the second
case. There might be another code fragment within Xerces that causes 
problems, too.

BTW the current config selects the windows transcoder as default for a
native mingw configuration. But the windows transcoder will not compile
with mingw as it hardcoded the assumption XMLCh equals wchar. Please
either avoid this assumption in the windows transcoder or select another
transcoder (f.e. iconv) as default for mingw.

Thanks

Ronald


---------------------------------------------------------------------
To unsubscribe, e-mail: c-dev-unsubscribe@xerces.apache.org
For additional commands, e-mail: c-dev-help@xerces.apache.org


Re: hardcoded transcoder assumptions in trunk 3.0

Posted by Ronald Lamprecht <R....@T-Online.de>.
Hi Alberto,

Alberto Massari wrote:
> We can disagree on whether it's an hardcoded assumption, but the fact is 
> that Xerces assumes that a XMLCh buffer contains UTF-16 characters, and 
> Windows assumes that wchar_t contains UTF-16 characters, and this means 
> that Windows API can directly use XMLCh buffers. What remains to be seen 
> is why iconv generated non-UTF-16 buffers, not that WindowsFileManager 
> didn't attempt to convert the input buffer.

I do agree to your analysis and explanation. Would you please add the 
appropriate report to Jira as you can explain the problem with some 
added internal constraints and hints.

>> But seeing your commit to revision 454356 that I thought to be 
>> compatible to the patch above showed again all problems even though 
>> both type definitions of your patch are set correctly. The problems 
>> shows up with both transcoders: iconv problems as expected, but 
>> windows transcoder problems, too.
> 
> 
> The patch should only allow WindowFileManager to compile under MinGW, 
> and should have no effect on the runtime behaviour (if it worked when 
> you manually patched the headers, it should work also after)

Pardon ... after a complete cleanup of everything and a complete 
recompilation and clean installation trunk revision 454356 seems to work 
with "windows" transcoder and WindowFileManager . I'll continue testing.

>> Let me shortly explain what we are doing:
>>
>> We get the save path by a getenv("HOME") call. We append the 
>> uncritical filename. We transcode this local string via 
>> XMLString::transcode to an XMLCh string and use it as argument to 
>> domSer->writeToURI(doc, path).
>>
>> That is all - you can just add the three lines to your favorite 
>> testapp. If the environment "HOME" is set to "P:/€uro" we run into 
>> trouble.
>>
>> Another oberservation that may be a hint:
>>
>> We also transcode the XMLCh string to utf8 using a transcoder received by
>>
>> XMLPlatformUtils::fgTransService->makeNewTranscoderFor(XMLRecognizer::UTF_8, 
>> initResult, 4096);
>>
>> and display the resulting utf8 string on screen. 
> 
> I got lost in the description of the problem: where does the error 
> occur? When you invoke writeToURI or when you print the transcoded UTF-8 
> string?
> 
> Just to recap:
> 
> char* env=getenv("HOME");
> char buffer[256];
> strcpy(buffer, env);
> strcat(buffer, "dummy.xml");
> XMLCh* path=XMLString::transcode(buffer);
> domSer->writeToURI(doc, path);
> 
> XMLTranscoder* 
> xCoder=XMLPlatformUtils::fgTransService->makeNewTranscoderFor(XMLRecognizer::UTF_8, 
> initResult, 4096);
> char buf2[256];
> unsigned int charsEaten;
> unsigned int srcChars=XMLString::stringLen(path);
> unsigned int outBytes  = xCoder->transcodeTo(path, srcChars,
>                                 buf2, 256,
>                                 charsEaten, XMLTranscoder::UnRep_RepChar);
> printf(buf2);
> 
> Is this what you are doing? If yes, where does the error occurs? BTW, do 
> you realize you are printing a UTF-8 string to a console that may be set 
> to only handle a specific codepage?

Of course we do not printf to the console but display the string with an 
utf8 based GUI.

Your code is exactly what we essentially do. You should be able to use 
it for testing the iconv transcoder. The problem did occur with the 
writeToURI in case of existing paths with Euro-signs or cyrillic 
characters. The Xerces error message are like:

Could not open file: C:\Documents and 
Settings\Костя.INFINITY-G3\Application Data/...

The GUI display of the path was a verification test for the transcoding 
that failed, too.

Thanks for your help.

Ronald

---------------------------------------------------------------------
To unsubscribe, e-mail: c-dev-unsubscribe@xerces.apache.org
For additional commands, e-mail: c-dev-help@xerces.apache.org


Re: hardcoded transcoder assumptions in trunk 3.0

Posted by Alberto Massari <am...@datadirect.com>.
Hi Ronald,

At 11.20 10/10/2006 +0200, Ronald Lamprecht wrote:
>Hi Alberto,
>
>Alberto Massari wrote:
>>Hi Ronald,
>>I think the error is in the iconv transcoder; 
>>the openFile(XMLCh*) method expects the XMLCh 
>>to be UTF-16, and if it's UTF-16 it can be used 
>>directly with the CreateFileW API. Can you open 
>>a bug in Jira attaching a sample file?
>
>That is what I called "hardcoded assumptions".

We can disagree on whether it's an hardcoded 
assumption, but the fact is that Xerces assumes 
that a XMLCh buffer contains UTF-16 characters, 
and Windows assumes that wchar_t contains UTF-16 
characters, and this means that Windows API can 
directly use XMLCh buffers. What remains to be 
seen is why iconv generated non-UTF-16 buffers, 
not that WindowsFileManager didn't attempt to convert the input buffer.


>>As for MinGW, I just tested the latest gcc 
>>2.4.2 and found out that the Windows API is now 
>>defined in terms of wchar_t, instead of 
>>unsigned short; I'll try to make configure 
>>detect this case, but in the meanwhile you 
>>should be able to compile Xerces on MinGW with 
>>the Windows transcoder by changing config.h and 
>>Xerces_autoconf_config.hpp to have
>>   #define XERCES_XMLCH_T wchar_t
>>Hope this helps,
>
>What did work immediatly was configuring Xerces 
>without disabling the windows transcoder that 
>will be choosen as default and then applying the 
>patch above. Recompling Xerces and our 
>application showed no transcoding problems and 
>all open and save of files even with Euro-Sign did work.
>
>But seeing your commit to revision 454356 that I 
>thought to be compatible to the patch above 
>showed again all problems even though both type 
>definitions of your patch are set correctly. The 
>problems shows up with both transcoders: iconv 
>problems as expected, but windows transcoder problems, too.

The patch should only allow WindowFileManager to 
compile under MinGW, and should have no effect on 
the runtime behaviour (if it worked when you 
manually patched the headers, it should work also after)


>This looks like further sideeffects of the configure.ac change.
>
>Let me shortly explain what we are doing:
>
>We get the save path by a getenv("HOME") call. 
>We append the uncritical filename. We transcode 
>this local string via XMLString::transcode to an 
>XMLCh string and use it as argument to domSer->writeToURI(doc, path).
>
>That is all - you can just add the three lines 
>to your favorite testapp. If the environment 
>"HOME" is set to "P:/€uro" we run into trouble.
>
>Another oberservation that may be a hint:
>
>We also transcode the XMLCh string to utf8 using a transcoder received by
>
>XMLPlatformUtils::fgTransService->makeNewTranscoderFor(XMLRecognizer::UTF_8, 
>initResult, 4096);
>
>and display the resulting utf8 string on screen.
>
>In the first patched Xerces version the string 
>is displayed correctly. With the current trunk 
>version the Euro-Sign is garbled - the transcode to utf-8 seems not to work.
>
>Any ideas?

I got lost in the description of the problem: 
where does the error occur? When you invoke 
writeToURI or when you print the transcoded UTF-8 string?

Just to recap:

char* env=getenv("HOME");
char buffer[256];
strcpy(buffer, env);
strcat(buffer, "dummy.xml");
XMLCh* path=XMLString::transcode(buffer);
domSer->writeToURI(doc, path);

XMLTranscoder* 
xCoder=XMLPlatformUtils::fgTransService->makeNewTranscoderFor(XMLRecognizer::UTF_8, 
initResult, 4096);
char buf2[256];
unsigned int charsEaten;
unsigned int srcChars=XMLString::stringLen(path);
unsigned int outBytes  = xCoder->transcodeTo(path, srcChars,
                                 buf2, 256,
                                 charsEaten, XMLTranscoder::UnRep_RepChar);
printf(buf2);

Is this what you are doing? If yes, where does 
the error occurs? BTW, do you realize you are 
printing a UTF-8 string to a console that may be 
set to only handle a specific codepage?

Alberto 


---------------------------------------------------------------------
To unsubscribe, e-mail: c-dev-unsubscribe@xerces.apache.org
For additional commands, e-mail: c-dev-help@xerces.apache.org


Re: hardcoded transcoder assumptions in trunk 3.0

Posted by Ronald Lamprecht <R....@T-Online.de>.
Hi Alberto,

Alberto Massari wrote:
> Hi Ronald,
> I think the error is in the iconv transcoder; the openFile(XMLCh*) 
> method expects the XMLCh to be UTF-16, and if it's UTF-16 it can be used 
> directly with the CreateFileW API. Can you open a bug in Jira attaching 
> a sample file?

That is what I called "hardcoded assumptions".

> As for MinGW, I just tested the latest gcc 2.4.2 and found out that the 
> Windows API is now defined in terms of wchar_t, instead of unsigned 
> short; I'll try to make configure detect this case, but in the meanwhile 
> you should be able to compile Xerces on MinGW with the Windows 
> transcoder by changing config.h and Xerces_autoconf_config.hpp to have
> 
>   #define XERCES_XMLCH_T wchar_t
> 
> Hope this helps,

What did work immediatly was configuring Xerces without disabling the 
windows transcoder that will be choosen as default and then applying the 
patch above. Recompling Xerces and our application showed no transcoding 
problems and all open and save of files even with Euro-Sign did work.

But seeing your commit to revision 454356 that I thought to be 
compatible to the patch above showed again all problems even though both 
type definitions of your patch are set correctly. The problems shows up 
with both transcoders: iconv problems as expected, but windows 
transcoder problems, too.

This looks like further sideeffects of the configure.ac change.

Let me shortly explain what we are doing:

We get the save path by a getenv("HOME") call. We append the uncritical 
filename. We transcode this local string via XMLString::transcode to an 
XMLCh string and use it as argument to domSer->writeToURI(doc, path).

That is all - you can just add the three lines to your favorite testapp. 
If the environment "HOME" is set to "P:/€uro" we run into trouble.

Another oberservation that may be a hint:

We also transcode the XMLCh string to utf8 using a transcoder received by

XMLPlatformUtils::fgTransService->makeNewTranscoderFor(XMLRecognizer::UTF_8, 
initResult, 4096);

and display the resulting utf8 string on screen.

In the first patched Xerces version the string is displayed correctly. 
With the current trunk version the Euro-Sign is garbled - the transcode 
to utf-8 seems not to work.

Any ideas?

Ronald

---------------------------------------------------------------------
To unsubscribe, e-mail: c-dev-unsubscribe@xerces.apache.org
For additional commands, e-mail: c-dev-help@xerces.apache.org


Re: hardcoded transcoder assumptions in trunk 3.0

Posted by Alberto Massari <am...@datadirect.com>.
Hi Ronald,
I think the error is in the iconv transcoder; the openFile(XMLCh*) 
method expects the XMLCh to be UTF-16, and if it's UTF-16 it can be 
used directly with the CreateFileW API. Can you open a bug in Jira 
attaching a sample file?
As for MinGW, I just tested the latest gcc 2.4.2 and found out that 
the Windows API is now defined in terms of wchar_t, instead of 
unsigned short; I'll try to make configure detect this case, but in 
the meanwhile you should be able to compile Xerces on MinGW with the 
Windows transcoder by changing config.h and Xerces_autoconf_config.hpp to have

   #define XERCES_XMLCH_T wchar_t

Hope this helps,
Alberto


At 11.10 09/10/2006 +0200, Ronald Lamprecht wrote:
>Hi,
>
>For a gcc (mingw cross and native) compiled test version of a Windows
>application I received error reports from Russia about unability to
>write XML files to paths containing cyrillic characters. I can confirm
>this problem and it is valid for Euro signs in the write path
>components, too.
>
>I tracked down the main problem to WindowsFileMgr.cpp l.141 (trunk rev
>453748):
>
>>     if (_onNT)
>>     {
>>         retVal = ::CreateFileW
>>             (
>>             (LPCWSTR) nameToOpen
>
>where "nameToOpen" is the internal XMLCh* string. No transcoding is
>performed.
>
>This seems to caus no problems if you use MSVC with the 
>"windows"-transcoder, as XMLCh equals wchar. But for gcc (mingw) you 
>have to choose another transcoder (iconv) and you will use XMLCh 
>declared as uint16. Passing this encoded string to windows fails of course.
>
>Please check that a fix is working with cyrillic/Euro sign,... in
>directory path components as well as within the filename itself. I got
>the first case running with a simple fix without solving the second
>case. There might be another code fragment within Xerces that causes 
>problems, too.
>
>BTW the current config selects the windows transcoder as default for a
>native mingw configuration. But the windows transcoder will not compile
>with mingw as it hardcoded the assumption XMLCh equals wchar. Please
>either avoid this assumption in the windows transcoder or select another
>transcoder (f.e. iconv) as default for mingw.
>
>Thanks
>
>Ronald
>
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: c-dev-unsubscribe@xerces.apache.org
>For additional commands, e-mail: c-dev-help@xerces.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: c-dev-unsubscribe@xerces.apache.org
For additional commands, e-mail: c-dev-help@xerces.apache.org