You are viewing a plain text version of this content. The canonical link for it is here.
Posted to c-dev@xerces.apache.org by "Janusz Nykiel (JIRA)" <xe...@xml.apache.org> on 2008/06/25 12:07:45 UTC
[jira] Created: (XERCESC-1813) Win32Transcoder uses "best-fit"
algorithm causing data loss
Win32Transcoder uses "best-fit" algorithm causing data loss
-----------------------------------------------------------
Key: XERCESC-1813
URL: https://issues.apache.org/jira/browse/XERCESC-1813
Project: Xerces-C++
Issue Type: Bug
Components: Utilities
Affects Versions: 2.8.0
Environment: Windows
Reporter: Janusz Nykiel
Win32Transcoder implicitly uses Windows' "best-fit" algorithm when calling WinAPI WideCharToMultiByte function. It may transliterate characters according to built-in Windows pages which contain arbitrary rules, for example ∞ (the infinity symbol) is changed to 8 (the digit) when the target character set doesn't have it (see http://blogs.msdn.com/michkap/archive/2005/02/13/371895.aspx). Thus the original character data may be lost. Sometimes the output XML may be outright malformed, for example when the output character set is ISO-8859-2 and the input XML contains the U+00AB («) and U+00BB (») characters - double angle quotation marks - which are transliterated to < and >, respectively.
WideCharToMultiByte has a flag controlling the "best-fit" algorithm use - WC_NO_BEST_FIT_CHARS, available starting with Windows 98/2000. Adding this flag to the WideCharToMultiByte invocations in ::transcodeTo and ::canTranscodeTo methods of the Win32Transcoder fixes the problem.
The documentation for WideCharToMultiByte states:
"
For strings that require validation, such as file, resource, and user names, the application should always use the WC_NO_BEST_FIT_CHARS flag with WideCharToMultiByte. This flag prevents the function from mapping characters to characters that appear similar but have very different semantics. In some cases, the semantic change can be extreme. For example, the symbol for "∞" (infinity) maps to 8 (eight) in some code pages.
"
Example input XML:
<?xml version='1.0' encoding='windows-1250' ?>
<test>zażółć gęślą «jaźń»</test>
Expected output XML:
<?xml version='1.0' encoding='iso-8859-2' ?>
<test>zażółć gęślą «jaźń»</test>
Actual output XML:
<?xml version='1.0' encoding='iso-8859-2' ?>
<test>zażółć gęślą <jaźń></test>
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
---------------------------------------------------------------------
To unsubscribe, e-mail: c-dev-unsubscribe@xerces.apache.org
For additional commands, e-mail: c-dev-help@xerces.apache.org
[jira] Updated: (XERCESC-1813) Win32Transcoder uses "best-fit"
algorithm causing data loss
Posted by "Janusz Nykiel (JIRA)" <xe...@xml.apache.org>.
[ https://issues.apache.org/jira/browse/XERCESC-1813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Janusz Nykiel updated XERCESC-1813:
-----------------------------------
Attachment: output-expected.xml
output-actual.xml
input.xml
> Win32Transcoder uses "best-fit" algorithm causing data loss
> -----------------------------------------------------------
>
> Key: XERCESC-1813
> URL: https://issues.apache.org/jira/browse/XERCESC-1813
> Project: Xerces-C++
> Issue Type: Bug
> Components: Utilities
> Affects Versions: 2.8.0
> Environment: Windows
> Reporter: Janusz Nykiel
> Assignee: David Bertoni
> Attachments: input.xml, output-actual.xml, output-expected.xml
>
>
> Win32Transcoder implicitly uses Windows' "best-fit" algorithm when calling WinAPI WideCharToMultiByte function. It may transliterate characters according to built-in Windows pages which contain arbitrary rules, for example ∞ (the infinity symbol) is changed to 8 (the digit) when the target character set doesn't have it (see http://blogs.msdn.com/michkap/archive/2005/02/13/371895.aspx). Thus the original character data may be lost. Sometimes the output XML may be outright malformed, for example when the output character set is ISO-8859-2 and the input XML contains the U+00AB («) and U+00BB (») characters - double angle quotation marks - which are transliterated to < and >, respectively.
> WideCharToMultiByte has a flag controlling the "best-fit" algorithm use - WC_NO_BEST_FIT_CHARS, available starting with Windows 98/2000. Adding this flag to the WideCharToMultiByte invocations in ::transcodeTo and ::canTranscodeTo methods of the Win32Transcoder fixes the problem.
> The documentation for WideCharToMultiByte states:
> "
> For strings that require validation, such as file, resource, and user names, the application should always use the WC_NO_BEST_FIT_CHARS flag with WideCharToMultiByte. This flag prevents the function from mapping characters to characters that appear similar but have very different semantics. In some cases, the semantic change can be extreme. For example, the symbol for "∞" (infinity) maps to 8 (eight) in some code pages.
> "
> Example input XML:
> <?xml version='1.0' encoding='windows-1250' ?>
> <test>zażółć gęślą «jaźń»</test>
> Expected output XML:
> <?xml version='1.0' encoding='iso-8859-2' ?>
> <test>zażółć gęślą «jaźń»</test>
> Actual output XML:
> <?xml version='1.0' encoding='iso-8859-2' ?>
> <test>zażółć gęślą <jaźń></test>
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
---------------------------------------------------------------------
To unsubscribe, e-mail: c-dev-unsubscribe@xerces.apache.org
For additional commands, e-mail: c-dev-help@xerces.apache.org
[jira] Commented: (XERCESC-1813) Win32Transcoder uses "best-fit"
algorithm causing data loss
Posted by "David Bertoni (JIRA)" <xe...@xml.apache.org>.
[ https://issues.apache.org/jira/browse/XERCESC-1813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12608125#action_12608125 ]
David Bertoni commented on XERCESC-1813:
----------------------------------------
Janusz, can you attach these three snippets as documents, so there aren't any encoding issues with the Jira database and email clients? Thanks!
> Win32Transcoder uses "best-fit" algorithm causing data loss
> -----------------------------------------------------------
>
> Key: XERCESC-1813
> URL: https://issues.apache.org/jira/browse/XERCESC-1813
> Project: Xerces-C++
> Issue Type: Bug
> Components: Utilities
> Affects Versions: 2.8.0
> Environment: Windows
> Reporter: Janusz Nykiel
> Assignee: David Bertoni
>
> Win32Transcoder implicitly uses Windows' "best-fit" algorithm when calling WinAPI WideCharToMultiByte function. It may transliterate characters according to built-in Windows pages which contain arbitrary rules, for example ∞ (the infinity symbol) is changed to 8 (the digit) when the target character set doesn't have it (see http://blogs.msdn.com/michkap/archive/2005/02/13/371895.aspx). Thus the original character data may be lost. Sometimes the output XML may be outright malformed, for example when the output character set is ISO-8859-2 and the input XML contains the U+00AB («) and U+00BB (») characters - double angle quotation marks - which are transliterated to < and >, respectively.
> WideCharToMultiByte has a flag controlling the "best-fit" algorithm use - WC_NO_BEST_FIT_CHARS, available starting with Windows 98/2000. Adding this flag to the WideCharToMultiByte invocations in ::transcodeTo and ::canTranscodeTo methods of the Win32Transcoder fixes the problem.
> The documentation for WideCharToMultiByte states:
> "
> For strings that require validation, such as file, resource, and user names, the application should always use the WC_NO_BEST_FIT_CHARS flag with WideCharToMultiByte. This flag prevents the function from mapping characters to characters that appear similar but have very different semantics. In some cases, the semantic change can be extreme. For example, the symbol for "∞" (infinity) maps to 8 (eight) in some code pages.
> "
> Example input XML:
> <?xml version='1.0' encoding='windows-1250' ?>
> <test>zażółć gęślą «jaźń»</test>
> Expected output XML:
> <?xml version='1.0' encoding='iso-8859-2' ?>
> <test>zażółć gęślą «jaźń»</test>
> Actual output XML:
> <?xml version='1.0' encoding='iso-8859-2' ?>
> <test>zażółć gęślą <jaźń></test>
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
---------------------------------------------------------------------
To unsubscribe, e-mail: c-dev-unsubscribe@xerces.apache.org
For additional commands, e-mail: c-dev-help@xerces.apache.org
[jira] Updated: (XERCESC-1813) Win32Transcoder uses "best-fit"
algorithm causing data loss
Posted by "Boris Kolpackov (JIRA)" <xe...@xml.apache.org>.
[ https://issues.apache.org/jira/browse/XERCESC-1813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Boris Kolpackov updated XERCESC-1813:
-------------------------------------
Fix Version/s: 2.9.0
3.0.0
Would be nice to fix for 3.0.0, 2.9.0.
> Win32Transcoder uses "best-fit" algorithm causing data loss
> -----------------------------------------------------------
>
> Key: XERCESC-1813
> URL: https://issues.apache.org/jira/browse/XERCESC-1813
> Project: Xerces-C++
> Issue Type: Bug
> Components: Utilities
> Affects Versions: 2.8.0
> Environment: Windows
> Reporter: Janusz Nykiel
> Assignee: David Bertoni
> Fix For: 3.0.0, 2.9.0
>
> Attachments: input.xml, output-actual.xml, output-expected.xml
>
>
> Win32Transcoder implicitly uses Windows' "best-fit" algorithm when calling WinAPI WideCharToMultiByte function. It may transliterate characters according to built-in Windows pages which contain arbitrary rules, for example ∞ (the infinity symbol) is changed to 8 (the digit) when the target character set doesn't have it (see http://blogs.msdn.com/michkap/archive/2005/02/13/371895.aspx). Thus the original character data may be lost. Sometimes the output XML may be outright malformed, for example when the output character set is ISO-8859-2 and the input XML contains the U+00AB («) and U+00BB (») characters - double angle quotation marks - which are transliterated to < and >, respectively.
> WideCharToMultiByte has a flag controlling the "best-fit" algorithm use - WC_NO_BEST_FIT_CHARS, available starting with Windows 98/2000. Adding this flag to the WideCharToMultiByte invocations in ::transcodeTo and ::canTranscodeTo methods of the Win32Transcoder fixes the problem.
> The documentation for WideCharToMultiByte states:
> "
> For strings that require validation, such as file, resource, and user names, the application should always use the WC_NO_BEST_FIT_CHARS flag with WideCharToMultiByte. This flag prevents the function from mapping characters to characters that appear similar but have very different semantics. In some cases, the semantic change can be extreme. For example, the symbol for "∞" (infinity) maps to 8 (eight) in some code pages.
> "
> Example input XML:
> <?xml version='1.0' encoding='windows-1250' ?>
> <test>zażółć gęślą «jaźń»</test>
> Expected output XML:
> <?xml version='1.0' encoding='iso-8859-2' ?>
> <test>zażółć gęślą «jaźń»</test>
> Actual output XML:
> <?xml version='1.0' encoding='iso-8859-2' ?>
> <test>zażółć gęślą <jaźń></test>
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
---------------------------------------------------------------------
To unsubscribe, e-mail: c-dev-unsubscribe@xerces.apache.org
For additional commands, e-mail: c-dev-help@xerces.apache.org
[jira] Assigned: (XERCESC-1813) Win32Transcoder uses "best-fit"
algorithm causing data loss
Posted by "David Bertoni (JIRA)" <xe...@xml.apache.org>.
[ https://issues.apache.org/jira/browse/XERCESC-1813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
David Bertoni reassigned XERCESC-1813:
--------------------------------------
Assignee: David Bertoni
> Win32Transcoder uses "best-fit" algorithm causing data loss
> -----------------------------------------------------------
>
> Key: XERCESC-1813
> URL: https://issues.apache.org/jira/browse/XERCESC-1813
> Project: Xerces-C++
> Issue Type: Bug
> Components: Utilities
> Affects Versions: 2.8.0
> Environment: Windows
> Reporter: Janusz Nykiel
> Assignee: David Bertoni
>
> Win32Transcoder implicitly uses Windows' "best-fit" algorithm when calling WinAPI WideCharToMultiByte function. It may transliterate characters according to built-in Windows pages which contain arbitrary rules, for example ∞ (the infinity symbol) is changed to 8 (the digit) when the target character set doesn't have it (see http://blogs.msdn.com/michkap/archive/2005/02/13/371895.aspx). Thus the original character data may be lost. Sometimes the output XML may be outright malformed, for example when the output character set is ISO-8859-2 and the input XML contains the U+00AB («) and U+00BB (») characters - double angle quotation marks - which are transliterated to < and >, respectively.
> WideCharToMultiByte has a flag controlling the "best-fit" algorithm use - WC_NO_BEST_FIT_CHARS, available starting with Windows 98/2000. Adding this flag to the WideCharToMultiByte invocations in ::transcodeTo and ::canTranscodeTo methods of the Win32Transcoder fixes the problem.
> The documentation for WideCharToMultiByte states:
> "
> For strings that require validation, such as file, resource, and user names, the application should always use the WC_NO_BEST_FIT_CHARS flag with WideCharToMultiByte. This flag prevents the function from mapping characters to characters that appear similar but have very different semantics. In some cases, the semantic change can be extreme. For example, the symbol for "∞" (infinity) maps to 8 (eight) in some code pages.
> "
> Example input XML:
> <?xml version='1.0' encoding='windows-1250' ?>
> <test>zażółć gęślą «jaźń»</test>
> Expected output XML:
> <?xml version='1.0' encoding='iso-8859-2' ?>
> <test>zażółć gęślą «jaźń»</test>
> Actual output XML:
> <?xml version='1.0' encoding='iso-8859-2' ?>
> <test>zażółć gęślą <jaźń></test>
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
---------------------------------------------------------------------
To unsubscribe, e-mail: c-dev-unsubscribe@xerces.apache.org
For additional commands, e-mail: c-dev-help@xerces.apache.org
[jira] Commented: (XERCESC-1813) Win32Transcoder uses "best-fit"
algorithm causing data loss
Posted by "David Bertoni (JIRA)" <xe...@xml.apache.org>.
[ https://issues.apache.org/jira/browse/XERCESC-1813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12608118#action_12608118 ]
David Bertoni commented on XERCESC-1813:
----------------------------------------
I'm already working on another issue with the Win32 transcoder, so I'll take this one.
> Win32Transcoder uses "best-fit" algorithm causing data loss
> -----------------------------------------------------------
>
> Key: XERCESC-1813
> URL: https://issues.apache.org/jira/browse/XERCESC-1813
> Project: Xerces-C++
> Issue Type: Bug
> Components: Utilities
> Affects Versions: 2.8.0
> Environment: Windows
> Reporter: Janusz Nykiel
> Assignee: David Bertoni
>
> Win32Transcoder implicitly uses Windows' "best-fit" algorithm when calling WinAPI WideCharToMultiByte function. It may transliterate characters according to built-in Windows pages which contain arbitrary rules, for example ∞ (the infinity symbol) is changed to 8 (the digit) when the target character set doesn't have it (see http://blogs.msdn.com/michkap/archive/2005/02/13/371895.aspx). Thus the original character data may be lost. Sometimes the output XML may be outright malformed, for example when the output character set is ISO-8859-2 and the input XML contains the U+00AB («) and U+00BB (») characters - double angle quotation marks - which are transliterated to < and >, respectively.
> WideCharToMultiByte has a flag controlling the "best-fit" algorithm use - WC_NO_BEST_FIT_CHARS, available starting with Windows 98/2000. Adding this flag to the WideCharToMultiByte invocations in ::transcodeTo and ::canTranscodeTo methods of the Win32Transcoder fixes the problem.
> The documentation for WideCharToMultiByte states:
> "
> For strings that require validation, such as file, resource, and user names, the application should always use the WC_NO_BEST_FIT_CHARS flag with WideCharToMultiByte. This flag prevents the function from mapping characters to characters that appear similar but have very different semantics. In some cases, the semantic change can be extreme. For example, the symbol for "∞" (infinity) maps to 8 (eight) in some code pages.
> "
> Example input XML:
> <?xml version='1.0' encoding='windows-1250' ?>
> <test>zażółć gęślą «jaźń»</test>
> Expected output XML:
> <?xml version='1.0' encoding='iso-8859-2' ?>
> <test>zażółć gęślą «jaźń»</test>
> Actual output XML:
> <?xml version='1.0' encoding='iso-8859-2' ?>
> <test>zażółć gęślą <jaźń></test>
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
---------------------------------------------------------------------
To unsubscribe, e-mail: c-dev-unsubscribe@xerces.apache.org
For additional commands, e-mail: c-dev-help@xerces.apache.org
[jira] Resolved: (XERCESC-1813) Win32Transcoder uses "best-fit"
algorithm causing data loss
Posted by "David Bertoni (JIRA)" <xe...@xml.apache.org>.
[ https://issues.apache.org/jira/browse/XERCESC-1813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
David Bertoni resolved XERCESC-1813.
------------------------------------
Resolution: Fixed
I've applied patches to the Xerces-C 2.9 branch and the trunk. If possible, please verify the issue is resolved.
> Win32Transcoder uses "best-fit" algorithm causing data loss
> -----------------------------------------------------------
>
> Key: XERCESC-1813
> URL: https://issues.apache.org/jira/browse/XERCESC-1813
> Project: Xerces-C++
> Issue Type: Bug
> Components: Utilities
> Affects Versions: 2.8.0
> Environment: Windows
> Reporter: Janusz Nykiel
> Assignee: David Bertoni
> Fix For: 3.0.0, 2.9.0
>
> Attachments: input.xml, output-actual.xml, output-expected.xml
>
>
> Win32Transcoder implicitly uses Windows' "best-fit" algorithm when calling WinAPI WideCharToMultiByte function. It may transliterate characters according to built-in Windows pages which contain arbitrary rules, for example ∞ (the infinity symbol) is changed to 8 (the digit) when the target character set doesn't have it (see http://blogs.msdn.com/michkap/archive/2005/02/13/371895.aspx). Thus the original character data may be lost. Sometimes the output XML may be outright malformed, for example when the output character set is ISO-8859-2 and the input XML contains the U+00AB («) and U+00BB (») characters - double angle quotation marks - which are transliterated to < and >, respectively.
> WideCharToMultiByte has a flag controlling the "best-fit" algorithm use - WC_NO_BEST_FIT_CHARS, available starting with Windows 98/2000. Adding this flag to the WideCharToMultiByte invocations in ::transcodeTo and ::canTranscodeTo methods of the Win32Transcoder fixes the problem.
> The documentation for WideCharToMultiByte states:
> "
> For strings that require validation, such as file, resource, and user names, the application should always use the WC_NO_BEST_FIT_CHARS flag with WideCharToMultiByte. This flag prevents the function from mapping characters to characters that appear similar but have very different semantics. In some cases, the semantic change can be extreme. For example, the symbol for "∞" (infinity) maps to 8 (eight) in some code pages.
> "
> Example input XML:
> <?xml version='1.0' encoding='windows-1250' ?>
> <test>zażółć gęślą «jaźń»</test>
> Expected output XML:
> <?xml version='1.0' encoding='iso-8859-2' ?>
> <test>zażółć gęślą «jaźń»</test>
> Actual output XML:
> <?xml version='1.0' encoding='iso-8859-2' ?>
> <test>zażółć gęślą <jaźń></test>
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
---------------------------------------------------------------------
To unsubscribe, e-mail: c-dev-unsubscribe@xerces.apache.org
For additional commands, e-mail: c-dev-help@xerces.apache.org