You are viewing a plain text version of this content. The canonical link for it is here.
Posted to c-dev@xerces.apache.org by "Janusz Nykiel (JIRA)" <xe...@xml.apache.org> on 2008/06/25 12:07:45 UTC

[jira] Created: (XERCESC-1813) Win32Transcoder uses "best-fit" algorithm causing data loss

Win32Transcoder uses "best-fit" algorithm causing data loss
-----------------------------------------------------------

                 Key: XERCESC-1813
                 URL: https://issues.apache.org/jira/browse/XERCESC-1813
             Project: Xerces-C++
          Issue Type: Bug
          Components: Utilities
    Affects Versions: 2.8.0
         Environment: Windows
            Reporter: Janusz Nykiel


Win32Transcoder implicitly uses Windows' "best-fit" algorithm when calling WinAPI WideCharToMultiByte function. It may transliterate characters according to built-in Windows pages which contain arbitrary rules, for example ∞ (the infinity symbol) is changed to 8 (the digit) when the target character set doesn't have it (see http://blogs.msdn.com/michkap/archive/2005/02/13/371895.aspx). Thus the original  character data may be lost. Sometimes the output XML may be outright malformed, for example when the output character set is ISO-8859-2 and the input XML contains the U+00AB  («) and U+00BB (») characters - double angle quotation marks - which are transliterated to < and >, respectively.

WideCharToMultiByte has a flag controlling the "best-fit" algorithm use - WC_NO_BEST_FIT_CHARS, available starting with Windows 98/2000. Adding this flag to the WideCharToMultiByte invocations in ::transcodeTo and ::canTranscodeTo methods of the Win32Transcoder fixes the problem.

The documentation for WideCharToMultiByte states:
"
For strings that require validation, such as file, resource, and user names, the application should always use the WC_NO_BEST_FIT_CHARS flag with WideCharToMultiByte. This flag prevents the function from mapping characters to characters that appear similar but have very different semantics. In some cases, the semantic change can be extreme. For example, the symbol for "∞" (infinity) maps to 8 (eight) in some code pages.
"


Example input XML:
<?xml version='1.0' encoding='windows-1250' ?>
<test>zażółć gęślą «jaźń»</test>

Expected output XML:
<?xml version='1.0' encoding='iso-8859-2' ?>
<test>zażółć gęślą &#xAB;jaźń&#xBB;</test>

Actual output XML:
<?xml version='1.0' encoding='iso-8859-2' ?>
<test>zażółć gęślą <jaźń></test>


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: c-dev-unsubscribe@xerces.apache.org
For additional commands, e-mail: c-dev-help@xerces.apache.org


[jira] Updated: (XERCESC-1813) Win32Transcoder uses "best-fit" algorithm causing data loss

Posted by "Janusz Nykiel (JIRA)" <xe...@xml.apache.org>.
     [ https://issues.apache.org/jira/browse/XERCESC-1813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Janusz Nykiel updated XERCESC-1813:
-----------------------------------

    Attachment: output-expected.xml
                output-actual.xml
                input.xml

> Win32Transcoder uses "best-fit" algorithm causing data loss
> -----------------------------------------------------------
>
>                 Key: XERCESC-1813
>                 URL: https://issues.apache.org/jira/browse/XERCESC-1813
>             Project: Xerces-C++
>          Issue Type: Bug
>          Components: Utilities
>    Affects Versions: 2.8.0
>         Environment: Windows
>            Reporter: Janusz Nykiel
>            Assignee: David Bertoni
>         Attachments: input.xml, output-actual.xml, output-expected.xml
>
>
> Win32Transcoder implicitly uses Windows' "best-fit" algorithm when calling WinAPI WideCharToMultiByte function. It may transliterate characters according to built-in Windows pages which contain arbitrary rules, for example ∞ (the infinity symbol) is changed to 8 (the digit) when the target character set doesn't have it (see http://blogs.msdn.com/michkap/archive/2005/02/13/371895.aspx). Thus the original  character data may be lost. Sometimes the output XML may be outright malformed, for example when the output character set is ISO-8859-2 and the input XML contains the U+00AB  («) and U+00BB (») characters - double angle quotation marks - which are transliterated to < and >, respectively.
> WideCharToMultiByte has a flag controlling the "best-fit" algorithm use - WC_NO_BEST_FIT_CHARS, available starting with Windows 98/2000. Adding this flag to the WideCharToMultiByte invocations in ::transcodeTo and ::canTranscodeTo methods of the Win32Transcoder fixes the problem.
> The documentation for WideCharToMultiByte states:
> "
> For strings that require validation, such as file, resource, and user names, the application should always use the WC_NO_BEST_FIT_CHARS flag with WideCharToMultiByte. This flag prevents the function from mapping characters to characters that appear similar but have very different semantics. In some cases, the semantic change can be extreme. For example, the symbol for "∞" (infinity) maps to 8 (eight) in some code pages.
> "
> Example input XML:
> <?xml version='1.0' encoding='windows-1250' ?>
> <test>zażółć gęślą «jaźń»</test>
> Expected output XML:
> <?xml version='1.0' encoding='iso-8859-2' ?>
> <test>zażółć gęślą &#xAB;jaźń&#xBB;</test>
> Actual output XML:
> <?xml version='1.0' encoding='iso-8859-2' ?>
> <test>zażółć gęślą <jaźń></test>

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: c-dev-unsubscribe@xerces.apache.org
For additional commands, e-mail: c-dev-help@xerces.apache.org


[jira] Commented: (XERCESC-1813) Win32Transcoder uses "best-fit" algorithm causing data loss

Posted by "David Bertoni (JIRA)" <xe...@xml.apache.org>.
    [ https://issues.apache.org/jira/browse/XERCESC-1813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12608125#action_12608125 ] 

David Bertoni commented on XERCESC-1813:
----------------------------------------

Janusz, can you attach these three snippets as documents, so there aren't any encoding issues with the Jira database and email clients?  Thanks!

> Win32Transcoder uses "best-fit" algorithm causing data loss
> -----------------------------------------------------------
>
>                 Key: XERCESC-1813
>                 URL: https://issues.apache.org/jira/browse/XERCESC-1813
>             Project: Xerces-C++
>          Issue Type: Bug
>          Components: Utilities
>    Affects Versions: 2.8.0
>         Environment: Windows
>            Reporter: Janusz Nykiel
>            Assignee: David Bertoni
>
> Win32Transcoder implicitly uses Windows' "best-fit" algorithm when calling WinAPI WideCharToMultiByte function. It may transliterate characters according to built-in Windows pages which contain arbitrary rules, for example ∞ (the infinity symbol) is changed to 8 (the digit) when the target character set doesn't have it (see http://blogs.msdn.com/michkap/archive/2005/02/13/371895.aspx). Thus the original  character data may be lost. Sometimes the output XML may be outright malformed, for example when the output character set is ISO-8859-2 and the input XML contains the U+00AB  («) and U+00BB (») characters - double angle quotation marks - which are transliterated to < and >, respectively.
> WideCharToMultiByte has a flag controlling the "best-fit" algorithm use - WC_NO_BEST_FIT_CHARS, available starting with Windows 98/2000. Adding this flag to the WideCharToMultiByte invocations in ::transcodeTo and ::canTranscodeTo methods of the Win32Transcoder fixes the problem.
> The documentation for WideCharToMultiByte states:
> "
> For strings that require validation, such as file, resource, and user names, the application should always use the WC_NO_BEST_FIT_CHARS flag with WideCharToMultiByte. This flag prevents the function from mapping characters to characters that appear similar but have very different semantics. In some cases, the semantic change can be extreme. For example, the symbol for "∞" (infinity) maps to 8 (eight) in some code pages.
> "
> Example input XML:
> <?xml version='1.0' encoding='windows-1250' ?>
> <test>zażółć gęślą «jaźń»</test>
> Expected output XML:
> <?xml version='1.0' encoding='iso-8859-2' ?>
> <test>zażółć gęślą &#xAB;jaźń&#xBB;</test>
> Actual output XML:
> <?xml version='1.0' encoding='iso-8859-2' ?>
> <test>zażółć gęślą <jaźń></test>

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: c-dev-unsubscribe@xerces.apache.org
For additional commands, e-mail: c-dev-help@xerces.apache.org


[jira] Updated: (XERCESC-1813) Win32Transcoder uses "best-fit" algorithm causing data loss

Posted by "Boris Kolpackov (JIRA)" <xe...@xml.apache.org>.
     [ https://issues.apache.org/jira/browse/XERCESC-1813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Boris Kolpackov updated XERCESC-1813:
-------------------------------------

    Fix Version/s: 2.9.0
                   3.0.0

Would be nice to fix for 3.0.0, 2.9.0.

> Win32Transcoder uses "best-fit" algorithm causing data loss
> -----------------------------------------------------------
>
>                 Key: XERCESC-1813
>                 URL: https://issues.apache.org/jira/browse/XERCESC-1813
>             Project: Xerces-C++
>          Issue Type: Bug
>          Components: Utilities
>    Affects Versions: 2.8.0
>         Environment: Windows
>            Reporter: Janusz Nykiel
>            Assignee: David Bertoni
>             Fix For: 3.0.0, 2.9.0
>
>         Attachments: input.xml, output-actual.xml, output-expected.xml
>
>
> Win32Transcoder implicitly uses Windows' "best-fit" algorithm when calling WinAPI WideCharToMultiByte function. It may transliterate characters according to built-in Windows pages which contain arbitrary rules, for example ∞ (the infinity symbol) is changed to 8 (the digit) when the target character set doesn't have it (see http://blogs.msdn.com/michkap/archive/2005/02/13/371895.aspx). Thus the original  character data may be lost. Sometimes the output XML may be outright malformed, for example when the output character set is ISO-8859-2 and the input XML contains the U+00AB  («) and U+00BB (») characters - double angle quotation marks - which are transliterated to < and >, respectively.
> WideCharToMultiByte has a flag controlling the "best-fit" algorithm use - WC_NO_BEST_FIT_CHARS, available starting with Windows 98/2000. Adding this flag to the WideCharToMultiByte invocations in ::transcodeTo and ::canTranscodeTo methods of the Win32Transcoder fixes the problem.
> The documentation for WideCharToMultiByte states:
> "
> For strings that require validation, such as file, resource, and user names, the application should always use the WC_NO_BEST_FIT_CHARS flag with WideCharToMultiByte. This flag prevents the function from mapping characters to characters that appear similar but have very different semantics. In some cases, the semantic change can be extreme. For example, the symbol for "∞" (infinity) maps to 8 (eight) in some code pages.
> "
> Example input XML:
> <?xml version='1.0' encoding='windows-1250' ?>
> <test>zażółć gęślą «jaźń»</test>
> Expected output XML:
> <?xml version='1.0' encoding='iso-8859-2' ?>
> <test>zażółć gęślą &#xAB;jaźń&#xBB;</test>
> Actual output XML:
> <?xml version='1.0' encoding='iso-8859-2' ?>
> <test>zażółć gęślą <jaźń></test>

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: c-dev-unsubscribe@xerces.apache.org
For additional commands, e-mail: c-dev-help@xerces.apache.org


[jira] Assigned: (XERCESC-1813) Win32Transcoder uses "best-fit" algorithm causing data loss

Posted by "David Bertoni (JIRA)" <xe...@xml.apache.org>.
     [ https://issues.apache.org/jira/browse/XERCESC-1813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

David Bertoni reassigned XERCESC-1813:
--------------------------------------

    Assignee: David Bertoni

> Win32Transcoder uses "best-fit" algorithm causing data loss
> -----------------------------------------------------------
>
>                 Key: XERCESC-1813
>                 URL: https://issues.apache.org/jira/browse/XERCESC-1813
>             Project: Xerces-C++
>          Issue Type: Bug
>          Components: Utilities
>    Affects Versions: 2.8.0
>         Environment: Windows
>            Reporter: Janusz Nykiel
>            Assignee: David Bertoni
>
> Win32Transcoder implicitly uses Windows' "best-fit" algorithm when calling WinAPI WideCharToMultiByte function. It may transliterate characters according to built-in Windows pages which contain arbitrary rules, for example ∞ (the infinity symbol) is changed to 8 (the digit) when the target character set doesn't have it (see http://blogs.msdn.com/michkap/archive/2005/02/13/371895.aspx). Thus the original  character data may be lost. Sometimes the output XML may be outright malformed, for example when the output character set is ISO-8859-2 and the input XML contains the U+00AB  («) and U+00BB (») characters - double angle quotation marks - which are transliterated to < and >, respectively.
> WideCharToMultiByte has a flag controlling the "best-fit" algorithm use - WC_NO_BEST_FIT_CHARS, available starting with Windows 98/2000. Adding this flag to the WideCharToMultiByte invocations in ::transcodeTo and ::canTranscodeTo methods of the Win32Transcoder fixes the problem.
> The documentation for WideCharToMultiByte states:
> "
> For strings that require validation, such as file, resource, and user names, the application should always use the WC_NO_BEST_FIT_CHARS flag with WideCharToMultiByte. This flag prevents the function from mapping characters to characters that appear similar but have very different semantics. In some cases, the semantic change can be extreme. For example, the symbol for "∞" (infinity) maps to 8 (eight) in some code pages.
> "
> Example input XML:
> <?xml version='1.0' encoding='windows-1250' ?>
> <test>zażółć gęślą «jaźń»</test>
> Expected output XML:
> <?xml version='1.0' encoding='iso-8859-2' ?>
> <test>zażółć gęślą &#xAB;jaźń&#xBB;</test>
> Actual output XML:
> <?xml version='1.0' encoding='iso-8859-2' ?>
> <test>zażółć gęślą <jaźń></test>

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: c-dev-unsubscribe@xerces.apache.org
For additional commands, e-mail: c-dev-help@xerces.apache.org


[jira] Commented: (XERCESC-1813) Win32Transcoder uses "best-fit" algorithm causing data loss

Posted by "David Bertoni (JIRA)" <xe...@xml.apache.org>.
    [ https://issues.apache.org/jira/browse/XERCESC-1813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12608118#action_12608118 ] 

David Bertoni commented on XERCESC-1813:
----------------------------------------

I'm already working on another issue with the Win32 transcoder, so I'll take this one.

> Win32Transcoder uses "best-fit" algorithm causing data loss
> -----------------------------------------------------------
>
>                 Key: XERCESC-1813
>                 URL: https://issues.apache.org/jira/browse/XERCESC-1813
>             Project: Xerces-C++
>          Issue Type: Bug
>          Components: Utilities
>    Affects Versions: 2.8.0
>         Environment: Windows
>            Reporter: Janusz Nykiel
>            Assignee: David Bertoni
>
> Win32Transcoder implicitly uses Windows' "best-fit" algorithm when calling WinAPI WideCharToMultiByte function. It may transliterate characters according to built-in Windows pages which contain arbitrary rules, for example ∞ (the infinity symbol) is changed to 8 (the digit) when the target character set doesn't have it (see http://blogs.msdn.com/michkap/archive/2005/02/13/371895.aspx). Thus the original  character data may be lost. Sometimes the output XML may be outright malformed, for example when the output character set is ISO-8859-2 and the input XML contains the U+00AB  («) and U+00BB (») characters - double angle quotation marks - which are transliterated to < and >, respectively.
> WideCharToMultiByte has a flag controlling the "best-fit" algorithm use - WC_NO_BEST_FIT_CHARS, available starting with Windows 98/2000. Adding this flag to the WideCharToMultiByte invocations in ::transcodeTo and ::canTranscodeTo methods of the Win32Transcoder fixes the problem.
> The documentation for WideCharToMultiByte states:
> "
> For strings that require validation, such as file, resource, and user names, the application should always use the WC_NO_BEST_FIT_CHARS flag with WideCharToMultiByte. This flag prevents the function from mapping characters to characters that appear similar but have very different semantics. In some cases, the semantic change can be extreme. For example, the symbol for "∞" (infinity) maps to 8 (eight) in some code pages.
> "
> Example input XML:
> <?xml version='1.0' encoding='windows-1250' ?>
> <test>zażółć gęślą «jaźń»</test>
> Expected output XML:
> <?xml version='1.0' encoding='iso-8859-2' ?>
> <test>zażółć gęślą &#xAB;jaźń&#xBB;</test>
> Actual output XML:
> <?xml version='1.0' encoding='iso-8859-2' ?>
> <test>zażółć gęślą <jaźń></test>

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: c-dev-unsubscribe@xerces.apache.org
For additional commands, e-mail: c-dev-help@xerces.apache.org


[jira] Resolved: (XERCESC-1813) Win32Transcoder uses "best-fit" algorithm causing data loss

Posted by "David Bertoni (JIRA)" <xe...@xml.apache.org>.
     [ https://issues.apache.org/jira/browse/XERCESC-1813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

David Bertoni resolved XERCESC-1813.
------------------------------------

    Resolution: Fixed

I've applied patches to the Xerces-C 2.9 branch and the trunk.  If possible, please verify the issue is resolved.

> Win32Transcoder uses "best-fit" algorithm causing data loss
> -----------------------------------------------------------
>
>                 Key: XERCESC-1813
>                 URL: https://issues.apache.org/jira/browse/XERCESC-1813
>             Project: Xerces-C++
>          Issue Type: Bug
>          Components: Utilities
>    Affects Versions: 2.8.0
>         Environment: Windows
>            Reporter: Janusz Nykiel
>            Assignee: David Bertoni
>             Fix For: 3.0.0, 2.9.0
>
>         Attachments: input.xml, output-actual.xml, output-expected.xml
>
>
> Win32Transcoder implicitly uses Windows' "best-fit" algorithm when calling WinAPI WideCharToMultiByte function. It may transliterate characters according to built-in Windows pages which contain arbitrary rules, for example ∞ (the infinity symbol) is changed to 8 (the digit) when the target character set doesn't have it (see http://blogs.msdn.com/michkap/archive/2005/02/13/371895.aspx). Thus the original  character data may be lost. Sometimes the output XML may be outright malformed, for example when the output character set is ISO-8859-2 and the input XML contains the U+00AB  («) and U+00BB (») characters - double angle quotation marks - which are transliterated to < and >, respectively.
> WideCharToMultiByte has a flag controlling the "best-fit" algorithm use - WC_NO_BEST_FIT_CHARS, available starting with Windows 98/2000. Adding this flag to the WideCharToMultiByte invocations in ::transcodeTo and ::canTranscodeTo methods of the Win32Transcoder fixes the problem.
> The documentation for WideCharToMultiByte states:
> "
> For strings that require validation, such as file, resource, and user names, the application should always use the WC_NO_BEST_FIT_CHARS flag with WideCharToMultiByte. This flag prevents the function from mapping characters to characters that appear similar but have very different semantics. In some cases, the semantic change can be extreme. For example, the symbol for "∞" (infinity) maps to 8 (eight) in some code pages.
> "
> Example input XML:
> <?xml version='1.0' encoding='windows-1250' ?>
> <test>zażółć gęślą «jaźń»</test>
> Expected output XML:
> <?xml version='1.0' encoding='iso-8859-2' ?>
> <test>zażółć gęślą &#xAB;jaźń&#xBB;</test>
> Actual output XML:
> <?xml version='1.0' encoding='iso-8859-2' ?>
> <test>zażółć gęślą <jaźń></test>

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: c-dev-unsubscribe@xerces.apache.org
For additional commands, e-mail: c-dev-help@xerces.apache.org