You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@xalan.apache.org by bu...@apache.org on 2001/05/07 09:24:45 UTC

[Bug 1639] New - Xalan escaping characters for ISO encodings other than ISO-8859-1

http://nagoya.apache.org/bugzilla/show_bug.cgi?id=1639

*** shadow/1639	Mon May  7 00:24:45 2001
--- shadow/1639.tmp.15531	Mon May  7 00:24:45 2001
***************
*** 0 ****
--- 1,58 ----
+ +============================================================================+
+ | Xalan escaping characters for ISO encodings other than ISO-8859-1          |
+ +----------------------------------------------------------------------------+
+ |        Bug #: 1639                        Product: XalanJ2                 |
+ |       Status: NEW                         Version: 2.0.x                   |
+ |   Resolution:                            Platform: PC                      |
+ |     Severity: Normal                   OS/Version:                         |
+ |     Priority:                           Component: org.apache.xalan.serial |
+ +----------------------------------------------------------------------------+
+ |  Assigned To: xalan-dev@xml.apache.org                                     |
+ |  Reported By: tgeor@yahoo.com                                              |
+ |      CC list: Cc:                                                          |
+ +----------------------------------------------------------------------------+
+ |          URL:                                                              |
+ +============================================================================+
+ |                              DESCRIPTION                                   |
+ I found that Xalan serializer escapes characters when you use an encoding of 
+ anonther language.
+ 
+ Example
+ 
+ ------------ foo.xml ------------ 
+ <?xml version="1.0" encoding="ISO-8859-7"?>
+ <doc>��� (ABC in Greek)</doc>
+ 
+ ------------ foo.xsl ------------ 
+ <?xml version="1.0"?> 
+ <xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
+   <xsl:output method="xml" encoding="ISO-8859-7"/>
+   <xsl:template match="doc">
+     <out><xsl:value-of select="."/></out>
+   </xsl:template>
+ </xsl:stylesheet>
+ 
+ ------------ foo.out ------------ 
+ <?xml version="1.0" encoding="ISO-8859-7"?>
+ <out>&#913;&#914;&#915; (ABC in Greek)</out>
+ 
+ The expected output should be
+ 
+ <?xml version="1.0" encoding="ISO-8859-7"?>
+ <out>��� (ABC in Greek)</out>
+ 
+ The same happens to attribute values when you have no-english characters.
+ 
+ The problem is in the code of org.apache.xalan.serialize.SerializerToXML and 
+ org.apache.xalan.serialize.SerializerToHTML when you check if ch < 
+ m_maxCharacter. When java reads data from an input stream converts character to 
+ Unicode, so for example, the greek letter A that has a value of 0xC1 (193) in 
+ ISO-8859-7 becomes unicode letter 0x0391 (913). The max printable character in 
+ ISO formats is 0xff (255) so the comparison ch < m_maxCharacter will be always 
+ false for this letters. 
+ 
+ You should compare the output character values with the m_maxCharacter and not 
+ the unicode character values. A solution could be translates characters to the 
+ output encoding using the getBytes(m_encoding) method of String class and 
+ compare the byte values with the m_maxCharacter but their will be a performance 
+ overhead.