You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@ode.apache.org by "Alexey Ousov (JIRA)" <ji...@apache.org> on 2008/12/30 14:34:44 UTC

[jira] Created: (ODE-472) utf-8 encoding is handled incorrectly within xslt stylesheets

utf-8 encoding is handled incorrectly within xslt stylesheets
-------------------------------------------------------------

                 Key: ODE-472
                 URL: https://issues.apache.org/jira/browse/ODE-472
             Project: ODE
          Issue Type: Bug
          Components: BPEL Runtime
    Affects Versions: 1.2
            Reporter: Alexey Ousov


The bug occurs when UTF-8 encoded symbols appear either within stylesheet itself or inside documents referenced with document() function. All such symbols are encoded twice.

So if we have in xslt something like:
<xsl:value-of select="&#00e0;" />
which is UTF-8 encoded as "C3 A0" in result node we will have sequence "C3 83 C2 A0" which is UTF-8 encoded "&#00c3;&#00a0;".

The case of bug is XslRuntimeUriResolver class, which reads files to string without parsing file encoding. I made quick fix, which fixes only document() function with xpath 1.0 runtime. Deeper investigation is needed, so hopefully full fix will be available after New Year.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (ODE-472) utf-8 encoding is handled incorrectly within xslt stylesheets

Posted by "Alexey Ousov (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/ODE-472?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Alexey Ousov updated ODE-472:
-----------------------------

    Attachment: ODE-472-ful.patch

Full path for this bug report attached

- Xml encoding for external files is determined by Xml parser itself (external files are returned as streams).
- In order to keep compiler compatibility, xslt stylesheets still loaded from String. To determine correct encoding, this tip was used: http://www.ibm.com/developerworks/library/x-tipsaxxni/ . This has no external dependencies, and uses bundled Java xml parser if none provided. But this require Java 5 version for "org.xml.sax.ext.Locator2" interface.

After applying patch, all affecting BPEL processes should be recompiled (because xslt body is stored inside compiled BPEL process)

> utf-8 encoding is handled incorrectly within xslt stylesheets
> -------------------------------------------------------------
>
>                 Key: ODE-472
>                 URL: https://issues.apache.org/jira/browse/ODE-472
>             Project: ODE
>          Issue Type: Bug
>          Components: BPEL Runtime
>    Affects Versions: 1.2
>            Reporter: Alexey Ousov
>         Attachments: ODE-472-ful.patch, ODE-472-quickfix.patch, ODE-472.patch, test1.par.zip
>
>
> The bug occurs when UTF-8 encoded symbols appear either within stylesheet itself or inside documents referenced with document() function. All such symbols are encoded twice.
> So if we have in xslt something like:
> <xsl:value-of select="&#00e0;" />
> which is UTF-8 encoded as "C3 A0" in result node we will have sequence "C3 83 C2 A0" which is UTF-8 encoded "&#00c3;&#00a0;".
> The case of bug is XslRuntimeUriResolver class, which reads files to string without parsing file encoding. I made quick fix, which fixes only document() function with xpath 1.0 runtime. Deeper investigation is needed, so hopefully full fix will be available after New Year.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (ODE-472) utf-8 encoding is handled incorrectly within xslt stylesheets

Posted by "Karthick Sankarachary (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/ODE-472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12663469#action_12663469 ] 

Karthick Sankarachary commented on ODE-472:
-------------------------------------------

Alexey,

I think you hit the nail on the head. The bottomline is that when you construct the StreamSource class, you should either use a stream, in which case the XML parser will resolve the XML character encoding for you, or a reader, in which case the character encoding must have been already resolved. 

Looking at your patch, if the body of the style sheet is already initialized, as is usually the case, then odds are that we won't be using the right character encoding. As you pointed out, we need to fix the BpelCompiler.loadXsltSheet(URI) method, so that it is aware of XML encoding declarations. 

The trick is to auto-detect the XML character encoding as described in http://www.w3.org/TR/REC-xml/#sec-guessing-no-ext-info. You might want to consider using org.apache.xmlbeans.impl.common.XmlEncodingSniffer or some helper class like that which does the grunt work for you.

Regards,
Karthick

> utf-8 encoding is handled incorrectly within xslt stylesheets
> -------------------------------------------------------------
>
>                 Key: ODE-472
>                 URL: https://issues.apache.org/jira/browse/ODE-472
>             Project: ODE
>          Issue Type: Bug
>          Components: BPEL Runtime
>    Affects Versions: 1.2
>            Reporter: Alexey Ousov
>         Attachments: ODE-472-quickfix.patch, ODE-472.patch, test1.par.zip
>
>
> The bug occurs when UTF-8 encoded symbols appear either within stylesheet itself or inside documents referenced with document() function. All such symbols are encoded twice.
> So if we have in xslt something like:
> <xsl:value-of select="&#00e0;" />
> which is UTF-8 encoded as "C3 A0" in result node we will have sequence "C3 83 C2 A0" which is UTF-8 encoded "&#00c3;&#00a0;".
> The case of bug is XslRuntimeUriResolver class, which reads files to string without parsing file encoding. I made quick fix, which fixes only document() function with xpath 1.0 runtime. Deeper investigation is needed, so hopefully full fix will be available after New Year.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (ODE-472) utf-8 encoding is handled incorrectly within xslt stylesheets

Posted by "Tammo van Lessen (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/ODE-472?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Tammo van Lessen updated ODE-472:
---------------------------------

    Fix Version/s: 1.3.5

> utf-8 encoding is handled incorrectly within xslt stylesheets
> -------------------------------------------------------------
>
>                 Key: ODE-472
>                 URL: https://issues.apache.org/jira/browse/ODE-472
>             Project: ODE
>          Issue Type: Bug
>          Components: BPEL Runtime
>    Affects Versions: 1.2
>            Reporter: Alexey Ousov
>             Fix For: 1.3.5
>
>         Attachments: ODE-472-ful.patch, ODE-472-quickfix.patch, ODE-472.patch, test1.par.zip
>
>
> The bug occurs when UTF-8 encoded symbols appear either within stylesheet itself or inside documents referenced with document() function. All such symbols are encoded twice.
> So if we have in xslt something like:
> <xsl:value-of select="&#00e0;" />
> which is UTF-8 encoded as "C3 A0" in result node we will have sequence "C3 83 C2 A0" which is UTF-8 encoded "&#00c3;&#00a0;".
> The case of bug is XslRuntimeUriResolver class, which reads files to string without parsing file encoding. I made quick fix, which fixes only document() function with xpath 1.0 runtime. Deeper investigation is needed, so hopefully full fix will be available after New Year.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] Updated: (ODE-472) utf-8 encoding is handled incorrectly within xslt stylesheets

Posted by "Alexey Ousov (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/ODE-472?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Alexey Ousov updated ODE-472:
-----------------------------

    Attachment: ODE-472-quickfix.patch

Attached partial quick fix

> utf-8 encoding is handled incorrectly within xslt stylesheets
> -------------------------------------------------------------
>
>                 Key: ODE-472
>                 URL: https://issues.apache.org/jira/browse/ODE-472
>             Project: ODE
>          Issue Type: Bug
>          Components: BPEL Runtime
>    Affects Versions: 1.2
>            Reporter: Alexey Ousov
>         Attachments: ODE-472-quickfix.patch
>
>
> The bug occurs when UTF-8 encoded symbols appear either within stylesheet itself or inside documents referenced with document() function. All such symbols are encoded twice.
> So if we have in xslt something like:
> <xsl:value-of select="&#00e0;" />
> which is UTF-8 encoded as "C3 A0" in result node we will have sequence "C3 83 C2 A0" which is UTF-8 encoded "&#00c3;&#00a0;".
> The case of bug is XslRuntimeUriResolver class, which reads files to string without parsing file encoding. I made quick fix, which fixes only document() function with xpath 1.0 runtime. Deeper investigation is needed, so hopefully full fix will be available after New Year.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (ODE-472) utf-8 encoding is handled incorrectly within xslt stylesheets

Posted by "Alexey Ousov (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/ODE-472?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Alexey Ousov updated ODE-472:
-----------------------------

    Attachment: test1.par.zip

Simple test case. If above patch is applied, document() function works as expected.

> utf-8 encoding is handled incorrectly within xslt stylesheets
> -------------------------------------------------------------
>
>                 Key: ODE-472
>                 URL: https://issues.apache.org/jira/browse/ODE-472
>             Project: ODE
>          Issue Type: Bug
>          Components: BPEL Runtime
>    Affects Versions: 1.2
>            Reporter: Alexey Ousov
>         Attachments: ODE-472-quickfix.patch, ODE-472.patch, test1.par.zip
>
>
> The bug occurs when UTF-8 encoded symbols appear either within stylesheet itself or inside documents referenced with document() function. All such symbols are encoded twice.
> So if we have in xslt something like:
> <xsl:value-of select="&#00e0;" />
> which is UTF-8 encoded as "C3 A0" in result node we will have sequence "C3 83 C2 A0" which is UTF-8 encoded "&#00c3;&#00a0;".
> The case of bug is XslRuntimeUriResolver class, which reads files to string without parsing file encoding. I made quick fix, which fixes only document() function with xpath 1.0 runtime. Deeper investigation is needed, so hopefully full fix will be available after New Year.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (ODE-472) utf-8 encoding is handled incorrectly within xslt stylesheets

Posted by "Alexey Ousov (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/ODE-472?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Alexey Ousov updated ODE-472:
-----------------------------

    Attachment: ODE-472.patch

Added partial fix for Xpath 1.0 and Xpath 2.0 runtime. The problem with xml documents in various encodings loaded from document() function is fixed. But problem with xslt itself in various encodings wasn't fixed. The problem is in function:
    private String loadXsltSheet(URI uri) {

        // TODO: lots of null returns, should have some better error messages.
        InputStream is;
        try {
            is = _resourceFinder.openResource(uri);
        } catch (Exception e1) {
            return null;
        }
        if (is == null)
            return null;

        try {
            return new String(StreamUtils.read(is));
        } catch (IOException e) {
            __log.debug("IO error", e);
            // todo: this should produce a message
            return null;
        } finally {
            try {
                is.close();
            } catch (Exception ex) {
                // No worries.
            }
        }
    }

As documentation says, new String(StreamUtils.read(is)); "Constructs a new String by decoding the specified array of bytes using the platform's default charset." so we need someway to find encoding of xslt stylesheet. Xml parser finds encoding automatically, so one way is to use xml parser to load/save xslt stylesheet. Another way is to write some custom routine to identify encoding of xslt stylesheet.

It is preferrable not to hold sheet body as a string, but rather as a byte array, or not to hold it at all, directly loading xslt from file, but this will break compiled process compatibility with older versions.

> utf-8 encoding is handled incorrectly within xslt stylesheets
> -------------------------------------------------------------
>
>                 Key: ODE-472
>                 URL: https://issues.apache.org/jira/browse/ODE-472
>             Project: ODE
>          Issue Type: Bug
>          Components: BPEL Runtime
>    Affects Versions: 1.2
>            Reporter: Alexey Ousov
>         Attachments: ODE-472-quickfix.patch, ODE-472.patch
>
>
> The bug occurs when UTF-8 encoded symbols appear either within stylesheet itself or inside documents referenced with document() function. All such symbols are encoded twice.
> So if we have in xslt something like:
> <xsl:value-of select="&#00e0;" />
> which is UTF-8 encoded as "C3 A0" in result node we will have sequence "C3 83 C2 A0" which is UTF-8 encoded "&#00c3;&#00a0;".
> The case of bug is XslRuntimeUriResolver class, which reads files to string without parsing file encoding. I made quick fix, which fixes only document() function with xpath 1.0 runtime. Deeper investigation is needed, so hopefully full fix will be available after New Year.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.