You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-dev@lucene.apache.org by "Age Jan Kuperus (JIRA)" <ji...@apache.org> on 2009/11/04 15:55:32 UTC

[jira] Issue Comment Edited: (SOLR-412) XsltWriter does not output UTF-8 by default

    [ https://issues.apache.org/jira/browse/SOLR-412?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12773501#action_12773501 ] 

Age Jan Kuperus edited comment on SOLR-412 at 11/4/09 2:54 PM:
---------------------------------------------------------------

IMHO the documentation in xslt 1.0 (http://www.w3.org/TR/xslt#output) is a bit clearer on the usage of these fields:

"The method attribute on xsl:output identifies the overall method that should be used for outputting the result tree. The value must be a QName. If the QName does not have a prefix, then it identifies a method specified in this document and must be one of xml, html or text."

"encoding specifies the preferred character encoding that the XSLT processor should use to encode sequences of characters as sequences of bytes; the value of the attribute should be treated case-insensitively; the value must contain only characters in the range #x21 to #x7E (i.e. printable ASCII characters); the value should either be a charset registered with the Internet Assigned Numbers Authority [IANA], [RFC2278] or start with X-"

"media-type specifies the media type (MIME content type) of the data that results from outputting the result tree; the charset parameter should not be specified explicitly; instead, when the top-level media type is text, a charset parameter should be added according to the character encoding actually used by the output method"

If I understand this correctly, this means the correct output specification is <xsl:output method="xml" encoding="utf-8" />, and <xsl:output media-type="text/xml; charset=UTF-8"/> should never be used. 

My suggestion would be to change XSLTResponseWriter.getContentType() in such a way that (in pseudocode):
if encoding is null
..  encoding = "utf-8"
end if
if  media-type is not null
..  /* next if is for compatibility with the workaround only */
..  if media-type contains "charset='
....    return media-type
..  else
....    return media-type + "; charset=\"" + encoding
..  end if
else
..  if method is "html" or the first element in the final output is <html>
....    media-type = "text/html"
..  elseif method is "text"
....    media-type = "text/plain"
..  else /* it must be xml */
....    media-type = "text/xml"
..  end if
..  return media-type + "; charset=\"" + encoding
end if

      was (Author: agejan`):
    IMHO the documentation in xslt 1.0 (http://www.w3.org/TR/xslt#output) is a bit clearer on the usage of these fields:

"The method attribute on xsl:output identifies the overall method that should be used for outputting the result tree. The value must be a QName. If the QName does not have a prefix, then it identifies a method specified in this document and must be one of xml, html or text."

"encoding specifies the preferred character encoding that the XSLT processor should use to encode sequences of characters as sequences of bytes; the value of the attribute should be treated case-insensitively; the value must contain only characters in the range #x21 to #x7E (i.e. printable ASCII characters); the value should either be a charset registered with the Internet Assigned Numbers Authority [IANA], [RFC2278] or start with X-"

"media-type specifies the media type (MIME content type) of the data that results from outputting the result tree; the charset parameter should not be specified explicitly; instead, when the top-level media type is text, a charset parameter should be added according to the character encoding actually used by the output method"

If I understand this correctly, this means the correct output specification is <xsl:output method="xml" encoding="utf-8" />, and <xsl:output media-type="text/xml; charset=UTF-8"/> should never be used. 

My suggestion would be to change XSLTResponseWriter.getContentType() in such a way that (in pseudocode):
if encoding is null
  encoding = "utf-8"
end if
if  media-type is not null
  /* next if is for compatibility with the workaround only */
  if media-type contains "charset='
    return media-type
  else
      return media-type + "; charset=\"" + encoding
  end if
else
  if method is "html" or the first element in the final output is <html>
    media-type = "text/html"
  elseif method is "text"
    media-type = "text/plain"
  else /* it must be xml */
    media-type = "text/xml"
  end if
  return media-type + "; charset=\"" + encoding
end if
  
> XsltWriter does not output UTF-8 by default
> -------------------------------------------
>
>                 Key: SOLR-412
>                 URL: https://issues.apache.org/jira/browse/SOLR-412
>             Project: Solr
>          Issue Type: Bug
>          Components: search
>    Affects Versions: 1.2
>         Environment: Tomcat 5.5
> Linux Red Hat ES4  (2.6.9-5.ELsmp from 'uname -a')
>            Reporter: Lance Norskog
>
> XsltWriter outputs XML text in ISO-8859-1 encoding by default.
> Tomcat 5.5 has URIEncoding="UTF-8" set in the <Connector> element as described in the Wiki.
> This outout description in the XML: 
> <xsl:output method="xml" encoding="utf-8" />
> gives output with this header:
> HTTP/1.1 200 OK
> Server: Apache-Coyote/1.1
> Content-Type: text/xml;charset=ISO-8859-1
> Transfer-Encoding: chunked
> Date: Wed, 14 Nov 2007 17:49:11 GMT
> I had to change the <xsl:output> directive to this:
>  <xsl:output media-type="text/xml; charset=UTF-8" encoding="UTF-8"/>
> This is the root cause of SOLR-233.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.