You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@cocoon.apache.org by Peter Flynn <pf...@ucc.ie> on 2010/12/17 16:06:37 UTC

Encoding

I restored the Xalan settings after (failing to) add Saxon by copying
Emacs' ~ backup copies of cocoon.xconf and sitemap.xmap, but now
suddenly there are Unicode replacement characters (U+FFFD) appearing for
accents in pages which were working before.

The data is taken from a feed from an Oracle Application Server giving a
HTML <table> fragment, eg
http://rss.ucc.ie/live/w_rms_profile_list.show?p_school_id=A005
which dog and wget identify in the headers as
Content-Type: text/html; charset=WINDOWS-1252
(yes, I know, yuck...not my server)

[That URI may not be accessible off-campus]

This is processed by a pipeline to ensure it is XML:

<map:match pattern="people-in-schools/*">
  <map:generate type="html"
  src="http://rss.ucc.ie/dev/w_rms_profile_list.show?p_school_id={1}"/>
  <map:serialize type="xml"/>
</map:match>

so that
http://publish.ucc.ie/researchprofiles/people-in-schools/A005
produces XML I can consume in my XSLT. However, this is appearing as:

<?xml version="1.0" encoding="ISO-8859-1"?><html...etc

depite the fact that the sitemap.xmap says very clearly:

<map:serializer logger="sitemap.serializer.xml"
	mime-type="application/xml" name="xml"
	src="org.apache.cocoon.serialization.XMLSerializer">
    <encoding>UTF-8</encoding>
</map:serializer>

The result is that the output at
http://publish.ucc.ie/researchprofiles/A005
has Unicode replacement characters instead of accents.

I thought it should enforce translation to UTF-8 but obviously I have
missed something....but what?

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@cocoon.apache.org
For additional commands, e-mail: users-help@cocoon.apache.org


Re: Encoding

Posted by Christopher Schultz <ch...@christopherschultz.net>.
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Laurent,

On 12/17/2010 11:25 AM, Laurent Medioni wrote:
> Have a look at http://wiki.apache.org/tomcat/FAQ/CharacterEncoding I
> think the comment you refer to tries to say that if no
> charset/encoding is set when producing a response then assume the
> ISO-8859-1 default value (do not ask why ;) ).

That wiki page explains why. I know because I wrote it :) It's all in
the servlet and HTTP specifications.

There's actually an open issue in Tomcat that proposes to switch the
default request body encoding /and/ URI encoding to UTF-8. Comments welcome:
https://issues.apache.org/bugzilla/show_bug.cgi?id=48550

- -chris
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.10 (MingW32)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAk0P1FEACgkQ9CaO5/Lv0PCHMgCeIJ8Zt4DczFzMQA9ZFMd/ALiI
zvEAn2g14sxMECi+X7HaJ1y+X5FqXlV8
=pH7L
-----END PGP SIGNATURE-----

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@cocoon.apache.org
For additional commands, e-mail: users-help@cocoon.apache.org


RE: Encoding

Posted by Laurent Medioni <lm...@odyssey-group.com>.
Setting container-encoding to UTF-8 enables you to share your servlet container (keeping its Latin1 default) with other applications not supporting UTF-8 (and not fiddling with encodings...), if relevant (we had the case...).

Have a look at http://wiki.apache.org/tomcat/FAQ/CharacterEncoding
I think the comment you refer to tries to say that if no charset/encoding is set when producing a response then assume the ISO8859-1 default value (do not ask why ;) ).

Alternatively try to do the equivalent of response.setContentType("text/html; charset=UTF-8") in your XSL (<xsl:output encoding='utf-8'/> ? sorry from memory, not an XSL specialist...), then you won't get the default encoding back.

Laurent
 

-----Original Message-----
From: Peter Flynn [mailto:pflynn@ucc.ie] 
Sent: vendredi, 17. décembre 2010 17:06
To: users@cocoon.apache.org
Subject: Re: Encoding

On 17/12/10 15:37, Laurent Medioni wrote:
> What is your 
> <init-param>
>       <param-name>container-encoding</param-name>
>       <param-value>UTF-8</param-value>
> </init-param>
> In web.xml ?

Interesting. ISO-8859-1, because

<!--
      Set encoding used by the container. If not set the ISO-8859-1 encoding
      will be assumed.
      Since the servlet specification requires that the ISO-8859-1 encoding
      is used (by default), you should never change this value unless
      you have a buggy servlet container.
    -->

I wouldn't call Tomcat buggy, exactly, but the servlet spec made a poor
choice in making ISO-8859-1 the default, given that the rest of the
planet is going down the UTF-{8|16|32|64} road :-)

Certainly fixes the problem though...very many thanks.

///Peter

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@cocoon.apache.org
For additional commands, e-mail: users-help@cocoon.apache.org


____________________________________________________________

• This email and any files transmitted with it are CONFIDENTIAL and intended
  solely for the use of the individual or entity to which they are addressed.
• Any unauthorized copying, disclosure, or distribution of the material within
  this email is strictly forbidden.
• Any views or opinions presented within this e-mail are solely those of the
  author and do not necessarily represent those of Odyssey Financial
Technologies SA unless otherwise specifically stated.
• An electronic message is not binding on its sender. Any message referring to
  a binding engagement must be confirmed in writing and duly signed.
• If you have received this email in error, please notify the sender immediately
  and delete the original.

Re: Encoding

Posted by Peter Flynn <pf...@ucc.ie>.
On 17/12/10 15:37, Laurent Medioni wrote:
> What is your 
> <init-param>
>       <param-name>container-encoding</param-name>
>       <param-value>UTF-8</param-value>
> </init-param>
> In web.xml ?

Interesting. ISO-8859-1, because

<!--
      Set encoding used by the container. If not set the ISO-8859-1 encoding
      will be assumed.
      Since the servlet specification requires that the ISO-8859-1 encoding
      is used (by default), you should never change this value unless
      you have a buggy servlet container.
    -->

I wouldn't call Tomcat buggy, exactly, but the servlet spec made a poor
choice in making ISO-8859-1 the default, given that the rest of the
planet is going down the UTF-{8|16|32|64} road :-)

Certainly fixes the problem though...very many thanks.

///Peter

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@cocoon.apache.org
For additional commands, e-mail: users-help@cocoon.apache.org


RE: Encoding

Posted by Laurent Medioni <lm...@odyssey-group.com>.
What is your 
<init-param>
      <param-name>container-encoding</param-name>
      <param-value>UTF-8</param-value>
</init-param>
In web.xml ?

____________________________________________________________

• This email and any files transmitted with it are CONFIDENTIAL and intended
  solely for the use of the individual or entity to which they are addressed.
• Any unauthorized copying, disclosure, or distribution of the material within
  this email is strictly forbidden.
• Any views or opinions presented within this e-mail are solely those of the
  author and do not necessarily represent those of Odyssey Financial
Technologies SA unless otherwise specifically stated.
• An electronic message is not binding on its sender. Any message referring to
  a binding engagement must be confirmed in writing and duly signed.
• If you have received this email in error, please notify the sender immediately
  and delete the original.

Re: Encoding

Posted by Peter Flynn <pf...@ucc.ie>.
On 17/12/10 15:06, Peter Flynn wrote:
[...]
> The result is that the output at
> http://publish.ucc.ie/researchprofiles/A005
> has Unicode replacement characters instead of accents.

Curiouser and curiouser, that page serves as UTF-8 but lower down it says:

<!DOCTYPE html
  PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
         "http://www.w3.org/TR/html4/loose.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" lang="en-ie">
   <head xmlns="" xmlns:h="http://www.w3.org/1999/xhtml">
      <meta http-equiv="Content-Type"
            content="text/html; charset=ISO-8859-1">
      <!--School: A005; Researcher-in-School: ; Real School: -->
      <meta content="no-cache" http-equiv="Pragma">

That is generated by

  <xsl:template match="h:head">
    <head>
    <xsl:comment>
      <xsl:text>School: </xsl:text>
      <xsl:value-of select="$school"/>
      <xsl:text>; Researcher-in-School: </xsl:text>
      <xsl:value-of select="$researcher-in-school"/>
      <xsl:text>; Real School: </xsl:text>
      <xsl:value-of select="$real-school"/>
    </xsl:comment>
    <meta http-equiv="Pragma" content="no-cache"/>

so WTF is that
<meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1">
coming from? Is Cocoon sticking it in by itself? The page template which
I take for the framework is
http://www.ucc.ie/en/old-design-base/
and that says quite clearly
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />

Something, somewhere is sticking a bogus encoding in the works.

///Peter

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@cocoon.apache.org
For additional commands, e-mail: users-help@cocoon.apache.org