You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@cocoon.apache.org by ". ." <sv...@hotmail.com> on 2010/09/28 16:09:08 UTC

System upgrade and now Cocoon is escaping tabs/entities.

Hallo,

We've come across a really annoying problem since a server upgrade.

We have an web application based on Cocoon 2.1.6 and Tomcat 5.0.x which has been working fine for years. Recently we have been having some problems with the physical hardware in our servers so decided to migrate to virtual servers and upgrade some bits and pieces along the way.

Our original application components were:

NetBSD 3.0.3 with Suse 9.x Linux compatibility layer.
Sun JDK 1.4.26
Tomcat 5.0.23
Cocoon 2.1.6

As part of the upgrade we switched to:

Centos 5.3
Sun JDK 1.6.21
Tomcat 5.0.30
Cocoon 2.1.6

We retained all the original configs and Jars/files for Cocoon and things are running well except for two problems.

Firstly, if any of our source XML/XSL files use tabs to indent the nodes, the outputted source escapes them as &#A9; which it didn't do before. This isn't a problem for output to be displayed in a browser but we have a number of legacy Flash components which, annoyingly, don't recognise this as whitespace and refuses to load causing the Flash component to fail.

Secondly we have a version of our site using Cyrillic characters and this was sadly developed not using UTF-8 (I don't know why). We're using some butchered hack to use the windows-1251 character set. What we are getting now is the error:

"org.xml.sax.SAXException: Attempt to output character of integral value 
1057 that is not represented in specified output encoding of 
windows-1251."

I have a theory that the two problems are related and we're keen to try and get the system working the way it was. If we can solve the whitespace/tab escaping that's 80% of the battle thought.

The nearest info I've found to the tab escaping problem said to check what XML serializer we're using and it's "org.apache.cocoon.serialization.XMLSerializer" as defined in sitemap.xmap which seems to be the preferred version.

At this point I'm stumped as to what part of our "upgrade" would of caused our output to suddenly start escaping whitespace.

Any ideas?

- J




 		 	   		  

RE: System upgrade and now Cocoon is escaping tabs/entities.

Posted by ". ." <sv...@hotmail.com>.
Chris,

So it turned out updating Xalan fixed the problem completely.

We went with Xalan 2.7.1 (which has Xerces 2.9.0 included).

We replace 'xercesImpl.jar' and 'xml-apis.jar' in Tomcat's endorsed folder and 'xalan-2.6.1-dev-20041008T0304.jar' with 'xalan.jar' from 2.7.1 and added 'serializer.jar' both in our lib folder.

Restarted Tomcat and the problem went away and nothing else on the site was affected. In fact, it seems a little faster now. :)

So now we're running find on CentOS 5, JDK 1.6.21 and Tomcat 5.0.28.

- J





> Date: Wed, 29 Sep 2010 09:41:55 -0400
> From: chris@christopherschultz.net
> To: users@cocoon.apache.org
> Subject: Re: System upgrade and now Cocoon is escaping tabs/entities.
> 
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
> 
> J,
> 
> On 9/29/2010 1:10 AM, . . wrote:
> >> &#a9 should be a copyright symbol if you're using ASCII.
> >>
> >> I suspect that &#a9 is being used instead of a newline (0xa) followed by
> >> a tab (0x9).
> > 
> > Actually it was a typo on my part. It's using &#9; :( *oops*
> 
> Yeah, that makes a ton of difference. I'm glad it wasn't 0xa9, 'cause
> that would have been a real mess. :)
> 
> >> [file.encoding] is likely to solve both of your problems.
> > 
> > I wrote a little JSP page to spit out the
> > System.getProperty("file.encoding") value and got some surprising
> > results. I tried two of the existing machines and got ISO-8859-1 for one
> > and ANSI_X3.4-1968 for the other.
> 
> ANSI_X3.4-1968, as you probably found out, is essentially basic ASCII,
> and ISO-8859-1 is ASCII plus a few other things, so they are compatible.
> It's not surprising that these two character sets are both working: if
> one works, the other has a good chance of working.
> 
> > The application runs fine on both of them. On the new server that too
> > is giving out ISO-8859-1.
> 
> Interesting.
> 
> > That said, we did an experiment last night and copied the entire
> > previous Tomcat folder over to the new CentOS server and ran it with Sun
> > JDK 1.4.29 - the problem disappeared. When we ran it with JDK 1.5 or 1.6
> > the problem manifested itself.
> > 
> > So the problem appears to related to the JDK in some way. Googling I
> > came up with this:
> > 
> > http://stackoverflow.com/questions/1059854/how-do-you-prevent-a-javax-transformer-from-escaping-whitespace
> > 
> > Which makes me wonder if the old Xalan from our previous Tomcat is
> > having issues with JDK 1.5 and up. I guess an Xalan upgrade is in order.
> 
> Cocoon packages it's own Xalan library, so that shouldn't be the
> problem, although I can't remember when Sun started packaging Xalan with
> Java. At some point, I think they even removed it. What version of Xalan
> are you running? It should be in your webapp's WEB-INF/lib directory. I
> don't think there's been a Xalan update in quite a few years.
> 
> Let us know how things turn out.
> 
> >> NB: Tomcat 5.0 has been retired and really should be replaced. Upgrading
> >> to Tomcat 6.0 shouldn't be too much trouble.
> > 
> > Only issue there is we have to support this legacy application for
> > another 12 months and it's a "hand me down" so we have little or no
> > source code or documentation. Porting it now would take up more
> > time/effort than is financially viable right now :(
> 
> Technically speaking, servlet containers are supposed to be backward
> compatible. I wouldn't be surprised if, given a review of your <Context>
> element for Tomcat (it should go into META-INF/context.xml, now in your
> webapp, instead of in conf/server.xml for the server), everything else
> works exactly as it did before.
> 
> - -chris
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1.4.10 (MingW32)
> Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/
> 
> iEYEARECAAYFAkyjQiMACgkQ9CaO5/Lv0PBtOACeKG7EgdIqh+vDNND8wFKAtGHM
> N08AnjBBlR2cvmgIu1BfIDy79bMSAs7Q
> =h7CA
> -----END PGP SIGNATURE-----
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@cocoon.apache.org
> For additional commands, e-mail: users-help@cocoon.apache.org
> 
 		 	   		  

Re: System upgrade and now Cocoon is escaping tabs/entities.

Posted by Christopher Schultz <ch...@christopherschultz.net>.
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

J,

On 9/29/2010 1:10 AM, . . wrote:
>> &#a9 should be a copyright symbol if you're using ASCII.
>>
>> I suspect that &#a9 is being used instead of a newline (0xa) followed by
>> a tab (0x9).
> 
> Actually it was a typo on my part. It's using &#9; :( *oops*

Yeah, that makes a ton of difference. I'm glad it wasn't 0xa9, 'cause
that would have been a real mess. :)

>> [file.encoding] is likely to solve both of your problems.
> 
> I wrote a little JSP page to spit out the
> System.getProperty("file.encoding") value and got some surprising
> results. I tried two of the existing machines and got ISO-8859-1 for one
> and ANSI_X3.4-1968 for the other.

ANSI_X3.4-1968, as you probably found out, is essentially basic ASCII,
and ISO-8859-1 is ASCII plus a few other things, so they are compatible.
It's not surprising that these two character sets are both working: if
one works, the other has a good chance of working.

> The application runs fine on both of them. On the new server that too
> is giving out ISO-8859-1.

Interesting.

> That said, we did an experiment last night and copied the entire
> previous Tomcat folder over to the new CentOS server and ran it with Sun
> JDK 1.4.29 - the problem disappeared. When we ran it with JDK 1.5 or 1.6
> the problem manifested itself.
> 
> So the problem appears to related to the JDK in some way. Googling I
> came up with this:
> 
> http://stackoverflow.com/questions/1059854/how-do-you-prevent-a-javax-transformer-from-escaping-whitespace
> 
> Which makes me wonder if the old Xalan from our previous Tomcat is
> having issues with JDK 1.5 and up. I guess an Xalan upgrade is in order.

Cocoon packages it's own Xalan library, so that shouldn't be the
problem, although I can't remember when Sun started packaging Xalan with
Java. At some point, I think they even removed it. What version of Xalan
are you running? It should be in your webapp's WEB-INF/lib directory. I
don't think there's been a Xalan update in quite a few years.

Let us know how things turn out.

>> NB: Tomcat 5.0 has been retired and really should be replaced. Upgrading
>> to Tomcat 6.0 shouldn't be too much trouble.
> 
> Only issue there is we have to support this legacy application for
> another 12 months and it's a "hand me down" so we have little or no
> source code or documentation. Porting it now would take up more
> time/effort than is financially viable right now :(

Technically speaking, servlet containers are supposed to be backward
compatible. I wouldn't be surprised if, given a review of your <Context>
element for Tomcat (it should go into META-INF/context.xml, now in your
webapp, instead of in conf/server.xml for the server), everything else
works exactly as it did before.

- -chris
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.10 (MingW32)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAkyjQiMACgkQ9CaO5/Lv0PBtOACeKG7EgdIqh+vDNND8wFKAtGHM
N08AnjBBlR2cvmgIu1BfIDy79bMSAs7Q
=h7CA
-----END PGP SIGNATURE-----

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@cocoon.apache.org
For additional commands, e-mail: users-help@cocoon.apache.org


RE: System upgrade and now Cocoon is escaping tabs/entities.

Posted by ". ." <sv...@hotmail.com>.
> &#a9 should be a copyright symbol if you're using ASCII.
> 
> I suspect that &#a9 is being used instead of a newline (0xa) followed by
> a tab (0x9).

Actually it was a typo on my part. It's using &#9; :( *oops*

> My guess is that your JVM's file.encoding system property used to be
> something like ISO-8859-1 or UTF-8 and now it's been changed to
> something that is more exotic, perhaps even mandating 16-bit characters
> (though your pages would be horribly jumbled if everything were
> interpreted at 16-bit characters).
> 
> Check the file.encoding of your JVM in the old, working system relative
> to the new, broken one. Also, check to make sure that your XML files
> have the "encoding" set in the <?xml?> processing instruction, and that
> the encoding actually matches what you used when you wrote the file to
> the disk. Finally, check to see if you have BOM characters at the start
> of your XML files.
> 
> This is likely to solve both of your problems.

I wrote a little JSP page to spit out the System.getProperty("file.encoding") value and got some surprising results. I tried two of the existing machines and got ISO-8859-1
for one and ANSI_X3.4-1968 for the other. The application runs fine on both of them. On the new server that too is giving out  ISO-8859-1.

That said, we did an experiment last night and copied the entire previous Tomcat folder over to the new CentOS server and ran it with Sun JDK 1.4.29 - the problem disappeared. When we ran it with JDK 1.5 or 1.6 the problem manifested itself.

So the problem appears to related to the JDK in some way. Googling I came up with this:

http://stackoverflow.com/questions/1059854/how-do-you-prevent-a-javax-transformer-from-escaping-whitespace

Which makes me wonder if the old Xalan from our previous Tomcat is having issues with JDK 1.5 and up. I guess an Xalan upgrade is in order.

> NB: Tomcat 5.0 has been retired and really should be replaced. Upgrading
> to Tomcat 6.0 shouldn't be too much trouble.

Only issue there is we have to support this legacy application for another 12 months and it's a "hand me down" so we have little or no source code or documentation. Porting it now would take up more time/effort than is financially viable right now :(

- J
 
 		 	   		  

Re: System upgrade and now Cocoon is escaping tabs/entities.

Posted by Christopher Schultz <ch...@christopherschultz.net>.
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

J,

On 9/28/2010 10:09 AM, . . wrote:
> Our original application components were:
> 
> NetBSD 3.0.3 with Suse 9.x Linux compatibility layer.
> Sun JDK 1.4.26
> Tomcat 5.0.23
> Cocoon 2.1.6
> 
> As part of the upgrade we switched to:
> 
> Centos 5.3
> Sun JDK 1.6.21
> Tomcat 5.0.30
> Cocoon 2.1.6

[snip]

> Firstly, if any of our source XML/XSL files use tabs to indent the
> nodes, the outputted source escapes them as &#A9; which it didn't do
> before. This isn't a problem for output to be displayed in a browser but
> we have a number of legacy Flash components which, annoyingly, don't
> recognise this as whitespace and refuses to load causing the Flash
> component to fail.

&#a9 should be a copyright symbol if you're using ASCII.

I suspect that &#a9 is being used instead of a newline (0xa) followed by
a tab (0x9).

My guess is that your JVM's file.encoding system property used to be
something like ISO-8859-1 or UTF-8 and now it's been changed to
something that is more exotic, perhaps even mandating 16-bit characters
(though your pages would be horribly jumbled if everything were
interpreted at 16-bit characters).

Check the file.encoding of your JVM in the old, working system relative
to the new, broken one. Also, check to make sure that your XML files
have the "encoding" set in the <?xml?> processing instruction, and that
the encoding actually matches what you used when you wrote the file to
the disk. Finally, check to see if you have BOM characters at the start
of your XML files.

This is likely to solve both of your problems.

NB: Tomcat 5.0 has been retired and really should be replaced. Upgrading
to Tomcat 6.0 shouldn't be too much trouble.

- -chris
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.10 (MingW32)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAkyiNykACgkQ9CaO5/Lv0PD5xgCbBS0jEpDVsd5z9OA3vwlkOqKr
WNoAoLLZfRUNW+Dbx/UiGyyOXLtdV2y9
=RGqP
-----END PGP SIGNATURE-----

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@cocoon.apache.org
For additional commands, e-mail: users-help@cocoon.apache.org