You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@cocoon.apache.org by Jeremy Quinn <je...@media.demon.co.uk> on 2003/03/03 19:05:33 UTC

form encoding problems

Hi All,

This is possibly a trivial mistake ... but I never came across it 
before.

I have a search form for searching Lucene. Mozilla confirms the page is 
in UTF-8 encoding.

I enter a string with accented characters into the query field. eg 
'éclair' (e-acute).

The form comes back with the string now reading 'éclair'. (A-tilde, 
Copyright sign). Mozilla says the encoding is still UTF-8. (The value 
has been picked up by an InputModule and fed via the SiteMap to XSLT).

The query string in the URL reads 'query=%C3%A9clair', which are the 
unicodes for 'A-tilde' and 'Copyright' characters. (Which would imply 
to me that the Browser incorrectly encoded the query.)

This makes me feel like I have done something really dumb, but I cannot 
work out what ;)

Incidentally, the search form in the Cocoon Samples does exactly the 
same thing!!

Any suggestions?

regards Jeremy

Re: form encoding problems

Posted by Jeremy Quinn <je...@media.demon.co.uk>.
On Monday, March 3, 2003, at 06:57 PM, Konstantin Piroumian wrote:

> Take a look at i18n samples, particularly the XSP page
> (/samples/i18n/simple.xsp) and try to enter something like that in the 
> input
> box there, then submit and see the 'Hello Tomcat' paragraph ('Tomcat' 
> should
> be replaced by the entered string). If this works correctly then take 
> a look
> at the source code of simple.xsp - maybe that's what you are looking 
> for.
>

OK, your XSP is using the @form-encoding="UTF-8" attribute in the 
<xsp-request:get-parameter/> tag.

I am retrieving request parameters using InputModules.

I wonder what the Request InputModules do about form encoding?

regards Jeremy


Re: form encoding problems

Posted by Jeremy Quinn <je...@media.demon.co.uk>.
On Monday, March 3, 2003, at 06:57 PM, Konstantin Piroumian wrote:

> Hi!
>
> Take a look at i18n samples, particularly the XSP page
> (/samples/i18n/simple.xsp) and try to enter something like that in the 
> input
> box there, then submit and see the 'Hello Tomcat' paragraph ('Tomcat' 
> should
> be replaced by the entered string). If this works correctly then take 
> a look
> at the source code of simple.xsp - maybe that's what you are looking 
> for.

OK, thats a lead.
The browser is still doing it's funny encoding, but you are handling it 
properly somehow, and the string gets passed through correctly.

Many thanks, I've got something to examine!
I'll see if I can fix the search sample when I work out what it is.

regards Jeremy


Re: form encoding problems

Posted by Konstantin Piroumian <kp...@apache.org>.
Hi!

Take a look at i18n samples, particularly the XSP page
(/samples/i18n/simple.xsp) and try to enter something like that in the input
box there, then submit and see the 'Hello Tomcat' paragraph ('Tomcat' should
be replaced by the entered string). If this works correctly then take a look
at the source code of simple.xsp - maybe that's what you are looking for.

I've checked i18n samples in IE 5+ and everything were fine: even Chinese,
Japanese hieroglyphs were copied/pasted/submitted/displayed correctly.

--
  Konstantin

----- Original Message -----
From: "Jeremy Quinn" <je...@media.demon.co.uk>
To: <co...@xml.apache.org>
Sent: Monday, 3 March 2003 ?. 21:05
Subject: form encoding problems


Hi All,

This is possibly a trivial mistake ... but I never came across it
before.

I have a search form for searching Lucene. Mozilla confirms the page is
in UTF-8 encoding.

I enter a string with accented characters into the query field. eg
'éclair' (e-acute).

The form comes back with the string now reading 'éclair'. (A-tilde,
Copyright sign). Mozilla says the encoding is still UTF-8. (The value
has been picked up by an InputModule and fed via the SiteMap to XSLT).

The query string in the URL reads 'query=%C3%A9clair', which are the
unicodes for 'A-tilde' and 'Copyright' characters. (Which would imply
to me that the Browser incorrectly encoded the query.)

This makes me feel like I have done something really dumb, but I cannot
work out what ;)

Incidentally, the search form in the Cocoon Samples does exactly the
same thing!!

Any suggestions?

regards Jeremy



Re: form encoding problems

Posted by Bruno Dumon <br...@outerthought.org>.
On Wed, 2003-03-12 at 17:52, Jeremy Quinn wrote:
[...]
> As far as I can tell, yes it did solve it.
> I was making only one change at a time, after this one, it worked ;)
> 
> > I'm wondering how the SetCharacterEncodingAction could actually do
> > anything useful. According to the servlet spec (I'm looking at version
> > 2.3), the request.setCharacterEncoding method only does something if
> > called before any data is read from the request.
> >
> 
> Interesting
> 
> > Since Cocoon itself reads parameters from the request (such as
> > cocoon-reload) before any action is executed, this action obviously
> > cannot do anything useful?
> >
> 
> Hmmm
> 

I just found out that actually it can work -- the setCharacterEncoding
method in Cocoon's request object doesn't correspond to the servlet
spec's setCharacterEncoding method but causes some Cocoon-specific
decoding/encoding trick to happen.

> > Wouldn't it be better if we logged a big warning in this action 
> > pointing
> > to the container/form-encoding parameters in the web.xml (and the same
> > in its javadoc)?
> >
> 
> Yes, this is a better technique.
> I had an idea there may be a configuration here, but could not find an 
> example of it.
> 

Since I now found out about the above, this action actually does
something useful, though I'm not sure if it's good to promote this if
we'd ever like to migrate to the servlet spec's setCharacterEncoding
method.

[...]
> >  * in the web.xml: set container-encoding to ISO-8859-1 (don't know why
> > its configurable because it should be ISO-8859-1 per spec), and set
> > form-encoding to the same encoding of the serializer.
> 
> Lets put the config in, it will make it easier for other to see what to 
> change. We work exclusively in UTF-8 for instance.

+1

Since the serializers are using UTF-8 by default, it's only logical that
the decoding also uses UTF-8 by default.

-- 
Bruno Dumon                             http://outerthought.org/
Outerthought - Open Source, Java & XML Competence Support Center
bruno@outerthought.org                          bruno@apache.org


Re: form encoding problems

Posted by Jeremy Quinn <je...@media.demon.co.uk>.
On Tuesday, March 11, 2003, at 01:34 PM, Bruno Dumon wrote:

> On Mon, 2003-03-03 at 23:00, Jeremy Quinn wrote:
> [...]
>> I have got it. This was answered on the users list a while back, sorry
>> guys.
>>
>> Answer, use the SetCharacterEncodingAction in the Pipeline.
>>
>> Works with InputModules too.
>
> (a bit late to jump into this thread)
>
> Are you sure that it was adding this action that solved your problem?
>

As far as I can tell, yes it did solve it.
I was making only one change at a time, after this one, it worked ;)

> I'm wondering how the SetCharacterEncodingAction could actually do
> anything useful. According to the servlet spec (I'm looking at version
> 2.3), the request.setCharacterEncoding method only does something if
> called before any data is read from the request.
>

Interesting

> Since Cocoon itself reads parameters from the request (such as
> cocoon-reload) before any action is executed, this action obviously
> cannot do anything useful?
>

Hmmm

> Wouldn't it be better if we logged a big warning in this action 
> pointing
> to the container/form-encoding parameters in the web.xml (and the same
> in its javadoc)?
>

Yes, this is a better technique.
I had an idea there may be a configuration here, but could not find an 
example of it.

> As for the correct way to do things, this is what I understand from it:
>
>  * set the encoding of the HTML serializer to the encoding you'd like 
> to
> use
>

Yep

>  * make sure a <head> element exists in the html you generate, so that
> the serializer can add a <meta ... tag into it (from Stefano's
> experiments, this does not seem to work with the xhtml serializer)
>

Good to know

>  * in the web.xml: set container-encoding to ISO-8859-1 (don't know why
> its configurable because it should be ISO-8859-1 per spec), and set
> form-encoding to the same encoding of the serializer.

Lets put the config in, it will make it easier for other to see what to 
change. We work exclusively in UTF-8 for instance.

I find it very strange that ISO-8859-1 should be the standard (how 
parochially European ;) surely in this day and age it should be UTF-8. 
;)

BTW. While I was searching Cocoon's codebase looking for code that sets 
up encodings, I found a FIXME in SQLTransformer, that makes the 
un-configurable assumption that ISO-8859-1 is the encoding of your DB.

I have been meaning to find time to make this configurable.


thanks for the feedback

regards Jeremy


Re: form encoding problems

Posted by Bruno Dumon <br...@outerthought.org>.
On Mon, 2003-03-03 at 23:00, Jeremy Quinn wrote:
[...]
> I have got it. This was answered on the users list a while back, sorry 
> guys.
> 
> Answer, use the SetCharacterEncodingAction in the Pipeline.
> 
> Works with InputModules too.

(a bit late to jump into this thread)

Are you sure that it was adding this action that solved your problem?

I'm wondering how the SetCharacterEncodingAction could actually do
anything useful. According to the servlet spec (I'm looking at version
2.3), the request.setCharacterEncoding method only does something if
called before any data is read from the request.

Since Cocoon itself reads parameters from the request (such as
cocoon-reload) before any action is executed, this action obviously
cannot do anything useful?

Wouldn't it be better if we logged a big warning in this action pointing
to the container/form-encoding parameters in the web.xml (and the same
in its javadoc)?

As for the correct way to do things, this is what I understand from it:

 * set the encoding of the HTML serializer to the encoding you'd like to
use

 * make sure a <head> element exists in the html you generate, so that
the serializer can add a <meta ... tag into it (from Stefano's
experiments, this does not seem to work with the xhtml serializer)

 * in the web.xml: set container-encoding to ISO-8859-1 (don't know why
its configurable because it should be ISO-8859-1 per spec), and set
form-encoding to the same encoding of the serializer.

-- 
Bruno Dumon                             http://outerthought.org/
Outerthought - Open Source, Java & XML Competence Support Center
bruno@outerthought.org                          bruno@apache.org


Re: form encoding problems

Posted by Jeremy Quinn <je...@media.demon.co.uk>.
On Monday, March 3, 2003, at 09:39 PM, Torsten Curdt wrote:

>>> I recall a similiar problem a long time ago. I solved it by changing
>>> some of the HTML serializer settings. The form encoding was fine in 
>>> IE
>>> but crap in NS/Mozilla. Sorry, cannot really remember what is was :-/
>>
>> Have a beer, maybe you will remember ;)
>
> *hick* ...I know now ;-)
>
> ...well ok - I looked it up. We have set the serializer encoding to
> ISO-8859-1 and it worked. At least for german :)

I have got it. This was answered on the users list a while back, sorry 
guys.

Answer, use the SetCharacterEncodingAction in the Pipeline.

Works with InputModules too.

Thanks for your help

regards Jeremy


Re: form encoding problems

Posted by Torsten Curdt <tc...@dff.st>.
> > I recall a similiar problem a long time ago. I solved it by changing
> > some of the HTML serializer settings. The form encoding was fine in IE 
> > but crap in NS/Mozilla. Sorry, cannot really remember what is was :-/
> 
> Have a beer, maybe you will remember ;)

*hick* ...I know now ;-)

...well ok - I looked it up. We have set the serializer encoding to
ISO-8859-1 and it worked. At least for german :)

cheers
--
Torsten


Re: form encoding problems

Posted by Jeremy Quinn <je...@media.demon.co.uk>.
On Monday, March 3, 2003, at 06:16 PM, Torsten Curdt wrote:

>> Hi All,
>> This is possibly a trivial mistake ... but I never came across it 
>> before.
>
> I recall a similiar problem a long time ago. I solved it by changing
> some of the HTML serializer settings. The form encoding was fine in IE 
> but crap in NS/Mozilla. Sorry, cannot really remember what is was :-/

Have a beer, maybe you will remember ;)

>
>> I have a search form for searching Lucene. Mozilla confirms the page 
>> is in UTF-8 encoding.
>> I enter a string with accented characters into the query field. eg 
>> 'clair' (e-acute).
>> The form comes back with the string now reading 'clair'. (A-tilde, 
>> Copyright sign). Mozilla says the encoding is still UTF-8. (The value 
>> has been picked up by an InputModule and fed via the SiteMap to >> XSLT).
>> The query string in the URL reads 'query=%C3%A9clair', which are the 
>> unicodes for 'A-tilde' and 'Copyright' characters. (Which would imply 
>> to me that the Browser incorrectly encoded the query.)
>
> How do other browsers behave?

Same way, I just quoted Mozilla because it has a handy 'page info' 
dialog.

>
>> This makes me feel like I have done something really dumb, but I 
>> cannot work out what ;)
>
> Well, good luck :)
>
>> Any suggestions?
>
> Have a beer and then come back - maybe you'll see it then ;)

Cheers! Hic! %}

regards Jeremy


Re: form encoding problems

Posted by Torsten Curdt <tc...@dff.st>.
> Hi All,
> 
> This is possibly a trivial mistake ... but I never came across it before.

I recall a similiar problem a long time ago. I solved it by changing
some of the HTML serializer settings. The form encoding was fine in IE 
but crap in NS/Mozilla. Sorry, cannot really remember what is was :-/

> I have a search form for searching Lucene. Mozilla confirms the page is 
> in UTF-8 encoding.
> 
> I enter a string with accented characters into the query field. eg 
> 'éclair' (e-acute).
> 
> The form comes back with the string now reading 'éclair'. (A-tilde, 
> Copyright sign). Mozilla says the encoding is still UTF-8. (The value 
> has been picked up by an InputModule and fed via the SiteMap to XSLT).
> 
> The query string in the URL reads 'query=%C3%A9clair', which are the 
> unicodes for 'A-tilde' and 'Copyright' characters. (Which would imply to 
> me that the Browser incorrectly encoded the query.)

How do other browsers behave?

> This makes me feel like I have done something really dumb, but I cannot 
> work out what ;)

Well, good luck :)

> Any suggestions?

Have a beer and then come back - maybe you'll see it then ;)

cheers
--
Torsten


Re: form encoding problems

Posted by Jeremy Quinn <je...@media.demon.co.uk>.
On Monday, March 3, 2003, at 10:42 PM, Artur Bialecki wrote:

> You might want to set the following init-params for cocon servlet
> in your web.xml
>
> form-encoding to UTF-8
> container-encoding to ISO8859-1

I thought I ought to be able to do this, but did not work out how.

regards Jeremy


RE: form encoding problems

Posted by Artur Bialecki <ar...@digitalfairway.com>.
You might want to set the following init-params for cocon servlet
in your web.xml

form-encoding to UTF-8
container-encoding to ISO8859-1

Artur...

> -----Original Message-----
> From: Jeremy Quinn [mailto:jeremy@media.demon.co.uk] 
> Sent: March 3, 2003 1:06 PM
> To: cocoon-dev@xml.apache.org
> Subject: form encoding problems
> 
> 
> Hi All,
> 
> This is possibly a trivial mistake ... but I never came across it 
> before.
> 
> I have a search form for searching Lucene. Mozilla confirms 
> the page is 
> in UTF-8 encoding.
> 
> I enter a string with accented characters into the query field. eg 
> 'éclair' (e-acute).
> 
> The form comes back with the string now reading 'éclair'. (A-tilde, 
> Copyright sign). Mozilla says the encoding is still UTF-8. (The value 
> has been picked up by an InputModule and fed via the SiteMap to XSLT).
> 
> The query string in the URL reads 'query=%C3%A9clair', which are the 
> unicodes for 'A-tilde' and 'Copyright' characters. (Which would imply 
> to me that the Browser incorrectly encoded the query.)
> 
> This makes me feel like I have done something really dumb, 
> but I cannot 
> work out what ;)
> 
> Incidentally, the search form in the Cocoon Samples does exactly the 
> same thing!!
> 
> Any suggestions?
> 
> regards Jeremy
> 
> 



Re: form encoding problems

Posted by Jeremy Quinn <je...@media.demon.co.uk>.
On Monday, March 3, 2003, at 06:16 PM, Leo Sutic wrote:

> Are you using Tomcat?

Yeah, sorry I should have mentioned that.

> Tomcat 3.x (I think) has a problem with UTF-8 decoding of parameters.

TomCat 4.1.18 + Apache2 mod_proxy

>
> I solved it by putting in a <meta http-equiv="encoding" value="ASCII"/>
> tag.

Yek! ;)

How does that mesh with:
<META http-equiv="Content-Type" content="text/html; charset=UTF-8"> ?

> I have no idea if Jetty is affected.

Not tried it either. But earlier I reported encoding problems in 
Cocoon's new test page from Jetty.

Hmmm.

Thanks for your reply.

regards Jeremy


RE: form encoding problems

Posted by Leo Sutic <le...@inspireinfrastructure.com>.
Are you using Tomcat?

Tomcat 3.x (I think) has a problem with UTF-8 decoding of parameters.

I solved it by putting in a <meta http-equiv="encoding" value="ASCII"/>
tag.

I have no idea if Jetty is affected.

/LS

> From: Jeremy Quinn [mailto:jeremy@media.demon.co.uk] 
> Any suggestions?