You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@cocoon.apache.org by Tuomo L <tl...@cc.hut.fi> on 2004/10/28 21:35:32 UTC

Encoding problems, still!

Hi,

We're having some serious encoding problems. This happens only with the 
@href attributes in html, when using characters like å, ä and ö (in 
Finnish alphabet). Form encoding works just fine. I've gone through 
all the threads concerning encoding (other people having encoding problems 
too). No luck so far. Is this still an issue in Cocoon? Could someone 
please tell what's wrong?

Cocoon 2.1.5
Tomcat 4.1.24
Windows XP / IE6

Thanks in advance,
Tuomo

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@cocoon.apache.org
For additional commands, e-mail: users-help@cocoon.apache.org


Re: Encoding problems, still!

Posted by "Volkm@r" <pl...@arcor.de>.
Tuomo L wrote:
> ...adding to my latest post
> 
> The URL-encoding is done wrong when serializing to HTML. According to 
> specs "äö" should become "%E4%F6" when encoded, not "%C3%A4%C3%B6". This 
                           iso-8859-1                       utf-8

But you'd better not mix up utf-8 with iso8859-1.

> seems to be the problem. So far I've noticed this problem with the 
> HREF-attribute only.
> 
> For a test I made a styslesheet that substitutes "ä" with "%E4" before 
> serializing to HTML. This works, but it should be done by the 
> serializer, right?

This means you are still using the built-in and uncorrected urlencoding?

How did you define the "set-encoding" action in the components section 
of sitemap.xmap?
</map:components>
   <map:actions>
<!-- read http://wiki.apache.org/cocoon/RequestParameterEncoding -->
     <map:action name="set-encoding"
         src="org.apache.cocoon.acting.SetCharacterEncodingAction"/>
   </map:actions>
</map:components>

And how are you using it in the pipeline?
<map:pipeline>
   <map:act type="set-encoding">
     <map:parameter name="form-encoding" value="utf-8"/>
   </map:act>
   ...
</map:pipeline>

-- 
Volkm@r


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@cocoon.apache.org
For additional commands, e-mail: users-help@cocoon.apache.org


Re: Encoding problems, still!

Posted by Tuomo L <tl...@cc.hut.fi>.
...adding to my latest post

The URL-encoding is done wrong when serializing to HTML. According to 
specs "äö" should become "%E4%F6" when encoded, not "%C3%A4%C3%B6". This 
seems to be the problem. So far I've noticed this problem with the 
HREF-attribute only.

For a test I made a styslesheet that substitutes "ä" with "%E4" before 
serializing to HTML. This works, but it should be done by the 
serializer, right?

Seems like a Cocoon issue.

-Tuomo

On Thu, 28 Oct 2004, Joerg Heinicke wrote:

> On 28.10.2004 21:35, Tuomo L wrote:
>
>> We're having some serious encoding problems. This happens only with the 
>> @href attributes in html, when using characters like å, ä and ö (in Finnish 
>> alphabet). Form encoding works just fine. I've gone through all the threads 
>> concerning encoding (other people having encoding problems too). No luck so 
>> far. Is this still an issue in Cocoon? Could someone please tell what's 
>> wrong?
>
> What's the page encoding? Forms work like expected? Just the links don't 
> work? This normally points to a different page encoding than UTF-8 as link 
> requests are encoded in UTF-8 while form requests are encoded in page 
> encoding. I don't think it is a Cocoon issue.
>
> Joerg
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@cocoon.apache.org
> For additional commands, e-mail: users-help@cocoon.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@cocoon.apache.org
For additional commands, e-mail: users-help@cocoon.apache.org


Re: XSLT & XPath Version in Cocoon 2.1.5.1

Posted by Philipp Rech <ph...@gmx.de>.
thank you! so per default i am using the xAlan engine right? And i would
have to change the "root" sitmap to set Saxon 8 as default, right? 
-phil

<map:transformers default="xslt">

    <!-- NOTE: This is the default XSLT processor. -->
    <map:transformer logger="sitemap.transformer.xslt" name="xslt"
pool-grow="2" pool-max="32" pool-min="8"
src="org.apache.cocoon.transformation.TraxTransformer">
      <use-request-parameters>false</use-request-parameters>
      <use-session-parameters>false</use-session-parameters>
      <use-cookie-parameters>false</use-cookie-parameters>
      <xslt-processor-role>xalan</xslt-processor-role>
      <check-includes>true</check-includes>
    </map:transformer>




> >>>>> "Philipp" == Philipp Rech <ph...@gmx.de> writes:
> 
>     Philipp> Hello, what XSLT and XPath Version can I use with my
>     Philipp> Cocoon (Version 2.1.5.1)?  Can I use XSLT and XPath 2.0
>     Philipp> already?  Thank you!
> 
> Yes. Saxon 8 works.
> -- 
> Colin Paul Adams
> Preston Lancashire
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@cocoon.apache.org
> For additional commands, e-mail: users-help@cocoon.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@cocoon.apache.org
For additional commands, e-mail: users-help@cocoon.apache.org


Re: XSLT & XPath Version in Cocoon 2.1.5.1

Posted by Bertrand Delacretaz <bd...@apache.org>.
> ...Yes. Saxon 8 works.

and http://wiki.apache.org/cocoon/Saxon explains how to set it up.

-Bertrand

Re: XSLT & XPath Version in Cocoon 2.1.5.1

Posted by Colin Paul Adams <co...@colina.demon.co.uk>.
>>>>> "Philipp" == Philipp Rech <ph...@gmx.de> writes:

    Philipp> Hello, what XSLT and XPath Version can I use with my
    Philipp> Cocoon (Version 2.1.5.1)?  Can I use XSLT and XPath 2.0
    Philipp> already?  Thank you!

Yes. Saxon 8 works.
-- 
Colin Paul Adams
Preston Lancashire

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@cocoon.apache.org
For additional commands, e-mail: users-help@cocoon.apache.org


Re: XSLT & XPath Version in Cocoon 2.1.5.1

Posted by Joerg Heinicke <jo...@gmx.de>.
On 29.10.2004 14:10, Philipp Rech wrote:

> Hello,
> 
> what XSLT and XPath Version can I use 
> with my Cocoon (Version 2.1.5.1)? 
> Can I use XSLT and XPath 2.0 already?

It depends on your XML libraries. Cocoon comes with most recent releases 
of Xalan and Xerces. Xalan release does not provide XSLT/XPath 2.0 
support. AFAIK there is a Xalan branch for XSLT 2.0. Alternatively you 
can try Saxon with XSLT 2.0 support.

Joerg

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@cocoon.apache.org
For additional commands, e-mail: users-help@cocoon.apache.org


XSLT & XPath Version in Cocoon 2.1.5.1

Posted by Philipp Rech <ph...@gmx.de>.
Hello,

what XSLT and XPath Version can I use 
with my Cocoon (Version 2.1.5.1)? 
Can I use XSLT and XPath 2.0 already?
Thank you!

phil 


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@cocoon.apache.org
For additional commands, e-mail: users-help@cocoon.apache.org


Re: Encoding problems, still!

Posted by Joerg Heinicke <jo...@gmx.de>.
On 31.10.2004 18:16, Marc Portier wrote:

>>> So assuming all this reasoning is ok, what could never work is this:
>>>
>>> - change your form-encoding (and matching setting of serialization) 
>>> to anything else then UTF-8, cos then request-params in forms and 
>>> pre-built ones in url's get encoded differently and we have no way to 
>>> make a distinction over at cocoon's side

So the theoretical problem is clear when having form-encoding different 
than UTF-8.

>>> It's sad news for Tuomo, but I can't see why it wouldn't be just 
>>> working if (and only if)
>>>
>>> - this is about parameter-values and NOT about URL's or 
>>> parameter-names (because there we *need* to do some work)

>>> - form-encoding is strictly kept to 'utf-8' (thx for the lesson) and 
>>> the serializer follows that (meta-equiv and all)
>>
>> These don't help either, since the UTF-8 encoded parameter values are 
>> read in as ISO-8859-1 and the output is invalid. If these parameter 
> 
> now this I don't understand

Now the practical one (i.e. implementation).

> they are indeed read in using ISO-8859-1, but then inside cocoon they 
> get re-en-decoded:
> 1. yourUtf8UrlEncodedValue --> first urldecoded and then interpreted by 
> container using ISO-8859-1
> 2. this result re-encoded by cocoon using 'container-encoding' 
> (==ISO-8859-1)
> 3. the bytes coming out of that should equal the bytes of the 
> parameter-value right after url-encoding
> 4. so decoding these with 'form-encoding' (==UTF-8) should really just work

This must work if you have form-encoding set to UTF-8 as a URL request 
and a form request send both UTF-8 encoded values. There should be no 
difference on Cocoon's side in handling both. But if form-encoding is 
not UTF-8 step 4 is done 'wrong'. Cocoon would need to read a URL 
request with UTF-8, but uses form-encoding.

> I'ld like to just understand first, and if we need to then also fix this for sure...

I don't see how we can fix it on Cocoon's side. Where do want to know 
from whether it's a form request (where you need form-encoding) or a URL 
request (where you need UTF-8)?

Joerg


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@cocoon.apache.org
For additional commands, e-mail: users-help@cocoon.apache.org


Re: Encoding problems, still!

Posted by Marc Portier <mp...@outerthought.org>.

Tuomo L wrote:
>>
>>
>> taking one step at the time (what am I not seeing?):
>> - suppose a sax stream (producing xhtml) before serialization has a 
>> @href holding an eurosign (\u20AC unicode char)
>>
>> - I hear you guys saying that xalan will recognize the uri-type 
>> attribute and serialize this character out as %E2%82%AC regardless of 
>> the chosen output encoding (didn't catch it but I am assuming that the 
>> output-encoding is set to UTF-8 anyways, and matches the form-encoding 
>> setting)
>>
>> - so we get an html page out telling the browser it is utf-8 encoded
>> - so the browser will apply utf-8 encoding to form-values (and names) 
>> if this were about a form, but it's about this ready @href
>>
>> - now this @href already has this same encoding (thx xalan) in place: 
>> so things should work the same as for the form (as long as the 
>> mentioned eurosign is strictly in the parameter-values)
>>
>>
>> So assuming all this reasoning is ok, what could never work is this:
>>
>> - change your form-encoding (and matching setting of serialization) to 
>> anything else then UTF-8, cos then request-params in forms and 
>> pre-built ones in url's get encoded differently and we have no way to 
>> make a distinction over at cocoon's side
>>
> 
> You're right.
> 

thx for confirming

>>
>> It's sad news for Tuomo, but I can't see why it wouldn't be just 
>> working if (and only if)
>>
>> - this is about parameter-values and NOT about URL's or 
>> parameter-names (because there we *need* to do some work)
> 
> 
> Yes, I was talking about parameter values all the time, but didn't show 
> it clear enough in the example. It should be:
> 
> <a href="someurl?foo=äö" foo="äö">äö</a>
> 

ok, that makes things clear


> Where the foo's value gets UTF-8 encoded by Xalan during serialization, 
> no matter what the settings are where ever.
> 
>> - container-encoding is traditionally set to ISO-8859-1 (unless using 
>> a container like jetty where you can modify it's internal behaviour)
> 
> 
> Mine is set to ISO-8859-1.
> 

good, keep it like that

>> - form-encoding is strictly kept to 'utf-8' (thx for the lesson) and 
>> the serializer follows that (meta-equiv and all)
> 
> 
> These don't help either, since the UTF-8 encoded parameter values are 
> read in as ISO-8859-1 and the output is invalid. If these parameter 

now this I don't understand

they are indeed read in using ISO-8859-1, but then inside cocoon they 
get re-en-decoded:
1. yourUtf8UrlEncodedValue --> first urldecoded and then interpreted by 
container using ISO-8859-1
2. this result re-encoded by cocoon using 'container-encoding' 
(==ISO-8859-1)
3. the bytes coming out of that should equal the bytes of the 
parameter-value right after url-encoding
4. so decoding these with 'form-encoding' (==UTF-8) should really just work


> values are now put for example in database, there are several '?'-marks 
> where those special characters should appear.
>

well, as a general remark you have to be careful with both

1. databases --> they typically have an encoding set too, and you should 
consult the settings of your jdbc driver to make sure you're not having 
a mismatch there

2. interpreting question-marks: I remember spending oodles of time 
looking at something that worked all the time just because the tool I 
used to read the logfiles or sql-output was not supporting the encoding 
or was using a font that had no glyph for a certain character then you 
can spot these questionmarks while all is well in fact)

anyways: safest thing to do is some code step debugging (at the level of 
the 'decode' method mentioned earlier) or inserting javacode that counts 
the length of the string or even better compares/or dumps intvalues of 
all chars in JVM memory

best to take it one step at a time...


> Maybe I just have to send the parameters within a form (as Joerg had 
> done it), which is not a very practical when you only need to do a 
> simple HTTP-GET with parameters. Or then I use a XSL-stylesheet which 

I agree

and as argued above this doesn't make sense: the form will be encoding 
the values exactly in the same way (ie. first utf-8 then url-encode) as 
xalan prepared things... so things should really just work IMHO

> converts all the special characters in parameter values to ISO-8859-1 

juk

> before Xalan serialization. This works, but is also inpractical, since I 
> have to write a long xsl:choose-section. Doing it this way also 
> decreases the performance of my application.
> 
> Can we come up with a better solution?
> 
> Thank you guys for taking interest in this issue.
> 

I'ld like to just understand first, and if we need to then also fix this 
for sure...

regards,
-marc= (off for 5 days helas, I hope you guys find a nice way out - and 
let us know)


regards,
-marc=
-- 
Marc Portier                            http://outerthought.org/
Outerthought - Open Source, Java & XML Competence Support Center
Read my weblog at                http://blogs.cocoondev.org/mpo/
mpo@outerthought.org                              mpo@apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@cocoon.apache.org
For additional commands, e-mail: users-help@cocoon.apache.org


Re: Encoding problems, still!

Posted by Tuomo L <tl...@cc.hut.fi>.
>
>
> taking one step at the time (what am I not seeing?):
> - suppose a sax stream (producing xhtml) before serialization has a @href 
> holding an eurosign (\u20AC unicode char)
>
> - I hear you guys saying that xalan will recognize the uri-type attribute and 
> serialize this character out as %E2%82%AC regardless of the chosen output 
> encoding (didn't catch it but I am assuming that the output-encoding is set 
> to UTF-8 anyways, and matches the form-encoding setting)
>
> - so we get an html page out telling the browser it is utf-8 encoded
> - so the browser will apply utf-8 encoding to form-values (and names) if this 
> were about a form, but it's about this ready @href
>
> - now this @href already has this same encoding (thx xalan) in place: so 
> things should work the same as for the form (as long as the mentioned 
> eurosign is strictly in the parameter-values)
>
>
> So assuming all this reasoning is ok, what could never work is this:
>
> - change your form-encoding (and matching setting of serialization) to 
> anything else then UTF-8, cos then request-params in forms and pre-built ones 
> in url's get encoded differently and we have no way to make a distinction 
> over at cocoon's side
>

You're right.

>
> It's sad news for Tuomo, but I can't see why it wouldn't be just working if 
> (and only if)
>
> - this is about parameter-values and NOT about URL's or parameter-names 
> (because there we *need* to do some work)

Yes, I was talking about parameter values all the time, but didn't show it 
clear enough in the example. It should be:

<a href="someurl?foo=äö" foo="äö">äö</a>

Where the foo's value gets UTF-8 encoded by Xalan during serialization, no 
matter what the settings are where ever.

> - container-encoding is traditionally set to ISO-8859-1 (unless using a 
> container like jetty where you can modify it's internal behaviour)

Mine is set to ISO-8859-1.

> - form-encoding is strictly kept to 'utf-8' (thx for the lesson) and the 
> serializer follows that (meta-equiv and all)

These don't help either, since the UTF-8 encoded parameter values are 
read in as ISO-8859-1 and the output is invalid. If these parameter values 
are now put for example in database, there are several '?'-marks where 
those special characters should appear.

Maybe I just have to send the parameters within a form (as Joerg had done 
it), which is not a very practical when you only need to do a simple 
HTTP-GET with parameters. Or then I use a XSL-stylesheet which converts 
all the special characters in parameter values to ISO-8859-1 before Xalan 
serialization. This works, but is also inpractical, since I have to write 
a long xsl:choose-section. Doing it this way also decreases the 
performance of my application.

Can we come up with a better solution?

Thank you guys for taking interest in this issue.

-Tuomo

>
>
> regards,
> -marc=
> -- 
> Marc Portier                            http://outerthought.org/
> Outerthought - Open Source, Java & XML Competence Support Center
> Read my weblog at                http://blogs.cocoondev.org/mpo/
> mpo@outerthought.org                              mpo@apache.org
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@cocoon.apache.org
> For additional commands, e-mail: users-help@cocoon.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@cocoon.apache.org
For additional commands, e-mail: users-help@cocoon.apache.org


Re: Encoding problems, still!

Posted by Marc Portier <mp...@outerthought.org>.

Joerg Heinicke wrote:
> On 30.10.2004 02:42, Marc Portier wrote:
> 
> That late? ;-)
> 

ugh

>>> But then in the bug report for Xalan (someone having this same 
>>> problem) it says:
>>>
>>> "According to section 16.2 of the XSLT Recommendation [1], non-ASCII 
>>> characters in URI attribute values should be escaped using the method 
>>> recommended in Section B.2.1 of the HTML 4.0 Recommendation [2]. The 
>>> latter recommends that non-ASCII characters be represented in UTF-8 
>>> prior to applying the "%HH" escaping described by the URI RTF, 
>>> regardless of the output encoding."
>>>
>>
>> nifty, didn't know... so whatever output encoding you set the uri's 
>> will be utf-8 encoded, and then url-encoded?
> 
> 
> Yes, that's how I understand it and wrote it in my first reply to 
> Tuomo's question.
> 
>> haven't ever seen this, I was under the impression that to xalan 
>> attributes were just attributes and would have expected characters to 
>> be replaced by character-entity-refs depending on if they are 
>> supported or not by the applied output-encoding
> 
> 
> No, Xalan handles href attributes differently.
> 

thx for boosting my knowledge,
now, this isn't actually making things easier, is it?

>>> This is what Xalan does (HTML serialization), so it obeys the spec.
>>>
>>> Correct me if I'm wrong, but during serialization if there are 
>>> special characters (above 128) in an URL:s request parameters 
>>> (href-attributes etc.), they are first encoded in UTF-8 by Xalan. 
>>> Even if the browser 
>>
>>
>> apparently, would like to see some test evidence to be on the safe 
>> side though
> 
> 
> I can confirm this behaviour for old versions of Xalan coming with 
> Cocoon 2.0 RC 1. At that time we tried to produce links with request 
> params and they did not work because of encoding. We had to change the 
> links to some form.submit() javascript stuff.
> 
>>> detects the page as ISO-8859-1 or anything else, these URL:s in the 
>>> HTML source contain parameters in UTF-8. Now, when user clicks on 
>>> this link, 
>>
>>
>> but it is not about request-parameters is it?
> 
> 
> It is as far as I understand.
> 

well, then I missed some question-mark somewhere ;-)

scanning back through the history I did find this:

> <a href="äö" foo="äö">äö</a>

this is NOT about request-parameter values IMHO

>> it is about the proper URL part, no?
> 
> 
> Don't know exactly. Had no tests for URL part and request param part.
> 
>> as in:
>>
>> http://server:port/path/more-path?request-param=value
>> ---------------------------------|-------------------
>>  >>  area-not-fixed-by-cocoon << |  >> area fixed by cocoon <<
>>
>> (in fact I'm even doubthing if we are fixing the names of the 
>> request-params (actually my guess would be we're only doing the values))
>>
>> see 
>> http://cvs.apache.org/viewcvs.cgi/cocoon/trunk/src/java/org/apache/cocoon/environment/http/HttpRequest.java?rev=55600&root=Apache-SVN&view=auto 
>>
>> there is the internal decode() method. it gets only called from areas 
>> that do with request-parameter-values (as I started to think: not even 
>> the names)
>>
>>> Cocoon reads the request parameters in as ISO-8859-1, and converts 
>>> them to UTF-8, without knowing that these parameters were already UTF-8!
> 
> 
> That's how I understand it (just the first part is not done by Cocoon, 
> but by the container as Mark wrote below too).
> 
>> nope, don't think so... first nuance (see above) the container reads
>> and applies (typically) ISO-8859-1,...
>>
>> and cocoon correctly re-encodes request-parameter-values based on its 
>> 'form-encoding', but isn't (at least to my knowledge) touching the url 
>> part of things
> 
> 
> But if you convert values from ISO-8859-1 to UTF-8 though they already 
> have been UTF-8 and not ISO-8859-1 you are in troubles like Tuomo, 
> aren't you?
> 

you get me doubthing :-)

first reading said yes, but I'm not convinced, as long as it is about 
the values and not the names or the @action part we're in good shape, no?


taking one step at the time (what am I not seeing?):
- suppose a sax stream (producing xhtml) before serialization has a 
@href holding an eurosign (\u20AC unicode char)

- I hear you guys saying that xalan will recognize the uri-type 
attribute and serialize this character out as %E2%82%AC regardless of 
the chosen output encoding (didn't catch it but I am assuming that the 
output-encoding is set to UTF-8 anyways, and matches the form-encoding 
setting)

- so we get an html page out telling the browser it is utf-8 encoded
- so the browser will apply utf-8 encoding to form-values (and names) if 
this were about a form, but it's about this ready @href

- now this @href already has this same encoding (thx xalan) in place: so 
things should work the same as for the form (as long as the mentioned 
eurosign is strictly in the parameter-values)


So assuming all this reasoning is ok, what could never work is this:

- change your form-encoding (and matching setting of serialization) to 
anything else then UTF-8, cos then request-params in forms and pre-built 
ones in url's get encoded differently and we have no way to make a 
distinction over at cocoon's side


It's sad news for Tuomo, but I can't see why it wouldn't be just working 
if (and only if)

- this is about parameter-values and NOT about URL's or parameter-names 
(because there we *need* to do some work)
- container-encoding is traditionally set to ISO-8859-1 (unless using a 
container like jetty where you can modify it's internal behaviour)
- form-encoding is strictly kept to 'utf-8' (thx for the lesson) and the 
serializer follows that (meta-equiv and all)


regards,
-marc=
-- 
Marc Portier                            http://outerthought.org/
Outerthought - Open Source, Java & XML Competence Support Center
Read my weblog at                http://blogs.cocoondev.org/mpo/
mpo@outerthought.org                              mpo@apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@cocoon.apache.org
For additional commands, e-mail: users-help@cocoon.apache.org


Re: Encoding problems, still!

Posted by Joerg Heinicke <jo...@gmx.de>.
On 30.10.2004 02:42, Marc Portier wrote:

That late? ;-)

>> But then in the bug report for Xalan (someone having this same 
>> problem) it says:
>>
>> "According to section 16.2 of the XSLT Recommendation [1], non-ASCII 
>> characters in URI attribute values should be escaped using the method 
>> recommended in Section B.2.1 of the HTML 4.0 Recommendation [2]. The 
>> latter recommends that non-ASCII characters be represented in UTF-8 
>> prior to applying the "%HH" escaping described by the URI RTF, 
>> regardless of the output encoding."
>>
> 
> nifty, didn't know... so whatever output encoding you set the uri's will 
> be utf-8 encoded, and then url-encoded?

Yes, that's how I understand it and wrote it in my first reply to 
Tuomo's question.

> haven't ever seen this, I was under the impression that to xalan 
> attributes were just attributes and would have expected characters to be 
> replaced by character-entity-refs depending on if they are supported or 
> not by the applied output-encoding

No, Xalan handles href attributes differently.

>> This is what Xalan does (HTML serialization), so it obeys the spec.
>>
>> Correct me if I'm wrong, but during serialization if there are special 
>> characters (above 128) in an URL:s request parameters (href-attributes 
>> etc.), they are first encoded in UTF-8 by Xalan. Even if the browser 
> 
> apparently, would like to see some test evidence to be on the safe side 
> though

I can confirm this behaviour for old versions of Xalan coming with 
Cocoon 2.0 RC 1. At that time we tried to produce links with request 
params and they did not work because of encoding. We had to change the 
links to some form.submit() javascript stuff.

>> detects the page as ISO-8859-1 or anything else, these URL:s in the 
>> HTML source contain parameters in UTF-8. Now, when user clicks on this 
>> link, 
> 
> but it is not about request-parameters is it?

It is as far as I understand.

> it is about the proper URL part, no?

Don't know exactly. Had no tests for URL part and request param part.

> as in:
> 
> http://server:port/path/more-path?request-param=value
> ---------------------------------|-------------------
>  >>  area-not-fixed-by-cocoon << |  >> area fixed by cocoon <<
> 
> (in fact I'm even doubthing if we are fixing the names of the 
> request-params (actually my guess would be we're only doing the values))
> 
> see 
> http://cvs.apache.org/viewcvs.cgi/cocoon/trunk/src/java/org/apache/cocoon/environment/http/HttpRequest.java?rev=55600&root=Apache-SVN&view=auto 
> 
> there is the internal decode() method. it gets only called from areas 
> that do with request-parameter-values (as I started to think: not even 
> the names)
> 
>> Cocoon reads the request parameters in as ISO-8859-1, and converts 
>> them to UTF-8, without knowing that these parameters were already UTF-8!

That's how I understand it (just the first part is not done by Cocoon, 
but by the container as Mark wrote below too).

> nope, don't think so... first nuance (see above) the container reads
> and applies (typically) ISO-8859-1,...
> 
> and cocoon correctly re-encodes request-parameter-values based on its 
> 'form-encoding', but isn't (at least to my knowledge) touching the url 
> part of things

But if you convert values from ISO-8859-1 to UTF-8 though they already 
have been UTF-8 and not ISO-8859-1 you are in troubles like Tuomo, 
aren't you?

Joerg

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@cocoon.apache.org
For additional commands, e-mail: users-help@cocoon.apache.org


Re: Encoding problems, still!

Posted by Marc Portier <mp...@outerthought.org>.

Tuomo L wrote:
> Ok, now I'm really confused.
> 
> In Bruno's excellent paper about Cocoon encoding, there's a section that 
> says:
> 
> "For Java-insiders: what Cocoon actually does internally is apply the 
> following trick to get a parameter correctly decoded: suppose "value" is 
> a string containing a request parameter, then Cocoon will do:
> 
> value = new String(value.getBytes("ISO-8859-1"), "UTF-8");      "
> 

correct.

this trick is the re-en-decoding
we get a string from getParameter, we encode it to bytes with ISO-8859-1 
and decode from there with UTF-8

why? to correct the container's mistake

the container will have received bytes (let's call these the 
original-request-parameter-bytes) but will have applied his 
'container-encoding'  on those to be able to return a String over 
getParameter.

NOTE: this container encoding is a property of your chosen container and 
typically fixed to being iso-8859-1, unless you are running jetty with 
the mentioned charset-property set you should never changes this)

now, cocoon knows from the form-encoding in which encoding forms have 
been serialized out, and thus how request params will be *really* encoded

so to correct the error the container made we encode back to the 
original bytes using latin-1 and then apply the correct form-encoding 
(utf-8)

between servlet-spec 2.2 and 2.3 this issue occured to the peeps doing 
the spec and they added setCharacterEncoding() to the servlet-request 
and mention explicitely that you need to call that before reading any 
getParameter (or any related action that requires to parse and thus 
decode the query-string)


> But then in the bug report for Xalan (someone having this same problem) 
> it says:
> 
> "According to section 16.2 of the XSLT Recommendation [1], non-ASCII 
> characters in URI attribute values should be escaped using the method 
> recommended in Section B.2.1 of the HTML 4.0 Recommendation [2]. The 
> latter recommends that non-ASCII characters be represented in UTF-8 
> prior to applying the "%HH" escaping described by the URI RTF, 
> regardless of the output encoding."
> 

nifty, didn't know... so whatever output encoding you set the uri's will 
be utf-8 encoded, and then url-encoded?

haven't ever seen this, I was under the impression that to xalan 
attributes were just attributes and would have expected characters to be 
replaced by character-entity-refs depending on if they are supported or 
not by the applied output-encoding

> This is what Xalan does (HTML serialization), so it obeys the spec.
> 
> Correct me if I'm wrong, but during serialization if there are special 
> characters (above 128) in an URL:s request parameters (href-attributes 
> etc.), they are first encoded in UTF-8 by Xalan. Even if the browser 

apparently, would like to see some test evidence to be on the safe side 
though

> detects the page as ISO-8859-1 or anything else, these URL:s in the HTML 
> source contain parameters in UTF-8. Now, when user clicks on this link, 

but it is not about request-parameters is it?
it is about the proper URL part, no?

as in:

http://server:port/path/more-path?request-param=value
---------------------------------|-------------------
  >>  area-not-fixed-by-cocoon << |  >> area fixed by cocoon <<

(in fact I'm even doubthing if we are fixing the names of the 
request-params (actually my guess would be we're only doing the values))

see 
http://cvs.apache.org/viewcvs.cgi/cocoon/trunk/src/java/org/apache/cocoon/environment/http/HttpRequest.java?rev=55600&root=Apache-SVN&view=auto

there is the internal decode() method. it gets only called from areas 
that do with request-parameter-values (as I started to think: not even 
the names)

> Cocoon reads the request parameters in as ISO-8859-1, and converts them 
> to UTF-8, without knowing that these parameters were already UTF-8!
> 

nope, don't think so... first nuance (see above) the container reads
and applies (typically) ISO-8859-1,...

and cocoon correctly re-encodes request-parameter-values based on its 
'form-encoding', but isn't (at least to my knowledge) touching the url 
part of things


(sorry for the confusion but that exactly was the executive summary from 
my previous post)


hope this clarifies the issue
hope this strengthens your trust in the proposed workarounds...

-marc=


> My knowledge of the Cocoon internals is not very good, but could this be 
> the problem?
> 
> -Tuomo
> 
> 
> On Fri, 29 Oct 2004, Marc Portier wrote:
> 
>> just scanning through this issue fast it seems to me like more 
>> evidence of things expressed here: 
>> http://marc.theaimsgroup.com/?t=109231177100007&r=1&w=2
>>
>>
>> rehashing what I read from Tuomo's setup:
>>
>> - cocoon-servlet init params are set to have container-encoding 
>> unchanged (thus iso_8859_1) like we recommend and form-encoding to 
>> utf-8 to make sure his forms can support wide variety of characters
>>
>> - as a consequence of this last setting (and the wellknown 
>> browser-limitation) this means we need to sync the encoding on the 
>> serializer to this same utf-8
>>
>> - because of this setting there is no reason to complain about the 
>> resulting HTML, that is full of utf-8 encoding, no need to refer to 
>> specs or blame cocoon: xml serialization was requested to use utf-8 so 
>> it does (even xalan does its work here I suppose)
>>
>>
>> now, what goes wrong?
>>
>> well, I had planned to get into this during gt2004s hackathon but got 
>> distracted on other issues.  Lacking the experience of the in depth 
>> debugging session I can't really do more then express my current 
>> 'suspicions'
>>
>> (as stated in the thread above)
>> we've done quite a good job at solving the issue regarding encodings 
>> of request-parameters and even extended the servlet 2.3 new insights 
>> in doing so (setRequestEncoding()) to support even 2.2 containers
>>
>> however, one important part of the request object set of getters is 
>> escaping this: the URL (and some of its derived 'paths' as well I assume)
>>
>> This explains why encoding in form-request params gets fixed 
>> correctly, but the url itself remains broke --> consequence:
>> - you can't link to non-latin-char-urls but you can pass 
>> non-latin-request-params
>>
>> in more cocoon detail this means you can't expect cocoon matchers to 
>> get correctly triggered by non-latin-urls as well as you can't 
>> automount sitemaps in directories with non-latin-only-names...
>> (or read resources with non-latin-only-names as the original post of 
>> the other thread was about)
>>
>>
>>
>> Suggestion:
>> 1. do some tests to verify above and list them as known limitations on 
>> appropriate wikis. --> tell about the two workarounds:
>> a/ to avoid non-latin urls (even if w3c says all urls should be utf-8 
>> encoded)
>> b/ use jetty, set org.mortbay.util.URI.charset property and then DO 
>> change the cocoon 'container-encoding' param accordingly
>>
>> 2. (assuming my analysis is correct and gets confirmed by the tests) 
>> extend our http-wrapping-encoding-fix to include the urls and paths as 
>> well (using the tests as a way to verify the success of this)
>>
>> 3. start the crusade for the abolishment of all encodings but utf-8!
>>
>>
>> The time consuming part here is jamming together an easy deployable 
>> testsuite (zip with automount sitemap and all needed stuff inside) 
>> covering the various aspects... would be cool if somebody else could 
>> be doing that...
>>
>> regards,
>> -marc=
>>
>> Joerg Heinicke wrote:
>>
>>> On 29.10.2004 08:44, Tuomo L wrote:
>>>
>>>>>> We're having some serious encoding problems. This happens only 
>>>>>> with the @href attributes in html, when using characters like å, ä 
>>>>>> and ö (in Finnish alphabet). Form encoding works just fine. I've 
>>>>>> gone through all the threads concerning encoding (other people 
>>>>>> having encoding problems too). No luck so far. Is this still an 
>>>>>> issue in Cocoon? Could someone please tell what's wrong?
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> What's the page encoding? Forms work like expected? Just the links 
>>>>> don't work? This normally points to a different page encoding than 
>>>>> UTF-8 as link requests are encoded in UTF-8 while form requests are 
>>>>> encoded in page encoding. I don't think it is a Cocoon issue.
>>>
>>>
>>>
>>> First a link about all the encodings:
>>> http://wiki.apache.org/cocoon/RequestParameterEncoding (mostly written
>>> by Bruno).
>>>
>>>> According to IE, the page encoding is set to UTF-8. The
>>>> container-encoding and form-encoding in web.xml (Tomcat) are set to 
>>>> UTF-8.
>>>
>>>
>>>
>>> The container-encoding should not be touched at all and remain 
>>> ISO-8859-1.
>>>
>>>> HTMLSerializer is set to use UTF-8 (mime-type="text/html; 
>>>> charset=utf-8")
>>>> and has the parameter <encoding>UTF-8</encoding>.
>>>
>>>
>>>
>>> This should result in <meta http-equiv="Content-Type"
>>> content="text/html;charset=utf-8">. The request encoding header should
>>> have the same value ... what's not that easy when using a recent Tomcat:
>>> http://issues.apache.org/bugzilla/show_bug.cgi?id=26997
>>>
>>>> The xsl stylesheets use ISO-8859-1, though.
>>>
>>>
>>>
>>> That's not a problem.
>>>
>>>> I've also tried setting everything to ISO-8859-1, but
>>>> the problem with the href-attributes in html remains. Mozilla Firefox
>>>> shows the characters correctly when doing "view source", but if I 
>>>> save the
>>>> document on disk and open with ASCII-editor, the encoding is wrong 
>>>> there
>>>> with both IE and Mozilla. So maybe it's not a browser problem?
>>>>
>>>> Here's an example:
>>>>
>>>> <a href="äö" foo="äö">äö</a>
>>>>
>>>> becomes:
>>>>
>>>> <a href="%C3%A4%C3%B6" foo="&auml;&ouml;">&auml;&ouml;</a>
>>>>
>>>> when it should read (I think):
>>>>
>>>> <a href="&auml;&ouml;" foo="&auml;&ouml;">&auml;&ouml;</a>
>>>
>>>
>>>
>>> ...
>>> follow-up mail:
>>>
>>>> The URL-encoding is done wrong when serializing to HTML. According to
>>>> specs "äö" should become "%E4%F6" when encoded, not "%C3%A4%C3%B6".
>>>> This seems to be the problem. So far I've noticed this problem with
>>>> the HREF-attribute only.
>>>>
>>>> For a test I made a styslesheet that substitutes "ä" with "%E4"
>>>> before serializing to HTML. This works, but it should be done by the
>>>> serializer, right?
>>>>
>>>> Seems like a Cocoon issue.
>>>
>>>
>>>
>>> If it would be an error at all, it would be a Xalan serializer problem I
>>> think. But there were bugs reported on this topic and rejected because
>>> of the specs (I think they have the same problems like you):
>>>
>>> http://nagoya.apache.org/jira/browse/XALANJ-1412
>>> http://nagoya.apache.org/jira/browse/XALANJ-1548
>>>
>>> As I wrote: you simply get different request encodings when sending a
>>> form or just clicking <a href=""/>.
>>>
>>> Joerg
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: users-unsubscribe@cocoon.apache.org
>>> For additional commands, e-mail: users-help@cocoon.apache.org
>>>
>>
>> -- 
>> Marc Portier                            http://outerthought.org/
>> Outerthought - Open Source, Java & XML Competence Support Center
>> Read my weblog at                http://blogs.cocoondev.org/mpo/
>> mpo@outerthought.org                              mpo@apache.org
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe@cocoon.apache.org
>> For additional commands, e-mail: users-help@cocoon.apache.org
>>
>>
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@cocoon.apache.org
> For additional commands, e-mail: users-help@cocoon.apache.org
> 

-- 
Marc Portier                            http://outerthought.org/
Outerthought - Open Source, Java & XML Competence Support Center
Read my weblog at                http://blogs.cocoondev.org/mpo/
mpo@outerthought.org                              mpo@apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@cocoon.apache.org
For additional commands, e-mail: users-help@cocoon.apache.org


Re: Encoding problems, still!

Posted by Tuomo L <tl...@cc.hut.fi>.
Ok, now I'm really confused.

In Bruno's excellent paper about Cocoon encoding, there's a section that 
says:

"For Java-insiders: what Cocoon actually does internally is apply the 
following trick to get a parameter correctly decoded: suppose "value" is a 
string containing a request parameter, then Cocoon will do:

value = new String(value.getBytes("ISO-8859-1"), "UTF-8");      "

But then in the bug report for Xalan (someone having this same problem) it 
says:

"According to section 16.2 of the XSLT Recommendation [1], non-ASCII 
characters in URI attribute values should be escaped using the method 
recommended in Section B.2.1 of the HTML 4.0 Recommendation [2]. The 
latter recommends that non-ASCII characters be represented in UTF-8 prior 
to applying the "%HH" escaping described by the URI RTF, regardless of the 
output encoding."

This is what Xalan does (HTML serialization), so it obeys the spec.

Correct me if I'm wrong, but during serialization if there are special 
characters (above 128) in an URL:s request parameters (href-attributes 
etc.), they are first encoded in UTF-8 by Xalan. Even if the browser 
detects the page as ISO-8859-1 or anything else, these URL:s in the HTML 
source contain parameters in UTF-8. Now, when user clicks on this link, 
Cocoon reads the request parameters in as ISO-8859-1, and converts them 
to UTF-8, without knowing that these parameters were already UTF-8!

My knowledge of the Cocoon internals is not very good, but could this be 
the problem?

-Tuomo


On Fri, 29 Oct 2004, Marc Portier wrote:

> just scanning through this issue fast it seems to me like more evidence of 
> things expressed here: 
> http://marc.theaimsgroup.com/?t=109231177100007&r=1&w=2
>
>
> rehashing what I read from Tuomo's setup:
>
> - cocoon-servlet init params are set to have container-encoding unchanged 
> (thus iso_8859_1) like we recommend and form-encoding to utf-8 to make sure 
> his forms can support wide variety of characters
>
> - as a consequence of this last setting (and the wellknown 
> browser-limitation) this means we need to sync the encoding on the serializer 
> to this same utf-8
>
> - because of this setting there is no reason to complain about the resulting 
> HTML, that is full of utf-8 encoding, no need to refer to specs or blame 
> cocoon: xml serialization was requested to use utf-8 so it does (even xalan 
> does its work here I suppose)
>
>
> now, what goes wrong?
>
> well, I had planned to get into this during gt2004s hackathon but got 
> distracted on other issues.  Lacking the experience of the in depth debugging 
> session I can't really do more then express my current 'suspicions'
>
> (as stated in the thread above)
> we've done quite a good job at solving the issue regarding encodings of 
> request-parameters and even extended the servlet 2.3 new insights in doing so 
> (setRequestEncoding()) to support even 2.2 containers
>
> however, one important part of the request object set of getters is escaping 
> this: the URL (and some of its derived 'paths' as well I assume)
>
> This explains why encoding in form-request params gets fixed correctly, but 
> the url itself remains broke --> consequence:
> - you can't link to non-latin-char-urls but you can pass 
> non-latin-request-params
>
> in more cocoon detail this means you can't expect cocoon matchers to get 
> correctly triggered by non-latin-urls as well as you can't automount sitemaps 
> in directories with non-latin-only-names...
> (or read resources with non-latin-only-names as the original post of the 
> other thread was about)
>
>
>
> Suggestion:
> 1. do some tests to verify above and list them as known limitations on 
> appropriate wikis. --> tell about the two workarounds:
> a/ to avoid non-latin urls (even if w3c says all urls should be utf-8 
> encoded)
> b/ use jetty, set org.mortbay.util.URI.charset property and then DO change 
> the cocoon 'container-encoding' param accordingly
>
> 2. (assuming my analysis is correct and gets confirmed by the tests) extend 
> our http-wrapping-encoding-fix to include the urls and paths as well (using 
> the tests as a way to verify the success of this)
>
> 3. start the crusade for the abolishment of all encodings but utf-8!
>
>
> The time consuming part here is jamming together an easy deployable testsuite 
> (zip with automount sitemap and all needed stuff inside) covering the various 
> aspects... would be cool if somebody else could be doing that...
>
> regards,
> -marc=
>
> Joerg Heinicke wrote:
>> On 29.10.2004 08:44, Tuomo L wrote:
>> 
>>>>> We're having some serious encoding problems. This happens only with the 
>>>>> @href attributes in html, when using characters like å, ä and ö (in 
>>>>> Finnish alphabet). Form encoding works just fine. I've gone through all 
>>>>> the threads concerning encoding (other people having encoding problems 
>>>>> too). No luck so far. Is this still an issue in Cocoon? Could someone 
>>>>> please tell what's wrong?
>>>> 
>>>> 
>>>> 
>>>> What's the page encoding? Forms work like expected? Just the links don't 
>>>> work? This normally points to a different page encoding than UTF-8 as 
>>>> link requests are encoded in UTF-8 while form requests are encoded in 
>>>> page encoding. I don't think it is a Cocoon issue.
>> 
>> 
>> First a link about all the encodings:
>> http://wiki.apache.org/cocoon/RequestParameterEncoding (mostly written
>> by Bruno).
>> 
>>> According to IE, the page encoding is set to UTF-8. The
>>> container-encoding and form-encoding in web.xml (Tomcat) are set to UTF-8.
>> 
>> 
>> The container-encoding should not be touched at all and remain ISO-8859-1.
>> 
>>> HTMLSerializer is set to use UTF-8 (mime-type="text/html; charset=utf-8")
>>> and has the parameter <encoding>UTF-8</encoding>.
>> 
>> 
>> This should result in <meta http-equiv="Content-Type"
>> content="text/html;charset=utf-8">. The request encoding header should
>> have the same value ... what's not that easy when using a recent Tomcat:
>> http://issues.apache.org/bugzilla/show_bug.cgi?id=26997
>> 
>>> The xsl stylesheets use ISO-8859-1, though.
>> 
>> 
>> That's not a problem.
>> 
>>> I've also tried setting everything to ISO-8859-1, but
>>> the problem with the href-attributes in html remains. Mozilla Firefox
>>> shows the characters correctly when doing "view source", but if I save the
>>> document on disk and open with ASCII-editor, the encoding is wrong there
>>> with both IE and Mozilla. So maybe it's not a browser problem?
>>> 
>>> Here's an example:
>>> 
>>> <a href="äö" foo="äö">äö</a>
>>> 
>>> becomes:
>>> 
>>> <a href="%C3%A4%C3%B6" foo="&auml;&ouml;">&auml;&ouml;</a>
>>> 
>>> when it should read (I think):
>>> 
>>> <a href="&auml;&ouml;" foo="&auml;&ouml;">&auml;&ouml;</a>
>> 
>> 
>> ...
>> follow-up mail:
>> 
>>> The URL-encoding is done wrong when serializing to HTML. According to
>>> specs "äö" should become "%E4%F6" when encoded, not "%C3%A4%C3%B6".
>>> This seems to be the problem. So far I've noticed this problem with
>>> the HREF-attribute only.
>>> 
>>> For a test I made a styslesheet that substitutes "ä" with "%E4"
>>> before serializing to HTML. This works, but it should be done by the
>>> serializer, right?
>>> 
>>> Seems like a Cocoon issue.
>> 
>> 
>> If it would be an error at all, it would be a Xalan serializer problem I
>> think. But there were bugs reported on this topic and rejected because
>> of the specs (I think they have the same problems like you):
>> 
>> http://nagoya.apache.org/jira/browse/XALANJ-1412
>> http://nagoya.apache.org/jira/browse/XALANJ-1548
>> 
>> As I wrote: you simply get different request encodings when sending a
>> form or just clicking <a href=""/>.
>> 
>> Joerg
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe@cocoon.apache.org
>> For additional commands, e-mail: users-help@cocoon.apache.org
>> 
>
> -- 
> Marc Portier                            http://outerthought.org/
> Outerthought - Open Source, Java & XML Competence Support Center
> Read my weblog at                http://blogs.cocoondev.org/mpo/
> mpo@outerthought.org                              mpo@apache.org
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@cocoon.apache.org
> For additional commands, e-mail: users-help@cocoon.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@cocoon.apache.org
For additional commands, e-mail: users-help@cocoon.apache.org


Re: Encoding problems, still!

Posted by Marc Portier <mp...@outerthought.org>.
just scanning through this issue fast it seems to me like more evidence 
of things expressed here: 
http://marc.theaimsgroup.com/?t=109231177100007&r=1&w=2


rehashing what I read from Tuomo's setup:

- cocoon-servlet init params are set to have container-encoding 
unchanged (thus iso_8859_1) like we recommend and form-encoding to utf-8 
to make sure his forms can support wide variety of characters

- as a consequence of this last setting (and the wellknown 
browser-limitation) this means we need to sync the encoding on the 
serializer to this same utf-8

- because of this setting there is no reason to complain about the 
resulting HTML, that is full of utf-8 encoding, no need to refer to 
specs or blame cocoon: xml serialization was requested to use utf-8 so 
it does (even xalan does its work here I suppose)


now, what goes wrong?

well, I had planned to get into this during gt2004s hackathon but got 
distracted on other issues.  Lacking the experience of the in depth 
debugging session I can't really do more then express my current 
'suspicions'

(as stated in the thread above)
we've done quite a good job at solving the issue regarding encodings of 
request-parameters and even extended the servlet 2.3 new insights in 
doing so (setRequestEncoding()) to support even 2.2 containers

however, one important part of the request object set of getters is 
escaping this: the URL (and some of its derived 'paths' as well I assume)

This explains why encoding in form-request params gets fixed correctly, 
but the url itself remains broke --> consequence:
- you can't link to non-latin-char-urls but you can pass 
non-latin-request-params

in more cocoon detail this means you can't expect cocoon matchers to get 
correctly triggered by non-latin-urls as well as you can't automount 
sitemaps in directories with non-latin-only-names...
(or read resources with non-latin-only-names as the original post of the 
other thread was about)



Suggestion:
1. do some tests to verify above and list them as known limitations on 
appropriate wikis. --> tell about the two workarounds:
  a/ to avoid non-latin urls (even if w3c says all urls should be utf-8 
encoded)
  b/ use jetty, set org.mortbay.util.URI.charset property and then DO 
change the cocoon 'container-encoding' param accordingly

2. (assuming my analysis is correct and gets confirmed by the tests) 
extend our http-wrapping-encoding-fix to include the urls and paths as 
well (using the tests as a way to verify the success of this)

3. start the crusade for the abolishment of all encodings but utf-8!


The time consuming part here is jamming together an easy deployable 
testsuite (zip with automount sitemap and all needed stuff inside) 
covering the various aspects... would be cool if somebody else could be 
doing that...

regards,
-marc=

Joerg Heinicke wrote:
> On 29.10.2004 08:44, Tuomo L wrote:
> 
>>>> We're having some serious encoding problems. This happens only with 
>>>> the @href attributes in html, when using characters like å, ä and ö 
>>>> (in Finnish alphabet). Form encoding works just fine. I've gone 
>>>> through all the threads concerning encoding (other people having 
>>>> encoding problems too). No luck so far. Is this still an issue in 
>>>> Cocoon? Could someone please tell what's wrong?
>>>
>>>
>>>
>>> What's the page encoding? Forms work like expected? Just the links 
>>> don't work? This normally points to a different page encoding than 
>>> UTF-8 as link requests are encoded in UTF-8 while form requests are 
>>> encoded in page encoding. I don't think it is a Cocoon issue.
> 
> 
> First a link about all the encodings:
> http://wiki.apache.org/cocoon/RequestParameterEncoding (mostly written
> by Bruno).
> 
>> According to IE, the page encoding is set to UTF-8. The
>> container-encoding and form-encoding in web.xml (Tomcat) are set to 
>> UTF-8.
> 
> 
> The container-encoding should not be touched at all and remain ISO-8859-1.
> 
>> HTMLSerializer is set to use UTF-8 (mime-type="text/html; charset=utf-8")
>> and has the parameter <encoding>UTF-8</encoding>.
> 
> 
> This should result in <meta http-equiv="Content-Type"
> content="text/html;charset=utf-8">. The request encoding header should
> have the same value ... what's not that easy when using a recent Tomcat:
> http://issues.apache.org/bugzilla/show_bug.cgi?id=26997
> 
>> The xsl stylesheets use ISO-8859-1, though.
> 
> 
> That's not a problem.
> 
>> I've also tried setting everything to ISO-8859-1, but
>> the problem with the href-attributes in html remains. Mozilla Firefox
>> shows the characters correctly when doing "view source", but if I save 
>> the
>> document on disk and open with ASCII-editor, the encoding is wrong there
>> with both IE and Mozilla. So maybe it's not a browser problem?
>>
>> Here's an example:
>>
>> <a href="äö" foo="äö">äö</a>
>>
>> becomes:
>>
>> <a href="%C3%A4%C3%B6" foo="&auml;&ouml;">&auml;&ouml;</a>
>>
>> when it should read (I think):
>>
>> <a href="&auml;&ouml;" foo="&auml;&ouml;">&auml;&ouml;</a>
> 
> 
> ...
> follow-up mail:
> 
>> The URL-encoding is done wrong when serializing to HTML. According to
>> specs "äö" should become "%E4%F6" when encoded, not "%C3%A4%C3%B6".
>> This seems to be the problem. So far I've noticed this problem with
>> the HREF-attribute only.
>>
>> For a test I made a styslesheet that substitutes "ä" with "%E4"
>> before serializing to HTML. This works, but it should be done by the
>> serializer, right?
>>
>> Seems like a Cocoon issue.
> 
> 
> If it would be an error at all, it would be a Xalan serializer problem I
> think. But there were bugs reported on this topic and rejected because
> of the specs (I think they have the same problems like you):
> 
> http://nagoya.apache.org/jira/browse/XALANJ-1412
> http://nagoya.apache.org/jira/browse/XALANJ-1548
> 
> As I wrote: you simply get different request encodings when sending a
> form or just clicking <a href=""/>.
> 
> Joerg
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@cocoon.apache.org
> For additional commands, e-mail: users-help@cocoon.apache.org
> 

-- 
Marc Portier                            http://outerthought.org/
Outerthought - Open Source, Java & XML Competence Support Center
Read my weblog at                http://blogs.cocoondev.org/mpo/
mpo@outerthought.org                              mpo@apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@cocoon.apache.org
For additional commands, e-mail: users-help@cocoon.apache.org


Re: Encoding problems, still!

Posted by Joerg Heinicke <jo...@gmx.de>.
On 29.10.2004 08:44, Tuomo L wrote:

>>> We're having some serious encoding problems. This happens only with 
>>> the @href attributes in html, when using characters like å, ä and ö 
>>> (in Finnish alphabet). Form encoding works just fine. I've gone 
>>> through all the threads concerning encoding (other people having 
>>> encoding problems too). No luck so far. Is this still an issue in 
>>> Cocoon? Could someone please tell what's wrong?
>>
>>
>> What's the page encoding? Forms work like expected? Just the links 
>> don't work? This normally points to a different page encoding than 
>> UTF-8 as link requests are encoded in UTF-8 while form requests are 
>> encoded in page encoding. I don't think it is a Cocoon issue.

First a link about all the encodings:
http://wiki.apache.org/cocoon/RequestParameterEncoding (mostly written
by Bruno).

> According to IE, the page encoding is set to UTF-8. The
> container-encoding and form-encoding in web.xml (Tomcat) are set to UTF-8.

The container-encoding should not be touched at all and remain ISO-8859-1.

> HTMLSerializer is set to use UTF-8 (mime-type="text/html; charset=utf-8")
> and has the parameter <encoding>UTF-8</encoding>.

This should result in <meta http-equiv="Content-Type"
content="text/html;charset=utf-8">. The request encoding header should
have the same value ... what's not that easy when using a recent Tomcat:
http://issues.apache.org/bugzilla/show_bug.cgi?id=26997

> The xsl stylesheets use ISO-8859-1, though.

That's not a problem.

> I've also tried setting everything to ISO-8859-1, but
> the problem with the href-attributes in html remains. Mozilla Firefox
> shows the characters correctly when doing "view source", but if I save the
> document on disk and open with ASCII-editor, the encoding is wrong there
> with both IE and Mozilla. So maybe it's not a browser problem?
> 
> Here's an example:
> 
> <a href="äö" foo="äö">äö</a>
> 
> becomes:
> 
> <a href="%C3%A4%C3%B6" foo="&auml;&ouml;">&auml;&ouml;</a>
> 
> when it should read (I think):
> 
> <a href="&auml;&ouml;" foo="&auml;&ouml;">&auml;&ouml;</a>

...
follow-up mail:
> The URL-encoding is done wrong when serializing to HTML. According to
> specs "äö" should become "%E4%F6" when encoded, not "%C3%A4%C3%B6".
> This seems to be the problem. So far I've noticed this problem with
> the HREF-attribute only.
> 
> For a test I made a styslesheet that substitutes "ä" with "%E4"
> before serializing to HTML. This works, but it should be done by the
> serializer, right?
> 
> Seems like a Cocoon issue.

If it would be an error at all, it would be a Xalan serializer problem I
think. But there were bugs reported on this topic and rejected because
of the specs (I think they have the same problems like you):

http://nagoya.apache.org/jira/browse/XALANJ-1412
http://nagoya.apache.org/jira/browse/XALANJ-1548

As I wrote: you simply get different request encodings when sending a
form or just clicking <a href=""/>.

Joerg

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@cocoon.apache.org
For additional commands, e-mail: users-help@cocoon.apache.org


Re: Encoding problems, still!

Posted by Tuomo L <tl...@cc.hut.fi>.

On Fri, 29 Oct 2004, Kees van Dieren wrote:

> In mozilla Firefox, when you open the right-click menu and choose "View
> page Info" what is the value for Encoding  there?

The same than used in serializer configuration. No problem here.

>
> The serializer still seems to use ISO-8859 (e.g. not UTF-8)(according to
> the link problem)?

The serializer uses the configured encoding, EXCEPT for the 
url-attributes, like @href. Here's the problem.

>
> The serializer add's the encoding type to the resulting html page, just
> behind the <head> tag. It should add something like:
> <meta http-equiv="Content-Type" content="text/html; charset=UTF-8>
> What is the charset in the resulting html page there?

The one set for the serializer's congfiguration.

>
> Encoding could also be sent within the headers. You can use WGet to view
> the headers (google for it to download it). with wget -d http://address/
> you might see the encoding type specified in the content-type header. Is
> it correct there?

Haven't tested this, but IE and Mozilla detect the encoding as 
supposed. The problem seems to be the combination of Xalan serialization 
and Cocoon's internal "trick" to convert each request parameters's value 
from ISO-8859-1 to UTF-8. This works only if the values are not UTF-8 
already, which is not the case after Xalan's serialization.

-Tuomo

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@cocoon.apache.org
For additional commands, e-mail: users-help@cocoon.apache.org


Re: Encoding problems, still!

Posted by Kees van Dieren <ke...@keesvandieren.nl>.
In mozilla Firefox, when you open the right-click menu and choose "View
page Info" what is the value for Encoding  there?

The serializer still seems to use ISO-8859 (e.g. not UTF-8)(according to
the link problem)?

The serializer add's the encoding type to the resulting html page, just
behind the <head> tag. It should add something like:
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8>
What is the charset in the resulting html page there?

Encoding could also be sent within the headers. You can use WGet to view
the headers (google for it to download it). with wget -d http://address/
you might see the encoding type specified in the content-type header. Is
it correct there?

Kind regards
> On Thu, 28 Oct 2004, Joerg Heinicke wrote:
>
>> On 28.10.2004 21:35, Tuomo L wrote:
>>
>>> We're having some serious encoding problems. This happens only with
>>> the  @href attributes in html, when using characters like å, ä and ö
>>> (in Finnish  alphabet). Form encoding works just fine. I've gone
>>> through all the threads  concerning encoding (other people having
>>> encoding problems too). No luck so  far. Is this still an issue in
>>> Cocoon? Could someone please tell what's  wrong?
>>
>> What's the page encoding? Forms work like expected? Just the links
>> don't  work? This normally points to a different page encoding than
>> UTF-8 as link  requests are encoded in UTF-8 while form requests are
>> encoded in page  encoding. I don't think it is a Cocoon issue.
>>
>> Joerg
>
> Thanks Joerg,
>
> According to IE, the page encoding is set to UTF-8. The
> container-encoding and form-encoding in web.xml (Tomcat) are set to
> UTF-8. HTMLSerializer is set to use UTF-8 (mime-type="text/html;
> charset=utf-8") and has the parameter <encoding>UTF-8</encoding>. The
> xsl stylesheets use ISO-8859-1, though. I've also tried setting
> everything to ISO-8859-1, but the problem with the href-attributes in
> html remains. Mozilla Firefox shows the characters correctly when doing
> "view source", but if I save the document on disk and open with
> ASCII-editor, the encoding is wrong there with both IE and Mozilla. So
> maybe it's not a browser problem?
>
> Here's an example:
>
> <a href="äö" foo="äö">äö</a>
>
> becomes:
>
> <a href="%C3%A4%C3%B6" foo="&auml;&ouml;">&auml;&ouml;</a>
>
> when it should read (I think):
>
> <a href="&auml;&ouml;" foo="&auml;&ouml;">&auml;&ouml;</a>
>
>
> What's happening?
>
> -Tuomo
>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe@cocoon.apache.org
>> For additional commands, e-mail: users-help@cocoon.apache.org
>>
>>
>
> --------------------------------------------------------------------- To
> unsubscribe, e-mail: users-unsubscribe@cocoon.apache.org
> For additional commands, e-mail: users-help@cocoon.apache.org




---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@cocoon.apache.org
For additional commands, e-mail: users-help@cocoon.apache.org


Re: Encoding problems, still!

Posted by Tuomo L <tl...@cc.hut.fi>.
On Thu, 28 Oct 2004, Joerg Heinicke wrote:

> On 28.10.2004 21:35, Tuomo L wrote:
>
>> We're having some serious encoding problems. This happens only with the 
>> @href attributes in html, when using characters like å, ä and ö (in Finnish 
>> alphabet). Form encoding works just fine. I've gone through all the threads 
>> concerning encoding (other people having encoding problems too). No luck so 
>> far. Is this still an issue in Cocoon? Could someone please tell what's 
>> wrong?
>
> What's the page encoding? Forms work like expected? Just the links don't 
> work? This normally points to a different page encoding than UTF-8 as link 
> requests are encoded in UTF-8 while form requests are encoded in page 
> encoding. I don't think it is a Cocoon issue.
>
> Joerg

Thanks Joerg,

According to IE, the page encoding is set to UTF-8. The
container-encoding and form-encoding in web.xml (Tomcat) are set to UTF-8.
HTMLSerializer is set to use UTF-8 (mime-type="text/html; charset=utf-8")
and has the parameter <encoding>UTF-8</encoding>. The xsl stylesheets use
ISO-8859-1, though. I've also tried setting everything to ISO-8859-1, but
the problem with the href-attributes in html remains. Mozilla Firefox
shows the characters correctly when doing "view source", but if I save the
document on disk and open with ASCII-editor, the encoding is wrong there
with both IE and Mozilla. So maybe it's not a browser problem?

Here's an example:

<a href="äö" foo="äö">äö</a>

becomes:

<a href="%C3%A4%C3%B6" foo="&auml;&ouml;">&auml;&ouml;</a>

when it should read (I think):

<a href="&auml;&ouml;" foo="&auml;&ouml;">&auml;&ouml;</a>


What's happening?

-Tuomo

>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@cocoon.apache.org
> For additional commands, e-mail: users-help@cocoon.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@cocoon.apache.org
For additional commands, e-mail: users-help@cocoon.apache.org


Re: Encoding problems, still!

Posted by Joerg Heinicke <jo...@gmx.de>.
On 28.10.2004 21:35, Tuomo L wrote:

> We're having some serious encoding problems. This happens only with the 
> @href attributes in html, when using characters like å, ä and ö (in 
> Finnish alphabet). Form encoding works just fine. I've gone through all 
> the threads concerning encoding (other people having encoding problems 
> too). No luck so far. Is this still an issue in Cocoon? Could someone 
> please tell what's wrong?

What's the page encoding? Forms work like expected? Just the links don't 
work? This normally points to a different page encoding than UTF-8 as 
link requests are encoded in UTF-8 while form requests are encoded in 
page encoding. I don't think it is a Cocoon issue.

Joerg

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@cocoon.apache.org
For additional commands, e-mail: users-help@cocoon.apache.org