You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@cocoon.apache.org by Stefano Mazzocchi <st...@apache.org> on 2002/04/28 00:07:38 UTC

The encoding nightmare with StreamGenerator

I have a browser that sends a POST request with:

  content-type: application/x-www-form-urlencoded

and the hidden field "content" is populated (using client-side
javascript) with some xml which looks like this

   <?xml version="1.0" encoding="UTF-8"?>
   <page>
    <title>Title</title>
    <abstract>è</abstract>
    ...
   </page>

the weird "è" text is the UTF-8 encoded value for [è] (depending on
your mail client you might not be getting nothing of the above as I
write it, but that's exactly part of the encoding nightmare that UTF was
designed to fix... but there is still a long way to go)

Now, I have use StreamGenerator to get this text, have it parsed and
feed my pipeline. So far so good.

The problem is that stupid StreamGenerator doesn't recognize the
encoding (because the content-type doesn't have the 'charset:' part
defined (and IE can't be tweaked to emit that, AFAIK)) so it spits the
charachers "as they are" (as they were ASCII encoded) (I used the
LogTransformer to witness this and the same weird 'è' appears in the
logs with no encoding translating taking place).

It seems that StreamGenerator (or the parser instance it instantiates)
fails to see that 'è' is not two 8bits chars but one 16bit char.

I'm positive the bug resides on StreamGenerator: in fact, if I tweak the
javascript to fill the form content with 

   <?xml version="1.0" encoding="BLAH"?>

the parser doesn't even trigger an error.

I'm going to investigate how to patch this since I need it badly! but if
you have any suggestions I'm all ears.

-- 
Stefano Mazzocchi      One must still have chaos in oneself to be
                          able to give birth to a dancing star.
<st...@apache.org>                             Friedrich Nietzsche
--------------------------------------------------------------------

---------------------------------------------------------------------
To unsubscribe, e-mail: cocoon-dev-unsubscribe@xml.apache.org
For additional commands, email: cocoon-dev-help@xml.apache.org


AW: The encoding nightmare with StreamGenerator

Posted by Mathias Brökelmann <ma...@mathias.d2g.com>.
Hi,

I think the problem is the servlet engine which parses the parameters
out of the request. StreamGenerator simply takes the parameters from the
request object.

Tomcat will use ISO-8859-1 as character encoding if the browser like ie
or netscape is not sending the character encoding to the server. 
Bad thing: it is hard coded in tomcat so you can not configure the
default encoding. (see: Tomcat sources org.apache.catalina.connector.
RequestBase method getReader())

The only solution which I found is not to send the post as
application/x-www-form-urlencoded but as multipart/form-data.

The result is that you get the content as binary and not already parsed
by the servlet engine. This should also work specially for xml streams
because of the <?xml version="1.0" encoding="UTF-8"?> statement to
identify the encoding.

Anyway, the StreamGenerater seems not to be able to handle
multipart/form-data as ContentType. Why?

Hope that helps.

Mathias Broekelmann

> -----Ursprüngliche Nachricht-----
> Von: Robert Koberg [mailto:rob@koberg.com]
> Gesendet: Sonntag, 28. April 2002 00:28
> An: cocoon-dev@xml.apache.org
> Betreff: Re: The encoding nightmare with StreamGenerator
> 
> Hi Stefano.
> 
> Is your xsl:output putting out utf-8 or iso?
> 
> We have the same problem not using cocoon. We use JS to pre-parse for
> these kinds of things - trial and error... :(
> 
> best,
> -Rob
> 
> 
> Stefano Mazzocchi wrote:
> 
> >I have a browser that sends a POST request with:
> >
> >  content-type: application/x-www-form-urlencoded
> >
> >and the hidden field "content" is populated (using client-side
> >javascript) with some xml which looks like this
> >
> >   <?xml version="1.0" encoding="UTF-8"?>
> >   <page>
> >    <title>Title</title>
> >    <abstract>è</abstract>
> >    ...
> >   </page>
> >
> >the weird "è" text is the UTF-8 encoded value for [è] (depending on
> >your mail client you might not be getting nothing of the above as I
> >write it, but that's exactly part of the encoding nightmare that UTF
was
> >designed to fix... but there is still a long way to go)
> >
> >Now, I have use StreamGenerator to get this text, have it parsed and
> >feed my pipeline. So far so good.
> >
> >The problem is that stupid StreamGenerator doesn't recognize the
> >encoding (because the content-type doesn't have the 'charset:' part
> >defined (and IE can't be tweaked to emit that, AFAIK)) so it spits
the
> >charachers "as they are" (as they were ASCII encoded) (I used the
> >LogTransformer to witness this and the same weird 'è' appears in the
> >logs with no encoding translating taking place).
> >
> >It seems that StreamGenerator (or the parser instance it
instantiates)
> >fails to see that 'è' is not two 8bits chars but one 16bit char.
> >
> >I'm positive the bug resides on StreamGenerator: in fact, if I tweak
the
> >javascript to fill the form content with
> >
> >   <?xml version="1.0" encoding="BLAH"?>
> >
> >the parser doesn't even trigger an error.
> >
> >I'm going to investigate how to patch this since I need it badly! but
if
> >you have any suggestions I'm all ears.
> >
> 
> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: cocoon-dev-unsubscribe@xml.apache.org
> For additional commands, email: cocoon-dev-help@xml.apache.org




---------------------------------------------------------------------
To unsubscribe, e-mail: cocoon-dev-unsubscribe@xml.apache.org
For additional commands, email: cocoon-dev-help@xml.apache.org


Re: The encoding nightmare with StreamGenerator

Posted by Robert Koberg <ro...@koberg.com>.
Hi Stefano.

Is your xsl:output putting out utf-8 or iso?

We have the same problem not using cocoon. We use JS to pre-parse for 
these kinds of things - trial and error... :(

best,
-Rob


Stefano Mazzocchi wrote:

>I have a browser that sends a POST request with:
>
>  content-type: application/x-www-form-urlencoded
>
>and the hidden field "content" is populated (using client-side
>javascript) with some xml which looks like this
>
>   <?xml version="1.0" encoding="UTF-8"?>
>   <page>
>    <title>Title</title>
>    <abstract>è</abstract>
>    ...
>   </page>
>
>the weird "è" text is the UTF-8 encoded value for [è] (depending on
>your mail client you might not be getting nothing of the above as I
>write it, but that's exactly part of the encoding nightmare that UTF was
>designed to fix... but there is still a long way to go)
>
>Now, I have use StreamGenerator to get this text, have it parsed and
>feed my pipeline. So far so good.
>
>The problem is that stupid StreamGenerator doesn't recognize the
>encoding (because the content-type doesn't have the 'charset:' part
>defined (and IE can't be tweaked to emit that, AFAIK)) so it spits the
>charachers "as they are" (as they were ASCII encoded) (I used the
>LogTransformer to witness this and the same weird 'è' appears in the
>logs with no encoding translating taking place).
>
>It seems that StreamGenerator (or the parser instance it instantiates)
>fails to see that 'è' is not two 8bits chars but one 16bit char.
>
>I'm positive the bug resides on StreamGenerator: in fact, if I tweak the
>javascript to fill the form content with 
>
>   <?xml version="1.0" encoding="BLAH"?>
>
>the parser doesn't even trigger an error.
>
>I'm going to investigate how to patch this since I need it badly! but if
>you have any suggestions I'm all ears.
>




---------------------------------------------------------------------
To unsubscribe, e-mail: cocoon-dev-unsubscribe@xml.apache.org
For additional commands, email: cocoon-dev-help@xml.apache.org