You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@tapestry.apache.org by Doug Hauge <do...@lithium.com> on 2007/07/07 02:43:52 UTC

Problem with non-ASCII form parameters in form containing upload component

I am running into an encoding problem with form parameter values that
contain non-ASCII characters in a form that contains an upload
component. Without the upload component everything works fine. The
problem appears to be in the handling of strings in
'multipart/form-data', in that both Firefox and Mozilla seem to send
strings encoded as UTF-8, but don't specify a character set, and the
upload component interprets these as ISO-8559-1. This actually seems to
be the correct response to incorrect behavior by the browsers, but in
practice we need to find a workaround. I can't find a way to get around
this without modifying the tapestry-upload project, and I was wondering
if anyone could suggest a better solution, and if not whether Tapestry
itself will deal with this in the future

Our simple, hack workaround is to modify
'MultipartDecoderImpl.processFileItems(...)' to call

wrapper.addParameter(item.getFieldName(), item.getString("UTF8"))

instead of

wrapper.addParameter(item.getFieldName(), item.getString())

To improve this, I think we would need

1) To have a way of passing in an appropriate default encoding to use.
We could contribute a 'HttpServletRequestHandler' that sets the
request's default character encoding, but is there a way to guarantee
that our handler would be called before the
'MultipartServletRequestFilter'?

2) Even if did (1), we would need a way to use this encoding to parse
strings multipart form fields. Passing the encoding to
'FileItem.getString()' is undesirable because it would not handle the
case where the part's 'charset' parameter was explicitly set. The
parameterless version of 'FileItem.getString()' cannot be used, however,
because it explicitly defaults to 'ISO-8859-1' if the character set is
not specified (e.g. it uses neither the request's character encoding or
the header encoding set by 'FileUpload.setHeaderEncoding'). I can't find
a nice way to do this without duplicating some code in
'commons-fileupload' or relying on public methods of 'DiskFileItem' that
aren't in the 'FileItem' interface.

Thanks,
Doug



---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@tapestry.apache.org
For additional commands, e-mail: users-help@tapestry.apache.org


Re: Problem with non-ASCII form parameters in form containing upload component

Posted by Steven Coco <co...@stevencoco.com>.
Hi.

I have a couple of obvious thoughts:

Does the form contain an accept-charset attribute? If the form in the HTML 
document does not specify an accept-charset, it is allowed to use the 
document's charset.

Did you double-check the charset being returned by the server (what's seen by 
the agent)? And take into consideration any transcoding allowed by servers or 
agents along the way; and allow for what the user agent itself has settled 
on, for the document's effective encoding, given its own choices.

If the header is not present, no meta element is given (for 
Content-Type...charset=), and the element itself has no charset defined in 
some attribute, HTML actually specifies that the user agent does not have to 
follow any rules -- e.g. it doesn't fall back to 8859-1 as stated in HTTP -- 
the user agent may be using the user's chosen default encoding, which may be 
UTF-8 (or the platform encoding). It also may guess...

I still do not know what you are intended to do if the agent has picked some 
arbitrary local encoding -- because it has not been given any charset it must 
presume -- and then it sends back something with no indication. It's very 
good to ensure the server sends clear headers; and it's nice to duplicate 
that in a meta element too; and also to include an attribute on the form. 
This at least gives the agent no excuses, and you can actually send back 
a "you have a buggy user agent" response with confidence.

I was prompted to at least comment because form handling in my opinion is 
still extremely messy! It can still bite you in corner cases. There are at 
least 6 cross-referenced RFCs for the encoding of each part of the multipart 
message. Each part is supposed to have a content-type; and that contains the 
charset parameter for text components. And you still see things being 
transmitted incorrectly. Plus the file name part is allowed to munge the file 
name into an approximation. I don't know if any of that is useful.

Good luck! I'd be curious to hear what you ultimately learn.

Ciao!
-Steev Coco.


On Fri July 6 2007 8:43:52 pm you wrote:
> I am running into an encoding problem with form parameter values that
> contain non-ASCII characters in a form that contains an upload
> component. Without the upload component everything works fine. The
> problem appears to be in the handling of strings in
> 'multipart/form-data', in that both Firefox and Mozilla seem to send
> strings encoded as UTF-8, but don't specify a character set, and the
> upload component interprets these as ISO-8559-1. This actually seems to
> be the correct response to incorrect behavior by the browsers, but in
> practice we need to find a workaround. I can't find a way to get around
> this without modifying the tapestry-upload project, and I was wondering
> if anyone could suggest a better solution, and if not whether Tapestry
> itself will deal with this in the future
>
> Our simple, hack workaround is to modify
> 'MultipartDecoderImpl.processFileItems(...)' to call
>
> wrapper.addParameter(item.getFieldName(), item.getString("UTF8"))
>
> instead of
>
> wrapper.addParameter(item.getFieldName(), item.getString())
>
> To improve this, I think we would need
>
> 1) To have a way of passing in an appropriate default encoding to use.
> We could contribute a 'HttpServletRequestHandler' that sets the
> request's default character encoding, but is there a way to guarantee
> that our handler would be called before the
> 'MultipartServletRequestFilter'?
>
> 2) Even if did (1), we would need a way to use this encoding to parse
> strings multipart form fields. Passing the encoding to
> 'FileItem.getString()' is undesirable because it would not handle the
> case where the part's 'charset' parameter was explicitly set. The
> parameterless version of 'FileItem.getString()' cannot be used, however,
> because it explicitly defaults to 'ISO-8859-1' if the character set is
> not specified (e.g. it uses neither the request's character encoding or
> the header encoding set by 'FileUpload.setHeaderEncoding'). I can't find
> a nice way to do this without duplicating some code in
> 'commons-fileupload' or relying on public methods of 'DiskFileItem' that
> aren't in the 'FileItem' interface.
>
> Thanks,
> Doug



---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@tapestry.apache.org
For additional commands, e-mail: users-help@tapestry.apache.org