You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@tomcat.apache.org by André Warnier <aw...@ice-sa.com> on 2009/03/16 14:54:34 UTC

form parameters

Hi.

I am about 99% sure of the following, but I would like to be 100% sure.

Referring to
HttpServletRequest.getParameter()
HttpServletRequest.getParameterValues()

If, inside a html page containing a tag such as

<meta content="text/html; charset=iso-8859-2" http-equiv="Content-Type">

there is a form section defined as follows :

<form name="form" method="post" enctype="multipart/form-data" 
action="(url of my webapp/servlet)">
<input name="param1" value="abc (+ some typical iso-latin-2 chars from 
the upper part of the table)">
...
</form>

then, if this form is submitted, within my servlet the line

String p1 = request.getParameter("param1");

would always return into p1, the proper internal Java Unicode string 
value of the input element "param1" of the form, properly decoded from 
it's original iso-8859-2 encoding.
Yes ?


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
For additional commands, e-mail: users-help@tomcat.apache.org


Re: form parameters

Posted by André Warnier <aw...@ice-sa.com>.
Joseph Millet wrote:
> Maybe I'm missing something but from the little knowledge I have, I'd
> think an HTML form is posted encoded in the form enclosing HTML
> document charset specified in the sent Server headers. So that you
> settle a page encoded in iso-8859-2, you wouldn't expect a form
> present in that page to post unicode data, would you ?
> 
Maybe we need to restate the issue a bit differently.
Imagine a website on which there is a starting page with 3 links :
- formA.html
- formB.html
- formC.html
Each of these is a html page containing a tag '<form method="POST">'.
Now 3 users, each at his workstation, obtain this starting page from the 
server.
Then userA clicks on the link to formA.html and obtains the 
corresponding page.
Similarly, userB clicks on the second link etc..
The users fill in their respective forms, and submit their respective 
forms to the server (in any order).

The process on the server which handles the first submission (whether it 
is a servlet in Tomcat, or a cgi-bin under httpd etc.. doesn't matter), 
has no idea where this submit data comes from, right ? (It could even 
come from a page obtained from another server).
So the process in question has to evaluate this data, based only on what 
it gets in this specific POST.

What we are discussing here is how, based only on the data coming in 
from the browser POST, the server process determines the correct 
character encoding of what it receives.
And the answer so far is, it basically cannot be sure, because the 
browser does not send enough information with the POST, to allow the 
server process to determine this unambiguously.

Of course, if the server process is sure that the form originally came 
from itself, and that all the forms composing this application are 
defined such that the browser *should* always encode the data in a 
specific way, then the process could reasonably assume a charset and 
encoding.  But if one of the users uses a non-compliant browser that 
does not give a jot about what html is telling it to do, then ..

A separate but connected question is that it seems that current browsers 
do not follow entirely the HTML specifications, and even for 
multipart/form-data submissions, do not send the charset/encoding 
headers that would enable the server to know for sure, athough they should.

To go back to your note above :
It is true that the browser, in the absence of other information, SHOULD 
consider that the data it is going to submit should be in the encoding 
of the page containing the <form>.
This /can/ be changed by using the "accept-charset" attribute of the 
<form> tag.
However, even if that is true and if the browser follows the 
specifications in that respect and does encode the data properly, it 
does not change what I mention above about the fact that the server is 
still really in the dark about what it gets.



---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
For additional commands, e-mail: users-help@tomcat.apache.org


Re: form parameters

Posted by Christopher Schultz <ch...@christopherschultz.net>.
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Joseph,

On 3/19/2009 7:49 PM, Joseph Millet wrote:
> Maybe I'm missing something but from the little knowledge I have, I'd
> think an HTML form is posted encoded in the form enclosing HTML
> document charset specified in the sent Server headers.

It doesn't really matter what the client decides to do (they can submit
in a different charset for all I care) as long as it indicates in the
request headers what the charset is.

The problem is that many clients do /not/ indicate the charset in the
request, even though the spec requires them to do so. If the above
assertion generally holds (POST charset matches the form's enclosing
document's charset) you can't bet on it.

> So that you settle a page encoded in iso-8859-2, you wouldn't expect
> a form present in that page to post unicode data, would you ?

As I said, it's only a coincidence that the client sends the POST data
in a matching charset. The only surprise is that the client sends
something other than ISO-8859-1 (the default as per the spec) but does
not tell the server what it is. :(

- -chris
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (MingW32)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAknDCl0ACgkQ9CaO5/Lv0PDLPwCffJMY3m0yySjboyKFHt1ENFdG
EzgAnjr9/6KXxhQFaBKc1xE/HSbCez7R
=87hJ
-----END PGP SIGNATURE-----

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
For additional commands, e-mail: users-help@tomcat.apache.org


Re: form parameters

Posted by Joseph Millet <jo...@gmail.com>.
Maybe I'm missing something but from the little knowledge I have, I'd
think an HTML form is posted encoded in the form enclosing HTML
document charset specified in the sent Server headers. So that you
settle a page encoded in iso-8859-2, you wouldn't expect a form
present in that page to post unicode data, would you ?

On Tue, Mar 17, 2009 at 2:31 PM, Christopher Schultz
<ch...@christopherschultz.net> wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> Pid,
>
> On 3/17/2009 6:52 AM, Pid wrote:
>> Does the Servlet Spec define the default value of the request encoding,
>> or is this a Tomcat feature?
>
> The servlet spec (section 3.9 "Request data encoding") specifies
> ISO-8859-1 as the default encoding for POST data when no charset has
> been specified. Although the servlet spec provides a default, I believe
> it is really inheriting this default from the HTTP spec.
>
>> If the latter, it would be a reasonable
>> candidate for a Connector parameter, perhaps.
>
> The <Connector> currently has both useBodyEncodingForURI and URIEncoding
> attributes for interpreting the URI, but nothing for the encoding of the
> body. Since this can easily be done using filters (whereas there is no
> way to fiddle with the URI encoding), I doubt it will be added to the
> <Connector>.
>
> - -chris
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1.4.9 (MingW32)
> Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/
>
> iEYEARECAAYFAkm/phgACgkQ9CaO5/Lv0PDITQCdGmVpl/GdHee3zAwGkbcUTQiq
> pL8AoJWxoH/iAjDlD5SQlirwn0XG5ZwZ
> =mwQW
> -----END PGP SIGNATURE-----
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
> For additional commands, e-mail: users-help@tomcat.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
For additional commands, e-mail: users-help@tomcat.apache.org


Re: form parameters

Posted by Christopher Schultz <ch...@christopherschultz.net>.
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Pid,

On 3/17/2009 6:52 AM, Pid wrote:
> Does the Servlet Spec define the default value of the request encoding,
> or is this a Tomcat feature?

The servlet spec (section 3.9 "Request data encoding") specifies
ISO-8859-1 as the default encoding for POST data when no charset has
been specified. Although the servlet spec provides a default, I believe
it is really inheriting this default from the HTTP spec.

> If the latter, it would be a reasonable
> candidate for a Connector parameter, perhaps.

The <Connector> currently has both useBodyEncodingForURI and URIEncoding
attributes for interpreting the URI, but nothing for the encoding of the
body. Since this can easily be done using filters (whereas there is no
way to fiddle with the URI encoding), I doubt it will be added to the
<Connector>.

- -chris
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (MingW32)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAkm/phgACgkQ9CaO5/Lv0PDITQCdGmVpl/GdHee3zAwGkbcUTQiq
pL8AoJWxoH/iAjDlD5SQlirwn0XG5ZwZ
=mwQW
-----END PGP SIGNATURE-----

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
For additional commands, e-mail: users-help@tomcat.apache.org


Re: form parameters

Posted by Pid <p...@pidster.com>.
André Warnier wrote:
> Christopher Schultz wrote:
>>
>> Quick question: multipart/form-data is typically used for file upload...
>> why not use application/x-www-form-urlencoded instead? I realize the
>> problem is that certain browsers do not send the proper charset in the
>> Content-Type, but I'd like to understand your affinity for
>> multipart/form-data.
>>
> This :
> http://www.w3.org/TR/html401/interact/forms.html#h-17.13
> See the note in green at the end of 17.13.1 Form submission method.
> 
> Plus, the fact that our applications (area : document management) very
> often do offer the possibility to upload a file from within forms.
> Plus, the fact that the same applications often do offer the possibility
> to submit very large non-USASCII text fields.
> Plus, the fact that most of my activity relates to users who are not
> mainly English-speaking and do not use a US keyboard to fill-in web forms.
> Plus, the fact that having seen HTTP/HTML being born, I remember the
> time when URL's were typically limited in size, in a manner inconsistent
> between platforms. That might still be the case.
> 
> Somewhat abusively I admit, I took an early aversion to
> application/x-www-form-urlencoded, as synonymous to GET, to non-capable
> of anything but US-ASCII (ok, iso-8859-1 at a stretch, but see the above
> green note) and to "nobody agrees as to the proper percent encoding and
> at what moment it should take place or not".
> 
> The multipart/form-data encoding does not have all of these
> connotations, and should be a foolproof way for a browser to send data
> to a server without any size limit or charset ambiguity.
> 
> It is therefore a big surprise and big disappointment to see that
> browser developers do not take advantage of this, for some reason I have
> trouble to fathom (because it's there, it is well-defined, it is easy to
> do, and it would save a lot of problems).
> 
> It is also a big disappointment to see (you are right, I checked) that
> the Servlet Spec does not foresee a simple method to get the parameter
> values if they are posted via the multipart/form-data encoding method.
> That is probably because for 10 years or so, I have been using this
> under Apache and perl without any problems at all : I just use the
> equivalent of GetParameter() there, without having to worry a jot about
> the request encoding; and why should I have to ?
> Read the body myself and parsing it ? in 2009 ?
> 
> Now come on, I am sure that there must exist some standard Java library
> usable in a servlet context, and which does that, no ?

Does the Servlet Spec define the default value of the request encoding,
or is this a Tomcat feature?  If the latter, it would be a reasonable
candidate for a Connector parameter, perhaps.

p



> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
> For additional commands, e-mail: users-help@tomcat.apache.org
> 
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
For additional commands, e-mail: users-help@tomcat.apache.org


Re: form parameters

Posted by Christopher Schultz <ch...@christopherschultz.net>.
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

André,

On 3/16/2009 8:30 PM, André Warnier wrote:
> Christopher Schultz wrote:
>> -----BEGIN PGP SIGNED MESSAGE-----
>> Hash: SHA1
>>
>> André,
>>
>> (Man, I need to get a keyboard mapping for "é". This copy-and-paste
>> thing is such a drag...)
> 
> Well, you can use Andre, I don't mind and I'm used to all kinds of
> spellings.  Or you can use André , the special form for people who
> haven't dominated their MIME charsets yet ;-)

Or for those whose charsets are mismatched (ha!).

> Well yes [size does matter], in a number of situations.  Think for example about
> webserver logs, where these things then appear as a very very long
> string, percent-escaped to boot.

Eh, so you'd get your data enlarged to some extent. Again, the exact
Content-Type is not really relevant, since the problem is the same in
either case. The only difference is whether the servlet spec says it'll
expose that data to you through getParameter and friends.

> There is no "Content-Type of the request".  Try it : make a GET request
> (or a POST with application/x-www-form-urlencoded), and look for a
> request Content-Type with a charset.
> For a GET there is no content-type (because there is no request body).
> For a POST there is a content-type, but without charset.

That's the browser's fault, not the spec's. A request /does/ have a
Content-Type, whether implied or explicit. The problem is when the
client encodes the POST body with a content type other than the default,
and refuses to advertise it (which is the root of the problem).

> The gist of it is : for an "enctype=application/x-www-form-urlencoded"
> (whether explicit or by default), the URL is encoded in whatever charset
> the browser feels like encoding it. Which MAY depend on what the browser
> thinks the charset of the page is, which contains the <form>; or the
> "accept-charset" attribute of the form tag, or the user's preferences.
> But whatever the browser is in the end sending you, it does not say.

Agreed. My interpretation of the spec is that most clients are
non-compliant. When I use the filter attached to one of my other posts,
I most certainly *do* get POST content in UTF-8 encoding, yet the
browser fails to inform me with a Content-Type header.

If I POST "gregör", the POST body (again, without charset indicated in
the content-type) is this:

query=greg%C3%B6r

Note that if ISO-8859-1 had been used, the string should have been:

query=greg%F6r

So, the browser is patently violating the spec: it is using UTF-8 to
encode the request body yet not advertising it (RFC 2616 section 3.7.1).

Technically speaking, there is /no/ default charset unless the primary
media type is "text". My interpretation of the HTTP spec is that both
multipart/form-data /and/ application/x-www-urlencoded /require/ a
charset to be declared, even if the charset is "raw" or something like
that (for binary files, for instance).

> But $filename is also ("magically") a /filehandle/, as
> soon as you treat it like one and read from it.  That filehandle is
> connected to a temporary file in which the module has already read and
> saved the file part as uploaded by the browser.

Yeah, this is commons-upload for Java peeps:
http://commons.apache.org/fileupload/

> So, no, it is not a 10 MB string in memory.
> If the programmer closes that filehandle, the file is automatically
> deleted from whatever temporary space it occupied.
> Keep reading, and don't miss the
>  $type = uploadInfo($filename)->{'Content-Type'};

Note that the encoding for a file upload should always be
application/octet-stream. Otherwise, you'll get things like newline
conversions such that md5(source) != md5(target). The Content-Type
should be the mime-type for the file.

> In our applications, we are the ones sending the forms to the client,
> and we know the type of encoding to expect from them.

If that's the case, why not simply force Java to always use a certain
encoding? That's essentially what you're doing in Perl, whether you know
it or not.

> Just to keep people honest, we also always include a hidden parameter
> containing a UTF-8 string with non-US-ASCII characters, and check the
> returned length (in bytes and in characters) when the form is submitted.
> If there is a discrepancy between them, we know that the form
> parameter's encoding is not what it should be, and reject the post.

That can easily be done in Java, too.

> It doesn't [currently fail] because so far I am not processing form posts in Java servlets.
> This discussion started because I need to do it now, in relation with
> the same external application for which I posted some questions about
> BufferedInputStreamReader's and such a while ago.

Yup, I remember.

> Now I have the problem in reverse : the application gets input from an
> iso-8859-2 form, in iso-8859-2, but is interpreting it as iso-8859-1.
> I was just wondering if by changing the form to use the
> multipart/form-data encoding type, the servlet would "magically" realise
> the errors of his ways, and read the data properly.
> Apparently however, browsers and HTTP and Servlet Specs conspire to make
> my life difficult.

Yeah. Hey, if you're sure the data will be in ISO-8859-2, then I would
just use a filter like the one I posted (and you've already played with)
and call it a day. You can rant and rave about the specs all you want,
but it's not going to solve your problem :)

Seriously, though, look into commons-fileupload if you want to actually
upload files (or even if you just really want to use multipart/form-data).

>> You can use commons-upload, which was intended to be used with file
>> uploads, and will probably read "simple" multipart/form-data fields as
>> well.
>>
> That's interesting, in a general sense.  I didn't know that one.  Where
> does it live ?

Sorry, I had the name (slightly) wrong and made an assumption that you'd
know what the heck "commons-foo" would mean. Apache commons is an Apache
site that hosts lots of small and super-useful Java libraries. The home
page is http://commons.apache.org/ (worth checking out everything they
have available) and commons-fileupload can be found here
http://commons.apache.org/fileupload/

> Unfortunately here, since I cannot modify the servlet, I'm stuck.
> But the setRequestCharacterEncoding filter will help in this case.

Hmm, if you can't modify the servlet you might be out of luck. Or, you
could always write a filter.... muhahaha!

> Ok, I found it. It is FileUpload, at http://commons.apache.org/fileupload/
> and it looks like Java may be as smart as perl after all ;-)

You could always switch to Python:

import Brain;

- -chris
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (MingW32)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAkm/A6YACgkQ9CaO5/Lv0PAhpACgpV4REQiO7u1cQHyLJ1nA8m5C
8isAoL6NpxeQyUGUR1/7rK3l0SAv3/FB
=FDtt
-----END PGP SIGNATURE-----

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
For additional commands, e-mail: users-help@tomcat.apache.org


Re: form parameters

Posted by André Warnier <aw...@ice-sa.com>.
Christopher Schultz wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
> 
> André,
> 
> (Man, I need to get a keyboard mapping for "é". This copy-and-paste
> thing is such a drag...)

Well, you can use Andre, I don't mind and I'm used to all kinds of 
spellings.  Or you can use André , the special form for people who 
haven't dominated their MIME charsets yet ;-)

>> Plus, the fact that the same applications often do offer the possibility
>> to submit very large non-USASCII text fields.
> 
> The size of the fields shouldn't be an issue, unless you want to stream
> the data yourself. 

Well yes it does, in a number of situations.  Think for example about 
webserver logs, where these things then appear as a very very long 
string, percent-escaped to boot.

[...]

> 
>> I took an early aversion to application/x-www-form-urlencoded, 
...
> No, it will work and its better that GET because it's encoded using the
> Content-Type of the request, rather than God-knows-what given the
> browser settings.

There is no "Content-Type of the request".  Try it : make a GET request 
(or a POST with application/x-www-form-urlencoded), and look for a 
request Content-Type with a charset.
For a GET there is no content-type (because there is no request body).
For a POST there is a content-type, but without charset.

The gist of it is : for an "enctype=application/x-www-form-urlencoded" 
(whether explicit or by default), the URL is encoded in whatever charset 
the browser feels like encoding it. Which MAY depend on what the browser 
thinks the charset of the page is, which contains the <form>; or the 
"accept-charset" attribute of the form tag, or the user's preferences.
But whatever the browser is in the end sending you, it does not say.

> The only differences I see between multipart/form-data and
> application/x-www-urlencoded encoding types are the W3C's choice for the
> default and the servlet spec's requirement (both x-www) and the W3C's
> statement about <input type="file" />.
> 
http://www.w3.org/TR/html401/interact/forms.html#adef-enctype
says, quote :
The value "multipart/form-data" should be used in combination with the 
INPUT element, type="file".
unquote
Note that it does /not/ say that it should /not/ be used with something 
else. What it says is that if you upload a file, you SHOULD use the 
multipart/form-data content encoding, because of course it does not make 
any sense to send the whole file as a "&file=...(10MB)...." 
application/x-www-urlencoded encoded string, percent-escaped to boot.

>> It is also a big disappointment to see (you are right, I checked) that
>> the Servlet Spec does not foresee a simple method to get the parameter
>> values if they are posted via the multipart/form-data encoding method.
> 
> This is because the implication of using multipart/form-data is that the
> app code will read its own stream. If you upload a 100MB file, do you
> want that whole thing in memory as a (useless) String value?

Let me introduce you to the hidden beauties of Perl, and of the CGI.pm 
module.  Read this :
http://cpan.uwinnipeg.ca/htdocs/CGI.pm/CGI.html#CREATING_A_FILE_UPLOAD_FIELD
You can skip the first part, which is about creating a file upload field 
when composing a form.  The second part, starting at this shaded box :
    $filename = param('uploaded_file');
explains what happens at the server side when reading such a request 
parameter.  Essentially you do :
$filename = param('name'); (Java : f = req.getParameter("name");)
In Perl, $filename is now a string containing the uploaded /filename/, 
as explained.  But $filename is also ("magically") a /filehandle/, as 
soon as you treat it like one and read from it.  That filehandle is 
connected to a temporary file in which the module has already read and 
saved the file part as uploaded by the browser.
So, no, it is not a 10 MB string in memory.
If the programmer closes that filehandle, the file is automatically 
deleted from whatever temporary space it occupied.
Keep reading, and don't miss the
  $type = uploadInfo($filename)->{'Content-Type'};

> 
> So what is Perl's default charset? I find it hard to believe that Perl
> just magically works with the same missing charset information.
> 
"magically" is a word full of connotations, in perl.
(Like "any sufficiently advanced technology..")
But you are right, even perl cannot magically determine the charset if 
the browser does not supply it.
In our applications, we are the ones sending the forms to the client, 
and we know the type of encoding to expect from them.
Just to keep people honest, we also always include a hidden parameter 
containing a UTF-8 string with non-US-ASCII characters, and check the 
returned length (in bytes and in characters) when the form is submitted.
If there is a discrepancy between them, we know that the form 
parameter's encoding is not what it should be, and reject the post.

>> Read the body myself and parsing it ? in 2009 ?
> 
> Yes, read it yourself. You told the servlet container that you wanted to
> do it. I'm actually surprised that getParameter() gets you any of your
> POST form data when using multipart/form-data. You never did say how it
> failed: do you get a bad String (misinterpreted) or do you get null
> because getParameter didn't parse the request?

It doesn't because so far I am not processing form posts in Java servlets.
This discussion started because I need to do it now, in relation with 
the same external application for which I posted some questions about 
BufferedInputStreamReader's and such a while ago. Then, it was related 
to the fact that this application was sending wrong output back to the 
browser (iso-8859-2 but with a iso-8859-1 output charset).  That, I had 
to fix using an output filter module back at the Apache level.
Now I have the problem in reverse : the application gets input from an 
iso-8859-2 form, in iso-8859-2, but is interpreting it as iso-8859-1.
I was just wondering if by changing the form to use the 
multipart/form-data encoding type, the servlet would "magically" realise 
the errors of his ways, and read the data properly.
Apparently however, browsers and HTTP and Servlet Specs conspire to make 
my life difficult.

> 
>> Now come on, I am sure that there must exist some standard Java library
>> usable in a servlet context, and which does that, no ?
> 
> You can use commons-upload, which was intended to be used with file
> uploads, and will probably read "simple" multipart/form-data fields as well.
> 
That's interesting, in a general sense.  I didn't know that one.  Where 
does it live ?
Unfortunately here, since I cannot modify the servlet, I'm stuck.
But the setRequestCharacterEncoding filter will help in this case.


Ok, I found it. It is FileUpload, at http://commons.apache.org/fileupload/
and it looks like Java may be as smart as perl after all ;-)


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
For additional commands, e-mail: users-help@tomcat.apache.org


Re: form parameters

Posted by Christopher Schultz <ch...@christopherschultz.net>.
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

André,

(Man, I need to get a keyboard mapping for "é". This copy-and-paste
thing is such a drag...)

On 3/16/2009 6:09 PM, André Warnier wrote:
> Christopher Schultz wrote:
>>
>> Quick question: multipart/form-data is typically used for file upload...
>> why not use application/x-www-form-urlencoded instead? I realize the
>> problem is that certain browsers do not send the proper charset in the
>> Content-Type, but I'd like to understand your affinity for
>> multipart/form-data.
>>
> This :
> http://www.w3.org/TR/html401/interact/forms.html#h-17.13
> See the note in green at the end of 17.13.1 Form submission method.

The W3C docs for 'enctype' mentions that multipart/form-data should be
used with <input type="file" />
http://www.w3.org/TR/html401/interact/forms.html#adef-enctype

> Plus, the fact that our applications (area : document management) very
> often do offer the possibility to upload a file from within forms.

Gotcha.

> Plus, the fact that the same applications often do offer the possibility
> to submit very large non-USASCII text fields.

The size of the fields shouldn't be an issue, unless you want to stream
the data yourself. But you mentioned using getParameter, so you're /not/
streaming the data yourself.

> Plus, the fact that most of my activity relates to users who are not
> mainly English-speaking and do not use a US keyboard to fill-in web forms.
> Plus, the fact that having seen HTTP/HTML being born, I remember the
> time when URL's were typically limited in size, in a manner inconsistent
> between platforms. That might still be the case.

It is :)

> Somewhat abusively I admit, I took an early aversion to
> application/x-www-form-urlencoded, as synonymous to GET, to non-capable
> of anything but US-ASCII (ok, iso-8859-1 at a stretch, but see the above
> green note) and to "nobody agrees as to the proper percent encoding and
> at what moment it should take place or not".

No, it will work and its better that GET because it's encoded using the
Content-Type of the request, rather than God-knows-what given the
browser settings.

> The multipart/form-data encoding does not have all of these
> connotations, and should be a foolproof way for a browser to send data
> to a server without any size limit or charset ambiguity.

The only differences I see between multipart/form-data and
application/x-www-urlencoded encoding types are the W3C's choice for the
default and the servlet spec's requirement (both x-www) and the W3C's
statement about <input type="file" />.

> It is also a big disappointment to see (you are right, I checked) that
> the Servlet Spec does not foresee a simple method to get the parameter
> values if they are posted via the multipart/form-data encoding method.

This is because the implication of using multipart/form-data is that the
app code will read its own stream. If you upload a 100MB file, do you
want that whole thing in memory as a (useless) String value?

> That is probably because for 10 years or so, I have been using this
> under Apache and perl without any problems at all : I just use the
> equivalent of GetParameter() there, without having to worry a jot about
> the request encoding; and why should I have to ?

So what is Perl's default charset? I find it hard to believe that Perl
just magically works with the same missing charset information.

> Read the body myself and parsing it ? in 2009 ?

Yes, read it yourself. You told the servlet container that you wanted to
do it. I'm actually surprised that getParameter() gets you any of your
POST form data when using multipart/form-data. You never did say how it
failed: do you get a bad String (misinterpreted) or do you get null
because getParameter didn't parse the request?

> Now come on, I am sure that there must exist some standard Java library
> usable in a servlet context, and which does that, no ?

You can use commons-upload, which was intended to be used with file
uploads, and will probably read "simple" multipart/form-data fields as well.

- -chris
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (MingW32)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAkm+2qcACgkQ9CaO5/Lv0PASywCcDJ1ZonoXKXuHp7SyUa3M6qeD
/ogAnii33RObyJ6HJvLkLEyBf+F8jQKZ
=3ArQ
-----END PGP SIGNATURE-----

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
For additional commands, e-mail: users-help@tomcat.apache.org


Re: form parameters

Posted by André Warnier <aw...@ice-sa.com>.
Christopher Schultz wrote:
> 
> Quick question: multipart/form-data is typically used for file upload...
> why not use application/x-www-form-urlencoded instead? I realize the
> problem is that certain browsers do not send the proper charset in the
> Content-Type, but I'd like to understand your affinity for
> multipart/form-data.
> 
This :
http://www.w3.org/TR/html401/interact/forms.html#h-17.13
See the note in green at the end of 17.13.1 Form submission method.

Plus, the fact that our applications (area : document management) very 
often do offer the possibility to upload a file from within forms.
Plus, the fact that the same applications often do offer the possibility 
to submit very large non-USASCII text fields.
Plus, the fact that most of my activity relates to users who are not 
mainly English-speaking and do not use a US keyboard to fill-in web forms.
Plus, the fact that having seen HTTP/HTML being born, I remember the 
time when URL's were typically limited in size, in a manner inconsistent 
between platforms. That might still be the case.

Somewhat abusively I admit, I took an early aversion to 
application/x-www-form-urlencoded, as synonymous to GET, to non-capable 
of anything but US-ASCII (ok, iso-8859-1 at a stretch, but see the above 
green note) and to "nobody agrees as to the proper percent encoding and 
at what moment it should take place or not".

The multipart/form-data encoding does not have all of these 
connotations, and should be a foolproof way for a browser to send data 
to a server without any size limit or charset ambiguity.

It is therefore a big surprise and big disappointment to see that 
browser developers do not take advantage of this, for some reason I have 
trouble to fathom (because it's there, it is well-defined, it is easy to 
do, and it would save a lot of problems).

It is also a big disappointment to see (you are right, I checked) that 
the Servlet Spec does not foresee a simple method to get the parameter 
values if they are posted via the multipart/form-data encoding method.
That is probably because for 10 years or so, I have been using this 
under Apache and perl without any problems at all : I just use the 
equivalent of GetParameter() there, without having to worry a jot about 
the request encoding; and why should I have to ?
Read the body myself and parsing it ? in 2009 ?

Now come on, I am sure that there must exist some standard Java library 
usable in a servlet context, and which does that, no ?


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
For additional commands, e-mail: users-help@tomcat.apache.org


Re: form parameters

Posted by Christopher Schultz <ch...@christopherschultz.net>.
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

André,

FYI After logging, this seems to be one of the most-discussed topics on
the list.

On 3/16/2009 9:54 AM, André Warnier wrote:
> I am about 99% sure of the following, but I would like to be 100% sure.

To sum up:

1. Using <meta> to set the Content-Type of the page to
   charset ISO-8859-2
2. Submitting a POST form with higher ASCII characters (those that
   will only work properly when respecting ISO-8859-2)
   and enctype="multipart/form-data"
3. Trying to use HttpServletRequest.getParameter()

> then, if this form is submitted, within my servlet the line
> 
> String p1 = request.getParameter("param1");
> 
> would always return into p1, the proper internal Java Unicode string
> value of the input element "param1" of the form, properly decoded from
> it's original iso-8859-2 encoding.
> Yes ?

No. The servlet spec (SRV 3.1.1) states that POST data will only be read
from the request when the following conditions are true (note #3):

"
1. The request is an HTTP or HTTPS request.
2. The HTTP method is POST.
3. The content type is application/x-www-form-urlencoded.
4. The servlet has made an initial call of any of the getParameter
   family of methods on the request object.
"

Since you are using multipart/form-data, Tomcat isn't supposed to read
the POST parameters. You will have to do this yourself. If your client
is not sending a Content-Type including a character encoding, then you
have a client who isn't playing nicely. :( Most people give up and just
set everything to UTF-8 and be done with it.

Mikolaj's experience suggests that his client doesn't send the right
Content-Type (charset, really) and so Tomcat defaults to ISO-8859-1.
Most people use a filter that checks to see what the character encoding
is and, if there is none, sets the default to whatever pages advertise
themselves as (often UTF-8, in your case ISO-8859-2). This fixes 90% of
the POST encoding problems.

GET is another issue. :(

You asked how the server asks the client to encode a request. There's
really no provision for that in the HTTP spec. Anecdotal evidence
suggests that request (N + 1) is sent using the encoding of response N,
meaning that the client tends to use the encoding of the server's last
response.

Your statement about GET requests being (not) covered under a
shortcoming of the HTTP and URL specs is spot on: you basically can't
count on correct non-ISO-8859-1 characters in a URL. The solution? Use POST.

Quick question: multipart/form-data is typically used for file upload...
why not use application/x-www-form-urlencoded instead? I realize the
problem is that certain browsers do not send the proper charset in the
Content-Type, but I'd like to understand your affinity for
multipart/form-data.

- -chris
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (MingW32)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAkm+w9kACgkQ9CaO5/Lv0PCXDgCdHi/cBwJgafNE5yR636FaXyHi
w24An0AMx7XXG8PRpjszGFmWM6KNWlnc
=Mtww
-----END PGP SIGNATURE-----

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
For additional commands, e-mail: users-help@tomcat.apache.org


Re: form parameters

Posted by Christopher Schultz <ch...@christopherschultz.net>.
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

André,

On 3/16/2009 6:59 PM, André Warnier wrote:
> The real fix would be HTTP 1.2, specifying once and for all that the
> default encoding for query parameters is Unicode/UTF-8

Yes. Given that HTTP/1.1 clients should include Content-Type yet don't,
how long do you think adoption of HTTP/1.2 will take? ;)

> I fail to understand why the powers-that-be did not reach that
> conclusion several years ago already.

Because it was the swingin' '90s, baby! Honestly, I'm very surprised
that CERN, being located in a country with so many languages, was
content to stick with ISO-8859-1.

> This being an English-speaking list, I also assume that whatever time is
> seen here being spent discussing it, is only a biased view of the
> overall situation.

Yup. I can tell you that in the US, non-ASCII characters are usually an
afterthought. I've seen many sites that don't accept (or don't properly
handle) anything but [a-zA-Z0-9]. :(

> Since the Apache CON Europe is upcoming, and since I'm planning to
> attend, I wonder if a bit of stirring up matters there would help.

Go to MozillaCON or OperaCON or something. Drop a huge metal W3C on
Microsoft's front lawn
(http://home.snafu.de/tilman/mozilla/stomps.html). /That's/ where you
need to complain; it's the browsers that are very conservative, which
makes sense given that it will take a while for everyone on the planet
to upgrade to HTTP/1.2-compatible server software.

- -chris
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (MingW32)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAkm++X8ACgkQ9CaO5/Lv0PC3hACeNdLiXAdi9hQQ/ZNtToispZ9T
G8AAn1E2trLhbvIS+g1ULmlDcmoNNa4B
=UuI/
-----END PGP SIGNATURE-----

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
For additional commands, e-mail: users-help@tomcat.apache.org


Re: form parameters

Posted by André Warnier <aw...@ice-sa.com>.
Christopher Schultz wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
> 
> André,
> 
> On 3/16/2009 11:53 AM, André Warnier wrote:
>> As far as I understand the HTTP specs, something like
>> request.setCharacterEncoding() should only be used (with a charset
>> different from iso-8859-1) when a request comes in without a charset
>> specification (which also indicates a broken client).
> 
> This is my interpretation of the spec.
> 
> This is my implementation of a fix:
> 
[...]
Thanks.
I already found that one, and it is a fix.
But it's a miserable one in the wider context of a multilingual WWW.
The real fix would be HTTP 1.2, specifying once and for all that the 
default encoding for query parameters is Unicode/UTF-8 + 
percent-encoding when needed (as far as I can think, only when appended 
to the URL).
Since Unicode/UTF-8 can represent all characters known to man and more, 
I fail to understand why the powers-that-be did not reach that 
conclusion several years ago already.

As you yourself mention in another post, these issues occupy a 
significant portion of the bandwidth of anything to do with the WWW, and 
probably cause the annual loss of thousands of work hours.
This being an English-speaking list, I also assume that whatever time is 
seen here being spent discussing it, is only a biased view of the 
overall situation.

Since the Apache CON Europe is upcoming, and since I'm planning to 
attend, I wonder if a bit of stirring up matters there would help.

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
For additional commands, e-mail: users-help@tomcat.apache.org


Re: form parameters

Posted by Christopher Schultz <ch...@christopherschultz.net>.
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

André,

On 3/16/2009 11:53 AM, André Warnier wrote:
> As far as I understand the HTTP specs, something like
> request.setCharacterEncoding() should only be used (with a charset
> different from iso-8859-1) when a request comes in without a charset
> specification (which also indicates a broken client).

This is my interpretation of the spec.

This is my implementation of a fix:

import java.io.IOException;

import javax.servlet.Filter;
import javax.servlet.FilterConfig;
import javax.servlet.FilterChain;
import javax.servlet.ServletException;
import javax.servlet.ServletRequest;
import javax.servlet.ServletResponse;

/**
 * A filter to ensure that a valid character encoding is available to the
 * request.<p>
 *
 * @author Chris Schultz
 * @version $Revision: 1.2 $ $Date: 2006-07-14 14:23:43 $
 */
public class EncodingFilter
    implements Filter
{
    public static final String DEFAULT_ENCODING = "UTF-8";

    private String _encoding;

    /**
     * Called by the servlet container to indicate to a filter that it is
     * being put into service.<p>
     *
     * @param config The Filter configuration.
     */
    public void init(FilterConfig config)
    {
        _encoding = config.getInitParameter("encoding");
        if(null == _encoding)
            _encoding = DEFAULT_ENCODING;
    }

    protected String getDefaultEncoding()
    {
        return _encoding;
    }

    /**
     * Performs the filtering operation provided by this filter.<p>
     *
     * This filter performs the following:
     *
     * Sets the character encoding on the request to that specified in the
     * init parameters, but only if the request does not already have
     * a specified encoding.
     *
     * @param request The request being made to the server.
     * @param response The response object prepared for the client.
     * @param chain The chain of filters providing request services.
     */
    public void doFilter(ServletRequest request,
                         ServletResponse response,
                         FilterChain chain)
        throws IOException, ServletException
    {
        if(null == request.getCharacterEncoding())
            request.setCharacterEncoding(this.getDefaultEncoding());

        chain.doFilter(request, response);
    }

    /**
     * Called by the servlet container to indicate that a filter is being
     * taken out of service.<p>
     */
    public void destroy()
    {
    }
}

- -chris
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (MingW32)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAkm+xZkACgkQ9CaO5/Lv0PCTvwCeL5ppLbBLpFyF+FZKYtEumfhE
t1wAn2V0bm8OCkQo6EyHJ9WQXhgvCoWf
=XR+P
-----END PGP SIGNATURE-----

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
For additional commands, e-mail: users-help@tomcat.apache.org


Re: form parameters

Posted by André Warnier <aw...@ice-sa.com>.
Gregor Schneider wrote:
> On Mon, Mar 16, 2009 at 3:10 PM, Mikolaj Rydzewski <mi...@ceti.pl> wrote:
>> It doesn't work for me. By default Tomcat uses ISO-8859-1 encoding. And it
>> will try this encoding to parse input parameters.
>>
> 
> That's true, I'm doing the same here for German Umlaute, however:
> 
> One link in the Wiki is pointing to HTTP specification section 3.4.1,
> however, there's something that I  do not understand:
> 
> The specs say in 3.4.1:
> 
> <quote>
> HTTP/1.1 recipients MUST respect the
>    charset label provided by the sender; and those user agents that have
>    a provision to "guess" a charset MUST use the charset from the
>    content-type field if they support that charset, rather than the
>    recipient's preference, when initially displaying a document. See
>    section 3.7.1.
> </quote>
> 
> So, for me as a non-native English speaker, I understand it in such a
> way that your conent-encoding must be obliged - or do I get it wrong
> here? So, if in the content-encoding UTF-8 is specified, why isn't it
> accepted then?
> 

+1.

In other words, according to the HTTP specs (and the Servlet Specs SRV 
4.9), if the client sends a form content using the "multipart/form-data" 
encoding, and specifies a charset for one of the parts, then the servlet 
engine should decode it that way.

And if the client sends a form content using the "multipart/form-data" 
encoding but does not specify a charset for any given part, then the 
servlet engine should consider that it is iso-8859-1, this being the 
default HTTP encoding.

As far as I understand the HTTP specs, something like 
request.setCharacterEncoding() should only be used (with a charset 
different from iso-8859-1) when a request comes in without a charset 
specification (which also indicates a broken client).

Comments anyone ?


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
For additional commands, e-mail: users-help@tomcat.apache.org


Re: form parameters

Posted by André Warnier <aw...@ice-sa.com>.
Gregor Schneider wrote:
> If found this one:
> 
> http://www.w3.org/TR/html401/interact/forms.html#adef-accept-charset
> 
> Actually, to me it's not clear why Tomcat should believe the input
> being encoded in ISO8859-1, when one can give a detailled information
> how the form-data is encoded.
> 
> If I understand it correctly, one can even *force* any client (as long
> as the client is following the specs) to encode the form-data using
> the "accepeted-charset"-attribute of the <Form>-element.
> 
> IOW:
> 
> Setting "accepted-charset="UTF8"" should solve the problems.
> 
> Comments, anyone?
> 
Yes.
But no, it does not seem to work.
I was under the same impression as you indicate above, and I already 
knew about the <form accept-charset=..>
But I just tested this in Firefox 2 and in IE 6, and it does not work as 
expected.

This is my test :

1) I created a html page as follows :
-- begin --
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
</head>
<body>
<form name="f1" action="http://mira.wissensbank.com/pcgi/printenv.pl" 
method="POST"
  enctype="multipart/form-data" accept-charset="UTF-8">
  First param: <input name="param1" type="text" value="andré"><br/>
  Second param: <input name="param2" type="text" value="gregör"><br/>
  <input name="go" type="submit" value="GO"><br/>
</form>
</body>
</html>
-- end --

The above file is created with a UTF-8 aware editor, and the characters 
in it (in "andré" and "gregör")(the umlaut is mine, as a test), are 
encoded as UTF-8. I saved the file as UTF-8 without BOM.  As you can 
see, the document contains a <meta> tag indicating the page encoding, 
and the form contains an "accept-charset" attribute of the same color.

2) I opened this file in Firefox 2.0 and clicked the GO button.
Since I open this as a local file, there is no "Content-Type" header 
coming from the server to confuse things.
In Firefox, I have the LiveHttpHeaders plugin installed, which allows me 
to see the request as sent to the server, and save a copy of it, which I 
did.  This is the result :

-- begin --
POST /pcgi/printenv.pl HTTP/1.1
Host: mira.wissensbank.com
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-GB; rv:1.8.1.15) 
Gecko/20080623 Firefox/2.0.0.15
Accept: 
text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5
Accept-Language: en-gb,en;q=0.7,de-de;q=0.3
Accept-Encoding: gzip,deflate
Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7
Keep-Alive: 300
Connection: keep-alive
Content-Type: multipart/form-data; 
boundary=---------------------------218302158314236
Content-Length: 350
-----------------------------218302158314236
Content-Disposition: form-data; name="param1"

andré
-----------------------------218302158314236
Content-Disposition: form-data; name="param2"

gregör
-----------------------------218302158314236
Content-Disposition: form-data; name="go"

GO
-----------------------------218302158314236--
-- end --

3) I did the same in Internet Explorer 6.0, which has another plugin of 
similar functionality (Fiddler), with which I can capture the whole request.
Here it is :
-- begin --
POST /pcgi/printenv.pl HTTP/1.1
Accept: */*
Accept-Language: de
Content-Type: multipart/form-data; 
boundary=---------------------------7d98c5bb072c
Accept-Encoding: gzip, deflate
User-Agent: Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET 
CLR 1.1.4322; .NET CLR 2.0.50727; .NET CLR 3.0.04506.30)
Host: mira.wissensbank.com
Content-Length: 338
Connection: Keep-Alive
Pragma: no-cache

-----------------------------7d98c5bb072c
Content-Disposition: form-data; name="param1"

andré
-----------------------------7d98c5bb072c
Content-Disposition: form-data; name="param2"

gregör
-----------------------------7d98c5bb072c
Content-Disposition: form-data; name="go"

GO
-----------------------------7d98c5bb072c--
-- end --

So, as anyone can see, neither one of these browsers is adding any 
charset information to the POST.  Which I personally find very strange, 
and rather on the bad side of the HTTP specs.

Which tends to confirm the note in SRV 4.9 of the Servlet Specs 2.4/2.5 :
"Currently, many browsers do not send a char encoding qualifier with the 
Content-Type header, leaving open the determination of the character 
encoding for reading HTTP requests."

Which also seems to contradict the HTML specs which you mention :
http://www.w3.org/TR/html401/interact/forms.html#h-17.13
and following paragraphs. (Note by the way the "Note" at the end of 17.13.1)
In particular, this one from section "17.13.4 Form content types" :
As with all multipart MIME types, each part has an optional 
"Content-Type" header that defaults to "text/plain". User agents should 
supply the "Content-Type" header, accompanied by a "charset" parameter.

Well, Firefox 2.0 and IE 6.0 don't supply a "Content-Type" and even less 
a charset.
In the case of IE 6.0, I am not really surprised, but in the case of 
Firefox, who would have thunk ?


Anyway, it kind of puts a spin on what I posted here before, in the 
sense that the servlet engine thus, even in the case of a html form 
which should have everything in it to leave no choice to the browser, 
still does not get any information about the real character set of the 
data sent by the browser.

Which personally, in our day and age, I find absolutely terrible.

I will now try to re-test this with Firefox 3 and IE 7.

Update : just tested with Firefox 3.1 beta, does not send Content-Type 
nor charset either.
I am puzzled as to why.


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
For additional commands, e-mail: users-help@tomcat.apache.org


Re: form parameters

Posted by Mark Thomas <ma...@apache.org>.
Gregor Schneider wrote:
> If found this one:
> 
> http://www.w3.org/TR/html401/interact/forms.html#adef-accept-charset
> 
> Actually, to me it's not clear why Tomcat should believe the input
> being encoded in ISO8859-1, when one can give a detailled information
> how the form-data is encoded.
> 
> If I understand it correctly, one can even *force* any client (as long
> as the client is following the specs) to encode the form-data using
> the "accepeted-charset"-attribute of the <Form>-element.
> 
> IOW:
> 
> Setting "accepted-charset="UTF8"" should solve the problems.
> 
> Comments, anyone?

Yes it should work but it won't. Tomcat will honour it if sent but the browsers
don't send it. See http://markmail.org/message/zozxd3iqp47ciisw

Mark

> 
> Rgds
> 
> Gregor


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
For additional commands, e-mail: users-help@tomcat.apache.org


Re: form parameters

Posted by Gregor Schneider <rc...@googlemail.com>.
If found this one:

http://www.w3.org/TR/html401/interact/forms.html#adef-accept-charset

Actually, to me it's not clear why Tomcat should believe the input
being encoded in ISO8859-1, when one can give a detailled information
how the form-data is encoded.

If I understand it correctly, one can even *force* any client (as long
as the client is following the specs) to encode the form-data using
the "accepeted-charset"-attribute of the <Form>-element.

IOW:

Setting "accepted-charset="UTF8"" should solve the problems.

Comments, anyone?

Rgds

Gregor
-- 
just because your paranoid, doesn't mean they're not after you...
gpgp-fp: 79A84FA526807026795E4209D3B3FE028B3170B2
gpgp-key available @ http://pgpkeys.pca.dfn.de:11371

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
For additional commands, e-mail: users-help@tomcat.apache.org


Re: form parameters

Posted by André Warnier <aw...@ice-sa.com>.
Joseph Millet wrote:
> Thing is you've got an HTML form that you tell browsers it is
> ISO-8859-2, so when they post it to form target URL - it gets send
> encoded as ISO-8859-2, it is then your responsibility to parse
> incoming queries 

Sorry, but I think this is incorrect.
According to the HTTP specs, the client should specify a character 
encoding, and the server should respect the specified character encoding 
indicated by the client, and not guess.

...
 > in the encoding you asked it to be encoded.

How does the server ask the client to encode his request ?

> 
> Depending upon your requirements, UTF-8 will fit most of any languages
> needs

True, but irrelevant here until the HTTP specs are revised.

There is one confusing case : a GET request, with the parameters encoded 
in the URL, because in that case the client has no defined way to 
specify the real character encoding of the request parameters.
That is a shortcoming of the HTTP and URL specs.

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
For additional commands, e-mail: users-help@tomcat.apache.org


Re: form parameters

Posted by Joseph Millet <jo...@gmail.com>.
Thing is you've got an HTML form that you tell browsers it is
ISO-8859-2, so when they post it to form target URL - it gets send
encoded as ISO-8859-2, it is then your responsibility to parse
incoming queries in the encoding you asked it to be encoded.

Depending upon your requirements, UTF-8 will fit most of any languages
needs but there are cases where you want to store some languages in
specific charsets as converting from specific charsets towards unicode
is reasonably reliable whereas converting from unicode towards
specific charsets can be tricky in some cases.

However, in your case your data is posted in ISO-8859-2 you'll need to
convert it in case you want to manipulate it as unicode using
something similar as this :

String value = request.getParameter("mytext");
try{
    value = new String(value.getBytes(), request.getCharacterEncoding());
}catch(java.io.UnsupportedEncodingException ex){
    System.err.println(ex);
}

But there might be some easier method and I'm not a JSP Guru ...

- Joseph

On Mon, Mar 16, 2009 at 3:40 PM, Gregor Schneider <rc...@googlemail.com> wrote:
> On Mon, Mar 16, 2009 at 3:10 PM, Mikolaj Rydzewski <mi...@ceti.pl> wrote:
>>
>> It doesn't work for me. By default Tomcat uses ISO-8859-1 encoding. And it
>> will try this encoding to parse input parameters.
>>
>
> That's true, I'm doing the same here for German Umlaute, however:
>
> One link in the Wiki is pointing to HTTP specification section 3.4.1,
> however, there's something that I  do not understand:
>
> The specs say in 3.4.1:
>
> <quote>
> HTTP/1.1 recipients MUST respect the
>   charset label provided by the sender; and those user agents that have
>   a provision to "guess" a charset MUST use the charset from the
>   content-type field if they support that charset, rather than the
>   recipient's preference, when initially displaying a document. See
>   section 3.7.1.
> </quote>
>
> So, for me as a non-native English speaker, I understand it in such a
> way that your conent-encoding must be obliged - or do I get it wrong
> here? So, if in the content-encoding UTF-8 is specified, why isn't it
> accepted then?
>
> Rgds
>
> Gregor
> --
> just because your paranoid, doesn't mean they're not after you...
> gpgp-fp: 79A84FA526807026795E4209D3B3FE028B3170B2
> gpgp-key available @ http://pgpkeys.pca.dfn.de:11371
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
> For additional commands, e-mail: users-help@tomcat.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
For additional commands, e-mail: users-help@tomcat.apache.org


Re: form parameters

Posted by Gregor Schneider <rc...@googlemail.com>.
On Mon, Mar 16, 2009 at 3:10 PM, Mikolaj Rydzewski <mi...@ceti.pl> wrote:
>
> It doesn't work for me. By default Tomcat uses ISO-8859-1 encoding. And it
> will try this encoding to parse input parameters.
>

That's true, I'm doing the same here for German Umlaute, however:

One link in the Wiki is pointing to HTTP specification section 3.4.1,
however, there's something that I  do not understand:

The specs say in 3.4.1:

<quote>
HTTP/1.1 recipients MUST respect the
   charset label provided by the sender; and those user agents that have
   a provision to "guess" a charset MUST use the charset from the
   content-type field if they support that charset, rather than the
   recipient's preference, when initially displaying a document. See
   section 3.7.1.
</quote>

So, for me as a non-native English speaker, I understand it in such a
way that your conent-encoding must be obliged - or do I get it wrong
here? So, if in the content-encoding UTF-8 is specified, why isn't it
accepted then?

Rgds

Gregor
-- 
just because your paranoid, doesn't mean they're not after you...
gpgp-fp: 79A84FA526807026795E4209D3B3FE028B3170B2
gpgp-key available @ http://pgpkeys.pca.dfn.de:11371

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
For additional commands, e-mail: users-help@tomcat.apache.org


Re: form parameters

Posted by Mikolaj Rydzewski <mi...@ceti.pl>.
André Warnier wrote:
>
> If, inside a html page containing a tag such as
>
> <meta content="text/html; charset=iso-8859-2" http-equiv="Content-Type">
[...]
>
> would always return into p1, the proper internal Java Unicode string 
> value of the input element "param1" of the form, properly decoded from 
> it's original iso-8859-2 encoding.
> Yes ?
Hi,

It doesn't work for me. By default Tomcat uses ISO-8859-1 encoding. And 
it will try this encoding to parse input parameters.

You have to call request.setCharacterEncoding(...) before reading 
parameter values from request.

Best solution is to have servlet filter that sets request encoding to 
some value. I use always UTF-8 and it works with no problems.

http://wiki.apache.org/tomcat/Tomcat/UTF-8

-- 
Mikolaj Rydzewski <mi...@ceti.pl>