You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@tomcat.apache.org by Garret Wilson <ga...@globalmentor.com> on 2019/01/08 21:31:19 UTC
distinction between resource charset and format octet decoding
I have question (using Tomcat 9.0.12 on Windows 10), and I'd like
someone on the Tomcat development team to clarify a distinction for me
regarding resource charsets and octet decoding in a particular format. I
am not a newbie, and although the answer to my question may seem
obvious, it goes to a critical issue that I believe to be a fundamental
bug in Tomcat encoding processing.
Let's say that as an HTTP client I retrieve a resource `readme.txt` from
Tomcat, and Tomcat clearly indicates via the HTTP response headers that
the `Content-Type` is `text/plain; charset=ISO-8859-1`. That file
contains, among things, a line that says:
See https://example.com/example.jsp?fullName=Fl%C3%A1vio+Jos%C3%A9
for more info.
I want parse the text file and present a live link to the user (email
clients do this all the time), but I want to make the link "pretty" by
decoding the URL. The question is: do I decode the octets using UTF-8,
to show `…fullName=Flávio+José`, or do I use ISO-8859-1 to decode the
octets, so that I show `…fullName=Flávio+José`? (Flávio José is a
famous Brazilian forró singer and musician, by the way.)
The content type encoding of `readme.txt` is ISO-8859-1, so I must use
ISO-8859-1 to decode the octets in `Fl%C3%A1vio+Jos%C3%A9`, yielding
`…fullName=Flávio+José`, right??!
No, of course not. The decoding of the octet sequence is independent of
the resource encoding, and represents a separate layer of encoding _on
top_ of the resource encoding. It wouldn't matter whether the text file
were encoded in UTF-8, ISO-8859-1, or US-ASCII—the URL would still be
https://example.com/example.jsp?fullName=Fl%C3%A1vio+Jos%C3%A9, and its
octets should still be decoded using UTF-8 as per RFC 3986.
I'll get right to the point; the above was a rhetorical question used as
an analogy.
The Tomcat FAQ at
https://wiki.apache.org/tomcat/FAQ/CharacterEncoding#Q9 indicates that
the default encoding for an HTTP POST is ISO-8859-1. That is true.
However Tomcat then goes further to then assume that it should decode
_the octets of `application/x-www-form-urlencoded`_ using ISO-8859-1 as
well! This is simply wrong; the octets should be interpreted as a
sequence of UTF-8 octets; see
https://url.spec.whatwg.org/#concept-urlencoded-serializer . This means
if my browser sends a POST with content `fullName=Fl%C3%A1vio+Jos%C3%A9`
using `application/x-www-form-urlencoded`, Tomcat will interpret this
request parameter as `Flávio José` in my servlet/JSP, when it should
interpret it as `Flávio José`. (Tomcat correctly decodes the octet when
used as a query parameter rather than a POST parameter.)
Now it may be that the FAQ is simply out of date; it still seems to
think that encoded URI octets should not be interpreted as UTF-8,
completely ignoring RFC 3986. If so, it is long out of date; RFC 3986
came out in 2005. (And indeed, Tomcat works with UTF-8 octets in URIs.)
But out of date or not, the FAQ at
https://wiki.apache.org/tomcat/FAQ/CharacterEncoding#Q8 then recommends
that to force Tomcat to interpret the
`application/x-www-form-urlencoded` octets as UTF-8, I must set the
`org.apache.catalina.filters.SetCharacterEncodingFilter` filter (in some
`web.xml` file) to `UTF-8`. (I can alternatively put `<%
request.setCharacterEncoding("UTF-8"); %>` in my JSP.) And sure enough,
it fixes the problem.
But as discussed above, this is completely wrong: the resource character
encoding of a request sent in `application/x-www-form-urlencoded` should
have absolutely no bearing on how the encoded octets within that
resource are decoded. They must be decoded as UTF-8, irrespective of
what "character encoding" Tomcat assumes the content to be. Tomcat has
updated the way it decodes URIs to support UTF-8; it is time Tomcat does
the same for `application/x-www-form-urlencoded` values. The current
approach is broken in the context of the modern web, and the workaround
is simply wrong.
I also raised this at https://stackoverflow.com/q/54094982/421049 .
I would have filed a Tomcat Bugzilla issue, but the bug report form
indicated I should report the problem on this list first.
Thank you in advance for your attention to this matter.
Garret Wilson
GlobalMentor, Inc.
---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
For additional commands, e-mail: users-help@tomcat.apache.org
Re: distinction between resource charset and format octet decoding
Posted by Garret Wilson <ga...@globalmentor.com>.
On 2/6/2020 10:44 AM, Mark Thomas wrote:
> …
> As of Tomcat 10, conf/web.xml contains the following:
>
> <!--
> Set the default request and response character encodings to UTF-8.
> -->
> <request-character-encoding>UTF-8</request-character-encoding>
> <response-character-encoding>UTF-8</response-character-encoding>
>
> That *should* have the effect you are looking for but I confess I
> haven't tested it in any great detail.
Yes! Oh, that is so wonderful. Thank you!
I brought this issue up on the list over a year ago, and I have since
published my entire comprehensive software development course (still
being expanded).
https://www.globalmentor.com/courses/softdev/
The course is centered around Tomcat as the server, and the lesson on
HTML forms contains a section warning to use `<request-character-encoding>`.
https://www.globalmentor.com/courses/softdev/html-forms
Once Tomcat 10 is released I'll be able to update this note as well.
Thanks again!
Garret
---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
For additional commands, e-mail: users-help@tomcat.apache.org
Re: [OT] distinction between resource charset and format octet
decoding
Posted by Garret Wilson <ga...@globalmentor.com>.
On 2/6/2020 12:43 PM, Christopher Schultz wrote:
> …
>> * Therefore `web.xml` settings, HTTP headers, etc. are all
>> irrelevant, as this is an issue dealing with the file format
>> itself, and the latest spec for the file format says to use UTF-8,
>> so everyone should use UTF-8 already.
> Except for everyone who already uses something else and expects
> everything to be backward-compatible.
I think there comes a time where we have to more forward after some
critical level of usage is reached. I think we've passed that point.
Modern browsers in the sense that you mention are not
backwards-compatible for `application/x-www-form-urlencoded`. So what
are we being compatible with by not using UTF-8 decoding? Do we have
anything besides browsers consuming output from legacy JSP apps? As
noted the browsers break when we try to be "backwards-compatible" in the
sense you mention.
> The problem is that you don't get to declare what's "best" for
> everyone and then the whole world does what you want.
But here I would imagine that already agrees what's best; the debate is
whether we should do different than what we know is best because of some
outdated specs. (And I say that as a huge proponent of following standards.)
I'll give you an example that is directly relevant. Over 10 years ago I
strongly advocated to the RDF group that the Internet should abandon the
outdated practice of requiring that `text/*` media types default to
US-ASCII; otherwise there would be no point in using `text/*` for
anything going forward! (That's why we went through a sad phase where
everyone was using `application/*` for text formats because they wanted
to default to something other than US-ASCII.)
* https://www.w3.org/2008/01/rdf-media-types
* https://lists.w3.org/Archives/Public/www-archive/2007Dec/0059.html
Sure enough, eventually someone saw the light (I won't claim I had
anything to do with it, but it is exactly what I was arguing for) and
created https://tools.ietf.org/html/rfc6657, which says that individual
`text/*` types can choose a default other than ASCII. Finally we're not
stuck in the past anymore!
I would say that someone needs to create an updated
`application/x-www-form-urlencoded` specification prescribing UTF-8
decoding of encoded octets, except that the WhatWG has already done
that! So I'm not declaring that everyone should do it "my" way. I'm
saying everyone should follow the latest spec which already exists.
Anyway, thanks for listening. I think it's a fun discussion, and I
wasn't being combative---I just wanted to tell a bit of the story. I
need to get back to work now. :)
Thanks again for the change in Tomcat 10!
Garret
Re: [OT] distinction between resource charset and format octet
decoding
Posted by Christopher Schultz <ch...@christopherschultz.net>.
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256
Garret,
On 2/6/20 10:25 AM, Garret Wilson wrote:
> On 2/6/2020 11:46 AM, André Warnier (tomcat/perl) wrote:
>> …
>>> As of Tomcat 10, conf/web.xml contains the following:
>>>
>>> <!-- Set the default request and response character encodings
>>> to UTF-8. -->
>>> <request-character-encoding>UTF-8</request-character-encoding>
>>> <response-character-encoding>UTF-8</response-character-encoding>
>>>
>>>
>>>
That *should* have the effect you are looking for but I confess I
>>> haven't tested it in any great detail.
>>>
>>
>> As I am sure many people (Christopher included) would agree, the
>> real solution would be for browsers and other HTTP clients to
>> indicate clearly in the request, the charset/encoding of each
>> text parameter that they are sending. There are even HTTP headers
>> already defined for that.
>
>
> Which HTTP headers are you referring to? `Content-Type`? It is my
> opinion that this is irrelevant and not applicable.
>
> As I explained (extensively) in my original post for this thread
> back on 2019-01-08, the issue is not the charset of
> `application/x-www-form-urlencoded`. That media type is made up of
> ASCII characters. It doesn't matter whether you say it's ASCII,
> ISO-8859-1, UTF-8, or whatever, the actual characters stay 100% the
> same.
Hmm. Not always. While it may be true that:
1. ASCII, ISO-8859-1, and UTF-8 are very common
2. ASCII, ISO-8859-1, and UTF-8 share the first 127 code points
It is not true that:
3. All character encodings share the first 127 code points.
UTF-16 doesn't follow that pattern.
> At issue is when certain octets are encoded (as specified by the
> `application/x-www-form-urlencoded` media type itself), what
> charset to use when decoding them. This is independent of the
> encoding of the media type itself; rather this is defined by the
> specification for the format.
Correct. And there is lack of agreement for URLs, so browsers decided
to make it up. It's not possible to guess what the browser has chosen
because it does not advertise it in any way (absent a standard). The
only 100% reliable way to do it would be to add a parameter to every
request which has a known-correct value that can be unambiguously
decoded. You just keep re-decoding the whole URL until that parameter
value matches the known-correct value. Sounds like a lot of fun to
implement across a whole application, right?
> Unfortunately https://tools.ietf.org/html/rfc1866 actually says we
> should use ASCII when decoding the octets, but this is severely
> antiquated and doesn't fit with modern practice. The WhatWG
> essentially redefines the format to say that the octets must be
> interpreted as UTF-8:
>
> https://url.spec.whatwg.org/#application/x-www-form-urlencoded
>
> So to summarize my view:
>
> * The decoding of the `application/x-www-form-urlencoded` media
> type encoded octets is completely independent of the charset
> indicated in the `Content-Type` header, and rather goes to the
> specification of the format itself.
It's strange, because Content-Type can contain a charset parameter,
but MIME specifically says that "charset" parameters are only
appropriate for "text/*" MIME types. So for
application/x-www-form-urlencoded, you "shouldn't" add that parameter.
But there's no particular reason NOT to include it (it doesn't
actually violate any spec) and adding it COMPLETELY AND UNAMBIGUOUSLY
indicates what the browser chose as the encoding.
> * RFC 1866 is severely out of date and out of step, and the
> WhatWG's specification of the `application/x-www-form-urlencoded`
> media type should be used instead. (Modern browser practice would
> seem to agree with me.)
RFC 1886 has been very much superseded. Also, HTML specs shouldn't be
defining HTTP semantics. So ignore whatever is in RFC 1866 on multiple
grounds.
> * Therefore `web.xml` settings, HTTP headers, etc. are all
> irrelevant, as this is an issue dealing with the file format
> itself, and the latest spec for the file format says to use UTF-8,
> so everyone should use UTF-8 already.
Except for everyone who already uses something else and expects
everything to be backward-compatible.
The problem is that you don't get to declare what's "best" for
everyone and then the whole world does what you want. I happen to
agree with you (Everyone should move to UTF-8 for everything.
Everywhere. Forever.), but you have to recognize that there is history
and entrenched systems, environments, and mindsets.
> The new default `web.xml` in Tomcat 10 is a wonderful step in the
> right direction.
+1
- -chris
-----BEGIN PGP SIGNATURE-----
Comment: Using GnuPG with Thunderbird - https://www.enigmail.net/
iQIzBAEBCAAdFiEEMmKgYcQvxMe7tcJcHPApP6U8pFgFAl48NAoACgkQHPApP6U8
pFgJ6A/+JSArcUkqm3P6n0awICXTuqIx0TU1oIf9bzivpAI/Na9fr//ebnwzmvoy
EXpbnn97B7Sy8uZ1wvT0+PQLbmwVmM/f7zBk4q+7Ba/ogkmrSHeLlsCIbLAlXOLD
kr/xDE4ftxrwR2+ZwuQwxH0muFH+4rq2SBFWTQnGORCQDqRRK7eQoQYHWE0HIAxj
cAJmwkQEQyi+YHdgaUo0L4BU7lvgPGk7JyjbzWBiigFYy/1Du1caE7PzYLa5G3wZ
BrYDA6QoQA+nUmXHn/ayUVXvsZc2l/nU/uM5m68Tp1iEVxdgp4u8XtHuqgv0Nzda
IeQq9HOP8wd7l27/dk2DvlZBmSWt2XDOI5ig+NoLPT1ixyQIqVJ2K8SyayGdUHW9
XJi/mqVqHF1h1okTgystt4mNTTBYFqFfwfBUWFK1T+9sUot8aJ2y6P20058mv5ds
iQbEP0K0VJsUGSD+JJd+lvm6gI+54jNhnNgS1bFndbC5p4afNToCCKl8EBBENtbK
64xiolpux4VLFrgmzyG6gfbiSurJz+s3hgH29JJGfml/zdNS5QMI+fhsgOFThDrr
38Ul/QA4fRJehINAqqnsBFhJlymgvO/3PMGCDYCvWfq0cyBDOoKzWH2lscq5cXnz
AMNiKU9roV1YdvUQPscSY7iPyDNq4JFDUdHa4pi7gp9JfXMlL7s=
=Igs0
-----END PGP SIGNATURE-----
---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
For additional commands, e-mail: users-help@tomcat.apache.org
Re: [OT] distinction between resource charset and format octet
decoding
Posted by Garret Wilson <ga...@globalmentor.com>.
On 2/6/2020 11:46 AM, André Warnier (tomcat/perl) wrote:
> …
>> As of Tomcat 10, conf/web.xml contains the following:
>>
>> <!--
>> Set the default request and response character encodings to UTF-8.
>> -->
>> <request-character-encoding>UTF-8</request-character-encoding>
>> <response-character-encoding>UTF-8</response-character-encoding>
>>
>> That *should* have the effect you are looking for but I confess I
>> haven't tested it in any great detail.
>>
>
> As I am sure many people (Christopher included) would agree, the real
> solution would be for browsers and other HTTP clients to indicate
> clearly in the request, the charset/encoding of each text parameter
> that they are sending.
> There are even HTTP headers already defined for that.
Which HTTP headers are you referring to? `Content-Type`? It is my
opinion that this is irrelevant and not applicable.
As I explained (extensively) in my original post for this thread back on
2019-01-08, the issue is not the charset of
`application/x-www-form-urlencoded`. That media type is made up of ASCII
characters. It doesn't matter whether you say it's ASCII, ISO-8859-1,
UTF-8, or whatever, the actual characters stay 100% the same. At issue
is when certain octets are encoded (as specified by the
`application/x-www-form-urlencoded` media type itself), what charset to
use when decoding them. This is independent of the encoding of the media
type itself; rather this is defined by the specification for the format.
Unfortunately https://tools.ietf.org/html/rfc1866 actually says we
should use ASCII when decoding the octets, but this is severely
antiquated and doesn't fit with modern practice. The WhatWG essentially
redefines the format to say that the octets must be interpreted as UTF-8:
https://url.spec.whatwg.org/#application/x-www-form-urlencoded
So to summarize my view:
* The decoding of the `application/x-www-form-urlencoded` media type
encoded octets is completely independent of the charset indicated in
the `Content-Type` header, and rather goes to the specification of
the format itself.
* RFC 1866 is severely out of date and out of step, and the WhatWG's
specification of the `application/x-www-form-urlencoded` media type
should be used instead. (Modern browser practice would seem to agree
with me.)
* Therefore `web.xml` settings, HTTP headers, etc. are all irrelevant,
as this is an issue dealing with the file format itself, and the
latest spec for the file format says to use UTF-8, so everyone
should use UTF-8 already.
The new default `web.xml` in Tomcat 10 is a wonderful step in the right
direction.
See my original post for more in-depth explanation.
Garret
Re: [OT] distinction between resource charset and format octet
decoding
Posted by "André Warnier (tomcat/perl)" <aw...@ice-sa.com>.
On 06.02.2020 14:44, Mark Thomas wrote:
> On 06/02/2020 13:39, Garret Wilson wrote:
>> On 2/6/2020 10:36 AM, Mark Thomas wrote:
>>> …
>>>>> Whether Tomcat should ship with this setting present in conf/web.xml
>>>>> by default is something that should probably be discussed for Tomcat
>>>>> 10. Given the current state of the web, there is a reasonable case for
>>>>> doing so. I'll add that to the TOMCAT-NEXT discussion list.
>>>> Is this still on the list for discussion for Tomcat 10?
>>> No, because it has already been implemented for Tomcat 10 and is in the
>>> milestone release currently being voted on.
>>
>> Waitasec. I'm not used to good news, so I want to make sure I understand
>> what you're saying. Are you saying that the proposed Tomcat 10
>> implementation already interprets encoded octets in web form submissions
>> using UTF-8 by default?!! :O
>
> As of Tomcat 10, conf/web.xml contains the following:
>
> <!--
> Set the default request and response character encodings to UTF-8.
> -->
> <request-character-encoding>UTF-8</request-character-encoding>
> <response-character-encoding>UTF-8</response-character-encoding>
>
> That *should* have the effect you are looking for but I confess I
> haven't tested it in any great detail.
>
As I am sure many people (Christopher included) would agree, the real solution would be
for browsers and other HTTP clients to indicate clearly in the request, the
charset/encoding of each text parameter that they are sending.
There are even HTTP headers already defined for that.
(Nowadays the default could be Unicode/UTF-8).
The problem is that browsers and other agents don't do that, although they undoubtedly
always know themselves, and although it would solve a series of issues that have literally
been there forever at the server and application level (*).
I have often wondered if/why the Apache Foundation does not pack enough influence over the
HTTP/HTML specifications process and over browser producers, to achieve that.
(And if not the Apache Foundation, then who ?)
(*) My own guess is that this basic thing (or lack of it) has cost over the years many
thousands of lines of unnecessary code and many thousands of unproductive developer hours.
As a tiny example, just consider the above web.xml parameters, and how much time in total
was dedicated to their definition and implementation.. Never mind all the previous related
filters and valves and their discussions on this list. And that's only for Tomcat.
---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
For additional commands, e-mail: users-help@tomcat.apache.org
Re: distinction between resource charset and format octet decoding
Posted by Mark Thomas <ma...@apache.org>.
On 06/02/2020 13:39, Garret Wilson wrote:
> On 2/6/2020 10:36 AM, Mark Thomas wrote:
>> …
>>>> Whether Tomcat should ship with this setting present in conf/web.xml
>>>> by default is something that should probably be discussed for Tomcat
>>>> 10. Given the current state of the web, there is a reasonable case for
>>>> doing so. I'll add that to the TOMCAT-NEXT discussion list.
>>> Is this still on the list for discussion for Tomcat 10?
>> No, because it has already been implemented for Tomcat 10 and is in the
>> milestone release currently being voted on.
>
> Waitasec. I'm not used to good news, so I want to make sure I understand
> what you're saying. Are you saying that the proposed Tomcat 10
> implementation already interprets encoded octets in web form submissions
> using UTF-8 by default?!! :O
As of Tomcat 10, conf/web.xml contains the following:
<!--
Set the default request and response character encodings to UTF-8.
-->
<request-character-encoding>UTF-8</request-character-encoding>
<response-character-encoding>UTF-8</response-character-encoding>
That *should* have the effect you are looking for but I confess I
haven't tested it in any great detail.
Mark
>
> It will be a joy to update the FAQ when this is released.
>
> Garret
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
> For additional commands, e-mail: users-help@tomcat.apache.org
>
---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
For additional commands, e-mail: users-help@tomcat.apache.org
Re: distinction between resource charset and format octet decoding
Posted by Garret Wilson <ga...@globalmentor.com>.
On 2/6/2020 10:36 AM, Mark Thomas wrote:
> …
>>> Whether Tomcat should ship with this setting present in conf/web.xml
>>> by default is something that should probably be discussed for Tomcat
>>> 10. Given the current state of the web, there is a reasonable case for
>>> doing so. I'll add that to the TOMCAT-NEXT discussion list.
>> Is this still on the list for discussion for Tomcat 10?
> No, because it has already been implemented for Tomcat 10 and is in the
> milestone release currently being voted on.
Waitasec. I'm not used to good news, so I want to make sure I understand
what you're saying. Are you saying that the proposed Tomcat 10
implementation already interprets encoded octets in web form submissions
using UTF-8 by default?!! :O
It will be a joy to update the FAQ when this is released.
Garret
---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
For additional commands, e-mail: users-help@tomcat.apache.org
Re: distinction between resource charset and format octet decoding
Posted by Mark Thomas <ma...@apache.org>.
On 06/02/2020 13:30, Garret Wilson wrote:
> On 1/8/2019 9:57 PM, Mark Thomas wrote:
>> …
>>
>> Yes, this default is now very out-dated. That is a side-effect of:
>> …
>> As of Servlet 4.0 there is a specification compliant configuration
>> option to change this default to any encoding of your choice.
>> Obviously, UTF-8 is one of the options. You can do this by adding the
>> following to your web.xml:
>> …
>>
>> Whether Tomcat should ship with this setting present in conf/web.xml
>> by default is something that should probably be discussed for Tomcat
>> 10. Given the current state of the web, there is a reasonable case for
>> doing so. I'll add that to the TOMCAT-NEXT discussion list.
>
> Is this still on the list for discussion for Tomcat 10?
No, because it has already been implemented for Tomcat 10 and is in the
milestone release currently being voted on.
Mark
>
> In my opinion it would be a real shame if Tomcat 10 ships with a web
> form encoding default that goes against the WhatWG specifications and
> corrupts non ISO-8859-1 content under modern browsers.
>
> Garret
>
> P.S. Mark, please ignore the other email from my personal email address.
> Because the Tomcat users list doesn't include my name in the "To:"
> header, my email client didn't know to use the correct reply address.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
> For additional commands, e-mail: users-help@tomcat.apache.org
>
---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
For additional commands, e-mail: users-help@tomcat.apache.org
Re: distinction between resource charset and format octet decoding
Posted by Garret Wilson <ga...@globalmentor.com>.
On 1/8/2019 9:57 PM, Mark Thomas wrote:
> …
>
> Yes, this default is now very out-dated. That is a side-effect of:
> …
> As of Servlet 4.0 there is a specification compliant configuration
> option to change this default to any encoding of your choice.
> Obviously, UTF-8 is one of the options. You can do this by adding the
> following to your web.xml:
> …
>
> Whether Tomcat should ship with this setting present in conf/web.xml
> by default is something that should probably be discussed for Tomcat
> 10. Given the current state of the web, there is a reasonable case for
> doing so. I'll add that to the TOMCAT-NEXT discussion list.
Is this still on the list for discussion for Tomcat 10?
In my opinion it would be a real shame if Tomcat 10 ships with a web
form encoding default that goes against the WhatWG specifications and
corrupts non ISO-8859-1 content under modern browsers.
Garret
P.S. Mark, please ignore the other email from my personal email address.
Because the Tomcat users list doesn't include my name in the "To:"
header, my email client didn't know to use the correct reply address.
---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
For additional commands, e-mail: users-help@tomcat.apache.org
Re: distinction between resource charset and format octet decoding
Posted by Garret Wilson <ga...@globalmentor.com>.
On 2/1/2019 9:38 AM, Christopher Schultz wrote:
>> Amazing. A close reading of RFC 3986 reveals that there is no
>> clear mandate for UTF-8 in existing URI schemes, even though
>> recommended for new schemes. Anyway, everyone seems to have settled
>> on UTF-8 (Tomcat included), so I'll try to indicate that.
> Wait... are you saying that _it's the Wild West out there?_ ;)
>
> Yes. The web is indeed held together with duct-tape and bailing wire.
> It's amazing that it works as well as it does.
Hahaha. I'm /so/ happy someone agrees with me! Here's to improving
things with a little JB Weld once in a while. (That's what my
grandparents used on the farm when the bailing wire and duct tape
couldn't handle it.)
Garret
Re: distinction between resource charset and format octet decoding
Posted by Christopher Schultz <ch...@christopherschultz.net>.
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256
Garret,
On 2/1/19 11:08, Garret Wilson wrote:
> On 2/1/2019 7:23 AM, Garret Wilson wrote:
>> … * "There /is no default encoding for URIs/ specified anywhere,
>> which is why there is a lot of confusion when it comes to
>> decoding these values." Sheesh, this is is ancient. I'll correct
>> it as per https://tools.ietf.org/html/rfc3986#section-2.5 .
>
>
> Amazing. A close reading of RFC 3986 reveals that there is no
> clear mandate for UTF-8 in existing URI schemes, even though
> recommended for new schemes. Anyway, everyone seems to have settled
> on UTF-8 (Tomcat included), so I'll try to indicate that.
Wait... are you saying that _it's the Wild West out there?_ ;)
Yes. The web is indeed held together with duct-tape and bailing wire.
It's amazing that it works as well as it does.
- -chris
-----BEGIN PGP SIGNATURE-----
Comment: Using GnuPG with Thunderbird - https://www.enigmail.net/
iQIzBAEBCAAdFiEEMmKgYcQvxMe7tcJcHPApP6U8pFgFAlxUhDEACgkQHPApP6U8
pFhWlA/8Cxr6xzT8+cw5Mu/a8cH788p+ucK4QtO9Qlm6EBhhX2sW9BelWpk2ftOX
xypZkwW155D2hlz58eUTGSoFl92rgFZNXmXBoIXd+MDgNS/b0zgabb7N7wlHswzj
LJArA9GtXNjRy5vJc4Bpe37ZpiqcV9f/sbQhSO31ZrJYvnVuOOYszzfp2g6UWlg5
+OAgfi2L99uMxJdqc81eIVsL6mmmhlkJYe6ejAZjb/EQ2Lk74MKlgCUfaoasCdYd
hqdQJIBpRGvUnx6UEoq+sdEilBAXTJocGv8cyOFQY5rHcaTy7WIQ9mIWilTjBb6O
gxWJbgRfX+uOVhTT5mo7LoE+YVLQZ3QPAM21SEXtX3PR5Vuk4hB8SYj3/er7S7v2
/kPL0d5K2DsO8034PoZQBturIV8pkiF5jqr2nSTND/B0nFK9hcZu27qY9RigHF95
8owMY7/hdMsK2PlYOwyj6dZSMx94Iy5mWDCrF3GUFCbEN9u3/6HoRYuJZOpCv8h1
aZHZmiYDEtxzxL8OkXNqyuBu4k+HJ58/ABMelpXOjxMVHuFXkqny6XiqrzyWac+z
yW1otX/uLKgqKI9PL3O8MfzVS5LZ6XVtprkZUDhCBvsA8vQTZYBRVQu3DiGMPojj
U4STB1VBJSV4I67bBhkQaAZnsqIgeNi/qzHC+5h6hbHl+Me1lRg=
=Z4XG
-----END PGP SIGNATURE-----
---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
For additional commands, e-mail: users-help@tomcat.apache.org
Re: distinction between resource charset and format octet decoding
Posted by Garret Wilson <ga...@globalmentor.com>.
On 2/1/2019 7:23 AM, Garret Wilson wrote:
> …
> * "There /is no default encoding for URIs/ specified anywhere, which
> is why there is a lot of confusion when it comes to decoding these
> values." Sheesh, this is is ancient. I'll correct it as per
> https://tools.ietf.org/html/rfc3986#section-2.5 .
Amazing. A close reading of RFC 3986 reveals that there is no clear
mandate for UTF-8 in existing URI schemes, even though recommended for
new schemes. Anyway, everyone seems to have settled on UTF-8 (Tomcat
included), so I'll try to indicate that.
Garret
---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
For additional commands, e-mail: users-help@tomcat.apache.org
Re: distinction between resource charset and format octet decoding
Posted by Garret Wilson <ga...@globalmentor.com>.
Good morning, I'm just getting to the editing. I'm going to list some
thoughts I have as I go through this, so you can verify things:
* The servlet spec links are way out of date. I'll update them.
* "There /is no default encoding for URIs/ specified anywhere, which
is why there is a lot of confusion when it comes to decoding these
values." Sheesh, this is is ancient. I'll correct it as per
https://tools.ietf.org/html/rfc3986#section-2.5 .
* "Most of the web uses ISO-8859-1 as the default for query strings."
Is this still true?! In light of the above, I would think it is not
true, but I wanted to ask, as you know better about what you've seen
"in the wild".
Garret
Re: distinction between resource charset and format octet decoding
Posted by Mark Thomas <ma...@apache.org>.
On 01/02/2019 17:58, Garret Wilson wrote:
> OK, Mark, I've made my initial edits to the
> https://wiki.apache.org/tomcat/FAQ/CharacterEncoding page. _Please check
> them over!_ This is my first edit to the wiki.
>
> That page has a lot of legacy information, some of which had to do with
> internal Tomcat stuff, and some of which had to do with minute details
> of obsolete RFCs and evolution of browser behavior. I didn't want to
> spend the entire day (week?) on this, so I tried to surgically to only
> address the sections relating to POST of
> application/x-www-form-urlencoded and how percent-encoded octets are
> interpreted. I couldn't resist updating the specification links and
> changing just a little prose about URL percent encoding.
>
> There is the risk now that other sections of the page are still outdated
> and conflict with my changes, but most importantly the FAQ should
> provide more complete information on how Tomcat web applications can be
> made to work with modern browsers.
>
> Please let me know if I bungled anything or if I need to clarify something.
LGTM.
> Thanks for letting me participate.
No need to thank us. We should be thanking you. Thank you.
So, what do you want to work on next? ;)
Cheers,
Mark
>
> Garret
>
> On 1/23/2019 12:26 AM, Mark Thomas wrote:
>> On 23/01/2019 05:07, Garret Wilson wrote:
>>> On 1/15/2019 3:20 AM, Mark Thomas wrote:
>>>> …
>>>> Anything in PascalCase becomes a link to a wiki page of that name.
>>>> Usernames are created in this form so references to the user
>>>> automatically become links to that user's page in the wiki.
>>>
>>> Ah, OK, that explains it. Very good to know. Maybe a little semantic
>>> overloading, but as this is my first wiki account anywhere, I'm guessing
>>> it's typical with whatever software you're using.
>>>
>>> Anyway my account is created, with username `GarretWilson`. After I get
>>> permissions I'll update the info on octet encoding for
>>> application/x-www-form-urlencoded in relation to the servlet spec. It
>>> may not be immediately, but I'll slowly but surely get to it.
>> Karma granted. Happy editing.
>>
>> Cheers,
>>
>> Mark
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
>> For additional commands, e-mail: users-help@tomcat.apache.org
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
> For additional commands, e-mail: users-help@tomcat.apache.org
>
---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
For additional commands, e-mail: users-help@tomcat.apache.org
Re: distinction between resource charset and format octet decoding
Posted by Garret Wilson <ga...@globalmentor.com>.
OK, Mark, I've made my initial edits to the
https://wiki.apache.org/tomcat/FAQ/CharacterEncoding page. _Please check
them over!_ This is my first edit to the wiki.
That page has a lot of legacy information, some of which had to do with
internal Tomcat stuff, and some of which had to do with minute details
of obsolete RFCs and evolution of browser behavior. I didn't want to
spend the entire day (week?) on this, so I tried to surgically to only
address the sections relating to POST of
application/x-www-form-urlencoded and how percent-encoded octets are
interpreted. I couldn't resist updating the specification links and
changing just a little prose about URL percent encoding.
There is the risk now that other sections of the page are still outdated
and conflict with my changes, but most importantly the FAQ should
provide more complete information on how Tomcat web applications can be
made to work with modern browsers.
Please let me know if I bungled anything or if I need to clarify something.
Thanks for letting me participate.
Garret
On 1/23/2019 12:26 AM, Mark Thomas wrote:
> On 23/01/2019 05:07, Garret Wilson wrote:
>> On 1/15/2019 3:20 AM, Mark Thomas wrote:
>>> …
>>> Anything in PascalCase becomes a link to a wiki page of that name.
>>> Usernames are created in this form so references to the user
>>> automatically become links to that user's page in the wiki.
>>
>> Ah, OK, that explains it. Very good to know. Maybe a little semantic
>> overloading, but as this is my first wiki account anywhere, I'm guessing
>> it's typical with whatever software you're using.
>>
>> Anyway my account is created, with username `GarretWilson`. After I get
>> permissions I'll update the info on octet encoding for
>> application/x-www-form-urlencoded in relation to the servlet spec. It
>> may not be immediately, but I'll slowly but surely get to it.
> Karma granted. Happy editing.
>
> Cheers,
>
> Mark
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
> For additional commands, e-mail: users-help@tomcat.apache.org
>
---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
For additional commands, e-mail: users-help@tomcat.apache.org
Re: distinction between resource charset and format octet decoding
Posted by Mark Thomas <ma...@apache.org>.
On 23/01/2019 05:07, Garret Wilson wrote:
> On 1/15/2019 3:20 AM, Mark Thomas wrote:
>> …
>> Anything in PascalCase becomes a link to a wiki page of that name.
>> Usernames are created in this form so references to the user
>> automatically become links to that user's page in the wiki.
>
>
> Ah, OK, that explains it. Very good to know. Maybe a little semantic
> overloading, but as this is my first wiki account anywhere, I'm guessing
> it's typical with whatever software you're using.
>
> Anyway my account is created, with username `GarretWilson`. After I get
> permissions I'll update the info on octet encoding for
> application/x-www-form-urlencoded in relation to the servlet spec. It
> may not be immediately, but I'll slowly but surely get to it.
Karma granted. Happy editing.
Cheers,
Mark
---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
For additional commands, e-mail: users-help@tomcat.apache.org
Re: distinction between resource charset and format octet decoding
Posted by Garret Wilson <ga...@globalmentor.com>.
On 1/15/2019 3:20 AM, Mark Thomas wrote:
> …
> Anything in PascalCase becomes a link to a wiki page of that name.
> Usernames are created in this form so references to the user
> automatically become links to that user's page in the wiki.
Ah, OK, that explains it. Very good to know. Maybe a little semantic
overloading, but as this is my first wiki account anywhere, I'm guessing
it's typical with whatever software you're using.
Anyway my account is created, with username `GarretWilson`. After I get
permissions I'll update the info on octet encoding for
application/x-www-form-urlencoded in relation to the servlet spec. It
may not be immediately, but I'll slowly but surely get to it.
Cheers,
Garret
---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
For additional commands, e-mail: users-help@tomcat.apache.org
Re: distinction between resource charset and format octet decoding
Posted by Mark Thomas <ma...@apache.org>.
On 15/01/2019 03:39, Garret Wilson wrote:
> On 1/9/2019 2:30 AM, Mark Thomas wrote:
>> …
>> Create yourself an account at https://wiki.apache.org/tomcat (click
>> login then create an account) and let the list know your ID. Then one of
>> the admins can add you to the allowed editors.
>
>
> I was just ready to create an account, but I want to verify the details
> so I don't screw things up.
>
> * It asks for a "Name". Is this a username, I suppose? So we don't
> maintain our "name" separate from our "login username"?
Yes, it is your username. Any linkage from that to your "public name"
would be maintained on your user page - if you wish.
> * It says to use "FirstnameLastName". Are you literally wanting us to
> use "JohnDoe", or can we use "johndoe"? Sorry for the questions; as
> one who works with protocols all the time, I automatically assume
> this stuff is important. But I prefer to use lowercase on my
> usernames; I'm a little confused about why this would want
> PascalCase for a login username. (I can't think of another system
> that I use that requires PascalCase usernames.)
Think of it as a SHOULD rather than a MUST.
> My guess is that it's trying to maintain a "human name" and a "username"
> but combine them both into one field or something. I can't say this
> approach is typical…
Anything in PascalCase becomes a link to a wiki page of that name.
Usernames are created in this form so references to the user
automatically become links to that user's page in the wiki.
It isn't a feature we use much at the moment. A quick check shows that
most, but not all, contributors have created their user name in PascalCase.
For example, take a look at https://wiki.apache.org/tomcat/AndrewCarr
Mark
---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
For additional commands, e-mail: users-help@tomcat.apache.org
Re: distinction between resource charset and format octet decoding
Posted by Garret Wilson <ga...@globalmentor.com>.
On 1/9/2019 2:30 AM, Mark Thomas wrote:
> …
> Create yourself an account at https://wiki.apache.org/tomcat (click
> login then create an account) and let the list know your ID. Then one of
> the admins can add you to the allowed editors.
I was just ready to create an account, but I want to verify the details
so I don't screw things up.
* It asks for a "Name". Is this a username, I suppose? So we don't
maintain our "name" separate from our "login username"?
* It says to use "FirstnameLastName". Are you literally wanting us to
use "JohnDoe", or can we use "johndoe"? Sorry for the questions; as
one who works with protocols all the time, I automatically assume
this stuff is important. But I prefer to use lowercase on my
usernames; I'm a little confused about why this would want
PascalCase for a login username. (I can't think of another system
that I use that requires PascalCase usernames.)
My guess is that it's trying to maintain a "human name" and a "username"
but combine them both into one field or something. I can't say this
approach is typical…
Garret
Re: distinction between resource charset and format octet decoding
Posted by Mark Thomas <ma...@apache.org>.
On 09/01/2019 00:50, Garret Wilson wrote:
> Hi, Mark, and thanks for some quick response. You provided some info I
> wasn't aware of. Some responses below:
>
> On 1/8/2019 9:57 PM, Mark Thomas wrote:
>> On 08/01/2019 21:31, Garret Wilson wrote:
>>
>> <snip/>
>>
>>> But as discussed above, this is completely wrong: the resource
>>> character encoding of a request sent in
>>> `application/x-www-form-urlencoded` should have absolutely no bearing
>>> on how the encoded octets within that resource are decoded.
>>
>> That is not the correct interpretation of section 3.12 of the Servlet
>> 4.0 specification (note the section numbers do vary between spec
>> versions). Tomcat implements the correct interpretation - i.e. the
>> charset from the request content-type defines how encoded octets are
>> decoded and, if none is specified, ISO-8859-1 is used as the default.
>
>
> Ah, I hadn't seen that in the servlet spec. Yes, it seems as if Tomcat
> is correctly following the spec, but I would still say the servlet spec
> is wrong to make any linkage at all between resource encoding and %nn
> interpretation. In fact reading the prose it's not clear to me that the
> servlet spec is even strongly tying the %nn interpretation to the
> encoding. It just sees to say that, unless otherwise specified, the %nn
> interpretation should be ISO-8859-1. And actually that's a step up from
> the HTML 4.0.1 spec, which in
> https://www.w3.org/TR/html401/interact/forms.html#h-17.13.4.1 indicates
> that they should be interpreted as US-ASCII codes. :(
>
> You indicate that this is all out of date, and I think we're in
> agreement there. We really, really need to get the next servlet
> specification to remove this part. In fact the servlet specification
> should defer to the official `application/x-www-form-urlencoded`
> specification, which at this point I think is the W3C HTML5 spec, which
> in turn defers to the WHATWG spec (which clearly says that UTF-8) should
> be used. What makes all of this more of a mess is that there seems to be
> no way to work around this from the client side, e.g. by putting
> something in the HTML to indicate UTF-8, as
> `application/x-www-form-urlencoded` doesn't support a `charset` parameter.
>
> Anyway if there are any openings on the committee to update the servlet
> spec, let me know.
That has moved to Eclipse. The process to update the spec is still being
defined. The Jakarta EE Servlet API project is the project to get
involved in.
>> ...
>> As of Servlet 4.0 there is a specification compliant configuration
>> option to change this default to any encoding of your choice.
>> Obviously, UTF-8 is one of the options. You can do this by adding the
>> following to your web.xml:
>>
>> <request-character-encoding>UTF-8</request-character-encoding>
>
> Oh, that is really good to know, thanks!! Still I say that the request
> character encoding is orthogonal to the %nn encoding, but, still, it's
> good to have an implementation-agnostic way to do it.
>
>>
>>
>> Whether Tomcat should ship with this setting present in conf/web.xml
>> by default is something that should probably be discussed for Tomcat
>> 10. Given the current state of the web, there is a reasonable case for
>> doing so. I'll add that to the TOMCAT-NEXT discussion list.
>
>
> Yes please! If I can help in any way, let me know.
>
>
>>
>> The Tomcat Wiki also needs to be updated to take account of this new
>> configuration option (and the related <response-character-encoding>).
>> Since it is a wiki and this is clearly an issue you care about would
>> you like to tackle that?
>
>
> Yes, I'd love to. Let me know what permissions I need, etc.
Create yourself an account at https://wiki.apache.org/tomcat (click
login then create an account) and let the list know your ID. Then one of
the admins can add you to the allowed editors.
Apologies for the hoop jumping required but without the manual approval
step for new accounts, the ASF project wiki's were being deluged in spam.
Mark
>
> I have an international flight boarding right now so I have to go, and I
> may not reply for the next few hours, but definitely sign me up.
>
> Thanks,
>
> Garret
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
> For additional commands, e-mail: users-help@tomcat.apache.org
>
---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
For additional commands, e-mail: users-help@tomcat.apache.org
Re: distinction between resource charset and format octet decoding
Posted by Garret Wilson <ga...@globalmentor.com>.
Hi, Mark, and thanks for some quick response. You provided some info I
wasn't aware of. Some responses below:
On 1/8/2019 9:57 PM, Mark Thomas wrote:
> On 08/01/2019 21:31, Garret Wilson wrote:
>
> <snip/>
>
>> But as discussed above, this is completely wrong: the resource
>> character encoding of a request sent in
>> `application/x-www-form-urlencoded` should have absolutely no bearing
>> on how the encoded octets within that resource are decoded.
>
> That is not the correct interpretation of section 3.12 of the Servlet
> 4.0 specification (note the section numbers do vary between spec
> versions). Tomcat implements the correct interpretation - i.e. the
> charset from the request content-type defines how encoded octets are
> decoded and, if none is specified, ISO-8859-1 is used as the default.
Ah, I hadn't seen that in the servlet spec. Yes, it seems as if Tomcat
is correctly following the spec, but I would still say the servlet spec
is wrong to make any linkage at all between resource encoding and %nn
interpretation. In fact reading the prose it's not clear to me that the
servlet spec is even strongly tying the %nn interpretation to the
encoding. It just sees to say that, unless otherwise specified, the %nn
interpretation should be ISO-8859-1. And actually that's a step up from
the HTML 4.0.1 spec, which in
https://www.w3.org/TR/html401/interact/forms.html#h-17.13.4.1 indicates
that they should be interpreted as US-ASCII codes. :(
You indicate that this is all out of date, and I think we're in
agreement there. We really, really need to get the next servlet
specification to remove this part. In fact the servlet specification
should defer to the official `application/x-www-form-urlencoded`
specification, which at this point I think is the W3C HTML5 spec, which
in turn defers to the WHATWG spec (which clearly says that UTF-8) should
be used. What makes all of this more of a mess is that there seems to be
no way to work around this from the client side, e.g. by putting
something in the HTML to indicate UTF-8, as
`application/x-www-form-urlencoded` doesn't support a `charset` parameter.
Anyway if there are any openings on the committee to update the servlet
spec, let me know.
> ...
> As of Servlet 4.0 there is a specification compliant configuration
> option to change this default to any encoding of your choice.
> Obviously, UTF-8 is one of the options. You can do this by adding the
> following to your web.xml:
>
> <request-character-encoding>UTF-8</request-character-encoding>
Oh, that is really good to know, thanks!! Still I say that the request
character encoding is orthogonal to the %nn encoding, but, still, it's
good to have an implementation-agnostic way to do it.
>
>
> Whether Tomcat should ship with this setting present in conf/web.xml
> by default is something that should probably be discussed for Tomcat
> 10. Given the current state of the web, there is a reasonable case for
> doing so. I'll add that to the TOMCAT-NEXT discussion list.
Yes please! If I can help in any way, let me know.
>
> The Tomcat Wiki also needs to be updated to take account of this new
> configuration option (and the related <response-character-encoding>).
> Since it is a wiki and this is clearly an issue you care about would
> you like to tackle that?
Yes, I'd love to. Let me know what permissions I need, etc.
I have an international flight boarding right now so I have to go, and I
may not reply for the next few hours, but definitely sign me up.
Thanks,
Garret
---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
For additional commands, e-mail: users-help@tomcat.apache.org
Re: distinction between resource charset and format octet decoding
Posted by Garret Wilson <ga...@globalmentor.com>.
Sorry to bring up the non-UTF-8 escaped octets form POST problem again,
but …
On 1/8/2019 3:57 PM, Mark Thomas wrote:
> …
> As of Servlet 4.0 there is a specification compliant configuration
> option to change this default to any encoding of your choice.
> Obviously, UTF-8 is one of the options. You can do this by adding the
> following to your web.xml:
>
> <request-character-encoding>UTF-8</request-character-encoding>
>
> If you add it to conf/web.xml it applies to every web application
> deployed to Tomcat.
>
> Tomcat 9 uses this in the examples, manager and host-manager
> applications in place of the SetCharacterEncodingFilter.
As you know I've already updated the Tomcat FAQ with the options for
forcing Tomcat to interpret form POSTs with any escaped characters using
UTF-8 octet sequences (as modern browsers send, and as HTML5 requires)
instead of ISO-8859-1 (as the Servlet 4 spec says).
But the problem is worse with the Spring community. If someone is using
Spring Boot to create an executable JAR/WAR using embedded tomcat,
Spring Boot does something to configure Tomcat to send the POSTs
correctly (that is, as the modern web likes it, not like the Servlet 4
spec says). Unfortunately, if I use Spring Boot to make a WAR which is
both a self-contained executing WAR /and/ a WAR deployable on Tomcat,
when I deploy the WAR on Tomcat the encoded characters are using escaped
ISO-8859-1 octets, so my web app breaks. Yes, the WAR runs differently
if using Spring Boot embedded Tomcat or deployed on standalone Tomcat as
a WAR.
Spring Boot ignores any `web.xml` file. I guess I could create a
`web.xml` file only for standalone Tomcat, but then this freezes Eclipse
(as I posted elsewhere) because Eclipse doesn't understand
`<request-character-encoding>`. So like so many things on the web, this
is a mess.
This is a serious issue, in my opinion. The Servlet 4 specification is
out of step with everything else in the ecosystem!
> Whether Tomcat should ship with this setting present in conf/web.xml
> by default is something that should probably be discussed for Tomcat
> 10. Given the current state of the web, there is a reasonable case for
> doing so. I'll add that to the TOMCAT-NEXT discussion list.
Yes, can I just re-second (third?) that motion, and underscore the need
for this to be changed in Tomcat 10?
Thanks,
Garret
Re: distinction between resource charset and format octet decoding
Posted by Mark Thomas <ma...@apache.org>.
On 08/01/2019 21:31, Garret Wilson wrote:
<snip/>
> But as discussed above, this is completely wrong: the resource character
> encoding of a request sent in `application/x-www-form-urlencoded` should
> have absolutely no bearing on how the encoded octets within that
> resource are decoded.
That is not the correct interpretation of section 3.12 of the Servlet
4.0 specification (note the section numbers do vary between spec
versions). Tomcat implements the correct interpretation - i.e. the
charset from the request content-type defines how encoded octets are
decoded and, if none is specified, ISO-8859-1 is used as the default.
Yes, this default is now very out-dated. That is a side-effect of:
- how long the Servlet specification has been around
- the very conservative approach taken by Java EE in terms of backwards
compatibility (once set, defaults are very rarely - if ever - changed)
- arguably missed opportunities to address this issue prior to
Servlet 4.0
As of Servlet 4.0 there is a specification compliant configuration
option to change this default to any encoding of your choice. Obviously,
UTF-8 is one of the options. You can do this by adding the following to
your web.xml:
<request-character-encoding>UTF-8</request-character-encoding>
If you add it to conf/web.xml it applies to every web application
deployed to Tomcat.
Tomcat 9 uses this in the examples, manager and host-manager
applications in place of the SetCharacterEncodingFilter.
Whether Tomcat should ship with this setting present in conf/web.xml by
default is something that should probably be discussed for Tomcat 10.
Given the current state of the web, there is a reasonable case for doing
so. I'll add that to the TOMCAT-NEXT discussion list.
The Tomcat Wiki also needs to be updated to take account of this new
configuration option (and the related <response-character-encoding>).
Since it is a wiki and this is clearly an issue you care about would you
like to tackle that?
Mark
---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
For additional commands, e-mail: users-help@tomcat.apache.org