You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@tomcat.apache.org by Garret Wilson <ga...@globalmentor.com> on 2019/01/08 21:31:19 UTC

distinction between resource charset and format octet decoding

I have question (using Tomcat 9.0.12 on Windows 10), and I'd like 
someone on the Tomcat development team to clarify a distinction for me 
regarding resource charsets and octet decoding in a particular format. I 
am not a newbie, and although the answer to my question may seem 
obvious, it goes to a critical issue that I believe to be a fundamental 
bug in Tomcat encoding processing.

Let's say that as an HTTP client I retrieve a resource `readme.txt` from 
Tomcat, and Tomcat clearly indicates via the HTTP response headers that 
the `Content-Type` is `text/plain; charset=ISO-8859-1`. That file 
contains, among things, a line that says:

     See https://example.com/example.jsp?fullName=Fl%C3%A1vio+Jos%C3%A9 
for more info.

I want parse the text file and present a live link to the user (email 
clients do this all the time), but I want to make the link "pretty" by 
decoding the URL. The question is: do I decode the octets using UTF-8, 
to show `…fullName=Flávio+José`, or do I use ISO-8859-1 to decode the 
octets, so that I show `…fullName=FlÃ¡vio+JosÃ©`? (Flávio José is a 
famous Brazilian forró singer and musician, by the way.)

The content type encoding of `readme.txt` is ISO-8859-1, so I must use 
ISO-8859-1 to decode the octets in `Fl%C3%A1vio+Jos%C3%A9`, yielding 
`…fullName=FlÃ¡vio+JosÃ©`, right??!

No, of course not. The decoding of the octet sequence is independent of 
the resource encoding, and represents a separate layer of encoding _on 
top_ of the resource encoding. It wouldn't matter whether the text file 
were encoded in UTF-8, ISO-8859-1, or US-ASCII—the URL would still be 
https://example.com/example.jsp?fullName=Fl%C3%A1vio+Jos%C3%A9, and its 
octets should still be decoded using UTF-8 as per RFC 3986.

I'll get right to the point; the above was a rhetorical question used as 
an analogy.

The Tomcat FAQ at 
https://wiki.apache.org/tomcat/FAQ/CharacterEncoding#Q9 indicates that 
the default encoding for an HTTP POST is ISO-8859-1. That is true. 
However Tomcat then goes further to then assume that it should decode 
_the octets of `application/x-www-form-urlencoded`_ using ISO-8859-1 as 
well! This is simply wrong; the octets should be interpreted as a 
sequence of UTF-8 octets; see 
https://url.spec.whatwg.org/#concept-urlencoded-serializer . This means 
if my browser sends a POST with content `fullName=Fl%C3%A1vio+Jos%C3%A9` 
using `application/x-www-form-urlencoded`, Tomcat will interpret this 
request parameter as `FlÃ¡vio JosÃ©` in my servlet/JSP, when it should 
interpret it as `Flávio José`. (Tomcat correctly decodes the octet when 
used as a query parameter rather than a POST parameter.)

Now it may be that the FAQ is simply out of date; it still seems to 
think that encoded URI octets should not be interpreted as UTF-8, 
completely ignoring RFC 3986. If so, it is long out of date; RFC 3986 
came out in 2005. (And indeed, Tomcat works with UTF-8 octets in URIs.) 
But out of date or not, the FAQ at 
https://wiki.apache.org/tomcat/FAQ/CharacterEncoding#Q8 then recommends 
that to force Tomcat to interpret the 
`application/x-www-form-urlencoded` octets as UTF-8, I must set the 
`org.apache.catalina.filters.SetCharacterEncodingFilter` filter (in some 
`web.xml` file) to `UTF-8`. (I can alternatively put `<% 
request.setCharacterEncoding("UTF-8"); %>` in my JSP.) And sure enough, 
it fixes the problem.

But as discussed above, this is completely wrong: the resource character 
encoding of a request sent in `application/x-www-form-urlencoded` should 
have absolutely no bearing on how the encoded octets within that 
resource are decoded. They must be decoded as UTF-8, irrespective of 
what "character encoding" Tomcat assumes the content to be. Tomcat has 
updated the way it decodes URIs to support UTF-8; it is time Tomcat does 
the same for `application/x-www-form-urlencoded` values. The current 
approach is broken in the context of the modern web, and the workaround 
is simply wrong.

I also raised this at https://stackoverflow.com/q/54094982/421049 .

I would have filed a Tomcat Bugzilla issue, but the bug report form 
indicated I should report the problem on this list first.

Thank you in advance for your attention to this matter.

Garret Wilson
GlobalMentor, Inc.


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
For additional commands, e-mail: users-help@tomcat.apache.org

Re: distinction between resource charset and format octet decoding

Posted by Garret Wilson <ga...@globalmentor.com>.

On 2/6/2020 10:44 AM, Mark Thomas wrote:
> …
> As of Tomcat 10, conf/web.xml contains the following:
>
> <!--
>    Set the default request and response character encodings to UTF-8.
> -->
> <request-character-encoding>UTF-8</request-character-encoding>
> <response-character-encoding>UTF-8</response-character-encoding>
>
> That *should* have the effect you are looking for but I confess I
> haven't tested it in any great detail.

Yes! Oh, that is so wonderful. Thank you!

I brought this issue up on the list over a year ago, and I have since 
published my entire comprehensive software development course (still 
being expanded).

https://www.globalmentor.com/courses/softdev/

The course is centered around Tomcat as the server, and the lesson on 
HTML forms contains a section warning to use `<request-character-encoding>`.

https://www.globalmentor.com/courses/softdev/html-forms

Once Tomcat 10 is released I'll be able to update this note as well.

Thanks again!

Garret

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
For additional commands, e-mail: users-help@tomcat.apache.org

Re: [OT] distinction between resource charset and format octet decoding

Posted by Garret Wilson <ga...@globalmentor.com>.

On 2/6/2020 12:43 PM, Christopher Schultz wrote:
> …
>> * Therefore `web.xml` settings, HTTP headers, etc. are all
>> irrelevant, as this is an issue dealing with the file format
>> itself, and the latest spec for the file format says to use UTF-8,
>> so everyone should use UTF-8 already.
> Except for everyone who already uses something else and expects
> everything to be backward-compatible.

I think there comes a time where we have to more forward after some 
critical level of usage is reached. I think we've passed that point.

Modern browsers in the sense that you mention are not 
backwards-compatible for `application/x-www-form-urlencoded`. So what 
are we being compatible with by not using UTF-8 decoding? Do we have 
anything besides browsers consuming output from legacy JSP apps? As 
noted the browsers break when we try to be "backwards-compatible" in the 
sense you mention.

> The problem is that you don't get to declare what's "best" for
> everyone and then the whole world does what you want.

But here I would imagine that already agrees what's best; the debate is 
whether we should do different than what we know is best because of some 
outdated specs. (And I say that as a huge proponent of following standards.)

I'll give you an example that is directly relevant. Over 10 years ago I 
strongly advocated to the RDF group that the Internet should abandon the 
outdated practice of requiring that `text/*` media types default to 
US-ASCII; otherwise there would be no point in using `text/*` for 
anything going forward! (That's why we went through a sad phase where 
everyone was using `application/*` for text formats because they wanted 
to default to something other than US-ASCII.)

  * https://www.w3.org/2008/01/rdf-media-types
  * https://lists.w3.org/Archives/Public/www-archive/2007Dec/0059.html

Sure enough, eventually someone saw the light (I won't claim I had 
anything to do with it, but it is exactly what I was arguing for) and 
created https://tools.ietf.org/html/rfc6657, which says that individual 
`text/*` types can choose a default other than ASCII. Finally we're not 
stuck in the past anymore!

I would say that someone needs to create an updated 
`application/x-www-form-urlencoded` specification prescribing UTF-8 
decoding of encoded octets, except that the WhatWG has already done 
that! So I'm not declaring that everyone should do it "my" way. I'm 
saying everyone should follow the latest spec which already exists.

Anyway, thanks for listening. I think it's a fun discussion, and I 
wasn't being combative---I just wanted to tell a bit of the story. I 
need to get back to work now. :)

Thanks again for the change in Tomcat 10!

Garret

Re: [OT] distinction between resource charset and format octet decoding

Posted by Christopher Schultz <ch...@christopherschultz.net>.

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256

Garret,

On 2/6/20 10:25 AM, Garret Wilson wrote:
> On 2/6/2020 11:46 AM, André Warnier (tomcat/perl) wrote:
>> …
>>> As of Tomcat 10, conf/web.xml contains the following:
>>> 
>>> <!-- Set the default request and response character encodings
>>> to UTF-8. --> 
>>> <request-character-encoding>UTF-8</request-character-encoding> 
>>> <response-character-encoding>UTF-8</response-character-encoding>
>>>
>>>
>>> 
That *should* have the effect you are looking for but I confess I
>>> haven't tested it in any great detail.
>>> 
>> 
>> As I am sure many people (Christopher included) would agree, the
>> real solution would be for browsers and other HTTP clients to
>> indicate clearly in the request, the charset/encoding of each
>> text parameter that they are sending. There are even HTTP headers
>> already defined for that.
> 
> 
> Which HTTP headers are you referring to? `Content-Type`? It is my 
> opinion that this is irrelevant and not applicable.
> 
> As I explained (extensively) in my original post for this thread
> back on 2019-01-08, the issue is not the charset of 
> `application/x-www-form-urlencoded`. That media type is made up of
> ASCII characters. It doesn't matter whether you say it's ASCII,
> ISO-8859-1, UTF-8, or whatever, the actual characters stay 100% the
> same.

Hmm. Not always. While it may be true that:

1. ASCII, ISO-8859-1, and UTF-8 are very common
2. ASCII, ISO-8859-1, and UTF-8 share the first 127 code points

It is not true that:

3. All character encodings share the first 127 code points.

UTF-16 doesn't follow that pattern.

> At issue is when certain octets are encoded (as specified by the 
> `application/x-www-form-urlencoded` media type itself), what
> charset to use when decoding them. This is independent of the
> encoding of the media type itself; rather this is defined by the
> specification for the format.
Correct. And there is lack of agreement for URLs, so browsers decided
to make it up. It's not possible to guess what the browser has chosen
because it does not advertise it in any way (absent a standard). The
only 100% reliable way to do it would be to add a parameter to every
request which has a known-correct value that can be unambiguously
decoded. You just keep re-decoding the whole URL until that parameter
value matches the known-correct value. Sounds like a lot of fun to
implement across a whole application, right?

> Unfortunately https://tools.ietf.org/html/rfc1866 actually says we 
> should use ASCII when decoding the octets, but this is severely 
> antiquated and doesn't fit with modern practice. The WhatWG
> essentially redefines the format to say that the octets must be
> interpreted as UTF-8:
> 
> https://url.spec.whatwg.org/#application/x-www-form-urlencoded
> 
> So to summarize my view:
> 
> * The decoding of the `application/x-www-form-urlencoded` media
> type encoded octets is completely independent of the charset
> indicated in the `Content-Type` header, and rather goes to the
> specification of the format itself.

It's strange, because Content-Type can contain a charset parameter,
but MIME specifically says that "charset" parameters are only
appropriate for "text/*" MIME types. So for
application/x-www-form-urlencoded, you "shouldn't" add that parameter.
But there's no particular reason NOT to include it (it doesn't
actually violate any spec) and adding it COMPLETELY AND UNAMBIGUOUSLY
indicates what the browser chose as the encoding.

> * RFC 1866 is severely out of date and out of step, and the
> WhatWG's specification of the `application/x-www-form-urlencoded`
> media type should be used instead. (Modern browser practice would
> seem to agree with me.)

RFC 1886 has been very much superseded. Also, HTML specs shouldn't be
defining HTTP semantics. So ignore whatever is in RFC 1866 on multiple
grounds.

> * Therefore `web.xml` settings, HTTP headers, etc. are all
> irrelevant, as this is an issue dealing with the file format
> itself, and the latest spec for the file format says to use UTF-8,
> so everyone should use UTF-8 already.

Except for everyone who already uses something else and expects
everything to be backward-compatible.

The problem is that you don't get to declare what's "best" for
everyone and then the whole world does what you want. I happen to
agree with you (Everyone should move to UTF-8 for everything.
Everywhere. Forever.), but you have to recognize that there is history
and entrenched systems, environments, and mindsets.

> The new default `web.xml` in Tomcat 10 is a wonderful step in the
> right direction.

+1

- -chris
-----BEGIN PGP SIGNATURE-----
Comment: Using GnuPG with Thunderbird - https://www.enigmail.net/

iQIzBAEBCAAdFiEEMmKgYcQvxMe7tcJcHPApP6U8pFgFAl48NAoACgkQHPApP6U8
pFgJ6A/+JSArcUkqm3P6n0awICXTuqIx0TU1oIf9bzivpAI/Na9fr//ebnwzmvoy
EXpbnn97B7Sy8uZ1wvT0+PQLbmwVmM/f7zBk4q+7Ba/ogkmrSHeLlsCIbLAlXOLD
kr/xDE4ftxrwR2+ZwuQwxH0muFH+4rq2SBFWTQnGORCQDqRRK7eQoQYHWE0HIAxj
cAJmwkQEQyi+YHdgaUo0L4BU7lvgPGk7JyjbzWBiigFYy/1Du1caE7PzYLa5G3wZ
BrYDA6QoQA+nUmXHn/ayUVXvsZc2l/nU/uM5m68Tp1iEVxdgp4u8XtHuqgv0Nzda
IeQq9HOP8wd7l27/dk2DvlZBmSWt2XDOI5ig+NoLPT1ixyQIqVJ2K8SyayGdUHW9
XJi/mqVqHF1h1okTgystt4mNTTBYFqFfwfBUWFK1T+9sUot8aJ2y6P20058mv5ds
iQbEP0K0VJsUGSD+JJd+lvm6gI+54jNhnNgS1bFndbC5p4afNToCCKl8EBBENtbK
64xiolpux4VLFrgmzyG6gfbiSurJz+s3hgH29JJGfml/zdNS5QMI+fhsgOFThDrr
38Ul/QA4fRJehINAqqnsBFhJlymgvO/3PMGCDYCvWfq0cyBDOoKzWH2lscq5cXnz
AMNiKU9roV1YdvUQPscSY7iPyDNq4JFDUdHa4pi7gp9JfXMlL7s=
=Igs0
-----END PGP SIGNATURE-----

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
For additional commands, e-mail: users-help@tomcat.apache.org

Re: [OT] distinction between resource charset and format octet decoding

Posted by Garret Wilson <ga...@globalmentor.com>.

On 2/6/2020 11:46 AM, André Warnier (tomcat/perl) wrote:
> …
>> As of Tomcat 10, conf/web.xml contains the following:
>>
>> <!--
>>    Set the default request and response character encodings to UTF-8.
>> -->
>> <request-character-encoding>UTF-8</request-character-encoding>
>> <response-character-encoding>UTF-8</response-character-encoding>
>>
>> That *should* have the effect you are looking for but I confess I
>> haven't tested it in any great detail.
>>
>
> As I am sure many people (Christopher included) would agree, the real 
> solution would be for browsers and other HTTP clients to indicate 
> clearly in the request, the charset/encoding of each text parameter 
> that they are sending.
> There are even HTTP headers already defined for that.

Which HTTP headers are you referring to? `Content-Type`? It is my 
opinion that this is irrelevant and not applicable.

As I explained (extensively) in my original post for this thread back on 
2019-01-08, the issue is not the charset of 
`application/x-www-form-urlencoded`. That media type is made up of ASCII 
characters. It doesn't matter whether you say it's ASCII, ISO-8859-1, 
UTF-8, or whatever, the actual characters stay 100% the same. At issue 
is when certain octets are encoded (as specified by the 
`application/x-www-form-urlencoded` media type itself), what charset to 
use when decoding them. This is independent of the encoding of the media 
type itself; rather this is defined by the specification for the format.

Unfortunately https://tools.ietf.org/html/rfc1866 actually says we 
should use ASCII when decoding the octets, but this is severely 
antiquated and doesn't fit with modern practice. The WhatWG essentially 
redefines the format to say that the octets must be interpreted as UTF-8:

https://url.spec.whatwg.org/#application/x-www-form-urlencoded

So to summarize my view:

  * The decoding of the `application/x-www-form-urlencoded` media type
    encoded octets is completely independent of the charset indicated in
    the `Content-Type` header, and rather goes to the specification of
    the format itself.
  * RFC 1866 is severely out of date and out of step, and the WhatWG's
    specification of the `application/x-www-form-urlencoded` media type
    should be used instead. (Modern browser practice would seem to agree
    with me.)
  * Therefore `web.xml` settings, HTTP headers, etc. are all irrelevant,
    as this is an issue dealing with the file format itself, and the
    latest spec for the file format says to use UTF-8, so everyone
    should use UTF-8 already.

The new default `web.xml` in Tomcat 10 is a wonderful step in the right 
direction.

See my original post for more in-depth explanation.

Garret

Re: [OT] distinction between resource charset and format octet decoding

Posted by "André Warnier (tomcat/perl)" <aw...@ice-sa.com>.

On 06.02.2020 14:44, Mark Thomas wrote:
> On 06/02/2020 13:39, Garret Wilson wrote:
>> On 2/6/2020 10:36 AM, Mark Thomas wrote:
>>> …
>>>>> Whether Tomcat should ship with this setting present in conf/web.xml
>>>>> by default is something that should probably be discussed for Tomcat
>>>>> 10. Given the current state of the web, there is a reasonable case for
>>>>> doing so. I'll add that to the TOMCAT-NEXT discussion list.
>>>> Is this still on the list for discussion for Tomcat 10?
>>> No, because it has already been implemented for Tomcat 10 and is in the
>>> milestone release currently being voted on.
>>
>> Waitasec. I'm not used to good news, so I want to make sure I understand
>> what you're saying. Are you saying that the proposed Tomcat 10
>> implementation already interprets encoded octets in web form submissions
>> using UTF-8 by default?!! :O
> 
> As of Tomcat 10, conf/web.xml contains the following:
> 
> <!--
>    Set the default request and response character encodings to UTF-8.
> -->
> <request-character-encoding>UTF-8</request-character-encoding>
> <response-character-encoding>UTF-8</response-character-encoding>
> 
> That *should* have the effect you are looking for but I confess I
> haven't tested it in any great detail.
> 

As I am sure many people (Christopher included) would agree, the real solution would be 
for browsers and other HTTP clients to indicate clearly in the request, the 
charset/encoding of each text parameter that they are sending.
There are even HTTP headers already defined for that.
(Nowadays the default could be Unicode/UTF-8).

The problem is that browsers and other agents don't do that, although they undoubtedly 
always know themselves, and although it would solve a series of issues that have literally 
been there forever at the server and application level (*).

I have often wondered if/why the Apache Foundation does not pack enough influence over the 
HTTP/HTML specifications process and over browser producers, to achieve that.
(And if not the Apache Foundation, then who ?)

(*) My own guess is that this basic thing (or lack of it) has cost over the years many 
thousands of lines of unnecessary code and many thousands of unproductive developer hours. 
As a tiny example, just consider the above web.xml parameters, and how much time in total 
was dedicated to their definition and implementation.. Never mind all the previous related 
filters and valves and their discussions on this list. And that's only for Tomcat.

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
For additional commands, e-mail: users-help@tomcat.apache.org

Re: distinction between resource charset and format octet decoding

Posted by Mark Thomas <ma...@apache.org>.

On 06/02/2020 13:39, Garret Wilson wrote:
> On 2/6/2020 10:36 AM, Mark Thomas wrote:
>> …
>>>> Whether Tomcat should ship with this setting present in conf/web.xml
>>>> by default is something that should probably be discussed for Tomcat
>>>> 10. Given the current state of the web, there is a reasonable case for
>>>> doing so. I'll add that to the TOMCAT-NEXT discussion list.
>>> Is this still on the list for discussion for Tomcat 10?
>> No, because it has already been implemented for Tomcat 10 and is in the
>> milestone release currently being voted on.
> 
> Waitasec. I'm not used to good news, so I want to make sure I understand
> what you're saying. Are you saying that the proposed Tomcat 10
> implementation already interprets encoded octets in web form submissions
> using UTF-8 by default?!! :O

As of Tomcat 10, conf/web.xml contains the following:

<!--
  Set the default request and response character encodings to UTF-8.
-->
<request-character-encoding>UTF-8</request-character-encoding>
<response-character-encoding>UTF-8</response-character-encoding>

That *should* have the effect you are looking for but I confess I
haven't tested it in any great detail.

Mark


> 
> It will be a joy to update the FAQ when this is released.
> 
> Garret
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
> For additional commands, e-mail: users-help@tomcat.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
For additional commands, e-mail: users-help@tomcat.apache.org

Re: distinction between resource charset and format octet decoding

Posted by Garret Wilson <ga...@globalmentor.com>.

On 2/6/2020 10:36 AM, Mark Thomas wrote:
> …
>>> Whether Tomcat should ship with this setting present in conf/web.xml
>>> by default is something that should probably be discussed for Tomcat
>>> 10. Given the current state of the web, there is a reasonable case for
>>> doing so. I'll add that to the TOMCAT-NEXT discussion list.
>> Is this still on the list for discussion for Tomcat 10?
> No, because it has already been implemented for Tomcat 10 and is in the
> milestone release currently being voted on.

Waitasec. I'm not used to good news, so I want to make sure I understand 
what you're saying. Are you saying that the proposed Tomcat 10 
implementation already interprets encoded octets in web form submissions 
using UTF-8 by default?!! :O

It will be a joy to update the FAQ when this is released.

Garret


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
For additional commands, e-mail: users-help@tomcat.apache.org

Re: distinction between resource charset and format octet decoding

Posted by Mark Thomas <ma...@apache.org>.

On 06/02/2020 13:30, Garret Wilson wrote:
> On 1/8/2019 9:57 PM, Mark Thomas wrote:
>> …
>>
>> Yes, this default is now very out-dated. That is a side-effect of:
>> …
>> As of Servlet 4.0 there is a specification compliant configuration
>> option to change this default to any encoding of your choice.
>> Obviously, UTF-8 is one of the options. You can do this by adding the
>> following to your web.xml:
>> …
>>
>> Whether Tomcat should ship with this setting present in conf/web.xml
>> by default is something that should probably be discussed for Tomcat
>> 10. Given the current state of the web, there is a reasonable case for
>> doing so. I'll add that to the TOMCAT-NEXT discussion list.
> 
> Is this still on the list for discussion for Tomcat 10?

No, because it has already been implemented for Tomcat 10 and is in the
milestone release currently being voted on.

Mark


> 
> In my opinion it would be a real shame if Tomcat 10 ships with a web
> form encoding default that goes against the WhatWG specifications and
> corrupts non ISO-8859-1 content under modern browsers.
> 
> Garret
> 
> P.S. Mark, please ignore the other email from my personal email address.
> Because the Tomcat users list doesn't include my name in the "To:"
> header, my email client didn't know to use the correct reply address.
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
> For additional commands, e-mail: users-help@tomcat.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
For additional commands, e-mail: users-help@tomcat.apache.org

Re: distinction between resource charset and format octet decoding

Posted by Garret Wilson <ga...@globalmentor.com>.

On 1/8/2019 9:57 PM, Mark Thomas wrote:
> …
>
> Yes, this default is now very out-dated. That is a side-effect of:
> …
> As of Servlet 4.0 there is a specification compliant configuration 
> option to change this default to any encoding of your choice. 
> Obviously, UTF-8 is one of the options. You can do this by adding the 
> following to your web.xml:
> …
>
> Whether Tomcat should ship with this setting present in conf/web.xml 
> by default is something that should probably be discussed for Tomcat 
> 10. Given the current state of the web, there is a reasonable case for 
> doing so. I'll add that to the TOMCAT-NEXT discussion list.

Is this still on the list for discussion for Tomcat 10?

In my opinion it would be a real shame if Tomcat 10 ships with a web 
form encoding default that goes against the WhatWG specifications and 
corrupts non ISO-8859-1 content under modern browsers.

Garret

P.S. Mark, please ignore the other email from my personal email address. 
Because the Tomcat users list doesn't include my name in the "To:" 
header, my email client didn't know to use the correct reply address.

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
For additional commands, e-mail: users-help@tomcat.apache.org

Re: distinction between resource charset and format octet decoding

Posted by Garret Wilson <ga...@globalmentor.com>.

On 2/1/2019 9:38 AM, Christopher Schultz wrote:
>> Amazing. A close reading of RFC 3986 reveals that there is no
>> clear mandate for UTF-8 in existing URI schemes, even though
>> recommended for new schemes. Anyway, everyone seems to have settled
>> on UTF-8 (Tomcat included), so I'll try to indicate that.
> Wait... are you saying that _it's the Wild West out there?_ ;)
>
> Yes. The web is indeed held together with duct-tape and bailing wire.
> It's amazing that it works as well as it does.

Hahaha. I'm /so/ happy someone agrees with me! Here's to improving 
things with a little JB Weld once in a while. (That's what my 
grandparents used on the farm when the bailing wire and duct tape 
couldn't handle it.)

Garret

Re: distinction between resource charset and format octet decoding

Posted by Christopher Schultz <ch...@christopherschultz.net>.

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256

Garret,

On 2/1/19 11:08, Garret Wilson wrote:
> On 2/1/2019 7:23 AM, Garret Wilson wrote:
>> … * "There /is no default encoding for URIs/ specified anywhere,
>> which is why there is a lot of confusion when it comes to
>> decoding these values." Sheesh, this is is ancient. I'll correct
>> it as per https://tools.ietf.org/html/rfc3986#section-2.5 .
> 
> 
> Amazing. A close reading of RFC 3986 reveals that there is no
> clear mandate for UTF-8 in existing URI schemes, even though
> recommended for new schemes. Anyway, everyone seems to have settled
> on UTF-8 (Tomcat included), so I'll try to indicate that.

Wait... are you saying that _it's the Wild West out there?_ ;)

Yes. The web is indeed held together with duct-tape and bailing wire.
It's amazing that it works as well as it does.

- -chris
-----BEGIN PGP SIGNATURE-----
Comment: Using GnuPG with Thunderbird - https://www.enigmail.net/

iQIzBAEBCAAdFiEEMmKgYcQvxMe7tcJcHPApP6U8pFgFAlxUhDEACgkQHPApP6U8
pFhWlA/8Cxr6xzT8+cw5Mu/a8cH788p+ucK4QtO9Qlm6EBhhX2sW9BelWpk2ftOX
xypZkwW155D2hlz58eUTGSoFl92rgFZNXmXBoIXd+MDgNS/b0zgabb7N7wlHswzj
LJArA9GtXNjRy5vJc4Bpe37ZpiqcV9f/sbQhSO31ZrJYvnVuOOYszzfp2g6UWlg5
+OAgfi2L99uMxJdqc81eIVsL6mmmhlkJYe6ejAZjb/EQ2Lk74MKlgCUfaoasCdYd
hqdQJIBpRGvUnx6UEoq+sdEilBAXTJocGv8cyOFQY5rHcaTy7WIQ9mIWilTjBb6O
gxWJbgRfX+uOVhTT5mo7LoE+YVLQZ3QPAM21SEXtX3PR5Vuk4hB8SYj3/er7S7v2
/kPL0d5K2DsO8034PoZQBturIV8pkiF5jqr2nSTND/B0nFK9hcZu27qY9RigHF95
8owMY7/hdMsK2PlYOwyj6dZSMx94Iy5mWDCrF3GUFCbEN9u3/6HoRYuJZOpCv8h1
aZHZmiYDEtxzxL8OkXNqyuBu4k+HJ58/ABMelpXOjxMVHuFXkqny6XiqrzyWac+z
yW1otX/uLKgqKI9PL3O8MfzVS5LZ6XVtprkZUDhCBvsA8vQTZYBRVQu3DiGMPojj
U4STB1VBJSV4I67bBhkQaAZnsqIgeNi/qzHC+5h6hbHl+Me1lRg=
=Z4XG
-----END PGP SIGNATURE-----

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
For additional commands, e-mail: users-help@tomcat.apache.org

Re: distinction between resource charset and format octet decoding

Posted by Garret Wilson <ga...@globalmentor.com>.

On 2/1/2019 7:23 AM, Garret Wilson wrote:
> …
>  * "There /is no default encoding for URIs/ specified anywhere, which
>    is why there is a lot of confusion when it comes to decoding these
>    values." Sheesh, this is is ancient. I'll correct it as per
>    https://tools.ietf.org/html/rfc3986#section-2.5 .

Amazing. A close reading of RFC 3986 reveals that there is no clear 
mandate for UTF-8 in existing URI schemes, even though recommended for 
new schemes. Anyway, everyone seems to have settled on UTF-8 (Tomcat 
included), so I'll try to indicate that.

Garret

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
For additional commands, e-mail: users-help@tomcat.apache.org

Re: distinction between resource charset and format octet decoding

Posted by Garret Wilson <ga...@globalmentor.com>.

Good morning, I'm just getting to the editing. I'm going to list some 
thoughts I have as I go through this, so you can verify things:

  * The servlet spec links are way out of date. I'll update them.
  * "There /is no default encoding for URIs/ specified anywhere, which
    is why there is a lot of confusion when it comes to decoding these
    values." Sheesh, this is is ancient. I'll correct it as per
    https://tools.ietf.org/html/rfc3986#section-2.5 .
  * "Most of the web uses ISO-8859-1 as the default for query strings."
    Is this still true?! In light of the above, I would think it is not
    true, but I wanted to ask, as you know better about what you've seen
    "in the wild".

Garret

Re: distinction between resource charset and format octet decoding

Posted by Mark Thomas <ma...@apache.org>.

On 01/02/2019 17:58, Garret Wilson wrote:
> OK, Mark, I've made my initial edits to the
> https://wiki.apache.org/tomcat/FAQ/CharacterEncoding page. _Please check
> them over!_ This is my first edit to the wiki.
> 
> That page has a lot of legacy information, some of which had to do with
> internal Tomcat stuff, and some of which had to do with minute details
> of obsolete RFCs and evolution of browser behavior. I didn't want to
> spend the entire day (week?) on this, so I tried to surgically to only
> address the sections relating to POST of
> application/x-www-form-urlencoded and how percent-encoded octets are
> interpreted. I couldn't resist updating the specification links and
> changing just a little prose about URL percent encoding.
> 
> There is the risk now that other sections of the page are still outdated
> and conflict with my changes, but most importantly the FAQ should
> provide more complete information on how Tomcat web applications can be
> made to work with modern browsers.
> 
> Please let me know if I bungled anything or if I need to clarify something.

LGTM.

> Thanks for letting me participate.

No need to thank us. We should be thanking you. Thank you.

So, what do you want to work on next? ;)

Cheers,

Mark


> 
> Garret
> 
> On 1/23/2019 12:26 AM, Mark Thomas wrote:
>> On 23/01/2019 05:07, Garret Wilson wrote:
>>> On 1/15/2019 3:20 AM, Mark Thomas wrote:
>>>> …
>>>> Anything in PascalCase becomes a link to a wiki page of that name.
>>>> Usernames are created in this form so references to the user
>>>> automatically become links to that user's page in the wiki.
>>>
>>> Ah, OK, that explains it. Very good to know. Maybe a little semantic
>>> overloading, but as this is my first wiki account anywhere, I'm guessing
>>> it's typical with whatever software you're using.
>>>
>>> Anyway my account is created, with username `GarretWilson`. After I get
>>> permissions I'll update the info on octet encoding for
>>> application/x-www-form-urlencoded in relation to the servlet spec. It
>>> may not be immediately, but I'll slowly but surely get to it.
>> Karma granted. Happy editing.
>>
>> Cheers,
>>
>> Mark
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
>> For additional commands, e-mail: users-help@tomcat.apache.org
>>
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
> For additional commands, e-mail: users-help@tomcat.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
For additional commands, e-mail: users-help@tomcat.apache.org

Re: distinction between resource charset and format octet decoding

Posted by Garret Wilson <ga...@globalmentor.com>.

OK, Mark, I've made my initial edits to the 
https://wiki.apache.org/tomcat/FAQ/CharacterEncoding page. _Please check 
them over!_ This is my first edit to the wiki.

That page has a lot of legacy information, some of which had to do with 
internal Tomcat stuff, and some of which had to do with minute details 
of obsolete RFCs and evolution of browser behavior. I didn't want to 
spend the entire day (week?) on this, so I tried to surgically to only 
address the sections relating to POST of 
application/x-www-form-urlencoded and how percent-encoded octets are 
interpreted. I couldn't resist updating the specification links and 
changing just a little prose about URL percent encoding.

There is the risk now that other sections of the page are still outdated 
and conflict with my changes, but most importantly the FAQ should 
provide more complete information on how Tomcat web applications can be 
made to work with modern browsers.

Please let me know if I bungled anything or if I need to clarify something.

Thanks for letting me participate.

Garret

On 1/23/2019 12:26 AM, Mark Thomas wrote:
> On 23/01/2019 05:07, Garret Wilson wrote:
>> On 1/15/2019 3:20 AM, Mark Thomas wrote:
>>> …
>>> Anything in PascalCase becomes a link to a wiki page of that name.
>>> Usernames are created in this form so references to the user
>>> automatically become links to that user's page in the wiki.
>>
>> Ah, OK, that explains it. Very good to know. Maybe a little semantic
>> overloading, but as this is my first wiki account anywhere, I'm guessing
>> it's typical with whatever software you're using.
>>
>> Anyway my account is created, with username `GarretWilson`. After I get
>> permissions I'll update the info on octet encoding for
>> application/x-www-form-urlencoded in relation to the servlet spec. It
>> may not be immediately, but I'll slowly but surely get to it.
> Karma granted. Happy editing.
>
> Cheers,
>
> Mark
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
> For additional commands, e-mail: users-help@tomcat.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
For additional commands, e-mail: users-help@tomcat.apache.org

Re: distinction between resource charset and format octet decoding

Posted by Mark Thomas <ma...@apache.org>.

On 23/01/2019 05:07, Garret Wilson wrote:
> On 1/15/2019 3:20 AM, Mark Thomas wrote:
>> …
>> Anything in PascalCase becomes a link to a wiki page of that name.
>> Usernames are created in this form so references to the user
>> automatically become links to that user's page in the wiki.
> 
> 
> Ah, OK, that explains it. Very good to know. Maybe a little semantic
> overloading, but as this is my first wiki account anywhere, I'm guessing
> it's typical with whatever software you're using.
> 
> Anyway my account is created, with username `GarretWilson`. After I get
> permissions I'll update the info on octet encoding for
> application/x-www-form-urlencoded in relation to the servlet spec. It
> may not be immediately, but I'll slowly but surely get to it.

Karma granted. Happy editing.

Cheers,

Mark

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
For additional commands, e-mail: users-help@tomcat.apache.org

Re: distinction between resource charset and format octet decoding

Posted by Garret Wilson <ga...@globalmentor.com>.

On 1/15/2019 3:20 AM, Mark Thomas wrote:
> …
> Anything in PascalCase becomes a link to a wiki page of that name.
> Usernames are created in this form so references to the user
> automatically become links to that user's page in the wiki.

Ah, OK, that explains it. Very good to know. Maybe a little semantic 
overloading, but as this is my first wiki account anywhere, I'm guessing 
it's typical with whatever software you're using.

Anyway my account is created, with username `GarretWilson`. After I get 
permissions I'll update the info on octet encoding for 
application/x-www-form-urlencoded in relation to the servlet spec. It 
may not be immediately, but I'll slowly but surely get to it.

Cheers,

Garret

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
For additional commands, e-mail: users-help@tomcat.apache.org

Re: distinction between resource charset and format octet decoding

Posted by Mark Thomas <ma...@apache.org>.

On 15/01/2019 03:39, Garret Wilson wrote:
> On 1/9/2019 2:30 AM, Mark Thomas wrote:
>> …
>> Create yourself an account at https://wiki.apache.org/tomcat (click
>> login then create an account) and let the list know your ID. Then one of
>> the admins can add you to the allowed editors.
> 
> 
> I was just ready to create an account, but I want to verify the details
> so I don't screw things up.
> 
>  * It asks for a "Name". Is this a username, I suppose? So we don't
>    maintain our "name" separate from our "login username"?

Yes, it is your username. Any linkage from that to your "public name"
would be maintained on your user page - if you wish.

>  * It says to use "FirstnameLastName". Are you literally wanting us to
>    use "JohnDoe", or can we use "johndoe"? Sorry for the questions; as
>    one who works with protocols all the time, I automatically assume
>    this stuff is important. But I prefer to use lowercase on my
>    usernames; I'm a little confused about why this would want
>    PascalCase for a login username. (I can't think of another system
>    that I use that requires PascalCase usernames.)

Think of it as a SHOULD rather than a MUST.

> My guess is that it's trying to maintain a "human name" and a "username"
> but combine them both into one field or something. I can't say this
> approach is typical…

Anything in PascalCase becomes a link to a wiki page of that name.
Usernames are created in this form so references to the user
automatically become links to that user's page in the wiki.

It isn't a feature we use much at the moment. A quick check shows that
most, but not all, contributors have created their user name in PascalCase.

For example, take a look at https://wiki.apache.org/tomcat/AndrewCarr

Mark

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
For additional commands, e-mail: users-help@tomcat.apache.org

Re: distinction between resource charset and format octet decoding

Posted by Garret Wilson <ga...@globalmentor.com>.

On 1/9/2019 2:30 AM, Mark Thomas wrote:
> …
> Create yourself an account at https://wiki.apache.org/tomcat (click
> login then create an account) and let the list know your ID. Then one of
> the admins can add you to the allowed editors.


I was just ready to create an account, but I want to verify the details 
so I don't screw things up.

  * It asks for a "Name". Is this a username, I suppose? So we don't
    maintain our "name" separate from our "login username"?
  * It says to use "FirstnameLastName". Are you literally wanting us to
    use "JohnDoe", or can we use "johndoe"? Sorry for the questions; as
    one who works with protocols all the time, I automatically assume
    this stuff is important. But I prefer to use lowercase on my
    usernames; I'm a little confused about why this would want
    PascalCase for a login username. (I can't think of another system
    that I use that requires PascalCase usernames.)

My guess is that it's trying to maintain a "human name" and a "username" 
but combine them both into one field or something. I can't say this 
approach is typical…

Garret

Re: distinction between resource charset and format octet decoding

Posted by Mark Thomas <ma...@apache.org>.

On 09/01/2019 00:50, Garret Wilson wrote:
> Hi, Mark, and thanks for some quick response. You provided some info I
> wasn't aware of. Some responses below:
> 
> On 1/8/2019 9:57 PM, Mark Thomas wrote:
>> On 08/01/2019 21:31, Garret Wilson wrote:
>>
>> <snip/>
>>
>>> But as discussed above, this is completely wrong: the resource
>>> character encoding of a request sent in
>>> `application/x-www-form-urlencoded` should have absolutely no bearing
>>> on how the encoded octets within that resource are decoded.
>>
>> That is not the correct interpretation of section 3.12 of the Servlet
>> 4.0 specification (note the section numbers do vary between spec
>> versions). Tomcat implements the correct interpretation - i.e. the
>> charset from the request content-type defines how encoded octets are
>> decoded and, if none is specified, ISO-8859-1 is used as the default.
> 
> 
> Ah, I hadn't seen that in the servlet spec. Yes, it seems as if Tomcat
> is correctly following the spec, but I would still say the servlet spec
> is wrong to make any linkage at all between resource encoding and %nn
> interpretation. In fact reading the prose it's not clear to me that the
> servlet spec is even strongly tying the %nn interpretation to the
> encoding. It just sees to say that, unless otherwise specified, the %nn
> interpretation should be ISO-8859-1. And actually that's a step up from
> the HTML 4.0.1 spec, which in
> https://www.w3.org/TR/html401/interact/forms.html#h-17.13.4.1 indicates
> that they should be interpreted as US-ASCII codes. :(
> 
> You indicate that this is all out of date, and I think we're in
> agreement there. We really, really need to get the next servlet
> specification to remove this part. In fact the servlet specification
> should defer to the official `application/x-www-form-urlencoded`
> specification, which at this point I think is the W3C HTML5 spec, which
> in turn defers to the WHATWG spec (which clearly says that UTF-8) should
> be used. What makes all of this more of a mess is that there seems to be
> no way to work around this from the client side, e.g. by putting
> something in the HTML to indicate UTF-8, as
> `application/x-www-form-urlencoded` doesn't support a `charset` parameter.
> 
> Anyway if there are any openings on the committee to update the servlet
> spec, let me know.

That has moved to Eclipse. The process to update the spec is still being
defined. The Jakarta EE Servlet API project is the project to get
involved in.


>> ...
>> As of Servlet 4.0 there is a specification compliant configuration
>> option to change this default to any encoding of your choice.
>> Obviously, UTF-8 is one of the options. You can do this by adding the
>> following to your web.xml:
>>
>> <request-character-encoding>UTF-8</request-character-encoding>
> 
> Oh, that is really good to know, thanks!! Still I say that the request
> character encoding is orthogonal to the %nn encoding, but, still, it's
> good to have an implementation-agnostic way to do it.
> 
>>
>>
>> Whether Tomcat should ship with this setting present in conf/web.xml
>> by default is something that should probably be discussed for Tomcat
>> 10. Given the current state of the web, there is a reasonable case for
>> doing so. I'll add that to the TOMCAT-NEXT discussion list.
> 
> 
> Yes please! If I can help in any way, let me know.
> 
> 
>>
>> The Tomcat Wiki also needs to be updated to take account of this new
>> configuration option (and the related <response-character-encoding>).
>> Since it is a wiki and this is clearly an issue you care about would
>> you like to tackle that?
> 
> 
> Yes, I'd love to. Let me know what permissions I need, etc.

Create yourself an account at https://wiki.apache.org/tomcat (click
login then create an account) and let the list know your ID. Then one of
the admins can add you to the allowed editors.

Apologies for the hoop jumping required but without the manual approval
step for new accounts, the ASF project wiki's were being deluged in spam.

Mark

> 
> I have an international flight boarding right now so I have to go, and I
> may not reply for the next few hours, but definitely sign me up.
> 
> Thanks,
> 
> Garret
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
> For additional commands, e-mail: users-help@tomcat.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
For additional commands, e-mail: users-help@tomcat.apache.org

Re: distinction between resource charset and format octet decoding

Posted by Garret Wilson <ga...@globalmentor.com>.

Hi, Mark, and thanks for some quick response. You provided some info I 
wasn't aware of. Some responses below:

On 1/8/2019 9:57 PM, Mark Thomas wrote:
> On 08/01/2019 21:31, Garret Wilson wrote:
>
> <snip/>
>
>> But as discussed above, this is completely wrong: the resource 
>> character encoding of a request sent in 
>> `application/x-www-form-urlencoded` should have absolutely no bearing 
>> on how the encoded octets within that resource are decoded.
>
> That is not the correct interpretation of section 3.12 of the Servlet 
> 4.0 specification (note the section numbers do vary between spec 
> versions). Tomcat implements the correct interpretation - i.e. the 
> charset from the request content-type defines how encoded octets are 
> decoded and, if none is specified, ISO-8859-1 is used as the default.

Ah, I hadn't seen that in the servlet spec. Yes, it seems as if Tomcat 
is correctly following the spec, but I would still say the servlet spec 
is wrong to make any linkage at all between resource encoding and %nn 
interpretation. In fact reading the prose it's not clear to me that the 
servlet spec is even strongly tying the %nn interpretation to the 
encoding. It just sees to say that, unless otherwise specified, the %nn 
interpretation should be ISO-8859-1. And actually that's a step up from 
the HTML 4.0.1 spec, which in 
https://www.w3.org/TR/html401/interact/forms.html#h-17.13.4.1 indicates 
that they should be interpreted as US-ASCII codes. :(

You indicate that this is all out of date, and I think we're in 
agreement there. We really, really need to get the next servlet 
specification to remove this part. In fact the servlet specification 
should defer to the official `application/x-www-form-urlencoded` 
specification, which at this point I think is the W3C HTML5 spec, which 
in turn defers to the WHATWG spec (which clearly says that UTF-8) should 
be used. What makes all of this more of a mess is that there seems to be 
no way to work around this from the client side, e.g. by putting 
something in the HTML to indicate UTF-8, as 
`application/x-www-form-urlencoded` doesn't support a `charset` parameter.

Anyway if there are any openings on the committee to update the servlet 
spec, let me know.

> ...
> As of Servlet 4.0 there is a specification compliant configuration 
> option to change this default to any encoding of your choice. 
> Obviously, UTF-8 is one of the options. You can do this by adding the 
> following to your web.xml:
>
> <request-character-encoding>UTF-8</request-character-encoding>

Oh, that is really good to know, thanks!! Still I say that the request 
character encoding is orthogonal to the %nn encoding, but, still, it's 
good to have an implementation-agnostic way to do it.

>
>
> Whether Tomcat should ship with this setting present in conf/web.xml 
> by default is something that should probably be discussed for Tomcat 
> 10. Given the current state of the web, there is a reasonable case for 
> doing so. I'll add that to the TOMCAT-NEXT discussion list.

Yes please! If I can help in any way, let me know.

>
> The Tomcat Wiki also needs to be updated to take account of this new 
> configuration option (and the related <response-character-encoding>). 
> Since it is a wiki and this is clearly an issue you care about would 
> you like to tackle that?

Yes, I'd love to. Let me know what permissions I need, etc.

I have an international flight boarding right now so I have to go, and I 
may not reply for the next few hours, but definitely sign me up.

Thanks,

Garret

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
For additional commands, e-mail: users-help@tomcat.apache.org

Re: distinction between resource charset and format octet decoding

Posted by Garret Wilson <ga...@globalmentor.com>.

Sorry to bring up the non-UTF-8 escaped octets form POST problem again, 
but …

On 1/8/2019 3:57 PM, Mark Thomas wrote:
> …
> As of Servlet 4.0 there is a specification compliant configuration 
> option to change this default to any encoding of your choice. 
> Obviously, UTF-8 is one of the options. You can do this by adding the 
> following to your web.xml:
>
> <request-character-encoding>UTF-8</request-character-encoding>
>
> If you add it to conf/web.xml it applies to every web application 
> deployed to Tomcat.
>
> Tomcat 9 uses this in the examples, manager and host-manager 
> applications in place of the SetCharacterEncodingFilter.

As you know I've already updated the Tomcat FAQ with the options for 
forcing Tomcat to interpret form POSTs with any escaped characters using 
UTF-8 octet sequences (as modern browsers send, and as HTML5 requires) 
instead of ISO-8859-1 (as the Servlet 4 spec says).

But the problem is worse with the Spring community. If someone is using 
Spring Boot to create an executable JAR/WAR using embedded tomcat, 
Spring Boot does something to configure Tomcat to send the POSTs 
correctly (that is, as the modern web likes it, not like the Servlet 4 
spec says). Unfortunately, if I use Spring Boot to make a WAR which is 
both a self-contained executing WAR /and/ a WAR deployable on Tomcat, 
when I deploy the WAR on Tomcat the encoded characters are using escaped 
ISO-8859-1 octets, so my web app breaks. Yes, the WAR runs differently 
if using Spring Boot embedded Tomcat or deployed on standalone Tomcat as 
a WAR.

Spring Boot ignores any `web.xml` file. I guess I could create a 
`web.xml` file only for standalone Tomcat, but then this freezes Eclipse 
(as I posted elsewhere) because Eclipse doesn't understand 
`<request-character-encoding>`. So like so many things on the web, this 
is a mess.

This is a serious issue, in my opinion. The Servlet 4 specification is 
out of step with everything else in the ecosystem!

> Whether Tomcat should ship with this setting present in conf/web.xml 
> by default is something that should probably be discussed for Tomcat 
> 10. Given the current state of the web, there is a reasonable case for 
> doing so. I'll add that to the TOMCAT-NEXT discussion list.

Yes, can I just re-second (third?) that motion, and underscore the need 
for this to be changed in Tomcat 10?

Thanks,

Garret

Re: distinction between resource charset and format octet decoding

Posted by Mark Thomas <ma...@apache.org>.

On 08/01/2019 21:31, Garret Wilson wrote:

<snip/>

> But as discussed above, this is completely wrong: the resource character 
> encoding of a request sent in `application/x-www-form-urlencoded` should 
> have absolutely no bearing on how the encoded octets within that 
> resource are decoded.

That is not the correct interpretation of section 3.12 of the Servlet 
4.0 specification (note the section numbers do vary between spec 
versions). Tomcat implements the correct interpretation - i.e. the 
charset from the request content-type defines how encoded octets are 
decoded and, if none is specified, ISO-8859-1 is used as the default.

Yes, this default is now very out-dated. That is a side-effect of:
- how long the Servlet specification has been around
- the very conservative approach taken by Java EE in terms of backwards
   compatibility (once set, defaults are very rarely - if ever - changed)
- arguably missed opportunities to address this issue prior to
   Servlet 4.0

As of Servlet 4.0 there is a specification compliant configuration 
option to change this default to any encoding of your choice. Obviously, 
UTF-8 is one of the options. You can do this by adding the following to 
your web.xml:

<request-character-encoding>UTF-8</request-character-encoding>

If you add it to conf/web.xml it applies to every web application 
deployed to Tomcat.

Tomcat 9 uses this in the examples, manager and host-manager 
applications in place of the SetCharacterEncodingFilter.

Whether Tomcat should ship with this setting present in conf/web.xml by 
default is something that should probably be discussed for Tomcat 10. 
Given the current state of the web, there is a reasonable case for doing 
so. I'll add that to the TOMCAT-NEXT discussion list.

The Tomcat Wiki also needs to be updated to take account of this new 
configuration option (and the related <response-character-encoding>). 
Since it is a wiki and this is clearly an issue you care about would you 
like to tackle that?

Mark

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
For additional commands, e-mail: users-help@tomcat.apache.org