You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@tomcat.apache.org by Christopher Schultz <ch...@christopherschultz.net> on 2016/11/17 14:21:23 UTC

Sanity Check

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256

All,

I've got a problem with a vendor and I'd like another opinion just to
make sure I'm not crazy. The vendor and I have a difference of opinion
about how a character should be encoded in an HTTP POST request.

The vendor's API officially should accept requests in UTF-8 encoding.
We are using application/x-www-form-urlencoded content type.

I'm trying to send a message with a non-ASCII character -- for
example, a � (that's (R), the registered trademark symbol).

The Java code being used to package-up this POST looks something like
this:

OutputStream out = httpurlconnection.getOutputStream();
out.print("notetext=");
out.print(URLEncoder.encode("test�", "UTF-8"));
out.close();

So the POST payload ends up being notetext=test%C2%AE or, on the wire,
the bytes are 6e 6f 74 65 74 65 78 74 3d 74 65 73 74 25 43 32 25 41 45.

The final bytes 25 43 32 25 41 45 are the characters % C 2 % A E.

Can someone verify that I'm encoding everything correctly?

The vendor is claiming that � can be sent "directly" like one might do
using curl:

$ curl -d 'notetext=�' [url]

and the bytes on the wire are 6e 6f 74 65 74 65 78 74 3d c2 ae (note
that c2 and ae are "bare" and not %-encoded).

Thanks,
- -chris
-----BEGIN PGP SIGNATURE-----
Comment: GPGTools - http://gpgtools.org
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iQIcBAEBCAAGBQJYLbzjAAoJEBzwKT+lPKRYaYoP/2oXJV2vhL4bLuZ2SF/0K46E
cQvPE6QbL+6OeC3jfpPKICDd0/0mFmbIxusdKDEcOeKWogVC+lH2by4ghMz4/s/3
dHqi4XYdCymGAikx+khptsna5ITlEgwSVNgCxeZUtMK1MKVpJqa7qOrZkddRi9uZ
RuT54WZ7iVSqOxHvGBf6pJ148Oh0J9gPKMrEdTW48avulvYzKagAH8XgIs8uOM8c
Cvd3yPhRxGLD8tenIrPOiD6G6PYRaSnBY5dOp42vvLXvBPb++bsqNU69RpnxNz1D
W+2BD8P6MDOJQChyBAILBmhhJRLPBQQYZuXtMjAfUCceLYichao515rdpCRX7t7g
Lcyiaz6ht5vOo1iW/r05GviKNl02Fi4BA5gOGaFewlAEwBJIDkPG/NeHXFM8clsy
gRJnmXRtL5IR5hudASMqiRQf0AUG3Ss+gVXXHdsqc+XuE7Q0VawjonCqDLCDwGpG
ZNMIULyWdUosCd5gC47rRN7irsclfOJ8ynI2WREGPlTTml8TKaIhrsaoB4RI4M4r
dUxAgDFOiONwZuXy53peAp6jziUwjqZS2H6BTwSXu+p6+RKvgGL5IuHP3UFfwQbL
JYzakySFoBR6FxN3fiotXKtwAXMtE77k4irzuc/iSxmDmJTdwatRuIQTHbXRoDm0
HcQh/hM/mVNcgRZJ+aqh
=heIv
-----END PGP SIGNATURE-----

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
For additional commands, e-mail: users-help@tomcat.apache.org


Re: Sanity Check

Posted by "Igal @ Lucee.org" <ig...@lucee.org>.
On 11/21/2016 9:09 AM, Christopher Schultz wrote:
> André,
>
> :)

Cute, very cute.

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
For additional commands, e-mail: users-help@tomcat.apache.org


Re: Sanity Check

Posted by Christopher Schultz <ch...@christopherschultz.net>.
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256

Andr�,

On 11/22/16 5:51 AM, Andr� Warnier (tomcat) wrote:
> So now, considering that such a thing would seem to have overall
> an overwhelming positive effect and no negative effect that we can
> think of, how would one go about proposing it ? For one, which
> would be the proper instance(s) to approach and how ?
> 
> I mean, having it adopted by acclamation would be very nice, but I
> am sure that there are some additional formalities to respect
> here.

Lobby individual products, I suspect.

httpd evidently already respects a charset in the
application/x-www-form-urlencoded content-type. I'd have to check, but
Tomcat might actually do that as well.

Now you just have to convince e.g. Firefox and Chrome to start doing it.

- -chris
-----BEGIN PGP SIGNATURE-----
Comment: GPGTools - http://gpgtools.org
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iQIcBAEBCAAGBQJYNNAsAAoJEBzwKT+lPKRYoZcQAJpOiVO6E36yjV8owQM9KgRq
wfKK8LLcxmZ/cb/6GURHnQGNOSO5MTxS3krx7hQuVb0I2P/+JDXTyaEf19Xolin3
oXKPNvdCjxi3zKh7OTWlSM6/KFVLVqyMeX6as0WPvU/bRrzkcSLx+zM5giAXjvFL
JqVR966ORSOzRl2facoKyWgbSRc48qWnptuiTbq+/yadywFQ16XOM/reyKf2ach1
JY0jiSV82w6UsYKGCoX3YVuG/vAt0GL1pjxgYKZ0v0Yve1XhB1mXSfrXxC2ai0cu
yFGiKtbBHGJakPBEBW6b9c6ZFOD/qtdsJ29h0TRfVm4f/z1KUbW8G7N6KPoscjku
hS7TGK0dYMdhCiKRndai+LB32IHUqMkd0M/jysDt3UwHuo87YL3FHFewYDLOJ0oU
jHtvVtE2OBB0PvKgTnqEvmpsXyNsiRtY6nlcscG1S0KjmWyHhqTFdiuRuTUeXTYb
w/Wijwgb38TRKOQyFrxa7FQpjCt6jIQW+WDu/vOQlG/FXMfiAKvL+1BP7Oy92Po5
Gxufx60+kLqd+0dMNrS4qX6HFAiBa2FIuPAVN5yPj7DFN4wZCRa3W8MFReruupnE
BddJygscxSjefh/K7LObq9zhrW9IGMBjU4xXSVWXNDO4HRhmPpffoLGqJ6jMPuqh
/yvnFpd7vWuWUl7UKZPp
=mfga
-----END PGP SIGNATURE-----

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
For additional commands, e-mail: users-help@tomcat.apache.org


Re: Sanity Check

Posted by "André Warnier (tomcat)" <aw...@ice-sa.com>.
On 21.11.2016 18:09, Christopher Schultz wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA256
>
> André,
>
> :)
>
> On 11/19/16 12:31 PM, Andr� Warnier (tomcat) wrote:
>> With respect, this is not only "Andr�'s problem".
>
> Agreed. I apologize if it seemed like I was suggesting that you are
> the only one complaining.
>
>> I would also posit that this being an English-language forum, the
>> posters here would tend to be predominently English-speaking
>> developers, who are quite likely not the ones most affected by such
>> issues. So the above numbers are quite likely to be
>> unrepresentative of the number of people really affected by such
>> matters.
>
> Also agreed: we are a self-selected group. But while we are
> predominantly English-speaking (even as a second of third language),
> we are all serving user populations that fall outside of that realm.
>
> For instance, my software is overwhelmingly deployed in the United
> States, but we have full support for Simplified and Traditional
> Chinese script (except for top-to-bottom and right-to-left rendering,
> which we don't do quite yet).
>
> So ISO-8859-1 has basically never worked for us, and we've been UTF-8
> since roughly the beginning.
>
>> And one could also look at the amount of code in applications and
>> in Tomcat e.g., which is dedicated to working around linked
>> issues. (Think "UseBodyEncodingForURL",
>> "org.apache.catalina.filters.AddDefaultCharsetFilter" etc.)
>>
>> Basically what I'm saying is that this
>> "posted-parameters-encoding-issue" is far from being "licked",
>> despite the fact that native English-speaking developers may have a
>> tendency to believe that it is.
>
> Aah, I meant that *my* problem with *this* vendor is now an
> open-and-shut case: they are squarely in violation of the
> specifications. They may decide not to change, but at least we know
> the truth of the matter and can move forward from there.
>
> When it's unclear which party is at fault, the party with the bigger
> bank account wins. (In that case, it's the vendor who has all the
> money, not me :) But being able to claim that they advertise support
> for this specification and clearly do not correctly-support it means
> that really THEY should be making a change to their software, not me.
>
>>> The only problem now is that it's not clear how to turn %C2%AE
>>> into a character because you have to know that UTF-8 and not
>>> Shift-JS or whatever is being used.
>>>
>>>> -> Required parameters : No parameters -> Optional parameters :
>>>> No parameters
>>>>
>>>> OK. So no charset= parameter is allowed. My advise to specify
>>>> the charset parameter was wrong.
>>
>> No, it wasn't, not really.  I believe that you were on a good track
>> there. It is the spec that is wrong, really.
>>
>> One is allowed to question a spec if it appears wrong, or ? After
>> all, RFC means "Request For Comment".
>
> Sure. The problem is that the app can only do so much, especially when
> the browsers behave in a very weird way... specifically by flatly
> refusing to provide a charset parameter to the Content-Type when it's
> appropriate.
>
> Being allows (spec-wise) to include a charset along with that
> Content-Type would be nice. An alternative would be to keep the spec
> in-fact and add a new spec that introduces a new header e.g.
> Encoded-Content-Type that would be a stand-in for the missing
> "charset" parameter for a/xfwu.
>
>>> Agreed: it is always against the spec(s) to specify a charset for
>>> any MIME type that is not text/*.
>>
>> Agreed. It just makes no sense for data that is not fundamentally
>> "text". (Whether some such text data has or not a MIME type whose
>> designation starts with "text/" is quite another matter. For
>> example : the MIME type "application/ecmascript" refers to text
>> data (javascript code) - and allows a charset attribute - even
>> though its type name does not start with "text/"; there are many
>> other types like that).
>
> I think the real problem is that many application/* MIME types really
> should be text/* types instead. Javascript is another good example.
> a/xwfu is also, by definition, text. If you want to upload binaries,
> you use application/binary or multipart/form-data with a subtype of
> application/binary.
>
>>>> Apache Tomcat supports the use of charset parameter with
>>>> Content-Type application/x-www-form-urlencoded in POST
>>>> requests.
>>>
>>
>> Good for Tomcat.  That /is/ the intelligent thing to do, MIME-type
>> notwithstanding. Because if ever, clients such as standard web
>> browsers would come to pay more attention and apply this attribute,
>> much of the current confusion would go away.
>>
>> Even better would be, if the RFC for
>> "application/x-www-form-urlencoded" would be amended, to specify
>> that this charset attribute SHOULD be provided, and that by default
>> its value would be "ISO-8859-1" (for now; but there is a good case
>> to make it UTF-8 nowadays).
>
> Weirdly, the current behavior of web browsers is to:
>
> a) Use the charset of the page that presented the form
> and
> b) Not report it to the server when submitting the POST request
>
> So everybody loses, and you can't just claim "the standard should be
> X". The standard default should be "undefined" :)
>
>> In fact, if Tomcat was to strictly respect the MIME type definition
>> of "application/x-www-form-urlencoded" and thus, after
>> percent-decoding the POST body, interpret any byte of the resulting
>> string strictly as being a character in the US-ASCII character set,
>> that /would/ instantly break thousands of applications.
>
> It would break everything, and I don't think it would be a "strict"
> following of the spec. There is a hole in the spec because the server
> can't (per spec) know the intended character encoding of the text
> after it has been url-decoded.
>
> I'm saying that the a/xwfu raw body itself must be (per spec)
> US-ASCII. But once url-decoded, those bytes can be interpreted as
> pretty much anything, UTF-8 being the most sensible these days, but
> evidently ISO-8859-1 gets used a lot. Hence your Andr\u2020� problem.
> Again, not YOUR problem. :)
>
>> it would now seem (unless I misinterpret, which is a distinct
>> possibility) that the content of a
>> "application/x-www-form-urlencoded" POST, *after*
>> URL-percent-decoding, *may* be a UTF-8 encoded Unicode string (it
>> may also be something else). (There is even a provision for
>> including a hidden "_charset_" parameter naming the
>> charset/encoding. Yet another muddle ?) (This also applies only to
>> HTML 5 <form> documents, but let's skip this for a moment).
>>
>> Still, as far as I can tell, there is still to some extent the
>> same "chicken-and-egg" problem, in the sense that in order to parse
>> the above parameter, one would first have to decode the
>> "application/x-www-form-urlencoded" POST body, using some character
>> set. For which one would need to know ditto character set before
>> decoding.
>
> The _charset_ thing is an horrible hack. It's worse than XML, but at
> least the XML parser can prove to itself that the character set of the
> bytes it's looking for are fairly close to the beginning of the
> stream. There's no requirement that the _charset_ parameter, for
> example, be the first parameter sent in the body of the request. :(
>
>> Pretty much the same solution applies to POSTs in the
>> "multipart/form-data" format, where each posted parameter already
>> has its own section with a MIME header.  Whenever one of these
>> parameters is text, it should specify a charset. (And if it
>> doesn't, then the current muddle applies).
>
> The problem is that most of these parts don't have a text/* MIME type.
> That's what I meant when I said you've "moved the problem" because
> a/xwfu can still hide in there and nothing has been solved.
>
>> The only remaining muddle is with the parameters passed inside the
>> URL, as a query-string.
>
> +1
>
>> But for those, one could apply for example the same mechanism as
>> is already applied for non-ASCII email header values (see
>> https://tools.ietf.org/html/rfc2047). This is not really ideal in
>> terms of simplicity, but 1) the code exists and works and 2) it
>> would certainly be preferable to the current muddled situation and
>> recurrent parameter encoding problems. (And again, for clients
>> which do not use this, then the current muddle applies).
>
> UTF-8 is pretty much the agreed-upon standard these days, except where
> it isn't :)
>
>> Altogether, to me it looks like there are 2 bodies of experts, one
>> on the HTML-and-client side and one on the HTTP-and-webserver side
>> (or maybe these are 4 bodies), who have not really been talking to
>> eachother constructively on this issue for years.
>
> Yes and, oddly enough, they are all working under the W3C umbrella.
>
>> The result being that instead of agreeing on some simple rules,
>> each one of them kind of patched together its own separate set of
>> rules (and a lot of complex software), to obtain finally something
>> which still does not really solve the interoperability problem
>> fundamentally.
>>
>> The current situation is nothing short of ridiculous : - there are
>> many character sets/encodings in use, but most/all of them are
>> clearly defined and named - there are millions of webservers, and
>> billions of web clients But fundamentally : - currently, a client
>> has no way to know for sure what character set/encoding it should
>> use, when it first tries to send some piece of text data to a
>> webserver - currently, a webserver has no way to know for sure in
>> what character set/encoding a client is sending text data to it
>
> All true.
>
>> I'm sure that we can do better.  But someone somewhere has to take
>> the initiative.  And who better than an open-source software
>> foundation whose products already dominates the worldwide webserver
>> market ?
>
> https://xkcd.com/927/
>

Yep, that is what I would be afraid of. (Great site by the way, thanks for the pointer).

But thus, it seems that
- we are agreed that this is (still) a problem, for web users and web developers worldwide
- we are agreed that solving it, would not break existing applications/webservers
- we are agreed that solving it would not require fundamentally new rules or RFCs, just 
maybe "tweaking" a couple of them (for example : allow a "charset" attribute for some 
additional MIME types which fundamentally concern text data (such as 
"application/x-www-form-urlencoded"), or allow a query-string to include an encoding as 
per rfc2047)(which does not per se require any tweaking of any RFC)
- we are agreed that solving the problem, would not require the writing of a lot of new 
code, as such code already exists in a much-used and debugged form
- we are agreed that solving this issue would save many people a lot of aggravation, and 
save the web-developer community a lot of basically superfluous coding and code 
maintenance, and documentation and on-line support, which could reasonably be evaluated 
together at thousands of man/hours per year
- and we are agreed that we cannot think of any organisation or group of persons that such 
a proposal could actually hurt or even inconvenience

So now, considering that such a thing would seem to have overall an overwhelming positive 
effect and no negative effect that we can think of, how would one go about proposing it ?
For one, which would be the proper instance(s) to approach and how ?

I mean, having it adopted by acclamation would be very nice, but I am sure that there are 
some additional formalities to respect here.



---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
For additional commands, e-mail: users-help@tomcat.apache.org


Re: Sanity Check

Posted by Christopher Schultz <ch...@christopherschultz.net>.
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256

André,

:)

On 11/19/16 12:31 PM, Andr� Warnier (tomcat) wrote:
> With respect, this is not only "Andr�'s problem".

Agreed. I apologize if it seemed like I was suggesting that you are
the only one complaining.

> I would also posit that this being an English-language forum, the 
> posters here would tend to be predominently English-speaking
> developers, who are quite likely not the ones most affected by such
> issues. So the above numbers are quite likely to be
> unrepresentative of the number of people really affected by such
> matters.

Also agreed: we are a self-selected group. But while we are
predominantly English-speaking (even as a second of third language),
we are all serving user populations that fall outside of that realm.

For instance, my software is overwhelmingly deployed in the United
States, but we have full support for Simplified and Traditional
Chinese script (except for top-to-bottom and right-to-left rendering,
which we don't do quite yet).

So ISO-8859-1 has basically never worked for us, and we've been UTF-8
since roughly the beginning.

> And one could also look at the amount of code in applications and
> in Tomcat e.g., which is dedicated to working around linked
> issues. (Think "UseBodyEncodingForURL", 
> "org.apache.catalina.filters.AddDefaultCharsetFilter" etc.)
> 
> Basically what I'm saying is that this 
> "posted-parameters-encoding-issue" is far from being "licked",
> despite the fact that native English-speaking developers may have a
> tendency to believe that it is.

Aah, I meant that *my* problem with *this* vendor is now an
open-and-shut case: they are squarely in violation of the
specifications. They may decide not to change, but at least we know
the truth of the matter and can move forward from there.

When it's unclear which party is at fault, the party with the bigger
bank account wins. (In that case, it's the vendor who has all the
money, not me :) But being able to claim that they advertise support
for this specification and clearly do not correctly-support it means
that really THEY should be making a change to their software, not me.

>> The only problem now is that it's not clear how to turn %C2%AE
>> into a character because you have to know that UTF-8 and not
>> Shift-JS or whatever is being used.
>> 
>>> -> Required parameters : No parameters -> Optional parameters :
>>> No parameters
>>> 
>>> OK. So no charset= parameter is allowed. My advise to specify
>>> the charset parameter was wrong.
> 
> No, it wasn't, not really.  I believe that you were on a good track
> there. It is the spec that is wrong, really.
> 
> One is allowed to question a spec if it appears wrong, or ? After
> all, RFC means "Request For Comment".

Sure. The problem is that the app can only do so much, especially when
the browsers behave in a very weird way... specifically by flatly
refusing to provide a charset parameter to the Content-Type when it's
appropriate.

Being allows (spec-wise) to include a charset along with that
Content-Type would be nice. An alternative would be to keep the spec
in-fact and add a new spec that introduces a new header e.g.
Encoded-Content-Type that would be a stand-in for the missing
"charset" parameter for a/xfwu.

>> Agreed: it is always against the spec(s) to specify a charset for
>> any MIME type that is not text/*.
> 
> Agreed. It just makes no sense for data that is not fundamentally
> "text". (Whether some such text data has or not a MIME type whose
> designation starts with "text/" is quite another matter. For
> example : the MIME type "application/ecmascript" refers to text
> data (javascript code) - and allows a charset attribute - even
> though its type name does not start with "text/"; there are many
> other types like that).

I think the real problem is that many application/* MIME types really
should be text/* types instead. Javascript is another good example.
a/xwfu is also, by definition, text. If you want to upload binaries,
you use application/binary or multipart/form-data with a subtype of
application/binary.

>>> Apache Tomcat supports the use of charset parameter with 
>>> Content-Type application/x-www-form-urlencoded in POST
>>> requests.
>> 
> 
> Good for Tomcat.  That /is/ the intelligent thing to do, MIME-type 
> notwithstanding. Because if ever, clients such as standard web
> browsers would come to pay more attention and apply this attribute,
> much of the current confusion would go away.
> 
> Even better would be, if the RFC for
> "application/x-www-form-urlencoded" would be amended, to specify
> that this charset attribute SHOULD be provided, and that by default
> its value would be "ISO-8859-1" (for now; but there is a good case
> to make it UTF-8 nowadays).

Weirdly, the current behavior of web browsers is to:

a) Use the charset of the page that presented the form
and
b) Not report it to the server when submitting the POST request

So everybody loses, and you can't just claim "the standard should be
X". The standard default should be "undefined" :)

> In fact, if Tomcat was to strictly respect the MIME type definition
> of "application/x-www-form-urlencoded" and thus, after
> percent-decoding the POST body, interpret any byte of the resulting
> string strictly as being a character in the US-ASCII character set,
> that /would/ instantly break thousands of applications.

It would break everything, and I don't think it would be a "strict"
following of the spec. There is a hole in the spec because the server
can't (per spec) know the intended character encoding of the text
after it has been url-decoded.

I'm saying that the a/xwfu raw body itself must be (per spec)
US-ASCII. But once url-decoded, those bytes can be interpreted as
pretty much anything, UTF-8 being the most sensible these days, but
evidently ISO-8859-1 gets used a lot. Hence your Andr\u2020� problem.
Again, not YOUR problem. :)

> it would now seem (unless I misinterpret, which is a distinct 
> possibility) that the content of a
> "application/x-www-form-urlencoded" POST, *after*
> URL-percent-decoding, *may* be a UTF-8 encoded Unicode string (it
> may also be something else). (There is even a provision for
> including a hidden "_charset_" parameter naming the
> charset/encoding. Yet another muddle ?) (This also applies only to
> HTML 5 <form> documents, but let's skip this for a moment).
> 
> Still, as far as I can tell, there is still to some extent the
> same "chicken-and-egg" problem, in the sense that in order to parse
> the above parameter, one would first have to decode the 
> "application/x-www-form-urlencoded" POST body, using some character
> set. For which one would need to know ditto character set before
> decoding.

The _charset_ thing is an horrible hack. It's worse than XML, but at
least the XML parser can prove to itself that the character set of the
bytes it's looking for are fairly close to the beginning of the
stream. There's no requirement that the _charset_ parameter, for
example, be the first parameter sent in the body of the request. :(

> Pretty much the same solution applies to POSTs in the 
> "multipart/form-data" format, where each posted parameter already
> has its own section with a MIME header.  Whenever one of these
> parameters is text, it should specify a charset. (And if it
> doesn't, then the current muddle applies).

The problem is that most of these parts don't have a text/* MIME type.
That's what I meant when I said you've "moved the problem" because
a/xwfu can still hide in there and nothing has been solved.

> The only remaining muddle is with the parameters passed inside the
> URL, as a query-string.

+1

> But for those, one could apply for example the same mechanism as
> is already applied for non-ASCII email header values (see 
> https://tools.ietf.org/html/rfc2047). This is not really ideal in
> terms of simplicity, but 1) the code exists and works and 2) it
> would certainly be preferable to the current muddled situation and
> recurrent parameter encoding problems. (And again, for clients
> which do not use this, then the current muddle applies).

UTF-8 is pretty much the agreed-upon standard these days, except where
it isn't :)

> Altogether, to me it looks like there are 2 bodies of experts, one
> on the HTML-and-client side and one on the HTTP-and-webserver side
> (or maybe these are 4 bodies), who have not really been talking to
> eachother constructively on this issue for years.

Yes and, oddly enough, they are all working under the W3C umbrella.

> The result being that instead of agreeing on some simple rules,
> each one of them kind of patched together its own separate set of
> rules (and a lot of complex software), to obtain finally something
> which still does not really solve the interoperability problem
> fundamentally.
> 
> The current situation is nothing short of ridiculous : - there are
> many character sets/encodings in use, but most/all of them are
> clearly defined and named - there are millions of webservers, and
> billions of web clients But fundamentally : - currently, a client
> has no way to know for sure what character set/encoding it should
> use, when it first tries to send some piece of text data to a
> webserver - currently, a webserver has no way to know for sure in
> what character set/encoding a client is sending text data to it

All true.

> I'm sure that we can do better.  But someone somewhere has to take
> the initiative.  And who better than an open-source software
> foundation whose products already dominates the worldwide webserver
> market ?

https://xkcd.com/927/

- -chris
-----BEGIN PGP SIGNATURE-----
Comment: GPGTools - http://gpgtools.org
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iQIcBAEBCAAGBQJYMyplAAoJEBzwKT+lPKRYeYEP/3aqK80OgJ7GsGjigpBmlhcd
Wa/gzX6LspRlqeGxfrXsPZqsWTdCq+4tEkS6bfIQpomXHLAqSQASBQK428dmc17p
YpO8sJ/RKiK4QEc40yT3jo2S1J+YM3wn9Qp8vMXgO0uNB9OUL+oZXN0ekZBYaxBB
IRTiKIuFnLLPKD6WrxaaYeijH/hsDV69GqX6+LJKTHuSozFQ/qblAbPd0NCxHf7g
hHw/dFRiL9vRXn1L89S+yoMWvsYLVYL4iVa7DCg5HE5z2an+b986ecyuHKALbu1c
dxrTHaV6neTC+vx0wqt9NUtUKGuJpWY2iE5RsXM1WRFgEQOr2/3RA5aLKXCt6FlP
nog+cOrqeu7PdnhvL5shCU9PdAvVnHGV622W0pONuWx1Mz3hmXT9BFBY7N71Q9Of
3oWByG1y9Py79/jlYbmZwHZPivFKxJfnVZHgk7w1qWxaoPM52rG7VsnOGYz9pBve
j9oxssmHFTHW8eZ315OZsg4Z+68WehHmnBNAM93+AEBhWiROH4JWINH5y8p1VNH+
qqSZT02cZaTILVOMudZRgSpGQpBfxyA3VNuVEMyOX58zh5V5pgVKEaCV1Y1H698s
PU/yvROkqLn3mdc9UdulPzMdiNS2Etc1nlZ8LjoX4GFfN/gfHYACr3aa4Cba/N0b
Uq5wouFq8YV8ESsw+1yN
=msJt
-----END PGP SIGNATURE-----

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
For additional commands, e-mail: users-help@tomcat.apache.org


Re: Sanity Check

Posted by "André Warnier (tomcat)" <aw...@ice-sa.com>.
On 18.11.2016 20:27, Christopher Schultz wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA256
>
> Konstantin,
>
> On 11/18/16 2:10 PM, Konstantin Kolinko wrote:
>> One more authority, that I forgot to mention in my mail: IANA
>> registry of mime types
>>
>> Registry:
>> https://www.iana.org/assignments/media-types/media-types.xhtml
>>
>> Registration entry for "application/x-www-form-urlencoded"
>> https://www.iana.org/assignments/media-types/application/x-www-form-ur
> lencoded
>>
>>   -> Encoding considerations : 7bit
>>
>> According to RFC defining this registry, it means that the data is
>> 7-bit ASCII only. https://tools.ietf.org/html/rfc6838#section-4.8
>
> Oh, that's the nail in the coffin.
>
> application/x-www-form-urlencoded from W3C says "if the character
> doesn't fit into the encoding of the message, it must be %-encoded"
> but it never says what "the encoding of the message" actually is. My
> worry was that it was mutable, and that UTF-8 was a valid encoding,
> meaning that 0xc2 0xae on the wire would have been acceptable (rather
> than %C2%AE).
>
> If application/a-www-form-urlencoded is *absolutely* supposed to be
> 7-bit ASCII, then nothing above 0x7f can ever be legally transferred
> across the wire when using that content-type.
>
> This solves Andr�'s problem with this content-type where he wanted to
> specify the charset to be used. It seems the standard defines the
> character set: US-ASCII.

With respect, this is not only "Andr�'s problem".
This is a general problem (not only with Tomcat), which affects any and all users and web 
application programmers and webserver developers, as soon as they are dealing with the 
World at large, which effectively uses a lot of languages which cannot be represented by 
the iso-latin-1 character set, and much less even by the US-ASCII character set.
It affect users, because many users still regularly see the data that they enter into web 
application pages and submit to a server, being misinterpreted.  (I cannot tell you how 
many times, even nowadays, I fill-in my name in a web form, only to have it echoed back to 
me as some variation of "andré"..)
As for web application and webserver developers, one only has to look at the archives of a 
forum such as Tomcat's, to see how often and how regularly such issues come up, and keep 
coming up over the years :

Sample from marc.info, tomcat-user :
period : 2016-02-08 / 2016-11-19
Total messages : 3582
Messages mentioning "encoding" : 164
Messages mentioning "character set" : 41

for comparison :
Messages mentioning "NIO" : 90
Messages mentioning "AJP" : 201
Messages mentioning "memory" : 258

Granted, this is not a very fine analysis.  But all in all, it would tend to suggest that 
this is not a "minor" issue : for Tomcat alone, it comes up just about as often as the 
"memory usage" topic, and more often than either Connector above.
I would also posit that this being an English-language forum, the posters here would tend 
to be predominently English-speaking developers, who are quite likely not the ones most 
affected by such issues. So the above numbers are quite likely to be unrepresentative of 
the number of people really affected by such matters.

And one could also look at the amount of code in applications and in Tomcat e.g., which is 
dedicated to working around linked issues.
(Think "UseBodyEncodingForURL", "org.apache.catalina.filters.AddDefaultCharsetFilter" etc.)

Basically what I'm saying is that this "posted-parameters-encoding-issue" is far from 
being "licked", despite the fact that native English-speaking developers may have a 
tendency to believe that it is.

>
> The only problem now is that it's not clear how to turn %C2%AE into a
> character because you have to know that UTF-8 and not Shift-JS or
> whatever is being used.
>
>> -> Required parameters : No parameters -> Optional parameters :  No
>> parameters
>>
>> OK. So no charset= parameter is allowed. My advise to specify the
>> charset parameter was wrong.

No, it wasn't, not really.  I believe that you were on a good track there.
It is the spec that is wrong, really.

One is allowed to question a spec if it appears wrong, or ?
After all, RFC means "Request For Comment".

>
> Agreed: it is always against the spec(s) to specify a charset for any
> MIME type that is not text/*.

Agreed. It just makes no sense for data that is not fundamentally "text".
(Whether some such text data has or not a MIME type whose designation starts with "text/" 
is quite another matter. For example : the MIME type "application/ecmascript" refers to 
text data (javascript code) - and allows a charset attribute - even though its type name 
does not start with "text/"; there are many other types like that).

>
>> Though historically ~10 years ago I saw
>> "application/x-www-form-urlencoded;charset=UTF-8" Content-Type in
>> the wild.
>
> Oh, I'm sure you saw it. I even tossed that into my client to see if
> it would make a difference. Not surprisingly, it did not.
>
>> It was a web site authored in WML (Wireless Markup Language) and
>> accessed via WAP protocol by mobile phones.
>>
>> (Specification reference for this WML/WAP usage:
>> http://technical.openmobilealliance.org/Technical/release_program/docs
> /Browsing/V2_3-20070227-C/WAP-191-WML-20000219-a.pdf
>>
>>   Document title: WAP WML WAP-191-WML 19 February 2000
>>
>> Wireless Application Protocol Wireless Markup Language
>> Specification Version 1.3
>>
>> -> Page 30 of 110 (in Section "9.5.1 The Go Element"): There is a
>> table, where the following line is relevant:
>>
>> Method: post Enctype: application/x-www-form-urlencoded Process:
>> [...] The Content-Type header must include the charset parameter to
>> indicate the character encoding.
>>
>> I suspect that the above URL is not the official location of the
>> document. I found it through Googling. Official location should be
>> http://www.wapforum.org/what/technical.htm )
>>
>>
>> Apache Tomcat supports the use of charset parameter with
>> Content-Type application/x-www-form-urlencoded in POST requests.
>

Good for Tomcat.  That /is/ the intelligent thing to do, MIME-type notwithstanding.
Because if ever, clients such as standard web browsers would come to pay more attention 
and apply this attribute, much of the current confusion would go away.

Even better would be, if the RFC for "application/x-www-form-urlencoded" would be amended, 
to specify that this charset attribute SHOULD be provided, and that by default its value 
would be "ISO-8859-1" (for now; but there is a good case to make it UTF-8 nowadays).
And the justification for this would be that undoubtedly in the practice, this MIME type 
applies exclusively for *text* data anyway, and that at numerous other places in the HTTP 
and WWW-related specifications, it already indicates that for text data, the character 
set/encoding should be clearly specified.

I mean, quite obviously, the current definition saying that this MIME type, which is used 
in millions of places to pass named text values from HTML <form>s to webservers, is to be 
composed of character codes belonging to the US-ASCII alphabet exclusively, is hopelessly 
out-of-date and is, in the real world, violated millions of times every day.
Or is there someone who would pretend that there are not hundreds of thousands of web 
forms being submitted every day to webservers in Germany, France, Spain, etc using POSTs 
with a Content-type "application/x-www-form-urlencoded", and that no parameter passed in 
this way ever contains more than US-ASCII characters ?

In fact, if Tomcat was to strictly respect the MIME type definition of 
"application/x-www-form-urlencoded" and thus, after percent-decoding the POST body, 
interpret any byte of the resulting string strictly as being a character in the US-ASCII 
character set, that /would/ instantly break thousands of applications.


> Interesting. I suspect that's because there are practical situations
> where "being liberal with what you accept" is more appropriate than
> angrily demanding that all clients be 100% spec-compliant :)
>
> The (illegal) charset parameter can only mean one thing: the character
> encoding to use to assemble url-decoded bytes into an actual string
> value (e.g. %C2%AE -> 0xc2 0xae -> "�" when using UTF-8).
>
> Thanks for that final reference; it really does close the case on this
> whole thing.
>

It does not really. That would just brush it under the carpet, again.

Addendum :
It seems that HTML 5 is (finally) trying to do something about this muddle :
- Starting from the MIME type registry of "application/x-www-form-urlencoded", in
   http://www.iana.org/assignments/media-types/application/x-www-form-urlencoded
- which says :
"
Interoperability considerations :
Rules for generating and processing application/x-www-form-urlencoded payloads are defined 
in the HTML specification.

Published specification :
http://www.w3.org/TR/html is the relevant specification. Algorithms for encoding and 
decoding are defined.
"
- and thus going to http://www.w3.org/TR/html ...
- which somehow leads to : 
https://www.w3.org/TR/html/sec-forms.html#application-x-www-form-urlencoded-encoding-algorithm
- and from there to :
https://url.spec.whatwg.org/#concept-urlencoded-serializer

it would now seem (unless I misinterpret, which is a distinct possibility) that the 
content of a "application/x-www-form-urlencoded" POST, *after* URL-percent-decoding,
*may* be a UTF-8 encoded Unicode string (it may also be something else).
(There is even a provision for including a hidden "_charset_" parameter naming the 
charset/encoding. Yet another muddle ?)
(This also applies only to HTML 5 <form> documents, but let's skip this for a moment).

Still, as far as I can tell, there is still to some extent the same "chicken-and-egg" 
problem, in the sense that in order to parse the above parameter, one would first have to 
decode the "application/x-www-form-urlencoded" POST body, using some character set.
For which one would need to know ditto character set before decoding.

To summarise :
In a POST in the "application/x-www-form-urlencoded" format, there is a body. This body 
has a single part, and it cannot be other than text (it is in fact a "query-string" 
composed of name/value pairs; only, it is put in the body of the request, instead of being 
appended to the URL).
So the Content-Type header of the POST request would be the perfect logical place to add a 
"charset" parameter, which would lift any uncertainty about the content of this 
query-string, character-set wise. And by default for now it could be ISO-8859-1, to match 
the majority of the rest of the WWW-related specs. (But it would *allow* the usage of any 
other encoding).
I do not believe that this would break anything. For clients which do not provided this 
charset attribute, the current muddled logic would still apply.
And it would certainly be simpler to implement, than the logic described in the HTML-5 
document.

Pretty much the same solution applies to POSTs in the "multipart/form-data" format, where 
each posted parameter already has its own section with a MIME header.  Whenever one of 
these parameters is text, it should specify a charset. (And if it doesn't, then the 
current muddle applies).

The only remaining muddle is with the parameters passed inside the URL, as a query-string.
But for those, one could apply for example the same mechanism as is already applied for 
non-ASCII email header values (see https://tools.ietf.org/html/rfc2047). This is not 
really ideal in terms of simplicity, but 1) the code exists and works and 2) it would 
certainly be preferable to the current muddled situation and recurrent parameter encoding 
problems. (And again, for clients which do not use this, then the current muddle applies).

Altogether, to me it looks like there are 2 bodies of experts, one on the HTML-and-client 
side and one on the HTTP-and-webserver side (or maybe these are 4 bodies), who have not 
really been talking to eachother constructively on this issue for years.
The result being that instead of agreeing on some simple rules, each one of them kind of 
patched together its own separate set of rules (and a lot of complex software), to obtain 
finally something which still does not really solve the interoperability problem 
fundamentally.

The current situation is nothing short of ridiculous :
- there are many character sets/encodings in use, but most/all of them are clearly defined 
and named
- there are millions of webservers, and billions of web clients
But fundamentally :
- currently, a client has no way to know for sure what character set/encoding it should 
use, when it first tries to send some piece of text data to a webserver
- currently, a webserver has no way to know for sure in what character set/encoding a 
client is sending text data to it

I'm sure that we can do better.  But someone somewhere has to take the initiative.  And 
who better than an open-source software foundation whose products already dominates the 
worldwide webserver market ?


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
For additional commands, e-mail: users-help@tomcat.apache.org


Re: Sanity Check

Posted by Christopher Schultz <ch...@christopherschultz.net>.
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256

Konstantin,

On 11/18/16 2:10 PM, Konstantin Kolinko wrote:
> One more authority, that I forgot to mention in my mail: IANA
> registry of mime types
> 
> Registry: 
> https://www.iana.org/assignments/media-types/media-types.xhtml
> 
> Registration entry for "application/x-www-form-urlencoded" 
> https://www.iana.org/assignments/media-types/application/x-www-form-ur
lencoded
>
>  -> Encoding considerations : 7bit
> 
> According to RFC defining this registry, it means that the data is 
> 7-bit ASCII only. https://tools.ietf.org/html/rfc6838#section-4.8

Oh, that's the nail in the coffin.

application/x-www-form-urlencoded from W3C says "if the character
doesn't fit into the encoding of the message, it must be %-encoded"
but it never says what "the encoding of the message" actually is. My
worry was that it was mutable, and that UTF-8 was a valid encoding,
meaning that 0xc2 0xae on the wire would have been acceptable (rather
than %C2%AE).

If application/a-www-form-urlencoded is *absolutely* supposed to be
7-bit ASCII, then nothing above 0x7f can ever be legally transferred
across the wire when using that content-type.

This solves Andr�'s problem with this content-type where he wanted to
specify the charset to be used. It seems the standard defines the
character set: US-ASCII.

The only problem now is that it's not clear how to turn %C2%AE into a
character because you have to know that UTF-8 and not Shift-JS or
whatever is being used.

> -> Required parameters : No parameters -> Optional parameters :  No
> parameters
> 
> OK. So no charset= parameter is allowed. My advise to specify the
> charset parameter was wrong.

Agreed: it is always against the spec(s) to specify a charset for any
MIME type that is not text/*.

> Though historically ~10 years ago I saw 
> "application/x-www-form-urlencoded;charset=UTF-8" Content-Type in
> the wild.

Oh, I'm sure you saw it. I even tossed that into my client to see if
it would make a difference. Not surprisingly, it did not.

> It was a web site authored in WML (Wireless Markup Language) and 
> accessed via WAP protocol by mobile phones.
> 
> (Specification reference for this WML/WAP usage: 
> http://technical.openmobilealliance.org/Technical/release_program/docs
/Browsing/V2_3-20070227-C/WAP-191-WML-20000219-a.pdf
>
>  Document title: WAP WML WAP-191-WML 19 February 2000
> 
> Wireless Application Protocol Wireless Markup Language
> Specification Version 1.3
> 
> -> Page 30 of 110 (in Section "9.5.1 The Go Element"): There is a
> table, where the following line is relevant:
> 
> Method: post Enctype: application/x-www-form-urlencoded Process:
> [...] The Content-Type header must include the charset parameter to
> indicate the character encoding.
> 
> I suspect that the above URL is not the official location of the 
> document. I found it through Googling. Official location should be
> http://www.wapforum.org/what/technical.htm )
> 
> 
> Apache Tomcat supports the use of charset parameter with
> Content-Type application/x-www-form-urlencoded in POST requests.

Interesting. I suspect that's because there are practical situations
where "being liberal with what you accept" is more appropriate than
angrily demanding that all clients be 100% spec-compliant :)

The (illegal) charset parameter can only mean one thing: the character
encoding to use to assemble url-decoded bytes into an actual string
value (e.g. %C2%AE -> 0xc2 0xae -> "�" when using UTF-8).

Thanks for that final reference; it really does close the case on this
whole thing.

- -chris
-----BEGIN PGP SIGNATURE-----
Comment: GPGTools - http://gpgtools.org
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iQIcBAEBCAAGBQJYL1YQAAoJEBzwKT+lPKRYyAkP/3Udkqjiqa7BhRH2Gxo8WhNf
Wm7BbWGS8vlgbHH/0mNzFPSxGi7mWxlimaGnc+H8fqk54RZCeNaqQPqPXhG7ldA1
QtR/1H1kXoqUNFmqnj3FBgA6UBZhql9RyLZLbeHdZMK9i1sN4bI/CEa2EP5rZ+0d
0sXXj8wRz+yk2bXtdyuW8yHzQRNB/+XJbOrQBVqc+u//K/+q9I8eEN0SlZo8+9t2
9hqqcufhd9YtuH1Ypn1M73l72WFWad7BEgPPG+noLcB8/OrSXfeF2ELEe9dzv6r6
Jyxas6uUiplE8+/1QTu8MYSGqeo3l/xgixCD9gEMLNFBlcLPlQcRhaoQ08bgZOcT
SyzVIYYCL7R7MsB1f3QFDEax0vwIi0a6Zrfaa3oqklXEhNuVk/Ani8+sbFw01iHW
ZxV6vc0v9APMOg3jVQug3UC1kAGcZi8toISKyrFt9lwK0AbDrSVKfe4sKql91yQm
wQCG3e/RjoSo1LEmh9yszurNtOy2ecqTBkIS2cksf4crYSqpefCyB/GpnrJaHMvx
P/PQ0hVZUg05Z/tj7Dxma5mWrlm9IQBC+inDiwIEnl9hGp67KfxZAEk8hUstDBWw
AK78+DsseGpyx40o6scDz8dR9ThnTHm3k0zhdUZoORwfft78Ar0HYjZCDQArhuMK
BDGqIegIrNeJtCDnYOdq
=nJCy
-----END PGP SIGNATURE-----

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
For additional commands, e-mail: users-help@tomcat.apache.org


Re: Sanity Check

Posted by Konstantin Kolinko <kn...@gmail.com>.
2016-11-18 19:02 GMT+03:00 Christopher Schultz <ch...@christopherschultz.net>:
> André,
>
> On 11/18/16 3:50 AM, André Warnier (tomcat) wrote:
>> On 18.11.2016 05:56, Christopher Schultz wrote:
>>> Since UTF-8 is supposed to be the "official" character encoding,
>>
>> Now where is that specified ?  As far as I know, the default
>> charset for everything HTTP and HTML-wise is still iso-8859-1, no ?
>> (and unfortunately so).
>
> I apologize for the sloppy language: this particular vendor's service
> claims that UTF-8 if the standard *for their service*. Not for HTTP in
> general.
>
>>> The vendor has responded with (paraphrasing) "it seems we don't
>>> completely follow this standard; we're considering what to do
>>> next, which may include no change". This is a big vendor with
>>> *lots* of software clients, so maintaining backward compatibility
>>> is going to be a big deal for them. I've got some tricks up my
>>> sleeve if they decide not to change anything. Hooray for specs.
>>> :(
>>
>> What I never understood in all that, is why browsers and other
>> clients never seem to respect (and servers do not seem to enforce)
>> what is indicated here :
>>
>> https://www.ietf.org/rfc/rfc2388.txt 4.5 Charset of text in form
>> data
>>
>> This would be a simple way to get rid of umpteen character
>> set/encoding issues encountered when trying to interpret <form>
>> data POSTed to web applications.
>
> The problem is that application/x-www-form-urlencoded doesn't give a
> client a natural way to specify the character encoding, and a/xwfu can
> be used inside of a multipart/form-data package as well. You've just
> moved the problem from the Content-Type of the request to the
> Content-Type of the *part* of the multi-part request. Nothing has been
> solved by using multipart/form-data.
>
> And browsers certainly DO use that, but almost exclusively for things
> like file-upload, since files tend to be very big already, and
> urlencoding a bunch of binary bytes makes the file size increase quite
> a bit.
>
>> It seems to me contrary to common sense that in our day and age,
>> the rules for this could not be set once and for all to something
>> like :
>>
>> 1) the default character set/encoding of HTTP and HTML is
>> Unicode/UTF-8 (instead of the current really archaic iso-8859-1) 2)
>> URLs (including query-strings) should be by default interpreted as
>> Unicode/UTF-8, encoded as per
>> https://tools.ietf.org/html/rfc3986#section-2 3) for POST requests
>> : - for the Content-type "application/x-www-form-urlencoded",
>> there SHOULD be a charset attribute indicating the charset and
>> encoding. By default, this is "text/plain; charset=UTF-8"
>
> Don't forget, charset == encoding. The text/plain is the MIME type,
> and that's already been defined as application/x-www-form-urlencoded.
> Somewhere it should just explicitly say "a/xwfu" must contain only
> ASCII bytes, and always encodes a text blob in UTF-8 encoding.
>
> But it will never happen (see below).

One more authority, that I forgot to mention in my mail:
IANA registry of mime types

Registry:
https://www.iana.org/assignments/media-types/media-types.xhtml

Registration entry for "application/x-www-form-urlencoded"
https://www.iana.org/assignments/media-types/application/x-www-form-urlencoded

-> Encoding considerations : 7bit

According to RFC defining this registry, it means that the data is
7-bit ASCII only.
https://tools.ietf.org/html/rfc6838#section-4.8

-> Required parameters : No parameters
-> Optional parameters :  No parameters

OK. So no charset= parameter is allowed.
My advise to specify the charset parameter was wrong.

Though historically ~10 years ago I saw
"application/x-www-form-urlencoded;charset=UTF-8" Content-Type in the
wild.

It was a web site authored in WML (Wireless Markup Language) and
accessed via WAP protocol by mobile phones.

(Specification reference for this WML/WAP usage:
http://technical.openmobilealliance.org/Technical/release_program/docs/Browsing/V2_3-20070227-C/WAP-191-WML-20000219-a.pdf

Document title:
WAP WML
WAP-191-WML
19 February 2000

Wireless Application Protocol
Wireless Markup Language Specification
Version 1.3

-> Page 30 of 110 (in Section "9.5.1 The Go Element"):
There is a table, where the following line is relevant:

Method: post
Enctype: application/x-www-form-urlencoded
Process: [...] The Content-Type header must include the charset
parameter to indicate the character encoding.

I suspect that the above URL is not the official location of the
document. I found it through Googling.
Official location should be http://www.wapforum.org/what/technical.htm
)


Apache Tomcat supports the use of charset parameter with Content-Type
application/x-www-form-urlencoded in POST requests.

>> - for the Content-type "multipart/form-data", each "part" MUST have
>> a Content-type header.  If this Content-type is a "text" type, then
>> the Content-type header SHOULD contain a charset attribute. If
>> omitted, by default this is "charset=UTF-8".
>>
>> and be done with it once and for all.
>
> Right: once and for all, for new clients who implement the spec. All
> old clients, servers, proxies, , etc. be damned. It's just not
> possible due to the need to be backward-compatible with really weird
> stuff like "smart" toasters and refrigerators, WebTV (remember that?)
> and all manner of embedded devices that will never be updated.
>
> What we really need is a new header that says "here's everything you
> need to know about encoding for this request" and clients and servers
> who both support that header can use it. All other uses need to
> fall-back to this old and nasty heuristic.
>
> - -chris


Best regards,
Konstantin Kolinko

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
For additional commands, e-mail: users-help@tomcat.apache.org


Re: Sanity Check

Posted by "André Warnier (tomcat)" <aw...@ice-sa.com>.
On 18.11.2016 17:02, Christopher Schultz wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA256
>
> Andr�,
>
> On 11/18/16 3:50 AM, Andr� Warnier (tomcat) wrote:
>> On 18.11.2016 05:56, Christopher Schultz wrote:
>>> Since UTF-8 is supposed to be the "official" character encoding,
>>
>> Now where is that specified ?  As far as I know, the default
>> charset for everything HTTP and HTML-wise is still iso-8859-1, no ?
>> (and unfortunately so).
>
> I apologize for the sloppy language: this particular vendor's service
> claims that UTF-8 if the standard *for their service*. Not for HTTP in
> general.
>
>>> The vendor has responded with (paraphrasing) "it seems we don't
>>> completely follow this standard; we're considering what to do
>>> next, which may include no change". This is a big vendor with
>>> *lots* of software clients, so maintaining backward compatibility
>>> is going to be a big deal for them. I've got some tricks up my
>>> sleeve if they decide not to change anything. Hooray for specs.
>>> :(
>>
>> What I never understood in all that, is why browsers and other
>> clients never seem to respect (and servers do not seem to enforce)
>> what is indicated here :
>>
>> https://www.ietf.org/rfc/rfc2388.txt 4.5 Charset of text in form
>> data
>>
>> This would be a simple way to get rid of umpteen character
>> set/encoding issues encountered when trying to interpret <form>
>> data POSTed to web applications.
>
> The problem is that application/x-www-form-urlencoded doesn't give a
> client a natural way to specify the character encoding,

Yes, it does.  In the case of this content-type, the whole list of posted parameters is 
provided as one big chunk of text, in the body of the request.
The content-type "application/x-www-form-urlencoded" implies text, because there is no 
good way in that format to include any post parameter which is not text.
Since it is text, there is no good reason why the (single) Content-type header of the POST 
could not provide a charset attribute.

  and a/xwfu can
> be used inside of a multipart/form-data package as well. You've just
> moved the problem from the Content-Type of the request to the
> Content-Type of the *part* of the multi-part request. Nothing has been
> solved by using multipart/form-data.

I have no changed or moved anything. I have just been adding the requirement that if any 
of these parts concerns a text-type part, it SHOULD also contain a charset attribute.

This is precisely what browsers do not do, for whatever reason which is beyond my 
comprehension.  The parts already have a Content-type. It is just the charset attribute 
*for the parts which are text* that is missing, despite what the rfc2388 recommendation says.

>
> And browsers certainly DO use that, but almost exclusively for things
> like file-upload, since files tend to be very big already, and
> urlencoding a bunch of binary bytes makes the file size increase quite
> a bit.
>
>> It seems to me contrary to common sense that in our day and age,
>> the rules for this could not be set once and for all to something
>> like :
>>
>> 1) the default character set/encoding of HTTP and HTML is
>> Unicode/UTF-8 (instead of the current really archaic iso-8859-1)

2)
>> URLs (including query-strings) should be by default interpreted as
>> Unicode/UTF-8, encoded as per
>> https://tools.ietf.org/html/rfc3986#section-2

3) for POST requests
>> : - for the Content-type "application/x-www-form-urlencoded",
>> there SHOULD be a charset attribute indicating the charset and
>> encoding. By default, this is "text/plain; charset=UTF-8"
>
> Don't forget, charset == encoding. The text/plain is the MIME type,
> and that's already been defined as application/x-www-form-urlencoded.

I made a mistake here. Scratch the "text/plain;" part above. The charset attribute should 
be added to the existing Content-type header.
In other words, the header should be :
Content-type: application/x-www-form-urlencoded; charset=xxxx
The MIME type "x-www-form-urlencoded" already *implies* that this is text, URL-encoded.
It just fails to specify what charset/encoding the query string was encoded with, *before* 
it was URL-encoded.

> Somewhere it should just explicitly say "a/xwfu" must contain only
> ASCII bytes, and always encodes a text blob in UTF-8 encoding.
>
> But it will never happen (see below).
>
>> - for the Content-type "multipart/form-data", each "part" MUST have
>> a Content-type header.  If this Content-type is a "text" type, then
>> the Content-type header SHOULD contain a charset attribute. If
>> omitted, by default this is "charset=UTF-8".
>>
>> and be done with it once and for all.
>
> Right: once and for all, for new clients who implement the spec. All
> old clients, servers, proxies, , etc. be damned. It's just not
> possible due to the need to be backward-compatible with really weird
> stuff like "smart" toasters and refrigerators, WebTV (remember that?)
> and all manner of embedded devices that will never be updated.
>
> What we really need is a new header that says "here's everything you
> need to know about encoding for this request"

There is no need for a new header. The existing "Content-type" header is perfectly 
adequate in all cases. It is the fact that it is not being used properly and consistently 
that is the problem.

The backward-compatibility issue is also not a real one, as you mention yourself below.

and clients and servers
> who both support that header can use it. All other uses need to
> fall-back to this old and nasty heuristic.
>

Indeed. And this would not be the first time, by far, that sloppy behaviour of clients is 
penalised by tighter interpretation of the rules by webservers.
But it probably falls upon webservers to initiate the movement.


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
For additional commands, e-mail: users-help@tomcat.apache.org


Re: Sanity Check

Posted by Christopher Schultz <ch...@christopherschultz.net>.
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256

Andr�,

On 11/18/16 3:50 AM, Andr� Warnier (tomcat) wrote:
> On 18.11.2016 05:56, Christopher Schultz wrote:
>> Since UTF-8 is supposed to be the "official" character encoding,
> 
> Now where is that specified ?  As far as I know, the default
> charset for everything HTTP and HTML-wise is still iso-8859-1, no ?
> (and unfortunately so).

I apologize for the sloppy language: this particular vendor's service
claims that UTF-8 if the standard *for their service*. Not for HTTP in
general.

>> The vendor has responded with (paraphrasing) "it seems we don't 
>> completely follow this standard; we're considering what to do
>> next, which may include no change". This is a big vendor with
>> *lots* of software clients, so maintaining backward compatibility
>> is going to be a big deal for them. I've got some tricks up my
>> sleeve if they decide not to change anything. Hooray for specs.
>> :(
> 
> What I never understood in all that, is why browsers and other
> clients never seem to respect (and servers do not seem to enforce)
> what is indicated here :
> 
> https://www.ietf.org/rfc/rfc2388.txt 4.5 Charset of text in form
> data
> 
> This would be a simple way to get rid of umpteen character
> set/encoding issues encountered when trying to interpret <form>
> data POSTed to web applications.

The problem is that application/x-www-form-urlencoded doesn't give a
client a natural way to specify the character encoding, and a/xwfu can
be used inside of a multipart/form-data package as well. You've just
moved the problem from the Content-Type of the request to the
Content-Type of the *part* of the multi-part request. Nothing has been
solved by using multipart/form-data.

And browsers certainly DO use that, but almost exclusively for things
like file-upload, since files tend to be very big already, and
urlencoding a bunch of binary bytes makes the file size increase quite
a bit.

> It seems to me contrary to common sense that in our day and age,
> the rules for this could not be set once and for all to something
> like :
> 
> 1) the default character set/encoding of HTTP and HTML is
> Unicode/UTF-8 (instead of the current really archaic iso-8859-1) 2)
> URLs (including query-strings) should be by default interpreted as 
> Unicode/UTF-8, encoded as per
> https://tools.ietf.org/html/rfc3986#section-2 3) for POST requests
> : - for the Content-type "application/x-www-form-urlencoded",
> there SHOULD be a charset attribute indicating the charset and
> encoding. By default, this is "text/plain; charset=UTF-8"

Don't forget, charset == encoding. The text/plain is the MIME type,
and that's already been defined as application/x-www-form-urlencoded.
Somewhere it should just explicitly say "a/xwfu" must contain only
ASCII bytes, and always encodes a text blob in UTF-8 encoding.

But it will never happen (see below).

> - for the Content-type "multipart/form-data", each "part" MUST have
> a Content-type header.  If this Content-type is a "text" type, then
> the Content-type header SHOULD contain a charset attribute. If
> omitted, by default this is "charset=UTF-8".
> 
> and be done with it once and for all.

Right: once and for all, for new clients who implement the spec. All
old clients, servers, proxies, , etc. be damned. It's just not
possible due to the need to be backward-compatible with really weird
stuff like "smart" toasters and refrigerators, WebTV (remember that?)
and all manner of embedded devices that will never be updated.

What we really need is a new header that says "here's everything you
need to know about encoding for this request" and clients and servers
who both support that header can use it. All other uses need to
fall-back to this old and nasty heuristic.

- -chris
-----BEGIN PGP SIGNATURE-----
Comment: GPGTools - http://gpgtools.org
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iQIcBAEBCAAGBQJYLyYAAAoJEBzwKT+lPKRYF90QAJNOyadgrG7DDyWLSuFfKkep
VAoc5yziddaHoTKpcExGrEB+LV5gJ35XR2Q+CiOCNoTR1O3oOJyflk2s8e+lqeZ9
2rqIlauOwwWC13dfwpcOENkeC3eyHn85d3NkuuFsqvqRl+Wuv4qvqRiv/kos723i
cKmgqbAE9zRjNxuIqym3J8m6BhwzJGN3HqtiUueTYphChW81V10hc8XElJEPDbAH
eGpdunp8eu4pbi36RZV5r2nZU2yHZVDd+HJnTFG4WJ/NvHODuJsR39fB+GANI0QJ
+OHS9b7Wpcl2eCPs8geVTSqe57vDBrhymFjIUorPuQeW0SxrwDJMdTJ4zYtqnY2B
fD7u9Lvo+RT/eskIcdFGVq5xUEBr2OIfx2XO2V7VlA52x+WJ421TLFRUQq67Un40
yDsPXEBHMVar2cyG2wOJsb/t6ndlCY30b1FPOD2zrg1XFxxzjaOCwUtZXqgX7sfu
H1Dalbg4S/8vPS5Yrd7ZHk4RgYr5GGMBcK01KC07Q/TrOFkw9ssqvfQTyl30jxZ/
/x74KMRAbJVsUuhJ0i8QLM0KqPMpJ9wP9jwQF4YFUFwTDp6xBa/FRVAXCmJQxKom
JFCky4YhVvOGVOK2iwDDQRJee1ahz0V+maJii1fSHVYMCrWrzGNZ6LMeuZAsovs0
ZjotO2X+XAPpLwczn6tI
=7oxR
-----END PGP SIGNATURE-----

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
For additional commands, e-mail: users-help@tomcat.apache.org


Re: Sanity Check

Posted by "André Warnier (tomcat)" <aw...@ice-sa.com>.
On 18.11.2016 05:56, Christopher Schultz wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA256
>
> Konstantin,
>
> On 11/17/16 4:58 PM, Konstantin Kolinko wrote:
>> 2016-11-17 17:21 GMT+03:00 Christopher Schultz
>> <ch...@christopherschultz.net>:
>>> All,
>>>
>>> I've got a problem with a vendor and I'd like another opinion
>>> just to make sure I'm not crazy. The vendor and I have a
>>> difference of opinion about how a character should be encoded in
>>> an HTTP POST request.
>>>
>>> The vendor's API officially should accept requests in UTF-8
>>> encoding. We are using application/x-www-form-urlencoded content
>>> type.
>>>
>>> I'm trying to send a message with a non-ASCII character -- for
>>> example, a � (that's (R), the registered trademark symbol).
>>>
>>> The Java code being used to package-up this POST looks something
>>> like this:
>>>
>>> OutputStream out = httpurlconnection.getOutputStream();
>>> out.print("notetext="); out.print(URLEncoder.encode("test�",
>>> "UTF-8")); out.close();
>>>
>>> So the POST payload ends up being notetext=test%C2%AE or, on the
>>> wire, the bytes are 6e 6f 74 65 74 65 78 74 3d 74 65 73 74 25 43
>>> 32 25 41 45.
>>>
>>> The final bytes 25 43 32 25 41 45 are the characters % C 2 % A
>>> E.
>>>
>>> Can someone verify that I'm encoding everything correctly?
>>>
>>> The vendor is claiming that � can be sent "directly" like one
>>> might do using curl:
>>>
>>> $ curl -d 'notetext=�' [url]
>>>
>>> and the bytes on the wire are 6e 6f 74 65 74 65 78 74 3d c2 ae
>>> (note that c2 and ae are "bare" and not %-encoded).
>>
>> 1. That is a wrong way to use curl.  The manual says that the
>> argument to -d should be properly urlencoded. The above value is an
>> incorrect one.
>>
>> https://curl.haxx.se/docs/manual.html See "POST (HTTP)" and below.
>
> +1
>
> The curl manual says that -d is the same as --data-ascii, which is
> totally wrong here if they are accepting UTF-8.
>
>> 2. If you are submitting data programmatically, I wonder why you
>> are using simple "application/x-www-form-urlencoded".
>>
>> I think it would be better to use explicit charset argument in the
>> Content-Type value, as it is easy to do so with Java clients.
>
> Their API expects application/x-www-form-urlencoded. Everything else
> they do is in JSON... I have no idea why they don't accept JSON as
> input, but that's the deal.
>
> MIME types that aren't text/* aren't supposed to have Content-Type
> parameters.

Maybe more precisely : there SHOULD be a Content-type header; but a "charset" attribute 
only makes sense if the content type is, generally-speaking, "text".
("text/plain" certainly qualifies; but one may argue about "text/html" and variants e.g., 
since these formats may have their own embedded charset indications)

>
>> 3. The application/x-www-form-urlencoded encoding was originally
>> specified in HTML specification.
>>
>> Current specification:
>> https://www.w3.org/TR/html51/sec-forms.html#urlencoded-form-data
>>
>> It defers details to
>> https://url.spec.whatwg.org/#concept-urlencoded-serializer
>>
>> Historic, HTML 4.01:
>> https://www.w3.org/TR/html401/interact/forms.html#h-17.13.4.1
>
> All true, but the spec argues with itself over the character encoding,
> and browsers make this worse with their stupid "I'll use whatever
> character encoding was used to load the page containing the form"
> behavior. With a software-client API, there basically is no spec.
>
> Their assertion is that their character encoding "is UTF-8". But it
> looks like they aren't doing it right.
>
>> My opinion is that the correct value on the wire is 25 43 32 25 41
>> 45 = % C 2 % A E.
>
> So, the same bytes as I had, right?
>
>> If a vendor accepts non-encoded "c2 ae": it technically may work
>> (in some versions of some software), but this is not a standard
>> feature and one would better not rely on it.
>>
>> Technically, if non-encoded bytes ("c2 ae") are accepted, they
>> won't be confused with special character ("=", "&", "+", "%",
>> CRLF), as all multi-byte UTF-8 characters have 0x80 bit set.
>
> Their non-%-encoded bytes could be considered legitimate, because the
> application/x-www-form-urlencoded rules say that any character "in the
> character set of the request" can be dropped-into the request without
> being %-encoded. But they we are back to the problem of not knowing
> what the encoding of the request is.
>
> Since UTF-8 is supposed to be the "official" character encoding,

Now where is that specified ?  As far as I know, the default charset for everything HTTP 
and HTML-wise is still iso-8859-1, no ? (and unfortunately so).

  I
> would expect that a properly-encoded request would contain nothing but
> valid ASCII characters, which means that 0xc2 0xae need to be
> %-encoded to become "%c2%ae".
>
>> 4. You code fragment is broken and won't compile: there are none
>> "print" methods in java.io.OutputStream.
>>
>> OutputStream works with byte[] and the method name is "write".
>
> Yes, it was hastily-typed from memory. The true code compiles and runs
> as expected.
>
>> 5. Wikipedia:
>> https://en.wikipedia.org/wiki/Percent-encoding#The_application.2Fx-www
> - -form-urlencoded_type
>>
>>   Wikipedia mentions XForms spec, ->
>> https://www.w3.org/TR/2007/REC-xforms-20071029/#serialize-urlencode
>
> Thanks
>>
> for the XForms reference... it's nice that it has a real
> example (including a non-ASCII character) instead of the usual trivial
> examples in the HTTP and HTML specs, for instance.
>
>> 6. You can test with real browsers.
>
> I will certainly be doing that.
>
> https://www.w3.org/TR/2007/REC-xforms-20071029/#serialize-urlencode
>
> The vendor has responded with (paraphrasing) "it seems we don't
> completely follow this standard; we're considering what to do next,
> which may include no change". This is a big vendor with *lots* of
> software clients, so maintaining backward compatibility is going to be
> a big deal for them. I've got some tricks up my sleeve if they decide
> not to change anything. Hooray for specs. :(
>

What I never understood in all that, is why browsers and other clients never seem to 
respect (and servers do not seem to enforce) what is indicated here :

https://www.ietf.org/rfc/rfc2388.txt
4.5 Charset of text in form data

This would be a simple way to get rid of umpteen character set/encoding issues encountered 
when trying to interpret <form> data POSTed to web applications.

It seems to me contrary to common sense that in our day and age, the rules for this could 
not be set once and for all to something like :

1) the default character set/encoding of HTTP and HTML is Unicode/UTF-8
    (instead of the current really archaic iso-8859-1)
2) URLs (including query-strings) should be by default interpreted as Unicode/UTF-8, 
encoded as per https://tools.ietf.org/html/rfc3986#section-2
3) for POST requests :
    - for the Content-type "application/x-www-form-urlencoded", there SHOULD be a charset 
attribute indicating the charset and encoding. By default, this is "text/plain; charset=UTF-8"
    - for the Content-type "multipart/form-data", each "part" MUST have a Content-type 
header.  If this Content-type is a "text" type, then the Content-type header SHOULD 
contain a charset attribute. If omitted, by default this is "charset=UTF-8".

and be done with it once and for all.


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
For additional commands, e-mail: users-help@tomcat.apache.org


Re: Sanity Check

Posted by Christopher Schultz <ch...@christopherschultz.net>.
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256

Konstantin,

On 11/17/16 4:58 PM, Konstantin Kolinko wrote:
> 2016-11-17 17:21 GMT+03:00 Christopher Schultz
> <ch...@christopherschultz.net>:
>> All,
>> 
>> I've got a problem with a vendor and I'd like another opinion
>> just to make sure I'm not crazy. The vendor and I have a
>> difference of opinion about how a character should be encoded in
>> an HTTP POST request.
>> 
>> The vendor's API officially should accept requests in UTF-8
>> encoding. We are using application/x-www-form-urlencoded content
>> type.
>> 
>> I'm trying to send a message with a non-ASCII character -- for 
>> example, a � (that's (R), the registered trademark symbol).
>> 
>> The Java code being used to package-up this POST looks something
>> like this:
>> 
>> OutputStream out = httpurlconnection.getOutputStream(); 
>> out.print("notetext="); out.print(URLEncoder.encode("test�",
>> "UTF-8")); out.close();
>> 
>> So the POST payload ends up being notetext=test%C2%AE or, on the
>> wire, the bytes are 6e 6f 74 65 74 65 78 74 3d 74 65 73 74 25 43
>> 32 25 41 45.
>> 
>> The final bytes 25 43 32 25 41 45 are the characters % C 2 % A
>> E.
>> 
>> Can someone verify that I'm encoding everything correctly?
>> 
>> The vendor is claiming that � can be sent "directly" like one
>> might do using curl:
>> 
>> $ curl -d 'notetext=�' [url]
>> 
>> and the bytes on the wire are 6e 6f 74 65 74 65 78 74 3d c2 ae
>> (note that c2 and ae are "bare" and not %-encoded).
> 
> 1. That is a wrong way to use curl.  The manual says that the
> argument to -d should be properly urlencoded. The above value is an
> incorrect one.
> 
> https://curl.haxx.se/docs/manual.html See "POST (HTTP)" and below.

+1

The curl manual says that -d is the same as --data-ascii, which is
totally wrong here if they are accepting UTF-8.

> 2. If you are submitting data programmatically, I wonder why you
> are using simple "application/x-www-form-urlencoded".
> 
> I think it would be better to use explicit charset argument in the 
> Content-Type value, as it is easy to do so with Java clients.

Their API expects application/x-www-form-urlencoded. Everything else
they do is in JSON... I have no idea why they don't accept JSON as
input, but that's the deal.

MIME types that aren't text/* aren't supposed to have Content-Type
parameters.

> 3. The application/x-www-form-urlencoded encoding was originally 
> specified in HTML specification.
> 
> Current specification: 
> https://www.w3.org/TR/html51/sec-forms.html#urlencoded-form-data
> 
> It defers details to 
> https://url.spec.whatwg.org/#concept-urlencoded-serializer
> 
> Historic, HTML 4.01: 
> https://www.w3.org/TR/html401/interact/forms.html#h-17.13.4.1

All true, but the spec argues with itself over the character encoding,
and browsers make this worse with their stupid "I'll use whatever
character encoding was used to load the page containing the form"
behavior. With a software-client API, there basically is no spec.

Their assertion is that their character encoding "is UTF-8". But it
looks like they aren't doing it right.

> My opinion is that the correct value on the wire is 25 43 32 25 41
> 45 = % C 2 % A E.

So, the same bytes as I had, right?

> If a vendor accepts non-encoded "c2 ae": it technically may work
> (in some versions of some software), but this is not a standard
> feature and one would better not rely on it.
> 
> Technically, if non-encoded bytes ("c2 ae") are accepted, they
> won't be confused with special character ("=", "&", "+", "%",
> CRLF), as all multi-byte UTF-8 characters have 0x80 bit set.

Their non-%-encoded bytes could be considered legitimate, because the
application/x-www-form-urlencoded rules say that any character "in the
character set of the request" can be dropped-into the request without
being %-encoded. But they we are back to the problem of not knowing
what the encoding of the request is.

Since UTF-8 is supposed to be the "official" character encoding, I
would expect that a properly-encoded request would contain nothing but
valid ASCII characters, which means that 0xc2 0xae need to be
%-encoded to become "%c2%ae".

> 4. You code fragment is broken and won't compile: there are none 
> "print" methods in java.io.OutputStream.
> 
> OutputStream works with byte[] and the method name is "write".

Yes, it was hastily-typed from memory. The true code compiles and runs
as expected.

> 5. Wikipedia: 
> https://en.wikipedia.org/wiki/Percent-encoding#The_application.2Fx-www
- -form-urlencoded_type
>
>  Wikipedia mentions XForms spec, ->
> https://www.w3.org/TR/2007/REC-xforms-20071029/#serialize-urlencode

Thanks
> 
for the XForms reference... it's nice that it has a real
example (including a non-ASCII character) instead of the usual trivial
examples in the HTTP and HTML specs, for instance.

> 6. You can test with real browsers.

I will certainly be doing that.

https://www.w3.org/TR/2007/REC-xforms-20071029/#serialize-urlencode

The vendor has responded with (paraphrasing) "it seems we don't
completely follow this standard; we're considering what to do next,
which may include no change". This is a big vendor with *lots* of
software clients, so maintaining backward compatibility is going to be
a big deal for them. I've got some tricks up my sleeve if they decide
not to change anything. Hooray for specs. :(

Thanks,
- -chris
-----BEGIN PGP SIGNATURE-----
Comment: GPGTools - http://gpgtools.org
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iQIcBAEBCAAGBQJYLooBAAoJEBzwKT+lPKRYnJ0P/1rWnTVK2fCTgTvdXCWwJk1j
fU36e2FBoEf+DEB7CuIGD0Yxoegkd09oMD5O7oKeK9Z0c8O9UTJbiF1hK2FXtFxM
gTA+PJNMlqglYvKOecdp9x7xmuNB1MBhZDTuqob16qBBD4ujChvns2SnANrDxdO8
zsZBTivT/LJxKnH2Q4tEe65trFjreplCHq1RnAkEYcDjQ85FkjE3+Msc9Wc3TUSX
4FAbeRjdKRn2NzzjYUeZdjKQ/aP+VeCHnWjvhVTZuY8H7fTMOq/Z6IbT3SqB1Pnt
endFVkV0czn2LbvK2F6Y6Mg0swwbKuw0nUnidvAtaxUQE3qobRehP0Anv4mdJlH9
yMS8EunQZqhgTNRzzVF6wsleEG6DciJBJQaCeWo9/964x2Y7+k9sf8lE0/jpDgdH
H++HtFny+FE8QzNp5tmq/g3ai1ivIGWCzZl7KaPLI2rpXH0W6gbXeDlwpBhHjkEn
IPgNBVnb+CCDAvbzogvi6Bv79Dr2WqYE9fdoQfH+X0q1i+LY6mkaHzZKCr7B7vWi
Vk3FXmVoz5P8YyT1AZg9bGWkKRhuMJcd+yFm2Xtc/KE+5N48Swt3B2isrAZ9jSdS
pUVc6tIAxLuoxXp9tP/RVyNWrVAu6iPPwLuSg4vgAp38+wl5ohAIjRd9dZEBOkM1
lm1cJrg8T8Xim39Z54Du
=i5zQ
-----END PGP SIGNATURE-----

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
For additional commands, e-mail: users-help@tomcat.apache.org


Re: Sanity Check

Posted by Konstantin Kolinko <kn...@gmail.com>.
 2016-11-17 17:21 GMT+03:00 Christopher Schultz <ch...@christopherschultz.net>:
> All,
>
> I've got a problem with a vendor and I'd like another opinion just to
> make sure I'm not crazy. The vendor and I have a difference of opinion
> about how a character should be encoded in an HTTP POST request.
>
> The vendor's API officially should accept requests in UTF-8 encoding.
> We are using application/x-www-form-urlencoded content type.
>
> I'm trying to send a message with a non-ASCII character -- for
> example, a ® (that's (R), the registered trademark symbol).
>
> The Java code being used to package-up this POST looks something like
> this:
>
> OutputStream out = httpurlconnection.getOutputStream();
> out.print("notetext=");
> out.print(URLEncoder.encode("test®", "UTF-8"));
> out.close();
>
> So the POST payload ends up being notetext=test%C2%AE or, on the wire,
> the bytes are 6e 6f 74 65 74 65 78 74 3d 74 65 73 74 25 43 32 25 41 45.
>
> The final bytes 25 43 32 25 41 45 are the characters % C 2 % A E.
>
> Can someone verify that I'm encoding everything correctly?
>
> The vendor is claiming that ® can be sent "directly" like one might do
> using curl:
>
> $ curl -d 'notetext=®' [url]
>
> and the bytes on the wire are 6e 6f 74 65 74 65 78 74 3d c2 ae (note
> that c2 and ae are "bare" and not %-encoded).

1. That is a wrong way to use curl.  The manual says that the argument
to -d should be properly urlencoded. The above value is an incorrect
one.

https://curl.haxx.se/docs/manual.html
See "POST (HTTP)" and below.

2. If you are submitting data programmatically, I wonder why you are
using simple "application/x-www-form-urlencoded".

I think it would be better to use explicit charset argument in the
Content-Type value, as it is easy to do so with Java clients.

3. The application/x-www-form-urlencoded encoding was originally
specified in HTML specification.

Current specification:
https://www.w3.org/TR/html51/sec-forms.html#urlencoded-form-data

It defers details to
https://url.spec.whatwg.org/#concept-urlencoded-serializer


Historic, HTML 4.01:
https://www.w3.org/TR/html401/interact/forms.html#h-17.13.4.1


My opinion is that the correct value on the wire is
25 43 32 25 41 45 = % C 2 % A E.


If a vendor accepts non-encoded "c2 ae":
it technically may work (in some versions of some software), but this
is not a standard feature and one would better not rely on it.

Technically, if non-encoded bytes ("c2 ae") are accepted, they won't
be confused with special character ("=", "&", "+", "%", CRLF), as all
multi-byte UTF-8 characters have 0x80 bit set.


4. You code fragment is broken and won't compile: there are none
"print" methods in java.io.OutputStream.

OutputStream works with byte[] and the method name is "write".


5. Wikipedia:
https://en.wikipedia.org/wiki/Percent-encoding#The_application.2Fx-www-form-urlencoded_type

Wikipedia mentions XForms spec,
-> https://www.w3.org/TR/2007/REC-xforms-20071029/#serialize-urlencode

6. You can test with real browsers.

Best regards,
Konstantin Kolinko

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
For additional commands, e-mail: users-help@tomcat.apache.org