You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@tomcat.apache.org by Mark Thomas <ma...@apache.org> on 2014/08/26 21:53:57 UTC

RFC6265, cookie parsing and UTF-8

One of the aims of the proposed cookie changes [1] was to deal with the
HTML 5 changes that mean UTF-8 can appear in cookie headers.

This has some potentially large implications for Tomcat.

Currently, Tomcat handles cookies as MessageBytes, processing everything
in bytes and only converting to String when necessary. This is largely
possible because of the assumption that everything is ASCII.

Introduce UTF-8 and processing everything in bytes gets a whole lot
harder. You essentially have to decode to UTF-8 to ensure that you have
valid data - at a which point why not just use Strings anyway?

I am currently leaning towards removing a lot of the current cookie
header caching  recycling and doing something along the following lines:
- Lazy parsing as currently (but unless cookie based session tracking is
  disabled this is going to run on every request)
- Convert headers to UTF-8 strings
- Parse them with a new parser along the lines of o.a.t.u.http.parser
- Have that parser return an array of javax.servlet.http.Cookie objects
- Pass those to the app if/when requested

In terms of handling RFC6265 and RFC2109 my plan is to have two parsers,
share as much code as possible and switch between them based on the
cookie header with the expectation that 99.9% of cookies will be parsed
by the RFC6265 parser. We could add some options to this switching to
enable other parsers (e.g. a Netscape parser) to be used.

I'd also like to keep the current cookie parsing implementation for now.
Until we are happy with the new parsing, the current implementation will
be the default. Once we are happy with the new parsing we can change the
default. We can add an option to switch between the current and the new
parsing.

Thoughts?


Mark


[1] https://wiki.apache.org/tomcat/Cookies

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@tomcat.apache.org
For additional commands, e-mail: dev-help@tomcat.apache.org

Re: RFC6265, cookie parsing and UTF-8

Posted by Christopher Schultz <ch...@christopherschultz.net>.

Mark,

On 8/26/14, 3:53 PM, Mark Thomas wrote:
> One of the aims of the proposed cookie changes [1] was to deal with the
> HTML 5 changes that mean UTF-8 can appear in cookie headers.
> 
> This has some potentially large implications for Tomcat.
> 
> Currently, Tomcat handles cookies as MessageBytes, processing everything
> in bytes and only converting to String when necessary. This is largely
> possible because of the assumption that everything is ASCII.
> 
> Introduce UTF-8 and processing everything in bytes gets a whole lot
> harder. You essentially have to decode to UTF-8 to ensure that you have
> valid data - at a which point why not just use Strings anyway?

I've always wondered why we bothered backing everything with
MessageBytes when the APIs are all String-bound anyway.

> I am currently leaning towards removing a lot of the current cookie
> header caching  recycling and doing something along the following lines:
> - Lazy parsing as currently (but unless cookie based session tracking is
>   disabled this is going to run on every request)
> - Convert headers to UTF-8 strings
> - Parse them with a new parser along the lines of o.a.t.u.http.parser
> - Have that parser return an array of javax.servlet.http.Cookie objects
> - Pass those to the app if/when requested
> 
> In terms of handling RFC6265 and RFC2109 my plan is to have two parsers,
> share as much code as possible and switch between them based on the
> cookie header with the expectation that 99.9% of cookies will be parsed
> by the RFC6265 parser. We could add some options to this switching to
> enable other parsers (e.g. a Netscape parser) to be used.
> 
> I'd also like to keep the current cookie parsing implementation for now.
> Until we are happy with the new parsing, the current implementation will
> be the default. Once we are happy with the new parsing we can change the
> default. We can add an option to switch between the current and the new
> parsing.
> 
> Thoughts?

+1 to everything above

-chris

Re: RFC6265, cookie parsing and UTF-8

Posted by Mark Thomas <ma...@apache.org>.

On 26/08/2014 23:09, Rémy Maucherat wrote:
> 2014-08-26 21:53 GMT+02:00 Mark Thomas <ma...@apache.org>:
> 
>> One of the aims of the proposed cookie changes [1] was to deal with the
>> HTML 5 changes that mean UTF-8 can appear in cookie headers.
>>
>> This has some potentially large implications for Tomcat.
>>
>> Currently, Tomcat handles cookies as MessageBytes, processing everything
>> in bytes and only converting to String when necessary. This is largely
>> possible because of the assumption that everything is ASCII.
>>
>> Introduce UTF-8 and processing everything in bytes gets a whole lot
>> harder. You essentially have to decode to UTF-8 to ensure that you have
>> valid data - at a which point why not just use Strings anyway?
>>
>> I am currently leaning towards removing a lot of the current cookie
>> header caching  recycling and doing something along the following lines:
>> - Lazy parsing as currently (but unless cookie based session tracking is
>>   disabled this is going to run on every request)
>> - Convert headers to UTF-8 strings
>> - Parse them with a new parser along the lines of o.a.t.u.http.parser
>> - Have that parser return an array of javax.servlet.http.Cookie objects
>> - Pass those to the app if/when requested
>>
>> In terms of handling RFC6265 and RFC2109 my plan is to have two parsers,
>> share as much code as possible and switch between them based on the
>> cookie header with the expectation that 99.9% of cookies will be parsed
>> by the RFC6265 parser. We could add some options to this switching to
>> enable other parsers (e.g. a Netscape parser) to be used.
>>
>> I'd also like to keep the current cookie parsing implementation for now.
>> Until we are happy with the new parsing, the current implementation will
>> be the default. Once we are happy with the new parsing we can change the
>> default. We can add an option to switch between the current and the new
>> parsing.
>>
>> Thoughts?
>>
> 
> As far as I am concerned, this could turn out badly.

I agree. I remember the last time I made changes to the cookie parsing
to improve spec compliance as a result of some security issues. It broke
a lot of stuff and the fall out lasted for months. I don't want to
repeat that.

> String manipulation is
> consistently the slowest thing overall other than IO, and rather often
> webapps use a massive amount of cookies [to the point they get errors
> because the HTTP header size is too small by default].

I agree the new code is going to have to keep a careful eye on performance.

> So the current processing should probably be the default [as proposed],
> then remain an option until it can be demonstrated this is not slower
> [which IMO is not possible, so it would have to remain].

The problem is that the current approach simply can't work for UTF-8
cookie values. I intend to start with some performance tests so we can
see what the difference really is. I'm expecting that we will need to
trade a little performance to be able to handle UTF-8. Whether or not
that trade is a reasonable one will depend on the performance figures. I
suggest we hold off on that debate until we have some hard numbers to
work with.

Mark

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@tomcat.apache.org
For additional commands, e-mail: dev-help@tomcat.apache.org

Re: RFC6265, cookie parsing and UTF-8

Posted by Rémy Maucherat <re...@apache.org>.

2014-08-26 21:53 GMT+02:00 Mark Thomas <ma...@apache.org>:

> One of the aims of the proposed cookie changes [1] was to deal with the
> HTML 5 changes that mean UTF-8 can appear in cookie headers.
>
> This has some potentially large implications for Tomcat.
>
> Currently, Tomcat handles cookies as MessageBytes, processing everything
> in bytes and only converting to String when necessary. This is largely
> possible because of the assumption that everything is ASCII.
>
> Introduce UTF-8 and processing everything in bytes gets a whole lot
> harder. You essentially have to decode to UTF-8 to ensure that you have
> valid data - at a which point why not just use Strings anyway?
>
> I am currently leaning towards removing a lot of the current cookie
> header caching  recycling and doing something along the following lines:
> - Lazy parsing as currently (but unless cookie based session tracking is
>   disabled this is going to run on every request)
> - Convert headers to UTF-8 strings
> - Parse them with a new parser along the lines of o.a.t.u.http.parser
> - Have that parser return an array of javax.servlet.http.Cookie objects
> - Pass those to the app if/when requested
>
> In terms of handling RFC6265 and RFC2109 my plan is to have two parsers,
> share as much code as possible and switch between them based on the
> cookie header with the expectation that 99.9% of cookies will be parsed
> by the RFC6265 parser. We could add some options to this switching to
> enable other parsers (e.g. a Netscape parser) to be used.
>
> I'd also like to keep the current cookie parsing implementation for now.
> Until we are happy with the new parsing, the current implementation will
> be the default. Once we are happy with the new parsing we can change the
> default. We can add an option to switch between the current and the new
> parsing.
>
> Thoughts?
>

As far as I am concerned, this could turn out badly. String manipulation is
consistently the slowest thing overall other than IO, and rather often
webapps use a massive amount of cookies [to the point they get errors
because the HTTP header size is too small by default].

So the current processing should probably be the default [as proposed],
then remain an option until it can be demonstrated this is not slower
[which IMO is not possible, so it would have to remain].

Rémy

Re: RFC6265, cookie parsing and UTF-8

Posted by Mark Thomas <ma...@apache.org>.

On 27/08/2014 10:58, Mark Thomas wrote:
> On 27/08/2014 10:38, Konstantin Kolinko wrote:
>> 2014-08-27 13:29 GMT+04:00 Mark Thomas <ma...@apache.org>:
>>>>
>>>
>>> Bad news: The issue is that if there is a chance of UTF-8 in the header
>>> then you can't simply split the header into individual cookies based on
>>> the separator byte since you can't tell (without decoding to characters)
>>> if a byte represents the separator or is part of a sequence of several
>>> bytes representing some other character.
>>>
>>
>> You can. All separator bytes are 7-bit US-ASCII.
>>
>> BTW, There is also a feature in UTF-8 that you can split it into
>> characters without actually decoding them.
>>
>> I mean "Character boundaries are easily found from anywhere in an
>> octet stream." as said in "1. Introduction" of
>> http://tools.ietf.org/html/rfc3629
> 
> Doh. Thanks for the correction. That gives us rather more options (if we
> want/need them).
> 
> I had in the back of my mind an old UTF-8 related security issue where
> multi-byte characters were being incorrectly processed and the remaining
> bytes were incorrectly being treated single byte characters in the range
> 0-127. I need to re-read through that issue to remind myself exactly
> what was going on as with UTF-8 that simply should not be possible.

For the record it was CVE-2008-2938 and what was happening was that a
character that should have been encoded in 1 byte was encoded in
multiple bytes (so the checks for that character didn't see it) and the
UTF-8 decoder at the time failed to reject it as it was required it do
by the spec.

Mark


> On a related topic... Since ISO-8859-1 is valid for use in a cookie
> value (BZ 55917) we are going to have to provide an option somewhere to
> select the encoding to use to decode cookie values.
> 
> Mark
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@tomcat.apache.org
> For additional commands, e-mail: dev-help@tomcat.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@tomcat.apache.org
For additional commands, e-mail: dev-help@tomcat.apache.org

Re: RFC6265, cookie parsing and UTF-8

Posted by Mark Thomas <ma...@apache.org>.

On 27/08/2014 10:38, Konstantin Kolinko wrote:
> 2014-08-27 13:29 GMT+04:00 Mark Thomas <ma...@apache.org>:
>>>
>>
>> Bad news: The issue is that if there is a chance of UTF-8 in the header
>> then you can't simply split the header into individual cookies based on
>> the separator byte since you can't tell (without decoding to characters)
>> if a byte represents the separator or is part of a sequence of several
>> bytes representing some other character.
>>
> 
> You can. All separator bytes are 7-bit US-ASCII.
> 
> BTW, There is also a feature in UTF-8 that you can split it into
> characters without actually decoding them.
> 
> I mean "Character boundaries are easily found from anywhere in an
> octet stream." as said in "1. Introduction" of
> http://tools.ietf.org/html/rfc3629

Doh. Thanks for the correction. That gives us rather more options (if we
want/need them).

I had in the back of my mind an old UTF-8 related security issue where
multi-byte characters were being incorrectly processed and the remaining
bytes were incorrectly being treated single byte characters in the range
0-127. I need to re-read through that issue to remind myself exactly
what was going on as with UTF-8 that simply should not be possible.

On a related topic... Since ISO-8859-1 is valid for use in a cookie
value (BZ 55917) we are going to have to provide an option somewhere to
select the encoding to use to decode cookie values.

Mark

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@tomcat.apache.org
For additional commands, e-mail: dev-help@tomcat.apache.org

Re: RFC6265, cookie parsing and UTF-8

Posted by Konstantin Kolinko <kn...@gmail.com>.

2014-08-27 13:29 GMT+04:00 Mark Thomas <ma...@apache.org>:
>>
>
> Bad news: The issue is that if there is a chance of UTF-8 in the header
> then you can't simply split the header into individual cookies based on
> the separator byte since you can't tell (without decoding to characters)
> if a byte represents the separator or is part of a sequence of several
> bytes representing some other character.
>

You can. All separator bytes are 7-bit US-ASCII.

BTW, There is also a feature in UTF-8 that you can split it into
characters without actually decoding them.

I mean "Character boundaries are easily found from anywhere in an
octet stream." as said in "1. Introduction" of
http://tools.ietf.org/html/rfc3629

Best regards,
Konstantin Kolinko

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@tomcat.apache.org
For additional commands, e-mail: dev-help@tomcat.apache.org

Re: RFC6265, cookie parsing and UTF-8

Posted by Mark Thomas <ma...@apache.org>.

On 26/08/2014 22:52, Filip Hanik wrote:
> On Tue, Aug 26, 2014 at 12:53 PM, Mark Thomas <ma...@apache.org> wrote:
> 
>> One of the aims of the proposed cookie changes [1] was to deal with the
>> HTML 5 changes that mean UTF-8 can appear in cookie headers.
>>
>> This has some potentially large implications for Tomcat.
>>
> 
> Since we already are in the 8.0.x release cycle, I, as an end user/system
> administrator, would expect parsing would remain 100% backwards compatible
> for version 8.0.x+n (n=1...)

+1

>> Currently, Tomcat handles cookies as MessageBytes, processing everything
>> in bytes and only converting to String when necessary. This is largely
>> possible because of the assumption that everything is ASCII.
>>
>> Introduce UTF-8 and processing everything in bytes gets a whole lot
>> harder. You essentially have to decode to UTF-8 to ensure that you have
>> valid data - at a which point why not just use Strings anyway?
>>
>> I am currently leaning towards removing a lot of the current cookie
>> header caching  recycling and doing something along the following lines:
>>
> 
> all that caching/recycling is to avoid GC cycles and was in the past a
> crucial performance optimization.
> back in those days, with the hardware that was available in 06-07, we were
> pushing a single Tomcat instance to 60k requests per second.
> creating new objects was painfully expensive at that rate.

I've done some work on reducing GC when Tomcat was being hammered with
large numbers of requests fairly recently so I agree this is an issue we
still need to keep an eye on.

>> - Lazy parsing as currently (but unless cookie based session tracking is
>>   disabled this is going to run on every request)
>>
> 
> but our cookies, JSESSIONID, doesn't have to be UTF-8, does it?
> this goes hand in hand with the SessionIdGenerator that Rainer just did,
> can that return UTF-8 values?
> So the lazy part can apply to all other cookies, meaning, don't parse it
> until the app requests it, just store the bytes and move on.

Good news: I don't believe the session IDs are UTF-8.

Bad news: The issue is that if there is a chance of UTF-8 in the header
then you can't simply split the header into individual cookies based on
the separator byte since you can't tell (without decoding to characters)
if a byte represents the separator or is part of a sequence of several
bytes representing some other character.

Aside: I think putting UTF-8 into HTTP headers is a crazy idea but that
ship has sailed and we have to deal with it.

>> - Convert headers to UTF-8 strings
>> 
>> - Parse them with a new parser along the lines of o.a.t.u.http.parser
>> - Have that parser return an array of javax.servlet.http.Cookie objects
>> - Pass those to the app if/when requested
>>
>> In terms of handling RFC6265 and RFC2109 my plan is to have two parsers,
>> share as much code as possible and switch between them based on the
>> cookie header with the expectation that 99.9% of cookies will be parsed
>> by the RFC6265 parser. We could add some options to this switching to
>> enable other parsers (e.g. a Netscape parser) to be used.
>>
> 
> I like the idea of swappable parsers, with the default is the exact
> behavior you see now. I can see changing the default after some
> stabilization.

Same here.


>> I'd also like to keep the current cookie parsing implementation for now.
>> Until we are happy with the new parsing, the current implementation will
>> be the default. Once we are happy with the new parsing we can change the
>> default. We can add an option to switch between the current and the new
>> parsing.
>>
>> Thoughts?
>>
> 
> knock it out.

That is the plan :)

Mark


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@tomcat.apache.org
For additional commands, e-mail: dev-help@tomcat.apache.org

Re: RFC6265, cookie parsing and UTF-8

Posted by Rainer Jung <ra...@kippdata.de>.

Am 26.08.2014 um 23:52 schrieb Filip Hanik:

> but our cookies, JSESSIONID, doesn't have to be UTF-8, does it?
> this goes hand in hand with the SessionIdGenerator that Rainer just did,
> can that return UTF-8 values?

We currently only bundle one impl of that and that impl hasn't changed, 
so it still uses random bytes encoded in hex digits.

But: as we know it appends the jvmRoute if set. That a user could try to 
set as UTF-8. But I guess it is extremely unlikely due to the jvmRoute 
often also being used in other legacy config files which don't support 
UTF-8.

A custom implementation of SessionIdGenerator currently would be free to 
return any string it likes. We can still change the API or docs though, 
it hasn't yet had any release.

I personally would find it bad practise to generate session IDs with 
non-ascii characters or even characters from the reserved set because 
the correct handling of that in all cases (cookie, uri encoded; load 
balancers, proxies etc.) would be unnecessarily fragile. Should I add 
something along those lines to the SessionIdGenerator docs?

Regards,

Rainer



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@tomcat.apache.org
For additional commands, e-mail: dev-help@tomcat.apache.org

Re: RFC6265, cookie parsing and UTF-8

Posted by Filip Hanik <fi...@hanik.com>.

On Tue, Aug 26, 2014 at 12:53 PM, Mark Thomas <ma...@apache.org> wrote:

> One of the aims of the proposed cookie changes [1] was to deal with the
> HTML 5 changes that mean UTF-8 can appear in cookie headers.
>
> This has some potentially large implications for Tomcat.
>

Since we already are in the 8.0.x release cycle, I, as an end user/system
administrator, would expect parsing would remain 100% backwards compatible
for version 8.0.x+n (n=1...)



>
> Currently, Tomcat handles cookies as MessageBytes, processing everything
> in bytes and only converting to String when necessary. This is largely
> possible because of the assumption that everything is ASCII.
>
> Introduce UTF-8 and processing everything in bytes gets a whole lot
> harder. You essentially have to decode to UTF-8 to ensure that you have
> valid data - at a which point why not just use Strings anyway?
>
> I am currently leaning towards removing a lot of the current cookie
> header caching  recycling and doing something along the following lines:
>

all that caching/recycling is to avoid GC cycles and was in the past a
crucial performance optimization.
back in those days, with the hardware that was available in 06-07, we were
pushing a single Tomcat instance to 60k requests per second.
creating new objects was painfully expensive at that rate.


> - Lazy parsing as currently (but unless cookie based session tracking is
>   disabled this is going to run on every request)
>

but our cookies, JSESSIONID, doesn't have to be UTF-8, does it?
this goes hand in hand with the SessionIdGenerator that Rainer just did,
can that return UTF-8 values?
So the lazy part can apply to all other cookies, meaning, don't parse it
until the app requests it, just store the bytes and move on.


> - Convert headers to UTF-8 strings

- Parse them with a new parser along the lines of o.a.t.u.http.parser
> - Have that parser return an array of javax.servlet.http.Cookie objects
> - Pass those to the app if/when requested
>
> In terms of handling RFC6265 and RFC2109 my plan is to have two parsers,
> share as much code as possible and switch between them based on the
> cookie header with the expectation that 99.9% of cookies will be parsed
> by the RFC6265 parser. We could add some options to this switching to
> enable other parsers (e.g. a Netscape parser) to be used.
>

I like the idea of swappable parsers, with the default is the exact
behavior you see now. I can see changing the default after some
stabilization.



>
> I'd also like to keep the current cookie parsing implementation for now.
> Until we are happy with the new parsing, the current implementation will
> be the default. Once we are happy with the new parsing we can change the
> default. We can add an option to switch between the current and the new
> parsing.
>
> Thoughts?
>

knock it out. 



>
>
> Mark
>
>
> [1] https://wiki.apache.org/tomcat/Cookies
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@tomcat.apache.org
> For additional commands, e-mail: dev-help@tomcat.apache.org
>
>