You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@kafka.apache.org by radai <ra...@gmail.com> on 2016/11/07 21:08:44 UTC

[VOTE] KIP-72 - Allow putting a bound on memory consumed by Incoming requests

Hi,

I would like to initiate a vote on KIP-72:

https://cwiki.apache.org/confluence/display/KAFKA/KIP-72%3A+Allow+putting+a+bound+on+memory+consumed+by+Incoming+requests

The kip allows specifying a limit on the amount of memory allocated for
reading incoming requests into. This is useful for "sizing" a broker and
avoiding OOMEs under heavy load (as actually happens occasionally at
linkedin).

I believe I've addressed most (all?) concerns brought up during the
discussion.

To the best of my understanding this vote is about the goal and
public-facing changes related to the new proposed behavior, but as for
implementation, i have the code up here:

https://github.com/radai-rosenblatt/kafka/tree/broker-memory-pool-with-muting

and I've stress-tested it to work properly (meaning it chugs along and
throttles under loads that would DOS 10.0.1.0 code).

I also believe that the primitives and "pattern"s introduced in this KIP
(namely the notion of a buffer pool and retrieving from / releasing to said
pool instead of allocating memory) are generally useful beyond the scope of
this KIP for both performance issues (allocating lots of short-lived large
buffers is a performance bottleneck) and other areas where memory limits
are a problem (KIP-81)

Thank you,

Radai.

Re: [VOTE] KIP-72 - Allow putting a bound on memory consumed by Incoming requests

Posted by Ismael Juma <is...@juma.me.uk>.

Thanks for the KIP, +1 from me.

I have a few comments with regards to the method names chosen, but since
none of the classes in question are public API, I'll comment directly in
the PR.

Ismael

On Mon, Nov 7, 2016 at 9:08 PM, radai <ra...@gmail.com> wrote:

> Hi,
>
> I would like to initiate a vote on KIP-72:
>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-
> 72%3A+Allow+putting+a+bound+on+memory+consumed+by+Incoming+requests
>
> The kip allows specifying a limit on the amount of memory allocated for
> reading incoming requests into. This is useful for "sizing" a broker and
> avoiding OOMEs under heavy load (as actually happens occasionally at
> linkedin).
>
> I believe I've addressed most (all?) concerns brought up during the
> discussion.
>
> To the best of my understanding this vote is about the goal and
> public-facing changes related to the new proposed behavior, but as for
> implementation, i have the code up here:
>
> https://github.com/radai-rosenblatt/kafka/tree/broker-
> memory-pool-with-muting
>
> and I've stress-tested it to work properly (meaning it chugs along and
> throttles under loads that would DOS 10.0.1.0 code).
>
> I also believe that the primitives and "pattern"s introduced in this KIP
> (namely the notion of a buffer pool and retrieving from / releasing to said
> pool instead of allocating memory) are generally useful beyond the scope of
> this KIP for both performance issues (allocating lots of short-lived large
> buffers is a performance bottleneck) and other areas where memory limits
> are a problem (KIP-81)
>
> Thank you,
>
> Radai.
>

Re: [VOTE] KIP-72 - Allow putting a bound on memory consumed by Incoming requests

Posted by Joel Koshy <jj...@gmail.com>.

+1 on the KIP.
I'll comment more on the formal PR. Also, can you also link a jira for this
from the KIP?

Thanks,

Joel

On Tue, Jan 3, 2017 at 11:14 AM, radai <ra...@gmail.com> wrote:

> I've just re-validated the functionality works - broker throttles under
> stress instead of OOMs.
>
> at this point my branch (
> https://github.com/radai-rosenblatt/kafka/tree/broker-memory
> -pool-with-muting)
> is "code complete" and somewhat tested and im waiting on the voting process
> to come to a conclusion before moving forward.
>
> On Fri, Dec 16, 2016 at 4:46 PM, radai <ra...@gmail.com> wrote:
>
> > I've added the 3 new metrics/sensors i've implemented to the KIP.
> >
> > at this point I would need to re-validate the functionality (which i
> > expect to do early january).
> >
> > code reviews welcome ;-)
> >
> > On Mon, Nov 28, 2016 at 10:37 AM, radai <ra...@gmail.com>
> > wrote:
> >
> >> will do (only added a single one so far, the rest TBD)
> >>
> >> On Mon, Nov 28, 2016 at 10:04 AM, Jun Rao <ju...@confluent.io> wrote:
> >>
> >>> Hi, Radai,
> >>>
> >>> Could you add a high level description of the newly added metrics to
> the
> >>> KIP wiki?
> >>>
> >>> Thanks,
> >>>
> >>> Jun
> >>>
> >>> On Wed, Nov 23, 2016 at 3:45 PM, radai <ra...@gmail.com>
> >>> wrote:
> >>>
> >>> > Hi Jun,
> >>> >
> >>> > I've added the sensor you requested (or at least I think I did ....)
> >>> >
> >>> > On Fri, Nov 18, 2016 at 12:37 PM, Jun Rao <ju...@confluent.io> wrote:
> >>> >
> >>> > > KafkaRequestHandlerPool
> >>> >
> >>>
> >>
> >>
> >
>

Re: [VOTE] KIP-72 - Allow putting a bound on memory consumed by Incoming requests

Posted by radai <ra...@gmail.com>.

I've just re-validated the functionality works - broker throttles under
stress instead of OOMs.

at this point my branch (
https://github.com/radai-rosenblatt/kafka/tree/broker-memory-pool-with-muting)
is "code complete" and somewhat tested and im waiting on the voting process
to come to a conclusion before moving forward.

On Fri, Dec 16, 2016 at 4:46 PM, radai <ra...@gmail.com> wrote:

> I've added the 3 new metrics/sensors i've implemented to the KIP.
>
> at this point I would need to re-validate the functionality (which i
> expect to do early january).
>
> code reviews welcome ;-)
>
> On Mon, Nov 28, 2016 at 10:37 AM, radai <ra...@gmail.com>
> wrote:
>
>> will do (only added a single one so far, the rest TBD)
>>
>> On Mon, Nov 28, 2016 at 10:04 AM, Jun Rao <ju...@confluent.io> wrote:
>>
>>> Hi, Radai,
>>>
>>> Could you add a high level description of the newly added metrics to the
>>> KIP wiki?
>>>
>>> Thanks,
>>>
>>> Jun
>>>
>>> On Wed, Nov 23, 2016 at 3:45 PM, radai <ra...@gmail.com>
>>> wrote:
>>>
>>> > Hi Jun,
>>> >
>>> > I've added the sensor you requested (or at least I think I did ....)
>>> >
>>> > On Fri, Nov 18, 2016 at 12:37 PM, Jun Rao <ju...@confluent.io> wrote:
>>> >
>>> > > KafkaRequestHandlerPool
>>> >
>>>
>>
>>
>

Re: [VOTE] KIP-72 - Allow putting a bound on memory consumed by Incoming requests

Posted by radai <ra...@gmail.com>.

JIRA up - https://issues.apache.org/jira/browse/KAFKA-4602
PR up - https://github.com/apache/kafka/pull/2330
KIP wiki has been updated.



On Fri, Jan 6, 2017 at 8:16 AM, radai <ra...@gmail.com> wrote:

> Will do (sorry for the delay).
> and thank you.
>
> On Fri, Jan 6, 2017 at 7:56 AM, Ismael Juma <is...@juma.me.uk> wrote:
>
>> Radai, you have more than enough votes to declare the vote successful.
>> Maybe it's time to do so. :) Also, once you have done that, it would be
>> good to move this KIP to the adopted table in the wiki.
>>
>> Thanks!
>>
>> Ismael
>>
>> On Fri, Jan 6, 2017 at 2:10 AM, Jun Rao <ju...@confluent.io> wrote:
>>
>> > Hi, Radai,
>> >
>> > The new metrics look good. +1 on the KIP.
>> >
>> > Thanks,
>> >
>> > Jun
>> >
>> > On Fri, Dec 16, 2016 at 4:46 PM, radai <ra...@gmail.com>
>> wrote:
>> >
>> > > I've added the 3 new metrics/sensors i've implemented to the KIP.
>> > >
>> > > at this point I would need to re-validate the functionality (which i
>> > expect
>> > > to do early january).
>> > >
>> > > code reviews welcome ;-)
>> > >
>> > > On Mon, Nov 28, 2016 at 10:37 AM, radai <ra...@gmail.com>
>> > > wrote:
>> > >
>> > > > will do (only added a single one so far, the rest TBD)
>> > > >
>> > > > On Mon, Nov 28, 2016 at 10:04 AM, Jun Rao <ju...@confluent.io> wrote:
>> > > >
>> > > >> Hi, Radai,
>> > > >>
>> > > >> Could you add a high level description of the newly added metrics
>> to
>> > the
>> > > >> KIP wiki?
>> > > >>
>> > > >> Thanks,
>> > > >>
>> > > >> Jun
>> > > >>
>> > > >> On Wed, Nov 23, 2016 at 3:45 PM, radai <radai.rosenblatt@gmail.com
>> >
>> > > >> wrote:
>> > > >>
>> > > >> > Hi Jun,
>> > > >> >
>> > > >> > I've added the sensor you requested (or at least I think I did
>> ....)
>> > > >> >
>> > > >> > On Fri, Nov 18, 2016 at 12:37 PM, Jun Rao <ju...@confluent.io>
>> wrote:
>> > > >> >
>> > > >> > > KafkaRequestHandlerPool
>> > > >> >
>> > > >>
>> > > >
>> > > >
>> > >
>> >
>>
>
>

Re: [VOTE] KIP-72 - Allow putting a bound on memory consumed by Incoming requests

Posted by radai <ra...@gmail.com>.

Will do (sorry for the delay).
and thank you.

On Fri, Jan 6, 2017 at 7:56 AM, Ismael Juma <is...@juma.me.uk> wrote:

> Radai, you have more than enough votes to declare the vote successful.
> Maybe it's time to do so. :) Also, once you have done that, it would be
> good to move this KIP to the adopted table in the wiki.
>
> Thanks!
>
> Ismael
>
> On Fri, Jan 6, 2017 at 2:10 AM, Jun Rao <ju...@confluent.io> wrote:
>
> > Hi, Radai,
> >
> > The new metrics look good. +1 on the KIP.
> >
> > Thanks,
> >
> > Jun
> >
> > On Fri, Dec 16, 2016 at 4:46 PM, radai <ra...@gmail.com>
> wrote:
> >
> > > I've added the 3 new metrics/sensors i've implemented to the KIP.
> > >
> > > at this point I would need to re-validate the functionality (which i
> > expect
> > > to do early january).
> > >
> > > code reviews welcome ;-)
> > >
> > > On Mon, Nov 28, 2016 at 10:37 AM, radai <ra...@gmail.com>
> > > wrote:
> > >
> > > > will do (only added a single one so far, the rest TBD)
> > > >
> > > > On Mon, Nov 28, 2016 at 10:04 AM, Jun Rao <ju...@confluent.io> wrote:
> > > >
> > > >> Hi, Radai,
> > > >>
> > > >> Could you add a high level description of the newly added metrics to
> > the
> > > >> KIP wiki?
> > > >>
> > > >> Thanks,
> > > >>
> > > >> Jun
> > > >>
> > > >> On Wed, Nov 23, 2016 at 3:45 PM, radai <ra...@gmail.com>
> > > >> wrote:
> > > >>
> > > >> > Hi Jun,
> > > >> >
> > > >> > I've added the sensor you requested (or at least I think I did
> ....)
> > > >> >
> > > >> > On Fri, Nov 18, 2016 at 12:37 PM, Jun Rao <ju...@confluent.io>
> wrote:
> > > >> >
> > > >> > > KafkaRequestHandlerPool
> > > >> >
> > > >>
> > > >
> > > >
> > >
> >
>

Re: [VOTE] KIP-72 - Allow putting a bound on memory consumed by Incoming requests

Posted by Ismael Juma <is...@juma.me.uk>.

Radai, you have more than enough votes to declare the vote successful.
Maybe it's time to do so. :) Also, once you have done that, it would be
good to move this KIP to the adopted table in the wiki.

Thanks!

Ismael

On Fri, Jan 6, 2017 at 2:10 AM, Jun Rao <ju...@confluent.io> wrote:

> Hi, Radai,
>
> The new metrics look good. +1 on the KIP.
>
> Thanks,
>
> Jun
>
> On Fri, Dec 16, 2016 at 4:46 PM, radai <ra...@gmail.com> wrote:
>
> > I've added the 3 new metrics/sensors i've implemented to the KIP.
> >
> > at this point I would need to re-validate the functionality (which i
> expect
> > to do early january).
> >
> > code reviews welcome ;-)
> >
> > On Mon, Nov 28, 2016 at 10:37 AM, radai <ra...@gmail.com>
> > wrote:
> >
> > > will do (only added a single one so far, the rest TBD)
> > >
> > > On Mon, Nov 28, 2016 at 10:04 AM, Jun Rao <ju...@confluent.io> wrote:
> > >
> > >> Hi, Radai,
> > >>
> > >> Could you add a high level description of the newly added metrics to
> the
> > >> KIP wiki?
> > >>
> > >> Thanks,
> > >>
> > >> Jun
> > >>
> > >> On Wed, Nov 23, 2016 at 3:45 PM, radai <ra...@gmail.com>
> > >> wrote:
> > >>
> > >> > Hi Jun,
> > >> >
> > >> > I've added the sensor you requested (or at least I think I did ....)
> > >> >
> > >> > On Fri, Nov 18, 2016 at 12:37 PM, Jun Rao <ju...@confluent.io> wrote:
> > >> >
> > >> > > KafkaRequestHandlerPool
> > >> >
> > >>
> > >
> > >
> >
>

Re: [VOTE] KIP-72 - Allow putting a bound on memory consumed by Incoming requests

Posted by Jun Rao <ju...@confluent.io>.

Hi, Radai,

The new metrics look good. +1 on the KIP.

Thanks,

Jun

On Fri, Dec 16, 2016 at 4:46 PM, radai <ra...@gmail.com> wrote:

> I've added the 3 new metrics/sensors i've implemented to the KIP.
>
> at this point I would need to re-validate the functionality (which i expect
> to do early january).
>
> code reviews welcome ;-)
>
> On Mon, Nov 28, 2016 at 10:37 AM, radai <ra...@gmail.com>
> wrote:
>
> > will do (only added a single one so far, the rest TBD)
> >
> > On Mon, Nov 28, 2016 at 10:04 AM, Jun Rao <ju...@confluent.io> wrote:
> >
> >> Hi, Radai,
> >>
> >> Could you add a high level description of the newly added metrics to the
> >> KIP wiki?
> >>
> >> Thanks,
> >>
> >> Jun
> >>
> >> On Wed, Nov 23, 2016 at 3:45 PM, radai <ra...@gmail.com>
> >> wrote:
> >>
> >> > Hi Jun,
> >> >
> >> > I've added the sensor you requested (or at least I think I did ....)
> >> >
> >> > On Fri, Nov 18, 2016 at 12:37 PM, Jun Rao <ju...@confluent.io> wrote:
> >> >
> >> > > KafkaRequestHandlerPool
> >> >
> >>
> >
> >
>

Re: [VOTE] KIP-72 - Allow putting a bound on memory consumed by Incoming requests

Posted by radai <ra...@gmail.com>.

I've added the 3 new metrics/sensors i've implemented to the KIP.

at this point I would need to re-validate the functionality (which i expect
to do early january).

code reviews welcome ;-)

On Mon, Nov 28, 2016 at 10:37 AM, radai <ra...@gmail.com> wrote:

> will do (only added a single one so far, the rest TBD)
>
> On Mon, Nov 28, 2016 at 10:04 AM, Jun Rao <ju...@confluent.io> wrote:
>
>> Hi, Radai,
>>
>> Could you add a high level description of the newly added metrics to the
>> KIP wiki?
>>
>> Thanks,
>>
>> Jun
>>
>> On Wed, Nov 23, 2016 at 3:45 PM, radai <ra...@gmail.com>
>> wrote:
>>
>> > Hi Jun,
>> >
>> > I've added the sensor you requested (or at least I think I did ....)
>> >
>> > On Fri, Nov 18, 2016 at 12:37 PM, Jun Rao <ju...@confluent.io> wrote:
>> >
>> > > KafkaRequestHandlerPool
>> >
>>
>
>

Re: [VOTE] KIP-72 - Allow putting a bound on memory consumed by Incoming requests

Posted by radai <ra...@gmail.com>.

will do (only added a single one so far, the rest TBD)

On Mon, Nov 28, 2016 at 10:04 AM, Jun Rao <ju...@confluent.io> wrote:

> Hi, Radai,
>
> Could you add a high level description of the newly added metrics to the
> KIP wiki?
>
> Thanks,
>
> Jun
>
> On Wed, Nov 23, 2016 at 3:45 PM, radai <ra...@gmail.com> wrote:
>
> > Hi Jun,
> >
> > I've added the sensor you requested (or at least I think I did ....)
> >
> > On Fri, Nov 18, 2016 at 12:37 PM, Jun Rao <ju...@confluent.io> wrote:
> >
> > > KafkaRequestHandlerPool
> >
>

Re: [VOTE] KIP-72 - Allow putting a bound on memory consumed by Incoming requests

Posted by Jun Rao <ju...@confluent.io>.

Hi, Radai,

Could you add a high level description of the newly added metrics to the
KIP wiki?

Thanks,

Jun

On Wed, Nov 23, 2016 at 3:45 PM, radai <ra...@gmail.com> wrote:

> Hi Jun,
>
> I've added the sensor you requested (or at least I think I did ....)
>
> On Fri, Nov 18, 2016 at 12:37 PM, Jun Rao <ju...@confluent.io> wrote:
>
> > KafkaRequestHandlerPool
>

Re: [VOTE] KIP-72 - Allow putting a bound on memory consumed by Incoming requests

Posted by radai <ra...@gmail.com>.

Hi Jun,

I've added the sensor you requested (or at least I think I did ....)

On Fri, Nov 18, 2016 at 12:37 PM, Jun Rao <ju...@confluent.io> wrote:

> KafkaRequestHandlerPool

Re: [VOTE] KIP-72 - Allow putting a bound on memory consumed by Incoming requests

Posted by Jun Rao <ju...@confluent.io>.

Hi, Radai,

3. Having a gauge of MemoryAvailable is useful. One issue with that though
is that if one only collects the metrics say every minute, one doesn't know
what has happened in between. We could additionally track the fraction of
the time when requested memory can't be served. Every time a request can't
be honored, we mark the starting time in memory pool. Every time a request
is honored, we end the time. We can then expose that accumulated fraction
of time as a Rate (similar to RequestHandlerAvgIdlePercent
in KafkaRequestHandlerPool). This will be a value between 0 and 1. The
higher the value, the more memory pressure.

Thanks,

Jun

On Fri, Nov 18, 2016 at 8:35 AM, radai <ra...@gmail.com> wrote:

> Hi Jun,
>
> 3. will (also :-) ) do. do you have ideas for appropriate names/metrics?
> I'm thinking along the lines of "MemoryAvailable" (current snapshot value
> from pool) and "Throttles" (some moving-avg of how often does throttling
> due to no mem kicks in). maybe also "BuffersOutstanding" ?
>
> On Thu, Nov 17, 2016 at 7:01 PM, Jun Rao <ju...@confluent.io> wrote:
>
> > Hi, Radai,
> >
> > 2. Yes, on the server side, the timeout is hardcoded at 300ms. That's not
> > too bad. We can just leave it as it is.
> >
> > 3. Another thing. Do we plan to expose some JMX metrics so that we can
> > monitor if there is any memory pressure in the pool?
> >
> > Thanks,
> >
> > Jun
> >
> > On Thu, Nov 17, 2016 at 8:57 AM, radai <ra...@gmail.com>
> wrote:
> >
> > > Hi Jun,
> > >
> > > 1. will do.
> > >
> > > 2. true. for several reasons:
> > >    2.1. which selector? there's a single pool but 16 selectors
> (linkedin
> > > typical, num.network.threads defaults to 3)
> > >    2.2. even if i could figure out which selector (all?) the better
> thing
> > > to do would be resume reading not when any memory becomes available
> > > (because worst case its not enough for anything) but when some "low
> > > watermark" of available memory is hit - so mute when @100% mem, unmute
> > when
> > > back down to 90%?
> > >    2.3. on the broker side (which is the current concern for my patch)
> > this
> > > max wait time is a hardcoded 300 ms (SocketServer.Processor.poll()),
> > which
> > > i think is acceptable and definitely not arbitrary or configurable.
> > >
> > >    if you still think this needs to be addressed (and you are right
> that
> > in
> > > the general case the timeout param could be arbitrary) i can implement
> > the
> > > watermark approach + pool.waitForLowWatermark(timeout) or something,
> and
> > > make Selector.poll() wait for low watermark at the end of poll() if no
> > work
> > > has been done (so as not to wait on memory needlessly for requests that
> > > done require it, as rajini suggested earlier)
> > >
> > > On Wed, Nov 16, 2016 at 9:04 AM, Jun Rao <ju...@confluent.io> wrote:
> > >
> > > > Hi, Radai,
> > > >
> > > > Thanks for the updated proposal. +1 overall. A couple of comments
> > below.
> > > >
> > > > 1. Our current convention is to avoid using getters. Could you change
> > > > getSize and getAvailableMemory accordingly? Also, size is bit
> > ambiguous,
> > > > could we use sth like capacity?
> > > >
> > > > 2. This is more on the implementation details. I didn't see any code
> to
> > > > wake up the selector when memory is released from the pool. For
> > example,
> > > > suppose that all socket keys are muted since the pool is full. The
> > > > selector.poll() call will wait for the timeout, which could be
> > > arbitrarily
> > > > long. Now, if some memory is released, it seems that we should wake
> up
> > > the
> > > > selector early instead of waiting for the timeout.
> > > >
> > > > Jun
> > > >
> > > >
> > > > On Mon, Nov 14, 2016 at 11:41 AM, Rajini Sivaram <
> > > > rajinisivaram@googlemail.com> wrote:
> > > >
> > > > > +1
> > > > >
> > > > > Thank you for the KIP, Radai.
> > > > >
> > > > > On Mon, Nov 14, 2016 at 6:07 PM, Mickael Maison <
> > > > mickael.maison@gmail.com>
> > > > > wrote:
> > > > >
> > > > > > +1. We've also been hit by OOMs on the broker because we were not
> > > able
> > > > > > to properly bound its memory usage.
> > > > > >
> > > > > > On Mon, Nov 14, 2016 at 5:56 PM, radai <
> radai.rosenblatt@gmail.com
> > >
> > > > > wrote:
> > > > > > > @rajini - fixed the hasBytesBuffered() method. also updated
> > poll()
> > > so
> > > > > > that
> > > > > > > no latency is added for picking up data stuck in ssl buffers
> > > (timeout
> > > > > is
> > > > > > > set to 0, just like with immediately connected keys and staged
> > > > > receives).
> > > > > > > thank you for pointing these out.
> > > > > > > added ssl (re) testing to the KIP testing plan.
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > On Mon, Nov 14, 2016 at 7:24 AM, Rajini Sivaram <
> > > > > > > rajinisivaram@googlemail.com> wrote:
> > > > > > >
> > > > > > >> Open point 1. I would just retain the current long value that
> > > > > specifies
> > > > > > >> queued.max.bytes as long and not as %heap since it is simple
> and
> > > > easy
> > > > > to
> > > > > > >> use. And keeps it consistent with other ".bytes" configs.
> > > > > > >>
> > > > > > >> Point 3. ssl buffers - I am not quite sure the implementation
> > > looks
> > > > > > >> correct. hasBytesBuffered() is checking position() of buffers
> ==
> > > 0.
> > > > > And
> > > > > > the
> > > > > > >> code checks this only when poll with a timeout returns
> (adding a
> > > > delay
> > > > > > when
> > > > > > >> there is nothing else to read).
> > > > > > >> But since this and open point 2 (optimization) are
> > implementation
> > > > > > details,
> > > > > > >> they can be looked at during PR review.
> > > > > > >>
> > > > > > >> It will be good to add SSL testing to the test plan as well,
> > since
> > > > > > there is
> > > > > > >> additional code to test for SSL.
> > > > > > >>
> > > > > > >>
> > > > > > >> On Fri, Nov 11, 2016 at 9:03 PM, radai <
> > > radai.rosenblatt@gmail.com>
> > > > > > wrote:
> > > > > > >>
> > > > > > >> > ok, i've made the following changes:
> > > > > > >> >
> > > > > > >> > 1. memory.pool.class.name has been removed
> > > > > > >> > 2. the code now only uses SimpleMemoryPool. the gc variant
> is
> > > left
> > > > > > >> (unused)
> > > > > > >> > as a developement aid and is unsettable via configuration.
> > > > > > >> > 3. I've resolved the issue of stale data getting stuck in
> > > > > intermediate
> > > > > > >> > (ssl) buffers.
> > > > > > >> > 4. default value for queued.max.bytes is -1, so off by
> > default.
> > > > any
> > > > > > <=0
> > > > > > >> > value is interpreted as off by the underlying code.
> > > > > > >> >
> > > > > > >> > open points:
> > > > > > >> >
> > > > > > >> > 1. the kafka config framework doesnt allow a value to be
> > either
> > > > long
> > > > > > or
> > > > > > >> > double, so in order to pull off the queued.max.bytes =
> 1000000
> > > or
> > > > > > >> > queued.max.bytes = 0.3 thing i'd need to define the config
> as
> > > type
> > > > > > >> string,
> > > > > > >> > which is ugly to me. do we want to support setting
> > > > queued.max.bytes
> > > > > > to %
> > > > > > >> of
> > > > > > >> > heap ? if so, by way of making queued.max.bytes of type
> > string,
> > > or
> > > > > by
> > > > > > way
> > > > > > >> > of a 2nd config param (with the resulting
> > > either/all/combination?
> > > > > > >> > validation). my personal opinion is string because i think a
> > > > single
> > > > > > >> > queued.max.bytes with overloaded meaning is more
> > understandable
> > > to
> > > > > > users.
> > > > > > >> > i'll await other people's opinions before doing anything.
> > > > > > >> > 2. i still need to evaluate rajini's optimization. sounds
> > > doable.
> > > > > > >> >
> > > > > > >> > asides:
> > > > > > >> >
> > > > > > >> > 1. i think you guys misunderstood the intent behind the gc
> > pool.
> > > > it
> > > > > > was
> > > > > > >> > never meant to be a magic pool that automatically releases
> > > buffers
> > > > > > >> (because
> > > > > > >> > just as rajini stated the performance implications would be
> > > > > > horrible). it
> > > > > > >> > was meant to catch leaks early. since that is indeed a
> > dev-only
> > > > > > concern
> > > > > > >> it
> > > > > > >> > wont ever get used in production.
> > > > > > >> > 2. i said this on some other kip discussion: i think the
> nice
> > > > thing
> > > > > > about
> > > > > > >> > the pool API is it "scales" from just keeping a memory bound
> > to
> > > > > > actually
> > > > > > >> > re-using buffers without changing the calling code. i think
> > > > > > >> actuallypooling
> > > > > > >> > large buffers will result in a significant performance
> impact,
> > > but
> > > > > > thats
> > > > > > >> > outside the scope of this kip. at that point i think more
> pool
> > > > > > >> > implementations (that actually pool) would be written. i
> agree
> > > > with
> > > > > > the
> > > > > > >> > ideal of exposing as few knobs as possible, but switching
> > pools
> > > > (or
> > > > > > pool
> > > > > > >> > params) for tuning may happen at some later point.
> > > > > > >> >
> > > > > > >> >
> > > > > > >> >
> > > > > > >> > On Fri, Nov 11, 2016 at 11:44 AM, Rajini Sivaram <
> > > > > > >> > rajinisivaram@googlemail.com> wrote:
> > > > > > >> >
> > > > > > >> > > 13. At the moment, I think channels are not muted if:
> > > > > > >> > >     channel.receive != null && channel.receive.buffer !=
> > null
> > > > > > >> > > This mutes all channels that aren't holding onto a
> > incomplete
> > > > > > buffer.
> > > > > > >> > They
> > > > > > >> > > may or may not have read the 4-byte size.
> > > > > > >> > >
> > > > > > >> > > I was thinking you could avoid muting channels if:
> > > > > > >> > >     channel.receive == null || channel.receive.size.
> > > remaining()
> > > > > > >> > > This will not mute channels that are holding onto a buffer
> > (as
> > > > > > above).
> > > > > > >> In
> > > > > > >> > > addition, it will not mute channels that haven't read the
> > > 4-byte
> > > > > > size.
> > > > > > >> A
> > > > > > >> > > client that is closed gracefully while the pool is full
> will
> > > not
> > > > > be
> > > > > > >> muted
> > > > > > >> > > in this case and the server can process close without
> > waiting
> > > > for
> > > > > > the
> > > > > > >> > pool
> > > > > > >> > > to free up. Once the 4-byte size is read, the channel will
> > be
> > > > > muted
> > > > > > if
> > > > > > >> > the
> > > > > > >> > > pool is still out of memory - for each channel, at most
> one
> > > > failed
> > > > > > read
> > > > > > >> > > attempt would be made while the pool is out of memory. I
> > think
> > > > > this
> > > > > > >> would
> > > > > > >> > > also delay muting of SSL channels since they can continue
> to
> > > > read
> > > > > > into
> > > > > > >> > > their (already allocated) network buffers and unwrap the
> > data
> > > > and
> > > > > > block
> > > > > > >> > > only when they need to allocate a buffer from the pool.
> > > > > > >> > >
> > > > > > >> > > On Fri, Nov 11, 2016 at 6:00 PM, Jay Kreps <
> > jay@confluent.io>
> > > > > > wrote:
> > > > > > >> > >
> > > > > > >> > > > Hey Radai,
> > > > > > >> > > >
> > > > > > >> > > > +1 on deprecating and eventually removing the old
> config.
> > > The
> > > > > > >> intention
> > > > > > >> > > was
> > > > > > >> > > > absolutely bounding memory usage. I think having two
> ways
> > of
> > > > > doing
> > > > > > >> > this,
> > > > > > >> > > > one that gives a crisp bound on memory and one that is
> > hard
> > > to
> > > > > > reason
> > > > > > >> > > about
> > > > > > >> > > > is pretty confusing. I think people will really
> appreciate
> > > > > having
> > > > > > one
> > > > > > >> > > > config which instead lets them directly control the
> thing
> > > they
> > > > > > >> actually
> > > > > > >> > > > care about (memory).
> > > > > > >> > > >
> > > > > > >> > > > I also want to second Jun's concern on the complexity of
> > the
> > > > > > >> self-GCing
> > > > > > >> > > > memory pool. I wrote the memory pool for the producer.
> In
> > > that
> > > > > > area
> > > > > > >> the
> > > > > > >> > > > pooling of messages is the single biggest factor in
> > > > performance
> > > > > of
> > > > > > >> the
> > > > > > >> > > > client so I believed it was worth some
> > > > sophistication/complexity
> > > > > > if
> > > > > > >> > there
> > > > > > >> > > > was performance payoff. All the same, the complexity of
> > that
> > > > > code
> > > > > > has
> > > > > > >> > > made
> > > > > > >> > > > it VERY hard to keep correct (it gets broken roughly
> every
> > > > other
> > > > > > time
> > > > > > >> > > > someone makes a change). Over time I came to feel a lot
> > less
> > > > > > proud of
> > > > > > >> > my
> > > > > > >> > > > cleverness. I learned something interesting reading your
> > > > > > self-GCing
> > > > > > >> > > memory
> > > > > > >> > > > pool, but I wonder if the complexity is worth the payoff
> > in
> > > > this
> > > > > > >> case?
> > > > > > >> > > >
> > > > > > >> > > > Philosophically we've tried really hard to avoid
> > needlessly
> > > > > > >> "pluggable"
> > > > > > >> > > > implementations. That is, when there is a temptation to
> > > give a
> > > > > > config
> > > > > > >> > > that
> > > > > > >> > > > plugs in different Java classes at run time for
> > > implementation
> > > > > > >> choices,
> > > > > > >> > > we
> > > > > > >> > > > should instead think of how to give the user the good
> > > behavior
> > > > > > >> > > > automatically. I think the use case for configuring a
> the
> > > > GCing
> > > > > > pool
> > > > > > >> > > would
> > > > > > >> > > > be if you discovered a bug in which memory leaked. But
> > this
> > > > > isn't
> > > > > > >> > > something
> > > > > > >> > > > the user should have to think about right? If there is a
> > bug
> > > > we
> > > > > > >> should
> > > > > > >> > > find
> > > > > > >> > > > and fix it.
> > > > > > >> > > >
> > > > > > >> > > > -Jay
> > > > > > >> > > >
> > > > > > >> > > > On Fri, Nov 11, 2016 at 9:21 AM, radai <
> > > > > > radai.rosenblatt@gmail.com>
> > > > > > >> > > wrote:
> > > > > > >> > > >
> > > > > > >> > > > > jun's #1 + rajini's #11 - the new config param is to
> > > enable
> > > > > > >> changing
> > > > > > >> > > the
> > > > > > >> > > > > pool implentation class. as i said in my response to
> > jun i
> > > > > will
> > > > > > >> make
> > > > > > >> > > the
> > > > > > >> > > > > default pool impl be the simple one, and this param is
> > to
> > > > > allow
> > > > > > a
> > > > > > >> > user
> > > > > > >> > > > > (more likely a dev) to change it.
> > > > > > >> > > > > both the simple pool and the "gc pool" make basically
> > just
> > > > an
> > > > > > >> > > > > AtomicLong.get() + (hashmap.put for gc) calls before
> > > > > returning a
> > > > > > >> > > buffer.
> > > > > > >> > > > > there is absolutely no dependency on GC times in
> > > allocating
> > > > > (or
> > > > > > >> not).
> > > > > > >> > > the
> > > > > > >> > > > > extra background thread in the gc pool is forever
> asleep
> > > > > unless
> > > > > > >> there
> > > > > > >> > > are
> > > > > > >> > > > > bugs (==leaks) so the extra cost is basically nothing
> > > > (backed
> > > > > by
> > > > > > >> > > > > benchmarks). let me re-itarate again - ANY BUFFER
> > > ALLOCATED
> > > > > MUST
> > > > > > >> > ALWAYS
> > > > > > >> > > > BE
> > > > > > >> > > > > RELEASED - so the gc pool should not rely on gc for
> > > > reclaiming
> > > > > > >> > buffers.
> > > > > > >> > > > its
> > > > > > >> > > > > a bug detector, not a feature and is definitely not
> > > intended
> > > > > to
> > > > > > >> hide
> > > > > > >> > > > bugs -
> > > > > > >> > > > > the exact opposite - its meant to expose them sooner.
> > i've
> > > > > > cleaned
> > > > > > >> up
> > > > > > >> > > the
> > > > > > >> > > > > docs to avoid this confusion. i also like the fail on
> > > leak.
> > > > > will
> > > > > > >> do.
> > > > > > >> > > > > as for the gap between pool size and heap size -
> thats a
> > > > valid
> > > > > > >> > > argument.
> > > > > > >> > > > > may allow also sizing the pool as % of heap size? so
> > > > > > >> > queued.max.bytes =
> > > > > > >> > > > > 1000000 for 1MB and queued.max.bytes = 0.25 for 25% of
> > > > > available
> > > > > > >> > heap?
> > > > > > >> > > > >
> > > > > > >> > > > > jun's 2.2 - queued.max.bytes +
> socket.request.max.bytes
> > > > still
> > > > > > >> holds,
> > > > > > >> > > > > assuming the ssl-related buffers are small. the
> largest
> > > > > > weakness in
> > > > > > >> > > this
> > > > > > >> > > > > claim has to do with decompression rather than
> anything
> > > > > > >> ssl-related.
> > > > > > >> > so
> > > > > > >> > > > yes
> > > > > > >> > > > > there is an O(#ssl connections * sslEngine packet
> size)
> > > > > > component,
> > > > > > >> > but
> > > > > > >> > > i
> > > > > > >> > > > > think its small. again - decompression should be the
> > > > concern.
> > > > > > >> > > > >
> > > > > > >> > > > > rajini's #13 - interesting optimization. the problem
> is
> > > > > there's
> > > > > > no
> > > > > > >> > > > knowing
> > > > > > >> > > > > in advance what the _next_ request to come out of a
> > socket
> > > > is,
> > > > > > so
> > > > > > >> > this
> > > > > > >> > > > > would mute just those sockets that are 1. mutable and
> 2.
> > > > have
> > > > > a
> > > > > > >> > > > > buffer-demanding request for which we could not
> > allocate a
> > > > > > buffer.
> > > > > > >> > > > downside
> > > > > > >> > > > > is that as-is this would cause the busy-loop on poll()
> > > that
> > > > > the
> > > > > > >> mutes
> > > > > > >> > > > were
> > > > > > >> > > > > supposed to prevent - or code would need to be added
> to
> > > > > > ad-hocmute
> > > > > > >> a
> > > > > > >> > > > > connection that was so-far unmuted but has now
> > generated a
> > > > > > >> > > > memory-demanding
> > > > > > >> > > > > request?
> > > > > > >> > > > >
> > > > > > >> > > > >
> > > > > > >> > > > >
> > > > > > >> > > > > On Fri, Nov 11, 2016 at 5:02 AM, Rajini Sivaram <
> > > > > > >> > > > > rajinisivaram@googlemail.com> wrote:
> > > > > > >> > > > >
> > > > > > >> > > > > > Radai,
> > > > > > >> > > > > >
> > > > > > >> > > > > > 11. The KIP talks about a new server configuration
> > > > parameter
> > > > > > >> > > > > > *memory.pool.class.name
> > > > > > >> > > > > > <http://memory.pool.class.name> *which is not in
> the
> > > > > > >> > implementation.
> > > > > > >> > > > Is
> > > > > > >> > > > > it
> > > > > > >> > > > > > still the case that the pool will be configurable?
> > > > > > >> > > > > >
> > > > > > >> > > > > > 12. Personally I would prefer not to have a garbage
> > > > > collected
> > > > > > >> pool
> > > > > > >> > > that
> > > > > > >> > > > > > hides bugs as well. Apart from the added code
> > complexity
> > > > and
> > > > > > >> extra
> > > > > > >> > > > thread
> > > > > > >> > > > > > to handle collections, I am also concerned about the
> > > > > > >> > > non-deterministic
> > > > > > >> > > > > > nature of GC timings. The KIP introduces delays in
> > > > > processing
> > > > > > >> > > requests
> > > > > > >> > > > > > based on the configuration parameter
> > *queued.max.bytes.
> > > > > *This
> > > > > > in
> > > > > > >> > > > > unrelated
> > > > > > >> > > > > > to the JVM heap size and hence pool can be full when
> > > there
> > > > > is
> > > > > > no
> > > > > > >> > > > pressure
> > > > > > >> > > > > > on the JVM to garbage collect. The KIP does not
> > prevent
> > > > > other
> > > > > > >> > > timeouts
> > > > > > >> > > > in
> > > > > > >> > > > > > the broker (eg. consumer session timeout) because it
> > is
> > > > > > relying
> > > > > > >> on
> > > > > > >> > > the
> > > > > > >> > > > > pool
> > > > > > >> > > > > > to be managed in a deterministic, timely manner.
> > Since a
> > > > > > garbage
> > > > > > >> > > > > collected
> > > > > > >> > > > > > pool cannot provide that guarantee, wouldn't it be
> > > better
> > > > to
> > > > > > run
> > > > > > >> > > tests
> > > > > > >> > > > > with
> > > > > > >> > > > > > a GC-pool that perhaps fails with a fatal error if
> it
> > > > > > encounters
> > > > > > >> a
> > > > > > >> > > > buffer
> > > > > > >> > > > > > that was not released?
> > > > > > >> > > > > >
> > > > > > >> > > > > > 13. The implementation currently mutes all channels
> > that
> > > > > don't
> > > > > > >> > have a
> > > > > > >> > > > > > receive buffer allocated. Would it make sense to
> mute
> > > only
> > > > > the
> > > > > > >> > > channels
> > > > > > >> > > > > > that need a buffer (i.e. allow channels to read the
> > > 4-byte
> > > > > > size
> > > > > > >> > that
> > > > > > >> > > is
> > > > > > >> > > > > not
> > > > > > >> > > > > > read using the pool) so that normal client
> connection
> > > > > close()
> > > > > > is
> > > > > > >> > > > handled
> > > > > > >> > > > > > even when the pool is full? Since the extra 4-bytes
> > may
> > > > > > already
> > > > > > >> be
> > > > > > >> > > > > > allocated for some connections, the total request
> > memory
> > > > has
> > > > > > to
> > > > > > >> > take
> > > > > > >> > > > into
> > > > > > >> > > > > > account *4*numConnections* bytes anyway.
> > > > > > >> > > > > >
> > > > > > >> > > > > >
> > > > > > >> > > > > > On Thu, Nov 10, 2016 at 11:51 PM, Jun Rao <
> > > > jun@confluent.io
> > > > > >
> > > > > > >> > wrote:
> > > > > > >> > > > > >
> > > > > > >> > > > > > > Hi, Radai,
> > > > > > >> > > > > > >
> > > > > > >> > > > > > > 1. Yes, I am concerned about the trickiness of
> > having
> > > to
> > > > > > deal
> > > > > > >> > with
> > > > > > >> > > > > wreak
> > > > > > >> > > > > > > refs. I think it's simpler to just have the simple
> > > > version
> > > > > > >> > > > instrumented
> > > > > > >> > > > > > > with enough debug/trace logging and do enough
> stress
> > > > > > testing.
> > > > > > >> > Since
> > > > > > >> > > > we
> > > > > > >> > > > > > > still have queued.max.requests, one can always
> fall
> > > back
> > > > > to
> > > > > > >> that
> > > > > > >> > > if a
> > > > > > >> > > > > > > memory leak issue is identified. We could also
> label
> > > the
> > > > > > >> feature
> > > > > > >> > as
> > > > > > >> > > > > beta
> > > > > > >> > > > > > if
> > > > > > >> > > > > > > we don't think this is production ready.
> > > > > > >> > > > > > >
> > > > > > >> > > > > > > 2.2 I am just wondering after we fix that issue
> > > whether
> > > > > the
> > > > > > >> claim
> > > > > > >> > > > that
> > > > > > >> > > > > > the
> > > > > > >> > > > > > > request memory is bounded by  queued.max.bytes +
> > > > > > >> > > > > socket.request.max.bytes
> > > > > > >> > > > > > > is still true.
> > > > > > >> > > > > > >
> > > > > > >> > > > > > > 5. Ok, leaving the default as -1 is fine then.
> > > > > > >> > > > > > >
> > > > > > >> > > > > > > Thanks,
> > > > > > >> > > > > > >
> > > > > > >> > > > > > > Jun
> > > > > > >> > > > > > >
> > > > > > >> > > > > > > On Wed, Nov 9, 2016 at 6:01 PM, radai <
> > > > > > >> > radai.rosenblatt@gmail.com>
> > > > > > >> > > > > > wrote:
> > > > > > >> > > > > > >
> > > > > > >> > > > > > > > Hi Jun,
> > > > > > >> > > > > > > >
> > > > > > >> > > > > > > > Thank you for taking the time to review this.
> > > > > > >> > > > > > > >
> > > > > > >> > > > > > > > 1. short version - yes, the concern is bugs, but
> > the
> > > > > cost
> > > > > > is
> > > > > > >> > tiny
> > > > > > >> > > > and
> > > > > > >> > > > > > > worth
> > > > > > >> > > > > > > > it, and its a common pattern. long version:
> > > > > > >> > > > > > > >    1.1 detecting these types of bugs (leaks)
> > cannot
> > > be
> > > > > > easily
> > > > > > >> > > done
> > > > > > >> > > > > with
> > > > > > >> > > > > > > > simple testing, but requires stress/stability
> > tests
> > > > that
> > > > > > run
> > > > > > >> > for
> > > > > > >> > > a
> > > > > > >> > > > > long
> > > > > > >> > > > > > > > time (long enough to hit OOM, depending on leak
> > size
> > > > and
> > > > > > >> > > available
> > > > > > >> > > > > > > memory).
> > > > > > >> > > > > > > > this is why some sort of leak detector is
> > "standard
> > > > > > practice"
> > > > > > >> > > .for
> > > > > > >> > > > > > > example
> > > > > > >> > > > > > > > look at netty (http://netty.io/wiki/
> > > > > > >> reference-counted-objects.
> > > > > > >> > > > > > > > html#leak-detection-levels)
> > > > > > >> > > > > > > > <http://netty.io/wiki/reference-counted-objects
> .
> > > > > > >> > > > > > > html#leak-detection-levels
> > > > > > >> > > > > > > > >-
> > > > > > >> > > > > > > > they have way more complicated built-in leak
> > > detection
> > > > > > >> enabled
> > > > > > >> > by
> > > > > > >> > > > > > > default.
> > > > > > >> > > > > > > > as a concrete example - during development i did
> > not
> > > > > > properly
> > > > > > >> > > > dispose
> > > > > > >> > > > > > of
> > > > > > >> > > > > > > > in-progress KafkaChannel.receive when a
> connection
> > > was
> > > > > > >> abruptly
> > > > > > >> > > > > closed
> > > > > > >> > > > > > > and
> > > > > > >> > > > > > > > I only found it because of the log msg printed
> by
> > > the
> > > > > > pool.
> > > > > > >> > > > > > > >    1.2 I have a benchmark suite showing the
> > > > performance
> > > > > > cost
> > > > > > >> of
> > > > > > >> > > the
> > > > > > >> > > > > gc
> > > > > > >> > > > > > > pool
> > > > > > >> > > > > > > > is absolutely negligible -
> > > > > > >> > > > > > > > https://github.com/radai-
> > > rosenblatt/kafka-benchmarks/
> > > > > > >> > > > > > > > tree/master/memorypool-benchmarks
> > > > > > >> > > > > > > >    1.3 as for the complexity of the impl - its
> > just
> > > > ~150
> > > > > > >> lines
> > > > > > >> > > and
> > > > > > >> > > > > > pretty
> > > > > > >> > > > > > > > straight forward. i think the main issue is that
> > not
> > > > > many
> > > > > > >> > people
> > > > > > >> > > > are
> > > > > > >> > > > > > > > familiar with weak refs and ref queues.
> > > > > > >> > > > > > > >
> > > > > > >> > > > > > > >    how about making the pool impl class a config
> > > param
> > > > > > >> > (generally
> > > > > > >> > > > > good
> > > > > > >> > > > > > > > going forward), make the default be the simple
> > pool,
> > > > and
> > > > > > keep
> > > > > > >> > the
> > > > > > >> > > > GC
> > > > > > >> > > > > > one
> > > > > > >> > > > > > > as
> > > > > > >> > > > > > > > a dev/debug/triage aid?
> > > > > > >> > > > > > > >
> > > > > > >> > > > > > > > 2. the KIP itself doesnt specifically treat SSL
> at
> > > > all -
> > > > > > its
> > > > > > >> an
> > > > > > >> > > > > > > > implementation detail. as for my current patch,
> it
> > > has
> > > > > > some
> > > > > > >> > > minimal
> > > > > > >> > > > > > > > treatment of SSL - just enough to not mute SSL
> > > sockets
> > > > > > >> > > > mid-handshake
> > > > > > >> > > > > -
> > > > > > >> > > > > > > but
> > > > > > >> > > > > > > > the code in SslTransportLayer still allocates
> > > buffers
> > > > > > itself.
> > > > > > >> > it
> > > > > > >> > > is
> > > > > > >> > > > > my
> > > > > > >> > > > > > > > understanding that netReadBuffer/appReadBuffer
> > > > shouldn't
> > > > > > grow
> > > > > > >> > > > beyond
> > > > > > >> > > > > 2
> > > > > > >> > > > > > x
> > > > > > >> > > > > > > > sslEngine.getSession().getPacketBufferSize(),
> > > which i
> > > > > > assume
> > > > > > >> > to
> > > > > > >> > > be
> > > > > > >> > > > > > > small.
> > > > > > >> > > > > > > > they are also long lived (they live for the
> > duration
> > > > of
> > > > > > the
> > > > > > >> > > > > connection)
> > > > > > >> > > > > > > > which makes a poor fit for pooling. the bigger
> > fish
> > > to
> > > > > > fry i
> > > > > > >> > > think
> > > > > > >> > > > is
> > > > > > >> > > > > > > > decompression - you could read a 1MB blob into a
> > > > > > >> pool-provided
> > > > > > >> > > > buffer
> > > > > > >> > > > > > and
> > > > > > >> > > > > > > > then decompress it into 10MB of heap allocated
> on
> > > the
> > > > > spot
> > > > > > >> :-)
> > > > > > >> > > > also,
> > > > > > >> > > > > > the
> > > > > > >> > > > > > > > ssl code is extremely tricky.
> > > > > > >> > > > > > > >    2.2 just to make sure, youre talking about
> > > > > > Selector.java:
> > > > > > >> > > while
> > > > > > >> > > > > > > > ((networkReceive = channel.read()) != null)
> > > > > > >> > > > > > addToStagedReceives(channel,
> > > > > > >> > > > > > > > networkReceive); ? if so youre right, and i'll
> fix
> > > > that
> > > > > > >> > (probably
> > > > > > >> > > > by
> > > > > > >> > > > > > > > something similar to immediatelyConnectedKeys,
> not
> > > > sure
> > > > > > yet)
> > > > > > >> > > > > > > >
> > > > > > >> > > > > > > > 3. isOutOfMemory is self explanatory (and i'll
> add
> > > > > > javadocs
> > > > > > >> and
> > > > > > >> > > > > update
> > > > > > >> > > > > > > the
> > > > > > >> > > > > > > > wiki). isLowOnMem is basically the point where I
> > > start
> > > > > > >> > > randomizing
> > > > > > >> > > > > the
> > > > > > >> > > > > > > > selection key handling order to avoid potential
> > > > > > starvation.
> > > > > > >> its
> > > > > > >> > > > > rather
> > > > > > >> > > > > > > > arbitrary and now that i think of it should
> > probably
> > > > not
> > > > > > >> exist
> > > > > > >> > > and
> > > > > > >> > > > be
> > > > > > >> > > > > > > > entirely contained in Selector (where the
> > shuffling
> > > > > takes
> > > > > > >> > place).
> > > > > > >> > > > > will
> > > > > > >> > > > > > > fix.
> > > > > > >> > > > > > > >
> > > > > > >> > > > > > > > 4. will do.
> > > > > > >> > > > > > > >
> > > > > > >> > > > > > > > 5. I prefer -1 or 0 as an explicit "OFF" (or
> > > basically
> > > > > > >> anything
> > > > > > >> > > > <=0).
> > > > > > >> > > > > > > > Long.MAX_VALUE would still create a pool, that
> > would
> > > > > still
> > > > > > >> > waste
> > > > > > >> > > > time
> > > > > > >> > > > > > > > tracking resources. I dont really mind though if
> > you
> > > > > have
> > > > > > a
> > > > > > >> > > > preferred
> > > > > > >> > > > > > > magic
> > > > > > >> > > > > > > > value for off.
> > > > > > >> > > > > > > >
> > > > > > >> > > > > > > >
> > > > > > >> > > > > > > >
> > > > > > >> > > > > > > >
> > > > > > >> > > > > > > >
> > > > > > >> > > > > > > > On Wed, Nov 9, 2016 at 9:28 AM, Jun Rao <
> > > > > jun@confluent.io
> > > > > > >
> > > > > > >> > > wrote:
> > > > > > >> > > > > > > >
> > > > > > >> > > > > > > > > Hi, Radai,
> > > > > > >> > > > > > > > >
> > > > > > >> > > > > > > > > Thanks for the KIP. Some comments below.
> > > > > > >> > > > > > > > >
> > > > > > >> > > > > > > > > 1. The KIP says "to facilitate faster
> > > implementation
> > > > > > (as a
> > > > > > >> > > safety
> > > > > > >> > > > > > net)
> > > > > > >> > > > > > > > the
> > > > > > >> > > > > > > > > pool will be implemented in such a way that
> > memory
> > > > > that
> > > > > > was
> > > > > > >> > not
> > > > > > >> > > > > > > > release()ed
> > > > > > >> > > > > > > > > (but still garbage collected) would be
> detected
> > > and
> > > > > > >> > > "reclaimed".
> > > > > > >> > > > > this
> > > > > > >> > > > > > > is
> > > > > > >> > > > > > > > to
> > > > > > >> > > > > > > > > prevent "leaks" in case of code paths that
> fail
> > to
> > > > > > >> release()
> > > > > > >> > > > > > > properly.".
> > > > > > >> > > > > > > > > What are the cases that could cause memory
> > leaks?
> > > If
> > > > > we
> > > > > > are
> > > > > > >> > > > > concerned
> > > > > > >> > > > > > > > about
> > > > > > >> > > > > > > > > bugs, it seems that it's better to just do
> more
> > > > > testing
> > > > > > to
> > > > > > >> > make
> > > > > > >> > > > > sure
> > > > > > >> > > > > > > the
> > > > > > >> > > > > > > > > usage of the simple implementation
> > > > (SimpleMemoryPool)
> > > > > is
> > > > > > >> > solid
> > > > > > >> > > > > > instead
> > > > > > >> > > > > > > of
> > > > > > >> > > > > > > > > adding more complicated logic
> > > > > > (GarbageCollectedMemoryPool)
> > > > > > >> to
> > > > > > >> > > > hide
> > > > > > >> > > > > > the
> > > > > > >> > > > > > > > > potential bugs.
> > > > > > >> > > > > > > > >
> > > > > > >> > > > > > > > > 2. I am wondering how much this KIP covers the
> > SSL
> > > > > > channel
> > > > > > >> > > > > > > > implementation.
> > > > > > >> > > > > > > > > 2.1 SslTransportLayer maintains netReadBuffer,
> > > > > > >> > netWriteBuffer,
> > > > > > >> > > > > > > > > appReadBuffer per socket. Should those memory
> be
> > > > > > accounted
> > > > > > >> > for
> > > > > > >> > > in
> > > > > > >> > > > > > > memory
> > > > > > >> > > > > > > > > pool?
> > > > > > >> > > > > > > > > 2.2 One tricky thing with SSL is that during a
> > > > > > >> > > > KafkaChannel.read(),
> > > > > > >> > > > > > > it's
> > > > > > >> > > > > > > > > possible for multiple NetworkReceives to be
> > > returned
> > > > > > since
> > > > > > >> > > > multiple
> > > > > > >> > > > > > > > > requests' data could be encrypted together by
> > SSL.
> > > > To
> > > > > > deal
> > > > > > >> > with
> > > > > > >> > > > > this,
> > > > > > >> > > > > > > we
> > > > > > >> > > > > > > > > stash those NetworkReceives in
> > > > Selector.stagedReceives
> > > > > > and
> > > > > > >> > give
> > > > > > >> > > > it
> > > > > > >> > > > > > back
> > > > > > >> > > > > > > > to
> > > > > > >> > > > > > > > > the poll() call one NetworkReceive at a time.
> > What
> > > > > this
> > > > > > >> means
> > > > > > >> > > is
> > > > > > >> > > > > > that,
> > > > > > >> > > > > > > if
> > > > > > >> > > > > > > > > we stop reading from KafkaChannel in the
> middle
> > > > > because
> > > > > > >> > memory
> > > > > > >> > > > pool
> > > > > > >> > > > > > is
> > > > > > >> > > > > > > > > full, this channel's key may never get
> selected
> > > for
> > > > > > reads
> > > > > > >> > (even
> > > > > > >> > > > > after
> > > > > > >> > > > > > > the
> > > > > > >> > > > > > > > > read interest is turned on), but there are
> still
> > > > > pending
> > > > > > >> data
> > > > > > >> > > for
> > > > > > >> > > > > the
> > > > > > >> > > > > > > > > channel, which will never get processed.
> > > > > > >> > > > > > > > >
> > > > > > >> > > > > > > > > 3. The code has the following two methods in
> > > > > MemoryPool,
> > > > > > >> > which
> > > > > > >> > > > are
> > > > > > >> > > > > > not
> > > > > > >> > > > > > > > > described in the KIP. Could you explain how
> they
> > > are
> > > > > > used
> > > > > > >> in
> > > > > > >> > > the
> > > > > > >> > > > > > wiki?
> > > > > > >> > > > > > > > > isLowOnMemory()
> > > > > > >> > > > > > > > > isOutOfMemory()
> > > > > > >> > > > > > > > >
> > > > > > >> > > > > > > > > 4. Could you also describe in the KIP at the
> > high
> > > > > level,
> > > > > > >> how
> > > > > > >> > > the
> > > > > > >> > > > > read
> > > > > > >> > > > > > > > > interest bit for the socket is turned on/off
> > with
> > > > > > respect
> > > > > > >> to
> > > > > > >> > > > > > > MemoryPool?
> > > > > > >> > > > > > > > >
> > > > > > >> > > > > > > > > 5. Should queued.max.bytes defaults to -1 or
> > > > > > >> Long.MAX_VALUE?
> > > > > > >> > > > > > > > >
> > > > > > >> > > > > > > > > Thanks,
> > > > > > >> > > > > > > > >
> > > > > > >> > > > > > > > > Jun
> > > > > > >> > > > > > > > >
> > > > > > >> > > > > > > > > On Mon, Nov 7, 2016 at 1:08 PM, radai <
> > > > > > >> > > > radai.rosenblatt@gmail.com>
> > > > > > >> > > > > > > > wrote:
> > > > > > >> > > > > > > > >
> > > > > > >> > > > > > > > > > Hi,
> > > > > > >> > > > > > > > > >
> > > > > > >> > > > > > > > > > I would like to initiate a vote on KIP-72:
> > > > > > >> > > > > > > > > >
> > > > > > >> > > > > > > > > > https://cwiki.apache.org/
> > > > > > confluence/display/KAFKA/KIP-
> > > > > > >> > 72%3A+
> > > > > > >> > > > > > > > > > Allow+putting+a+bound+on+memor
> > > > > y+consumed+by+Incoming+
> > > > > > >> > > requests
> > > > > > >> > > > > > > > > >
> > > > > > >> > > > > > > > > > The kip allows specifying a limit on the
> > amount
> > > of
> > > > > > memory
> > > > > > >> > > > > allocated
> > > > > > >> > > > > > > for
> > > > > > >> > > > > > > > > > reading incoming requests into. This is
> useful
> > > for
> > > > > > >> > "sizing" a
> > > > > > >> > > > > > broker
> > > > > > >> > > > > > > > and
> > > > > > >> > > > > > > > > > avoiding OOMEs under heavy load (as actually
> > > > happens
> > > > > > >> > > > occasionally
> > > > > > >> > > > > > at
> > > > > > >> > > > > > > > > > linkedin).
> > > > > > >> > > > > > > > > >
> > > > > > >> > > > > > > > > > I believe I've addressed most (all?)
> concerns
> > > > > brought
> > > > > > up
> > > > > > >> > > during
> > > > > > >> > > > > the
> > > > > > >> > > > > > > > > > discussion.
> > > > > > >> > > > > > > > > >
> > > > > > >> > > > > > > > > > To the best of my understanding this vote is
> > > about
> > > > > the
> > > > > > >> goal
> > > > > > >> > > and
> > > > > > >> > > > > > > > > > public-facing changes related to the new
> > > proposed
> > > > > > >> behavior,
> > > > > > >> > > but
> > > > > > >> > > > > as
> > > > > > >> > > > > > > for
> > > > > > >> > > > > > > > > > implementation, i have the code up here:
> > > > > > >> > > > > > > > > >
> > > > > > >> > > > > > > > > > https://github.com/radai-
> > > > > > rosenblatt/kafka/tree/broker-
> > > > > > >> > memory
> > > > > > >> > > > > > > > > > -pool-with-muting
> > > > > > >> > > > > > > > > >
> > > > > > >> > > > > > > > > > and I've stress-tested it to work properly
> > > > (meaning
> > > > > it
> > > > > > >> > chugs
> > > > > > >> > > > > along
> > > > > > >> > > > > > > and
> > > > > > >> > > > > > > > > > throttles under loads that would DOS
> 10.0.1.0
> > > > code).
> > > > > > >> > > > > > > > > >
> > > > > > >> > > > > > > > > > I also believe that the primitives and
> > > "pattern"s
> > > > > > >> > introduced
> > > > > > >> > > in
> > > > > > >> > > > > > this
> > > > > > >> > > > > > > > KIP
> > > > > > >> > > > > > > > > > (namely the notion of a buffer pool and
> > > retrieving
> > > > > > from /
> > > > > > >> > > > > releasing
> > > > > > >> > > > > > > to
> > > > > > >> > > > > > > > > said
> > > > > > >> > > > > > > > > > pool instead of allocating memory) are
> > generally
> > > > > > useful
> > > > > > >> > > beyond
> > > > > > >> > > > > the
> > > > > > >> > > > > > > > scope
> > > > > > >> > > > > > > > > of
> > > > > > >> > > > > > > > > > this KIP for both performance issues
> > (allocating
> > > > > lots
> > > > > > of
> > > > > > >> > > > > > short-lived
> > > > > > >> > > > > > > > > large
> > > > > > >> > > > > > > > > > buffers is a performance bottleneck) and
> other
> > > > areas
> > > > > > >> where
> > > > > > >> > > > memory
> > > > > > >> > > > > > > > limits
> > > > > > >> > > > > > > > > > are a problem (KIP-81)
> > > > > > >> > > > > > > > > >
> > > > > > >> > > > > > > > > > Thank you,
> > > > > > >> > > > > > > > > >
> > > > > > >> > > > > > > > > > Radai.
> > > > > > >> > > > > > > > > >
> > > > > > >> > > > > > > > >
> > > > > > >> > > > > > > >
> > > > > > >> > > > > > >
> > > > > > >> > > > > >
> > > > > > >> > > > > >
> > > > > > >> > > > > >
> > > > > > >> > > > > > --
> > > > > > >> > > > > > Regards,
> > > > > > >> > > > > >
> > > > > > >> > > > > > Rajini
> > > > > > >> > > > > >
> > > > > > >> > > > >
> > > > > > >> > > >
> > > > > > >> > >
> > > > > > >> > >
> > > > > > >> > >
> > > > > > >> > > --
> > > > > > >> > > Regards,
> > > > > > >> > >
> > > > > > >> > > Rajini
> > > > > > >> > >
> > > > > > >> >
> > > > > > >>
> > > > > > >>
> > > > > > >>
> > > > > > >> --
> > > > > > >> Regards,
> > > > > > >>
> > > > > > >> Rajini
> > > > > > >>
> > > > > >
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Regards,
> > > > >
> > > > > Rajini
> > > > >
> > > >
> > >
> >
>

Re: [VOTE] KIP-72 - Allow putting a bound on memory consumed by Incoming requests

Posted by radai <ra...@gmail.com>.

Hi Jun,

3. will (also :-) ) do. do you have ideas for appropriate names/metrics?
I'm thinking along the lines of "MemoryAvailable" (current snapshot value
from pool) and "Throttles" (some moving-avg of how often does throttling
due to no mem kicks in). maybe also "BuffersOutstanding" ?

On Thu, Nov 17, 2016 at 7:01 PM, Jun Rao <ju...@confluent.io> wrote:

> Hi, Radai,
>
> 2. Yes, on the server side, the timeout is hardcoded at 300ms. That's not
> too bad. We can just leave it as it is.
>
> 3. Another thing. Do we plan to expose some JMX metrics so that we can
> monitor if there is any memory pressure in the pool?
>
> Thanks,
>
> Jun
>
> On Thu, Nov 17, 2016 at 8:57 AM, radai <ra...@gmail.com> wrote:
>
> > Hi Jun,
> >
> > 1. will do.
> >
> > 2. true. for several reasons:
> >    2.1. which selector? there's a single pool but 16 selectors (linkedin
> > typical, num.network.threads defaults to 3)
> >    2.2. even if i could figure out which selector (all?) the better thing
> > to do would be resume reading not when any memory becomes available
> > (because worst case its not enough for anything) but when some "low
> > watermark" of available memory is hit - so mute when @100% mem, unmute
> when
> > back down to 90%?
> >    2.3. on the broker side (which is the current concern for my patch)
> this
> > max wait time is a hardcoded 300 ms (SocketServer.Processor.poll()),
> which
> > i think is acceptable and definitely not arbitrary or configurable.
> >
> >    if you still think this needs to be addressed (and you are right that
> in
> > the general case the timeout param could be arbitrary) i can implement
> the
> > watermark approach + pool.waitForLowWatermark(timeout) or something, and
> > make Selector.poll() wait for low watermark at the end of poll() if no
> work
> > has been done (so as not to wait on memory needlessly for requests that
> > done require it, as rajini suggested earlier)
> >
> > On Wed, Nov 16, 2016 at 9:04 AM, Jun Rao <ju...@confluent.io> wrote:
> >
> > > Hi, Radai,
> > >
> > > Thanks for the updated proposal. +1 overall. A couple of comments
> below.
> > >
> > > 1. Our current convention is to avoid using getters. Could you change
> > > getSize and getAvailableMemory accordingly? Also, size is bit
> ambiguous,
> > > could we use sth like capacity?
> > >
> > > 2. This is more on the implementation details. I didn't see any code to
> > > wake up the selector when memory is released from the pool. For
> example,
> > > suppose that all socket keys are muted since the pool is full. The
> > > selector.poll() call will wait for the timeout, which could be
> > arbitrarily
> > > long. Now, if some memory is released, it seems that we should wake up
> > the
> > > selector early instead of waiting for the timeout.
> > >
> > > Jun
> > >
> > >
> > > On Mon, Nov 14, 2016 at 11:41 AM, Rajini Sivaram <
> > > rajinisivaram@googlemail.com> wrote:
> > >
> > > > +1
> > > >
> > > > Thank you for the KIP, Radai.
> > > >
> > > > On Mon, Nov 14, 2016 at 6:07 PM, Mickael Maison <
> > > mickael.maison@gmail.com>
> > > > wrote:
> > > >
> > > > > +1. We've also been hit by OOMs on the broker because we were not
> > able
> > > > > to properly bound its memory usage.
> > > > >
> > > > > On Mon, Nov 14, 2016 at 5:56 PM, radai <radai.rosenblatt@gmail.com
> >
> > > > wrote:
> > > > > > @rajini - fixed the hasBytesBuffered() method. also updated
> poll()
> > so
> > > > > that
> > > > > > no latency is added for picking up data stuck in ssl buffers
> > (timeout
> > > > is
> > > > > > set to 0, just like with immediately connected keys and staged
> > > > receives).
> > > > > > thank you for pointing these out.
> > > > > > added ssl (re) testing to the KIP testing plan.
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > On Mon, Nov 14, 2016 at 7:24 AM, Rajini Sivaram <
> > > > > > rajinisivaram@googlemail.com> wrote:
> > > > > >
> > > > > >> Open point 1. I would just retain the current long value that
> > > > specifies
> > > > > >> queued.max.bytes as long and not as %heap since it is simple and
> > > easy
> > > > to
> > > > > >> use. And keeps it consistent with other ".bytes" configs.
> > > > > >>
> > > > > >> Point 3. ssl buffers - I am not quite sure the implementation
> > looks
> > > > > >> correct. hasBytesBuffered() is checking position() of buffers ==
> > 0.
> > > > And
> > > > > the
> > > > > >> code checks this only when poll with a timeout returns (adding a
> > > delay
> > > > > when
> > > > > >> there is nothing else to read).
> > > > > >> But since this and open point 2 (optimization) are
> implementation
> > > > > details,
> > > > > >> they can be looked at during PR review.
> > > > > >>
> > > > > >> It will be good to add SSL testing to the test plan as well,
> since
> > > > > there is
> > > > > >> additional code to test for SSL.
> > > > > >>
> > > > > >>
> > > > > >> On Fri, Nov 11, 2016 at 9:03 PM, radai <
> > radai.rosenblatt@gmail.com>
> > > > > wrote:
> > > > > >>
> > > > > >> > ok, i've made the following changes:
> > > > > >> >
> > > > > >> > 1. memory.pool.class.name has been removed
> > > > > >> > 2. the code now only uses SimpleMemoryPool. the gc variant is
> > left
> > > > > >> (unused)
> > > > > >> > as a developement aid and is unsettable via configuration.
> > > > > >> > 3. I've resolved the issue of stale data getting stuck in
> > > > intermediate
> > > > > >> > (ssl) buffers.
> > > > > >> > 4. default value for queued.max.bytes is -1, so off by
> default.
> > > any
> > > > > <=0
> > > > > >> > value is interpreted as off by the underlying code.
> > > > > >> >
> > > > > >> > open points:
> > > > > >> >
> > > > > >> > 1. the kafka config framework doesnt allow a value to be
> either
> > > long
> > > > > or
> > > > > >> > double, so in order to pull off the queued.max.bytes = 1000000
> > or
> > > > > >> > queued.max.bytes = 0.3 thing i'd need to define the config as
> > type
> > > > > >> string,
> > > > > >> > which is ugly to me. do we want to support setting
> > > queued.max.bytes
> > > > > to %
> > > > > >> of
> > > > > >> > heap ? if so, by way of making queued.max.bytes of type
> string,
> > or
> > > > by
> > > > > way
> > > > > >> > of a 2nd config param (with the resulting
> > either/all/combination?
> > > > > >> > validation). my personal opinion is string because i think a
> > > single
> > > > > >> > queued.max.bytes with overloaded meaning is more
> understandable
> > to
> > > > > users.
> > > > > >> > i'll await other people's opinions before doing anything.
> > > > > >> > 2. i still need to evaluate rajini's optimization. sounds
> > doable.
> > > > > >> >
> > > > > >> > asides:
> > > > > >> >
> > > > > >> > 1. i think you guys misunderstood the intent behind the gc
> pool.
> > > it
> > > > > was
> > > > > >> > never meant to be a magic pool that automatically releases
> > buffers
> > > > > >> (because
> > > > > >> > just as rajini stated the performance implications would be
> > > > > horrible). it
> > > > > >> > was meant to catch leaks early. since that is indeed a
> dev-only
> > > > > concern
> > > > > >> it
> > > > > >> > wont ever get used in production.
> > > > > >> > 2. i said this on some other kip discussion: i think the nice
> > > thing
> > > > > about
> > > > > >> > the pool API is it "scales" from just keeping a memory bound
> to
> > > > > actually
> > > > > >> > re-using buffers without changing the calling code. i think
> > > > > >> actuallypooling
> > > > > >> > large buffers will result in a significant performance impact,
> > but
> > > > > thats
> > > > > >> > outside the scope of this kip. at that point i think more pool
> > > > > >> > implementations (that actually pool) would be written. i agree
> > > with
> > > > > the
> > > > > >> > ideal of exposing as few knobs as possible, but switching
> pools
> > > (or
> > > > > pool
> > > > > >> > params) for tuning may happen at some later point.
> > > > > >> >
> > > > > >> >
> > > > > >> >
> > > > > >> > On Fri, Nov 11, 2016 at 11:44 AM, Rajini Sivaram <
> > > > > >> > rajinisivaram@googlemail.com> wrote:
> > > > > >> >
> > > > > >> > > 13. At the moment, I think channels are not muted if:
> > > > > >> > >     channel.receive != null && channel.receive.buffer !=
> null
> > > > > >> > > This mutes all channels that aren't holding onto a
> incomplete
> > > > > buffer.
> > > > > >> > They
> > > > > >> > > may or may not have read the 4-byte size.
> > > > > >> > >
> > > > > >> > > I was thinking you could avoid muting channels if:
> > > > > >> > >     channel.receive == null || channel.receive.size.
> > remaining()
> > > > > >> > > This will not mute channels that are holding onto a buffer
> (as
> > > > > above).
> > > > > >> In
> > > > > >> > > addition, it will not mute channels that haven't read the
> > 4-byte
> > > > > size.
> > > > > >> A
> > > > > >> > > client that is closed gracefully while the pool is full will
> > not
> > > > be
> > > > > >> muted
> > > > > >> > > in this case and the server can process close without
> waiting
> > > for
> > > > > the
> > > > > >> > pool
> > > > > >> > > to free up. Once the 4-byte size is read, the channel will
> be
> > > > muted
> > > > > if
> > > > > >> > the
> > > > > >> > > pool is still out of memory - for each channel, at most one
> > > failed
> > > > > read
> > > > > >> > > attempt would be made while the pool is out of memory. I
> think
> > > > this
> > > > > >> would
> > > > > >> > > also delay muting of SSL channels since they can continue to
> > > read
> > > > > into
> > > > > >> > > their (already allocated) network buffers and unwrap the
> data
> > > and
> > > > > block
> > > > > >> > > only when they need to allocate a buffer from the pool.
> > > > > >> > >
> > > > > >> > > On Fri, Nov 11, 2016 at 6:00 PM, Jay Kreps <
> jay@confluent.io>
> > > > > wrote:
> > > > > >> > >
> > > > > >> > > > Hey Radai,
> > > > > >> > > >
> > > > > >> > > > +1 on deprecating and eventually removing the old config.
> > The
> > > > > >> intention
> > > > > >> > > was
> > > > > >> > > > absolutely bounding memory usage. I think having two ways
> of
> > > > doing
> > > > > >> > this,
> > > > > >> > > > one that gives a crisp bound on memory and one that is
> hard
> > to
> > > > > reason
> > > > > >> > > about
> > > > > >> > > > is pretty confusing. I think people will really appreciate
> > > > having
> > > > > one
> > > > > >> > > > config which instead lets them directly control the thing
> > they
> > > > > >> actually
> > > > > >> > > > care about (memory).
> > > > > >> > > >
> > > > > >> > > > I also want to second Jun's concern on the complexity of
> the
> > > > > >> self-GCing
> > > > > >> > > > memory pool. I wrote the memory pool for the producer. In
> > that
> > > > > area
> > > > > >> the
> > > > > >> > > > pooling of messages is the single biggest factor in
> > > performance
> > > > of
> > > > > >> the
> > > > > >> > > > client so I believed it was worth some
> > > sophistication/complexity
> > > > > if
> > > > > >> > there
> > > > > >> > > > was performance payoff. All the same, the complexity of
> that
> > > > code
> > > > > has
> > > > > >> > > made
> > > > > >> > > > it VERY hard to keep correct (it gets broken roughly every
> > > other
> > > > > time
> > > > > >> > > > someone makes a change). Over time I came to feel a lot
> less
> > > > > proud of
> > > > > >> > my
> > > > > >> > > > cleverness. I learned something interesting reading your
> > > > > self-GCing
> > > > > >> > > memory
> > > > > >> > > > pool, but I wonder if the complexity is worth the payoff
> in
> > > this
> > > > > >> case?
> > > > > >> > > >
> > > > > >> > > > Philosophically we've tried really hard to avoid
> needlessly
> > > > > >> "pluggable"
> > > > > >> > > > implementations. That is, when there is a temptation to
> > give a
> > > > > config
> > > > > >> > > that
> > > > > >> > > > plugs in different Java classes at run time for
> > implementation
> > > > > >> choices,
> > > > > >> > > we
> > > > > >> > > > should instead think of how to give the user the good
> > behavior
> > > > > >> > > > automatically. I think the use case for configuring a the
> > > GCing
> > > > > pool
> > > > > >> > > would
> > > > > >> > > > be if you discovered a bug in which memory leaked. But
> this
> > > > isn't
> > > > > >> > > something
> > > > > >> > > > the user should have to think about right? If there is a
> bug
> > > we
> > > > > >> should
> > > > > >> > > find
> > > > > >> > > > and fix it.
> > > > > >> > > >
> > > > > >> > > > -Jay
> > > > > >> > > >
> > > > > >> > > > On Fri, Nov 11, 2016 at 9:21 AM, radai <
> > > > > radai.rosenblatt@gmail.com>
> > > > > >> > > wrote:
> > > > > >> > > >
> > > > > >> > > > > jun's #1 + rajini's #11 - the new config param is to
> > enable
> > > > > >> changing
> > > > > >> > > the
> > > > > >> > > > > pool implentation class. as i said in my response to
> jun i
> > > > will
> > > > > >> make
> > > > > >> > > the
> > > > > >> > > > > default pool impl be the simple one, and this param is
> to
> > > > allow
> > > > > a
> > > > > >> > user
> > > > > >> > > > > (more likely a dev) to change it.
> > > > > >> > > > > both the simple pool and the "gc pool" make basically
> just
> > > an
> > > > > >> > > > > AtomicLong.get() + (hashmap.put for gc) calls before
> > > > returning a
> > > > > >> > > buffer.
> > > > > >> > > > > there is absolutely no dependency on GC times in
> > allocating
> > > > (or
> > > > > >> not).
> > > > > >> > > the
> > > > > >> > > > > extra background thread in the gc pool is forever asleep
> > > > unless
> > > > > >> there
> > > > > >> > > are
> > > > > >> > > > > bugs (==leaks) so the extra cost is basically nothing
> > > (backed
> > > > by
> > > > > >> > > > > benchmarks). let me re-itarate again - ANY BUFFER
> > ALLOCATED
> > > > MUST
> > > > > >> > ALWAYS
> > > > > >> > > > BE
> > > > > >> > > > > RELEASED - so the gc pool should not rely on gc for
> > > reclaiming
> > > > > >> > buffers.
> > > > > >> > > > its
> > > > > >> > > > > a bug detector, not a feature and is definitely not
> > intended
> > > > to
> > > > > >> hide
> > > > > >> > > > bugs -
> > > > > >> > > > > the exact opposite - its meant to expose them sooner.
> i've
> > > > > cleaned
> > > > > >> up
> > > > > >> > > the
> > > > > >> > > > > docs to avoid this confusion. i also like the fail on
> > leak.
> > > > will
> > > > > >> do.
> > > > > >> > > > > as for the gap between pool size and heap size - thats a
> > > valid
> > > > > >> > > argument.
> > > > > >> > > > > may allow also sizing the pool as % of heap size? so
> > > > > >> > queued.max.bytes =
> > > > > >> > > > > 1000000 for 1MB and queued.max.bytes = 0.25 for 25% of
> > > > available
> > > > > >> > heap?
> > > > > >> > > > >
> > > > > >> > > > > jun's 2.2 - queued.max.bytes + socket.request.max.bytes
> > > still
> > > > > >> holds,
> > > > > >> > > > > assuming the ssl-related buffers are small. the largest
> > > > > weakness in
> > > > > >> > > this
> > > > > >> > > > > claim has to do with decompression rather than anything
> > > > > >> ssl-related.
> > > > > >> > so
> > > > > >> > > > yes
> > > > > >> > > > > there is an O(#ssl connections * sslEngine packet size)
> > > > > component,
> > > > > >> > but
> > > > > >> > > i
> > > > > >> > > > > think its small. again - decompression should be the
> > > concern.
> > > > > >> > > > >
> > > > > >> > > > > rajini's #13 - interesting optimization. the problem is
> > > > there's
> > > > > no
> > > > > >> > > > knowing
> > > > > >> > > > > in advance what the _next_ request to come out of a
> socket
> > > is,
> > > > > so
> > > > > >> > this
> > > > > >> > > > > would mute just those sockets that are 1. mutable and 2.
> > > have
> > > > a
> > > > > >> > > > > buffer-demanding request for which we could not
> allocate a
> > > > > buffer.
> > > > > >> > > > downside
> > > > > >> > > > > is that as-is this would cause the busy-loop on poll()
> > that
> > > > the
> > > > > >> mutes
> > > > > >> > > > were
> > > > > >> > > > > supposed to prevent - or code would need to be added to
> > > > > ad-hocmute
> > > > > >> a
> > > > > >> > > > > connection that was so-far unmuted but has now
> generated a
> > > > > >> > > > memory-demanding
> > > > > >> > > > > request?
> > > > > >> > > > >
> > > > > >> > > > >
> > > > > >> > > > >
> > > > > >> > > > > On Fri, Nov 11, 2016 at 5:02 AM, Rajini Sivaram <
> > > > > >> > > > > rajinisivaram@googlemail.com> wrote:
> > > > > >> > > > >
> > > > > >> > > > > > Radai,
> > > > > >> > > > > >
> > > > > >> > > > > > 11. The KIP talks about a new server configuration
> > > parameter
> > > > > >> > > > > > *memory.pool.class.name
> > > > > >> > > > > > <http://memory.pool.class.name> *which is not in the
> > > > > >> > implementation.
> > > > > >> > > > Is
> > > > > >> > > > > it
> > > > > >> > > > > > still the case that the pool will be configurable?
> > > > > >> > > > > >
> > > > > >> > > > > > 12. Personally I would prefer not to have a garbage
> > > > collected
> > > > > >> pool
> > > > > >> > > that
> > > > > >> > > > > > hides bugs as well. Apart from the added code
> complexity
> > > and
> > > > > >> extra
> > > > > >> > > > thread
> > > > > >> > > > > > to handle collections, I am also concerned about the
> > > > > >> > > non-deterministic
> > > > > >> > > > > > nature of GC timings. The KIP introduces delays in
> > > > processing
> > > > > >> > > requests
> > > > > >> > > > > > based on the configuration parameter
> *queued.max.bytes.
> > > > *This
> > > > > in
> > > > > >> > > > > unrelated
> > > > > >> > > > > > to the JVM heap size and hence pool can be full when
> > there
> > > > is
> > > > > no
> > > > > >> > > > pressure
> > > > > >> > > > > > on the JVM to garbage collect. The KIP does not
> prevent
> > > > other
> > > > > >> > > timeouts
> > > > > >> > > > in
> > > > > >> > > > > > the broker (eg. consumer session timeout) because it
> is
> > > > > relying
> > > > > >> on
> > > > > >> > > the
> > > > > >> > > > > pool
> > > > > >> > > > > > to be managed in a deterministic, timely manner.
> Since a
> > > > > garbage
> > > > > >> > > > > collected
> > > > > >> > > > > > pool cannot provide that guarantee, wouldn't it be
> > better
> > > to
> > > > > run
> > > > > >> > > tests
> > > > > >> > > > > with
> > > > > >> > > > > > a GC-pool that perhaps fails with a fatal error if it
> > > > > encounters
> > > > > >> a
> > > > > >> > > > buffer
> > > > > >> > > > > > that was not released?
> > > > > >> > > > > >
> > > > > >> > > > > > 13. The implementation currently mutes all channels
> that
> > > > don't
> > > > > >> > have a
> > > > > >> > > > > > receive buffer allocated. Would it make sense to mute
> > only
> > > > the
> > > > > >> > > channels
> > > > > >> > > > > > that need a buffer (i.e. allow channels to read the
> > 4-byte
> > > > > size
> > > > > >> > that
> > > > > >> > > is
> > > > > >> > > > > not
> > > > > >> > > > > > read using the pool) so that normal client connection
> > > > close()
> > > > > is
> > > > > >> > > > handled
> > > > > >> > > > > > even when the pool is full? Since the extra 4-bytes
> may
> > > > > already
> > > > > >> be
> > > > > >> > > > > > allocated for some connections, the total request
> memory
> > > has
> > > > > to
> > > > > >> > take
> > > > > >> > > > into
> > > > > >> > > > > > account *4*numConnections* bytes anyway.
> > > > > >> > > > > >
> > > > > >> > > > > >
> > > > > >> > > > > > On Thu, Nov 10, 2016 at 11:51 PM, Jun Rao <
> > > jun@confluent.io
> > > > >
> > > > > >> > wrote:
> > > > > >> > > > > >
> > > > > >> > > > > > > Hi, Radai,
> > > > > >> > > > > > >
> > > > > >> > > > > > > 1. Yes, I am concerned about the trickiness of
> having
> > to
> > > > > deal
> > > > > >> > with
> > > > > >> > > > > wreak
> > > > > >> > > > > > > refs. I think it's simpler to just have the simple
> > > version
> > > > > >> > > > instrumented
> > > > > >> > > > > > > with enough debug/trace logging and do enough stress
> > > > > testing.
> > > > > >> > Since
> > > > > >> > > > we
> > > > > >> > > > > > > still have queued.max.requests, one can always fall
> > back
> > > > to
> > > > > >> that
> > > > > >> > > if a
> > > > > >> > > > > > > memory leak issue is identified. We could also label
> > the
> > > > > >> feature
> > > > > >> > as
> > > > > >> > > > > beta
> > > > > >> > > > > > if
> > > > > >> > > > > > > we don't think this is production ready.
> > > > > >> > > > > > >
> > > > > >> > > > > > > 2.2 I am just wondering after we fix that issue
> > whether
> > > > the
> > > > > >> claim
> > > > > >> > > > that
> > > > > >> > > > > > the
> > > > > >> > > > > > > request memory is bounded by  queued.max.bytes +
> > > > > >> > > > > socket.request.max.bytes
> > > > > >> > > > > > > is still true.
> > > > > >> > > > > > >
> > > > > >> > > > > > > 5. Ok, leaving the default as -1 is fine then.
> > > > > >> > > > > > >
> > > > > >> > > > > > > Thanks,
> > > > > >> > > > > > >
> > > > > >> > > > > > > Jun
> > > > > >> > > > > > >
> > > > > >> > > > > > > On Wed, Nov 9, 2016 at 6:01 PM, radai <
> > > > > >> > radai.rosenblatt@gmail.com>
> > > > > >> > > > > > wrote:
> > > > > >> > > > > > >
> > > > > >> > > > > > > > Hi Jun,
> > > > > >> > > > > > > >
> > > > > >> > > > > > > > Thank you for taking the time to review this.
> > > > > >> > > > > > > >
> > > > > >> > > > > > > > 1. short version - yes, the concern is bugs, but
> the
> > > > cost
> > > > > is
> > > > > >> > tiny
> > > > > >> > > > and
> > > > > >> > > > > > > worth
> > > > > >> > > > > > > > it, and its a common pattern. long version:
> > > > > >> > > > > > > >    1.1 detecting these types of bugs (leaks)
> cannot
> > be
> > > > > easily
> > > > > >> > > done
> > > > > >> > > > > with
> > > > > >> > > > > > > > simple testing, but requires stress/stability
> tests
> > > that
> > > > > run
> > > > > >> > for
> > > > > >> > > a
> > > > > >> > > > > long
> > > > > >> > > > > > > > time (long enough to hit OOM, depending on leak
> size
> > > and
> > > > > >> > > available
> > > > > >> > > > > > > memory).
> > > > > >> > > > > > > > this is why some sort of leak detector is
> "standard
> > > > > practice"
> > > > > >> > > .for
> > > > > >> > > > > > > example
> > > > > >> > > > > > > > look at netty (http://netty.io/wiki/
> > > > > >> reference-counted-objects.
> > > > > >> > > > > > > > html#leak-detection-levels)
> > > > > >> > > > > > > > <http://netty.io/wiki/reference-counted-objects.
> > > > > >> > > > > > > html#leak-detection-levels
> > > > > >> > > > > > > > >-
> > > > > >> > > > > > > > they have way more complicated built-in leak
> > detection
> > > > > >> enabled
> > > > > >> > by
> > > > > >> > > > > > > default.
> > > > > >> > > > > > > > as a concrete example - during development i did
> not
> > > > > properly
> > > > > >> > > > dispose
> > > > > >> > > > > > of
> > > > > >> > > > > > > > in-progress KafkaChannel.receive when a connection
> > was
> > > > > >> abruptly
> > > > > >> > > > > closed
> > > > > >> > > > > > > and
> > > > > >> > > > > > > > I only found it because of the log msg printed by
> > the
> > > > > pool.
> > > > > >> > > > > > > >    1.2 I have a benchmark suite showing the
> > > performance
> > > > > cost
> > > > > >> of
> > > > > >> > > the
> > > > > >> > > > > gc
> > > > > >> > > > > > > pool
> > > > > >> > > > > > > > is absolutely negligible -
> > > > > >> > > > > > > > https://github.com/radai-
> > rosenblatt/kafka-benchmarks/
> > > > > >> > > > > > > > tree/master/memorypool-benchmarks
> > > > > >> > > > > > > >    1.3 as for the complexity of the impl - its
> just
> > > ~150
> > > > > >> lines
> > > > > >> > > and
> > > > > >> > > > > > pretty
> > > > > >> > > > > > > > straight forward. i think the main issue is that
> not
> > > > many
> > > > > >> > people
> > > > > >> > > > are
> > > > > >> > > > > > > > familiar with weak refs and ref queues.
> > > > > >> > > > > > > >
> > > > > >> > > > > > > >    how about making the pool impl class a config
> > param
> > > > > >> > (generally
> > > > > >> > > > > good
> > > > > >> > > > > > > > going forward), make the default be the simple
> pool,
> > > and
> > > > > keep
> > > > > >> > the
> > > > > >> > > > GC
> > > > > >> > > > > > one
> > > > > >> > > > > > > as
> > > > > >> > > > > > > > a dev/debug/triage aid?
> > > > > >> > > > > > > >
> > > > > >> > > > > > > > 2. the KIP itself doesnt specifically treat SSL at
> > > all -
> > > > > its
> > > > > >> an
> > > > > >> > > > > > > > implementation detail. as for my current patch, it
> > has
> > > > > some
> > > > > >> > > minimal
> > > > > >> > > > > > > > treatment of SSL - just enough to not mute SSL
> > sockets
> > > > > >> > > > mid-handshake
> > > > > >> > > > > -
> > > > > >> > > > > > > but
> > > > > >> > > > > > > > the code in SslTransportLayer still allocates
> > buffers
> > > > > itself.
> > > > > >> > it
> > > > > >> > > is
> > > > > >> > > > > my
> > > > > >> > > > > > > > understanding that netReadBuffer/appReadBuffer
> > > shouldn't
> > > > > grow
> > > > > >> > > > beyond
> > > > > >> > > > > 2
> > > > > >> > > > > > x
> > > > > >> > > > > > > > sslEngine.getSession().getPacketBufferSize(),
> > which i
> > > > > assume
> > > > > >> > to
> > > > > >> > > be
> > > > > >> > > > > > > small.
> > > > > >> > > > > > > > they are also long lived (they live for the
> duration
> > > of
> > > > > the
> > > > > >> > > > > connection)
> > > > > >> > > > > > > > which makes a poor fit for pooling. the bigger
> fish
> > to
> > > > > fry i
> > > > > >> > > think
> > > > > >> > > > is
> > > > > >> > > > > > > > decompression - you could read a 1MB blob into a
> > > > > >> pool-provided
> > > > > >> > > > buffer
> > > > > >> > > > > > and
> > > > > >> > > > > > > > then decompress it into 10MB of heap allocated on
> > the
> > > > spot
> > > > > >> :-)
> > > > > >> > > > also,
> > > > > >> > > > > > the
> > > > > >> > > > > > > > ssl code is extremely tricky.
> > > > > >> > > > > > > >    2.2 just to make sure, youre talking about
> > > > > Selector.java:
> > > > > >> > > while
> > > > > >> > > > > > > > ((networkReceive = channel.read()) != null)
> > > > > >> > > > > > addToStagedReceives(channel,
> > > > > >> > > > > > > > networkReceive); ? if so youre right, and i'll fix
> > > that
> > > > > >> > (probably
> > > > > >> > > > by
> > > > > >> > > > > > > > something similar to immediatelyConnectedKeys, not
> > > sure
> > > > > yet)
> > > > > >> > > > > > > >
> > > > > >> > > > > > > > 3. isOutOfMemory is self explanatory (and i'll add
> > > > > javadocs
> > > > > >> and
> > > > > >> > > > > update
> > > > > >> > > > > > > the
> > > > > >> > > > > > > > wiki). isLowOnMem is basically the point where I
> > start
> > > > > >> > > randomizing
> > > > > >> > > > > the
> > > > > >> > > > > > > > selection key handling order to avoid potential
> > > > > starvation.
> > > > > >> its
> > > > > >> > > > > rather
> > > > > >> > > > > > > > arbitrary and now that i think of it should
> probably
> > > not
> > > > > >> exist
> > > > > >> > > and
> > > > > >> > > > be
> > > > > >> > > > > > > > entirely contained in Selector (where the
> shuffling
> > > > takes
> > > > > >> > place).
> > > > > >> > > > > will
> > > > > >> > > > > > > fix.
> > > > > >> > > > > > > >
> > > > > >> > > > > > > > 4. will do.
> > > > > >> > > > > > > >
> > > > > >> > > > > > > > 5. I prefer -1 or 0 as an explicit "OFF" (or
> > basically
> > > > > >> anything
> > > > > >> > > > <=0).
> > > > > >> > > > > > > > Long.MAX_VALUE would still create a pool, that
> would
> > > > still
> > > > > >> > waste
> > > > > >> > > > time
> > > > > >> > > > > > > > tracking resources. I dont really mind though if
> you
> > > > have
> > > > > a
> > > > > >> > > > preferred
> > > > > >> > > > > > > magic
> > > > > >> > > > > > > > value for off.
> > > > > >> > > > > > > >
> > > > > >> > > > > > > >
> > > > > >> > > > > > > >
> > > > > >> > > > > > > >
> > > > > >> > > > > > > >
> > > > > >> > > > > > > > On Wed, Nov 9, 2016 at 9:28 AM, Jun Rao <
> > > > jun@confluent.io
> > > > > >
> > > > > >> > > wrote:
> > > > > >> > > > > > > >
> > > > > >> > > > > > > > > Hi, Radai,
> > > > > >> > > > > > > > >
> > > > > >> > > > > > > > > Thanks for the KIP. Some comments below.
> > > > > >> > > > > > > > >
> > > > > >> > > > > > > > > 1. The KIP says "to facilitate faster
> > implementation
> > > > > (as a
> > > > > >> > > safety
> > > > > >> > > > > > net)
> > > > > >> > > > > > > > the
> > > > > >> > > > > > > > > pool will be implemented in such a way that
> memory
> > > > that
> > > > > was
> > > > > >> > not
> > > > > >> > > > > > > > release()ed
> > > > > >> > > > > > > > > (but still garbage collected) would be detected
> > and
> > > > > >> > > "reclaimed".
> > > > > >> > > > > this
> > > > > >> > > > > > > is
> > > > > >> > > > > > > > to
> > > > > >> > > > > > > > > prevent "leaks" in case of code paths that fail
> to
> > > > > >> release()
> > > > > >> > > > > > > properly.".
> > > > > >> > > > > > > > > What are the cases that could cause memory
> leaks?
> > If
> > > > we
> > > > > are
> > > > > >> > > > > concerned
> > > > > >> > > > > > > > about
> > > > > >> > > > > > > > > bugs, it seems that it's better to just do more
> > > > testing
> > > > > to
> > > > > >> > make
> > > > > >> > > > > sure
> > > > > >> > > > > > > the
> > > > > >> > > > > > > > > usage of the simple implementation
> > > (SimpleMemoryPool)
> > > > is
> > > > > >> > solid
> > > > > >> > > > > > instead
> > > > > >> > > > > > > of
> > > > > >> > > > > > > > > adding more complicated logic
> > > > > (GarbageCollectedMemoryPool)
> > > > > >> to
> > > > > >> > > > hide
> > > > > >> > > > > > the
> > > > > >> > > > > > > > > potential bugs.
> > > > > >> > > > > > > > >
> > > > > >> > > > > > > > > 2. I am wondering how much this KIP covers the
> SSL
> > > > > channel
> > > > > >> > > > > > > > implementation.
> > > > > >> > > > > > > > > 2.1 SslTransportLayer maintains netReadBuffer,
> > > > > >> > netWriteBuffer,
> > > > > >> > > > > > > > > appReadBuffer per socket. Should those memory be
> > > > > accounted
> > > > > >> > for
> > > > > >> > > in
> > > > > >> > > > > > > memory
> > > > > >> > > > > > > > > pool?
> > > > > >> > > > > > > > > 2.2 One tricky thing with SSL is that during a
> > > > > >> > > > KafkaChannel.read(),
> > > > > >> > > > > > > it's
> > > > > >> > > > > > > > > possible for multiple NetworkReceives to be
> > returned
> > > > > since
> > > > > >> > > > multiple
> > > > > >> > > > > > > > > requests' data could be encrypted together by
> SSL.
> > > To
> > > > > deal
> > > > > >> > with
> > > > > >> > > > > this,
> > > > > >> > > > > > > we
> > > > > >> > > > > > > > > stash those NetworkReceives in
> > > Selector.stagedReceives
> > > > > and
> > > > > >> > give
> > > > > >> > > > it
> > > > > >> > > > > > back
> > > > > >> > > > > > > > to
> > > > > >> > > > > > > > > the poll() call one NetworkReceive at a time.
> What
> > > > this
> > > > > >> means
> > > > > >> > > is
> > > > > >> > > > > > that,
> > > > > >> > > > > > > if
> > > > > >> > > > > > > > > we stop reading from KafkaChannel in the middle
> > > > because
> > > > > >> > memory
> > > > > >> > > > pool
> > > > > >> > > > > > is
> > > > > >> > > > > > > > > full, this channel's key may never get selected
> > for
> > > > > reads
> > > > > >> > (even
> > > > > >> > > > > after
> > > > > >> > > > > > > the
> > > > > >> > > > > > > > > read interest is turned on), but there are still
> > > > pending
> > > > > >> data
> > > > > >> > > for
> > > > > >> > > > > the
> > > > > >> > > > > > > > > channel, which will never get processed.
> > > > > >> > > > > > > > >
> > > > > >> > > > > > > > > 3. The code has the following two methods in
> > > > MemoryPool,
> > > > > >> > which
> > > > > >> > > > are
> > > > > >> > > > > > not
> > > > > >> > > > > > > > > described in the KIP. Could you explain how they
> > are
> > > > > used
> > > > > >> in
> > > > > >> > > the
> > > > > >> > > > > > wiki?
> > > > > >> > > > > > > > > isLowOnMemory()
> > > > > >> > > > > > > > > isOutOfMemory()
> > > > > >> > > > > > > > >
> > > > > >> > > > > > > > > 4. Could you also describe in the KIP at the
> high
> > > > level,
> > > > > >> how
> > > > > >> > > the
> > > > > >> > > > > read
> > > > > >> > > > > > > > > interest bit for the socket is turned on/off
> with
> > > > > respect
> > > > > >> to
> > > > > >> > > > > > > MemoryPool?
> > > > > >> > > > > > > > >
> > > > > >> > > > > > > > > 5. Should queued.max.bytes defaults to -1 or
> > > > > >> Long.MAX_VALUE?
> > > > > >> > > > > > > > >
> > > > > >> > > > > > > > > Thanks,
> > > > > >> > > > > > > > >
> > > > > >> > > > > > > > > Jun
> > > > > >> > > > > > > > >
> > > > > >> > > > > > > > > On Mon, Nov 7, 2016 at 1:08 PM, radai <
> > > > > >> > > > radai.rosenblatt@gmail.com>
> > > > > >> > > > > > > > wrote:
> > > > > >> > > > > > > > >
> > > > > >> > > > > > > > > > Hi,
> > > > > >> > > > > > > > > >
> > > > > >> > > > > > > > > > I would like to initiate a vote on KIP-72:
> > > > > >> > > > > > > > > >
> > > > > >> > > > > > > > > > https://cwiki.apache.org/
> > > > > confluence/display/KAFKA/KIP-
> > > > > >> > 72%3A+
> > > > > >> > > > > > > > > > Allow+putting+a+bound+on+memor
> > > > y+consumed+by+Incoming+
> > > > > >> > > requests
> > > > > >> > > > > > > > > >
> > > > > >> > > > > > > > > > The kip allows specifying a limit on the
> amount
> > of
> > > > > memory
> > > > > >> > > > > allocated
> > > > > >> > > > > > > for
> > > > > >> > > > > > > > > > reading incoming requests into. This is useful
> > for
> > > > > >> > "sizing" a
> > > > > >> > > > > > broker
> > > > > >> > > > > > > > and
> > > > > >> > > > > > > > > > avoiding OOMEs under heavy load (as actually
> > > happens
> > > > > >> > > > occasionally
> > > > > >> > > > > > at
> > > > > >> > > > > > > > > > linkedin).
> > > > > >> > > > > > > > > >
> > > > > >> > > > > > > > > > I believe I've addressed most (all?) concerns
> > > > brought
> > > > > up
> > > > > >> > > during
> > > > > >> > > > > the
> > > > > >> > > > > > > > > > discussion.
> > > > > >> > > > > > > > > >
> > > > > >> > > > > > > > > > To the best of my understanding this vote is
> > about
> > > > the
> > > > > >> goal
> > > > > >> > > and
> > > > > >> > > > > > > > > > public-facing changes related to the new
> > proposed
> > > > > >> behavior,
> > > > > >> > > but
> > > > > >> > > > > as
> > > > > >> > > > > > > for
> > > > > >> > > > > > > > > > implementation, i have the code up here:
> > > > > >> > > > > > > > > >
> > > > > >> > > > > > > > > > https://github.com/radai-
> > > > > rosenblatt/kafka/tree/broker-
> > > > > >> > memory
> > > > > >> > > > > > > > > > -pool-with-muting
> > > > > >> > > > > > > > > >
> > > > > >> > > > > > > > > > and I've stress-tested it to work properly
> > > (meaning
> > > > it
> > > > > >> > chugs
> > > > > >> > > > > along
> > > > > >> > > > > > > and
> > > > > >> > > > > > > > > > throttles under loads that would DOS 10.0.1.0
> > > code).
> > > > > >> > > > > > > > > >
> > > > > >> > > > > > > > > > I also believe that the primitives and
> > "pattern"s
> > > > > >> > introduced
> > > > > >> > > in
> > > > > >> > > > > > this
> > > > > >> > > > > > > > KIP
> > > > > >> > > > > > > > > > (namely the notion of a buffer pool and
> > retrieving
> > > > > from /
> > > > > >> > > > > releasing
> > > > > >> > > > > > > to
> > > > > >> > > > > > > > > said
> > > > > >> > > > > > > > > > pool instead of allocating memory) are
> generally
> > > > > useful
> > > > > >> > > beyond
> > > > > >> > > > > the
> > > > > >> > > > > > > > scope
> > > > > >> > > > > > > > > of
> > > > > >> > > > > > > > > > this KIP for both performance issues
> (allocating
> > > > lots
> > > > > of
> > > > > >> > > > > > short-lived
> > > > > >> > > > > > > > > large
> > > > > >> > > > > > > > > > buffers is a performance bottleneck) and other
> > > areas
> > > > > >> where
> > > > > >> > > > memory
> > > > > >> > > > > > > > limits
> > > > > >> > > > > > > > > > are a problem (KIP-81)
> > > > > >> > > > > > > > > >
> > > > > >> > > > > > > > > > Thank you,
> > > > > >> > > > > > > > > >
> > > > > >> > > > > > > > > > Radai.
> > > > > >> > > > > > > > > >
> > > > > >> > > > > > > > >
> > > > > >> > > > > > > >
> > > > > >> > > > > > >
> > > > > >> > > > > >
> > > > > >> > > > > >
> > > > > >> > > > > >
> > > > > >> > > > > > --
> > > > > >> > > > > > Regards,
> > > > > >> > > > > >
> > > > > >> > > > > > Rajini
> > > > > >> > > > > >
> > > > > >> > > > >
> > > > > >> > > >
> > > > > >> > >
> > > > > >> > >
> > > > > >> > >
> > > > > >> > > --
> > > > > >> > > Regards,
> > > > > >> > >
> > > > > >> > > Rajini
> > > > > >> > >
> > > > > >> >
> > > > > >>
> > > > > >>
> > > > > >>
> > > > > >> --
> > > > > >> Regards,
> > > > > >>
> > > > > >> Rajini
> > > > > >>
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > Regards,
> > > >
> > > > Rajini
> > > >
> > >
> >
>

Re: [VOTE] KIP-72 - Allow putting a bound on memory consumed by Incoming requests

Posted by Jun Rao <ju...@confluent.io>.

Hi, Radai,

2. Yes, on the server side, the timeout is hardcoded at 300ms. That's not
too bad. We can just leave it as it is.

3. Another thing. Do we plan to expose some JMX metrics so that we can
monitor if there is any memory pressure in the pool?

Thanks,

Jun

On Thu, Nov 17, 2016 at 8:57 AM, radai <ra...@gmail.com> wrote:

> Hi Jun,
>
> 1. will do.
>
> 2. true. for several reasons:
>    2.1. which selector? there's a single pool but 16 selectors (linkedin
> typical, num.network.threads defaults to 3)
>    2.2. even if i could figure out which selector (all?) the better thing
> to do would be resume reading not when any memory becomes available
> (because worst case its not enough for anything) but when some "low
> watermark" of available memory is hit - so mute when @100% mem, unmute when
> back down to 90%?
>    2.3. on the broker side (which is the current concern for my patch) this
> max wait time is a hardcoded 300 ms (SocketServer.Processor.poll()), which
> i think is acceptable and definitely not arbitrary or configurable.
>
>    if you still think this needs to be addressed (and you are right that in
> the general case the timeout param could be arbitrary) i can implement the
> watermark approach + pool.waitForLowWatermark(timeout) or something, and
> make Selector.poll() wait for low watermark at the end of poll() if no work
> has been done (so as not to wait on memory needlessly for requests that
> done require it, as rajini suggested earlier)
>
> On Wed, Nov 16, 2016 at 9:04 AM, Jun Rao <ju...@confluent.io> wrote:
>
> > Hi, Radai,
> >
> > Thanks for the updated proposal. +1 overall. A couple of comments below.
> >
> > 1. Our current convention is to avoid using getters. Could you change
> > getSize and getAvailableMemory accordingly? Also, size is bit ambiguous,
> > could we use sth like capacity?
> >
> > 2. This is more on the implementation details. I didn't see any code to
> > wake up the selector when memory is released from the pool. For example,
> > suppose that all socket keys are muted since the pool is full. The
> > selector.poll() call will wait for the timeout, which could be
> arbitrarily
> > long. Now, if some memory is released, it seems that we should wake up
> the
> > selector early instead of waiting for the timeout.
> >
> > Jun
> >
> >
> > On Mon, Nov 14, 2016 at 11:41 AM, Rajini Sivaram <
> > rajinisivaram@googlemail.com> wrote:
> >
> > > +1
> > >
> > > Thank you for the KIP, Radai.
> > >
> > > On Mon, Nov 14, 2016 at 6:07 PM, Mickael Maison <
> > mickael.maison@gmail.com>
> > > wrote:
> > >
> > > > +1. We've also been hit by OOMs on the broker because we were not
> able
> > > > to properly bound its memory usage.
> > > >
> > > > On Mon, Nov 14, 2016 at 5:56 PM, radai <ra...@gmail.com>
> > > wrote:
> > > > > @rajini - fixed the hasBytesBuffered() method. also updated poll()
> so
> > > > that
> > > > > no latency is added for picking up data stuck in ssl buffers
> (timeout
> > > is
> > > > > set to 0, just like with immediately connected keys and staged
> > > receives).
> > > > > thank you for pointing these out.
> > > > > added ssl (re) testing to the KIP testing plan.
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > On Mon, Nov 14, 2016 at 7:24 AM, Rajini Sivaram <
> > > > > rajinisivaram@googlemail.com> wrote:
> > > > >
> > > > >> Open point 1. I would just retain the current long value that
> > > specifies
> > > > >> queued.max.bytes as long and not as %heap since it is simple and
> > easy
> > > to
> > > > >> use. And keeps it consistent with other ".bytes" configs.
> > > > >>
> > > > >> Point 3. ssl buffers - I am not quite sure the implementation
> looks
> > > > >> correct. hasBytesBuffered() is checking position() of buffers ==
> 0.
> > > And
> > > > the
> > > > >> code checks this only when poll with a timeout returns (adding a
> > delay
> > > > when
> > > > >> there is nothing else to read).
> > > > >> But since this and open point 2 (optimization) are implementation
> > > > details,
> > > > >> they can be looked at during PR review.
> > > > >>
> > > > >> It will be good to add SSL testing to the test plan as well, since
> > > > there is
> > > > >> additional code to test for SSL.
> > > > >>
> > > > >>
> > > > >> On Fri, Nov 11, 2016 at 9:03 PM, radai <
> radai.rosenblatt@gmail.com>
> > > > wrote:
> > > > >>
> > > > >> > ok, i've made the following changes:
> > > > >> >
> > > > >> > 1. memory.pool.class.name has been removed
> > > > >> > 2. the code now only uses SimpleMemoryPool. the gc variant is
> left
> > > > >> (unused)
> > > > >> > as a developement aid and is unsettable via configuration.
> > > > >> > 3. I've resolved the issue of stale data getting stuck in
> > > intermediate
> > > > >> > (ssl) buffers.
> > > > >> > 4. default value for queued.max.bytes is -1, so off by default.
> > any
> > > > <=0
> > > > >> > value is interpreted as off by the underlying code.
> > > > >> >
> > > > >> > open points:
> > > > >> >
> > > > >> > 1. the kafka config framework doesnt allow a value to be either
> > long
> > > > or
> > > > >> > double, so in order to pull off the queued.max.bytes = 1000000
> or
> > > > >> > queued.max.bytes = 0.3 thing i'd need to define the config as
> type
> > > > >> string,
> > > > >> > which is ugly to me. do we want to support setting
> > queued.max.bytes
> > > > to %
> > > > >> of
> > > > >> > heap ? if so, by way of making queued.max.bytes of type string,
> or
> > > by
> > > > way
> > > > >> > of a 2nd config param (with the resulting
> either/all/combination?
> > > > >> > validation). my personal opinion is string because i think a
> > single
> > > > >> > queued.max.bytes with overloaded meaning is more understandable
> to
> > > > users.
> > > > >> > i'll await other people's opinions before doing anything.
> > > > >> > 2. i still need to evaluate rajini's optimization. sounds
> doable.
> > > > >> >
> > > > >> > asides:
> > > > >> >
> > > > >> > 1. i think you guys misunderstood the intent behind the gc pool.
> > it
> > > > was
> > > > >> > never meant to be a magic pool that automatically releases
> buffers
> > > > >> (because
> > > > >> > just as rajini stated the performance implications would be
> > > > horrible). it
> > > > >> > was meant to catch leaks early. since that is indeed a dev-only
> > > > concern
> > > > >> it
> > > > >> > wont ever get used in production.
> > > > >> > 2. i said this on some other kip discussion: i think the nice
> > thing
> > > > about
> > > > >> > the pool API is it "scales" from just keeping a memory bound to
> > > > actually
> > > > >> > re-using buffers without changing the calling code. i think
> > > > >> actuallypooling
> > > > >> > large buffers will result in a significant performance impact,
> but
> > > > thats
> > > > >> > outside the scope of this kip. at that point i think more pool
> > > > >> > implementations (that actually pool) would be written. i agree
> > with
> > > > the
> > > > >> > ideal of exposing as few knobs as possible, but switching pools
> > (or
> > > > pool
> > > > >> > params) for tuning may happen at some later point.
> > > > >> >
> > > > >> >
> > > > >> >
> > > > >> > On Fri, Nov 11, 2016 at 11:44 AM, Rajini Sivaram <
> > > > >> > rajinisivaram@googlemail.com> wrote:
> > > > >> >
> > > > >> > > 13. At the moment, I think channels are not muted if:
> > > > >> > >     channel.receive != null && channel.receive.buffer != null
> > > > >> > > This mutes all channels that aren't holding onto a incomplete
> > > > buffer.
> > > > >> > They
> > > > >> > > may or may not have read the 4-byte size.
> > > > >> > >
> > > > >> > > I was thinking you could avoid muting channels if:
> > > > >> > >     channel.receive == null || channel.receive.size.
> remaining()
> > > > >> > > This will not mute channels that are holding onto a buffer (as
> > > > above).
> > > > >> In
> > > > >> > > addition, it will not mute channels that haven't read the
> 4-byte
> > > > size.
> > > > >> A
> > > > >> > > client that is closed gracefully while the pool is full will
> not
> > > be
> > > > >> muted
> > > > >> > > in this case and the server can process close without waiting
> > for
> > > > the
> > > > >> > pool
> > > > >> > > to free up. Once the 4-byte size is read, the channel will be
> > > muted
> > > > if
> > > > >> > the
> > > > >> > > pool is still out of memory - for each channel, at most one
> > failed
> > > > read
> > > > >> > > attempt would be made while the pool is out of memory. I think
> > > this
> > > > >> would
> > > > >> > > also delay muting of SSL channels since they can continue to
> > read
> > > > into
> > > > >> > > their (already allocated) network buffers and unwrap the data
> > and
> > > > block
> > > > >> > > only when they need to allocate a buffer from the pool.
> > > > >> > >
> > > > >> > > On Fri, Nov 11, 2016 at 6:00 PM, Jay Kreps <ja...@confluent.io>
> > > > wrote:
> > > > >> > >
> > > > >> > > > Hey Radai,
> > > > >> > > >
> > > > >> > > > +1 on deprecating and eventually removing the old config.
> The
> > > > >> intention
> > > > >> > > was
> > > > >> > > > absolutely bounding memory usage. I think having two ways of
> > > doing
> > > > >> > this,
> > > > >> > > > one that gives a crisp bound on memory and one that is hard
> to
> > > > reason
> > > > >> > > about
> > > > >> > > > is pretty confusing. I think people will really appreciate
> > > having
> > > > one
> > > > >> > > > config which instead lets them directly control the thing
> they
> > > > >> actually
> > > > >> > > > care about (memory).
> > > > >> > > >
> > > > >> > > > I also want to second Jun's concern on the complexity of the
> > > > >> self-GCing
> > > > >> > > > memory pool. I wrote the memory pool for the producer. In
> that
> > > > area
> > > > >> the
> > > > >> > > > pooling of messages is the single biggest factor in
> > performance
> > > of
> > > > >> the
> > > > >> > > > client so I believed it was worth some
> > sophistication/complexity
> > > > if
> > > > >> > there
> > > > >> > > > was performance payoff. All the same, the complexity of that
> > > code
> > > > has
> > > > >> > > made
> > > > >> > > > it VERY hard to keep correct (it gets broken roughly every
> > other
> > > > time
> > > > >> > > > someone makes a change). Over time I came to feel a lot less
> > > > proud of
> > > > >> > my
> > > > >> > > > cleverness. I learned something interesting reading your
> > > > self-GCing
> > > > >> > > memory
> > > > >> > > > pool, but I wonder if the complexity is worth the payoff in
> > this
> > > > >> case?
> > > > >> > > >
> > > > >> > > > Philosophically we've tried really hard to avoid needlessly
> > > > >> "pluggable"
> > > > >> > > > implementations. That is, when there is a temptation to
> give a
> > > > config
> > > > >> > > that
> > > > >> > > > plugs in different Java classes at run time for
> implementation
> > > > >> choices,
> > > > >> > > we
> > > > >> > > > should instead think of how to give the user the good
> behavior
> > > > >> > > > automatically. I think the use case for configuring a the
> > GCing
> > > > pool
> > > > >> > > would
> > > > >> > > > be if you discovered a bug in which memory leaked. But this
> > > isn't
> > > > >> > > something
> > > > >> > > > the user should have to think about right? If there is a bug
> > we
> > > > >> should
> > > > >> > > find
> > > > >> > > > and fix it.
> > > > >> > > >
> > > > >> > > > -Jay
> > > > >> > > >
> > > > >> > > > On Fri, Nov 11, 2016 at 9:21 AM, radai <
> > > > radai.rosenblatt@gmail.com>
> > > > >> > > wrote:
> > > > >> > > >
> > > > >> > > > > jun's #1 + rajini's #11 - the new config param is to
> enable
> > > > >> changing
> > > > >> > > the
> > > > >> > > > > pool implentation class. as i said in my response to jun i
> > > will
> > > > >> make
> > > > >> > > the
> > > > >> > > > > default pool impl be the simple one, and this param is to
> > > allow
> > > > a
> > > > >> > user
> > > > >> > > > > (more likely a dev) to change it.
> > > > >> > > > > both the simple pool and the "gc pool" make basically just
> > an
> > > > >> > > > > AtomicLong.get() + (hashmap.put for gc) calls before
> > > returning a
> > > > >> > > buffer.
> > > > >> > > > > there is absolutely no dependency on GC times in
> allocating
> > > (or
> > > > >> not).
> > > > >> > > the
> > > > >> > > > > extra background thread in the gc pool is forever asleep
> > > unless
> > > > >> there
> > > > >> > > are
> > > > >> > > > > bugs (==leaks) so the extra cost is basically nothing
> > (backed
> > > by
> > > > >> > > > > benchmarks). let me re-itarate again - ANY BUFFER
> ALLOCATED
> > > MUST
> > > > >> > ALWAYS
> > > > >> > > > BE
> > > > >> > > > > RELEASED - so the gc pool should not rely on gc for
> > reclaiming
> > > > >> > buffers.
> > > > >> > > > its
> > > > >> > > > > a bug detector, not a feature and is definitely not
> intended
> > > to
> > > > >> hide
> > > > >> > > > bugs -
> > > > >> > > > > the exact opposite - its meant to expose them sooner. i've
> > > > cleaned
> > > > >> up
> > > > >> > > the
> > > > >> > > > > docs to avoid this confusion. i also like the fail on
> leak.
> > > will
> > > > >> do.
> > > > >> > > > > as for the gap between pool size and heap size - thats a
> > valid
> > > > >> > > argument.
> > > > >> > > > > may allow also sizing the pool as % of heap size? so
> > > > >> > queued.max.bytes =
> > > > >> > > > > 1000000 for 1MB and queued.max.bytes = 0.25 for 25% of
> > > available
> > > > >> > heap?
> > > > >> > > > >
> > > > >> > > > > jun's 2.2 - queued.max.bytes + socket.request.max.bytes
> > still
> > > > >> holds,
> > > > >> > > > > assuming the ssl-related buffers are small. the largest
> > > > weakness in
> > > > >> > > this
> > > > >> > > > > claim has to do with decompression rather than anything
> > > > >> ssl-related.
> > > > >> > so
> > > > >> > > > yes
> > > > >> > > > > there is an O(#ssl connections * sslEngine packet size)
> > > > component,
> > > > >> > but
> > > > >> > > i
> > > > >> > > > > think its small. again - decompression should be the
> > concern.
> > > > >> > > > >
> > > > >> > > > > rajini's #13 - interesting optimization. the problem is
> > > there's
> > > > no
> > > > >> > > > knowing
> > > > >> > > > > in advance what the _next_ request to come out of a socket
> > is,
> > > > so
> > > > >> > this
> > > > >> > > > > would mute just those sockets that are 1. mutable and 2.
> > have
> > > a
> > > > >> > > > > buffer-demanding request for which we could not allocate a
> > > > buffer.
> > > > >> > > > downside
> > > > >> > > > > is that as-is this would cause the busy-loop on poll()
> that
> > > the
> > > > >> mutes
> > > > >> > > > were
> > > > >> > > > > supposed to prevent - or code would need to be added to
> > > > ad-hocmute
> > > > >> a
> > > > >> > > > > connection that was so-far unmuted but has now generated a
> > > > >> > > > memory-demanding
> > > > >> > > > > request?
> > > > >> > > > >
> > > > >> > > > >
> > > > >> > > > >
> > > > >> > > > > On Fri, Nov 11, 2016 at 5:02 AM, Rajini Sivaram <
> > > > >> > > > > rajinisivaram@googlemail.com> wrote:
> > > > >> > > > >
> > > > >> > > > > > Radai,
> > > > >> > > > > >
> > > > >> > > > > > 11. The KIP talks about a new server configuration
> > parameter
> > > > >> > > > > > *memory.pool.class.name
> > > > >> > > > > > <http://memory.pool.class.name> *which is not in the
> > > > >> > implementation.
> > > > >> > > > Is
> > > > >> > > > > it
> > > > >> > > > > > still the case that the pool will be configurable?
> > > > >> > > > > >
> > > > >> > > > > > 12. Personally I would prefer not to have a garbage
> > > collected
> > > > >> pool
> > > > >> > > that
> > > > >> > > > > > hides bugs as well. Apart from the added code complexity
> > and
> > > > >> extra
> > > > >> > > > thread
> > > > >> > > > > > to handle collections, I am also concerned about the
> > > > >> > > non-deterministic
> > > > >> > > > > > nature of GC timings. The KIP introduces delays in
> > > processing
> > > > >> > > requests
> > > > >> > > > > > based on the configuration parameter *queued.max.bytes.
> > > *This
> > > > in
> > > > >> > > > > unrelated
> > > > >> > > > > > to the JVM heap size and hence pool can be full when
> there
> > > is
> > > > no
> > > > >> > > > pressure
> > > > >> > > > > > on the JVM to garbage collect. The KIP does not prevent
> > > other
> > > > >> > > timeouts
> > > > >> > > > in
> > > > >> > > > > > the broker (eg. consumer session timeout) because it is
> > > > relying
> > > > >> on
> > > > >> > > the
> > > > >> > > > > pool
> > > > >> > > > > > to be managed in a deterministic, timely manner. Since a
> > > > garbage
> > > > >> > > > > collected
> > > > >> > > > > > pool cannot provide that guarantee, wouldn't it be
> better
> > to
> > > > run
> > > > >> > > tests
> > > > >> > > > > with
> > > > >> > > > > > a GC-pool that perhaps fails with a fatal error if it
> > > > encounters
> > > > >> a
> > > > >> > > > buffer
> > > > >> > > > > > that was not released?
> > > > >> > > > > >
> > > > >> > > > > > 13. The implementation currently mutes all channels that
> > > don't
> > > > >> > have a
> > > > >> > > > > > receive buffer allocated. Would it make sense to mute
> only
> > > the
> > > > >> > > channels
> > > > >> > > > > > that need a buffer (i.e. allow channels to read the
> 4-byte
> > > > size
> > > > >> > that
> > > > >> > > is
> > > > >> > > > > not
> > > > >> > > > > > read using the pool) so that normal client connection
> > > close()
> > > > is
> > > > >> > > > handled
> > > > >> > > > > > even when the pool is full? Since the extra 4-bytes may
> > > > already
> > > > >> be
> > > > >> > > > > > allocated for some connections, the total request memory
> > has
> > > > to
> > > > >> > take
> > > > >> > > > into
> > > > >> > > > > > account *4*numConnections* bytes anyway.
> > > > >> > > > > >
> > > > >> > > > > >
> > > > >> > > > > > On Thu, Nov 10, 2016 at 11:51 PM, Jun Rao <
> > jun@confluent.io
> > > >
> > > > >> > wrote:
> > > > >> > > > > >
> > > > >> > > > > > > Hi, Radai,
> > > > >> > > > > > >
> > > > >> > > > > > > 1. Yes, I am concerned about the trickiness of having
> to
> > > > deal
> > > > >> > with
> > > > >> > > > > wreak
> > > > >> > > > > > > refs. I think it's simpler to just have the simple
> > version
> > > > >> > > > instrumented
> > > > >> > > > > > > with enough debug/trace logging and do enough stress
> > > > testing.
> > > > >> > Since
> > > > >> > > > we
> > > > >> > > > > > > still have queued.max.requests, one can always fall
> back
> > > to
> > > > >> that
> > > > >> > > if a
> > > > >> > > > > > > memory leak issue is identified. We could also label
> the
> > > > >> feature
> > > > >> > as
> > > > >> > > > > beta
> > > > >> > > > > > if
> > > > >> > > > > > > we don't think this is production ready.
> > > > >> > > > > > >
> > > > >> > > > > > > 2.2 I am just wondering after we fix that issue
> whether
> > > the
> > > > >> claim
> > > > >> > > > that
> > > > >> > > > > > the
> > > > >> > > > > > > request memory is bounded by  queued.max.bytes +
> > > > >> > > > > socket.request.max.bytes
> > > > >> > > > > > > is still true.
> > > > >> > > > > > >
> > > > >> > > > > > > 5. Ok, leaving the default as -1 is fine then.
> > > > >> > > > > > >
> > > > >> > > > > > > Thanks,
> > > > >> > > > > > >
> > > > >> > > > > > > Jun
> > > > >> > > > > > >
> > > > >> > > > > > > On Wed, Nov 9, 2016 at 6:01 PM, radai <
> > > > >> > radai.rosenblatt@gmail.com>
> > > > >> > > > > > wrote:
> > > > >> > > > > > >
> > > > >> > > > > > > > Hi Jun,
> > > > >> > > > > > > >
> > > > >> > > > > > > > Thank you for taking the time to review this.
> > > > >> > > > > > > >
> > > > >> > > > > > > > 1. short version - yes, the concern is bugs, but the
> > > cost
> > > > is
> > > > >> > tiny
> > > > >> > > > and
> > > > >> > > > > > > worth
> > > > >> > > > > > > > it, and its a common pattern. long version:
> > > > >> > > > > > > >    1.1 detecting these types of bugs (leaks) cannot
> be
> > > > easily
> > > > >> > > done
> > > > >> > > > > with
> > > > >> > > > > > > > simple testing, but requires stress/stability tests
> > that
> > > > run
> > > > >> > for
> > > > >> > > a
> > > > >> > > > > long
> > > > >> > > > > > > > time (long enough to hit OOM, depending on leak size
> > and
> > > > >> > > available
> > > > >> > > > > > > memory).
> > > > >> > > > > > > > this is why some sort of leak detector is "standard
> > > > practice"
> > > > >> > > .for
> > > > >> > > > > > > example
> > > > >> > > > > > > > look at netty (http://netty.io/wiki/
> > > > >> reference-counted-objects.
> > > > >> > > > > > > > html#leak-detection-levels)
> > > > >> > > > > > > > <http://netty.io/wiki/reference-counted-objects.
> > > > >> > > > > > > html#leak-detection-levels
> > > > >> > > > > > > > >-
> > > > >> > > > > > > > they have way more complicated built-in leak
> detection
> > > > >> enabled
> > > > >> > by
> > > > >> > > > > > > default.
> > > > >> > > > > > > > as a concrete example - during development i did not
> > > > properly
> > > > >> > > > dispose
> > > > >> > > > > > of
> > > > >> > > > > > > > in-progress KafkaChannel.receive when a connection
> was
> > > > >> abruptly
> > > > >> > > > > closed
> > > > >> > > > > > > and
> > > > >> > > > > > > > I only found it because of the log msg printed by
> the
> > > > pool.
> > > > >> > > > > > > >    1.2 I have a benchmark suite showing the
> > performance
> > > > cost
> > > > >> of
> > > > >> > > the
> > > > >> > > > > gc
> > > > >> > > > > > > pool
> > > > >> > > > > > > > is absolutely negligible -
> > > > >> > > > > > > > https://github.com/radai-
> rosenblatt/kafka-benchmarks/
> > > > >> > > > > > > > tree/master/memorypool-benchmarks
> > > > >> > > > > > > >    1.3 as for the complexity of the impl - its just
> > ~150
> > > > >> lines
> > > > >> > > and
> > > > >> > > > > > pretty
> > > > >> > > > > > > > straight forward. i think the main issue is that not
> > > many
> > > > >> > people
> > > > >> > > > are
> > > > >> > > > > > > > familiar with weak refs and ref queues.
> > > > >> > > > > > > >
> > > > >> > > > > > > >    how about making the pool impl class a config
> param
> > > > >> > (generally
> > > > >> > > > > good
> > > > >> > > > > > > > going forward), make the default be the simple pool,
> > and
> > > > keep
> > > > >> > the
> > > > >> > > > GC
> > > > >> > > > > > one
> > > > >> > > > > > > as
> > > > >> > > > > > > > a dev/debug/triage aid?
> > > > >> > > > > > > >
> > > > >> > > > > > > > 2. the KIP itself doesnt specifically treat SSL at
> > all -
> > > > its
> > > > >> an
> > > > >> > > > > > > > implementation detail. as for my current patch, it
> has
> > > > some
> > > > >> > > minimal
> > > > >> > > > > > > > treatment of SSL - just enough to not mute SSL
> sockets
> > > > >> > > > mid-handshake
> > > > >> > > > > -
> > > > >> > > > > > > but
> > > > >> > > > > > > > the code in SslTransportLayer still allocates
> buffers
> > > > itself.
> > > > >> > it
> > > > >> > > is
> > > > >> > > > > my
> > > > >> > > > > > > > understanding that netReadBuffer/appReadBuffer
> > shouldn't
> > > > grow
> > > > >> > > > beyond
> > > > >> > > > > 2
> > > > >> > > > > > x
> > > > >> > > > > > > > sslEngine.getSession().getPacketBufferSize(),
> which i
> > > > assume
> > > > >> > to
> > > > >> > > be
> > > > >> > > > > > > small.
> > > > >> > > > > > > > they are also long lived (they live for the duration
> > of
> > > > the
> > > > >> > > > > connection)
> > > > >> > > > > > > > which makes a poor fit for pooling. the bigger fish
> to
> > > > fry i
> > > > >> > > think
> > > > >> > > > is
> > > > >> > > > > > > > decompression - you could read a 1MB blob into a
> > > > >> pool-provided
> > > > >> > > > buffer
> > > > >> > > > > > and
> > > > >> > > > > > > > then decompress it into 10MB of heap allocated on
> the
> > > spot
> > > > >> :-)
> > > > >> > > > also,
> > > > >> > > > > > the
> > > > >> > > > > > > > ssl code is extremely tricky.
> > > > >> > > > > > > >    2.2 just to make sure, youre talking about
> > > > Selector.java:
> > > > >> > > while
> > > > >> > > > > > > > ((networkReceive = channel.read()) != null)
> > > > >> > > > > > addToStagedReceives(channel,
> > > > >> > > > > > > > networkReceive); ? if so youre right, and i'll fix
> > that
> > > > >> > (probably
> > > > >> > > > by
> > > > >> > > > > > > > something similar to immediatelyConnectedKeys, not
> > sure
> > > > yet)
> > > > >> > > > > > > >
> > > > >> > > > > > > > 3. isOutOfMemory is self explanatory (and i'll add
> > > > javadocs
> > > > >> and
> > > > >> > > > > update
> > > > >> > > > > > > the
> > > > >> > > > > > > > wiki). isLowOnMem is basically the point where I
> start
> > > > >> > > randomizing
> > > > >> > > > > the
> > > > >> > > > > > > > selection key handling order to avoid potential
> > > > starvation.
> > > > >> its
> > > > >> > > > > rather
> > > > >> > > > > > > > arbitrary and now that i think of it should probably
> > not
> > > > >> exist
> > > > >> > > and
> > > > >> > > > be
> > > > >> > > > > > > > entirely contained in Selector (where the shuffling
> > > takes
> > > > >> > place).
> > > > >> > > > > will
> > > > >> > > > > > > fix.
> > > > >> > > > > > > >
> > > > >> > > > > > > > 4. will do.
> > > > >> > > > > > > >
> > > > >> > > > > > > > 5. I prefer -1 or 0 as an explicit "OFF" (or
> basically
> > > > >> anything
> > > > >> > > > <=0).
> > > > >> > > > > > > > Long.MAX_VALUE would still create a pool, that would
> > > still
> > > > >> > waste
> > > > >> > > > time
> > > > >> > > > > > > > tracking resources. I dont really mind though if you
> > > have
> > > > a
> > > > >> > > > preferred
> > > > >> > > > > > > magic
> > > > >> > > > > > > > value for off.
> > > > >> > > > > > > >
> > > > >> > > > > > > >
> > > > >> > > > > > > >
> > > > >> > > > > > > >
> > > > >> > > > > > > >
> > > > >> > > > > > > > On Wed, Nov 9, 2016 at 9:28 AM, Jun Rao <
> > > jun@confluent.io
> > > > >
> > > > >> > > wrote:
> > > > >> > > > > > > >
> > > > >> > > > > > > > > Hi, Radai,
> > > > >> > > > > > > > >
> > > > >> > > > > > > > > Thanks for the KIP. Some comments below.
> > > > >> > > > > > > > >
> > > > >> > > > > > > > > 1. The KIP says "to facilitate faster
> implementation
> > > > (as a
> > > > >> > > safety
> > > > >> > > > > > net)
> > > > >> > > > > > > > the
> > > > >> > > > > > > > > pool will be implemented in such a way that memory
> > > that
> > > > was
> > > > >> > not
> > > > >> > > > > > > > release()ed
> > > > >> > > > > > > > > (but still garbage collected) would be detected
> and
> > > > >> > > "reclaimed".
> > > > >> > > > > this
> > > > >> > > > > > > is
> > > > >> > > > > > > > to
> > > > >> > > > > > > > > prevent "leaks" in case of code paths that fail to
> > > > >> release()
> > > > >> > > > > > > properly.".
> > > > >> > > > > > > > > What are the cases that could cause memory leaks?
> If
> > > we
> > > > are
> > > > >> > > > > concerned
> > > > >> > > > > > > > about
> > > > >> > > > > > > > > bugs, it seems that it's better to just do more
> > > testing
> > > > to
> > > > >> > make
> > > > >> > > > > sure
> > > > >> > > > > > > the
> > > > >> > > > > > > > > usage of the simple implementation
> > (SimpleMemoryPool)
> > > is
> > > > >> > solid
> > > > >> > > > > > instead
> > > > >> > > > > > > of
> > > > >> > > > > > > > > adding more complicated logic
> > > > (GarbageCollectedMemoryPool)
> > > > >> to
> > > > >> > > > hide
> > > > >> > > > > > the
> > > > >> > > > > > > > > potential bugs.
> > > > >> > > > > > > > >
> > > > >> > > > > > > > > 2. I am wondering how much this KIP covers the SSL
> > > > channel
> > > > >> > > > > > > > implementation.
> > > > >> > > > > > > > > 2.1 SslTransportLayer maintains netReadBuffer,
> > > > >> > netWriteBuffer,
> > > > >> > > > > > > > > appReadBuffer per socket. Should those memory be
> > > > accounted
> > > > >> > for
> > > > >> > > in
> > > > >> > > > > > > memory
> > > > >> > > > > > > > > pool?
> > > > >> > > > > > > > > 2.2 One tricky thing with SSL is that during a
> > > > >> > > > KafkaChannel.read(),
> > > > >> > > > > > > it's
> > > > >> > > > > > > > > possible for multiple NetworkReceives to be
> returned
> > > > since
> > > > >> > > > multiple
> > > > >> > > > > > > > > requests' data could be encrypted together by SSL.
> > To
> > > > deal
> > > > >> > with
> > > > >> > > > > this,
> > > > >> > > > > > > we
> > > > >> > > > > > > > > stash those NetworkReceives in
> > Selector.stagedReceives
> > > > and
> > > > >> > give
> > > > >> > > > it
> > > > >> > > > > > back
> > > > >> > > > > > > > to
> > > > >> > > > > > > > > the poll() call one NetworkReceive at a time. What
> > > this
> > > > >> means
> > > > >> > > is
> > > > >> > > > > > that,
> > > > >> > > > > > > if
> > > > >> > > > > > > > > we stop reading from KafkaChannel in the middle
> > > because
> > > > >> > memory
> > > > >> > > > pool
> > > > >> > > > > > is
> > > > >> > > > > > > > > full, this channel's key may never get selected
> for
> > > > reads
> > > > >> > (even
> > > > >> > > > > after
> > > > >> > > > > > > the
> > > > >> > > > > > > > > read interest is turned on), but there are still
> > > pending
> > > > >> data
> > > > >> > > for
> > > > >> > > > > the
> > > > >> > > > > > > > > channel, which will never get processed.
> > > > >> > > > > > > > >
> > > > >> > > > > > > > > 3. The code has the following two methods in
> > > MemoryPool,
> > > > >> > which
> > > > >> > > > are
> > > > >> > > > > > not
> > > > >> > > > > > > > > described in the KIP. Could you explain how they
> are
> > > > used
> > > > >> in
> > > > >> > > the
> > > > >> > > > > > wiki?
> > > > >> > > > > > > > > isLowOnMemory()
> > > > >> > > > > > > > > isOutOfMemory()
> > > > >> > > > > > > > >
> > > > >> > > > > > > > > 4. Could you also describe in the KIP at the high
> > > level,
> > > > >> how
> > > > >> > > the
> > > > >> > > > > read
> > > > >> > > > > > > > > interest bit for the socket is turned on/off with
> > > > respect
> > > > >> to
> > > > >> > > > > > > MemoryPool?
> > > > >> > > > > > > > >
> > > > >> > > > > > > > > 5. Should queued.max.bytes defaults to -1 or
> > > > >> Long.MAX_VALUE?
> > > > >> > > > > > > > >
> > > > >> > > > > > > > > Thanks,
> > > > >> > > > > > > > >
> > > > >> > > > > > > > > Jun
> > > > >> > > > > > > > >
> > > > >> > > > > > > > > On Mon, Nov 7, 2016 at 1:08 PM, radai <
> > > > >> > > > radai.rosenblatt@gmail.com>
> > > > >> > > > > > > > wrote:
> > > > >> > > > > > > > >
> > > > >> > > > > > > > > > Hi,
> > > > >> > > > > > > > > >
> > > > >> > > > > > > > > > I would like to initiate a vote on KIP-72:
> > > > >> > > > > > > > > >
> > > > >> > > > > > > > > > https://cwiki.apache.org/
> > > > confluence/display/KAFKA/KIP-
> > > > >> > 72%3A+
> > > > >> > > > > > > > > > Allow+putting+a+bound+on+memor
> > > y+consumed+by+Incoming+
> > > > >> > > requests
> > > > >> > > > > > > > > >
> > > > >> > > > > > > > > > The kip allows specifying a limit on the amount
> of
> > > > memory
> > > > >> > > > > allocated
> > > > >> > > > > > > for
> > > > >> > > > > > > > > > reading incoming requests into. This is useful
> for
> > > > >> > "sizing" a
> > > > >> > > > > > broker
> > > > >> > > > > > > > and
> > > > >> > > > > > > > > > avoiding OOMEs under heavy load (as actually
> > happens
> > > > >> > > > occasionally
> > > > >> > > > > > at
> > > > >> > > > > > > > > > linkedin).
> > > > >> > > > > > > > > >
> > > > >> > > > > > > > > > I believe I've addressed most (all?) concerns
> > > brought
> > > > up
> > > > >> > > during
> > > > >> > > > > the
> > > > >> > > > > > > > > > discussion.
> > > > >> > > > > > > > > >
> > > > >> > > > > > > > > > To the best of my understanding this vote is
> about
> > > the
> > > > >> goal
> > > > >> > > and
> > > > >> > > > > > > > > > public-facing changes related to the new
> proposed
> > > > >> behavior,
> > > > >> > > but
> > > > >> > > > > as
> > > > >> > > > > > > for
> > > > >> > > > > > > > > > implementation, i have the code up here:
> > > > >> > > > > > > > > >
> > > > >> > > > > > > > > > https://github.com/radai-
> > > > rosenblatt/kafka/tree/broker-
> > > > >> > memory
> > > > >> > > > > > > > > > -pool-with-muting
> > > > >> > > > > > > > > >
> > > > >> > > > > > > > > > and I've stress-tested it to work properly
> > (meaning
> > > it
> > > > >> > chugs
> > > > >> > > > > along
> > > > >> > > > > > > and
> > > > >> > > > > > > > > > throttles under loads that would DOS 10.0.1.0
> > code).
> > > > >> > > > > > > > > >
> > > > >> > > > > > > > > > I also believe that the primitives and
> "pattern"s
> > > > >> > introduced
> > > > >> > > in
> > > > >> > > > > > this
> > > > >> > > > > > > > KIP
> > > > >> > > > > > > > > > (namely the notion of a buffer pool and
> retrieving
> > > > from /
> > > > >> > > > > releasing
> > > > >> > > > > > > to
> > > > >> > > > > > > > > said
> > > > >> > > > > > > > > > pool instead of allocating memory) are generally
> > > > useful
> > > > >> > > beyond
> > > > >> > > > > the
> > > > >> > > > > > > > scope
> > > > >> > > > > > > > > of
> > > > >> > > > > > > > > > this KIP for both performance issues (allocating
> > > lots
> > > > of
> > > > >> > > > > > short-lived
> > > > >> > > > > > > > > large
> > > > >> > > > > > > > > > buffers is a performance bottleneck) and other
> > areas
> > > > >> where
> > > > >> > > > memory
> > > > >> > > > > > > > limits
> > > > >> > > > > > > > > > are a problem (KIP-81)
> > > > >> > > > > > > > > >
> > > > >> > > > > > > > > > Thank you,
> > > > >> > > > > > > > > >
> > > > >> > > > > > > > > > Radai.
> > > > >> > > > > > > > > >
> > > > >> > > > > > > > >
> > > > >> > > > > > > >
> > > > >> > > > > > >
> > > > >> > > > > >
> > > > >> > > > > >
> > > > >> > > > > >
> > > > >> > > > > > --
> > > > >> > > > > > Regards,
> > > > >> > > > > >
> > > > >> > > > > > Rajini
> > > > >> > > > > >
> > > > >> > > > >
> > > > >> > > >
> > > > >> > >
> > > > >> > >
> > > > >> > >
> > > > >> > > --
> > > > >> > > Regards,
> > > > >> > >
> > > > >> > > Rajini
> > > > >> > >
> > > > >> >
> > > > >>
> > > > >>
> > > > >>
> > > > >> --
> > > > >> Regards,
> > > > >>
> > > > >> Rajini
> > > > >>
> > > >
> > >
> > >
> > >
> > > --
> > > Regards,
> > >
> > > Rajini
> > >
> >
>

Re: [VOTE] KIP-72 - Allow putting a bound on memory consumed by Incoming requests

Posted by radai <ra...@gmail.com>.

Hi Jun,

1. will do.

2. true. for several reasons:
   2.1. which selector? there's a single pool but 16 selectors (linkedin
typical, num.network.threads defaults to 3)
   2.2. even if i could figure out which selector (all?) the better thing
to do would be resume reading not when any memory becomes available
(because worst case its not enough for anything) but when some "low
watermark" of available memory is hit - so mute when @100% mem, unmute when
back down to 90%?
   2.3. on the broker side (which is the current concern for my patch) this
max wait time is a hardcoded 300 ms (SocketServer.Processor.poll()), which
i think is acceptable and definitely not arbitrary or configurable.

   if you still think this needs to be addressed (and you are right that in
the general case the timeout param could be arbitrary) i can implement the
watermark approach + pool.waitForLowWatermark(timeout) or something, and
make Selector.poll() wait for low watermark at the end of poll() if no work
has been done (so as not to wait on memory needlessly for requests that
done require it, as rajini suggested earlier)

On Wed, Nov 16, 2016 at 9:04 AM, Jun Rao <ju...@confluent.io> wrote:

> Hi, Radai,
>
> Thanks for the updated proposal. +1 overall. A couple of comments below.
>
> 1. Our current convention is to avoid using getters. Could you change
> getSize and getAvailableMemory accordingly? Also, size is bit ambiguous,
> could we use sth like capacity?
>
> 2. This is more on the implementation details. I didn't see any code to
> wake up the selector when memory is released from the pool. For example,
> suppose that all socket keys are muted since the pool is full. The
> selector.poll() call will wait for the timeout, which could be arbitrarily
> long. Now, if some memory is released, it seems that we should wake up the
> selector early instead of waiting for the timeout.
>
> Jun
>
>
> On Mon, Nov 14, 2016 at 11:41 AM, Rajini Sivaram <
> rajinisivaram@googlemail.com> wrote:
>
> > +1
> >
> > Thank you for the KIP, Radai.
> >
> > On Mon, Nov 14, 2016 at 6:07 PM, Mickael Maison <
> mickael.maison@gmail.com>
> > wrote:
> >
> > > +1. We've also been hit by OOMs on the broker because we were not able
> > > to properly bound its memory usage.
> > >
> > > On Mon, Nov 14, 2016 at 5:56 PM, radai <ra...@gmail.com>
> > wrote:
> > > > @rajini - fixed the hasBytesBuffered() method. also updated poll() so
> > > that
> > > > no latency is added for picking up data stuck in ssl buffers (timeout
> > is
> > > > set to 0, just like with immediately connected keys and staged
> > receives).
> > > > thank you for pointing these out.
> > > > added ssl (re) testing to the KIP testing plan.
> > > >
> > > >
> > > >
> > > >
> > > > On Mon, Nov 14, 2016 at 7:24 AM, Rajini Sivaram <
> > > > rajinisivaram@googlemail.com> wrote:
> > > >
> > > >> Open point 1. I would just retain the current long value that
> > specifies
> > > >> queued.max.bytes as long and not as %heap since it is simple and
> easy
> > to
> > > >> use. And keeps it consistent with other ".bytes" configs.
> > > >>
> > > >> Point 3. ssl buffers - I am not quite sure the implementation looks
> > > >> correct. hasBytesBuffered() is checking position() of buffers == 0.
> > And
> > > the
> > > >> code checks this only when poll with a timeout returns (adding a
> delay
> > > when
> > > >> there is nothing else to read).
> > > >> But since this and open point 2 (optimization) are implementation
> > > details,
> > > >> they can be looked at during PR review.
> > > >>
> > > >> It will be good to add SSL testing to the test plan as well, since
> > > there is
> > > >> additional code to test for SSL.
> > > >>
> > > >>
> > > >> On Fri, Nov 11, 2016 at 9:03 PM, radai <ra...@gmail.com>
> > > wrote:
> > > >>
> > > >> > ok, i've made the following changes:
> > > >> >
> > > >> > 1. memory.pool.class.name has been removed
> > > >> > 2. the code now only uses SimpleMemoryPool. the gc variant is left
> > > >> (unused)
> > > >> > as a developement aid and is unsettable via configuration.
> > > >> > 3. I've resolved the issue of stale data getting stuck in
> > intermediate
> > > >> > (ssl) buffers.
> > > >> > 4. default value for queued.max.bytes is -1, so off by default.
> any
> > > <=0
> > > >> > value is interpreted as off by the underlying code.
> > > >> >
> > > >> > open points:
> > > >> >
> > > >> > 1. the kafka config framework doesnt allow a value to be either
> long
> > > or
> > > >> > double, so in order to pull off the queued.max.bytes = 1000000 or
> > > >> > queued.max.bytes = 0.3 thing i'd need to define the config as type
> > > >> string,
> > > >> > which is ugly to me. do we want to support setting
> queued.max.bytes
> > > to %
> > > >> of
> > > >> > heap ? if so, by way of making queued.max.bytes of type string, or
> > by
> > > way
> > > >> > of a 2nd config param (with the resulting either/all/combination?
> > > >> > validation). my personal opinion is string because i think a
> single
> > > >> > queued.max.bytes with overloaded meaning is more understandable to
> > > users.
> > > >> > i'll await other people's opinions before doing anything.
> > > >> > 2. i still need to evaluate rajini's optimization. sounds doable.
> > > >> >
> > > >> > asides:
> > > >> >
> > > >> > 1. i think you guys misunderstood the intent behind the gc pool.
> it
> > > was
> > > >> > never meant to be a magic pool that automatically releases buffers
> > > >> (because
> > > >> > just as rajini stated the performance implications would be
> > > horrible). it
> > > >> > was meant to catch leaks early. since that is indeed a dev-only
> > > concern
> > > >> it
> > > >> > wont ever get used in production.
> > > >> > 2. i said this on some other kip discussion: i think the nice
> thing
> > > about
> > > >> > the pool API is it "scales" from just keeping a memory bound to
> > > actually
> > > >> > re-using buffers without changing the calling code. i think
> > > >> actuallypooling
> > > >> > large buffers will result in a significant performance impact, but
> > > thats
> > > >> > outside the scope of this kip. at that point i think more pool
> > > >> > implementations (that actually pool) would be written. i agree
> with
> > > the
> > > >> > ideal of exposing as few knobs as possible, but switching pools
> (or
> > > pool
> > > >> > params) for tuning may happen at some later point.
> > > >> >
> > > >> >
> > > >> >
> > > >> > On Fri, Nov 11, 2016 at 11:44 AM, Rajini Sivaram <
> > > >> > rajinisivaram@googlemail.com> wrote:
> > > >> >
> > > >> > > 13. At the moment, I think channels are not muted if:
> > > >> > >     channel.receive != null && channel.receive.buffer != null
> > > >> > > This mutes all channels that aren't holding onto a incomplete
> > > buffer.
> > > >> > They
> > > >> > > may or may not have read the 4-byte size.
> > > >> > >
> > > >> > > I was thinking you could avoid muting channels if:
> > > >> > >     channel.receive == null || channel.receive.size.remaining()
> > > >> > > This will not mute channels that are holding onto a buffer (as
> > > above).
> > > >> In
> > > >> > > addition, it will not mute channels that haven't read the 4-byte
> > > size.
> > > >> A
> > > >> > > client that is closed gracefully while the pool is full will not
> > be
> > > >> muted
> > > >> > > in this case and the server can process close without waiting
> for
> > > the
> > > >> > pool
> > > >> > > to free up. Once the 4-byte size is read, the channel will be
> > muted
> > > if
> > > >> > the
> > > >> > > pool is still out of memory - for each channel, at most one
> failed
> > > read
> > > >> > > attempt would be made while the pool is out of memory. I think
> > this
> > > >> would
> > > >> > > also delay muting of SSL channels since they can continue to
> read
> > > into
> > > >> > > their (already allocated) network buffers and unwrap the data
> and
> > > block
> > > >> > > only when they need to allocate a buffer from the pool.
> > > >> > >
> > > >> > > On Fri, Nov 11, 2016 at 6:00 PM, Jay Kreps <ja...@confluent.io>
> > > wrote:
> > > >> > >
> > > >> > > > Hey Radai,
> > > >> > > >
> > > >> > > > +1 on deprecating and eventually removing the old config. The
> > > >> intention
> > > >> > > was
> > > >> > > > absolutely bounding memory usage. I think having two ways of
> > doing
> > > >> > this,
> > > >> > > > one that gives a crisp bound on memory and one that is hard to
> > > reason
> > > >> > > about
> > > >> > > > is pretty confusing. I think people will really appreciate
> > having
> > > one
> > > >> > > > config which instead lets them directly control the thing they
> > > >> actually
> > > >> > > > care about (memory).
> > > >> > > >
> > > >> > > > I also want to second Jun's concern on the complexity of the
> > > >> self-GCing
> > > >> > > > memory pool. I wrote the memory pool for the producer. In that
> > > area
> > > >> the
> > > >> > > > pooling of messages is the single biggest factor in
> performance
> > of
> > > >> the
> > > >> > > > client so I believed it was worth some
> sophistication/complexity
> > > if
> > > >> > there
> > > >> > > > was performance payoff. All the same, the complexity of that
> > code
> > > has
> > > >> > > made
> > > >> > > > it VERY hard to keep correct (it gets broken roughly every
> other
> > > time
> > > >> > > > someone makes a change). Over time I came to feel a lot less
> > > proud of
> > > >> > my
> > > >> > > > cleverness. I learned something interesting reading your
> > > self-GCing
> > > >> > > memory
> > > >> > > > pool, but I wonder if the complexity is worth the payoff in
> this
> > > >> case?
> > > >> > > >
> > > >> > > > Philosophically we've tried really hard to avoid needlessly
> > > >> "pluggable"
> > > >> > > > implementations. That is, when there is a temptation to give a
> > > config
> > > >> > > that
> > > >> > > > plugs in different Java classes at run time for implementation
> > > >> choices,
> > > >> > > we
> > > >> > > > should instead think of how to give the user the good behavior
> > > >> > > > automatically. I think the use case for configuring a the
> GCing
> > > pool
> > > >> > > would
> > > >> > > > be if you discovered a bug in which memory leaked. But this
> > isn't
> > > >> > > something
> > > >> > > > the user should have to think about right? If there is a bug
> we
> > > >> should
> > > >> > > find
> > > >> > > > and fix it.
> > > >> > > >
> > > >> > > > -Jay
> > > >> > > >
> > > >> > > > On Fri, Nov 11, 2016 at 9:21 AM, radai <
> > > radai.rosenblatt@gmail.com>
> > > >> > > wrote:
> > > >> > > >
> > > >> > > > > jun's #1 + rajini's #11 - the new config param is to enable
> > > >> changing
> > > >> > > the
> > > >> > > > > pool implentation class. as i said in my response to jun i
> > will
> > > >> make
> > > >> > > the
> > > >> > > > > default pool impl be the simple one, and this param is to
> > allow
> > > a
> > > >> > user
> > > >> > > > > (more likely a dev) to change it.
> > > >> > > > > both the simple pool and the "gc pool" make basically just
> an
> > > >> > > > > AtomicLong.get() + (hashmap.put for gc) calls before
> > returning a
> > > >> > > buffer.
> > > >> > > > > there is absolutely no dependency on GC times in allocating
> > (or
> > > >> not).
> > > >> > > the
> > > >> > > > > extra background thread in the gc pool is forever asleep
> > unless
> > > >> there
> > > >> > > are
> > > >> > > > > bugs (==leaks) so the extra cost is basically nothing
> (backed
> > by
> > > >> > > > > benchmarks). let me re-itarate again - ANY BUFFER ALLOCATED
> > MUST
> > > >> > ALWAYS
> > > >> > > > BE
> > > >> > > > > RELEASED - so the gc pool should not rely on gc for
> reclaiming
> > > >> > buffers.
> > > >> > > > its
> > > >> > > > > a bug detector, not a feature and is definitely not intended
> > to
> > > >> hide
> > > >> > > > bugs -
> > > >> > > > > the exact opposite - its meant to expose them sooner. i've
> > > cleaned
> > > >> up
> > > >> > > the
> > > >> > > > > docs to avoid this confusion. i also like the fail on leak.
> > will
> > > >> do.
> > > >> > > > > as for the gap between pool size and heap size - thats a
> valid
> > > >> > > argument.
> > > >> > > > > may allow also sizing the pool as % of heap size? so
> > > >> > queued.max.bytes =
> > > >> > > > > 1000000 for 1MB and queued.max.bytes = 0.25 for 25% of
> > available
> > > >> > heap?
> > > >> > > > >
> > > >> > > > > jun's 2.2 - queued.max.bytes + socket.request.max.bytes
> still
> > > >> holds,
> > > >> > > > > assuming the ssl-related buffers are small. the largest
> > > weakness in
> > > >> > > this
> > > >> > > > > claim has to do with decompression rather than anything
> > > >> ssl-related.
> > > >> > so
> > > >> > > > yes
> > > >> > > > > there is an O(#ssl connections * sslEngine packet size)
> > > component,
> > > >> > but
> > > >> > > i
> > > >> > > > > think its small. again - decompression should be the
> concern.
> > > >> > > > >
> > > >> > > > > rajini's #13 - interesting optimization. the problem is
> > there's
> > > no
> > > >> > > > knowing
> > > >> > > > > in advance what the _next_ request to come out of a socket
> is,
> > > so
> > > >> > this
> > > >> > > > > would mute just those sockets that are 1. mutable and 2.
> have
> > a
> > > >> > > > > buffer-demanding request for which we could not allocate a
> > > buffer.
> > > >> > > > downside
> > > >> > > > > is that as-is this would cause the busy-loop on poll() that
> > the
> > > >> mutes
> > > >> > > > were
> > > >> > > > > supposed to prevent - or code would need to be added to
> > > ad-hocmute
> > > >> a
> > > >> > > > > connection that was so-far unmuted but has now generated a
> > > >> > > > memory-demanding
> > > >> > > > > request?
> > > >> > > > >
> > > >> > > > >
> > > >> > > > >
> > > >> > > > > On Fri, Nov 11, 2016 at 5:02 AM, Rajini Sivaram <
> > > >> > > > > rajinisivaram@googlemail.com> wrote:
> > > >> > > > >
> > > >> > > > > > Radai,
> > > >> > > > > >
> > > >> > > > > > 11. The KIP talks about a new server configuration
> parameter
> > > >> > > > > > *memory.pool.class.name
> > > >> > > > > > <http://memory.pool.class.name> *which is not in the
> > > >> > implementation.
> > > >> > > > Is
> > > >> > > > > it
> > > >> > > > > > still the case that the pool will be configurable?
> > > >> > > > > >
> > > >> > > > > > 12. Personally I would prefer not to have a garbage
> > collected
> > > >> pool
> > > >> > > that
> > > >> > > > > > hides bugs as well. Apart from the added code complexity
> and
> > > >> extra
> > > >> > > > thread
> > > >> > > > > > to handle collections, I am also concerned about the
> > > >> > > non-deterministic
> > > >> > > > > > nature of GC timings. The KIP introduces delays in
> > processing
> > > >> > > requests
> > > >> > > > > > based on the configuration parameter *queued.max.bytes.
> > *This
> > > in
> > > >> > > > > unrelated
> > > >> > > > > > to the JVM heap size and hence pool can be full when there
> > is
> > > no
> > > >> > > > pressure
> > > >> > > > > > on the JVM to garbage collect. The KIP does not prevent
> > other
> > > >> > > timeouts
> > > >> > > > in
> > > >> > > > > > the broker (eg. consumer session timeout) because it is
> > > relying
> > > >> on
> > > >> > > the
> > > >> > > > > pool
> > > >> > > > > > to be managed in a deterministic, timely manner. Since a
> > > garbage
> > > >> > > > > collected
> > > >> > > > > > pool cannot provide that guarantee, wouldn't it be better
> to
> > > run
> > > >> > > tests
> > > >> > > > > with
> > > >> > > > > > a GC-pool that perhaps fails with a fatal error if it
> > > encounters
> > > >> a
> > > >> > > > buffer
> > > >> > > > > > that was not released?
> > > >> > > > > >
> > > >> > > > > > 13. The implementation currently mutes all channels that
> > don't
> > > >> > have a
> > > >> > > > > > receive buffer allocated. Would it make sense to mute only
> > the
> > > >> > > channels
> > > >> > > > > > that need a buffer (i.e. allow channels to read the 4-byte
> > > size
> > > >> > that
> > > >> > > is
> > > >> > > > > not
> > > >> > > > > > read using the pool) so that normal client connection
> > close()
> > > is
> > > >> > > > handled
> > > >> > > > > > even when the pool is full? Since the extra 4-bytes may
> > > already
> > > >> be
> > > >> > > > > > allocated for some connections, the total request memory
> has
> > > to
> > > >> > take
> > > >> > > > into
> > > >> > > > > > account *4*numConnections* bytes anyway.
> > > >> > > > > >
> > > >> > > > > >
> > > >> > > > > > On Thu, Nov 10, 2016 at 11:51 PM, Jun Rao <
> jun@confluent.io
> > >
> > > >> > wrote:
> > > >> > > > > >
> > > >> > > > > > > Hi, Radai,
> > > >> > > > > > >
> > > >> > > > > > > 1. Yes, I am concerned about the trickiness of having to
> > > deal
> > > >> > with
> > > >> > > > > wreak
> > > >> > > > > > > refs. I think it's simpler to just have the simple
> version
> > > >> > > > instrumented
> > > >> > > > > > > with enough debug/trace logging and do enough stress
> > > testing.
> > > >> > Since
> > > >> > > > we
> > > >> > > > > > > still have queued.max.requests, one can always fall back
> > to
> > > >> that
> > > >> > > if a
> > > >> > > > > > > memory leak issue is identified. We could also label the
> > > >> feature
> > > >> > as
> > > >> > > > > beta
> > > >> > > > > > if
> > > >> > > > > > > we don't think this is production ready.
> > > >> > > > > > >
> > > >> > > > > > > 2.2 I am just wondering after we fix that issue whether
> > the
> > > >> claim
> > > >> > > > that
> > > >> > > > > > the
> > > >> > > > > > > request memory is bounded by  queued.max.bytes +
> > > >> > > > > socket.request.max.bytes
> > > >> > > > > > > is still true.
> > > >> > > > > > >
> > > >> > > > > > > 5. Ok, leaving the default as -1 is fine then.
> > > >> > > > > > >
> > > >> > > > > > > Thanks,
> > > >> > > > > > >
> > > >> > > > > > > Jun
> > > >> > > > > > >
> > > >> > > > > > > On Wed, Nov 9, 2016 at 6:01 PM, radai <
> > > >> > radai.rosenblatt@gmail.com>
> > > >> > > > > > wrote:
> > > >> > > > > > >
> > > >> > > > > > > > Hi Jun,
> > > >> > > > > > > >
> > > >> > > > > > > > Thank you for taking the time to review this.
> > > >> > > > > > > >
> > > >> > > > > > > > 1. short version - yes, the concern is bugs, but the
> > cost
> > > is
> > > >> > tiny
> > > >> > > > and
> > > >> > > > > > > worth
> > > >> > > > > > > > it, and its a common pattern. long version:
> > > >> > > > > > > >    1.1 detecting these types of bugs (leaks) cannot be
> > > easily
> > > >> > > done
> > > >> > > > > with
> > > >> > > > > > > > simple testing, but requires stress/stability tests
> that
> > > run
> > > >> > for
> > > >> > > a
> > > >> > > > > long
> > > >> > > > > > > > time (long enough to hit OOM, depending on leak size
> and
> > > >> > > available
> > > >> > > > > > > memory).
> > > >> > > > > > > > this is why some sort of leak detector is "standard
> > > practice"
> > > >> > > .for
> > > >> > > > > > > example
> > > >> > > > > > > > look at netty (http://netty.io/wiki/
> > > >> reference-counted-objects.
> > > >> > > > > > > > html#leak-detection-levels)
> > > >> > > > > > > > <http://netty.io/wiki/reference-counted-objects.
> > > >> > > > > > > html#leak-detection-levels
> > > >> > > > > > > > >-
> > > >> > > > > > > > they have way more complicated built-in leak detection
> > > >> enabled
> > > >> > by
> > > >> > > > > > > default.
> > > >> > > > > > > > as a concrete example - during development i did not
> > > properly
> > > >> > > > dispose
> > > >> > > > > > of
> > > >> > > > > > > > in-progress KafkaChannel.receive when a connection was
> > > >> abruptly
> > > >> > > > > closed
> > > >> > > > > > > and
> > > >> > > > > > > > I only found it because of the log msg printed by the
> > > pool.
> > > >> > > > > > > >    1.2 I have a benchmark suite showing the
> performance
> > > cost
> > > >> of
> > > >> > > the
> > > >> > > > > gc
> > > >> > > > > > > pool
> > > >> > > > > > > > is absolutely negligible -
> > > >> > > > > > > > https://github.com/radai-rosenblatt/kafka-benchmarks/
> > > >> > > > > > > > tree/master/memorypool-benchmarks
> > > >> > > > > > > >    1.3 as for the complexity of the impl - its just
> ~150
> > > >> lines
> > > >> > > and
> > > >> > > > > > pretty
> > > >> > > > > > > > straight forward. i think the main issue is that not
> > many
> > > >> > people
> > > >> > > > are
> > > >> > > > > > > > familiar with weak refs and ref queues.
> > > >> > > > > > > >
> > > >> > > > > > > >    how about making the pool impl class a config param
> > > >> > (generally
> > > >> > > > > good
> > > >> > > > > > > > going forward), make the default be the simple pool,
> and
> > > keep
> > > >> > the
> > > >> > > > GC
> > > >> > > > > > one
> > > >> > > > > > > as
> > > >> > > > > > > > a dev/debug/triage aid?
> > > >> > > > > > > >
> > > >> > > > > > > > 2. the KIP itself doesnt specifically treat SSL at
> all -
> > > its
> > > >> an
> > > >> > > > > > > > implementation detail. as for my current patch, it has
> > > some
> > > >> > > minimal
> > > >> > > > > > > > treatment of SSL - just enough to not mute SSL sockets
> > > >> > > > mid-handshake
> > > >> > > > > -
> > > >> > > > > > > but
> > > >> > > > > > > > the code in SslTransportLayer still allocates buffers
> > > itself.
> > > >> > it
> > > >> > > is
> > > >> > > > > my
> > > >> > > > > > > > understanding that netReadBuffer/appReadBuffer
> shouldn't
> > > grow
> > > >> > > > beyond
> > > >> > > > > 2
> > > >> > > > > > x
> > > >> > > > > > > > sslEngine.getSession().getPacketBufferSize(), which i
> > > assume
> > > >> > to
> > > >> > > be
> > > >> > > > > > > small.
> > > >> > > > > > > > they are also long lived (they live for the duration
> of
> > > the
> > > >> > > > > connection)
> > > >> > > > > > > > which makes a poor fit for pooling. the bigger fish to
> > > fry i
> > > >> > > think
> > > >> > > > is
> > > >> > > > > > > > decompression - you could read a 1MB blob into a
> > > >> pool-provided
> > > >> > > > buffer
> > > >> > > > > > and
> > > >> > > > > > > > then decompress it into 10MB of heap allocated on the
> > spot
> > > >> :-)
> > > >> > > > also,
> > > >> > > > > > the
> > > >> > > > > > > > ssl code is extremely tricky.
> > > >> > > > > > > >    2.2 just to make sure, youre talking about
> > > Selector.java:
> > > >> > > while
> > > >> > > > > > > > ((networkReceive = channel.read()) != null)
> > > >> > > > > > addToStagedReceives(channel,
> > > >> > > > > > > > networkReceive); ? if so youre right, and i'll fix
> that
> > > >> > (probably
> > > >> > > > by
> > > >> > > > > > > > something similar to immediatelyConnectedKeys, not
> sure
> > > yet)
> > > >> > > > > > > >
> > > >> > > > > > > > 3. isOutOfMemory is self explanatory (and i'll add
> > > javadocs
> > > >> and
> > > >> > > > > update
> > > >> > > > > > > the
> > > >> > > > > > > > wiki). isLowOnMem is basically the point where I start
> > > >> > > randomizing
> > > >> > > > > the
> > > >> > > > > > > > selection key handling order to avoid potential
> > > starvation.
> > > >> its
> > > >> > > > > rather
> > > >> > > > > > > > arbitrary and now that i think of it should probably
> not
> > > >> exist
> > > >> > > and
> > > >> > > > be
> > > >> > > > > > > > entirely contained in Selector (where the shuffling
> > takes
> > > >> > place).
> > > >> > > > > will
> > > >> > > > > > > fix.
> > > >> > > > > > > >
> > > >> > > > > > > > 4. will do.
> > > >> > > > > > > >
> > > >> > > > > > > > 5. I prefer -1 or 0 as an explicit "OFF" (or basically
> > > >> anything
> > > >> > > > <=0).
> > > >> > > > > > > > Long.MAX_VALUE would still create a pool, that would
> > still
> > > >> > waste
> > > >> > > > time
> > > >> > > > > > > > tracking resources. I dont really mind though if you
> > have
> > > a
> > > >> > > > preferred
> > > >> > > > > > > magic
> > > >> > > > > > > > value for off.
> > > >> > > > > > > >
> > > >> > > > > > > >
> > > >> > > > > > > >
> > > >> > > > > > > >
> > > >> > > > > > > >
> > > >> > > > > > > > On Wed, Nov 9, 2016 at 9:28 AM, Jun Rao <
> > jun@confluent.io
> > > >
> > > >> > > wrote:
> > > >> > > > > > > >
> > > >> > > > > > > > > Hi, Radai,
> > > >> > > > > > > > >
> > > >> > > > > > > > > Thanks for the KIP. Some comments below.
> > > >> > > > > > > > >
> > > >> > > > > > > > > 1. The KIP says "to facilitate faster implementation
> > > (as a
> > > >> > > safety
> > > >> > > > > > net)
> > > >> > > > > > > > the
> > > >> > > > > > > > > pool will be implemented in such a way that memory
> > that
> > > was
> > > >> > not
> > > >> > > > > > > > release()ed
> > > >> > > > > > > > > (but still garbage collected) would be detected and
> > > >> > > "reclaimed".
> > > >> > > > > this
> > > >> > > > > > > is
> > > >> > > > > > > > to
> > > >> > > > > > > > > prevent "leaks" in case of code paths that fail to
> > > >> release()
> > > >> > > > > > > properly.".
> > > >> > > > > > > > > What are the cases that could cause memory leaks? If
> > we
> > > are
> > > >> > > > > concerned
> > > >> > > > > > > > about
> > > >> > > > > > > > > bugs, it seems that it's better to just do more
> > testing
> > > to
> > > >> > make
> > > >> > > > > sure
> > > >> > > > > > > the
> > > >> > > > > > > > > usage of the simple implementation
> (SimpleMemoryPool)
> > is
> > > >> > solid
> > > >> > > > > > instead
> > > >> > > > > > > of
> > > >> > > > > > > > > adding more complicated logic
> > > (GarbageCollectedMemoryPool)
> > > >> to
> > > >> > > > hide
> > > >> > > > > > the
> > > >> > > > > > > > > potential bugs.
> > > >> > > > > > > > >
> > > >> > > > > > > > > 2. I am wondering how much this KIP covers the SSL
> > > channel
> > > >> > > > > > > > implementation.
> > > >> > > > > > > > > 2.1 SslTransportLayer maintains netReadBuffer,
> > > >> > netWriteBuffer,
> > > >> > > > > > > > > appReadBuffer per socket. Should those memory be
> > > accounted
> > > >> > for
> > > >> > > in
> > > >> > > > > > > memory
> > > >> > > > > > > > > pool?
> > > >> > > > > > > > > 2.2 One tricky thing with SSL is that during a
> > > >> > > > KafkaChannel.read(),
> > > >> > > > > > > it's
> > > >> > > > > > > > > possible for multiple NetworkReceives to be returned
> > > since
> > > >> > > > multiple
> > > >> > > > > > > > > requests' data could be encrypted together by SSL.
> To
> > > deal
> > > >> > with
> > > >> > > > > this,
> > > >> > > > > > > we
> > > >> > > > > > > > > stash those NetworkReceives in
> Selector.stagedReceives
> > > and
> > > >> > give
> > > >> > > > it
> > > >> > > > > > back
> > > >> > > > > > > > to
> > > >> > > > > > > > > the poll() call one NetworkReceive at a time. What
> > this
> > > >> means
> > > >> > > is
> > > >> > > > > > that,
> > > >> > > > > > > if
> > > >> > > > > > > > > we stop reading from KafkaChannel in the middle
> > because
> > > >> > memory
> > > >> > > > pool
> > > >> > > > > > is
> > > >> > > > > > > > > full, this channel's key may never get selected for
> > > reads
> > > >> > (even
> > > >> > > > > after
> > > >> > > > > > > the
> > > >> > > > > > > > > read interest is turned on), but there are still
> > pending
> > > >> data
> > > >> > > for
> > > >> > > > > the
> > > >> > > > > > > > > channel, which will never get processed.
> > > >> > > > > > > > >
> > > >> > > > > > > > > 3. The code has the following two methods in
> > MemoryPool,
> > > >> > which
> > > >> > > > are
> > > >> > > > > > not
> > > >> > > > > > > > > described in the KIP. Could you explain how they are
> > > used
> > > >> in
> > > >> > > the
> > > >> > > > > > wiki?
> > > >> > > > > > > > > isLowOnMemory()
> > > >> > > > > > > > > isOutOfMemory()
> > > >> > > > > > > > >
> > > >> > > > > > > > > 4. Could you also describe in the KIP at the high
> > level,
> > > >> how
> > > >> > > the
> > > >> > > > > read
> > > >> > > > > > > > > interest bit for the socket is turned on/off with
> > > respect
> > > >> to
> > > >> > > > > > > MemoryPool?
> > > >> > > > > > > > >
> > > >> > > > > > > > > 5. Should queued.max.bytes defaults to -1 or
> > > >> Long.MAX_VALUE?
> > > >> > > > > > > > >
> > > >> > > > > > > > > Thanks,
> > > >> > > > > > > > >
> > > >> > > > > > > > > Jun
> > > >> > > > > > > > >
> > > >> > > > > > > > > On Mon, Nov 7, 2016 at 1:08 PM, radai <
> > > >> > > > radai.rosenblatt@gmail.com>
> > > >> > > > > > > > wrote:
> > > >> > > > > > > > >
> > > >> > > > > > > > > > Hi,
> > > >> > > > > > > > > >
> > > >> > > > > > > > > > I would like to initiate a vote on KIP-72:
> > > >> > > > > > > > > >
> > > >> > > > > > > > > > https://cwiki.apache.org/
> > > confluence/display/KAFKA/KIP-
> > > >> > 72%3A+
> > > >> > > > > > > > > > Allow+putting+a+bound+on+memor
> > y+consumed+by+Incoming+
> > > >> > > requests
> > > >> > > > > > > > > >
> > > >> > > > > > > > > > The kip allows specifying a limit on the amount of
> > > memory
> > > >> > > > > allocated
> > > >> > > > > > > for
> > > >> > > > > > > > > > reading incoming requests into. This is useful for
> > > >> > "sizing" a
> > > >> > > > > > broker
> > > >> > > > > > > > and
> > > >> > > > > > > > > > avoiding OOMEs under heavy load (as actually
> happens
> > > >> > > > occasionally
> > > >> > > > > > at
> > > >> > > > > > > > > > linkedin).
> > > >> > > > > > > > > >
> > > >> > > > > > > > > > I believe I've addressed most (all?) concerns
> > brought
> > > up
> > > >> > > during
> > > >> > > > > the
> > > >> > > > > > > > > > discussion.
> > > >> > > > > > > > > >
> > > >> > > > > > > > > > To the best of my understanding this vote is about
> > the
> > > >> goal
> > > >> > > and
> > > >> > > > > > > > > > public-facing changes related to the new proposed
> > > >> behavior,
> > > >> > > but
> > > >> > > > > as
> > > >> > > > > > > for
> > > >> > > > > > > > > > implementation, i have the code up here:
> > > >> > > > > > > > > >
> > > >> > > > > > > > > > https://github.com/radai-
> > > rosenblatt/kafka/tree/broker-
> > > >> > memory
> > > >> > > > > > > > > > -pool-with-muting
> > > >> > > > > > > > > >
> > > >> > > > > > > > > > and I've stress-tested it to work properly
> (meaning
> > it
> > > >> > chugs
> > > >> > > > > along
> > > >> > > > > > > and
> > > >> > > > > > > > > > throttles under loads that would DOS 10.0.1.0
> code).
> > > >> > > > > > > > > >
> > > >> > > > > > > > > > I also believe that the primitives and "pattern"s
> > > >> > introduced
> > > >> > > in
> > > >> > > > > > this
> > > >> > > > > > > > KIP
> > > >> > > > > > > > > > (namely the notion of a buffer pool and retrieving
> > > from /
> > > >> > > > > releasing
> > > >> > > > > > > to
> > > >> > > > > > > > > said
> > > >> > > > > > > > > > pool instead of allocating memory) are generally
> > > useful
> > > >> > > beyond
> > > >> > > > > the
> > > >> > > > > > > > scope
> > > >> > > > > > > > > of
> > > >> > > > > > > > > > this KIP for both performance issues (allocating
> > lots
> > > of
> > > >> > > > > > short-lived
> > > >> > > > > > > > > large
> > > >> > > > > > > > > > buffers is a performance bottleneck) and other
> areas
> > > >> where
> > > >> > > > memory
> > > >> > > > > > > > limits
> > > >> > > > > > > > > > are a problem (KIP-81)
> > > >> > > > > > > > > >
> > > >> > > > > > > > > > Thank you,
> > > >> > > > > > > > > >
> > > >> > > > > > > > > > Radai.
> > > >> > > > > > > > > >
> > > >> > > > > > > > >
> > > >> > > > > > > >
> > > >> > > > > > >
> > > >> > > > > >
> > > >> > > > > >
> > > >> > > > > >
> > > >> > > > > > --
> > > >> > > > > > Regards,
> > > >> > > > > >
> > > >> > > > > > Rajini
> > > >> > > > > >
> > > >> > > > >
> > > >> > > >
> > > >> > >
> > > >> > >
> > > >> > >
> > > >> > > --
> > > >> > > Regards,
> > > >> > >
> > > >> > > Rajini
> > > >> > >
> > > >> >
> > > >>
> > > >>
> > > >>
> > > >> --
> > > >> Regards,
> > > >>
> > > >> Rajini
> > > >>
> > >
> >
> >
> >
> > --
> > Regards,
> >
> > Rajini
> >
>

Re: [VOTE] KIP-72 - Allow putting a bound on memory consumed by Incoming requests

Posted by Jun Rao <ju...@confluent.io>.

Hi, Radai,

Thanks for the updated proposal. +1 overall. A couple of comments below.

1. Our current convention is to avoid using getters. Could you change
getSize and getAvailableMemory accordingly? Also, size is bit ambiguous,
could we use sth like capacity?

2. This is more on the implementation details. I didn't see any code to
wake up the selector when memory is released from the pool. For example,
suppose that all socket keys are muted since the pool is full. The
selector.poll() call will wait for the timeout, which could be arbitrarily
long. Now, if some memory is released, it seems that we should wake up the
selector early instead of waiting for the timeout.

Jun


On Mon, Nov 14, 2016 at 11:41 AM, Rajini Sivaram <
rajinisivaram@googlemail.com> wrote:

> +1
>
> Thank you for the KIP, Radai.
>
> On Mon, Nov 14, 2016 at 6:07 PM, Mickael Maison <mi...@gmail.com>
> wrote:
>
> > +1. We've also been hit by OOMs on the broker because we were not able
> > to properly bound its memory usage.
> >
> > On Mon, Nov 14, 2016 at 5:56 PM, radai <ra...@gmail.com>
> wrote:
> > > @rajini - fixed the hasBytesBuffered() method. also updated poll() so
> > that
> > > no latency is added for picking up data stuck in ssl buffers (timeout
> is
> > > set to 0, just like with immediately connected keys and staged
> receives).
> > > thank you for pointing these out.
> > > added ssl (re) testing to the KIP testing plan.
> > >
> > >
> > >
> > >
> > > On Mon, Nov 14, 2016 at 7:24 AM, Rajini Sivaram <
> > > rajinisivaram@googlemail.com> wrote:
> > >
> > >> Open point 1. I would just retain the current long value that
> specifies
> > >> queued.max.bytes as long and not as %heap since it is simple and easy
> to
> > >> use. And keeps it consistent with other ".bytes" configs.
> > >>
> > >> Point 3. ssl buffers - I am not quite sure the implementation looks
> > >> correct. hasBytesBuffered() is checking position() of buffers == 0.
> And
> > the
> > >> code checks this only when poll with a timeout returns (adding a delay
> > when
> > >> there is nothing else to read).
> > >> But since this and open point 2 (optimization) are implementation
> > details,
> > >> they can be looked at during PR review.
> > >>
> > >> It will be good to add SSL testing to the test plan as well, since
> > there is
> > >> additional code to test for SSL.
> > >>
> > >>
> > >> On Fri, Nov 11, 2016 at 9:03 PM, radai <ra...@gmail.com>
> > wrote:
> > >>
> > >> > ok, i've made the following changes:
> > >> >
> > >> > 1. memory.pool.class.name has been removed
> > >> > 2. the code now only uses SimpleMemoryPool. the gc variant is left
> > >> (unused)
> > >> > as a developement aid and is unsettable via configuration.
> > >> > 3. I've resolved the issue of stale data getting stuck in
> intermediate
> > >> > (ssl) buffers.
> > >> > 4. default value for queued.max.bytes is -1, so off by default. any
> > <=0
> > >> > value is interpreted as off by the underlying code.
> > >> >
> > >> > open points:
> > >> >
> > >> > 1. the kafka config framework doesnt allow a value to be either long
> > or
> > >> > double, so in order to pull off the queued.max.bytes = 1000000 or
> > >> > queued.max.bytes = 0.3 thing i'd need to define the config as type
> > >> string,
> > >> > which is ugly to me. do we want to support setting queued.max.bytes
> > to %
> > >> of
> > >> > heap ? if so, by way of making queued.max.bytes of type string, or
> by
> > way
> > >> > of a 2nd config param (with the resulting either/all/combination?
> > >> > validation). my personal opinion is string because i think a single
> > >> > queued.max.bytes with overloaded meaning is more understandable to
> > users.
> > >> > i'll await other people's opinions before doing anything.
> > >> > 2. i still need to evaluate rajini's optimization. sounds doable.
> > >> >
> > >> > asides:
> > >> >
> > >> > 1. i think you guys misunderstood the intent behind the gc pool. it
> > was
> > >> > never meant to be a magic pool that automatically releases buffers
> > >> (because
> > >> > just as rajini stated the performance implications would be
> > horrible). it
> > >> > was meant to catch leaks early. since that is indeed a dev-only
> > concern
> > >> it
> > >> > wont ever get used in production.
> > >> > 2. i said this on some other kip discussion: i think the nice thing
> > about
> > >> > the pool API is it "scales" from just keeping a memory bound to
> > actually
> > >> > re-using buffers without changing the calling code. i think
> > >> actuallypooling
> > >> > large buffers will result in a significant performance impact, but
> > thats
> > >> > outside the scope of this kip. at that point i think more pool
> > >> > implementations (that actually pool) would be written. i agree with
> > the
> > >> > ideal of exposing as few knobs as possible, but switching pools (or
> > pool
> > >> > params) for tuning may happen at some later point.
> > >> >
> > >> >
> > >> >
> > >> > On Fri, Nov 11, 2016 at 11:44 AM, Rajini Sivaram <
> > >> > rajinisivaram@googlemail.com> wrote:
> > >> >
> > >> > > 13. At the moment, I think channels are not muted if:
> > >> > >     channel.receive != null && channel.receive.buffer != null
> > >> > > This mutes all channels that aren't holding onto a incomplete
> > buffer.
> > >> > They
> > >> > > may or may not have read the 4-byte size.
> > >> > >
> > >> > > I was thinking you could avoid muting channels if:
> > >> > >     channel.receive == null || channel.receive.size.remaining()
> > >> > > This will not mute channels that are holding onto a buffer (as
> > above).
> > >> In
> > >> > > addition, it will not mute channels that haven't read the 4-byte
> > size.
> > >> A
> > >> > > client that is closed gracefully while the pool is full will not
> be
> > >> muted
> > >> > > in this case and the server can process close without waiting for
> > the
> > >> > pool
> > >> > > to free up. Once the 4-byte size is read, the channel will be
> muted
> > if
> > >> > the
> > >> > > pool is still out of memory - for each channel, at most one failed
> > read
> > >> > > attempt would be made while the pool is out of memory. I think
> this
> > >> would
> > >> > > also delay muting of SSL channels since they can continue to read
> > into
> > >> > > their (already allocated) network buffers and unwrap the data and
> > block
> > >> > > only when they need to allocate a buffer from the pool.
> > >> > >
> > >> > > On Fri, Nov 11, 2016 at 6:00 PM, Jay Kreps <ja...@confluent.io>
> > wrote:
> > >> > >
> > >> > > > Hey Radai,
> > >> > > >
> > >> > > > +1 on deprecating and eventually removing the old config. The
> > >> intention
> > >> > > was
> > >> > > > absolutely bounding memory usage. I think having two ways of
> doing
> > >> > this,
> > >> > > > one that gives a crisp bound on memory and one that is hard to
> > reason
> > >> > > about
> > >> > > > is pretty confusing. I think people will really appreciate
> having
> > one
> > >> > > > config which instead lets them directly control the thing they
> > >> actually
> > >> > > > care about (memory).
> > >> > > >
> > >> > > > I also want to second Jun's concern on the complexity of the
> > >> self-GCing
> > >> > > > memory pool. I wrote the memory pool for the producer. In that
> > area
> > >> the
> > >> > > > pooling of messages is the single biggest factor in performance
> of
> > >> the
> > >> > > > client so I believed it was worth some sophistication/complexity
> > if
> > >> > there
> > >> > > > was performance payoff. All the same, the complexity of that
> code
> > has
> > >> > > made
> > >> > > > it VERY hard to keep correct (it gets broken roughly every other
> > time
> > >> > > > someone makes a change). Over time I came to feel a lot less
> > proud of
> > >> > my
> > >> > > > cleverness. I learned something interesting reading your
> > self-GCing
> > >> > > memory
> > >> > > > pool, but I wonder if the complexity is worth the payoff in this
> > >> case?
> > >> > > >
> > >> > > > Philosophically we've tried really hard to avoid needlessly
> > >> "pluggable"
> > >> > > > implementations. That is, when there is a temptation to give a
> > config
> > >> > > that
> > >> > > > plugs in different Java classes at run time for implementation
> > >> choices,
> > >> > > we
> > >> > > > should instead think of how to give the user the good behavior
> > >> > > > automatically. I think the use case for configuring a the GCing
> > pool
> > >> > > would
> > >> > > > be if you discovered a bug in which memory leaked. But this
> isn't
> > >> > > something
> > >> > > > the user should have to think about right? If there is a bug we
> > >> should
> > >> > > find
> > >> > > > and fix it.
> > >> > > >
> > >> > > > -Jay
> > >> > > >
> > >> > > > On Fri, Nov 11, 2016 at 9:21 AM, radai <
> > radai.rosenblatt@gmail.com>
> > >> > > wrote:
> > >> > > >
> > >> > > > > jun's #1 + rajini's #11 - the new config param is to enable
> > >> changing
> > >> > > the
> > >> > > > > pool implentation class. as i said in my response to jun i
> will
> > >> make
> > >> > > the
> > >> > > > > default pool impl be the simple one, and this param is to
> allow
> > a
> > >> > user
> > >> > > > > (more likely a dev) to change it.
> > >> > > > > both the simple pool and the "gc pool" make basically just an
> > >> > > > > AtomicLong.get() + (hashmap.put for gc) calls before
> returning a
> > >> > > buffer.
> > >> > > > > there is absolutely no dependency on GC times in allocating
> (or
> > >> not).
> > >> > > the
> > >> > > > > extra background thread in the gc pool is forever asleep
> unless
> > >> there
> > >> > > are
> > >> > > > > bugs (==leaks) so the extra cost is basically nothing (backed
> by
> > >> > > > > benchmarks). let me re-itarate again - ANY BUFFER ALLOCATED
> MUST
> > >> > ALWAYS
> > >> > > > BE
> > >> > > > > RELEASED - so the gc pool should not rely on gc for reclaiming
> > >> > buffers.
> > >> > > > its
> > >> > > > > a bug detector, not a feature and is definitely not intended
> to
> > >> hide
> > >> > > > bugs -
> > >> > > > > the exact opposite - its meant to expose them sooner. i've
> > cleaned
> > >> up
> > >> > > the
> > >> > > > > docs to avoid this confusion. i also like the fail on leak.
> will
> > >> do.
> > >> > > > > as for the gap between pool size and heap size - thats a valid
> > >> > > argument.
> > >> > > > > may allow also sizing the pool as % of heap size? so
> > >> > queued.max.bytes =
> > >> > > > > 1000000 for 1MB and queued.max.bytes = 0.25 for 25% of
> available
> > >> > heap?
> > >> > > > >
> > >> > > > > jun's 2.2 - queued.max.bytes + socket.request.max.bytes still
> > >> holds,
> > >> > > > > assuming the ssl-related buffers are small. the largest
> > weakness in
> > >> > > this
> > >> > > > > claim has to do with decompression rather than anything
> > >> ssl-related.
> > >> > so
> > >> > > > yes
> > >> > > > > there is an O(#ssl connections * sslEngine packet size)
> > component,
> > >> > but
> > >> > > i
> > >> > > > > think its small. again - decompression should be the concern.
> > >> > > > >
> > >> > > > > rajini's #13 - interesting optimization. the problem is
> there's
> > no
> > >> > > > knowing
> > >> > > > > in advance what the _next_ request to come out of a socket is,
> > so
> > >> > this
> > >> > > > > would mute just those sockets that are 1. mutable and 2. have
> a
> > >> > > > > buffer-demanding request for which we could not allocate a
> > buffer.
> > >> > > > downside
> > >> > > > > is that as-is this would cause the busy-loop on poll() that
> the
> > >> mutes
> > >> > > > were
> > >> > > > > supposed to prevent - or code would need to be added to
> > ad-hocmute
> > >> a
> > >> > > > > connection that was so-far unmuted but has now generated a
> > >> > > > memory-demanding
> > >> > > > > request?
> > >> > > > >
> > >> > > > >
> > >> > > > >
> > >> > > > > On Fri, Nov 11, 2016 at 5:02 AM, Rajini Sivaram <
> > >> > > > > rajinisivaram@googlemail.com> wrote:
> > >> > > > >
> > >> > > > > > Radai,
> > >> > > > > >
> > >> > > > > > 11. The KIP talks about a new server configuration parameter
> > >> > > > > > *memory.pool.class.name
> > >> > > > > > <http://memory.pool.class.name> *which is not in the
> > >> > implementation.
> > >> > > > Is
> > >> > > > > it
> > >> > > > > > still the case that the pool will be configurable?
> > >> > > > > >
> > >> > > > > > 12. Personally I would prefer not to have a garbage
> collected
> > >> pool
> > >> > > that
> > >> > > > > > hides bugs as well. Apart from the added code complexity and
> > >> extra
> > >> > > > thread
> > >> > > > > > to handle collections, I am also concerned about the
> > >> > > non-deterministic
> > >> > > > > > nature of GC timings. The KIP introduces delays in
> processing
> > >> > > requests
> > >> > > > > > based on the configuration parameter *queued.max.bytes.
> *This
> > in
> > >> > > > > unrelated
> > >> > > > > > to the JVM heap size and hence pool can be full when there
> is
> > no
> > >> > > > pressure
> > >> > > > > > on the JVM to garbage collect. The KIP does not prevent
> other
> > >> > > timeouts
> > >> > > > in
> > >> > > > > > the broker (eg. consumer session timeout) because it is
> > relying
> > >> on
> > >> > > the
> > >> > > > > pool
> > >> > > > > > to be managed in a deterministic, timely manner. Since a
> > garbage
> > >> > > > > collected
> > >> > > > > > pool cannot provide that guarantee, wouldn't it be better to
> > run
> > >> > > tests
> > >> > > > > with
> > >> > > > > > a GC-pool that perhaps fails with a fatal error if it
> > encounters
> > >> a
> > >> > > > buffer
> > >> > > > > > that was not released?
> > >> > > > > >
> > >> > > > > > 13. The implementation currently mutes all channels that
> don't
> > >> > have a
> > >> > > > > > receive buffer allocated. Would it make sense to mute only
> the
> > >> > > channels
> > >> > > > > > that need a buffer (i.e. allow channels to read the 4-byte
> > size
> > >> > that
> > >> > > is
> > >> > > > > not
> > >> > > > > > read using the pool) so that normal client connection
> close()
> > is
> > >> > > > handled
> > >> > > > > > even when the pool is full? Since the extra 4-bytes may
> > already
> > >> be
> > >> > > > > > allocated for some connections, the total request memory has
> > to
> > >> > take
> > >> > > > into
> > >> > > > > > account *4*numConnections* bytes anyway.
> > >> > > > > >
> > >> > > > > >
> > >> > > > > > On Thu, Nov 10, 2016 at 11:51 PM, Jun Rao <jun@confluent.io
> >
> > >> > wrote:
> > >> > > > > >
> > >> > > > > > > Hi, Radai,
> > >> > > > > > >
> > >> > > > > > > 1. Yes, I am concerned about the trickiness of having to
> > deal
> > >> > with
> > >> > > > > wreak
> > >> > > > > > > refs. I think it's simpler to just have the simple version
> > >> > > > instrumented
> > >> > > > > > > with enough debug/trace logging and do enough stress
> > testing.
> > >> > Since
> > >> > > > we
> > >> > > > > > > still have queued.max.requests, one can always fall back
> to
> > >> that
> > >> > > if a
> > >> > > > > > > memory leak issue is identified. We could also label the
> > >> feature
> > >> > as
> > >> > > > > beta
> > >> > > > > > if
> > >> > > > > > > we don't think this is production ready.
> > >> > > > > > >
> > >> > > > > > > 2.2 I am just wondering after we fix that issue whether
> the
> > >> claim
> > >> > > > that
> > >> > > > > > the
> > >> > > > > > > request memory is bounded by  queued.max.bytes +
> > >> > > > > socket.request.max.bytes
> > >> > > > > > > is still true.
> > >> > > > > > >
> > >> > > > > > > 5. Ok, leaving the default as -1 is fine then.
> > >> > > > > > >
> > >> > > > > > > Thanks,
> > >> > > > > > >
> > >> > > > > > > Jun
> > >> > > > > > >
> > >> > > > > > > On Wed, Nov 9, 2016 at 6:01 PM, radai <
> > >> > radai.rosenblatt@gmail.com>
> > >> > > > > > wrote:
> > >> > > > > > >
> > >> > > > > > > > Hi Jun,
> > >> > > > > > > >
> > >> > > > > > > > Thank you for taking the time to review this.
> > >> > > > > > > >
> > >> > > > > > > > 1. short version - yes, the concern is bugs, but the
> cost
> > is
> > >> > tiny
> > >> > > > and
> > >> > > > > > > worth
> > >> > > > > > > > it, and its a common pattern. long version:
> > >> > > > > > > >    1.1 detecting these types of bugs (leaks) cannot be
> > easily
> > >> > > done
> > >> > > > > with
> > >> > > > > > > > simple testing, but requires stress/stability tests that
> > run
> > >> > for
> > >> > > a
> > >> > > > > long
> > >> > > > > > > > time (long enough to hit OOM, depending on leak size and
> > >> > > available
> > >> > > > > > > memory).
> > >> > > > > > > > this is why some sort of leak detector is "standard
> > practice"
> > >> > > .for
> > >> > > > > > > example
> > >> > > > > > > > look at netty (http://netty.io/wiki/
> > >> reference-counted-objects.
> > >> > > > > > > > html#leak-detection-levels)
> > >> > > > > > > > <http://netty.io/wiki/reference-counted-objects.
> > >> > > > > > > html#leak-detection-levels
> > >> > > > > > > > >-
> > >> > > > > > > > they have way more complicated built-in leak detection
> > >> enabled
> > >> > by
> > >> > > > > > > default.
> > >> > > > > > > > as a concrete example - during development i did not
> > properly
> > >> > > > dispose
> > >> > > > > > of
> > >> > > > > > > > in-progress KafkaChannel.receive when a connection was
> > >> abruptly
> > >> > > > > closed
> > >> > > > > > > and
> > >> > > > > > > > I only found it because of the log msg printed by the
> > pool.
> > >> > > > > > > >    1.2 I have a benchmark suite showing the performance
> > cost
> > >> of
> > >> > > the
> > >> > > > > gc
> > >> > > > > > > pool
> > >> > > > > > > > is absolutely negligible -
> > >> > > > > > > > https://github.com/radai-rosenblatt/kafka-benchmarks/
> > >> > > > > > > > tree/master/memorypool-benchmarks
> > >> > > > > > > >    1.3 as for the complexity of the impl - its just ~150
> > >> lines
> > >> > > and
> > >> > > > > > pretty
> > >> > > > > > > > straight forward. i think the main issue is that not
> many
> > >> > people
> > >> > > > are
> > >> > > > > > > > familiar with weak refs and ref queues.
> > >> > > > > > > >
> > >> > > > > > > >    how about making the pool impl class a config param
> > >> > (generally
> > >> > > > > good
> > >> > > > > > > > going forward), make the default be the simple pool, and
> > keep
> > >> > the
> > >> > > > GC
> > >> > > > > > one
> > >> > > > > > > as
> > >> > > > > > > > a dev/debug/triage aid?
> > >> > > > > > > >
> > >> > > > > > > > 2. the KIP itself doesnt specifically treat SSL at all -
> > its
> > >> an
> > >> > > > > > > > implementation detail. as for my current patch, it has
> > some
> > >> > > minimal
> > >> > > > > > > > treatment of SSL - just enough to not mute SSL sockets
> > >> > > > mid-handshake
> > >> > > > > -
> > >> > > > > > > but
> > >> > > > > > > > the code in SslTransportLayer still allocates buffers
> > itself.
> > >> > it
> > >> > > is
> > >> > > > > my
> > >> > > > > > > > understanding that netReadBuffer/appReadBuffer shouldn't
> > grow
> > >> > > > beyond
> > >> > > > > 2
> > >> > > > > > x
> > >> > > > > > > > sslEngine.getSession().getPacketBufferSize(), which i
> > assume
> > >> > to
> > >> > > be
> > >> > > > > > > small.
> > >> > > > > > > > they are also long lived (they live for the duration of
> > the
> > >> > > > > connection)
> > >> > > > > > > > which makes a poor fit for pooling. the bigger fish to
> > fry i
> > >> > > think
> > >> > > > is
> > >> > > > > > > > decompression - you could read a 1MB blob into a
> > >> pool-provided
> > >> > > > buffer
> > >> > > > > > and
> > >> > > > > > > > then decompress it into 10MB of heap allocated on the
> spot
> > >> :-)
> > >> > > > also,
> > >> > > > > > the
> > >> > > > > > > > ssl code is extremely tricky.
> > >> > > > > > > >    2.2 just to make sure, youre talking about
> > Selector.java:
> > >> > > while
> > >> > > > > > > > ((networkReceive = channel.read()) != null)
> > >> > > > > > addToStagedReceives(channel,
> > >> > > > > > > > networkReceive); ? if so youre right, and i'll fix that
> > >> > (probably
> > >> > > > by
> > >> > > > > > > > something similar to immediatelyConnectedKeys, not sure
> > yet)
> > >> > > > > > > >
> > >> > > > > > > > 3. isOutOfMemory is self explanatory (and i'll add
> > javadocs
> > >> and
> > >> > > > > update
> > >> > > > > > > the
> > >> > > > > > > > wiki). isLowOnMem is basically the point where I start
> > >> > > randomizing
> > >> > > > > the
> > >> > > > > > > > selection key handling order to avoid potential
> > starvation.
> > >> its
> > >> > > > > rather
> > >> > > > > > > > arbitrary and now that i think of it should probably not
> > >> exist
> > >> > > and
> > >> > > > be
> > >> > > > > > > > entirely contained in Selector (where the shuffling
> takes
> > >> > place).
> > >> > > > > will
> > >> > > > > > > fix.
> > >> > > > > > > >
> > >> > > > > > > > 4. will do.
> > >> > > > > > > >
> > >> > > > > > > > 5. I prefer -1 or 0 as an explicit "OFF" (or basically
> > >> anything
> > >> > > > <=0).
> > >> > > > > > > > Long.MAX_VALUE would still create a pool, that would
> still
> > >> > waste
> > >> > > > time
> > >> > > > > > > > tracking resources. I dont really mind though if you
> have
> > a
> > >> > > > preferred
> > >> > > > > > > magic
> > >> > > > > > > > value for off.
> > >> > > > > > > >
> > >> > > > > > > >
> > >> > > > > > > >
> > >> > > > > > > >
> > >> > > > > > > >
> > >> > > > > > > > On Wed, Nov 9, 2016 at 9:28 AM, Jun Rao <
> jun@confluent.io
> > >
> > >> > > wrote:
> > >> > > > > > > >
> > >> > > > > > > > > Hi, Radai,
> > >> > > > > > > > >
> > >> > > > > > > > > Thanks for the KIP. Some comments below.
> > >> > > > > > > > >
> > >> > > > > > > > > 1. The KIP says "to facilitate faster implementation
> > (as a
> > >> > > safety
> > >> > > > > > net)
> > >> > > > > > > > the
> > >> > > > > > > > > pool will be implemented in such a way that memory
> that
> > was
> > >> > not
> > >> > > > > > > > release()ed
> > >> > > > > > > > > (but still garbage collected) would be detected and
> > >> > > "reclaimed".
> > >> > > > > this
> > >> > > > > > > is
> > >> > > > > > > > to
> > >> > > > > > > > > prevent "leaks" in case of code paths that fail to
> > >> release()
> > >> > > > > > > properly.".
> > >> > > > > > > > > What are the cases that could cause memory leaks? If
> we
> > are
> > >> > > > > concerned
> > >> > > > > > > > about
> > >> > > > > > > > > bugs, it seems that it's better to just do more
> testing
> > to
> > >> > make
> > >> > > > > sure
> > >> > > > > > > the
> > >> > > > > > > > > usage of the simple implementation (SimpleMemoryPool)
> is
> > >> > solid
> > >> > > > > > instead
> > >> > > > > > > of
> > >> > > > > > > > > adding more complicated logic
> > (GarbageCollectedMemoryPool)
> > >> to
> > >> > > > hide
> > >> > > > > > the
> > >> > > > > > > > > potential bugs.
> > >> > > > > > > > >
> > >> > > > > > > > > 2. I am wondering how much this KIP covers the SSL
> > channel
> > >> > > > > > > > implementation.
> > >> > > > > > > > > 2.1 SslTransportLayer maintains netReadBuffer,
> > >> > netWriteBuffer,
> > >> > > > > > > > > appReadBuffer per socket. Should those memory be
> > accounted
> > >> > for
> > >> > > in
> > >> > > > > > > memory
> > >> > > > > > > > > pool?
> > >> > > > > > > > > 2.2 One tricky thing with SSL is that during a
> > >> > > > KafkaChannel.read(),
> > >> > > > > > > it's
> > >> > > > > > > > > possible for multiple NetworkReceives to be returned
> > since
> > >> > > > multiple
> > >> > > > > > > > > requests' data could be encrypted together by SSL. To
> > deal
> > >> > with
> > >> > > > > this,
> > >> > > > > > > we
> > >> > > > > > > > > stash those NetworkReceives in Selector.stagedReceives
> > and
> > >> > give
> > >> > > > it
> > >> > > > > > back
> > >> > > > > > > > to
> > >> > > > > > > > > the poll() call one NetworkReceive at a time. What
> this
> > >> means
> > >> > > is
> > >> > > > > > that,
> > >> > > > > > > if
> > >> > > > > > > > > we stop reading from KafkaChannel in the middle
> because
> > >> > memory
> > >> > > > pool
> > >> > > > > > is
> > >> > > > > > > > > full, this channel's key may never get selected for
> > reads
> > >> > (even
> > >> > > > > after
> > >> > > > > > > the
> > >> > > > > > > > > read interest is turned on), but there are still
> pending
> > >> data
> > >> > > for
> > >> > > > > the
> > >> > > > > > > > > channel, which will never get processed.
> > >> > > > > > > > >
> > >> > > > > > > > > 3. The code has the following two methods in
> MemoryPool,
> > >> > which
> > >> > > > are
> > >> > > > > > not
> > >> > > > > > > > > described in the KIP. Could you explain how they are
> > used
> > >> in
> > >> > > the
> > >> > > > > > wiki?
> > >> > > > > > > > > isLowOnMemory()
> > >> > > > > > > > > isOutOfMemory()
> > >> > > > > > > > >
> > >> > > > > > > > > 4. Could you also describe in the KIP at the high
> level,
> > >> how
> > >> > > the
> > >> > > > > read
> > >> > > > > > > > > interest bit for the socket is turned on/off with
> > respect
> > >> to
> > >> > > > > > > MemoryPool?
> > >> > > > > > > > >
> > >> > > > > > > > > 5. Should queued.max.bytes defaults to -1 or
> > >> Long.MAX_VALUE?
> > >> > > > > > > > >
> > >> > > > > > > > > Thanks,
> > >> > > > > > > > >
> > >> > > > > > > > > Jun
> > >> > > > > > > > >
> > >> > > > > > > > > On Mon, Nov 7, 2016 at 1:08 PM, radai <
> > >> > > > radai.rosenblatt@gmail.com>
> > >> > > > > > > > wrote:
> > >> > > > > > > > >
> > >> > > > > > > > > > Hi,
> > >> > > > > > > > > >
> > >> > > > > > > > > > I would like to initiate a vote on KIP-72:
> > >> > > > > > > > > >
> > >> > > > > > > > > > https://cwiki.apache.org/
> > confluence/display/KAFKA/KIP-
> > >> > 72%3A+
> > >> > > > > > > > > > Allow+putting+a+bound+on+memor
> y+consumed+by+Incoming+
> > >> > > requests
> > >> > > > > > > > > >
> > >> > > > > > > > > > The kip allows specifying a limit on the amount of
> > memory
> > >> > > > > allocated
> > >> > > > > > > for
> > >> > > > > > > > > > reading incoming requests into. This is useful for
> > >> > "sizing" a
> > >> > > > > > broker
> > >> > > > > > > > and
> > >> > > > > > > > > > avoiding OOMEs under heavy load (as actually happens
> > >> > > > occasionally
> > >> > > > > > at
> > >> > > > > > > > > > linkedin).
> > >> > > > > > > > > >
> > >> > > > > > > > > > I believe I've addressed most (all?) concerns
> brought
> > up
> > >> > > during
> > >> > > > > the
> > >> > > > > > > > > > discussion.
> > >> > > > > > > > > >
> > >> > > > > > > > > > To the best of my understanding this vote is about
> the
> > >> goal
> > >> > > and
> > >> > > > > > > > > > public-facing changes related to the new proposed
> > >> behavior,
> > >> > > but
> > >> > > > > as
> > >> > > > > > > for
> > >> > > > > > > > > > implementation, i have the code up here:
> > >> > > > > > > > > >
> > >> > > > > > > > > > https://github.com/radai-
> > rosenblatt/kafka/tree/broker-
> > >> > memory
> > >> > > > > > > > > > -pool-with-muting
> > >> > > > > > > > > >
> > >> > > > > > > > > > and I've stress-tested it to work properly (meaning
> it
> > >> > chugs
> > >> > > > > along
> > >> > > > > > > and
> > >> > > > > > > > > > throttles under loads that would DOS 10.0.1.0 code).
> > >> > > > > > > > > >
> > >> > > > > > > > > > I also believe that the primitives and "pattern"s
> > >> > introduced
> > >> > > in
> > >> > > > > > this
> > >> > > > > > > > KIP
> > >> > > > > > > > > > (namely the notion of a buffer pool and retrieving
> > from /
> > >> > > > > releasing
> > >> > > > > > > to
> > >> > > > > > > > > said
> > >> > > > > > > > > > pool instead of allocating memory) are generally
> > useful
> > >> > > beyond
> > >> > > > > the
> > >> > > > > > > > scope
> > >> > > > > > > > > of
> > >> > > > > > > > > > this KIP for both performance issues (allocating
> lots
> > of
> > >> > > > > > short-lived
> > >> > > > > > > > > large
> > >> > > > > > > > > > buffers is a performance bottleneck) and other areas
> > >> where
> > >> > > > memory
> > >> > > > > > > > limits
> > >> > > > > > > > > > are a problem (KIP-81)
> > >> > > > > > > > > >
> > >> > > > > > > > > > Thank you,
> > >> > > > > > > > > >
> > >> > > > > > > > > > Radai.
> > >> > > > > > > > > >
> > >> > > > > > > > >
> > >> > > > > > > >
> > >> > > > > > >
> > >> > > > > >
> > >> > > > > >
> > >> > > > > >
> > >> > > > > > --
> > >> > > > > > Regards,
> > >> > > > > >
> > >> > > > > > Rajini
> > >> > > > > >
> > >> > > > >
> > >> > > >
> > >> > >
> > >> > >
> > >> > >
> > >> > > --
> > >> > > Regards,
> > >> > >
> > >> > > Rajini
> > >> > >
> > >> >
> > >>
> > >>
> > >>
> > >> --
> > >> Regards,
> > >>
> > >> Rajini
> > >>
> >
>
>
>
> --
> Regards,
>
> Rajini
>

Re: [VOTE] KIP-72 - Allow putting a bound on memory consumed by Incoming requests

Posted by Rajini Sivaram <ra...@googlemail.com>.

+1

Thank you for the KIP, Radai.

On Mon, Nov 14, 2016 at 6:07 PM, Mickael Maison <mi...@gmail.com>
wrote:

> +1. We've also been hit by OOMs on the broker because we were not able
> to properly bound its memory usage.
>
> On Mon, Nov 14, 2016 at 5:56 PM, radai <ra...@gmail.com> wrote:
> > @rajini - fixed the hasBytesBuffered() method. also updated poll() so
> that
> > no latency is added for picking up data stuck in ssl buffers (timeout is
> > set to 0, just like with immediately connected keys and staged receives).
> > thank you for pointing these out.
> > added ssl (re) testing to the KIP testing plan.
> >
> >
> >
> >
> > On Mon, Nov 14, 2016 at 7:24 AM, Rajini Sivaram <
> > rajinisivaram@googlemail.com> wrote:
> >
> >> Open point 1. I would just retain the current long value that specifies
> >> queued.max.bytes as long and not as %heap since it is simple and easy to
> >> use. And keeps it consistent with other ".bytes" configs.
> >>
> >> Point 3. ssl buffers - I am not quite sure the implementation looks
> >> correct. hasBytesBuffered() is checking position() of buffers == 0. And
> the
> >> code checks this only when poll with a timeout returns (adding a delay
> when
> >> there is nothing else to read).
> >> But since this and open point 2 (optimization) are implementation
> details,
> >> they can be looked at during PR review.
> >>
> >> It will be good to add SSL testing to the test plan as well, since
> there is
> >> additional code to test for SSL.
> >>
> >>
> >> On Fri, Nov 11, 2016 at 9:03 PM, radai <ra...@gmail.com>
> wrote:
> >>
> >> > ok, i've made the following changes:
> >> >
> >> > 1. memory.pool.class.name has been removed
> >> > 2. the code now only uses SimpleMemoryPool. the gc variant is left
> >> (unused)
> >> > as a developement aid and is unsettable via configuration.
> >> > 3. I've resolved the issue of stale data getting stuck in intermediate
> >> > (ssl) buffers.
> >> > 4. default value for queued.max.bytes is -1, so off by default. any
> <=0
> >> > value is interpreted as off by the underlying code.
> >> >
> >> > open points:
> >> >
> >> > 1. the kafka config framework doesnt allow a value to be either long
> or
> >> > double, so in order to pull off the queued.max.bytes = 1000000 or
> >> > queued.max.bytes = 0.3 thing i'd need to define the config as type
> >> string,
> >> > which is ugly to me. do we want to support setting queued.max.bytes
> to %
> >> of
> >> > heap ? if so, by way of making queued.max.bytes of type string, or by
> way
> >> > of a 2nd config param (with the resulting either/all/combination?
> >> > validation). my personal opinion is string because i think a single
> >> > queued.max.bytes with overloaded meaning is more understandable to
> users.
> >> > i'll await other people's opinions before doing anything.
> >> > 2. i still need to evaluate rajini's optimization. sounds doable.
> >> >
> >> > asides:
> >> >
> >> > 1. i think you guys misunderstood the intent behind the gc pool. it
> was
> >> > never meant to be a magic pool that automatically releases buffers
> >> (because
> >> > just as rajini stated the performance implications would be
> horrible). it
> >> > was meant to catch leaks early. since that is indeed a dev-only
> concern
> >> it
> >> > wont ever get used in production.
> >> > 2. i said this on some other kip discussion: i think the nice thing
> about
> >> > the pool API is it "scales" from just keeping a memory bound to
> actually
> >> > re-using buffers without changing the calling code. i think
> >> actuallypooling
> >> > large buffers will result in a significant performance impact, but
> thats
> >> > outside the scope of this kip. at that point i think more pool
> >> > implementations (that actually pool) would be written. i agree with
> the
> >> > ideal of exposing as few knobs as possible, but switching pools (or
> pool
> >> > params) for tuning may happen at some later point.
> >> >
> >> >
> >> >
> >> > On Fri, Nov 11, 2016 at 11:44 AM, Rajini Sivaram <
> >> > rajinisivaram@googlemail.com> wrote:
> >> >
> >> > > 13. At the moment, I think channels are not muted if:
> >> > >     channel.receive != null && channel.receive.buffer != null
> >> > > This mutes all channels that aren't holding onto a incomplete
> buffer.
> >> > They
> >> > > may or may not have read the 4-byte size.
> >> > >
> >> > > I was thinking you could avoid muting channels if:
> >> > >     channel.receive == null || channel.receive.size.remaining()
> >> > > This will not mute channels that are holding onto a buffer (as
> above).
> >> In
> >> > > addition, it will not mute channels that haven't read the 4-byte
> size.
> >> A
> >> > > client that is closed gracefully while the pool is full will not be
> >> muted
> >> > > in this case and the server can process close without waiting for
> the
> >> > pool
> >> > > to free up. Once the 4-byte size is read, the channel will be muted
> if
> >> > the
> >> > > pool is still out of memory - for each channel, at most one failed
> read
> >> > > attempt would be made while the pool is out of memory. I think this
> >> would
> >> > > also delay muting of SSL channels since they can continue to read
> into
> >> > > their (already allocated) network buffers and unwrap the data and
> block
> >> > > only when they need to allocate a buffer from the pool.
> >> > >
> >> > > On Fri, Nov 11, 2016 at 6:00 PM, Jay Kreps <ja...@confluent.io>
> wrote:
> >> > >
> >> > > > Hey Radai,
> >> > > >
> >> > > > +1 on deprecating and eventually removing the old config. The
> >> intention
> >> > > was
> >> > > > absolutely bounding memory usage. I think having two ways of doing
> >> > this,
> >> > > > one that gives a crisp bound on memory and one that is hard to
> reason
> >> > > about
> >> > > > is pretty confusing. I think people will really appreciate having
> one
> >> > > > config which instead lets them directly control the thing they
> >> actually
> >> > > > care about (memory).
> >> > > >
> >> > > > I also want to second Jun's concern on the complexity of the
> >> self-GCing
> >> > > > memory pool. I wrote the memory pool for the producer. In that
> area
> >> the
> >> > > > pooling of messages is the single biggest factor in performance of
> >> the
> >> > > > client so I believed it was worth some sophistication/complexity
> if
> >> > there
> >> > > > was performance payoff. All the same, the complexity of that code
> has
> >> > > made
> >> > > > it VERY hard to keep correct (it gets broken roughly every other
> time
> >> > > > someone makes a change). Over time I came to feel a lot less
> proud of
> >> > my
> >> > > > cleverness. I learned something interesting reading your
> self-GCing
> >> > > memory
> >> > > > pool, but I wonder if the complexity is worth the payoff in this
> >> case?
> >> > > >
> >> > > > Philosophically we've tried really hard to avoid needlessly
> >> "pluggable"
> >> > > > implementations. That is, when there is a temptation to give a
> config
> >> > > that
> >> > > > plugs in different Java classes at run time for implementation
> >> choices,
> >> > > we
> >> > > > should instead think of how to give the user the good behavior
> >> > > > automatically. I think the use case for configuring a the GCing
> pool
> >> > > would
> >> > > > be if you discovered a bug in which memory leaked. But this isn't
> >> > > something
> >> > > > the user should have to think about right? If there is a bug we
> >> should
> >> > > find
> >> > > > and fix it.
> >> > > >
> >> > > > -Jay
> >> > > >
> >> > > > On Fri, Nov 11, 2016 at 9:21 AM, radai <
> radai.rosenblatt@gmail.com>
> >> > > wrote:
> >> > > >
> >> > > > > jun's #1 + rajini's #11 - the new config param is to enable
> >> changing
> >> > > the
> >> > > > > pool implentation class. as i said in my response to jun i will
> >> make
> >> > > the
> >> > > > > default pool impl be the simple one, and this param is to allow
> a
> >> > user
> >> > > > > (more likely a dev) to change it.
> >> > > > > both the simple pool and the "gc pool" make basically just an
> >> > > > > AtomicLong.get() + (hashmap.put for gc) calls before returning a
> >> > > buffer.
> >> > > > > there is absolutely no dependency on GC times in allocating (or
> >> not).
> >> > > the
> >> > > > > extra background thread in the gc pool is forever asleep unless
> >> there
> >> > > are
> >> > > > > bugs (==leaks) so the extra cost is basically nothing (backed by
> >> > > > > benchmarks). let me re-itarate again - ANY BUFFER ALLOCATED MUST
> >> > ALWAYS
> >> > > > BE
> >> > > > > RELEASED - so the gc pool should not rely on gc for reclaiming
> >> > buffers.
> >> > > > its
> >> > > > > a bug detector, not a feature and is definitely not intended to
> >> hide
> >> > > > bugs -
> >> > > > > the exact opposite - its meant to expose them sooner. i've
> cleaned
> >> up
> >> > > the
> >> > > > > docs to avoid this confusion. i also like the fail on leak. will
> >> do.
> >> > > > > as for the gap between pool size and heap size - thats a valid
> >> > > argument.
> >> > > > > may allow also sizing the pool as % of heap size? so
> >> > queued.max.bytes =
> >> > > > > 1000000 for 1MB and queued.max.bytes = 0.25 for 25% of available
> >> > heap?
> >> > > > >
> >> > > > > jun's 2.2 - queued.max.bytes + socket.request.max.bytes still
> >> holds,
> >> > > > > assuming the ssl-related buffers are small. the largest
> weakness in
> >> > > this
> >> > > > > claim has to do with decompression rather than anything
> >> ssl-related.
> >> > so
> >> > > > yes
> >> > > > > there is an O(#ssl connections * sslEngine packet size)
> component,
> >> > but
> >> > > i
> >> > > > > think its small. again - decompression should be the concern.
> >> > > > >
> >> > > > > rajini's #13 - interesting optimization. the problem is there's
> no
> >> > > > knowing
> >> > > > > in advance what the _next_ request to come out of a socket is,
> so
> >> > this
> >> > > > > would mute just those sockets that are 1. mutable and 2. have a
> >> > > > > buffer-demanding request for which we could not allocate a
> buffer.
> >> > > > downside
> >> > > > > is that as-is this would cause the busy-loop on poll() that the
> >> mutes
> >> > > > were
> >> > > > > supposed to prevent - or code would need to be added to
> ad-hocmute
> >> a
> >> > > > > connection that was so-far unmuted but has now generated a
> >> > > > memory-demanding
> >> > > > > request?
> >> > > > >
> >> > > > >
> >> > > > >
> >> > > > > On Fri, Nov 11, 2016 at 5:02 AM, Rajini Sivaram <
> >> > > > > rajinisivaram@googlemail.com> wrote:
> >> > > > >
> >> > > > > > Radai,
> >> > > > > >
> >> > > > > > 11. The KIP talks about a new server configuration parameter
> >> > > > > > *memory.pool.class.name
> >> > > > > > <http://memory.pool.class.name> *which is not in the
> >> > implementation.
> >> > > > Is
> >> > > > > it
> >> > > > > > still the case that the pool will be configurable?
> >> > > > > >
> >> > > > > > 12. Personally I would prefer not to have a garbage collected
> >> pool
> >> > > that
> >> > > > > > hides bugs as well. Apart from the added code complexity and
> >> extra
> >> > > > thread
> >> > > > > > to handle collections, I am also concerned about the
> >> > > non-deterministic
> >> > > > > > nature of GC timings. The KIP introduces delays in processing
> >> > > requests
> >> > > > > > based on the configuration parameter *queued.max.bytes. *This
> in
> >> > > > > unrelated
> >> > > > > > to the JVM heap size and hence pool can be full when there is
> no
> >> > > > pressure
> >> > > > > > on the JVM to garbage collect. The KIP does not prevent other
> >> > > timeouts
> >> > > > in
> >> > > > > > the broker (eg. consumer session timeout) because it is
> relying
> >> on
> >> > > the
> >> > > > > pool
> >> > > > > > to be managed in a deterministic, timely manner. Since a
> garbage
> >> > > > > collected
> >> > > > > > pool cannot provide that guarantee, wouldn't it be better to
> run
> >> > > tests
> >> > > > > with
> >> > > > > > a GC-pool that perhaps fails with a fatal error if it
> encounters
> >> a
> >> > > > buffer
> >> > > > > > that was not released?
> >> > > > > >
> >> > > > > > 13. The implementation currently mutes all channels that don't
> >> > have a
> >> > > > > > receive buffer allocated. Would it make sense to mute only the
> >> > > channels
> >> > > > > > that need a buffer (i.e. allow channels to read the 4-byte
> size
> >> > that
> >> > > is
> >> > > > > not
> >> > > > > > read using the pool) so that normal client connection close()
> is
> >> > > > handled
> >> > > > > > even when the pool is full? Since the extra 4-bytes may
> already
> >> be
> >> > > > > > allocated for some connections, the total request memory has
> to
> >> > take
> >> > > > into
> >> > > > > > account *4*numConnections* bytes anyway.
> >> > > > > >
> >> > > > > >
> >> > > > > > On Thu, Nov 10, 2016 at 11:51 PM, Jun Rao <ju...@confluent.io>
> >> > wrote:
> >> > > > > >
> >> > > > > > > Hi, Radai,
> >> > > > > > >
> >> > > > > > > 1. Yes, I am concerned about the trickiness of having to
> deal
> >> > with
> >> > > > > wreak
> >> > > > > > > refs. I think it's simpler to just have the simple version
> >> > > > instrumented
> >> > > > > > > with enough debug/trace logging and do enough stress
> testing.
> >> > Since
> >> > > > we
> >> > > > > > > still have queued.max.requests, one can always fall back to
> >> that
> >> > > if a
> >> > > > > > > memory leak issue is identified. We could also label the
> >> feature
> >> > as
> >> > > > > beta
> >> > > > > > if
> >> > > > > > > we don't think this is production ready.
> >> > > > > > >
> >> > > > > > > 2.2 I am just wondering after we fix that issue whether the
> >> claim
> >> > > > that
> >> > > > > > the
> >> > > > > > > request memory is bounded by  queued.max.bytes +
> >> > > > > socket.request.max.bytes
> >> > > > > > > is still true.
> >> > > > > > >
> >> > > > > > > 5. Ok, leaving the default as -1 is fine then.
> >> > > > > > >
> >> > > > > > > Thanks,
> >> > > > > > >
> >> > > > > > > Jun
> >> > > > > > >
> >> > > > > > > On Wed, Nov 9, 2016 at 6:01 PM, radai <
> >> > radai.rosenblatt@gmail.com>
> >> > > > > > wrote:
> >> > > > > > >
> >> > > > > > > > Hi Jun,
> >> > > > > > > >
> >> > > > > > > > Thank you for taking the time to review this.
> >> > > > > > > >
> >> > > > > > > > 1. short version - yes, the concern is bugs, but the cost
> is
> >> > tiny
> >> > > > and
> >> > > > > > > worth
> >> > > > > > > > it, and its a common pattern. long version:
> >> > > > > > > >    1.1 detecting these types of bugs (leaks) cannot be
> easily
> >> > > done
> >> > > > > with
> >> > > > > > > > simple testing, but requires stress/stability tests that
> run
> >> > for
> >> > > a
> >> > > > > long
> >> > > > > > > > time (long enough to hit OOM, depending on leak size and
> >> > > available
> >> > > > > > > memory).
> >> > > > > > > > this is why some sort of leak detector is "standard
> practice"
> >> > > .for
> >> > > > > > > example
> >> > > > > > > > look at netty (http://netty.io/wiki/
> >> reference-counted-objects.
> >> > > > > > > > html#leak-detection-levels)
> >> > > > > > > > <http://netty.io/wiki/reference-counted-objects.
> >> > > > > > > html#leak-detection-levels
> >> > > > > > > > >-
> >> > > > > > > > they have way more complicated built-in leak detection
> >> enabled
> >> > by
> >> > > > > > > default.
> >> > > > > > > > as a concrete example - during development i did not
> properly
> >> > > > dispose
> >> > > > > > of
> >> > > > > > > > in-progress KafkaChannel.receive when a connection was
> >> abruptly
> >> > > > > closed
> >> > > > > > > and
> >> > > > > > > > I only found it because of the log msg printed by the
> pool.
> >> > > > > > > >    1.2 I have a benchmark suite showing the performance
> cost
> >> of
> >> > > the
> >> > > > > gc
> >> > > > > > > pool
> >> > > > > > > > is absolutely negligible -
> >> > > > > > > > https://github.com/radai-rosenblatt/kafka-benchmarks/
> >> > > > > > > > tree/master/memorypool-benchmarks
> >> > > > > > > >    1.3 as for the complexity of the impl - its just ~150
> >> lines
> >> > > and
> >> > > > > > pretty
> >> > > > > > > > straight forward. i think the main issue is that not many
> >> > people
> >> > > > are
> >> > > > > > > > familiar with weak refs and ref queues.
> >> > > > > > > >
> >> > > > > > > >    how about making the pool impl class a config param
> >> > (generally
> >> > > > > good
> >> > > > > > > > going forward), make the default be the simple pool, and
> keep
> >> > the
> >> > > > GC
> >> > > > > > one
> >> > > > > > > as
> >> > > > > > > > a dev/debug/triage aid?
> >> > > > > > > >
> >> > > > > > > > 2. the KIP itself doesnt specifically treat SSL at all -
> its
> >> an
> >> > > > > > > > implementation detail. as for my current patch, it has
> some
> >> > > minimal
> >> > > > > > > > treatment of SSL - just enough to not mute SSL sockets
> >> > > > mid-handshake
> >> > > > > -
> >> > > > > > > but
> >> > > > > > > > the code in SslTransportLayer still allocates buffers
> itself.
> >> > it
> >> > > is
> >> > > > > my
> >> > > > > > > > understanding that netReadBuffer/appReadBuffer shouldn't
> grow
> >> > > > beyond
> >> > > > > 2
> >> > > > > > x
> >> > > > > > > > sslEngine.getSession().getPacketBufferSize(), which i
> assume
> >> > to
> >> > > be
> >> > > > > > > small.
> >> > > > > > > > they are also long lived (they live for the duration of
> the
> >> > > > > connection)
> >> > > > > > > > which makes a poor fit for pooling. the bigger fish to
> fry i
> >> > > think
> >> > > > is
> >> > > > > > > > decompression - you could read a 1MB blob into a
> >> pool-provided
> >> > > > buffer
> >> > > > > > and
> >> > > > > > > > then decompress it into 10MB of heap allocated on the spot
> >> :-)
> >> > > > also,
> >> > > > > > the
> >> > > > > > > > ssl code is extremely tricky.
> >> > > > > > > >    2.2 just to make sure, youre talking about
> Selector.java:
> >> > > while
> >> > > > > > > > ((networkReceive = channel.read()) != null)
> >> > > > > > addToStagedReceives(channel,
> >> > > > > > > > networkReceive); ? if so youre right, and i'll fix that
> >> > (probably
> >> > > > by
> >> > > > > > > > something similar to immediatelyConnectedKeys, not sure
> yet)
> >> > > > > > > >
> >> > > > > > > > 3. isOutOfMemory is self explanatory (and i'll add
> javadocs
> >> and
> >> > > > > update
> >> > > > > > > the
> >> > > > > > > > wiki). isLowOnMem is basically the point where I start
> >> > > randomizing
> >> > > > > the
> >> > > > > > > > selection key handling order to avoid potential
> starvation.
> >> its
> >> > > > > rather
> >> > > > > > > > arbitrary and now that i think of it should probably not
> >> exist
> >> > > and
> >> > > > be
> >> > > > > > > > entirely contained in Selector (where the shuffling takes
> >> > place).
> >> > > > > will
> >> > > > > > > fix.
> >> > > > > > > >
> >> > > > > > > > 4. will do.
> >> > > > > > > >
> >> > > > > > > > 5. I prefer -1 or 0 as an explicit "OFF" (or basically
> >> anything
> >> > > > <=0).
> >> > > > > > > > Long.MAX_VALUE would still create a pool, that would still
> >> > waste
> >> > > > time
> >> > > > > > > > tracking resources. I dont really mind though if you have
> a
> >> > > > preferred
> >> > > > > > > magic
> >> > > > > > > > value for off.
> >> > > > > > > >
> >> > > > > > > >
> >> > > > > > > >
> >> > > > > > > >
> >> > > > > > > >
> >> > > > > > > > On Wed, Nov 9, 2016 at 9:28 AM, Jun Rao <jun@confluent.io
> >
> >> > > wrote:
> >> > > > > > > >
> >> > > > > > > > > Hi, Radai,
> >> > > > > > > > >
> >> > > > > > > > > Thanks for the KIP. Some comments below.
> >> > > > > > > > >
> >> > > > > > > > > 1. The KIP says "to facilitate faster implementation
> (as a
> >> > > safety
> >> > > > > > net)
> >> > > > > > > > the
> >> > > > > > > > > pool will be implemented in such a way that memory that
> was
> >> > not
> >> > > > > > > > release()ed
> >> > > > > > > > > (but still garbage collected) would be detected and
> >> > > "reclaimed".
> >> > > > > this
> >> > > > > > > is
> >> > > > > > > > to
> >> > > > > > > > > prevent "leaks" in case of code paths that fail to
> >> release()
> >> > > > > > > properly.".
> >> > > > > > > > > What are the cases that could cause memory leaks? If we
> are
> >> > > > > concerned
> >> > > > > > > > about
> >> > > > > > > > > bugs, it seems that it's better to just do more testing
> to
> >> > make
> >> > > > > sure
> >> > > > > > > the
> >> > > > > > > > > usage of the simple implementation (SimpleMemoryPool) is
> >> > solid
> >> > > > > > instead
> >> > > > > > > of
> >> > > > > > > > > adding more complicated logic
> (GarbageCollectedMemoryPool)
> >> to
> >> > > > hide
> >> > > > > > the
> >> > > > > > > > > potential bugs.
> >> > > > > > > > >
> >> > > > > > > > > 2. I am wondering how much this KIP covers the SSL
> channel
> >> > > > > > > > implementation.
> >> > > > > > > > > 2.1 SslTransportLayer maintains netReadBuffer,
> >> > netWriteBuffer,
> >> > > > > > > > > appReadBuffer per socket. Should those memory be
> accounted
> >> > for
> >> > > in
> >> > > > > > > memory
> >> > > > > > > > > pool?
> >> > > > > > > > > 2.2 One tricky thing with SSL is that during a
> >> > > > KafkaChannel.read(),
> >> > > > > > > it's
> >> > > > > > > > > possible for multiple NetworkReceives to be returned
> since
> >> > > > multiple
> >> > > > > > > > > requests' data could be encrypted together by SSL. To
> deal
> >> > with
> >> > > > > this,
> >> > > > > > > we
> >> > > > > > > > > stash those NetworkReceives in Selector.stagedReceives
> and
> >> > give
> >> > > > it
> >> > > > > > back
> >> > > > > > > > to
> >> > > > > > > > > the poll() call one NetworkReceive at a time. What this
> >> means
> >> > > is
> >> > > > > > that,
> >> > > > > > > if
> >> > > > > > > > > we stop reading from KafkaChannel in the middle because
> >> > memory
> >> > > > pool
> >> > > > > > is
> >> > > > > > > > > full, this channel's key may never get selected for
> reads
> >> > (even
> >> > > > > after
> >> > > > > > > the
> >> > > > > > > > > read interest is turned on), but there are still pending
> >> data
> >> > > for
> >> > > > > the
> >> > > > > > > > > channel, which will never get processed.
> >> > > > > > > > >
> >> > > > > > > > > 3. The code has the following two methods in MemoryPool,
> >> > which
> >> > > > are
> >> > > > > > not
> >> > > > > > > > > described in the KIP. Could you explain how they are
> used
> >> in
> >> > > the
> >> > > > > > wiki?
> >> > > > > > > > > isLowOnMemory()
> >> > > > > > > > > isOutOfMemory()
> >> > > > > > > > >
> >> > > > > > > > > 4. Could you also describe in the KIP at the high level,
> >> how
> >> > > the
> >> > > > > read
> >> > > > > > > > > interest bit for the socket is turned on/off with
> respect
> >> to
> >> > > > > > > MemoryPool?
> >> > > > > > > > >
> >> > > > > > > > > 5. Should queued.max.bytes defaults to -1 or
> >> Long.MAX_VALUE?
> >> > > > > > > > >
> >> > > > > > > > > Thanks,
> >> > > > > > > > >
> >> > > > > > > > > Jun
> >> > > > > > > > >
> >> > > > > > > > > On Mon, Nov 7, 2016 at 1:08 PM, radai <
> >> > > > radai.rosenblatt@gmail.com>
> >> > > > > > > > wrote:
> >> > > > > > > > >
> >> > > > > > > > > > Hi,
> >> > > > > > > > > >
> >> > > > > > > > > > I would like to initiate a vote on KIP-72:
> >> > > > > > > > > >
> >> > > > > > > > > > https://cwiki.apache.org/
> confluence/display/KAFKA/KIP-
> >> > 72%3A+
> >> > > > > > > > > > Allow+putting+a+bound+on+memory+consumed+by+Incoming+
> >> > > requests
> >> > > > > > > > > >
> >> > > > > > > > > > The kip allows specifying a limit on the amount of
> memory
> >> > > > > allocated
> >> > > > > > > for
> >> > > > > > > > > > reading incoming requests into. This is useful for
> >> > "sizing" a
> >> > > > > > broker
> >> > > > > > > > and
> >> > > > > > > > > > avoiding OOMEs under heavy load (as actually happens
> >> > > > occasionally
> >> > > > > > at
> >> > > > > > > > > > linkedin).
> >> > > > > > > > > >
> >> > > > > > > > > > I believe I've addressed most (all?) concerns brought
> up
> >> > > during
> >> > > > > the
> >> > > > > > > > > > discussion.
> >> > > > > > > > > >
> >> > > > > > > > > > To the best of my understanding this vote is about the
> >> goal
> >> > > and
> >> > > > > > > > > > public-facing changes related to the new proposed
> >> behavior,
> >> > > but
> >> > > > > as
> >> > > > > > > for
> >> > > > > > > > > > implementation, i have the code up here:
> >> > > > > > > > > >
> >> > > > > > > > > > https://github.com/radai-
> rosenblatt/kafka/tree/broker-
> >> > memory
> >> > > > > > > > > > -pool-with-muting
> >> > > > > > > > > >
> >> > > > > > > > > > and I've stress-tested it to work properly (meaning it
> >> > chugs
> >> > > > > along
> >> > > > > > > and
> >> > > > > > > > > > throttles under loads that would DOS 10.0.1.0 code).
> >> > > > > > > > > >
> >> > > > > > > > > > I also believe that the primitives and "pattern"s
> >> > introduced
> >> > > in
> >> > > > > > this
> >> > > > > > > > KIP
> >> > > > > > > > > > (namely the notion of a buffer pool and retrieving
> from /
> >> > > > > releasing
> >> > > > > > > to
> >> > > > > > > > > said
> >> > > > > > > > > > pool instead of allocating memory) are generally
> useful
> >> > > beyond
> >> > > > > the
> >> > > > > > > > scope
> >> > > > > > > > > of
> >> > > > > > > > > > this KIP for both performance issues (allocating lots
> of
> >> > > > > > short-lived
> >> > > > > > > > > large
> >> > > > > > > > > > buffers is a performance bottleneck) and other areas
> >> where
> >> > > > memory
> >> > > > > > > > limits
> >> > > > > > > > > > are a problem (KIP-81)
> >> > > > > > > > > >
> >> > > > > > > > > > Thank you,
> >> > > > > > > > > >
> >> > > > > > > > > > Radai.
> >> > > > > > > > > >
> >> > > > > > > > >
> >> > > > > > > >
> >> > > > > > >
> >> > > > > >
> >> > > > > >
> >> > > > > >
> >> > > > > > --
> >> > > > > > Regards,
> >> > > > > >
> >> > > > > > Rajini
> >> > > > > >
> >> > > > >
> >> > > >
> >> > >
> >> > >
> >> > >
> >> > > --
> >> > > Regards,
> >> > >
> >> > > Rajini
> >> > >
> >> >
> >>
> >>
> >>
> >> --
> >> Regards,
> >>
> >> Rajini
> >>
>



-- 
Regards,

Rajini

Re: [VOTE] KIP-72 - Allow putting a bound on memory consumed by Incoming requests

Posted by Mickael Maison <mi...@gmail.com>.

+1. We've also been hit by OOMs on the broker because we were not able
to properly bound its memory usage.

On Mon, Nov 14, 2016 at 5:56 PM, radai <ra...@gmail.com> wrote:
> @rajini - fixed the hasBytesBuffered() method. also updated poll() so that
> no latency is added for picking up data stuck in ssl buffers (timeout is
> set to 0, just like with immediately connected keys and staged receives).
> thank you for pointing these out.
> added ssl (re) testing to the KIP testing plan.
>
>
>
>
> On Mon, Nov 14, 2016 at 7:24 AM, Rajini Sivaram <
> rajinisivaram@googlemail.com> wrote:
>
>> Open point 1. I would just retain the current long value that specifies
>> queued.max.bytes as long and not as %heap since it is simple and easy to
>> use. And keeps it consistent with other ".bytes" configs.
>>
>> Point 3. ssl buffers - I am not quite sure the implementation looks
>> correct. hasBytesBuffered() is checking position() of buffers == 0. And the
>> code checks this only when poll with a timeout returns (adding a delay when
>> there is nothing else to read).
>> But since this and open point 2 (optimization) are implementation details,
>> they can be looked at during PR review.
>>
>> It will be good to add SSL testing to the test plan as well, since there is
>> additional code to test for SSL.
>>
>>
>> On Fri, Nov 11, 2016 at 9:03 PM, radai <ra...@gmail.com> wrote:
>>
>> > ok, i've made the following changes:
>> >
>> > 1. memory.pool.class.name has been removed
>> > 2. the code now only uses SimpleMemoryPool. the gc variant is left
>> (unused)
>> > as a developement aid and is unsettable via configuration.
>> > 3. I've resolved the issue of stale data getting stuck in intermediate
>> > (ssl) buffers.
>> > 4. default value for queued.max.bytes is -1, so off by default. any <=0
>> > value is interpreted as off by the underlying code.
>> >
>> > open points:
>> >
>> > 1. the kafka config framework doesnt allow a value to be either long or
>> > double, so in order to pull off the queued.max.bytes = 1000000 or
>> > queued.max.bytes = 0.3 thing i'd need to define the config as type
>> string,
>> > which is ugly to me. do we want to support setting queued.max.bytes to %
>> of
>> > heap ? if so, by way of making queued.max.bytes of type string, or by way
>> > of a 2nd config param (with the resulting either/all/combination?
>> > validation). my personal opinion is string because i think a single
>> > queued.max.bytes with overloaded meaning is more understandable to users.
>> > i'll await other people's opinions before doing anything.
>> > 2. i still need to evaluate rajini's optimization. sounds doable.
>> >
>> > asides:
>> >
>> > 1. i think you guys misunderstood the intent behind the gc pool. it was
>> > never meant to be a magic pool that automatically releases buffers
>> (because
>> > just as rajini stated the performance implications would be horrible). it
>> > was meant to catch leaks early. since that is indeed a dev-only concern
>> it
>> > wont ever get used in production.
>> > 2. i said this on some other kip discussion: i think the nice thing about
>> > the pool API is it "scales" from just keeping a memory bound to actually
>> > re-using buffers without changing the calling code. i think
>> actuallypooling
>> > large buffers will result in a significant performance impact, but thats
>> > outside the scope of this kip. at that point i think more pool
>> > implementations (that actually pool) would be written. i agree with the
>> > ideal of exposing as few knobs as possible, but switching pools (or pool
>> > params) for tuning may happen at some later point.
>> >
>> >
>> >
>> > On Fri, Nov 11, 2016 at 11:44 AM, Rajini Sivaram <
>> > rajinisivaram@googlemail.com> wrote:
>> >
>> > > 13. At the moment, I think channels are not muted if:
>> > >     channel.receive != null && channel.receive.buffer != null
>> > > This mutes all channels that aren't holding onto a incomplete buffer.
>> > They
>> > > may or may not have read the 4-byte size.
>> > >
>> > > I was thinking you could avoid muting channels if:
>> > >     channel.receive == null || channel.receive.size.remaining()
>> > > This will not mute channels that are holding onto a buffer (as above).
>> In
>> > > addition, it will not mute channels that haven't read the 4-byte size.
>> A
>> > > client that is closed gracefully while the pool is full will not be
>> muted
>> > > in this case and the server can process close without waiting for the
>> > pool
>> > > to free up. Once the 4-byte size is read, the channel will be muted if
>> > the
>> > > pool is still out of memory - for each channel, at most one failed read
>> > > attempt would be made while the pool is out of memory. I think this
>> would
>> > > also delay muting of SSL channels since they can continue to read into
>> > > their (already allocated) network buffers and unwrap the data and block
>> > > only when they need to allocate a buffer from the pool.
>> > >
>> > > On Fri, Nov 11, 2016 at 6:00 PM, Jay Kreps <ja...@confluent.io> wrote:
>> > >
>> > > > Hey Radai,
>> > > >
>> > > > +1 on deprecating and eventually removing the old config. The
>> intention
>> > > was
>> > > > absolutely bounding memory usage. I think having two ways of doing
>> > this,
>> > > > one that gives a crisp bound on memory and one that is hard to reason
>> > > about
>> > > > is pretty confusing. I think people will really appreciate having one
>> > > > config which instead lets them directly control the thing they
>> actually
>> > > > care about (memory).
>> > > >
>> > > > I also want to second Jun's concern on the complexity of the
>> self-GCing
>> > > > memory pool. I wrote the memory pool for the producer. In that area
>> the
>> > > > pooling of messages is the single biggest factor in performance of
>> the
>> > > > client so I believed it was worth some sophistication/complexity if
>> > there
>> > > > was performance payoff. All the same, the complexity of that code has
>> > > made
>> > > > it VERY hard to keep correct (it gets broken roughly every other time
>> > > > someone makes a change). Over time I came to feel a lot less proud of
>> > my
>> > > > cleverness. I learned something interesting reading your self-GCing
>> > > memory
>> > > > pool, but I wonder if the complexity is worth the payoff in this
>> case?
>> > > >
>> > > > Philosophically we've tried really hard to avoid needlessly
>> "pluggable"
>> > > > implementations. That is, when there is a temptation to give a config
>> > > that
>> > > > plugs in different Java classes at run time for implementation
>> choices,
>> > > we
>> > > > should instead think of how to give the user the good behavior
>> > > > automatically. I think the use case for configuring a the GCing pool
>> > > would
>> > > > be if you discovered a bug in which memory leaked. But this isn't
>> > > something
>> > > > the user should have to think about right? If there is a bug we
>> should
>> > > find
>> > > > and fix it.
>> > > >
>> > > > -Jay
>> > > >
>> > > > On Fri, Nov 11, 2016 at 9:21 AM, radai <ra...@gmail.com>
>> > > wrote:
>> > > >
>> > > > > jun's #1 + rajini's #11 - the new config param is to enable
>> changing
>> > > the
>> > > > > pool implentation class. as i said in my response to jun i will
>> make
>> > > the
>> > > > > default pool impl be the simple one, and this param is to allow a
>> > user
>> > > > > (more likely a dev) to change it.
>> > > > > both the simple pool and the "gc pool" make basically just an
>> > > > > AtomicLong.get() + (hashmap.put for gc) calls before returning a
>> > > buffer.
>> > > > > there is absolutely no dependency on GC times in allocating (or
>> not).
>> > > the
>> > > > > extra background thread in the gc pool is forever asleep unless
>> there
>> > > are
>> > > > > bugs (==leaks) so the extra cost is basically nothing (backed by
>> > > > > benchmarks). let me re-itarate again - ANY BUFFER ALLOCATED MUST
>> > ALWAYS
>> > > > BE
>> > > > > RELEASED - so the gc pool should not rely on gc for reclaiming
>> > buffers.
>> > > > its
>> > > > > a bug detector, not a feature and is definitely not intended to
>> hide
>> > > > bugs -
>> > > > > the exact opposite - its meant to expose them sooner. i've cleaned
>> up
>> > > the
>> > > > > docs to avoid this confusion. i also like the fail on leak. will
>> do.
>> > > > > as for the gap between pool size and heap size - thats a valid
>> > > argument.
>> > > > > may allow also sizing the pool as % of heap size? so
>> > queued.max.bytes =
>> > > > > 1000000 for 1MB and queued.max.bytes = 0.25 for 25% of available
>> > heap?
>> > > > >
>> > > > > jun's 2.2 - queued.max.bytes + socket.request.max.bytes still
>> holds,
>> > > > > assuming the ssl-related buffers are small. the largest weakness in
>> > > this
>> > > > > claim has to do with decompression rather than anything
>> ssl-related.
>> > so
>> > > > yes
>> > > > > there is an O(#ssl connections * sslEngine packet size) component,
>> > but
>> > > i
>> > > > > think its small. again - decompression should be the concern.
>> > > > >
>> > > > > rajini's #13 - interesting optimization. the problem is there's no
>> > > > knowing
>> > > > > in advance what the _next_ request to come out of a socket is, so
>> > this
>> > > > > would mute just those sockets that are 1. mutable and 2. have a
>> > > > > buffer-demanding request for which we could not allocate a buffer.
>> > > > downside
>> > > > > is that as-is this would cause the busy-loop on poll() that the
>> mutes
>> > > > were
>> > > > > supposed to prevent - or code would need to be added to ad-hocmute
>> a
>> > > > > connection that was so-far unmuted but has now generated a
>> > > > memory-demanding
>> > > > > request?
>> > > > >
>> > > > >
>> > > > >
>> > > > > On Fri, Nov 11, 2016 at 5:02 AM, Rajini Sivaram <
>> > > > > rajinisivaram@googlemail.com> wrote:
>> > > > >
>> > > > > > Radai,
>> > > > > >
>> > > > > > 11. The KIP talks about a new server configuration parameter
>> > > > > > *memory.pool.class.name
>> > > > > > <http://memory.pool.class.name> *which is not in the
>> > implementation.
>> > > > Is
>> > > > > it
>> > > > > > still the case that the pool will be configurable?
>> > > > > >
>> > > > > > 12. Personally I would prefer not to have a garbage collected
>> pool
>> > > that
>> > > > > > hides bugs as well. Apart from the added code complexity and
>> extra
>> > > > thread
>> > > > > > to handle collections, I am also concerned about the
>> > > non-deterministic
>> > > > > > nature of GC timings. The KIP introduces delays in processing
>> > > requests
>> > > > > > based on the configuration parameter *queued.max.bytes. *This in
>> > > > > unrelated
>> > > > > > to the JVM heap size and hence pool can be full when there is no
>> > > > pressure
>> > > > > > on the JVM to garbage collect. The KIP does not prevent other
>> > > timeouts
>> > > > in
>> > > > > > the broker (eg. consumer session timeout) because it is relying
>> on
>> > > the
>> > > > > pool
>> > > > > > to be managed in a deterministic, timely manner. Since a garbage
>> > > > > collected
>> > > > > > pool cannot provide that guarantee, wouldn't it be better to run
>> > > tests
>> > > > > with
>> > > > > > a GC-pool that perhaps fails with a fatal error if it encounters
>> a
>> > > > buffer
>> > > > > > that was not released?
>> > > > > >
>> > > > > > 13. The implementation currently mutes all channels that don't
>> > have a
>> > > > > > receive buffer allocated. Would it make sense to mute only the
>> > > channels
>> > > > > > that need a buffer (i.e. allow channels to read the 4-byte size
>> > that
>> > > is
>> > > > > not
>> > > > > > read using the pool) so that normal client connection close() is
>> > > > handled
>> > > > > > even when the pool is full? Since the extra 4-bytes may already
>> be
>> > > > > > allocated for some connections, the total request memory has to
>> > take
>> > > > into
>> > > > > > account *4*numConnections* bytes anyway.
>> > > > > >
>> > > > > >
>> > > > > > On Thu, Nov 10, 2016 at 11:51 PM, Jun Rao <ju...@confluent.io>
>> > wrote:
>> > > > > >
>> > > > > > > Hi, Radai,
>> > > > > > >
>> > > > > > > 1. Yes, I am concerned about the trickiness of having to deal
>> > with
>> > > > > wreak
>> > > > > > > refs. I think it's simpler to just have the simple version
>> > > > instrumented
>> > > > > > > with enough debug/trace logging and do enough stress testing.
>> > Since
>> > > > we
>> > > > > > > still have queued.max.requests, one can always fall back to
>> that
>> > > if a
>> > > > > > > memory leak issue is identified. We could also label the
>> feature
>> > as
>> > > > > beta
>> > > > > > if
>> > > > > > > we don't think this is production ready.
>> > > > > > >
>> > > > > > > 2.2 I am just wondering after we fix that issue whether the
>> claim
>> > > > that
>> > > > > > the
>> > > > > > > request memory is bounded by  queued.max.bytes +
>> > > > > socket.request.max.bytes
>> > > > > > > is still true.
>> > > > > > >
>> > > > > > > 5. Ok, leaving the default as -1 is fine then.
>> > > > > > >
>> > > > > > > Thanks,
>> > > > > > >
>> > > > > > > Jun
>> > > > > > >
>> > > > > > > On Wed, Nov 9, 2016 at 6:01 PM, radai <
>> > radai.rosenblatt@gmail.com>
>> > > > > > wrote:
>> > > > > > >
>> > > > > > > > Hi Jun,
>> > > > > > > >
>> > > > > > > > Thank you for taking the time to review this.
>> > > > > > > >
>> > > > > > > > 1. short version - yes, the concern is bugs, but the cost is
>> > tiny
>> > > > and
>> > > > > > > worth
>> > > > > > > > it, and its a common pattern. long version:
>> > > > > > > >    1.1 detecting these types of bugs (leaks) cannot be easily
>> > > done
>> > > > > with
>> > > > > > > > simple testing, but requires stress/stability tests that run
>> > for
>> > > a
>> > > > > long
>> > > > > > > > time (long enough to hit OOM, depending on leak size and
>> > > available
>> > > > > > > memory).
>> > > > > > > > this is why some sort of leak detector is "standard practice"
>> > > .for
>> > > > > > > example
>> > > > > > > > look at netty (http://netty.io/wiki/
>> reference-counted-objects.
>> > > > > > > > html#leak-detection-levels)
>> > > > > > > > <http://netty.io/wiki/reference-counted-objects.
>> > > > > > > html#leak-detection-levels
>> > > > > > > > >-
>> > > > > > > > they have way more complicated built-in leak detection
>> enabled
>> > by
>> > > > > > > default.
>> > > > > > > > as a concrete example - during development i did not properly
>> > > > dispose
>> > > > > > of
>> > > > > > > > in-progress KafkaChannel.receive when a connection was
>> abruptly
>> > > > > closed
>> > > > > > > and
>> > > > > > > > I only found it because of the log msg printed by the pool.
>> > > > > > > >    1.2 I have a benchmark suite showing the performance cost
>> of
>> > > the
>> > > > > gc
>> > > > > > > pool
>> > > > > > > > is absolutely negligible -
>> > > > > > > > https://github.com/radai-rosenblatt/kafka-benchmarks/
>> > > > > > > > tree/master/memorypool-benchmarks
>> > > > > > > >    1.3 as for the complexity of the impl - its just ~150
>> lines
>> > > and
>> > > > > > pretty
>> > > > > > > > straight forward. i think the main issue is that not many
>> > people
>> > > > are
>> > > > > > > > familiar with weak refs and ref queues.
>> > > > > > > >
>> > > > > > > >    how about making the pool impl class a config param
>> > (generally
>> > > > > good
>> > > > > > > > going forward), make the default be the simple pool, and keep
>> > the
>> > > > GC
>> > > > > > one
>> > > > > > > as
>> > > > > > > > a dev/debug/triage aid?
>> > > > > > > >
>> > > > > > > > 2. the KIP itself doesnt specifically treat SSL at all - its
>> an
>> > > > > > > > implementation detail. as for my current patch, it has some
>> > > minimal
>> > > > > > > > treatment of SSL - just enough to not mute SSL sockets
>> > > > mid-handshake
>> > > > > -
>> > > > > > > but
>> > > > > > > > the code in SslTransportLayer still allocates buffers itself.
>> > it
>> > > is
>> > > > > my
>> > > > > > > > understanding that netReadBuffer/appReadBuffer shouldn't grow
>> > > > beyond
>> > > > > 2
>> > > > > > x
>> > > > > > > > sslEngine.getSession().getPacketBufferSize(), which i assume
>> > to
>> > > be
>> > > > > > > small.
>> > > > > > > > they are also long lived (they live for the duration of the
>> > > > > connection)
>> > > > > > > > which makes a poor fit for pooling. the bigger fish to fry i
>> > > think
>> > > > is
>> > > > > > > > decompression - you could read a 1MB blob into a
>> pool-provided
>> > > > buffer
>> > > > > > and
>> > > > > > > > then decompress it into 10MB of heap allocated on the spot
>> :-)
>> > > > also,
>> > > > > > the
>> > > > > > > > ssl code is extremely tricky.
>> > > > > > > >    2.2 just to make sure, youre talking about Selector.java:
>> > > while
>> > > > > > > > ((networkReceive = channel.read()) != null)
>> > > > > > addToStagedReceives(channel,
>> > > > > > > > networkReceive); ? if so youre right, and i'll fix that
>> > (probably
>> > > > by
>> > > > > > > > something similar to immediatelyConnectedKeys, not sure yet)
>> > > > > > > >
>> > > > > > > > 3. isOutOfMemory is self explanatory (and i'll add javadocs
>> and
>> > > > > update
>> > > > > > > the
>> > > > > > > > wiki). isLowOnMem is basically the point where I start
>> > > randomizing
>> > > > > the
>> > > > > > > > selection key handling order to avoid potential starvation.
>> its
>> > > > > rather
>> > > > > > > > arbitrary and now that i think of it should probably not
>> exist
>> > > and
>> > > > be
>> > > > > > > > entirely contained in Selector (where the shuffling takes
>> > place).
>> > > > > will
>> > > > > > > fix.
>> > > > > > > >
>> > > > > > > > 4. will do.
>> > > > > > > >
>> > > > > > > > 5. I prefer -1 or 0 as an explicit "OFF" (or basically
>> anything
>> > > > <=0).
>> > > > > > > > Long.MAX_VALUE would still create a pool, that would still
>> > waste
>> > > > time
>> > > > > > > > tracking resources. I dont really mind though if you have a
>> > > > preferred
>> > > > > > > magic
>> > > > > > > > value for off.
>> > > > > > > >
>> > > > > > > >
>> > > > > > > >
>> > > > > > > >
>> > > > > > > >
>> > > > > > > > On Wed, Nov 9, 2016 at 9:28 AM, Jun Rao <ju...@confluent.io>
>> > > wrote:
>> > > > > > > >
>> > > > > > > > > Hi, Radai,
>> > > > > > > > >
>> > > > > > > > > Thanks for the KIP. Some comments below.
>> > > > > > > > >
>> > > > > > > > > 1. The KIP says "to facilitate faster implementation (as a
>> > > safety
>> > > > > > net)
>> > > > > > > > the
>> > > > > > > > > pool will be implemented in such a way that memory that was
>> > not
>> > > > > > > > release()ed
>> > > > > > > > > (but still garbage collected) would be detected and
>> > > "reclaimed".
>> > > > > this
>> > > > > > > is
>> > > > > > > > to
>> > > > > > > > > prevent "leaks" in case of code paths that fail to
>> release()
>> > > > > > > properly.".
>> > > > > > > > > What are the cases that could cause memory leaks? If we are
>> > > > > concerned
>> > > > > > > > about
>> > > > > > > > > bugs, it seems that it's better to just do more testing to
>> > make
>> > > > > sure
>> > > > > > > the
>> > > > > > > > > usage of the simple implementation (SimpleMemoryPool) is
>> > solid
>> > > > > > instead
>> > > > > > > of
>> > > > > > > > > adding more complicated logic (GarbageCollectedMemoryPool)
>> to
>> > > > hide
>> > > > > > the
>> > > > > > > > > potential bugs.
>> > > > > > > > >
>> > > > > > > > > 2. I am wondering how much this KIP covers the SSL channel
>> > > > > > > > implementation.
>> > > > > > > > > 2.1 SslTransportLayer maintains netReadBuffer,
>> > netWriteBuffer,
>> > > > > > > > > appReadBuffer per socket. Should those memory be accounted
>> > for
>> > > in
>> > > > > > > memory
>> > > > > > > > > pool?
>> > > > > > > > > 2.2 One tricky thing with SSL is that during a
>> > > > KafkaChannel.read(),
>> > > > > > > it's
>> > > > > > > > > possible for multiple NetworkReceives to be returned since
>> > > > multiple
>> > > > > > > > > requests' data could be encrypted together by SSL. To deal
>> > with
>> > > > > this,
>> > > > > > > we
>> > > > > > > > > stash those NetworkReceives in Selector.stagedReceives and
>> > give
>> > > > it
>> > > > > > back
>> > > > > > > > to
>> > > > > > > > > the poll() call one NetworkReceive at a time. What this
>> means
>> > > is
>> > > > > > that,
>> > > > > > > if
>> > > > > > > > > we stop reading from KafkaChannel in the middle because
>> > memory
>> > > > pool
>> > > > > > is
>> > > > > > > > > full, this channel's key may never get selected for reads
>> > (even
>> > > > > after
>> > > > > > > the
>> > > > > > > > > read interest is turned on), but there are still pending
>> data
>> > > for
>> > > > > the
>> > > > > > > > > channel, which will never get processed.
>> > > > > > > > >
>> > > > > > > > > 3. The code has the following two methods in MemoryPool,
>> > which
>> > > > are
>> > > > > > not
>> > > > > > > > > described in the KIP. Could you explain how they are used
>> in
>> > > the
>> > > > > > wiki?
>> > > > > > > > > isLowOnMemory()
>> > > > > > > > > isOutOfMemory()
>> > > > > > > > >
>> > > > > > > > > 4. Could you also describe in the KIP at the high level,
>> how
>> > > the
>> > > > > read
>> > > > > > > > > interest bit for the socket is turned on/off with respect
>> to
>> > > > > > > MemoryPool?
>> > > > > > > > >
>> > > > > > > > > 5. Should queued.max.bytes defaults to -1 or
>> Long.MAX_VALUE?
>> > > > > > > > >
>> > > > > > > > > Thanks,
>> > > > > > > > >
>> > > > > > > > > Jun
>> > > > > > > > >
>> > > > > > > > > On Mon, Nov 7, 2016 at 1:08 PM, radai <
>> > > > radai.rosenblatt@gmail.com>
>> > > > > > > > wrote:
>> > > > > > > > >
>> > > > > > > > > > Hi,
>> > > > > > > > > >
>> > > > > > > > > > I would like to initiate a vote on KIP-72:
>> > > > > > > > > >
>> > > > > > > > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-
>> > 72%3A+
>> > > > > > > > > > Allow+putting+a+bound+on+memory+consumed+by+Incoming+
>> > > requests
>> > > > > > > > > >
>> > > > > > > > > > The kip allows specifying a limit on the amount of memory
>> > > > > allocated
>> > > > > > > for
>> > > > > > > > > > reading incoming requests into. This is useful for
>> > "sizing" a
>> > > > > > broker
>> > > > > > > > and
>> > > > > > > > > > avoiding OOMEs under heavy load (as actually happens
>> > > > occasionally
>> > > > > > at
>> > > > > > > > > > linkedin).
>> > > > > > > > > >
>> > > > > > > > > > I believe I've addressed most (all?) concerns brought up
>> > > during
>> > > > > the
>> > > > > > > > > > discussion.
>> > > > > > > > > >
>> > > > > > > > > > To the best of my understanding this vote is about the
>> goal
>> > > and
>> > > > > > > > > > public-facing changes related to the new proposed
>> behavior,
>> > > but
>> > > > > as
>> > > > > > > for
>> > > > > > > > > > implementation, i have the code up here:
>> > > > > > > > > >
>> > > > > > > > > > https://github.com/radai-rosenblatt/kafka/tree/broker-
>> > memory
>> > > > > > > > > > -pool-with-muting
>> > > > > > > > > >
>> > > > > > > > > > and I've stress-tested it to work properly (meaning it
>> > chugs
>> > > > > along
>> > > > > > > and
>> > > > > > > > > > throttles under loads that would DOS 10.0.1.0 code).
>> > > > > > > > > >
>> > > > > > > > > > I also believe that the primitives and "pattern"s
>> > introduced
>> > > in
>> > > > > > this
>> > > > > > > > KIP
>> > > > > > > > > > (namely the notion of a buffer pool and retrieving from /
>> > > > > releasing
>> > > > > > > to
>> > > > > > > > > said
>> > > > > > > > > > pool instead of allocating memory) are generally useful
>> > > beyond
>> > > > > the
>> > > > > > > > scope
>> > > > > > > > > of
>> > > > > > > > > > this KIP for both performance issues (allocating lots of
>> > > > > > short-lived
>> > > > > > > > > large
>> > > > > > > > > > buffers is a performance bottleneck) and other areas
>> where
>> > > > memory
>> > > > > > > > limits
>> > > > > > > > > > are a problem (KIP-81)
>> > > > > > > > > >
>> > > > > > > > > > Thank you,
>> > > > > > > > > >
>> > > > > > > > > > Radai.
>> > > > > > > > > >
>> > > > > > > > >
>> > > > > > > >
>> > > > > > >
>> > > > > >
>> > > > > >
>> > > > > >
>> > > > > > --
>> > > > > > Regards,
>> > > > > >
>> > > > > > Rajini
>> > > > > >
>> > > > >
>> > > >
>> > >
>> > >
>> > >
>> > > --
>> > > Regards,
>> > >
>> > > Rajini
>> > >
>> >
>>
>>
>>
>> --
>> Regards,
>>
>> Rajini
>>

Re: [VOTE] KIP-72 - Allow putting a bound on memory consumed by Incoming requests

Posted by radai <ra...@gmail.com>.

@rajini - fixed the hasBytesBuffered() method. also updated poll() so that
no latency is added for picking up data stuck in ssl buffers (timeout is
set to 0, just like with immediately connected keys and staged receives).
thank you for pointing these out.
added ssl (re) testing to the KIP testing plan.




On Mon, Nov 14, 2016 at 7:24 AM, Rajini Sivaram <
rajinisivaram@googlemail.com> wrote:

> Open point 1. I would just retain the current long value that specifies
> queued.max.bytes as long and not as %heap since it is simple and easy to
> use. And keeps it consistent with other ".bytes" configs.
>
> Point 3. ssl buffers - I am not quite sure the implementation looks
> correct. hasBytesBuffered() is checking position() of buffers == 0. And the
> code checks this only when poll with a timeout returns (adding a delay when
> there is nothing else to read).
> But since this and open point 2 (optimization) are implementation details,
> they can be looked at during PR review.
>
> It will be good to add SSL testing to the test plan as well, since there is
> additional code to test for SSL.
>
>
> On Fri, Nov 11, 2016 at 9:03 PM, radai <ra...@gmail.com> wrote:
>
> > ok, i've made the following changes:
> >
> > 1. memory.pool.class.name has been removed
> > 2. the code now only uses SimpleMemoryPool. the gc variant is left
> (unused)
> > as a developement aid and is unsettable via configuration.
> > 3. I've resolved the issue of stale data getting stuck in intermediate
> > (ssl) buffers.
> > 4. default value for queued.max.bytes is -1, so off by default. any <=0
> > value is interpreted as off by the underlying code.
> >
> > open points:
> >
> > 1. the kafka config framework doesnt allow a value to be either long or
> > double, so in order to pull off the queued.max.bytes = 1000000 or
> > queued.max.bytes = 0.3 thing i'd need to define the config as type
> string,
> > which is ugly to me. do we want to support setting queued.max.bytes to %
> of
> > heap ? if so, by way of making queued.max.bytes of type string, or by way
> > of a 2nd config param (with the resulting either/all/combination?
> > validation). my personal opinion is string because i think a single
> > queued.max.bytes with overloaded meaning is more understandable to users.
> > i'll await other people's opinions before doing anything.
> > 2. i still need to evaluate rajini's optimization. sounds doable.
> >
> > asides:
> >
> > 1. i think you guys misunderstood the intent behind the gc pool. it was
> > never meant to be a magic pool that automatically releases buffers
> (because
> > just as rajini stated the performance implications would be horrible). it
> > was meant to catch leaks early. since that is indeed a dev-only concern
> it
> > wont ever get used in production.
> > 2. i said this on some other kip discussion: i think the nice thing about
> > the pool API is it "scales" from just keeping a memory bound to actually
> > re-using buffers without changing the calling code. i think
> actuallypooling
> > large buffers will result in a significant performance impact, but thats
> > outside the scope of this kip. at that point i think more pool
> > implementations (that actually pool) would be written. i agree with the
> > ideal of exposing as few knobs as possible, but switching pools (or pool
> > params) for tuning may happen at some later point.
> >
> >
> >
> > On Fri, Nov 11, 2016 at 11:44 AM, Rajini Sivaram <
> > rajinisivaram@googlemail.com> wrote:
> >
> > > 13. At the moment, I think channels are not muted if:
> > >     channel.receive != null && channel.receive.buffer != null
> > > This mutes all channels that aren't holding onto a incomplete buffer.
> > They
> > > may or may not have read the 4-byte size.
> > >
> > > I was thinking you could avoid muting channels if:
> > >     channel.receive == null || channel.receive.size.remaining()
> > > This will not mute channels that are holding onto a buffer (as above).
> In
> > > addition, it will not mute channels that haven't read the 4-byte size.
> A
> > > client that is closed gracefully while the pool is full will not be
> muted
> > > in this case and the server can process close without waiting for the
> > pool
> > > to free up. Once the 4-byte size is read, the channel will be muted if
> > the
> > > pool is still out of memory - for each channel, at most one failed read
> > > attempt would be made while the pool is out of memory. I think this
> would
> > > also delay muting of SSL channels since they can continue to read into
> > > their (already allocated) network buffers and unwrap the data and block
> > > only when they need to allocate a buffer from the pool.
> > >
> > > On Fri, Nov 11, 2016 at 6:00 PM, Jay Kreps <ja...@confluent.io> wrote:
> > >
> > > > Hey Radai,
> > > >
> > > > +1 on deprecating and eventually removing the old config. The
> intention
> > > was
> > > > absolutely bounding memory usage. I think having two ways of doing
> > this,
> > > > one that gives a crisp bound on memory and one that is hard to reason
> > > about
> > > > is pretty confusing. I think people will really appreciate having one
> > > > config which instead lets them directly control the thing they
> actually
> > > > care about (memory).
> > > >
> > > > I also want to second Jun's concern on the complexity of the
> self-GCing
> > > > memory pool. I wrote the memory pool for the producer. In that area
> the
> > > > pooling of messages is the single biggest factor in performance of
> the
> > > > client so I believed it was worth some sophistication/complexity if
> > there
> > > > was performance payoff. All the same, the complexity of that code has
> > > made
> > > > it VERY hard to keep correct (it gets broken roughly every other time
> > > > someone makes a change). Over time I came to feel a lot less proud of
> > my
> > > > cleverness. I learned something interesting reading your self-GCing
> > > memory
> > > > pool, but I wonder if the complexity is worth the payoff in this
> case?
> > > >
> > > > Philosophically we've tried really hard to avoid needlessly
> "pluggable"
> > > > implementations. That is, when there is a temptation to give a config
> > > that
> > > > plugs in different Java classes at run time for implementation
> choices,
> > > we
> > > > should instead think of how to give the user the good behavior
> > > > automatically. I think the use case for configuring a the GCing pool
> > > would
> > > > be if you discovered a bug in which memory leaked. But this isn't
> > > something
> > > > the user should have to think about right? If there is a bug we
> should
> > > find
> > > > and fix it.
> > > >
> > > > -Jay
> > > >
> > > > On Fri, Nov 11, 2016 at 9:21 AM, radai <ra...@gmail.com>
> > > wrote:
> > > >
> > > > > jun's #1 + rajini's #11 - the new config param is to enable
> changing
> > > the
> > > > > pool implentation class. as i said in my response to jun i will
> make
> > > the
> > > > > default pool impl be the simple one, and this param is to allow a
> > user
> > > > > (more likely a dev) to change it.
> > > > > both the simple pool and the "gc pool" make basically just an
> > > > > AtomicLong.get() + (hashmap.put for gc) calls before returning a
> > > buffer.
> > > > > there is absolutely no dependency on GC times in allocating (or
> not).
> > > the
> > > > > extra background thread in the gc pool is forever asleep unless
> there
> > > are
> > > > > bugs (==leaks) so the extra cost is basically nothing (backed by
> > > > > benchmarks). let me re-itarate again - ANY BUFFER ALLOCATED MUST
> > ALWAYS
> > > > BE
> > > > > RELEASED - so the gc pool should not rely on gc for reclaiming
> > buffers.
> > > > its
> > > > > a bug detector, not a feature and is definitely not intended to
> hide
> > > > bugs -
> > > > > the exact opposite - its meant to expose them sooner. i've cleaned
> up
> > > the
> > > > > docs to avoid this confusion. i also like the fail on leak. will
> do.
> > > > > as for the gap between pool size and heap size - thats a valid
> > > argument.
> > > > > may allow also sizing the pool as % of heap size? so
> > queued.max.bytes =
> > > > > 1000000 for 1MB and queued.max.bytes = 0.25 for 25% of available
> > heap?
> > > > >
> > > > > jun's 2.2 - queued.max.bytes + socket.request.max.bytes still
> holds,
> > > > > assuming the ssl-related buffers are small. the largest weakness in
> > > this
> > > > > claim has to do with decompression rather than anything
> ssl-related.
> > so
> > > > yes
> > > > > there is an O(#ssl connections * sslEngine packet size) component,
> > but
> > > i
> > > > > think its small. again - decompression should be the concern.
> > > > >
> > > > > rajini's #13 - interesting optimization. the problem is there's no
> > > > knowing
> > > > > in advance what the _next_ request to come out of a socket is, so
> > this
> > > > > would mute just those sockets that are 1. mutable and 2. have a
> > > > > buffer-demanding request for which we could not allocate a buffer.
> > > > downside
> > > > > is that as-is this would cause the busy-loop on poll() that the
> mutes
> > > > were
> > > > > supposed to prevent - or code would need to be added to ad-hocmute
> a
> > > > > connection that was so-far unmuted but has now generated a
> > > > memory-demanding
> > > > > request?
> > > > >
> > > > >
> > > > >
> > > > > On Fri, Nov 11, 2016 at 5:02 AM, Rajini Sivaram <
> > > > > rajinisivaram@googlemail.com> wrote:
> > > > >
> > > > > > Radai,
> > > > > >
> > > > > > 11. The KIP talks about a new server configuration parameter
> > > > > > *memory.pool.class.name
> > > > > > <http://memory.pool.class.name> *which is not in the
> > implementation.
> > > > Is
> > > > > it
> > > > > > still the case that the pool will be configurable?
> > > > > >
> > > > > > 12. Personally I would prefer not to have a garbage collected
> pool
> > > that
> > > > > > hides bugs as well. Apart from the added code complexity and
> extra
> > > > thread
> > > > > > to handle collections, I am also concerned about the
> > > non-deterministic
> > > > > > nature of GC timings. The KIP introduces delays in processing
> > > requests
> > > > > > based on the configuration parameter *queued.max.bytes. *This in
> > > > > unrelated
> > > > > > to the JVM heap size and hence pool can be full when there is no
> > > > pressure
> > > > > > on the JVM to garbage collect. The KIP does not prevent other
> > > timeouts
> > > > in
> > > > > > the broker (eg. consumer session timeout) because it is relying
> on
> > > the
> > > > > pool
> > > > > > to be managed in a deterministic, timely manner. Since a garbage
> > > > > collected
> > > > > > pool cannot provide that guarantee, wouldn't it be better to run
> > > tests
> > > > > with
> > > > > > a GC-pool that perhaps fails with a fatal error if it encounters
> a
> > > > buffer
> > > > > > that was not released?
> > > > > >
> > > > > > 13. The implementation currently mutes all channels that don't
> > have a
> > > > > > receive buffer allocated. Would it make sense to mute only the
> > > channels
> > > > > > that need a buffer (i.e. allow channels to read the 4-byte size
> > that
> > > is
> > > > > not
> > > > > > read using the pool) so that normal client connection close() is
> > > > handled
> > > > > > even when the pool is full? Since the extra 4-bytes may already
> be
> > > > > > allocated for some connections, the total request memory has to
> > take
> > > > into
> > > > > > account *4*numConnections* bytes anyway.
> > > > > >
> > > > > >
> > > > > > On Thu, Nov 10, 2016 at 11:51 PM, Jun Rao <ju...@confluent.io>
> > wrote:
> > > > > >
> > > > > > > Hi, Radai,
> > > > > > >
> > > > > > > 1. Yes, I am concerned about the trickiness of having to deal
> > with
> > > > > wreak
> > > > > > > refs. I think it's simpler to just have the simple version
> > > > instrumented
> > > > > > > with enough debug/trace logging and do enough stress testing.
> > Since
> > > > we
> > > > > > > still have queued.max.requests, one can always fall back to
> that
> > > if a
> > > > > > > memory leak issue is identified. We could also label the
> feature
> > as
> > > > > beta
> > > > > > if
> > > > > > > we don't think this is production ready.
> > > > > > >
> > > > > > > 2.2 I am just wondering after we fix that issue whether the
> claim
> > > > that
> > > > > > the
> > > > > > > request memory is bounded by  queued.max.bytes +
> > > > > socket.request.max.bytes
> > > > > > > is still true.
> > > > > > >
> > > > > > > 5. Ok, leaving the default as -1 is fine then.
> > > > > > >
> > > > > > > Thanks,
> > > > > > >
> > > > > > > Jun
> > > > > > >
> > > > > > > On Wed, Nov 9, 2016 at 6:01 PM, radai <
> > radai.rosenblatt@gmail.com>
> > > > > > wrote:
> > > > > > >
> > > > > > > > Hi Jun,
> > > > > > > >
> > > > > > > > Thank you for taking the time to review this.
> > > > > > > >
> > > > > > > > 1. short version - yes, the concern is bugs, but the cost is
> > tiny
> > > > and
> > > > > > > worth
> > > > > > > > it, and its a common pattern. long version:
> > > > > > > >    1.1 detecting these types of bugs (leaks) cannot be easily
> > > done
> > > > > with
> > > > > > > > simple testing, but requires stress/stability tests that run
> > for
> > > a
> > > > > long
> > > > > > > > time (long enough to hit OOM, depending on leak size and
> > > available
> > > > > > > memory).
> > > > > > > > this is why some sort of leak detector is "standard practice"
> > > .for
> > > > > > > example
> > > > > > > > look at netty (http://netty.io/wiki/
> reference-counted-objects.
> > > > > > > > html#leak-detection-levels)
> > > > > > > > <http://netty.io/wiki/reference-counted-objects.
> > > > > > > html#leak-detection-levels
> > > > > > > > >-
> > > > > > > > they have way more complicated built-in leak detection
> enabled
> > by
> > > > > > > default.
> > > > > > > > as a concrete example - during development i did not properly
> > > > dispose
> > > > > > of
> > > > > > > > in-progress KafkaChannel.receive when a connection was
> abruptly
> > > > > closed
> > > > > > > and
> > > > > > > > I only found it because of the log msg printed by the pool.
> > > > > > > >    1.2 I have a benchmark suite showing the performance cost
> of
> > > the
> > > > > gc
> > > > > > > pool
> > > > > > > > is absolutely negligible -
> > > > > > > > https://github.com/radai-rosenblatt/kafka-benchmarks/
> > > > > > > > tree/master/memorypool-benchmarks
> > > > > > > >    1.3 as for the complexity of the impl - its just ~150
> lines
> > > and
> > > > > > pretty
> > > > > > > > straight forward. i think the main issue is that not many
> > people
> > > > are
> > > > > > > > familiar with weak refs and ref queues.
> > > > > > > >
> > > > > > > >    how about making the pool impl class a config param
> > (generally
> > > > > good
> > > > > > > > going forward), make the default be the simple pool, and keep
> > the
> > > > GC
> > > > > > one
> > > > > > > as
> > > > > > > > a dev/debug/triage aid?
> > > > > > > >
> > > > > > > > 2. the KIP itself doesnt specifically treat SSL at all - its
> an
> > > > > > > > implementation detail. as for my current patch, it has some
> > > minimal
> > > > > > > > treatment of SSL - just enough to not mute SSL sockets
> > > > mid-handshake
> > > > > -
> > > > > > > but
> > > > > > > > the code in SslTransportLayer still allocates buffers itself.
> > it
> > > is
> > > > > my
> > > > > > > > understanding that netReadBuffer/appReadBuffer shouldn't grow
> > > > beyond
> > > > > 2
> > > > > > x
> > > > > > > > sslEngine.getSession().getPacketBufferSize(), which i assume
> > to
> > > be
> > > > > > > small.
> > > > > > > > they are also long lived (they live for the duration of the
> > > > > connection)
> > > > > > > > which makes a poor fit for pooling. the bigger fish to fry i
> > > think
> > > > is
> > > > > > > > decompression - you could read a 1MB blob into a
> pool-provided
> > > > buffer
> > > > > > and
> > > > > > > > then decompress it into 10MB of heap allocated on the spot
> :-)
> > > > also,
> > > > > > the
> > > > > > > > ssl code is extremely tricky.
> > > > > > > >    2.2 just to make sure, youre talking about Selector.java:
> > > while
> > > > > > > > ((networkReceive = channel.read()) != null)
> > > > > > addToStagedReceives(channel,
> > > > > > > > networkReceive); ? if so youre right, and i'll fix that
> > (probably
> > > > by
> > > > > > > > something similar to immediatelyConnectedKeys, not sure yet)
> > > > > > > >
> > > > > > > > 3. isOutOfMemory is self explanatory (and i'll add javadocs
> and
> > > > > update
> > > > > > > the
> > > > > > > > wiki). isLowOnMem is basically the point where I start
> > > randomizing
> > > > > the
> > > > > > > > selection key handling order to avoid potential starvation.
> its
> > > > > rather
> > > > > > > > arbitrary and now that i think of it should probably not
> exist
> > > and
> > > > be
> > > > > > > > entirely contained in Selector (where the shuffling takes
> > place).
> > > > > will
> > > > > > > fix.
> > > > > > > >
> > > > > > > > 4. will do.
> > > > > > > >
> > > > > > > > 5. I prefer -1 or 0 as an explicit "OFF" (or basically
> anything
> > > > <=0).
> > > > > > > > Long.MAX_VALUE would still create a pool, that would still
> > waste
> > > > time
> > > > > > > > tracking resources. I dont really mind though if you have a
> > > > preferred
> > > > > > > magic
> > > > > > > > value for off.
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > On Wed, Nov 9, 2016 at 9:28 AM, Jun Rao <ju...@confluent.io>
> > > wrote:
> > > > > > > >
> > > > > > > > > Hi, Radai,
> > > > > > > > >
> > > > > > > > > Thanks for the KIP. Some comments below.
> > > > > > > > >
> > > > > > > > > 1. The KIP says "to facilitate faster implementation (as a
> > > safety
> > > > > > net)
> > > > > > > > the
> > > > > > > > > pool will be implemented in such a way that memory that was
> > not
> > > > > > > > release()ed
> > > > > > > > > (but still garbage collected) would be detected and
> > > "reclaimed".
> > > > > this
> > > > > > > is
> > > > > > > > to
> > > > > > > > > prevent "leaks" in case of code paths that fail to
> release()
> > > > > > > properly.".
> > > > > > > > > What are the cases that could cause memory leaks? If we are
> > > > > concerned
> > > > > > > > about
> > > > > > > > > bugs, it seems that it's better to just do more testing to
> > make
> > > > > sure
> > > > > > > the
> > > > > > > > > usage of the simple implementation (SimpleMemoryPool) is
> > solid
> > > > > > instead
> > > > > > > of
> > > > > > > > > adding more complicated logic (GarbageCollectedMemoryPool)
> to
> > > > hide
> > > > > > the
> > > > > > > > > potential bugs.
> > > > > > > > >
> > > > > > > > > 2. I am wondering how much this KIP covers the SSL channel
> > > > > > > > implementation.
> > > > > > > > > 2.1 SslTransportLayer maintains netReadBuffer,
> > netWriteBuffer,
> > > > > > > > > appReadBuffer per socket. Should those memory be accounted
> > for
> > > in
> > > > > > > memory
> > > > > > > > > pool?
> > > > > > > > > 2.2 One tricky thing with SSL is that during a
> > > > KafkaChannel.read(),
> > > > > > > it's
> > > > > > > > > possible for multiple NetworkReceives to be returned since
> > > > multiple
> > > > > > > > > requests' data could be encrypted together by SSL. To deal
> > with
> > > > > this,
> > > > > > > we
> > > > > > > > > stash those NetworkReceives in Selector.stagedReceives and
> > give
> > > > it
> > > > > > back
> > > > > > > > to
> > > > > > > > > the poll() call one NetworkReceive at a time. What this
> means
> > > is
> > > > > > that,
> > > > > > > if
> > > > > > > > > we stop reading from KafkaChannel in the middle because
> > memory
> > > > pool
> > > > > > is
> > > > > > > > > full, this channel's key may never get selected for reads
> > (even
> > > > > after
> > > > > > > the
> > > > > > > > > read interest is turned on), but there are still pending
> data
> > > for
> > > > > the
> > > > > > > > > channel, which will never get processed.
> > > > > > > > >
> > > > > > > > > 3. The code has the following two methods in MemoryPool,
> > which
> > > > are
> > > > > > not
> > > > > > > > > described in the KIP. Could you explain how they are used
> in
> > > the
> > > > > > wiki?
> > > > > > > > > isLowOnMemory()
> > > > > > > > > isOutOfMemory()
> > > > > > > > >
> > > > > > > > > 4. Could you also describe in the KIP at the high level,
> how
> > > the
> > > > > read
> > > > > > > > > interest bit for the socket is turned on/off with respect
> to
> > > > > > > MemoryPool?
> > > > > > > > >
> > > > > > > > > 5. Should queued.max.bytes defaults to -1 or
> Long.MAX_VALUE?
> > > > > > > > >
> > > > > > > > > Thanks,
> > > > > > > > >
> > > > > > > > > Jun
> > > > > > > > >
> > > > > > > > > On Mon, Nov 7, 2016 at 1:08 PM, radai <
> > > > radai.rosenblatt@gmail.com>
> > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > Hi,
> > > > > > > > > >
> > > > > > > > > > I would like to initiate a vote on KIP-72:
> > > > > > > > > >
> > > > > > > > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-
> > 72%3A+
> > > > > > > > > > Allow+putting+a+bound+on+memory+consumed+by+Incoming+
> > > requests
> > > > > > > > > >
> > > > > > > > > > The kip allows specifying a limit on the amount of memory
> > > > > allocated
> > > > > > > for
> > > > > > > > > > reading incoming requests into. This is useful for
> > "sizing" a
> > > > > > broker
> > > > > > > > and
> > > > > > > > > > avoiding OOMEs under heavy load (as actually happens
> > > > occasionally
> > > > > > at
> > > > > > > > > > linkedin).
> > > > > > > > > >
> > > > > > > > > > I believe I've addressed most (all?) concerns brought up
> > > during
> > > > > the
> > > > > > > > > > discussion.
> > > > > > > > > >
> > > > > > > > > > To the best of my understanding this vote is about the
> goal
> > > and
> > > > > > > > > > public-facing changes related to the new proposed
> behavior,
> > > but
> > > > > as
> > > > > > > for
> > > > > > > > > > implementation, i have the code up here:
> > > > > > > > > >
> > > > > > > > > > https://github.com/radai-rosenblatt/kafka/tree/broker-
> > memory
> > > > > > > > > > -pool-with-muting
> > > > > > > > > >
> > > > > > > > > > and I've stress-tested it to work properly (meaning it
> > chugs
> > > > > along
> > > > > > > and
> > > > > > > > > > throttles under loads that would DOS 10.0.1.0 code).
> > > > > > > > > >
> > > > > > > > > > I also believe that the primitives and "pattern"s
> > introduced
> > > in
> > > > > > this
> > > > > > > > KIP
> > > > > > > > > > (namely the notion of a buffer pool and retrieving from /
> > > > > releasing
> > > > > > > to
> > > > > > > > > said
> > > > > > > > > > pool instead of allocating memory) are generally useful
> > > beyond
> > > > > the
> > > > > > > > scope
> > > > > > > > > of
> > > > > > > > > > this KIP for both performance issues (allocating lots of
> > > > > > short-lived
> > > > > > > > > large
> > > > > > > > > > buffers is a performance bottleneck) and other areas
> where
> > > > memory
> > > > > > > > limits
> > > > > > > > > > are a problem (KIP-81)
> > > > > > > > > >
> > > > > > > > > > Thank you,
> > > > > > > > > >
> > > > > > > > > > Radai.
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > > Regards,
> > > > > >
> > > > > > Rajini
> > > > > >
> > > > >
> > > >
> > >
> > >
> > >
> > > --
> > > Regards,
> > >
> > > Rajini
> > >
> >
>
>
>
> --
> Regards,
>
> Rajini
>

Re: [VOTE] KIP-72 - Allow putting a bound on memory consumed by Incoming requests

Posted by Rajini Sivaram <ra...@googlemail.com>.

Open point 1. I would just retain the current long value that specifies
queued.max.bytes as long and not as %heap since it is simple and easy to
use. And keeps it consistent with other ".bytes" configs.

Point 3. ssl buffers - I am not quite sure the implementation looks
correct. hasBytesBuffered() is checking position() of buffers == 0. And the
code checks this only when poll with a timeout returns (adding a delay when
there is nothing else to read).
But since this and open point 2 (optimization) are implementation details,
they can be looked at during PR review.

It will be good to add SSL testing to the test plan as well, since there is
additional code to test for SSL.


On Fri, Nov 11, 2016 at 9:03 PM, radai <ra...@gmail.com> wrote:

> ok, i've made the following changes:
>
> 1. memory.pool.class.name has been removed
> 2. the code now only uses SimpleMemoryPool. the gc variant is left (unused)
> as a developement aid and is unsettable via configuration.
> 3. I've resolved the issue of stale data getting stuck in intermediate
> (ssl) buffers.
> 4. default value for queued.max.bytes is -1, so off by default. any <=0
> value is interpreted as off by the underlying code.
>
> open points:
>
> 1. the kafka config framework doesnt allow a value to be either long or
> double, so in order to pull off the queued.max.bytes = 1000000 or
> queued.max.bytes = 0.3 thing i'd need to define the config as type string,
> which is ugly to me. do we want to support setting queued.max.bytes to % of
> heap ? if so, by way of making queued.max.bytes of type string, or by way
> of a 2nd config param (with the resulting either/all/combination?
> validation). my personal opinion is string because i think a single
> queued.max.bytes with overloaded meaning is more understandable to users.
> i'll await other people's opinions before doing anything.
> 2. i still need to evaluate rajini's optimization. sounds doable.
>
> asides:
>
> 1. i think you guys misunderstood the intent behind the gc pool. it was
> never meant to be a magic pool that automatically releases buffers (because
> just as rajini stated the performance implications would be horrible). it
> was meant to catch leaks early. since that is indeed a dev-only concern it
> wont ever get used in production.
> 2. i said this on some other kip discussion: i think the nice thing about
> the pool API is it "scales" from just keeping a memory bound to actually
> re-using buffers without changing the calling code. i think actuallypooling
> large buffers will result in a significant performance impact, but thats
> outside the scope of this kip. at that point i think more pool
> implementations (that actually pool) would be written. i agree with the
> ideal of exposing as few knobs as possible, but switching pools (or pool
> params) for tuning may happen at some later point.
>
>
>
> On Fri, Nov 11, 2016 at 11:44 AM, Rajini Sivaram <
> rajinisivaram@googlemail.com> wrote:
>
> > 13. At the moment, I think channels are not muted if:
> >     channel.receive != null && channel.receive.buffer != null
> > This mutes all channels that aren't holding onto a incomplete buffer.
> They
> > may or may not have read the 4-byte size.
> >
> > I was thinking you could avoid muting channels if:
> >     channel.receive == null || channel.receive.size.remaining()
> > This will not mute channels that are holding onto a buffer (as above). In
> > addition, it will not mute channels that haven't read the 4-byte size. A
> > client that is closed gracefully while the pool is full will not be muted
> > in this case and the server can process close without waiting for the
> pool
> > to free up. Once the 4-byte size is read, the channel will be muted if
> the
> > pool is still out of memory - for each channel, at most one failed read
> > attempt would be made while the pool is out of memory. I think this would
> > also delay muting of SSL channels since they can continue to read into
> > their (already allocated) network buffers and unwrap the data and block
> > only when they need to allocate a buffer from the pool.
> >
> > On Fri, Nov 11, 2016 at 6:00 PM, Jay Kreps <ja...@confluent.io> wrote:
> >
> > > Hey Radai,
> > >
> > > +1 on deprecating and eventually removing the old config. The intention
> > was
> > > absolutely bounding memory usage. I think having two ways of doing
> this,
> > > one that gives a crisp bound on memory and one that is hard to reason
> > about
> > > is pretty confusing. I think people will really appreciate having one
> > > config which instead lets them directly control the thing they actually
> > > care about (memory).
> > >
> > > I also want to second Jun's concern on the complexity of the self-GCing
> > > memory pool. I wrote the memory pool for the producer. In that area the
> > > pooling of messages is the single biggest factor in performance of the
> > > client so I believed it was worth some sophistication/complexity if
> there
> > > was performance payoff. All the same, the complexity of that code has
> > made
> > > it VERY hard to keep correct (it gets broken roughly every other time
> > > someone makes a change). Over time I came to feel a lot less proud of
> my
> > > cleverness. I learned something interesting reading your self-GCing
> > memory
> > > pool, but I wonder if the complexity is worth the payoff in this case?
> > >
> > > Philosophically we've tried really hard to avoid needlessly "pluggable"
> > > implementations. That is, when there is a temptation to give a config
> > that
> > > plugs in different Java classes at run time for implementation choices,
> > we
> > > should instead think of how to give the user the good behavior
> > > automatically. I think the use case for configuring a the GCing pool
> > would
> > > be if you discovered a bug in which memory leaked. But this isn't
> > something
> > > the user should have to think about right? If there is a bug we should
> > find
> > > and fix it.
> > >
> > > -Jay
> > >
> > > On Fri, Nov 11, 2016 at 9:21 AM, radai <ra...@gmail.com>
> > wrote:
> > >
> > > > jun's #1 + rajini's #11 - the new config param is to enable changing
> > the
> > > > pool implentation class. as i said in my response to jun i will make
> > the
> > > > default pool impl be the simple one, and this param is to allow a
> user
> > > > (more likely a dev) to change it.
> > > > both the simple pool and the "gc pool" make basically just an
> > > > AtomicLong.get() + (hashmap.put for gc) calls before returning a
> > buffer.
> > > > there is absolutely no dependency on GC times in allocating (or not).
> > the
> > > > extra background thread in the gc pool is forever asleep unless there
> > are
> > > > bugs (==leaks) so the extra cost is basically nothing (backed by
> > > > benchmarks). let me re-itarate again - ANY BUFFER ALLOCATED MUST
> ALWAYS
> > > BE
> > > > RELEASED - so the gc pool should not rely on gc for reclaiming
> buffers.
> > > its
> > > > a bug detector, not a feature and is definitely not intended to hide
> > > bugs -
> > > > the exact opposite - its meant to expose them sooner. i've cleaned up
> > the
> > > > docs to avoid this confusion. i also like the fail on leak. will do.
> > > > as for the gap between pool size and heap size - thats a valid
> > argument.
> > > > may allow also sizing the pool as % of heap size? so
> queued.max.bytes =
> > > > 1000000 for 1MB and queued.max.bytes = 0.25 for 25% of available
> heap?
> > > >
> > > > jun's 2.2 - queued.max.bytes + socket.request.max.bytes still holds,
> > > > assuming the ssl-related buffers are small. the largest weakness in
> > this
> > > > claim has to do with decompression rather than anything ssl-related.
> so
> > > yes
> > > > there is an O(#ssl connections * sslEngine packet size) component,
> but
> > i
> > > > think its small. again - decompression should be the concern.
> > > >
> > > > rajini's #13 - interesting optimization. the problem is there's no
> > > knowing
> > > > in advance what the _next_ request to come out of a socket is, so
> this
> > > > would mute just those sockets that are 1. mutable and 2. have a
> > > > buffer-demanding request for which we could not allocate a buffer.
> > > downside
> > > > is that as-is this would cause the busy-loop on poll() that the mutes
> > > were
> > > > supposed to prevent - or code would need to be added to ad-hocmute a
> > > > connection that was so-far unmuted but has now generated a
> > > memory-demanding
> > > > request?
> > > >
> > > >
> > > >
> > > > On Fri, Nov 11, 2016 at 5:02 AM, Rajini Sivaram <
> > > > rajinisivaram@googlemail.com> wrote:
> > > >
> > > > > Radai,
> > > > >
> > > > > 11. The KIP talks about a new server configuration parameter
> > > > > *memory.pool.class.name
> > > > > <http://memory.pool.class.name> *which is not in the
> implementation.
> > > Is
> > > > it
> > > > > still the case that the pool will be configurable?
> > > > >
> > > > > 12. Personally I would prefer not to have a garbage collected pool
> > that
> > > > > hides bugs as well. Apart from the added code complexity and extra
> > > thread
> > > > > to handle collections, I am also concerned about the
> > non-deterministic
> > > > > nature of GC timings. The KIP introduces delays in processing
> > requests
> > > > > based on the configuration parameter *queued.max.bytes. *This in
> > > > unrelated
> > > > > to the JVM heap size and hence pool can be full when there is no
> > > pressure
> > > > > on the JVM to garbage collect. The KIP does not prevent other
> > timeouts
> > > in
> > > > > the broker (eg. consumer session timeout) because it is relying on
> > the
> > > > pool
> > > > > to be managed in a deterministic, timely manner. Since a garbage
> > > > collected
> > > > > pool cannot provide that guarantee, wouldn't it be better to run
> > tests
> > > > with
> > > > > a GC-pool that perhaps fails with a fatal error if it encounters a
> > > buffer
> > > > > that was not released?
> > > > >
> > > > > 13. The implementation currently mutes all channels that don't
> have a
> > > > > receive buffer allocated. Would it make sense to mute only the
> > channels
> > > > > that need a buffer (i.e. allow channels to read the 4-byte size
> that
> > is
> > > > not
> > > > > read using the pool) so that normal client connection close() is
> > > handled
> > > > > even when the pool is full? Since the extra 4-bytes may already be
> > > > > allocated for some connections, the total request memory has to
> take
> > > into
> > > > > account *4*numConnections* bytes anyway.
> > > > >
> > > > >
> > > > > On Thu, Nov 10, 2016 at 11:51 PM, Jun Rao <ju...@confluent.io>
> wrote:
> > > > >
> > > > > > Hi, Radai,
> > > > > >
> > > > > > 1. Yes, I am concerned about the trickiness of having to deal
> with
> > > > wreak
> > > > > > refs. I think it's simpler to just have the simple version
> > > instrumented
> > > > > > with enough debug/trace logging and do enough stress testing.
> Since
> > > we
> > > > > > still have queued.max.requests, one can always fall back to that
> > if a
> > > > > > memory leak issue is identified. We could also label the feature
> as
> > > > beta
> > > > > if
> > > > > > we don't think this is production ready.
> > > > > >
> > > > > > 2.2 I am just wondering after we fix that issue whether the claim
> > > that
> > > > > the
> > > > > > request memory is bounded by  queued.max.bytes +
> > > > socket.request.max.bytes
> > > > > > is still true.
> > > > > >
> > > > > > 5. Ok, leaving the default as -1 is fine then.
> > > > > >
> > > > > > Thanks,
> > > > > >
> > > > > > Jun
> > > > > >
> > > > > > On Wed, Nov 9, 2016 at 6:01 PM, radai <
> radai.rosenblatt@gmail.com>
> > > > > wrote:
> > > > > >
> > > > > > > Hi Jun,
> > > > > > >
> > > > > > > Thank you for taking the time to review this.
> > > > > > >
> > > > > > > 1. short version - yes, the concern is bugs, but the cost is
> tiny
> > > and
> > > > > > worth
> > > > > > > it, and its a common pattern. long version:
> > > > > > >    1.1 detecting these types of bugs (leaks) cannot be easily
> > done
> > > > with
> > > > > > > simple testing, but requires stress/stability tests that run
> for
> > a
> > > > long
> > > > > > > time (long enough to hit OOM, depending on leak size and
> > available
> > > > > > memory).
> > > > > > > this is why some sort of leak detector is "standard practice"
> > .for
> > > > > > example
> > > > > > > look at netty (http://netty.io/wiki/reference-counted-objects.
> > > > > > > html#leak-detection-levels)
> > > > > > > <http://netty.io/wiki/reference-counted-objects.
> > > > > > html#leak-detection-levels
> > > > > > > >-
> > > > > > > they have way more complicated built-in leak detection enabled
> by
> > > > > > default.
> > > > > > > as a concrete example - during development i did not properly
> > > dispose
> > > > > of
> > > > > > > in-progress KafkaChannel.receive when a connection was abruptly
> > > > closed
> > > > > > and
> > > > > > > I only found it because of the log msg printed by the pool.
> > > > > > >    1.2 I have a benchmark suite showing the performance cost of
> > the
> > > > gc
> > > > > > pool
> > > > > > > is absolutely negligible -
> > > > > > > https://github.com/radai-rosenblatt/kafka-benchmarks/
> > > > > > > tree/master/memorypool-benchmarks
> > > > > > >    1.3 as for the complexity of the impl - its just ~150 lines
> > and
> > > > > pretty
> > > > > > > straight forward. i think the main issue is that not many
> people
> > > are
> > > > > > > familiar with weak refs and ref queues.
> > > > > > >
> > > > > > >    how about making the pool impl class a config param
> (generally
> > > > good
> > > > > > > going forward), make the default be the simple pool, and keep
> the
> > > GC
> > > > > one
> > > > > > as
> > > > > > > a dev/debug/triage aid?
> > > > > > >
> > > > > > > 2. the KIP itself doesnt specifically treat SSL at all - its an
> > > > > > > implementation detail. as for my current patch, it has some
> > minimal
> > > > > > > treatment of SSL - just enough to not mute SSL sockets
> > > mid-handshake
> > > > -
> > > > > > but
> > > > > > > the code in SslTransportLayer still allocates buffers itself.
> it
> > is
> > > > my
> > > > > > > understanding that netReadBuffer/appReadBuffer shouldn't grow
> > > beyond
> > > > 2
> > > > > x
> > > > > > > sslEngine.getSession().getPacketBufferSize(), which i assume
> to
> > be
> > > > > > small.
> > > > > > > they are also long lived (they live for the duration of the
> > > > connection)
> > > > > > > which makes a poor fit for pooling. the bigger fish to fry i
> > think
> > > is
> > > > > > > decompression - you could read a 1MB blob into a pool-provided
> > > buffer
> > > > > and
> > > > > > > then decompress it into 10MB of heap allocated on the spot :-)
> > > also,
> > > > > the
> > > > > > > ssl code is extremely tricky.
> > > > > > >    2.2 just to make sure, youre talking about Selector.java:
> > while
> > > > > > > ((networkReceive = channel.read()) != null)
> > > > > addToStagedReceives(channel,
> > > > > > > networkReceive); ? if so youre right, and i'll fix that
> (probably
> > > by
> > > > > > > something similar to immediatelyConnectedKeys, not sure yet)
> > > > > > >
> > > > > > > 3. isOutOfMemory is self explanatory (and i'll add javadocs and
> > > > update
> > > > > > the
> > > > > > > wiki). isLowOnMem is basically the point where I start
> > randomizing
> > > > the
> > > > > > > selection key handling order to avoid potential starvation. its
> > > > rather
> > > > > > > arbitrary and now that i think of it should probably not exist
> > and
> > > be
> > > > > > > entirely contained in Selector (where the shuffling takes
> place).
> > > > will
> > > > > > fix.
> > > > > > >
> > > > > > > 4. will do.
> > > > > > >
> > > > > > > 5. I prefer -1 or 0 as an explicit "OFF" (or basically anything
> > > <=0).
> > > > > > > Long.MAX_VALUE would still create a pool, that would still
> waste
> > > time
> > > > > > > tracking resources. I dont really mind though if you have a
> > > preferred
> > > > > > magic
> > > > > > > value for off.
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > On Wed, Nov 9, 2016 at 9:28 AM, Jun Rao <ju...@confluent.io>
> > wrote:
> > > > > > >
> > > > > > > > Hi, Radai,
> > > > > > > >
> > > > > > > > Thanks for the KIP. Some comments below.
> > > > > > > >
> > > > > > > > 1. The KIP says "to facilitate faster implementation (as a
> > safety
> > > > > net)
> > > > > > > the
> > > > > > > > pool will be implemented in such a way that memory that was
> not
> > > > > > > release()ed
> > > > > > > > (but still garbage collected) would be detected and
> > "reclaimed".
> > > > this
> > > > > > is
> > > > > > > to
> > > > > > > > prevent "leaks" in case of code paths that fail to release()
> > > > > > properly.".
> > > > > > > > What are the cases that could cause memory leaks? If we are
> > > > concerned
> > > > > > > about
> > > > > > > > bugs, it seems that it's better to just do more testing to
> make
> > > > sure
> > > > > > the
> > > > > > > > usage of the simple implementation (SimpleMemoryPool) is
> solid
> > > > > instead
> > > > > > of
> > > > > > > > adding more complicated logic (GarbageCollectedMemoryPool) to
> > > hide
> > > > > the
> > > > > > > > potential bugs.
> > > > > > > >
> > > > > > > > 2. I am wondering how much this KIP covers the SSL channel
> > > > > > > implementation.
> > > > > > > > 2.1 SslTransportLayer maintains netReadBuffer,
> netWriteBuffer,
> > > > > > > > appReadBuffer per socket. Should those memory be accounted
> for
> > in
> > > > > > memory
> > > > > > > > pool?
> > > > > > > > 2.2 One tricky thing with SSL is that during a
> > > KafkaChannel.read(),
> > > > > > it's
> > > > > > > > possible for multiple NetworkReceives to be returned since
> > > multiple
> > > > > > > > requests' data could be encrypted together by SSL. To deal
> with
> > > > this,
> > > > > > we
> > > > > > > > stash those NetworkReceives in Selector.stagedReceives and
> give
> > > it
> > > > > back
> > > > > > > to
> > > > > > > > the poll() call one NetworkReceive at a time. What this means
> > is
> > > > > that,
> > > > > > if
> > > > > > > > we stop reading from KafkaChannel in the middle because
> memory
> > > pool
> > > > > is
> > > > > > > > full, this channel's key may never get selected for reads
> (even
> > > > after
> > > > > > the
> > > > > > > > read interest is turned on), but there are still pending data
> > for
> > > > the
> > > > > > > > channel, which will never get processed.
> > > > > > > >
> > > > > > > > 3. The code has the following two methods in MemoryPool,
> which
> > > are
> > > > > not
> > > > > > > > described in the KIP. Could you explain how they are used in
> > the
> > > > > wiki?
> > > > > > > > isLowOnMemory()
> > > > > > > > isOutOfMemory()
> > > > > > > >
> > > > > > > > 4. Could you also describe in the KIP at the high level, how
> > the
> > > > read
> > > > > > > > interest bit for the socket is turned on/off with respect to
> > > > > > MemoryPool?
> > > > > > > >
> > > > > > > > 5. Should queued.max.bytes defaults to -1 or Long.MAX_VALUE?
> > > > > > > >
> > > > > > > > Thanks,
> > > > > > > >
> > > > > > > > Jun
> > > > > > > >
> > > > > > > > On Mon, Nov 7, 2016 at 1:08 PM, radai <
> > > radai.rosenblatt@gmail.com>
> > > > > > > wrote:
> > > > > > > >
> > > > > > > > > Hi,
> > > > > > > > >
> > > > > > > > > I would like to initiate a vote on KIP-72:
> > > > > > > > >
> > > > > > > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-
> 72%3A+
> > > > > > > > > Allow+putting+a+bound+on+memory+consumed+by+Incoming+
> > requests
> > > > > > > > >
> > > > > > > > > The kip allows specifying a limit on the amount of memory
> > > > allocated
> > > > > > for
> > > > > > > > > reading incoming requests into. This is useful for
> "sizing" a
> > > > > broker
> > > > > > > and
> > > > > > > > > avoiding OOMEs under heavy load (as actually happens
> > > occasionally
> > > > > at
> > > > > > > > > linkedin).
> > > > > > > > >
> > > > > > > > > I believe I've addressed most (all?) concerns brought up
> > during
> > > > the
> > > > > > > > > discussion.
> > > > > > > > >
> > > > > > > > > To the best of my understanding this vote is about the goal
> > and
> > > > > > > > > public-facing changes related to the new proposed behavior,
> > but
> > > > as
> > > > > > for
> > > > > > > > > implementation, i have the code up here:
> > > > > > > > >
> > > > > > > > > https://github.com/radai-rosenblatt/kafka/tree/broker-
> memory
> > > > > > > > > -pool-with-muting
> > > > > > > > >
> > > > > > > > > and I've stress-tested it to work properly (meaning it
> chugs
> > > > along
> > > > > > and
> > > > > > > > > throttles under loads that would DOS 10.0.1.0 code).
> > > > > > > > >
> > > > > > > > > I also believe that the primitives and "pattern"s
> introduced
> > in
> > > > > this
> > > > > > > KIP
> > > > > > > > > (namely the notion of a buffer pool and retrieving from /
> > > > releasing
> > > > > > to
> > > > > > > > said
> > > > > > > > > pool instead of allocating memory) are generally useful
> > beyond
> > > > the
> > > > > > > scope
> > > > > > > > of
> > > > > > > > > this KIP for both performance issues (allocating lots of
> > > > > short-lived
> > > > > > > > large
> > > > > > > > > buffers is a performance bottleneck) and other areas where
> > > memory
> > > > > > > limits
> > > > > > > > > are a problem (KIP-81)
> > > > > > > > >
> > > > > > > > > Thank you,
> > > > > > > > >
> > > > > > > > > Radai.
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Regards,
> > > > >
> > > > > Rajini
> > > > >
> > > >
> > >
> >
> >
> >
> > --
> > Regards,
> >
> > Rajini
> >
>



-- 
Regards,

Rajini

Re: [VOTE] KIP-72 - Allow putting a bound on memory consumed by Incoming requests

Posted by radai <ra...@gmail.com>.

ok, i've made the following changes:

1. memory.pool.class.name has been removed
2. the code now only uses SimpleMemoryPool. the gc variant is left (unused)
as a developement aid and is unsettable via configuration.
3. I've resolved the issue of stale data getting stuck in intermediate
(ssl) buffers.
4. default value for queued.max.bytes is -1, so off by default. any <=0
value is interpreted as off by the underlying code.

open points:

1. the kafka config framework doesnt allow a value to be either long or
double, so in order to pull off the queued.max.bytes = 1000000 or
queued.max.bytes = 0.3 thing i'd need to define the config as type string,
which is ugly to me. do we want to support setting queued.max.bytes to % of
heap ? if so, by way of making queued.max.bytes of type string, or by way
of a 2nd config param (with the resulting either/all/combination?
validation). my personal opinion is string because i think a single
queued.max.bytes with overloaded meaning is more understandable to users.
i'll await other people's opinions before doing anything.
2. i still need to evaluate rajini's optimization. sounds doable.

asides:

1. i think you guys misunderstood the intent behind the gc pool. it was
never meant to be a magic pool that automatically releases buffers (because
just as rajini stated the performance implications would be horrible). it
was meant to catch leaks early. since that is indeed a dev-only concern it
wont ever get used in production.
2. i said this on some other kip discussion: i think the nice thing about
the pool API is it "scales" from just keeping a memory bound to actually
re-using buffers without changing the calling code. i think actuallypooling
large buffers will result in a significant performance impact, but thats
outside the scope of this kip. at that point i think more pool
implementations (that actually pool) would be written. i agree with the
ideal of exposing as few knobs as possible, but switching pools (or pool
params) for tuning may happen at some later point.



On Fri, Nov 11, 2016 at 11:44 AM, Rajini Sivaram <
rajinisivaram@googlemail.com> wrote:

> 13. At the moment, I think channels are not muted if:
>     channel.receive != null && channel.receive.buffer != null
> This mutes all channels that aren't holding onto a incomplete buffer. They
> may or may not have read the 4-byte size.
>
> I was thinking you could avoid muting channels if:
>     channel.receive == null || channel.receive.size.remaining()
> This will not mute channels that are holding onto a buffer (as above). In
> addition, it will not mute channels that haven't read the 4-byte size. A
> client that is closed gracefully while the pool is full will not be muted
> in this case and the server can process close without waiting for the pool
> to free up. Once the 4-byte size is read, the channel will be muted if the
> pool is still out of memory - for each channel, at most one failed read
> attempt would be made while the pool is out of memory. I think this would
> also delay muting of SSL channels since they can continue to read into
> their (already allocated) network buffers and unwrap the data and block
> only when they need to allocate a buffer from the pool.
>
> On Fri, Nov 11, 2016 at 6:00 PM, Jay Kreps <ja...@confluent.io> wrote:
>
> > Hey Radai,
> >
> > +1 on deprecating and eventually removing the old config. The intention
> was
> > absolutely bounding memory usage. I think having two ways of doing this,
> > one that gives a crisp bound on memory and one that is hard to reason
> about
> > is pretty confusing. I think people will really appreciate having one
> > config which instead lets them directly control the thing they actually
> > care about (memory).
> >
> > I also want to second Jun's concern on the complexity of the self-GCing
> > memory pool. I wrote the memory pool for the producer. In that area the
> > pooling of messages is the single biggest factor in performance of the
> > client so I believed it was worth some sophistication/complexity if there
> > was performance payoff. All the same, the complexity of that code has
> made
> > it VERY hard to keep correct (it gets broken roughly every other time
> > someone makes a change). Over time I came to feel a lot less proud of my
> > cleverness. I learned something interesting reading your self-GCing
> memory
> > pool, but I wonder if the complexity is worth the payoff in this case?
> >
> > Philosophically we've tried really hard to avoid needlessly "pluggable"
> > implementations. That is, when there is a temptation to give a config
> that
> > plugs in different Java classes at run time for implementation choices,
> we
> > should instead think of how to give the user the good behavior
> > automatically. I think the use case for configuring a the GCing pool
> would
> > be if you discovered a bug in which memory leaked. But this isn't
> something
> > the user should have to think about right? If there is a bug we should
> find
> > and fix it.
> >
> > -Jay
> >
> > On Fri, Nov 11, 2016 at 9:21 AM, radai <ra...@gmail.com>
> wrote:
> >
> > > jun's #1 + rajini's #11 - the new config param is to enable changing
> the
> > > pool implentation class. as i said in my response to jun i will make
> the
> > > default pool impl be the simple one, and this param is to allow a user
> > > (more likely a dev) to change it.
> > > both the simple pool and the "gc pool" make basically just an
> > > AtomicLong.get() + (hashmap.put for gc) calls before returning a
> buffer.
> > > there is absolutely no dependency on GC times in allocating (or not).
> the
> > > extra background thread in the gc pool is forever asleep unless there
> are
> > > bugs (==leaks) so the extra cost is basically nothing (backed by
> > > benchmarks). let me re-itarate again - ANY BUFFER ALLOCATED MUST ALWAYS
> > BE
> > > RELEASED - so the gc pool should not rely on gc for reclaiming buffers.
> > its
> > > a bug detector, not a feature and is definitely not intended to hide
> > bugs -
> > > the exact opposite - its meant to expose them sooner. i've cleaned up
> the
> > > docs to avoid this confusion. i also like the fail on leak. will do.
> > > as for the gap between pool size and heap size - thats a valid
> argument.
> > > may allow also sizing the pool as % of heap size? so queued.max.bytes =
> > > 1000000 for 1MB and queued.max.bytes = 0.25 for 25% of available heap?
> > >
> > > jun's 2.2 - queued.max.bytes + socket.request.max.bytes still holds,
> > > assuming the ssl-related buffers are small. the largest weakness in
> this
> > > claim has to do with decompression rather than anything ssl-related. so
> > yes
> > > there is an O(#ssl connections * sslEngine packet size) component, but
> i
> > > think its small. again - decompression should be the concern.
> > >
> > > rajini's #13 - interesting optimization. the problem is there's no
> > knowing
> > > in advance what the _next_ request to come out of a socket is, so this
> > > would mute just those sockets that are 1. mutable and 2. have a
> > > buffer-demanding request for which we could not allocate a buffer.
> > downside
> > > is that as-is this would cause the busy-loop on poll() that the mutes
> > were
> > > supposed to prevent - or code would need to be added to ad-hocmute a
> > > connection that was so-far unmuted but has now generated a
> > memory-demanding
> > > request?
> > >
> > >
> > >
> > > On Fri, Nov 11, 2016 at 5:02 AM, Rajini Sivaram <
> > > rajinisivaram@googlemail.com> wrote:
> > >
> > > > Radai,
> > > >
> > > > 11. The KIP talks about a new server configuration parameter
> > > > *memory.pool.class.name
> > > > <http://memory.pool.class.name> *which is not in the implementation.
> > Is
> > > it
> > > > still the case that the pool will be configurable?
> > > >
> > > > 12. Personally I would prefer not to have a garbage collected pool
> that
> > > > hides bugs as well. Apart from the added code complexity and extra
> > thread
> > > > to handle collections, I am also concerned about the
> non-deterministic
> > > > nature of GC timings. The KIP introduces delays in processing
> requests
> > > > based on the configuration parameter *queued.max.bytes. *This in
> > > unrelated
> > > > to the JVM heap size and hence pool can be full when there is no
> > pressure
> > > > on the JVM to garbage collect. The KIP does not prevent other
> timeouts
> > in
> > > > the broker (eg. consumer session timeout) because it is relying on
> the
> > > pool
> > > > to be managed in a deterministic, timely manner. Since a garbage
> > > collected
> > > > pool cannot provide that guarantee, wouldn't it be better to run
> tests
> > > with
> > > > a GC-pool that perhaps fails with a fatal error if it encounters a
> > buffer
> > > > that was not released?
> > > >
> > > > 13. The implementation currently mutes all channels that don't have a
> > > > receive buffer allocated. Would it make sense to mute only the
> channels
> > > > that need a buffer (i.e. allow channels to read the 4-byte size that
> is
> > > not
> > > > read using the pool) so that normal client connection close() is
> > handled
> > > > even when the pool is full? Since the extra 4-bytes may already be
> > > > allocated for some connections, the total request memory has to take
> > into
> > > > account *4*numConnections* bytes anyway.
> > > >
> > > >
> > > > On Thu, Nov 10, 2016 at 11:51 PM, Jun Rao <ju...@confluent.io> wrote:
> > > >
> > > > > Hi, Radai,
> > > > >
> > > > > 1. Yes, I am concerned about the trickiness of having to deal with
> > > wreak
> > > > > refs. I think it's simpler to just have the simple version
> > instrumented
> > > > > with enough debug/trace logging and do enough stress testing. Since
> > we
> > > > > still have queued.max.requests, one can always fall back to that
> if a
> > > > > memory leak issue is identified. We could also label the feature as
> > > beta
> > > > if
> > > > > we don't think this is production ready.
> > > > >
> > > > > 2.2 I am just wondering after we fix that issue whether the claim
> > that
> > > > the
> > > > > request memory is bounded by  queued.max.bytes +
> > > socket.request.max.bytes
> > > > > is still true.
> > > > >
> > > > > 5. Ok, leaving the default as -1 is fine then.
> > > > >
> > > > > Thanks,
> > > > >
> > > > > Jun
> > > > >
> > > > > On Wed, Nov 9, 2016 at 6:01 PM, radai <ra...@gmail.com>
> > > > wrote:
> > > > >
> > > > > > Hi Jun,
> > > > > >
> > > > > > Thank you for taking the time to review this.
> > > > > >
> > > > > > 1. short version - yes, the concern is bugs, but the cost is tiny
> > and
> > > > > worth
> > > > > > it, and its a common pattern. long version:
> > > > > >    1.1 detecting these types of bugs (leaks) cannot be easily
> done
> > > with
> > > > > > simple testing, but requires stress/stability tests that run for
> a
> > > long
> > > > > > time (long enough to hit OOM, depending on leak size and
> available
> > > > > memory).
> > > > > > this is why some sort of leak detector is "standard practice"
> .for
> > > > > example
> > > > > > look at netty (http://netty.io/wiki/reference-counted-objects.
> > > > > > html#leak-detection-levels)
> > > > > > <http://netty.io/wiki/reference-counted-objects.
> > > > > html#leak-detection-levels
> > > > > > >-
> > > > > > they have way more complicated built-in leak detection enabled by
> > > > > default.
> > > > > > as a concrete example - during development i did not properly
> > dispose
> > > > of
> > > > > > in-progress KafkaChannel.receive when a connection was abruptly
> > > closed
> > > > > and
> > > > > > I only found it because of the log msg printed by the pool.
> > > > > >    1.2 I have a benchmark suite showing the performance cost of
> the
> > > gc
> > > > > pool
> > > > > > is absolutely negligible -
> > > > > > https://github.com/radai-rosenblatt/kafka-benchmarks/
> > > > > > tree/master/memorypool-benchmarks
> > > > > >    1.3 as for the complexity of the impl - its just ~150 lines
> and
> > > > pretty
> > > > > > straight forward. i think the main issue is that not many people
> > are
> > > > > > familiar with weak refs and ref queues.
> > > > > >
> > > > > >    how about making the pool impl class a config param (generally
> > > good
> > > > > > going forward), make the default be the simple pool, and keep the
> > GC
> > > > one
> > > > > as
> > > > > > a dev/debug/triage aid?
> > > > > >
> > > > > > 2. the KIP itself doesnt specifically treat SSL at all - its an
> > > > > > implementation detail. as for my current patch, it has some
> minimal
> > > > > > treatment of SSL - just enough to not mute SSL sockets
> > mid-handshake
> > > -
> > > > > but
> > > > > > the code in SslTransportLayer still allocates buffers itself. it
> is
> > > my
> > > > > > understanding that netReadBuffer/appReadBuffer shouldn't grow
> > beyond
> > > 2
> > > > x
> > > > > > sslEngine.getSession().getPacketBufferSize(), which i assume to
> be
> > > > > small.
> > > > > > they are also long lived (they live for the duration of the
> > > connection)
> > > > > > which makes a poor fit for pooling. the bigger fish to fry i
> think
> > is
> > > > > > decompression - you could read a 1MB blob into a pool-provided
> > buffer
> > > > and
> > > > > > then decompress it into 10MB of heap allocated on the spot :-)
> > also,
> > > > the
> > > > > > ssl code is extremely tricky.
> > > > > >    2.2 just to make sure, youre talking about Selector.java:
> while
> > > > > > ((networkReceive = channel.read()) != null)
> > > > addToStagedReceives(channel,
> > > > > > networkReceive); ? if so youre right, and i'll fix that (probably
> > by
> > > > > > something similar to immediatelyConnectedKeys, not sure yet)
> > > > > >
> > > > > > 3. isOutOfMemory is self explanatory (and i'll add javadocs and
> > > update
> > > > > the
> > > > > > wiki). isLowOnMem is basically the point where I start
> randomizing
> > > the
> > > > > > selection key handling order to avoid potential starvation. its
> > > rather
> > > > > > arbitrary and now that i think of it should probably not exist
> and
> > be
> > > > > > entirely contained in Selector (where the shuffling takes place).
> > > will
> > > > > fix.
> > > > > >
> > > > > > 4. will do.
> > > > > >
> > > > > > 5. I prefer -1 or 0 as an explicit "OFF" (or basically anything
> > <=0).
> > > > > > Long.MAX_VALUE would still create a pool, that would still waste
> > time
> > > > > > tracking resources. I dont really mind though if you have a
> > preferred
> > > > > magic
> > > > > > value for off.
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > On Wed, Nov 9, 2016 at 9:28 AM, Jun Rao <ju...@confluent.io>
> wrote:
> > > > > >
> > > > > > > Hi, Radai,
> > > > > > >
> > > > > > > Thanks for the KIP. Some comments below.
> > > > > > >
> > > > > > > 1. The KIP says "to facilitate faster implementation (as a
> safety
> > > > net)
> > > > > > the
> > > > > > > pool will be implemented in such a way that memory that was not
> > > > > > release()ed
> > > > > > > (but still garbage collected) would be detected and
> "reclaimed".
> > > this
> > > > > is
> > > > > > to
> > > > > > > prevent "leaks" in case of code paths that fail to release()
> > > > > properly.".
> > > > > > > What are the cases that could cause memory leaks? If we are
> > > concerned
> > > > > > about
> > > > > > > bugs, it seems that it's better to just do more testing to make
> > > sure
> > > > > the
> > > > > > > usage of the simple implementation (SimpleMemoryPool) is solid
> > > > instead
> > > > > of
> > > > > > > adding more complicated logic (GarbageCollectedMemoryPool) to
> > hide
> > > > the
> > > > > > > potential bugs.
> > > > > > >
> > > > > > > 2. I am wondering how much this KIP covers the SSL channel
> > > > > > implementation.
> > > > > > > 2.1 SslTransportLayer maintains netReadBuffer, netWriteBuffer,
> > > > > > > appReadBuffer per socket. Should those memory be accounted for
> in
> > > > > memory
> > > > > > > pool?
> > > > > > > 2.2 One tricky thing with SSL is that during a
> > KafkaChannel.read(),
> > > > > it's
> > > > > > > possible for multiple NetworkReceives to be returned since
> > multiple
> > > > > > > requests' data could be encrypted together by SSL. To deal with
> > > this,
> > > > > we
> > > > > > > stash those NetworkReceives in Selector.stagedReceives and give
> > it
> > > > back
> > > > > > to
> > > > > > > the poll() call one NetworkReceive at a time. What this means
> is
> > > > that,
> > > > > if
> > > > > > > we stop reading from KafkaChannel in the middle because memory
> > pool
> > > > is
> > > > > > > full, this channel's key may never get selected for reads (even
> > > after
> > > > > the
> > > > > > > read interest is turned on), but there are still pending data
> for
> > > the
> > > > > > > channel, which will never get processed.
> > > > > > >
> > > > > > > 3. The code has the following two methods in MemoryPool, which
> > are
> > > > not
> > > > > > > described in the KIP. Could you explain how they are used in
> the
> > > > wiki?
> > > > > > > isLowOnMemory()
> > > > > > > isOutOfMemory()
> > > > > > >
> > > > > > > 4. Could you also describe in the KIP at the high level, how
> the
> > > read
> > > > > > > interest bit for the socket is turned on/off with respect to
> > > > > MemoryPool?
> > > > > > >
> > > > > > > 5. Should queued.max.bytes defaults to -1 or Long.MAX_VALUE?
> > > > > > >
> > > > > > > Thanks,
> > > > > > >
> > > > > > > Jun
> > > > > > >
> > > > > > > On Mon, Nov 7, 2016 at 1:08 PM, radai <
> > radai.rosenblatt@gmail.com>
> > > > > > wrote:
> > > > > > >
> > > > > > > > Hi,
> > > > > > > >
> > > > > > > > I would like to initiate a vote on KIP-72:
> > > > > > > >
> > > > > > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-72%3A+
> > > > > > > > Allow+putting+a+bound+on+memory+consumed+by+Incoming+
> requests
> > > > > > > >
> > > > > > > > The kip allows specifying a limit on the amount of memory
> > > allocated
> > > > > for
> > > > > > > > reading incoming requests into. This is useful for "sizing" a
> > > > broker
> > > > > > and
> > > > > > > > avoiding OOMEs under heavy load (as actually happens
> > occasionally
> > > > at
> > > > > > > > linkedin).
> > > > > > > >
> > > > > > > > I believe I've addressed most (all?) concerns brought up
> during
> > > the
> > > > > > > > discussion.
> > > > > > > >
> > > > > > > > To the best of my understanding this vote is about the goal
> and
> > > > > > > > public-facing changes related to the new proposed behavior,
> but
> > > as
> > > > > for
> > > > > > > > implementation, i have the code up here:
> > > > > > > >
> > > > > > > > https://github.com/radai-rosenblatt/kafka/tree/broker-memory
> > > > > > > > -pool-with-muting
> > > > > > > >
> > > > > > > > and I've stress-tested it to work properly (meaning it chugs
> > > along
> > > > > and
> > > > > > > > throttles under loads that would DOS 10.0.1.0 code).
> > > > > > > >
> > > > > > > > I also believe that the primitives and "pattern"s introduced
> in
> > > > this
> > > > > > KIP
> > > > > > > > (namely the notion of a buffer pool and retrieving from /
> > > releasing
> > > > > to
> > > > > > > said
> > > > > > > > pool instead of allocating memory) are generally useful
> beyond
> > > the
> > > > > > scope
> > > > > > > of
> > > > > > > > this KIP for both performance issues (allocating lots of
> > > > short-lived
> > > > > > > large
> > > > > > > > buffers is a performance bottleneck) and other areas where
> > memory
> > > > > > limits
> > > > > > > > are a problem (KIP-81)
> > > > > > > >
> > > > > > > > Thank you,
> > > > > > > >
> > > > > > > > Radai.
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > Regards,
> > > >
> > > > Rajini
> > > >
> > >
> >
>
>
>
> --
> Regards,
>
> Rajini
>

Re: [VOTE] KIP-72 - Allow putting a bound on memory consumed by Incoming requests

Posted by Rajini Sivaram <ra...@googlemail.com>.

13. At the moment, I think channels are not muted if:
    channel.receive != null && channel.receive.buffer != null
This mutes all channels that aren't holding onto a incomplete buffer. They
may or may not have read the 4-byte size.

I was thinking you could avoid muting channels if:
    channel.receive == null || channel.receive.size.remaining()
This will not mute channels that are holding onto a buffer (as above). In
addition, it will not mute channels that haven't read the 4-byte size. A
client that is closed gracefully while the pool is full will not be muted
in this case and the server can process close without waiting for the pool
to free up. Once the 4-byte size is read, the channel will be muted if the
pool is still out of memory - for each channel, at most one failed read
attempt would be made while the pool is out of memory. I think this would
also delay muting of SSL channels since they can continue to read into
their (already allocated) network buffers and unwrap the data and block
only when they need to allocate a buffer from the pool.

On Fri, Nov 11, 2016 at 6:00 PM, Jay Kreps <ja...@confluent.io> wrote:

> Hey Radai,
>
> +1 on deprecating and eventually removing the old config. The intention was
> absolutely bounding memory usage. I think having two ways of doing this,
> one that gives a crisp bound on memory and one that is hard to reason about
> is pretty confusing. I think people will really appreciate having one
> config which instead lets them directly control the thing they actually
> care about (memory).
>
> I also want to second Jun's concern on the complexity of the self-GCing
> memory pool. I wrote the memory pool for the producer. In that area the
> pooling of messages is the single biggest factor in performance of the
> client so I believed it was worth some sophistication/complexity if there
> was performance payoff. All the same, the complexity of that code has made
> it VERY hard to keep correct (it gets broken roughly every other time
> someone makes a change). Over time I came to feel a lot less proud of my
> cleverness. I learned something interesting reading your self-GCing memory
> pool, but I wonder if the complexity is worth the payoff in this case?
>
> Philosophically we've tried really hard to avoid needlessly "pluggable"
> implementations. That is, when there is a temptation to give a config that
> plugs in different Java classes at run time for implementation choices, we
> should instead think of how to give the user the good behavior
> automatically. I think the use case for configuring a the GCing pool would
> be if you discovered a bug in which memory leaked. But this isn't something
> the user should have to think about right? If there is a bug we should find
> and fix it.
>
> -Jay
>
> On Fri, Nov 11, 2016 at 9:21 AM, radai <ra...@gmail.com> wrote:
>
> > jun's #1 + rajini's #11 - the new config param is to enable changing the
> > pool implentation class. as i said in my response to jun i will make the
> > default pool impl be the simple one, and this param is to allow a user
> > (more likely a dev) to change it.
> > both the simple pool and the "gc pool" make basically just an
> > AtomicLong.get() + (hashmap.put for gc) calls before returning a buffer.
> > there is absolutely no dependency on GC times in allocating (or not). the
> > extra background thread in the gc pool is forever asleep unless there are
> > bugs (==leaks) so the extra cost is basically nothing (backed by
> > benchmarks). let me re-itarate again - ANY BUFFER ALLOCATED MUST ALWAYS
> BE
> > RELEASED - so the gc pool should not rely on gc for reclaiming buffers.
> its
> > a bug detector, not a feature and is definitely not intended to hide
> bugs -
> > the exact opposite - its meant to expose them sooner. i've cleaned up the
> > docs to avoid this confusion. i also like the fail on leak. will do.
> > as for the gap between pool size and heap size - thats a valid argument.
> > may allow also sizing the pool as % of heap size? so queued.max.bytes =
> > 1000000 for 1MB and queued.max.bytes = 0.25 for 25% of available heap?
> >
> > jun's 2.2 - queued.max.bytes + socket.request.max.bytes still holds,
> > assuming the ssl-related buffers are small. the largest weakness in this
> > claim has to do with decompression rather than anything ssl-related. so
> yes
> > there is an O(#ssl connections * sslEngine packet size) component, but i
> > think its small. again - decompression should be the concern.
> >
> > rajini's #13 - interesting optimization. the problem is there's no
> knowing
> > in advance what the _next_ request to come out of a socket is, so this
> > would mute just those sockets that are 1. mutable and 2. have a
> > buffer-demanding request for which we could not allocate a buffer.
> downside
> > is that as-is this would cause the busy-loop on poll() that the mutes
> were
> > supposed to prevent - or code would need to be added to ad-hocmute a
> > connection that was so-far unmuted but has now generated a
> memory-demanding
> > request?
> >
> >
> >
> > On Fri, Nov 11, 2016 at 5:02 AM, Rajini Sivaram <
> > rajinisivaram@googlemail.com> wrote:
> >
> > > Radai,
> > >
> > > 11. The KIP talks about a new server configuration parameter
> > > *memory.pool.class.name
> > > <http://memory.pool.class.name> *which is not in the implementation.
> Is
> > it
> > > still the case that the pool will be configurable?
> > >
> > > 12. Personally I would prefer not to have a garbage collected pool that
> > > hides bugs as well. Apart from the added code complexity and extra
> thread
> > > to handle collections, I am also concerned about the non-deterministic
> > > nature of GC timings. The KIP introduces delays in processing requests
> > > based on the configuration parameter *queued.max.bytes. *This in
> > unrelated
> > > to the JVM heap size and hence pool can be full when there is no
> pressure
> > > on the JVM to garbage collect. The KIP does not prevent other timeouts
> in
> > > the broker (eg. consumer session timeout) because it is relying on the
> > pool
> > > to be managed in a deterministic, timely manner. Since a garbage
> > collected
> > > pool cannot provide that guarantee, wouldn't it be better to run tests
> > with
> > > a GC-pool that perhaps fails with a fatal error if it encounters a
> buffer
> > > that was not released?
> > >
> > > 13. The implementation currently mutes all channels that don't have a
> > > receive buffer allocated. Would it make sense to mute only the channels
> > > that need a buffer (i.e. allow channels to read the 4-byte size that is
> > not
> > > read using the pool) so that normal client connection close() is
> handled
> > > even when the pool is full? Since the extra 4-bytes may already be
> > > allocated for some connections, the total request memory has to take
> into
> > > account *4*numConnections* bytes anyway.
> > >
> > >
> > > On Thu, Nov 10, 2016 at 11:51 PM, Jun Rao <ju...@confluent.io> wrote:
> > >
> > > > Hi, Radai,
> > > >
> > > > 1. Yes, I am concerned about the trickiness of having to deal with
> > wreak
> > > > refs. I think it's simpler to just have the simple version
> instrumented
> > > > with enough debug/trace logging and do enough stress testing. Since
> we
> > > > still have queued.max.requests, one can always fall back to that if a
> > > > memory leak issue is identified. We could also label the feature as
> > beta
> > > if
> > > > we don't think this is production ready.
> > > >
> > > > 2.2 I am just wondering after we fix that issue whether the claim
> that
> > > the
> > > > request memory is bounded by  queued.max.bytes +
> > socket.request.max.bytes
> > > > is still true.
> > > >
> > > > 5. Ok, leaving the default as -1 is fine then.
> > > >
> > > > Thanks,
> > > >
> > > > Jun
> > > >
> > > > On Wed, Nov 9, 2016 at 6:01 PM, radai <ra...@gmail.com>
> > > wrote:
> > > >
> > > > > Hi Jun,
> > > > >
> > > > > Thank you for taking the time to review this.
> > > > >
> > > > > 1. short version - yes, the concern is bugs, but the cost is tiny
> and
> > > > worth
> > > > > it, and its a common pattern. long version:
> > > > >    1.1 detecting these types of bugs (leaks) cannot be easily done
> > with
> > > > > simple testing, but requires stress/stability tests that run for a
> > long
> > > > > time (long enough to hit OOM, depending on leak size and available
> > > > memory).
> > > > > this is why some sort of leak detector is "standard practice" .for
> > > > example
> > > > > look at netty (http://netty.io/wiki/reference-counted-objects.
> > > > > html#leak-detection-levels)
> > > > > <http://netty.io/wiki/reference-counted-objects.
> > > > html#leak-detection-levels
> > > > > >-
> > > > > they have way more complicated built-in leak detection enabled by
> > > > default.
> > > > > as a concrete example - during development i did not properly
> dispose
> > > of
> > > > > in-progress KafkaChannel.receive when a connection was abruptly
> > closed
> > > > and
> > > > > I only found it because of the log msg printed by the pool.
> > > > >    1.2 I have a benchmark suite showing the performance cost of the
> > gc
> > > > pool
> > > > > is absolutely negligible -
> > > > > https://github.com/radai-rosenblatt/kafka-benchmarks/
> > > > > tree/master/memorypool-benchmarks
> > > > >    1.3 as for the complexity of the impl - its just ~150 lines and
> > > pretty
> > > > > straight forward. i think the main issue is that not many people
> are
> > > > > familiar with weak refs and ref queues.
> > > > >
> > > > >    how about making the pool impl class a config param (generally
> > good
> > > > > going forward), make the default be the simple pool, and keep the
> GC
> > > one
> > > > as
> > > > > a dev/debug/triage aid?
> > > > >
> > > > > 2. the KIP itself doesnt specifically treat SSL at all - its an
> > > > > implementation detail. as for my current patch, it has some minimal
> > > > > treatment of SSL - just enough to not mute SSL sockets
> mid-handshake
> > -
> > > > but
> > > > > the code in SslTransportLayer still allocates buffers itself. it is
> > my
> > > > > understanding that netReadBuffer/appReadBuffer shouldn't grow
> beyond
> > 2
> > > x
> > > > > sslEngine.getSession().getPacketBufferSize(), which i assume to be
> > > > small.
> > > > > they are also long lived (they live for the duration of the
> > connection)
> > > > > which makes a poor fit for pooling. the bigger fish to fry i think
> is
> > > > > decompression - you could read a 1MB blob into a pool-provided
> buffer
> > > and
> > > > > then decompress it into 10MB of heap allocated on the spot :-)
> also,
> > > the
> > > > > ssl code is extremely tricky.
> > > > >    2.2 just to make sure, youre talking about Selector.java: while
> > > > > ((networkReceive = channel.read()) != null)
> > > addToStagedReceives(channel,
> > > > > networkReceive); ? if so youre right, and i'll fix that (probably
> by
> > > > > something similar to immediatelyConnectedKeys, not sure yet)
> > > > >
> > > > > 3. isOutOfMemory is self explanatory (and i'll add javadocs and
> > update
> > > > the
> > > > > wiki). isLowOnMem is basically the point where I start randomizing
> > the
> > > > > selection key handling order to avoid potential starvation. its
> > rather
> > > > > arbitrary and now that i think of it should probably not exist and
> be
> > > > > entirely contained in Selector (where the shuffling takes place).
> > will
> > > > fix.
> > > > >
> > > > > 4. will do.
> > > > >
> > > > > 5. I prefer -1 or 0 as an explicit "OFF" (or basically anything
> <=0).
> > > > > Long.MAX_VALUE would still create a pool, that would still waste
> time
> > > > > tracking resources. I dont really mind though if you have a
> preferred
> > > > magic
> > > > > value for off.
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > On Wed, Nov 9, 2016 at 9:28 AM, Jun Rao <ju...@confluent.io> wrote:
> > > > >
> > > > > > Hi, Radai,
> > > > > >
> > > > > > Thanks for the KIP. Some comments below.
> > > > > >
> > > > > > 1. The KIP says "to facilitate faster implementation (as a safety
> > > net)
> > > > > the
> > > > > > pool will be implemented in such a way that memory that was not
> > > > > release()ed
> > > > > > (but still garbage collected) would be detected and "reclaimed".
> > this
> > > > is
> > > > > to
> > > > > > prevent "leaks" in case of code paths that fail to release()
> > > > properly.".
> > > > > > What are the cases that could cause memory leaks? If we are
> > concerned
> > > > > about
> > > > > > bugs, it seems that it's better to just do more testing to make
> > sure
> > > > the
> > > > > > usage of the simple implementation (SimpleMemoryPool) is solid
> > > instead
> > > > of
> > > > > > adding more complicated logic (GarbageCollectedMemoryPool) to
> hide
> > > the
> > > > > > potential bugs.
> > > > > >
> > > > > > 2. I am wondering how much this KIP covers the SSL channel
> > > > > implementation.
> > > > > > 2.1 SslTransportLayer maintains netReadBuffer, netWriteBuffer,
> > > > > > appReadBuffer per socket. Should those memory be accounted for in
> > > > memory
> > > > > > pool?
> > > > > > 2.2 One tricky thing with SSL is that during a
> KafkaChannel.read(),
> > > > it's
> > > > > > possible for multiple NetworkReceives to be returned since
> multiple
> > > > > > requests' data could be encrypted together by SSL. To deal with
> > this,
> > > > we
> > > > > > stash those NetworkReceives in Selector.stagedReceives and give
> it
> > > back
> > > > > to
> > > > > > the poll() call one NetworkReceive at a time. What this means is
> > > that,
> > > > if
> > > > > > we stop reading from KafkaChannel in the middle because memory
> pool
> > > is
> > > > > > full, this channel's key may never get selected for reads (even
> > after
> > > > the
> > > > > > read interest is turned on), but there are still pending data for
> > the
> > > > > > channel, which will never get processed.
> > > > > >
> > > > > > 3. The code has the following two methods in MemoryPool, which
> are
> > > not
> > > > > > described in the KIP. Could you explain how they are used in the
> > > wiki?
> > > > > > isLowOnMemory()
> > > > > > isOutOfMemory()
> > > > > >
> > > > > > 4. Could you also describe in the KIP at the high level, how the
> > read
> > > > > > interest bit for the socket is turned on/off with respect to
> > > > MemoryPool?
> > > > > >
> > > > > > 5. Should queued.max.bytes defaults to -1 or Long.MAX_VALUE?
> > > > > >
> > > > > > Thanks,
> > > > > >
> > > > > > Jun
> > > > > >
> > > > > > On Mon, Nov 7, 2016 at 1:08 PM, radai <
> radai.rosenblatt@gmail.com>
> > > > > wrote:
> > > > > >
> > > > > > > Hi,
> > > > > > >
> > > > > > > I would like to initiate a vote on KIP-72:
> > > > > > >
> > > > > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-72%3A+
> > > > > > > Allow+putting+a+bound+on+memory+consumed+by+Incoming+requests
> > > > > > >
> > > > > > > The kip allows specifying a limit on the amount of memory
> > allocated
> > > > for
> > > > > > > reading incoming requests into. This is useful for "sizing" a
> > > broker
> > > > > and
> > > > > > > avoiding OOMEs under heavy load (as actually happens
> occasionally
> > > at
> > > > > > > linkedin).
> > > > > > >
> > > > > > > I believe I've addressed most (all?) concerns brought up during
> > the
> > > > > > > discussion.
> > > > > > >
> > > > > > > To the best of my understanding this vote is about the goal and
> > > > > > > public-facing changes related to the new proposed behavior, but
> > as
> > > > for
> > > > > > > implementation, i have the code up here:
> > > > > > >
> > > > > > > https://github.com/radai-rosenblatt/kafka/tree/broker-memory
> > > > > > > -pool-with-muting
> > > > > > >
> > > > > > > and I've stress-tested it to work properly (meaning it chugs
> > along
> > > > and
> > > > > > > throttles under loads that would DOS 10.0.1.0 code).
> > > > > > >
> > > > > > > I also believe that the primitives and "pattern"s introduced in
> > > this
> > > > > KIP
> > > > > > > (namely the notion of a buffer pool and retrieving from /
> > releasing
> > > > to
> > > > > > said
> > > > > > > pool instead of allocating memory) are generally useful beyond
> > the
> > > > > scope
> > > > > > of
> > > > > > > this KIP for both performance issues (allocating lots of
> > > short-lived
> > > > > > large
> > > > > > > buffers is a performance bottleneck) and other areas where
> memory
> > > > > limits
> > > > > > > are a problem (KIP-81)
> > > > > > >
> > > > > > > Thank you,
> > > > > > >
> > > > > > > Radai.
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> > >
> > >
> > > --
> > > Regards,
> > >
> > > Rajini
> > >
> >
>



-- 
Regards,

Rajini

Re: [VOTE] KIP-72 - Allow putting a bound on memory consumed by Incoming requests

Posted by Jay Kreps <ja...@confluent.io>.

Hey Radai,

+1 on deprecating and eventually removing the old config. The intention was
absolutely bounding memory usage. I think having two ways of doing this,
one that gives a crisp bound on memory and one that is hard to reason about
is pretty confusing. I think people will really appreciate having one
config which instead lets them directly control the thing they actually
care about (memory).

I also want to second Jun's concern on the complexity of the self-GCing
memory pool. I wrote the memory pool for the producer. In that area the
pooling of messages is the single biggest factor in performance of the
client so I believed it was worth some sophistication/complexity if there
was performance payoff. All the same, the complexity of that code has made
it VERY hard to keep correct (it gets broken roughly every other time
someone makes a change). Over time I came to feel a lot less proud of my
cleverness. I learned something interesting reading your self-GCing memory
pool, but I wonder if the complexity is worth the payoff in this case?

Philosophically we've tried really hard to avoid needlessly "pluggable"
implementations. That is, when there is a temptation to give a config that
plugs in different Java classes at run time for implementation choices, we
should instead think of how to give the user the good behavior
automatically. I think the use case for configuring a the GCing pool would
be if you discovered a bug in which memory leaked. But this isn't something
the user should have to think about right? If there is a bug we should find
and fix it.

-Jay

On Fri, Nov 11, 2016 at 9:21 AM, radai <ra...@gmail.com> wrote:

> jun's #1 + rajini's #11 - the new config param is to enable changing the
> pool implentation class. as i said in my response to jun i will make the
> default pool impl be the simple one, and this param is to allow a user
> (more likely a dev) to change it.
> both the simple pool and the "gc pool" make basically just an
> AtomicLong.get() + (hashmap.put for gc) calls before returning a buffer.
> there is absolutely no dependency on GC times in allocating (or not). the
> extra background thread in the gc pool is forever asleep unless there are
> bugs (==leaks) so the extra cost is basically nothing (backed by
> benchmarks). let me re-itarate again - ANY BUFFER ALLOCATED MUST ALWAYS BE
> RELEASED - so the gc pool should not rely on gc for reclaiming buffers. its
> a bug detector, not a feature and is definitely not intended to hide bugs -
> the exact opposite - its meant to expose them sooner. i've cleaned up the
> docs to avoid this confusion. i also like the fail on leak. will do.
> as for the gap between pool size and heap size - thats a valid argument.
> may allow also sizing the pool as % of heap size? so queued.max.bytes =
> 1000000 for 1MB and queued.max.bytes = 0.25 for 25% of available heap?
>
> jun's 2.2 - queued.max.bytes + socket.request.max.bytes still holds,
> assuming the ssl-related buffers are small. the largest weakness in this
> claim has to do with decompression rather than anything ssl-related. so yes
> there is an O(#ssl connections * sslEngine packet size) component, but i
> think its small. again - decompression should be the concern.
>
> rajini's #13 - interesting optimization. the problem is there's no knowing
> in advance what the _next_ request to come out of a socket is, so this
> would mute just those sockets that are 1. mutable and 2. have a
> buffer-demanding request for which we could not allocate a buffer. downside
> is that as-is this would cause the busy-loop on poll() that the mutes were
> supposed to prevent - or code would need to be added to ad-hocmute a
> connection that was so-far unmuted but has now generated a memory-demanding
> request?
>
>
>
> On Fri, Nov 11, 2016 at 5:02 AM, Rajini Sivaram <
> rajinisivaram@googlemail.com> wrote:
>
> > Radai,
> >
> > 11. The KIP talks about a new server configuration parameter
> > *memory.pool.class.name
> > <http://memory.pool.class.name> *which is not in the implementation. Is
> it
> > still the case that the pool will be configurable?
> >
> > 12. Personally I would prefer not to have a garbage collected pool that
> > hides bugs as well. Apart from the added code complexity and extra thread
> > to handle collections, I am also concerned about the non-deterministic
> > nature of GC timings. The KIP introduces delays in processing requests
> > based on the configuration parameter *queued.max.bytes. *This in
> unrelated
> > to the JVM heap size and hence pool can be full when there is no pressure
> > on the JVM to garbage collect. The KIP does not prevent other timeouts in
> > the broker (eg. consumer session timeout) because it is relying on the
> pool
> > to be managed in a deterministic, timely manner. Since a garbage
> collected
> > pool cannot provide that guarantee, wouldn't it be better to run tests
> with
> > a GC-pool that perhaps fails with a fatal error if it encounters a buffer
> > that was not released?
> >
> > 13. The implementation currently mutes all channels that don't have a
> > receive buffer allocated. Would it make sense to mute only the channels
> > that need a buffer (i.e. allow channels to read the 4-byte size that is
> not
> > read using the pool) so that normal client connection close() is handled
> > even when the pool is full? Since the extra 4-bytes may already be
> > allocated for some connections, the total request memory has to take into
> > account *4*numConnections* bytes anyway.
> >
> >
> > On Thu, Nov 10, 2016 at 11:51 PM, Jun Rao <ju...@confluent.io> wrote:
> >
> > > Hi, Radai,
> > >
> > > 1. Yes, I am concerned about the trickiness of having to deal with
> wreak
> > > refs. I think it's simpler to just have the simple version instrumented
> > > with enough debug/trace logging and do enough stress testing. Since we
> > > still have queued.max.requests, one can always fall back to that if a
> > > memory leak issue is identified. We could also label the feature as
> beta
> > if
> > > we don't think this is production ready.
> > >
> > > 2.2 I am just wondering after we fix that issue whether the claim that
> > the
> > > request memory is bounded by  queued.max.bytes +
> socket.request.max.bytes
> > > is still true.
> > >
> > > 5. Ok, leaving the default as -1 is fine then.
> > >
> > > Thanks,
> > >
> > > Jun
> > >
> > > On Wed, Nov 9, 2016 at 6:01 PM, radai <ra...@gmail.com>
> > wrote:
> > >
> > > > Hi Jun,
> > > >
> > > > Thank you for taking the time to review this.
> > > >
> > > > 1. short version - yes, the concern is bugs, but the cost is tiny and
> > > worth
> > > > it, and its a common pattern. long version:
> > > >    1.1 detecting these types of bugs (leaks) cannot be easily done
> with
> > > > simple testing, but requires stress/stability tests that run for a
> long
> > > > time (long enough to hit OOM, depending on leak size and available
> > > memory).
> > > > this is why some sort of leak detector is "standard practice" .for
> > > example
> > > > look at netty (http://netty.io/wiki/reference-counted-objects.
> > > > html#leak-detection-levels)
> > > > <http://netty.io/wiki/reference-counted-objects.
> > > html#leak-detection-levels
> > > > >-
> > > > they have way more complicated built-in leak detection enabled by
> > > default.
> > > > as a concrete example - during development i did not properly dispose
> > of
> > > > in-progress KafkaChannel.receive when a connection was abruptly
> closed
> > > and
> > > > I only found it because of the log msg printed by the pool.
> > > >    1.2 I have a benchmark suite showing the performance cost of the
> gc
> > > pool
> > > > is absolutely negligible -
> > > > https://github.com/radai-rosenblatt/kafka-benchmarks/
> > > > tree/master/memorypool-benchmarks
> > > >    1.3 as for the complexity of the impl - its just ~150 lines and
> > pretty
> > > > straight forward. i think the main issue is that not many people are
> > > > familiar with weak refs and ref queues.
> > > >
> > > >    how about making the pool impl class a config param (generally
> good
> > > > going forward), make the default be the simple pool, and keep the GC
> > one
> > > as
> > > > a dev/debug/triage aid?
> > > >
> > > > 2. the KIP itself doesnt specifically treat SSL at all - its an
> > > > implementation detail. as for my current patch, it has some minimal
> > > > treatment of SSL - just enough to not mute SSL sockets mid-handshake
> -
> > > but
> > > > the code in SslTransportLayer still allocates buffers itself. it is
> my
> > > > understanding that netReadBuffer/appReadBuffer shouldn't grow beyond
> 2
> > x
> > > > sslEngine.getSession().getPacketBufferSize(), which i assume to be
> > > small.
> > > > they are also long lived (they live for the duration of the
> connection)
> > > > which makes a poor fit for pooling. the bigger fish to fry i think is
> > > > decompression - you could read a 1MB blob into a pool-provided buffer
> > and
> > > > then decompress it into 10MB of heap allocated on the spot :-) also,
> > the
> > > > ssl code is extremely tricky.
> > > >    2.2 just to make sure, youre talking about Selector.java: while
> > > > ((networkReceive = channel.read()) != null)
> > addToStagedReceives(channel,
> > > > networkReceive); ? if so youre right, and i'll fix that (probably by
> > > > something similar to immediatelyConnectedKeys, not sure yet)
> > > >
> > > > 3. isOutOfMemory is self explanatory (and i'll add javadocs and
> update
> > > the
> > > > wiki). isLowOnMem is basically the point where I start randomizing
> the
> > > > selection key handling order to avoid potential starvation. its
> rather
> > > > arbitrary and now that i think of it should probably not exist and be
> > > > entirely contained in Selector (where the shuffling takes place).
> will
> > > fix.
> > > >
> > > > 4. will do.
> > > >
> > > > 5. I prefer -1 or 0 as an explicit "OFF" (or basically anything <=0).
> > > > Long.MAX_VALUE would still create a pool, that would still waste time
> > > > tracking resources. I dont really mind though if you have a preferred
> > > magic
> > > > value for off.
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > On Wed, Nov 9, 2016 at 9:28 AM, Jun Rao <ju...@confluent.io> wrote:
> > > >
> > > > > Hi, Radai,
> > > > >
> > > > > Thanks for the KIP. Some comments below.
> > > > >
> > > > > 1. The KIP says "to facilitate faster implementation (as a safety
> > net)
> > > > the
> > > > > pool will be implemented in such a way that memory that was not
> > > > release()ed
> > > > > (but still garbage collected) would be detected and "reclaimed".
> this
> > > is
> > > > to
> > > > > prevent "leaks" in case of code paths that fail to release()
> > > properly.".
> > > > > What are the cases that could cause memory leaks? If we are
> concerned
> > > > about
> > > > > bugs, it seems that it's better to just do more testing to make
> sure
> > > the
> > > > > usage of the simple implementation (SimpleMemoryPool) is solid
> > instead
> > > of
> > > > > adding more complicated logic (GarbageCollectedMemoryPool) to hide
> > the
> > > > > potential bugs.
> > > > >
> > > > > 2. I am wondering how much this KIP covers the SSL channel
> > > > implementation.
> > > > > 2.1 SslTransportLayer maintains netReadBuffer, netWriteBuffer,
> > > > > appReadBuffer per socket. Should those memory be accounted for in
> > > memory
> > > > > pool?
> > > > > 2.2 One tricky thing with SSL is that during a KafkaChannel.read(),
> > > it's
> > > > > possible for multiple NetworkReceives to be returned since multiple
> > > > > requests' data could be encrypted together by SSL. To deal with
> this,
> > > we
> > > > > stash those NetworkReceives in Selector.stagedReceives and give it
> > back
> > > > to
> > > > > the poll() call one NetworkReceive at a time. What this means is
> > that,
> > > if
> > > > > we stop reading from KafkaChannel in the middle because memory pool
> > is
> > > > > full, this channel's key may never get selected for reads (even
> after
> > > the
> > > > > read interest is turned on), but there are still pending data for
> the
> > > > > channel, which will never get processed.
> > > > >
> > > > > 3. The code has the following two methods in MemoryPool, which are
> > not
> > > > > described in the KIP. Could you explain how they are used in the
> > wiki?
> > > > > isLowOnMemory()
> > > > > isOutOfMemory()
> > > > >
> > > > > 4. Could you also describe in the KIP at the high level, how the
> read
> > > > > interest bit for the socket is turned on/off with respect to
> > > MemoryPool?
> > > > >
> > > > > 5. Should queued.max.bytes defaults to -1 or Long.MAX_VALUE?
> > > > >
> > > > > Thanks,
> > > > >
> > > > > Jun
> > > > >
> > > > > On Mon, Nov 7, 2016 at 1:08 PM, radai <ra...@gmail.com>
> > > > wrote:
> > > > >
> > > > > > Hi,
> > > > > >
> > > > > > I would like to initiate a vote on KIP-72:
> > > > > >
> > > > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-72%3A+
> > > > > > Allow+putting+a+bound+on+memory+consumed+by+Incoming+requests
> > > > > >
> > > > > > The kip allows specifying a limit on the amount of memory
> allocated
> > > for
> > > > > > reading incoming requests into. This is useful for "sizing" a
> > broker
> > > > and
> > > > > > avoiding OOMEs under heavy load (as actually happens occasionally
> > at
> > > > > > linkedin).
> > > > > >
> > > > > > I believe I've addressed most (all?) concerns brought up during
> the
> > > > > > discussion.
> > > > > >
> > > > > > To the best of my understanding this vote is about the goal and
> > > > > > public-facing changes related to the new proposed behavior, but
> as
> > > for
> > > > > > implementation, i have the code up here:
> > > > > >
> > > > > > https://github.com/radai-rosenblatt/kafka/tree/broker-memory
> > > > > > -pool-with-muting
> > > > > >
> > > > > > and I've stress-tested it to work properly (meaning it chugs
> along
> > > and
> > > > > > throttles under loads that would DOS 10.0.1.0 code).
> > > > > >
> > > > > > I also believe that the primitives and "pattern"s introduced in
> > this
> > > > KIP
> > > > > > (namely the notion of a buffer pool and retrieving from /
> releasing
> > > to
> > > > > said
> > > > > > pool instead of allocating memory) are generally useful beyond
> the
> > > > scope
> > > > > of
> > > > > > this KIP for both performance issues (allocating lots of
> > short-lived
> > > > > large
> > > > > > buffers is a performance bottleneck) and other areas where memory
> > > > limits
> > > > > > are a problem (KIP-81)
> > > > > >
> > > > > > Thank you,
> > > > > >
> > > > > > Radai.
> > > > > >
> > > > >
> > > >
> > >
> >
> >
> >
> > --
> > Regards,
> >
> > Rajini
> >
>

Re: [VOTE] KIP-72 - Allow putting a bound on memory consumed by Incoming requests

Posted by radai <ra...@gmail.com>.

jun's #1 + rajini's #11 - the new config param is to enable changing the
pool implentation class. as i said in my response to jun i will make the
default pool impl be the simple one, and this param is to allow a user
(more likely a dev) to change it.
both the simple pool and the "gc pool" make basically just an
AtomicLong.get() + (hashmap.put for gc) calls before returning a buffer.
there is absolutely no dependency on GC times in allocating (or not). the
extra background thread in the gc pool is forever asleep unless there are
bugs (==leaks) so the extra cost is basically nothing (backed by
benchmarks). let me re-itarate again - ANY BUFFER ALLOCATED MUST ALWAYS BE
RELEASED - so the gc pool should not rely on gc for reclaiming buffers. its
a bug detector, not a feature and is definitely not intended to hide bugs -
the exact opposite - its meant to expose them sooner. i've cleaned up the
docs to avoid this confusion. i also like the fail on leak. will do.
as for the gap between pool size and heap size - thats a valid argument.
may allow also sizing the pool as % of heap size? so queued.max.bytes =
1000000 for 1MB and queued.max.bytes = 0.25 for 25% of available heap?

jun's 2.2 - queued.max.bytes + socket.request.max.bytes still holds,
assuming the ssl-related buffers are small. the largest weakness in this
claim has to do with decompression rather than anything ssl-related. so yes
there is an O(#ssl connections * sslEngine packet size) component, but i
think its small. again - decompression should be the concern.

rajini's #13 - interesting optimization. the problem is there's no knowing
in advance what the _next_ request to come out of a socket is, so this
would mute just those sockets that are 1. mutable and 2. have a
buffer-demanding request for which we could not allocate a buffer. downside
is that as-is this would cause the busy-loop on poll() that the mutes were
supposed to prevent - or code would need to be added to ad-hocmute a
connection that was so-far unmuted but has now generated a memory-demanding
request?



On Fri, Nov 11, 2016 at 5:02 AM, Rajini Sivaram <
rajinisivaram@googlemail.com> wrote:

> Radai,
>
> 11. The KIP talks about a new server configuration parameter
> *memory.pool.class.name
> <http://memory.pool.class.name> *which is not in the implementation. Is it
> still the case that the pool will be configurable?
>
> 12. Personally I would prefer not to have a garbage collected pool that
> hides bugs as well. Apart from the added code complexity and extra thread
> to handle collections, I am also concerned about the non-deterministic
> nature of GC timings. The KIP introduces delays in processing requests
> based on the configuration parameter *queued.max.bytes. *This in unrelated
> to the JVM heap size and hence pool can be full when there is no pressure
> on the JVM to garbage collect. The KIP does not prevent other timeouts in
> the broker (eg. consumer session timeout) because it is relying on the pool
> to be managed in a deterministic, timely manner. Since a garbage collected
> pool cannot provide that guarantee, wouldn't it be better to run tests with
> a GC-pool that perhaps fails with a fatal error if it encounters a buffer
> that was not released?
>
> 13. The implementation currently mutes all channels that don't have a
> receive buffer allocated. Would it make sense to mute only the channels
> that need a buffer (i.e. allow channels to read the 4-byte size that is not
> read using the pool) so that normal client connection close() is handled
> even when the pool is full? Since the extra 4-bytes may already be
> allocated for some connections, the total request memory has to take into
> account *4*numConnections* bytes anyway.
>
>
> On Thu, Nov 10, 2016 at 11:51 PM, Jun Rao <ju...@confluent.io> wrote:
>
> > Hi, Radai,
> >
> > 1. Yes, I am concerned about the trickiness of having to deal with wreak
> > refs. I think it's simpler to just have the simple version instrumented
> > with enough debug/trace logging and do enough stress testing. Since we
> > still have queued.max.requests, one can always fall back to that if a
> > memory leak issue is identified. We could also label the feature as beta
> if
> > we don't think this is production ready.
> >
> > 2.2 I am just wondering after we fix that issue whether the claim that
> the
> > request memory is bounded by  queued.max.bytes + socket.request.max.bytes
> > is still true.
> >
> > 5. Ok, leaving the default as -1 is fine then.
> >
> > Thanks,
> >
> > Jun
> >
> > On Wed, Nov 9, 2016 at 6:01 PM, radai <ra...@gmail.com>
> wrote:
> >
> > > Hi Jun,
> > >
> > > Thank you for taking the time to review this.
> > >
> > > 1. short version - yes, the concern is bugs, but the cost is tiny and
> > worth
> > > it, and its a common pattern. long version:
> > >    1.1 detecting these types of bugs (leaks) cannot be easily done with
> > > simple testing, but requires stress/stability tests that run for a long
> > > time (long enough to hit OOM, depending on leak size and available
> > memory).
> > > this is why some sort of leak detector is "standard practice" .for
> > example
> > > look at netty (http://netty.io/wiki/reference-counted-objects.
> > > html#leak-detection-levels)
> > > <http://netty.io/wiki/reference-counted-objects.
> > html#leak-detection-levels
> > > >-
> > > they have way more complicated built-in leak detection enabled by
> > default.
> > > as a concrete example - during development i did not properly dispose
> of
> > > in-progress KafkaChannel.receive when a connection was abruptly closed
> > and
> > > I only found it because of the log msg printed by the pool.
> > >    1.2 I have a benchmark suite showing the performance cost of the gc
> > pool
> > > is absolutely negligible -
> > > https://github.com/radai-rosenblatt/kafka-benchmarks/
> > > tree/master/memorypool-benchmarks
> > >    1.3 as for the complexity of the impl - its just ~150 lines and
> pretty
> > > straight forward. i think the main issue is that not many people are
> > > familiar with weak refs and ref queues.
> > >
> > >    how about making the pool impl class a config param (generally good
> > > going forward), make the default be the simple pool, and keep the GC
> one
> > as
> > > a dev/debug/triage aid?
> > >
> > > 2. the KIP itself doesnt specifically treat SSL at all - its an
> > > implementation detail. as for my current patch, it has some minimal
> > > treatment of SSL - just enough to not mute SSL sockets mid-handshake -
> > but
> > > the code in SslTransportLayer still allocates buffers itself. it is my
> > > understanding that netReadBuffer/appReadBuffer shouldn't grow beyond 2
> x
> > > sslEngine.getSession().getPacketBufferSize(), which i assume to be
> > small.
> > > they are also long lived (they live for the duration of the connection)
> > > which makes a poor fit for pooling. the bigger fish to fry i think is
> > > decompression - you could read a 1MB blob into a pool-provided buffer
> and
> > > then decompress it into 10MB of heap allocated on the spot :-) also,
> the
> > > ssl code is extremely tricky.
> > >    2.2 just to make sure, youre talking about Selector.java: while
> > > ((networkReceive = channel.read()) != null)
> addToStagedReceives(channel,
> > > networkReceive); ? if so youre right, and i'll fix that (probably by
> > > something similar to immediatelyConnectedKeys, not sure yet)
> > >
> > > 3. isOutOfMemory is self explanatory (and i'll add javadocs and update
> > the
> > > wiki). isLowOnMem is basically the point where I start randomizing the
> > > selection key handling order to avoid potential starvation. its rather
> > > arbitrary and now that i think of it should probably not exist and be
> > > entirely contained in Selector (where the shuffling takes place). will
> > fix.
> > >
> > > 4. will do.
> > >
> > > 5. I prefer -1 or 0 as an explicit "OFF" (or basically anything <=0).
> > > Long.MAX_VALUE would still create a pool, that would still waste time
> > > tracking resources. I dont really mind though if you have a preferred
> > magic
> > > value for off.
> > >
> > >
> > >
> > >
> > >
> > > On Wed, Nov 9, 2016 at 9:28 AM, Jun Rao <ju...@confluent.io> wrote:
> > >
> > > > Hi, Radai,
> > > >
> > > > Thanks for the KIP. Some comments below.
> > > >
> > > > 1. The KIP says "to facilitate faster implementation (as a safety
> net)
> > > the
> > > > pool will be implemented in such a way that memory that was not
> > > release()ed
> > > > (but still garbage collected) would be detected and "reclaimed". this
> > is
> > > to
> > > > prevent "leaks" in case of code paths that fail to release()
> > properly.".
> > > > What are the cases that could cause memory leaks? If we are concerned
> > > about
> > > > bugs, it seems that it's better to just do more testing to make sure
> > the
> > > > usage of the simple implementation (SimpleMemoryPool) is solid
> instead
> > of
> > > > adding more complicated logic (GarbageCollectedMemoryPool) to hide
> the
> > > > potential bugs.
> > > >
> > > > 2. I am wondering how much this KIP covers the SSL channel
> > > implementation.
> > > > 2.1 SslTransportLayer maintains netReadBuffer, netWriteBuffer,
> > > > appReadBuffer per socket. Should those memory be accounted for in
> > memory
> > > > pool?
> > > > 2.2 One tricky thing with SSL is that during a KafkaChannel.read(),
> > it's
> > > > possible for multiple NetworkReceives to be returned since multiple
> > > > requests' data could be encrypted together by SSL. To deal with this,
> > we
> > > > stash those NetworkReceives in Selector.stagedReceives and give it
> back
> > > to
> > > > the poll() call one NetworkReceive at a time. What this means is
> that,
> > if
> > > > we stop reading from KafkaChannel in the middle because memory pool
> is
> > > > full, this channel's key may never get selected for reads (even after
> > the
> > > > read interest is turned on), but there are still pending data for the
> > > > channel, which will never get processed.
> > > >
> > > > 3. The code has the following two methods in MemoryPool, which are
> not
> > > > described in the KIP. Could you explain how they are used in the
> wiki?
> > > > isLowOnMemory()
> > > > isOutOfMemory()
> > > >
> > > > 4. Could you also describe in the KIP at the high level, how the read
> > > > interest bit for the socket is turned on/off with respect to
> > MemoryPool?
> > > >
> > > > 5. Should queued.max.bytes defaults to -1 or Long.MAX_VALUE?
> > > >
> > > > Thanks,
> > > >
> > > > Jun
> > > >
> > > > On Mon, Nov 7, 2016 at 1:08 PM, radai <ra...@gmail.com>
> > > wrote:
> > > >
> > > > > Hi,
> > > > >
> > > > > I would like to initiate a vote on KIP-72:
> > > > >
> > > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-72%3A+
> > > > > Allow+putting+a+bound+on+memory+consumed+by+Incoming+requests
> > > > >
> > > > > The kip allows specifying a limit on the amount of memory allocated
> > for
> > > > > reading incoming requests into. This is useful for "sizing" a
> broker
> > > and
> > > > > avoiding OOMEs under heavy load (as actually happens occasionally
> at
> > > > > linkedin).
> > > > >
> > > > > I believe I've addressed most (all?) concerns brought up during the
> > > > > discussion.
> > > > >
> > > > > To the best of my understanding this vote is about the goal and
> > > > > public-facing changes related to the new proposed behavior, but as
> > for
> > > > > implementation, i have the code up here:
> > > > >
> > > > > https://github.com/radai-rosenblatt/kafka/tree/broker-memory
> > > > > -pool-with-muting
> > > > >
> > > > > and I've stress-tested it to work properly (meaning it chugs along
> > and
> > > > > throttles under loads that would DOS 10.0.1.0 code).
> > > > >
> > > > > I also believe that the primitives and "pattern"s introduced in
> this
> > > KIP
> > > > > (namely the notion of a buffer pool and retrieving from / releasing
> > to
> > > > said
> > > > > pool instead of allocating memory) are generally useful beyond the
> > > scope
> > > > of
> > > > > this KIP for both performance issues (allocating lots of
> short-lived
> > > > large
> > > > > buffers is a performance bottleneck) and other areas where memory
> > > limits
> > > > > are a problem (KIP-81)
> > > > >
> > > > > Thank you,
> > > > >
> > > > > Radai.
> > > > >
> > > >
> > >
> >
>
>
>
> --
> Regards,
>
> Rajini
>

Re: [VOTE] KIP-72 - Allow putting a bound on memory consumed by Incoming requests

Posted by Rajini Sivaram <ra...@googlemail.com>.

Radai,

11. The KIP talks about a new server configuration parameter
*memory.pool.class.name
<http://memory.pool.class.name> *which is not in the implementation. Is it
still the case that the pool will be configurable?

12. Personally I would prefer not to have a garbage collected pool that
hides bugs as well. Apart from the added code complexity and extra thread
to handle collections, I am also concerned about the non-deterministic
nature of GC timings. The KIP introduces delays in processing requests
based on the configuration parameter *queued.max.bytes. *This in unrelated
to the JVM heap size and hence pool can be full when there is no pressure
on the JVM to garbage collect. The KIP does not prevent other timeouts in
the broker (eg. consumer session timeout) because it is relying on the pool
to be managed in a deterministic, timely manner. Since a garbage collected
pool cannot provide that guarantee, wouldn't it be better to run tests with
a GC-pool that perhaps fails with a fatal error if it encounters a buffer
that was not released?

13. The implementation currently mutes all channels that don't have a
receive buffer allocated. Would it make sense to mute only the channels
that need a buffer (i.e. allow channels to read the 4-byte size that is not
read using the pool) so that normal client connection close() is handled
even when the pool is full? Since the extra 4-bytes may already be
allocated for some connections, the total request memory has to take into
account *4*numConnections* bytes anyway.


On Thu, Nov 10, 2016 at 11:51 PM, Jun Rao <ju...@confluent.io> wrote:

> Hi, Radai,
>
> 1. Yes, I am concerned about the trickiness of having to deal with wreak
> refs. I think it's simpler to just have the simple version instrumented
> with enough debug/trace logging and do enough stress testing. Since we
> still have queued.max.requests, one can always fall back to that if a
> memory leak issue is identified. We could also label the feature as beta if
> we don't think this is production ready.
>
> 2.2 I am just wondering after we fix that issue whether the claim that the
> request memory is bounded by  queued.max.bytes + socket.request.max.bytes
> is still true.
>
> 5. Ok, leaving the default as -1 is fine then.
>
> Thanks,
>
> Jun
>
> On Wed, Nov 9, 2016 at 6:01 PM, radai <ra...@gmail.com> wrote:
>
> > Hi Jun,
> >
> > Thank you for taking the time to review this.
> >
> > 1. short version - yes, the concern is bugs, but the cost is tiny and
> worth
> > it, and its a common pattern. long version:
> >    1.1 detecting these types of bugs (leaks) cannot be easily done with
> > simple testing, but requires stress/stability tests that run for a long
> > time (long enough to hit OOM, depending on leak size and available
> memory).
> > this is why some sort of leak detector is "standard practice" .for
> example
> > look at netty (http://netty.io/wiki/reference-counted-objects.
> > html#leak-detection-levels)
> > <http://netty.io/wiki/reference-counted-objects.
> html#leak-detection-levels
> > >-
> > they have way more complicated built-in leak detection enabled by
> default.
> > as a concrete example - during development i did not properly dispose of
> > in-progress KafkaChannel.receive when a connection was abruptly closed
> and
> > I only found it because of the log msg printed by the pool.
> >    1.2 I have a benchmark suite showing the performance cost of the gc
> pool
> > is absolutely negligible -
> > https://github.com/radai-rosenblatt/kafka-benchmarks/
> > tree/master/memorypool-benchmarks
> >    1.3 as for the complexity of the impl - its just ~150 lines and pretty
> > straight forward. i think the main issue is that not many people are
> > familiar with weak refs and ref queues.
> >
> >    how about making the pool impl class a config param (generally good
> > going forward), make the default be the simple pool, and keep the GC one
> as
> > a dev/debug/triage aid?
> >
> > 2. the KIP itself doesnt specifically treat SSL at all - its an
> > implementation detail. as for my current patch, it has some minimal
> > treatment of SSL - just enough to not mute SSL sockets mid-handshake -
> but
> > the code in SslTransportLayer still allocates buffers itself. it is my
> > understanding that netReadBuffer/appReadBuffer shouldn't grow beyond 2 x
> > sslEngine.getSession().getPacketBufferSize(), which i assume to be
> small.
> > they are also long lived (they live for the duration of the connection)
> > which makes a poor fit for pooling. the bigger fish to fry i think is
> > decompression - you could read a 1MB blob into a pool-provided buffer and
> > then decompress it into 10MB of heap allocated on the spot :-) also, the
> > ssl code is extremely tricky.
> >    2.2 just to make sure, youre talking about Selector.java: while
> > ((networkReceive = channel.read()) != null) addToStagedReceives(channel,
> > networkReceive); ? if so youre right, and i'll fix that (probably by
> > something similar to immediatelyConnectedKeys, not sure yet)
> >
> > 3. isOutOfMemory is self explanatory (and i'll add javadocs and update
> the
> > wiki). isLowOnMem is basically the point where I start randomizing the
> > selection key handling order to avoid potential starvation. its rather
> > arbitrary and now that i think of it should probably not exist and be
> > entirely contained in Selector (where the shuffling takes place). will
> fix.
> >
> > 4. will do.
> >
> > 5. I prefer -1 or 0 as an explicit "OFF" (or basically anything <=0).
> > Long.MAX_VALUE would still create a pool, that would still waste time
> > tracking resources. I dont really mind though if you have a preferred
> magic
> > value for off.
> >
> >
> >
> >
> >
> > On Wed, Nov 9, 2016 at 9:28 AM, Jun Rao <ju...@confluent.io> wrote:
> >
> > > Hi, Radai,
> > >
> > > Thanks for the KIP. Some comments below.
> > >
> > > 1. The KIP says "to facilitate faster implementation (as a safety net)
> > the
> > > pool will be implemented in such a way that memory that was not
> > release()ed
> > > (but still garbage collected) would be detected and "reclaimed". this
> is
> > to
> > > prevent "leaks" in case of code paths that fail to release()
> properly.".
> > > What are the cases that could cause memory leaks? If we are concerned
> > about
> > > bugs, it seems that it's better to just do more testing to make sure
> the
> > > usage of the simple implementation (SimpleMemoryPool) is solid instead
> of
> > > adding more complicated logic (GarbageCollectedMemoryPool) to hide the
> > > potential bugs.
> > >
> > > 2. I am wondering how much this KIP covers the SSL channel
> > implementation.
> > > 2.1 SslTransportLayer maintains netReadBuffer, netWriteBuffer,
> > > appReadBuffer per socket. Should those memory be accounted for in
> memory
> > > pool?
> > > 2.2 One tricky thing with SSL is that during a KafkaChannel.read(),
> it's
> > > possible for multiple NetworkReceives to be returned since multiple
> > > requests' data could be encrypted together by SSL. To deal with this,
> we
> > > stash those NetworkReceives in Selector.stagedReceives and give it back
> > to
> > > the poll() call one NetworkReceive at a time. What this means is that,
> if
> > > we stop reading from KafkaChannel in the middle because memory pool is
> > > full, this channel's key may never get selected for reads (even after
> the
> > > read interest is turned on), but there are still pending data for the
> > > channel, which will never get processed.
> > >
> > > 3. The code has the following two methods in MemoryPool, which are not
> > > described in the KIP. Could you explain how they are used in the wiki?
> > > isLowOnMemory()
> > > isOutOfMemory()
> > >
> > > 4. Could you also describe in the KIP at the high level, how the read
> > > interest bit for the socket is turned on/off with respect to
> MemoryPool?
> > >
> > > 5. Should queued.max.bytes defaults to -1 or Long.MAX_VALUE?
> > >
> > > Thanks,
> > >
> > > Jun
> > >
> > > On Mon, Nov 7, 2016 at 1:08 PM, radai <ra...@gmail.com>
> > wrote:
> > >
> > > > Hi,
> > > >
> > > > I would like to initiate a vote on KIP-72:
> > > >
> > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-72%3A+
> > > > Allow+putting+a+bound+on+memory+consumed+by+Incoming+requests
> > > >
> > > > The kip allows specifying a limit on the amount of memory allocated
> for
> > > > reading incoming requests into. This is useful for "sizing" a broker
> > and
> > > > avoiding OOMEs under heavy load (as actually happens occasionally at
> > > > linkedin).
> > > >
> > > > I believe I've addressed most (all?) concerns brought up during the
> > > > discussion.
> > > >
> > > > To the best of my understanding this vote is about the goal and
> > > > public-facing changes related to the new proposed behavior, but as
> for
> > > > implementation, i have the code up here:
> > > >
> > > > https://github.com/radai-rosenblatt/kafka/tree/broker-memory
> > > > -pool-with-muting
> > > >
> > > > and I've stress-tested it to work properly (meaning it chugs along
> and
> > > > throttles under loads that would DOS 10.0.1.0 code).
> > > >
> > > > I also believe that the primitives and "pattern"s introduced in this
> > KIP
> > > > (namely the notion of a buffer pool and retrieving from / releasing
> to
> > > said
> > > > pool instead of allocating memory) are generally useful beyond the
> > scope
> > > of
> > > > this KIP for both performance issues (allocating lots of short-lived
> > > large
> > > > buffers is a performance bottleneck) and other areas where memory
> > limits
> > > > are a problem (KIP-81)
> > > >
> > > > Thank you,
> > > >
> > > > Radai.
> > > >
> > >
> >
>



-- 
Regards,

Rajini

Re: [VOTE] KIP-72 - Allow putting a bound on memory consumed by Incoming requests

Posted by Jun Rao <ju...@confluent.io>.

Hi, Radai,

1. Yes, I am concerned about the trickiness of having to deal with wreak
refs. I think it's simpler to just have the simple version instrumented
with enough debug/trace logging and do enough stress testing. Since we
still have queued.max.requests, one can always fall back to that if a
memory leak issue is identified. We could also label the feature as beta if
we don't think this is production ready.

2.2 I am just wondering after we fix that issue whether the claim that the
request memory is bounded by  queued.max.bytes + socket.request.max.bytes
is still true.

5. Ok, leaving the default as -1 is fine then.

Thanks,

Jun

On Wed, Nov 9, 2016 at 6:01 PM, radai <ra...@gmail.com> wrote:

> Hi Jun,
>
> Thank you for taking the time to review this.
>
> 1. short version - yes, the concern is bugs, but the cost is tiny and worth
> it, and its a common pattern. long version:
>    1.1 detecting these types of bugs (leaks) cannot be easily done with
> simple testing, but requires stress/stability tests that run for a long
> time (long enough to hit OOM, depending on leak size and available memory).
> this is why some sort of leak detector is "standard practice" .for example
> look at netty (http://netty.io/wiki/reference-counted-objects.
> html#leak-detection-levels)
> <http://netty.io/wiki/reference-counted-objects.html#leak-detection-levels
> >-
> they have way more complicated built-in leak detection enabled by default.
> as a concrete example - during development i did not properly dispose of
> in-progress KafkaChannel.receive when a connection was abruptly closed and
> I only found it because of the log msg printed by the pool.
>    1.2 I have a benchmark suite showing the performance cost of the gc pool
> is absolutely negligible -
> https://github.com/radai-rosenblatt/kafka-benchmarks/
> tree/master/memorypool-benchmarks
>    1.3 as for the complexity of the impl - its just ~150 lines and pretty
> straight forward. i think the main issue is that not many people are
> familiar with weak refs and ref queues.
>
>    how about making the pool impl class a config param (generally good
> going forward), make the default be the simple pool, and keep the GC one as
> a dev/debug/triage aid?
>
> 2. the KIP itself doesnt specifically treat SSL at all - its an
> implementation detail. as for my current patch, it has some minimal
> treatment of SSL - just enough to not mute SSL sockets mid-handshake - but
> the code in SslTransportLayer still allocates buffers itself. it is my
> understanding that netReadBuffer/appReadBuffer shouldn't grow beyond 2 x
> sslEngine.getSession().getPacketBufferSize(), which i assume to be small.
> they are also long lived (they live for the duration of the connection)
> which makes a poor fit for pooling. the bigger fish to fry i think is
> decompression - you could read a 1MB blob into a pool-provided buffer and
> then decompress it into 10MB of heap allocated on the spot :-) also, the
> ssl code is extremely tricky.
>    2.2 just to make sure, youre talking about Selector.java: while
> ((networkReceive = channel.read()) != null) addToStagedReceives(channel,
> networkReceive); ? if so youre right, and i'll fix that (probably by
> something similar to immediatelyConnectedKeys, not sure yet)
>
> 3. isOutOfMemory is self explanatory (and i'll add javadocs and update the
> wiki). isLowOnMem is basically the point where I start randomizing the
> selection key handling order to avoid potential starvation. its rather
> arbitrary and now that i think of it should probably not exist and be
> entirely contained in Selector (where the shuffling takes place). will fix.
>
> 4. will do.
>
> 5. I prefer -1 or 0 as an explicit "OFF" (or basically anything <=0).
> Long.MAX_VALUE would still create a pool, that would still waste time
> tracking resources. I dont really mind though if you have a preferred magic
> value for off.
>
>
>
>
>
> On Wed, Nov 9, 2016 at 9:28 AM, Jun Rao <ju...@confluent.io> wrote:
>
> > Hi, Radai,
> >
> > Thanks for the KIP. Some comments below.
> >
> > 1. The KIP says "to facilitate faster implementation (as a safety net)
> the
> > pool will be implemented in such a way that memory that was not
> release()ed
> > (but still garbage collected) would be detected and "reclaimed". this is
> to
> > prevent "leaks" in case of code paths that fail to release() properly.".
> > What are the cases that could cause memory leaks? If we are concerned
> about
> > bugs, it seems that it's better to just do more testing to make sure the
> > usage of the simple implementation (SimpleMemoryPool) is solid instead of
> > adding more complicated logic (GarbageCollectedMemoryPool) to hide the
> > potential bugs.
> >
> > 2. I am wondering how much this KIP covers the SSL channel
> implementation.
> > 2.1 SslTransportLayer maintains netReadBuffer, netWriteBuffer,
> > appReadBuffer per socket. Should those memory be accounted for in memory
> > pool?
> > 2.2 One tricky thing with SSL is that during a KafkaChannel.read(), it's
> > possible for multiple NetworkReceives to be returned since multiple
> > requests' data could be encrypted together by SSL. To deal with this, we
> > stash those NetworkReceives in Selector.stagedReceives and give it back
> to
> > the poll() call one NetworkReceive at a time. What this means is that, if
> > we stop reading from KafkaChannel in the middle because memory pool is
> > full, this channel's key may never get selected for reads (even after the
> > read interest is turned on), but there are still pending data for the
> > channel, which will never get processed.
> >
> > 3. The code has the following two methods in MemoryPool, which are not
> > described in the KIP. Could you explain how they are used in the wiki?
> > isLowOnMemory()
> > isOutOfMemory()
> >
> > 4. Could you also describe in the KIP at the high level, how the read
> > interest bit for the socket is turned on/off with respect to MemoryPool?
> >
> > 5. Should queued.max.bytes defaults to -1 or Long.MAX_VALUE?
> >
> > Thanks,
> >
> > Jun
> >
> > On Mon, Nov 7, 2016 at 1:08 PM, radai <ra...@gmail.com>
> wrote:
> >
> > > Hi,
> > >
> > > I would like to initiate a vote on KIP-72:
> > >
> > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-72%3A+
> > > Allow+putting+a+bound+on+memory+consumed+by+Incoming+requests
> > >
> > > The kip allows specifying a limit on the amount of memory allocated for
> > > reading incoming requests into. This is useful for "sizing" a broker
> and
> > > avoiding OOMEs under heavy load (as actually happens occasionally at
> > > linkedin).
> > >
> > > I believe I've addressed most (all?) concerns brought up during the
> > > discussion.
> > >
> > > To the best of my understanding this vote is about the goal and
> > > public-facing changes related to the new proposed behavior, but as for
> > > implementation, i have the code up here:
> > >
> > > https://github.com/radai-rosenblatt/kafka/tree/broker-memory
> > > -pool-with-muting
> > >
> > > and I've stress-tested it to work properly (meaning it chugs along and
> > > throttles under loads that would DOS 10.0.1.0 code).
> > >
> > > I also believe that the primitives and "pattern"s introduced in this
> KIP
> > > (namely the notion of a buffer pool and retrieving from / releasing to
> > said
> > > pool instead of allocating memory) are generally useful beyond the
> scope
> > of
> > > this KIP for both performance issues (allocating lots of short-lived
> > large
> > > buffers is a performance bottleneck) and other areas where memory
> limits
> > > are a problem (KIP-81)
> > >
> > > Thank you,
> > >
> > > Radai.
> > >
> >
>

Re: [VOTE] KIP-72 - Allow putting a bound on memory consumed by Incoming requests

Posted by radai <ra...@gmail.com>.

Hi Jun,

Thank you for taking the time to review this.

1. short version - yes, the concern is bugs, but the cost is tiny and worth
it, and its a common pattern. long version:
   1.1 detecting these types of bugs (leaks) cannot be easily done with
simple testing, but requires stress/stability tests that run for a long
time (long enough to hit OOM, depending on leak size and available memory).
this is why some sort of leak detector is "standard practice" .for example
look at netty (http://netty.io/wiki/reference-counted-objects.html#leak-detection-levels)
<http://netty.io/wiki/reference-counted-objects.html#leak-detection-levels>-
they have way more complicated built-in leak detection enabled by default.
as a concrete example - during development i did not properly dispose of
in-progress KafkaChannel.receive when a connection was abruptly closed and
I only found it because of the log msg printed by the pool.
   1.2 I have a benchmark suite showing the performance cost of the gc pool
is absolutely negligible -
https://github.com/radai-rosenblatt/kafka-benchmarks/tree/master/memorypool-benchmarks
   1.3 as for the complexity of the impl - its just ~150 lines and pretty
straight forward. i think the main issue is that not many people are
familiar with weak refs and ref queues.

   how about making the pool impl class a config param (generally good
going forward), make the default be the simple pool, and keep the GC one as
a dev/debug/triage aid?

2. the KIP itself doesnt specifically treat SSL at all - its an
implementation detail. as for my current patch, it has some minimal
treatment of SSL - just enough to not mute SSL sockets mid-handshake - but
the code in SslTransportLayer still allocates buffers itself. it is my
understanding that netReadBuffer/appReadBuffer shouldn't grow beyond 2 x
sslEngine.getSession().getPacketBufferSize(), which i assume to be small.
they are also long lived (they live for the duration of the connection)
which makes a poor fit for pooling. the bigger fish to fry i think is
decompression - you could read a 1MB blob into a pool-provided buffer and
then decompress it into 10MB of heap allocated on the spot :-) also, the
ssl code is extremely tricky.
   2.2 just to make sure, youre talking about Selector.java: while
((networkReceive = channel.read()) != null) addToStagedReceives(channel,
networkReceive); ? if so youre right, and i'll fix that (probably by
something similar to immediatelyConnectedKeys, not sure yet)

3. isOutOfMemory is self explanatory (and i'll add javadocs and update the
wiki). isLowOnMem is basically the point where I start randomizing the
selection key handling order to avoid potential starvation. its rather
arbitrary and now that i think of it should probably not exist and be
entirely contained in Selector (where the shuffling takes place). will fix.

4. will do.

5. I prefer -1 or 0 as an explicit "OFF" (or basically anything <=0).
Long.MAX_VALUE would still create a pool, that would still waste time
tracking resources. I dont really mind though if you have a preferred magic
value for off.

On Wed, Nov 9, 2016 at 9:28 AM, Jun Rao <ju...@confluent.io> wrote:

> Hi, Radai,
>
> Thanks for the KIP. Some comments below.
>
> 1. The KIP says "to facilitate faster implementation (as a safety net) the
> pool will be implemented in such a way that memory that was not release()ed
> (but still garbage collected) would be detected and "reclaimed". this is to
> prevent "leaks" in case of code paths that fail to release() properly.".
> What are the cases that could cause memory leaks? If we are concerned about
> bugs, it seems that it's better to just do more testing to make sure the
> usage of the simple implementation (SimpleMemoryPool) is solid instead of
> adding more complicated logic (GarbageCollectedMemoryPool) to hide the
> potential bugs.
>
> 2. I am wondering how much this KIP covers the SSL channel implementation.
> 2.1 SslTransportLayer maintains netReadBuffer, netWriteBuffer,
> appReadBuffer per socket. Should those memory be accounted for in memory
> pool?
> 2.2 One tricky thing with SSL is that during a KafkaChannel.read(), it's
> possible for multiple NetworkReceives to be returned since multiple
> requests' data could be encrypted together by SSL. To deal with this, we
> stash those NetworkReceives in Selector.stagedReceives and give it back to
> the poll() call one NetworkReceive at a time. What this means is that, if
> we stop reading from KafkaChannel in the middle because memory pool is
> full, this channel's key may never get selected for reads (even after the
> read interest is turned on), but there are still pending data for the
> channel, which will never get processed.
>
> 3. The code has the following two methods in MemoryPool, which are not
> described in the KIP. Could you explain how they are used in the wiki?
> isLowOnMemory()
> isOutOfMemory()
>
> 4. Could you also describe in the KIP at the high level, how the read
> interest bit for the socket is turned on/off with respect to MemoryPool?
>
> 5. Should queued.max.bytes defaults to -1 or Long.MAX_VALUE?
>
> Thanks,
>
> Jun
>
> On Mon, Nov 7, 2016 at 1:08 PM, radai <ra...@gmail.com> wrote:
>
> > Hi,
> >
> > I would like to initiate a vote on KIP-72:
> >
> > https://cwiki.apache.org/confluence/display/KAFKA/KIP-72%3A+
> > Allow+putting+a+bound+on+memory+consumed+by+Incoming+requests
> >
> > The kip allows specifying a limit on the amount of memory allocated for
> > reading incoming requests into. This is useful for "sizing" a broker and
> > avoiding OOMEs under heavy load (as actually happens occasionally at
> > linkedin).
> >
> > I believe I've addressed most (all?) concerns brought up during the
> > discussion.
> >
> > To the best of my understanding this vote is about the goal and
> > public-facing changes related to the new proposed behavior, but as for
> > implementation, i have the code up here:
> >
> > https://github.com/radai-rosenblatt/kafka/tree/broker-memory
> > -pool-with-muting
> >
> > and I've stress-tested it to work properly (meaning it chugs along and
> > throttles under loads that would DOS 10.0.1.0 code).
> >
> > I also believe that the primitives and "pattern"s introduced in this KIP
> > (namely the notion of a buffer pool and retrieving from / releasing to
> said
> > pool instead of allocating memory) are generally useful beyond the scope
> of
> > this KIP for both performance issues (allocating lots of short-lived
> large
> > buffers is a performance bottleneck) and other areas where memory limits
> > are a problem (KIP-81)
> >
> > Thank you,
> >
> > Radai.
> >
>

Re: [VOTE] KIP-72 - Allow putting a bound on memory consumed by Incoming requests

Posted by Jun Rao <ju...@confluent.io>.

Hi, Radai,

Thanks for the KIP. Some comments below.

1. The KIP says "to facilitate faster implementation (as a safety net) the
pool will be implemented in such a way that memory that was not release()ed
(but still garbage collected) would be detected and "reclaimed". this is to
prevent "leaks" in case of code paths that fail to release() properly.".
What are the cases that could cause memory leaks? If we are concerned about
bugs, it seems that it's better to just do more testing to make sure the
usage of the simple implementation (SimpleMemoryPool) is solid instead of
adding more complicated logic (GarbageCollectedMemoryPool) to hide the
potential bugs.

2. I am wondering how much this KIP covers the SSL channel implementation.
2.1 SslTransportLayer maintains netReadBuffer, netWriteBuffer,
appReadBuffer per socket. Should those memory be accounted for in memory
pool?
2.2 One tricky thing with SSL is that during a KafkaChannel.read(), it's
possible for multiple NetworkReceives to be returned since multiple
requests' data could be encrypted together by SSL. To deal with this, we
stash those NetworkReceives in Selector.stagedReceives and give it back to
the poll() call one NetworkReceive at a time. What this means is that, if
we stop reading from KafkaChannel in the middle because memory pool is
full, this channel's key may never get selected for reads (even after the
read interest is turned on), but there are still pending data for the
channel, which will never get processed.

3. The code has the following two methods in MemoryPool, which are not
described in the KIP. Could you explain how they are used in the wiki?
isLowOnMemory()
isOutOfMemory()

4. Could you also describe in the KIP at the high level, how the read
interest bit for the socket is turned on/off with respect to MemoryPool?

5. Should queued.max.bytes defaults to -1 or Long.MAX_VALUE?

Thanks,

Jun

On Mon, Nov 7, 2016 at 1:08 PM, radai <ra...@gmail.com> wrote:

> Hi,
>
> I would like to initiate a vote on KIP-72:
>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-72%3A+
> Allow+putting+a+bound+on+memory+consumed+by+Incoming+requests
>
> The kip allows specifying a limit on the amount of memory allocated for
> reading incoming requests into. This is useful for "sizing" a broker and
> avoiding OOMEs under heavy load (as actually happens occasionally at
> linkedin).
>
> I believe I've addressed most (all?) concerns brought up during the
> discussion.
>
> To the best of my understanding this vote is about the goal and
> public-facing changes related to the new proposed behavior, but as for
> implementation, i have the code up here:
>
> https://github.com/radai-rosenblatt/kafka/tree/broker-memory
> -pool-with-muting
>
> and I've stress-tested it to work properly (meaning it chugs along and
> throttles under loads that would DOS 10.0.1.0 code).
>
> I also believe that the primitives and "pattern"s introduced in this KIP
> (namely the notion of a buffer pool and retrieving from / releasing to said
> pool instead of allocating memory) are generally useful beyond the scope of
> this KIP for both performance issues (allocating lots of short-lived large
> buffers is a performance bottleneck) and other areas where memory limits
> are a problem (KIP-81)
>
> Thank you,
>
> Radai.
>

Re: [VOTE] KIP-72 - Allow putting a bound on memory consumed by Incoming requests

Posted by Gwen Shapira <gw...@confluent.io>.

+1 (binding)

On Tue, Nov 8, 2016 at 10:26 AM, radai <ra...@gmail.com> wrote:
> I've updated the KIP page to specify the new config would co-exist with
> "queued.max.request" to minimize the impact on compatibility.
>
> On Tue, Nov 8, 2016 at 7:02 AM, radai <ra...@gmail.com> wrote:
>
>> My personal opinion on this is that control of memory was always the
>> intent behind queued.max.requests and so this KIP could completely obsolete
>> it.
>> For now its probably safest to leave it as-is (making memory-bound
>> "opt-in") and revisit this at a later date
>>
>> On Mon, Nov 7, 2016 at 2:32 PM, Gwen Shapira <gw...@confluent.io> wrote:
>>
>>> Hey Radai,
>>>
>>> Looking at the proposal, it looks like a major question is still
>>> unresolved?
>>> "This configuration parameter can either replace queued.max.requests
>>> completely, or co-exist with it (by way of either-or or respecting
>>> both bounds and not picking up new requests when either is hit)."
>>>
>>> On Mon, Nov 7, 2016 at 1:08 PM, radai <ra...@gmail.com> wrote:
>>> > Hi,
>>> >
>>> > I would like to initiate a vote on KIP-72:
>>> >
>>> > https://cwiki.apache.org/confluence/display/KAFKA/KIP-72%3A+
>>> Allow+putting+a+bound+on+memory+consumed+by+Incoming+requests
>>> >
>>> > The kip allows specifying a limit on the amount of memory allocated for
>>> > reading incoming requests into. This is useful for "sizing" a broker and
>>> > avoiding OOMEs under heavy load (as actually happens occasionally at
>>> > linkedin).
>>> >
>>> > I believe I've addressed most (all?) concerns brought up during the
>>> > discussion.
>>> >
>>> > To the best of my understanding this vote is about the goal and
>>> > public-facing changes related to the new proposed behavior, but as for
>>> > implementation, i have the code up here:
>>> >
>>> > https://github.com/radai-rosenblatt/kafka/tree/broker-memory
>>> -pool-with-muting
>>> >
>>> > and I've stress-tested it to work properly (meaning it chugs along and
>>> > throttles under loads that would DOS 10.0.1.0 code).
>>> >
>>> > I also believe that the primitives and "pattern"s introduced in this KIP
>>> > (namely the notion of a buffer pool and retrieving from / releasing to
>>> said
>>> > pool instead of allocating memory) are generally useful beyond the
>>> scope of
>>> > this KIP for both performance issues (allocating lots of short-lived
>>> large
>>> > buffers is a performance bottleneck) and other areas where memory limits
>>> > are a problem (KIP-81)
>>> >
>>> > Thank you,
>>> >
>>> > Radai.
>>>
>>>
>>>
>>> --
>>> Gwen Shapira
>>> Product Manager | Confluent
>>> 650.450.2760 | @gwenshap
>>> Follow us: Twitter | blog
>>>
>>
>>



-- 
Gwen Shapira
Product Manager | Confluent
650.450.2760 | @gwenshap
Follow us: Twitter | blog

Re: [VOTE] KIP-72 - Allow putting a bound on memory consumed by Incoming requests

Posted by radai <ra...@gmail.com>.

I've updated the KIP page to specify the new config would co-exist with
"queued.max.request" to minimize the impact on compatibility.

On Tue, Nov 8, 2016 at 7:02 AM, radai <ra...@gmail.com> wrote:

> My personal opinion on this is that control of memory was always the
> intent behind queued.max.requests and so this KIP could completely obsolete
> it.
> For now its probably safest to leave it as-is (making memory-bound
> "opt-in") and revisit this at a later date
>
> On Mon, Nov 7, 2016 at 2:32 PM, Gwen Shapira <gw...@confluent.io> wrote:
>
>> Hey Radai,
>>
>> Looking at the proposal, it looks like a major question is still
>> unresolved?
>> "This configuration parameter can either replace queued.max.requests
>> completely, or co-exist with it (by way of either-or or respecting
>> both bounds and not picking up new requests when either is hit)."
>>
>> On Mon, Nov 7, 2016 at 1:08 PM, radai <ra...@gmail.com> wrote:
>> > Hi,
>> >
>> > I would like to initiate a vote on KIP-72:
>> >
>> > https://cwiki.apache.org/confluence/display/KAFKA/KIP-72%3A+
>> Allow+putting+a+bound+on+memory+consumed+by+Incoming+requests
>> >
>> > The kip allows specifying a limit on the amount of memory allocated for
>> > reading incoming requests into. This is useful for "sizing" a broker and
>> > avoiding OOMEs under heavy load (as actually happens occasionally at
>> > linkedin).
>> >
>> > I believe I've addressed most (all?) concerns brought up during the
>> > discussion.
>> >
>> > To the best of my understanding this vote is about the goal and
>> > public-facing changes related to the new proposed behavior, but as for
>> > implementation, i have the code up here:
>> >
>> > https://github.com/radai-rosenblatt/kafka/tree/broker-memory
>> -pool-with-muting
>> >
>> > and I've stress-tested it to work properly (meaning it chugs along and
>> > throttles under loads that would DOS 10.0.1.0 code).
>> >
>> > I also believe that the primitives and "pattern"s introduced in this KIP
>> > (namely the notion of a buffer pool and retrieving from / releasing to
>> said
>> > pool instead of allocating memory) are generally useful beyond the
>> scope of
>> > this KIP for both performance issues (allocating lots of short-lived
>> large
>> > buffers is a performance bottleneck) and other areas where memory limits
>> > are a problem (KIP-81)
>> >
>> > Thank you,
>> >
>> > Radai.
>>
>>
>>
>> --
>> Gwen Shapira
>> Product Manager | Confluent
>> 650.450.2760 | @gwenshap
>> Follow us: Twitter | blog
>>
>
>

Re: [VOTE] KIP-72 - Allow putting a bound on memory consumed by Incoming requests

Posted by radai <ra...@gmail.com>.

My personal opinion on this is that control of memory was always the intent
behind queued.max.requests and so this KIP could completely obsolete it.
For now its probably safest to leave it as-is (making memory-bound
"opt-in") and revisit this at a later date

On Mon, Nov 7, 2016 at 2:32 PM, Gwen Shapira <gw...@confluent.io> wrote:

> Hey Radai,
>
> Looking at the proposal, it looks like a major question is still
> unresolved?
> "This configuration parameter can either replace queued.max.requests
> completely, or co-exist with it (by way of either-or or respecting
> both bounds and not picking up new requests when either is hit)."
>
> On Mon, Nov 7, 2016 at 1:08 PM, radai <ra...@gmail.com> wrote:
> > Hi,
> >
> > I would like to initiate a vote on KIP-72:
> >
> > https://cwiki.apache.org/confluence/display/KAFKA/KIP-
> 72%3A+Allow+putting+a+bound+on+memory+consumed+by+Incoming+requests
> >
> > The kip allows specifying a limit on the amount of memory allocated for
> > reading incoming requests into. This is useful for "sizing" a broker and
> > avoiding OOMEs under heavy load (as actually happens occasionally at
> > linkedin).
> >
> > I believe I've addressed most (all?) concerns brought up during the
> > discussion.
> >
> > To the best of my understanding this vote is about the goal and
> > public-facing changes related to the new proposed behavior, but as for
> > implementation, i have the code up here:
> >
> > https://github.com/radai-rosenblatt/kafka/tree/broker-
> memory-pool-with-muting
> >
> > and I've stress-tested it to work properly (meaning it chugs along and
> > throttles under loads that would DOS 10.0.1.0 code).
> >
> > I also believe that the primitives and "pattern"s introduced in this KIP
> > (namely the notion of a buffer pool and retrieving from / releasing to
> said
> > pool instead of allocating memory) are generally useful beyond the scope
> of
> > this KIP for both performance issues (allocating lots of short-lived
> large
> > buffers is a performance bottleneck) and other areas where memory limits
> > are a problem (KIP-81)
> >
> > Thank you,
> >
> > Radai.
>
>
>
> --
> Gwen Shapira
> Product Manager | Confluent
> 650.450.2760 | @gwenshap
> Follow us: Twitter | blog
>

Re: [VOTE] KIP-72 - Allow putting a bound on memory consumed by Incoming requests

Posted by Gwen Shapira <gw...@confluent.io>.

Hey Radai,

Looking at the proposal, it looks like a major question is still unresolved?
"This configuration parameter can either replace queued.max.requests
completely, or co-exist with it (by way of either-or or respecting
both bounds and not picking up new requests when either is hit)."

On Mon, Nov 7, 2016 at 1:08 PM, radai <ra...@gmail.com> wrote:
> Hi,
>
> I would like to initiate a vote on KIP-72:
>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-72%3A+Allow+putting+a+bound+on+memory+consumed+by+Incoming+requests
>
> The kip allows specifying a limit on the amount of memory allocated for
> reading incoming requests into. This is useful for "sizing" a broker and
> avoiding OOMEs under heavy load (as actually happens occasionally at
> linkedin).
>
> I believe I've addressed most (all?) concerns brought up during the
> discussion.
>
> To the best of my understanding this vote is about the goal and
> public-facing changes related to the new proposed behavior, but as for
> implementation, i have the code up here:
>
> https://github.com/radai-rosenblatt/kafka/tree/broker-memory-pool-with-muting
>
> and I've stress-tested it to work properly (meaning it chugs along and
> throttles under loads that would DOS 10.0.1.0 code).
>
> I also believe that the primitives and "pattern"s introduced in this KIP
> (namely the notion of a buffer pool and retrieving from / releasing to said
> pool instead of allocating memory) are generally useful beyond the scope of
> this KIP for both performance issues (allocating lots of short-lived large
> buffers is a performance bottleneck) and other areas where memory limits
> are a problem (KIP-81)
>
> Thank you,
>
> Radai.



-- 
Gwen Shapira
Product Manager | Confluent
650.450.2760 | @gwenshap
Follow us: Twitter | blog