You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@hc.apache.org by Carter Kozak <ck...@ckozak.net> on 2020/07/30 17:28:00 UTC

Limiting the blast radius of connection leaks

Hello Friends,

I recently had a service crash due to an out of memory error as the result of a response leak. While this is caused by code that uses closeable resources incorrectly, and that code should be fixed, I'd like to discuss some ideas which could avoid catastrophic failure in this scenario.

A little background on my environment: I maintain an RPC library which wraps an hc5 client. There's no limit set on the hc5 connection pool (Integer.MAX_VALUE), and resource use is bounded using a variation of additive-increase/multiplicative-decrease limiters to scale based on load. Some endpoints are capable of returning a raw binary stream, this introduces some risk of resource leaks -- ideally we would take an inputstream consumer to fully own the lifecycle, but unfortunately that's not an API change we can make at this point. We return permits before passing the resulting inputstream to custom code to prevent leaks from blocking the entire service.

Unfortunately leased connections are anchored directly to both StrictConnPool and LaxConnPool in order to validate that connections are returned to the pool where they originated, which prevents a relatively large swath of resources from being garbage collected in the event of a leak. I realize my case is a bit different due to the effectively unlimited pool size, where memory becomes the limiting factor rather than requests blocking against the pool limit, but I believe there are a few options that can help us to both detect and endure leaks.

A couple potential solutions:

Use weak references to track leased connections with a periodic sweep to detect leaks and reclaim per-route permits. This approach would be complex, but allows the default configuration to detect and handle rare response leaks without eventually blocking connection requests. Using this approach, a leaked connection wouldn't necessarily be closed when the pool is shut down, but may survive until a full GC. In most cases I expect clients to be long-lived, and I'm not sure what guarantees are made about leaked connections.

Use a counter to track leased connections instead of tracking leased entries directly. This loses the safety guarantees that connections aren't already leased, and that they're being returned to the pool where they originated, however it eliminates the need for the connection pools hash table and references to checked-out values. This solution would not be helpful when connections are leaked and the route limit is reached, so it may make more sense as a PoolConcurrencyPolicy.UNLIMITED option. This design would not be able to close leased connections when the pool is closed, they would remain open until they are released, when the closed pool could close the incoming connection.

Have you encountered similar problems? If you would consider a contribution with one of these designs or another I have not yet considered, I will open a ticket and plan to implement a solution. Any and all feedback is greatly appreciated.

Thanks,
Carter Kozak

Re: Limiting the blast radius of connection leaks

Posted by Carter Kozak <ck...@ckozak.net>.

Hi Julian,

Thank you for the description and examples from sling. This looks a
lot like what I was thinking, very similar to the Cleaner utility provided
by newer Java versions. I had been planning to attempt to apply this
kind of protection at the pool level rather than only the client APIs
that aren't internally protected, but your approach would reduce
unnecessary overhead in cases where we're confident leaks cannot
occur.
As leak detection moves farther from the pool itself, it becomes less
clear whether the feature belongs in httpcomponents or should
be implemented by consumers.
I've been a little busy lately and haven't had an opportunity to sketch
out the design. I'll probably write a few variations to see which works
the most naturally.

Carter

On Tue, Aug 4, 2020, at 04:42, Julian Sedding wrote:
> Hi
> 
> I don't know if Java's "reachability" of responses would be a good
> indicator for when a connection has erroneously not been returned to a
> pool. If that is the case (and the stated memory leak makes me think
> it might be), then the below information may be helpful.
> 
> I'll share some (positive) experiences from the Apache Sling project
> with an approach that may work here as well. Sling is a Web-Framework
> built on top of JCR (Java Content Repository, think of it as a
> hierarchical database). For each request, a ResourceResolver is
> created based on user credentials (often anonymous). The
> ResourceResolver (RR) needs to be closed at the end of the request. So
> far the framework can manage the lifecycle of the RR - no problem.
> However, it is also possible to obtain a RR in components that
> implement some backend logic, e.g. a scheduled task performing some
> maintenance or reporting. In these cases the responsibility for
> closing the RR is with the programmer implementing that piece of
> backend logic. Of course that is often forgotten and leads to a memory
> leak (leaked RR instances).
> 
> Solving this issue with a finalize method would work, but finalize is
> known to require synchronization and to be rather slow. Instead we
> implemented a mechanism to close unclosed RR instances using Java's
> java.lang.ref.ReferenceQueue and WeakReference, see
> https://github.com/apache/sling-org-apache-sling-resourceresolver/blob/30f7d91c6c99dc674fe361060aaa90f6e0252a6a/src/main/java/org/apache/sling/resourceresolver/impl/CommonResourceResolverFactoryImpl.java#L525
> 
> The trick is to encapsulate all required state in one object and then
> create another object that references the state, but has no (relevant)
> state of its own, a bit like a facade. When the facade gets GC'ed you
> get a WeakReference referencing the state object enqueued in the
> ReferenceQueue and all resources can be closed. In Sling we also give
> the possibility to log the "opening stacktrace" when an RR is closed
> by this fallback mechanism. This can help identify places in the
> source code where RRs aren't correctly closed.
> 
> Hope this may help.
> 
> Regards
> Julian
> 
> On Mon, Aug 3, 2020 at 7:39 PM Gary Gregory <ga...@gmail.com> wrote:
> >
> > Hi Carter,
> >
> > I don't have a super strong feeling for this either. I just would like us
> > to try and reuse a library before reinventing the wheel, but my time is
> > limited, like everyone else.
> >
> > Gary
> >
> > On Mon, Aug 3, 2020, 13:19 Carter Kozak <ck...@ckozak.net> wrote:
> >
> > > Hi Gary,
> > >
> > > That sounds reasonable, however there are some subtle distinctions
> > > between general purpose object pools and those used for http
> > > connections, perhaps the most obvious of which is http/2 multiplexed
> > > connections. It's certainly possible that it could work, but I imagine it
> > > could become a large project and may result in something very
> > > specific to http clients. Pools used by blocking and asynchronous
> > > clients may also differ substantially, so I'm not sure how general this
> > > implementation could be.
> > >
> > > I'd be happy to help in any way I can if folks more familiar than myself
> > > with the current hc pool implementations are supportive of your proposal,
> > > but I'm worried that it may become a blocker that prevents incremental
> > > improvements.
> > >
> > > In conclusion, I don't feel comfortable taking a stance on this and I trust
> > > the team to make the right decision.
> > >
> > > -ck
> > >
> > > On Fri, Jul 31, 2020, at 17:38, Gary Gregory wrote:
> > > > What I would prefer we do is not reinvent yet another pooling feature,
> > > all
> > > > its potential bugs and added maintenance load when that wheel's been
> > > > invented over and over. For me personally, I would prefer that this
> > > project
> > > > adopt a FOSS pooling library and improve _it_ if that were needed,
> > > thereby
> > > > improving the whole ecosystem.
> > > >
> > > > Gary
> > > >
> > > > On Fri, Jul 31, 2020, 16:22 Carter Kozak <ck...@ckozak.net> wrote:
> > > >
> > > > > On Fri, Jul 31, 2020, at 14:31, Gary Gregory wrote:
> > > > > > On Fri, Jul 31, 2020 at 12:45 PM Oleg Kalnichevski <olegk@apache.org
> > > >
> > > > > wrote:
> > > > > >
> > > > > > > On Thu, 2020-07-30 at 13:28 -0400, Carter Kozak wrote:
> > > > > > > > Hello Friends,
> > > > > > > >
> > > > > > > > I recently had a service crash due to an out of memory error as
> > > the
> > > > > > > > result of a response leak. While this is caused by code that uses
> > > > > > > > closeable resources incorrectly, and that code should be fixed,
> > > I'd
> > > > > > > > like to discuss some ideas which could avoid catastrophic
> > > failure in
> > > > > > > > this scenario.
> > > > > > > >
> > > > > > > > A little background on my environment: I maintain an RPC library
> > > > > > > > which wraps an hc5 client. There's no limit set on the hc5
> > > connection
> > > > > > > > pool (Integer.MAX_VALUE), and resource use is bounded using a
> > > > > > > > variation of additive-increase/multiplicative-decrease limiters
> > > to
> > > > > > > > scale based on load. Some endpoints are capable of returning a
> > > raw
> > > > > > > > binary stream, this introduces some risk of resource leaks --
> > > ideally
> > > > > > > > we would take an inputstream consumer to fully own the
> > > lifecycle, but
> > > > > > > > unfortunately that's not an API change we can make at this
> > > point. We
> > > > > > > > return permits before passing the resulting inputstream to custom
> > > > > > > > code to prevent leaks from blocking the entire service.
> > > > > > > >
> > > > > > > > Unfortunately leased connections are anchored directly to both
> > > > > > > > StrictConnPool and LaxConnPool in order to validate that
> > > connections
> > > > > > > > are returned to the pool where they originated, which prevents a
> > > > > > > > relatively large swath of resources from being garbage collected
> > > in
> > > > > > > > the event of a leak. I realize my case is a bit different due to
> > > the
> > > > > > > > effectively unlimited pool size, where memory becomes the
> > > limiting
> > > > > > > > factor rather than requests blocking against the pool limit, but
> > > I
> > > > > > > > believe there are a few options that can help us to both detect
> > > and
> > > > > > > > endure leaks.
> > > > > > > >
> > > > > > > > A couple potential solutions:
> > > > > > > >
> > > > > > > > Use weak references to track leased connections with a periodic
> > > sweep
> > > > > > > > to detect leaks and reclaim per-route permits. This approach
> > > would be
> > > > > > > > complex, but allows the default configuration to detect and
> > > handle
> > > > > > > > rare response leaks without eventually blocking connection
> > > requests.
> > > > > > > > Using this approach, a leaked connection wouldn't necessarily be
> > > > > > > > closed when the pool is shut down, but may survive until a full
> > > GC.
> > > > > > > > In most cases I expect clients to be long-lived, and I'm not sure
> > > > > > > > what guarantees are made about leaked connections.
> > > > > > > >
> > > > > > > > Use a counter to track leased connections instead of tracking
> > > leased
> > > > > > > > entries directly. This loses the safety guarantees that
> > > connections
> > > > > > > > aren't already leased, and that they're being returned to the
> > > pool
> > > > > > > > where they originated, however it eliminates the need for the
> > > > > > > > connection pools hash table and references to checked-out values.
> > > > > > > > This solution would not be helpful when connections are leaked
> > > and
> > > > > > > > the route limit is reached, so it may make more sense as a
> > > > > > > > PoolConcurrencyPolicy.UNLIMITED option. This design would not be
> > > able
> > > > > > > > to close leased connections when the pool is closed, they would
> > > > > > > > remain open until they are released, when the closed pool could
> > > close
> > > > > > > > the incoming connection.
> > > > > > > >
> > > > > > > > Have you encountered similar problems? If you would consider a
> > > > > > > > contribution with one of these designs or another I have not yet
> > > > > > > > considered, I will open a ticket and plan to implement a
> > > solution.
> > > > > > > > Any and all feedback is greatly appreciated.
> > > > > > > >
> > > > > > > > Thanks,
> > > > > > > > Carter Kozak
> > > > > > > >
> > > > > > >
> > > > > > > Hi Carter
> > > > > > >
> > > > > > > Automatic recovery of leaked connections has been actively
> > > discussed in
> > > > > > > HC 3.x and early HC 4.x days without ever producing any practical
> > > > > > > outcome. It is _massively_ easier to just release connections then
> > > jump
> > > > > > > through many hoops to try and reclaim what might or might not be
> > > some
> > > > > > > leaked connections.
> > > > > > >
> > > > > > > You are very welcome to give it a shot but I would kindly ask you
> > > to
> > > > > > > start it off as a completely new connection manager initially. Once
> > > > > > > there is a working solution we can decide how to fold it into the
> > > > > > > project and in what form.
> > > > > > >
> > > > > >
> > > > > > This might be made easier by using Apache Commons Pool which can
> > > track
> > > > > > borrowed but abandoned objects:
> > > > > >
> > > > > > The pool can also be configured to detect and remove "abandoned"
> > > objects,
> > > > > > i.e. objects that have been checked out of the pool but neither used
> > > nor
> > > > > > returned before the configured removeAbandonedTimeout
> > > > > > <
> > > > >
> > > https://commons.apache.org/proper/commons-pool/apidocs/org/apache/commons/pool2/impl/AbandonedConfig.html#getRemoveAbandonedTimeout--
> > > > > >.
> > > > > > Abandoned object removal can be configured to happen when
> > > borrowObject is
> > > > > > invoked and the pool is close to starvation, or it can be executed
> > > by the
> > > > > > idle object evictor, or both. If pooled objects implement the
> > > TrackedUse
> > > > > > <
> > > > >
> > > https://commons.apache.org/proper/commons-pool/apidocs/org/apache/commons/pool2/TrackedUse.html
> > > > > >
> > > > > > interface,
> > > > > > their last use will be queried using the getLastUsed method on that
> > > > > > interface; otherwise abandonment is determined by how long an object
> > > has
> > > > > > been checked out from the pool.
> > > > > >
> > > > > > See
> > > > > >
> > > > >
> > > https://commons.apache.org/proper/commons-pool/apidocs/index.html?org/apache/commons/pool2/impl/GenericObjectPool.html
> > > > > >
> > > > > > Gary
> > > > > >
> > > > > > Cheers
> > > > > > >
> > > > > > > Oleg
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > ---------------------------------------------------------------------
> > > > > > > To unsubscribe, e-mail: dev-unsubscribe@hc.apache.org
> > > > > > > For additional commands, e-mail: dev-help@hc.apache.org
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > > > Thank you both,
> > > > >
> > > > > Oleg, I agree that it's much easier to close resources than it is to
> > > write
> > > > > a pool that is capable of recovery, and we shouldn't take on undue
> > > > > complexity to force a solution to work. That said, the surface area of
> > > code
> > > > > that has the potential to leak (code that uses HC) is orders of
> > > magnitude
> > > > > larger than the connection pool, and a fix could reduce JIRA load on
> > > the
> > > > > project. I will plan to implement an initial experimental version as an
> > > > > alternative to PoolingHttpClientConnectionManager so we can avoid
> > > impacting
> > > > > users, and I can validate it in a high-load production environment
> > > before
> > > > > it considered for a release.
> > > > >
> > > > > Gary, thanks for pointing out the abandonment feature in commons-pool!
> > > I
> > > > > worry about detection based on a timeout because it's difficult to be
> > > > > certain an object is abandoned. While multi-hour requests waiting for
> > > > > response headers aren't a good idea, I've run into a few places that
> > > depend
> > > > > on them and wouldn't want to break users.
> > > > >
> > > > > Carter
> > > > >
> > > >
> > >
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@hc.apache.org
> For additional commands, e-mail: dev-help@hc.apache.org
> 
>

Re: Limiting the blast radius of connection leaks

Posted by Julian Sedding <js...@gmail.com>.

Hi

I don't know if Java's "reachability" of responses would be a good
indicator for when a connection has erroneously not been returned to a
pool. If that is the case (and the stated memory leak makes me think
it might be), then the below information may be helpful.

I'll share some (positive) experiences from the Apache Sling project
with an approach that may work here as well. Sling is a Web-Framework
built on top of JCR (Java Content Repository, think of it as a
hierarchical database). For each request, a ResourceResolver is
created based on user credentials (often anonymous). The
ResourceResolver (RR) needs to be closed at the end of the request. So
far the framework can manage the lifecycle of the RR - no problem.
However, it is also possible to obtain a RR in components that
implement some backend logic, e.g. a scheduled task performing some
maintenance or reporting. In these cases the responsibility for
closing the RR is with the programmer implementing that piece of
backend logic. Of course that is often forgotten and leads to a memory
leak (leaked RR instances).

Solving this issue with a finalize method would work, but finalize is
known to require synchronization and to be rather slow. Instead we
implemented a mechanism to close unclosed RR instances using Java's
java.lang.ref.ReferenceQueue and WeakReference, see
https://github.com/apache/sling-org-apache-sling-resourceresolver/blob/30f7d91c6c99dc674fe361060aaa90f6e0252a6a/src/main/java/org/apache/sling/resourceresolver/impl/CommonResourceResolverFactoryImpl.java#L525

The trick is to encapsulate all required state in one object and then
create another object that references the state, but has no (relevant)
state of its own, a bit like a facade. When the facade gets GC'ed you
get a WeakReference referencing the state object enqueued in the
ReferenceQueue and all resources can be closed. In Sling we also give
the possibility to log the "opening stacktrace" when an RR is closed
by this fallback mechanism. This can help identify places in the
source code where RRs aren't correctly closed.

Hope this may help.

Regards
Julian

On Mon, Aug 3, 2020 at 7:39 PM Gary Gregory <ga...@gmail.com> wrote:
>
> Hi Carter,
>
> I don't have a super strong feeling for this either. I just would like us
> to try and reuse a library before reinventing the wheel, but my time is
> limited, like everyone else.
>
> Gary
>
> On Mon, Aug 3, 2020, 13:19 Carter Kozak <ck...@ckozak.net> wrote:
>
> > Hi Gary,
> >
> > That sounds reasonable, however there are some subtle distinctions
> > between general purpose object pools and those used for http
> > connections, perhaps the most obvious of which is http/2 multiplexed
> > connections. It's certainly possible that it could work, but I imagine it
> > could become a large project and may result in something very
> > specific to http clients. Pools used by blocking and asynchronous
> > clients may also differ substantially, so I'm not sure how general this
> > implementation could be.
> >
> > I'd be happy to help in any way I can if folks more familiar than myself
> > with the current hc pool implementations are supportive of your proposal,
> > but I'm worried that it may become a blocker that prevents incremental
> > improvements.
> >
> > In conclusion, I don't feel comfortable taking a stance on this and I trust
> > the team to make the right decision.
> >
> > -ck
> >
> > On Fri, Jul 31, 2020, at 17:38, Gary Gregory wrote:
> > > What I would prefer we do is not reinvent yet another pooling feature,
> > all
> > > its potential bugs and added maintenance load when that wheel's been
> > > invented over and over. For me personally, I would prefer that this
> > project
> > > adopt a FOSS pooling library and improve _it_ if that were needed,
> > thereby
> > > improving the whole ecosystem.
> > >
> > > Gary
> > >
> > > On Fri, Jul 31, 2020, 16:22 Carter Kozak <ck...@ckozak.net> wrote:
> > >
> > > > On Fri, Jul 31, 2020, at 14:31, Gary Gregory wrote:
> > > > > On Fri, Jul 31, 2020 at 12:45 PM Oleg Kalnichevski <olegk@apache.org
> > >
> > > > wrote:
> > > > >
> > > > > > On Thu, 2020-07-30 at 13:28 -0400, Carter Kozak wrote:
> > > > > > > Hello Friends,
> > > > > > >
> > > > > > > I recently had a service crash due to an out of memory error as
> > the
> > > > > > > result of a response leak. While this is caused by code that uses
> > > > > > > closeable resources incorrectly, and that code should be fixed,
> > I'd
> > > > > > > like to discuss some ideas which could avoid catastrophic
> > failure in
> > > > > > > this scenario.
> > > > > > >
> > > > > > > A little background on my environment: I maintain an RPC library
> > > > > > > which wraps an hc5 client. There's no limit set on the hc5
> > connection
> > > > > > > pool (Integer.MAX_VALUE), and resource use is bounded using a
> > > > > > > variation of additive-increase/multiplicative-decrease limiters
> > to
> > > > > > > scale based on load. Some endpoints are capable of returning a
> > raw
> > > > > > > binary stream, this introduces some risk of resource leaks --
> > ideally
> > > > > > > we would take an inputstream consumer to fully own the
> > lifecycle, but
> > > > > > > unfortunately that's not an API change we can make at this
> > point. We
> > > > > > > return permits before passing the resulting inputstream to custom
> > > > > > > code to prevent leaks from blocking the entire service.
> > > > > > >
> > > > > > > Unfortunately leased connections are anchored directly to both
> > > > > > > StrictConnPool and LaxConnPool in order to validate that
> > connections
> > > > > > > are returned to the pool where they originated, which prevents a
> > > > > > > relatively large swath of resources from being garbage collected
> > in
> > > > > > > the event of a leak. I realize my case is a bit different due to
> > the
> > > > > > > effectively unlimited pool size, where memory becomes the
> > limiting
> > > > > > > factor rather than requests blocking against the pool limit, but
> > I
> > > > > > > believe there are a few options that can help us to both detect
> > and
> > > > > > > endure leaks.
> > > > > > >
> > > > > > > A couple potential solutions:
> > > > > > >
> > > > > > > Use weak references to track leased connections with a periodic
> > sweep
> > > > > > > to detect leaks and reclaim per-route permits. This approach
> > would be
> > > > > > > complex, but allows the default configuration to detect and
> > handle
> > > > > > > rare response leaks without eventually blocking connection
> > requests.
> > > > > > > Using this approach, a leaked connection wouldn't necessarily be
> > > > > > > closed when the pool is shut down, but may survive until a full
> > GC.
> > > > > > > In most cases I expect clients to be long-lived, and I'm not sure
> > > > > > > what guarantees are made about leaked connections.
> > > > > > >
> > > > > > > Use a counter to track leased connections instead of tracking
> > leased
> > > > > > > entries directly. This loses the safety guarantees that
> > connections
> > > > > > > aren't already leased, and that they're being returned to the
> > pool
> > > > > > > where they originated, however it eliminates the need for the
> > > > > > > connection pools hash table and references to checked-out values.
> > > > > > > This solution would not be helpful when connections are leaked
> > and
> > > > > > > the route limit is reached, so it may make more sense as a
> > > > > > > PoolConcurrencyPolicy.UNLIMITED option. This design would not be
> > able
> > > > > > > to close leased connections when the pool is closed, they would
> > > > > > > remain open until they are released, when the closed pool could
> > close
> > > > > > > the incoming connection.
> > > > > > >
> > > > > > > Have you encountered similar problems? If you would consider a
> > > > > > > contribution with one of these designs or another I have not yet
> > > > > > > considered, I will open a ticket and plan to implement a
> > solution.
> > > > > > > Any and all feedback is greatly appreciated.
> > > > > > >
> > > > > > > Thanks,
> > > > > > > Carter Kozak
> > > > > > >
> > > > > >
> > > > > > Hi Carter
> > > > > >
> > > > > > Automatic recovery of leaked connections has been actively
> > discussed in
> > > > > > HC 3.x and early HC 4.x days without ever producing any practical
> > > > > > outcome. It is _massively_ easier to just release connections then
> > jump
> > > > > > through many hoops to try and reclaim what might or might not be
> > some
> > > > > > leaked connections.
> > > > > >
> > > > > > You are very welcome to give it a shot but I would kindly ask you
> > to
> > > > > > start it off as a completely new connection manager initially. Once
> > > > > > there is a working solution we can decide how to fold it into the
> > > > > > project and in what form.
> > > > > >
> > > > >
> > > > > This might be made easier by using Apache Commons Pool which can
> > track
> > > > > borrowed but abandoned objects:
> > > > >
> > > > > The pool can also be configured to detect and remove "abandoned"
> > objects,
> > > > > i.e. objects that have been checked out of the pool but neither used
> > nor
> > > > > returned before the configured removeAbandonedTimeout
> > > > > <
> > > >
> > https://commons.apache.org/proper/commons-pool/apidocs/org/apache/commons/pool2/impl/AbandonedConfig.html#getRemoveAbandonedTimeout--
> > > > >.
> > > > > Abandoned object removal can be configured to happen when
> > borrowObject is
> > > > > invoked and the pool is close to starvation, or it can be executed
> > by the
> > > > > idle object evictor, or both. If pooled objects implement the
> > TrackedUse
> > > > > <
> > > >
> > https://commons.apache.org/proper/commons-pool/apidocs/org/apache/commons/pool2/TrackedUse.html
> > > > >
> > > > > interface,
> > > > > their last use will be queried using the getLastUsed method on that
> > > > > interface; otherwise abandonment is determined by how long an object
> > has
> > > > > been checked out from the pool.
> > > > >
> > > > > See
> > > > >
> > > >
> > https://commons.apache.org/proper/commons-pool/apidocs/index.html?org/apache/commons/pool2/impl/GenericObjectPool.html
> > > > >
> > > > > Gary
> > > > >
> > > > > Cheers
> > > > > >
> > > > > > Oleg
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > ---------------------------------------------------------------------
> > > > > > To unsubscribe, e-mail: dev-unsubscribe@hc.apache.org
> > > > > > For additional commands, e-mail: dev-help@hc.apache.org
> > > > > >
> > > > > >
> > > > >
> > > >
> > > > Thank you both,
> > > >
> > > > Oleg, I agree that it's much easier to close resources than it is to
> > write
> > > > a pool that is capable of recovery, and we shouldn't take on undue
> > > > complexity to force a solution to work. That said, the surface area of
> > code
> > > > that has the potential to leak (code that uses HC) is orders of
> > magnitude
> > > > larger than the connection pool, and a fix could reduce JIRA load on
> > the
> > > > project. I will plan to implement an initial experimental version as an
> > > > alternative to PoolingHttpClientConnectionManager so we can avoid
> > impacting
> > > > users, and I can validate it in a high-load production environment
> > before
> > > > it considered for a release.
> > > >
> > > > Gary, thanks for pointing out the abandonment feature in commons-pool!
> > I
> > > > worry about detection based on a timeout because it's difficult to be
> > > > certain an object is abandoned. While multi-hour requests waiting for
> > > > response headers aren't a good idea, I've run into a few places that
> > depend
> > > > on them and wouldn't want to break users.
> > > >
> > > > Carter
> > > >
> > >
> >

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@hc.apache.org
For additional commands, e-mail: dev-help@hc.apache.org

Re: Limiting the blast radius of connection leaks

Posted by Gary Gregory <ga...@gmail.com>.

Hi Carter,

I don't have a super strong feeling for this either. I just would like us
to try and reuse a library before reinventing the wheel, but my time is
limited, like everyone else.

Gary

On Mon, Aug 3, 2020, 13:19 Carter Kozak <ck...@ckozak.net> wrote:

> Hi Gary,
>
> That sounds reasonable, however there are some subtle distinctions
> between general purpose object pools and those used for http
> connections, perhaps the most obvious of which is http/2 multiplexed
> connections. It's certainly possible that it could work, but I imagine it
> could become a large project and may result in something very
> specific to http clients. Pools used by blocking and asynchronous
> clients may also differ substantially, so I'm not sure how general this
> implementation could be.
>
> I'd be happy to help in any way I can if folks more familiar than myself
> with the current hc pool implementations are supportive of your proposal,
> but I'm worried that it may become a blocker that prevents incremental
> improvements.
>
> In conclusion, I don't feel comfortable taking a stance on this and I trust
> the team to make the right decision.
>
> -ck
>
> On Fri, Jul 31, 2020, at 17:38, Gary Gregory wrote:
> > What I would prefer we do is not reinvent yet another pooling feature,
> all
> > its potential bugs and added maintenance load when that wheel's been
> > invented over and over. For me personally, I would prefer that this
> project
> > adopt a FOSS pooling library and improve _it_ if that were needed,
> thereby
> > improving the whole ecosystem.
> >
> > Gary
> >
> > On Fri, Jul 31, 2020, 16:22 Carter Kozak <ck...@ckozak.net> wrote:
> >
> > > On Fri, Jul 31, 2020, at 14:31, Gary Gregory wrote:
> > > > On Fri, Jul 31, 2020 at 12:45 PM Oleg Kalnichevski <olegk@apache.org
> >
> > > wrote:
> > > >
> > > > > On Thu, 2020-07-30 at 13:28 -0400, Carter Kozak wrote:
> > > > > > Hello Friends,
> > > > > >
> > > > > > I recently had a service crash due to an out of memory error as
> the
> > > > > > result of a response leak. While this is caused by code that uses
> > > > > > closeable resources incorrectly, and that code should be fixed,
> I'd
> > > > > > like to discuss some ideas which could avoid catastrophic
> failure in
> > > > > > this scenario.
> > > > > >
> > > > > > A little background on my environment: I maintain an RPC library
> > > > > > which wraps an hc5 client. There's no limit set on the hc5
> connection
> > > > > > pool (Integer.MAX_VALUE), and resource use is bounded using a
> > > > > > variation of additive-increase/multiplicative-decrease limiters
> to
> > > > > > scale based on load. Some endpoints are capable of returning a
> raw
> > > > > > binary stream, this introduces some risk of resource leaks --
> ideally
> > > > > > we would take an inputstream consumer to fully own the
> lifecycle, but
> > > > > > unfortunately that's not an API change we can make at this
> point. We
> > > > > > return permits before passing the resulting inputstream to custom
> > > > > > code to prevent leaks from blocking the entire service.
> > > > > >
> > > > > > Unfortunately leased connections are anchored directly to both
> > > > > > StrictConnPool and LaxConnPool in order to validate that
> connections
> > > > > > are returned to the pool where they originated, which prevents a
> > > > > > relatively large swath of resources from being garbage collected
> in
> > > > > > the event of a leak. I realize my case is a bit different due to
> the
> > > > > > effectively unlimited pool size, where memory becomes the
> limiting
> > > > > > factor rather than requests blocking against the pool limit, but
> I
> > > > > > believe there are a few options that can help us to both detect
> and
> > > > > > endure leaks.
> > > > > >
> > > > > > A couple potential solutions:
> > > > > >
> > > > > > Use weak references to track leased connections with a periodic
> sweep
> > > > > > to detect leaks and reclaim per-route permits. This approach
> would be
> > > > > > complex, but allows the default configuration to detect and
> handle
> > > > > > rare response leaks without eventually blocking connection
> requests.
> > > > > > Using this approach, a leaked connection wouldn't necessarily be
> > > > > > closed when the pool is shut down, but may survive until a full
> GC.
> > > > > > In most cases I expect clients to be long-lived, and I'm not sure
> > > > > > what guarantees are made about leaked connections.
> > > > > >
> > > > > > Use a counter to track leased connections instead of tracking
> leased
> > > > > > entries directly. This loses the safety guarantees that
> connections
> > > > > > aren't already leased, and that they're being returned to the
> pool
> > > > > > where they originated, however it eliminates the need for the
> > > > > > connection pools hash table and references to checked-out values.
> > > > > > This solution would not be helpful when connections are leaked
> and
> > > > > > the route limit is reached, so it may make more sense as a
> > > > > > PoolConcurrencyPolicy.UNLIMITED option. This design would not be
> able
> > > > > > to close leased connections when the pool is closed, they would
> > > > > > remain open until they are released, when the closed pool could
> close
> > > > > > the incoming connection.
> > > > > >
> > > > > > Have you encountered similar problems? If you would consider a
> > > > > > contribution with one of these designs or another I have not yet
> > > > > > considered, I will open a ticket and plan to implement a
> solution.
> > > > > > Any and all feedback is greatly appreciated.
> > > > > >
> > > > > > Thanks,
> > > > > > Carter Kozak
> > > > > >
> > > > >
> > > > > Hi Carter
> > > > >
> > > > > Automatic recovery of leaked connections has been actively
> discussed in
> > > > > HC 3.x and early HC 4.x days without ever producing any practical
> > > > > outcome. It is _massively_ easier to just release connections then
> jump
> > > > > through many hoops to try and reclaim what might or might not be
> some
> > > > > leaked connections.
> > > > >
> > > > > You are very welcome to give it a shot but I would kindly ask you
> to
> > > > > start it off as a completely new connection manager initially. Once
> > > > > there is a working solution we can decide how to fold it into the
> > > > > project and in what form.
> > > > >
> > > >
> > > > This might be made easier by using Apache Commons Pool which can
> track
> > > > borrowed but abandoned objects:
> > > >
> > > > The pool can also be configured to detect and remove "abandoned"
> objects,
> > > > i.e. objects that have been checked out of the pool but neither used
> nor
> > > > returned before the configured removeAbandonedTimeout
> > > > <
> > >
> https://commons.apache.org/proper/commons-pool/apidocs/org/apache/commons/pool2/impl/AbandonedConfig.html#getRemoveAbandonedTimeout--
> > > >.
> > > > Abandoned object removal can be configured to happen when
> borrowObject is
> > > > invoked and the pool is close to starvation, or it can be executed
> by the
> > > > idle object evictor, or both. If pooled objects implement the
> TrackedUse
> > > > <
> > >
> https://commons.apache.org/proper/commons-pool/apidocs/org/apache/commons/pool2/TrackedUse.html
> > > >
> > > > interface,
> > > > their last use will be queried using the getLastUsed method on that
> > > > interface; otherwise abandonment is determined by how long an object
> has
> > > > been checked out from the pool.
> > > >
> > > > See
> > > >
> > >
> https://commons.apache.org/proper/commons-pool/apidocs/index.html?org/apache/commons/pool2/impl/GenericObjectPool.html
> > > >
> > > > Gary
> > > >
> > > > Cheers
> > > > >
> > > > > Oleg
> > > > >
> > > > >
> > > > >
> > > > >
> ---------------------------------------------------------------------
> > > > > To unsubscribe, e-mail: dev-unsubscribe@hc.apache.org
> > > > > For additional commands, e-mail: dev-help@hc.apache.org
> > > > >
> > > > >
> > > >
> > >
> > > Thank you both,
> > >
> > > Oleg, I agree that it's much easier to close resources than it is to
> write
> > > a pool that is capable of recovery, and we shouldn't take on undue
> > > complexity to force a solution to work. That said, the surface area of
> code
> > > that has the potential to leak (code that uses HC) is orders of
> magnitude
> > > larger than the connection pool, and a fix could reduce JIRA load on
> the
> > > project. I will plan to implement an initial experimental version as an
> > > alternative to PoolingHttpClientConnectionManager so we can avoid
> impacting
> > > users, and I can validate it in a high-load production environment
> before
> > > it considered for a release.
> > >
> > > Gary, thanks for pointing out the abandonment feature in commons-pool!
> I
> > > worry about detection based on a timeout because it's difficult to be
> > > certain an object is abandoned. While multi-hour requests waiting for
> > > response headers aren't a good idea, I've run into a few places that
> depend
> > > on them and wouldn't want to break users.
> > >
> > > Carter
> > >
> >
>

Re: Limiting the blast radius of connection leaks

Posted by Carter Kozak <ck...@ckozak.net>.

Hi Gary,

That sounds reasonable, however there are some subtle distinctions
between general purpose object pools and those used for http
connections, perhaps the most obvious of which is http/2 multiplexed
connections. It's certainly possible that it could work, but I imagine it
could become a large project and may result in something very
specific to http clients. Pools used by blocking and asynchronous
clients may also differ substantially, so I'm not sure how general this
implementation could be.

I'd be happy to help in any way I can if folks more familiar than myself
with the current hc pool implementations are supportive of your proposal,
but I'm worried that it may become a blocker that prevents incremental
improvements.

In conclusion, I don't feel comfortable taking a stance on this and I trust
the team to make the right decision.

-ck

On Fri, Jul 31, 2020, at 17:38, Gary Gregory wrote:
> What I would prefer we do is not reinvent yet another pooling feature, all
> its potential bugs and added maintenance load when that wheel's been
> invented over and over. For me personally, I would prefer that this project
> adopt a FOSS pooling library and improve _it_ if that were needed, thereby
> improving the whole ecosystem.
> 
> Gary
> 
> On Fri, Jul 31, 2020, 16:22 Carter Kozak <ck...@ckozak.net> wrote:
> 
> > On Fri, Jul 31, 2020, at 14:31, Gary Gregory wrote:
> > > On Fri, Jul 31, 2020 at 12:45 PM Oleg Kalnichevski <ol...@apache.org>
> > wrote:
> > >
> > > > On Thu, 2020-07-30 at 13:28 -0400, Carter Kozak wrote:
> > > > > Hello Friends,
> > > > >
> > > > > I recently had a service crash due to an out of memory error as the
> > > > > result of a response leak. While this is caused by code that uses
> > > > > closeable resources incorrectly, and that code should be fixed, I'd
> > > > > like to discuss some ideas which could avoid catastrophic failure in
> > > > > this scenario.
> > > > >
> > > > > A little background on my environment: I maintain an RPC library
> > > > > which wraps an hc5 client. There's no limit set on the hc5 connection
> > > > > pool (Integer.MAX_VALUE), and resource use is bounded using a
> > > > > variation of additive-increase/multiplicative-decrease limiters to
> > > > > scale based on load. Some endpoints are capable of returning a raw
> > > > > binary stream, this introduces some risk of resource leaks -- ideally
> > > > > we would take an inputstream consumer to fully own the lifecycle, but
> > > > > unfortunately that's not an API change we can make at this point. We
> > > > > return permits before passing the resulting inputstream to custom
> > > > > code to prevent leaks from blocking the entire service.
> > > > >
> > > > > Unfortunately leased connections are anchored directly to both
> > > > > StrictConnPool and LaxConnPool in order to validate that connections
> > > > > are returned to the pool where they originated, which prevents a
> > > > > relatively large swath of resources from being garbage collected in
> > > > > the event of a leak. I realize my case is a bit different due to the
> > > > > effectively unlimited pool size, where memory becomes the limiting
> > > > > factor rather than requests blocking against the pool limit, but I
> > > > > believe there are a few options that can help us to both detect and
> > > > > endure leaks.
> > > > >
> > > > > A couple potential solutions:
> > > > >
> > > > > Use weak references to track leased connections with a periodic sweep
> > > > > to detect leaks and reclaim per-route permits. This approach would be
> > > > > complex, but allows the default configuration to detect and handle
> > > > > rare response leaks without eventually blocking connection requests.
> > > > > Using this approach, a leaked connection wouldn't necessarily be
> > > > > closed when the pool is shut down, but may survive until a full GC.
> > > > > In most cases I expect clients to be long-lived, and I'm not sure
> > > > > what guarantees are made about leaked connections.
> > > > >
> > > > > Use a counter to track leased connections instead of tracking leased
> > > > > entries directly. This loses the safety guarantees that connections
> > > > > aren't already leased, and that they're being returned to the pool
> > > > > where they originated, however it eliminates the need for the
> > > > > connection pools hash table and references to checked-out values.
> > > > > This solution would not be helpful when connections are leaked and
> > > > > the route limit is reached, so it may make more sense as a
> > > > > PoolConcurrencyPolicy.UNLIMITED option. This design would not be able
> > > > > to close leased connections when the pool is closed, they would
> > > > > remain open until they are released, when the closed pool could close
> > > > > the incoming connection.
> > > > >
> > > > > Have you encountered similar problems? If you would consider a
> > > > > contribution with one of these designs or another I have not yet
> > > > > considered, I will open a ticket and plan to implement a solution.
> > > > > Any and all feedback is greatly appreciated.
> > > > >
> > > > > Thanks,
> > > > > Carter Kozak
> > > > >
> > > >
> > > > Hi Carter
> > > >
> > > > Automatic recovery of leaked connections has been actively discussed in
> > > > HC 3.x and early HC 4.x days without ever producing any practical
> > > > outcome. It is _massively_ easier to just release connections then jump
> > > > through many hoops to try and reclaim what might or might not be some
> > > > leaked connections.
> > > >
> > > > You are very welcome to give it a shot but I would kindly ask you to
> > > > start it off as a completely new connection manager initially. Once
> > > > there is a working solution we can decide how to fold it into the
> > > > project and in what form.
> > > >
> > >
> > > This might be made easier by using Apache Commons Pool which can track
> > > borrowed but abandoned objects:
> > >
> > > The pool can also be configured to detect and remove "abandoned" objects,
> > > i.e. objects that have been checked out of the pool but neither used nor
> > > returned before the configured removeAbandonedTimeout
> > > <
> > https://commons.apache.org/proper/commons-pool/apidocs/org/apache/commons/pool2/impl/AbandonedConfig.html#getRemoveAbandonedTimeout--
> > >.
> > > Abandoned object removal can be configured to happen when borrowObject is
> > > invoked and the pool is close to starvation, or it can be executed by the
> > > idle object evictor, or both. If pooled objects implement the TrackedUse
> > > <
> > https://commons.apache.org/proper/commons-pool/apidocs/org/apache/commons/pool2/TrackedUse.html
> > >
> > > interface,
> > > their last use will be queried using the getLastUsed method on that
> > > interface; otherwise abandonment is determined by how long an object has
> > > been checked out from the pool.
> > >
> > > See
> > >
> > https://commons.apache.org/proper/commons-pool/apidocs/index.html?org/apache/commons/pool2/impl/GenericObjectPool.html
> > >
> > > Gary
> > >
> > > Cheers
> > > >
> > > > Oleg
> > > >
> > > >
> > > >
> > > > ---------------------------------------------------------------------
> > > > To unsubscribe, e-mail: dev-unsubscribe@hc.apache.org
> > > > For additional commands, e-mail: dev-help@hc.apache.org
> > > >
> > > >
> > >
> >
> > Thank you both,
> >
> > Oleg, I agree that it's much easier to close resources than it is to write
> > a pool that is capable of recovery, and we shouldn't take on undue
> > complexity to force a solution to work. That said, the surface area of code
> > that has the potential to leak (code that uses HC) is orders of magnitude
> > larger than the connection pool, and a fix could reduce JIRA load on the
> > project. I will plan to implement an initial experimental version as an
> > alternative to PoolingHttpClientConnectionManager so we can avoid impacting
> > users, and I can validate it in a high-load production environment before
> > it considered for a release.
> >
> > Gary, thanks for pointing out the abandonment feature in commons-pool! I
> > worry about detection based on a timeout because it's difficult to be
> > certain an object is abandoned. While multi-hour requests waiting for
> > response headers aren't a good idea, I've run into a few places that depend
> > on them and wouldn't want to break users.
> >
> > Carter
> >
>

Re: Limiting the blast radius of connection leaks

Posted by Gary Gregory <ga...@gmail.com>.

What I would prefer we do is not reinvent yet another pooling feature, all
its potential bugs and added maintenance load when that wheel's been
invented over and over. For me personally, I would prefer that this project
adopt a FOSS pooling library and improve _it_ if that were needed, thereby
improving the whole ecosystem.

Gary

On Fri, Jul 31, 2020, 16:22 Carter Kozak <ck...@ckozak.net> wrote:

> On Fri, Jul 31, 2020, at 14:31, Gary Gregory wrote:
> > On Fri, Jul 31, 2020 at 12:45 PM Oleg Kalnichevski <ol...@apache.org>
> wrote:
> >
> > > On Thu, 2020-07-30 at 13:28 -0400, Carter Kozak wrote:
> > > > Hello Friends,
> > > >
> > > > I recently had a service crash due to an out of memory error as the
> > > > result of a response leak. While this is caused by code that uses
> > > > closeable resources incorrectly, and that code should be fixed, I'd
> > > > like to discuss some ideas which could avoid catastrophic failure in
> > > > this scenario.
> > > >
> > > > A little background on my environment: I maintain an RPC library
> > > > which wraps an hc5 client. There's no limit set on the hc5 connection
> > > > pool (Integer.MAX_VALUE), and resource use is bounded using a
> > > > variation of additive-increase/multiplicative-decrease limiters to
> > > > scale based on load. Some endpoints are capable of returning a raw
> > > > binary stream, this introduces some risk of resource leaks -- ideally
> > > > we would take an inputstream consumer to fully own the lifecycle, but
> > > > unfortunately that's not an API change we can make at this point. We
> > > > return permits before passing the resulting inputstream to custom
> > > > code to prevent leaks from blocking the entire service.
> > > >
> > > > Unfortunately leased connections are anchored directly to both
> > > > StrictConnPool and LaxConnPool in order to validate that connections
> > > > are returned to the pool where they originated, which prevents a
> > > > relatively large swath of resources from being garbage collected in
> > > > the event of a leak. I realize my case is a bit different due to the
> > > > effectively unlimited pool size, where memory becomes the limiting
> > > > factor rather than requests blocking against the pool limit, but I
> > > > believe there are a few options that can help us to both detect and
> > > > endure leaks.
> > > >
> > > > A couple potential solutions:
> > > >
> > > > Use weak references to track leased connections with a periodic sweep
> > > > to detect leaks and reclaim per-route permits. This approach would be
> > > > complex, but allows the default configuration to detect and handle
> > > > rare response leaks without eventually blocking connection requests.
> > > > Using this approach, a leaked connection wouldn't necessarily be
> > > > closed when the pool is shut down, but may survive until a full GC.
> > > > In most cases I expect clients to be long-lived, and I'm not sure
> > > > what guarantees are made about leaked connections.
> > > >
> > > > Use a counter to track leased connections instead of tracking leased
> > > > entries directly. This loses the safety guarantees that connections
> > > > aren't already leased, and that they're being returned to the pool
> > > > where they originated, however it eliminates the need for the
> > > > connection pools hash table and references to checked-out values.
> > > > This solution would not be helpful when connections are leaked and
> > > > the route limit is reached, so it may make more sense as a
> > > > PoolConcurrencyPolicy.UNLIMITED option. This design would not be able
> > > > to close leased connections when the pool is closed, they would
> > > > remain open until they are released, when the closed pool could close
> > > > the incoming connection.
> > > >
> > > > Have you encountered similar problems? If you would consider a
> > > > contribution with one of these designs or another I have not yet
> > > > considered, I will open a ticket and plan to implement a solution.
> > > > Any and all feedback is greatly appreciated.
> > > >
> > > > Thanks,
> > > > Carter Kozak
> > > >
> > >
> > > Hi Carter
> > >
> > > Automatic recovery of leaked connections has been actively discussed in
> > > HC 3.x and early HC 4.x days without ever producing any practical
> > > outcome. It is _massively_ easier to just release connections then jump
> > > through many hoops to try and reclaim what might or might not be some
> > > leaked connections.
> > >
> > > You are very welcome to give it a shot but I would kindly ask you to
> > > start it off as a completely new connection manager initially. Once
> > > there is a working solution we can decide how to fold it into the
> > > project and in what form.
> > >
> >
> > This might be made easier by using Apache Commons Pool which can track
> > borrowed but abandoned objects:
> >
> > The pool can also be configured to detect and remove "abandoned" objects,
> > i.e. objects that have been checked out of the pool but neither used nor
> > returned before the configured removeAbandonedTimeout
> > <
> https://commons.apache.org/proper/commons-pool/apidocs/org/apache/commons/pool2/impl/AbandonedConfig.html#getRemoveAbandonedTimeout--
> >.
> > Abandoned object removal can be configured to happen when borrowObject is
> > invoked and the pool is close to starvation, or it can be executed by the
> > idle object evictor, or both. If pooled objects implement the TrackedUse
> > <
> https://commons.apache.org/proper/commons-pool/apidocs/org/apache/commons/pool2/TrackedUse.html
> >
> > interface,
> > their last use will be queried using the getLastUsed method on that
> > interface; otherwise abandonment is determined by how long an object has
> > been checked out from the pool.
> >
> > See
> >
> https://commons.apache.org/proper/commons-pool/apidocs/index.html?org/apache/commons/pool2/impl/GenericObjectPool.html
> >
> > Gary
> >
> > Cheers
> > >
> > > Oleg
> > >
> > >
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: dev-unsubscribe@hc.apache.org
> > > For additional commands, e-mail: dev-help@hc.apache.org
> > >
> > >
> >
>
> Thank you both,
>
> Oleg, I agree that it's much easier to close resources than it is to write
> a pool that is capable of recovery, and we shouldn't take on undue
> complexity to force a solution to work. That said, the surface area of code
> that has the potential to leak (code that uses HC) is orders of magnitude
> larger than the connection pool, and a fix could reduce JIRA load on the
> project. I will plan to implement an initial experimental version as an
> alternative to PoolingHttpClientConnectionManager so we can avoid impacting
> users, and I can validate it in a high-load production environment before
> it considered for a release.
>
> Gary, thanks for pointing out the abandonment feature in commons-pool! I
> worry about detection based on a timeout because it's difficult to be
> certain an object is abandoned. While multi-hour requests waiting for
> response headers aren't a good idea, I've run into a few places that depend
> on them and wouldn't want to break users.
>
> Carter
>

Re: Limiting the blast radius of connection leaks

Posted by Carter Kozak <ck...@ckozak.net>.

On Fri, Jul 31, 2020, at 14:31, Gary Gregory wrote:
> On Fri, Jul 31, 2020 at 12:45 PM Oleg Kalnichevski <ol...@apache.org> wrote:
> 
> > On Thu, 2020-07-30 at 13:28 -0400, Carter Kozak wrote:
> > > Hello Friends,
> > >
> > > I recently had a service crash due to an out of memory error as the
> > > result of a response leak. While this is caused by code that uses
> > > closeable resources incorrectly, and that code should be fixed, I'd
> > > like to discuss some ideas which could avoid catastrophic failure in
> > > this scenario.
> > >
> > > A little background on my environment: I maintain an RPC library
> > > which wraps an hc5 client. There's no limit set on the hc5 connection
> > > pool (Integer.MAX_VALUE), and resource use is bounded using a
> > > variation of additive-increase/multiplicative-decrease limiters to
> > > scale based on load. Some endpoints are capable of returning a raw
> > > binary stream, this introduces some risk of resource leaks -- ideally
> > > we would take an inputstream consumer to fully own the lifecycle, but
> > > unfortunately that's not an API change we can make at this point. We
> > > return permits before passing the resulting inputstream to custom
> > > code to prevent leaks from blocking the entire service.
> > >
> > > Unfortunately leased connections are anchored directly to both
> > > StrictConnPool and LaxConnPool in order to validate that connections
> > > are returned to the pool where they originated, which prevents a
> > > relatively large swath of resources from being garbage collected in
> > > the event of a leak. I realize my case is a bit different due to the
> > > effectively unlimited pool size, where memory becomes the limiting
> > > factor rather than requests blocking against the pool limit, but I
> > > believe there are a few options that can help us to both detect and
> > > endure leaks.
> > >
> > > A couple potential solutions:
> > >
> > > Use weak references to track leased connections with a periodic sweep
> > > to detect leaks and reclaim per-route permits. This approach would be
> > > complex, but allows the default configuration to detect and handle
> > > rare response leaks without eventually blocking connection requests.
> > > Using this approach, a leaked connection wouldn't necessarily be
> > > closed when the pool is shut down, but may survive until a full GC.
> > > In most cases I expect clients to be long-lived, and I'm not sure
> > > what guarantees are made about leaked connections.
> > >
> > > Use a counter to track leased connections instead of tracking leased
> > > entries directly. This loses the safety guarantees that connections
> > > aren't already leased, and that they're being returned to the pool
> > > where they originated, however it eliminates the need for the
> > > connection pools hash table and references to checked-out values.
> > > This solution would not be helpful when connections are leaked and
> > > the route limit is reached, so it may make more sense as a
> > > PoolConcurrencyPolicy.UNLIMITED option. This design would not be able
> > > to close leased connections when the pool is closed, they would
> > > remain open until they are released, when the closed pool could close
> > > the incoming connection.
> > >
> > > Have you encountered similar problems? If you would consider a
> > > contribution with one of these designs or another I have not yet
> > > considered, I will open a ticket and plan to implement a solution.
> > > Any and all feedback is greatly appreciated.
> > >
> > > Thanks,
> > > Carter Kozak
> > >
> >
> > Hi Carter
> >
> > Automatic recovery of leaked connections has been actively discussed in
> > HC 3.x and early HC 4.x days without ever producing any practical
> > outcome. It is _massively_ easier to just release connections then jump
> > through many hoops to try and reclaim what might or might not be some
> > leaked connections.
> >
> > You are very welcome to give it a shot but I would kindly ask you to
> > start it off as a completely new connection manager initially. Once
> > there is a working solution we can decide how to fold it into the
> > project and in what form.
> >
> 
> This might be made easier by using Apache Commons Pool which can track
> borrowed but abandoned objects:
> 
> The pool can also be configured to detect and remove "abandoned" objects,
> i.e. objects that have been checked out of the pool but neither used nor
> returned before the configured removeAbandonedTimeout
> <https://commons.apache.org/proper/commons-pool/apidocs/org/apache/commons/pool2/impl/AbandonedConfig.html#getRemoveAbandonedTimeout-->.
> Abandoned object removal can be configured to happen when borrowObject is
> invoked and the pool is close to starvation, or it can be executed by the
> idle object evictor, or both. If pooled objects implement the TrackedUse
> <https://commons.apache.org/proper/commons-pool/apidocs/org/apache/commons/pool2/TrackedUse.html>
> interface,
> their last use will be queried using the getLastUsed method on that
> interface; otherwise abandonment is determined by how long an object has
> been checked out from the pool.
> 
> See
> https://commons.apache.org/proper/commons-pool/apidocs/index.html?org/apache/commons/pool2/impl/GenericObjectPool.html
> 
> Gary
> 
> Cheers
> >
> > Oleg
> >
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: dev-unsubscribe@hc.apache.org
> > For additional commands, e-mail: dev-help@hc.apache.org
> >
> >
> 

Thank you both,

Oleg, I agree that it's much easier to close resources than it is to write a pool that is capable of recovery, and we shouldn't take on undue complexity to force a solution to work. That said, the surface area of code that has the potential to leak (code that uses HC) is orders of magnitude larger than the connection pool, and a fix could reduce JIRA load on the project. I will plan to implement an initial experimental version as an alternative to PoolingHttpClientConnectionManager so we can avoid impacting users, and I can validate it in a high-load production environment before it considered for a release.

Gary, thanks for pointing out the abandonment feature in commons-pool! I worry about detection based on a timeout because it's difficult to be certain an object is abandoned. While multi-hour requests waiting for response headers aren't a good idea, I've run into a few places that depend on them and wouldn't want to break users.

Carter

Re: Limiting the blast radius of connection leaks

Posted by Gary Gregory <ga...@gmail.com>.

On Fri, Jul 31, 2020 at 12:45 PM Oleg Kalnichevski <ol...@apache.org> wrote:

> On Thu, 2020-07-30 at 13:28 -0400, Carter Kozak wrote:
> > Hello Friends,
> >
> > I recently had a service crash due to an out of memory error as the
> > result of a response leak. While this is caused by code that uses
> > closeable resources incorrectly, and that code should be fixed, I'd
> > like to discuss some ideas which could avoid catastrophic failure in
> > this scenario.
> >
> > A little background on my environment: I maintain an RPC library
> > which wraps an hc5 client. There's no limit set on the hc5 connection
> > pool (Integer.MAX_VALUE), and resource use is bounded using a
> > variation of additive-increase/multiplicative-decrease limiters to
> > scale based on load. Some endpoints are capable of returning a raw
> > binary stream, this introduces some risk of resource leaks -- ideally
> > we would take an inputstream consumer to fully own the lifecycle, but
> > unfortunately that's not an API change we can make at this point. We
> > return permits before passing the resulting inputstream to custom
> > code to prevent leaks from blocking the entire service.
> >
> > Unfortunately leased connections are anchored directly to both
> > StrictConnPool and LaxConnPool in order to validate that connections
> > are returned to the pool where they originated, which prevents a
> > relatively large swath of resources from being garbage collected in
> > the event of a leak. I realize my case is a bit different due to the
> > effectively unlimited pool size, where memory becomes the limiting
> > factor rather than requests blocking against the pool limit, but I
> > believe there are a few options that can help us to both detect and
> > endure leaks.
> >
> > A couple potential solutions:
> >
> > Use weak references to track leased connections with a periodic sweep
> > to detect leaks and reclaim per-route permits. This approach would be
> > complex, but allows the default configuration to detect and handle
> > rare response leaks without eventually blocking connection requests.
> > Using this approach, a leaked connection wouldn't necessarily be
> > closed when the pool is shut down, but may survive until a full GC.
> > In most cases I expect clients to be long-lived, and I'm not sure
> > what guarantees are made about leaked connections.
> >
> > Use a counter to track leased connections instead of tracking leased
> > entries directly. This loses the safety guarantees that connections
> > aren't already leased, and that they're being returned to the pool
> > where they originated, however it eliminates the need for the
> > connection pools hash table and references to checked-out values.
> > This solution would not be helpful when connections are leaked and
> > the route limit is reached, so it may make more sense as a
> > PoolConcurrencyPolicy.UNLIMITED option. This design would not be able
> > to close leased connections when the pool is closed, they would
> > remain open until they are released, when the closed pool could close
> > the incoming connection.
> >
> > Have you encountered similar problems? If you would consider a
> > contribution with one of these designs or another I have not yet
> > considered, I will open a ticket and plan to implement a solution.
> > Any and all feedback is greatly appreciated.
> >
> > Thanks,
> > Carter Kozak
> >
>
> Hi Carter
>
> Automatic recovery of leaked connections has been actively discussed in
> HC 3.x and early HC 4.x days without ever producing any practical
> outcome. It is _massively_ easier to just release connections then jump
> through many hoops to try and reclaim what might or might not be some
> leaked connections.
>
> You are very welcome to give it a shot but I would kindly ask you to
> start it off as a completely new connection manager initially. Once
> there is a working solution we can decide how to fold it into the
> project and in what form.
>

This might be made easier by using Apache Commons Pool which can track
borrowed but abandoned objects:

The pool can also be configured to detect and remove "abandoned" objects,
i.e. objects that have been checked out of the pool but neither used nor
returned before the configured removeAbandonedTimeout
<https://commons.apache.org/proper/commons-pool/apidocs/org/apache/commons/pool2/impl/AbandonedConfig.html#getRemoveAbandonedTimeout-->.
Abandoned object removal can be configured to happen when borrowObject is
invoked and the pool is close to starvation, or it can be executed by the
idle object evictor, or both. If pooled objects implement the TrackedUse
<https://commons.apache.org/proper/commons-pool/apidocs/org/apache/commons/pool2/TrackedUse.html>
interface,
their last use will be queried using the getLastUsed method on that
interface; otherwise abandonment is determined by how long an object has
been checked out from the pool.

See
https://commons.apache.org/proper/commons-pool/apidocs/index.html?org/apache/commons/pool2/impl/GenericObjectPool.html

Gary

Cheers
>
> Oleg
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@hc.apache.org
> For additional commands, e-mail: dev-help@hc.apache.org
>
>

Re: Limiting the blast radius of connection leaks

Posted by Oleg Kalnichevski <ol...@apache.org>.

On Thu, 2020-07-30 at 13:28 -0400, Carter Kozak wrote:
> Hello Friends,
> 
> I recently had a service crash due to an out of memory error as the
> result of a response leak. While this is caused by code that uses
> closeable resources incorrectly, and that code should be fixed, I'd
> like to discuss some ideas which could avoid catastrophic failure in
> this scenario.
> 
> A little background on my environment: I maintain an RPC library
> which wraps an hc5 client. There's no limit set on the hc5 connection
> pool (Integer.MAX_VALUE), and resource use is bounded using a
> variation of additive-increase/multiplicative-decrease limiters to
> scale based on load. Some endpoints are capable of returning a raw
> binary stream, this introduces some risk of resource leaks -- ideally
> we would take an inputstream consumer to fully own the lifecycle, but
> unfortunately that's not an API change we can make at this point. We
> return permits before passing the resulting inputstream to custom
> code to prevent leaks from blocking the entire service.
> 
> Unfortunately leased connections are anchored directly to both
> StrictConnPool and LaxConnPool in order to validate that connections
> are returned to the pool where they originated, which prevents a
> relatively large swath of resources from being garbage collected in
> the event of a leak. I realize my case is a bit different due to the
> effectively unlimited pool size, where memory becomes the limiting
> factor rather than requests blocking against the pool limit, but I
> believe there are a few options that can help us to both detect and
> endure leaks.
> 
> A couple potential solutions:
> 
> Use weak references to track leased connections with a periodic sweep
> to detect leaks and reclaim per-route permits. This approach would be
> complex, but allows the default configuration to detect and handle
> rare response leaks without eventually blocking connection requests.
> Using this approach, a leaked connection wouldn't necessarily be
> closed when the pool is shut down, but may survive until a full GC.
> In most cases I expect clients to be long-lived, and I'm not sure
> what guarantees are made about leaked connections.
> 
> Use a counter to track leased connections instead of tracking leased
> entries directly. This loses the safety guarantees that connections
> aren't already leased, and that they're being returned to the pool
> where they originated, however it eliminates the need for the
> connection pools hash table and references to checked-out values.
> This solution would not be helpful when connections are leaked and
> the route limit is reached, so it may make more sense as a
> PoolConcurrencyPolicy.UNLIMITED option. This design would not be able
> to close leased connections when the pool is closed, they would
> remain open until they are released, when the closed pool could close
> the incoming connection.
> 
> Have you encountered similar problems? If you would consider a
> contribution with one of these designs or another I have not yet
> considered, I will open a ticket and plan to implement a solution.
> Any and all feedback is greatly appreciated.
> 
> Thanks,
> Carter Kozak
> 

Hi Carter

Automatic recovery of leaked connections has been actively discussed in
HC 3.x and early HC 4.x days without ever producing any practical
outcome. It is _massively_ easier to just release connections then jump
through many hoops to try and reclaim what might or might not be some
leaked connections. 

You are very welcome to give it a shot but I would kindly ask you to
start it off as a completely new connection manager initially. Once
there is a working solution we can decide how to fold it into the
project and in what form.

Cheers

Oleg 



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@hc.apache.org
For additional commands, e-mail: dev-help@hc.apache.org