You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-dev@hadoop.apache.org by Mukund Madhav Thakur <mt...@cloudera.com.INVALID> on 2022/01/12 09:24:38 UTC

ElasticByteBufferPool is ever growing thus can cause memory leak.

Hello Everyone,
I was just browsing through the code while doing my Vectored IO stuff. It
seems like ElasticByteBufferPool is an ever growing pool and memory is not
getting released as there is no WeakReference being maintained in the pool.
This can cause memory leaks in the production environment.  This is widely
used in places like StripedReconstructor,
DFSStripedInputStream, BlockBlobAppendStream etc.

I would suggest we use DirectBufferPool class for direct buffer pooling as
it already is keeping WeakReference for the buffers. Although, we will have
to make this implement the ByteBufferPool interface and implement the
corresponding methods. Happy to make the API changes once finalized.

Thanks,
Mukund

Re: ElasticByteBufferPool is ever growing thus can cause memory leak.

Posted by Mukund Madhav Thakur <mt...@cloudera.com.INVALID>.
Hi Everyone,
I have committed the initial work of vectored IO in the feature branch.
https://github.com/apache/hadoop/commits/feature-vectored-io
Created an epic to track the future work
https://issues.apache.org/jira/browse/HADOOP-18103.

Please try this out and review and let me know what you think. Especially
if someone can change the parquet library to use the new vectored API and
run some spark jobs to see the improvements.
Paralllely, I am working with Rajesh to do a poc in orc and running hive
queries.


Thanks,
Mukund




On Thu, Jan 13, 2022 at 12:27 AM Steve Loughran <st...@cloudera.com.invalid>
wrote:

> DirectBufferPool does the weak refs effectively, but it only recycles
> buffers of the same size; the elastic one will return the first largest
> buffer.
>
> I think we may want to have a variant of the ElasticBufferPool which uses
> weak refs..we can adopt that and leave the (old) one alone. It is marked as
> public and stable after all
>
> I would also like a new method ByteBufferPool.release() which, when
> implementations support it, does a cleanup -essentially
> resetting the maps. interface "default" methods make it safer to add these.
>
> Handling buffet recycling really matters for the new vectored IO work which
> is very ready to go in.
> I want to create a stabilisation branch in the ASF repo with the code as
> is, then mukund can add a new PR with the buffer pooling
> improvements...that will make it easier for people to cherry pick if they
> want.
>
> On Wed, 12 Jan 2022 at 18:23, Chris Nauroth <cn...@apache.org> wrote:
>
> > Thanks for discussing this, Mukund.
> >
> > Another difference between the 2 classes is that ElasticByteBufferPool
> > supports a caller's preference for either on-heap or off-heap buffers and
> > internally calls either allocate or allocateDirect. Do you intend to
> > preserve that functionality too in a single merged class?
> >
> > Chris Nauroth
> >
> >
> > On Wed, Jan 12, 2022 at 1:25 AM Mukund Madhav Thakur
> > <mt...@cloudera.com.invalid> wrote:
> >
> > > Hello Everyone,
> > > I was just browsing through the code while doing my Vectored IO stuff.
> It
> > > seems like ElasticByteBufferPool is an ever growing pool and memory is
> > not
> > > getting released as there is no WeakReference being maintained in the
> > pool.
> > > This can cause memory leaks in the production environment.  This is
> > widely
> > > used in places like StripedReconstructor,
> > > DFSStripedInputStream, BlockBlobAppendStream etc.
> > >
> > > I would suggest we use DirectBufferPool class for direct buffer pooling
> > as
> > > it already is keeping WeakReference for the buffers. Although, we will
> > have
> > > to make this implement the ByteBufferPool interface and implement the
> > > corresponding methods. Happy to make the API changes once finalized.
> > >
> > > Thanks,
> > > Mukund
> > >
> >
>

Re: ElasticByteBufferPool is ever growing thus can cause memory leak.

Posted by Mukund Madhav Thakur <mt...@cloudera.com.INVALID>.
Hi Everyone,
I have committed the initial work of vectored IO in the feature branch.
https://github.com/apache/hadoop/commits/feature-vectored-io
Created an epic to track the future work
https://issues.apache.org/jira/browse/HADOOP-18103.

Please try this out and review and let me know what you think. Especially
if someone can change the parquet library to use the new vectored API and
run some spark jobs to see the improvements.
Paralllely, I am working with Rajesh to do a poc in orc and running hive
queries.


Thanks,
Mukund




On Thu, Jan 13, 2022 at 12:27 AM Steve Loughran <st...@cloudera.com.invalid>
wrote:

> DirectBufferPool does the weak refs effectively, but it only recycles
> buffers of the same size; the elastic one will return the first largest
> buffer.
>
> I think we may want to have a variant of the ElasticBufferPool which uses
> weak refs..we can adopt that and leave the (old) one alone. It is marked as
> public and stable after all
>
> I would also like a new method ByteBufferPool.release() which, when
> implementations support it, does a cleanup -essentially
> resetting the maps. interface "default" methods make it safer to add these.
>
> Handling buffet recycling really matters for the new vectored IO work which
> is very ready to go in.
> I want to create a stabilisation branch in the ASF repo with the code as
> is, then mukund can add a new PR with the buffer pooling
> improvements...that will make it easier for people to cherry pick if they
> want.
>
> On Wed, 12 Jan 2022 at 18:23, Chris Nauroth <cn...@apache.org> wrote:
>
> > Thanks for discussing this, Mukund.
> >
> > Another difference between the 2 classes is that ElasticByteBufferPool
> > supports a caller's preference for either on-heap or off-heap buffers and
> > internally calls either allocate or allocateDirect. Do you intend to
> > preserve that functionality too in a single merged class?
> >
> > Chris Nauroth
> >
> >
> > On Wed, Jan 12, 2022 at 1:25 AM Mukund Madhav Thakur
> > <mt...@cloudera.com.invalid> wrote:
> >
> > > Hello Everyone,
> > > I was just browsing through the code while doing my Vectored IO stuff.
> It
> > > seems like ElasticByteBufferPool is an ever growing pool and memory is
> > not
> > > getting released as there is no WeakReference being maintained in the
> > pool.
> > > This can cause memory leaks in the production environment.  This is
> > widely
> > > used in places like StripedReconstructor,
> > > DFSStripedInputStream, BlockBlobAppendStream etc.
> > >
> > > I would suggest we use DirectBufferPool class for direct buffer pooling
> > as
> > > it already is keeping WeakReference for the buffers. Although, we will
> > have
> > > to make this implement the ByteBufferPool interface and implement the
> > > corresponding methods. Happy to make the API changes once finalized.
> > >
> > > Thanks,
> > > Mukund
> > >
> >
>

Re: ElasticByteBufferPool is ever growing thus can cause memory leak.

Posted by Steve Loughran <st...@cloudera.com.INVALID>.
DirectBufferPool does the weak refs effectively, but it only recycles
buffers of the same size; the elastic one will return the first largest
buffer.

I think we may want to have a variant of the ElasticBufferPool which uses
weak refs..we can adopt that and leave the (old) one alone. It is marked as
public and stable after all

I would also like a new method ByteBufferPool.release() which, when
implementations support it, does a cleanup -essentially
resetting the maps. interface "default" methods make it safer to add these.

Handling buffet recycling really matters for the new vectored IO work which
is very ready to go in.
I want to create a stabilisation branch in the ASF repo with the code as
is, then mukund can add a new PR with the buffer pooling
improvements...that will make it easier for people to cherry pick if they
want.

On Wed, 12 Jan 2022 at 18:23, Chris Nauroth <cn...@apache.org> wrote:

> Thanks for discussing this, Mukund.
>
> Another difference between the 2 classes is that ElasticByteBufferPool
> supports a caller's preference for either on-heap or off-heap buffers and
> internally calls either allocate or allocateDirect. Do you intend to
> preserve that functionality too in a single merged class?
>
> Chris Nauroth
>
>
> On Wed, Jan 12, 2022 at 1:25 AM Mukund Madhav Thakur
> <mt...@cloudera.com.invalid> wrote:
>
> > Hello Everyone,
> > I was just browsing through the code while doing my Vectored IO stuff. It
> > seems like ElasticByteBufferPool is an ever growing pool and memory is
> not
> > getting released as there is no WeakReference being maintained in the
> pool.
> > This can cause memory leaks in the production environment.  This is
> widely
> > used in places like StripedReconstructor,
> > DFSStripedInputStream, BlockBlobAppendStream etc.
> >
> > I would suggest we use DirectBufferPool class for direct buffer pooling
> as
> > it already is keeping WeakReference for the buffers. Although, we will
> have
> > to make this implement the ByteBufferPool interface and implement the
> > corresponding methods. Happy to make the API changes once finalized.
> >
> > Thanks,
> > Mukund
> >
>

Re: ElasticByteBufferPool is ever growing thus can cause memory leak.

Posted by Steve Loughran <st...@cloudera.com.INVALID>.
DirectBufferPool does the weak refs effectively, but it only recycles
buffers of the same size; the elastic one will return the first largest
buffer.

I think we may want to have a variant of the ElasticBufferPool which uses
weak refs..we can adopt that and leave the (old) one alone. It is marked as
public and stable after all

I would also like a new method ByteBufferPool.release() which, when
implementations support it, does a cleanup -essentially
resetting the maps. interface "default" methods make it safer to add these.

Handling buffet recycling really matters for the new vectored IO work which
is very ready to go in.
I want to create a stabilisation branch in the ASF repo with the code as
is, then mukund can add a new PR with the buffer pooling
improvements...that will make it easier for people to cherry pick if they
want.

On Wed, 12 Jan 2022 at 18:23, Chris Nauroth <cn...@apache.org> wrote:

> Thanks for discussing this, Mukund.
>
> Another difference between the 2 classes is that ElasticByteBufferPool
> supports a caller's preference for either on-heap or off-heap buffers and
> internally calls either allocate or allocateDirect. Do you intend to
> preserve that functionality too in a single merged class?
>
> Chris Nauroth
>
>
> On Wed, Jan 12, 2022 at 1:25 AM Mukund Madhav Thakur
> <mt...@cloudera.com.invalid> wrote:
>
> > Hello Everyone,
> > I was just browsing through the code while doing my Vectored IO stuff. It
> > seems like ElasticByteBufferPool is an ever growing pool and memory is
> not
> > getting released as there is no WeakReference being maintained in the
> pool.
> > This can cause memory leaks in the production environment.  This is
> widely
> > used in places like StripedReconstructor,
> > DFSStripedInputStream, BlockBlobAppendStream etc.
> >
> > I would suggest we use DirectBufferPool class for direct buffer pooling
> as
> > it already is keeping WeakReference for the buffers. Although, we will
> have
> > to make this implement the ByteBufferPool interface and implement the
> > corresponding methods. Happy to make the API changes once finalized.
> >
> > Thanks,
> > Mukund
> >
>

Re: ElasticByteBufferPool is ever growing thus can cause memory leak.

Posted by Chris Nauroth <cn...@apache.org>.
Thanks for discussing this, Mukund.

Another difference between the 2 classes is that ElasticByteBufferPool
supports a caller's preference for either on-heap or off-heap buffers and
internally calls either allocate or allocateDirect. Do you intend to
preserve that functionality too in a single merged class?

Chris Nauroth


On Wed, Jan 12, 2022 at 1:25 AM Mukund Madhav Thakur
<mt...@cloudera.com.invalid> wrote:

> Hello Everyone,
> I was just browsing through the code while doing my Vectored IO stuff. It
> seems like ElasticByteBufferPool is an ever growing pool and memory is not
> getting released as there is no WeakReference being maintained in the pool.
> This can cause memory leaks in the production environment.  This is widely
> used in places like StripedReconstructor,
> DFSStripedInputStream, BlockBlobAppendStream etc.
>
> I would suggest we use DirectBufferPool class for direct buffer pooling as
> it already is keeping WeakReference for the buffers. Although, we will have
> to make this implement the ByteBufferPool interface and implement the
> corresponding methods. Happy to make the API changes once finalized.
>
> Thanks,
> Mukund
>

Re: ElasticByteBufferPool is ever growing thus can cause memory leak.

Posted by Chris Nauroth <cn...@apache.org>.
Thanks for discussing this, Mukund.

Another difference between the 2 classes is that ElasticByteBufferPool
supports a caller's preference for either on-heap or off-heap buffers and
internally calls either allocate or allocateDirect. Do you intend to
preserve that functionality too in a single merged class?

Chris Nauroth


On Wed, Jan 12, 2022 at 1:25 AM Mukund Madhav Thakur
<mt...@cloudera.com.invalid> wrote:

> Hello Everyone,
> I was just browsing through the code while doing my Vectored IO stuff. It
> seems like ElasticByteBufferPool is an ever growing pool and memory is not
> getting released as there is no WeakReference being maintained in the pool.
> This can cause memory leaks in the production environment.  This is widely
> used in places like StripedReconstructor,
> DFSStripedInputStream, BlockBlobAppendStream etc.
>
> I would suggest we use DirectBufferPool class for direct buffer pooling as
> it already is keeping WeakReference for the buffers. Although, we will have
> to make this implement the ByteBufferPool interface and implement the
> corresponding methods. Happy to make the API changes once finalized.
>
> Thanks,
> Mukund
>