You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@mesos.apache.org by Benjamin Mahler <bm...@apache.org> on 2019/06/21 23:12:05 UTC

Review Request 70927: Fixed a memory "leak" of filters in the allocator.

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/70927/
-----------------------------------------------------------

Review request for mesos, Andrei Sekretenko and Meng Zhu.


Bugs: MESOS-9852
    https://issues.apache.org/jira/browse/MESOS-9852


Repository: mesos


Description
-------

Per MESOS-9852, the allocator does not keep a handle to the offer
filter timer, which means it cannot remove the timer if it's no
longer relevant (e.g. due to revive). This means that timers build
up in memory.

Also, the offer filter is allocated on the heap, and is not deleted
until the expiry of the timer (which may take forever!), even if the
offer filter is no longer relevant (e.g. due to revive). This means
that offer filters build up in memory.

The fix applied is to manage offer filters using a shared_ptr in the
maps, which means that they get deleted when erased. The expiration
functions need to now use a weak_ptr in case they run after the
offer filter has been erased (which is possible due to racing between
the expiry timer firing and the timer being discarded).

To discard the timers when no longer needed, the destructors of the
filters perform the discard.

This was tested against a test which spins in a revive + long decline
loop. Previously, the RSS continues to grow, but after this fix
it remains the same.


Diffs
-----

  src/master/allocator/mesos/hierarchical.hpp 25472eb7f5cb8342d2c88449c9c4145d816fb718 
  src/master/allocator/mesos/hierarchical.cpp 7076e968589c05c770cf50ad289bc7889625b54f 


Diff: https://reviews.apache.org/r/70927/diff/1/


Testing
-------

make check in addition to the leak test described in the description
and included in the subsequent patch (that's not planned to be committed)


Thanks,

Benjamin Mahler


Re: Review Request 70927: Fixed a memory "leak" of filters in the allocator.

Posted by Benjamin Mahler <bm...@apache.org>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/70927/
-----------------------------------------------------------

(Updated June 24, 2019, 7:54 p.m.)


Review request for mesos, Andrei Sekretenko and Meng Zhu.


Changes
-------

Moved the expiry future into the filter and exposed it.


Bugs: MESOS-9852
    https://issues.apache.org/jira/browse/MESOS-9852


Repository: mesos


Description
-------

Per MESOS-9852, the allocator does not keep a handle to the offer
filter timer, which means it cannot remove the timer if it's no
longer relevant (e.g. due to revive). This means that timers build
up in memory.

Also, the offer filter is allocated on the heap, and is not deleted
until the expiry of the timer (which may take forever!), even if the
offer filter is no longer relevant (e.g. due to revive). This means
that offer filters build up in memory.

The fix applied is to manage offer filters using a shared_ptr in the
maps, which means that they get deleted when erased. The expiration
functions need to now use a weak_ptr in case they run after the
offer filter has been erased (which is possible due to racing between
the expiry timer firing and the timer being discarded).

To discard the timers when no longer needed, the destructors of the
filters perform the discard.

This was tested against a test which spins in a revive + long decline
loop. Previously, the RSS continues to grow, but after this fix
it remains the same.


Diffs (updated)
-----

  src/master/allocator/mesos/hierarchical.hpp 25472eb7f5cb8342d2c88449c9c4145d816fb718 
  src/master/allocator/mesos/hierarchical.cpp 7076e968589c05c770cf50ad289bc7889625b54f 


Diff: https://reviews.apache.org/r/70927/diff/2/

Changes: https://reviews.apache.org/r/70927/diff/1-2/


Testing
-------

make check in addition to the leak test described in the description
and included in the subsequent patch (that's not planned to be committed)


Thanks,

Benjamin Mahler


Re: Review Request 70927: Fixed a memory "leak" of filters in the allocator.

Posted by Benjamin Mahler <bm...@apache.org>.

> On June 22, 2019, 11:51 p.m., Meng Zhu wrote:
> > src/master/allocator/mesos/hierarchical.cpp
> > Lines 2319-2325 (original), 2315-2321 (patched)
> > <https://reviews.apache.org/r/70927/diff/1/?file=2151677#file2151677line2363>
> >
> >     Isn't this dispatch redundant?

No, see the comment in recoverResources():

https://github.com/apache/mesos/blob/1.8.0/src/master/allocator/mesos/hierarchical.cpp#L1330-L1332

I agree it's not clear, should probably be inlined in a lambda close to the comment rather than being way down here. But, unrelated to this particular fix.


- Benjamin


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/70927/#review216077
-----------------------------------------------------------


On June 21, 2019, 11:12 p.m., Benjamin Mahler wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/70927/
> -----------------------------------------------------------
> 
> (Updated June 21, 2019, 11:12 p.m.)
> 
> 
> Review request for mesos, Andrei Sekretenko and Meng Zhu.
> 
> 
> Bugs: MESOS-9852
>     https://issues.apache.org/jira/browse/MESOS-9852
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> Per MESOS-9852, the allocator does not keep a handle to the offer
> filter timer, which means it cannot remove the timer if it's no
> longer relevant (e.g. due to revive). This means that timers build
> up in memory.
> 
> Also, the offer filter is allocated on the heap, and is not deleted
> until the expiry of the timer (which may take forever!), even if the
> offer filter is no longer relevant (e.g. due to revive). This means
> that offer filters build up in memory.
> 
> The fix applied is to manage offer filters using a shared_ptr in the
> maps, which means that they get deleted when erased. The expiration
> functions need to now use a weak_ptr in case they run after the
> offer filter has been erased (which is possible due to racing between
> the expiry timer firing and the timer being discarded).
> 
> To discard the timers when no longer needed, the destructors of the
> filters perform the discard.
> 
> This was tested against a test which spins in a revive + long decline
> loop. Previously, the RSS continues to grow, but after this fix
> it remains the same.
> 
> 
> Diffs
> -----
> 
>   src/master/allocator/mesos/hierarchical.hpp 25472eb7f5cb8342d2c88449c9c4145d816fb718 
>   src/master/allocator/mesos/hierarchical.cpp 7076e968589c05c770cf50ad289bc7889625b54f 
> 
> 
> Diff: https://reviews.apache.org/r/70927/diff/1/
> 
> 
> Testing
> -------
> 
> make check in addition to the leak test described in the description
> and included in the subsequent patch (that's not planned to be committed)
> 
> 
> Thanks,
> 
> Benjamin Mahler
> 
>


Re: Review Request 70927: Fixed a memory "leak" of filters in the allocator.

Posted by Meng Zhu <mz...@mesosphere.io>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/70927/#review216077
-----------------------------------------------------------


Fix it, then Ship it!




Bar Benjamin's comment, LGTM.


src/master/allocator/mesos/hierarchical.cpp
Lines 1145-1161 (original), 1154-1167 (patched)
<https://reviews.apache.org/r/70927/#comment303062>

    This exposes too much guts to my liking. Per Benjamin's comment, I think we can let filter take in a duration and a function object, then we can hide the future and the weak pointer. Ditto below.
    
    But since this is just for my aesthetics, I am OK with as is.



src/master/allocator/mesos/hierarchical.cpp
Lines 1371 (patched)
<https://reviews.apache.org/r/70927/#comment303063>

    let's avoid unrelated changes.



src/master/allocator/mesos/hierarchical.cpp
Lines 2319-2325 (original), 2315-2321 (patched)
<https://reviews.apache.org/r/70927/#comment303060>

    Isn't this dispatch redundant?


- Meng Zhu


On June 21, 2019, 4:12 p.m., Benjamin Mahler wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/70927/
> -----------------------------------------------------------
> 
> (Updated June 21, 2019, 4:12 p.m.)
> 
> 
> Review request for mesos, Andrei Sekretenko and Meng Zhu.
> 
> 
> Bugs: MESOS-9852
>     https://issues.apache.org/jira/browse/MESOS-9852
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> Per MESOS-9852, the allocator does not keep a handle to the offer
> filter timer, which means it cannot remove the timer if it's no
> longer relevant (e.g. due to revive). This means that timers build
> up in memory.
> 
> Also, the offer filter is allocated on the heap, and is not deleted
> until the expiry of the timer (which may take forever!), even if the
> offer filter is no longer relevant (e.g. due to revive). This means
> that offer filters build up in memory.
> 
> The fix applied is to manage offer filters using a shared_ptr in the
> maps, which means that they get deleted when erased. The expiration
> functions need to now use a weak_ptr in case they run after the
> offer filter has been erased (which is possible due to racing between
> the expiry timer firing and the timer being discarded).
> 
> To discard the timers when no longer needed, the destructors of the
> filters perform the discard.
> 
> This was tested against a test which spins in a revive + long decline
> loop. Previously, the RSS continues to grow, but after this fix
> it remains the same.
> 
> 
> Diffs
> -----
> 
>   src/master/allocator/mesos/hierarchical.hpp 25472eb7f5cb8342d2c88449c9c4145d816fb718 
>   src/master/allocator/mesos/hierarchical.cpp 7076e968589c05c770cf50ad289bc7889625b54f 
> 
> 
> Diff: https://reviews.apache.org/r/70927/diff/1/
> 
> 
> Testing
> -------
> 
> make check in addition to the leak test described in the description
> and included in the subsequent patch (that's not planned to be committed)
> 
> 
> Thanks,
> 
> Benjamin Mahler
> 
>


Re: Review Request 70927: Fixed a memory "leak" of filters in the allocator.

Posted by Benjamin Bannier <bb...@apache.org>.

> On June 22, 2019, 10:14 p.m., Benjamin Bannier wrote:
> > src/master/allocator/mesos/hierarchical.cpp
> > Line 1158 (original), 1164 (patched)
> > <https://reviews.apache.org/r/70927/diff/1/?file=2151677#file2151677line1180>
> >
> >     While this improves on the previous approach where we potentially left behind a full filter instance with its large `resources`, we now leave behind a callback containing a `FrameworkID`, a `SlaveID` and a `weak_ptr`. This should be much better, but could still grow without bounds (albeit at a much slower rate).
> >     
> >     What do you think about moving the `expiry` functionality into the filters itself? The filters could e.g., be constructed from a `Timeout` as previously, would internally set up a pending `Future` member which either on `discard`/`abandon`/... of the `Future` (filter got removed) or `after` the timeout (filter expired) would invoke `HierarchicalAllocatorProcess::expire` (would require passing its `PID` in filter ctrs). That way we might be able to avoid tracking timeouts for removed filters altogether.
> 
> Meng Zhu wrote:
>     > we now leave behind a callback containing a FrameworkID, a SlaveID and a weak_ptr
>     
>     The callback will be cleared when we discard the future (and the corresponding promise) in the destructor, no?

You are right Meng, I failed to notice the `onReady`. Like you also wrote, most of this could be internal to the filters to remove coupling and exposed guts.


- Benjamin


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/70927/#review216076
-----------------------------------------------------------


On June 22, 2019, 1:12 a.m., Benjamin Mahler wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/70927/
> -----------------------------------------------------------
> 
> (Updated June 22, 2019, 1:12 a.m.)
> 
> 
> Review request for mesos, Andrei Sekretenko and Meng Zhu.
> 
> 
> Bugs: MESOS-9852
>     https://issues.apache.org/jira/browse/MESOS-9852
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> Per MESOS-9852, the allocator does not keep a handle to the offer
> filter timer, which means it cannot remove the timer if it's no
> longer relevant (e.g. due to revive). This means that timers build
> up in memory.
> 
> Also, the offer filter is allocated on the heap, and is not deleted
> until the expiry of the timer (which may take forever!), even if the
> offer filter is no longer relevant (e.g. due to revive). This means
> that offer filters build up in memory.
> 
> The fix applied is to manage offer filters using a shared_ptr in the
> maps, which means that they get deleted when erased. The expiration
> functions need to now use a weak_ptr in case they run after the
> offer filter has been erased (which is possible due to racing between
> the expiry timer firing and the timer being discarded).
> 
> To discard the timers when no longer needed, the destructors of the
> filters perform the discard.
> 
> This was tested against a test which spins in a revive + long decline
> loop. Previously, the RSS continues to grow, but after this fix
> it remains the same.
> 
> 
> Diffs
> -----
> 
>   src/master/allocator/mesos/hierarchical.hpp 25472eb7f5cb8342d2c88449c9c4145d816fb718 
>   src/master/allocator/mesos/hierarchical.cpp 7076e968589c05c770cf50ad289bc7889625b54f 
> 
> 
> Diff: https://reviews.apache.org/r/70927/diff/1/
> 
> 
> Testing
> -------
> 
> make check in addition to the leak test described in the description
> and included in the subsequent patch (that's not planned to be committed)
> 
> 
> Thanks,
> 
> Benjamin Mahler
> 
>


Re: Review Request 70927: Fixed a memory "leak" of filters in the allocator.

Posted by Benjamin Mahler <bm...@apache.org>.

> On June 22, 2019, 8:14 p.m., Benjamin Bannier wrote:
> > src/master/allocator/mesos/hierarchical.cpp
> > Line 1158 (original), 1164 (patched)
> > <https://reviews.apache.org/r/70927/diff/1/?file=2151677#file2151677line1180>
> >
> >     While this improves on the previous approach where we potentially left behind a full filter instance with its large `resources`, we now leave behind a callback containing a `FrameworkID`, a `SlaveID` and a `weak_ptr`. This should be much better, but could still grow without bounds (albeit at a much slower rate).
> >     
> >     What do you think about moving the `expiry` functionality into the filters itself? The filters could e.g., be constructed from a `Timeout` as previously, would internally set up a pending `Future` member which either on `discard`/`abandon`/... of the `Future` (filter got removed) or `after` the timeout (filter expired) would invoke `HierarchicalAllocatorProcess::expire` (would require passing its `PID` in filter ctrs). That way we might be able to avoid tracking timeouts for removed filters altogether.
> 
> Meng Zhu wrote:
>     > we now leave behind a callback containing a FrameworkID, a SlaveID and a weak_ptr
>     
>     The callback will be cleared when we discard the future (and the corresponding promise) in the destructor, no?
> 
> Benjamin Bannier wrote:
>     You are right Meng, I failed to notice the `onReady`. Like you also wrote, most of this could be internal to the filters to remove coupling and exposed guts.

Hm.. calling expire requires passing the weak pointer, which we don't have access to from within the filters themselves. I did move the future into the filter, and exposed an `expired()` function, which indeed seems a bit nicer


- Benjamin


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/70927/#review216076
-----------------------------------------------------------


On June 21, 2019, 11:12 p.m., Benjamin Mahler wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/70927/
> -----------------------------------------------------------
> 
> (Updated June 21, 2019, 11:12 p.m.)
> 
> 
> Review request for mesos, Andrei Sekretenko and Meng Zhu.
> 
> 
> Bugs: MESOS-9852
>     https://issues.apache.org/jira/browse/MESOS-9852
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> Per MESOS-9852, the allocator does not keep a handle to the offer
> filter timer, which means it cannot remove the timer if it's no
> longer relevant (e.g. due to revive). This means that timers build
> up in memory.
> 
> Also, the offer filter is allocated on the heap, and is not deleted
> until the expiry of the timer (which may take forever!), even if the
> offer filter is no longer relevant (e.g. due to revive). This means
> that offer filters build up in memory.
> 
> The fix applied is to manage offer filters using a shared_ptr in the
> maps, which means that they get deleted when erased. The expiration
> functions need to now use a weak_ptr in case they run after the
> offer filter has been erased (which is possible due to racing between
> the expiry timer firing and the timer being discarded).
> 
> To discard the timers when no longer needed, the destructors of the
> filters perform the discard.
> 
> This was tested against a test which spins in a revive + long decline
> loop. Previously, the RSS continues to grow, but after this fix
> it remains the same.
> 
> 
> Diffs
> -----
> 
>   src/master/allocator/mesos/hierarchical.hpp 25472eb7f5cb8342d2c88449c9c4145d816fb718 
>   src/master/allocator/mesos/hierarchical.cpp 7076e968589c05c770cf50ad289bc7889625b54f 
> 
> 
> Diff: https://reviews.apache.org/r/70927/diff/1/
> 
> 
> Testing
> -------
> 
> make check in addition to the leak test described in the description
> and included in the subsequent patch (that's not planned to be committed)
> 
> 
> Thanks,
> 
> Benjamin Mahler
> 
>


Re: Review Request 70927: Fixed a memory "leak" of filters in the allocator.

Posted by Meng Zhu <mz...@mesosphere.io>.

> On June 22, 2019, 1:14 p.m., Benjamin Bannier wrote:
> > src/master/allocator/mesos/hierarchical.cpp
> > Line 1158 (original), 1164 (patched)
> > <https://reviews.apache.org/r/70927/diff/1/?file=2151677#file2151677line1180>
> >
> >     While this improves on the previous approach where we potentially left behind a full filter instance with its large `resources`, we now leave behind a callback containing a `FrameworkID`, a `SlaveID` and a `weak_ptr`. This should be much better, but could still grow without bounds (albeit at a much slower rate).
> >     
> >     What do you think about moving the `expiry` functionality into the filters itself? The filters could e.g., be constructed from a `Timeout` as previously, would internally set up a pending `Future` member which either on `discard`/`abandon`/... of the `Future` (filter got removed) or `after` the timeout (filter expired) would invoke `HierarchicalAllocatorProcess::expire` (would require passing its `PID` in filter ctrs). That way we might be able to avoid tracking timeouts for removed filters altogether.

> we now leave behind a callback containing a FrameworkID, a SlaveID and a weak_ptr

The callback will be cleared when we discard the future (and the corresponding promise) in the destructor, no?


- Meng


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/70927/#review216076
-----------------------------------------------------------


On June 21, 2019, 4:12 p.m., Benjamin Mahler wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/70927/
> -----------------------------------------------------------
> 
> (Updated June 21, 2019, 4:12 p.m.)
> 
> 
> Review request for mesos, Andrei Sekretenko and Meng Zhu.
> 
> 
> Bugs: MESOS-9852
>     https://issues.apache.org/jira/browse/MESOS-9852
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> Per MESOS-9852, the allocator does not keep a handle to the offer
> filter timer, which means it cannot remove the timer if it's no
> longer relevant (e.g. due to revive). This means that timers build
> up in memory.
> 
> Also, the offer filter is allocated on the heap, and is not deleted
> until the expiry of the timer (which may take forever!), even if the
> offer filter is no longer relevant (e.g. due to revive). This means
> that offer filters build up in memory.
> 
> The fix applied is to manage offer filters using a shared_ptr in the
> maps, which means that they get deleted when erased. The expiration
> functions need to now use a weak_ptr in case they run after the
> offer filter has been erased (which is possible due to racing between
> the expiry timer firing and the timer being discarded).
> 
> To discard the timers when no longer needed, the destructors of the
> filters perform the discard.
> 
> This was tested against a test which spins in a revive + long decline
> loop. Previously, the RSS continues to grow, but after this fix
> it remains the same.
> 
> 
> Diffs
> -----
> 
>   src/master/allocator/mesos/hierarchical.hpp 25472eb7f5cb8342d2c88449c9c4145d816fb718 
>   src/master/allocator/mesos/hierarchical.cpp 7076e968589c05c770cf50ad289bc7889625b54f 
> 
> 
> Diff: https://reviews.apache.org/r/70927/diff/1/
> 
> 
> Testing
> -------
> 
> make check in addition to the leak test described in the description
> and included in the subsequent patch (that's not planned to be committed)
> 
> 
> Thanks,
> 
> Benjamin Mahler
> 
>


Re: Review Request 70927: Fixed a memory "leak" of filters in the allocator.

Posted by Benjamin Bannier <bb...@apache.org>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/70927/#review216076
-----------------------------------------------------------




src/master/allocator/mesos/hierarchical.cpp
Line 1158 (original), 1164 (patched)
<https://reviews.apache.org/r/70927/#comment303059>

    While this improves on the previous approach where we potentially left behind a full filter instance with its large `resources`, we now leave behind a callback containing a `FrameworkID`, a `SlaveID` and a `weak_ptr`. This should be much better, but could still grow without bounds (albeit at a much slower rate).
    
    What do you think about moving the `expiry` functionality into the filters itself? The filters could e.g., be constructed from a `Timeout` as previously, would internally set up a pending `Future` member which either on `discard`/`abandon`/... of the `Future` (filter got removed) or `after` the timeout (filter expired) would invoke `HierarchicalAllocatorProcess::expire` (would require passing its `PID` in filter ctrs). That way we might be able to avoid tracking timeouts for removed filters altogether.


- Benjamin Bannier


On June 22, 2019, 1:12 a.m., Benjamin Mahler wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/70927/
> -----------------------------------------------------------
> 
> (Updated June 22, 2019, 1:12 a.m.)
> 
> 
> Review request for mesos, Andrei Sekretenko and Meng Zhu.
> 
> 
> Bugs: MESOS-9852
>     https://issues.apache.org/jira/browse/MESOS-9852
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> Per MESOS-9852, the allocator does not keep a handle to the offer
> filter timer, which means it cannot remove the timer if it's no
> longer relevant (e.g. due to revive). This means that timers build
> up in memory.
> 
> Also, the offer filter is allocated on the heap, and is not deleted
> until the expiry of the timer (which may take forever!), even if the
> offer filter is no longer relevant (e.g. due to revive). This means
> that offer filters build up in memory.
> 
> The fix applied is to manage offer filters using a shared_ptr in the
> maps, which means that they get deleted when erased. The expiration
> functions need to now use a weak_ptr in case they run after the
> offer filter has been erased (which is possible due to racing between
> the expiry timer firing and the timer being discarded).
> 
> To discard the timers when no longer needed, the destructors of the
> filters perform the discard.
> 
> This was tested against a test which spins in a revive + long decline
> loop. Previously, the RSS continues to grow, but after this fix
> it remains the same.
> 
> 
> Diffs
> -----
> 
>   src/master/allocator/mesos/hierarchical.hpp 25472eb7f5cb8342d2c88449c9c4145d816fb718 
>   src/master/allocator/mesos/hierarchical.cpp 7076e968589c05c770cf50ad289bc7889625b54f 
> 
> 
> Diff: https://reviews.apache.org/r/70927/diff/1/
> 
> 
> Testing
> -------
> 
> make check in addition to the leak test described in the description
> and included in the subsequent patch (that's not planned to be committed)
> 
> 
> Thanks,
> 
> Benjamin Mahler
> 
>