You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mesos.apache.org by Timothy Chen <tn...@apache.org> on 2014/07/03 01:37:36 UTC

Re: Review Request 22796: Add timeout to rescind unused offers

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/22796/
-----------------------------------------------------------

(Updated July 2, 2014, 11:37 p.m.)


Review request for mesos, Adam B and Niklas Nielsen.


Bugs: MESOS-186
    https://issues.apache.org/jira/browse/MESOS-186


Repository: mesos-git


Description
-------

Based on Kapil's patch (https://reviews.apache.org/r/22066/), adding timeout for each offer from master to remove the offer when it's no longer used.


Diffs
-----

  src/master/flags.hpp 32704ce 
  src/master/master.hpp 5fef354 
  src/master/master.cpp 251b699 
  src/tests/master_tests.cpp 5a1cf7f 

Diff: https://reviews.apache.org/r/22796/diff/


Testing
-------

Added three more unit tests from Kapil's patch: Testing offer not rescinded after task launched, offer not rescinded when framework/slave unregistered.
The test exposed a race condition that can lead to a segfault if two remove offers are called on the same offer.

make check.


Thanks,

Timothy Chen


Re: Review Request 22796: Add timeout to rescind unused offers

Posted by Mesos ReviewBot <de...@mesos.apache.org>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/22796/#review47273
-----------------------------------------------------------


Patch looks great!

Reviews applied: [22796]

All tests passed.

- Mesos ReviewBot


On July 3, 2014, 12:42 a.m., Timothy Chen wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/22796/
> -----------------------------------------------------------
> 
> (Updated July 3, 2014, 12:42 a.m.)
> 
> 
> Review request for mesos, Adam B and Niklas Nielsen.
> 
> 
> Bugs: MESOS-186
>     https://issues.apache.org/jira/browse/MESOS-186
> 
> 
> Repository: mesos-git
> 
> 
> Description
> -------
> 
> Based on Kapil's patch (https://reviews.apache.org/r/22066/), adding timeout for each offer from master to remove the offer when it's no longer used.
> 
> 
> Diffs
> -----
> 
>   src/master/flags.hpp 32704ce 
>   src/master/master.hpp 5fef354 
>   src/master/master.cpp 251b699 
>   src/tests/master_tests.cpp 5a1cf7f 
> 
> Diff: https://reviews.apache.org/r/22796/diff/
> 
> 
> Testing
> -------
> 
> Added three more unit tests from Kapil's patch: Testing offer not rescinded after task launched, offer not rescinded when framework/slave unregistered.
> The test exposed a race condition that can lead to a segfault if two remove offers are called on the same offer.
> 
> make check.
> 
> 
> Thanks,
> 
> Timothy Chen
> 
>


Re: Review Request 22796: Add timeout to rescind unused offers

Posted by Mesos ReviewBot <de...@mesos.apache.org>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/22796/#review47598
-----------------------------------------------------------


Patch looks great!

Reviews applied: [22796]

All tests passed.

- Mesos ReviewBot


On July 10, 2014, 5:47 p.m., Timothy Chen wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/22796/
> -----------------------------------------------------------
> 
> (Updated July 10, 2014, 5:47 p.m.)
> 
> 
> Review request for mesos, Adam B and Niklas Nielsen.
> 
> 
> Bugs: MESOS-186
>     https://issues.apache.org/jira/browse/MESOS-186
> 
> 
> Repository: mesos-git
> 
> 
> Description
> -------
> 
> Based on Kapil's patch (https://reviews.apache.org/r/22066/), adding timeout for each offer from master to remove the offer when it's no longer used.
> 
> 
> Diffs
> -----
> 
>   src/master/flags.hpp 32704ce 
>   src/master/master.hpp 8641f2d 
>   src/master/master.cpp 86b147f 
>   src/tests/master_tests.cpp 5a1cf7f 
> 
> Diff: https://reviews.apache.org/r/22796/diff/
> 
> 
> Testing
> -------
> 
> Added three more unit tests from Kapil's patch: Testing offer not rescinded after task launched, offer not rescinded when framework/slave unregistered.
> The test exposed a race condition that can lead to a segfault if two remove offers are called on the same offer.
> 
> make check.
> 
> 
> Thanks,
> 
> Timothy Chen
> 
>


Re: Review Request 22796: Add timeout to rescind unused offers

Posted by Adam B <ad...@mesosphere.io>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/22796/#review47989
-----------------------------------------------------------

Ship it!


Just fix these minor issues, especially the Clock::advance(0), and then we can commit it!


src/tests/master_tests.cpp
<https://reviews.apache.org/r/22796/#comment84249>

    Clock::resume() after?



src/tests/master_tests.cpp
<https://reviews.apache.org/r/22796/#comment84247>

    End your comment with a period. (new lint rule?)



src/tests/master_tests.cpp
<https://reviews.apache.org/r/22796/#comment84244>

    s/Register/Registered/



src/tests/master_tests.cpp
<https://reviews.apache.org/r/22796/#comment84245>

    If masterFlags.offer_timeout == 0, this won't be advancing the clock at all. How about doubling the default timeout instead?



src/tests/master_tests.cpp
<https://reviews.apache.org/r/22796/#comment84246>

    rescinded
    Also, end the comment with a period.



src/tests/master_tests.cpp
<https://reviews.apache.org/r/22796/#comment84248>

    Only single-space. (new lint rule?)


- Adam B


On July 10, 2014, 10:47 a.m., Timothy Chen wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/22796/
> -----------------------------------------------------------
> 
> (Updated July 10, 2014, 10:47 a.m.)
> 
> 
> Review request for mesos, Adam B and Niklas Nielsen.
> 
> 
> Bugs: MESOS-186
>     https://issues.apache.org/jira/browse/MESOS-186
> 
> 
> Repository: mesos-git
> 
> 
> Description
> -------
> 
> Based on Kapil's patch (https://reviews.apache.org/r/22066/), adding timeout for each offer from master to remove the offer when it's no longer used.
> 
> 
> Diffs
> -----
> 
>   src/master/flags.hpp 32704ce 
>   src/master/master.hpp 8641f2d 
>   src/master/master.cpp 86b147f 
>   src/tests/master_tests.cpp 5a1cf7f 
> 
> Diff: https://reviews.apache.org/r/22796/diff/
> 
> 
> Testing
> -------
> 
> Added three more unit tests from Kapil's patch: Testing offer not rescinded after task launched, offer not rescinded when framework/slave unregistered.
> The test exposed a race condition that can lead to a segfault if two remove offers are called on the same offer.
> 
> make check.
> 
> 
> Thanks,
> 
> Timothy Chen
> 
>


Re: Review Request 22796: Add timeout to rescind unused offers

Posted by Adam B <ad...@mesosphere.io>.

> On July 23, 2014, 6:42 p.m., Ben Mahler wrote:
> > Could you add me to the reviewers for this patch? I would like to take a look at this one before this gets committed.

I'll wait for BenM to take a look before committing. Fairly straightforward, but maybe you've got a preferred default timeout or parameter name or something.


- Adam


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/22796/#review48610
-----------------------------------------------------------


On July 23, 2014, 8 p.m., Timothy Chen wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/22796/
> -----------------------------------------------------------
> 
> (Updated July 23, 2014, 8 p.m.)
> 
> 
> Review request for mesos, Adam B, Ben Mahler, and Niklas Nielsen.
> 
> 
> Bugs: MESOS-186
>     https://issues.apache.org/jira/browse/MESOS-186
> 
> 
> Repository: mesos-git
> 
> 
> Description
> -------
> 
> Based on Kapil's patch (https://reviews.apache.org/r/22066/), adding timeout for each offer from master to remove the offer when it's no longer used.
> 
> 
> Diffs
> -----
> 
>   src/master/flags.hpp 32704ce 
>   src/master/master.hpp fa46a67 
>   src/master/master.cpp fb2fd5a 
>   src/tests/master_tests.cpp 5a1cf7f 
> 
> Diff: https://reviews.apache.org/r/22796/diff/
> 
> 
> Testing
> -------
> 
> Added three more unit tests from Kapil's patch: Testing offer not rescinded after task launched, offer not rescinded when framework/slave unregistered.
> The test exposed a race condition that can lead to a segfault if two remove offers are called on the same offer.
> 
> make check.
> 
> 
> Thanks,
> 
> Timothy Chen
> 
>


Re: Review Request 22796: Add timeout to rescind unused offers

Posted by Ben Mahler <be...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/22796/#review48610
-----------------------------------------------------------


Could you add me to the reviewers for this patch? I would like to take a look at this one before this gets committed.

- Ben Mahler


On July 17, 2014, 9:30 p.m., Timothy Chen wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/22796/
> -----------------------------------------------------------
> 
> (Updated July 17, 2014, 9:30 p.m.)
> 
> 
> Review request for mesos, Adam B and Niklas Nielsen.
> 
> 
> Bugs: MESOS-186
>     https://issues.apache.org/jira/browse/MESOS-186
> 
> 
> Repository: mesos-git
> 
> 
> Description
> -------
> 
> Based on Kapil's patch (https://reviews.apache.org/r/22066/), adding timeout for each offer from master to remove the offer when it's no longer used.
> 
> 
> Diffs
> -----
> 
>   src/master/flags.hpp 32704ce 
>   src/master/master.hpp fa46a67 
>   src/master/master.cpp fb2fd5a 
>   src/tests/master_tests.cpp 5a1cf7f 
> 
> Diff: https://reviews.apache.org/r/22796/diff/
> 
> 
> Testing
> -------
> 
> Added three more unit tests from Kapil's patch: Testing offer not rescinded after task launched, offer not rescinded when framework/slave unregistered.
> The test exposed a race condition that can lead to a segfault if two remove offers are called on the same offer.
> 
> make check.
> 
> 
> Thanks,
> 
> Timothy Chen
> 
>


Re: Review Request 22796: Add timeout to rescind unused offers

Posted by Timothy Chen <tn...@apache.org>.

> On July 29, 2014, 10:15 p.m., Ben Mahler wrote:
> > src/master/master.cpp, lines 4318-4322
> > <https://reviews.apache.org/r/22796/diff/10/?file=644109#file644109line4318>
> >
> >     Do we need to store the Timers at all? OfferIDs are unique and so this should be correct without storing / canceling Timers. Given that, should we simplify it a bit and remove the Timer storage / cancellation?

Hi Ben, I think it should be technically still correct but I wonder if we should try to cancel the timer to avoid it being still registered in the time period for the delay, and won't always call removeTimer in the normal path? 


- Timothy


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/22796/#review49026
-----------------------------------------------------------


On July 29, 2014, 1:08 a.m., Timothy Chen wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/22796/
> -----------------------------------------------------------
> 
> (Updated July 29, 2014, 1:08 a.m.)
> 
> 
> Review request for mesos, Adam B, Ben Mahler, and Niklas Nielsen.
> 
> 
> Bugs: MESOS-186
>     https://issues.apache.org/jira/browse/MESOS-186
> 
> 
> Repository: mesos-git
> 
> 
> Description
> -------
> 
> Based on Kapil's patch (https://reviews.apache.org/r/22066/), adding timeout for each offer from master to remove the offer when it's no longer used.
> 
> 
> Diffs
> -----
> 
>   src/master/flags.hpp 32704ce 
>   src/master/master.hpp d8a4d9e 
>   src/master/master.cpp 273a516 
>   src/tests/master_tests.cpp 5a1cf7f 
> 
> Diff: https://reviews.apache.org/r/22796/diff/
> 
> 
> Testing
> -------
> 
> Added three more unit tests from Kapil's patch: Testing offer not rescinded after task launched, offer not rescinded when framework/slave unregistered.
> The test exposed a race condition that can lead to a segfault if two remove offers are called on the same offer.
> 
> make check.
> 
> 
> Thanks,
> 
> Timothy Chen
> 
>


Re: Review Request 22796: Add timeout to rescind unused offers

Posted by Ben Mahler <be...@gmail.com>.

> On July 29, 2014, 10:15 p.m., Ben Mahler wrote:
> > src/master/master.cpp, lines 4318-4322
> > <https://reviews.apache.org/r/22796/diff/10/?file=644109#file644109line4318>
> >
> >     Do we need to store the Timers at all? OfferIDs are unique and so this should be correct without storing / canceling Timers. Given that, should we simplify it a bit and remove the Timer storage / cancellation?
> 
> Timothy Chen wrote:
>     Hi Ben, I think it should be technically still correct but I wonder if we should try to cancel the timer to avoid it being still registered in the time period for the delay, and won't always call removeTimer in the normal path?

Ok, seems good to cancel these timers as we don't fully understand the scalability of having a ton of Timers within libprocess, and there will be a lot of offers for large clusters with high churn.

But let's leave a comment here saying that canceling the Timers is only done to avoid having too many active Timers in libprocess?


- Ben


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/22796/#review49026
-----------------------------------------------------------


On July 29, 2014, 1:08 a.m., Timothy Chen wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/22796/
> -----------------------------------------------------------
> 
> (Updated July 29, 2014, 1:08 a.m.)
> 
> 
> Review request for mesos, Adam B, Ben Mahler, and Niklas Nielsen.
> 
> 
> Bugs: MESOS-186
>     https://issues.apache.org/jira/browse/MESOS-186
> 
> 
> Repository: mesos-git
> 
> 
> Description
> -------
> 
> Based on Kapil's patch (https://reviews.apache.org/r/22066/), adding timeout for each offer from master to remove the offer when it's no longer used.
> 
> 
> Diffs
> -----
> 
>   src/master/flags.hpp 32704ce 
>   src/master/master.hpp d8a4d9e 
>   src/master/master.cpp 273a516 
>   src/tests/master_tests.cpp 5a1cf7f 
> 
> Diff: https://reviews.apache.org/r/22796/diff/
> 
> 
> Testing
> -------
> 
> Added three more unit tests from Kapil's patch: Testing offer not rescinded after task launched, offer not rescinded when framework/slave unregistered.
> The test exposed a race condition that can lead to a segfault if two remove offers are called on the same offer.
> 
> make check.
> 
> 
> Thanks,
> 
> Timothy Chen
> 
>


Re: Review Request 22796: Add timeout to rescind unused offers

Posted by Timothy Chen <tn...@apache.org>.

> On July 29, 2014, 10:15 p.m., Ben Mahler wrote:
> > src/tests/master_tests.cpp, lines 2065-2128
> > <https://reviews.apache.org/r/22796/diff/10/?file=644110#file644110line2065>
> >
> >     I'm not sure we need this test, since it's conflating a few things:
> >     
> >     (1) Frameworks do not receive any events from the master after they have un-registered, so this is not specific to rescinded offers.
> >     
> >     (2) Framework unregistration means that the framework is fully shutdown in the cluster, at which point any outstanding offers would be invalid (meaning that they are implicitly rescinded). However, we don't send these rescinded notifications to the unregistered framework because we completely stop communicating with the framework after unregistration.
> >     
> >     Does this make sense? Any reason to keep it?

I see, so this should be already been tested by other tests then? I'll remove it.


- Timothy


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/22796/#review49026
-----------------------------------------------------------


On July 29, 2014, 1:08 a.m., Timothy Chen wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/22796/
> -----------------------------------------------------------
> 
> (Updated July 29, 2014, 1:08 a.m.)
> 
> 
> Review request for mesos, Adam B, Ben Mahler, and Niklas Nielsen.
> 
> 
> Bugs: MESOS-186
>     https://issues.apache.org/jira/browse/MESOS-186
> 
> 
> Repository: mesos-git
> 
> 
> Description
> -------
> 
> Based on Kapil's patch (https://reviews.apache.org/r/22066/), adding timeout for each offer from master to remove the offer when it's no longer used.
> 
> 
> Diffs
> -----
> 
>   src/master/flags.hpp 32704ce 
>   src/master/master.hpp d8a4d9e 
>   src/master/master.cpp 273a516 
>   src/tests/master_tests.cpp 5a1cf7f 
> 
> Diff: https://reviews.apache.org/r/22796/diff/
> 
> 
> Testing
> -------
> 
> Added three more unit tests from Kapil's patch: Testing offer not rescinded after task launched, offer not rescinded when framework/slave unregistered.
> The test exposed a race condition that can lead to a segfault if two remove offers are called on the same offer.
> 
> make check.
> 
> 
> Thanks,
> 
> Timothy Chen
> 
>


Re: Review Request 22796: Add timeout to rescind unused offers

Posted by Ben Mahler <be...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/22796/#review49026
-----------------------------------------------------------


Looks great now, I had a few notes to clean up the tests a bit. Let me know what you think!

Just to confirm, you are running these test with a high number of iterations to ensure they are not flaky, right?


src/master/master.hpp
<https://reviews.apache.org/r/22796/#comment85859>

    const &



src/master/master.cpp
<https://reviews.apache.org/r/22796/#comment85862>

    Do we need to store the Timers at all? OfferIDs are unique and so this should be correct without storing / canceling Timers. Given that, should we simplify it a bit and remove the Timer storage / cancellation?



src/tests/master_tests.cpp
<https://reviews.apache.org/r/22796/#comment85870>

    How about s/RescindResourceOffers/OfferTimeoutNonZero/



src/tests/master_tests.cpp
<https://reviews.apache.org/r/22796/#comment85864>

    Can you capture the Offers so we can later ensure the full resources are re-offered?
    
    E.g.
    
      Future<vector<Offer> > offers1;
      Future<vector<Offer> > offers2;
      EXPECT_CALL(sched, resourceOffers(&driver, _))
        .WillOnce(FutureArg<1>(&offers1));
        .WillOnce(FutureArg<1>(&offers2));
    
      ...
      
      AWAIT_READY(resourcesRecovered);
      AWAIT_READY(offers2);
    
      ASSERT_EQ(1u, offers2.get().size());
    
      // Ensure the scheduler is re-offered the resources through a new offer.
      EXPECT_EQ(Resources(offers1.get()[0].resources()),
                Resources(offers2.get()[0].resources()));



src/tests/master_tests.cpp
<https://reviews.apache.org/r/22796/#comment85865>

    We'll often try to match these with the callback names so that it's clear what callback we're expecting:
    
    s/rescinded/offerRescinded/



src/tests/master_tests.cpp
<https://reviews.apache.org/r/22796/#comment85868>

    I'm not sure we need this test, since it's conflating a few things:
    
    (1) Frameworks do not receive any events from the master after they have un-registered, so this is not specific to rescinded offers.
    
    (2) Framework unregistration means that the framework is fully shutdown in the cluster, at which point any outstanding offers would be invalid (meaning that they are implicitly rescinded). However, we don't send these rescinded notifications to the unregistered framework because we completely stop communicating with the framework after unregistration.
    
    Does this make sense? Any reason to keep it?



src/tests/master_tests.cpp
<https://reviews.apache.org/r/22796/#comment85869>

    /timeout/the rescind timeout/



src/tests/master_tests.cpp
<https://reviews.apache.org/r/22796/#comment85871>

    How about s/RescindResourceOffers/OfferTimeoutZero/
    
    It seems this would let us easily compare the test case here vs. the one above based on the naming. What do you think?



src/tests/master_tests.cpp
<https://reviews.apache.org/r/22796/#comment85872>

    What about making sure that we get a single offer and no more than that after we advance the clock?
    
    We can do this using a .WillOnce and a .WillRepeatedly that satisfy two futures. Once the first future is ready, we advance the clock. After the clock is settled we can assert that the second future is still pending.



src/tests/master_tests.cpp
<https://reviews.apache.org/r/22796/#comment85873>

    Do you want a corollary test?
    
    This one:  OfferNotRescindedOnceUsed
    Corollary: OfferNotRescindedOnceDeclined


- Ben Mahler


On July 29, 2014, 1:08 a.m., Timothy Chen wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/22796/
> -----------------------------------------------------------
> 
> (Updated July 29, 2014, 1:08 a.m.)
> 
> 
> Review request for mesos, Adam B, Ben Mahler, and Niklas Nielsen.
> 
> 
> Bugs: MESOS-186
>     https://issues.apache.org/jira/browse/MESOS-186
> 
> 
> Repository: mesos-git
> 
> 
> Description
> -------
> 
> Based on Kapil's patch (https://reviews.apache.org/r/22066/), adding timeout for each offer from master to remove the offer when it's no longer used.
> 
> 
> Diffs
> -----
> 
>   src/master/flags.hpp 32704ce 
>   src/master/master.hpp d8a4d9e 
>   src/master/master.cpp 273a516 
>   src/tests/master_tests.cpp 5a1cf7f 
> 
> Diff: https://reviews.apache.org/r/22796/diff/
> 
> 
> Testing
> -------
> 
> Added three more unit tests from Kapil's patch: Testing offer not rescinded after task launched, offer not rescinded when framework/slave unregistered.
> The test exposed a race condition that can lead to a segfault if two remove offers are called on the same offer.
> 
> make check.
> 
> 
> Thanks,
> 
> Timothy Chen
> 
>


Re: Review Request 22796: Add timeout to rescind unused offers

Posted by Mesos ReviewBot <de...@mesos.apache.org>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/22796/#review49018
-----------------------------------------------------------


Patch looks great!

Reviews applied: [22796]

All tests passed.

- Mesos ReviewBot


On July 29, 2014, 1:08 a.m., Timothy Chen wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/22796/
> -----------------------------------------------------------
> 
> (Updated July 29, 2014, 1:08 a.m.)
> 
> 
> Review request for mesos, Adam B, Ben Mahler, and Niklas Nielsen.
> 
> 
> Bugs: MESOS-186
>     https://issues.apache.org/jira/browse/MESOS-186
> 
> 
> Repository: mesos-git
> 
> 
> Description
> -------
> 
> Based on Kapil's patch (https://reviews.apache.org/r/22066/), adding timeout for each offer from master to remove the offer when it's no longer used.
> 
> 
> Diffs
> -----
> 
>   src/master/flags.hpp 32704ce 
>   src/master/master.hpp d8a4d9e 
>   src/master/master.cpp 273a516 
>   src/tests/master_tests.cpp 5a1cf7f 
> 
> Diff: https://reviews.apache.org/r/22796/diff/
> 
> 
> Testing
> -------
> 
> Added three more unit tests from Kapil's patch: Testing offer not rescinded after task launched, offer not rescinded when framework/slave unregistered.
> The test exposed a race condition that can lead to a segfault if two remove offers are called on the same offer.
> 
> make check.
> 
> 
> Thanks,
> 
> Timothy Chen
> 
>


Re: Review Request 22796: Add timeout to rescind unused offers

Posted by Timothy Chen <tn...@apache.org>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/22796/
-----------------------------------------------------------

(Updated July 29, 2014, 1:08 a.m.)


Review request for mesos, Adam B, Ben Mahler, and Niklas Nielsen.


Bugs: MESOS-186
    https://issues.apache.org/jira/browse/MESOS-186


Repository: mesos-git


Description
-------

Based on Kapil's patch (https://reviews.apache.org/r/22066/), adding timeout for each offer from master to remove the offer when it's no longer used.


Diffs (updated)
-----

  src/master/flags.hpp 32704ce 
  src/master/master.hpp d8a4d9e 
  src/master/master.cpp 273a516 
  src/tests/master_tests.cpp 5a1cf7f 

Diff: https://reviews.apache.org/r/22796/diff/


Testing
-------

Added three more unit tests from Kapil's patch: Testing offer not rescinded after task launched, offer not rescinded when framework/slave unregistered.
The test exposed a race condition that can lead to a segfault if two remove offers are called on the same offer.

make check.


Thanks,

Timothy Chen


Re: Review Request 22796: Add timeout to rescind unused offers

Posted by Timothy Chen <tn...@apache.org>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/22796/
-----------------------------------------------------------

(Updated July 28, 2014, 8:34 p.m.)


Review request for mesos, Adam B, Ben Mahler, and Niklas Nielsen.


Bugs: MESOS-186
    https://issues.apache.org/jira/browse/MESOS-186


Repository: mesos-git


Description
-------

Based on Kapil's patch (https://reviews.apache.org/r/22066/), adding timeout for each offer from master to remove the offer when it's no longer used.


Diffs (updated)
-----

  src/master/flags.hpp 32704ce 
  src/master/master.hpp d8a4d9e 
  src/master/master.cpp 273a516 
  src/tests/master_tests.cpp 5a1cf7f 

Diff: https://reviews.apache.org/r/22796/diff/


Testing
-------

Added three more unit tests from Kapil's patch: Testing offer not rescinded after task launched, offer not rescinded when framework/slave unregistered.
The test exposed a race condition that can lead to a segfault if two remove offers are called on the same offer.

make check.


Thanks,

Timothy Chen


Re: Review Request 22796: Add timeout to rescind unused offers

Posted by Timothy Chen <tn...@apache.org>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/22796/
-----------------------------------------------------------

(Updated July 28, 2014, 8:28 p.m.)


Review request for mesos, Adam B, Ben Mahler, and Niklas Nielsen.


Bugs: MESOS-186
    https://issues.apache.org/jira/browse/MESOS-186


Repository: mesos-git


Description
-------

Based on Kapil's patch (https://reviews.apache.org/r/22066/), adding timeout for each offer from master to remove the offer when it's no longer used.


Diffs (updated)
-----

  src/master/flags.hpp 32704ce 
  src/master/master.hpp d8a4d9e 
  src/master/master.cpp 273a516 
  src/tests/master_tests.cpp 5a1cf7f 

Diff: https://reviews.apache.org/r/22796/diff/


Testing
-------

Added three more unit tests from Kapil's patch: Testing offer not rescinded after task launched, offer not rescinded when framework/slave unregistered.
The test exposed a race condition that can lead to a segfault if two remove offers are called on the same offer.

make check.


Thanks,

Timothy Chen


Re: Review Request 22796: Add timeout to rescind unused offers

Posted by Timothy Chen <tn...@apache.org>.

> On July 24, 2014, 6:27 p.m., Ben Mahler wrote:
> > src/master/master.cpp, lines 3451-3457
> > <https://reviews.apache.org/r/22796/diff/7/?file=634636#file634636line3451>
> >
> >     I forgot to mention the bug here in my comment!
> >     
> >     With using an offerTimeout function, you can properly get the resources back from the allocator.
> >     
> >     This current patch removes the offer but doesn't tell the allocator!
> 
> Ben Mahler wrote:
>     Ideally we could capture the allocator expectations in the test, which would have caught this issue.

Not sure I understand, I thought removeOffer call already handles rescinding offers which also gives back allocated resources to the slave and framework?
This patch simply adds a timeout to call rescind when it's not claimed?


- Timothy


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/22796/#review48670
-----------------------------------------------------------


On July 28, 2014, 8:28 p.m., Timothy Chen wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/22796/
> -----------------------------------------------------------
> 
> (Updated July 28, 2014, 8:28 p.m.)
> 
> 
> Review request for mesos, Adam B, Ben Mahler, and Niklas Nielsen.
> 
> 
> Bugs: MESOS-186
>     https://issues.apache.org/jira/browse/MESOS-186
> 
> 
> Repository: mesos-git
> 
> 
> Description
> -------
> 
> Based on Kapil's patch (https://reviews.apache.org/r/22066/), adding timeout for each offer from master to remove the offer when it's no longer used.
> 
> 
> Diffs
> -----
> 
>   src/master/flags.hpp 32704ce 
>   src/master/master.hpp d8a4d9e 
>   src/master/master.cpp 273a516 
>   src/tests/master_tests.cpp 5a1cf7f 
> 
> Diff: https://reviews.apache.org/r/22796/diff/
> 
> 
> Testing
> -------
> 
> Added three more unit tests from Kapil's patch: Testing offer not rescinded after task launched, offer not rescinded when framework/slave unregistered.
> The test exposed a race condition that can lead to a segfault if two remove offers are called on the same offer.
> 
> make check.
> 
> 
> Thanks,
> 
> Timothy Chen
> 
>


Re: Review Request 22796: Add timeout to rescind unused offers

Posted by Benjamin Mahler <be...@gmail.com>.
https://issues.apache.org/jira/browse/MESOS-1452 :)


On Thu, Jul 24, 2014 at 11:28 AM, Ben Mahler <be...@gmail.com>
wrote:

>
>
> > On July 24, 2014, 6:27 p.m., Ben Mahler wrote:
> > > src/master/master.cpp, lines 3451-3457
> > > <
> https://reviews.apache.org/r/22796/diff/7/?file=634636#file634636line3451>
> > >
> > >     I forgot to mention the bug here in my comment!
> > >
> > >     With using an offerTimeout function, you can properly get the
> resources back from the allocator.
> > >
> > >     This current patch removes the offer but doesn't tell the
> allocator!
>
> Ideally we could capture the allocator expectations in the test, which
> would have caught this issue.
>
>
> - Ben
>
>
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/22796/#review48670
> -----------------------------------------------------------
>
>
> On July 24, 2014, 3 a.m., Timothy Chen wrote:
> >
> > -----------------------------------------------------------
> > This is an automatically generated e-mail. To reply, visit:
> > https://reviews.apache.org/r/22796/
> > -----------------------------------------------------------
> >
> > (Updated July 24, 2014, 3 a.m.)
> >
> >
> > Review request for mesos, Adam B, Ben Mahler, and Niklas Nielsen.
> >
> >
> > Bugs: MESOS-186
> >     https://issues.apache.org/jira/browse/MESOS-186
> >
> >
> > Repository: mesos-git
> >
> >
> > Description
> > -------
> >
> > Based on Kapil's patch (https://reviews.apache.org/r/22066/), adding
> timeout for each offer from master to remove the offer when it's no longer
> used.
> >
> >
> > Diffs
> > -----
> >
> >   src/master/flags.hpp 32704ce
> >   src/master/master.hpp fa46a67
> >   src/master/master.cpp fb2fd5a
> >   src/tests/master_tests.cpp 5a1cf7f
> >
> > Diff: https://reviews.apache.org/r/22796/diff/
> >
> >
> > Testing
> > -------
> >
> > Added three more unit tests from Kapil's patch: Testing offer not
> rescinded after task launched, offer not rescinded when framework/slave
> unregistered.
> > The test exposed a race condition that can lead to a segfault if two
> remove offers are called on the same offer.
> >
> > make check.
> >
> >
> > Thanks,
> >
> > Timothy Chen
> >
> >
>
>

Re: Review Request 22796: Add timeout to rescind unused offers

Posted by Ben Mahler <be...@gmail.com>.

> On July 24, 2014, 6:27 p.m., Ben Mahler wrote:
> > src/master/master.cpp, lines 3451-3457
> > <https://reviews.apache.org/r/22796/diff/7/?file=634636#file634636line3451>
> >
> >     I forgot to mention the bug here in my comment!
> >     
> >     With using an offerTimeout function, you can properly get the resources back from the allocator.
> >     
> >     This current patch removes the offer but doesn't tell the allocator!

Ideally we could capture the allocator expectations in the test, which would have caught this issue.


- Ben


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/22796/#review48670
-----------------------------------------------------------


On July 24, 2014, 3 a.m., Timothy Chen wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/22796/
> -----------------------------------------------------------
> 
> (Updated July 24, 2014, 3 a.m.)
> 
> 
> Review request for mesos, Adam B, Ben Mahler, and Niklas Nielsen.
> 
> 
> Bugs: MESOS-186
>     https://issues.apache.org/jira/browse/MESOS-186
> 
> 
> Repository: mesos-git
> 
> 
> Description
> -------
> 
> Based on Kapil's patch (https://reviews.apache.org/r/22066/), adding timeout for each offer from master to remove the offer when it's no longer used.
> 
> 
> Diffs
> -----
> 
>   src/master/flags.hpp 32704ce 
>   src/master/master.hpp fa46a67 
>   src/master/master.cpp fb2fd5a 
>   src/tests/master_tests.cpp 5a1cf7f 
> 
> Diff: https://reviews.apache.org/r/22796/diff/
> 
> 
> Testing
> -------
> 
> Added three more unit tests from Kapil's patch: Testing offer not rescinded after task launched, offer not rescinded when framework/slave unregistered.
> The test exposed a race condition that can lead to a segfault if two remove offers are called on the same offer.
> 
> make check.
> 
> 
> Thanks,
> 
> Timothy Chen
> 
>


Re: Review Request 22796: Add timeout to rescind unused offers

Posted by Timothy Chen <tn...@apache.org>.

> On July 24, 2014, 6:27 p.m., Ben Mahler wrote:
> > src/master/master.cpp, lines 3451-3457
> > <https://reviews.apache.org/r/22796/diff/7/?file=634636#file634636line3451>
> >
> >     I forgot to mention the bug here in my comment!
> >     
> >     With using an offerTimeout function, you can properly get the resources back from the allocator.
> >     
> >     This current patch removes the offer but doesn't tell the allocator!
> 
> Ben Mahler wrote:
>     Ideally we could capture the allocator expectations in the test, which would have caught this issue.
> 
> Timothy Chen wrote:
>     Not sure I understand, I thought removeOffer call already handles rescinding offers which also gives back allocated resources to the slave and framework?
>     This patch simply adds a timeout to call rescind when it's not claimed?
> 
> Ben Mahler wrote:
>     Take a look at other calls to removeOffer, this one is in the same vein as this review: https://github.com/apache/mesos/blob/0.19.1/src/master/master.cpp#L1857
>     
>     This trickiness was the motivation for: https://issues.apache.org/jira/browse/MESOS-1452
>     
>     Let's improve the test here! We should expect that after the timeout, the scheduler receives another offer for the same resources, that will not happen with the current diff.

Ok I see what you mean, I've added the same check in my test.


- Timothy


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/22796/#review48670
-----------------------------------------------------------


On July 29, 2014, 1:08 a.m., Timothy Chen wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/22796/
> -----------------------------------------------------------
> 
> (Updated July 29, 2014, 1:08 a.m.)
> 
> 
> Review request for mesos, Adam B, Ben Mahler, and Niklas Nielsen.
> 
> 
> Bugs: MESOS-186
>     https://issues.apache.org/jira/browse/MESOS-186
> 
> 
> Repository: mesos-git
> 
> 
> Description
> -------
> 
> Based on Kapil's patch (https://reviews.apache.org/r/22066/), adding timeout for each offer from master to remove the offer when it's no longer used.
> 
> 
> Diffs
> -----
> 
>   src/master/flags.hpp 32704ce 
>   src/master/master.hpp d8a4d9e 
>   src/master/master.cpp 273a516 
>   src/tests/master_tests.cpp 5a1cf7f 
> 
> Diff: https://reviews.apache.org/r/22796/diff/
> 
> 
> Testing
> -------
> 
> Added three more unit tests from Kapil's patch: Testing offer not rescinded after task launched, offer not rescinded when framework/slave unregistered.
> The test exposed a race condition that can lead to a segfault if two remove offers are called on the same offer.
> 
> make check.
> 
> 
> Thanks,
> 
> Timothy Chen
> 
>


Re: Review Request 22796: Add timeout to rescind unused offers

Posted by Ben Mahler <be...@gmail.com>.

> On July 24, 2014, 6:27 p.m., Ben Mahler wrote:
> > src/master/master.cpp, lines 3451-3457
> > <https://reviews.apache.org/r/22796/diff/7/?file=634636#file634636line3451>
> >
> >     I forgot to mention the bug here in my comment!
> >     
> >     With using an offerTimeout function, you can properly get the resources back from the allocator.
> >     
> >     This current patch removes the offer but doesn't tell the allocator!
> 
> Ben Mahler wrote:
>     Ideally we could capture the allocator expectations in the test, which would have caught this issue.
> 
> Timothy Chen wrote:
>     Not sure I understand, I thought removeOffer call already handles rescinding offers which also gives back allocated resources to the slave and framework?
>     This patch simply adds a timeout to call rescind when it's not claimed?

Take a look at other calls to removeOffer, this one is in the same vein as this review: https://github.com/apache/mesos/blob/0.19.1/src/master/master.cpp#L1857

This trickiness was the motivation for: https://issues.apache.org/jira/browse/MESOS-1452

Let's improve the test here! We should expect that after the timeout, the scheduler receives another offer for the same resources, that will not happen with the current diff.


- Ben


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/22796/#review48670
-----------------------------------------------------------


On July 28, 2014, 8:34 p.m., Timothy Chen wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/22796/
> -----------------------------------------------------------
> 
> (Updated July 28, 2014, 8:34 p.m.)
> 
> 
> Review request for mesos, Adam B, Ben Mahler, and Niklas Nielsen.
> 
> 
> Bugs: MESOS-186
>     https://issues.apache.org/jira/browse/MESOS-186
> 
> 
> Repository: mesos-git
> 
> 
> Description
> -------
> 
> Based on Kapil's patch (https://reviews.apache.org/r/22066/), adding timeout for each offer from master to remove the offer when it's no longer used.
> 
> 
> Diffs
> -----
> 
>   src/master/flags.hpp 32704ce 
>   src/master/master.hpp d8a4d9e 
>   src/master/master.cpp 273a516 
>   src/tests/master_tests.cpp 5a1cf7f 
> 
> Diff: https://reviews.apache.org/r/22796/diff/
> 
> 
> Testing
> -------
> 
> Added three more unit tests from Kapil's patch: Testing offer not rescinded after task launched, offer not rescinded when framework/slave unregistered.
> The test exposed a race condition that can lead to a segfault if two remove offers are called on the same offer.
> 
> make check.
> 
> 
> Thanks,
> 
> Timothy Chen
> 
>


Re: Review Request 22796: Add timeout to rescind unused offers

Posted by Ben Mahler <be...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/22796/#review48670
-----------------------------------------------------------



src/master/master.cpp
<https://reviews.apache.org/r/22796/#comment85410>

    I forgot to mention the bug here in my comment!
    
    With using an offerTimeout function, you can properly get the resources back from the allocator.
    
    This current patch removes the offer but doesn't tell the allocator!


- Ben Mahler


On July 24, 2014, 3 a.m., Timothy Chen wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/22796/
> -----------------------------------------------------------
> 
> (Updated July 24, 2014, 3 a.m.)
> 
> 
> Review request for mesos, Adam B, Ben Mahler, and Niklas Nielsen.
> 
> 
> Bugs: MESOS-186
>     https://issues.apache.org/jira/browse/MESOS-186
> 
> 
> Repository: mesos-git
> 
> 
> Description
> -------
> 
> Based on Kapil's patch (https://reviews.apache.org/r/22066/), adding timeout for each offer from master to remove the offer when it's no longer used.
> 
> 
> Diffs
> -----
> 
>   src/master/flags.hpp 32704ce 
>   src/master/master.hpp fa46a67 
>   src/master/master.cpp fb2fd5a 
>   src/tests/master_tests.cpp 5a1cf7f 
> 
> Diff: https://reviews.apache.org/r/22796/diff/
> 
> 
> Testing
> -------
> 
> Added three more unit tests from Kapil's patch: Testing offer not rescinded after task launched, offer not rescinded when framework/slave unregistered.
> The test exposed a race condition that can lead to a segfault if two remove offers are called on the same offer.
> 
> make check.
> 
> 
> Thanks,
> 
> Timothy Chen
> 
>


Re: Review Request 22796: Add timeout to rescind unused offers

Posted by Timothy Chen <tn...@apache.org>.

> On July 24, 2014, 6:23 p.m., Ben Mahler wrote:
> > src/tests/master_tests.cpp, lines 2055-2115
> > <https://reviews.apache.org/r/22796/diff/7/?file=634637#file634637line2055>
> >
> >     I'm a bit puzzled by this test, I don't think we need this test since the framework cannot receive _any_ events after a framework unregistration, which seems orthogonal to the timeout you've introduced.

Makes sense, I'll remove it then.


> On July 24, 2014, 6:23 p.m., Ben Mahler wrote:
> > src/master/flags.hpp, lines 291-293
> > <https://reviews.apache.org/r/22796/diff/7/?file=634634#file634634line291>
> >
> >     How about:
> >     
> >     "Duration of time before an offer is rescinded from a framework. This can help fairness when running frameworks that hold on to offers, or frameworks that accidentally drop offers. To disable the timeout, set this to '0secs'".
> >     
> >     Will '0' work for this, since it's a duration?

Duration parse doesn't take zero though, has to specify a unit. Can add a special case for zero in another patch.


- Timothy


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/22796/#review48665
-----------------------------------------------------------


On July 24, 2014, 3 a.m., Timothy Chen wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/22796/
> -----------------------------------------------------------
> 
> (Updated July 24, 2014, 3 a.m.)
> 
> 
> Review request for mesos, Adam B, Ben Mahler, and Niklas Nielsen.
> 
> 
> Bugs: MESOS-186
>     https://issues.apache.org/jira/browse/MESOS-186
> 
> 
> Repository: mesos-git
> 
> 
> Description
> -------
> 
> Based on Kapil's patch (https://reviews.apache.org/r/22066/), adding timeout for each offer from master to remove the offer when it's no longer used.
> 
> 
> Diffs
> -----
> 
>   src/master/flags.hpp 32704ce 
>   src/master/master.hpp fa46a67 
>   src/master/master.cpp fb2fd5a 
>   src/tests/master_tests.cpp 5a1cf7f 
> 
> Diff: https://reviews.apache.org/r/22796/diff/
> 
> 
> Testing
> -------
> 
> Added three more unit tests from Kapil's patch: Testing offer not rescinded after task launched, offer not rescinded when framework/slave unregistered.
> The test exposed a race condition that can lead to a segfault if two remove offers are called on the same offer.
> 
> make check.
> 
> 
> Thanks,
> 
> Timothy Chen
> 
>


Re: Review Request 22796: Add timeout to rescind unused offers

Posted by Timothy Chen <tn...@apache.org>.

> On July 24, 2014, 6:23 p.m., Ben Mahler wrote:
> > src/master/master.cpp, lines 4293-4296
> > <https://reviews.apache.org/r/22796/diff/7/?file=634636#file634636line4293>
> >
> >     It looks like this NULL case is only possible for the rescind timeout, since all synchronous calls to removeOffers previously could not hit this condition.
> >     
> >     What about instead adding a small method for the delay() to invoke:
> >     
> >     void Master::offerTimeout(const OfferID&)
> >     {
> >       Offer* offer = getOffer(offerId);
> >       if (offer != NULL) {
> >         removeOffer(offer, true);
> >       }
> >     }
> >     
> >     This way, we only introduce the NULL case where it's expected and harmless. It makes it even less critical to remove the timers correctly since it wouldn't have an impact on the logging.
> >     
> >     The other reason I would suggest this approach is that the contents of offerTimeout would be a possible candidate for being moved to a C++11 lambda function instead of making a member method.

I added a offerTimeout method, but since the offer object can be deleted I'm passing OfferID as value.


- Timothy


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/22796/#review48665
-----------------------------------------------------------


On July 24, 2014, 3 a.m., Timothy Chen wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/22796/
> -----------------------------------------------------------
> 
> (Updated July 24, 2014, 3 a.m.)
> 
> 
> Review request for mesos, Adam B, Ben Mahler, and Niklas Nielsen.
> 
> 
> Bugs: MESOS-186
>     https://issues.apache.org/jira/browse/MESOS-186
> 
> 
> Repository: mesos-git
> 
> 
> Description
> -------
> 
> Based on Kapil's patch (https://reviews.apache.org/r/22066/), adding timeout for each offer from master to remove the offer when it's no longer used.
> 
> 
> Diffs
> -----
> 
>   src/master/flags.hpp 32704ce 
>   src/master/master.hpp fa46a67 
>   src/master/master.cpp fb2fd5a 
>   src/tests/master_tests.cpp 5a1cf7f 
> 
> Diff: https://reviews.apache.org/r/22796/diff/
> 
> 
> Testing
> -------
> 
> Added three more unit tests from Kapil's patch: Testing offer not rescinded after task launched, offer not rescinded when framework/slave unregistered.
> The test exposed a race condition that can lead to a segfault if two remove offers are called on the same offer.
> 
> make check.
> 
> 
> Thanks,
> 
> Timothy Chen
> 
>


Re: Review Request 22796: Add timeout to rescind unused offers

Posted by Ben Mahler <be...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/22796/#review48665
-----------------------------------------------------------


Thanks Adam and Tim!

Looks good, I made a few small suggestions below, mainly to try to avoid logging an INFO message every time the offer removal race occurs.

Was surprised that the tests needed to use 2x the timeout? Did you run them in repetition?


src/master/flags.hpp
<https://reviews.apache.org/r/22796/#comment85390>

    If you look above, only one space between each flag. :)



src/master/flags.hpp
<https://reviews.apache.org/r/22796/#comment85392>

    How about:
    
    "Duration of time before an offer is rescinded from a framework. This can help fairness when running frameworks that hold on to offers, or frameworks that accidentally drop offers. To disable the timeout, set this to '0secs'".
    
    Will '0' work for this, since it's a duration?



src/master/master.hpp
<https://reviews.apache.org/r/22796/#comment85393>

    const &



src/master/master.cpp
<https://reviews.apache.org/r/22796/#comment85394>

    How about a little comment here:
    
    // Rescind the offer after the timeout elapses.
    
    What about wrapping at the delay, might be less "jagged" since it's only a few characters:
    
      offerTimers[offer->id()] =
        delay(flags.offer_timeout,
              self(),
              &Self::removeOffer,
              offer->id(),
              true);



src/master/master.cpp
<https://reviews.apache.org/r/22796/#comment85395>

    const &



src/master/master.cpp
<https://reviews.apache.org/r/22796/#comment85396>

    It looks like this NULL case is only possible for the rescind timeout, since all synchronous calls to removeOffers previously could not hit this condition.
    
    What about instead adding a small method for the delay() to invoke:
    
    void Master::offerTimeout(const OfferID&)
    {
      Offer* offer = getOffer(offerId);
      if (offer != NULL) {
        removeOffer(offer, true);
      }
    }
    
    This way, we only introduce the NULL case where it's expected and harmless. It makes it even less critical to remove the timers correctly since it wouldn't have an impact on the logging.
    
    The other reason I would suggest this approach is that the contents of offerTimeout would be a possible candidate for being moved to a C++11 lambda function instead of making a member method.



src/master/master.cpp
<https://reviews.apache.org/r/22796/#comment85406>

    hashmap has .contains():
    
    if (offerTimers.contains(offer->id())) {
      ...
    }



src/tests/master_tests.cpp
<https://reviews.apache.org/r/22796/#comment85398>

    Why did you need 2x the timeout? 1x should be enough, no?
    
    Could you add a little comment to this section? E.g.
    
    "
    Now advance to the offer timeout, we need to settle the clock to ensure that the offer rescind timeout would be processed if triggered.
    "



src/tests/master_tests.cpp
<https://reviews.apache.org/r/22796/#comment85397>

    Might want to store the flags, pass them to the master and use them here, to ensure they have the same values:
    
    master::Flags masterFlags = MesosTest::CreateMasterFlags();
    Try<PID<Master> > master = StartMaster(masterFlags);
    ...
    Clock::advance(masterFlags.offer_timeout * 2.0);
    
    Otherwise there is an implicit assumption that StartMaster() uses CreateMasterFlags().



src/tests/master_tests.cpp
<https://reviews.apache.org/r/22796/#comment85399>

    I'm a bit puzzled by this test, I don't think we need this test since the framework cannot receive _any_ events after a framework unregistration, which seems orthogonal to the timeout you've introduced.



src/tests/master_tests.cpp
<https://reviews.apache.org/r/22796/#comment85401>

    Could you add the same comment here that I mentioned above?



src/tests/master_tests.cpp
<https://reviews.apache.org/r/22796/#comment85405>

    s/task/a task/



src/tests/master_tests.cpp
<https://reviews.apache.org/r/22796/#comment85404>

    Ditto above for comment and 1x timeout.


- Ben Mahler


On July 24, 2014, 3 a.m., Timothy Chen wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/22796/
> -----------------------------------------------------------
> 
> (Updated July 24, 2014, 3 a.m.)
> 
> 
> Review request for mesos, Adam B, Ben Mahler, and Niklas Nielsen.
> 
> 
> Bugs: MESOS-186
>     https://issues.apache.org/jira/browse/MESOS-186
> 
> 
> Repository: mesos-git
> 
> 
> Description
> -------
> 
> Based on Kapil's patch (https://reviews.apache.org/r/22066/), adding timeout for each offer from master to remove the offer when it's no longer used.
> 
> 
> Diffs
> -----
> 
>   src/master/flags.hpp 32704ce 
>   src/master/master.hpp fa46a67 
>   src/master/master.cpp fb2fd5a 
>   src/tests/master_tests.cpp 5a1cf7f 
> 
> Diff: https://reviews.apache.org/r/22796/diff/
> 
> 
> Testing
> -------
> 
> Added three more unit tests from Kapil's patch: Testing offer not rescinded after task launched, offer not rescinded when framework/slave unregistered.
> The test exposed a race condition that can lead to a segfault if two remove offers are called on the same offer.
> 
> make check.
> 
> 
> Thanks,
> 
> Timothy Chen
> 
>


Re: Review Request 22796: Add timeout to rescind unused offers

Posted by Timothy Chen <tn...@apache.org>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/22796/
-----------------------------------------------------------

(Updated July 24, 2014, 3 a.m.)


Review request for mesos, Adam B, Ben Mahler, and Niklas Nielsen.


Changes
-------

Add Ben as reviewer


Bugs: MESOS-186
    https://issues.apache.org/jira/browse/MESOS-186


Repository: mesos-git


Description
-------

Based on Kapil's patch (https://reviews.apache.org/r/22066/), adding timeout for each offer from master to remove the offer when it's no longer used.


Diffs
-----

  src/master/flags.hpp 32704ce 
  src/master/master.hpp fa46a67 
  src/master/master.cpp fb2fd5a 
  src/tests/master_tests.cpp 5a1cf7f 

Diff: https://reviews.apache.org/r/22796/diff/


Testing
-------

Added three more unit tests from Kapil's patch: Testing offer not rescinded after task launched, offer not rescinded when framework/slave unregistered.
The test exposed a race condition that can lead to a segfault if two remove offers are called on the same offer.

make check.


Thanks,

Timothy Chen


Re: Review Request 22796: Add timeout to rescind unused offers

Posted by Adam B <ad...@mesosphere.io>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/22796/#review48478
-----------------------------------------------------------

Ship it!


Ship it! Super-trivial changes, which I can fix before committing this.
Thanks, Tim!


src/master/flags.hpp
<https://reviews.apache.org/r/22796/#comment85116>

    Unnecessary whitespace at begin and end of string.



src/tests/master_tests.cpp
<https://reviews.apache.org/r/22796/#comment85117>

    s/Rescind/Rescinded/


- Adam B


On July 17, 2014, 2:30 p.m., Timothy Chen wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/22796/
> -----------------------------------------------------------
> 
> (Updated July 17, 2014, 2:30 p.m.)
> 
> 
> Review request for mesos, Adam B and Niklas Nielsen.
> 
> 
> Bugs: MESOS-186
>     https://issues.apache.org/jira/browse/MESOS-186
> 
> 
> Repository: mesos-git
> 
> 
> Description
> -------
> 
> Based on Kapil's patch (https://reviews.apache.org/r/22066/), adding timeout for each offer from master to remove the offer when it's no longer used.
> 
> 
> Diffs
> -----
> 
>   src/master/flags.hpp 32704ce 
>   src/master/master.hpp fa46a67 
>   src/master/master.cpp fb2fd5a 
>   src/tests/master_tests.cpp 5a1cf7f 
> 
> Diff: https://reviews.apache.org/r/22796/diff/
> 
> 
> Testing
> -------
> 
> Added three more unit tests from Kapil's patch: Testing offer not rescinded after task launched, offer not rescinded when framework/slave unregistered.
> The test exposed a race condition that can lead to a segfault if two remove offers are called on the same offer.
> 
> make check.
> 
> 
> Thanks,
> 
> Timothy Chen
> 
>


Re: Review Request 22796: Add timeout to rescind unused offers

Posted by Timothy Chen <tn...@apache.org>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/22796/
-----------------------------------------------------------

(Updated July 17, 2014, 9:30 p.m.)


Review request for mesos, Adam B and Niklas Nielsen.


Bugs: MESOS-186
    https://issues.apache.org/jira/browse/MESOS-186


Repository: mesos-git


Description
-------

Based on Kapil's patch (https://reviews.apache.org/r/22066/), adding timeout for each offer from master to remove the offer when it's no longer used.


Diffs (updated)
-----

  src/master/flags.hpp 32704ce 
  src/master/master.hpp fa46a67 
  src/master/master.cpp fb2fd5a 
  src/tests/master_tests.cpp 5a1cf7f 

Diff: https://reviews.apache.org/r/22796/diff/


Testing
-------

Added three more unit tests from Kapil's patch: Testing offer not rescinded after task launched, offer not rescinded when framework/slave unregistered.
The test exposed a race condition that can lead to a segfault if two remove offers are called on the same offer.

make check.


Thanks,

Timothy Chen


Re: Review Request 22796: Add timeout to rescind unused offers

Posted by Timothy Chen <tn...@apache.org>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/22796/
-----------------------------------------------------------

(Updated July 17, 2014, 8:08 p.m.)


Review request for mesos, Adam B and Niklas Nielsen.


Bugs: MESOS-186
    https://issues.apache.org/jira/browse/MESOS-186


Repository: mesos-git


Description
-------

Based on Kapil's patch (https://reviews.apache.org/r/22066/), adding timeout for each offer from master to remove the offer when it's no longer used.


Diffs (updated)
-----

  src/master/flags.hpp 32704ce 
  src/master/master.hpp fa46a67 
  src/master/master.cpp fb2fd5a 
  src/tests/master_tests.cpp 5a1cf7f 

Diff: https://reviews.apache.org/r/22796/diff/


Testing
-------

Added three more unit tests from Kapil's patch: Testing offer not rescinded after task launched, offer not rescinded when framework/slave unregistered.
The test exposed a race condition that can lead to a segfault if two remove offers are called on the same offer.

make check.


Thanks,

Timothy Chen


Re: Review Request 22796: Add timeout to rescind unused offers

Posted by Timothy Chen <tn...@apache.org>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/22796/
-----------------------------------------------------------

(Updated July 10, 2014, 5:47 p.m.)


Review request for mesos, Adam B and Niklas Nielsen.


Bugs: MESOS-186
    https://issues.apache.org/jira/browse/MESOS-186


Repository: mesos-git


Description
-------

Based on Kapil's patch (https://reviews.apache.org/r/22066/), adding timeout for each offer from master to remove the offer when it's no longer used.


Diffs (updated)
-----

  src/master/flags.hpp 32704ce 
  src/master/master.hpp 8641f2d 
  src/master/master.cpp 86b147f 
  src/tests/master_tests.cpp 5a1cf7f 

Diff: https://reviews.apache.org/r/22796/diff/


Testing
-------

Added three more unit tests from Kapil's patch: Testing offer not rescinded after task launched, offer not rescinded when framework/slave unregistered.
The test exposed a race condition that can lead to a segfault if two remove offers are called on the same offer.

make check.


Thanks,

Timothy Chen


Re: Review Request 22796: Add timeout to rescind unused offers

Posted by Timothy Chen <tn...@apache.org>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/22796/
-----------------------------------------------------------

(Updated July 10, 2014, 5:45 p.m.)


Review request for mesos, Adam B and Niklas Nielsen.


Bugs: MESOS-186
    https://issues.apache.org/jira/browse/MESOS-186


Repository: mesos-git


Description
-------

Based on Kapil's patch (https://reviews.apache.org/r/22066/), adding timeout for each offer from master to remove the offer when it's no longer used.


Diffs (updated)
-----

  src/master/flags.hpp 32704ce 
  src/master/master.hpp 8641f2d 
  src/master/master.cpp 86b147f 
  src/tests/master_tests.cpp 5a1cf7f 

Diff: https://reviews.apache.org/r/22796/diff/


Testing
-------

Added three more unit tests from Kapil's patch: Testing offer not rescinded after task launched, offer not rescinded when framework/slave unregistered.
The test exposed a race condition that can lead to a segfault if two remove offers are called on the same offer.

make check.


Thanks,

Timothy Chen


Re: Review Request 22796: Add timeout to rescind unused offers

Posted by Adam B <ad...@mesosphere.io>.

> On July 9, 2014, 11:52 p.m., Timothy Chen wrote:
> > src/tests/master_tests.cpp, line 2122
> > <https://reviews.apache.org/r/22796/diff/3/?file=623252#file623252line2122>
> >
> >     it's actually not expecting rescind to send to unregistered slave. Let me fix this

You can drop this issue now.


- Adam


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/22796/#review47552
-----------------------------------------------------------


On July 10, 2014, 10:47 a.m., Timothy Chen wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/22796/
> -----------------------------------------------------------
> 
> (Updated July 10, 2014, 10:47 a.m.)
> 
> 
> Review request for mesos, Adam B and Niklas Nielsen.
> 
> 
> Bugs: MESOS-186
>     https://issues.apache.org/jira/browse/MESOS-186
> 
> 
> Repository: mesos-git
> 
> 
> Description
> -------
> 
> Based on Kapil's patch (https://reviews.apache.org/r/22066/), adding timeout for each offer from master to remove the offer when it's no longer used.
> 
> 
> Diffs
> -----
> 
>   src/master/flags.hpp 32704ce 
>   src/master/master.hpp 8641f2d 
>   src/master/master.cpp 86b147f 
>   src/tests/master_tests.cpp 5a1cf7f 
> 
> Diff: https://reviews.apache.org/r/22796/diff/
> 
> 
> Testing
> -------
> 
> Added three more unit tests from Kapil's patch: Testing offer not rescinded after task launched, offer not rescinded when framework/slave unregistered.
> The test exposed a race condition that can lead to a segfault if two remove offers are called on the same offer.
> 
> make check.
> 
> 
> Thanks,
> 
> Timothy Chen
> 
>


Re: Review Request 22796: Add timeout to rescind unused offers

Posted by Timothy Chen <tn...@apache.org>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/22796/#review47552
-----------------------------------------------------------



src/tests/master_tests.cpp
<https://reviews.apache.org/r/22796/#comment83559>

    it's actually not expecting rescind to send to unregistered slave. Let me fix this


- Timothy Chen


On July 3, 2014, 12:42 a.m., Timothy Chen wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/22796/
> -----------------------------------------------------------
> 
> (Updated July 3, 2014, 12:42 a.m.)
> 
> 
> Review request for mesos, Adam B and Niklas Nielsen.
> 
> 
> Bugs: MESOS-186
>     https://issues.apache.org/jira/browse/MESOS-186
> 
> 
> Repository: mesos-git
> 
> 
> Description
> -------
> 
> Based on Kapil's patch (https://reviews.apache.org/r/22066/), adding timeout for each offer from master to remove the offer when it's no longer used.
> 
> 
> Diffs
> -----
> 
>   src/master/flags.hpp 32704ce 
>   src/master/master.hpp 5fef354 
>   src/master/master.cpp 251b699 
>   src/tests/master_tests.cpp 5a1cf7f 
> 
> Diff: https://reviews.apache.org/r/22796/diff/
> 
> 
> Testing
> -------
> 
> Added three more unit tests from Kapil's patch: Testing offer not rescinded after task launched, offer not rescinded when framework/slave unregistered.
> The test exposed a race condition that can lead to a segfault if two remove offers are called on the same offer.
> 
> make check.
> 
> 
> Thanks,
> 
> Timothy Chen
> 
>


Re: Review Request 22796: Add timeout to rescind unused offers

Posted by Adam B <ad...@mesosphere.io>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/22796/#review47489
-----------------------------------------------------------


Looks great. Just a few minor nits, and a request for an offer_timeout=0 test, then we can ship it!


src/master/flags.hpp
<https://reviews.apache.org/r/22796/#comment83343>

    Mention that offer_timeout=0 means never timeout.



src/tests/master_tests.cpp
<https://reviews.apache.org/r/22796/#comment83344>

    Extra blank line. Use 2 here, not 3.



src/tests/master_tests.cpp
<https://reviews.apache.org/r/22796/#comment83345>

    If you're not modifying the default masterFlags, you can just call StartMaster() and it will CreateMasterFlags() for you.



src/tests/master_tests.cpp
<https://reviews.apache.org/r/22796/#comment83346>

    indent



src/tests/master_tests.cpp
<https://reviews.apache.org/r/22796/#comment83347>

    Ditto. Just use StartSlave() if you're not modifying the default flags



src/tests/master_tests.cpp
<https://reviews.apache.org/r/22796/#comment83348>

    StartMaster()



src/tests/master_tests.cpp
<https://reviews.apache.org/r/22796/#comment83350>

    indent



src/tests/master_tests.cpp
<https://reviews.apache.org/r/22796/#comment83349>

    StartSlave()



src/tests/master_tests.cpp
<https://reviews.apache.org/r/22796/#comment83352>

    How about something a little more descriptive than 'message'?



src/tests/master_tests.cpp
<https://reviews.apache.org/r/22796/#comment83351>

    Times(1) is implicit. You can remove this line.



src/tests/master_tests.cpp
<https://reviews.apache.org/r/22796/#comment83358>

    Weird test. According to your comment below, "We expect rescind to be called from unregister slave". So this comment is inaccurate. And you're not even testing your offer_timeout here then.
    I'd suggest instead, a test when offer_timeout=0 that the offer is never rescinded.



src/tests/master_tests.cpp
<https://reviews.apache.org/r/22796/#comment83353>

    StartMaster()



src/tests/master_tests.cpp
<https://reviews.apache.org/r/22796/#comment83354>

    indent



src/tests/master_tests.cpp
<https://reviews.apache.org/r/22796/#comment83355>

    StartSlave()



src/tests/master_tests.cpp
<https://reviews.apache.org/r/22796/#comment83356>

    Times(1) is implicit, unnecessary



src/tests/master_tests.cpp
<https://reviews.apache.org/r/22796/#comment83357>

    indent



src/tests/master_tests.cpp
<https://reviews.apache.org/r/22796/#comment83360>

    BUG: Declare the FUTURE_PROTOBUF for SlaveRegisteredMessage before you call StartSlave, else it might happen before you setup your future expectation.



src/tests/master_tests.cpp
<https://reviews.apache.org/r/22796/#comment83359>

    This is the old-school verbose way of launching a task. Find an example of the LaunchTasks() test action in one of these other tests, like SlaveTest.TerminatingSlaveDoesNotReregister



src/tests/master_tests.cpp
<https://reviews.apache.org/r/22796/#comment83361>

    Times(1) is implicit, unnecessary


- Adam B


On July 2, 2014, 5:42 p.m., Timothy Chen wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/22796/
> -----------------------------------------------------------
> 
> (Updated July 2, 2014, 5:42 p.m.)
> 
> 
> Review request for mesos, Adam B and Niklas Nielsen.
> 
> 
> Bugs: MESOS-186
>     https://issues.apache.org/jira/browse/MESOS-186
> 
> 
> Repository: mesos-git
> 
> 
> Description
> -------
> 
> Based on Kapil's patch (https://reviews.apache.org/r/22066/), adding timeout for each offer from master to remove the offer when it's no longer used.
> 
> 
> Diffs
> -----
> 
>   src/master/flags.hpp 32704ce 
>   src/master/master.hpp 5fef354 
>   src/master/master.cpp 251b699 
>   src/tests/master_tests.cpp 5a1cf7f 
> 
> Diff: https://reviews.apache.org/r/22796/diff/
> 
> 
> Testing
> -------
> 
> Added three more unit tests from Kapil's patch: Testing offer not rescinded after task launched, offer not rescinded when framework/slave unregistered.
> The test exposed a race condition that can lead to a segfault if two remove offers are called on the same offer.
> 
> make check.
> 
> 
> Thanks,
> 
> Timothy Chen
> 
>


Re: Review Request 22796: Add timeout to rescind unused offers

Posted by Timothy Chen <tn...@apache.org>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/22796/
-----------------------------------------------------------

(Updated July 3, 2014, 12:42 a.m.)


Review request for mesos, Adam B and Niklas Nielsen.


Bugs: MESOS-186
    https://issues.apache.org/jira/browse/MESOS-186


Repository: mesos-git


Description
-------

Based on Kapil's patch (https://reviews.apache.org/r/22066/), adding timeout for each offer from master to remove the offer when it's no longer used.


Diffs (updated)
-----

  src/master/flags.hpp 32704ce 
  src/master/master.hpp 5fef354 
  src/master/master.cpp 251b699 
  src/tests/master_tests.cpp 5a1cf7f 

Diff: https://reviews.apache.org/r/22796/diff/


Testing
-------

Added three more unit tests from Kapil's patch: Testing offer not rescinded after task launched, offer not rescinded when framework/slave unregistered.
The test exposed a race condition that can lead to a segfault if two remove offers are called on the same offer.

make check.


Thanks,

Timothy Chen


Re: Review Request 22796: Add timeout to rescind unused offers

Posted by Mesos ReviewBot <de...@mesos.apache.org>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/22796/#review47265
-----------------------------------------------------------


Bad patch!

Reviews applied: [22796]

Failed command: ./support/mesos-style.py

Error:
 Checking 486 files using filter --filter=-,+build/class,+build/deprecated,+build/endif_comment,+runtime/vlog,+whitespace/blank_line,+whitespace/comma,+whitespace/ending_newline,+whitespace/forcolon,+whitespace/indent,+whitespace/line_length,+whitespace/tab,+whitespace/todo
src/master/master.cpp:3442:  Lines should be <= 80 characters long  [whitespace/line_length] [2]
Total errors found: 1


- Mesos ReviewBot


On July 2, 2014, 11:37 p.m., Timothy Chen wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/22796/
> -----------------------------------------------------------
> 
> (Updated July 2, 2014, 11:37 p.m.)
> 
> 
> Review request for mesos, Adam B and Niklas Nielsen.
> 
> 
> Bugs: MESOS-186
>     https://issues.apache.org/jira/browse/MESOS-186
> 
> 
> Repository: mesos-git
> 
> 
> Description
> -------
> 
> Based on Kapil's patch (https://reviews.apache.org/r/22066/), adding timeout for each offer from master to remove the offer when it's no longer used.
> 
> 
> Diffs
> -----
> 
>   src/master/flags.hpp 32704ce 
>   src/master/master.hpp 5fef354 
>   src/master/master.cpp 251b699 
>   src/tests/master_tests.cpp 5a1cf7f 
> 
> Diff: https://reviews.apache.org/r/22796/diff/
> 
> 
> Testing
> -------
> 
> Added three more unit tests from Kapil's patch: Testing offer not rescinded after task launched, offer not rescinded when framework/slave unregistered.
> The test exposed a race condition that can lead to a segfault if two remove offers are called on the same offer.
> 
> make check.
> 
> 
> Thanks,
> 
> Timothy Chen
> 
>