You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mesos.apache.org by Ben Mahler <be...@gmail.com> on 2013/08/23 03:15:59 UTC

Review Request 13757: Fixed an issue where the Master unnecessarily sends a "Framework failed over" message when the scheduler driver retries an initial failover re-registration.

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/13757/
-----------------------------------------------------------

Review request for mesos, Benjamin Hindman and Vinod Kone.


Bugs: MESOS-659
    https://issues.apache.org/jira/browse/MESOS-659


Repository: mesos-git


Description
-------

See MESOS-659.


Diffs
-----

  src/master/master.cpp d53b8bb97da45834790cca6e04b70b969a8d3453 
  src/tests/fault_tolerance_tests.cpp 10e52c401476eb8416361de49b8e4061bb7ac4f3 

Diff: https://reviews.apache.org/r/13757/diff/


Testing
-------

Added a test that catches this case when not fixed.


Thanks,

Ben Mahler


Re: Review Request 13757: Fixed an issue where the Master unnecessarily sends a "Framework failed over" message when the scheduler driver retries an initial failover re-registration.

Posted by Ben Mahler <be...@gmail.com>.

> On Sept. 16, 2013, 6:22 p.m., Vinod Kone wrote:
> > src/master/master.cpp, line 1899
> > <https://reviews.apache.org/r/13757/diff/1/?file=344097#file344097line1899>
> >
> >     From MESOS-488 it sounded like we are in this situation when the reregister message is a duplicate. Are there other cases? If it is a duplicate can't we just ignore them? For example, can we ignore the message in reregisterFramework() if (framework->active && from == framework->pid)?
> >     
> >     In either case, please add a comment in the code.

Added a comment in reregisterFramework() as to why we cannot attempt to ignore duplicates.
Added a comment in failoverFramework as to why it is safe to not send FrameworkError when the pid is the same.


- Ben


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/13757/#review26147
-----------------------------------------------------------


On Aug. 23, 2013, 4:15 a.m., Ben Mahler wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/13757/
> -----------------------------------------------------------
> 
> (Updated Aug. 23, 2013, 4:15 a.m.)
> 
> 
> Review request for mesos, Benjamin Hindman and Vinod Kone.
> 
> 
> Bugs: MESOS-488
>     https://issues.apache.org/jira/browse/MESOS-488
> 
> 
> Repository: mesos-git
> 
> 
> Description
> -------
> 
> See MESOS-659.
> 
> 
> Diffs
> -----
> 
>   src/master/master.cpp d53b8bb97da45834790cca6e04b70b969a8d3453 
>   src/tests/fault_tolerance_tests.cpp 10e52c401476eb8416361de49b8e4061bb7ac4f3 
> 
> Diff: https://reviews.apache.org/r/13757/diff/
> 
> 
> Testing
> -------
> 
> Added a test that catches this case when not fixed.
> 
> 
> Thanks,
> 
> Ben Mahler
> 
>


Re: Review Request 13757: Fixed an issue where the Master unnecessarily sends a "Framework failed over" message when the scheduler driver retries an initial failover re-registration.

Posted by Vinod Kone <vi...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/13757/#review26147
-----------------------------------------------------------



src/master/master.cpp
<https://reviews.apache.org/r/13757/#comment51054>

    From MESOS-488 it sounded like we are in this situation when the reregister message is a duplicate. Are there other cases? If it is a duplicate can't we just ignore them? For example, can we ignore the message in reregisterFramework() if (framework->active && from == framework->pid)?
    
    In either case, please add a comment in the code.


- Vinod Kone


On Aug. 23, 2013, 4:15 a.m., Ben Mahler wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/13757/
> -----------------------------------------------------------
> 
> (Updated Aug. 23, 2013, 4:15 a.m.)
> 
> 
> Review request for mesos, Benjamin Hindman and Vinod Kone.
> 
> 
> Bugs: MESOS-488
>     https://issues.apache.org/jira/browse/MESOS-488
> 
> 
> Repository: mesos-git
> 
> 
> Description
> -------
> 
> See MESOS-659.
> 
> 
> Diffs
> -----
> 
>   src/master/master.cpp d53b8bb97da45834790cca6e04b70b969a8d3453 
>   src/tests/fault_tolerance_tests.cpp 10e52c401476eb8416361de49b8e4061bb7ac4f3 
> 
> Diff: https://reviews.apache.org/r/13757/diff/
> 
> 
> Testing
> -------
> 
> Added a test that catches this case when not fixed.
> 
> 
> Thanks,
> 
> Ben Mahler
> 
>


Re: Review Request 13757: Fixed an issue where the Master unnecessarily sends a "Framework failed over" message when the scheduler driver retries an initial failover re-registration.

Posted by Ben Mahler <be...@gmail.com>.

> On Oct. 7, 2013, 7:11 p.m., Vinod Kone wrote:
> >

Also manipulated the clock so that the test does not take over a second to run (for the re-registration retry).


> On Oct. 7, 2013, 7:11 p.m., Vinod Kone wrote:
> > src/tests/fault_tolerance_tests.cpp, lines 724-725
> > <https://reviews.apache.org/r/13757/diff/2/?file=361444#file361444line724>
> >
> >     Do we need the slave in this test? Killing it would make this test much simpler!

Thanks for catching this!


> On Oct. 7, 2013, 7:11 p.m., Vinod Kone wrote:
> > src/tests/fault_tolerance_tests.cpp, lines 738-739
> > <https://reviews.apache.org/r/13757/diff/2/?file=361444#file361444line738>
> >
> >     you can kill this if there is no slave.

Thanks!


- Ben


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/13757/#review26745
-----------------------------------------------------------


On Oct. 4, 2013, 6:15 p.m., Ben Mahler wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/13757/
> -----------------------------------------------------------
> 
> (Updated Oct. 4, 2013, 6:15 p.m.)
> 
> 
> Review request for mesos, Benjamin Hindman and Vinod Kone.
> 
> 
> Bugs: MESOS-488
>     https://issues.apache.org/jira/browse/MESOS-488
> 
> 
> Repository: mesos-git
> 
> 
> Description
> -------
> 
> See MESOS-659.
> 
> 
> Diffs
> -----
> 
>   src/master/master.cpp ce8365f082a5f96ef64e33e526cb5047dff52127 
>   src/tests/fault_tolerance_tests.cpp 10e52c401476eb8416361de49b8e4061bb7ac4f3 
> 
> Diff: https://reviews.apache.org/r/13757/diff/
> 
> 
> Testing
> -------
> 
> Added a test that catches this case when not fixed.
> 
> 
> Thanks,
> 
> Ben Mahler
> 
>


Re: Review Request 13757: Fixed an issue where the Master unnecessarily sends a "Framework failed over" message when the scheduler driver retries an initial failover re-registration.

Posted by Vinod Kone <vi...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/13757/#review26745
-----------------------------------------------------------

Ship it!



src/tests/fault_tolerance_tests.cpp
<https://reviews.apache.org/r/13757/#comment52056>

    Do we need the slave in this test? Killing it would make this test much simpler!



src/tests/fault_tolerance_tests.cpp
<https://reviews.apache.org/r/13757/#comment52057>

    you can kill this if there is no slave.



src/tests/fault_tolerance_tests.cpp
<https://reviews.apache.org/r/13757/#comment52058>

    s/,/ to the failed over scheduler;/
    
    s/ensure/ensures/



src/tests/fault_tolerance_tests.cpp
<https://reviews.apache.org/r/13757/#comment52059>

    .Times(0) on the next line.



src/tests/fault_tolerance_tests.cpp
<https://reviews.apache.org/r/13757/#comment52061>

    you can kill these too if there is no slave.


- Vinod Kone


On Oct. 4, 2013, 6:15 p.m., Ben Mahler wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/13757/
> -----------------------------------------------------------
> 
> (Updated Oct. 4, 2013, 6:15 p.m.)
> 
> 
> Review request for mesos, Benjamin Hindman and Vinod Kone.
> 
> 
> Bugs: MESOS-488
>     https://issues.apache.org/jira/browse/MESOS-488
> 
> 
> Repository: mesos-git
> 
> 
> Description
> -------
> 
> See MESOS-659.
> 
> 
> Diffs
> -----
> 
>   src/master/master.cpp ce8365f082a5f96ef64e33e526cb5047dff52127 
>   src/tests/fault_tolerance_tests.cpp 10e52c401476eb8416361de49b8e4061bb7ac4f3 
> 
> Diff: https://reviews.apache.org/r/13757/diff/
> 
> 
> Testing
> -------
> 
> Added a test that catches this case when not fixed.
> 
> 
> Thanks,
> 
> Ben Mahler
> 
>


Re: Review Request 13757: Fixed an issue where the Master unnecessarily sends a "Framework failed over" message when the scheduler driver retries an initial failover re-registration.

Posted by Ben Mahler <be...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/13757/
-----------------------------------------------------------

(Updated Oct. 8, 2013, 6:11 p.m.)


Review request for mesos, Benjamin Hindman and Vinod Kone.


Changes
-------

Cleaned up the test from Vinod's review, NNFR.


Bugs: MESOS-488
    https://issues.apache.org/jira/browse/MESOS-488


Repository: mesos-git


Description
-------

See MESOS-659.


Diffs (updated)
-----

  src/master/master.cpp ce8365f082a5f96ef64e33e526cb5047dff52127 
  src/tests/fault_tolerance_tests.cpp 10e52c401476eb8416361de49b8e4061bb7ac4f3 

Diff: https://reviews.apache.org/r/13757/diff/


Testing
-------

Added a test that catches this case when not fixed.


Thanks,

Ben Mahler


Re: Review Request 13757: Fixed an issue where the Master unnecessarily sends a "Framework failed over" message when the scheduler driver retries an initial failover re-registration.

Posted by Ben Mahler <be...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/13757/
-----------------------------------------------------------

(Updated Oct. 4, 2013, 6:15 p.m.)


Review request for mesos, Benjamin Hindman and Vinod Kone.


Changes
-------

Added comments (and rebased against master).


Bugs: MESOS-488
    https://issues.apache.org/jira/browse/MESOS-488


Repository: mesos-git


Description
-------

See MESOS-659.


Diffs (updated)
-----

  src/master/master.cpp ce8365f082a5f96ef64e33e526cb5047dff52127 
  src/tests/fault_tolerance_tests.cpp 10e52c401476eb8416361de49b8e4061bb7ac4f3 

Diff: https://reviews.apache.org/r/13757/diff/


Testing
-------

Added a test that catches this case when not fixed.


Thanks,

Ben Mahler


Re: Review Request 13757: Fixed an issue where the Master unnecessarily sends a "Framework failed over" message when the scheduler driver retries an initial failover re-registration.

Posted by Ben Mahler <be...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/13757/
-----------------------------------------------------------

(Updated Aug. 23, 2013, 4:15 a.m.)


Review request for mesos, Benjamin Hindman and Vinod Kone.


Bugs: MESOS-488
    https://issues.apache.org/jira/browse/MESOS-488


Repository: mesos-git


Description
-------

See MESOS-659.


Diffs
-----

  src/master/master.cpp d53b8bb97da45834790cca6e04b70b969a8d3453 
  src/tests/fault_tolerance_tests.cpp 10e52c401476eb8416361de49b8e4061bb7ac4f3 

Diff: https://reviews.apache.org/r/13757/diff/


Testing
-------

Added a test that catches this case when not fixed.


Thanks,

Ben Mahler


Re: Review Request 13757: Fixed an issue where the Master unnecessarily sends a "Framework failed over" message when the scheduler driver retries an initial failover re-registration.

Posted by Ben Mahler <be...@gmail.com>.

> On Aug. 23, 2013, 1:26 a.m., Vinod Kone wrote:
> > src/master/master.cpp, line 1899
> > <https://reviews.apache.org/r/13757/diff/1/?file=344097#file344097line1899>
> >
> >     A comment here would be nice.
> >     
> >     Also, why not just ignore these messages in Master::reregisterFramework()?

It's possible for the same PID to re-register with failover == true (e.g. if the scheduler crashed and restarted on the same machine). In that case, we'll still need to go through the motions of setting it to active and flushing the offers, no?


- Ben


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/13757/#review25450
-----------------------------------------------------------


On Aug. 23, 2013, 4:15 a.m., Ben Mahler wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/13757/
> -----------------------------------------------------------
> 
> (Updated Aug. 23, 2013, 4:15 a.m.)
> 
> 
> Review request for mesos, Benjamin Hindman and Vinod Kone.
> 
> 
> Bugs: MESOS-488
>     https://issues.apache.org/jira/browse/MESOS-488
> 
> 
> Repository: mesos-git
> 
> 
> Description
> -------
> 
> See MESOS-659.
> 
> 
> Diffs
> -----
> 
>   src/master/master.cpp d53b8bb97da45834790cca6e04b70b969a8d3453 
>   src/tests/fault_tolerance_tests.cpp 10e52c401476eb8416361de49b8e4061bb7ac4f3 
> 
> Diff: https://reviews.apache.org/r/13757/diff/
> 
> 
> Testing
> -------
> 
> Added a test that catches this case when not fixed.
> 
> 
> Thanks,
> 
> Ben Mahler
> 
>


Re: Review Request 13757: Fixed an issue where the Master unnecessarily sends a "Framework failed over" message when the scheduler driver retries an initial failover re-registration.

Posted by Vinod Kone <vi...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/13757/#review25450
-----------------------------------------------------------


Can you tag MESOS-488 instead?


src/master/master.cpp
<https://reviews.apache.org/r/13757/#comment49861>

    A comment here would be nice.
    
    Also, why not just ignore these messages in Master::reregisterFramework()?


- Vinod Kone


On Aug. 23, 2013, 1:15 a.m., Ben Mahler wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/13757/
> -----------------------------------------------------------
> 
> (Updated Aug. 23, 2013, 1:15 a.m.)
> 
> 
> Review request for mesos, Benjamin Hindman and Vinod Kone.
> 
> 
> Bugs: MESOS-659
>     https://issues.apache.org/jira/browse/MESOS-659
> 
> 
> Repository: mesos-git
> 
> 
> Description
> -------
> 
> See MESOS-659.
> 
> 
> Diffs
> -----
> 
>   src/master/master.cpp d53b8bb97da45834790cca6e04b70b969a8d3453 
>   src/tests/fault_tolerance_tests.cpp 10e52c401476eb8416361de49b8e4061bb7ac4f3 
> 
> Diff: https://reviews.apache.org/r/13757/diff/
> 
> 
> Testing
> -------
> 
> Added a test that catches this case when not fixed.
> 
> 
> Thanks,
> 
> Ben Mahler
> 
>