You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mesos.apache.org by Vinod Kone <vi...@gmail.com> on 2013/09/16 20:38:20 UTC
Re: Review Request 13744: Fixed a case where Framework re-registration time
was not being updated.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/13744/#review26149
-----------------------------------------------------------
Ship it!
Consider writing a test in this review. If you would like to punt please create a ticket for the test to keep track.
- Vinod Kone
On Aug. 23, 2013, 7:22 p.m., Ben Mahler wrote:
>
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/13744/
> -----------------------------------------------------------
>
> (Updated Aug. 23, 2013, 7:22 p.m.)
>
>
> Review request for mesos, Benjamin Hindman and Vinod Kone.
>
>
> Bugs: MESOS-658
> https://issues.apache.org/jira/browse/MESOS-658
>
>
> Repository: mesos-git
>
>
> Description
> -------
>
> This is a split up of https://reviews.apache.org/r/13699/ (has ship its) into two commits.
>
> There was a case during re-registration where the re-registered time was not being set.
>
> This can cause a serious issue when the following occurs:
> -Scheduler disconnects from the master, Master::exited(UPID) sets framework->active = false.
> -Scheduler re-registers with ReregisterFrameworkMessage::failover=false. Currently, the master does _not_ update the re-registration time in this case!
> -Now the failoverFramework timeout is setup in the Master.
> -Scheduler disconnects again from the master, Master::exited(UPID) sets active=false once again.
> -The original failoverFramework timeout fires, compares Framework->reregisteredTime. Since it has not been updated, the master proceeds to shut down the framework on all the slaves!
>
> I'll file a bug for this and add it here.
>
>
> Diffs
> -----
>
> src/master/master.hpp 30752d2698931624fdf4aa6e40ef9fc4ec58dc6d
> src/master/master.cpp d53b8bb97da45834790cca6e04b70b969a8d3453
>
> Diff: https://reviews.apache.org/r/13744/diff/
>
>
> Testing
> -------
>
> make check, I'll look into adding a test that exposed this issue.
>
>
> Thanks,
>
> Ben Mahler
>
>
Re: Review Request 13744: Fixed a case where Framework re-registration time
was not being updated.
Posted by Ben Mahler <be...@gmail.com>.
> On Sept. 16, 2013, 6:38 p.m., Vinod Kone wrote:
> > Consider writing a test in this review. If you would like to punt please create a ticket for the test to keep track.
I realized I could not spoof exited events so I was not able to create a test to trigger this, any tips?
- Ben
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/13744/#review26149
-----------------------------------------------------------
On Aug. 23, 2013, 7:22 p.m., Ben Mahler wrote:
>
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/13744/
> -----------------------------------------------------------
>
> (Updated Aug. 23, 2013, 7:22 p.m.)
>
>
> Review request for mesos, Benjamin Hindman and Vinod Kone.
>
>
> Bugs: MESOS-658
> https://issues.apache.org/jira/browse/MESOS-658
>
>
> Repository: mesos-git
>
>
> Description
> -------
>
> This is a split up of https://reviews.apache.org/r/13699/ (has ship its) into two commits.
>
> There was a case during re-registration where the re-registered time was not being set.
>
> This can cause a serious issue when the following occurs:
> -Scheduler disconnects from the master, Master::exited(UPID) sets framework->active = false.
> -Scheduler re-registers with ReregisterFrameworkMessage::failover=false. Currently, the master does _not_ update the re-registration time in this case!
> -Now the failoverFramework timeout is setup in the Master.
> -Scheduler disconnects again from the master, Master::exited(UPID) sets active=false once again.
> -The original failoverFramework timeout fires, compares Framework->reregisteredTime. Since it has not been updated, the master proceeds to shut down the framework on all the slaves!
>
> I'll file a bug for this and add it here.
>
>
> Diffs
> -----
>
> src/master/master.hpp 30752d2698931624fdf4aa6e40ef9fc4ec58dc6d
> src/master/master.cpp d53b8bb97da45834790cca6e04b70b969a8d3453
>
> Diff: https://reviews.apache.org/r/13744/diff/
>
>
> Testing
> -------
>
> make check, I'll look into adding a test that exposed this issue.
>
>
> Thanks,
>
> Ben Mahler
>
>
Re: Review Request 13744: Fixed a case where Framework re-registration time
was not being updated.
Posted by Ben Mahler <be...@gmail.com>.
> On Sept. 16, 2013, 6:38 p.m., Vinod Kone wrote:
> > Consider writing a test in this review. If you would like to punt please create a ticket for the test to keep track.
>
> Ben Mahler wrote:
> I realized I could not spoof exited events so I was not able to create a test to trigger this, any tips?
>
> Vinod Kone wrote:
> Can't you manually bring down and bring up the scheduler to test the scenario you described?
Discussed with Vinod offline, it's quite difficult to test this scenario without having a way to break links between Processes in tests.
- Ben
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/13744/#review26149
-----------------------------------------------------------
On Aug. 23, 2013, 7:22 p.m., Ben Mahler wrote:
>
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/13744/
> -----------------------------------------------------------
>
> (Updated Aug. 23, 2013, 7:22 p.m.)
>
>
> Review request for mesos, Benjamin Hindman and Vinod Kone.
>
>
> Bugs: MESOS-658
> https://issues.apache.org/jira/browse/MESOS-658
>
>
> Repository: mesos-git
>
>
> Description
> -------
>
> This is a split up of https://reviews.apache.org/r/13699/ (has ship its) into two commits.
>
> There was a case during re-registration where the re-registered time was not being set.
>
> This can cause a serious issue when the following occurs:
> -Scheduler disconnects from the master, Master::exited(UPID) sets framework->active = false.
> -Scheduler re-registers with ReregisterFrameworkMessage::failover=false. Currently, the master does _not_ update the re-registration time in this case!
> -Now the failoverFramework timeout is setup in the Master.
> -Scheduler disconnects again from the master, Master::exited(UPID) sets active=false once again.
> -The original failoverFramework timeout fires, compares Framework->reregisteredTime. Since it has not been updated, the master proceeds to shut down the framework on all the slaves!
>
> I'll file a bug for this and add it here.
>
>
> Diffs
> -----
>
> src/master/master.hpp 30752d2698931624fdf4aa6e40ef9fc4ec58dc6d
> src/master/master.cpp d53b8bb97da45834790cca6e04b70b969a8d3453
>
> Diff: https://reviews.apache.org/r/13744/diff/
>
>
> Testing
> -------
>
> make check, I'll look into adding a test that exposed this issue.
>
>
> Thanks,
>
> Ben Mahler
>
>
Re: Review Request 13744: Fixed a case where Framework re-registration time
was not being updated.
Posted by Vinod Kone <vi...@gmail.com>.
> On Sept. 16, 2013, 6:38 p.m., Vinod Kone wrote:
> > Consider writing a test in this review. If you would like to punt please create a ticket for the test to keep track.
>
> Ben Mahler wrote:
> I realized I could not spoof exited events so I was not able to create a test to trigger this, any tips?
Can't you manually bring down and bring up the scheduler to test the scenario you described?
- Vinod
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/13744/#review26149
-----------------------------------------------------------
On Aug. 23, 2013, 7:22 p.m., Ben Mahler wrote:
>
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/13744/
> -----------------------------------------------------------
>
> (Updated Aug. 23, 2013, 7:22 p.m.)
>
>
> Review request for mesos, Benjamin Hindman and Vinod Kone.
>
>
> Bugs: MESOS-658
> https://issues.apache.org/jira/browse/MESOS-658
>
>
> Repository: mesos-git
>
>
> Description
> -------
>
> This is a split up of https://reviews.apache.org/r/13699/ (has ship its) into two commits.
>
> There was a case during re-registration where the re-registered time was not being set.
>
> This can cause a serious issue when the following occurs:
> -Scheduler disconnects from the master, Master::exited(UPID) sets framework->active = false.
> -Scheduler re-registers with ReregisterFrameworkMessage::failover=false. Currently, the master does _not_ update the re-registration time in this case!
> -Now the failoverFramework timeout is setup in the Master.
> -Scheduler disconnects again from the master, Master::exited(UPID) sets active=false once again.
> -The original failoverFramework timeout fires, compares Framework->reregisteredTime. Since it has not been updated, the master proceeds to shut down the framework on all the slaves!
>
> I'll file a bug for this and add it here.
>
>
> Diffs
> -----
>
> src/master/master.hpp 30752d2698931624fdf4aa6e40ef9fc4ec58dc6d
> src/master/master.cpp d53b8bb97da45834790cca6e04b70b969a8d3453
>
> Diff: https://reviews.apache.org/r/13744/diff/
>
>
> Testing
> -------
>
> make check, I'll look into adding a test that exposed this issue.
>
>
> Thanks,
>
> Ben Mahler
>
>