You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mesos.apache.org by Ben Mahler <be...@gmail.com> on 2013/08/22 20:37:22 UTC

Review Request 13744: Fixed a case where Framework re-registration time was not being updated.

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/13744/
-----------------------------------------------------------

Review request for mesos, Benjamin Hindman and Vinod Kone.


Repository: mesos-git


Description
-------

This is a split up of https://reviews.apache.org/r/13699/ (has ship its) into two commits.

There was a case during re-registration where the re-registered time was not being set.

This can cause a serious issue when the following occurs:
 -Scheduler disconnects from the master, Master::exited(UPID) sets framework->active = false.
 -Scheduler re-registers with ReregisterFrameworkMessage::failover=false. Currently, the master does _not_ update the re-registration time in this case!
 -Now the failoverFramework timeout is setup in the Master.
 -Scheduler disconnects again from the master, Master::exited(UPID) sets active=false once again.
 -The original failoverFramework timeout fires, compares Framework->reregisteredTime. Since it has not been updated, the master proceeds to shut down the framework on all the slaves!

I'll file a bug for this and add it here.


Diffs
-----

  src/master/http.cpp 1ac84a9f75df43632ddbd1fec50333c159651f15 
  src/master/master.hpp 30752d2698931624fdf4aa6e40ef9fc4ec58dc6d 
  src/master/master.cpp d53b8bb97da45834790cca6e04b70b969a8d3453 

Diff: https://reviews.apache.org/r/13744/diff/


Testing
-------

make check, I'll look into adding a test that exposed this issue.


Thanks,

Ben Mahler


Re: Review Request 13744: Fixed a case where Framework re-registration time was not being updated.

Posted by Ben Mahler <be...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/13744/
-----------------------------------------------------------

(Updated Oct. 4, 2013, 6:34 p.m.)


Review request for mesos, Benjamin Hindman and Vinod Kone.


Changes
-------

Updated a comment.


Bugs: MESOS-658
    https://issues.apache.org/jira/browse/MESOS-658


Repository: mesos-git


Description
-------

This is a split up of https://reviews.apache.org/r/13699/ (has ship its) into two commits.

There was a case during re-registration where the re-registered time was not being set.

This can cause a serious issue when the following occurs:
 -Scheduler disconnects from the master, Master::exited(UPID) sets framework->active = false.
 -Scheduler re-registers with ReregisterFrameworkMessage::failover=false. Currently, the master does _not_ update the re-registration time in this case!
 -Now the failoverFramework timeout is setup in the Master.
 -Scheduler disconnects again from the master, Master::exited(UPID) sets active=false once again.
 -The original failoverFramework timeout fires, compares Framework->reregisteredTime. Since it has not been updated, the master proceeds to shut down the framework on all the slaves!

I'll file a bug for this and add it here.


Diffs (updated)
-----

  src/master/master.hpp 0aeec7fc540d44c03c1171f31a7281a4b0055925 
  src/master/master.cpp ce8365f082a5f96ef64e33e526cb5047dff52127 

Diff: https://reviews.apache.org/r/13744/diff/


Testing
-------

make check, I'll look into adding a test that exposed this issue.


Thanks,

Ben Mahler


Re: Review Request 13744: Fixed a case where Framework re-registration time was not being updated.

Posted by Ben Mahler <be...@gmail.com>.

> On Sept. 16, 2013, 6:38 p.m., Vinod Kone wrote:
> > Consider writing a test in this review. If you would like to punt please create a ticket for the test to keep track.

I realized I could not spoof exited events so I was not able to create a test to trigger this, any tips?


- Ben


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/13744/#review26149
-----------------------------------------------------------


On Aug. 23, 2013, 7:22 p.m., Ben Mahler wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/13744/
> -----------------------------------------------------------
> 
> (Updated Aug. 23, 2013, 7:22 p.m.)
> 
> 
> Review request for mesos, Benjamin Hindman and Vinod Kone.
> 
> 
> Bugs: MESOS-658
>     https://issues.apache.org/jira/browse/MESOS-658
> 
> 
> Repository: mesos-git
> 
> 
> Description
> -------
> 
> This is a split up of https://reviews.apache.org/r/13699/ (has ship its) into two commits.
> 
> There was a case during re-registration where the re-registered time was not being set.
> 
> This can cause a serious issue when the following occurs:
>  -Scheduler disconnects from the master, Master::exited(UPID) sets framework->active = false.
>  -Scheduler re-registers with ReregisterFrameworkMessage::failover=false. Currently, the master does _not_ update the re-registration time in this case!
>  -Now the failoverFramework timeout is setup in the Master.
>  -Scheduler disconnects again from the master, Master::exited(UPID) sets active=false once again.
>  -The original failoverFramework timeout fires, compares Framework->reregisteredTime. Since it has not been updated, the master proceeds to shut down the framework on all the slaves!
> 
> I'll file a bug for this and add it here.
> 
> 
> Diffs
> -----
> 
>   src/master/master.hpp 30752d2698931624fdf4aa6e40ef9fc4ec58dc6d 
>   src/master/master.cpp d53b8bb97da45834790cca6e04b70b969a8d3453 
> 
> Diff: https://reviews.apache.org/r/13744/diff/
> 
> 
> Testing
> -------
> 
> make check, I'll look into adding a test that exposed this issue.
> 
> 
> Thanks,
> 
> Ben Mahler
> 
>


Re: Review Request 13744: Fixed a case where Framework re-registration time was not being updated.

Posted by Ben Mahler <be...@gmail.com>.

> On Sept. 16, 2013, 6:38 p.m., Vinod Kone wrote:
> > Consider writing a test in this review. If you would like to punt please create a ticket for the test to keep track.
> 
> Ben Mahler wrote:
>     I realized I could not spoof exited events so I was not able to create a test to trigger this, any tips?
> 
> Vinod Kone wrote:
>     Can't you manually bring down and bring up the scheduler to test the scenario you described?

Discussed with Vinod offline, it's quite difficult to test this scenario without having a way to break links between Processes in tests.


- Ben


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/13744/#review26149
-----------------------------------------------------------


On Aug. 23, 2013, 7:22 p.m., Ben Mahler wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/13744/
> -----------------------------------------------------------
> 
> (Updated Aug. 23, 2013, 7:22 p.m.)
> 
> 
> Review request for mesos, Benjamin Hindman and Vinod Kone.
> 
> 
> Bugs: MESOS-658
>     https://issues.apache.org/jira/browse/MESOS-658
> 
> 
> Repository: mesos-git
> 
> 
> Description
> -------
> 
> This is a split up of https://reviews.apache.org/r/13699/ (has ship its) into two commits.
> 
> There was a case during re-registration where the re-registered time was not being set.
> 
> This can cause a serious issue when the following occurs:
>  -Scheduler disconnects from the master, Master::exited(UPID) sets framework->active = false.
>  -Scheduler re-registers with ReregisterFrameworkMessage::failover=false. Currently, the master does _not_ update the re-registration time in this case!
>  -Now the failoverFramework timeout is setup in the Master.
>  -Scheduler disconnects again from the master, Master::exited(UPID) sets active=false once again.
>  -The original failoverFramework timeout fires, compares Framework->reregisteredTime. Since it has not been updated, the master proceeds to shut down the framework on all the slaves!
> 
> I'll file a bug for this and add it here.
> 
> 
> Diffs
> -----
> 
>   src/master/master.hpp 30752d2698931624fdf4aa6e40ef9fc4ec58dc6d 
>   src/master/master.cpp d53b8bb97da45834790cca6e04b70b969a8d3453 
> 
> Diff: https://reviews.apache.org/r/13744/diff/
> 
> 
> Testing
> -------
> 
> make check, I'll look into adding a test that exposed this issue.
> 
> 
> Thanks,
> 
> Ben Mahler
> 
>


Re: Review Request 13744: Fixed a case where Framework re-registration time was not being updated.

Posted by Vinod Kone <vi...@gmail.com>.

> On Sept. 16, 2013, 6:38 p.m., Vinod Kone wrote:
> > Consider writing a test in this review. If you would like to punt please create a ticket for the test to keep track.
> 
> Ben Mahler wrote:
>     I realized I could not spoof exited events so I was not able to create a test to trigger this, any tips?

Can't you manually bring down and bring up the scheduler to test the scenario you described?


- Vinod


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/13744/#review26149
-----------------------------------------------------------


On Aug. 23, 2013, 7:22 p.m., Ben Mahler wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/13744/
> -----------------------------------------------------------
> 
> (Updated Aug. 23, 2013, 7:22 p.m.)
> 
> 
> Review request for mesos, Benjamin Hindman and Vinod Kone.
> 
> 
> Bugs: MESOS-658
>     https://issues.apache.org/jira/browse/MESOS-658
> 
> 
> Repository: mesos-git
> 
> 
> Description
> -------
> 
> This is a split up of https://reviews.apache.org/r/13699/ (has ship its) into two commits.
> 
> There was a case during re-registration where the re-registered time was not being set.
> 
> This can cause a serious issue when the following occurs:
>  -Scheduler disconnects from the master, Master::exited(UPID) sets framework->active = false.
>  -Scheduler re-registers with ReregisterFrameworkMessage::failover=false. Currently, the master does _not_ update the re-registration time in this case!
>  -Now the failoverFramework timeout is setup in the Master.
>  -Scheduler disconnects again from the master, Master::exited(UPID) sets active=false once again.
>  -The original failoverFramework timeout fires, compares Framework->reregisteredTime. Since it has not been updated, the master proceeds to shut down the framework on all the slaves!
> 
> I'll file a bug for this and add it here.
> 
> 
> Diffs
> -----
> 
>   src/master/master.hpp 30752d2698931624fdf4aa6e40ef9fc4ec58dc6d 
>   src/master/master.cpp d53b8bb97da45834790cca6e04b70b969a8d3453 
> 
> Diff: https://reviews.apache.org/r/13744/diff/
> 
> 
> Testing
> -------
> 
> make check, I'll look into adding a test that exposed this issue.
> 
> 
> Thanks,
> 
> Ben Mahler
> 
>


Re: Review Request 13744: Fixed a case where Framework re-registration time was not being updated.

Posted by Vinod Kone <vi...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/13744/#review26149
-----------------------------------------------------------

Ship it!


Consider writing a test in this review. If you would like to punt please create a ticket for the test to keep track.

- Vinod Kone


On Aug. 23, 2013, 7:22 p.m., Ben Mahler wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/13744/
> -----------------------------------------------------------
> 
> (Updated Aug. 23, 2013, 7:22 p.m.)
> 
> 
> Review request for mesos, Benjamin Hindman and Vinod Kone.
> 
> 
> Bugs: MESOS-658
>     https://issues.apache.org/jira/browse/MESOS-658
> 
> 
> Repository: mesos-git
> 
> 
> Description
> -------
> 
> This is a split up of https://reviews.apache.org/r/13699/ (has ship its) into two commits.
> 
> There was a case during re-registration where the re-registered time was not being set.
> 
> This can cause a serious issue when the following occurs:
>  -Scheduler disconnects from the master, Master::exited(UPID) sets framework->active = false.
>  -Scheduler re-registers with ReregisterFrameworkMessage::failover=false. Currently, the master does _not_ update the re-registration time in this case!
>  -Now the failoverFramework timeout is setup in the Master.
>  -Scheduler disconnects again from the master, Master::exited(UPID) sets active=false once again.
>  -The original failoverFramework timeout fires, compares Framework->reregisteredTime. Since it has not been updated, the master proceeds to shut down the framework on all the slaves!
> 
> I'll file a bug for this and add it here.
> 
> 
> Diffs
> -----
> 
>   src/master/master.hpp 30752d2698931624fdf4aa6e40ef9fc4ec58dc6d 
>   src/master/master.cpp d53b8bb97da45834790cca6e04b70b969a8d3453 
> 
> Diff: https://reviews.apache.org/r/13744/diff/
> 
> 
> Testing
> -------
> 
> make check, I'll look into adding a test that exposed this issue.
> 
> 
> Thanks,
> 
> Ben Mahler
> 
>


Re: Review Request 13744: Fixed a case where Framework re-registration time was not being updated.

Posted by Ben Mahler <be...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/13744/
-----------------------------------------------------------

(Updated Aug. 23, 2013, 7:22 p.m.)


Review request for mesos, Benjamin Hindman and Vinod Kone.


Changes
-------

I've simplified this change by only including a fix of MESOS-658, removed the refactoring of Framework::reregisteredTime to an Option<Time>.


Bugs: MESOS-658
    https://issues.apache.org/jira/browse/MESOS-658


Repository: mesos-git


Description
-------

This is a split up of https://reviews.apache.org/r/13699/ (has ship its) into two commits.

There was a case during re-registration where the re-registered time was not being set.

This can cause a serious issue when the following occurs:
 -Scheduler disconnects from the master, Master::exited(UPID) sets framework->active = false.
 -Scheduler re-registers with ReregisterFrameworkMessage::failover=false. Currently, the master does _not_ update the re-registration time in this case!
 -Now the failoverFramework timeout is setup in the Master.
 -Scheduler disconnects again from the master, Master::exited(UPID) sets active=false once again.
 -The original failoverFramework timeout fires, compares Framework->reregisteredTime. Since it has not been updated, the master proceeds to shut down the framework on all the slaves!

I'll file a bug for this and add it here.


Diffs (updated)
-----

  src/master/master.hpp 30752d2698931624fdf4aa6e40ef9fc4ec58dc6d 
  src/master/master.cpp d53b8bb97da45834790cca6e04b70b969a8d3453 

Diff: https://reviews.apache.org/r/13744/diff/


Testing
-------

make check, I'll look into adding a test that exposed this issue.


Thanks,

Ben Mahler


Re: Review Request 13744: Fixed a case where Framework re-registration time was not being updated.

Posted by Ben Mahler <be...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/13744/#review25476
-----------------------------------------------------------



src/master/master.cpp
<https://reviews.apache.org/r/13744/#comment49906>

    We actually have to check the framework->registeredTime here as well in case the framework went away and then registered fresh once again. In this case, the framework->reregisteredTime would potentially not be set.
    
    Leaving this as a note so I can mark it as fixed when I update this.
    
    Clearly, a UUID is a saner way to check this, do we want to do this now?


- Ben Mahler


On Aug. 22, 2013, 10:25 p.m., Ben Mahler wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/13744/
> -----------------------------------------------------------
> 
> (Updated Aug. 22, 2013, 10:25 p.m.)
> 
> 
> Review request for mesos, Benjamin Hindman and Vinod Kone.
> 
> 
> Bugs: MESOS-658
>     https://issues.apache.org/jira/browse/MESOS-658
> 
> 
> Repository: mesos-git
> 
> 
> Description
> -------
> 
> This is a split up of https://reviews.apache.org/r/13699/ (has ship its) into two commits.
> 
> There was a case during re-registration where the re-registered time was not being set.
> 
> This can cause a serious issue when the following occurs:
>  -Scheduler disconnects from the master, Master::exited(UPID) sets framework->active = false.
>  -Scheduler re-registers with ReregisterFrameworkMessage::failover=false. Currently, the master does _not_ update the re-registration time in this case!
>  -Now the failoverFramework timeout is setup in the Master.
>  -Scheduler disconnects again from the master, Master::exited(UPID) sets active=false once again.
>  -The original failoverFramework timeout fires, compares Framework->reregisteredTime. Since it has not been updated, the master proceeds to shut down the framework on all the slaves!
> 
> I'll file a bug for this and add it here.
> 
> 
> Diffs
> -----
> 
>   src/master/http.cpp 1ac84a9f75df43632ddbd1fec50333c159651f15 
>   src/master/master.hpp 30752d2698931624fdf4aa6e40ef9fc4ec58dc6d 
>   src/master/master.cpp d53b8bb97da45834790cca6e04b70b969a8d3453 
> 
> Diff: https://reviews.apache.org/r/13744/diff/
> 
> 
> Testing
> -------
> 
> make check, I'll look into adding a test that exposed this issue.
> 
> 
> Thanks,
> 
> Ben Mahler
> 
>


Re: Review Request 13744: Fixed a case where Framework re-registration time was not being updated.

Posted by Ben Mahler <be...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/13744/
-----------------------------------------------------------

(Updated Aug. 22, 2013, 10:25 p.m.)


Review request for mesos, Benjamin Hindman and Vinod Kone.


Bugs: MESOS-658
    https://issues.apache.org/jira/browse/MESOS-658


Repository: mesos-git


Description
-------

This is a split up of https://reviews.apache.org/r/13699/ (has ship its) into two commits.

There was a case during re-registration where the re-registered time was not being set.

This can cause a serious issue when the following occurs:
 -Scheduler disconnects from the master, Master::exited(UPID) sets framework->active = false.
 -Scheduler re-registers with ReregisterFrameworkMessage::failover=false. Currently, the master does _not_ update the re-registration time in this case!
 -Now the failoverFramework timeout is setup in the Master.
 -Scheduler disconnects again from the master, Master::exited(UPID) sets active=false once again.
 -The original failoverFramework timeout fires, compares Framework->reregisteredTime. Since it has not been updated, the master proceeds to shut down the framework on all the slaves!

I'll file a bug for this and add it here.


Diffs
-----

  src/master/http.cpp 1ac84a9f75df43632ddbd1fec50333c159651f15 
  src/master/master.hpp 30752d2698931624fdf4aa6e40ef9fc4ec58dc6d 
  src/master/master.cpp d53b8bb97da45834790cca6e04b70b969a8d3453 

Diff: https://reviews.apache.org/r/13744/diff/


Testing
-------

make check, I'll look into adding a test that exposed this issue.


Thanks,

Ben Mahler


Re: Review Request 13744: Fixed a case where Framework re-registration time was not being updated.

Posted by Brenden Matthews <br...@diddyinc.com>.

> On Aug. 22, 2013, 7:39 p.m., Brenden Matthews wrote:
> > This looks good.  I wonder if it's related to a bug I'm seeing where a framework is marked as 'terminated' even though it's not (according to the web UI)?  I keep seeing it with storm (though I have not yet debugged it).
> 
> Ben Mahler wrote:
>     Quite possibly! Did you see the storm framework get shut down on all the slaves? Do you know what the failover_timeout is inside Storm's FrameworkInfo?
> 
> Brenden Matthews wrote:
>     Unfortunately I do not.  It's not in the logs, either.
> 
> Ben Mahler wrote:
>     Is it hard-coded in the source? Or using the default (0.0)? If using 0.0, it looks like we'll immediately remove it, allowing no time to fail over.

It looks like the default is 3600 (1h).


- Brenden


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/13744/#review25425
-----------------------------------------------------------


On Aug. 22, 2013, 10:25 p.m., Ben Mahler wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/13744/
> -----------------------------------------------------------
> 
> (Updated Aug. 22, 2013, 10:25 p.m.)
> 
> 
> Review request for mesos, Benjamin Hindman and Vinod Kone.
> 
> 
> Bugs: MESOS-658
>     https://issues.apache.org/jira/browse/MESOS-658
> 
> 
> Repository: mesos-git
> 
> 
> Description
> -------
> 
> This is a split up of https://reviews.apache.org/r/13699/ (has ship its) into two commits.
> 
> There was a case during re-registration where the re-registered time was not being set.
> 
> This can cause a serious issue when the following occurs:
>  -Scheduler disconnects from the master, Master::exited(UPID) sets framework->active = false.
>  -Scheduler re-registers with ReregisterFrameworkMessage::failover=false. Currently, the master does _not_ update the re-registration time in this case!
>  -Now the failoverFramework timeout is setup in the Master.
>  -Scheduler disconnects again from the master, Master::exited(UPID) sets active=false once again.
>  -The original failoverFramework timeout fires, compares Framework->reregisteredTime. Since it has not been updated, the master proceeds to shut down the framework on all the slaves!
> 
> I'll file a bug for this and add it here.
> 
> 
> Diffs
> -----
> 
>   src/master/http.cpp 1ac84a9f75df43632ddbd1fec50333c159651f15 
>   src/master/master.hpp 30752d2698931624fdf4aa6e40ef9fc4ec58dc6d 
>   src/master/master.cpp d53b8bb97da45834790cca6e04b70b969a8d3453 
> 
> Diff: https://reviews.apache.org/r/13744/diff/
> 
> 
> Testing
> -------
> 
> make check, I'll look into adding a test that exposed this issue.
> 
> 
> Thanks,
> 
> Ben Mahler
> 
>


Re: Review Request 13744: Fixed a case where Framework re-registration time was not being updated.

Posted by Ben Mahler <be...@gmail.com>.

> On Aug. 22, 2013, 7:39 p.m., Brenden Matthews wrote:
> > This looks good.  I wonder if it's related to a bug I'm seeing where a framework is marked as 'terminated' even though it's not (according to the web UI)?  I keep seeing it with storm (though I have not yet debugged it).

Quite possibly! Did you see the storm framework get shut down on all the slaves? Do you know what the failover_timeout is inside Storm's FrameworkInfo?


- Ben


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/13744/#review25425
-----------------------------------------------------------


On Aug. 22, 2013, 6:37 p.m., Ben Mahler wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/13744/
> -----------------------------------------------------------
> 
> (Updated Aug. 22, 2013, 6:37 p.m.)
> 
> 
> Review request for mesos, Benjamin Hindman and Vinod Kone.
> 
> 
> Repository: mesos-git
> 
> 
> Description
> -------
> 
> This is a split up of https://reviews.apache.org/r/13699/ (has ship its) into two commits.
> 
> There was a case during re-registration where the re-registered time was not being set.
> 
> This can cause a serious issue when the following occurs:
>  -Scheduler disconnects from the master, Master::exited(UPID) sets framework->active = false.
>  -Scheduler re-registers with ReregisterFrameworkMessage::failover=false. Currently, the master does _not_ update the re-registration time in this case!
>  -Now the failoverFramework timeout is setup in the Master.
>  -Scheduler disconnects again from the master, Master::exited(UPID) sets active=false once again.
>  -The original failoverFramework timeout fires, compares Framework->reregisteredTime. Since it has not been updated, the master proceeds to shut down the framework on all the slaves!
> 
> I'll file a bug for this and add it here.
> 
> 
> Diffs
> -----
> 
>   src/master/http.cpp 1ac84a9f75df43632ddbd1fec50333c159651f15 
>   src/master/master.hpp 30752d2698931624fdf4aa6e40ef9fc4ec58dc6d 
>   src/master/master.cpp d53b8bb97da45834790cca6e04b70b969a8d3453 
> 
> Diff: https://reviews.apache.org/r/13744/diff/
> 
> 
> Testing
> -------
> 
> make check, I'll look into adding a test that exposed this issue.
> 
> 
> Thanks,
> 
> Ben Mahler
> 
>


Re: Review Request 13744: Fixed a case where Framework re-registration time was not being updated.

Posted by Brenden Matthews <br...@diddyinc.com>.

> On Aug. 22, 2013, 7:39 p.m., Brenden Matthews wrote:
> > This looks good.  I wonder if it's related to a bug I'm seeing where a framework is marked as 'terminated' even though it's not (according to the web UI)?  I keep seeing it with storm (though I have not yet debugged it).
> 
> Ben Mahler wrote:
>     Quite possibly! Did you see the storm framework get shut down on all the slaves? Do you know what the failover_timeout is inside Storm's FrameworkInfo?

Unfortunately I do not.  It's not in the logs, either.


- Brenden


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/13744/#review25425
-----------------------------------------------------------


On Aug. 22, 2013, 10:25 p.m., Ben Mahler wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/13744/
> -----------------------------------------------------------
> 
> (Updated Aug. 22, 2013, 10:25 p.m.)
> 
> 
> Review request for mesos, Benjamin Hindman and Vinod Kone.
> 
> 
> Bugs: MESOS-658
>     https://issues.apache.org/jira/browse/MESOS-658
> 
> 
> Repository: mesos-git
> 
> 
> Description
> -------
> 
> This is a split up of https://reviews.apache.org/r/13699/ (has ship its) into two commits.
> 
> There was a case during re-registration where the re-registered time was not being set.
> 
> This can cause a serious issue when the following occurs:
>  -Scheduler disconnects from the master, Master::exited(UPID) sets framework->active = false.
>  -Scheduler re-registers with ReregisterFrameworkMessage::failover=false. Currently, the master does _not_ update the re-registration time in this case!
>  -Now the failoverFramework timeout is setup in the Master.
>  -Scheduler disconnects again from the master, Master::exited(UPID) sets active=false once again.
>  -The original failoverFramework timeout fires, compares Framework->reregisteredTime. Since it has not been updated, the master proceeds to shut down the framework on all the slaves!
> 
> I'll file a bug for this and add it here.
> 
> 
> Diffs
> -----
> 
>   src/master/http.cpp 1ac84a9f75df43632ddbd1fec50333c159651f15 
>   src/master/master.hpp 30752d2698931624fdf4aa6e40ef9fc4ec58dc6d 
>   src/master/master.cpp d53b8bb97da45834790cca6e04b70b969a8d3453 
> 
> Diff: https://reviews.apache.org/r/13744/diff/
> 
> 
> Testing
> -------
> 
> make check, I'll look into adding a test that exposed this issue.
> 
> 
> Thanks,
> 
> Ben Mahler
> 
>


Re: Review Request 13744: Fixed a case where Framework re-registration time was not being updated.

Posted by Ben Mahler <be...@gmail.com>.

> On Aug. 22, 2013, 7:39 p.m., Brenden Matthews wrote:
> > This looks good.  I wonder if it's related to a bug I'm seeing where a framework is marked as 'terminated' even though it's not (according to the web UI)?  I keep seeing it with storm (though I have not yet debugged it).
> 
> Ben Mahler wrote:
>     Quite possibly! Did you see the storm framework get shut down on all the slaves? Do you know what the failover_timeout is inside Storm's FrameworkInfo?
> 
> Brenden Matthews wrote:
>     Unfortunately I do not.  It's not in the logs, either.

Is it hard-coded in the source? Or using the default (0.0)? If using 0.0, it looks like we'll immediately remove it, allowing no time to fail over.


- Ben


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/13744/#review25425
-----------------------------------------------------------


On Aug. 22, 2013, 10:25 p.m., Ben Mahler wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/13744/
> -----------------------------------------------------------
> 
> (Updated Aug. 22, 2013, 10:25 p.m.)
> 
> 
> Review request for mesos, Benjamin Hindman and Vinod Kone.
> 
> 
> Bugs: MESOS-658
>     https://issues.apache.org/jira/browse/MESOS-658
> 
> 
> Repository: mesos-git
> 
> 
> Description
> -------
> 
> This is a split up of https://reviews.apache.org/r/13699/ (has ship its) into two commits.
> 
> There was a case during re-registration where the re-registered time was not being set.
> 
> This can cause a serious issue when the following occurs:
>  -Scheduler disconnects from the master, Master::exited(UPID) sets framework->active = false.
>  -Scheduler re-registers with ReregisterFrameworkMessage::failover=false. Currently, the master does _not_ update the re-registration time in this case!
>  -Now the failoverFramework timeout is setup in the Master.
>  -Scheduler disconnects again from the master, Master::exited(UPID) sets active=false once again.
>  -The original failoverFramework timeout fires, compares Framework->reregisteredTime. Since it has not been updated, the master proceeds to shut down the framework on all the slaves!
> 
> I'll file a bug for this and add it here.
> 
> 
> Diffs
> -----
> 
>   src/master/http.cpp 1ac84a9f75df43632ddbd1fec50333c159651f15 
>   src/master/master.hpp 30752d2698931624fdf4aa6e40ef9fc4ec58dc6d 
>   src/master/master.cpp d53b8bb97da45834790cca6e04b70b969a8d3453 
> 
> Diff: https://reviews.apache.org/r/13744/diff/
> 
> 
> Testing
> -------
> 
> make check, I'll look into adding a test that exposed this issue.
> 
> 
> Thanks,
> 
> Ben Mahler
> 
>


Re: Review Request 13744: Fixed a case where Framework re-registration time was not being updated.

Posted by Brenden Matthews <br...@diddyinc.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/13744/#review25425
-----------------------------------------------------------

Ship it!


This looks good.  I wonder if it's related to a bug I'm seeing where a framework is marked as 'terminated' even though it's not (according to the web UI)?  I keep seeing it with storm (though I have not yet debugged it).

- Brenden Matthews


On Aug. 22, 2013, 6:37 p.m., Ben Mahler wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/13744/
> -----------------------------------------------------------
> 
> (Updated Aug. 22, 2013, 6:37 p.m.)
> 
> 
> Review request for mesos, Benjamin Hindman and Vinod Kone.
> 
> 
> Repository: mesos-git
> 
> 
> Description
> -------
> 
> This is a split up of https://reviews.apache.org/r/13699/ (has ship its) into two commits.
> 
> There was a case during re-registration where the re-registered time was not being set.
> 
> This can cause a serious issue when the following occurs:
>  -Scheduler disconnects from the master, Master::exited(UPID) sets framework->active = false.
>  -Scheduler re-registers with ReregisterFrameworkMessage::failover=false. Currently, the master does _not_ update the re-registration time in this case!
>  -Now the failoverFramework timeout is setup in the Master.
>  -Scheduler disconnects again from the master, Master::exited(UPID) sets active=false once again.
>  -The original failoverFramework timeout fires, compares Framework->reregisteredTime. Since it has not been updated, the master proceeds to shut down the framework on all the slaves!
> 
> I'll file a bug for this and add it here.
> 
> 
> Diffs
> -----
> 
>   src/master/http.cpp 1ac84a9f75df43632ddbd1fec50333c159651f15 
>   src/master/master.hpp 30752d2698931624fdf4aa6e40ef9fc4ec58dc6d 
>   src/master/master.cpp d53b8bb97da45834790cca6e04b70b969a8d3453 
> 
> Diff: https://reviews.apache.org/r/13744/diff/
> 
> 
> Testing
> -------
> 
> make check, I'll look into adding a test that exposed this issue.
> 
> 
> Thanks,
> 
> Ben Mahler
> 
>