You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@mesos.apache.org by Neil Conway <ne...@gmail.com> on 2016/12/02 17:18:46 UTC

Re: Review Request 53897: Changed how master represents "recovered" frameworks.

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/53897/
-----------------------------------------------------------

(Updated Dec. 2, 2016, 5:18 p.m.)


Review request for mesos and Vinod Kone.


Changes
-------

Tweak comment.


Bugs: MESOS-6419
    https://issues.apache.org/jira/browse/MESOS-6419


Repository: mesos


Description
-------

After master failover, the new master doesn't know which frameworks were
registered with the previous master (because this information is not
currently stored in the registry). In the period after the master fails
over but before the framework scheduler has re-registered, the master
learns about the frameworks in the cluster when agents re-register (an
agent reports the FrameworkInfo for all of the frameworks it is running
when it re-registers).

Such frameworks were previously represented separately from the normal
list of frameworks in the master: the master kept a collection of
`FrameworkInfo` for these "recovered" frameworks.

This commit removes this separate collection of "recovered" frameworks.
Instead, the master now treats recovered frameworks very similarly to
frameworks that are registered but currently disconnected. For example,
recovered frameworks will now have a `Framework` object which tracks the
tasks/executors running under that framework; recovered frameworks will
be reported via the normal "frameworks" key when querying HTTP
endpoints. Similarly, "teardown" operations on recovered frameworks will
now work correctly (MESOS-6419).

This means there is no longer a concept of "orphan tasks" [1]: if the
master knows about a task, the task will be running under a framework
(albeit the framework might be recovered or disconnected). A new
"recovered" key has been added to various HTTP endpoints/APIs to
determine if a framework hasn't yet re-registered after master failover.

[1] The exception here is if the cluster contains Mesos agents older
than 1.0, because old Mesos agents don't report `FrameworkInfo`s when
they re-register.


Diffs (updated)
-----

  include/mesos/master/master.proto 3553c683c17004ac1831ec90271aa8584c950e53 
  include/mesos/v1/master/master.proto 022b491b7d5c49c5aeddf4ffc97c148f55629c95 
  src/master/http.cpp ac560d1fdd219d0de0c5d987a32a7112e149602f 
  src/master/master.hpp 877ca9010d0d6efc97f3d71fbd27272a255409d0 
  src/master/master.cpp e03a2e8025943825a2902102c43dc0eb66bacb6a 
  src/tests/api_tests.cpp afae6a7e0809174f48f280f170fad9315e80a906 
  src/tests/fault_tolerance_tests.cpp 1a8888de7faee56d394de30b798982dbb6e32f81 
  src/tests/master_allocator_tests.cpp bb94e38d5bb472801366c172cfc036f2eecdcbcb 
  src/tests/master_authorization_tests.cpp 06c6e47250003cd467bf37e1daa8bd636ef8fef5 
  src/tests/master_tests.cpp dfedbbdf78e8054813872e9eeebccc7504097751 
  src/tests/partition_tests.cpp 5a0d4bd2de6a5aa0e9fdf0d34cd10d16fd4e34a1 
  src/tests/teardown_tests.cpp 0babf8c99f133c3f0dada772bd5cd2601c47a080 

Diff: https://reviews.apache.org/r/53897/diff/


Testing
-------

`make check`


Thanks,

Neil Conway


Re: Review Request 53897: Changed how master represents "recovered" frameworks.

Posted by Vinod Kone <vi...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/53897/#review158460
-----------------------------------------------------------


Ship it!





src/tests/fault_tolerance_tests.cpp (line 2245)
<https://reviews.apache.org/r/53897/#comment229237>

    nice test!


- Vinod Kone


On Dec. 7, 2016, 7:11 p.m., Neil Conway wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/53897/
> -----------------------------------------------------------
> 
> (Updated Dec. 7, 2016, 7:11 p.m.)
> 
> 
> Review request for mesos and Vinod Kone.
> 
> 
> Bugs: MESOS-6419
>     https://issues.apache.org/jira/browse/MESOS-6419
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> After master failover, the new master doesn't know which frameworks were
> registered with the previous master (because this information is not
> currently stored in the registry). In the period after the master fails
> over but before the framework scheduler has re-registered, the master
> learns about the frameworks in the cluster when agents re-register (an
> agent reports the FrameworkInfo for all of the frameworks it is running
> when it re-registers).
> 
> Such frameworks were previously represented separately from the normal
> list of frameworks in the master: the master kept a collection of
> `FrameworkInfo` for these "recovered" frameworks.
> 
> This commit removes this separate collection of "recovered" frameworks.
> Instead, the master now treats recovered frameworks very similarly to
> frameworks that are registered but currently disconnected. For example,
> recovered frameworks will now have a `Framework` object which tracks the
> tasks/executors running under that framework; recovered frameworks will
> be reported via the normal "frameworks" key when querying HTTP
> endpoints. Similarly, "teardown" operations on recovered frameworks will
> now work correctly (MESOS-6419).
> 
> This means there is no longer a concept of "orphan tasks" [1]: if the
> master knows about a task, the task will be running under a framework
> (albeit the framework might be recovered or disconnected). A new
> "recovered" key has been added to various HTTP endpoints/APIs to
> determine if a framework hasn't yet re-registered after master failover.
> 
> Another behavioral change is that, previously, a framework
> re-registering after master failover was allowed to make arbitrary
> changes to its `FrameworkInfo`. The master now only applies updates to
> FrameworkInfo fields that can be safely changed (as was already done
> when a framework fails over to a master that hasn't failed over). The
> previous behavior qualifies as a bug.
> 
> [1] The exception here is if the cluster contains Mesos agents older
> than 1.0, because old Mesos agents don't report `FrameworkInfo`s when
> they re-register.
> 
> 
> Diffs
> -----
> 
>   CHANGELOG fe24e8e4291c76a015e1f344764ed53b653cd8cd 
>   include/mesos/master/master.proto 966105cf14bf92ad15a38653cd5c7f3322821289 
>   include/mesos/v1/master/master.proto 0e5a8674a76a46c6f2cc17e11cd735f756d94fd1 
>   src/master/http.cpp ac560d1fdd219d0de0c5d987a32a7112e149602f 
>   src/master/master.hpp b444b2352360fb4f7179acd97dffc0cd81cc7afa 
>   src/master/master.cpp 67f32229470da4cf7953881d1c5dcb99393002de 
>   src/tests/api_tests.cpp 6cd1f83a47731ee8c4fb1f022bf585b271b3b966 
>   src/tests/fault_tolerance_tests.cpp 59f68f65fac9fca4a6941793b712bfe7bb30c880 
>   src/tests/master_allocator_tests.cpp bb94e38d5bb472801366c172cfc036f2eecdcbcb 
>   src/tests/master_authorization_tests.cpp 06c6e47250003cd467bf37e1daa8bd636ef8fef5 
>   src/tests/master_tests.cpp dfedbbdf78e8054813872e9eeebccc7504097751 
>   src/tests/partition_tests.cpp 5a0d4bd2de6a5aa0e9fdf0d34cd10d16fd4e34a1 
>   src/tests/teardown_tests.cpp 125e1808e51da53fdf1bb1babc956ff062cbb102 
> 
> Diff: https://reviews.apache.org/r/53897/diff/
> 
> 
> Testing
> -------
> 
> `make check`
> 
> 
> Thanks,
> 
> Neil Conway
> 
>


Re: Review Request 53897: Changed how master represents "recovered" frameworks.

Posted by Neil Conway <ne...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/53897/
-----------------------------------------------------------

(Updated Dec. 7, 2016, 7:11 p.m.)


Review request for mesos and Vinod Kone.


Changes
-------

Added CHANGELOG entry for change to FrameworkInfo behavior.


Bugs: MESOS-6419
    https://issues.apache.org/jira/browse/MESOS-6419


Repository: mesos


Description
-------

After master failover, the new master doesn't know which frameworks were
registered with the previous master (because this information is not
currently stored in the registry). In the period after the master fails
over but before the framework scheduler has re-registered, the master
learns about the frameworks in the cluster when agents re-register (an
agent reports the FrameworkInfo for all of the frameworks it is running
when it re-registers).

Such frameworks were previously represented separately from the normal
list of frameworks in the master: the master kept a collection of
`FrameworkInfo` for these "recovered" frameworks.

This commit removes this separate collection of "recovered" frameworks.
Instead, the master now treats recovered frameworks very similarly to
frameworks that are registered but currently disconnected. For example,
recovered frameworks will now have a `Framework` object which tracks the
tasks/executors running under that framework; recovered frameworks will
be reported via the normal "frameworks" key when querying HTTP
endpoints. Similarly, "teardown" operations on recovered frameworks will
now work correctly (MESOS-6419).

This means there is no longer a concept of "orphan tasks" [1]: if the
master knows about a task, the task will be running under a framework
(albeit the framework might be recovered or disconnected). A new
"recovered" key has been added to various HTTP endpoints/APIs to
determine if a framework hasn't yet re-registered after master failover.

Another behavioral change is that, previously, a framework
re-registering after master failover was allowed to make arbitrary
changes to its `FrameworkInfo`. The master now only applies updates to
FrameworkInfo fields that can be safely changed (as was already done
when a framework fails over to a master that hasn't failed over). The
previous behavior qualifies as a bug.

[1] The exception here is if the cluster contains Mesos agents older
than 1.0, because old Mesos agents don't report `FrameworkInfo`s when
they re-register.


Diffs (updated)
-----

  CHANGELOG fe24e8e4291c76a015e1f344764ed53b653cd8cd 
  include/mesos/master/master.proto 966105cf14bf92ad15a38653cd5c7f3322821289 
  include/mesos/v1/master/master.proto 0e5a8674a76a46c6f2cc17e11cd735f756d94fd1 
  src/master/http.cpp ac560d1fdd219d0de0c5d987a32a7112e149602f 
  src/master/master.hpp b444b2352360fb4f7179acd97dffc0cd81cc7afa 
  src/master/master.cpp 67f32229470da4cf7953881d1c5dcb99393002de 
  src/tests/api_tests.cpp 6cd1f83a47731ee8c4fb1f022bf585b271b3b966 
  src/tests/fault_tolerance_tests.cpp 59f68f65fac9fca4a6941793b712bfe7bb30c880 
  src/tests/master_allocator_tests.cpp bb94e38d5bb472801366c172cfc036f2eecdcbcb 
  src/tests/master_authorization_tests.cpp 06c6e47250003cd467bf37e1daa8bd636ef8fef5 
  src/tests/master_tests.cpp dfedbbdf78e8054813872e9eeebccc7504097751 
  src/tests/partition_tests.cpp 5a0d4bd2de6a5aa0e9fdf0d34cd10d16fd4e34a1 
  src/tests/teardown_tests.cpp 125e1808e51da53fdf1bb1babc956ff062cbb102 

Diff: https://reviews.apache.org/r/53897/diff/


Testing
-------

`make check`


Thanks,

Neil Conway


Re: Review Request 53897: Changed how master represents "recovered" frameworks.

Posted by Neil Conway <ne...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/53897/
-----------------------------------------------------------

(Updated Dec. 7, 2016, 5:50 a.m.)


Review request for mesos and Vinod Kone.


Changes
-------

Rebase.


Bugs: MESOS-6419
    https://issues.apache.org/jira/browse/MESOS-6419


Repository: mesos


Description
-------

After master failover, the new master doesn't know which frameworks were
registered with the previous master (because this information is not
currently stored in the registry). In the period after the master fails
over but before the framework scheduler has re-registered, the master
learns about the frameworks in the cluster when agents re-register (an
agent reports the FrameworkInfo for all of the frameworks it is running
when it re-registers).

Such frameworks were previously represented separately from the normal
list of frameworks in the master: the master kept a collection of
`FrameworkInfo` for these "recovered" frameworks.

This commit removes this separate collection of "recovered" frameworks.
Instead, the master now treats recovered frameworks very similarly to
frameworks that are registered but currently disconnected. For example,
recovered frameworks will now have a `Framework` object which tracks the
tasks/executors running under that framework; recovered frameworks will
be reported via the normal "frameworks" key when querying HTTP
endpoints. Similarly, "teardown" operations on recovered frameworks will
now work correctly (MESOS-6419).

This means there is no longer a concept of "orphan tasks" [1]: if the
master knows about a task, the task will be running under a framework
(albeit the framework might be recovered or disconnected). A new
"recovered" key has been added to various HTTP endpoints/APIs to
determine if a framework hasn't yet re-registered after master failover.

Another behavioral change is that, previously, a framework
re-registering after master failover was allowed to make arbitrary
changes to its `FrameworkInfo`. The master now only applies updates to
FrameworkInfo fields that can be safely changed (as was already done
when a framework fails over to a master that hasn't failed over). The
previous behavior qualifies as a bug.

[1] The exception here is if the cluster contains Mesos agents older
than 1.0, because old Mesos agents don't report `FrameworkInfo`s when
they re-register.


Diffs (updated)
-----

  include/mesos/master/master.proto 966105cf14bf92ad15a38653cd5c7f3322821289 
  include/mesos/v1/master/master.proto 0e5a8674a76a46c6f2cc17e11cd735f756d94fd1 
  src/master/http.cpp ac560d1fdd219d0de0c5d987a32a7112e149602f 
  src/master/master.hpp b444b2352360fb4f7179acd97dffc0cd81cc7afa 
  src/master/master.cpp 67f32229470da4cf7953881d1c5dcb99393002de 
  src/tests/api_tests.cpp 7c70f284a9225ca23abc7049f48f3efba314b641 
  src/tests/fault_tolerance_tests.cpp 59f68f65fac9fca4a6941793b712bfe7bb30c880 
  src/tests/master_allocator_tests.cpp bb94e38d5bb472801366c172cfc036f2eecdcbcb 
  src/tests/master_authorization_tests.cpp 06c6e47250003cd467bf37e1daa8bd636ef8fef5 
  src/tests/master_tests.cpp dfedbbdf78e8054813872e9eeebccc7504097751 
  src/tests/partition_tests.cpp 5a0d4bd2de6a5aa0e9fdf0d34cd10d16fd4e34a1 
  src/tests/teardown_tests.cpp 125e1808e51da53fdf1bb1babc956ff062cbb102 

Diff: https://reviews.apache.org/r/53897/diff/


Testing
-------

`make check`


Thanks,

Neil Conway


Re: Review Request 53897: Changed how master represents "recovered" frameworks.

Posted by Neil Conway <ne...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/53897/
-----------------------------------------------------------

(Updated Dec. 7, 2016, 4:14 a.m.)


Review request for mesos and Vinod Kone.


Changes
-------

Add additional CHECKs, add new unit test, tweak a few tests.


Bugs: MESOS-6419
    https://issues.apache.org/jira/browse/MESOS-6419


Repository: mesos


Description (updated)
-------

After master failover, the new master doesn't know which frameworks were
registered with the previous master (because this information is not
currently stored in the registry). In the period after the master fails
over but before the framework scheduler has re-registered, the master
learns about the frameworks in the cluster when agents re-register (an
agent reports the FrameworkInfo for all of the frameworks it is running
when it re-registers).

Such frameworks were previously represented separately from the normal
list of frameworks in the master: the master kept a collection of
`FrameworkInfo` for these "recovered" frameworks.

This commit removes this separate collection of "recovered" frameworks.
Instead, the master now treats recovered frameworks very similarly to
frameworks that are registered but currently disconnected. For example,
recovered frameworks will now have a `Framework` object which tracks the
tasks/executors running under that framework; recovered frameworks will
be reported via the normal "frameworks" key when querying HTTP
endpoints. Similarly, "teardown" operations on recovered frameworks will
now work correctly (MESOS-6419).

This means there is no longer a concept of "orphan tasks" [1]: if the
master knows about a task, the task will be running under a framework
(albeit the framework might be recovered or disconnected). A new
"recovered" key has been added to various HTTP endpoints/APIs to
determine if a framework hasn't yet re-registered after master failover.

Another behavioral change is that, previously, a framework
re-registering after master failover was allowed to make arbitrary
changes to its `FrameworkInfo`. The master now only applies updates to
FrameworkInfo fields that can be safely changed (as was already done
when a framework fails over to a master that hasn't failed over). The
previous behavior qualifies as a bug.

[1] The exception here is if the cluster contains Mesos agents older
than 1.0, because old Mesos agents don't report `FrameworkInfo`s when
they re-register.


Diffs (updated)
-----

  include/mesos/master/master.proto 966105cf14bf92ad15a38653cd5c7f3322821289 
  include/mesos/v1/master/master.proto 0e5a8674a76a46c6f2cc17e11cd735f756d94fd1 
  src/master/http.cpp ac560d1fdd219d0de0c5d987a32a7112e149602f 
  src/master/master.hpp b444b2352360fb4f7179acd97dffc0cd81cc7afa 
  src/master/master.cpp b0670d993348d189fafff0f83f9da0c5b18d1c51 
  src/tests/api_tests.cpp 7c70f284a9225ca23abc7049f48f3efba314b641 
  src/tests/fault_tolerance_tests.cpp 59f68f65fac9fca4a6941793b712bfe7bb30c880 
  src/tests/master_allocator_tests.cpp bb94e38d5bb472801366c172cfc036f2eecdcbcb 
  src/tests/master_authorization_tests.cpp 06c6e47250003cd467bf37e1daa8bd636ef8fef5 
  src/tests/master_tests.cpp dfedbbdf78e8054813872e9eeebccc7504097751 
  src/tests/partition_tests.cpp 5a0d4bd2de6a5aa0e9fdf0d34cd10d16fd4e34a1 
  src/tests/teardown_tests.cpp 125e1808e51da53fdf1bb1babc956ff062cbb102 

Diff: https://reviews.apache.org/r/53897/diff/


Testing
-------

`make check`


Thanks,

Neil Conway


Re: Review Request 53897: Changed how master represents "recovered" frameworks.

Posted by Neil Conway <ne...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/53897/
-----------------------------------------------------------

(Updated Dec. 4, 2016, 2:06 a.m.)


Review request for mesos and Vinod Kone.


Changes
-------

Address some review comments.


Bugs: MESOS-6419
    https://issues.apache.org/jira/browse/MESOS-6419


Repository: mesos


Description
-------

After master failover, the new master doesn't know which frameworks were
registered with the previous master (because this information is not
currently stored in the registry). In the period after the master fails
over but before the framework scheduler has re-registered, the master
learns about the frameworks in the cluster when agents re-register (an
agent reports the FrameworkInfo for all of the frameworks it is running
when it re-registers).

Such frameworks were previously represented separately from the normal
list of frameworks in the master: the master kept a collection of
`FrameworkInfo` for these "recovered" frameworks.

This commit removes this separate collection of "recovered" frameworks.
Instead, the master now treats recovered frameworks very similarly to
frameworks that are registered but currently disconnected. For example,
recovered frameworks will now have a `Framework` object which tracks the
tasks/executors running under that framework; recovered frameworks will
be reported via the normal "frameworks" key when querying HTTP
endpoints. Similarly, "teardown" operations on recovered frameworks will
now work correctly (MESOS-6419).

This means there is no longer a concept of "orphan tasks" [1]: if the
master knows about a task, the task will be running under a framework
(albeit the framework might be recovered or disconnected). A new
"recovered" key has been added to various HTTP endpoints/APIs to
determine if a framework hasn't yet re-registered after master failover.

[1] The exception here is if the cluster contains Mesos agents older
than 1.0, because old Mesos agents don't report `FrameworkInfo`s when
they re-register.


Diffs (updated)
-----

  include/mesos/master/master.proto 966105cf14bf92ad15a38653cd5c7f3322821289 
  include/mesos/v1/master/master.proto 0e5a8674a76a46c6f2cc17e11cd735f756d94fd1 
  src/master/http.cpp ac560d1fdd219d0de0c5d987a32a7112e149602f 
  src/master/master.hpp b444b2352360fb4f7179acd97dffc0cd81cc7afa 
  src/master/master.cpp b0670d993348d189fafff0f83f9da0c5b18d1c51 
  src/tests/api_tests.cpp afae6a7e0809174f48f280f170fad9315e80a906 
  src/tests/fault_tolerance_tests.cpp 59f68f65fac9fca4a6941793b712bfe7bb30c880 
  src/tests/master_allocator_tests.cpp bb94e38d5bb472801366c172cfc036f2eecdcbcb 
  src/tests/master_authorization_tests.cpp 06c6e47250003cd467bf37e1daa8bd636ef8fef5 
  src/tests/master_tests.cpp dfedbbdf78e8054813872e9eeebccc7504097751 
  src/tests/partition_tests.cpp 5a0d4bd2de6a5aa0e9fdf0d34cd10d16fd4e34a1 
  src/tests/teardown_tests.cpp 125e1808e51da53fdf1bb1babc956ff062cbb102 

Diff: https://reviews.apache.org/r/53897/diff/


Testing
-------

`make check`


Thanks,

Neil Conway


Re: Review Request 53897: Changed how master represents "recovered" frameworks.

Posted by Neil Conway <ne...@gmail.com>.

> On Dec. 3, 2016, 2:12 a.m., Vinod Kone wrote:
> > src/master/master.cpp, lines 7132-7137
> > <https://reviews.apache.org/r/53897/diff/6/?file=1574956#file1574956line7132>
> >
> >     hmm. it's a bit weird that activating a framework also updates it. this should be done by the callers before calling this method.
> 
> Neil Conway wrote:
>     Hmmm, why is this weird? To me it seems reasonable that when we activate a recovered framework, we're given the `FrameworkInfo` that was presented by the framework on re-registration; we use that `FrameworkInfo` to update whatever `FrameworkInfo` we previously had for the recovered framework.
>     
>     If we move this outside of `activateRecoveredFramework`, we'd need identical code in both of its call-sites -- that's not too bad, but I'm not sure why it is an improvement.
> 
> Vinod Kone wrote:
>     i meant the name "activateRecoveredFramework" doesn't seemt to suggest that it also "updates" in addition to "activate"? usually they are done by different functions (e.g., Allocator::activateFramework(), Allocator::updateFramework(), Allocator::activateSlave(), Allocator::updateSlave). when i think of activating a recovered framework, my intuition is that it goes from RECOVERED to ACTIVATED state in the master and maybe gets activated in the allocator (basically things that need to be done to get a framework activated).
>     
>     More importanly, i realized that previously a framework could update its FrameworkInfo willy nilly after a master failover, but now it cannot because `updateFrameworkInfo` doesn't allow updates to all fields. That's probably ok but maybe worth calling out?

I added a note about the behavioral change to the commit message. I also added a test to validate this behavior.


- Neil


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/53897/#review157867
-----------------------------------------------------------


On Dec. 4, 2016, 2:06 a.m., Neil Conway wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/53897/
> -----------------------------------------------------------
> 
> (Updated Dec. 4, 2016, 2:06 a.m.)
> 
> 
> Review request for mesos and Vinod Kone.
> 
> 
> Bugs: MESOS-6419
>     https://issues.apache.org/jira/browse/MESOS-6419
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> After master failover, the new master doesn't know which frameworks were
> registered with the previous master (because this information is not
> currently stored in the registry). In the period after the master fails
> over but before the framework scheduler has re-registered, the master
> learns about the frameworks in the cluster when agents re-register (an
> agent reports the FrameworkInfo for all of the frameworks it is running
> when it re-registers).
> 
> Such frameworks were previously represented separately from the normal
> list of frameworks in the master: the master kept a collection of
> `FrameworkInfo` for these "recovered" frameworks.
> 
> This commit removes this separate collection of "recovered" frameworks.
> Instead, the master now treats recovered frameworks very similarly to
> frameworks that are registered but currently disconnected. For example,
> recovered frameworks will now have a `Framework` object which tracks the
> tasks/executors running under that framework; recovered frameworks will
> be reported via the normal "frameworks" key when querying HTTP
> endpoints. Similarly, "teardown" operations on recovered frameworks will
> now work correctly (MESOS-6419).
> 
> This means there is no longer a concept of "orphan tasks" [1]: if the
> master knows about a task, the task will be running under a framework
> (albeit the framework might be recovered or disconnected). A new
> "recovered" key has been added to various HTTP endpoints/APIs to
> determine if a framework hasn't yet re-registered after master failover.
> 
> [1] The exception here is if the cluster contains Mesos agents older
> than 1.0, because old Mesos agents don't report `FrameworkInfo`s when
> they re-register.
> 
> 
> Diffs
> -----
> 
>   include/mesos/master/master.proto 966105cf14bf92ad15a38653cd5c7f3322821289 
>   include/mesos/v1/master/master.proto 0e5a8674a76a46c6f2cc17e11cd735f756d94fd1 
>   src/master/http.cpp ac560d1fdd219d0de0c5d987a32a7112e149602f 
>   src/master/master.hpp b444b2352360fb4f7179acd97dffc0cd81cc7afa 
>   src/master/master.cpp b0670d993348d189fafff0f83f9da0c5b18d1c51 
>   src/tests/api_tests.cpp afae6a7e0809174f48f280f170fad9315e80a906 
>   src/tests/fault_tolerance_tests.cpp 59f68f65fac9fca4a6941793b712bfe7bb30c880 
>   src/tests/master_allocator_tests.cpp bb94e38d5bb472801366c172cfc036f2eecdcbcb 
>   src/tests/master_authorization_tests.cpp 06c6e47250003cd467bf37e1daa8bd636ef8fef5 
>   src/tests/master_tests.cpp dfedbbdf78e8054813872e9eeebccc7504097751 
>   src/tests/partition_tests.cpp 5a0d4bd2de6a5aa0e9fdf0d34cd10d16fd4e34a1 
>   src/tests/teardown_tests.cpp 125e1808e51da53fdf1bb1babc956ff062cbb102 
> 
> Diff: https://reviews.apache.org/r/53897/diff/
> 
> 
> Testing
> -------
> 
> `make check`
> 
> 
> Thanks,
> 
> Neil Conway
> 
>


Re: Review Request 53897: Changed how master represents "recovered" frameworks.

Posted by Neil Conway <ne...@gmail.com>.

> On Dec. 3, 2016, 2:12 a.m., Vinod Kone wrote:
> > src/master/master.cpp, line 5602
> > <https://reviews.apache.org/r/53897/diff/6/?file=1574956#file1574956line5602>
> >
> >     shouldn't we send a ShutdownFrameworkMessage in this case?

This is implemented in a later review in this chain, https://reviews.apache.org/r/54232/


> On Dec. 3, 2016, 2:12 a.m., Vinod Kone wrote:
> > src/master/master.cpp, lines 5578-5580
> > <https://reviews.apache.org/r/53897/diff/6/?file=1574956#file1574956line5578>
> >
> >     yea, i think this should be
> >     
> >     ```
> >     if (framework != nullptr && framework->connected()) {
> >     
> >     }
> >     ```

I did this in a separate review, https://reviews.apache.org/r/54380/


> On Dec. 3, 2016, 2:12 a.m., Vinod Kone wrote:
> > src/master/master.cpp, lines 7132-7137
> > <https://reviews.apache.org/r/53897/diff/6/?file=1574956#file1574956line7132>
> >
> >     hmm. it's a bit weird that activating a framework also updates it. this should be done by the callers before calling this method.

Hmmm, why is this weird? To me it seems reasonable that when we activate a recovered framework, we're given the `FrameworkInfo` that was presented by the framework on re-registration; we use that `FrameworkInfo` to update whatever `FrameworkInfo` we previously had for the recovered framework.

If we move this outside of `activateRecoveredFramework`, we'd need identical code in both of its call-sites -- that's not too bad, but I'm not sure why it is an improvement.


- Neil


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/53897/#review157867
-----------------------------------------------------------


On Dec. 4, 2016, 2:06 a.m., Neil Conway wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/53897/
> -----------------------------------------------------------
> 
> (Updated Dec. 4, 2016, 2:06 a.m.)
> 
> 
> Review request for mesos and Vinod Kone.
> 
> 
> Bugs: MESOS-6419
>     https://issues.apache.org/jira/browse/MESOS-6419
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> After master failover, the new master doesn't know which frameworks were
> registered with the previous master (because this information is not
> currently stored in the registry). In the period after the master fails
> over but before the framework scheduler has re-registered, the master
> learns about the frameworks in the cluster when agents re-register (an
> agent reports the FrameworkInfo for all of the frameworks it is running
> when it re-registers).
> 
> Such frameworks were previously represented separately from the normal
> list of frameworks in the master: the master kept a collection of
> `FrameworkInfo` for these "recovered" frameworks.
> 
> This commit removes this separate collection of "recovered" frameworks.
> Instead, the master now treats recovered frameworks very similarly to
> frameworks that are registered but currently disconnected. For example,
> recovered frameworks will now have a `Framework` object which tracks the
> tasks/executors running under that framework; recovered frameworks will
> be reported via the normal "frameworks" key when querying HTTP
> endpoints. Similarly, "teardown" operations on recovered frameworks will
> now work correctly (MESOS-6419).
> 
> This means there is no longer a concept of "orphan tasks" [1]: if the
> master knows about a task, the task will be running under a framework
> (albeit the framework might be recovered or disconnected). A new
> "recovered" key has been added to various HTTP endpoints/APIs to
> determine if a framework hasn't yet re-registered after master failover.
> 
> [1] The exception here is if the cluster contains Mesos agents older
> than 1.0, because old Mesos agents don't report `FrameworkInfo`s when
> they re-register.
> 
> 
> Diffs
> -----
> 
>   include/mesos/master/master.proto 966105cf14bf92ad15a38653cd5c7f3322821289 
>   include/mesos/v1/master/master.proto 0e5a8674a76a46c6f2cc17e11cd735f756d94fd1 
>   src/master/http.cpp ac560d1fdd219d0de0c5d987a32a7112e149602f 
>   src/master/master.hpp b444b2352360fb4f7179acd97dffc0cd81cc7afa 
>   src/master/master.cpp b0670d993348d189fafff0f83f9da0c5b18d1c51 
>   src/tests/api_tests.cpp afae6a7e0809174f48f280f170fad9315e80a906 
>   src/tests/fault_tolerance_tests.cpp 59f68f65fac9fca4a6941793b712bfe7bb30c880 
>   src/tests/master_allocator_tests.cpp bb94e38d5bb472801366c172cfc036f2eecdcbcb 
>   src/tests/master_authorization_tests.cpp 06c6e47250003cd467bf37e1daa8bd636ef8fef5 
>   src/tests/master_tests.cpp dfedbbdf78e8054813872e9eeebccc7504097751 
>   src/tests/partition_tests.cpp 5a0d4bd2de6a5aa0e9fdf0d34cd10d16fd4e34a1 
>   src/tests/teardown_tests.cpp 125e1808e51da53fdf1bb1babc956ff062cbb102 
> 
> Diff: https://reviews.apache.org/r/53897/diff/
> 
> 
> Testing
> -------
> 
> `make check`
> 
> 
> Thanks,
> 
> Neil Conway
> 
>


Re: Review Request 53897: Changed how master represents "recovered" frameworks.

Posted by Vinod Kone <vi...@gmail.com>.

> On Dec. 3, 2016, 2:12 a.m., Vinod Kone wrote:
> > src/master/master.cpp, line 7124
> > <https://reviews.apache.org/r/53897/diff/6/?file=1574956#file1574956line7124>
> >
> >     CHECK_NOTNULL(framework);
> 
> Neil Conway wrote:
>     Is there a general rule for when to add `CHECK_NOTNULL` for internal functions that take pointer parameters? `Master::failoverFramework`, `Master::_failoverFramework`, `Master::_exited(Framework*)`, `Master::drop`, and `Master::authorizeTask` could all have `CHECK_NOTNULL`s added, for example.

i think we should do CHEcK_NOTNULL() for all methods that take raw pointers. the above ones you cited seem to have missed them.


> On Dec. 3, 2016, 2:12 a.m., Vinod Kone wrote:
> > src/master/master.cpp, lines 7132-7137
> > <https://reviews.apache.org/r/53897/diff/6/?file=1574956#file1574956line7132>
> >
> >     hmm. it's a bit weird that activating a framework also updates it. this should be done by the callers before calling this method.
> 
> Neil Conway wrote:
>     Hmmm, why is this weird? To me it seems reasonable that when we activate a recovered framework, we're given the `FrameworkInfo` that was presented by the framework on re-registration; we use that `FrameworkInfo` to update whatever `FrameworkInfo` we previously had for the recovered framework.
>     
>     If we move this outside of `activateRecoveredFramework`, we'd need identical code in both of its call-sites -- that's not too bad, but I'm not sure why it is an improvement.

i meant the name "activateRecoveredFramework" doesn't seemt to suggest that it also "updates" in addition to "activate"? usually they are done by different functions (e.g., Allocator::activateFramework(), Allocator::updateFramework(), Allocator::activateSlave(), Allocator::updateSlave). when i think of activating a recovered framework, my intuition is that it goes from RECOVERED to ACTIVATED state in the master and maybe gets activated in the allocator (basically things that need to be done to get a framework activated).

More importanly, i realized that previously a framework could update its FrameworkInfo willy nilly after a master failover, but now it cannot because `updateFrameworkInfo` doesn't allow updates to all fields. That's probably ok but maybe worth calling out?


- Vinod


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/53897/#review157867
-----------------------------------------------------------


On Dec. 4, 2016, 2:06 a.m., Neil Conway wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/53897/
> -----------------------------------------------------------
> 
> (Updated Dec. 4, 2016, 2:06 a.m.)
> 
> 
> Review request for mesos and Vinod Kone.
> 
> 
> Bugs: MESOS-6419
>     https://issues.apache.org/jira/browse/MESOS-6419
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> After master failover, the new master doesn't know which frameworks were
> registered with the previous master (because this information is not
> currently stored in the registry). In the period after the master fails
> over but before the framework scheduler has re-registered, the master
> learns about the frameworks in the cluster when agents re-register (an
> agent reports the FrameworkInfo for all of the frameworks it is running
> when it re-registers).
> 
> Such frameworks were previously represented separately from the normal
> list of frameworks in the master: the master kept a collection of
> `FrameworkInfo` for these "recovered" frameworks.
> 
> This commit removes this separate collection of "recovered" frameworks.
> Instead, the master now treats recovered frameworks very similarly to
> frameworks that are registered but currently disconnected. For example,
> recovered frameworks will now have a `Framework` object which tracks the
> tasks/executors running under that framework; recovered frameworks will
> be reported via the normal "frameworks" key when querying HTTP
> endpoints. Similarly, "teardown" operations on recovered frameworks will
> now work correctly (MESOS-6419).
> 
> This means there is no longer a concept of "orphan tasks" [1]: if the
> master knows about a task, the task will be running under a framework
> (albeit the framework might be recovered or disconnected). A new
> "recovered" key has been added to various HTTP endpoints/APIs to
> determine if a framework hasn't yet re-registered after master failover.
> 
> [1] The exception here is if the cluster contains Mesos agents older
> than 1.0, because old Mesos agents don't report `FrameworkInfo`s when
> they re-register.
> 
> 
> Diffs
> -----
> 
>   include/mesos/master/master.proto 966105cf14bf92ad15a38653cd5c7f3322821289 
>   include/mesos/v1/master/master.proto 0e5a8674a76a46c6f2cc17e11cd735f756d94fd1 
>   src/master/http.cpp ac560d1fdd219d0de0c5d987a32a7112e149602f 
>   src/master/master.hpp b444b2352360fb4f7179acd97dffc0cd81cc7afa 
>   src/master/master.cpp b0670d993348d189fafff0f83f9da0c5b18d1c51 
>   src/tests/api_tests.cpp afae6a7e0809174f48f280f170fad9315e80a906 
>   src/tests/fault_tolerance_tests.cpp 59f68f65fac9fca4a6941793b712bfe7bb30c880 
>   src/tests/master_allocator_tests.cpp bb94e38d5bb472801366c172cfc036f2eecdcbcb 
>   src/tests/master_authorization_tests.cpp 06c6e47250003cd467bf37e1daa8bd636ef8fef5 
>   src/tests/master_tests.cpp dfedbbdf78e8054813872e9eeebccc7504097751 
>   src/tests/partition_tests.cpp 5a0d4bd2de6a5aa0e9fdf0d34cd10d16fd4e34a1 
>   src/tests/teardown_tests.cpp 125e1808e51da53fdf1bb1babc956ff062cbb102 
> 
> Diff: https://reviews.apache.org/r/53897/diff/
> 
> 
> Testing
> -------
> 
> `make check`
> 
> 
> Thanks,
> 
> Neil Conway
> 
>


Re: Review Request 53897: Changed how master represents "recovered" frameworks.

Posted by Neil Conway <ne...@gmail.com>.

> On Dec. 3, 2016, 2:12 a.m., Vinod Kone wrote:
> > src/master/master.cpp, line 2478
> > <https://reviews.apache.org/r/53897/diff/6/?file=1574956#file1574956line2478>
> >
> >     Can you add a comment on when we are here. It is not very obvious. Maybe copy paste the comment from #2708?

I think the "it is now safe" comment isn't very helpful/accurate. I replaced this with:

```
    // The framework has previously been registered with this master;
    // it may or may not currently be connected.
```

Which matches a similar comment in the driver-based scheduler code path.


> On Dec. 3, 2016, 2:12 a.m., Vinod Kone wrote:
> > src/master/master.cpp, line 7124
> > <https://reviews.apache.org/r/53897/diff/6/?file=1574956#file1574956line7124>
> >
> >     CHECK_NOTNULL(framework);

Is there a general rule for when to add `CHECK_NOTNULL` for internal functions that take pointer parameters? `Master::failoverFramework`, `Master::_failoverFramework`, `Master::_exited(Framework*)`, `Master::drop`, and `Master::authorizeTask` could all have `CHECK_NOTNULL`s added, for example.


- Neil


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/53897/#review157867
-----------------------------------------------------------


On Dec. 2, 2016, 8:56 p.m., Neil Conway wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/53897/
> -----------------------------------------------------------
> 
> (Updated Dec. 2, 2016, 8:56 p.m.)
> 
> 
> Review request for mesos and Vinod Kone.
> 
> 
> Bugs: MESOS-6419
>     https://issues.apache.org/jira/browse/MESOS-6419
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> After master failover, the new master doesn't know which frameworks were
> registered with the previous master (because this information is not
> currently stored in the registry). In the period after the master fails
> over but before the framework scheduler has re-registered, the master
> learns about the frameworks in the cluster when agents re-register (an
> agent reports the FrameworkInfo for all of the frameworks it is running
> when it re-registers).
> 
> Such frameworks were previously represented separately from the normal
> list of frameworks in the master: the master kept a collection of
> `FrameworkInfo` for these "recovered" frameworks.
> 
> This commit removes this separate collection of "recovered" frameworks.
> Instead, the master now treats recovered frameworks very similarly to
> frameworks that are registered but currently disconnected. For example,
> recovered frameworks will now have a `Framework` object which tracks the
> tasks/executors running under that framework; recovered frameworks will
> be reported via the normal "frameworks" key when querying HTTP
> endpoints. Similarly, "teardown" operations on recovered frameworks will
> now work correctly (MESOS-6419).
> 
> This means there is no longer a concept of "orphan tasks" [1]: if the
> master knows about a task, the task will be running under a framework
> (albeit the framework might be recovered or disconnected). A new
> "recovered" key has been added to various HTTP endpoints/APIs to
> determine if a framework hasn't yet re-registered after master failover.
> 
> [1] The exception here is if the cluster contains Mesos agents older
> than 1.0, because old Mesos agents don't report `FrameworkInfo`s when
> they re-register.
> 
> 
> Diffs
> -----
> 
>   include/mesos/master/master.proto 3553c683c17004ac1831ec90271aa8584c950e53 
>   include/mesos/v1/master/master.proto 022b491b7d5c49c5aeddf4ffc97c148f55629c95 
>   src/master/http.cpp ac560d1fdd219d0de0c5d987a32a7112e149602f 
>   src/master/master.hpp 5114e9d21929157e09a4a9fc4dbfeca260fb6b45 
>   src/master/master.cpp e03a2e8025943825a2902102c43dc0eb66bacb6a 
>   src/tests/api_tests.cpp afae6a7e0809174f48f280f170fad9315e80a906 
>   src/tests/fault_tolerance_tests.cpp 1a8888de7faee56d394de30b798982dbb6e32f81 
>   src/tests/master_allocator_tests.cpp bb94e38d5bb472801366c172cfc036f2eecdcbcb 
>   src/tests/master_authorization_tests.cpp 06c6e47250003cd467bf37e1daa8bd636ef8fef5 
>   src/tests/master_tests.cpp dfedbbdf78e8054813872e9eeebccc7504097751 
>   src/tests/partition_tests.cpp 5a0d4bd2de6a5aa0e9fdf0d34cd10d16fd4e34a1 
>   src/tests/teardown_tests.cpp 0babf8c99f133c3f0dada772bd5cd2601c47a080 
> 
> Diff: https://reviews.apache.org/r/53897/diff/
> 
> 
> Testing
> -------
> 
> `make check`
> 
> 
> Thanks,
> 
> Neil Conway
> 
>


Re: Review Request 53897: Changed how master represents "recovered" frameworks.

Posted by Vinod Kone <vi...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/53897/#review157867
-----------------------------------------------------------




src/master/master.hpp (line 672)
<https://reviews.apache.org/r/53897/#comment228472>

    s/newPid/pid/



src/master/master.cpp (line 1277)
<https://reviews.apache.org/r/53897/#comment228477>

    s/frameworks/framework's/



src/master/master.cpp (line 2474)
<https://reviews.apache.org/r/53897/#comment228486>

    i think this would be better as:
    
    ```
    if (framework == nullptr) {
      //
      recoverFramework(frameworkInfo);
      
      framework = getFramework(frameworkInfo.id());
    }
    
    CHECK_NOTNULL(framework);
    if (!framework->recovered())
    ```



src/master/master.cpp (line 2478)
<https://reviews.apache.org/r/53897/#comment228483>

    Can you add a comment on when we are here. It is not very obvious. Maybe copy paste the comment from #2708?



src/master/master.cpp (lines 2707 - 2714)
<https://reviews.apache.org/r/53897/#comment228487>

    see comments above.



src/master/master.cpp (line 2714)
<https://reviews.apache.org/r/53897/#comment228485>

    move this outside the if statement.



src/master/master.cpp (lines 5488 - 5490)
<https://reviews.apache.org/r/53897/#comment228490>

    yea, i think this should be
    
    ```
    if (framework != nullptr && framework->connected()) {
    
    }
    ```



src/master/master.cpp (line 5512)
<https://reviews.apache.org/r/53897/#comment228491>

    shouldn't we send a ShutdownFrameworkMessage in this case?



src/master/master.cpp (lines 5854 - 5857)
<https://reviews.apache.org/r/53897/#comment228492>

    how about
    
    ```
    string status = framework == nullptr ? "unknown" : "disconnected";
    ```



src/master/master.cpp (lines 6046 - 6049)
<https://reviews.apache.org/r/53897/#comment228493>

    see above.



src/master/master.cpp (line 7021)
<https://reviews.apache.org/r/53897/#comment228494>

    CHECK_NOTNULL(framework);



src/master/master.cpp (line 7028)
<https://reviews.apache.org/r/53897/#comment228502>

    also want to check framework->pid and framework->http are None()?



src/master/master.cpp (lines 7029 - 7034)
<https://reviews.apache.org/r/53897/#comment228498>

    hmm. it's a bit weird that activating a framework also updates it. this should be done by the callers before calling this method.



src/master/master.cpp (line 7087)
<https://reviews.apache.org/r/53897/#comment228505>

    do we have to send FrameworkReregisteredMessage for HTTP? IIRC Registered or Reregistered both end up as `Subscribed` Event.
    
    if no need to make the distinction, can we just call `_failoverFramework` here to avoid duplicating a bunch of code?



src/master/master.cpp (line 7095)
<https://reviews.apache.org/r/53897/#comment228506>

    given the above comments, do we still want to distinguish between DISCONNECTED and RECOVERED states? if we can get away with it, that would be conceptually easy to understand.


- Vinod Kone


On Dec. 2, 2016, 8:56 p.m., Neil Conway wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/53897/
> -----------------------------------------------------------
> 
> (Updated Dec. 2, 2016, 8:56 p.m.)
> 
> 
> Review request for mesos and Vinod Kone.
> 
> 
> Bugs: MESOS-6419
>     https://issues.apache.org/jira/browse/MESOS-6419
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> After master failover, the new master doesn't know which frameworks were
> registered with the previous master (because this information is not
> currently stored in the registry). In the period after the master fails
> over but before the framework scheduler has re-registered, the master
> learns about the frameworks in the cluster when agents re-register (an
> agent reports the FrameworkInfo for all of the frameworks it is running
> when it re-registers).
> 
> Such frameworks were previously represented separately from the normal
> list of frameworks in the master: the master kept a collection of
> `FrameworkInfo` for these "recovered" frameworks.
> 
> This commit removes this separate collection of "recovered" frameworks.
> Instead, the master now treats recovered frameworks very similarly to
> frameworks that are registered but currently disconnected. For example,
> recovered frameworks will now have a `Framework` object which tracks the
> tasks/executors running under that framework; recovered frameworks will
> be reported via the normal "frameworks" key when querying HTTP
> endpoints. Similarly, "teardown" operations on recovered frameworks will
> now work correctly (MESOS-6419).
> 
> This means there is no longer a concept of "orphan tasks" [1]: if the
> master knows about a task, the task will be running under a framework
> (albeit the framework might be recovered or disconnected). A new
> "recovered" key has been added to various HTTP endpoints/APIs to
> determine if a framework hasn't yet re-registered after master failover.
> 
> [1] The exception here is if the cluster contains Mesos agents older
> than 1.0, because old Mesos agents don't report `FrameworkInfo`s when
> they re-register.
> 
> 
> Diffs
> -----
> 
>   include/mesos/master/master.proto 3553c683c17004ac1831ec90271aa8584c950e53 
>   include/mesos/v1/master/master.proto 022b491b7d5c49c5aeddf4ffc97c148f55629c95 
>   src/master/http.cpp ac560d1fdd219d0de0c5d987a32a7112e149602f 
>   src/master/master.hpp 5114e9d21929157e09a4a9fc4dbfeca260fb6b45 
>   src/master/master.cpp e03a2e8025943825a2902102c43dc0eb66bacb6a 
>   src/tests/api_tests.cpp afae6a7e0809174f48f280f170fad9315e80a906 
>   src/tests/fault_tolerance_tests.cpp 1a8888de7faee56d394de30b798982dbb6e32f81 
>   src/tests/master_allocator_tests.cpp bb94e38d5bb472801366c172cfc036f2eecdcbcb 
>   src/tests/master_authorization_tests.cpp 06c6e47250003cd467bf37e1daa8bd636ef8fef5 
>   src/tests/master_tests.cpp dfedbbdf78e8054813872e9eeebccc7504097751 
>   src/tests/partition_tests.cpp 5a0d4bd2de6a5aa0e9fdf0d34cd10d16fd4e34a1 
>   src/tests/teardown_tests.cpp 0babf8c99f133c3f0dada772bd5cd2601c47a080 
> 
> Diff: https://reviews.apache.org/r/53897/diff/
> 
> 
> Testing
> -------
> 
> `make check`
> 
> 
> Thanks,
> 
> Neil Conway
> 
>


Re: Review Request 53897: Changed how master represents "recovered" frameworks.

Posted by Neil Conway <ne...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/53897/
-----------------------------------------------------------

(Updated Dec. 2, 2016, 8:56 p.m.)


Review request for mesos and Vinod Kone.


Changes
-------

Tweak comment.


Bugs: MESOS-6419
    https://issues.apache.org/jira/browse/MESOS-6419


Repository: mesos


Description
-------

After master failover, the new master doesn't know which frameworks were
registered with the previous master (because this information is not
currently stored in the registry). In the period after the master fails
over but before the framework scheduler has re-registered, the master
learns about the frameworks in the cluster when agents re-register (an
agent reports the FrameworkInfo for all of the frameworks it is running
when it re-registers).

Such frameworks were previously represented separately from the normal
list of frameworks in the master: the master kept a collection of
`FrameworkInfo` for these "recovered" frameworks.

This commit removes this separate collection of "recovered" frameworks.
Instead, the master now treats recovered frameworks very similarly to
frameworks that are registered but currently disconnected. For example,
recovered frameworks will now have a `Framework` object which tracks the
tasks/executors running under that framework; recovered frameworks will
be reported via the normal "frameworks" key when querying HTTP
endpoints. Similarly, "teardown" operations on recovered frameworks will
now work correctly (MESOS-6419).

This means there is no longer a concept of "orphan tasks" [1]: if the
master knows about a task, the task will be running under a framework
(albeit the framework might be recovered or disconnected). A new
"recovered" key has been added to various HTTP endpoints/APIs to
determine if a framework hasn't yet re-registered after master failover.

[1] The exception here is if the cluster contains Mesos agents older
than 1.0, because old Mesos agents don't report `FrameworkInfo`s when
they re-register.


Diffs (updated)
-----

  include/mesos/master/master.proto 3553c683c17004ac1831ec90271aa8584c950e53 
  include/mesos/v1/master/master.proto 022b491b7d5c49c5aeddf4ffc97c148f55629c95 
  src/master/http.cpp ac560d1fdd219d0de0c5d987a32a7112e149602f 
  src/master/master.hpp 5114e9d21929157e09a4a9fc4dbfeca260fb6b45 
  src/master/master.cpp e03a2e8025943825a2902102c43dc0eb66bacb6a 
  src/tests/api_tests.cpp afae6a7e0809174f48f280f170fad9315e80a906 
  src/tests/fault_tolerance_tests.cpp 1a8888de7faee56d394de30b798982dbb6e32f81 
  src/tests/master_allocator_tests.cpp bb94e38d5bb472801366c172cfc036f2eecdcbcb 
  src/tests/master_authorization_tests.cpp 06c6e47250003cd467bf37e1daa8bd636ef8fef5 
  src/tests/master_tests.cpp dfedbbdf78e8054813872e9eeebccc7504097751 
  src/tests/partition_tests.cpp 5a0d4bd2de6a5aa0e9fdf0d34cd10d16fd4e34a1 
  src/tests/teardown_tests.cpp 0babf8c99f133c3f0dada772bd5cd2601c47a080 

Diff: https://reviews.apache.org/r/53897/diff/


Testing
-------

`make check`


Thanks,

Neil Conway