You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@mesos.apache.org by Andrew Schwartzmeyer <an...@schwartzmeyer.com> on 2018/02/01 19:57:29 UTC

Review Request 65465: Windows: Fixed recovery of Mesos containerizer.

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/65465/
-----------------------------------------------------------

Review request for mesos, Akash Gupta, Jie Yu, and Joseph Wu.


Repository: mesos


Description
-------

The Windows OS deletes the job object created in the agent process when
the agent dies, because no other process holds a handle to it (despite
processes being assigned to the job object). While this is
counter-intuitive, it is the observed behavior. So in order for recovery
to succeed, the containerizer must also hold an otherwise unused handle
to its job object to keep it alive in the kernel, and available for
recovery to find.


Diffs
-----

  src/slave/containerizer/mesos/main.cpp a53ccd68bf975d919f9d1f920cf3fa74d4e43f24 


Diff: https://reviews.apache.org/r/65465/diff/1/


Testing
-------


Thanks,

Andrew Schwartzmeyer


Re: Review Request 65465: Windows: Fixed recovery of Mesos containerizer.

Posted by Andrew Schwartzmeyer <an...@schwartzmeyer.com>.

> On Feb. 1, 2018, 2:32 p.m., Jie Yu wrote:
> > src/slave/containerizer/mesos/main.cpp
> > Lines 40-50 (patched)
> > <https://reviews.apache.org/r/65465/diff/1/?file=1951378#file1951378line40>
> >
> >     Flying by. Why this logic is not in launch.cpp? Sounds to me it's unrelated to, for example, Mount below?

Where in `launch.cpp` would you put it? The handle needs to exist for exactly as long as the process exists (or as close as we can get, which putting it here gets it really close).


- Andrew


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/65465/#review196662
-----------------------------------------------------------


On Feb. 1, 2018, 11:57 a.m., Andrew Schwartzmeyer wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/65465/
> -----------------------------------------------------------
> 
> (Updated Feb. 1, 2018, 11:57 a.m.)
> 
> 
> Review request for mesos, Akash Gupta, Jie Yu, and Joseph Wu.
> 
> 
> Bugs: MESOS-8519
>     https://issues.apache.org/jira/browse/MESOS-8519
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> The Windows OS deletes the job object created in the agent process when
> the agent dies, because no other process holds a handle to it (despite
> processes being assigned to the job object). While this is
> counter-intuitive, it is the observed behavior. So in order for recovery
> to succeed, the containerizer must also hold an otherwise unused handle
> to its job object to keep it alive in the kernel, and available for
> recovery to find.
> 
> 
> Diffs
> -----
> 
>   src/slave/containerizer/mesos/main.cpp a53ccd68bf975d919f9d1f920cf3fa74d4e43f24 
> 
> 
> Diff: https://reviews.apache.org/r/65465/diff/1/
> 
> 
> Testing
> -------
> 
> ```
> [----------] Global test environment tear-down
> [==========] 874 tests from 85 test cases ran. (253311 ms total)
> [  PASSED  ] 874 tests.
> 
> I0201 12:46:58.159368  3116 slave.cpp:6921] Recovering framework eb32cef4-c503-4ab7-85d4-8d4577e6a3bf-0000
> I0201 12:46:58.159368  3116 slave.cpp:8543] Recovering executor 'notepad.01d79d48-0791-11e8-8f77-02421c3bc93c' of framework eb32cef4-c503-4ab7-85d4-8d4577e6a3bf-0000
> I0201 12:46:58.162847  9456 task_status_update_manager.cpp:207] Recovering task status update manager
> I0201 12:46:58.162847  9456 task_status_update_manager.cpp:215] Recovering executor 'notepad.01d79d48-0791-11e8-8f77-02421c3bc93c' of framework eb32cef4-c503-4ab7-85d4-8d4577e6a3bf-0000
> I0201 12:46:58.166851  7344 containerizer.cpp:674] Recovering containerizer
> I0201 12:46:58.167351  7344 containerizer.cpp:731] Recovering container 69cefa53-61e0-444b-a808-e38ffb4cb18f for executor 'notepad.01d79d48-0791-11e8-8f77-02421c3bc93c' of framework eb32cef4-c503-4ab7-85d4-8d4577e6a3bf-0000
> I0201 12:46:58.183379 17088 provisioner.cpp:493] Provisioner recovery complete
> I0201 12:46:58.186367 16792 slave.cpp:6695] Sending reconnect request to executor 'notepad.01d79d48-0791-11e8-8f77-02421c3bc93c' of framework eb32cef4-c503-4ab7-85d4-8d4577e6a3bf-0000 at executor(1)@10.123.7.41:52591
> I0201 12:46:58.194370  7344 slave.cpp:4519] Received re-registration message from executor 'notepad.01d79d48-0791-11e8-8f77-02421c3bc93c' of framework eb32cef4-c503-4ab7-85d4-8d4577e6a3bf-0000
> I0201 12:47:00.193958 16792 slave.cpp:4737] Cleaning up un-reregistered executors
> I0201 12:47:00.193958 16792 slave.cpp:6824] Finished recovery
> I0201 12:47:00.200943  9456 task_status_update_manager.cpp:181] Pausing sending task status updates
> I0201 12:47:00.200943  3116 slave.cpp:1146] New master detected at master@10.123.6.78:5050
> I0201 12:47:00.200943  3116 slave.cpp:1190] No credentials provided. Attempting to register without authentication
> I0201 12:47:00.200943  3116 slave.cpp:1201] Detecting new master
> I0201 12:47:00.214944 16792 slave.cpp:1471] Re-registered with master master@10.123.6.78:5050
> I0201 12:47:00.214944 13180 task_status_update_manager.cpp:188] Resuming sending task status updates
> I0201 12:47:00.215942 16792 slave.cpp:1516] Forwarding agent update {"operations":{},"resource_version_uuid" {"value":"jLIL1d\/PQnuwmFxpMf8CLQ=="},"slave_id":{"value":"7dc02270-a4e1-4f59-9ad7-56bad5182ea4S3"},"update_oversubscribed_resources":true}
> I0201 12:47:00.219952  3116 slave.cpp:3625] Updating info for framework eb32cef4-c503-4ab7-85d4-8d4577e6a3bf-0000 with pid updated to scheduler-aaa62980-8b1b-4775-b8bb-c6890b41941e@10.123.6.78:45907
> I0201 12:47:00.233942  7344 task_status_update_manager.cpp:188] Resuming sending task status updates
> ```
> 
> 
> Thanks,
> 
> Andrew Schwartzmeyer
> 
>


Re: Review Request 65465: Windows: Fixed recovery of Mesos containerizer.

Posted by Jie Yu <yu...@gmail.com>.

> On Feb. 1, 2018, 10:32 p.m., Jie Yu wrote:
> > src/slave/containerizer/mesos/main.cpp
> > Lines 40-50 (patched)
> > <https://reviews.apache.org/r/65465/diff/1/?file=1951378#file1951378line40>
> >
> >     Flying by. Why this logic is not in launch.cpp? Sounds to me it's unrelated to, for example, Mount below?
> 
> Andrew Schwartzmeyer wrote:
>     Where in `launch.cpp` would you put it? The handle needs to exist for exactly as long as the process exists (or as close as we can get, which putting it here gets it really close).

well, i don't think putting here or in launch.cpp has any noticible difference in terms of "closeness" (probably a dozen of instructions?).

my question is: is this logic only related to the launch of a container or not? If yes, this should be moved to launch.cpp (i.e., `MesosContainerizerLaunch::execute()`).


- Jie


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/65465/#review196662
-----------------------------------------------------------


On Feb. 1, 2018, 7:57 p.m., Andrew Schwartzmeyer wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/65465/
> -----------------------------------------------------------
> 
> (Updated Feb. 1, 2018, 7:57 p.m.)
> 
> 
> Review request for mesos, Akash Gupta, Jie Yu, and Joseph Wu.
> 
> 
> Bugs: MESOS-8519
>     https://issues.apache.org/jira/browse/MESOS-8519
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> The Windows OS deletes the job object created in the agent process when
> the agent dies, because no other process holds a handle to it (despite
> processes being assigned to the job object). While this is
> counter-intuitive, it is the observed behavior. So in order for recovery
> to succeed, the containerizer must also hold an otherwise unused handle
> to its job object to keep it alive in the kernel, and available for
> recovery to find.
> 
> 
> Diffs
> -----
> 
>   src/slave/containerizer/mesos/main.cpp a53ccd68bf975d919f9d1f920cf3fa74d4e43f24 
> 
> 
> Diff: https://reviews.apache.org/r/65465/diff/1/
> 
> 
> Testing
> -------
> 
> ```
> [----------] Global test environment tear-down
> [==========] 874 tests from 85 test cases ran. (253311 ms total)
> [  PASSED  ] 874 tests.
> 
> I0201 12:46:58.159368  3116 slave.cpp:6921] Recovering framework eb32cef4-c503-4ab7-85d4-8d4577e6a3bf-0000
> I0201 12:46:58.159368  3116 slave.cpp:8543] Recovering executor 'notepad.01d79d48-0791-11e8-8f77-02421c3bc93c' of framework eb32cef4-c503-4ab7-85d4-8d4577e6a3bf-0000
> I0201 12:46:58.162847  9456 task_status_update_manager.cpp:207] Recovering task status update manager
> I0201 12:46:58.162847  9456 task_status_update_manager.cpp:215] Recovering executor 'notepad.01d79d48-0791-11e8-8f77-02421c3bc93c' of framework eb32cef4-c503-4ab7-85d4-8d4577e6a3bf-0000
> I0201 12:46:58.166851  7344 containerizer.cpp:674] Recovering containerizer
> I0201 12:46:58.167351  7344 containerizer.cpp:731] Recovering container 69cefa53-61e0-444b-a808-e38ffb4cb18f for executor 'notepad.01d79d48-0791-11e8-8f77-02421c3bc93c' of framework eb32cef4-c503-4ab7-85d4-8d4577e6a3bf-0000
> I0201 12:46:58.183379 17088 provisioner.cpp:493] Provisioner recovery complete
> I0201 12:46:58.186367 16792 slave.cpp:6695] Sending reconnect request to executor 'notepad.01d79d48-0791-11e8-8f77-02421c3bc93c' of framework eb32cef4-c503-4ab7-85d4-8d4577e6a3bf-0000 at executor(1)@10.123.7.41:52591
> I0201 12:46:58.194370  7344 slave.cpp:4519] Received re-registration message from executor 'notepad.01d79d48-0791-11e8-8f77-02421c3bc93c' of framework eb32cef4-c503-4ab7-85d4-8d4577e6a3bf-0000
> I0201 12:47:00.193958 16792 slave.cpp:4737] Cleaning up un-reregistered executors
> I0201 12:47:00.193958 16792 slave.cpp:6824] Finished recovery
> I0201 12:47:00.200943  9456 task_status_update_manager.cpp:181] Pausing sending task status updates
> I0201 12:47:00.200943  3116 slave.cpp:1146] New master detected at master@10.123.6.78:5050
> I0201 12:47:00.200943  3116 slave.cpp:1190] No credentials provided. Attempting to register without authentication
> I0201 12:47:00.200943  3116 slave.cpp:1201] Detecting new master
> I0201 12:47:00.214944 16792 slave.cpp:1471] Re-registered with master master@10.123.6.78:5050
> I0201 12:47:00.214944 13180 task_status_update_manager.cpp:188] Resuming sending task status updates
> I0201 12:47:00.215942 16792 slave.cpp:1516] Forwarding agent update {"operations":{},"resource_version_uuid" {"value":"jLIL1d\/PQnuwmFxpMf8CLQ=="},"slave_id":{"value":"7dc02270-a4e1-4f59-9ad7-56bad5182ea4S3"},"update_oversubscribed_resources":true}
> I0201 12:47:00.219952  3116 slave.cpp:3625] Updating info for framework eb32cef4-c503-4ab7-85d4-8d4577e6a3bf-0000 with pid updated to scheduler-aaa62980-8b1b-4775-b8bb-c6890b41941e@10.123.6.78:45907
> I0201 12:47:00.233942  7344 task_status_update_manager.cpp:188] Resuming sending task status updates
> ```
> 
> 
> Thanks,
> 
> Andrew Schwartzmeyer
> 
>


Re: Review Request 65465: Windows: Fixed recovery of Mesos containerizer.

Posted by Joseph Wu <jo...@mesosphere.io>.

> On Feb. 1, 2018, 2:32 p.m., Jie Yu wrote:
> > src/slave/containerizer/mesos/main.cpp
> > Lines 40-50 (patched)
> > <https://reviews.apache.org/r/65465/diff/1/?file=1951378#file1951378line40>
> >
> >     Flying by. Why this logic is not in launch.cpp? Sounds to me it's unrelated to, for example, Mount below?
> 
> Andrew Schwartzmeyer wrote:
>     Where in `launch.cpp` would you put it? The handle needs to exist for exactly as long as the process exists (or as close as we can get, which putting it here gets it really close).
> 
> Jie Yu wrote:
>     well, i don't think putting here or in launch.cpp has any noticible difference in terms of "closeness" (probably a dozen of instructions?).
>     
>     my question is: is this logic only related to the launch of a container or not? If yes, this should be moved to launch.cpp (i.e., `MesosContainerizerLaunch::execute()`).

This is not exactly related to launching a Windows job object, as the reason for adding this code is for recovering the job object later.

Having it in `main.cpp` vs. `launch.cpp` doesn't make too much of a difference regarding the lifetime of the Handle (as `main.cpp` calls methods in `launch.cpp`), but it is safer to do this in `launch.cpp` as the `main.cpp` contains up to 3 subcommands (only one of which is long-lived).


- Joseph


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/65465/#review196662
-----------------------------------------------------------


On Feb. 1, 2018, 11:57 a.m., Andrew Schwartzmeyer wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/65465/
> -----------------------------------------------------------
> 
> (Updated Feb. 1, 2018, 11:57 a.m.)
> 
> 
> Review request for mesos, Akash Gupta, Jie Yu, and Joseph Wu.
> 
> 
> Bugs: MESOS-8519
>     https://issues.apache.org/jira/browse/MESOS-8519
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> The Windows OS deletes the job object created in the agent process when
> the agent dies, because no other process holds a handle to it (despite
> processes being assigned to the job object). While this is
> counter-intuitive, it is the observed behavior. So in order for recovery
> to succeed, the containerizer must also hold an otherwise unused handle
> to its job object to keep it alive in the kernel, and available for
> recovery to find.
> 
> 
> Diffs
> -----
> 
>   src/slave/containerizer/mesos/main.cpp a53ccd68bf975d919f9d1f920cf3fa74d4e43f24 
> 
> 
> Diff: https://reviews.apache.org/r/65465/diff/1/
> 
> 
> Testing
> -------
> 
> ```
> [----------] Global test environment tear-down
> [==========] 874 tests from 85 test cases ran. (253311 ms total)
> [  PASSED  ] 874 tests.
> 
> I0201 12:46:58.159368  3116 slave.cpp:6921] Recovering framework eb32cef4-c503-4ab7-85d4-8d4577e6a3bf-0000
> I0201 12:46:58.159368  3116 slave.cpp:8543] Recovering executor 'notepad.01d79d48-0791-11e8-8f77-02421c3bc93c' of framework eb32cef4-c503-4ab7-85d4-8d4577e6a3bf-0000
> I0201 12:46:58.162847  9456 task_status_update_manager.cpp:207] Recovering task status update manager
> I0201 12:46:58.162847  9456 task_status_update_manager.cpp:215] Recovering executor 'notepad.01d79d48-0791-11e8-8f77-02421c3bc93c' of framework eb32cef4-c503-4ab7-85d4-8d4577e6a3bf-0000
> I0201 12:46:58.166851  7344 containerizer.cpp:674] Recovering containerizer
> I0201 12:46:58.167351  7344 containerizer.cpp:731] Recovering container 69cefa53-61e0-444b-a808-e38ffb4cb18f for executor 'notepad.01d79d48-0791-11e8-8f77-02421c3bc93c' of framework eb32cef4-c503-4ab7-85d4-8d4577e6a3bf-0000
> I0201 12:46:58.183379 17088 provisioner.cpp:493] Provisioner recovery complete
> I0201 12:46:58.186367 16792 slave.cpp:6695] Sending reconnect request to executor 'notepad.01d79d48-0791-11e8-8f77-02421c3bc93c' of framework eb32cef4-c503-4ab7-85d4-8d4577e6a3bf-0000 at executor(1)@10.123.7.41:52591
> I0201 12:46:58.194370  7344 slave.cpp:4519] Received re-registration message from executor 'notepad.01d79d48-0791-11e8-8f77-02421c3bc93c' of framework eb32cef4-c503-4ab7-85d4-8d4577e6a3bf-0000
> I0201 12:47:00.193958 16792 slave.cpp:4737] Cleaning up un-reregistered executors
> I0201 12:47:00.193958 16792 slave.cpp:6824] Finished recovery
> I0201 12:47:00.200943  9456 task_status_update_manager.cpp:181] Pausing sending task status updates
> I0201 12:47:00.200943  3116 slave.cpp:1146] New master detected at master@10.123.6.78:5050
> I0201 12:47:00.200943  3116 slave.cpp:1190] No credentials provided. Attempting to register without authentication
> I0201 12:47:00.200943  3116 slave.cpp:1201] Detecting new master
> I0201 12:47:00.214944 16792 slave.cpp:1471] Re-registered with master master@10.123.6.78:5050
> I0201 12:47:00.214944 13180 task_status_update_manager.cpp:188] Resuming sending task status updates
> I0201 12:47:00.215942 16792 slave.cpp:1516] Forwarding agent update {"operations":{},"resource_version_uuid" {"value":"jLIL1d\/PQnuwmFxpMf8CLQ=="},"slave_id":{"value":"7dc02270-a4e1-4f59-9ad7-56bad5182ea4S3"},"update_oversubscribed_resources":true}
> I0201 12:47:00.219952  3116 slave.cpp:3625] Updating info for framework eb32cef4-c503-4ab7-85d4-8d4577e6a3bf-0000 with pid updated to scheduler-aaa62980-8b1b-4775-b8bb-c6890b41941e@10.123.6.78:45907
> I0201 12:47:00.233942  7344 task_status_update_manager.cpp:188] Resuming sending task status updates
> ```
> 
> 
> Thanks,
> 
> Andrew Schwartzmeyer
> 
>


Re: Review Request 65465: Windows: Fixed recovery of Mesos containerizer.

Posted by Jie Yu <yu...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/65465/#review196662
-----------------------------------------------------------




src/slave/containerizer/mesos/main.cpp
Lines 40-50 (patched)
<https://reviews.apache.org/r/65465/#comment276403>

    Flying by. Why this logic is not in launch.cpp? Sounds to me it's unrelated to, for example, Mount below?


- Jie Yu


On Feb. 1, 2018, 7:57 p.m., Andrew Schwartzmeyer wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/65465/
> -----------------------------------------------------------
> 
> (Updated Feb. 1, 2018, 7:57 p.m.)
> 
> 
> Review request for mesos, Akash Gupta, Jie Yu, and Joseph Wu.
> 
> 
> Bugs: MESOS-8519
>     https://issues.apache.org/jira/browse/MESOS-8519
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> The Windows OS deletes the job object created in the agent process when
> the agent dies, because no other process holds a handle to it (despite
> processes being assigned to the job object). While this is
> counter-intuitive, it is the observed behavior. So in order for recovery
> to succeed, the containerizer must also hold an otherwise unused handle
> to its job object to keep it alive in the kernel, and available for
> recovery to find.
> 
> 
> Diffs
> -----
> 
>   src/slave/containerizer/mesos/main.cpp a53ccd68bf975d919f9d1f920cf3fa74d4e43f24 
> 
> 
> Diff: https://reviews.apache.org/r/65465/diff/1/
> 
> 
> Testing
> -------
> 
> ```
> [----------] Global test environment tear-down
> [==========] 874 tests from 85 test cases ran. (253311 ms total)
> [  PASSED  ] 874 tests.
> 
> I0201 12:46:58.159368  3116 slave.cpp:6921] Recovering framework eb32cef4-c503-4ab7-85d4-8d4577e6a3bf-0000
> I0201 12:46:58.159368  3116 slave.cpp:8543] Recovering executor 'notepad.01d79d48-0791-11e8-8f77-02421c3bc93c' of framework eb32cef4-c503-4ab7-85d4-8d4577e6a3bf-0000
> I0201 12:46:58.162847  9456 task_status_update_manager.cpp:207] Recovering task status update manager
> I0201 12:46:58.162847  9456 task_status_update_manager.cpp:215] Recovering executor 'notepad.01d79d48-0791-11e8-8f77-02421c3bc93c' of framework eb32cef4-c503-4ab7-85d4-8d4577e6a3bf-0000
> I0201 12:46:58.166851  7344 containerizer.cpp:674] Recovering containerizer
> I0201 12:46:58.167351  7344 containerizer.cpp:731] Recovering container 69cefa53-61e0-444b-a808-e38ffb4cb18f for executor 'notepad.01d79d48-0791-11e8-8f77-02421c3bc93c' of framework eb32cef4-c503-4ab7-85d4-8d4577e6a3bf-0000
> I0201 12:46:58.183379 17088 provisioner.cpp:493] Provisioner recovery complete
> I0201 12:46:58.186367 16792 slave.cpp:6695] Sending reconnect request to executor 'notepad.01d79d48-0791-11e8-8f77-02421c3bc93c' of framework eb32cef4-c503-4ab7-85d4-8d4577e6a3bf-0000 at executor(1)@10.123.7.41:52591
> I0201 12:46:58.194370  7344 slave.cpp:4519] Received re-registration message from executor 'notepad.01d79d48-0791-11e8-8f77-02421c3bc93c' of framework eb32cef4-c503-4ab7-85d4-8d4577e6a3bf-0000
> I0201 12:47:00.193958 16792 slave.cpp:4737] Cleaning up un-reregistered executors
> I0201 12:47:00.193958 16792 slave.cpp:6824] Finished recovery
> I0201 12:47:00.200943  9456 task_status_update_manager.cpp:181] Pausing sending task status updates
> I0201 12:47:00.200943  3116 slave.cpp:1146] New master detected at master@10.123.6.78:5050
> I0201 12:47:00.200943  3116 slave.cpp:1190] No credentials provided. Attempting to register without authentication
> I0201 12:47:00.200943  3116 slave.cpp:1201] Detecting new master
> I0201 12:47:00.214944 16792 slave.cpp:1471] Re-registered with master master@10.123.6.78:5050
> I0201 12:47:00.214944 13180 task_status_update_manager.cpp:188] Resuming sending task status updates
> I0201 12:47:00.215942 16792 slave.cpp:1516] Forwarding agent update {"operations":{},"resource_version_uuid" {"value":"jLIL1d\/PQnuwmFxpMf8CLQ=="},"slave_id":{"value":"7dc02270-a4e1-4f59-9ad7-56bad5182ea4S3"},"update_oversubscribed_resources":true}
> I0201 12:47:00.219952  3116 slave.cpp:3625] Updating info for framework eb32cef4-c503-4ab7-85d4-8d4577e6a3bf-0000 with pid updated to scheduler-aaa62980-8b1b-4775-b8bb-c6890b41941e@10.123.6.78:45907
> I0201 12:47:00.233942  7344 task_status_update_manager.cpp:188] Resuming sending task status updates
> ```
> 
> 
> Thanks,
> 
> Andrew Schwartzmeyer
> 
>


Re: Review Request 65465: Windows: Fixed recovery of Mesos containerizer.

Posted by Mesos Reviewbot Windows <re...@mesos.apache.org>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/65465/#review196652
-----------------------------------------------------------



PASS: Mesos patch 65465 was successfully built and tested.

Reviews applied: `['65397', '65398', '65399', '65400', '65401', '65402', '65403', '65404', '65405', '65406', '65407', '65408', '65409', '65465']`

All the build artifacts available at: http://dcos-win.westus.cloudapp.azure.com/mesos-build/review/65465

- Mesos Reviewbot Windows


On Feb. 1, 2018, 8:57 p.m., Andrew Schwartzmeyer wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/65465/
> -----------------------------------------------------------
> 
> (Updated Feb. 1, 2018, 8:57 p.m.)
> 
> 
> Review request for mesos, Akash Gupta, Jie Yu, and Joseph Wu.
> 
> 
> Bugs: MESOS-8519
>     https://issues.apache.org/jira/browse/MESOS-8519
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> The Windows OS deletes the job object created in the agent process when
> the agent dies, because no other process holds a handle to it (despite
> processes being assigned to the job object). While this is
> counter-intuitive, it is the observed behavior. So in order for recovery
> to succeed, the containerizer must also hold an otherwise unused handle
> to its job object to keep it alive in the kernel, and available for
> recovery to find.
> 
> 
> Diffs
> -----
> 
>   src/slave/containerizer/mesos/main.cpp a53ccd68bf975d919f9d1f920cf3fa74d4e43f24 
> 
> 
> Diff: https://reviews.apache.org/r/65465/diff/1/
> 
> 
> Testing
> -------
> 
> ```
> [----------] Global test environment tear-down
> [==========] 874 tests from 85 test cases ran. (253311 ms total)
> [  PASSED  ] 874 tests.
> 
> I0201 12:46:58.159368  3116 slave.cpp:6921] Recovering framework eb32cef4-c503-4ab7-85d4-8d4577e6a3bf-0000
> I0201 12:46:58.159368  3116 slave.cpp:8543] Recovering executor 'notepad.01d79d48-0791-11e8-8f77-02421c3bc93c' of framework eb32cef4-c503-4ab7-85d4-8d4577e6a3bf-0000
> I0201 12:46:58.162847  9456 task_status_update_manager.cpp:207] Recovering task status update manager
> I0201 12:46:58.162847  9456 task_status_update_manager.cpp:215] Recovering executor 'notepad.01d79d48-0791-11e8-8f77-02421c3bc93c' of framework eb32cef4-c503-4ab7-85d4-8d4577e6a3bf-0000
> I0201 12:46:58.166851  7344 containerizer.cpp:674] Recovering containerizer
> I0201 12:46:58.167351  7344 containerizer.cpp:731] Recovering container 69cefa53-61e0-444b-a808-e38ffb4cb18f for executor 'notepad.01d79d48-0791-11e8-8f77-02421c3bc93c' of framework eb32cef4-c503-4ab7-85d4-8d4577e6a3bf-0000
> I0201 12:46:58.183379 17088 provisioner.cpp:493] Provisioner recovery complete
> I0201 12:46:58.186367 16792 slave.cpp:6695] Sending reconnect request to executor 'notepad.01d79d48-0791-11e8-8f77-02421c3bc93c' of framework eb32cef4-c503-4ab7-85d4-8d4577e6a3bf-0000 at executor(1)@10.123.7.41:52591
> I0201 12:46:58.194370  7344 slave.cpp:4519] Received re-registration message from executor 'notepad.01d79d48-0791-11e8-8f77-02421c3bc93c' of framework eb32cef4-c503-4ab7-85d4-8d4577e6a3bf-0000
> I0201 12:47:00.193958 16792 slave.cpp:4737] Cleaning up un-reregistered executors
> I0201 12:47:00.193958 16792 slave.cpp:6824] Finished recovery
> I0201 12:47:00.200943  9456 task_status_update_manager.cpp:181] Pausing sending task status updates
> I0201 12:47:00.200943  3116 slave.cpp:1146] New master detected at master@10.123.6.78:5050
> I0201 12:47:00.200943  3116 slave.cpp:1190] No credentials provided. Attempting to register without authentication
> I0201 12:47:00.200943  3116 slave.cpp:1201] Detecting new master
> I0201 12:47:00.214944 16792 slave.cpp:1471] Re-registered with master master@10.123.6.78:5050
> I0201 12:47:00.214944 13180 task_status_update_manager.cpp:188] Resuming sending task status updates
> I0201 12:47:00.215942 16792 slave.cpp:1516] Forwarding agent update {"operations":{},"resource_version_uuid" {"value":"jLIL1d\/PQnuwmFxpMf8CLQ=="},"slave_id":{"value":"7dc02270-a4e1-4f59-9ad7-56bad5182ea4S3"},"update_oversubscribed_resources":true}
> I0201 12:47:00.219952  3116 slave.cpp:3625] Updating info for framework eb32cef4-c503-4ab7-85d4-8d4577e6a3bf-0000 with pid updated to scheduler-aaa62980-8b1b-4775-b8bb-c6890b41941e@10.123.6.78:45907
> I0201 12:47:00.233942  7344 task_status_update_manager.cpp:188] Resuming sending task status updates
> ```
> 
> 
> Thanks,
> 
> Andrew Schwartzmeyer
> 
>


Re: Review Request 65465: Windows: Fixed recovery of Mesos containerizer.

Posted by Andrew Schwartzmeyer <an...@schwartzmeyer.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/65465/#review196648
-----------------------------------------------------------




src/slave/containerizer/mesos/main.cpp
Lines 40-50 (patched)
<https://reviews.apache.org/r/65465/#comment276370>

    TODO: Comment in here why it is necessary; and also comment where we create the job object too.


- Andrew Schwartzmeyer


On Feb. 1, 2018, 11:57 a.m., Andrew Schwartzmeyer wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/65465/
> -----------------------------------------------------------
> 
> (Updated Feb. 1, 2018, 11:57 a.m.)
> 
> 
> Review request for mesos, Akash Gupta, Jie Yu, and Joseph Wu.
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> The Windows OS deletes the job object created in the agent process when
> the agent dies, because no other process holds a handle to it (despite
> processes being assigned to the job object). While this is
> counter-intuitive, it is the observed behavior. So in order for recovery
> to succeed, the containerizer must also hold an otherwise unused handle
> to its job object to keep it alive in the kernel, and available for
> recovery to find.
> 
> 
> Diffs
> -----
> 
>   src/slave/containerizer/mesos/main.cpp a53ccd68bf975d919f9d1f920cf3fa74d4e43f24 
> 
> 
> Diff: https://reviews.apache.org/r/65465/diff/1/
> 
> 
> Testing
> -------
> 
> 
> Thanks,
> 
> Andrew Schwartzmeyer
> 
>


Re: Review Request 65465: Windows: Fixed recovery of Mesos containerizer.

Posted by Joseph Wu <jo...@mesosphere.io>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/65465/#review197187
-----------------------------------------------------------


Ship it!





src/slave/containerizer/mesos/launch.cpp
Lines 534 (patched)
<https://reviews.apache.org/r/65465/#comment277306>

    I wonder if we should `NOTE:` that this handle will not be destructed, even though it is a SharedHandle, because it never goes out of scope (i.e. `exec` will not trigger the destruction).


- Joseph Wu


On Feb. 8, 2018, 11:54 a.m., Andrew Schwartzmeyer wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/65465/
> -----------------------------------------------------------
> 
> (Updated Feb. 8, 2018, 11:54 a.m.)
> 
> 
> Review request for mesos, Akash Gupta, Jie Yu, and Joseph Wu.
> 
> 
> Bugs: MESOS-8519
>     https://issues.apache.org/jira/browse/MESOS-8519
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> The Windows OS deletes the job object created in the agent process when
> the agent dies, because no other process holds a handle to it (despite
> processes being assigned to the job object). While this is
> counter-intuitive, it is the observed behavior. So in order for recovery
> to succeed, the containerizer must also hold an otherwise unused handle
> to its job object to keep it alive in the kernel, and available for
> recovery to find.
> 
> 
> Diffs
> -----
> 
>   src/slave/containerizer/mesos/launch.cpp 91016ed417428e3a5b21a132a96b9d7760d13aa3 
> 
> 
> Diff: https://reviews.apache.org/r/65465/diff/2/
> 
> 
> Testing
> -------
> 
> ```
> [----------] Global test environment tear-down
> [==========] 874 tests from 85 test cases ran. (253311 ms total)
> [  PASSED  ] 874 tests.
> 
> I0201 12:46:58.159368  3116 slave.cpp:6921] Recovering framework eb32cef4-c503-4ab7-85d4-8d4577e6a3bf-0000
> I0201 12:46:58.159368  3116 slave.cpp:8543] Recovering executor 'notepad.01d79d48-0791-11e8-8f77-02421c3bc93c' of framework eb32cef4-c503-4ab7-85d4-8d4577e6a3bf-0000
> I0201 12:46:58.162847  9456 task_status_update_manager.cpp:207] Recovering task status update manager
> I0201 12:46:58.162847  9456 task_status_update_manager.cpp:215] Recovering executor 'notepad.01d79d48-0791-11e8-8f77-02421c3bc93c' of framework eb32cef4-c503-4ab7-85d4-8d4577e6a3bf-0000
> I0201 12:46:58.166851  7344 containerizer.cpp:674] Recovering containerizer
> I0201 12:46:58.167351  7344 containerizer.cpp:731] Recovering container 69cefa53-61e0-444b-a808-e38ffb4cb18f for executor 'notepad.01d79d48-0791-11e8-8f77-02421c3bc93c' of framework eb32cef4-c503-4ab7-85d4-8d4577e6a3bf-0000
> I0201 12:46:58.183379 17088 provisioner.cpp:493] Provisioner recovery complete
> I0201 12:46:58.186367 16792 slave.cpp:6695] Sending reconnect request to executor 'notepad.01d79d48-0791-11e8-8f77-02421c3bc93c' of framework eb32cef4-c503-4ab7-85d4-8d4577e6a3bf-0000 at executor(1)@10.123.7.41:52591
> I0201 12:46:58.194370  7344 slave.cpp:4519] Received re-registration message from executor 'notepad.01d79d48-0791-11e8-8f77-02421c3bc93c' of framework eb32cef4-c503-4ab7-85d4-8d4577e6a3bf-0000
> I0201 12:47:00.193958 16792 slave.cpp:4737] Cleaning up un-reregistered executors
> I0201 12:47:00.193958 16792 slave.cpp:6824] Finished recovery
> I0201 12:47:00.200943  9456 task_status_update_manager.cpp:181] Pausing sending task status updates
> I0201 12:47:00.200943  3116 slave.cpp:1146] New master detected at master@10.123.6.78:5050
> I0201 12:47:00.200943  3116 slave.cpp:1190] No credentials provided. Attempting to register without authentication
> I0201 12:47:00.200943  3116 slave.cpp:1201] Detecting new master
> I0201 12:47:00.214944 16792 slave.cpp:1471] Re-registered with master master@10.123.6.78:5050
> I0201 12:47:00.214944 13180 task_status_update_manager.cpp:188] Resuming sending task status updates
> I0201 12:47:00.215942 16792 slave.cpp:1516] Forwarding agent update {"operations":{},"resource_version_uuid" {"value":"jLIL1d\/PQnuwmFxpMf8CLQ=="},"slave_id":{"value":"7dc02270-a4e1-4f59-9ad7-56bad5182ea4S3"},"update_oversubscribed_resources":true}
> I0201 12:47:00.219952  3116 slave.cpp:3625] Updating info for framework eb32cef4-c503-4ab7-85d4-8d4577e6a3bf-0000 with pid updated to scheduler-aaa62980-8b1b-4775-b8bb-c6890b41941e@10.123.6.78:45907
> I0201 12:47:00.233942  7344 task_status_update_manager.cpp:188] Resuming sending task status updates
> ```
> 
> 
> Thanks,
> 
> Andrew Schwartzmeyer
> 
>


Re: Review Request 65465: Windows: Fixed recovery of Mesos containerizer.

Posted by Andrew Schwartzmeyer <an...@schwartzmeyer.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/65465/#review197135
-----------------------------------------------------------



Re-ran ctest on Windows, make check on Linux, and manually tested recovery scenarios.

- Andrew Schwartzmeyer


On Feb. 8, 2018, 11:54 a.m., Andrew Schwartzmeyer wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/65465/
> -----------------------------------------------------------
> 
> (Updated Feb. 8, 2018, 11:54 a.m.)
> 
> 
> Review request for mesos, Akash Gupta, Jie Yu, and Joseph Wu.
> 
> 
> Bugs: MESOS-8519
>     https://issues.apache.org/jira/browse/MESOS-8519
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> The Windows OS deletes the job object created in the agent process when
> the agent dies, because no other process holds a handle to it (despite
> processes being assigned to the job object). While this is
> counter-intuitive, it is the observed behavior. So in order for recovery
> to succeed, the containerizer must also hold an otherwise unused handle
> to its job object to keep it alive in the kernel, and available for
> recovery to find.
> 
> 
> Diffs
> -----
> 
>   src/slave/containerizer/mesos/launch.cpp 91016ed417428e3a5b21a132a96b9d7760d13aa3 
> 
> 
> Diff: https://reviews.apache.org/r/65465/diff/2/
> 
> 
> Testing
> -------
> 
> ```
> [----------] Global test environment tear-down
> [==========] 874 tests from 85 test cases ran. (253311 ms total)
> [  PASSED  ] 874 tests.
> 
> I0201 12:46:58.159368  3116 slave.cpp:6921] Recovering framework eb32cef4-c503-4ab7-85d4-8d4577e6a3bf-0000
> I0201 12:46:58.159368  3116 slave.cpp:8543] Recovering executor 'notepad.01d79d48-0791-11e8-8f77-02421c3bc93c' of framework eb32cef4-c503-4ab7-85d4-8d4577e6a3bf-0000
> I0201 12:46:58.162847  9456 task_status_update_manager.cpp:207] Recovering task status update manager
> I0201 12:46:58.162847  9456 task_status_update_manager.cpp:215] Recovering executor 'notepad.01d79d48-0791-11e8-8f77-02421c3bc93c' of framework eb32cef4-c503-4ab7-85d4-8d4577e6a3bf-0000
> I0201 12:46:58.166851  7344 containerizer.cpp:674] Recovering containerizer
> I0201 12:46:58.167351  7344 containerizer.cpp:731] Recovering container 69cefa53-61e0-444b-a808-e38ffb4cb18f for executor 'notepad.01d79d48-0791-11e8-8f77-02421c3bc93c' of framework eb32cef4-c503-4ab7-85d4-8d4577e6a3bf-0000
> I0201 12:46:58.183379 17088 provisioner.cpp:493] Provisioner recovery complete
> I0201 12:46:58.186367 16792 slave.cpp:6695] Sending reconnect request to executor 'notepad.01d79d48-0791-11e8-8f77-02421c3bc93c' of framework eb32cef4-c503-4ab7-85d4-8d4577e6a3bf-0000 at executor(1)@10.123.7.41:52591
> I0201 12:46:58.194370  7344 slave.cpp:4519] Received re-registration message from executor 'notepad.01d79d48-0791-11e8-8f77-02421c3bc93c' of framework eb32cef4-c503-4ab7-85d4-8d4577e6a3bf-0000
> I0201 12:47:00.193958 16792 slave.cpp:4737] Cleaning up un-reregistered executors
> I0201 12:47:00.193958 16792 slave.cpp:6824] Finished recovery
> I0201 12:47:00.200943  9456 task_status_update_manager.cpp:181] Pausing sending task status updates
> I0201 12:47:00.200943  3116 slave.cpp:1146] New master detected at master@10.123.6.78:5050
> I0201 12:47:00.200943  3116 slave.cpp:1190] No credentials provided. Attempting to register without authentication
> I0201 12:47:00.200943  3116 slave.cpp:1201] Detecting new master
> I0201 12:47:00.214944 16792 slave.cpp:1471] Re-registered with master master@10.123.6.78:5050
> I0201 12:47:00.214944 13180 task_status_update_manager.cpp:188] Resuming sending task status updates
> I0201 12:47:00.215942 16792 slave.cpp:1516] Forwarding agent update {"operations":{},"resource_version_uuid" {"value":"jLIL1d\/PQnuwmFxpMf8CLQ=="},"slave_id":{"value":"7dc02270-a4e1-4f59-9ad7-56bad5182ea4S3"},"update_oversubscribed_resources":true}
> I0201 12:47:00.219952  3116 slave.cpp:3625] Updating info for framework eb32cef4-c503-4ab7-85d4-8d4577e6a3bf-0000 with pid updated to scheduler-aaa62980-8b1b-4775-b8bb-c6890b41941e@10.123.6.78:45907
> I0201 12:47:00.233942  7344 task_status_update_manager.cpp:188] Resuming sending task status updates
> ```
> 
> 
> Thanks,
> 
> Andrew Schwartzmeyer
> 
>


Re: Review Request 65465: Windows: Fixed recovery of Mesos containerizer.

Posted by Andrew Schwartzmeyer <an...@schwartzmeyer.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/65465/#review197192
-----------------------------------------------------------




src/slave/containerizer/mesos/launch.cpp
Lines 534 (patched)
<https://reviews.apache.org/r/65465/#comment277313>

    Good point.


- Andrew Schwartzmeyer


On Feb. 8, 2018, 11:54 a.m., Andrew Schwartzmeyer wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/65465/
> -----------------------------------------------------------
> 
> (Updated Feb. 8, 2018, 11:54 a.m.)
> 
> 
> Review request for mesos, Akash Gupta, Jie Yu, and Joseph Wu.
> 
> 
> Bugs: MESOS-8519
>     https://issues.apache.org/jira/browse/MESOS-8519
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> The Windows OS deletes the job object created in the agent process when
> the agent dies, because no other process holds a handle to it (despite
> processes being assigned to the job object). While this is
> counter-intuitive, it is the observed behavior. So in order for recovery
> to succeed, the containerizer must also hold an otherwise unused handle
> to its job object to keep it alive in the kernel, and available for
> recovery to find.
> 
> 
> Diffs
> -----
> 
>   src/slave/containerizer/mesos/launch.cpp 91016ed417428e3a5b21a132a96b9d7760d13aa3 
> 
> 
> Diff: https://reviews.apache.org/r/65465/diff/2/
> 
> 
> Testing
> -------
> 
> ```
> [----------] Global test environment tear-down
> [==========] 874 tests from 85 test cases ran. (253311 ms total)
> [  PASSED  ] 874 tests.
> 
> I0201 12:46:58.159368  3116 slave.cpp:6921] Recovering framework eb32cef4-c503-4ab7-85d4-8d4577e6a3bf-0000
> I0201 12:46:58.159368  3116 slave.cpp:8543] Recovering executor 'notepad.01d79d48-0791-11e8-8f77-02421c3bc93c' of framework eb32cef4-c503-4ab7-85d4-8d4577e6a3bf-0000
> I0201 12:46:58.162847  9456 task_status_update_manager.cpp:207] Recovering task status update manager
> I0201 12:46:58.162847  9456 task_status_update_manager.cpp:215] Recovering executor 'notepad.01d79d48-0791-11e8-8f77-02421c3bc93c' of framework eb32cef4-c503-4ab7-85d4-8d4577e6a3bf-0000
> I0201 12:46:58.166851  7344 containerizer.cpp:674] Recovering containerizer
> I0201 12:46:58.167351  7344 containerizer.cpp:731] Recovering container 69cefa53-61e0-444b-a808-e38ffb4cb18f for executor 'notepad.01d79d48-0791-11e8-8f77-02421c3bc93c' of framework eb32cef4-c503-4ab7-85d4-8d4577e6a3bf-0000
> I0201 12:46:58.183379 17088 provisioner.cpp:493] Provisioner recovery complete
> I0201 12:46:58.186367 16792 slave.cpp:6695] Sending reconnect request to executor 'notepad.01d79d48-0791-11e8-8f77-02421c3bc93c' of framework eb32cef4-c503-4ab7-85d4-8d4577e6a3bf-0000 at executor(1)@10.123.7.41:52591
> I0201 12:46:58.194370  7344 slave.cpp:4519] Received re-registration message from executor 'notepad.01d79d48-0791-11e8-8f77-02421c3bc93c' of framework eb32cef4-c503-4ab7-85d4-8d4577e6a3bf-0000
> I0201 12:47:00.193958 16792 slave.cpp:4737] Cleaning up un-reregistered executors
> I0201 12:47:00.193958 16792 slave.cpp:6824] Finished recovery
> I0201 12:47:00.200943  9456 task_status_update_manager.cpp:181] Pausing sending task status updates
> I0201 12:47:00.200943  3116 slave.cpp:1146] New master detected at master@10.123.6.78:5050
> I0201 12:47:00.200943  3116 slave.cpp:1190] No credentials provided. Attempting to register without authentication
> I0201 12:47:00.200943  3116 slave.cpp:1201] Detecting new master
> I0201 12:47:00.214944 16792 slave.cpp:1471] Re-registered with master master@10.123.6.78:5050
> I0201 12:47:00.214944 13180 task_status_update_manager.cpp:188] Resuming sending task status updates
> I0201 12:47:00.215942 16792 slave.cpp:1516] Forwarding agent update {"operations":{},"resource_version_uuid" {"value":"jLIL1d\/PQnuwmFxpMf8CLQ=="},"slave_id":{"value":"7dc02270-a4e1-4f59-9ad7-56bad5182ea4S3"},"update_oversubscribed_resources":true}
> I0201 12:47:00.219952  3116 slave.cpp:3625] Updating info for framework eb32cef4-c503-4ab7-85d4-8d4577e6a3bf-0000 with pid updated to scheduler-aaa62980-8b1b-4775-b8bb-c6890b41941e@10.123.6.78:45907
> I0201 12:47:00.233942  7344 task_status_update_manager.cpp:188] Resuming sending task status updates
> ```
> 
> 
> Thanks,
> 
> Andrew Schwartzmeyer
> 
>


Re: Review Request 65465: Windows: Fixed recovery of Mesos containerizer.

Posted by Jie Yu <yu...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/65465/#review197179
-----------------------------------------------------------


Ship it!




LGTM

- Jie Yu


On Feb. 8, 2018, 7:54 p.m., Andrew Schwartzmeyer wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/65465/
> -----------------------------------------------------------
> 
> (Updated Feb. 8, 2018, 7:54 p.m.)
> 
> 
> Review request for mesos, Akash Gupta, Jie Yu, and Joseph Wu.
> 
> 
> Bugs: MESOS-8519
>     https://issues.apache.org/jira/browse/MESOS-8519
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> The Windows OS deletes the job object created in the agent process when
> the agent dies, because no other process holds a handle to it (despite
> processes being assigned to the job object). While this is
> counter-intuitive, it is the observed behavior. So in order for recovery
> to succeed, the containerizer must also hold an otherwise unused handle
> to its job object to keep it alive in the kernel, and available for
> recovery to find.
> 
> 
> Diffs
> -----
> 
>   src/slave/containerizer/mesos/launch.cpp 91016ed417428e3a5b21a132a96b9d7760d13aa3 
> 
> 
> Diff: https://reviews.apache.org/r/65465/diff/2/
> 
> 
> Testing
> -------
> 
> ```
> [----------] Global test environment tear-down
> [==========] 874 tests from 85 test cases ran. (253311 ms total)
> [  PASSED  ] 874 tests.
> 
> I0201 12:46:58.159368  3116 slave.cpp:6921] Recovering framework eb32cef4-c503-4ab7-85d4-8d4577e6a3bf-0000
> I0201 12:46:58.159368  3116 slave.cpp:8543] Recovering executor 'notepad.01d79d48-0791-11e8-8f77-02421c3bc93c' of framework eb32cef4-c503-4ab7-85d4-8d4577e6a3bf-0000
> I0201 12:46:58.162847  9456 task_status_update_manager.cpp:207] Recovering task status update manager
> I0201 12:46:58.162847  9456 task_status_update_manager.cpp:215] Recovering executor 'notepad.01d79d48-0791-11e8-8f77-02421c3bc93c' of framework eb32cef4-c503-4ab7-85d4-8d4577e6a3bf-0000
> I0201 12:46:58.166851  7344 containerizer.cpp:674] Recovering containerizer
> I0201 12:46:58.167351  7344 containerizer.cpp:731] Recovering container 69cefa53-61e0-444b-a808-e38ffb4cb18f for executor 'notepad.01d79d48-0791-11e8-8f77-02421c3bc93c' of framework eb32cef4-c503-4ab7-85d4-8d4577e6a3bf-0000
> I0201 12:46:58.183379 17088 provisioner.cpp:493] Provisioner recovery complete
> I0201 12:46:58.186367 16792 slave.cpp:6695] Sending reconnect request to executor 'notepad.01d79d48-0791-11e8-8f77-02421c3bc93c' of framework eb32cef4-c503-4ab7-85d4-8d4577e6a3bf-0000 at executor(1)@10.123.7.41:52591
> I0201 12:46:58.194370  7344 slave.cpp:4519] Received re-registration message from executor 'notepad.01d79d48-0791-11e8-8f77-02421c3bc93c' of framework eb32cef4-c503-4ab7-85d4-8d4577e6a3bf-0000
> I0201 12:47:00.193958 16792 slave.cpp:4737] Cleaning up un-reregistered executors
> I0201 12:47:00.193958 16792 slave.cpp:6824] Finished recovery
> I0201 12:47:00.200943  9456 task_status_update_manager.cpp:181] Pausing sending task status updates
> I0201 12:47:00.200943  3116 slave.cpp:1146] New master detected at master@10.123.6.78:5050
> I0201 12:47:00.200943  3116 slave.cpp:1190] No credentials provided. Attempting to register without authentication
> I0201 12:47:00.200943  3116 slave.cpp:1201] Detecting new master
> I0201 12:47:00.214944 16792 slave.cpp:1471] Re-registered with master master@10.123.6.78:5050
> I0201 12:47:00.214944 13180 task_status_update_manager.cpp:188] Resuming sending task status updates
> I0201 12:47:00.215942 16792 slave.cpp:1516] Forwarding agent update {"operations":{},"resource_version_uuid" {"value":"jLIL1d\/PQnuwmFxpMf8CLQ=="},"slave_id":{"value":"7dc02270-a4e1-4f59-9ad7-56bad5182ea4S3"},"update_oversubscribed_resources":true}
> I0201 12:47:00.219952  3116 slave.cpp:3625] Updating info for framework eb32cef4-c503-4ab7-85d4-8d4577e6a3bf-0000 with pid updated to scheduler-aaa62980-8b1b-4775-b8bb-c6890b41941e@10.123.6.78:45907
> I0201 12:47:00.233942  7344 task_status_update_manager.cpp:188] Resuming sending task status updates
> ```
> 
> 
> Thanks,
> 
> Andrew Schwartzmeyer
> 
>


Re: Review Request 65465: Windows: Fixed recovery of Mesos containerizer.

Posted by Andrew Schwartzmeyer <an...@schwartzmeyer.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/65465/
-----------------------------------------------------------

(Updated Feb. 8, 2018, 11:54 a.m.)


Review request for mesos, Akash Gupta, Jie Yu, and Joseph Wu.


Bugs: MESOS-8519
    https://issues.apache.org/jira/browse/MESOS-8519


Repository: mesos


Description
-------

The Windows OS deletes the job object created in the agent process when
the agent dies, because no other process holds a handle to it (despite
processes being assigned to the job object). While this is
counter-intuitive, it is the observed behavior. So in order for recovery
to succeed, the containerizer must also hold an otherwise unused handle
to its job object to keep it alive in the kernel, and available for
recovery to find.


Diffs (updated)
-----

  src/slave/containerizer/mesos/launch.cpp 91016ed417428e3a5b21a132a96b9d7760d13aa3 


Diff: https://reviews.apache.org/r/65465/diff/2/

Changes: https://reviews.apache.org/r/65465/diff/1-2/


Testing
-------

```
[----------] Global test environment tear-down
[==========] 874 tests from 85 test cases ran. (253311 ms total)
[  PASSED  ] 874 tests.

I0201 12:46:58.159368  3116 slave.cpp:6921] Recovering framework eb32cef4-c503-4ab7-85d4-8d4577e6a3bf-0000
I0201 12:46:58.159368  3116 slave.cpp:8543] Recovering executor 'notepad.01d79d48-0791-11e8-8f77-02421c3bc93c' of framework eb32cef4-c503-4ab7-85d4-8d4577e6a3bf-0000
I0201 12:46:58.162847  9456 task_status_update_manager.cpp:207] Recovering task status update manager
I0201 12:46:58.162847  9456 task_status_update_manager.cpp:215] Recovering executor 'notepad.01d79d48-0791-11e8-8f77-02421c3bc93c' of framework eb32cef4-c503-4ab7-85d4-8d4577e6a3bf-0000
I0201 12:46:58.166851  7344 containerizer.cpp:674] Recovering containerizer
I0201 12:46:58.167351  7344 containerizer.cpp:731] Recovering container 69cefa53-61e0-444b-a808-e38ffb4cb18f for executor 'notepad.01d79d48-0791-11e8-8f77-02421c3bc93c' of framework eb32cef4-c503-4ab7-85d4-8d4577e6a3bf-0000
I0201 12:46:58.183379 17088 provisioner.cpp:493] Provisioner recovery complete
I0201 12:46:58.186367 16792 slave.cpp:6695] Sending reconnect request to executor 'notepad.01d79d48-0791-11e8-8f77-02421c3bc93c' of framework eb32cef4-c503-4ab7-85d4-8d4577e6a3bf-0000 at executor(1)@10.123.7.41:52591
I0201 12:46:58.194370  7344 slave.cpp:4519] Received re-registration message from executor 'notepad.01d79d48-0791-11e8-8f77-02421c3bc93c' of framework eb32cef4-c503-4ab7-85d4-8d4577e6a3bf-0000
I0201 12:47:00.193958 16792 slave.cpp:4737] Cleaning up un-reregistered executors
I0201 12:47:00.193958 16792 slave.cpp:6824] Finished recovery
I0201 12:47:00.200943  9456 task_status_update_manager.cpp:181] Pausing sending task status updates
I0201 12:47:00.200943  3116 slave.cpp:1146] New master detected at master@10.123.6.78:5050
I0201 12:47:00.200943  3116 slave.cpp:1190] No credentials provided. Attempting to register without authentication
I0201 12:47:00.200943  3116 slave.cpp:1201] Detecting new master
I0201 12:47:00.214944 16792 slave.cpp:1471] Re-registered with master master@10.123.6.78:5050
I0201 12:47:00.214944 13180 task_status_update_manager.cpp:188] Resuming sending task status updates
I0201 12:47:00.215942 16792 slave.cpp:1516] Forwarding agent update {"operations":{},"resource_version_uuid" {"value":"jLIL1d\/PQnuwmFxpMf8CLQ=="},"slave_id":{"value":"7dc02270-a4e1-4f59-9ad7-56bad5182ea4S3"},"update_oversubscribed_resources":true}
I0201 12:47:00.219952  3116 slave.cpp:3625] Updating info for framework eb32cef4-c503-4ab7-85d4-8d4577e6a3bf-0000 with pid updated to scheduler-aaa62980-8b1b-4775-b8bb-c6890b41941e@10.123.6.78:45907
I0201 12:47:00.233942  7344 task_status_update_manager.cpp:188] Resuming sending task status updates
```


Thanks,

Andrew Schwartzmeyer