You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@mesos.apache.org by Andrew Schwartzmeyer <an...@schwartzmeyer.com> on 2018/02/02 00:15:20 UTC

Re: Review Request 65409: Fixed `SlaveRecoveryTest.ReconcileTasksMissingFromSlave`.

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/65409/
-----------------------------------------------------------

(Updated Feb. 1, 2018, 4:15 p.m.)


Review request for mesos, Akash Gupta, Jie Yu, and Joseph Wu.


Changes
-------

Rebased.


Bugs: MESOS-6713
    https://issues.apache.org/jira/browse/MESOS-6713


Repository: mesos


Description
-------

Because it is not possible to delete a file (or a folder recursively)
with open handles on Windows, we have to explicitly `reset()` the agent
before removing the framework meta directory. Otherwise, the task status
update manager will be destructed too late, and so an open handle for
`task.updates` will cause the `os::rmdir` to fail.

This is safe because we previously destructed the agent anyway, just
later in the test when it was reassigned.


Diffs (updated)
-----

  src/tests/slave_recovery_tests.cpp 77aa60c953bd0769eaba05f001755e4cec9ba028 


Diff: https://reviews.apache.org/r/65409/diff/2/

Changes: https://reviews.apache.org/r/65409/diff/1-2/


Testing
-------

make check on CentOS 7, all passed
ctest on Windows, all passed including new SlaveRecoveryTests

Note that while this chain enables recovery of Docker tasks on Windows, it explicitly does not fix MESOS-8519 (recovery of job object tasks).

```
I0131 11:52:01.545505  8316 docker.cpp:898] Recovering Docker containers
I0131 11:52:01.546005   660 containerizer.cpp:674] Recovering containerizer
I0131 11:52:01.546505   660 containerizer.cpp:725] Skipping recovery of executor 'iis.feae9d12-06ba-11e8-8f77-02421c3bc93c' of framework eb32cef4-c503-4ab7-85d4-8d4577e6a3bf-0000 because it was not launched from mesos containerizer
I0131 11:52:01.557006 11272 provisioner.cpp:493] Provisioner recovery complete
I0131 11:52:02.521003  8720 docker.cpp:1008] Recovering container 'f7978e90-32f5-458d-ad4e-3ffa25a7b190' for executor 'iis.feae9d12-06ba-11e8-8f77-02421c3bc93c' of framework eb32cef4-c503-4ab7-85d4-8d4577e6a3bf-0000
I0131 11:52:02.530527  8316 slave.cpp:6695] Sending reconnect request to executor 'iis.feae9d12-06ba-11e8-8f77-02421c3bc93c' of framework eb32cef4-c503-4ab7-85d4-8d4577e6a3bf-0000 at executor(1)@10.123.7.41:63903
I0131 11:52:02.549062  8720 slave.cpp:4519] Received re-registration message from executor 'iis.feae9d12-06ba-11e8-8f77-02421c3bc93c' of framework eb32cef4-c503-4ab7-85d4-8d4577e6a3bf-0000
I0131 11:52:04.548064 10556 slave.cpp:4737] Cleaning up un-reregistered executors
I0131 11:52:04.548064 10556 slave.cpp:6824] Finished recovery
I0131 11:52:04.566066   660 task_status_update_manager.cpp:181] Pausing sending task status updates
I0131 11:52:04.567059 14636 slave.cpp:1146] New master detected at master@10.123.6.78:5050
I0131 11:52:04.567059 14636 slave.cpp:1190] No credentials provided. Attempting to register without authentication
I0131 11:52:04.568047 14636 slave.cpp:1201] Detecting new master
I0131 11:52:04.604035  8720 slave.cpp:1471] Re-registered with master master@10.123.6.78:5050
I0131 11:52:04.605060   660 task_status_update_manager.cpp:188] Resuming sending task status updates
I0131 11:52:04.606036  8720 slave.cpp:1516] Forwarding agent update {"operations":{},"resource_version_uuid":{"value":"mzwol7M6SrGxOml4zYlA8Q=="},"slave_id":{"value":"7dc02270-a4e1-4f59-9ad7-56bad5182ea4-S0"},"update_oversubscribed_resource
s":true}
I0131 11:52:04.612036  8720 slave.cpp:3625] Updating info for framework eb32cef4-c503-4ab7-85d4-8d4577e6a3bf-0000 with pid updated to scheduler-aaa62980-8b1b-4775-b8bb-c6890b41941e@10.123.6.78:45907
I0131 11:52:04.636543 13468 task_status_update_manager.cpp:188] Resuming sending task status updates
```


Thanks,

Andrew Schwartzmeyer


Re: Review Request 65409: Fixed `SlaveRecoveryTest.ReconcileTasksMissingFromSlave`.

Posted by Joseph Wu <jo...@mesosphere.io>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/65409/#review196958
-----------------------------------------------------------




src/tests/slave_recovery_tests.cpp
Line 3832 (original), 3832 (patched)
<https://reviews.apache.org/r/65409/#comment276948>

    The reset should ideally go right below this line.  I believe the test should not be reliant on any data structures of the agent existing after termination.
    
    You can probably move the comment up here too.


- Joseph Wu


On Feb. 1, 2018, 4:15 p.m., Andrew Schwartzmeyer wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/65409/
> -----------------------------------------------------------
> 
> (Updated Feb. 1, 2018, 4:15 p.m.)
> 
> 
> Review request for mesos, Akash Gupta, Jie Yu, and Joseph Wu.
> 
> 
> Bugs: MESOS-6713
>     https://issues.apache.org/jira/browse/MESOS-6713
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> Because it is not possible to delete a file (or a folder recursively)
> with open handles on Windows, we have to explicitly `reset()` the agent
> before removing the framework meta directory. Otherwise, the task status
> update manager will be destructed too late, and so an open handle for
> `task.updates` will cause the `os::rmdir` to fail.
> 
> This is safe because we previously destructed the agent anyway, just
> later in the test when it was reassigned.
> 
> 
> Diffs
> -----
> 
>   src/tests/slave_recovery_tests.cpp 77aa60c953bd0769eaba05f001755e4cec9ba028 
> 
> 
> Diff: https://reviews.apache.org/r/65409/diff/2/
> 
> 
> Testing
> -------
> 
> make check on CentOS 7, all passed
> ctest on Windows, all passed including new SlaveRecoveryTests
> 
> Note that while this chain enables recovery of Docker tasks on Windows, it explicitly does not fix MESOS-8519 (recovery of job object tasks).
> 
> ```
> I0131 11:52:01.545505  8316 docker.cpp:898] Recovering Docker containers
> I0131 11:52:01.546005   660 containerizer.cpp:674] Recovering containerizer
> I0131 11:52:01.546505   660 containerizer.cpp:725] Skipping recovery of executor 'iis.feae9d12-06ba-11e8-8f77-02421c3bc93c' of framework eb32cef4-c503-4ab7-85d4-8d4577e6a3bf-0000 because it was not launched from mesos containerizer
> I0131 11:52:01.557006 11272 provisioner.cpp:493] Provisioner recovery complete
> I0131 11:52:02.521003  8720 docker.cpp:1008] Recovering container 'f7978e90-32f5-458d-ad4e-3ffa25a7b190' for executor 'iis.feae9d12-06ba-11e8-8f77-02421c3bc93c' of framework eb32cef4-c503-4ab7-85d4-8d4577e6a3bf-0000
> I0131 11:52:02.530527  8316 slave.cpp:6695] Sending reconnect request to executor 'iis.feae9d12-06ba-11e8-8f77-02421c3bc93c' of framework eb32cef4-c503-4ab7-85d4-8d4577e6a3bf-0000 at executor(1)@10.123.7.41:63903
> I0131 11:52:02.549062  8720 slave.cpp:4519] Received re-registration message from executor 'iis.feae9d12-06ba-11e8-8f77-02421c3bc93c' of framework eb32cef4-c503-4ab7-85d4-8d4577e6a3bf-0000
> I0131 11:52:04.548064 10556 slave.cpp:4737] Cleaning up un-reregistered executors
> I0131 11:52:04.548064 10556 slave.cpp:6824] Finished recovery
> I0131 11:52:04.566066   660 task_status_update_manager.cpp:181] Pausing sending task status updates
> I0131 11:52:04.567059 14636 slave.cpp:1146] New master detected at master@10.123.6.78:5050
> I0131 11:52:04.567059 14636 slave.cpp:1190] No credentials provided. Attempting to register without authentication
> I0131 11:52:04.568047 14636 slave.cpp:1201] Detecting new master
> I0131 11:52:04.604035  8720 slave.cpp:1471] Re-registered with master master@10.123.6.78:5050
> I0131 11:52:04.605060   660 task_status_update_manager.cpp:188] Resuming sending task status updates
> I0131 11:52:04.606036  8720 slave.cpp:1516] Forwarding agent update {"operations":{},"resource_version_uuid":{"value":"mzwol7M6SrGxOml4zYlA8Q=="},"slave_id":{"value":"7dc02270-a4e1-4f59-9ad7-56bad5182ea4-S0"},"update_oversubscribed_resource
> s":true}
> I0131 11:52:04.612036  8720 slave.cpp:3625] Updating info for framework eb32cef4-c503-4ab7-85d4-8d4577e6a3bf-0000 with pid updated to scheduler-aaa62980-8b1b-4775-b8bb-c6890b41941e@10.123.6.78:45907
> I0131 11:52:04.636543 13468 task_status_update_manager.cpp:188] Resuming sending task status updates
> ```
> 
> 
> Thanks,
> 
> Andrew Schwartzmeyer
> 
>


Re: Review Request 65409: Fixed `SlaveRecoveryTest.ReconcileTasksMissingFromSlave`.

Posted by Joseph Wu <jo...@mesosphere.io>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/65409/#review197185
-----------------------------------------------------------


Ship it!




Ship It!

- Joseph Wu


On Feb. 8, 2018, 11:53 a.m., Andrew Schwartzmeyer wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/65409/
> -----------------------------------------------------------
> 
> (Updated Feb. 8, 2018, 11:53 a.m.)
> 
> 
> Review request for mesos, Akash Gupta, Jie Yu, and Joseph Wu.
> 
> 
> Bugs: MESOS-6713
>     https://issues.apache.org/jira/browse/MESOS-6713
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> Because it is not possible to delete a file (or a folder recursively)
> with open handles on Windows, we have to explicitly `reset()` the agent
> before removing the framework meta directory. Otherwise, the task status
> update manager will be destructed too late, and so an open handle for
> `task.updates` will cause the `os::rmdir` to fail.
> 
> This is safe because we previously destructed the agent anyway, just
> later in the test when it was reassigned.
> 
> 
> Diffs
> -----
> 
>   src/tests/slave_recovery_tests.cpp 77aa60c953bd0769eaba05f001755e4cec9ba028 
> 
> 
> Diff: https://reviews.apache.org/r/65409/diff/3/
> 
> 
> Testing
> -------
> 
> make check on CentOS 7, all passed
> ctest on Windows, all passed including new SlaveRecoveryTests
> 
> Note that while this chain enables recovery of Docker tasks on Windows, it explicitly does not fix MESOS-8519 (recovery of job object tasks).
> 
> ```
> I0131 11:52:01.545505  8316 docker.cpp:898] Recovering Docker containers
> I0131 11:52:01.546005   660 containerizer.cpp:674] Recovering containerizer
> I0131 11:52:01.546505   660 containerizer.cpp:725] Skipping recovery of executor 'iis.feae9d12-06ba-11e8-8f77-02421c3bc93c' of framework eb32cef4-c503-4ab7-85d4-8d4577e6a3bf-0000 because it was not launched from mesos containerizer
> I0131 11:52:01.557006 11272 provisioner.cpp:493] Provisioner recovery complete
> I0131 11:52:02.521003  8720 docker.cpp:1008] Recovering container 'f7978e90-32f5-458d-ad4e-3ffa25a7b190' for executor 'iis.feae9d12-06ba-11e8-8f77-02421c3bc93c' of framework eb32cef4-c503-4ab7-85d4-8d4577e6a3bf-0000
> I0131 11:52:02.530527  8316 slave.cpp:6695] Sending reconnect request to executor 'iis.feae9d12-06ba-11e8-8f77-02421c3bc93c' of framework eb32cef4-c503-4ab7-85d4-8d4577e6a3bf-0000 at executor(1)@10.123.7.41:63903
> I0131 11:52:02.549062  8720 slave.cpp:4519] Received re-registration message from executor 'iis.feae9d12-06ba-11e8-8f77-02421c3bc93c' of framework eb32cef4-c503-4ab7-85d4-8d4577e6a3bf-0000
> I0131 11:52:04.548064 10556 slave.cpp:4737] Cleaning up un-reregistered executors
> I0131 11:52:04.548064 10556 slave.cpp:6824] Finished recovery
> I0131 11:52:04.566066   660 task_status_update_manager.cpp:181] Pausing sending task status updates
> I0131 11:52:04.567059 14636 slave.cpp:1146] New master detected at master@10.123.6.78:5050
> I0131 11:52:04.567059 14636 slave.cpp:1190] No credentials provided. Attempting to register without authentication
> I0131 11:52:04.568047 14636 slave.cpp:1201] Detecting new master
> I0131 11:52:04.604035  8720 slave.cpp:1471] Re-registered with master master@10.123.6.78:5050
> I0131 11:52:04.605060   660 task_status_update_manager.cpp:188] Resuming sending task status updates
> I0131 11:52:04.606036  8720 slave.cpp:1516] Forwarding agent update {"operations":{},"resource_version_uuid":{"value":"mzwol7M6SrGxOml4zYlA8Q=="},"slave_id":{"value":"7dc02270-a4e1-4f59-9ad7-56bad5182ea4-S0"},"update_oversubscribed_resource
> s":true}
> I0131 11:52:04.612036  8720 slave.cpp:3625] Updating info for framework eb32cef4-c503-4ab7-85d4-8d4577e6a3bf-0000 with pid updated to scheduler-aaa62980-8b1b-4775-b8bb-c6890b41941e@10.123.6.78:45907
> I0131 11:52:04.636543 13468 task_status_update_manager.cpp:188] Resuming sending task status updates
> ```
> 
> 
> Thanks,
> 
> Andrew Schwartzmeyer
> 
>


Re: Review Request 65409: Fixed `SlaveRecoveryTest.ReconcileTasksMissingFromSlave`.

Posted by Andrew Schwartzmeyer <an...@schwartzmeyer.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/65409/
-----------------------------------------------------------

(Updated Feb. 8, 2018, 11:53 a.m.)


Review request for mesos, Akash Gupta, Jie Yu, and Joseph Wu.


Bugs: MESOS-6713
    https://issues.apache.org/jira/browse/MESOS-6713


Repository: mesos


Description
-------

Because it is not possible to delete a file (or a folder recursively)
with open handles on Windows, we have to explicitly `reset()` the agent
before removing the framework meta directory. Otherwise, the task status
update manager will be destructed too late, and so an open handle for
`task.updates` will cause the `os::rmdir` to fail.

This is safe because we previously destructed the agent anyway, just
later in the test when it was reassigned.


Diffs (updated)
-----

  src/tests/slave_recovery_tests.cpp 77aa60c953bd0769eaba05f001755e4cec9ba028 


Diff: https://reviews.apache.org/r/65409/diff/3/

Changes: https://reviews.apache.org/r/65409/diff/2-3/


Testing
-------

make check on CentOS 7, all passed
ctest on Windows, all passed including new SlaveRecoveryTests

Note that while this chain enables recovery of Docker tasks on Windows, it explicitly does not fix MESOS-8519 (recovery of job object tasks).

```
I0131 11:52:01.545505  8316 docker.cpp:898] Recovering Docker containers
I0131 11:52:01.546005   660 containerizer.cpp:674] Recovering containerizer
I0131 11:52:01.546505   660 containerizer.cpp:725] Skipping recovery of executor 'iis.feae9d12-06ba-11e8-8f77-02421c3bc93c' of framework eb32cef4-c503-4ab7-85d4-8d4577e6a3bf-0000 because it was not launched from mesos containerizer
I0131 11:52:01.557006 11272 provisioner.cpp:493] Provisioner recovery complete
I0131 11:52:02.521003  8720 docker.cpp:1008] Recovering container 'f7978e90-32f5-458d-ad4e-3ffa25a7b190' for executor 'iis.feae9d12-06ba-11e8-8f77-02421c3bc93c' of framework eb32cef4-c503-4ab7-85d4-8d4577e6a3bf-0000
I0131 11:52:02.530527  8316 slave.cpp:6695] Sending reconnect request to executor 'iis.feae9d12-06ba-11e8-8f77-02421c3bc93c' of framework eb32cef4-c503-4ab7-85d4-8d4577e6a3bf-0000 at executor(1)@10.123.7.41:63903
I0131 11:52:02.549062  8720 slave.cpp:4519] Received re-registration message from executor 'iis.feae9d12-06ba-11e8-8f77-02421c3bc93c' of framework eb32cef4-c503-4ab7-85d4-8d4577e6a3bf-0000
I0131 11:52:04.548064 10556 slave.cpp:4737] Cleaning up un-reregistered executors
I0131 11:52:04.548064 10556 slave.cpp:6824] Finished recovery
I0131 11:52:04.566066   660 task_status_update_manager.cpp:181] Pausing sending task status updates
I0131 11:52:04.567059 14636 slave.cpp:1146] New master detected at master@10.123.6.78:5050
I0131 11:52:04.567059 14636 slave.cpp:1190] No credentials provided. Attempting to register without authentication
I0131 11:52:04.568047 14636 slave.cpp:1201] Detecting new master
I0131 11:52:04.604035  8720 slave.cpp:1471] Re-registered with master master@10.123.6.78:5050
I0131 11:52:04.605060   660 task_status_update_manager.cpp:188] Resuming sending task status updates
I0131 11:52:04.606036  8720 slave.cpp:1516] Forwarding agent update {"operations":{},"resource_version_uuid":{"value":"mzwol7M6SrGxOml4zYlA8Q=="},"slave_id":{"value":"7dc02270-a4e1-4f59-9ad7-56bad5182ea4-S0"},"update_oversubscribed_resource
s":true}
I0131 11:52:04.612036  8720 slave.cpp:3625] Updating info for framework eb32cef4-c503-4ab7-85d4-8d4577e6a3bf-0000 with pid updated to scheduler-aaa62980-8b1b-4775-b8bb-c6890b41941e@10.123.6.78:45907
I0131 11:52:04.636543 13468 task_status_update_manager.cpp:188] Resuming sending task status updates
```


Thanks,

Andrew Schwartzmeyer