You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@mesos.apache.org by Andrew Schwartzmeyer <an...@schwartzmeyer.com> on 2018/02/02 00:15:20 UTC
Re: Review Request 65409: Fixed
`SlaveRecoveryTest.ReconcileTasksMissingFromSlave`.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/65409/
-----------------------------------------------------------
(Updated Feb. 1, 2018, 4:15 p.m.)
Review request for mesos, Akash Gupta, Jie Yu, and Joseph Wu.
Changes
-------
Rebased.
Bugs: MESOS-6713
https://issues.apache.org/jira/browse/MESOS-6713
Repository: mesos
Description
-------
Because it is not possible to delete a file (or a folder recursively)
with open handles on Windows, we have to explicitly `reset()` the agent
before removing the framework meta directory. Otherwise, the task status
update manager will be destructed too late, and so an open handle for
`task.updates` will cause the `os::rmdir` to fail.
This is safe because we previously destructed the agent anyway, just
later in the test when it was reassigned.
Diffs (updated)
-----
src/tests/slave_recovery_tests.cpp 77aa60c953bd0769eaba05f001755e4cec9ba028
Diff: https://reviews.apache.org/r/65409/diff/2/
Changes: https://reviews.apache.org/r/65409/diff/1-2/
Testing
-------
make check on CentOS 7, all passed
ctest on Windows, all passed including new SlaveRecoveryTests
Note that while this chain enables recovery of Docker tasks on Windows, it explicitly does not fix MESOS-8519 (recovery of job object tasks).
```
I0131 11:52:01.545505 8316 docker.cpp:898] Recovering Docker containers
I0131 11:52:01.546005 660 containerizer.cpp:674] Recovering containerizer
I0131 11:52:01.546505 660 containerizer.cpp:725] Skipping recovery of executor 'iis.feae9d12-06ba-11e8-8f77-02421c3bc93c' of framework eb32cef4-c503-4ab7-85d4-8d4577e6a3bf-0000 because it was not launched from mesos containerizer
I0131 11:52:01.557006 11272 provisioner.cpp:493] Provisioner recovery complete
I0131 11:52:02.521003 8720 docker.cpp:1008] Recovering container 'f7978e90-32f5-458d-ad4e-3ffa25a7b190' for executor 'iis.feae9d12-06ba-11e8-8f77-02421c3bc93c' of framework eb32cef4-c503-4ab7-85d4-8d4577e6a3bf-0000
I0131 11:52:02.530527 8316 slave.cpp:6695] Sending reconnect request to executor 'iis.feae9d12-06ba-11e8-8f77-02421c3bc93c' of framework eb32cef4-c503-4ab7-85d4-8d4577e6a3bf-0000 at executor(1)@10.123.7.41:63903
I0131 11:52:02.549062 8720 slave.cpp:4519] Received re-registration message from executor 'iis.feae9d12-06ba-11e8-8f77-02421c3bc93c' of framework eb32cef4-c503-4ab7-85d4-8d4577e6a3bf-0000
I0131 11:52:04.548064 10556 slave.cpp:4737] Cleaning up un-reregistered executors
I0131 11:52:04.548064 10556 slave.cpp:6824] Finished recovery
I0131 11:52:04.566066 660 task_status_update_manager.cpp:181] Pausing sending task status updates
I0131 11:52:04.567059 14636 slave.cpp:1146] New master detected at master@10.123.6.78:5050
I0131 11:52:04.567059 14636 slave.cpp:1190] No credentials provided. Attempting to register without authentication
I0131 11:52:04.568047 14636 slave.cpp:1201] Detecting new master
I0131 11:52:04.604035 8720 slave.cpp:1471] Re-registered with master master@10.123.6.78:5050
I0131 11:52:04.605060 660 task_status_update_manager.cpp:188] Resuming sending task status updates
I0131 11:52:04.606036 8720 slave.cpp:1516] Forwarding agent update {"operations":{},"resource_version_uuid":{"value":"mzwol7M6SrGxOml4zYlA8Q=="},"slave_id":{"value":"7dc02270-a4e1-4f59-9ad7-56bad5182ea4-S0"},"update_oversubscribed_resource
s":true}
I0131 11:52:04.612036 8720 slave.cpp:3625] Updating info for framework eb32cef4-c503-4ab7-85d4-8d4577e6a3bf-0000 with pid updated to scheduler-aaa62980-8b1b-4775-b8bb-c6890b41941e@10.123.6.78:45907
I0131 11:52:04.636543 13468 task_status_update_manager.cpp:188] Resuming sending task status updates
```
Thanks,
Andrew Schwartzmeyer
Re: Review Request 65409: Fixed
`SlaveRecoveryTest.ReconcileTasksMissingFromSlave`.
Posted by Joseph Wu <jo...@mesosphere.io>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/65409/#review196958
-----------------------------------------------------------
src/tests/slave_recovery_tests.cpp
Line 3832 (original), 3832 (patched)
<https://reviews.apache.org/r/65409/#comment276948>
The reset should ideally go right below this line. I believe the test should not be reliant on any data structures of the agent existing after termination.
You can probably move the comment up here too.
- Joseph Wu
On Feb. 1, 2018, 4:15 p.m., Andrew Schwartzmeyer wrote:
>
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/65409/
> -----------------------------------------------------------
>
> (Updated Feb. 1, 2018, 4:15 p.m.)
>
>
> Review request for mesos, Akash Gupta, Jie Yu, and Joseph Wu.
>
>
> Bugs: MESOS-6713
> https://issues.apache.org/jira/browse/MESOS-6713
>
>
> Repository: mesos
>
>
> Description
> -------
>
> Because it is not possible to delete a file (or a folder recursively)
> with open handles on Windows, we have to explicitly `reset()` the agent
> before removing the framework meta directory. Otherwise, the task status
> update manager will be destructed too late, and so an open handle for
> `task.updates` will cause the `os::rmdir` to fail.
>
> This is safe because we previously destructed the agent anyway, just
> later in the test when it was reassigned.
>
>
> Diffs
> -----
>
> src/tests/slave_recovery_tests.cpp 77aa60c953bd0769eaba05f001755e4cec9ba028
>
>
> Diff: https://reviews.apache.org/r/65409/diff/2/
>
>
> Testing
> -------
>
> make check on CentOS 7, all passed
> ctest on Windows, all passed including new SlaveRecoveryTests
>
> Note that while this chain enables recovery of Docker tasks on Windows, it explicitly does not fix MESOS-8519 (recovery of job object tasks).
>
> ```
> I0131 11:52:01.545505 8316 docker.cpp:898] Recovering Docker containers
> I0131 11:52:01.546005 660 containerizer.cpp:674] Recovering containerizer
> I0131 11:52:01.546505 660 containerizer.cpp:725] Skipping recovery of executor 'iis.feae9d12-06ba-11e8-8f77-02421c3bc93c' of framework eb32cef4-c503-4ab7-85d4-8d4577e6a3bf-0000 because it was not launched from mesos containerizer
> I0131 11:52:01.557006 11272 provisioner.cpp:493] Provisioner recovery complete
> I0131 11:52:02.521003 8720 docker.cpp:1008] Recovering container 'f7978e90-32f5-458d-ad4e-3ffa25a7b190' for executor 'iis.feae9d12-06ba-11e8-8f77-02421c3bc93c' of framework eb32cef4-c503-4ab7-85d4-8d4577e6a3bf-0000
> I0131 11:52:02.530527 8316 slave.cpp:6695] Sending reconnect request to executor 'iis.feae9d12-06ba-11e8-8f77-02421c3bc93c' of framework eb32cef4-c503-4ab7-85d4-8d4577e6a3bf-0000 at executor(1)@10.123.7.41:63903
> I0131 11:52:02.549062 8720 slave.cpp:4519] Received re-registration message from executor 'iis.feae9d12-06ba-11e8-8f77-02421c3bc93c' of framework eb32cef4-c503-4ab7-85d4-8d4577e6a3bf-0000
> I0131 11:52:04.548064 10556 slave.cpp:4737] Cleaning up un-reregistered executors
> I0131 11:52:04.548064 10556 slave.cpp:6824] Finished recovery
> I0131 11:52:04.566066 660 task_status_update_manager.cpp:181] Pausing sending task status updates
> I0131 11:52:04.567059 14636 slave.cpp:1146] New master detected at master@10.123.6.78:5050
> I0131 11:52:04.567059 14636 slave.cpp:1190] No credentials provided. Attempting to register without authentication
> I0131 11:52:04.568047 14636 slave.cpp:1201] Detecting new master
> I0131 11:52:04.604035 8720 slave.cpp:1471] Re-registered with master master@10.123.6.78:5050
> I0131 11:52:04.605060 660 task_status_update_manager.cpp:188] Resuming sending task status updates
> I0131 11:52:04.606036 8720 slave.cpp:1516] Forwarding agent update {"operations":{},"resource_version_uuid":{"value":"mzwol7M6SrGxOml4zYlA8Q=="},"slave_id":{"value":"7dc02270-a4e1-4f59-9ad7-56bad5182ea4-S0"},"update_oversubscribed_resource
> s":true}
> I0131 11:52:04.612036 8720 slave.cpp:3625] Updating info for framework eb32cef4-c503-4ab7-85d4-8d4577e6a3bf-0000 with pid updated to scheduler-aaa62980-8b1b-4775-b8bb-c6890b41941e@10.123.6.78:45907
> I0131 11:52:04.636543 13468 task_status_update_manager.cpp:188] Resuming sending task status updates
> ```
>
>
> Thanks,
>
> Andrew Schwartzmeyer
>
>
Re: Review Request 65409: Fixed
`SlaveRecoveryTest.ReconcileTasksMissingFromSlave`.
Posted by Joseph Wu <jo...@mesosphere.io>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/65409/#review197185
-----------------------------------------------------------
Ship it!
Ship It!
- Joseph Wu
On Feb. 8, 2018, 11:53 a.m., Andrew Schwartzmeyer wrote:
>
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/65409/
> -----------------------------------------------------------
>
> (Updated Feb. 8, 2018, 11:53 a.m.)
>
>
> Review request for mesos, Akash Gupta, Jie Yu, and Joseph Wu.
>
>
> Bugs: MESOS-6713
> https://issues.apache.org/jira/browse/MESOS-6713
>
>
> Repository: mesos
>
>
> Description
> -------
>
> Because it is not possible to delete a file (or a folder recursively)
> with open handles on Windows, we have to explicitly `reset()` the agent
> before removing the framework meta directory. Otherwise, the task status
> update manager will be destructed too late, and so an open handle for
> `task.updates` will cause the `os::rmdir` to fail.
>
> This is safe because we previously destructed the agent anyway, just
> later in the test when it was reassigned.
>
>
> Diffs
> -----
>
> src/tests/slave_recovery_tests.cpp 77aa60c953bd0769eaba05f001755e4cec9ba028
>
>
> Diff: https://reviews.apache.org/r/65409/diff/3/
>
>
> Testing
> -------
>
> make check on CentOS 7, all passed
> ctest on Windows, all passed including new SlaveRecoveryTests
>
> Note that while this chain enables recovery of Docker tasks on Windows, it explicitly does not fix MESOS-8519 (recovery of job object tasks).
>
> ```
> I0131 11:52:01.545505 8316 docker.cpp:898] Recovering Docker containers
> I0131 11:52:01.546005 660 containerizer.cpp:674] Recovering containerizer
> I0131 11:52:01.546505 660 containerizer.cpp:725] Skipping recovery of executor 'iis.feae9d12-06ba-11e8-8f77-02421c3bc93c' of framework eb32cef4-c503-4ab7-85d4-8d4577e6a3bf-0000 because it was not launched from mesos containerizer
> I0131 11:52:01.557006 11272 provisioner.cpp:493] Provisioner recovery complete
> I0131 11:52:02.521003 8720 docker.cpp:1008] Recovering container 'f7978e90-32f5-458d-ad4e-3ffa25a7b190' for executor 'iis.feae9d12-06ba-11e8-8f77-02421c3bc93c' of framework eb32cef4-c503-4ab7-85d4-8d4577e6a3bf-0000
> I0131 11:52:02.530527 8316 slave.cpp:6695] Sending reconnect request to executor 'iis.feae9d12-06ba-11e8-8f77-02421c3bc93c' of framework eb32cef4-c503-4ab7-85d4-8d4577e6a3bf-0000 at executor(1)@10.123.7.41:63903
> I0131 11:52:02.549062 8720 slave.cpp:4519] Received re-registration message from executor 'iis.feae9d12-06ba-11e8-8f77-02421c3bc93c' of framework eb32cef4-c503-4ab7-85d4-8d4577e6a3bf-0000
> I0131 11:52:04.548064 10556 slave.cpp:4737] Cleaning up un-reregistered executors
> I0131 11:52:04.548064 10556 slave.cpp:6824] Finished recovery
> I0131 11:52:04.566066 660 task_status_update_manager.cpp:181] Pausing sending task status updates
> I0131 11:52:04.567059 14636 slave.cpp:1146] New master detected at master@10.123.6.78:5050
> I0131 11:52:04.567059 14636 slave.cpp:1190] No credentials provided. Attempting to register without authentication
> I0131 11:52:04.568047 14636 slave.cpp:1201] Detecting new master
> I0131 11:52:04.604035 8720 slave.cpp:1471] Re-registered with master master@10.123.6.78:5050
> I0131 11:52:04.605060 660 task_status_update_manager.cpp:188] Resuming sending task status updates
> I0131 11:52:04.606036 8720 slave.cpp:1516] Forwarding agent update {"operations":{},"resource_version_uuid":{"value":"mzwol7M6SrGxOml4zYlA8Q=="},"slave_id":{"value":"7dc02270-a4e1-4f59-9ad7-56bad5182ea4-S0"},"update_oversubscribed_resource
> s":true}
> I0131 11:52:04.612036 8720 slave.cpp:3625] Updating info for framework eb32cef4-c503-4ab7-85d4-8d4577e6a3bf-0000 with pid updated to scheduler-aaa62980-8b1b-4775-b8bb-c6890b41941e@10.123.6.78:45907
> I0131 11:52:04.636543 13468 task_status_update_manager.cpp:188] Resuming sending task status updates
> ```
>
>
> Thanks,
>
> Andrew Schwartzmeyer
>
>
Re: Review Request 65409: Fixed
`SlaveRecoveryTest.ReconcileTasksMissingFromSlave`.
Posted by Andrew Schwartzmeyer <an...@schwartzmeyer.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/65409/
-----------------------------------------------------------
(Updated Feb. 8, 2018, 11:53 a.m.)
Review request for mesos, Akash Gupta, Jie Yu, and Joseph Wu.
Bugs: MESOS-6713
https://issues.apache.org/jira/browse/MESOS-6713
Repository: mesos
Description
-------
Because it is not possible to delete a file (or a folder recursively)
with open handles on Windows, we have to explicitly `reset()` the agent
before removing the framework meta directory. Otherwise, the task status
update manager will be destructed too late, and so an open handle for
`task.updates` will cause the `os::rmdir` to fail.
This is safe because we previously destructed the agent anyway, just
later in the test when it was reassigned.
Diffs (updated)
-----
src/tests/slave_recovery_tests.cpp 77aa60c953bd0769eaba05f001755e4cec9ba028
Diff: https://reviews.apache.org/r/65409/diff/3/
Changes: https://reviews.apache.org/r/65409/diff/2-3/
Testing
-------
make check on CentOS 7, all passed
ctest on Windows, all passed including new SlaveRecoveryTests
Note that while this chain enables recovery of Docker tasks on Windows, it explicitly does not fix MESOS-8519 (recovery of job object tasks).
```
I0131 11:52:01.545505 8316 docker.cpp:898] Recovering Docker containers
I0131 11:52:01.546005 660 containerizer.cpp:674] Recovering containerizer
I0131 11:52:01.546505 660 containerizer.cpp:725] Skipping recovery of executor 'iis.feae9d12-06ba-11e8-8f77-02421c3bc93c' of framework eb32cef4-c503-4ab7-85d4-8d4577e6a3bf-0000 because it was not launched from mesos containerizer
I0131 11:52:01.557006 11272 provisioner.cpp:493] Provisioner recovery complete
I0131 11:52:02.521003 8720 docker.cpp:1008] Recovering container 'f7978e90-32f5-458d-ad4e-3ffa25a7b190' for executor 'iis.feae9d12-06ba-11e8-8f77-02421c3bc93c' of framework eb32cef4-c503-4ab7-85d4-8d4577e6a3bf-0000
I0131 11:52:02.530527 8316 slave.cpp:6695] Sending reconnect request to executor 'iis.feae9d12-06ba-11e8-8f77-02421c3bc93c' of framework eb32cef4-c503-4ab7-85d4-8d4577e6a3bf-0000 at executor(1)@10.123.7.41:63903
I0131 11:52:02.549062 8720 slave.cpp:4519] Received re-registration message from executor 'iis.feae9d12-06ba-11e8-8f77-02421c3bc93c' of framework eb32cef4-c503-4ab7-85d4-8d4577e6a3bf-0000
I0131 11:52:04.548064 10556 slave.cpp:4737] Cleaning up un-reregistered executors
I0131 11:52:04.548064 10556 slave.cpp:6824] Finished recovery
I0131 11:52:04.566066 660 task_status_update_manager.cpp:181] Pausing sending task status updates
I0131 11:52:04.567059 14636 slave.cpp:1146] New master detected at master@10.123.6.78:5050
I0131 11:52:04.567059 14636 slave.cpp:1190] No credentials provided. Attempting to register without authentication
I0131 11:52:04.568047 14636 slave.cpp:1201] Detecting new master
I0131 11:52:04.604035 8720 slave.cpp:1471] Re-registered with master master@10.123.6.78:5050
I0131 11:52:04.605060 660 task_status_update_manager.cpp:188] Resuming sending task status updates
I0131 11:52:04.606036 8720 slave.cpp:1516] Forwarding agent update {"operations":{},"resource_version_uuid":{"value":"mzwol7M6SrGxOml4zYlA8Q=="},"slave_id":{"value":"7dc02270-a4e1-4f59-9ad7-56bad5182ea4-S0"},"update_oversubscribed_resource
s":true}
I0131 11:52:04.612036 8720 slave.cpp:3625] Updating info for framework eb32cef4-c503-4ab7-85d4-8d4577e6a3bf-0000 with pid updated to scheduler-aaa62980-8b1b-4775-b8bb-c6890b41941e@10.123.6.78:45907
I0131 11:52:04.636543 13468 task_status_update_manager.cpp:188] Resuming sending task status updates
```
Thanks,
Andrew Schwartzmeyer