You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@mesos.apache.org by Zhitao Li <zh...@gmail.com> on 2018/06/06 00:07:49 UTC

Re: Review Request 67264: Unmounted any mount points in gc paths.

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/67264/
-----------------------------------------------------------

(Updated June 5, 2018, 5:07 p.m.)


Review request for mesos, Chun-Hung Hsiao, Jason Lai, and Jie Yu.


Changes
-------

Jie's comments.


Summary (updated)
-----------------

Unmounted any mount points in gc paths.


Bugs: MESOS-8830
    https://issues.apache.org/jira/browse/MESOS-8830


Repository: mesos


Description (updated)
-------

In various corner cases, agent may not get chance to properly unmount
persistent volumes mounted inside an executor's sandbox. When GC later
gets to these sandbox directories, permanent data loss can happen (see
MESOS-8830).

Currently, the only mounts in the host mount namespace under the sandbox
directories are persistent volumes, so this diff added protection to
unmount any dangling mount points before calling `rmdir` on the
directory.

NOTE: this means agent will not garbage collect any path if it cannot
read its own `mountinfo` table.


Diffs (updated)
-----

  src/local/local.cpp afff54653e8e659d947ddbee6dc38ba2715f2a78 
  src/slave/gc.hpp df40165bb8a23f065156bf6c5f354b143d88c088 
  src/slave/gc.cpp 390b35e6d17d6614a73c9548decbf10739560106 
  src/slave/gc_process.hpp 20374ad91820341282fdf18ecade60a020e26cea 
  src/slave/main.cpp 646125344d590b28256d8ee684d7e51a90e82f23 
  src/slave/paths.hpp 015896453410a33923eed07b3e676be19af62a48 
  src/slave/paths.cpp ed0b1276908f4990ce7a24c96aea20e8c79d3126 
  src/tests/cluster.cpp 01eb0950e687227dac81b1cdb9eaba3379cf5dbb 
  src/tests/gc_tests.cpp 619ed22edd9b3909ea24cdcbf62c354420a8d031 
  src/tests/mesos.hpp 733344a2f07ebd9d841a55fb9bbfda2e3c1a1eb2 
  src/tests/mesos.cpp d3c87c295429481c59d5a49398e289a4b84e4496 
  src/tests/slave_tests.cpp 3d67511de5abd3466eeb5ad1daf318209bd69eed 


Diff: https://reviews.apache.org/r/67264/diff/5/

Changes: https://reviews.apache.org/r/67264/diff/4-5/


Testing
-------

Added a unit test in following patch.

Tested with following procedures:
1. Start a test master and agent;
2. Created a persistent volume on agent through operator API;
3. Use `mesos-execute` to run a task;
4. Stop the agent;
5. Manually bind mount persistent volume path into a `volume` directory inside the executor sandbox (to simulate a dangling mount in MESOS-8830);
6. Restart agent with `--gc_disk_headroom=1.0 --gc_delay=1secs` to force it gc the path immediately.

With this fix, we observed that the dangling mount is automatically cleaned up, and agent produces log line:
```
W0523 06:00:04.001075 82745 gc.cpp:229] Unmounting dangling mount point '/home/zhitao/mesos-workdir/slaves/b3eb3aff-d19d-45ff-8113-f0316462d3fa-S0/frameworks/b3eb3aff-d19d-45ff-8113-f0316462d3fa-0000/executors/test_id/runs/1cd3bd06-2632-4541-a708-80c7cd51c74b/volume' of persistent volume '/home/zhitao/mesos-workdir/volumes/roles/role/id1' inside garbage collected path '/home/zhitao/mesos-workdir/slaves/b3eb3aff-d19d-45ff-8113-f0316462d3fa-S0'
```


Thanks,

Zhitao Li


Re: Review Request 67264: Unmounted any mount points in gc paths.

Posted by Zhitao Li <zh...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/67264/
-----------------------------------------------------------

(Updated June 12, 2018, 1:22 p.m.)


Review request for mesos, Chun-Hung Hsiao, Jason Lai, and Jie Yu.


Changes
-------

Jie's comments.


Bugs: MESOS-8830
    https://issues.apache.org/jira/browse/MESOS-8830


Repository: mesos


Description
-------

In various corner cases, agent may not get chance to properly unmount
persistent volumes mounted inside an executor's sandbox. When GC later
gets to these sandbox directories, permanent data loss can happen (see
MESOS-8830).

Currently, the only mounts in the host mount namespace under the sandbox
directories are persistent volumes, so this diff added protection to
unmount any dangling mount points before calling `rmdir` on the
directory.

NOTE: this means agent will not garbage collect any path if it cannot
read its own `mountinfo` table.


Diffs (updated)
-----

  src/local/local.cpp afff54653e8e659d947ddbee6dc38ba2715f2a78 
  src/slave/gc.hpp df40165bb8a23f065156bf6c5f354b143d88c088 
  src/slave/gc.cpp 390b35e6d17d6614a73c9548decbf10739560106 
  src/slave/gc_process.hpp 20374ad91820341282fdf18ecade60a020e26cea 
  src/slave/main.cpp 646125344d590b28256d8ee684d7e51a90e82f23 
  src/tests/cluster.cpp 01eb0950e687227dac81b1cdb9eaba3379cf5dbb 
  src/tests/gc_tests.cpp 619ed22edd9b3909ea24cdcbf62c354420a8d031 
  src/tests/mesos.hpp 733344a2f07ebd9d841a55fb9bbfda2e3c1a1eb2 
  src/tests/mesos.cpp d3c87c295429481c59d5a49398e289a4b84e4496 
  src/tests/slave_tests.cpp 3d67511de5abd3466eeb5ad1daf318209bd69eed 


Diff: https://reviews.apache.org/r/67264/diff/8/

Changes: https://reviews.apache.org/r/67264/diff/7-8/


Testing
-------

Added a unit test in following patch.

Tested with following procedures:
1. Start a test master and agent;
2. Created a persistent volume on agent through operator API;
3. Use `mesos-execute` to run a task;
4. Stop the agent;
5. Manually bind mount persistent volume path into a `volume` directory inside the executor sandbox (to simulate a dangling mount in MESOS-8830);
6. Restart agent with `--gc_disk_headroom=1.0 --gc_delay=1secs` to force it gc the path immediately.

With this fix, we observed that the dangling mount is automatically cleaned up, and agent produces log line:
```
W0523 06:00:04.001075 82745 gc.cpp:229] Unmounting dangling mount point '/home/zhitao/mesos-workdir/slaves/b3eb3aff-d19d-45ff-8113-f0316462d3fa-S0/frameworks/b3eb3aff-d19d-45ff-8113-f0316462d3fa-0000/executors/test_id/runs/1cd3bd06-2632-4541-a708-80c7cd51c74b/volume' of persistent volume '/home/zhitao/mesos-workdir/volumes/roles/role/id1' inside garbage collected path '/home/zhitao/mesos-workdir/slaves/b3eb3aff-d19d-45ff-8113-f0316462d3fa-S0'
```


Thanks,

Zhitao Li


Re: Review Request 67264: Unmounted any mount points in gc paths.

Posted by Jie Yu <yu...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/67264/#review204608
-----------------------------------------------------------


Fix it, then Ship it!




LGTM overall!


src/slave/gc.cpp
Lines 213 (patched)
<https://reviews.apache.org/r/67264/#comment287189>

    `if (mountTable.isError())`



src/slave/gc.cpp
Lines 226-228 (patched)
<https://reviews.apache.org/r/67264/#comment287197>

    This double for loop will be quite expensive if we have a lot of directories to delete, and quite a lot of mount table entries.
    
    I'd suggest we some short cut like the following:
    ```
    foreach (const fs::MountInfoTable::Entry& entry,
             adaptor::reverse(mountTable->entries)) {
      // Ignore mounts whose targets are not under `workDir`.
      if (!strings::startsWith(
              path::join(entry.target, ""),
              path::join(workDir, "")) {
        continue;
      }
      
      for (auto it = infos.begin(); it != infos.end(); ) {
        ...
      }
    }
    ```



src/slave/gc.cpp
Lines 231 (patched)
<https://reviews.apache.org/r/67264/#comment287191>

    Can you add a TODO to validate that `info->path` is a realpath. I think we should just make sure slave's workdir is a realpath.



src/slave/paths.cpp
Lines 725-728 (patched)
<https://reviews.apache.org/r/67264/#comment287206>

    DO you still need this?


- Jie Yu


On June 11, 2018, 11:04 p.m., Zhitao Li wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/67264/
> -----------------------------------------------------------
> 
> (Updated June 11, 2018, 11:04 p.m.)
> 
> 
> Review request for mesos, Chun-Hung Hsiao, Jason Lai, and Jie Yu.
> 
> 
> Bugs: MESOS-8830
>     https://issues.apache.org/jira/browse/MESOS-8830
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> In various corner cases, agent may not get chance to properly unmount
> persistent volumes mounted inside an executor's sandbox. When GC later
> gets to these sandbox directories, permanent data loss can happen (see
> MESOS-8830).
> 
> Currently, the only mounts in the host mount namespace under the sandbox
> directories are persistent volumes, so this diff added protection to
> unmount any dangling mount points before calling `rmdir` on the
> directory.
> 
> NOTE: this means agent will not garbage collect any path if it cannot
> read its own `mountinfo` table.
> 
> 
> Diffs
> -----
> 
>   src/local/local.cpp afff54653e8e659d947ddbee6dc38ba2715f2a78 
>   src/slave/gc.hpp df40165bb8a23f065156bf6c5f354b143d88c088 
>   src/slave/gc.cpp 390b35e6d17d6614a73c9548decbf10739560106 
>   src/slave/gc_process.hpp 20374ad91820341282fdf18ecade60a020e26cea 
>   src/slave/main.cpp 646125344d590b28256d8ee684d7e51a90e82f23 
>   src/slave/paths.hpp 015896453410a33923eed07b3e676be19af62a48 
>   src/slave/paths.cpp ed0b1276908f4990ce7a24c96aea20e8c79d3126 
>   src/tests/cluster.cpp 01eb0950e687227dac81b1cdb9eaba3379cf5dbb 
>   src/tests/gc_tests.cpp 619ed22edd9b3909ea24cdcbf62c354420a8d031 
>   src/tests/mesos.hpp 733344a2f07ebd9d841a55fb9bbfda2e3c1a1eb2 
>   src/tests/mesos.cpp d3c87c295429481c59d5a49398e289a4b84e4496 
>   src/tests/slave_tests.cpp 3d67511de5abd3466eeb5ad1daf318209bd69eed 
> 
> 
> Diff: https://reviews.apache.org/r/67264/diff/7/
> 
> 
> Testing
> -------
> 
> Added a unit test in following patch.
> 
> Tested with following procedures:
> 1. Start a test master and agent;
> 2. Created a persistent volume on agent through operator API;
> 3. Use `mesos-execute` to run a task;
> 4. Stop the agent;
> 5. Manually bind mount persistent volume path into a `volume` directory inside the executor sandbox (to simulate a dangling mount in MESOS-8830);
> 6. Restart agent with `--gc_disk_headroom=1.0 --gc_delay=1secs` to force it gc the path immediately.
> 
> With this fix, we observed that the dangling mount is automatically cleaned up, and agent produces log line:
> ```
> W0523 06:00:04.001075 82745 gc.cpp:229] Unmounting dangling mount point '/home/zhitao/mesos-workdir/slaves/b3eb3aff-d19d-45ff-8113-f0316462d3fa-S0/frameworks/b3eb3aff-d19d-45ff-8113-f0316462d3fa-0000/executors/test_id/runs/1cd3bd06-2632-4541-a708-80c7cd51c74b/volume' of persistent volume '/home/zhitao/mesos-workdir/volumes/roles/role/id1' inside garbage collected path '/home/zhitao/mesos-workdir/slaves/b3eb3aff-d19d-45ff-8113-f0316462d3fa-S0'
> ```
> 
> 
> Thanks,
> 
> Zhitao Li
> 
>


Re: Review Request 67264: Unmounted any mount points in gc paths.

Posted by Zhitao Li <zh...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/67264/
-----------------------------------------------------------

(Updated June 11, 2018, 4:04 p.m.)


Review request for mesos, Chun-Hung Hsiao, Jason Lai, and Jie Yu.


Changes
-------

Use adaptor::reverse


Bugs: MESOS-8830
    https://issues.apache.org/jira/browse/MESOS-8830


Repository: mesos


Description
-------

In various corner cases, agent may not get chance to properly unmount
persistent volumes mounted inside an executor's sandbox. When GC later
gets to these sandbox directories, permanent data loss can happen (see
MESOS-8830).

Currently, the only mounts in the host mount namespace under the sandbox
directories are persistent volumes, so this diff added protection to
unmount any dangling mount points before calling `rmdir` on the
directory.

NOTE: this means agent will not garbage collect any path if it cannot
read its own `mountinfo` table.


Diffs (updated)
-----

  src/local/local.cpp afff54653e8e659d947ddbee6dc38ba2715f2a78 
  src/slave/gc.hpp df40165bb8a23f065156bf6c5f354b143d88c088 
  src/slave/gc.cpp 390b35e6d17d6614a73c9548decbf10739560106 
  src/slave/gc_process.hpp 20374ad91820341282fdf18ecade60a020e26cea 
  src/slave/main.cpp 646125344d590b28256d8ee684d7e51a90e82f23 
  src/slave/paths.hpp 015896453410a33923eed07b3e676be19af62a48 
  src/slave/paths.cpp ed0b1276908f4990ce7a24c96aea20e8c79d3126 
  src/tests/cluster.cpp 01eb0950e687227dac81b1cdb9eaba3379cf5dbb 
  src/tests/gc_tests.cpp 619ed22edd9b3909ea24cdcbf62c354420a8d031 
  src/tests/mesos.hpp 733344a2f07ebd9d841a55fb9bbfda2e3c1a1eb2 
  src/tests/mesos.cpp d3c87c295429481c59d5a49398e289a4b84e4496 
  src/tests/slave_tests.cpp 3d67511de5abd3466eeb5ad1daf318209bd69eed 


Diff: https://reviews.apache.org/r/67264/diff/7/

Changes: https://reviews.apache.org/r/67264/diff/6-7/


Testing
-------

Added a unit test in following patch.

Tested with following procedures:
1. Start a test master and agent;
2. Created a persistent volume on agent through operator API;
3. Use `mesos-execute` to run a task;
4. Stop the agent;
5. Manually bind mount persistent volume path into a `volume` directory inside the executor sandbox (to simulate a dangling mount in MESOS-8830);
6. Restart agent with `--gc_disk_headroom=1.0 --gc_delay=1secs` to force it gc the path immediately.

With this fix, we observed that the dangling mount is automatically cleaned up, and agent produces log line:
```
W0523 06:00:04.001075 82745 gc.cpp:229] Unmounting dangling mount point '/home/zhitao/mesos-workdir/slaves/b3eb3aff-d19d-45ff-8113-f0316462d3fa-S0/frameworks/b3eb3aff-d19d-45ff-8113-f0316462d3fa-0000/executors/test_id/runs/1cd3bd06-2632-4541-a708-80c7cd51c74b/volume' of persistent volume '/home/zhitao/mesos-workdir/volumes/roles/role/id1' inside garbage collected path '/home/zhitao/mesos-workdir/slaves/b3eb3aff-d19d-45ff-8113-f0316462d3fa-S0'
```


Thanks,

Zhitao Li


Re: Review Request 67264: Unmounted any mount points in gc paths.

Posted by Jason Lai <ja...@jasonlai.net>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/67264/#review204566
-----------------------------------------------------------


Ship it!




Ship It!


src/slave/gc.cpp
Lines 225-227 (patched)
<https://reviews.apache.org/r/67264/#comment287165>

    This works, but if you wanna be consistent with the rest of how reverse iteration of the mount entries, the following could be considered:
    
    ```
      foreach (const MountTable::Entry& entry,
               adaptor::reverse(mountTable->entries)) {
    ```
    
    This is used in `src/linux/fs.cpp` and a couple of other places.


- Jason Lai


On June 11, 2018, 8:58 p.m., Zhitao Li wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/67264/
> -----------------------------------------------------------
> 
> (Updated June 11, 2018, 8:58 p.m.)
> 
> 
> Review request for mesos, Chun-Hung Hsiao, Jason Lai, and Jie Yu.
> 
> 
> Bugs: MESOS-8830
>     https://issues.apache.org/jira/browse/MESOS-8830
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> In various corner cases, agent may not get chance to properly unmount
> persistent volumes mounted inside an executor's sandbox. When GC later
> gets to these sandbox directories, permanent data loss can happen (see
> MESOS-8830).
> 
> Currently, the only mounts in the host mount namespace under the sandbox
> directories are persistent volumes, so this diff added protection to
> unmount any dangling mount points before calling `rmdir` on the
> directory.
> 
> NOTE: this means agent will not garbage collect any path if it cannot
> read its own `mountinfo` table.
> 
> 
> Diffs
> -----
> 
>   src/local/local.cpp afff54653e8e659d947ddbee6dc38ba2715f2a78 
>   src/slave/gc.hpp df40165bb8a23f065156bf6c5f354b143d88c088 
>   src/slave/gc.cpp 390b35e6d17d6614a73c9548decbf10739560106 
>   src/slave/gc_process.hpp 20374ad91820341282fdf18ecade60a020e26cea 
>   src/slave/main.cpp 646125344d590b28256d8ee684d7e51a90e82f23 
>   src/slave/paths.hpp 015896453410a33923eed07b3e676be19af62a48 
>   src/slave/paths.cpp ed0b1276908f4990ce7a24c96aea20e8c79d3126 
>   src/tests/cluster.cpp 01eb0950e687227dac81b1cdb9eaba3379cf5dbb 
>   src/tests/gc_tests.cpp 619ed22edd9b3909ea24cdcbf62c354420a8d031 
>   src/tests/mesos.hpp 733344a2f07ebd9d841a55fb9bbfda2e3c1a1eb2 
>   src/tests/mesos.cpp d3c87c295429481c59d5a49398e289a4b84e4496 
>   src/tests/slave_tests.cpp 3d67511de5abd3466eeb5ad1daf318209bd69eed 
> 
> 
> Diff: https://reviews.apache.org/r/67264/diff/6/
> 
> 
> Testing
> -------
> 
> Added a unit test in following patch.
> 
> Tested with following procedures:
> 1. Start a test master and agent;
> 2. Created a persistent volume on agent through operator API;
> 3. Use `mesos-execute` to run a task;
> 4. Stop the agent;
> 5. Manually bind mount persistent volume path into a `volume` directory inside the executor sandbox (to simulate a dangling mount in MESOS-8830);
> 6. Restart agent with `--gc_disk_headroom=1.0 --gc_delay=1secs` to force it gc the path immediately.
> 
> With this fix, we observed that the dangling mount is automatically cleaned up, and agent produces log line:
> ```
> W0523 06:00:04.001075 82745 gc.cpp:229] Unmounting dangling mount point '/home/zhitao/mesos-workdir/slaves/b3eb3aff-d19d-45ff-8113-f0316462d3fa-S0/frameworks/b3eb3aff-d19d-45ff-8113-f0316462d3fa-0000/executors/test_id/runs/1cd3bd06-2632-4541-a708-80c7cd51c74b/volume' of persistent volume '/home/zhitao/mesos-workdir/volumes/roles/role/id1' inside garbage collected path '/home/zhitao/mesos-workdir/slaves/b3eb3aff-d19d-45ff-8113-f0316462d3fa-S0'
> ```
> 
> 
> Thanks,
> 
> Zhitao Li
> 
>


Re: Review Request 67264: Unmounted any mount points in gc paths.

Posted by Zhitao Li <zh...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/67264/
-----------------------------------------------------------

(Updated June 11, 2018, 1:58 p.m.)


Review request for mesos, Chun-Hung Hsiao, Jason Lai, and Jie Yu.


Changes
-------

Fail `rmdir` future if mount table cannot be loaded for agent on Linux.


Bugs: MESOS-8830
    https://issues.apache.org/jira/browse/MESOS-8830


Repository: mesos


Description
-------

In various corner cases, agent may not get chance to properly unmount
persistent volumes mounted inside an executor's sandbox. When GC later
gets to these sandbox directories, permanent data loss can happen (see
MESOS-8830).

Currently, the only mounts in the host mount namespace under the sandbox
directories are persistent volumes, so this diff added protection to
unmount any dangling mount points before calling `rmdir` on the
directory.

NOTE: this means agent will not garbage collect any path if it cannot
read its own `mountinfo` table.


Diffs (updated)
-----

  src/local/local.cpp afff54653e8e659d947ddbee6dc38ba2715f2a78 
  src/slave/gc.hpp df40165bb8a23f065156bf6c5f354b143d88c088 
  src/slave/gc.cpp 390b35e6d17d6614a73c9548decbf10739560106 
  src/slave/gc_process.hpp 20374ad91820341282fdf18ecade60a020e26cea 
  src/slave/main.cpp 646125344d590b28256d8ee684d7e51a90e82f23 
  src/slave/paths.hpp 015896453410a33923eed07b3e676be19af62a48 
  src/slave/paths.cpp ed0b1276908f4990ce7a24c96aea20e8c79d3126 
  src/tests/cluster.cpp 01eb0950e687227dac81b1cdb9eaba3379cf5dbb 
  src/tests/gc_tests.cpp 619ed22edd9b3909ea24cdcbf62c354420a8d031 
  src/tests/mesos.hpp 733344a2f07ebd9d841a55fb9bbfda2e3c1a1eb2 
  src/tests/mesos.cpp d3c87c295429481c59d5a49398e289a4b84e4496 
  src/tests/slave_tests.cpp 3d67511de5abd3466eeb5ad1daf318209bd69eed 


Diff: https://reviews.apache.org/r/67264/diff/6/

Changes: https://reviews.apache.org/r/67264/diff/5-6/


Testing
-------

Added a unit test in following patch.

Tested with following procedures:
1. Start a test master and agent;
2. Created a persistent volume on agent through operator API;
3. Use `mesos-execute` to run a task;
4. Stop the agent;
5. Manually bind mount persistent volume path into a `volume` directory inside the executor sandbox (to simulate a dangling mount in MESOS-8830);
6. Restart agent with `--gc_disk_headroom=1.0 --gc_delay=1secs` to force it gc the path immediately.

With this fix, we observed that the dangling mount is automatically cleaned up, and agent produces log line:
```
W0523 06:00:04.001075 82745 gc.cpp:229] Unmounting dangling mount point '/home/zhitao/mesos-workdir/slaves/b3eb3aff-d19d-45ff-8113-f0316462d3fa-S0/frameworks/b3eb3aff-d19d-45ff-8113-f0316462d3fa-0000/executors/test_id/runs/1cd3bd06-2632-4541-a708-80c7cd51c74b/volume' of persistent volume '/home/zhitao/mesos-workdir/volumes/roles/role/id1' inside garbage collected path '/home/zhitao/mesos-workdir/slaves/b3eb3aff-d19d-45ff-8113-f0316462d3fa-S0'
```


Thanks,

Zhitao Li