You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@mesos.apache.org by GitBox <gi...@apache.org> on 2020/04/01 12:50:51 UTC

[GitHub] [mesos] cf-natali opened a new pull request #355: Handle EBUSY when destroying a cgroup.

cf-natali opened a new pull request #355: Handle EBUSY when destroying a cgroup.
URL: https://github.com/apache/mesos/pull/355
 
 
   It's a workaround for kernel bugs
   https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/kernel/cgroup/cgroup.c?id=9c974c77246460fa6a92c18554c3311c8c83c160
   and
   https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/kernel/cgroup/cgroup.c?id=c03cd7738a83b13739f00546166969342c8ff014
   
   Fixes MESOS-10107.
   
   @abudnik 
   
   > Does the workaround work reliably after changing the initial delay and retry count to the values taken from libcontainerd (10ms and 5)?
   
   Yes, however I chose 1ms and 10 for two reasons:
   - this possibly yields lower latency
   - more importantly, while doing an strace I can see that it can take sometimes up to ver 100-200ms for rmdir to succeed:
   
   ```
   [pid  1965] 13:22:36.021260 rmdir("/sys/fs/cgroup/freezer/mesos/b99efad6-b9eb-43bd-8242-29a2b321dd07") = -1 EBUSY (Périphérique ou ressource occupé) <0.000017>
   [pid  1965] 13:22:36.022604 rmdir("/sys/fs/cgroup/freezer/mesos/b99efad6-b9eb-43bd-8242-29a2b321dd07") = -1 EBUSY (Périphérique ou ressource occupé) <0.000018>
   [pid  1965] 13:22:36.024807 rmdir("/sys/fs/cgroup/freezer/mesos/b99efad6-b9eb-43bd-8242-29a2b321dd07") = -1 EBUSY (Périphérique ou ressource occupé) <0.000080>
   [pid  1965] 13:22:36.029116 rmdir("/sys/fs/cgroup/freezer/mesos/b99efad6-b9eb-43bd-8242-29a2b321dd07") = -1 EBUSY (Périphérique ou ressource occupé) <0.000466>
   [pid  1965] 13:22:36.037990 rmdir("/sys/fs/cgroup/freezer/mesos/b99efad6-b9eb-43bd-8242-29a2b321dd07") = -1 EBUSY (Périphérique ou ressource occupé) <0.000190>
   [pid  1965] 13:22:36.054528 rmdir("/sys/fs/cgroup/freezer/mesos/b99efad6-b9eb-43bd-8242-29a2b321dd07") = -1 EBUSY (Périphérique ou ressource occupé) <0.000038>
   [pid  1965] 13:22:36.086874 rmdir("/sys/fs/cgroup/freezer/mesos/b99efad6-b9eb-43bd-8242-29a2b321dd07") = -1 EBUSY (Périphérique ou ressource occupé) <0.000029>
   [pid  3225] 13:22:36.127365 +++ killed by SIGKILL +++
   [pid  1965] 13:22:36.151151 rmdir("/sys/fs/cgroup/freezer/mesos/b99efad6-b9eb-43bd-8242-29a2b321dd07") = 0 <0.000114>
   ```
   
   And 10ms with 5 retries only 320ms (10 * 2**5), so I'd rather have a bit more margin.
   
   > Should we retry only if `::rmdir()` returns EBUSY errno error?
   
   Definitely - I wanted to do that but I'm not sure what's the best way to do it: is there a way to access `errno` from `Try<Nothing> rmdir` or can I just assume that the global `errno` is preserved and access it directly?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [mesos] cf-natali commented on issue #355: Handle EBUSY when destroying a cgroup.

Posted by GitBox <gi...@apache.org>.
cf-natali commented on issue #355: Handle EBUSY when destroying a cgroup.
URL: https://github.com/apache/mesos/pull/355#issuecomment-612129236
 
 
   > It seems the existing method `cgroups::remove` is not used anywhere except in the unit test codes, so do we still need it? Or we want to change it to internally call the `internal::remove` introduced in this PR?
   
   Whatever you prefer :).

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [mesos] cf-natali commented on issue #355: Handle EBUSY when destroying a cgroup.

Posted by GitBox <gi...@apache.org>.
cf-natali commented on issue #355: Handle EBUSY when destroying a cgroup.
URL: https://github.com/apache/mesos/pull/355#issuecomment-612580001
 
 
   > > > It seems the existing method `cgroups::remove` is not used anywhere except in the unit test codes, so do we still need it? Or we want to change it to internally call the `internal::remove` introduced in this PR?
   > > 
   > > 
   > > Whatever you prefer :).
   > 
   > Second thought, it seems not a good idea to make `cgroups::remove` internally call `internal::remove` since they have different return types, so let just leave it as it is.
   
   Ah, for some reason I didn't see this edit, so I removed `cgroups::remove` - seems it makes sense since it's not used apart from tests, and we probably don't want to use it since it doesn't have the workaround?
   
   It's easy to add back if you want, just let me know what you prefer.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [mesos] cf-natali commented on issue #355: Handle EBUSY when destroying a cgroup.

Posted by GitBox <gi...@apache.org>.
cf-natali commented on issue #355: Handle EBUSY when destroying a cgroup.
URL: https://github.com/apache/mesos/pull/355#issuecomment-612649988
 
 
   OK, hopefully good to merge now :).

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [mesos] cf-natali commented on a change in pull request #355: Handle EBUSY when destroying a cgroup.

Posted by GitBox <gi...@apache.org>.
cf-natali commented on a change in pull request #355: Handle EBUSY when destroying a cgroup.
URL: https://github.com/apache/mesos/pull/355#discussion_r405018860
 
 

 ##########
 File path: src/linux/cgroups.cpp
 ##########
 @@ -696,13 +696,24 @@ Try<Nothing> remove(const string& hierarchy, const string& cgroup)
   string path = path::join(hierarchy, cgroup);
 
   // Do NOT recursively remove cgroups.
-  Try<Nothing> rmdir = os::rmdir(path, false);
-  if (rmdir.isError()) {
-    return Error(
+  // TODO The retry was added as a fix for kernel bug
+  // https://lkml.org/lkml/2020/1/15/1349
+  // where calling rmdir on a seemingly empty cgroup can fail
+  // with EBUSY while some tasks are exiting.
+  auto delay = Milliseconds(1);
+  for (auto retry = 10; ;) {
+    Try<Nothing> rmdir = os::rmdir(path, false);
+    if (!rmdir.isError()) {
+      return rmdir;
+    } else if (retry > 0) {
+      os::sleep(delay);
 
 Review comment:
   Hm, I'm still confused: in your version, you capture `retry` and `delay` in the lambdas by value, so their modifications are lost:
   
   ```
     return loop(
         // NOTE: copy `delay` by value in the first lambda function.
         [=]() mutable {
           auto timeout = process::after(delay);
           delay *= 2;
           return timeout;
         },
         // NOTE: copy `retry` by value in the second lambda function.
         [=](const Nothing&) mutable -> Future<ControlFlow<Nothing>> {
   [...]
           } else if ((errno == EBUSY) && (retry > 0)) {
             --retry;
             return process::Continue();
   ```
   
   I think that with this, `delay` and `retry` will *not* be updated from one iteration to the next: so basically we will keep retrying the `rmdir` every ms forever.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [mesos] qianzhangxa commented on a change in pull request #355: Handle EBUSY when destroying a cgroup.

Posted by GitBox <gi...@apache.org>.
qianzhangxa commented on a change in pull request #355: Handle EBUSY when destroying a cgroup.
URL: https://github.com/apache/mesos/pull/355#discussion_r407208547
 
 

 ##########
 File path: src/linux/cgroups.cpp
 ##########
 @@ -263,6 +265,73 @@ static Try<Nothing> cloneCpusetCpusMems(
   return Nothing();
 }
 
+
+// Removes a cgroup from a given hierachy.
+// @param   hierarchy  Path to hierarchy root.
+// @param   cgroup     Path of the cgroup relative to the hierarchy root.
+// @return  Some if the operation succeeds.
+//          Error if the operation fails.
+Future<Nothing> remove(const string& hierarchy, const string& cgroup)
+{
+  const string path = path::join(hierarchy, cgroup);
+
+  // We retry on EBUSY as a workaround for kernel bug
+  // https://lkml.org/lkml/2020/1/15/1349 and others which cause rmdir to fail
+  // with EBUSY even though the cgroup appears empty.
+
+  Duration delay = Duration::zero();
+  int retry = 10;
+
+  return loop(
+      [=]() mutable {
+        auto timeout = process::after(delay);
+        delay = (delay == Duration::zero()) ? Milliseconds(1) : delay * 2;
+        return timeout;
 
 Review comment:
   Yeah, I agree.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [mesos] qianzhangxa commented on a change in pull request #355: Handle EBUSY when destroying a cgroup.

Posted by GitBox <gi...@apache.org>.
qianzhangxa commented on a change in pull request #355: Handle EBUSY when destroying a cgroup.
URL: https://github.com/apache/mesos/pull/355#discussion_r406727510
 
 

 ##########
 File path: src/linux/cgroups.cpp
 ##########
 @@ -263,6 +265,70 @@ static Try<Nothing> cloneCpusetCpusMems(
   return Nothing();
 }
 
 
 Review comment:
   One more newline here.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [mesos] abudnik commented on a change in pull request #355: Handle EBUSY when destroying a cgroup.

Posted by GitBox <gi...@apache.org>.
abudnik commented on a change in pull request #355: Handle EBUSY when destroying a cgroup.
URL: https://github.com/apache/mesos/pull/355#discussion_r406270107
 
 

 ##########
 File path: src/linux/cgroups.cpp
 ##########
 @@ -1535,22 +1629,20 @@ class Destroyer : public Process<Destroyer>
 
   void remove()
   {
-    foreach (const string& cgroup, cgroups) {
-      Try<Nothing> remove = cgroups::remove(hierarchy, cgroup);
-      if (remove.isError()) {
-        // If the `cgroup` still exists in the hierarchy, treat this as
-        // an error; otherwise, treat this as a success since the `cgroup`
-        // has actually been cleaned up.
-        if (os::exists(path::join(hierarchy, cgroup))) {
-          promise.fail(
-              "Failed to remove cgroup '" + cgroup + "': " + remove.error());
-          terminate(self());
-          return;
-        }
-      }
+    Future<Nothing> remove_all = internal::remove(hierarchy, cgroups);
+    remove_all.onAny(defer(self(), &Destroyer::removed, lambda::_1));
+  }
+
+  void removed(const Future<Nothing>& remove_all)
 
 Review comment:
   `snake_case` is allowed in libprocess only.
   Please `s/remove_all/removeAll`

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [mesos] abudnik commented on a change in pull request #355: Handle EBUSY when destroying a cgroup.

Posted by GitBox <gi...@apache.org>.
abudnik commented on a change in pull request #355: Handle EBUSY when destroying a cgroup.
URL: https://github.com/apache/mesos/pull/355#discussion_r408157119
 
 

 ##########
 File path: src/tests/containerizer/cgroups_isolator_tests.cpp
 ##########
 @@ -1489,7 +1489,7 @@ TEST_F(CgroupsIsolatorTest, ROOT_CGROUPS_CreateRecursively)
   // We should remove cgroups_root after the slave being started
   // because slave will create cgroups_root dir during startup
   // if it's not present.
-  ASSERT_SOME(cgroups::remove(hierarchy.get(), flags.cgroups_root));
+  ASSERT_SOME(os::rmdir(path::join(hierarchy.get(), flags.cgroups_root)));
 
 Review comment:
   Please, specify a second parameter `recursive = false` here and elsewhere:
   ```c++
   os::rmdir(path::join(hierarchy.get(), flags.cgroups_root), false)
   ```
   
   I'd recommend running tests under `root` before sending a patch:
   ```shell
   cd build
   sudo make check
   # or
   GLOG_v=2 sudo src/mesos-tests --verbose --gtest_repeat=1 --gtest_filter=*
   ```

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [mesos] abudnik commented on a change in pull request #355: Handle EBUSY when destroying a cgroup.

Posted by GitBox <gi...@apache.org>.
abudnik commented on a change in pull request #355: Handle EBUSY when destroying a cgroup.
URL: https://github.com/apache/mesos/pull/355#discussion_r406246095
 
 

 ##########
 File path: src/linux/cgroups.cpp
 ##########
 @@ -263,6 +265,98 @@ static Try<Nothing> cloneCpusetCpusMems(
   return Nothing();
 }
 
+
+// Removes a cgroup from a given hierachy.
+// @param   hierarchy  Path to hierarchy root.
+// @param   cgroup     Path of the cgroup relative to the hierarchy root.
+// @return  Some if the operation succeeds.
+//          Error if the operation fails.
+Future<Nothing> remove(const string& hierarchy, const string& cgroup)
+{
+  const string path = path::join(hierarchy, cgroup);
+
+  // We retry on EBUSY as a workaround for kernel bug
+  // https://lkml.org/lkml/2020/1/15/1349 and others which cause rmdir to fail
+  // with EBUSY even though the cgroup appears empty.
+
+  Duration delay;
+  int retry = 10;
+
+  return loop(
+      [=]() mutable {
+        auto timeout = process::after(delay);
+        delay = (delay == Duration()) ? Milliseconds(1) : delay * 2;
+        return timeout;
+      },
+      [=](const Nothing&) mutable -> Future<ControlFlow<Nothing>> {
+        if (::rmdir(path.c_str()) == 0) {
+          return process::Break();
+        } else if ((errno == EBUSY) && (retry > 0)) {
+          --retry;
+          return process::Continue();
+        } else {
+          // If the `cgroup` still exists in the hierarchy, treat this as
+          // an error; otherwise, treat this as a success since the `cgroup`
+          // has actually been cleaned up.
+          if (os::exists(path::join(hierarchy, cgroup))) {
+            return Failure(
+              "Failed to remove directory '" + path + "': " +
+              os::strerror(errno));
+          }
+          return process::Break();
+        }
+      }
+    );
+}
+
+// Removes a list of cgroups from a given hierachy.
+// The cgroups are removed in order, so this function can be used to remove a
+// cgroup hierarchy in a bottom-up fashion.
+// @param   hierarchy  Path to hierarchy root.
+// @param   cgroups    Path of the cgroups relative to the hierarchy root.
+// @return  Some if the operation succeeds.
+//          Error if the operation fails.
+Future<Nothing> remove(const string& hierarchy, const vector<string>& cgroups)
+{
+  size_t cgroup_index = 0;
+
+  return loop(
+      [=]() mutable -> Future<bool> {
+        if (cgroup_index == cgroups.size()) {
+          // No more cgroup to remove.
+          return false;
+        }
+        Future<Nothing> removed = internal::remove(
+                                        hierarchy, cgroups[cgroup_index]);
+        ++cgroup_index;
+        return removed.then(
+          [=](const Future<Nothing>& removed) -> Future<bool> {
+            if (removed.isReady()) {
+              return true;
+            }
+            if (removed.isFailed()) {
+              return Failure(removed.failure());
+            }
+            return false;
+          }
+        );
+      },
+      [](const Future<bool>& removed) mutable -> Future<ControlFlow<Nothing>> {
+        if (removed.isReady()) {
+          if (!removed.get()) {
+            // No more cgroup to remove.
+            return process::Break();
+          }
+          return process::Continue();
+        }
+        if (removed.isFailed()) {
+          return Failure(removed.failure());
+        }
+        return process::Break();
+      }
+    );
 
 Review comment:
   This can be simpified by using [future chaining](https://github.com/apache/mesos/blob/e6b15b1e968649c794d67bb97961e8f0216e2ed8/src/slave/containerizer/mesos/containerizer.cpp#L1579-L1605):
   ```c++
   Future<Nothing> remove(const string& hierarchy, const vector<string>& cgroups)
   {
     Future<Nothing> f = Nothing();
   
     foreach (const string& cgroup, cgroups) {
       f = f.then([=] {
         return internal::remove(hierarchy, cgroup);
       });
     }
   
     return f;
   }
   ```

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [mesos] abudnik commented on a change in pull request #355: Handle EBUSY when destroying a cgroup.

Posted by GitBox <gi...@apache.org>.
abudnik commented on a change in pull request #355: Handle EBUSY when destroying a cgroup.
URL: https://github.com/apache/mesos/pull/355#discussion_r406268132
 
 

 ##########
 File path: src/linux/cgroups.cpp
 ##########
 @@ -1535,22 +1629,20 @@ class Destroyer : public Process<Destroyer>
 
   void remove()
   {
-    foreach (const string& cgroup, cgroups) {
-      Try<Nothing> remove = cgroups::remove(hierarchy, cgroup);
-      if (remove.isError()) {
-        // If the `cgroup` still exists in the hierarchy, treat this as
-        // an error; otherwise, treat this as a success since the `cgroup`
-        // has actually been cleaned up.
-        if (os::exists(path::join(hierarchy, cgroup))) {
-          promise.fail(
-              "Failed to remove cgroup '" + cgroup + "': " + remove.error());
-          terminate(self());
-          return;
-        }
-      }
+    Future<Nothing> remove_all = internal::remove(hierarchy, cgroups);
 
 Review comment:
   I think we don't need to use `remove_all` variable here. Also, we can get rid of the `void remove()` method. So the only call-site of `remove()` will look like this:
   ```c++
   // instead of calling `remove()` method
   internal::remove(hierarchy, cgroups)
     .onAny(defer(self(), &Destroyer::removed, lambda::_1));
   ``` 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [mesos] qianzhangxa commented on a change in pull request #355: Handle EBUSY when destroying a cgroup.

Posted by GitBox <gi...@apache.org>.
qianzhangxa commented on a change in pull request #355: Handle EBUSY when destroying a cgroup.
URL: https://github.com/apache/mesos/pull/355#discussion_r406744824
 
 

 ##########
 File path: src/linux/cgroups.cpp
 ##########
 @@ -263,6 +265,70 @@ static Try<Nothing> cloneCpusetCpusMems(
   return Nothing();
 }
 
+
+// Removes a cgroup from a given hierachy.
+// @param   hierarchy  Path to hierarchy root.
+// @param   cgroup     Path of the cgroup relative to the hierarchy root.
+// @return  Some if the operation succeeds.
+//          Error if the operation fails.
+Future<Nothing> remove(const string& hierarchy, const string& cgroup)
+{
+  const string path = path::join(hierarchy, cgroup);
+
+  // We retry on EBUSY as a workaround for kernel bug
+  // https://lkml.org/lkml/2020/1/15/1349 and others which cause rmdir to fail
+  // with EBUSY even though the cgroup appears empty.
+
+  Duration delay;
+  int retry = 10;
+
+  return loop(
+      [=]() mutable {
+        auto timeout = process::after(delay);
+        delay = (delay == Duration()) ? Milliseconds(1) : delay * 2;
 
 Review comment:
   Can we use `Duration::zero()` instead of `Duration()`?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [mesos] cf-natali commented on a change in pull request #355: Handle EBUSY when destroying a cgroup.

Posted by GitBox <gi...@apache.org>.
cf-natali commented on a change in pull request #355: Handle EBUSY when destroying a cgroup.
URL: https://github.com/apache/mesos/pull/355#discussion_r406378922
 
 

 ##########
 File path: src/linux/cgroups.cpp
 ##########
 @@ -263,6 +265,98 @@ static Try<Nothing> cloneCpusetCpusMems(
   return Nothing();
 }
 
+
+// Removes a cgroup from a given hierachy.
+// @param   hierarchy  Path to hierarchy root.
+// @param   cgroup     Path of the cgroup relative to the hierarchy root.
+// @return  Some if the operation succeeds.
+//          Error if the operation fails.
+Future<Nothing> remove(const string& hierarchy, const string& cgroup)
+{
+  const string path = path::join(hierarchy, cgroup);
+
+  // We retry on EBUSY as a workaround for kernel bug
+  // https://lkml.org/lkml/2020/1/15/1349 and others which cause rmdir to fail
+  // with EBUSY even though the cgroup appears empty.
+
+  Duration delay;
+  int retry = 10;
+
+  return loop(
+      [=]() mutable {
+        auto timeout = process::after(delay);
+        delay = (delay == Duration()) ? Milliseconds(1) : delay * 2;
+        return timeout;
+      },
+      [=](const Nothing&) mutable -> Future<ControlFlow<Nothing>> {
+        if (::rmdir(path.c_str()) == 0) {
+          return process::Break();
+        } else if ((errno == EBUSY) && (retry > 0)) {
+          --retry;
+          return process::Continue();
+        } else {
+          // If the `cgroup` still exists in the hierarchy, treat this as
+          // an error; otherwise, treat this as a success since the `cgroup`
+          // has actually been cleaned up.
+          if (os::exists(path::join(hierarchy, cgroup))) {
+            return Failure(
+              "Failed to remove directory '" + path + "': " +
+              os::strerror(errno));
+          }
+          return process::Break();
+        }
+      }
+    );
+}
+
+// Removes a list of cgroups from a given hierachy.
+// The cgroups are removed in order, so this function can be used to remove a
+// cgroup hierarchy in a bottom-up fashion.
+// @param   hierarchy  Path to hierarchy root.
+// @param   cgroups    Path of the cgroups relative to the hierarchy root.
+// @return  Some if the operation succeeds.
+//          Error if the operation fails.
+Future<Nothing> remove(const string& hierarchy, const vector<string>& cgroups)
+{
+  size_t cgroup_index = 0;
+
+  return loop(
+      [=]() mutable -> Future<bool> {
+        if (cgroup_index == cgroups.size()) {
+          // No more cgroup to remove.
+          return false;
+        }
+        Future<Nothing> removed = internal::remove(
+                                        hierarchy, cgroups[cgroup_index]);
+        ++cgroup_index;
+        return removed.then(
+          [=](const Future<Nothing>& removed) -> Future<bool> {
+            if (removed.isReady()) {
+              return true;
+            }
+            if (removed.isFailed()) {
+              return Failure(removed.failure());
+            }
+            return false;
+          }
+        );
+      },
+      [](const Future<bool>& removed) mutable -> Future<ControlFlow<Nothing>> {
+        if (removed.isReady()) {
+          if (!removed.get()) {
+            // No more cgroup to remove.
+            return process::Break();
+          }
+          return process::Continue();
+        }
+        if (removed.isFailed()) {
+          return Failure(removed.failure());
+        }
+        return process::Break();
+      }
+    );
 
 Review comment:
   Indeed, that's much simpler, wasn't aware of that.
   Thanks!

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [mesos] abudnik commented on a change in pull request #355: Handle EBUSY when destroying a cgroup.

Posted by GitBox <gi...@apache.org>.
abudnik commented on a change in pull request #355: Handle EBUSY when destroying a cgroup.
URL: https://github.com/apache/mesos/pull/355#discussion_r404838566
 
 

 ##########
 File path: docs/contributors.yaml
 ##########
 @@ -231,6 +231,12 @@
   jira_user: drcrallen
   reviewboard_user: drcrallen
 
+- name: Charles Natali
+  emails:
+    - cf.natali@gmail.com
+  jira_user: charle
+  reviewboard_user: cf.natali
+
 
 Review comment:
   Could you please move this into a separate commit (within this PR)?
   Example: https://github.com/apache/mesos/commit/beaf0cd844f3658bfccb86049f7181036b0e6ae4

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [mesos] qianzhangxa commented on a change in pull request #355: Handle EBUSY when destroying a cgroup.

Posted by GitBox <gi...@apache.org>.
qianzhangxa commented on a change in pull request #355: Handle EBUSY when destroying a cgroup.
URL: https://github.com/apache/mesos/pull/355#discussion_r407207279
 
 

 ##########
 File path: src/linux/cgroups.cpp
 ##########
 @@ -263,6 +265,73 @@ static Try<Nothing> cloneCpusetCpusMems(
   return Nothing();
 }
 
+
+// Removes a cgroup from a given hierachy.
+// @param   hierarchy  Path to hierarchy root.
+// @param   cgroup     Path of the cgroup relative to the hierarchy root.
+// @return  Some if the operation succeeds.
+//          Error if the operation fails.
+Future<Nothing> remove(const string& hierarchy, const string& cgroup)
+{
+  const string path = path::join(hierarchy, cgroup);
+
+  // We retry on EBUSY as a workaround for kernel bug
+  // https://lkml.org/lkml/2020/1/15/1349 and others which cause rmdir to fail
+  // with EBUSY even though the cgroup appears empty.
+
+  Duration delay = Duration::zero();
+  int retry = 10;
+
+  return loop(
+      [=]() mutable {
+        auto timeout = process::after(delay);
+        delay = (delay == Duration::zero()) ? Milliseconds(1) : delay * 2;
+        return timeout;
+      },
+      [=](const Nothing&) mutable -> Future<ControlFlow<Nothing>> {
+        if (::rmdir(path.c_str()) == 0) {
+          return process::Break();
+        } else if ((errno == EBUSY) && (retry > 0)) {
+          --retry;
+          return process::Continue();
+        } else {
+          // If the `cgroup` still exists in the hierarchy, treat this as
+          // an error; otherwise, treat this as a success since the `cgroup`
+          // has actually been cleaned up.
+          // Save the error string as os::exists may clobber errno.
+          const string error = os::strerror(errno);
+          if (os::exists(path::join(hierarchy, cgroup))) {
+            return Failure(
+              "Failed to remove directory '" + path + "': " + error);
 
 Review comment:
   Need 2 more spaces for the indent.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [mesos] cf-natali commented on a change in pull request #355: Handle EBUSY when destroying a cgroup.

Posted by GitBox <gi...@apache.org>.
cf-natali commented on a change in pull request #355: Handle EBUSY when destroying a cgroup.
URL: https://github.com/apache/mesos/pull/355#discussion_r406378209
 
 

 ##########
 File path: src/linux/cgroups.cpp
 ##########
 @@ -1535,22 +1629,20 @@ class Destroyer : public Process<Destroyer>
 
   void remove()
   {
-    foreach (const string& cgroup, cgroups) {
-      Try<Nothing> remove = cgroups::remove(hierarchy, cgroup);
-      if (remove.isError()) {
-        // If the `cgroup` still exists in the hierarchy, treat this as
-        // an error; otherwise, treat this as a success since the `cgroup`
-        // has actually been cleaned up.
-        if (os::exists(path::join(hierarchy, cgroup))) {
-          promise.fail(
-              "Failed to remove cgroup '" + cgroup + "': " + remove.error());
-          terminate(self());
-          return;
-        }
-      }
+    Future<Nothing> remove_all = internal::remove(hierarchy, cgroups);
+    remove_all.onAny(defer(self(), &Destroyer::removed, lambda::_1));
+  }
+
+  void removed(const Future<Nothing>& remove_all)
 
 Review comment:
   Done.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [mesos] qianzhangxa commented on issue #355: Handle EBUSY when destroying a cgroup.

Posted by GitBox <gi...@apache.org>.
qianzhangxa commented on issue #355: Handle EBUSY when destroying a cgroup.
URL: https://github.com/apache/mesos/pull/355#issuecomment-612398466
 
 
   > > It seems the existing method `cgroups::remove` is not used anywhere except in the unit test codes, so do we still need it? Or we want to change it to internally call the `internal::remove` introduced in this PR?
   > 
   > Whatever you prefer :).
   
   I think we could just remove it and call `os::rmdir()` in the affected unit tests.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [mesos] qianzhangxa commented on issue #355: Handle EBUSY when destroying a cgroup.

Posted by GitBox <gi...@apache.org>.
qianzhangxa commented on issue #355: Handle EBUSY when destroying a cgroup.
URL: https://github.com/apache/mesos/pull/355#issuecomment-612629131
 
 
   > > > > It seems the existing method `cgroups::remove` is not used anywhere except in the unit test codes, so do we still need it? Or we want to change it to internally call the `internal::remove` introduced in this PR?
   > > > 
   > > > 
   > > > Whatever you prefer :).
   > > 
   > > 
   > > Second thought, it seems not a good idea to make `cgroups::remove` internally call `internal::remove` since they have different return types, so let just leave it as it is.
   > 
   > Ah, for some reason I didn't see this edit, so I removed `cgroups::remove` - seems it makes sense since it's not used apart from tests, and we probably don't want to use it since it doesn't have the workaround?
   > 
   > It's easy to add back if you want, just let me know what you prefer.
   
   Yeah, I am OK to remove it.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [mesos] qianzhangxa commented on a change in pull request #355: Handle EBUSY when destroying a cgroup.

Posted by GitBox <gi...@apache.org>.
qianzhangxa commented on a change in pull request #355: Handle EBUSY when destroying a cgroup.
URL: https://github.com/apache/mesos/pull/355#discussion_r406738281
 
 

 ##########
 File path: src/linux/cgroups.cpp
 ##########
 @@ -263,6 +265,70 @@ static Try<Nothing> cloneCpusetCpusMems(
   return Nothing();
 }
 
+
+// Removes a cgroup from a given hierachy.
+// @param   hierarchy  Path to hierarchy root.
+// @param   cgroup     Path of the cgroup relative to the hierarchy root.
+// @return  Some if the operation succeeds.
+//          Error if the operation fails.
+Future<Nothing> remove(const string& hierarchy, const string& cgroup)
+{
+  const string path = path::join(hierarchy, cgroup);
+
+  // We retry on EBUSY as a workaround for kernel bug
+  // https://lkml.org/lkml/2020/1/15/1349 and others which cause rmdir to fail
+  // with EBUSY even though the cgroup appears empty.
+
+  Duration delay;
+  int retry = 10;
+
+  return loop(
+      [=]() mutable {
+        auto timeout = process::after(delay);
+        delay = (delay == Duration()) ? Milliseconds(1) : delay * 2;
+        return timeout;
+      },
+      [=](const Nothing&) mutable -> Future<ControlFlow<Nothing>> {
+        if (::rmdir(path.c_str()) == 0) {
+          return process::Break();
+        } else if ((errno == EBUSY) && (retry > 0)) {
+          --retry;
+          return process::Continue();
+        } else {
+          // If the `cgroup` still exists in the hierarchy, treat this as
+          // an error; otherwise, treat this as a success since the `cgroup`
+          // has actually been cleaned up.
+          if (os::exists(path::join(hierarchy, cgroup))) {
+            return Failure(
+              "Failed to remove directory '" + path + "': " +
+              os::strerror(errno));
+          }
+          return process::Break();
 
 Review comment:
   A newline between.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [mesos] asfgit closed pull request #355: Handle EBUSY when destroying a cgroup.

Posted by GitBox <gi...@apache.org>.
asfgit closed pull request #355: Handle EBUSY when destroying a cgroup.
URL: https://github.com/apache/mesos/pull/355
 
 
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [mesos] qianzhangxa commented on a change in pull request #355: Handle EBUSY when destroying a cgroup.

Posted by GitBox <gi...@apache.org>.
qianzhangxa commented on a change in pull request #355: Handle EBUSY when destroying a cgroup.
URL: https://github.com/apache/mesos/pull/355#discussion_r407035630
 
 

 ##########
 File path: src/linux/cgroups.cpp
 ##########
 @@ -263,6 +265,70 @@ static Try<Nothing> cloneCpusetCpusMems(
   return Nothing();
 }
 
 
 Review comment:
   Sorry my bad, two newlines are enough.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [mesos] qianzhangxa commented on a change in pull request #355: Handle EBUSY when destroying a cgroup.

Posted by GitBox <gi...@apache.org>.
qianzhangxa commented on a change in pull request #355: Handle EBUSY when destroying a cgroup.
URL: https://github.com/apache/mesos/pull/355#discussion_r406760777
 
 

 ##########
 File path: src/linux/cgroups.cpp
 ##########
 @@ -263,6 +265,70 @@ static Try<Nothing> cloneCpusetCpusMems(
   return Nothing();
 }
 
+
+// Removes a cgroup from a given hierachy.
+// @param   hierarchy  Path to hierarchy root.
+// @param   cgroup     Path of the cgroup relative to the hierarchy root.
+// @return  Some if the operation succeeds.
+//          Error if the operation fails.
+Future<Nothing> remove(const string& hierarchy, const string& cgroup)
+{
+  const string path = path::join(hierarchy, cgroup);
+
+  // We retry on EBUSY as a workaround for kernel bug
+  // https://lkml.org/lkml/2020/1/15/1349 and others which cause rmdir to fail
+  // with EBUSY even though the cgroup appears empty.
+
+  Duration delay;
+  int retry = 10;
+
+  return loop(
+      [=]() mutable {
+        auto timeout = process::after(delay);
+        delay = (delay == Duration()) ? Milliseconds(1) : delay * 2;
+        return timeout;
+      },
+      [=](const Nothing&) mutable -> Future<ControlFlow<Nothing>> {
+        if (::rmdir(path.c_str()) == 0) {
+          return process::Break();
+        } else if ((errno == EBUSY) && (retry > 0)) {
+          --retry;
+          return process::Continue();
+        } else {
+          // If the `cgroup` still exists in the hierarchy, treat this as
+          // an error; otherwise, treat this as a success since the `cgroup`
+          // has actually been cleaned up.
+          if (os::exists(path::join(hierarchy, cgroup))) {
+            return Failure(
+              "Failed to remove directory '" + path + "': " +
+              os::strerror(errno));
 
 Review comment:
   I think what we want to return is the `errno` of `::rmdir`, but after the calls to `path::join` and `os::exists`, is it still the original `errno` of `::rmdir`?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [mesos] cf-natali commented on a change in pull request #355: Handle EBUSY when destroying a cgroup.

Posted by GitBox <gi...@apache.org>.
cf-natali commented on a change in pull request #355: Handle EBUSY when destroying a cgroup.
URL: https://github.com/apache/mesos/pull/355#discussion_r408645634
 
 

 ##########
 File path: src/tests/containerizer/cgroups_isolator_tests.cpp
 ##########
 @@ -1489,7 +1489,7 @@ TEST_F(CgroupsIsolatorTest, ROOT_CGROUPS_CreateRecursively)
   // We should remove cgroups_root after the slave being started
   // because slave will create cgroups_root dir during startup
   // if it's not present.
-  ASSERT_SOME(cgroups::remove(hierarchy.get(), flags.cgroups_root));
+  ASSERT_SOME(os::rmdir(path::join(hierarchy.get(), flags.cgroups_root)));
 
 Review comment:
   > Please, specify a second parameter recursive = false here and elsewhere:
   
   Done.
   
   > I'd recommend running tests under root before sending a patch:
   
   Thanks - it's not mentioned in the contribution guide, I'll submit a PR to update it.
   
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [mesos] cf-natali commented on a change in pull request #355: Handle EBUSY when destroying a cgroup.

Posted by GitBox <gi...@apache.org>.
cf-natali commented on a change in pull request #355: Handle EBUSY when destroying a cgroup.
URL: https://github.com/apache/mesos/pull/355#discussion_r405615701
 
 

 ##########
 File path: src/linux/cgroups.cpp
 ##########
 @@ -696,13 +696,24 @@ Try<Nothing> remove(const string& hierarchy, const string& cgroup)
   string path = path::join(hierarchy, cgroup);
 
   // Do NOT recursively remove cgroups.
-  Try<Nothing> rmdir = os::rmdir(path, false);
-  if (rmdir.isError()) {
-    return Error(
+  // TODO The retry was added as a fix for kernel bug
+  // https://lkml.org/lkml/2020/1/15/1349
+  // where calling rmdir on a seemingly empty cgroup can fail
+  // with EBUSY while some tasks are exiting.
+  auto delay = Milliseconds(1);
+  for (auto retry = 10; ;) {
+    Try<Nothing> rmdir = os::rmdir(path, false);
+    if (!rmdir.isError()) {
+      return rmdir;
+    } else if (retry > 0) {
+      os::sleep(delay);
 
 Review comment:
   Done.
   I had to use another `loop` to chain the calls to `remove` - since we want to delete directories in order, because we proceed in bottom-up fashion.
   
   I tested in the following way:
   - make check
   - wrote a reproducer which repeatedly starts tasks with a large memory limit and exceed it, so that they are called by the OOM killer
   - check that it was failing without the fix - I've been running the same test for several hours with the fix, no failure so far
   - I also checked the rmdir handling of EBUSY with strace, checking the exponential backoff
   - I also tried to reduce the maximum retries to 2 to force it to fail, and checked that failures are correctly reported

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [mesos] abudnik commented on a change in pull request #355: Handle EBUSY when destroying a cgroup.

Posted by GitBox <gi...@apache.org>.
abudnik commented on a change in pull request #355: Handle EBUSY when destroying a cgroup.
URL: https://github.com/apache/mesos/pull/355#discussion_r402256087
 
 

 ##########
 File path: src/linux/cgroups.cpp
 ##########
 @@ -696,13 +696,24 @@ Try<Nothing> remove(const string& hierarchy, const string& cgroup)
   string path = path::join(hierarchy, cgroup);
 
   // Do NOT recursively remove cgroups.
-  Try<Nothing> rmdir = os::rmdir(path, false);
-  if (rmdir.isError()) {
-    return Error(
+  // TODO The retry was added as a fix for kernel bug
+  // https://lkml.org/lkml/2020/1/15/1349
+  // where calling rmdir on a seemingly empty cgroup can fail
+  // with EBUSY while some tasks are exiting.
+  auto delay = Milliseconds(1);
+  for (auto retry = 10; ;) {
+    Try<Nothing> rmdir = os::rmdir(path, false);
+    if (!rmdir.isError()) {
+      return rmdir;
+    } else if (retry > 0) {
+      os::sleep(delay);
 
 Review comment:
   An internal helper function can be introduced to implement non-blocking retry logic:
   ```
   Future<Nothing> remove(const std::string& hierarchy, const std::vector<std::string>& groups)
   ````
   [XfsDiskIsolatorProcess::initialize](https://github.com/apache/mesos/blob/f8a3dd334934094ec44e07fa350f958d218bc78f/src/slave/containerizer/mesos/isolators/xfs/disk.cpp#L973-L982) and [IOSwitchboard::_connect](https://github.com/apache/mesos/blob/f8a3dd334934094ec44e07fa350f958d218bc78f/src/slave/containerizer/mesos/io/switchboard.cpp#L695-L713) are examples of how `process::loop()` can be used to implement non-blocking retries.
   
   The new `remove` function can be used in both places you have mentioned above where the blocking `cgroups::remove` function is called.
   
   > we should check for EBUSY
   
   Just call `::rmdir()` (no `os::` namespace)

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [mesos] cf-natali commented on a change in pull request #355: Handle EBUSY when destroying a cgroup.

Posted by GitBox <gi...@apache.org>.
cf-natali commented on a change in pull request #355: Handle EBUSY when destroying a cgroup.
URL: https://github.com/apache/mesos/pull/355#discussion_r406854135
 
 

 ##########
 File path: src/linux/cgroups.cpp
 ##########
 @@ -263,6 +265,70 @@ static Try<Nothing> cloneCpusetCpusMems(
   return Nothing();
 }
 
 
 Review comment:
   The are already 2 newlines, not enough?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [mesos] qianzhangxa edited a comment on issue #355: Handle EBUSY when destroying a cgroup.

Posted by GitBox <gi...@apache.org>.
qianzhangxa edited a comment on issue #355: Handle EBUSY when destroying a cgroup.
URL: https://github.com/apache/mesos/pull/355#issuecomment-612398466
 
 
   > > It seems the existing method `cgroups::remove` is not used anywhere except in the unit test codes, so do we still need it? Or we want to change it to internally call the `internal::remove` introduced in this PR?
   > 
   > Whatever you prefer :).
   
   Second thought, it seems not a good idea to make `cgroups::remove` internally call `internal::remove` since they have different return types, so let just leave it as it is.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [mesos] cf-natali commented on a change in pull request #355: Handle EBUSY when destroying a cgroup.

Posted by GitBox <gi...@apache.org>.
cf-natali commented on a change in pull request #355: Handle EBUSY when destroying a cgroup.
URL: https://github.com/apache/mesos/pull/355#discussion_r405612984
 
 

 ##########
 File path: docs/contributors.yaml
 ##########
 @@ -231,6 +231,12 @@
   jira_user: drcrallen
   reviewboard_user: drcrallen
 
+- name: Charles Natali
+  emails:
+    - cf.natali@gmail.com
+  jira_user: charle
+  reviewboard_user: cf.natali
+
 
 Review comment:
   Done!

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [mesos] cf-natali commented on a change in pull request #355: Handle EBUSY when destroying a cgroup.

Posted by GitBox <gi...@apache.org>.
cf-natali commented on a change in pull request #355: Handle EBUSY when destroying a cgroup.
URL: https://github.com/apache/mesos/pull/355#discussion_r405042359
 
 

 ##########
 File path: src/linux/cgroups.cpp
 ##########
 @@ -696,13 +696,24 @@ Try<Nothing> remove(const string& hierarchy, const string& cgroup)
   string path = path::join(hierarchy, cgroup);
 
   // Do NOT recursively remove cgroups.
-  Try<Nothing> rmdir = os::rmdir(path, false);
-  if (rmdir.isError()) {
-    return Error(
+  // TODO The retry was added as a fix for kernel bug
+  // https://lkml.org/lkml/2020/1/15/1349
+  // where calling rmdir on a seemingly empty cgroup can fail
+  // with EBUSY while some tasks are exiting.
+  auto delay = Milliseconds(1);
+  for (auto retry = 10; ;) {
+    Try<Nothing> rmdir = os::rmdir(path, false);
+    if (!rmdir.isError()) {
+      return rmdir;
+    } else if (retry > 0) {
+      os::sleep(delay);
 
 Review comment:
   Ah, I hadn't realised that `loop` actually called the lambda multiple times, I thought it was copying it every time - makes sense, and pretty neat.
   
   I'll finish, thanks!

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [mesos] abudnik commented on a change in pull request #355: Handle EBUSY when destroying a cgroup.

Posted by GitBox <gi...@apache.org>.
abudnik commented on a change in pull request #355: Handle EBUSY when destroying a cgroup.
URL: https://github.com/apache/mesos/pull/355#discussion_r404835680
 
 

 ##########
 File path: src/linux/cgroups.cpp
 ##########
 @@ -696,13 +696,24 @@ Try<Nothing> remove(const string& hierarchy, const string& cgroup)
   string path = path::join(hierarchy, cgroup);
 
   // Do NOT recursively remove cgroups.
-  Try<Nothing> rmdir = os::rmdir(path, false);
-  if (rmdir.isError()) {
-    return Error(
+  // TODO The retry was added as a fix for kernel bug
+  // https://lkml.org/lkml/2020/1/15/1349
+  // where calling rmdir on a seemingly empty cgroup can fail
+  // with EBUSY while some tasks are exiting.
+  auto delay = Milliseconds(1);
+  for (auto retry = 10; ;) {
+    Try<Nothing> rmdir = os::rmdir(path, false);
+    if (!rmdir.isError()) {
+      return rmdir;
+    } else if (retry > 0) {
+      os::sleep(delay);
 
 Review comment:
   Added a few minor changes to your code:
   ```
   Future<Nothing> remove(const string& hierarchy, const string& cgroup)
   {
     // NOTE: please minimize using of `auto` keyword to keep the code consistent.
     const string path = path::join(hierarchy, cgroup);
   
     Duration delay = Milliseconds(1);
     int retry = 10;
   
     // TODO: add a comment describing the problem with the kernel bug and the way
     // we retry `rmdir` operation.
     return loop(
         // NOTE: copy `delay` by value in the first lambda function.
         [=]() mutable {
           auto timeout = process::after(delay);
           delay *= 2;
           return timeout;
         },
         // NOTE: copy `retry` by value in the second lambda function.
         [=](const Nothing&) mutable -> Future<ControlFlow<Nothing>> {
           if (::rmdir(path.c_str()) == 0) {
             return process::Break();
           } else if ((errno == EBUSY) && (retry > 0)) {
             --retry;
             return process::Continue();
           } else {
             // If the `cgroup` still exists in the hierarchy, treat this as
             // an error; otherwise, treat this as a success since the `cgroup`
             // has actually been cleaned up.
             if (os::exists(path::join(hierarchy, cgroup))) {
               return Failure(
                   "Failed to remove directory '" + path +
                   "': " + os::strerror(errno));
             }
             return process::Break();
           }
         }
       );
   }
   ```

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [mesos] abudnik commented on a change in pull request #355: Handle EBUSY when destroying a cgroup.

Posted by GitBox <gi...@apache.org>.
abudnik commented on a change in pull request #355: Handle EBUSY when destroying a cgroup.
URL: https://github.com/apache/mesos/pull/355#discussion_r405029624
 
 

 ##########
 File path: src/linux/cgroups.cpp
 ##########
 @@ -696,13 +696,24 @@ Try<Nothing> remove(const string& hierarchy, const string& cgroup)
   string path = path::join(hierarchy, cgroup);
 
   // Do NOT recursively remove cgroups.
-  Try<Nothing> rmdir = os::rmdir(path, false);
-  if (rmdir.isError()) {
-    return Error(
+  // TODO The retry was added as a fix for kernel bug
+  // https://lkml.org/lkml/2020/1/15/1349
+  // where calling rmdir on a seemingly empty cgroup can fail
+  // with EBUSY while some tasks are exiting.
+  auto delay = Milliseconds(1);
+  for (auto retry = 10; ;) {
+    Try<Nothing> rmdir = os::rmdir(path, false);
+    if (!rmdir.isError()) {
+      return rmdir;
+    } else if (retry > 0) {
+      os::sleep(delay);
 
 Review comment:
   `loop()` calls the same lambda object, so the state of this lambda is preserved between iterations. Here is an example of this approach (`maxBackoff` is captured by lambda): https://github.com/apache/mesos/blob/master/src/csi/v1_volume_manager.cpp#L512-L545

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [mesos] cf-natali commented on a change in pull request #355: Handle EBUSY when destroying a cgroup.

Posted by GitBox <gi...@apache.org>.
cf-natali commented on a change in pull request #355: Handle EBUSY when destroying a cgroup.
URL: https://github.com/apache/mesos/pull/355#discussion_r406378603
 
 

 ##########
 File path: src/linux/cgroups.cpp
 ##########
 @@ -1535,22 +1629,20 @@ class Destroyer : public Process<Destroyer>
 
   void remove()
   {
-    foreach (const string& cgroup, cgroups) {
-      Try<Nothing> remove = cgroups::remove(hierarchy, cgroup);
-      if (remove.isError()) {
-        // If the `cgroup` still exists in the hierarchy, treat this as
-        // an error; otherwise, treat this as a success since the `cgroup`
-        // has actually been cleaned up.
-        if (os::exists(path::join(hierarchy, cgroup))) {
-          promise.fail(
-              "Failed to remove cgroup '" + cgroup + "': " + remove.error());
-          terminate(self());
-          return;
-        }
-      }
+    Future<Nothing> remove_all = internal::remove(hierarchy, cgroups);
 
 Review comment:
   Done.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [mesos] cf-natali commented on a change in pull request #355: Handle EBUSY when destroying a cgroup.

Posted by GitBox <gi...@apache.org>.
cf-natali commented on a change in pull request #355: Handle EBUSY when destroying a cgroup.
URL: https://github.com/apache/mesos/pull/355#discussion_r404414704
 
 

 ##########
 File path: src/linux/cgroups.cpp
 ##########
 @@ -696,13 +696,24 @@ Try<Nothing> remove(const string& hierarchy, const string& cgroup)
   string path = path::join(hierarchy, cgroup);
 
   // Do NOT recursively remove cgroups.
-  Try<Nothing> rmdir = os::rmdir(path, false);
-  if (rmdir.isError()) {
-    return Error(
+  // TODO The retry was added as a fix for kernel bug
+  // https://lkml.org/lkml/2020/1/15/1349
+  // where calling rmdir on a seemingly empty cgroup can fail
+  // with EBUSY while some tasks are exiting.
+  auto delay = Milliseconds(1);
+  for (auto retry = 10; ;) {
+    Try<Nothing> rmdir = os::rmdir(path, false);
+    if (!rmdir.isError()) {
+      return rmdir;
+    } else if (retry > 0) {
+      os::sleep(delay);
 
 Review comment:
   @abudnik do you have a suggestion to help this move forward?
   
   I've applied my patch to our Mesos build we're using at work, because without it we keep having containers which don't get destroyed, and it leads to a lot of pain because we use gpu isolation and the agent then basically leaks the gpu...
   
   As I said earlier I really don't see how to implement the non-blocking retry in a function because we have nowhere to store the variables such as `retry`, `delay`, etc.
   Can you detail how you would implement it?
   
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [mesos] cf-natali commented on a change in pull request #355: Handle EBUSY when destroying a cgroup.

Posted by GitBox <gi...@apache.org>.
cf-natali commented on a change in pull request #355: Handle EBUSY when destroying a cgroup.
URL: https://github.com/apache/mesos/pull/355#discussion_r405018860
 
 

 ##########
 File path: src/linux/cgroups.cpp
 ##########
 @@ -696,13 +696,24 @@ Try<Nothing> remove(const string& hierarchy, const string& cgroup)
   string path = path::join(hierarchy, cgroup);
 
   // Do NOT recursively remove cgroups.
-  Try<Nothing> rmdir = os::rmdir(path, false);
-  if (rmdir.isError()) {
-    return Error(
+  // TODO The retry was added as a fix for kernel bug
+  // https://lkml.org/lkml/2020/1/15/1349
+  // where calling rmdir on a seemingly empty cgroup can fail
+  // with EBUSY while some tasks are exiting.
+  auto delay = Milliseconds(1);
+  for (auto retry = 10; ;) {
+    Try<Nothing> rmdir = os::rmdir(path, false);
+    if (!rmdir.isError()) {
+      return rmdir;
+    } else if (retry > 0) {
+      os::sleep(delay);
 
 Review comment:
   Hm, I'm still confused: in your version, you capture `retry` and `delay` in the lambdas by value, so their modifications are lost:
   
   ```
     return loop(
         // NOTE: copy `delay` by value in the first lambda function.
         [=]() mutable {
           auto timeout = process::after(delay);
           delay *= 2;
           return timeout;
         },
         // NOTE: copy `retry` by value in the second lambda function.
         [=](const Nothing&) mutable -> Future<ControlFlow<Nothing>> {
   [...]
           } else if ((errno == EBUSY) && (retry > 0)) {
             --retry;
             return process::Continue();
   ```
   
   I think that with this, `delay` and `retry` will *not* be updated from one iteration to the next: so basically we will keep retrying the `rmdir` every 1ms forever.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [mesos] abudnik commented on a change in pull request #355: Handle EBUSY when destroying a cgroup.

Posted by GitBox <gi...@apache.org>.
abudnik commented on a change in pull request #355: Handle EBUSY when destroying a cgroup.
URL: https://github.com/apache/mesos/pull/355#discussion_r401684158
 
 

 ##########
 File path: src/linux/cgroups.cpp
 ##########
 @@ -696,13 +696,24 @@ Try<Nothing> remove(const string& hierarchy, const string& cgroup)
   string path = path::join(hierarchy, cgroup);
 
   // Do NOT recursively remove cgroups.
-  Try<Nothing> rmdir = os::rmdir(path, false);
-  if (rmdir.isError()) {
-    return Error(
+  // TODO The retry was added as a fix for kernel bug
+  // https://lkml.org/lkml/2020/1/15/1349
+  // where calling rmdir on a seemingly empty cgroup can fail
+  // with EBUSY while some tasks are exiting.
+  auto delay = Milliseconds(1);
+  for (auto retry = 10; ;) {
+    Try<Nothing> rmdir = os::rmdir(path, false);
+    if (!rmdir.isError()) {
+      return rmdir;
+    } else if (retry > 0) {
+      os::sleep(delay);
 
 Review comment:
   Note that `os::sleep()` is a blocking call while the `remove()` function is supposed to be non-blocking. Calling a blocking function from a non-blocking context must be avoided if possible.
   
   If the error occurs for the `freezer` cgroup only, [Destroyer::remove](https://github.com/apache/mesos/blob/f8a3dd334934094ec44e07fa350f958d218bc78f/src/linux/cgroups.cpp#L1536-L1555) method can be modified to asynchronously retry "remove cgroup" operation.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [mesos] abudnik commented on a change in pull request #355: Handle EBUSY when destroying a cgroup.

Posted by GitBox <gi...@apache.org>.
abudnik commented on a change in pull request #355: Handle EBUSY when destroying a cgroup.
URL: https://github.com/apache/mesos/pull/355#discussion_r407230922
 
 

 ##########
 File path: src/linux/cgroups.cpp
 ##########
 @@ -691,21 +760,6 @@ Try<Nothing> create(
 }
 
 
-Try<Nothing> remove(const string& hierarchy, const string& cgroup)
 
 Review comment:
   Let's remove the declaration from the header file too.
   
   Overall, LGTM. I'll run a CI build for this patch and merge it later.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [mesos] cf-natali commented on a change in pull request #355: Handle EBUSY when destroying a cgroup.

Posted by GitBox <gi...@apache.org>.
cf-natali commented on a change in pull request #355: Handle EBUSY when destroying a cgroup.
URL: https://github.com/apache/mesos/pull/355#discussion_r407235607
 
 

 ##########
 File path: src/linux/cgroups.cpp
 ##########
 @@ -691,21 +760,6 @@ Try<Nothing> create(
 }
 
 
-Try<Nothing> remove(const string& hierarchy, const string& cgroup)
 
 Review comment:
   > Let's remove the declaration from the header file too
   
   Done.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [mesos] cf-natali commented on a change in pull request #355: Handle EBUSY when destroying a cgroup.

Posted by GitBox <gi...@apache.org>.
cf-natali commented on a change in pull request #355: Handle EBUSY when destroying a cgroup.
URL: https://github.com/apache/mesos/pull/355#discussion_r401705020
 
 

 ##########
 File path: src/linux/cgroups.cpp
 ##########
 @@ -696,13 +696,24 @@ Try<Nothing> remove(const string& hierarchy, const string& cgroup)
   string path = path::join(hierarchy, cgroup);
 
   // Do NOT recursively remove cgroups.
-  Try<Nothing> rmdir = os::rmdir(path, false);
-  if (rmdir.isError()) {
-    return Error(
+  // TODO The retry was added as a fix for kernel bug
+  // https://lkml.org/lkml/2020/1/15/1349
+  // where calling rmdir on a seemingly empty cgroup can fail
+  // with EBUSY while some tasks are exiting.
+  auto delay = Milliseconds(1);
+  for (auto retry = 10; ;) {
+    Try<Nothing> rmdir = os::rmdir(path, false);
+    if (!rmdir.isError()) {
+      return rmdir;
+    } else if (retry > 0) {
+      os::sleep(delay);
 
 Review comment:
   There is at least another code path which calls `cgroup::remove` - when the freezer subsystem isn't available: https://github.com/apache/mesos/blob/f8a3dd334934094ec44e07fa350f958d218bc78f/src/linux/cgroups.cpp#L1596
   
   Also it seems that `cgroup::mount` has a similar hack: https://github.com/apache/mesos/blob/f8a3dd334934094ec44e07fa350f958d218bc78f/src/linux/cgroups.cpp#L575
   
   It returns a future but can block retrying?
   
   So unless you feel strongly about it maybe best to keep it there, since failing to remove a cgroup is pretty much a fatal error?
   
   Also do you have a suggestion for this - I agree we should check for EBUSY, I'm not sure how to do it :).
   
   > > Should we retry only if ::rmdir() returns EBUSY errno error?
   > 
   > Definitely - I wanted to do that but I'm not sure what's the best way to do it: is there a way to access errno from Try<Nothing> rmdir or can I just assume that the global errno is preserved and access it directly?
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [mesos] cf-natali commented on a change in pull request #355: Handle EBUSY when destroying a cgroup.

Posted by GitBox <gi...@apache.org>.
cf-natali commented on a change in pull request #355: Handle EBUSY when destroying a cgroup.
URL: https://github.com/apache/mesos/pull/355#discussion_r406855125
 
 

 ##########
 File path: src/linux/cgroups.cpp
 ##########
 @@ -263,6 +265,70 @@ static Try<Nothing> cloneCpusetCpusMems(
   return Nothing();
 }
 
+
+// Removes a cgroup from a given hierachy.
+// @param   hierarchy  Path to hierarchy root.
+// @param   cgroup     Path of the cgroup relative to the hierarchy root.
+// @return  Some if the operation succeeds.
+//          Error if the operation fails.
+Future<Nothing> remove(const string& hierarchy, const string& cgroup)
+{
+  const string path = path::join(hierarchy, cgroup);
+
+  // We retry on EBUSY as a workaround for kernel bug
+  // https://lkml.org/lkml/2020/1/15/1349 and others which cause rmdir to fail
+  // with EBUSY even though the cgroup appears empty.
+
+  Duration delay;
+  int retry = 10;
+
+  return loop(
+      [=]() mutable {
+        auto timeout = process::after(delay);
+        delay = (delay == Duration()) ? Milliseconds(1) : delay * 2;
+        return timeout;
+      },
+      [=](const Nothing&) mutable -> Future<ControlFlow<Nothing>> {
+        if (::rmdir(path.c_str()) == 0) {
+          return process::Break();
+        } else if ((errno == EBUSY) && (retry > 0)) {
+          --retry;
+          return process::Continue();
+        } else {
+          // If the `cgroup` still exists in the hierarchy, treat this as
+          // an error; otherwise, treat this as a success since the `cgroup`
+          // has actually been cleaned up.
+          if (os::exists(path::join(hierarchy, cgroup))) {
+            return Failure(
+              "Failed to remove directory '" + path + "': " +
+              os::strerror(errno));
 
 Review comment:
   So I actually thought about this, but I thought it would be OK as in order to reach this branch `stat` has to succeed, hence `errno` would be preserved.
   But I double-checked and apparently POSIX states that even in case of success, `errno` might be clobbered.
   So it's safer to call `strerror` before, done.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [mesos] cf-natali commented on a change in pull request #355: Handle EBUSY when destroying a cgroup.

Posted by GitBox <gi...@apache.org>.
cf-natali commented on a change in pull request #355: Handle EBUSY when destroying a cgroup.
URL: https://github.com/apache/mesos/pull/355#discussion_r406855188
 
 

 ##########
 File path: src/linux/cgroups.cpp
 ##########
 @@ -263,6 +265,70 @@ static Try<Nothing> cloneCpusetCpusMems(
   return Nothing();
 }
 
+
+// Removes a cgroup from a given hierachy.
+// @param   hierarchy  Path to hierarchy root.
+// @param   cgroup     Path of the cgroup relative to the hierarchy root.
+// @return  Some if the operation succeeds.
+//          Error if the operation fails.
+Future<Nothing> remove(const string& hierarchy, const string& cgroup)
+{
+  const string path = path::join(hierarchy, cgroup);
+
+  // We retry on EBUSY as a workaround for kernel bug
+  // https://lkml.org/lkml/2020/1/15/1349 and others which cause rmdir to fail
+  // with EBUSY even though the cgroup appears empty.
+
+  Duration delay;
+  int retry = 10;
+
+  return loop(
+      [=]() mutable {
+        auto timeout = process::after(delay);
+        delay = (delay == Duration()) ? Milliseconds(1) : delay * 2;
 
 Review comment:
   Done.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [mesos] qianzhangxa commented on a change in pull request #355: Handle EBUSY when destroying a cgroup.

Posted by GitBox <gi...@apache.org>.
qianzhangxa commented on a change in pull request #355: Handle EBUSY when destroying a cgroup.
URL: https://github.com/apache/mesos/pull/355#discussion_r407051855
 
 

 ##########
 File path: src/linux/cgroups.cpp
 ##########
 @@ -263,6 +265,73 @@ static Try<Nothing> cloneCpusetCpusMems(
   return Nothing();
 }
 
+
+// Removes a cgroup from a given hierachy.
+// @param   hierarchy  Path to hierarchy root.
+// @param   cgroup     Path of the cgroup relative to the hierarchy root.
+// @return  Some if the operation succeeds.
+//          Error if the operation fails.
+Future<Nothing> remove(const string& hierarchy, const string& cgroup)
+{
+  const string path = path::join(hierarchy, cgroup);
+
+  // We retry on EBUSY as a workaround for kernel bug
+  // https://lkml.org/lkml/2020/1/15/1349 and others which cause rmdir to fail
+  // with EBUSY even though the cgroup appears empty.
+
+  Duration delay = Duration::zero();
+  int retry = 10;
+
+  return loop(
+      [=]() mutable {
+        auto timeout = process::after(delay);
+        delay = (delay == Duration::zero()) ? Milliseconds(1) : delay * 2;
+        return timeout;
 
 Review comment:
   Why do we need this `timeout` variable? Can we just do something like the below?
   ```
           delay = (delay == Duration::zero()) ? Milliseconds(1) : delay * 2;
           return process::after(delay);
   ```

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [mesos] cf-natali commented on a change in pull request #355: Handle EBUSY when destroying a cgroup.

Posted by GitBox <gi...@apache.org>.
cf-natali commented on a change in pull request #355: Handle EBUSY when destroying a cgroup.
URL: https://github.com/apache/mesos/pull/355#discussion_r406854300
 
 

 ##########
 File path: src/linux/cgroups.cpp
 ##########
 @@ -263,6 +265,70 @@ static Try<Nothing> cloneCpusetCpusMems(
   return Nothing();
 }
 
+
+// Removes a cgroup from a given hierachy.
+// @param   hierarchy  Path to hierarchy root.
+// @param   cgroup     Path of the cgroup relative to the hierarchy root.
+// @return  Some if the operation succeeds.
+//          Error if the operation fails.
+Future<Nothing> remove(const string& hierarchy, const string& cgroup)
+{
+  const string path = path::join(hierarchy, cgroup);
+
+  // We retry on EBUSY as a workaround for kernel bug
+  // https://lkml.org/lkml/2020/1/15/1349 and others which cause rmdir to fail
+  // with EBUSY even though the cgroup appears empty.
+
+  Duration delay;
+  int retry = 10;
+
+  return loop(
+      [=]() mutable {
+        auto timeout = process::after(delay);
+        delay = (delay == Duration()) ? Milliseconds(1) : delay * 2;
+        return timeout;
+      },
+      [=](const Nothing&) mutable -> Future<ControlFlow<Nothing>> {
+        if (::rmdir(path.c_str()) == 0) {
+          return process::Break();
+        } else if ((errno == EBUSY) && (retry > 0)) {
+          --retry;
+          return process::Continue();
+        } else {
+          // If the `cgroup` still exists in the hierarchy, treat this as
+          // an error; otherwise, treat this as a success since the `cgroup`
+          // has actually been cleaned up.
+          if (os::exists(path::join(hierarchy, cgroup))) {
+            return Failure(
+              "Failed to remove directory '" + path + "': " +
+              os::strerror(errno));
+          }
+          return process::Break();
+        }
+      }
+    );
+}
+
 
 Review comment:
   Done.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [mesos] cf-natali commented on a change in pull request #355: Handle EBUSY when destroying a cgroup.

Posted by GitBox <gi...@apache.org>.
cf-natali commented on a change in pull request #355: Handle EBUSY when destroying a cgroup.
URL: https://github.com/apache/mesos/pull/355#discussion_r402382128
 
 

 ##########
 File path: src/linux/cgroups.cpp
 ##########
 @@ -696,13 +696,24 @@ Try<Nothing> remove(const string& hierarchy, const string& cgroup)
   string path = path::join(hierarchy, cgroup);
 
   // Do NOT recursively remove cgroups.
-  Try<Nothing> rmdir = os::rmdir(path, false);
-  if (rmdir.isError()) {
-    return Error(
+  // TODO The retry was added as a fix for kernel bug
+  // https://lkml.org/lkml/2020/1/15/1349
+  // where calling rmdir on a seemingly empty cgroup can fail
+  // with EBUSY while some tasks are exiting.
+  auto delay = Milliseconds(1);
+  for (auto retry = 10; ;) {
+    Try<Nothing> rmdir = os::rmdir(path, false);
+    if (!rmdir.isError()) {
+      return rmdir;
+    } else if (retry > 0) {
+      os::sleep(delay);
 
 Review comment:
   > XfsDiskIsolatorProcess::initialize and IOSwitchboard::_connect are examples of how process::loop() can be used to implement non-blocking retries.
   
   > The new remove function can be used in both places you have mentioned above where the blocking cgroups::remove function is called.
   
   Sorry, I don't really see how to do this in a helper function: the problem is that we need to persist state from one call to the next - e.g. the delay and current retry count - and we don't have a scope where those variables would be live.
   
   For example if the function looked like:
   
   ```
   Future<Nothing> remove(const std::string& hierarchy, const std::string& cgroup)
   {
     const auto path = path::join(hierarchy, cgroup);
   
     auto delay = Milliseconds(1);
     auto retry = 10;
     auto result = Future<Nothing>();
   
     return loop(
         [&]() {
           return process::after(delay);
         },
         [&](const Nothing&) -> ControlFlow<Nothing> {
           if (::rmdir(path.c_str()) == 0) {
             return process::Break();
           } else if ((errno == EBUSY) && (retry > 0)) {
             delay *= 2;
             --retry;
             return process::Continue();
           } else {
             // If the `cgroup` still exists in the hierarchy, treat this as
             // an error; otherwise, treat this as a success since the `cgroup`
             // has actually been cleaned up.
             if (os::exists(path::join(hierarchy, cgroup))) {
               result = Failure(
                 "Failed to remove directory '" + path + "': " +
                 os::strerror(errno));
             }
             return process::Break();
           }
         }
       );
   
     return result;
   }
   ```
   
   The `delay`, `retry` and `result` would be reference to local variables, so invalid.
   `XfsDiskIsolatorProcess::initialize` and `IOSwitchboard::_connect` are both member functions of the processes on bahalf of which they're run, so don't have this problem.
   
   TBH I'm not really familiar with libprocess, maybe it would be much easier for you to write the patch?
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [mesos] cf-natali commented on a change in pull request #355: Handle EBUSY when destroying a cgroup.

Posted by GitBox <gi...@apache.org>.
cf-natali commented on a change in pull request #355: Handle EBUSY when destroying a cgroup.
URL: https://github.com/apache/mesos/pull/355#discussion_r406855260
 
 

 ##########
 File path: src/linux/cgroups.cpp
 ##########
 @@ -263,6 +265,70 @@ static Try<Nothing> cloneCpusetCpusMems(
   return Nothing();
 }
 
+
+// Removes a cgroup from a given hierachy.
+// @param   hierarchy  Path to hierarchy root.
+// @param   cgroup     Path of the cgroup relative to the hierarchy root.
+// @return  Some if the operation succeeds.
+//          Error if the operation fails.
+Future<Nothing> remove(const string& hierarchy, const string& cgroup)
+{
+  const string path = path::join(hierarchy, cgroup);
+
+  // We retry on EBUSY as a workaround for kernel bug
+  // https://lkml.org/lkml/2020/1/15/1349 and others which cause rmdir to fail
+  // with EBUSY even though the cgroup appears empty.
+
+  Duration delay;
+  int retry = 10;
+
+  return loop(
+      [=]() mutable {
+        auto timeout = process::after(delay);
+        delay = (delay == Duration()) ? Milliseconds(1) : delay * 2;
+        return timeout;
+      },
+      [=](const Nothing&) mutable -> Future<ControlFlow<Nothing>> {
+        if (::rmdir(path.c_str()) == 0) {
+          return process::Break();
+        } else if ((errno == EBUSY) && (retry > 0)) {
+          --retry;
+          return process::Continue();
+        } else {
+          // If the `cgroup` still exists in the hierarchy, treat this as
+          // an error; otherwise, treat this as a success since the `cgroup`
+          // has actually been cleaned up.
+          if (os::exists(path::join(hierarchy, cgroup))) {
+            return Failure(
+              "Failed to remove directory '" + path + "': " +
+              os::strerror(errno));
+          }
+          return process::Break();
 
 Review comment:
   Done.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [mesos] qianzhangxa commented on a change in pull request #355: Handle EBUSY when destroying a cgroup.

Posted by GitBox <gi...@apache.org>.
qianzhangxa commented on a change in pull request #355: Handle EBUSY when destroying a cgroup.
URL: https://github.com/apache/mesos/pull/355#discussion_r406729204
 
 

 ##########
 File path: src/linux/cgroups.cpp
 ##########
 @@ -263,6 +265,70 @@ static Try<Nothing> cloneCpusetCpusMems(
   return Nothing();
 }
 
+
+// Removes a cgroup from a given hierachy.
+// @param   hierarchy  Path to hierarchy root.
+// @param   cgroup     Path of the cgroup relative to the hierarchy root.
+// @return  Some if the operation succeeds.
+//          Error if the operation fails.
+Future<Nothing> remove(const string& hierarchy, const string& cgroup)
+{
+  const string path = path::join(hierarchy, cgroup);
+
+  // We retry on EBUSY as a workaround for kernel bug
+  // https://lkml.org/lkml/2020/1/15/1349 and others which cause rmdir to fail
+  // with EBUSY even though the cgroup appears empty.
+
+  Duration delay;
+  int retry = 10;
+
+  return loop(
+      [=]() mutable {
+        auto timeout = process::after(delay);
+        delay = (delay == Duration()) ? Milliseconds(1) : delay * 2;
+        return timeout;
+      },
+      [=](const Nothing&) mutable -> Future<ControlFlow<Nothing>> {
+        if (::rmdir(path.c_str()) == 0) {
+          return process::Break();
+        } else if ((errno == EBUSY) && (retry > 0)) {
+          --retry;
+          return process::Continue();
+        } else {
+          // If the `cgroup` still exists in the hierarchy, treat this as
+          // an error; otherwise, treat this as a success since the `cgroup`
+          // has actually been cleaned up.
+          if (os::exists(path::join(hierarchy, cgroup))) {
+            return Failure(
+              "Failed to remove directory '" + path + "': " +
+              os::strerror(errno));
+          }
+          return process::Break();
+        }
+      }
+    );
+}
+
 
 Review comment:
   One more newline here.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [mesos] cf-natali commented on a change in pull request #355: Handle EBUSY when destroying a cgroup.

Posted by GitBox <gi...@apache.org>.
cf-natali commented on a change in pull request #355: Handle EBUSY when destroying a cgroup.
URL: https://github.com/apache/mesos/pull/355#discussion_r407163471
 
 

 ##########
 File path: src/linux/cgroups.cpp
 ##########
 @@ -263,6 +265,73 @@ static Try<Nothing> cloneCpusetCpusMems(
   return Nothing();
 }
 
+
+// Removes a cgroup from a given hierachy.
+// @param   hierarchy  Path to hierarchy root.
+// @param   cgroup     Path of the cgroup relative to the hierarchy root.
+// @return  Some if the operation succeeds.
+//          Error if the operation fails.
+Future<Nothing> remove(const string& hierarchy, const string& cgroup)
+{
+  const string path = path::join(hierarchy, cgroup);
+
+  // We retry on EBUSY as a workaround for kernel bug
+  // https://lkml.org/lkml/2020/1/15/1349 and others which cause rmdir to fail
+  // with EBUSY even though the cgroup appears empty.
+
+  Duration delay = Duration::zero();
+  int retry = 10;
+
+  return loop(
+      [=]() mutable {
+        auto timeout = process::after(delay);
+        delay = (delay == Duration::zero()) ? Milliseconds(1) : delay * 2;
+        return timeout;
 
 Review comment:
   But it would mean that we always introduce a 1ms delay before trying to remove the cgroup. It's not much, but it'd be best to avoid it?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [mesos] cf-natali commented on a change in pull request #355: Handle EBUSY when destroying a cgroup.

Posted by GitBox <gi...@apache.org>.
cf-natali commented on a change in pull request #355: Handle EBUSY when destroying a cgroup.
URL: https://github.com/apache/mesos/pull/355#discussion_r407228978
 
 

 ##########
 File path: src/linux/cgroups.cpp
 ##########
 @@ -263,6 +265,73 @@ static Try<Nothing> cloneCpusetCpusMems(
   return Nothing();
 }
 
+
+// Removes a cgroup from a given hierachy.
+// @param   hierarchy  Path to hierarchy root.
+// @param   cgroup     Path of the cgroup relative to the hierarchy root.
+// @return  Some if the operation succeeds.
+//          Error if the operation fails.
+Future<Nothing> remove(const string& hierarchy, const string& cgroup)
+{
+  const string path = path::join(hierarchy, cgroup);
+
+  // We retry on EBUSY as a workaround for kernel bug
+  // https://lkml.org/lkml/2020/1/15/1349 and others which cause rmdir to fail
+  // with EBUSY even though the cgroup appears empty.
+
+  Duration delay = Duration::zero();
+  int retry = 10;
+
+  return loop(
+      [=]() mutable {
+        auto timeout = process::after(delay);
+        delay = (delay == Duration::zero()) ? Milliseconds(1) : delay * 2;
+        return timeout;
+      },
+      [=](const Nothing&) mutable -> Future<ControlFlow<Nothing>> {
+        if (::rmdir(path.c_str()) == 0) {
+          return process::Break();
+        } else if ((errno == EBUSY) && (retry > 0)) {
+          --retry;
+          return process::Continue();
+        } else {
+          // If the `cgroup` still exists in the hierarchy, treat this as
+          // an error; otherwise, treat this as a success since the `cgroup`
+          // has actually been cleaned up.
+          // Save the error string as os::exists may clobber errno.
+          const string error = os::strerror(errno);
+          if (os::exists(path::join(hierarchy, cgroup))) {
+            return Failure(
+              "Failed to remove directory '" + path + "': " + error);
 
 Review comment:
   Done.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services