You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@mesos.apache.org by "Qian Zhang (JIRA)" <ji...@apache.org> on 2018/10/22 09:56:00 UTC

[jira] [Comment Edited] (MESOS-9334) Container stuck at ISOLATING state due to libevent poll never returns

    [ https://issues.apache.org/jira/browse/MESOS-9334?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16658795#comment-16658795 ] 

Qian Zhang edited comment on MESOS-9334 at 10/22/18 9:55 AM:
-------------------------------------------------------------

I added some logs into `libevent_poll.cpp` (see the diff below for details) and reproduced this issue a couple of times.
{code:java}
--- a/3rdparty/libprocess/src/posix/libevent/libevent_poll.cpp
+++ b/3rdparty/libprocess/src/posix/libevent/libevent_poll.cpp
@@ -32,11 +34,17 @@ struct Poll
 };
 
 
-void pollCallback(evutil_socket_t, short what, void* arg)
+void pollCallback(evutil_socket_t fd, short what, void* arg)
 {
   Poll* poll = reinterpret_cast<Poll*>(arg);
 
+  LOG(INFO) << "==========pollCallback starts with fd " << fd
+            << " and with poll " << poll << "==========";
+
   if (poll->promise.future().hasDiscard()) {
+    LOG(INFO) << "==========pollCallback discards with fd "
+              << fd << "==========";
+
     poll->promise.discard();
   } else {
     // Convert libevent specific EV_READ / EV_WRITE to io::* specific
@@ -44,17 +52,24 @@ void pollCallback(evutil_socket_t, short what, void* arg)
     short events =
       ((what & EV_READ) ? io::READ : 0) | ((what & EV_WRITE) ? io::WRITE : 0);
 
+    LOG(INFO) << "==========pollCallback sets promise with fd " << fd
+              << " and with events " << events << "==========";
+
     poll->promise.set(events);
   }
 
   // Deleting the `poll` also destructs `ev` and hence triggers `event_free`,
   // which makes the event non-pending.
   delete poll;
+
+  LOG(INFO) << "==========pollCallback ends with fd " << fd << "==========";
 }
 
 
 void pollDiscard(const std::weak_ptr<event>& ev, short events)
 {
+  LOG(INFO) << "==========pollDiscard is called==========";
+
   // Discarding inside the event loop prevents `pollCallback()` from being
   // called twice if the future is discarded.
   run_in_event_loop([=]() {
@@ -78,6 +93,9 @@ Future<short> poll(int_fd fd, short events)
 
   internal::Poll* poll = new internal::Poll();
 
+  LOG(INFO) << "==========libevent starts polling with fd " << fd
+            << " and with poll " << poll << "==========";
+
   Future<short> future = poll->promise.future();
 
   // Convert io::READ / io::WRITE to libevent specific values of these
{code}
Here is what I found in the agent log when this issue occurred (fd 48 is the stderr file descriptor of `NetworkCniIsolatorSetup`):
{code:java}
I1021 15:57:45.000000 2116 libevent_poll.cpp:96] ==========libevent starts polling with fd 48 and with poll 0x7f60df029eb0==========
I1021 15:57:45.000000 2117 libevent_poll.cpp:41] ==========pollCallback starts with fd 48 and with poll 0x7f60e6e56c70==========
I1021 15:57:45.000000 2117 libevent_poll.cpp:45] ==========pollCallback discards with fd 48==========
I1021 15:57:45.000000 2117 libevent_poll.cpp:65] ==========pollCallback ends with fd 48==========
{code}
We can see libevent started to poll fd 48 with the poll object whose address is 0x7f60df029eb0, but when `pollCallback` was called for fd 48, the address of poll object is different (0x7f60e6e56c70) which has been discarded (see the third log line) !!! 

And when I searched 0x7f60e6e56c70 in the agent log, I found:
{code:java}
I1021 15:57:22.000000 2115 memory.cpp:478] Started listening for OOM events for container 4753a5ef-eccd-4373-b8f0-a4b40abf0fb5
I1021 15:57:22.000000 2115 libevent_poll.cpp:96] ==========libevent starts polling with fd 48 and with poll 0x7f60e6e56c70==========
{code}
So the poll object 0x7f60e6e56c70 was created 23 seconds ago with the same file descriptor (fd 48) which is used to listen OOM events for another container (4753a5ef-eccd-4373-b8f0-a4b40abf0fb5), and that container was destroyed at 15:57:44 right before agent started to wait the stderr of `NetworkCniIsolatorSetup` (15:57:45).
{code:java}
I1021 15:57:44.000000  2114 containerizer.cpp:2459] Destroying container 4753a5ef-eccd-4373-b8f0-a4b40abf0fb5 in RUNNING state
{code}
I reproduced this issue a couple of times, the observations I had from agent log are same as the above.


was (Author: qianzhang):
I added some logs into `libevent_poll.cpp` (see the diff below for details) and reproduced this issue a couple of times.
{code:java}
--- a/3rdparty/libprocess/src/posix/libevent/libevent_poll.cpp
+++ b/3rdparty/libprocess/src/posix/libevent/libevent_poll.cpp
@@ -32,11 +34,17 @@ struct Poll
 };
 
 
-void pollCallback(evutil_socket_t, short what, void* arg)
+void pollCallback(evutil_socket_t fd, short what, void* arg)
 {
   Poll* poll = reinterpret_cast<Poll*>(arg);
 
+  LOG(INFO) << "==========pollCallback starts with fd " << fd
+            << " and with poll " << poll << "==========";
+
   if (poll->promise.future().hasDiscard()) {
+    LOG(INFO) << "==========pollCallback discards with fd "
+              << fd << "==========";
+
     poll->promise.discard();
   } else {
     // Convert libevent specific EV_READ / EV_WRITE to io::* specific
@@ -44,17 +52,24 @@ void pollCallback(evutil_socket_t, short what, void* arg)
     short events =
       ((what & EV_READ) ? io::READ : 0) | ((what & EV_WRITE) ? io::WRITE : 0);
 
+    LOG(INFO) << "==========pollCallback sets promise with fd " << fd
+              << " and with events " << events << "==========";
+
     poll->promise.set(events);
   }
 
   // Deleting the `poll` also destructs `ev` and hence triggers `event_free`,
   // which makes the event non-pending.
   delete poll;
+
+  LOG(INFO) << "==========pollCallback ends with fd " << fd << "==========";
 }
 
 
 void pollDiscard(const std::weak_ptr<event>& ev, short events)
 {
+  LOG(INFO) << "==========pollDiscard is called==========";
+
   // Discarding inside the event loop prevents `pollCallback()` from being
   // called twice if the future is discarded.
   run_in_event_loop([=]() {
@@ -78,6 +93,9 @@ Future<short> poll(int_fd fd, short events)
 
   internal::Poll* poll = new internal::Poll();
 
+  LOG(INFO) << "==========libevent starts polling with fd " << fd
+            << " and with poll " << poll << "==========";
+
   Future<short> future = poll->promise.future();
 
   // Convert io::READ / io::WRITE to libevent specific values of these
{code}
Here is what I found in the agent log when this issue occurred (fd 48 is the stderr file descriptor of `NetworkCniIsolatorSetup`):

 
{code:java}
I1021 15:57:45.000000 2116 libevent_poll.cpp:96] ==========libevent starts polling with fd 48 and with poll 0x7f60df029eb0==========
I1021 15:57:45.000000 2117 libevent_poll.cpp:41] ==========pollCallback starts with fd 48 and with poll 0x7f60e6e56c70==========
I1021 15:57:45.000000 2117 libevent_poll.cpp:45] ==========pollCallback discards with fd 48==========
I1021 15:57:45.000000 2117 libevent_poll.cpp:65] ==========pollCallback ends with fd 48==========
{code}
We can see libevent started to poll fd 48 with the poll object whose address is 0x7f60df029eb0, but when `pollCallback` was called for fd 48, the address of poll object is different (0x7f60e6e56c70) which has been discarded (see the third log line) !!! 

And when I searched 0x7f60e6e56c70 in the agent log, I found:

 
{code:java}
I1021 15:57:22.000000 2115 memory.cpp:478] Started listening for OOM events for container 4753a5ef-eccd-4373-b8f0-a4b40abf0fb5
I1021 15:57:22.000000 2115 libevent_poll.cpp:96] ==========libevent starts polling with fd 48 and with poll 0x7f60e6e56c70==========
{code}
So the poll object 0x7f60e6e56c70 was created 23 seconds ago with the same file descriptor (fd 48) which is used to listen OOM events for another container (4753a5ef-eccd-4373-b8f0-a4b40abf0fb5), and that container was destroyed at 15:57:44 right before agent started to wait the stderr of `NetworkCniIsolatorSetup` (15:57:45).

 
{code:java}
I1021 15:57:44.000000  2114 containerizer.cpp:2459] Destroying container 4753a5ef-eccd-4373-b8f0-a4b40abf0fb5 in RUNNING state
{code}
I reproduced this issue a couple of times, the observations I had from agent log are same as the above.

 

> Container stuck at ISOLATING state due to libevent poll never returns
> ---------------------------------------------------------------------
>
>                 Key: MESOS-9334
>                 URL: https://issues.apache.org/jira/browse/MESOS-9334
>             Project: Mesos
>          Issue Type: Bug
>          Components: containerization
>            Reporter: Qian Zhang
>            Priority: Critical
>
> We found UCR container may be stuck at `ISOLATING` state:
> {code:java}
> 2018-10-03 09:13:23: I1003 09:13:23.274561 2355 containerizer.cpp:3122] Transitioning the state of container 1e5b8fc3-5c9e-4159-a0b9-3d46595a5b54 from PREPARING to ISOLATING
> 2018-10-03 09:13:23: I1003 09:13:23.279223 2354 cni.cpp:962] Bind mounted '/proc/5244/ns/net' to '/run/mesos/isolators/network/cni/1e5b8fc3-5c9e-4159-a0b9-3d46595a5b54/ns' for container 1e5b8fc3-5c9e-4159-a0b9-3d46595a5b54
> 2018-10-03 09:23:22: I1003 09:23:22.879868 2354 containerizer.cpp:2459] Destroying container 1e5b8fc3-5c9e-4159-a0b9-3d46595a5b54 in ISOLATING state
> {code}
>  In the above logs, the state of container `1e5b8fc3-5c9e-4159-a0b9-3d46595a5b54` was transitioned to `ISOLATING` at 09:13:23, but did not transitioned to any other states until it was destroyed due to the executor registration timeout (10 mins). And the destroy can never complete since it needs to wait for the container to finish isolating.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)