You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@mesos.apache.org by ch...@apache.org on 2018/11/28 23:27:54 UTC

[mesos] branch 1.6.x updated (efbfc9d -> 5cec448)

This is an automated email from the ASF dual-hosted git repository.

chhsiao pushed a change to branch 1.6.x
in repository https://gitbox.apache.org/repos/asf/mesos.git.


    from efbfc9d  Added MESOS-9418 to the 1.6.2 CHANGELOG.
     new 2d7cb6b  Fixed master crash when executors send messages to recovered frameworks.
     new 5cec448  Added MESOS-9419 to the 1.6.2 CHANGELOG.

The 2 revisions listed above as "new" are entirely new to this
repository and will be described in separate emails.  The revisions
listed as "add" were already present in the repository and have only
been added to this reference.


Summary of changes:
 CHANGELOG             |  1 +
 src/master/master.cpp | 10 ++++++++++
 src/master/master.hpp | 21 ++++++++++++++++++---
 3 files changed, 29 insertions(+), 3 deletions(-)


[mesos] 02/02: Added MESOS-9419 to the 1.6.2 CHANGELOG.

Posted by ch...@apache.org.
This is an automated email from the ASF dual-hosted git repository.

chhsiao pushed a commit to branch 1.6.x
in repository https://gitbox.apache.org/repos/asf/mesos.git

commit 5cec44873c0fa6451528d8c50b40fb1afdac4079
Author: Chun-Hung Hsiao <ch...@mesosphere.io>
AuthorDate: Wed Nov 28 10:18:52 2018 -0800

    Added MESOS-9419 to the 1.6.2 CHANGELOG.
---
 CHANGELOG | 1 +
 1 file changed, 1 insertion(+)

diff --git a/CHANGELOG b/CHANGELOG
index d70e78f..8245d91 100644
--- a/CHANGELOG
+++ b/CHANGELOG
@@ -36,6 +36,7 @@ Release Notes - Mesos - Version 1.6.2 (WIP)
   * [MESOS-9332] - Nested container should run as the same user of its parent container by default.
   * [MESOS-9334] - Container stuck at ISOLATING state due to libevent poll never returns.
   * [MESOS-9418] - Add support for the `Discard` blkio operation type.
+  * [MESOS-9419] - Executor to framework message crashes master if framework has not re-registered.
 
 ** Improvement
   * [MESOS-9305] - Create cgoup recursively to workaround systemd deleting cgroups_root.


[mesos] 01/02: Fixed master crash when executors send messages to recovered frameworks.

Posted by ch...@apache.org.
This is an automated email from the ASF dual-hosted git repository.

chhsiao pushed a commit to branch 1.6.x
in repository https://gitbox.apache.org/repos/asf/mesos.git

commit 2d7cb6b60d6cdd3c1dbe1470f0afa044ae78c10c
Author: Chun-Hung Hsiao <ch...@mesosphere.io>
AuthorDate: Mon Nov 26 20:12:36 2018 -0800

    Fixed master crash when executors send messages to recovered frameworks.
    
    The `Framework::send` function assumes that either `http` or `pid` is
    set, which is not true for a framework that hasn't yet reregistered yet
    but recovered from a reregistered agent. As a result, the master would
    crash when a recovered executor tries to send a message to such a
    framework (see MESOS-9419). This patch fixes this crash bug.
    
    Review: https://reviews.apache.org/r/69451
---
 src/master/master.cpp | 10 ++++++++++
 src/master/master.hpp | 21 ++++++++++++++++++---
 2 files changed, 28 insertions(+), 3 deletions(-)

diff --git a/src/master/master.cpp b/src/master/master.cpp
index 3f923a7..67fa5a8 100644
--- a/src/master/master.cpp
+++ b/src/master/master.cpp
@@ -6444,6 +6444,16 @@ void Master::executorMessage(
     return;
   }
 
+  if (!framework->connected()) {
+    LOG(WARNING) << "Not forwarding executor message for executor '"
+                 << executorId << "' of framework " << frameworkId
+                 << " on agent " << *slave
+                 << " because the framework is disconnected";
+
+    metrics->invalid_executor_to_framework_messages++;
+    return;
+  }
+
   ExecutorToFrameworkMessage message;
   *message.mutable_slave_id() =
     std::move(*executorToFrameworkMessage.mutable_slave_id());
diff --git a/src/master/master.hpp b/src/master/master.hpp
index f01b743..52c508a 100644
--- a/src/master/master.hpp
+++ b/src/master/master.hpp
@@ -2387,8 +2387,21 @@ struct Framework
   void send(const Message& message)
   {
     if (!connected()) {
-      LOG(WARNING) << "Master attempted to send message to disconnected"
+      LOG(WARNING) << "Master attempting to send message to disconnected"
                    << " framework " << *this;
+
+      // NOTE: We proceed here without returning to support the case where a
+      // "disconnected" framework is still talking to the master and the master
+      // wants to shut it down by sending a `FrameworkErrorMessage`. This can
+      // occur in a one-way network partition where the master -> framework link
+      // is broken but the framework -> master link remains intact. Note that we
+      // have no periodic heartbeats between the master and pid-based
+      // schedulers.
+      //
+      // TODO(chhsiao): Update the `FrameworkErrorMessage` call-sites that rely
+      // on the lack of a `return` here to directly call `process::send` so that
+      // this function doesn't need to deal with the special case. Then we can
+      // check that one of `http` or `pid` is set if the framework is connected.
     }
 
     if (http.isSome()) {
@@ -2396,9 +2409,11 @@ struct Framework
         LOG(WARNING) << "Unable to send event to framework " << *this << ":"
                      << " connection closed";
       }
-    } else {
-      CHECK_SOME(pid);
+    } else if (pid.isSome()) {
       master->send(pid.get(), message);
+    } else {
+      LOG(WARNING) << "Unable to send message to framework " << *this << ":"
+                   << " framework is recovered but has not reregistered";
     }
   }