You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@mesos.apache.org by ch...@apache.org on 2018/11/28 23:28:16 UTC

[mesos] branch 1.5.x updated (9c28b26 -> 4a8d3b4)

This is an automated email from the ASF dual-hosted git repository.

chhsiao pushed a change to branch 1.5.x
in repository https://gitbox.apache.org/repos/asf/mesos.git.


    from 9c28b26  Added MESOS-9317 to the 1.5.3 CHANGELOG.
     new d27d057  Fixed master crash when executors send messages to recovered frameworks.
     new 4a8d3b4  Added MESOS-9419 to the 1.5.3 CHANGELOG.

The 2 revisions listed above as "new" are entirely new to this
repository and will be described in separate emails.  The revisions
listed as "add" were already present in the repository and have only
been added to this reference.


Summary of changes:
 CHANGELOG             |  1 +
 src/master/master.cpp | 10 ++++++++++
 src/master/master.hpp | 21 ++++++++++++++++++---
 3 files changed, 29 insertions(+), 3 deletions(-)


[mesos] 01/02: Fixed master crash when executors send messages to recovered frameworks.

Posted by ch...@apache.org.
This is an automated email from the ASF dual-hosted git repository.

chhsiao pushed a commit to branch 1.5.x
in repository https://gitbox.apache.org/repos/asf/mesos.git

commit d27d057b7769eafa3e967763a073a2841520e050
Author: Chun-Hung Hsiao <ch...@mesosphere.io>
AuthorDate: Mon Nov 26 20:12:36 2018 -0800

    Fixed master crash when executors send messages to recovered frameworks.
    
    The `Framework::send` function assumes that either `http` or `pid` is
    set, which is not true for a framework that hasn't yet reregistered yet
    but recovered from a reregistered agent. As a result, the master would
    crash when a recovered executor tries to send a message to such a
    framework (see MESOS-9419). This patch fixes this crash bug.
    
    Review: https://reviews.apache.org/r/69451
---
 src/master/master.cpp | 10 ++++++++++
 src/master/master.hpp | 21 ++++++++++++++++++---
 2 files changed, 28 insertions(+), 3 deletions(-)

diff --git a/src/master/master.cpp b/src/master/master.cpp
index 0229d1b..4626f16 100644
--- a/src/master/master.cpp
+++ b/src/master/master.cpp
@@ -6007,6 +6007,16 @@ void Master::executorMessage(
     return;
   }
 
+  if (!framework->connected()) {
+    LOG(WARNING) << "Not forwarding executor message for executor '"
+                 << executorId << "' of framework " << frameworkId
+                 << " on agent " << *slave
+                 << " because the framework is disconnected";
+
+    metrics->invalid_executor_to_framework_messages++;
+    return;
+  }
+
   ExecutorToFrameworkMessage message;
   message.mutable_slave_id()->MergeFrom(slaveId);
   message.mutable_framework_id()->MergeFrom(frameworkId);
diff --git a/src/master/master.hpp b/src/master/master.hpp
index c1db276..0705fcd 100644
--- a/src/master/master.hpp
+++ b/src/master/master.hpp
@@ -2334,8 +2334,21 @@ struct Framework
   void send(const Message& message)
   {
     if (!connected()) {
-      LOG(WARNING) << "Master attempted to send message to disconnected"
+      LOG(WARNING) << "Master attempting to send message to disconnected"
                    << " framework " << *this;
+
+      // NOTE: We proceed here without returning to support the case where a
+      // "disconnected" framework is still talking to the master and the master
+      // wants to shut it down by sending a `FrameworkErrorMessage`. This can
+      // occur in a one-way network partition where the master -> framework link
+      // is broken but the framework -> master link remains intact. Note that we
+      // have no periodic heartbeats between the master and pid-based
+      // schedulers.
+      //
+      // TODO(chhsiao): Update the `FrameworkErrorMessage` call-sites that rely
+      // on the lack of a `return` here to directly call `process::send` so that
+      // this function doesn't need to deal with the special case. Then we can
+      // check that one of `http` or `pid` is set if the framework is connected.
     }
 
     if (http.isSome()) {
@@ -2343,9 +2356,11 @@ struct Framework
         LOG(WARNING) << "Unable to send event to framework " << *this << ":"
                      << " connection closed";
       }
-    } else {
-      CHECK_SOME(pid);
+    } else if (pid.isSome()) {
       master->send(pid.get(), message);
+    } else {
+      LOG(WARNING) << "Unable to send message to framework " << *this << ":"
+                   << " framework is recovered but has not reregistered";
     }
   }
 


[mesos] 02/02: Added MESOS-9419 to the 1.5.3 CHANGELOG.

Posted by ch...@apache.org.
This is an automated email from the ASF dual-hosted git repository.

chhsiao pushed a commit to branch 1.5.x
in repository https://gitbox.apache.org/repos/asf/mesos.git

commit 4a8d3b4a17c4743ad7827d047ffc2e6d9943778a
Author: Chun-Hung Hsiao <ch...@mesosphere.io>
AuthorDate: Wed Nov 28 10:20:29 2018 -0800

    Added MESOS-9419 to the 1.5.3 CHANGELOG.
---
 CHANGELOG | 1 +
 1 file changed, 1 insertion(+)

diff --git a/CHANGELOG b/CHANGELOG
index 8798f91..43a6782 100644
--- a/CHANGELOG
+++ b/CHANGELOG
@@ -6,6 +6,7 @@ Release Notes - Mesos - Version 1.5.3 (WIP)
   * [MESOS-7474] - Mesos fetcher cache doesn't retry when missed.
   * [MESOS-9317] - Some master endpoints do not handle failed authorization properly.
   * [MESOS-9332] - Nested container should run as the same user of its parent container by default.
+  * [MESOS-9419] - Executor to framework message crashes master if framework has not re-registered.
 
 
 Release Notes - Mesos - Version 1.5.2