You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@mesos.apache.org by ab...@apache.org on 2019/08/26 12:59:42 UTC

[mesos] branch 1.7.x updated (e51ab55 -> 80d42b9)

This is an automated email from the ASF dual-hosted git repository.

abudnik pushed a change to branch 1.7.x
in repository https://gitbox.apache.org/repos/asf/mesos.git.


    from e51ab55  Added MESOS-9836 to the 1.7.3 CHANGELOG.
     new 2d62e8a  Added missing `return` statement in `Slave::statusUpdate`.
     new b7dcc98  Fixed out-of-order processing of terminal status updates in agent.
     new 80d42b9  Added MESOS-9887 to the 1.7.3 CHANGELOG.

The 3 revisions listed above as "new" are entirely new to this
repository and will be described in separate emails.  The revisions
listed as "add" were already present in the repository and have only
been added to this reference.


Summary of changes:
 CHANGELOG           |  1 +
 src/slave/slave.cpp | 64 ++++++++++++++++++++++++++++++++++++++++++++++++++---
 src/slave/slave.hpp |  6 +++++
 3 files changed, 68 insertions(+), 3 deletions(-)


[mesos] 02/03: Fixed out-of-order processing of terminal status updates in agent.

Posted by ab...@apache.org.
This is an automated email from the ASF dual-hosted git repository.

abudnik pushed a commit to branch 1.7.x
in repository https://gitbox.apache.org/repos/asf/mesos.git

commit b7dcc984476904d6d17f7bf699295dfa9ac8a66e
Author: Andrei Budnik <ab...@apache.org>
AuthorDate: Tue Aug 20 19:24:44 2019 +0200

    Fixed out-of-order processing of terminal status updates in agent.
    
    Previously, Mesos agent could send TASK_FAILED status update on
    executor termination while processing of TASK_FINISHED status update
    was in progress. Processing of task status updates involves sending
    requests to the containerizer, which might finish processing of these
    requests out-of-order, e.g. `MesosContainerizer::status`. Also,
    the agent does not overwrite status of the terminal status update once
    it's stored in the `terminatedTasks`. Hence, there was a race condition
    between two terminal status updates.
    
    Note that V1 Executors are not affected by this problem because they
    wait for an acknowledgement of the terminal status update by the agent
    before terminating.
    
    This patch introduces a new data structure `pendingStatusUpdates`,
    which holds a list of status updates that are being processed. This
    data structure allows validating the order of processing of status
    updates by the agent.
    
    Review: https://reviews.apache.org/r/71343
---
 src/slave/slave.cpp | 62 ++++++++++++++++++++++++++++++++++++++++++++++++++---
 src/slave/slave.hpp |  6 ++++++
 2 files changed, 65 insertions(+), 3 deletions(-)

diff --git a/src/slave/slave.cpp b/src/slave/slave.cpp
index edfe3d0..f10aac2 100644
--- a/src/slave/slave.cpp
+++ b/src/slave/slave.cpp
@@ -5486,6 +5486,8 @@ void Slave::statusUpdate(StatusUpdate update, const Option<UPID>& pid)
 
   metrics.valid_status_updates++;
 
+  executor->addPendingTaskStatus(status);
+
   // Before sending update, we need to retrieve the container status
   // if the task reached the executor. For tasks that are queued, we
   // do not need to send the container status and we must
@@ -5697,6 +5699,17 @@ void Slave::___statusUpdate(
   VLOG(1) << "Task status update manager successfully handled status update "
           << update;
 
+  const TaskStatus& status = update.status();
+
+  Executor* executor = nullptr;
+  Framework* framework = getFramework(update.framework_id());
+  if (framework != nullptr) {
+    executor = framework->getExecutor(status.task_id());
+    if (executor != nullptr) {
+      executor->removePendingTaskStatus(status);
+    }
+  }
+
   if (pid == UPID()) {
     return;
   }
@@ -5704,7 +5717,7 @@ void Slave::___statusUpdate(
   StatusUpdateAcknowledgementMessage message;
   message.mutable_framework_id()->MergeFrom(update.framework_id());
   message.mutable_slave_id()->MergeFrom(update.slave_id());
-  message.mutable_task_id()->MergeFrom(update.status().task_id());
+  message.mutable_task_id()->MergeFrom(status.task_id());
   message.set_uuid(update.uuid());
 
   // Task status update manager successfully handled the status update.
@@ -5716,14 +5729,12 @@ void Slave::___statusUpdate(
     send(pid.get(), message);
   } else {
     // Acknowledge the HTTP based executor.
-    Framework* framework = getFramework(update.framework_id());
     if (framework == nullptr) {
       LOG(WARNING) << "Ignoring sending acknowledgement for status update "
                    << update << " of unknown framework";
       return;
     }
 
-    Executor* executor = framework->getExecutor(update.status().task_id());
     if (executor == nullptr) {
       // Refer to the comments in 'statusUpdate()' on when this can
       // happen.
@@ -9861,6 +9872,33 @@ void Executor::recoverTask(const TaskState& state, bool recheckpointTask)
 }
 
 
+void Executor::addPendingTaskStatus(const TaskStatus& status)
+{
+  auto uuid = id::UUID::fromBytes(status.uuid()).get();
+  pendingStatusUpdates[status.task_id()][uuid] = status;
+}
+
+
+void Executor::removePendingTaskStatus(const TaskStatus& status)
+{
+  const TaskID& taskId = status.task_id();
+
+  auto uuid = id::UUID::fromBytes(status.uuid()).get();
+
+  if (!pendingStatusUpdates.contains(taskId) ||
+      !pendingStatusUpdates[taskId].contains(uuid)) {
+    LOG(WARNING) << "Unknown pending status update (uuid: " << uuid << ")";
+    return;
+  }
+
+  pendingStatusUpdates[taskId].erase(uuid);
+
+  if (pendingStatusUpdates[taskId].empty()) {
+    pendingStatusUpdates.erase(taskId);
+  }
+}
+
+
 Try<Nothing> Executor::updateTaskState(const TaskStatus& status)
 {
   bool terminal = protobuf::isTerminalState(status.state());
@@ -9884,6 +9922,24 @@ Try<Nothing> Executor::updateTaskState(const TaskStatus& status)
     task = launchedTasks.at(status.task_id());
 
     if (terminal) {
+      if (pendingStatusUpdates.contains(status.task_id())) {
+        auto statusUpdates = pendingStatusUpdates[status.task_id()].values();
+
+        auto firstTerminal = std::find_if(
+            statusUpdates.begin(),
+            statusUpdates.end(),
+            [](const TaskStatus& status) {
+              return protobuf::isTerminalState(status.state());
+            });
+
+        CHECK(firstTerminal != statusUpdates.end());
+
+        if (firstTerminal->uuid() != status.uuid()) {
+          return Error("Unexpected terminal status update after first status"
+                       " update " + stringify(firstTerminal->state()));
+        }
+      }
+
       launchedTasks.erase(taskId);
     }
   } else if (terminatedTasks.contains(taskId)) {
diff --git a/src/slave/slave.hpp b/src/slave/slave.hpp
index 8e5adc5..c81e2d5 100644
--- a/src/slave/slave.hpp
+++ b/src/slave/slave.hpp
@@ -913,6 +913,9 @@ public:
 
   void recoverTask(const state::TaskState& state, bool recheckpointTask);
 
+  void addPendingTaskStatus(const TaskStatus& status);
+  void removePendingTaskStatus(const TaskStatus& status);
+
   Try<Nothing> updateTaskState(const TaskStatus& status);
 
   // Returns true if there are any queued/launched/terminated tasks.
@@ -1054,6 +1057,9 @@ public:
   // non-terminal tasks.
   Option<mesos::slave::ContainerTermination> pendingTermination;
 
+  // Task status updates that are being processed by the agent.
+  hashmap<TaskID, LinkedHashMap<id::UUID, TaskStatus>> pendingStatusUpdates;
+
 private:
   Executor(const Executor&) = delete;
   Executor& operator=(const Executor&) = delete;


[mesos] 01/03: Added missing `return` statement in `Slave::statusUpdate`.

Posted by ab...@apache.org.
This is an automated email from the ASF dual-hosted git repository.

abudnik pushed a commit to branch 1.7.x
in repository https://gitbox.apache.org/repos/asf/mesos.git

commit 2d62e8ae0ef94f78c9b32be258a08d1e6e2382df
Author: Andrei Budnik <ab...@apache.org>
AuthorDate: Fri Aug 23 14:36:18 2019 +0200

    Added missing `return` statement in `Slave::statusUpdate`.
    
    Previously, if `statusUpdate` was called for a pending task, it would
    forward the status update and then continue executing `statusUpdate`,
    which then checks if there is an executor that is aware of this task.
    Given that a pending task is not known to any executor, it would always
    handle it by forwarding status update one more time. This patch adds
    missing `return` statement, which fixes the issue.
    
    Review: https://reviews.apache.org/r/71361
---
 src/slave/slave.cpp | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/src/slave/slave.cpp b/src/slave/slave.cpp
index 1c33579..edfe3d0 100644
--- a/src/slave/slave.cpp
+++ b/src/slave/slave.cpp
@@ -5418,6 +5418,8 @@ void Slave::statusUpdate(StatusUpdate update, const Option<UPID>& pid)
 
     taskStatusUpdateManager->update(update, info.id())
       .onAny(defer(self(), &Slave::___statusUpdate, lambda::_1, update, pid));
+
+    return;
   }
 
   Executor* executor = framework->getExecutor(status.task_id());


[mesos] 03/03: Added MESOS-9887 to the 1.7.3 CHANGELOG.

Posted by ab...@apache.org.
This is an automated email from the ASF dual-hosted git repository.

abudnik pushed a commit to branch 1.7.x
in repository https://gitbox.apache.org/repos/asf/mesos.git

commit 80d42b9a2c9223665a82bbaaf3cbc222a094e2ef
Author: Andrei Budnik <ab...@apache.org>
AuthorDate: Mon Aug 26 14:58:45 2019 +0200

    Added MESOS-9887 to the 1.7.3 CHANGELOG.
---
 CHANGELOG | 1 +
 1 file changed, 1 insertion(+)

diff --git a/CHANGELOG b/CHANGELOG
index 06c88db..1178228 100644
--- a/CHANGELOG
+++ b/CHANGELOG
@@ -29,6 +29,7 @@ Release Notes - Mesos - Version 1.7.3 (WIP)
   * [MESOS-9856] - REVIVE call with specified role(s) clears filters for all roles of a framework.
   * [MESOS-9868] - NetworkInfo from the agent /state endpoint is not correct.
   * [MESOS-9870] - Simultaneous adding/removal of a role from framework's roles and its suppressed roles crashes the master.
+  * [MESOS-9887] - Race condition between two terminal task status updates for Docker/Command executor.
   * [MESOS-9893] - `volume/secret` isolator should cleanup the stored secret from runtime directory when the container is destroyed.
   * [MESOS-9925] - Default executor takes a couple of seconds to start and subscribe Mesos agent.