You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@mesos.apache.org by me...@apache.org on 2016/07/31 18:07:17 UTC

[1/2] mesos git commit: Added agent and scheduler authentication backoff.

Repository: mesos
Updated Branches:
  refs/heads/0.28.x f5e308877 -> d219d5057


Added agent and scheduler authentication backoff.

The backoff follows the existing pattern for backoff used during agent
registration where we backoff for some random time in an interval of
increasing length capped by `AUTHENTICATION_RETRY_INTERVAL_MAX`.

Review: https://reviews.apache.org/r/49308/


Project: http://git-wip-us.apache.org/repos/asf/mesos/repo
Commit: http://git-wip-us.apache.org/repos/asf/mesos/commit/8046860d
Tree: http://git-wip-us.apache.org/repos/asf/mesos/tree/8046860d
Diff: http://git-wip-us.apache.org/repos/asf/mesos/diff/8046860d

Branch: refs/heads/0.28.x
Commit: 8046860d75b3a8229326f270667ae251bf1a9d51
Parents: f5e3088
Author: Benjamin Bannier <be...@mesosphere.io>
Authored: Wed Jul 6 11:09:04 2016 -0700
Committer: Adam B <ad...@mesosphere.io>
Committed: Sun Jul 31 03:12:41 2016 -0700

----------------------------------------------------------------------
 docs/configuration.md              | 14 ++++++++++++++
 docs/endpoints/slave/state.json.md |  1 +
 docs/endpoints/slave/state.md      |  1 +
 src/sched/constants.cpp            |  4 ++++
 src/sched/constants.hpp            |  8 ++++++++
 src/sched/flags.hpp                | 10 ++++++++++
 src/sched/sched.cpp                | 33 ++++++++++++++++++++++++++++-----
 src/slave/constants.cpp            |  2 ++
 src/slave/constants.hpp            |  7 +++++++
 src/slave/flags.cpp                | 14 ++++++++++++--
 src/slave/flags.hpp                |  1 +
 src/slave/slave.cpp                | 29 +++++++++++++++++++++++------
 src/slave/slave.hpp                |  3 +++
 13 files changed, 114 insertions(+), 13 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/mesos/blob/8046860d/docs/configuration.md
----------------------------------------------------------------------
diff --git a/docs/configuration.md b/docs/configuration.md
index 305ba2c..ad8b950 100644
--- a/docs/configuration.md
+++ b/docs/configuration.md
@@ -883,6 +883,20 @@ load an alternate authenticatee module using <code>--modules</code>. (default: c
 </tr>
 <tr>
   <td>
+    --authentication_backoff_factor=VALUE
+  </td>
+  <td>
+After a failed authentication the agent picks a random amount of time between
+<code>[0, b]</code>, where <code>b = authentication_backoff_factor</code>, to
+authenticate with a new master. Subsequent retries are exponentially backed
+off based on this interval (e.g., 1st retry uses a random value between
+<code>[0, b * 2^1]</code>, 2nd retry between <code>[0, b * 2^2]</code>, 3rd
+retry between <code>[0, b * 2^3]</code>, etc up to a maximum of 1mins
+(default: 1secs)
+  </td>
+</tr>
+<tr>
+  <td>
     --[no]-cgroups_cpu_enable_pids_and_tids_count
   </td>
   <td>

http://git-wip-us.apache.org/repos/asf/mesos/blob/8046860d/docs/endpoints/slave/state.json.md
----------------------------------------------------------------------
diff --git a/docs/endpoints/slave/state.json.md b/docs/endpoints/slave/state.json.md
index 184a65f..0ab775d 100644
--- a/docs/endpoints/slave/state.json.md
+++ b/docs/endpoints/slave/state.json.md
@@ -42,6 +42,7 @@ Example (**Note**: this is not exhaustive):
     "frameworks" : [],
     "completed_frameworks" : [],
     "flags" : {
+         "authentication_backoff_factor": "1secs",
          "gc_disk_headroom" : "0.1",
          "isolation" : "posix/cpu,posix/mem",
          "containerizers" : "mesos",

http://git-wip-us.apache.org/repos/asf/mesos/blob/8046860d/docs/endpoints/slave/state.md
----------------------------------------------------------------------
diff --git a/docs/endpoints/slave/state.md b/docs/endpoints/slave/state.md
index 5618912..76c2c09 100644
--- a/docs/endpoints/slave/state.md
+++ b/docs/endpoints/slave/state.md
@@ -42,6 +42,7 @@ Example (**Note**: this is not exhaustive):
     "frameworks" : [],
     "completed_frameworks" : [],
     "flags" : {
+         "authentication_backoff_factor": "1secs",
          "gc_disk_headroom" : "0.1",
          "isolation" : "posix/cpu,posix/mem",
          "containerizers" : "mesos",

http://git-wip-us.apache.org/repos/asf/mesos/blob/8046860d/src/sched/constants.cpp
----------------------------------------------------------------------
diff --git a/src/sched/constants.cpp b/src/sched/constants.cpp
index c00b8ba..c4fc80f 100644
--- a/src/sched/constants.cpp
+++ b/src/sched/constants.cpp
@@ -29,6 +29,10 @@ const Duration DEFAULT_REGISTRATION_BACKOFF_FACTOR = Seconds(2);
 
 const Duration REGISTRATION_RETRY_INTERVAL_MAX = Minutes(1);
 
+const Duration AUTHENTICATION_RETRY_INTERVAL_MAX = Minutes(1);
+
+const Duration DEFAULT_AUTHENTICATION_BACKOFF_FACTOR = Seconds(1);
+
 const std::string DEFAULT_AUTHENTICATEE = "crammd5";
 
 } // namespace scheduler {

http://git-wip-us.apache.org/repos/asf/mesos/blob/8046860d/src/sched/constants.hpp
----------------------------------------------------------------------
diff --git a/src/sched/constants.hpp b/src/sched/constants.hpp
index 523c6d9..6038fd1 100644
--- a/src/sched/constants.hpp
+++ b/src/sched/constants.hpp
@@ -31,6 +31,14 @@ extern const Duration DEFAULT_REGISTRATION_BACKOFF_FACTOR;
 // registration.
 extern const Duration REGISTRATION_RETRY_INTERVAL_MAX;
 
+// The maximum interval the scheduler driver waits before retrying
+// authentication.
+extern const Duration AUTHENTICATION_RETRY_INTERVAL_MAX;
+
+// Default backoff interval used by the scheduler to wait after failed
+// authentication.
+extern const Duration DEFAULT_AUTHENTICATION_BACKOFF_FACTOR;
+
 // Name of the default, CRAM-MD5 authenticatee.
 extern const std::string DEFAULT_AUTHENTICATEE;
 

http://git-wip-us.apache.org/repos/asf/mesos/blob/8046860d/src/sched/flags.hpp
----------------------------------------------------------------------
diff --git a/src/sched/flags.hpp b/src/sched/flags.hpp
index 57dbab8..318dbc8 100644
--- a/src/sched/flags.hpp
+++ b/src/sched/flags.hpp
@@ -36,6 +36,15 @@ class Flags : public logging::Flags
 public:
   Flags()
   {
+    add(&Flags::authentication_backoff_factor,
+        "authentication_backoff_factor",
+        "Scheduler driver authentication retries are exponentially backed\n"
+        "off based on 'b', the authentication backoff factor (e.g., 1st retry\n"
+        "uses a random value between `[0, b * 2^1]`, 2nd retry between\n"
+        "`[0, b * 2^2]`, 3rd retry between `[0, b * 2^3]`, etc up to a\n"
+        "maximum of " + stringify(AUTHENTICATION_RETRY_INTERVAL_MAX),
+        DEFAULT_AUTHENTICATION_BACKOFF_FACTOR);
+
     add(&Flags::registration_backoff_factor,
         "registration_backoff_factor",
         "Scheduler driver (re-)registration retries are exponentially backed\n"
@@ -101,6 +110,7 @@ public:
         DEFAULT_AUTHENTICATEE);
   }
 
+  Duration authentication_backoff_factor;
   Duration registration_backoff_factor;
   Option<Modules> modules;
   std::string authenticatee;

http://git-wip-us.apache.org/repos/asf/mesos/blob/8046860d/src/sched/sched.cpp
----------------------------------------------------------------------
diff --git a/src/sched/sched.cpp b/src/sched/sched.cpp
index e39f94f..25b0622 100644
--- a/src/sched/sched.cpp
+++ b/src/sched/sched.cpp
@@ -23,12 +23,13 @@
 
 #include <arpa/inet.h>
 
+#include <cmath>
 #include <iostream>
 #include <map>
 #include <memory>
 #include <mutex>
-#include <string>
 #include <sstream>
+#include <string>
 
 #include <mesos/mesos.hpp>
 #include <mesos/module.hpp>
@@ -217,7 +218,8 @@ public:
       authenticatee(NULL),
       authenticating(None()),
       authenticated(false),
-      reauthenticate(false)
+      reauthenticate(false),
+      failedAuthentications(0)
   {
     LOG(INFO) << "Version: " << MESOS_VERSION;
   }
@@ -328,8 +330,8 @@ protected:
 
       if (credential.isSome()) {
         // Authenticate with the master.
-        // TODO(vinod): Do a backoff for authentication similar to what
-        // we do for registration.
+        // TODO(adam-mesos): Consider adding an initial delay like we do for
+        // slave registration, to combat thundering herds on master failover.
         authenticate();
       } else {
         // Proceed with registration without authentication.
@@ -456,8 +458,24 @@ protected:
       authenticating = None();
       reauthenticate = false;
 
+      ++failedAuthentications;
+
+      // Backoff.
+      // The backoff is a random duration in the interval [0, b * 2^N)
+      // where `b = authentication_backoff_factor` and `N` the number
+      // of failed authentication attempts. It is capped by
+      // `REGISTER_RETRY_INTERVAL_MAX`.
+      Duration backoff = flags.authentication_backoff_factor *
+                         std::pow(2, failedAuthentications);
+      backoff = std::min(backoff, scheduler::AUTHENTICATION_RETRY_INTERVAL_MAX);
+
+      // Determine the delay for next attempt by picking a random
+      // duration between 0 and 'maxBackoff'.
+      // TODO(vinod): Use random numbers from <random> header.
+      backoff *= (double) ::random() / RAND_MAX;
+
       // TODO(vinod): Add a limit on number of retries.
-      dispatch(self(), &Self::authenticate); // Retry.
+      delay(backoff, self(), &Self::authenticate);
       return;
     }
 
@@ -474,6 +492,8 @@ protected:
     authenticated = true;
     authenticating = None();
 
+    failedAuthentications = 0;
+
     doReliableRegistration(flags.registration_backoff_factor);
   }
 
@@ -1620,6 +1640,9 @@ private:
 
   // Indicates if a new authentication attempt should be enforced.
   bool reauthenticate;
+
+  // Indicates the number of failed authentication attempts.
+  uint64_t failedAuthentications;
 };
 
 } // namespace internal {

http://git-wip-us.apache.org/repos/asf/mesos/blob/8046860d/src/slave/constants.cpp
----------------------------------------------------------------------
diff --git a/src/slave/constants.cpp b/src/slave/constants.cpp
index 0f0d8e4..8c4597e 100644
--- a/src/slave/constants.cpp
+++ b/src/slave/constants.cpp
@@ -51,6 +51,8 @@ const Duration DOCKER_INSPECT_DELAY = Seconds(1);
 // TODO(tnachen): Make this a flag.
 const Duration DOCKER_VERSION_WAIT_TIMEOUT = Seconds(5);
 const std::string DEFAULT_AUTHENTICATEE = "crammd5";
+const Duration AUTHENTICATION_RETRY_INTERVAL_MAX = Minutes(1);
+const Duration DEFAULT_AUTHENTICATION_BACKOFF_FACTOR = Seconds(1);
 const std::string COMMAND_EXECUTOR_ROOTFS_CONTAINER_PATH = ".rootfs";
 
 Duration DEFAULT_MASTER_PING_TIMEOUT()

http://git-wip-us.apache.org/repos/asf/mesos/blob/8046860d/src/slave/constants.hpp
----------------------------------------------------------------------
diff --git a/src/slave/constants.hpp b/src/slave/constants.hpp
index d5ded8a..f3a6888 100644
--- a/src/slave/constants.hpp
+++ b/src/slave/constants.hpp
@@ -105,6 +105,13 @@ extern const Duration DOCKER_VERSION_WAIT_TIMEOUT;
 // Name of the default, CRAM-MD5 authenticatee.
 extern const std::string DEFAULT_AUTHENTICATEE;
 
+// The maximum interval the slave waits before retrying authentication.
+extern const Duration AUTHENTICATION_RETRY_INTERVAL_MAX;
+
+// Default backoff interval used by the slave to wait after failed
+// authentication.
+extern const Duration DEFAULT_AUTHENTICATION_BACKOFF_FACTOR;
+
 // Default maximum storage space to be used by the fetcher cache.
 const Bytes DEFAULT_FETCHER_CACHE_SIZE = Gigabytes(2);
 

http://git-wip-us.apache.org/repos/asf/mesos/blob/8046860d/src/slave/flags.cpp
----------------------------------------------------------------------
diff --git a/src/slave/flags.cpp b/src/slave/flags.cpp
index 6e3fd69..dbcb1e9 100644
--- a/src/slave/flags.cpp
+++ b/src/slave/flags.cpp
@@ -215,10 +215,20 @@ mesos::internal::slave::Flags::Flags()
       "Subsequent retries are exponentially backed off based on this\n"
       "interval (e.g., 1st retry uses a random value between `[0, b * 2^1]`,\n"
       "2nd retry between `[0, b * 2^2]`, 3rd retry between `[0, b * 2^3]`,\n"
-      "etc) up to a maximum of " +
-        stringify(REGISTER_RETRY_INTERVAL_MAX),
+      "etc) up to a maximum of " + stringify(REGISTER_RETRY_INTERVAL_MAX),
       DEFAULT_REGISTRATION_BACKOFF_FACTOR);
 
+  add(&Flags::authentication_backoff_factor,
+      "authentication_backoff_factor",
+      "After a failed authentication the agent picks a random amount of time\n"
+      "between `[0, b]`, where `b = authentication_backoff_factor`, to\n"
+      "authenticate with a new master. Subsequent retries are exponentially\n"
+      "backed off based on this interval (e.g., 1st retry uses a random\n"
+      "value between `[0, b * 2^1]`, 2nd retry between `[0, b * 2^2]`, 3rd\n"
+      "retry between `[0, b * 2^3]`, etc up to a maximum of " +
+          stringify(AUTHENTICATION_RETRY_INTERVAL_MAX),
+      DEFAULT_AUTHENTICATION_BACKOFF_FACTOR);
+
   add(&Flags::executor_environment_variables,
       "executor_environment_variables",
       "JSON object representing the environment variables that should be\n"

http://git-wip-us.apache.org/repos/asf/mesos/blob/8046860d/src/slave/flags.hpp
----------------------------------------------------------------------
diff --git a/src/slave/flags.hpp b/src/slave/flags.hpp
index feb095d..dd722a9 100644
--- a/src/slave/flags.hpp
+++ b/src/slave/flags.hpp
@@ -68,6 +68,7 @@ public:
 #endif // __WINDOWS__
   std::string frameworks_home;  // TODO(benh): Make an Option.
   Duration registration_backoff_factor;
+  Duration authentication_backoff_factor;
   Option<JSON::Object> executor_environment_variables;
   Duration executor_registration_timeout;
   Duration executor_shutdown_grace_period;

http://git-wip-us.apache.org/repos/asf/mesos/blob/8046860d/src/slave/slave.cpp
----------------------------------------------------------------------
diff --git a/src/slave/slave.cpp b/src/slave/slave.cpp
index e5934b9..2610cd4 100644
--- a/src/slave/slave.cpp
+++ b/src/slave/slave.cpp
@@ -19,6 +19,7 @@
 #include <stdlib.h> // For random().
 
 #include <algorithm>
+#include <cmath>
 #include <iomanip>
 #include <list>
 #include <map>
@@ -144,6 +145,7 @@ Slave::Slave(const std::string& id,
     authenticating(None()),
     authenticated(false),
     reauthenticate(false),
+    failedAuthentications(0),
     executorDirectoryMaxAllowedAge(age(0)),
     resourceEstimator(_resourceEstimator),
     qosController(_qosController) {}
@@ -807,11 +809,8 @@ void Slave::detected(const Future<Option<MasterInfo>>& _master)
 
     if (credential.isSome()) {
       // Authenticate with the master.
-      // TODO(vinod): Do a backoff for authentication similar to what
-      // we do for registration. This is a little tricky because, if
-      // we delay 'Slave::authenticate' and a new master is detected
-      // before 'authenticate' event is processed the slave tries to
-      // authenticate with the new master twice.
+      // TODO(adam-mesos): Consider adding an initial delay like we do
+      // for registration, to combat thundering herds on master failover.
       // TODO(vinod): Consider adding an "AUTHENTICATED" state to the
       // slave instead of "authenticate" variable.
       authenticate();
@@ -916,8 +915,24 @@ void Slave::_authenticate()
     authenticating = None();
     reauthenticate = false;
 
+    ++failedAuthentications;
+
+    // Backoff.
+    // The backoff is a random duration in the interval [0, b * 2^N)
+    // where `b = authentication_backoff_factor` and `N` the number
+    // of failed authentication attempts. It is capped by
+    // `REGISTER_RETRY_INTERVAL_MAX`.
+    Duration backoff =
+      flags.authentication_backoff_factor * std::pow(2, failedAuthentications);
+    backoff = std::min(backoff, AUTHENTICATION_RETRY_INTERVAL_MAX);
+
+    // Determine the delay for next attempt by picking a random
+    // duration between 0 and 'maxBackoff'.
+    // TODO(vinod): Use random numbers from <random> header.
+    backoff *= (double) ::random() / RAND_MAX;
+
     // TODO(vinod): Add a limit on number of retries.
-    dispatch(self(), &Self::authenticate); // Retry.
+    delay(backoff, self(), &Self::authenticate); // Retry.
     return;
   }
 
@@ -932,6 +947,8 @@ void Slave::_authenticate()
   authenticated = true;
   authenticating = None();
 
+  failedAuthentications = 0;
+
   // Proceed with registration.
   doReliableRegistration(flags.registration_backoff_factor * 2);
 }

http://git-wip-us.apache.org/repos/asf/mesos/blob/8046860d/src/slave/slave.hpp
----------------------------------------------------------------------
diff --git a/src/slave/slave.hpp b/src/slave/slave.hpp
index 7520cc3..e7f5e03 100644
--- a/src/slave/slave.hpp
+++ b/src/slave/slave.hpp
@@ -570,6 +570,9 @@ private:
   // Indicates if a new authentication attempt should be enforced.
   bool reauthenticate;
 
+  // Indicates the number of failed authentication attempts.
+  uint64_t failedAuthentications;
+
   // Maximum age of executor directories. Will be recomputed
   // periodically every flags.disk_watch_interval.
   Duration executorDirectoryMaxAllowedAge;


[2/2] mesos git commit: Add MESOS-2043 to the 0.28.3 changelog and sort.

Posted by me...@apache.org.
Add MESOS-2043 to the 0.28.3 changelog and sort.


Project: http://git-wip-us.apache.org/repos/asf/mesos/repo
Commit: http://git-wip-us.apache.org/repos/asf/mesos/commit/d219d505
Tree: http://git-wip-us.apache.org/repos/asf/mesos/tree/d219d505
Diff: http://git-wip-us.apache.org/repos/asf/mesos/diff/d219d505

Branch: refs/heads/0.28.x
Commit: d219d505722eefa1231d64bf8cd43ac65d783fe4
Parents: 8046860
Author: Adam B <ad...@mesosphere.io>
Authored: Sat Jul 30 23:18:27 2016 -0700
Committer: Adam B <ad...@mesosphere.io>
Committed: Sun Jul 31 03:15:25 2016 -0700

----------------------------------------------------------------------
 CHANGELOG | 13 +++++++------
 1 file changed, 7 insertions(+), 6 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/mesos/blob/d219d505/CHANGELOG
----------------------------------------------------------------------
diff --git a/CHANGELOG b/CHANGELOG
index 92b482e..9de4bc8 100644
--- a/CHANGELOG
+++ b/CHANGELOG
@@ -4,17 +4,18 @@ Release Notes - Mesos - Version 0.28.3
 
 All Issues:
 ** Bug
-  * [MESOS-5673] - Port mapping isolator may cause segfault if it bind mount root does not exist.
+  * [MESOS-2043] - Framework auth fail with timeout error and never get authenticated
+  * [MESOS-5073] - Mesos allocator leaks role sorter and quota role sorters.
   * [MESOS-5330] - Agent should backoff before connecting to the master.
+  * [MESOS-5390] - v1 Executor Protos not included in maven jar
   * [MESOS-5543] - /dev/fd is missing in the Mesos containerizer environment.
+  * [MESOS-5576] - Masters may drop the first message they send between masters after a network partition.
+  * [MESOS-5673] - Port mapping isolator may cause segfault if it bind mount root does not exist.
   * [MESOS-5691] - SSL downgrade support will leak sockets in CLOSE_WAIT status.
-  * [MESOS-5723] - SSL-enabled libprocess will leak incoming links to forks.
-  * [MESOS-5748] - Potential segfault in `link` when linking to a remote process.
-  * [MESOS-5073] - Mesos allocator leaks role sorter and quota role sorters.
   * [MESOS-5698] - Quota sorter not updated for resource changes at agent.
+  * [MESOS-5723] - SSL-enabled libprocess will leak incoming links to forks.
   * [MESOS-5740] - Consider adding `relink` functionality to libprocess.
-  * [MESOS-5576] - Masters may drop the first message they send between masters after a network partition.
-  * [MESOS-5390] - v1 Executor Protos not included in maven jar
+  * [MESOS-5748] - Potential segfault in `link` when linking to a remote process.
 
 
 Release Notes - Mesos - Version 0.28.2