You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@mesos.apache.org by "Till Toenshoff (JIRA)" <ji...@apache.org> on 2015/02/17 21:58:11 UTC
[jira] [Updated] (MESOS-2360) Slave may send multiple, almost
concurrent registration requests to the master.
[ https://issues.apache.org/jira/browse/MESOS-2360?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Till Toenshoff updated MESOS-2360:
----------------------------------
Description:
Triggered by an issue Alexander stumbled across in https://issues.apache.org/jira/browse/MESOS-2355, I wanted to find out on why the Slave was allowed to send out multiple, parallel registration requests.
When looking at the code, one part got my attention:
{noformat}
// Retry registration if necessary.
Duration next = std::min(
duration * ((double) ::random() / RAND_MAX),
REGISTER_RETRY_INTERVAL_MAX);
{noformat}
[src/slave/slave.cpp, slave::doReliableRegistration, line 1040 ff]
So this does allow {{next}} to be something equal or very close to 0. Such zero delay will cause immediate retries which might (with a bit of tough luck) again trigger an immediate retry a.s.o.. The delay will, for these cases, be determined by the full cycle-frequency of libprocess.
Why was this implemented without a flooring limit - say e.g. 1second?
While MESOS-2355 got a proper fix for this scenario in a test, the global issue remains to get clarified, I think.
Should we add such floor limit to prevent pointless (almost) concurrent registration requests?
was:
Triggered by an issue Alexander stumbled across in https://issues.apache.org/jira/browse/MESOS-2355, I wanted to find out on why the Slave was allowed to send out multiple, parallel registration requests.
When looking at the code, one part got my attention:
{noformat}
// Retry registration if necessary.
Duration next = std::min(
duration * ((double) ::random() / RAND_MAX),
REGISTER_RETRY_INTERVAL_MAX);
{noformat}
[src/slave.cpp, slave::doReliableRegistration, line 1040 ff]
So this does allow {{next}} to be something equal or very close to 0. Such zero delay will cause immediate retries which might (with a bit of tough luck) again trigger an immediate retry a.s.o.. The delay will, for these cases, be determined by the full cycle-frequency of libprocess.
Why was this implemented without a flooring limit - say e.g. 1second?
While MESOS-2355 got a proper fix for this scenario in a test, the global issue remains to get clarified, I think.
Should we add such floor limit to prevent pointless (almost) concurrent registration requests?
> Slave may send multiple, almost concurrent registration requests to the master.
> -------------------------------------------------------------------------------
>
> Key: MESOS-2360
> URL: https://issues.apache.org/jira/browse/MESOS-2360
> Project: Mesos
> Issue Type: Bug
> Reporter: Till Toenshoff
>
> Triggered by an issue Alexander stumbled across in https://issues.apache.org/jira/browse/MESOS-2355, I wanted to find out on why the Slave was allowed to send out multiple, parallel registration requests.
> When looking at the code, one part got my attention:
> {noformat}
> // Retry registration if necessary.
> Duration next = std::min(
> duration * ((double) ::random() / RAND_MAX),
> REGISTER_RETRY_INTERVAL_MAX);
> {noformat}
> [src/slave/slave.cpp, slave::doReliableRegistration, line 1040 ff]
> So this does allow {{next}} to be something equal or very close to 0. Such zero delay will cause immediate retries which might (with a bit of tough luck) again trigger an immediate retry a.s.o.. The delay will, for these cases, be determined by the full cycle-frequency of libprocess.
> Why was this implemented without a flooring limit - say e.g. 1second?
> While MESOS-2355 got a proper fix for this scenario in a test, the global issue remains to get clarified, I think.
> Should we add such floor limit to prevent pointless (almost) concurrent registration requests?
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)