You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@mesos.apache.org by Meng Zhu <mz...@mesosphere.io> on 2018/08/14 05:33:38 UTC

Re: Review Request 68304: Fixed a backoff overflow bug in agent authentication retry logic.

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/68304/
-----------------------------------------------------------

(Updated Aug. 13, 2018, 10:33 p.m.)


Review request for mesos, Benjamin Mahler and Gastón Kleiman.


Summary (updated)
-----------------

Fixed a backoff overflow bug in agent authentication retry logic.


Bugs: MESOS-9147
    https://issues.apache.org/jira/browse/MESOS-9147


Repository: mesos


Description (updated)
-------

This patch fixed the backoff time calculation
overflow bug described in MESOS-9147.

The old approach times out an authentication request after
5s and then retries after some backoff time. This is not
optimal because, if the agent is going to backoff some time
before retry, we might as well wait that long for the
previous authentication request (instead of timeout early).

This patch combines the authentication timeout and
authentication retry backoff interval into a single
wait time interval. Now agent will timeout the previous
authentication request after the wait time interval and
then immediately retry.


Diffs (updated)
-----

  src/slave/constants.hpp 0bd9f371ca24df66055f31bca0d57625dc7642d9 
  src/slave/slave.hpp 802d4eb9e9eba2f1175dd85e56dcc80a61e32f74 
  src/slave/slave.cpp 78e8666f402be58af5b6e20a715da4998af2615c 
  src/tests/authentication_tests.cpp c9a8f85951a50e278ae509f4efa7105755015ce9 


Diff: https://reviews.apache.org/r/68304/diff/3/

Changes: https://reviews.apache.org/r/68304/diff/2-3/


Testing
-------

make check


Thanks,

Meng Zhu


Re: Review Request 68304: Fixed a backoff overflow bug in agent authentication retry logic.

Posted by Meng Zhu <mz...@mesosphere.io>.

> On Aug. 14, 2018, 3:30 p.m., Benjamin Mahler wrote:
> > src/slave/slave.cpp
> > Lines 1287-1290 (original), 1286-1289 (patched)
> > <https://reviews.apache.org/r/68304/diff/3/?file=2072257#file2072257line1287>
> >
> >     Shouldn't we specialize this between authentication and registration?

Added a TODO. Changing that breaks a lot of test. Dropping for this one.


- Meng


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/68304/#review207255
-----------------------------------------------------------


On Aug. 15, 2018, 11:47 a.m., Meng Zhu wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/68304/
> -----------------------------------------------------------
> 
> (Updated Aug. 15, 2018, 11:47 a.m.)
> 
> 
> Review request for mesos, Benjamin Mahler and Gastón Kleiman.
> 
> 
> Bugs: MESOS-9147
>     https://issues.apache.org/jira/browse/MESOS-9147
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> This patch fixed the backoff time calculation
> overflow bug described in MESOS-9147.
> 
> The old approach times out an authentication request after
> 5s and then retries after some backoff time. This is not
> optimal because, if the agent is going to backoff some time
> before retry, we might as well wait that long for the
> previous authentication request (instead of timeout early).
> 
> This patch combines the authentication timeout and
> authentication retry backoff interval into a single
> wait time interval. Now agent will timeout the previous
> authentication request after the wait time interval and
> then immediately retry.
> 
> 
> Diffs
> -----
> 
>   src/slave/constants.hpp 0bd9f371ca24df66055f31bca0d57625dc7642d9 
>   src/slave/slave.hpp 802d4eb9e9eba2f1175dd85e56dcc80a61e32f74 
>   src/slave/slave.cpp 78e8666f402be58af5b6e20a715da4998af2615c 
>   src/tests/authentication_tests.cpp c9a8f85951a50e278ae509f4efa7105755015ce9 
> 
> 
> Diff: https://reviews.apache.org/r/68304/diff/5/
> 
> 
> Testing
> -------
> 
> make check
> 
> 
> Thanks,
> 
> Meng Zhu
> 
>


Re: Review Request 68304: Fixed a backoff overflow bug in agent authentication retry logic.

Posted by Benjamin Mahler <bm...@apache.org>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/68304/#review207255
-----------------------------------------------------------



Per offline conversation, seems like we could simplify this by modeling as:

```
MIN, MIN + factor*2^N
```


src/slave/slave.cpp
Lines 1287-1290 (original), 1286-1289 (patched)
<https://reviews.apache.org/r/68304/#comment290587>

    Shouldn't we specialize this between authentication and registration?


- Benjamin Mahler


On Aug. 14, 2018, 5:33 a.m., Meng Zhu wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/68304/
> -----------------------------------------------------------
> 
> (Updated Aug. 14, 2018, 5:33 a.m.)
> 
> 
> Review request for mesos, Benjamin Mahler and Gastón Kleiman.
> 
> 
> Bugs: MESOS-9147
>     https://issues.apache.org/jira/browse/MESOS-9147
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> This patch fixed the backoff time calculation
> overflow bug described in MESOS-9147.
> 
> The old approach times out an authentication request after
> 5s and then retries after some backoff time. This is not
> optimal because, if the agent is going to backoff some time
> before retry, we might as well wait that long for the
> previous authentication request (instead of timeout early).
> 
> This patch combines the authentication timeout and
> authentication retry backoff interval into a single
> wait time interval. Now agent will timeout the previous
> authentication request after the wait time interval and
> then immediately retry.
> 
> 
> Diffs
> -----
> 
>   src/slave/constants.hpp 0bd9f371ca24df66055f31bca0d57625dc7642d9 
>   src/slave/slave.hpp 802d4eb9e9eba2f1175dd85e56dcc80a61e32f74 
>   src/slave/slave.cpp 78e8666f402be58af5b6e20a715da4998af2615c 
>   src/tests/authentication_tests.cpp c9a8f85951a50e278ae509f4efa7105755015ce9 
> 
> 
> Diff: https://reviews.apache.org/r/68304/diff/3/
> 
> 
> Testing
> -------
> 
> make check
> 
> 
> Thanks,
> 
> Meng Zhu
> 
>


Re: Review Request 68304: Fixed a backoff overflow bug in agent authentication retry logic.

Posted by Meng Zhu <mz...@mesosphere.io>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/68304/
-----------------------------------------------------------

(Updated Aug. 16, 2018, 1:54 p.m.)


Review request for mesos, Benjamin Mahler and Gastón Kleiman.


Changes
-------

Rebased. And updated the flag description.


Bugs: MESOS-9147
    https://issues.apache.org/jira/browse/MESOS-9147


Repository: mesos


Description
-------

This patch fixed the backoff time calculation
overflow bug described in MESOS-9147.

The old approach times out an authentication request after
5s and then retries after some backoff time. This is not
optimal because, if the agent is going to backoff some time
before retry, we might as well wait that long for the
previous authentication request (instead of timeout early).

This patch combines the authentication timeout and
authentication retry backoff interval into a single
wait time interval. Now agent will timeout the previous
authentication request after the wait time interval and
then immediately retry.


Diffs (updated)
-----

  src/slave/constants.hpp 0bd9f371ca24df66055f31bca0d57625dc7642d9 
  src/slave/flags.cpp f1727cd5b05a122f52775a54208e097b07603239 
  src/slave/slave.hpp 802d4eb9e9eba2f1175dd85e56dcc80a61e32f74 
  src/slave/slave.cpp 78e8666f402be58af5b6e20a715da4998af2615c 
  src/tests/authentication_tests.cpp 0e8a758ff5061541576317ced079fc59c728f246 


Diff: https://reviews.apache.org/r/68304/diff/7/

Changes: https://reviews.apache.org/r/68304/diff/6-7/


Testing
-------

make check


Thanks,

Meng Zhu


Re: Review Request 68304: Fixed a backoff overflow bug in agent authentication retry logic.

Posted by Meng Zhu <mz...@mesosphere.io>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/68304/
-----------------------------------------------------------

(Updated Aug. 15, 2018, 3:49 p.m.)


Review request for mesos, Benjamin Mahler and Gastón Kleiman.


Bugs: MESOS-9147
    https://issues.apache.org/jira/browse/MESOS-9147


Repository: mesos


Description
-------

This patch fixed the backoff time calculation
overflow bug described in MESOS-9147.

The old approach times out an authentication request after
5s and then retries after some backoff time. This is not
optimal because, if the agent is going to backoff some time
before retry, we might as well wait that long for the
previous authentication request (instead of timeout early).

This patch combines the authentication timeout and
authentication retry backoff interval into a single
wait time interval. Now agent will timeout the previous
authentication request after the wait time interval and
then immediately retry.


Diffs (updated)
-----

  src/slave/constants.hpp 0bd9f371ca24df66055f31bca0d57625dc7642d9 
  src/slave/flags.cpp 54d9acc8693f53294bdc2a88183cac84a8dfbfd9 
  src/slave/slave.hpp 802d4eb9e9eba2f1175dd85e56dcc80a61e32f74 
  src/slave/slave.cpp 78e8666f402be58af5b6e20a715da4998af2615c 
  src/tests/authentication_tests.cpp c9a8f85951a50e278ae509f4efa7105755015ce9 


Diff: https://reviews.apache.org/r/68304/diff/6/

Changes: https://reviews.apache.org/r/68304/diff/5-6/


Testing
-------

make check


Thanks,

Meng Zhu


Re: Review Request 68304: Fixed a backoff overflow bug in agent authentication retry logic.

Posted by Gastón Kleiman <ga...@mesosphere.io>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/68304/#review207355
-----------------------------------------------------------


Ship it!




Ship it after fixing the issues opened by BenM.

- Gastón Kleiman


On Aug. 15, 2018, 11:47 a.m., Meng Zhu wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/68304/
> -----------------------------------------------------------
> 
> (Updated Aug. 15, 2018, 11:47 a.m.)
> 
> 
> Review request for mesos, Benjamin Mahler and Gastón Kleiman.
> 
> 
> Bugs: MESOS-9147
>     https://issues.apache.org/jira/browse/MESOS-9147
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> This patch fixed the backoff time calculation
> overflow bug described in MESOS-9147.
> 
> The old approach times out an authentication request after
> 5s and then retries after some backoff time. This is not
> optimal because, if the agent is going to backoff some time
> before retry, we might as well wait that long for the
> previous authentication request (instead of timeout early).
> 
> This patch combines the authentication timeout and
> authentication retry backoff interval into a single
> wait time interval. Now agent will timeout the previous
> authentication request after the wait time interval and
> then immediately retry.
> 
> 
> Diffs
> -----
> 
>   src/slave/constants.hpp 0bd9f371ca24df66055f31bca0d57625dc7642d9 
>   src/slave/slave.hpp 802d4eb9e9eba2f1175dd85e56dcc80a61e32f74 
>   src/slave/slave.cpp 78e8666f402be58af5b6e20a715da4998af2615c 
>   src/tests/authentication_tests.cpp c9a8f85951a50e278ae509f4efa7105755015ce9 
> 
> 
> Diff: https://reviews.apache.org/r/68304/diff/5/
> 
> 
> Testing
> -------
> 
> make check
> 
> 
> Thanks,
> 
> Meng Zhu
> 
>


Re: Review Request 68304: Fixed a backoff overflow bug in agent authentication retry logic.

Posted by Gastón Kleiman <ga...@mesosphere.io>.

> On Aug. 15, 2018, 2:29 p.m., Benjamin Mahler wrote:
> > src/slave/slave.cpp
> > Lines 1401-1408 (original), 1421-1428 (patched)
> > <https://reviews.apache.org/r/68304/diff/5/?file=2073108#file2073108line1422>
> >
> >     It seems a little easier to understand the overall approach if you show the math / instead of directly using variable names?
> >     
> >     ```
> >     // Grow the timeout range using exponential backoff:
> >     //
> >     //   [min, min + factor * 2^0]
> >     //   [min, min + factor * 2^1]
> >     //   ...
> >     //   [min, min + factor * 2^N]
> >     //   ...
> >     //   [min, max] // Stop at max.   
> >     ```
> >     
> >     Once I understand this, I can easily map it to the variables

+1

I wish I had seen this comment before, I had to do the math manually while reviewing the patch to make sure that it was ok.


- Gastón


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/68304/#review207356
-----------------------------------------------------------


On Aug. 15, 2018, 11:47 a.m., Meng Zhu wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/68304/
> -----------------------------------------------------------
> 
> (Updated Aug. 15, 2018, 11:47 a.m.)
> 
> 
> Review request for mesos, Benjamin Mahler and Gastón Kleiman.
> 
> 
> Bugs: MESOS-9147
>     https://issues.apache.org/jira/browse/MESOS-9147
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> This patch fixed the backoff time calculation
> overflow bug described in MESOS-9147.
> 
> The old approach times out an authentication request after
> 5s and then retries after some backoff time. This is not
> optimal because, if the agent is going to backoff some time
> before retry, we might as well wait that long for the
> previous authentication request (instead of timeout early).
> 
> This patch combines the authentication timeout and
> authentication retry backoff interval into a single
> wait time interval. Now agent will timeout the previous
> authentication request after the wait time interval and
> then immediately retry.
> 
> 
> Diffs
> -----
> 
>   src/slave/constants.hpp 0bd9f371ca24df66055f31bca0d57625dc7642d9 
>   src/slave/slave.hpp 802d4eb9e9eba2f1175dd85e56dcc80a61e32f74 
>   src/slave/slave.cpp 78e8666f402be58af5b6e20a715da4998af2615c 
>   src/tests/authentication_tests.cpp c9a8f85951a50e278ae509f4efa7105755015ce9 
> 
> 
> Diff: https://reviews.apache.org/r/68304/diff/5/
> 
> 
> Testing
> -------
> 
> make check
> 
> 
> Thanks,
> 
> Meng Zhu
> 
>


Re: Review Request 68304: Fixed a backoff overflow bug in agent authentication retry logic.

Posted by Benjamin Mahler <bm...@apache.org>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/68304/#review207356
-----------------------------------------------------------


Fix it, then Ship it!





src/slave/constants.hpp
Lines 68-69 (original), 68-74 (patched)
<https://reviews.apache.org/r/68304/#comment290723>

    How about:
    
    ```
    // The minimum timeout used when authenticating against the master.
    //
    // TODO(mzhu): Make this configurable.
    constexpr Duration AUTHENTICATION_TIMEOUT_MIN = Seconds(5);
    
    // The maximum timeout used when authenticating against the master.
    //
    // TODO(mzhu): Make this configurable.
    constexpr Duration AUTHENTICATION_TIMEOUT_MAX = Minutes(1);
    ```



src/slave/slave.cpp
Lines 1288 (patched)
<https://reviews.apache.org/r/68304/#comment290722>

    "authentication" typo and newline:
    
    ```
        // Wait for a random amount of time before authentication or
        // registration.
        //
        // TODO(mzhu): Specialize this for authetication.
    ```



src/slave/slave.cpp
Lines 1401-1408 (original), 1421-1428 (patched)
<https://reviews.apache.org/r/68304/#comment290724>

    It seems a little easier to understand the overall approach if you show the math / instead of directly using variable names?
    
    ```
    // Grow the timeout range using exponential backoff:
    //
    //   [min, min + factor * 2^0]
    //   [min, min + factor * 2^1]
    //   ...
    //   [min, min + factor * 2^N]
    //   ...
    //   [min, max] // Stop at max.   
    ```
    
    Once I understand this, I can easily map it to the variables


- Benjamin Mahler


On Aug. 15, 2018, 6:47 p.m., Meng Zhu wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/68304/
> -----------------------------------------------------------
> 
> (Updated Aug. 15, 2018, 6:47 p.m.)
> 
> 
> Review request for mesos, Benjamin Mahler and Gastón Kleiman.
> 
> 
> Bugs: MESOS-9147
>     https://issues.apache.org/jira/browse/MESOS-9147
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> This patch fixed the backoff time calculation
> overflow bug described in MESOS-9147.
> 
> The old approach times out an authentication request after
> 5s and then retries after some backoff time. This is not
> optimal because, if the agent is going to backoff some time
> before retry, we might as well wait that long for the
> previous authentication request (instead of timeout early).
> 
> This patch combines the authentication timeout and
> authentication retry backoff interval into a single
> wait time interval. Now agent will timeout the previous
> authentication request after the wait time interval and
> then immediately retry.
> 
> 
> Diffs
> -----
> 
>   src/slave/constants.hpp 0bd9f371ca24df66055f31bca0d57625dc7642d9 
>   src/slave/slave.hpp 802d4eb9e9eba2f1175dd85e56dcc80a61e32f74 
>   src/slave/slave.cpp 78e8666f402be58af5b6e20a715da4998af2615c 
>   src/tests/authentication_tests.cpp c9a8f85951a50e278ae509f4efa7105755015ce9 
> 
> 
> Diff: https://reviews.apache.org/r/68304/diff/5/
> 
> 
> Testing
> -------
> 
> make check
> 
> 
> Thanks,
> 
> Meng Zhu
> 
>


Re: Review Request 68304: Fixed a backoff overflow bug in agent authentication retry logic.

Posted by Meng Zhu <mz...@mesosphere.io>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/68304/
-----------------------------------------------------------

(Updated Aug. 15, 2018, 11:47 a.m.)


Review request for mesos, Benjamin Mahler and Gastón Kleiman.


Changes
-------

Reverted the specialized initial backoff change for authentication/registration. As it breaks a lot of existing test. Added a TODO.


Bugs: MESOS-9147
    https://issues.apache.org/jira/browse/MESOS-9147


Repository: mesos


Description
-------

This patch fixed the backoff time calculation
overflow bug described in MESOS-9147.

The old approach times out an authentication request after
5s and then retries after some backoff time. This is not
optimal because, if the agent is going to backoff some time
before retry, we might as well wait that long for the
previous authentication request (instead of timeout early).

This patch combines the authentication timeout and
authentication retry backoff interval into a single
wait time interval. Now agent will timeout the previous
authentication request after the wait time interval and
then immediately retry.


Diffs (updated)
-----

  src/slave/constants.hpp 0bd9f371ca24df66055f31bca0d57625dc7642d9 
  src/slave/slave.hpp 802d4eb9e9eba2f1175dd85e56dcc80a61e32f74 
  src/slave/slave.cpp 78e8666f402be58af5b6e20a715da4998af2615c 
  src/tests/authentication_tests.cpp c9a8f85951a50e278ae509f4efa7105755015ce9 


Diff: https://reviews.apache.org/r/68304/diff/5/

Changes: https://reviews.apache.org/r/68304/diff/4-5/


Testing
-------

make check


Thanks,

Meng Zhu


Re: Review Request 68304: Fixed a backoff overflow bug in agent authentication retry logic.

Posted by Meng Zhu <mz...@mesosphere.io>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/68304/
-----------------------------------------------------------

(Updated Aug. 14, 2018, 5:47 p.m.)


Review request for mesos, Benjamin Mahler and Gastón Kleiman.


Bugs: MESOS-9147
    https://issues.apache.org/jira/browse/MESOS-9147


Repository: mesos


Description
-------

This patch fixed the backoff time calculation
overflow bug described in MESOS-9147.

The old approach times out an authentication request after
5s and then retries after some backoff time. This is not
optimal because, if the agent is going to backoff some time
before retry, we might as well wait that long for the
previous authentication request (instead of timeout early).

This patch combines the authentication timeout and
authentication retry backoff interval into a single
wait time interval. Now agent will timeout the previous
authentication request after the wait time interval and
then immediately retry.


Diffs (updated)
-----

  src/slave/constants.hpp 0bd9f371ca24df66055f31bca0d57625dc7642d9 
  src/slave/slave.hpp 802d4eb9e9eba2f1175dd85e56dcc80a61e32f74 
  src/slave/slave.cpp 78e8666f402be58af5b6e20a715da4998af2615c 
  src/tests/authentication_tests.cpp c9a8f85951a50e278ae509f4efa7105755015ce9 


Diff: https://reviews.apache.org/r/68304/diff/4/

Changes: https://reviews.apache.org/r/68304/diff/3-4/


Testing
-------

make check


Thanks,

Meng Zhu