You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@mesos.apache.org by Xudong Ni via Review Board <no...@reviews.apache.org> on 2018/10/09 20:45:24 UTC

Re: Review Request 68706: Added master failover reregistration progress metrics.

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/68706/
-----------------------------------------------------------

(Updated Oct. 9, 2018, 8:45 p.m.)


Review request for mesos, Benjamin Mahler, James Peach, and Jiang Yan Xu.


Bugs: MESOS-9178
    https://issues.apache.org/jira/browse/MESOS-9178


Repository: mesos


Description (updated)
-------

During the master failover, the time that the master elected is
considered as the start of failover. In the progress of
reregistration, the percentile represents the time when such
percentile of agents finished registration again; The percentile of
these data as in this metrics can represent overall reregistration
progress; In case of degradation towards to the end of reregistration,
the high percentile can reflect it; In the case there are unreachable
agents in the failover, if certain percentile recovery couldn't be
reached, the intiail value of that percentile will not be updated.


Diffs (updated)
-----

  docs/monitoring.md 00c6ea94bcb73746aef740236632ede123f5b534 
  src/master/master.hpp ea7e9242b62fe6c2cc0e717f9a9f2f0c1cc0a390 
  src/master/master.cpp 06d769aeba16586a020729d454f4d00688b78c78 
  src/master/metrics.hpp e1da18e6ba2737f729e1e30653020538150ae898 
  src/master/metrics.cpp 56a7eef2d279ad3248092d37d19013d3ac110757 


Diff: https://reviews.apache.org/r/68706/diff/2/

Changes: https://reviews.apache.org/r/68706/diff/1-2/


Testing (updated)
-------

Tested in mmaster with 6 reregistration agents:
"master/slave_reregistrations": 6,

In the middle of reregistration process:
"master/slaves_100_percent_reregistered_secs": 0,
"master/slaves_25_percent_reregistered_secs": 2.244662016,
"master/slaves_50_percent_reregistered_secs": 3.599491072,
"master/slaves_75_percent_reregistered_secs": 9.53919616,
"master/slaves_90_percent_reregistered_secs": 0,
"master/slaves_99_percent_reregistered_secs": 0,

When all registrations finished:
"master/slaves_100_percent_reregistered_secs": 29.697210112,
"master/slaves_25_percent_reregistered_secs": 2.244662016,
"master/slaves_50_percent_reregistered_secs": 3.599491072,
"master/slaves_75_percent_reregistered_secs": 9.53919616,
"master/slaves_90_percent_reregistered_secs": 29.697210112,
"master/slaves_99_percent_reregistered_secs": 29.697210112,


Thanks,

Xudong Ni


Re: Review Request 68706: Added master failover reregistration progress metrics.

Posted by Mesos Reviewbot Windows <re...@mesos.apache.org>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/68706/#review209386
-----------------------------------------------------------



PASS: Mesos patch 68706 was successfully built and tested.

Reviews applied: `['68706']`

All the build artifacts available at: http://dcos-win.westus2.cloudapp.azure.com/artifacts/mesos-reviewbot-testing/2442/mesos-review-68706

- Mesos Reviewbot Windows


On Oct. 9, 2018, 1:45 p.m., Xudong Ni wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/68706/
> -----------------------------------------------------------
> 
> (Updated Oct. 9, 2018, 1:45 p.m.)
> 
> 
> Review request for mesos, Benjamin Mahler, James Peach, and Jiang Yan Xu.
> 
> 
> Bugs: MESOS-9178
>     https://issues.apache.org/jira/browse/MESOS-9178
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> During the master failover, the time that the master elected is
> considered as the start of failover. In the progress of
> reregistration, the percentile represents the time when such
> percentile of agents finished registration again; The percentile of
> these data as in this metrics can represent overall reregistration
> progress; In case of degradation towards to the end of reregistration,
> the high percentile can reflect it; In the case there are unreachable
> agents in the failover, if certain percentile recovery couldn't be
> reached, the intiail value of that percentile will not be updated.
> 
> 
> Diffs
> -----
> 
>   docs/monitoring.md 00c6ea94bcb73746aef740236632ede123f5b534 
>   src/master/master.hpp ea7e9242b62fe6c2cc0e717f9a9f2f0c1cc0a390 
>   src/master/master.cpp 06d769aeba16586a020729d454f4d00688b78c78 
>   src/master/metrics.hpp e1da18e6ba2737f729e1e30653020538150ae898 
>   src/master/metrics.cpp 56a7eef2d279ad3248092d37d19013d3ac110757 
> 
> 
> Diff: https://reviews.apache.org/r/68706/diff/2/
> 
> 
> Testing
> -------
> 
> Tested in mmaster with 6 reregistration agents:
> "master/slave_reregistrations": 6,
> 
> In the middle of reregistration process:
> "master/slaves_100_percent_reregistered_secs": 0,
> "master/slaves_25_percent_reregistered_secs": 2.244662016,
> "master/slaves_50_percent_reregistered_secs": 3.599491072,
> "master/slaves_75_percent_reregistered_secs": 9.53919616,
> "master/slaves_90_percent_reregistered_secs": 0,
> "master/slaves_99_percent_reregistered_secs": 0,
> 
> When all registrations finished:
> "master/slaves_100_percent_reregistered_secs": 29.697210112,
> "master/slaves_25_percent_reregistered_secs": 2.244662016,
> "master/slaves_50_percent_reregistered_secs": 3.599491072,
> "master/slaves_75_percent_reregistered_secs": 9.53919616,
> "master/slaves_90_percent_reregistered_secs": 29.697210112,
> "master/slaves_99_percent_reregistered_secs": 29.697210112,
> 
> 
> Thanks,
> 
> Xudong Ni
> 
>


Re: Review Request 68706: Added master failover reregistration progress metrics.

Posted by Mesos Reviewbot Windows <re...@mesos.apache.org>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/68706/#review209389
-----------------------------------------------------------



PASS: Mesos patch 68706 was successfully built and tested.

Reviews applied: `['68706']`

All the build artifacts available at: http://dcos-win.westus2.cloudapp.azure.com/artifacts/mesos-reviewbot-testing/2443/mesos-review-68706

- Mesos Reviewbot Windows


On Oct. 9, 2018, 11:24 p.m., Xudong Ni wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/68706/
> -----------------------------------------------------------
> 
> (Updated Oct. 9, 2018, 11:24 p.m.)
> 
> 
> Review request for mesos, Benjamin Mahler, James Peach, and Jiang Yan Xu.
> 
> 
> Bugs: MESOS-9178
>     https://issues.apache.org/jira/browse/MESOS-9178
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> During the master failover, the time that the master elected is
> considered as the start of failover. In the progress of
> reregistration, the percentile represents the time when such
> percentile of agents finished registration again; The percentile of
> these data as in this metrics can represent overall reregistration
> progress; In case of degradation towards to the end of reregistration,
> the high percentile can reflect it; In the case there are unreachable
> agents in the failover, if certain percentile recovery couldn't be
> reached, the intiail value of that percentile will not be updated.
> 
> 
> Diffs
> -----
> 
>   docs/monitoring.md 00c6ea94bcb73746aef740236632ede123f5b534 
>   src/master/master.hpp ea7e9242b62fe6c2cc0e717f9a9f2f0c1cc0a390 
>   src/master/master.cpp 06d769aeba16586a020729d454f4d00688b78c78 
>   src/master/metrics.hpp e1da18e6ba2737f729e1e30653020538150ae898 
>   src/master/metrics.cpp 56a7eef2d279ad3248092d37d19013d3ac110757 
> 
> 
> Diff: https://reviews.apache.org/r/68706/diff/3/
> 
> 
> Testing
> -------
> 
> Tested in mmaster with 6 reregistration agents:
> "master/slave_reregistrations": 6,
> 
> In the middle of reregistration process:
> "master/slaves_100_percent_reregistered_secs": 0,
> "master/slaves_25_percent_reregistered_secs": 2.244662016,
> "master/slaves_50_percent_reregistered_secs": 3.599491072,
> "master/slaves_75_percent_reregistered_secs": 9.53919616,
> "master/slaves_90_percent_reregistered_secs": 0,
> "master/slaves_99_percent_reregistered_secs": 0,
> 
> When all registrations finished:
> "master/slaves_100_percent_reregistered_secs": 29.697210112,
> "master/slaves_25_percent_reregistered_secs": 2.244662016,
> "master/slaves_50_percent_reregistered_secs": 3.599491072,
> "master/slaves_75_percent_reregistered_secs": 9.53919616,
> "master/slaves_90_percent_reregistered_secs": 29.697210112,
> "master/slaves_99_percent_reregistered_secs": 29.697210112,
> 
> 
> Thanks,
> 
> Xudong Ni
> 
>


Re: Review Request 68706: Added master failover reregistration progress metrics.

Posted by Mesos Reviewbot <re...@mesos.apache.org>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/68706/#review209671
-----------------------------------------------------------



Patch looks great!

Reviews applied: [68706]

Passed command: export OS='ubuntu:14.04' BUILDTOOL='autotools' COMPILER='gcc' CONFIGURATION='--verbose --disable-libtool-wrappers' ENVIRONMENT='GLOG_v=1 MESOS_VERBOSE=1'; ./support/docker-build.sh

- Mesos Reviewbot


On Oct. 16, 2018, 4:49 p.m., Xudong Ni wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/68706/
> -----------------------------------------------------------
> 
> (Updated Oct. 16, 2018, 4:49 p.m.)
> 
> 
> Review request for mesos, Benjamin Mahler, James Peach, and Jiang Yan Xu.
> 
> 
> Bugs: MESOS-9178
>     https://issues.apache.org/jira/browse/MESOS-9178
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> During the master failover, the time that the master elected is
> considered as the start of failover. In the progress of
> reregistration, the percentile represents the time when such
> percentile of agents finished registration again; The percentile of
> these data as in this metrics can represent overall reregistration
> progress; In case of degradation towards to the end of reregistration,
> the high percentile can reflect it; In the case there are unreachable
> agents in the failover, if certain percentile recovery couldn't be
> reached, the intiail value of that percentile will not be updated.
> 
> 
> Diffs
> -----
> 
>   docs/monitoring.md 00c6ea94bcb73746aef740236632ede123f5b534 
>   src/master/master.hpp ea7e9242b62fe6c2cc0e717f9a9f2f0c1cc0a390 
>   src/master/master.cpp 868787bb2f9d879531402f83507b322462322efc 
>   src/master/metrics.hpp e1da18e6ba2737f729e1e30653020538150ae898 
>   src/master/metrics.cpp 56a7eef2d279ad3248092d37d19013d3ac110757 
>   src/tests/master_tests.cpp 1db8ed7d81acbcd8bad4b7ca77c501d1d99cc135 
> 
> 
> Diff: https://reviews.apache.org/r/68706/diff/5/
> 
> 
> Testing
> -------
> 
> Automation:
> [ RUN      ] MasterTest.MetricsInMetricsEndpoint
> [       OK ] MasterTest.MetricsInMetricsEndpoint (42 ms)
> 
> Real world cases:
> While reregistrations is in progress: 3277 out of 3582 completed:
> "master/slave_reregistrations": 3277.0,
> "master/slaves_100_percent_reregistered_secs": 0.0,
> "master/slaves_25_percent_reregistered_secs": 5.0,
> "master/slaves_50_percent_reregistered_secs": 11.0,
> "master/slaves_75_percent_reregistered_secs": 20.0,
> "master/slaves_90_percent_reregistered_secs": 30.0,
> "master/slaves_99_percent_reregistered_secs": 0.0,
> 
> 
> While 3582 reregistrations were all completed:
> "master/slave_reregistrations": 3582.0,
> "master/slaves_100_percent_reregistered_secs": 54.0,
> "master/slaves_25_percent_reregistered_secs": 5.0,
> "master/slaves_50_percent_reregistered_secs": 11.0,
> "master/slaves_75_percent_reregistered_secs": 20.0,
> "master/slaves_90_percent_reregistered_secs": 30.0,
> "master/slaves_99_percent_reregistered_secs": 39.0,
> 
> 
> Thanks,
> 
> Xudong Ni
> 
>


Re: Review Request 68706: Added master failover reregistration progress metrics.

Posted by Mesos Reviewbot Windows <re...@mesos.apache.org>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/68706/#review209646
-----------------------------------------------------------



PASS: Mesos patch 68706 was successfully built and tested.

Reviews applied: `['68706']`

All the build artifacts available at: http://dcos-win.westus2.cloudapp.azure.com/artifacts/mesos-reviewbot-testing/2475/mesos-review-68706

- Mesos Reviewbot Windows


On Oct. 16, 2018, 4:49 p.m., Xudong Ni wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/68706/
> -----------------------------------------------------------
> 
> (Updated Oct. 16, 2018, 4:49 p.m.)
> 
> 
> Review request for mesos, Benjamin Mahler, James Peach, and Jiang Yan Xu.
> 
> 
> Bugs: MESOS-9178
>     https://issues.apache.org/jira/browse/MESOS-9178
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> During the master failover, the time that the master elected is
> considered as the start of failover. In the progress of
> reregistration, the percentile represents the time when such
> percentile of agents finished registration again; The percentile of
> these data as in this metrics can represent overall reregistration
> progress; In case of degradation towards to the end of reregistration,
> the high percentile can reflect it; In the case there are unreachable
> agents in the failover, if certain percentile recovery couldn't be
> reached, the intiail value of that percentile will not be updated.
> 
> 
> Diffs
> -----
> 
>   docs/monitoring.md 00c6ea94bcb73746aef740236632ede123f5b534 
>   src/master/master.hpp ea7e9242b62fe6c2cc0e717f9a9f2f0c1cc0a390 
>   src/master/master.cpp 868787bb2f9d879531402f83507b322462322efc 
>   src/master/metrics.hpp e1da18e6ba2737f729e1e30653020538150ae898 
>   src/master/metrics.cpp 56a7eef2d279ad3248092d37d19013d3ac110757 
>   src/tests/master_tests.cpp 1db8ed7d81acbcd8bad4b7ca77c501d1d99cc135 
> 
> 
> Diff: https://reviews.apache.org/r/68706/diff/5/
> 
> 
> Testing
> -------
> 
> Automation:
> [ RUN      ] MasterTest.MetricsInMetricsEndpoint
> [       OK ] MasterTest.MetricsInMetricsEndpoint (42 ms)
> 
> Real world cases:
> While reregistrations is in progress: 3277 out of 3582 completed:
> "master/slave_reregistrations": 3277.0,
> "master/slaves_100_percent_reregistered_secs": 0.0,
> "master/slaves_25_percent_reregistered_secs": 5.0,
> "master/slaves_50_percent_reregistered_secs": 11.0,
> "master/slaves_75_percent_reregistered_secs": 20.0,
> "master/slaves_90_percent_reregistered_secs": 30.0,
> "master/slaves_99_percent_reregistered_secs": 0.0,
> 
> 
> While 3582 reregistrations were all completed:
> "master/slave_reregistrations": 3582.0,
> "master/slaves_100_percent_reregistered_secs": 54.0,
> "master/slaves_25_percent_reregistered_secs": 5.0,
> "master/slaves_50_percent_reregistered_secs": 11.0,
> "master/slaves_75_percent_reregistered_secs": 20.0,
> "master/slaves_90_percent_reregistered_secs": 30.0,
> "master/slaves_99_percent_reregistered_secs": 39.0,
> 
> 
> Thanks,
> 
> Xudong Ni
> 
>


Re: Review Request 68706: Added master failover reregistration progress metrics.

Posted by Xudong Ni via Review Board <no...@reviews.apache.org>.

> On Oct. 18, 2018, 6:14 p.m., Jiang Yan Xu wrote:
> > src/master/master.cpp
> > Lines 1874-1875 (patched)
> > <https://reviews.apache.org/r/68706/diff/5/?file=2098490#file2098490line1874>
> >
> >     If we use equality operator and only set the timer when such a number of reregistered agent is reached we are guaranteed to only set each timer once (but we may need to set multiple timers in one call) right? This alleviates the need to check if the timer is already set!
> >     
> >     This should also work with boundary cases like the total recovered agents count being 0 or 1 (overlapping percentiles) etc. Right?
> 
> Jiang Yan Xu wrote:
>     We should comment on the reasons for dropping issues.

This is actaully fixed and marked as dropped somehow. We used equality operator to 0.0 to check whether the percentile was reached or not; The reason we used push gauge not timer is explained in push gauge vs timer comments section


> On Oct. 18, 2018, 6:14 p.m., Jiang Yan Xu wrote:
> > src/master/metrics.hpp
> > Lines 51 (patched)
> > <https://reviews.apache.org/r/68706/diff/5/?file=2098491#file2098491line51>
> >
> >     On using Timer (e.g., like [state_fetch](https://github.com/apache/mesos/blob/7f36ebc1775398a43b2aa3a81bb647fb296b8313/src/master/registrar.cpp#L172)) vs. PushGauge, after looking at how it'll be used I think the main advantage of Timer is that it doesn't export any value if you haven't set it.
> >     
> >     Consider the two cases:
> >     
> >     1. There are 1000 recovered agents and 0 have reregsitered, should all the timers have zero values or should they be absent?
> >     
> >     2. There are 0 recovered agents (e.g., brand new cluster), should all of the metrics be zero or non-existent? I feel like they should be zero, as in, e.g., 100% of all 0 agents are reregistered within 0 secs.
> >     
> >     So timer handles this natrually. Also it sets the `_secs` name for you but that's a minor conveninence.
> 
> Jiang Yan Xu wrote:
>     We should comment on the reasons for dropping issues.

Sorry about this, I did make the comments but it must be in one of draft which was not saved. 

I agree that metrics should be zero but not absent when certain percentige were not reached. I did tried both PushGauge and Timer implementation and tested in our clusters.

If we used the timer, when the value was not set, that metric is actually missing. PushGauge is set with the initial value 0.0 and we can tell whether it's set yet, the metric will always exist no matter that percentile reached or not, and it has better performance.

The brand new cluster case was tested and the results were included in the test results.


- Xudong


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/68706/#review209722
-----------------------------------------------------------


On Oct. 19, 2018, 11:56 p.m., Xudong Ni wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/68706/
> -----------------------------------------------------------
> 
> (Updated Oct. 19, 2018, 11:56 p.m.)
> 
> 
> Review request for mesos, Benjamin Mahler, James Peach, and Jiang Yan Xu.
> 
> 
> Bugs: MESOS-9178
>     https://issues.apache.org/jira/browse/MESOS-9178
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> During the master failover, the time that the master elected is
> considered as the start of failover. In the progress of
> reregistration, the percentile represents the time when such
> percentile of agents finished registration again; The percentile of
> these data as in this metrics can represent overall reregistration
> progress; In case of degradation towards to the end of reregistration,
> the high percentile can reflect it; In the case there are unreachable
> agents in the failover, if certain percentile recovery couldn't be
> reached, the intiail value of that percentile will not be updated.
> 
> 
> Diffs
> -----
> 
>   src/master/master.cpp 868787bb2f9d879531402f83507b322462322efc 
>   src/master/metrics.hpp e1da18e6ba2737f729e1e30653020538150ae898 
> 
> 
> Diff: https://reviews.apache.org/r/68706/diff/7/
> 
> 
> Testing
> -------
> 
> Automation:
> [ RUN      ] MasterTest.MetricsInMetricsEndpoint
> [       OK ] MasterTest.MetricsInMetricsEndpoint (42 ms)
> 
> Real world cases:
> 
> While the master is not elected or there is no agents recovered yet
> "master/recovered_agents_100_percent_reregistered_secs": 0.0,
> "master/recovered_agents_25_percent_reregistered_secs": 0.0,
> "master/recovered_agents_50_percent_reregistered_secs": 0.0,
> "master/recovered_agents_75_percent_reregistered_secs": 0.0,
> "master/recovered_agents_90_percent_reregistered_secs": 0.0,
> "master/recovered_agents_99_percent_reregistered_secs": 0.0,
> "master/slave_reregistrations": 0.0,
> 
> While reregistrations is in progress: 5 out of 6 completed:
> "master/recovered_agents_100_percent_reregistered_secs": 0.0,
> "master/recovered_agents_25_percent_reregistered_secs": 2.0,
> "master/recovered_agents_50_percent_reregistered_secs": 3.0,
> "master/recovered_agents_75_percent_reregistered_secs": 6.0,
> "master/recovered_agents_90_percent_reregistered_secs": 0.0,
> "master/recovered_agents_99_percent_reregistered_secs": 0.0,
> "master/slave_reregistrations": 5.0,
> 
> 
> While 6 reregistrations were all completed:
> "master/recovered_agents_100_percent_reregistered_secs": 22.0,
> "master/recovered_agents_25_percent_reregistered_secs": 2.0,
> "master/recovered_agents_50_percent_reregistered_secs": 3.0,
> "master/recovered_agents_75_percent_reregistered_secs": 6.0,
> "master/recovered_agents_90_percent_reregistered_secs": 22.0,
> "master/recovered_agents_99_percent_reregistered_secs": 22.0,
> "master/slave_reregistrations": 6.0,
> 
> 
> Thanks,
> 
> Xudong Ni
> 
>


Re: Review Request 68706: Added master failover reregistration progress metrics.

Posted by Jiang Yan Xu <ya...@jxu.me>.

> On Oct. 18, 2018, 11:14 a.m., Jiang Yan Xu wrote:
> > src/master/master.cpp
> > Lines 1874-1875 (patched)
> > <https://reviews.apache.org/r/68706/diff/5/?file=2098490#file2098490line1874>
> >
> >     If we use equality operator and only set the timer when such a number of reregistered agent is reached we are guaranteed to only set each timer once (but we may need to set multiple timers in one call) right? This alleviates the need to check if the timer is already set!
> >     
> >     This should also work with boundary cases like the total recovered agents count being 0 or 1 (overlapping percentiles) etc. Right?

We should comment on the reasons for dropping issues.


> On Oct. 18, 2018, 11:14 a.m., Jiang Yan Xu wrote:
> > src/master/metrics.hpp
> > Lines 51 (patched)
> > <https://reviews.apache.org/r/68706/diff/5/?file=2098491#file2098491line51>
> >
> >     On using Timer (e.g., like [state_fetch](https://github.com/apache/mesos/blob/7f36ebc1775398a43b2aa3a81bb647fb296b8313/src/master/registrar.cpp#L172)) vs. PushGauge, after looking at how it'll be used I think the main advantage of Timer is that it doesn't export any value if you haven't set it.
> >     
> >     Consider the two cases:
> >     
> >     1. There are 1000 recovered agents and 0 have reregsitered, should all the timers have zero values or should they be absent?
> >     
> >     2. There are 0 recovered agents (e.g., brand new cluster), should all of the metrics be zero or non-existent? I feel like they should be zero, as in, e.g., 100% of all 0 agents are reregistered within 0 secs.
> >     
> >     So timer handles this natrually. Also it sets the `_secs` name for you but that's a minor conveninence.

We should comment on the reasons for dropping issues.


- Jiang Yan


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/68706/#review209722
-----------------------------------------------------------


On Oct. 19, 2018, 4:56 p.m., Xudong Ni wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/68706/
> -----------------------------------------------------------
> 
> (Updated Oct. 19, 2018, 4:56 p.m.)
> 
> 
> Review request for mesos, Benjamin Mahler, James Peach, and Jiang Yan Xu.
> 
> 
> Bugs: MESOS-9178
>     https://issues.apache.org/jira/browse/MESOS-9178
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> During the master failover, the time that the master elected is
> considered as the start of failover. In the progress of
> reregistration, the percentile represents the time when such
> percentile of agents finished registration again; The percentile of
> these data as in this metrics can represent overall reregistration
> progress; In case of degradation towards to the end of reregistration,
> the high percentile can reflect it; In the case there are unreachable
> agents in the failover, if certain percentile recovery couldn't be
> reached, the intiail value of that percentile will not be updated.
> 
> 
> Diffs
> -----
> 
>   src/master/master.cpp 868787bb2f9d879531402f83507b322462322efc 
>   src/master/metrics.hpp e1da18e6ba2737f729e1e30653020538150ae898 
> 
> 
> Diff: https://reviews.apache.org/r/68706/diff/7/
> 
> 
> Testing
> -------
> 
> Automation:
> [ RUN      ] MasterTest.MetricsInMetricsEndpoint
> [       OK ] MasterTest.MetricsInMetricsEndpoint (42 ms)
> 
> Real world cases:
> 
> While the master is not elected or there is no agents recovered yet
> "master/recovered_agents_100_percent_reregistered_secs": 0.0,
> "master/recovered_agents_25_percent_reregistered_secs": 0.0,
> "master/recovered_agents_50_percent_reregistered_secs": 0.0,
> "master/recovered_agents_75_percent_reregistered_secs": 0.0,
> "master/recovered_agents_90_percent_reregistered_secs": 0.0,
> "master/recovered_agents_99_percent_reregistered_secs": 0.0,
> "master/slave_reregistrations": 0.0,
> 
> While reregistrations is in progress: 5 out of 6 completed:
> "master/recovered_agents_100_percent_reregistered_secs": 0.0,
> "master/recovered_agents_25_percent_reregistered_secs": 2.0,
> "master/recovered_agents_50_percent_reregistered_secs": 3.0,
> "master/recovered_agents_75_percent_reregistered_secs": 6.0,
> "master/recovered_agents_90_percent_reregistered_secs": 0.0,
> "master/recovered_agents_99_percent_reregistered_secs": 0.0,
> "master/slave_reregistrations": 5.0,
> 
> 
> While 6 reregistrations were all completed:
> "master/recovered_agents_100_percent_reregistered_secs": 22.0,
> "master/recovered_agents_25_percent_reregistered_secs": 2.0,
> "master/recovered_agents_50_percent_reregistered_secs": 3.0,
> "master/recovered_agents_75_percent_reregistered_secs": 6.0,
> "master/recovered_agents_90_percent_reregistered_secs": 22.0,
> "master/recovered_agents_99_percent_reregistered_secs": 22.0,
> "master/slave_reregistrations": 6.0,
> 
> 
> Thanks,
> 
> Xudong Ni
> 
>


Re: Review Request 68706: Added master failover reregistration progress metrics.

Posted by Jiang Yan Xu <ya...@jxu.me>.

> On Oct. 18, 2018, 11:14 a.m., Jiang Yan Xu wrote:
> >

Sorry for the delay. Feel free to chat if it's not clear!


- Jiang Yan


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/68706/#review209722
-----------------------------------------------------------


On Oct. 16, 2018, 9:49 a.m., Xudong Ni wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/68706/
> -----------------------------------------------------------
> 
> (Updated Oct. 16, 2018, 9:49 a.m.)
> 
> 
> Review request for mesos, Benjamin Mahler, James Peach, and Jiang Yan Xu.
> 
> 
> Bugs: MESOS-9178
>     https://issues.apache.org/jira/browse/MESOS-9178
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> During the master failover, the time that the master elected is
> considered as the start of failover. In the progress of
> reregistration, the percentile represents the time when such
> percentile of agents finished registration again; The percentile of
> these data as in this metrics can represent overall reregistration
> progress; In case of degradation towards to the end of reregistration,
> the high percentile can reflect it; In the case there are unreachable
> agents in the failover, if certain percentile recovery couldn't be
> reached, the intiail value of that percentile will not be updated.
> 
> 
> Diffs
> -----
> 
>   docs/monitoring.md 00c6ea94bcb73746aef740236632ede123f5b534 
>   src/master/master.hpp ea7e9242b62fe6c2cc0e717f9a9f2f0c1cc0a390 
>   src/master/master.cpp 868787bb2f9d879531402f83507b322462322efc 
>   src/master/metrics.hpp e1da18e6ba2737f729e1e30653020538150ae898 
>   src/master/metrics.cpp 56a7eef2d279ad3248092d37d19013d3ac110757 
>   src/tests/master_tests.cpp 1db8ed7d81acbcd8bad4b7ca77c501d1d99cc135 
> 
> 
> Diff: https://reviews.apache.org/r/68706/diff/5/
> 
> 
> Testing
> -------
> 
> Automation:
> [ RUN      ] MasterTest.MetricsInMetricsEndpoint
> [       OK ] MasterTest.MetricsInMetricsEndpoint (42 ms)
> 
> Real world cases:
> While reregistrations is in progress: 3277 out of 3582 completed:
> "master/slave_reregistrations": 3277.0,
> "master/slaves_100_percent_reregistered_secs": 0.0,
> "master/slaves_25_percent_reregistered_secs": 5.0,
> "master/slaves_50_percent_reregistered_secs": 11.0,
> "master/slaves_75_percent_reregistered_secs": 20.0,
> "master/slaves_90_percent_reregistered_secs": 30.0,
> "master/slaves_99_percent_reregistered_secs": 0.0,
> 
> 
> While 3582 reregistrations were all completed:
> "master/slave_reregistrations": 3582.0,
> "master/slaves_100_percent_reregistered_secs": 54.0,
> "master/slaves_25_percent_reregistered_secs": 5.0,
> "master/slaves_50_percent_reregistered_secs": 11.0,
> "master/slaves_75_percent_reregistered_secs": 20.0,
> "master/slaves_90_percent_reregistered_secs": 30.0,
> "master/slaves_99_percent_reregistered_secs": 39.0,
> 
> 
> Thanks,
> 
> Xudong Ni
> 
>


Re: Review Request 68706: Added master failover reregistration progress metrics.

Posted by Jiang Yan Xu <ya...@jxu.me>.

> On Oct. 18, 2018, 11:14 a.m., Jiang Yan Xu wrote:
> > src/master/master.cpp
> > Lines 1874-1875 (patched)
> > <https://reviews.apache.org/r/68706/diff/5/?file=2098490#file2098490line1874>
> >
> >     If we use equality operator and only set the timer when such a number of reregistered agent is reached we are guaranteed to only set each timer once (but we may need to set multiple timers in one call) right? This alleviates the need to check if the timer is already set!
> >     
> >     This should also work with boundary cases like the total recovered agents count being 0 or 1 (overlapping percentiles) etc. Right?
> 
> Jiang Yan Xu wrote:
>     We should comment on the reasons for dropping issues.
> 
> Xudong Ni wrote:
>     This is actaully fixed and marked as dropped somehow. We used equality operator to 0.0 to check whether the percentile was reached or not; The reason we used push gauge not timer is explained in push gauge vs timer comments section

The point here is "This alleviates the need to check if the timer is already set!". It's still in the new revision. I made another comment about it.


> On Oct. 18, 2018, 11:14 a.m., Jiang Yan Xu wrote:
> > src/master/metrics.hpp
> > Lines 51 (patched)
> > <https://reviews.apache.org/r/68706/diff/5/?file=2098491#file2098491line51>
> >
> >     On using Timer (e.g., like [state_fetch](https://github.com/apache/mesos/blob/7f36ebc1775398a43b2aa3a81bb647fb296b8313/src/master/registrar.cpp#L172)) vs. PushGauge, after looking at how it'll be used I think the main advantage of Timer is that it doesn't export any value if you haven't set it.
> >     
> >     Consider the two cases:
> >     
> >     1. There are 1000 recovered agents and 0 have reregsitered, should all the timers have zero values or should they be absent?
> >     
> >     2. There are 0 recovered agents (e.g., brand new cluster), should all of the metrics be zero or non-existent? I feel like they should be zero, as in, e.g., 100% of all 0 agents are reregistered within 0 secs.
> >     
> >     So timer handles this natrually. Also it sets the `_secs` name for you but that's a minor conveninence.
> 
> Jiang Yan Xu wrote:
>     We should comment on the reasons for dropping issues.
> 
> Xudong Ni wrote:
>     Sorry about this, I did make the comments but it must be in one of draft which was not saved. 
>     
>     I agree that metrics should be zero but not absent when certain percentige were not reached. I did tried both PushGauge and Timer implementation and tested in our clusters.
>     
>     If we used the timer, when the value was not set, that metric is actually missing. PushGauge is set with the initial value 0.0 and we can tell whether it's set yet, the metric will always exist no matter that percentile reached or not, and it has better performance.
>     
>     The brand new cluster case was tested and the results were included in the test results.

My point was the opposite. When a percentage is not reached, I feel that the value being absent is preferrable and I used the above two cases to demonstrate it. (There is a subtle difference between the two IMO).

It's in fact the current practice used by metrics like `state_fetch`.


- Jiang Yan


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/68706/#review209722
-----------------------------------------------------------


On Oct. 19, 2018, 4:56 p.m., Xudong Ni wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/68706/
> -----------------------------------------------------------
> 
> (Updated Oct. 19, 2018, 4:56 p.m.)
> 
> 
> Review request for mesos, Benjamin Mahler, James Peach, and Jiang Yan Xu.
> 
> 
> Bugs: MESOS-9178
>     https://issues.apache.org/jira/browse/MESOS-9178
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> During the master failover, the time that the master elected is
> considered as the start of failover. In the progress of
> reregistration, the percentile represents the time when such
> percentile of agents finished registration again; The percentile of
> these data as in this metrics can represent overall reregistration
> progress; In case of degradation towards to the end of reregistration,
> the high percentile can reflect it; In the case there are unreachable
> agents in the failover, if certain percentile recovery couldn't be
> reached, the intiail value of that percentile will not be updated.
> 
> 
> Diffs
> -----
> 
>   src/master/master.cpp 868787bb2f9d879531402f83507b322462322efc 
>   src/master/metrics.hpp e1da18e6ba2737f729e1e30653020538150ae898 
> 
> 
> Diff: https://reviews.apache.org/r/68706/diff/7/
> 
> 
> Testing
> -------
> 
> Automation:
> [ RUN      ] MasterTest.MetricsInMetricsEndpoint
> [       OK ] MasterTest.MetricsInMetricsEndpoint (42 ms)
> 
> Real world cases:
> 
> While the master is not elected or there is no agents recovered yet
> "master/recovered_agents_100_percent_reregistered_secs": 0.0,
> "master/recovered_agents_25_percent_reregistered_secs": 0.0,
> "master/recovered_agents_50_percent_reregistered_secs": 0.0,
> "master/recovered_agents_75_percent_reregistered_secs": 0.0,
> "master/recovered_agents_90_percent_reregistered_secs": 0.0,
> "master/recovered_agents_99_percent_reregistered_secs": 0.0,
> "master/slave_reregistrations": 0.0,
> 
> While reregistrations is in progress: 5 out of 6 completed:
> "master/recovered_agents_100_percent_reregistered_secs": 0.0,
> "master/recovered_agents_25_percent_reregistered_secs": 2.0,
> "master/recovered_agents_50_percent_reregistered_secs": 3.0,
> "master/recovered_agents_75_percent_reregistered_secs": 6.0,
> "master/recovered_agents_90_percent_reregistered_secs": 0.0,
> "master/recovered_agents_99_percent_reregistered_secs": 0.0,
> "master/slave_reregistrations": 5.0,
> 
> 
> While 6 reregistrations were all completed:
> "master/recovered_agents_100_percent_reregistered_secs": 22.0,
> "master/recovered_agents_25_percent_reregistered_secs": 2.0,
> "master/recovered_agents_50_percent_reregistered_secs": 3.0,
> "master/recovered_agents_75_percent_reregistered_secs": 6.0,
> "master/recovered_agents_90_percent_reregistered_secs": 22.0,
> "master/recovered_agents_99_percent_reregistered_secs": 22.0,
> "master/slave_reregistrations": 6.0,
> 
> 
> Thanks,
> 
> Xudong Ni
> 
>


Re: Review Request 68706: Added master failover reregistration progress metrics.

Posted by Jiang Yan Xu <ya...@jxu.me>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/68706/#review209722
-----------------------------------------------------------




src/master/master.hpp
Lines 2346-2347 (patched)
<https://reviews.apache.org/r/68706/#comment294277>

    I wasn't sure if we needed a dedicated struct / method for these metrics when we first discussed this approach but looking at the the additional method, math and class member state that we need to track I feel that we do. 
    
    It's probably the best if we put such logic in the master `Metrics` class. It'll isolate the metric calculation logic from the master and you can use the struct's member state to help with tracking.
    
    Similar examples already exist here. e.g., https://github.com/apache/mesos/blob/f03f91545fec64a767f808c48a3b352dc86b1ac0/src/master/metrics.hpp#L89
    
    I propose that we add something like this
    
    ```
    struct AgentReregistrations
    {
      AgentReregistrations(int recoveredAgentCount);
      ~AgentReregistrations();
    
      // If you use `Timer` you don't even need this. See comment about Timer below.
      process::Time electedTime, 
      int recoveredAgentCount;
      int reregisteredAgentCount = 0;
      // ... Whatever else state you may need.
      
      process::metrics::PushGauge recovered_agents_25_percent_reregistered_secs;
      // ... other gauges.
      
      
      void incrementAgentReregistered();
    };
    
    // Option since you only set it after the master is elected.
    Option<AgentReregistrations> agentReregistrations;
    ```



src/master/master.cpp
Lines 1866-1868 (patched)
<https://reviews.apache.org/r/68706/#comment294282>

    But we naturally don't need float comparison, we are comparing the reregistered agents count (int) against the **minimum** number of agents that **satisfy** the percentage requirement (int). So if you take a `ceil` of the floating point multiplication result and cast to int you'll get the semantically correct number?
    
    e.g., if total = 5 and we are calculating 75% the `reregisteredAgentCount` we are looking for is 4.



src/master/master.cpp
Lines 1874-1875 (patched)
<https://reviews.apache.org/r/68706/#comment294312>

    If we use equality operator and only set the timer when such a number of reregistered agent is reached we are guaranteed to only set each timer once (but we may need to set multiple timers in one call) right? This alleviates the need to check if the timer is already set!
    
    This should also work with boundary cases like the total recovered agents count being 0 or 1 (overlapping percentiles) etc. Right?



src/master/metrics.hpp
Lines 50 (patched)
<https://reviews.apache.org/r/68706/#comment294275>

    Can we use the names agreed-upon with BenM in the slack channel which is e.g., `recovered_agents_50_percent_reregistered_secs`?



src/master/metrics.hpp
Lines 51 (patched)
<https://reviews.apache.org/r/68706/#comment294313>

    On using Timer (e.g., like [state_fetch](https://github.com/apache/mesos/blob/7f36ebc1775398a43b2aa3a81bb647fb296b8313/src/master/registrar.cpp#L172)) vs. PushGauge, after looking at how it'll be used I think the main advantage of Timer is that it doesn't export any value if you haven't set it.
    
    Consider the two cases:
    
    1. There are 1000 recovered agents and 0 have reregsitered, should all the timers have zero values or should they be absent?
    
    2. There are 0 recovered agents (e.g., brand new cluster), should all of the metrics be zero or non-existent? I feel like they should be zero, as in, e.g., 100% of all 0 agents are reregistered within 0 secs.
    
    So timer handles this natrually. Also it sets the `_secs` name for you but that's a minor conveninence.


- Jiang Yan Xu


On Oct. 16, 2018, 9:49 a.m., Xudong Ni wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/68706/
> -----------------------------------------------------------
> 
> (Updated Oct. 16, 2018, 9:49 a.m.)
> 
> 
> Review request for mesos, Benjamin Mahler, James Peach, and Jiang Yan Xu.
> 
> 
> Bugs: MESOS-9178
>     https://issues.apache.org/jira/browse/MESOS-9178
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> During the master failover, the time that the master elected is
> considered as the start of failover. In the progress of
> reregistration, the percentile represents the time when such
> percentile of agents finished registration again; The percentile of
> these data as in this metrics can represent overall reregistration
> progress; In case of degradation towards to the end of reregistration,
> the high percentile can reflect it; In the case there are unreachable
> agents in the failover, if certain percentile recovery couldn't be
> reached, the intiail value of that percentile will not be updated.
> 
> 
> Diffs
> -----
> 
>   docs/monitoring.md 00c6ea94bcb73746aef740236632ede123f5b534 
>   src/master/master.hpp ea7e9242b62fe6c2cc0e717f9a9f2f0c1cc0a390 
>   src/master/master.cpp 868787bb2f9d879531402f83507b322462322efc 
>   src/master/metrics.hpp e1da18e6ba2737f729e1e30653020538150ae898 
>   src/master/metrics.cpp 56a7eef2d279ad3248092d37d19013d3ac110757 
>   src/tests/master_tests.cpp 1db8ed7d81acbcd8bad4b7ca77c501d1d99cc135 
> 
> 
> Diff: https://reviews.apache.org/r/68706/diff/5/
> 
> 
> Testing
> -------
> 
> Automation:
> [ RUN      ] MasterTest.MetricsInMetricsEndpoint
> [       OK ] MasterTest.MetricsInMetricsEndpoint (42 ms)
> 
> Real world cases:
> While reregistrations is in progress: 3277 out of 3582 completed:
> "master/slave_reregistrations": 3277.0,
> "master/slaves_100_percent_reregistered_secs": 0.0,
> "master/slaves_25_percent_reregistered_secs": 5.0,
> "master/slaves_50_percent_reregistered_secs": 11.0,
> "master/slaves_75_percent_reregistered_secs": 20.0,
> "master/slaves_90_percent_reregistered_secs": 30.0,
> "master/slaves_99_percent_reregistered_secs": 0.0,
> 
> 
> While 3582 reregistrations were all completed:
> "master/slave_reregistrations": 3582.0,
> "master/slaves_100_percent_reregistered_secs": 54.0,
> "master/slaves_25_percent_reregistered_secs": 5.0,
> "master/slaves_50_percent_reregistered_secs": 11.0,
> "master/slaves_75_percent_reregistered_secs": 20.0,
> "master/slaves_90_percent_reregistered_secs": 30.0,
> "master/slaves_99_percent_reregistered_secs": 39.0,
> 
> 
> Thanks,
> 
> Xudong Ni
> 
>


Re: Review Request 68706: Added master failover reregistration progress metrics.

Posted by Jiang Yan Xu <ya...@jxu.me>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/68706/#review210145
-----------------------------------------------------------



There were changes added for a test but it's gone in the latest revision?


src/master/master.cpp
Lines 1720 (patched)
<https://reviews.apache.org/r/68706/#comment294924>

    You can construct the entire `Option<AgentReregistrations> agentReregistrations` struct here using the `expectedAgentCount` as well as the `electedTime` which is a class member (Rationale explained below).
    
    Initializing `agentReregistrations` is just one step in the list of steps taken in `Master::_recover` so you don't have to squeeze it between the two lines for allocator recover code. It's not appropriate as the block comment above clearly tries to explain the allocator reocvery. 
    
    You can just put it somewhere below by itself (blank lines above and below the statement). The `expectedAgentCount` variable is already set and you can use; an frankly you can just use `registry.slaves().slaves().size()` as well.



src/master/master.cpp
Lines 2099 (patched)
<https://reviews.apache.org/r/68706/#comment294925>

    - End sentences with `.`
    - s/master/the master/



src/master/metrics.hpp
Lines 87 (patched)
<https://reviews.apache.org/r/68706/#comment294779>

    - End sentences with `.`.
    - s/when/during/



src/master/metrics.hpp
Lines 90 (patched)
<https://reviews.apache.org/r/68706/#comment294778>

    No need for `explicit` as it's only for single argument constructors, you don't need it either with zero or two arguments (one proposal below).



src/master/metrics.hpp
Lines 90-123 (patched)
<https://reviews.apache.org/r/68706/#comment294927>

    To be consistent let's put the implementation in the cpp file master/metrics.cpp. Here and for other methods as well.



src/master/metrics.hpp
Lines 92 (patched)
<https://reviews.apache.org/r/68706/#comment294926>

    Indent two more spaces, we can see examples in master/metrics.cpp, here and elsewhere.



src/master/metrics.hpp
Lines 130 (patched)
<https://reviews.apache.org/r/68706/#comment294928>

    We generally don't use one letter variable names. `t` here seemingly refers to `Time` but it's a `Duration`. It won't hurt to keep it strongly types until we need to convert it, i.e., Use `Duration duration = Clock::now() - electedTime;` should be fine.



src/master/metrics.hpp
Lines 132 (patched)
<https://reviews.apache.org/r/68706/#comment294929>

    A space between `if` and `(`. Here and elsewhere.



src/master/metrics.hpp
Lines 133 (patched)
<https://reviews.apache.org/r/68706/#comment294931>

    Let's `#include <cmath>` for ceil, in general include all needed headers without relying on transitive dependencies is a good pratice.



src/master/metrics.hpp
Lines 138 (patched)
<https://reviews.apache.org/r/68706/#comment294932>

    As I commented in the previous revision, we don't need to check if a timer is already set. A percentage timer corresponds to one precise `reregisteredAgentCount` (and not vice versa) so we just need to check that. Does it make sense?



src/master/metrics.hpp
Lines 176-177 (patched)
<https://reviews.apache.org/r/68706/#comment294923>

    Semantically it's hard to explain why `recoveredAgentCount` is an `Option` while `reregisteredAgentCount` is not? Why not have all three fields as Options? But then one may ask why not have the entire struct by an Option?
    
    It's true that we don't have a `recoveredAgentCount` before they are recovered and `electedTime` before the master is elected but we can just use `Option<AgentReregistrations> agentReregistrations` like I suggested?
    
    The semantics of `Option<AgentReregistrations> agentReregistrations` seems natural to me: it is optional because agent reregistrations only happen after the master is recovered.


- Jiang Yan Xu


On Oct. 19, 2018, 4:56 p.m., Xudong Ni wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/68706/
> -----------------------------------------------------------
> 
> (Updated Oct. 19, 2018, 4:56 p.m.)
> 
> 
> Review request for mesos, Benjamin Mahler, James Peach, and Jiang Yan Xu.
> 
> 
> Bugs: MESOS-9178
>     https://issues.apache.org/jira/browse/MESOS-9178
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> During the master failover, the time that the master elected is
> considered as the start of failover. In the progress of
> reregistration, the percentile represents the time when such
> percentile of agents finished registration again; The percentile of
> these data as in this metrics can represent overall reregistration
> progress; In case of degradation towards to the end of reregistration,
> the high percentile can reflect it; In the case there are unreachable
> agents in the failover, if certain percentile recovery couldn't be
> reached, the intiail value of that percentile will not be updated.
> 
> 
> Diffs
> -----
> 
>   src/master/master.cpp 868787bb2f9d879531402f83507b322462322efc 
>   src/master/metrics.hpp e1da18e6ba2737f729e1e30653020538150ae898 
> 
> 
> Diff: https://reviews.apache.org/r/68706/diff/7/
> 
> 
> Testing
> -------
> 
> Automation:
> [ RUN      ] MasterTest.MetricsInMetricsEndpoint
> [       OK ] MasterTest.MetricsInMetricsEndpoint (42 ms)
> 
> Real world cases:
> 
> While the master is not elected or there is no agents recovered yet
> "master/recovered_agents_100_percent_reregistered_secs": 0.0,
> "master/recovered_agents_25_percent_reregistered_secs": 0.0,
> "master/recovered_agents_50_percent_reregistered_secs": 0.0,
> "master/recovered_agents_75_percent_reregistered_secs": 0.0,
> "master/recovered_agents_90_percent_reregistered_secs": 0.0,
> "master/recovered_agents_99_percent_reregistered_secs": 0.0,
> "master/slave_reregistrations": 0.0,
> 
> While reregistrations is in progress: 5 out of 6 completed:
> "master/recovered_agents_100_percent_reregistered_secs": 0.0,
> "master/recovered_agents_25_percent_reregistered_secs": 2.0,
> "master/recovered_agents_50_percent_reregistered_secs": 3.0,
> "master/recovered_agents_75_percent_reregistered_secs": 6.0,
> "master/recovered_agents_90_percent_reregistered_secs": 0.0,
> "master/recovered_agents_99_percent_reregistered_secs": 0.0,
> "master/slave_reregistrations": 5.0,
> 
> 
> While 6 reregistrations were all completed:
> "master/recovered_agents_100_percent_reregistered_secs": 22.0,
> "master/recovered_agents_25_percent_reregistered_secs": 2.0,
> "master/recovered_agents_50_percent_reregistered_secs": 3.0,
> "master/recovered_agents_75_percent_reregistered_secs": 6.0,
> "master/recovered_agents_90_percent_reregistered_secs": 22.0,
> "master/recovered_agents_99_percent_reregistered_secs": 22.0,
> "master/slave_reregistrations": 6.0,
> 
> 
> Thanks,
> 
> Xudong Ni
> 
>


Re: Review Request 68706: Added master failover reregistration progress metrics.

Posted by Mesos Reviewbot <re...@mesos.apache.org>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/68706/#review209817
-----------------------------------------------------------



Patch looks great!

Reviews applied: [68706]

Passed command: export OS='ubuntu:14.04' BUILDTOOL='autotools' COMPILER='gcc' CONFIGURATION='--verbose --disable-libtool-wrappers' ENVIRONMENT='GLOG_v=1 MESOS_VERBOSE=1'; ./support/docker-build.sh

- Mesos Reviewbot


On Oct. 19, 2018, 11:56 p.m., Xudong Ni wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/68706/
> -----------------------------------------------------------
> 
> (Updated Oct. 19, 2018, 11:56 p.m.)
> 
> 
> Review request for mesos, Benjamin Mahler, James Peach, and Jiang Yan Xu.
> 
> 
> Bugs: MESOS-9178
>     https://issues.apache.org/jira/browse/MESOS-9178
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> During the master failover, the time that the master elected is
> considered as the start of failover. In the progress of
> reregistration, the percentile represents the time when such
> percentile of agents finished registration again; The percentile of
> these data as in this metrics can represent overall reregistration
> progress; In case of degradation towards to the end of reregistration,
> the high percentile can reflect it; In the case there are unreachable
> agents in the failover, if certain percentile recovery couldn't be
> reached, the intiail value of that percentile will not be updated.
> 
> 
> Diffs
> -----
> 
>   src/master/master.cpp 868787bb2f9d879531402f83507b322462322efc 
>   src/master/metrics.hpp e1da18e6ba2737f729e1e30653020538150ae898 
> 
> 
> Diff: https://reviews.apache.org/r/68706/diff/7/
> 
> 
> Testing
> -------
> 
> Automation:
> [ RUN      ] MasterTest.MetricsInMetricsEndpoint
> [       OK ] MasterTest.MetricsInMetricsEndpoint (42 ms)
> 
> Real world cases:
> 
> While the master is not elected or there is no agents recovered yet
> "master/recovered_agents_100_percent_reregistered_secs": 0.0,
> "master/recovered_agents_25_percent_reregistered_secs": 0.0,
> "master/recovered_agents_50_percent_reregistered_secs": 0.0,
> "master/recovered_agents_75_percent_reregistered_secs": 0.0,
> "master/recovered_agents_90_percent_reregistered_secs": 0.0,
> "master/recovered_agents_99_percent_reregistered_secs": 0.0,
> "master/slave_reregistrations": 0.0,
> 
> While reregistrations is in progress: 5 out of 6 completed:
> "master/recovered_agents_100_percent_reregistered_secs": 0.0,
> "master/recovered_agents_25_percent_reregistered_secs": 2.0,
> "master/recovered_agents_50_percent_reregistered_secs": 3.0,
> "master/recovered_agents_75_percent_reregistered_secs": 6.0,
> "master/recovered_agents_90_percent_reregistered_secs": 0.0,
> "master/recovered_agents_99_percent_reregistered_secs": 0.0,
> "master/slave_reregistrations": 5.0,
> 
> 
> While 6 reregistrations were all completed:
> "master/recovered_agents_100_percent_reregistered_secs": 22.0,
> "master/recovered_agents_25_percent_reregistered_secs": 2.0,
> "master/recovered_agents_50_percent_reregistered_secs": 3.0,
> "master/recovered_agents_75_percent_reregistered_secs": 6.0,
> "master/recovered_agents_90_percent_reregistered_secs": 22.0,
> "master/recovered_agents_99_percent_reregistered_secs": 22.0,
> "master/slave_reregistrations": 6.0,
> 
> 
> Thanks,
> 
> Xudong Ni
> 
>


Re: Review Request 68706: Added master failover reregistration progress metrics.

Posted by Mesos Reviewbot Windows <re...@mesos.apache.org>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/68706/#review209816
-----------------------------------------------------------



PASS: Mesos patch 68706 was successfully built and tested.

Reviews applied: `['68706']`

All the build artifacts available at: http://dcos-win.westus2.cloudapp.azure.com/artifacts/mesos-reviewbot-testing/2495/mesos-review-68706

- Mesos Reviewbot Windows


On Oct. 19, 2018, 11:56 p.m., Xudong Ni wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/68706/
> -----------------------------------------------------------
> 
> (Updated Oct. 19, 2018, 11:56 p.m.)
> 
> 
> Review request for mesos, Benjamin Mahler, James Peach, and Jiang Yan Xu.
> 
> 
> Bugs: MESOS-9178
>     https://issues.apache.org/jira/browse/MESOS-9178
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> During the master failover, the time that the master elected is
> considered as the start of failover. In the progress of
> reregistration, the percentile represents the time when such
> percentile of agents finished registration again; The percentile of
> these data as in this metrics can represent overall reregistration
> progress; In case of degradation towards to the end of reregistration,
> the high percentile can reflect it; In the case there are unreachable
> agents in the failover, if certain percentile recovery couldn't be
> reached, the intiail value of that percentile will not be updated.
> 
> 
> Diffs
> -----
> 
>   src/master/master.cpp 868787bb2f9d879531402f83507b322462322efc 
>   src/master/metrics.hpp e1da18e6ba2737f729e1e30653020538150ae898 
> 
> 
> Diff: https://reviews.apache.org/r/68706/diff/7/
> 
> 
> Testing
> -------
> 
> Automation:
> [ RUN      ] MasterTest.MetricsInMetricsEndpoint
> [       OK ] MasterTest.MetricsInMetricsEndpoint (42 ms)
> 
> Real world cases:
> 
> While the master is not elected or there is no agents recovered yet
> "master/recovered_agents_100_percent_reregistered_secs": 0.0,
> "master/recovered_agents_25_percent_reregistered_secs": 0.0,
> "master/recovered_agents_50_percent_reregistered_secs": 0.0,
> "master/recovered_agents_75_percent_reregistered_secs": 0.0,
> "master/recovered_agents_90_percent_reregistered_secs": 0.0,
> "master/recovered_agents_99_percent_reregistered_secs": 0.0,
> "master/slave_reregistrations": 0.0,
> 
> While reregistrations is in progress: 5 out of 6 completed:
> "master/recovered_agents_100_percent_reregistered_secs": 0.0,
> "master/recovered_agents_25_percent_reregistered_secs": 2.0,
> "master/recovered_agents_50_percent_reregistered_secs": 3.0,
> "master/recovered_agents_75_percent_reregistered_secs": 6.0,
> "master/recovered_agents_90_percent_reregistered_secs": 0.0,
> "master/recovered_agents_99_percent_reregistered_secs": 0.0,
> "master/slave_reregistrations": 5.0,
> 
> 
> While 6 reregistrations were all completed:
> "master/recovered_agents_100_percent_reregistered_secs": 22.0,
> "master/recovered_agents_25_percent_reregistered_secs": 2.0,
> "master/recovered_agents_50_percent_reregistered_secs": 3.0,
> "master/recovered_agents_75_percent_reregistered_secs": 6.0,
> "master/recovered_agents_90_percent_reregistered_secs": 22.0,
> "master/recovered_agents_99_percent_reregistered_secs": 22.0,
> "master/slave_reregistrations": 6.0,
> 
> 
> Thanks,
> 
> Xudong Ni
> 
>


Re: Review Request 68706: Added master failover reregistration progress metrics.

Posted by Xudong Ni via Review Board <no...@reviews.apache.org>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/68706/
-----------------------------------------------------------

(Updated Oct. 19, 2018, 11:56 p.m.)


Review request for mesos, Benjamin Mahler, James Peach, and Jiang Yan Xu.


Bugs: MESOS-9178
    https://issues.apache.org/jira/browse/MESOS-9178


Repository: mesos


Description
-------

During the master failover, the time that the master elected is
considered as the start of failover. In the progress of
reregistration, the percentile represents the time when such
percentile of agents finished registration again; The percentile of
these data as in this metrics can represent overall reregistration
progress; In case of degradation towards to the end of reregistration,
the high percentile can reflect it; In the case there are unreachable
agents in the failover, if certain percentile recovery couldn't be
reached, the intiail value of that percentile will not be updated.


Diffs (updated)
-----

  src/master/master.cpp 868787bb2f9d879531402f83507b322462322efc 
  src/master/metrics.hpp e1da18e6ba2737f729e1e30653020538150ae898 


Diff: https://reviews.apache.org/r/68706/diff/7/

Changes: https://reviews.apache.org/r/68706/diff/6-7/


Testing (updated)
-------

Automation:
[ RUN      ] MasterTest.MetricsInMetricsEndpoint
[       OK ] MasterTest.MetricsInMetricsEndpoint (42 ms)

Real world cases:

While the master is not elected or there is no agents recovered yet
"master/recovered_agents_100_percent_reregistered_secs": 0.0,
"master/recovered_agents_25_percent_reregistered_secs": 0.0,
"master/recovered_agents_50_percent_reregistered_secs": 0.0,
"master/recovered_agents_75_percent_reregistered_secs": 0.0,
"master/recovered_agents_90_percent_reregistered_secs": 0.0,
"master/recovered_agents_99_percent_reregistered_secs": 0.0,
"master/slave_reregistrations": 0.0,

While reregistrations is in progress: 5 out of 6 completed:
"master/recovered_agents_100_percent_reregistered_secs": 0.0,
"master/recovered_agents_25_percent_reregistered_secs": 2.0,
"master/recovered_agents_50_percent_reregistered_secs": 3.0,
"master/recovered_agents_75_percent_reregistered_secs": 6.0,
"master/recovered_agents_90_percent_reregistered_secs": 0.0,
"master/recovered_agents_99_percent_reregistered_secs": 0.0,
"master/slave_reregistrations": 5.0,


While 6 reregistrations were all completed:
"master/recovered_agents_100_percent_reregistered_secs": 22.0,
"master/recovered_agents_25_percent_reregistered_secs": 2.0,
"master/recovered_agents_50_percent_reregistered_secs": 3.0,
"master/recovered_agents_75_percent_reregistered_secs": 6.0,
"master/recovered_agents_90_percent_reregistered_secs": 22.0,
"master/recovered_agents_99_percent_reregistered_secs": 22.0,
"master/slave_reregistrations": 6.0,


Thanks,

Xudong Ni


Re: Review Request 68706: Added master failover reregistration progress metrics.

Posted by Xudong Ni via Review Board <no...@reviews.apache.org>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/68706/
-----------------------------------------------------------

(Updated Oct. 19, 2018, 11:14 p.m.)


Review request for mesos, Benjamin Mahler, James Peach, and Jiang Yan Xu.


Bugs: MESOS-9178
    https://issues.apache.org/jira/browse/MESOS-9178


Repository: mesos


Description
-------

During the master failover, the time that the master elected is
considered as the start of failover. In the progress of
reregistration, the percentile represents the time when such
percentile of agents finished registration again; The percentile of
these data as in this metrics can represent overall reregistration
progress; In case of degradation towards to the end of reregistration,
the high percentile can reflect it; In the case there are unreachable
agents in the failover, if certain percentile recovery couldn't be
reached, the intiail value of that percentile will not be updated.


Diffs (updated)
-----

  src/master/master.cpp 868787bb2f9d879531402f83507b322462322efc 
  src/master/metrics.hpp e1da18e6ba2737f729e1e30653020538150ae898 


Diff: https://reviews.apache.org/r/68706/diff/6/

Changes: https://reviews.apache.org/r/68706/diff/5-6/


Testing (updated)
-------

Automation:
[ RUN      ] MasterTest.MetricsInMetricsEndpoint
[       OK ] MasterTest.MetricsInMetricsEndpoint (42 ms)

Real world cases:

While the master is not elected or there is no agents to recover
"master/recovered_agents_100_percent_reregistered_secs": 0.0,
"master/recovered_agents_25_percent_reregistered_secs": 0.0,
"master/recovered_agents_50_percent_reregistered_secs": 0.0,
"master/recovered_agents_75_percent_reregistered_secs": 0.0,
"master/recovered_agents_90_percent_reregistered_secs": 0.0,
"master/recovered_agents_99_percent_reregistered_secs": 0.0,


While reregistrations is in progress: 4 out of 6 completed:
"master/recovered_agents_100_percent_reregistered_secs": 0.0,
"master/recovered_agents_25_percent_reregistered_secs": 1.0,
"master/recovered_agents_50_percent_reregistered_secs": 8.0,
"master/recovered_agents_75_percent_reregistered_secs": 16.0,
"master/recovered_agents_90_percent_reregistered_secs": 0.0,
"master/recovered_agents_99_percent_reregistered_secs": 0.0,
"master/slave_reregistrations": 4.0,


While 6 reregistrations were all completed:
"master/recovered_agents_100_percent_reregistered_secs": 39.0,
"master/recovered_agents_25_percent_reregistered_secs": 1.0,
"master/recovered_agents_50_percent_reregistered_secs": 8.0,
"master/recovered_agents_75_percent_reregistered_secs": 16.0,
"master/recovered_agents_90_percent_reregistered_secs": 31.0,
"master/recovered_agents_99_percent_reregistered_secs": 31.0,
"master/slave_reregistrations": 6.0,


Thanks,

Xudong Ni


Re: Review Request 68706: Added master failover reregistration progress metrics.

Posted by Xudong Ni via Review Board <no...@reviews.apache.org>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/68706/
-----------------------------------------------------------

(Updated Oct. 16, 2018, 4:49 p.m.)


Review request for mesos, Benjamin Mahler, James Peach, and Jiang Yan Xu.


Changes
-------

Sync master


Bugs: MESOS-9178
    https://issues.apache.org/jira/browse/MESOS-9178


Repository: mesos


Description
-------

During the master failover, the time that the master elected is
considered as the start of failover. In the progress of
reregistration, the percentile represents the time when such
percentile of agents finished registration again; The percentile of
these data as in this metrics can represent overall reregistration
progress; In case of degradation towards to the end of reregistration,
the high percentile can reflect it; In the case there are unreachable
agents in the failover, if certain percentile recovery couldn't be
reached, the intiail value of that percentile will not be updated.


Diffs (updated)
-----

  docs/monitoring.md 00c6ea94bcb73746aef740236632ede123f5b534 
  src/master/master.hpp ea7e9242b62fe6c2cc0e717f9a9f2f0c1cc0a390 
  src/master/master.cpp 868787bb2f9d879531402f83507b322462322efc 
  src/master/metrics.hpp e1da18e6ba2737f729e1e30653020538150ae898 
  src/master/metrics.cpp 56a7eef2d279ad3248092d37d19013d3ac110757 
  src/tests/master_tests.cpp 1db8ed7d81acbcd8bad4b7ca77c501d1d99cc135 


Diff: https://reviews.apache.org/r/68706/diff/5/

Changes: https://reviews.apache.org/r/68706/diff/4-5/


Testing
-------

Automation:
[ RUN      ] MasterTest.MetricsInMetricsEndpoint
[       OK ] MasterTest.MetricsInMetricsEndpoint (42 ms)

Real world cases:
While reregistrations is in progress: 3277 out of 3582 completed:
"master/slave_reregistrations": 3277.0,
"master/slaves_100_percent_reregistered_secs": 0.0,
"master/slaves_25_percent_reregistered_secs": 5.0,
"master/slaves_50_percent_reregistered_secs": 11.0,
"master/slaves_75_percent_reregistered_secs": 20.0,
"master/slaves_90_percent_reregistered_secs": 30.0,
"master/slaves_99_percent_reregistered_secs": 0.0,


While 3582 reregistrations were all completed:
"master/slave_reregistrations": 3582.0,
"master/slaves_100_percent_reregistered_secs": 54.0,
"master/slaves_25_percent_reregistered_secs": 5.0,
"master/slaves_50_percent_reregistered_secs": 11.0,
"master/slaves_75_percent_reregistered_secs": 20.0,
"master/slaves_90_percent_reregistered_secs": 30.0,
"master/slaves_99_percent_reregistered_secs": 39.0,


Thanks,

Xudong Ni


Re: Review Request 68706: Added master failover reregistration progress metrics.

Posted by Mesos Reviewbot Windows <re...@mesos.apache.org>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/68706/#review209595
-----------------------------------------------------------



FAIL: Some of the unit tests failed. Please check the relevant logs.

Reviews applied: `['68706']`

Failed command: `Start-MesosCITesting`

All the build artifacts available at: http://dcos-win.westus2.cloudapp.azure.com/artifacts/mesos-reviewbot-testing/2469/mesos-review-68706

Relevant logs:

- [mesos-tests.log](http://dcos-win.westus2.cloudapp.azure.com/artifacts/mesos-reviewbot-testing/2469/mesos-review-68706/logs/mesos-tests.log):

```
I1016 06:34:30.897042  8800 executor.cpp:805] Shutting down
I1016 06:34:30.897042  8800 executor.cpp:918] Sending SIGTERM to process tree at pid 400xecutor(1)@192.10.1.5:62750
I1016 06:34:30.896052  1308 slave.cpp:909] Agent terminating
I1016 06:34:30.896052  8440 master.cpp:11082] Removing task dd62b7a3-f017-44fb-a6fc-b973be640983 with resources cpus(allocated: *):4; mem(allocated: *):2048; disk(allocated: *):1024; ports(allocated: *):[31000-32000] of framework 9f48f4c4-e51d-4f12-b338-ca5873a9eed6-0000 on agent 9f48f4c4-e51d-4f12-b338-ca5873a9eed6-S0 at slave(461)@192.10.1.5:60955 (windows-02.aa0q4n2kgcyefckmv0xukjvy4f.xx.internal.cloudapp.net)
W1016 06:34:30.896052  1308 slave.cpp:3917] Ignoring shutdown framework 9f48f4c4-e51d-4f12-b338-ca5873a9eed6-0000 because it is terminating
I1016 06:34:30.899040  8440 master.cpp:1251] Agent 9f48f4c4-e51d-4f12-b338-ca5873a9eed6-S0 at slave(461)@192.10.1.5:60955 (windows-02.aa0q4n2kgcyefckmv0xukjvy4f.xx.internal.cloudapp.net) disconnected
I1016 06:34:30.899040  8440 master.cpp:3317] Disconnecting agent 9f48f4c4-e51d-4f12-b338-ca5873a9eed6-S0 at slave(461)@192.10.1.5:60955 (windows-02.aa0q4n2kgcyefckmv0xukjvy4f.xx.internal.cloudapp.net)
I1016 06:34:30.899040  8440 master.cpp:3336] Deactivating agent 9f48f4c4-e51d-4f12-b338-ca5873a9eed6-S0 at slave(461)@192.10.1.5:60955 (windows-02.aa0q4n2kgcyefckmv0xukjvy4f.xx.internal.cloudapp.net)
I1016 06:34:30.900039  5924 hierarchical.cpp:359] Removed framework 9f48f4c4-e51d-4f12-b338-ca5873a9eed6-0000
I1016 06:34:30.900039  5924 hierarchical.cpp:803] Agent 9f48f4c4-e51d-4f12-b338-ca5873a9eed6-S0 deactivated
I1016 06:34:30.900039  8604 containerizer.cpp:2455] Destroying container 1443f5fd-7561-4184-a35d-e5eb03ace98a in RUNNING state
I1016 06:34:30.901036  8604 containerizer.cpp:3122] Transitioning the state of container 1443f5fd-7561-4184-a35d-e5eb03ace98a from RUNNING to DESTROYING
I1016 06:34:30.901036  8604 launcher.cpp:166] Asked to destroy container 1443f5fd-7561-4184-a35d-e5eb03ace98a
I1016 06:34:30.935099  8440 containerizer.cpp:2961] Container 1443f5fd-7561-4184-a35d-e5eb03ace98a has exited
I1016 06:34:30.967039 11088 master.cpp:1093] Master terminating
I1016 06:34:30.9690[       OK ] IsolationFlag/MemoryIsolatorTest.ROOT_MemUsage/0 (584 ms)
[----------] 1 test from IsolationFlag/MemoryIsolatorTest (601 ms total)

[----------] Global test environment tear-down
[==========] 1051 tests from 103 test cases ran. (488984 ms total)
[  PASSED  ] 1050 tests.
[  FAILED  ] 1 test, listed below:
[  FAILED  ] DockerFetcherPluginTest.INTERNET_CURL_FetchBlob

 1 FAILED TEST
  YOU HAVE 232 DISABLED TESTS

55  7268 hierarchical.cpp:645] Removed agent 9f48f4c4-e51d-4f12-b338-ca5873a9eed6-S0
I1016 06:34:31.236047  6856 process.cpp:926] Stopped the socket accept loop
```

- Mesos Reviewbot Windows


On Oct. 16, 2018, 9:49 a.m., Xudong Ni wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/68706/
> -----------------------------------------------------------
> 
> (Updated Oct. 16, 2018, 9:49 a.m.)
> 
> 
> Review request for mesos, Benjamin Mahler, James Peach, and Jiang Yan Xu.
> 
> 
> Bugs: MESOS-9178
>     https://issues.apache.org/jira/browse/MESOS-9178
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> During the master failover, the time that the master elected is
> considered as the start of failover. In the progress of
> reregistration, the percentile represents the time when such
> percentile of agents finished registration again; The percentile of
> these data as in this metrics can represent overall reregistration
> progress; In case of degradation towards to the end of reregistration,
> the high percentile can reflect it; In the case there are unreachable
> agents in the failover, if certain percentile recovery couldn't be
> reached, the intiail value of that percentile will not be updated.
> 
> 
> Diffs
> -----
> 
>   docs/monitoring.md 00c6ea94bcb73746aef740236632ede123f5b534 
>   src/master/master.hpp ea7e9242b62fe6c2cc0e717f9a9f2f0c1cc0a390 
>   src/master/master.cpp 06d769aeba16586a020729d454f4d00688b78c78 
>   src/master/metrics.hpp e1da18e6ba2737f729e1e30653020538150ae898 
>   src/master/metrics.cpp 56a7eef2d279ad3248092d37d19013d3ac110757 
>   src/tests/master_tests.cpp 1db8ed7d81acbcd8bad4b7ca77c501d1d99cc135 
> 
> 
> Diff: https://reviews.apache.org/r/68706/diff/4/
> 
> 
> Testing
> -------
> 
> Automation:
> [ RUN      ] MasterTest.MetricsInMetricsEndpoint
> [       OK ] MasterTest.MetricsInMetricsEndpoint (42 ms)
> 
> Real world cases:
> While reregistrations is in progress: 3277 out of 3582 completed:
> "master/slave_reregistrations": 3277.0,
> "master/slaves_100_percent_reregistered_secs": 0.0,
> "master/slaves_25_percent_reregistered_secs": 5.0,
> "master/slaves_50_percent_reregistered_secs": 11.0,
> "master/slaves_75_percent_reregistered_secs": 20.0,
> "master/slaves_90_percent_reregistered_secs": 30.0,
> "master/slaves_99_percent_reregistered_secs": 0.0,
> 
> 
> While 3582 reregistrations were all completed:
> "master/slave_reregistrations": 3582.0,
> "master/slaves_100_percent_reregistered_secs": 54.0,
> "master/slaves_25_percent_reregistered_secs": 5.0,
> "master/slaves_50_percent_reregistered_secs": 11.0,
> "master/slaves_75_percent_reregistered_secs": 20.0,
> "master/slaves_90_percent_reregistered_secs": 30.0,
> "master/slaves_99_percent_reregistered_secs": 39.0,
> 
> 
> Thanks,
> 
> Xudong Ni
> 
>


Re: Review Request 68706: Added master failover reregistration progress metrics.

Posted by Xudong Ni via Review Board <no...@reviews.apache.org>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/68706/
-----------------------------------------------------------

(Updated Oct. 16, 2018, 4:19 a.m.)


Review request for mesos, Benjamin Mahler, James Peach, and Jiang Yan Xu.


Bugs: MESOS-9178
    https://issues.apache.org/jira/browse/MESOS-9178


Repository: mesos


Description
-------

During the master failover, the time that the master elected is
considered as the start of failover. In the progress of
reregistration, the percentile represents the time when such
percentile of agents finished registration again; The percentile of
these data as in this metrics can represent overall reregistration
progress; In case of degradation towards to the end of reregistration,
the high percentile can reflect it; In the case there are unreachable
agents in the failover, if certain percentile recovery couldn't be
reached, the intiail value of that percentile will not be updated.


Diffs (updated)
-----

  docs/monitoring.md 00c6ea94bcb73746aef740236632ede123f5b534 
  src/master/master.hpp ea7e9242b62fe6c2cc0e717f9a9f2f0c1cc0a390 
  src/master/master.cpp 06d769aeba16586a020729d454f4d00688b78c78 
  src/master/metrics.hpp e1da18e6ba2737f729e1e30653020538150ae898 
  src/master/metrics.cpp 56a7eef2d279ad3248092d37d19013d3ac110757 
  src/tests/master_tests.cpp 1db8ed7d81acbcd8bad4b7ca77c501d1d99cc135 


Diff: https://reviews.apache.org/r/68706/diff/4/

Changes: https://reviews.apache.org/r/68706/diff/3-4/


Testing (updated)
-------

Automation:
[ RUN      ] MasterTest.MetricsInMetricsEndpoint
[       OK ] MasterTest.MetricsInMetricsEndpoint (42 ms)

Real world cases:
While reregistrations is in progress: 3277 out of 3582 completed:
"master/slave_reregistrations": 3277.0,
"master/slaves_100_percent_reregistered_secs": 0.0,
"master/slaves_25_percent_reregistered_secs": 5.0,
"master/slaves_50_percent_reregistered_secs": 11.0,
"master/slaves_75_percent_reregistered_secs": 20.0,
"master/slaves_90_percent_reregistered_secs": 30.0,
"master/slaves_99_percent_reregistered_secs": 0.0,


While 3582 reregistrations were all completed:
"master/slave_reregistrations": 3582.0,
"master/slaves_100_percent_reregistered_secs": 54.0,
"master/slaves_25_percent_reregistered_secs": 5.0,
"master/slaves_50_percent_reregistered_secs": 11.0,
"master/slaves_75_percent_reregistered_secs": 20.0,
"master/slaves_90_percent_reregistered_secs": 30.0,
"master/slaves_99_percent_reregistered_secs": 39.0,


Thanks,

Xudong Ni


Re: Review Request 68706: Added master failover reregistration progress metrics.

Posted by Xudong Ni via Review Board <no...@reviews.apache.org>.

> On Oct. 15, 2018, 4:45 p.m., James Peach wrote:
> > I think that we need test for this as well. At minimum, we ought to update `MasterTest.MetricsInMetricsEndpoint`. Best would be a test that registers a number of agents, then restarts the master and validates the metrics.

Basic test was added; And we validated the actual metrics in the real environment


> On Oct. 15, 2018, 4:45 p.m., James Peach wrote:
> > src/master/master.cpp
> > Lines 1850 (patched)
> > <https://reviews.apache.org/r/68706/diff/3/?file=2095910#file2095910line1850>
> >
> >     I found the arithmetic here pretty confusing. How about simplifying this to:
> >     ```
> >     
> >     double percentRegistered = metrics->slave_reregistrations.value().get() / expectedAgentCount;
> >     
> >     if (slave25PercentageRegistered.value().get() == 0) {
> >       if (percentRegistered > 0.25) {
> >         slaves_25_percent_reregistered_secs = t;
> >       }
> >     }
> >     ```

This is actually fixed for another comments to use the ceil:

      if((recovered_agents_25_percent_reregistered_secs.value().get() == 0.0)
          && (reregisteredAgentCount == ceil(recoveredAgentCount.get() * 0.25)))
      {
        recovered_agents_25_percent_reregistered_secs = t;
      }


- Xudong


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/68706/#review209547
-----------------------------------------------------------


On Oct. 19, 2018, 11:56 p.m., Xudong Ni wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/68706/
> -----------------------------------------------------------
> 
> (Updated Oct. 19, 2018, 11:56 p.m.)
> 
> 
> Review request for mesos, Benjamin Mahler, James Peach, and Jiang Yan Xu.
> 
> 
> Bugs: MESOS-9178
>     https://issues.apache.org/jira/browse/MESOS-9178
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> During the master failover, the time that the master elected is
> considered as the start of failover. In the progress of
> reregistration, the percentile represents the time when such
> percentile of agents finished registration again; The percentile of
> these data as in this metrics can represent overall reregistration
> progress; In case of degradation towards to the end of reregistration,
> the high percentile can reflect it; In the case there are unreachable
> agents in the failover, if certain percentile recovery couldn't be
> reached, the intiail value of that percentile will not be updated.
> 
> 
> Diffs
> -----
> 
>   src/master/master.cpp 868787bb2f9d879531402f83507b322462322efc 
>   src/master/metrics.hpp e1da18e6ba2737f729e1e30653020538150ae898 
> 
> 
> Diff: https://reviews.apache.org/r/68706/diff/7/
> 
> 
> Testing
> -------
> 
> Automation:
> [ RUN      ] MasterTest.MetricsInMetricsEndpoint
> [       OK ] MasterTest.MetricsInMetricsEndpoint (42 ms)
> 
> Real world cases:
> 
> While the master is not elected or there is no agents recovered yet
> "master/recovered_agents_100_percent_reregistered_secs": 0.0,
> "master/recovered_agents_25_percent_reregistered_secs": 0.0,
> "master/recovered_agents_50_percent_reregistered_secs": 0.0,
> "master/recovered_agents_75_percent_reregistered_secs": 0.0,
> "master/recovered_agents_90_percent_reregistered_secs": 0.0,
> "master/recovered_agents_99_percent_reregistered_secs": 0.0,
> "master/slave_reregistrations": 0.0,
> 
> While reregistrations is in progress: 5 out of 6 completed:
> "master/recovered_agents_100_percent_reregistered_secs": 0.0,
> "master/recovered_agents_25_percent_reregistered_secs": 2.0,
> "master/recovered_agents_50_percent_reregistered_secs": 3.0,
> "master/recovered_agents_75_percent_reregistered_secs": 6.0,
> "master/recovered_agents_90_percent_reregistered_secs": 0.0,
> "master/recovered_agents_99_percent_reregistered_secs": 0.0,
> "master/slave_reregistrations": 5.0,
> 
> 
> While 6 reregistrations were all completed:
> "master/recovered_agents_100_percent_reregistered_secs": 22.0,
> "master/recovered_agents_25_percent_reregistered_secs": 2.0,
> "master/recovered_agents_50_percent_reregistered_secs": 3.0,
> "master/recovered_agents_75_percent_reregistered_secs": 6.0,
> "master/recovered_agents_90_percent_reregistered_secs": 22.0,
> "master/recovered_agents_99_percent_reregistered_secs": 22.0,
> "master/slave_reregistrations": 6.0,
> 
> 
> Thanks,
> 
> Xudong Ni
> 
>


Re: Review Request 68706: Added master failover reregistration progress metrics.

Posted by James Peach <jp...@apache.org>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/68706/#review209547
-----------------------------------------------------------



I think that we need test for this as well. At minimum, we ought to update `MasterTest.MetricsInMetricsEndpoint`. Best would be a test that registers a number of agents, then restarts the master and validates the metrics.


docs/monitoring.md
Lines 424 (patched)
<https://reviews.apache.org/r/68706/#comment294074>

    Although (for historical consistency) the metric name should use the `slave` terminolofy, everything else should use `agent`, including documentation and code comments.



docs/monitoring.md
Lines 431 (patched)
<https://reviews.apache.org/r/68706/#comment294075>

    I think that we can make this explanation clearer. This is the elapsed time for 50% of the previously registered agents to re-register with the newly epected master after a failover.



src/master/master.hpp
Lines 1176 (patched)
<https://reviews.apache.org/r/68706/#comment294073>

    Let's call this
    ```
    updateReregistrationMetrics();
    ```



src/master/master.hpp
Lines 2344 (patched)
<https://reviews.apache.org/r/68706/#comment294076>

    Since `expectedAgentCount` isn't set until `_recover()`, let's make it and `Option<int>`. Consider leaving a comment for future generations.



src/master/master.hpp
Lines 2422 (patched)
<https://reviews.apache.org/r/68706/#comment294072>

    All these helper functions can be removed once you switch to `PushGauge`.



src/master/master.cpp
Lines 1850 (patched)
<https://reviews.apache.org/r/68706/#comment294071>

    I found the arithmetic here pretty confusing. How about simplifying this to:
    ```
    
    double percentRegistered = metrics->slave_reregistrations.value().get() / expectedAgentCount;
    
    if (slave25PercentageRegistered.value().get() == 0) {
      if (percentRegistered > 0.25) {
        slaves_25_percent_reregistered_secs = t;
      }
    }
    ```



src/master/master.cpp
Lines 1891 (patched)
<https://reviews.apache.org/r/68706/#comment294069>

    2 newlines between file-scope functions.



src/master/metrics.hpp
Lines 51 (patched)
<https://reviews.apache.org/r/68706/#comment294066>

    Comments should be complete sentences:
    ```
    // Timer to track the progress of agent re-registration.
    ```
    
    Note that, all comments are standardized on "re-registration".



src/master/metrics.hpp
Lines 58 (patched)
<https://reviews.apache.org/r/68706/#comment294067>

    Let's make these all `PullGauge` objects. Initializing these to `0` (the default) would be indistinguishable from a metrics publishing perspective, but the performance would be better.



src/master/metrics.cpp
Lines 53 (patched)
<https://reviews.apache.org/r/68706/#comment294064>

    This metric isn't documented and I'm not clear on what its semantics are. Let's remove it.


- James Peach


On Oct. 10, 2018, 5:22 p.m., Xudong Ni wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/68706/
> -----------------------------------------------------------
> 
> (Updated Oct. 10, 2018, 5:22 p.m.)
> 
> 
> Review request for mesos, Benjamin Mahler, James Peach, and Jiang Yan Xu.
> 
> 
> Bugs: MESOS-9178
>     https://issues.apache.org/jira/browse/MESOS-9178
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> During the master failover, the time that the master elected is
> considered as the start of failover. In the progress of
> reregistration, the percentile represents the time when such
> percentile of agents finished registration again; The percentile of
> these data as in this metrics can represent overall reregistration
> progress; In case of degradation towards to the end of reregistration,
> the high percentile can reflect it; In the case there are unreachable
> agents in the failover, if certain percentile recovery couldn't be
> reached, the intiail value of that percentile will not be updated.
> 
> 
> Diffs
> -----
> 
>   docs/monitoring.md 00c6ea94bcb73746aef740236632ede123f5b534 
>   src/master/master.hpp ea7e9242b62fe6c2cc0e717f9a9f2f0c1cc0a390 
>   src/master/master.cpp 06d769aeba16586a020729d454f4d00688b78c78 
>   src/master/metrics.hpp e1da18e6ba2737f729e1e30653020538150ae898 
>   src/master/metrics.cpp 56a7eef2d279ad3248092d37d19013d3ac110757 
> 
> 
> Diff: https://reviews.apache.org/r/68706/diff/3/
> 
> 
> Testing
> -------
> 
> Tested in mmaster with 6 reregistration agents:
> "master/slave_reregistrations": 6,
> 
> In the middle of reregistration process:
> "master/slaves_100_percent_reregistered_secs": 0,
> "master/slaves_25_percent_reregistered_secs": 2.244662016,
> "master/slaves_50_percent_reregistered_secs": 3.599491072,
> "master/slaves_75_percent_reregistered_secs": 9.53919616,
> "master/slaves_90_percent_reregistered_secs": 0,
> "master/slaves_99_percent_reregistered_secs": 0,
> 
> When all registrations finished:
> "master/slaves_100_percent_reregistered_secs": 29.697210112,
> "master/slaves_25_percent_reregistered_secs": 2.244662016,
> "master/slaves_50_percent_reregistered_secs": 3.599491072,
> "master/slaves_75_percent_reregistered_secs": 9.53919616,
> "master/slaves_90_percent_reregistered_secs": 29.697210112,
> "master/slaves_99_percent_reregistered_secs": 29.697210112,
> 
> With 3606 agents, the last 1% take significant time
> "master/slave_reregistrations": 3606,
> "master/slave_shutdowns_canceled": 0,
> "master/slave_shutdowns_completed": 0,
> "master/slave_shutdowns_scheduled": 0,
> "master/slave_unreachable_canceled": 0,
> "master/slave_unreachable_completed": 0,
> "master/slave_unreachable_scheduled": 0,
> "master/slaves_100_percent_reregistered_secs": 58.585202944,
> "master/slaves_25_percent_reregistered_secs": 9.966434048,
> "master/slaves_50_percent_reregistered_secs": 20.259571968,
> "master/slaves_75_percent_reregistered_secs": 30.598885888,
> "master/slaves_90_percent_reregistered_secs": 36.396082944,
> "master/slaves_99_percent_reregistered_secs": 39.811022848,
> 
> 
> Thanks,
> 
> Xudong Ni
> 
>


Re: Review Request 68706: Added master failover reregistration progress metrics.

Posted by Mesos Reviewbot <re...@mesos.apache.org>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/68706/#review209425
-----------------------------------------------------------



Patch looks great!

Reviews applied: [68706]

Passed command: export OS='ubuntu:14.04' BUILDTOOL='autotools' COMPILER='gcc' CONFIGURATION='--verbose --disable-libtool-wrappers' ENVIRONMENT='GLOG_v=1 MESOS_VERBOSE=1'; ./support/docker-build.sh

- Mesos Reviewbot


On Oct. 10, 2018, 5:22 p.m., Xudong Ni wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/68706/
> -----------------------------------------------------------
> 
> (Updated Oct. 10, 2018, 5:22 p.m.)
> 
> 
> Review request for mesos, Benjamin Mahler, James Peach, and Jiang Yan Xu.
> 
> 
> Bugs: MESOS-9178
>     https://issues.apache.org/jira/browse/MESOS-9178
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> During the master failover, the time that the master elected is
> considered as the start of failover. In the progress of
> reregistration, the percentile represents the time when such
> percentile of agents finished registration again; The percentile of
> these data as in this metrics can represent overall reregistration
> progress; In case of degradation towards to the end of reregistration,
> the high percentile can reflect it; In the case there are unreachable
> agents in the failover, if certain percentile recovery couldn't be
> reached, the intiail value of that percentile will not be updated.
> 
> 
> Diffs
> -----
> 
>   docs/monitoring.md 00c6ea94bcb73746aef740236632ede123f5b534 
>   src/master/master.hpp ea7e9242b62fe6c2cc0e717f9a9f2f0c1cc0a390 
>   src/master/master.cpp 06d769aeba16586a020729d454f4d00688b78c78 
>   src/master/metrics.hpp e1da18e6ba2737f729e1e30653020538150ae898 
>   src/master/metrics.cpp 56a7eef2d279ad3248092d37d19013d3ac110757 
> 
> 
> Diff: https://reviews.apache.org/r/68706/diff/3/
> 
> 
> Testing
> -------
> 
> Tested in mmaster with 6 reregistration agents:
> "master/slave_reregistrations": 6,
> 
> In the middle of reregistration process:
> "master/slaves_100_percent_reregistered_secs": 0,
> "master/slaves_25_percent_reregistered_secs": 2.244662016,
> "master/slaves_50_percent_reregistered_secs": 3.599491072,
> "master/slaves_75_percent_reregistered_secs": 9.53919616,
> "master/slaves_90_percent_reregistered_secs": 0,
> "master/slaves_99_percent_reregistered_secs": 0,
> 
> When all registrations finished:
> "master/slaves_100_percent_reregistered_secs": 29.697210112,
> "master/slaves_25_percent_reregistered_secs": 2.244662016,
> "master/slaves_50_percent_reregistered_secs": 3.599491072,
> "master/slaves_75_percent_reregistered_secs": 9.53919616,
> "master/slaves_90_percent_reregistered_secs": 29.697210112,
> "master/slaves_99_percent_reregistered_secs": 29.697210112,
> 
> With 3606 agents, the last 1% take significant time
> "master/slave_reregistrations": 3606,
> "master/slave_shutdowns_canceled": 0,
> "master/slave_shutdowns_completed": 0,
> "master/slave_shutdowns_scheduled": 0,
> "master/slave_unreachable_canceled": 0,
> "master/slave_unreachable_completed": 0,
> "master/slave_unreachable_scheduled": 0,
> "master/slaves_100_percent_reregistered_secs": 58.585202944,
> "master/slaves_25_percent_reregistered_secs": 9.966434048,
> "master/slaves_50_percent_reregistered_secs": 20.259571968,
> "master/slaves_75_percent_reregistered_secs": 30.598885888,
> "master/slaves_90_percent_reregistered_secs": 36.396082944,
> "master/slaves_99_percent_reregistered_secs": 39.811022848,
> 
> 
> Thanks,
> 
> Xudong Ni
> 
>


Re: Review Request 68706: Added master failover reregistration progress metrics.

Posted by Xudong Ni via Review Board <no...@reviews.apache.org>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/68706/
-----------------------------------------------------------

(Updated Oct. 10, 2018, 5:22 p.m.)


Review request for mesos, Benjamin Mahler, James Peach, and Jiang Yan Xu.


Bugs: MESOS-9178
    https://issues.apache.org/jira/browse/MESOS-9178


Repository: mesos


Description
-------

During the master failover, the time that the master elected is
considered as the start of failover. In the progress of
reregistration, the percentile represents the time when such
percentile of agents finished registration again; The percentile of
these data as in this metrics can represent overall reregistration
progress; In case of degradation towards to the end of reregistration,
the high percentile can reflect it; In the case there are unreachable
agents in the failover, if certain percentile recovery couldn't be
reached, the intiail value of that percentile will not be updated.


Diffs
-----

  docs/monitoring.md 00c6ea94bcb73746aef740236632ede123f5b534 
  src/master/master.hpp ea7e9242b62fe6c2cc0e717f9a9f2f0c1cc0a390 
  src/master/master.cpp 06d769aeba16586a020729d454f4d00688b78c78 
  src/master/metrics.hpp e1da18e6ba2737f729e1e30653020538150ae898 
  src/master/metrics.cpp 56a7eef2d279ad3248092d37d19013d3ac110757 


Diff: https://reviews.apache.org/r/68706/diff/3/


Testing (updated)
-------

Tested in mmaster with 6 reregistration agents:
"master/slave_reregistrations": 6,

In the middle of reregistration process:
"master/slaves_100_percent_reregistered_secs": 0,
"master/slaves_25_percent_reregistered_secs": 2.244662016,
"master/slaves_50_percent_reregistered_secs": 3.599491072,
"master/slaves_75_percent_reregistered_secs": 9.53919616,
"master/slaves_90_percent_reregistered_secs": 0,
"master/slaves_99_percent_reregistered_secs": 0,

When all registrations finished:
"master/slaves_100_percent_reregistered_secs": 29.697210112,
"master/slaves_25_percent_reregistered_secs": 2.244662016,
"master/slaves_50_percent_reregistered_secs": 3.599491072,
"master/slaves_75_percent_reregistered_secs": 9.53919616,
"master/slaves_90_percent_reregistered_secs": 29.697210112,
"master/slaves_99_percent_reregistered_secs": 29.697210112,

With 3606 agents, the last 1% take significant time
"master/slave_reregistrations": 3606,
"master/slave_shutdowns_canceled": 0,
"master/slave_shutdowns_completed": 0,
"master/slave_shutdowns_scheduled": 0,
"master/slave_unreachable_canceled": 0,
"master/slave_unreachable_completed": 0,
"master/slave_unreachable_scheduled": 0,
"master/slaves_100_percent_reregistered_secs": 58.585202944,
"master/slaves_25_percent_reregistered_secs": 9.966434048,
"master/slaves_50_percent_reregistered_secs": 20.259571968,
"master/slaves_75_percent_reregistered_secs": 30.598885888,
"master/slaves_90_percent_reregistered_secs": 36.396082944,
"master/slaves_99_percent_reregistered_secs": 39.811022848,


Thanks,

Xudong Ni


Re: Review Request 68706: Added master failover reregistration progress metrics.

Posted by Xudong Ni via Review Board <no...@reviews.apache.org>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/68706/
-----------------------------------------------------------

(Updated Oct. 9, 2018, 11:24 p.m.)


Review request for mesos, Benjamin Mahler, James Peach, and Jiang Yan Xu.


Changes
-------

fixed the style


Bugs: MESOS-9178
    https://issues.apache.org/jira/browse/MESOS-9178


Repository: mesos


Description
-------

During the master failover, the time that the master elected is
considered as the start of failover. In the progress of
reregistration, the percentile represents the time when such
percentile of agents finished registration again; The percentile of
these data as in this metrics can represent overall reregistration
progress; In case of degradation towards to the end of reregistration,
the high percentile can reflect it; In the case there are unreachable
agents in the failover, if certain percentile recovery couldn't be
reached, the intiail value of that percentile will not be updated.


Diffs (updated)
-----

  docs/monitoring.md 00c6ea94bcb73746aef740236632ede123f5b534 
  src/master/master.hpp ea7e9242b62fe6c2cc0e717f9a9f2f0c1cc0a390 
  src/master/master.cpp 06d769aeba16586a020729d454f4d00688b78c78 
  src/master/metrics.hpp e1da18e6ba2737f729e1e30653020538150ae898 
  src/master/metrics.cpp 56a7eef2d279ad3248092d37d19013d3ac110757 


Diff: https://reviews.apache.org/r/68706/diff/3/

Changes: https://reviews.apache.org/r/68706/diff/2-3/


Testing
-------

Tested in mmaster with 6 reregistration agents:
"master/slave_reregistrations": 6,

In the middle of reregistration process:
"master/slaves_100_percent_reregistered_secs": 0,
"master/slaves_25_percent_reregistered_secs": 2.244662016,
"master/slaves_50_percent_reregistered_secs": 3.599491072,
"master/slaves_75_percent_reregistered_secs": 9.53919616,
"master/slaves_90_percent_reregistered_secs": 0,
"master/slaves_99_percent_reregistered_secs": 0,

When all registrations finished:
"master/slaves_100_percent_reregistered_secs": 29.697210112,
"master/slaves_25_percent_reregistered_secs": 2.244662016,
"master/slaves_50_percent_reregistered_secs": 3.599491072,
"master/slaves_75_percent_reregistered_secs": 9.53919616,
"master/slaves_90_percent_reregistered_secs": 29.697210112,
"master/slaves_99_percent_reregistered_secs": 29.697210112,


Thanks,

Xudong Ni