You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mesos.apache.org by Ben Mahler <be...@gmail.com> on 2013/01/24 10:22:42 UTC
Review Request: Resource Monitoring 7: Archive terminated executor
statistics.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/9095/
-----------------------------------------------------------
Review request for mesos, Benjamin Hindman and Vinod Kone.
Description
-------
This wires up the archival of terminated executor stats.
This addresses bug MESOS-324.
https://issues.apache.org/jira/browse/MESOS-324
Diffs
-----
src/slave/monitor.hpp PRE-CREATION
src/slave/monitor.cpp PRE-CREATION
src/slave/slave.cpp 9755b46f97173d6fcc9ab1fd63e0e4814b3bc018
Diff: https://reviews.apache.org/r/9095/diff/
Testing
-------
make check
Thanks,
Ben Mahler
Re: Review Request: Resource Monitoring 7: Archive terminated executor
statistics.
Posted by Charles Reiss <wo...@gmail.com>.
> On Jan. 24, 2013, 5:22 p.m., Charles Reiss wrote:
> > src/slave/slave.cpp, line 1084
> > <https://reviews.apache.org/r/9095/diff/1/?file=251559#file251559line1084>
> >
> > Aren't you going to be missing the last resource sample for the executor (perhaps the only one for an, e.g., crash-looping executor)?
>
> Charles Reiss wrote:
> Okay, sorry, I really should have looked more at what archive() did before writing that. However, I think you have a problem with the frameworkId/executorId pairs not being unique over a window when an executor gets restarted with the same ID (crash-loop scenario is the obvious case where this is likely again).
>
> Ben Mahler wrote:
> Good point, there's definitely a bug here:
>
> -Executor 1 terminates
> -Archive stats for Executor 1
> -Executor 1 runs again on the same slave
> -We collect and export resource usage to STATS for Executor 1.
>
> Now, Executor 1 incorrectly remains archived, and while it will show up in the usage.json endpoint, it will never show up again in the statistics snapshot.json.
> The fix here is when a new statistics comes in, to ensure it's not archived. I'll make that fix in https://reviews.apache.org/r/9093/
>
> Were there any other issues here?
I didn't see anything else broken, though I didn't look very hard.
I would have preferred/expected if statistics would be separate for separate executor attempts (e.g. keyed by the slave's UUID, which likely requires an IsolationModule API change to support), but it's not a big deal.
- Charles
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/9095/#review15645
-----------------------------------------------------------
On Jan. 24, 2013, 9:22 a.m., Ben Mahler wrote:
>
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/9095/
> -----------------------------------------------------------
>
> (Updated Jan. 24, 2013, 9:22 a.m.)
>
>
> Review request for mesos, Benjamin Hindman and Vinod Kone.
>
>
> Description
> -------
>
> This wires up the archival of terminated executor stats.
>
>
> This addresses bug MESOS-324.
> https://issues.apache.org/jira/browse/MESOS-324
>
>
> Diffs
> -----
>
> src/slave/monitor.hpp PRE-CREATION
> src/slave/monitor.cpp PRE-CREATION
> src/slave/slave.cpp 9755b46f97173d6fcc9ab1fd63e0e4814b3bc018
>
> Diff: https://reviews.apache.org/r/9095/diff/
>
>
> Testing
> -------
>
> make check
>
>
> Thanks,
>
> Ben Mahler
>
>
Re: Review Request: Resource Monitoring 7: Archive terminated executor
statistics.
Posted by Ben Mahler <be...@gmail.com>.
> On Jan. 24, 2013, 5:22 p.m., Charles Reiss wrote:
> > src/slave/slave.cpp, line 1084
> > <https://reviews.apache.org/r/9095/diff/1/?file=251559#file251559line1084>
> >
> > Aren't you going to be missing the last resource sample for the executor (perhaps the only one for an, e.g., crash-looping executor)?
>
> Charles Reiss wrote:
> Okay, sorry, I really should have looked more at what archive() did before writing that. However, I think you have a problem with the frameworkId/executorId pairs not being unique over a window when an executor gets restarted with the same ID (crash-loop scenario is the obvious case where this is likely again).
Good point, there's definitely a bug here:
-Executor 1 terminates
-Archive stats for Executor 1
-Executor 1 runs again on the same slave
-We collect and export resource usage to STATS for Executor 1.
Now, Executor 1 incorrectly remains archived, and while it will show up in the usage.json endpoint, it will never show up again in the statistics snapshot.json.
The fix here is when a new statistics comes in, to ensure it's not archived. I'll make that fix in https://reviews.apache.org/r/9093/
Were there any other issues here?
- Ben
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/9095/#review15645
-----------------------------------------------------------
On Jan. 24, 2013, 9:22 a.m., Ben Mahler wrote:
>
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/9095/
> -----------------------------------------------------------
>
> (Updated Jan. 24, 2013, 9:22 a.m.)
>
>
> Review request for mesos, Benjamin Hindman and Vinod Kone.
>
>
> Description
> -------
>
> This wires up the archival of terminated executor stats.
>
>
> This addresses bug MESOS-324.
> https://issues.apache.org/jira/browse/MESOS-324
>
>
> Diffs
> -----
>
> src/slave/monitor.hpp PRE-CREATION
> src/slave/monitor.cpp PRE-CREATION
> src/slave/slave.cpp 9755b46f97173d6fcc9ab1fd63e0e4814b3bc018
>
> Diff: https://reviews.apache.org/r/9095/diff/
>
>
> Testing
> -------
>
> make check
>
>
> Thanks,
>
> Ben Mahler
>
>
Re: Review Request: Resource Monitoring 7: Archive terminated executor
statistics.
Posted by Ben Mahler <be...@gmail.com>.
> On Jan. 24, 2013, 5:22 p.m., Charles Reiss wrote:
> > src/slave/slave.cpp, line 1084
> > <https://reviews.apache.org/r/9095/diff/1/?file=251559#file251559line1084>
> >
> > Aren't you going to be missing the last resource sample for the executor (perhaps the only one for an, e.g., crash-looping executor)?
>
> Charles Reiss wrote:
> Okay, sorry, I really should have looked more at what archive() did before writing that. However, I think you have a problem with the frameworkId/executorId pairs not being unique over a window when an executor gets restarted with the same ID (crash-loop scenario is the obvious case where this is likely again).
>
> Ben Mahler wrote:
> Good point, there's definitely a bug here:
>
> -Executor 1 terminates
> -Archive stats for Executor 1
> -Executor 1 runs again on the same slave
> -We collect and export resource usage to STATS for Executor 1.
>
> Now, Executor 1 incorrectly remains archived, and while it will show up in the usage.json endpoint, it will never show up again in the statistics snapshot.json.
> The fix here is when a new statistics comes in, to ensure it's not archived. I'll make that fix in https://reviews.apache.org/r/9093/
>
> Were there any other issues here?
>
> Charles Reiss wrote:
> I didn't see anything else broken, though I didn't look very hard.
>
> I would have preferred/expected if statistics would be separate for separate executor attempts (e.g. keyed by the slave's UUID, which likely requires an IsolationModule API change to support), but it's not a big deal.
Right, it would require an isolation module API change, at least with the way I've designed it.
I think two things are useful here:
(1) Statistics per executor run
(2) Statistics across executor runs
I've designed for (2) simply because it was easier given the current API, but I think for the webui (1) is indeed more useful. I'll think about this.
- Ben
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/9095/#review15645
-----------------------------------------------------------
On Jan. 24, 2013, 9:22 a.m., Ben Mahler wrote:
>
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/9095/
> -----------------------------------------------------------
>
> (Updated Jan. 24, 2013, 9:22 a.m.)
>
>
> Review request for mesos, Benjamin Hindman and Vinod Kone.
>
>
> Description
> -------
>
> This wires up the archival of terminated executor stats.
>
>
> This addresses bug MESOS-324.
> https://issues.apache.org/jira/browse/MESOS-324
>
>
> Diffs
> -----
>
> src/slave/monitor.hpp PRE-CREATION
> src/slave/monitor.cpp PRE-CREATION
> src/slave/slave.cpp 9755b46f97173d6fcc9ab1fd63e0e4814b3bc018
>
> Diff: https://reviews.apache.org/r/9095/diff/
>
>
> Testing
> -------
>
> make check
>
>
> Thanks,
>
> Ben Mahler
>
>
Re: Review Request: Resource Monitoring 7: Archive terminated executor
statistics.
Posted by Charles Reiss <wo...@gmail.com>.
> On Jan. 24, 2013, 5:22 p.m., Charles Reiss wrote:
> > src/slave/slave.cpp, line 1084
> > <https://reviews.apache.org/r/9095/diff/1/?file=251559#file251559line1084>
> >
> > Aren't you going to be missing the last resource sample for the executor (perhaps the only one for an, e.g., crash-looping executor)?
Okay, sorry, I really should have looked more at what archive() did before writing that. However, I think you have a problem with the frameworkId/executorId pairs not being unique over a window when an executor gets restarted with the same ID (crash-loop scenario is the obvious case where this is likely again).
- Charles
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/9095/#review15645
-----------------------------------------------------------
On Jan. 24, 2013, 9:22 a.m., Ben Mahler wrote:
>
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/9095/
> -----------------------------------------------------------
>
> (Updated Jan. 24, 2013, 9:22 a.m.)
>
>
> Review request for mesos, Benjamin Hindman and Vinod Kone.
>
>
> Description
> -------
>
> This wires up the archival of terminated executor stats.
>
>
> This addresses bug MESOS-324.
> https://issues.apache.org/jira/browse/MESOS-324
>
>
> Diffs
> -----
>
> src/slave/monitor.hpp PRE-CREATION
> src/slave/monitor.cpp PRE-CREATION
> src/slave/slave.cpp 9755b46f97173d6fcc9ab1fd63e0e4814b3bc018
>
> Diff: https://reviews.apache.org/r/9095/diff/
>
>
> Testing
> -------
>
> make check
>
>
> Thanks,
>
> Ben Mahler
>
>
Re: Review Request: Resource Monitoring 7: Archive terminated executor
statistics.
Posted by Charles Reiss <wo...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/9095/#review15645
-----------------------------------------------------------
src/slave/slave.cpp
<https://reviews.apache.org/r/9095/#comment33745>
Aren't you going to be missing the last resource sample for the executor (perhaps the only one for an, e.g., crash-looping executor)?
- Charles Reiss
On Jan. 24, 2013, 9:22 a.m., Ben Mahler wrote:
>
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/9095/
> -----------------------------------------------------------
>
> (Updated Jan. 24, 2013, 9:22 a.m.)
>
>
> Review request for mesos, Benjamin Hindman and Vinod Kone.
>
>
> Description
> -------
>
> This wires up the archival of terminated executor stats.
>
>
> This addresses bug MESOS-324.
> https://issues.apache.org/jira/browse/MESOS-324
>
>
> Diffs
> -----
>
> src/slave/monitor.hpp PRE-CREATION
> src/slave/monitor.cpp PRE-CREATION
> src/slave/slave.cpp 9755b46f97173d6fcc9ab1fd63e0e4814b3bc018
>
> Diff: https://reviews.apache.org/r/9095/diff/
>
>
> Testing
> -------
>
> make check
>
>
> Thanks,
>
> Ben Mahler
>
>
Re: Review Request: Resource Monitoring 7: Archive terminated executor
statistics.
Posted by Ben Mahler <be...@gmail.com>.
> On Jan. 28, 2013, 10:23 p.m., Vinod Kone wrote:
> > src/slave/slave.cpp, line 1084
> > <https://reviews.apache.org/r/9095/diff/1/?file=251559#file251559line1084>
> >
> > fwiw, with slave restart, the executor's uuid is going to be exposed to the isolation module. so probably a TODO here would be great.
Added a TODO inside monitor.hpp.
- Ben
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/9095/#review15778
-----------------------------------------------------------
On Jan. 24, 2013, 9:22 a.m., Ben Mahler wrote:
>
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/9095/
> -----------------------------------------------------------
>
> (Updated Jan. 24, 2013, 9:22 a.m.)
>
>
> Review request for mesos, Benjamin Hindman and Vinod Kone.
>
>
> Description
> -------
>
> This wires up the archival of terminated executor stats.
>
>
> This addresses bug MESOS-324.
> https://issues.apache.org/jira/browse/MESOS-324
>
>
> Diffs
> -----
>
> src/slave/monitor.hpp PRE-CREATION
> src/slave/monitor.cpp PRE-CREATION
> src/slave/slave.cpp 9755b46f97173d6fcc9ab1fd63e0e4814b3bc018
>
> Diff: https://reviews.apache.org/r/9095/diff/
>
>
> Testing
> -------
>
> make check
>
>
> Thanks,
>
> Ben Mahler
>
>
Re: Review Request: Resource Monitoring 7: Archive terminated executor
statistics.
Posted by Vinod Kone <vi...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/9095/#review15778
-----------------------------------------------------------
Ship it!
src/slave/slave.cpp
<https://reviews.apache.org/r/9095/#comment33965>
fwiw, with slave restart, the executor's uuid is going to be exposed to the isolation module. so probably a TODO here would be great.
- Vinod Kone
On Jan. 24, 2013, 9:22 a.m., Ben Mahler wrote:
>
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/9095/
> -----------------------------------------------------------
>
> (Updated Jan. 24, 2013, 9:22 a.m.)
>
>
> Review request for mesos, Benjamin Hindman and Vinod Kone.
>
>
> Description
> -------
>
> This wires up the archival of terminated executor stats.
>
>
> This addresses bug MESOS-324.
> https://issues.apache.org/jira/browse/MESOS-324
>
>
> Diffs
> -----
>
> src/slave/monitor.hpp PRE-CREATION
> src/slave/monitor.cpp PRE-CREATION
> src/slave/slave.cpp 9755b46f97173d6fcc9ab1fd63e0e4814b3bc018
>
> Diff: https://reviews.apache.org/r/9095/diff/
>
>
> Testing
> -------
>
> make check
>
>
> Thanks,
>
> Ben Mahler
>
>
Re: Review Request: Resource Monitoring 7: Archive terminated executor
statistics.
Posted by Benjamin Hindman <be...@berkeley.edu>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/9095/#review16818
-----------------------------------------------------------
Ship it!
Ship It!
- Benjamin Hindman
On Feb. 13, 2013, 2:46 a.m., Ben Mahler wrote:
>
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/9095/
> -----------------------------------------------------------
>
> (Updated Feb. 13, 2013, 2:46 a.m.)
>
>
> Review request for mesos, Benjamin Hindman and Vinod Kone.
>
>
> Description
> -------
>
> This wires up the archival of terminated executor stats.
>
>
> This addresses bug MESOS-324.
> https://issues.apache.org/jira/browse/MESOS-324
>
>
> Diffs
> -----
>
> src/slave/monitor.cpp PRE-CREATION
>
> Diff: https://reviews.apache.org/r/9095/diff/
>
>
> Testing
> -------
>
> make check
>
>
> Thanks,
>
> Ben Mahler
>
>
Re: Review Request: Resource Monitoring 7: Archive terminated executor
statistics.
Posted by Ben Mahler <be...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/9095/
-----------------------------------------------------------
(Updated Feb. 25, 2013, 7:17 p.m.)
Review request for mesos, Benjamin Hindman and Vinod Kone.
Changes
-------
Rebased off trunk.
Description
-------
This wires up the archival of terminated executor stats.
This addresses bug MESOS-324.
https://issues.apache.org/jira/browse/MESOS-324
Diffs (updated)
-----
src/slave/monitor.cpp PRE-CREATION
Diff: https://reviews.apache.org/r/9095/diff/
Testing
-------
make check
Thanks,
Ben Mahler
Re: Review Request: Resource Monitoring 7: Archive terminated executor
statistics.
Posted by Ben Mahler <be...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/9095/
-----------------------------------------------------------
(Updated Feb. 22, 2013, 12:32 a.m.)
Review request for mesos, Benjamin Hindman and Vinod Kone.
Changes
-------
Rebased off trunk.
Description
-------
This wires up the archival of terminated executor stats.
This addresses bug MESOS-324.
https://issues.apache.org/jira/browse/MESOS-324
Diffs (updated)
-----
src/slave/monitor.cpp PRE-CREATION
Diff: https://reviews.apache.org/r/9095/diff/
Testing
-------
make check
Thanks,
Ben Mahler
Re: Review Request: Resource Monitoring 7: Archive terminated executor
statistics.
Posted by Ben Mahler <be...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/9095/
-----------------------------------------------------------
(Updated Feb. 13, 2013, 2:46 a.m.)
Review request for mesos, Benjamin Hindman and Vinod Kone.
Changes
-------
Highly simplified given the addition of unwatch() in the monitor.
Description
-------
This wires up the archival of terminated executor stats.
This addresses bug MESOS-324.
https://issues.apache.org/jira/browse/MESOS-324
Diffs (updated)
-----
src/slave/monitor.cpp PRE-CREATION
Diff: https://reviews.apache.org/r/9095/diff/
Testing
-------
make check
Thanks,
Ben Mahler
Re: Review Request: Resource Monitoring 7: Archive terminated executor
statistics.
Posted by Ben Mahler <be...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/9095/
-----------------------------------------------------------
(Updated Jan. 30, 2013, 4:02 a.m.)
Review request for mesos, Benjamin Hindman and Vinod Kone.
Changes
-------
Updated with upstream changes.
Description
-------
This wires up the archival of terminated executor stats.
This addresses bug MESOS-324.
https://issues.apache.org/jira/browse/MESOS-324
Diffs (updated)
-----
src/slave/monitor.hpp PRE-CREATION
src/slave/monitor.cpp PRE-CREATION
src/slave/slave.cpp 9755b46f97173d6fcc9ab1fd63e0e4814b3bc018
Diff: https://reviews.apache.org/r/9095/diff/
Testing
-------
make check
Thanks,
Ben Mahler
Re: Review Request: Resource Monitoring 7: Archive terminated executor
statistics.
Posted by Ben Mahler <be...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/9095/
-----------------------------------------------------------
(Updated Jan. 29, 2013, 1:42 a.m.)
Review request for mesos, Benjamin Hindman and Vinod Kone.
Changes
-------
Vinod's review.
Description
-------
This wires up the archival of terminated executor stats.
This addresses bug MESOS-324.
https://issues.apache.org/jira/browse/MESOS-324
Diffs (updated)
-----
src/slave/monitor.hpp PRE-CREATION
src/slave/monitor.cpp PRE-CREATION
src/slave/slave.cpp 9755b46f97173d6fcc9ab1fd63e0e4814b3bc018
Diff: https://reviews.apache.org/r/9095/diff/
Testing
-------
make check
Thanks,
Ben Mahler