You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mesos.apache.org by Ben Mahler <be...@gmail.com> on 2013/01/24 10:22:42 UTC

Review Request: Resource Monitoring 7: Archive terminated executor statistics.

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/9095/
-----------------------------------------------------------

Review request for mesos, Benjamin Hindman and Vinod Kone.


Description
-------

This wires up the archival of terminated executor stats.


This addresses bug MESOS-324.
    https://issues.apache.org/jira/browse/MESOS-324


Diffs
-----

  src/slave/monitor.hpp PRE-CREATION 
  src/slave/monitor.cpp PRE-CREATION 
  src/slave/slave.cpp 9755b46f97173d6fcc9ab1fd63e0e4814b3bc018 

Diff: https://reviews.apache.org/r/9095/diff/


Testing
-------

make check


Thanks,

Ben Mahler


Re: Review Request: Resource Monitoring 7: Archive terminated executor statistics.

Posted by Charles Reiss <wo...@gmail.com>.

> On Jan. 24, 2013, 5:22 p.m., Charles Reiss wrote:
> > src/slave/slave.cpp, line 1084
> > <https://reviews.apache.org/r/9095/diff/1/?file=251559#file251559line1084>
> >
> >     Aren't you going to be missing the last resource sample for the executor (perhaps the only one for an, e.g., crash-looping executor)?
> 
> Charles Reiss wrote:
>     Okay, sorry, I really should have looked more at what archive() did before writing that. However, I think you have a problem with the frameworkId/executorId pairs not being unique over a window when an executor gets restarted with the same ID (crash-loop scenario is the obvious case where this is likely again).
> 
> Ben Mahler wrote:
>     Good point, there's definitely a bug here:
>     
>     -Executor 1 terminates
>     -Archive stats for Executor 1
>     -Executor 1 runs again on the same slave
>     -We collect and export resource usage to STATS for Executor 1.
>     
>     Now, Executor 1 incorrectly remains archived, and while it will show up in the usage.json endpoint, it will never show up again in the statistics snapshot.json.
>     The fix here is when a new statistics comes in, to ensure it's not archived. I'll make that fix in https://reviews.apache.org/r/9093/
>     
>     Were there any other issues here?

I didn't see anything else broken, though I didn't look very hard.

I would have preferred/expected if statistics would be separate for separate executor attempts (e.g. keyed by the slave's UUID, which likely requires an IsolationModule API change to support), but it's not a big deal.


- Charles


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/9095/#review15645
-----------------------------------------------------------


On Jan. 24, 2013, 9:22 a.m., Ben Mahler wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/9095/
> -----------------------------------------------------------
> 
> (Updated Jan. 24, 2013, 9:22 a.m.)
> 
> 
> Review request for mesos, Benjamin Hindman and Vinod Kone.
> 
> 
> Description
> -------
> 
> This wires up the archival of terminated executor stats.
> 
> 
> This addresses bug MESOS-324.
>     https://issues.apache.org/jira/browse/MESOS-324
> 
> 
> Diffs
> -----
> 
>   src/slave/monitor.hpp PRE-CREATION 
>   src/slave/monitor.cpp PRE-CREATION 
>   src/slave/slave.cpp 9755b46f97173d6fcc9ab1fd63e0e4814b3bc018 
> 
> Diff: https://reviews.apache.org/r/9095/diff/
> 
> 
> Testing
> -------
> 
> make check
> 
> 
> Thanks,
> 
> Ben Mahler
> 
>


Re: Review Request: Resource Monitoring 7: Archive terminated executor statistics.

Posted by Ben Mahler <be...@gmail.com>.

> On Jan. 24, 2013, 5:22 p.m., Charles Reiss wrote:
> > src/slave/slave.cpp, line 1084
> > <https://reviews.apache.org/r/9095/diff/1/?file=251559#file251559line1084>
> >
> >     Aren't you going to be missing the last resource sample for the executor (perhaps the only one for an, e.g., crash-looping executor)?
> 
> Charles Reiss wrote:
>     Okay, sorry, I really should have looked more at what archive() did before writing that. However, I think you have a problem with the frameworkId/executorId pairs not being unique over a window when an executor gets restarted with the same ID (crash-loop scenario is the obvious case where this is likely again).

Good point, there's definitely a bug here:

-Executor 1 terminates
-Archive stats for Executor 1
-Executor 1 runs again on the same slave
-We collect and export resource usage to STATS for Executor 1.

Now, Executor 1 incorrectly remains archived, and while it will show up in the usage.json endpoint, it will never show up again in the statistics snapshot.json.
The fix here is when a new statistics comes in, to ensure it's not archived. I'll make that fix in https://reviews.apache.org/r/9093/

Were there any other issues here?


- Ben


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/9095/#review15645
-----------------------------------------------------------


On Jan. 24, 2013, 9:22 a.m., Ben Mahler wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/9095/
> -----------------------------------------------------------
> 
> (Updated Jan. 24, 2013, 9:22 a.m.)
> 
> 
> Review request for mesos, Benjamin Hindman and Vinod Kone.
> 
> 
> Description
> -------
> 
> This wires up the archival of terminated executor stats.
> 
> 
> This addresses bug MESOS-324.
>     https://issues.apache.org/jira/browse/MESOS-324
> 
> 
> Diffs
> -----
> 
>   src/slave/monitor.hpp PRE-CREATION 
>   src/slave/monitor.cpp PRE-CREATION 
>   src/slave/slave.cpp 9755b46f97173d6fcc9ab1fd63e0e4814b3bc018 
> 
> Diff: https://reviews.apache.org/r/9095/diff/
> 
> 
> Testing
> -------
> 
> make check
> 
> 
> Thanks,
> 
> Ben Mahler
> 
>


Re: Review Request: Resource Monitoring 7: Archive terminated executor statistics.

Posted by Ben Mahler <be...@gmail.com>.

> On Jan. 24, 2013, 5:22 p.m., Charles Reiss wrote:
> > src/slave/slave.cpp, line 1084
> > <https://reviews.apache.org/r/9095/diff/1/?file=251559#file251559line1084>
> >
> >     Aren't you going to be missing the last resource sample for the executor (perhaps the only one for an, e.g., crash-looping executor)?
> 
> Charles Reiss wrote:
>     Okay, sorry, I really should have looked more at what archive() did before writing that. However, I think you have a problem with the frameworkId/executorId pairs not being unique over a window when an executor gets restarted with the same ID (crash-loop scenario is the obvious case where this is likely again).
> 
> Ben Mahler wrote:
>     Good point, there's definitely a bug here:
>     
>     -Executor 1 terminates
>     -Archive stats for Executor 1
>     -Executor 1 runs again on the same slave
>     -We collect and export resource usage to STATS for Executor 1.
>     
>     Now, Executor 1 incorrectly remains archived, and while it will show up in the usage.json endpoint, it will never show up again in the statistics snapshot.json.
>     The fix here is when a new statistics comes in, to ensure it's not archived. I'll make that fix in https://reviews.apache.org/r/9093/
>     
>     Were there any other issues here?
> 
> Charles Reiss wrote:
>     I didn't see anything else broken, though I didn't look very hard.
>     
>     I would have preferred/expected if statistics would be separate for separate executor attempts (e.g. keyed by the slave's UUID, which likely requires an IsolationModule API change to support), but it's not a big deal.

Right, it would require an isolation module API change, at least with the way I've designed it.

I think two things are useful here:
  (1) Statistics per executor run
  (2) Statistics across executor runs

I've designed for (2) simply because it was easier given the current API, but I think for the webui (1) is indeed more useful. I'll think about this.


- Ben


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/9095/#review15645
-----------------------------------------------------------


On Jan. 24, 2013, 9:22 a.m., Ben Mahler wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/9095/
> -----------------------------------------------------------
> 
> (Updated Jan. 24, 2013, 9:22 a.m.)
> 
> 
> Review request for mesos, Benjamin Hindman and Vinod Kone.
> 
> 
> Description
> -------
> 
> This wires up the archival of terminated executor stats.
> 
> 
> This addresses bug MESOS-324.
>     https://issues.apache.org/jira/browse/MESOS-324
> 
> 
> Diffs
> -----
> 
>   src/slave/monitor.hpp PRE-CREATION 
>   src/slave/monitor.cpp PRE-CREATION 
>   src/slave/slave.cpp 9755b46f97173d6fcc9ab1fd63e0e4814b3bc018 
> 
> Diff: https://reviews.apache.org/r/9095/diff/
> 
> 
> Testing
> -------
> 
> make check
> 
> 
> Thanks,
> 
> Ben Mahler
> 
>


Re: Review Request: Resource Monitoring 7: Archive terminated executor statistics.

Posted by Charles Reiss <wo...@gmail.com>.

> On Jan. 24, 2013, 5:22 p.m., Charles Reiss wrote:
> > src/slave/slave.cpp, line 1084
> > <https://reviews.apache.org/r/9095/diff/1/?file=251559#file251559line1084>
> >
> >     Aren't you going to be missing the last resource sample for the executor (perhaps the only one for an, e.g., crash-looping executor)?

Okay, sorry, I really should have looked more at what archive() did before writing that. However, I think you have a problem with the frameworkId/executorId pairs not being unique over a window when an executor gets restarted with the same ID (crash-loop scenario is the obvious case where this is likely again).


- Charles


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/9095/#review15645
-----------------------------------------------------------


On Jan. 24, 2013, 9:22 a.m., Ben Mahler wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/9095/
> -----------------------------------------------------------
> 
> (Updated Jan. 24, 2013, 9:22 a.m.)
> 
> 
> Review request for mesos, Benjamin Hindman and Vinod Kone.
> 
> 
> Description
> -------
> 
> This wires up the archival of terminated executor stats.
> 
> 
> This addresses bug MESOS-324.
>     https://issues.apache.org/jira/browse/MESOS-324
> 
> 
> Diffs
> -----
> 
>   src/slave/monitor.hpp PRE-CREATION 
>   src/slave/monitor.cpp PRE-CREATION 
>   src/slave/slave.cpp 9755b46f97173d6fcc9ab1fd63e0e4814b3bc018 
> 
> Diff: https://reviews.apache.org/r/9095/diff/
> 
> 
> Testing
> -------
> 
> make check
> 
> 
> Thanks,
> 
> Ben Mahler
> 
>


Re: Review Request: Resource Monitoring 7: Archive terminated executor statistics.

Posted by Charles Reiss <wo...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/9095/#review15645
-----------------------------------------------------------



src/slave/slave.cpp
<https://reviews.apache.org/r/9095/#comment33745>

    Aren't you going to be missing the last resource sample for the executor (perhaps the only one for an, e.g., crash-looping executor)?


- Charles Reiss


On Jan. 24, 2013, 9:22 a.m., Ben Mahler wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/9095/
> -----------------------------------------------------------
> 
> (Updated Jan. 24, 2013, 9:22 a.m.)
> 
> 
> Review request for mesos, Benjamin Hindman and Vinod Kone.
> 
> 
> Description
> -------
> 
> This wires up the archival of terminated executor stats.
> 
> 
> This addresses bug MESOS-324.
>     https://issues.apache.org/jira/browse/MESOS-324
> 
> 
> Diffs
> -----
> 
>   src/slave/monitor.hpp PRE-CREATION 
>   src/slave/monitor.cpp PRE-CREATION 
>   src/slave/slave.cpp 9755b46f97173d6fcc9ab1fd63e0e4814b3bc018 
> 
> Diff: https://reviews.apache.org/r/9095/diff/
> 
> 
> Testing
> -------
> 
> make check
> 
> 
> Thanks,
> 
> Ben Mahler
> 
>


Re: Review Request: Resource Monitoring 7: Archive terminated executor statistics.

Posted by Ben Mahler <be...@gmail.com>.

> On Jan. 28, 2013, 10:23 p.m., Vinod Kone wrote:
> > src/slave/slave.cpp, line 1084
> > <https://reviews.apache.org/r/9095/diff/1/?file=251559#file251559line1084>
> >
> >     fwiw, with slave restart, the executor's uuid is going to be exposed to the isolation module. so probably a TODO here would be great.

Added a TODO inside monitor.hpp.


- Ben


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/9095/#review15778
-----------------------------------------------------------


On Jan. 24, 2013, 9:22 a.m., Ben Mahler wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/9095/
> -----------------------------------------------------------
> 
> (Updated Jan. 24, 2013, 9:22 a.m.)
> 
> 
> Review request for mesos, Benjamin Hindman and Vinod Kone.
> 
> 
> Description
> -------
> 
> This wires up the archival of terminated executor stats.
> 
> 
> This addresses bug MESOS-324.
>     https://issues.apache.org/jira/browse/MESOS-324
> 
> 
> Diffs
> -----
> 
>   src/slave/monitor.hpp PRE-CREATION 
>   src/slave/monitor.cpp PRE-CREATION 
>   src/slave/slave.cpp 9755b46f97173d6fcc9ab1fd63e0e4814b3bc018 
> 
> Diff: https://reviews.apache.org/r/9095/diff/
> 
> 
> Testing
> -------
> 
> make check
> 
> 
> Thanks,
> 
> Ben Mahler
> 
>


Re: Review Request: Resource Monitoring 7: Archive terminated executor statistics.

Posted by Vinod Kone <vi...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/9095/#review15778
-----------------------------------------------------------

Ship it!



src/slave/slave.cpp
<https://reviews.apache.org/r/9095/#comment33965>

    fwiw, with slave restart, the executor's uuid is going to be exposed to the isolation module. so probably a TODO here would be great.


- Vinod Kone


On Jan. 24, 2013, 9:22 a.m., Ben Mahler wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/9095/
> -----------------------------------------------------------
> 
> (Updated Jan. 24, 2013, 9:22 a.m.)
> 
> 
> Review request for mesos, Benjamin Hindman and Vinod Kone.
> 
> 
> Description
> -------
> 
> This wires up the archival of terminated executor stats.
> 
> 
> This addresses bug MESOS-324.
>     https://issues.apache.org/jira/browse/MESOS-324
> 
> 
> Diffs
> -----
> 
>   src/slave/monitor.hpp PRE-CREATION 
>   src/slave/monitor.cpp PRE-CREATION 
>   src/slave/slave.cpp 9755b46f97173d6fcc9ab1fd63e0e4814b3bc018 
> 
> Diff: https://reviews.apache.org/r/9095/diff/
> 
> 
> Testing
> -------
> 
> make check
> 
> 
> Thanks,
> 
> Ben Mahler
> 
>


Re: Review Request: Resource Monitoring 7: Archive terminated executor statistics.

Posted by Benjamin Hindman <be...@berkeley.edu>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/9095/#review16818
-----------------------------------------------------------

Ship it!


Ship It!

- Benjamin Hindman


On Feb. 13, 2013, 2:46 a.m., Ben Mahler wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/9095/
> -----------------------------------------------------------
> 
> (Updated Feb. 13, 2013, 2:46 a.m.)
> 
> 
> Review request for mesos, Benjamin Hindman and Vinod Kone.
> 
> 
> Description
> -------
> 
> This wires up the archival of terminated executor stats.
> 
> 
> This addresses bug MESOS-324.
>     https://issues.apache.org/jira/browse/MESOS-324
> 
> 
> Diffs
> -----
> 
>   src/slave/monitor.cpp PRE-CREATION 
> 
> Diff: https://reviews.apache.org/r/9095/diff/
> 
> 
> Testing
> -------
> 
> make check
> 
> 
> Thanks,
> 
> Ben Mahler
> 
>


Re: Review Request: Resource Monitoring 7: Archive terminated executor statistics.

Posted by Ben Mahler <be...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/9095/
-----------------------------------------------------------

(Updated Feb. 25, 2013, 7:17 p.m.)


Review request for mesos, Benjamin Hindman and Vinod Kone.


Changes
-------

Rebased off trunk.


Description
-------

This wires up the archival of terminated executor stats.


This addresses bug MESOS-324.
    https://issues.apache.org/jira/browse/MESOS-324


Diffs (updated)
-----

  src/slave/monitor.cpp PRE-CREATION 

Diff: https://reviews.apache.org/r/9095/diff/


Testing
-------

make check


Thanks,

Ben Mahler


Re: Review Request: Resource Monitoring 7: Archive terminated executor statistics.

Posted by Ben Mahler <be...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/9095/
-----------------------------------------------------------

(Updated Feb. 22, 2013, 12:32 a.m.)


Review request for mesos, Benjamin Hindman and Vinod Kone.


Changes
-------

Rebased off trunk.


Description
-------

This wires up the archival of terminated executor stats.


This addresses bug MESOS-324.
    https://issues.apache.org/jira/browse/MESOS-324


Diffs (updated)
-----

  src/slave/monitor.cpp PRE-CREATION 

Diff: https://reviews.apache.org/r/9095/diff/


Testing
-------

make check


Thanks,

Ben Mahler


Re: Review Request: Resource Monitoring 7: Archive terminated executor statistics.

Posted by Ben Mahler <be...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/9095/
-----------------------------------------------------------

(Updated Feb. 13, 2013, 2:46 a.m.)


Review request for mesos, Benjamin Hindman and Vinod Kone.


Changes
-------

Highly simplified given the addition of unwatch() in the monitor.


Description
-------

This wires up the archival of terminated executor stats.


This addresses bug MESOS-324.
    https://issues.apache.org/jira/browse/MESOS-324


Diffs (updated)
-----

  src/slave/monitor.cpp PRE-CREATION 

Diff: https://reviews.apache.org/r/9095/diff/


Testing
-------

make check


Thanks,

Ben Mahler


Re: Review Request: Resource Monitoring 7: Archive terminated executor statistics.

Posted by Ben Mahler <be...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/9095/
-----------------------------------------------------------

(Updated Jan. 30, 2013, 4:02 a.m.)


Review request for mesos, Benjamin Hindman and Vinod Kone.


Changes
-------

Updated with upstream changes.


Description
-------

This wires up the archival of terminated executor stats.


This addresses bug MESOS-324.
    https://issues.apache.org/jira/browse/MESOS-324


Diffs (updated)
-----

  src/slave/monitor.hpp PRE-CREATION 
  src/slave/monitor.cpp PRE-CREATION 
  src/slave/slave.cpp 9755b46f97173d6fcc9ab1fd63e0e4814b3bc018 

Diff: https://reviews.apache.org/r/9095/diff/


Testing
-------

make check


Thanks,

Ben Mahler


Re: Review Request: Resource Monitoring 7: Archive terminated executor statistics.

Posted by Ben Mahler <be...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/9095/
-----------------------------------------------------------

(Updated Jan. 29, 2013, 1:42 a.m.)


Review request for mesos, Benjamin Hindman and Vinod Kone.


Changes
-------

Vinod's review.


Description
-------

This wires up the archival of terminated executor stats.


This addresses bug MESOS-324.
    https://issues.apache.org/jira/browse/MESOS-324


Diffs (updated)
-----

  src/slave/monitor.hpp PRE-CREATION 
  src/slave/monitor.cpp PRE-CREATION 
  src/slave/slave.cpp 9755b46f97173d6fcc9ab1fd63e0e4814b3bc018 

Diff: https://reviews.apache.org/r/9095/diff/


Testing
-------

make check


Thanks,

Ben Mahler