You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mesos.apache.org by Ben Mahler <be...@gmail.com> on 2013/03/01 01:15:08 UTC
Re: Review Request: Slave Restart (Part Quatre): Checkpointed slave,
framework, executor and task
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/8535/#review17210
-----------------------------------------------------------
Ship it!
src/slave/cgroups_isolation_module.cpp
<https://reviews.apache.org/r/8535/#comment36502>
Can you make this comment a little clearer? My suggestion:
Before
// Checkpoint the forked pid, if necessary.
// We do this here instead of in the parent process, because if we do it
// in the slave, the slave might die immediately after it forks but before
// it writes the the pid to disk. This will result in an orphaned executor
// process that the slave has no idea about when doing recovery.
s/the the/the
After:
// Checkpoint the forked pid, if necessary.
// The checkpointing must be done in the forked process, because
// the slave process can die immediately after the isolation module forks but before it would have
// a chance to write the pid to disk. That would result in an orphaned
// executor process unknown to the slave when doing recovery.
src/slave/flags.hpp
<https://reviews.apache.org/r/8535/#comment36503>
Since operators will be reading this string, let's make it very clear for them:
Before:
"Whether to checkpoint relevant info about the slave and frameworks\n"
"to disk. This will enable a restarted slave to recover\n"
"status updates and reconnect with old executors, if so desired."
After:
"Enable checkpointing of slave and framework information to disk.\n"
"This enables a restarted slave to recover status updates and reconnect\n"
"with old executors, if --recover is specified."
What is if so desired? If --recover is specified?
src/slave/lxc_isolation_module.cpp
<https://reviews.apache.org/r/8535/#comment36505>
Ditto on my earlier review, I'd rather this send TASK_FAILED or TASK_LOST with this message. Or at least, please add a TODO that this is not implemented because we're deprecating the lxc isolation module?
Seems pretty bad that a framework can cause slaves to exit at will!
src/slave/process_based_isolation_module.cpp
<https://reviews.apache.org/r/8535/#comment36506>
ditto above comment
src/slave/slave.hpp
<https://reviews.apache.org/r/8535/#comment36511>
Why did you move these in this review, as opposed to the review where you introduced them?
src/slave/slave.hpp
<https://reviews.apache.org/r/8535/#comment36508>
s/This callback is called/This continuation runs
src/slave/slave.cpp
<https://reviews.apache.org/r/8535/#comment36512>
s/but the slave doesn't support checkpointing!/but checkpointing is not enabled on the slave!/
src/slave/slave.cpp
<https://reviews.apache.org/r/8535/#comment36513>
ditto "not enabled" vs "supported"
src/slave/slave.cpp
<https://reviews.apache.org/r/8535/#comment36509>
ditto from my review of part 3, with respect to just having statusUpdate
src/slave/slave.cpp
<https://reviews.apache.org/r/8535/#comment36510>
getLibprocessPIDPath sounds ambiguous, maybe this would be better named as getExecutorDriverPIDPath?
- Ben Mahler
On Feb. 23, 2013, 8:12 a.m., Vinod Kone wrote:
>
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/8535/
> -----------------------------------------------------------
>
> (Updated Feb. 23, 2013, 8:12 a.m.)
>
>
> Review request for mesos, Benjamin Hindman and Ben Mahler.
>
>
> Description
> -------
>
> Checkpoints slave id, framework pid, executor pids (libprocess, execed, forked) and task info.
>
>
> Diffs
> -----
>
> include/mesos/mesos.proto 38235157d45bdccb676e5c3241c21b585a6f8801
> src/exec/exec.cpp 821a94fab1f5969183ecf9e28d7b6bc10920db24
> src/slave/cgroups_isolation_module.hpp 669efa14ba2603764aa68ae19a44e79dbfdec192
> src/slave/cgroups_isolation_module.cpp a2eba6f96f5d8a4b1257571aa29e37c5682aab8d
> src/slave/flags.hpp 39e57f4104ee7a1538436ebbb9493581e28c99dd
> src/slave/isolation_module.hpp b962365ebeddd047896a66b02a327aa26ae323d3
> src/slave/lxc_isolation_module.hpp 2bc844f491befbe588965da2ada7cfcef0b6f0a4
> src/slave/lxc_isolation_module.cpp 30cff2a49339bb07030727d30352536a0a22d58c
> src/slave/paths.hpp fbf3fd84fb8f2590311b18d2afec2d2e0d30ef0a
> src/slave/process_based_isolation_module.hpp f1817192582e3646f8dcf17934ba7998829e8fd6
> src/slave/process_based_isolation_module.cpp 12a579cba56cd3dac384bc7919b0d5537b0e429d
> src/slave/slave.hpp e9f7b659ca2860501840b3d01e69915ebd162039
> src/slave/slave.cpp d4721c3eb51db87278d05f6fbe2eadb8a3a9b4dd
> src/slave/state.hpp c1e4c782380b0076313d2216c40e86774050d445
> src/slave/state.cpp 2688567719c5b12f1d8a56fa6193c11d4077685b
> src/tests/protobuf_io_tests.cpp 5bd1e676761ac35d30b9491fb397830c8a008795
> src/tests/slave_state_tests.cpp 1136ea93cffb1483458edad2605b0b4f83b61c44
> src/tests/status_update_manager_tests.cpp PRE-CREATION
> src/tests/utils.hpp be457117515ee727af101370b26bf9188afb8f45
>
> Diff: https://reviews.apache.org/r/8535/diff/
>
>
> Testing
> -------
>
> make check
>
>
> Thanks,
>
> Vinod Kone
>
>
Re: Review Request: Slave Restart (Part Quatre): Checkpointed slave,
framework, executor and task
Posted by Vinod Kone <vi...@gmail.com>.
> On March 1, 2013, 12:15 a.m., Ben Mahler wrote:
> > src/slave/cgroups_isolation_module.cpp, line 505
> > <https://reviews.apache.org/r/8535/diff/8/?file=261376#file261376line505>
> >
> > Can you make this comment a little clearer? My suggestion:
> >
> > Before
> > // Checkpoint the forked pid, if necessary.
> > // We do this here instead of in the parent process, because if we do it
> > // in the slave, the slave might die immediately after it forks but before
> > // it writes the the pid to disk. This will result in an orphaned executor
> > // process that the slave has no idea about when doing recovery.
> >
> > s/the the/the
> >
> > After:
> > // Checkpoint the forked pid, if necessary.
> > // The checkpointing must be done in the forked process, because
> > // the slave process can die immediately after the isolation module forks but before it would have
> > // a chance to write the pid to disk. That would result in an orphaned
> > // executor process unknown to the slave when doing recovery.
> >
done. thanks.
> On March 1, 2013, 12:15 a.m., Ben Mahler wrote:
> > src/slave/lxc_isolation_module.cpp, line 124
> > <https://reviews.apache.org/r/8535/diff/8/?file=261380#file261380line124>
> >
> > Ditto on my earlier review, I'd rather this send TASK_FAILED or TASK_LOST with this message. Or at least, please add a TODO that this is not implemented because we're deprecating the lxc isolation module?
> >
> > Seems pretty bad that a framework can cause slaves to exit at will!
Thansk for the push. Decided to make this a CHECK instead.
Also added logic in slave/main.cpp to stop users from starting a slave w/ lxc and checkpointing.
I think failing fast is the right semantics.
> On March 1, 2013, 12:15 a.m., Ben Mahler wrote:
> > src/slave/slave.hpp, line 137
> > <https://reviews.apache.org/r/8535/diff/8/?file=261384#file261384line137>
> >
> > Why did you move these in this review, as opposed to the review where you introduced them?
moved them back.
> On March 1, 2013, 12:15 a.m., Ben Mahler wrote:
> > src/slave/slave.hpp, line 196
> > <https://reviews.apache.org/r/8535/diff/8/?file=261384#file261384line196>
> >
> > s/This callback is called/This continuation runs
I don't like "this continuation runs".
Instead, I went with s/callback//
> On March 1, 2013, 12:15 a.m., Ben Mahler wrote:
> > src/slave/slave.cpp, line 915
> > <https://reviews.apache.org/r/8535/diff/8/?file=261385#file261385line915>
> >
> > getLibprocessPIDPath sounds ambiguous, maybe this would be better named as getExecutorDriverPIDPath?
We chose "libprocess" in the pid name, to differentiate with the os pid ('forkedpid').
- Vinod
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/8535/#review17210
-----------------------------------------------------------
On Feb. 23, 2013, 8:12 a.m., Vinod Kone wrote:
>
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/8535/
> -----------------------------------------------------------
>
> (Updated Feb. 23, 2013, 8:12 a.m.)
>
>
> Review request for mesos, Benjamin Hindman and Ben Mahler.
>
>
> Description
> -------
>
> Checkpoints slave id, framework pid, executor pids (libprocess, execed, forked) and task info.
>
>
> Diffs
> -----
>
> include/mesos/mesos.proto 38235157d45bdccb676e5c3241c21b585a6f8801
> src/exec/exec.cpp 821a94fab1f5969183ecf9e28d7b6bc10920db24
> src/slave/cgroups_isolation_module.hpp 669efa14ba2603764aa68ae19a44e79dbfdec192
> src/slave/cgroups_isolation_module.cpp a2eba6f96f5d8a4b1257571aa29e37c5682aab8d
> src/slave/flags.hpp 39e57f4104ee7a1538436ebbb9493581e28c99dd
> src/slave/isolation_module.hpp b962365ebeddd047896a66b02a327aa26ae323d3
> src/slave/lxc_isolation_module.hpp 2bc844f491befbe588965da2ada7cfcef0b6f0a4
> src/slave/lxc_isolation_module.cpp 30cff2a49339bb07030727d30352536a0a22d58c
> src/slave/paths.hpp fbf3fd84fb8f2590311b18d2afec2d2e0d30ef0a
> src/slave/process_based_isolation_module.hpp f1817192582e3646f8dcf17934ba7998829e8fd6
> src/slave/process_based_isolation_module.cpp 12a579cba56cd3dac384bc7919b0d5537b0e429d
> src/slave/slave.hpp e9f7b659ca2860501840b3d01e69915ebd162039
> src/slave/slave.cpp d4721c3eb51db87278d05f6fbe2eadb8a3a9b4dd
> src/slave/state.hpp c1e4c782380b0076313d2216c40e86774050d445
> src/slave/state.cpp 2688567719c5b12f1d8a56fa6193c11d4077685b
> src/tests/protobuf_io_tests.cpp 5bd1e676761ac35d30b9491fb397830c8a008795
> src/tests/slave_state_tests.cpp 1136ea93cffb1483458edad2605b0b4f83b61c44
> src/tests/status_update_manager_tests.cpp PRE-CREATION
> src/tests/utils.hpp be457117515ee727af101370b26bf9188afb8f45
>
> Diff: https://reviews.apache.org/r/8535/diff/
>
>
> Testing
> -------
>
> make check
>
>
> Thanks,
>
> Vinod Kone
>
>