You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mesos.apache.org by Ben Mahler <be...@gmail.com> on 2013/03/01 01:15:08 UTC

Re: Review Request: Slave Restart (Part Quatre): Checkpointed slave, framework, executor and task

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/8535/#review17210
-----------------------------------------------------------

Ship it!



src/slave/cgroups_isolation_module.cpp
<https://reviews.apache.org/r/8535/#comment36502>

    Can you make this comment a little clearer? My suggestion:
    
    Before
        // Checkpoint the forked pid, if necessary.
        // We do this here instead of in the parent process, because if we do it
        // in the slave, the slave might die immediately after it forks but before
        // it writes the the pid to disk. This will result in an orphaned executor
        // process that the slave has no idea about when doing recovery.
    
    s/the the/the
    
    After:
        // Checkpoint the forked pid, if necessary.
        // The checkpointing must be done in the forked process, because
        // the slave process can die immediately after the isolation module forks but before it would have
        // a chance to write the pid to disk. That would result in an orphaned
        // executor process unknown to the slave when doing recovery.
    



src/slave/flags.hpp
<https://reviews.apache.org/r/8535/#comment36503>

    Since operators will be reading this string, let's make it very clear for them:
    
    Before:
    "Whether to checkpoint relevant info about the slave and frameworks\n"
    "to disk. This will enable a restarted slave to recover\n"
    "status updates and reconnect with old executors, if so desired."
    
    After:
    "Enable checkpointing of slave and framework information to disk.\n"
    "This enables a restarted slave to recover status updates and reconnect\n"
    "with old executors, if --recover is specified."
    
    What is if so desired? If --recover is specified?



src/slave/lxc_isolation_module.cpp
<https://reviews.apache.org/r/8535/#comment36505>

    Ditto on my earlier review, I'd rather this send TASK_FAILED or TASK_LOST with this message. Or at least, please add a TODO that this is not implemented because we're deprecating the lxc isolation module?
    
    Seems pretty bad that a framework can cause slaves to exit at will!



src/slave/process_based_isolation_module.cpp
<https://reviews.apache.org/r/8535/#comment36506>

    ditto above comment



src/slave/slave.hpp
<https://reviews.apache.org/r/8535/#comment36511>

    Why did you move these in this review, as opposed to the review where you introduced them?



src/slave/slave.hpp
<https://reviews.apache.org/r/8535/#comment36508>

    s/This callback is called/This continuation runs



src/slave/slave.cpp
<https://reviews.apache.org/r/8535/#comment36512>

    s/but the slave doesn't support checkpointing!/but checkpointing is not enabled on the slave!/



src/slave/slave.cpp
<https://reviews.apache.org/r/8535/#comment36513>

    ditto "not enabled" vs "supported"



src/slave/slave.cpp
<https://reviews.apache.org/r/8535/#comment36509>

    ditto from my review of part 3, with respect to just having statusUpdate 



src/slave/slave.cpp
<https://reviews.apache.org/r/8535/#comment36510>

    getLibprocessPIDPath sounds ambiguous, maybe this would be better named as getExecutorDriverPIDPath?


- Ben Mahler


On Feb. 23, 2013, 8:12 a.m., Vinod Kone wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/8535/
> -----------------------------------------------------------
> 
> (Updated Feb. 23, 2013, 8:12 a.m.)
> 
> 
> Review request for mesos, Benjamin Hindman and Ben Mahler.
> 
> 
> Description
> -------
> 
> Checkpoints slave id, framework pid, executor pids (libprocess, execed, forked) and task info.
> 
> 
> Diffs
> -----
> 
>   include/mesos/mesos.proto 38235157d45bdccb676e5c3241c21b585a6f8801 
>   src/exec/exec.cpp 821a94fab1f5969183ecf9e28d7b6bc10920db24 
>   src/slave/cgroups_isolation_module.hpp 669efa14ba2603764aa68ae19a44e79dbfdec192 
>   src/slave/cgroups_isolation_module.cpp a2eba6f96f5d8a4b1257571aa29e37c5682aab8d 
>   src/slave/flags.hpp 39e57f4104ee7a1538436ebbb9493581e28c99dd 
>   src/slave/isolation_module.hpp b962365ebeddd047896a66b02a327aa26ae323d3 
>   src/slave/lxc_isolation_module.hpp 2bc844f491befbe588965da2ada7cfcef0b6f0a4 
>   src/slave/lxc_isolation_module.cpp 30cff2a49339bb07030727d30352536a0a22d58c 
>   src/slave/paths.hpp fbf3fd84fb8f2590311b18d2afec2d2e0d30ef0a 
>   src/slave/process_based_isolation_module.hpp f1817192582e3646f8dcf17934ba7998829e8fd6 
>   src/slave/process_based_isolation_module.cpp 12a579cba56cd3dac384bc7919b0d5537b0e429d 
>   src/slave/slave.hpp e9f7b659ca2860501840b3d01e69915ebd162039 
>   src/slave/slave.cpp d4721c3eb51db87278d05f6fbe2eadb8a3a9b4dd 
>   src/slave/state.hpp c1e4c782380b0076313d2216c40e86774050d445 
>   src/slave/state.cpp 2688567719c5b12f1d8a56fa6193c11d4077685b 
>   src/tests/protobuf_io_tests.cpp 5bd1e676761ac35d30b9491fb397830c8a008795 
>   src/tests/slave_state_tests.cpp 1136ea93cffb1483458edad2605b0b4f83b61c44 
>   src/tests/status_update_manager_tests.cpp PRE-CREATION 
>   src/tests/utils.hpp be457117515ee727af101370b26bf9188afb8f45 
> 
> Diff: https://reviews.apache.org/r/8535/diff/
> 
> 
> Testing
> -------
> 
> make check
> 
> 
> Thanks,
> 
> Vinod Kone
> 
>


Re: Review Request: Slave Restart (Part Quatre): Checkpointed slave, framework, executor and task

Posted by Vinod Kone <vi...@gmail.com>.

> On March 1, 2013, 12:15 a.m., Ben Mahler wrote:
> > src/slave/cgroups_isolation_module.cpp, line 505
> > <https://reviews.apache.org/r/8535/diff/8/?file=261376#file261376line505>
> >
> >     Can you make this comment a little clearer? My suggestion:
> >     
> >     Before
> >         // Checkpoint the forked pid, if necessary.
> >         // We do this here instead of in the parent process, because if we do it
> >         // in the slave, the slave might die immediately after it forks but before
> >         // it writes the the pid to disk. This will result in an orphaned executor
> >         // process that the slave has no idea about when doing recovery.
> >     
> >     s/the the/the
> >     
> >     After:
> >         // Checkpoint the forked pid, if necessary.
> >         // The checkpointing must be done in the forked process, because
> >         // the slave process can die immediately after the isolation module forks but before it would have
> >         // a chance to write the pid to disk. That would result in an orphaned
> >         // executor process unknown to the slave when doing recovery.
> >

done. thanks.


> On March 1, 2013, 12:15 a.m., Ben Mahler wrote:
> > src/slave/lxc_isolation_module.cpp, line 124
> > <https://reviews.apache.org/r/8535/diff/8/?file=261380#file261380line124>
> >
> >     Ditto on my earlier review, I'd rather this send TASK_FAILED or TASK_LOST with this message. Or at least, please add a TODO that this is not implemented because we're deprecating the lxc isolation module?
> >     
> >     Seems pretty bad that a framework can cause slaves to exit at will!

Thansk for the push. Decided to make this a CHECK instead.

Also added logic in slave/main.cpp to stop users from starting a slave w/ lxc and checkpointing.

I think failing fast is the right semantics.


> On March 1, 2013, 12:15 a.m., Ben Mahler wrote:
> > src/slave/slave.hpp, line 137
> > <https://reviews.apache.org/r/8535/diff/8/?file=261384#file261384line137>
> >
> >     Why did you move these in this review, as opposed to the review where you introduced them?

moved them back.


> On March 1, 2013, 12:15 a.m., Ben Mahler wrote:
> > src/slave/slave.hpp, line 196
> > <https://reviews.apache.org/r/8535/diff/8/?file=261384#file261384line196>
> >
> >     s/This callback is called/This continuation runs

I don't like "this continuation runs".

Instead, I went with s/callback//


> On March 1, 2013, 12:15 a.m., Ben Mahler wrote:
> > src/slave/slave.cpp, line 915
> > <https://reviews.apache.org/r/8535/diff/8/?file=261385#file261385line915>
> >
> >     getLibprocessPIDPath sounds ambiguous, maybe this would be better named as getExecutorDriverPIDPath?

We chose "libprocess" in the pid name, to differentiate with the os pid ('forkedpid').


- Vinod


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/8535/#review17210
-----------------------------------------------------------


On Feb. 23, 2013, 8:12 a.m., Vinod Kone wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/8535/
> -----------------------------------------------------------
> 
> (Updated Feb. 23, 2013, 8:12 a.m.)
> 
> 
> Review request for mesos, Benjamin Hindman and Ben Mahler.
> 
> 
> Description
> -------
> 
> Checkpoints slave id, framework pid, executor pids (libprocess, execed, forked) and task info.
> 
> 
> Diffs
> -----
> 
>   include/mesos/mesos.proto 38235157d45bdccb676e5c3241c21b585a6f8801 
>   src/exec/exec.cpp 821a94fab1f5969183ecf9e28d7b6bc10920db24 
>   src/slave/cgroups_isolation_module.hpp 669efa14ba2603764aa68ae19a44e79dbfdec192 
>   src/slave/cgroups_isolation_module.cpp a2eba6f96f5d8a4b1257571aa29e37c5682aab8d 
>   src/slave/flags.hpp 39e57f4104ee7a1538436ebbb9493581e28c99dd 
>   src/slave/isolation_module.hpp b962365ebeddd047896a66b02a327aa26ae323d3 
>   src/slave/lxc_isolation_module.hpp 2bc844f491befbe588965da2ada7cfcef0b6f0a4 
>   src/slave/lxc_isolation_module.cpp 30cff2a49339bb07030727d30352536a0a22d58c 
>   src/slave/paths.hpp fbf3fd84fb8f2590311b18d2afec2d2e0d30ef0a 
>   src/slave/process_based_isolation_module.hpp f1817192582e3646f8dcf17934ba7998829e8fd6 
>   src/slave/process_based_isolation_module.cpp 12a579cba56cd3dac384bc7919b0d5537b0e429d 
>   src/slave/slave.hpp e9f7b659ca2860501840b3d01e69915ebd162039 
>   src/slave/slave.cpp d4721c3eb51db87278d05f6fbe2eadb8a3a9b4dd 
>   src/slave/state.hpp c1e4c782380b0076313d2216c40e86774050d445 
>   src/slave/state.cpp 2688567719c5b12f1d8a56fa6193c11d4077685b 
>   src/tests/protobuf_io_tests.cpp 5bd1e676761ac35d30b9491fb397830c8a008795 
>   src/tests/slave_state_tests.cpp 1136ea93cffb1483458edad2605b0b4f83b61c44 
>   src/tests/status_update_manager_tests.cpp PRE-CREATION 
>   src/tests/utils.hpp be457117515ee727af101370b26bf9188afb8f45 
> 
> Diff: https://reviews.apache.org/r/8535/diff/
> 
> 
> Testing
> -------
> 
> make check
> 
> 
> Thanks,
> 
> Vinod Kone
> 
>