You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mesos.apache.org by Ben Mahler <be...@gmail.com> on 2013/03/01 21:43:49 UTC

Re: Review Request: Slave Restart (Part 9): Recover executors and isolation module

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/8762/#review17260
-----------------------------------------------------------

Ship it!


Great tests, pushes me to write better tests for my components! (usage() on the isolation modules in untested from my monitoring changes =/)

Also, I thought you were going to pass in the SlaveID through the environment?


src/exec/exec.cpp
<https://reviews.apache.org/r/8762/#comment36696>

    Can we maybe spell this out in the public executor interfaces? Considering how harsh this is, I like the harshness if we have this contract, as people will _definitely_ discover they are doing this.
    
    One would probably assume the driver will store messages for later delivery if sent pre-registration. That would be a nicer contract than doing this, but I see the added complexity =/



src/exec/exec.cpp
<https://reviews.apache.org/r/8762/#comment36697>

    ditto



src/exec/exec.cpp
<https://reviews.apache.org/r/8762/#comment36693>

    s/those//
    s/haven't/have not



src/exec/exec.cpp
<https://reviews.apache.org/r/8762/#comment36699>

    Were you going to add slave id to these?
    
    While you're at it: https://issues.apache.org/jira/browse/MESOS-334
    
    Or maybe let's not do too many things in this change..



src/slave/cgroups_isolation_module.cpp
<https://reviews.apache.org/r/8762/#comment36702>

    Can you split off the killed case in terms of the failed message?
    
    I see unknown as more serious than killed (since killed can happen due to known races).



src/slave/cgroups_isolation_module.cpp
<https://reviews.apache.org/r/8762/#comment36703>

    No need for a comment on this in the .cpp file since it's in the header, right?



src/slave/cgroups_isolation_module.cpp
<https://reviews.apache.org/r/8762/#comment36704>

    would be awful nice to pass that option through directly ;)



src/slave/flags.hpp
<https://reviews.apache.org/r/8762/#comment36705>

    great



src/slave/process_based_isolation_module.cpp
<https://reviews.apache.org/r/8762/#comment36708>

    ditto on splitting the failure message



src/slave/slave.cpp
<https://reviews.apache.org/r/8762/#comment36713>

    given the context here, "terminating" seems more appropriate
    
    you would set terminating to true, and then you call terminate, makes more sense IMO



src/slave/slave.cpp
<https://reviews.apache.org/r/8762/#comment36714>

    empty()?



src/slave/slave.cpp
<https://reviews.apache.org/r/8762/#comment36716>

    Does this imply the StatusUpdateManager provides idempotent primitives in general?



src/slave/slave.cpp
<https://reviews.apache.org/r/8762/#comment36717>

    Kill "If we are here, ", I know the comment is for this spot in the code because the comment was written in this spot for a reason! ;)



src/slave/slave.cpp
<https://reviews.apache.org/r/8762/#comment36718>

    comma after exited



src/slave/slave.cpp
<https://reviews.apache.org/r/8762/#comment36719>

    Isn't the reaper only for the process isolation module?



src/slave/slave.cpp
<https://reviews.apache.org/r/8762/#comment36720>

    quotes on executor id but not framework id?



src/slave/slave.cpp
<https://reviews.apache.org/r/8762/#comment36722>

    s/recovering/recovery of/



src/slave/slave.cpp
<https://reviews.apache.org/r/8762/#comment36723>

    ditto



src/slave/slave.cpp
<https://reviews.apache.org/r/8762/#comment36725>

    s/framework/the framework entry/



src/slave/slave.cpp
<https://reviews.apache.org/r/8762/#comment36727>

    s/Create executor/Create the executor entry/



src/slave/slave.cpp
<https://reviews.apache.org/r/8762/#comment36728>

    kill newline?



src/slave/slave.cpp
<https://reviews.apache.org/r/8762/#comment36724>

    ditto



src/slave/slave.cpp
<https://reviews.apache.org/r/8762/#comment36729>

    s/,//



src/slave/slave.cpp
<https://reviews.apache.org/r/8762/#comment36730>

    s/by/for/?



src/slave/slave.cpp
<https://reviews.apache.org/r/8762/#comment36732>

    else?



src/slave/slave.cpp
<https://reviews.apache.org/r/8762/#comment36733>

    s/Setup the monitoring./Begin monitoring the executor./



src/tests/slave_recovery_tests.cpp
<https://reviews.apache.org/r/8762/#comment36734>

    Line wrap at 70


- Ben Mahler


On Feb. 27, 2013, 7:59 p.m., Vinod Kone wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/8762/
> -----------------------------------------------------------
> 
> (Updated Feb. 27, 2013, 7:59 p.m.)
> 
> 
> Review request for mesos, Benjamin Hindman and Ben Mahler.
> 
> 
> Description
> -------
> 
> Properly recovers and reconnects with executors.
> 
> Also, recovers isolation module.
> 
> This is pretty much the whole of recovery! (I'm going to send out another tiny review for properly doing incompatible upgrades)
> 
> 
> Diffs
> -----
> 
>   src/exec/exec.cpp 821a94fab1f5969183ecf9e28d7b6bc10920db24 
>   src/launcher/launcher.hpp ead58c0f60b6bff532f241a224b547b25e292175 
>   src/launcher/launcher.cpp f2d68c5fd1201b75c5a8f96599d7fb36516687fb 
>   src/launcher/main.cpp e90df85ae433431defdd251490c74a482a58f743 
>   src/messages/messages.proto 815fcbbcb4a8643f50950a294cedf7281b2a187f 
>   src/slave/cgroups_isolation_module.hpp a04fc46b15d2741886f5847cadbdc9bed351c588 
>   src/slave/cgroups_isolation_module.cpp a779de80d13c67e507d7d2ee788fcdaa5e71daca 
>   src/slave/constants.hpp 08db75f3e5af0ae79298546386deddf00beb8de9 
>   src/slave/constants.cpp 159dd12ec1e6b5f33fd75fc3a66d8ac897fc40b7 
>   src/slave/flags.hpp 9cd436cae5772b906b3566eb1f344904d5894214 
>   src/slave/isolation_module.hpp d7cc02b8ea0cde591ee61059efd79fbacaa74769 
>   src/slave/lxc_isolation_module.hpp 6be99038fb547e43e74ef5469f2339bbe82017cf 
>   src/slave/lxc_isolation_module.cpp 1ed3c87f884089e032bba2ed89be109abe6c1c13 
>   src/slave/process_based_isolation_module.hpp 6a4e6aef9ab4b13cce9341f6f35c6b24e6f8fde7 
>   src/slave/process_based_isolation_module.cpp ff98d105af675dfc66070feaa43b42c1aa438fd8 
>   src/slave/reaper.cpp c0ee4b4c07fd792bcb39455b666808b712eb32c2 
>   src/slave/slave.hpp 7648c33230c1900eda7529045c5df9ccab105d47 
>   src/slave/slave.cpp 8c2e1bfc363491c681177676f9dfe5f229276f7d 
>   src/slave/state.cpp 2688567719c5b12f1d8a56fa6193c11d4077685b 
>   src/slave/status_update_manager.cpp PRE-CREATION 
>   src/tests/slave_recovery_tests.cpp PRE-CREATION 
>   src/tests/utils.hpp 4600c2f6e0ceabd6a38f8bb762da314b795c23a0 
> 
> Diff: https://reviews.apache.org/r/8762/diff/
> 
> 
> Testing
> -------
> 
> make check
> 
> 
> Thanks,
> 
> Vinod Kone
> 
>


Re: Review Request: Slave Restart (Part 9): Recover executors and isolation module

Posted by Ben Mahler <be...@gmail.com>.

> On March 1, 2013, 8:43 p.m., Ben Mahler wrote:
> > src/slave/cgroups_isolation_module.cpp, line 620
> > <https://reviews.apache.org/r/8762/diff/7/?file=263163#file263163line620>
> >
> >     Can you split off the killed case in terms of the failed message?
> >     
> >     I see unknown as more serious than killed (since killed can happen due to known races).
> 
> Vinod Kone wrote:
>     I don't see the benefit, considering the fact that a killed executor becomes unknown as soon as it gets cleaned up. Also note, that we use the same phrasing across a bunch of functions, across isolation modules.

That's no reason to continue the bad habit! ;)

This would be useful for forensics IMO


> On March 1, 2013, 8:43 p.m., Ben Mahler wrote:
> > src/slave/process_based_isolation_module.cpp, line 329
> > <https://reviews.apache.org/r/8762/diff/7/?file=263171#file263171line329>
> >
> >     ditto on splitting the failure message
> 
> Vinod Kone wrote:
>     see above.

Why wouldn't this be useful for forensics?


> On March 1, 2013, 8:43 p.m., Ben Mahler wrote:
> > src/slave/slave.cpp, lines 415-416
> > <https://reviews.apache.org/r/8762/diff/7/?file=263174#file263174line415>
> >
> >     given the context here, "terminating" seems more appropriate
> >     
> >     you would set terminating to true, and then you call terminate, makes more sense IMO
> 
> Vinod Kone wrote:
>     "terminating" would be confusing, because intuitively it sounds like someone called 'terminate()'.

But you're immediately calling terminate(), so what you're saying it sounds like is what we want ;)


> On March 1, 2013, 8:43 p.m., Ben Mahler wrote:
> > src/slave/slave.cpp, line 1151
> > <https://reviews.apache.org/r/8762/diff/7/?file=263174#file263174line1151>
> >
> >     Kill "If we are here, ", I know the comment is for this spot in the code because the comment was written in this spot for a reason! ;)
> 
> Vinod Kone wrote:
>     i like the emphasis.

Can you explain why? It's unnecessary less succinct. I feel like if we apply it here, there's nothing to stop us from applying it everywhere?


> On March 1, 2013, 8:43 p.m., Ben Mahler wrote:
> > src/slave/slave.cpp, line 1702
> > <https://reviews.apache.org/r/8762/diff/7/?file=263174#file263174line1702>
> >
> >     quotes on executor id but not framework id?
> 
> Vinod Kone wrote:
>     as i explained in an earlier review, this has been the convention, for better or worse.

fair enough, I guess we'll start killing them when we sanitize executor ids


> On March 1, 2013, 8:43 p.m., Ben Mahler wrote:
> > src/slave/slave.cpp, line 1733
> > <https://reviews.apache.org/r/8762/diff/7/?file=263174#file263174line1733>
> >
> >     s/Create executor/Create the executor entry/

Why did you drop? I think either correct it or kill the comment? Currently it's not a sentence.


> On March 1, 2013, 8:43 p.m., Ben Mahler wrote:
> > src/slave/slave.cpp, line 1721
> > <https://reviews.apache.org/r/8762/diff/7/?file=263174#file263174line1721>
> >
> >     s/framework/the framework entry/

Why did you drop? I think either correct it or kill the comment? Currently it's not a sentence.


- Ben


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/8762/#review17260
-----------------------------------------------------------


On March 4, 2013, 1:26 a.m., Vinod Kone wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/8762/
> -----------------------------------------------------------
> 
> (Updated March 4, 2013, 1:26 a.m.)
> 
> 
> Review request for mesos, Benjamin Hindman and Ben Mahler.
> 
> 
> Description
> -------
> 
> Properly recovers and reconnects with executors.
> 
> Also, recovers isolation module.
> 
> This is pretty much the whole of recovery! (I'm going to send out another tiny review for properly doing incompatible upgrades)
> 
> 
> Diffs
> -----
> 
>   src/exec/exec.cpp 821a94fab1f5969183ecf9e28d7b6bc10920db24 
>   src/launcher/launcher.hpp ead58c0f60b6bff532f241a224b547b25e292175 
>   src/launcher/launcher.cpp f2d68c5fd1201b75c5a8f96599d7fb36516687fb 
>   src/launcher/main.cpp e90df85ae433431defdd251490c74a482a58f743 
>   src/messages/messages.proto 815fcbbcb4a8643f50950a294cedf7281b2a187f 
>   src/slave/cgroups_isolation_module.hpp a04fc46b15d2741886f5847cadbdc9bed351c588 
>   src/slave/cgroups_isolation_module.cpp a779de80d13c67e507d7d2ee788fcdaa5e71daca 
>   src/slave/constants.hpp dc5b9877e2608f6806f13b6ef4d133b6fddaece4 
>   src/slave/constants.cpp ecb96fd42afdb74aca394411c6ff60276f509cfd 
>   src/slave/flags.hpp 9cd436cae5772b906b3566eb1f344904d5894214 
>   src/slave/isolation_module.hpp d7cc02b8ea0cde591ee61059efd79fbacaa74769 
>   src/slave/lxc_isolation_module.hpp 6be99038fb547e43e74ef5469f2339bbe82017cf 
>   src/slave/lxc_isolation_module.cpp 1ed3c87f884089e032bba2ed89be109abe6c1c13 
>   src/slave/process_based_isolation_module.hpp 6a4e6aef9ab4b13cce9341f6f35c6b24e6f8fde7 
>   src/slave/process_based_isolation_module.cpp ff98d105af675dfc66070feaa43b42c1aa438fd8 
>   src/slave/slave.hpp 7648c33230c1900eda7529045c5df9ccab105d47 
>   src/slave/slave.cpp c13d268c5e0e107902f64e30304a18128927a571 
>   src/slave/state.cpp 2688567719c5b12f1d8a56fa6193c11d4077685b 
>   src/slave/status_update_manager.hpp PRE-CREATION 
>   src/slave/status_update_manager.cpp PRE-CREATION 
>   src/tests/slave_recovery_tests.cpp PRE-CREATION 
>   src/tests/utils.hpp b648b631d8c11c26e6adfa3f5b16012044557e25 
> 
> Diff: https://reviews.apache.org/r/8762/diff/
> 
> 
> Testing
> -------
> 
> make check
> 
> 
> Thanks,
> 
> Vinod Kone
> 
>


Re: Review Request: Slave Restart (Part 9): Recover executors and isolation module

Posted by Vinod Kone <vi...@gmail.com>.

> On March 1, 2013, 8:43 p.m., Ben Mahler wrote:
> > src/exec/exec.cpp, line 393
> > <https://reviews.apache.org/r/8762/diff/7/?file=263157#file263157line393>
> >
> >     Can we maybe spell this out in the public executor interfaces? Considering how harsh this is, I like the harshness if we have this contract, as people will _definitely_ discover they are doing this.
> >     
> >     One would probably assume the driver will store messages for later delivery if sent pre-registration. That would be a nicer contract than doing this, but I see the added complexity =/

Decided to pass SlaveID through environment. So, no need for enforcing this in the contract. Also, now, we actually buffer status updates.


> On March 1, 2013, 8:43 p.m., Ben Mahler wrote:
> > src/exec/exec.cpp, line 539
> > <https://reviews.apache.org/r/8762/diff/7/?file=263157#file263157line539>
> >
> >     Were you going to add slave id to these?
> >     
> >     While you're at it: https://issues.apache.org/jira/browse/MESOS-334
> >     
> >     Or maybe let's not do too many things in this change..

done.


> On March 1, 2013, 8:43 p.m., Ben Mahler wrote:
> > src/slave/slave.cpp, lines 415-416
> > <https://reviews.apache.org/r/8762/diff/7/?file=263174#file263174line415>
> >
> >     given the context here, "terminating" seems more appropriate
> >     
> >     you would set terminating to true, and then you call terminate, makes more sense IMO

"terminating" would be confusing, because intuitively it sounds like someone called 'terminate()'.


> On March 1, 2013, 8:43 p.m., Ben Mahler wrote:
> > src/slave/slave.cpp, line 1115
> > <https://reviews.apache.org/r/8762/diff/7/?file=263174#file263174line1115>
> >
> >     Does this imply the StatusUpdateManager provides idempotent primitives in general?

yes.


> On March 1, 2013, 8:43 p.m., Ben Mahler wrote:
> > src/slave/slave.cpp, line 1151
> > <https://reviews.apache.org/r/8762/diff/7/?file=263174#file263174line1151>
> >
> >     Kill "If we are here, ", I know the comment is for this spot in the code because the comment was written in this spot for a reason! ;)

i like the emphasis.


> On March 1, 2013, 8:43 p.m., Ben Mahler wrote:
> > src/slave/slave.cpp, line 1153
> > <https://reviews.apache.org/r/8762/diff/7/?file=263174#file263174line1153>
> >
> >     Isn't the reaper only for the process isolation module?

nope.


> On March 1, 2013, 8:43 p.m., Ben Mahler wrote:
> > src/slave/slave.cpp, line 1702
> > <https://reviews.apache.org/r/8762/diff/7/?file=263174#file263174line1702>
> >
> >     quotes on executor id but not framework id?

as i explained in an earlier review, this has been the convention, for better or worse.


> On March 1, 2013, 8:43 p.m., Ben Mahler wrote:
> > src/slave/slave.cpp, line 1753
> > <https://reviews.apache.org/r/8762/diff/7/?file=263174#file263174line1753>
> >
> >     s/by/for/?

for tasks, by isolation module.


> On March 1, 2013, 8:43 p.m., Ben Mahler wrote:
> > src/slave/slave.cpp, line 1797
> > <https://reviews.apache.org/r/8762/diff/7/?file=263174#file263174line1797>
> >
> >     else?

else, no action needed. 


> On March 1, 2013, 8:43 p.m., Ben Mahler wrote:
> > src/tests/slave_recovery_tests.cpp, line 616
> > <https://reviews.apache.org/r/8762/diff/7/?file=263177#file263177line616>
> >
> >     Line wrap at 70

?


> On March 1, 2013, 8:43 p.m., Ben Mahler wrote:
> > src/slave/cgroups_isolation_module.cpp, line 620
> > <https://reviews.apache.org/r/8762/diff/7/?file=263163#file263163line620>
> >
> >     Can you split off the killed case in terms of the failed message?
> >     
> >     I see unknown as more serious than killed (since killed can happen due to known races).

I don't see the benefit, considering the fact that a killed executor becomes unknown as soon as it gets cleaned up. Also note, that we use the same phrasing across a bunch of functions, across isolation modules.


> On March 1, 2013, 8:43 p.m., Ben Mahler wrote:
> > src/slave/process_based_isolation_module.cpp, line 329
> > <https://reviews.apache.org/r/8762/diff/7/?file=263171#file263171line329>
> >
> >     ditto on splitting the failure message

see above.


- Vinod


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/8762/#review17260
-----------------------------------------------------------


On Feb. 27, 2013, 7:59 p.m., Vinod Kone wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/8762/
> -----------------------------------------------------------
> 
> (Updated Feb. 27, 2013, 7:59 p.m.)
> 
> 
> Review request for mesos, Benjamin Hindman and Ben Mahler.
> 
> 
> Description
> -------
> 
> Properly recovers and reconnects with executors.
> 
> Also, recovers isolation module.
> 
> This is pretty much the whole of recovery! (I'm going to send out another tiny review for properly doing incompatible upgrades)
> 
> 
> Diffs
> -----
> 
>   src/exec/exec.cpp 821a94fab1f5969183ecf9e28d7b6bc10920db24 
>   src/launcher/launcher.hpp ead58c0f60b6bff532f241a224b547b25e292175 
>   src/launcher/launcher.cpp f2d68c5fd1201b75c5a8f96599d7fb36516687fb 
>   src/launcher/main.cpp e90df85ae433431defdd251490c74a482a58f743 
>   src/messages/messages.proto 815fcbbcb4a8643f50950a294cedf7281b2a187f 
>   src/slave/cgroups_isolation_module.hpp a04fc46b15d2741886f5847cadbdc9bed351c588 
>   src/slave/cgroups_isolation_module.cpp a779de80d13c67e507d7d2ee788fcdaa5e71daca 
>   src/slave/constants.hpp 08db75f3e5af0ae79298546386deddf00beb8de9 
>   src/slave/constants.cpp 159dd12ec1e6b5f33fd75fc3a66d8ac897fc40b7 
>   src/slave/flags.hpp 9cd436cae5772b906b3566eb1f344904d5894214 
>   src/slave/isolation_module.hpp d7cc02b8ea0cde591ee61059efd79fbacaa74769 
>   src/slave/lxc_isolation_module.hpp 6be99038fb547e43e74ef5469f2339bbe82017cf 
>   src/slave/lxc_isolation_module.cpp 1ed3c87f884089e032bba2ed89be109abe6c1c13 
>   src/slave/process_based_isolation_module.hpp 6a4e6aef9ab4b13cce9341f6f35c6b24e6f8fde7 
>   src/slave/process_based_isolation_module.cpp ff98d105af675dfc66070feaa43b42c1aa438fd8 
>   src/slave/reaper.cpp c0ee4b4c07fd792bcb39455b666808b712eb32c2 
>   src/slave/slave.hpp 7648c33230c1900eda7529045c5df9ccab105d47 
>   src/slave/slave.cpp 8c2e1bfc363491c681177676f9dfe5f229276f7d 
>   src/slave/state.cpp 2688567719c5b12f1d8a56fa6193c11d4077685b 
>   src/slave/status_update_manager.cpp PRE-CREATION 
>   src/tests/slave_recovery_tests.cpp PRE-CREATION 
>   src/tests/utils.hpp 4600c2f6e0ceabd6a38f8bb762da314b795c23a0 
> 
> Diff: https://reviews.apache.org/r/8762/diff/
> 
> 
> Testing
> -------
> 
> make check
> 
> 
> Thanks,
> 
> Vinod Kone
> 
>