You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mesos.apache.org by Ben Mahler <be...@gmail.com> on 2013/03/01 21:43:49 UTC
Re: Review Request: Slave Restart (Part 9): Recover executors and isolation
module
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/8762/#review17260
-----------------------------------------------------------
Ship it!
Great tests, pushes me to write better tests for my components! (usage() on the isolation modules in untested from my monitoring changes =/)
Also, I thought you were going to pass in the SlaveID through the environment?
src/exec/exec.cpp
<https://reviews.apache.org/r/8762/#comment36696>
Can we maybe spell this out in the public executor interfaces? Considering how harsh this is, I like the harshness if we have this contract, as people will _definitely_ discover they are doing this.
One would probably assume the driver will store messages for later delivery if sent pre-registration. That would be a nicer contract than doing this, but I see the added complexity =/
src/exec/exec.cpp
<https://reviews.apache.org/r/8762/#comment36697>
ditto
src/exec/exec.cpp
<https://reviews.apache.org/r/8762/#comment36693>
s/those//
s/haven't/have not
src/exec/exec.cpp
<https://reviews.apache.org/r/8762/#comment36699>
Were you going to add slave id to these?
While you're at it: https://issues.apache.org/jira/browse/MESOS-334
Or maybe let's not do too many things in this change..
src/slave/cgroups_isolation_module.cpp
<https://reviews.apache.org/r/8762/#comment36702>
Can you split off the killed case in terms of the failed message?
I see unknown as more serious than killed (since killed can happen due to known races).
src/slave/cgroups_isolation_module.cpp
<https://reviews.apache.org/r/8762/#comment36703>
No need for a comment on this in the .cpp file since it's in the header, right?
src/slave/cgroups_isolation_module.cpp
<https://reviews.apache.org/r/8762/#comment36704>
would be awful nice to pass that option through directly ;)
src/slave/flags.hpp
<https://reviews.apache.org/r/8762/#comment36705>
great
src/slave/process_based_isolation_module.cpp
<https://reviews.apache.org/r/8762/#comment36708>
ditto on splitting the failure message
src/slave/slave.cpp
<https://reviews.apache.org/r/8762/#comment36713>
given the context here, "terminating" seems more appropriate
you would set terminating to true, and then you call terminate, makes more sense IMO
src/slave/slave.cpp
<https://reviews.apache.org/r/8762/#comment36714>
empty()?
src/slave/slave.cpp
<https://reviews.apache.org/r/8762/#comment36716>
Does this imply the StatusUpdateManager provides idempotent primitives in general?
src/slave/slave.cpp
<https://reviews.apache.org/r/8762/#comment36717>
Kill "If we are here, ", I know the comment is for this spot in the code because the comment was written in this spot for a reason! ;)
src/slave/slave.cpp
<https://reviews.apache.org/r/8762/#comment36718>
comma after exited
src/slave/slave.cpp
<https://reviews.apache.org/r/8762/#comment36719>
Isn't the reaper only for the process isolation module?
src/slave/slave.cpp
<https://reviews.apache.org/r/8762/#comment36720>
quotes on executor id but not framework id?
src/slave/slave.cpp
<https://reviews.apache.org/r/8762/#comment36722>
s/recovering/recovery of/
src/slave/slave.cpp
<https://reviews.apache.org/r/8762/#comment36723>
ditto
src/slave/slave.cpp
<https://reviews.apache.org/r/8762/#comment36725>
s/framework/the framework entry/
src/slave/slave.cpp
<https://reviews.apache.org/r/8762/#comment36727>
s/Create executor/Create the executor entry/
src/slave/slave.cpp
<https://reviews.apache.org/r/8762/#comment36728>
kill newline?
src/slave/slave.cpp
<https://reviews.apache.org/r/8762/#comment36724>
ditto
src/slave/slave.cpp
<https://reviews.apache.org/r/8762/#comment36729>
s/,//
src/slave/slave.cpp
<https://reviews.apache.org/r/8762/#comment36730>
s/by/for/?
src/slave/slave.cpp
<https://reviews.apache.org/r/8762/#comment36732>
else?
src/slave/slave.cpp
<https://reviews.apache.org/r/8762/#comment36733>
s/Setup the monitoring./Begin monitoring the executor./
src/tests/slave_recovery_tests.cpp
<https://reviews.apache.org/r/8762/#comment36734>
Line wrap at 70
- Ben Mahler
On Feb. 27, 2013, 7:59 p.m., Vinod Kone wrote:
>
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/8762/
> -----------------------------------------------------------
>
> (Updated Feb. 27, 2013, 7:59 p.m.)
>
>
> Review request for mesos, Benjamin Hindman and Ben Mahler.
>
>
> Description
> -------
>
> Properly recovers and reconnects with executors.
>
> Also, recovers isolation module.
>
> This is pretty much the whole of recovery! (I'm going to send out another tiny review for properly doing incompatible upgrades)
>
>
> Diffs
> -----
>
> src/exec/exec.cpp 821a94fab1f5969183ecf9e28d7b6bc10920db24
> src/launcher/launcher.hpp ead58c0f60b6bff532f241a224b547b25e292175
> src/launcher/launcher.cpp f2d68c5fd1201b75c5a8f96599d7fb36516687fb
> src/launcher/main.cpp e90df85ae433431defdd251490c74a482a58f743
> src/messages/messages.proto 815fcbbcb4a8643f50950a294cedf7281b2a187f
> src/slave/cgroups_isolation_module.hpp a04fc46b15d2741886f5847cadbdc9bed351c588
> src/slave/cgroups_isolation_module.cpp a779de80d13c67e507d7d2ee788fcdaa5e71daca
> src/slave/constants.hpp 08db75f3e5af0ae79298546386deddf00beb8de9
> src/slave/constants.cpp 159dd12ec1e6b5f33fd75fc3a66d8ac897fc40b7
> src/slave/flags.hpp 9cd436cae5772b906b3566eb1f344904d5894214
> src/slave/isolation_module.hpp d7cc02b8ea0cde591ee61059efd79fbacaa74769
> src/slave/lxc_isolation_module.hpp 6be99038fb547e43e74ef5469f2339bbe82017cf
> src/slave/lxc_isolation_module.cpp 1ed3c87f884089e032bba2ed89be109abe6c1c13
> src/slave/process_based_isolation_module.hpp 6a4e6aef9ab4b13cce9341f6f35c6b24e6f8fde7
> src/slave/process_based_isolation_module.cpp ff98d105af675dfc66070feaa43b42c1aa438fd8
> src/slave/reaper.cpp c0ee4b4c07fd792bcb39455b666808b712eb32c2
> src/slave/slave.hpp 7648c33230c1900eda7529045c5df9ccab105d47
> src/slave/slave.cpp 8c2e1bfc363491c681177676f9dfe5f229276f7d
> src/slave/state.cpp 2688567719c5b12f1d8a56fa6193c11d4077685b
> src/slave/status_update_manager.cpp PRE-CREATION
> src/tests/slave_recovery_tests.cpp PRE-CREATION
> src/tests/utils.hpp 4600c2f6e0ceabd6a38f8bb762da314b795c23a0
>
> Diff: https://reviews.apache.org/r/8762/diff/
>
>
> Testing
> -------
>
> make check
>
>
> Thanks,
>
> Vinod Kone
>
>
Re: Review Request: Slave Restart (Part 9): Recover executors and isolation
module
Posted by Ben Mahler <be...@gmail.com>.
> On March 1, 2013, 8:43 p.m., Ben Mahler wrote:
> > src/slave/cgroups_isolation_module.cpp, line 620
> > <https://reviews.apache.org/r/8762/diff/7/?file=263163#file263163line620>
> >
> > Can you split off the killed case in terms of the failed message?
> >
> > I see unknown as more serious than killed (since killed can happen due to known races).
>
> Vinod Kone wrote:
> I don't see the benefit, considering the fact that a killed executor becomes unknown as soon as it gets cleaned up. Also note, that we use the same phrasing across a bunch of functions, across isolation modules.
That's no reason to continue the bad habit! ;)
This would be useful for forensics IMO
> On March 1, 2013, 8:43 p.m., Ben Mahler wrote:
> > src/slave/process_based_isolation_module.cpp, line 329
> > <https://reviews.apache.org/r/8762/diff/7/?file=263171#file263171line329>
> >
> > ditto on splitting the failure message
>
> Vinod Kone wrote:
> see above.
Why wouldn't this be useful for forensics?
> On March 1, 2013, 8:43 p.m., Ben Mahler wrote:
> > src/slave/slave.cpp, lines 415-416
> > <https://reviews.apache.org/r/8762/diff/7/?file=263174#file263174line415>
> >
> > given the context here, "terminating" seems more appropriate
> >
> > you would set terminating to true, and then you call terminate, makes more sense IMO
>
> Vinod Kone wrote:
> "terminating" would be confusing, because intuitively it sounds like someone called 'terminate()'.
But you're immediately calling terminate(), so what you're saying it sounds like is what we want ;)
> On March 1, 2013, 8:43 p.m., Ben Mahler wrote:
> > src/slave/slave.cpp, line 1151
> > <https://reviews.apache.org/r/8762/diff/7/?file=263174#file263174line1151>
> >
> > Kill "If we are here, ", I know the comment is for this spot in the code because the comment was written in this spot for a reason! ;)
>
> Vinod Kone wrote:
> i like the emphasis.
Can you explain why? It's unnecessary less succinct. I feel like if we apply it here, there's nothing to stop us from applying it everywhere?
> On March 1, 2013, 8:43 p.m., Ben Mahler wrote:
> > src/slave/slave.cpp, line 1702
> > <https://reviews.apache.org/r/8762/diff/7/?file=263174#file263174line1702>
> >
> > quotes on executor id but not framework id?
>
> Vinod Kone wrote:
> as i explained in an earlier review, this has been the convention, for better or worse.
fair enough, I guess we'll start killing them when we sanitize executor ids
> On March 1, 2013, 8:43 p.m., Ben Mahler wrote:
> > src/slave/slave.cpp, line 1733
> > <https://reviews.apache.org/r/8762/diff/7/?file=263174#file263174line1733>
> >
> > s/Create executor/Create the executor entry/
Why did you drop? I think either correct it or kill the comment? Currently it's not a sentence.
> On March 1, 2013, 8:43 p.m., Ben Mahler wrote:
> > src/slave/slave.cpp, line 1721
> > <https://reviews.apache.org/r/8762/diff/7/?file=263174#file263174line1721>
> >
> > s/framework/the framework entry/
Why did you drop? I think either correct it or kill the comment? Currently it's not a sentence.
- Ben
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/8762/#review17260
-----------------------------------------------------------
On March 4, 2013, 1:26 a.m., Vinod Kone wrote:
>
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/8762/
> -----------------------------------------------------------
>
> (Updated March 4, 2013, 1:26 a.m.)
>
>
> Review request for mesos, Benjamin Hindman and Ben Mahler.
>
>
> Description
> -------
>
> Properly recovers and reconnects with executors.
>
> Also, recovers isolation module.
>
> This is pretty much the whole of recovery! (I'm going to send out another tiny review for properly doing incompatible upgrades)
>
>
> Diffs
> -----
>
> src/exec/exec.cpp 821a94fab1f5969183ecf9e28d7b6bc10920db24
> src/launcher/launcher.hpp ead58c0f60b6bff532f241a224b547b25e292175
> src/launcher/launcher.cpp f2d68c5fd1201b75c5a8f96599d7fb36516687fb
> src/launcher/main.cpp e90df85ae433431defdd251490c74a482a58f743
> src/messages/messages.proto 815fcbbcb4a8643f50950a294cedf7281b2a187f
> src/slave/cgroups_isolation_module.hpp a04fc46b15d2741886f5847cadbdc9bed351c588
> src/slave/cgroups_isolation_module.cpp a779de80d13c67e507d7d2ee788fcdaa5e71daca
> src/slave/constants.hpp dc5b9877e2608f6806f13b6ef4d133b6fddaece4
> src/slave/constants.cpp ecb96fd42afdb74aca394411c6ff60276f509cfd
> src/slave/flags.hpp 9cd436cae5772b906b3566eb1f344904d5894214
> src/slave/isolation_module.hpp d7cc02b8ea0cde591ee61059efd79fbacaa74769
> src/slave/lxc_isolation_module.hpp 6be99038fb547e43e74ef5469f2339bbe82017cf
> src/slave/lxc_isolation_module.cpp 1ed3c87f884089e032bba2ed89be109abe6c1c13
> src/slave/process_based_isolation_module.hpp 6a4e6aef9ab4b13cce9341f6f35c6b24e6f8fde7
> src/slave/process_based_isolation_module.cpp ff98d105af675dfc66070feaa43b42c1aa438fd8
> src/slave/slave.hpp 7648c33230c1900eda7529045c5df9ccab105d47
> src/slave/slave.cpp c13d268c5e0e107902f64e30304a18128927a571
> src/slave/state.cpp 2688567719c5b12f1d8a56fa6193c11d4077685b
> src/slave/status_update_manager.hpp PRE-CREATION
> src/slave/status_update_manager.cpp PRE-CREATION
> src/tests/slave_recovery_tests.cpp PRE-CREATION
> src/tests/utils.hpp b648b631d8c11c26e6adfa3f5b16012044557e25
>
> Diff: https://reviews.apache.org/r/8762/diff/
>
>
> Testing
> -------
>
> make check
>
>
> Thanks,
>
> Vinod Kone
>
>
Re: Review Request: Slave Restart (Part 9): Recover executors and isolation
module
Posted by Vinod Kone <vi...@gmail.com>.
> On March 1, 2013, 8:43 p.m., Ben Mahler wrote:
> > src/exec/exec.cpp, line 393
> > <https://reviews.apache.org/r/8762/diff/7/?file=263157#file263157line393>
> >
> > Can we maybe spell this out in the public executor interfaces? Considering how harsh this is, I like the harshness if we have this contract, as people will _definitely_ discover they are doing this.
> >
> > One would probably assume the driver will store messages for later delivery if sent pre-registration. That would be a nicer contract than doing this, but I see the added complexity =/
Decided to pass SlaveID through environment. So, no need for enforcing this in the contract. Also, now, we actually buffer status updates.
> On March 1, 2013, 8:43 p.m., Ben Mahler wrote:
> > src/exec/exec.cpp, line 539
> > <https://reviews.apache.org/r/8762/diff/7/?file=263157#file263157line539>
> >
> > Were you going to add slave id to these?
> >
> > While you're at it: https://issues.apache.org/jira/browse/MESOS-334
> >
> > Or maybe let's not do too many things in this change..
done.
> On March 1, 2013, 8:43 p.m., Ben Mahler wrote:
> > src/slave/slave.cpp, lines 415-416
> > <https://reviews.apache.org/r/8762/diff/7/?file=263174#file263174line415>
> >
> > given the context here, "terminating" seems more appropriate
> >
> > you would set terminating to true, and then you call terminate, makes more sense IMO
"terminating" would be confusing, because intuitively it sounds like someone called 'terminate()'.
> On March 1, 2013, 8:43 p.m., Ben Mahler wrote:
> > src/slave/slave.cpp, line 1115
> > <https://reviews.apache.org/r/8762/diff/7/?file=263174#file263174line1115>
> >
> > Does this imply the StatusUpdateManager provides idempotent primitives in general?
yes.
> On March 1, 2013, 8:43 p.m., Ben Mahler wrote:
> > src/slave/slave.cpp, line 1151
> > <https://reviews.apache.org/r/8762/diff/7/?file=263174#file263174line1151>
> >
> > Kill "If we are here, ", I know the comment is for this spot in the code because the comment was written in this spot for a reason! ;)
i like the emphasis.
> On March 1, 2013, 8:43 p.m., Ben Mahler wrote:
> > src/slave/slave.cpp, line 1153
> > <https://reviews.apache.org/r/8762/diff/7/?file=263174#file263174line1153>
> >
> > Isn't the reaper only for the process isolation module?
nope.
> On March 1, 2013, 8:43 p.m., Ben Mahler wrote:
> > src/slave/slave.cpp, line 1702
> > <https://reviews.apache.org/r/8762/diff/7/?file=263174#file263174line1702>
> >
> > quotes on executor id but not framework id?
as i explained in an earlier review, this has been the convention, for better or worse.
> On March 1, 2013, 8:43 p.m., Ben Mahler wrote:
> > src/slave/slave.cpp, line 1753
> > <https://reviews.apache.org/r/8762/diff/7/?file=263174#file263174line1753>
> >
> > s/by/for/?
for tasks, by isolation module.
> On March 1, 2013, 8:43 p.m., Ben Mahler wrote:
> > src/slave/slave.cpp, line 1797
> > <https://reviews.apache.org/r/8762/diff/7/?file=263174#file263174line1797>
> >
> > else?
else, no action needed.
> On March 1, 2013, 8:43 p.m., Ben Mahler wrote:
> > src/tests/slave_recovery_tests.cpp, line 616
> > <https://reviews.apache.org/r/8762/diff/7/?file=263177#file263177line616>
> >
> > Line wrap at 70
?
> On March 1, 2013, 8:43 p.m., Ben Mahler wrote:
> > src/slave/cgroups_isolation_module.cpp, line 620
> > <https://reviews.apache.org/r/8762/diff/7/?file=263163#file263163line620>
> >
> > Can you split off the killed case in terms of the failed message?
> >
> > I see unknown as more serious than killed (since killed can happen due to known races).
I don't see the benefit, considering the fact that a killed executor becomes unknown as soon as it gets cleaned up. Also note, that we use the same phrasing across a bunch of functions, across isolation modules.
> On March 1, 2013, 8:43 p.m., Ben Mahler wrote:
> > src/slave/process_based_isolation_module.cpp, line 329
> > <https://reviews.apache.org/r/8762/diff/7/?file=263171#file263171line329>
> >
> > ditto on splitting the failure message
see above.
- Vinod
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/8762/#review17260
-----------------------------------------------------------
On Feb. 27, 2013, 7:59 p.m., Vinod Kone wrote:
>
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/8762/
> -----------------------------------------------------------
>
> (Updated Feb. 27, 2013, 7:59 p.m.)
>
>
> Review request for mesos, Benjamin Hindman and Ben Mahler.
>
>
> Description
> -------
>
> Properly recovers and reconnects with executors.
>
> Also, recovers isolation module.
>
> This is pretty much the whole of recovery! (I'm going to send out another tiny review for properly doing incompatible upgrades)
>
>
> Diffs
> -----
>
> src/exec/exec.cpp 821a94fab1f5969183ecf9e28d7b6bc10920db24
> src/launcher/launcher.hpp ead58c0f60b6bff532f241a224b547b25e292175
> src/launcher/launcher.cpp f2d68c5fd1201b75c5a8f96599d7fb36516687fb
> src/launcher/main.cpp e90df85ae433431defdd251490c74a482a58f743
> src/messages/messages.proto 815fcbbcb4a8643f50950a294cedf7281b2a187f
> src/slave/cgroups_isolation_module.hpp a04fc46b15d2741886f5847cadbdc9bed351c588
> src/slave/cgroups_isolation_module.cpp a779de80d13c67e507d7d2ee788fcdaa5e71daca
> src/slave/constants.hpp 08db75f3e5af0ae79298546386deddf00beb8de9
> src/slave/constants.cpp 159dd12ec1e6b5f33fd75fc3a66d8ac897fc40b7
> src/slave/flags.hpp 9cd436cae5772b906b3566eb1f344904d5894214
> src/slave/isolation_module.hpp d7cc02b8ea0cde591ee61059efd79fbacaa74769
> src/slave/lxc_isolation_module.hpp 6be99038fb547e43e74ef5469f2339bbe82017cf
> src/slave/lxc_isolation_module.cpp 1ed3c87f884089e032bba2ed89be109abe6c1c13
> src/slave/process_based_isolation_module.hpp 6a4e6aef9ab4b13cce9341f6f35c6b24e6f8fde7
> src/slave/process_based_isolation_module.cpp ff98d105af675dfc66070feaa43b42c1aa438fd8
> src/slave/reaper.cpp c0ee4b4c07fd792bcb39455b666808b712eb32c2
> src/slave/slave.hpp 7648c33230c1900eda7529045c5df9ccab105d47
> src/slave/slave.cpp 8c2e1bfc363491c681177676f9dfe5f229276f7d
> src/slave/state.cpp 2688567719c5b12f1d8a56fa6193c11d4077685b
> src/slave/status_update_manager.cpp PRE-CREATION
> src/tests/slave_recovery_tests.cpp PRE-CREATION
> src/tests/utils.hpp 4600c2f6e0ceabd6a38f8bb762da314b795c23a0
>
> Diff: https://reviews.apache.org/r/8762/diff/
>
>
> Testing
> -------
>
> make check
>
>
> Thanks,
>
> Vinod Kone
>
>