You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mesos.apache.org by Niklas Nielsen <ni...@qni.dk> on 2014/04/10 19:02:04 UTC

Review Request 20221: Changed executor state recovery to allow run recovery in absence of executor info.

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/20221/
-----------------------------------------------------------

Review request for mesos, Ian Downes and Vinod Kone.


Repository: mesos-git


Description
-------

This patch let executor recovery recover runs in the absence of
executor info.  This is needed as new task-info patch will introduce
an intermediate state where the executor info hasn't been check
pointed. In this interim, the slave may fail-over and should be in a
position to clean up orphan containers (as for now, the containerizer
API doesn't provide a way to reconcile the executor info and it is
therefore not possible to recover the containers in this case).


Diffs
-----

  src/slave/slave.cpp cddb241 
  src/slave/state.cpp 21d1fb7 

Diff: https://reviews.apache.org/r/20221/diff/


Testing
-------

make check and tested with task-info patch and new launch test.


Thanks,

Niklas Nielsen


Re: Review Request 20221: Changed executor state recovery to allow run recovery in absence of executor info.

Posted by Vinod Kone <vi...@gmail.com>.

> On April 14, 2014, 5:04 p.m., Benjamin Hindman wrote:
> > src/slave/slave.cpp, lines 3116-3118
> > <https://reviews.apache.org/r/20221/diff/2/?file=554602#file554602line3116>
> >
> >     I think what this is saying is:
> >     
> >     If we have a valid run (determined in the codce above) then we're sure to have a checkpointed ExecutorInfo because the ExecutorInfo is checkpointed before we checkpoint any information about a run.
> >     
> >     But is it possible that a run is valid but for whatever reason recovering the ExecutorInfo fails? For example, because the file got corrupted, or by accidentally deleted?
> 
> Niklas Nielsen wrote:
>     If the executor info file gets corrupted or deleted, the check would fail.
>     
>     How about extending the test on entry (that ensures presence of runs and gracefully GC's and abort recovery?) with ... || state.info.isNone() ?
>     The test will be removed in the task info patch anyway as we deal with the missing executor info explicitly there.

i like the second suggestion because hard failing on an executor corruption is bad.


- Vinod


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/20221/#review40269
-----------------------------------------------------------


On April 10, 2014, 8:26 p.m., Niklas Nielsen wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/20221/
> -----------------------------------------------------------
> 
> (Updated April 10, 2014, 8:26 p.m.)
> 
> 
> Review request for mesos, Ian Downes and Vinod Kone.
> 
> 
> Repository: mesos-git
> 
> 
> Description
> -------
> 
> This patch let executor recovery recover runs in the absence of
> executor info.  This is needed as new task-info patch will introduce
> an intermediate state where the executor info hasn't been check
> pointed. In this interim, the slave may fail-over and should be in a
> position to clean up orphan containers (as for now, the containerizer
> API doesn't provide a way to reconcile the executor info and it is
> therefore not possible to recover the containers in this case).
> 
> 
> Diffs
> -----
> 
>   src/slave/slave.cpp cddb241 
>   src/slave/state.cpp 21d1fb7 
> 
> Diff: https://reviews.apache.org/r/20221/diff/
> 
> 
> Testing
> -------
> 
> make check and tested with task-info patch and new launch test.
> 
> 
> Thanks,
> 
> Niklas Nielsen
> 
>


Re: Review Request 20221: Changed executor state recovery to allow run recovery in absence of executor info.

Posted by Niklas Nielsen <ni...@qni.dk>.

> On April 14, 2014, 10:04 a.m., Benjamin Hindman wrote:
> > src/slave/slave.cpp, lines 3116-3118
> > <https://reviews.apache.org/r/20221/diff/2/?file=554602#file554602line3116>
> >
> >     I think what this is saying is:
> >     
> >     If we have a valid run (determined in the codce above) then we're sure to have a checkpointed ExecutorInfo because the ExecutorInfo is checkpointed before we checkpoint any information about a run.
> >     
> >     But is it possible that a run is valid but for whatever reason recovering the ExecutorInfo fails? For example, because the file got corrupted, or by accidentally deleted?

If the executor info file gets corrupted or deleted, the check would fail.

How about extending the test on entry (that ensures presence of runs and gracefully GC's and abort recovery?) with ... || state.info.isNone() ?
The test will be removed in the task info patch anyway as we deal with the missing executor info explicitly there.


- Niklas


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/20221/#review40269
-----------------------------------------------------------


On April 10, 2014, 1:26 p.m., Niklas Nielsen wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/20221/
> -----------------------------------------------------------
> 
> (Updated April 10, 2014, 1:26 p.m.)
> 
> 
> Review request for mesos, Ian Downes and Vinod Kone.
> 
> 
> Repository: mesos-git
> 
> 
> Description
> -------
> 
> This patch let executor recovery recover runs in the absence of
> executor info.  This is needed as new task-info patch will introduce
> an intermediate state where the executor info hasn't been check
> pointed. In this interim, the slave may fail-over and should be in a
> position to clean up orphan containers (as for now, the containerizer
> API doesn't provide a way to reconcile the executor info and it is
> therefore not possible to recover the containers in this case).
> 
> 
> Diffs
> -----
> 
>   src/slave/slave.cpp cddb241 
>   src/slave/state.cpp 21d1fb7 
> 
> Diff: https://reviews.apache.org/r/20221/diff/
> 
> 
> Testing
> -------
> 
> make check and tested with task-info patch and new launch test.
> 
> 
> Thanks,
> 
> Niklas Nielsen
> 
>


Re: Review Request 20221: Changed executor state recovery to allow run recovery in absence of executor info.

Posted by Niklas Nielsen <ni...@qni.dk>.

> On April 14, 2014, 10:04 a.m., Benjamin Hindman wrote:
> > src/slave/slave.cpp, lines 3116-3118
> > <https://reviews.apache.org/r/20221/diff/2/?file=554602#file554602line3116>
> >
> >     I think what this is saying is:
> >     
> >     If we have a valid run (determined in the codce above) then we're sure to have a checkpointed ExecutorInfo because the ExecutorInfo is checkpointed before we checkpoint any information about a run.
> >     
> >     But is it possible that a run is valid but for whatever reason recovering the ExecutorInfo fails? For example, because the file got corrupted, or by accidentally deleted?
> 
> Niklas Nielsen wrote:
>     If the executor info file gets corrupted or deleted, the check would fail.
>     
>     How about extending the test on entry (that ensures presence of runs and gracefully GC's and abort recovery?) with ... || state.info.isNone() ?
>     The test will be removed in the task info patch anyway as we deal with the missing executor info explicitly there.
> 
> Vinod Kone wrote:
>     i like the second suggestion because hard failing on an executor corruption is bad.

Cool - should be changed in latest patch. Can you take a second look?


- Niklas


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/20221/#review40269
-----------------------------------------------------------


On April 15, 2014, 9:51 a.m., Niklas Nielsen wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/20221/
> -----------------------------------------------------------
> 
> (Updated April 15, 2014, 9:51 a.m.)
> 
> 
> Review request for mesos, Ian Downes and Vinod Kone.
> 
> 
> Repository: mesos-git
> 
> 
> Description
> -------
> 
> This patch let executor recovery recover runs in the absence of
> executor info.  This is needed as new task-info patch will introduce
> an intermediate state where the executor info hasn't been check
> pointed. In this interim, the slave may fail-over and should be in a
> position to clean up orphan containers (as for now, the containerizer
> API doesn't provide a way to reconcile the executor info and it is
> therefore not possible to recover the containers in this case).
> 
> 
> Diffs
> -----
> 
>   src/slave/slave.cpp 19c5f0d 
>   src/slave/state.cpp a2af33c 
> 
> Diff: https://reviews.apache.org/r/20221/diff/
> 
> 
> Testing
> -------
> 
> make check and tested with task-info patch and new launch test.
> 
> 
> Thanks,
> 
> Niklas Nielsen
> 
>


Re: Review Request 20221: Changed executor state recovery to allow run recovery in absence of executor info.

Posted by Benjamin Hindman <be...@berkeley.edu>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/20221/#review40269
-----------------------------------------------------------



src/slave/slave.cpp
<https://reviews.apache.org/r/20221/#comment73234>

    I think what this is saying is:
    
    If we have a valid run (determined in the codce above) then we're sure to have a checkpointed ExecutorInfo because the ExecutorInfo is checkpointed before we checkpoint any information about a run.
    
    But is it possible that a run is valid but for whatever reason recovering the ExecutorInfo fails? For example, because the file got corrupted, or by accidentally deleted?


- Benjamin Hindman


On April 10, 2014, 8:26 p.m., Niklas Nielsen wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/20221/
> -----------------------------------------------------------
> 
> (Updated April 10, 2014, 8:26 p.m.)
> 
> 
> Review request for mesos, Ian Downes and Vinod Kone.
> 
> 
> Repository: mesos-git
> 
> 
> Description
> -------
> 
> This patch let executor recovery recover runs in the absence of
> executor info.  This is needed as new task-info patch will introduce
> an intermediate state where the executor info hasn't been check
> pointed. In this interim, the slave may fail-over and should be in a
> position to clean up orphan containers (as for now, the containerizer
> API doesn't provide a way to reconcile the executor info and it is
> therefore not possible to recover the containers in this case).
> 
> 
> Diffs
> -----
> 
>   src/slave/slave.cpp cddb241 
>   src/slave/state.cpp 21d1fb7 
> 
> Diff: https://reviews.apache.org/r/20221/diff/
> 
> 
> Testing
> -------
> 
> make check and tested with task-info patch and new launch test.
> 
> 
> Thanks,
> 
> Niklas Nielsen
> 
>


Re: Review Request 20221: Changed executor state recovery to allow run recovery in absence of executor info.

Posted by Vinod Kone <vi...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/20221/#review40672
-----------------------------------------------------------

Ship it!


Ship It!

- Vinod Kone


On April 15, 2014, 4:51 p.m., Niklas Nielsen wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/20221/
> -----------------------------------------------------------
> 
> (Updated April 15, 2014, 4:51 p.m.)
> 
> 
> Review request for mesos, Ian Downes and Vinod Kone.
> 
> 
> Repository: mesos-git
> 
> 
> Description
> -------
> 
> This patch let executor recovery recover runs in the absence of
> executor info.  This is needed as new task-info patch will introduce
> an intermediate state where the executor info hasn't been check
> pointed. In this interim, the slave may fail-over and should be in a
> position to clean up orphan containers (as for now, the containerizer
> API doesn't provide a way to reconcile the executor info and it is
> therefore not possible to recover the containers in this case).
> 
> 
> Diffs
> -----
> 
>   src/slave/slave.cpp 19c5f0d 
>   src/slave/state.cpp a2af33c 
> 
> Diff: https://reviews.apache.org/r/20221/diff/
> 
> 
> Testing
> -------
> 
> make check and tested with task-info patch and new launch test.
> 
> 
> Thanks,
> 
> Niklas Nielsen
> 
>


Re: Review Request 20221: Changed executor state recovery to allow run recovery in absence of executor info.

Posted by Niklas Nielsen <ni...@qni.dk>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/20221/
-----------------------------------------------------------

(Updated April 15, 2014, 9:51 a.m.)


Review request for mesos, Ian Downes and Vinod Kone.


Changes
-------

Changed hard CHECK() to extended test which gracefully garbage collects and exits.


Repository: mesos-git


Description
-------

This patch let executor recovery recover runs in the absence of
executor info.  This is needed as new task-info patch will introduce
an intermediate state where the executor info hasn't been check
pointed. In this interim, the slave may fail-over and should be in a
position to clean up orphan containers (as for now, the containerizer
API doesn't provide a way to reconcile the executor info and it is
therefore not possible to recover the containers in this case).


Diffs (updated)
-----

  src/slave/slave.cpp 19c5f0d 
  src/slave/state.cpp a2af33c 

Diff: https://reviews.apache.org/r/20221/diff/


Testing
-------

make check and tested with task-info patch and new launch test.


Thanks,

Niklas Nielsen


Re: Review Request 20221: Changed executor state recovery to allow run recovery in absence of executor info.

Posted by Niklas Nielsen <ni...@qni.dk>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/20221/
-----------------------------------------------------------

(Updated April 10, 2014, 1:26 p.m.)


Review request for mesos, Ian Downes and Vinod Kone.


Changes
-------

Added comment motivating hard CHECK().


Repository: mesos-git


Description
-------

This patch let executor recovery recover runs in the absence of
executor info.  This is needed as new task-info patch will introduce
an intermediate state where the executor info hasn't been check
pointed. In this interim, the slave may fail-over and should be in a
position to clean up orphan containers (as for now, the containerizer
API doesn't provide a way to reconcile the executor info and it is
therefore not possible to recover the containers in this case).


Diffs (updated)
-----

  src/slave/slave.cpp cddb241 
  src/slave/state.cpp 21d1fb7 

Diff: https://reviews.apache.org/r/20221/diff/


Testing
-------

make check and tested with task-info patch and new launch test.


Thanks,

Niklas Nielsen


Re: Review Request 20221: Changed executor state recovery to allow run recovery in absence of executor info.

Posted by Mesos ReviewBot <de...@mesos.apache.org>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/20221/#review40047
-----------------------------------------------------------


Patch looks great!

Reviews applied: [20221]

All tests passed.

- Mesos ReviewBot


On April 10, 2014, 5:02 p.m., Niklas Nielsen wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/20221/
> -----------------------------------------------------------
> 
> (Updated April 10, 2014, 5:02 p.m.)
> 
> 
> Review request for mesos, Ian Downes and Vinod Kone.
> 
> 
> Repository: mesos-git
> 
> 
> Description
> -------
> 
> This patch let executor recovery recover runs in the absence of
> executor info.  This is needed as new task-info patch will introduce
> an intermediate state where the executor info hasn't been check
> pointed. In this interim, the slave may fail-over and should be in a
> position to clean up orphan containers (as for now, the containerizer
> API doesn't provide a way to reconcile the executor info and it is
> therefore not possible to recover the containers in this case).
> 
> 
> Diffs
> -----
> 
>   src/slave/slave.cpp cddb241 
>   src/slave/state.cpp 21d1fb7 
> 
> Diff: https://reviews.apache.org/r/20221/diff/
> 
> 
> Testing
> -------
> 
> make check and tested with task-info patch and new launch test.
> 
> 
> Thanks,
> 
> Niklas Nielsen
> 
>


Re: Review Request 20221: Changed executor state recovery to allow run recovery in absence of executor info.

Posted by Vinod Kone <vi...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/20221/#review40054
-----------------------------------------------------------

Ship it!


Ship It!

- Vinod Kone


On April 10, 2014, 5:02 p.m., Niklas Nielsen wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/20221/
> -----------------------------------------------------------
> 
> (Updated April 10, 2014, 5:02 p.m.)
> 
> 
> Review request for mesos, Ian Downes and Vinod Kone.
> 
> 
> Repository: mesos-git
> 
> 
> Description
> -------
> 
> This patch let executor recovery recover runs in the absence of
> executor info.  This is needed as new task-info patch will introduce
> an intermediate state where the executor info hasn't been check
> pointed. In this interim, the slave may fail-over and should be in a
> position to clean up orphan containers (as for now, the containerizer
> API doesn't provide a way to reconcile the executor info and it is
> therefore not possible to recover the containers in this case).
> 
> 
> Diffs
> -----
> 
>   src/slave/slave.cpp cddb241 
>   src/slave/state.cpp 21d1fb7 
> 
> Diff: https://reviews.apache.org/r/20221/diff/
> 
> 
> Testing
> -------
> 
> make check and tested with task-info patch and new launch test.
> 
> 
> Thanks,
> 
> Niklas Nielsen
> 
>


Re: Review Request 20221: Changed executor state recovery to allow run recovery in absence of executor info.

Posted by Vinod Kone <vi...@gmail.com>.

> On April 10, 2014, 6:58 p.m., Benjamin Hindman wrote:
> > src/slave/slave.cpp, line 3116
> > <https://reviews.apache.org/r/20221/diff/1/?file=554484#file554484line3116>
> >
> >     Can we add a comment that captures the invariant that 'state.info' is Some here? Perhaps Vinod can suggest something here?

+1. I needed some time to convince myself that this works.

The current invariant is that "ExecutorInfo is checkpointed before the run state."


- Vinod


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/20221/#review40060
-----------------------------------------------------------


On April 10, 2014, 5:02 p.m., Niklas Nielsen wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/20221/
> -----------------------------------------------------------
> 
> (Updated April 10, 2014, 5:02 p.m.)
> 
> 
> Review request for mesos, Ian Downes and Vinod Kone.
> 
> 
> Repository: mesos-git
> 
> 
> Description
> -------
> 
> This patch let executor recovery recover runs in the absence of
> executor info.  This is needed as new task-info patch will introduce
> an intermediate state where the executor info hasn't been check
> pointed. In this interim, the slave may fail-over and should be in a
> position to clean up orphan containers (as for now, the containerizer
> API doesn't provide a way to reconcile the executor info and it is
> therefore not possible to recover the containers in this case).
> 
> 
> Diffs
> -----
> 
>   src/slave/slave.cpp cddb241 
>   src/slave/state.cpp 21d1fb7 
> 
> Diff: https://reviews.apache.org/r/20221/diff/
> 
> 
> Testing
> -------
> 
> make check and tested with task-info patch and new launch test.
> 
> 
> Thanks,
> 
> Niklas Nielsen
> 
>


Re: Review Request 20221: Changed executor state recovery to allow run recovery in absence of executor info.

Posted by Benjamin Hindman <be...@berkeley.edu>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/20221/#review40060
-----------------------------------------------------------



src/slave/slave.cpp
<https://reviews.apache.org/r/20221/#comment72866>

    Can we add a comment that captures the invariant that 'state.info' is Some here? Perhaps Vinod can suggest something here?


- Benjamin Hindman


On April 10, 2014, 5:02 p.m., Niklas Nielsen wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/20221/
> -----------------------------------------------------------
> 
> (Updated April 10, 2014, 5:02 p.m.)
> 
> 
> Review request for mesos, Ian Downes and Vinod Kone.
> 
> 
> Repository: mesos-git
> 
> 
> Description
> -------
> 
> This patch let executor recovery recover runs in the absence of
> executor info.  This is needed as new task-info patch will introduce
> an intermediate state where the executor info hasn't been check
> pointed. In this interim, the slave may fail-over and should be in a
> position to clean up orphan containers (as for now, the containerizer
> API doesn't provide a way to reconcile the executor info and it is
> therefore not possible to recover the containers in this case).
> 
> 
> Diffs
> -----
> 
>   src/slave/slave.cpp cddb241 
>   src/slave/state.cpp 21d1fb7 
> 
> Diff: https://reviews.apache.org/r/20221/diff/
> 
> 
> Testing
> -------
> 
> make check and tested with task-info patch and new launch test.
> 
> 
> Thanks,
> 
> Niklas Nielsen
> 
>