You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mesos.apache.org by Vinod Kone <vi...@gmail.com> on 2013/08/03 23:35:44 UTC

Review Request 13261: Clarified the guidance for users when slave recovery fails.

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/13261/
-----------------------------------------------------------

Review request for mesos, Benjamin Hindman, Ben Mahler, and Brenden Matthews.


Bugs: MESOS-613
    https://issues.apache.org/jira/browse/MESOS-613


Repository: mesos-git


Description
-------

See summary.


Diffs
-----

  src/slave/slave.cpp 7f6e6b456890db438092f19a22e4dd816bb33d04 

Diff: https://reviews.apache.org/r/13261/diff/


Testing
-------

make check


Thanks,

Vinod Kone


Re: Review Request 13261: Clarified the guidance for users when slave recovery fails.

Posted by Vinod Kone <vi...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/13261/
-----------------------------------------------------------

(Updated Aug. 6, 2013, 5:54 p.m.)


Review request for mesos, Benjamin Hindman and Ben Mahler.


Changes
-------

rebased. nnfr.


Bugs: MESOS-613
    https://issues.apache.org/jira/browse/MESOS-613


Repository: mesos-git


Description
-------

See summary.


Diffs (updated)
-----

  src/slave/slave.cpp 9cd7754b647dde21267f1990edb7d4e1425beacd 

Diff: https://reviews.apache.org/r/13261/diff/


Testing
-------

make check


Thanks,

Vinod Kone


Re: Review Request 13261: Clarified the guidance for users when slave recovery fails.

Posted by Brenden Matthews <br...@diddyinc.com>.

> On Aug. 5, 2013, 8:57 p.m., Brenden Matthews wrote:
> > Is there not a better way to handle this process in an automated fashion?  It seems to require user intervention if the slave gets in to a bad state.
> 
> Vinod Kone wrote:
>     Thats a good point. The reason we wanted explicit intervention was to help us diagnose/fix issues with slave recovery easily. Once we deem slave recovery stable we could probably automate some of these decisions (maybe via a flag). Thoughts?
> 
> Brenden Matthews wrote:
>     This sounds reasonable.  In production there will be cases where it will fail to recover, and the slave should take a reasonable course of action to return to an operable state.
>     
>     Perhaps --recovery_failure_action={continue,abort} ? Can we not also just use --strict?  As in, if --no-strict is set, it should continue as far as possible.
> 
> Vinod Kone wrote:
>     I would like to wait on adding a flag/option for auto recovery after we get some data from testing. Let me know if its causing enough of a pain for you guys. And, hopefully you're not running it in production already?

Just running it in staging at the moment.  We'd like to have it, but there isn't an immediate concern.


- Brenden


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/13261/#review24673
-----------------------------------------------------------


On Aug. 6, 2013, 5:54 p.m., Vinod Kone wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/13261/
> -----------------------------------------------------------
> 
> (Updated Aug. 6, 2013, 5:54 p.m.)
> 
> 
> Review request for mesos, Benjamin Hindman and Ben Mahler.
> 
> 
> Bugs: MESOS-613
>     https://issues.apache.org/jira/browse/MESOS-613
> 
> 
> Repository: mesos-git
> 
> 
> Description
> -------
> 
> See summary.
> 
> 
> Diffs
> -----
> 
>   src/slave/slave.cpp 9cd7754b647dde21267f1990edb7d4e1425beacd 
> 
> Diff: https://reviews.apache.org/r/13261/diff/
> 
> 
> Testing
> -------
> 
> make check
> 
> 
> Thanks,
> 
> Vinod Kone
> 
>


Re: Review Request 13261: Clarified the guidance for users when slave recovery fails.

Posted by Brenden Matthews <br...@diddyinc.com>.

> On Aug. 5, 2013, 8:57 p.m., Brenden Matthews wrote:
> > Is there not a better way to handle this process in an automated fashion?  It seems to require user intervention if the slave gets in to a bad state.
> 
> Vinod Kone wrote:
>     Thats a good point. The reason we wanted explicit intervention was to help us diagnose/fix issues with slave recovery easily. Once we deem slave recovery stable we could probably automate some of these decisions (maybe via a flag). Thoughts?

This sounds reasonable.  In production there will be cases where it will fail to recover, and the slave should take a reasonable course of action to return to an operable state.

Perhaps --recovery_failure_action={continue,abort} ? Can we not also just use --strict?  As in, if --no-strict is set, it should continue as far as possible.


- Brenden


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/13261/#review24673
-----------------------------------------------------------


On Aug. 3, 2013, 9:35 p.m., Vinod Kone wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/13261/
> -----------------------------------------------------------
> 
> (Updated Aug. 3, 2013, 9:35 p.m.)
> 
> 
> Review request for mesos, Benjamin Hindman, Ben Mahler, and Brenden Matthews.
> 
> 
> Bugs: MESOS-613
>     https://issues.apache.org/jira/browse/MESOS-613
> 
> 
> Repository: mesos-git
> 
> 
> Description
> -------
> 
> See summary.
> 
> 
> Diffs
> -----
> 
>   src/slave/slave.cpp 7f6e6b456890db438092f19a22e4dd816bb33d04 
> 
> Diff: https://reviews.apache.org/r/13261/diff/
> 
> 
> Testing
> -------
> 
> make check
> 
> 
> Thanks,
> 
> Vinod Kone
> 
>


Re: Review Request 13261: Clarified the guidance for users when slave recovery fails.

Posted by Vinod Kone <vi...@gmail.com>.

> On Aug. 5, 2013, 8:57 p.m., Brenden Matthews wrote:
> > Is there not a better way to handle this process in an automated fashion?  It seems to require user intervention if the slave gets in to a bad state.

Thats a good point. The reason we wanted explicit intervention was to help us diagnose/fix issues with slave recovery easily. Once we deem slave recovery stable we could probably automate some of these decisions (maybe via a flag). Thoughts?


- Vinod


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/13261/#review24673
-----------------------------------------------------------


On Aug. 3, 2013, 9:35 p.m., Vinod Kone wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/13261/
> -----------------------------------------------------------
> 
> (Updated Aug. 3, 2013, 9:35 p.m.)
> 
> 
> Review request for mesos, Benjamin Hindman, Ben Mahler, and Brenden Matthews.
> 
> 
> Bugs: MESOS-613
>     https://issues.apache.org/jira/browse/MESOS-613
> 
> 
> Repository: mesos-git
> 
> 
> Description
> -------
> 
> See summary.
> 
> 
> Diffs
> -----
> 
>   src/slave/slave.cpp 7f6e6b456890db438092f19a22e4dd816bb33d04 
> 
> Diff: https://reviews.apache.org/r/13261/diff/
> 
> 
> Testing
> -------
> 
> make check
> 
> 
> Thanks,
> 
> Vinod Kone
> 
>


Re: Review Request 13261: Clarified the guidance for users when slave recovery fails.

Posted by Vinod Kone <vi...@gmail.com>.

> On Aug. 5, 2013, 8:57 p.m., Brenden Matthews wrote:
> > Is there not a better way to handle this process in an automated fashion?  It seems to require user intervention if the slave gets in to a bad state.
> 
> Vinod Kone wrote:
>     Thats a good point. The reason we wanted explicit intervention was to help us diagnose/fix issues with slave recovery easily. Once we deem slave recovery stable we could probably automate some of these decisions (maybe via a flag). Thoughts?
> 
> Brenden Matthews wrote:
>     This sounds reasonable.  In production there will be cases where it will fail to recover, and the slave should take a reasonable course of action to return to an operable state.
>     
>     Perhaps --recovery_failure_action={continue,abort} ? Can we not also just use --strict?  As in, if --no-strict is set, it should continue as far as possible.

I would like to wait on adding a flag/option for auto recovery after we get some data from testing. Let me know if its causing enough of a pain for you guys. And, hopefully you're not running it in production already?


- Vinod


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/13261/#review24673
-----------------------------------------------------------


On Aug. 3, 2013, 9:35 p.m., Vinod Kone wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/13261/
> -----------------------------------------------------------
> 
> (Updated Aug. 3, 2013, 9:35 p.m.)
> 
> 
> Review request for mesos, Benjamin Hindman, Ben Mahler, and Brenden Matthews.
> 
> 
> Bugs: MESOS-613
>     https://issues.apache.org/jira/browse/MESOS-613
> 
> 
> Repository: mesos-git
> 
> 
> Description
> -------
> 
> See summary.
> 
> 
> Diffs
> -----
> 
>   src/slave/slave.cpp 7f6e6b456890db438092f19a22e4dd816bb33d04 
> 
> Diff: https://reviews.apache.org/r/13261/diff/
> 
> 
> Testing
> -------
> 
> make check
> 
> 
> Thanks,
> 
> Vinod Kone
> 
>


Re: Review Request 13261: Clarified the guidance for users when slave recovery fails.

Posted by Brenden Matthews <br...@diddyinc.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/13261/#review24673
-----------------------------------------------------------


Is there not a better way to handle this process in an automated fashion?  It seems to require user intervention if the slave gets in to a bad state.

- Brenden Matthews


On Aug. 3, 2013, 9:35 p.m., Vinod Kone wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/13261/
> -----------------------------------------------------------
> 
> (Updated Aug. 3, 2013, 9:35 p.m.)
> 
> 
> Review request for mesos, Benjamin Hindman, Ben Mahler, and Brenden Matthews.
> 
> 
> Bugs: MESOS-613
>     https://issues.apache.org/jira/browse/MESOS-613
> 
> 
> Repository: mesos-git
> 
> 
> Description
> -------
> 
> See summary.
> 
> 
> Diffs
> -----
> 
>   src/slave/slave.cpp 7f6e6b456890db438092f19a22e4dd816bb33d04 
> 
> Diff: https://reviews.apache.org/r/13261/diff/
> 
> 
> Testing
> -------
> 
> make check
> 
> 
> Thanks,
> 
> Vinod Kone
> 
>


Re: Review Request 13261: Clarified the guidance for users when slave recovery fails.

Posted by Vinod Kone <vi...@gmail.com>.

> On Aug. 5, 2013, 6:56 p.m., Ben Mahler wrote:
> > src/slave/slave.cpp, line 398
> > <https://reviews.apache.org/r/13261/diff/1/?file=336047#file336047line398>
> >
> >     s/f/rf/?

it's a symlink.


> On Aug. 5, 2013, 6:56 p.m., Ben Mahler wrote:
> > src/slave/slave.cpp, lines 2563-2565
> > <https://reviews.apache.org/r/13261/diff/1/?file=336047#file336047line2563>
> >
> >     Even with --strict, they can just rm -rf the latest slave path, so it seems like 'Restart the slave...' is an additional option when --strict is on.
> >     
> >     s/rm/rm -rf/ ?

I would like users to try --no-strict first before they go down the path of no recovery, because --no-strict recovers as much as possible and tries to do the right thing imo.


> On Aug. 5, 2013, 6:56 p.m., Ben Mahler wrote:
> > src/slave/slave.cpp, line 396
> > <https://reviews.apache.org/r/13261/diff/1/?file=336047#file336047line396>
> >
> >     What does discarded mean in this case? Is it still a failure to recover? Perhaps it needs a separate message that the recovery process was terminated, or?

discarded happens if isolator/status update manager discards recover() for whatever reason. not sure if we want a separate message, because we still want them to take the same action.


- Vinod


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/13261/#review24666
-----------------------------------------------------------


On Aug. 3, 2013, 9:35 p.m., Vinod Kone wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/13261/
> -----------------------------------------------------------
> 
> (Updated Aug. 3, 2013, 9:35 p.m.)
> 
> 
> Review request for mesos, Benjamin Hindman, Ben Mahler, and Brenden Matthews.
> 
> 
> Bugs: MESOS-613
>     https://issues.apache.org/jira/browse/MESOS-613
> 
> 
> Repository: mesos-git
> 
> 
> Description
> -------
> 
> See summary.
> 
> 
> Diffs
> -----
> 
>   src/slave/slave.cpp 7f6e6b456890db438092f19a22e4dd816bb33d04 
> 
> Diff: https://reviews.apache.org/r/13261/diff/
> 
> 
> Testing
> -------
> 
> make check
> 
> 
> Thanks,
> 
> Vinod Kone
> 
>


Re: Review Request 13261: Clarified the guidance for users when slave recovery fails.

Posted by Ben Mahler <be...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/13261/#review24666
-----------------------------------------------------------

Ship it!



src/slave/slave.cpp
<https://reviews.apache.org/r/13261/#comment48774>

    What does discarded mean in this case? Is it still a failure to recover? Perhaps it needs a separate message that the recovery process was terminated, or?



src/slave/slave.cpp
<https://reviews.apache.org/r/13261/#comment48775>

    s/f/rf/?



src/slave/slave.cpp
<https://reviews.apache.org/r/13261/#comment48776>

    Even with --strict, they can just rm -rf the latest slave path, so it seems like 'Restart the slave...' is an additional option when --strict is on.
    
    s/rm/rm -rf/ ?


- Ben Mahler


On Aug. 3, 2013, 9:35 p.m., Vinod Kone wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/13261/
> -----------------------------------------------------------
> 
> (Updated Aug. 3, 2013, 9:35 p.m.)
> 
> 
> Review request for mesos, Benjamin Hindman, Ben Mahler, and Brenden Matthews.
> 
> 
> Bugs: MESOS-613
>     https://issues.apache.org/jira/browse/MESOS-613
> 
> 
> Repository: mesos-git
> 
> 
> Description
> -------
> 
> See summary.
> 
> 
> Diffs
> -----
> 
>   src/slave/slave.cpp 7f6e6b456890db438092f19a22e4dd816bb33d04 
> 
> Diff: https://reviews.apache.org/r/13261/diff/
> 
> 
> Testing
> -------
> 
> make check
> 
> 
> Thanks,
> 
> Vinod Kone
> 
>