You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mesos.apache.org by Ben Mahler <be...@gmail.com> on 2013/12/13 05:28:33 UTC

Review Request 16238: Updated the slave to not recover after a reboot.

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/16238/
-----------------------------------------------------------

Review request for mesos, Benjamin Hindman and Vinod Kone.


Bugs: MESOS-844
    https://issues.apache.org/jira/browse/MESOS-844


Repository: mesos-git


Description
-------

See MESOS-844. This uses the boot id to check whether the underlying machine has rebooted. If so, we do not want to attempt recovery given the PIDs are stale and machine information may have changed.


Diffs
-----

  src/slave/paths.hpp 8f80b931a896cd7e317658147e864410558d5028 
  src/slave/slave.cpp 6e6107e3551f29566ff233dad47dac8c29b4fab5 
  src/slave/state.cpp bf267b5ea23a7427c7cb05ab4e98af2e44d9cc43 
  src/tests/slave_recovery_tests.cpp 250083d380bf6d0bdba09abd239a9139a402008f 

Diff: https://reviews.apache.org/r/16238/diff/


Testing
-------

Added an integration test, make check.

I need to clean up the boot time code to handle multiple slaves:
[ RUN      ] SlaveRecoveryProcessIsolatorTest/0.MultipleSlaves
F1212 20:26:35.767004 273121280 state.cpp:69] CHECK_SOME(id): No boot time found
*** Check failure stack trace: ***
    @        0x10f35ea18  google::LogMessage::Flush()
    @        0x10f3600d1  google::LogMessageFatal::~LogMessageFatal()
    @        0x10e410a36  _CheckSome::~_CheckSome()
    @        0x10f0eb6a1  mesos::internal::slave::state::recover()
    @        0x10f13924f  process::AsyncExecutorProcess::execute<>()
    @        0x10f11dd4f  std::tr1::_Mem_fn<>::operator()()
    @        0x10f11ddcb  std::tr1::_Bind<>::operator()<>()
    @        0x10f11de1c  std::tr1::_Function_handler<>::_M_invoke()
    @        0x10f138fd3  process::internal::rdispatcher<>()
    @        0x10f11fe8b  std::tr1::_Bind<>::operator()<>()
    @        0x10f11ffc8  std::tr1::_Function_handler<>::_M_invoke()
Re-registered executor on tw-172-25-24-157.office.twttr.net
    @        0x10f2850d0  process::ProcessManager::resume()
    @        0x10f285ad8  process::schedule()
    @     0x7fff8bd05772  _pthread_start
    @     0x7fff8bcf21a1  thread_start


Thanks,

Ben Mahler


Re: Review Request 16238: Updated the slave to not recover after a reboot.

Posted by Benjamin Hindman <be...@berkeley.edu>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/16238/#review30712
-----------------------------------------------------------

Ship it!



src/tests/slave_recovery_tests.cpp
<https://reviews.apache.org/r/16238/#comment58809>

    Indentation?


- Benjamin Hindman


On Dec. 18, 2013, 2:16 a.m., Ben Mahler wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/16238/
> -----------------------------------------------------------
> 
> (Updated Dec. 18, 2013, 2:16 a.m.)
> 
> 
> Review request for mesos, Benjamin Hindman and Vinod Kone.
> 
> 
> Bugs: MESOS-844
>     https://issues.apache.org/jira/browse/MESOS-844
> 
> 
> Repository: mesos-git
> 
> 
> Description
> -------
> 
> See MESOS-844. This uses the boot id to check whether the underlying machine has rebooted. If so, we do not want to attempt recovery given the PIDs are stale and machine information may have changed.
> 
> 
> Diffs
> -----
> 
>   src/slave/paths.hpp 8f80b931a896cd7e317658147e864410558d5028 
>   src/slave/slave.cpp 6e6107e3551f29566ff233dad47dac8c29b4fab5 
>   src/slave/state.cpp bf267b5ea23a7427c7cb05ab4e98af2e44d9cc43 
>   src/tests/slave_recovery_tests.cpp 250083d380bf6d0bdba09abd239a9139a402008f 
> 
> Diff: https://reviews.apache.org/r/16238/diff/
> 
> 
> Testing
> -------
> 
> Added an integration test, make check.
> 
> 
> Thanks,
> 
> Ben Mahler
> 
>


Re: Review Request 16238: Updated the slave to not recover after a reboot.

Posted by Ben Mahler <be...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/16238/
-----------------------------------------------------------

(Updated Dec. 19, 2013, 11:43 p.m.)


Review request for mesos, Benjamin Hindman and Vinod Kone.


Changes
-------

Used 'ID' in comments and removed the .txt file extension.


Bugs: MESOS-844
    https://issues.apache.org/jira/browse/MESOS-844


Repository: mesos-git


Description
-------

See MESOS-844. This uses the boot id to check whether the underlying machine has rebooted. If so, we do not want to attempt recovery given the PIDs are stale and machine information may have changed.


Diffs (updated)
-----

  src/slave/paths.hpp 8f80b931a896cd7e317658147e864410558d5028 
  src/slave/slave.cpp 6e6107e3551f29566ff233dad47dac8c29b4fab5 
  src/slave/state.cpp bf267b5ea23a7427c7cb05ab4e98af2e44d9cc43 
  src/tests/slave_recovery_tests.cpp 250083d380bf6d0bdba09abd239a9139a402008f 

Diff: https://reviews.apache.org/r/16238/diff/


Testing
-------

Added an integration test, make check.


Thanks,

Ben Mahler


Re: Review Request 16238: Updated the slave to not recover after a reboot.

Posted by Ben Mahler <be...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/16238/
-----------------------------------------------------------

(Updated Dec. 18, 2013, 2:16 a.m.)


Review request for mesos, Benjamin Hindman and Vinod Kone.


Changes
-------

This now works correctly on OS X since sysctl is used in favor of utmpx.


Bugs: MESOS-844
    https://issues.apache.org/jira/browse/MESOS-844


Repository: mesos-git


Description
-------

See MESOS-844. This uses the boot id to check whether the underlying machine has rebooted. If so, we do not want to attempt recovery given the PIDs are stale and machine information may have changed.


Diffs
-----

  src/slave/paths.hpp 8f80b931a896cd7e317658147e864410558d5028 
  src/slave/slave.cpp 6e6107e3551f29566ff233dad47dac8c29b4fab5 
  src/slave/state.cpp bf267b5ea23a7427c7cb05ab4e98af2e44d9cc43 
  src/tests/slave_recovery_tests.cpp 250083d380bf6d0bdba09abd239a9139a402008f 

Diff: https://reviews.apache.org/r/16238/diff/


Testing (updated)
-------

Added an integration test, make check.


Thanks,

Ben Mahler