You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mesos.apache.org by "jiraposter@reviews.apache.org (JIRA)" <ji...@apache.org> on 2012/05/01 04:38:49 UTC

[jira] [Commented] (MESOS-110) Mesos deploys should not restart tasks

    [ https://issues.apache.org/jira/browse/MESOS-110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13265610#comment-13265610 ] 

jiraposter@reviews.apache.org commented on MESOS-110:
-----------------------------------------------------



bq.  On 2012-04-25 22:11:01, Benjamin Hindman wrote:
bq.  > I've only gotten halfway through ... but there is a bunch here already. I'd like to break this up into at least four patches. (1) The utils stuff that was added. (2) The master changes. (3) The slave::path namespace stuff. (3) The status update manager API + implementation (but not the slave using it yet). And (4) the slave using each of these components, and the executor changes that are included.
bq.  > 
bq.  > These comments are across all of those patches, but I'll make future passes on each of those components.
bq.  
bq.  Vinod Kone wrote:
bq.      addressed the comments for the utils.hpp part. Will send a review for utils.hpp and protobuf_utils.hpp (forgot to include it in this review) shortly.

patch coming for path refactoring shortly.


bq.  On 2012-04-25 22:11:01, Benjamin Hindman wrote:
bq.  > src/slave/slave.hpp, line 58
bq.  > <https://reviews.apache.org/r/4462/diff/3/?file=103041#file103041line58>
bq.  >
bq.  >     s/follows/follows:

done


bq.  On 2012-04-25 22:11:01, Benjamin Hindman wrote:
bq.  > src/slave/slave.hpp, line 66
bq.  > <https://reviews.apache.org/r/4462/diff/3/?file=103041#file103041line66>
bq.  >
bq.  >     What is the framework PID? How is that different than the Executor PID mentioned below?

we need to store framework pid because, when an executor re-registers we need its framework's PID to re-create a new Framework() object. 


bq.  On 2012-04-25 22:11:01, Benjamin Hindman wrote:
bq.  > src/slave/slave.cpp, line 218
bq.  > <https://reviews.apache.org/r/4462/diff/3/?file=103042#file103042line218>
bq.  >
bq.  >     Isn't the default workd_dir mentioned in the addOption above /tmp/mesos? If so, "/tmp" should be "/tmp/mesos" here, and we probably don't want "/tmp/mesos/mesos". Just clean this up so people understand expectations (i.e., should work_dir be a path including the "mesos" directory, or will we create that ourselves).

I agree its a bit confusing. The reason I did it this way is to avoid dumping work/meta directories directly under '/tmp' (for eg. when an user specifies work_dir=/tmp).
 
I'm indifferent on how we want to do this. I will revert back to 'workRootDir = conf.get("work_dir", "/tmp/mesos") + "/work"' for now;


bq.  On 2012-04-25 22:11:01, Benjamin Hindman wrote:
bq.  > src/slave/slave.hpp, line 56
bq.  > <https://reviews.apache.org/r/4462/diff/3/?file=103041#file103041line56>
bq.  >
bq.  >     I'd like to stick all of this stuff in it's own file and commit this on it's own (with integration in Slave as applicable and tests).

refactored out path stuff into slave/path.hpp 


bq.  On 2012-04-25 22:11:01, Benjamin Hindman wrote:
bq.  > src/slave/slave.hpp, lines 344-345
bq.  > <https://reviews.apache.org/r/4462/diff/3/?file=103041#file103041line344>
bq.  >
bq.  >     What's the difference between "work" and "work root"? Or "meta" and "meta root"?

workRootDir denotes the root directory (conf['work_dir'/mesos/work]) where work directories of a slave are stored. it's not exactly a slave's working directory (that would be workRootDir/slaves/slaveId).

i needed the workRootDir for the path layout stuff.

same goes with metaRootDir


bq.  On 2012-04-25 22:11:01, Benjamin Hindman wrote:
bq.  > src/slave/slave.cpp, line 2289
bq.  > <https://reviews.apache.org/r/4462/diff/3/?file=103042#file103042line2289>
bq.  >
bq.  >     Pass these into writeFrameworkPID, no need to make this an instance function. More importantly, you should do this for each of these writers/readers.

moved these write/read functions into path.hpp


- Vinod


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/4462/#review7232
-----------------------------------------------------------


On 2012-04-19 16:53:07, Vinod Kone wrote:
bq.  
bq.  -----------------------------------------------------------
bq.  This is an automatically generated e-mail. To reply, visit:
bq.  https://reviews.apache.org/r/4462/
bq.  -----------------------------------------------------------
bq.  
bq.  (Updated 2012-04-19 16:53:07)
bq.  
bq.  
bq.  Review request for mesos, Benjamin Hindman and John Sirois.
bq.  
bq.  
bq.  Summary
bq.  -------
bq.  
bq.  Sorry for the huge  CL!
bq.  
bq.  Slave restarts now supports recovery!
bq.  --> Non-disruptive restart means running tasks are not lost
bq.  --> Re-connects with live executors
bq.  --> Checkpoints and reliably sends status updates
bq.  --> Ability to kill executors if the slave upgrade is incompatible with running executors
bq.  
bq.  
bq.  This addresses bug mesos-110.
bq.      https://issues.apache.org/jira/browse/mesos-110
bq.  
bq.  
bq.  Diffs
bq.  -----
bq.  
bq.    src/Makefile.am d5edaa2 
bq.    src/common/hashset.hpp 1feb610 
bq.    src/common/utils.hpp 1d81e21 
bq.    src/exec/exec.cpp e8db407 
bq.    src/launcher/launcher.cpp a141b9a 
bq.    src/local/local.hpp 55f9eaf 
bq.    src/local/local.cpp affe432 
bq.    src/master/master.cpp 4dc9ee0 
bq.    src/messages/messages.proto 87e1548 
bq.    src/sched/sched.cpp dcadb10 
bq.    src/scripts/killtree.sh bceae9d 
bq.    src/slave/constants.hpp f0c8679 
bq.    src/slave/http.cpp 19c48a0 
bq.    src/slave/isolation_module.hpp c896908 
bq.    src/slave/lxc_isolation_module.hpp b7beefe 
bq.    src/slave/lxc_isolation_module.cpp 66a2a89 
bq.    src/slave/main.cpp 85cba25 
bq.    src/slave/process_based_isolation_module.hpp f6f9554 
bq.    src/slave/process_based_isolation_module.cpp 2b37d42 
bq.    src/slave/slave.hpp 279bc7b 
bq.    src/slave/slave.cpp 3358ec4 
bq.    src/slave/statusupdates_manager.hpp PRE-CREATION 
bq.    src/slave/statusupdates_manager.cpp PRE-CREATION 
bq.    src/tests/external_tests.cpp d1b20e4 
bq.    src/tests/fault_tolerance_tests.cpp 6772daf 
bq.    src/tests/slave_restart_tests.cpp PRE-CREATION 
bq.    src/tests/utils.hpp e81ec82 
bq.  
bq.  Diff: https://reviews.apache.org/r/4462/diff
bq.  
bq.  
bq.  Testing
bq.  -------
bq.  
bq.  make check.
bq.  
bq.  Note that only the new test in tests/slave_restart_tests.cpp  engages in recovery!
bq.  
bq.  Recovery is disabled for old tests (though they still checkpoint relevant info!)
bq.  
bq.  
bq.  Thanks,
bq.  
bq.  Vinod
bq.  
bq.


                
> Mesos deploys should not restart tasks
> --------------------------------------
>
>                 Key: MESOS-110
>                 URL: https://issues.apache.org/jira/browse/MESOS-110
>             Project: Mesos
>          Issue Type: Improvement
>          Components: framework
>            Reporter: Rob Benson
>            Assignee: Vinod Kone
>
> Running a long-lived service on Mesos has a significant drawback right now in that Mesos build deploys restart your tasks. This could lead to nontrivial outages for services that have a high warm-up time.  Basically everything would need a graceful restart mechanism that basically allows a shutdown/restart with a new version of the code. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira