You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mesos.apache.org by "Rob Benson (Created) (JIRA)" <ji...@apache.org> on 2011/12/21 15:07:30 UTC

[jira] [Created] (MESOS-110) Mesos deploys should not restart tasks

Mesos deploys should not restart tasks
--------------------------------------

                 Key: MESOS-110
                 URL: https://issues.apache.org/jira/browse/MESOS-110
             Project: Mesos
          Issue Type: Improvement
          Components: framework
            Reporter: Rob Benson


Running a long-lived service on Mesos has a significant drawback right now in that Mesos build deploys restart your tasks. This could lead to nontrivial outages for services that have a high warm-up time.  Basically everything would need a graceful restart mechanism that basically allows a shutdown/restart with a new version of the code. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MESOS-110) Mesos deploys should not restart tasks

Posted by "jiraposter@reviews.apache.org (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MESOS-110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13245731#comment-13245731 ] 

jiraposter@reviews.apache.org commented on MESOS-110:
-----------------------------------------------------


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/4462/
-----------------------------------------------------------

(Updated 2012-04-03 20:58:17.982746)


Review request for mesos, Benjamin Hindman and John Sirois.


Changes
-------

merged with trunk


Summary
-------

Sorry for the huge  CL!

Slave restarts now supports recovery!
--> Non-disruptive restart means running tasks are not lost
--> Re-connects with live executors
--> Checkpoints and reliably sends status updates
--> Ability to kill executors if the slave upgrade is incompatible with running executors


This addresses bug mesos-110.
    https://issues.apache.org/jira/browse/mesos-110


Diffs (updated)
-----

  src/Makefile.am d5edaa2 
  src/common/hashset.hpp 1feb610 
  src/common/utils.hpp 1d81e21 
  src/exec/exec.cpp e8db407 
  src/launcher/launcher.cpp a141b9a 
  src/local/local.hpp 55f9eaf 
  src/local/local.cpp affe432 
  src/master/master.cpp 4dc9ee0 
  src/messages/messages.proto 87e1548 
  src/sched/sched.cpp dcadb10 
  src/slave/constants.hpp f0c8679 
  src/slave/isolation_module.hpp c896908 
  src/slave/lxc_isolation_module.hpp b7beefe 
  src/slave/lxc_isolation_module.cpp 66a2a89 
  src/slave/main.cpp 85cba25 
  src/slave/process_based_isolation_module.hpp f6f9554 
  src/slave/process_based_isolation_module.cpp 2b37d42 
  src/slave/slave.hpp 279bc7b 
  src/slave/slave.cpp 3358ec4 
  src/tests/fault_tolerance_tests.cpp 6772daf 
  src/tests/slave_restart_tests.cpp PRE-CREATION 
  src/tests/utils.hpp e81ec82 

Diff: https://reviews.apache.org/r/4462/diff


Testing
-------

make check.

Note that only the new test in tests/slave_restart_tests.cpp  engages in recovery!

Recovery is disabled for old tests (though they still checkpoint relevant info!)


Thanks,

Vinod


                
> Mesos deploys should not restart tasks
> --------------------------------------
>
>                 Key: MESOS-110
>                 URL: https://issues.apache.org/jira/browse/MESOS-110
>             Project: Mesos
>          Issue Type: Improvement
>          Components: framework
>            Reporter: Rob Benson
>            Assignee: Vinod Kone
>
> Running a long-lived service on Mesos has a significant drawback right now in that Mesos build deploys restart your tasks. This could lead to nontrivial outages for services that have a high warm-up time.  Basically everything would need a graceful restart mechanism that basically allows a shutdown/restart with a new version of the code. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MESOS-110) Mesos deploys should not restart tasks

Posted by "jiraposter@reviews.apache.org (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MESOS-110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13263273#comment-13263273 ] 

jiraposter@reviews.apache.org commented on MESOS-110:
-----------------------------------------------------



bq.  On 2012-04-25 22:11:01, Benjamin Hindman wrote:
bq.  > I've only gotten halfway through ... but there is a bunch here already. I'd like to break this up into at least four patches. (1) The utils stuff that was added. (2) The master changes. (3) The slave::path namespace stuff. (3) The status update manager API + implementation (but not the slave using it yet). And (4) the slave using each of these components, and the executor changes that are included.
bq.  > 
bq.  > These comments are across all of those patches, but I'll make future passes on each of those components.

addressed the comments for the utils.hpp part. Will send a review for utils.hpp and protobuf_utils.hpp (forgot to include it in this review) shortly.


bq.  On 2012-04-25 22:11:01, Benjamin Hindman wrote:
bq.  > src/common/utils.hpp, line 211
bq.  > <https://reviews.apache.org/r/4462/diff/3/?file=103024#file103024line211>
bq.  >
bq.  >     Add that this is at the current file position of the file descriptor.

done


bq.  On 2012-04-25 22:11:01, Benjamin Hindman wrote:
bq.  > src/common/utils.hpp, line 218
bq.  > <https://reviews.apache.org/r/4462/diff/3/?file=103024#file103024line218>
bq.  >
bq.  >     How is this helpful? (If this came from my code, it should be removed.)

it did come from your code. killed it.


bq.  On 2012-04-25 22:11:01, Benjamin Hindman wrote:
bq.  > src/common/utils.hpp, line 259
bq.  > <https://reviews.apache.org/r/4462/diff/3/?file=103024#file103024line259>
bq.  >
bq.  >     Blah.

fixed format.


bq.  On 2012-04-25 22:11:01, Benjamin Hindman wrote:
bq.  > src/common/utils.hpp, line 253
bq.  > <https://reviews.apache.org/r/4462/diff/3/?file=103024#file103024line253>
bq.  >
bq.  >     I'd prefer if this did not seek to the beginning and read the file, but rather read from the current position until the end (and have the comment say as much).

line 263 was a bug. fixed it. this line should make sense now!


bq.  On 2012-04-25 22:11:01, Benjamin Hindman wrote:
bq.  > src/common/utils.hpp, line 264
bq.  > <https://reviews.apache.org/r/4462/diff/3/?file=103024#file103024line264>
bq.  >
bq.  >     This looks like a bug ('offset' as the third argument?).

good catch! fixed.


bq.  On 2012-04-25 22:11:01, Benjamin Hindman wrote:
bq.  > src/common/utils.hpp, line 267
bq.  > <https://reviews.apache.org/r/4462/diff/3/?file=103024#file103024line267>
bq.  >
bq.  >     No need for the space here though.

fixed


bq.  On 2012-04-25 22:11:01, Benjamin Hindman wrote:
bq.  > src/common/utils.hpp, line 276
bq.  > <https://reviews.apache.org/r/4462/diff/3/?file=103024#file103024line276>
bq.  >
bq.  >     Again, should be killed (only makes sense in a macro).

killed


bq.  On 2012-04-25 22:11:01, Benjamin Hindman wrote:
bq.  > src/common/utils.hpp, line 290
bq.  > <https://reviews.apache.org/r/4462/diff/3/?file=103024#file103024line290>
bq.  >
bq.  >     You should refactor the protobuf::read and protobuf::write to use these versions of read and write now as well.

didnt get u?


bq.  On 2012-04-25 22:11:01, Benjamin Hindman wrote:
bq.  > src/common/utils.hpp, line 329
bq.  > <https://reviews.apache.org/r/4462/diff/3/?file=103024#file103024line329>
bq.  >
bq.  >     man 3 dirname (and use it please).

aha..done


bq.  On 2012-04-25 22:11:01, Benjamin Hindman wrote:
bq.  > src/common/utils.hpp, line 532
bq.  > <https://reviews.apache.org/r/4462/diff/3/?file=103024#file103024line532>
bq.  >
bq.  >     s/file_pattern/pattern

done


bq.  On 2012-04-25 22:11:01, Benjamin Hindman wrote:
bq.  > src/common/utils.hpp, line 534
bq.  > <https://reviews.apache.org/r/4462/diff/3/?file=103024#file103024line534>
bq.  >
bq.  >     s/result/results

done


bq.  On 2012-04-25 22:11:01, Benjamin Hindman wrote:
bq.  > src/common/utils.hpp, line 536
bq.  > <https://reviews.apache.org/r/4462/diff/3/?file=103024#file103024line536>
bq.  >
bq.  >     Kill.

done


bq.  On 2012-04-25 22:11:01, Benjamin Hindman wrote:
bq.  > src/common/utils.hpp, line 538
bq.  > <https://reviews.apache.org/r/4462/diff/3/?file=103024#file103024line538>
bq.  >
bq.  >     Why not return a Try instead?

done


bq.  On 2012-04-25 22:11:01, Benjamin Hindman wrote:
bq.  > src/common/utils.hpp, line 546
bq.  > <https://reviews.apache.org/r/4462/diff/3/?file=103024#file103024line546>
bq.  >
bq.  >     Why is this a hack?

because we are using exists() function to check isDir() semantics, based on the knowledge that the entry always exists.


bq.  On 2012-04-25 22:11:01, Benjamin Hindman wrote:
bq.  > src/common/utils.hpp, line 549
bq.  > <https://reviews.apache.org/r/4462/diff/3/?file=103024#file103024line549>
bq.  >
bq.  >     s/p/result or s/p/path

done


bq.  On 2012-04-25 22:11:01, Benjamin Hindman wrote:
bq.  > src/common/utils.hpp, line 555
bq.  > <https://reviews.apache.org/r/4462/diff/3/?file=103024#file103024line555>
bq.  >
bq.  >     Kill.

done


- Vinod


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/4462/#review7232
-----------------------------------------------------------


On 2012-04-19 16:53:07, Vinod Kone wrote:
bq.  
bq.  -----------------------------------------------------------
bq.  This is an automatically generated e-mail. To reply, visit:
bq.  https://reviews.apache.org/r/4462/
bq.  -----------------------------------------------------------
bq.  
bq.  (Updated 2012-04-19 16:53:07)
bq.  
bq.  
bq.  Review request for mesos, Benjamin Hindman and John Sirois.
bq.  
bq.  
bq.  Summary
bq.  -------
bq.  
bq.  Sorry for the huge  CL!
bq.  
bq.  Slave restarts now supports recovery!
bq.  --> Non-disruptive restart means running tasks are not lost
bq.  --> Re-connects with live executors
bq.  --> Checkpoints and reliably sends status updates
bq.  --> Ability to kill executors if the slave upgrade is incompatible with running executors
bq.  
bq.  
bq.  This addresses bug mesos-110.
bq.      https://issues.apache.org/jira/browse/mesos-110
bq.  
bq.  
bq.  Diffs
bq.  -----
bq.  
bq.    src/Makefile.am d5edaa2 
bq.    src/common/hashset.hpp 1feb610 
bq.    src/common/utils.hpp 1d81e21 
bq.    src/exec/exec.cpp e8db407 
bq.    src/launcher/launcher.cpp a141b9a 
bq.    src/local/local.hpp 55f9eaf 
bq.    src/local/local.cpp affe432 
bq.    src/master/master.cpp 4dc9ee0 
bq.    src/messages/messages.proto 87e1548 
bq.    src/sched/sched.cpp dcadb10 
bq.    src/scripts/killtree.sh bceae9d 
bq.    src/slave/constants.hpp f0c8679 
bq.    src/slave/http.cpp 19c48a0 
bq.    src/slave/isolation_module.hpp c896908 
bq.    src/slave/lxc_isolation_module.hpp b7beefe 
bq.    src/slave/lxc_isolation_module.cpp 66a2a89 
bq.    src/slave/main.cpp 85cba25 
bq.    src/slave/process_based_isolation_module.hpp f6f9554 
bq.    src/slave/process_based_isolation_module.cpp 2b37d42 
bq.    src/slave/slave.hpp 279bc7b 
bq.    src/slave/slave.cpp 3358ec4 
bq.    src/slave/statusupdates_manager.hpp PRE-CREATION 
bq.    src/slave/statusupdates_manager.cpp PRE-CREATION 
bq.    src/tests/external_tests.cpp d1b20e4 
bq.    src/tests/fault_tolerance_tests.cpp 6772daf 
bq.    src/tests/slave_restart_tests.cpp PRE-CREATION 
bq.    src/tests/utils.hpp e81ec82 
bq.  
bq.  Diff: https://reviews.apache.org/r/4462/diff
bq.  
bq.  
bq.  Testing
bq.  -------
bq.  
bq.  make check.
bq.  
bq.  Note that only the new test in tests/slave_restart_tests.cpp  engages in recovery!
bq.  
bq.  Recovery is disabled for old tests (though they still checkpoint relevant info!)
bq.  
bq.  
bq.  Thanks,
bq.  
bq.  Vinod
bq.  
bq.


                
> Mesos deploys should not restart tasks
> --------------------------------------
>
>                 Key: MESOS-110
>                 URL: https://issues.apache.org/jira/browse/MESOS-110
>             Project: Mesos
>          Issue Type: Improvement
>          Components: framework
>            Reporter: Rob Benson
>            Assignee: Vinod Kone
>
> Running a long-lived service on Mesos has a significant drawback right now in that Mesos build deploys restart your tasks. This could lead to nontrivial outages for services that have a high warm-up time.  Basically everything would need a graceful restart mechanism that basically allows a shutdown/restart with a new version of the code. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MESOS-110) Mesos deploys should not restart tasks

Posted by "jiraposter@reviews.apache.org (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MESOS-110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13236256#comment-13236256 ] 

jiraposter@reviews.apache.org commented on MESOS-110:
-----------------------------------------------------


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/4462/
-----------------------------------------------------------

(Updated 2012-03-23 01:18:12.046729)


Review request for mesos, Benjamin Hindman and John Sirois.


Changes
-------

typo


Summary (updated)
-------

Sorry for the huge  CL!

Slave restarts now supports recovery!
--> Non-disruptive restart means running tasks are not lost
--> Re-connects with live executors
--> Checkpoints and reliably sends status updates
--> Ability to kill executors if the slave upgrade is incompatible with running executors


This addresses bug mesos-110.
    https://issues.apache.org/jira/browse/mesos-110


Diffs
-----

  src/Makefile.am 090c07a 
  src/common/hashset.hpp 1feb610 
  src/common/utils.hpp 015fef8 
  src/exec/exec.cpp 53f7e54 
  src/launcher/launcher.cpp 98a4847 
  src/local/local.hpp 55f9eaf 
  src/local/local.cpp affe432 
  src/master/master.cpp 6ae8aef 
  src/messages/messages.proto 7f9cffe 
  src/sched/sched.cpp 43d9717 
  src/slave/constants.hpp f0c8679 
  src/slave/isolation_module.hpp c896908 
  src/slave/lxc_isolation_module.hpp b7beefe 
  src/slave/lxc_isolation_module.cpp 8c25dd4 
  src/slave/main.cpp ac780c4 
  src/slave/process_based_isolation_module.hpp f6f9554 
  src/slave/process_based_isolation_module.cpp e0f3ee8 
  src/slave/slave.hpp b7bc45a 
  src/slave/slave.cpp 9332caa 
  src/tests/fault_tolerance_tests.cpp 130218d 
  src/tests/slave_restart_tests.cpp PRE-CREATION 
  src/tests/utils.hpp 8c038ce 

Diff: https://reviews.apache.org/r/4462/diff


Testing
-------

make check.

Note that only the new test in tests/slave_restart_tests.cpp  engages in recovery!

Recovery is disabled for old tests (though they still checkpoint relevant info!)


Thanks,

Vinod


                
> Mesos deploys should not restart tasks
> --------------------------------------
>
>                 Key: MESOS-110
>                 URL: https://issues.apache.org/jira/browse/MESOS-110
>             Project: Mesos
>          Issue Type: Improvement
>          Components: framework
>            Reporter: Rob Benson
>            Assignee: Vinod Kone
>
> Running a long-lived service on Mesos has a significant drawback right now in that Mesos build deploys restart your tasks. This could lead to nontrivial outages for services that have a high warm-up time.  Basically everything would need a graceful restart mechanism that basically allows a shutdown/restart with a new version of the code. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Assigned] (MESOS-110) Mesos deploys should not restart tasks

Posted by "Vinod Kone (Assigned) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MESOS-110?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Vinod Kone reassigned MESOS-110:
--------------------------------

    Assignee: Vinod Kone
    
> Mesos deploys should not restart tasks
> --------------------------------------
>
>                 Key: MESOS-110
>                 URL: https://issues.apache.org/jira/browse/MESOS-110
>             Project: Mesos
>          Issue Type: Improvement
>          Components: framework
>            Reporter: Rob Benson
>            Assignee: Vinod Kone
>
> Running a long-lived service on Mesos has a significant drawback right now in that Mesos build deploys restart your tasks. This could lead to nontrivial outages for services that have a high warm-up time.  Basically everything would need a graceful restart mechanism that basically allows a shutdown/restart with a new version of the code. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MESOS-110) Mesos deploys should not restart tasks

Posted by "jiraposter@reviews.apache.org (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MESOS-110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13265610#comment-13265610 ] 

jiraposter@reviews.apache.org commented on MESOS-110:
-----------------------------------------------------



bq.  On 2012-04-25 22:11:01, Benjamin Hindman wrote:
bq.  > I've only gotten halfway through ... but there is a bunch here already. I'd like to break this up into at least four patches. (1) The utils stuff that was added. (2) The master changes. (3) The slave::path namespace stuff. (3) The status update manager API + implementation (but not the slave using it yet). And (4) the slave using each of these components, and the executor changes that are included.
bq.  > 
bq.  > These comments are across all of those patches, but I'll make future passes on each of those components.
bq.  
bq.  Vinod Kone wrote:
bq.      addressed the comments for the utils.hpp part. Will send a review for utils.hpp and protobuf_utils.hpp (forgot to include it in this review) shortly.

patch coming for path refactoring shortly.


bq.  On 2012-04-25 22:11:01, Benjamin Hindman wrote:
bq.  > src/slave/slave.hpp, line 58
bq.  > <https://reviews.apache.org/r/4462/diff/3/?file=103041#file103041line58>
bq.  >
bq.  >     s/follows/follows:

done


bq.  On 2012-04-25 22:11:01, Benjamin Hindman wrote:
bq.  > src/slave/slave.hpp, line 66
bq.  > <https://reviews.apache.org/r/4462/diff/3/?file=103041#file103041line66>
bq.  >
bq.  >     What is the framework PID? How is that different than the Executor PID mentioned below?

we need to store framework pid because, when an executor re-registers we need its framework's PID to re-create a new Framework() object. 


bq.  On 2012-04-25 22:11:01, Benjamin Hindman wrote:
bq.  > src/slave/slave.cpp, line 218
bq.  > <https://reviews.apache.org/r/4462/diff/3/?file=103042#file103042line218>
bq.  >
bq.  >     Isn't the default workd_dir mentioned in the addOption above /tmp/mesos? If so, "/tmp" should be "/tmp/mesos" here, and we probably don't want "/tmp/mesos/mesos". Just clean this up so people understand expectations (i.e., should work_dir be a path including the "mesos" directory, or will we create that ourselves).

I agree its a bit confusing. The reason I did it this way is to avoid dumping work/meta directories directly under '/tmp' (for eg. when an user specifies work_dir=/tmp).
 
I'm indifferent on how we want to do this. I will revert back to 'workRootDir = conf.get("work_dir", "/tmp/mesos") + "/work"' for now;


bq.  On 2012-04-25 22:11:01, Benjamin Hindman wrote:
bq.  > src/slave/slave.hpp, line 56
bq.  > <https://reviews.apache.org/r/4462/diff/3/?file=103041#file103041line56>
bq.  >
bq.  >     I'd like to stick all of this stuff in it's own file and commit this on it's own (with integration in Slave as applicable and tests).

refactored out path stuff into slave/path.hpp 


bq.  On 2012-04-25 22:11:01, Benjamin Hindman wrote:
bq.  > src/slave/slave.hpp, lines 344-345
bq.  > <https://reviews.apache.org/r/4462/diff/3/?file=103041#file103041line344>
bq.  >
bq.  >     What's the difference between "work" and "work root"? Or "meta" and "meta root"?

workRootDir denotes the root directory (conf['work_dir'/mesos/work]) where work directories of a slave are stored. it's not exactly a slave's working directory (that would be workRootDir/slaves/slaveId).

i needed the workRootDir for the path layout stuff.

same goes with metaRootDir


bq.  On 2012-04-25 22:11:01, Benjamin Hindman wrote:
bq.  > src/slave/slave.cpp, line 2289
bq.  > <https://reviews.apache.org/r/4462/diff/3/?file=103042#file103042line2289>
bq.  >
bq.  >     Pass these into writeFrameworkPID, no need to make this an instance function. More importantly, you should do this for each of these writers/readers.

moved these write/read functions into path.hpp


- Vinod


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/4462/#review7232
-----------------------------------------------------------


On 2012-04-19 16:53:07, Vinod Kone wrote:
bq.  
bq.  -----------------------------------------------------------
bq.  This is an automatically generated e-mail. To reply, visit:
bq.  https://reviews.apache.org/r/4462/
bq.  -----------------------------------------------------------
bq.  
bq.  (Updated 2012-04-19 16:53:07)
bq.  
bq.  
bq.  Review request for mesos, Benjamin Hindman and John Sirois.
bq.  
bq.  
bq.  Summary
bq.  -------
bq.  
bq.  Sorry for the huge  CL!
bq.  
bq.  Slave restarts now supports recovery!
bq.  --> Non-disruptive restart means running tasks are not lost
bq.  --> Re-connects with live executors
bq.  --> Checkpoints and reliably sends status updates
bq.  --> Ability to kill executors if the slave upgrade is incompatible with running executors
bq.  
bq.  
bq.  This addresses bug mesos-110.
bq.      https://issues.apache.org/jira/browse/mesos-110
bq.  
bq.  
bq.  Diffs
bq.  -----
bq.  
bq.    src/Makefile.am d5edaa2 
bq.    src/common/hashset.hpp 1feb610 
bq.    src/common/utils.hpp 1d81e21 
bq.    src/exec/exec.cpp e8db407 
bq.    src/launcher/launcher.cpp a141b9a 
bq.    src/local/local.hpp 55f9eaf 
bq.    src/local/local.cpp affe432 
bq.    src/master/master.cpp 4dc9ee0 
bq.    src/messages/messages.proto 87e1548 
bq.    src/sched/sched.cpp dcadb10 
bq.    src/scripts/killtree.sh bceae9d 
bq.    src/slave/constants.hpp f0c8679 
bq.    src/slave/http.cpp 19c48a0 
bq.    src/slave/isolation_module.hpp c896908 
bq.    src/slave/lxc_isolation_module.hpp b7beefe 
bq.    src/slave/lxc_isolation_module.cpp 66a2a89 
bq.    src/slave/main.cpp 85cba25 
bq.    src/slave/process_based_isolation_module.hpp f6f9554 
bq.    src/slave/process_based_isolation_module.cpp 2b37d42 
bq.    src/slave/slave.hpp 279bc7b 
bq.    src/slave/slave.cpp 3358ec4 
bq.    src/slave/statusupdates_manager.hpp PRE-CREATION 
bq.    src/slave/statusupdates_manager.cpp PRE-CREATION 
bq.    src/tests/external_tests.cpp d1b20e4 
bq.    src/tests/fault_tolerance_tests.cpp 6772daf 
bq.    src/tests/slave_restart_tests.cpp PRE-CREATION 
bq.    src/tests/utils.hpp e81ec82 
bq.  
bq.  Diff: https://reviews.apache.org/r/4462/diff
bq.  
bq.  
bq.  Testing
bq.  -------
bq.  
bq.  make check.
bq.  
bq.  Note that only the new test in tests/slave_restart_tests.cpp  engages in recovery!
bq.  
bq.  Recovery is disabled for old tests (though they still checkpoint relevant info!)
bq.  
bq.  
bq.  Thanks,
bq.  
bq.  Vinod
bq.  
bq.


                
> Mesos deploys should not restart tasks
> --------------------------------------
>
>                 Key: MESOS-110
>                 URL: https://issues.apache.org/jira/browse/MESOS-110
>             Project: Mesos
>          Issue Type: Improvement
>          Components: framework
>            Reporter: Rob Benson
>            Assignee: Vinod Kone
>
> Running a long-lived service on Mesos has a significant drawback right now in that Mesos build deploys restart your tasks. This could lead to nontrivial outages for services that have a high warm-up time.  Basically everything would need a graceful restart mechanism that basically allows a shutdown/restart with a new version of the code. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MESOS-110) Mesos deploys should not restart tasks

Posted by "jiraposter@reviews.apache.org (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MESOS-110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13263376#comment-13263376 ] 

jiraposter@reviews.apache.org commented on MESOS-110:
-----------------------------------------------------



bq.  On 2012-04-25 22:11:01, Benjamin Hindman wrote:
bq.  > src/master/master.cpp, line 138
bq.  > <https://reviews.apache.org/r/4462/diff/3/?file=103029#file103029line138>
bq.  >
bq.  >     s/slave "/slave on "

done


bq.  On 2012-04-25 22:11:01, Benjamin Hindman wrote:
bq.  > src/master/master.cpp, line 176
bq.  > <https://reviews.apache.org/r/4462/diff/3/?file=103029#file103029line176>
bq.  >
bq.  >     Ditto above.

done


bq.  On 2012-04-25 22:11:01, Benjamin Hindman wrote:
bq.  > src/master/master.cpp, line 667
bq.  > <https://reviews.apache.org/r/4462/diff/3/?file=103029#file103029line667>
bq.  >
bq.  >     I suggest just killing this LOG line and keeping the new one you added below.

done


bq.  On 2012-04-25 22:11:01, Benjamin Hindman wrote:
bq.  > src/master/master.cpp, line 756
bq.  > <https://reviews.apache.org/r/4462/diff/3/?file=103029#file103029line756>
bq.  >
bq.  >     s/"("/" ("

done


bq.  On 2012-04-25 22:11:01, Benjamin Hindman wrote:
bq.  > src/master/master.cpp, line 801
bq.  > <https://reviews.apache.org/r/4462/diff/3/?file=103029#file103029line801>
bq.  >
bq.  >     s/"("/" ("

done


bq.  On 2012-04-25 22:11:01, Benjamin Hindman wrote:
bq.  > src/master/master.cpp, line 848
bq.  > <https://reviews.apache.org/r/4462/diff/3/?file=103029#file103029line848>
bq.  >
bq.  >     s/slave "/slave on "
bq.  >     
bq.  >     Note also that anytime we are printing out the PID, we're getting the IP, so the hostname is not strictly necessary (there a bunch of these below).

fixed.

reg: hostname, i think having it makes debugging less painful.


bq.  On 2012-04-25 22:11:01, Benjamin Hindman wrote:
bq.  > src/master/master.cpp, line 894
bq.  > <https://reviews.apache.org/r/4462/diff/3/?file=103029#file103029line894>
bq.  >
bq.  >     s/"("/" ("

done


bq.  On 2012-04-25 22:11:01, Benjamin Hindman wrote:
bq.  > src/master/master.cpp, line 905
bq.  > <https://reviews.apache.org/r/4462/diff/3/?file=103029#file103029line905>
bq.  >
bq.  >     s/"("/" ("

done


bq.  On 2012-04-25 22:11:01, Benjamin Hindman wrote:
bq.  > src/master/master.cpp, line 947
bq.  > <https://reviews.apache.org/r/4462/diff/3/?file=103029#file103029line947>
bq.  >
bq.  >     ?

reverted


bq.  On 2012-04-25 22:11:01, Benjamin Hindman wrote:
bq.  > src/master/master.cpp, line 978
bq.  > <https://reviews.apache.org/r/4462/diff/3/?file=103029#file103029line978>
bq.  >
bq.  >     s/"("/" ("

done


bq.  On 2012-04-25 22:11:01, Benjamin Hindman wrote:
bq.  > src/master/master.cpp, line 984
bq.  > <https://reviews.apache.org/r/4462/diff/3/?file=103029#file103029line984>
bq.  >
bq.  >     s/"("/" ("

done


bq.  On 2012-04-25 22:11:01, Benjamin Hindman wrote:
bq.  > src/master/master.cpp, line 1007
bq.  > <https://reviews.apache.org/r/4462/diff/3/?file=103029#file103029line1007>
bq.  >
bq.  >     s/"("/" ("

done


bq.  On 2012-04-25 22:11:01, Benjamin Hindman wrote:
bq.  > src/master/master.cpp, line 1020
bq.  > <https://reviews.apache.org/r/4462/diff/3/?file=103029#file103029line1020>
bq.  >
bq.  >     s/"("/" ("

done


bq.  On 2012-04-25 22:11:01, Benjamin Hindman wrote:
bq.  > src/master/master.cpp, line 1446
bq.  > <https://reviews.apache.org/r/4462/diff/3/?file=103029#file103029line1446>
bq.  >
bq.  >     s/"("/" ("

done


bq.  On 2012-04-25 22:11:01, Benjamin Hindman wrote:
bq.  > src/master/master.cpp, line 1504
bq.  > <https://reviews.apache.org/r/4462/diff/3/?file=103029#file103029line1504>
bq.  >
bq.  >     s/"("/" ("

done


bq.  On 2012-04-25 22:11:01, Benjamin Hindman wrote:
bq.  > src/master/master.cpp, line 1731
bq.  > <https://reviews.apache.org/r/4462/diff/3/?file=103029#file103029line1731>
bq.  >
bq.  >     s/"("/" ("

done


- Vinod


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/4462/#review7232
-----------------------------------------------------------


On 2012-04-19 16:53:07, Vinod Kone wrote:
bq.  
bq.  -----------------------------------------------------------
bq.  This is an automatically generated e-mail. To reply, visit:
bq.  https://reviews.apache.org/r/4462/
bq.  -----------------------------------------------------------
bq.  
bq.  (Updated 2012-04-19 16:53:07)
bq.  
bq.  
bq.  Review request for mesos, Benjamin Hindman and John Sirois.
bq.  
bq.  
bq.  Summary
bq.  -------
bq.  
bq.  Sorry for the huge  CL!
bq.  
bq.  Slave restarts now supports recovery!
bq.  --> Non-disruptive restart means running tasks are not lost
bq.  --> Re-connects with live executors
bq.  --> Checkpoints and reliably sends status updates
bq.  --> Ability to kill executors if the slave upgrade is incompatible with running executors
bq.  
bq.  
bq.  This addresses bug mesos-110.
bq.      https://issues.apache.org/jira/browse/mesos-110
bq.  
bq.  
bq.  Diffs
bq.  -----
bq.  
bq.    src/Makefile.am d5edaa2 
bq.    src/common/hashset.hpp 1feb610 
bq.    src/common/utils.hpp 1d81e21 
bq.    src/exec/exec.cpp e8db407 
bq.    src/launcher/launcher.cpp a141b9a 
bq.    src/local/local.hpp 55f9eaf 
bq.    src/local/local.cpp affe432 
bq.    src/master/master.cpp 4dc9ee0 
bq.    src/messages/messages.proto 87e1548 
bq.    src/sched/sched.cpp dcadb10 
bq.    src/scripts/killtree.sh bceae9d 
bq.    src/slave/constants.hpp f0c8679 
bq.    src/slave/http.cpp 19c48a0 
bq.    src/slave/isolation_module.hpp c896908 
bq.    src/slave/lxc_isolation_module.hpp b7beefe 
bq.    src/slave/lxc_isolation_module.cpp 66a2a89 
bq.    src/slave/main.cpp 85cba25 
bq.    src/slave/process_based_isolation_module.hpp f6f9554 
bq.    src/slave/process_based_isolation_module.cpp 2b37d42 
bq.    src/slave/slave.hpp 279bc7b 
bq.    src/slave/slave.cpp 3358ec4 
bq.    src/slave/statusupdates_manager.hpp PRE-CREATION 
bq.    src/slave/statusupdates_manager.cpp PRE-CREATION 
bq.    src/tests/external_tests.cpp d1b20e4 
bq.    src/tests/fault_tolerance_tests.cpp 6772daf 
bq.    src/tests/slave_restart_tests.cpp PRE-CREATION 
bq.    src/tests/utils.hpp e81ec82 
bq.  
bq.  Diff: https://reviews.apache.org/r/4462/diff
bq.  
bq.  
bq.  Testing
bq.  -------
bq.  
bq.  make check.
bq.  
bq.  Note that only the new test in tests/slave_restart_tests.cpp  engages in recovery!
bq.  
bq.  Recovery is disabled for old tests (though they still checkpoint relevant info!)
bq.  
bq.  
bq.  Thanks,
bq.  
bq.  Vinod
bq.  
bq.


                
> Mesos deploys should not restart tasks
> --------------------------------------
>
>                 Key: MESOS-110
>                 URL: https://issues.apache.org/jira/browse/MESOS-110
>             Project: Mesos
>          Issue Type: Improvement
>          Components: framework
>            Reporter: Rob Benson
>            Assignee: Vinod Kone
>
> Running a long-lived service on Mesos has a significant drawback right now in that Mesos build deploys restart your tasks. This could lead to nontrivial outages for services that have a high warm-up time.  Basically everything would need a graceful restart mechanism that basically allows a shutdown/restart with a new version of the code. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MESOS-110) Mesos deploys should not restart tasks

Posted by "jiraposter@reviews.apache.org (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MESOS-110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13236253#comment-13236253 ] 

jiraposter@reviews.apache.org commented on MESOS-110:
-----------------------------------------------------


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/4462/
-----------------------------------------------------------

Review request for mesos, Benjamin Hindman and John Sirois.


Summary
-------

Sorry for the huge  CL!

Slave restarts now supports recovery!
--> Non-disruptive restart means running tasks are not lost
--> Re-connects to live executors
--> Checkpoints and reliably sends status updates
--> Ability to kill executors if the slave upgrade is incompatible with running executors


This addresses bug mesos-110.
    https://issues.apache.org/jira/browse/mesos-110


Diffs
-----

  src/Makefile.am 090c07a 
  src/common/hashset.hpp 1feb610 
  src/common/utils.hpp 015fef8 
  src/exec/exec.cpp 53f7e54 
  src/launcher/launcher.cpp 98a4847 
  src/local/local.hpp 55f9eaf 
  src/local/local.cpp affe432 
  src/master/master.cpp 6ae8aef 
  src/messages/messages.proto 7f9cffe 
  src/sched/sched.cpp 43d9717 
  src/slave/constants.hpp f0c8679 
  src/slave/isolation_module.hpp c896908 
  src/slave/lxc_isolation_module.hpp b7beefe 
  src/slave/lxc_isolation_module.cpp 8c25dd4 
  src/slave/main.cpp ac780c4 
  src/slave/process_based_isolation_module.hpp f6f9554 
  src/slave/process_based_isolation_module.cpp e0f3ee8 
  src/slave/slave.hpp b7bc45a 
  src/slave/slave.cpp 9332caa 
  src/tests/fault_tolerance_tests.cpp 130218d 
  src/tests/slave_restart_tests.cpp PRE-CREATION 
  src/tests/utils.hpp 8c038ce 

Diff: https://reviews.apache.org/r/4462/diff


Testing
-------

make check.

Note that only the new test in tests/slave_restart_tests.cpp  engages in recovery!

Recovery is disabled for old tests (though they still checkpoint relevant info!)


Thanks,

Vinod


                
> Mesos deploys should not restart tasks
> --------------------------------------
>
>                 Key: MESOS-110
>                 URL: https://issues.apache.org/jira/browse/MESOS-110
>             Project: Mesos
>          Issue Type: Improvement
>          Components: framework
>            Reporter: Rob Benson
>            Assignee: Vinod Kone
>
> Running a long-lived service on Mesos has a significant drawback right now in that Mesos build deploys restart your tasks. This could lead to nontrivial outages for services that have a high warm-up time.  Basically everything would need a graceful restart mechanism that basically allows a shutdown/restart with a new version of the code. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MESOS-110) Mesos deploys should not restart tasks

Posted by "jiraposter@reviews.apache.org (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MESOS-110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13257596#comment-13257596 ] 

jiraposter@reviews.apache.org commented on MESOS-110:
-----------------------------------------------------


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/4462/
-----------------------------------------------------------

(Updated 2012-04-19 16:53:07.262208)


Review request for mesos, Benjamin Hindman and John Sirois.


Changes
-------

status update manager re-factor.

one of the tests is a bit flaky due to a race. but its a problem with the test, not the slave code. i will fix it after hackweek, meanwhile, if you have cycles take a look.


Summary
-------

Sorry for the huge  CL!

Slave restarts now supports recovery!
--> Non-disruptive restart means running tasks are not lost
--> Re-connects with live executors
--> Checkpoints and reliably sends status updates
--> Ability to kill executors if the slave upgrade is incompatible with running executors


This addresses bug mesos-110.
    https://issues.apache.org/jira/browse/mesos-110


Diffs (updated)
-----

  src/Makefile.am d5edaa2 
  src/common/hashset.hpp 1feb610 
  src/common/utils.hpp 1d81e21 
  src/exec/exec.cpp e8db407 
  src/launcher/launcher.cpp a141b9a 
  src/local/local.hpp 55f9eaf 
  src/local/local.cpp affe432 
  src/master/master.cpp 4dc9ee0 
  src/messages/messages.proto 87e1548 
  src/sched/sched.cpp dcadb10 
  src/scripts/killtree.sh bceae9d 
  src/slave/constants.hpp f0c8679 
  src/slave/http.cpp 19c48a0 
  src/slave/isolation_module.hpp c896908 
  src/slave/lxc_isolation_module.hpp b7beefe 
  src/slave/lxc_isolation_module.cpp 66a2a89 
  src/slave/main.cpp 85cba25 
  src/slave/process_based_isolation_module.hpp f6f9554 
  src/slave/process_based_isolation_module.cpp 2b37d42 
  src/slave/slave.hpp 279bc7b 
  src/slave/slave.cpp 3358ec4 
  src/slave/statusupdates_manager.hpp PRE-CREATION 
  src/slave/statusupdates_manager.cpp PRE-CREATION 
  src/tests/external_tests.cpp d1b20e4 
  src/tests/fault_tolerance_tests.cpp 6772daf 
  src/tests/slave_restart_tests.cpp PRE-CREATION 
  src/tests/utils.hpp e81ec82 

Diff: https://reviews.apache.org/r/4462/diff


Testing
-------

make check.

Note that only the new test in tests/slave_restart_tests.cpp  engages in recovery!

Recovery is disabled for old tests (though they still checkpoint relevant info!)


Thanks,

Vinod


                
> Mesos deploys should not restart tasks
> --------------------------------------
>
>                 Key: MESOS-110
>                 URL: https://issues.apache.org/jira/browse/MESOS-110
>             Project: Mesos
>          Issue Type: Improvement
>          Components: framework
>            Reporter: Rob Benson
>            Assignee: Vinod Kone
>
> Running a long-lived service on Mesos has a significant drawback right now in that Mesos build deploys restart your tasks. This could lead to nontrivial outages for services that have a high warm-up time.  Basically everything would need a graceful restart mechanism that basically allows a shutdown/restart with a new version of the code. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MESOS-110) Mesos deploys should not restart tasks

Posted by "jiraposter@reviews.apache.org (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MESOS-110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13262184#comment-13262184 ] 

jiraposter@reviews.apache.org commented on MESOS-110:
-----------------------------------------------------


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/4462/#review7232
-----------------------------------------------------------


I've only gotten halfway through ... but there is a bunch here already. I'd like to break this up into at least four patches. (1) The utils stuff that was added. (2) The master changes. (3) The slave::path namespace stuff. (3) The status update manager API + implementation (but not the slave using it yet). And (4) the slave using each of these components, and the executor changes that are included.

These comments are across all of those patches, but I'll make future passes on each of those components.


src/common/utils.hpp
<https://reviews.apache.org/r/4462/#comment15918>

    Result is deprecated. If we can replace it with Try, please do.



src/common/utils.hpp
<https://reviews.apache.org/r/4462/#comment15920>

    Add that this is at the current file position of the file descriptor.



src/common/utils.hpp
<https://reviews.apache.org/r/4462/#comment15919>

    How is this helpful? (If this came from my code, it should be removed.)



src/common/utils.hpp
<https://reviews.apache.org/r/4462/#comment15926>

    I'd prefer if this did not seek to the beginning and read the file, but rather read from the current position until the end (and have the comment say as much).



src/common/utils.hpp
<https://reviews.apache.org/r/4462/#comment15924>

    Blah.



src/common/utils.hpp
<https://reviews.apache.org/r/4462/#comment15923>

    This looks like a bug ('offset' as the third argument?).



src/common/utils.hpp
<https://reviews.apache.org/r/4462/#comment15925>

    No need for the space here though.



src/common/utils.hpp
<https://reviews.apache.org/r/4462/#comment15927>

    Again, should be killed (only makes sense in a macro).



src/common/utils.hpp
<https://reviews.apache.org/r/4462/#comment15928>

    You should refactor the protobuf::read and protobuf::write to use these versions of read and write now as well.



src/common/utils.hpp
<https://reviews.apache.org/r/4462/#comment15929>

    man 3 dirname (and use it please).



src/common/utils.hpp
<https://reviews.apache.org/r/4462/#comment15930>

    s/file_pattern/pattern



src/common/utils.hpp
<https://reviews.apache.org/r/4462/#comment15932>

    s/result/results



src/common/utils.hpp
<https://reviews.apache.org/r/4462/#comment15936>

    Kill.



src/common/utils.hpp
<https://reviews.apache.org/r/4462/#comment15931>

    Why not return a Try instead?



src/common/utils.hpp
<https://reviews.apache.org/r/4462/#comment15935>

    Why is this a hack?



src/common/utils.hpp
<https://reviews.apache.org/r/4462/#comment15933>

    s/p/result or s/p/path



src/common/utils.hpp
<https://reviews.apache.org/r/4462/#comment15934>

    Kill.



src/exec/exec.cpp
<https://reviews.apache.org/r/4462/#comment15949>

    It is not possible that the SlaveInfo could have changed? It seems like that needs to get passed along as well.



src/exec/exec.cpp
<https://reviews.apache.org/r/4462/#comment15937>

    s/on/to
    
    Also, put this after the CHECK below.



src/exec/exec.cpp
<https://reviews.apache.org/r/4462/#comment15943>

    Should this be a 'shutdown' instead of a CHECK?



src/exec/exec.cpp
<https://reviews.apache.org/r/4462/#comment15939>

    s/ Expected/, expected



src/exec/exec.cpp
<https://reviews.apache.org/r/4462/#comment15940>

    s/ Received/, received



src/exec/exec.cpp
<https://reviews.apache.org/r/4462/#comment15941>

    s/UPID/const UPID&



src/exec/exec.cpp
<https://reviews.apache.org/r/4462/#comment15942>

    Space.



src/exec/exec.cpp
<https://reviews.apache.org/r/4462/#comment15944>

    This relies on a side-effect of shutdown, namely that it calls exit(0). But that means if we try and test this case in "local" mode we'll end up falling out of this if block and executing the rest of the code in this function. Please fix.



src/exec/exec.cpp
<https://reviews.apache.org/r/4462/#comment15945>

    This needs a comment about why it's the case that a task will only ever be sent from the slave to the executor once (and thus, the check is warranted).



src/exec/exec.cpp
<https://reviews.apache.org/r/4462/#comment15947>

    Maybe we only do this in the else branch below?



src/exec/exec.cpp
<https://reviews.apache.org/r/4462/#comment15948>

    I'd prefer to call 'shutdown' here, even if it means revisiting/refactoring that function given my comments above.



src/exec/exec.cpp
<https://reviews.apache.org/r/4462/#comment15946>

    I'd prefer to just call this 'tasks'.



src/launcher/launcher.cpp
<https://reviews.apache.org/r/4462/#comment15950>

    Kill.



src/local/local.hpp
<https://reviews.apache.org/r/4462/#comment15951>

    s/recovery/recover



src/master/master.cpp
<https://reviews.apache.org/r/4462/#comment15952>

    s/slave "/slave on "



src/master/master.cpp
<https://reviews.apache.org/r/4462/#comment15953>

    Ditto above.



src/master/master.cpp
<https://reviews.apache.org/r/4462/#comment15954>

    I suggest just killing this LOG line and keeping the new one you added below.



src/master/master.cpp
<https://reviews.apache.org/r/4462/#comment15955>

    s/"("/" ("



src/master/master.cpp
<https://reviews.apache.org/r/4462/#comment15956>

    s/"("/" ("



src/master/master.cpp
<https://reviews.apache.org/r/4462/#comment15957>

    s/slave "/slave on "
    
    Note also that anytime we are printing out the PID, we're getting the IP, so the hostname is not strictly necessary (there a bunch of these below).



src/master/master.cpp
<https://reviews.apache.org/r/4462/#comment15958>

    s/"("/" ("



src/master/master.cpp
<https://reviews.apache.org/r/4462/#comment15959>

    s/"("/" ("



src/master/master.cpp
<https://reviews.apache.org/r/4462/#comment15960>

    ?



src/master/master.cpp
<https://reviews.apache.org/r/4462/#comment15961>

    s/"("/" ("



src/master/master.cpp
<https://reviews.apache.org/r/4462/#comment15962>

    s/"("/" ("



src/master/master.cpp
<https://reviews.apache.org/r/4462/#comment15963>

    s/"("/" ("



src/master/master.cpp
<https://reviews.apache.org/r/4462/#comment15964>

    s/"("/" ("



src/master/master.cpp
<https://reviews.apache.org/r/4462/#comment15965>

    s/"("/" ("



src/master/master.cpp
<https://reviews.apache.org/r/4462/#comment15966>

    s/"("/" ("



src/master/master.cpp
<https://reviews.apache.org/r/4462/#comment15967>

    s/"("/" ("



src/messages/messages.proto
<https://reviews.apache.org/r/4462/#comment15969>

    Newline.



src/messages/messages.proto
<https://reviews.apache.org/r/4462/#comment15968>

    Newline.



src/messages/messages.proto
<https://reviews.apache.org/r/4462/#comment15970>

    Seems like having the SlaveInfo would be a good thing (maybe the slave port changed? maybe other things will be added in the future that could change?).



src/scripts/killtree.sh
<https://reviews.apache.org/r/4462/#comment15971>

    How about:
    
    echo "$(basename ${0}): '${PID}' should be a number"



src/slave/constants.hpp
<https://reviews.apache.org/r/4462/#comment15972>

    This seems low ... ?



src/slave/lxc_isolation_module.cpp
<https://reviews.apache.org/r/4462/#comment15973>

    Please don't use snake_case.



src/slave/lxc_isolation_module.cpp
<https://reviews.apache.org/r/4462/#comment15978>

    So, this is identical to what's in process_based_isolation_module.cpp. At the time of designing this with you my thought was that we would be writing "different" data to disk. In particular, this might be isolation module "specific" data. I think the right model here is really to have the isolation module know what the meta directory is for the executor and have it's own well defined place where it can write data. In LXC's case, we'll want to probably write not just the pid but also the container name.



src/slave/lxc_isolation_module.cpp
<https://reviews.apache.org/r/4462/#comment15975>

    So, if we are just given the directory for this executor we could just recover the file that we expected to write (above). If we somehow need to get this from the pid_t passed in, then we'll need to have the location for the 'cgroup' directory, from which we can look at all /path/to/cgroup/*/tasks to see if this pid is a parent of one of those tasks. This should always work.
    
    Again, the alternative will be to have this module actually write some of it's own state to the meta-directory, e.g., the name of this container. This might require calling IsolationModule::recover in the slave earlier though.



src/slave/main.cpp
<https://reviews.apache.org/r/4462/#comment15979>

    I'd prefer to type on the command line:
    
    /path/to/mesos-slave --recover
    
    rather than:
    
    /path/to/mesos-slave --recovery



src/slave/process_based_isolation_module.hpp
<https://reviews.apache.org/r/4462/#comment15981>

    Remove newline.



src/slave/process_based_isolation_module.hpp
<https://reviews.apache.org/r/4462/#comment15982>

    Why'd you move this?



src/slave/process_based_isolation_module.cpp
<https://reviews.apache.org/r/4462/#comment15983>

    Not used (and it's deprecated).



src/slave/process_based_isolation_module.cpp
<https://reviews.apache.org/r/4462/#comment15984>

    I'm not a fan of this factored out. It's not that much code, and I'd prefer to see exactly what's happening here.



src/slave/process_based_isolation_module.cpp
<https://reviews.apache.org/r/4462/#comment15985>

    Doesn't this happen automagically because the reaper tells us about all child processes exiting/terminating?



src/slave/slave.hpp
<https://reviews.apache.org/r/4462/#comment15988>

    I'd like to stick all of this stuff in it's own file and commit this on it's own (with integration in Slave as applicable and tests).



src/slave/slave.hpp
<https://reviews.apache.org/r/4462/#comment15986>

    s/follows/follows:



src/slave/slave.hpp
<https://reviews.apache.org/r/4462/#comment15987>

    What is the framework PID? How is that different than the Executor PID mentioned below?



src/slave/slave.hpp
<https://reviews.apache.org/r/4462/#comment15989>

    Why not just get reconnect from configuration rather than passing it in here and having another default?



src/slave/slave.hpp
<https://reviews.apache.org/r/4462/#comment15990>

    No snake_case please.



src/slave/slave.hpp
<https://reviews.apache.org/r/4462/#comment15991>

    Kill.



src/slave/slave.hpp
<https://reviews.apache.org/r/4462/#comment15992>

    What's the difference between "work" and "work root"? Or "meta" and "meta root"?



src/slave/slave.hpp
<https://reviews.apache.org/r/4462/#comment15993>

    Makes me think it shouldn't be an instance variable or recovery needs to be factored out into it's own process ... I think we briefly talked about this though.



src/slave/slave.cpp
<https://reviews.apache.org/r/4462/#comment15994>

    I'm afraid that even this might be too complicated. We might want a single option 'recover', that takes a string instead of a bool, which specifies "how" to recover. For example, --recover=reconnect might mean recover the state and executors, where as --recover=upgrade might mean recover the state but then kill all executors (and create/send appropriate status updates). That way, when an operator goes to restart mesos, they are forced to specify how recovery is done, rather than specifying one option (e.g, --recover) and forgetting the other one and making a mistake. (Note, "reconnect" and "upgrade" might not be the best names. Also, doing it this way might make the name 'recovery' instead of 'recover' work ... e.g., --recovery=reconnect could be read as "the recovery strategy is reconnect to the executors".)



src/slave/slave.cpp
<https://reviews.apache.org/r/4462/#comment15995>

    Isn't the default workd_dir mentioned in the addOption above /tmp/mesos? If so, "/tmp" should be "/tmp/mesos" here, and we probably don't want "/tmp/mesos/mesos". Just clean this up so people understand expectations (i.e., should work_dir be a path including the "mesos" directory, or will we create that ourselves).



src/slave/slave.cpp
<https://reviews.apache.org/r/4462/#comment15996>

    All the more reason to merge the options.



src/slave/slave.cpp
<https://reviews.apache.org/r/4462/#comment15997>

    s/Skipping/Delaying
    
    Also, let's make this LOG(WARNING).



src/slave/slave.cpp
<https://reviews.apache.org/r/4462/#comment15999>

    You should just return a value (let the compiler optimize the move), no need to allocate on heap. In this tiny bit of code you even forget to free the object! Also, createTask does not appear to be in this review. Let's stick it in it's own review that gets committed ASAP.



src/slave/slave.cpp
<https://reviews.apache.org/r/4462/#comment16001>

    Pass these into writeFrameworkPID, no need to make this an instance function. More importantly, you should do this for each of these writers/readers.


- Benjamin


On 2012-04-19 16:53:07, Vinod Kone wrote:
bq.  
bq.  -----------------------------------------------------------
bq.  This is an automatically generated e-mail. To reply, visit:
bq.  https://reviews.apache.org/r/4462/
bq.  -----------------------------------------------------------
bq.  
bq.  (Updated 2012-04-19 16:53:07)
bq.  
bq.  
bq.  Review request for mesos, Benjamin Hindman and John Sirois.
bq.  
bq.  
bq.  Summary
bq.  -------
bq.  
bq.  Sorry for the huge  CL!
bq.  
bq.  Slave restarts now supports recovery!
bq.  --> Non-disruptive restart means running tasks are not lost
bq.  --> Re-connects with live executors
bq.  --> Checkpoints and reliably sends status updates
bq.  --> Ability to kill executors if the slave upgrade is incompatible with running executors
bq.  
bq.  
bq.  This addresses bug mesos-110.
bq.      https://issues.apache.org/jira/browse/mesos-110
bq.  
bq.  
bq.  Diffs
bq.  -----
bq.  
bq.    src/Makefile.am d5edaa2 
bq.    src/common/hashset.hpp 1feb610 
bq.    src/common/utils.hpp 1d81e21 
bq.    src/exec/exec.cpp e8db407 
bq.    src/launcher/launcher.cpp a141b9a 
bq.    src/local/local.hpp 55f9eaf 
bq.    src/local/local.cpp affe432 
bq.    src/master/master.cpp 4dc9ee0 
bq.    src/messages/messages.proto 87e1548 
bq.    src/sched/sched.cpp dcadb10 
bq.    src/scripts/killtree.sh bceae9d 
bq.    src/slave/constants.hpp f0c8679 
bq.    src/slave/http.cpp 19c48a0 
bq.    src/slave/isolation_module.hpp c896908 
bq.    src/slave/lxc_isolation_module.hpp b7beefe 
bq.    src/slave/lxc_isolation_module.cpp 66a2a89 
bq.    src/slave/main.cpp 85cba25 
bq.    src/slave/process_based_isolation_module.hpp f6f9554 
bq.    src/slave/process_based_isolation_module.cpp 2b37d42 
bq.    src/slave/slave.hpp 279bc7b 
bq.    src/slave/slave.cpp 3358ec4 
bq.    src/slave/statusupdates_manager.hpp PRE-CREATION 
bq.    src/slave/statusupdates_manager.cpp PRE-CREATION 
bq.    src/tests/external_tests.cpp d1b20e4 
bq.    src/tests/fault_tolerance_tests.cpp 6772daf 
bq.    src/tests/slave_restart_tests.cpp PRE-CREATION 
bq.    src/tests/utils.hpp e81ec82 
bq.  
bq.  Diff: https://reviews.apache.org/r/4462/diff
bq.  
bq.  
bq.  Testing
bq.  -------
bq.  
bq.  make check.
bq.  
bq.  Note that only the new test in tests/slave_restart_tests.cpp  engages in recovery!
bq.  
bq.  Recovery is disabled for old tests (though they still checkpoint relevant info!)
bq.  
bq.  
bq.  Thanks,
bq.  
bq.  Vinod
bq.  
bq.


                
> Mesos deploys should not restart tasks
> --------------------------------------
>
>                 Key: MESOS-110
>                 URL: https://issues.apache.org/jira/browse/MESOS-110
>             Project: Mesos
>          Issue Type: Improvement
>          Components: framework
>            Reporter: Rob Benson
>            Assignee: Vinod Kone
>
> Running a long-lived service on Mesos has a significant drawback right now in that Mesos build deploys restart your tasks. This could lead to nontrivial outages for services that have a high warm-up time.  Basically everything would need a graceful restart mechanism that basically allows a shutdown/restart with a new version of the code. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MESOS-110) Mesos deploys should not restart tasks

Posted by "Vinod Kone (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MESOS-110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13486565#comment-13486565 ] 

Vinod Kone commented on MESOS-110:
----------------------------------

Stu,

Just wanted to let you know this is definitely on our Q4 roadmap, and there has been considerable progress made so far. We will keep you guys updated as and when we get close to deploy this feature.
                
> Mesos deploys should not restart tasks
> --------------------------------------
>
>                 Key: MESOS-110
>                 URL: https://issues.apache.org/jira/browse/MESOS-110
>             Project: Mesos
>          Issue Type: Improvement
>          Components: framework
>            Reporter: Rob Benson
>            Assignee: Vinod Kone
>
> Running a long-lived service on Mesos has a significant drawback right now in that Mesos build deploys restart your tasks. This could lead to nontrivial outages for services that have a high warm-up time.  Basically everything would need a graceful restart mechanism that basically allows a shutdown/restart with a new version of the code. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MESOS-110) Mesos deploys should not restart tasks

Posted by "Stu Hood (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MESOS-110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13486343#comment-13486343 ] 

Stu Hood commented on MESOS-110:
--------------------------------

This ticket is critical for services that would not normally be preempted except for hardware failures... tasks restarting due to slave restarts adds a shared failure domain to the entire system that is impossible for tasks to avoid, regardless of how well they are replicated.
                
> Mesos deploys should not restart tasks
> --------------------------------------
>
>                 Key: MESOS-110
>                 URL: https://issues.apache.org/jira/browse/MESOS-110
>             Project: Mesos
>          Issue Type: Improvement
>          Components: framework
>            Reporter: Rob Benson
>            Assignee: Vinod Kone
>
> Running a long-lived service on Mesos has a significant drawback right now in that Mesos build deploys restart your tasks. This could lead to nontrivial outages for services that have a high warm-up time.  Basically everything would need a graceful restart mechanism that basically allows a shutdown/restart with a new version of the code. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira