You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mesos.apache.org by Harvey Feng <h....@berkeley.edu> on 2012/04/18 06:27:26 UTC

Review Request: Updates and additions to the MPI framework

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/4768/
-----------------------------------------------------------

Review request for mesos, Benjamin Hindman and Charles Reiss.


Summary
-------

Some updates to point out:

-nmpiexec.py
  -> 'mpdallexit' should terminate all slaves' mpds in the ring. I moved 'driver.stop()' to statusUpdate() so that it stops when all tasks have been finished, which occurs when the executor's launched mpd processes have all exited. 
-startmpd.py
  -> Didn't remove cleanup(), and added code in shutdown() that manually kills mpd processes. They might be useful during abnormal (cleanup) and normal (shutdown) framework/executor termination...I think. cleanup() still terminates all mpd's in the slave, but shutdown doesn't. 
  -> killtask() stops the mpd associated with the given tid. 
  -> Task states update nicely now. They correspond to the state of a task's associated mpd process.
-Readme
  -> Included additional info on how to setup and run MPICH2 1.2 and nmpiexec on OS X and Ubuntu/Linux


This addresses bug MESOS-183.
    https://issues.apache.org/jira/browse/MESOS-183


Diffs
-----

  frameworks/mpi/README.txt cdb4553 
  frameworks/mpi/nmpiexec.py a5db9c0 
  frameworks/mpi/startmpd.py 8eeba5e 

Diff: https://reviews.apache.org/r/4768/diff


Testing
-------


Thanks,

Harvey


Re: Review Request: Updates and additions to the MPI framework

Posted by Harvey Feng <h....@berkeley.edu>.

> On 2012-04-20 17:58:45, Charles Reiss wrote:
> > frameworks/mpi/nmpiexec.py, line 159
> > <https://reviews.apache.org/r/4768/diff/2/?file=103455#file103455line159>
> >
> >     Account for MPICH2PATH here.

done.


> On 2012-04-20 17:58:45, Charles Reiss wrote:
> > frameworks/mpi/nmpiexec.py, line 194
> > <https://reviews.apache.org/r/4768/diff/2/?file=103455#file103455line194>
> >
> >     Try to keep us below 80 chars (or at least below 100); split into multiple lines.

done.


> On 2012-04-20 17:58:45, Charles Reiss wrote:
> > frameworks/mpi/startmpd.py, line 85
> > <https://reviews.apache.org/r/4768/diff/2/?file=103456#file103456line85>
> >
> >     Assuming mpd tries to do something graceful on SIGTERM, try SIGTERM, wait a bit, then try SIGKILL (and below).

Gave it a 5 second interval


> On 2012-04-20 17:58:45, Charles Reiss wrote:
> > frameworks/mpi/nmpiexec.py, line 32
> > <https://reviews.apache.org/r/4768/diff/2/?file=103455#file103455line32>
> >
> >     MPI_TASK should be an array (and below).

ok. The executable should be the first argument in the array, renamed MPI_PROGRAM.


> On 2012-04-20 17:58:45, Charles Reiss wrote:
> > frameworks/mpi/nmpiexec.py, line 210
> > <https://reviews.apache.org/r/4768/diff/2/?file=103455#file103455line210>
> >
> >     Account for MPICH2PATH here.

done. made it a global variable in both files too.


> On 2012-04-20 17:58:45, Charles Reiss wrote:
> > frameworks/mpi/nmpiexec.py, line 215
> > <https://reviews.apache.org/r/4768/diff/2/?file=103455#file103455line215>
> >
> >     close() this; does this (and existing code calling mpdtrace) leave mpdtrace as a zombie process?

mpdtrace's were left as zombie processes...it uses communicate() now to get the the stdout.


- Harvey


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/4768/#review7086
-----------------------------------------------------------


On 2012-04-21 05:08:47, Harvey Feng wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/4768/
> -----------------------------------------------------------
> 
> (Updated 2012-04-21 05:08:47)
> 
> 
> Review request for mesos, Benjamin Hindman, Charles Reiss, and Jessica.
> 
> 
> Summary
> -------
> 
> Some updates to point out:
> 
> -nmpiexec.py
>   -> 'mpdallexit' should terminate all slaves' mpds in the ring. I moved 'driver.stop()' to statusUpdate() so that it stops when all tasks have been finished, which occurs when the executor's launched mpd processes have all exited. 
> -startmpd.py
>   -> Didn't remove cleanup(), and added code in shutdown() that manually kills mpd processes. They might be useful during abnormal (cleanup) and normal (shutdown) framework/executor termination...I think. cleanup() still terminates all mpd's in the slave, but shutdown doesn't. 
>   -> killtask() stops the mpd associated with the given tid. 
>   -> Task states update nicely now. They correspond to the state of a task's associated mpd process.
> -Readme
>   -> Included additional info on how to setup and run MPICH2 1.2 and nmpiexec on OS X and Ubuntu/Linux
> 
> 
> This addresses bug MESOS-183.
>     https://issues.apache.org/jira/browse/MESOS-183
> 
> 
> Diffs
> -----
> 
>   frameworks/mpi/README.txt cdb4553 
>   frameworks/mpi/nmpiexec.py a5db9c0 
>   frameworks/mpi/startmpd.py 8eeba5e 
> 
> Diff: https://reviews.apache.org/r/4768/diff
> 
> 
> Testing
> -------
> 
> 
> Thanks,
> 
> Harvey
> 
>


Re: Review Request: Updates and additions to the MPI framework

Posted by Charles Reiss <wo...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/4768/#review7086
-----------------------------------------------------------



frameworks/mpi/nmpiexec.py
<https://reviews.apache.org/r/4768/#comment15686>

    MPI_TASK should be an array (and below).



frameworks/mpi/nmpiexec.py
<https://reviews.apache.org/r/4768/#comment15691>

    Account for MPICH2PATH here.



frameworks/mpi/nmpiexec.py
<https://reviews.apache.org/r/4768/#comment15689>

    Try to keep us below 80 chars (or at least below 100); split into multiple lines.



frameworks/mpi/nmpiexec.py
<https://reviews.apache.org/r/4768/#comment15692>

    Account for MPICH2PATH here.



frameworks/mpi/nmpiexec.py
<https://reviews.apache.org/r/4768/#comment15687>

    close() this; does this (and existing code calling mpdtrace) leave mpdtrace as a zombie process?



frameworks/mpi/startmpd.py
<https://reviews.apache.org/r/4768/#comment15690>

    Assuming mpd tries to do something graceful on SIGTERM, try SIGTERM, wait a bit, then try SIGKILL (and below).


- Charles


On 2012-04-20 08:17:57, Harvey Feng wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/4768/
> -----------------------------------------------------------
> 
> (Updated 2012-04-20 08:17:57)
> 
> 
> Review request for mesos, Benjamin Hindman and Charles Reiss.
> 
> 
> Summary
> -------
> 
> Some updates to point out:
> 
> -nmpiexec.py
>   -> 'mpdallexit' should terminate all slaves' mpds in the ring. I moved 'driver.stop()' to statusUpdate() so that it stops when all tasks have been finished, which occurs when the executor's launched mpd processes have all exited. 
> -startmpd.py
>   -> Didn't remove cleanup(), and added code in shutdown() that manually kills mpd processes. They might be useful during abnormal (cleanup) and normal (shutdown) framework/executor termination...I think. cleanup() still terminates all mpd's in the slave, but shutdown doesn't. 
>   -> killtask() stops the mpd associated with the given tid. 
>   -> Task states update nicely now. They correspond to the state of a task's associated mpd process.
> -Readme
>   -> Included additional info on how to setup and run MPICH2 1.2 and nmpiexec on OS X and Ubuntu/Linux
> 
> 
> This addresses bug MESOS-183.
>     https://issues.apache.org/jira/browse/MESOS-183
> 
> 
> Diffs
> -----
> 
>   frameworks/mpi/README.txt cdb4553 
>   frameworks/mpi/nmpiexec.py a5db9c0 
>   frameworks/mpi/startmpd.py 8eeba5e 
> 
> Diff: https://reviews.apache.org/r/4768/diff
> 
> 
> Testing
> -------
> 
> 
> Thanks,
> 
> Harvey
> 
>


Re: Review Request: Updates and additions to the MPI framework

Posted by Harvey Feng <h....@berkeley.edu>.

> On 2012-05-04 01:41:20, Benjamin Hindman wrote:
> > frameworks/mpi/README.txt, line 19
> > <https://reviews.apache.org/r/4768/diff/5/?file=105961#file105961line19>
> >
> >     I know it's obvious, but you might want to remind users that you'll need to install mpich2 on every machine in your cluster?.

Done.


> On 2012-05-04 01:41:20, Benjamin Hindman wrote:
> > frameworks/mpi/README.txt, line 23
> > <https://reviews.apache.org/r/4768/diff/5/?file=105961#file105961line23>
> >
> >     Kill whitespace.

Done.


> On 2012-05-04 01:41:20, Benjamin Hindman wrote:
> > frameworks/mpi/README.txt, line 25
> > <https://reviews.apache.org/r/4768/diff/5/?file=105961#file105961line25>
> >
> >     Kill whitespace.

Done.


> On 2012-05-04 01:41:20, Benjamin Hindman wrote:
> > frameworks/mpi/nmpiexec.py, line 26
> > <https://reviews.apache.org/r/4768/diff/5/?file=105963#file105963line26>
> >
> >     s/mpd slots/mpd(s)

Done


> On 2012-05-04 01:41:20, Benjamin Hindman wrote:
> > frameworks/mpi/nmpiexec.py, line 71
> > <https://reviews.apache.org/r/4768/diff/5/?file=105963#file105963line71>
> >
> >     If you move this check into the 'for offer in offers:' on line 60, then you'll only be doing the check and decline in one place (not also on lines 107 and 108).

Done


> On 2012-05-04 01:41:20, Benjamin Hindman wrote:
> > frameworks/mpi/nmpiexec.py, line 118
> > <https://reviews.apache.org/r/4768/diff/5/?file=105963#file105963line118>
> >
> >     Again, I'm not sure how ifhn_slave is going to be used. Can you elaborate?

I left this in pending Jessica's response...it's removed now.


> On 2012-05-04 01:41:20, Benjamin Hindman wrote:
> > frameworks/mpi/nmpiexec.py, line 121
> > <https://reviews.apache.org/r/4768/diff/5/?file=105963#file105963line121>
> >
> >     I love the long options! Thank you!


> On 2012-05-04 01:41:20, Benjamin Hindman wrote:
> > frameworks/mpi/nmpiexec.py, line 209
> > <https://reviews.apache.org/r/4768/diff/5/?file=105963#file105963line209>
> >
> >     +1 to Jessica's comment.

This simplifies the trailing '/' check/fix to just os.path.join(options.path, ""). 


> On 2012-05-04 01:41:20, Benjamin Hindman wrote:
> > frameworks/mpi/nmpiexec.py, line 221
> > <https://reviews.apache.org/r/4768/diff/5/?file=105963#file105963line221>
> >
> >     +1 to Jessica's comment.

Unchanged after using the above.


> On 2012-05-04 01:41:20, Benjamin Hindman wrote:
> > frameworks/mpi/nmpiexec.py, line 230
> > <https://reviews.apache.org/r/4768/diff/5/?file=105963#file105963line230>
> >
> >     mpdtraceerr is not used, kill it please.

Done.


- Harvey


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/4768/#review7541
-----------------------------------------------------------


On 2012-05-02 13:29:50, Harvey Feng wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/4768/
> -----------------------------------------------------------
> 
> (Updated 2012-05-02 13:29:50)
> 
> 
> Review request for mesos, Benjamin Hindman, Charles Reiss, and Jessica.
> 
> 
> Summary
> -------
> 
> Some updates to point out:
> 
> -nmpiexec.py
>   -> 'mpdallexit' should terminate all slaves' mpds in the ring. I moved 'driver.stop()' to statusUpdate() so that it stops when all tasks have been finished, which occurs when the executor's launched mpd processes have all exited. 
> -startmpd.py
>   -> Didn't remove cleanup(), and added code in shutdown() that manually kills mpd processes. They might be useful during abnormal (cleanup) and normal (shutdown) framework/executor termination...I think. cleanup() still terminates all mpd's in the slave, but shutdown doesn't. 
>   -> killtask() stops the mpd associated with the given tid. 
>   -> Task states update nicely now. They correspond to the state of a task's associated mpd process.
> -Readme
>   -> Included additional info on how to setup and run MPICH2 1.2 and nmpiexec on OS X and Ubuntu/Linux
> 
> 
> This addresses bug MESOS-183.
>     https://issues.apache.org/jira/browse/MESOS-183
> 
> 
> Diffs
> -----
> 
>   frameworks/mpi/README.txt cdb4553 
>   frameworks/mpi/nmpiexec 517bdbc 
>   frameworks/mpi/nmpiexec.py a5db9c0 
>   frameworks/mpi/startmpd.py 8eeba5e 
>   frameworks/mpi/startmpd.sh 44faa05 
> 
> Diff: https://reviews.apache.org/r/4768/diff
> 
> 
> Testing
> -------
> 
> 
> Thanks,
> 
> Harvey
> 
>


Re: Review Request: Updates and additions to the MPI framework

Posted by Benjamin Hindman <be...@berkeley.edu>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/4768/#review7541
-----------------------------------------------------------


Awesome refactor! Super excited to get this committed (and I think Jessica will be too!). Just a few more minor points (please address Jessica's comments too). Thank you!


frameworks/mpi/README.txt
<https://reviews.apache.org/r/4768/#comment16703>

    I know it's obvious, but you might want to remind users that you'll need to install mpich2 on every machine in your cluster?.



frameworks/mpi/README.txt
<https://reviews.apache.org/r/4768/#comment16701>

    Kill whitespace.



frameworks/mpi/README.txt
<https://reviews.apache.org/r/4768/#comment16702>

    Kill whitespace.



frameworks/mpi/nmpiexec.py
<https://reviews.apache.org/r/4768/#comment16704>

    s/mpd slots/mpd(s)



frameworks/mpi/nmpiexec.py
<https://reviews.apache.org/r/4768/#comment16707>

    If you move this check into the 'for offer in offers:' on line 60, then you'll only be doing the check and decline in one place (not also on lines 107 and 108).



frameworks/mpi/nmpiexec.py
<https://reviews.apache.org/r/4768/#comment16705>

    Again, I'm not sure how ifhn_slave is going to be used. Can you elaborate?



frameworks/mpi/nmpiexec.py
<https://reviews.apache.org/r/4768/#comment16706>

    I love the long options! Thank you!



frameworks/mpi/nmpiexec.py
<https://reviews.apache.org/r/4768/#comment16710>

    +1 to Jessica's comment.



frameworks/mpi/nmpiexec.py
<https://reviews.apache.org/r/4768/#comment16708>

    +1 to Jessica's comment.



frameworks/mpi/nmpiexec.py
<https://reviews.apache.org/r/4768/#comment16709>

    mpdtraceerr is not used, kill it please.


- Benjamin


On 2012-05-02 13:29:50, Harvey Feng wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/4768/
> -----------------------------------------------------------
> 
> (Updated 2012-05-02 13:29:50)
> 
> 
> Review request for mesos, Benjamin Hindman, Charles Reiss, and Jessica.
> 
> 
> Summary
> -------
> 
> Some updates to point out:
> 
> -nmpiexec.py
>   -> 'mpdallexit' should terminate all slaves' mpds in the ring. I moved 'driver.stop()' to statusUpdate() so that it stops when all tasks have been finished, which occurs when the executor's launched mpd processes have all exited. 
> -startmpd.py
>   -> Didn't remove cleanup(), and added code in shutdown() that manually kills mpd processes. They might be useful during abnormal (cleanup) and normal (shutdown) framework/executor termination...I think. cleanup() still terminates all mpd's in the slave, but shutdown doesn't. 
>   -> killtask() stops the mpd associated with the given tid. 
>   -> Task states update nicely now. They correspond to the state of a task's associated mpd process.
> -Readme
>   -> Included additional info on how to setup and run MPICH2 1.2 and nmpiexec on OS X and Ubuntu/Linux
> 
> 
> This addresses bug MESOS-183.
>     https://issues.apache.org/jira/browse/MESOS-183
> 
> 
> Diffs
> -----
> 
>   frameworks/mpi/README.txt cdb4553 
>   frameworks/mpi/nmpiexec 517bdbc 
>   frameworks/mpi/nmpiexec.py a5db9c0 
>   frameworks/mpi/startmpd.py 8eeba5e 
>   frameworks/mpi/startmpd.sh 44faa05 
> 
> Diff: https://reviews.apache.org/r/4768/diff
> 
> 
> Testing
> -------
> 
> 
> Thanks,
> 
> Harvey
> 
>


Re: Review Request: Updates and additions to the MPI framework

Posted by Jessica <je...@gmail.com>.

> On 2012-05-08 03:39:29, Benjamin Hindman wrote:
> > I'll get this checked in provided Jessica gives it a "Ship It". Thanks the the good work here, I intend to make it a demonstration of how to write frameworks on Mesos!

Scratch that. I voted to ship it and then remembered an issue that I don't think has been addressed yet. I posted this on the jira, but I haven't seen any changes for it: 

I'm running into the setuptools issue addressed in the test python framework: https://issues.apache.org/jira/browse/MESOS-130. The locations of the eggs added to PYTHONPATH in nmpiexec [now mpiexec-mesos?] need to be updated so that the Mesos/protobuf libraries (and setuptools) don't have to be installed on every node. 

There also seems to be an issue with Python detecting the Mesos module from the egg in src/python/dist--I couldn't import mesos until I unzipped the egg, no matter what directory I was in or how I modified the PYTHONPATH. [Update: I believe it's related to the fact that the mesos egg uses C/C++ extensions. I think it needs to use a setuptools module to list the package contents.]


- Jessica


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/4768/#review7666
-----------------------------------------------------------


On 2012-05-08 01:29:06, Harvey Feng wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/4768/
> -----------------------------------------------------------
> 
> (Updated 2012-05-08 01:29:06)
> 
> 
> Review request for mesos, Benjamin Hindman, Charles Reiss, and Jessica.
> 
> 
> Summary
> -------
> 
> Some updates to point out:
> 
> -nmpiexec.py
>   -> 'mpdallexit' should terminate all slaves' mpds in the ring. I moved 'driver.stop()' to statusUpdate() so that it stops when all tasks have been finished, which occurs when the executor's launched mpd processes have all exited. 
> -startmpd.py
>   -> Didn't remove cleanup(), and added code in shutdown() that manually kills mpd processes. They might be useful during abnormal (cleanup) and normal (shutdown) framework/executor termination...I think. cleanup() still terminates all mpd's in the slave, but shutdown doesn't. 
>   -> killtask() stops the mpd associated with the given tid. 
>   -> Task states update nicely now. They correspond to the state of a task's associated mpd process.
> -Readme
>   -> Included additional info on how to setup and run MPICH2 1.2 and nmpiexec on OS X and Ubuntu/Linux
> 
> 
> This addresses bug MESOS-183.
>     https://issues.apache.org/jira/browse/MESOS-183
> 
> 
> Diffs
> -----
> 
>   frameworks/mpi/README.txt cdb4553 
>   frameworks/mpi/nmpiexec 517bdbc 
>   frameworks/mpi/nmpiexec.py a5db9c0 
>   frameworks/mpi/startmpd.py 8eeba5e 
>   frameworks/mpi/startmpd.sh 44faa05 
> 
> Diff: https://reviews.apache.org/r/4768/diff
> 
> 
> Testing
> -------
> 
> 
> Thanks,
> 
> Harvey
> 
>


Re: Review Request: Updates and additions to the MPI framework

Posted by Benjamin Hindman <be...@berkeley.edu>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/4768/#review7666
-----------------------------------------------------------

Ship it!


I'll get this checked in provided Jessica gives it a "Ship It". Thanks the the good work here, I intend to make it a demonstration of how to write frameworks on Mesos!

- Benjamin


On 2012-05-08 01:29:06, Harvey Feng wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/4768/
> -----------------------------------------------------------
> 
> (Updated 2012-05-08 01:29:06)
> 
> 
> Review request for mesos, Benjamin Hindman, Charles Reiss, and Jessica.
> 
> 
> Summary
> -------
> 
> Some updates to point out:
> 
> -nmpiexec.py
>   -> 'mpdallexit' should terminate all slaves' mpds in the ring. I moved 'driver.stop()' to statusUpdate() so that it stops when all tasks have been finished, which occurs when the executor's launched mpd processes have all exited. 
> -startmpd.py
>   -> Didn't remove cleanup(), and added code in shutdown() that manually kills mpd processes. They might be useful during abnormal (cleanup) and normal (shutdown) framework/executor termination...I think. cleanup() still terminates all mpd's in the slave, but shutdown doesn't. 
>   -> killtask() stops the mpd associated with the given tid. 
>   -> Task states update nicely now. They correspond to the state of a task's associated mpd process.
> -Readme
>   -> Included additional info on how to setup and run MPICH2 1.2 and nmpiexec on OS X and Ubuntu/Linux
> 
> 
> This addresses bug MESOS-183.
>     https://issues.apache.org/jira/browse/MESOS-183
> 
> 
> Diffs
> -----
> 
>   frameworks/mpi/README.txt cdb4553 
>   frameworks/mpi/nmpiexec 517bdbc 
>   frameworks/mpi/nmpiexec.py a5db9c0 
>   frameworks/mpi/startmpd.py 8eeba5e 
>   frameworks/mpi/startmpd.sh 44faa05 
> 
> Diff: https://reviews.apache.org/r/4768/diff
> 
> 
> Testing
> -------
> 
> 
> Thanks,
> 
> Harvey
> 
>


Re: Review Request: Updates and additions to the MPI framework

Posted by Jessica <je...@gmail.com>.

> On 2012-05-25 18:12:45, Jessica wrote:
> > frameworks/mpi/mpiexec-mesos.py, line 61
> > <https://reviews.apache.org/r/4768/diff/8/?file=109962#file109962line61>
> >
> >     I've been puzzling over why the return is an issue with this revision since it wasn't with earlier revisions, and I believe it's due to the fact that the return is within the for loop. Before, this return was outside of the loop, so we'd always complete the loop. Once the loop completed, we'd check if we had enough mpds, and if so, we'd launch. With this revision, we may never get a chance to complete the loop and thus never check if we have enough resources. I think a break would solve the problem, provided it's acceptable not to respond to all of the offers. Otherwise, we need to make sure to decline all offers.
> 
> Harvey Feng wrote:
>     You're right, I missed this :(. A continue would make sure we decline all the offers if enough tasks are launched.

Yes; however, after further investigation, I've discovered that completing the function results in threading.Thread(target=mpiexec).start() getting called multiple times. So I guess it either needs to go back to how it was before (with the return before the loop) or there needs to be some kind of flag that indicates whether the thread has already been launched. (I used the flag approach, and it worked fine, but maybe you have a better idea.)


- Jessica


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/4768/#review8116
-----------------------------------------------------------


On 2012-05-23 23:44:52, Harvey Feng wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/4768/
> -----------------------------------------------------------
> 
> (Updated 2012-05-23 23:44:52)
> 
> 
> Review request for mesos, Benjamin Hindman, Charles Reiss, and Jessica.
> 
> 
> Summary
> -------
> 
> Some updates to point out:
> 
> -nmpiexec.py
>   -> 'mpdallexit' should terminate all slaves' mpds in the ring. I moved 'driver.stop()' to statusUpdate() so that it stops when all tasks have been finished, which occurs when the executor's launched mpd processes have all exited. 
> -startmpd.py
>   -> Didn't remove cleanup(), and added code in shutdown() that manually kills mpd processes. They might be useful during abnormal (cleanup) and normal (shutdown) framework/executor termination...I think. cleanup() still terminates all mpd's in the slave, but shutdown doesn't. 
>   -> killtask() stops the mpd associated with the given tid. 
>   -> Task states update nicely now. They correspond to the state of a task's associated mpd process.
> -Readme
>   -> Included additional info on how to setup and run MPICH2 1.2 and nmpiexec on OS X and Ubuntu/Linux
> 
> 
> This addresses bug MESOS-183.
>     https://issues.apache.org/jira/browse/MESOS-183
> 
> 
> Diffs
> -----
> 
>   frameworks/mpi/startmpd.py 8eeba5e 
>   frameworks/mpi/startmpd.sh 44faa05 
>   frameworks/mpi/nmpiexec 517bdbc 
>   frameworks/mpi/nmpiexec.py a5db9c0 
>   frameworks/mpi/mpiexec-mesos PRE-CREATION 
>   frameworks/mpi/mpiexec-mesos.py PRE-CREATION 
>   frameworks/mpi/README.txt cdb4553 
> 
> Diff: https://reviews.apache.org/r/4768/diff
> 
> 
> Testing
> -------
> 
> 
> Thanks,
> 
> Harvey
> 
>


Re: Review Request: Updates and additions to the MPI framework

Posted by Harvey Feng <h....@berkeley.edu>.

> On 2012-05-25 18:12:45, Jessica wrote:
> > frameworks/mpi/mpiexec-mesos.py, line 61
> > <https://reviews.apache.org/r/4768/diff/8/?file=109962#file109962line61>
> >
> >     I've been puzzling over why the return is an issue with this revision since it wasn't with earlier revisions, and I believe it's due to the fact that the return is within the for loop. Before, this return was outside of the loop, so we'd always complete the loop. Once the loop completed, we'd check if we had enough mpds, and if so, we'd launch. With this revision, we may never get a chance to complete the loop and thus never check if we have enough resources. I think a break would solve the problem, provided it's acceptable not to respond to all of the offers. Otherwise, we need to make sure to decline all offers.
> 
> Harvey Feng wrote:
>     You're right, I missed this :(. A continue would make sure we decline all the offers if enough tasks are launched.
> 
> Jessica wrote:
>     Yes; however, after further investigation, I've discovered that completing the function results in threading.Thread(target=mpiexec).start() getting called multiple times. So I guess it either needs to go back to how it was before (with the return before the loop) or there needs to be some kind of flag that indicates whether the thread has already been launched. (I used the flag approach, and it worked fine, but maybe you have a better idea.)

Fixed by adding a flag.


- Harvey


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/4768/#review8116
-----------------------------------------------------------


On 2012-05-23 23:44:52, Harvey Feng wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/4768/
> -----------------------------------------------------------
> 
> (Updated 2012-05-23 23:44:52)
> 
> 
> Review request for mesos, Benjamin Hindman, Charles Reiss, and Jessica.
> 
> 
> Summary
> -------
> 
> Some updates to point out:
> 
> -nmpiexec.py
>   -> 'mpdallexit' should terminate all slaves' mpds in the ring. I moved 'driver.stop()' to statusUpdate() so that it stops when all tasks have been finished, which occurs when the executor's launched mpd processes have all exited. 
> -startmpd.py
>   -> Didn't remove cleanup(), and added code in shutdown() that manually kills mpd processes. They might be useful during abnormal (cleanup) and normal (shutdown) framework/executor termination...I think. cleanup() still terminates all mpd's in the slave, but shutdown doesn't. 
>   -> killtask() stops the mpd associated with the given tid. 
>   -> Task states update nicely now. They correspond to the state of a task's associated mpd process.
> -Readme
>   -> Included additional info on how to setup and run MPICH2 1.2 and nmpiexec on OS X and Ubuntu/Linux
> 
> 
> This addresses bug MESOS-183.
>     https://issues.apache.org/jira/browse/MESOS-183
> 
> 
> Diffs
> -----
> 
>   frameworks/mpi/startmpd.py 8eeba5e 
>   frameworks/mpi/startmpd.sh 44faa05 
>   frameworks/mpi/nmpiexec 517bdbc 
>   frameworks/mpi/nmpiexec.py a5db9c0 
>   frameworks/mpi/mpiexec-mesos PRE-CREATION 
>   frameworks/mpi/mpiexec-mesos.py PRE-CREATION 
>   frameworks/mpi/README.txt cdb4553 
> 
> Diff: https://reviews.apache.org/r/4768/diff
> 
> 
> Testing
> -------
> 
> 
> Thanks,
> 
> Harvey
> 
>


Re: Review Request: Updates and additions to the MPI framework

Posted by Harvey Feng <h....@berkeley.edu>.

> On 2012-05-25 18:12:45, Jessica wrote:
> > frameworks/mpi/mpiexec-mesos.py, line 61
> > <https://reviews.apache.org/r/4768/diff/8/?file=109962#file109962line61>
> >
> >     I've been puzzling over why the return is an issue with this revision since it wasn't with earlier revisions, and I believe it's due to the fact that the return is within the for loop. Before, this return was outside of the loop, so we'd always complete the loop. Once the loop completed, we'd check if we had enough mpds, and if so, we'd launch. With this revision, we may never get a chance to complete the loop and thus never check if we have enough resources. I think a break would solve the problem, provided it's acceptable not to respond to all of the offers. Otherwise, we need to make sure to decline all offers.

You're right, I missed this :(. A continue would make sure we decline all the offers if enough tasks are launched.


- Harvey


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/4768/#review8116
-----------------------------------------------------------


On 2012-05-23 23:44:52, Harvey Feng wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/4768/
> -----------------------------------------------------------
> 
> (Updated 2012-05-23 23:44:52)
> 
> 
> Review request for mesos, Benjamin Hindman, Charles Reiss, and Jessica.
> 
> 
> Summary
> -------
> 
> Some updates to point out:
> 
> -nmpiexec.py
>   -> 'mpdallexit' should terminate all slaves' mpds in the ring. I moved 'driver.stop()' to statusUpdate() so that it stops when all tasks have been finished, which occurs when the executor's launched mpd processes have all exited. 
> -startmpd.py
>   -> Didn't remove cleanup(), and added code in shutdown() that manually kills mpd processes. They might be useful during abnormal (cleanup) and normal (shutdown) framework/executor termination...I think. cleanup() still terminates all mpd's in the slave, but shutdown doesn't. 
>   -> killtask() stops the mpd associated with the given tid. 
>   -> Task states update nicely now. They correspond to the state of a task's associated mpd process.
> -Readme
>   -> Included additional info on how to setup and run MPICH2 1.2 and nmpiexec on OS X and Ubuntu/Linux
> 
> 
> This addresses bug MESOS-183.
>     https://issues.apache.org/jira/browse/MESOS-183
> 
> 
> Diffs
> -----
> 
>   frameworks/mpi/startmpd.py 8eeba5e 
>   frameworks/mpi/startmpd.sh 44faa05 
>   frameworks/mpi/nmpiexec 517bdbc 
>   frameworks/mpi/nmpiexec.py a5db9c0 
>   frameworks/mpi/mpiexec-mesos PRE-CREATION 
>   frameworks/mpi/mpiexec-mesos.py PRE-CREATION 
>   frameworks/mpi/README.txt cdb4553 
> 
> Diff: https://reviews.apache.org/r/4768/diff
> 
> 
> Testing
> -------
> 
> 
> Thanks,
> 
> Harvey
> 
>


Re: Review Request: Updates and additions to the MPI framework

Posted by Jessica <je...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/4768/#review8116
-----------------------------------------------------------



frameworks/mpi/mpiexec-mesos.py
<https://reviews.apache.org/r/4768/#comment17634>

    I've been puzzling over why the return is an issue with this revision since it wasn't with earlier revisions, and I believe it's due to the fact that the return is within the for loop. Before, this return was outside of the loop, so we'd always complete the loop. Once the loop completed, we'd check if we had enough mpds, and if so, we'd launch. With this revision, we may never get a chance to complete the loop and thus never check if we have enough resources. I think a break would solve the problem, provided it's acceptable not to respond to all of the offers. Otherwise, we need to make sure to decline all offers.


- Jessica


On 2012-05-23 23:44:52, Harvey Feng wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/4768/
> -----------------------------------------------------------
> 
> (Updated 2012-05-23 23:44:52)
> 
> 
> Review request for mesos, Benjamin Hindman, Charles Reiss, and Jessica.
> 
> 
> Summary
> -------
> 
> Some updates to point out:
> 
> -nmpiexec.py
>   -> 'mpdallexit' should terminate all slaves' mpds in the ring. I moved 'driver.stop()' to statusUpdate() so that it stops when all tasks have been finished, which occurs when the executor's launched mpd processes have all exited. 
> -startmpd.py
>   -> Didn't remove cleanup(), and added code in shutdown() that manually kills mpd processes. They might be useful during abnormal (cleanup) and normal (shutdown) framework/executor termination...I think. cleanup() still terminates all mpd's in the slave, but shutdown doesn't. 
>   -> killtask() stops the mpd associated with the given tid. 
>   -> Task states update nicely now. They correspond to the state of a task's associated mpd process.
> -Readme
>   -> Included additional info on how to setup and run MPICH2 1.2 and nmpiexec on OS X and Ubuntu/Linux
> 
> 
> This addresses bug MESOS-183.
>     https://issues.apache.org/jira/browse/MESOS-183
> 
> 
> Diffs
> -----
> 
>   frameworks/mpi/startmpd.py 8eeba5e 
>   frameworks/mpi/startmpd.sh 44faa05 
>   frameworks/mpi/nmpiexec 517bdbc 
>   frameworks/mpi/nmpiexec.py a5db9c0 
>   frameworks/mpi/mpiexec-mesos PRE-CREATION 
>   frameworks/mpi/mpiexec-mesos.py PRE-CREATION 
>   frameworks/mpi/README.txt cdb4553 
> 
> Diff: https://reviews.apache.org/r/4768/diff
> 
> 
> Testing
> -------
> 
> 
> Thanks,
> 
> Harvey
> 
>


Re: Review Request: Updates and additions to the MPI framework

Posted by Benjamin Hindman <be...@berkeley.edu>.

> On June 1, 2012, 11:25 a.m., Jessica wrote:
> > Looks great--everything appears to work now. Thanks so much for your work!

I've just committed this, thanks Harvey!


- Benjamin


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/4768/#review8282
-----------------------------------------------------------


On June 1, 2012, 10:21 a.m., Harvey Feng wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/4768/
> -----------------------------------------------------------
> 
> (Updated June 1, 2012, 10:21 a.m.)
> 
> 
> Review request for mesos, Benjamin Hindman, Charles Reiss, and Jessica.
> 
> 
> Description
> -------
> 
> Some updates to point out:
> 
> -nmpiexec.py
>   -> 'mpdallexit' should terminate all slaves' mpds in the ring. I moved 'driver.stop()' to statusUpdate() so that it stops when all tasks have been finished, which occurs when the executor's launched mpd processes have all exited. 
> -startmpd.py
>   -> Didn't remove cleanup(), and added code in shutdown() that manually kills mpd processes. They might be useful during abnormal (cleanup) and normal (shutdown) framework/executor termination...I think. cleanup() still terminates all mpd's in the slave, but shutdown doesn't. 
>   -> killtask() stops the mpd associated with the given tid. 
>   -> Task states update nicely now. They correspond to the state of a task's associated mpd process.
> -Readme
>   -> Included additional info on how to setup and run MPICH2 1.2 and nmpiexec on OS X and Ubuntu/Linux
> 
> 
> This addresses bug MESOS-183.
>     https://issues.apache.org/jira/browse/MESOS-183
> 
> 
> Diffs
> -----
> 
>   frameworks/mpi/README.txt cdb4553 
>   frameworks/mpi/mpiexec-mesos PRE-CREATION 
>   frameworks/mpi/mpiexec-mesos.py PRE-CREATION 
>   frameworks/mpi/nmpiexec 517bdbc 
>   frameworks/mpi/nmpiexec.py a5db9c0 
>   frameworks/mpi/startmpd.py 8eeba5e 
>   frameworks/mpi/startmpd.sh 44faa05 
> 
> Diff: https://reviews.apache.org/r/4768/diff/
> 
> 
> Testing
> -------
> 
> 
> Thanks,
> 
> Harvey Feng
> 
>


Re: Review Request: Updates and additions to the MPI framework

Posted by Jessica <je...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/4768/#review8282
-----------------------------------------------------------

Ship it!


Looks great--everything appears to work now. Thanks so much for your work!

- Jessica


On 2012-06-01 10:21:09, Harvey Feng wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/4768/
> -----------------------------------------------------------
> 
> (Updated 2012-06-01 10:21:09)
> 
> 
> Review request for mesos, Benjamin Hindman, Charles Reiss, and Jessica.
> 
> 
> Summary
> -------
> 
> Some updates to point out:
> 
> -nmpiexec.py
>   -> 'mpdallexit' should terminate all slaves' mpds in the ring. I moved 'driver.stop()' to statusUpdate() so that it stops when all tasks have been finished, which occurs when the executor's launched mpd processes have all exited. 
> -startmpd.py
>   -> Didn't remove cleanup(), and added code in shutdown() that manually kills mpd processes. They might be useful during abnormal (cleanup) and normal (shutdown) framework/executor termination...I think. cleanup() still terminates all mpd's in the slave, but shutdown doesn't. 
>   -> killtask() stops the mpd associated with the given tid. 
>   -> Task states update nicely now. They correspond to the state of a task's associated mpd process.
> -Readme
>   -> Included additional info on how to setup and run MPICH2 1.2 and nmpiexec on OS X and Ubuntu/Linux
> 
> 
> This addresses bug MESOS-183.
>     https://issues.apache.org/jira/browse/MESOS-183
> 
> 
> Diffs
> -----
> 
>   frameworks/mpi/README.txt cdb4553 
>   frameworks/mpi/mpiexec-mesos PRE-CREATION 
>   frameworks/mpi/mpiexec-mesos.py PRE-CREATION 
>   frameworks/mpi/nmpiexec 517bdbc 
>   frameworks/mpi/nmpiexec.py a5db9c0 
>   frameworks/mpi/startmpd.py 8eeba5e 
>   frameworks/mpi/startmpd.sh 44faa05 
> 
> Diff: https://reviews.apache.org/r/4768/diff
> 
> 
> Testing
> -------
> 
> 
> Thanks,
> 
> Harvey
> 
>


Re: Review Request: Updates and additions to the MPI framework

Posted by Harvey Feng <h....@berkeley.edu>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/4768/
-----------------------------------------------------------

(Updated 2012-06-01 10:21:09.375357)


Review request for mesos, Benjamin Hindman, Charles Reiss, and Jessica.


Changes
-------

L59 should fix the start mpiexec problem


Summary
-------

Some updates to point out:

-nmpiexec.py
  -> 'mpdallexit' should terminate all slaves' mpds in the ring. I moved 'driver.stop()' to statusUpdate() so that it stops when all tasks have been finished, which occurs when the executor's launched mpd processes have all exited. 
-startmpd.py
  -> Didn't remove cleanup(), and added code in shutdown() that manually kills mpd processes. They might be useful during abnormal (cleanup) and normal (shutdown) framework/executor termination...I think. cleanup() still terminates all mpd's in the slave, but shutdown doesn't. 
  -> killtask() stops the mpd associated with the given tid. 
  -> Task states update nicely now. They correspond to the state of a task's associated mpd process.
-Readme
  -> Included additional info on how to setup and run MPICH2 1.2 and nmpiexec on OS X and Ubuntu/Linux


This addresses bug MESOS-183.
    https://issues.apache.org/jira/browse/MESOS-183


Diffs (updated)
-----

  frameworks/mpi/README.txt cdb4553 
  frameworks/mpi/mpiexec-mesos PRE-CREATION 
  frameworks/mpi/mpiexec-mesos.py PRE-CREATION 
  frameworks/mpi/nmpiexec 517bdbc 
  frameworks/mpi/nmpiexec.py a5db9c0 
  frameworks/mpi/startmpd.py 8eeba5e 
  frameworks/mpi/startmpd.sh 44faa05 

Diff: https://reviews.apache.org/r/4768/diff


Testing
-------


Thanks,

Harvey


Re: Review Request: Updates and additions to the MPI framework

Posted by Harvey Feng <h....@berkeley.edu>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/4768/
-----------------------------------------------------------

(Updated 2012-06-01 10:04:04.996783)


Review request for mesos, Benjamin Hindman, Charles Reiss, and Jessica.


Changes
-------

-Fixed starting of mpiexec and declining of offers once started
-Added command line arg support for MPI_PROGRAM


Summary
-------

Some updates to point out:

-nmpiexec.py
  -> 'mpdallexit' should terminate all slaves' mpds in the ring. I moved 'driver.stop()' to statusUpdate() so that it stops when all tasks have been finished, which occurs when the executor's launched mpd processes have all exited. 
-startmpd.py
  -> Didn't remove cleanup(), and added code in shutdown() that manually kills mpd processes. They might be useful during abnormal (cleanup) and normal (shutdown) framework/executor termination...I think. cleanup() still terminates all mpd's in the slave, but shutdown doesn't. 
  -> killtask() stops the mpd associated with the given tid. 
  -> Task states update nicely now. They correspond to the state of a task's associated mpd process.
-Readme
  -> Included additional info on how to setup and run MPICH2 1.2 and nmpiexec on OS X and Ubuntu/Linux


This addresses bug MESOS-183.
    https://issues.apache.org/jira/browse/MESOS-183


Diffs (updated)
-----

  frameworks/mpi/README.txt cdb4553 
  frameworks/mpi/mpiexec-mesos PRE-CREATION 
  frameworks/mpi/mpiexec-mesos.py PRE-CREATION 
  frameworks/mpi/nmpiexec 517bdbc 
  frameworks/mpi/nmpiexec.py a5db9c0 
  frameworks/mpi/startmpd.py 8eeba5e 
  frameworks/mpi/startmpd.sh 44faa05 

Diff: https://reviews.apache.org/r/4768/diff


Testing
-------


Thanks,

Harvey


Re: Review Request: Updates and additions to the MPI framework

Posted by Jessica <je...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/4768/#review8115
-----------------------------------------------------------



frameworks/mpi/mpiexec-mesos.py
<https://reviews.apache.org/r/4768/#comment17633>

    I think maybe this needs to be a break rather than a return. Putting a return here means that we might never get to line 104, and the process will just sit there forever declining offers.


- Jessica


On 2012-05-23 23:44:52, Harvey Feng wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/4768/
> -----------------------------------------------------------
> 
> (Updated 2012-05-23 23:44:52)
> 
> 
> Review request for mesos, Benjamin Hindman, Charles Reiss, and Jessica.
> 
> 
> Summary
> -------
> 
> Some updates to point out:
> 
> -nmpiexec.py
>   -> 'mpdallexit' should terminate all slaves' mpds in the ring. I moved 'driver.stop()' to statusUpdate() so that it stops when all tasks have been finished, which occurs when the executor's launched mpd processes have all exited. 
> -startmpd.py
>   -> Didn't remove cleanup(), and added code in shutdown() that manually kills mpd processes. They might be useful during abnormal (cleanup) and normal (shutdown) framework/executor termination...I think. cleanup() still terminates all mpd's in the slave, but shutdown doesn't. 
>   -> killtask() stops the mpd associated with the given tid. 
>   -> Task states update nicely now. They correspond to the state of a task's associated mpd process.
> -Readme
>   -> Included additional info on how to setup and run MPICH2 1.2 and nmpiexec on OS X and Ubuntu/Linux
> 
> 
> This addresses bug MESOS-183.
>     https://issues.apache.org/jira/browse/MESOS-183
> 
> 
> Diffs
> -----
> 
>   frameworks/mpi/startmpd.py 8eeba5e 
>   frameworks/mpi/startmpd.sh 44faa05 
>   frameworks/mpi/nmpiexec 517bdbc 
>   frameworks/mpi/nmpiexec.py a5db9c0 
>   frameworks/mpi/mpiexec-mesos PRE-CREATION 
>   frameworks/mpi/mpiexec-mesos.py PRE-CREATION 
>   frameworks/mpi/README.txt cdb4553 
> 
> Diff: https://reviews.apache.org/r/4768/diff
> 
> 
> Testing
> -------
> 
> 
> Thanks,
> 
> Harvey
> 
>


Re: Review Request: Updates and additions to the MPI framework

Posted by Harvey Feng <h....@berkeley.edu>.

> On 2012-05-31 15:42:48, Jessica wrote:
> > frameworks/mpi/mpiexec-mesos.py, line 25
> > <https://reviews.apache.org/r/4768/diff/8/?file=109962#file109962line25>
> >
> >     This is incorrect. Replace with
> >     
> >     call([MPICH2PATH + 'mpiexec', '-l', '-n', str(TOTAL_MPDS)] + MPI_PROGRAM)
> >     
> >     (note that MPI_PROGRAM is the entire program list appended to the first list)
> >     
> >     This allows MPI_PROGRAM to contain command line arguments.

Done


> On 2012-05-31 15:42:48, Jessica wrote:
> > frameworks/mpi/mpiexec-mesos.py, line 141
> > <https://reviews.apache.org/r/4768/diff/8/?file=109962#file109962line141>
> >
> >     To allow mpi_program to contain command line arguments, add the line 
> >     
> >     parser.disable_interspersed_args()
> >     
> >     (http://docs.python.org/library/optparse.html#querying-and-manipulating-your-option-parser)

Done


- Harvey


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/4768/#review8256
-----------------------------------------------------------


On 2012-05-23 23:44:52, Harvey Feng wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/4768/
> -----------------------------------------------------------
> 
> (Updated 2012-05-23 23:44:52)
> 
> 
> Review request for mesos, Benjamin Hindman, Charles Reiss, and Jessica.
> 
> 
> Summary
> -------
> 
> Some updates to point out:
> 
> -nmpiexec.py
>   -> 'mpdallexit' should terminate all slaves' mpds in the ring. I moved 'driver.stop()' to statusUpdate() so that it stops when all tasks have been finished, which occurs when the executor's launched mpd processes have all exited. 
> -startmpd.py
>   -> Didn't remove cleanup(), and added code in shutdown() that manually kills mpd processes. They might be useful during abnormal (cleanup) and normal (shutdown) framework/executor termination...I think. cleanup() still terminates all mpd's in the slave, but shutdown doesn't. 
>   -> killtask() stops the mpd associated with the given tid. 
>   -> Task states update nicely now. They correspond to the state of a task's associated mpd process.
> -Readme
>   -> Included additional info on how to setup and run MPICH2 1.2 and nmpiexec on OS X and Ubuntu/Linux
> 
> 
> This addresses bug MESOS-183.
>     https://issues.apache.org/jira/browse/MESOS-183
> 
> 
> Diffs
> -----
> 
>   frameworks/mpi/startmpd.py 8eeba5e 
>   frameworks/mpi/startmpd.sh 44faa05 
>   frameworks/mpi/nmpiexec 517bdbc 
>   frameworks/mpi/nmpiexec.py a5db9c0 
>   frameworks/mpi/mpiexec-mesos PRE-CREATION 
>   frameworks/mpi/mpiexec-mesos.py PRE-CREATION 
>   frameworks/mpi/README.txt cdb4553 
> 
> Diff: https://reviews.apache.org/r/4768/diff
> 
> 
> Testing
> -------
> 
> 
> Thanks,
> 
> Harvey
> 
>


Re: Review Request: Updates and additions to the MPI framework

Posted by Jessica <je...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/4768/#review8256
-----------------------------------------------------------



frameworks/mpi/mpiexec-mesos.py
<https://reviews.apache.org/r/4768/#comment17880>

    This is incorrect. Replace with
    
    call([MPICH2PATH + 'mpiexec', '-l', '-n', str(TOTAL_MPDS)] + MPI_PROGRAM)
    
    (note that MPI_PROGRAM is the entire program list appended to the first list)
    
    This allows MPI_PROGRAM to contain command line arguments.



frameworks/mpi/mpiexec-mesos.py
<https://reviews.apache.org/r/4768/#comment17881>

    To allow mpi_program to contain command line arguments, add the line 
    
    parser.disable_interspersed_args()
    
    (http://docs.python.org/library/optparse.html#querying-and-manipulating-your-option-parser)


- Jessica


On 2012-05-23 23:44:52, Harvey Feng wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/4768/
> -----------------------------------------------------------
> 
> (Updated 2012-05-23 23:44:52)
> 
> 
> Review request for mesos, Benjamin Hindman, Charles Reiss, and Jessica.
> 
> 
> Summary
> -------
> 
> Some updates to point out:
> 
> -nmpiexec.py
>   -> 'mpdallexit' should terminate all slaves' mpds in the ring. I moved 'driver.stop()' to statusUpdate() so that it stops when all tasks have been finished, which occurs when the executor's launched mpd processes have all exited. 
> -startmpd.py
>   -> Didn't remove cleanup(), and added code in shutdown() that manually kills mpd processes. They might be useful during abnormal (cleanup) and normal (shutdown) framework/executor termination...I think. cleanup() still terminates all mpd's in the slave, but shutdown doesn't. 
>   -> killtask() stops the mpd associated with the given tid. 
>   -> Task states update nicely now. They correspond to the state of a task's associated mpd process.
> -Readme
>   -> Included additional info on how to setup and run MPICH2 1.2 and nmpiexec on OS X and Ubuntu/Linux
> 
> 
> This addresses bug MESOS-183.
>     https://issues.apache.org/jira/browse/MESOS-183
> 
> 
> Diffs
> -----
> 
>   frameworks/mpi/startmpd.py 8eeba5e 
>   frameworks/mpi/startmpd.sh 44faa05 
>   frameworks/mpi/nmpiexec 517bdbc 
>   frameworks/mpi/nmpiexec.py a5db9c0 
>   frameworks/mpi/mpiexec-mesos PRE-CREATION 
>   frameworks/mpi/mpiexec-mesos.py PRE-CREATION 
>   frameworks/mpi/README.txt cdb4553 
> 
> Diff: https://reviews.apache.org/r/4768/diff
> 
> 
> Testing
> -------
> 
> 
> Thanks,
> 
> Harvey
> 
>


Re: Review Request: Updates and additions to the MPI framework

Posted by Harvey Feng <h....@berkeley.edu>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/4768/
-----------------------------------------------------------

(Updated 2012-05-23 23:44:52.955726)


Review request for mesos, Benjamin Hindman, Charles Reiss, and Jessica.


Changes
-------

Added path to distribute egg in mpiexec-mesos and wrapped a line in the Readme. Also changed nmpiexec filenames to mpiexec-mesos.


Summary
-------

Some updates to point out:

-nmpiexec.py
  -> 'mpdallexit' should terminate all slaves' mpds in the ring. I moved 'driver.stop()' to statusUpdate() so that it stops when all tasks have been finished, which occurs when the executor's launched mpd processes have all exited. 
-startmpd.py
  -> Didn't remove cleanup(), and added code in shutdown() that manually kills mpd processes. They might be useful during abnormal (cleanup) and normal (shutdown) framework/executor termination...I think. cleanup() still terminates all mpd's in the slave, but shutdown doesn't. 
  -> killtask() stops the mpd associated with the given tid. 
  -> Task states update nicely now. They correspond to the state of a task's associated mpd process.
-Readme
  -> Included additional info on how to setup and run MPICH2 1.2 and nmpiexec on OS X and Ubuntu/Linux


This addresses bug MESOS-183.
    https://issues.apache.org/jira/browse/MESOS-183


Diffs (updated)
-----

  frameworks/mpi/startmpd.py 8eeba5e 
  frameworks/mpi/startmpd.sh 44faa05 
  frameworks/mpi/nmpiexec 517bdbc 
  frameworks/mpi/nmpiexec.py a5db9c0 
  frameworks/mpi/mpiexec-mesos PRE-CREATION 
  frameworks/mpi/mpiexec-mesos.py PRE-CREATION 
  frameworks/mpi/README.txt cdb4553 

Diff: https://reviews.apache.org/r/4768/diff


Testing
-------


Thanks,

Harvey


Re: Review Request: Updates and additions to the MPI framework

Posted by Harvey Feng <h....@berkeley.edu>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/4768/
-----------------------------------------------------------

(Updated 2012-05-21 18:16:49.284988)


Review request for mesos, Benjamin Hindman, Charles Reiss, and Jessica.


Changes
-------

Added paths (with defaults) to mesos and protobuf eggs ($MESOS_EGG and $PROTOBUF_EGG) in mpiexec-mesos script (still named nmpiexec at the moment).


Summary
-------

Some updates to point out:

-nmpiexec.py
  -> 'mpdallexit' should terminate all slaves' mpds in the ring. I moved 'driver.stop()' to statusUpdate() so that it stops when all tasks have been finished, which occurs when the executor's launched mpd processes have all exited. 
-startmpd.py
  -> Didn't remove cleanup(), and added code in shutdown() that manually kills mpd processes. They might be useful during abnormal (cleanup) and normal (shutdown) framework/executor termination...I think. cleanup() still terminates all mpd's in the slave, but shutdown doesn't. 
  -> killtask() stops the mpd associated with the given tid. 
  -> Task states update nicely now. They correspond to the state of a task's associated mpd process.
-Readme
  -> Included additional info on how to setup and run MPICH2 1.2 and nmpiexec on OS X and Ubuntu/Linux


This addresses bug MESOS-183.
    https://issues.apache.org/jira/browse/MESOS-183


Diffs (updated)
-----

  frameworks/mpi/nmpiexec 517bdbc 
  frameworks/mpi/nmpiexec.py a5db9c0 
  frameworks/mpi/README.txt cdb4553 
  frameworks/mpi/startmpd.py 8eeba5e 
  frameworks/mpi/startmpd.sh 44faa05 

Diff: https://reviews.apache.org/r/4768/diff


Testing
-------


Thanks,

Harvey


Re: Review Request: Updates and additions to the MPI framework

Posted by Harvey Feng <h....@berkeley.edu>.

> On 2012-05-08 12:04:40, Jessica wrote:
> > frameworks/mpi/nmpiexec.py, line 187
> > <https://reviews.apache.org/r/4768/diff/6/?file=107646#file107646line187>
> >
> >     Super minor issue, so I'm still voting to ship it, but convention and consistency say that the = in default = "" should not be surrounded by whitespace.

Fixed.


- Harvey


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/4768/#review7677
-----------------------------------------------------------


On 2012-05-21 18:16:49, Harvey Feng wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/4768/
> -----------------------------------------------------------
> 
> (Updated 2012-05-21 18:16:49)
> 
> 
> Review request for mesos, Benjamin Hindman, Charles Reiss, and Jessica.
> 
> 
> Summary
> -------
> 
> Some updates to point out:
> 
> -nmpiexec.py
>   -> 'mpdallexit' should terminate all slaves' mpds in the ring. I moved 'driver.stop()' to statusUpdate() so that it stops when all tasks have been finished, which occurs when the executor's launched mpd processes have all exited. 
> -startmpd.py
>   -> Didn't remove cleanup(), and added code in shutdown() that manually kills mpd processes. They might be useful during abnormal (cleanup) and normal (shutdown) framework/executor termination...I think. cleanup() still terminates all mpd's in the slave, but shutdown doesn't. 
>   -> killtask() stops the mpd associated with the given tid. 
>   -> Task states update nicely now. They correspond to the state of a task's associated mpd process.
> -Readme
>   -> Included additional info on how to setup and run MPICH2 1.2 and nmpiexec on OS X and Ubuntu/Linux
> 
> 
> This addresses bug MESOS-183.
>     https://issues.apache.org/jira/browse/MESOS-183
> 
> 
> Diffs
> -----
> 
>   frameworks/mpi/nmpiexec 517bdbc 
>   frameworks/mpi/nmpiexec.py a5db9c0 
>   frameworks/mpi/README.txt cdb4553 
>   frameworks/mpi/startmpd.py 8eeba5e 
>   frameworks/mpi/startmpd.sh 44faa05 
> 
> Diff: https://reviews.apache.org/r/4768/diff
> 
> 
> Testing
> -------
> 
> 
> Thanks,
> 
> Harvey
> 
>


Re: Review Request: Updates and additions to the MPI framework

Posted by Jessica <je...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/4768/#review7677
-----------------------------------------------------------

Ship it!


As far as I can tell, everything looks good. I won't have access to my test environment until Thursday, but you know you'll hear from me if there are any bugs. :) Thanks for your work on this!


frameworks/mpi/nmpiexec.py
<https://reviews.apache.org/r/4768/#comment16907>

    Super minor issue, so I'm still voting to ship it, but convention and consistency say that the = in default = "" should not be surrounded by whitespace.



frameworks/mpi/nmpiexec.py
<https://reviews.apache.org/r/4768/#comment16905>

    Clever technique. I definitely wouldn't have thought to join here. (I also verified that you won't have issues if the user doesn't define a path. os.path.join('','') == '', so it's perfect.)


- Jessica


On 2012-05-08 01:29:06, Harvey Feng wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/4768/
> -----------------------------------------------------------
> 
> (Updated 2012-05-08 01:29:06)
> 
> 
> Review request for mesos, Benjamin Hindman, Charles Reiss, and Jessica.
> 
> 
> Summary
> -------
> 
> Some updates to point out:
> 
> -nmpiexec.py
>   -> 'mpdallexit' should terminate all slaves' mpds in the ring. I moved 'driver.stop()' to statusUpdate() so that it stops when all tasks have been finished, which occurs when the executor's launched mpd processes have all exited. 
> -startmpd.py
>   -> Didn't remove cleanup(), and added code in shutdown() that manually kills mpd processes. They might be useful during abnormal (cleanup) and normal (shutdown) framework/executor termination...I think. cleanup() still terminates all mpd's in the slave, but shutdown doesn't. 
>   -> killtask() stops the mpd associated with the given tid. 
>   -> Task states update nicely now. They correspond to the state of a task's associated mpd process.
> -Readme
>   -> Included additional info on how to setup and run MPICH2 1.2 and nmpiexec on OS X and Ubuntu/Linux
> 
> 
> This addresses bug MESOS-183.
>     https://issues.apache.org/jira/browse/MESOS-183
> 
> 
> Diffs
> -----
> 
>   frameworks/mpi/README.txt cdb4553 
>   frameworks/mpi/nmpiexec 517bdbc 
>   frameworks/mpi/nmpiexec.py a5db9c0 
>   frameworks/mpi/startmpd.py 8eeba5e 
>   frameworks/mpi/startmpd.sh 44faa05 
> 
> Diff: https://reviews.apache.org/r/4768/diff
> 
> 
> Testing
> -------
> 
> 
> Thanks,
> 
> Harvey
> 
>


Re: Review Request: Updates and additions to the MPI framework

Posted by Harvey Feng <h....@berkeley.edu>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/4768/
-----------------------------------------------------------

(Updated 2012-05-08 01:29:06.075735)


Review request for mesos, Benjamin Hindman, Charles Reiss, and Jessica.


Changes
-------

-Updated some of the logic from the previous diff.
-Better usage of os.path.join()
-References to "nmpiexec*" have been changed to "mpiexec-mesos*", but the filenames still need to be changed...


Summary
-------

Some updates to point out:

-nmpiexec.py
  -> 'mpdallexit' should terminate all slaves' mpds in the ring. I moved 'driver.stop()' to statusUpdate() so that it stops when all tasks have been finished, which occurs when the executor's launched mpd processes have all exited. 
-startmpd.py
  -> Didn't remove cleanup(), and added code in shutdown() that manually kills mpd processes. They might be useful during abnormal (cleanup) and normal (shutdown) framework/executor termination...I think. cleanup() still terminates all mpd's in the slave, but shutdown doesn't. 
  -> killtask() stops the mpd associated with the given tid. 
  -> Task states update nicely now. They correspond to the state of a task's associated mpd process.
-Readme
  -> Included additional info on how to setup and run MPICH2 1.2 and nmpiexec on OS X and Ubuntu/Linux


This addresses bug MESOS-183.
    https://issues.apache.org/jira/browse/MESOS-183


Diffs (updated)
-----

  frameworks/mpi/README.txt cdb4553 
  frameworks/mpi/nmpiexec 517bdbc 
  frameworks/mpi/nmpiexec.py a5db9c0 
  frameworks/mpi/startmpd.py 8eeba5e 
  frameworks/mpi/startmpd.sh 44faa05 

Diff: https://reviews.apache.org/r/4768/diff


Testing
-------


Thanks,

Harvey


Re: Review Request: Updates and additions to the MPI framework

Posted by Harvey Feng <h....@berkeley.edu>.

> On 2012-05-07 19:19:17, Jessica wrote:
> > frameworks/mpi/nmpiexec.py, line 118
> > <https://reviews.apache.org/r/4768/diff/5/?file=105963#file105963line118>
> >
> >     I agree with Benjamin--ifhn_slave probably needs to go. I've created a new issue (MESOS-189) for configuring IP addresses that I think is relevant.
> >     
> >     Also... technically, you should use "is" and "is not" instead of "==" and "!=" to compare against None. This is the pythonic (and minutely faster) way.

Ok, switched from "==" to "is"


> On 2012-05-07 19:19:17, Jessica wrote:
> > frameworks/mpi/nmpiexec.py, line 186
> > <https://reviews.apache.org/r/4768/diff/5/?file=105963#file105963line186>
> >
> >     I'd recommend passing default='' to parser.add_option so that you don't have to check if path is None later in the code. (This will also simplify calls to os.path.join since you won't have to check the value of path first.)

Yup, this leaves just one "os.path.join()" call.


- Harvey


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/4768/#review7651
-----------------------------------------------------------


On 2012-05-08 01:29:06, Harvey Feng wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/4768/
> -----------------------------------------------------------
> 
> (Updated 2012-05-08 01:29:06)
> 
> 
> Review request for mesos, Benjamin Hindman, Charles Reiss, and Jessica.
> 
> 
> Summary
> -------
> 
> Some updates to point out:
> 
> -nmpiexec.py
>   -> 'mpdallexit' should terminate all slaves' mpds in the ring. I moved 'driver.stop()' to statusUpdate() so that it stops when all tasks have been finished, which occurs when the executor's launched mpd processes have all exited. 
> -startmpd.py
>   -> Didn't remove cleanup(), and added code in shutdown() that manually kills mpd processes. They might be useful during abnormal (cleanup) and normal (shutdown) framework/executor termination...I think. cleanup() still terminates all mpd's in the slave, but shutdown doesn't. 
>   -> killtask() stops the mpd associated with the given tid. 
>   -> Task states update nicely now. They correspond to the state of a task's associated mpd process.
> -Readme
>   -> Included additional info on how to setup and run MPICH2 1.2 and nmpiexec on OS X and Ubuntu/Linux
> 
> 
> This addresses bug MESOS-183.
>     https://issues.apache.org/jira/browse/MESOS-183
> 
> 
> Diffs
> -----
> 
>   frameworks/mpi/README.txt cdb4553 
>   frameworks/mpi/nmpiexec 517bdbc 
>   frameworks/mpi/nmpiexec.py a5db9c0 
>   frameworks/mpi/startmpd.py 8eeba5e 
>   frameworks/mpi/startmpd.sh 44faa05 
> 
> Diff: https://reviews.apache.org/r/4768/diff
> 
> 
> Testing
> -------
> 
> 
> Thanks,
> 
> Harvey
> 
>


Re: Review Request: Updates and additions to the MPI framework

Posted by Jessica <je...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/4768/#review7651
-----------------------------------------------------------



frameworks/mpi/nmpiexec.py
<https://reviews.apache.org/r/4768/#comment16862>

    I agree with Benjamin--ifhn_slave probably needs to go. I've created a new issue (MESOS-189) for configuring IP addresses that I think is relevant.
    
    Also... technically, you should use "is" and "is not" instead of "==" and "!=" to compare against None. This is the pythonic (and minutely faster) way.



frameworks/mpi/nmpiexec.py
<https://reviews.apache.org/r/4768/#comment16860>

    I'd recommend passing default='' to parser.add_option so that you don't have to check if path is None later in the code. (This will also simplify calls to os.path.join since you won't have to check the value of path first.)



frameworks/mpi/nmpiexec.py
<https://reviews.apache.org/r/4768/#comment16861>

    setting path default to '' eliminates this check.


- Jessica


On 2012-05-02 13:29:50, Harvey Feng wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/4768/
> -----------------------------------------------------------
> 
> (Updated 2012-05-02 13:29:50)
> 
> 
> Review request for mesos, Benjamin Hindman, Charles Reiss, and Jessica.
> 
> 
> Summary
> -------
> 
> Some updates to point out:
> 
> -nmpiexec.py
>   -> 'mpdallexit' should terminate all slaves' mpds in the ring. I moved 'driver.stop()' to statusUpdate() so that it stops when all tasks have been finished, which occurs when the executor's launched mpd processes have all exited. 
> -startmpd.py
>   -> Didn't remove cleanup(), and added code in shutdown() that manually kills mpd processes. They might be useful during abnormal (cleanup) and normal (shutdown) framework/executor termination...I think. cleanup() still terminates all mpd's in the slave, but shutdown doesn't. 
>   -> killtask() stops the mpd associated with the given tid. 
>   -> Task states update nicely now. They correspond to the state of a task's associated mpd process.
> -Readme
>   -> Included additional info on how to setup and run MPICH2 1.2 and nmpiexec on OS X and Ubuntu/Linux
> 
> 
> This addresses bug MESOS-183.
>     https://issues.apache.org/jira/browse/MESOS-183
> 
> 
> Diffs
> -----
> 
>   frameworks/mpi/README.txt cdb4553 
>   frameworks/mpi/nmpiexec 517bdbc 
>   frameworks/mpi/nmpiexec.py a5db9c0 
>   frameworks/mpi/startmpd.py 8eeba5e 
>   frameworks/mpi/startmpd.sh 44faa05 
> 
> Diff: https://reviews.apache.org/r/4768/diff
> 
> 
> Testing
> -------
> 
> 
> Thanks,
> 
> Harvey
> 
>


Re: Review Request: Updates and additions to the MPI framework

Posted by Harvey Feng <h....@berkeley.edu>.

> On 2012-05-02 14:58:27, Jessica wrote:
> > frameworks/mpi/nmpiexec.py, line 209
> > <https://reviews.apache.org/r/4768/diff/5/?file=105963#file105963line209>
> >
> >     os.path.join will handle this check and "do the right thing" wherever you use it, whether the user specifies the ending slash or not.

I used it to add a trailing '/' to the specified path, if necessary. Thanks!


> On 2012-05-02 14:58:27, Jessica wrote:
> > frameworks/mpi/nmpiexec.py, line 221
> > <https://reviews.apache.org/r/4768/diff/5/?file=105963#file105963line221>
> >
> >     mpd_cmd = os.path.join(MPICH2PATH, "mpd")

Unchanged after adding the above


- Harvey


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/4768/#review7469
-----------------------------------------------------------


On 2012-05-02 13:29:50, Harvey Feng wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/4768/
> -----------------------------------------------------------
> 
> (Updated 2012-05-02 13:29:50)
> 
> 
> Review request for mesos, Benjamin Hindman, Charles Reiss, and Jessica.
> 
> 
> Summary
> -------
> 
> Some updates to point out:
> 
> -nmpiexec.py
>   -> 'mpdallexit' should terminate all slaves' mpds in the ring. I moved 'driver.stop()' to statusUpdate() so that it stops when all tasks have been finished, which occurs when the executor's launched mpd processes have all exited. 
> -startmpd.py
>   -> Didn't remove cleanup(), and added code in shutdown() that manually kills mpd processes. They might be useful during abnormal (cleanup) and normal (shutdown) framework/executor termination...I think. cleanup() still terminates all mpd's in the slave, but shutdown doesn't. 
>   -> killtask() stops the mpd associated with the given tid. 
>   -> Task states update nicely now. They correspond to the state of a task's associated mpd process.
> -Readme
>   -> Included additional info on how to setup and run MPICH2 1.2 and nmpiexec on OS X and Ubuntu/Linux
> 
> 
> This addresses bug MESOS-183.
>     https://issues.apache.org/jira/browse/MESOS-183
> 
> 
> Diffs
> -----
> 
>   frameworks/mpi/README.txt cdb4553 
>   frameworks/mpi/nmpiexec 517bdbc 
>   frameworks/mpi/nmpiexec.py a5db9c0 
>   frameworks/mpi/startmpd.py 8eeba5e 
>   frameworks/mpi/startmpd.sh 44faa05 
> 
> Diff: https://reviews.apache.org/r/4768/diff
> 
> 
> Testing
> -------
> 
> 
> Thanks,
> 
> Harvey
> 
>


Re: Review Request: Updates and additions to the MPI framework

Posted by Jessica <je...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/4768/#review7469
-----------------------------------------------------------



frameworks/mpi/nmpiexec.py
<https://reviews.apache.org/r/4768/#comment16516>

    os.path.join will handle this check and "do the right thing" wherever you use it, whether the user specifies the ending slash or not.



frameworks/mpi/nmpiexec.py
<https://reviews.apache.org/r/4768/#comment16517>

    mpd_cmd = os.path.join(MPICH2PATH, "mpd")


- Jessica


On 2012-05-02 13:29:50, Harvey Feng wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/4768/
> -----------------------------------------------------------
> 
> (Updated 2012-05-02 13:29:50)
> 
> 
> Review request for mesos, Benjamin Hindman, Charles Reiss, and Jessica.
> 
> 
> Summary
> -------
> 
> Some updates to point out:
> 
> -nmpiexec.py
>   -> 'mpdallexit' should terminate all slaves' mpds in the ring. I moved 'driver.stop()' to statusUpdate() so that it stops when all tasks have been finished, which occurs when the executor's launched mpd processes have all exited. 
> -startmpd.py
>   -> Didn't remove cleanup(), and added code in shutdown() that manually kills mpd processes. They might be useful during abnormal (cleanup) and normal (shutdown) framework/executor termination...I think. cleanup() still terminates all mpd's in the slave, but shutdown doesn't. 
>   -> killtask() stops the mpd associated with the given tid. 
>   -> Task states update nicely now. They correspond to the state of a task's associated mpd process.
> -Readme
>   -> Included additional info on how to setup and run MPICH2 1.2 and nmpiexec on OS X and Ubuntu/Linux
> 
> 
> This addresses bug MESOS-183.
>     https://issues.apache.org/jira/browse/MESOS-183
> 
> 
> Diffs
> -----
> 
>   frameworks/mpi/README.txt cdb4553 
>   frameworks/mpi/nmpiexec 517bdbc 
>   frameworks/mpi/nmpiexec.py a5db9c0 
>   frameworks/mpi/startmpd.py 8eeba5e 
>   frameworks/mpi/startmpd.sh 44faa05 
> 
> Diff: https://reviews.apache.org/r/4768/diff
> 
> 
> Testing
> -------
> 
> 
> Thanks,
> 
> Harvey
> 
>


Re: Review Request: Updates and additions to the MPI framework

Posted by Harvey Feng <h....@berkeley.edu>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/4768/
-----------------------------------------------------------

(Updated 2012-05-02 13:29:50.636167)


Review request for mesos, Benjamin Hindman, Charles Reiss, and Jessica.


Changes
-------

I'm not sure how to get 'git diff' results using 'mpiexec-mesos*' filenames, so I left the two files with 'nmpiexec' names unchanged. All text/code uses mpiexec-mesos though...

This patch also deals with many style issues from the last one, and eliminates the executor code (startmpd).


Summary
-------

Some updates to point out:

-nmpiexec.py
  -> 'mpdallexit' should terminate all slaves' mpds in the ring. I moved 'driver.stop()' to statusUpdate() so that it stops when all tasks have been finished, which occurs when the executor's launched mpd processes have all exited. 
-startmpd.py
  -> Didn't remove cleanup(), and added code in shutdown() that manually kills mpd processes. They might be useful during abnormal (cleanup) and normal (shutdown) framework/executor termination...I think. cleanup() still terminates all mpd's in the slave, but shutdown doesn't. 
  -> killtask() stops the mpd associated with the given tid. 
  -> Task states update nicely now. They correspond to the state of a task's associated mpd process.
-Readme
  -> Included additional info on how to setup and run MPICH2 1.2 and nmpiexec on OS X and Ubuntu/Linux


This addresses bug MESOS-183.
    https://issues.apache.org/jira/browse/MESOS-183


Diffs (updated)
-----

  frameworks/mpi/README.txt cdb4553 
  frameworks/mpi/nmpiexec 517bdbc 
  frameworks/mpi/nmpiexec.py a5db9c0 
  frameworks/mpi/startmpd.py 8eeba5e 
  frameworks/mpi/startmpd.sh 44faa05 

Diff: https://reviews.apache.org/r/4768/diff


Testing
-------


Thanks,

Harvey


Re: Review Request: Updates and additions to the MPI framework

Posted by Harvey Feng <h....@berkeley.edu>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/4768/
-----------------------------------------------------------

(Updated 2012-05-02 13:00:12.162674)


Review request for mesos, Benjamin Hindman, Charles Reiss, and Jessica.


Changes
-------

-Fixed lots of style issues...
-Converted to a no-executor framework


Summary
-------

Some updates to point out:

-nmpiexec.py
  -> 'mpdallexit' should terminate all slaves' mpds in the ring. I moved 'driver.stop()' to statusUpdate() so that it stops when all tasks have been finished, which occurs when the executor's launched mpd processes have all exited. 
-startmpd.py
  -> Didn't remove cleanup(), and added code in shutdown() that manually kills mpd processes. They might be useful during abnormal (cleanup) and normal (shutdown) framework/executor termination...I think. cleanup() still terminates all mpd's in the slave, but shutdown doesn't. 
  -> killtask() stops the mpd associated with the given tid. 
  -> Task states update nicely now. They correspond to the state of a task's associated mpd process.
-Readme
  -> Included additional info on how to setup and run MPICH2 1.2 and nmpiexec on OS X and Ubuntu/Linux


This addresses bug MESOS-183.
    https://issues.apache.org/jira/browse/MESOS-183


Diffs (updated)
-----

  frameworks/mpi/nmpiexec.py a5db9c0 
  frameworks/mpi/README.txt cdb4553 
  frameworks/mpi/startmpd.py 8eeba5e 
  frameworks/mpi/startmpd.sh 44faa05 

Diff: https://reviews.apache.org/r/4768/diff


Testing
-------


Thanks,

Harvey


Re: Review Request: Updates and additions to the MPI framework

Posted by Harvey Feng <h....@berkeley.edu>.

> On 2012-04-24 21:45:19, Benjamin Hindman wrote:
> > frameworks/mpi/nmpiexec.py, line 270
> > <https://reviews.apache.org/r/4768/diff/3/?file=103693#file103693line270>
> >
> >     Did someone actually ask for this feature?

No, but I thought it wouldn't hurt...


> On 2012-04-24 21:45:19, Benjamin Hindman wrote:
> > frameworks/mpi/nmpiexec.py, line 253
> > <https://reviews.apache.org/r/4768/diff/3/?file=103693#file103693line253>
> >
> >     I'd like to make this whole thing simpler. In particular, I don't see any need for an executor here (i.e., the startmpd.py script). We actually have a 1-1 mapping from tasks to mpd's, so let's just have our TaskInfo's have a CommandInfo which just launches the mpd. Something like this for that CommandInfo's value:
> >     
> >     ...command.value = "${MPICH2PATH}mpd -n --host=${MPD_MASTER_IP} --port=${MPD_MASTER_PORT}"
> >     
> >     (Note your code, as does what I have above, assumes that MPICH2PATH includes the trailing '/'. Also, you might need to change the string based on if you're passing --ifhn. Finally, note I used the long options --host and --port. This is for readability for other people that might not know mpd very well. Likewise, if there is a long option for -n it would be great to use that instead.)
> >     
> >     Of course, this will require setting the command's environment variables appropriately. But, this should let us completely eliminate the startmpd.py script!!!!! Yeah! Less things to maintain!

Ok. Switched to no-executor. I directly specified the variables during string/command construction, so there isn't a need to set environment variables. Got rid of startmpd.py/startmpd.h =)


> On 2012-04-24 21:45:19, Benjamin Hindman wrote:
> > frameworks/mpi/nmpiexec.py, line 199
> > <https://reviews.apache.org/r/4768/diff/3/?file=103693#file103693line199>
> >
> >     s/slot/mpd
> >     
> >     Same as above, is there something we can/should set when we launch the mpd?

There doesn't seem to be anything to set for memory when launching the mpd...


> On 2012-04-24 21:45:19, Benjamin Hindman wrote:
> > frameworks/mpi/nmpiexec.py, line 28
> > <https://reviews.apache.org/r/4768/diff/3/?file=103693#file103693line28>
> >
> >     I know this wasn't the style of the codebase you've inherited, but I'd like spaces around all operators please. For example, this line should read:
> >     
> >     print "Got " + str(TOTAL_TASKS) + " mpd slots, running mpiexec"
> >     
> >     It looks like you've already done this with some of the code you've added (which is awesome!), but please clean up all the code. Thanks!

Done. I converted most string operations to use %, roughly following the Google style guide for Python.


> On 2012-04-24 21:45:19, Benjamin Hindman wrote:
> > frameworks/mpi/README.txt, line 76
> > <https://reviews.apache.org/r/4768/diff/3/?file=103692#file103692line76>
> >
> >     Does nmpiexec work out of the box without changes (not included in this review).
> >     
> >     Also, let's change the names from "nmpiexec*" to "mpiexec-mesos*"! ;)
> >     
> >     Finally, just pass host:port, no need for the 'master@' prefix.

Yup, nmpiexec works. 


> On 2012-04-24 21:45:19, Benjamin Hindman wrote:
> > frameworks/mpi/nmpiexec.py, line 196
> > <https://reviews.apache.org/r/4768/diff/3/?file=103693#file103693line196>
> >
> >     s/slot/mpd
> >     
> >     Also, do we want to set the '--ncpus' option on the actual mpd that we launch on a Mesos slave?

Added --npus to slave mpd calls.


> On 2012-04-24 21:45:19, Benjamin Hindman wrote:
> > frameworks/mpi/nmpiexec.py, lines 91-92
> > <https://reviews.apache.org/r/4768/diff/3/?file=103693#file103693line91>
> >
> >     Use driver.declineOffer please.

Done for this and others below. Issue#188 (small bug fix for Python/C++ binding) would have to be committed for driver.declineOffer to work though.


- Harvey


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/4768/#review7179
-----------------------------------------------------------


On 2012-05-02 13:00:12, Harvey Feng wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/4768/
> -----------------------------------------------------------
> 
> (Updated 2012-05-02 13:00:12)
> 
> 
> Review request for mesos, Benjamin Hindman, Charles Reiss, and Jessica.
> 
> 
> Summary
> -------
> 
> Some updates to point out:
> 
> -nmpiexec.py
>   -> 'mpdallexit' should terminate all slaves' mpds in the ring. I moved 'driver.stop()' to statusUpdate() so that it stops when all tasks have been finished, which occurs when the executor's launched mpd processes have all exited. 
> -startmpd.py
>   -> Didn't remove cleanup(), and added code in shutdown() that manually kills mpd processes. They might be useful during abnormal (cleanup) and normal (shutdown) framework/executor termination...I think. cleanup() still terminates all mpd's in the slave, but shutdown doesn't. 
>   -> killtask() stops the mpd associated with the given tid. 
>   -> Task states update nicely now. They correspond to the state of a task's associated mpd process.
> -Readme
>   -> Included additional info on how to setup and run MPICH2 1.2 and nmpiexec on OS X and Ubuntu/Linux
> 
> 
> This addresses bug MESOS-183.
>     https://issues.apache.org/jira/browse/MESOS-183
> 
> 
> Diffs
> -----
> 
>   frameworks/mpi/nmpiexec.py a5db9c0 
>   frameworks/mpi/README.txt cdb4553 
>   frameworks/mpi/startmpd.py 8eeba5e 
>   frameworks/mpi/startmpd.sh 44faa05 
> 
> Diff: https://reviews.apache.org/r/4768/diff
> 
> 
> Testing
> -------
> 
> 
> Thanks,
> 
> Harvey
> 
>


Re: Review Request: Updates and additions to the MPI framework

Posted by Jessica <je...@gmail.com>.

> On 2012-04-24 21:45:19, Benjamin Hindman wrote:
> > frameworks/mpi/nmpiexec.py, line 209
> > <https://reviews.apache.org/r/4768/diff/3/?file=103693#file103693line209>
> >
> >     I'm not really sure how this can be used: the user running this script will not know what machines they might run on, so they can't possibly know which IP addresses they want to use on those machines. Maybe Jessica J. had something else in mind here?
> >     
> >     It definitely makes sense to keep --ifhn for the master.

Hmmm... Looks like my comment here disappeared somehow. Anyway, I agree that the --ifhn-slave option doesn't make sense since there's no way you can specify an IP address for each slave. I guess what I had in mind was a more general Mesos configuration option rather than specific to the MPI framework. 

>From a selfish standpoint, I'm not terribly concerned since the master was the option I was concerned about. However, I've been thinking that, assuming you're using the deploy scripts to start your cluster, it may be worth considering modifying the format of the slaves configuration file (which currently lists only hostnames) and allowing the user to also specify an IP address for each host. Then perhaps the MPI framework could grab the IP address from the Mesos configuration. This would be useful for deploying Mesos as well since some users (such as myself) may have their Mesos config files in an NTFS directory. (This setup means I can't start the entire cluster at one go if I need to give any of my nodes a specific IP address since all nodes will try to use the same ip option in mesos.conf.) Just a thought... I'll open a general Mesos "Improvement" ticket if there's any chance of it happening.


> On 2012-04-24 21:45:19, Benjamin Hindman wrote:
> > frameworks/mpi/nmpiexec.py, line 223
> > <https://reviews.apache.org/r/4768/diff/3/?file=103693#file103693line223>
> >
> >     It looks like you assume that path ends in a '/'. You should probably check this here.

Why not use os.path.join?


- Jessica


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/4768/#review7179
-----------------------------------------------------------


On 2012-05-02 13:29:50, Harvey Feng wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/4768/
> -----------------------------------------------------------
> 
> (Updated 2012-05-02 13:29:50)
> 
> 
> Review request for mesos, Benjamin Hindman, Charles Reiss, and Jessica.
> 
> 
> Summary
> -------
> 
> Some updates to point out:
> 
> -nmpiexec.py
>   -> 'mpdallexit' should terminate all slaves' mpds in the ring. I moved 'driver.stop()' to statusUpdate() so that it stops when all tasks have been finished, which occurs when the executor's launched mpd processes have all exited. 
> -startmpd.py
>   -> Didn't remove cleanup(), and added code in shutdown() that manually kills mpd processes. They might be useful during abnormal (cleanup) and normal (shutdown) framework/executor termination...I think. cleanup() still terminates all mpd's in the slave, but shutdown doesn't. 
>   -> killtask() stops the mpd associated with the given tid. 
>   -> Task states update nicely now. They correspond to the state of a task's associated mpd process.
> -Readme
>   -> Included additional info on how to setup and run MPICH2 1.2 and nmpiexec on OS X and Ubuntu/Linux
> 
> 
> This addresses bug MESOS-183.
>     https://issues.apache.org/jira/browse/MESOS-183
> 
> 
> Diffs
> -----
> 
>   frameworks/mpi/README.txt cdb4553 
>   frameworks/mpi/nmpiexec 517bdbc 
>   frameworks/mpi/nmpiexec.py a5db9c0 
>   frameworks/mpi/startmpd.py 8eeba5e 
>   frameworks/mpi/startmpd.sh 44faa05 
> 
> Diff: https://reviews.apache.org/r/4768/diff
> 
> 
> Testing
> -------
> 
> 
> Thanks,
> 
> Harvey
> 
>


Re: Review Request: Updates and additions to the MPI framework

Posted by Harvey Feng <h....@berkeley.edu>.

> On 2012-04-24 21:45:19, Benjamin Hindman wrote:
> > frameworks/mpi/nmpiexec.py, line 278
> > <https://reviews.apache.org/r/4768/diff/3/?file=103693#file103693line278>
> >
> >     Kill extra space after 'args[0]'.

Done.


> On 2012-04-24 21:45:19, Benjamin Hindman wrote:
> > frameworks/mpi/nmpiexec.py, line 258
> > <https://reviews.apache.org/r/4768/diff/3/?file=103693#file103693line258>
> >
> >     Why the indentation? And add a period at the end of the sentence please.

Done.


> On 2012-04-24 21:45:19, Benjamin Hindman wrote:
> > frameworks/mpi/nmpiexec.py, line 257
> > <https://reviews.apache.org/r/4768/diff/3/?file=103693#file103693line257>
> >
> >     s/executor/executor.

Done.


> On 2012-04-24 21:45:19, Benjamin Hindman wrote:
> > frameworks/mpi/nmpiexec.py, line 225
> > <https://reviews.apache.org/r/4768/diff/3/?file=103693#file103693line225>
> >
> >     s/mesos/Mesos

Done.


> On 2012-04-24 21:45:19, Benjamin Hindman wrote:
> > frameworks/mpi/nmpiexec.py, line 219
> > <https://reviews.apache.org/r/4768/diff/3/?file=103693#file103693line219>
> >
> >     What about s/TOTAL_TASKS/TOTAL_MPDS

Done.


> On 2012-04-24 21:45:19, Benjamin Hindman wrote:
> > frameworks/mpi/nmpiexec.py, line 193
> > <https://reviews.apache.org/r/4768/diff/3/?file=103693#file103693line193>
> >
> >     s-slots/mpd:s-mpd's

Done.


> On 2012-04-24 21:45:19, Benjamin Hindman wrote:
> > frameworks/mpi/nmpiexec.py, line 177
> > <https://reviews.apache.org/r/4768/diff/3/?file=103693#file103693line177>
> >
> >     No need for the intermediate 'count'.

Done - removed 'count'


> On 2012-04-24 21:45:19, Benjamin Hindman wrote:
> > frameworks/mpi/nmpiexec.py, line 176
> > <https://reviews.apache.org/r/4768/diff/3/?file=103693#file103693line176>
> >
> >     Since mpderr is unused, how about instead:
> >     
> >     mpdtraceout, _ = mpdtraceproc.communicate()
> >     
> >     or
> >     
> >     mpdtraceout = mpdtraceproc.communicate()[0]

Went with mpdtraceout = mpdtraceproc.communicate()[0].


> On 2012-04-24 21:45:19, Benjamin Hindman wrote:
> > frameworks/mpi/nmpiexec.py, line 146
> > <https://reviews.apache.org/r/4768/diff/3/?file=103693#file103693line146>
> >
> >     Please use driver.declineOffer.

Done.


> On 2012-04-24 21:45:19, Benjamin Hindman wrote:
> > frameworks/mpi/nmpiexec.py, line 145
> > <https://reviews.apache.org/r/4768/diff/3/?file=103693#file103693line145>
> >
> >     s/Rejecting slot/Declining offer

Done.


> On 2012-04-24 21:45:19, Benjamin Hindman wrote:
> > frameworks/mpi/nmpiexec.py, line 142
> > <https://reviews.apache.org/r/4768/diff/3/?file=103693#file103693line142>
> >
> >     How about:
> >     
> >     print "Launching mpd " + tid + " on host " + offer.hostname

Changed to: print "Replying to offer: launching mpd %d on host %s" % (tid, offer.hostname)


> On 2012-04-24 21:45:19, Benjamin Hindman wrote:
> > frameworks/mpi/nmpiexec.py, line 114
> > <https://reviews.apache.org/r/4768/diff/3/?file=103693#file103693line114>
> >
> >     s/slot/offer

Done.


> On 2012-04-24 21:45:19, Benjamin Hindman wrote:
> > frameworks/mpi/nmpiexec.py, line 109
> > <https://reviews.apache.org/r/4768/diff/3/?file=103693#file103693line109>
> >
> >     s/Rejecting slot/Declining offer
> >     
> >     Also, why not do driver.declineOffer right here?

Done.


> On 2012-04-24 21:45:19, Benjamin Hindman wrote:
> > frameworks/mpi/nmpiexec.py, line 102
> > <https://reviews.apache.org/r/4768/diff/3/?file=103693#file103693line102>
> >
> >     s/r/resource

Done.


> On 2012-04-24 21:45:19, Benjamin Hindman wrote:
> > frameworks/mpi/nmpiexec.py, line 100
> > <https://reviews.apache.org/r/4768/diff/3/?file=103693#file103693line100>
> >
> >     Kill this line (or alternatively add the offer.id.value up on line 87).

Done, merged with line 87.


> On 2012-04-24 21:45:19, Benjamin Hindman wrote:
> > frameworks/mpi/nmpiexec.py, line 96
> > <https://reviews.apache.org/r/4768/diff/3/?file=103693#file103693line96>
> >
> >     s/slot/resources

Done.


> On 2012-04-24 21:45:19, Benjamin Hindman wrote:
> > frameworks/mpi/nmpiexec.py, line 89
> > <https://reviews.apache.org/r/4768/diff/3/?file=103693#file103693line89>
> >
> >     s/Rejecting/Declining

Done.


> On 2012-04-24 21:45:19, Benjamin Hindman wrote:
> > frameworks/mpi/nmpiexec.py, line 69
> > <https://reviews.apache.org/r/4768/diff/3/?file=103693#file103693line69>
> >
> >     No longer use, kill please.

Done.


> On 2012-04-24 21:45:19, Benjamin Hindman wrote:
> > frameworks/mpi/nmpiexec.py, line 66
> > <https://reviews.apache.org/r/4768/diff/3/?file=103693#file103693line66>
> >
> >     No longer used, kill please.

Done.


> On 2012-04-24 21:45:19, Benjamin Hindman wrote:
> > frameworks/mpi/nmpiexec.py, line 59
> > <https://reviews.apache.org/r/4768/diff/3/?file=103693#file103693line59>
> >
> >     How about s/tasksLaunched/mpdsLaunched

Done.


> On 2012-04-24 21:45:19, Benjamin Hindman wrote:
> > frameworks/mpi/nmpiexec.py, line 55
> > <https://reviews.apache.org/r/4768/diff/3/?file=103693#file103693line55>
> >
> >     It would be great to give this a real name, e.g., MPIScheduler.

Done.


> On 2012-04-24 21:45:19, Benjamin Hindman wrote:
> > frameworks/mpi/nmpiexec.py, line 22
> > <https://reviews.apache.org/r/4768/diff/3/?file=103693#file103693line22>
> >
> >     No need to take driver as an argument anymore.

Deleted parameter.


> On 2012-04-24 21:45:19, Benjamin Hindman wrote:
> > frameworks/mpi/nmpiexec.py, line 18
> > <https://reviews.apache.org/r/4768/diff/3/?file=103693#file103693line18>
> >
> >     Optional path.

Changed.


> On 2012-04-24 21:45:19, Benjamin Hindman wrote:
> > frameworks/mpi/nmpiexec.py, line 17
> > <https://reviews.apache.org/r/4768/diff/3/?file=103693#file103693line17>
> >
> >     This default isn't the same as what gets printed out from --help. Probably makes sense to kill these here and just put the value down in the add_option call (like you do for --num and TOTAL_TASKS).

Done - default value set at add_option.


> On 2012-04-24 21:45:19, Benjamin Hindman wrote:
> > frameworks/mpi/README.txt, line 62
> > <https://reviews.apache.org/r/4768/diff/3/?file=103692#file103692line62>
> >
> >     I'd prefer if we just had everyone do 'make', since that should build the Python dependencies (including protobuf).

Ok. I probably didn't configure properly when installing mine...


> On 2012-04-24 21:45:19, Benjamin Hindman wrote:
> > frameworks/mpi/README.txt, line 23
> > <https://reviews.apache.org/r/4768/diff/3/?file=103692#file103692line23>
> >
> >     Please kill all whitespace in this review.

Done.


- Harvey


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/4768/#review7179
-----------------------------------------------------------


On 2012-05-02 13:29:50, Harvey Feng wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/4768/
> -----------------------------------------------------------
> 
> (Updated 2012-05-02 13:29:50)
> 
> 
> Review request for mesos, Benjamin Hindman, Charles Reiss, and Jessica.
> 
> 
> Summary
> -------
> 
> Some updates to point out:
> 
> -nmpiexec.py
>   -> 'mpdallexit' should terminate all slaves' mpds in the ring. I moved 'driver.stop()' to statusUpdate() so that it stops when all tasks have been finished, which occurs when the executor's launched mpd processes have all exited. 
> -startmpd.py
>   -> Didn't remove cleanup(), and added code in shutdown() that manually kills mpd processes. They might be useful during abnormal (cleanup) and normal (shutdown) framework/executor termination...I think. cleanup() still terminates all mpd's in the slave, but shutdown doesn't. 
>   -> killtask() stops the mpd associated with the given tid. 
>   -> Task states update nicely now. They correspond to the state of a task's associated mpd process.
> -Readme
>   -> Included additional info on how to setup and run MPICH2 1.2 and nmpiexec on OS X and Ubuntu/Linux
> 
> 
> This addresses bug MESOS-183.
>     https://issues.apache.org/jira/browse/MESOS-183
> 
> 
> Diffs
> -----
> 
>   frameworks/mpi/README.txt cdb4553 
>   frameworks/mpi/nmpiexec 517bdbc 
>   frameworks/mpi/nmpiexec.py a5db9c0 
>   frameworks/mpi/startmpd.py 8eeba5e 
>   frameworks/mpi/startmpd.sh 44faa05 
> 
> Diff: https://reviews.apache.org/r/4768/diff
> 
> 
> Testing
> -------
> 
> 
> Thanks,
> 
> Harvey
> 
>


Re: Review Request: Updates and additions to the MPI framework

Posted by Benjamin Hindman <be...@berkeley.edu>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/4768/#review7179
-----------------------------------------------------------


There are a lot of comments here, but hopefully it will help make this MPI on Mesos very maintainable in the future! Thanks so much Harvey!


frameworks/mpi/README.txt
<https://reviews.apache.org/r/4768/#comment15854>

    Please kill all whitespace in this review.



frameworks/mpi/README.txt
<https://reviews.apache.org/r/4768/#comment15885>

    I'd prefer if we just had everyone do 'make', since that should build the Python dependencies (including protobuf).



frameworks/mpi/README.txt
<https://reviews.apache.org/r/4768/#comment15864>

    Does nmpiexec work out of the box without changes (not included in this review).
    
    Also, let's change the names from "nmpiexec*" to "mpiexec-mesos*"! ;)
    
    Finally, just pass host:port, no need for the 'master@' prefix.



frameworks/mpi/nmpiexec.py
<https://reviews.apache.org/r/4768/#comment15883>

    This default isn't the same as what gets printed out from --help. Probably makes sense to kill these here and just put the value down in the add_option call (like you do for --num and TOTAL_TASKS).



frameworks/mpi/nmpiexec.py
<https://reviews.apache.org/r/4768/#comment15884>

    Optional path.



frameworks/mpi/nmpiexec.py
<https://reviews.apache.org/r/4768/#comment15855>

    No need to take driver as an argument anymore.



frameworks/mpi/nmpiexec.py
<https://reviews.apache.org/r/4768/#comment15856>

    I know this wasn't the style of the codebase you've inherited, but I'd like spaces around all operators please. For example, this line should read:
    
    print "Got " + str(TOTAL_TASKS) + " mpd slots, running mpiexec"
    
    It looks like you've already done this with some of the code you've added (which is awesome!), but please clean up all the code. Thanks!



frameworks/mpi/nmpiexec.py
<https://reviews.apache.org/r/4768/#comment15863>

    It would be great to give this a real name, e.g., MPIScheduler.



frameworks/mpi/nmpiexec.py
<https://reviews.apache.org/r/4768/#comment15880>

    How about s/tasksLaunched/mpdsLaunched



frameworks/mpi/nmpiexec.py
<https://reviews.apache.org/r/4768/#comment15839>

    No longer used, kill please.



frameworks/mpi/nmpiexec.py
<https://reviews.apache.org/r/4768/#comment15840>

    No longer use, kill please.



frameworks/mpi/nmpiexec.py
<https://reviews.apache.org/r/4768/#comment15843>

    s/Rejecting/Declining



frameworks/mpi/nmpiexec.py
<https://reviews.apache.org/r/4768/#comment15841>

    Use driver.declineOffer please.



frameworks/mpi/nmpiexec.py
<https://reviews.apache.org/r/4768/#comment15842>

    s/slot/resources



frameworks/mpi/nmpiexec.py
<https://reviews.apache.org/r/4768/#comment15887>

    Kill this line (or alternatively add the offer.id.value up on line 87).



frameworks/mpi/nmpiexec.py
<https://reviews.apache.org/r/4768/#comment15888>

    s/r/resource



frameworks/mpi/nmpiexec.py
<https://reviews.apache.org/r/4768/#comment15844>

    s/Rejecting slot/Declining offer
    
    Also, why not do driver.declineOffer right here?



frameworks/mpi/nmpiexec.py
<https://reviews.apache.org/r/4768/#comment15879>

    s/slot/offer



frameworks/mpi/nmpiexec.py
<https://reviews.apache.org/r/4768/#comment15889>

    How about:
    
    print "Launching mpd " + tid + " on host " + offer.hostname



frameworks/mpi/nmpiexec.py
<https://reviews.apache.org/r/4768/#comment15877>

    s/Rejecting slot/Declining offer



frameworks/mpi/nmpiexec.py
<https://reviews.apache.org/r/4768/#comment15878>

    Please use driver.declineOffer.



frameworks/mpi/nmpiexec.py
<https://reviews.apache.org/r/4768/#comment15852>

    Since mpderr is unused, how about instead:
    
    mpdtraceout, _ = mpdtraceproc.communicate()
    
    or
    
    mpdtraceout = mpdtraceproc.communicate()[0]



frameworks/mpi/nmpiexec.py
<https://reviews.apache.org/r/4768/#comment15853>

    No need for the intermediate 'count'.



frameworks/mpi/nmpiexec.py
<https://reviews.apache.org/r/4768/#comment15868>

    s-slots/mpd:s-mpd's



frameworks/mpi/nmpiexec.py
<https://reviews.apache.org/r/4768/#comment15869>

    s/slot/mpd
    
    Also, do we want to set the '--ncpus' option on the actual mpd that we launch on a Mesos slave?



frameworks/mpi/nmpiexec.py
<https://reviews.apache.org/r/4768/#comment15870>

    s/slot/mpd
    
    Same as above, is there something we can/should set when we launch the mpd?



frameworks/mpi/nmpiexec.py
<https://reviews.apache.org/r/4768/#comment15875>

    I'm not really sure how this can be used: the user running this script will not know what machines they might run on, so they can't possibly know which IP addresses they want to use on those machines. Maybe Jessica J. had something else in mind here?
    
    It definitely makes sense to keep --ifhn for the master.



frameworks/mpi/nmpiexec.py
<https://reviews.apache.org/r/4768/#comment15882>

    What about s/TOTAL_TASKS/TOTAL_MPDS



frameworks/mpi/nmpiexec.py
<https://reviews.apache.org/r/4768/#comment15862>

    It looks like you assume that path ends in a '/'. You should probably check this here.



frameworks/mpi/nmpiexec.py
<https://reviews.apache.org/r/4768/#comment15850>

    s/mesos/Mesos



frameworks/mpi/nmpiexec.py
<https://reviews.apache.org/r/4768/#comment15861>

    I'd like to make this whole thing simpler. In particular, I don't see any need for an executor here (i.e., the startmpd.py script). We actually have a 1-1 mapping from tasks to mpd's, so let's just have our TaskInfo's have a CommandInfo which just launches the mpd. Something like this for that CommandInfo's value:
    
    ...command.value = "${MPICH2PATH}mpd -n --host=${MPD_MASTER_IP} --port=${MPD_MASTER_PORT}"
    
    (Note your code, as does what I have above, assumes that MPICH2PATH includes the trailing '/'. Also, you might need to change the string based on if you're passing --ifhn. Finally, note I used the long options --host and --port. This is for readability for other people that might not know mpd very well. Likewise, if there is a long option for -n it would be great to use that instead.)
    
    Of course, this will require setting the command's environment variables appropriately. But, this should let us completely eliminate the startmpd.py script!!!!! Yeah! Less things to maintain!



frameworks/mpi/nmpiexec.py
<https://reviews.apache.org/r/4768/#comment15858>

    s/executor/executor.



frameworks/mpi/nmpiexec.py
<https://reviews.apache.org/r/4768/#comment15859>

    Why the indentation? And add a period at the end of the sentence please.



frameworks/mpi/nmpiexec.py
<https://reviews.apache.org/r/4768/#comment15867>

    Did someone actually ask for this feature?



frameworks/mpi/nmpiexec.py
<https://reviews.apache.org/r/4768/#comment15857>

    Kill extra space after 'args[0]'.


- Benjamin


On 2012-04-21 05:08:47, Harvey Feng wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/4768/
> -----------------------------------------------------------
> 
> (Updated 2012-04-21 05:08:47)
> 
> 
> Review request for mesos, Benjamin Hindman, Charles Reiss, and Jessica.
> 
> 
> Summary
> -------
> 
> Some updates to point out:
> 
> -nmpiexec.py
>   -> 'mpdallexit' should terminate all slaves' mpds in the ring. I moved 'driver.stop()' to statusUpdate() so that it stops when all tasks have been finished, which occurs when the executor's launched mpd processes have all exited. 
> -startmpd.py
>   -> Didn't remove cleanup(), and added code in shutdown() that manually kills mpd processes. They might be useful during abnormal (cleanup) and normal (shutdown) framework/executor termination...I think. cleanup() still terminates all mpd's in the slave, but shutdown doesn't. 
>   -> killtask() stops the mpd associated with the given tid. 
>   -> Task states update nicely now. They correspond to the state of a task's associated mpd process.
> -Readme
>   -> Included additional info on how to setup and run MPICH2 1.2 and nmpiexec on OS X and Ubuntu/Linux
> 
> 
> This addresses bug MESOS-183.
>     https://issues.apache.org/jira/browse/MESOS-183
> 
> 
> Diffs
> -----
> 
>   frameworks/mpi/README.txt cdb4553 
>   frameworks/mpi/nmpiexec.py a5db9c0 
>   frameworks/mpi/startmpd.py 8eeba5e 
> 
> Diff: https://reviews.apache.org/r/4768/diff
> 
> 
> Testing
> -------
> 
> 
> Thanks,
> 
> Harvey
> 
>


Re: Review Request: Updates and additions to the MPI framework

Posted by Harvey Feng <h....@berkeley.edu>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/4768/
-----------------------------------------------------------

(Updated 2012-04-21 05:08:47.980951)


Review request for mesos, Benjamin Hindman, Charles Reiss, and Jessica.


Changes
-------

Changes:
-MPICH2PATH is global var now
--ifhn-master and --ifhn-slave options added
-fixed mpdtrace zombie proccess problem


Summary
-------

Some updates to point out:

-nmpiexec.py
  -> 'mpdallexit' should terminate all slaves' mpds in the ring. I moved 'driver.stop()' to statusUpdate() so that it stops when all tasks have been finished, which occurs when the executor's launched mpd processes have all exited. 
-startmpd.py
  -> Didn't remove cleanup(), and added code in shutdown() that manually kills mpd processes. They might be useful during abnormal (cleanup) and normal (shutdown) framework/executor termination...I think. cleanup() still terminates all mpd's in the slave, but shutdown doesn't. 
  -> killtask() stops the mpd associated with the given tid. 
  -> Task states update nicely now. They correspond to the state of a task's associated mpd process.
-Readme
  -> Included additional info on how to setup and run MPICH2 1.2 and nmpiexec on OS X and Ubuntu/Linux


This addresses bug MESOS-183.
    https://issues.apache.org/jira/browse/MESOS-183


Diffs (updated)
-----

  frameworks/mpi/README.txt cdb4553 
  frameworks/mpi/nmpiexec.py a5db9c0 
  frameworks/mpi/startmpd.py 8eeba5e 

Diff: https://reviews.apache.org/r/4768/diff


Testing
-------


Thanks,

Harvey


Re: Review Request: Updates and additions to the MPI framework

Posted by Harvey Feng <h....@berkeley.edu>.

> On 2012-04-20 16:48:05, Jessica wrote:
> > frameworks/mpi/startmpd.py, line 58
> > <https://reviews.apache.org/r/4768/diff/2/?file=103456#file103456line58>
> >
> >     Passing --ifhn here means that the slave nodes will try to listen on the master node's IP address. If multi-interface functionality is made available for slave nodes, it needs to be specified in a way separate from the way it's specified for the master node. Otherwise, MPI will fail to run on slave nodes.

Ah, sorry about that. I separated it into --ifhn-slave and --ifhn-master


- Harvey


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/4768/#review7080
-----------------------------------------------------------


On 2012-04-21 05:08:47, Harvey Feng wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/4768/
> -----------------------------------------------------------
> 
> (Updated 2012-04-21 05:08:47)
> 
> 
> Review request for mesos, Benjamin Hindman, Charles Reiss, and Jessica.
> 
> 
> Summary
> -------
> 
> Some updates to point out:
> 
> -nmpiexec.py
>   -> 'mpdallexit' should terminate all slaves' mpds in the ring. I moved 'driver.stop()' to statusUpdate() so that it stops when all tasks have been finished, which occurs when the executor's launched mpd processes have all exited. 
> -startmpd.py
>   -> Didn't remove cleanup(), and added code in shutdown() that manually kills mpd processes. They might be useful during abnormal (cleanup) and normal (shutdown) framework/executor termination...I think. cleanup() still terminates all mpd's in the slave, but shutdown doesn't. 
>   -> killtask() stops the mpd associated with the given tid. 
>   -> Task states update nicely now. They correspond to the state of a task's associated mpd process.
> -Readme
>   -> Included additional info on how to setup and run MPICH2 1.2 and nmpiexec on OS X and Ubuntu/Linux
> 
> 
> This addresses bug MESOS-183.
>     https://issues.apache.org/jira/browse/MESOS-183
> 
> 
> Diffs
> -----
> 
>   frameworks/mpi/README.txt cdb4553 
>   frameworks/mpi/nmpiexec.py a5db9c0 
>   frameworks/mpi/startmpd.py 8eeba5e 
> 
> Diff: https://reviews.apache.org/r/4768/diff
> 
> 
> Testing
> -------
> 
> 
> Thanks,
> 
> Harvey
> 
>


Re: Review Request: Updates and additions to the MPI framework

Posted by Jessica <je...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/4768/#review7080
-----------------------------------------------------------



frameworks/mpi/startmpd.py
<https://reviews.apache.org/r/4768/#comment15683>

    Passing --ifhn here means that the slave nodes will try to listen on the master node's IP address. If multi-interface functionality is made available for slave nodes, it needs to be specified in a way separate from the way it's specified for the master node. Otherwise, MPI will fail to run on slave nodes.


- Jessica


On 2012-04-20 08:17:57, Harvey Feng wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/4768/
> -----------------------------------------------------------
> 
> (Updated 2012-04-20 08:17:57)
> 
> 
> Review request for mesos, Benjamin Hindman and Charles Reiss.
> 
> 
> Summary
> -------
> 
> Some updates to point out:
> 
> -nmpiexec.py
>   -> 'mpdallexit' should terminate all slaves' mpds in the ring. I moved 'driver.stop()' to statusUpdate() so that it stops when all tasks have been finished, which occurs when the executor's launched mpd processes have all exited. 
> -startmpd.py
>   -> Didn't remove cleanup(), and added code in shutdown() that manually kills mpd processes. They might be useful during abnormal (cleanup) and normal (shutdown) framework/executor termination...I think. cleanup() still terminates all mpd's in the slave, but shutdown doesn't. 
>   -> killtask() stops the mpd associated with the given tid. 
>   -> Task states update nicely now. They correspond to the state of a task's associated mpd process.
> -Readme
>   -> Included additional info on how to setup and run MPICH2 1.2 and nmpiexec on OS X and Ubuntu/Linux
> 
> 
> This addresses bug MESOS-183.
>     https://issues.apache.org/jira/browse/MESOS-183
> 
> 
> Diffs
> -----
> 
>   frameworks/mpi/README.txt cdb4553 
>   frameworks/mpi/nmpiexec.py a5db9c0 
>   frameworks/mpi/startmpd.py 8eeba5e 
> 
> Diff: https://reviews.apache.org/r/4768/diff
> 
> 
> Testing
> -------
> 
> 
> Thanks,
> 
> Harvey
> 
>


Re: Review Request: Updates and additions to the MPI framework

Posted by Harvey Feng <h....@berkeley.edu>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/4768/
-----------------------------------------------------------

(Updated 2012-04-20 08:17:57.362659)


Review request for mesos, Benjamin Hindman and Charles Reiss.


Changes
-------

Added optional --name, --path for directory of mpich2 binaries, and --ifhn tags. 


Summary
-------

Some updates to point out:

-nmpiexec.py
  -> 'mpdallexit' should terminate all slaves' mpds in the ring. I moved 'driver.stop()' to statusUpdate() so that it stops when all tasks have been finished, which occurs when the executor's launched mpd processes have all exited. 
-startmpd.py
  -> Didn't remove cleanup(), and added code in shutdown() that manually kills mpd processes. They might be useful during abnormal (cleanup) and normal (shutdown) framework/executor termination...I think. cleanup() still terminates all mpd's in the slave, but shutdown doesn't. 
  -> killtask() stops the mpd associated with the given tid. 
  -> Task states update nicely now. They correspond to the state of a task's associated mpd process.
-Readme
  -> Included additional info on how to setup and run MPICH2 1.2 and nmpiexec on OS X and Ubuntu/Linux


This addresses bug MESOS-183.
    https://issues.apache.org/jira/browse/MESOS-183


Diffs (updated)
-----

  frameworks/mpi/README.txt cdb4553 
  frameworks/mpi/nmpiexec.py a5db9c0 
  frameworks/mpi/startmpd.py 8eeba5e 

Diff: https://reviews.apache.org/r/4768/diff


Testing
-------


Thanks,

Harvey


Re: Review Request: Updates and additions to the MPI framework

Posted by Harvey Feng <h....@berkeley.edu>.

> On 2012-04-18 05:41:37, Charles Reiss wrote:
> > frameworks/mpi/README.txt, line 11
> > <https://reviews.apache.org/r/4768/diff/1/?file=102473#file102473line11>
> >
> >     mpd was deprecated? What's the current alternative?

I think the new versions use the Hydra process manager, so 'mpiexec' would be the only command needed to launch an MPI program.  


> On 2012-04-18 05:41:37, Charles Reiss wrote:
> > frameworks/mpi/nmpiexec.py, line 22
> > <https://reviews.apache.org/r/4768/diff/1/?file=102474#file102474line22>
> >
> >     Remove or comment this debugging.

done.


> On 2012-04-18 05:41:37, Charles Reiss wrote:
> > frameworks/mpi/startmpd.py, line 83
> > <https://reviews.apache.org/r/4768/diff/1/?file=102475#file102475line83>
> >
> >     Use os.kill instead (and above).

done.


> On 2012-04-18 05:41:37, Charles Reiss wrote:
> > frameworks/mpi/startmpd.py, line 56
> > <https://reviews.apache.org/r/4768/diff/1/?file=102475#file102475line56>
> >
> >     Can we use MPD's exit status to determine when to send TASK_FAILED or TASK_KILLED?

ok, fixed that.


> On 2012-04-18 05:41:37, Charles Reiss wrote:
> > frameworks/mpi/startmpd.py, line 15
> > <https://reviews.apache.org/r/4768/diff/1/?file=102475#file102475line15>
> >
> >     I think we can get rid of this entirely; it's clearly wrong in the case where multiple MPIs are running, and we should be tracking stray processes so we eventually kill them if MPD doesn't do something funny. (And if it does, we should figure out how to disable that.)

ok - shutdown() should remove any stray processes left over.


> On 2012-04-18 05:41:37, Charles Reiss wrote:
> > frameworks/mpi/nmpiexec.py, line 210
> > <https://reviews.apache.org/r/4768/diff/1/?file=102474#file102474line210>
> >
> >     Let's try a name that doesn't contain test or Python and will give a hint when multiple instances are running, like something using MPI_TASK.

changed to 'MPI: ' + MPI_TASK, and added a --name option


> On 2012-04-18 05:41:37, Charles Reiss wrote:
> > frameworks/mpi/nmpiexec.py, line 95
> > <https://reviews.apache.org/r/4768/diff/1/?file=102474#file102474line95>
> >
> >     Remove trailing whitespace.

done


> On 2012-04-18 05:41:37, Charles Reiss wrote:
> > frameworks/mpi/nmpiexec.py, line 31
> > <https://reviews.apache.org/r/4768/diff/1/?file=102474#file102474line31>
> >
> >     Can we avoid using the shell here (and having MPI_TASK be interpreted by the shell twice)?

ok


> On 2012-04-18 05:41:37, Charles Reiss wrote:
> > frameworks/mpi/README.txt, line 37
> > <https://reviews.apache.org/r/4768/diff/1/?file=102473#file102473line37>
> >
> >     We should probably support taking the path to these binaries an option passed automatically to the executor (e.g. through an environment variable option) to avoid PATH issues.

ok. Passes the directory to mpi binaries using the executor's CommandInfo


- Harvey


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/4768/#review6999
-----------------------------------------------------------


On 2012-04-20 08:17:57, Harvey Feng wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/4768/
> -----------------------------------------------------------
> 
> (Updated 2012-04-20 08:17:57)
> 
> 
> Review request for mesos, Benjamin Hindman and Charles Reiss.
> 
> 
> Summary
> -------
> 
> Some updates to point out:
> 
> -nmpiexec.py
>   -> 'mpdallexit' should terminate all slaves' mpds in the ring. I moved 'driver.stop()' to statusUpdate() so that it stops when all tasks have been finished, which occurs when the executor's launched mpd processes have all exited. 
> -startmpd.py
>   -> Didn't remove cleanup(), and added code in shutdown() that manually kills mpd processes. They might be useful during abnormal (cleanup) and normal (shutdown) framework/executor termination...I think. cleanup() still terminates all mpd's in the slave, but shutdown doesn't. 
>   -> killtask() stops the mpd associated with the given tid. 
>   -> Task states update nicely now. They correspond to the state of a task's associated mpd process.
> -Readme
>   -> Included additional info on how to setup and run MPICH2 1.2 and nmpiexec on OS X and Ubuntu/Linux
> 
> 
> This addresses bug MESOS-183.
>     https://issues.apache.org/jira/browse/MESOS-183
> 
> 
> Diffs
> -----
> 
>   frameworks/mpi/README.txt cdb4553 
>   frameworks/mpi/nmpiexec.py a5db9c0 
>   frameworks/mpi/startmpd.py 8eeba5e 
> 
> Diff: https://reviews.apache.org/r/4768/diff
> 
> 
> Testing
> -------
> 
> 
> Thanks,
> 
> Harvey
> 
>


Re: Review Request: Updates and additions to the MPI framework

Posted by Charles Reiss <wo...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/4768/#review6999
-----------------------------------------------------------



frameworks/mpi/README.txt
<https://reviews.apache.org/r/4768/#comment15565>

    mpd was deprecated? What's the current alternative?



frameworks/mpi/README.txt
<https://reviews.apache.org/r/4768/#comment15566>

    We should probably support taking the path to these binaries an option passed automatically to the executor (e.g. through an environment variable option) to avoid PATH issues.



frameworks/mpi/nmpiexec.py
<https://reviews.apache.org/r/4768/#comment15555>

    Remove or comment this debugging.



frameworks/mpi/nmpiexec.py
<https://reviews.apache.org/r/4768/#comment15563>

    Can we avoid using the shell here (and having MPI_TASK be interpreted by the shell twice)?



frameworks/mpi/nmpiexec.py
<https://reviews.apache.org/r/4768/#comment15561>

    Remove trailing whitespace.



frameworks/mpi/nmpiexec.py
<https://reviews.apache.org/r/4768/#comment15557>

    Let's try a name that doesn't contain test or Python and will give a hint when multiple instances are running, like something using MPI_TASK.



frameworks/mpi/startmpd.py
<https://reviews.apache.org/r/4768/#comment15562>

    I think we can get rid of this entirely; it's clearly wrong in the case where multiple MPIs are running, and we should be tracking stray processes so we eventually kill them if MPD doesn't do something funny. (And if it does, we should figure out how to disable that.)



frameworks/mpi/startmpd.py
<https://reviews.apache.org/r/4768/#comment15559>

    Can we use MPD's exit status to determine when to send TASK_FAILED or TASK_KILLED?



frameworks/mpi/startmpd.py
<https://reviews.apache.org/r/4768/#comment15558>

    Use os.kill instead (and above).


- Charles


On 2012-04-18 04:27:25, Harvey Feng wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/4768/
> -----------------------------------------------------------
> 
> (Updated 2012-04-18 04:27:25)
> 
> 
> Review request for mesos, Benjamin Hindman and Charles Reiss.
> 
> 
> Summary
> -------
> 
> Some updates to point out:
> 
> -nmpiexec.py
>   -> 'mpdallexit' should terminate all slaves' mpds in the ring. I moved 'driver.stop()' to statusUpdate() so that it stops when all tasks have been finished, which occurs when the executor's launched mpd processes have all exited. 
> -startmpd.py
>   -> Didn't remove cleanup(), and added code in shutdown() that manually kills mpd processes. They might be useful during abnormal (cleanup) and normal (shutdown) framework/executor termination...I think. cleanup() still terminates all mpd's in the slave, but shutdown doesn't. 
>   -> killtask() stops the mpd associated with the given tid. 
>   -> Task states update nicely now. They correspond to the state of a task's associated mpd process.
> -Readme
>   -> Included additional info on how to setup and run MPICH2 1.2 and nmpiexec on OS X and Ubuntu/Linux
> 
> 
> This addresses bug MESOS-183.
>     https://issues.apache.org/jira/browse/MESOS-183
> 
> 
> Diffs
> -----
> 
>   frameworks/mpi/README.txt cdb4553 
>   frameworks/mpi/nmpiexec.py a5db9c0 
>   frameworks/mpi/startmpd.py 8eeba5e 
> 
> Diff: https://reviews.apache.org/r/4768/diff
> 
> 
> Testing
> -------
> 
> 
> Thanks,
> 
> Harvey
> 
>


Re: Review Request: Updates and additions to the MPI framework

Posted by Jessica <je...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/4768/#review7043
-----------------------------------------------------------



frameworks/mpi/nmpiexec.py
<https://reviews.apache.org/r/4768/#comment15627>

    allow user to pass "--ifhn=[ip-address]" to mpd for multi-homed systems


- Jessica


On 2012-04-18 04:27:25, Harvey Feng wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/4768/
> -----------------------------------------------------------
> 
> (Updated 2012-04-18 04:27:25)
> 
> 
> Review request for mesos, Benjamin Hindman and Charles Reiss.
> 
> 
> Summary
> -------
> 
> Some updates to point out:
> 
> -nmpiexec.py
>   -> 'mpdallexit' should terminate all slaves' mpds in the ring. I moved 'driver.stop()' to statusUpdate() so that it stops when all tasks have been finished, which occurs when the executor's launched mpd processes have all exited. 
> -startmpd.py
>   -> Didn't remove cleanup(), and added code in shutdown() that manually kills mpd processes. They might be useful during abnormal (cleanup) and normal (shutdown) framework/executor termination...I think. cleanup() still terminates all mpd's in the slave, but shutdown doesn't. 
>   -> killtask() stops the mpd associated with the given tid. 
>   -> Task states update nicely now. They correspond to the state of a task's associated mpd process.
> -Readme
>   -> Included additional info on how to setup and run MPICH2 1.2 and nmpiexec on OS X and Ubuntu/Linux
> 
> 
> This addresses bug MESOS-183.
>     https://issues.apache.org/jira/browse/MESOS-183
> 
> 
> Diffs
> -----
> 
>   frameworks/mpi/README.txt cdb4553 
>   frameworks/mpi/nmpiexec.py a5db9c0 
>   frameworks/mpi/startmpd.py 8eeba5e 
> 
> Diff: https://reviews.apache.org/r/4768/diff
> 
> 
> Testing
> -------
> 
> 
> Thanks,
> 
> Harvey
> 
>