You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mesos.apache.org by "Jessica J (Created) (JIRA)" <ji...@apache.org> on 2012/04/12 18:11:20 UTC

[jira] [Created] (MESOS-183) Included MPI Framework Fails to Start

Included MPI Framework Fails to Start
-------------------------------------

                 Key: MESOS-183
                 URL: https://issues.apache.org/jira/browse/MESOS-183
             Project: Mesos
          Issue Type: Bug
          Components: documentation, framework
         Environment: Scientific Linux Cluster
            Reporter: Jessica J


There are really two facets to this issue. The first is that no good documentation exists for setting up and using the included MPI framework. The second, and more important issue, is that the framework will not run. The second issue is possibly related to the first in that I may not be setting it up properly. 

To test the MPI framework, by trial and error I determined I needed to run python setup.py build and python setup.py install in the MESOS-HOME/src/python directory. Now when I try to run nmpiexec -h, I get an AttributeError, below: 

Traceback (most recent call last):
  File "./nmpiexec.py", line 2, in <module>
    import mesos
  File "/usr/lib64/python2.6/site-packages/mesos-0.9.0-py2.6-linux-x86_64.egg/mesos.py", line 22, in <module>
    import _mesos
  File "/usr/lib64/python2.6/site-packages/mesos-0.9.0-py2.6-linux-x86_64.egg/mesos_pb2.py", line 1286, in <module>
    DESCRIPTOR.message_types_by_name['FrameworkID'] = _FRAMEWORKID
AttributeError: 'FileDescriptor' object has no attribute 'message_types_by_name'

I've examined setup.py and determined that the version of protobuf it includes (2.4.1) does, indeed, contain a FileDescriptor class in descriptor.py that sets self.message_types_by_name, so I'm not sure what the issue is. Is this a bug? Or is there a step I'm missing? Do I need to also build/install protobuf?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MESOS-183) Included MPI Framework Fails to Start

Posted by "jiraposter@reviews.apache.org (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MESOS-183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13261056#comment-13261056 ] 

jiraposter@reviews.apache.org commented on MESOS-183:
-----------------------------------------------------


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/4768/#review7179
-----------------------------------------------------------


There are a lot of comments here, but hopefully it will help make this MPI on Mesos very maintainable in the future! Thanks so much Harvey!


frameworks/mpi/README.txt
<https://reviews.apache.org/r/4768/#comment15854>

    Please kill all whitespace in this review.



frameworks/mpi/README.txt
<https://reviews.apache.org/r/4768/#comment15885>

    I'd prefer if we just had everyone do 'make', since that should build the Python dependencies (including protobuf).



frameworks/mpi/README.txt
<https://reviews.apache.org/r/4768/#comment15864>

    Does nmpiexec work out of the box without changes (not included in this review).
    
    Also, let's change the names from "nmpiexec*" to "mpiexec-mesos*"! ;)
    
    Finally, just pass host:port, no need for the 'master@' prefix.



frameworks/mpi/nmpiexec.py
<https://reviews.apache.org/r/4768/#comment15883>

    This default isn't the same as what gets printed out from --help. Probably makes sense to kill these here and just put the value down in the add_option call (like you do for --num and TOTAL_TASKS).



frameworks/mpi/nmpiexec.py
<https://reviews.apache.org/r/4768/#comment15884>

    Optional path.



frameworks/mpi/nmpiexec.py
<https://reviews.apache.org/r/4768/#comment15855>

    No need to take driver as an argument anymore.



frameworks/mpi/nmpiexec.py
<https://reviews.apache.org/r/4768/#comment15856>

    I know this wasn't the style of the codebase you've inherited, but I'd like spaces around all operators please. For example, this line should read:
    
    print "Got " + str(TOTAL_TASKS) + " mpd slots, running mpiexec"
    
    It looks like you've already done this with some of the code you've added (which is awesome!), but please clean up all the code. Thanks!



frameworks/mpi/nmpiexec.py
<https://reviews.apache.org/r/4768/#comment15863>

    It would be great to give this a real name, e.g., MPIScheduler.



frameworks/mpi/nmpiexec.py
<https://reviews.apache.org/r/4768/#comment15880>

    How about s/tasksLaunched/mpdsLaunched



frameworks/mpi/nmpiexec.py
<https://reviews.apache.org/r/4768/#comment15839>

    No longer used, kill please.



frameworks/mpi/nmpiexec.py
<https://reviews.apache.org/r/4768/#comment15840>

    No longer use, kill please.



frameworks/mpi/nmpiexec.py
<https://reviews.apache.org/r/4768/#comment15843>

    s/Rejecting/Declining



frameworks/mpi/nmpiexec.py
<https://reviews.apache.org/r/4768/#comment15841>

    Use driver.declineOffer please.



frameworks/mpi/nmpiexec.py
<https://reviews.apache.org/r/4768/#comment15842>

    s/slot/resources



frameworks/mpi/nmpiexec.py
<https://reviews.apache.org/r/4768/#comment15887>

    Kill this line (or alternatively add the offer.id.value up on line 87).



frameworks/mpi/nmpiexec.py
<https://reviews.apache.org/r/4768/#comment15888>

    s/r/resource



frameworks/mpi/nmpiexec.py
<https://reviews.apache.org/r/4768/#comment15844>

    s/Rejecting slot/Declining offer
    
    Also, why not do driver.declineOffer right here?



frameworks/mpi/nmpiexec.py
<https://reviews.apache.org/r/4768/#comment15879>

    s/slot/offer



frameworks/mpi/nmpiexec.py
<https://reviews.apache.org/r/4768/#comment15889>

    How about:
    
    print "Launching mpd " + tid + " on host " + offer.hostname



frameworks/mpi/nmpiexec.py
<https://reviews.apache.org/r/4768/#comment15877>

    s/Rejecting slot/Declining offer



frameworks/mpi/nmpiexec.py
<https://reviews.apache.org/r/4768/#comment15878>

    Please use driver.declineOffer.



frameworks/mpi/nmpiexec.py
<https://reviews.apache.org/r/4768/#comment15852>

    Since mpderr is unused, how about instead:
    
    mpdtraceout, _ = mpdtraceproc.communicate()
    
    or
    
    mpdtraceout = mpdtraceproc.communicate()[0]



frameworks/mpi/nmpiexec.py
<https://reviews.apache.org/r/4768/#comment15853>

    No need for the intermediate 'count'.



frameworks/mpi/nmpiexec.py
<https://reviews.apache.org/r/4768/#comment15868>

    s-slots/mpd:s-mpd's



frameworks/mpi/nmpiexec.py
<https://reviews.apache.org/r/4768/#comment15869>

    s/slot/mpd
    
    Also, do we want to set the '--ncpus' option on the actual mpd that we launch on a Mesos slave?



frameworks/mpi/nmpiexec.py
<https://reviews.apache.org/r/4768/#comment15870>

    s/slot/mpd
    
    Same as above, is there something we can/should set when we launch the mpd?



frameworks/mpi/nmpiexec.py
<https://reviews.apache.org/r/4768/#comment15875>

    I'm not really sure how this can be used: the user running this script will not know what machines they might run on, so they can't possibly know which IP addresses they want to use on those machines. Maybe Jessica J. had something else in mind here?
    
    It definitely makes sense to keep --ifhn for the master.



frameworks/mpi/nmpiexec.py
<https://reviews.apache.org/r/4768/#comment15882>

    What about s/TOTAL_TASKS/TOTAL_MPDS



frameworks/mpi/nmpiexec.py
<https://reviews.apache.org/r/4768/#comment15862>

    It looks like you assume that path ends in a '/'. You should probably check this here.



frameworks/mpi/nmpiexec.py
<https://reviews.apache.org/r/4768/#comment15850>

    s/mesos/Mesos



frameworks/mpi/nmpiexec.py
<https://reviews.apache.org/r/4768/#comment15861>

    I'd like to make this whole thing simpler. In particular, I don't see any need for an executor here (i.e., the startmpd.py script). We actually have a 1-1 mapping from tasks to mpd's, so let's just have our TaskInfo's have a CommandInfo which just launches the mpd. Something like this for that CommandInfo's value:
    
    ...command.value = "${MPICH2PATH}mpd -n --host=${MPD_MASTER_IP} --port=${MPD_MASTER_PORT}"
    
    (Note your code, as does what I have above, assumes that MPICH2PATH includes the trailing '/'. Also, you might need to change the string based on if you're passing --ifhn. Finally, note I used the long options --host and --port. This is for readability for other people that might not know mpd very well. Likewise, if there is a long option for -n it would be great to use that instead.)
    
    Of course, this will require setting the command's environment variables appropriately. But, this should let us completely eliminate the startmpd.py script!!!!! Yeah! Less things to maintain!



frameworks/mpi/nmpiexec.py
<https://reviews.apache.org/r/4768/#comment15858>

    s/executor/executor.



frameworks/mpi/nmpiexec.py
<https://reviews.apache.org/r/4768/#comment15859>

    Why the indentation? And add a period at the end of the sentence please.



frameworks/mpi/nmpiexec.py
<https://reviews.apache.org/r/4768/#comment15867>

    Did someone actually ask for this feature?



frameworks/mpi/nmpiexec.py
<https://reviews.apache.org/r/4768/#comment15857>

    Kill extra space after 'args[0]'.


- Benjamin


On 2012-04-21 05:08:47, Harvey Feng wrote:
bq.  
bq.  -----------------------------------------------------------
bq.  This is an automatically generated e-mail. To reply, visit:
bq.  https://reviews.apache.org/r/4768/
bq.  -----------------------------------------------------------
bq.  
bq.  (Updated 2012-04-21 05:08:47)
bq.  
bq.  
bq.  Review request for mesos, Benjamin Hindman, Charles Reiss, and Jessica.
bq.  
bq.  
bq.  Summary
bq.  -------
bq.  
bq.  Some updates to point out:
bq.  
bq.  -nmpiexec.py
bq.    -> 'mpdallexit' should terminate all slaves' mpds in the ring. I moved 'driver.stop()' to statusUpdate() so that it stops when all tasks have been finished, which occurs when the executor's launched mpd processes have all exited. 
bq.  -startmpd.py
bq.    -> Didn't remove cleanup(), and added code in shutdown() that manually kills mpd processes. They might be useful during abnormal (cleanup) and normal (shutdown) framework/executor termination...I think. cleanup() still terminates all mpd's in the slave, but shutdown doesn't. 
bq.    -> killtask() stops the mpd associated with the given tid. 
bq.    -> Task states update nicely now. They correspond to the state of a task's associated mpd process.
bq.  -Readme
bq.    -> Included additional info on how to setup and run MPICH2 1.2 and nmpiexec on OS X and Ubuntu/Linux
bq.  
bq.  
bq.  This addresses bug MESOS-183.
bq.      https://issues.apache.org/jira/browse/MESOS-183
bq.  
bq.  
bq.  Diffs
bq.  -----
bq.  
bq.    frameworks/mpi/README.txt cdb4553 
bq.    frameworks/mpi/nmpiexec.py a5db9c0 
bq.    frameworks/mpi/startmpd.py 8eeba5e 
bq.  
bq.  Diff: https://reviews.apache.org/r/4768/diff
bq.  
bq.  
bq.  Testing
bq.  -------
bq.  
bq.  
bq.  Thanks,
bq.  
bq.  Harvey
bq.  
bq.


                
> Included MPI Framework Fails to Start
> -------------------------------------
>
>                 Key: MESOS-183
>                 URL: https://issues.apache.org/jira/browse/MESOS-183
>             Project: Mesos
>          Issue Type: Bug
>          Components: documentation, framework
>         Environment: Scientific Linux Cluster
>            Reporter: Jessica J
>            Assignee: Harvey Feng 
>            Priority: Blocker
>              Labels: documentation, mpi, setup
>
> There are really two facets to this issue. The first is that no good documentation exists for setting up and using the included MPI framework. The second, and more important issue, is that the framework will not run. The second issue is possibly related to the first in that I may not be setting it up properly. 
> To test the MPI framework, by trial and error I determined I needed to run python setup.py build and python setup.py install in the MESOS-HOME/src/python directory. Now when I try to run nmpiexec -h, I get an AttributeError, below: 
> Traceback (most recent call last):
>   File "./nmpiexec.py", line 2, in <module>
>     import mesos
>   File "/usr/lib64/python2.6/site-packages/mesos-0.9.0-py2.6-linux-x86_64.egg/mesos.py", line 22, in <module>
>     import _mesos
>   File "/usr/lib64/python2.6/site-packages/mesos-0.9.0-py2.6-linux-x86_64.egg/mesos_pb2.py", line 1286, in <module>
>     DESCRIPTOR.message_types_by_name['FrameworkID'] = _FRAMEWORKID
> AttributeError: 'FileDescriptor' object has no attribute 'message_types_by_name'
> I've examined setup.py and determined that the version of protobuf it includes (2.4.1) does, indeed, contain a FileDescriptor class in descriptor.py that sets self.message_types_by_name, so I'm not sure what the issue is. Is this a bug? Or is there a step I'm missing? Do I need to also build/install protobuf?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MESOS-183) Included MPI Framework Fails to Start

Posted by "jiraposter@reviews.apache.org (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MESOS-183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13258079#comment-13258079 ] 

jiraposter@reviews.apache.org commented on MESOS-183:
-----------------------------------------------------



bq.  On 2012-04-18 05:41:37, Charles Reiss wrote:
bq.  > frameworks/mpi/README.txt, line 11
bq.  > <https://reviews.apache.org/r/4768/diff/1/?file=102473#file102473line11>
bq.  >
bq.  >     mpd was deprecated? What's the current alternative?

I think the new versions use the Hydra process manager, so 'mpiexec' would be the only command needed to launch an MPI program.  


bq.  On 2012-04-18 05:41:37, Charles Reiss wrote:
bq.  > frameworks/mpi/nmpiexec.py, line 22
bq.  > <https://reviews.apache.org/r/4768/diff/1/?file=102474#file102474line22>
bq.  >
bq.  >     Remove or comment this debugging.

done.


bq.  On 2012-04-18 05:41:37, Charles Reiss wrote:
bq.  > frameworks/mpi/startmpd.py, line 83
bq.  > <https://reviews.apache.org/r/4768/diff/1/?file=102475#file102475line83>
bq.  >
bq.  >     Use os.kill instead (and above).

done.


bq.  On 2012-04-18 05:41:37, Charles Reiss wrote:
bq.  > frameworks/mpi/startmpd.py, line 56
bq.  > <https://reviews.apache.org/r/4768/diff/1/?file=102475#file102475line56>
bq.  >
bq.  >     Can we use MPD's exit status to determine when to send TASK_FAILED or TASK_KILLED?

ok, fixed that.


bq.  On 2012-04-18 05:41:37, Charles Reiss wrote:
bq.  > frameworks/mpi/startmpd.py, line 15
bq.  > <https://reviews.apache.org/r/4768/diff/1/?file=102475#file102475line15>
bq.  >
bq.  >     I think we can get rid of this entirely; it's clearly wrong in the case where multiple MPIs are running, and we should be tracking stray processes so we eventually kill them if MPD doesn't do something funny. (And if it does, we should figure out how to disable that.)

ok - shutdown() should remove any stray processes left over.


bq.  On 2012-04-18 05:41:37, Charles Reiss wrote:
bq.  > frameworks/mpi/nmpiexec.py, line 210
bq.  > <https://reviews.apache.org/r/4768/diff/1/?file=102474#file102474line210>
bq.  >
bq.  >     Let's try a name that doesn't contain test or Python and will give a hint when multiple instances are running, like something using MPI_TASK.

changed to 'MPI: ' + MPI_TASK, and added a --name option


bq.  On 2012-04-18 05:41:37, Charles Reiss wrote:
bq.  > frameworks/mpi/nmpiexec.py, line 95
bq.  > <https://reviews.apache.org/r/4768/diff/1/?file=102474#file102474line95>
bq.  >
bq.  >     Remove trailing whitespace.

done


bq.  On 2012-04-18 05:41:37, Charles Reiss wrote:
bq.  > frameworks/mpi/nmpiexec.py, line 31
bq.  > <https://reviews.apache.org/r/4768/diff/1/?file=102474#file102474line31>
bq.  >
bq.  >     Can we avoid using the shell here (and having MPI_TASK be interpreted by the shell twice)?

ok


bq.  On 2012-04-18 05:41:37, Charles Reiss wrote:
bq.  > frameworks/mpi/README.txt, line 37
bq.  > <https://reviews.apache.org/r/4768/diff/1/?file=102473#file102473line37>
bq.  >
bq.  >     We should probably support taking the path to these binaries an option passed automatically to the executor (e.g. through an environment variable option) to avoid PATH issues.

ok. Passes the directory to mpi binaries using the executor's CommandInfo


- Harvey


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/4768/#review6999
-----------------------------------------------------------


On 2012-04-20 08:17:57, Harvey Feng wrote:
bq.  
bq.  -----------------------------------------------------------
bq.  This is an automatically generated e-mail. To reply, visit:
bq.  https://reviews.apache.org/r/4768/
bq.  -----------------------------------------------------------
bq.  
bq.  (Updated 2012-04-20 08:17:57)
bq.  
bq.  
bq.  Review request for mesos, Benjamin Hindman and Charles Reiss.
bq.  
bq.  
bq.  Summary
bq.  -------
bq.  
bq.  Some updates to point out:
bq.  
bq.  -nmpiexec.py
bq.    -> 'mpdallexit' should terminate all slaves' mpds in the ring. I moved 'driver.stop()' to statusUpdate() so that it stops when all tasks have been finished, which occurs when the executor's launched mpd processes have all exited. 
bq.  -startmpd.py
bq.    -> Didn't remove cleanup(), and added code in shutdown() that manually kills mpd processes. They might be useful during abnormal (cleanup) and normal (shutdown) framework/executor termination...I think. cleanup() still terminates all mpd's in the slave, but shutdown doesn't. 
bq.    -> killtask() stops the mpd associated with the given tid. 
bq.    -> Task states update nicely now. They correspond to the state of a task's associated mpd process.
bq.  -Readme
bq.    -> Included additional info on how to setup and run MPICH2 1.2 and nmpiexec on OS X and Ubuntu/Linux
bq.  
bq.  
bq.  This addresses bug MESOS-183.
bq.      https://issues.apache.org/jira/browse/MESOS-183
bq.  
bq.  
bq.  Diffs
bq.  -----
bq.  
bq.    frameworks/mpi/README.txt cdb4553 
bq.    frameworks/mpi/nmpiexec.py a5db9c0 
bq.    frameworks/mpi/startmpd.py 8eeba5e 
bq.  
bq.  Diff: https://reviews.apache.org/r/4768/diff
bq.  
bq.  
bq.  Testing
bq.  -------
bq.  
bq.  
bq.  Thanks,
bq.  
bq.  Harvey
bq.  
bq.


                
> Included MPI Framework Fails to Start
> -------------------------------------
>
>                 Key: MESOS-183
>                 URL: https://issues.apache.org/jira/browse/MESOS-183
>             Project: Mesos
>          Issue Type: Bug
>          Components: documentation, framework
>         Environment: Scientific Linux Cluster
>            Reporter: Jessica J
>            Assignee: Harvey Feng 
>            Priority: Blocker
>              Labels: documentation, mpi, setup
>
> There are really two facets to this issue. The first is that no good documentation exists for setting up and using the included MPI framework. The second, and more important issue, is that the framework will not run. The second issue is possibly related to the first in that I may not be setting it up properly. 
> To test the MPI framework, by trial and error I determined I needed to run python setup.py build and python setup.py install in the MESOS-HOME/src/python directory. Now when I try to run nmpiexec -h, I get an AttributeError, below: 
> Traceback (most recent call last):
>   File "./nmpiexec.py", line 2, in <module>
>     import mesos
>   File "/usr/lib64/python2.6/site-packages/mesos-0.9.0-py2.6-linux-x86_64.egg/mesos.py", line 22, in <module>
>     import _mesos
>   File "/usr/lib64/python2.6/site-packages/mesos-0.9.0-py2.6-linux-x86_64.egg/mesos_pb2.py", line 1286, in <module>
>     DESCRIPTOR.message_types_by_name['FrameworkID'] = _FRAMEWORKID
> AttributeError: 'FileDescriptor' object has no attribute 'message_types_by_name'
> I've examined setup.py and determined that the version of protobuf it includes (2.4.1) does, indeed, contain a FileDescriptor class in descriptor.py that sets self.message_types_by_name, so I'm not sure what the issue is. Is this a bug? Or is there a step I'm missing? Do I need to also build/install protobuf?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MESOS-183) Included MPI Framework Fails to Start

Posted by "Matei Zaharia (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MESOS-183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13252627#comment-13252627 ] 

Matei Zaharia commented on MESOS-183:
-------------------------------------

While we're at it, we should also consider renaming the script. The "n" in nmpiexec is from back when Mesos used to be called Nexus!
                
> Included MPI Framework Fails to Start
> -------------------------------------
>
>                 Key: MESOS-183
>                 URL: https://issues.apache.org/jira/browse/MESOS-183
>             Project: Mesos
>          Issue Type: Bug
>          Components: documentation, framework
>         Environment: Scientific Linux Cluster
>            Reporter: Jessica J
>            Assignee: Harvey Feng 
>              Labels: documentation, mpi, setup
>
> There are really two facets to this issue. The first is that no good documentation exists for setting up and using the included MPI framework. The second, and more important issue, is that the framework will not run. The second issue is possibly related to the first in that I may not be setting it up properly. 
> To test the MPI framework, by trial and error I determined I needed to run python setup.py build and python setup.py install in the MESOS-HOME/src/python directory. Now when I try to run nmpiexec -h, I get an AttributeError, below: 
> Traceback (most recent call last):
>   File "./nmpiexec.py", line 2, in <module>
>     import mesos
>   File "/usr/lib64/python2.6/site-packages/mesos-0.9.0-py2.6-linux-x86_64.egg/mesos.py", line 22, in <module>
>     import _mesos
>   File "/usr/lib64/python2.6/site-packages/mesos-0.9.0-py2.6-linux-x86_64.egg/mesos_pb2.py", line 1286, in <module>
>     DESCRIPTOR.message_types_by_name['FrameworkID'] = _FRAMEWORKID
> AttributeError: 'FileDescriptor' object has no attribute 'message_types_by_name'
> I've examined setup.py and determined that the version of protobuf it includes (2.4.1) does, indeed, contain a FileDescriptor class in descriptor.py that sets self.message_types_by_name, so I'm not sure what the issue is. Is this a bug? Or is there a step I'm missing? Do I need to also build/install protobuf?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Closed] (MESOS-183) Included MPI Framework Fails to Start

Posted by "Benjamin Hindman (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MESOS-183?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Benjamin Hindman closed MESOS-183.
----------------------------------

    
> Included MPI Framework Fails to Start
> -------------------------------------
>
>                 Key: MESOS-183
>                 URL: https://issues.apache.org/jira/browse/MESOS-183
>             Project: Mesos
>          Issue Type: Bug
>          Components: documentation, framework
>         Environment: Scientific Linux Cluster
>            Reporter: Jessica J
>            Assignee: Harvey Feng 
>            Priority: Blocker
>              Labels: documentation, mpi, setup
>
> There are really two facets to this issue. The first is that no good documentation exists for setting up and using the included MPI framework. The second, and more important issue, is that the framework will not run. The second issue is possibly related to the first in that I may not be setting it up properly. 
> To test the MPI framework, by trial and error I determined I needed to run python setup.py build and python setup.py install in the MESOS-HOME/src/python directory. Now when I try to run nmpiexec -h, I get an AttributeError, below: 
> Traceback (most recent call last):
>   File "./nmpiexec.py", line 2, in <module>
>     import mesos
>   File "/usr/lib64/python2.6/site-packages/mesos-0.9.0-py2.6-linux-x86_64.egg/mesos.py", line 22, in <module>
>     import _mesos
>   File "/usr/lib64/python2.6/site-packages/mesos-0.9.0-py2.6-linux-x86_64.egg/mesos_pb2.py", line 1286, in <module>
>     DESCRIPTOR.message_types_by_name['FrameworkID'] = _FRAMEWORKID
> AttributeError: 'FileDescriptor' object has no attribute 'message_types_by_name'
> I've examined setup.py and determined that the version of protobuf it includes (2.4.1) does, indeed, contain a FileDescriptor class in descriptor.py that sets self.message_types_by_name, so I'm not sure what the issue is. Is this a bug? Or is there a step I'm missing? Do I need to also build/install protobuf?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MESOS-183) Included MPI Framework Fails to Start

Posted by "Jessica J (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MESOS-183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13258255#comment-13258255 ] 

Jessica J commented on MESOS-183:
---------------------------------

PYTHONPATH is out of date in startmpd.sh, as well.
                
> Included MPI Framework Fails to Start
> -------------------------------------
>
>                 Key: MESOS-183
>                 URL: https://issues.apache.org/jira/browse/MESOS-183
>             Project: Mesos
>          Issue Type: Bug
>          Components: documentation, framework
>         Environment: Scientific Linux Cluster
>            Reporter: Jessica J
>            Assignee: Harvey Feng 
>            Priority: Blocker
>              Labels: documentation, mpi, setup
>
> There are really two facets to this issue. The first is that no good documentation exists for setting up and using the included MPI framework. The second, and more important issue, is that the framework will not run. The second issue is possibly related to the first in that I may not be setting it up properly. 
> To test the MPI framework, by trial and error I determined I needed to run python setup.py build and python setup.py install in the MESOS-HOME/src/python directory. Now when I try to run nmpiexec -h, I get an AttributeError, below: 
> Traceback (most recent call last):
>   File "./nmpiexec.py", line 2, in <module>
>     import mesos
>   File "/usr/lib64/python2.6/site-packages/mesos-0.9.0-py2.6-linux-x86_64.egg/mesos.py", line 22, in <module>
>     import _mesos
>   File "/usr/lib64/python2.6/site-packages/mesos-0.9.0-py2.6-linux-x86_64.egg/mesos_pb2.py", line 1286, in <module>
>     DESCRIPTOR.message_types_by_name['FrameworkID'] = _FRAMEWORKID
> AttributeError: 'FileDescriptor' object has no attribute 'message_types_by_name'
> I've examined setup.py and determined that the version of protobuf it includes (2.4.1) does, indeed, contain a FileDescriptor class in descriptor.py that sets self.message_types_by_name, so I'm not sure what the issue is. Is this a bug? Or is there a step I'm missing? Do I need to also build/install protobuf?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MESOS-183) Included MPI Framework Fails to Start

Posted by "jiraposter@reviews.apache.org (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MESOS-183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13269915#comment-13269915 ] 

jiraposter@reviews.apache.org commented on MESOS-183:
-----------------------------------------------------


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/4768/#review7651
-----------------------------------------------------------



frameworks/mpi/nmpiexec.py
<https://reviews.apache.org/r/4768/#comment16862>

    I agree with Benjamin--ifhn_slave probably needs to go. I've created a new issue (MESOS-189) for configuring IP addresses that I think is relevant.
    
    Also... technically, you should use "is" and "is not" instead of "==" and "!=" to compare against None. This is the pythonic (and minutely faster) way.



frameworks/mpi/nmpiexec.py
<https://reviews.apache.org/r/4768/#comment16860>

    I'd recommend passing default='' to parser.add_option so that you don't have to check if path is None later in the code. (This will also simplify calls to os.path.join since you won't have to check the value of path first.)



frameworks/mpi/nmpiexec.py
<https://reviews.apache.org/r/4768/#comment16861>

    setting path default to '' eliminates this check.


- Jessica


On 2012-05-02 13:29:50, Harvey Feng wrote:
bq.  
bq.  -----------------------------------------------------------
bq.  This is an automatically generated e-mail. To reply, visit:
bq.  https://reviews.apache.org/r/4768/
bq.  -----------------------------------------------------------
bq.  
bq.  (Updated 2012-05-02 13:29:50)
bq.  
bq.  
bq.  Review request for mesos, Benjamin Hindman, Charles Reiss, and Jessica.
bq.  
bq.  
bq.  Summary
bq.  -------
bq.  
bq.  Some updates to point out:
bq.  
bq.  -nmpiexec.py
bq.    -> 'mpdallexit' should terminate all slaves' mpds in the ring. I moved 'driver.stop()' to statusUpdate() so that it stops when all tasks have been finished, which occurs when the executor's launched mpd processes have all exited. 
bq.  -startmpd.py
bq.    -> Didn't remove cleanup(), and added code in shutdown() that manually kills mpd processes. They might be useful during abnormal (cleanup) and normal (shutdown) framework/executor termination...I think. cleanup() still terminates all mpd's in the slave, but shutdown doesn't. 
bq.    -> killtask() stops the mpd associated with the given tid. 
bq.    -> Task states update nicely now. They correspond to the state of a task's associated mpd process.
bq.  -Readme
bq.    -> Included additional info on how to setup and run MPICH2 1.2 and nmpiexec on OS X and Ubuntu/Linux
bq.  
bq.  
bq.  This addresses bug MESOS-183.
bq.      https://issues.apache.org/jira/browse/MESOS-183
bq.  
bq.  
bq.  Diffs
bq.  -----
bq.  
bq.    frameworks/mpi/README.txt cdb4553 
bq.    frameworks/mpi/nmpiexec 517bdbc 
bq.    frameworks/mpi/nmpiexec.py a5db9c0 
bq.    frameworks/mpi/startmpd.py 8eeba5e 
bq.    frameworks/mpi/startmpd.sh 44faa05 
bq.  
bq.  Diff: https://reviews.apache.org/r/4768/diff
bq.  
bq.  
bq.  Testing
bq.  -------
bq.  
bq.  
bq.  Thanks,
bq.  
bq.  Harvey
bq.  
bq.


                
> Included MPI Framework Fails to Start
> -------------------------------------
>
>                 Key: MESOS-183
>                 URL: https://issues.apache.org/jira/browse/MESOS-183
>             Project: Mesos
>          Issue Type: Bug
>          Components: documentation, framework
>         Environment: Scientific Linux Cluster
>            Reporter: Jessica J
>            Assignee: Harvey Feng 
>            Priority: Blocker
>              Labels: documentation, mpi, setup
>
> There are really two facets to this issue. The first is that no good documentation exists for setting up and using the included MPI framework. The second, and more important issue, is that the framework will not run. The second issue is possibly related to the first in that I may not be setting it up properly. 
> To test the MPI framework, by trial and error I determined I needed to run python setup.py build and python setup.py install in the MESOS-HOME/src/python directory. Now when I try to run nmpiexec -h, I get an AttributeError, below: 
> Traceback (most recent call last):
>   File "./nmpiexec.py", line 2, in <module>
>     import mesos
>   File "/usr/lib64/python2.6/site-packages/mesos-0.9.0-py2.6-linux-x86_64.egg/mesos.py", line 22, in <module>
>     import _mesos
>   File "/usr/lib64/python2.6/site-packages/mesos-0.9.0-py2.6-linux-x86_64.egg/mesos_pb2.py", line 1286, in <module>
>     DESCRIPTOR.message_types_by_name['FrameworkID'] = _FRAMEWORKID
> AttributeError: 'FileDescriptor' object has no attribute 'message_types_by_name'
> I've examined setup.py and determined that the version of protobuf it includes (2.4.1) does, indeed, contain a FileDescriptor class in descriptor.py that sets self.message_types_by_name, so I'm not sure what the issue is. Is this a bug? Or is there a step I'm missing? Do I need to also build/install protobuf?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MESOS-183) Included MPI Framework Fails to Start

Posted by "jiraposter@reviews.apache.org (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MESOS-183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13270390#comment-13270390 ] 

jiraposter@reviews.apache.org commented on MESOS-183:
-----------------------------------------------------


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/4768/#review7677
-----------------------------------------------------------

Ship it!


As far as I can tell, everything looks good. I won't have access to my test environment until Thursday, but you know you'll hear from me if there are any bugs. :) Thanks for your work on this!


frameworks/mpi/nmpiexec.py
<https://reviews.apache.org/r/4768/#comment16907>

    Super minor issue, so I'm still voting to ship it, but convention and consistency say that the = in default = "" should not be surrounded by whitespace.



frameworks/mpi/nmpiexec.py
<https://reviews.apache.org/r/4768/#comment16905>

    Clever technique. I definitely wouldn't have thought to join here. (I also verified that you won't have issues if the user doesn't define a path. os.path.join('','') == '', so it's perfect.)


- Jessica


On 2012-05-08 01:29:06, Harvey Feng wrote:
bq.  
bq.  -----------------------------------------------------------
bq.  This is an automatically generated e-mail. To reply, visit:
bq.  https://reviews.apache.org/r/4768/
bq.  -----------------------------------------------------------
bq.  
bq.  (Updated 2012-05-08 01:29:06)
bq.  
bq.  
bq.  Review request for mesos, Benjamin Hindman, Charles Reiss, and Jessica.
bq.  
bq.  
bq.  Summary
bq.  -------
bq.  
bq.  Some updates to point out:
bq.  
bq.  -nmpiexec.py
bq.    -> 'mpdallexit' should terminate all slaves' mpds in the ring. I moved 'driver.stop()' to statusUpdate() so that it stops when all tasks have been finished, which occurs when the executor's launched mpd processes have all exited. 
bq.  -startmpd.py
bq.    -> Didn't remove cleanup(), and added code in shutdown() that manually kills mpd processes. They might be useful during abnormal (cleanup) and normal (shutdown) framework/executor termination...I think. cleanup() still terminates all mpd's in the slave, but shutdown doesn't. 
bq.    -> killtask() stops the mpd associated with the given tid. 
bq.    -> Task states update nicely now. They correspond to the state of a task's associated mpd process.
bq.  -Readme
bq.    -> Included additional info on how to setup and run MPICH2 1.2 and nmpiexec on OS X and Ubuntu/Linux
bq.  
bq.  
bq.  This addresses bug MESOS-183.
bq.      https://issues.apache.org/jira/browse/MESOS-183
bq.  
bq.  
bq.  Diffs
bq.  -----
bq.  
bq.    frameworks/mpi/README.txt cdb4553 
bq.    frameworks/mpi/nmpiexec 517bdbc 
bq.    frameworks/mpi/nmpiexec.py a5db9c0 
bq.    frameworks/mpi/startmpd.py 8eeba5e 
bq.    frameworks/mpi/startmpd.sh 44faa05 
bq.  
bq.  Diff: https://reviews.apache.org/r/4768/diff
bq.  
bq.  
bq.  Testing
bq.  -------
bq.  
bq.  
bq.  Thanks,
bq.  
bq.  Harvey
bq.  
bq.


                
> Included MPI Framework Fails to Start
> -------------------------------------
>
>                 Key: MESOS-183
>                 URL: https://issues.apache.org/jira/browse/MESOS-183
>             Project: Mesos
>          Issue Type: Bug
>          Components: documentation, framework
>         Environment: Scientific Linux Cluster
>            Reporter: Jessica J
>            Assignee: Harvey Feng 
>            Priority: Blocker
>              Labels: documentation, mpi, setup
>
> There are really two facets to this issue. The first is that no good documentation exists for setting up and using the included MPI framework. The second, and more important issue, is that the framework will not run. The second issue is possibly related to the first in that I may not be setting it up properly. 
> To test the MPI framework, by trial and error I determined I needed to run python setup.py build and python setup.py install in the MESOS-HOME/src/python directory. Now when I try to run nmpiexec -h, I get an AttributeError, below: 
> Traceback (most recent call last):
>   File "./nmpiexec.py", line 2, in <module>
>     import mesos
>   File "/usr/lib64/python2.6/site-packages/mesos-0.9.0-py2.6-linux-x86_64.egg/mesos.py", line 22, in <module>
>     import _mesos
>   File "/usr/lib64/python2.6/site-packages/mesos-0.9.0-py2.6-linux-x86_64.egg/mesos_pb2.py", line 1286, in <module>
>     DESCRIPTOR.message_types_by_name['FrameworkID'] = _FRAMEWORKID
> AttributeError: 'FileDescriptor' object has no attribute 'message_types_by_name'
> I've examined setup.py and determined that the version of protobuf it includes (2.4.1) does, indeed, contain a FileDescriptor class in descriptor.py that sets self.message_types_by_name, so I'm not sure what the issue is. Is this a bug? Or is there a step I'm missing? Do I need to also build/install protobuf?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MESOS-183) Included MPI Framework Fails to Start

Posted by "jiraposter@reviews.apache.org (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MESOS-183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13268030#comment-13268030 ] 

jiraposter@reviews.apache.org commented on MESOS-183:
-----------------------------------------------------


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/4768/#review7541
-----------------------------------------------------------


Awesome refactor! Super excited to get this committed (and I think Jessica will be too!). Just a few more minor points (please address Jessica's comments too). Thank you!


frameworks/mpi/README.txt
<https://reviews.apache.org/r/4768/#comment16703>

    I know it's obvious, but you might want to remind users that you'll need to install mpich2 on every machine in your cluster?.



frameworks/mpi/README.txt
<https://reviews.apache.org/r/4768/#comment16701>

    Kill whitespace.



frameworks/mpi/README.txt
<https://reviews.apache.org/r/4768/#comment16702>

    Kill whitespace.



frameworks/mpi/nmpiexec.py
<https://reviews.apache.org/r/4768/#comment16704>

    s/mpd slots/mpd(s)



frameworks/mpi/nmpiexec.py
<https://reviews.apache.org/r/4768/#comment16707>

    If you move this check into the 'for offer in offers:' on line 60, then you'll only be doing the check and decline in one place (not also on lines 107 and 108).



frameworks/mpi/nmpiexec.py
<https://reviews.apache.org/r/4768/#comment16705>

    Again, I'm not sure how ifhn_slave is going to be used. Can you elaborate?



frameworks/mpi/nmpiexec.py
<https://reviews.apache.org/r/4768/#comment16706>

    I love the long options! Thank you!



frameworks/mpi/nmpiexec.py
<https://reviews.apache.org/r/4768/#comment16710>

    +1 to Jessica's comment.



frameworks/mpi/nmpiexec.py
<https://reviews.apache.org/r/4768/#comment16708>

    +1 to Jessica's comment.



frameworks/mpi/nmpiexec.py
<https://reviews.apache.org/r/4768/#comment16709>

    mpdtraceerr is not used, kill it please.


- Benjamin


On 2012-05-02 13:29:50, Harvey Feng wrote:
bq.  
bq.  -----------------------------------------------------------
bq.  This is an automatically generated e-mail. To reply, visit:
bq.  https://reviews.apache.org/r/4768/
bq.  -----------------------------------------------------------
bq.  
bq.  (Updated 2012-05-02 13:29:50)
bq.  
bq.  
bq.  Review request for mesos, Benjamin Hindman, Charles Reiss, and Jessica.
bq.  
bq.  
bq.  Summary
bq.  -------
bq.  
bq.  Some updates to point out:
bq.  
bq.  -nmpiexec.py
bq.    -> 'mpdallexit' should terminate all slaves' mpds in the ring. I moved 'driver.stop()' to statusUpdate() so that it stops when all tasks have been finished, which occurs when the executor's launched mpd processes have all exited. 
bq.  -startmpd.py
bq.    -> Didn't remove cleanup(), and added code in shutdown() that manually kills mpd processes. They might be useful during abnormal (cleanup) and normal (shutdown) framework/executor termination...I think. cleanup() still terminates all mpd's in the slave, but shutdown doesn't. 
bq.    -> killtask() stops the mpd associated with the given tid. 
bq.    -> Task states update nicely now. They correspond to the state of a task's associated mpd process.
bq.  -Readme
bq.    -> Included additional info on how to setup and run MPICH2 1.2 and nmpiexec on OS X and Ubuntu/Linux
bq.  
bq.  
bq.  This addresses bug MESOS-183.
bq.      https://issues.apache.org/jira/browse/MESOS-183
bq.  
bq.  
bq.  Diffs
bq.  -----
bq.  
bq.    frameworks/mpi/README.txt cdb4553 
bq.    frameworks/mpi/nmpiexec 517bdbc 
bq.    frameworks/mpi/nmpiexec.py a5db9c0 
bq.    frameworks/mpi/startmpd.py 8eeba5e 
bq.    frameworks/mpi/startmpd.sh 44faa05 
bq.  
bq.  Diff: https://reviews.apache.org/r/4768/diff
bq.  
bq.  
bq.  Testing
bq.  -------
bq.  
bq.  
bq.  Thanks,
bq.  
bq.  Harvey
bq.  
bq.


                
> Included MPI Framework Fails to Start
> -------------------------------------
>
>                 Key: MESOS-183
>                 URL: https://issues.apache.org/jira/browse/MESOS-183
>             Project: Mesos
>          Issue Type: Bug
>          Components: documentation, framework
>         Environment: Scientific Linux Cluster
>            Reporter: Jessica J
>            Assignee: Harvey Feng 
>            Priority: Blocker
>              Labels: documentation, mpi, setup
>
> There are really two facets to this issue. The first is that no good documentation exists for setting up and using the included MPI framework. The second, and more important issue, is that the framework will not run. The second issue is possibly related to the first in that I may not be setting it up properly. 
> To test the MPI framework, by trial and error I determined I needed to run python setup.py build and python setup.py install in the MESOS-HOME/src/python directory. Now when I try to run nmpiexec -h, I get an AttributeError, below: 
> Traceback (most recent call last):
>   File "./nmpiexec.py", line 2, in <module>
>     import mesos
>   File "/usr/lib64/python2.6/site-packages/mesos-0.9.0-py2.6-linux-x86_64.egg/mesos.py", line 22, in <module>
>     import _mesos
>   File "/usr/lib64/python2.6/site-packages/mesos-0.9.0-py2.6-linux-x86_64.egg/mesos_pb2.py", line 1286, in <module>
>     DESCRIPTOR.message_types_by_name['FrameworkID'] = _FRAMEWORKID
> AttributeError: 'FileDescriptor' object has no attribute 'message_types_by_name'
> I've examined setup.py and determined that the version of protobuf it includes (2.4.1) does, indeed, contain a FileDescriptor class in descriptor.py that sets self.message_types_by_name, so I'm not sure what the issue is. Is this a bug? Or is there a step I'm missing? Do I need to also build/install protobuf?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MESOS-183) Included MPI Framework Fails to Start

Posted by "jiraposter@reviews.apache.org (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MESOS-183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13266615#comment-13266615 ] 

jiraposter@reviews.apache.org commented on MESOS-183:
-----------------------------------------------------


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/4768/#review7469
-----------------------------------------------------------



frameworks/mpi/nmpiexec.py
<https://reviews.apache.org/r/4768/#comment16516>

    os.path.join will handle this check and "do the right thing" wherever you use it, whether the user specifies the ending slash or not.



frameworks/mpi/nmpiexec.py
<https://reviews.apache.org/r/4768/#comment16517>

    mpd_cmd = os.path.join(MPICH2PATH, "mpd")


- Jessica


On 2012-05-02 13:29:50, Harvey Feng wrote:
bq.  
bq.  -----------------------------------------------------------
bq.  This is an automatically generated e-mail. To reply, visit:
bq.  https://reviews.apache.org/r/4768/
bq.  -----------------------------------------------------------
bq.  
bq.  (Updated 2012-05-02 13:29:50)
bq.  
bq.  
bq.  Review request for mesos, Benjamin Hindman, Charles Reiss, and Jessica.
bq.  
bq.  
bq.  Summary
bq.  -------
bq.  
bq.  Some updates to point out:
bq.  
bq.  -nmpiexec.py
bq.    -> 'mpdallexit' should terminate all slaves' mpds in the ring. I moved 'driver.stop()' to statusUpdate() so that it stops when all tasks have been finished, which occurs when the executor's launched mpd processes have all exited. 
bq.  -startmpd.py
bq.    -> Didn't remove cleanup(), and added code in shutdown() that manually kills mpd processes. They might be useful during abnormal (cleanup) and normal (shutdown) framework/executor termination...I think. cleanup() still terminates all mpd's in the slave, but shutdown doesn't. 
bq.    -> killtask() stops the mpd associated with the given tid. 
bq.    -> Task states update nicely now. They correspond to the state of a task's associated mpd process.
bq.  -Readme
bq.    -> Included additional info on how to setup and run MPICH2 1.2 and nmpiexec on OS X and Ubuntu/Linux
bq.  
bq.  
bq.  This addresses bug MESOS-183.
bq.      https://issues.apache.org/jira/browse/MESOS-183
bq.  
bq.  
bq.  Diffs
bq.  -----
bq.  
bq.    frameworks/mpi/README.txt cdb4553 
bq.    frameworks/mpi/nmpiexec 517bdbc 
bq.    frameworks/mpi/nmpiexec.py a5db9c0 
bq.    frameworks/mpi/startmpd.py 8eeba5e 
bq.    frameworks/mpi/startmpd.sh 44faa05 
bq.  
bq.  Diff: https://reviews.apache.org/r/4768/diff
bq.  
bq.  
bq.  Testing
bq.  -------
bq.  
bq.  
bq.  Thanks,
bq.  
bq.  Harvey
bq.  
bq.


                
> Included MPI Framework Fails to Start
> -------------------------------------
>
>                 Key: MESOS-183
>                 URL: https://issues.apache.org/jira/browse/MESOS-183
>             Project: Mesos
>          Issue Type: Bug
>          Components: documentation, framework
>         Environment: Scientific Linux Cluster
>            Reporter: Jessica J
>            Assignee: Harvey Feng 
>            Priority: Blocker
>              Labels: documentation, mpi, setup
>
> There are really two facets to this issue. The first is that no good documentation exists for setting up and using the included MPI framework. The second, and more important issue, is that the framework will not run. The second issue is possibly related to the first in that I may not be setting it up properly. 
> To test the MPI framework, by trial and error I determined I needed to run python setup.py build and python setup.py install in the MESOS-HOME/src/python directory. Now when I try to run nmpiexec -h, I get an AttributeError, below: 
> Traceback (most recent call last):
>   File "./nmpiexec.py", line 2, in <module>
>     import mesos
>   File "/usr/lib64/python2.6/site-packages/mesos-0.9.0-py2.6-linux-x86_64.egg/mesos.py", line 22, in <module>
>     import _mesos
>   File "/usr/lib64/python2.6/site-packages/mesos-0.9.0-py2.6-linux-x86_64.egg/mesos_pb2.py", line 1286, in <module>
>     DESCRIPTOR.message_types_by_name['FrameworkID'] = _FRAMEWORKID
> AttributeError: 'FileDescriptor' object has no attribute 'message_types_by_name'
> I've examined setup.py and determined that the version of protobuf it includes (2.4.1) does, indeed, contain a FileDescriptor class in descriptor.py that sets self.message_types_by_name, so I'm not sure what the issue is. Is this a bug? Or is there a step I'm missing? Do I need to also build/install protobuf?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MESOS-183) Included MPI Framework Fails to Start

Posted by "Jessica J (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MESOS-183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13253547#comment-13253547 ] 

Jessica J commented on MESOS-183:
---------------------------------

Looking more closely at the code and the message, I'm really not sure that MergeFrom is the answer. In the example framework, it gets called from resourceOffers, which a framework can't accept until it has registered with the master. But the error is with a "RegisterFrameworkMessage," which would imply that the framework isn't even registering.
                
> Included MPI Framework Fails to Start
> -------------------------------------
>
>                 Key: MESOS-183
>                 URL: https://issues.apache.org/jira/browse/MESOS-183
>             Project: Mesos
>          Issue Type: Bug
>          Components: documentation, framework
>         Environment: Scientific Linux Cluster
>            Reporter: Jessica J
>            Assignee: Harvey Feng 
>            Priority: Blocker
>              Labels: documentation, mpi, setup
>
> There are really two facets to this issue. The first is that no good documentation exists for setting up and using the included MPI framework. The second, and more important issue, is that the framework will not run. The second issue is possibly related to the first in that I may not be setting it up properly. 
> To test the MPI framework, by trial and error I determined I needed to run python setup.py build and python setup.py install in the MESOS-HOME/src/python directory. Now when I try to run nmpiexec -h, I get an AttributeError, below: 
> Traceback (most recent call last):
>   File "./nmpiexec.py", line 2, in <module>
>     import mesos
>   File "/usr/lib64/python2.6/site-packages/mesos-0.9.0-py2.6-linux-x86_64.egg/mesos.py", line 22, in <module>
>     import _mesos
>   File "/usr/lib64/python2.6/site-packages/mesos-0.9.0-py2.6-linux-x86_64.egg/mesos_pb2.py", line 1286, in <module>
>     DESCRIPTOR.message_types_by_name['FrameworkID'] = _FRAMEWORKID
> AttributeError: 'FileDescriptor' object has no attribute 'message_types_by_name'
> I've examined setup.py and determined that the version of protobuf it includes (2.4.1) does, indeed, contain a FileDescriptor class in descriptor.py that sets self.message_types_by_name, so I'm not sure what the issue is. Is this a bug? Or is there a step I'm missing? Do I need to also build/install protobuf?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MESOS-183) Included MPI Framework Fails to Start

Posted by "jiraposter@reviews.apache.org (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MESOS-183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13256219#comment-13256219 ] 

jiraposter@reviews.apache.org commented on MESOS-183:
-----------------------------------------------------


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/4768/#review6999
-----------------------------------------------------------



frameworks/mpi/README.txt
<https://reviews.apache.org/r/4768/#comment15565>

    mpd was deprecated? What's the current alternative?



frameworks/mpi/README.txt
<https://reviews.apache.org/r/4768/#comment15566>

    We should probably support taking the path to these binaries an option passed automatically to the executor (e.g. through an environment variable option) to avoid PATH issues.



frameworks/mpi/nmpiexec.py
<https://reviews.apache.org/r/4768/#comment15555>

    Remove or comment this debugging.



frameworks/mpi/nmpiexec.py
<https://reviews.apache.org/r/4768/#comment15563>

    Can we avoid using the shell here (and having MPI_TASK be interpreted by the shell twice)?



frameworks/mpi/nmpiexec.py
<https://reviews.apache.org/r/4768/#comment15561>

    Remove trailing whitespace.



frameworks/mpi/nmpiexec.py
<https://reviews.apache.org/r/4768/#comment15557>

    Let's try a name that doesn't contain test or Python and will give a hint when multiple instances are running, like something using MPI_TASK.



frameworks/mpi/startmpd.py
<https://reviews.apache.org/r/4768/#comment15562>

    I think we can get rid of this entirely; it's clearly wrong in the case where multiple MPIs are running, and we should be tracking stray processes so we eventually kill them if MPD doesn't do something funny. (And if it does, we should figure out how to disable that.)



frameworks/mpi/startmpd.py
<https://reviews.apache.org/r/4768/#comment15559>

    Can we use MPD's exit status to determine when to send TASK_FAILED or TASK_KILLED?



frameworks/mpi/startmpd.py
<https://reviews.apache.org/r/4768/#comment15558>

    Use os.kill instead (and above).


- Charles


On 2012-04-18 04:27:25, Harvey Feng wrote:
bq.  
bq.  -----------------------------------------------------------
bq.  This is an automatically generated e-mail. To reply, visit:
bq.  https://reviews.apache.org/r/4768/
bq.  -----------------------------------------------------------
bq.  
bq.  (Updated 2012-04-18 04:27:25)
bq.  
bq.  
bq.  Review request for mesos, Benjamin Hindman and Charles Reiss.
bq.  
bq.  
bq.  Summary
bq.  -------
bq.  
bq.  Some updates to point out:
bq.  
bq.  -nmpiexec.py
bq.    -> 'mpdallexit' should terminate all slaves' mpds in the ring. I moved 'driver.stop()' to statusUpdate() so that it stops when all tasks have been finished, which occurs when the executor's launched mpd processes have all exited. 
bq.  -startmpd.py
bq.    -> Didn't remove cleanup(), and added code in shutdown() that manually kills mpd processes. They might be useful during abnormal (cleanup) and normal (shutdown) framework/executor termination...I think. cleanup() still terminates all mpd's in the slave, but shutdown doesn't. 
bq.    -> killtask() stops the mpd associated with the given tid. 
bq.    -> Task states update nicely now. They correspond to the state of a task's associated mpd process.
bq.  -Readme
bq.    -> Included additional info on how to setup and run MPICH2 1.2 and nmpiexec on OS X and Ubuntu/Linux
bq.  
bq.  
bq.  This addresses bug MESOS-183.
bq.      https://issues.apache.org/jira/browse/MESOS-183
bq.  
bq.  
bq.  Diffs
bq.  -----
bq.  
bq.    frameworks/mpi/README.txt cdb4553 
bq.    frameworks/mpi/nmpiexec.py a5db9c0 
bq.    frameworks/mpi/startmpd.py 8eeba5e 
bq.  
bq.  Diff: https://reviews.apache.org/r/4768/diff
bq.  
bq.  
bq.  Testing
bq.  -------
bq.  
bq.  
bq.  Thanks,
bq.  
bq.  Harvey
bq.  
bq.


                
> Included MPI Framework Fails to Start
> -------------------------------------
>
>                 Key: MESOS-183
>                 URL: https://issues.apache.org/jira/browse/MESOS-183
>             Project: Mesos
>          Issue Type: Bug
>          Components: documentation, framework
>         Environment: Scientific Linux Cluster
>            Reporter: Jessica J
>            Assignee: Harvey Feng 
>            Priority: Blocker
>              Labels: documentation, mpi, setup
>
> There are really two facets to this issue. The first is that no good documentation exists for setting up and using the included MPI framework. The second, and more important issue, is that the framework will not run. The second issue is possibly related to the first in that I may not be setting it up properly. 
> To test the MPI framework, by trial and error I determined I needed to run python setup.py build and python setup.py install in the MESOS-HOME/src/python directory. Now when I try to run nmpiexec -h, I get an AttributeError, below: 
> Traceback (most recent call last):
>   File "./nmpiexec.py", line 2, in <module>
>     import mesos
>   File "/usr/lib64/python2.6/site-packages/mesos-0.9.0-py2.6-linux-x86_64.egg/mesos.py", line 22, in <module>
>     import _mesos
>   File "/usr/lib64/python2.6/site-packages/mesos-0.9.0-py2.6-linux-x86_64.egg/mesos_pb2.py", line 1286, in <module>
>     DESCRIPTOR.message_types_by_name['FrameworkID'] = _FRAMEWORKID
> AttributeError: 'FileDescriptor' object has no attribute 'message_types_by_name'
> I've examined setup.py and determined that the version of protobuf it includes (2.4.1) does, indeed, contain a FileDescriptor class in descriptor.py that sets self.message_types_by_name, so I'm not sure what the issue is. Is this a bug? Or is there a step I'm missing? Do I need to also build/install protobuf?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MESOS-183) Included MPI Framework Fails to Start

Posted by "Jessica J (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MESOS-183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13253485#comment-13253485 ] 

Jessica J commented on MESOS-183:
---------------------------------

Here's what I've tried, in case it helps Harvey (assuming he will be the one to fix the framework). Following the example of the python test_framework, I created a new framework using mesos_pb2.FrameworkInfo() and passed that as the second argument to the MesosSchedulerDriver constructor. Running nmpiexec with this change resulted in an assertion failure:

python: ./common/try.hpp:77: T Try<T>::get() const [with T = mesos::internal::MasterDetector*]: Assertion `state == SOME' failed.
Aborted

I eventually determined that this failure was due to the fact that MPI framework does not accept the master URL in the same format as the rest of the project. (This should be changed for consistency, i.e., mesos://master@[ipaddress]:[port] rather than [ipaddress]:[port].)

Using the correct URL allows the framework to find the master, but then this error shows up on the master:

W0413 11:36:30.357491 30017 protobuf.hpp:255] Initialization errors: framework.executor
libprotobuf ERROR google/protobuf/message_lite.cc:123] Can't parse message of type "mesos.internal.RegisterFrameworkMessage" because it is missing required fields: framework.executor

However, attempting to assign anything to framework.executor (before passing it to the MesosSchedulerDriver constructor) results in an AttributeError:

AttributeError: 'FrameworkInfo' object has no attribute 'executor'
                
> Included MPI Framework Fails to Start
> -------------------------------------
>
>                 Key: MESOS-183
>                 URL: https://issues.apache.org/jira/browse/MESOS-183
>             Project: Mesos
>          Issue Type: Bug
>          Components: documentation, framework
>         Environment: Scientific Linux Cluster
>            Reporter: Jessica J
>            Assignee: Harvey Feng 
>            Priority: Blocker
>              Labels: documentation, mpi, setup
>
> There are really two facets to this issue. The first is that no good documentation exists for setting up and using the included MPI framework. The second, and more important issue, is that the framework will not run. The second issue is possibly related to the first in that I may not be setting it up properly. 
> To test the MPI framework, by trial and error I determined I needed to run python setup.py build and python setup.py install in the MESOS-HOME/src/python directory. Now when I try to run nmpiexec -h, I get an AttributeError, below: 
> Traceback (most recent call last):
>   File "./nmpiexec.py", line 2, in <module>
>     import mesos
>   File "/usr/lib64/python2.6/site-packages/mesos-0.9.0-py2.6-linux-x86_64.egg/mesos.py", line 22, in <module>
>     import _mesos
>   File "/usr/lib64/python2.6/site-packages/mesos-0.9.0-py2.6-linux-x86_64.egg/mesos_pb2.py", line 1286, in <module>
>     DESCRIPTOR.message_types_by_name['FrameworkID'] = _FRAMEWORKID
> AttributeError: 'FileDescriptor' object has no attribute 'message_types_by_name'
> I've examined setup.py and determined that the version of protobuf it includes (2.4.1) does, indeed, contain a FileDescriptor class in descriptor.py that sets self.message_types_by_name, so I'm not sure what the issue is. Is this a bug? Or is there a step I'm missing? Do I need to also build/install protobuf?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MESOS-183) Included MPI Framework Fails to Start

Posted by "jiraposter@reviews.apache.org (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MESOS-183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13266780#comment-13266780 ] 

jiraposter@reviews.apache.org commented on MESOS-183:
-----------------------------------------------------



bq.  On 2012-04-24 21:45:19, Benjamin Hindman wrote:
bq.  > frameworks/mpi/nmpiexec.py, line 278
bq.  > <https://reviews.apache.org/r/4768/diff/3/?file=103693#file103693line278>
bq.  >
bq.  >     Kill extra space after 'args[0]'.

Done.


bq.  On 2012-04-24 21:45:19, Benjamin Hindman wrote:
bq.  > frameworks/mpi/nmpiexec.py, line 258
bq.  > <https://reviews.apache.org/r/4768/diff/3/?file=103693#file103693line258>
bq.  >
bq.  >     Why the indentation? And add a period at the end of the sentence please.

Done.


bq.  On 2012-04-24 21:45:19, Benjamin Hindman wrote:
bq.  > frameworks/mpi/nmpiexec.py, line 257
bq.  > <https://reviews.apache.org/r/4768/diff/3/?file=103693#file103693line257>
bq.  >
bq.  >     s/executor/executor.

Done.


bq.  On 2012-04-24 21:45:19, Benjamin Hindman wrote:
bq.  > frameworks/mpi/nmpiexec.py, line 225
bq.  > <https://reviews.apache.org/r/4768/diff/3/?file=103693#file103693line225>
bq.  >
bq.  >     s/mesos/Mesos

Done.


bq.  On 2012-04-24 21:45:19, Benjamin Hindman wrote:
bq.  > frameworks/mpi/nmpiexec.py, line 219
bq.  > <https://reviews.apache.org/r/4768/diff/3/?file=103693#file103693line219>
bq.  >
bq.  >     What about s/TOTAL_TASKS/TOTAL_MPDS

Done.


bq.  On 2012-04-24 21:45:19, Benjamin Hindman wrote:
bq.  > frameworks/mpi/nmpiexec.py, line 193
bq.  > <https://reviews.apache.org/r/4768/diff/3/?file=103693#file103693line193>
bq.  >
bq.  >     s-slots/mpd:s-mpd's

Done.


bq.  On 2012-04-24 21:45:19, Benjamin Hindman wrote:
bq.  > frameworks/mpi/nmpiexec.py, line 177
bq.  > <https://reviews.apache.org/r/4768/diff/3/?file=103693#file103693line177>
bq.  >
bq.  >     No need for the intermediate 'count'.

Done - removed 'count'


bq.  On 2012-04-24 21:45:19, Benjamin Hindman wrote:
bq.  > frameworks/mpi/nmpiexec.py, line 176
bq.  > <https://reviews.apache.org/r/4768/diff/3/?file=103693#file103693line176>
bq.  >
bq.  >     Since mpderr is unused, how about instead:
bq.  >     
bq.  >     mpdtraceout, _ = mpdtraceproc.communicate()
bq.  >     
bq.  >     or
bq.  >     
bq.  >     mpdtraceout = mpdtraceproc.communicate()[0]

Went with mpdtraceout = mpdtraceproc.communicate()[0].


bq.  On 2012-04-24 21:45:19, Benjamin Hindman wrote:
bq.  > frameworks/mpi/nmpiexec.py, line 146
bq.  > <https://reviews.apache.org/r/4768/diff/3/?file=103693#file103693line146>
bq.  >
bq.  >     Please use driver.declineOffer.

Done.


bq.  On 2012-04-24 21:45:19, Benjamin Hindman wrote:
bq.  > frameworks/mpi/nmpiexec.py, line 145
bq.  > <https://reviews.apache.org/r/4768/diff/3/?file=103693#file103693line145>
bq.  >
bq.  >     s/Rejecting slot/Declining offer

Done.


bq.  On 2012-04-24 21:45:19, Benjamin Hindman wrote:
bq.  > frameworks/mpi/nmpiexec.py, line 142
bq.  > <https://reviews.apache.org/r/4768/diff/3/?file=103693#file103693line142>
bq.  >
bq.  >     How about:
bq.  >     
bq.  >     print "Launching mpd " + tid + " on host " + offer.hostname

Changed to: print "Replying to offer: launching mpd %d on host %s" % (tid, offer.hostname)


bq.  On 2012-04-24 21:45:19, Benjamin Hindman wrote:
bq.  > frameworks/mpi/nmpiexec.py, line 114
bq.  > <https://reviews.apache.org/r/4768/diff/3/?file=103693#file103693line114>
bq.  >
bq.  >     s/slot/offer

Done.


bq.  On 2012-04-24 21:45:19, Benjamin Hindman wrote:
bq.  > frameworks/mpi/nmpiexec.py, line 109
bq.  > <https://reviews.apache.org/r/4768/diff/3/?file=103693#file103693line109>
bq.  >
bq.  >     s/Rejecting slot/Declining offer
bq.  >     
bq.  >     Also, why not do driver.declineOffer right here?

Done.


bq.  On 2012-04-24 21:45:19, Benjamin Hindman wrote:
bq.  > frameworks/mpi/nmpiexec.py, line 102
bq.  > <https://reviews.apache.org/r/4768/diff/3/?file=103693#file103693line102>
bq.  >
bq.  >     s/r/resource

Done.


bq.  On 2012-04-24 21:45:19, Benjamin Hindman wrote:
bq.  > frameworks/mpi/nmpiexec.py, line 100
bq.  > <https://reviews.apache.org/r/4768/diff/3/?file=103693#file103693line100>
bq.  >
bq.  >     Kill this line (or alternatively add the offer.id.value up on line 87).

Done, merged with line 87.


bq.  On 2012-04-24 21:45:19, Benjamin Hindman wrote:
bq.  > frameworks/mpi/nmpiexec.py, line 96
bq.  > <https://reviews.apache.org/r/4768/diff/3/?file=103693#file103693line96>
bq.  >
bq.  >     s/slot/resources

Done.


bq.  On 2012-04-24 21:45:19, Benjamin Hindman wrote:
bq.  > frameworks/mpi/nmpiexec.py, line 89
bq.  > <https://reviews.apache.org/r/4768/diff/3/?file=103693#file103693line89>
bq.  >
bq.  >     s/Rejecting/Declining

Done.


bq.  On 2012-04-24 21:45:19, Benjamin Hindman wrote:
bq.  > frameworks/mpi/nmpiexec.py, line 69
bq.  > <https://reviews.apache.org/r/4768/diff/3/?file=103693#file103693line69>
bq.  >
bq.  >     No longer use, kill please.

Done.


bq.  On 2012-04-24 21:45:19, Benjamin Hindman wrote:
bq.  > frameworks/mpi/nmpiexec.py, line 66
bq.  > <https://reviews.apache.org/r/4768/diff/3/?file=103693#file103693line66>
bq.  >
bq.  >     No longer used, kill please.

Done.


bq.  On 2012-04-24 21:45:19, Benjamin Hindman wrote:
bq.  > frameworks/mpi/nmpiexec.py, line 59
bq.  > <https://reviews.apache.org/r/4768/diff/3/?file=103693#file103693line59>
bq.  >
bq.  >     How about s/tasksLaunched/mpdsLaunched

Done.


bq.  On 2012-04-24 21:45:19, Benjamin Hindman wrote:
bq.  > frameworks/mpi/nmpiexec.py, line 55
bq.  > <https://reviews.apache.org/r/4768/diff/3/?file=103693#file103693line55>
bq.  >
bq.  >     It would be great to give this a real name, e.g., MPIScheduler.

Done.


bq.  On 2012-04-24 21:45:19, Benjamin Hindman wrote:
bq.  > frameworks/mpi/nmpiexec.py, line 22
bq.  > <https://reviews.apache.org/r/4768/diff/3/?file=103693#file103693line22>
bq.  >
bq.  >     No need to take driver as an argument anymore.

Deleted parameter.


bq.  On 2012-04-24 21:45:19, Benjamin Hindman wrote:
bq.  > frameworks/mpi/nmpiexec.py, line 18
bq.  > <https://reviews.apache.org/r/4768/diff/3/?file=103693#file103693line18>
bq.  >
bq.  >     Optional path.

Changed.


bq.  On 2012-04-24 21:45:19, Benjamin Hindman wrote:
bq.  > frameworks/mpi/nmpiexec.py, line 17
bq.  > <https://reviews.apache.org/r/4768/diff/3/?file=103693#file103693line17>
bq.  >
bq.  >     This default isn't the same as what gets printed out from --help. Probably makes sense to kill these here and just put the value down in the add_option call (like you do for --num and TOTAL_TASKS).

Done - default value set at add_option.


bq.  On 2012-04-24 21:45:19, Benjamin Hindman wrote:
bq.  > frameworks/mpi/README.txt, line 62
bq.  > <https://reviews.apache.org/r/4768/diff/3/?file=103692#file103692line62>
bq.  >
bq.  >     I'd prefer if we just had everyone do 'make', since that should build the Python dependencies (including protobuf).

Ok. I probably didn't configure properly when installing mine...


bq.  On 2012-04-24 21:45:19, Benjamin Hindman wrote:
bq.  > frameworks/mpi/README.txt, line 23
bq.  > <https://reviews.apache.org/r/4768/diff/3/?file=103692#file103692line23>
bq.  >
bq.  >     Please kill all whitespace in this review.

Done.


- Harvey


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/4768/#review7179
-----------------------------------------------------------


On 2012-05-02 13:29:50, Harvey Feng wrote:
bq.  
bq.  -----------------------------------------------------------
bq.  This is an automatically generated e-mail. To reply, visit:
bq.  https://reviews.apache.org/r/4768/
bq.  -----------------------------------------------------------
bq.  
bq.  (Updated 2012-05-02 13:29:50)
bq.  
bq.  
bq.  Review request for mesos, Benjamin Hindman, Charles Reiss, and Jessica.
bq.  
bq.  
bq.  Summary
bq.  -------
bq.  
bq.  Some updates to point out:
bq.  
bq.  -nmpiexec.py
bq.    -> 'mpdallexit' should terminate all slaves' mpds in the ring. I moved 'driver.stop()' to statusUpdate() so that it stops when all tasks have been finished, which occurs when the executor's launched mpd processes have all exited. 
bq.  -startmpd.py
bq.    -> Didn't remove cleanup(), and added code in shutdown() that manually kills mpd processes. They might be useful during abnormal (cleanup) and normal (shutdown) framework/executor termination...I think. cleanup() still terminates all mpd's in the slave, but shutdown doesn't. 
bq.    -> killtask() stops the mpd associated with the given tid. 
bq.    -> Task states update nicely now. They correspond to the state of a task's associated mpd process.
bq.  -Readme
bq.    -> Included additional info on how to setup and run MPICH2 1.2 and nmpiexec on OS X and Ubuntu/Linux
bq.  
bq.  
bq.  This addresses bug MESOS-183.
bq.      https://issues.apache.org/jira/browse/MESOS-183
bq.  
bq.  
bq.  Diffs
bq.  -----
bq.  
bq.    frameworks/mpi/README.txt cdb4553 
bq.    frameworks/mpi/nmpiexec 517bdbc 
bq.    frameworks/mpi/nmpiexec.py a5db9c0 
bq.    frameworks/mpi/startmpd.py 8eeba5e 
bq.    frameworks/mpi/startmpd.sh 44faa05 
bq.  
bq.  Diff: https://reviews.apache.org/r/4768/diff
bq.  
bq.  
bq.  Testing
bq.  -------
bq.  
bq.  
bq.  Thanks,
bq.  
bq.  Harvey
bq.  
bq.


                
> Included MPI Framework Fails to Start
> -------------------------------------
>
>                 Key: MESOS-183
>                 URL: https://issues.apache.org/jira/browse/MESOS-183
>             Project: Mesos
>          Issue Type: Bug
>          Components: documentation, framework
>         Environment: Scientific Linux Cluster
>            Reporter: Jessica J
>            Assignee: Harvey Feng 
>            Priority: Blocker
>              Labels: documentation, mpi, setup
>
> There are really two facets to this issue. The first is that no good documentation exists for setting up and using the included MPI framework. The second, and more important issue, is that the framework will not run. The second issue is possibly related to the first in that I may not be setting it up properly. 
> To test the MPI framework, by trial and error I determined I needed to run python setup.py build and python setup.py install in the MESOS-HOME/src/python directory. Now when I try to run nmpiexec -h, I get an AttributeError, below: 
> Traceback (most recent call last):
>   File "./nmpiexec.py", line 2, in <module>
>     import mesos
>   File "/usr/lib64/python2.6/site-packages/mesos-0.9.0-py2.6-linux-x86_64.egg/mesos.py", line 22, in <module>
>     import _mesos
>   File "/usr/lib64/python2.6/site-packages/mesos-0.9.0-py2.6-linux-x86_64.egg/mesos_pb2.py", line 1286, in <module>
>     DESCRIPTOR.message_types_by_name['FrameworkID'] = _FRAMEWORKID
> AttributeError: 'FileDescriptor' object has no attribute 'message_types_by_name'
> I've examined setup.py and determined that the version of protobuf it includes (2.4.1) does, indeed, contain a FileDescriptor class in descriptor.py that sets self.message_types_by_name, so I'm not sure what the issue is. Is this a bug? Or is there a step I'm missing? Do I need to also build/install protobuf?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MESOS-183) Included MPI Framework Fails to Start

Posted by "jiraposter@reviews.apache.org (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MESOS-183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13258367#comment-13258367 ] 

jiraposter@reviews.apache.org commented on MESOS-183:
-----------------------------------------------------


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/4768/#review7080
-----------------------------------------------------------



frameworks/mpi/startmpd.py
<https://reviews.apache.org/r/4768/#comment15683>

    Passing --ifhn here means that the slave nodes will try to listen on the master node's IP address. If multi-interface functionality is made available for slave nodes, it needs to be specified in a way separate from the way it's specified for the master node. Otherwise, MPI will fail to run on slave nodes.


- Jessica


On 2012-04-20 08:17:57, Harvey Feng wrote:
bq.  
bq.  -----------------------------------------------------------
bq.  This is an automatically generated e-mail. To reply, visit:
bq.  https://reviews.apache.org/r/4768/
bq.  -----------------------------------------------------------
bq.  
bq.  (Updated 2012-04-20 08:17:57)
bq.  
bq.  
bq.  Review request for mesos, Benjamin Hindman and Charles Reiss.
bq.  
bq.  
bq.  Summary
bq.  -------
bq.  
bq.  Some updates to point out:
bq.  
bq.  -nmpiexec.py
bq.    -> 'mpdallexit' should terminate all slaves' mpds in the ring. I moved 'driver.stop()' to statusUpdate() so that it stops when all tasks have been finished, which occurs when the executor's launched mpd processes have all exited. 
bq.  -startmpd.py
bq.    -> Didn't remove cleanup(), and added code in shutdown() that manually kills mpd processes. They might be useful during abnormal (cleanup) and normal (shutdown) framework/executor termination...I think. cleanup() still terminates all mpd's in the slave, but shutdown doesn't. 
bq.    -> killtask() stops the mpd associated with the given tid. 
bq.    -> Task states update nicely now. They correspond to the state of a task's associated mpd process.
bq.  -Readme
bq.    -> Included additional info on how to setup and run MPICH2 1.2 and nmpiexec on OS X and Ubuntu/Linux
bq.  
bq.  
bq.  This addresses bug MESOS-183.
bq.      https://issues.apache.org/jira/browse/MESOS-183
bq.  
bq.  
bq.  Diffs
bq.  -----
bq.  
bq.    frameworks/mpi/README.txt cdb4553 
bq.    frameworks/mpi/nmpiexec.py a5db9c0 
bq.    frameworks/mpi/startmpd.py 8eeba5e 
bq.  
bq.  Diff: https://reviews.apache.org/r/4768/diff
bq.  
bq.  
bq.  Testing
bq.  -------
bq.  
bq.  
bq.  Thanks,
bq.  
bq.  Harvey
bq.  
bq.


                
> Included MPI Framework Fails to Start
> -------------------------------------
>
>                 Key: MESOS-183
>                 URL: https://issues.apache.org/jira/browse/MESOS-183
>             Project: Mesos
>          Issue Type: Bug
>          Components: documentation, framework
>         Environment: Scientific Linux Cluster
>            Reporter: Jessica J
>            Assignee: Harvey Feng 
>            Priority: Blocker
>              Labels: documentation, mpi, setup
>
> There are really two facets to this issue. The first is that no good documentation exists for setting up and using the included MPI framework. The second, and more important issue, is that the framework will not run. The second issue is possibly related to the first in that I may not be setting it up properly. 
> To test the MPI framework, by trial and error I determined I needed to run python setup.py build and python setup.py install in the MESOS-HOME/src/python directory. Now when I try to run nmpiexec -h, I get an AttributeError, below: 
> Traceback (most recent call last):
>   File "./nmpiexec.py", line 2, in <module>
>     import mesos
>   File "/usr/lib64/python2.6/site-packages/mesos-0.9.0-py2.6-linux-x86_64.egg/mesos.py", line 22, in <module>
>     import _mesos
>   File "/usr/lib64/python2.6/site-packages/mesos-0.9.0-py2.6-linux-x86_64.egg/mesos_pb2.py", line 1286, in <module>
>     DESCRIPTOR.message_types_by_name['FrameworkID'] = _FRAMEWORKID
> AttributeError: 'FileDescriptor' object has no attribute 'message_types_by_name'
> I've examined setup.py and determined that the version of protobuf it includes (2.4.1) does, indeed, contain a FileDescriptor class in descriptor.py that sets self.message_types_by_name, so I'm not sure what the issue is. Is this a bug? Or is there a step I'm missing? Do I need to also build/install protobuf?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MESOS-183) Included MPI Framework Fails to Start

Posted by "jiraposter@reviews.apache.org (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MESOS-183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13270122#comment-13270122 ] 

jiraposter@reviews.apache.org commented on MESOS-183:
-----------------------------------------------------



bq.  On 2012-05-04 01:41:20, Benjamin Hindman wrote:
bq.  > frameworks/mpi/README.txt, line 19
bq.  > <https://reviews.apache.org/r/4768/diff/5/?file=105961#file105961line19>
bq.  >
bq.  >     I know it's obvious, but you might want to remind users that you'll need to install mpich2 on every machine in your cluster?.

Done.


bq.  On 2012-05-04 01:41:20, Benjamin Hindman wrote:
bq.  > frameworks/mpi/README.txt, line 23
bq.  > <https://reviews.apache.org/r/4768/diff/5/?file=105961#file105961line23>
bq.  >
bq.  >     Kill whitespace.

Done.


bq.  On 2012-05-04 01:41:20, Benjamin Hindman wrote:
bq.  > frameworks/mpi/README.txt, line 25
bq.  > <https://reviews.apache.org/r/4768/diff/5/?file=105961#file105961line25>
bq.  >
bq.  >     Kill whitespace.

Done.


bq.  On 2012-05-04 01:41:20, Benjamin Hindman wrote:
bq.  > frameworks/mpi/nmpiexec.py, line 26
bq.  > <https://reviews.apache.org/r/4768/diff/5/?file=105963#file105963line26>
bq.  >
bq.  >     s/mpd slots/mpd(s)

Done


bq.  On 2012-05-04 01:41:20, Benjamin Hindman wrote:
bq.  > frameworks/mpi/nmpiexec.py, line 71
bq.  > <https://reviews.apache.org/r/4768/diff/5/?file=105963#file105963line71>
bq.  >
bq.  >     If you move this check into the 'for offer in offers:' on line 60, then you'll only be doing the check and decline in one place (not also on lines 107 and 108).

Done


bq.  On 2012-05-04 01:41:20, Benjamin Hindman wrote:
bq.  > frameworks/mpi/nmpiexec.py, line 118
bq.  > <https://reviews.apache.org/r/4768/diff/5/?file=105963#file105963line118>
bq.  >
bq.  >     Again, I'm not sure how ifhn_slave is going to be used. Can you elaborate?

I left this in pending Jessica's response...it's removed now.


bq.  On 2012-05-04 01:41:20, Benjamin Hindman wrote:
bq.  > frameworks/mpi/nmpiexec.py, line 121
bq.  > <https://reviews.apache.org/r/4768/diff/5/?file=105963#file105963line121>
bq.  >
bq.  >     I love the long options! Thank you!


bq.  On 2012-05-04 01:41:20, Benjamin Hindman wrote:
bq.  > frameworks/mpi/nmpiexec.py, line 209
bq.  > <https://reviews.apache.org/r/4768/diff/5/?file=105963#file105963line209>
bq.  >
bq.  >     +1 to Jessica's comment.

This simplifies the trailing '/' check/fix to just os.path.join(options.path, ""). 


bq.  On 2012-05-04 01:41:20, Benjamin Hindman wrote:
bq.  > frameworks/mpi/nmpiexec.py, line 221
bq.  > <https://reviews.apache.org/r/4768/diff/5/?file=105963#file105963line221>
bq.  >
bq.  >     +1 to Jessica's comment.

Unchanged after using the above.


bq.  On 2012-05-04 01:41:20, Benjamin Hindman wrote:
bq.  > frameworks/mpi/nmpiexec.py, line 230
bq.  > <https://reviews.apache.org/r/4768/diff/5/?file=105963#file105963line230>
bq.  >
bq.  >     mpdtraceerr is not used, kill it please.

Done.


- Harvey


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/4768/#review7541
-----------------------------------------------------------


On 2012-05-02 13:29:50, Harvey Feng wrote:
bq.  
bq.  -----------------------------------------------------------
bq.  This is an automatically generated e-mail. To reply, visit:
bq.  https://reviews.apache.org/r/4768/
bq.  -----------------------------------------------------------
bq.  
bq.  (Updated 2012-05-02 13:29:50)
bq.  
bq.  
bq.  Review request for mesos, Benjamin Hindman, Charles Reiss, and Jessica.
bq.  
bq.  
bq.  Summary
bq.  -------
bq.  
bq.  Some updates to point out:
bq.  
bq.  -nmpiexec.py
bq.    -> 'mpdallexit' should terminate all slaves' mpds in the ring. I moved 'driver.stop()' to statusUpdate() so that it stops when all tasks have been finished, which occurs when the executor's launched mpd processes have all exited. 
bq.  -startmpd.py
bq.    -> Didn't remove cleanup(), and added code in shutdown() that manually kills mpd processes. They might be useful during abnormal (cleanup) and normal (shutdown) framework/executor termination...I think. cleanup() still terminates all mpd's in the slave, but shutdown doesn't. 
bq.    -> killtask() stops the mpd associated with the given tid. 
bq.    -> Task states update nicely now. They correspond to the state of a task's associated mpd process.
bq.  -Readme
bq.    -> Included additional info on how to setup and run MPICH2 1.2 and nmpiexec on OS X and Ubuntu/Linux
bq.  
bq.  
bq.  This addresses bug MESOS-183.
bq.      https://issues.apache.org/jira/browse/MESOS-183
bq.  
bq.  
bq.  Diffs
bq.  -----
bq.  
bq.    frameworks/mpi/README.txt cdb4553 
bq.    frameworks/mpi/nmpiexec 517bdbc 
bq.    frameworks/mpi/nmpiexec.py a5db9c0 
bq.    frameworks/mpi/startmpd.py 8eeba5e 
bq.    frameworks/mpi/startmpd.sh 44faa05 
bq.  
bq.  Diff: https://reviews.apache.org/r/4768/diff
bq.  
bq.  
bq.  Testing
bq.  -------
bq.  
bq.  
bq.  Thanks,
bq.  
bq.  Harvey
bq.  
bq.


                
> Included MPI Framework Fails to Start
> -------------------------------------
>
>                 Key: MESOS-183
>                 URL: https://issues.apache.org/jira/browse/MESOS-183
>             Project: Mesos
>          Issue Type: Bug
>          Components: documentation, framework
>         Environment: Scientific Linux Cluster
>            Reporter: Jessica J
>            Assignee: Harvey Feng 
>            Priority: Blocker
>              Labels: documentation, mpi, setup
>
> There are really two facets to this issue. The first is that no good documentation exists for setting up and using the included MPI framework. The second, and more important issue, is that the framework will not run. The second issue is possibly related to the first in that I may not be setting it up properly. 
> To test the MPI framework, by trial and error I determined I needed to run python setup.py build and python setup.py install in the MESOS-HOME/src/python directory. Now when I try to run nmpiexec -h, I get an AttributeError, below: 
> Traceback (most recent call last):
>   File "./nmpiexec.py", line 2, in <module>
>     import mesos
>   File "/usr/lib64/python2.6/site-packages/mesos-0.9.0-py2.6-linux-x86_64.egg/mesos.py", line 22, in <module>
>     import _mesos
>   File "/usr/lib64/python2.6/site-packages/mesos-0.9.0-py2.6-linux-x86_64.egg/mesos_pb2.py", line 1286, in <module>
>     DESCRIPTOR.message_types_by_name['FrameworkID'] = _FRAMEWORKID
> AttributeError: 'FileDescriptor' object has no attribute 'message_types_by_name'
> I've examined setup.py and determined that the version of protobuf it includes (2.4.1) does, indeed, contain a FileDescriptor class in descriptor.py that sets self.message_types_by_name, so I'm not sure what the issue is. Is this a bug? Or is there a step I'm missing? Do I need to also build/install protobuf?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MESOS-183) Included MPI Framework Fails to Start

Posted by "jiraposter@reviews.apache.org (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MESOS-183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13270123#comment-13270123 ] 

jiraposter@reviews.apache.org commented on MESOS-183:
-----------------------------------------------------


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/4768/
-----------------------------------------------------------

(Updated 2012-05-08 01:29:06.075735)


Review request for mesos, Benjamin Hindman, Charles Reiss, and Jessica.


Changes
-------

-Updated some of the logic from the previous diff.
-Better usage of os.path.join()
-References to "nmpiexec*" have been changed to "mpiexec-mesos*", but the filenames still need to be changed...


Summary
-------

Some updates to point out:

-nmpiexec.py
  -> 'mpdallexit' should terminate all slaves' mpds in the ring. I moved 'driver.stop()' to statusUpdate() so that it stops when all tasks have been finished, which occurs when the executor's launched mpd processes have all exited. 
-startmpd.py
  -> Didn't remove cleanup(), and added code in shutdown() that manually kills mpd processes. They might be useful during abnormal (cleanup) and normal (shutdown) framework/executor termination...I think. cleanup() still terminates all mpd's in the slave, but shutdown doesn't. 
  -> killtask() stops the mpd associated with the given tid. 
  -> Task states update nicely now. They correspond to the state of a task's associated mpd process.
-Readme
  -> Included additional info on how to setup and run MPICH2 1.2 and nmpiexec on OS X and Ubuntu/Linux


This addresses bug MESOS-183.
    https://issues.apache.org/jira/browse/MESOS-183


Diffs (updated)
-----

  frameworks/mpi/README.txt cdb4553 
  frameworks/mpi/nmpiexec 517bdbc 
  frameworks/mpi/nmpiexec.py a5db9c0 
  frameworks/mpi/startmpd.py 8eeba5e 
  frameworks/mpi/startmpd.sh 44faa05 

Diff: https://reviews.apache.org/r/4768/diff


Testing
-------


Thanks,

Harvey


                
> Included MPI Framework Fails to Start
> -------------------------------------
>
>                 Key: MESOS-183
>                 URL: https://issues.apache.org/jira/browse/MESOS-183
>             Project: Mesos
>          Issue Type: Bug
>          Components: documentation, framework
>         Environment: Scientific Linux Cluster
>            Reporter: Jessica J
>            Assignee: Harvey Feng 
>            Priority: Blocker
>              Labels: documentation, mpi, setup
>
> There are really two facets to this issue. The first is that no good documentation exists for setting up and using the included MPI framework. The second, and more important issue, is that the framework will not run. The second issue is possibly related to the first in that I may not be setting it up properly. 
> To test the MPI framework, by trial and error I determined I needed to run python setup.py build and python setup.py install in the MESOS-HOME/src/python directory. Now when I try to run nmpiexec -h, I get an AttributeError, below: 
> Traceback (most recent call last):
>   File "./nmpiexec.py", line 2, in <module>
>     import mesos
>   File "/usr/lib64/python2.6/site-packages/mesos-0.9.0-py2.6-linux-x86_64.egg/mesos.py", line 22, in <module>
>     import _mesos
>   File "/usr/lib64/python2.6/site-packages/mesos-0.9.0-py2.6-linux-x86_64.egg/mesos_pb2.py", line 1286, in <module>
>     DESCRIPTOR.message_types_by_name['FrameworkID'] = _FRAMEWORKID
> AttributeError: 'FileDescriptor' object has no attribute 'message_types_by_name'
> I've examined setup.py and determined that the version of protobuf it includes (2.4.1) does, indeed, contain a FileDescriptor class in descriptor.py that sets self.message_types_by_name, so I'm not sure what the issue is. Is this a bug? Or is there a step I'm missing? Do I need to also build/install protobuf?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MESOS-183) Included MPI Framework Fails to Start

Posted by "jiraposter@reviews.apache.org (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MESOS-183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13266571#comment-13266571 ] 

jiraposter@reviews.apache.org commented on MESOS-183:
-----------------------------------------------------



bq.  On 2012-04-24 21:45:19, Benjamin Hindman wrote:
bq.  > frameworks/mpi/nmpiexec.py, line 209
bq.  > <https://reviews.apache.org/r/4768/diff/3/?file=103693#file103693line209>
bq.  >
bq.  >     I'm not really sure how this can be used: the user running this script will not know what machines they might run on, so they can't possibly know which IP addresses they want to use on those machines. Maybe Jessica J. had something else in mind here?
bq.  >     
bq.  >     It definitely makes sense to keep --ifhn for the master.

Hmmm... Looks like my comment here disappeared somehow. Anyway, I agree that the --ifhn-slave option doesn't make sense since there's no way you can specify an IP address for each slave. I guess what I had in mind was a more general Mesos configuration option rather than specific to the MPI framework. 

bq. From a selfish standpoint, I'm not terribly concerned since the master was the option I was concerned about. However, I've been thinking that, assuming you're using the deploy scripts to start your cluster, it may be worth considering modifying the format of the slaves configuration file (which currently lists only hostnames) and allowing the user to also specify an IP address for each host. Then perhaps the MPI framework could grab the IP address from the Mesos configuration. This would be useful for deploying Mesos as well since some users (such as myself) may have their Mesos config files in an NTFS directory. (This setup means I can't start the entire cluster at one go if I need to give any of my nodes a specific IP address since all nodes will try to use the same ip option in mesos.conf.) Just a thought... I'll open a general Mesos "Improvement" ticket if there's any chance of it happening.


bq.  On 2012-04-24 21:45:19, Benjamin Hindman wrote:
bq.  > frameworks/mpi/nmpiexec.py, line 223
bq.  > <https://reviews.apache.org/r/4768/diff/3/?file=103693#file103693line223>
bq.  >
bq.  >     It looks like you assume that path ends in a '/'. You should probably check this here.

Why not use os.path.join?


- Jessica


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/4768/#review7179
-----------------------------------------------------------


On 2012-05-02 13:29:50, Harvey Feng wrote:
bq.  
bq.  -----------------------------------------------------------
bq.  This is an automatically generated e-mail. To reply, visit:
bq.  https://reviews.apache.org/r/4768/
bq.  -----------------------------------------------------------
bq.  
bq.  (Updated 2012-05-02 13:29:50)
bq.  
bq.  
bq.  Review request for mesos, Benjamin Hindman, Charles Reiss, and Jessica.
bq.  
bq.  
bq.  Summary
bq.  -------
bq.  
bq.  Some updates to point out:
bq.  
bq.  -nmpiexec.py
bq.    -> 'mpdallexit' should terminate all slaves' mpds in the ring. I moved 'driver.stop()' to statusUpdate() so that it stops when all tasks have been finished, which occurs when the executor's launched mpd processes have all exited. 
bq.  -startmpd.py
bq.    -> Didn't remove cleanup(), and added code in shutdown() that manually kills mpd processes. They might be useful during abnormal (cleanup) and normal (shutdown) framework/executor termination...I think. cleanup() still terminates all mpd's in the slave, but shutdown doesn't. 
bq.    -> killtask() stops the mpd associated with the given tid. 
bq.    -> Task states update nicely now. They correspond to the state of a task's associated mpd process.
bq.  -Readme
bq.    -> Included additional info on how to setup and run MPICH2 1.2 and nmpiexec on OS X and Ubuntu/Linux
bq.  
bq.  
bq.  This addresses bug MESOS-183.
bq.      https://issues.apache.org/jira/browse/MESOS-183
bq.  
bq.  
bq.  Diffs
bq.  -----
bq.  
bq.    frameworks/mpi/README.txt cdb4553 
bq.    frameworks/mpi/nmpiexec 517bdbc 
bq.    frameworks/mpi/nmpiexec.py a5db9c0 
bq.    frameworks/mpi/startmpd.py 8eeba5e 
bq.    frameworks/mpi/startmpd.sh 44faa05 
bq.  
bq.  Diff: https://reviews.apache.org/r/4768/diff
bq.  
bq.  
bq.  Testing
bq.  -------
bq.  
bq.  
bq.  Thanks,
bq.  
bq.  Harvey
bq.  
bq.


                
> Included MPI Framework Fails to Start
> -------------------------------------
>
>                 Key: MESOS-183
>                 URL: https://issues.apache.org/jira/browse/MESOS-183
>             Project: Mesos
>          Issue Type: Bug
>          Components: documentation, framework
>         Environment: Scientific Linux Cluster
>            Reporter: Jessica J
>            Assignee: Harvey Feng 
>            Priority: Blocker
>              Labels: documentation, mpi, setup
>
> There are really two facets to this issue. The first is that no good documentation exists for setting up and using the included MPI framework. The second, and more important issue, is that the framework will not run. The second issue is possibly related to the first in that I may not be setting it up properly. 
> To test the MPI framework, by trial and error I determined I needed to run python setup.py build and python setup.py install in the MESOS-HOME/src/python directory. Now when I try to run nmpiexec -h, I get an AttributeError, below: 
> Traceback (most recent call last):
>   File "./nmpiexec.py", line 2, in <module>
>     import mesos
>   File "/usr/lib64/python2.6/site-packages/mesos-0.9.0-py2.6-linux-x86_64.egg/mesos.py", line 22, in <module>
>     import _mesos
>   File "/usr/lib64/python2.6/site-packages/mesos-0.9.0-py2.6-linux-x86_64.egg/mesos_pb2.py", line 1286, in <module>
>     DESCRIPTOR.message_types_by_name['FrameworkID'] = _FRAMEWORKID
> AttributeError: 'FileDescriptor' object has no attribute 'message_types_by_name'
> I've examined setup.py and determined that the version of protobuf it includes (2.4.1) does, indeed, contain a FileDescriptor class in descriptor.py that sets self.message_types_by_name, so I'm not sure what the issue is. Is this a bug? Or is there a step I'm missing? Do I need to also build/install protobuf?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MESOS-183) Included MPI Framework Fails to Start

Posted by "jiraposter@reviews.apache.org (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MESOS-183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13258426#comment-13258426 ] 

jiraposter@reviews.apache.org commented on MESOS-183:
-----------------------------------------------------


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/4768/#review7086
-----------------------------------------------------------



frameworks/mpi/nmpiexec.py
<https://reviews.apache.org/r/4768/#comment15686>

    MPI_TASK should be an array (and below).



frameworks/mpi/nmpiexec.py
<https://reviews.apache.org/r/4768/#comment15691>

    Account for MPICH2PATH here.



frameworks/mpi/nmpiexec.py
<https://reviews.apache.org/r/4768/#comment15689>

    Try to keep us below 80 chars (or at least below 100); split into multiple lines.



frameworks/mpi/nmpiexec.py
<https://reviews.apache.org/r/4768/#comment15692>

    Account for MPICH2PATH here.



frameworks/mpi/nmpiexec.py
<https://reviews.apache.org/r/4768/#comment15687>

    close() this; does this (and existing code calling mpdtrace) leave mpdtrace as a zombie process?



frameworks/mpi/startmpd.py
<https://reviews.apache.org/r/4768/#comment15690>

    Assuming mpd tries to do something graceful on SIGTERM, try SIGTERM, wait a bit, then try SIGKILL (and below).


- Charles


On 2012-04-20 08:17:57, Harvey Feng wrote:
bq.  
bq.  -----------------------------------------------------------
bq.  This is an automatically generated e-mail. To reply, visit:
bq.  https://reviews.apache.org/r/4768/
bq.  -----------------------------------------------------------
bq.  
bq.  (Updated 2012-04-20 08:17:57)
bq.  
bq.  
bq.  Review request for mesos, Benjamin Hindman and Charles Reiss.
bq.  
bq.  
bq.  Summary
bq.  -------
bq.  
bq.  Some updates to point out:
bq.  
bq.  -nmpiexec.py
bq.    -> 'mpdallexit' should terminate all slaves' mpds in the ring. I moved 'driver.stop()' to statusUpdate() so that it stops when all tasks have been finished, which occurs when the executor's launched mpd processes have all exited. 
bq.  -startmpd.py
bq.    -> Didn't remove cleanup(), and added code in shutdown() that manually kills mpd processes. They might be useful during abnormal (cleanup) and normal (shutdown) framework/executor termination...I think. cleanup() still terminates all mpd's in the slave, but shutdown doesn't. 
bq.    -> killtask() stops the mpd associated with the given tid. 
bq.    -> Task states update nicely now. They correspond to the state of a task's associated mpd process.
bq.  -Readme
bq.    -> Included additional info on how to setup and run MPICH2 1.2 and nmpiexec on OS X and Ubuntu/Linux
bq.  
bq.  
bq.  This addresses bug MESOS-183.
bq.      https://issues.apache.org/jira/browse/MESOS-183
bq.  
bq.  
bq.  Diffs
bq.  -----
bq.  
bq.    frameworks/mpi/README.txt cdb4553 
bq.    frameworks/mpi/nmpiexec.py a5db9c0 
bq.    frameworks/mpi/startmpd.py 8eeba5e 
bq.  
bq.  Diff: https://reviews.apache.org/r/4768/diff
bq.  
bq.  
bq.  Testing
bq.  -------
bq.  
bq.  
bq.  Thanks,
bq.  
bq.  Harvey
bq.  
bq.


                
> Included MPI Framework Fails to Start
> -------------------------------------
>
>                 Key: MESOS-183
>                 URL: https://issues.apache.org/jira/browse/MESOS-183
>             Project: Mesos
>          Issue Type: Bug
>          Components: documentation, framework
>         Environment: Scientific Linux Cluster
>            Reporter: Jessica J
>            Assignee: Harvey Feng 
>            Priority: Blocker
>              Labels: documentation, mpi, setup
>
> There are really two facets to this issue. The first is that no good documentation exists for setting up and using the included MPI framework. The second, and more important issue, is that the framework will not run. The second issue is possibly related to the first in that I may not be setting it up properly. 
> To test the MPI framework, by trial and error I determined I needed to run python setup.py build and python setup.py install in the MESOS-HOME/src/python directory. Now when I try to run nmpiexec -h, I get an AttributeError, below: 
> Traceback (most recent call last):
>   File "./nmpiexec.py", line 2, in <module>
>     import mesos
>   File "/usr/lib64/python2.6/site-packages/mesos-0.9.0-py2.6-linux-x86_64.egg/mesos.py", line 22, in <module>
>     import _mesos
>   File "/usr/lib64/python2.6/site-packages/mesos-0.9.0-py2.6-linux-x86_64.egg/mesos_pb2.py", line 1286, in <module>
>     DESCRIPTOR.message_types_by_name['FrameworkID'] = _FRAMEWORKID
> AttributeError: 'FileDescriptor' object has no attribute 'message_types_by_name'
> I've examined setup.py and determined that the version of protobuf it includes (2.4.1) does, indeed, contain a FileDescriptor class in descriptor.py that sets self.message_types_by_name, so I'm not sure what the issue is. Is this a bug? Or is there a step I'm missing? Do I need to also build/install protobuf?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MESOS-183) Included MPI Framework Fails to Start

Posted by "Jessica J (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MESOS-183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13252597#comment-13252597 ] 

Jessica J commented on MESOS-183:
---------------------------------

The issue appears to be that there was an older version of protobuf already on the machine (perhaps mesos/mpi should use a local copy of protobuf rather than the system-wide version?); however, the fact remains that the state of the setup documentation needs improving.
                
> Included MPI Framework Fails to Start
> -------------------------------------
>
>                 Key: MESOS-183
>                 URL: https://issues.apache.org/jira/browse/MESOS-183
>             Project: Mesos
>          Issue Type: Bug
>          Components: documentation, framework
>         Environment: Scientific Linux Cluster
>            Reporter: Jessica J
>            Assignee: Harvey Feng 
>              Labels: documentation, mpi, setup
>
> There are really two facets to this issue. The first is that no good documentation exists for setting up and using the included MPI framework. The second, and more important issue, is that the framework will not run. The second issue is possibly related to the first in that I may not be setting it up properly. 
> To test the MPI framework, by trial and error I determined I needed to run python setup.py build and python setup.py install in the MESOS-HOME/src/python directory. Now when I try to run nmpiexec -h, I get an AttributeError, below: 
> Traceback (most recent call last):
>   File "./nmpiexec.py", line 2, in <module>
>     import mesos
>   File "/usr/lib64/python2.6/site-packages/mesos-0.9.0-py2.6-linux-x86_64.egg/mesos.py", line 22, in <module>
>     import _mesos
>   File "/usr/lib64/python2.6/site-packages/mesos-0.9.0-py2.6-linux-x86_64.egg/mesos_pb2.py", line 1286, in <module>
>     DESCRIPTOR.message_types_by_name['FrameworkID'] = _FRAMEWORKID
> AttributeError: 'FileDescriptor' object has no attribute 'message_types_by_name'
> I've examined setup.py and determined that the version of protobuf it includes (2.4.1) does, indeed, contain a FileDescriptor class in descriptor.py that sets self.message_types_by_name, so I'm not sure what the issue is. Is this a bug? Or is there a step I'm missing? Do I need to also build/install protobuf?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MESOS-183) Included MPI Framework Fails to Start

Posted by "jiraposter@reviews.apache.org (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MESOS-183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13258773#comment-13258773 ] 

jiraposter@reviews.apache.org commented on MESOS-183:
-----------------------------------------------------



bq.  On 2012-04-20 16:48:05, Jessica wrote:
bq.  > frameworks/mpi/startmpd.py, line 58
bq.  > <https://reviews.apache.org/r/4768/diff/2/?file=103456#file103456line58>
bq.  >
bq.  >     Passing --ifhn here means that the slave nodes will try to listen on the master node's IP address. If multi-interface functionality is made available for slave nodes, it needs to be specified in a way separate from the way it's specified for the master node. Otherwise, MPI will fail to run on slave nodes.

Ah, sorry about that. I separated it into --ifhn-slave and --ifhn-master


- Harvey


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/4768/#review7080
-----------------------------------------------------------


On 2012-04-21 05:08:47, Harvey Feng wrote:
bq.  
bq.  -----------------------------------------------------------
bq.  This is an automatically generated e-mail. To reply, visit:
bq.  https://reviews.apache.org/r/4768/
bq.  -----------------------------------------------------------
bq.  
bq.  (Updated 2012-04-21 05:08:47)
bq.  
bq.  
bq.  Review request for mesos, Benjamin Hindman, Charles Reiss, and Jessica.
bq.  
bq.  
bq.  Summary
bq.  -------
bq.  
bq.  Some updates to point out:
bq.  
bq.  -nmpiexec.py
bq.    -> 'mpdallexit' should terminate all slaves' mpds in the ring. I moved 'driver.stop()' to statusUpdate() so that it stops when all tasks have been finished, which occurs when the executor's launched mpd processes have all exited. 
bq.  -startmpd.py
bq.    -> Didn't remove cleanup(), and added code in shutdown() that manually kills mpd processes. They might be useful during abnormal (cleanup) and normal (shutdown) framework/executor termination...I think. cleanup() still terminates all mpd's in the slave, but shutdown doesn't. 
bq.    -> killtask() stops the mpd associated with the given tid. 
bq.    -> Task states update nicely now. They correspond to the state of a task's associated mpd process.
bq.  -Readme
bq.    -> Included additional info on how to setup and run MPICH2 1.2 and nmpiexec on OS X and Ubuntu/Linux
bq.  
bq.  
bq.  This addresses bug MESOS-183.
bq.      https://issues.apache.org/jira/browse/MESOS-183
bq.  
bq.  
bq.  Diffs
bq.  -----
bq.  
bq.    frameworks/mpi/README.txt cdb4553 
bq.    frameworks/mpi/nmpiexec.py a5db9c0 
bq.    frameworks/mpi/startmpd.py 8eeba5e 
bq.  
bq.  Diff: https://reviews.apache.org/r/4768/diff
bq.  
bq.  
bq.  Testing
bq.  -------
bq.  
bq.  
bq.  Thanks,
bq.  
bq.  Harvey
bq.  
bq.


                
> Included MPI Framework Fails to Start
> -------------------------------------
>
>                 Key: MESOS-183
>                 URL: https://issues.apache.org/jira/browse/MESOS-183
>             Project: Mesos
>          Issue Type: Bug
>          Components: documentation, framework
>         Environment: Scientific Linux Cluster
>            Reporter: Jessica J
>            Assignee: Harvey Feng 
>            Priority: Blocker
>              Labels: documentation, mpi, setup
>
> There are really two facets to this issue. The first is that no good documentation exists for setting up and using the included MPI framework. The second, and more important issue, is that the framework will not run. The second issue is possibly related to the first in that I may not be setting it up properly. 
> To test the MPI framework, by trial and error I determined I needed to run python setup.py build and python setup.py install in the MESOS-HOME/src/python directory. Now when I try to run nmpiexec -h, I get an AttributeError, below: 
> Traceback (most recent call last):
>   File "./nmpiexec.py", line 2, in <module>
>     import mesos
>   File "/usr/lib64/python2.6/site-packages/mesos-0.9.0-py2.6-linux-x86_64.egg/mesos.py", line 22, in <module>
>     import _mesos
>   File "/usr/lib64/python2.6/site-packages/mesos-0.9.0-py2.6-linux-x86_64.egg/mesos_pb2.py", line 1286, in <module>
>     DESCRIPTOR.message_types_by_name['FrameworkID'] = _FRAMEWORKID
> AttributeError: 'FileDescriptor' object has no attribute 'message_types_by_name'
> I've examined setup.py and determined that the version of protobuf it includes (2.4.1) does, indeed, contain a FileDescriptor class in descriptor.py that sets self.message_types_by_name, so I'm not sure what the issue is. Is this a bug? Or is there a step I'm missing? Do I need to also build/install protobuf?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MESOS-183) Included MPI Framework Fails to Start

Posted by "jiraposter@reviews.apache.org (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MESOS-183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13258078#comment-13258078 ] 

jiraposter@reviews.apache.org commented on MESOS-183:
-----------------------------------------------------


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/4768/
-----------------------------------------------------------

(Updated 2012-04-20 08:17:57.362659)


Review request for mesos, Benjamin Hindman and Charles Reiss.


Changes
-------

Added optional --name, --path for directory of mpich2 binaries, and --ifhn tags. 


Summary
-------

Some updates to point out:

-nmpiexec.py
  -> 'mpdallexit' should terminate all slaves' mpds in the ring. I moved 'driver.stop()' to statusUpdate() so that it stops when all tasks have been finished, which occurs when the executor's launched mpd processes have all exited. 
-startmpd.py
  -> Didn't remove cleanup(), and added code in shutdown() that manually kills mpd processes. They might be useful during abnormal (cleanup) and normal (shutdown) framework/executor termination...I think. cleanup() still terminates all mpd's in the slave, but shutdown doesn't. 
  -> killtask() stops the mpd associated with the given tid. 
  -> Task states update nicely now. They correspond to the state of a task's associated mpd process.
-Readme
  -> Included additional info on how to setup and run MPICH2 1.2 and nmpiexec on OS X and Ubuntu/Linux


This addresses bug MESOS-183.
    https://issues.apache.org/jira/browse/MESOS-183


Diffs (updated)
-----

  frameworks/mpi/README.txt cdb4553 
  frameworks/mpi/nmpiexec.py a5db9c0 
  frameworks/mpi/startmpd.py 8eeba5e 

Diff: https://reviews.apache.org/r/4768/diff


Testing
-------


Thanks,

Harvey


                
> Included MPI Framework Fails to Start
> -------------------------------------
>
>                 Key: MESOS-183
>                 URL: https://issues.apache.org/jira/browse/MESOS-183
>             Project: Mesos
>          Issue Type: Bug
>          Components: documentation, framework
>         Environment: Scientific Linux Cluster
>            Reporter: Jessica J
>            Assignee: Harvey Feng 
>            Priority: Blocker
>              Labels: documentation, mpi, setup
>
> There are really two facets to this issue. The first is that no good documentation exists for setting up and using the included MPI framework. The second, and more important issue, is that the framework will not run. The second issue is possibly related to the first in that I may not be setting it up properly. 
> To test the MPI framework, by trial and error I determined I needed to run python setup.py build and python setup.py install in the MESOS-HOME/src/python directory. Now when I try to run nmpiexec -h, I get an AttributeError, below: 
> Traceback (most recent call last):
>   File "./nmpiexec.py", line 2, in <module>
>     import mesos
>   File "/usr/lib64/python2.6/site-packages/mesos-0.9.0-py2.6-linux-x86_64.egg/mesos.py", line 22, in <module>
>     import _mesos
>   File "/usr/lib64/python2.6/site-packages/mesos-0.9.0-py2.6-linux-x86_64.egg/mesos_pb2.py", line 1286, in <module>
>     DESCRIPTOR.message_types_by_name['FrameworkID'] = _FRAMEWORKID
> AttributeError: 'FileDescriptor' object has no attribute 'message_types_by_name'
> I've examined setup.py and determined that the version of protobuf it includes (2.4.1) does, indeed, contain a FileDescriptor class in descriptor.py that sets self.message_types_by_name, so I'm not sure what the issue is. Is this a bug? Or is there a step I'm missing? Do I need to also build/install protobuf?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MESOS-183) Included MPI Framework Fails to Start

Posted by "jiraposter@reviews.apache.org (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MESOS-183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13270124#comment-13270124 ] 

jiraposter@reviews.apache.org commented on MESOS-183:
-----------------------------------------------------



bq.  On 2012-05-02 14:58:27, Jessica wrote:
bq.  > frameworks/mpi/nmpiexec.py, line 209
bq.  > <https://reviews.apache.org/r/4768/diff/5/?file=105963#file105963line209>
bq.  >
bq.  >     os.path.join will handle this check and "do the right thing" wherever you use it, whether the user specifies the ending slash or not.

I used it to add a trailing '/' to the specified path, if necessary. Thanks!


bq.  On 2012-05-02 14:58:27, Jessica wrote:
bq.  > frameworks/mpi/nmpiexec.py, line 221
bq.  > <https://reviews.apache.org/r/4768/diff/5/?file=105963#file105963line221>
bq.  >
bq.  >     mpd_cmd = os.path.join(MPICH2PATH, "mpd")

Unchanged after adding the above


- Harvey


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/4768/#review7469
-----------------------------------------------------------


On 2012-05-02 13:29:50, Harvey Feng wrote:
bq.  
bq.  -----------------------------------------------------------
bq.  This is an automatically generated e-mail. To reply, visit:
bq.  https://reviews.apache.org/r/4768/
bq.  -----------------------------------------------------------
bq.  
bq.  (Updated 2012-05-02 13:29:50)
bq.  
bq.  
bq.  Review request for mesos, Benjamin Hindman, Charles Reiss, and Jessica.
bq.  
bq.  
bq.  Summary
bq.  -------
bq.  
bq.  Some updates to point out:
bq.  
bq.  -nmpiexec.py
bq.    -> 'mpdallexit' should terminate all slaves' mpds in the ring. I moved 'driver.stop()' to statusUpdate() so that it stops when all tasks have been finished, which occurs when the executor's launched mpd processes have all exited. 
bq.  -startmpd.py
bq.    -> Didn't remove cleanup(), and added code in shutdown() that manually kills mpd processes. They might be useful during abnormal (cleanup) and normal (shutdown) framework/executor termination...I think. cleanup() still terminates all mpd's in the slave, but shutdown doesn't. 
bq.    -> killtask() stops the mpd associated with the given tid. 
bq.    -> Task states update nicely now. They correspond to the state of a task's associated mpd process.
bq.  -Readme
bq.    -> Included additional info on how to setup and run MPICH2 1.2 and nmpiexec on OS X and Ubuntu/Linux
bq.  
bq.  
bq.  This addresses bug MESOS-183.
bq.      https://issues.apache.org/jira/browse/MESOS-183
bq.  
bq.  
bq.  Diffs
bq.  -----
bq.  
bq.    frameworks/mpi/README.txt cdb4553 
bq.    frameworks/mpi/nmpiexec 517bdbc 
bq.    frameworks/mpi/nmpiexec.py a5db9c0 
bq.    frameworks/mpi/startmpd.py 8eeba5e 
bq.    frameworks/mpi/startmpd.sh 44faa05 
bq.  
bq.  Diff: https://reviews.apache.org/r/4768/diff
bq.  
bq.  
bq.  Testing
bq.  -------
bq.  
bq.  
bq.  Thanks,
bq.  
bq.  Harvey
bq.  
bq.


                
> Included MPI Framework Fails to Start
> -------------------------------------
>
>                 Key: MESOS-183
>                 URL: https://issues.apache.org/jira/browse/MESOS-183
>             Project: Mesos
>          Issue Type: Bug
>          Components: documentation, framework
>         Environment: Scientific Linux Cluster
>            Reporter: Jessica J
>            Assignee: Harvey Feng 
>            Priority: Blocker
>              Labels: documentation, mpi, setup
>
> There are really two facets to this issue. The first is that no good documentation exists for setting up and using the included MPI framework. The second, and more important issue, is that the framework will not run. The second issue is possibly related to the first in that I may not be setting it up properly. 
> To test the MPI framework, by trial and error I determined I needed to run python setup.py build and python setup.py install in the MESOS-HOME/src/python directory. Now when I try to run nmpiexec -h, I get an AttributeError, below: 
> Traceback (most recent call last):
>   File "./nmpiexec.py", line 2, in <module>
>     import mesos
>   File "/usr/lib64/python2.6/site-packages/mesos-0.9.0-py2.6-linux-x86_64.egg/mesos.py", line 22, in <module>
>     import _mesos
>   File "/usr/lib64/python2.6/site-packages/mesos-0.9.0-py2.6-linux-x86_64.egg/mesos_pb2.py", line 1286, in <module>
>     DESCRIPTOR.message_types_by_name['FrameworkID'] = _FRAMEWORKID
> AttributeError: 'FileDescriptor' object has no attribute 'message_types_by_name'
> I've examined setup.py and determined that the version of protobuf it includes (2.4.1) does, indeed, contain a FileDescriptor class in descriptor.py that sets self.message_types_by_name, so I'm not sure what the issue is. Is this a bug? Or is there a step I'm missing? Do I need to also build/install protobuf?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Resolved] (MESOS-183) Included MPI Framework Fails to Start

Posted by "Benjamin Hindman (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MESOS-183?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Benjamin Hindman resolved MESOS-183.
------------------------------------

    Resolution: Fixed
    
> Included MPI Framework Fails to Start
> -------------------------------------
>
>                 Key: MESOS-183
>                 URL: https://issues.apache.org/jira/browse/MESOS-183
>             Project: Mesos
>          Issue Type: Bug
>          Components: documentation, framework
>         Environment: Scientific Linux Cluster
>            Reporter: Jessica J
>            Assignee: Harvey Feng 
>            Priority: Blocker
>              Labels: documentation, mpi, setup
>
> There are really two facets to this issue. The first is that no good documentation exists for setting up and using the included MPI framework. The second, and more important issue, is that the framework will not run. The second issue is possibly related to the first in that I may not be setting it up properly. 
> To test the MPI framework, by trial and error I determined I needed to run python setup.py build and python setup.py install in the MESOS-HOME/src/python directory. Now when I try to run nmpiexec -h, I get an AttributeError, below: 
> Traceback (most recent call last):
>   File "./nmpiexec.py", line 2, in <module>
>     import mesos
>   File "/usr/lib64/python2.6/site-packages/mesos-0.9.0-py2.6-linux-x86_64.egg/mesos.py", line 22, in <module>
>     import _mesos
>   File "/usr/lib64/python2.6/site-packages/mesos-0.9.0-py2.6-linux-x86_64.egg/mesos_pb2.py", line 1286, in <module>
>     DESCRIPTOR.message_types_by_name['FrameworkID'] = _FRAMEWORKID
> AttributeError: 'FileDescriptor' object has no attribute 'message_types_by_name'
> I've examined setup.py and determined that the version of protobuf it includes (2.4.1) does, indeed, contain a FileDescriptor class in descriptor.py that sets self.message_types_by_name, so I'm not sure what the issue is. Is this a bug? Or is there a step I'm missing? Do I need to also build/install protobuf?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MESOS-183) Included MPI Framework Fails to Start

Posted by "jiraposter@reviews.apache.org (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MESOS-183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13257674#comment-13257674 ] 

jiraposter@reviews.apache.org commented on MESOS-183:
-----------------------------------------------------


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/4768/#review7043
-----------------------------------------------------------



frameworks/mpi/nmpiexec.py
<https://reviews.apache.org/r/4768/#comment15627>

    allow user to pass "--ifhn=[ip-address]" to mpd for multi-homed systems


- Jessica


On 2012-04-18 04:27:25, Harvey Feng wrote:
bq.  
bq.  -----------------------------------------------------------
bq.  This is an automatically generated e-mail. To reply, visit:
bq.  https://reviews.apache.org/r/4768/
bq.  -----------------------------------------------------------
bq.  
bq.  (Updated 2012-04-18 04:27:25)
bq.  
bq.  
bq.  Review request for mesos, Benjamin Hindman and Charles Reiss.
bq.  
bq.  
bq.  Summary
bq.  -------
bq.  
bq.  Some updates to point out:
bq.  
bq.  -nmpiexec.py
bq.    -> 'mpdallexit' should terminate all slaves' mpds in the ring. I moved 'driver.stop()' to statusUpdate() so that it stops when all tasks have been finished, which occurs when the executor's launched mpd processes have all exited. 
bq.  -startmpd.py
bq.    -> Didn't remove cleanup(), and added code in shutdown() that manually kills mpd processes. They might be useful during abnormal (cleanup) and normal (shutdown) framework/executor termination...I think. cleanup() still terminates all mpd's in the slave, but shutdown doesn't. 
bq.    -> killtask() stops the mpd associated with the given tid. 
bq.    -> Task states update nicely now. They correspond to the state of a task's associated mpd process.
bq.  -Readme
bq.    -> Included additional info on how to setup and run MPICH2 1.2 and nmpiexec on OS X and Ubuntu/Linux
bq.  
bq.  
bq.  This addresses bug MESOS-183.
bq.      https://issues.apache.org/jira/browse/MESOS-183
bq.  
bq.  
bq.  Diffs
bq.  -----
bq.  
bq.    frameworks/mpi/README.txt cdb4553 
bq.    frameworks/mpi/nmpiexec.py a5db9c0 
bq.    frameworks/mpi/startmpd.py 8eeba5e 
bq.  
bq.  Diff: https://reviews.apache.org/r/4768/diff
bq.  
bq.  
bq.  Testing
bq.  -------
bq.  
bq.  
bq.  Thanks,
bq.  
bq.  Harvey
bq.  
bq.


                
> Included MPI Framework Fails to Start
> -------------------------------------
>
>                 Key: MESOS-183
>                 URL: https://issues.apache.org/jira/browse/MESOS-183
>             Project: Mesos
>          Issue Type: Bug
>          Components: documentation, framework
>         Environment: Scientific Linux Cluster
>            Reporter: Jessica J
>            Assignee: Harvey Feng 
>            Priority: Blocker
>              Labels: documentation, mpi, setup
>
> There are really two facets to this issue. The first is that no good documentation exists for setting up and using the included MPI framework. The second, and more important issue, is that the framework will not run. The second issue is possibly related to the first in that I may not be setting it up properly. 
> To test the MPI framework, by trial and error I determined I needed to run python setup.py build and python setup.py install in the MESOS-HOME/src/python directory. Now when I try to run nmpiexec -h, I get an AttributeError, below: 
> Traceback (most recent call last):
>   File "./nmpiexec.py", line 2, in <module>
>     import mesos
>   File "/usr/lib64/python2.6/site-packages/mesos-0.9.0-py2.6-linux-x86_64.egg/mesos.py", line 22, in <module>
>     import _mesos
>   File "/usr/lib64/python2.6/site-packages/mesos-0.9.0-py2.6-linux-x86_64.egg/mesos_pb2.py", line 1286, in <module>
>     DESCRIPTOR.message_types_by_name['FrameworkID'] = _FRAMEWORKID
> AttributeError: 'FileDescriptor' object has no attribute 'message_types_by_name'
> I've examined setup.py and determined that the version of protobuf it includes (2.4.1) does, indeed, contain a FileDescriptor class in descriptor.py that sets self.message_types_by_name, so I'm not sure what the issue is. Is this a bug? Or is there a step I'm missing? Do I need to also build/install protobuf?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MESOS-183) Included MPI Framework Fails to Start

Posted by "Jessica J (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MESOS-183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13258180#comment-13258180 ] 

Jessica J commented on MESOS-183:
---------------------------------

I'm running into the setuptools issue addressed in the test python framework: https://issues.apache.org/jira/browse/MESOS-130. The locations of the eggs added to PYTHONPATH in nmpiexec need to be updated so that the Mesos/protobuf libraries (and setuptools) don't have to be installed on every node. The documentation should also probably be updated accordingly since bundling the eggs should mean that python setup.py install is not necessary. 

There also seems to be an issue with Python detecting the Mesos module from the egg in src/python/dist--I couldn't import mesos until I unzipped the egg, no matter what directory I was in or how I modified the PYTHONPATH. Not sure what's wrong here since, in my experience, Python should be able to import from a zip/egg file...
                
> Included MPI Framework Fails to Start
> -------------------------------------
>
>                 Key: MESOS-183
>                 URL: https://issues.apache.org/jira/browse/MESOS-183
>             Project: Mesos
>          Issue Type: Bug
>          Components: documentation, framework
>         Environment: Scientific Linux Cluster
>            Reporter: Jessica J
>            Assignee: Harvey Feng 
>            Priority: Blocker
>              Labels: documentation, mpi, setup
>
> There are really two facets to this issue. The first is that no good documentation exists for setting up and using the included MPI framework. The second, and more important issue, is that the framework will not run. The second issue is possibly related to the first in that I may not be setting it up properly. 
> To test the MPI framework, by trial and error I determined I needed to run python setup.py build and python setup.py install in the MESOS-HOME/src/python directory. Now when I try to run nmpiexec -h, I get an AttributeError, below: 
> Traceback (most recent call last):
>   File "./nmpiexec.py", line 2, in <module>
>     import mesos
>   File "/usr/lib64/python2.6/site-packages/mesos-0.9.0-py2.6-linux-x86_64.egg/mesos.py", line 22, in <module>
>     import _mesos
>   File "/usr/lib64/python2.6/site-packages/mesos-0.9.0-py2.6-linux-x86_64.egg/mesos_pb2.py", line 1286, in <module>
>     DESCRIPTOR.message_types_by_name['FrameworkID'] = _FRAMEWORKID
> AttributeError: 'FileDescriptor' object has no attribute 'message_types_by_name'
> I've examined setup.py and determined that the version of protobuf it includes (2.4.1) does, indeed, contain a FileDescriptor class in descriptor.py that sets self.message_types_by_name, so I'm not sure what the issue is. Is this a bug? Or is there a step I'm missing? Do I need to also build/install protobuf?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MESOS-183) Included MPI Framework Fails to Start

Posted by "jiraposter@reviews.apache.org (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MESOS-183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13266526#comment-13266526 ] 

jiraposter@reviews.apache.org commented on MESOS-183:
-----------------------------------------------------


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/4768/
-----------------------------------------------------------

(Updated 2012-05-02 13:00:12.162674)


Review request for mesos, Benjamin Hindman, Charles Reiss, and Jessica.


Changes
-------

-Fixed lots of style issues...
-Converted to a no-executor framework


Summary
-------

Some updates to point out:

-nmpiexec.py
  -> 'mpdallexit' should terminate all slaves' mpds in the ring. I moved 'driver.stop()' to statusUpdate() so that it stops when all tasks have been finished, which occurs when the executor's launched mpd processes have all exited. 
-startmpd.py
  -> Didn't remove cleanup(), and added code in shutdown() that manually kills mpd processes. They might be useful during abnormal (cleanup) and normal (shutdown) framework/executor termination...I think. cleanup() still terminates all mpd's in the slave, but shutdown doesn't. 
  -> killtask() stops the mpd associated with the given tid. 
  -> Task states update nicely now. They correspond to the state of a task's associated mpd process.
-Readme
  -> Included additional info on how to setup and run MPICH2 1.2 and nmpiexec on OS X and Ubuntu/Linux


This addresses bug MESOS-183.
    https://issues.apache.org/jira/browse/MESOS-183


Diffs (updated)
-----

  frameworks/mpi/nmpiexec.py a5db9c0 
  frameworks/mpi/README.txt cdb4553 
  frameworks/mpi/startmpd.py 8eeba5e 
  frameworks/mpi/startmpd.sh 44faa05 

Diff: https://reviews.apache.org/r/4768/diff


Testing
-------


Thanks,

Harvey


                
> Included MPI Framework Fails to Start
> -------------------------------------
>
>                 Key: MESOS-183
>                 URL: https://issues.apache.org/jira/browse/MESOS-183
>             Project: Mesos
>          Issue Type: Bug
>          Components: documentation, framework
>         Environment: Scientific Linux Cluster
>            Reporter: Jessica J
>            Assignee: Harvey Feng 
>            Priority: Blocker
>              Labels: documentation, mpi, setup
>
> There are really two facets to this issue. The first is that no good documentation exists for setting up and using the included MPI framework. The second, and more important issue, is that the framework will not run. The second issue is possibly related to the first in that I may not be setting it up properly. 
> To test the MPI framework, by trial and error I determined I needed to run python setup.py build and python setup.py install in the MESOS-HOME/src/python directory. Now when I try to run nmpiexec -h, I get an AttributeError, below: 
> Traceback (most recent call last):
>   File "./nmpiexec.py", line 2, in <module>
>     import mesos
>   File "/usr/lib64/python2.6/site-packages/mesos-0.9.0-py2.6-linux-x86_64.egg/mesos.py", line 22, in <module>
>     import _mesos
>   File "/usr/lib64/python2.6/site-packages/mesos-0.9.0-py2.6-linux-x86_64.egg/mesos_pb2.py", line 1286, in <module>
>     DESCRIPTOR.message_types_by_name['FrameworkID'] = _FRAMEWORKID
> AttributeError: 'FileDescriptor' object has no attribute 'message_types_by_name'
> I've examined setup.py and determined that the version of protobuf it includes (2.4.1) does, indeed, contain a FileDescriptor class in descriptor.py that sets self.message_types_by_name, so I'm not sure what the issue is. Is this a bug? Or is there a step I'm missing? Do I need to also build/install protobuf?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MESOS-183) Included MPI Framework Fails to Start

Posted by "jiraposter@reviews.apache.org (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MESOS-183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13258769#comment-13258769 ] 

jiraposter@reviews.apache.org commented on MESOS-183:
-----------------------------------------------------


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/4768/
-----------------------------------------------------------

(Updated 2012-04-21 05:08:47.980951)


Review request for mesos, Benjamin Hindman, Charles Reiss, and Jessica.


Changes
-------

Changes:
-MPICH2PATH is global var now
--ifhn-master and --ifhn-slave options added
-fixed mpdtrace zombie proccess problem


Summary
-------

Some updates to point out:

-nmpiexec.py
  -> 'mpdallexit' should terminate all slaves' mpds in the ring. I moved 'driver.stop()' to statusUpdate() so that it stops when all tasks have been finished, which occurs when the executor's launched mpd processes have all exited. 
-startmpd.py
  -> Didn't remove cleanup(), and added code in shutdown() that manually kills mpd processes. They might be useful during abnormal (cleanup) and normal (shutdown) framework/executor termination...I think. cleanup() still terminates all mpd's in the slave, but shutdown doesn't. 
  -> killtask() stops the mpd associated with the given tid. 
  -> Task states update nicely now. They correspond to the state of a task's associated mpd process.
-Readme
  -> Included additional info on how to setup and run MPICH2 1.2 and nmpiexec on OS X and Ubuntu/Linux


This addresses bug MESOS-183.
    https://issues.apache.org/jira/browse/MESOS-183


Diffs (updated)
-----

  frameworks/mpi/README.txt cdb4553 
  frameworks/mpi/nmpiexec.py a5db9c0 
  frameworks/mpi/startmpd.py 8eeba5e 

Diff: https://reviews.apache.org/r/4768/diff


Testing
-------


Thanks,

Harvey


                
> Included MPI Framework Fails to Start
> -------------------------------------
>
>                 Key: MESOS-183
>                 URL: https://issues.apache.org/jira/browse/MESOS-183
>             Project: Mesos
>          Issue Type: Bug
>          Components: documentation, framework
>         Environment: Scientific Linux Cluster
>            Reporter: Jessica J
>            Assignee: Harvey Feng 
>            Priority: Blocker
>              Labels: documentation, mpi, setup
>
> There are really two facets to this issue. The first is that no good documentation exists for setting up and using the included MPI framework. The second, and more important issue, is that the framework will not run. The second issue is possibly related to the first in that I may not be setting it up properly. 
> To test the MPI framework, by trial and error I determined I needed to run python setup.py build and python setup.py install in the MESOS-HOME/src/python directory. Now when I try to run nmpiexec -h, I get an AttributeError, below: 
> Traceback (most recent call last):
>   File "./nmpiexec.py", line 2, in <module>
>     import mesos
>   File "/usr/lib64/python2.6/site-packages/mesos-0.9.0-py2.6-linux-x86_64.egg/mesos.py", line 22, in <module>
>     import _mesos
>   File "/usr/lib64/python2.6/site-packages/mesos-0.9.0-py2.6-linux-x86_64.egg/mesos_pb2.py", line 1286, in <module>
>     DESCRIPTOR.message_types_by_name['FrameworkID'] = _FRAMEWORKID
> AttributeError: 'FileDescriptor' object has no attribute 'message_types_by_name'
> I've examined setup.py and determined that the version of protobuf it includes (2.4.1) does, indeed, contain a FileDescriptor class in descriptor.py that sets self.message_types_by_name, so I'm not sure what the issue is. Is this a bug? Or is there a step I'm missing? Do I need to also build/install protobuf?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MESOS-183) Included MPI Framework Fails to Start

Posted by "jiraposter@reviews.apache.org (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MESOS-183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13256195#comment-13256195 ] 

jiraposter@reviews.apache.org commented on MESOS-183:
-----------------------------------------------------


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/4768/
-----------------------------------------------------------

Review request for mesos, Benjamin Hindman and Charles Reiss.


Summary
-------

Some updates to point out:

-nmpiexec.py
  -> 'mpdallexit' should terminate all slaves' mpds in the ring. I moved 'driver.stop()' to statusUpdate() so that it stops when all tasks have been finished, which occurs when the executor's launched mpd processes have all exited. 
-startmpd.py
  -> Didn't remove cleanup(), and added code in shutdown() that manually kills mpd processes. They might be useful during abnormal (cleanup) and normal (shutdown) framework/executor termination...I think. cleanup() still terminates all mpd's in the slave, but shutdown doesn't. 
  -> killtask() stops the mpd associated with the given tid. 
  -> Task states update nicely now. They correspond to the state of a task's associated mpd process.
-Readme
  -> Included additional info on how to setup and run MPICH2 1.2 and nmpiexec on OS X and Ubuntu/Linux


This addresses bug MESOS-183.
    https://issues.apache.org/jira/browse/MESOS-183


Diffs
-----

  frameworks/mpi/README.txt cdb4553 
  frameworks/mpi/nmpiexec.py a5db9c0 
  frameworks/mpi/startmpd.py 8eeba5e 

Diff: https://reviews.apache.org/r/4768/diff


Testing
-------


Thanks,

Harvey


                
> Included MPI Framework Fails to Start
> -------------------------------------
>
>                 Key: MESOS-183
>                 URL: https://issues.apache.org/jira/browse/MESOS-183
>             Project: Mesos
>          Issue Type: Bug
>          Components: documentation, framework
>         Environment: Scientific Linux Cluster
>            Reporter: Jessica J
>            Assignee: Harvey Feng 
>            Priority: Blocker
>              Labels: documentation, mpi, setup
>
> There are really two facets to this issue. The first is that no good documentation exists for setting up and using the included MPI framework. The second, and more important issue, is that the framework will not run. The second issue is possibly related to the first in that I may not be setting it up properly. 
> To test the MPI framework, by trial and error I determined I needed to run python setup.py build and python setup.py install in the MESOS-HOME/src/python directory. Now when I try to run nmpiexec -h, I get an AttributeError, below: 
> Traceback (most recent call last):
>   File "./nmpiexec.py", line 2, in <module>
>     import mesos
>   File "/usr/lib64/python2.6/site-packages/mesos-0.9.0-py2.6-linux-x86_64.egg/mesos.py", line 22, in <module>
>     import _mesos
>   File "/usr/lib64/python2.6/site-packages/mesos-0.9.0-py2.6-linux-x86_64.egg/mesos_pb2.py", line 1286, in <module>
>     DESCRIPTOR.message_types_by_name['FrameworkID'] = _FRAMEWORKID
> AttributeError: 'FileDescriptor' object has no attribute 'message_types_by_name'
> I've examined setup.py and determined that the version of protobuf it includes (2.4.1) does, indeed, contain a FileDescriptor class in descriptor.py that sets self.message_types_by_name, so I'm not sure what the issue is. Is this a bug? Or is there a step I'm missing? Do I need to also build/install protobuf?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MESOS-183) Included MPI Framework Fails to Start

Posted by "jiraposter@reviews.apache.org (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MESOS-183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13270125#comment-13270125 ] 

jiraposter@reviews.apache.org commented on MESOS-183:
-----------------------------------------------------



bq.  On 2012-05-07 19:19:17, Jessica wrote:
bq.  > frameworks/mpi/nmpiexec.py, line 118
bq.  > <https://reviews.apache.org/r/4768/diff/5/?file=105963#file105963line118>
bq.  >
bq.  >     I agree with Benjamin--ifhn_slave probably needs to go. I've created a new issue (MESOS-189) for configuring IP addresses that I think is relevant.
bq.  >     
bq.  >     Also... technically, you should use "is" and "is not" instead of "==" and "!=" to compare against None. This is the pythonic (and minutely faster) way.

Ok, switched from "==" to "is"


bq.  On 2012-05-07 19:19:17, Jessica wrote:
bq.  > frameworks/mpi/nmpiexec.py, line 186
bq.  > <https://reviews.apache.org/r/4768/diff/5/?file=105963#file105963line186>
bq.  >
bq.  >     I'd recommend passing default='' to parser.add_option so that you don't have to check if path is None later in the code. (This will also simplify calls to os.path.join since you won't have to check the value of path first.)

Yup, this leaves just one "os.path.join()" call.


- Harvey


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/4768/#review7651
-----------------------------------------------------------


On 2012-05-08 01:29:06, Harvey Feng wrote:
bq.  
bq.  -----------------------------------------------------------
bq.  This is an automatically generated e-mail. To reply, visit:
bq.  https://reviews.apache.org/r/4768/
bq.  -----------------------------------------------------------
bq.  
bq.  (Updated 2012-05-08 01:29:06)
bq.  
bq.  
bq.  Review request for mesos, Benjamin Hindman, Charles Reiss, and Jessica.
bq.  
bq.  
bq.  Summary
bq.  -------
bq.  
bq.  Some updates to point out:
bq.  
bq.  -nmpiexec.py
bq.    -> 'mpdallexit' should terminate all slaves' mpds in the ring. I moved 'driver.stop()' to statusUpdate() so that it stops when all tasks have been finished, which occurs when the executor's launched mpd processes have all exited. 
bq.  -startmpd.py
bq.    -> Didn't remove cleanup(), and added code in shutdown() that manually kills mpd processes. They might be useful during abnormal (cleanup) and normal (shutdown) framework/executor termination...I think. cleanup() still terminates all mpd's in the slave, but shutdown doesn't. 
bq.    -> killtask() stops the mpd associated with the given tid. 
bq.    -> Task states update nicely now. They correspond to the state of a task's associated mpd process.
bq.  -Readme
bq.    -> Included additional info on how to setup and run MPICH2 1.2 and nmpiexec on OS X and Ubuntu/Linux
bq.  
bq.  
bq.  This addresses bug MESOS-183.
bq.      https://issues.apache.org/jira/browse/MESOS-183
bq.  
bq.  
bq.  Diffs
bq.  -----
bq.  
bq.    frameworks/mpi/README.txt cdb4553 
bq.    frameworks/mpi/nmpiexec 517bdbc 
bq.    frameworks/mpi/nmpiexec.py a5db9c0 
bq.    frameworks/mpi/startmpd.py 8eeba5e 
bq.    frameworks/mpi/startmpd.sh 44faa05 
bq.  
bq.  Diff: https://reviews.apache.org/r/4768/diff
bq.  
bq.  
bq.  Testing
bq.  -------
bq.  
bq.  
bq.  Thanks,
bq.  
bq.  Harvey
bq.  
bq.


                
> Included MPI Framework Fails to Start
> -------------------------------------
>
>                 Key: MESOS-183
>                 URL: https://issues.apache.org/jira/browse/MESOS-183
>             Project: Mesos
>          Issue Type: Bug
>          Components: documentation, framework
>         Environment: Scientific Linux Cluster
>            Reporter: Jessica J
>            Assignee: Harvey Feng 
>            Priority: Blocker
>              Labels: documentation, mpi, setup
>
> There are really two facets to this issue. The first is that no good documentation exists for setting up and using the included MPI framework. The second, and more important issue, is that the framework will not run. The second issue is possibly related to the first in that I may not be setting it up properly. 
> To test the MPI framework, by trial and error I determined I needed to run python setup.py build and python setup.py install in the MESOS-HOME/src/python directory. Now when I try to run nmpiexec -h, I get an AttributeError, below: 
> Traceback (most recent call last):
>   File "./nmpiexec.py", line 2, in <module>
>     import mesos
>   File "/usr/lib64/python2.6/site-packages/mesos-0.9.0-py2.6-linux-x86_64.egg/mesos.py", line 22, in <module>
>     import _mesos
>   File "/usr/lib64/python2.6/site-packages/mesos-0.9.0-py2.6-linux-x86_64.egg/mesos_pb2.py", line 1286, in <module>
>     DESCRIPTOR.message_types_by_name['FrameworkID'] = _FRAMEWORKID
> AttributeError: 'FileDescriptor' object has no attribute 'message_types_by_name'
> I've examined setup.py and determined that the version of protobuf it includes (2.4.1) does, indeed, contain a FileDescriptor class in descriptor.py that sets self.message_types_by_name, so I'm not sure what the issue is. Is this a bug? Or is there a step I'm missing? Do I need to also build/install protobuf?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MESOS-183) Included MPI Framework Fails to Start

Posted by "jiraposter@reviews.apache.org (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MESOS-183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13270162#comment-13270162 ] 

jiraposter@reviews.apache.org commented on MESOS-183:
-----------------------------------------------------


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/4768/#review7666
-----------------------------------------------------------

Ship it!


I'll get this checked in provided Jessica gives it a "Ship It". Thanks the the good work here, I intend to make it a demonstration of how to write frameworks on Mesos!

- Benjamin


On 2012-05-08 01:29:06, Harvey Feng wrote:
bq.  
bq.  -----------------------------------------------------------
bq.  This is an automatically generated e-mail. To reply, visit:
bq.  https://reviews.apache.org/r/4768/
bq.  -----------------------------------------------------------
bq.  
bq.  (Updated 2012-05-08 01:29:06)
bq.  
bq.  
bq.  Review request for mesos, Benjamin Hindman, Charles Reiss, and Jessica.
bq.  
bq.  
bq.  Summary
bq.  -------
bq.  
bq.  Some updates to point out:
bq.  
bq.  -nmpiexec.py
bq.    -> 'mpdallexit' should terminate all slaves' mpds in the ring. I moved 'driver.stop()' to statusUpdate() so that it stops when all tasks have been finished, which occurs when the executor's launched mpd processes have all exited. 
bq.  -startmpd.py
bq.    -> Didn't remove cleanup(), and added code in shutdown() that manually kills mpd processes. They might be useful during abnormal (cleanup) and normal (shutdown) framework/executor termination...I think. cleanup() still terminates all mpd's in the slave, but shutdown doesn't. 
bq.    -> killtask() stops the mpd associated with the given tid. 
bq.    -> Task states update nicely now. They correspond to the state of a task's associated mpd process.
bq.  -Readme
bq.    -> Included additional info on how to setup and run MPICH2 1.2 and nmpiexec on OS X and Ubuntu/Linux
bq.  
bq.  
bq.  This addresses bug MESOS-183.
bq.      https://issues.apache.org/jira/browse/MESOS-183
bq.  
bq.  
bq.  Diffs
bq.  -----
bq.  
bq.    frameworks/mpi/README.txt cdb4553 
bq.    frameworks/mpi/nmpiexec 517bdbc 
bq.    frameworks/mpi/nmpiexec.py a5db9c0 
bq.    frameworks/mpi/startmpd.py 8eeba5e 
bq.    frameworks/mpi/startmpd.sh 44faa05 
bq.  
bq.  Diff: https://reviews.apache.org/r/4768/diff
bq.  
bq.  
bq.  Testing
bq.  -------
bq.  
bq.  
bq.  Thanks,
bq.  
bq.  Harvey
bq.  
bq.


                
> Included MPI Framework Fails to Start
> -------------------------------------
>
>                 Key: MESOS-183
>                 URL: https://issues.apache.org/jira/browse/MESOS-183
>             Project: Mesos
>          Issue Type: Bug
>          Components: documentation, framework
>         Environment: Scientific Linux Cluster
>            Reporter: Jessica J
>            Assignee: Harvey Feng 
>            Priority: Blocker
>              Labels: documentation, mpi, setup
>
> There are really two facets to this issue. The first is that no good documentation exists for setting up and using the included MPI framework. The second, and more important issue, is that the framework will not run. The second issue is possibly related to the first in that I may not be setting it up properly. 
> To test the MPI framework, by trial and error I determined I needed to run python setup.py build and python setup.py install in the MESOS-HOME/src/python directory. Now when I try to run nmpiexec -h, I get an AttributeError, below: 
> Traceback (most recent call last):
>   File "./nmpiexec.py", line 2, in <module>
>     import mesos
>   File "/usr/lib64/python2.6/site-packages/mesos-0.9.0-py2.6-linux-x86_64.egg/mesos.py", line 22, in <module>
>     import _mesos
>   File "/usr/lib64/python2.6/site-packages/mesos-0.9.0-py2.6-linux-x86_64.egg/mesos_pb2.py", line 1286, in <module>
>     DESCRIPTOR.message_types_by_name['FrameworkID'] = _FRAMEWORKID
> AttributeError: 'FileDescriptor' object has no attribute 'message_types_by_name'
> I've examined setup.py and determined that the version of protobuf it includes (2.4.1) does, indeed, contain a FileDescriptor class in descriptor.py that sets self.message_types_by_name, so I'm not sure what the issue is. Is this a bug? Or is there a step I'm missing? Do I need to also build/install protobuf?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MESOS-183) Included MPI Framework Fails to Start

Posted by "Matei Zaharia (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MESOS-183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13253516#comment-13253516 ] 

Matei Zaharia commented on MESOS-183:
-------------------------------------

Yeah, this is because the MPI code seems to be against an older version of the Mesos API. That "executor" field is a Google protocol buffer, so you need to set it in a separate way, using executor.MergeFrom(your_executor_info) instead of executor = your_executor_info. If you're curious to see a working Python-based Mesos framework to fix MPI based on it, take a look in src/examples/python. Otherwise someone from our side will hopefully do this. We really should've caught this earlier but I guess we don't have any automated tests for MPI.
                
> Included MPI Framework Fails to Start
> -------------------------------------
>
>                 Key: MESOS-183
>                 URL: https://issues.apache.org/jira/browse/MESOS-183
>             Project: Mesos
>          Issue Type: Bug
>          Components: documentation, framework
>         Environment: Scientific Linux Cluster
>            Reporter: Jessica J
>            Assignee: Harvey Feng 
>            Priority: Blocker
>              Labels: documentation, mpi, setup
>
> There are really two facets to this issue. The first is that no good documentation exists for setting up and using the included MPI framework. The second, and more important issue, is that the framework will not run. The second issue is possibly related to the first in that I may not be setting it up properly. 
> To test the MPI framework, by trial and error I determined I needed to run python setup.py build and python setup.py install in the MESOS-HOME/src/python directory. Now when I try to run nmpiexec -h, I get an AttributeError, below: 
> Traceback (most recent call last):
>   File "./nmpiexec.py", line 2, in <module>
>     import mesos
>   File "/usr/lib64/python2.6/site-packages/mesos-0.9.0-py2.6-linux-x86_64.egg/mesos.py", line 22, in <module>
>     import _mesos
>   File "/usr/lib64/python2.6/site-packages/mesos-0.9.0-py2.6-linux-x86_64.egg/mesos_pb2.py", line 1286, in <module>
>     DESCRIPTOR.message_types_by_name['FrameworkID'] = _FRAMEWORKID
> AttributeError: 'FileDescriptor' object has no attribute 'message_types_by_name'
> I've examined setup.py and determined that the version of protobuf it includes (2.4.1) does, indeed, contain a FileDescriptor class in descriptor.py that sets self.message_types_by_name, so I'm not sure what the issue is. Is this a bug? Or is there a step I'm missing? Do I need to also build/install protobuf?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (MESOS-183) Included MPI Framework Fails to Start

Posted by "Jessica J (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MESOS-183?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jessica J updated MESOS-183:
----------------------------

    Priority: Blocker  (was: Major)

I'm changing the priority to blocker because the MPI + Hadoop functionality is the reason I'm using Mesos. I've attempted to track down what needs to change to make this run properly and have been unsuccessful, so until this is fixed, Mesos is useless to me.
                
> Included MPI Framework Fails to Start
> -------------------------------------
>
>                 Key: MESOS-183
>                 URL: https://issues.apache.org/jira/browse/MESOS-183
>             Project: Mesos
>          Issue Type: Bug
>          Components: documentation, framework
>         Environment: Scientific Linux Cluster
>            Reporter: Jessica J
>            Assignee: Harvey Feng 
>            Priority: Blocker
>              Labels: documentation, mpi, setup
>
> There are really two facets to this issue. The first is that no good documentation exists for setting up and using the included MPI framework. The second, and more important issue, is that the framework will not run. The second issue is possibly related to the first in that I may not be setting it up properly. 
> To test the MPI framework, by trial and error I determined I needed to run python setup.py build and python setup.py install in the MESOS-HOME/src/python directory. Now when I try to run nmpiexec -h, I get an AttributeError, below: 
> Traceback (most recent call last):
>   File "./nmpiexec.py", line 2, in <module>
>     import mesos
>   File "/usr/lib64/python2.6/site-packages/mesos-0.9.0-py2.6-linux-x86_64.egg/mesos.py", line 22, in <module>
>     import _mesos
>   File "/usr/lib64/python2.6/site-packages/mesos-0.9.0-py2.6-linux-x86_64.egg/mesos_pb2.py", line 1286, in <module>
>     DESCRIPTOR.message_types_by_name['FrameworkID'] = _FRAMEWORKID
> AttributeError: 'FileDescriptor' object has no attribute 'message_types_by_name'
> I've examined setup.py and determined that the version of protobuf it includes (2.4.1) does, indeed, contain a FileDescriptor class in descriptor.py that sets self.message_types_by_name, so I'm not sure what the issue is. Is this a bug? Or is there a step I'm missing? Do I need to also build/install protobuf?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MESOS-183) Included MPI Framework Fails to Start

Posted by "Jessica J (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MESOS-183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13253535#comment-13253535 ] 

Jessica J commented on MESOS-183:
---------------------------------

Matei, thanks for the response. I see the MergeFrom call in src/examples/python/test_framework.py, but the python test framework has the same issue--the master reports it can't parse the message. (Actually, the error I posted above was from attempting to run test_framework.py; they both result in the same error.) 
                
> Included MPI Framework Fails to Start
> -------------------------------------
>
>                 Key: MESOS-183
>                 URL: https://issues.apache.org/jira/browse/MESOS-183
>             Project: Mesos
>          Issue Type: Bug
>          Components: documentation, framework
>         Environment: Scientific Linux Cluster
>            Reporter: Jessica J
>            Assignee: Harvey Feng 
>            Priority: Blocker
>              Labels: documentation, mpi, setup
>
> There are really two facets to this issue. The first is that no good documentation exists for setting up and using the included MPI framework. The second, and more important issue, is that the framework will not run. The second issue is possibly related to the first in that I may not be setting it up properly. 
> To test the MPI framework, by trial and error I determined I needed to run python setup.py build and python setup.py install in the MESOS-HOME/src/python directory. Now when I try to run nmpiexec -h, I get an AttributeError, below: 
> Traceback (most recent call last):
>   File "./nmpiexec.py", line 2, in <module>
>     import mesos
>   File "/usr/lib64/python2.6/site-packages/mesos-0.9.0-py2.6-linux-x86_64.egg/mesos.py", line 22, in <module>
>     import _mesos
>   File "/usr/lib64/python2.6/site-packages/mesos-0.9.0-py2.6-linux-x86_64.egg/mesos_pb2.py", line 1286, in <module>
>     DESCRIPTOR.message_types_by_name['FrameworkID'] = _FRAMEWORKID
> AttributeError: 'FileDescriptor' object has no attribute 'message_types_by_name'
> I've examined setup.py and determined that the version of protobuf it includes (2.4.1) does, indeed, contain a FileDescriptor class in descriptor.py that sets self.message_types_by_name, so I'm not sure what the issue is. Is this a bug? Or is there a step I'm missing? Do I need to also build/install protobuf?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Assigned] (MESOS-183) Included MPI Framework Fails to Start

Posted by "Charles Reiss (Assigned) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MESOS-183?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Charles Reiss reassigned MESOS-183:
-----------------------------------

    Assignee: Harvey Feng 
    
> Included MPI Framework Fails to Start
> -------------------------------------
>
>                 Key: MESOS-183
>                 URL: https://issues.apache.org/jira/browse/MESOS-183
>             Project: Mesos
>          Issue Type: Bug
>          Components: documentation, framework
>         Environment: Scientific Linux Cluster
>            Reporter: Jessica J
>            Assignee: Harvey Feng 
>              Labels: documentation, mpi, setup
>
> There are really two facets to this issue. The first is that no good documentation exists for setting up and using the included MPI framework. The second, and more important issue, is that the framework will not run. The second issue is possibly related to the first in that I may not be setting it up properly. 
> To test the MPI framework, by trial and error I determined I needed to run python setup.py build and python setup.py install in the MESOS-HOME/src/python directory. Now when I try to run nmpiexec -h, I get an AttributeError, below: 
> Traceback (most recent call last):
>   File "./nmpiexec.py", line 2, in <module>
>     import mesos
>   File "/usr/lib64/python2.6/site-packages/mesos-0.9.0-py2.6-linux-x86_64.egg/mesos.py", line 22, in <module>
>     import _mesos
>   File "/usr/lib64/python2.6/site-packages/mesos-0.9.0-py2.6-linux-x86_64.egg/mesos_pb2.py", line 1286, in <module>
>     DESCRIPTOR.message_types_by_name['FrameworkID'] = _FRAMEWORKID
> AttributeError: 'FileDescriptor' object has no attribute 'message_types_by_name'
> I've examined setup.py and determined that the version of protobuf it includes (2.4.1) does, indeed, contain a FileDescriptor class in descriptor.py that sets self.message_types_by_name, so I'm not sure what the issue is. Is this a bug? Or is there a step I'm missing? Do I need to also build/install protobuf?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MESOS-183) Included MPI Framework Fails to Start

Posted by "jiraposter@reviews.apache.org (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MESOS-183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13258772#comment-13258772 ] 

jiraposter@reviews.apache.org commented on MESOS-183:
-----------------------------------------------------



bq.  On 2012-04-20 17:58:45, Charles Reiss wrote:
bq.  > frameworks/mpi/nmpiexec.py, line 159
bq.  > <https://reviews.apache.org/r/4768/diff/2/?file=103455#file103455line159>
bq.  >
bq.  >     Account for MPICH2PATH here.

done.


bq.  On 2012-04-20 17:58:45, Charles Reiss wrote:
bq.  > frameworks/mpi/nmpiexec.py, line 194
bq.  > <https://reviews.apache.org/r/4768/diff/2/?file=103455#file103455line194>
bq.  >
bq.  >     Try to keep us below 80 chars (or at least below 100); split into multiple lines.

done.


bq.  On 2012-04-20 17:58:45, Charles Reiss wrote:
bq.  > frameworks/mpi/startmpd.py, line 85
bq.  > <https://reviews.apache.org/r/4768/diff/2/?file=103456#file103456line85>
bq.  >
bq.  >     Assuming mpd tries to do something graceful on SIGTERM, try SIGTERM, wait a bit, then try SIGKILL (and below).

Gave it a 5 second interval


bq.  On 2012-04-20 17:58:45, Charles Reiss wrote:
bq.  > frameworks/mpi/nmpiexec.py, line 32
bq.  > <https://reviews.apache.org/r/4768/diff/2/?file=103455#file103455line32>
bq.  >
bq.  >     MPI_TASK should be an array (and below).

ok. The executable should be the first argument in the array, renamed MPI_PROGRAM.


bq.  On 2012-04-20 17:58:45, Charles Reiss wrote:
bq.  > frameworks/mpi/nmpiexec.py, line 210
bq.  > <https://reviews.apache.org/r/4768/diff/2/?file=103455#file103455line210>
bq.  >
bq.  >     Account for MPICH2PATH here.

done. made it a global variable in both files too.


bq.  On 2012-04-20 17:58:45, Charles Reiss wrote:
bq.  > frameworks/mpi/nmpiexec.py, line 215
bq.  > <https://reviews.apache.org/r/4768/diff/2/?file=103455#file103455line215>
bq.  >
bq.  >     close() this; does this (and existing code calling mpdtrace) leave mpdtrace as a zombie process?

mpdtrace's were left as zombie processes...it uses communicate() now to get the the stdout.


- Harvey


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/4768/#review7086
-----------------------------------------------------------


On 2012-04-21 05:08:47, Harvey Feng wrote:
bq.  
bq.  -----------------------------------------------------------
bq.  This is an automatically generated e-mail. To reply, visit:
bq.  https://reviews.apache.org/r/4768/
bq.  -----------------------------------------------------------
bq.  
bq.  (Updated 2012-04-21 05:08:47)
bq.  
bq.  
bq.  Review request for mesos, Benjamin Hindman, Charles Reiss, and Jessica.
bq.  
bq.  
bq.  Summary
bq.  -------
bq.  
bq.  Some updates to point out:
bq.  
bq.  -nmpiexec.py
bq.    -> 'mpdallexit' should terminate all slaves' mpds in the ring. I moved 'driver.stop()' to statusUpdate() so that it stops when all tasks have been finished, which occurs when the executor's launched mpd processes have all exited. 
bq.  -startmpd.py
bq.    -> Didn't remove cleanup(), and added code in shutdown() that manually kills mpd processes. They might be useful during abnormal (cleanup) and normal (shutdown) framework/executor termination...I think. cleanup() still terminates all mpd's in the slave, but shutdown doesn't. 
bq.    -> killtask() stops the mpd associated with the given tid. 
bq.    -> Task states update nicely now. They correspond to the state of a task's associated mpd process.
bq.  -Readme
bq.    -> Included additional info on how to setup and run MPICH2 1.2 and nmpiexec on OS X and Ubuntu/Linux
bq.  
bq.  
bq.  This addresses bug MESOS-183.
bq.      https://issues.apache.org/jira/browse/MESOS-183
bq.  
bq.  
bq.  Diffs
bq.  -----
bq.  
bq.    frameworks/mpi/README.txt cdb4553 
bq.    frameworks/mpi/nmpiexec.py a5db9c0 
bq.    frameworks/mpi/startmpd.py 8eeba5e 
bq.  
bq.  Diff: https://reviews.apache.org/r/4768/diff
bq.  
bq.  
bq.  Testing
bq.  -------
bq.  
bq.  
bq.  Thanks,
bq.  
bq.  Harvey
bq.  
bq.


                
> Included MPI Framework Fails to Start
> -------------------------------------
>
>                 Key: MESOS-183
>                 URL: https://issues.apache.org/jira/browse/MESOS-183
>             Project: Mesos
>          Issue Type: Bug
>          Components: documentation, framework
>         Environment: Scientific Linux Cluster
>            Reporter: Jessica J
>            Assignee: Harvey Feng 
>            Priority: Blocker
>              Labels: documentation, mpi, setup
>
> There are really two facets to this issue. The first is that no good documentation exists for setting up and using the included MPI framework. The second, and more important issue, is that the framework will not run. The second issue is possibly related to the first in that I may not be setting it up properly. 
> To test the MPI framework, by trial and error I determined I needed to run python setup.py build and python setup.py install in the MESOS-HOME/src/python directory. Now when I try to run nmpiexec -h, I get an AttributeError, below: 
> Traceback (most recent call last):
>   File "./nmpiexec.py", line 2, in <module>
>     import mesos
>   File "/usr/lib64/python2.6/site-packages/mesos-0.9.0-py2.6-linux-x86_64.egg/mesos.py", line 22, in <module>
>     import _mesos
>   File "/usr/lib64/python2.6/site-packages/mesos-0.9.0-py2.6-linux-x86_64.egg/mesos_pb2.py", line 1286, in <module>
>     DESCRIPTOR.message_types_by_name['FrameworkID'] = _FRAMEWORKID
> AttributeError: 'FileDescriptor' object has no attribute 'message_types_by_name'
> I've examined setup.py and determined that the version of protobuf it includes (2.4.1) does, indeed, contain a FileDescriptor class in descriptor.py that sets self.message_types_by_name, so I'm not sure what the issue is. Is this a bug? Or is there a step I'm missing? Do I need to also build/install protobuf?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MESOS-183) Included MPI Framework Fails to Start

Posted by "jiraposter@reviews.apache.org (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MESOS-183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13266559#comment-13266559 ] 

jiraposter@reviews.apache.org commented on MESOS-183:
-----------------------------------------------------


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/4768/
-----------------------------------------------------------

(Updated 2012-05-02 13:29:50.636167)


Review request for mesos, Benjamin Hindman, Charles Reiss, and Jessica.


Changes
-------

I'm not sure how to get 'git diff' results using 'mpiexec-mesos*' filenames, so I left the two files with 'nmpiexec' names unchanged. All text/code uses mpiexec-mesos though...

This patch also deals with many style issues from the last one, and eliminates the executor code (startmpd).


Summary
-------

Some updates to point out:

-nmpiexec.py
  -> 'mpdallexit' should terminate all slaves' mpds in the ring. I moved 'driver.stop()' to statusUpdate() so that it stops when all tasks have been finished, which occurs when the executor's launched mpd processes have all exited. 
-startmpd.py
  -> Didn't remove cleanup(), and added code in shutdown() that manually kills mpd processes. They might be useful during abnormal (cleanup) and normal (shutdown) framework/executor termination...I think. cleanup() still terminates all mpd's in the slave, but shutdown doesn't. 
  -> killtask() stops the mpd associated with the given tid. 
  -> Task states update nicely now. They correspond to the state of a task's associated mpd process.
-Readme
  -> Included additional info on how to setup and run MPICH2 1.2 and nmpiexec on OS X and Ubuntu/Linux


This addresses bug MESOS-183.
    https://issues.apache.org/jira/browse/MESOS-183


Diffs (updated)
-----

  frameworks/mpi/README.txt cdb4553 
  frameworks/mpi/nmpiexec 517bdbc 
  frameworks/mpi/nmpiexec.py a5db9c0 
  frameworks/mpi/startmpd.py 8eeba5e 
  frameworks/mpi/startmpd.sh 44faa05 

Diff: https://reviews.apache.org/r/4768/diff


Testing
-------


Thanks,

Harvey


                
> Included MPI Framework Fails to Start
> -------------------------------------
>
>                 Key: MESOS-183
>                 URL: https://issues.apache.org/jira/browse/MESOS-183
>             Project: Mesos
>          Issue Type: Bug
>          Components: documentation, framework
>         Environment: Scientific Linux Cluster
>            Reporter: Jessica J
>            Assignee: Harvey Feng 
>            Priority: Blocker
>              Labels: documentation, mpi, setup
>
> There are really two facets to this issue. The first is that no good documentation exists for setting up and using the included MPI framework. The second, and more important issue, is that the framework will not run. The second issue is possibly related to the first in that I may not be setting it up properly. 
> To test the MPI framework, by trial and error I determined I needed to run python setup.py build and python setup.py install in the MESOS-HOME/src/python directory. Now when I try to run nmpiexec -h, I get an AttributeError, below: 
> Traceback (most recent call last):
>   File "./nmpiexec.py", line 2, in <module>
>     import mesos
>   File "/usr/lib64/python2.6/site-packages/mesos-0.9.0-py2.6-linux-x86_64.egg/mesos.py", line 22, in <module>
>     import _mesos
>   File "/usr/lib64/python2.6/site-packages/mesos-0.9.0-py2.6-linux-x86_64.egg/mesos_pb2.py", line 1286, in <module>
>     DESCRIPTOR.message_types_by_name['FrameworkID'] = _FRAMEWORKID
> AttributeError: 'FileDescriptor' object has no attribute 'message_types_by_name'
> I've examined setup.py and determined that the version of protobuf it includes (2.4.1) does, indeed, contain a FileDescriptor class in descriptor.py that sets self.message_types_by_name, so I'm not sure what the issue is. Is this a bug? Or is there a step I'm missing? Do I need to also build/install protobuf?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MESOS-183) Included MPI Framework Fails to Start

Posted by "jiraposter@reviews.apache.org (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MESOS-183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13270396#comment-13270396 ] 

jiraposter@reviews.apache.org commented on MESOS-183:
-----------------------------------------------------



bq.  On 2012-05-08 03:39:29, Benjamin Hindman wrote:
bq.  > I'll get this checked in provided Jessica gives it a "Ship It". Thanks the the good work here, I intend to make it a demonstration of how to write frameworks on Mesos!

Scratch that. I voted to ship it and then remembered an issue that I don't think has been addressed yet. I posted this on the jira, but I haven't seen any changes for it: 

I'm running into the setuptools issue addressed in the test python framework: https://issues.apache.org/jira/browse/MESOS-130. The locations of the eggs added to PYTHONPATH in nmpiexec [now mpiexec-mesos?] need to be updated so that the Mesos/protobuf libraries (and setuptools) don't have to be installed on every node. 

There also seems to be an issue with Python detecting the Mesos module from the egg in src/python/dist--I couldn't import mesos until I unzipped the egg, no matter what directory I was in or how I modified the PYTHONPATH. [Update: I believe it's related to the fact that the mesos egg uses C/C++ extensions. I think it needs to use a setuptools module to list the package contents.]


- Jessica


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/4768/#review7666
-----------------------------------------------------------


On 2012-05-08 01:29:06, Harvey Feng wrote:
bq.  
bq.  -----------------------------------------------------------
bq.  This is an automatically generated e-mail. To reply, visit:
bq.  https://reviews.apache.org/r/4768/
bq.  -----------------------------------------------------------
bq.  
bq.  (Updated 2012-05-08 01:29:06)
bq.  
bq.  
bq.  Review request for mesos, Benjamin Hindman, Charles Reiss, and Jessica.
bq.  
bq.  
bq.  Summary
bq.  -------
bq.  
bq.  Some updates to point out:
bq.  
bq.  -nmpiexec.py
bq.    -> 'mpdallexit' should terminate all slaves' mpds in the ring. I moved 'driver.stop()' to statusUpdate() so that it stops when all tasks have been finished, which occurs when the executor's launched mpd processes have all exited. 
bq.  -startmpd.py
bq.    -> Didn't remove cleanup(), and added code in shutdown() that manually kills mpd processes. They might be useful during abnormal (cleanup) and normal (shutdown) framework/executor termination...I think. cleanup() still terminates all mpd's in the slave, but shutdown doesn't. 
bq.    -> killtask() stops the mpd associated with the given tid. 
bq.    -> Task states update nicely now. They correspond to the state of a task's associated mpd process.
bq.  -Readme
bq.    -> Included additional info on how to setup and run MPICH2 1.2 and nmpiexec on OS X and Ubuntu/Linux
bq.  
bq.  
bq.  This addresses bug MESOS-183.
bq.      https://issues.apache.org/jira/browse/MESOS-183
bq.  
bq.  
bq.  Diffs
bq.  -----
bq.  
bq.    frameworks/mpi/README.txt cdb4553 
bq.    frameworks/mpi/nmpiexec 517bdbc 
bq.    frameworks/mpi/nmpiexec.py a5db9c0 
bq.    frameworks/mpi/startmpd.py 8eeba5e 
bq.    frameworks/mpi/startmpd.sh 44faa05 
bq.  
bq.  Diff: https://reviews.apache.org/r/4768/diff
bq.  
bq.  
bq.  Testing
bq.  -------
bq.  
bq.  
bq.  Thanks,
bq.  
bq.  Harvey
bq.  
bq.


                
> Included MPI Framework Fails to Start
> -------------------------------------
>
>                 Key: MESOS-183
>                 URL: https://issues.apache.org/jira/browse/MESOS-183
>             Project: Mesos
>          Issue Type: Bug
>          Components: documentation, framework
>         Environment: Scientific Linux Cluster
>            Reporter: Jessica J
>            Assignee: Harvey Feng 
>            Priority: Blocker
>              Labels: documentation, mpi, setup
>
> There are really two facets to this issue. The first is that no good documentation exists for setting up and using the included MPI framework. The second, and more important issue, is that the framework will not run. The second issue is possibly related to the first in that I may not be setting it up properly. 
> To test the MPI framework, by trial and error I determined I needed to run python setup.py build and python setup.py install in the MESOS-HOME/src/python directory. Now when I try to run nmpiexec -h, I get an AttributeError, below: 
> Traceback (most recent call last):
>   File "./nmpiexec.py", line 2, in <module>
>     import mesos
>   File "/usr/lib64/python2.6/site-packages/mesos-0.9.0-py2.6-linux-x86_64.egg/mesos.py", line 22, in <module>
>     import _mesos
>   File "/usr/lib64/python2.6/site-packages/mesos-0.9.0-py2.6-linux-x86_64.egg/mesos_pb2.py", line 1286, in <module>
>     DESCRIPTOR.message_types_by_name['FrameworkID'] = _FRAMEWORKID
> AttributeError: 'FileDescriptor' object has no attribute 'message_types_by_name'
> I've examined setup.py and determined that the version of protobuf it includes (2.4.1) does, indeed, contain a FileDescriptor class in descriptor.py that sets self.message_types_by_name, so I'm not sure what the issue is. Is this a bug? Or is there a step I'm missing? Do I need to also build/install protobuf?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MESOS-183) Included MPI Framework Fails to Start

Posted by "Jessica J (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MESOS-183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13252642#comment-13252642 ] 

Jessica J commented on MESOS-183:
---------------------------------

Here's a bonafide bug--looks like the MesosSchedulerDriver signature changed but the calling code did not. Is there an easy way I can patch this on my own machine?

[jessica@golgatha mpi]$ ./nmpiexec -n 10 mesos://master@192.168.41.1:5050 hostname
Connecting to mesos master mesos://master@192.168.41.1:5050
MPD_PID is golgatha.[...].net_40165
Traceback (most recent call last):
  File "./nmpiexec.py", line 171, in <module>
    mesos.MesosSchedulerDriver(sched, args[0]).run()
TypeError: function takes exactly 3 arguments (2 given)

                
> Included MPI Framework Fails to Start
> -------------------------------------
>
>                 Key: MESOS-183
>                 URL: https://issues.apache.org/jira/browse/MESOS-183
>             Project: Mesos
>          Issue Type: Bug
>          Components: documentation, framework
>         Environment: Scientific Linux Cluster
>            Reporter: Jessica J
>            Assignee: Harvey Feng 
>              Labels: documentation, mpi, setup
>
> There are really two facets to this issue. The first is that no good documentation exists for setting up and using the included MPI framework. The second, and more important issue, is that the framework will not run. The second issue is possibly related to the first in that I may not be setting it up properly. 
> To test the MPI framework, by trial and error I determined I needed to run python setup.py build and python setup.py install in the MESOS-HOME/src/python directory. Now when I try to run nmpiexec -h, I get an AttributeError, below: 
> Traceback (most recent call last):
>   File "./nmpiexec.py", line 2, in <module>
>     import mesos
>   File "/usr/lib64/python2.6/site-packages/mesos-0.9.0-py2.6-linux-x86_64.egg/mesos.py", line 22, in <module>
>     import _mesos
>   File "/usr/lib64/python2.6/site-packages/mesos-0.9.0-py2.6-linux-x86_64.egg/mesos_pb2.py", line 1286, in <module>
>     DESCRIPTOR.message_types_by_name['FrameworkID'] = _FRAMEWORKID
> AttributeError: 'FileDescriptor' object has no attribute 'message_types_by_name'
> I've examined setup.py and determined that the version of protobuf it includes (2.4.1) does, indeed, contain a FileDescriptor class in descriptor.py that sets self.message_types_by_name, so I'm not sure what the issue is. Is this a bug? Or is there a step I'm missing? Do I need to also build/install protobuf?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MESOS-183) Included MPI Framework Fails to Start

Posted by "jiraposter@reviews.apache.org (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MESOS-183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13266525#comment-13266525 ] 

jiraposter@reviews.apache.org commented on MESOS-183:
-----------------------------------------------------



bq.  On 2012-04-24 21:45:19, Benjamin Hindman wrote:
bq.  > frameworks/mpi/nmpiexec.py, line 270
bq.  > <https://reviews.apache.org/r/4768/diff/3/?file=103693#file103693line270>
bq.  >
bq.  >     Did someone actually ask for this feature?

No, but I thought it wouldn't hurt...


bq.  On 2012-04-24 21:45:19, Benjamin Hindman wrote:
bq.  > frameworks/mpi/nmpiexec.py, line 253
bq.  > <https://reviews.apache.org/r/4768/diff/3/?file=103693#file103693line253>
bq.  >
bq.  >     I'd like to make this whole thing simpler. In particular, I don't see any need for an executor here (i.e., the startmpd.py script). We actually have a 1-1 mapping from tasks to mpd's, so let's just have our TaskInfo's have a CommandInfo which just launches the mpd. Something like this for that CommandInfo's value:
bq.  >     
bq.  >     ...command.value = "${MPICH2PATH}mpd -n --host=${MPD_MASTER_IP} --port=${MPD_MASTER_PORT}"
bq.  >     
bq.  >     (Note your code, as does what I have above, assumes that MPICH2PATH includes the trailing '/'. Also, you might need to change the string based on if you're passing --ifhn. Finally, note I used the long options --host and --port. This is for readability for other people that might not know mpd very well. Likewise, if there is a long option for -n it would be great to use that instead.)
bq.  >     
bq.  >     Of course, this will require setting the command's environment variables appropriately. But, this should let us completely eliminate the startmpd.py script!!!!! Yeah! Less things to maintain!

Ok. Switched to no-executor. I directly specified the variables during string/command construction, so there isn't a need to set environment variables. Got rid of startmpd.py/startmpd.h =)


bq.  On 2012-04-24 21:45:19, Benjamin Hindman wrote:
bq.  > frameworks/mpi/nmpiexec.py, line 199
bq.  > <https://reviews.apache.org/r/4768/diff/3/?file=103693#file103693line199>
bq.  >
bq.  >     s/slot/mpd
bq.  >     
bq.  >     Same as above, is there something we can/should set when we launch the mpd?

There doesn't seem to be anything to set for memory when launching the mpd...


bq.  On 2012-04-24 21:45:19, Benjamin Hindman wrote:
bq.  > frameworks/mpi/nmpiexec.py, line 28
bq.  > <https://reviews.apache.org/r/4768/diff/3/?file=103693#file103693line28>
bq.  >
bq.  >     I know this wasn't the style of the codebase you've inherited, but I'd like spaces around all operators please. For example, this line should read:
bq.  >     
bq.  >     print "Got " + str(TOTAL_TASKS) + " mpd slots, running mpiexec"
bq.  >     
bq.  >     It looks like you've already done this with some of the code you've added (which is awesome!), but please clean up all the code. Thanks!

Done. I converted most string operations to use %, roughly following the Google style guide for Python.


bq.  On 2012-04-24 21:45:19, Benjamin Hindman wrote:
bq.  > frameworks/mpi/README.txt, line 76
bq.  > <https://reviews.apache.org/r/4768/diff/3/?file=103692#file103692line76>
bq.  >
bq.  >     Does nmpiexec work out of the box without changes (not included in this review).
bq.  >     
bq.  >     Also, let's change the names from "nmpiexec*" to "mpiexec-mesos*"! ;)
bq.  >     
bq.  >     Finally, just pass host:port, no need for the 'master@' prefix.

Yup, nmpiexec works. 


bq.  On 2012-04-24 21:45:19, Benjamin Hindman wrote:
bq.  > frameworks/mpi/nmpiexec.py, line 196
bq.  > <https://reviews.apache.org/r/4768/diff/3/?file=103693#file103693line196>
bq.  >
bq.  >     s/slot/mpd
bq.  >     
bq.  >     Also, do we want to set the '--ncpus' option on the actual mpd that we launch on a Mesos slave?

Added --npus to slave mpd calls.


bq.  On 2012-04-24 21:45:19, Benjamin Hindman wrote:
bq.  > frameworks/mpi/nmpiexec.py, lines 91-92
bq.  > <https://reviews.apache.org/r/4768/diff/3/?file=103693#file103693line91>
bq.  >
bq.  >     Use driver.declineOffer please.

Done for this and others below. Issue#188 (small bug fix for Python/C++ binding) would have to be committed for driver.declineOffer to work though.


- Harvey


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/4768/#review7179
-----------------------------------------------------------


On 2012-05-02 13:00:12, Harvey Feng wrote:
bq.  
bq.  -----------------------------------------------------------
bq.  This is an automatically generated e-mail. To reply, visit:
bq.  https://reviews.apache.org/r/4768/
bq.  -----------------------------------------------------------
bq.  
bq.  (Updated 2012-05-02 13:00:12)
bq.  
bq.  
bq.  Review request for mesos, Benjamin Hindman, Charles Reiss, and Jessica.
bq.  
bq.  
bq.  Summary
bq.  -------
bq.  
bq.  Some updates to point out:
bq.  
bq.  -nmpiexec.py
bq.    -> 'mpdallexit' should terminate all slaves' mpds in the ring. I moved 'driver.stop()' to statusUpdate() so that it stops when all tasks have been finished, which occurs when the executor's launched mpd processes have all exited. 
bq.  -startmpd.py
bq.    -> Didn't remove cleanup(), and added code in shutdown() that manually kills mpd processes. They might be useful during abnormal (cleanup) and normal (shutdown) framework/executor termination...I think. cleanup() still terminates all mpd's in the slave, but shutdown doesn't. 
bq.    -> killtask() stops the mpd associated with the given tid. 
bq.    -> Task states update nicely now. They correspond to the state of a task's associated mpd process.
bq.  -Readme
bq.    -> Included additional info on how to setup and run MPICH2 1.2 and nmpiexec on OS X and Ubuntu/Linux
bq.  
bq.  
bq.  This addresses bug MESOS-183.
bq.      https://issues.apache.org/jira/browse/MESOS-183
bq.  
bq.  
bq.  Diffs
bq.  -----
bq.  
bq.    frameworks/mpi/nmpiexec.py a5db9c0 
bq.    frameworks/mpi/README.txt cdb4553 
bq.    frameworks/mpi/startmpd.py 8eeba5e 
bq.    frameworks/mpi/startmpd.sh 44faa05 
bq.  
bq.  Diff: https://reviews.apache.org/r/4768/diff
bq.  
bq.  
bq.  Testing
bq.  -------
bq.  
bq.  
bq.  Thanks,
bq.  
bq.  Harvey
bq.  
bq.


                
> Included MPI Framework Fails to Start
> -------------------------------------
>
>                 Key: MESOS-183
>                 URL: https://issues.apache.org/jira/browse/MESOS-183
>             Project: Mesos
>          Issue Type: Bug
>          Components: documentation, framework
>         Environment: Scientific Linux Cluster
>            Reporter: Jessica J
>            Assignee: Harvey Feng 
>            Priority: Blocker
>              Labels: documentation, mpi, setup
>
> There are really two facets to this issue. The first is that no good documentation exists for setting up and using the included MPI framework. The second, and more important issue, is that the framework will not run. The second issue is possibly related to the first in that I may not be setting it up properly. 
> To test the MPI framework, by trial and error I determined I needed to run python setup.py build and python setup.py install in the MESOS-HOME/src/python directory. Now when I try to run nmpiexec -h, I get an AttributeError, below: 
> Traceback (most recent call last):
>   File "./nmpiexec.py", line 2, in <module>
>     import mesos
>   File "/usr/lib64/python2.6/site-packages/mesos-0.9.0-py2.6-linux-x86_64.egg/mesos.py", line 22, in <module>
>     import _mesos
>   File "/usr/lib64/python2.6/site-packages/mesos-0.9.0-py2.6-linux-x86_64.egg/mesos_pb2.py", line 1286, in <module>
>     DESCRIPTOR.message_types_by_name['FrameworkID'] = _FRAMEWORKID
> AttributeError: 'FileDescriptor' object has no attribute 'message_types_by_name'
> I've examined setup.py and determined that the version of protobuf it includes (2.4.1) does, indeed, contain a FileDescriptor class in descriptor.py that sets self.message_types_by_name, so I'm not sure what the issue is. Is this a bug? Or is there a step I'm missing? Do I need to also build/install protobuf?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira