You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mesos.apache.org by Stratos Dimopoulos <st...@gmail.com> on 2014/10/31 22:32:11 UTC
MPI on Hadoop

Hi all,

I have already sent this to the user mailing list but didn't get a reply.
So I thought about trying here. If you have any hints please let me know.

I am having a couple of issues trying to run MPI over Mesos. I am using
Mesos 0.20.0 on Ubuntu 12.04 and MPICH2.

- I was able to successfully (?) run a helloworld MPI program but still the
task appears as lost in the GUI. Here is the stack trace from the mpi
execution:

>> We've launched all our MPDs; waiting for them to come up
Got 1 mpd(s), running mpiexec
Running mpiexec


 *** Hello world from processor euca-10-2-235-206, rank 0 out of 1
processors ***

mpiexec completed, calling mpdallexit euca-10-2-248-74_57995
Task 0 in state 5
A task finished unexpectedly, calling mpdexit on euca-10-2-248-74_57995
mpdroot: perror msg: No such file or directory
mpdroot: cannot connect to local mpd at: /tmp/mpd2.console_root
    probable cause:  no mpd daemon on this machine
    possible cause:  unix socket /tmp/mpd2.console_root has been removed
mpdexit (__init__ 1208): forked process failed; status=255
I1028 22:15:04.774554  4859 sched.cpp:747] Stopping framework
'20141028-203440-1257767434-5050-3638-0006'
2014-10-28 22:15:04,795:4819(0x7fd7b1422700):ZOO_INFO@zookeeper_close@2505:
Closing zookeeper sessionId=0x14959388d4e0020


And also in *executor stdout* I get:
sh -c 'mpd --noconsole --ncpus=1 --host=euca-10-2-248-74
--port=39237'Command exited with status 127 → command not found

and on *stderr*:
sh: 1 mpd: not found

I am assuming the messages on the executor's log files appear because after
mpiexec is completed the task is finished and the mpd ring is no longer
running - so it complains about not finding the mpd command, which normally
works fine.

- An other thing I would like to ask has to do with the procedure to follow
for running MPI on Mesos. So far, using Spark and Hadoop on Mesos, I was
used to have an executor shared on HDFS and there was no need to
distributed the code to the slaves. With MPI I had to distribute the
helloworld executable to slaves, because having it on HDFS didn't work.
Moreover I was expecting that the mpd ring would be started from Mesos (in
the same way that the hadoop jobtracker is started from Mesos for the
HadoopOnMesos implementations). Now I have to first run mpdboot before
being able to run mpi on Mesos. Is the above procedure what I should do or
I am missing something?

- Finally, in order to make MPI to work I had to install the
mesos.interface with pip and manually copy the native directory from the
python/dist-packages (native doesn't exist on the pip repo). And then I
realized there is the mpiexec-mesos.in file that it does all that - I can
update the README to be a little more clear if you want - I am guessing
someone else might also get confused with this.

thanks,
Stratos