You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mesos.apache.org by Richard Xia <rx...@eecs.berkeley.edu> on 2012/04/16 12:48:50 UTC
Having problems with the EC2 Python scripts
Hi,
I'm trying to go through the guide here
(https://github.com/mesos/mesos/wiki/EC2-Scripts) and I'm running into a
couple problems. I'm running the latest version of the trunk (r1310658)
on Mac OS X 10.6 with the default Python (2.6).
The first problem that I run into is with the launch script. The default
wait time of 60 seconds doesn't seem to be enough; I would consistently
run into the error of the ssh connection being refused. When I set the
wait time to 120 seconds (just to be safe, I'm sure a smaller value
would work as well), it worked and would run to completion. I was just
using the default settings suggested by the guide (1 slave, m1.large
instance) and it took me a while to realize that the script just wasn't
waiting long enough for the instances to start up. Is this the expected
behavior? If it is, I think the guide needs to be updated to mention
that the default wait time may not be long enough.
The second problem I am having is with any of the scripts that target an
existing cluster. For example, if I try running ./mesos-ec2 stop
<cluster-name>, I get the error message "ERROR: Could not find any
existing cluster". When debugging the script, I found that
get_existing_cluster() wasn't working properly. On line 309, when it
sets the variable group_names, it calls g.id where g is a security
group. The following lines seem to check whether the security group name
matches "<cluster-name>-master", "-slaves", or "-zoo". However, when
running a debugger, I find that the security group's id is actually in
the form " sg-6561c10d", not "<cluster-name>-slaves". Instead, it seems
to me that line 309 should be group_names = [g.name for g in
res.groups]. When I make this change myself, it seems to work.
Thanks,
Richard Xia
Re: Having problems with the EC2 Python scripts
Posted by Matei Zaharia <ma...@eecs.berkeley.edu>.
Ugh, that's pretty evil. I guess we'll have to look into this but I don't know off the top of my head.
Matei
On Apr 17, 2012, at 9:23 PM, Richard Xia wrote:
> Hi Matei,
>
> You're right, I do have boto installed and the Mesos scripts are picking that version up instead of the packaged ones. Apparently the Python convention is to search for modules in PYTHONPATH *after* searching in site-packages, so even though the Mesos-packaged boto is included in PYTHONPATH, my boto installation is loaded first. I'm not that experienced with Python load paths, so is there an easy way to fix this without uninstalling boto?
>
> Thanks,
> Richard
>
> On 4/17/12 8:17 AM, Matei Zaharia wrote:
>> Hi Richard,
>>
>> Do you have boto (the EC2 library for Python) installed on your machine through easy_install by any chance? It sounds like your Python is finding a different version of it than the one we ship with Mesos, because I run these scripts very often and I certainly never get the group.name vs group.id thing.
>>
>> For the initial timeout, I agree that we should make it longer. You can also use launch --resume to resume installation on a cluster where launch failed for this reason by the way.
>>
>> Matei
>>
>> On Apr 16, 2012, at 11:48 AM, Richard Xia wrote:
>>
>>> Hi,
>>>
>>> I'm trying to go through the guide here (https://github.com/mesos/mesos/wiki/EC2-Scripts) and I'm running into a couple problems. I'm running the latest version of the trunk (r1310658) on Mac OS X 10.6 with the default Python (2.6).
>>>
>>> The first problem that I run into is with the launch script. The default wait time of 60 seconds doesn't seem to be enough; I would consistently run into the error of the ssh connection being refused. When I set the wait time to 120 seconds (just to be safe, I'm sure a smaller value would work as well), it worked and would run to completion. I was just using the default settings suggested by the guide (1 slave, m1.large instance) and it took me a while to realize that the script just wasn't waiting long enough for the instances to start up. Is this the expected behavior? If it is, I think the guide needs to be updated to mention that the default wait time may not be long enough.
>>>
>>> The second problem I am having is with any of the scripts that target an existing cluster. For example, if I try running ./mesos-ec2 stop<cluster-name>, I get the error message "ERROR: Could not find any existing cluster". When debugging the script, I found that get_existing_cluster() wasn't working properly. On line 309, when it sets the variable group_names, it calls g.id where g is a security group. The following lines seem to check whether the security group name matches "<cluster-name>-master", "-slaves", or "-zoo". However, when running a debugger, I find that the security group's id is actually in the form " sg-6561c10d", not "<cluster-name>-slaves". Instead, it seems to me that line 309 should be group_names = [g.name for g in res.groups]. When I make this change myself, it seems to work.
>>>
>>> Thanks,
>>> Richard Xia
>
Re: Having problems with the EC2 Python scripts
Posted by Richard Xia <rx...@eecs.berkeley.edu>.
Hi Matei,
You're right, I do have boto installed and the Mesos scripts are picking
that version up instead of the packaged ones. Apparently the Python
convention is to search for modules in PYTHONPATH *after* searching in
site-packages, so even though the Mesos-packaged boto is included in
PYTHONPATH, my boto installation is loaded first. I'm not that
experienced with Python load paths, so is there an easy way to fix this
without uninstalling boto?
Thanks,
Richard
On 4/17/12 8:17 AM, Matei Zaharia wrote:
> Hi Richard,
>
> Do you have boto (the EC2 library for Python) installed on your machine through easy_install by any chance? It sounds like your Python is finding a different version of it than the one we ship with Mesos, because I run these scripts very often and I certainly never get the group.name vs group.id thing.
>
> For the initial timeout, I agree that we should make it longer. You can also use launch --resume to resume installation on a cluster where launch failed for this reason by the way.
>
> Matei
>
> On Apr 16, 2012, at 11:48 AM, Richard Xia wrote:
>
>> Hi,
>>
>> I'm trying to go through the guide here (https://github.com/mesos/mesos/wiki/EC2-Scripts) and I'm running into a couple problems. I'm running the latest version of the trunk (r1310658) on Mac OS X 10.6 with the default Python (2.6).
>>
>> The first problem that I run into is with the launch script. The default wait time of 60 seconds doesn't seem to be enough; I would consistently run into the error of the ssh connection being refused. When I set the wait time to 120 seconds (just to be safe, I'm sure a smaller value would work as well), it worked and would run to completion. I was just using the default settings suggested by the guide (1 slave, m1.large instance) and it took me a while to realize that the script just wasn't waiting long enough for the instances to start up. Is this the expected behavior? If it is, I think the guide needs to be updated to mention that the default wait time may not be long enough.
>>
>> The second problem I am having is with any of the scripts that target an existing cluster. For example, if I try running ./mesos-ec2 stop<cluster-name>, I get the error message "ERROR: Could not find any existing cluster". When debugging the script, I found that get_existing_cluster() wasn't working properly. On line 309, when it sets the variable group_names, it calls g.id where g is a security group. The following lines seem to check whether the security group name matches "<cluster-name>-master", "-slaves", or "-zoo". However, when running a debugger, I find that the security group's id is actually in the form " sg-6561c10d", not "<cluster-name>-slaves". Instead, it seems to me that line 309 should be group_names = [g.name for g in res.groups]. When I make this change myself, it seems to work.
>>
>> Thanks,
>> Richard Xia
Re: Having problems with the EC2 Python scripts
Posted by Matei Zaharia <ma...@eecs.berkeley.edu>.
Hi Richard,
Do you have boto (the EC2 library for Python) installed on your machine through easy_install by any chance? It sounds like your Python is finding a different version of it than the one we ship with Mesos, because I run these scripts very often and I certainly never get the group.name vs group.id thing.
For the initial timeout, I agree that we should make it longer. You can also use launch --resume to resume installation on a cluster where launch failed for this reason by the way.
Matei
On Apr 16, 2012, at 11:48 AM, Richard Xia wrote:
> Hi,
>
> I'm trying to go through the guide here (https://github.com/mesos/mesos/wiki/EC2-Scripts) and I'm running into a couple problems. I'm running the latest version of the trunk (r1310658) on Mac OS X 10.6 with the default Python (2.6).
>
> The first problem that I run into is with the launch script. The default wait time of 60 seconds doesn't seem to be enough; I would consistently run into the error of the ssh connection being refused. When I set the wait time to 120 seconds (just to be safe, I'm sure a smaller value would work as well), it worked and would run to completion. I was just using the default settings suggested by the guide (1 slave, m1.large instance) and it took me a while to realize that the script just wasn't waiting long enough for the instances to start up. Is this the expected behavior? If it is, I think the guide needs to be updated to mention that the default wait time may not be long enough.
>
> The second problem I am having is with any of the scripts that target an existing cluster. For example, if I try running ./mesos-ec2 stop <cluster-name>, I get the error message "ERROR: Could not find any existing cluster". When debugging the script, I found that get_existing_cluster() wasn't working properly. On line 309, when it sets the variable group_names, it calls g.id where g is a security group. The following lines seem to check whether the security group name matches "<cluster-name>-master", "-slaves", or "-zoo". However, when running a debugger, I find that the security group's id is actually in the form " sg-6561c10d", not "<cluster-name>-slaves". Instead, it seems to me that line 309 should be group_names = [g.name for g in res.groups]. When I make this change myself, it seems to work.
>
> Thanks,
> Richard Xia