You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mesos.apache.org by Scott Smith <sc...@gmail.com> on 2012/04/20 04:17:32 UTC

Problem setting up cluster

I'm trying to set up a cluster on ec2, but not using the canned
scripts/image.  I built the latest svn on Ubuntu 11.10 amd64, and copied
the build to a second node.  Both are c1.medium instances (not that it
should matter).  No other software is running (no hdfs, no hadoop, etc).

The problem I have is the slave repeatedly (approx once per second)
connects, advertises its resources, gets added, and then disconnects.  No
reason is given for disconnecting.  There are no messages on the slave,
only 5 or 6 messages on the master.

I'm not sure what the next diagnostic step should be; I was hoping someone
else ran into the same problem and could point out what I did wrong.  Any
advice?

Thanks!

Re: Problem setting up cluster

Posted by Scott Smith <sc...@gmail.com>.
LIBPROCESS_IP didn't work for me, but I'm working off an older version
of Mesos (the one listed in these instructions:
https://github.com/mesos/spark/wiki/Running-spark-on-mesos) and I see
there was a fix in Mesos recently regarding LIBPROCESS_IP
(https://reviews.apache.org/r/4355/)

However, I found the culprit:  /etc/hosts had an entry for the local ip:

# Added by cloud-init
127.0.1.1	ip-10-252-94-24.us-west-2.compute.internal ip-10-252-94-24

I removed it, and now everything works.  Thanks for your help!

On Thu, Apr 19, 2012 at 9:56 PM, Matei Zaharia <ma...@eecs.berkeley.edu> wrote:
> Good point there. Maybe libprocess (our communication layer) is using the wrong address. I remember seeing that on ubuntu -- if you try to call gethostbyname passing in the local hostname, you get back 127.0.1.1 instead of the external IP. Try setting the LIBPROCESS_IP environment variable on the slave to the "right" IP before you run mesos-slave.
>
> Matei
>
> On Apr 19, 2012, at 9:47 PM, Scott Smith wrote:
>
>> Well the logs say this:
>>
>> I0420 04:40:30.870983  8193 master.cpp:814] Attempting to register
>> slave 201204200437-0-162 at slave@127.0.1.1:51851
>> I0420 04:40:30.871330  8193 master.cpp:1057] Master now considering a
>> slave at ip-10-252-94-24.us-west-2.compute.internal:51851 as active
>> I0420 04:40:30.871415  8193 master.cpp:1588] Adding slave
>> 201204200437-0-162 at ip-10-252-94-24.us-west-2.compute.internal with
>> cpus=1; mem=1024
>> I0420 04:40:30.871599  8193 simple_allocator.cpp:71] Added slave
>> 201204200437-0-162 with cpus=1; mem=1024
>> I0420 04:40:30.871680  8193 master.cpp:1143] Slave 201204200437-0-162
>> disconnected
>> I0420 04:40:30.871819  8193 simple_allocator.cpp:83] Removed slave
>> 201204200437-0-162
>>
>> tcp dump says this:
>>
>> POST /master/mesos.internal.RegisterSlaveMessage HTTP/1.0
>> User-Agent: libprocess/slave@127.0.1.1:51851
>> Connection: Keep-Alive
>> Transfer-Encoding: chunked
>>
>> 87
>>
>> ..
>> *ip-10-252-94-24.us-west-2.compute.internal.*ip-10-252-94-24.us-west-2.compute.internal..
>> .cpus...              .......?..
>> .mem...               .......@ .?
>> 0
>>
>>
>>
>> so it looks like its reporting both a valid hostname and a loopback
>> addr.  Which will the master use?
>>
>> btw I have both machines in the same security group, and opened all
>> tcp inbound for the group to the group.
>>
>>
>> On Thu, Apr 19, 2012 at 9:42 PM, Matei Zaharia <ma...@eecs.berkeley.edu> wrote:
>>> What hostname and port does the slave report for itself (i.e. when the master sees it connect, what message does it print)? It could be that the master cannot connect back to that address. Maybe you need to open up communication among machines in your EC2 security groups.
>>>
>>> Matei
>>>
>>> On Apr 19, 2012, at 9:10 PM, Scott Smith wrote:
>>>
>>>> Direct IP/port.  No zookeeper.
>>>> On Apr 19, 2012 7:35 PM, "John Sirois" <js...@twitter.com> wrote:
>>>>
>>>>> How are your slaves connecting to the master?  Via zookeeper or via known
>>>>> hostname/ip ?
>>>>>
>>>>> On Thursday, April 19, 2012, Scott Smith wrote:
>>>>>
>>>>>> I'm trying to set up a cluster on ec2, but not using the canned
>>>>>> scripts/image.  I built the latest svn on Ubuntu 11.10 amd64, and copied
>>>>>> the build to a second node.  Both are c1.medium instances (not that it
>>>>>> should matter).  No other software is running (no hdfs, no hadoop, etc).
>>>>>>
>>>>>> The problem I have is the slave repeatedly (approx once per second)
>>>>>> connects, advertises its resources, gets added, and then disconnects.  No
>>>>>> reason is given for disconnecting.  There are no messages on the slave,
>>>>>> only 5 or 6 messages on the master.
>>>>>>
>>>>>> I'm not sure what the next diagnostic step should be; I was hoping
>>>>> someone
>>>>>> else ran into the same problem and could point out what I did wrong.  Any
>>>>>> advice?
>>>>>>
>>>>>> Thanks!
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> John Sirois
>>>>> 303-512-3301
>>>>>
>>>
>>
>>
>>
>> --
>>         Scott
>



-- 
        Scott

Re: Problem setting up cluster

Posted by Matei Zaharia <ma...@eecs.berkeley.edu>.
Good point there. Maybe libprocess (our communication layer) is using the wrong address. I remember seeing that on ubuntu -- if you try to call gethostbyname passing in the local hostname, you get back 127.0.1.1 instead of the external IP. Try setting the LIBPROCESS_IP environment variable on the slave to the "right" IP before you run mesos-slave.

Matei

On Apr 19, 2012, at 9:47 PM, Scott Smith wrote:

> Well the logs say this:
> 
> I0420 04:40:30.870983  8193 master.cpp:814] Attempting to register
> slave 201204200437-0-162 at slave@127.0.1.1:51851
> I0420 04:40:30.871330  8193 master.cpp:1057] Master now considering a
> slave at ip-10-252-94-24.us-west-2.compute.internal:51851 as active
> I0420 04:40:30.871415  8193 master.cpp:1588] Adding slave
> 201204200437-0-162 at ip-10-252-94-24.us-west-2.compute.internal with
> cpus=1; mem=1024
> I0420 04:40:30.871599  8193 simple_allocator.cpp:71] Added slave
> 201204200437-0-162 with cpus=1; mem=1024
> I0420 04:40:30.871680  8193 master.cpp:1143] Slave 201204200437-0-162
> disconnected
> I0420 04:40:30.871819  8193 simple_allocator.cpp:83] Removed slave
> 201204200437-0-162
> 
> tcp dump says this:
> 
> POST /master/mesos.internal.RegisterSlaveMessage HTTP/1.0
> User-Agent: libprocess/slave@127.0.1.1:51851
> Connection: Keep-Alive
> Transfer-Encoding: chunked
> 
> 87
> 
> ..
> *ip-10-252-94-24.us-west-2.compute.internal.*ip-10-252-94-24.us-west-2.compute.internal..
> .cpus...		.......?..
> .mem...		.......@ .?
> 0
> 
> 
> 
> so it looks like its reporting both a valid hostname and a loopback
> addr.  Which will the master use?
> 
> btw I have both machines in the same security group, and opened all
> tcp inbound for the group to the group.
> 
> 
> On Thu, Apr 19, 2012 at 9:42 PM, Matei Zaharia <ma...@eecs.berkeley.edu> wrote:
>> What hostname and port does the slave report for itself (i.e. when the master sees it connect, what message does it print)? It could be that the master cannot connect back to that address. Maybe you need to open up communication among machines in your EC2 security groups.
>> 
>> Matei
>> 
>> On Apr 19, 2012, at 9:10 PM, Scott Smith wrote:
>> 
>>> Direct IP/port.  No zookeeper.
>>> On Apr 19, 2012 7:35 PM, "John Sirois" <js...@twitter.com> wrote:
>>> 
>>>> How are your slaves connecting to the master?  Via zookeeper or via known
>>>> hostname/ip ?
>>>> 
>>>> On Thursday, April 19, 2012, Scott Smith wrote:
>>>> 
>>>>> I'm trying to set up a cluster on ec2, but not using the canned
>>>>> scripts/image.  I built the latest svn on Ubuntu 11.10 amd64, and copied
>>>>> the build to a second node.  Both are c1.medium instances (not that it
>>>>> should matter).  No other software is running (no hdfs, no hadoop, etc).
>>>>> 
>>>>> The problem I have is the slave repeatedly (approx once per second)
>>>>> connects, advertises its resources, gets added, and then disconnects.  No
>>>>> reason is given for disconnecting.  There are no messages on the slave,
>>>>> only 5 or 6 messages on the master.
>>>>> 
>>>>> I'm not sure what the next diagnostic step should be; I was hoping
>>>> someone
>>>>> else ran into the same problem and could point out what I did wrong.  Any
>>>>> advice?
>>>>> 
>>>>> Thanks!
>>>>> 
>>>> 
>>>> 
>>>> --
>>>> John Sirois
>>>> 303-512-3301
>>>> 
>> 
> 
> 
> 
> -- 
>         Scott


Re: Problem setting up cluster

Posted by Scott Smith <sc...@gmail.com>.
Well the logs say this:

I0420 04:40:30.870983  8193 master.cpp:814] Attempting to register
slave 201204200437-0-162 at slave@127.0.1.1:51851
I0420 04:40:30.871330  8193 master.cpp:1057] Master now considering a
slave at ip-10-252-94-24.us-west-2.compute.internal:51851 as active
I0420 04:40:30.871415  8193 master.cpp:1588] Adding slave
201204200437-0-162 at ip-10-252-94-24.us-west-2.compute.internal with
cpus=1; mem=1024
I0420 04:40:30.871599  8193 simple_allocator.cpp:71] Added slave
201204200437-0-162 with cpus=1; mem=1024
I0420 04:40:30.871680  8193 master.cpp:1143] Slave 201204200437-0-162
disconnected
I0420 04:40:30.871819  8193 simple_allocator.cpp:83] Removed slave
201204200437-0-162

tcp dump says this:

POST /master/mesos.internal.RegisterSlaveMessage HTTP/1.0
User-Agent: libprocess/slave@127.0.1.1:51851
Connection: Keep-Alive
Transfer-Encoding: chunked

87

..
*ip-10-252-94-24.us-west-2.compute.internal.*ip-10-252-94-24.us-west-2.compute.internal..
.cpus...		.......?..
.mem...		.......@ .?
0



so it looks like its reporting both a valid hostname and a loopback
addr.  Which will the master use?

btw I have both machines in the same security group, and opened all
tcp inbound for the group to the group.


On Thu, Apr 19, 2012 at 9:42 PM, Matei Zaharia <ma...@eecs.berkeley.edu> wrote:
> What hostname and port does the slave report for itself (i.e. when the master sees it connect, what message does it print)? It could be that the master cannot connect back to that address. Maybe you need to open up communication among machines in your EC2 security groups.
>
> Matei
>
> On Apr 19, 2012, at 9:10 PM, Scott Smith wrote:
>
>> Direct IP/port.  No zookeeper.
>> On Apr 19, 2012 7:35 PM, "John Sirois" <js...@twitter.com> wrote:
>>
>>> How are your slaves connecting to the master?  Via zookeeper or via known
>>> hostname/ip ?
>>>
>>> On Thursday, April 19, 2012, Scott Smith wrote:
>>>
>>>> I'm trying to set up a cluster on ec2, but not using the canned
>>>> scripts/image.  I built the latest svn on Ubuntu 11.10 amd64, and copied
>>>> the build to a second node.  Both are c1.medium instances (not that it
>>>> should matter).  No other software is running (no hdfs, no hadoop, etc).
>>>>
>>>> The problem I have is the slave repeatedly (approx once per second)
>>>> connects, advertises its resources, gets added, and then disconnects.  No
>>>> reason is given for disconnecting.  There are no messages on the slave,
>>>> only 5 or 6 messages on the master.
>>>>
>>>> I'm not sure what the next diagnostic step should be; I was hoping
>>> someone
>>>> else ran into the same problem and could point out what I did wrong.  Any
>>>> advice?
>>>>
>>>> Thanks!
>>>>
>>>
>>>
>>> --
>>> John Sirois
>>> 303-512-3301
>>>
>



-- 
        Scott

Re: Problem setting up cluster

Posted by Matei Zaharia <ma...@eecs.berkeley.edu>.
What hostname and port does the slave report for itself (i.e. when the master sees it connect, what message does it print)? It could be that the master cannot connect back to that address. Maybe you need to open up communication among machines in your EC2 security groups.

Matei

On Apr 19, 2012, at 9:10 PM, Scott Smith wrote:

> Direct IP/port.  No zookeeper.
> On Apr 19, 2012 7:35 PM, "John Sirois" <js...@twitter.com> wrote:
> 
>> How are your slaves connecting to the master?  Via zookeeper or via known
>> hostname/ip ?
>> 
>> On Thursday, April 19, 2012, Scott Smith wrote:
>> 
>>> I'm trying to set up a cluster on ec2, but not using the canned
>>> scripts/image.  I built the latest svn on Ubuntu 11.10 amd64, and copied
>>> the build to a second node.  Both are c1.medium instances (not that it
>>> should matter).  No other software is running (no hdfs, no hadoop, etc).
>>> 
>>> The problem I have is the slave repeatedly (approx once per second)
>>> connects, advertises its resources, gets added, and then disconnects.  No
>>> reason is given for disconnecting.  There are no messages on the slave,
>>> only 5 or 6 messages on the master.
>>> 
>>> I'm not sure what the next diagnostic step should be; I was hoping
>> someone
>>> else ran into the same problem and could point out what I did wrong.  Any
>>> advice?
>>> 
>>> Thanks!
>>> 
>> 
>> 
>> --
>> John Sirois
>> 303-512-3301
>> 


Re: Problem setting up cluster

Posted by Scott Smith <sc...@gmail.com>.
Direct IP/port.  No zookeeper.
On Apr 19, 2012 7:35 PM, "John Sirois" <js...@twitter.com> wrote:

> How are your slaves connecting to the master?  Via zookeeper or via known
> hostname/ip ?
>
> On Thursday, April 19, 2012, Scott Smith wrote:
>
> > I'm trying to set up a cluster on ec2, but not using the canned
> > scripts/image.  I built the latest svn on Ubuntu 11.10 amd64, and copied
> > the build to a second node.  Both are c1.medium instances (not that it
> > should matter).  No other software is running (no hdfs, no hadoop, etc).
> >
> > The problem I have is the slave repeatedly (approx once per second)
> > connects, advertises its resources, gets added, and then disconnects.  No
> > reason is given for disconnecting.  There are no messages on the slave,
> > only 5 or 6 messages on the master.
> >
> > I'm not sure what the next diagnostic step should be; I was hoping
> someone
> > else ran into the same problem and could point out what I did wrong.  Any
> > advice?
> >
> > Thanks!
> >
>
>
> --
> John Sirois
> 303-512-3301
>

Re: Problem setting up cluster

Posted by John Sirois <js...@twitter.com>.
How are your slaves connecting to the master?  Via zookeeper or via known
hostname/ip ?

On Thursday, April 19, 2012, Scott Smith wrote:

> I'm trying to set up a cluster on ec2, but not using the canned
> scripts/image.  I built the latest svn on Ubuntu 11.10 amd64, and copied
> the build to a second node.  Both are c1.medium instances (not that it
> should matter).  No other software is running (no hdfs, no hadoop, etc).
>
> The problem I have is the slave repeatedly (approx once per second)
> connects, advertises its resources, gets added, and then disconnects.  No
> reason is given for disconnecting.  There are no messages on the slave,
> only 5 or 6 messages on the master.
>
> I'm not sure what the next diagnostic step should be; I was hoping someone
> else ran into the same problem and could point out what I did wrong.  Any
> advice?
>
> Thanks!
>


-- 
John Sirois
303-512-3301