You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mesos.apache.org by Ahmet Emre Aladağ <em...@fikrimuhal.com> on 2015/10/15 17:49:30 UTC

Transport endpoint is not connected

Hi all,

I'm trying to build a mesos cluster with mesosphere 0.25.

When I run 3 mesos-master node with QUORUM=2, one is elected as the leader,
1 minute later the leader gives the error messages below, then restarts.
Upon restart, they make another election. They keep electing one another in
a loop, consistently failing, restarting and re-electing. If I set
QUORUM=1, leader becomes stable. But slaves can't connect masters. What
could be the reason for this connection problem?

Marathon console thinks node 1 is the leader although mesos panel shows
node 3 is the leader.

I also tried running slaves on the same nodes as masters but they
encountered the same error and slaves are not recognized by the masters.


Thanks,

MASTER ERRORS:

E1015 11:50:35.539562 19150 socket.hpp:174] Shutdown failed on fd=25:
Transport endpoint is not connected [107]

E1015 11:50:35.539897 19150 socket.hpp:174] Shutdown failed on fd=24:
Transport endpoint is not connected [107]


SLAVE ERRORS:

E1015 15:17:53.232672 25191 socket.hpp:174] Shutdown failed on fd=10:
Transport endpoint is not connected [107]

E1015 15:18:01.424705 25191 socket.hpp:174] Shutdown failed on fd=11:
Transport endpoint is not connected [107]

E1015 15:19:09.392596 25191 socket.hpp:174] Shutdown failed on fd=12:
Transport endpoint is not connected [107]

W1015 15:19:09.392750 25185 slave.cpp:3187] Master disconnected! Waiting
for a new master to be elected

E1015 15:21:21.104575 25191 socket.hpp:174] Shutdown failed on fd=10:
Transport endpoint is not connected [107]

E1015 15:23:31.664559 25191 socket.hpp:174] Shutdown failed on fd=10:
Transport endpoint is not connected [107]

Re: Transport endpoint is not connected

Posted by Ahmet Emre Aladağ <em...@fikrimuhal.com>.
Thank you, masters and slaves are on the same 3 nodes and they are all in
the same private network. I set advertise_ip as local IP and hostname as
public hostname and now slaves can connect.

Zookeeper broadcasts local IPs, redirection (in case of not being the
leader) on the mesos panel happens with public hostnames.

By the way, Quorum=2 works now, re-election stopped. I guess firewall +
local/public IP broadcasting was the problem.

Thank you for the suggestions.



On Fri, Oct 16, 2015 at 1:49 PM, tommy xiao <xi...@gmail.com> wrote:

> currently mesos should be in same network, don't expose public ip.
>
> 2015-10-16 5:59 GMT+08:00 Ahmet Emre Aladağ <em...@fikrimuhal.com>:
>
>> I figured out the reason.
>>
>> I had configured mesos with internal IPs. I see that Zookeeper broadcasts
>> external network IP to the masters/slaves. There was a firewall issue with
>> the public IP. So that's the problem.
>>
>> Is it the correct way for Zookeeper to broadcast the public IPs? That's
>> understandable for cases where we extend the cluster out of the network.
>>
>> On Thu, Oct 15, 2015 at 6:49 PM, Ahmet Emre Aladağ <em...@fikrimuhal.com>
>> wrote:
>>
>>> Hi all,
>>>
>>> I'm trying to build a mesos cluster with mesosphere 0.25.
>>>
>>> When I run 3 mesos-master node with QUORUM=2, one is elected as the
>>> leader, 1 minute later the leader gives the error messages below, then
>>> restarts. Upon restart, they make another election. They keep electing one
>>> another in a loop, consistently failing, restarting and re-electing. If I
>>> set QUORUM=1, leader becomes stable. But slaves can't connect masters. What
>>> could be the reason for this connection problem?
>>>
>>> Marathon console thinks node 1 is the leader although mesos panel shows
>>> node 3 is the leader.
>>>
>>> I also tried running slaves on the same nodes as masters but they
>>> encountered the same error and slaves are not recognized by the masters.
>>>
>>>
>>> Thanks,
>>>
>>> MASTER ERRORS:
>>>
>>> E1015 11:50:35.539562 19150 socket.hpp:174] Shutdown failed on fd=25:
>>> Transport endpoint is not connected [107]
>>>
>>> E1015 11:50:35.539897 19150 socket.hpp:174] Shutdown failed on fd=24:
>>> Transport endpoint is not connected [107]
>>>
>>>
>>> SLAVE ERRORS:
>>>
>>> E1015 15:17:53.232672 25191 socket.hpp:174] Shutdown failed on fd=10:
>>> Transport endpoint is not connected [107]
>>>
>>> E1015 15:18:01.424705 25191 socket.hpp:174] Shutdown failed on fd=11:
>>> Transport endpoint is not connected [107]
>>>
>>> E1015 15:19:09.392596 25191 socket.hpp:174] Shutdown failed on fd=12:
>>> Transport endpoint is not connected [107]
>>>
>>> W1015 15:19:09.392750 25185 slave.cpp:3187] Master disconnected! Waiting
>>> for a new master to be elected
>>>
>>> E1015 15:21:21.104575 25191 socket.hpp:174] Shutdown failed on fd=10:
>>> Transport endpoint is not connected [107]
>>>
>>> E1015 15:23:31.664559 25191 socket.hpp:174] Shutdown failed on fd=10:
>>> Transport endpoint is not connected [107]
>>>
>>>
>>
>
>
> --
> Deshi Xiao
> Twitter: xds2000
> E-mail: xiaods(AT)gmail.com
>

Re: Transport endpoint is not connected

Posted by tommy xiao <xi...@gmail.com>.
currently mesos should be in same network, don't expose public ip.

2015-10-16 5:59 GMT+08:00 Ahmet Emre Aladağ <em...@fikrimuhal.com>:

> I figured out the reason.
>
> I had configured mesos with internal IPs. I see that Zookeeper broadcasts
> external network IP to the masters/slaves. There was a firewall issue with
> the public IP. So that's the problem.
>
> Is it the correct way for Zookeeper to broadcast the public IPs? That's
> understandable for cases where we extend the cluster out of the network.
>
> On Thu, Oct 15, 2015 at 6:49 PM, Ahmet Emre Aladağ <em...@fikrimuhal.com>
> wrote:
>
>> Hi all,
>>
>> I'm trying to build a mesos cluster with mesosphere 0.25.
>>
>> When I run 3 mesos-master node with QUORUM=2, one is elected as the
>> leader, 1 minute later the leader gives the error messages below, then
>> restarts. Upon restart, they make another election. They keep electing one
>> another in a loop, consistently failing, restarting and re-electing. If I
>> set QUORUM=1, leader becomes stable. But slaves can't connect masters. What
>> could be the reason for this connection problem?
>>
>> Marathon console thinks node 1 is the leader although mesos panel shows
>> node 3 is the leader.
>>
>> I also tried running slaves on the same nodes as masters but they
>> encountered the same error and slaves are not recognized by the masters.
>>
>>
>> Thanks,
>>
>> MASTER ERRORS:
>>
>> E1015 11:50:35.539562 19150 socket.hpp:174] Shutdown failed on fd=25:
>> Transport endpoint is not connected [107]
>>
>> E1015 11:50:35.539897 19150 socket.hpp:174] Shutdown failed on fd=24:
>> Transport endpoint is not connected [107]
>>
>>
>> SLAVE ERRORS:
>>
>> E1015 15:17:53.232672 25191 socket.hpp:174] Shutdown failed on fd=10:
>> Transport endpoint is not connected [107]
>>
>> E1015 15:18:01.424705 25191 socket.hpp:174] Shutdown failed on fd=11:
>> Transport endpoint is not connected [107]
>>
>> E1015 15:19:09.392596 25191 socket.hpp:174] Shutdown failed on fd=12:
>> Transport endpoint is not connected [107]
>>
>> W1015 15:19:09.392750 25185 slave.cpp:3187] Master disconnected! Waiting
>> for a new master to be elected
>>
>> E1015 15:21:21.104575 25191 socket.hpp:174] Shutdown failed on fd=10:
>> Transport endpoint is not connected [107]
>>
>> E1015 15:23:31.664559 25191 socket.hpp:174] Shutdown failed on fd=10:
>> Transport endpoint is not connected [107]
>>
>>
>


-- 
Deshi Xiao
Twitter: xds2000
E-mail: xiaods(AT)gmail.com

Re: Transport endpoint is not connected

Posted by haosdent <ha...@gmail.com>.
Do you mean your zookeeper not in a same private network with master and
slave? Maybe you could try set LIBPROCESS_ADVERTISE_IP
and LIBPROCESS_ADVERTISE_PORT(could find more details in
https://github.com/apache/mesos/blob/master/docs/configuration.md#libprocess-options),
but I not have experience for these environments and it maybe not works for
your scenario.

On Fri, Oct 16, 2015 at 5:59 AM, Ahmet Emre Aladağ <em...@fikrimuhal.com>
wrote:

> I figured out the reason.
>
> I had configured mesos with internal IPs. I see that Zookeeper broadcasts
> external network IP to the masters/slaves. There was a firewall issue with
> the public IP. So that's the problem.
>
> Is it the correct way for Zookeeper to broadcast the public IPs? That's
> understandable for cases where we extend the cluster out of the network.
>
> On Thu, Oct 15, 2015 at 6:49 PM, Ahmet Emre Aladağ <em...@fikrimuhal.com>
> wrote:
>
>> Hi all,
>>
>> I'm trying to build a mesos cluster with mesosphere 0.25.
>>
>> When I run 3 mesos-master node with QUORUM=2, one is elected as the
>> leader, 1 minute later the leader gives the error messages below, then
>> restarts. Upon restart, they make another election. They keep electing one
>> another in a loop, consistently failing, restarting and re-electing. If I
>> set QUORUM=1, leader becomes stable. But slaves can't connect masters. What
>> could be the reason for this connection problem?
>>
>> Marathon console thinks node 1 is the leader although mesos panel shows
>> node 3 is the leader.
>>
>> I also tried running slaves on the same nodes as masters but they
>> encountered the same error and slaves are not recognized by the masters.
>>
>>
>> Thanks,
>>
>> MASTER ERRORS:
>>
>> E1015 11:50:35.539562 19150 socket.hpp:174] Shutdown failed on fd=25:
>> Transport endpoint is not connected [107]
>>
>> E1015 11:50:35.539897 19150 socket.hpp:174] Shutdown failed on fd=24:
>> Transport endpoint is not connected [107]
>>
>>
>> SLAVE ERRORS:
>>
>> E1015 15:17:53.232672 25191 socket.hpp:174] Shutdown failed on fd=10:
>> Transport endpoint is not connected [107]
>>
>> E1015 15:18:01.424705 25191 socket.hpp:174] Shutdown failed on fd=11:
>> Transport endpoint is not connected [107]
>>
>> E1015 15:19:09.392596 25191 socket.hpp:174] Shutdown failed on fd=12:
>> Transport endpoint is not connected [107]
>>
>> W1015 15:19:09.392750 25185 slave.cpp:3187] Master disconnected! Waiting
>> for a new master to be elected
>>
>> E1015 15:21:21.104575 25191 socket.hpp:174] Shutdown failed on fd=10:
>> Transport endpoint is not connected [107]
>>
>> E1015 15:23:31.664559 25191 socket.hpp:174] Shutdown failed on fd=10:
>> Transport endpoint is not connected [107]
>>
>>
>


-- 
Best Regards,
Haosdent Huang

Re: Transport endpoint is not connected

Posted by Ahmet Emre Aladağ <em...@fikrimuhal.com>.
I figured out the reason.

I had configured mesos with internal IPs. I see that Zookeeper broadcasts
external network IP to the masters/slaves. There was a firewall issue with
the public IP. So that's the problem.

Is it the correct way for Zookeeper to broadcast the public IPs? That's
understandable for cases where we extend the cluster out of the network.

On Thu, Oct 15, 2015 at 6:49 PM, Ahmet Emre Aladağ <em...@fikrimuhal.com>
wrote:

> Hi all,
>
> I'm trying to build a mesos cluster with mesosphere 0.25.
>
> When I run 3 mesos-master node with QUORUM=2, one is elected as the
> leader, 1 minute later the leader gives the error messages below, then
> restarts. Upon restart, they make another election. They keep electing one
> another in a loop, consistently failing, restarting and re-electing. If I
> set QUORUM=1, leader becomes stable. But slaves can't connect masters. What
> could be the reason for this connection problem?
>
> Marathon console thinks node 1 is the leader although mesos panel shows
> node 3 is the leader.
>
> I also tried running slaves on the same nodes as masters but they
> encountered the same error and slaves are not recognized by the masters.
>
>
> Thanks,
>
> MASTER ERRORS:
>
> E1015 11:50:35.539562 19150 socket.hpp:174] Shutdown failed on fd=25:
> Transport endpoint is not connected [107]
>
> E1015 11:50:35.539897 19150 socket.hpp:174] Shutdown failed on fd=24:
> Transport endpoint is not connected [107]
>
>
> SLAVE ERRORS:
>
> E1015 15:17:53.232672 25191 socket.hpp:174] Shutdown failed on fd=10:
> Transport endpoint is not connected [107]
>
> E1015 15:18:01.424705 25191 socket.hpp:174] Shutdown failed on fd=11:
> Transport endpoint is not connected [107]
>
> E1015 15:19:09.392596 25191 socket.hpp:174] Shutdown failed on fd=12:
> Transport endpoint is not connected [107]
>
> W1015 15:19:09.392750 25185 slave.cpp:3187] Master disconnected! Waiting
> for a new master to be elected
>
> E1015 15:21:21.104575 25191 socket.hpp:174] Shutdown failed on fd=10:
> Transport endpoint is not connected [107]
>
> E1015 15:23:31.664559 25191 socket.hpp:174] Shutdown failed on fd=10:
> Transport endpoint is not connected [107]
>
>