You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mesos.apache.org by Nan Xiao <xi...@gmail.com> on 2015/12/28 11:35:31 UTC

The issue of "Failed to shutdown socket with fd xx: Transport endpoint is not connected" on Mesos master

Hi all,

Greetings from me!

I am trying to follow this tutorial
(https://github.com/kubernetes/kubernetes/blob/master/docs/getting-started-guides/mesos.md)
to deploy "k8s on Mesos" on local machines: The k8s is the newest
master branch, and Mesos is the 0.26 edition.

After running Mesos master(IP:15.242.100.56), Mesos
slave(IP:15.242.100.16),, and the k8s(IP:15.242.100.60), I can see the
following logs from Mesos master:

......
I1227 22:52:34.494478  8069 master.cpp:4269] Received update of slave
9c3c6c78-0b62-4eaa-b27a-498f172e7fe6-S0 at slave(1)@15.242.100.16:5051
(pqsfc016.ftc.rdlabs.hpecorp.net) with total oversubscribed resources
I1227 22:52:34.494940  8065 hierarchical.cpp:400] Slave
9c3c6c78-0b62-4eaa-b27a-498f172e7fe6-S0
(pqsfc016.ftc.rdlabs.hpecorp.net) updated with oversubscribed
resources  (total: cpus(*):32; mem(*):127878; disk(*):4336;
ports(*):[31000-32000], allocated: )
I1227 22:53:06.740757  8053 http.cpp:334] HTTP GET for
/master/state.json from 15.242.100.60:56219 with
User-Agent='Go-http-client/1.1'
I1227 22:53:07.736419  8065 http.cpp:334] HTTP GET for
/master/state.json from 15.242.100.60:56241 with
User-Agent='Go-http-client/1.1'
I1227 22:53:07.767196  8070 http.cpp:334] HTTP GET for
/master/state.json from 15.242.100.60:56252 with
User-Agent='Go-http-client/1.1'
I1227 22:53:08.808171  8053 http.cpp:334] HTTP GET for
/master/state.json from 15.242.100.60:56272 with
User-Agent='Go-http-client/1.1'
I1227 22:53:08.815811  8060 master.cpp:2176] Received SUBSCRIBE call
for framework 'Kubernetes' at scheduler(1)@15.242.100.60:59488
I1227 22:53:08.816182  8060 master.cpp:2247] Subscribing framework
Kubernetes with checkpointing enabled and capabilities [  ]
I1227 22:53:08.817294  8052 hierarchical.cpp:195] Added framework
9c3c6c78-0b62-4eaa-b27a-498f172e7fe6-0000
I1227 22:53:08.817464  8050 master.cpp:1122] Framework
9c3c6c78-0b62-4eaa-b27a-498f172e7fe6-0000 (Kubernetes) at
scheduler(1)@15.242.100.60:59488 disconnected
E1227 22:53:08.817497  8073 process.cpp:1911] Failed to shutdown
socket with fd 17: Transport endpoint is not connected
I1227 22:53:08.817533  8050 master.cpp:2472] Disconnecting framework
9c3c6c78-0b62-4eaa-b27a-498f172e7fe6-0000 (Kubernetes) at
scheduler(1)@15.242.100.60:59488
I1227 22:53:08.817595  8050 master.cpp:2496] Deactivating framework
9c3c6c78-0b62-4eaa-b27a-498f172e7fe6-0000 (Kubernetes) at
scheduler(1)@15.242.100.60:59488
I1227 22:53:08.817797  8050 master.cpp:1146] Giving framework
9c3c6c78-0b62-4eaa-b27a-498f172e7fe6-0000 (Kubernetes) at
scheduler(1)@15.242.100.60:59488 7625.14222623576weeks to failover
W1227 22:53:08.818389  8062 master.cpp:4840] Master returning
resources offered to framework
9c3c6c78-0b62-4eaa-b27a-498f172e7fe6-0000 because the framework has
terminated or is inactive
I1227 22:53:08.818397  8052 hierarchical.cpp:273] Deactivated
framework 9c3c6c78-0b62-4eaa-b27a-498f172e7fe6-0000
I1227 22:53:08.819046  8066 hierarchical.cpp:744] Recovered
cpus(*):32; mem(*):127878; disk(*):4336; ports(*):[31000-32000]
(total: cpus(*):32; mem(*):127878; disk(*):4336;
ports(*):[31000-32000], allocated: ) on slave
9c3c6c78-0b62-4eaa-b27a-498f172e7fe6-S0 from framework
9c3c6c78-0b62-4eaa-b27a-498f172e7fe6-0000
......

I can't figure out why Mesos master complains "Failed to shutdown
socket with fd 17: Transport endpoint is not connected".
Could someone give some clues on this issue?

Thanks very much in advance!

Best Regards
Nan Xiao

Re: The issue of "Failed to shutdown socket with fd xx: Transport endpoint is not connected" on Mesos master

Posted by Nan Xiao <xi...@gmail.com>.

Hi Avinash,

Sorry for my unclear expression!

The root cause is not related to k8s, but the CentOS which k8s is running on.
It is related to iptables. After executing "iptables -F", it works!

Best Regards
Nan Xiao


On Wed, Dec 30, 2015 at 11:41 PM, Avinash Sridharan
<av...@mesosphere.io> wrote:
> Thanks for the update Nan. k8s enabling firewall rules that would block
> traffic to the master seems a bit odd. Looks like a bug to me, in the head
> of the branch. If you are able to reproduce it consistently, could you file
> an issue against kubernetes mesos.
>
> regards,
> Avinash
>
> On Tue, Dec 29, 2015 at 11:04 PM, Nan Xiao <xi...@gmail.com> wrote:
>>
>> Hi Avinash,
>>
>> Thanks very much for your reply!
>>
>> The root cause has been found: the k8s server has enabled the iptables
>> which blocks connection from
>> Mesos master; after disable it, it works!
>>
>> Best Regards
>> Nan Xiao
>>
>>
>> On Wed, Dec 30, 2015 at 3:22 AM, Avinash Sridharan
>> <av...@mesosphere.io> wrote:
>> > lsof command will show only actively opened file descriptors. So if you
>> > ran
>> > the command after seeing the error logs in the master, most probably the
>> > master had already closed this fd. Just throwing a few other things to
>> > look
>> > at, that might give some more insights.
>> >
>> > * Run the "netstat -na" and netstat -nt" commands on the master and the
>> > kubernetes master node to make sure that the master is listening to the
>> > right port, and the k8s scheduler is trying to connect to the right
>> > port.
>> > From the logs it does look like the master is receiving the registration
>> > request, so there shouldn't be a network configuration issue here.
>> > * Make sure there are no firewall rules getting turned on in your
>> > cluster
>> > since it looks like the k8s scheduler is not able to connect to the
>> > master
>> > (though it was able to register the first time).
>> >
>> > On Tue, Dec 29, 2015 at 1:37 AM, Nan Xiao <xi...@gmail.com>
>> > wrote:
>> >>
>> >> BTW, using "lsof" command finds there are only 16 file descriptors. I
>> >> don't know why Mesos
>> >> master try to close "fd 17".
>> >> Best Regards
>> >> Nan Xiao
>> >>
>> >>
>> >> On Tue, Dec 29, 2015 at 11:32 AM, Nan Xiao <xi...@gmail.com>
>> >> wrote:
>> >> > Hi Klaus,
>> >> >
>> >> > Firstly, thanks very much for your answer!
>> >> >
>> >> > The km processes are all live:
>> >> > root     129474 128024  2 22:26 pts/0    00:00:00 km apiserver
>> >> > --address=15.242.100.60 --etcd-servers=http://15.242.100.60:4001
>> >> > --service-cluster-ip-range=10.10.10.0/24 --port=8888
>> >> > --cloud-provider=mesos --cloud-config=mesos-cloud.conf
>> >> > --secure-port=0
>> >> > --v=1
>> >> > root     129509 128024  2 22:26 pts/0    00:00:00 km
>> >> > controller-manager --master=15.242.100.60:8888 --cloud-provider=mesos
>> >> > --cloud-config=./mesos-cloud.conf --v=1
>> >> > root     129538 128024  0 22:26 pts/0    00:00:00 km scheduler
>> >> > --address=15.242.100.60 --mesos-master=15.242.100.56:5050
>> >> > --etcd-servers=http://15.242.100.60:4001 --mesos-user=root
>> >> > --api-servers=15.242.100.60:8888 --cluster-dns=10.10.10.10
>> >> > --cluster-domain=cluster.local --v=2
>> >> >
>> >> > All the logs are also seem OK, except the logs from scheduler.log:
>> >> > ......
>> >> > I1228 22:26:37.883092  129538 messenger.go:381] Receiving message
>> >> > mesos.internal.InternalMasterChangeDetected from
>> >> > scheduler(1)@15.242.100.60:33077
>> >> > I1228 22:26:37.883225  129538 scheduler.go:374] New master
>> >> > master@15.242.100.56:5050 detected
>> >> > I1228 22:26:37.883268  129538 scheduler.go:435] No credentials were
>> >> > provided. Attempting to register scheduler without authentication.
>> >> > I1228 22:26:37.883356  129538 scheduler.go:928] Registering with
>> >> > master: master@15.242.100.56:5050
>> >> > I1228 22:26:37.883460  129538 messenger.go:187] Sending message
>> >> > mesos.internal.RegisterFrameworkMessage to master@15.242.100.56:5050
>> >> > I1228 22:26:37.883504  129538 scheduler.go:881] will retry
>> >> > registration in 1.209320575s if necessary
>> >> > I1228 22:26:37.883758  129538 http_transporter.go:193] Sending
>> >> > message
>> >> > to master@15.242.100.56:5050 via http
>> >> > I1228 22:26:37.883873  129538 http_transporter.go:587] libproc target
>> >> > URL
>> >> >
>> >> > http://15.242.100.56:5050/master/mesos.internal.RegisterFrameworkMessage
>> >> > I1228 22:26:39.093560  129538 scheduler.go:928] Registering with
>> >> > master: master@15.242.100.56:5050
>> >> > I1228 22:26:39.093659  129538 messenger.go:187] Sending message
>> >> > mesos.internal.RegisterFrameworkMessage to master@15.242.100.56:5050
>> >> > I1228 22:26:39.093702  129538 scheduler.go:881] will retry
>> >> > registration in 3.762036352s if necessary
>> >> > I1228 22:26:39.093765  129538 http_transporter.go:193] Sending
>> >> > message
>> >> > to master@15.242.100.56:5050 via http
>> >> > I1228 22:26:39.093847  129538 http_transporter.go:587] libproc target
>> >> > URL
>> >> >
>> >> > http://15.242.100.56:5050/master/mesos.internal.RegisterFrameworkMessage
>> >> > ......
>> >> >
>> >> > From the log, the Mesos master rejected the k8s's registeration, and
>> >> > k8s retry constantly.
>> >> >
>> >> > Have you met this issue before? Thanks very much in advance!
>> >> > Best Regards
>> >> > Nan Xiao
>> >> >
>> >> >
>> >> > On Mon, Dec 28, 2015 at 7:26 PM, Klaus Ma <kl...@gmail.com>
>> >> > wrote:
>> >> >> It seems Kubernetes is down; would you help to check kubernetes's
>> >> >> status
>> >> >> (km)?
>> >> >>
>> >> >> ----
>> >> >> Da (Klaus), Ma (马达) | PMP® | Advisory Software Engineer
>> >> >> Platform Symphony/DCOS Development & Support, STG, IBM GCG
>> >> >> +86-10-8245 4084 | klaus1982.cn@gmail.com | http://k82.me
>> >> >>
>> >> >> On Mon, Dec 28, 2015 at 6:35 PM, Nan Xiao <xi...@gmail.com>
>> >> >> wrote:
>> >> >>>
>> >> >>> Hi all,
>> >> >>>
>> >> >>> Greetings from me!
>> >> >>>
>> >> >>> I am trying to follow this tutorial
>> >> >>>
>> >> >>>
>> >> >>>
>> >> >>> (https://github.com/kubernetes/kubernetes/blob/master/docs/getting-started-guides/mesos.md)
>> >> >>> to deploy "k8s on Mesos" on local machines: The k8s is the newest
>> >> >>> master branch, and Mesos is the 0.26 edition.
>> >> >>>
>> >> >>> After running Mesos master(IP:15.242.100.56), Mesos
>> >> >>> slave(IP:15.242.100.16),, and the k8s(IP:15.242.100.60), I can see
>> >> >>> the
>> >> >>> following logs from Mesos master:
>> >> >>>
>> >> >>> ......
>> >> >>> I1227 22:52:34.494478  8069 master.cpp:4269] Received update of
>> >> >>> slave
>> >> >>> 9c3c6c78-0b62-4eaa-b27a-498f172e7fe6-S0 at
>> >> >>> slave(1)@15.242.100.16:5051
>> >> >>> (pqsfc016.ftc.rdlabs.hpecorp.net) with total oversubscribed
>> >> >>> resources
>> >> >>> I1227 22:52:34.494940  8065 hierarchical.cpp:400] Slave
>> >> >>> 9c3c6c78-0b62-4eaa-b27a-498f172e7fe6-S0
>> >> >>> (pqsfc016.ftc.rdlabs.hpecorp.net) updated with oversubscribed
>> >> >>> resources  (total: cpus(*):32; mem(*):127878; disk(*):4336;
>> >> >>> ports(*):[31000-32000], allocated: )
>> >> >>> I1227 22:53:06.740757 8053 http.cpp:334] HTTP GET for
>> >> >>> /master/state.json from 15.242.100.60:56219 with
>> >> >>> User-Agent='Go-http-client/1.1'
>> >> >>> I1227 22:53:07.736419 8065 http.cpp:334] HTTP GET for
>> >> >>> /master/state.json from 15.242.100.60:56241 with
>> >> >>> User-Agent='Go-http-client/1.1'
>> >> >>> I1227 22:53:07.767196  8070 http.cpp:334] HTTP GET for
>> >> >>> /master/state.json from 15.242.100.60:56252 with
>> >> >>> User-Agent='Go-http-client/1.1'
>> >> >>> I1227 22:53:08.808171  8053 http.cpp:334] HTTP GET for
>> >> >>> /master/state.json from 15.242.100.60:56272 with
>> >> >>> User-Agent='Go-http-client/1.1'
>> >> >>> I1227 22:53:08.815811 8060 master.cpp:2176] Received SUBSCRIBE call
>> >> >>> for framework 'Kubernetes' at scheduler(1)@15.242.100.60:59488
>> >> >>> I1227 22:53:08.816182 8060 master.cpp:2247] Subscribing framework
>> >> >>> Kubernetes with checkpointing enabled and capabilities [  ]
>> >> >>> I1227 22:53:08.817294  8052 hierarchical.cpp:195] Added framework
>> >> >>> 9c3c6c78-0b62-4eaa-b27a-498f172e7fe6-0000
>> >> >>> I1227 22:53:08.817464  8050 master.cpp:1122] Framework
>> >> >>> 9c3c6c78-0b62-4eaa-b27a-498f172e7fe6-0000 (Kubernetes) at
>> >> >>> scheduler(1)@15.242.100.60:59488 disconnected
>> >> >>> E1227 22:53:08.817497 8073 process.cpp:1911] Failed to shutdown
>> >> >>> socket with fd 17: Transport endpoint is not connected
>> >> >>> I1227 22:53:08.817533  8050 master.cpp:2472] Disconnecting
>> >> >>> framework
>> >> >>> 9c3c6c78-0b62-4eaa-b27a-498f172e7fe6-0000 (Kubernetes) at
>> >> >>> scheduler(1)@15.242.100.60:59488
>> >> >>> I1227 22:53:08.817595 8050 master.cpp:2496] Deactivating framework
>> >> >>> 9c3c6c78-0b62-4eaa-b27a-498f172e7fe6-0000 (Kubernetes) at
>> >> >>> scheduler(1)@15.242.100.60:59488
>> >> >>> I1227 22:53:08.817797 8050 master.cpp:1146] Giving framework
>> >> >>> 9c3c6c78-0b62-4eaa-b27a-498f172e7fe6-0000 (Kubernetes) at
>> >> >>> scheduler(1)@15.242.100.60:59488 7625.14222623576weeks to failover
>> >> >>> W1227 22:53:08.818389 8062 master.cpp:4840] Master returning
>> >> >>> resources offered to framework
>> >> >>> 9c3c6c78-0b62-4eaa-b27a-498f172e7fe6-0000 because the framework has
>> >> >>> terminated or is inactive
>> >> >>> I1227 22:53:08.818397  8052 hierarchical.cpp:273] Deactivated
>> >> >>> framework 9c3c6c78-0b62-4eaa-b27a-498f172e7fe6-0000
>> >> >>> I1227 22:53:08.819046  8066 hierarchical.cpp:744] Recovered
>> >> >>> cpus(*):32; mem(*):127878; disk(*):4336; ports(*):[31000-32000]
>> >> >>> (total: cpus(*):32; mem(*):127878; disk(*):4336;
>> >> >>> ports(*):[31000-32000], allocated: ) on slave
>> >> >>> 9c3c6c78-0b62-4eaa-b27a-498f172e7fe6-S0 from framework
>> >> >>> 9c3c6c78-0b62-4eaa-b27a-498f172e7fe6-0000
>> >> >>> ......
>> >> >>>
>> >> >>> I can't figure out why Mesos master complains "Failed to shutdown
>> >> >>> socket with fd 17: Transport endpoint is not connected".
>> >> >>> Could someone give some clues on this issue?
>> >> >>>
>> >> >>> Thanks very much in advance!
>> >> >>>
>> >> >>> Best Regards
>> >> >>> Nan Xiao
>> >> >>
>> >> >>
>> >
>> >
>> >
>> >
>> > --
>> > Avinash Sridharan, Mesosphere
>> > +1 (323) 702 5245
>
>
>
>
> --
> Avinash Sridharan, Mesosphere
> +1 (323) 702 5245

Re: The issue of "Failed to shutdown socket with fd xx: Transport endpoint is not connected" on Mesos master

Posted by Avinash Sridharan <av...@mesosphere.io>.

Thanks for the update Nan. k8s enabling firewall rules that would block
traffic to the master seems a bit odd. Looks like a bug to me, in the head
of the branch. If you are able to reproduce it consistently, could you file
an issue against kubernetes mesos.

regards,
Avinash

On Tue, Dec 29, 2015 at 11:04 PM, Nan Xiao <xi...@gmail.com> wrote:

> Hi Avinash,
>
> Thanks very much for your reply!
>
> The root cause has been found: the k8s server has enabled the iptables
> which blocks connection from
> Mesos master; after disable it, it works!
>
> Best Regards
> Nan Xiao
>
>
> On Wed, Dec 30, 2015 at 3:22 AM, Avinash Sridharan
> <av...@mesosphere.io> wrote:
> > lsof command will show only actively opened file descriptors. So if you
> ran
> > the command after seeing the error logs in the master, most probably the
> > master had already closed this fd. Just throwing a few other things to
> look
> > at, that might give some more insights.
> >
> > * Run the "netstat -na" and netstat -nt" commands on the master and the
> > kubernetes master node to make sure that the master is listening to the
> > right port, and the k8s scheduler is trying to connect to the right port.
> > From the logs it does look like the master is receiving the registration
> > request, so there shouldn't be a network configuration issue here.
> > * Make sure there are no firewall rules getting turned on in your cluster
> > since it looks like the k8s scheduler is not able to connect to the
> master
> > (though it was able to register the first time).
> >
> > On Tue, Dec 29, 2015 at 1:37 AM, Nan Xiao <xi...@gmail.com>
> wrote:
> >>
> >> BTW, using "lsof" command finds there are only 16 file descriptors. I
> >> don't know why Mesos
> >> master try to close "fd 17".
> >> Best Regards
> >> Nan Xiao
> >>
> >>
> >> On Tue, Dec 29, 2015 at 11:32 AM, Nan Xiao <xi...@gmail.com>
> >> wrote:
> >> > Hi Klaus,
> >> >
> >> > Firstly, thanks very much for your answer!
> >> >
> >> > The km processes are all live:
> >> > root     129474 128024  2 22:26 pts/0    00:00:00 km apiserver
> >> > --address=15.242.100.60 --etcd-servers=http://15.242.100.60:4001
> >> > --service-cluster-ip-range=10.10.10.0/24 --port=8888
> >> > --cloud-provider=mesos --cloud-config=mesos-cloud.conf --secure-port=0
> >> > --v=1
> >> > root     129509 128024  2 22:26 pts/0    00:00:00 km
> >> > controller-manager --master=15.242.100.60:8888 --cloud-provider=mesos
> >> > --cloud-config=./mesos-cloud.conf --v=1
> >> > root     129538 128024  0 22:26 pts/0    00:00:00 km scheduler
> >> > --address=15.242.100.60 --mesos-master=15.242.100.56:5050
> >> > --etcd-servers=http://15.242.100.60:4001 --mesos-user=root
> >> > --api-servers=15.242.100.60:8888 --cluster-dns=10.10.10.10
> >> > --cluster-domain=cluster.local --v=2
> >> >
> >> > All the logs are also seem OK, except the logs from scheduler.log:
> >> > ......
> >> > I1228 22:26:37.883092  129538 messenger.go:381] Receiving message
> >> > mesos.internal.InternalMasterChangeDetected from
> >> > scheduler(1)@15.242.100.60:33077
> >> > I1228 22:26:37.883225  129538 scheduler.go:374] New master
> >> > master@15.242.100.56:5050 detected
> >> > I1228 22:26:37.883268  129538 scheduler.go:435] No credentials were
> >> > provided. Attempting to register scheduler without authentication.
> >> > I1228 22:26:37.883356  129538 scheduler.go:928] Registering with
> >> > master: master@15.242.100.56:5050
> >> > I1228 22:26:37.883460  129538 messenger.go:187] Sending message
> >> > mesos.internal.RegisterFrameworkMessage to master@15.242.100.56:5050
> >> > I1228 22:26:37.883504  129538 scheduler.go:881] will retry
> >> > registration in 1.209320575s if necessary
> >> > I1228 22:26:37.883758  129538 http_transporter.go:193] Sending message
> >> > to master@15.242.100.56:5050 via http
> >> > I1228 22:26:37.883873  129538 http_transporter.go:587] libproc target
> >> > URL
> >> >
> http://15.242.100.56:5050/master/mesos.internal.RegisterFrameworkMessage
> >> > I1228 22:26:39.093560  129538 scheduler.go:928] Registering with
> >> > master: master@15.242.100.56:5050
> >> > I1228 22:26:39.093659  129538 messenger.go:187] Sending message
> >> > mesos.internal.RegisterFrameworkMessage to master@15.242.100.56:5050
> >> > I1228 22:26:39.093702  129538 scheduler.go:881] will retry
> >> > registration in 3.762036352s if necessary
> >> > I1228 22:26:39.093765  129538 http_transporter.go:193] Sending message
> >> > to master@15.242.100.56:5050 via http
> >> > I1228 22:26:39.093847  129538 http_transporter.go:587] libproc target
> >> > URL
> >> >
> http://15.242.100.56:5050/master/mesos.internal.RegisterFrameworkMessage
> >> > ......
> >> >
> >> > From the log, the Mesos master rejected the k8s's registeration, and
> >> > k8s retry constantly.
> >> >
> >> > Have you met this issue before? Thanks very much in advance!
> >> > Best Regards
> >> > Nan Xiao
> >> >
> >> >
> >> > On Mon, Dec 28, 2015 at 7:26 PM, Klaus Ma <kl...@gmail.com>
> >> > wrote:
> >> >> It seems Kubernetes is down; would you help to check kubernetes's
> >> >> status
> >> >> (km)?
> >> >>
> >> >> ----
> >> >> Da (Klaus), Ma (马达) | PMP® | Advisory Software Engineer
> >> >> Platform Symphony/DCOS Development & Support, STG, IBM GCG
> >> >> +86-10-8245 4084 | klaus1982.cn@gmail.com | http://k82.me
> >> >>
> >> >> On Mon, Dec 28, 2015 at 6:35 PM, Nan Xiao <xi...@gmail.com>
> >> >> wrote:
> >> >>>
> >> >>> Hi all,
> >> >>>
> >> >>> Greetings from me!
> >> >>>
> >> >>> I am trying to follow this tutorial
> >> >>>
> >> >>>
> >> >>> (
> https://github.com/kubernetes/kubernetes/blob/master/docs/getting-started-guides/mesos.md
> )
> >> >>> to deploy "k8s on Mesos" on local machines: The k8s is the newest
> >> >>> master branch, and Mesos is the 0.26 edition.
> >> >>>
> >> >>> After running Mesos master(IP:15.242.100.56), Mesos
> >> >>> slave(IP:15.242.100.16),, and the k8s(IP:15.242.100.60), I can see
> the
> >> >>> following logs from Mesos master:
> >> >>>
> >> >>> ......
> >> >>> I1227 22:52:34.494478  8069 master.cpp:4269] Received update of
> slave
> >> >>> 9c3c6c78-0b62-4eaa-b27a-498f172e7fe6-S0 at slave(1)@
> 15.242.100.16:5051
> >> >>> (pqsfc016.ftc.rdlabs.hpecorp.net) with total oversubscribed
> resources
> >> >>> I1227 22:52:34.494940  8065 hierarchical.cpp:400] Slave
> >> >>> 9c3c6c78-0b62-4eaa-b27a-498f172e7fe6-S0
> >> >>> (pqsfc016.ftc.rdlabs.hpecorp.net) updated with oversubscribed
> >> >>> resources  (total: cpus(*):32; mem(*):127878; disk(*):4336;
> >> >>> ports(*):[31000-32000], allocated: )
> >> >>> I1227 22:53:06.740757 8053 http.cpp:334] HTTP GET for
> >> >>> /master/state.json from 15.242.100.60:56219 with
> >> >>> User-Agent='Go-http-client/1.1'
> >> >>> I1227 22:53:07.736419 8065 http.cpp:334] HTTP GET for
> >> >>> /master/state.json from 15.242.100.60:56241 with
> >> >>> User-Agent='Go-http-client/1.1'
> >> >>> I1227 22:53:07.767196  8070 http.cpp:334] HTTP GET for
> >> >>> /master/state.json from 15.242.100.60:56252 with
> >> >>> User-Agent='Go-http-client/1.1'
> >> >>> I1227 22:53:08.808171  8053 http.cpp:334] HTTP GET for
> >> >>> /master/state.json from 15.242.100.60:56272 with
> >> >>> User-Agent='Go-http-client/1.1'
> >> >>> I1227 22:53:08.815811 8060 master.cpp:2176] Received SUBSCRIBE call
> >> >>> for framework 'Kubernetes' at scheduler(1)@15.242.100.60:59488
> >> >>> I1227 22:53:08.816182 8060 master.cpp:2247] Subscribing framework
> >> >>> Kubernetes with checkpointing enabled and capabilities [  ]
> >> >>> I1227 22:53:08.817294  8052 hierarchical.cpp:195] Added framework
> >> >>> 9c3c6c78-0b62-4eaa-b27a-498f172e7fe6-0000
> >> >>> I1227 22:53:08.817464  8050 master.cpp:1122] Framework
> >> >>> 9c3c6c78-0b62-4eaa-b27a-498f172e7fe6-0000 (Kubernetes) at
> >> >>> scheduler(1)@15.242.100.60:59488 disconnected
> >> >>> E1227 22:53:08.817497 8073 process.cpp:1911] Failed to shutdown
> >> >>> socket with fd 17: Transport endpoint is not connected
> >> >>> I1227 22:53:08.817533  8050 master.cpp:2472] Disconnecting framework
> >> >>> 9c3c6c78-0b62-4eaa-b27a-498f172e7fe6-0000 (Kubernetes) at
> >> >>> scheduler(1)@15.242.100.60:59488
> >> >>> I1227 22:53:08.817595 8050 master.cpp:2496] Deactivating framework
> >> >>> 9c3c6c78-0b62-4eaa-b27a-498f172e7fe6-0000 (Kubernetes) at
> >> >>> scheduler(1)@15.242.100.60:59488
> >> >>> I1227 22:53:08.817797 8050 master.cpp:1146] Giving framework
> >> >>> 9c3c6c78-0b62-4eaa-b27a-498f172e7fe6-0000 (Kubernetes) at
> >> >>> scheduler(1)@15.242.100.60:59488 7625.14222623576weeks to failover
> >> >>> W1227 22:53:08.818389 8062 master.cpp:4840] Master returning
> >> >>> resources offered to framework
> >> >>> 9c3c6c78-0b62-4eaa-b27a-498f172e7fe6-0000 because the framework has
> >> >>> terminated or is inactive
> >> >>> I1227 22:53:08.818397  8052 hierarchical.cpp:273] Deactivated
> >> >>> framework 9c3c6c78-0b62-4eaa-b27a-498f172e7fe6-0000
> >> >>> I1227 22:53:08.819046  8066 hierarchical.cpp:744] Recovered
> >> >>> cpus(*):32; mem(*):127878; disk(*):4336; ports(*):[31000-32000]
> >> >>> (total: cpus(*):32; mem(*):127878; disk(*):4336;
> >> >>> ports(*):[31000-32000], allocated: ) on slave
> >> >>> 9c3c6c78-0b62-4eaa-b27a-498f172e7fe6-S0 from framework
> >> >>> 9c3c6c78-0b62-4eaa-b27a-498f172e7fe6-0000
> >> >>> ......
> >> >>>
> >> >>> I can't figure out why Mesos master complains "Failed to shutdown
> >> >>> socket with fd 17: Transport endpoint is not connected".
> >> >>> Could someone give some clues on this issue?
> >> >>>
> >> >>> Thanks very much in advance!
> >> >>>
> >> >>> Best Regards
> >> >>> Nan Xiao
> >> >>
> >> >>
> >
> >
> >
> >
> > --
> > Avinash Sridharan, Mesosphere
> > +1 (323) 702 5245
>



-- 
Avinash Sridharan, Mesosphere
+1 (323) 702 5245

Re: The issue of "Failed to shutdown socket with fd xx: Transport endpoint is not connected" on Mesos master

Posted by Nan Xiao <xi...@gmail.com>.

Hi Avinash,

Thanks very much for your reply!

The root cause has been found: the k8s server has enabled the iptables
which blocks connection from
Mesos master; after disable it, it works!

Best Regards
Nan Xiao


On Wed, Dec 30, 2015 at 3:22 AM, Avinash Sridharan
<av...@mesosphere.io> wrote:
> lsof command will show only actively opened file descriptors. So if you ran
> the command after seeing the error logs in the master, most probably the
> master had already closed this fd. Just throwing a few other things to look
> at, that might give some more insights.
>
> * Run the "netstat -na" and netstat -nt" commands on the master and the
> kubernetes master node to make sure that the master is listening to the
> right port, and the k8s scheduler is trying to connect to the right port.
> From the logs it does look like the master is receiving the registration
> request, so there shouldn't be a network configuration issue here.
> * Make sure there are no firewall rules getting turned on in your cluster
> since it looks like the k8s scheduler is not able to connect to the master
> (though it was able to register the first time).
>
> On Tue, Dec 29, 2015 at 1:37 AM, Nan Xiao <xi...@gmail.com> wrote:
>>
>> BTW, using "lsof" command finds there are only 16 file descriptors. I
>> don't know why Mesos
>> master try to close "fd 17".
>> Best Regards
>> Nan Xiao
>>
>>
>> On Tue, Dec 29, 2015 at 11:32 AM, Nan Xiao <xi...@gmail.com>
>> wrote:
>> > Hi Klaus,
>> >
>> > Firstly, thanks very much for your answer!
>> >
>> > The km processes are all live:
>> > root     129474 128024  2 22:26 pts/0    00:00:00 km apiserver
>> > --address=15.242.100.60 --etcd-servers=http://15.242.100.60:4001
>> > --service-cluster-ip-range=10.10.10.0/24 --port=8888
>> > --cloud-provider=mesos --cloud-config=mesos-cloud.conf --secure-port=0
>> > --v=1
>> > root     129509 128024  2 22:26 pts/0    00:00:00 km
>> > controller-manager --master=15.242.100.60:8888 --cloud-provider=mesos
>> > --cloud-config=./mesos-cloud.conf --v=1
>> > root     129538 128024  0 22:26 pts/0    00:00:00 km scheduler
>> > --address=15.242.100.60 --mesos-master=15.242.100.56:5050
>> > --etcd-servers=http://15.242.100.60:4001 --mesos-user=root
>> > --api-servers=15.242.100.60:8888 --cluster-dns=10.10.10.10
>> > --cluster-domain=cluster.local --v=2
>> >
>> > All the logs are also seem OK, except the logs from scheduler.log:
>> > ......
>> > I1228 22:26:37.883092  129538 messenger.go:381] Receiving message
>> > mesos.internal.InternalMasterChangeDetected from
>> > scheduler(1)@15.242.100.60:33077
>> > I1228 22:26:37.883225  129538 scheduler.go:374] New master
>> > master@15.242.100.56:5050 detected
>> > I1228 22:26:37.883268  129538 scheduler.go:435] No credentials were
>> > provided. Attempting to register scheduler without authentication.
>> > I1228 22:26:37.883356  129538 scheduler.go:928] Registering with
>> > master: master@15.242.100.56:5050
>> > I1228 22:26:37.883460  129538 messenger.go:187] Sending message
>> > mesos.internal.RegisterFrameworkMessage to master@15.242.100.56:5050
>> > I1228 22:26:37.883504  129538 scheduler.go:881] will retry
>> > registration in 1.209320575s if necessary
>> > I1228 22:26:37.883758  129538 http_transporter.go:193] Sending message
>> > to master@15.242.100.56:5050 via http
>> > I1228 22:26:37.883873  129538 http_transporter.go:587] libproc target
>> > URL
>> > http://15.242.100.56:5050/master/mesos.internal.RegisterFrameworkMessage
>> > I1228 22:26:39.093560  129538 scheduler.go:928] Registering with
>> > master: master@15.242.100.56:5050
>> > I1228 22:26:39.093659  129538 messenger.go:187] Sending message
>> > mesos.internal.RegisterFrameworkMessage to master@15.242.100.56:5050
>> > I1228 22:26:39.093702  129538 scheduler.go:881] will retry
>> > registration in 3.762036352s if necessary
>> > I1228 22:26:39.093765  129538 http_transporter.go:193] Sending message
>> > to master@15.242.100.56:5050 via http
>> > I1228 22:26:39.093847  129538 http_transporter.go:587] libproc target
>> > URL
>> > http://15.242.100.56:5050/master/mesos.internal.RegisterFrameworkMessage
>> > ......
>> >
>> > From the log, the Mesos master rejected the k8s's registeration, and
>> > k8s retry constantly.
>> >
>> > Have you met this issue before? Thanks very much in advance!
>> > Best Regards
>> > Nan Xiao
>> >
>> >
>> > On Mon, Dec 28, 2015 at 7:26 PM, Klaus Ma <kl...@gmail.com>
>> > wrote:
>> >> It seems Kubernetes is down; would you help to check kubernetes's
>> >> status
>> >> (km)?
>> >>
>> >> ----
>> >> Da (Klaus), Ma (马达) | PMP® | Advisory Software Engineer
>> >> Platform Symphony/DCOS Development & Support, STG, IBM GCG
>> >> +86-10-8245 4084 | klaus1982.cn@gmail.com | http://k82.me
>> >>
>> >> On Mon, Dec 28, 2015 at 6:35 PM, Nan Xiao <xi...@gmail.com>
>> >> wrote:
>> >>>
>> >>> Hi all,
>> >>>
>> >>> Greetings from me!
>> >>>
>> >>> I am trying to follow this tutorial
>> >>>
>> >>>
>> >>> (https://github.com/kubernetes/kubernetes/blob/master/docs/getting-started-guides/mesos.md)
>> >>> to deploy "k8s on Mesos" on local machines: The k8s is the newest
>> >>> master branch, and Mesos is the 0.26 edition.
>> >>>
>> >>> After running Mesos master(IP:15.242.100.56), Mesos
>> >>> slave(IP:15.242.100.16),, and the k8s(IP:15.242.100.60), I can see the
>> >>> following logs from Mesos master:
>> >>>
>> >>> ......
>> >>> I1227 22:52:34.494478  8069 master.cpp:4269] Received update of slave
>> >>> 9c3c6c78-0b62-4eaa-b27a-498f172e7fe6-S0 at slave(1)@15.242.100.16:5051
>> >>> (pqsfc016.ftc.rdlabs.hpecorp.net) with total oversubscribed resources
>> >>> I1227 22:52:34.494940  8065 hierarchical.cpp:400] Slave
>> >>> 9c3c6c78-0b62-4eaa-b27a-498f172e7fe6-S0
>> >>> (pqsfc016.ftc.rdlabs.hpecorp.net) updated with oversubscribed
>> >>> resources  (total: cpus(*):32; mem(*):127878; disk(*):4336;
>> >>> ports(*):[31000-32000], allocated: )
>> >>> I1227 22:53:06.740757 8053 http.cpp:334] HTTP GET for
>> >>> /master/state.json from 15.242.100.60:56219 with
>> >>> User-Agent='Go-http-client/1.1'
>> >>> I1227 22:53:07.736419 8065 http.cpp:334] HTTP GET for
>> >>> /master/state.json from 15.242.100.60:56241 with
>> >>> User-Agent='Go-http-client/1.1'
>> >>> I1227 22:53:07.767196  8070 http.cpp:334] HTTP GET for
>> >>> /master/state.json from 15.242.100.60:56252 with
>> >>> User-Agent='Go-http-client/1.1'
>> >>> I1227 22:53:08.808171  8053 http.cpp:334] HTTP GET for
>> >>> /master/state.json from 15.242.100.60:56272 with
>> >>> User-Agent='Go-http-client/1.1'
>> >>> I1227 22:53:08.815811 8060 master.cpp:2176] Received SUBSCRIBE call
>> >>> for framework 'Kubernetes' at scheduler(1)@15.242.100.60:59488
>> >>> I1227 22:53:08.816182 8060 master.cpp:2247] Subscribing framework
>> >>> Kubernetes with checkpointing enabled and capabilities [  ]
>> >>> I1227 22:53:08.817294  8052 hierarchical.cpp:195] Added framework
>> >>> 9c3c6c78-0b62-4eaa-b27a-498f172e7fe6-0000
>> >>> I1227 22:53:08.817464  8050 master.cpp:1122] Framework
>> >>> 9c3c6c78-0b62-4eaa-b27a-498f172e7fe6-0000 (Kubernetes) at
>> >>> scheduler(1)@15.242.100.60:59488 disconnected
>> >>> E1227 22:53:08.817497 8073 process.cpp:1911] Failed to shutdown
>> >>> socket with fd 17: Transport endpoint is not connected
>> >>> I1227 22:53:08.817533  8050 master.cpp:2472] Disconnecting framework
>> >>> 9c3c6c78-0b62-4eaa-b27a-498f172e7fe6-0000 (Kubernetes) at
>> >>> scheduler(1)@15.242.100.60:59488
>> >>> I1227 22:53:08.817595 8050 master.cpp:2496] Deactivating framework
>> >>> 9c3c6c78-0b62-4eaa-b27a-498f172e7fe6-0000 (Kubernetes) at
>> >>> scheduler(1)@15.242.100.60:59488
>> >>> I1227 22:53:08.817797 8050 master.cpp:1146] Giving framework
>> >>> 9c3c6c78-0b62-4eaa-b27a-498f172e7fe6-0000 (Kubernetes) at
>> >>> scheduler(1)@15.242.100.60:59488 7625.14222623576weeks to failover
>> >>> W1227 22:53:08.818389 8062 master.cpp:4840] Master returning
>> >>> resources offered to framework
>> >>> 9c3c6c78-0b62-4eaa-b27a-498f172e7fe6-0000 because the framework has
>> >>> terminated or is inactive
>> >>> I1227 22:53:08.818397  8052 hierarchical.cpp:273] Deactivated
>> >>> framework 9c3c6c78-0b62-4eaa-b27a-498f172e7fe6-0000
>> >>> I1227 22:53:08.819046  8066 hierarchical.cpp:744] Recovered
>> >>> cpus(*):32; mem(*):127878; disk(*):4336; ports(*):[31000-32000]
>> >>> (total: cpus(*):32; mem(*):127878; disk(*):4336;
>> >>> ports(*):[31000-32000], allocated: ) on slave
>> >>> 9c3c6c78-0b62-4eaa-b27a-498f172e7fe6-S0 from framework
>> >>> 9c3c6c78-0b62-4eaa-b27a-498f172e7fe6-0000
>> >>> ......
>> >>>
>> >>> I can't figure out why Mesos master complains "Failed to shutdown
>> >>> socket with fd 17: Transport endpoint is not connected".
>> >>> Could someone give some clues on this issue?
>> >>>
>> >>> Thanks very much in advance!
>> >>>
>> >>> Best Regards
>> >>> Nan Xiao
>> >>
>> >>
>
>
>
>
> --
> Avinash Sridharan, Mesosphere
> +1 (323) 702 5245

Re: The issue of "Failed to shutdown socket with fd xx: Transport endpoint is not connected" on Mesos master

Posted by Avinash Sridharan <av...@mesosphere.io>.

lsof command will show only actively opened file descriptors. So if you ran
the command after seeing the error logs in the master, most probably the
master had already closed this fd. Just throwing a few other things to look
at, that might give some more insights.

* Run the "netstat -na" and netstat -nt" commands on the master and the
kubernetes master node to make sure that the master is listening to the
right port, and the k8s scheduler is trying to connect to the right port.
>From the logs it does look like the master is receiving the registration
request, so there shouldn't be a network configuration issue here.
* Make sure there are no firewall rules getting turned on in your cluster
since it looks like the k8s scheduler is not able to connect to the master
(though it was able to register the first time).

On Tue, Dec 29, 2015 at 1:37 AM, Nan Xiao <xi...@gmail.com> wrote:

> BTW, using "lsof" command finds there are only 16 file descriptors. I
> don't know why Mesos
> master try to close "fd 17".
> Best Regards
> Nan Xiao
>
>
> On Tue, Dec 29, 2015 at 11:32 AM, Nan Xiao <xi...@gmail.com>
> wrote:
> > Hi Klaus,
> >
> > Firstly, thanks very much for your answer!
> >
> > The km processes are all live:
> > root     129474 128024  2 22:26 pts/0    00:00:00 km apiserver
> > --address=15.242.100.60 --etcd-servers=http://15.242.100.60:4001
> > --service-cluster-ip-range=10.10.10.0/24 --port=8888
> > --cloud-provider=mesos --cloud-config=mesos-cloud.conf --secure-port=0
> > --v=1
> > root     129509 128024  2 22:26 pts/0    00:00:00 km
> > controller-manager --master=15.242.100.60:8888 --cloud-provider=mesos
> > --cloud-config=./mesos-cloud.conf --v=1
> > root     129538 128024  0 22:26 pts/0    00:00:00 km scheduler
> > --address=15.242.100.60 --mesos-master=15.242.100.56:5050
> > --etcd-servers=http://15.242.100.60:4001 --mesos-user=root
> > --api-servers=15.242.100.60:8888 --cluster-dns=10.10.10.10
> > --cluster-domain=cluster.local --v=2
> >
> > All the logs are also seem OK, except the logs from scheduler.log:
> > ......
> > I1228 22:26:37.883092  129538 messenger.go:381] Receiving message
> > mesos.internal.InternalMasterChangeDetected from
> > scheduler(1)@15.242.100.60:33077
> > I1228 22:26:37.883225  129538 scheduler.go:374] New master
> > master@15.242.100.56:5050 detected
> > I1228 22:26:37.883268  129538 scheduler.go:435] No credentials were
> > provided. Attempting to register scheduler without authentication.
> > I1228 22:26:37.883356  129538 scheduler.go:928] Registering with
> > master: master@15.242.100.56:5050
> > I1228 22:26:37.883460  129538 messenger.go:187] Sending message
> > mesos.internal.RegisterFrameworkMessage to master@15.242.100.56:5050
> > I1228 22:26:37.883504  129538 scheduler.go:881] will retry
> > registration in 1.209320575s if necessary
> > I1228 22:26:37.883758  129538 http_transporter.go:193] Sending message
> > to master@15.242.100.56:5050 via http
> > I1228 22:26:37.883873  129538 http_transporter.go:587] libproc target
> > URL
> http://15.242.100.56:5050/master/mesos.internal.RegisterFrameworkMessage
> > I1228 22:26:39.093560  129538 scheduler.go:928] Registering with
> > master: master@15.242.100.56:5050
> > I1228 22:26:39.093659  129538 messenger.go:187] Sending message
> > mesos.internal.RegisterFrameworkMessage to master@15.242.100.56:5050
> > I1228 22:26:39.093702  129538 scheduler.go:881] will retry
> > registration in 3.762036352s if necessary
> > I1228 22:26:39.093765  129538 http_transporter.go:193] Sending message
> > to master@15.242.100.56:5050 via http
> > I1228 22:26:39.093847  129538 http_transporter.go:587] libproc target
> > URL
> http://15.242.100.56:5050/master/mesos.internal.RegisterFrameworkMessage
> > ......
> >
> > From the log, the Mesos master rejected the k8s's registeration, and
> > k8s retry constantly.
> >
> > Have you met this issue before? Thanks very much in advance!
> > Best Regards
> > Nan Xiao
> >
> >
> > On Mon, Dec 28, 2015 at 7:26 PM, Klaus Ma <kl...@gmail.com>
> wrote:
> >> It seems Kubernetes is down; would you help to check kubernetes's status
> >> (km)?
> >>
> >> ----
> >> Da (Klaus), Ma (马达) | PMP® | Advisory Software Engineer
> >> Platform Symphony/DCOS Development & Support, STG, IBM GCG
> >> +86-10-8245 4084 | klaus1982.cn@gmail.com | http://k82.me
> >>
> >> On Mon, Dec 28, 2015 at 6:35 PM, Nan Xiao <xi...@gmail.com>
> wrote:
> >>>
> >>> Hi all,
> >>>
> >>> Greetings from me!
> >>>
> >>> I am trying to follow this tutorial
> >>>
> >>> (
> https://github.com/kubernetes/kubernetes/blob/master/docs/getting-started-guides/mesos.md
> )
> >>> to deploy "k8s on Mesos" on local machines: The k8s is the newest
> >>> master branch, and Mesos is the 0.26 edition.
> >>>
> >>> After running Mesos master(IP:15.242.100.56), Mesos
> >>> slave(IP:15.242.100.16),, and the k8s(IP:15.242.100.60), I can see the
> >>> following logs from Mesos master:
> >>>
> >>> ......
> >>> I1227 22:52:34.494478  8069 master.cpp:4269] Received update of slave
> >>> 9c3c6c78-0b62-4eaa-b27a-498f172e7fe6-S0 at slave(1)@15.242.100.16:5051
> >>> (pqsfc016.ftc.rdlabs.hpecorp.net) with total oversubscribed resources
> >>> I1227 22:52:34.494940  8065 hierarchical.cpp:400] Slave
> >>> 9c3c6c78-0b62-4eaa-b27a-498f172e7fe6-S0
> >>> (pqsfc016.ftc.rdlabs.hpecorp.net) updated with oversubscribed
> >>> resources  (total: cpus(*):32; mem(*):127878; disk(*):4336;
> >>> ports(*):[31000-32000], allocated: )
> >>> I1227 22:53:06.740757 8053 http.cpp:334] HTTP GET for
> >>> /master/state.json from 15.242.100.60:56219 with
> >>> User-Agent='Go-http-client/1.1'
> >>> I1227 22:53:07.736419 8065 http.cpp:334] HTTP GET for
> >>> /master/state.json from 15.242.100.60:56241 with
> >>> User-Agent='Go-http-client/1.1'
> >>> I1227 22:53:07.767196  8070 http.cpp:334] HTTP GET for
> >>> /master/state.json from 15.242.100.60:56252 with
> >>> User-Agent='Go-http-client/1.1'
> >>> I1227 22:53:08.808171  8053 http.cpp:334] HTTP GET for
> >>> /master/state.json from 15.242.100.60:56272 with
> >>> User-Agent='Go-http-client/1.1'
> >>> I1227 22:53:08.815811 8060 master.cpp:2176] Received SUBSCRIBE call
> >>> for framework 'Kubernetes' at scheduler(1)@15.242.100.60:59488
> >>> I1227 22:53:08.816182 8060 master.cpp:2247] Subscribing framework
> >>> Kubernetes with checkpointing enabled and capabilities [  ]
> >>> I1227 22:53:08.817294  8052 hierarchical.cpp:195] Added framework
> >>> 9c3c6c78-0b62-4eaa-b27a-498f172e7fe6-0000
> >>> I1227 22:53:08.817464  8050 master.cpp:1122] Framework
> >>> 9c3c6c78-0b62-4eaa-b27a-498f172e7fe6-0000 (Kubernetes) at
> >>> scheduler(1)@15.242.100.60:59488 disconnected
> >>> E1227 22:53:08.817497 8073 process.cpp:1911] Failed to shutdown
> >>> socket with fd 17: Transport endpoint is not connected
> >>> I1227 22:53:08.817533  8050 master.cpp:2472] Disconnecting framework
> >>> 9c3c6c78-0b62-4eaa-b27a-498f172e7fe6-0000 (Kubernetes) at
> >>> scheduler(1)@15.242.100.60:59488
> >>> I1227 22:53:08.817595 8050 master.cpp:2496] Deactivating framework
> >>> 9c3c6c78-0b62-4eaa-b27a-498f172e7fe6-0000 (Kubernetes) at
> >>> scheduler(1)@15.242.100.60:59488
> >>> I1227 22:53:08.817797 8050 master.cpp:1146] Giving framework
> >>> 9c3c6c78-0b62-4eaa-b27a-498f172e7fe6-0000 (Kubernetes) at
> >>> scheduler(1)@15.242.100.60:59488 7625.14222623576weeks to failover
> >>> W1227 22:53:08.818389 8062 master.cpp:4840] Master returning
> >>> resources offered to framework
> >>> 9c3c6c78-0b62-4eaa-b27a-498f172e7fe6-0000 because the framework has
> >>> terminated or is inactive
> >>> I1227 22:53:08.818397  8052 hierarchical.cpp:273] Deactivated
> >>> framework 9c3c6c78-0b62-4eaa-b27a-498f172e7fe6-0000
> >>> I1227 22:53:08.819046  8066 hierarchical.cpp:744] Recovered
> >>> cpus(*):32; mem(*):127878; disk(*):4336; ports(*):[31000-32000]
> >>> (total: cpus(*):32; mem(*):127878; disk(*):4336;
> >>> ports(*):[31000-32000], allocated: ) on slave
> >>> 9c3c6c78-0b62-4eaa-b27a-498f172e7fe6-S0 from framework
> >>> 9c3c6c78-0b62-4eaa-b27a-498f172e7fe6-0000
> >>> ......
> >>>
> >>> I can't figure out why Mesos master complains "Failed to shutdown
> >>> socket with fd 17: Transport endpoint is not connected".
> >>> Could someone give some clues on this issue?
> >>>
> >>> Thanks very much in advance!
> >>>
> >>> Best Regards
> >>> Nan Xiao
> >>
> >>
>



-- 
Avinash Sridharan, Mesosphere
+1 (323) 702 5245

Re: The issue of "Failed to shutdown socket with fd xx: Transport endpoint is not connected" on Mesos master

Posted by Nan Xiao <xi...@gmail.com>.

BTW, using "lsof" command finds there are only 16 file descriptors. I
don't know why Mesos
master try to close "fd 17".
Best Regards
Nan Xiao


On Tue, Dec 29, 2015 at 11:32 AM, Nan Xiao <xi...@gmail.com> wrote:
> Hi Klaus,
>
> Firstly, thanks very much for your answer!
>
> The km processes are all live:
> root     129474 128024  2 22:26 pts/0    00:00:00 km apiserver
> --address=15.242.100.60 --etcd-servers=http://15.242.100.60:4001
> --service-cluster-ip-range=10.10.10.0/24 --port=8888
> --cloud-provider=mesos --cloud-config=mesos-cloud.conf --secure-port=0
> --v=1
> root     129509 128024  2 22:26 pts/0    00:00:00 km
> controller-manager --master=15.242.100.60:8888 --cloud-provider=mesos
> --cloud-config=./mesos-cloud.conf --v=1
> root     129538 128024  0 22:26 pts/0    00:00:00 km scheduler
> --address=15.242.100.60 --mesos-master=15.242.100.56:5050
> --etcd-servers=http://15.242.100.60:4001 --mesos-user=root
> --api-servers=15.242.100.60:8888 --cluster-dns=10.10.10.10
> --cluster-domain=cluster.local --v=2
>
> All the logs are also seem OK, except the logs from scheduler.log:
> ......
> I1228 22:26:37.883092  129538 messenger.go:381] Receiving message
> mesos.internal.InternalMasterChangeDetected from
> scheduler(1)@15.242.100.60:33077
> I1228 22:26:37.883225  129538 scheduler.go:374] New master
> master@15.242.100.56:5050 detected
> I1228 22:26:37.883268  129538 scheduler.go:435] No credentials were
> provided. Attempting to register scheduler without authentication.
> I1228 22:26:37.883356  129538 scheduler.go:928] Registering with
> master: master@15.242.100.56:5050
> I1228 22:26:37.883460  129538 messenger.go:187] Sending message
> mesos.internal.RegisterFrameworkMessage to master@15.242.100.56:5050
> I1228 22:26:37.883504  129538 scheduler.go:881] will retry
> registration in 1.209320575s if necessary
> I1228 22:26:37.883758  129538 http_transporter.go:193] Sending message
> to master@15.242.100.56:5050 via http
> I1228 22:26:37.883873  129538 http_transporter.go:587] libproc target
> URL http://15.242.100.56:5050/master/mesos.internal.RegisterFrameworkMessage
> I1228 22:26:39.093560  129538 scheduler.go:928] Registering with
> master: master@15.242.100.56:5050
> I1228 22:26:39.093659  129538 messenger.go:187] Sending message
> mesos.internal.RegisterFrameworkMessage to master@15.242.100.56:5050
> I1228 22:26:39.093702  129538 scheduler.go:881] will retry
> registration in 3.762036352s if necessary
> I1228 22:26:39.093765  129538 http_transporter.go:193] Sending message
> to master@15.242.100.56:5050 via http
> I1228 22:26:39.093847  129538 http_transporter.go:587] libproc target
> URL http://15.242.100.56:5050/master/mesos.internal.RegisterFrameworkMessage
> ......
>
> From the log, the Mesos master rejected the k8s's registeration, and
> k8s retry constantly.
>
> Have you met this issue before? Thanks very much in advance!
> Best Regards
> Nan Xiao
>
>
> On Mon, Dec 28, 2015 at 7:26 PM, Klaus Ma <kl...@gmail.com> wrote:
>> It seems Kubernetes is down; would you help to check kubernetes's status
>> (km)?
>>
>> ----
>> Da (Klaus), Ma (马达) | PMP® | Advisory Software Engineer
>> Platform Symphony/DCOS Development & Support, STG, IBM GCG
>> +86-10-8245 4084 | klaus1982.cn@gmail.com | http://k82.me
>>
>> On Mon, Dec 28, 2015 at 6:35 PM, Nan Xiao <xi...@gmail.com> wrote:
>>>
>>> Hi all,
>>>
>>> Greetings from me!
>>>
>>> I am trying to follow this tutorial
>>>
>>> (https://github.com/kubernetes/kubernetes/blob/master/docs/getting-started-guides/mesos.md)
>>> to deploy "k8s on Mesos" on local machines: The k8s is the newest
>>> master branch, and Mesos is the 0.26 edition.
>>>
>>> After running Mesos master(IP:15.242.100.56), Mesos
>>> slave(IP:15.242.100.16),, and the k8s(IP:15.242.100.60), I can see the
>>> following logs from Mesos master:
>>>
>>> ......
>>> I1227 22:52:34.494478  8069 master.cpp:4269] Received update of slave
>>> 9c3c6c78-0b62-4eaa-b27a-498f172e7fe6-S0 at slave(1)@15.242.100.16:5051
>>> (pqsfc016.ftc.rdlabs.hpecorp.net) with total oversubscribed resources
>>> I1227 22:52:34.494940  8065 hierarchical.cpp:400] Slave
>>> 9c3c6c78-0b62-4eaa-b27a-498f172e7fe6-S0
>>> (pqsfc016.ftc.rdlabs.hpecorp.net) updated with oversubscribed
>>> resources  (total: cpus(*):32; mem(*):127878; disk(*):4336;
>>> ports(*):[31000-32000], allocated: )
>>> I1227 22:53:06.740757 8053 http.cpp:334] HTTP GET for
>>> /master/state.json from 15.242.100.60:56219 with
>>> User-Agent='Go-http-client/1.1'
>>> I1227 22:53:07.736419 8065 http.cpp:334] HTTP GET for
>>> /master/state.json from 15.242.100.60:56241 with
>>> User-Agent='Go-http-client/1.1'
>>> I1227 22:53:07.767196  8070 http.cpp:334] HTTP GET for
>>> /master/state.json from 15.242.100.60:56252 with
>>> User-Agent='Go-http-client/1.1'
>>> I1227 22:53:08.808171  8053 http.cpp:334] HTTP GET for
>>> /master/state.json from 15.242.100.60:56272 with
>>> User-Agent='Go-http-client/1.1'
>>> I1227 22:53:08.815811 8060 master.cpp:2176] Received SUBSCRIBE call
>>> for framework 'Kubernetes' at scheduler(1)@15.242.100.60:59488
>>> I1227 22:53:08.816182 8060 master.cpp:2247] Subscribing framework
>>> Kubernetes with checkpointing enabled and capabilities [  ]
>>> I1227 22:53:08.817294  8052 hierarchical.cpp:195] Added framework
>>> 9c3c6c78-0b62-4eaa-b27a-498f172e7fe6-0000
>>> I1227 22:53:08.817464  8050 master.cpp:1122] Framework
>>> 9c3c6c78-0b62-4eaa-b27a-498f172e7fe6-0000 (Kubernetes) at
>>> scheduler(1)@15.242.100.60:59488 disconnected
>>> E1227 22:53:08.817497 8073 process.cpp:1911] Failed to shutdown
>>> socket with fd 17: Transport endpoint is not connected
>>> I1227 22:53:08.817533  8050 master.cpp:2472] Disconnecting framework
>>> 9c3c6c78-0b62-4eaa-b27a-498f172e7fe6-0000 (Kubernetes) at
>>> scheduler(1)@15.242.100.60:59488
>>> I1227 22:53:08.817595 8050 master.cpp:2496] Deactivating framework
>>> 9c3c6c78-0b62-4eaa-b27a-498f172e7fe6-0000 (Kubernetes) at
>>> scheduler(1)@15.242.100.60:59488
>>> I1227 22:53:08.817797 8050 master.cpp:1146] Giving framework
>>> 9c3c6c78-0b62-4eaa-b27a-498f172e7fe6-0000 (Kubernetes) at
>>> scheduler(1)@15.242.100.60:59488 7625.14222623576weeks to failover
>>> W1227 22:53:08.818389 8062 master.cpp:4840] Master returning
>>> resources offered to framework
>>> 9c3c6c78-0b62-4eaa-b27a-498f172e7fe6-0000 because the framework has
>>> terminated or is inactive
>>> I1227 22:53:08.818397  8052 hierarchical.cpp:273] Deactivated
>>> framework 9c3c6c78-0b62-4eaa-b27a-498f172e7fe6-0000
>>> I1227 22:53:08.819046  8066 hierarchical.cpp:744] Recovered
>>> cpus(*):32; mem(*):127878; disk(*):4336; ports(*):[31000-32000]
>>> (total: cpus(*):32; mem(*):127878; disk(*):4336;
>>> ports(*):[31000-32000], allocated: ) on slave
>>> 9c3c6c78-0b62-4eaa-b27a-498f172e7fe6-S0 from framework
>>> 9c3c6c78-0b62-4eaa-b27a-498f172e7fe6-0000
>>> ......
>>>
>>> I can't figure out why Mesos master complains "Failed to shutdown
>>> socket with fd 17: Transport endpoint is not connected".
>>> Could someone give some clues on this issue?
>>>
>>> Thanks very much in advance!
>>>
>>> Best Regards
>>> Nan Xiao
>>
>>

Re: The issue of "Failed to shutdown socket with fd xx: Transport endpoint is not connected" on Mesos master

Posted by Nan Xiao <xi...@gmail.com>.

Hi Klaus,

Firstly, thanks very much for your answer!

The km processes are all live:
root     129474 128024  2 22:26 pts/0    00:00:00 km apiserver
--address=15.242.100.60 --etcd-servers=http://15.242.100.60:4001
--service-cluster-ip-range=10.10.10.0/24 --port=8888
--cloud-provider=mesos --cloud-config=mesos-cloud.conf --secure-port=0
--v=1
root     129509 128024  2 22:26 pts/0    00:00:00 km
controller-manager --master=15.242.100.60:8888 --cloud-provider=mesos
--cloud-config=./mesos-cloud.conf --v=1
root     129538 128024  0 22:26 pts/0    00:00:00 km scheduler
--address=15.242.100.60 --mesos-master=15.242.100.56:5050
--etcd-servers=http://15.242.100.60:4001 --mesos-user=root
--api-servers=15.242.100.60:8888 --cluster-dns=10.10.10.10
--cluster-domain=cluster.local --v=2

All the logs are also seem OK, except the logs from scheduler.log:
......
I1228 22:26:37.883092  129538 messenger.go:381] Receiving message
mesos.internal.InternalMasterChangeDetected from
scheduler(1)@15.242.100.60:33077
I1228 22:26:37.883225  129538 scheduler.go:374] New master
master@15.242.100.56:5050 detected
I1228 22:26:37.883268  129538 scheduler.go:435] No credentials were
provided. Attempting to register scheduler without authentication.
I1228 22:26:37.883356  129538 scheduler.go:928] Registering with
master: master@15.242.100.56:5050
I1228 22:26:37.883460  129538 messenger.go:187] Sending message
mesos.internal.RegisterFrameworkMessage to master@15.242.100.56:5050
I1228 22:26:37.883504  129538 scheduler.go:881] will retry
registration in 1.209320575s if necessary
I1228 22:26:37.883758  129538 http_transporter.go:193] Sending message
to master@15.242.100.56:5050 via http
I1228 22:26:37.883873  129538 http_transporter.go:587] libproc target
URL http://15.242.100.56:5050/master/mesos.internal.RegisterFrameworkMessage
I1228 22:26:39.093560  129538 scheduler.go:928] Registering with
master: master@15.242.100.56:5050
I1228 22:26:39.093659  129538 messenger.go:187] Sending message
mesos.internal.RegisterFrameworkMessage to master@15.242.100.56:5050
I1228 22:26:39.093702  129538 scheduler.go:881] will retry
registration in 3.762036352s if necessary
I1228 22:26:39.093765  129538 http_transporter.go:193] Sending message
to master@15.242.100.56:5050 via http
I1228 22:26:39.093847  129538 http_transporter.go:587] libproc target
URL http://15.242.100.56:5050/master/mesos.internal.RegisterFrameworkMessage
......

>From the log, the Mesos master rejected the k8s's registeration, and
k8s retry constantly.

Have you met this issue before? Thanks very much in advance!
Best Regards
Nan Xiao


On Mon, Dec 28, 2015 at 7:26 PM, Klaus Ma <kl...@gmail.com> wrote:
> It seems Kubernetes is down; would you help to check kubernetes's status
> (km)?
>
> ----
> Da (Klaus), Ma (马达) | PMP® | Advisory Software Engineer
> Platform Symphony/DCOS Development & Support, STG, IBM GCG
> +86-10-8245 4084 | klaus1982.cn@gmail.com | http://k82.me
>
> On Mon, Dec 28, 2015 at 6:35 PM, Nan Xiao <xi...@gmail.com> wrote:
>>
>> Hi all,
>>
>> Greetings from me!
>>
>> I am trying to follow this tutorial
>>
>> (https://github.com/kubernetes/kubernetes/blob/master/docs/getting-started-guides/mesos.md)
>> to deploy "k8s on Mesos" on local machines: The k8s is the newest
>> master branch, and Mesos is the 0.26 edition.
>>
>> After running Mesos master(IP:15.242.100.56), Mesos
>> slave(IP:15.242.100.16),, and the k8s(IP:15.242.100.60), I can see the
>> following logs from Mesos master:
>>
>> ......
>> I1227 22:52:34.494478  8069 master.cpp:4269] Received update of slave
>> 9c3c6c78-0b62-4eaa-b27a-498f172e7fe6-S0 at slave(1)@15.242.100.16:5051
>> (pqsfc016.ftc.rdlabs.hpecorp.net) with total oversubscribed resources
>> I1227 22:52:34.494940  8065 hierarchical.cpp:400] Slave
>> 9c3c6c78-0b62-4eaa-b27a-498f172e7fe6-S0
>> (pqsfc016.ftc.rdlabs.hpecorp.net) updated with oversubscribed
>> resources  (total: cpus(*):32; mem(*):127878; disk(*):4336;
>> ports(*):[31000-32000], allocated: )
>> I1227 22:53:06.740757  8053 http.cpp:334] HTTP GET for
>> /master/state.json from 15.242.100.60:56219 with
>> User-Agent='Go-http-client/1.1'
>> I1227 22:53:07.736419 8065 http.cpp:334] HTTP GET for
>> /master/state.json from 15.242.100.60:56241 with
>> User-Agent='Go-http-client/1.1'
>> I1227 22:53:07.767196  8070 http.cpp:334] HTTP GET for
>> /master/state.json from 15.242.100.60:56252 with
>> User-Agent='Go-http-client/1.1'
>> I1227 22:53:08.808171  8053 http.cpp:334] HTTP GET for
>> /master/state.json from 15.242.100.60:56272 with
>> User-Agent='Go-http-client/1.1'
>> I1227 22:53:08.815811  8060 master.cpp:2176] Received SUBSCRIBE call
>> for framework 'Kubernetes' at scheduler(1)@15.242.100.60:59488
>> I1227 22:53:08.816182 8060 master.cpp:2247] Subscribing framework
>> Kubernetes with checkpointing enabled and capabilities [  ]
>> I1227 22:53:08.817294  8052 hierarchical.cpp:195] Added framework
>> 9c3c6c78-0b62-4eaa-b27a-498f172e7fe6-0000
>> I1227 22:53:08.817464  8050 master.cpp:1122] Framework
>> 9c3c6c78-0b62-4eaa-b27a-498f172e7fe6-0000 (Kubernetes) at
>> scheduler(1)@15.242.100.60:59488 disconnected
>> E1227 22:53:08.817497 8073 process.cpp:1911] Failed to shutdown
>> socket with fd 17: Transport endpoint is not connected
>> I1227 22:53:08.817533  8050 master.cpp:2472] Disconnecting framework
>> 9c3c6c78-0b62-4eaa-b27a-498f172e7fe6-0000 (Kubernetes) at
>> scheduler(1)@15.242.100.60:59488
>> I1227 22:53:08.817595 8050 master.cpp:2496] Deactivating framework
>> 9c3c6c78-0b62-4eaa-b27a-498f172e7fe6-0000 (Kubernetes) at
>> scheduler(1)@15.242.100.60:59488
>> I1227 22:53:08.817797 8050 master.cpp:1146] Giving framework
>> 9c3c6c78-0b62-4eaa-b27a-498f172e7fe6-0000 (Kubernetes) at
>> scheduler(1)@15.242.100.60:59488 7625.14222623576weeks to failover
>> W1227 22:53:08.818389 8062 master.cpp:4840] Master returning
>> resources offered to framework
>> 9c3c6c78-0b62-4eaa-b27a-498f172e7fe6-0000 because the framework has
>> terminated or is inactive
>> I1227 22:53:08.818397  8052 hierarchical.cpp:273] Deactivated
>> framework 9c3c6c78-0b62-4eaa-b27a-498f172e7fe6-0000
>> I1227 22:53:08.819046  8066 hierarchical.cpp:744] Recovered
>> cpus(*):32; mem(*):127878; disk(*):4336; ports(*):[31000-32000]
>> (total: cpus(*):32; mem(*):127878; disk(*):4336;
>> ports(*):[31000-32000], allocated: ) on slave
>> 9c3c6c78-0b62-4eaa-b27a-498f172e7fe6-S0 from framework
>> 9c3c6c78-0b62-4eaa-b27a-498f172e7fe6-0000
>> ......
>>
>> I can't figure out why Mesos master complains "Failed to shutdown
>> socket with fd 17: Transport endpoint is not connected".
>> Could someone give some clues on this issue?
>>
>> Thanks very much in advance!
>>
>> Best Regards
>> Nan Xiao
>
>

Re: The issue of "Failed to shutdown socket with fd xx: Transport endpoint is not connected" on Mesos master

Posted by Klaus Ma <kl...@gmail.com>.

It seems Kubernetes is down; would you help to check kubernetes's status
(km)?

----
Da (Klaus), Ma (马达) | PMP® | Advisory Software Engineer
Platform Symphony/DCOS Development & Support, STG, IBM GCG
+86-10-8245 4084 | klaus1982.cn@gmail.com | http://k82.me

On Mon, Dec 28, 2015 at 6:35 PM, Nan Xiao <xi...@gmail.com> wrote:

> Hi all,
>
> Greetings from me!
>
> I am trying to follow this tutorial
> (
> https://github.com/kubernetes/kubernetes/blob/master/docs/getting-started-guides/mesos.md
> )
> to deploy "k8s on Mesos" on local machines: The k8s is the newest
> master branch, and Mesos is the 0.26 edition.
>
> After running Mesos master(IP:15.242.100.56), Mesos
> slave(IP:15.242.100.16),, and the k8s(IP:15.242.100.60), I can see the
> following logs from Mesos master:
>
> ......
> I1227 22:52:34.494478  8069 master.cpp:4269] Received update of slave
> 9c3c6c78-0b62-4eaa-b27a-498f172e7fe6-S0 at slave(1)@15.242.100.16:5051
> (pqsfc016.ftc.rdlabs.hpecorp.net) with total oversubscribed resources
> I1227 22:52:34.494940  8065 hierarchical.cpp:400] Slave
> 9c3c6c78-0b62-4eaa-b27a-498f172e7fe6-S0
> (pqsfc016.ftc.rdlabs.hpecorp.net) updated with oversubscribed
> resources  (total: cpus(*):32; mem(*):127878; disk(*):4336;
> ports(*):[31000-32000], allocated: )
> I1227 22:53:06.740757  8053 http.cpp:334] HTTP GET for
> /master/state.json from 15.242.100.60:56219 with
> User-Agent='Go-http-client/1.1'
> I1227 22:53:07.736419 8065 http.cpp:334] HTTP GET for
> /master/state.json from 15.242.100.60:56241 with
> User-Agent='Go-http-client/1.1'
> I1227 22:53:07.767196  8070 http.cpp:334] HTTP GET for
> /master/state.json from 15.242.100.60:56252 with
> User-Agent='Go-http-client/1.1'
> I1227 22:53:08.808171  8053 http.cpp:334] HTTP GET for
> /master/state.json from 15.242.100.60:56272 with
> User-Agent='Go-http-client/1.1'
> I1227 22:53:08.815811  8060 master.cpp:2176] Received SUBSCRIBE call
> for framework 'Kubernetes' at scheduler(1)@15.242.100.60:59488
> I1227 22:53:08.816182 8060 master.cpp:2247] Subscribing framework
> Kubernetes with checkpointing enabled and capabilities [  ]
> I1227 22:53:08.817294  8052 hierarchical.cpp:195] Added framework
> 9c3c6c78-0b62-4eaa-b27a-498f172e7fe6-0000
> I1227 22:53:08.817464  8050 master.cpp:1122] Framework
> 9c3c6c78-0b62-4eaa-b27a-498f172e7fe6-0000 (Kubernetes) at
> scheduler(1)@15.242.100.60:59488 disconnected
> E1227 22:53:08.817497 8073 process.cpp:1911] Failed to shutdown
> socket with fd 17: Transport endpoint is not connected
> I1227 22:53:08.817533  8050 master.cpp:2472] Disconnecting framework
> 9c3c6c78-0b62-4eaa-b27a-498f172e7fe6-0000 (Kubernetes) at
> scheduler(1)@15.242.100.60:59488
> I1227 22:53:08.817595 8050 master.cpp:2496] Deactivating framework
> 9c3c6c78-0b62-4eaa-b27a-498f172e7fe6-0000 (Kubernetes) at
> scheduler(1)@15.242.100.60:59488
> I1227 22:53:08.817797 8050 master.cpp:1146] Giving framework
> 9c3c6c78-0b62-4eaa-b27a-498f172e7fe6-0000 (Kubernetes) at
> scheduler(1)@15.242.100.60:59488 7625.14222623576weeks to failover
> W1227 22:53:08.818389 8062 master.cpp:4840] Master returning
> resources offered to framework
> 9c3c6c78-0b62-4eaa-b27a-498f172e7fe6-0000 because the framework has
> terminated or is inactive
> I1227 22:53:08.818397  8052 hierarchical.cpp:273] Deactivated
> framework 9c3c6c78-0b62-4eaa-b27a-498f172e7fe6-0000
> I1227 22:53:08.819046  8066 hierarchical.cpp:744] Recovered
> cpus(*):32; mem(*):127878; disk(*):4336; ports(*):[31000-32000]
> (total: cpus(*):32; mem(*):127878; disk(*):4336;
> ports(*):[31000-32000], allocated: ) on slave
> 9c3c6c78-0b62-4eaa-b27a-498f172e7fe6-S0 from framework
> 9c3c6c78-0b62-4eaa-b27a-498f172e7fe6-0000
> ......
>
> I can't figure out why Mesos master complains "Failed to shutdown
> socket with fd 17: Transport endpoint is not connected".
> Could someone give some clues on this issue?
>
> Thanks very much in advance!
>
> Best Regards
> Nan Xiao
>

Re: mesos-master v0.26 crashes for quorum 0

Posted by Adam Bordelon <ad...@mesosphere.io>.

You should never specify a quorum of 0.
For 1 master, you specify quorum of 1.
For 3 masters, quorum is 2.
For 5 masters, quorum is 3.
For 7 masters, quorum is 4.
The quorum dictates how many masters (log replicas) have to agree on a fact
to win a vote. If you have a quorum of 0, then no masters vote, so nobody
wins. On a related note, you should always have an odd number of masters,
so that the vote is never tied.

I will admit that the master shouldn't crash with --quorum=0; it should
just exit with an error that quorum must be >=1. Want to file a JIRA?

On Tue, Dec 29, 2015 at 3:43 PM, Mehrotra, Ashish <As...@emc.com>
wrote:

> Hi All,
>
> I am running Centos 7.1, zookeeper version 3.4.7 and Mesos version 0.26.0.
> After starting the zookeeper, when I tried to start to start the
> meson-server with quorum 0 (everything being run on the same machine, not
> as local but distributed set up), the server crashed.
> This happened immediately after the fresh installs.
> When I changed quorum=1, the mesos master ran fine and the slave could get
> connected.
> Then on restarting the mesos master, there was no issue. * The issue was
> seen the very first time only.*
> The error stack is incomprehensible.
>
> Anyone seen this issue previously?
> The error log was —
>
> [root@abc123 build]# ./bin/mesos-master.sh --ip=10.10.10.118
> --work_dir=/var/lib/mesos --zk=zk://10.10.10.118:2181/mesos *--quorum=0*
> I1229 13:41:24.925851  3345 main.cpp:232] Build: 2015-12-29 12:29:36 by
> root
> I1229 13:41:24.925983  3345 main.cpp:234] Version: 0.26.0
> I1229 13:41:24.929131  3345 main.cpp:255] Using 'HierarchicalDRF' allocator
> I1229 13:41:24.953929  3345 leveldb.cpp:176] Opened db in 24.529078ms
> I1229 13:41:24.955523  3345 leveldb.cpp:183] Compacted db in 1.525191ms
> I1229 13:41:24.955688  3345 leveldb.cpp:198] Created db iterator in
> 107413ns
> I1229 13:41:24.955724  3345 leveldb.cpp:204] Seeked to beginning of db in
> 4553ns
> I1229 13:41:24.955737  3345 leveldb.cpp:273] Iterated through 0 keys in
> the db in 224ns
> I1229 13:41:24.956120  3345 replica.cpp:780] Replica recovered with log
> positions 0 -> 0 with 1 holes and 0 unlearned
> I1229 13:41:24.961802  3345 main.cpp:464] Starting Mesos master
> I1229 13:41:24.965438  3345 master.cpp:367] Master
> a38658f7-89c1-4b1f-84f9-5796234b2104 (localhost) started on
> 10.10.10.118:5050
> I1229 13:41:24.965459  3345 master.cpp:369] Flags at startup:
> --allocation_interval="1secs" --allocator="HierarchicalDRF"
> --authenticate="false" --authenticate_slaves="false"
> --authenticators="crammd5" --authorizers="local" --framework_sorter="drf"
> --help="false" --hostname_lookup="true" --initialize_driver_logging="true"
> --ip="10.10.10.118" --log_auto_initialize="true" --logbufsecs="0"
> --logging_level="INFO" --max_slave_ping_timeouts="5" --port="5050"
> --quiet="false" --quorum="0" --recovery_slave_removal_limit="100%"
> --registry="replicated_log" --registry_fetch_timeout="1mins"
> --registry_store_timeout="5secs" --registry_strict="false"
> --root_submissions="true" --slave_ping_timeout="15secs"
> --slave_reregister_timeout="10mins" --user_sorter="drf" --version="false"
> --webui_dir="/home/admin/mesos/build/../src/webui"
> --work_dir="/var/lib/mesos" --zk="zk://10.10.10.118:2181/mesos"
> --zk_session_timeout="10secs"
> I1229 13:41:24.965761  3345 master.cpp:416] Master allowing
> unauthenticated frameworks to register
> I1229 13:41:24.965772  3345 master.cpp:421] Master allowing
> unauthenticated slaves to register
> I1229 13:41:24.965837  3345 master.cpp:458] Using default 'crammd5'
> authenticator
> W1229 13:41:24.965867  3345 authenticator.cpp:513] No credentials
> provided, authentication requests will be refused
> I1229 13:41:24.965881  3345 authenticator.cpp:520] Initializing server SASL
> I1229 13:41:24.966788  3364 log.cpp:238] Attempting to join replica to
> ZooKeeper group
> 2015-12-29 13:41:24,970:3345(0x7f21042f3700):ZOO_INFO@log_env@712: Client
> environment:zookeeper.version=zookeeper C client 3.4.5
> 2015-12-29 13:41:24,970:3345(0x7f21042f3700):ZOO_INFO@log_env@716: Client
> environment:host.name=abc.def.com
> 2015-12-29 13:41:24,970:3345(0x7f21042f3700):ZOO_INFO@log_env@723: Client
> environment:os.name=Linux
> 2015-12-29 13:41:24,970:3345(0x7f21042f3700):ZOO_INFO@log_env@724: Client
> environment:os.arch=3.10.0-229.el7.x86_64
> 2015-12-29 13:41:24,970:3345(0x7f21042f3700):ZOO_INFO@log_env@725: Client
> environment:os.version=#1 SMP Fri Mar 6 11:36:42 UTC 2015
> 2015-12-29 13:41:24,970:3345(0x7f21042f3700):ZOO_INFO@log_env@733: Client
> environment:user.name=root
> 2015-12-29 13:41:24,970:3345(0x7f21042f3700):ZOO_INFO@log_env@741: Client
> environment:user.home=/root
> 2015-12-29 13:41:24,970:3345(0x7f21042f3700):ZOO_INFO@log_env@753: Client
> environment:user.dir=/home/admin/mesos/build
> 2015-12-29 13:41:24,970:3345(0x7f21042f3700):ZOO_INFO@zookeeper_init@786:
> Initiating client connection, host=10.10.10.118:2181 sessionTimeout=10000
> watcher=0x7f2110e3f16c sessionId=0 sessionPasswd=<null>
> context=0x7f20fc0038c0 flags=0
> 2015-12-29 13:41:24,970:3345(0x7f2103af2700):ZOO_INFO@log_env@712: Client
> environment:zookeeper.version=zookeeper C client 3.4.5
> 2015-12-29 13:41:24,970:3345(0x7f2103af2700):ZOO_INFO@log_env@716: Client
> environment:host.name=abc.def.com
> 2015-12-29 13:41:24,970:3345(0x7f2103af2700):ZOO_INFO@log_env@723: Client
> environment:os.name=Linux
> 2015-12-29 13:41:24,970:3345(0x7f2103af2700):ZOO_INFO@log_env@724: Client
> environment:os.arch=3.10.0-229.el7.x86_64
> 2015-12-29 13:41:24,970:3345(0x7f2103af2700):ZOO_INFO@log_env@725: Client
> environment:os.version=#1 SMP Fri Mar 6 11:36:42 UTC 2015
> 2015-12-29 13:41:24,970:3345(0x7f2103af2700):ZOO_INFO@log_env@733: Client
> environment:user.name=root
> 2015-12-29 13:41:24,970:3345(0x7f2103af2700):ZOO_INFO@log_env@741: Client
> environment:user.home=/root
> 2015-12-29 13:41:24,970:3345(0x7f2103af2700):ZOO_INFO@log_env@753: Client
> environment:user.dir=/home/admin/mesos/build
> 2015-12-29 13:41:24,970:3345(0x7f2103af2700):ZOO_INFO@zookeeper_init@786:
> Initiating client connection, host=10.10.10.118:2181 sessionTimeout=10000
> watcher=0x7f2110e3f16c sessionId=0 sessionPasswd=<null>
> context=0x7f20fc003a70 flags=0
> I1229 13:41:24.971629  3360 recover.cpp:449] Starting replica recovery
> 2015-12-29 13:41:24,972:3345(0x7f21062f7700):ZOO_INFO@log_env@712: Client
> environment:zookeeper.version=zookeeper C client 3.4.5
> 2015-12-29 13:41:24,972:3345(0x7f21062f7700):ZOO_INFO@log_env@716: Client
> environment:host.name=abc.def.com
> 2015-12-29 13:41:24,972:3345(0x7f21062f7700):ZOO_INFO@log_env@723: Client
> environment:os.name=Linux
> 2015-12-29 13:41:24,972:3345(0x7f21062f7700):ZOO_INFO@log_env@724: Client
> environment:os.arch=3.10.0-229.el7.x86_64
> 2015-12-29 13:41:24,972:3345(0x7f21062f7700):ZOO_INFO@log_env@725: Client
> environment:os.version=#1 SMP Fri Mar 6 11:36:42 UTC 2015
> 2015-12-29 13:41:24,972:3345(0x7f21062f7700):ZOO_INFO@log_env@733: Client
> environment:user.name=root
> 2015-12-29 13:41:24,972:3345(0x7f21062f7700):ZOO_INFO@log_env@741: Client
> environment:user.home=/root
> 2015-12-29 13:41:24,972:3345(0x7f21062f7700):ZOO_INFO@log_env@753: Client
> environment:user.dir=/home/admin/mesos/build
> 2015-12-29 13:41:24,972:3345(0x7f21062f7700):ZOO_INFO@zookeeper_init@786:
> Initiating client connection, host=10.10.10.118:2181 sessionTimeout=10000
> watcher=0x7f2110e3f16c sessionId=0 sessionPasswd=<null>
> context=0x7f20fc0078b0 flags=0
> 2015-12-29 13:41:24,972:3345(0x7f2105af6700):ZOO_INFO@log_env@712: Client
> environment:zookeeper.version=zookeeper C client 3.4.5
> 2015-12-29 13:41:24,972:3345(0x7f2105af6700):ZOO_INFO@log_env@716: Client
> environment:host.name=abc.def.com
> 2015-12-29 13:41:24,972:3345(0x7f2105af6700):ZOO_INFO@log_env@723: Client
> environment:os.name=Linux
> 2015-12-29 13:41:24,972:3345(0x7f2105af6700):ZOO_INFO@log_env@724: Client
> environment:os.arch=3.10.0-229.el7.x86_64
> 2015-12-29 13:41:24,972:3345(0x7f2105af6700):ZOO_INFO@log_env@725: Client
> environment:os.version=#1 SMP Fri Mar 6 11:36:42 UTC 2015
> 2015-12-29 13:41:24,972:3345(0x7f2105af6700):ZOO_INFO@log_env@733: Client
> environment:user.name=root
> 2015-12-29 13:41:24,972:3345(0x7f2105af6700):ZOO_INFO@log_env@741: Client
> environment:user.home=/root
> 2015-12-29 13:41:24,972:3345(0x7f2105af6700):ZOO_INFO@log_env@753: Client
> environment:user.dir=/home/admin/mesos/build
> 2015-12-29 13:41:24,972:3345(0x7f2105af6700):ZOO_INFO@zookeeper_init@786:
> Initiating client connection, host=10.10.10.118:2181 sessionTimeout=10000
> watcher=0x7f2110e3f16c sessionId=0 sessionPasswd=<null>
> context=0x7f20fc007f00 flags=0
> I1229 13:41:24.973780  3362 recover.cpp:475] Replica is in EMPTY status
> I1229 13:41:24.979076  3362 replica.cpp:676] Replica in EMPTY status
> received a broadcasted recover request from (4)@10.10.10.118:5050
> I1229 13:41:24.979863  3362 recover.cpp:195] Received a recover response
> from a replica in EMPTY status
> F1229 13:41:24.980000  3362 recover.cpp:219]
> CHECK_SOME(lowestBeginPosition): is NONE
> *** Check failure stack trace: ***
>     @     0x7f211143a6a2  google::LogMessage::Fail()
>     @     0x7f211143a601  google::LogMessage::SendToLog()
> 2015-12-29 13:41:24,995:3345(0x7f20f37fe700):ZOO_INFO@check_events@1703:
> initiated connection to server [10.10.10.118:2181]
> 2015-12-29 13:41:25,004:3345(0x7f21008d9700):ZOO_INFO@check_events@1703:
> initiated connection to server [10.10.10.118:2181]
> 2015-12-29 13:41:25,004:3345(0x7f21018db700):ZOO_INFO@check_events@1703:
> initiated connection to server [10.10.10.118:2181]
> 2015-12-29 13:41:25,004:3345(0x7f20f27fc700):ZOO_INFO@check_events@1703:
> initiated connection to server [10.10.10.118:2181]
>     @     0x7f211143a012  google::LogMessage::Flush()
>     @     0x7f211143cd46  google::LogMessageFatal::~LogMessageFatal()
>     @     0x7f211056d44c  _CheckFatal::~_CheckFatal()
>     @     0x7f211125a243
> mesos::internal::log::RecoverProtocolProcess::received()
>     @     0x7f2111265ae6
> _ZZN7process8dispatchI6OptionIN5mesos8internal3log15RecoverResponseEENS4_22RecoverProtocolProcessERKNS_6FutureIS5_EES9_EENS8_IT_EERKNS_3PIDIT0_EEMSF_FSD_T1_ET2_ENKUlPNS_11ProcessBaseEE_clESO_
>     @     0x7f211127c5d5
> _ZNSt17_Function_handlerIFvPN7process11ProcessBaseEEZNS0_8dispatchI6OptionIN5mesos8internal3log15RecoverResponseEENS8_22RecoverProtocolProcessERKNS0_6FutureIS9_EESD_EENSC_IT_EERKNS0_3PIDIT0_EEMSJ_FSH_T1_ET2_EUlS2_E_E9_M_invokeERKSt9_Any_dataS2_
>     @     0x7f21113c0d7d  std::function<>::operator()()
>     @     0x7f21113a8b95  process::ProcessBase::visit()
>     @     0x7f21113ac960  process::DispatchEvent::visit()
>     @           0x471dd8  process::ProcessBase::serve()
> I1229 13:41:25.136451  3366 contender.cpp:149] Joining the ZK group
>     @     0x7f21113a4f81  process::ProcessManager::resume()
>     @     0x7f21113a21b2
> _ZZN7process14ProcessManager12init_threadsEvENKUlRKSt11atomic_boolE_clES3_
>     @     0x7f21113ac18c
> _ZNSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS3_EEE6__callIvIEILm0EEEET_OSt5tupleIIDpT0_EESt12_Index_tupleIIXspT1_EEE
>     @     0x7f21113ac13c
> _ZNSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS3_EEEclIIEvEET0_DpOT_
>     @     0x7f21113ac0ce
> _ZNSt12_Bind_simpleIFSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS4_EEEvEE9_M_invokeIIEEEvSt12_Index_tupleIIXspT_EEE
>     @     0x7f21113ac025
> _ZNSt12_Bind_simpleIFSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS4_EEEvEEclEv
>     @     0x7f21113abfbe
> _ZNSt6thread5_ImplISt12_Bind_simpleIFSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS6_EEEvEEE6_M_runEv
>     @     0x7f210ce3f220  (unknown)
>     @     0x7f210d099dc5  start_thread
>     @     0x7f210c5a721d  __clone
> *Aborted (core dumped)*
>
>

mesos-master v0.26 crashes for quorum 0

Posted by "Mehrotra, Ashish" <As...@emc.com>.

Hi All,

I am running Centos 7.1, zookeeper version 3.4.7 and Mesos version 0.26.0. After starting the zookeeper, when I tried to start to start the meson-server with quorum 0 (everything being run on the same machine, not as local but distributed set up), the server crashed.
This happened immediately after the fresh installs.
When I changed quorum=1, the mesos master ran fine and the slave could get connected.
Then on restarting the mesos master, there was no issue. The issue was seen the very first time only.
The error stack is incomprehensible.

Anyone seen this issue previously?
The error log was —

[root@abc123 build]# ./bin/mesos-master.sh --ip=10.10.10.118 --work_dir=/var/lib/mesos --zk=zk://10.10.10.118:2181/mesos --quorum=0
I1229 13:41:24.925851  3345 main.cpp:232] Build: 2015-12-29 12:29:36 by root
I1229 13:41:24.925983  3345 main.cpp:234] Version: 0.26.0
I1229 13:41:24.929131  3345 main.cpp:255] Using 'HierarchicalDRF' allocator
I1229 13:41:24.953929  3345 leveldb.cpp:176] Opened db in 24.529078ms
I1229 13:41:24.955523  3345 leveldb.cpp:183] Compacted db in 1.525191ms
I1229 13:41:24.955688  3345 leveldb.cpp:198] Created db iterator in 107413ns
I1229 13:41:24.955724  3345 leveldb.cpp:204] Seeked to beginning of db in 4553ns
I1229 13:41:24.955737  3345 leveldb.cpp:273] Iterated through 0 keys in the db in 224ns
I1229 13:41:24.956120  3345 replica.cpp:780] Replica recovered with log positions 0 -> 0 with 1 holes and 0 unlearned
I1229 13:41:24.961802  3345 main.cpp:464] Starting Mesos master
I1229 13:41:24.965438  3345 master.cpp:367] Master a38658f7-89c1-4b1f-84f9-5796234b2104 (localhost) started on 10.10.10.118:5050
I1229 13:41:24.965459  3345 master.cpp:369] Flags at startup: --allocation_interval="1secs" --allocator="HierarchicalDRF" --authenticate="false" --authenticate_slaves="false" --authenticators="crammd5" --authorizers="local" --framework_sorter="drf" --help="false" --hostname_lookup="true" --initialize_driver_logging="true" --ip="10.10.10.118" --log_auto_initialize="true" --logbufsecs="0" --logging_level="INFO" --max_slave_ping_timeouts="5" --port="5050" --quiet="false" --quorum="0" --recovery_slave_removal_limit="100%" --registry="replicated_log" --registry_fetch_timeout="1mins" --registry_store_timeout="5secs" --registry_strict="false" --root_submissions="true" --slave_ping_timeout="15secs" --slave_reregister_timeout="10mins" --user_sorter="drf" --version="false" --webui_dir="/home/admin/mesos/build/../src/webui" --work_dir="/var/lib/mesos" --zk="zk://10.10.10.118:2181/mesos" --zk_session_timeout="10secs"
I1229 13:41:24.965761  3345 master.cpp:416] Master allowing unauthenticated frameworks to register
I1229 13:41:24.965772  3345 master.cpp:421] Master allowing unauthenticated slaves to register
I1229 13:41:24.965837  3345 master.cpp:458] Using default 'crammd5' authenticator
W1229 13:41:24.965867  3345 authenticator.cpp:513] No credentials provided, authentication requests will be refused
I1229 13:41:24.965881  3345 authenticator.cpp:520] Initializing server SASL
I1229 13:41:24.966788  3364 log.cpp:238] Attempting to join replica to ZooKeeper group
2015-12-29 13:41:24,970:3345(0x7f21042f3700):ZOO_INFO@log_env@712: Client environment:zookeeper.version=zookeeper C client 3.4.5
2015-12-29 13:41:24,970:3345(0x7f21042f3700):ZOO_INFO@log_env@716: Client environment:host.name=abc.def.com<http://abc.def.com>
2015-12-29 13:41:24,970:3345(0x7f21042f3700):ZOO_INFO@log_env@723: Client environment:os.name=Linux
2015-12-29 13:41:24,970:3345(0x7f21042f3700):ZOO_INFO@log_env@724: Client environment:os.arch=3.10.0-229.el7.x86_64
2015-12-29 13:41:24,970:3345(0x7f21042f3700):ZOO_INFO@log_env@725: Client environment:os.version=#1 SMP Fri Mar 6 11:36:42 UTC 2015
2015-12-29 13:41:24,970:3345(0x7f21042f3700):ZOO_INFO@log_env@733: Client environment:user.name=root
2015-12-29 13:41:24,970:3345(0x7f21042f3700):ZOO_INFO@log_env@741: Client environment:user.home=/root
2015-12-29 13:41:24,970:3345(0x7f21042f3700):ZOO_INFO@log_env@753: Client environment:user.dir=/home/admin/mesos/build
2015-12-29 13:41:24,970:3345(0x7f21042f3700):ZOO_INFO@zookeeper_init@786: Initiating client connection, host=10.10.10.118:2181 sessionTimeout=10000 watcher=0x7f2110e3f16c sessionId=0 sessionPasswd=<null> context=0x7f20fc0038c0 flags=0
2015-12-29 13:41:24,970:3345(0x7f2103af2700):ZOO_INFO@log_env@712: Client environment:zookeeper.version=zookeeper C client 3.4.5
2015-12-29 13:41:24,970:3345(0x7f2103af2700):ZOO_INFO@log_env@716: Client environment:host.name=abc.def.com<http://abc.def.com>
2015-12-29 13:41:24,970:3345(0x7f2103af2700):ZOO_INFO@log_env@723: Client environment:os.name=Linux
2015-12-29 13:41:24,970:3345(0x7f2103af2700):ZOO_INFO@log_env@724: Client environment:os.arch=3.10.0-229.el7.x86_64
2015-12-29 13:41:24,970:3345(0x7f2103af2700):ZOO_INFO@log_env@725: Client environment:os.version=#1 SMP Fri Mar 6 11:36:42 UTC 2015
2015-12-29 13:41:24,970:3345(0x7f2103af2700):ZOO_INFO@log_env@733: Client environment:user.name=root
2015-12-29 13:41:24,970:3345(0x7f2103af2700):ZOO_INFO@log_env@741: Client environment:user.home=/root
2015-12-29 13:41:24,970:3345(0x7f2103af2700):ZOO_INFO@log_env@753: Client environment:user.dir=/home/admin/mesos/build
2015-12-29 13:41:24,970:3345(0x7f2103af2700):ZOO_INFO@zookeeper_init@786: Initiating client connection, host=10.10.10.118:2181 sessionTimeout=10000 watcher=0x7f2110e3f16c sessionId=0 sessionPasswd=<null> context=0x7f20fc003a70 flags=0
I1229 13:41:24.971629  3360 recover.cpp:449] Starting replica recovery
2015-12-29 13:41:24,972:3345(0x7f21062f7700):ZOO_INFO@log_env@712: Client environment:zookeeper.version=zookeeper C client 3.4.5
2015-12-29 13:41:24,972:3345(0x7f21062f7700):ZOO_INFO@log_env@716: Client environment:host.name=abc.def.com<http://abc.def.com>
2015-12-29 13:41:24,972:3345(0x7f21062f7700):ZOO_INFO@log_env@723: Client environment:os.name=Linux
2015-12-29 13:41:24,972:3345(0x7f21062f7700):ZOO_INFO@log_env@724: Client environment:os.arch=3.10.0-229.el7.x86_64
2015-12-29 13:41:24,972:3345(0x7f21062f7700):ZOO_INFO@log_env@725: Client environment:os.version=#1 SMP Fri Mar 6 11:36:42 UTC 2015
2015-12-29 13:41:24,972:3345(0x7f21062f7700):ZOO_INFO@log_env@733: Client environment:user.name=root
2015-12-29 13:41:24,972:3345(0x7f21062f7700):ZOO_INFO@log_env@741: Client environment:user.home=/root
2015-12-29 13:41:24,972:3345(0x7f21062f7700):ZOO_INFO@log_env@753: Client environment:user.dir=/home/admin/mesos/build
2015-12-29 13:41:24,972:3345(0x7f21062f7700):ZOO_INFO@zookeeper_init@786: Initiating client connection, host=10.10.10.118:2181 sessionTimeout=10000 watcher=0x7f2110e3f16c sessionId=0 sessionPasswd=<null> context=0x7f20fc0078b0 flags=0
2015-12-29 13:41:24,972:3345(0x7f2105af6700):ZOO_INFO@log_env@712: Client environment:zookeeper.version=zookeeper C client 3.4.5
2015-12-29 13:41:24,972:3345(0x7f2105af6700):ZOO_INFO@log_env@716: Client environment:host.name=abc.def.com<http://abc.def.com>
2015-12-29 13:41:24,972:3345(0x7f2105af6700):ZOO_INFO@log_env@723: Client environment:os.name=Linux
2015-12-29 13:41:24,972:3345(0x7f2105af6700):ZOO_INFO@log_env@724: Client environment:os.arch=3.10.0-229.el7.x86_64
2015-12-29 13:41:24,972:3345(0x7f2105af6700):ZOO_INFO@log_env@725: Client environment:os.version=#1 SMP Fri Mar 6 11:36:42 UTC 2015
2015-12-29 13:41:24,972:3345(0x7f2105af6700):ZOO_INFO@log_env@733: Client environment:user.name=root
2015-12-29 13:41:24,972:3345(0x7f2105af6700):ZOO_INFO@log_env@741: Client environment:user.home=/root
2015-12-29 13:41:24,972:3345(0x7f2105af6700):ZOO_INFO@log_env@753: Client environment:user.dir=/home/admin/mesos/build
2015-12-29 13:41:24,972:3345(0x7f2105af6700):ZOO_INFO@zookeeper_init@786: Initiating client connection, host=10.10.10.118:2181 sessionTimeout=10000 watcher=0x7f2110e3f16c sessionId=0 sessionPasswd=<null> context=0x7f20fc007f00 flags=0
I1229 13:41:24.973780  3362 recover.cpp:475] Replica is in EMPTY status
I1229 13:41:24.979076  3362 replica.cpp:676] Replica in EMPTY status received a broadcasted recover request from (4)@10.10.10.118:5050
I1229 13:41:24.979863  3362 recover.cpp:195] Received a recover response from a replica in EMPTY status
F1229 13:41:24.980000  3362 recover.cpp:219] CHECK_SOME(lowestBeginPosition): is NONE
*** Check failure stack trace: ***
    @     0x7f211143a6a2  google::LogMessage::Fail()
    @     0x7f211143a601  google::LogMessage::SendToLog()
2015-12-29 13:41:24,995:3345(0x7f20f37fe700):ZOO_INFO@check_events@1703: initiated connection to server [10.10.10.118:2181]
2015-12-29 13:41:25,004:3345(0x7f21008d9700):ZOO_INFO@check_events@1703: initiated connection to server [10.10.10.118:2181]
2015-12-29 13:41:25,004:3345(0x7f21018db700):ZOO_INFO@check_events@1703: initiated connection to server [10.10.10.118:2181]
2015-12-29 13:41:25,004:3345(0x7f20f27fc700):ZOO_INFO@check_events@1703: initiated connection to server [10.10.10.118:2181]
    @     0x7f211143a012  google::LogMessage::Flush()
    @     0x7f211143cd46  google::LogMessageFatal::~LogMessageFatal()
    @     0x7f211056d44c  _CheckFatal::~_CheckFatal()
    @     0x7f211125a243  mesos::internal::log::RecoverProtocolProcess::received()
    @     0x7f2111265ae6  _ZZN7process8dispatchI6OptionIN5mesos8internal3log15RecoverResponseEENS4_22RecoverProtocolProcessERKNS_6FutureIS5_EES9_EENS8_IT_EERKNS_3PIDIT0_EEMSF_FSD_T1_ET2_ENKUlPNS_11ProcessBaseEE_clESO_
    @     0x7f211127c5d5  _ZNSt17_Function_handlerIFvPN7process11ProcessBaseEEZNS0_8dispatchI6OptionIN5mesos8internal3log15RecoverResponseEENS8_22RecoverProtocolProcessERKNS0_6FutureIS9_EESD_EENSC_IT_EERKNS0_3PIDIT0_EEMSJ_FSH_T1_ET2_EUlS2_E_E9_M_invokeERKSt9_Any_dataS2_
    @     0x7f21113c0d7d  std::function<>::operator()()
    @     0x7f21113a8b95  process::ProcessBase::visit()
    @     0x7f21113ac960  process::DispatchEvent::visit()
    @           0x471dd8  process::ProcessBase::serve()
I1229 13:41:25.136451  3366 contender.cpp:149] Joining the ZK group
    @     0x7f21113a4f81  process::ProcessManager::resume()
    @     0x7f21113a21b2  _ZZN7process14ProcessManager12init_threadsEvENKUlRKSt11atomic_boolE_clES3_
    @     0x7f21113ac18c  _ZNSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS3_EEE6__callIvIEILm0EEEET_OSt5tupleIIDpT0_EESt12_Index_tupleIIXspT1_EEE
    @     0x7f21113ac13c  _ZNSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS3_EEEclIIEvEET0_DpOT_
    @     0x7f21113ac0ce  _ZNSt12_Bind_simpleIFSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS4_EEEvEE9_M_invokeIIEEEvSt12_Index_tupleIIXspT_EEE
    @     0x7f21113ac025  _ZNSt12_Bind_simpleIFSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS4_EEEvEEclEv
    @     0x7f21113abfbe  _ZNSt6thread5_ImplISt12_Bind_simpleIFSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS6_EEEvEEE6_M_runEv
    @     0x7f210ce3f220  (unknown)
    @     0x7f210d099dc5  start_thread
    @     0x7f210c5a721d  __clone
Aborted (core dumped)