You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mesos.apache.org by Martin Stiborský <ma...@gmail.com> on 2015/04/27 10:58:44 UTC

zookeeper quorum failing because of high network load

Hello guys,
we are running a mesos stack on CoreOS, with three zookeeper nodes.

We can start a docker containers with Marathon and all, that's fine, but
some of the docker containers generates high network load, while
communicating between nodes/containers and I think that' the reason why the
zookeper is failing.
>From logs, I can see this error:

Apr 27 05:06:15 epsp02.dc.vendavo.com systemd[1]: Stopping Zookeper
server...
Apr 27 05:06:45 epsp02.dc.vendavo.com docker[1155]: 2015-04-27 05:06:45,705
[myid:1] - WARN  [NIOServerCxn.Factory:
0.0.0.0/0.0.0.0:2181:NIOServerCnxn@357] - caught end of stream
exception
Apr 27 05:06:45 epsp02.dc.vendavo.com docker[1155]: EndOfStreamException:
Unable to read additional data from client sessionid 0x14cf73508730003,
likely client has closed socket
Apr 27 05:06:45 epsp02.dc.vendavo.com docker[1155]: at
org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:228)
Apr 27 05:06:45 epsp02.dc.vendavo.com docker[1155]: at
org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:208)
Apr 27 05:06:45 epsp02.dc.vendavo.com docker[1155]: at
java.lang.Thread.run(Thread.java:745)
Apr 27 05:06:45 epsp02.dc.vendavo.com docker[1155]: 2015-04-27 05:06:45,707
[myid:1] - INFO  [NIOServerCxn.Factory:
0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1007] - Closed socket connec
tion for client /10.60.11.82:58082 which had sessionid 0x14cf73508730003

And then all ZK nodes goes down…mesos fails as well and that's it. The
cluster eventually do recover, but the tasks running are gone, not finished.

I have to say I don't have a proper monitoring in place yet, working on it
right now, so I can't rely on real data to prove this assumption, but it's
my guess.
So if you can confirm that this makes sense, or share with me your
experiences, that would be pretty valuable for me right now.

Thanks a lot!

Re: zookeeper quorum failing because of high network load

Posted by Martin Stiborský <ma...@gmail.com>.
Now I finally tracked down the real problem, and it's nothing related to
mesos at all.
It was fleet on CoreOS stopping all containers on a node, because the node
was considered as unresponsive, from the CoreOS/etcd/fleet cluster point of
view.
The high cpu/network load caused the problem and fleet decided to stop the
services on the node in order to run them on another node.
Now in retrospective it of course sounds like pretty clear thing and it's
true that I should have looked on fleet log first, my bad.
The solution is a slight tunning of etcd and fleet parameters, like they
did here for example:
https://github.com/deis/deis/pull/1689/files

Thanks a lot guys for you effort, it helped!

On Tue, Apr 28, 2015 at 10:58 AM Ondrej Smola <on...@gmail.com>
wrote:

> Hi Martin,
>
> do all 3 zookeepers go down with same error logs/cause - there should be
> some info as one node failure should not cause ZK to fail (as quorum is
> maintained) and remaining nodes should at least show some info from failure
> detector.
> The original log you posted are after stopping zookeeper - I saw these
> logs very frequently when i run Apache Storm in local/devel mode and
> terminate it from IDE - i think they are due to forcible stopping ZK (from
> timestamps there is 30 second timeout) - but i never saw them in
> production/non local mode. The problem should be described in log lines
> before  systemd[1]: Stopping Zookeper server... Could you please post
> preceding lines.
>
> It is only specific to deployed application (db + app image) - other
> applications are running OK?
>
>
>
>
>
>
>
> 2015-04-28 10:24 GMT+02:00 Martin Stiborský <ma...@gmail.com>:
>
>> Hi guys,
>> these machines are relatively beefy - Dell PowerEdge r710 with 2x QC
>> Xeon, 144GB RAM, CoreOS is deployed on baremetal.
>> - ZK is running on the same 3 nodes as the mesos cluster
>> - our application is not using ZK
>> - nothing else running on the stack, only 1 mesos master, 3 mesos slaves
>>  and marathon, all of this on top of CoreOS booted from iPXE from network
>> - ZK log is not on dedicated disk, I can put it on NFS share
>>
>> The pattern is always the same. We start first container on the first
>> node, it's a database, then we run the second container with our
>> application on the second cluster node, the application loads data from
>> the database container on first node, then after about 6 minute, the stack
>> goes down.
>>
>> If we run both containers on same node, it's fine. That's why I tend to
>> blame network, but can't find the problem.
>>
>> On Tue, Apr 28, 2015 at 7:33 AM Charles Baker <cn...@gmail.com> wrote:
>>
>>> Hi Martin. Are these VMs or bare-metal? Is ZK running on the same 3
>>> nodes as the mesos cluster? Does your application also use ZooKeeper to
>>> manage it's own state? Are there any other services running on the machines
>>> and does Mesos and ZK have enough resources? And as Tomas asked; is your ZK
>>> log on a dedicated disk?
>>>
>>>
>>> On Mon, Apr 27, 2015 at 11:20 AM Martin Stiborský <
>>> martin.stiborsky@gmail.com> wrote:
>>>
>>>> Hi,
>>>> there are 3 zookeepers nodes.
>>>> We've started our containers and this time I was watching the
>>>> zookeepers and their condition with the "stat" command.
>>>> It seems that zookeeper latency is not the issue, there was only about
>>>> 8 connections, max latency time 134ms.
>>>>
>>>> I'm still not sure what is the real cause here…from mesos-master log I
>>>> see normal behaviour and the suddenly:
>>>> Apr 27 18:02:37 systemd[1]: mesos-master@1.service: main process
>>>> exited, code=exited, status=137/n/a
>>>>
>>>> If we run our containers all on one mesos-slave node, it works, but
>>>> when distributed to three nodes, it's failing.
>>>>
>>>>
>>>> On Mon, Apr 27, 2015 at 11:32 AM Tomas Barton <ba...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi Martin,
>>>>>
>>>>> how many ZooKeepers do you have? Is your transaction log on a
>>>>> dedicated disk? How many clients are approximately connecting?
>>>>>
>>>>> have a look at
>>>>> http://zookeeper.apache.org/doc/r3.2.2/zookeeperAdmin.html#sc_bestPractices
>>>>>
>>>>> Tomas
>>>>>
>>>>> On 27 April 2015 at 10:58, Martin Stiborský <
>>>>> martin.stiborsky@gmail.com> wrote:
>>>>>
>>>>>> Hello guys,
>>>>>> we are running a mesos stack on CoreOS, with three zookeeper nodes.
>>>>>>
>>>>>> We can start a docker containers with Marathon and all, that's fine,
>>>>>> but some of the docker containers generates high network load, while
>>>>>> communicating between nodes/containers and I think that' the reason why the
>>>>>> zookeper is failing.
>>>>>> From logs, I can see this error:
>>>>>>
>>>>>> Apr 27 05:06:15 epsp02.dc.vendavo.com systemd[1]: Stopping Zookeper
>>>>>> server...
>>>>>> Apr 27 05:06:45 epsp02.dc.vendavo.com docker[1155]: 2015-04-27
>>>>>> 05:06:45,705 [myid:1] - WARN  [NIOServerCxn.Factory:
>>>>>> 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@357] - caught end of stream
>>>>>> exception
>>>>>> Apr 27 05:06:45 epsp02.dc.vendavo.com docker[1155]:
>>>>>> EndOfStreamException: Unable to read additional data from client sessionid
>>>>>> 0x14cf73508730003, likely client has closed socket
>>>>>> Apr 27 05:06:45 epsp02.dc.vendavo.com docker[1155]: at
>>>>>> org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:228)
>>>>>> Apr 27 05:06:45 epsp02.dc.vendavo.com docker[1155]: at
>>>>>> org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:208)
>>>>>> Apr 27 05:06:45 epsp02.dc.vendavo.com docker[1155]: at
>>>>>> java.lang.Thread.run(Thread.java:745)
>>>>>> Apr 27 05:06:45 epsp02.dc.vendavo.com docker[1155]: 2015-04-27
>>>>>> 05:06:45,707 [myid:1] - INFO  [NIOServerCxn.Factory:
>>>>>> 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1007] - Closed socket connec
>>>>>> tion for client /10.60.11.82:58082 which had sessionid
>>>>>> 0x14cf73508730003
>>>>>>
>>>>>> And then all ZK nodes goes down…mesos fails as well and that's it.
>>>>>> The cluster eventually do recover, but the tasks running are gone, not
>>>>>> finished.
>>>>>>
>>>>>> I have to say I don't have a proper monitoring in place yet, working
>>>>>> on it right now, so I can't rely on real data to prove this assumption, but
>>>>>> it's my guess.
>>>>>> So if you can confirm that this makes sense, or share with me your
>>>>>> experiences, that would be pretty valuable for me right now.
>>>>>>
>>>>>> Thanks a lot!
>>>>>>
>>>>>
>>>>>
>

Re: zookeeper quorum failing because of high network load

Posted by Ondrej Smola <on...@gmail.com>.
Hi Martin,

do all 3 zookeepers go down with same error logs/cause - there should be
some info as one node failure should not cause ZK to fail (as quorum is
maintained) and remaining nodes should at least show some info from failure
detector.
The original log you posted are after stopping zookeeper - I saw these logs
very frequently when i run Apache Storm in local/devel mode and terminate
it from IDE - i think they are due to forcible stopping ZK (from timestamps
there is 30 second timeout) - but i never saw them in production/non local
mode. The problem should be described in log lines before  systemd[1]:
Stopping Zookeper server... Could you please post preceding lines.

It is only specific to deployed application (db + app image) - other
applications are running OK?







2015-04-28 10:24 GMT+02:00 Martin Stiborský <ma...@gmail.com>:

> Hi guys,
> these machines are relatively beefy - Dell PowerEdge r710 with 2x QC
> Xeon, 144GB RAM, CoreOS is deployed on baremetal.
> - ZK is running on the same 3 nodes as the mesos cluster
> - our application is not using ZK
> - nothing else running on the stack, only 1 mesos master, 3 mesos slaves
>  and marathon, all of this on top of CoreOS booted from iPXE from network
> - ZK log is not on dedicated disk, I can put it on NFS share
>
> The pattern is always the same. We start first container on the first
> node, it's a database, then we run the second container with our
> application on the second cluster node, the application loads data from
> the database container on first node, then after about 6 minute, the stack
> goes down.
>
> If we run both containers on same node, it's fine. That's why I tend to
> blame network, but can't find the problem.
>
> On Tue, Apr 28, 2015 at 7:33 AM Charles Baker <cn...@gmail.com> wrote:
>
>> Hi Martin. Are these VMs or bare-metal? Is ZK running on the same 3 nodes
>> as the mesos cluster? Does your application also use ZooKeeper to manage
>> it's own state? Are there any other services running on the machines and
>> does Mesos and ZK have enough resources? And as Tomas asked; is your ZK log
>> on a dedicated disk?
>>
>>
>> On Mon, Apr 27, 2015 at 11:20 AM Martin Stiborský <
>> martin.stiborsky@gmail.com> wrote:
>>
>>> Hi,
>>> there are 3 zookeepers nodes.
>>> We've started our containers and this time I was watching the zookeepers
>>> and their condition with the "stat" command.
>>> It seems that zookeeper latency is not the issue, there was only about 8
>>> connections, max latency time 134ms.
>>>
>>> I'm still not sure what is the real cause here…from mesos-master log I
>>> see normal behaviour and the suddenly:
>>> Apr 27 18:02:37 systemd[1]: mesos-master@1.service: main process
>>> exited, code=exited, status=137/n/a
>>>
>>> If we run our containers all on one mesos-slave node, it works, but when
>>> distributed to three nodes, it's failing.
>>>
>>>
>>> On Mon, Apr 27, 2015 at 11:32 AM Tomas Barton <ba...@gmail.com>
>>> wrote:
>>>
>>>> Hi Martin,
>>>>
>>>> how many ZooKeepers do you have? Is your transaction log on a dedicated
>>>> disk? How many clients are approximately connecting?
>>>>
>>>> have a look at
>>>> http://zookeeper.apache.org/doc/r3.2.2/zookeeperAdmin.html#sc_bestPractices
>>>>
>>>> Tomas
>>>>
>>>> On 27 April 2015 at 10:58, Martin Stiborský <martin.stiborsky@gmail.com
>>>> > wrote:
>>>>
>>>>> Hello guys,
>>>>> we are running a mesos stack on CoreOS, with three zookeeper nodes.
>>>>>
>>>>> We can start a docker containers with Marathon and all, that's fine,
>>>>> but some of the docker containers generates high network load, while
>>>>> communicating between nodes/containers and I think that' the reason why the
>>>>> zookeper is failing.
>>>>> From logs, I can see this error:
>>>>>
>>>>> Apr 27 05:06:15 epsp02.dc.vendavo.com systemd[1]: Stopping Zookeper
>>>>> server...
>>>>> Apr 27 05:06:45 epsp02.dc.vendavo.com docker[1155]: 2015-04-27
>>>>> 05:06:45,705 [myid:1] - WARN  [NIOServerCxn.Factory:
>>>>> 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@357] - caught end of stream
>>>>> exception
>>>>> Apr 27 05:06:45 epsp02.dc.vendavo.com docker[1155]:
>>>>> EndOfStreamException: Unable to read additional data from client sessionid
>>>>> 0x14cf73508730003, likely client has closed socket
>>>>> Apr 27 05:06:45 epsp02.dc.vendavo.com docker[1155]: at
>>>>> org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:228)
>>>>> Apr 27 05:06:45 epsp02.dc.vendavo.com docker[1155]: at
>>>>> org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:208)
>>>>> Apr 27 05:06:45 epsp02.dc.vendavo.com docker[1155]: at
>>>>> java.lang.Thread.run(Thread.java:745)
>>>>> Apr 27 05:06:45 epsp02.dc.vendavo.com docker[1155]: 2015-04-27
>>>>> 05:06:45,707 [myid:1] - INFO  [NIOServerCxn.Factory:
>>>>> 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1007] - Closed socket connec
>>>>> tion for client /10.60.11.82:58082 which had sessionid
>>>>> 0x14cf73508730003
>>>>>
>>>>> And then all ZK nodes goes down…mesos fails as well and that's it. The
>>>>> cluster eventually do recover, but the tasks running are gone, not finished.
>>>>>
>>>>> I have to say I don't have a proper monitoring in place yet, working
>>>>> on it right now, so I can't rely on real data to prove this assumption, but
>>>>> it's my guess.
>>>>> So if you can confirm that this makes sense, or share with me your
>>>>> experiences, that would be pretty valuable for me right now.
>>>>>
>>>>> Thanks a lot!
>>>>>
>>>>
>>>>

Re: zookeeper quorum failing because of high network load

Posted by Martin Stiborský <ma...@gmail.com>.
Hi guys,
these machines are relatively beefy - Dell PowerEdge r710 with 2x QC Xeon,
144GB RAM, CoreOS is deployed on baremetal.
- ZK is running on the same 3 nodes as the mesos cluster
- our application is not using ZK
- nothing else running on the stack, only 1 mesos master, 3 mesos slaves
 and marathon, all of this on top of CoreOS booted from iPXE from network
- ZK log is not on dedicated disk, I can put it on NFS share

The pattern is always the same. We start first container on the first node,
it's a database, then we run the second container with our application on
the second cluster node, the application loads data from the database
container on first node, then after about 6 minute, the stack goes down.

If we run both containers on same node, it's fine. That's why I tend to
blame network, but can't find the problem.

On Tue, Apr 28, 2015 at 7:33 AM Charles Baker <cn...@gmail.com> wrote:

> Hi Martin. Are these VMs or bare-metal? Is ZK running on the same 3 nodes
> as the mesos cluster? Does your application also use ZooKeeper to manage
> it's own state? Are there any other services running on the machines and
> does Mesos and ZK have enough resources? And as Tomas asked; is your ZK log
> on a dedicated disk?
>
>
> On Mon, Apr 27, 2015 at 11:20 AM Martin Stiborský <
> martin.stiborsky@gmail.com> wrote:
>
>> Hi,
>> there are 3 zookeepers nodes.
>> We've started our containers and this time I was watching the zookeepers
>> and their condition with the "stat" command.
>> It seems that zookeeper latency is not the issue, there was only about 8
>> connections, max latency time 134ms.
>>
>> I'm still not sure what is the real cause here…from mesos-master log I
>> see normal behaviour and the suddenly:
>> Apr 27 18:02:37 systemd[1]: mesos-master@1.service: main process exited,
>> code=exited, status=137/n/a
>>
>> If we run our containers all on one mesos-slave node, it works, but when
>> distributed to three nodes, it's failing.
>>
>>
>> On Mon, Apr 27, 2015 at 11:32 AM Tomas Barton <ba...@gmail.com>
>> wrote:
>>
>>> Hi Martin,
>>>
>>> how many ZooKeepers do you have? Is your transaction log on a dedicated
>>> disk? How many clients are approximately connecting?
>>>
>>> have a look at
>>> http://zookeeper.apache.org/doc/r3.2.2/zookeeperAdmin.html#sc_bestPractices
>>>
>>> Tomas
>>>
>>> On 27 April 2015 at 10:58, Martin Stiborský <ma...@gmail.com>
>>> wrote:
>>>
>>>> Hello guys,
>>>> we are running a mesos stack on CoreOS, with three zookeeper nodes.
>>>>
>>>> We can start a docker containers with Marathon and all, that's fine,
>>>> but some of the docker containers generates high network load, while
>>>> communicating between nodes/containers and I think that' the reason why the
>>>> zookeper is failing.
>>>> From logs, I can see this error:
>>>>
>>>> Apr 27 05:06:15 epsp02.dc.vendavo.com systemd[1]: Stopping Zookeper
>>>> server...
>>>> Apr 27 05:06:45 epsp02.dc.vendavo.com docker[1155]: 2015-04-27
>>>> 05:06:45,705 [myid:1] - WARN  [NIOServerCxn.Factory:
>>>> 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@357] - caught end of stream
>>>> exception
>>>> Apr 27 05:06:45 epsp02.dc.vendavo.com docker[1155]:
>>>> EndOfStreamException: Unable to read additional data from client sessionid
>>>> 0x14cf73508730003, likely client has closed socket
>>>> Apr 27 05:06:45 epsp02.dc.vendavo.com docker[1155]: at
>>>> org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:228)
>>>> Apr 27 05:06:45 epsp02.dc.vendavo.com docker[1155]: at
>>>> org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:208)
>>>> Apr 27 05:06:45 epsp02.dc.vendavo.com docker[1155]: at
>>>> java.lang.Thread.run(Thread.java:745)
>>>> Apr 27 05:06:45 epsp02.dc.vendavo.com docker[1155]: 2015-04-27
>>>> 05:06:45,707 [myid:1] - INFO  [NIOServerCxn.Factory:
>>>> 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1007] - Closed socket connec
>>>> tion for client /10.60.11.82:58082 which had sessionid
>>>> 0x14cf73508730003
>>>>
>>>> And then all ZK nodes goes down…mesos fails as well and that's it. The
>>>> cluster eventually do recover, but the tasks running are gone, not finished.
>>>>
>>>> I have to say I don't have a proper monitoring in place yet, working on
>>>> it right now, so I can't rely on real data to prove this assumption, but
>>>> it's my guess.
>>>> So if you can confirm that this makes sense, or share with me your
>>>> experiences, that would be pretty valuable for me right now.
>>>>
>>>> Thanks a lot!
>>>>
>>>
>>>

Re: zookeeper quorum failing because of high network load

Posted by Charles Baker <cn...@gmail.com>.
Hi Martin. Are these VMs or bare-metal? Is ZK running on the same 3 nodes
as the mesos cluster? Does your application also use ZooKeeper to manage
it's own state? Are there any other services running on the machines and
does Mesos and ZK have enough resources? And as Tomas asked; is your ZK log
on a dedicated disk?


On Mon, Apr 27, 2015 at 11:20 AM Martin Stiborský <
martin.stiborsky@gmail.com> wrote:

> Hi,
> there are 3 zookeepers nodes.
> We've started our containers and this time I was watching the zookeepers
> and their condition with the "stat" command.
> It seems that zookeeper latency is not the issue, there was only about 8
> connections, max latency time 134ms.
>
> I'm still not sure what is the real cause here…from mesos-master log I see
> normal behaviour and the suddenly:
> Apr 27 18:02:37 systemd[1]: mesos-master@1.service: main process exited,
> code=exited, status=137/n/a
>
> If we run our containers all on one mesos-slave node, it works, but when
> distributed to three nodes, it's failing.
>
>
> On Mon, Apr 27, 2015 at 11:32 AM Tomas Barton <ba...@gmail.com>
> wrote:
>
>> Hi Martin,
>>
>> how many ZooKeepers do you have? Is your transaction log on a dedicated
>> disk? How many clients are approximately connecting?
>>
>> have a look at
>> http://zookeeper.apache.org/doc/r3.2.2/zookeeperAdmin.html#sc_bestPractices
>>
>> Tomas
>>
>> On 27 April 2015 at 10:58, Martin Stiborský <ma...@gmail.com>
>> wrote:
>>
>>> Hello guys,
>>> we are running a mesos stack on CoreOS, with three zookeeper nodes.
>>>
>>> We can start a docker containers with Marathon and all, that's fine, but
>>> some of the docker containers generates high network load, while
>>> communicating between nodes/containers and I think that' the reason why the
>>> zookeper is failing.
>>> From logs, I can see this error:
>>>
>>> Apr 27 05:06:15 epsp02.dc.vendavo.com systemd[1]: Stopping Zookeper
>>> server...
>>> Apr 27 05:06:45 epsp02.dc.vendavo.com docker[1155]: 2015-04-27
>>> 05:06:45,705 [myid:1] - WARN  [NIOServerCxn.Factory:
>>> 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@357] - caught end of stream
>>> exception
>>> Apr 27 05:06:45 epsp02.dc.vendavo.com docker[1155]:
>>> EndOfStreamException: Unable to read additional data from client sessionid
>>> 0x14cf73508730003, likely client has closed socket
>>> Apr 27 05:06:45 epsp02.dc.vendavo.com docker[1155]: at
>>> org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:228)
>>> Apr 27 05:06:45 epsp02.dc.vendavo.com docker[1155]: at
>>> org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:208)
>>> Apr 27 05:06:45 epsp02.dc.vendavo.com docker[1155]: at
>>> java.lang.Thread.run(Thread.java:745)
>>> Apr 27 05:06:45 epsp02.dc.vendavo.com docker[1155]: 2015-04-27
>>> 05:06:45,707 [myid:1] - INFO  [NIOServerCxn.Factory:
>>> 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1007] - Closed socket connec
>>> tion for client /10.60.11.82:58082 which had sessionid 0x14cf73508730003
>>>
>>> And then all ZK nodes goes down…mesos fails as well and that's it. The
>>> cluster eventually do recover, but the tasks running are gone, not finished.
>>>
>>> I have to say I don't have a proper monitoring in place yet, working on
>>> it right now, so I can't rely on real data to prove this assumption, but
>>> it's my guess.
>>> So if you can confirm that this makes sense, or share with me your
>>> experiences, that would be pretty valuable for me right now.
>>>
>>> Thanks a lot!
>>>
>>
>>

Re: zookeeper quorum failing because of high network load

Posted by Martin Stiborský <ma...@gmail.com>.
Hi,
there are 3 zookeepers nodes.
We've started our containers and this time I was watching the zookeepers
and their condition with the "stat" command.
It seems that zookeeper latency is not the issue, there was only about 8
connections, max latency time 134ms.

I'm still not sure what is the real cause here…from mesos-master log I see
normal behaviour and the suddenly:
Apr 27 18:02:37 systemd[1]: mesos-master@1.service: main process exited,
code=exited, status=137/n/a

If we run our containers all on one mesos-slave node, it works, but when
distributed to three nodes, it's failing.


On Mon, Apr 27, 2015 at 11:32 AM Tomas Barton <ba...@gmail.com>
wrote:

> Hi Martin,
>
> how many ZooKeepers do you have? Is your transaction log on a dedicated
> disk? How many clients are approximately connecting?
>
> have a look at
> http://zookeeper.apache.org/doc/r3.2.2/zookeeperAdmin.html#sc_bestPractices
>
> Tomas
>
> On 27 April 2015 at 10:58, Martin Stiborský <ma...@gmail.com>
> wrote:
>
>> Hello guys,
>> we are running a mesos stack on CoreOS, with three zookeeper nodes.
>>
>> We can start a docker containers with Marathon and all, that's fine, but
>> some of the docker containers generates high network load, while
>> communicating between nodes/containers and I think that' the reason why the
>> zookeper is failing.
>> From logs, I can see this error:
>>
>> Apr 27 05:06:15 epsp02.dc.vendavo.com systemd[1]: Stopping Zookeper
>> server...
>> Apr 27 05:06:45 epsp02.dc.vendavo.com docker[1155]: 2015-04-27
>> 05:06:45,705 [myid:1] - WARN  [NIOServerCxn.Factory:
>> 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@357] - caught end of stream
>> exception
>> Apr 27 05:06:45 epsp02.dc.vendavo.com docker[1155]:
>> EndOfStreamException: Unable to read additional data from client sessionid
>> 0x14cf73508730003, likely client has closed socket
>> Apr 27 05:06:45 epsp02.dc.vendavo.com docker[1155]: at
>> org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:228)
>> Apr 27 05:06:45 epsp02.dc.vendavo.com docker[1155]: at
>> org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:208)
>> Apr 27 05:06:45 epsp02.dc.vendavo.com docker[1155]: at
>> java.lang.Thread.run(Thread.java:745)
>> Apr 27 05:06:45 epsp02.dc.vendavo.com docker[1155]: 2015-04-27
>> 05:06:45,707 [myid:1] - INFO  [NIOServerCxn.Factory:
>> 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1007] - Closed socket connec
>> tion for client /10.60.11.82:58082 which had sessionid 0x14cf73508730003
>>
>> And then all ZK nodes goes down…mesos fails as well and that's it. The
>> cluster eventually do recover, but the tasks running are gone, not finished.
>>
>> I have to say I don't have a proper monitoring in place yet, working on
>> it right now, so I can't rely on real data to prove this assumption, but
>> it's my guess.
>> So if you can confirm that this makes sense, or share with me your
>> experiences, that would be pretty valuable for me right now.
>>
>> Thanks a lot!
>>
>
>

Re: zookeeper quorum failing because of high network load

Posted by Tomas Barton <ba...@gmail.com>.
Hi Martin,

how many ZooKeepers do you have? Is your transaction log on a dedicated
disk? How many clients are approximately connecting?

have a look at
http://zookeeper.apache.org/doc/r3.2.2/zookeeperAdmin.html#sc_bestPractices

Tomas

On 27 April 2015 at 10:58, Martin Stiborský <ma...@gmail.com>
wrote:

> Hello guys,
> we are running a mesos stack on CoreOS, with three zookeeper nodes.
>
> We can start a docker containers with Marathon and all, that's fine, but
> some of the docker containers generates high network load, while
> communicating between nodes/containers and I think that' the reason why the
> zookeper is failing.
> From logs, I can see this error:
>
> Apr 27 05:06:15 epsp02.dc.vendavo.com systemd[1]: Stopping Zookeper
> server...
> Apr 27 05:06:45 epsp02.dc.vendavo.com docker[1155]: 2015-04-27
> 05:06:45,705 [myid:1] - WARN  [NIOServerCxn.Factory:
> 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@357] - caught end of stream
> exception
> Apr 27 05:06:45 epsp02.dc.vendavo.com docker[1155]: EndOfStreamException:
> Unable to read additional data from client sessionid 0x14cf73508730003,
> likely client has closed socket
> Apr 27 05:06:45 epsp02.dc.vendavo.com docker[1155]: at
> org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:228)
> Apr 27 05:06:45 epsp02.dc.vendavo.com docker[1155]: at
> org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:208)
> Apr 27 05:06:45 epsp02.dc.vendavo.com docker[1155]: at
> java.lang.Thread.run(Thread.java:745)
> Apr 27 05:06:45 epsp02.dc.vendavo.com docker[1155]: 2015-04-27
> 05:06:45,707 [myid:1] - INFO  [NIOServerCxn.Factory:
> 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1007] - Closed socket connec
> tion for client /10.60.11.82:58082 which had sessionid 0x14cf73508730003
>
> And then all ZK nodes goes down…mesos fails as well and that's it. The
> cluster eventually do recover, but the tasks running are gone, not finished.
>
> I have to say I don't have a proper monitoring in place yet, working on it
> right now, so I can't rely on real data to prove this assumption, but it's
> my guess.
> So if you can confirm that this makes sense, or share with me your
> experiences, that would be pretty valuable for me right now.
>
> Thanks a lot!
>