You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@zookeeper.apache.org by Akmal Abbasov <ak...@icloud.com> on 2015/11/17 14:10:59 UTC

Transaction timeouts

Hi, I’m seeing a lot of `Closing connection to peer due to transaction timeout` messages in zk logs, in all zk servers.
Is this transaction timeout configured through syncLimit in zk config file.
Also does zk server need to be restarted in order to update this config?
Thank you.

Regards, Akmal

Re: Transaction timeouts

Posted by Akmal Abbasov <ak...@icloud.com>.

> On 17 Nov 2015, at 21:34, Raúl Gutiérrez Segalés <rg...@itevenworks.net> wrote:
> 
> On 17 November 2015 at 12:13, Akmal Abbasov <ak...@icloud.com>
> wrote:
> 
>> Hi Raul,
>> Thank you for your response.
>> I am running zookeeper with -Xms512m -Xmx1g options, is this enough.
>> 
> 
> It depends on your workload.. how many writes/read per sec are you
> expecting/seeing? Are you seeing long
> GC pauses? If so, you'll need more mem or bigger tick times, otherwise
> you'll miss the deadlines for the
> pings (both among learners and to clients…)
> 
Where I can find this information, in fact information regarding read/writes. 
This is the output of the stat command
Server 1
Latency min/avg/max: 0/66/5212
Received: 8722
Sent: 8694
Connections: 19
Outstanding: 0
Zxid: 0xa9600002ef2
Mode: follower
Node count: 479

Server 2 
Latency min/avg/max: 0/70/5252
Received: 8228
Sent: 8203
Connections: 16
Outstanding: 0
Zxid: 0xa9600002e12
Mode: leader
Node count: 479

Server 3
Latency min/avg/max: 0/0/1
Received: 140
Sent: 139
Connections: 2
Outstanding: 0
Zxid: 0xa9600002bf8
Mode: follower
Node count: 479

All the servers have the same configs. 
Is -Xms512m -Xmx1g enough to handle my workload.
Moreover I see that the load is not evenly distributed. Is it something that should be tuned manually,
or there is something like hbase/hdfs balancer, which will take care of this?

> 
>> Regarding the network, all of the server zk server nodes are hosted in the
>> cloud, in the same dc.
>> But according to the zk troubleshooting guide, the timeout should be
>> increased for cloud environments.
>> 
> 
> Yup, latency can be unpredictable in the cloud…
> 
> 
>> One more thing is that, I’m seeing a lot of
>> fsync-ing the write ahead log in SyncThread:1 took 2962ms which will
>> adversely effect operation latency. See the ZooKeeper troubleshooting guide
>> messages in the logs.
>> 
> 
> That definitely looks bad and will block everything else. What type of disc
> are you writing your logs and snapshots to? Are they
> separate volumes?
I’m using separate disk for both logs and data. But they’re hdd, not ssd. 
So my assumption 

I’ve tried to understand what actually is happening, here is the summary of the logs
08:22:08,201	Transaction timeout
08:22:08,596 - 08:22:25,441	ZookeeperServer not running
08:22:24,927	New election
Everything is starting from ’Transaction timeout’ in leader, which caused ‘Exception when following the leader’ in learners.
Then all zookeeper processes are shutting down. New election is happening and zookeeper processes are starting. 

And one more thing, what’s the best way to update the configs without downtime.
Thank you.

Regards, Akmal

Re: Transaction timeouts

Posted by Raúl Gutiérrez Segalés <rg...@itevenworks.net>.

On 17 November 2015 at 12:13, Akmal Abbasov <ak...@icloud.com>
wrote:

> Hi Raul,
> Thank you for your response.
> I am running zookeeper with -Xms512m -Xmx1g options, is this enough.
>

It depends on your workload.. how many writes/read per sec are you
expecting/seeing? Are you seeing long
GC pauses? If so, you'll need more mem or bigger tick times, otherwise
you'll miss the deadlines for the
pings (both among learners and to clients...).

> Regarding the network, all of the server zk server nodes are hosted in the
> cloud, in the same dc.
> But according to the zk troubleshooting guide, the timeout should be
> increased for cloud environments.
>

Yup, latency can be unpredictable in the cloud...

> One more thing is that, I’m seeing a lot of
> fsync-ing the write ahead log in SyncThread:1 took 2962ms which will
> adversely effect operation latency. See the ZooKeeper troubleshooting guide
> messages in the logs.
>

That definitely looks bad and will block everything else. What type of disc
are you writing your logs and snapshots to? Are they
separate volumes?

-rgs

Re: Transaction timeouts

Posted by Akmal Abbasov <ak...@icloud.com>.

Hi Raul,
Thank you for your response.
I am running zookeeper with -Xms512m -Xmx1g options, is this enough.
Regarding the network, all of the server zk server nodes are hosted in the cloud, in the same dc.
But according to the zk troubleshooting guide, the timeout should be increased for cloud environments.
One more thing is that, I’m seeing a lot of 
fsync-ing the write ahead log in SyncThread:1 took 2962ms which will adversely effect operation latency. See the ZooKeeper troubleshooting guide
messages in the logs.
Thank you.

Regards, Akmal

> On 17 Nov 2015, at 19:10, Raúl Gutiérrez Segalés <rg...@itevenworks.net> wrote:
> 
> Hi,
> 
> On 17 November 2015 at 05:10, Akmal Abbasov <ak...@icloud.com>
> wrote:
> 
>> Hi, I’m seeing a lot of `Closing connection to peer due to transaction
>> timeout` messages in zk logs, in all zk servers.
>> Is this transaction timeout configured through syncLimit in zk config file.
>> 
> 
> That message comes from LearnerHandler#ping() [0], and the frequency of
> pings from the leader to learners
> is twice a tick [1]. So if your tickTime is 2000ms (the default), you are
> pinging the learners every second. You could
> adjust the tickTime and see if it gets better. But I suspect something else
> (GC-ing? noisy network?) is going on, given that it
> shouldn't be that hard for the leader and learners to keep up with 1 ping
> every sec.
> 
> You can check ZAB messages (i.e.: pings, acks, commits, proposals, etc.)
> between the leader and learners using zktraffic's
> zk-dump [2].
> 
> 
>> Also does zk server need to be restarted in order to update this config?
>> 
> 
> yes.
> 
> 
> -rgs
> 
> [0]
> https://github.com/apache/zookeeper/blob/trunk/src/java/main/org/apache/zookeeper/server/quorum/LearnerHandler.java#L923
> [1]
> https://github.com/apache/zookeeper/blob/trunk/src/java/main/org/apache/zookeeper/server/quorum/Leader.java#L549
> [2] https://github.com/twitter/zktraffic

Re: Transaction timeouts

Posted by Raúl Gutiérrez Segalés <rg...@itevenworks.net>.

Hi,

On 17 November 2015 at 05:10, Akmal Abbasov <ak...@icloud.com>
wrote:

> Hi, I’m seeing a lot of `Closing connection to peer due to transaction
> timeout` messages in zk logs, in all zk servers.
> Is this transaction timeout configured through syncLimit in zk config file.
>

That message comes from LearnerHandler#ping() [0], and the frequency of
pings from the leader to learners
is twice a tick [1]. So if your tickTime is 2000ms (the default), you are
pinging the learners every second. You could
adjust the tickTime and see if it gets better. But I suspect something else
(GC-ing? noisy network?) is going on, given that it
shouldn't be that hard for the leader and learners to keep up with 1 ping
every sec.

You can check ZAB messages (i.e.: pings, acks, commits, proposals, etc.)
between the leader and learners using zktraffic's
zk-dump [2].

> Also does zk server need to be restarted in order to update this config?
>

yes.

-rgs

[0]
https://github.com/apache/zookeeper/blob/trunk/src/java/main/org/apache/zookeeper/server/quorum/LearnerHandler.java#L923
[1]
https://github.com/apache/zookeeper/blob/trunk/src/java/main/org/apache/zookeeper/server/quorum/Leader.java#L549
[2] https://github.com/twitter/zktraffic