You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@zookeeper.apache.org by Manosiz Bhattacharyya <ma...@gmail.com> on 2012/01/18 22:26:44 UTC

Timeouts and ping handling

Hello,

 We are using Zookeeper-3.3.4 with client session timeouts of 5 seconds,
and we see frequent timeouts. We have a cluster of 50 nodes (3 of which are
ZK nodes) and each node has 5 client connections (a total of 250 connection
to the Ensemble). While investigating the zookeeper connections, we found
that sometimes pings sent from the zookeeper client does not return from
the server within 5 seconds, and the client connection gets disconnected.
Digging deeper it seems that pings are enqueued the same way as other
requests in the three stage request processing pipeline (prep, sync,
finalize) in zkserver. So if there are a lot of write operations from other
active sessions in front of a ping from an inactive session in the queues,
the inactive session could timeout.

My question is whether we can return the ping request from the client
immediately from the server, as the purpose of the ping request seems to be
to treat it as an heartbeat from relatively inactive sessions. If we keep a
separate ping queue in the Prep phase which forwards it straight to the
finalize phase, possible requests before the ping which required I/O inside
the sync phase would not cause the client timeouts. I hope pings do not
generate any order in the database. I did take a cursory look at the code
and thought that could be done. Would really appreciate an opinion
regarding this.

As an aside I should mention that increasing the session timeout to 20
seconds did improved the problem significantly. However as we are using
Zookeeper to monitor health of our components, increasing the timeout means
that we only get to know a component's death 20 seconds later. This is
something we would definitely try to avoid, and would like to go to the 5
second timeout.

Regards,
Manosiz.

Re: Timeouts and ping handling

Posted by Patrick Hunt <ph...@apache.org>.
On Wed, Jan 18, 2012 at 3:21 PM, Camille Fournier <cf...@renttherunway.com> wrote:
> Duh, I knew there was something I was forgetting. You can't process the
> session timeout faster than the server can process the full pipeline, so
> making pings come back faster just means you will have a false sense of
> liveness for your services.

There's also this - we only send HBs when the client is not active.
HBs check that the server is alive but at the same time we're also
letting the server know that we're alive.

However, when the client is active (sending read/write ops) we don't
need a HB. Any read/write operation serves as the HB. Say we send a
read operation to the server, we won't send another HB to the server
until the read operation result comes back (and then 1/3 the timeout
after that). In this case you can't take advantage of the hack that's
been discussed. The read operation needs to complete, if it takes too
long (as in this case) the session will timeout as usual. Now, if you
have clients that are largely inactive this may not matter too much,
but depending on the use case you might get caught by this.

Patrick

Re: Timeouts and ping handling

Posted by Patrick Hunt <ph...@apache.org>.
On Wed, Jan 18, 2012 at 3:21 PM, Camille Fournier <cf...@renttherunway.com> wrote:
> Duh, I knew there was something I was forgetting. You can't process the
> session timeout faster than the server can process the full pipeline, so
> making pings come back faster just means you will have a false sense of
> liveness for your services.

There's also this - we only send HBs when the client is not active.
HBs check that the server is alive but at the same time we're also
letting the server know that we're alive.

However, when the client is active (sending read/write ops) we don't
need a HB. Any read/write operation serves as the HB. Say we send a
read operation to the server, we won't send another HB to the server
until the read operation result comes back (and then 1/3 the timeout
after that). In this case you can't take advantage of the hack that's
been discussed. The read operation needs to complete, if it takes too
long (as in this case) the session will timeout as usual. Now, if you
have clients that are largely inactive this may not matter too much,
but depending on the use case you might get caught by this.

Patrick

Re: Timeouts and ping handling

Posted by Camille Fournier <cf...@renttherunway.com>.
Duh, I knew there was something I was forgetting. You can't process the
session timeout faster than the server can process the full pipeline, so
making pings come back faster just means you will have a false sense of
liveness for your services.

The question about why the leaders and followers handle read-only requests
differently still stands, though.

C

On Wed, Jan 18, 2012 at 5:45 PM, Patrick Hunt <ph...@apache.org> wrote:

> On Wed, Jan 18, 2012 at 2:03 PM, Camille Fournier <ca...@apache.org>
> wrote:
> > I think it can be done. Looking through the code, it seems like it should
> > be safe modulo some stats that are set in the FinalRequestProcessor that
> > may be less useful.
> >
>
> Turning around HBs at the head end of the server is a bad idea. If the
> server can't support the timeout you requested then you are setting
> yourself up for trouble if you try to fake it. (think through some of
> the failure cases...)
>
> This is not something you want to do. Rather first look at some of the
> more obvious issues such as GC, then disk (I've seen ppl go to
> ramdisks in some cases), then OS/net tuning etc....
>
> Patrick
>

Re: Timeouts and ping handling

Posted by Camille Fournier <cf...@renttherunway.com>.
Duh, I knew there was something I was forgetting. You can't process the
session timeout faster than the server can process the full pipeline, so
making pings come back faster just means you will have a false sense of
liveness for your services.

The question about why the leaders and followers handle read-only requests
differently still stands, though.

C

On Wed, Jan 18, 2012 at 5:45 PM, Patrick Hunt <ph...@apache.org> wrote:

> On Wed, Jan 18, 2012 at 2:03 PM, Camille Fournier <ca...@apache.org>
> wrote:
> > I think it can be done. Looking through the code, it seems like it should
> > be safe modulo some stats that are set in the FinalRequestProcessor that
> > may be less useful.
> >
>
> Turning around HBs at the head end of the server is a bad idea. If the
> server can't support the timeout you requested then you are setting
> yourself up for trouble if you try to fake it. (think through some of
> the failure cases...)
>
> This is not something you want to do. Rather first look at some of the
> more obvious issues such as GC, then disk (I've seen ppl go to
> ramdisks in some cases), then OS/net tuning etc....
>
> Patrick
>

Re: Timeouts and ping handling

Posted by Patrick Hunt <ph...@apache.org>.
On Wed, Jan 18, 2012 at 2:03 PM, Camille Fournier <ca...@apache.org> wrote:
> I think it can be done. Looking through the code, it seems like it should
> be safe modulo some stats that are set in the FinalRequestProcessor that
> may be less useful.
>

Turning around HBs at the head end of the server is a bad idea. If the
server can't support the timeout you requested then you are setting
yourself up for trouble if you try to fake it. (think through some of
the failure cases...)

This is not something you want to do. Rather first look at some of the
more obvious issues such as GC, then disk (I've seen ppl go to
ramdisks in some cases), then OS/net tuning etc....

Patrick

Re: Timeouts and ping handling

Posted by Patrick Hunt <ph...@apache.org>.
On Wed, Jan 18, 2012 at 2:03 PM, Camille Fournier <ca...@apache.org> wrote:
> I think it can be done. Looking through the code, it seems like it should
> be safe modulo some stats that are set in the FinalRequestProcessor that
> may be less useful.
>

Turning around HBs at the head end of the server is a bad idea. If the
server can't support the timeout you requested then you are setting
yourself up for trouble if you try to fake it. (think through some of
the failure cases...)

This is not something you want to do. Rather first look at some of the
more obvious issues such as GC, then disk (I've seen ppl go to
ramdisks in some cases), then OS/net tuning etc....

Patrick

Re: Timeouts and ping handling

Posted by Camille Fournier <ca...@apache.org>.
I think it can be done. Looking through the code, it seems like it should
be safe modulo some stats that are set in the FinalRequestProcessor that
may be less useful.

A question for the other zookeeper devs out there, is there a reason that
we handle read-only operations in the first processor differently on the
leader than the followers? The leader (calling PrepRequestProcessor first)
will do a session check for any of the read-only requests:
 zks.sessionTracker.checkSession(request.sessionId,
                        request.getOwner());

but the FollowerRequestProcessor will just push these requests to its
second processor, and never check the session. What's the purpose of the
session check on the leader but not the followers?

C

On Wed, Jan 18, 2012 at 4:26 PM, Manosiz Bhattacharyya
<ma...@gmail.com>wrote:

> Hello,
>
>  We are using Zookeeper-3.3.4 with client session timeouts of 5 seconds,
> and we see frequent timeouts. We have a cluster of 50 nodes (3 of which are
> ZK nodes) and each node has 5 client connections (a total of 250 connection
> to the Ensemble). While investigating the zookeeper connections, we found
> that sometimes pings sent from the zookeeper client does not return from
> the server within 5 seconds, and the client connection gets disconnected.
> Digging deeper it seems that pings are enqueued the same way as other
> requests in the three stage request processing pipeline (prep, sync,
> finalize) in zkserver. So if there are a lot of write operations from other
> active sessions in front of a ping from an inactive session in the queues,
> the inactive session could timeout.
>
> My question is whether we can return the ping request from the client
> immediately from the server, as the purpose of the ping request seems to be
> to treat it as an heartbeat from relatively inactive sessions. If we keep a
> separate ping queue in the Prep phase which forwards it straight to the
> finalize phase, possible requests before the ping which required I/O inside
> the sync phase would not cause the client timeouts. I hope pings do not
> generate any order in the database. I did take a cursory look at the code
> and thought that could be done. Would really appreciate an opinion
> regarding this.
>
> As an aside I should mention that increasing the session timeout to 20
> seconds did improved the problem significantly. However as we are using
> Zookeeper to monitor health of our components, increasing the timeout means
> that we only get to know a component's death 20 seconds later. This is
> something we would definitely try to avoid, and would like to go to the 5
> second timeout.
>
> Regards,
> Manosiz.
>

Re: Timeouts and ping handling

Posted by Manosiz Bhattacharyya <ma...@gmail.com>.
Thanks,
Manosiz.

On Thu, Jan 19, 2012 at 11:31 AM, Patrick Hunt <ph...@apache.org> wrote:

> See "preAllocSize"
>
> http://zookeeper.apache.org/doc/r3.4.2/zookeeperAdmin.html#sc_advancedConfiguration
>
> On Thu, Jan 19, 2012 at 10:49 AM, Manosiz Bhattacharyya
> <ma...@gmail.com> wrote:
> > Thanks a lot for this info. A pointer in the code to where you do this
> > preallocation or a flag to disable this would be very beneficial.
> >
> > On Thu, Jan 19, 2012 at 10:18 AM, Ted Dunning <te...@gmail.com>
> wrote:
> >
> >> ZK does pretty much entirely sequential I/O.
> >>
> >> One thing that it does which might be very, very bad for SSD is that it
> >> pre-allocates disk extents in the log by writing a bunch of zeros.
>  This is
> >> to avoid directory updates as the log is written, but it doubles the
> load
> >> on the SSD.
> >>
> >> On Thu, Jan 19, 2012 at 5:31 PM, Manosiz Bhattacharyya
> >> <ma...@gmail.com>wrote:
> >>
> >> > I do not think that there is a problem with the queue size. I guess
> the
> >> > problem is more with latency when the Fusion I/O goes in for a GC. We
> are
> >> > enabling stats on the Zookeeper and the fusion I/O to be more precise.
> >> Does
> >> > Zookeeper typically do only sequential I/O, or does it do some random
> >> too.
> >> > We could then move the logs to a disk.
> >> >
> >> > Thanks,
> >> > Manosiz.
> >> >
> >> > On Wed, Jan 18, 2012 at 10:18 PM, Ted Dunning <te...@gmail.com>
> >> > wrote:
> >> >
> >> > > If you aren't pushing much data through ZK, there is almost no way
> that
> >> > the
> >> > > request queue can fill up without the log or snapshot disks being
> slow.
> >> > >  See what happens if you put the log into a real disk or (heaven
> help
> >> us)
> >> > > onto a tmpfs partition.
> >> > >
> >> > > On Thu, Jan 19, 2012 at 2:18 AM, Manosiz Bhattacharyya
> >> > > <ma...@gmail.com>wrote:
> >> > >
> >> > > > I will do as you mention.
> >> > > >
> >> > > > We are using the async API's throughout. Also we do not write too
> >> much
> >> > > data
> >> > > > into Zookeeper. We just use it for leadership elections and health
> >> > > > monitoring, which is why we see the timeouts typically on idle
> >> > zookeeper
> >> > > > connections.
> >> > > >
> >> > > > The reason why we want the sessions to be alive is because of the
> >> > > > leadership election algorithm that we use from the zookeeper
> recipe.
> >> So
> >> > > if
> >> > > > a connection is broken for the leader node, the ephemeral node
> that
> >> > > > guaranteed its leadership is lost, and reconnecting will create a
> new
> >> > > node
> >> > > > which does not guarantee leadership. We then have to re-elect a
> new
> >> > > leader
> >> > > > - which requires significant work. The bigger the timeout, bigger
> is
> >> > the
> >> > > > time the cluster stays without a master for a particular service,
> as
> >> > the
> >> > > > old master cannot keep on working once it has known its session is
> >> gone
> >> > > and
> >> > > > with it, its ephemeral node. As we are trying to have highly
> >> available
> >> > > > service (not internet scale, but at the scale of a storage system
> >> with
> >> > ms
> >> > > > latencies typically), we thought about reducing the timeout, but
> >> > keeping
> >> > > > the session open. Also note the node that typically is the master
> >> does
> >> > > not
> >> > > > write too often into zookeeper.
> >> > > >
> >> > > > Thanks,
> >> > > > Manosiz.
> >> > > >
> >> > > > On Wed, Jan 18, 2012 at 5:49 PM, Patrick Hunt <ph...@apache.org>
> >> > wrote:
> >> > > >
> >> > > > > On Wed, Jan 18, 2012 at 4:47 PM, Manosiz Bhattacharyya
> >> > > > > <ma...@gmail.com> wrote:
> >> > > > > > Thanks Patrick for your answer,
> >> > > > >
> >> > > > > No problem.
> >> > > > >
> >> > > > > > Actually we are in a virtualized environment, we have a FIO
> disk
> >> > for
> >> > > > > > transactional logs. It does have some latency sometimes during
> >> FIO
> >> > > > > garbage
> >> > > > > > collection. We know this could be the potential issue, but was
> >> > trying
> >> > > > to
> >> > > > > > workaround that.
> >> > > > >
> >> > > > > Ah, I see. I saw something very similar to this recently with
> SSDs
> >> > > > > used for the datadir. The fdatasync latency was sometimes > 10
> >> > > > > seconds. I suspect it happened as a result of disk GC activity.
> >> > > > >
> >> > > > > I was able to identify the problem by running something like
> this:
> >> > > > >
> >> > > > > sudo strace -r -T -f -p 8066 -e trace=fsync,fdatasync -o
> trace.txt
> >> > > > >
> >> > > > > and then graphing the results (log scale). You should try
> running
> >> > this
> >> > > > > against your servers to confirm that it is indeed the problem.
> >> > > > >
> >> > > > > > We were trying to qualify the requests into two types - either
> >> HB's
> >> > > or
> >> > > > > > normal requests. Isn't it better to reject normal requests if
> the
> >> > > queue
> >> > > > > > size is full to say a certain threshold, but keep the session
> >> > alive.
> >> > > > That
> >> > > > > > way the flow control can be achieved with the users session
> >> > retrying
> >> > > > the
> >> > > > > > operation, but the session health would be maintained.
> >> > > > >
> >> > > > > What good is a session (connection) that's not usable? You're
> >> better
> >> > > > > off disconnecting and re-establishing with a server that can
> >> process
> >> > > > > your requests in a timely fashion.
> >> > > > >
> >> > > > > ZK looks at availability from a service perspective, not from an
> >> > > > > individual session/connection perspective. The whole more
> important
> >> > > > > than the parts. There already is very sophisticated flow control
> >> > going
> >> > > > > on - e.g. the sessions shut down and stop reading requests when
> the
> >> > > > > number of outstanding requests on a server exceeds some
> threshold.
> >> > > > > Once the server catches up it starts reading again. Again -
> >> checkout
> >> > > > > your "stat" results for insight into this. (ie "outstanding
> >> > requests")
> >> > > > >
> >> > > > > Patrick
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
>

Re: Timeouts and ping handling

Posted by Patrick Hunt <ph...@apache.org>.
See "preAllocSize"
http://zookeeper.apache.org/doc/r3.4.2/zookeeperAdmin.html#sc_advancedConfiguration

On Thu, Jan 19, 2012 at 10:49 AM, Manosiz Bhattacharyya
<ma...@gmail.com> wrote:
> Thanks a lot for this info. A pointer in the code to where you do this
> preallocation or a flag to disable this would be very beneficial.
>
> On Thu, Jan 19, 2012 at 10:18 AM, Ted Dunning <te...@gmail.com> wrote:
>
>> ZK does pretty much entirely sequential I/O.
>>
>> One thing that it does which might be very, very bad for SSD is that it
>> pre-allocates disk extents in the log by writing a bunch of zeros.  This is
>> to avoid directory updates as the log is written, but it doubles the load
>> on the SSD.
>>
>> On Thu, Jan 19, 2012 at 5:31 PM, Manosiz Bhattacharyya
>> <ma...@gmail.com>wrote:
>>
>> > I do not think that there is a problem with the queue size. I guess the
>> > problem is more with latency when the Fusion I/O goes in for a GC. We are
>> > enabling stats on the Zookeeper and the fusion I/O to be more precise.
>> Does
>> > Zookeeper typically do only sequential I/O, or does it do some random
>> too.
>> > We could then move the logs to a disk.
>> >
>> > Thanks,
>> > Manosiz.
>> >
>> > On Wed, Jan 18, 2012 at 10:18 PM, Ted Dunning <te...@gmail.com>
>> > wrote:
>> >
>> > > If you aren't pushing much data through ZK, there is almost no way that
>> > the
>> > > request queue can fill up without the log or snapshot disks being slow.
>> > >  See what happens if you put the log into a real disk or (heaven help
>> us)
>> > > onto a tmpfs partition.
>> > >
>> > > On Thu, Jan 19, 2012 at 2:18 AM, Manosiz Bhattacharyya
>> > > <ma...@gmail.com>wrote:
>> > >
>> > > > I will do as you mention.
>> > > >
>> > > > We are using the async API's throughout. Also we do not write too
>> much
>> > > data
>> > > > into Zookeeper. We just use it for leadership elections and health
>> > > > monitoring, which is why we see the timeouts typically on idle
>> > zookeeper
>> > > > connections.
>> > > >
>> > > > The reason why we want the sessions to be alive is because of the
>> > > > leadership election algorithm that we use from the zookeeper recipe.
>> So
>> > > if
>> > > > a connection is broken for the leader node, the ephemeral node that
>> > > > guaranteed its leadership is lost, and reconnecting will create a new
>> > > node
>> > > > which does not guarantee leadership. We then have to re-elect a new
>> > > leader
>> > > > - which requires significant work. The bigger the timeout, bigger is
>> > the
>> > > > time the cluster stays without a master for a particular service, as
>> > the
>> > > > old master cannot keep on working once it has known its session is
>> gone
>> > > and
>> > > > with it, its ephemeral node. As we are trying to have highly
>> available
>> > > > service (not internet scale, but at the scale of a storage system
>> with
>> > ms
>> > > > latencies typically), we thought about reducing the timeout, but
>> > keeping
>> > > > the session open. Also note the node that typically is the master
>> does
>> > > not
>> > > > write too often into zookeeper.
>> > > >
>> > > > Thanks,
>> > > > Manosiz.
>> > > >
>> > > > On Wed, Jan 18, 2012 at 5:49 PM, Patrick Hunt <ph...@apache.org>
>> > wrote:
>> > > >
>> > > > > On Wed, Jan 18, 2012 at 4:47 PM, Manosiz Bhattacharyya
>> > > > > <ma...@gmail.com> wrote:
>> > > > > > Thanks Patrick for your answer,
>> > > > >
>> > > > > No problem.
>> > > > >
>> > > > > > Actually we are in a virtualized environment, we have a FIO disk
>> > for
>> > > > > > transactional logs. It does have some latency sometimes during
>> FIO
>> > > > > garbage
>> > > > > > collection. We know this could be the potential issue, but was
>> > trying
>> > > > to
>> > > > > > workaround that.
>> > > > >
>> > > > > Ah, I see. I saw something very similar to this recently with SSDs
>> > > > > used for the datadir. The fdatasync latency was sometimes > 10
>> > > > > seconds. I suspect it happened as a result of disk GC activity.
>> > > > >
>> > > > > I was able to identify the problem by running something like this:
>> > > > >
>> > > > > sudo strace -r -T -f -p 8066 -e trace=fsync,fdatasync -o trace.txt
>> > > > >
>> > > > > and then graphing the results (log scale). You should try running
>> > this
>> > > > > against your servers to confirm that it is indeed the problem.
>> > > > >
>> > > > > > We were trying to qualify the requests into two types - either
>> HB's
>> > > or
>> > > > > > normal requests. Isn't it better to reject normal requests if the
>> > > queue
>> > > > > > size is full to say a certain threshold, but keep the session
>> > alive.
>> > > > That
>> > > > > > way the flow control can be achieved with the users session
>> > retrying
>> > > > the
>> > > > > > operation, but the session health would be maintained.
>> > > > >
>> > > > > What good is a session (connection) that's not usable? You're
>> better
>> > > > > off disconnecting and re-establishing with a server that can
>> process
>> > > > > your requests in a timely fashion.
>> > > > >
>> > > > > ZK looks at availability from a service perspective, not from an
>> > > > > individual session/connection perspective. The whole more important
>> > > > > than the parts. There already is very sophisticated flow control
>> > going
>> > > > > on - e.g. the sessions shut down and stop reading requests when the
>> > > > > number of outstanding requests on a server exceeds some threshold.
>> > > > > Once the server catches up it starts reading again. Again -
>> checkout
>> > > > > your "stat" results for insight into this. (ie "outstanding
>> > requests")
>> > > > >
>> > > > > Patrick
>> > > > >
>> > > >
>> > >
>> >
>>

Re: Timeouts and ping handling

Posted by Manosiz Bhattacharyya <ma...@gmail.com>.
Thanks a lot for this info. A pointer in the code to where you do this
preallocation or a flag to disable this would be very beneficial.

On Thu, Jan 19, 2012 at 10:18 AM, Ted Dunning <te...@gmail.com> wrote:

> ZK does pretty much entirely sequential I/O.
>
> One thing that it does which might be very, very bad for SSD is that it
> pre-allocates disk extents in the log by writing a bunch of zeros.  This is
> to avoid directory updates as the log is written, but it doubles the load
> on the SSD.
>
> On Thu, Jan 19, 2012 at 5:31 PM, Manosiz Bhattacharyya
> <ma...@gmail.com>wrote:
>
> > I do not think that there is a problem with the queue size. I guess the
> > problem is more with latency when the Fusion I/O goes in for a GC. We are
> > enabling stats on the Zookeeper and the fusion I/O to be more precise.
> Does
> > Zookeeper typically do only sequential I/O, or does it do some random
> too.
> > We could then move the logs to a disk.
> >
> > Thanks,
> > Manosiz.
> >
> > On Wed, Jan 18, 2012 at 10:18 PM, Ted Dunning <te...@gmail.com>
> > wrote:
> >
> > > If you aren't pushing much data through ZK, there is almost no way that
> > the
> > > request queue can fill up without the log or snapshot disks being slow.
> > >  See what happens if you put the log into a real disk or (heaven help
> us)
> > > onto a tmpfs partition.
> > >
> > > On Thu, Jan 19, 2012 at 2:18 AM, Manosiz Bhattacharyya
> > > <ma...@gmail.com>wrote:
> > >
> > > > I will do as you mention.
> > > >
> > > > We are using the async API's throughout. Also we do not write too
> much
> > > data
> > > > into Zookeeper. We just use it for leadership elections and health
> > > > monitoring, which is why we see the timeouts typically on idle
> > zookeeper
> > > > connections.
> > > >
> > > > The reason why we want the sessions to be alive is because of the
> > > > leadership election algorithm that we use from the zookeeper recipe.
> So
> > > if
> > > > a connection is broken for the leader node, the ephemeral node that
> > > > guaranteed its leadership is lost, and reconnecting will create a new
> > > node
> > > > which does not guarantee leadership. We then have to re-elect a new
> > > leader
> > > > - which requires significant work. The bigger the timeout, bigger is
> > the
> > > > time the cluster stays without a master for a particular service, as
> > the
> > > > old master cannot keep on working once it has known its session is
> gone
> > > and
> > > > with it, its ephemeral node. As we are trying to have highly
> available
> > > > service (not internet scale, but at the scale of a storage system
> with
> > ms
> > > > latencies typically), we thought about reducing the timeout, but
> > keeping
> > > > the session open. Also note the node that typically is the master
> does
> > > not
> > > > write too often into zookeeper.
> > > >
> > > > Thanks,
> > > > Manosiz.
> > > >
> > > > On Wed, Jan 18, 2012 at 5:49 PM, Patrick Hunt <ph...@apache.org>
> > wrote:
> > > >
> > > > > On Wed, Jan 18, 2012 at 4:47 PM, Manosiz Bhattacharyya
> > > > > <ma...@gmail.com> wrote:
> > > > > > Thanks Patrick for your answer,
> > > > >
> > > > > No problem.
> > > > >
> > > > > > Actually we are in a virtualized environment, we have a FIO disk
> > for
> > > > > > transactional logs. It does have some latency sometimes during
> FIO
> > > > > garbage
> > > > > > collection. We know this could be the potential issue, but was
> > trying
> > > > to
> > > > > > workaround that.
> > > > >
> > > > > Ah, I see. I saw something very similar to this recently with SSDs
> > > > > used for the datadir. The fdatasync latency was sometimes > 10
> > > > > seconds. I suspect it happened as a result of disk GC activity.
> > > > >
> > > > > I was able to identify the problem by running something like this:
> > > > >
> > > > > sudo strace -r -T -f -p 8066 -e trace=fsync,fdatasync -o trace.txt
> > > > >
> > > > > and then graphing the results (log scale). You should try running
> > this
> > > > > against your servers to confirm that it is indeed the problem.
> > > > >
> > > > > > We were trying to qualify the requests into two types - either
> HB's
> > > or
> > > > > > normal requests. Isn't it better to reject normal requests if the
> > > queue
> > > > > > size is full to say a certain threshold, but keep the session
> > alive.
> > > > That
> > > > > > way the flow control can be achieved with the users session
> > retrying
> > > > the
> > > > > > operation, but the session health would be maintained.
> > > > >
> > > > > What good is a session (connection) that's not usable? You're
> better
> > > > > off disconnecting and re-establishing with a server that can
> process
> > > > > your requests in a timely fashion.
> > > > >
> > > > > ZK looks at availability from a service perspective, not from an
> > > > > individual session/connection perspective. The whole more important
> > > > > than the parts. There already is very sophisticated flow control
> > going
> > > > > on - e.g. the sessions shut down and stop reading requests when the
> > > > > number of outstanding requests on a server exceeds some threshold.
> > > > > Once the server catches up it starts reading again. Again -
> checkout
> > > > > your "stat" results for insight into this. (ie "outstanding
> > requests")
> > > > >
> > > > > Patrick
> > > > >
> > > >
> > >
> >
>

Re: Timeouts and ping handling

Posted by Ted Dunning <te...@gmail.com>.
ZK does pretty much entirely sequential I/O.

One thing that it does which might be very, very bad for SSD is that it
pre-allocates disk extents in the log by writing a bunch of zeros.  This is
to avoid directory updates as the log is written, but it doubles the load
on the SSD.

On Thu, Jan 19, 2012 at 5:31 PM, Manosiz Bhattacharyya
<ma...@gmail.com>wrote:

> I do not think that there is a problem with the queue size. I guess the
> problem is more with latency when the Fusion I/O goes in for a GC. We are
> enabling stats on the Zookeeper and the fusion I/O to be more precise. Does
> Zookeeper typically do only sequential I/O, or does it do some random too.
> We could then move the logs to a disk.
>
> Thanks,
> Manosiz.
>
> On Wed, Jan 18, 2012 at 10:18 PM, Ted Dunning <te...@gmail.com>
> wrote:
>
> > If you aren't pushing much data through ZK, there is almost no way that
> the
> > request queue can fill up without the log or snapshot disks being slow.
> >  See what happens if you put the log into a real disk or (heaven help us)
> > onto a tmpfs partition.
> >
> > On Thu, Jan 19, 2012 at 2:18 AM, Manosiz Bhattacharyya
> > <ma...@gmail.com>wrote:
> >
> > > I will do as you mention.
> > >
> > > We are using the async API's throughout. Also we do not write too much
> > data
> > > into Zookeeper. We just use it for leadership elections and health
> > > monitoring, which is why we see the timeouts typically on idle
> zookeeper
> > > connections.
> > >
> > > The reason why we want the sessions to be alive is because of the
> > > leadership election algorithm that we use from the zookeeper recipe. So
> > if
> > > a connection is broken for the leader node, the ephemeral node that
> > > guaranteed its leadership is lost, and reconnecting will create a new
> > node
> > > which does not guarantee leadership. We then have to re-elect a new
> > leader
> > > - which requires significant work. The bigger the timeout, bigger is
> the
> > > time the cluster stays without a master for a particular service, as
> the
> > > old master cannot keep on working once it has known its session is gone
> > and
> > > with it, its ephemeral node. As we are trying to have highly available
> > > service (not internet scale, but at the scale of a storage system with
> ms
> > > latencies typically), we thought about reducing the timeout, but
> keeping
> > > the session open. Also note the node that typically is the master does
> > not
> > > write too often into zookeeper.
> > >
> > > Thanks,
> > > Manosiz.
> > >
> > > On Wed, Jan 18, 2012 at 5:49 PM, Patrick Hunt <ph...@apache.org>
> wrote:
> > >
> > > > On Wed, Jan 18, 2012 at 4:47 PM, Manosiz Bhattacharyya
> > > > <ma...@gmail.com> wrote:
> > > > > Thanks Patrick for your answer,
> > > >
> > > > No problem.
> > > >
> > > > > Actually we are in a virtualized environment, we have a FIO disk
> for
> > > > > transactional logs. It does have some latency sometimes during FIO
> > > > garbage
> > > > > collection. We know this could be the potential issue, but was
> trying
> > > to
> > > > > workaround that.
> > > >
> > > > Ah, I see. I saw something very similar to this recently with SSDs
> > > > used for the datadir. The fdatasync latency was sometimes > 10
> > > > seconds. I suspect it happened as a result of disk GC activity.
> > > >
> > > > I was able to identify the problem by running something like this:
> > > >
> > > > sudo strace -r -T -f -p 8066 -e trace=fsync,fdatasync -o trace.txt
> > > >
> > > > and then graphing the results (log scale). You should try running
> this
> > > > against your servers to confirm that it is indeed the problem.
> > > >
> > > > > We were trying to qualify the requests into two types - either HB's
> > or
> > > > > normal requests. Isn't it better to reject normal requests if the
> > queue
> > > > > size is full to say a certain threshold, but keep the session
> alive.
> > > That
> > > > > way the flow control can be achieved with the users session
> retrying
> > > the
> > > > > operation, but the session health would be maintained.
> > > >
> > > > What good is a session (connection) that's not usable? You're better
> > > > off disconnecting and re-establishing with a server that can process
> > > > your requests in a timely fashion.
> > > >
> > > > ZK looks at availability from a service perspective, not from an
> > > > individual session/connection perspective. The whole more important
> > > > than the parts. There already is very sophisticated flow control
> going
> > > > on - e.g. the sessions shut down and stop reading requests when the
> > > > number of outstanding requests on a server exceeds some threshold.
> > > > Once the server catches up it starts reading again. Again - checkout
> > > > your "stat" results for insight into this. (ie "outstanding
> requests")
> > > >
> > > > Patrick
> > > >
> > >
> >
>

Re: Timeouts and ping handling

Posted by Manosiz Bhattacharyya <ma...@gmail.com>.
We are using the zookeeper c client version 3.3.4 the same as the server.
We use libptread-2.10.1.so, and no special time slicing in user code. Will
let you know what we find.

Thanks,
Manosiz.

On Thu, Jan 19, 2012 at 10:09 AM, Patrick Hunt <ph...@apache.org> wrote:

> On Thu, Jan 19, 2012 at 9:31 AM, Manosiz Bhattacharyya
> <ma...@gmail.com> wrote:
> > I do not think that there is a problem with the queue size. I guess the
> > problem is more with latency when the Fusion I/O goes in for a GC. We are
> > enabling stats on the Zookeeper and the fusion I/O to be more precise.
> Does
> > Zookeeper typically do only sequential I/O, or does it do some random
> too.
> > We could then move the logs to a disk.
>
> I was going to say what Ted said - it's odd to see such long pauses
> given you don't have GC issues and you are barely using the system.
> Your suspicion on disk may be correct.
>
> The server really just does sequential IO - it's writing the WAL for
> any changes and periodically taking the snapshot.
>
> Note that this could be an issue in ZK itself. The c client talking to
> the service using async operations with such low round trip
> expectations is not something we typically see or in particular test.
> It will be interesting to see the results of your further
> investigations.
>
> Btw, you are using c client - which version? the pthreads version or
> the version where you manage timeslicing yourself?
>
> Patrick
>

Re: Timeouts and ping handling

Posted by Patrick Hunt <ph...@apache.org>.
On Thu, Jan 19, 2012 at 9:31 AM, Manosiz Bhattacharyya
<ma...@gmail.com> wrote:
> I do not think that there is a problem with the queue size. I guess the
> problem is more with latency when the Fusion I/O goes in for a GC. We are
> enabling stats on the Zookeeper and the fusion I/O to be more precise. Does
> Zookeeper typically do only sequential I/O, or does it do some random too.
> We could then move the logs to a disk.

I was going to say what Ted said - it's odd to see such long pauses
given you don't have GC issues and you are barely using the system.
Your suspicion on disk may be correct.

The server really just does sequential IO - it's writing the WAL for
any changes and periodically taking the snapshot.

Note that this could be an issue in ZK itself. The c client talking to
the service using async operations with such low round trip
expectations is not something we typically see or in particular test.
It will be interesting to see the results of your further
investigations.

Btw, you are using c client - which version? the pthreads version or
the version where you manage timeslicing yourself?

Patrick

Re: Timeouts and ping handling

Posted by Manosiz Bhattacharyya <ma...@gmail.com>.
I do not think that there is a problem with the queue size. I guess the
problem is more with latency when the Fusion I/O goes in for a GC. We are
enabling stats on the Zookeeper and the fusion I/O to be more precise. Does
Zookeeper typically do only sequential I/O, or does it do some random too.
We could then move the logs to a disk.

Thanks,
Manosiz.

On Wed, Jan 18, 2012 at 10:18 PM, Ted Dunning <te...@gmail.com> wrote:

> If you aren't pushing much data through ZK, there is almost no way that the
> request queue can fill up without the log or snapshot disks being slow.
>  See what happens if you put the log into a real disk or (heaven help us)
> onto a tmpfs partition.
>
> On Thu, Jan 19, 2012 at 2:18 AM, Manosiz Bhattacharyya
> <ma...@gmail.com>wrote:
>
> > I will do as you mention.
> >
> > We are using the async API's throughout. Also we do not write too much
> data
> > into Zookeeper. We just use it for leadership elections and health
> > monitoring, which is why we see the timeouts typically on idle zookeeper
> > connections.
> >
> > The reason why we want the sessions to be alive is because of the
> > leadership election algorithm that we use from the zookeeper recipe. So
> if
> > a connection is broken for the leader node, the ephemeral node that
> > guaranteed its leadership is lost, and reconnecting will create a new
> node
> > which does not guarantee leadership. We then have to re-elect a new
> leader
> > - which requires significant work. The bigger the timeout, bigger is the
> > time the cluster stays without a master for a particular service, as the
> > old master cannot keep on working once it has known its session is gone
> and
> > with it, its ephemeral node. As we are trying to have highly available
> > service (not internet scale, but at the scale of a storage system with ms
> > latencies typically), we thought about reducing the timeout, but keeping
> > the session open. Also note the node that typically is the master does
> not
> > write too often into zookeeper.
> >
> > Thanks,
> > Manosiz.
> >
> > On Wed, Jan 18, 2012 at 5:49 PM, Patrick Hunt <ph...@apache.org> wrote:
> >
> > > On Wed, Jan 18, 2012 at 4:47 PM, Manosiz Bhattacharyya
> > > <ma...@gmail.com> wrote:
> > > > Thanks Patrick for your answer,
> > >
> > > No problem.
> > >
> > > > Actually we are in a virtualized environment, we have a FIO disk for
> > > > transactional logs. It does have some latency sometimes during FIO
> > > garbage
> > > > collection. We know this could be the potential issue, but was trying
> > to
> > > > workaround that.
> > >
> > > Ah, I see. I saw something very similar to this recently with SSDs
> > > used for the datadir. The fdatasync latency was sometimes > 10
> > > seconds. I suspect it happened as a result of disk GC activity.
> > >
> > > I was able to identify the problem by running something like this:
> > >
> > > sudo strace -r -T -f -p 8066 -e trace=fsync,fdatasync -o trace.txt
> > >
> > > and then graphing the results (log scale). You should try running this
> > > against your servers to confirm that it is indeed the problem.
> > >
> > > > We were trying to qualify the requests into two types - either HB's
> or
> > > > normal requests. Isn't it better to reject normal requests if the
> queue
> > > > size is full to say a certain threshold, but keep the session alive.
> > That
> > > > way the flow control can be achieved with the users session retrying
> > the
> > > > operation, but the session health would be maintained.
> > >
> > > What good is a session (connection) that's not usable? You're better
> > > off disconnecting and re-establishing with a server that can process
> > > your requests in a timely fashion.
> > >
> > > ZK looks at availability from a service perspective, not from an
> > > individual session/connection perspective. The whole more important
> > > than the parts. There already is very sophisticated flow control going
> > > on - e.g. the sessions shut down and stop reading requests when the
> > > number of outstanding requests on a server exceeds some threshold.
> > > Once the server catches up it starts reading again. Again - checkout
> > > your "stat" results for insight into this. (ie "outstanding requests")
> > >
> > > Patrick
> > >
> >
>

Re: Timeouts and ping handling

Posted by Ted Dunning <te...@gmail.com>.
If you aren't pushing much data through ZK, there is almost no way that the
request queue can fill up without the log or snapshot disks being slow.
 See what happens if you put the log into a real disk or (heaven help us)
onto a tmpfs partition.

On Thu, Jan 19, 2012 at 2:18 AM, Manosiz Bhattacharyya
<ma...@gmail.com>wrote:

> I will do as you mention.
>
> We are using the async API's throughout. Also we do not write too much data
> into Zookeeper. We just use it for leadership elections and health
> monitoring, which is why we see the timeouts typically on idle zookeeper
> connections.
>
> The reason why we want the sessions to be alive is because of the
> leadership election algorithm that we use from the zookeeper recipe. So if
> a connection is broken for the leader node, the ephemeral node that
> guaranteed its leadership is lost, and reconnecting will create a new node
> which does not guarantee leadership. We then have to re-elect a new leader
> - which requires significant work. The bigger the timeout, bigger is the
> time the cluster stays without a master for a particular service, as the
> old master cannot keep on working once it has known its session is gone and
> with it, its ephemeral node. As we are trying to have highly available
> service (not internet scale, but at the scale of a storage system with ms
> latencies typically), we thought about reducing the timeout, but keeping
> the session open. Also note the node that typically is the master does not
> write too often into zookeeper.
>
> Thanks,
> Manosiz.
>
> On Wed, Jan 18, 2012 at 5:49 PM, Patrick Hunt <ph...@apache.org> wrote:
>
> > On Wed, Jan 18, 2012 at 4:47 PM, Manosiz Bhattacharyya
> > <ma...@gmail.com> wrote:
> > > Thanks Patrick for your answer,
> >
> > No problem.
> >
> > > Actually we are in a virtualized environment, we have a FIO disk for
> > > transactional logs. It does have some latency sometimes during FIO
> > garbage
> > > collection. We know this could be the potential issue, but was trying
> to
> > > workaround that.
> >
> > Ah, I see. I saw something very similar to this recently with SSDs
> > used for the datadir. The fdatasync latency was sometimes > 10
> > seconds. I suspect it happened as a result of disk GC activity.
> >
> > I was able to identify the problem by running something like this:
> >
> > sudo strace -r -T -f -p 8066 -e trace=fsync,fdatasync -o trace.txt
> >
> > and then graphing the results (log scale). You should try running this
> > against your servers to confirm that it is indeed the problem.
> >
> > > We were trying to qualify the requests into two types - either HB's or
> > > normal requests. Isn't it better to reject normal requests if the queue
> > > size is full to say a certain threshold, but keep the session alive.
> That
> > > way the flow control can be achieved with the users session retrying
> the
> > > operation, but the session health would be maintained.
> >
> > What good is a session (connection) that's not usable? You're better
> > off disconnecting and re-establishing with a server that can process
> > your requests in a timely fashion.
> >
> > ZK looks at availability from a service perspective, not from an
> > individual session/connection perspective. The whole more important
> > than the parts. There already is very sophisticated flow control going
> > on - e.g. the sessions shut down and stop reading requests when the
> > number of outstanding requests on a server exceeds some threshold.
> > Once the server catches up it starts reading again. Again - checkout
> > your "stat" results for insight into this. (ie "outstanding requests")
> >
> > Patrick
> >
>

Re: Timeouts and ping handling

Posted by Manosiz Bhattacharyya <ma...@gmail.com>.
I will do as you mention.

We are using the async API's throughout. Also we do not write too much data
into Zookeeper. We just use it for leadership elections and health
monitoring, which is why we see the timeouts typically on idle zookeeper
connections.

The reason why we want the sessions to be alive is because of the
leadership election algorithm that we use from the zookeeper recipe. So if
a connection is broken for the leader node, the ephemeral node that
guaranteed its leadership is lost, and reconnecting will create a new node
which does not guarantee leadership. We then have to re-elect a new leader
- which requires significant work. The bigger the timeout, bigger is the
time the cluster stays without a master for a particular service, as the
old master cannot keep on working once it has known its session is gone and
with it, its ephemeral node. As we are trying to have highly available
service (not internet scale, but at the scale of a storage system with ms
latencies typically), we thought about reducing the timeout, but keeping
the session open. Also note the node that typically is the master does not
write too often into zookeeper.

Thanks,
Manosiz.

On Wed, Jan 18, 2012 at 5:49 PM, Patrick Hunt <ph...@apache.org> wrote:

> On Wed, Jan 18, 2012 at 4:47 PM, Manosiz Bhattacharyya
> <ma...@gmail.com> wrote:
> > Thanks Patrick for your answer,
>
> No problem.
>
> > Actually we are in a virtualized environment, we have a FIO disk for
> > transactional logs. It does have some latency sometimes during FIO
> garbage
> > collection. We know this could be the potential issue, but was trying to
> > workaround that.
>
> Ah, I see. I saw something very similar to this recently with SSDs
> used for the datadir. The fdatasync latency was sometimes > 10
> seconds. I suspect it happened as a result of disk GC activity.
>
> I was able to identify the problem by running something like this:
>
> sudo strace -r -T -f -p 8066 -e trace=fsync,fdatasync -o trace.txt
>
> and then graphing the results (log scale). You should try running this
> against your servers to confirm that it is indeed the problem.
>
> > We were trying to qualify the requests into two types - either HB's or
> > normal requests. Isn't it better to reject normal requests if the queue
> > size is full to say a certain threshold, but keep the session alive. That
> > way the flow control can be achieved with the users session retrying the
> > operation, but the session health would be maintained.
>
> What good is a session (connection) that's not usable? You're better
> off disconnecting and re-establishing with a server that can process
> your requests in a timely fashion.
>
> ZK looks at availability from a service perspective, not from an
> individual session/connection perspective. The whole more important
> than the parts. There already is very sophisticated flow control going
> on - e.g. the sessions shut down and stop reading requests when the
> number of outstanding requests on a server exceeds some threshold.
> Once the server catches up it starts reading again. Again - checkout
> your "stat" results for insight into this. (ie "outstanding requests")
>
> Patrick
>

Re: Timeouts and ping handling

Posted by Patrick Hunt <ph...@apache.org>.
On Wed, Jan 18, 2012 at 4:47 PM, Manosiz Bhattacharyya
<ma...@gmail.com> wrote:
> Thanks Patrick for your answer,

No problem.

> Actually we are in a virtualized environment, we have a FIO disk for
> transactional logs. It does have some latency sometimes during FIO garbage
> collection. We know this could be the potential issue, but was trying to
> workaround that.

Ah, I see. I saw something very similar to this recently with SSDs
used for the datadir. The fdatasync latency was sometimes > 10
seconds. I suspect it happened as a result of disk GC activity.

I was able to identify the problem by running something like this:

sudo strace -r -T -f -p 8066 -e trace=fsync,fdatasync -o trace.txt

and then graphing the results (log scale). You should try running this
against your servers to confirm that it is indeed the problem.

> We were trying to qualify the requests into two types - either HB's or
> normal requests. Isn't it better to reject normal requests if the queue
> size is full to say a certain threshold, but keep the session alive. That
> way the flow control can be achieved with the users session retrying the
> operation, but the session health would be maintained.

What good is a session (connection) that's not usable? You're better
off disconnecting and re-establishing with a server that can process
your requests in a timely fashion.

ZK looks at availability from a service perspective, not from an
individual session/connection perspective. The whole more important
than the parts. There already is very sophisticated flow control going
on - e.g. the sessions shut down and stop reading requests when the
number of outstanding requests on a server exceeds some threshold.
Once the server catches up it starts reading again. Again - checkout
your "stat" results for insight into this. (ie "outstanding requests")

Patrick

Re: Timeouts and ping handling

Posted by Manosiz Bhattacharyya <ma...@gmail.com>.
Yes.

On Wed, Jan 18, 2012 at 5:15 PM, Ted Dunning <te...@gmail.com> wrote:

> Does FIO stand for Fusion I/O?
>
> On Thu, Jan 19, 2012 at 12:47 AM, Manosiz Bhattacharyya
> <ma...@gmail.com>wrote:
>
> > ... we have a FIO disk
>

Re: Timeouts and ping handling

Posted by Ted Dunning <te...@gmail.com>.
Does FIO stand for Fusion I/O?

On Thu, Jan 19, 2012 at 12:47 AM, Manosiz Bhattacharyya
<ma...@gmail.com>wrote:

> ... we have a FIO disk

Re: Timeouts and ping handling

Posted by Manosiz Bhattacharyya <ma...@gmail.com>.
I was not indicating that we do not detect the situation of a stuck server.
A watchdog of some sort keeping track of queue changes could also suffice.
Thanks for you input. I guess we will try to work out with the increasing
the timeout.

-- Manosiz.

On Wed, Jan 18, 2012 at 4:54 PM, Ted Dunning <te...@gmail.com> wrote:

> That really depends on whether you think that a stuck server is a problem.
>  The primary indication of that is a full queue and you are suggesting that
> we not detect this situation.  It isn't a matter of keeping the session
> alive ... it is a matter of whether or not we can guarantee that things are
> working.  By all appearances, they aren't and ZK is all about guarantees.
>
>
>
> On Thu, Jan 19, 2012 at 12:47 AM, Manosiz Bhattacharyya
> <ma...@gmail.com>wrote:
>
> > We were trying to qualify the requests into two types - either HB's or
> > normal requests. Isn't it better to reject normal requests if the queue
> > size is full to say a certain threshold, but keep the session alive. That
> > way the flow control can be achieved with the users session retrying the
> > operation, but the session health would be maintained.
> >
> >
>

Re: Timeouts and ping handling

Posted by Ted Dunning <te...@gmail.com>.
That really depends on whether you think that a stuck server is a problem.
 The primary indication of that is a full queue and you are suggesting that
we not detect this situation.  It isn't a matter of keeping the session
alive ... it is a matter of whether or not we can guarantee that things are
working.  By all appearances, they aren't and ZK is all about guarantees.



On Thu, Jan 19, 2012 at 12:47 AM, Manosiz Bhattacharyya
<ma...@gmail.com>wrote:

> We were trying to qualify the requests into two types - either HB's or
> normal requests. Isn't it better to reject normal requests if the queue
> size is full to say a certain threshold, but keep the session alive. That
> way the flow control can be achieved with the users session retrying the
> operation, but the session health would be maintained.
>
>

Re: Timeouts and ping handling

Posted by Manosiz Bhattacharyya <ma...@gmail.com>.
Thanks Patrick for your answer,

Actually we are in a virtualized environment, we have a FIO disk for
transactional logs. It does have some latency sometimes during FIO garbage
collection. We know this could be the potential issue, but was trying to
workaround that.

We were trying to qualify the requests into two types - either HB's or
normal requests. Isn't it better to reject normal requests if the queue
size is full to say a certain threshold, but keep the session alive. That
way the flow control can be achieved with the users session retrying the
operation, but the session health would be maintained.

Regards,
Manosiz.

On Wed, Jan 18, 2012 at 2:53 PM, Patrick Hunt <ph...@apache.org> wrote:

> Next up is disk. (I'm assuming you're not running in a virtualized
> environment, correct?) You have a dedicated log device for the
> transactional logs? Check your disk latency and make sure that's not
> holding up the writes.
>
> What does "stat" show you wrt latency in general and at the time you
> see the issue on the client?
>
> You've looked through the troubleshooting guide?
> http://wiki.apache.org/hadoop/ZooKeeper/Troubleshooting
>
> Patrick
>
> On Wed, Jan 18, 2012 at 2:47 PM, Manosiz Bhattacharyya
> <ma...@gmail.com> wrote:
> > Thanks a lot for your response. We are running the c-client, as all our
> > components are C++ applications. We are tracing GC on the server side,
> but
> > did not see much activity there. We did tune GC. Our gc flags include the
> > following
> >
> > JVMFLAGS="$JVMFLAGS -XX:+UseParNewGC"
> > JVMFLAGS="$JVMFLAGS -XX:+UseConcMarkSweepGC"
> > JVMFLAGS="$JVMFLAGS -XX:+CMSParallelRemarkEnabled"
> > JVMFLAGS="$JVMFLAGS -XX:SurvivorRatio=8"
> > JVMFLAGS="$JVMFLAGS -XX:MaxTenuringThreshold=1"
> > JVMFLAGS="$JVMFLAGS -XX:CMSInitiatingOccupancyFraction=75"
> > JVMFLAGS="$JVMFLAGS -XX:+UseCMSInitiatingOccupancyOnly"
> > JVMFLAGS="$JVMFLAGS -XX:ParallelCMSThreads=1"
> >
> > The JMX console shows that the old gen is not getting full at all - the
> new
> > gen is pretty much where the activity is and the pauses in the verbose:gc
> > only shows about times in 10-20 ms.
> >
> > On Wed, Jan 18, 2012 at 2:34 PM, Patrick Hunt <ph...@apache.org> wrote:
> >
> >> 5 seconds is fairly low. HBs are sent by the client every 1/3 the
> >> timeout, with expectation that it will get a response in another 1/3
> >> the timeout. if not the client session will time out.
> >>
> >> As a result, any blip of 1.5 sec or more btw the client and server
> >> could cause this to happen. Network latency, OS latency, ZK server
> >> latency, client latency etc....
> >>
> >> I suspect that you are being effected by GC pauses. Have you tuned the
> >> GC at all or just the defaults? Monitor the GC in the VM during
> >> operation and see if this is effecting you. At the very least you need
> >> to turn on parallel/CMS/incremental GC.
> >>
> >> Patrick
> >>
> >> On Wed, Jan 18, 2012 at 1:26 PM, Manosiz Bhattacharyya
> >> <ma...@gmail.com> wrote:
> >> > Hello,
> >> >
> >> >  We are using Zookeeper-3.3.4 with client session timeouts of 5
> seconds,
> >> > and we see frequent timeouts. We have a cluster of 50 nodes (3 of
> which
> >> are
> >> > ZK nodes) and each node has 5 client connections (a total of 250
> >> connection
> >> > to the Ensemble). While investigating the zookeeper connections, we
> found
> >> > that sometimes pings sent from the zookeeper client does not return
> from
> >> > the server within 5 seconds, and the client connection gets
> disconnected.
> >> > Digging deeper it seems that pings are enqueued the same way as other
> >> > requests in the three stage request processing pipeline (prep, sync,
> >> > finalize) in zkserver. So if there are a lot of write operations from
> >> other
> >> > active sessions in front of a ping from an inactive session in the
> >> queues,
> >> > the inactive session could timeout.
> >> >
> >> > My question is whether we can return the ping request from the client
> >> > immediately from the server, as the purpose of the ping request seems
> to
> >> be
> >> > to treat it as an heartbeat from relatively inactive sessions. If we
> >> keep a
> >> > separate ping queue in the Prep phase which forwards it straight to
> the
> >> > finalize phase, possible requests before the ping which required I/O
> >> inside
> >> > the sync phase would not cause the client timeouts. I hope pings do
> not
> >> > generate any order in the database. I did take a cursory look at the
> code
> >> > and thought that could be done. Would really appreciate an opinion
> >> > regarding this.
> >> >
> >> > As an aside I should mention that increasing the session timeout to 20
> >> > seconds did improved the problem significantly. However as we are
> using
> >> > Zookeeper to monitor health of our components, increasing the timeout
> >> means
> >> > that we only get to know a component's death 20 seconds later. This is
> >> > something we would definitely try to avoid, and would like to go to
> the 5
> >> > second timeout.
> >> >
> >> > Regards,
> >> > Manosiz.
> >>
>

Re: Timeouts and ping handling

Posted by Patrick Hunt <ph...@apache.org>.
Next up is disk. (I'm assuming you're not running in a virtualized
environment, correct?) You have a dedicated log device for the
transactional logs? Check your disk latency and make sure that's not
holding up the writes.

What does "stat" show you wrt latency in general and at the time you
see the issue on the client?

You've looked through the troubleshooting guide?
http://wiki.apache.org/hadoop/ZooKeeper/Troubleshooting

Patrick

On Wed, Jan 18, 2012 at 2:47 PM, Manosiz Bhattacharyya
<ma...@gmail.com> wrote:
> Thanks a lot for your response. We are running the c-client, as all our
> components are C++ applications. We are tracing GC on the server side, but
> did not see much activity there. We did tune GC. Our gc flags include the
> following
>
> JVMFLAGS="$JVMFLAGS -XX:+UseParNewGC"
> JVMFLAGS="$JVMFLAGS -XX:+UseConcMarkSweepGC"
> JVMFLAGS="$JVMFLAGS -XX:+CMSParallelRemarkEnabled"
> JVMFLAGS="$JVMFLAGS -XX:SurvivorRatio=8"
> JVMFLAGS="$JVMFLAGS -XX:MaxTenuringThreshold=1"
> JVMFLAGS="$JVMFLAGS -XX:CMSInitiatingOccupancyFraction=75"
> JVMFLAGS="$JVMFLAGS -XX:+UseCMSInitiatingOccupancyOnly"
> JVMFLAGS="$JVMFLAGS -XX:ParallelCMSThreads=1"
>
> The JMX console shows that the old gen is not getting full at all - the new
> gen is pretty much where the activity is and the pauses in the verbose:gc
> only shows about times in 10-20 ms.
>
> On Wed, Jan 18, 2012 at 2:34 PM, Patrick Hunt <ph...@apache.org> wrote:
>
>> 5 seconds is fairly low. HBs are sent by the client every 1/3 the
>> timeout, with expectation that it will get a response in another 1/3
>> the timeout. if not the client session will time out.
>>
>> As a result, any blip of 1.5 sec or more btw the client and server
>> could cause this to happen. Network latency, OS latency, ZK server
>> latency, client latency etc....
>>
>> I suspect that you are being effected by GC pauses. Have you tuned the
>> GC at all or just the defaults? Monitor the GC in the VM during
>> operation and see if this is effecting you. At the very least you need
>> to turn on parallel/CMS/incremental GC.
>>
>> Patrick
>>
>> On Wed, Jan 18, 2012 at 1:26 PM, Manosiz Bhattacharyya
>> <ma...@gmail.com> wrote:
>> > Hello,
>> >
>> >  We are using Zookeeper-3.3.4 with client session timeouts of 5 seconds,
>> > and we see frequent timeouts. We have a cluster of 50 nodes (3 of which
>> are
>> > ZK nodes) and each node has 5 client connections (a total of 250
>> connection
>> > to the Ensemble). While investigating the zookeeper connections, we found
>> > that sometimes pings sent from the zookeeper client does not return from
>> > the server within 5 seconds, and the client connection gets disconnected.
>> > Digging deeper it seems that pings are enqueued the same way as other
>> > requests in the three stage request processing pipeline (prep, sync,
>> > finalize) in zkserver. So if there are a lot of write operations from
>> other
>> > active sessions in front of a ping from an inactive session in the
>> queues,
>> > the inactive session could timeout.
>> >
>> > My question is whether we can return the ping request from the client
>> > immediately from the server, as the purpose of the ping request seems to
>> be
>> > to treat it as an heartbeat from relatively inactive sessions. If we
>> keep a
>> > separate ping queue in the Prep phase which forwards it straight to the
>> > finalize phase, possible requests before the ping which required I/O
>> inside
>> > the sync phase would not cause the client timeouts. I hope pings do not
>> > generate any order in the database. I did take a cursory look at the code
>> > and thought that could be done. Would really appreciate an opinion
>> > regarding this.
>> >
>> > As an aside I should mention that increasing the session timeout to 20
>> > seconds did improved the problem significantly. However as we are using
>> > Zookeeper to monitor health of our components, increasing the timeout
>> means
>> > that we only get to know a component's death 20 seconds later. This is
>> > something we would definitely try to avoid, and would like to go to the 5
>> > second timeout.
>> >
>> > Regards,
>> > Manosiz.
>>

Re: Timeouts and ping handling

Posted by Manosiz Bhattacharyya <ma...@gmail.com>.
Thanks a lot for your response. We are running the c-client, as all our
components are C++ applications. We are tracing GC on the server side, but
did not see much activity there. We did tune GC. Our gc flags include the
following

JVMFLAGS="$JVMFLAGS -XX:+UseParNewGC"
JVMFLAGS="$JVMFLAGS -XX:+UseConcMarkSweepGC"
JVMFLAGS="$JVMFLAGS -XX:+CMSParallelRemarkEnabled"
JVMFLAGS="$JVMFLAGS -XX:SurvivorRatio=8"
JVMFLAGS="$JVMFLAGS -XX:MaxTenuringThreshold=1"
JVMFLAGS="$JVMFLAGS -XX:CMSInitiatingOccupancyFraction=75"
JVMFLAGS="$JVMFLAGS -XX:+UseCMSInitiatingOccupancyOnly"
JVMFLAGS="$JVMFLAGS -XX:ParallelCMSThreads=1"

The JMX console shows that the old gen is not getting full at all - the new
gen is pretty much where the activity is and the pauses in the verbose:gc
only shows about times in 10-20 ms.

On Wed, Jan 18, 2012 at 2:34 PM, Patrick Hunt <ph...@apache.org> wrote:

> 5 seconds is fairly low. HBs are sent by the client every 1/3 the
> timeout, with expectation that it will get a response in another 1/3
> the timeout. if not the client session will time out.
>
> As a result, any blip of 1.5 sec or more btw the client and server
> could cause this to happen. Network latency, OS latency, ZK server
> latency, client latency etc....
>
> I suspect that you are being effected by GC pauses. Have you tuned the
> GC at all or just the defaults? Monitor the GC in the VM during
> operation and see if this is effecting you. At the very least you need
> to turn on parallel/CMS/incremental GC.
>
> Patrick
>
> On Wed, Jan 18, 2012 at 1:26 PM, Manosiz Bhattacharyya
> <ma...@gmail.com> wrote:
> > Hello,
> >
> >  We are using Zookeeper-3.3.4 with client session timeouts of 5 seconds,
> > and we see frequent timeouts. We have a cluster of 50 nodes (3 of which
> are
> > ZK nodes) and each node has 5 client connections (a total of 250
> connection
> > to the Ensemble). While investigating the zookeeper connections, we found
> > that sometimes pings sent from the zookeeper client does not return from
> > the server within 5 seconds, and the client connection gets disconnected.
> > Digging deeper it seems that pings are enqueued the same way as other
> > requests in the three stage request processing pipeline (prep, sync,
> > finalize) in zkserver. So if there are a lot of write operations from
> other
> > active sessions in front of a ping from an inactive session in the
> queues,
> > the inactive session could timeout.
> >
> > My question is whether we can return the ping request from the client
> > immediately from the server, as the purpose of the ping request seems to
> be
> > to treat it as an heartbeat from relatively inactive sessions. If we
> keep a
> > separate ping queue in the Prep phase which forwards it straight to the
> > finalize phase, possible requests before the ping which required I/O
> inside
> > the sync phase would not cause the client timeouts. I hope pings do not
> > generate any order in the database. I did take a cursory look at the code
> > and thought that could be done. Would really appreciate an opinion
> > regarding this.
> >
> > As an aside I should mention that increasing the session timeout to 20
> > seconds did improved the problem significantly. However as we are using
> > Zookeeper to monitor health of our components, increasing the timeout
> means
> > that we only get to know a component's death 20 seconds later. This is
> > something we would definitely try to avoid, and would like to go to the 5
> > second timeout.
> >
> > Regards,
> > Manosiz.
>

Re: Timeouts and ping handling

Posted by Patrick Hunt <ph...@apache.org>.
Forgot to mention, use "stat" and some of the other 4letterwords to
get an idea what your request latency looks like across servers. In
particular you can see the "max latency" and correlate that with what
you're seeing on the clients & gc (etc...) activity.

Patrick

On Wed, Jan 18, 2012 at 2:34 PM, Patrick Hunt <ph...@apache.org> wrote:
> 5 seconds is fairly low. HBs are sent by the client every 1/3 the
> timeout, with expectation that it will get a response in another 1/3
> the timeout. if not the client session will time out.
>
> As a result, any blip of 1.5 sec or more btw the client and server
> could cause this to happen. Network latency, OS latency, ZK server
> latency, client latency etc....
>
> I suspect that you are being effected by GC pauses. Have you tuned the
> GC at all or just the defaults? Monitor the GC in the VM during
> operation and see if this is effecting you. At the very least you need
> to turn on parallel/CMS/incremental GC.
>
> Patrick
>
> On Wed, Jan 18, 2012 at 1:26 PM, Manosiz Bhattacharyya
> <ma...@gmail.com> wrote:
>> Hello,
>>
>>  We are using Zookeeper-3.3.4 with client session timeouts of 5 seconds,
>> and we see frequent timeouts. We have a cluster of 50 nodes (3 of which are
>> ZK nodes) and each node has 5 client connections (a total of 250 connection
>> to the Ensemble). While investigating the zookeeper connections, we found
>> that sometimes pings sent from the zookeeper client does not return from
>> the server within 5 seconds, and the client connection gets disconnected.
>> Digging deeper it seems that pings are enqueued the same way as other
>> requests in the three stage request processing pipeline (prep, sync,
>> finalize) in zkserver. So if there are a lot of write operations from other
>> active sessions in front of a ping from an inactive session in the queues,
>> the inactive session could timeout.
>>
>> My question is whether we can return the ping request from the client
>> immediately from the server, as the purpose of the ping request seems to be
>> to treat it as an heartbeat from relatively inactive sessions. If we keep a
>> separate ping queue in the Prep phase which forwards it straight to the
>> finalize phase, possible requests before the ping which required I/O inside
>> the sync phase would not cause the client timeouts. I hope pings do not
>> generate any order in the database. I did take a cursory look at the code
>> and thought that could be done. Would really appreciate an opinion
>> regarding this.
>>
>> As an aside I should mention that increasing the session timeout to 20
>> seconds did improved the problem significantly. However as we are using
>> Zookeeper to monitor health of our components, increasing the timeout means
>> that we only get to know a component's death 20 seconds later. This is
>> something we would definitely try to avoid, and would like to go to the 5
>> second timeout.
>>
>> Regards,
>> Manosiz.

Re: Timeouts and ping handling

Posted by Ted Dunning <te...@gmail.com>.
Monitor GC on *both* ZK server and client.  Either side can easily cause a
1-2 second delay if mal-configured.

On Wed, Jan 18, 2012 at 10:34 PM, Patrick Hunt <ph...@apache.org> wrote:

> I suspect that you are being effected by GC pauses. Have you tuned the
> GC at all or just the defaults? Monitor the GC in the VM during
> operation and see if this is effecting you. At the very least you need
> to turn on parallel/CMS/incremental GC.
>
>

Re: Timeouts and ping handling

Posted by Patrick Hunt <ph...@apache.org>.
5 seconds is fairly low. HBs are sent by the client every 1/3 the
timeout, with expectation that it will get a response in another 1/3
the timeout. if not the client session will time out.

As a result, any blip of 1.5 sec or more btw the client and server
could cause this to happen. Network latency, OS latency, ZK server
latency, client latency etc....

I suspect that you are being effected by GC pauses. Have you tuned the
GC at all or just the defaults? Monitor the GC in the VM during
operation and see if this is effecting you. At the very least you need
to turn on parallel/CMS/incremental GC.

Patrick

On Wed, Jan 18, 2012 at 1:26 PM, Manosiz Bhattacharyya
<ma...@gmail.com> wrote:
> Hello,
>
>  We are using Zookeeper-3.3.4 with client session timeouts of 5 seconds,
> and we see frequent timeouts. We have a cluster of 50 nodes (3 of which are
> ZK nodes) and each node has 5 client connections (a total of 250 connection
> to the Ensemble). While investigating the zookeeper connections, we found
> that sometimes pings sent from the zookeeper client does not return from
> the server within 5 seconds, and the client connection gets disconnected.
> Digging deeper it seems that pings are enqueued the same way as other
> requests in the three stage request processing pipeline (prep, sync,
> finalize) in zkserver. So if there are a lot of write operations from other
> active sessions in front of a ping from an inactive session in the queues,
> the inactive session could timeout.
>
> My question is whether we can return the ping request from the client
> immediately from the server, as the purpose of the ping request seems to be
> to treat it as an heartbeat from relatively inactive sessions. If we keep a
> separate ping queue in the Prep phase which forwards it straight to the
> finalize phase, possible requests before the ping which required I/O inside
> the sync phase would not cause the client timeouts. I hope pings do not
> generate any order in the database. I did take a cursory look at the code
> and thought that could be done. Would really appreciate an opinion
> regarding this.
>
> As an aside I should mention that increasing the session timeout to 20
> seconds did improved the problem significantly. However as we are using
> Zookeeper to monitor health of our components, increasing the timeout means
> that we only get to know a component's death 20 seconds later. This is
> something we would definitely try to avoid, and would like to go to the 5
> second timeout.
>
> Regards,
> Manosiz.

Re: Timeouts and ping handling

Posted by Camille Fournier <ca...@apache.org>.
I think it can be done. Looking through the code, it seems like it should
be safe modulo some stats that are set in the FinalRequestProcessor that
may be less useful.

A question for the other zookeeper devs out there, is there a reason that
we handle read-only operations in the first processor differently on the
leader than the followers? The leader (calling PrepRequestProcessor first)
will do a session check for any of the read-only requests:
 zks.sessionTracker.checkSession(request.sessionId,
                        request.getOwner());

but the FollowerRequestProcessor will just push these requests to its
second processor, and never check the session. What's the purpose of the
session check on the leader but not the followers?

C

On Wed, Jan 18, 2012 at 4:26 PM, Manosiz Bhattacharyya
<ma...@gmail.com>wrote:

> Hello,
>
>  We are using Zookeeper-3.3.4 with client session timeouts of 5 seconds,
> and we see frequent timeouts. We have a cluster of 50 nodes (3 of which are
> ZK nodes) and each node has 5 client connections (a total of 250 connection
> to the Ensemble). While investigating the zookeeper connections, we found
> that sometimes pings sent from the zookeeper client does not return from
> the server within 5 seconds, and the client connection gets disconnected.
> Digging deeper it seems that pings are enqueued the same way as other
> requests in the three stage request processing pipeline (prep, sync,
> finalize) in zkserver. So if there are a lot of write operations from other
> active sessions in front of a ping from an inactive session in the queues,
> the inactive session could timeout.
>
> My question is whether we can return the ping request from the client
> immediately from the server, as the purpose of the ping request seems to be
> to treat it as an heartbeat from relatively inactive sessions. If we keep a
> separate ping queue in the Prep phase which forwards it straight to the
> finalize phase, possible requests before the ping which required I/O inside
> the sync phase would not cause the client timeouts. I hope pings do not
> generate any order in the database. I did take a cursory look at the code
> and thought that could be done. Would really appreciate an opinion
> regarding this.
>
> As an aside I should mention that increasing the session timeout to 20
> seconds did improved the problem significantly. However as we are using
> Zookeeper to monitor health of our components, increasing the timeout means
> that we only get to know a component's death 20 seconds later. This is
> something we would definitely try to avoid, and would like to go to the 5
> second timeout.
>
> Regards,
> Manosiz.
>