You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@zookeeper.apache.org by Chang Song <tr...@me.com> on 2011/04/13 15:35:29 UTC

Serious problem processing hearbeat on login stampede

Hello, folks.

We have ran into a very serious issue with Zookeeper.
Here's a brief scenario.

We have some Zookeeper clients with session timeout of 15 sec (thus 5 sec ping), let's called
these clients, group A.

Now 1000 new clients (let's call these, group B) starts up at the same time trying to 
connect to a three-node ZK ensemble, creating ZK createSession stampede.

Now almost all clients in group A is not able to exchange ping within session expire time (15 sec).
Thus clients in group A drops out of the cluster.

We have looked into this issue a bit, found mostly synchronous nature of session queue processing.
Latency between ping request and response ranges from 10ms up to 14 seconds during this login stampede.

Since session timeout is serious matter for our cluster, thus ping should be done in psuedo realtime fashion.

I don't know exactly how these ping timeout policy in clients and server, but failure to receive ping
response in clients due to zookeeper login session seem very nonsense to me.

Shouldn't we have a separate ping/heartbeat queue and thread?
Or even multiple ping queues/threads to keep realtime heartbeat?

THis is very serious issue with Zookeeper for our mission-critical system. Could anyone
look into this?

I will try to file a bug.

Thank you.

Chang

Re: Serious problem processing hearbeat on login stampede

Posted by Ted Dunning <te...@gmail.com>.

2011/4/14 Chang Song <tr...@me.com>

> You need to understand that most app can tolerate delay in connect/close,
>
but we cannot tolerate ping delay since we are using ZK heartbeat TO
> for sole failure detection.
>

What about using multiple ZK clusters for this, then?

But it really sounds like your ZK machines are misconfigured somehow.
 Session start/stop isn't any more
expensive than znode updates and a small ZK cluster can handle tens of
thousands of those per second if
set up correctly.

Have you tested a cluster where the machines are set up correctly with
separate snapshot and log disks?

Are your ZK machines doing any other tasks?


> We use 15 seconds (5 sec for each ensemble)
> for session timeout, important server will drop out of the clusters even
> if the server is not malfunctioning, in some cases, it wreaks havoc on
> certain
> services.
>

Re: Serious problem processing hearbeat on login stampede

Posted by Chang Song <tr...@me.com>.

sure I will

thank you.

Chang


2011. 4. 15., 오전 7:16, Benjamin Reed 작성:

> when you file the jira can you also note the logging level you are using?
> 
> thanx
> ben
> 
> 2011/4/14 Chang Song <tr...@me.com>:
>> 
>> Yes, Ben.
>> 
>> If you read my emails carefully, I already said it is not heartbeat,
>> it is session establishment / closing gets stamped.
>> Since all the requests' response gets delayed, heartbeats are delayed
>> as well.
>> 
>> 
>> You need to understand that most app can tolerate delay in connect/close,
>> but we cannot tolerate ping delay since we are using ZK heartbeat TO
>> for sole failure detection.
>> We use 15 seconds (5 sec for each ensemble)
>> for session timeout, important server will drop out of the clusters even
>> if the server is not malfunctioning, in some cases, it wreaks havoc on certain
>> services.
>> 
>> 
>> 1. 3.3.3 (latest)
>> 
>> 2. We have a boot disk and usr disk.
>>   But as I said, disk I/O is not an issue that's causing 8 second delay.
>> 
>> My team will file JIRA today, we'll have to discuss on JIRA ;)
>> 
>> Thank you.
>> 
>> Chang
>> 
>> 
>> 
>> 
>> 2011. 4. 15., 오전 2:59, Benjamin Reed 작성:
>> 
>>> chang,
>>> 
>>> if the problem is on client startup, then it isn't the heartbeat
>>> stamped, it is session establishment. the heartbeats are very light
>>> weight, so i can't imagine them causing any issues.
>>> 
>>> the two key issues we need to know are: 1) the version of the server
>>> you are running, and 2) if you are using a dedicated device for the
>>> transaction log.
>>> 
>>> ben
>>> 
>>> 2011/4/14 Patrick Hunt <ph...@apache.org>:
>>>> 2011/4/14 Chang Song <tr...@me.com>:
>>>>>> 2) regarding IO, if you run 'iostat -x 2' on the zk servers while your
>>>>>> issue is happening, what's the %util of the disk? what's the iowait
>>>>>> look like?
>>>>>> 
>>>>> 
>>>>> Again, no I/O at all.   0%
>>>>> 
>>>> 
>>>> This is simply not possible.
>>>> 
>>>> Sessions are persistent. Each time a session is created, and each time
>>>> it is closed, a transaction is written by the zk server to the data
>>>> directory. Additionally log4j based logs are also being streamed to
>>>> the disk. Each of these activities will cause disk IO that will show
>>>> up on iostat.
>>>> 
>>>>> Patrick. They are not continuously login/logout.
>>>>> Maybe a couple of times a week. and before they push new feature.
>>>>> When this happens, clients in group A drops out of clusters, which causes
>>>>> problem to other unrelated services.
>>>>> 
>>>> 
>>>> Ok, good to know.
>>>> 
>>>>> 
>>>>> It is not about use case, because ZK clients simply tried to connect to
>>>>> ZK ensemble. No use case applies. Just many clients login at the
>>>>> same time or expires at the same time or close session at the same time.
>>>>> 
>>>> 
>>>> As I mentioned, I've seen cluster sizes of 10,000 clients (10x what
>>>> you report) that didn't have this issue. While bugs might be lurking,
>>>> I've also worked with many teams deploying clusters (probably close to
>>>> 100 by now), some of which had problems, the suggestions I'm making to
>>>> you are based on that experience.
>>>> 
>>>>> Heartbeats should be handled in an isolated queue and a
>>>>> dedicated thread.  I don't think we need strict ordering keeping
>>>>> of heartbeats, do we?
>>>> 
>>>> ZK is purposely architected this way, it is not a mistake/bug. It is a
>>>> falicy for a highly available service to respond quickly to a
>>>> heartbeat when it cannot service regular requests in a timely fashion.
>>>> This is one of the main reasons why heartbeats are handled in this
>>>> way.
>>>> 
>>>> Patrick
>>>> 
>>>>>> Patrick
>>>>>> 
>>>>>>> It's about CommitProcessor thread queueing (in leader).
>>>>>>> QueuedRequests goes up to 800, so does commitedRequests and
>>>>>>> PendingRequestElapsedTime. PendingRequestElapsedTime
>>>>>>> goes up to 8.8 seconds during this flood.
>>>>>>> 
>>>>>>> To exactly reproduce this scenario, easiest way is to
>>>>>>> 
>>>>>>> - suspend All JVM client with debugger
>>>>>>> - Cause all client JVM OOME to create heap dump
>>>>>>> 
>>>>>>> in group B. All clients in group A will not be able to receive
>>>>>>> ping response in 5 seconds.
>>>>>>> 
>>>>>>> We need to fix this as soon as possible.
>>>>>>> What we do as a workaround is to raise sessionTimeout to 40 sec.
>>>>>>> At least clients in Group A survives. But this increases
>>>>>>> our cluster failover time significantly.
>>>>>>> 
>>>>>>> Thank you, Patrick.
>>>>>>> 
>>>>>>> 
>>>>>>> ps. We actually push ping request to FinalRequestProcessor as soon
>>>>>>>   as the packet identifies itself as ping. No dice.
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 2011. 4. 14., 오전 12:21, Patrick Hunt 작성:
>>>>>>> 
>>>>>>>> Hi Chang, it sounds like you may have an issue with your cluster
>>>>>>>> environment/setup, or perhaps a resource (GC/mem) issue. Have you
>>>>>>>> looked through the troubleshooting guide?
>>>>>>>> https://cwiki.apache.org/confluence/display/ZOOKEEPER/Troubleshooting
>>>>>>>> 
>>>>>>>> In particular 1000 clients connecting should be fine, I've personally
>>>>>>>> seen clusters of 7-10 thousand clients. Keep in mind that each session
>>>>>>>> establishment is essentially a write (so the quorum in involved) and
>>>>>>>> what we typically see there is that the cluster configuration has
>>>>>>>> issues. 14 seconds for a ping response is huge and indicates one of
>>>>>>>> the following may be an underlying cause:
>>>>>>>> 
>>>>>>>> 1) are you running in a virtualized environment?
>>>>>>>> 2) are you co-locating other services on the same host(s) that make up
>>>>>>>> the ZK serving cluster?
>>>>>>>> 3) have you followed the admin guide's "things to avoid"?
>>>>>>>> http://zookeeper.apache.org/doc/r3.3.3/zookeeperAdmin.html#sc_commonProblems
>>>>>>>> In particular ensuring that you are not swapping or going into gc
>>>>>>>> pause (both on the server and the client)
>>>>>>>> a) try turning on GC logging and ensure that you are not going into GC
>>>>>>>> pause, see the troubleshooting guide, this is the most common cause of
>>>>>>>> high latency for the clients
>>>>>>>> b) ensure that you are not swapping
>>>>>>>> c) ensure that other processes are not causing log writing
>>>>>>>> (transactional logging) to be slow.
>>>>>>>> 
>>>>>>>> Patrick
>>>>>>>> 
>>>>>>>> On Wed, Apr 13, 2011 at 6:35 AM, Chang Song <tr...@me.com> wrote:
>>>>>>>>> Hello, folks.
>>>>>>>>> 
>>>>>>>>> We have ran into a very serious issue with Zookeeper.
>>>>>>>>> Here's a brief scenario.
>>>>>>>>> 
>>>>>>>>> We have some Zookeeper clients with session timeout of 15 sec (thus 5 sec ping), let's called
>>>>>>>>> these clients, group A.
>>>>>>>>> 
>>>>>>>>> Now 1000 new clients (let's call these, group B) starts up at the same time trying to
>>>>>>>>> connect to a three-node ZK ensemble, creating ZK createSession stampede.
>>>>>>>>> 
>>>>>>>>> Now almost all clients in group A is not able to exchange ping within session expire time (15 sec).
>>>>>>>>> Thus clients in group A drops out of the cluster.
>>>>>>>>> 
>>>>>>>>> We have looked into this issue a bit, found mostly synchronous nature of session queue processing.
>>>>>>>>> Latency between ping request and response ranges from 10ms up to 14 seconds during this login stampede.
>>>>>>>>> 
>>>>>>>>> Since session timeout is serious matter for our cluster, thus ping should be done in psuedo realtime fashion.
>>>>>>>>> 
>>>>>>>>> I don't know exactly how these ping timeout policy in clients and server, but failure to receive ping
>>>>>>>>> response in clients due to zookeeper login session seem very nonsense to me.
>>>>>>>>> 
>>>>>>>>> Shouldn't we have a separate ping/heartbeat queue and thread?
>>>>>>>>> Or even multiple ping queues/threads to keep realtime heartbeat?
>>>>>>>>> 
>>>>>>>>> THis is very serious issue with Zookeeper for our mission-critical system. Could anyone
>>>>>>>>> look into this?
>>>>>>>>> 
>>>>>>>>> I will try to file a bug.
>>>>>>>>> 
>>>>>>>>> Thank you.
>>>>>>>>> 
>>>>>>>>> Chang
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>> 
>>>>> 
>>>> 
>> 
>>

Re: Serious problem processing hearbeat on login stampede

Posted by Benjamin Reed <br...@apache.org>.

when you file the jira can you also note the logging level you are using?

thanx
ben

2011/4/14 Chang Song <tr...@me.com>:
>
> Yes, Ben.
>
> If you read my emails carefully, I already said it is not heartbeat,
> it is session establishment / closing gets stamped.
> Since all the requests' response gets delayed, heartbeats are delayed
> as well.
>
>
> You need to understand that most app can tolerate delay in connect/close,
> but we cannot tolerate ping delay since we are using ZK heartbeat TO
> for sole failure detection.
> We use 15 seconds (5 sec for each ensemble)
> for session timeout, important server will drop out of the clusters even
> if the server is not malfunctioning, in some cases, it wreaks havoc on certain
> services.
>
>
> 1. 3.3.3 (latest)
>
> 2. We have a boot disk and usr disk.
>    But as I said, disk I/O is not an issue that's causing 8 second delay.
>
> My team will file JIRA today, we'll have to discuss on JIRA ;)
>
> Thank you.
>
> Chang
>
>
>
>
> 2011. 4. 15., 오전 2:59, Benjamin Reed 작성:
>
>> chang,
>>
>> if the problem is on client startup, then it isn't the heartbeat
>> stamped, it is session establishment. the heartbeats are very light
>> weight, so i can't imagine them causing any issues.
>>
>> the two key issues we need to know are: 1) the version of the server
>> you are running, and 2) if you are using a dedicated device for the
>> transaction log.
>>
>> ben
>>
>> 2011/4/14 Patrick Hunt <ph...@apache.org>:
>>> 2011/4/14 Chang Song <tr...@me.com>:
>>>>> 2) regarding IO, if you run 'iostat -x 2' on the zk servers while your
>>>>> issue is happening, what's the %util of the disk? what's the iowait
>>>>> look like?
>>>>>
>>>>
>>>> Again, no I/O at all.   0%
>>>>
>>>
>>> This is simply not possible.
>>>
>>> Sessions are persistent. Each time a session is created, and each time
>>> it is closed, a transaction is written by the zk server to the data
>>> directory. Additionally log4j based logs are also being streamed to
>>> the disk. Each of these activities will cause disk IO that will show
>>> up on iostat.
>>>
>>>> Patrick. They are not continuously login/logout.
>>>> Maybe a couple of times a week. and before they push new feature.
>>>> When this happens, clients in group A drops out of clusters, which causes
>>>> problem to other unrelated services.
>>>>
>>>
>>> Ok, good to know.
>>>
>>>>
>>>> It is not about use case, because ZK clients simply tried to connect to
>>>> ZK ensemble. No use case applies. Just many clients login at the
>>>> same time or expires at the same time or close session at the same time.
>>>>
>>>
>>> As I mentioned, I've seen cluster sizes of 10,000 clients (10x what
>>> you report) that didn't have this issue. While bugs might be lurking,
>>> I've also worked with many teams deploying clusters (probably close to
>>> 100 by now), some of which had problems, the suggestions I'm making to
>>> you are based on that experience.
>>>
>>>> Heartbeats should be handled in an isolated queue and a
>>>> dedicated thread.  I don't think we need strict ordering keeping
>>>> of heartbeats, do we?
>>>
>>> ZK is purposely architected this way, it is not a mistake/bug. It is a
>>> falicy for a highly available service to respond quickly to a
>>> heartbeat when it cannot service regular requests in a timely fashion.
>>> This is one of the main reasons why heartbeats are handled in this
>>> way.
>>>
>>> Patrick
>>>
>>>>> Patrick
>>>>>
>>>>>> It's about CommitProcessor thread queueing (in leader).
>>>>>> QueuedRequests goes up to 800, so does commitedRequests and
>>>>>> PendingRequestElapsedTime. PendingRequestElapsedTime
>>>>>> goes up to 8.8 seconds during this flood.
>>>>>>
>>>>>> To exactly reproduce this scenario, easiest way is to
>>>>>>
>>>>>> - suspend All JVM client with debugger
>>>>>> - Cause all client JVM OOME to create heap dump
>>>>>>
>>>>>> in group B. All clients in group A will not be able to receive
>>>>>> ping response in 5 seconds.
>>>>>>
>>>>>> We need to fix this as soon as possible.
>>>>>> What we do as a workaround is to raise sessionTimeout to 40 sec.
>>>>>> At least clients in Group A survives. But this increases
>>>>>> our cluster failover time significantly.
>>>>>>
>>>>>> Thank you, Patrick.
>>>>>>
>>>>>>
>>>>>> ps. We actually push ping request to FinalRequestProcessor as soon
>>>>>>    as the packet identifies itself as ping. No dice.
>>>>>>
>>>>>>
>>>>>>
>>>>>> 2011. 4. 14., 오전 12:21, Patrick Hunt 작성:
>>>>>>
>>>>>>> Hi Chang, it sounds like you may have an issue with your cluster
>>>>>>> environment/setup, or perhaps a resource (GC/mem) issue. Have you
>>>>>>> looked through the troubleshooting guide?
>>>>>>> https://cwiki.apache.org/confluence/display/ZOOKEEPER/Troubleshooting
>>>>>>>
>>>>>>> In particular 1000 clients connecting should be fine, I've personally
>>>>>>> seen clusters of 7-10 thousand clients. Keep in mind that each session
>>>>>>> establishment is essentially a write (so the quorum in involved) and
>>>>>>> what we typically see there is that the cluster configuration has
>>>>>>> issues. 14 seconds for a ping response is huge and indicates one of
>>>>>>> the following may be an underlying cause:
>>>>>>>
>>>>>>> 1) are you running in a virtualized environment?
>>>>>>> 2) are you co-locating other services on the same host(s) that make up
>>>>>>> the ZK serving cluster?
>>>>>>> 3) have you followed the admin guide's "things to avoid"?
>>>>>>> http://zookeeper.apache.org/doc/r3.3.3/zookeeperAdmin.html#sc_commonProblems
>>>>>>> In particular ensuring that you are not swapping or going into gc
>>>>>>> pause (both on the server and the client)
>>>>>>> a) try turning on GC logging and ensure that you are not going into GC
>>>>>>> pause, see the troubleshooting guide, this is the most common cause of
>>>>>>> high latency for the clients
>>>>>>> b) ensure that you are not swapping
>>>>>>> c) ensure that other processes are not causing log writing
>>>>>>> (transactional logging) to be slow.
>>>>>>>
>>>>>>> Patrick
>>>>>>>
>>>>>>> On Wed, Apr 13, 2011 at 6:35 AM, Chang Song <tr...@me.com> wrote:
>>>>>>>> Hello, folks.
>>>>>>>>
>>>>>>>> We have ran into a very serious issue with Zookeeper.
>>>>>>>> Here's a brief scenario.
>>>>>>>>
>>>>>>>> We have some Zookeeper clients with session timeout of 15 sec (thus 5 sec ping), let's called
>>>>>>>> these clients, group A.
>>>>>>>>
>>>>>>>> Now 1000 new clients (let's call these, group B) starts up at the same time trying to
>>>>>>>> connect to a three-node ZK ensemble, creating ZK createSession stampede.
>>>>>>>>
>>>>>>>> Now almost all clients in group A is not able to exchange ping within session expire time (15 sec).
>>>>>>>> Thus clients in group A drops out of the cluster.
>>>>>>>>
>>>>>>>> We have looked into this issue a bit, found mostly synchronous nature of session queue processing.
>>>>>>>> Latency between ping request and response ranges from 10ms up to 14 seconds during this login stampede.
>>>>>>>>
>>>>>>>> Since session timeout is serious matter for our cluster, thus ping should be done in psuedo realtime fashion.
>>>>>>>>
>>>>>>>> I don't know exactly how these ping timeout policy in clients and server, but failure to receive ping
>>>>>>>> response in clients due to zookeeper login session seem very nonsense to me.
>>>>>>>>
>>>>>>>> Shouldn't we have a separate ping/heartbeat queue and thread?
>>>>>>>> Or even multiple ping queues/threads to keep realtime heartbeat?
>>>>>>>>
>>>>>>>> THis is very serious issue with Zookeeper for our mission-critical system. Could anyone
>>>>>>>> look into this?
>>>>>>>>
>>>>>>>> I will try to file a bug.
>>>>>>>>
>>>>>>>> Thank you.
>>>>>>>>
>>>>>>>> Chang
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>
>>>>>>
>>>>
>>>>
>>>
>
>

Re: Serious problem processing hearbeat on login stampede

Posted by Chang Song <tr...@me.com>.

Ted.
Please be patient.
I didn't say I won't post the data.

I am not doing the test myself. My team does.
I saw iostat result when they did the test.

I cannot cut-and-paste what I don't have.
I cannot force them to come in on weekends to do the testing.

and let me add. There is no magic in iostat.
They are hard data came from device driver, and calculated
from utilization law ;)

Thanks.

Chang



2011. 4. 17., 오전 5:43, Ted Dunning 작성:

> That isn't the issue.
> 
> The issue is that there is something here that is a mystery.  You aren't
> seeing the answer.  If you could, you would have seen it already and
> wouldn't have a question to ask.  If you want somebody else to see the
> answer, you need to show them the raw data and not just tell them your
> interpretation of the situation.  If you don't want to show the raw data
> because it is embarrassing to admit that you might not be seeing something,
> then you should just not ask the question.
> 
> It doesn't matter how much you know or how well you can interpret things
> normally.  You are adding your interpretation to the raw data and there is
> something in this interpretation or your description of the situation that
> is faulty.  The only way to get past that is to allow others to interpret
> the raw data.
> 
> This happens to everybody at some point.  New eyes see new things and
> sometimes that is just what is needed.
> 
> 2011/4/15 Chang Song <tr...@me.com>
> 
>> But please note that I used to be a kernel filesystem engineer, and I know
>> how to read iostat ;)
>>

Re: Serious problem processing hearbeat on login stampede

Posted by Ted Dunning <te...@gmail.com>.

That isn't the issue.

The issue is that there is something here that is a mystery.  You aren't
seeing the answer.  If you could, you would have seen it already and
wouldn't have a question to ask.  If you want somebody else to see the
answer, you need to show them the raw data and not just tell them your
interpretation of the situation.  If you don't want to show the raw data
because it is embarrassing to admit that you might not be seeing something,
then you should just not ask the question.

It doesn't matter how much you know or how well you can interpret things
normally.  You are adding your interpretation to the raw data and there is
something in this interpretation or your description of the situation that
is faulty.  The only way to get past that is to allow others to interpret
the raw data.

This happens to everybody at some point.  New eyes see new things and
sometimes that is just what is needed.

2011/4/15 Chang Song <tr...@me.com>

> But please note that I used to be a kernel filesystem engineer, and I know
> how to read iostat ;)
>

Re: Serious problem processing hearbeat on login stampede

Posted by Chang Song <tr...@me.com>.

Client in group B (1000 clients) didn't create any eph. nodes.
Only clients in group A (ping delayed) does, and each client creates only one ephemeral node.

I know it is difficult to see what's going on when you cannot reproduce yourself.
We will working to create a simple reproducer for this.



2011. 4. 17., 오전 5:46, Ted Dunning 작성:

> How many ephemeral files have to be deleted when a session closes or
> expires?
> 
> 2011/4/15 Chang Song <tr...@me.com>
> 
>> It is not login, it is session expiring and closing process.
>>

Re: Serious problem processing hearbeat on login stampede

Posted by Ted Dunning <te...@gmail.com>.

How many ephemeral files have to be deleted when a session closes or
expires?

2011/4/15 Chang Song <tr...@me.com>

> It is not login, it is session expiring and closing process.
>

Re: Serious problem processing hearbeat on login stampede

Posted by Chang Song <tr...@me.com>.

2011. 4. 16., 오후 2:21, Ted Dunning 작성:

> You know, I think it would help if you would answer some of the questions that people have posed.
> 
> You say that it takes 1000 clients over 8 seconds to register.  That is about 100 transactions per second.

Ted.
Sorry.
Real reproducing scenario isn't what I mentioned initially.
It is not login, it is session expiring and closing process.

I know, we had test ZK many times well above this in our environment, and saw no problem.
So sorry about confusion.



> That is two orders of magnitude slower than others have observed ZK to be.  This is a really big difference.
> 
> So there is a big discrepancy here.  I am not saying you didn't observe what you say, but I do think that there is something that you haven't mentioned because you haven't noticed it yet.  If you go through the questions people have asked and answer them, there is a good chance you will notice something that is causing your problems.  There is likely to be a problem in the way that you have set up your machines.
> 
> One pending question is whether you have separate log and snapshot disks.  Do you?

I have already answered this. I have no separate disk.
We have one filesystem mount point with RAID1 disks.


> Another is whether you have other processes running on the disk.  Are there?

Our ZK ensemble server are dedicated to ZK ensemble only



> Another is a request that you post some of the output of iostat with 5 second sampling rate.  Can you post that output?

I will. It will be on Monday though.
But please note that I used to be a kernel filesystem engineer, and I know how to read iostat ;)



> There are others questions that you will find in the email history.
> 
> Remember, people answering your questions here are doing so because they are nice and because they like to build a sense of community.  But to get a lot from them, you need to work with them.



Please let me know if there are questions to be answered
I will try to update JIRA with answers in these emails.

Thank you.




> 
> 2011/4/15 Chang Song <tr...@me.com>
> 
> I have file a JIRA bug
> 
> https://issues.apache.org/jira/browse/ZOOKEEPER-1049
> 
> 
> We have measured I/O wait again, but found no IO activity due to ZK.
> Just regular page cache sync daemon in the work: 0-3%.
> 
> I will have my team to attach ZK stat result.
> 
> Thanks a lot.
> Let's move this discussion to JIRA
> 
> 
> 2011. 4. 15., 오전 7:34, Ted Dunning 작성:
> 
> > You said that, but there was some skepticism from others about this.
> >
> > You need to try the monitoring that was suggested.  5 minute averages are
> > not useful.
> >
> > What does the stat four letter command return?  (
> > http://zookeeper.apache.org/doc/r3.1.2/zookeeperAdmin.html#sc_zkCommands )
> >
> > 2011/4/14 Chang Song <tr...@me.com>
> >
> >> 2. We have a boot disk and usr disk.
> >>   But as I said, disk I/O is not an issue that's causing 8 second delay.
> >>
> 
>

Re: Serious problem processing hearbeat on login stampede

Posted by Ted Dunning <te...@gmail.com>.

You know, I think it would help if you would answer some of the questions
that people have posed.

You say that it takes 1000 clients over 8 seconds to register.  That is
about 100 transactions per second.

That is two orders of magnitude slower than others have observed ZK to be.
 This is a really big difference.

So there is a big discrepancy here.  I am not saying you didn't observe what
you say, but I do think that there is something that you haven't mentioned
because you haven't noticed it yet.  If you go through the questions people
have asked and answer them, there is a good chance you will notice something
that is causing your problems.  There is likely to be a problem in the way
that you have set up your machines.

One pending question is whether you have separate log and snapshot disks.
 Do you?

Another is whether you have other processes running on the disk.  Are there?

Another is a request that you post some of the output of iostat with 5
second sampling rate.  Can you post that output?

There are others questions that you will find in the email history.

Remember, people answering your questions here are doing so because they are
nice and because they like to build a sense of community.  But to get a lot
from them, you need to work with them.

2011/4/15 Chang Song <tr...@me.com>

>
> I have file a JIRA bug
>
> https://issues.apache.org/jira/browse/ZOOKEEPER-1049
>
>
> We have measured I/O wait again, but found no IO activity due to ZK.
> Just regular page cache sync daemon in the work: 0-3%.
>
> I will have my team to attach ZK stat result.
>
> Thanks a lot.
> Let's move this discussion to JIRA
>
>
> 2011. 4. 15., 오전 7:34, Ted Dunning 작성:
>
> > You said that, but there was some skepticism from others about this.
> >
> > You need to try the monitoring that was suggested.  5 minute averages are
> > not useful.
> >
> > What does the stat four letter command return?  (
> > http://zookeeper.apache.org/doc/r3.1.2/zookeeperAdmin.html#sc_zkCommands)
> >
> > 2011/4/14 Chang Song <tr...@me.com>
> >
> >> 2. We have a boot disk and usr disk.
> >>   But as I said, disk I/O is not an issue that's causing 8 second delay.
> >>
>
>

Re: Serious problem processing hearbeat on login stampede

Posted by Chang Song <tr...@me.com>.

I have file a JIRA bug

https://issues.apache.org/jira/browse/ZOOKEEPER-1049


We have measured I/O wait again, but found no IO activity due to ZK.
Just regular page cache sync daemon in the work: 0-3%.

I will have my team to attach ZK stat result.

Thanks a lot.
Let's move this discussion to JIRA


2011. 4. 15., 오전 7:34, Ted Dunning 작성:

> You said that, but there was some skepticism from others about this.
> 
> You need to try the monitoring that was suggested.  5 minute averages are
> not useful.
> 
> What does the stat four letter command return?  (
> http://zookeeper.apache.org/doc/r3.1.2/zookeeperAdmin.html#sc_zkCommands )
> 
> 2011/4/14 Chang Song <tr...@me.com>
> 
>> 2. We have a boot disk and usr disk.
>>   But as I said, disk I/O is not an issue that's causing 8 second delay.
>>

Re: Serious problem processing hearbeat on login stampede

Posted by Ted Dunning <te...@gmail.com>.

You said that, but there was some skepticism from others about this.

You need to try the monitoring that was suggested.  5 minute averages are
not useful.

What does the stat four letter command return?  (
http://zookeeper.apache.org/doc/r3.1.2/zookeeperAdmin.html#sc_zkCommands )

2011/4/14 Chang Song <tr...@me.com>

> 2. We have a boot disk and usr disk.
>    But as I said, disk I/O is not an issue that's causing 8 second delay.
>

Re: Serious problem processing hearbeat on login stampede

Posted by Chang Song <tr...@me.com>.

Yes, Ben.

If you read my emails carefully, I already said it is not heartbeat,
it is session establishment / closing gets stamped.
Since all the requests' response gets delayed, heartbeats are delayed
as well.


You need to understand that most app can tolerate delay in connect/close,
but we cannot tolerate ping delay since we are using ZK heartbeat TO
for sole failure detection.  
We use 15 seconds (5 sec for each ensemble)
for session timeout, important server will drop out of the clusters even
if the server is not malfunctioning, in some cases, it wreaks havoc on certain
services.


1. 3.3.3 (latest)

2. We have a boot disk and usr disk. 
    But as I said, disk I/O is not an issue that's causing 8 second delay.

My team will file JIRA today, we'll have to discuss on JIRA ;)

Thank you.

Chang




2011. 4. 15., 오전 2:59, Benjamin Reed 작성:

> chang,
> 
> if the problem is on client startup, then it isn't the heartbeat
> stamped, it is session establishment. the heartbeats are very light
> weight, so i can't imagine them causing any issues.
> 
> the two key issues we need to know are: 1) the version of the server
> you are running, and 2) if you are using a dedicated device for the
> transaction log.
> 
> ben
> 
> 2011/4/14 Patrick Hunt <ph...@apache.org>:
>> 2011/4/14 Chang Song <tr...@me.com>:
>>>> 2) regarding IO, if you run 'iostat -x 2' on the zk servers while your
>>>> issue is happening, what's the %util of the disk? what's the iowait
>>>> look like?
>>>> 
>>> 
>>> Again, no I/O at all.   0%
>>> 
>> 
>> This is simply not possible.
>> 
>> Sessions are persistent. Each time a session is created, and each time
>> it is closed, a transaction is written by the zk server to the data
>> directory. Additionally log4j based logs are also being streamed to
>> the disk. Each of these activities will cause disk IO that will show
>> up on iostat.
>> 
>>> Patrick. They are not continuously login/logout.
>>> Maybe a couple of times a week. and before they push new feature.
>>> When this happens, clients in group A drops out of clusters, which causes
>>> problem to other unrelated services.
>>> 
>> 
>> Ok, good to know.
>> 
>>> 
>>> It is not about use case, because ZK clients simply tried to connect to
>>> ZK ensemble. No use case applies. Just many clients login at the
>>> same time or expires at the same time or close session at the same time.
>>> 
>> 
>> As I mentioned, I've seen cluster sizes of 10,000 clients (10x what
>> you report) that didn't have this issue. While bugs might be lurking,
>> I've also worked with many teams deploying clusters (probably close to
>> 100 by now), some of which had problems, the suggestions I'm making to
>> you are based on that experience.
>> 
>>> Heartbeats should be handled in an isolated queue and a
>>> dedicated thread.  I don't think we need strict ordering keeping
>>> of heartbeats, do we?
>> 
>> ZK is purposely architected this way, it is not a mistake/bug. It is a
>> falicy for a highly available service to respond quickly to a
>> heartbeat when it cannot service regular requests in a timely fashion.
>> This is one of the main reasons why heartbeats are handled in this
>> way.
>> 
>> Patrick
>> 
>>>> Patrick
>>>> 
>>>>> It's about CommitProcessor thread queueing (in leader).
>>>>> QueuedRequests goes up to 800, so does commitedRequests and
>>>>> PendingRequestElapsedTime. PendingRequestElapsedTime
>>>>> goes up to 8.8 seconds during this flood.
>>>>> 
>>>>> To exactly reproduce this scenario, easiest way is to
>>>>> 
>>>>> - suspend All JVM client with debugger
>>>>> - Cause all client JVM OOME to create heap dump
>>>>> 
>>>>> in group B. All clients in group A will not be able to receive
>>>>> ping response in 5 seconds.
>>>>> 
>>>>> We need to fix this as soon as possible.
>>>>> What we do as a workaround is to raise sessionTimeout to 40 sec.
>>>>> At least clients in Group A survives. But this increases
>>>>> our cluster failover time significantly.
>>>>> 
>>>>> Thank you, Patrick.
>>>>> 
>>>>> 
>>>>> ps. We actually push ping request to FinalRequestProcessor as soon
>>>>>    as the packet identifies itself as ping. No dice.
>>>>> 
>>>>> 
>>>>> 
>>>>> 2011. 4. 14., 오전 12:21, Patrick Hunt 작성:
>>>>> 
>>>>>> Hi Chang, it sounds like you may have an issue with your cluster
>>>>>> environment/setup, or perhaps a resource (GC/mem) issue. Have you
>>>>>> looked through the troubleshooting guide?
>>>>>> https://cwiki.apache.org/confluence/display/ZOOKEEPER/Troubleshooting
>>>>>> 
>>>>>> In particular 1000 clients connecting should be fine, I've personally
>>>>>> seen clusters of 7-10 thousand clients. Keep in mind that each session
>>>>>> establishment is essentially a write (so the quorum in involved) and
>>>>>> what we typically see there is that the cluster configuration has
>>>>>> issues. 14 seconds for a ping response is huge and indicates one of
>>>>>> the following may be an underlying cause:
>>>>>> 
>>>>>> 1) are you running in a virtualized environment?
>>>>>> 2) are you co-locating other services on the same host(s) that make up
>>>>>> the ZK serving cluster?
>>>>>> 3) have you followed the admin guide's "things to avoid"?
>>>>>> http://zookeeper.apache.org/doc/r3.3.3/zookeeperAdmin.html#sc_commonProblems
>>>>>> In particular ensuring that you are not swapping or going into gc
>>>>>> pause (both on the server and the client)
>>>>>> a) try turning on GC logging and ensure that you are not going into GC
>>>>>> pause, see the troubleshooting guide, this is the most common cause of
>>>>>> high latency for the clients
>>>>>> b) ensure that you are not swapping
>>>>>> c) ensure that other processes are not causing log writing
>>>>>> (transactional logging) to be slow.
>>>>>> 
>>>>>> Patrick
>>>>>> 
>>>>>> On Wed, Apr 13, 2011 at 6:35 AM, Chang Song <tr...@me.com> wrote:
>>>>>>> Hello, folks.
>>>>>>> 
>>>>>>> We have ran into a very serious issue with Zookeeper.
>>>>>>> Here's a brief scenario.
>>>>>>> 
>>>>>>> We have some Zookeeper clients with session timeout of 15 sec (thus 5 sec ping), let's called
>>>>>>> these clients, group A.
>>>>>>> 
>>>>>>> Now 1000 new clients (let's call these, group B) starts up at the same time trying to
>>>>>>> connect to a three-node ZK ensemble, creating ZK createSession stampede.
>>>>>>> 
>>>>>>> Now almost all clients in group A is not able to exchange ping within session expire time (15 sec).
>>>>>>> Thus clients in group A drops out of the cluster.
>>>>>>> 
>>>>>>> We have looked into this issue a bit, found mostly synchronous nature of session queue processing.
>>>>>>> Latency between ping request and response ranges from 10ms up to 14 seconds during this login stampede.
>>>>>>> 
>>>>>>> Since session timeout is serious matter for our cluster, thus ping should be done in psuedo realtime fashion.
>>>>>>> 
>>>>>>> I don't know exactly how these ping timeout policy in clients and server, but failure to receive ping
>>>>>>> response in clients due to zookeeper login session seem very nonsense to me.
>>>>>>> 
>>>>>>> Shouldn't we have a separate ping/heartbeat queue and thread?
>>>>>>> Or even multiple ping queues/threads to keep realtime heartbeat?
>>>>>>> 
>>>>>>> THis is very serious issue with Zookeeper for our mission-critical system. Could anyone
>>>>>>> look into this?
>>>>>>> 
>>>>>>> I will try to file a bug.
>>>>>>> 
>>>>>>> Thank you.
>>>>>>> 
>>>>>>> Chang
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>> 
>>>>> 
>>> 
>>> 
>>

Re: Serious problem processing hearbeat on login stampede

Posted by Benjamin Reed <br...@apache.org>.

chang,

if the problem is on client startup, then it isn't the heartbeat
stamped, it is session establishment. the heartbeats are very light
weight, so i can't imagine them causing any issues.

the two key issues we need to know are: 1) the version of the server
you are running, and 2) if you are using a dedicated device for the
transaction log.

ben

2011/4/14 Patrick Hunt <ph...@apache.org>:
> 2011/4/14 Chang Song <tr...@me.com>:
>>> 2) regarding IO, if you run 'iostat -x 2' on the zk servers while your
>>> issue is happening, what's the %util of the disk? what's the iowait
>>> look like?
>>>
>>
>> Again, no I/O at all.   0%
>>
>
> This is simply not possible.
>
> Sessions are persistent. Each time a session is created, and each time
> it is closed, a transaction is written by the zk server to the data
> directory. Additionally log4j based logs are also being streamed to
> the disk. Each of these activities will cause disk IO that will show
> up on iostat.
>
>> Patrick. They are not continuously login/logout.
>> Maybe a couple of times a week. and before they push new feature.
>> When this happens, clients in group A drops out of clusters, which causes
>> problem to other unrelated services.
>>
>
> Ok, good to know.
>
>>
>> It is not about use case, because ZK clients simply tried to connect to
>> ZK ensemble. No use case applies. Just many clients login at the
>> same time or expires at the same time or close session at the same time.
>>
>
> As I mentioned, I've seen cluster sizes of 10,000 clients (10x what
> you report) that didn't have this issue. While bugs might be lurking,
> I've also worked with many teams deploying clusters (probably close to
> 100 by now), some of which had problems, the suggestions I'm making to
> you are based on that experience.
>
>> Heartbeats should be handled in an isolated queue and a
>> dedicated thread.  I don't think we need strict ordering keeping
>> of heartbeats, do we?
>
> ZK is purposely architected this way, it is not a mistake/bug. It is a
> falicy for a highly available service to respond quickly to a
> heartbeat when it cannot service regular requests in a timely fashion.
> This is one of the main reasons why heartbeats are handled in this
> way.
>
> Patrick
>
>>> Patrick
>>>
>>>> It's about CommitProcessor thread queueing (in leader).
>>>> QueuedRequests goes up to 800, so does commitedRequests and
>>>> PendingRequestElapsedTime. PendingRequestElapsedTime
>>>> goes up to 8.8 seconds during this flood.
>>>>
>>>> To exactly reproduce this scenario, easiest way is to
>>>>
>>>> - suspend All JVM client with debugger
>>>> - Cause all client JVM OOME to create heap dump
>>>>
>>>> in group B. All clients in group A will not be able to receive
>>>> ping response in 5 seconds.
>>>>
>>>> We need to fix this as soon as possible.
>>>> What we do as a workaround is to raise sessionTimeout to 40 sec.
>>>> At least clients in Group A survives. But this increases
>>>> our cluster failover time significantly.
>>>>
>>>> Thank you, Patrick.
>>>>
>>>>
>>>> ps. We actually push ping request to FinalRequestProcessor as soon
>>>>     as the packet identifies itself as ping. No dice.
>>>>
>>>>
>>>>
>>>> 2011. 4. 14., 오전 12:21, Patrick Hunt 작성:
>>>>
>>>>> Hi Chang, it sounds like you may have an issue with your cluster
>>>>> environment/setup, or perhaps a resource (GC/mem) issue. Have you
>>>>> looked through the troubleshooting guide?
>>>>> https://cwiki.apache.org/confluence/display/ZOOKEEPER/Troubleshooting
>>>>>
>>>>> In particular 1000 clients connecting should be fine, I've personally
>>>>> seen clusters of 7-10 thousand clients. Keep in mind that each session
>>>>> establishment is essentially a write (so the quorum in involved) and
>>>>> what we typically see there is that the cluster configuration has
>>>>> issues. 14 seconds for a ping response is huge and indicates one of
>>>>> the following may be an underlying cause:
>>>>>
>>>>> 1) are you running in a virtualized environment?
>>>>> 2) are you co-locating other services on the same host(s) that make up
>>>>> the ZK serving cluster?
>>>>> 3) have you followed the admin guide's "things to avoid"?
>>>>> http://zookeeper.apache.org/doc/r3.3.3/zookeeperAdmin.html#sc_commonProblems
>>>>> In particular ensuring that you are not swapping or going into gc
>>>>> pause (both on the server and the client)
>>>>> a) try turning on GC logging and ensure that you are not going into GC
>>>>> pause, see the troubleshooting guide, this is the most common cause of
>>>>> high latency for the clients
>>>>> b) ensure that you are not swapping
>>>>> c) ensure that other processes are not causing log writing
>>>>> (transactional logging) to be slow.
>>>>>
>>>>> Patrick
>>>>>
>>>>> On Wed, Apr 13, 2011 at 6:35 AM, Chang Song <tr...@me.com> wrote:
>>>>>> Hello, folks.
>>>>>>
>>>>>> We have ran into a very serious issue with Zookeeper.
>>>>>> Here's a brief scenario.
>>>>>>
>>>>>> We have some Zookeeper clients with session timeout of 15 sec (thus 5 sec ping), let's called
>>>>>> these clients, group A.
>>>>>>
>>>>>> Now 1000 new clients (let's call these, group B) starts up at the same time trying to
>>>>>> connect to a three-node ZK ensemble, creating ZK createSession stampede.
>>>>>>
>>>>>> Now almost all clients in group A is not able to exchange ping within session expire time (15 sec).
>>>>>> Thus clients in group A drops out of the cluster.
>>>>>>
>>>>>> We have looked into this issue a bit, found mostly synchronous nature of session queue processing.
>>>>>> Latency between ping request and response ranges from 10ms up to 14 seconds during this login stampede.
>>>>>>
>>>>>> Since session timeout is serious matter for our cluster, thus ping should be done in psuedo realtime fashion.
>>>>>>
>>>>>> I don't know exactly how these ping timeout policy in clients and server, but failure to receive ping
>>>>>> response in clients due to zookeeper login session seem very nonsense to me.
>>>>>>
>>>>>> Shouldn't we have a separate ping/heartbeat queue and thread?
>>>>>> Or even multiple ping queues/threads to keep realtime heartbeat?
>>>>>>
>>>>>> THis is very serious issue with Zookeeper for our mission-critical system. Could anyone
>>>>>> look into this?
>>>>>>
>>>>>> I will try to file a bug.
>>>>>>
>>>>>> Thank you.
>>>>>>
>>>>>> Chang
>>>>>>
>>>>>>
>>>>>>
>>>>
>>>>
>>
>>
>

Re: Serious problem processing hearbeat on login stampede

Posted by Chang Song <tr...@me.com>.



2011. 4. 15., 오전 1:04, Patrick Hunt 작성:

> 2011/4/14 Chang Song <tr...@me.com>:
>>> 2) regarding IO, if you run 'iostat -x 2' on the zk servers while your
>>> issue is happening, what's the %util of the disk? what's the iowait
>>> look like?
>>> 
>> 
>> Again, no I/O at all.   0%
>> 
> 
> This is simply not possible.
> 
> Sessions are persistent. Each time a session is created, and each time
> it is closed, a transaction is written by the zk server to the data
> directory. Additionally log4j based logs are also being streamed to
> the disk. Each of these activities will cause disk IO that will show
> up on iostat.
> 

Pat. I didn't say there wasn't any IO, just said 0% utilization.
Meaning no significant IO.  It is possible for our monitoring agent
can miss some updates since we have 5 min. monitoring.
I will try to login to the server and watch.

But since this is session related, are you using fsync() to flush
log buffer out to disk? Then I should immediately see io activity
out of the roof.



>> Patrick. They are not continuously login/logout.
>> Maybe a couple of times a week. and before they push new feature.
>> When this happens, clients in group A drops out of clusters, which causes
>> problem to other unrelated services.
>> 
> 
> Ok, good to know.
> 
>> 
>> It is not about use case, because ZK clients simply tried to connect to
>> ZK ensemble. No use case applies. Just many clients login at the
>> same time or expires at the same time or close session at the same time.
>> 
> 
> As I mentioned, I've seen cluster sizes of 10,000 clients (10x what
> you report) that didn't have this issue. While bugs might be lurking,
> I've also worked with many teams deploying clusters (probably close to
> 100 by now), some of which had problems, the suggestions I'm making to
> you are based on that experience.
> 

Sure. I understand.


>> Heartbeats should be handled in an isolated queue and a
>> dedicated thread.  I don't think we need strict ordering keeping
>> of heartbeats, do we?
> 
> ZK is purposely architected this way, it is not a mistake/bug. It is a
> falicy for a highly available service to respond quickly to a
> heartbeat when it cannot service regular requests in a timely fashion.
> This is one of the main reasons why heartbeats are handled in this
> way.
> 

hmm. 
If that's the case, we REALLY need to fix this problem hard way.
Thanks.

Chang



> Patrick
> 
>>> Patrick
>>> 
>>>> It's about CommitProcessor thread queueing (in leader).
>>>> QueuedRequests goes up to 800, so does commitedRequests and
>>>> PendingRequestElapsedTime. PendingRequestElapsedTime
>>>> goes up to 8.8 seconds during this flood.
>>>> 
>>>> To exactly reproduce this scenario, easiest way is to
>>>> 
>>>> - suspend All JVM client with debugger
>>>> - Cause all client JVM OOME to create heap dump
>>>> 
>>>> in group B. All clients in group A will not be able to receive
>>>> ping response in 5 seconds.
>>>> 
>>>> We need to fix this as soon as possible.
>>>> What we do as a workaround is to raise sessionTimeout to 40 sec.
>>>> At least clients in Group A survives. But this increases
>>>> our cluster failover time significantly.
>>>> 
>>>> Thank you, Patrick.
>>>> 
>>>> 
>>>> ps. We actually push ping request to FinalRequestProcessor as soon
>>>>    as the packet identifies itself as ping. No dice.
>>>> 
>>>> 
>>>> 
>>>> 2011. 4. 14., 오전 12:21, Patrick Hunt 작성:
>>>> 
>>>>> Hi Chang, it sounds like you may have an issue with your cluster
>>>>> environment/setup, or perhaps a resource (GC/mem) issue. Have you
>>>>> looked through the troubleshooting guide?
>>>>> https://cwiki.apache.org/confluence/display/ZOOKEEPER/Troubleshooting
>>>>> 
>>>>> In particular 1000 clients connecting should be fine, I've personally
>>>>> seen clusters of 7-10 thousand clients. Keep in mind that each session
>>>>> establishment is essentially a write (so the quorum in involved) and
>>>>> what we typically see there is that the cluster configuration has
>>>>> issues. 14 seconds for a ping response is huge and indicates one of
>>>>> the following may be an underlying cause:
>>>>> 
>>>>> 1) are you running in a virtualized environment?
>>>>> 2) are you co-locating other services on the same host(s) that make up
>>>>> the ZK serving cluster?
>>>>> 3) have you followed the admin guide's "things to avoid"?
>>>>> http://zookeeper.apache.org/doc/r3.3.3/zookeeperAdmin.html#sc_commonProblems
>>>>> In particular ensuring that you are not swapping or going into gc
>>>>> pause (both on the server and the client)
>>>>> a) try turning on GC logging and ensure that you are not going into GC
>>>>> pause, see the troubleshooting guide, this is the most common cause of
>>>>> high latency for the clients
>>>>> b) ensure that you are not swapping
>>>>> c) ensure that other processes are not causing log writing
>>>>> (transactional logging) to be slow.
>>>>> 
>>>>> Patrick
>>>>> 
>>>>> On Wed, Apr 13, 2011 at 6:35 AM, Chang Song <tr...@me.com> wrote:
>>>>>> Hello, folks.
>>>>>> 
>>>>>> We have ran into a very serious issue with Zookeeper.
>>>>>> Here's a brief scenario.
>>>>>> 
>>>>>> We have some Zookeeper clients with session timeout of 15 sec (thus 5 sec ping), let's called
>>>>>> these clients, group A.
>>>>>> 
>>>>>> Now 1000 new clients (let's call these, group B) starts up at the same time trying to
>>>>>> connect to a three-node ZK ensemble, creating ZK createSession stampede.
>>>>>> 
>>>>>> Now almost all clients in group A is not able to exchange ping within session expire time (15 sec).
>>>>>> Thus clients in group A drops out of the cluster.
>>>>>> 
>>>>>> We have looked into this issue a bit, found mostly synchronous nature of session queue processing.
>>>>>> Latency between ping request and response ranges from 10ms up to 14 seconds during this login stampede.
>>>>>> 
>>>>>> Since session timeout is serious matter for our cluster, thus ping should be done in psuedo realtime fashion.
>>>>>> 
>>>>>> I don't know exactly how these ping timeout policy in clients and server, but failure to receive ping
>>>>>> response in clients due to zookeeper login session seem very nonsense to me.
>>>>>> 
>>>>>> Shouldn't we have a separate ping/heartbeat queue and thread?
>>>>>> Or even multiple ping queues/threads to keep realtime heartbeat?
>>>>>> 
>>>>>> THis is very serious issue with Zookeeper for our mission-critical system. Could anyone
>>>>>> look into this?
>>>>>> 
>>>>>> I will try to file a bug.
>>>>>> 
>>>>>> Thank you.
>>>>>> 
>>>>>> Chang
>>>>>> 
>>>>>> 
>>>>>> 
>>>> 
>>>> 
>> 
>>

Re: Serious problem processing hearbeat on login stampede

Posted by Patrick Hunt <ph...@apache.org>.

2011/4/14 Chang Song <tr...@me.com>:
>> 2) regarding IO, if you run 'iostat -x 2' on the zk servers while your
>> issue is happening, what's the %util of the disk? what's the iowait
>> look like?
>>
>
> Again, no I/O at all.   0%
>

This is simply not possible.

Sessions are persistent. Each time a session is created, and each time
it is closed, a transaction is written by the zk server to the data
directory. Additionally log4j based logs are also being streamed to
the disk. Each of these activities will cause disk IO that will show
up on iostat.

> Patrick. They are not continuously login/logout.
> Maybe a couple of times a week. and before they push new feature.
> When this happens, clients in group A drops out of clusters, which causes
> problem to other unrelated services.
>

Ok, good to know.

>
> It is not about use case, because ZK clients simply tried to connect to
> ZK ensemble. No use case applies. Just many clients login at the
> same time or expires at the same time or close session at the same time.
>

As I mentioned, I've seen cluster sizes of 10,000 clients (10x what
you report) that didn't have this issue. While bugs might be lurking,
I've also worked with many teams deploying clusters (probably close to
100 by now), some of which had problems, the suggestions I'm making to
you are based on that experience.

> Heartbeats should be handled in an isolated queue and a
> dedicated thread.  I don't think we need strict ordering keeping
> of heartbeats, do we?

ZK is purposely architected this way, it is not a mistake/bug. It is a
falicy for a highly available service to respond quickly to a
heartbeat when it cannot service regular requests in a timely fashion.
This is one of the main reasons why heartbeats are handled in this
way.

Patrick

>> Patrick
>>
>>> It's about CommitProcessor thread queueing (in leader).
>>> QueuedRequests goes up to 800, so does commitedRequests and
>>> PendingRequestElapsedTime. PendingRequestElapsedTime
>>> goes up to 8.8 seconds during this flood.
>>>
>>> To exactly reproduce this scenario, easiest way is to
>>>
>>> - suspend All JVM client with debugger
>>> - Cause all client JVM OOME to create heap dump
>>>
>>> in group B. All clients in group A will not be able to receive
>>> ping response in 5 seconds.
>>>
>>> We need to fix this as soon as possible.
>>> What we do as a workaround is to raise sessionTimeout to 40 sec.
>>> At least clients in Group A survives. But this increases
>>> our cluster failover time significantly.
>>>
>>> Thank you, Patrick.
>>>
>>>
>>> ps. We actually push ping request to FinalRequestProcessor as soon
>>>     as the packet identifies itself as ping. No dice.
>>>
>>>
>>>
>>> 2011. 4. 14., 오전 12:21, Patrick Hunt 작성:
>>>
>>>> Hi Chang, it sounds like you may have an issue with your cluster
>>>> environment/setup, or perhaps a resource (GC/mem) issue. Have you
>>>> looked through the troubleshooting guide?
>>>> https://cwiki.apache.org/confluence/display/ZOOKEEPER/Troubleshooting
>>>>
>>>> In particular 1000 clients connecting should be fine, I've personally
>>>> seen clusters of 7-10 thousand clients. Keep in mind that each session
>>>> establishment is essentially a write (so the quorum in involved) and
>>>> what we typically see there is that the cluster configuration has
>>>> issues. 14 seconds for a ping response is huge and indicates one of
>>>> the following may be an underlying cause:
>>>>
>>>> 1) are you running in a virtualized environment?
>>>> 2) are you co-locating other services on the same host(s) that make up
>>>> the ZK serving cluster?
>>>> 3) have you followed the admin guide's "things to avoid"?
>>>> http://zookeeper.apache.org/doc/r3.3.3/zookeeperAdmin.html#sc_commonProblems
>>>> In particular ensuring that you are not swapping or going into gc
>>>> pause (both on the server and the client)
>>>> a) try turning on GC logging and ensure that you are not going into GC
>>>> pause, see the troubleshooting guide, this is the most common cause of
>>>> high latency for the clients
>>>> b) ensure that you are not swapping
>>>> c) ensure that other processes are not causing log writing
>>>> (transactional logging) to be slow.
>>>>
>>>> Patrick
>>>>
>>>> On Wed, Apr 13, 2011 at 6:35 AM, Chang Song <tr...@me.com> wrote:
>>>>> Hello, folks.
>>>>>
>>>>> We have ran into a very serious issue with Zookeeper.
>>>>> Here's a brief scenario.
>>>>>
>>>>> We have some Zookeeper clients with session timeout of 15 sec (thus 5 sec ping), let's called
>>>>> these clients, group A.
>>>>>
>>>>> Now 1000 new clients (let's call these, group B) starts up at the same time trying to
>>>>> connect to a three-node ZK ensemble, creating ZK createSession stampede.
>>>>>
>>>>> Now almost all clients in group A is not able to exchange ping within session expire time (15 sec).
>>>>> Thus clients in group A drops out of the cluster.
>>>>>
>>>>> We have looked into this issue a bit, found mostly synchronous nature of session queue processing.
>>>>> Latency between ping request and response ranges from 10ms up to 14 seconds during this login stampede.
>>>>>
>>>>> Since session timeout is serious matter for our cluster, thus ping should be done in psuedo realtime fashion.
>>>>>
>>>>> I don't know exactly how these ping timeout policy in clients and server, but failure to receive ping
>>>>> response in clients due to zookeeper login session seem very nonsense to me.
>>>>>
>>>>> Shouldn't we have a separate ping/heartbeat queue and thread?
>>>>> Or even multiple ping queues/threads to keep realtime heartbeat?
>>>>>
>>>>> THis is very serious issue with Zookeeper for our mission-critical system. Could anyone
>>>>> look into this?
>>>>>
>>>>> I will try to file a bug.
>>>>>
>>>>> Thank you.
>>>>>
>>>>> Chang
>>>>>
>>>>>
>>>>>
>>>
>>>
>
>

Re: Serious problem processing hearbeat on login stampede

Posted by Chang Song <tr...@me.com>.



2011. 4. 14., 오전 10:30, Patrick Hunt 작성:

> 2011/4/13 Chang Song <tr...@me.com>:
>> 
>> Patrick.
>> Thank you for the reply.
>> 
>> We are very aware of all the things you mentioned below.
>> None of those.
>> 
>> Not GC (we monitor every possible resource in JVM and system)
>> No IO. No Swapping.
>> No VM guest OS. No logging.
>> 
> 
> Hm. ok, a few more ideas then:
> 
> 1) what is the connectivity like btw the servers?
> 
> What is the ping time btw them?
> 
> Is the system perhaps loading down the network during this test,
> causing network latency to increase? Are all the nic cards (server and
> client) configured correctly? I've seen a number of cases where
> clients and/or server had incorrectly configured nics (ethtool
> reported 10 MB/sec half duplex for what should be 1gigeth)


Nope. We are experts at these ;)
No issue regarding these



> 2) regarding IO, if you run 'iostat -x 2' on the zk servers while your
> issue is happening, what's the %util of the disk? what's the iowait
> look like?
> 

Again, no I/O at all.   0% 



> 3) create a JIRA and upload your 3 server configuration files. Include
> the log4j.properties file you are using and any other details you
> think might be useful. If you can upload a log file from when you see
> this issue that would be useful. Upload any log file if you can't get
> it from the time when you see the issue.
> 

I will have my team to file a JIRA.



>> 
>> Oh, one thing I should mention is that it is not 1000 clients,
>> 1000 login/logout per second. All operations like closeSession,
>> ping takes more than 8 seconds (peak).
>> 
> 
> Are you continuously logging in and the logging out, 1000 times per
> second? That's not a good use case for ZK sessions in general. Perhaps
> if you describe your use case in more detail it would help.
> 

Patrick. They are not continuously login/logout.
Maybe a couple of times a week. and before they push new feature.
When this happens, clients in group A drops out of clusters, which causes
problem to other unrelated services.


It is not about use case, because ZK clients simply tried to connect to
ZK ensemble. No use case applies. Just many clients login at the
same time or expires at the same time or close session at the same time.


I am talking about important of realtime-ness of heartbeat.
Especially when session timeouts are short like 15 sec.

Heartbeats should be handled in an isolated queue and a 
dedicated thread.  I don't think we need strict ordering keeping
of heartbeats, do we?

Thank you for your help, Patrick.



> Patrick
> 
>> It's about CommitProcessor thread queueing (in leader).
>> QueuedRequests goes up to 800, so does commitedRequests and
>> PendingRequestElapsedTime. PendingRequestElapsedTime
>> goes up to 8.8 seconds during this flood.
>> 
>> To exactly reproduce this scenario, easiest way is to
>> 
>> - suspend All JVM client with debugger
>> - Cause all client JVM OOME to create heap dump
>> 
>> in group B. All clients in group A will not be able to receive
>> ping response in 5 seconds.
>> 
>> We need to fix this as soon as possible.
>> What we do as a workaround is to raise sessionTimeout to 40 sec.
>> At least clients in Group A survives. But this increases
>> our cluster failover time significantly.
>> 
>> Thank you, Patrick.
>> 
>> 
>> ps. We actually push ping request to FinalRequestProcessor as soon
>>     as the packet identifies itself as ping. No dice.
>> 
>> 
>> 
>> 2011. 4. 14., 오전 12:21, Patrick Hunt 작성:
>> 
>>> Hi Chang, it sounds like you may have an issue with your cluster
>>> environment/setup, or perhaps a resource (GC/mem) issue. Have you
>>> looked through the troubleshooting guide?
>>> https://cwiki.apache.org/confluence/display/ZOOKEEPER/Troubleshooting
>>> 
>>> In particular 1000 clients connecting should be fine, I've personally
>>> seen clusters of 7-10 thousand clients. Keep in mind that each session
>>> establishment is essentially a write (so the quorum in involved) and
>>> what we typically see there is that the cluster configuration has
>>> issues. 14 seconds for a ping response is huge and indicates one of
>>> the following may be an underlying cause:
>>> 
>>> 1) are you running in a virtualized environment?
>>> 2) are you co-locating other services on the same host(s) that make up
>>> the ZK serving cluster?
>>> 3) have you followed the admin guide's "things to avoid"?
>>> http://zookeeper.apache.org/doc/r3.3.3/zookeeperAdmin.html#sc_commonProblems
>>> In particular ensuring that you are not swapping or going into gc
>>> pause (both on the server and the client)
>>> a) try turning on GC logging and ensure that you are not going into GC
>>> pause, see the troubleshooting guide, this is the most common cause of
>>> high latency for the clients
>>> b) ensure that you are not swapping
>>> c) ensure that other processes are not causing log writing
>>> (transactional logging) to be slow.
>>> 
>>> Patrick
>>> 
>>> On Wed, Apr 13, 2011 at 6:35 AM, Chang Song <tr...@me.com> wrote:
>>>> Hello, folks.
>>>> 
>>>> We have ran into a very serious issue with Zookeeper.
>>>> Here's a brief scenario.
>>>> 
>>>> We have some Zookeeper clients with session timeout of 15 sec (thus 5 sec ping), let's called
>>>> these clients, group A.
>>>> 
>>>> Now 1000 new clients (let's call these, group B) starts up at the same time trying to
>>>> connect to a three-node ZK ensemble, creating ZK createSession stampede.
>>>> 
>>>> Now almost all clients in group A is not able to exchange ping within session expire time (15 sec).
>>>> Thus clients in group A drops out of the cluster.
>>>> 
>>>> We have looked into this issue a bit, found mostly synchronous nature of session queue processing.
>>>> Latency between ping request and response ranges from 10ms up to 14 seconds during this login stampede.
>>>> 
>>>> Since session timeout is serious matter for our cluster, thus ping should be done in psuedo realtime fashion.
>>>> 
>>>> I don't know exactly how these ping timeout policy in clients and server, but failure to receive ping
>>>> response in clients due to zookeeper login session seem very nonsense to me.
>>>> 
>>>> Shouldn't we have a separate ping/heartbeat queue and thread?
>>>> Or even multiple ping queues/threads to keep realtime heartbeat?
>>>> 
>>>> THis is very serious issue with Zookeeper for our mission-critical system. Could anyone
>>>> look into this?
>>>> 
>>>> I will try to file a bug.
>>>> 
>>>> Thank you.
>>>> 
>>>> Chang
>>>> 
>>>> 
>>>> 
>> 
>>

Re: Serious problem processing hearbeat on login stampede

Posted by Chang Song <tr...@me.com>.

Patrick and Ted.
Unless Zookeeper clients adding this feature, it is not easy for us to implement.

We only provide platform for many services within our org.
Their batch servers will fire off whatever clients they want.
We have no control over it.

But 8 second latency during stampede is definitely a problem and
these needs to be addressed in server. Not client back-off policy.

What happens when we have double the more traffic?
Now we have more than 20 second latency? 

I think we need to rethink how heartbeat traffic are handled among all 
other request/response.

Thank you.




2011. 4. 14., 오후 2:24, Ted Dunning 작성:

> This is a more powerful idea than it looks like at first glance.
> 
> The reason is that there is often a highly non-linear and adverse impact to
> response time due to higher load.  I have never been able to properly
> account for this using queuing models in a system that is not swapping, but
> it is definitely real.
> 
> If your rebooting processes simply wait between 0 and 5 seconds, your
> problems are likely to be much better.
> 
> 2011/4/13 Patrick Hunt <ph...@apache.org>
> 
>> 2) can you hold off some of the clients from the stampede? Perhaps add
>> a random holdoff to each of the clients before connecting,
>> additionally a similar random holdoff from closing the session. this
>> seems like a straightforward change from your client side (easy to
>> implement/try) but hard to tell given we don't have much insight into
>> what your use case is.
>>

Re: Serious problem processing hearbeat on login stampede

Posted by Ted Dunning <te...@gmail.com>.

This is a more powerful idea than it looks like at first glance.

The reason is that there is often a highly non-linear and adverse impact to
response time due to higher load.  I have never been able to properly
account for this using queuing models in a system that is not swapping, but
it is definitely real.

If your rebooting processes simply wait between 0 and 5 seconds, your
problems are likely to be much better.

2011/4/13 Patrick Hunt <ph...@apache.org>

> 2) can you hold off some of the clients from the stampede? Perhaps add
> a random holdoff to each of the clients before connecting,
> additionally a similar random holdoff from closing the session. this
> seems like a straightforward change from your client side (easy to
> implement/try) but hard to tell given we don't have much insight into
> what your use case is.
>

Re: Serious problem processing hearbeat on login stampede

Posted by Chang Song <tr...@me.com>.

I only use one filesystem for all logs.


2011. 4. 15., 오전 1:00, Mahadev Konar 작성:

> Chang/Pat and others,
>  I didnt see this in the discussions above, but are you guys using a
> single disk or 2 disks for ZK? One for snapshot and one for txn
> logging?
> 
> thanks
> mahadev
> 
> 2011/4/14 Chang Song <tr...@me.com>:
>> 
>> 2011. 4. 14., 오후 1:53, Patrick Hunt 작성:
>> 
>>> two additional thoughts come to mind:
>>> 
>>> 1) try running the ensemble with a single zk server, does this help at
>>> all? (it might provide a short term workaround, it also might provide
>>> some insight into what's causing the issue)
>> 
>> 
>> We are going to try this to see if we identify a culprit.
>> 
>> Thanks.
>> 
>> 
>> 
>>> 2) can you hold off some of the clients from the stampede? Perhaps add
>>> a random holdoff to each of the clients before connecting,
>>> additionally a similar random holdoff from closing the session. this
>>> seems like a straightforward change from your client side (easy to
>>> implement/try) but hard to tell given we don't have much insight into
>>> what your use case is.
>>> 
>>> 
>>> Anyone else in the community have any ideas?
>>> 
>>> 
>>> Patrick
>>> 
>>> 2011/4/13 Patrick Hunt <ph...@apache.org>:
>>>> 2011/4/13 Chang Song <tr...@me.com>:
>>>>> 
>>>>> Patrick.
>>>>> Thank you for the reply.
>>>>> 
>>>>> We are very aware of all the things you mentioned below.
>>>>> None of those.
>>>>> 
>>>>> Not GC (we monitor every possible resource in JVM and system)
>>>>> No IO. No Swapping.
>>>>> No VM guest OS. No logging.
>>>>> 
>>>> 
>>>> Hm. ok, a few more ideas then:
>>>> 
>>>> 1) what is the connectivity like btw the servers?
>>>> 
>>>> What is the ping time btw them?
>>>> 
>>>> Is the system perhaps loading down the network during this test,
>>>> causing network latency to increase? Are all the nic cards (server and
>>>> client) configured correctly? I've seen a number of cases where
>>>> clients and/or server had incorrectly configured nics (ethtool
>>>> reported 10 MB/sec half duplex for what should be 1gigeth)
>>>> 
>>>> 2) regarding IO, if you run 'iostat -x 2' on the zk servers while your
>>>> issue is happening, what's the %util of the disk? what's the iowait
>>>> look like?
>>>> 
>>>> 3) create a JIRA and upload your 3 server configuration files. Include
>>>> the log4j.properties file you are using and any other details you
>>>> think might be useful. If you can upload a log file from when you see
>>>> this issue that would be useful. Upload any log file if you can't get
>>>> it from the time when you see the issue.
>>>> 
>>>>> 
>>>>> Oh, one thing I should mention is that it is not 1000 clients,
>>>>> 1000 login/logout per second. All operations like closeSession,
>>>>> ping takes more than 8 seconds (peak).
>>>>> 
>>>> 
>>>> Are you continuously logging in and the logging out, 1000 times per
>>>> second? That's not a good use case for ZK sessions in general. Perhaps
>>>> if you describe your use case in more detail it would help.
>>>> 
>>>> Patrick
>>>> 
>>>>> It's about CommitProcessor thread queueing (in leader).
>>>>> QueuedRequests goes up to 800, so does commitedRequests and
>>>>> PendingRequestElapsedTime. PendingRequestElapsedTime
>>>>> goes up to 8.8 seconds during this flood.
>>>>> 
>>>>> To exactly reproduce this scenario, easiest way is to
>>>>> 
>>>>> - suspend All JVM client with debugger
>>>>> - Cause all client JVM OOME to create heap dump
>>>>> 
>>>>> in group B. All clients in group A will not be able to receive
>>>>> ping response in 5 seconds.
>>>>> 
>>>>> We need to fix this as soon as possible.
>>>>> What we do as a workaround is to raise sessionTimeout to 40 sec.
>>>>> At least clients in Group A survives. But this increases
>>>>> our cluster failover time significantly.
>>>>> 
>>>>> Thank you, Patrick.
>>>>> 
>>>>> 
>>>>> ps. We actually push ping request to FinalRequestProcessor as soon
>>>>>    as the packet identifies itself as ping. No dice.
>>>>> 
>>>>> 
>>>>> 
>>>>> 2011. 4. 14., 오전 12:21, Patrick Hunt 작성:
>>>>> 
>>>>>> Hi Chang, it sounds like you may have an issue with your cluster
>>>>>> environment/setup, or perhaps a resource (GC/mem) issue. Have you
>>>>>> looked through the troubleshooting guide?
>>>>>> https://cwiki.apache.org/confluence/display/ZOOKEEPER/Troubleshooting
>>>>>> 
>>>>>> In particular 1000 clients connecting should be fine, I've personally
>>>>>> seen clusters of 7-10 thousand clients. Keep in mind that each session
>>>>>> establishment is essentially a write (so the quorum in involved) and
>>>>>> what we typically see there is that the cluster configuration has
>>>>>> issues. 14 seconds for a ping response is huge and indicates one of
>>>>>> the following may be an underlying cause:
>>>>>> 
>>>>>> 1) are you running in a virtualized environment?
>>>>>> 2) are you co-locating other services on the same host(s) that make up
>>>>>> the ZK serving cluster?
>>>>>> 3) have you followed the admin guide's "things to avoid"?
>>>>>> http://zookeeper.apache.org/doc/r3.3.3/zookeeperAdmin.html#sc_commonProblems
>>>>>> In particular ensuring that you are not swapping or going into gc
>>>>>> pause (both on the server and the client)
>>>>>> a) try turning on GC logging and ensure that you are not going into GC
>>>>>> pause, see the troubleshooting guide, this is the most common cause of
>>>>>> high latency for the clients
>>>>>> b) ensure that you are not swapping
>>>>>> c) ensure that other processes are not causing log writing
>>>>>> (transactional logging) to be slow.
>>>>>> 
>>>>>> Patrick
>>>>>> 
>>>>>> On Wed, Apr 13, 2011 at 6:35 AM, Chang Song <tr...@me.com> wrote:
>>>>>>> Hello, folks.
>>>>>>> 
>>>>>>> We have ran into a very serious issue with Zookeeper.
>>>>>>> Here's a brief scenario.
>>>>>>> 
>>>>>>> We have some Zookeeper clients with session timeout of 15 sec (thus 5 sec ping), let's called
>>>>>>> these clients, group A.
>>>>>>> 
>>>>>>> Now 1000 new clients (let's call these, group B) starts up at the same time trying to
>>>>>>> connect to a three-node ZK ensemble, creating ZK createSession stampede.
>>>>>>> 
>>>>>>> Now almost all clients in group A is not able to exchange ping within session expire time (15 sec).
>>>>>>> Thus clients in group A drops out of the cluster.
>>>>>>> 
>>>>>>> We have looked into this issue a bit, found mostly synchronous nature of session queue processing.
>>>>>>> Latency between ping request and response ranges from 10ms up to 14 seconds during this login stampede.
>>>>>>> 
>>>>>>> Since session timeout is serious matter for our cluster, thus ping should be done in psuedo realtime fashion.
>>>>>>> 
>>>>>>> I don't know exactly how these ping timeout policy in clients and server, but failure to receive ping
>>>>>>> response in clients due to zookeeper login session seem very nonsense to me.
>>>>>>> 
>>>>>>> Shouldn't we have a separate ping/heartbeat queue and thread?
>>>>>>> Or even multiple ping queues/threads to keep realtime heartbeat?
>>>>>>> 
>>>>>>> THis is very serious issue with Zookeeper for our mission-critical system. Could anyone
>>>>>>> look into this?
>>>>>>> 
>>>>>>> I will try to file a bug.
>>>>>>> 
>>>>>>> Thank you.
>>>>>>> 
>>>>>>> Chang
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>> 
>>>>> 
>>>> 
>> 
>> 
> 
> 
> 
> -- 
> thanks
> mahadev
> @mahadevkonar

Re: Serious problem processing hearbeat on login stampede

Posted by Mahadev Konar <ma...@apache.org>.

Chang/Pat and others,
  I didnt see this in the discussions above, but are you guys using a
single disk or 2 disks for ZK? One for snapshot and one for txn
logging?

thanks
mahadev

2011/4/14 Chang Song <tr...@me.com>:
>
> 2011. 4. 14., 오후 1:53, Patrick Hunt 작성:
>
>> two additional thoughts come to mind:
>>
>> 1) try running the ensemble with a single zk server, does this help at
>> all? (it might provide a short term workaround, it also might provide
>> some insight into what's causing the issue)
>
>
> We are going to try this to see if we identify a culprit.
>
> Thanks.
>
>
>
>> 2) can you hold off some of the clients from the stampede? Perhaps add
>> a random holdoff to each of the clients before connecting,
>> additionally a similar random holdoff from closing the session. this
>> seems like a straightforward change from your client side (easy to
>> implement/try) but hard to tell given we don't have much insight into
>> what your use case is.
>>
>>
>> Anyone else in the community have any ideas?
>>
>>
>> Patrick
>>
>> 2011/4/13 Patrick Hunt <ph...@apache.org>:
>>> 2011/4/13 Chang Song <tr...@me.com>:
>>>>
>>>> Patrick.
>>>> Thank you for the reply.
>>>>
>>>> We are very aware of all the things you mentioned below.
>>>> None of those.
>>>>
>>>> Not GC (we monitor every possible resource in JVM and system)
>>>> No IO. No Swapping.
>>>> No VM guest OS. No logging.
>>>>
>>>
>>> Hm. ok, a few more ideas then:
>>>
>>> 1) what is the connectivity like btw the servers?
>>>
>>> What is the ping time btw them?
>>>
>>> Is the system perhaps loading down the network during this test,
>>> causing network latency to increase? Are all the nic cards (server and
>>> client) configured correctly? I've seen a number of cases where
>>> clients and/or server had incorrectly configured nics (ethtool
>>> reported 10 MB/sec half duplex for what should be 1gigeth)
>>>
>>> 2) regarding IO, if you run 'iostat -x 2' on the zk servers while your
>>> issue is happening, what's the %util of the disk? what's the iowait
>>> look like?
>>>
>>> 3) create a JIRA and upload your 3 server configuration files. Include
>>> the log4j.properties file you are using and any other details you
>>> think might be useful. If you can upload a log file from when you see
>>> this issue that would be useful. Upload any log file if you can't get
>>> it from the time when you see the issue.
>>>
>>>>
>>>> Oh, one thing I should mention is that it is not 1000 clients,
>>>> 1000 login/logout per second. All operations like closeSession,
>>>> ping takes more than 8 seconds (peak).
>>>>
>>>
>>> Are you continuously logging in and the logging out, 1000 times per
>>> second? That's not a good use case for ZK sessions in general. Perhaps
>>> if you describe your use case in more detail it would help.
>>>
>>> Patrick
>>>
>>>> It's about CommitProcessor thread queueing (in leader).
>>>> QueuedRequests goes up to 800, so does commitedRequests and
>>>> PendingRequestElapsedTime. PendingRequestElapsedTime
>>>> goes up to 8.8 seconds during this flood.
>>>>
>>>> To exactly reproduce this scenario, easiest way is to
>>>>
>>>> - suspend All JVM client with debugger
>>>> - Cause all client JVM OOME to create heap dump
>>>>
>>>> in group B. All clients in group A will not be able to receive
>>>> ping response in 5 seconds.
>>>>
>>>> We need to fix this as soon as possible.
>>>> What we do as a workaround is to raise sessionTimeout to 40 sec.
>>>> At least clients in Group A survives. But this increases
>>>> our cluster failover time significantly.
>>>>
>>>> Thank you, Patrick.
>>>>
>>>>
>>>> ps. We actually push ping request to FinalRequestProcessor as soon
>>>>     as the packet identifies itself as ping. No dice.
>>>>
>>>>
>>>>
>>>> 2011. 4. 14., 오전 12:21, Patrick Hunt 작성:
>>>>
>>>>> Hi Chang, it sounds like you may have an issue with your cluster
>>>>> environment/setup, or perhaps a resource (GC/mem) issue. Have you
>>>>> looked through the troubleshooting guide?
>>>>> https://cwiki.apache.org/confluence/display/ZOOKEEPER/Troubleshooting
>>>>>
>>>>> In particular 1000 clients connecting should be fine, I've personally
>>>>> seen clusters of 7-10 thousand clients. Keep in mind that each session
>>>>> establishment is essentially a write (so the quorum in involved) and
>>>>> what we typically see there is that the cluster configuration has
>>>>> issues. 14 seconds for a ping response is huge and indicates one of
>>>>> the following may be an underlying cause:
>>>>>
>>>>> 1) are you running in a virtualized environment?
>>>>> 2) are you co-locating other services on the same host(s) that make up
>>>>> the ZK serving cluster?
>>>>> 3) have you followed the admin guide's "things to avoid"?
>>>>> http://zookeeper.apache.org/doc/r3.3.3/zookeeperAdmin.html#sc_commonProblems
>>>>> In particular ensuring that you are not swapping or going into gc
>>>>> pause (both on the server and the client)
>>>>> a) try turning on GC logging and ensure that you are not going into GC
>>>>> pause, see the troubleshooting guide, this is the most common cause of
>>>>> high latency for the clients
>>>>> b) ensure that you are not swapping
>>>>> c) ensure that other processes are not causing log writing
>>>>> (transactional logging) to be slow.
>>>>>
>>>>> Patrick
>>>>>
>>>>> On Wed, Apr 13, 2011 at 6:35 AM, Chang Song <tr...@me.com> wrote:
>>>>>> Hello, folks.
>>>>>>
>>>>>> We have ran into a very serious issue with Zookeeper.
>>>>>> Here's a brief scenario.
>>>>>>
>>>>>> We have some Zookeeper clients with session timeout of 15 sec (thus 5 sec ping), let's called
>>>>>> these clients, group A.
>>>>>>
>>>>>> Now 1000 new clients (let's call these, group B) starts up at the same time trying to
>>>>>> connect to a three-node ZK ensemble, creating ZK createSession stampede.
>>>>>>
>>>>>> Now almost all clients in group A is not able to exchange ping within session expire time (15 sec).
>>>>>> Thus clients in group A drops out of the cluster.
>>>>>>
>>>>>> We have looked into this issue a bit, found mostly synchronous nature of session queue processing.
>>>>>> Latency between ping request and response ranges from 10ms up to 14 seconds during this login stampede.
>>>>>>
>>>>>> Since session timeout is serious matter for our cluster, thus ping should be done in psuedo realtime fashion.
>>>>>>
>>>>>> I don't know exactly how these ping timeout policy in clients and server, but failure to receive ping
>>>>>> response in clients due to zookeeper login session seem very nonsense to me.
>>>>>>
>>>>>> Shouldn't we have a separate ping/heartbeat queue and thread?
>>>>>> Or even multiple ping queues/threads to keep realtime heartbeat?
>>>>>>
>>>>>> THis is very serious issue with Zookeeper for our mission-critical system. Could anyone
>>>>>> look into this?
>>>>>>
>>>>>> I will try to file a bug.
>>>>>>
>>>>>> Thank you.
>>>>>>
>>>>>> Chang
>>>>>>
>>>>>>
>>>>>>
>>>>
>>>>
>>>
>
>



-- 
thanks
mahadev
@mahadevkonar

Re: Serious problem processing hearbeat on login stampede

Posted by Chang Song <tr...@me.com>.

2011. 4. 14., 오후 1:53, Patrick Hunt 작성:

> two additional thoughts come to mind:
> 
> 1) try running the ensemble with a single zk server, does this help at
> all? (it might provide a short term workaround, it also might provide
> some insight into what's causing the issue)


We are going to try this to see if we identify a culprit.

Thanks.



> 2) can you hold off some of the clients from the stampede? Perhaps add
> a random holdoff to each of the clients before connecting,
> additionally a similar random holdoff from closing the session. this
> seems like a straightforward change from your client side (easy to
> implement/try) but hard to tell given we don't have much insight into
> what your use case is.
> 
> 
> Anyone else in the community have any ideas?
> 
> 
> Patrick
> 
> 2011/4/13 Patrick Hunt <ph...@apache.org>:
>> 2011/4/13 Chang Song <tr...@me.com>:
>>> 
>>> Patrick.
>>> Thank you for the reply.
>>> 
>>> We are very aware of all the things you mentioned below.
>>> None of those.
>>> 
>>> Not GC (we monitor every possible resource in JVM and system)
>>> No IO. No Swapping.
>>> No VM guest OS. No logging.
>>> 
>> 
>> Hm. ok, a few more ideas then:
>> 
>> 1) what is the connectivity like btw the servers?
>> 
>> What is the ping time btw them?
>> 
>> Is the system perhaps loading down the network during this test,
>> causing network latency to increase? Are all the nic cards (server and
>> client) configured correctly? I've seen a number of cases where
>> clients and/or server had incorrectly configured nics (ethtool
>> reported 10 MB/sec half duplex for what should be 1gigeth)
>> 
>> 2) regarding IO, if you run 'iostat -x 2' on the zk servers while your
>> issue is happening, what's the %util of the disk? what's the iowait
>> look like?
>> 
>> 3) create a JIRA and upload your 3 server configuration files. Include
>> the log4j.properties file you are using and any other details you
>> think might be useful. If you can upload a log file from when you see
>> this issue that would be useful. Upload any log file if you can't get
>> it from the time when you see the issue.
>> 
>>> 
>>> Oh, one thing I should mention is that it is not 1000 clients,
>>> 1000 login/logout per second. All operations like closeSession,
>>> ping takes more than 8 seconds (peak).
>>> 
>> 
>> Are you continuously logging in and the logging out, 1000 times per
>> second? That's not a good use case for ZK sessions in general. Perhaps
>> if you describe your use case in more detail it would help.
>> 
>> Patrick
>> 
>>> It's about CommitProcessor thread queueing (in leader).
>>> QueuedRequests goes up to 800, so does commitedRequests and
>>> PendingRequestElapsedTime. PendingRequestElapsedTime
>>> goes up to 8.8 seconds during this flood.
>>> 
>>> To exactly reproduce this scenario, easiest way is to
>>> 
>>> - suspend All JVM client with debugger
>>> - Cause all client JVM OOME to create heap dump
>>> 
>>> in group B. All clients in group A will not be able to receive
>>> ping response in 5 seconds.
>>> 
>>> We need to fix this as soon as possible.
>>> What we do as a workaround is to raise sessionTimeout to 40 sec.
>>> At least clients in Group A survives. But this increases
>>> our cluster failover time significantly.
>>> 
>>> Thank you, Patrick.
>>> 
>>> 
>>> ps. We actually push ping request to FinalRequestProcessor as soon
>>>     as the packet identifies itself as ping. No dice.
>>> 
>>> 
>>> 
>>> 2011. 4. 14., 오전 12:21, Patrick Hunt 작성:
>>> 
>>>> Hi Chang, it sounds like you may have an issue with your cluster
>>>> environment/setup, or perhaps a resource (GC/mem) issue. Have you
>>>> looked through the troubleshooting guide?
>>>> https://cwiki.apache.org/confluence/display/ZOOKEEPER/Troubleshooting
>>>> 
>>>> In particular 1000 clients connecting should be fine, I've personally
>>>> seen clusters of 7-10 thousand clients. Keep in mind that each session
>>>> establishment is essentially a write (so the quorum in involved) and
>>>> what we typically see there is that the cluster configuration has
>>>> issues. 14 seconds for a ping response is huge and indicates one of
>>>> the following may be an underlying cause:
>>>> 
>>>> 1) are you running in a virtualized environment?
>>>> 2) are you co-locating other services on the same host(s) that make up
>>>> the ZK serving cluster?
>>>> 3) have you followed the admin guide's "things to avoid"?
>>>> http://zookeeper.apache.org/doc/r3.3.3/zookeeperAdmin.html#sc_commonProblems
>>>> In particular ensuring that you are not swapping or going into gc
>>>> pause (both on the server and the client)
>>>> a) try turning on GC logging and ensure that you are not going into GC
>>>> pause, see the troubleshooting guide, this is the most common cause of
>>>> high latency for the clients
>>>> b) ensure that you are not swapping
>>>> c) ensure that other processes are not causing log writing
>>>> (transactional logging) to be slow.
>>>> 
>>>> Patrick
>>>> 
>>>> On Wed, Apr 13, 2011 at 6:35 AM, Chang Song <tr...@me.com> wrote:
>>>>> Hello, folks.
>>>>> 
>>>>> We have ran into a very serious issue with Zookeeper.
>>>>> Here's a brief scenario.
>>>>> 
>>>>> We have some Zookeeper clients with session timeout of 15 sec (thus 5 sec ping), let's called
>>>>> these clients, group A.
>>>>> 
>>>>> Now 1000 new clients (let's call these, group B) starts up at the same time trying to
>>>>> connect to a three-node ZK ensemble, creating ZK createSession stampede.
>>>>> 
>>>>> Now almost all clients in group A is not able to exchange ping within session expire time (15 sec).
>>>>> Thus clients in group A drops out of the cluster.
>>>>> 
>>>>> We have looked into this issue a bit, found mostly synchronous nature of session queue processing.
>>>>> Latency between ping request and response ranges from 10ms up to 14 seconds during this login stampede.
>>>>> 
>>>>> Since session timeout is serious matter for our cluster, thus ping should be done in psuedo realtime fashion.
>>>>> 
>>>>> I don't know exactly how these ping timeout policy in clients and server, but failure to receive ping
>>>>> response in clients due to zookeeper login session seem very nonsense to me.
>>>>> 
>>>>> Shouldn't we have a separate ping/heartbeat queue and thread?
>>>>> Or even multiple ping queues/threads to keep realtime heartbeat?
>>>>> 
>>>>> THis is very serious issue with Zookeeper for our mission-critical system. Could anyone
>>>>> look into this?
>>>>> 
>>>>> I will try to file a bug.
>>>>> 
>>>>> Thank you.
>>>>> 
>>>>> Chang
>>>>> 
>>>>> 
>>>>> 
>>> 
>>> 
>>

Re: Serious problem processing hearbeat on login stampede

Posted by Patrick Hunt <ph...@apache.org>.

two additional thoughts come to mind:

1) try running the ensemble with a single zk server, does this help at
all? (it might provide a short term workaround, it also might provide
some insight into what's causing the issue)

2) can you hold off some of the clients from the stampede? Perhaps add
a random holdoff to each of the clients before connecting,
additionally a similar random holdoff from closing the session. this
seems like a straightforward change from your client side (easy to
implement/try) but hard to tell given we don't have much insight into
what your use case is.


Anyone else in the community have any ideas?


Patrick

2011/4/13 Patrick Hunt <ph...@apache.org>:
> 2011/4/13 Chang Song <tr...@me.com>:
>>
>> Patrick.
>> Thank you for the reply.
>>
>> We are very aware of all the things you mentioned below.
>> None of those.
>>
>> Not GC (we monitor every possible resource in JVM and system)
>> No IO. No Swapping.
>> No VM guest OS. No logging.
>>
>
> Hm. ok, a few more ideas then:
>
> 1) what is the connectivity like btw the servers?
>
> What is the ping time btw them?
>
> Is the system perhaps loading down the network during this test,
> causing network latency to increase? Are all the nic cards (server and
> client) configured correctly? I've seen a number of cases where
> clients and/or server had incorrectly configured nics (ethtool
> reported 10 MB/sec half duplex for what should be 1gigeth)
>
> 2) regarding IO, if you run 'iostat -x 2' on the zk servers while your
> issue is happening, what's the %util of the disk? what's the iowait
> look like?
>
> 3) create a JIRA and upload your 3 server configuration files. Include
> the log4j.properties file you are using and any other details you
> think might be useful. If you can upload a log file from when you see
> this issue that would be useful. Upload any log file if you can't get
> it from the time when you see the issue.
>
>>
>> Oh, one thing I should mention is that it is not 1000 clients,
>> 1000 login/logout per second. All operations like closeSession,
>> ping takes more than 8 seconds (peak).
>>
>
> Are you continuously logging in and the logging out, 1000 times per
> second? That's not a good use case for ZK sessions in general. Perhaps
> if you describe your use case in more detail it would help.
>
> Patrick
>
>> It's about CommitProcessor thread queueing (in leader).
>> QueuedRequests goes up to 800, so does commitedRequests and
>> PendingRequestElapsedTime. PendingRequestElapsedTime
>> goes up to 8.8 seconds during this flood.
>>
>> To exactly reproduce this scenario, easiest way is to
>>
>> - suspend All JVM client with debugger
>> - Cause all client JVM OOME to create heap dump
>>
>> in group B. All clients in group A will not be able to receive
>> ping response in 5 seconds.
>>
>> We need to fix this as soon as possible.
>> What we do as a workaround is to raise sessionTimeout to 40 sec.
>> At least clients in Group A survives. But this increases
>> our cluster failover time significantly.
>>
>> Thank you, Patrick.
>>
>>
>> ps. We actually push ping request to FinalRequestProcessor as soon
>>      as the packet identifies itself as ping. No dice.
>>
>>
>>
>> 2011. 4. 14., 오전 12:21, Patrick Hunt 작성:
>>
>>> Hi Chang, it sounds like you may have an issue with your cluster
>>> environment/setup, or perhaps a resource (GC/mem) issue. Have you
>>> looked through the troubleshooting guide?
>>> https://cwiki.apache.org/confluence/display/ZOOKEEPER/Troubleshooting
>>>
>>> In particular 1000 clients connecting should be fine, I've personally
>>> seen clusters of 7-10 thousand clients. Keep in mind that each session
>>> establishment is essentially a write (so the quorum in involved) and
>>> what we typically see there is that the cluster configuration has
>>> issues. 14 seconds for a ping response is huge and indicates one of
>>> the following may be an underlying cause:
>>>
>>> 1) are you running in a virtualized environment?
>>> 2) are you co-locating other services on the same host(s) that make up
>>> the ZK serving cluster?
>>> 3) have you followed the admin guide's "things to avoid"?
>>> http://zookeeper.apache.org/doc/r3.3.3/zookeeperAdmin.html#sc_commonProblems
>>> In particular ensuring that you are not swapping or going into gc
>>> pause (both on the server and the client)
>>> a) try turning on GC logging and ensure that you are not going into GC
>>> pause, see the troubleshooting guide, this is the most common cause of
>>> high latency for the clients
>>> b) ensure that you are not swapping
>>> c) ensure that other processes are not causing log writing
>>> (transactional logging) to be slow.
>>>
>>> Patrick
>>>
>>> On Wed, Apr 13, 2011 at 6:35 AM, Chang Song <tr...@me.com> wrote:
>>>> Hello, folks.
>>>>
>>>> We have ran into a very serious issue with Zookeeper.
>>>> Here's a brief scenario.
>>>>
>>>> We have some Zookeeper clients with session timeout of 15 sec (thus 5 sec ping), let's called
>>>> these clients, group A.
>>>>
>>>> Now 1000 new clients (let's call these, group B) starts up at the same time trying to
>>>> connect to a three-node ZK ensemble, creating ZK createSession stampede.
>>>>
>>>> Now almost all clients in group A is not able to exchange ping within session expire time (15 sec).
>>>> Thus clients in group A drops out of the cluster.
>>>>
>>>> We have looked into this issue a bit, found mostly synchronous nature of session queue processing.
>>>> Latency between ping request and response ranges from 10ms up to 14 seconds during this login stampede.
>>>>
>>>> Since session timeout is serious matter for our cluster, thus ping should be done in psuedo realtime fashion.
>>>>
>>>> I don't know exactly how these ping timeout policy in clients and server, but failure to receive ping
>>>> response in clients due to zookeeper login session seem very nonsense to me.
>>>>
>>>> Shouldn't we have a separate ping/heartbeat queue and thread?
>>>> Or even multiple ping queues/threads to keep realtime heartbeat?
>>>>
>>>> THis is very serious issue with Zookeeper for our mission-critical system. Could anyone
>>>> look into this?
>>>>
>>>> I will try to file a bug.
>>>>
>>>> Thank you.
>>>>
>>>> Chang
>>>>
>>>>
>>>>
>>
>>
>

Re: Serious problem processing hearbeat on login stampede

Posted by Patrick Hunt <ph...@apache.org>.

2011/4/13 Chang Song <tr...@me.com>:
>
> Patrick.
> Thank you for the reply.
>
> We are very aware of all the things you mentioned below.
> None of those.
>
> Not GC (we monitor every possible resource in JVM and system)
> No IO. No Swapping.
> No VM guest OS. No logging.
>

Hm. ok, a few more ideas then:

1) what is the connectivity like btw the servers?

What is the ping time btw them?

Is the system perhaps loading down the network during this test,
causing network latency to increase? Are all the nic cards (server and
client) configured correctly? I've seen a number of cases where
clients and/or server had incorrectly configured nics (ethtool
reported 10 MB/sec half duplex for what should be 1gigeth)

2) regarding IO, if you run 'iostat -x 2' on the zk servers while your
issue is happening, what's the %util of the disk? what's the iowait
look like?

3) create a JIRA and upload your 3 server configuration files. Include
the log4j.properties file you are using and any other details you
think might be useful. If you can upload a log file from when you see
this issue that would be useful. Upload any log file if you can't get
it from the time when you see the issue.

>
> Oh, one thing I should mention is that it is not 1000 clients,
> 1000 login/logout per second. All operations like closeSession,
> ping takes more than 8 seconds (peak).
>

Are you continuously logging in and the logging out, 1000 times per
second? That's not a good use case for ZK sessions in general. Perhaps
if you describe your use case in more detail it would help.

Patrick

> It's about CommitProcessor thread queueing (in leader).
> QueuedRequests goes up to 800, so does commitedRequests and
> PendingRequestElapsedTime. PendingRequestElapsedTime
> goes up to 8.8 seconds during this flood.
>
> To exactly reproduce this scenario, easiest way is to
>
> - suspend All JVM client with debugger
> - Cause all client JVM OOME to create heap dump
>
> in group B. All clients in group A will not be able to receive
> ping response in 5 seconds.
>
> We need to fix this as soon as possible.
> What we do as a workaround is to raise sessionTimeout to 40 sec.
> At least clients in Group A survives. But this increases
> our cluster failover time significantly.
>
> Thank you, Patrick.
>
>
> ps. We actually push ping request to FinalRequestProcessor as soon
>      as the packet identifies itself as ping. No dice.
>
>
>
> 2011. 4. 14., 오전 12:21, Patrick Hunt 작성:
>
>> Hi Chang, it sounds like you may have an issue with your cluster
>> environment/setup, or perhaps a resource (GC/mem) issue. Have you
>> looked through the troubleshooting guide?
>> https://cwiki.apache.org/confluence/display/ZOOKEEPER/Troubleshooting
>>
>> In particular 1000 clients connecting should be fine, I've personally
>> seen clusters of 7-10 thousand clients. Keep in mind that each session
>> establishment is essentially a write (so the quorum in involved) and
>> what we typically see there is that the cluster configuration has
>> issues. 14 seconds for a ping response is huge and indicates one of
>> the following may be an underlying cause:
>>
>> 1) are you running in a virtualized environment?
>> 2) are you co-locating other services on the same host(s) that make up
>> the ZK serving cluster?
>> 3) have you followed the admin guide's "things to avoid"?
>> http://zookeeper.apache.org/doc/r3.3.3/zookeeperAdmin.html#sc_commonProblems
>> In particular ensuring that you are not swapping or going into gc
>> pause (both on the server and the client)
>> a) try turning on GC logging and ensure that you are not going into GC
>> pause, see the troubleshooting guide, this is the most common cause of
>> high latency for the clients
>> b) ensure that you are not swapping
>> c) ensure that other processes are not causing log writing
>> (transactional logging) to be slow.
>>
>> Patrick
>>
>> On Wed, Apr 13, 2011 at 6:35 AM, Chang Song <tr...@me.com> wrote:
>>> Hello, folks.
>>>
>>> We have ran into a very serious issue with Zookeeper.
>>> Here's a brief scenario.
>>>
>>> We have some Zookeeper clients with session timeout of 15 sec (thus 5 sec ping), let's called
>>> these clients, group A.
>>>
>>> Now 1000 new clients (let's call these, group B) starts up at the same time trying to
>>> connect to a three-node ZK ensemble, creating ZK createSession stampede.
>>>
>>> Now almost all clients in group A is not able to exchange ping within session expire time (15 sec).
>>> Thus clients in group A drops out of the cluster.
>>>
>>> We have looked into this issue a bit, found mostly synchronous nature of session queue processing.
>>> Latency between ping request and response ranges from 10ms up to 14 seconds during this login stampede.
>>>
>>> Since session timeout is serious matter for our cluster, thus ping should be done in psuedo realtime fashion.
>>>
>>> I don't know exactly how these ping timeout policy in clients and server, but failure to receive ping
>>> response in clients due to zookeeper login session seem very nonsense to me.
>>>
>>> Shouldn't we have a separate ping/heartbeat queue and thread?
>>> Or even multiple ping queues/threads to keep realtime heartbeat?
>>>
>>> THis is very serious issue with Zookeeper for our mission-critical system. Could anyone
>>> look into this?
>>>
>>> I will try to file a bug.
>>>
>>> Thank you.
>>>
>>> Chang
>>>
>>>
>>>
>
>

Re: Serious problem processing hearbeat on login stampede

Posted by Chang Song <tr...@me.com>.

Patrick.
Thank you for the reply.

We are very aware of all the things you mentioned below.
None of those.

Not GC (we monitor every possible resource in JVM and system)
No IO. No Swapping.
No VM guest OS. No logging.


Oh, one thing I should mention is that it is not 1000 clients,
1000 login/logout per second. All operations like closeSession, 
ping takes more than 8 seconds (peak).

It's about CommitProcessor thread queueing (in leader).
QueuedRequests goes up to 800, so does commitedRequests and
PendingRequestElapsedTime. PendingRequestElapsedTime
goes up to 8.8 seconds during this flood.

To exactly reproduce this scenario, easiest way is to

- suspend All JVM client with debugger
- Cause all client JVM OOME to create heap dump

in group B. All clients in group A will not be able to receive
ping response in 5 seconds.

We need to fix this as soon as possible.
What we do as a workaround is to raise sessionTimeout to 40 sec.
At least clients in Group A survives. But this increases 
our cluster failover time significantly. 

Thank you, Patrick.


ps. We actually push ping request to FinalRequestProcessor as soon
      as the packet identifies itself as ping. No dice.



2011. 4. 14., 오전 12:21, Patrick Hunt 작성:

> Hi Chang, it sounds like you may have an issue with your cluster
> environment/setup, or perhaps a resource (GC/mem) issue. Have you
> looked through the troubleshooting guide?
> https://cwiki.apache.org/confluence/display/ZOOKEEPER/Troubleshooting
> 
> In particular 1000 clients connecting should be fine, I've personally
> seen clusters of 7-10 thousand clients. Keep in mind that each session
> establishment is essentially a write (so the quorum in involved) and
> what we typically see there is that the cluster configuration has
> issues. 14 seconds for a ping response is huge and indicates one of
> the following may be an underlying cause:
> 
> 1) are you running in a virtualized environment?
> 2) are you co-locating other services on the same host(s) that make up
> the ZK serving cluster?
> 3) have you followed the admin guide's "things to avoid"?
> http://zookeeper.apache.org/doc/r3.3.3/zookeeperAdmin.html#sc_commonProblems
> In particular ensuring that you are not swapping or going into gc
> pause (both on the server and the client)
> a) try turning on GC logging and ensure that you are not going into GC
> pause, see the troubleshooting guide, this is the most common cause of
> high latency for the clients
> b) ensure that you are not swapping
> c) ensure that other processes are not causing log writing
> (transactional logging) to be slow.
> 
> Patrick
> 
> On Wed, Apr 13, 2011 at 6:35 AM, Chang Song <tr...@me.com> wrote:
>> Hello, folks.
>> 
>> We have ran into a very serious issue with Zookeeper.
>> Here's a brief scenario.
>> 
>> We have some Zookeeper clients with session timeout of 15 sec (thus 5 sec ping), let's called
>> these clients, group A.
>> 
>> Now 1000 new clients (let's call these, group B) starts up at the same time trying to
>> connect to a three-node ZK ensemble, creating ZK createSession stampede.
>> 
>> Now almost all clients in group A is not able to exchange ping within session expire time (15 sec).
>> Thus clients in group A drops out of the cluster.
>> 
>> We have looked into this issue a bit, found mostly synchronous nature of session queue processing.
>> Latency between ping request and response ranges from 10ms up to 14 seconds during this login stampede.
>> 
>> Since session timeout is serious matter for our cluster, thus ping should be done in psuedo realtime fashion.
>> 
>> I don't know exactly how these ping timeout policy in clients and server, but failure to receive ping
>> response in clients due to zookeeper login session seem very nonsense to me.
>> 
>> Shouldn't we have a separate ping/heartbeat queue and thread?
>> Or even multiple ping queues/threads to keep realtime heartbeat?
>> 
>> THis is very serious issue with Zookeeper for our mission-critical system. Could anyone
>> look into this?
>> 
>> I will try to file a bug.
>> 
>> Thank you.
>> 
>> Chang
>> 
>> 
>>

Re: Serious problem processing hearbeat on login stampede

Posted by Patrick Hunt <ph...@apache.org>.

Hi Chang, it sounds like you may have an issue with your cluster
environment/setup, or perhaps a resource (GC/mem) issue. Have you
looked through the troubleshooting guide?
https://cwiki.apache.org/confluence/display/ZOOKEEPER/Troubleshooting

In particular 1000 clients connecting should be fine, I've personally
seen clusters of 7-10 thousand clients. Keep in mind that each session
establishment is essentially a write (so the quorum in involved) and
what we typically see there is that the cluster configuration has
issues. 14 seconds for a ping response is huge and indicates one of
the following may be an underlying cause:

1) are you running in a virtualized environment?
2) are you co-locating other services on the same host(s) that make up
the ZK serving cluster?
3) have you followed the admin guide's "things to avoid"?
http://zookeeper.apache.org/doc/r3.3.3/zookeeperAdmin.html#sc_commonProblems
In particular ensuring that you are not swapping or going into gc
pause (both on the server and the client)
a) try turning on GC logging and ensure that you are not going into GC
pause, see the troubleshooting guide, this is the most common cause of
high latency for the clients
b) ensure that you are not swapping
c) ensure that other processes are not causing log writing
(transactional logging) to be slow.

Patrick

On Wed, Apr 13, 2011 at 6:35 AM, Chang Song <tr...@me.com> wrote:
> Hello, folks.
>
> We have ran into a very serious issue with Zookeeper.
> Here's a brief scenario.
>
> We have some Zookeeper clients with session timeout of 15 sec (thus 5 sec ping), let's called
> these clients, group A.
>
> Now 1000 new clients (let's call these, group B) starts up at the same time trying to
> connect to a three-node ZK ensemble, creating ZK createSession stampede.
>
> Now almost all clients in group A is not able to exchange ping within session expire time (15 sec).
> Thus clients in group A drops out of the cluster.
>
> We have looked into this issue a bit, found mostly synchronous nature of session queue processing.
> Latency between ping request and response ranges from 10ms up to 14 seconds during this login stampede.
>
> Since session timeout is serious matter for our cluster, thus ping should be done in psuedo realtime fashion.
>
> I don't know exactly how these ping timeout policy in clients and server, but failure to receive ping
> response in clients due to zookeeper login session seem very nonsense to me.
>
> Shouldn't we have a separate ping/heartbeat queue and thread?
> Or even multiple ping queues/threads to keep realtime heartbeat?
>
> THis is very serious issue with Zookeeper for our mission-critical system. Could anyone
> look into this?
>
> I will try to file a bug.
>
> Thank you.
>
> Chang
>
>
>

Re: Serious problem processing hearbeat on login stampede

Posted by Chang Song <tr...@me.com>.

Actually, "netspider", Chisu Ryu, in my team fixed it.

Thanks, Chisu.

Chang



2011. 7. 6., 오전 3:04, Patrick Hunt 작성:

> Vishal brought up an issue at the ZK post-summit meetup that might
> also be (partially?) resolved by this patch.
> 
> Thanks again Chang Song!
> 
> Patrick
> 
> 2011/7/1 Chang Song <tr...@me.com>:
>> 
>> No problem.
>> Glad to contribute.
>> 
>> Thanks a lot.
>> 
>> 
>> 2011. 7. 2., 오전 1:03, Ted Dunning 작성:
>> 
>>> Thanks for the feedback Jared!
>>> 
>>> (and thanks to Chang as well!)
>>> 
>>> On Fri, Jul 1, 2011 at 8:06 AM, Jared Cantwell <ja...@gmail.com>wrote:
>>> 
>>>> As a note, I believe we just used this patch to solve a major issue ...
>>>> 
>>>> Thanks Chang!
>>>> ~Jared
>>>> 
>>>> On Tue, Apr 19, 2011 at 10:59 AM, Ted Dunning <te...@gmail.com>
>>>> wrote:
>>>> 
>>>>> Where is this set?
>>>>> 
>>>>> Why does this cause this problem?
>>>>> 
>>>>> 2011/4/19 Chang Song <tr...@me.com>
>>>>> 
>>>>>> 
>>>>>> Problem solved.
>>>>>> it was socket linger option set to 2 sec timeout.
>>>>>> 
>>>>>> We have verified that the original problem goes away when we turn off
>>>>>> linger option.
>>>>>> No longer a mystery ;)
>>>>>> 
>>>>>> 
>>>>>> https://issues.apache.org/jira/browse/ZOOKEEPER-1049
>>>>>> 
>>>>>> 
>>>>>> Chang
>>>>>> 
>>>>>> 
>>>>>> 2011. 4. 19., 오전 3:16, Mahadev Konar 작성:
>>>>>> 
>>>>>>> Camille, Ted,
>>>>>>> Can we continue the discussion on
>>>>>>> https://issues.apache.org/jira/browse/ZOOKEEPER-1049?
>>>>>>> 
>>>>>>> We should track all the suggestions/issues on the jira.
>>>>>>> 
>>>>>>> thanks
>>>>>>> mahadev
>>>>>>> 
>>>>>>> On Mon, Apr 18, 2011 at 9:03 AM, Ted Dunning <te...@gmail.com>
>>>>>> wrote:
>>>>>>>> Interesting.  It does seem to suggestion the session expiration is
>>>>>>>> expensive.
>>>>>>>> 
>>>>>>>> There is a concurrent table in guava that provides very good
>>>>>> multi-threaded
>>>>>>>> performance.  I think that is achieved by using a number of locks
>>>> and
>>>>>> then
>>>>>>>> distributing threads across the locks according to the hash slot
>>>> being
>>>>>> used.
>>>>>>>> But I would have expected any in memory operation to complete very
>>>>>> quickly.
>>>>>>>> 
>>>>>>>> Is it possible that the locks on the session table are held longer
>>>>> than
>>>>>> they
>>>>>>>> should be?
>>>>>>>> 
>>>>>>>> 2011/4/18 Fournier, Camille F. [Tech] <Ca...@gs.com>
>>>>>>>> 
>>>>>>>>> Is it possible this is related to this report back in February?
>>>>>>>>> 
>>>>>>>>> 
>>>>>> 
>>>>> 
>>>> http://mail-archives.apache.org/mod_mbox/zookeeper-user/201102.mbox/%3C6642FC1CAF133548AA8FDF497C547F0A23C0C5265B@NYWEXMBX2126.msad.ms.com%3E
>>>>>>>>> 
>>>>>>>>> I theorized that the issue might be due to synchronization on the
>>>>>> session
>>>>>>>>> table, but never got enough information to finish the
>>>> investigation.
>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> --
>>>>>>> thanks
>>>>>>> mahadev
>>>>>>> @mahadevkonar
>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>> 
>>

Re: Serious problem processing hearbeat on login stampede

Posted by Patrick Hunt <ph...@apache.org>.

Vishal brought up an issue at the ZK post-summit meetup that might
also be (partially?) resolved by this patch.

Thanks again Chang Song!

Patrick

2011/7/1 Chang Song <tr...@me.com>:
>
> No problem.
> Glad to contribute.
>
> Thanks a lot.
>
>
> 2011. 7. 2., 오전 1:03, Ted Dunning 작성:
>
>> Thanks for the feedback Jared!
>>
>> (and thanks to Chang as well!)
>>
>> On Fri, Jul 1, 2011 at 8:06 AM, Jared Cantwell <ja...@gmail.com>wrote:
>>
>>> As a note, I believe we just used this patch to solve a major issue ...
>>>
>>> Thanks Chang!
>>> ~Jared
>>>
>>> On Tue, Apr 19, 2011 at 10:59 AM, Ted Dunning <te...@gmail.com>
>>> wrote:
>>>
>>>> Where is this set?
>>>>
>>>> Why does this cause this problem?
>>>>
>>>> 2011/4/19 Chang Song <tr...@me.com>
>>>>
>>>>>
>>>>> Problem solved.
>>>>> it was socket linger option set to 2 sec timeout.
>>>>>
>>>>> We have verified that the original problem goes away when we turn off
>>>>> linger option.
>>>>> No longer a mystery ;)
>>>>>
>>>>>
>>>>> https://issues.apache.org/jira/browse/ZOOKEEPER-1049
>>>>>
>>>>>
>>>>> Chang
>>>>>
>>>>>
>>>>> 2011. 4. 19., 오전 3:16, Mahadev Konar 작성:
>>>>>
>>>>>> Camille, Ted,
>>>>>> Can we continue the discussion on
>>>>>> https://issues.apache.org/jira/browse/ZOOKEEPER-1049?
>>>>>>
>>>>>> We should track all the suggestions/issues on the jira.
>>>>>>
>>>>>> thanks
>>>>>> mahadev
>>>>>>
>>>>>> On Mon, Apr 18, 2011 at 9:03 AM, Ted Dunning <te...@gmail.com>
>>>>> wrote:
>>>>>>> Interesting.  It does seem to suggestion the session expiration is
>>>>>>> expensive.
>>>>>>>
>>>>>>> There is a concurrent table in guava that provides very good
>>>>> multi-threaded
>>>>>>> performance.  I think that is achieved by using a number of locks
>>> and
>>>>> then
>>>>>>> distributing threads across the locks according to the hash slot
>>> being
>>>>> used.
>>>>>>> But I would have expected any in memory operation to complete very
>>>>> quickly.
>>>>>>>
>>>>>>> Is it possible that the locks on the session table are held longer
>>>> than
>>>>> they
>>>>>>> should be?
>>>>>>>
>>>>>>> 2011/4/18 Fournier, Camille F. [Tech] <Ca...@gs.com>
>>>>>>>
>>>>>>>> Is it possible this is related to this report back in February?
>>>>>>>>
>>>>>>>>
>>>>>
>>>>
>>> http://mail-archives.apache.org/mod_mbox/zookeeper-user/201102.mbox/%3C6642FC1CAF133548AA8FDF497C547F0A23C0C5265B@NYWEXMBX2126.msad.ms.com%3E
>>>>>>>>
>>>>>>>> I theorized that the issue might be due to synchronization on the
>>>>> session
>>>>>>>> table, but never got enough information to finish the
>>> investigation.
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> thanks
>>>>>> mahadev
>>>>>> @mahadevkonar
>>>>>
>>>>>
>>>>
>>>
>
>

Re: Serious problem processing hearbeat on login stampede

Posted by Chang Song <tr...@me.com>.

No problem.
Glad to contribute.

Thanks a lot.


2011. 7. 2., 오전 1:03, Ted Dunning 작성:

> Thanks for the feedback Jared!
> 
> (and thanks to Chang as well!)
> 
> On Fri, Jul 1, 2011 at 8:06 AM, Jared Cantwell <ja...@gmail.com>wrote:
> 
>> As a note, I believe we just used this patch to solve a major issue ...
>> 
>> Thanks Chang!
>> ~Jared
>> 
>> On Tue, Apr 19, 2011 at 10:59 AM, Ted Dunning <te...@gmail.com>
>> wrote:
>> 
>>> Where is this set?
>>> 
>>> Why does this cause this problem?
>>> 
>>> 2011/4/19 Chang Song <tr...@me.com>
>>> 
>>>> 
>>>> Problem solved.
>>>> it was socket linger option set to 2 sec timeout.
>>>> 
>>>> We have verified that the original problem goes away when we turn off
>>>> linger option.
>>>> No longer a mystery ;)
>>>> 
>>>> 
>>>> https://issues.apache.org/jira/browse/ZOOKEEPER-1049
>>>> 
>>>> 
>>>> Chang
>>>> 
>>>> 
>>>> 2011. 4. 19., 오전 3:16, Mahadev Konar 작성:
>>>> 
>>>>> Camille, Ted,
>>>>> Can we continue the discussion on
>>>>> https://issues.apache.org/jira/browse/ZOOKEEPER-1049?
>>>>> 
>>>>> We should track all the suggestions/issues on the jira.
>>>>> 
>>>>> thanks
>>>>> mahadev
>>>>> 
>>>>> On Mon, Apr 18, 2011 at 9:03 AM, Ted Dunning <te...@gmail.com>
>>>> wrote:
>>>>>> Interesting.  It does seem to suggestion the session expiration is
>>>>>> expensive.
>>>>>> 
>>>>>> There is a concurrent table in guava that provides very good
>>>> multi-threaded
>>>>>> performance.  I think that is achieved by using a number of locks
>> and
>>>> then
>>>>>> distributing threads across the locks according to the hash slot
>> being
>>>> used.
>>>>>> But I would have expected any in memory operation to complete very
>>>> quickly.
>>>>>> 
>>>>>> Is it possible that the locks on the session table are held longer
>>> than
>>>> they
>>>>>> should be?
>>>>>> 
>>>>>> 2011/4/18 Fournier, Camille F. [Tech] <Ca...@gs.com>
>>>>>> 
>>>>>>> Is it possible this is related to this report back in February?
>>>>>>> 
>>>>>>> 
>>>> 
>>> 
>> http://mail-archives.apache.org/mod_mbox/zookeeper-user/201102.mbox/%3C6642FC1CAF133548AA8FDF497C547F0A23C0C5265B@NYWEXMBX2126.msad.ms.com%3E
>>>>>>> 
>>>>>>> I theorized that the issue might be due to synchronization on the
>>>> session
>>>>>>> table, but never got enough information to finish the
>> investigation.
>>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> --
>>>>> thanks
>>>>> mahadev
>>>>> @mahadevkonar
>>>> 
>>>> 
>>> 
>>

Re: Serious problem processing hearbeat on login stampede

Posted by Ted Dunning <te...@gmail.com>.

Thanks for the feedback Jared!

(and thanks to Chang as well!)

On Fri, Jul 1, 2011 at 8:06 AM, Jared Cantwell <ja...@gmail.com>wrote:

> As a note, I believe we just used this patch to solve a major issue ...
>
> Thanks Chang!
> ~Jared
>
> On Tue, Apr 19, 2011 at 10:59 AM, Ted Dunning <te...@gmail.com>
> wrote:
>
> > Where is this set?
> >
> > Why does this cause this problem?
> >
> > 2011/4/19 Chang Song <tr...@me.com>
> >
> > >
> > > Problem solved.
> > > it was socket linger option set to 2 sec timeout.
> > >
> > > We have verified that the original problem goes away when we turn off
> > > linger option.
> > > No longer a mystery ;)
> > >
> > >
> > > https://issues.apache.org/jira/browse/ZOOKEEPER-1049
> > >
> > >
> > > Chang
> > >
> > >
> > > 2011. 4. 19., 오전 3:16, Mahadev Konar 작성:
> > >
> > > > Camille, Ted,
> > > > Can we continue the discussion on
> > > > https://issues.apache.org/jira/browse/ZOOKEEPER-1049?
> > > >
> > > > We should track all the suggestions/issues on the jira.
> > > >
> > > > thanks
> > > > mahadev
> > > >
> > > > On Mon, Apr 18, 2011 at 9:03 AM, Ted Dunning <te...@gmail.com>
> > > wrote:
> > > >> Interesting.  It does seem to suggestion the session expiration is
> > > >> expensive.
> > > >>
> > > >> There is a concurrent table in guava that provides very good
> > > multi-threaded
> > > >> performance.  I think that is achieved by using a number of locks
> and
> > > then
> > > >> distributing threads across the locks according to the hash slot
> being
> > > used.
> > > >>  But I would have expected any in memory operation to complete very
> > > quickly.
> > > >>
> > > >> Is it possible that the locks on the session table are held longer
> > than
> > > they
> > > >> should be?
> > > >>
> > > >> 2011/4/18 Fournier, Camille F. [Tech] <Ca...@gs.com>
> > > >>
> > > >>> Is it possible this is related to this report back in February?
> > > >>>
> > > >>>
> > >
> >
> http://mail-archives.apache.org/mod_mbox/zookeeper-user/201102.mbox/%3C6642FC1CAF133548AA8FDF497C547F0A23C0C5265B@NYWEXMBX2126.msad.ms.com%3E
> > > >>>
> > > >>> I theorized that the issue might be due to synchronization on the
> > > session
> > > >>> table, but never got enough information to finish the
> investigation.
> > > >>>
> > > >>
> > > >
> > > >
> > > >
> > > > --
> > > > thanks
> > > > mahadev
> > > > @mahadevkonar
> > >
> > >
> >
>

Re: Serious problem processing hearbeat on login stampede

Posted by Jared Cantwell <ja...@gmail.com>.

As a note, I believe we just used this patch to solve a major issue we were
seeing.  We were having problems when power to a node was pulled, and thus
hung tcp sessions on the servers.  With many connections, each close
operation was taking 2 seconds and held up the server significantly enough
to start incorrectly closing other sessions.  By disabling linger, these
hanging sessions were closed immediately and the problem went away.

Thanks Chang!
~Jared

On Tue, Apr 19, 2011 at 10:59 AM, Ted Dunning <te...@gmail.com> wrote:

> Where is this set?
>
> Why does this cause this problem?
>
> 2011/4/19 Chang Song <tr...@me.com>
>
> >
> > Problem solved.
> > it was socket linger option set to 2 sec timeout.
> >
> > We have verified that the original problem goes away when we turn off
> > linger option.
> > No longer a mystery ;)
> >
> >
> > https://issues.apache.org/jira/browse/ZOOKEEPER-1049
> >
> >
> > Chang
> >
> >
> > 2011. 4. 19., 오전 3:16, Mahadev Konar 작성:
> >
> > > Camille, Ted,
> > > Can we continue the discussion on
> > > https://issues.apache.org/jira/browse/ZOOKEEPER-1049?
> > >
> > > We should track all the suggestions/issues on the jira.
> > >
> > > thanks
> > > mahadev
> > >
> > > On Mon, Apr 18, 2011 at 9:03 AM, Ted Dunning <te...@gmail.com>
> > wrote:
> > >> Interesting.  It does seem to suggestion the session expiration is
> > >> expensive.
> > >>
> > >> There is a concurrent table in guava that provides very good
> > multi-threaded
> > >> performance.  I think that is achieved by using a number of locks and
> > then
> > >> distributing threads across the locks according to the hash slot being
> > used.
> > >>  But I would have expected any in memory operation to complete very
> > quickly.
> > >>
> > >> Is it possible that the locks on the session table are held longer
> than
> > they
> > >> should be?
> > >>
> > >> 2011/4/18 Fournier, Camille F. [Tech] <Ca...@gs.com>
> > >>
> > >>> Is it possible this is related to this report back in February?
> > >>>
> > >>>
> >
> http://mail-archives.apache.org/mod_mbox/zookeeper-user/201102.mbox/%3C6642FC1CAF133548AA8FDF497C547F0A23C0C5265B@NYWEXMBX2126.msad.ms.com%3E
> > >>>
> > >>> I theorized that the issue might be due to synchronization on the
> > session
> > >>> table, but never got enough information to finish the investigation.
> > >>>
> > >>
> > >
> > >
> > >
> > > --
> > > thanks
> > > mahadev
> > > @mahadevkonar
> >
> >
>

Re: Serious problem processing hearbeat on login stampede

Posted by Ted Dunning <te...@gmail.com>.

Where is this set?

Why does this cause this problem?

2011/4/19 Chang Song <tr...@me.com>

>
> Problem solved.
> it was socket linger option set to 2 sec timeout.
>
> We have verified that the original problem goes away when we turn off
> linger option.
> No longer a mystery ;)
>
>
> https://issues.apache.org/jira/browse/ZOOKEEPER-1049
>
>
> Chang
>
>
> 2011. 4. 19., 오전 3:16, Mahadev Konar 작성:
>
> > Camille, Ted,
> > Can we continue the discussion on
> > https://issues.apache.org/jira/browse/ZOOKEEPER-1049?
> >
> > We should track all the suggestions/issues on the jira.
> >
> > thanks
> > mahadev
> >
> > On Mon, Apr 18, 2011 at 9:03 AM, Ted Dunning <te...@gmail.com>
> wrote:
> >> Interesting.  It does seem to suggestion the session expiration is
> >> expensive.
> >>
> >> There is a concurrent table in guava that provides very good
> multi-threaded
> >> performance.  I think that is achieved by using a number of locks and
> then
> >> distributing threads across the locks according to the hash slot being
> used.
> >>  But I would have expected any in memory operation to complete very
> quickly.
> >>
> >> Is it possible that the locks on the session table are held longer than
> they
> >> should be?
> >>
> >> 2011/4/18 Fournier, Camille F. [Tech] <Ca...@gs.com>
> >>
> >>> Is it possible this is related to this report back in February?
> >>>
> >>>
> http://mail-archives.apache.org/mod_mbox/zookeeper-user/201102.mbox/%3C6642FC1CAF133548AA8FDF497C547F0A23C0C5265B@NYWEXMBX2126.msad.ms.com%3E
> >>>
> >>> I theorized that the issue might be due to synchronization on the
> session
> >>> table, but never got enough information to finish the investigation.
> >>>
> >>
> >
> >
> >
> > --
> > thanks
> > mahadev
> > @mahadevkonar
>
>

Re: Serious problem processing hearbeat on login stampede

Posted by Chang Song <tr...@me.com>.

Problem solved.
it was socket linger option set to 2 sec timeout.

We have verified that the original problem goes away when we turn off linger option.
No longer a mystery ;)


https://issues.apache.org/jira/browse/ZOOKEEPER-1049


Chang


2011. 4. 19., 오전 3:16, Mahadev Konar 작성:

> Camille, Ted,
> Can we continue the discussion on
> https://issues.apache.org/jira/browse/ZOOKEEPER-1049?
> 
> We should track all the suggestions/issues on the jira.
> 
> thanks
> mahadev
> 
> On Mon, Apr 18, 2011 at 9:03 AM, Ted Dunning <te...@gmail.com> wrote:
>> Interesting.  It does seem to suggestion the session expiration is
>> expensive.
>> 
>> There is a concurrent table in guava that provides very good multi-threaded
>> performance.  I think that is achieved by using a number of locks and then
>> distributing threads across the locks according to the hash slot being used.
>>  But I would have expected any in memory operation to complete very quickly.
>> 
>> Is it possible that the locks on the session table are held longer than they
>> should be?
>> 
>> 2011/4/18 Fournier, Camille F. [Tech] <Ca...@gs.com>
>> 
>>> Is it possible this is related to this report back in February?
>>> 
>>> http://mail-archives.apache.org/mod_mbox/zookeeper-user/201102.mbox/%3C6642FC1CAF133548AA8FDF497C547F0A23C0C5265B@NYWEXMBX2126.msad.ms.com%3E
>>> 
>>> I theorized that the issue might be due to synchronization on the session
>>> table, but never got enough information to finish the investigation.
>>> 
>> 
> 
> 
> 
> -- 
> thanks
> mahadev
> @mahadevkonar

Re: Serious problem processing hearbeat on login stampede

Posted by Mahadev Konar <ma...@apache.org>.

Camille, Ted,
 Can we continue the discussion on
https://issues.apache.org/jira/browse/ZOOKEEPER-1049?

We should track all the suggestions/issues on the jira.

thanks
mahadev

On Mon, Apr 18, 2011 at 9:03 AM, Ted Dunning <te...@gmail.com> wrote:
> Interesting.  It does seem to suggestion the session expiration is
> expensive.
>
> There is a concurrent table in guava that provides very good multi-threaded
> performance.  I think that is achieved by using a number of locks and then
> distributing threads across the locks according to the hash slot being used.
>  But I would have expected any in memory operation to complete very quickly.
>
> Is it possible that the locks on the session table are held longer than they
> should be?
>
> 2011/4/18 Fournier, Camille F. [Tech] <Ca...@gs.com>
>
>> Is it possible this is related to this report back in February?
>>
>> http://mail-archives.apache.org/mod_mbox/zookeeper-user/201102.mbox/%3C6642FC1CAF133548AA8FDF497C547F0A23C0C5265B@NYWEXMBX2126.msad.ms.com%3E
>>
>> I theorized that the issue might be due to synchronization on the session
>> table, but never got enough information to finish the investigation.
>>
>



-- 
thanks
mahadev
@mahadevkonar

Re: Serious problem processing hearbeat on login stampede

Posted by Ted Dunning <te...@gmail.com>.

Interesting.  It does seem to suggestion the session expiration is
expensive.

There is a concurrent table in guava that provides very good multi-threaded
performance.  I think that is achieved by using a number of locks and then
distributing threads across the locks according to the hash slot being used.
 But I would have expected any in memory operation to complete very quickly.

Is it possible that the locks on the session table are held longer than they
should be?

2011/4/18 Fournier, Camille F. [Tech] <Ca...@gs.com>

> Is it possible this is related to this report back in February?
>
> http://mail-archives.apache.org/mod_mbox/zookeeper-user/201102.mbox/%3C6642FC1CAF133548AA8FDF497C547F0A23C0C5265B@NYWEXMBX2126.msad.ms.com%3E
>
> I theorized that the issue might be due to synchronization on the session
> table, but never got enough information to finish the investigation.
>

RE: Serious problem processing hearbeat on login stampede

Posted by "Fournier, Camille F. [Tech]" <Ca...@gs.com>.

Is it possible this is related to this report back in February?
http://mail-archives.apache.org/mod_mbox/zookeeper-user/201102.mbox/%3C6642FC1CAF133548AA8FDF497C547F0A23C0C5265B@NYWEXMBX2126.msad.ms.com%3E

I theorized that the issue might be due to synchronization on the session table, but never got enough information to finish the investigation. 

C

-----Original Message-----
From: Chang Song [mailto:tru64ufs@me.com] 
Sent: Saturday, April 16, 2011 8:31 AM
To: user@zookeeper.apache.org
Cc: zookeeper-user@hadoop.apache.org
Subject: Re: Serious problem processing hearbeat on login stampede


Lakshman.


That's exactly the same symptom (queueing in CommitRequestProcessor)
We didn't bypass ping, but we pushed ping request from the beginning of the queue
directly to FinalRequestProcessor(), but it didn't alleviate the problem.

We will post a bit more detailed analysis in the ZK JIRA bug soon

Thank you.

Chang

ps. we are also working toward getting a simple reproducer so that committer can 
      reproduce and fix. 


2011. 4. 16., 오후 8:36, Lakshman 작성:

> Hi Everyone,
> 
> We also faced similar [session timeout] issue but in a different scenario.
> Here is some analysis I've done sometime back. Same has been posted on
> zookeeper-user forum.
> There is no under provisioning on server side.
> 
> Issue is resolved after bypassing the ping requests from the queue. This may
> not be a good idea. But we just gave a try.
> 
> Earlier mail which I've posted on forum.
> *********************************
> Subject: Frequent SessionTimeoutException[Client] -
> CancelledKeyException[Server]
> 
> We are using zookeeper 3.3.1. And more frequently we are hitting
> CancelledKeyException after startup of application.
> Average response time is less than 50 milliseconds. But the last request
> sent is not getting any response for 20 seconds so its timing out.
> 
> When analyzed, we found some possible problem with CommitRequestProcessor.
> 
> Following are the series of steps happening.
> 
> Client has sent some request[exists, setData, etc.] 
> Server received the packet completely. That is submitted for processing.
> [nextPending] 
> Client has sent some ping requests after that.
> Server has received the ping request as well and that is also queued up.
> Client is timing out as it didn't get any response from server.
> This is because ping requests are also getting queued up into
> queuedRequests.
> Its waiting for a commitedRequest for the current nextPending operation. 
> 
> As per my understanding pings request from client need not be queued up and
> can be processed immediately.
> *********************************
> 
> --
> Thanks
> Laxman
> -----Original Message-----
> From: Chang Song [mailto:tru64ufs@me.com] 
> Sent: Wednesday, April 13, 2011 7:05 PM
> To: zookeeper-user@hadoop.apache.org
> Subject: Serious problem processing hearbeat on login stampede
> 
> Hello, folks.
> 
> We have ran into a very serious issue with Zookeeper.
> Here's a brief scenario.
> 
> We have some Zookeeper clients with session timeout of 15 sec (thus 5 sec
> ping), let's called these clients, group A.
> 
> Now 1000 new clients (let's call these, group B) starts up at the same time
> trying to connect to a three-node ZK ensemble, creating ZK createSession
> stampede.
> 
> Now almost all clients in group A is not able to exchange ping within
> session expire time (15 sec).
> Thus clients in group A drops out of the cluster.
> 
> We have looked into this issue a bit, found mostly synchronous nature of
> session queue processing.
> Latency between ping request and response ranges from 10ms up to 14 seconds
> during this login stampede.
> 
> Since session timeout is serious matter for our cluster, thus ping should be
> done in psuedo realtime fashion.
> 
> I don't know exactly how these ping timeout policy in clients and server,
> but failure to receive ping response in clients due to zookeeper login
> session seem very nonsense to me.
> 
> Shouldn't we have a separate ping/heartbeat queue and thread?
> Or even multiple ping queues/threads to keep realtime heartbeat?
> 
> THis is very serious issue with Zookeeper for our mission-critical system.
> Could anyone look into this?
> 
> I will try to file a bug.
> 
> Thank you.
> 
> Chang
> 
>

Re: Serious problem processing hearbeat on login stampede

Posted by Chang Song <tr...@me.com>.

Lakshman.


That's exactly the same symptom (queueing in CommitRequestProcessor)
We didn't bypass ping, but we pushed ping request from the beginning of the queue
directly to FinalRequestProcessor(), but it didn't alleviate the problem.

We will post a bit more detailed analysis in the ZK JIRA bug soon

Thank you.

Chang

ps. we are also working toward getting a simple reproducer so that committer can 
      reproduce and fix. 


2011. 4. 16., 오후 8:36, Lakshman 작성:

> Hi Everyone,
> 
> We also faced similar [session timeout] issue but in a different scenario.
> Here is some analysis I've done sometime back. Same has been posted on
> zookeeper-user forum.
> There is no under provisioning on server side.
> 
> Issue is resolved after bypassing the ping requests from the queue. This may
> not be a good idea. But we just gave a try.
> 
> Earlier mail which I've posted on forum.
> *********************************
> Subject: Frequent SessionTimeoutException[Client] -
> CancelledKeyException[Server]
> 
> We are using zookeeper 3.3.1. And more frequently we are hitting
> CancelledKeyException after startup of application.
> Average response time is less than 50 milliseconds. But the last request
> sent is not getting any response for 20 seconds so its timing out.
> 
> When analyzed, we found some possible problem with CommitRequestProcessor.
> 
> Following are the series of steps happening.
> 
> Client has sent some request[exists, setData, etc.] 
> Server received the packet completely. That is submitted for processing.
> [nextPending] 
> Client has sent some ping requests after that.
> Server has received the ping request as well and that is also queued up.
> Client is timing out as it didn't get any response from server.
> This is because ping requests are also getting queued up into
> queuedRequests.
> Its waiting for a commitedRequest for the current nextPending operation. 
> 
> As per my understanding pings request from client need not be queued up and
> can be processed immediately.
> *********************************
> 
> --
> Thanks
> Laxman
> -----Original Message-----
> From: Chang Song [mailto:tru64ufs@me.com] 
> Sent: Wednesday, April 13, 2011 7:05 PM
> To: zookeeper-user@hadoop.apache.org
> Subject: Serious problem processing hearbeat on login stampede
> 
> Hello, folks.
> 
> We have ran into a very serious issue with Zookeeper.
> Here's a brief scenario.
> 
> We have some Zookeeper clients with session timeout of 15 sec (thus 5 sec
> ping), let's called these clients, group A.
> 
> Now 1000 new clients (let's call these, group B) starts up at the same time
> trying to connect to a three-node ZK ensemble, creating ZK createSession
> stampede.
> 
> Now almost all clients in group A is not able to exchange ping within
> session expire time (15 sec).
> Thus clients in group A drops out of the cluster.
> 
> We have looked into this issue a bit, found mostly synchronous nature of
> session queue processing.
> Latency between ping request and response ranges from 10ms up to 14 seconds
> during this login stampede.
> 
> Since session timeout is serious matter for our cluster, thus ping should be
> done in psuedo realtime fashion.
> 
> I don't know exactly how these ping timeout policy in clients and server,
> but failure to receive ping response in clients due to zookeeper login
> session seem very nonsense to me.
> 
> Shouldn't we have a separate ping/heartbeat queue and thread?
> Or even multiple ping queues/threads to keep realtime heartbeat?
> 
> THis is very serious issue with Zookeeper for our mission-critical system.
> Could anyone look into this?
> 
> I will try to file a bug.
> 
> Thank you.
> 
> Chang
> 
>

RE: Serious problem processing hearbeat on login stampede

Posted by Lakshman <la...@huawei.com>.

Hi Everyone,

We also faced similar [session timeout] issue but in a different scenario.
Here is some analysis I've done sometime back. Same has been posted on
zookeeper-user forum.
There is no under provisioning on server side.

Issue is resolved after bypassing the ping requests from the queue. This may
not be a good idea. But we just gave a try.

Earlier mail which I've posted on forum.
*********************************
Subject: Frequent SessionTimeoutException[Client] -
CancelledKeyException[Server]

We are using zookeeper 3.3.1. And more frequently we are hitting
CancelledKeyException after startup of application.
Average response time is less than 50 milliseconds. But the last request
sent is not getting any response for 20 seconds so its timing out.
 
When analyzed, we found some possible problem with CommitRequestProcessor.
 
Following are the series of steps happening.
 
Client has sent some request[exists, setData, etc.] 
Server received the packet completely. That is submitted for processing.
[nextPending] 
Client has sent some ping requests after that.
Server has received the ping request as well and that is also queued up.
Client is timing out as it didn't get any response from server.
This is because ping requests are also getting queued up into
queuedRequests.
Its waiting for a commitedRequest for the current nextPending operation. 
 
As per my understanding pings request from client need not be queued up and
can be processed immediately.
*********************************

--
Thanks
Laxman
-----Original Message-----
From: Chang Song [mailto:tru64ufs@me.com] 
Sent: Wednesday, April 13, 2011 7:05 PM
To: zookeeper-user@hadoop.apache.org
Subject: Serious problem processing hearbeat on login stampede

Hello, folks.

We have ran into a very serious issue with Zookeeper.
Here's a brief scenario.

We have some Zookeeper clients with session timeout of 15 sec (thus 5 sec
ping), let's called these clients, group A.

Now 1000 new clients (let's call these, group B) starts up at the same time
trying to connect to a three-node ZK ensemble, creating ZK createSession
stampede.

Now almost all clients in group A is not able to exchange ping within
session expire time (15 sec).
Thus clients in group A drops out of the cluster.

We have looked into this issue a bit, found mostly synchronous nature of
session queue processing.
Latency between ping request and response ranges from 10ms up to 14 seconds
during this login stampede.

Since session timeout is serious matter for our cluster, thus ping should be
done in psuedo realtime fashion.

I don't know exactly how these ping timeout policy in clients and server,
but failure to receive ping response in clients due to zookeeper login
session seem very nonsense to me.

Shouldn't we have a separate ping/heartbeat queue and thread?
Or even multiple ping queues/threads to keep realtime heartbeat?

THis is very serious issue with Zookeeper for our mission-critical system.
Could anyone look into this?

I will try to file a bug.

Thank you.

Chang