You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@zookeeper.apache.org by "Dale Johnson (JIRA)" <ji...@apache.org> on 2010/01/13 02:53:54 UTC

[jira] Created: (ZOOKEEPER-642) "exceeded deadline by N ms" floods logs

"exceeded deadline by N ms" floods logs
---------------------------------------

                 Key: ZOOKEEPER-642
                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-642
             Project: Zookeeper
          Issue Type: Bug
          Components: c client
    Affects Versions: 3.2.1
         Environment: virtualized linux - ec2 - ubuntu
            Reporter: Dale Johnson
             Fix For: 3.1.2


More important zookeeper warnings are drown out by the following several times per minute:

2010-01-12 17:39:57,227:22317(0x4147eb90):ZOO_WARN@zookeeper_interest@1335: Exceeded deadline by 13ms

Perhaps this is an issue with the way virtualized systems manage gettimeofday results?

Maybe the current 10ms threshold could be pushed up a bit.  I notice that 95% of the messages are below 50ms.

Is there an obvious configuration change that I can make to fix this?

config file below:

# The number of milliseconds of each tick
tickTime=2000
# The number of ticks that the initial
# synchronization phase can take
initLimit=10
# The number of ticks that can pass between
# sending a request and getting an acknowledgement
syncLimit=5
# the directory where the snapshot is stored.
dataDir=/mnt/zookeeper
# the port at which the clients will connect
clientPort=2181

server.1=hbase.1:2888:3888
server.2=hbase.2:2888:3888
server.3=hbase.3:2888:3888
server.4=hbase.4:2888:3888
server.5=hbase.5:2888:3888


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (ZOOKEEPER-642) "exceeded deadline by N ms" floods logs

Posted by "Patrick Hunt (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/ZOOKEEPER-642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12837493#action_12837493 ] 

Patrick Hunt commented on ZOOKEEPER-642:
----------------------------------------

> We have bumped up tickTime to 20000 per Patrick's suggestion in another thread.

Hm, I'm not sure which thread you are referring to but tickTime has no impact on this particular issue.

>  "Exceeded deadline by 769ms" 
> ... The message as it is has a fairly low diagnostic value.

I don't think that's the case here. Nearly 800ms is a pretty significant issue, esp as the vm is unloaded as you mentioned.

Can you try using cli_mt instead of cli_st and see if you notice any difference?

This message  "Exceeded deadline by 769ms" is saying that the ZK _client_ asked the operating system, via the "select(..., timeout) call, to either select something or wakeup after the specified timeout. From the select man page: 

"timeout  is an upper bound on the amount of time elapsed before select() returns."

We are printing this message because select woke up 769ms later than what we asked, which is pretty significant. Notice this has nothing to do with the server, server settings/config or server performance, it's all local to the client/select call.  This tells me that something is up on your client, prolly due to use of vmware. Perhaps Swapping at the host (non-virt) level? Try the cli_mt, you might also try running w/o VMware and see what happens, it will give you a baseline.



> "exceeded deadline by N ms" floods logs
> ---------------------------------------
>
>                 Key: ZOOKEEPER-642
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-642
>             Project: Zookeeper
>          Issue Type: Bug
>          Components: c client
>    Affects Versions: 3.2.1
>         Environment: virtualized linux - ec2 - ubuntu
>            Reporter: Dale Johnson
>             Fix For: 3.4.0
>
>
> More important zookeeper warnings are drown out by the following several times per minute:
> 2010-01-12 17:39:57,227:22317(0x4147eb90):ZOO_WARN@zookeeper_interest@1335: Exceeded deadline by 13ms
> Perhaps this is an issue with the way virtualized systems manage gettimeofday results?
> Maybe the current 10ms threshold could be pushed up a bit.  I notice that 95% of the messages are below 50ms.
> Is there an obvious configuration change that I can make to fix this?
> config file below:
> # The number of milliseconds of each tick
> tickTime=2000
> # The number of ticks that the initial
> # synchronization phase can take
> initLimit=10
> # The number of ticks that can pass between
> # sending a request and getting an acknowledgement
> syncLimit=5
> # the directory where the snapshot is stored.
> dataDir=/mnt/zookeeper
> # the port at which the clients will connect
> clientPort=2181
> server.1=hbase.1:2888:3888
> server.2=hbase.2:2888:3888
> server.3=hbase.3:2888:3888
> server.4=hbase.4:2888:3888
> server.5=hbase.5:2888:3888

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (ZOOKEEPER-642) "exceeded deadline by N ms" floods logs

Posted by "Patrick Hunt (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/ZOOKEEPER-642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Patrick Hunt updated ZOOKEEPER-642:
-----------------------------------

    Fix Version/s:     (was: 3.3.0)
                   3.4.0

> "exceeded deadline by N ms" floods logs
> ---------------------------------------
>
>                 Key: ZOOKEEPER-642
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-642
>             Project: Zookeeper
>          Issue Type: Bug
>          Components: c client
>    Affects Versions: 3.2.1
>         Environment: virtualized linux - ec2 - ubuntu
>            Reporter: Dale Johnson
>             Fix For: 3.4.0
>
>
> More important zookeeper warnings are drown out by the following several times per minute:
> 2010-01-12 17:39:57,227:22317(0x4147eb90):ZOO_WARN@zookeeper_interest@1335: Exceeded deadline by 13ms
> Perhaps this is an issue with the way virtualized systems manage gettimeofday results?
> Maybe the current 10ms threshold could be pushed up a bit.  I notice that 95% of the messages are below 50ms.
> Is there an obvious configuration change that I can make to fix this?
> config file below:
> # The number of milliseconds of each tick
> tickTime=2000
> # The number of ticks that the initial
> # synchronization phase can take
> initLimit=10
> # The number of ticks that can pass between
> # sending a request and getting an acknowledgement
> syncLimit=5
> # the directory where the snapshot is stored.
> dataDir=/mnt/zookeeper
> # the port at which the clients will connect
> clientPort=2181
> server.1=hbase.1:2888:3888
> server.2=hbase.2:2888:3888
> server.3=hbase.3:2888:3888
> server.4=hbase.4:2888:3888
> server.5=hbase.5:2888:3888

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (ZOOKEEPER-642) "exceeded deadline by N ms" floods logs

Posted by "Patrick Hunt (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/ZOOKEEPER-642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12804802#action_12804802 ] 

Patrick Hunt commented on ZOOKEEPER-642:
----------------------------------------

@dale basically we are asking to be woken up in a certain amount of time (in ms), and we are woken up later than we asked for. This is 
to be expected on heavily loaded system, we are logging in order to provide some insight (say you see a bunch of these, then
some session timeout, it might be useful). I agree though, in general not particularly useful, bit then in some cases it might be.

The selection of 10ms is a mistake though (esp as some systems don't have very good resolution on timing). It should be larger, perhaps 
scaled relative to the timeout. We could output as debug, but then it would essentially never be avail when you wanted it. We could
latch (only output the first N logs of this type) but then it would get lost over time. Perhaps we could do something where we output
a max of once every minute (as warn, debug the rest of the time). Might be a good tradeoff (increase the check from 10ms at the same
time).

> "exceeded deadline by N ms" floods logs
> ---------------------------------------
>
>                 Key: ZOOKEEPER-642
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-642
>             Project: Zookeeper
>          Issue Type: Bug
>          Components: c client
>    Affects Versions: 3.2.1
>         Environment: virtualized linux - ec2 - ubuntu
>            Reporter: Dale Johnson
>             Fix For: 3.3.0
>
>
> More important zookeeper warnings are drown out by the following several times per minute:
> 2010-01-12 17:39:57,227:22317(0x4147eb90):ZOO_WARN@zookeeper_interest@1335: Exceeded deadline by 13ms
> Perhaps this is an issue with the way virtualized systems manage gettimeofday results?
> Maybe the current 10ms threshold could be pushed up a bit.  I notice that 95% of the messages are below 50ms.
> Is there an obvious configuration change that I can make to fix this?
> config file below:
> # The number of milliseconds of each tick
> tickTime=2000
> # The number of ticks that the initial
> # synchronization phase can take
> initLimit=10
> # The number of ticks that can pass between
> # sending a request and getting an acknowledgement
> syncLimit=5
> # the directory where the snapshot is stored.
> dataDir=/mnt/zookeeper
> # the port at which the clients will connect
> clientPort=2181
> server.1=hbase.1:2888:3888
> server.2=hbase.2:2888:3888
> server.3=hbase.3:2888:3888
> server.4=hbase.4:2888:3888
> server.5=hbase.5:2888:3888

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (ZOOKEEPER-642) "exceeded deadline by N ms" floods logs

Posted by "Dale Johnson (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/ZOOKEEPER-642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12800301#action_12800301 ] 

Dale Johnson commented on ZOOKEEPER-642:
----------------------------------------

It is a small instance.

When the machine becomes heavily loaded, the times go up quite a bit on
these messages, some into the thousands of milliseconds.  But the
smaller values appear on a pretty regular basis.

Does the warning mean that it's not responded within a single tick?

The message as it is has a fairly low diagnostic value.  I guess if they
happened with enough frequency and that was a signal to get more
zookeeper servers, or that the client hosts were too overloaded, it
could be more helpful.

> "exceeded deadline by N ms" floods logs
> ---------------------------------------
>
>                 Key: ZOOKEEPER-642
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-642
>             Project: Zookeeper
>          Issue Type: Bug
>          Components: c client
>    Affects Versions: 3.2.1
>         Environment: virtualized linux - ec2 - ubuntu
>            Reporter: Dale Johnson
>             Fix For: 3.3.0
>
>
> More important zookeeper warnings are drown out by the following several times per minute:
> 2010-01-12 17:39:57,227:22317(0x4147eb90):ZOO_WARN@zookeeper_interest@1335: Exceeded deadline by 13ms
> Perhaps this is an issue with the way virtualized systems manage gettimeofday results?
> Maybe the current 10ms threshold could be pushed up a bit.  I notice that 95% of the messages are below 50ms.
> Is there an obvious configuration change that I can make to fix this?
> config file below:
> # The number of milliseconds of each tick
> tickTime=2000
> # The number of ticks that the initial
> # synchronization phase can take
> initLimit=10
> # The number of ticks that can pass between
> # sending a request and getting an acknowledgement
> syncLimit=5
> # the directory where the snapshot is stored.
> dataDir=/mnt/zookeeper
> # the port at which the clients will connect
> clientPort=2181
> server.1=hbase.1:2888:3888
> server.2=hbase.2:2888:3888
> server.3=hbase.3:2888:3888
> server.4=hbase.4:2888:3888
> server.5=hbase.5:2888:3888

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (ZOOKEEPER-642) "exceeded deadline by N ms" floods logs

Posted by "Lei Zhang (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/ZOOKEEPER-642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12837506#action_12837506 ] 

Lei Zhang commented on ZOOKEEPER-642:
-------------------------------------



Maybe I had misinterpreted what you meant by "timeout" - I was referring to
http://mail-archives.apache.org/mod_mbox/hadoop-zookeeper-user/200908.mbox/%253C4A8D7B4B.5020701@apache.org%253E
.

"typically we suggest timeouts in the 20-30 second range"




Same. Still seeing same message every 10 seconds.


> "exceeded deadline by N ms" floods logs
> ---------------------------------------
>
>                 Key: ZOOKEEPER-642
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-642
>             Project: Zookeeper
>          Issue Type: Bug
>          Components: c client
>    Affects Versions: 3.2.1
>         Environment: virtualized linux - ec2 - ubuntu
>            Reporter: Dale Johnson
>             Fix For: 3.4.0
>
>
> More important zookeeper warnings are drown out by the following several times per minute:
> 2010-01-12 17:39:57,227:22317(0x4147eb90):ZOO_WARN@zookeeper_interest@1335: Exceeded deadline by 13ms
> Perhaps this is an issue with the way virtualized systems manage gettimeofday results?
> Maybe the current 10ms threshold could be pushed up a bit.  I notice that 95% of the messages are below 50ms.
> Is there an obvious configuration change that I can make to fix this?
> config file below:
> # The number of milliseconds of each tick
> tickTime=2000
> # The number of ticks that the initial
> # synchronization phase can take
> initLimit=10
> # The number of ticks that can pass between
> # sending a request and getting an acknowledgement
> syncLimit=5
> # the directory where the snapshot is stored.
> dataDir=/mnt/zookeeper
> # the port at which the clients will connect
> clientPort=2181
> server.1=hbase.1:2888:3888
> server.2=hbase.2:2888:3888
> server.3=hbase.3:2888:3888
> server.4=hbase.4:2888:3888
> server.5=hbase.5:2888:3888

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (ZOOKEEPER-642) "exceeded deadline by N ms" floods logs

Posted by "Patrick Hunt (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/ZOOKEEPER-642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12837543#action_12837543 ] 

Patrick Hunt commented on ZOOKEEPER-642:
----------------------------------------

Hi Lei,

tickTime is used by the servers, not really impacts the clients

the clients have a "timeout" value they use when connecting to the servers, this establishes the session timeout.

What I was suggesting on the link you listed was changing the timeout, not the tickTime. 2 sec is pretty low for timeout.

However in your case none of this matters. The "deadline exceeded" message you are seeing is purely an OS issue.

1) we select(..., deadline)
2) when we wake up from select we check how close we are to the deadline, if this is exceeded by > 10ms then we log a warning.

In your case the deadline is being exceeded by a significant amount, this is a warning because it could impact the ability
of the client to maintain connectivity with the server.


> "exceeded deadline by N ms" floods logs
> ---------------------------------------
>
>                 Key: ZOOKEEPER-642
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-642
>             Project: Zookeeper
>          Issue Type: Bug
>          Components: c client
>    Affects Versions: 3.2.1
>         Environment: virtualized linux - ec2 - ubuntu
>            Reporter: Dale Johnson
>             Fix For: 3.4.0
>
>
> More important zookeeper warnings are drown out by the following several times per minute:
> 2010-01-12 17:39:57,227:22317(0x4147eb90):ZOO_WARN@zookeeper_interest@1335: Exceeded deadline by 13ms
> Perhaps this is an issue with the way virtualized systems manage gettimeofday results?
> Maybe the current 10ms threshold could be pushed up a bit.  I notice that 95% of the messages are below 50ms.
> Is there an obvious configuration change that I can make to fix this?
> config file below:
> # The number of milliseconds of each tick
> tickTime=2000
> # The number of ticks that the initial
> # synchronization phase can take
> initLimit=10
> # The number of ticks that can pass between
> # sending a request and getting an acknowledgement
> syncLimit=5
> # the directory where the snapshot is stored.
> dataDir=/mnt/zookeeper
> # the port at which the clients will connect
> clientPort=2181
> server.1=hbase.1:2888:3888
> server.2=hbase.2:2888:3888
> server.3=hbase.3:2888:3888
> server.4=hbase.4:2888:3888
> server.5=hbase.5:2888:3888

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (ZOOKEEPER-642) "exceeded deadline by N ms" floods logs

Posted by "Lei Zhang (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/ZOOKEEPER-642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12837457#action_12837457 ] 

Lei Zhang commented on ZOOKEEPER-642:
-------------------------------------

I have taken over Dale's responsibility of zookeeper. We have bumped up tickTime to 20000 per Patrick's suggestion in another thread. Now we see these "Exceeded deadline by 769ms" message every 10 seconds - I'm testing using 'cli_st localhost:<port>', on a VMware Linux machine that is mostly idle . I echo Dale's comment:

The message as it is has a fairly low diagnostic value.

Since this message is at WARN level, I feel we need to do something. But what:
  o bump up priority of zookeeper daemon
  o check bug in client library
  o check bug in zookeeper server

Somehow this doesn't smell like a real "Exceeded timeline" issue to me.

> "exceeded deadline by N ms" floods logs
> ---------------------------------------
>
>                 Key: ZOOKEEPER-642
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-642
>             Project: Zookeeper
>          Issue Type: Bug
>          Components: c client
>    Affects Versions: 3.2.1
>         Environment: virtualized linux - ec2 - ubuntu
>            Reporter: Dale Johnson
>             Fix For: 3.4.0
>
>
> More important zookeeper warnings are drown out by the following several times per minute:
> 2010-01-12 17:39:57,227:22317(0x4147eb90):ZOO_WARN@zookeeper_interest@1335: Exceeded deadline by 13ms
> Perhaps this is an issue with the way virtualized systems manage gettimeofday results?
> Maybe the current 10ms threshold could be pushed up a bit.  I notice that 95% of the messages are below 50ms.
> Is there an obvious configuration change that I can make to fix this?
> config file below:
> # The number of milliseconds of each tick
> tickTime=2000
> # The number of ticks that the initial
> # synchronization phase can take
> initLimit=10
> # The number of ticks that can pass between
> # sending a request and getting an acknowledgement
> syncLimit=5
> # the directory where the snapshot is stored.
> dataDir=/mnt/zookeeper
> # the port at which the clients will connect
> clientPort=2181
> server.1=hbase.1:2888:3888
> server.2=hbase.2:2888:3888
> server.3=hbase.3:2888:3888
> server.4=hbase.4:2888:3888
> server.5=hbase.5:2888:3888

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (ZOOKEEPER-642) "exceeded deadline by N ms" floods logs

Posted by "Patrick Hunt (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/ZOOKEEPER-642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Patrick Hunt updated ZOOKEEPER-642:
-----------------------------------

    Fix Version/s:     (was: 3.1.2)
                   3.3.0

unfortunately it's not configurable & I'm not sure why 10 was chosen (nothing in the
src to indicate. 

What type of ec2 instance is this, small? Is this at all correlated with the
client/clienthost having increased workload?

I think we should make some sort of change here, not sure what. Perhaps this 
should be a % of timeout rather than fixed. Or perhaps made configurable. Alternately
we might make this debug (or switch to debug if > some number output in a period of
time, etc...)

> "exceeded deadline by N ms" floods logs
> ---------------------------------------
>
>                 Key: ZOOKEEPER-642
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-642
>             Project: Zookeeper
>          Issue Type: Bug
>          Components: c client
>    Affects Versions: 3.2.1
>         Environment: virtualized linux - ec2 - ubuntu
>            Reporter: Dale Johnson
>             Fix For: 3.3.0
>
>
> More important zookeeper warnings are drown out by the following several times per minute:
> 2010-01-12 17:39:57,227:22317(0x4147eb90):ZOO_WARN@zookeeper_interest@1335: Exceeded deadline by 13ms
> Perhaps this is an issue with the way virtualized systems manage gettimeofday results?
> Maybe the current 10ms threshold could be pushed up a bit.  I notice that 95% of the messages are below 50ms.
> Is there an obvious configuration change that I can make to fix this?
> config file below:
> # The number of milliseconds of each tick
> tickTime=2000
> # The number of ticks that the initial
> # synchronization phase can take
> initLimit=10
> # The number of ticks that can pass between
> # sending a request and getting an acknowledgement
> syncLimit=5
> # the directory where the snapshot is stored.
> dataDir=/mnt/zookeeper
> # the port at which the clients will connect
> clientPort=2181
> server.1=hbase.1:2888:3888
> server.2=hbase.2:2888:3888
> server.3=hbase.3:2888:3888
> server.4=hbase.4:2888:3888
> server.5=hbase.5:2888:3888

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.