You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@zookeeper.apache.org by Neha Narkhede <ne...@gmail.com> on 2011/11/02 18:16:29 UTC

Zookeeper session losing some watchers

Hi,

We've been seeing a problem with our zookeeper servers lately, where
all of a sudden a session loses some of the watchers registered on
some of the znodes. Let me explain our Kafka-ZK setup. We have a Kafka
cluster in one DC establishing sessions (with 6sec timeout) with a ZK
cluster (of 4 machines) in another DC and registers watchers on some
zookeeper paths. Every couple of weeks, we observe some problem with
the Kafka servers, where on investigating further, we find that the
session lost some of the key watches, but not all.

The last time this happened, we ran the wchc command on the ZK servers
and saw the problem. Unfortunately, we lost relevant information from
the ZK logs by the time we were ready to debug it further. Since this
causes Kafka servers to stop making progress, we want to setup some
kind of alert when this happens. This will help us collect more
information to give you. Particularly, we were thinking about running
wchp periodically (maybe once a minute), grepping for the ZK paths and
counting the number of watches that should be registered for correct
operation. But I observed that the watcher info is not replicated
across all ZK servers, so we would have to query every ZK server to
inorder to get the full list.

I'm not sure running wchp periodically on all ZK servers is the best
option for this alert. Can you think of what could be the problem here
and how we can setup this alert for now ?

Thanks
Neha

Re: Zookeeper session losing some watchers

Posted by Patrick Hunt <ph...@apache.org>.
On Wed, Nov 2, 2011 at 10:16 AM, Neha Narkhede <ne...@gmail.com> wrote:
> We've been seeing a problem with our zookeeper servers lately, where
> all of a sudden a session loses some of the watchers registered on
> some of the znodes. Let me explain our Kafka-ZK setup. We have a Kafka
> cluster in one DC establishing sessions (with 6sec timeout) with a ZK
> cluster (of 4 machines) in another DC and registers watchers on some
> zookeeper paths. Every couple of weeks, we observe some problem with
> the Kafka servers, where on investigating further, we find that the
> session lost some of the key watches, but not all.
>

Any pattern to this? Data watches vs Child watches? A particular part
of the data hierarchy?

Are the sessions migrating at all? 6sec btw datacenters doesn't allow
for much in the way of network flakeyness, any idea how frequently the
sessions are moving?

Is the zk server ensemble (cluster) itself stable? Is leadership
changing at all or are servers entering/leaving the quorum?

> The last time this happened, we ran the wchc command on the ZK servers
> and saw the problem. Unfortunately, we lost relevant information from
> the ZK logs by the time we were ready to debug it further. Since this
> causes Kafka servers to stop making progress, we want to setup some
> kind of alert when this happens. This will help us collect more
> information to give you. Particularly, we were thinking about running
> wchp periodically (maybe once a minute), grepping for the ZK paths and
> counting the number of watches that should be registered for correct
> operation. But I observed that the watcher info is not replicated
> across all ZK servers, so we would have to query every ZK server to
> inorder to get the full list.

That's correct.

> I'm not sure running wchp periodically on all ZK servers is the best
> option for this alert. Can you think of what could be the problem here
> and how we can setup this alert for now ?

Jamie brings up a possible problem in his reply (chroot bug) perhaps it's that?

I can't think of anything other than wchp. That sounds like a
reasonable way to track it to me. Once you do see a hit collect as
much data as possible and try to correlate against the server logs.

Patrick

Re: Zookeeper session losing some watchers

Posted by Jamie Rothfeder <ja...@gmail.com>.
Hi Jun,

It depends. You might just reregister the watch on another node (specifically, the original node minus the chroot). This case is really easy to test, even on a single, locally running instance. Just create a watch then print out the watches using wchc or wcwp. Restart the zookeeper. After the client automatically reconnects, rerun the four letter word to observe what happened to the watch.

-Jamie

On Nov 7, 2011, at 7:27 PM, Jun Rao <ju...@gmail.com> wrote:

> Jamie,
> 
> We do use chroot. However, the chroot problem will lose all watchers, not
> some watchers, right?
> 
> Thanks,
> 
> Jun
> 
> On Wed, Nov 2, 2011 at 7:34 PM, Jamie Rothfeder
> <ja...@gmail.com>wrote:
> 
>> Hi Neha,
>> 
>> I encountered a similar problem with zookeeper losing watches and found
>> that it was related to this bug:
>> 
>> https://issues.apache.org/jira/browse/ZOOKEEPER-961
>> 
>> Are you using a chroot?
>> 
>> Thanks,
>> Jamie
>>  Cli
>> On Wed, Nov 2, 2011 at 1:16 PM, Neha Narkhede <neha. @gmail.com
>>> wrote:
>> 
>>> Hi,
>>> 
>>> We've been seeing a problem with our zookeeper servers lately, where
>>> all of a sudden a session loses some of the watchers registered on
>>> some of the znodes. Let me explain our Kafka-ZK setup. We have a Kafka
>>> cluster in one DC establishing sessions (with 6sec timeout) with a ZK
>>> cluster (of 4 machines) in another DC and registers watchers on some
>>> zookeeper paths. Every couple of weeks, we observe some problem with
>>> the Kafka servers, where on investigating further, we find that the
>>> session lost some of the key watches, but not all.
>>> 
>>> The last time this happened, we ran the wchc command on the ZK servers
>>> and saw the problem. Unfortunately, we lost relevant information from
>>> the ZK logs by the time we were ready to debug it further. Since this
>>> causes Kafka servers to stop making progress, we want to setup some
>>> kind of alert when this happens. This will help us collect more
>>> information to give you. Particularly, we were thinking about running
>>> wchp periodically (maybe once a minute), grepping for the ZK paths and
>>> counting the number of watches that should be registered for correct
>>> operation. But I observed that the watcher info is not replicated
>>> across all ZK servers, so we would have to query every ZK server to
>>> inorder to get the full list.
>>> 
>>> I'm not sure running wchp periodically on all ZK servers is the best
>>> option for this alert. Can you think of what could be the problem here
>>> and how we can setup this alert for now ?
>>> 
>>> Thanks
>>> Neha
>>> 
>> 

Re: Zookeeper session losing some watchers

Posted by Jun Rao <ju...@gmail.com>.
Jamie,

We do use chroot. However, the chroot problem will lose all watchers, not
some watchers, right?

Thanks,

Jun

On Wed, Nov 2, 2011 at 7:34 PM, Jamie Rothfeder
<ja...@gmail.com>wrote:

> Hi Neha,
>
> I encountered a similar problem with zookeeper losing watches and found
> that it was related to this bug:
>
> https://issues.apache.org/jira/browse/ZOOKEEPER-961
>
> Are you using a chroot?
>
> Thanks,
> Jamie
>
> On Wed, Nov 2, 2011 at 1:16 PM, Neha Narkhede <neha.narkhede@gmail.com
> >wrote:
>
> > Hi,
> >
> > We've been seeing a problem with our zookeeper servers lately, where
> > all of a sudden a session loses some of the watchers registered on
> > some of the znodes. Let me explain our Kafka-ZK setup. We have a Kafka
> > cluster in one DC establishing sessions (with 6sec timeout) with a ZK
> > cluster (of 4 machines) in another DC and registers watchers on some
> > zookeeper paths. Every couple of weeks, we observe some problem with
> > the Kafka servers, where on investigating further, we find that the
> > session lost some of the key watches, but not all.
> >
> > The last time this happened, we ran the wchc command on the ZK servers
> > and saw the problem. Unfortunately, we lost relevant information from
> > the ZK logs by the time we were ready to debug it further. Since this
> > causes Kafka servers to stop making progress, we want to setup some
> > kind of alert when this happens. This will help us collect more
> > information to give you. Particularly, we were thinking about running
> > wchp periodically (maybe once a minute), grepping for the ZK paths and
> > counting the number of watches that should be registered for correct
> > operation. But I observed that the watcher info is not replicated
> > across all ZK servers, so we would have to query every ZK server to
> > inorder to get the full list.
> >
> > I'm not sure running wchp periodically on all ZK servers is the best
> > option for this alert. Can you think of what could be the problem here
> > and how we can setup this alert for now ?
> >
> > Thanks
> > Neha
> >
>

Re: Zookeeper session losing some watchers

Posted by Jamie Rothfeder <ja...@gmail.com>.
Hi Neha,

I encountered a similar problem with zookeeper losing watches and found
that it was related to this bug:

https://issues.apache.org/jira/browse/ZOOKEEPER-961

Are you using a chroot?

Thanks,
Jamie

On Wed, Nov 2, 2011 at 1:16 PM, Neha Narkhede <ne...@gmail.com>wrote:

> Hi,
>
> We've been seeing a problem with our zookeeper servers lately, where
> all of a sudden a session loses some of the watchers registered on
> some of the znodes. Let me explain our Kafka-ZK setup. We have a Kafka
> cluster in one DC establishing sessions (with 6sec timeout) with a ZK
> cluster (of 4 machines) in another DC and registers watchers on some
> zookeeper paths. Every couple of weeks, we observe some problem with
> the Kafka servers, where on investigating further, we find that the
> session lost some of the key watches, but not all.
>
> The last time this happened, we ran the wchc command on the ZK servers
> and saw the problem. Unfortunately, we lost relevant information from
> the ZK logs by the time we were ready to debug it further. Since this
> causes Kafka servers to stop making progress, we want to setup some
> kind of alert when this happens. This will help us collect more
> information to give you. Particularly, we were thinking about running
> wchp periodically (maybe once a minute), grepping for the ZK paths and
> counting the number of watches that should be registered for correct
> operation. But I observed that the watcher info is not replicated
> across all ZK servers, so we would have to query every ZK server to
> inorder to get the full list.
>
> I'm not sure running wchp periodically on all ZK servers is the best
> option for this alert. Can you think of what could be the problem here
> and how we can setup this alert for now ?
>
> Thanks
> Neha
>