You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@zookeeper.apache.org by Kuba Lekstan <ku...@gmail.com> on 2014/11/13 17:25:19 UTC

cluster/ephemeral nodes inconsistency

Hello,

A bit of details:
We have 5 node cluster, which we use for configuration distrubution and
monitoring active instances of our applications. Each application creates
its ephemeral node, so we know which apps are alive, how many of them there
is and what they are doing.

The problem had happen at 4th November, first time it was around 4AM,
second time around 12PM.
First time it was middle of the night when I got woken up, the support guys
told me that something is wrong with config distribution.

First I've checked apps for errors but didn't find anything interesting,
then I looked at what's in zookeeper (using node-zk-browser).
I've noticed that there are 3 ephemeral nodes which were created at 1st nov
(while the oldest application was started on 3rd nov), I could read its
data but was not able to delete them - was getting NONODE exception.

I thought wtf - why I cannot delete these nodes, something very bad had to
happen with ZK.

So I sshed on the leader and using CLI I tried to read these nodes but I
was not able to - the leader was telling me that such nodes doesn't exist.
After this I started to ssh to the rest of the nodes in cluster and trying
to read these nodes. Finally I found the server which did let me read the
data of these nodes.
Because of the inconsistency I've decided to restart it. Restart did help,
everything went back to normal state. The ephemeral nodes disappeared.

Similar situation had happen at 12PM but this time I had a lot more time to
look what is wrong. Second time the problem was about 3 ephemeral nodes
which were created at 1st now (again?). This time I dig a bit deeper and
look into logs and 4 letter commands - but could not find anything
interesting except the all these 3 nodes were created under different
sessionids but zk had no hosts connected under this sessionids.
Solution was similar to the one from 4AM but this time I've delete all
files in ZK data directory.

Oddly enough the problem happened twice on the same ZK node, the final
solution was to clear ZK data directory. After clearing the directory the
problem didn't happen again.

I tried to look for solution/similar problems, I found the posts where
people were complaining about ephemeral nodes not being removed after
client session gets closed. But I was not able to find posts about ZK not
being consistent.

What do you think about this? Can we do something to fix this?

Sorry for my english, I was doing my best. :)

Thanks, Kuba.

Re: cluster/ephemeral nodes inconsistency

Posted by kishore g <g....@gmail.com>.
can you provide more info about the zookeeper deployment, are you running
any other applications along side zookeeper servers on the same nodes. I
remember seeing these issue when zookeeper server suffers from GC (GC
pauses longer than session timeout).

Looking at the timestamp of the operations at individual zk servers will
also help in triaging this issues. Can you can attach/paste the changes
that happened to this znode from the transaction logs? ( you can use
ZkLogFormatter or use ZKGrep
<https://issues.apache.org/jira/browse/HELIX-356> tool we wrote in Helix.)
We might be able to understand the sequence of operations.



On Wed, Jan 14, 2015 at 1:30 PM, Flavio Junqueira <
fpjunqueira@yahoo.com.invalid> wrote:

> Also, what was the last operation that changed the messed up znode and
> when has the operation been executed?
>
> -Flavio
>
> > On 14 Jan 2015, at 12:40, Flavio Junqueira <fp...@yahoo.com>
> wrote:
> >
> > But you do observe the session being closed, yes? And the ephemeral can
> be listed with getChildren but you can't get it with getData, is it right?
> >
> > -Flavio
> >
> >
> > On Wednesday, January 14, 2015 11:42 AM, Kuba Lekstan <ku...@gmail.com>
> wrote:
> >
> >
> > German, today it had happen on our secondary cluster which consist of 3
> > nodes, the leader didn't see the node but two other followers did.
> >
> > Flavio, I browsed the logs but was unable to find anything interesting,
> > only setData operations were issued.
> >
> > Problematic znode was last modified at 13 Jan 2015 17:xx, we have noticed
> > the issue at 14 Jan 2015 11:xx.
> >
> > 2015-01-14 10:52 GMT+01:00 Flavio Junqueira
> <fpjunqueira@yahoo.com.invalid <ma...@yahoo.com.invalid>>:
> >
> > > Hi there,
> > > I suggest a couple of things here:
> > > - Use LogFormatter to look into the transaction logs to check the
> > > operations that are actually coming across.- It would be nice be able
> to
> > > reproduce it outside your app, ideally as a junit test so that we can
> start
> > > working on it.
> > > I vaguely remember coming across such a problem, but I'll need to dig
> into
> > > it. Does anyone on this list recall a similar problem?
> > > -Flavio
> > >
> > >      On Wednesday, January 14, 2015 9:14 AM, Kuba Lekstan <
> > > kuebzky@gmail.com <ma...@gmail.com>> wrote:
> > >
> > >
> > >
> > >  German do you have any idea what might be causing these? Today same
> issue
> > > had happen.
> > >
> > > 2014-11-21 5:42 GMT+01:00 Yogesh Patil <patyogesh@gmail.com <mailto:
> patyogesh@gmail.com>>:
> > >
> > > > Hi Zookeepers,
> > > > I am also experiencing the similar problem since yestderday. I have
> > > pretty
> > > > much similar setup and ephemeral znodes in place for keep-alive kind
> of
> > > > function. I too see in spite of ZK session going down, ephemeral
> znodes
> > > > still LIVES.
> > > >
> > > > I am using ZK 3.5.0.
> > > >
> > > > Any solution/fix for this type of an issue??
> > > >
> > > >
> > > > --
> > > > Sincerely,
> > > >
> > > > *Yogesh Patil*
> > > >
> > > >
> > > >
> > > > On Thu, Nov 13, 2014 at 2:10 PM, Kuba Lekstan <kuebzky@gmail.com
> <ma...@gmail.com>> wrote:
> > > >
> > > > > Sorry, forgot to mention. Version: 3.4.6.
> > > > >
> > > > > Thanks.
> > > > >
> > > > > 2014-11-13 18:11 GMT+01:00 German Blanco <
> > > german.blanco.blanco@gmail.com <ma...@gmail.com>
> > > > >:
> > > > >
> > > > > > Hello,
> > > > > >
> > > > > > which version of Zookeeper are you using?
> > > > > >
> > > > > > On Thu, Nov 13, 2014 at 5:25 PM, Kuba Lekstan <kuebzky@gmail.com
> <ma...@gmail.com>>
> > > > wrote:
> > > > > >
> > > > > > > Hello,
> > > > > > >
> > > > > > > A bit of details:
> > > > > > > We have 5 node cluster, which we use for configuration
> distrubution
> > > > and
> > > > > > > monitoring active instances of our applications. Each
> application
> > > > > creates
> > > > > > > its ephemeral node, so we know which apps are alive, how many
> of
> > > them
> > > > > > there
> > > > > > > is and what they are doing.
> > > > > > >
> > > > > > > The problem had happen at 4th November, first time it was
> around
> > > 4AM,
> > > > > > > second time around 12PM.
> > > > > > > First time it was middle of the night when I got woken up, the
> > > > support
> > > > > > guys
> > > > > > > told me that something is wrong with config distribution.
> > > > > > >
> > > > > > > First I've checked apps for errors but didn't find anything
> > > > > interesting,
> > > > > > > then I looked at what's in zookeeper (using node-zk-browser).
> > > > > > > I've noticed that there are 3 ephemeral nodes which were
> created at
> > > > 1st
> > > > > > nov
> > > > > > > (while the oldest application was started on 3rd nov), I could
> read
> > > > its
> > > > > > > data but was not able to delete them - was getting NONODE
> > > exception.
> > > > > > >
> > > > > > > I thought wtf - why I cannot delete these nodes, something
> very bad
> > > > had
> > > > > > to
> > > > > > > happen with ZK.
> > > > > > >
> > > > > > > So I sshed on the leader and using CLI I tried to read these
> nodes
> > > > but
> > > > > I
> > > > > > > was not able to - the leader was telling me that such nodes
> doesn't
> > > > > > exist.
> > > > > > > After this I started to ssh to the rest of the nodes in
> cluster and
> > > > > > trying
> > > > > > > to read these nodes. Finally I found the server which did let
> me
> > > read
> > > > > the
> > > > > > > data of these nodes.
> > > > > > > Because of the inconsistency I've decided to restart it.
> Restart
> > > did
> > > > > > help,
> > > > > > > everything went back to normal state. The ephemeral nodes
> > > > disappeared.
> > > > > > >
> > > > > > > Similar situation had happen at 12PM but this time I had a lot
> more
> > > > > time
> > > > > > to
> > > > > > > look what is wrong. Second time the problem was about 3
> ephemeral
> > > > nodes
> > > > > > > which were created at 1st now (again?). This time I dig a bit
> > > deeper
> > > > > and
> > > > > > > look into logs and 4 letter commands - but could not find
> anything
> > > > > > > interesting except the all these 3 nodes were created under
> > > different
> > > > > > > sessionids but zk had no hosts connected under this sessionids.
> > > > > > > Solution was similar to the one from 4AM but this time I've
> delete
> > > > all
> > > > > > > files in ZK data directory.
> > > > > > >
> > > > > > > Oddly enough the problem happened twice on the same ZK node,
> the
> > > > final
> > > > > > > solution was to clear ZK data directory. After clearing the
> > > directory
> > > > > the
> > > > > > > problem didn't happen again.
> > > > > > >
> > > > > > > I tried to look for solution/similar problems, I found the
> posts
> > > > where
> > > > > > > people were complaining about ephemeral nodes not being removed
> > > after
> > > > > > > client session gets closed. But I was not able to find posts
> about
> > > ZK
> > > > > not
> > > > > > > being consistent.
> > > > > > >
> > > > > > > What do you think about this? Can we do something to fix this?
> > > > > > >
> > > > > > > Sorry for my english, I was doing my best. :)
> > > > > > >
> > > > > > > Thanks, Kuba.
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> > >
> > >
> > >
> > >
> >
> >
>
>

Re: cluster/ephemeral nodes inconsistency

Posted by Flavio Junqueira <fp...@yahoo.com.INVALID>.
Also, what was the last operation that changed the messed up znode and when has the operation been executed?

-Flavio

> On 14 Jan 2015, at 12:40, Flavio Junqueira <fp...@yahoo.com> wrote:
> 
> But you do observe the session being closed, yes? And the ephemeral can be listed with getChildren but you can't get it with getData, is it right?
> 
> -Flavio
> 
> 
> On Wednesday, January 14, 2015 11:42 AM, Kuba Lekstan <ku...@gmail.com> wrote:
> 
> 
> German, today it had happen on our secondary cluster which consist of 3
> nodes, the leader didn't see the node but two other followers did.
> 
> Flavio, I browsed the logs but was unable to find anything interesting,
> only setData operations were issued.
> 
> Problematic znode was last modified at 13 Jan 2015 17:xx, we have noticed
> the issue at 14 Jan 2015 11:xx.
> 
> 2015-01-14 10:52 GMT+01:00 Flavio Junqueira <fpjunqueira@yahoo.com.invalid <ma...@yahoo.com.invalid>>:
> 
> > Hi there,
> > I suggest a couple of things here:
> > - Use LogFormatter to look into the transaction logs to check the
> > operations that are actually coming across.- It would be nice be able to
> > reproduce it outside your app, ideally as a junit test so that we can start
> > working on it.
> > I vaguely remember coming across such a problem, but I'll need to dig into
> > it. Does anyone on this list recall a similar problem?
> > -Flavio
> >
> >      On Wednesday, January 14, 2015 9:14 AM, Kuba Lekstan <
> > kuebzky@gmail.com <ma...@gmail.com>> wrote:
> >
> >
> >
> >  German do you have any idea what might be causing these? Today same issue
> > had happen.
> >
> > 2014-11-21 5:42 GMT+01:00 Yogesh Patil <patyogesh@gmail.com <ma...@gmail.com>>:
> >
> > > Hi Zookeepers,
> > > I am also experiencing the similar problem since yestderday. I have
> > pretty
> > > much similar setup and ephemeral znodes in place for keep-alive kind of
> > > function. I too see in spite of ZK session going down, ephemeral znodes
> > > still LIVES.
> > >
> > > I am using ZK 3.5.0.
> > >
> > > Any solution/fix for this type of an issue??
> > >
> > >
> > > --
> > > Sincerely,
> > >
> > > *Yogesh Patil*
> > >
> > >
> > >
> > > On Thu, Nov 13, 2014 at 2:10 PM, Kuba Lekstan <kuebzky@gmail.com <ma...@gmail.com>> wrote:
> > >
> > > > Sorry, forgot to mention. Version: 3.4.6.
> > > >
> > > > Thanks.
> > > >
> > > > 2014-11-13 18:11 GMT+01:00 German Blanco <
> > german.blanco.blanco@gmail.com <ma...@gmail.com>
> > > >:
> > > >
> > > > > Hello,
> > > > >
> > > > > which version of Zookeeper are you using?
> > > > >
> > > > > On Thu, Nov 13, 2014 at 5:25 PM, Kuba Lekstan <kuebzky@gmail.com <ma...@gmail.com>>
> > > wrote:
> > > > >
> > > > > > Hello,
> > > > > >
> > > > > > A bit of details:
> > > > > > We have 5 node cluster, which we use for configuration distrubution
> > > and
> > > > > > monitoring active instances of our applications. Each application
> > > > creates
> > > > > > its ephemeral node, so we know which apps are alive, how many of
> > them
> > > > > there
> > > > > > is and what they are doing.
> > > > > >
> > > > > > The problem had happen at 4th November, first time it was around
> > 4AM,
> > > > > > second time around 12PM.
> > > > > > First time it was middle of the night when I got woken up, the
> > > support
> > > > > guys
> > > > > > told me that something is wrong with config distribution.
> > > > > >
> > > > > > First I've checked apps for errors but didn't find anything
> > > > interesting,
> > > > > > then I looked at what's in zookeeper (using node-zk-browser).
> > > > > > I've noticed that there are 3 ephemeral nodes which were created at
> > > 1st
> > > > > nov
> > > > > > (while the oldest application was started on 3rd nov), I could read
> > > its
> > > > > > data but was not able to delete them - was getting NONODE
> > exception.
> > > > > >
> > > > > > I thought wtf - why I cannot delete these nodes, something very bad
> > > had
> > > > > to
> > > > > > happen with ZK.
> > > > > >
> > > > > > So I sshed on the leader and using CLI I tried to read these nodes
> > > but
> > > > I
> > > > > > was not able to - the leader was telling me that such nodes doesn't
> > > > > exist.
> > > > > > After this I started to ssh to the rest of the nodes in cluster and
> > > > > trying
> > > > > > to read these nodes. Finally I found the server which did let me
> > read
> > > > the
> > > > > > data of these nodes.
> > > > > > Because of the inconsistency I've decided to restart it. Restart
> > did
> > > > > help,
> > > > > > everything went back to normal state. The ephemeral nodes
> > > disappeared.
> > > > > >
> > > > > > Similar situation had happen at 12PM but this time I had a lot more
> > > > time
> > > > > to
> > > > > > look what is wrong. Second time the problem was about 3 ephemeral
> > > nodes
> > > > > > which were created at 1st now (again?). This time I dig a bit
> > deeper
> > > > and
> > > > > > look into logs and 4 letter commands - but could not find anything
> > > > > > interesting except the all these 3 nodes were created under
> > different
> > > > > > sessionids but zk had no hosts connected under this sessionids.
> > > > > > Solution was similar to the one from 4AM but this time I've delete
> > > all
> > > > > > files in ZK data directory.
> > > > > >
> > > > > > Oddly enough the problem happened twice on the same ZK node, the
> > > final
> > > > > > solution was to clear ZK data directory. After clearing the
> > directory
> > > > the
> > > > > > problem didn't happen again.
> > > > > >
> > > > > > I tried to look for solution/similar problems, I found the posts
> > > where
> > > > > > people were complaining about ephemeral nodes not being removed
> > after
> > > > > > client session gets closed. But I was not able to find posts about
> > ZK
> > > > not
> > > > > > being consistent.
> > > > > >
> > > > > > What do you think about this? Can we do something to fix this?
> > > > > >
> > > > > > Sorry for my english, I was doing my best. :)
> > > > > >
> > > > > > Thanks, Kuba.
> > > > > >
> > > > >
> > > >
> > >
> >
> >
> >
> >
> >
> 
> 


Re: cluster/ephemeral nodes inconsistency

Posted by Kuba Lekstan <ku...@gmail.com>.
As far as I understand this issue:
https://issues.apache.org/jira/browse/ZOOKEEPER-1777 is about some ZK nodes
not seeing part of existing ephemeral znodes. I have opposite problem, some
ZK nodes are seeing part of not existing ephemeral nodes.

2015-01-14 12:39 GMT+01:00 Kuba Lekstan <ku...@gmail.com>:

> German, today it had happen on our secondary cluster which consist of 3
> nodes, the leader didn't see the node but two other followers did.
>
> Flavio, I browsed the logs but was unable to find anything interesting,
> only setData operations were issued.
>
> Problematic znode was last modified at 13 Jan 2015 17:xx, we have noticed
> the issue at 14 Jan 2015 11:xx.
>
> 2015-01-14 10:52 GMT+01:00 Flavio Junqueira <fpjunqueira@yahoo.com.invalid
> >:
>
>> Hi there,
>> I suggest a couple of things here:
>> - Use LogFormatter to look into the transaction logs to check the
>> operations that are actually coming across.- It would be nice be able to
>> reproduce it outside your app, ideally as a junit test so that we can start
>> working on it.
>> I vaguely remember coming across such a problem, but I'll need to dig
>> into it. Does anyone on this list recall a similar problem?
>> -Flavio
>>
>>      On Wednesday, January 14, 2015 9:14 AM, Kuba Lekstan <
>> kuebzky@gmail.com> wrote:
>>
>>
>>
>>  German do you have any idea what might be causing these? Today same issue
>> had happen.
>>
>> 2014-11-21 5:42 GMT+01:00 Yogesh Patil <pa...@gmail.com>:
>>
>> > Hi Zookeepers,
>> > I am also experiencing the similar problem since yestderday. I have
>> pretty
>> > much similar setup and ephemeral znodes in place for keep-alive kind of
>> > function. I too see in spite of ZK session going down, ephemeral znodes
>> > still LIVES.
>> >
>> > I am using ZK 3.5.0.
>> >
>> > Any solution/fix for this type of an issue??
>> >
>> >
>> > --
>> > Sincerely,
>> >
>> > *Yogesh Patil*
>> >
>> >
>> >
>> > On Thu, Nov 13, 2014 at 2:10 PM, Kuba Lekstan <ku...@gmail.com>
>> wrote:
>> >
>> > > Sorry, forgot to mention. Version: 3.4.6.
>> > >
>> > > Thanks.
>> > >
>> > > 2014-11-13 18:11 GMT+01:00 German Blanco <
>> german.blanco.blanco@gmail.com
>> > >:
>> > >
>> > > > Hello,
>> > > >
>> > > > which version of Zookeeper are you using?
>> > > >
>> > > > On Thu, Nov 13, 2014 at 5:25 PM, Kuba Lekstan <ku...@gmail.com>
>> > wrote:
>> > > >
>> > > > > Hello,
>> > > > >
>> > > > > A bit of details:
>> > > > > We have 5 node cluster, which we use for configuration
>> distrubution
>> > and
>> > > > > monitoring active instances of our applications. Each application
>> > > creates
>> > > > > its ephemeral node, so we know which apps are alive, how many of
>> them
>> > > > there
>> > > > > is and what they are doing.
>> > > > >
>> > > > > The problem had happen at 4th November, first time it was around
>> 4AM,
>> > > > > second time around 12PM.
>> > > > > First time it was middle of the night when I got woken up, the
>> > support
>> > > > guys
>> > > > > told me that something is wrong with config distribution.
>> > > > >
>> > > > > First I've checked apps for errors but didn't find anything
>> > > interesting,
>> > > > > then I looked at what's in zookeeper (using node-zk-browser).
>> > > > > I've noticed that there are 3 ephemeral nodes which were created
>> at
>> > 1st
>> > > > nov
>> > > > > (while the oldest application was started on 3rd nov), I could
>> read
>> > its
>> > > > > data but was not able to delete them - was getting NONODE
>> exception.
>> > > > >
>> > > > > I thought wtf - why I cannot delete these nodes, something very
>> bad
>> > had
>> > > > to
>> > > > > happen with ZK.
>> > > > >
>> > > > > So I sshed on the leader and using CLI I tried to read these nodes
>> > but
>> > > I
>> > > > > was not able to - the leader was telling me that such nodes
>> doesn't
>> > > > exist.
>> > > > > After this I started to ssh to the rest of the nodes in cluster
>> and
>> > > > trying
>> > > > > to read these nodes. Finally I found the server which did let me
>> read
>> > > the
>> > > > > data of these nodes.
>> > > > > Because of the inconsistency I've decided to restart it. Restart
>> did
>> > > > help,
>> > > > > everything went back to normal state. The ephemeral nodes
>> > disappeared.
>> > > > >
>> > > > > Similar situation had happen at 12PM but this time I had a lot
>> more
>> > > time
>> > > > to
>> > > > > look what is wrong. Second time the problem was about 3 ephemeral
>> > nodes
>> > > > > which were created at 1st now (again?). This time I dig a bit
>> deeper
>> > > and
>> > > > > look into logs and 4 letter commands - but could not find anything
>> > > > > interesting except the all these 3 nodes were created under
>> different
>> > > > > sessionids but zk had no hosts connected under this sessionids.
>> > > > > Solution was similar to the one from 4AM but this time I've delete
>> > all
>> > > > > files in ZK data directory.
>> > > > >
>> > > > > Oddly enough the problem happened twice on the same ZK node, the
>> > final
>> > > > > solution was to clear ZK data directory. After clearing the
>> directory
>> > > the
>> > > > > problem didn't happen again.
>> > > > >
>> > > > > I tried to look for solution/similar problems, I found the posts
>> > where
>> > > > > people were complaining about ephemeral nodes not being removed
>> after
>> > > > > client session gets closed. But I was not able to find posts
>> about ZK
>> > > not
>> > > > > being consistent.
>> > > > >
>> > > > > What do you think about this? Can we do something to fix this?
>> > > > >
>> > > > > Sorry for my english, I was doing my best. :)
>> > > > >
>> > > > > Thanks, Kuba.
>> > > > >
>> > > >
>> > >
>> >
>>
>>
>>
>>
>>
>
>

Re: cluster/ephemeral nodes inconsistency

Posted by Kuba Lekstan <ku...@gmail.com>.
German, today it had happen on our secondary cluster which consist of 3
nodes, the leader didn't see the node but two other followers did.

Flavio, I browsed the logs but was unable to find anything interesting,
only setData operations were issued.

Problematic znode was last modified at 13 Jan 2015 17:xx, we have noticed
the issue at 14 Jan 2015 11:xx.

2015-01-14 10:52 GMT+01:00 Flavio Junqueira <fp...@yahoo.com.invalid>:

> Hi there,
> I suggest a couple of things here:
> - Use LogFormatter to look into the transaction logs to check the
> operations that are actually coming across.- It would be nice be able to
> reproduce it outside your app, ideally as a junit test so that we can start
> working on it.
> I vaguely remember coming across such a problem, but I'll need to dig into
> it. Does anyone on this list recall a similar problem?
> -Flavio
>
>      On Wednesday, January 14, 2015 9:14 AM, Kuba Lekstan <
> kuebzky@gmail.com> wrote:
>
>
>
>  German do you have any idea what might be causing these? Today same issue
> had happen.
>
> 2014-11-21 5:42 GMT+01:00 Yogesh Patil <pa...@gmail.com>:
>
> > Hi Zookeepers,
> > I am also experiencing the similar problem since yestderday. I have
> pretty
> > much similar setup and ephemeral znodes in place for keep-alive kind of
> > function. I too see in spite of ZK session going down, ephemeral znodes
> > still LIVES.
> >
> > I am using ZK 3.5.0.
> >
> > Any solution/fix for this type of an issue??
> >
> >
> > --
> > Sincerely,
> >
> > *Yogesh Patil*
> >
> >
> >
> > On Thu, Nov 13, 2014 at 2:10 PM, Kuba Lekstan <ku...@gmail.com> wrote:
> >
> > > Sorry, forgot to mention. Version: 3.4.6.
> > >
> > > Thanks.
> > >
> > > 2014-11-13 18:11 GMT+01:00 German Blanco <
> german.blanco.blanco@gmail.com
> > >:
> > >
> > > > Hello,
> > > >
> > > > which version of Zookeeper are you using?
> > > >
> > > > On Thu, Nov 13, 2014 at 5:25 PM, Kuba Lekstan <ku...@gmail.com>
> > wrote:
> > > >
> > > > > Hello,
> > > > >
> > > > > A bit of details:
> > > > > We have 5 node cluster, which we use for configuration distrubution
> > and
> > > > > monitoring active instances of our applications. Each application
> > > creates
> > > > > its ephemeral node, so we know which apps are alive, how many of
> them
> > > > there
> > > > > is and what they are doing.
> > > > >
> > > > > The problem had happen at 4th November, first time it was around
> 4AM,
> > > > > second time around 12PM.
> > > > > First time it was middle of the night when I got woken up, the
> > support
> > > > guys
> > > > > told me that something is wrong with config distribution.
> > > > >
> > > > > First I've checked apps for errors but didn't find anything
> > > interesting,
> > > > > then I looked at what's in zookeeper (using node-zk-browser).
> > > > > I've noticed that there are 3 ephemeral nodes which were created at
> > 1st
> > > > nov
> > > > > (while the oldest application was started on 3rd nov), I could read
> > its
> > > > > data but was not able to delete them - was getting NONODE
> exception.
> > > > >
> > > > > I thought wtf - why I cannot delete these nodes, something very bad
> > had
> > > > to
> > > > > happen with ZK.
> > > > >
> > > > > So I sshed on the leader and using CLI I tried to read these nodes
> > but
> > > I
> > > > > was not able to - the leader was telling me that such nodes doesn't
> > > > exist.
> > > > > After this I started to ssh to the rest of the nodes in cluster and
> > > > trying
> > > > > to read these nodes. Finally I found the server which did let me
> read
> > > the
> > > > > data of these nodes.
> > > > > Because of the inconsistency I've decided to restart it. Restart
> did
> > > > help,
> > > > > everything went back to normal state. The ephemeral nodes
> > disappeared.
> > > > >
> > > > > Similar situation had happen at 12PM but this time I had a lot more
> > > time
> > > > to
> > > > > look what is wrong. Second time the problem was about 3 ephemeral
> > nodes
> > > > > which were created at 1st now (again?). This time I dig a bit
> deeper
> > > and
> > > > > look into logs and 4 letter commands - but could not find anything
> > > > > interesting except the all these 3 nodes were created under
> different
> > > > > sessionids but zk had no hosts connected under this sessionids.
> > > > > Solution was similar to the one from 4AM but this time I've delete
> > all
> > > > > files in ZK data directory.
> > > > >
> > > > > Oddly enough the problem happened twice on the same ZK node, the
> > final
> > > > > solution was to clear ZK data directory. After clearing the
> directory
> > > the
> > > > > problem didn't happen again.
> > > > >
> > > > > I tried to look for solution/similar problems, I found the posts
> > where
> > > > > people were complaining about ephemeral nodes not being removed
> after
> > > > > client session gets closed. But I was not able to find posts about
> ZK
> > > not
> > > > > being consistent.
> > > > >
> > > > > What do you think about this? Can we do something to fix this?
> > > > >
> > > > > Sorry for my english, I was doing my best. :)
> > > > >
> > > > > Thanks, Kuba.
> > > > >
> > > >
> > >
> >
>
>
>
>
>

Re: cluster/ephemeral nodes inconsistency

Posted by Flavio Junqueira <fp...@yahoo.com.INVALID>.
Hi there,
I suggest a couple of things here:
- Use LogFormatter to look into the transaction logs to check the operations that are actually coming across.- It would be nice be able to reproduce it outside your app, ideally as a junit test so that we can start working on it.
I vaguely remember coming across such a problem, but I'll need to dig into it. Does anyone on this list recall a similar problem?
-Flavio  

     On Wednesday, January 14, 2015 9:14 AM, Kuba Lekstan <ku...@gmail.com> wrote:
   
 

 German do you have any idea what might be causing these? Today same issue
had happen.

2014-11-21 5:42 GMT+01:00 Yogesh Patil <pa...@gmail.com>:

> Hi Zookeepers,
> I am also experiencing the similar problem since yestderday. I have pretty
> much similar setup and ephemeral znodes in place for keep-alive kind of
> function. I too see in spite of ZK session going down, ephemeral znodes
> still LIVES.
>
> I am using ZK 3.5.0.
>
> Any solution/fix for this type of an issue??
>
>
> --
> Sincerely,
>
> *Yogesh Patil*
>
>
>
> On Thu, Nov 13, 2014 at 2:10 PM, Kuba Lekstan <ku...@gmail.com> wrote:
>
> > Sorry, forgot to mention. Version: 3.4.6.
> >
> > Thanks.
> >
> > 2014-11-13 18:11 GMT+01:00 German Blanco <german.blanco.blanco@gmail.com
> >:
> >
> > > Hello,
> > >
> > > which version of Zookeeper are you using?
> > >
> > > On Thu, Nov 13, 2014 at 5:25 PM, Kuba Lekstan <ku...@gmail.com>
> wrote:
> > >
> > > > Hello,
> > > >
> > > > A bit of details:
> > > > We have 5 node cluster, which we use for configuration distrubution
> and
> > > > monitoring active instances of our applications. Each application
> > creates
> > > > its ephemeral node, so we know which apps are alive, how many of them
> > > there
> > > > is and what they are doing.
> > > >
> > > > The problem had happen at 4th November, first time it was around 4AM,
> > > > second time around 12PM.
> > > > First time it was middle of the night when I got woken up, the
> support
> > > guys
> > > > told me that something is wrong with config distribution.
> > > >
> > > > First I've checked apps for errors but didn't find anything
> > interesting,
> > > > then I looked at what's in zookeeper (using node-zk-browser).
> > > > I've noticed that there are 3 ephemeral nodes which were created at
> 1st
> > > nov
> > > > (while the oldest application was started on 3rd nov), I could read
> its
> > > > data but was not able to delete them - was getting NONODE exception.
> > > >
> > > > I thought wtf - why I cannot delete these nodes, something very bad
> had
> > > to
> > > > happen with ZK.
> > > >
> > > > So I sshed on the leader and using CLI I tried to read these nodes
> but
> > I
> > > > was not able to - the leader was telling me that such nodes doesn't
> > > exist.
> > > > After this I started to ssh to the rest of the nodes in cluster and
> > > trying
> > > > to read these nodes. Finally I found the server which did let me read
> > the
> > > > data of these nodes.
> > > > Because of the inconsistency I've decided to restart it. Restart did
> > > help,
> > > > everything went back to normal state. The ephemeral nodes
> disappeared.
> > > >
> > > > Similar situation had happen at 12PM but this time I had a lot more
> > time
> > > to
> > > > look what is wrong. Second time the problem was about 3 ephemeral
> nodes
> > > > which were created at 1st now (again?). This time I dig a bit deeper
> > and
> > > > look into logs and 4 letter commands - but could not find anything
> > > > interesting except the all these 3 nodes were created under different
> > > > sessionids but zk had no hosts connected under this sessionids.
> > > > Solution was similar to the one from 4AM but this time I've delete
> all
> > > > files in ZK data directory.
> > > >
> > > > Oddly enough the problem happened twice on the same ZK node, the
> final
> > > > solution was to clear ZK data directory. After clearing the directory
> > the
> > > > problem didn't happen again.
> > > >
> > > > I tried to look for solution/similar problems, I found the posts
> where
> > > > people were complaining about ephemeral nodes not being removed after
> > > > client session gets closed. But I was not able to find posts about ZK
> > not
> > > > being consistent.
> > > >
> > > > What do you think about this? Can we do something to fix this?
> > > >
> > > > Sorry for my english, I was doing my best. :)
> > > >
> > > > Thanks, Kuba.
> > > >
> > >
> >
>


 
   

Re: cluster/ephemeral nodes inconsistency

Posted by German Blanco <ge...@gmail.com>.
Could you please check if it could be this thing?
https://issues.apache.org/jira/browse/ZOOKEEPER-1777
Did you check if the node with the extra ephemeral nodes was the leader or
not?


On Wed, Jan 14, 2015 at 10:14 AM, Kuba Lekstan <ku...@gmail.com> wrote:

> German do you have any idea what might be causing these? Today same issue
> had happen.
>
> 2014-11-21 5:42 GMT+01:00 Yogesh Patil <pa...@gmail.com>:
>
> > Hi Zookeepers,
> > I am also experiencing the similar problem since yestderday. I have
> pretty
> > much similar setup and ephemeral znodes in place for keep-alive kind of
> > function. I too see in spite of ZK session going down, ephemeral znodes
> > still LIVES.
> >
> > I am using ZK 3.5.0.
> >
> > Any solution/fix for this type of an issue??
> >
> >
> > --
> > Sincerely,
> >
> > *Yogesh Patil*
> >
> >
> >
> > On Thu, Nov 13, 2014 at 2:10 PM, Kuba Lekstan <ku...@gmail.com> wrote:
> >
> > > Sorry, forgot to mention. Version: 3.4.6.
> > >
> > > Thanks.
> > >
> > > 2014-11-13 18:11 GMT+01:00 German Blanco <
> german.blanco.blanco@gmail.com
> > >:
> > >
> > > > Hello,
> > > >
> > > > which version of Zookeeper are you using?
> > > >
> > > > On Thu, Nov 13, 2014 at 5:25 PM, Kuba Lekstan <ku...@gmail.com>
> > wrote:
> > > >
> > > > > Hello,
> > > > >
> > > > > A bit of details:
> > > > > We have 5 node cluster, which we use for configuration distrubution
> > and
> > > > > monitoring active instances of our applications. Each application
> > > creates
> > > > > its ephemeral node, so we know which apps are alive, how many of
> them
> > > > there
> > > > > is and what they are doing.
> > > > >
> > > > > The problem had happen at 4th November, first time it was around
> 4AM,
> > > > > second time around 12PM.
> > > > > First time it was middle of the night when I got woken up, the
> > support
> > > > guys
> > > > > told me that something is wrong with config distribution.
> > > > >
> > > > > First I've checked apps for errors but didn't find anything
> > > interesting,
> > > > > then I looked at what's in zookeeper (using node-zk-browser).
> > > > > I've noticed that there are 3 ephemeral nodes which were created at
> > 1st
> > > > nov
> > > > > (while the oldest application was started on 3rd nov), I could read
> > its
> > > > > data but was not able to delete them - was getting NONODE
> exception.
> > > > >
> > > > > I thought wtf - why I cannot delete these nodes, something very bad
> > had
> > > > to
> > > > > happen with ZK.
> > > > >
> > > > > So I sshed on the leader and using CLI I tried to read these nodes
> > but
> > > I
> > > > > was not able to - the leader was telling me that such nodes doesn't
> > > > exist.
> > > > > After this I started to ssh to the rest of the nodes in cluster and
> > > > trying
> > > > > to read these nodes. Finally I found the server which did let me
> read
> > > the
> > > > > data of these nodes.
> > > > > Because of the inconsistency I've decided to restart it. Restart
> did
> > > > help,
> > > > > everything went back to normal state. The ephemeral nodes
> > disappeared.
> > > > >
> > > > > Similar situation had happen at 12PM but this time I had a lot more
> > > time
> > > > to
> > > > > look what is wrong. Second time the problem was about 3 ephemeral
> > nodes
> > > > > which were created at 1st now (again?). This time I dig a bit
> deeper
> > > and
> > > > > look into logs and 4 letter commands - but could not find anything
> > > > > interesting except the all these 3 nodes were created under
> different
> > > > > sessionids but zk had no hosts connected under this sessionids.
> > > > > Solution was similar to the one from 4AM but this time I've delete
> > all
> > > > > files in ZK data directory.
> > > > >
> > > > > Oddly enough the problem happened twice on the same ZK node, the
> > final
> > > > > solution was to clear ZK data directory. After clearing the
> directory
> > > the
> > > > > problem didn't happen again.
> > > > >
> > > > > I tried to look for solution/similar problems, I found the posts
> > where
> > > > > people were complaining about ephemeral nodes not being removed
> after
> > > > > client session gets closed. But I was not able to find posts about
> ZK
> > > not
> > > > > being consistent.
> > > > >
> > > > > What do you think about this? Can we do something to fix this?
> > > > >
> > > > > Sorry for my english, I was doing my best. :)
> > > > >
> > > > > Thanks, Kuba.
> > > > >
> > > >
> > >
> >
>

Re: cluster/ephemeral nodes inconsistency

Posted by Kuba Lekstan <ku...@gmail.com>.
German do you have any idea what might be causing these? Today same issue
had happen.

2014-11-21 5:42 GMT+01:00 Yogesh Patil <pa...@gmail.com>:

> Hi Zookeepers,
> I am also experiencing the similar problem since yestderday. I have pretty
> much similar setup and ephemeral znodes in place for keep-alive kind of
> function. I too see in spite of ZK session going down, ephemeral znodes
> still LIVES.
>
> I am using ZK 3.5.0.
>
> Any solution/fix for this type of an issue??
>
>
> --
> Sincerely,
>
> *Yogesh Patil*
>
>
>
> On Thu, Nov 13, 2014 at 2:10 PM, Kuba Lekstan <ku...@gmail.com> wrote:
>
> > Sorry, forgot to mention. Version: 3.4.6.
> >
> > Thanks.
> >
> > 2014-11-13 18:11 GMT+01:00 German Blanco <german.blanco.blanco@gmail.com
> >:
> >
> > > Hello,
> > >
> > > which version of Zookeeper are you using?
> > >
> > > On Thu, Nov 13, 2014 at 5:25 PM, Kuba Lekstan <ku...@gmail.com>
> wrote:
> > >
> > > > Hello,
> > > >
> > > > A bit of details:
> > > > We have 5 node cluster, which we use for configuration distrubution
> and
> > > > monitoring active instances of our applications. Each application
> > creates
> > > > its ephemeral node, so we know which apps are alive, how many of them
> > > there
> > > > is and what they are doing.
> > > >
> > > > The problem had happen at 4th November, first time it was around 4AM,
> > > > second time around 12PM.
> > > > First time it was middle of the night when I got woken up, the
> support
> > > guys
> > > > told me that something is wrong with config distribution.
> > > >
> > > > First I've checked apps for errors but didn't find anything
> > interesting,
> > > > then I looked at what's in zookeeper (using node-zk-browser).
> > > > I've noticed that there are 3 ephemeral nodes which were created at
> 1st
> > > nov
> > > > (while the oldest application was started on 3rd nov), I could read
> its
> > > > data but was not able to delete them - was getting NONODE exception.
> > > >
> > > > I thought wtf - why I cannot delete these nodes, something very bad
> had
> > > to
> > > > happen with ZK.
> > > >
> > > > So I sshed on the leader and using CLI I tried to read these nodes
> but
> > I
> > > > was not able to - the leader was telling me that such nodes doesn't
> > > exist.
> > > > After this I started to ssh to the rest of the nodes in cluster and
> > > trying
> > > > to read these nodes. Finally I found the server which did let me read
> > the
> > > > data of these nodes.
> > > > Because of the inconsistency I've decided to restart it. Restart did
> > > help,
> > > > everything went back to normal state. The ephemeral nodes
> disappeared.
> > > >
> > > > Similar situation had happen at 12PM but this time I had a lot more
> > time
> > > to
> > > > look what is wrong. Second time the problem was about 3 ephemeral
> nodes
> > > > which were created at 1st now (again?). This time I dig a bit deeper
> > and
> > > > look into logs and 4 letter commands - but could not find anything
> > > > interesting except the all these 3 nodes were created under different
> > > > sessionids but zk had no hosts connected under this sessionids.
> > > > Solution was similar to the one from 4AM but this time I've delete
> all
> > > > files in ZK data directory.
> > > >
> > > > Oddly enough the problem happened twice on the same ZK node, the
> final
> > > > solution was to clear ZK data directory. After clearing the directory
> > the
> > > > problem didn't happen again.
> > > >
> > > > I tried to look for solution/similar problems, I found the posts
> where
> > > > people were complaining about ephemeral nodes not being removed after
> > > > client session gets closed. But I was not able to find posts about ZK
> > not
> > > > being consistent.
> > > >
> > > > What do you think about this? Can we do something to fix this?
> > > >
> > > > Sorry for my english, I was doing my best. :)
> > > >
> > > > Thanks, Kuba.
> > > >
> > >
> >
>

Re: cluster/ephemeral nodes inconsistency

Posted by Yogesh Patil <pa...@gmail.com>.
Hi Zookeepers,
I am also experiencing the similar problem since yestderday. I have pretty
much similar setup and ephemeral znodes in place for keep-alive kind of
function. I too see in spite of ZK session going down, ephemeral znodes
still LIVES.

I am using ZK 3.5.0.

Any solution/fix for this type of an issue??


-- 
Sincerely,

*Yogesh Patil*



On Thu, Nov 13, 2014 at 2:10 PM, Kuba Lekstan <ku...@gmail.com> wrote:

> Sorry, forgot to mention. Version: 3.4.6.
>
> Thanks.
>
> 2014-11-13 18:11 GMT+01:00 German Blanco <ge...@gmail.com>:
>
> > Hello,
> >
> > which version of Zookeeper are you using?
> >
> > On Thu, Nov 13, 2014 at 5:25 PM, Kuba Lekstan <ku...@gmail.com> wrote:
> >
> > > Hello,
> > >
> > > A bit of details:
> > > We have 5 node cluster, which we use for configuration distrubution and
> > > monitoring active instances of our applications. Each application
> creates
> > > its ephemeral node, so we know which apps are alive, how many of them
> > there
> > > is and what they are doing.
> > >
> > > The problem had happen at 4th November, first time it was around 4AM,
> > > second time around 12PM.
> > > First time it was middle of the night when I got woken up, the support
> > guys
> > > told me that something is wrong with config distribution.
> > >
> > > First I've checked apps for errors but didn't find anything
> interesting,
> > > then I looked at what's in zookeeper (using node-zk-browser).
> > > I've noticed that there are 3 ephemeral nodes which were created at 1st
> > nov
> > > (while the oldest application was started on 3rd nov), I could read its
> > > data but was not able to delete them - was getting NONODE exception.
> > >
> > > I thought wtf - why I cannot delete these nodes, something very bad had
> > to
> > > happen with ZK.
> > >
> > > So I sshed on the leader and using CLI I tried to read these nodes but
> I
> > > was not able to - the leader was telling me that such nodes doesn't
> > exist.
> > > After this I started to ssh to the rest of the nodes in cluster and
> > trying
> > > to read these nodes. Finally I found the server which did let me read
> the
> > > data of these nodes.
> > > Because of the inconsistency I've decided to restart it. Restart did
> > help,
> > > everything went back to normal state. The ephemeral nodes disappeared.
> > >
> > > Similar situation had happen at 12PM but this time I had a lot more
> time
> > to
> > > look what is wrong. Second time the problem was about 3 ephemeral nodes
> > > which were created at 1st now (again?). This time I dig a bit deeper
> and
> > > look into logs and 4 letter commands - but could not find anything
> > > interesting except the all these 3 nodes were created under different
> > > sessionids but zk had no hosts connected under this sessionids.
> > > Solution was similar to the one from 4AM but this time I've delete all
> > > files in ZK data directory.
> > >
> > > Oddly enough the problem happened twice on the same ZK node, the final
> > > solution was to clear ZK data directory. After clearing the directory
> the
> > > problem didn't happen again.
> > >
> > > I tried to look for solution/similar problems, I found the posts where
> > > people were complaining about ephemeral nodes not being removed after
> > > client session gets closed. But I was not able to find posts about ZK
> not
> > > being consistent.
> > >
> > > What do you think about this? Can we do something to fix this?
> > >
> > > Sorry for my english, I was doing my best. :)
> > >
> > > Thanks, Kuba.
> > >
> >
>

Re: cluster/ephemeral nodes inconsistency

Posted by Kuba Lekstan <ku...@gmail.com>.
Sorry, forgot to mention. Version: 3.4.6.

Thanks.

2014-11-13 18:11 GMT+01:00 German Blanco <ge...@gmail.com>:

> Hello,
>
> which version of Zookeeper are you using?
>
> On Thu, Nov 13, 2014 at 5:25 PM, Kuba Lekstan <ku...@gmail.com> wrote:
>
> > Hello,
> >
> > A bit of details:
> > We have 5 node cluster, which we use for configuration distrubution and
> > monitoring active instances of our applications. Each application creates
> > its ephemeral node, so we know which apps are alive, how many of them
> there
> > is and what they are doing.
> >
> > The problem had happen at 4th November, first time it was around 4AM,
> > second time around 12PM.
> > First time it was middle of the night when I got woken up, the support
> guys
> > told me that something is wrong with config distribution.
> >
> > First I've checked apps for errors but didn't find anything interesting,
> > then I looked at what's in zookeeper (using node-zk-browser).
> > I've noticed that there are 3 ephemeral nodes which were created at 1st
> nov
> > (while the oldest application was started on 3rd nov), I could read its
> > data but was not able to delete them - was getting NONODE exception.
> >
> > I thought wtf - why I cannot delete these nodes, something very bad had
> to
> > happen with ZK.
> >
> > So I sshed on the leader and using CLI I tried to read these nodes but I
> > was not able to - the leader was telling me that such nodes doesn't
> exist.
> > After this I started to ssh to the rest of the nodes in cluster and
> trying
> > to read these nodes. Finally I found the server which did let me read the
> > data of these nodes.
> > Because of the inconsistency I've decided to restart it. Restart did
> help,
> > everything went back to normal state. The ephemeral nodes disappeared.
> >
> > Similar situation had happen at 12PM but this time I had a lot more time
> to
> > look what is wrong. Second time the problem was about 3 ephemeral nodes
> > which were created at 1st now (again?). This time I dig a bit deeper and
> > look into logs and 4 letter commands - but could not find anything
> > interesting except the all these 3 nodes were created under different
> > sessionids but zk had no hosts connected under this sessionids.
> > Solution was similar to the one from 4AM but this time I've delete all
> > files in ZK data directory.
> >
> > Oddly enough the problem happened twice on the same ZK node, the final
> > solution was to clear ZK data directory. After clearing the directory the
> > problem didn't happen again.
> >
> > I tried to look for solution/similar problems, I found the posts where
> > people were complaining about ephemeral nodes not being removed after
> > client session gets closed. But I was not able to find posts about ZK not
> > being consistent.
> >
> > What do you think about this? Can we do something to fix this?
> >
> > Sorry for my english, I was doing my best. :)
> >
> > Thanks, Kuba.
> >
>

Re: cluster/ephemeral nodes inconsistency

Posted by German Blanco <ge...@gmail.com>.
Hello,

which version of Zookeeper are you using?

On Thu, Nov 13, 2014 at 5:25 PM, Kuba Lekstan <ku...@gmail.com> wrote:

> Hello,
>
> A bit of details:
> We have 5 node cluster, which we use for configuration distrubution and
> monitoring active instances of our applications. Each application creates
> its ephemeral node, so we know which apps are alive, how many of them there
> is and what they are doing.
>
> The problem had happen at 4th November, first time it was around 4AM,
> second time around 12PM.
> First time it was middle of the night when I got woken up, the support guys
> told me that something is wrong with config distribution.
>
> First I've checked apps for errors but didn't find anything interesting,
> then I looked at what's in zookeeper (using node-zk-browser).
> I've noticed that there are 3 ephemeral nodes which were created at 1st nov
> (while the oldest application was started on 3rd nov), I could read its
> data but was not able to delete them - was getting NONODE exception.
>
> I thought wtf - why I cannot delete these nodes, something very bad had to
> happen with ZK.
>
> So I sshed on the leader and using CLI I tried to read these nodes but I
> was not able to - the leader was telling me that such nodes doesn't exist.
> After this I started to ssh to the rest of the nodes in cluster and trying
> to read these nodes. Finally I found the server which did let me read the
> data of these nodes.
> Because of the inconsistency I've decided to restart it. Restart did help,
> everything went back to normal state. The ephemeral nodes disappeared.
>
> Similar situation had happen at 12PM but this time I had a lot more time to
> look what is wrong. Second time the problem was about 3 ephemeral nodes
> which were created at 1st now (again?). This time I dig a bit deeper and
> look into logs and 4 letter commands - but could not find anything
> interesting except the all these 3 nodes were created under different
> sessionids but zk had no hosts connected under this sessionids.
> Solution was similar to the one from 4AM but this time I've delete all
> files in ZK data directory.
>
> Oddly enough the problem happened twice on the same ZK node, the final
> solution was to clear ZK data directory. After clearing the directory the
> problem didn't happen again.
>
> I tried to look for solution/similar problems, I found the posts where
> people were complaining about ephemeral nodes not being removed after
> client session gets closed. But I was not able to find posts about ZK not
> being consistent.
>
> What do you think about this? Can we do something to fix this?
>
> Sorry for my english, I was doing my best. :)
>
> Thanks, Kuba.
>