You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@zookeeper.apache.org by rajsura <ra...@gmail.com> on 2020/05/02 07:06:02 UTC

ZooKeeper config caching issues?

Hello,

We recently upgraded our 5 node ZooKeeper ensemble from v3.4.8 to v3.5.6.
Encountered no issues as such.

This is how the ZooKeeper config looks like:



Post upgrade, we had to migrate server.22 on the same node, but with
FOO.bar.com domain name due to kerberos referral issues. And, we used
different server-identifier, i.e., 23 when we migrated. So, here is how the
new config looked like:



We restarted all the nodes in the ensemble with the above updated config.
And the migrated node joined the quorum successfully and was serving all
clients directly connected to it, without any issues.

Recently, when a leader election happened,
server.23=node5.foo.bar.com(migrated node) was chosen as Leader (as it has
highest ID). But then, ZooKeeper was unable to serve any clients and all the
servers were somehow still trying to establish a channel to 22 (old DNS
name: node5.bar.com) and were throwing below error in a loop:



Fetching config from live ZooKeeper znode also doesn't show "22" being a
member of the ensemble. Its not clear how "22" is still coming into the
picture.



We suspected some weird caching issue and restarted ZooKeeper across all the
nodes but that didn't help. So, whenever node5 becomes the Leader, ID:22 is
popping up. We even rebooted node5 and that hasn't helped too.

We also looked at '/zookeeper/config' content from snapshot files and did
not find any reference to ID:22.

Any help would be greatly appreciated.

Thanks,
Rajkiran



--
Sent from: http://zookeeper-user.578899.n2.nabble.com/

Re: ZooKeeper config caching issues?

Posted by rajsura <ra...@gmail.com>.
Hi Enrico,

As you suggested, I tried adding removing a node and adding a new node, in
the hope that the config will be updated somehow. But, it did not help. The
new node joined the cluster and synced the data and was serving clients the
correct data. But, when it assumed leader role, the entire cluster would
just go unresponsive. And once leader role moved elsewhere, the cluster
would respond. This kinda looks like a bug to me. Could you please check.

And then at last, I did `reconfigEnabled=true` across the cluster. And then
used `reconfig` functionality to remove the old/stale node and then added
back the new node. Had to restart the nodes couple of times for this to take
effect. And this time when the new node assumed leader role, the cluster was
responsive and no issues were observed.

So, this definitely looks like a bug in some corner which is hard-coded/told
look only for dyanmicConfig?

Thanks again for your help, much appreciated.

Regards,
Rajkiran



--
Sent from: http://zookeeper-user.578899.n2.nabble.com/

Re: ZooKeeper config caching issues?

Posted by rajsura <ra...@gmail.com>.
Thanks a lot, Enrico, for your reply.

Yes, we have done lot of migrations till now but never faced any issues. It
was only this time, we had to make a change only in DNS name and are seeing
this issue.

Thanks again for both the tips, we will try adding a new node and then
discard this node.

Regards,
Rajkiran



--
Sent from: http://zookeeper-user.578899.n2.nabble.com/

Re: ZooKeeper config caching issues?

Posted by Enrico Olivelli <eo...@gmail.com>.
Rajkiran
I am not sure that changing the addresses this way is supported. Apart from
that...
Maybe you can try to enable reconfig and use it to fix the problem.

Otherwise another way would be to add new nodes with the new addresses and
then dismiss the old nodes


Enrico



Il Sab 2 Mag 2020, 10:42 rajsura <ra...@gmail.com> ha scritto:

> Latest observation, we noticed that ZooKeeper was complaining about
> dynamic.next file, event though we HAVE NOT ENABLED
> dynamic-reconfiguration.
>
>
> 2020-05-02 01:43:05,870 [myid:21] - ERROR
> [QuorumPeer[myid=21](plain=/0.0.0.0:2181)(secure=disabled):QuorumPeer@1637
> ]
> - Error writing next dynamic config file to disk:
>
>
> And zookeeper user did not have perms to that config directory, so we fixed
> that restarted zookeeper. And then it dumped below dynamic.next, which
> contains the OLD migrated node as a member :O
>
>
> $ sudo cat /opt/zookeeper/conf/zoo.cfg.dynamic.next
> server.17=node1.foo.bar.com:2888:3888:participant;0.0.0.0:2181
> server.19=node2.foo.bar.com:2888:3888:participant;0.0.0.0:2181
> server.20=node3.foo.bar.com:2888:3888:participant;0.0.0.0:2181
> server.21=node4.foo.bar.com:2888:3888:participant;0.0.0.0:2181
> server.22=node5.bar.com:2888:3888:participant;0.0.0.0:2181
>
>
> So, this looks like a bug. And from where is it still fetching this? How do
> we fix it.
> Any lead/help is very much appreciated.
>
> FTR: We haven't enabled dynamic reconfig.
>
> Thanks in advance,
> Rajkiran
>
>
>
> --
> Sent from: http://zookeeper-user.578899.n2.nabble.com/
>

Re: ZooKeeper config caching issues?

Posted by rajsura <ra...@gmail.com>.
Latest observation, we noticed that ZooKeeper was complaining about
dynamic.next file, event though we HAVE NOT ENABLED dynamic-reconfiguration.


2020-05-02 01:43:05,870 [myid:21] - ERROR
[QuorumPeer[myid=21](plain=/0.0.0.0:2181)(secure=disabled):QuorumPeer@1637]
- Error writing next dynamic config file to disk:


And zookeeper user did not have perms to that config directory, so we fixed
that restarted zookeeper. And then it dumped below dynamic.next, which
contains the OLD migrated node as a member :O


$ sudo cat /opt/zookeeper/conf/zoo.cfg.dynamic.next
server.17=node1.foo.bar.com:2888:3888:participant;0.0.0.0:2181
server.19=node2.foo.bar.com:2888:3888:participant;0.0.0.0:2181
server.20=node3.foo.bar.com:2888:3888:participant;0.0.0.0:2181
server.21=node4.foo.bar.com:2888:3888:participant;0.0.0.0:2181
server.22=node5.bar.com:2888:3888:participant;0.0.0.0:2181


So, this looks like a bug. And from where is it still fetching this? How do
we fix it.
Any lead/help is very much appreciated.

FTR: We haven't enabled dynamic reconfig.

Thanks in advance, 
Rajkiran



--
Sent from: http://zookeeper-user.578899.n2.nabble.com/