You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@zookeeper.apache.org by Ricky Burnett <rb...@splunk.com.INVALID> on 2021/08/26 23:15:33 UTC

Problems expanding zookeeper cluster

Hey,
I recently expanded my zookeeper cluster from 3 nodes to 5 nodes and have
run into some issues.  My cluster is deployed on kubernetes and is being
used for HA for Flink.

My first symptom is that Flink was unable to correctly perform leader
election.   It successfully connects to zookeeper and retrieves the
leaderlatch and leader information.  It does not become the new leader.
The problem is that the leader information is pointing to wrong.  I am
unable to delete the latch and the zNode using the zkCli.   The cli replies
"Node does not exist: " even though I can query the node and see data.

We are seeing data integrity errors in the logs.  We aren't sure if that is
related or not.
"Message: Digests are not matching. Value is Zxid. Last
value:16617228477804"

How can I go about clearing the erroneous state?  Why did this occur?  And
how can I prevent this from happening in the future?

Also what additional information is needed to help debug this issue?