You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@zookeeper.apache.org by Prabhjot Bharaj <pr...@gmail.com> on 2015/11/23 11:23:06 UTC

Zookeeper JMX monitoring - important parameters

Hello Folks,

I would like to know what are the important zookeeper parameters that can
be monitored on a zookeeper server via its JMX port. I've setup my 5-node
zookeeper ensemble with the required steps present on this page:
https://zookeeper.apache.org/doc/r3.4.6/zookeeperJMX.html#ch_console

After connecting to the JVM via jconsole, I can see the stats. But, I would
like to know which stats/values we can send to our reporting system so that
we can be alerted if some vital parameter is showing unexpected value.
--------------------------------------------
--------------------------------------------
--------------------------------------------
--------------------------------------------
--------------------------------------------
Here is the homework I've done on it:-

*1. *QuorumSize (under ReplicatedServer_id<#myid value>) - Must always be
equal to the number of nodes in zookeeper.conf.

   1.

      Example MBean - org.apache.ZooKeeperService:name0=ReplicatedServer_id7
      2.

      Alert - It should never be lower than (floor(n/2) +1). If this
      happens, the cluster’s health is bad. Alert on this value going
lower than
      (floor(n/2) + 1), where n is the total machines participating in the
      ensemble

c. Procedure - bounce the servers which are not participating in the quorum
and see if it changes anything on this attribute

2. NodeCount (under InMemoryDataTree) - from all the machines in a cluster
should be equal. This helps us check consistency of nodes in the zookeeper
cluster.

   1.

      Example MBean -
      org.apache.ZooKeeperService:name0=ReplicatedServer_id7,name1=replica.7,name2=Leader,name3=InMemoryDataTree
      2.

      Alert - if any of the nodes in the cluster shows a different value
      than the total number of nodes in the ensemble, fire an alert

c. Procedure - There is no generalised solution for this. This will need
investigation.

3. Memory Management -
a. GarbageCollection - Listing important parameters for monitoring garbage
collection on the zookeeper server nodes. Any value in this section, if it
is significantly higher than that of other nodes in the ensemble can point
to something fishy in the cluster.
i. ConcurrentMarkSweep time to be monitored across all nodes
Example MBean - java.lang:type=GarbageCollector,name=ConcurrentMarkSweep
ii. ParNew time to be monitored across all nodes
Example MBean - java.lang:type=GarbageCollector,name=ParNew

4. Leader count - this must be 1 at all times - out of all the
replica.<#myid values> under ReplicatedServer_id<#myid value> on all
machines, there should be only 1 leader.
a. Example MBean -

org.apache.ZooKeeperService:name0=ReplicatedServer_id7,name1=replica.7,name2=Leader.

   1.

   Alert - name<x>=Leader should be only 1 from all the nodes reporting
   data in the cluster - setup an alert on this. If the alert is fired, it
   means zookeeper went through a split brain. This is a high-risk thing.
   2.

   Procedure - check if network is all good amongst the machines. If some
   n/w slowness amongst nodes in a rack, or across a rack (in case zookeeper
   nodes are placed across racks), then it must be taken care of. Until it is
   solved, find a good machine which has good n/w connectivity. push a config
   for adding this new machine in the cluster and remove the existing machine
   from the cluster.



--------------------------------------------
--------------------------------------------
--------------------------------------------
--------------------------------------------
--------------------------------------------


I would like to know if the above parameters for monitoring the cluster are
sufficient, or did I miss something out ? Request your help in pointing me
in the right direction. Please feel free to point out any changes in the
above write-up


Thanks,

Prabhjot

Re: Zookeeper JMX monitoring - important parameters

Posted by Prabhjot Bharaj <pr...@gmail.com>.
Hello Folks,

Request you to share your experiences on this

Thanks,
Prabhjot
On Nov 23, 2015 3:53 PM, "Prabhjot Bharaj" <pr...@gmail.com> wrote:

> Hello Folks,
>
> I would like to know what are the important zookeeper parameters that can
> be monitored on a zookeeper server via its JMX port. I've setup my 5-node
> zookeeper ensemble with the required steps present on this page:
> https://zookeeper.apache.org/doc/r3.4.6/zookeeperJMX.html#ch_console
>
> After connecting to the JVM via jconsole, I can see the stats. But, I
> would like to know which stats/values we can send to our reporting system
> so that we can be alerted if some vital parameter is showing unexpected
> value.
> --------------------------------------------
> --------------------------------------------
> --------------------------------------------
> --------------------------------------------
> --------------------------------------------
> Here is the homework I've done on it:-
>
> *1. *QuorumSize (under ReplicatedServer_id<#myid value>) - Must always be
> equal to the number of nodes in zookeeper.conf.
>
>    1.
>
>       Example MBean -
>       org.apache.ZooKeeperService:name0=ReplicatedServer_id7
>       2.
>
>       Alert - It should never be lower than (floor(n/2) +1). If this
>       happens, the cluster’s health is bad. Alert on this value going lower than
>       (floor(n/2) + 1), where n is the total machines participating in the
>       ensemble
>
> c. Procedure - bounce the servers which are not participating in the
> quorum and see if it changes anything on this attribute
>
> 2. NodeCount (under InMemoryDataTree) - from all the machines in a
> cluster should be equal. This helps us check consistency of nodes in the
> zookeeper cluster.
>
>    1.
>
>       Example MBean -
>       org.apache.ZooKeeperService:name0=ReplicatedServer_id7,name1=replica.7,name2=Leader,name3=InMemoryDataTree
>       2.
>
>       Alert - if any of the nodes in the cluster shows a different value
>       than the total number of nodes in the ensemble, fire an alert
>
> c. Procedure - There is no generalised solution for this. This will need
> investigation.
>
> 3. Memory Management -
> a. GarbageCollection - Listing important parameters for monitoring
> garbage collection on the zookeeper server nodes. Any value in this
> section, if it is significantly higher than that of other nodes in the
> ensemble can point to something fishy in the cluster.
> i. ConcurrentMarkSweep time to be monitored across all nodes
> Example MBean - java.lang:type=GarbageCollector,name=ConcurrentMarkSweep
> ii. ParNew time to be monitored across all nodes
> Example MBean - java.lang:type=GarbageCollector,name=ParNew
>
> 4. Leader count - this must be 1 at all times - out of all the
> replica.<#myid values> under ReplicatedServer_id<#myid value> on all
> machines, there should be only 1 leader.
> a. Example MBean -
>
>
> org.apache.ZooKeeperService:name0=ReplicatedServer_id7,name1=replica.7,name2=Leader.
>
>    1.
>
>    Alert - name<x>=Leader should be only 1 from all the nodes reporting
>    data in the cluster - setup an alert on this. If the alert is fired, it
>    means zookeeper went through a split brain. This is a high-risk thing.
>    2.
>
>    Procedure - check if network is all good amongst the machines. If some
>    n/w slowness amongst nodes in a rack, or across a rack (in case zookeeper
>    nodes are placed across racks), then it must be taken care of. Until it is
>    solved, find a good machine which has good n/w connectivity. push a config
>    for adding this new machine in the cluster and remove the existing machine
>    from the cluster.
>
>
>
> --------------------------------------------
> --------------------------------------------
> --------------------------------------------
> --------------------------------------------
> --------------------------------------------
>
>
> I would like to know if the above parameters for monitoring the cluster
> are sufficient, or did I miss something out ? Request your help in pointing
> me in the right direction. Please feel free to point out any changes in the
> above write-up
>
>
> Thanks,
>
> Prabhjot
>
>
>