You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@ambari.apache.org by "Dmitry Lysnichenko (JIRA)" <ji...@apache.org> on 2014/10/15 18:45:33 UTC

[jira] [Created] (AMBARI-7791) HBase Master CPU utilization alert is not suppressed at MM

Dmitry Lysnichenko created AMBARI-7791:
------------------------------------------

             Summary: HBase Master CPU utilization alert is not suppressed at MM
                 Key: AMBARI-7791
                 URL: https://issues.apache.org/jira/browse/AMBARI-7791
             Project: Ambari
          Issue Type: Bug
          Components: ambari-server
    Affects Versions: 1.7.0
            Reporter: Dmitry Lysnichenko
            Assignee: Dmitry Lysnichenko
             Fix For: 1.7.0


Looks like we have a design flaw that affects suppressing some alerts. It causes a rare bug that probably affects 1.6.1.

h2. The short story
When we put HBase Master (or entire HBase service) into MM and then stop HBase Master, the alert "HBase Master CPU utilization" pops up and is not suppressed. This issue reproduces only when HBase Master is located on a separate host then Nagios server. 

h2. How suppressing alerts works 
When we put some service/host/host component into MM, at the server we build a complete map of host components that are in MM and post it to an agent. Agent writes down this info to file /var/nagios/ignore.dat in a form:
{code}
vm-3.vm GANGLIA GANGLIA_MONITOR
vm-0.vm HBASE HBASE_MASTER
vm-3.vm HDFS DATANODE
vm-2.vm HBASE HBASE_REGIONSERVER
vm-0.vm HBASE HBASE_REGIONSERVER
vm-1.vm HBASE HBASE_REGIONSERVER
vm-3.vm YARN NODEMANAGER
vm-3.vm HBASE HBASE_REGIONSERVER
{code}
All alerts at Nagios are wrapped into shell script (check_wrapper.sh). When any alert is generated, this wrapper checks  if the hostname, service name and component name for this alert are present at /var/nagios/ignore.dat. If yes, alert is suppressed

h2. What exactly is broken
At jira https://issues.apache.org/jira/browse/AMBARI-6358 we had a requirement to have only one 'HBase Master CPU utilization' check even in HA mode. So this check is bound to Nagios host (to be executed only once even if hbase master hostgroup has more than one host, like it is done for "* Percent Count" alerts). As a result, Hbase Master alert origin data does not match any entry at file /var/nagios/ignore.dat . That's why the alert is not suppressed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)