You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@ambari.apache.org by "Dmitry Lysnichenko (JIRA)" <ji...@apache.org> on 2014/10/15 18:46:34 UTC

[jira] [Updated] (AMBARI-7791) HBase Master CPU utilization alert is not suppressed at MM

     [ https://issues.apache.org/jira/browse/AMBARI-7791?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dmitry Lysnichenko updated AMBARI-7791:
---------------------------------------
    Attachment: AMBARI-7791_branch-1.7.0.patch

> HBase Master CPU utilization alert is not suppressed at MM
> ----------------------------------------------------------
>
>                 Key: AMBARI-7791
>                 URL: https://issues.apache.org/jira/browse/AMBARI-7791
>             Project: Ambari
>          Issue Type: Bug
>          Components: ambari-server
>    Affects Versions: 1.7.0
>            Reporter: Dmitry Lysnichenko
>            Assignee: Dmitry Lysnichenko
>             Fix For: 1.7.0
>
>         Attachments: AMBARI-7791_branch-1.7.0.patch
>
>
> Looks like we have a design flaw that affects suppressing some alerts. It causes a rare bug that probably affects 1.6.1.
> h2. The short story
> When we put HBase Master (or entire HBase service) into MM and then stop HBase Master, the alert "HBase Master CPU utilization" pops up and is not suppressed. This issue reproduces only when HBase Master is located on a separate host then Nagios server. 
> h2. How suppressing alerts works 
> When we put some service/host/host component into MM, at the server we build a complete map of host components that are in MM and post it to an agent. Agent writes down this info to file /var/nagios/ignore.dat in a form:
> {code}
> vm-3.vm GANGLIA GANGLIA_MONITOR
> vm-0.vm HBASE HBASE_MASTER
> vm-3.vm HDFS DATANODE
> vm-2.vm HBASE HBASE_REGIONSERVER
> vm-0.vm HBASE HBASE_REGIONSERVER
> vm-1.vm HBASE HBASE_REGIONSERVER
> vm-3.vm YARN NODEMANAGER
> vm-3.vm HBASE HBASE_REGIONSERVER
> {code}
> All alerts at Nagios are wrapped into shell script (check_wrapper.sh). When any alert is generated, this wrapper checks  if the hostname, service name and component name for this alert are present at /var/nagios/ignore.dat. If yes, alert is suppressed
> h2. What exactly is broken
> At jira https://issues.apache.org/jira/browse/AMBARI-6358 we had a requirement to have only one 'HBase Master CPU utilization' check even in HA mode. So this check is bound to Nagios host (to be executed only once even if hbase master hostgroup has more than one host, like it is done for "* Percent Count" alerts). As a result, Hbase Master alert origin data does not match any entry at file /var/nagios/ignore.dat . That's why the alert is not suppressed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)