You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@ambari.apache.org by "Andrew Onischuk (JIRA)" <ji...@apache.org> on 2014/05/13 19:54:15 UTC

[jira] [Commented] (AMBARI-5681) Add Nagios alert if HDFS last checkpoint time exceeds threshold

    [ https://issues.apache.org/jira/browse/AMBARI-5681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13996694#comment-13996694 ] 

Andrew Onischuk commented on AMBARI-5681:
-----------------------------------------

Committed to branch-1.6.0

> Add Nagios alert if HDFS last checkpoint time exceeds threshold
> ---------------------------------------------------------------
>
>                 Key: AMBARI-5681
>                 URL: https://issues.apache.org/jira/browse/AMBARI-5681
>             Project: Ambari
>          Issue Type: Bug
>            Reporter: Andrew Onischuk
>            Assignee: Andrew Onischuk
>             Fix For: 1.6.0
>
>
> Description: If the secondary NameNode(SNN) failed to merge edit files for any
> reason, Nagios doesn't alert on it.
> PROBLEM: For any reasons, SNN fails to merge edit files for long time it goes
> undetected. This can cause the edit files to become very large and slows down
> NameNode performance. And in some cases, can lead to corruption of NameNode
> edit files.  
> BUSINESS IMPACT: If Nagios doesn't alert on SNN functionality, this will
> eventually cause long downtime for all of customers and a possiblitly of data
> loss.
> STEPS TO REPRODUCE:
>   * SNN fails to merge edit files for any reason
>   * NameNode edit files grow in size
>   * Corruption to edit files.
> ACTUAL BEHAVIOR: Nagios doesn't fire critical alarm  
> EXPECTED BEHAVIOR: Nagios should fire critical alarm
> SUPPORT ANALYSIS: N/A
> Note:
> We need to get this fixed and alert our customers to add the nagios alarm
> ASAP.



--
This message was sent by Atlassian JIRA
(v6.2#6252)