You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@ambari.apache.org by "Andrew Onischuk (JIRA)" <ji...@apache.org> on 2014/05/13 19:54:15 UTC
[jira] [Commented] (AMBARI-5681) Add Nagios alert if HDFS last
checkpoint time exceeds threshold
[ https://issues.apache.org/jira/browse/AMBARI-5681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13996694#comment-13996694 ]
Andrew Onischuk commented on AMBARI-5681:
-----------------------------------------
Committed to branch-1.6.0
> Add Nagios alert if HDFS last checkpoint time exceeds threshold
> ---------------------------------------------------------------
>
> Key: AMBARI-5681
> URL: https://issues.apache.org/jira/browse/AMBARI-5681
> Project: Ambari
> Issue Type: Bug
> Reporter: Andrew Onischuk
> Assignee: Andrew Onischuk
> Fix For: 1.6.0
>
>
> Description: If the secondary NameNode(SNN) failed to merge edit files for any
> reason, Nagios doesn't alert on it.
> PROBLEM: For any reasons, SNN fails to merge edit files for long time it goes
> undetected. This can cause the edit files to become very large and slows down
> NameNode performance. And in some cases, can lead to corruption of NameNode
> edit files.
> BUSINESS IMPACT: If Nagios doesn't alert on SNN functionality, this will
> eventually cause long downtime for all of customers and a possiblitly of data
> loss.
> STEPS TO REPRODUCE:
> * SNN fails to merge edit files for any reason
> * NameNode edit files grow in size
> * Corruption to edit files.
> ACTUAL BEHAVIOR: Nagios doesn't fire critical alarm
> EXPECTED BEHAVIOR: Nagios should fire critical alarm
> SUPPORT ANALYSIS: N/A
> Note:
> We need to get this fixed and alert our customers to add the nagios alarm
> ASAP.
--
This message was sent by Atlassian JIRA
(v6.2#6252)