You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@ambari.apache.org by Andrew Onischuk <ao...@hortonworks.com> on 2014/05/06 16:44:18 UTC

Review Request 21113: Add Nagios alert if HDFS last checkpoint time exceeds threshold

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/21113/
-----------------------------------------------------------

Review request for Ambari and Myroslav Papirkovskyy.


Bugs: AMBARI-5681
    https://issues.apache.org/jira/browse/AMBARI-5681


Repository: ambari


Description
-------

Description: If the secondary NameNode(SNN) failed to merge edit files for any
reason, Nagios doesn't alert on it.

PROBLEM: For any reasons, SNN fails to merge edit files for long time it goes
undetected. This can cause the edit files to become very large and slows down
NameNode performance. And in some cases, can lead to corruption of NameNode
edit files.  
BUSINESS IMPACT: If Nagios doesn't alert on SNN functionality, this will
eventually cause long downtime for all of customers and a possiblitly of data
loss.

STEPS TO REPRODUCE:

  * SNN fails to merge edit files for any reason
  * NameNode edit files grow in size
  * Corruption to edit files.

ACTUAL BEHAVIOR: Nagios doesn't fire critical alarm  
EXPECTED BEHAVIOR: Nagios should fire critical alarm

SUPPORT ANALYSIS: N/A

Note:

We need to get this fixed and alert our customers to add the nagios alarm
ASAP.


Diffs
-----

  ambari-server/src/main/resources/stacks/HDP/2.0.6/services/NAGIOS/package/files/check_checkpoint_time.py PRE-CREATION 
  ambari-server/src/main/resources/stacks/HDP/2.0.6/services/NAGIOS/package/scripts/nagios_server_config.py 4089b2e 
  ambari-server/src/main/resources/stacks/HDP/2.0.6/services/NAGIOS/package/scripts/params.py 2e41c23 
  ambari-server/src/main/resources/stacks/HDP/2.0.6/services/NAGIOS/package/templates/hadoop-commands.cfg.j2 ff03bf9 
  ambari-server/src/main/resources/stacks/HDP/2.0.6/services/NAGIOS/package/templates/hadoop-services.cfg.j2 e7fda1a 
  ambari-server/src/test/python/stacks/2.0.6/NAGIOS/test_nagios_server.py 145b443 

Diff: https://reviews.apache.org/r/21113/diff/


Testing
-------


Thanks,

Andrew Onischuk


Re: Review Request 21113: Add Nagios alert if HDFS last checkpoint time exceeds threshold

Posted by Myroslav Papirkovskyy <mp...@hortonworks.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/21113/#review42835
-----------------------------------------------------------

Ship it!


Ship It!

- Myroslav Papirkovskyy


On May 13, 2014, 5:08 p.m., Andrew Onischuk wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/21113/
> -----------------------------------------------------------
> 
> (Updated May 13, 2014, 5:08 p.m.)
> 
> 
> Review request for Ambari and Myroslav Papirkovskyy.
> 
> 
> Bugs: AMBARI-5681
>     https://issues.apache.org/jira/browse/AMBARI-5681
> 
> 
> Repository: ambari
> 
> 
> Description
> -------
> 
> Description: If the secondary NameNode(SNN) failed to merge edit files for any
> reason, Nagios doesn't alert on it.
> 
> PROBLEM: For any reasons, SNN fails to merge edit files for long time it goes
> undetected. This can cause the edit files to become very large and slows down
> NameNode performance. And in some cases, can lead to corruption of NameNode
> edit files.  
> BUSINESS IMPACT: If Nagios doesn't alert on SNN functionality, this will
> eventually cause long downtime for all of customers and a possiblitly of data
> loss.
> 
> STEPS TO REPRODUCE:
> 
>   * SNN fails to merge edit files for any reason
>   * NameNode edit files grow in size
>   * Corruption to edit files.
> 
> ACTUAL BEHAVIOR: Nagios doesn't fire critical alarm  
> EXPECTED BEHAVIOR: Nagios should fire critical alarm
> 
> SUPPORT ANALYSIS: N/A
> 
> Note:
> 
> We need to get this fixed and alert our customers to add the nagios alarm
> ASAP.
> 
> 
> Diffs
> -----
> 
>   ambari-server/src/main/resources/stacks/HDP/2.0.6/services/NAGIOS/package/files/check_checkpoint_time.py PRE-CREATION 
>   ambari-server/src/main/resources/stacks/HDP/2.0.6/services/NAGIOS/package/scripts/nagios_server_config.py 4089b2e 
>   ambari-server/src/main/resources/stacks/HDP/2.0.6/services/NAGIOS/package/scripts/params.py 2e41c23 
>   ambari-server/src/main/resources/stacks/HDP/2.0.6/services/NAGIOS/package/templates/hadoop-commands.cfg.j2 ff03bf9 
>   ambari-server/src/main/resources/stacks/HDP/2.0.6/services/NAGIOS/package/templates/hadoop-services.cfg.j2 e7fda1a 
>   ambari-server/src/test/python/stacks/2.0.6/NAGIOS/test_nagios_server.py 145b443 
> 
> Diff: https://reviews.apache.org/r/21113/diff/
> 
> 
> Testing
> -------
> 
> 
> Thanks,
> 
> Andrew Onischuk
> 
>


Re: Review Request 21113: Add Nagios alert if HDFS last checkpoint time exceeds threshold

Posted by Andrew Onischuk <ao...@hortonworks.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/21113/
-----------------------------------------------------------

(Updated May 13, 2014, 2:08 p.m.)


Review request for Ambari and Myroslav Papirkovskyy.


Bugs: AMBARI-5681
    https://issues.apache.org/jira/browse/AMBARI-5681


Repository: ambari


Description
-------

Description: If the secondary NameNode(SNN) failed to merge edit files for any
reason, Nagios doesn't alert on it.

PROBLEM: For any reasons, SNN fails to merge edit files for long time it goes
undetected. This can cause the edit files to become very large and slows down
NameNode performance. And in some cases, can lead to corruption of NameNode
edit files.  
BUSINESS IMPACT: If Nagios doesn't alert on SNN functionality, this will
eventually cause long downtime for all of customers and a possiblitly of data
loss.

STEPS TO REPRODUCE:

  * SNN fails to merge edit files for any reason
  * NameNode edit files grow in size
  * Corruption to edit files.

ACTUAL BEHAVIOR: Nagios doesn't fire critical alarm  
EXPECTED BEHAVIOR: Nagios should fire critical alarm

SUPPORT ANALYSIS: N/A

Note:

We need to get this fixed and alert our customers to add the nagios alarm
ASAP.


Diffs (updated)
-----

  ambari-server/src/main/resources/stacks/HDP/2.0.6/services/NAGIOS/package/files/check_checkpoint_time.py PRE-CREATION 
  ambari-server/src/main/resources/stacks/HDP/2.0.6/services/NAGIOS/package/scripts/nagios_server_config.py 4089b2e 
  ambari-server/src/main/resources/stacks/HDP/2.0.6/services/NAGIOS/package/scripts/params.py 2e41c23 
  ambari-server/src/main/resources/stacks/HDP/2.0.6/services/NAGIOS/package/templates/hadoop-commands.cfg.j2 ff03bf9 
  ambari-server/src/main/resources/stacks/HDP/2.0.6/services/NAGIOS/package/templates/hadoop-services.cfg.j2 e7fda1a 
  ambari-server/src/test/python/stacks/2.0.6/NAGIOS/test_nagios_server.py 145b443 

Diff: https://reviews.apache.org/r/21113/diff/


Testing
-------


Thanks,

Andrew Onischuk


Re: Review Request 21113: Add Nagios alert if HDFS last checkpoint time exceeds threshold

Posted by Andrew Onischuk <ao...@hortonworks.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/21113/
-----------------------------------------------------------

(Updated May 13, 2014, 2:06 p.m.)


Review request for Ambari and Myroslav Papirkovskyy.


Bugs: AMBARI-5681
    https://issues.apache.org/jira/browse/AMBARI-5681


Repository: ambari


Description
-------

Description: If the secondary NameNode(SNN) failed to merge edit files for any
reason, Nagios doesn't alert on it.

PROBLEM: For any reasons, SNN fails to merge edit files for long time it goes
undetected. This can cause the edit files to become very large and slows down
NameNode performance. And in some cases, can lead to corruption of NameNode
edit files.  
BUSINESS IMPACT: If Nagios doesn't alert on SNN functionality, this will
eventually cause long downtime for all of customers and a possiblitly of data
loss.

STEPS TO REPRODUCE:

  * SNN fails to merge edit files for any reason
  * NameNode edit files grow in size
  * Corruption to edit files.

ACTUAL BEHAVIOR: Nagios doesn't fire critical alarm  
EXPECTED BEHAVIOR: Nagios should fire critical alarm

SUPPORT ANALYSIS: N/A

Note:

We need to get this fixed and alert our customers to add the nagios alarm
ASAP.


Diffs (updated)
-----

  ambari-server/src/main/resources/stacks/HDP/2.0.6/services/NAGIOS/package/files/check_checkpoint_time.py PRE-CREATION 
  ambari-server/src/main/resources/stacks/HDP/2.0.6/services/NAGIOS/package/scripts/nagios_server_config.py 4089b2e 
  ambari-server/src/main/resources/stacks/HDP/2.0.6/services/NAGIOS/package/scripts/params.py 2e41c23 
  ambari-server/src/main/resources/stacks/HDP/2.0.6/services/NAGIOS/package/templates/hadoop-commands.cfg.j2 ff03bf9 
  ambari-server/src/main/resources/stacks/HDP/2.0.6/services/NAGIOS/package/templates/hadoop-services.cfg.j2 e7fda1a 
  ambari-server/src/test/python/stacks/2.0.6/NAGIOS/test_nagios_server.py 145b443 

Diff: https://reviews.apache.org/r/21113/diff/


Testing
-------


Thanks,

Andrew Onischuk


Re: Review Request 21113: Add Nagios alert if HDFS last checkpoint time exceeds threshold

Posted by Andrew Onischuk <ao...@hortonworks.com>.

> On May 6, 2014, 2:53 p.m., Michael Harp wrote:
> > Whats the expected behavior when namenode HA is enabled?

The behaviour is the same, since in HA checkpoints are done too. It was tested in HA and works fine. 


- Andrew


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/21113/#review42289
-----------------------------------------------------------


On May 13, 2014, 2:08 p.m., Andrew Onischuk wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/21113/
> -----------------------------------------------------------
> 
> (Updated May 13, 2014, 2:08 p.m.)
> 
> 
> Review request for Ambari and Myroslav Papirkovskyy.
> 
> 
> Bugs: AMBARI-5681
>     https://issues.apache.org/jira/browse/AMBARI-5681
> 
> 
> Repository: ambari
> 
> 
> Description
> -------
> 
> Description: If the secondary NameNode(SNN) failed to merge edit files for any
> reason, Nagios doesn't alert on it.
> 
> PROBLEM: For any reasons, SNN fails to merge edit files for long time it goes
> undetected. This can cause the edit files to become very large and slows down
> NameNode performance. And in some cases, can lead to corruption of NameNode
> edit files.  
> BUSINESS IMPACT: If Nagios doesn't alert on SNN functionality, this will
> eventually cause long downtime for all of customers and a possiblitly of data
> loss.
> 
> STEPS TO REPRODUCE:
> 
>   * SNN fails to merge edit files for any reason
>   * NameNode edit files grow in size
>   * Corruption to edit files.
> 
> ACTUAL BEHAVIOR: Nagios doesn't fire critical alarm  
> EXPECTED BEHAVIOR: Nagios should fire critical alarm
> 
> SUPPORT ANALYSIS: N/A
> 
> Note:
> 
> We need to get this fixed and alert our customers to add the nagios alarm
> ASAP.
> 
> 
> Diffs
> -----
> 
>   ambari-server/src/main/resources/stacks/HDP/2.0.6/services/NAGIOS/package/files/check_checkpoint_time.py PRE-CREATION 
>   ambari-server/src/main/resources/stacks/HDP/2.0.6/services/NAGIOS/package/scripts/nagios_server_config.py 4089b2e 
>   ambari-server/src/main/resources/stacks/HDP/2.0.6/services/NAGIOS/package/scripts/params.py 2e41c23 
>   ambari-server/src/main/resources/stacks/HDP/2.0.6/services/NAGIOS/package/templates/hadoop-commands.cfg.j2 ff03bf9 
>   ambari-server/src/main/resources/stacks/HDP/2.0.6/services/NAGIOS/package/templates/hadoop-services.cfg.j2 e7fda1a 
>   ambari-server/src/test/python/stacks/2.0.6/NAGIOS/test_nagios_server.py 145b443 
> 
> Diff: https://reviews.apache.org/r/21113/diff/
> 
> 
> Testing
> -------
> 
> 
> Thanks,
> 
> Andrew Onischuk
> 
>


Re: Review Request 21113: Add Nagios alert if HDFS last checkpoint time exceeds threshold

Posted by Michael Harp <mi...@teradata.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/21113/#review42289
-----------------------------------------------------------


Whats the expected behavior when namenode HA is enabled?

- Michael Harp


On May 6, 2014, 2:44 p.m., Andrew Onischuk wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/21113/
> -----------------------------------------------------------
> 
> (Updated May 6, 2014, 2:44 p.m.)
> 
> 
> Review request for Ambari and Myroslav Papirkovskyy.
> 
> 
> Bugs: AMBARI-5681
>     https://issues.apache.org/jira/browse/AMBARI-5681
> 
> 
> Repository: ambari
> 
> 
> Description
> -------
> 
> Description: If the secondary NameNode(SNN) failed to merge edit files for any
> reason, Nagios doesn't alert on it.
> 
> PROBLEM: For any reasons, SNN fails to merge edit files for long time it goes
> undetected. This can cause the edit files to become very large and slows down
> NameNode performance. And in some cases, can lead to corruption of NameNode
> edit files.  
> BUSINESS IMPACT: If Nagios doesn't alert on SNN functionality, this will
> eventually cause long downtime for all of customers and a possiblitly of data
> loss.
> 
> STEPS TO REPRODUCE:
> 
>   * SNN fails to merge edit files for any reason
>   * NameNode edit files grow in size
>   * Corruption to edit files.
> 
> ACTUAL BEHAVIOR: Nagios doesn't fire critical alarm  
> EXPECTED BEHAVIOR: Nagios should fire critical alarm
> 
> SUPPORT ANALYSIS: N/A
> 
> Note:
> 
> We need to get this fixed and alert our customers to add the nagios alarm
> ASAP.
> 
> 
> Diffs
> -----
> 
>   ambari-server/src/main/resources/stacks/HDP/2.0.6/services/NAGIOS/package/files/check_checkpoint_time.py PRE-CREATION 
>   ambari-server/src/main/resources/stacks/HDP/2.0.6/services/NAGIOS/package/scripts/nagios_server_config.py 4089b2e 
>   ambari-server/src/main/resources/stacks/HDP/2.0.6/services/NAGIOS/package/scripts/params.py 2e41c23 
>   ambari-server/src/main/resources/stacks/HDP/2.0.6/services/NAGIOS/package/templates/hadoop-commands.cfg.j2 ff03bf9 
>   ambari-server/src/main/resources/stacks/HDP/2.0.6/services/NAGIOS/package/templates/hadoop-services.cfg.j2 e7fda1a 
>   ambari-server/src/test/python/stacks/2.0.6/NAGIOS/test_nagios_server.py 145b443 
> 
> Diff: https://reviews.apache.org/r/21113/diff/
> 
> 
> Testing
> -------
> 
> 
> Thanks,
> 
> Andrew Onischuk
> 
>