You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@ambari.apache.org by Andrew Onischuk <ao...@hortonworks.com> on 2014/05/06 16:44:18 UTC
Review Request 21113: Add Nagios alert if HDFS last checkpoint time exceeds
threshold
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/21113/
-----------------------------------------------------------
Review request for Ambari and Myroslav Papirkovskyy.
Bugs: AMBARI-5681
https://issues.apache.org/jira/browse/AMBARI-5681
Repository: ambari
Description
-------
Description: If the secondary NameNode(SNN) failed to merge edit files for any
reason, Nagios doesn't alert on it.
PROBLEM: For any reasons, SNN fails to merge edit files for long time it goes
undetected. This can cause the edit files to become very large and slows down
NameNode performance. And in some cases, can lead to corruption of NameNode
edit files.
BUSINESS IMPACT: If Nagios doesn't alert on SNN functionality, this will
eventually cause long downtime for all of customers and a possiblitly of data
loss.
STEPS TO REPRODUCE:
* SNN fails to merge edit files for any reason
* NameNode edit files grow in size
* Corruption to edit files.
ACTUAL BEHAVIOR: Nagios doesn't fire critical alarm
EXPECTED BEHAVIOR: Nagios should fire critical alarm
SUPPORT ANALYSIS: N/A
Note:
We need to get this fixed and alert our customers to add the nagios alarm
ASAP.
Diffs
-----
ambari-server/src/main/resources/stacks/HDP/2.0.6/services/NAGIOS/package/files/check_checkpoint_time.py PRE-CREATION
ambari-server/src/main/resources/stacks/HDP/2.0.6/services/NAGIOS/package/scripts/nagios_server_config.py 4089b2e
ambari-server/src/main/resources/stacks/HDP/2.0.6/services/NAGIOS/package/scripts/params.py 2e41c23
ambari-server/src/main/resources/stacks/HDP/2.0.6/services/NAGIOS/package/templates/hadoop-commands.cfg.j2 ff03bf9
ambari-server/src/main/resources/stacks/HDP/2.0.6/services/NAGIOS/package/templates/hadoop-services.cfg.j2 e7fda1a
ambari-server/src/test/python/stacks/2.0.6/NAGIOS/test_nagios_server.py 145b443
Diff: https://reviews.apache.org/r/21113/diff/
Testing
-------
Thanks,
Andrew Onischuk
Re: Review Request 21113: Add Nagios alert if HDFS last checkpoint time
exceeds threshold
Posted by Myroslav Papirkovskyy <mp...@hortonworks.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/21113/#review42835
-----------------------------------------------------------
Ship it!
Ship It!
- Myroslav Papirkovskyy
On May 13, 2014, 5:08 p.m., Andrew Onischuk wrote:
>
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/21113/
> -----------------------------------------------------------
>
> (Updated May 13, 2014, 5:08 p.m.)
>
>
> Review request for Ambari and Myroslav Papirkovskyy.
>
>
> Bugs: AMBARI-5681
> https://issues.apache.org/jira/browse/AMBARI-5681
>
>
> Repository: ambari
>
>
> Description
> -------
>
> Description: If the secondary NameNode(SNN) failed to merge edit files for any
> reason, Nagios doesn't alert on it.
>
> PROBLEM: For any reasons, SNN fails to merge edit files for long time it goes
> undetected. This can cause the edit files to become very large and slows down
> NameNode performance. And in some cases, can lead to corruption of NameNode
> edit files.
> BUSINESS IMPACT: If Nagios doesn't alert on SNN functionality, this will
> eventually cause long downtime for all of customers and a possiblitly of data
> loss.
>
> STEPS TO REPRODUCE:
>
> * SNN fails to merge edit files for any reason
> * NameNode edit files grow in size
> * Corruption to edit files.
>
> ACTUAL BEHAVIOR: Nagios doesn't fire critical alarm
> EXPECTED BEHAVIOR: Nagios should fire critical alarm
>
> SUPPORT ANALYSIS: N/A
>
> Note:
>
> We need to get this fixed and alert our customers to add the nagios alarm
> ASAP.
>
>
> Diffs
> -----
>
> ambari-server/src/main/resources/stacks/HDP/2.0.6/services/NAGIOS/package/files/check_checkpoint_time.py PRE-CREATION
> ambari-server/src/main/resources/stacks/HDP/2.0.6/services/NAGIOS/package/scripts/nagios_server_config.py 4089b2e
> ambari-server/src/main/resources/stacks/HDP/2.0.6/services/NAGIOS/package/scripts/params.py 2e41c23
> ambari-server/src/main/resources/stacks/HDP/2.0.6/services/NAGIOS/package/templates/hadoop-commands.cfg.j2 ff03bf9
> ambari-server/src/main/resources/stacks/HDP/2.0.6/services/NAGIOS/package/templates/hadoop-services.cfg.j2 e7fda1a
> ambari-server/src/test/python/stacks/2.0.6/NAGIOS/test_nagios_server.py 145b443
>
> Diff: https://reviews.apache.org/r/21113/diff/
>
>
> Testing
> -------
>
>
> Thanks,
>
> Andrew Onischuk
>
>
Re: Review Request 21113: Add Nagios alert if HDFS last checkpoint time
exceeds threshold
Posted by Andrew Onischuk <ao...@hortonworks.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/21113/
-----------------------------------------------------------
(Updated May 13, 2014, 2:08 p.m.)
Review request for Ambari and Myroslav Papirkovskyy.
Bugs: AMBARI-5681
https://issues.apache.org/jira/browse/AMBARI-5681
Repository: ambari
Description
-------
Description: If the secondary NameNode(SNN) failed to merge edit files for any
reason, Nagios doesn't alert on it.
PROBLEM: For any reasons, SNN fails to merge edit files for long time it goes
undetected. This can cause the edit files to become very large and slows down
NameNode performance. And in some cases, can lead to corruption of NameNode
edit files.
BUSINESS IMPACT: If Nagios doesn't alert on SNN functionality, this will
eventually cause long downtime for all of customers and a possiblitly of data
loss.
STEPS TO REPRODUCE:
* SNN fails to merge edit files for any reason
* NameNode edit files grow in size
* Corruption to edit files.
ACTUAL BEHAVIOR: Nagios doesn't fire critical alarm
EXPECTED BEHAVIOR: Nagios should fire critical alarm
SUPPORT ANALYSIS: N/A
Note:
We need to get this fixed and alert our customers to add the nagios alarm
ASAP.
Diffs (updated)
-----
ambari-server/src/main/resources/stacks/HDP/2.0.6/services/NAGIOS/package/files/check_checkpoint_time.py PRE-CREATION
ambari-server/src/main/resources/stacks/HDP/2.0.6/services/NAGIOS/package/scripts/nagios_server_config.py 4089b2e
ambari-server/src/main/resources/stacks/HDP/2.0.6/services/NAGIOS/package/scripts/params.py 2e41c23
ambari-server/src/main/resources/stacks/HDP/2.0.6/services/NAGIOS/package/templates/hadoop-commands.cfg.j2 ff03bf9
ambari-server/src/main/resources/stacks/HDP/2.0.6/services/NAGIOS/package/templates/hadoop-services.cfg.j2 e7fda1a
ambari-server/src/test/python/stacks/2.0.6/NAGIOS/test_nagios_server.py 145b443
Diff: https://reviews.apache.org/r/21113/diff/
Testing
-------
Thanks,
Andrew Onischuk
Re: Review Request 21113: Add Nagios alert if HDFS last checkpoint time
exceeds threshold
Posted by Andrew Onischuk <ao...@hortonworks.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/21113/
-----------------------------------------------------------
(Updated May 13, 2014, 2:06 p.m.)
Review request for Ambari and Myroslav Papirkovskyy.
Bugs: AMBARI-5681
https://issues.apache.org/jira/browse/AMBARI-5681
Repository: ambari
Description
-------
Description: If the secondary NameNode(SNN) failed to merge edit files for any
reason, Nagios doesn't alert on it.
PROBLEM: For any reasons, SNN fails to merge edit files for long time it goes
undetected. This can cause the edit files to become very large and slows down
NameNode performance. And in some cases, can lead to corruption of NameNode
edit files.
BUSINESS IMPACT: If Nagios doesn't alert on SNN functionality, this will
eventually cause long downtime for all of customers and a possiblitly of data
loss.
STEPS TO REPRODUCE:
* SNN fails to merge edit files for any reason
* NameNode edit files grow in size
* Corruption to edit files.
ACTUAL BEHAVIOR: Nagios doesn't fire critical alarm
EXPECTED BEHAVIOR: Nagios should fire critical alarm
SUPPORT ANALYSIS: N/A
Note:
We need to get this fixed and alert our customers to add the nagios alarm
ASAP.
Diffs (updated)
-----
ambari-server/src/main/resources/stacks/HDP/2.0.6/services/NAGIOS/package/files/check_checkpoint_time.py PRE-CREATION
ambari-server/src/main/resources/stacks/HDP/2.0.6/services/NAGIOS/package/scripts/nagios_server_config.py 4089b2e
ambari-server/src/main/resources/stacks/HDP/2.0.6/services/NAGIOS/package/scripts/params.py 2e41c23
ambari-server/src/main/resources/stacks/HDP/2.0.6/services/NAGIOS/package/templates/hadoop-commands.cfg.j2 ff03bf9
ambari-server/src/main/resources/stacks/HDP/2.0.6/services/NAGIOS/package/templates/hadoop-services.cfg.j2 e7fda1a
ambari-server/src/test/python/stacks/2.0.6/NAGIOS/test_nagios_server.py 145b443
Diff: https://reviews.apache.org/r/21113/diff/
Testing
-------
Thanks,
Andrew Onischuk
Re: Review Request 21113: Add Nagios alert if HDFS last checkpoint time
exceeds threshold
Posted by Andrew Onischuk <ao...@hortonworks.com>.
> On May 6, 2014, 2:53 p.m., Michael Harp wrote:
> > Whats the expected behavior when namenode HA is enabled?
The behaviour is the same, since in HA checkpoints are done too. It was tested in HA and works fine.
- Andrew
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/21113/#review42289
-----------------------------------------------------------
On May 13, 2014, 2:08 p.m., Andrew Onischuk wrote:
>
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/21113/
> -----------------------------------------------------------
>
> (Updated May 13, 2014, 2:08 p.m.)
>
>
> Review request for Ambari and Myroslav Papirkovskyy.
>
>
> Bugs: AMBARI-5681
> https://issues.apache.org/jira/browse/AMBARI-5681
>
>
> Repository: ambari
>
>
> Description
> -------
>
> Description: If the secondary NameNode(SNN) failed to merge edit files for any
> reason, Nagios doesn't alert on it.
>
> PROBLEM: For any reasons, SNN fails to merge edit files for long time it goes
> undetected. This can cause the edit files to become very large and slows down
> NameNode performance. And in some cases, can lead to corruption of NameNode
> edit files.
> BUSINESS IMPACT: If Nagios doesn't alert on SNN functionality, this will
> eventually cause long downtime for all of customers and a possiblitly of data
> loss.
>
> STEPS TO REPRODUCE:
>
> * SNN fails to merge edit files for any reason
> * NameNode edit files grow in size
> * Corruption to edit files.
>
> ACTUAL BEHAVIOR: Nagios doesn't fire critical alarm
> EXPECTED BEHAVIOR: Nagios should fire critical alarm
>
> SUPPORT ANALYSIS: N/A
>
> Note:
>
> We need to get this fixed and alert our customers to add the nagios alarm
> ASAP.
>
>
> Diffs
> -----
>
> ambari-server/src/main/resources/stacks/HDP/2.0.6/services/NAGIOS/package/files/check_checkpoint_time.py PRE-CREATION
> ambari-server/src/main/resources/stacks/HDP/2.0.6/services/NAGIOS/package/scripts/nagios_server_config.py 4089b2e
> ambari-server/src/main/resources/stacks/HDP/2.0.6/services/NAGIOS/package/scripts/params.py 2e41c23
> ambari-server/src/main/resources/stacks/HDP/2.0.6/services/NAGIOS/package/templates/hadoop-commands.cfg.j2 ff03bf9
> ambari-server/src/main/resources/stacks/HDP/2.0.6/services/NAGIOS/package/templates/hadoop-services.cfg.j2 e7fda1a
> ambari-server/src/test/python/stacks/2.0.6/NAGIOS/test_nagios_server.py 145b443
>
> Diff: https://reviews.apache.org/r/21113/diff/
>
>
> Testing
> -------
>
>
> Thanks,
>
> Andrew Onischuk
>
>
Re: Review Request 21113: Add Nagios alert if HDFS last checkpoint time
exceeds threshold
Posted by Michael Harp <mi...@teradata.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/21113/#review42289
-----------------------------------------------------------
Whats the expected behavior when namenode HA is enabled?
- Michael Harp
On May 6, 2014, 2:44 p.m., Andrew Onischuk wrote:
>
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/21113/
> -----------------------------------------------------------
>
> (Updated May 6, 2014, 2:44 p.m.)
>
>
> Review request for Ambari and Myroslav Papirkovskyy.
>
>
> Bugs: AMBARI-5681
> https://issues.apache.org/jira/browse/AMBARI-5681
>
>
> Repository: ambari
>
>
> Description
> -------
>
> Description: If the secondary NameNode(SNN) failed to merge edit files for any
> reason, Nagios doesn't alert on it.
>
> PROBLEM: For any reasons, SNN fails to merge edit files for long time it goes
> undetected. This can cause the edit files to become very large and slows down
> NameNode performance. And in some cases, can lead to corruption of NameNode
> edit files.
> BUSINESS IMPACT: If Nagios doesn't alert on SNN functionality, this will
> eventually cause long downtime for all of customers and a possiblitly of data
> loss.
>
> STEPS TO REPRODUCE:
>
> * SNN fails to merge edit files for any reason
> * NameNode edit files grow in size
> * Corruption to edit files.
>
> ACTUAL BEHAVIOR: Nagios doesn't fire critical alarm
> EXPECTED BEHAVIOR: Nagios should fire critical alarm
>
> SUPPORT ANALYSIS: N/A
>
> Note:
>
> We need to get this fixed and alert our customers to add the nagios alarm
> ASAP.
>
>
> Diffs
> -----
>
> ambari-server/src/main/resources/stacks/HDP/2.0.6/services/NAGIOS/package/files/check_checkpoint_time.py PRE-CREATION
> ambari-server/src/main/resources/stacks/HDP/2.0.6/services/NAGIOS/package/scripts/nagios_server_config.py 4089b2e
> ambari-server/src/main/resources/stacks/HDP/2.0.6/services/NAGIOS/package/scripts/params.py 2e41c23
> ambari-server/src/main/resources/stacks/HDP/2.0.6/services/NAGIOS/package/templates/hadoop-commands.cfg.j2 ff03bf9
> ambari-server/src/main/resources/stacks/HDP/2.0.6/services/NAGIOS/package/templates/hadoop-services.cfg.j2 e7fda1a
> ambari-server/src/test/python/stacks/2.0.6/NAGIOS/test_nagios_server.py 145b443
>
> Diff: https://reviews.apache.org/r/21113/diff/
>
>
> Testing
> -------
>
>
> Thanks,
>
> Andrew Onischuk
>
>