You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-issues@hadoop.apache.org by "Tsuyoshi Ozawa (JIRA)" <ji...@apache.org> on 2015/05/04 10:44:06 UTC
[jira] [Updated] (HADOOP-11328) ZKFailoverController does not log
Exception and causes latent problems during failover
[ https://issues.apache.org/jira/browse/HADOOP-11328?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Tsuyoshi Ozawa updated HADOOP-11328:
------------------------------------
Summary: ZKFailoverController does not log Exception and causes latent problems during failover (was: ZKFailoverController.java does not log Exception and causes latent problems during failover)
> ZKFailoverController does not log Exception and causes latent problems during failover
> --------------------------------------------------------------------------------------
>
> Key: HADOOP-11328
> URL: https://issues.apache.org/jira/browse/HADOOP-11328
> Project: Hadoop Common
> Issue Type: Bug
> Components: ha
> Affects Versions: 2.5.1
> Reporter: Tianyin Xu
> Attachments: ZKFailoverController.log.exception.1.patch
>
>
> In _ZKFailoverController.java_, the _Exception_ caught by the _run()_ method does not have a single error log. This causes latent problems that are only manifested during failover.
> h5. The problem we encountered
> An _Exception_ is thrown from the _doRun()_ method during _initHM()_ (caused by a configuration error). If you want to repeat, you can set
> "_ha.health-monitor.connect-retry-interval.ms_" to be any nonsensical value.
> {code:title=ZKFailoverController.java|borderStyle=solid}
> private int doRun(String[] args)
> ...
> initRPC();
> initHM();
> startRPC();
> ....
> }
> {code}
> The Exception is caught in the _run()_ method, as follows,
> {code:title=ZKFailoverController.java|borderStyle=solid}
> public int run(final String[] args) throws Exception {
> ...
> try {
> ...
> @Override
> public Integer run() {
> try {
> return doRun(args);
> } catch (Exception t) {
> throw new RuntimeException(t);
> } finally {
> if (elector != null) {
> elector.terminateConnection();
> }
> }
> }
> });
> } catch (RuntimeException rte) {
> throw (Exception)rte.getCause();
> }
> }
> {code}
> Unfortunately, the Exception (causing the shutdown of the process) is *not logged at all*. This causes latent errors which is only manifested during failover (because ZKFC is dead). The tricky thing here is that everything looks perfectly fine: the _jps_ command shows a running DFSZKFailoverController process and the two NameNode (active and standby) work fine.
> h5. Patch
> We strongly suggest to add a error log to notify the error caught, such as,
> --- hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ha/ZKFailoverController.java (revision 1641307)
> +++ hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ha/ZKFailoverController.java (working copy)
> {code:title=@@ -178,6 +178,7 @@|borderStyle=solid}
> }
> });
> } catch (RuntimeException rte) {
> + LOG.fatal("The failover controller encounters runtime error: " + rte);
> throw (Exception)rte.getCause();
> }
> }
> {code}
> Thanks!
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)