You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-issues@hadoop.apache.org by "Ming Ma (JIRA)" <ji...@apache.org> on 2015/01/13 01:03:36 UTC

[jira] [Updated] (HADOOP-10668) TestZKFailoverControllerStress#testExpireBackAndForth occasionally fails

     [ https://issues.apache.org/jira/browse/HADOOP-10668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ming Ma updated HADOOP-10668:
-----------------------------
    Attachment: HADOOP-10668.patch

It appears the check whether a node is in the right state could be the issue. {{ZKFailoverController}} has its own {{serviceState}}. HA service such as DummyHAService has its own state. What happened here is {{MiniZKFCCluster}}'s {{waitForHAState}} uses DummyHAService state to decide the state has transitioned properly. But when fencing is involved, the to-be-elected active will directly call the old active's {{transitionToStandby}} method. Thus {{DummyHAService}}'s state could be set to standby before {{ZKFailoverController}}'s state is updated.

The patch didn't change the fact {{ZKFailoverController}}'s state is only updated when it receives notification from ZK callback. So with the fix, it might still get the following error in the log. But that is ok, {{ZKFailoverController}}'s state eventually will be changed to standby.
 
{noformat}
2015-01-12 15:08:16,497 ERROR ha.ZKFailoverController (ZKFailoverController.java:verifyChangedServiceState(828)) - Local service DummyHAService #1 has changed the serviceState to standby. Expected was active. Quitting election marking fencing necessary.
{noformat}

> TestZKFailoverControllerStress#testExpireBackAndForth occasionally fails
> ------------------------------------------------------------------------
>
>                 Key: HADOOP-10668
>                 URL: https://issues.apache.org/jira/browse/HADOOP-10668
>             Project: Hadoop Common
>          Issue Type: Test
>          Components: test
>    Affects Versions: 3.0.0
>            Reporter: Ted Yu
>              Labels: test
>         Attachments: HADOOP-10668.patch
>
>
> From https://builds.apache.org/job/PreCommit-HADOOP-Build/4018//testReport/org.apache.hadoop.ha/TestZKFailoverControllerStress/testExpireBackAndForth/ :
> {code}
> org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode
> 	at org.apache.zookeeper.server.DataTree.getData(DataTree.java:648)
> 	at org.apache.zookeeper.server.ZKDatabase.getData(ZKDatabase.java:371)
> 	at org.apache.hadoop.ha.MiniZKFCCluster.expireActiveLockHolder(MiniZKFCCluster.java:199)
> 	at org.apache.hadoop.ha.MiniZKFCCluster.expireAndVerifyFailover(MiniZKFCCluster.java:234)
> 	at org.apache.hadoop.ha.TestZKFailoverControllerStress.testExpireBackAndForth(TestZKFailoverControllerStress.java:84)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)