You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-issues@hadoop.apache.org by "Todd Lipcon (Updated) (JIRA)" <ji...@apache.org> on 2012/03/30 07:44:42 UTC

[jira] [Updated] (HADOOP-8217) Edge case split-brain race in ZK-based auto-failover

     [ https://issues.apache.org/jira/browse/HADOOP-8217?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Todd Lipcon updated HADOOP-8217:
--------------------------------

    Attachment: hadoop-8217-testcase.txt

Here's a test case which produces the issue as described. This builds on top of the test infrastructure introduced in HADOOP-8228. I also introduced a simple fault injector class to make it possible to deterministically introduce this issue (this is similar to the fault injection technique we use in HDFS for checkpointing).

This patch also copies GenericTestUtils.DelayAnswer from HDFS into Common. We can later do a followup patch on the HDFS side to remove the copy in that project.
                
> Edge case split-brain race in ZK-based auto-failover
> ----------------------------------------------------
>
>                 Key: HADOOP-8217
>                 URL: https://issues.apache.org/jira/browse/HADOOP-8217
>             Project: Hadoop Common
>          Issue Type: Bug
>          Components: auto-failover, ha
>    Affects Versions: 0.24.0
>            Reporter: Todd Lipcon
>            Assignee: Todd Lipcon
>         Attachments: hadoop-8217-testcase.txt
>
>
> As discussed in HADOOP-8206, the current design for automatic failover has the following race:
> - ZKFC1 gets active lock
> - ZKFC1 is about to send transitionToActive() and machine freezes (eg GC pause + swapping)
> - ZKFC1 loses its ZK lock, ZKFC2 gets ZK lock
> - ZKFC2 calls transitionToStandby on NN1, and transitions NN2 to active
> - ZKFC1 wakes up from pause, calls transitionToActive(), now we have a bad situation
> This is rare, since it requires ZKFC1 to freeze longer than its ZK session timeout, but worth fixing, since the results can be disastrous.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira