You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@zookeeper.apache.org by "Sunanda Bera (JIRA)" <ji...@apache.org> on 2009/02/13 23:20:59 UTC

[jira] Created: (ZOOKEEPER-313) Problem with successive leader failures when no client is connected

Problem with successive leader failures when no client is connected 
--------------------------------------------------------------------

                 Key: ZOOKEEPER-313
                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-313
             Project: Zookeeper
          Issue Type: Bug
          Components: server
    Affects Versions: 3.0.1, 3.0.0
         Environment: all
            Reporter: Sunanda Bera


Steps to reproduce:

Create a 3 node cluster . Run some transactions and then stop all clients. Make sure no other clients connect for the duration of the test.

Let L1 be the current leader. Bring down L1. Let L2 be the leader chosen.  Let the third node be N3. Note that this will increase the txn id for N3's snapshot without any  transaction being logged. Now bring up L1 -- same will happen for L1. Now bring down L2.

Both N3 and L1 now have snapshots with a transaction id greater than the last logged transaction. Whoever is elected leader will try to restore its state from the filesystem and fail.

One easy workaround is obviously to change the FileTxnSnapLog not to save a snapshot if zxid > last logged zxid. The correct solution is possibly to log a transaction for leader election as well.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (ZOOKEEPER-313) Problem with successive leader failures when no client is connected

Posted by "Patrick Hunt (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/ZOOKEEPER-313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Patrick Hunt updated ZOOKEEPER-313:
-----------------------------------

    Fix Version/s: 3.1.1

> Problem with successive leader failures when no client is connected 
> --------------------------------------------------------------------
>
>                 Key: ZOOKEEPER-313
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-313
>             Project: Zookeeper
>          Issue Type: Bug
>          Components: server
>    Affects Versions: 3.0.0, 3.0.1
>         Environment: all
>            Reporter: Sunanda Bera
>             Fix For: 3.1.1
>
>
> Steps to reproduce:
> Create a 3 node cluster . Run some transactions and then stop all clients. Make sure no other clients connect for the duration of the test.
> Let L1 be the current leader. Bring down L1. Let L2 be the leader chosen.  Let the third node be N3. Note that this will increase the txn id for N3's snapshot without any  transaction being logged. Now bring up L1 -- same will happen for L1. Now bring down L2.
> Both N3 and L1 now have snapshots with a transaction id greater than the last logged transaction. Whoever is elected leader will try to restore its state from the filesystem and fail.
> One easy workaround is obviously to change the FileTxnSnapLog not to save a snapshot if zxid > last logged zxid. The correct solution is possibly to log a transaction for leader election as well.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (ZOOKEEPER-313) Problem with successive leader failures when no client is connected

Posted by "Mahadev konar (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/ZOOKEEPER-313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12673432#action_12673432 ] 

Mahadev konar commented on ZOOKEEPER-313:
-----------------------------------------

sunanda,
 can you try out our latest release 3.1.0? We had a bug in 3.0 and 3.0.1 - ZOOKEEPER-251

which has been resolved in 3.1. 

 I tried the above scenario on 3.1 but cannot reproduce it. I can reproduce that in 3.0.0 and 3.0.1
Also, as ben mentioned even in 3.1 we do not log the new leader transaction whcih we should (this is not really incorrect but would be just to follow our design spec).


> Problem with successive leader failures when no client is connected 
> --------------------------------------------------------------------
>
>                 Key: ZOOKEEPER-313
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-313
>             Project: Zookeeper
>          Issue Type: Bug
>          Components: server
>    Affects Versions: 3.0.0, 3.0.1
>         Environment: all
>            Reporter: Sunanda Bera
>             Fix For: 3.1.1
>
>
> Steps to reproduce:
> Create a 3 node cluster . Run some transactions and then stop all clients. Make sure no other clients connect for the duration of the test.
> Let L1 be the current leader. Bring down L1. Let L2 be the leader chosen.  Let the third node be N3. Note that this will increase the txn id for N3's snapshot without any  transaction being logged. Now bring up L1 -- same will happen for L1. Now bring down L2.
> Both N3 and L1 now have snapshots with a transaction id greater than the last logged transaction. Whoever is elected leader will try to restore its state from the filesystem and fail.
> One easy workaround is obviously to change the FileTxnSnapLog not to save a snapshot if zxid > last logged zxid. The correct solution is possibly to log a transaction for leader election as well.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (ZOOKEEPER-313) Problem with successive leader failures when no client is connected

Posted by "Benjamin Reed (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/ZOOKEEPER-313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12673409#action_12673409 ] 

Benjamin Reed commented on ZOOKEEPER-313:
-----------------------------------------

excellent find! thanx for the test case too! you are correct the new leader transaction is not being logged but should be. (actually according to our design spec, it must be.)

> Problem with successive leader failures when no client is connected 
> --------------------------------------------------------------------
>
>                 Key: ZOOKEEPER-313
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-313
>             Project: Zookeeper
>          Issue Type: Bug
>          Components: server
>    Affects Versions: 3.0.0, 3.0.1
>         Environment: all
>            Reporter: Sunanda Bera
>             Fix For: 3.1.1
>
>
> Steps to reproduce:
> Create a 3 node cluster . Run some transactions and then stop all clients. Make sure no other clients connect for the duration of the test.
> Let L1 be the current leader. Bring down L1. Let L2 be the leader chosen.  Let the third node be N3. Note that this will increase the txn id for N3's snapshot without any  transaction being logged. Now bring up L1 -- same will happen for L1. Now bring down L2.
> Both N3 and L1 now have snapshots with a transaction id greater than the last logged transaction. Whoever is elected leader will try to restore its state from the filesystem and fail.
> One easy workaround is obviously to change the FileTxnSnapLog not to save a snapshot if zxid > last logged zxid. The correct solution is possibly to log a transaction for leader election as well.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Resolved: (ZOOKEEPER-313) Problem with successive leader failures when no client is connected

Posted by "Mahadev konar (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/ZOOKEEPER-313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Mahadev konar resolved ZOOKEEPER-313.
-------------------------------------

    Resolution: Duplicate
      Assignee: Mahadev konar

sunanda, 

i am makring this as a duplicate of ZOOKEEPER-251. I have openened ZOOKEEPER-335 for logging the new leader election txn. 
Please feel free to reopen if that is not the case.


> Problem with successive leader failures when no client is connected 
> --------------------------------------------------------------------
>
>                 Key: ZOOKEEPER-313
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-313
>             Project: Zookeeper
>          Issue Type: Bug
>          Components: server
>    Affects Versions: 3.0.0, 3.0.1
>         Environment: all
>            Reporter: Sunanda Bera
>            Assignee: Mahadev konar
>             Fix For: 3.1.1
>
>
> Steps to reproduce:
> Create a 3 node cluster . Run some transactions and then stop all clients. Make sure no other clients connect for the duration of the test.
> Let L1 be the current leader. Bring down L1. Let L2 be the leader chosen.  Let the third node be N3. Note that this will increase the txn id for N3's snapshot without any  transaction being logged. Now bring up L1 -- same will happen for L1. Now bring down L2.
> Both N3 and L1 now have snapshots with a transaction id greater than the last logged transaction. Whoever is elected leader will try to restore its state from the filesystem and fail.
> One easy workaround is obviously to change the FileTxnSnapLog not to save a snapshot if zxid > last logged zxid. The correct solution is possibly to log a transaction for leader election as well.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.