You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@zookeeper.apache.org by "Vishal Kathuria (JIRA)" <ji...@apache.org> on 2011/08/15 19:37:27 UTC

[jira] [Created] (ZOOKEEPER-1154) Data inconsistency when the node(s) with the highest zxid is not present at the time of leader election

Data inconsistency when the node(s) with the highest zxid is not present at the time of leader election
-------------------------------------------------------------------------------------------------------

Key: ZOOKEEPER-1154
URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1154
Project: ZooKeeper
Issue Type: Bug
Components: quorum
Affects Versions: 3.3.3
Reporter: Vishal Kathuria
Priority: Critical
Fix For: 3.4.0

If a participant with the highest zxid (lets call it A) isn't present during leader election, a participant with a lower zxid (say B) might be chosen as a leader. When A comes up, it will replay the log with that higher zxid. The change that was in that higher zxid will only be visible to the clients connecting to the participant A, but not to other participants.

I was able to reproduce this problem by
1. connect debugger to B and C and suspend them, so they don't write anything
2. Issue an update to the leader A.
3. After a few seconds, crash all servers (A,B,C)
4. Start B and C, let the leader election take place
5. Start A.
6. You will find that the update done in step 2 is visible on A but not on B,C, hence the inconsistency.

Below is a more detailed analysis of what is happening in the code.

Initial Condition
1. Lets say there are three nodes in the ensemble A,B,C with A being the leader
2. The current epoch is 7.
3. For simplicity of the example, lets say zxid is a two digit number, with epoch being the first digit.
4. The zxid is 73
5. All the nodes have seen the change 73 and have persistently logged it.

Step 1
Request with zxid 74 is issued. The leader A writes it to the log but there is a crash of the entire ensemble and B,C never write the change 74 to their log.

Step 3
B,C restart, A is still down
B,C form the quorum
B is the new leader. Lets say B minCommitLog is 71 and maxCommitLog is 73
epoch is now 8, zxid is 80
Request with zxid 81 is successful. On B, minCommitLog is now 71, maxCommitLog is 81

Step 4
A starts up. It applies the change in request with zxid 74 to its in-memory data tree
A contacts B to registerAsFollower and provides 74 as its ZxId
Since 71<=74<=81, B decides to send A the diff. B will send to A the proposal 81.

Problem:
The problem with the above sequence is that A's data tree has the update from request 74, which is not correct. Before getting the proposals 81, A should have received a trunc to 73. I don't see that in the code. If the maxCommitLog on B hadn't bumped to 81 but had stayed at 73, that case seems to be fine.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (ZOOKEEPER-1154) Data inconsistency when the node(s) with the highest zxid is not present at the time of leader election

Posted by "Camille Fournier (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/ZOOKEEPER-1154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13088127#comment-13088127 ] 

Camille Fournier commented on ZOOKEEPER-1154:
---------------------------------------------

// Committer: This comment is for your review. Please remove this comment before committing.
            // This should not be needed. In case of a SNAP, the snapshot should already have this value.
            // If we set it here, it could end up
            // setting the value back since some changes might get committed on the leader
            // between when the leader reads this value from data tree and when the snap
            // is serialized on the leader. In case of DIFF, the COMMIT messages should set it right.
            // In case of TRUNC, it should correctly get set as part
            // of the truncate call above.
//            zk.getZKDatabase().setlastProcessedZxid(qp.getZxid());


This is incorrect for SNAP. The way we deserialize snapshots from a SNAP request does not set the lastProcessedZXid from the leader, the snapshot does not contain this information at all. It must be sent from the leader to be set correctly.

> Data inconsistency when the node(s) with the highest zxid is not present at the time of leader election
> -------------------------------------------------------------------------------------------------------
>
>                 Key: ZOOKEEPER-1154
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1154
>             Project: ZooKeeper
>          Issue Type: Bug
>          Components: quorum
>    Affects Versions: 3.3.3
>            Reporter: Vishal Kathuria
>            Priority: Blocker
>             Fix For: 3.4.0
>
>         Attachments: ZOOKEEPER-1154.patch
>
>   Original Estimate: 504h
>  Remaining Estimate: 504h
>
> If a participant with the highest zxid (lets call it A) isn't present during leader election, a participant with a lower zxid (say B) might be chosen as a leader. When A comes up, it will replay the log with that higher zxid. The change that was in that higher zxid will only be visible to the clients connecting to the participant A, but not to other participants.
> I was able to reproduce this problem by
> 1. connect debugger to B and C and suspend them, so they don't write anything
> 2. Issue an update to the leader A.
> 3. After a few seconds, crash all servers (A,B,C)
> 4. Start B and C, let the leader election take place
> 5. Start A.
> 6. You will find that the update done in step 2 is visible on A but not on B,C, hence the inconsistency.
> Below is a more detailed analysis of what is happening in the code.
> Initial Condition
> 1.	Lets say there are three nodes in the ensemble A,B,C with A being the leader
> 2.	The current epoch is 7. 
> 3.	For simplicity of the example, lets say zxid is a two digit number, with epoch being the first digit.
> 4.	The zxid is 73
> 5.	All the nodes have seen the change 73 and have persistently logged it.
> Step 1
> Request with zxid 74 is issued. The leader A writes it to the log but there is a crash of the entire ensemble and B,C never write the change 74 to their log.
> Step 3
> B,C restart, A is still down
> B,C form the quorum
> B is the new leader. Lets say  B minCommitLog is 71 and maxCommitLog is 73
> epoch is now 8, zxid is 80
> Request with zxid 81 is successful. On B, minCommitLog is now 71, maxCommitLog is 81
> Step 4
> A starts up. It applies the change in request with zxid 74 to its in-memory data tree
> A contacts B to registerAsFollower and provides 74 as its ZxId
> Since 71<=74<=81, B decides to send A the diff. B will send to A the proposal 81.
> Problem:
> The problem with the above sequence is that A's data tree has the update from request 74, which is not correct. Before getting the proposals 81, A should have received a trunc to 73. I don't see that in the code. If the maxCommitLog on B hadn't bumped to 81 but had stayed at 73, that case seems to be fine.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (ZOOKEEPER-1154) Data inconsistency when the node(s) with the highest zxid is not present at the time of leader election

Posted by "Mahadev konar (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/ZOOKEEPER-1154?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Mahadev konar updated ZOOKEEPER-1154:
-------------------------------------

    Priority: Blocker  (was: Critical)

> Data inconsistency when the node(s) with the highest zxid is not present at the time of leader election
> -------------------------------------------------------------------------------------------------------
>
>                 Key: ZOOKEEPER-1154
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1154
>             Project: ZooKeeper
>          Issue Type: Bug
>          Components: quorum
>    Affects Versions: 3.3.3
>            Reporter: Vishal Kathuria
>            Priority: Blocker
>             Fix For: 3.4.0
>
>   Original Estimate: 504h
>  Remaining Estimate: 504h
>
> If a participant with the highest zxid (lets call it A) isn't present during leader election, a participant with a lower zxid (say B) might be chosen as a leader. When A comes up, it will replay the log with that higher zxid. The change that was in that higher zxid will only be visible to the clients connecting to the participant A, but not to other participants.
> I was able to reproduce this problem by
> 1. connect debugger to B and C and suspend them, so they don't write anything
> 2. Issue an update to the leader A.
> 3. After a few seconds, crash all servers (A,B,C)
> 4. Start B and C, let the leader election take place
> 5. Start A.
> 6. You will find that the update done in step 2 is visible on A but not on B,C, hence the inconsistency.
> Below is a more detailed analysis of what is happening in the code.
> Initial Condition
> 1.	Lets say there are three nodes in the ensemble A,B,C with A being the leader
> 2.	The current epoch is 7. 
> 3.	For simplicity of the example, lets say zxid is a two digit number, with epoch being the first digit.
> 4.	The zxid is 73
> 5.	All the nodes have seen the change 73 and have persistently logged it.
> Step 1
> Request with zxid 74 is issued. The leader A writes it to the log but there is a crash of the entire ensemble and B,C never write the change 74 to their log.
> Step 3
> B,C restart, A is still down
> B,C form the quorum
> B is the new leader. Lets say  B minCommitLog is 71 and maxCommitLog is 73
> epoch is now 8, zxid is 80
> Request with zxid 81 is successful. On B, minCommitLog is now 71, maxCommitLog is 81
> Step 4
> A starts up. It applies the change in request with zxid 74 to its in-memory data tree
> A contacts B to registerAsFollower and provides 74 as its ZxId
> Since 71<=74<=81, B decides to send A the diff. B will send to A the proposal 81.
> Problem:
> The problem with the above sequence is that A's data tree has the update from request 74, which is not correct. Before getting the proposals 81, A should have received a trunc to 73. I don't see that in the code. If the maxCommitLog on B hadn't bumped to 81 but had stayed at 73, that case seems to be fine.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (ZOOKEEPER-1154) Data inconsistency when the node(s) with the highest zxid is not present at the time of leader election

Posted by "Vishal Kathuria (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/ZOOKEEPER-1154?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Vishal Kathuria updated ZOOKEEPER-1154:
---------------------------------------

    Attachment: ZOOKEEPER-1154.patch

Submitting an updated patch. 

This patch includes the fix for ZOOKEEPER-1156 as well, since it is necessary for getting the tests to pass.

All tests are passing on this new patch.

> Data inconsistency when the node(s) with the highest zxid is not present at the time of leader election
> -------------------------------------------------------------------------------------------------------
>
>                 Key: ZOOKEEPER-1154
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1154
>             Project: ZooKeeper
>          Issue Type: Bug
>          Components: quorum
>    Affects Versions: 3.3.3
>            Reporter: Vishal Kathuria
>            Priority: Blocker
>             Fix For: 3.4.0
>
>         Attachments: ZOOKEEPER-1154.patch, ZOOKEEPER-1154.patch
>
>   Original Estimate: 504h
>  Remaining Estimate: 504h
>
> If a participant with the highest zxid (lets call it A) isn't present during leader election, a participant with a lower zxid (say B) might be chosen as a leader. When A comes up, it will replay the log with that higher zxid. The change that was in that higher zxid will only be visible to the clients connecting to the participant A, but not to other participants.
> I was able to reproduce this problem by
> 1. connect debugger to B and C and suspend them, so they don't write anything
> 2. Issue an update to the leader A.
> 3. After a few seconds, crash all servers (A,B,C)
> 4. Start B and C, let the leader election take place
> 5. Start A.
> 6. You will find that the update done in step 2 is visible on A but not on B,C, hence the inconsistency.
> Below is a more detailed analysis of what is happening in the code.
> Initial Condition
> 1.	Lets say there are three nodes in the ensemble A,B,C with A being the leader
> 2.	The current epoch is 7. 
> 3.	For simplicity of the example, lets say zxid is a two digit number, with epoch being the first digit.
> 4.	The zxid is 73
> 5.	All the nodes have seen the change 73 and have persistently logged it.
> Step 1
> Request with zxid 74 is issued. The leader A writes it to the log but there is a crash of the entire ensemble and B,C never write the change 74 to their log.
> Step 3
> B,C restart, A is still down
> B,C form the quorum
> B is the new leader. Lets say  B minCommitLog is 71 and maxCommitLog is 73
> epoch is now 8, zxid is 80
> Request with zxid 81 is successful. On B, minCommitLog is now 71, maxCommitLog is 81
> Step 4
> A starts up. It applies the change in request with zxid 74 to its in-memory data tree
> A contacts B to registerAsFollower and provides 74 as its ZxId
> Since 71<=74<=81, B decides to send A the diff. B will send to A the proposal 81.
> Problem:
> The problem with the above sequence is that A's data tree has the update from request 74, which is not correct. Before getting the proposals 81, A should have received a trunc to 73. I don't see that in the code. If the maxCommitLog on B hadn't bumped to 81 but had stayed at 73, that case seems to be fine.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (ZOOKEEPER-1154) Data inconsistency when the node(s) with the highest zxid is not present at the time of leader election

Posted by "Patrick Hunt (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/ZOOKEEPER-1154?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Patrick Hunt updated ZOOKEEPER-1154:
------------------------------------

    Fix Version/s: 3.3.4

> Data inconsistency when the node(s) with the highest zxid is not present at the time of leader election
> -------------------------------------------------------------------------------------------------------
>
>                 Key: ZOOKEEPER-1154
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1154
>             Project: ZooKeeper
>          Issue Type: Bug
>          Components: quorum
>    Affects Versions: 3.3.3
>            Reporter: Vishal Kathuria
>            Assignee: Vishal Kathuria
>            Priority: Blocker
>             Fix For: 3.3.4, 3.4.0
>
>         Attachments: ZOOKEEPER-1154.patch, ZOOKEEPER-1154.patch
>
>   Original Estimate: 504h
>  Remaining Estimate: 504h
>
> If a participant with the highest zxid (lets call it A) isn't present during leader election, a participant with a lower zxid (say B) might be chosen as a leader. When A comes up, it will replay the log with that higher zxid. The change that was in that higher zxid will only be visible to the clients connecting to the participant A, but not to other participants.
> I was able to reproduce this problem by
> 1. connect debugger to B and C and suspend them, so they don't write anything
> 2. Issue an update to the leader A.
> 3. After a few seconds, crash all servers (A,B,C)
> 4. Start B and C, let the leader election take place
> 5. Start A.
> 6. You will find that the update done in step 2 is visible on A but not on B,C, hence the inconsistency.
> Below is a more detailed analysis of what is happening in the code.
> Initial Condition
> 1.	Lets say there are three nodes in the ensemble A,B,C with A being the leader
> 2.	The current epoch is 7. 
> 3.	For simplicity of the example, lets say zxid is a two digit number, with epoch being the first digit.
> 4.	The zxid is 73
> 5.	All the nodes have seen the change 73 and have persistently logged it.
> Step 1
> Request with zxid 74 is issued. The leader A writes it to the log but there is a crash of the entire ensemble and B,C never write the change 74 to their log.
> Step 3
> B,C restart, A is still down
> B,C form the quorum
> B is the new leader. Lets say  B minCommitLog is 71 and maxCommitLog is 73
> epoch is now 8, zxid is 80
> Request with zxid 81 is successful. On B, minCommitLog is now 71, maxCommitLog is 81
> Step 4
> A starts up. It applies the change in request with zxid 74 to its in-memory data tree
> A contacts B to registerAsFollower and provides 74 as its ZxId
> Since 71<=74<=81, B decides to send A the diff. B will send to A the proposal 81.
> Problem:
> The problem with the above sequence is that A's data tree has the update from request 74, which is not correct. Before getting the proposals 81, A should have received a trunc to 73. I don't see that in the code. If the maxCommitLog on B hadn't bumped to 81 but had stayed at 73, that case seems to be fine.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (ZOOKEEPER-1154) Data inconsistency when the node(s) with the highest zxid is not present at the time of leader election

Posted by "Vishal Kathuria (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/ZOOKEEPER-1154?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Vishal Kathuria updated ZOOKEEPER-1154:
---------------------------------------

    Attachment: ZOOKEEPER-1154.patch

Thanks for reviewing the patch and your feedback Camille. I have made the changes you suggested to the test. Please let me know in case you have any further feedback.



> Data inconsistency when the node(s) with the highest zxid is not present at the time of leader election
> -------------------------------------------------------------------------------------------------------
>
>                 Key: ZOOKEEPER-1154
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1154
>             Project: ZooKeeper
>          Issue Type: Bug
>          Components: quorum
>    Affects Versions: 3.3.3
>            Reporter: Vishal Kathuria
>            Assignee: Vishal Kathuria
>            Priority: Blocker
>             Fix For: 3.3.4, 3.4.0
>
>         Attachments: ZOOKEEPER-1154.patch, ZOOKEEPER-1154.patch, ZOOKEEPER-1154.patch
>
>   Original Estimate: 504h
>  Remaining Estimate: 504h
>
> If a participant with the highest zxid (lets call it A) isn't present during leader election, a participant with a lower zxid (say B) might be chosen as a leader. When A comes up, it will replay the log with that higher zxid. The change that was in that higher zxid will only be visible to the clients connecting to the participant A, but not to other participants.
> I was able to reproduce this problem by
> 1. connect debugger to B and C and suspend them, so they don't write anything
> 2. Issue an update to the leader A.
> 3. After a few seconds, crash all servers (A,B,C)
> 4. Start B and C, let the leader election take place
> 5. Start A.
> 6. You will find that the update done in step 2 is visible on A but not on B,C, hence the inconsistency.
> Below is a more detailed analysis of what is happening in the code.
> Initial Condition
> 1.	Lets say there are three nodes in the ensemble A,B,C with A being the leader
> 2.	The current epoch is 7. 
> 3.	For simplicity of the example, lets say zxid is a two digit number, with epoch being the first digit.
> 4.	The zxid is 73
> 5.	All the nodes have seen the change 73 and have persistently logged it.
> Step 1
> Request with zxid 74 is issued. The leader A writes it to the log but there is a crash of the entire ensemble and B,C never write the change 74 to their log.
> Step 3
> B,C restart, A is still down
> B,C form the quorum
> B is the new leader. Lets say  B minCommitLog is 71 and maxCommitLog is 73
> epoch is now 8, zxid is 80
> Request with zxid 81 is successful. On B, minCommitLog is now 71, maxCommitLog is 81
> Step 4
> A starts up. It applies the change in request with zxid 74 to its in-memory data tree
> A contacts B to registerAsFollower and provides 74 as its ZxId
> Since 71<=74<=81, B decides to send A the diff. B will send to A the proposal 81.
> Problem:
> The problem with the above sequence is that A's data tree has the update from request 74, which is not correct. Before getting the proposals 81, A should have received a trunc to 73. I don't see that in the code. If the maxCommitLog on B hadn't bumped to 81 but had stayed at 73, that case seems to be fine.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (ZOOKEEPER-1154) Data inconsistency when the node(s) with the highest zxid is not present at the time of leader election

Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/ZOOKEEPER-1154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13088075#comment-13088075 ] 

Hadoop QA commented on ZOOKEEPER-1154:
--------------------------------------

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12491016/ZOOKEEPER-1154.patch
  against trunk revision 1159432.

    +1 @author.  The patch does not contain any @author tags.

    +1 tests included.  The patch appears to include 3 new or modified tests.

    +1 javadoc.  The javadoc tool did not generate any warning messages.

    +1 javac.  The applied patch does not increase the total number of javac compiler warnings.

    +1 findbugs.  The patch does not introduce any new Findbugs (version 1.3.9) warnings.

    +1 release audit.  The applied patch does not increase the total number of release audit warnings.

    -1 core tests.  The patch failed core unit tests.

    +1 contrib tests.  The patch passed contrib unit tests.

Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/469//testReport/
Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/469//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/469//console

This message is automatically generated.

> Data inconsistency when the node(s) with the highest zxid is not present at the time of leader election
> -------------------------------------------------------------------------------------------------------
>
>                 Key: ZOOKEEPER-1154
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1154
>             Project: ZooKeeper
>          Issue Type: Bug
>          Components: quorum
>    Affects Versions: 3.3.3
>            Reporter: Vishal Kathuria
>            Priority: Blocker
>             Fix For: 3.4.0
>
>         Attachments: ZOOKEEPER-1154.patch
>
>   Original Estimate: 504h
>  Remaining Estimate: 504h
>
> If a participant with the highest zxid (lets call it A) isn't present during leader election, a participant with a lower zxid (say B) might be chosen as a leader. When A comes up, it will replay the log with that higher zxid. The change that was in that higher zxid will only be visible to the clients connecting to the participant A, but not to other participants.
> I was able to reproduce this problem by
> 1. connect debugger to B and C and suspend them, so they don't write anything
> 2. Issue an update to the leader A.
> 3. After a few seconds, crash all servers (A,B,C)
> 4. Start B and C, let the leader election take place
> 5. Start A.
> 6. You will find that the update done in step 2 is visible on A but not on B,C, hence the inconsistency.
> Below is a more detailed analysis of what is happening in the code.
> Initial Condition
> 1.	Lets say there are three nodes in the ensemble A,B,C with A being the leader
> 2.	The current epoch is 7. 
> 3.	For simplicity of the example, lets say zxid is a two digit number, with epoch being the first digit.
> 4.	The zxid is 73
> 5.	All the nodes have seen the change 73 and have persistently logged it.
> Step 1
> Request with zxid 74 is issued. The leader A writes it to the log but there is a crash of the entire ensemble and B,C never write the change 74 to their log.
> Step 3
> B,C restart, A is still down
> B,C form the quorum
> B is the new leader. Lets say  B minCommitLog is 71 and maxCommitLog is 73
> epoch is now 8, zxid is 80
> Request with zxid 81 is successful. On B, minCommitLog is now 71, maxCommitLog is 81
> Step 4
> A starts up. It applies the change in request with zxid 74 to its in-memory data tree
> A contacts B to registerAsFollower and provides 74 as its ZxId
> Since 71<=74<=81, B decides to send A the diff. B will send to A the proposal 81.
> Problem:
> The problem with the above sequence is that A's data tree has the update from request 74, which is not correct. Before getting the proposals 81, A should have received a trunc to 73. I don't see that in the code. If the maxCommitLog on B hadn't bumped to 81 but had stayed at 73, that case seems to be fine.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (ZOOKEEPER-1154) Data inconsistency when the node(s) with the highest zxid is not present at the time of leader election

Posted by "Vishal Kathuria (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/ZOOKEEPER-1154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13085272#comment-13085272 ] 

Vishal Kathuria commented on ZOOKEEPER-1154:
--------------------------------------------

Thanks for taking a look at this Flavio.

I have been thinking of a solution and sharing my thoughts here.
The leader should keep track of the last committed zxid for n epochs previous to the current epoch.
If a follower tries to join and the leader has the last committed zxid of the follower's persisted epoch, the leader first asks the follower to truncate its log to the last committed zxid before starting to send a diff.

If the (current epoch - follower's persisted epoch > n), then the follower cannot get a diff and will have to get a snap instead. 

Looking forward to your comments.
thanks
Vishal

> Data inconsistency when the node(s) with the highest zxid is not present at the time of leader election
> -------------------------------------------------------------------------------------------------------
>
>                 Key: ZOOKEEPER-1154
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1154
>             Project: ZooKeeper
>          Issue Type: Bug
>          Components: quorum
>    Affects Versions: 3.3.3
>            Reporter: Vishal Kathuria
>            Priority: Critical
>             Fix For: 3.4.0
>
>   Original Estimate: 504h
>  Remaining Estimate: 504h
>
> If a participant with the highest zxid (lets call it A) isn't present during leader election, a participant with a lower zxid (say B) might be chosen as a leader. When A comes up, it will replay the log with that higher zxid. The change that was in that higher zxid will only be visible to the clients connecting to the participant A, but not to other participants.
> I was able to reproduce this problem by
> 1. connect debugger to B and C and suspend them, so they don't write anything
> 2. Issue an update to the leader A.
> 3. After a few seconds, crash all servers (A,B,C)
> 4. Start B and C, let the leader election take place
> 5. Start A.
> 6. You will find that the update done in step 2 is visible on A but not on B,C, hence the inconsistency.
> Below is a more detailed analysis of what is happening in the code.
> Initial Condition
> 1.	Lets say there are three nodes in the ensemble A,B,C with A being the leader
> 2.	The current epoch is 7. 
> 3.	For simplicity of the example, lets say zxid is a two digit number, with epoch being the first digit.
> 4.	The zxid is 73
> 5.	All the nodes have seen the change 73 and have persistently logged it.
> Step 1
> Request with zxid 74 is issued. The leader A writes it to the log but there is a crash of the entire ensemble and B,C never write the change 74 to their log.
> Step 3
> B,C restart, A is still down
> B,C form the quorum
> B is the new leader. Lets say  B minCommitLog is 71 and maxCommitLog is 73
> epoch is now 8, zxid is 80
> Request with zxid 81 is successful. On B, minCommitLog is now 71, maxCommitLog is 81
> Step 4
> A starts up. It applies the change in request with zxid 74 to its in-memory data tree
> A contacts B to registerAsFollower and provides 74 as its ZxId
> Since 71<=74<=81, B decides to send A the diff. B will send to A the proposal 81.
> Problem:
> The problem with the above sequence is that A's data tree has the update from request 74, which is not correct. Before getting the proposals 81, A should have received a trunc to 73. I don't see that in the code. If the maxCommitLog on B hadn't bumped to 81 but had stayed at 73, that case seems to be fine.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (ZOOKEEPER-1154) Data inconsistency when the node(s) with the highest zxid is not present at the time of leader election

Posted by "Vishal Kathuria (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/ZOOKEEPER-1154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13087857#comment-13087857 ] 

Vishal Kathuria commented on ZOOKEEPER-1154:
--------------------------------------------

Hi Flavio,

Looking forward to reading your paper. I looked online and found the slide deck but couldn't find a link to the pdf. Could you please forward the link to me?

I agree us not testing all the scenarios is a serious problem - it can let regressions slip through and them not getting caught for several years (as it happened with ZOOKEEPER-1156, which has been there in the code for 2 years).

I fully understand your concern around simplicity. When I submit the patch, please let me know what you think. I found a way to do this without maintaining additional state/variables. The idea is that if the follower has a zxid that the leader doesn't have a matching committed proposal for, the leader asks the follower to truncate to the zxid for which the leader does have a committed proposal and starts sending diffs from that.

My new test is passing with the fix. Interestingly I am seeing JVM crashes in the hammer tests, so I need to investigate that before I can submit the patch.

Thanks!


> Data inconsistency when the node(s) with the highest zxid is not present at the time of leader election
> -------------------------------------------------------------------------------------------------------
>
>                 Key: ZOOKEEPER-1154
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1154
>             Project: ZooKeeper
>          Issue Type: Bug
>          Components: quorum
>    Affects Versions: 3.3.3
>            Reporter: Vishal Kathuria
>            Priority: Blocker
>             Fix For: 3.4.0
>
>   Original Estimate: 504h
>  Remaining Estimate: 504h
>
> If a participant with the highest zxid (lets call it A) isn't present during leader election, a participant with a lower zxid (say B) might be chosen as a leader. When A comes up, it will replay the log with that higher zxid. The change that was in that higher zxid will only be visible to the clients connecting to the participant A, but not to other participants.
> I was able to reproduce this problem by
> 1. connect debugger to B and C and suspend them, so they don't write anything
> 2. Issue an update to the leader A.
> 3. After a few seconds, crash all servers (A,B,C)
> 4. Start B and C, let the leader election take place
> 5. Start A.
> 6. You will find that the update done in step 2 is visible on A but not on B,C, hence the inconsistency.
> Below is a more detailed analysis of what is happening in the code.
> Initial Condition
> 1.	Lets say there are three nodes in the ensemble A,B,C with A being the leader
> 2.	The current epoch is 7. 
> 3.	For simplicity of the example, lets say zxid is a two digit number, with epoch being the first digit.
> 4.	The zxid is 73
> 5.	All the nodes have seen the change 73 and have persistently logged it.
> Step 1
> Request with zxid 74 is issued. The leader A writes it to the log but there is a crash of the entire ensemble and B,C never write the change 74 to their log.
> Step 3
> B,C restart, A is still down
> B,C form the quorum
> B is the new leader. Lets say  B minCommitLog is 71 and maxCommitLog is 73
> epoch is now 8, zxid is 80
> Request with zxid 81 is successful. On B, minCommitLog is now 71, maxCommitLog is 81
> Step 4
> A starts up. It applies the change in request with zxid 74 to its in-memory data tree
> A contacts B to registerAsFollower and provides 74 as its ZxId
> Since 71<=74<=81, B decides to send A the diff. B will send to A the proposal 81.
> Problem:
> The problem with the above sequence is that A's data tree has the update from request 74, which is not correct. Before getting the proposals 81, A should have received a trunc to 73. I don't see that in the code. If the maxCommitLog on B hadn't bumped to 81 but had stayed at 73, that case seems to be fine.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (ZOOKEEPER-1154) Data inconsistency when the node(s) with the highest zxid is not present at the time of leader election

Posted by "Camille Fournier (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/ZOOKEEPER-1154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13088124#comment-13088124 ] 

Camille Fournier commented on ZOOKEEPER-1154:
---------------------------------------------

I think you have more problems than the failing hammer tests. Look at the logs in FollowerResyncConcurrencyTest, I see a huge number of errors like:

2011-08-19 23:38:42,392 [myid:] - WARN  [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:11221:DataTree@909] - Failed: Create transaction for /mybar0000004091
     [exec]     [junit] org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = NodeExists
     [exec]     [junit] 	at org.apache.zookeeper.server.DataTree.createNode(DataTree.java:483)
     [exec]     [junit] 	at org.apache.zookeeper.server.DataTree.processTxn(DataTree.java:799)
     [exec]     [junit] 	at org.apache.zookeeper.server.persistence.FileTxnSnapLog.processTransaction(FileTxnSnapLog.java:197)
     [exec]     [junit] 	at org.apache.zookeeper.server.persistence.FileTxnSnapLog.restore(FileTxnSnapLog.java:150)
     [exec]     [junit] 	at org.apache.zookeeper.server.ZKDatabase.loadDataBase(ZKDatabase.java:223)
     [exec]     [junit] 	at org.apache.zookeeper.server.quorum.QuorumPeer.loadDataBase(QuorumPeer.java:413)
     [exec]     [junit] 	at org.apache.zookeeper.server.quorum.QuorumPeer.getLastLoggedZxid(QuorumPeer.java:538)
     [exec]     [junit] 	at org.apache.zookeeper.server.quorum.Learner.registerWithLeader(Learner.java:257)
     [exec]     [junit] 	at org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:72)
     [exec]     [junit] 	at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:716)



> Data inconsistency when the node(s) with the highest zxid is not present at the time of leader election
> -------------------------------------------------------------------------------------------------------
>
>                 Key: ZOOKEEPER-1154
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1154
>             Project: ZooKeeper
>          Issue Type: Bug
>          Components: quorum
>    Affects Versions: 3.3.3
>            Reporter: Vishal Kathuria
>            Priority: Blocker
>             Fix For: 3.4.0
>
>         Attachments: ZOOKEEPER-1154.patch
>
>   Original Estimate: 504h
>  Remaining Estimate: 504h
>
> If a participant with the highest zxid (lets call it A) isn't present during leader election, a participant with a lower zxid (say B) might be chosen as a leader. When A comes up, it will replay the log with that higher zxid. The change that was in that higher zxid will only be visible to the clients connecting to the participant A, but not to other participants.
> I was able to reproduce this problem by
> 1. connect debugger to B and C and suspend them, so they don't write anything
> 2. Issue an update to the leader A.
> 3. After a few seconds, crash all servers (A,B,C)
> 4. Start B and C, let the leader election take place
> 5. Start A.
> 6. You will find that the update done in step 2 is visible on A but not on B,C, hence the inconsistency.
> Below is a more detailed analysis of what is happening in the code.
> Initial Condition
> 1.	Lets say there are three nodes in the ensemble A,B,C with A being the leader
> 2.	The current epoch is 7. 
> 3.	For simplicity of the example, lets say zxid is a two digit number, with epoch being the first digit.
> 4.	The zxid is 73
> 5.	All the nodes have seen the change 73 and have persistently logged it.
> Step 1
> Request with zxid 74 is issued. The leader A writes it to the log but there is a crash of the entire ensemble and B,C never write the change 74 to their log.
> Step 3
> B,C restart, A is still down
> B,C form the quorum
> B is the new leader. Lets say  B minCommitLog is 71 and maxCommitLog is 73
> epoch is now 8, zxid is 80
> Request with zxid 81 is successful. On B, minCommitLog is now 71, maxCommitLog is 81
> Step 4
> A starts up. It applies the change in request with zxid 74 to its in-memory data tree
> A contacts B to registerAsFollower and provides 74 as its ZxId
> Since 71<=74<=81, B decides to send A the diff. B will send to A the proposal 81.
> Problem:
> The problem with the above sequence is that A's data tree has the update from request 74, which is not correct. Before getting the proposals 81, A should have received a trunc to 73. I don't see that in the code. If the maxCommitLog on B hadn't bumped to 81 but had stayed at 73, that case seems to be fine.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (ZOOKEEPER-1154) Data inconsistency when the node(s) with the highest zxid is not present at the time of leader election

Posted by "Flavio Junqueira (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/ZOOKEEPER-1154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13086409#comment-13086409 ] 

Flavio Junqueira commented on ZOOKEEPER-1154:
---------------------------------------------

-0, given all the bugs we have been able to find in the protocol implementation recently, I prefer an approach that is perhaps less efficient but easier to read and interpret. Instead of keeping one or more variables and performing more checks, we should simply send a snapshot if the leader has committed anything other than NEWLEADER in the new epoch.   

> Data inconsistency when the node(s) with the highest zxid is not present at the time of leader election
> -------------------------------------------------------------------------------------------------------
>
>                 Key: ZOOKEEPER-1154
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1154
>             Project: ZooKeeper
>          Issue Type: Bug
>          Components: quorum
>    Affects Versions: 3.3.3
>            Reporter: Vishal Kathuria
>            Priority: Blocker
>             Fix For: 3.4.0
>
>   Original Estimate: 504h
>  Remaining Estimate: 504h
>
> If a participant with the highest zxid (lets call it A) isn't present during leader election, a participant with a lower zxid (say B) might be chosen as a leader. When A comes up, it will replay the log with that higher zxid. The change that was in that higher zxid will only be visible to the clients connecting to the participant A, but not to other participants.
> I was able to reproduce this problem by
> 1. connect debugger to B and C and suspend them, so they don't write anything
> 2. Issue an update to the leader A.
> 3. After a few seconds, crash all servers (A,B,C)
> 4. Start B and C, let the leader election take place
> 5. Start A.
> 6. You will find that the update done in step 2 is visible on A but not on B,C, hence the inconsistency.
> Below is a more detailed analysis of what is happening in the code.
> Initial Condition
> 1.	Lets say there are three nodes in the ensemble A,B,C with A being the leader
> 2.	The current epoch is 7. 
> 3.	For simplicity of the example, lets say zxid is a two digit number, with epoch being the first digit.
> 4.	The zxid is 73
> 5.	All the nodes have seen the change 73 and have persistently logged it.
> Step 1
> Request with zxid 74 is issued. The leader A writes it to the log but there is a crash of the entire ensemble and B,C never write the change 74 to their log.
> Step 3
> B,C restart, A is still down
> B,C form the quorum
> B is the new leader. Lets say  B minCommitLog is 71 and maxCommitLog is 73
> epoch is now 8, zxid is 80
> Request with zxid 81 is successful. On B, minCommitLog is now 71, maxCommitLog is 81
> Step 4
> A starts up. It applies the change in request with zxid 74 to its in-memory data tree
> A contacts B to registerAsFollower and provides 74 as its ZxId
> Since 71<=74<=81, B decides to send A the diff. B will send to A the proposal 81.
> Problem:
> The problem with the above sequence is that A's data tree has the update from request 74, which is not correct. Before getting the proposals 81, A should have received a trunc to 73. I don't see that in the code. If the maxCommitLog on B hadn't bumped to 81 but had stayed at 73, that case seems to be fine.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (ZOOKEEPER-1154) Data inconsistency when the node(s) with the highest zxid is not present at the time of leader election

Posted by "Vishal Kathuria (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/ZOOKEEPER-1154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13085239#comment-13085239 ] 

Vishal Kathuria commented on ZOOKEEPER-1154:
--------------------------------------------

I connected the debugger to B and C and paused all threads to make sure B and C don't get a chance to log the request before they crash. I waited a few secs to give A enough time to log the request though.

> Data inconsistency when the node(s) with the highest zxid is not present at the time of leader election
> -------------------------------------------------------------------------------------------------------
>
>                 Key: ZOOKEEPER-1154
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1154
>             Project: ZooKeeper
>          Issue Type: Bug
>          Components: quorum
>    Affects Versions: 3.3.3
>            Reporter: Vishal Kathuria
>            Priority: Critical
>             Fix For: 3.4.0
>
>   Original Estimate: 504h
>  Remaining Estimate: 504h
>
> If a participant with the highest zxid (lets call it A) isn't present during leader election, a participant with a lower zxid (say B) might be chosen as a leader. When A comes up, it will replay the log with that higher zxid. The change that was in that higher zxid will only be visible to the clients connecting to the participant A, but not to other participants.
> I was able to reproduce this problem by
> 1. connect debugger to B and C and suspend them, so they don't write anything
> 2. Issue an update to the leader A.
> 3. After a few seconds, crash all servers (A,B,C)
> 4. Start B and C, let the leader election take place
> 5. Start A.
> 6. You will find that the update done in step 2 is visible on A but not on B,C, hence the inconsistency.
> Below is a more detailed analysis of what is happening in the code.
> Initial Condition
> 1.	Lets say there are three nodes in the ensemble A,B,C with A being the leader
> 2.	The current epoch is 7. 
> 3.	For simplicity of the example, lets say zxid is a two digit number, with epoch being the first digit.
> 4.	The zxid is 73
> 5.	All the nodes have seen the change 73 and have persistently logged it.
> Step 1
> Request with zxid 74 is issued. The leader A writes it to the log but there is a crash of the entire ensemble and B,C never write the change 74 to their log.
> Step 3
> B,C restart, A is still down
> B,C form the quorum
> B is the new leader. Lets say  B minCommitLog is 71 and maxCommitLog is 73
> epoch is now 8, zxid is 80
> Request with zxid 81 is successful. On B, minCommitLog is now 71, maxCommitLog is 81
> Step 4
> A starts up. It applies the change in request with zxid 74 to its in-memory data tree
> A contacts B to registerAsFollower and provides 74 as its ZxId
> Since 71<=74<=81, B decides to send A the diff. B will send to A the proposal 81.
> Problem:
> The problem with the above sequence is that A's data tree has the update from request 74, which is not correct. Before getting the proposals 81, A should have received a trunc to 73. I don't see that in the code. If the maxCommitLog on B hadn't bumped to 81 but had stayed at 73, that case seems to be fine.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (ZOOKEEPER-1154) Data inconsistency when the node(s) with the highest zxid is not present at the time of leader election

Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/ZOOKEEPER-1154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13089634#comment-13089634 ] 

Hadoop QA commented on ZOOKEEPER-1154:
--------------------------------------

+1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12491350/ZOOKEEPER-1154.patch
  against trunk revision 1159929.

    +1 @author.  The patch does not contain any @author tags.

    +1 tests included.  The patch appears to include 3 new or modified tests.

    +1 javadoc.  The javadoc tool did not generate any warning messages.

    +1 javac.  The applied patch does not increase the total number of javac compiler warnings.

    +1 findbugs.  The patch does not introduce any new Findbugs (version 1.3.9) warnings.

    +1 release audit.  The applied patch does not increase the total number of release audit warnings.

    +1 core tests.  The patch passed core unit tests.

    +1 contrib tests.  The patch passed contrib unit tests.

Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/472//testReport/
Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/472//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/472//console

This message is automatically generated.

> Data inconsistency when the node(s) with the highest zxid is not present at the time of leader election
> -------------------------------------------------------------------------------------------------------
>
>                 Key: ZOOKEEPER-1154
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1154
>             Project: ZooKeeper
>          Issue Type: Bug
>          Components: quorum
>    Affects Versions: 3.3.3
>            Reporter: Vishal Kathuria
>            Priority: Blocker
>             Fix For: 3.4.0
>
>         Attachments: ZOOKEEPER-1154.patch, ZOOKEEPER-1154.patch
>
>   Original Estimate: 504h
>  Remaining Estimate: 504h
>
> If a participant with the highest zxid (lets call it A) isn't present during leader election, a participant with a lower zxid (say B) might be chosen as a leader. When A comes up, it will replay the log with that higher zxid. The change that was in that higher zxid will only be visible to the clients connecting to the participant A, but not to other participants.
> I was able to reproduce this problem by
> 1. connect debugger to B and C and suspend them, so they don't write anything
> 2. Issue an update to the leader A.
> 3. After a few seconds, crash all servers (A,B,C)
> 4. Start B and C, let the leader election take place
> 5. Start A.
> 6. You will find that the update done in step 2 is visible on A but not on B,C, hence the inconsistency.
> Below is a more detailed analysis of what is happening in the code.
> Initial Condition
> 1.	Lets say there are three nodes in the ensemble A,B,C with A being the leader
> 2.	The current epoch is 7. 
> 3.	For simplicity of the example, lets say zxid is a two digit number, with epoch being the first digit.
> 4.	The zxid is 73
> 5.	All the nodes have seen the change 73 and have persistently logged it.
> Step 1
> Request with zxid 74 is issued. The leader A writes it to the log but there is a crash of the entire ensemble and B,C never write the change 74 to their log.
> Step 3
> B,C restart, A is still down
> B,C form the quorum
> B is the new leader. Lets say  B minCommitLog is 71 and maxCommitLog is 73
> epoch is now 8, zxid is 80
> Request with zxid 81 is successful. On B, minCommitLog is now 71, maxCommitLog is 81
> Step 4
> A starts up. It applies the change in request with zxid 74 to its in-memory data tree
> A contacts B to registerAsFollower and provides 74 as its ZxId
> Since 71<=74<=81, B decides to send A the diff. B will send to A the proposal 81.
> Problem:
> The problem with the above sequence is that A's data tree has the update from request 74, which is not correct. Before getting the proposals 81, A should have received a trunc to 73. I don't see that in the code. If the maxCommitLog on B hadn't bumped to 81 but had stayed at 73, that case seems to be fine.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (ZOOKEEPER-1154) Data inconsistency when the node(s) with the highest zxid is not present at the time of leader election

Posted by "Vishal Kathuria (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/ZOOKEEPER-1154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13086592#comment-13086592 ] 

Vishal Kathuria commented on ZOOKEEPER-1154:
--------------------------------------------

Flavio, 
Although I agree that code can be made easier to read and interpret, I don't think the bug was there because of poor code readability.

Here is how I hit upon this bug - I started with an academic exercise to compare Paxos and ZooKeeper protocol. My premise was that, if there are any differences, I should 

a. either be able to attribute it to use of TCP and choosing of server with highest zxid as the leader
b. or I should be able to find a bug. This is how I identified this bug.

So my strategy to gaining confidence that the ZK protocol has no bugs, is to review the code (and convince a couple of other folks as well to do the same) and make sure it is equivalent to Paxos modulo above mentioned optimizations :)

We could make the protocol simpler but it would certainly be less efficient. Snapshots, when you have several GB of data are quite expensive. If we don't even fix it for the case of n=1, it would increase the probability of the snapshot transfers.

I have the fix for the general case (n>1) coded and the test written (as Ben said, the test took longer than the fix itself :))

While testing the fix, it looks like I hit another bug in truncate code path. The passed in zxid should be in the truncated log but it isn't. I should be able to send a patch out soon after I have figured that one out.

Thanks!




> Data inconsistency when the node(s) with the highest zxid is not present at the time of leader election
> -------------------------------------------------------------------------------------------------------
>
>                 Key: ZOOKEEPER-1154
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1154
>             Project: ZooKeeper
>          Issue Type: Bug
>          Components: quorum
>    Affects Versions: 3.3.3
>            Reporter: Vishal Kathuria
>            Priority: Blocker
>             Fix For: 3.4.0
>
>   Original Estimate: 504h
>  Remaining Estimate: 504h
>
> If a participant with the highest zxid (lets call it A) isn't present during leader election, a participant with a lower zxid (say B) might be chosen as a leader. When A comes up, it will replay the log with that higher zxid. The change that was in that higher zxid will only be visible to the clients connecting to the participant A, but not to other participants.
> I was able to reproduce this problem by
> 1. connect debugger to B and C and suspend them, so they don't write anything
> 2. Issue an update to the leader A.
> 3. After a few seconds, crash all servers (A,B,C)
> 4. Start B and C, let the leader election take place
> 5. Start A.
> 6. You will find that the update done in step 2 is visible on A but not on B,C, hence the inconsistency.
> Below is a more detailed analysis of what is happening in the code.
> Initial Condition
> 1.	Lets say there are three nodes in the ensemble A,B,C with A being the leader
> 2.	The current epoch is 7. 
> 3.	For simplicity of the example, lets say zxid is a two digit number, with epoch being the first digit.
> 4.	The zxid is 73
> 5.	All the nodes have seen the change 73 and have persistently logged it.
> Step 1
> Request with zxid 74 is issued. The leader A writes it to the log but there is a crash of the entire ensemble and B,C never write the change 74 to their log.
> Step 3
> B,C restart, A is still down
> B,C form the quorum
> B is the new leader. Lets say  B minCommitLog is 71 and maxCommitLog is 73
> epoch is now 8, zxid is 80
> Request with zxid 81 is successful. On B, minCommitLog is now 71, maxCommitLog is 81
> Step 4
> A starts up. It applies the change in request with zxid 74 to its in-memory data tree
> A contacts B to registerAsFollower and provides 74 as its ZxId
> Since 71<=74<=81, B decides to send A the diff. B will send to A the proposal 81.
> Problem:
> The problem with the above sequence is that A's data tree has the update from request 74, which is not correct. Before getting the proposals 81, A should have received a trunc to 73. I don't see that in the code. If the maxCommitLog on B hadn't bumped to 81 but had stayed at 73, that case seems to be fine.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (ZOOKEEPER-1154) Data inconsistency when the node(s) with the highest zxid is not present at the time of leader election

Posted by "Camille Fournier (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/ZOOKEEPER-1154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13096467#comment-13096467 ] 

Camille Fournier commented on ZOOKEEPER-1154:
---------------------------------------------

If anyone wants to look this over before I check it in, the src tree changes are untouched but I had to modify QuorumPeerTestBase to get the test to run properly. Took some but not all of the changes to this class from trunk.

> Data inconsistency when the node(s) with the highest zxid is not present at the time of leader election
> -------------------------------------------------------------------------------------------------------
>
>                 Key: ZOOKEEPER-1154
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1154
>             Project: ZooKeeper
>          Issue Type: Bug
>          Components: quorum
>    Affects Versions: 3.3.3
>            Reporter: Vishal Kathuria
>            Assignee: Vishal Kathuria
>            Priority: Blocker
>             Fix For: 3.3.4, 3.4.0
>
>         Attachments: ZOOKEEPER-1154.patch, ZOOKEEPER-1154.patch, ZOOKEEPER-1154.patch, ZOOKEEPER-1154.patch
>
>   Original Estimate: 504h
>  Remaining Estimate: 504h
>
> If a participant with the highest zxid (lets call it A) isn't present during leader election, a participant with a lower zxid (say B) might be chosen as a leader. When A comes up, it will replay the log with that higher zxid. The change that was in that higher zxid will only be visible to the clients connecting to the participant A, but not to other participants.
> I was able to reproduce this problem by
> 1. connect debugger to B and C and suspend them, so they don't write anything
> 2. Issue an update to the leader A.
> 3. After a few seconds, crash all servers (A,B,C)
> 4. Start B and C, let the leader election take place
> 5. Start A.
> 6. You will find that the update done in step 2 is visible on A but not on B,C, hence the inconsistency.
> Below is a more detailed analysis of what is happening in the code.
> Initial Condition
> 1.	Lets say there are three nodes in the ensemble A,B,C with A being the leader
> 2.	The current epoch is 7. 
> 3.	For simplicity of the example, lets say zxid is a two digit number, with epoch being the first digit.
> 4.	The zxid is 73
> 5.	All the nodes have seen the change 73 and have persistently logged it.
> Step 1
> Request with zxid 74 is issued. The leader A writes it to the log but there is a crash of the entire ensemble and B,C never write the change 74 to their log.
> Step 3
> B,C restart, A is still down
> B,C form the quorum
> B is the new leader. Lets say  B minCommitLog is 71 and maxCommitLog is 73
> epoch is now 8, zxid is 80
> Request with zxid 81 is successful. On B, minCommitLog is now 71, maxCommitLog is 81
> Step 4
> A starts up. It applies the change in request with zxid 74 to its in-memory data tree
> A contacts B to registerAsFollower and provides 74 as its ZxId
> Since 71<=74<=81, B decides to send A the diff. B will send to A the proposal 81.
> Problem:
> The problem with the above sequence is that A's data tree has the update from request 74, which is not correct. Before getting the proposals 81, A should have received a trunc to 73. I don't see that in the code. If the maxCommitLog on B hadn't bumped to 81 but had stayed at 73, that case seems to be fine.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (ZOOKEEPER-1154) Data inconsistency when the node(s) with the highest zxid is not present at the time of leader election

Posted by "Camille Fournier (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/ZOOKEEPER-1154?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Camille Fournier updated ZOOKEEPER-1154:
----------------------------------------

    Attachment: ZOOKEEPER-1154.patch

a patch on 3.3.3

> Data inconsistency when the node(s) with the highest zxid is not present at the time of leader election
> -------------------------------------------------------------------------------------------------------
>
>                 Key: ZOOKEEPER-1154
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1154
>             Project: ZooKeeper
>          Issue Type: Bug
>          Components: quorum
>    Affects Versions: 3.3.3
>            Reporter: Vishal Kathuria
>            Assignee: Vishal Kathuria
>            Priority: Blocker
>             Fix For: 3.3.4, 3.4.0
>
>         Attachments: ZOOKEEPER-1154.patch, ZOOKEEPER-1154.patch, ZOOKEEPER-1154.patch, ZOOKEEPER-1154.patch
>
>   Original Estimate: 504h
>  Remaining Estimate: 504h
>
> If a participant with the highest zxid (lets call it A) isn't present during leader election, a participant with a lower zxid (say B) might be chosen as a leader. When A comes up, it will replay the log with that higher zxid. The change that was in that higher zxid will only be visible to the clients connecting to the participant A, but not to other participants.
> I was able to reproduce this problem by
> 1. connect debugger to B and C and suspend them, so they don't write anything
> 2. Issue an update to the leader A.
> 3. After a few seconds, crash all servers (A,B,C)
> 4. Start B and C, let the leader election take place
> 5. Start A.
> 6. You will find that the update done in step 2 is visible on A but not on B,C, hence the inconsistency.
> Below is a more detailed analysis of what is happening in the code.
> Initial Condition
> 1.	Lets say there are three nodes in the ensemble A,B,C with A being the leader
> 2.	The current epoch is 7. 
> 3.	For simplicity of the example, lets say zxid is a two digit number, with epoch being the first digit.
> 4.	The zxid is 73
> 5.	All the nodes have seen the change 73 and have persistently logged it.
> Step 1
> Request with zxid 74 is issued. The leader A writes it to the log but there is a crash of the entire ensemble and B,C never write the change 74 to their log.
> Step 3
> B,C restart, A is still down
> B,C form the quorum
> B is the new leader. Lets say  B minCommitLog is 71 and maxCommitLog is 73
> epoch is now 8, zxid is 80
> Request with zxid 81 is successful. On B, minCommitLog is now 71, maxCommitLog is 81
> Step 4
> A starts up. It applies the change in request with zxid 74 to its in-memory data tree
> A contacts B to registerAsFollower and provides 74 as its ZxId
> Since 71<=74<=81, B decides to send A the diff. B will send to A the proposal 81.
> Problem:
> The problem with the above sequence is that A's data tree has the update from request 74, which is not correct. Before getting the proposals 81, A should have received a trunc to 73. I don't see that in the code. If the maxCommitLog on B hadn't bumped to 81 but had stayed at 73, that case seems to be fine.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Assigned] (ZOOKEEPER-1154) Data inconsistency when the node(s) with the highest zxid is not present at the time of leader election

Posted by "Patrick Hunt (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/ZOOKEEPER-1154?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Patrick Hunt reassigned ZOOKEEPER-1154:
---------------------------------------

    Assignee: Vishal Kathuria

> Data inconsistency when the node(s) with the highest zxid is not present at the time of leader election
> -------------------------------------------------------------------------------------------------------
>
>                 Key: ZOOKEEPER-1154
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1154
>             Project: ZooKeeper
>          Issue Type: Bug
>          Components: quorum
>    Affects Versions: 3.3.3
>            Reporter: Vishal Kathuria
>            Assignee: Vishal Kathuria
>            Priority: Blocker
>             Fix For: 3.4.0
>
>         Attachments: ZOOKEEPER-1154.patch, ZOOKEEPER-1154.patch
>
>   Original Estimate: 504h
>  Remaining Estimate: 504h
>
> If a participant with the highest zxid (lets call it A) isn't present during leader election, a participant with a lower zxid (say B) might be chosen as a leader. When A comes up, it will replay the log with that higher zxid. The change that was in that higher zxid will only be visible to the clients connecting to the participant A, but not to other participants.
> I was able to reproduce this problem by
> 1. connect debugger to B and C and suspend them, so they don't write anything
> 2. Issue an update to the leader A.
> 3. After a few seconds, crash all servers (A,B,C)
> 4. Start B and C, let the leader election take place
> 5. Start A.
> 6. You will find that the update done in step 2 is visible on A but not on B,C, hence the inconsistency.
> Below is a more detailed analysis of what is happening in the code.
> Initial Condition
> 1.	Lets say there are three nodes in the ensemble A,B,C with A being the leader
> 2.	The current epoch is 7. 
> 3.	For simplicity of the example, lets say zxid is a two digit number, with epoch being the first digit.
> 4.	The zxid is 73
> 5.	All the nodes have seen the change 73 and have persistently logged it.
> Step 1
> Request with zxid 74 is issued. The leader A writes it to the log but there is a crash of the entire ensemble and B,C never write the change 74 to their log.
> Step 3
> B,C restart, A is still down
> B,C form the quorum
> B is the new leader. Lets say  B minCommitLog is 71 and maxCommitLog is 73
> epoch is now 8, zxid is 80
> Request with zxid 81 is successful. On B, minCommitLog is now 71, maxCommitLog is 81
> Step 4
> A starts up. It applies the change in request with zxid 74 to its in-memory data tree
> A contacts B to registerAsFollower and provides 74 as its ZxId
> Since 71<=74<=81, B decides to send A the diff. B will send to A the proposal 81.
> Problem:
> The problem with the above sequence is that A's data tree has the update from request 74, which is not correct. Before getting the proposals 81, A should have received a trunc to 73. I don't see that in the code. If the maxCommitLog on B hadn't bumped to 81 but had stayed at 73, that case seems to be fine.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (ZOOKEEPER-1154) Data inconsistency when the node(s) with the highest zxid is not present at the time of leader election

Posted by "Flavio Junqueira (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/ZOOKEEPER-1154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13086603#comment-13086603 ] 

Flavio Junqueira commented on ZOOKEEPER-1154:
---------------------------------------------

Hi Vishal, We've had discussions about Zab not being equivalent to Paxos in the mailing list of ZooKeeper, and we have a paper at DSN 2011 discussing how they differ.

About the code, I didn't mean to imply that the bug was there because of poor readability. Frankly, I have no idea why that bug is there. I thought we were truncating in a certain way, and you've shown that we were not doing what I thought we were doing. Thanks for the observation!

Now one real problem I see is that we were not testing if we were truncating in the scenario is described, which we should, and so I think that tests are indeed an important contribution here. And, I still don't feel comfortable fixing in the way you're proposing, for simplicity.

> Data inconsistency when the node(s) with the highest zxid is not present at the time of leader election
> -------------------------------------------------------------------------------------------------------
>
>                 Key: ZOOKEEPER-1154
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1154
>             Project: ZooKeeper
>          Issue Type: Bug
>          Components: quorum
>    Affects Versions: 3.3.3
>            Reporter: Vishal Kathuria
>            Priority: Blocker
>             Fix For: 3.4.0
>
>   Original Estimate: 504h
>  Remaining Estimate: 504h
>
> If a participant with the highest zxid (lets call it A) isn't present during leader election, a participant with a lower zxid (say B) might be chosen as a leader. When A comes up, it will replay the log with that higher zxid. The change that was in that higher zxid will only be visible to the clients connecting to the participant A, but not to other participants.
> I was able to reproduce this problem by
> 1. connect debugger to B and C and suspend them, so they don't write anything
> 2. Issue an update to the leader A.
> 3. After a few seconds, crash all servers (A,B,C)
> 4. Start B and C, let the leader election take place
> 5. Start A.
> 6. You will find that the update done in step 2 is visible on A but not on B,C, hence the inconsistency.
> Below is a more detailed analysis of what is happening in the code.
> Initial Condition
> 1.	Lets say there are three nodes in the ensemble A,B,C with A being the leader
> 2.	The current epoch is 7. 
> 3.	For simplicity of the example, lets say zxid is a two digit number, with epoch being the first digit.
> 4.	The zxid is 73
> 5.	All the nodes have seen the change 73 and have persistently logged it.
> Step 1
> Request with zxid 74 is issued. The leader A writes it to the log but there is a crash of the entire ensemble and B,C never write the change 74 to their log.
> Step 3
> B,C restart, A is still down
> B,C form the quorum
> B is the new leader. Lets say  B minCommitLog is 71 and maxCommitLog is 73
> epoch is now 8, zxid is 80
> Request with zxid 81 is successful. On B, minCommitLog is now 71, maxCommitLog is 81
> Step 4
> A starts up. It applies the change in request with zxid 74 to its in-memory data tree
> A contacts B to registerAsFollower and provides 74 as its ZxId
> Since 71<=74<=81, B decides to send A the diff. B will send to A the proposal 81.
> Problem:
> The problem with the above sequence is that A's data tree has the update from request 74, which is not correct. Before getting the proposals 81, A should have received a trunc to 73. I don't see that in the code. If the maxCommitLog on B hadn't bumped to 81 but had stayed at 73, that case seems to be fine.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (ZOOKEEPER-1154) Data inconsistency when the node(s) with the highest zxid is not present at the time of leader election

Posted by "Vishal Kathuria (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/ZOOKEEPER-1154?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Vishal Kathuria updated ZOOKEEPER-1154:
---------------------------------------

    Attachment: ZOOKEEPER-1154.patch

> Data inconsistency when the node(s) with the highest zxid is not present at the time of leader election
> -------------------------------------------------------------------------------------------------------
>
>                 Key: ZOOKEEPER-1154
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1154
>             Project: ZooKeeper
>          Issue Type: Bug
>          Components: quorum
>    Affects Versions: 3.3.3
>            Reporter: Vishal Kathuria
>            Priority: Blocker
>             Fix For: 3.4.0
>
>         Attachments: ZOOKEEPER-1154.patch
>
>   Original Estimate: 504h
>  Remaining Estimate: 504h
>
> If a participant with the highest zxid (lets call it A) isn't present during leader election, a participant with a lower zxid (say B) might be chosen as a leader. When A comes up, it will replay the log with that higher zxid. The change that was in that higher zxid will only be visible to the clients connecting to the participant A, but not to other participants.
> I was able to reproduce this problem by
> 1. connect debugger to B and C and suspend them, so they don't write anything
> 2. Issue an update to the leader A.
> 3. After a few seconds, crash all servers (A,B,C)
> 4. Start B and C, let the leader election take place
> 5. Start A.
> 6. You will find that the update done in step 2 is visible on A but not on B,C, hence the inconsistency.
> Below is a more detailed analysis of what is happening in the code.
> Initial Condition
> 1.	Lets say there are three nodes in the ensemble A,B,C with A being the leader
> 2.	The current epoch is 7. 
> 3.	For simplicity of the example, lets say zxid is a two digit number, with epoch being the first digit.
> 4.	The zxid is 73
> 5.	All the nodes have seen the change 73 and have persistently logged it.
> Step 1
> Request with zxid 74 is issued. The leader A writes it to the log but there is a crash of the entire ensemble and B,C never write the change 74 to their log.
> Step 3
> B,C restart, A is still down
> B,C form the quorum
> B is the new leader. Lets say  B minCommitLog is 71 and maxCommitLog is 73
> epoch is now 8, zxid is 80
> Request with zxid 81 is successful. On B, minCommitLog is now 71, maxCommitLog is 81
> Step 4
> A starts up. It applies the change in request with zxid 74 to its in-memory data tree
> A contacts B to registerAsFollower and provides 74 as its ZxId
> Since 71<=74<=81, B decides to send A the diff. B will send to A the proposal 81.
> Problem:
> The problem with the above sequence is that A's data tree has the update from request 74, which is not correct. Before getting the proposals 81, A should have received a trunc to 73. I don't see that in the code. If the maxCommitLog on B hadn't bumped to 81 but had stayed at 73, that case seems to be fine.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (ZOOKEEPER-1154) Data inconsistency when the node(s) with the highest zxid is not present at the time of leader election

Posted by "Hudson (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/ZOOKEEPER-1154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13096632#comment-13096632 ] 

Hudson commented on ZOOKEEPER-1154:
-----------------------------------

Integrated in ZooKeeper-trunk #1293 (See [https://builds.apache.org/job/ZooKeeper-trunk/1293/])
    ZOOKEEPER-1154, ZOOKEEPER-1156: 
Data inconsistency when the node(s) with the highest zxid is not present at the time of leader election
Log truncation truncating log too much - can cause data loss

Vishal Kathuria via camille

camille : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1164758
Files : 
* /zookeeper/trunk/CHANGES.txt
* /zookeeper/trunk/src/java/main/org/apache/zookeeper/server/persistence/FileTxnLog.java
* /zookeeper/trunk/src/java/main/org/apache/zookeeper/server/quorum/LearnerHandler.java
* /zookeeper/trunk/src/java/test/org/apache/zookeeper/server/quorum/QuorumPeerMainTest.java


> Data inconsistency when the node(s) with the highest zxid is not present at the time of leader election
> -------------------------------------------------------------------------------------------------------
>
>                 Key: ZOOKEEPER-1154
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1154
>             Project: ZooKeeper
>          Issue Type: Bug
>          Components: quorum
>    Affects Versions: 3.3.3
>            Reporter: Vishal Kathuria
>            Assignee: Vishal Kathuria
>            Priority: Blocker
>             Fix For: 3.3.4, 3.4.0
>
>         Attachments: ZOOKEEPER-1154.patch, ZOOKEEPER-1154.patch, ZOOKEEPER-1154.patch, ZOOKEEPER-1154.patch
>
>   Original Estimate: 504h
>  Remaining Estimate: 504h
>
> If a participant with the highest zxid (lets call it A) isn't present during leader election, a participant with a lower zxid (say B) might be chosen as a leader. When A comes up, it will replay the log with that higher zxid. The change that was in that higher zxid will only be visible to the clients connecting to the participant A, but not to other participants.
> I was able to reproduce this problem by
> 1. connect debugger to B and C and suspend them, so they don't write anything
> 2. Issue an update to the leader A.
> 3. After a few seconds, crash all servers (A,B,C)
> 4. Start B and C, let the leader election take place
> 5. Start A.
> 6. You will find that the update done in step 2 is visible on A but not on B,C, hence the inconsistency.
> Below is a more detailed analysis of what is happening in the code.
> Initial Condition
> 1.	Lets say there are three nodes in the ensemble A,B,C with A being the leader
> 2.	The current epoch is 7. 
> 3.	For simplicity of the example, lets say zxid is a two digit number, with epoch being the first digit.
> 4.	The zxid is 73
> 5.	All the nodes have seen the change 73 and have persistently logged it.
> Step 1
> Request with zxid 74 is issued. The leader A writes it to the log but there is a crash of the entire ensemble and B,C never write the change 74 to their log.
> Step 3
> B,C restart, A is still down
> B,C form the quorum
> B is the new leader. Lets say  B minCommitLog is 71 and maxCommitLog is 73
> epoch is now 8, zxid is 80
> Request with zxid 81 is successful. On B, minCommitLog is now 71, maxCommitLog is 81
> Step 4
> A starts up. It applies the change in request with zxid 74 to its in-memory data tree
> A contacts B to registerAsFollower and provides 74 as its ZxId
> Since 71<=74<=81, B decides to send A the diff. B will send to A the proposal 81.
> Problem:
> The problem with the above sequence is that A's data tree has the update from request 74, which is not correct. Before getting the proposals 81, A should have received a trunc to 73. I don't see that in the code. If the maxCommitLog on B hadn't bumped to 81 but had stayed at 73, that case seems to be fine.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (ZOOKEEPER-1154) Data inconsistency when the node(s) with the highest zxid is not present at the time of leader election

Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/ZOOKEEPER-1154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13096469#comment-13096469 ] 

Hadoop QA commented on ZOOKEEPER-1154:
--------------------------------------

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12492846/ZOOKEEPER-1154.patch
  against trunk revision 1164758.

    +1 @author.  The patch does not contain any @author tags.

    +1 tests included.  The patch appears to include 6 new or modified tests.

    -1 patch.  The patch command could not apply the patch.

Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/493//console

This message is automatically generated.

> Data inconsistency when the node(s) with the highest zxid is not present at the time of leader election
> -------------------------------------------------------------------------------------------------------
>
>                 Key: ZOOKEEPER-1154
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1154
>             Project: ZooKeeper
>          Issue Type: Bug
>          Components: quorum
>    Affects Versions: 3.3.3
>            Reporter: Vishal Kathuria
>            Assignee: Vishal Kathuria
>            Priority: Blocker
>             Fix For: 3.3.4, 3.4.0
>
>         Attachments: ZOOKEEPER-1154.patch, ZOOKEEPER-1154.patch, ZOOKEEPER-1154.patch, ZOOKEEPER-1154.patch
>
>   Original Estimate: 504h
>  Remaining Estimate: 504h
>
> If a participant with the highest zxid (lets call it A) isn't present during leader election, a participant with a lower zxid (say B) might be chosen as a leader. When A comes up, it will replay the log with that higher zxid. The change that was in that higher zxid will only be visible to the clients connecting to the participant A, but not to other participants.
> I was able to reproduce this problem by
> 1. connect debugger to B and C and suspend them, so they don't write anything
> 2. Issue an update to the leader A.
> 3. After a few seconds, crash all servers (A,B,C)
> 4. Start B and C, let the leader election take place
> 5. Start A.
> 6. You will find that the update done in step 2 is visible on A but not on B,C, hence the inconsistency.
> Below is a more detailed analysis of what is happening in the code.
> Initial Condition
> 1.	Lets say there are three nodes in the ensemble A,B,C with A being the leader
> 2.	The current epoch is 7. 
> 3.	For simplicity of the example, lets say zxid is a two digit number, with epoch being the first digit.
> 4.	The zxid is 73
> 5.	All the nodes have seen the change 73 and have persistently logged it.
> Step 1
> Request with zxid 74 is issued. The leader A writes it to the log but there is a crash of the entire ensemble and B,C never write the change 74 to their log.
> Step 3
> B,C restart, A is still down
> B,C form the quorum
> B is the new leader. Lets say  B minCommitLog is 71 and maxCommitLog is 73
> epoch is now 8, zxid is 80
> Request with zxid 81 is successful. On B, minCommitLog is now 71, maxCommitLog is 81
> Step 4
> A starts up. It applies the change in request with zxid 74 to its in-memory data tree
> A contacts B to registerAsFollower and provides 74 as its ZxId
> Since 71<=74<=81, B decides to send A the diff. B will send to A the proposal 81.
> Problem:
> The problem with the above sequence is that A's data tree has the update from request 74, which is not correct. Before getting the proposals 81, A should have received a trunc to 73. I don't see that in the code. If the maxCommitLog on B hadn't bumped to 81 but had stayed at 73, that case seems to be fine.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (ZOOKEEPER-1154) Data inconsistency when the node(s) with the highest zxid is not present at the time of leader election

Posted by "Mahadev konar (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/ZOOKEEPER-1154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13086060#comment-13086060 ] 

Mahadev konar commented on ZOOKEEPER-1154:
------------------------------------------

Good catch Vishal. I agree with Ben. n = 1 should be good enough. It'd be the most common case. 

> Data inconsistency when the node(s) with the highest zxid is not present at the time of leader election
> -------------------------------------------------------------------------------------------------------
>
>                 Key: ZOOKEEPER-1154
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1154
>             Project: ZooKeeper
>          Issue Type: Bug
>          Components: quorum
>    Affects Versions: 3.3.3
>            Reporter: Vishal Kathuria
>            Priority: Blocker
>             Fix For: 3.4.0
>
>   Original Estimate: 504h
>  Remaining Estimate: 504h
>
> If a participant with the highest zxid (lets call it A) isn't present during leader election, a participant with a lower zxid (say B) might be chosen as a leader. When A comes up, it will replay the log with that higher zxid. The change that was in that higher zxid will only be visible to the clients connecting to the participant A, but not to other participants.
> I was able to reproduce this problem by
> 1. connect debugger to B and C and suspend them, so they don't write anything
> 2. Issue an update to the leader A.
> 3. After a few seconds, crash all servers (A,B,C)
> 4. Start B and C, let the leader election take place
> 5. Start A.
> 6. You will find that the update done in step 2 is visible on A but not on B,C, hence the inconsistency.
> Below is a more detailed analysis of what is happening in the code.
> Initial Condition
> 1.	Lets say there are three nodes in the ensemble A,B,C with A being the leader
> 2.	The current epoch is 7. 
> 3.	For simplicity of the example, lets say zxid is a two digit number, with epoch being the first digit.
> 4.	The zxid is 73
> 5.	All the nodes have seen the change 73 and have persistently logged it.
> Step 1
> Request with zxid 74 is issued. The leader A writes it to the log but there is a crash of the entire ensemble and B,C never write the change 74 to their log.
> Step 3
> B,C restart, A is still down
> B,C form the quorum
> B is the new leader. Lets say  B minCommitLog is 71 and maxCommitLog is 73
> epoch is now 8, zxid is 80
> Request with zxid 81 is successful. On B, minCommitLog is now 71, maxCommitLog is 81
> Step 4
> A starts up. It applies the change in request with zxid 74 to its in-memory data tree
> A contacts B to registerAsFollower and provides 74 as its ZxId
> Since 71<=74<=81, B decides to send A the diff. B will send to A the proposal 81.
> Problem:
> The problem with the above sequence is that A's data tree has the update from request 74, which is not correct. Before getting the proposals 81, A should have received a trunc to 73. I don't see that in the code. If the maxCommitLog on B hadn't bumped to 81 but had stayed at 73, that case seems to be fine.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (ZOOKEEPER-1154) Data inconsistency when the node(s) with the highest zxid is not present at the time of leader election

Posted by "Vishal Kathuria (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/ZOOKEEPER-1154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13085886#comment-13085886 ] 

Vishal Kathuria commented on ZOOKEEPER-1154:
--------------------------------------------

Thanks Ben,

Yes, n=1 should be easy to implement but I was worried it would lead to more frequent sending of snaps instead of diffs. This could be a problem for ZK installations with large data (some of which are several GB). 

Looking at the code to see the best way to implement this for the general case - I should have an update soon.

thanks!

> Data inconsistency when the node(s) with the highest zxid is not present at the time of leader election
> -------------------------------------------------------------------------------------------------------
>
>                 Key: ZOOKEEPER-1154
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1154
>             Project: ZooKeeper
>          Issue Type: Bug
>          Components: quorum
>    Affects Versions: 3.3.3
>            Reporter: Vishal Kathuria
>            Priority: Blocker
>             Fix For: 3.4.0
>
>   Original Estimate: 504h
>  Remaining Estimate: 504h
>
> If a participant with the highest zxid (lets call it A) isn't present during leader election, a participant with a lower zxid (say B) might be chosen as a leader. When A comes up, it will replay the log with that higher zxid. The change that was in that higher zxid will only be visible to the clients connecting to the participant A, but not to other participants.
> I was able to reproduce this problem by
> 1. connect debugger to B and C and suspend them, so they don't write anything
> 2. Issue an update to the leader A.
> 3. After a few seconds, crash all servers (A,B,C)
> 4. Start B and C, let the leader election take place
> 5. Start A.
> 6. You will find that the update done in step 2 is visible on A but not on B,C, hence the inconsistency.
> Below is a more detailed analysis of what is happening in the code.
> Initial Condition
> 1.	Lets say there are three nodes in the ensemble A,B,C with A being the leader
> 2.	The current epoch is 7. 
> 3.	For simplicity of the example, lets say zxid is a two digit number, with epoch being the first digit.
> 4.	The zxid is 73
> 5.	All the nodes have seen the change 73 and have persistently logged it.
> Step 1
> Request with zxid 74 is issued. The leader A writes it to the log but there is a crash of the entire ensemble and B,C never write the change 74 to their log.
> Step 3
> B,C restart, A is still down
> B,C form the quorum
> B is the new leader. Lets say  B minCommitLog is 71 and maxCommitLog is 73
> epoch is now 8, zxid is 80
> Request with zxid 81 is successful. On B, minCommitLog is now 71, maxCommitLog is 81
> Step 4
> A starts up. It applies the change in request with zxid 74 to its in-memory data tree
> A contacts B to registerAsFollower and provides 74 as its ZxId
> Since 71<=74<=81, B decides to send A the diff. B will send to A the proposal 81.
> Problem:
> The problem with the above sequence is that A's data tree has the update from request 74, which is not correct. Before getting the proposals 81, A should have received a trunc to 73. I don't see that in the code. If the maxCommitLog on B hadn't bumped to 81 but had stayed at 73, that case seems to be fine.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (ZOOKEEPER-1154) Data inconsistency when the node(s) with the highest zxid is not present at the time of leader election

Posted by "Benjamin Reed (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/ZOOKEEPER-1154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13085568#comment-13085568 ] 

Benjamin Reed commented on ZOOKEEPER-1154:
------------------------------------------

yes, you are correct vishal. i think n=1 is easy to implement and should catch the common cases. the fix should be pretty easy: we just hook into the committedLog processing where the minCommittedLog and maxCommittedLog are maintained (it's just one place in ZKDatabase.java) to track the last zxid for each epoch in the committedLog (or just the prevEpoch if n=1). then we use that in the FollowerHandler to decide when to truncate.

the harder part might be writing the test cases.

> Data inconsistency when the node(s) with the highest zxid is not present at the time of leader election
> -------------------------------------------------------------------------------------------------------
>
>                 Key: ZOOKEEPER-1154
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1154
>             Project: ZooKeeper
>          Issue Type: Bug
>          Components: quorum
>    Affects Versions: 3.3.3
>            Reporter: Vishal Kathuria
>            Priority: Blocker
>             Fix For: 3.4.0
>
>   Original Estimate: 504h
>  Remaining Estimate: 504h
>
> If a participant with the highest zxid (lets call it A) isn't present during leader election, a participant with a lower zxid (say B) might be chosen as a leader. When A comes up, it will replay the log with that higher zxid. The change that was in that higher zxid will only be visible to the clients connecting to the participant A, but not to other participants.
> I was able to reproduce this problem by
> 1. connect debugger to B and C and suspend them, so they don't write anything
> 2. Issue an update to the leader A.
> 3. After a few seconds, crash all servers (A,B,C)
> 4. Start B and C, let the leader election take place
> 5. Start A.
> 6. You will find that the update done in step 2 is visible on A but not on B,C, hence the inconsistency.
> Below is a more detailed analysis of what is happening in the code.
> Initial Condition
> 1.	Lets say there are three nodes in the ensemble A,B,C with A being the leader
> 2.	The current epoch is 7. 
> 3.	For simplicity of the example, lets say zxid is a two digit number, with epoch being the first digit.
> 4.	The zxid is 73
> 5.	All the nodes have seen the change 73 and have persistently logged it.
> Step 1
> Request with zxid 74 is issued. The leader A writes it to the log but there is a crash of the entire ensemble and B,C never write the change 74 to their log.
> Step 3
> B,C restart, A is still down
> B,C form the quorum
> B is the new leader. Lets say  B minCommitLog is 71 and maxCommitLog is 73
> epoch is now 8, zxid is 80
> Request with zxid 81 is successful. On B, minCommitLog is now 71, maxCommitLog is 81
> Step 4
> A starts up. It applies the change in request with zxid 74 to its in-memory data tree
> A contacts B to registerAsFollower and provides 74 as its ZxId
> Since 71<=74<=81, B decides to send A the diff. B will send to A the proposal 81.
> Problem:
> The problem with the above sequence is that A's data tree has the update from request 74, which is not correct. Before getting the proposals 81, A should have received a trunc to 73. I don't see that in the code. If the maxCommitLog on B hadn't bumped to 81 but had stayed at 73, that case seems to be fine.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (ZOOKEEPER-1154) Data inconsistency when the node(s) with the highest zxid is not present at the time of leader election

Posted by "Camille Fournier (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/ZOOKEEPER-1154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13096456#comment-13096456 ] 

Camille Fournier commented on ZOOKEEPER-1154:
---------------------------------------------

Committed to trunk r1164758

> Data inconsistency when the node(s) with the highest zxid is not present at the time of leader election
> -------------------------------------------------------------------------------------------------------
>
>                 Key: ZOOKEEPER-1154
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1154
>             Project: ZooKeeper
>          Issue Type: Bug
>          Components: quorum
>    Affects Versions: 3.3.3
>            Reporter: Vishal Kathuria
>            Assignee: Vishal Kathuria
>            Priority: Blocker
>             Fix For: 3.3.4, 3.4.0
>
>         Attachments: ZOOKEEPER-1154.patch, ZOOKEEPER-1154.patch, ZOOKEEPER-1154.patch
>
>   Original Estimate: 504h
>  Remaining Estimate: 504h
>
> If a participant with the highest zxid (lets call it A) isn't present during leader election, a participant with a lower zxid (say B) might be chosen as a leader. When A comes up, it will replay the log with that higher zxid. The change that was in that higher zxid will only be visible to the clients connecting to the participant A, but not to other participants.
> I was able to reproduce this problem by
> 1. connect debugger to B and C and suspend them, so they don't write anything
> 2. Issue an update to the leader A.
> 3. After a few seconds, crash all servers (A,B,C)
> 4. Start B and C, let the leader election take place
> 5. Start A.
> 6. You will find that the update done in step 2 is visible on A but not on B,C, hence the inconsistency.
> Below is a more detailed analysis of what is happening in the code.
> Initial Condition
> 1.	Lets say there are three nodes in the ensemble A,B,C with A being the leader
> 2.	The current epoch is 7. 
> 3.	For simplicity of the example, lets say zxid is a two digit number, with epoch being the first digit.
> 4.	The zxid is 73
> 5.	All the nodes have seen the change 73 and have persistently logged it.
> Step 1
> Request with zxid 74 is issued. The leader A writes it to the log but there is a crash of the entire ensemble and B,C never write the change 74 to their log.
> Step 3
> B,C restart, A is still down
> B,C form the quorum
> B is the new leader. Lets say  B minCommitLog is 71 and maxCommitLog is 73
> epoch is now 8, zxid is 80
> Request with zxid 81 is successful. On B, minCommitLog is now 71, maxCommitLog is 81
> Step 4
> A starts up. It applies the change in request with zxid 74 to its in-memory data tree
> A contacts B to registerAsFollower and provides 74 as its ZxId
> Since 71<=74<=81, B decides to send A the diff. B will send to A the proposal 81.
> Problem:
> The problem with the above sequence is that A's data tree has the update from request 74, which is not correct. Before getting the proposals 81, A should have received a trunc to 73. I don't see that in the code. If the maxCommitLog on B hadn't bumped to 81 but had stayed at 73, that case seems to be fine.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (ZOOKEEPER-1154) Data inconsistency when the node(s) with the highest zxid is not present at the time of leader election

Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/ZOOKEEPER-1154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13096158#comment-13096158 ] 

Hadoop QA commented on ZOOKEEPER-1154:
--------------------------------------

+1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12492776/ZOOKEEPER-1154.patch
  against trunk revision 1164632.

    +1 @author.  The patch does not contain any @author tags.

    +1 tests included.  The patch appears to include 3 new or modified tests.

    +1 javadoc.  The javadoc tool did not generate any warning messages.

    +1 javac.  The applied patch does not increase the total number of javac compiler warnings.

    +1 findbugs.  The patch does not introduce any new Findbugs (version 1.3.9) warnings.

    +1 release audit.  The applied patch does not increase the total number of release audit warnings.

    +1 core tests.  The patch passed core unit tests.

    +1 contrib tests.  The patch passed contrib unit tests.

Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/488//testReport/
Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/488//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/488//console

This message is automatically generated.

> Data inconsistency when the node(s) with the highest zxid is not present at the time of leader election
> -------------------------------------------------------------------------------------------------------
>
>                 Key: ZOOKEEPER-1154
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1154
>             Project: ZooKeeper
>          Issue Type: Bug
>          Components: quorum
>    Affects Versions: 3.3.3
>            Reporter: Vishal Kathuria
>            Assignee: Vishal Kathuria
>            Priority: Blocker
>             Fix For: 3.3.4, 3.4.0
>
>         Attachments: ZOOKEEPER-1154.patch, ZOOKEEPER-1154.patch, ZOOKEEPER-1154.patch
>
>   Original Estimate: 504h
>  Remaining Estimate: 504h
>
> If a participant with the highest zxid (lets call it A) isn't present during leader election, a participant with a lower zxid (say B) might be chosen as a leader. When A comes up, it will replay the log with that higher zxid. The change that was in that higher zxid will only be visible to the clients connecting to the participant A, but not to other participants.
> I was able to reproduce this problem by
> 1. connect debugger to B and C and suspend them, so they don't write anything
> 2. Issue an update to the leader A.
> 3. After a few seconds, crash all servers (A,B,C)
> 4. Start B and C, let the leader election take place
> 5. Start A.
> 6. You will find that the update done in step 2 is visible on A but not on B,C, hence the inconsistency.
> Below is a more detailed analysis of what is happening in the code.
> Initial Condition
> 1.	Lets say there are three nodes in the ensemble A,B,C with A being the leader
> 2.	The current epoch is 7. 
> 3.	For simplicity of the example, lets say zxid is a two digit number, with epoch being the first digit.
> 4.	The zxid is 73
> 5.	All the nodes have seen the change 73 and have persistently logged it.
> Step 1
> Request with zxid 74 is issued. The leader A writes it to the log but there is a crash of the entire ensemble and B,C never write the change 74 to their log.
> Step 3
> B,C restart, A is still down
> B,C form the quorum
> B is the new leader. Lets say  B minCommitLog is 71 and maxCommitLog is 73
> epoch is now 8, zxid is 80
> Request with zxid 81 is successful. On B, minCommitLog is now 71, maxCommitLog is 81
> Step 4
> A starts up. It applies the change in request with zxid 74 to its in-memory data tree
> A contacts B to registerAsFollower and provides 74 as its ZxId
> Since 71<=74<=81, B decides to send A the diff. B will send to A the proposal 81.
> Problem:
> The problem with the above sequence is that A's data tree has the update from request 74, which is not correct. Before getting the proposals 81, A should have received a trunc to 73. I don't see that in the code. If the maxCommitLog on B hadn't bumped to 81 but had stayed at 73, that case seems to be fine.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (ZOOKEEPER-1154) Data inconsistency when the node(s) with the highest zxid is not present at the time of leader election

Posted by "Camille Fournier (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/ZOOKEEPER-1154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13088125#comment-13088125 ] 

Camille Fournier commented on ZOOKEEPER-1154:
---------------------------------------------

I suspect btw that the random crashes are caused by these lines you added into Learner:
 try {
          sock.close();
        } catch (Exception e) {
          // Ignore any exceptions
        }
        sock = null;

> Data inconsistency when the node(s) with the highest zxid is not present at the time of leader election
> -------------------------------------------------------------------------------------------------------
>
>                 Key: ZOOKEEPER-1154
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1154
>             Project: ZooKeeper
>          Issue Type: Bug
>          Components: quorum
>    Affects Versions: 3.3.3
>            Reporter: Vishal Kathuria
>            Priority: Blocker
>             Fix For: 3.4.0
>
>         Attachments: ZOOKEEPER-1154.patch
>
>   Original Estimate: 504h
>  Remaining Estimate: 504h
>
> If a participant with the highest zxid (lets call it A) isn't present during leader election, a participant with a lower zxid (say B) might be chosen as a leader. When A comes up, it will replay the log with that higher zxid. The change that was in that higher zxid will only be visible to the clients connecting to the participant A, but not to other participants.
> I was able to reproduce this problem by
> 1. connect debugger to B and C and suspend them, so they don't write anything
> 2. Issue an update to the leader A.
> 3. After a few seconds, crash all servers (A,B,C)
> 4. Start B and C, let the leader election take place
> 5. Start A.
> 6. You will find that the update done in step 2 is visible on A but not on B,C, hence the inconsistency.
> Below is a more detailed analysis of what is happening in the code.
> Initial Condition
> 1.	Lets say there are three nodes in the ensemble A,B,C with A being the leader
> 2.	The current epoch is 7. 
> 3.	For simplicity of the example, lets say zxid is a two digit number, with epoch being the first digit.
> 4.	The zxid is 73
> 5.	All the nodes have seen the change 73 and have persistently logged it.
> Step 1
> Request with zxid 74 is issued. The leader A writes it to the log but there is a crash of the entire ensemble and B,C never write the change 74 to their log.
> Step 3
> B,C restart, A is still down
> B,C form the quorum
> B is the new leader. Lets say  B minCommitLog is 71 and maxCommitLog is 73
> epoch is now 8, zxid is 80
> Request with zxid 81 is successful. On B, minCommitLog is now 71, maxCommitLog is 81
> Step 4
> A starts up. It applies the change in request with zxid 74 to its in-memory data tree
> A contacts B to registerAsFollower and provides 74 as its ZxId
> Since 71<=74<=81, B decides to send A the diff. B will send to A the proposal 81.
> Problem:
> The problem with the above sequence is that A's data tree has the update from request 74, which is not correct. Before getting the proposals 81, A should have received a trunc to 73. I don't see that in the code. If the maxCommitLog on B hadn't bumped to 81 but had stayed at 73, that case seems to be fine.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (ZOOKEEPER-1154) Data inconsistency when the node(s) with the highest zxid is not present at the time of leader election

Posted by "Camille Fournier (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/ZOOKEEPER-1154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13094524#comment-13094524 ] 

Camille Fournier commented on ZOOKEEPER-1154:
---------------------------------------------

Doing a code review, one general comment: Please don't use assertTrue with equality checks and without a message. If it is an equality check you want to do, use assertEquals, and leave a message telling me what the assert is checking. I would appreciate it if you would upload the patch with this fixed in the tests. Other than that it looks good, I'll apply with that minor fix. Also if we want this to go into the 3.3.3 branch please upload a patch against that branch..

> Data inconsistency when the node(s) with the highest zxid is not present at the time of leader election
> -------------------------------------------------------------------------------------------------------
>
>                 Key: ZOOKEEPER-1154
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1154
>             Project: ZooKeeper
>          Issue Type: Bug
>          Components: quorum
>    Affects Versions: 3.3.3
>            Reporter: Vishal Kathuria
>            Assignee: Vishal Kathuria
>            Priority: Blocker
>             Fix For: 3.3.4, 3.4.0
>
>         Attachments: ZOOKEEPER-1154.patch, ZOOKEEPER-1154.patch
>
>   Original Estimate: 504h
>  Remaining Estimate: 504h
>
> If a participant with the highest zxid (lets call it A) isn't present during leader election, a participant with a lower zxid (say B) might be chosen as a leader. When A comes up, it will replay the log with that higher zxid. The change that was in that higher zxid will only be visible to the clients connecting to the participant A, but not to other participants.
> I was able to reproduce this problem by
> 1. connect debugger to B and C and suspend them, so they don't write anything
> 2. Issue an update to the leader A.
> 3. After a few seconds, crash all servers (A,B,C)
> 4. Start B and C, let the leader election take place
> 5. Start A.
> 6. You will find that the update done in step 2 is visible on A but not on B,C, hence the inconsistency.
> Below is a more detailed analysis of what is happening in the code.
> Initial Condition
> 1.	Lets say there are three nodes in the ensemble A,B,C with A being the leader
> 2.	The current epoch is 7. 
> 3.	For simplicity of the example, lets say zxid is a two digit number, with epoch being the first digit.
> 4.	The zxid is 73
> 5.	All the nodes have seen the change 73 and have persistently logged it.
> Step 1
> Request with zxid 74 is issued. The leader A writes it to the log but there is a crash of the entire ensemble and B,C never write the change 74 to their log.
> Step 3
> B,C restart, A is still down
> B,C form the quorum
> B is the new leader. Lets say  B minCommitLog is 71 and maxCommitLog is 73
> epoch is now 8, zxid is 80
> Request with zxid 81 is successful. On B, minCommitLog is now 71, maxCommitLog is 81
> Step 4
> A starts up. It applies the change in request with zxid 74 to its in-memory data tree
> A contacts B to registerAsFollower and provides 74 as its ZxId
> Since 71<=74<=81, B decides to send A the diff. B will send to A the proposal 81.
> Problem:
> The problem with the above sequence is that A's data tree has the update from request 74, which is not correct. Before getting the proposals 81, A should have received a trunc to 73. I don't see that in the code. If the maxCommitLog on B hadn't bumped to 81 but had stayed at 73, that case seems to be fine.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (ZOOKEEPER-1154) Data inconsistency when the node(s) with the highest zxid is not present at the time of leader election

Posted by "Camille Fournier (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/ZOOKEEPER-1154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13096459#comment-13096459 ] 

Camille Fournier commented on ZOOKEEPER-1154:
---------------------------------------------

I want to put this in 3.3.4, and I need to modify the tests to work with the old test stuff. Will close when I fix that and get it checked in.

> Data inconsistency when the node(s) with the highest zxid is not present at the time of leader election
> -------------------------------------------------------------------------------------------------------
>
>                 Key: ZOOKEEPER-1154
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1154
>             Project: ZooKeeper
>          Issue Type: Bug
>          Components: quorum
>    Affects Versions: 3.3.3
>            Reporter: Vishal Kathuria
>            Assignee: Vishal Kathuria
>            Priority: Blocker
>             Fix For: 3.3.4, 3.4.0
>
>         Attachments: ZOOKEEPER-1154.patch, ZOOKEEPER-1154.patch, ZOOKEEPER-1154.patch
>
>   Original Estimate: 504h
>  Remaining Estimate: 504h
>
> If a participant with the highest zxid (lets call it A) isn't present during leader election, a participant with a lower zxid (say B) might be chosen as a leader. When A comes up, it will replay the log with that higher zxid. The change that was in that higher zxid will only be visible to the clients connecting to the participant A, but not to other participants.
> I was able to reproduce this problem by
> 1. connect debugger to B and C and suspend them, so they don't write anything
> 2. Issue an update to the leader A.
> 3. After a few seconds, crash all servers (A,B,C)
> 4. Start B and C, let the leader election take place
> 5. Start A.
> 6. You will find that the update done in step 2 is visible on A but not on B,C, hence the inconsistency.
> Below is a more detailed analysis of what is happening in the code.
> Initial Condition
> 1.	Lets say there are three nodes in the ensemble A,B,C with A being the leader
> 2.	The current epoch is 7. 
> 3.	For simplicity of the example, lets say zxid is a two digit number, with epoch being the first digit.
> 4.	The zxid is 73
> 5.	All the nodes have seen the change 73 and have persistently logged it.
> Step 1
> Request with zxid 74 is issued. The leader A writes it to the log but there is a crash of the entire ensemble and B,C never write the change 74 to their log.
> Step 3
> B,C restart, A is still down
> B,C form the quorum
> B is the new leader. Lets say  B minCommitLog is 71 and maxCommitLog is 73
> epoch is now 8, zxid is 80
> Request with zxid 81 is successful. On B, minCommitLog is now 71, maxCommitLog is 81
> Step 4
> A starts up. It applies the change in request with zxid 74 to its in-memory data tree
> A contacts B to registerAsFollower and provides 74 as its ZxId
> Since 71<=74<=81, B decides to send A the diff. B will send to A the proposal 81.
> Problem:
> The problem with the above sequence is that A's data tree has the update from request 74, which is not correct. Before getting the proposals 81, A should have received a trunc to 73. I don't see that in the code. If the maxCommitLog on B hadn't bumped to 81 but had stayed at 73, that case seems to be fine.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (ZOOKEEPER-1154) Data inconsistency when the node(s) with the highest zxid is not present at the time of leader election

Posted by "Flavio Junqueira (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/ZOOKEEPER-1154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13085216#comment-13085216 ] 

Flavio Junqueira commented on ZOOKEEPER-1154:
---------------------------------------------

Vishal, I haven't gone through the code to confirm it, but there is one point I don't understand. In your recipe to reproduce the problem, waiting a few seconds should be more than sufficient to make sure that all servers have received the update, no? 

> Data inconsistency when the node(s) with the highest zxid is not present at the time of leader election
> -------------------------------------------------------------------------------------------------------
>
>                 Key: ZOOKEEPER-1154
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1154
>             Project: ZooKeeper
>          Issue Type: Bug
>          Components: quorum
>    Affects Versions: 3.3.3
>            Reporter: Vishal Kathuria
>            Priority: Critical
>             Fix For: 3.4.0
>
>   Original Estimate: 504h
>  Remaining Estimate: 504h
>
> If a participant with the highest zxid (lets call it A) isn't present during leader election, a participant with a lower zxid (say B) might be chosen as a leader. When A comes up, it will replay the log with that higher zxid. The change that was in that higher zxid will only be visible to the clients connecting to the participant A, but not to other participants.
> I was able to reproduce this problem by
> 1. connect debugger to B and C and suspend them, so they don't write anything
> 2. Issue an update to the leader A.
> 3. After a few seconds, crash all servers (A,B,C)
> 4. Start B and C, let the leader election take place
> 5. Start A.
> 6. You will find that the update done in step 2 is visible on A but not on B,C, hence the inconsistency.
> Below is a more detailed analysis of what is happening in the code.
> Initial Condition
> 1.	Lets say there are three nodes in the ensemble A,B,C with A being the leader
> 2.	The current epoch is 7. 
> 3.	For simplicity of the example, lets say zxid is a two digit number, with epoch being the first digit.
> 4.	The zxid is 73
> 5.	All the nodes have seen the change 73 and have persistently logged it.
> Step 1
> Request with zxid 74 is issued. The leader A writes it to the log but there is a crash of the entire ensemble and B,C never write the change 74 to their log.
> Step 3
> B,C restart, A is still down
> B,C form the quorum
> B is the new leader. Lets say  B minCommitLog is 71 and maxCommitLog is 73
> epoch is now 8, zxid is 80
> Request with zxid 81 is successful. On B, minCommitLog is now 71, maxCommitLog is 81
> Step 4
> A starts up. It applies the change in request with zxid 74 to its in-memory data tree
> A contacts B to registerAsFollower and provides 74 as its ZxId
> Since 71<=74<=81, B decides to send A the diff. B will send to A the proposal 81.
> Problem:
> The problem with the above sequence is that A's data tree has the update from request 74, which is not correct. Before getting the proposals 81, A should have received a trunc to 73. I don't see that in the code. If the maxCommitLog on B hadn't bumped to 81 but had stayed at 73, that case seems to be fine.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (ZOOKEEPER-1154) Data inconsistency when the node(s) with the highest zxid is not present at the time of leader election

Posted by "Camille Fournier (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/ZOOKEEPER-1154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13096342#comment-13096342 ] 

Camille Fournier commented on ZOOKEEPER-1154:
---------------------------------------------

Looks great Vishal, thanks! Will commit this soon.

> Data inconsistency when the node(s) with the highest zxid is not present at the time of leader election
> -------------------------------------------------------------------------------------------------------
>
>                 Key: ZOOKEEPER-1154
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1154
>             Project: ZooKeeper
>          Issue Type: Bug
>          Components: quorum
>    Affects Versions: 3.3.3
>            Reporter: Vishal Kathuria
>            Assignee: Vishal Kathuria
>            Priority: Blocker
>             Fix For: 3.3.4, 3.4.0
>
>         Attachments: ZOOKEEPER-1154.patch, ZOOKEEPER-1154.patch, ZOOKEEPER-1154.patch
>
>   Original Estimate: 504h
>  Remaining Estimate: 504h
>
> If a participant with the highest zxid (lets call it A) isn't present during leader election, a participant with a lower zxid (say B) might be chosen as a leader. When A comes up, it will replay the log with that higher zxid. The change that was in that higher zxid will only be visible to the clients connecting to the participant A, but not to other participants.
> I was able to reproduce this problem by
> 1. connect debugger to B and C and suspend them, so they don't write anything
> 2. Issue an update to the leader A.
> 3. After a few seconds, crash all servers (A,B,C)
> 4. Start B and C, let the leader election take place
> 5. Start A.
> 6. You will find that the update done in step 2 is visible on A but not on B,C, hence the inconsistency.
> Below is a more detailed analysis of what is happening in the code.
> Initial Condition
> 1.	Lets say there are three nodes in the ensemble A,B,C with A being the leader
> 2.	The current epoch is 7. 
> 3.	For simplicity of the example, lets say zxid is a two digit number, with epoch being the first digit.
> 4.	The zxid is 73
> 5.	All the nodes have seen the change 73 and have persistently logged it.
> Step 1
> Request with zxid 74 is issued. The leader A writes it to the log but there is a crash of the entire ensemble and B,C never write the change 74 to their log.
> Step 3
> B,C restart, A is still down
> B,C form the quorum
> B is the new leader. Lets say  B minCommitLog is 71 and maxCommitLog is 73
> epoch is now 8, zxid is 80
> Request with zxid 81 is successful. On B, minCommitLog is now 71, maxCommitLog is 81
> Step 4
> A starts up. It applies the change in request with zxid 74 to its in-memory data tree
> A contacts B to registerAsFollower and provides 74 as its ZxId
> Since 71<=74<=81, B decides to send A the diff. B will send to A the proposal 81.
> Problem:
> The problem with the above sequence is that A's data tree has the update from request 74, which is not correct. Before getting the proposals 81, A should have received a trunc to 73. I don't see that in the code. If the maxCommitLog on B hadn't bumped to 81 but had stayed at 73, that case seems to be fine.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira