You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@zookeeper.apache.org by powell molleti <po...@yahoo.com.INVALID> on 2016/01/01 05:19:59 UTC

FLE update proposal question

Hi,
I  want to better understand the use of code here: http://bit.ly/1kxjk5GWhy should FLE reset the Vote to what is on the disk/initial values when totalOrderPredicate() fails in the case of received ElectionEpoch being greater than current vote's ElectionEpoch. 
Going back to initial values(and clearing the recv set) does not seem to make it incorrect but seems to slow down FLE if I am not mistaken. For example if B has the best totalOrderPredicate() and A learns from it and if C has higher election epoch but older values then A is forced to reset what it learned from B till C and B catch up to each other?. Rather than let A and B wait for C to upgrade its values after A and B borrow its ElectionEpoch?. 
Any help is appreciated.
thanksPowell.

Re: FLE update proposal question

Posted by powell molleti <po...@yahoo.com.INVALID>.

Hi Flavio,
Here is my attempt:
Lets assume a large cluster and look at three nodes.
ep = electionEpoch, p = peerEpoch, z = zxidstep 1:A [ ep:3, p:1, z: 1 ] [ LOOKING ] { failed to follow multiple times, hence resets Vote to stored values and bumps up epoch }B [ ep:1, p:2, z: 9 ] [ FOLLOWING ]C [ ep:1, p:2, z: 9 ] [ LEADING ]
step 2:
B goes LOOKING state but cannot reach A or C
A [ ep:3, p:1, z: 1 ] [ LOOKING ] B [ ep:2, p:2, z: 99 ] [ LOOKING ] { starts vote with last committed transactions }C [ ep:1, p:2, z: 99 ] [ LEADING ]
step 3:
C goes into LOOKING
A [ ep:3, p:1, z: 1 ] [ LOOKING ] B [ ep:2, p:2, z: 99 ] [ LOOKING ]C [ ep:2, p:2, z: 999 ] [ LOOKING ]
step 4:
Only B and C can reach each other so they converge.

A [ ep:3, p:1, z: 1 ] [ LOOKING ] B [ ep:2, p:2, z: 999 ] [ LOOKING ]C [ ep:2, p:2, z: 999 ] [ LOOKING ]
step 5:
B hears from A now and A still cannot see C
bit.ly/1kxjk5G: A.ep > B.logicalClock and totalOrderPredicate(A, B) is false hence B resets to values stored on disk i.e lost values it learned from C but copies the logicalClock

A [ ep:3, p:1, z: 1 ] [ LOOKING ] B [ ep:3, p:2, z: 99 ] [ LOOKING ] { moved back zxid }C [ ep:2, p:2, z: 999 ] [ LOOKING ]
Now A and B converge
step 6:A [ ep:3, p:2, z: 99 ] [ LOOKING ] B [ ep:3, p:2, z: 99 ] [ LOOKING ]C [ ep:2, p:2, z: 999 ] [ LOOKING ]
My initial question is - what is goal of the resetting the proposal to on disk values when Rx proposal electionEpoch is greater than current logical clock and totalOrderPredicate(Rx, this) is false in step 5 above. This caused B to unlearn and re-learn.
Also if you can shed some light regarding the role of ElectionEpoch that would be great. Is this due to the fact a Vote received by LeaderElection could be stale and forcing system to converge on an ElectionEpoch helps with liveliness ? i.e electing a leader with reasonably latest votes?. But why doesn't it consider Vote from LEADER/FOLLOWER for learning. Why is it necessary to learn only from LOOKING peers and not from LEADER/FOLLOWER.
Here is another case to illustrate this problem:
A[K], B[K], C[F], D[L], E[F] { K = looking, F = following, L = leading }
In a partitioned system here where A and B can see C and D but not E, C and D can see all. In this case A and B will never go to following state and follow D since both of them will never learn from out of election peers (exception is when an out of election peer has the same election epoch as current logical clock). 
Here the system is working without the participation of A and B. 
Any help is appreciated.
thanksPowell.

    On Tuesday, January 5, 2016 8:55 AM, Flavio Junqueira <fp...@apache.org> wrote:

 Hi Powell,

I don't understand why you want to reset the values of the server vote when the totalOrderPredicate check fails. The values you're referring to are epoch and zxid?

In the example you give, it looks like you're saying that the vote of B wins over the vote of C and the one of C wins over the one of A, so the order is B > C > A, but A shouldn't take C's vote because it already took B's and B's vote win. If that's the case, then this already happens. I'm probably missing the point here, so perhaps you could provide an example with more detail, like with epoch numbers and such to illustrate the point.

-Flavio

> On 01 Jan 2016, at 04:19, Powell Molleti <po...@yahoo.com.INVALID> wrote:
> 
> Hi,
> I  want to better understand the use of code here: http://bit.ly/1kxjk5GWhy should FLE reset the Vote to what is on the disk/initial values when totalOrderPredicate() fails in the case of received ElectionEpoch being greater than current vote's ElectionEpoch. 
> Going back to initial values(and clearing the recv set) does not seem to make it incorrect but seems to slow down FLE if I am not mistaken. For example if B has the best totalOrderPredicate() and A learns from it and if C has higher election epoch but older values then A is forced to reset what it learned from B till C and B catch up to each other?. Rather than let A and B wait for C to upgrade its values after A and B borrow its ElectionEpoch?. 
> Any help is appreciated.
> thanksPowell.

Re: FLE update proposal question

Posted by Flavio Junqueira <fp...@yahoo.com.INVALID>.

Hi Powell,


Thanks for the presentation yesterday, it was great to hear about your progress with quorum connection manager, fle, etc. Check my comments inline, please:


> On 07 Jan 2016, at 11:34, powell molleti <po...@yahoo.com.INVALID> wrote:
> 
> 
> 
> Hi Flavio,
> 
> Here is my attempt(disabled rich text):
> 
> Lets assume a large cluster and look at three nodes.
> 
> ep = electionEpoch, p = peerEpoch, z = zxid
> step 1:
> A [ ep:3, p:1, z: 1 ] [ LOOKING ] { failed to follow multiple times, hence resets Vote to stored values and bumps up epoch }
> B [ ep:1, p:2, z: 9 ] [ FOLLOWING ]
> C [ ep:1, p:2, z: 9 ] [ LEADING ]
> 
> step 2:
> 
> B goes LOOKING state but cannot reach A or C
> 
> A [ ep:3, p:1, z: 1 ] [ LOOKING ] 
> B [ ep:2, p:2, z: 99 ] [ LOOKING ] { starts vote with last committed transactions }
> C [ ep:1, p:2, z: 99 ] [ LEADING ]
> 
> step 3:
> 
> C goes into LOOKING
> 
> A [ ep:3, p:1, z: 1 ] [ LOOKING ] 
> B [ ep:2, p:2, z: 99 ] [ LOOKING ]
> C [ ep:2, p:2, z: 999 ] [ LOOKING ]
> 

Here you are assuming that C proposed a bunch of transactions, but it was the only server to accept them, yes? There is no quorum for proposals 100-999, but it sounds like it is intentional.

> step 4:
> 
> Only B and C can reach each other so they converge.
> 
> 
> A [ ep:3, p:1, z: 1 ] [ LOOKING ] 
> B [ ep:2, p:2, z: 999 ] [ LOOKING ]
> C [ ep:2, p:2, z: 999 ] [ LOOKING ]
> 
> step 5:
> 
> B hears from A now and A still cannot see C
> 
> bit.ly/1kxjk5G: A.ep > B.logicalClock and totalOrderPredicate(A, B) is false hence B resets to values stored on disk i.e lost values it learned from C but copies the logicalClock
> 
> 
> A [ ep:3, p:1, z: 1 ] [ LOOKING ] 
> B [ ep:3, p:2, z: 99 ] [ LOOKING ] { moved back zxid }
> C [ ep:2, p:2, z: 999 ] [ LOOKING ]
> 
> Now A and B converge
> 
> step 6:
> A [ ep:3, p:2, z: 99 ] [ LOOKING ] 
> B [ ep:3, p:2, z: 99 ] [ LOOKING ]
> C [ ep:2, p:2, z: 999 ] [ LOOKING ]
> 
> My initial question is - what is goal of the resetting the proposal to on disk values when Rx proposal electionEpoch is greater than current logical clock and totalOrderPredicate(Rx, this) is false in step 5 above. This caused B to unlearn and re-learn.
> 

My recollection is that originally we were trying to prevent a server from participating in old election rounds or joining old quorums. For example, you could have the situation in which a server drops a leader (the leader doesn't know) and this server forms a quorum with some other servers, commit some txns, but later rejoins the previous leader and the previous leader will tell the server to truncate. Also, we were trying to avoid that a server that had a vote from an old election round brought it into a new round. 

We have made some significant changes since then. We use the server epoch in the vote and perform some checks after election to make sure that the leader is right. It's possible that the election epoch is no longer useful, which might simplify the current code, but I need to think some more about it to be sure.    

> Also if you can shed some light regarding the role of ElectionEpoch that would be great. Is this due to the fact a Vote received by LeaderElection could be stale and forcing system to converge on an ElectionEpoch helps with liveliness ? i.e electing a leader with reasonably latest votes?. But why doesn't it consider Vote from LEADER/FOLLOWER for learning. Why is it necessary to learn only from LOOKING peers and not from LEADER/FOLLOWER.
> 
> Here is another case to illustrate this problem:
> 
> A[K], B[K], C[F], D[L], E[F] { K = looking, F = following, L = leading }
> 
> In a partitioned system here where A and B can see C and D but not E, C and D can see all. 
> In this case A and B will never go to following state and follow D since both of them will never learn from out of election peers (exception is when an out of election peer has the same election epoch as current logical clock). 

We can try to optimize for partitions, but our rationale when we did all this was that it wasn't worth targeting permanent or just long partitions. It is certainly possible to tune it to be more partition-tolerant, but I find it arguable whether it is worth doing it.

> 
> Here the system is working without the participation of A and B. 
> 
> Any help is appreciated.
> 
> thanks
> Powell.


-Flavio

> 
> On Tuesday, January 5, 2016 8:55 AM, Flavio Junqueira <fp...@apache.org> wrote:
> Hi Powell,
> 
> I don't understand why you want to reset the values of the server vote when the totalOrderPredicate check fails. The values you're referring to are epoch and zxid?
> 
> In the example you give, it looks like you're saying that the vote of B wins over the vote of C and the one of C wins over the one of A, so the order is B > C > A, but A shouldn't take C's vote because it already took B's and B's vote win. If that's the case, then this already happens. I'm probably missing the point here, so perhaps you could provide an example with more detail, like with epoch numbers and such to illustrate the point.
> 
> -Flavio
> 
> 
> 
>> On 01 Jan 2016, at 04:19, Powell Molleti <po...@yahoo.com.INVALID> wrote:
>> 
>> Hi,
>> I  want to better understand the use of code here: http://bit.ly/1kxjk5GWhy should FLE reset the Vote to what is on the disk/initial values when totalOrderPredicate() fails in the case of received ElectionEpoch being greater than current vote's ElectionEpoch. 
>> Going back to initial values(and clearing the recv set) does not seem to make it incorrect but seems to slow down FLE if I am not mistaken. For example if B has the best totalOrderPredicate() and A learns from it and if C has higher election epoch but older values then A is forced to reset what it learned from B till C and B catch up to each other?. Rather than let A and B wait for C to upgrade its values after A and B borrow its ElectionEpoch?. 
>> Any help is appreciated.
>> thanksPowell.

Re: FLE update proposal question

Posted by powell molleti <po...@yahoo.com.INVALID>.

Hi Flavio,

Here is my attempt(disabled rich text):

Lets assume a large cluster and look at three nodes.

ep = electionEpoch, p = peerEpoch, z = zxid
step 1:
A [ ep:3, p:1, z: 1 ] [ LOOKING ] { failed to follow multiple times, hence resets Vote to stored values and bumps up epoch }
B [ ep:1, p:2, z: 9 ] [ FOLLOWING ]
C [ ep:1, p:2, z: 9 ] [ LEADING ]

step 2:

B goes LOOKING state but cannot reach A or C

A [ ep:3, p:1, z: 1 ] [ LOOKING ] 
B [ ep:2, p:2, z: 99 ] [ LOOKING ] { starts vote with last committed transactions }
C [ ep:1, p:2, z: 99 ] [ LEADING ]

step 3:

C goes into LOOKING

A [ ep:3, p:1, z: 1 ] [ LOOKING ] 
B [ ep:2, p:2, z: 99 ] [ LOOKING ]
C [ ep:2, p:2, z: 999 ] [ LOOKING ]

step 4:

Only B and C can reach each other so they converge.

A [ ep:3, p:1, z: 1 ] [ LOOKING ] 
B [ ep:2, p:2, z: 999 ] [ LOOKING ]
C [ ep:2, p:2, z: 999 ] [ LOOKING ]

step 5:

B hears from A now and A still cannot see C

bit.ly/1kxjk5G: A.ep > B.logicalClock and totalOrderPredicate(A, B) is false hence B resets to values stored on disk i.e lost values it learned from C but copies the logicalClock

A [ ep:3, p:1, z: 1 ] [ LOOKING ] 
B [ ep:3, p:2, z: 99 ] [ LOOKING ] { moved back zxid }
C [ ep:2, p:2, z: 999 ] [ LOOKING ]

Now A and B converge

step 6:
A [ ep:3, p:2, z: 99 ] [ LOOKING ] 
B [ ep:3, p:2, z: 99 ] [ LOOKING ]
C [ ep:2, p:2, z: 999 ] [ LOOKING ]

My initial question is - what is goal of the resetting the proposal to on disk values when Rx proposal electionEpoch is greater than current logical clock and totalOrderPredicate(Rx, this) is false in step 5 above. This caused B to unlearn and re-learn.

Also if you can shed some light regarding the role of ElectionEpoch that would be great. Is this due to the fact a Vote received by LeaderElection could be stale and forcing system to converge on an ElectionEpoch helps with liveliness ? i.e electing a leader with reasonably latest votes?. But why doesn't it consider Vote from LEADER/FOLLOWER for learning. Why is it necessary to learn only from LOOKING peers and not from LEADER/FOLLOWER.

Here is another case to illustrate this problem:

A[K], B[K], C[F], D[L], E[F] { K = looking, F = following, L = leading }

In a partitioned system here where A and B can see C and D but not E, C and D can see all. 
In this case A and B will never go to following state and follow D since both of them will never learn from out of election peers (exception is when an out of election peer has the same election epoch as current logical clock). 

Here the system is working without the participation of A and B. 

Any help is appreciated.

thanks
Powell.

On Tuesday, January 5, 2016 8:55 AM, Flavio Junqueira <fp...@apache.org> wrote:
Hi Powell,

I don't understand why you want to reset the values of the server vote when the totalOrderPredicate check fails. The values you're referring to are epoch and zxid?

In the example you give, it looks like you're saying that the vote of B wins over the vote of C and the one of C wins over the one of A, so the order is B > C > A, but A shouldn't take C's vote because it already took B's and B's vote win. If that's the case, then this already happens. I'm probably missing the point here, so perhaps you could provide an example with more detail, like with epoch numbers and such to illustrate the point.

-Flavio

> On 01 Jan 2016, at 04:19, Powell Molleti <po...@yahoo.com.INVALID> wrote:
> 
> Hi,
> I  want to better understand the use of code here: http://bit.ly/1kxjk5GWhy should FLE reset the Vote to what is on the disk/initial values when totalOrderPredicate() fails in the case of received ElectionEpoch being greater than current vote's ElectionEpoch. 
> Going back to initial values(and clearing the recv set) does not seem to make it incorrect but seems to slow down FLE if I am not mistaken. For example if B has the best totalOrderPredicate() and A learns from it and if C has higher election epoch but older values then A is forced to reset what it learned from B till C and B catch up to each other?. Rather than let A and B wait for C to upgrade its values after A and B borrow its ElectionEpoch?. 
> Any help is appreciated.
> thanksPowell.

Re: FLE update proposal question

Posted by Flavio Junqueira <fp...@apache.org>.

Hi Powell,

I don't understand why you want to reset the values of the server vote when the totalOrderPredicate check fails. The values you're referring to are epoch and zxid?

In the example you give, it looks like you're saying that the vote of B wins over the vote of C and the one of C wins over the one of A, so the order is B > C > A, but A shouldn't take C's vote because it already took B's and B's vote win. If that's the case, then this already happens. I'm probably missing the point here, so perhaps you could provide an example with more detail, like with epoch numbers and such to illustrate the point.

-Flavio

> On 01 Jan 2016, at 04:19, Powell Molleti <po...@yahoo.com.INVALID> wrote:
> 
> Hi,
> I  want to better understand the use of code here: http://bit.ly/1kxjk5GWhy should FLE reset the Vote to what is on the disk/initial values when totalOrderPredicate() fails in the case of received ElectionEpoch being greater than current vote's ElectionEpoch. 
> Going back to initial values(and clearing the recv set) does not seem to make it incorrect but seems to slow down FLE if I am not mistaken. For example if B has the best totalOrderPredicate() and A learns from it and if C has higher election epoch but older values then A is forced to reset what it learned from B till C and B catch up to each other?. Rather than let A and B wait for C to upgrade its values after A and B borrow its ElectionEpoch?. 
> Any help is appreciated.
> thanksPowell.