You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@zookeeper.apache.org by Tom Lee <ap...@tomlee.co> on 2018/04/09 20:13:17 UTC

QuorumCnxManager "challenge" protocol details?

Hello,

Relatively new to the ZK code base, please be gentle. :) This is bordering
on a question for users@, but I'm asking here because I'm more than happy
to try and dig into the code if it's not too far beyond my reach -- hope
that's okay.

I'm trying to dig into / work around ZOOKEEPER-2938:

https://issues.apache.org/jira/browse/ZOOKEEPER-2938

Unfortunately, the proposed work-around (simply restarting the leader)
isn't particularly great for us because of some limitations in our
automation -- so I'm trying to see if we can find some alternatives and/or
fix the issue properly.

Looking at
https://github.com/apache/zookeeper/blob/75411ab34a3d53c43c2d508b12314a9788aa417d/src/java/main/org/apache/zookeeper/server/quorum/QuorumCnxManager.java#L391
-- afaict what's happening is the "unhappy"/prospective member of the
quorum is attempting to connect to other, established members, sends a
challenge request (which seems to just be a simple payload consisting of
its ID and the local election host + port), then promptly closes the
connection because its own ID is less than that of the recipient(s) --
seemingly without waiting for a response.

The mechanics are all easy enough to understand, but I feel like I'm
lacking some context RE: what's *supposed *to happen here. When this code
is all working as expected, what *should *happen with respect to these
challenges? What is this code trying to achieve by forcefully disconnecting
from peers with an ID greater than the local peer?

I also don't fully understand why restarting the leader would fix things,
but that's probably just something I need to dive into to get to the bottom
of this.

Appreciate any guidance.

Cheers,
Tom