You are viewing a plain text version of this content. The canonical link for it is here.
Posted to notifications@zookeeper.apache.org by GitBox <gi...@apache.org> on 2020/02/12 13:52:49 UTC

[GitHub] [zookeeper] symat opened a new pull request #1251: ZOOKEEPER-3720: Fix rolling upgrade failure (invalid protocol version)

symat opened a new pull request #1251: ZOOKEEPER-3720: Fix rolling upgrade failure (invalid protocol version)
URL: https://github.com/apache/zookeeper/pull/1251
 
 
   The multi-address feature introduced in ZOOKEEPER-3188 required
   changes in the Quorum protocol as we had to send all addresses in
   the connection initiation message to enable the receiving side to
   choose a reachable address in case of network failure.
   
   The new code can handle both the old and the new protocol versions to
   avoid 'invalid protocol' error e.g. during rolling restarts. However,
   the new protocol version still can not be used during rolling upgrade
   if the old servers are not supporting this protocol. In this case the
   old and the new servers would form two distinct partitions until all
   the servers get upgraded. To support rolling upgrades too, we want to
   disable the MultiAddress feature by default and use the old protocol.
   
   If the user would like enable the MultiAddress feature on a 3.6.0
   cluster, she/he can do it either by 1) starting the cluster from
   scratch (without rolling upgrade), or 2) performing a rolling upgrade
   without the MultiAddress feature enabled then doing a rolling restart
   with a new configuration where the MultiAddress feature is enabled.
   During the rolling restart there will be no partitions, as all the
   servers in the cluster now will run ZooKeeper version 3.6.0 which
   understands now both protocols.
   
   The changes in this patch:
   - introducing new config property: multiAddress.enabled, disabled
   by default
   - updating QuorumCnxManager to be able to use both protocol versions
   and to use the old one if MultiAddress is disabled
   - failing with ConfigException if the user provides multiple
   addresses in the config while having MultiAddress disabled
   - updating the existing MultiAddress related tests to enable the
   feature first
   - add some new tests
   - update the documentation
   
   Testing:
   - I ran all the unit tests
   - Using https://github.com/symat/zk-rolling-upgrade-test
     - I tested rolling upgrade from 3.5.6
     - I tested rolling restart to enable the MultiAddress feature
   - Using https://github.com/symat/zookeeper-docker-test
     - I tested the MultiAddress feature by disabling some virtual
       interfaces and waiting for the cluster to recover

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [zookeeper] symat commented on issue #1251: ZOOKEEPER-3720: Fix rolling upgrade failure (invalid protocol version)

Posted by GitBox <gi...@apache.org>.
symat commented on issue #1251: ZOOKEEPER-3720: Fix rolling upgrade failure (invalid protocol version)
URL: https://github.com/apache/zookeeper/pull/1251#issuecomment-585249349
 
 
   > with this patch the first problem found my @phunt in 3.6.0rc0 wouldn't be discovered.
   
   this is true... still (for the overall quality) I think it is good that both issues (ICMP and the expected behavior in rolling restarts) were raised. I am just sad that we haven't deal with these during the development of ZOOKEEPER-3188.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [zookeeper] symat commented on a change in pull request #1251: ZOOKEEPER-3720: Fix rolling upgrade failure (invalid protocol version)

Posted by GitBox <gi...@apache.org>.
symat commented on a change in pull request #1251: ZOOKEEPER-3720: Fix rolling upgrade failure (invalid protocol version)
URL: https://github.com/apache/zookeeper/pull/1251#discussion_r378474147
 
 

 ##########
 File path: zookeeper-server/src/main/java/org/apache/zookeeper/server/quorum/QuorumCnxManager.java
 ##########
 @@ -416,10 +424,20 @@ private boolean startConnection(Socket sock, Long sid) throws IOException {
 
             // Sending id and challenge
 
-            // represents protocol version (in other words - message type)
-            dout.writeLong(PROTOCOL_VERSION);
+            // First sending the protocol version (in other words - message type).
+            // For backward compatibility reasons we stick to the old protocol version, unless the MultiAddress
+            // feature is enabled. During rolling upgrade, we must make sure that all the servers can
+            // understand the protocol version we use to avoid multiple partitions. see ZOOKEEPER-3720
+            long protocolVersion = self.isMultiAddressEnabled() ? PROTOCOL_VERSION_3_6_0_MULTI_ADDRESS : PROTOCOL_VERSION_3_4_7;
+            dout.writeLong(protocolVersion);
 
 Review comment:
   Thanks, I really like this idea with the 'feature flags', I will change accordingly. 
   
   The only constraint is that the version number must be negative. I think I saw in the git history, that some years ago, when the protocol version was not introduced yet, this was the way how someone solved the introduction of versioning. The old code sent some positive sid or message size, and the new code sent a negative version number. This way enforcing the old code to break and throw away the message, while the new code was able to verify if it got the message in the new format based on the negative number. But I need to double-check this tomorrow, maybe I remember wrong.
   
   And also there can be maybe 1-2 negative numbers already used as protocol versions, so I will check which bits are still available.
   
   Although I think it is enough to support the rolling upgrade from the first stable 3.4 release, which can simplify things, and we don't need to go back in time in the code too much.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [zookeeper] symat commented on a change in pull request #1251: ZOOKEEPER-3720: Fix rolling upgrade failure (invalid protocol version)

Posted by GitBox <gi...@apache.org>.
symat commented on a change in pull request #1251: ZOOKEEPER-3720: Fix rolling upgrade failure (invalid protocol version)
URL: https://github.com/apache/zookeeper/pull/1251#discussion_r378466027
 
 

 ##########
 File path: zookeeper-server/src/main/java/org/apache/zookeeper/server/quorum/QuorumCnxManager.java
 ##########
 @@ -113,9 +115,12 @@
     private AtomicLong observerCounter = new AtomicLong(-1);
 
     /*
-     * Protocol identifier used among peers
+     * Protocol identifier used among peers (must be a negative number for backward compatibility reasons)
      */
-    public static final long PROTOCOL_VERSION = -65535L;
+    // the following protocol version was sent in every connection initiation message since ZOOKEEPER-2186 released in 3.4.7
+    public static final long PROTOCOL_VERSION_3_4_7 = -65536L;
 
 Review comment:
   The reason why I named this way was that I am not sure if we won't need to add other previous versions later (e.g. to support rolling restart from pre-3.4.7 versions). But giving a second thought, I like your proposal better. These are just constants, we can rename them later in any ways. I will change this in the next commit.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [zookeeper] eolivelli commented on issue #1251: ZOOKEEPER-3720: Fix rolling upgrade failure (invalid protocol version)

Posted by GitBox <gi...@apache.org>.
eolivelli commented on issue #1251: ZOOKEEPER-3720: Fix rolling upgrade failure (invalid protocol version)
URL: https://github.com/apache/zookeeper/pull/1251#issuecomment-586354608
 
 
   @anmolnar @nrkalmar are you okay with this final form?
   If so please merge to branch-3.6 
   
   I will cut a new RC during the weekend
   I will take care about 3.6.0 branch, I will port more patches 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [zookeeper] anmolnar commented on a change in pull request #1251: ZOOKEEPER-3720: Fix rolling upgrade failure (invalid protocol version)

Posted by GitBox <gi...@apache.org>.
anmolnar commented on a change in pull request #1251: ZOOKEEPER-3720: Fix rolling upgrade failure (invalid protocol version)
URL: https://github.com/apache/zookeeper/pull/1251#discussion_r378288766
 
 

 ##########
 File path: zookeeper-server/src/main/java/org/apache/zookeeper/server/quorum/QuorumCnxManager.java
 ##########
 @@ -639,7 +657,7 @@ public void toSend(Long sid, ByteBuffer b) {
     synchronized boolean connectOne(long sid, MultipleAddresses electionAddr) {
         if (senderWorkerMap.get(sid) != null) {
             LOG.debug("There is a connection already for server {}", sid);
-            if (electionAddr.size() > 1 && self.isMultiAddressReachabilityCheckEnabled()) {
+            if (electionAddr.size() > 1 && self.isMultiAddressEnabled() && self.isMultiAddressReachabilityCheckEnabled()) {
 
 Review comment:
   Why do you need the "enabled" check here?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [zookeeper] symat commented on issue #1251: ZOOKEEPER-3720: Fix rolling upgrade failure (invalid protocol version)

Posted by GitBox <gi...@apache.org>.
symat commented on issue #1251: ZOOKEEPER-3720: Fix rolling upgrade failure (invalid protocol version)
URL: https://github.com/apache/zookeeper/pull/1251#issuecomment-585704893
 
 
   @ztzg thanks for checking!

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [zookeeper] ztzg commented on issue #1251: ZOOKEEPER-3720: Fix rolling upgrade failure (invalid protocol version)

Posted by GitBox <gi...@apache.org>.
ztzg commented on issue #1251: ZOOKEEPER-3720: Fix rolling upgrade failure (invalid protocol version)
URL: https://github.com/apache/zookeeper/pull/1251#issuecomment-585371302
 
 
   LGTM!  (I reviewed the code, but did not do any testing.)

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [zookeeper] anmolnar commented on a change in pull request #1251: ZOOKEEPER-3720: Fix rolling upgrade failure (invalid protocol version)

Posted by GitBox <gi...@apache.org>.
anmolnar commented on a change in pull request #1251: ZOOKEEPER-3720: Fix rolling upgrade failure (invalid protocol version)
URL: https://github.com/apache/zookeeper/pull/1251#discussion_r378289899
 
 

 ##########
 File path: zookeeper-server/src/main/java/org/apache/zookeeper/server/quorum/QuorumCnxManager.java
 ##########
 @@ -639,7 +657,7 @@ public void toSend(Long sid, ByteBuffer b) {
     synchronized boolean connectOne(long sid, MultipleAddresses electionAddr) {
         if (senderWorkerMap.get(sid) != null) {
             LOG.debug("There is a connection already for server {}", sid);
-            if (electionAddr.size() > 1 && self.isMultiAddressReachabilityCheckEnabled()) {
+            if (electionAddr.size() > 1 && self.isMultiAddressEnabled() && self.isMultiAddressReachabilityCheckEnabled()) {
 
 Review comment:
   Does not do any harm though... move it to the beginning of `if` please.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [zookeeper] symat commented on a change in pull request #1251: ZOOKEEPER-3720: Fix rolling upgrade failure (invalid protocol version)

Posted by GitBox <gi...@apache.org>.
symat commented on a change in pull request #1251: ZOOKEEPER-3720: Fix rolling upgrade failure (invalid protocol version)
URL: https://github.com/apache/zookeeper/pull/1251#discussion_r378287448
 
 

 ##########
 File path: zookeeper-docs/src/main/resources/markdown/zookeeperAdmin.md
 ##########
 @@ -1584,6 +1596,16 @@ and [SASL authentication for ZooKeeper](https://cwiki.apache.org/confluence/disp
     (e.g. the zk/myhost@EXAMPLE.COM client principal will be authenticated in ZooKeeper as zk/myhost)
     Default: false
 
+* *multiAddress.enable* :
+    (Java system property: **zookeeper.multiAddress.enable**)
 
 Review comment:
   true, thanks. :)

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [zookeeper] eolivelli closed pull request #1251: ZOOKEEPER-3720: Fix rolling upgrade failure (invalid protocol version)

Posted by GitBox <gi...@apache.org>.
eolivelli closed pull request #1251: ZOOKEEPER-3720: Fix rolling upgrade failure (invalid protocol version)
URL: https://github.com/apache/zookeeper/pull/1251
 
 
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [zookeeper] symat commented on a change in pull request #1251: ZOOKEEPER-3720: Fix rolling upgrade failure (invalid protocol version)

Posted by GitBox <gi...@apache.org>.
symat commented on a change in pull request #1251: ZOOKEEPER-3720: Fix rolling upgrade failure (invalid protocol version)
URL: https://github.com/apache/zookeeper/pull/1251#discussion_r378790887
 
 

 ##########
 File path: zookeeper-server/src/main/java/org/apache/zookeeper/server/quorum/QuorumCnxManager.java
 ##########
 @@ -416,10 +424,20 @@ private boolean startConnection(Socket sock, Long sid) throws IOException {
 
             // Sending id and challenge
 
-            // represents protocol version (in other words - message type)
-            dout.writeLong(PROTOCOL_VERSION);
+            // First sending the protocol version (in other words - message type).
+            // For backward compatibility reasons we stick to the old protocol version, unless the MultiAddress
+            // feature is enabled. During rolling upgrade, we must make sure that all the servers can
+            // understand the protocol version we use to avoid multiple partitions. see ZOOKEEPER-3720
+            long protocolVersion = self.isMultiAddressEnabled() ? PROTOCOL_VERSION_3_6_0_MULTI_ADDRESS : PROTOCOL_VERSION_3_4_7;
+            dout.writeLong(protocolVersion);
 
 Review comment:
   OK, I did some deeper digging in the git history, this is what I found:
   
   The different version numbers / behaviors we had before:
   - protocol version originally introduced by ZOOKEEPER-107 in 3.5.0
   - before 3.5.0, always a positive SID was sent in the beginning of the connection initiation message and there were no address information there. Even 3.4.14 still only sends a positive SID.
   - since ZOOKEEPER-107 (in 3.5.0+): version -65536L (0xffff0000) was sent, and on the receiver side:
      - if any positive (incl. 0) Long number received, then accept the number as SID and get the address from the own config
      - if -65536L was received, then parsing the message with the new format and use the address provided there
      - for any other negative numbers, we fail
    - in ZOOKEEPER-1633 (3.4.7), the receiver logic was making more clever to expect the SID as the second Long in the message, if the first Long was negative. So from 3.4.7 all the 3.4 servers can at least not die due to the new message formats, and rolling upgrades become possible without multiple partitions. However, all 3.4.X still sends only the positive SID and will not use any address information from the initial message (just reads the SID).
   
   This also means, that we can choose any negative number as new protocol version for the MultiAddress feature, and then the rolling upgrade from 3.4.7-3.4.14 to 3.6.0 could work theoretically even with MultiAddress enabled, as long as we always send the SID right after the protocol version.
   
   But the value of the protocol version is actually checked in 3.5.0, so we definitely need to disable MultiAddress to be able to get the rolling upgrade to work from 3.5.
   
   Long story short: we only have a single protocol version, -65536 (0xffff0000) as of now, plus the case when positive numbers are sent. The -65535 (0xffff0001) seems to me a good protocol version candidate for the multiaddress compatible protocol, since it is also only a single bit difference between the two. So if later we will introduce the 'feature flags' we can already say that the last bit represented the multiaddress feature. 
   
   What do you think? (maybe I misunderstood your proposal here)
   
   But I just don't know how this whole thing should look in the later releases. Also this whole protocol version is now only about the initial message. We don't know if there is any other incompatibility between the other messages.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [zookeeper] symat commented on a change in pull request #1251: ZOOKEEPER-3720: Fix rolling upgrade failure (invalid protocol version)

Posted by GitBox <gi...@apache.org>.
symat commented on a change in pull request #1251: ZOOKEEPER-3720: Fix rolling upgrade failure (invalid protocol version)
URL: https://github.com/apache/zookeeper/pull/1251#discussion_r379509522
 
 

 ##########
 File path: zookeeper-server/src/main/java/org/apache/zookeeper/server/quorum/QuorumCnxManager.java
 ##########
 @@ -416,10 +424,20 @@ private boolean startConnection(Socket sock, Long sid) throws IOException {
 
             // Sending id and challenge
 
-            // represents protocol version (in other words - message type)
-            dout.writeLong(PROTOCOL_VERSION);
+            // First sending the protocol version (in other words - message type).
+            // For backward compatibility reasons we stick to the old protocol version, unless the MultiAddress
+            // feature is enabled. During rolling upgrade, we must make sure that all the servers can
+            // understand the protocol version we use to avoid multiple partitions. see ZOOKEEPER-3720
+            long protocolVersion = self.isMultiAddressEnabled() ? PROTOCOL_VERSION_3_6_0_MULTI_ADDRESS : PROTOCOL_VERSION_3_4_7;
+            dout.writeLong(protocolVersion);
 
 Review comment:
   @eolivelli what do you think?
   
   should we add the 'feature flags' implementation here in 3.6.0? Or is it enough if we choose a new optional version number that differs only in a single bit to the current version number, so that we are keeping this possibility of 'feature flags' open for the future? (currently we have only a single 'feature' anyway) 
   
   I am not sure if it worths to push for a more generic solution, given that we don't really know yet how to design this in the future.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [zookeeper] nkalmar commented on a change in pull request #1251: ZOOKEEPER-3720: Fix rolling upgrade failure (invalid protocol version)

Posted by GitBox <gi...@apache.org>.
nkalmar commented on a change in pull request #1251: ZOOKEEPER-3720: Fix rolling upgrade failure (invalid protocol version)
URL: https://github.com/apache/zookeeper/pull/1251#discussion_r378285237
 
 

 ##########
 File path: zookeeper-docs/src/main/resources/markdown/zookeeperAdmin.md
 ##########
 @@ -1584,6 +1596,16 @@ and [SASL authentication for ZooKeeper](https://cwiki.apache.org/confluence/disp
     (e.g. the zk/myhost@EXAMPLE.COM client principal will be authenticated in ZooKeeper as zk/myhost)
     Default: false
 
+* *multiAddress.enable* :
+    (Java system property: **zookeeper.multiAddress.enable**)
 
 Review comment:
   Same here, missing the d (internet has ruined me, no pun intended)

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [zookeeper] nkalmar edited a comment on issue #1251: ZOOKEEPER-3720: Fix rolling upgrade failure (invalid protocol version)

Posted by GitBox <gi...@apache.org>.
nkalmar edited a comment on issue #1251: ZOOKEEPER-3720: Fix rolling upgrade failure (invalid protocol version)
URL: https://github.com/apache/zookeeper/pull/1251#issuecomment-586615506
 
 
   retest maven please

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [zookeeper] symat commented on a change in pull request #1251: ZOOKEEPER-3720: Fix rolling upgrade failure (invalid protocol version)

Posted by GitBox <gi...@apache.org>.
symat commented on a change in pull request #1251: ZOOKEEPER-3720: Fix rolling upgrade failure (invalid protocol version)
URL: https://github.com/apache/zookeeper/pull/1251#discussion_r378790887
 
 

 ##########
 File path: zookeeper-server/src/main/java/org/apache/zookeeper/server/quorum/QuorumCnxManager.java
 ##########
 @@ -416,10 +424,20 @@ private boolean startConnection(Socket sock, Long sid) throws IOException {
 
             // Sending id and challenge
 
-            // represents protocol version (in other words - message type)
-            dout.writeLong(PROTOCOL_VERSION);
+            // First sending the protocol version (in other words - message type).
+            // For backward compatibility reasons we stick to the old protocol version, unless the MultiAddress
+            // feature is enabled. During rolling upgrade, we must make sure that all the servers can
+            // understand the protocol version we use to avoid multiple partitions. see ZOOKEEPER-3720
+            long protocolVersion = self.isMultiAddressEnabled() ? PROTOCOL_VERSION_3_6_0_MULTI_ADDRESS : PROTOCOL_VERSION_3_4_7;
+            dout.writeLong(protocolVersion);
 
 Review comment:
   OK, I did some deeper digging in the git history, this is what I found:
   
   The different version numbers / behaviors we had before:
   - protocol version originally introduced by ZOOKEEPER-107 in 3.5.0
   - before 3.5.0, always a positive SID was sent in the beginning of the connection initiation message and there were no address information there. Even 3.4.14 still only sends a positive SID.
   - since ZOOKEEPER-107 (in 3.5.0+): version -65536L (0xffff0000) was sent, and on the receiver side:
      - if any positive (incl. 0) Long number received, then accept the number as SID and get the address from the own config
      - if -65536L was received, then parsing the message with the new format and use the address provided there
      - for any other negative numbers, we fail
    - in ZOOKEEPER-1633 (3.4.7), the receiver logic was making more clever to expect the SID as the second Long in the message, if the first Long was negative. So from 3.4.7 all the 3.4 servers can at least not die due to the new message formats, and rolling upgrades become possible without multiple partitions. However, all 3.4.X still sends only the positive SID and will not use any address information from the initial message (just reads the SID).
   
   This also means, that we can choose any negative number as new protocol version for the MultiAddress feature, and then the rolling upgrade from 3.4.7-3.4.14 to 3.6.0 could work theoretically even with MultiAddress enabled, as long as we always send the SID right after the protocol version.
   
   But the value of the protocol version is actually checked in 3.5.0, so we definitely need to disable MultiAddress to be able to get the rolling upgrade to work from 3.5.
   
   Long story short: we only have a single protocol version, -65536 (0xffff0000) as of now, plus the case when positive numbers are sent. The -65535 (0xffff0001) seems to me a good protocol version candidate for the multiaddress compatible protocol, since it is also only a single bit difference between the two. What do you think?
   
   But I just don't know how this whole thing should look in the later releases. Also this whole protocol version is now only about the initial message. We don't know if there is any other incompatibility between the other messages.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [zookeeper] symat commented on a change in pull request #1251: ZOOKEEPER-3720: Fix rolling upgrade failure (invalid protocol version)

Posted by GitBox <gi...@apache.org>.
symat commented on a change in pull request #1251: ZOOKEEPER-3720: Fix rolling upgrade failure (invalid protocol version)
URL: https://github.com/apache/zookeeper/pull/1251#discussion_r378285330
 
 

 ##########
 File path: zookeeper-server/src/test/java/org/apache/zookeeper/server/quorum/QuorumPeerMainMultiAddressTest.java
 ##########
 @@ -230,6 +233,29 @@ public void shouldReconfigIncrementally_IPv6() throws Exception {
     checkIfZooKeeperQuorumWorks(newQuorumConfig);
   }
 
+  @Test(expected = KeeperException.class)
 
 Review comment:
   thanks, it's nicer that way indeed, I will fix it

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [zookeeper] eolivelli commented on a change in pull request #1251: ZOOKEEPER-3720: Fix rolling upgrade failure (invalid protocol version)

Posted by GitBox <gi...@apache.org>.
eolivelli commented on a change in pull request #1251: ZOOKEEPER-3720: Fix rolling upgrade failure (invalid protocol version)
URL: https://github.com/apache/zookeeper/pull/1251#discussion_r378268889
 
 

 ##########
 File path: zookeeper-docs/src/main/resources/markdown/zookeeperAdmin.md
 ##########
 @@ -1584,6 +1596,16 @@ and [SASL authentication for ZooKeeper](https://cwiki.apache.org/confluence/disp
     (e.g. the zk/myhost@EXAMPLE.COM client principal will be authenticated in ZooKeeper as zk/myhost)
     Default: false
 
+* *multiAddress.enable* :
 
 Review comment:
   multiAddress.enabled

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [zookeeper] symat commented on a change in pull request #1251: ZOOKEEPER-3720: Fix rolling upgrade failure (invalid protocol version)

Posted by GitBox <gi...@apache.org>.
symat commented on a change in pull request #1251: ZOOKEEPER-3720: Fix rolling upgrade failure (invalid protocol version)
URL: https://github.com/apache/zookeeper/pull/1251#discussion_r378790887
 
 

 ##########
 File path: zookeeper-server/src/main/java/org/apache/zookeeper/server/quorum/QuorumCnxManager.java
 ##########
 @@ -416,10 +424,20 @@ private boolean startConnection(Socket sock, Long sid) throws IOException {
 
             // Sending id and challenge
 
-            // represents protocol version (in other words - message type)
-            dout.writeLong(PROTOCOL_VERSION);
+            // First sending the protocol version (in other words - message type).
+            // For backward compatibility reasons we stick to the old protocol version, unless the MultiAddress
+            // feature is enabled. During rolling upgrade, we must make sure that all the servers can
+            // understand the protocol version we use to avoid multiple partitions. see ZOOKEEPER-3720
+            long protocolVersion = self.isMultiAddressEnabled() ? PROTOCOL_VERSION_3_6_0_MULTI_ADDRESS : PROTOCOL_VERSION_3_4_7;
+            dout.writeLong(protocolVersion);
 
 Review comment:
   OK, I did some deeper digging in the git history, this is what I found:
   
   The different version numbers / behaviors we had before:
   - protocol version originally introduced by ZOOKEEPER-107 in 3.5.0
   - before 3.5.0, always a positive SID was sent in the beginning of the connection initiation message and there were no address information there. Even 3.4.14 still only sends a positive SID.
   - since ZOOKEEPER-107 (in 3.5.0+): version -65536L (0xffff0000) was sent, and on the receiver side:
      - if any positive (incl. 0) Long number received, then accept the number as SID and get the address from the own config
      - if -65536L was received, then parsing the message with the new format and use the address provided there
    - in ZOOKEEPER-1633 (3.4.7), the receiver logic was making more clever to expect the SID as the second Long in the message, if the first Long was negative. So from 3.4.7 all the 3.4 servers can at least not die due to the new message formats, and rolling upgrades become possible without multiple partitions. However, all 3.4.X still sends only the positive SID and will not use any address information from the initial message (just reads the SID).
   
   This also means, that we can choose any negative number as new protocol version for the MultiAddress feature, and then the rolling upgrade from 3.4.7-3.4.14 to 3.6.0 could work theoretically even with MultiAddress enabled, as long as we always send the SID right after the protocol version.
   
   But the value of the protocol version is actually checked in 3.5.0, so we definitely need to disable MultiAddress to be able to get the rolling upgrade to work from 3.5.
   
   Long story short: we only have a single protocol version, -65536 (0xffff0000) as of now, plus the case when positive numbers are sent. The -65535 (0xffff0001) seems to me a good protocol version candidate for the multiaddress compatible protocol, since it is also only a single bit difference between the two. What do you think?
   
   But I just don't know how this whole thing should look in the later releases. Also this whole protocol version is now only about the initial message. We don't know if there is any other incompatibility between the other messages.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [zookeeper] symat commented on a change in pull request #1251: ZOOKEEPER-3720: Fix rolling upgrade failure (invalid protocol version)

Posted by GitBox <gi...@apache.org>.
symat commented on a change in pull request #1251: ZOOKEEPER-3720: Fix rolling upgrade failure (invalid protocol version)
URL: https://github.com/apache/zookeeper/pull/1251#discussion_r378303201
 
 

 ##########
 File path: zookeeper-server/src/main/java/org/apache/zookeeper/server/quorum/QuorumCnxManager.java
 ##########
 @@ -639,7 +657,7 @@ public void toSend(Long sid, ByteBuffer b) {
     synchronized boolean connectOne(long sid, MultipleAddresses electionAddr) {
         if (senderWorkerMap.get(sid) != null) {
             LOG.debug("There is a connection already for server {}", sid);
-            if (electionAddr.size() > 1 && self.isMultiAddressReachabilityCheckEnabled()) {
+            if (electionAddr.size() > 1 && self.isMultiAddressEnabled() && self.isMultiAddressReachabilityCheckEnabled()) {
 
 Review comment:
   you are right, actually if multi-address is disabled, then we can't have more than one address anyway. Still, I think it is good to have the extra condition there (if for nothing else, maybe just to make the code easier to understand)
   I will move it to the beginning.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [zookeeper] eolivelli commented on a change in pull request #1251: ZOOKEEPER-3720: Fix rolling upgrade failure (invalid protocol version)

Posted by GitBox <gi...@apache.org>.
eolivelli commented on a change in pull request #1251: ZOOKEEPER-3720: Fix rolling upgrade failure (invalid protocol version)
URL: https://github.com/apache/zookeeper/pull/1251#discussion_r378438237
 
 

 ##########
 File path: zookeeper-server/src/main/java/org/apache/zookeeper/server/quorum/QuorumCnxManager.java
 ##########
 @@ -416,10 +424,20 @@ private boolean startConnection(Socket sock, Long sid) throws IOException {
 
             // Sending id and challenge
 
-            // represents protocol version (in other words - message type)
-            dout.writeLong(PROTOCOL_VERSION);
+            // First sending the protocol version (in other words - message type).
+            // For backward compatibility reasons we stick to the old protocol version, unless the MultiAddress
+            // feature is enabled. During rolling upgrade, we must make sure that all the servers can
+            // understand the protocol version we use to avoid multiple partitions. see ZOOKEEPER-3720
+            long protocolVersion = self.isMultiAddressEnabled() ? PROTOCOL_VERSION_3_6_0_MULTI_ADDRESS : PROTOCOL_VERSION_3_4_7;
+            dout.writeLong(protocolVersion);
 
 Review comment:
   In case of new protocol (v2) I would like to spend here a 64bit long of 'flags' (filled with zeros in 3.6.0) in order to introduce a more futureproof 'feature detection' that is better than using protocol version numbers.
   
   We can then add 'multiaddress' as an enabled feature.
   
   We should do more reasoning and design steps but this way of coding the initial handshake is quite common and it will open up the way for future improvements.
   
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [zookeeper] eolivelli commented on a change in pull request #1251: ZOOKEEPER-3720: Fix rolling upgrade failure (invalid protocol version)

Posted by GitBox <gi...@apache.org>.
eolivelli commented on a change in pull request #1251: ZOOKEEPER-3720: Fix rolling upgrade failure (invalid protocol version)
URL: https://github.com/apache/zookeeper/pull/1251#discussion_r378271226
 
 

 ##########
 File path: zookeeper-server/src/test/java/org/apache/zookeeper/server/quorum/QuorumPeerMainMultiAddressTest.java
 ##########
 @@ -230,6 +233,29 @@ public void shouldReconfigIncrementally_IPv6() throws Exception {
     checkIfZooKeeperQuorumWorks(newQuorumConfig);
   }
 
+  @Test(expected = KeeperException.class)
 
 Review comment:
   this way of testing for an expected exception is too broad, you don't know where KeeperException has been thrown.
   I suggest to push the assertion around the  ReconfigTest.reconfig call

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [zookeeper] eolivelli edited a comment on issue #1251: ZOOKEEPER-3720: Fix rolling upgrade failure (invalid protocol version)

Posted by GitBox <gi...@apache.org>.
eolivelli edited a comment on issue #1251: ZOOKEEPER-3720: Fix rolling upgrade failure (invalid protocol version)
URL: https://github.com/apache/zookeeper/pull/1251#issuecomment-586354608
 
 
   @anmolnar @nkalmar are you okay with this final form?
   If so please merge to branch-3.6 
   
   I will cut a new RC during the weekend
   I will take care about 3.6.0 branch, I will port more patches 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [zookeeper] nkalmar commented on issue #1251: ZOOKEEPER-3720: Fix rolling upgrade failure (invalid protocol version)

Posted by GitBox <gi...@apache.org>.
nkalmar commented on issue #1251: ZOOKEEPER-3720: Fix rolling upgrade failure (invalid protocol version)
URL: https://github.com/apache/zookeeper/pull/1251#issuecomment-586615627
 
 
   retest this please

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [zookeeper] symat commented on a change in pull request #1251: ZOOKEEPER-3720: Fix rolling upgrade failure (invalid protocol version)

Posted by GitBox <gi...@apache.org>.
symat commented on a change in pull request #1251: ZOOKEEPER-3720: Fix rolling upgrade failure (invalid protocol version)
URL: https://github.com/apache/zookeeper/pull/1251#discussion_r378284246
 
 

 ##########
 File path: zookeeper-docs/src/main/resources/markdown/zookeeperAdmin.md
 ##########
 @@ -1584,6 +1596,16 @@ and [SASL authentication for ZooKeeper](https://cwiki.apache.org/confluence/disp
     (e.g. the zk/myhost@EXAMPLE.COM client principal will be authenticated in ZooKeeper as zk/myhost)
     Default: false
 
+* *multiAddress.enable* :
 
 Review comment:
   nice catch, thanks!

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [zookeeper] eolivelli commented on a change in pull request #1251: ZOOKEEPER-3720: Fix rolling upgrade failure (invalid protocol version)

Posted by GitBox <gi...@apache.org>.
eolivelli commented on a change in pull request #1251: ZOOKEEPER-3720: Fix rolling upgrade failure (invalid protocol version)
URL: https://github.com/apache/zookeeper/pull/1251#discussion_r378436446
 
 

 ##########
 File path: zookeeper-server/src/main/java/org/apache/zookeeper/server/quorum/QuorumCnxManager.java
 ##########
 @@ -113,9 +115,12 @@
     private AtomicLong observerCounter = new AtomicLong(-1);
 
     /*
-     * Protocol identifier used among peers
+     * Protocol identifier used among peers (must be a negative number for backward compatibility reasons)
      */
-    public static final long PROTOCOL_VERSION = -65535L;
+    // the following protocol version was sent in every connection initiation message since ZOOKEEPER-2186 released in 3.4.7
+    public static final long PROTOCOL_VERSION_3_4_7 = -65536L;
 
 Review comment:
   I would call this constant PROTOCOL_VERSION_V1
   and the other one PROTOCOL_VERSION_V2

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [zookeeper] nkalmar commented on issue #1251: ZOOKEEPER-3720: Fix rolling upgrade failure (invalid protocol version)

Posted by GitBox <gi...@apache.org>.
nkalmar commented on issue #1251: ZOOKEEPER-3720: Fix rolling upgrade failure (invalid protocol version)
URL: https://github.com/apache/zookeeper/pull/1251#issuecomment-586615506
 
 
   retest maven

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services