You are viewing a plain text version of this content. The canonical link for it is here.
Posted to notifications@zookeeper.apache.org by GitBox <gi...@apache.org> on 2021/10/07 10:27:44 UTC

[GitHub] [zookeeper] andrekramer1 opened a new pull request #1770: Netty server does not respond to ruok while initializing cluster

andrekramer1 opened a new pull request #1770:
URL: https://github.com/apache/zookeeper/pull/1770


   Apache Pulsar (and others I suspect) use Zookeeper in a Kubernetes statefull set with liveness and ready probes polling the "ruok" Zookeeper command. With the Netty server configured, on later versions of Zookeeper, the first replica would start but never become ready so the statefull set could not scale up from 1 to the desired replica count. This is due to the first replica never replying to "ruok" - it just closes the connection.
   
   Apache Pulsar issue https://github.com/apache/pulsar/issues/11070 reported this failure and this change set was created to get the server to respond to "ruok" while initializing. With these changes the set scales up to the desired 3 replicas.
   
   The issue does not occur with the NIO server context (which is the default) but I've not compared the two to work out exact differences - just modified the Netty one to respond in more cases. There's a tricky issue of disallowing exceedingly large requests as well (in code below) as well as the general question of is it ok to proceed past these checks that were closing the connection. In a multi-threaded server checking a variable isRunning() could be a race in any case so hopefully it's still robust with these changes but they probably need to be taken over by an expert and just used as a starting point for a fix.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@zookeeper.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [zookeeper] eolivelli commented on pull request #1770: Netty server does not respond to ruok while initializing cluster

Posted by GitBox <gi...@apache.org>.
eolivelli commented on pull request #1770:
URL: https://github.com/apache/zookeeper/pull/1770#issuecomment-1017383493


   @andrekramer1 @anmolnar @lvfangmin I created an alternative, simpler, patch
   #1798 that does not change the behaviour to prevent floods


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@zookeeper.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [zookeeper] lhotari commented on a change in pull request #1770: Netty server does not respond to ruok while initializing cluster

Posted by GitBox <gi...@apache.org>.
lhotari commented on a change in pull request #1770:
URL: https://github.com/apache/zookeeper/pull/1770#discussion_r778804659



##########
File path: zookeeper-server/src/main/java/org/apache/zookeeper/server/NettyServerCnxn.java
##########
@@ -521,11 +525,16 @@ private void receiveMessage(ByteBuf message) {
                             throw new IOException("Len error " + len);
                         }
                         ZooKeeperServer zks = this.zkServer;
-                        if (zks == null || !zks.isRunning()) {
-                            throw new IOException("ZK down");
+
+                        if (zks != null && !zks.isRunning()) {

Review comment:
       Is the condition correct? I'd assume that the condition would be `zks.isRunning()`.
   ```suggestion
                           if (zks != null && zks.isRunning()) {
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@zookeeper.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [zookeeper] andrekramer1 edited a comment on pull request #1770: Netty server does not respond to ruok while initializing cluster

Posted by GitBox <gi...@apache.org>.
andrekramer1 edited a comment on pull request #1770:
URL: https://github.com/apache/zookeeper/pull/1770#issuecomment-1017395134


   @eolivelli I think that would not allow it progress to report it's up if SSL is used. Don't know if that is important but was my reason for removing the "flood prevention". Have you tested it solves the 3 node cluster initialization issue for Pulsar?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@zookeeper.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [zookeeper] andrekramer1 commented on pull request #1770: Netty server does not respond to ruok while initializing cluster

Posted by GitBox <gi...@apache.org>.
andrekramer1 commented on pull request #1770:
URL: https://github.com/apache/zookeeper/pull/1770#issuecomment-1021263029


   There was a change to fix the NPE but it may not have fixed this issue: https://github.com/apache/zookeeper/pull/1798 Probably needs testing on latest release and meanwhile switching to NIOServerCNXNs should be a work around.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@zookeeper.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [zookeeper] eolivelli closed pull request #1770: Netty server does not respond to ruok while initializing cluster

Posted by GitBox <gi...@apache.org>.
eolivelli closed pull request #1770:
URL: https://github.com/apache/zookeeper/pull/1770


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@zookeeper.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [zookeeper] eolivelli commented on pull request #1770: Netty server does not respond to ruok while initializing cluster

Posted by GitBox <gi...@apache.org>.
eolivelli commented on pull request #1770:
URL: https://github.com/apache/zookeeper/pull/1770#issuecomment-937865952


   One problem is the the 4 letters words API is deprecated in favour of the HTTP Admin API.
   I am not saying that we should not fix this, but that we are going to fix something that we want to drop one day.
   
   My understanding is that we want to answer something only to "ruok" and not to other commands in this case.
   
   I remember this other PR that is in the same direction
   https://github.com/apache/zookeeper/pull/1520 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@zookeeper.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [zookeeper] lhotari commented on a change in pull request #1770: Netty server does not respond to ruok while initializing cluster

Posted by GitBox <gi...@apache.org>.
lhotari commented on a change in pull request #1770:
URL: https://github.com/apache/zookeeper/pull/1770#discussion_r778806253



##########
File path: zookeeper-server/src/main/java/org/apache/zookeeper/server/NettyServerCnxn.java
##########
@@ -521,11 +525,16 @@ private void receiveMessage(ByteBuf message) {
                             throw new IOException("Len error " + len);
                         }
                         ZooKeeperServer zks = this.zkServer;
-                        if (zks == null || !zks.isRunning()) {
-                            throw new IOException("ZK down");
+
+                        if (zks != null && !zks.isRunning()) {
+                            // checkRequestSize will throw IOException if request is rejected
+                            zks.checkRequestSizeWhenReceivingMessage(len);
+                        } else {
+                            LOG.debug("Skip configurable check for max receive length");
+                            if (len > 64 * 1024) {
+                                throw new IOException("Received message size too large for uninitialized server");
+                            }

Review comment:
       This check looks unnecessary. 
   There's already https://github.com/apache/zookeeper/blob/c926b2f4d58ad08c33ff88ff6256347dab77ef60/zookeeper-server/src/main/java/org/apache/zookeeper/server/NettyServerCnxn.java#L524-L526




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@zookeeper.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [zookeeper] eolivelli commented on a change in pull request #1770: Netty server does not respond to ruok while initializing cluster

Posted by GitBox <gi...@apache.org>.
eolivelli commented on a change in pull request #1770:
URL: https://github.com/apache/zookeeper/pull/1770#discussion_r785712241



##########
File path: zookeeper-server/src/main/java/org/apache/zookeeper/server/NettyServerCnxnFactory.java
##########
@@ -225,15 +225,6 @@ public void channelActive(ChannelHandlerContext ctx) throws Exception {
             NettyServerCnxn cnxn = new NettyServerCnxn(channel, zkServer, NettyServerCnxnFactory.this);
             ctx.channel().attr(CONNECTION_ATTRIBUTE).set(cnxn);
 
-            // Check the zkServer assigned to the cnxn is still running,
-            // close it before starting the heavy TLS handshake
-            if (!cnxn.isZKServerRunning()) {
-                LOG.warn("Zookeeper server is not running, close the connection before starting the TLS handshake");
-                ServerMetrics.getMetrics().CNXN_CLOSED_WITHOUT_ZK_SERVER_RUNNING.add(1);

Review comment:
       I am re-reading the patch again.
   It looks like we are no more updating this metric.
   and also we are dropping this case.
   
   My understanding is that this check is here to prevent a flood of useless (but heavyweight) TLS handshakes after restarting the ZK node.
   
   I am not sure this is a good move to remove this. 
   This fix may work on a small cluster (with very few ZK clients I mean)
   
   @lvfangmin If I read correctly (from git blame) this improvement was part of ZOOKEEPER-3682 and the set of patches ported from Facebook ZooKeeper fork.
   
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@zookeeper.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [zookeeper] andrekramer1 commented on pull request #1770: Netty server does not respond to ruok while initializing cluster

Posted by GitBox <gi...@apache.org>.
andrekramer1 commented on pull request #1770:
URL: https://github.com/apache/zookeeper/pull/1770#issuecomment-1017395134


   @eolivelli I think that would not allow it progress to report it's up if SSL is used. Don't know if that is important but was my reason for removing the "flood prevention". Have you tested it solves the 3 node cluster initialization issue?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@zookeeper.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [zookeeper] andrekramer1 commented on pull request #1770: Netty server does not respond to ruok while initializing cluster

Posted by GitBox <gi...@apache.org>.
andrekramer1 commented on pull request #1770:
URL: https://github.com/apache/zookeeper/pull/1770#issuecomment-944364560


   @eolivelli Hi, I changed the log level to cut down on noise. On the other PR, I also have some checks for zkServer == null to avoid NPEs I saw but I think not for the NPE that the other PR reported. What are the next steps for this - especially with regards to wider potential impact on Zookkeeper?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@zookeeper.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [zookeeper] eolivelli commented on pull request #1770: Netty server does not respond to ruok while initializing cluster

Posted by GitBox <gi...@apache.org>.
eolivelli commented on pull request #1770:
URL: https://github.com/apache/zookeeper/pull/1770#issuecomment-1021414827


   I am following up.
   My goal is to have this fixed and to cut a release


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@zookeeper.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [zookeeper] eolivelli commented on pull request #1770: Netty server does not respond to ruok while initializing cluster

Posted by GitBox <gi...@apache.org>.
eolivelli commented on pull request #1770:
URL: https://github.com/apache/zookeeper/pull/1770#issuecomment-1021987814


   @andrekramer1 
   this is my fix https://github.com/apache/zookeeper/pull/1799
   
   I verified it works.
   Please follow up there for new comments


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@zookeeper.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [zookeeper] eolivelli commented on pull request #1770: Netty server does not respond to ruok while initializing cluster

Posted by GitBox <gi...@apache.org>.
eolivelli commented on pull request #1770:
URL: https://github.com/apache/zookeeper/pull/1770#issuecomment-937865952


   One problem is the the 4 letters words API is deprecated in favour of the HTTP Admin API.
   I am not saying that we should not fix this, but that we are going to fix something that we want to drop one day.
   
   My understanding is that we want to answer something only to "ruok" and not to other commands in this case.
   
   I remember this other PR that is in the same direction
   https://github.com/apache/zookeeper/pull/1520 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@zookeeper.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [zookeeper] eolivelli commented on pull request #1770: Netty server does not respond to ruok while initializing cluster

Posted by GitBox <gi...@apache.org>.
eolivelli commented on pull request #1770:
URL: https://github.com/apache/zookeeper/pull/1770#issuecomment-1017401510


   @andrekramer1 
   my understanding is that the problem is that due to the NPE the connection hangs.
   
   if we close the connection when the server is not ready that it should be fine (the probe will retry)
   
   Because to the question "are you okay?" we cannot answer "I am okay" if the server is not ready to process requests.
   
   so:
   - for TLS: no server running -> close the connection
   - for non TLS: no server running -> close the connection (do not throw NPE and let the connection hang)
   
   > Have you tested it solves the 3 node cluster initialisation issue for Pulsar?
   no I haven't
   
   in NIOServerCnxn if the server is not running we still throw an error and close the connection
   
   https://github.com/apache/zookeeper/blob/94d0c4d8558e1b201665bb9dffd33dacbc7ca945/zookeeper-server/src/main/java/org/apache/zookeeper/server/NIOServerCnxn.java#L423
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@zookeeper.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [zookeeper] andrekramer1 commented on pull request #1770: Netty server does not respond to ruok while initializing cluster

Posted by GitBox <gi...@apache.org>.
andrekramer1 commented on pull request #1770:
URL: https://github.com/apache/zookeeper/pull/1770#issuecomment-1021263029


   There was a change to fix the NPE but it may not have fixed this issue: https://github.com/apache/zookeeper/pull/1798 Probably needs testing on latest release and meanwhile switching to NIOServerCNXNs should be a work around.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@zookeeper.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [zookeeper] lhotari commented on pull request #1770: Netty server does not respond to ruok while initializing cluster

Posted by GitBox <gi...@apache.org>.
lhotari commented on pull request #1770:
URL: https://github.com/apache/zookeeper/pull/1770#issuecomment-979508229


   The Zookeeper issue seems to be https://issues.apache.org/jira/browse/ZOOKEEPER-3988


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@zookeeper.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [zookeeper] andrekramer1 commented on a change in pull request #1770: Netty server does not respond to ruok while initializing cluster

Posted by GitBox <gi...@apache.org>.
andrekramer1 commented on a change in pull request #1770:
URL: https://github.com/apache/zookeeper/pull/1770#discussion_r778864150



##########
File path: zookeeper-server/src/main/java/org/apache/zookeeper/server/NettyServerCnxn.java
##########
@@ -521,11 +525,16 @@ private void receiveMessage(ByteBuf message) {
                             throw new IOException("Len error " + len);
                         }
                         ZooKeeperServer zks = this.zkServer;
-                        if (zks == null || !zks.isRunning()) {
-                            throw new IOException("ZK down");
+
+                        if (zks != null && !zks.isRunning()) {

Review comment:
       Thanks, I was not sure if this was needed and changed the condition.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@zookeeper.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [zookeeper] lhotari commented on pull request #1770: Netty server does not respond to ruok while initializing cluster

Posted by GitBox <gi...@apache.org>.
lhotari commented on pull request #1770:
URL: https://github.com/apache/zookeeper/pull/1770#issuecomment-1005824309


   @anmolnar @maoling  please review


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@zookeeper.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org