You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@pulsar.apache.org by GitBox <gi...@apache.org> on 2022/02/01 16:47:02 UTC

[GitHub] [pulsar] frederic-kneier opened a new pull request #14088: [Issue 11070][Zookeeper] Fix netcat returning early for probe

frederic-kneier opened a new pull request #14088:
URL: https://github.com/apache/pulsar/pull/14088


   Netcat returns before zookeeper is able to reply leading to a failed check even if the reply would arrive shortly thereafter.
   
   Fixes #11070
   
   ### Motivation
   
   Readiness and liveness probes in Kubernetes fail in Kubernetes in some cases because the check does not wait for a response.
   
   ### Modifications
   
   The check script now waits for 1 seconds for a response.
   
   ### Verifying this change
   
   Since this problem is caused by a race condition, testing is a bit complicated.
   
   ### Does this pull request potentially affect one of the following parts:
   
     - Dependencies (does it add or upgrade a dependency): no
     - The public API: no
     - The schema: no
     - The default values of configurations: no
     - The wire protocol: no
     - The rest endpoints: no
     - The admin cli options: no
     - Anything that affects deployment: don't know
   
   ### Documentation
   
   Check the box below or label this PR directly (if you have committer privilege).
   
   Need to update docs? 
   
   - [ x ] `no-need-doc` 
     
     The actual intention of the script does not change.
     
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pulsar.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] lhotari commented on pull request #14088: [Issue 11070][Zookeeper] Fix netcat returning early for probe

Posted by GitBox <gi...@apache.org>.
lhotari commented on pull request #14088:
URL: https://github.com/apache/pulsar/pull/14088#issuecomment-1027221858


   > Since the cause of the problem is a race condition 
   
   @frederic-kneier could you share a reference to the race condition that you are referring? It would be interesting to learn more about that. Thanks! Great work on this issue!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pulsar.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] lhotari commented on pull request #14088: [Issue 11070][Zookeeper] Fix netcat returning early for probe

Posted by GitBox <gi...@apache.org>.
lhotari commented on pull request #14088:
URL: https://github.com/apache/pulsar/pull/14088#issuecomment-1027636152


   I made a similar change to the Apache Pulsar Helm chart: https://github.com/apache/pulsar-helm-chart/pull/223 . Instead of relying on the `pulsar-zookeeper-ruok.sh` script in the container image, I replaced it with `bash -c 'echo ruok | nc -q 1 localhost 2181 | grep imok'`.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pulsar.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] frederic-kneier edited a comment on pull request #14088: [Issue 11070][Zookeeper] Fix netcat returning early for probe

Posted by GitBox <gi...@apache.org>.
frederic-kneier edited a comment on pull request #14088:
URL: https://github.com/apache/pulsar/pull/14088#issuecomment-1027309429


   @lhotari calling "echo ruok | nc -q -1 localhost 2181" does not solve the problem. It has to be "echo ruok | nc -q 1 localhost 2181"


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pulsar.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] lhotari commented on pull request #14088: [Issue 11070][Zookeeper] Fix netcat returning early for probe

Posted by GitBox <gi...@apache.org>.
lhotari commented on pull request #14088:
URL: https://github.com/apache/pulsar/pull/14088#issuecomment-1027613244


   > @lhotari calling "echo ruok | nc -q -1 localhost 2181" does not solve the problem. It has to be "echo ruok | nc -q 1 localhost 2181"
   
   @frederic-kneier ok, good to hear about that. I'm trying to understand the reason why it fixes the problem.
   
   > If you pipe the command to nc the input stream is closed instantly which leads to nc terminating in certain conditions even before the server is able to send the reply.
   
   In this case, this explanation doesn't seem to hold. The apache/pulsar:2.8.2 image contains netcat-openbsd and for it, `-q -1` is the default and would prevent closing the socket after stdin EOF. 
   
   Here's some parts of the man page for netcat-openbsd nc
   ```
       -N      shutdown(2) the network socket after EOF on the input.  Some servers require this to finish their work.
   
        -q seconds
                after EOF on stdin, wait the specified number of seconds and then quit. If seconds is negative, wait forever (default).  Specifying a non-negative seconds implies -N.
   ```
   
   It says "Some servers require this to finish their work". I wonder why this is the case for Zookeeper. It seems to happen only when Zookeeper is configured using `org.apache.zookeeper.server.NettyServerCnxnFactory`.
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pulsar.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] lhotari commented on pull request #14088: [Issue 11070][Zookeeper] Fix netcat returning early for probe

Posted by GitBox <gi...@apache.org>.
lhotari commented on pull request #14088:
URL: https://github.com/apache/pulsar/pull/14088#issuecomment-1027237833


   I'm running this experiment in the Apache Pulsar Helm Chart: https://github.com/apache/pulsar-helm-chart/pull/190/commits/98dd0297f8782f2aa39dc1447ef6d0c4efbd6a5e . Let's see if the tests finally pass.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pulsar.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] frederic-kneier commented on pull request #14088: [Issue 11070][Zookeeper] Fix netcat returning early for probe

Posted by GitBox <gi...@apache.org>.
frederic-kneier commented on pull request #14088:
URL: https://github.com/apache/pulsar/pull/14088#issuecomment-1027309429


   @lhotari calling "echo ruok | nc -q -1 localhost 2181" does not solve the problem. It hat to be "echo ruok | nc -q 1 localhost 2181"


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pulsar.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] frederic-kneier commented on pull request #14088: [Issue 11070][Zookeeper] Fix netcat returning early for probe

Posted by GitBox <gi...@apache.org>.
frederic-kneier commented on pull request #14088:
URL: https://github.com/apache/pulsar/pull/14088#issuecomment-1027201801


   If you pipe the command to nc the input stream is closed instantly which leads to nc terminating in certain conditions even before the server is able to send the reply. This leads to an empty output which then leads to a failed health check. This behavior seems to be different for different version of nc (OpenBSD, Linux). Since the cause of the problem is a race condition the "-q 1" will wait one second before the program terminates and the server is able to send the reply. This behavior is reproducable on certain Kubernetes clusters with small nodes and seems to be fixed with this change.
   
   for run in {1..10}; do echo ruok | nc localhost 2181; done => imokimokimokimokimok
   for run in {1..10}; do echo ruok | nc -q 1 localhost 2181; done => imokimokimokimokimokimokimokimokimokimok


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pulsar.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] lhotari commented on pull request #14088: [Issue 11070][Zookeeper] Fix netcat returning early for probe

Posted by GitBox <gi...@apache.org>.
lhotari commented on pull request #14088:
URL: https://github.com/apache/pulsar/pull/14088#issuecomment-1027673568


   the `-q 1` didn't solve the issue in Pulsar Helm Chart CI tests when TLS is enabled for Zookeeper. 
   I made a change to send the "ruok" to the TLS port when TLS is enabled. The command used is `bash -c 'echo ruok | openssl s_client -quiet -crlf -connect localhost:2281 -cert /pulsar/certs/zookeeper/tls.crt -key /pulsar/certs/zookeeper/tls.key | grep imok'`. [A similar approach is used in the Bitnami Zookeeper Helm chart for Zookeeper probes](https://github.com/bitnami/charts/blob/5b659dea765218500a85c33a40361db0bec040bf/bitnami/zookeeper/templates/statefulset.yaml#L355). This doesn't resolve the issue and TLS tests still fail occasionally. This means that the problem is elsewhere. Perhaps the ZK fix https://github.com/apache/zookeeper/pull/1800 resolves it.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pulsar.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] lhotari commented on pull request #14088: [Issue 11070][Zookeeper] Fix netcat returning early for probe

Posted by GitBox <gi...@apache.org>.
lhotari commented on pull request #14088:
URL: https://github.com/apache/pulsar/pull/14088#issuecomment-1027628937


   I found some explanation in https://stackoverflow.com/questions/4160347/close-vs-shutdown-socket/23483487 .
   netcat will send a FIN and close the connection cleanly when using `-q 1`. I guess that the default for netcat-openbsd is that it will wait for the other end to close the connection unless `-q 1` is specified.
   It feels like a bug in Zookeeper, but I'm fine if this mitigates it. I'm just wonder what other consequences there could be.
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pulsar.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] lhotari commented on pull request #14088: [Issue 11070][Zookeeper] Fix netcat returning early for probe

Posted by GitBox <gi...@apache.org>.
lhotari commented on pull request #14088:
URL: https://github.com/apache/pulsar/pull/14088#issuecomment-1027231048


   I did some googling and there's `-q -1` in this answer: https://unix.stackexchange.com/a/274603/ . 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pulsar.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] codelipenghui merged pull request #14088: [Issue 11070][Zookeeper] Fix netcat returning early for probe

Posted by GitBox <gi...@apache.org>.
codelipenghui merged pull request #14088:
URL: https://github.com/apache/pulsar/pull/14088


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pulsar.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] lhotari commented on pull request #14088: [Issue 11070][Zookeeper] Fix netcat returning early for probe

Posted by GitBox <gi...@apache.org>.
lhotari commented on pull request #14088:
URL: https://github.com/apache/pulsar/pull/14088#issuecomment-1027220070


   > If you pipe the command to nc the input stream is closed instantly which leads to nc terminating in certain conditions even before the server is able to send the reply. This leads to an empty output which then leads to a failed health check. This behavior seems to be different for different version of nc (OpenBSD, Linux). Since the cause of the problem is a race condition the "-q 1" will wait one second before the program terminates and the server is able to send the reply. This behavior is reproducable on certain Kubernetes clusters with small nodes and seems to be fixed with this change.
   > 
   > for run in {1..10}; do echo ruok | nc localhost 2181; done => imokimokimokimokimok for run in {1..10}; do echo ruok | nc -q 1 localhost 2181; done => imokimokimokimokimokimokimokimokimokimok
   
   Thanks for the explanation @frederic-kneier . 
   Btw. I've been struggling with the Zookeeper probes and this has been causing some instability in https://github.com/apache/pulsar-helm-chart . Some attempts to improve the situation: 
   https://github.com/apache/pulsar-helm-chart/pull/220
   https://github.com/apache/pulsar-helm-chart/pull/214
   https://github.com/apache/pulsar-helm-chart/pull/202
   
   I just wonder if the value for `-w` should be more than 1? 
   I checked the Bitnami Zookeeper Helm chart and there the probe timeout value is passed to `-w` parameter.
   https://github.com/bitnami/charts/blob/b86be50209134b8a2967ce5335c647c4b9ca1759/bitnami/zookeeper/templates/statefulset.yaml#L350-L356


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pulsar.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] lhotari edited a comment on pull request #14088: [Issue 11070][Zookeeper] Fix netcat returning early for probe

Posted by GitBox <gi...@apache.org>.
lhotari edited a comment on pull request #14088:
URL: https://github.com/apache/pulsar/pull/14088#issuecomment-1027220070


   > If you pipe the command to nc the input stream is closed instantly which leads to nc terminating in certain conditions even before the server is able to send the reply. This leads to an empty output which then leads to a failed health check. This behavior seems to be different for different version of nc (OpenBSD, Linux). Since the cause of the problem is a race condition the "-q 1" will wait one second before the program terminates and the server is able to send the reply. This behavior is reproducable on certain Kubernetes clusters with small nodes and seems to be fixed with this change.
   > 
   > for run in {1..10}; do echo ruok | nc localhost 2181; done => imokimokimokimokimok for run in {1..10}; do echo ruok | nc -q 1 localhost 2181; done => imokimokimokimokimokimokimokimokimokimok
   
   Thanks for the explanation @frederic-kneier . 
   Btw. I've been struggling with the Zookeeper probes and this has been causing some instability in https://github.com/apache/pulsar-helm-chart . Some attempts to improve the situation: 
   https://github.com/apache/pulsar-helm-chart/pull/220
   https://github.com/apache/pulsar-helm-chart/pull/214
   https://github.com/apache/pulsar-helm-chart/pull/202
   
   I just wonder if the value for `-q` should be more than 1? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pulsar.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org