You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@pulsar.apache.org by GitBox <gi...@apache.org> on 2022/03/23 18:54:22 UTC

[GitHub] [pulsar] GBM-tamerm opened a new issue #14826: Broker freeze for communications in v 2.7.4

GBM-tamerm opened a new issue #14826:
URL: https://github.com/apache/pulsar/issues/14826


   **Describe the bug**
   e saw strange behaviour , as broker stopped accepting connections and clients start receiving different exceptions such as : Connection Already Closed ,  Topic not available , 
    exceptions.
    
    The broker java process itself is up and running,  but curl http ports such as curl broker  metrics stop return anything.
    
    It is only works when we restart the broker again 
     
    So it seems as connection pool issue or leak as it keeps alive speically we can some logs entries as below
    
    [pulsar-client-io-1-1] INFO org.apache.pulsar.client.impl.ConnectionPool - [[id: 0x15a65d4f, L:/10.244.63.22:36068 - xxxxx:6651]] Connected to server Killed
   
   **To Reproduce**
   I was able to reproduce the issue by a simple java program which keep looping while opening new
    socket to the broker port without closing the socket on every loop.  So after a while the clients start to get that connection already closed exceptions and others.
   
   **Expected behavior**
   Broker should not freeze , generate meaningful exception which disconnecting bad clients 
   
   **Screenshots**
   If applicable, add screenshots to help explain your problem.
   
   **Desktop (please complete the following information):**
    - OS:  Linux VM
    - 6 brokers
    - 6 bookies
    - 5 ZK
   
   **Additional context**
   No k8 deployment , and client accessing brokers DNS directly without Pulsar proxy.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pulsar.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] lhotari commented on issue #14826: Broker freeze for communications in v 2.7.4

Posted by GitBox <gi...@apache.org>.
lhotari commented on issue #14826:
URL: https://github.com/apache/pulsar/issues/14826#issuecomment-1077127850


   Adding a link to the dev mailing list thread: https://lists.apache.org/thread/tsg8q6xc75605jrs66rvj2f3dhndo5k4
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pulsar.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] GBM-tamerm edited a comment on issue #14826: Too many TCP Connections are in CLOSE_WAIT status freeze broker v 2.7.4

Posted by GitBox <gi...@apache.org>.
GBM-tamerm edited a comment on issue #14826:
URL: https://github.com/apache/pulsar/issues/14826#issuecomment-1078832791


   Hi Lari ,
   
   I did your suggestions and did not resolve the issue , broker still freeze after a while 
   I think the potential cause for that is most likely a deadlock or racing between broker and bookie clients connections ,i updated the issue description to reflect more details
   Too many TCP Connections are in CLOSE_WAIT status in a Pulsar broker causing Disconnection Exceptions and Connection Already Close exceptions in pulsar clients
   
   And majority of the close_ wait connections between broker and bookies
   
   Also i can see alof of occurrence of below exception:
   Failed to initialize managed ledger: org.apache.bookkeeper.mledger.ManagedLedgerException$MetadataNotFoundException: Managed ledger not found
   java.util.concurrent.CompletionException: java.util.NoSuchElementException: No value present
           at java.util.concurrent.CompletableFuture.uniExceptionally(CompletableFuture.java:884) ~[?:1.8.0_322]
           at java.util.concurrent.CompletableFuture$UniExceptionally.tryFire(CompletableFuture.java:866) ~[?:1.8.0_322]
           at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990) ~[?:1.8.0_322]
   Caused by: java.util.NoSuchElementException: No value present
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pulsar.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] lhotari commented on issue #14826: Broker freeze for communications in v 2.7.4

Posted by GitBox <gi...@apache.org>.
lhotari commented on issue #14826:
URL: https://github.com/apache/pulsar/issues/14826#issuecomment-1077182257


   @GBM-tamerm Please provide additional information about the environment:
   - What Linux distribution and version? 
   - What Java version?
   - Are you running Pulsar as a systemd service or in some other way
     - What are the open files limits for the service?
   
   Setting the open files limits depends on the OS and the way the service is run. 
   Each TCP/IP connection consumes a file handle and that's why it's necessary to tune the open files limit.
   
   Example of setting open files limits for a systemd service: https://github.com/openmessaging/benchmark/blob/89dce6d61c4444fa993ce36098e50ed5e124cb4a/driver-pulsar/deploy/ssd/templates/pulsar.service#L11
   Uses `LimitNOFILE=300000` in the `[Service]` section. This is used in all components (broker, bookkeeper, zookeeper).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pulsar.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] lhotari commented on issue #14826: Broker freeze for communications in v 2.7.4

Posted by GitBox <gi...@apache.org>.
lhotari commented on issue #14826:
URL: https://github.com/apache/pulsar/issues/14826#issuecomment-1077221700


   You can find out if there's a lot of connection in half closed state with `netstat -tapn | grep CLOSE_WAIT` or `ss|grep CLOSE-WAIT`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pulsar.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] GBM-tamerm edited a comment on issue #14826: Too many TCP Connections are in CLOSE_WAIT status freeze broker v 2.7.4

Posted by GitBox <gi...@apache.org>.
GBM-tamerm edited a comment on issue #14826:
URL: https://github.com/apache/pulsar/issues/14826#issuecomment-1078832791


   Hi Lari ,
   
   I did your suggestions and did not resolve the issue , broker still freeze after a while 
   I think the is potential cause for that , most likely a deadlock or racing between broker and bookie clients connections ,i updated the issue description to reflect more details
   Too many TCP Connections are in CLOSE_WAIT status in a Pulsar broker causing Disconnection Exceptions and Connection Already Close exceptions in pulsar clients
   
   And majority of the close_ wait connections between broker and bookies
   
   Also i can see alof of occurrence of below exception:
   Failed to initialize managed ledger: org.apache.bookkeeper.mledger.ManagedLedgerException$MetadataNotFoundException: Managed ledger not found
   java.util.concurrent.CompletionException: java.util.NoSuchElementException: No value present
           at java.util.concurrent.CompletableFuture.uniExceptionally(CompletableFuture.java:884) ~[?:1.8.0_322]
           at java.util.concurrent.CompletableFuture$UniExceptionally.tryFire(CompletableFuture.java:866) ~[?:1.8.0_322]
           at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990) ~[?:1.8.0_322]
   Caused by: java.util.NoSuchElementException: No value present
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pulsar.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] GBM-tamerm commented on issue #14826: Too many TCP Connections are in CLOSE_WAIT status freeze broker v 2.7.4

Posted by GitBox <gi...@apache.org>.
GBM-tamerm commented on issue #14826:
URL: https://github.com/apache/pulsar/issues/14826#issuecomment-1078832791


   I think i see the potential cause for that , i updated the issue description to reflect more details
   Too many TCP Connections are in CLOSE_WAIT status in a Pulsar broker causing Disconnection Exceptions and Connection Already Close exceptions in pulsar clients
   
   And majority of the close_ wait connections between broker and bookies
   
   Also i can see alof of occurrence of below exception:
   Failed to initialize managed ledger: org.apache.bookkeeper.mledger.ManagedLedgerException$MetadataNotFoundException: Managed ledger not found
   java.util.concurrent.CompletionException: java.util.NoSuchElementException: No value present
           at java.util.concurrent.CompletableFuture.uniExceptionally(CompletableFuture.java:884) ~[?:1.8.0_322]
           at java.util.concurrent.CompletableFuture$UniExceptionally.tryFire(CompletableFuture.java:866) ~[?:1.8.0_322]
           at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990) ~[?:1.8.0_322]
   Caused by: java.util.NoSuchElementException: No value present
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pulsar.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] lhotari commented on issue #14826: Broker freeze for communications in v 2.7.4

Posted by GitBox <gi...@apache.org>.
lhotari commented on issue #14826:
URL: https://github.com/apache/pulsar/issues/14826#issuecomment-1077266920


   One detail is that Pulsar doesn't use TCP/IP keepalive at the moment. I have created a PR to start a discussion whether it would be useful. That is in #14841 . 
   Pulsar does have an application level keepalive solution in the Pulsar binary protocol.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pulsar.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] lhotari commented on issue #14826: Broker freeze for communications in v 2.7.4

Posted by GitBox <gi...@apache.org>.
lhotari commented on issue #14826:
URL: https://github.com/apache/pulsar/issues/14826#issuecomment-1077218236


   After adjusting the open files limits for the services (and making sure that the settings are effective), I'd also recommend adjusting the TCP/IP keepalive settings. By default, a TCP/IP connection might linger in half closed state for up to 2 hours. 
   
   Again, modifying the Linux kernel sysctl settings depend on the OS.
   For Ubuntu / systemd, settings can be placed in `/etc/sysctl.d` directory. systemd has a service called systemd-sysctl which applies the settings when the service is restarted (`systemctl restart systemd-sysctl`).
   
   One example of tuning TCP/IP keepalive settings, reducing timeout from 2 hrs to 20 minutes:
   ```
   # Tune TCP/IP keepalive settings
   # http://tldp.org/HOWTO/TCP-Keepalive-HOWTO/usingkeepalive.html
   net.ipv4.tcp_keepalive_time = 1200
   net.ipv4.tcp_keepalive_intvl = 60
   net.ipv4.tcp_keepalive_probes = 20
   ```
   
   Script to configure
   ```shell
   echo -e "# Tune TCP/IP keepalive settings\n# http://tldp.org/HOWTO/TCP-Keepalive-HOWTO/usingkeepalive.html\nnet.ipv4.tcp_keepalive_time = 1200\nnet.ipv4.tcp_keepalive_intvl = 60\nnet.ipv4.tcp_keepalive_probes = 20" | sudo tee /etc/sysctl.d/99-keepalive.conf
   sudo systemctl restart systemd-sysctl
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pulsar.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] GBM-tamerm edited a comment on issue #14826: Too many TCP Connections are in CLOSE_WAIT status freeze broker v 2.7.4

Posted by GitBox <gi...@apache.org>.
GBM-tamerm edited a comment on issue #14826:
URL: https://github.com/apache/pulsar/issues/14826#issuecomment-1078832791


   Hi Lari ,
   
   I did your suggestions and did not resolve the issue , broker still freeze after a while 
   I think the potential cause for that is most likely a deadlock or racing between broker and bookie clients connections ,i updated the issue description to reflect more details
   Too many TCP Connections are in CLOSE_WAIT status in a Pulsar broker causing Disconnection Exceptions and Connection Already Close exceptions in pulsar clients
   
   As the majority of the close_ wait connections between broker and bookies, it indicate deadlock /race is happening
   
   Also i can see alof of occurrence of below exception:
   Failed to initialize managed ledger: org.apache.bookkeeper.mledger.ManagedLedgerException$MetadataNotFoundException: Managed ledger not found
   java.util.concurrent.CompletionException: java.util.NoSuchElementException: No value present
           at java.util.concurrent.CompletableFuture.uniExceptionally(CompletableFuture.java:884) ~[?:1.8.0_322]
           at java.util.concurrent.CompletableFuture$UniExceptionally.tryFire(CompletableFuture.java:866) ~[?:1.8.0_322]
           at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990) ~[?:1.8.0_322]
   Caused by: java.util.NoSuchElementException: No value present
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pulsar.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] GBM-tamerm commented on issue #14826: Broker freeze for communications in v 2.7.4

Posted by GitBox <gi...@apache.org>.
GBM-tamerm commented on issue #14826:
URL: https://github.com/apache/pulsar/issues/14826#issuecomment-1077458606


   Thanks Lari ,
   I added the additional details you asked ,.
   We are not setting LimitNOFILE in our systemctl service and leave it to the default.
   systemctl show -p DefaultLimitNOFILE   >> DefaultLimitNOFILE=4096
   
   I will try to run netstat command when the issue happen again.
   Our PRD cluster running in an old version 2.3.2 , and does not seem to have this issue we see in 2.7
   Also some client reports seeing the below error:
   org.apache.pulsar.client.api.PulsarClientException$LookupException: Reached max number of redirections
   Is that something we can tune in broker.conf ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pulsar.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] GBM-tamerm edited a comment on issue #14826: Too many TCP Connections are in CLOSE_WAIT status freeze broker v 2.7.4

Posted by GitBox <gi...@apache.org>.
GBM-tamerm edited a comment on issue #14826:
URL: https://github.com/apache/pulsar/issues/14826#issuecomment-1078832791


   Hi Lari ,
   
   I did your suggestions and did not resolve the issue , broker still freeze after a while 
   I think the potential cause for that is most likely a deadlock or racing between broker and bookie clients connections ,i updated the issue description to reflect more details
   Too many TCP Connections are in CLOSE_WAIT status in a Pulsar broker causing Disconnection Exceptions and Connection Already Close exceptions in pulsar clients
   
   As the majority of the close_ wait connections between broker and bookies, it indicate deadlock /race is happening
   
   attached the jstack that shows the deadlock
   [jstack_12.txt](https://github.com/apache/pulsar/files/8403100/jstack_12.txt)
   
   https://jstack.review?https://gist.github.com/GBM-tamerm/a29b793db94702ea58da449927938cad
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pulsar.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org