You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@pulsar.apache.org by GitBox <gi...@apache.org> on 2022/03/23 18:54:22 UTC
[GitHub] [pulsar] GBM-tamerm opened a new issue #14826: Broker freeze for communications in v 2.7.4
GBM-tamerm opened a new issue #14826:
URL: https://github.com/apache/pulsar/issues/14826
**Describe the bug**
e saw strange behaviour , as broker stopped accepting connections and clients start receiving different exceptions such as : Connection Already Closed , Topic not available ,
exceptions.
The broker java process itself is up and running, but curl http ports such as curl broker metrics stop return anything.
It is only works when we restart the broker again
So it seems as connection pool issue or leak as it keeps alive speically we can some logs entries as below
[pulsar-client-io-1-1] INFO org.apache.pulsar.client.impl.ConnectionPool - [[id: 0x15a65d4f, L:/10.244.63.22:36068 - xxxxx:6651]] Connected to server Killed
**To Reproduce**
I was able to reproduce the issue by a simple java program which keep looping while opening new
socket to the broker port without closing the socket on every loop. So after a while the clients start to get that connection already closed exceptions and others.
**Expected behavior**
Broker should not freeze , generate meaningful exception which disconnecting bad clients
**Screenshots**
If applicable, add screenshots to help explain your problem.
**Desktop (please complete the following information):**
- OS: Linux VM
- 6 brokers
- 6 bookies
- 5 ZK
**Additional context**
No k8 deployment , and client accessing brokers DNS directly without Pulsar proxy.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@pulsar.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [pulsar] lhotari commented on issue #14826: Broker freeze for communications in v 2.7.4
Posted by GitBox <gi...@apache.org>.
lhotari commented on issue #14826:
URL: https://github.com/apache/pulsar/issues/14826#issuecomment-1077127850
Adding a link to the dev mailing list thread: https://lists.apache.org/thread/tsg8q6xc75605jrs66rvj2f3dhndo5k4
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@pulsar.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [pulsar] GBM-tamerm edited a comment on issue #14826: Too many TCP Connections are in CLOSE_WAIT status freeze broker v 2.7.4
Posted by GitBox <gi...@apache.org>.
GBM-tamerm edited a comment on issue #14826:
URL: https://github.com/apache/pulsar/issues/14826#issuecomment-1078832791
Hi Lari ,
I did your suggestions and did not resolve the issue , broker still freeze after a while
I think the potential cause for that is most likely a deadlock or racing between broker and bookie clients connections ,i updated the issue description to reflect more details
Too many TCP Connections are in CLOSE_WAIT status in a Pulsar broker causing Disconnection Exceptions and Connection Already Close exceptions in pulsar clients
And majority of the close_ wait connections between broker and bookies
Also i can see alof of occurrence of below exception:
Failed to initialize managed ledger: org.apache.bookkeeper.mledger.ManagedLedgerException$MetadataNotFoundException: Managed ledger not found
java.util.concurrent.CompletionException: java.util.NoSuchElementException: No value present
at java.util.concurrent.CompletableFuture.uniExceptionally(CompletableFuture.java:884) ~[?:1.8.0_322]
at java.util.concurrent.CompletableFuture$UniExceptionally.tryFire(CompletableFuture.java:866) ~[?:1.8.0_322]
at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990) ~[?:1.8.0_322]
Caused by: java.util.NoSuchElementException: No value present
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@pulsar.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [pulsar] lhotari commented on issue #14826: Broker freeze for communications in v 2.7.4
Posted by GitBox <gi...@apache.org>.
lhotari commented on issue #14826:
URL: https://github.com/apache/pulsar/issues/14826#issuecomment-1077182257
@GBM-tamerm Please provide additional information about the environment:
- What Linux distribution and version?
- What Java version?
- Are you running Pulsar as a systemd service or in some other way
- What are the open files limits for the service?
Setting the open files limits depends on the OS and the way the service is run.
Each TCP/IP connection consumes a file handle and that's why it's necessary to tune the open files limit.
Example of setting open files limits for a systemd service: https://github.com/openmessaging/benchmark/blob/89dce6d61c4444fa993ce36098e50ed5e124cb4a/driver-pulsar/deploy/ssd/templates/pulsar.service#L11
Uses `LimitNOFILE=300000` in the `[Service]` section. This is used in all components (broker, bookkeeper, zookeeper).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@pulsar.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [pulsar] lhotari commented on issue #14826: Broker freeze for communications in v 2.7.4
Posted by GitBox <gi...@apache.org>.
lhotari commented on issue #14826:
URL: https://github.com/apache/pulsar/issues/14826#issuecomment-1077221700
You can find out if there's a lot of connection in half closed state with `netstat -tapn | grep CLOSE_WAIT` or `ss|grep CLOSE-WAIT`.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@pulsar.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [pulsar] GBM-tamerm edited a comment on issue #14826: Too many TCP Connections are in CLOSE_WAIT status freeze broker v 2.7.4
Posted by GitBox <gi...@apache.org>.
GBM-tamerm edited a comment on issue #14826:
URL: https://github.com/apache/pulsar/issues/14826#issuecomment-1078832791
Hi Lari ,
I did your suggestions and did not resolve the issue , broker still freeze after a while
I think the is potential cause for that , most likely a deadlock or racing between broker and bookie clients connections ,i updated the issue description to reflect more details
Too many TCP Connections are in CLOSE_WAIT status in a Pulsar broker causing Disconnection Exceptions and Connection Already Close exceptions in pulsar clients
And majority of the close_ wait connections between broker and bookies
Also i can see alof of occurrence of below exception:
Failed to initialize managed ledger: org.apache.bookkeeper.mledger.ManagedLedgerException$MetadataNotFoundException: Managed ledger not found
java.util.concurrent.CompletionException: java.util.NoSuchElementException: No value present
at java.util.concurrent.CompletableFuture.uniExceptionally(CompletableFuture.java:884) ~[?:1.8.0_322]
at java.util.concurrent.CompletableFuture$UniExceptionally.tryFire(CompletableFuture.java:866) ~[?:1.8.0_322]
at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990) ~[?:1.8.0_322]
Caused by: java.util.NoSuchElementException: No value present
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@pulsar.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [pulsar] GBM-tamerm commented on issue #14826: Too many TCP Connections are in CLOSE_WAIT status freeze broker v 2.7.4
Posted by GitBox <gi...@apache.org>.
GBM-tamerm commented on issue #14826:
URL: https://github.com/apache/pulsar/issues/14826#issuecomment-1078832791
I think i see the potential cause for that , i updated the issue description to reflect more details
Too many TCP Connections are in CLOSE_WAIT status in a Pulsar broker causing Disconnection Exceptions and Connection Already Close exceptions in pulsar clients
And majority of the close_ wait connections between broker and bookies
Also i can see alof of occurrence of below exception:
Failed to initialize managed ledger: org.apache.bookkeeper.mledger.ManagedLedgerException$MetadataNotFoundException: Managed ledger not found
java.util.concurrent.CompletionException: java.util.NoSuchElementException: No value present
at java.util.concurrent.CompletableFuture.uniExceptionally(CompletableFuture.java:884) ~[?:1.8.0_322]
at java.util.concurrent.CompletableFuture$UniExceptionally.tryFire(CompletableFuture.java:866) ~[?:1.8.0_322]
at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990) ~[?:1.8.0_322]
Caused by: java.util.NoSuchElementException: No value present
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@pulsar.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [pulsar] lhotari commented on issue #14826: Broker freeze for communications in v 2.7.4
Posted by GitBox <gi...@apache.org>.
lhotari commented on issue #14826:
URL: https://github.com/apache/pulsar/issues/14826#issuecomment-1077266920
One detail is that Pulsar doesn't use TCP/IP keepalive at the moment. I have created a PR to start a discussion whether it would be useful. That is in #14841 .
Pulsar does have an application level keepalive solution in the Pulsar binary protocol.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@pulsar.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [pulsar] lhotari commented on issue #14826: Broker freeze for communications in v 2.7.4
Posted by GitBox <gi...@apache.org>.
lhotari commented on issue #14826:
URL: https://github.com/apache/pulsar/issues/14826#issuecomment-1077218236
After adjusting the open files limits for the services (and making sure that the settings are effective), I'd also recommend adjusting the TCP/IP keepalive settings. By default, a TCP/IP connection might linger in half closed state for up to 2 hours.
Again, modifying the Linux kernel sysctl settings depend on the OS.
For Ubuntu / systemd, settings can be placed in `/etc/sysctl.d` directory. systemd has a service called systemd-sysctl which applies the settings when the service is restarted (`systemctl restart systemd-sysctl`).
One example of tuning TCP/IP keepalive settings, reducing timeout from 2 hrs to 20 minutes:
```
# Tune TCP/IP keepalive settings
# http://tldp.org/HOWTO/TCP-Keepalive-HOWTO/usingkeepalive.html
net.ipv4.tcp_keepalive_time = 1200
net.ipv4.tcp_keepalive_intvl = 60
net.ipv4.tcp_keepalive_probes = 20
```
Script to configure
```shell
echo -e "# Tune TCP/IP keepalive settings\n# http://tldp.org/HOWTO/TCP-Keepalive-HOWTO/usingkeepalive.html\nnet.ipv4.tcp_keepalive_time = 1200\nnet.ipv4.tcp_keepalive_intvl = 60\nnet.ipv4.tcp_keepalive_probes = 20" | sudo tee /etc/sysctl.d/99-keepalive.conf
sudo systemctl restart systemd-sysctl
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@pulsar.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [pulsar] GBM-tamerm edited a comment on issue #14826: Too many TCP Connections are in CLOSE_WAIT status freeze broker v 2.7.4
Posted by GitBox <gi...@apache.org>.
GBM-tamerm edited a comment on issue #14826:
URL: https://github.com/apache/pulsar/issues/14826#issuecomment-1078832791
Hi Lari ,
I did your suggestions and did not resolve the issue , broker still freeze after a while
I think the potential cause for that is most likely a deadlock or racing between broker and bookie clients connections ,i updated the issue description to reflect more details
Too many TCP Connections are in CLOSE_WAIT status in a Pulsar broker causing Disconnection Exceptions and Connection Already Close exceptions in pulsar clients
As the majority of the close_ wait connections between broker and bookies, it indicate deadlock /race is happening
Also i can see alof of occurrence of below exception:
Failed to initialize managed ledger: org.apache.bookkeeper.mledger.ManagedLedgerException$MetadataNotFoundException: Managed ledger not found
java.util.concurrent.CompletionException: java.util.NoSuchElementException: No value present
at java.util.concurrent.CompletableFuture.uniExceptionally(CompletableFuture.java:884) ~[?:1.8.0_322]
at java.util.concurrent.CompletableFuture$UniExceptionally.tryFire(CompletableFuture.java:866) ~[?:1.8.0_322]
at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990) ~[?:1.8.0_322]
Caused by: java.util.NoSuchElementException: No value present
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@pulsar.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [pulsar] GBM-tamerm commented on issue #14826: Broker freeze for communications in v 2.7.4
Posted by GitBox <gi...@apache.org>.
GBM-tamerm commented on issue #14826:
URL: https://github.com/apache/pulsar/issues/14826#issuecomment-1077458606
Thanks Lari ,
I added the additional details you asked ,.
We are not setting LimitNOFILE in our systemctl service and leave it to the default.
systemctl show -p DefaultLimitNOFILE >> DefaultLimitNOFILE=4096
I will try to run netstat command when the issue happen again.
Our PRD cluster running in an old version 2.3.2 , and does not seem to have this issue we see in 2.7
Also some client reports seeing the below error:
org.apache.pulsar.client.api.PulsarClientException$LookupException: Reached max number of redirections
Is that something we can tune in broker.conf ?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@pulsar.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [pulsar] GBM-tamerm edited a comment on issue #14826: Too many TCP Connections are in CLOSE_WAIT status freeze broker v 2.7.4
Posted by GitBox <gi...@apache.org>.
GBM-tamerm edited a comment on issue #14826:
URL: https://github.com/apache/pulsar/issues/14826#issuecomment-1078832791
Hi Lari ,
I did your suggestions and did not resolve the issue , broker still freeze after a while
I think the potential cause for that is most likely a deadlock or racing between broker and bookie clients connections ,i updated the issue description to reflect more details
Too many TCP Connections are in CLOSE_WAIT status in a Pulsar broker causing Disconnection Exceptions and Connection Already Close exceptions in pulsar clients
As the majority of the close_ wait connections between broker and bookies, it indicate deadlock /race is happening
attached the jstack that shows the deadlock
[jstack_12.txt](https://github.com/apache/pulsar/files/8403100/jstack_12.txt)
https://jstack.review?https://gist.github.com/GBM-tamerm/a29b793db94702ea58da449927938cad
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@pulsar.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org