You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@pulsar.apache.org by GitBox <gi...@apache.org> on 2022/02/09 04:52:41 UTC

[GitHub] [pulsar] zackliu opened a new issue #14183: Topic lookup timeout and can't recover after broker crash

zackliu opened a new issue #14183:
URL: https://github.com/apache/pulsar/issues/14183


   **Describe the bug**
   I tried using Pulsar in Kubernetes and it's deployed using helm. After one broker pod crashed and came back, I found the whole Pulsar didn't work. I used pulsar-perf to publish message and the log showed as below. I can confirm 10.0.129.19 is the pulsar proxy IP exposed by kubernetes service. It's reachable from client. And the broker's log indicate there's a client connection connected.
   
   > 2022-02-07T09:40:03,110
   > +0000 [pulsar-client-io-2-3] WARN  org.apache.pulsar.client.impl.PulsarClientImpl - [chenyltopic2] Failed to get partitioned topic metadata: org.apache.pulsar.client.api.PulsarClientException$TimeoutException: Lookup request timeout {'durationMs': '30000', 'reqId':'1671309513597037281', 'remote':'10.0.129.19/10.0.129.19:6650', 'local':'/10.240.2.68:52414'}
   > 2022-02-07T09:40:03,111+0000 [pulsar-client-io-2-3] WARN  org.apache.pulsar.client.impl.ClientCnx - [id: 0xdb0f2346, L:/10.240.2.68:52414 - R:10.0.129.19/10.0.129.19:6650] Lookup request timeout {'durationMs': '30000', 'reqId':'1671309513597037281', 'remote':'10.0.129.19/10.0.129.19:6650', 'local':'/10.240.2.68:52414'}
   > 2022-02-07T09:40:03,111+0000 [pulsar-perf-producer-exec-1-1] ERROR org.apache.pulsar.testclient.PerformanceProducer - Got error
   > java.util.concurrent.ExecutionException: org.apache.pulsar.client.api.PulsarClientException$TimeoutException: Lookup request timeout {'durationMs': '30000', 'reqId':'1671309513597037281', 'remote':'10.0.129.19/10.0.129.19:6650', 'local':'/10.240.2.68:52414'}
   >         at java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:357) ~[?:1.8.0_312]
   >         at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1908) ~[?:1.8.0_312]
   >         at org.apache.pulsar.testclient.PerformanceProducer.runProducer(PerformanceProducer.java:595) ~[org.apache.pulsar-pulsar-testclient-2.9.1.jar:2.9.1]
   >         at org.apache.pulsar.testclient.PerformanceProducer.lambda$main$1(PerformanceProducer.java:425) ~[org.apache.pulsar-pulsar-testclient-2.9.1.jar:2.9.1]
   >         at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [?:1.8.0_312]
   >         at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_312]
   >         at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_312]
   >         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_312]
   >         at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) [io.netty-netty-common-4.1.72.Final.jar:4.1.72.Final]
   >         at java.lang.Thread.run(Thread.java:748) [?:1.8.0_312]
   > Caused by: org.apache.pulsar.client.api.PulsarClientException$TimeoutException: Lookup request timeout {'durationMs': '30000', 'reqId':'1671309513597037281', 'remote':'10.0.129.19/10.0.129.19:6650', 'local':'/10.240.2.68:52414'}
   >         at org.apache.pulsar.client.impl.ClientCnx.checkRequestTimeout(ClientCnx.java:1204) ~[org.apache.pulsar-pulsar-client-original-2.9.1.jar:2.9.1]
   >         at org.apache.pulsar.common.util.Runnables$CatchingAndLoggingRunnable.run(Runnables.java:53) ~[org.apache.pulsar-pulsar-common-2.9.1.jar:2.9.1]
   >         at io.netty.util.concurrent.PromiseTask.runTask(PromiseTask.java:98) ~[io.netty-netty-common-4.1.72.Final.jar:4.1.72.Final]
   >         at io.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:176) ~[io.netty-netty-common-4.1.72.Final.jar:4.1.72.Final]
   >         at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164) ~[io.netty-netty-common-4.1.72.Final.jar:4.1.72.Final]
   >         at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:469) ~[io.netty-netty-common-4.1.72.Final.jar:4.1.72.Final]
   >         at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:384) ~[io.netty-netty-transport-classes-epoll-4.1.72.Final.jar:4.1.72.Final]
   >         at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:986) ~[io.netty-netty-common-4.1.72.Final.jar:4.1.72.Final]
   >         at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) ~[io.netty-netty-common-4.1.72.Final.jar:4.1.72.Final]
   
   **To Reproduce**
   The issue happened many times.
   1. Use pulsar-perf to publish some messages at first and it worked well. 
   2. And then one broker OOM crashed as I'm sending message too quickly. 
   3. I saw many errors and I terminated the perf process and start another one a few seconds later. Then I the issue happened and I can't publish any messages as topic lookup timeout. I tried using different topic name (name that never used before) but doesn't help.
   
   **Expected behavior**
   Everything can recover after broker crash.
   
   **Screenshots**
   If applicable, add screenshots to help explain your problem.
   
   **Desktop (please complete the following information):**
    - OS: Ubuntu 18.04
    - Pulsar: 2.9.1
   
   **Additional context**
   Add any other context about the problem here.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pulsar.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [pulsar] lhotari commented on issue #14183: Topic lookup timeout and can't recover after broker crash

Posted by GitBox <gi...@apache.org>.

lhotari commented on issue #14183:
URL: https://github.com/apache/pulsar/issues/14183#issuecomment-1034542768


   > And then one broker OOM crashed as I'm sending message too quickly.
   
   Did the broker process restart after the OOM? It should terminated and get started again in the Pulsar Helm chart if there's 
   `-XX:+ExitOnOutOfMemoryError` in your JVM args for the broker? It's in the default `PULSAR_GC` options in the Apache Pulsar Helm Chart, https://github.com/apache/pulsar-helm-chart/blob/9613ee029290a23e512d5f247bef69faa6bf796a/charts/pulsar/values.yaml#L751 .
   Are you using the default JVM args that includes `-XX:+ExitOnOutOfMemoryError`.
   
   > Pulsar: 2.9.1
   
   Can you reproduce on Pulsar 2.8.2 ?
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pulsar.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [pulsar] github-actions[bot] commented on issue #14183: Topic lookup timeout and can't recover after broker crash

Posted by GitBox <gi...@apache.org>.

github-actions[bot] commented on issue #14183:
URL: https://github.com/apache/pulsar/issues/14183#issuecomment-1066003481


   The issue had no activity for 30 days, mark with Stale label.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pulsar.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org