You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@pulsar.apache.org by GitBox <gi...@apache.org> on 2022/03/17 21:16:57 UTC

[GitHub] [pulsar] utahkay opened a new issue #14736: [2.10] After zookeeper reconnect, brokers get caught in a loop revalidating locks, slowing message delivery and consuming heap until OOM

utahkay opened a new issue #14736:
URL: https://github.com/apache/pulsar/issues/14736


   **Describe the bug**
   Pulsar 2.10 introduces the zookeeperSessionExpiredPolicy=reconnect option. If the zookeeper session is disconnected, the brokers will continue running with as much functionality as possible, and reconnect to zookeeper when it is available again.
   
   The problem happens when the brokers reconnect to zookeeper. They seem to be having locks invalidated continuously, so the broker revalidates the lock, then it's invalidated, then the broker again revalidates, then it's invalidated, and so on. The CPU and memory usage gradually increase over the next 10-15 minutes, slowing down message delivery. Eventually, the brokers run out of heap space and die due to OOM. At which point Kubernetes recreates them, and they work fine at that point, without experiencing the lock invalidation problem.
   
   Cycling the brokers at any time prior to them dying also fixes the problem.
   
   **To Reproduce**
   Steps to reproduce the behavior:
   1. I'm running Pulsar 2.10.0.0-rc12 (StreamNative) in StreamNative Cloud Hosted environment.
   2. I have a couple clients putting light load on the Pulsar cluster, publishing and consuming.
   3. I kill the zookeeper cluster entirely.
   4. I wait for 5 minutes.
   5. I bring the zookeeper cluster back up.
   
   **Expected behavior**
   1. Expect the brokers to reconnect to zookeeper - This succeeds.
   2. Expect the brokers to work normally - Instead, they spend much of their time revalidating locks that are immediately invalidated.
   
   **Screenshots**
   A portion of the logs:
   ```
   "Mar 17, 2022 @ 14:32:27.642","test-2-10-broker-0","2022-03-17T20:32:27,642+0000 [main-EventThread] INFO  org.apache.pulsar.metadata.coordination.impl.ResourceLockImpl - Successfully revalidated the lock on /namespace/pulsar/test-2-10-broker-0.test-2-10-broker-headless.cv-pulsar.svc.cluster.local:8080/0x00000000_0xffffffff"
   "Mar 17, 2022 @ 14:32:27.642","test-2-10-broker-0","2022-03-17T20:32:27,641+0000 [main-EventThread] INFO  org.apache.pulsar.metadata.coordination.impl.ResourceLockImpl - Successfully re-acquired lock at /namespace/pulsar/test-2-10/test-2-10-broker-0.test-2-10-broker-headless.cv-pulsar.svc.cluster.local:8080/0x00000000_0xffffffff"
   "Mar 17, 2022 @ 14:32:27.642","test-2-10-broker-0","2022-03-17T20:32:27,641+0000 [main-EventThread] INFO  org.apache.pulsar.metadata.coordination.impl.ResourceLockImpl - Acquired resource lock on /namespace/pulsar/test-2-10/test-2-10-broker-0.test-2-10-broker-headless.cv-pulsar.svc.cluster.local:8080/0x00000000_0xffffffff"
   "Mar 17, 2022 @ 14:32:27.642","test-2-10-broker-0","2022-03-17T20:32:27,642+0000 [main-EventThread] INFO  org.apache.pulsar.metadata.coordination.impl.ResourceLockImpl - Successfully re-acquired lock at /namespace/pulsar/test-2-10-broker-0.test-2-10-broker-headless.cv-pulsar.svc.cluster.local:8080/0x00000000_0xffffffff"
   "Mar 17, 2022 @ 14:32:27.642","test-2-10-broker-0","2022-03-17T20:32:27,641+0000 [main-EventThread] INFO  org.apache.pulsar.metadata.coordination.impl.ResourceLockImpl - Successfully revalidated the lock on /namespace/pulsar/test-2-10/test-2-10-broker-0.test-2-10-broker-headless.cv-pulsar.svc.cluster.local:8080/0x00000000_0xffffffff"
   "Mar 17, 2022 @ 14:32:27.652","test-2-10-broker-0","2022-03-17T20:32:27,652+0000 [metadata-store-9-1] INFO  org.apache.pulsar.metadata.coordination.impl.ResourceLockImpl - Lock on resource /namespace/pulsar/test-2-10-broker-0.test-2-10-broker-headless.cv-pulsar.svc.cluster.local:8080/0x00000000_0xffffffff was invalidated"
   "Mar 17, 2022 @ 14:32:27.652","test-2-10-broker-0","2022-03-17T20:32:27,651+0000 [metadata-store-9-1] INFO  org.apache.pulsar.metadata.coordination.impl.ResourceLockImpl - Lock on resource /namespace/pulsar/test-2-10/test-2-10-broker-0.test-2-10-broker-headless.cv-pulsar.svc.cluster.local:8080/0x00000000_0xffffffff was invalidated"
   "Mar 17, 2022 @ 14:32:27.652","test-2-10-broker-0","2022-03-17T20:32:27,652+0000 [metadata-store-9-1] INFO  org.apache.pulsar.metadata.coordination.impl.ResourceLockImpl - Lock on resource /namespace/pulsar/test-2-10/test-2-10-broker-0.test-2-10-broker-headless.cv-pulsar.svc.cluster.local:8080/0x00000000_0xffffffff was invalidated"
   "Mar 17, 2022 @ 14:32:27.652","test-2-10-broker-0","2022-03-17T20:32:27,652+0000 [metadata-store-9-1] INFO  org.apache.pulsar.metadata.coordination.impl.ResourceLockImpl - Lock on resource /namespace/pulsar/test-2-10-broker-0.test-2-10-broker-headless.cv-pulsar.svc.cluster.local:8080/0x00000000_0xffffffff was invalidated"
   "Mar 17, 2022 @ 14:32:27.959","test-2-10-broker-0","2022-03-17T20:32:27,959+0000 [main-EventThread] INFO  org.apache.pulsar.metadata.coordination.impl.ResourceLockImpl - Successfully revalidated the lock on /namespace/pulsar/test-2-10/test-2-10-broker-0.test-2-10-broker-headless.cv-pulsar.svc.cluster.local:8080/0x00000000_0xffffffff"
   "Mar 17, 2022 @ 14:32:27.959","test-2-10-broker-0","2022-03-17T20:32:27,959+0000 [main-EventThread] INFO  org.apache.pulsar.metadata.coordination.impl.ResourceLockImpl - Acquired resource lock on /namespace/pulsar/test-2-10-broker-0.test-2-10-broker-headless.cv-pulsar.svc.cluster.local:8080/0x00000000_0xffffffff"
   "Mar 17, 2022 @ 14:32:27.959","test-2-10-broker-0","2022-03-17T20:32:27,959+0000 [main-EventThread] INFO  org.apache.pulsar.metadata.coordination.impl.ResourceLockImpl - Successfully re-acquired lock at /namespace/pulsar/test-2-10-broker-0.test-2-10-broker-headless.cv-pulsar.svc.cluster.local:8080/0x00000000_0xffffffff"
   "Mar 17, 2022 @ 14:32:27.959","test-2-10-broker-0","2022-03-17T20:32:27,959+0000 [main-EventThread] INFO  org.apache.pulsar.metadata.coordination.impl.ResourceLockImpl - Successfully re-acquired lock at /namespace/pulsar/test-2-10/test-2-10-broker-0.test-2-10-broker-headless.cv-pulsar.svc.cluster.local:8080/0x00000000_0xffffffff"
   "Mar 17, 2022 @ 14:32:27.959","test-2-10-broker-0","2022-03-17T20:32:27,959+0000 [main-EventThread] INFO  org.apache.pulsar.metadata.coordination.impl.ResourceLockImpl - Acquired resource lock on /namespace/pulsar/test-2-10/test-2-10-broker-0.test-2-10-broker-headless.cv-pulsar.svc.cluster.local:8080/0x00000000_0xffffffff"
   "Mar 17, 2022 @ 14:32:27.960","test-2-10-broker-0","2022-03-17T20:32:27,959+0000 [main-EventThread] INFO  org.apache.pulsar.metadata.coordination.impl.ResourceLockImpl - Successfully revalidated the lock on /namespace/pulsar/test-2-10-broker-0.test-2-10-broker-headless.cv-pulsar.svc.cluster.local:8080/0x00000000_0xffffffff"
   "Mar 17, 2022 @ 14:32:27.967","test-2-10-broker-0","2022-03-17T20:32:27,967+0000 [metadata-store-9-1] INFO  org.apache.pulsar.metadata.coordination.impl.ResourceLockImpl - Lock on resource /namespace/pulsar/test-2-10/test-2-10-broker-0.test-2-10-broker-headless.cv-pulsar.svc.cluster.local:8080/0x00000000_0xffffffff was invalidated"
   "Mar 17, 2022 @ 14:32:27.967","test-2-10-broker-0","2022-03-17T20:32:27,967+0000 [metadata-store-9-1] INFO  org.apache.pulsar.metadata.coordination.impl.ResourceLockImpl - Lock on resource /namespace/pulsar/test-2-10/test-2-10-broker-0.test-2-10-broker-headless.cv-pulsar.svc.cluster.local:8080/0x00000000_0xffffffff was invalidated"
   "Mar 17, 2022 @ 14:32:27.968","test-2-10-broker-0","2022-03-17T20:32:27,967+0000 [metadata-store-9-1] INFO  org.apache.pulsar.metadata.coordination.impl.ResourceLockImpl - Lock on resource /namespace/pulsar/test-2-10-broker-0.test-2-10-broker-headless.cv-pulsar.svc.cluster.local:8080/0x00000000_0xffffffff was invalidated"
   "Mar 17, 2022 @ 14:32:27.968","test-2-10-broker-0","2022-03-17T20:32:27,967+0000 [metadata-store-9-1] INFO  org.apache.pulsar.metadata.coordination.impl.ResourceLockImpl - Lock on resource /namespace/pulsar/test-2-10-broker-0.test-2-10-broker-headless.cv-pulsar.svc.cluster.local:8080/0x00000000_0xffffffff was invalidated"
   "Mar 17, 2022 @ 14:32:28.294","test-2-10-broker-0","2022-03-17T20:32:28,294+0000 [main-EventThread] INFO  org.apache.pulsar.metadata.coordination.impl.ResourceLockImpl - Successfully re-acquired lock at /namespace/pulsar/test-2-10-broker-0.test-2-10-broker-headless.cv-pulsar.svc.cluster.local:8080/0x00000000_0xffffffff"
   "Mar 17, 2022 @ 14:32:28.294","test-2-10-broker-0","2022-03-17T20:32:28,294+0000 [main-EventThread] INFO  org.apache.pulsar.metadata.coordination.impl.ResourceLockImpl - Successfully revalidated the lock on /namespace/pulsar/test-2-10-broker-0.test-2-10-broker-headless.cv-pulsar.svc.cluster.local:8080/0x00000000_0xffffffff"
   "Mar 17, 2022 @ 14:32:28.294","test-2-10-broker-0","2022-03-17T20:32:28,293+0000 [main-EventThread] INFO  org.apache.pulsar.metadata.coordination.impl.ResourceLockImpl - Acquired resource lock on /namespace/pulsar/test-2-10/test-2-10-broker-0.test-2-10-broker-headless.cv-pulsar.svc.cluster.local:8080/0x00000000_0xffffffff"
   "Mar 17, 2022 @ 14:32:28.294","test-2-10-broker-0","2022-03-17T20:32:28,294+0000 [main-EventThread] INFO  org.apache.pulsar.metadata.coordination.impl.ResourceLockImpl - Successfully revalidated the lock on /namespace/pulsar/test-2-10/test-2-10-broker-0.test-2-10-broker-headless.cv-pulsar.svc.cluster.local:8080/0x00000000_0xffffffff"
   "Mar 17, 2022 @ 14:32:28.294","test-2-10-broker-0","2022-03-17T20:32:28,293+0000 [main-EventThread] INFO  org.apache.pulsar.metadata.coordination.impl.ResourceLockImpl - Successfully re-acquired lock at /namespace/pulsar/test-2-10/test-2-10-broker-0.test-2-10-broker-headless.cv-pulsar.svc.cluster.local:8080/0x00000000_0xffffffff"
   
   ....
   
   "Mar 17, 2022 @ 14:36:31.082","test-2-10-broker-0","Terminating due to java.lang.OutOfMemoryError: Java heap space"
   ```
   
   ![CPU and memory consumption](https://user-images.githubusercontent.com/41346/158894058-9c597727-03f3-4c6d-9720-c71964ead353.png)
   
   ![Message latency gradually increases up to 5 seconds, until brokers restart](https://user-images.githubusercontent.com/41346/158895650-d856e981-d6a9-4193-ad04-17816552ac9e.png)
   
   
   **Desktop (please complete the following information):**
    - StreamNative Cloud Hosted (Ubuntu)
   
   **Additional context**
   I notice in the logs, after a few minutes, there start to be some zookeeper and MetadataStore disconnects / reconnects. 
   
   More logs:
   [test-2-10-broker-0.log.csv](https://github.com/apache/pulsar/files/8292566/test-2-10-broker-0.log.csv)
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pulsar.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] utahkay commented on issue #14736: [2.10] After zookeeper reconnect, brokers get caught in a loop revalidating locks, slowing message delivery and consuming heap until OOM

Posted by GitBox <gi...@apache.org>.
utahkay commented on issue #14736:
URL: https://github.com/apache/pulsar/issues/14736#issuecomment-1072574346


   @codelipenghui That's fair, but the brokers are only receiving something like one message a second, so there should be very little CPU required. The CPU usage grows steadily and appears unrelated to my light workload, which stays constant.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pulsar.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] codelipenghui commented on issue #14736: [2.10] After zookeeper reconnect, brokers get caught in a loop revalidating locks, slowing message delivery and consuming heap until OOM

Posted by GitBox <gi...@apache.org>.
codelipenghui commented on issue #14736:
URL: https://github.com/apache/pulsar/issues/14736#issuecomment-1072070592


   @utahkay Looks like the broker pod doesn't have more CPU resources, I think you have configured CPU limitation? So that the GC thread is not able to get the CPU cycles. You can try to remove the CPU limitation or config more CPU resources for the pod.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pulsar.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] codelipenghui commented on issue #14736: [2.10] After zookeeper reconnect, brokers get caught in a loop revalidating locks, slowing message delivery and consuming heap until OOM

Posted by GitBox <gi...@apache.org>.
codelipenghui commented on issue #14736:
URL: https://github.com/apache/pulsar/issues/14736#issuecomment-1072070592


   @utahkay Looks like the broker pod doesn't have more CPU resources, I think you have configured CPU limitation? So that the GC thread is not able to get the CPU cycles. You can try to remove the CPU limitation or config more CPU resources for the pod.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pulsar.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] codelipenghui commented on issue #14736: [2.10] After zookeeper reconnect, brokers get caught in a loop revalidating locks, slowing message delivery and consuming heap until OOM

Posted by GitBox <gi...@apache.org>.
codelipenghui commented on issue #14736:
URL: https://github.com/apache/pulsar/issues/14736#issuecomment-1072941092


   @utahkay I see. Looks like the CPU is used by the metadata validation, maybe. I think we can capture the thread CPU usage by `top -Hp` to see which thread consumed the CPU. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pulsar.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org