You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@pulsar.apache.org by GitBox <gi...@apache.org> on 2021/06/10 14:42:49 UTC

[GitHub] [pulsar] frankjkelly opened a new issue #10891: Pulsar 2.6.1: After ZK node crash due to "No space left on device" - unable to recover all topics

frankjkelly opened a new issue #10891:
URL: https://github.com/apache/pulsar/issues/10891


   **Describe the bug**
   One of our three ZooKeeper nodes ran out of disk space when taking a snapshot.
   ```
   java.io.IOException: No space left on device
   at org.apache.zookeeper.server.ZooKeeperServer.takeSnapshot(ZooKeeperServer.java:312) [org.apache.pulsar-pulsar-zookeeper-2.6.1.jar:2.6.1]
   ```
   
   However once that ZooKeeper pod restarted we started to see errors that owner for a topic timed out
   ```
   ERROR org.apache.pulsar.broker.web.PulsarWebResource - Finding owner for topic persistent://cogito-dialog/wav/54e2c1ef-94f8-4a86-ba80-a07973243166 timed out",
   ```
   and
   ```
   ERROR org.apache.pulsar.broker.web.PulsarWebResource - Finding owner for topic persistent://cogito-dialog/event/compute timed out
   ```
   Restarting the brokers did not help - in the end we `nuked` the cluster and started over but just wondering if we missed a step or if there's some way to remediate if / when this happens in Production.
   
   **To Reproduce**
   Steps to reproduce the behavior:
   1. We had a 32 GB disk attached to ZooKeeper
   2. We created over 300k topics in Pulsar
   
   **Expected behavior**
   If one ZK node dies or has a problem that the rest of the ZooKeeper quorum can continue to operate and Pulsar can maintain the state of Topics/Brokers
   
   **Additional context**
   - Using Pulsar 2.6.1
   - On Kubernetes
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] codelipenghui commented on issue #10891: Pulsar 2.6.1: After ZK node crash due to "No space left on device" - unable to recover all topics

Posted by GitBox <gi...@apache.org>.
codelipenghui commented on issue #10891:
URL: https://github.com/apache/pulsar/issues/10891#issuecomment-1058889551


   The issue had no activity for 30 days, mark with Stale label.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pulsar.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] frankjkelly closed issue #10891: Pulsar 2.6.1: After ZK node crash due to "No space left on device" - unable to recover all topics

Posted by GitBox <gi...@apache.org>.
frankjkelly closed issue #10891:
URL: https://github.com/apache/pulsar/issues/10891


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pulsar.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] frankjkelly commented on issue #10891: Pulsar 2.6.1: After ZK node crash due to "No space left on device" - unable to recover all topics

Posted by GitBox <gi...@apache.org>.
frankjkelly commented on issue #10891:
URL: https://github.com/apache/pulsar/issues/10891#issuecomment-1059154117


   Have not seen this since we 
   (a) moved to 2.7.2
   (b) increased ZK disk considerably
   (c) Added ZK disk monitoring


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pulsar.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org