You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@pulsar.apache.org by GitBox <gi...@apache.org> on 2020/06/02 09:23:34 UTC

[GitHub] [pulsar] hmoragrega opened a new issue #7140: Zombie producers shown as connected after a broker instance crashed

hmoragrega opened a new issue #7140:
URL: https://github.com/apache/pulsar/issues/7140


   **Describe the bug**
   A broker crash due to "out of memory" left zombie producers connected
   
   Our setup:
   - k8s cluster
   - 3 pulsar brokers
   - 3 pulsar proxies
   - X Services publishing to the same topic using official CGo library
   
   **Steps to reproduce the behavior:**
   Force a OOM crash in a broker with connected producers, the producers connected to the broker are still listed as connected and have to be manually deleted
   
   In our case producers tried to reconnect automatically to the cluster using the same producer name but received this error `Failed to create producer: ProducerBusy`. 
   
   Only a change in the producer name fixed the issue
   
   The list of producers for the topic show old instances with prefix `74966b7f75` yet only 5 instances producing messages are running in our k8s cluster
   ```
   "publishers" : [ {
       "producerName" : "service-x-events-968974b58-7wwcm",
       "connectedSince" : "2020-06-02T07:09:08.453Z",
       "address" : "/10.1.30.2:57634"
     }, {
       "producerName" : "service-x-events-74966b7f75-x9cfq",
       "connectedSince" : "2020-06-01T20:32:21.863Z",
       "address" : "/10.1.39.51:49418"
     }, {
       "producerName" : "service-x-events-968974b58-8v679",
       "connectedSince" : "2020-06-02T04:56:23.638Z",
       "address" : "/10.1.16.2:59924"
     }, {
       "producerName" : "service-x-events-968974b58-99qpk",
       "connectedSince" : "2020-06-02T04:56:25.752Z",
       "address" : "/10.1.16.2:59920"
     }, {
       "producerName" : "service-x-events-74966b7f75-c6ljj",
       "connectedSince" : "2020-06-01T20:32:21.863Z",
       "address" : "/10.1.39.51:59820"
     }, {
       "producerName" : "service-x-events-74966b7f75-rvtjt",
       "connectedSince" : "2020-06-01T20:32:21.863Z",
       "address" : "/10.1.16.2:56450"
     }, {
       "producerName" : "service-x-events-968974b58-86p8p",
       "connectedSince" : "2020-06-02T04:00:07.853Z",
       "address" : "/10.1.39.51:52732"
     }, {
       "producerName" : "service-x-events-968974b58-9wp9m",
       "connectedSince" : "2020-06-02T07:05:30.878Z",
       "address" : "/10.1.30.2:57488"
     } ],
   ```
   
   ```
   service-x-968974b58-99qpk	 Running	0	Jun 2, 2020, 3:57:07 AM	
   service-x-968974b58-86p8p	 Running	0	Jun 2, 2020, 3:57:07 AM	
   service-x-968974b58-8v679	 Running	0	Jun 2, 2020, 3:57:07 AM	
   service-x-968974b58-9wp9m	 Running	0	Jun 2, 2020, 7:00:26 AM	
   service-x-968974b58-7wwcm	 Running	0	Jun 2, 2020, 7:00:57 AM
   ```
   
   **Expected behavior**
   The old producers are removed after the broker crash
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] sijie commented on issue #7140: Zombie producers shown as connected after a broker instance crashed

Posted by GitBox <gi...@apache.org>.
sijie commented on issue #7140:
URL: https://github.com/apache/pulsar/issues/7140#issuecomment-637902988


   @hmoragrega 
   
   A couple of questions:
   
   - which version of pulsar broker are you using?
   - how did you cause broker crash? one thing I don't understand in the description is if the broker was crashed, the JVM process should be gone. Are you using a different approach to force broker crash?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] larslorenzen commented on issue #7140: Zombie producers shown as connected after a broker instance crashed

Posted by GitBox <gi...@apache.org>.
larslorenzen commented on issue #7140:
URL: https://github.com/apache/pulsar/issues/7140#issuecomment-648651707


   Also happened to us after a broker restart: https://github.com/apache/pulsar/issues/7059#issuecomment-648041771. Producer could not reconnect.
   
   Pulsar broker version: 2.5.2
   Pulsar producer version: 2.5.2


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] hmoragrega commented on issue #7140: Zombie producers shown as connected after a broker instance crashed

Posted by GitBox <gi...@apache.org>.
hmoragrega commented on issue #7140:
URL: https://github.com/apache/pulsar/issues/7140#issuecomment-638009352


   thanks for checking out:
   * We are running a docker image: `apachepulsar/pulsar-all:2.5.1`
   * The broker process was killed by k8s:
   ```
   2020-06-01 20:31:26.000 UTC
   Memory cgroup out of memory: Kill process 281696 (java) score 1862 or sacrifice child Killed process 281696 (java) total-vm:6202804kB, anon-rss:2484912kB, file-rss:24084kB, shmem-rss:0kB
   
   2020-06-01 20:31:26.152 UTC
   Killed process 281696 (java) total-vm:6202804kB, anon-rss:2484912kB, file-rss:24084kB, shmem-rss:0kB}] Conditions:[{Type:KernelDeadlock Status:False Transition:2020-05-10 23:28:18.17330381 +0000 UTC m=+5.104380376 Reason:KernelHasNoDeadlock Message:kernel has no deadlock} {Type:ReadonlyFilesystem Status:False Transition:2020-05-10 23:28:18.173304129 +0000 UTC m=+5.104380498 Reason:FilesystemIsNotReadOnly Message:Filesystem is not read-only}]}
   
   ```
   I've checked the pulsar logs around the event but I haven't found any outstanding message, just the application being restarted 
   ```
   I 2020-06-01T20:31:19.772967095Z 20:31:19.772 [pulsar-client-io-42-1] INFO  org.apache.pulsar.client.impl.ConsumerImpl - [persistent://pulsar/pulsar/10.1.2.182:8080/healthcheck][reader-98d2f9bedb] Subscribed to topic on 10.1.2.182/10.1.2.182:6650 -- consumer: 3380
   I 2020-06-01T20:31:19.785581683Z 20:31:19.785 [pulsar-io-22-3] INFO  org.apache.pulsar.broker.service.ServerCnx - [PersistentTopic{topic=persistent://pulsar/pulsar/10.1.2.182:8080/healthcheck}][pulsar-20-3385] Closing producer on cnx /10.1.2.182:33968
   I 2020-06-01T20:31:19.785694992Z 20:31:19.785 [pulsar-io-22-3] INFO  org.apache.pulsar.broker.service.ServerCnx - [PersistentTopic{topic=persistent://pulsar/pulsar/10.1.2.182:8080/healthcheck}][pulsar-20-3385] Closed producer on cnx /10.1.2.182:33968
   I 2020-06-01T20:31:19.785802934Z 20:31:19.785 [pulsar-client-io-42-1] INFO  org.apache.pulsar.client.impl.ProducerImpl - [persistent://pulsar/pulsar/10.1.2.182:8080/healthcheck] [pulsar-20-3385] Closed Producer
   I 2020-06-01T20:31:19.785898196Z 20:31:19.785 [pulsar-io-22-3] INFO  org.apache.pulsar.broker.service.ServerCnx - [/10.1.2.182:33968] Closing consumer: 3380
   I 2020-06-01T20:31:19.786763101Z 20:31:19.786 [pulsar-io-22-3] INFO  org.apache.pulsar.broker.service.AbstractDispatcherSingleActiveConsumer - Removing consumer Consumer{subscription=PersistentSubscription{topic=persistent://pulsar/pulsar/10.1.2.182:8080/healthcheck, name=reader-98d2f9bedb}, consumerId=3380, consumerName=9574f, address=/10.1.2.182:33968}
   I 2020-06-01T20:31:19.787468692Z 20:31:19.786 [pulsar-io-22-3] INFO  org.apache.pulsar.broker.service.persistent.PersistentSubscription - [persistent://pulsar/pulsar/10.1.2.182:8080/healthcheck][reader-98d2f9bedb] Successfully closed subscription [NonDurableCursorImpl{ledger=pulsar/pulsar/10.1.2.182:8080/persistent/healthcheck, ackPos=38739:397, readPos=38739:398}]
   I 2020-06-01T20:31:19.787644898Z 20:31:19.787 [pulsar-io-22-3] INFO  org.apache.pulsar.broker.service.persistent.PersistentSubscription - [persistent://pulsar/pulsar/10.1.2.182:8080/healthcheck][reader-98d2f9bedb] Successfully closed dispatcher for reader
   I 2020-06-01T20:31:19.787662374Z 20:31:19.787 [pulsar-external-listener-43-1] INFO  org.eclipse.jetty.server.RequestLog - 127.0.0.1 - - [01/Jun/2020:20:31:19 +0000] "GET /admin/v2/brokers/health HTTP/1.1" 200 2 "-" "curl/7.64.0" 26
   I 2020-06-01T20:31:19.787773157Z 20:31:19.787 [pulsar-io-22-3] INFO  org.apache.pulsar.broker.service.ServerCnx - [/10.1.2.182:33968] Closed consumer Consumer{subscription=PersistentSubscription{topic=persistent://pulsar/pulsar/10.1.2.182:8080/healthcheck, name=reader-98d2f9bedb}, consumerId=3380, consumerName=9574f, address=/10.1.2.182:33968}
   I 2020-06-01T20:31:19.787804660Z 20:31:19.787 [pulsar-client-io-42-1] INFO  org.apache.pulsar.client.impl.ConsumerImpl - [persistent://pulsar/pulsar/10.1.2.182:8080/healthcheck] [reader-98d2f9bedb] Closed consumer
   I 2020-06-01T20:31:23.906091867Z 20:31:23.905 [pulsar-ordered-OrderedExecutor-3-0-EventThread] INFO  org.apache.pulsar.zookeeper.ZooKeeperDataCache - [State:CONNECTED Timeout:30000 sessionid:0x300369d33e0001b local:/10.1.2.182:33466 remoteserver:pulsar-zookeeper-2.pulsar-zookeeper/10.1.8.80:2181 lastZxid:25770466537 xid:107048 sent:107048 recv:124047 queuedpkts:0 pendingresp:0 queuedevents:0] Received ZooKeeper watch event: WatchedEvent state:SyncConnected type:NodeDataChanged path:/loadbalance/brokers/10.1.41.80:8080
   I 2020-06-01T20:31:24.719776356Z 20:31:24.719 [pulsar-load-manager-3-1] INFO  org.apache.pulsar.broker.loadbalance.impl.ModularLoadManagerImpl - Writing local data to ZooKeeper because maximum change 11.40127182006836% exceeded threshold 10%; time since last report written is 5.0 seconds
   I 2020-06-01T20:31:24.725426343Z 20:31:24.725 [pulsar-ordered-OrderedExecutor-3-0-EventThread] INFO  org.apache.pulsar.zookeeper.ZooKeeperDataCache - [State:CONNECTED Timeout:30000 sessionid:0x300369d33e0001b local:/10.1.2.182:33466 remoteserver:pulsar-zookeeper-2.pulsar-zookeeper/10.1.8.80:2181 lastZxid:25770466539 xid:107051 sent:107051 recv:124051 queuedpkts:0 pendingresp:0 queuedevents:0] Received ZooKeeper watch event: WatchedEvent state:SyncConnected type:NodeDataChanged path:/loadbalance/brokers/10.1.2.182:8080
   
   // Restart
   
   I 2020-06-01T20:31:27.095166566Z [conf/pulsar_env.sh] Applying config PULSAR_EXTRA_OPTS = -Dpulsar.log.root.level=info
   I 2020-06-01T20:31:27.095312356Z [conf/pulsar_env.sh] Applying config PULSAR_GC = "-XX:+UseG1GC -XX:MaxGCPauseMillis=10"
   I 2020-06-01T20:31:27.095324965Z [conf/pulsar_env.sh] Applying config PULSAR_MEM = "-Xms2g -Xmx2g -XX:MaxDirectMemorySize=2g -Dio.netty.leakDetectionLevel=disabled -Dio.netty.recycler.linkCapacity=1024 -XX:+ParallelRefProcEnabled -XX:+UnlockExperimentalVMOptions -XX:+AggressiveOpts -XX:+DoEscapeAnalysis -XX:ParallelGCThreads=32 -XX:ConcGCThreads=32 -XX:G1NewSizePercent=50 -XX:+DisableExplicitGC -XX:-ResizePLAB -XX:+ExitOnOutOfMemoryError -XX:+PerfDisableSharedMem"
   I 2020-06-01T20:31:27.095333038Z [conf/broker.conf] Applying config advertisedAddress = 10.1.2.182
   ...
   I 2020-06-01T20:31:27.095428042Z [conf/broker.conf] Applying config zookeeperServers = pulsar-zookeeper-0.pulsar-zookeeper,pulsar-zookeeper-1.pulsar-zookeeper,pulsar-zookeeper-2.pulsar-zookeeper
   I 2020-06-01T20:31:27.118097760Z [conf/pulsar_env.sh] Applying config PULSAR_EXTRA_OPTS = -Dpulsar.log.root.level=info
   I 2020-06-01T20:31:27.118143135Z [conf/pulsar_env.sh] Applying config PULSAR_GC = "-XX:+UseG1GC -XX:MaxGCPauseMillis=10"
   I 2020-06-01T20:31:27.118149618Z [conf/pulsar_env.sh] Applying config PULSAR_MEM = "-Xms2g -Xmx2g -XX:MaxDirectMemorySize=2g -Dio.netty.leakDetectionLevel=disabled -Dio.netty.recycler.linkCapacity=1024 -XX:+ParallelRefProcEnabled -XX:+UnlockExperimentalVMOptions -XX:+AggressiveOpts -XX:+DoEscapeAnalysis -XX:ParallelGCThreads=32 -XX:ConcGCThreads=32 -XX:G1NewSizePercent=50 -XX:+DisableExplicitGC -XX:-ResizePLAB -XX:+ExitOnOutOfMemoryError -XX:+PerfDisableSharedMem"
   I 2020-06-01T20:31:27.118155877Z [conf/pulsar_env.sh] Applying config PULSAR_EXTRA_OPTS = -Dpulsar.log.root.level=info
   I 2020-06-01T20:31:27.118160698Z [conf/pulsar_env.sh] Applying config PULSAR_GC = "-XX:+UseG1GC -XX:MaxGCPauseMillis=10"
   I 2020-06-01T20:31:27.118165936Z [conf/pulsar_env.sh] Applying config PULSAR_MEM = "-Xms2g -Xmx2g -XX:MaxDirectMemorySize=2g -Dio.netty.leakDetectionLevel=disabled -Dio.netty.recycler.linkCapacity=1024 -XX:+ParallelRefProcEnabled -XX:+UnlockExperimentalVMOptions -XX:+AggressiveOpts -XX:+DoEscapeAnalysis -XX:ParallelGCThreads=32 -XX:ConcGCThreads=32 -XX:G1NewSizePercent=50 -XX:+DisableExplicitGC -XX:-ResizePLAB -XX:+ExitOnOutOfMemoryError -XX:+PerfDisableSharedMem"
   I 2020-06-01T20:31:27.141939511Z [conf/pulsar_env.sh] Applying config PULSAR_EXTRA_OPTS = -Dpulsar.log.root.level=info
   I 2020-06-01T20:31:27.141971873Z [conf/pulsar_env.sh] Applying config PULSAR_GC = "-XX:+UseG1GC -XX:MaxGCPauseMillis=10"
   I 2020-06-01T20:31:27.142012430Z [conf/pulsar_env.sh] Applying config PULSAR_MEM = "-Xms2g -Xmx2g -XX:MaxDirectMemorySize=2g -Dio.netty.leakDetectionLevel=disabled -Dio.netty.recycler.linkCapacity=1024 -XX:+ParallelRefProcEnabled -XX:+UnlockExperimentalVMOptions -XX:+AggressiveOpts -XX:+DoEscapeAnalysis -XX:ParallelGCThreads=32 -XX:ConcGCThreads=32 -XX:G1NewSizePercent=50 -XX:+DisableExplicitGC -XX:-ResizePLAB -XX:+ExitOnOutOfMemoryError -XX:+PerfDisableSharedMem"
   E 2020-06-01T20:31:27.934534415Z [AppClassLoader@18b4aac2] info AspectJ Weaver Version 1.9.2 built on Wednesday Oct 24, 2018 at 15:43:33 GMT
   E 2020-06-01T20:31:27.934578455Z [AppClassLoader@18b4aac2] info register classloader sun.misc.Launcher$AppClassLoader@18b4aac2
   E 2020-06-01T20:31:27.934621976Z [AppClassLoader@18b4aac2] info using configuration file:/pulsar/lib/org.apache.pulsar-pulsar-zookeeper-utils-2.5.1.jar!/META-INF/aop.xml
   E 2020-06-01T20:31:27.934630474Z [AppClassLoader@18b4aac2] info using configuration file:/pulsar/lib/org.apache.pulsar-pulsar-zookeeper-2.5.1.jar!/META-INF/aop.xml
   E 2020-06-01T20:31:27.940125061Z [AppClassLoader@18b4aac2] info register aspect org.apache.pulsar.broker.zookeeper.aspectj.ClientCnxnAspect
   E 2020-06-01T20:31:28.218736757Z [AppClassLoader@18b4aac2] info register aspect org.apache.pulsar.zookeeper.FinalRequestProcessorAspect
   E 2020-06-01T20:31:28.223107367Z [AppClassLoader@18b4aac2] info register aspect org.apache.pulsar.zookeeper.ZooKeeperServerAspect
   I 2020-06-01T20:31:30.492087527Z 20:31:30.488 [main] INFO  org.apache.pulsar.broker.PulsarService - Starting Pulsar Broker service; version: '2.5.1'
   I 2020-06-01T20:31:30.494428749Z 20:31:30.494 [main] INFO  org.apache.pulsar.broker.PulsarService - Git Revision c36170c3acc0bd432036baf7d9471a4c34320395
   I 2020-06-01T20:31:30.494766089Z 20:31:30.494 [main] INFO  org.apache.pulsar.broker.PulsarService - Built by guangning <gu...@apache.org> on ubuntu-s-4vcpu-8gb-sfo1-01 at 2020-04-15T08:53:35+0000


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org