You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@pulsar.apache.org by GitBox <gi...@apache.org> on 2019/04/26 01:14:15 UTC
[GitHub] [pulsar] rdhabalia opened a new pull request #4138:
[pulsar-broker]Fix: client-producer can't reconnect due to failed
producer-future on cnx cache
rdhabalia opened a new pull request #4138: [pulsar-broker]Fix: client-producer can't reconnect due to failed producer-future on cnx cache
URL: https://github.com/apache/pulsar/pull/4138
### Motivation
Frequently having issue where broker doesn't allow client to create producer and client keep receiving error: `Producer is already present on the connection`
- It happens when same client-host is connected to one broker for two different topics which are part of different bundles. So, client shares same cnx-C1 with broker-B1 for the two different topics which are part of different bundles.
- Now, one of the bundle and topic-T1 gets unloaded and goes to different broker-B2. while unloading, client try to reconnect to broker B1 and producer-creation fails due to unloading. However, in the unloading process, broker sometimes couldn't clear this failed-producer-future from the cache.
- After sometime, the same bundle and topic T1 again gets unloaded from broker-B2 and comes back to Broker-B1.
- Now, client already has cnx-C1 with broker-B1 (because bundle-2 topics are still connected) and try to create producer (with same producer-id which was created initially at client) for topic-T1 on the same cnx-C1.
- Some how, broker couldn't clear old producer-result from the cnx-C1 cache and it thinks that producer is already connected so, client keeps failing and couldn't create the producer.
**Broker-log**
```
1#(1) Bundle started unloading
10:11:18.980 [pulsar-web-29-18] INFO org.apache.pulsar.broker.namespace.OwnedBundle - Disabling ownership: my-prop/my-cluster/my-ns/0x00000000_0x80000000
:
10:11:19.026 [pulsar-web-29-18] INFO org.apache.pulsar.broker.service.BrokerService - [persistent://my-prop/my-cluster/my-ns/topic-partition-0] Unloading topic
:
10:11:19.029 [pulsar-web-29-18] INFO org.apache.pulsar.broker.service.Producer - Disconnecting producer: Producer{topic=PersistentTopic{topic=persistent://my-prop/my-cluster/my-ns/topic-partition-0}, client=/121.121.121.121:58042, producerName=my-cluster-2897-13224, producerId=1069}
:
#(2) While unloading topic: producer disconnected and client tried again to create producer on same cnx: 121.121.121.121:58042 .. but I think producerFuture is not removed
10:11:19.133 [pulsar-io-21-48] INFO org.apache.pulsar.broker.service.ServerCnx - [/121.121.121.121:58042][persistent://my-prop/my-cluster/my-ns/topic-partition-0] Creating producer. producerId=1069
10:11:19.134 [pulsar-io-21-42] INFO org.apache.pulsar.broker.lookup.TopicLookupBase - Failed to lookup yahoo.dht.set.prod-tenzing.su for topic persistent://my-prop/my-cluster/my-ns/topic-partition-0 with error Namespace bundle my-prop/my-cluster/my-ns/0x00000000_0x80000000 is being unloaded
:
10:11:19.134 [pulsar-io-21-42] INFO org.apache.pulsar.broker.lookup.TopicLookupBase - Failed to lookup yahoo.dht.set.prod-tenzing.su for topic persistent://my-prop/my-cluster/my-ns/topic-partition-0 with error Namespace bundle my-prop/my-cluster/my-ns/0x00000000_0x80000000 is being unloaded
10:11:19.204 [BookKeeperClientWorker-OrderedExecutor-5-0] INFO org.apache.bookkeeper.mledger.impl.MetaStoreImplZookeeper - [my-prop/my-cluster/my-ns/persistent/topic-partition-0] [ContactsBackend.BlendedContacts_Index_subscription] Updating cursor info ledgerId=-1 mark-delete=10548168989:116241
:
10:11:19.208 [bookkeeper-ml-workers-OrderedExecutor-19-0] INFO org.apache.pulsar.broker.service.persistent.PersistentTopic - [persistent://my-prop/my-cluster/my-ns/topic-partition-0] Topic closed
:
10:11:19.211 [pulsar-io-21-27] WARN org.apache.pulsar.broker.service.BrokerService - [persistent://my-prop/my-cluster/my-ns/topic-partition-0] Service unit is not ready when loading the topic
#(3) Topic again assigned to same broker and stale and failed producer-future already exists which doesn't allow producer to reconnect and keep giving error: Producer already connected
:
16:46:19.116 [bookkeeper-ml-workers-OrderedExecutor-19-0] INFO org.apache.pulsar.broker.service.persistent.DispatchRateLimiter - [persistent://my-prop/my-cluster/my-ns/topic-partition-0] [null] setting message-dispatch-rate DispatchRate{dispatchThrottlingRateInMsg=0, dispatchThrottlingRateInByte=0, ratePeriodInSecond=1}
16:46:19.155 [pulsar-io-21-48] WARN org.apache.pulsar.broker.service.ServerCnx - [/121.121.121.121:58042][persistent://my-prop/my-cluster/my-ns/topic-partition-0] Producer is already present on the connection
:
16:46:19.155 [pulsar-io-21-48] WARN org.apache.pulsar.broker.service.ServerCnx - [/121.121.121.121:58042][persistent://my-prop/my-cluster/my-ns/topic-partition-0] Producer is already present on the connection
16:46:19.350 [pulsar-io-21-48] WARN org.apache.pulsar.broker.service.ServerCnx - [/121.121.121.121:58042][persistent://my-prop/my-cluster/my-ns/topic-partition-0] Producer is already present on the connection
16:46:19.751 [pulsar-io-21-48] WARN org.apache.pulsar.broker.service.ServerCnx - [/121.121.121.121:58042][persistent://my-prop/my-cluster/my-ns/topic-partition-0] Producer is already present on the connection
16:46:20.528 [pulsar-io-21-48] WARN org.apache.pulsar.broker.service.ServerCnx - [/121.121.121.121:58042][persistent://my-prop/my-cluster/my-ns/topic-partition-0] Producer is already present on the connection
```
**client-log**
CPP-Client log shows unknown error
```
20190425 16:46:19.155 : WARN cms./var/builds/workspace/260010-v3-component/BUILD_CONTAINER/rhel6/label/DOCKER-LOW/app_root/pulsar/pulsar-client-cpp/lib/ProducerImpl.cc - [persistent://sherpa-prodhd/prod2-sg3/prodhd-NOTIFICATION-ContactsBackend.BlendedContacts_Index/DIFF-partition-0, prod2-sg3-2897-13224] Failed to reconnect producer: UnknownError
```
I couldn't reproduce it in my local env but it happens very frequently to my prod env.
### Modification
- Clean up producer-future from the cache if future is already completed with failure.
- added log once schema-future completes for future troubleshooting to make sure schema-future doesn't get stuck.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services