You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@pulsar.apache.org by GitBox <gi...@apache.org> on 2021/12/29 15:21:07 UTC

[GitHub] [pulsar-client-go] wolfstudy opened a new issue #697: Reconnection blocked in producer by request timed out

wolfstudy opened a new issue #697:
URL: https://github.com/apache/pulsar-client-go/issues/697


   #### Expected behavior
   
   When the reconnection logic is triggered, the reconnection can be successful.
   
   #### Actual behavior
   
   When the reconnection logic is triggered, Go SDK has been trying to reconnect, and it has been unable to reconnect successfully.
   
   When the reconnection logic continues to be triggered, the log information of the Go SDK is as follows:
   
   1. Broker notifies the client to close the producer
   
   ```
   Dec 29 15:18:08 gg-1-1-1-2 gateway: time="2021-12-29T15:18:08+08:00" level=info msg="Broker notification of Closed producer: 1" local_addr="1.1.1.2:51860" remote_addr="pulsar://1.1.1.3:6650"
   Dec 29 15:18:08 gg-1-1-1-2 gateway: time="2021-12-29T15:18:08+08:00" level=warning msg="[Connection was closed]" cnx="1.1.1.2:51860 -> 1.1.1.3:6650" producerID=1 producer_name=cmq_1.1.1.2_ac48dcea-6164-11ec-8357-5254007f7980 topic="persistent://test-tenant/tst-ns/test-topic-partition-0"
   Dec 29 15:18:08 gg-1-1-1-2 gateway: time="2021-12-29T15:18:08+08:00" level=info msg="[Reconnecting to broker in  109.456323ms]" producerID=1 producer_name=cmq_1.1.1.2_ac48dcea-6164-11ec-8357-5254007f7980 topic="persistent://test-tenant/tst-ns/test-topic-partition-0"
   
   ```
   
   2. ReceivedSendReceipt process error. After that, the `send` action of this topic has not been restored successfully, and the request timeout has been reported.
   
   ```
   Dec 29 15:18:08 gg-1-1-1-2 gateway: time="2021-12-29T15:18:08+08:00" level=warning msg="[Got unexpected send receipt for messageID=%+v ledgerId:18625173 entryId:431457 ]" local_addr="1.1.1.2:51860" producerID=1 remote_addr="pulsar://1.1.1.3:6650"
   Dec 29 15:18:08 gg-1-1-1-2 gateway: time="2021-12-29T15:18:08+08:00" level=warning msg="[Got unexpected send receipt for messageID=%+v ledgerId:18625173 entryId:431462 ]" local_addr="1.1.1.2:51860" producerID=1 remote_addr="pulsar://1.1.1.3:6650"
   Dec 29 15:18:08 gg-1-1-1-2 gateway: time="2021-12-29T15:18:08+08:00" level=warning msg="[Got unexpected send receipt for messageID=%+v ledgerId:18625173 entryId:431465 ]" local_addr="1.1.1.2:51860" producerID=1 remote_addr="pulsar://1.1.1.3:6650"
   Dec 29 15:18:08 gg-1-1-1-2 gateway: time="2021-12-29T15:18:08+08:00" level=warning msg="[Got unexpected send receipt for messageID=%+v ledgerId:18625173 entryId:431470 ]" local_addr="1.1.1.2:51860" producerID=1 remote_addr="pulsar://1.1.1.3:6650"
   Dec 29 15:18:08 gg-1-1-1-2 gateway: time="2021-12-29T15:18:08+08:00" level=warning msg="[Got unexpected send receipt for messageID=%+v ledgerId:18625173 entryId:431473 ]" local_addr="1.1.1.2:51860" producerID=1 remote_addr="pulsar://1.1.1.3:6650"
   Dec 29 15:18:08 gg-1-1-1-2 gateway: time="2021-12-29T15:18:08+08:00" level=warning msg="[Got unexpected send receipt for messageID=%+v ledgerId:18625173 entryId:431479 ]" local_addr="1.1.1.2:51860" producerID=1 remote_addr="pulsar://1.1.1.3:6650"
   Dec 29 15:18:08 gg-1-1-1-2 gateway: time="2021-12-29T15:18:08+08:00" level=warning msg="[Got unexpected send receipt for messageID=%+v ledgerId:18625173 entryId:431483 ]" local_addr="1.1.1.2:51860" producerID=1 remote_addr="pulsar://1.1.1.3:6650"
   Dec 29 15:18:08 gg-1-1-1-2 gateway: time="2021-12-29T15:18:08+08:00" level=warning msg="[Got unexpected send receipt for messageID=%+v ledgerId:18625173 entryId:431488 ]" local_addr="1.1.1.2:51860" producerID=1 remote_addr="pulsar://1.1.1.3:6650"
   Dec 29 15:18:08 gg-1-1-1-2 gateway: time="2021-12-29T15:18:08+08:00" level=warning msg="[Got unexpected send receipt for messageID=%+v ledgerId:18625173 entryId:431492 ]" local_addr="1.1.1.2:51860" producerID=1 remote_addr="pulsar://1.1.1.3:6650"
   Dec 29 15:18:08 gg-1-1-1-2 gateway: time="2021-12-29T15:18:08+08:00" level=warning msg="[Got unexpected send receipt for messageID=%+v ledgerId:18625173 entryId:431495 ]" local_addr="1.1.1.2:51860" producerID=1 remote_addr="pulsar://1.1.1.3:6650"
   Dec 29 15:18:08 gg-1-1-1-2 gateway: time="2021-12-29T15:18:08+08:00" level=warning msg="[Got unexpected send receipt for messageID=%+v ledgerId:18625173 entryId:431497 ]" local_addr="1.1.1.2:51860" producerID=1 remote_addr="pulsar://1.1.1.3:6650"
   Dec 29 15:18:08 gg-1-1-1-2 gateway: time="2021-12-29T15:18:08+08:00" level=warning msg="[Got unexpected send receipt for messageID=%+v ledgerId:18625173 entryId:431500 ]" local_addr="1.1.1.2:51860" producerID=1 remote_addr="pulsar://1.1.1.3:6650"
   Dec 29 15:18:08 gg-1-1-1-2 gateway: time="2021-12-29T15:18:08+08:00" level=warning msg="[Got unexpected send receipt for messageID=%+v ledgerId:18625173 entryId:431503 ]" local_addr="1.1.1.2:51860" producerID=1 remote_addr="pulsar://1.1.1.3:6650"
   Dec 29 15:18:08 gg-1-1-1-2 gateway: time="2021-12-29T15:18:08+08:00" level=warning msg="[Got unexpected send receipt for messageID=%+v ledgerId:18625173 entryId:431504 ]" local_addr="1.1.1.2:51860" producerID=1 remote_addr="pulsar://1.1.1.3:6650"
   Dec 29 15:18:08 gg-1-1-1-2 gateway: time="2021-12-29T15:18:08+08:00" level=warning msg="[Got unexpected send receipt for messageID=%+v ledgerId:18625173 entryId:431506 ]" local_addr="1.1.1.2:51860" producerID=1 remote_addr="pulsar://1.1.1.3:6650"
   Dec 29 15:18:08 gg-1-1-1-2 gateway: time="2021-12-29T15:18:08+08:00" level=warning msg="[Got unexpected send receipt for messageID=%+v ledgerId:18625173 entryId:431512 ]" local_addr="1.1.1.2:51860" producerID=1 remote_addr="pulsar://1.1.1.3:6650"
   Dec 29 15:18:08 gg-1-1-1-2 gateway: time="2021-12-29T15:18:08+08:00" level=warning msg="[Got unexpected send receipt for messageID=%+v ledgerId:18625173 entryId:431519 ]" local_addr="1.1.1.2:51860" producerID=1 remote_addr="pulsar://1.1.1.3:6650"
   Dec 29 15:18:08 gg-1-1-1-2 gateway: time="2021-12-29T15:18:08+08:00" level=warning msg="[Got unexpected send receipt for messageID=%+v ledgerId:18625173 entryId:431523 ]" local_addr="1.1.1.2:51860" producerID=1 remote_addr="pulsar://1.1.1.3:6650"
   Dec 29 15:18:08 gg-1-1-1-2 gateway: time="2021-12-29T15:18:08+08:00" level=warning msg="[Got unexpected send receipt for messageID=%+v ledgerId:18625173 entryId:431526 ]" local_addr="1.1.1.2:51860" producerID=1 remote_addr="pulsar://1.1.1.3:6650"
   
   ....
   ....
   ....
   ```
   
   3. The logic of send timeout was triggered, and the message failed to be sent
   
   ```
   Dec 29 15:18:16 gg-1-1-1-2 gateway: time="2021-12-29T15:18:16+08:00" level=info msg="Failing 685 messages" producerID=1 producer_name=cmq_1.1.1.2_ac48dcea-6164-11ec-8357-5254007f7980 topic="persistent://test-tenant/tst-ns/test-topic-partition-0"
   ```
   
   4. Continue to try the logic of reconnection, and reconnection fails, output `request timed out`
   
   ```
   Dec 29 15:18:18 gg-1-1-1-2 gateway: time="2021-12-29T15:18:18+08:00" level=error msg="[Failed to create producer]" error="request timed out" producerID=1 producer_name=cmq_1.1.1.2_ac48dcea-6164-11ec-8357-5254007f7980 topic="persistent://test-tenant/tst-ns/test-topic-partition-0"
   Dec 29 15:18:18 gg-1-1-1-2 gateway: time="2021-12-29T15:18:18+08:00" level=info msg="[Reconnecting to broker in  221.849265ms]" producerID=1 producer_name=cmq_1.1.1.2_ac48dcea-6164-11ec-8357-5254007f7980 topic="persistent://test-tenant/tst-ns/test-topic-partition-0"
   
   Dec 29 15:18:28 gg-1-1-1-2 gateway: time="2021-12-29T15:18:28+08:00" level=error msg="[Failed to create producer]" error="request timed out" producerID=1 producer_name=cmq_1.1.1.2_ac48dcea-6164-11ec-8357-5254007f7980 topic="persistent://test-tenant/tst-ns/test-topic-partition-0"
   Dec 29 15:18:28 gg-1-1-1-2 gateway: time="2021-12-29T15:18:28+08:00" level=info msg="[Reconnecting to broker in  468.906379ms]" producerID=1 producer_name=cmq_1.1.1.2_ac48dcea-6164-11ec-8357-5254007f7980 topic="persistent://test-tenant/tst-ns/test-topic-partition-0"
   
   Dec 29 15:18:39 gg-1-1-1-2 gateway: time="2021-12-29T15:18:39+08:00" level=error msg="[Failed to create producer]" error="request timed out" producerID=1 producer_name=cmq_1.1.1.2_ac48dcea-6164-11ec-8357-5254007f7980 topic="persistent://test-tenant/tst-ns/test-topic-partition-0"
   Dec 29 15:18:39 gg-1-1-1-2 gateway: time="2021-12-29T15:18:39+08:00" level=info msg="[Reconnecting to broker in  840.895497ms]" producerID=1 producer_name=cmq_1.1.1.2_ac48dcea-6164-11ec-8357-5254007f7980 topic="persistent://test-tenant/tst-ns/test-topic-partition-0"
   
   Dec 29 15:18:50 gg-1-1-1-2 gateway: time="2021-12-29T15:18:50+08:00" level=error msg="[Failed to create producer]" error="request timed out" producerID=1 producer_name=cmq_1.1.1.2_ac48dcea-6164-11ec-8357-5254007f7980 topic="persistent://test-tenant/tst-ns/test-topic-partition-0"
   Dec 29 15:18:50 gg-1-1-1-2 gateway: time="2021-12-29T15:18:50+08:00" level=info msg="[Reconnecting to broker in  1.913524483s]" producerID=1 producer_name=cmq_1.1.1.2_ac48dcea-6164-11ec-8357-5254007f7980 topic="persistent://test-tenant/tst-ns/test-topic-partition-0"
   
   Dec 29 15:19:02 gg-1-1-1-2 gateway: time="2021-12-29T15:19:02+08:00" level=error msg="[Failed to create producer]" error="request timed out" producerID=1 producer_name=cmq_1.1.1.2_ac48dcea-6164-11ec-8357-5254007f7980 topic="persistent://test-tenant/tst-ns/test-topic-partition-0"
   Dec 29 15:19:02 gg-1-1-1-2 gateway: time="2021-12-29T15:19:02+08:00" level=info msg="[Reconnecting to broker in  3.485446s]" producerID=1 producer_name=cmq_1.1.1.2_ac48dcea-6164-11ec-8357-5254007f7980 topic="persistent://test-tenant/tst-ns/test-topic-partition-0"
   
   ....
   ....
   
   ec 29 15:22:52 gz-9-139-45-78 gateway: time="2021-12-29T15:22:52+08:00" level=error msg="[Failed to create producer]" error="request timed out" producerID=1 producer_name=cmq_1.1.1.2_ac48dcea-6164-11ec-8357-5254007f7980 topic="persistent://test-tenant/tst-ns/test-topic-partition-0"
   Dec 29 15:22:52 gz-9-139-45-78 gateway: time="2021-12-29T15:22:52+08:00" level=info msg="[Reconnecting to broker in  1m3.390120353s]" producerID=1 producer_name=cmq_1.1.1.2_ac48dcea-6164-11ec-8357-5254007f7980 topic="persistent://test-tenant/tst-ns/test-topic-partition-0"
   ```
   
   #### Steps to reproduce
   
   This is not a stable recurring issue
   
   #### System configuration
   **Pulsar version**: 2.7.2
   **Go SDK**: master branch
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pulsar.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar-client-go] wolfstudy commented on issue #697: Reconnection blocked in producer by request timed out

Posted by GitBox <gi...@apache.org>.
wolfstudy commented on issue #697:
URL: https://github.com/apache/pulsar-client-go/issues/697#issuecomment-1005276588


   > Could this issue possibly be related to #691?
   > 
   > I tried out the client library with that PR applied, but after 24-48hrs I got a problem with clients freezing as they attempted to reconnect. I rolled back to [d5d4903](https://github.com/apache/pulsar-client-go/commit/d5d49031c7da2aa7fa0b9338e63addc6a762281b) and the issue has not reoccurred after ~1 week.
   
   Hello @bschofield Can you provide more log info for this, thanks. 
   
   The https://github.com/apache/pulsar-client-go/pull/689  does not seem to cause the problem described in this issue, because the reconnect stuck occurs on the prouder side


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pulsar.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar-client-go] wolfstudy commented on issue #697: Reconnection blocked in producer by request timed out

Posted by GitBox <gi...@apache.org>.
wolfstudy commented on issue #697:
URL: https://github.com/apache/pulsar-client-go/issues/697#issuecomment-1008911849


   Update for my side:
   
   I ran into a similar problem today, please provide further context about the problem:
   
   ![image](https://user-images.githubusercontent.com/20965307/148779792-ff360734-1742-48d7-9851-c2d30b5dfc29.png)
   As can be seen in the above picture, when we encounter this problem, we can see that the Go SDK has been trying to send the cmdProducer command to the broker, and indeed this command is also sent to the broker, but we can see that the Broker Haven't responded to this response.
   
   And the `Epoch` means reconnect count.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pulsar.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar-client-go] zymap closed issue #697: Reconnection blocked in producer by request timed out

Posted by GitBox <gi...@apache.org>.
zymap closed issue #697:
URL: https://github.com/apache/pulsar-client-go/issues/697


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pulsar.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar-client-go] wolfstudy edited a comment on issue #697: Reconnection blocked in producer by request timed out

Posted by GitBox <gi...@apache.org>.
wolfstudy edited a comment on issue #697:
URL: https://github.com/apache/pulsar-client-go/issues/697#issuecomment-1006366136


   @bschofield Thanks for your work on this. The #691 change ignores some problems. When the producer is closed, it may cause the go rutine to leak, so I submitted a new pull request and listened to the action of the producer close by adding the `closeCh` of the event notification to ensure that the producer is closed. When closed, the go rutine of reconnect can exit normally. The new pull request is here https://github.com/apache/pulsar-client-go/pull/700, PTAL.
   
   In addition, the problem you encountered here seems to be somewhat different from the problem I encountered. Indeed, for some reason, the broker proactively notified and disconnected from the producer, and then in the stage of trying to reconnect, the reconnection failed due to `request time out`. Because this problem is difficult to reproduce, there is still no way to determine where `p.grabCnx()` is blocked, causing the reconnection to fail.
   
   
   > 2021-12-29 16:13:02 {"level":"warn","ts":1640794382.4651952,"caller":"messaging/logging_zap.go:87","msg":"Received send error from server: [[PersistenceError org.apache.bookkeeper.mledger.ManagedLedgerException: Bookie operation timeout]] : [%!s(MISSING)]","remote_addr":"pulsar://10.244.0.2:6650","local_addr":"10.244.0.113:54102"}
   
   This seems to be that the broker returned a clear error. When the broker tried to read data from bookie, the read operation timed out.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pulsar.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar-client-go] sharashka-io commented on issue #697: Reconnection blocked in producer by request timed out

Posted by GitBox <gi...@apache.org>.
sharashka-io commented on issue #697:
URL: https://github.com/apache/pulsar-client-go/issues/697#issuecomment-1004759438


   Could this issue be related to https://github.com/apache/pulsar-client-go/pull/691? I tried out the client library with that PR applied but after ~24hrs I had a problem with the system becoming stuck trying to reconnect. I rolled back to https://github.com/apache/pulsar-client-go/commit/d5d49031c7da2aa7fa0b9338e63addc6a762281b and the issue has not reoccurred.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pulsar.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar-client-go] bschofield edited a comment on issue #697: Reconnection blocked in producer by request timed out

Posted by GitBox <gi...@apache.org>.
bschofield edited a comment on issue #697:
URL: https://github.com/apache/pulsar-client-go/issues/697#issuecomment-1006554346


   Thanks a lot for the reply, @wolfstudy.
   
   I agree that the logs I posted don't seem to show exactly the same issue, but I'm not sure that I got the root-cause logs -- as a broader outline of the problem, several of my producers shut down over a period of an hour or so, and the broker/bookie seemed to get into quite a confused state for the affected topics. I had to shut down the entire system and restart it to get things started again.
   
   I'll put some comments on PR #700.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pulsar.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar-client-go] bschofield edited a comment on issue #697: Reconnection blocked in producer by request timed out

Posted by GitBox <gi...@apache.org>.
bschofield edited a comment on issue #697:
URL: https://github.com/apache/pulsar-client-go/issues/697#issuecomment-1005484225


   > Hello @bschofield Can you provide more log info for this, thanks.
   
   It is hard for me to know exactly which log lines are relevant as I have very many pods running, each with a large number of producers, but here is an example that looks interesting. This may not be the root cause of the issue.
   
   ```
   2021-12-29 16:12:31 {"level":"warn","ts":1640794351.682869,"caller":"messaging/logging_zap.go:67","msg":"[Detected stale connection to broker]","remote_addr":"pulsar://10.244.0.2:6650","local_addr":"10.244.1.173:59382"}
   2021-12-29 16:12:31 {"level":"warn","ts":1640794351.6829515,"caller":"messaging/logging_zap.go:67","msg":"[Failed to write on connection]","remote_addr":"pulsar://10.244.0.2:6650","local_addr":"10.244.1.173:59382","error":"write tcp 10.244.1.173:59382->10.244.0.2:6650: use of closed network connection"}
   2021-12-29 16:12:31 {"level":"warn","ts":1640794351.6829886,"caller":"messaging/logging_zap.go:67","msg":"[Connection was closed]","topic":"persistent://REDACTED-P2","producer_name":"REDACTED-76b5b9cbf9-6rs76","producerID":1,"cnx":"10.244.1.173:59382 -> 10.244.0.2:6650"}
   2021-12-29 16:12:31 {"level":"warn","ts":1640794351.6830094,"caller":"messaging/logging_zap.go:67","msg":"[Connection was closed]","topic":"persistent://REDACTED-W0","producer_name":"REDACTED-76b5b9cbf9-6rs76","producerID":8,"cnx":"10.244.1.173:59382 -> 10.244.0.2:6650"}
   2021-12-29 16:12:31 {"level":"warn","ts":1640794351.6830208,"caller":"messaging/logging_zap.go:67","msg":"[Connection was closed]","topic":"persistent://REDACTED-J1","producer_name":"REDACTED-76b5b9cbf9-6rs76","producerID":6,"cnx":"10.244.1.173:59382 -> 10.244.0.2:6650"}
   2021-12-29 16:12:31 {"level":"warn","ts":1640794351.6830392,"caller":"messaging/logging_zap.go:67","msg":"[Failed to write on connection]","remote_addr":"pulsar://10.244.0.2:6650","local_addr":"10.244.1.173:59382","error":"write tcp 10.244.1.173:59382->10.244.0.2:6650: use of closed network connection"}
   2021-12-29 16:12:31 {"level":"warn","ts":1640794351.6830513,"caller":"messaging/logging_zap.go:67","msg":"[Failed to write on connection]","remote_addr":"pulsar://10.244.0.2:6650","local_addr":"10.244.1.173:59382","error":"write tcp 10.244.1.173:59382->10.244.0.2:6650: use of closed network connection"}
   2021-12-29 16:12:31 {"level":"warn","ts":1640794351.6830642,"caller":"messaging/logging_zap.go:67","msg":"[Failed to write on connection]","remote_addr":"pulsar://10.244.0.2:6650","local_addr":"10.244.1.173:59382","error":"write tcp 10.244.1.173:59382->10.244.0.2:6650: use of closed network connection"}
   2021-12-29 16:12:31 {"level":"info","ts":1640794351.6830602,"caller":"messaging/logging_zap.go:62","msg":"[runEventsLoop will reconnect in producer]","topic":"persistent://REDACTED-P2","producer_name":"REDACTED-76b5b9cbf9-6rs76","producerID":1}
   2021-12-29 16:12:31 {"level":"info","ts":1640794351.6830988,"caller":"messaging/logging_zap.go:62","msg":"[Reconnecting to broker in  108.668329ms]","topic":"persistent://REDACTED-P2","producer_name":"REDACTED-76b5b9cbf9-6rs76","producerID":1}
   2021-12-29 16:12:31 {"level":"info","ts":1640794351.6830504,"caller":"messaging/logging_zap.go:62","msg":"[runEventsLoop will reconnect in producer]","topic":"persistent://REDACTED-W0","producer_name":"REDACTED-76b5b9cbf9-6rs76","producerID":8}
   2021-12-29 16:12:31 {"level":"info","ts":1640794351.6831107,"caller":"messaging/logging_zap.go:62","msg":"[Reconnecting to broker in  107.590908ms]","topic":"persistent://REDACTED-W0","producer_name":"REDACTED-76b5b9cbf9-6rs76","producerID":8}
   2021-12-29 16:12:31 {"level":"info","ts":1640794351.68313,"caller":"messaging/logging_zap.go:62","msg":"[runEventsLoop will reconnect in producer]","topic":"persistent://REDACTED-J1","producer_name":"REDACTED-76b5b9cbf9-6rs76","producerID":6}
   2021-12-29 16:12:31 {"level":"info","ts":1640794351.6831586,"caller":"messaging/logging_zap.go:62","msg":"[Reconnecting to broker in  111.142612ms]","topic":"persistent://REDACTED-J1","producer_name":"REDACTED-76b5b9cbf9-6rs76","producerID":6}
   2021-12-29 16:12:31 {"level":"info","ts":1640794351.7922926,"caller":"messaging/logging_zap.go:62","msg":"[Connecting to broker]","remote_addr":"pulsar://10.244.0.2:6650"}
   2021-12-29 16:12:31 {"level":"info","ts":1640794351.7961068,"caller":"messaging/logging_zap.go:62","msg":"[TCP connection established]","remote_addr":"pulsar://10.244.0.2:6650","local_addr":"10.244.1.173:39250"}
   2021-12-29 16:12:31 {"level":"info","ts":1640794351.79917,"caller":"messaging/logging_zap.go:62","msg":"[Connection is ready]","remote_addr":"pulsar://10.244.0.2:6650","local_addr":"10.244.1.173:39250"}
   2021-12-29 16:12:31 {"level":"info","ts":1640794351.802493,"caller":"messaging/logging_zap.go:82","msg":"Resending [1780] pending batches","topic":"persistent://REDACTED-P2","producer_name":"REDACTED-76b5b9cbf9-6rs76","producerID":1}
   2021-12-29 16:12:31 {"level":"info","ts":1640794351.8025355,"caller":"messaging/logging_zap.go:82","msg":"Resending [680] pending batches","topic":"persistent://REDACTED-J1","producer_name":"REDACTED-76b5b9cbf9-6rs76","producerID":6}
   2021-12-29 16:12:31 {"level":"info","ts":1640794351.8025396,"caller":"messaging/logging_zap.go:82","msg":"Resending [1866] pending batches","topic":"persistent://REDACTED-W0","producer_name":"REDACTED-76b5b9cbf9-6rs76","producerID":8}
   2021-12-29 16:12:31 {"level":"info","ts":1640794351.8164892,"caller":"messaging/logging_zap.go:62","msg":"[Reconnected producer to broker]","topic":"persistent://REDACTED-J1","producer_name":"REDACTED-76b5b9cbf9-6rs76","producerID":6,"cnx":"10.244.1.173:39250 -> 10.244.0.2:6650"}
   2021-12-29 16:12:31 {"level":"warn","ts":1640794351.8165426,"caller":"messaging/logging_zap.go:87","msg":"Received ack for [ledgerId:1021695 entryId:47743  739 711] on sequenceId %!v(MISSING) - expected: %!v(MISSING), closing connection","topic":"persistent://REDACTED-J1","producer_name":"REDACTED-76b5b9cbf9-6rs76","producerID":6}
   2021-12-29 16:12:31 {"level":"warn","ts":1640794351.8166032,"caller":"messaging/logging_zap.go:67","msg":"[Connection was closed]","topic":"persistent://REDACTED-P2","producer_name":"REDACTED-76b5b9cbf9-6rs76","producerID":1,"cnx":"10.244.1.173:39250 -> 10.244.0.2:6650"}
   2021-12-29 16:12:31 {"level":"warn","ts":1640794351.8166149,"caller":"messaging/logging_zap.go:67","msg":"[Connection was closed]","topic":"persistent://REDACTED-W0","producer_name":"REDACTED-76b5b9cbf9-6rs76","producerID":8,"cnx":"10.244.1.173:39250 -> 10.244.0.2:6650"}
   2021-12-29 16:12:31 {"level":"warn","ts":1640794351.8166232,"caller":"messaging/logging_zap.go:67","msg":"[Connection was closed]","topic":"persistent://REDACTED-J1","producer_name":"REDACTED-76b5b9cbf9-6rs76","producerID":6,"cnx":"10.244.1.173:39250 -> 10.244.0.2:6650"}
   2021-12-29 16:12:31 {"level":"warn","ts":1640794351.8166335,"caller":"messaging/logging_zap.go:67","msg":"[Failed to write on connection]","remote_addr":"pulsar://10.244.0.2:6650","local_addr":"10.244.1.173:39250","error":"write tcp 10.244.1.173:39250->10.244.0.2:6650: use of closed network connection"}
   2021-12-29 16:12:31 {"level":"warn","ts":1640794351.816643,"caller":"messaging/logging_zap.go:67","msg":"[Failed to write on connection]","remote_addr":"pulsar://10.244.0.2:6650","local_addr":"10.244.1.173:39250","error":"write tcp 10.244.1.173:39250->10.244.0.2:6650: use of closed network connection"}
   ```
   
   I also see some other errors which might be unrelated problems, e.g. 
   
   ```
   2021-12-29 16:13:02 | {"level":"warn","ts":1640794382.4651952,"caller":"messaging/logging_zap.go:87","msg":"Received send error from server: [[PersistenceError org.apache.bookkeeper.mledger.ManagedLedgerException: Bookie operation timeout]] : [%!s(MISSING)]","remote_addr":"pulsar://10.244.0.2:6650","local_addr":"10.244.0.113:54102"}
   ```
   
   >The #689 does not seem to cause the problem described in this issue, because the reconnect stuck occurs on the prouder side
   
   I did not suggest that you looked at #689. I suggested that you look at #691.
   
   In PR #691 (https://github.com/apache/pulsar-client-go/pull/691/files), you changed `pulsar/producer_partition.go` so that `p.reconnectToBroker()` is called in a separate goroutine.
   
   My (untested) hypothesis is that one of the other `p.*()` calls from the original goroutine could perhaps now be happening at the same time as `p.reconnectToBroker()`:
   
   	for {
   		select {
   		case i := <-p.eventsChan:
   			switch v := i.(type) {
   			case *sendRequest:
   				p.internalSend(v)
   			case *flushRequest:
   				p.internalFlush(v)
   			case *closeProducer:
   				p.internalClose(v)
   				return
   			}
   		case <-p.batchFlushTicker.C:
   			if p.batchBuilder.IsMultiBatches() {
   				p.internalFlushCurrentBatches()
   			} else {
   				p.internalFlushCurrentBatch()
   			}
   		}
   	}
   
   I was wondering if this could cause the freeze in `p.grabCnx()` that you are seeing.
   
   Again, I haven't tested this idea, but thought it could be worth mentioning since I experienced a similar problem but only with PR #691 applied.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pulsar.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar-client-go] bschofield edited a comment on issue #697: Reconnection blocked in producer by request timed out

Posted by GitBox <gi...@apache.org>.
bschofield edited a comment on issue #697:
URL: https://github.com/apache/pulsar-client-go/issues/697#issuecomment-1004761478


   Could this issue possibly be related to https://github.com/apache/pulsar-client-go/pull/691?
   
   I tried out the client library with that PR applied, but after 24-48hrs I got a problem with clients freezing as they attempted to reconnect. I rolled back to https://github.com/apache/pulsar-client-go/commit/d5d49031c7da2aa7fa0b9338e63addc6a762281b and the issue has not reoccurred after ~1 week.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pulsar.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar-client-go] bschofield edited a comment on issue #697: Reconnection blocked in producer by request timed out

Posted by GitBox <gi...@apache.org>.
bschofield edited a comment on issue #697:
URL: https://github.com/apache/pulsar-client-go/issues/697#issuecomment-1005484225


   > Hello @bschofield Can you provide more log info for this, thanks.
   
   It is hard for me to know exactly which log lines are relevant as I have very many pods running, each with a large number of producers, but here is an example that looks interesting. This may not be the root cause of the issue.
   
   ```
   2021-12-29 16:12:31 {"level":"warn","ts":1640794351.682869,"caller":"messaging/logging_zap.go:67","msg":"[Detected stale connection to broker]","remote_addr":"pulsar://10.244.0.2:6650","local_addr":"10.244.1.173:59382"}
   2021-12-29 16:12:31 {"level":"warn","ts":1640794351.6829515,"caller":"messaging/logging_zap.go:67","msg":"[Failed to write on connection]","remote_addr":"pulsar://10.244.0.2:6650","local_addr":"10.244.1.173:59382","error":"write tcp 10.244.1.173:59382->10.244.0.2:6650: use of closed network connection"}
   2021-12-29 16:12:31 {"level":"warn","ts":1640794351.6829886,"caller":"messaging/logging_zap.go:67","msg":"[Connection was closed]","topic":"persistent://REDACTED-P2","producer_name":"REDACTED-76b5b9cbf9-6rs76","producerID":1,"cnx":"10.244.1.173:59382 -> 10.244.0.2:6650"}
   2021-12-29 16:12:31 {"level":"warn","ts":1640794351.6830094,"caller":"messaging/logging_zap.go:67","msg":"[Connection was closed]","topic":"persistent://REDACTED-W0","producer_name":"REDACTED-76b5b9cbf9-6rs76","producerID":8,"cnx":"10.244.1.173:59382 -> 10.244.0.2:6650"}
   2021-12-29 16:12:31 {"level":"warn","ts":1640794351.6830208,"caller":"messaging/logging_zap.go:67","msg":"[Connection was closed]","topic":"persistent://REDACTED-J1","producer_name":"REDACTED-76b5b9cbf9-6rs76","producerID":6,"cnx":"10.244.1.173:59382 -> 10.244.0.2:6650"}
   2021-12-29 16:12:31 {"level":"warn","ts":1640794351.6830392,"caller":"messaging/logging_zap.go:67","msg":"[Failed to write on connection]","remote_addr":"pulsar://10.244.0.2:6650","local_addr":"10.244.1.173:59382","error":"write tcp 10.244.1.173:59382->10.244.0.2:6650: use of closed network connection"}
   2021-12-29 16:12:31 {"level":"warn","ts":1640794351.6830513,"caller":"messaging/logging_zap.go:67","msg":"[Failed to write on connection]","remote_addr":"pulsar://10.244.0.2:6650","local_addr":"10.244.1.173:59382","error":"write tcp 10.244.1.173:59382->10.244.0.2:6650: use of closed network connection"}
   2021-12-29 16:12:31 {"level":"warn","ts":1640794351.6830642,"caller":"messaging/logging_zap.go:67","msg":"[Failed to write on connection]","remote_addr":"pulsar://10.244.0.2:6650","local_addr":"10.244.1.173:59382","error":"write tcp 10.244.1.173:59382->10.244.0.2:6650: use of closed network connection"}
   2021-12-29 16:12:31 {"level":"info","ts":1640794351.6830602,"caller":"messaging/logging_zap.go:62","msg":"[runEventsLoop will reconnect in producer]","topic":"persistent://REDACTED-P2","producer_name":"REDACTED-76b5b9cbf9-6rs76","producerID":1}
   2021-12-29 16:12:31 {"level":"info","ts":1640794351.6830988,"caller":"messaging/logging_zap.go:62","msg":"[Reconnecting to broker in  108.668329ms]","topic":"persistent://REDACTED-P2","producer_name":"REDACTED-76b5b9cbf9-6rs76","producerID":1}
   2021-12-29 16:12:31 {"level":"info","ts":1640794351.6830504,"caller":"messaging/logging_zap.go:62","msg":"[runEventsLoop will reconnect in producer]","topic":"persistent://REDACTED-W0","producer_name":"REDACTED-76b5b9cbf9-6rs76","producerID":8}
   2021-12-29 16:12:31 {"level":"info","ts":1640794351.6831107,"caller":"messaging/logging_zap.go:62","msg":"[Reconnecting to broker in  107.590908ms]","topic":"persistent://REDACTED-W0","producer_name":"REDACTED-76b5b9cbf9-6rs76","producerID":8}
   2021-12-29 16:12:31 {"level":"info","ts":1640794351.68313,"caller":"messaging/logging_zap.go:62","msg":"[runEventsLoop will reconnect in producer]","topic":"persistent://REDACTED-J1","producer_name":"REDACTED-76b5b9cbf9-6rs76","producerID":6}
   2021-12-29 16:12:31 {"level":"info","ts":1640794351.6831586,"caller":"messaging/logging_zap.go:62","msg":"[Reconnecting to broker in  111.142612ms]","topic":"persistent://REDACTED-J1","producer_name":"REDACTED-76b5b9cbf9-6rs76","producerID":6}
   2021-12-29 16:12:31 {"level":"info","ts":1640794351.7922926,"caller":"messaging/logging_zap.go:62","msg":"[Connecting to broker]","remote_addr":"pulsar://10.244.0.2:6650"}
   2021-12-29 16:12:31 {"level":"info","ts":1640794351.7961068,"caller":"messaging/logging_zap.go:62","msg":"[TCP connection established]","remote_addr":"pulsar://10.244.0.2:6650","local_addr":"10.244.1.173:39250"}
   2021-12-29 16:12:31 {"level":"info","ts":1640794351.79917,"caller":"messaging/logging_zap.go:62","msg":"[Connection is ready]","remote_addr":"pulsar://10.244.0.2:6650","local_addr":"10.244.1.173:39250"}
   2021-12-29 16:12:31 {"level":"info","ts":1640794351.802493,"caller":"messaging/logging_zap.go:82","msg":"Resending [1780] pending batches","topic":"persistent://REDACTED-P2","producer_name":"REDACTED-76b5b9cbf9-6rs76","producerID":1}
   2021-12-29 16:12:31 {"level":"info","ts":1640794351.8025355,"caller":"messaging/logging_zap.go:82","msg":"Resending [680] pending batches","topic":"persistent://REDACTED-J1","producer_name":"REDACTED-76b5b9cbf9-6rs76","producerID":6}
   2021-12-29 16:12:31 {"level":"info","ts":1640794351.8025396,"caller":"messaging/logging_zap.go:82","msg":"Resending [1866] pending batches","topic":"persistent://REDACTED-W0","producer_name":"REDACTED-76b5b9cbf9-6rs76","producerID":8}
   2021-12-29 16:12:31 {"level":"info","ts":1640794351.8164892,"caller":"messaging/logging_zap.go:62","msg":"[Reconnected producer to broker]","topic":"persistent://REDACTED-J1","producer_name":"REDACTED-76b5b9cbf9-6rs76","producerID":6,"cnx":"10.244.1.173:39250 -> 10.244.0.2:6650"}
   2021-12-29 16:12:31 {"level":"warn","ts":1640794351.8165426,"caller":"messaging/logging_zap.go:87","msg":"Received ack for [ledgerId:1021695 entryId:47743  739 711] on sequenceId %!v(MISSING) - expected: %!v(MISSING), closing connection","topic":"persistent://REDACTED-J1","producer_name":"REDACTED-76b5b9cbf9-6rs76","producerID":6}
   2021-12-29 16:12:31 {"level":"warn","ts":1640794351.8166032,"caller":"messaging/logging_zap.go:67","msg":"[Connection was closed]","topic":"persistent://REDACTED-P2","producer_name":"REDACTED-76b5b9cbf9-6rs76","producerID":1,"cnx":"10.244.1.173:39250 -> 10.244.0.2:6650"}
   2021-12-29 16:12:31 {"level":"warn","ts":1640794351.8166149,"caller":"messaging/logging_zap.go:67","msg":"[Connection was closed]","topic":"persistent://REDACTED-W0","producer_name":"REDACTED-76b5b9cbf9-6rs76","producerID":8,"cnx":"10.244.1.173:39250 -> 10.244.0.2:6650"}
   2021-12-29 16:12:31 {"level":"warn","ts":1640794351.8166232,"caller":"messaging/logging_zap.go:67","msg":"[Connection was closed]","topic":"persistent://REDACTED-J1","producer_name":"REDACTED-76b5b9cbf9-6rs76","producerID":6,"cnx":"10.244.1.173:39250 -> 10.244.0.2:6650"}
   2021-12-29 16:12:31 {"level":"warn","ts":1640794351.8166335,"caller":"messaging/logging_zap.go:67","msg":"[Failed to write on connection]","remote_addr":"pulsar://10.244.0.2:6650","local_addr":"10.244.1.173:39250","error":"write tcp 10.244.1.173:39250->10.244.0.2:6650: use of closed network connection"}
   2021-12-29 16:12:31 {"level":"warn","ts":1640794351.816643,"caller":"messaging/logging_zap.go:67","msg":"[Failed to write on connection]","remote_addr":"pulsar://10.244.0.2:6650","local_addr":"10.244.1.173:39250","error":"write tcp 10.244.1.173:39250->10.244.0.2:6650: use of closed network connection"}
   ```
   
   I also see some other errors which might be unrelated problems, e.g. 
   
   ```
   2021-12-29 16:13:02 {"level":"warn","ts":1640794382.4651952,"caller":"messaging/logging_zap.go:87","msg":"Received send error from server: [[PersistenceError org.apache.bookkeeper.mledger.ManagedLedgerException: Bookie operation timeout]] : [%!s(MISSING)]","remote_addr":"pulsar://10.244.0.2:6650","local_addr":"10.244.0.113:54102"}
   ```
   
   >The #689 does not seem to cause the problem described in this issue, because the reconnect stuck occurs on the prouder side
   
   I think there is a little miscommunication here, I was trying to suggest that you look at #691 and not #689.
   
   In PR #691 (https://github.com/apache/pulsar-client-go/pull/691/files), you changed `pulsar/producer_partition.go` so that `p.reconnectToBroker()` is called in a separate goroutine.
   
   My (untested) hypothesis is that one of the other `p.*()` calls from the original goroutine could perhaps now be happening at the same time as `p.reconnectToBroker()`:
   
   	for {
   		select {
   		case i := <-p.eventsChan:
   			switch v := i.(type) {
   			case *sendRequest:
   				p.internalSend(v)
   			case *flushRequest:
   				p.internalFlush(v)
   			case *closeProducer:
   				p.internalClose(v)
   				return
   			}
   		case <-p.batchFlushTicker.C:
   			if p.batchBuilder.IsMultiBatches() {
   				p.internalFlushCurrentBatches()
   			} else {
   				p.internalFlushCurrentBatch()
   			}
   		}
   	}
   
   I was wondering if this could cause the freeze in `p.grabCnx()` that you are seeing.
   
   Again, I haven't tested this idea, but thought it could be worth mentioning since I experienced a similar problem but only with PR #691 applied. If this turns out to be something unrelated, apologies!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pulsar.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar-client-go] bschofield commented on issue #697: Reconnection blocked in producer by request timed out

Posted by GitBox <gi...@apache.org>.
bschofield commented on issue #697:
URL: https://github.com/apache/pulsar-client-go/issues/697#issuecomment-1006554346


   Thanks a lot for the reply, @wolfstudy.
   
   I agree that the logs I posted don't seem to show exactly the same issue, but I'm not sure that I got the root-cause logs -- as a broader outline of the problem, several of my producers shut down over a period of an hour or so, and the broker/bookie seemed to get into quite a confused state for the affected topics. I had to shut down the entire system and restart it to get things started again.
   
   I'll put some comments about my concerns on PR #700.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pulsar.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar-client-go] bschofield commented on issue #697: Reconnection blocked in producer by request timed out

Posted by GitBox <gi...@apache.org>.
bschofield commented on issue #697:
URL: https://github.com/apache/pulsar-client-go/issues/697#issuecomment-1005484225


   > Hello @bschofield Can you provide more log info for this, thanks.
   
   It is hard for me to know exactly which log lines are relevant, but here is an example that looks interesting, particularly the "use of closed network connection":
   
   ```
   2021-12-29 16:12:31 {"level":"warn","ts":1640794351.682869,"caller":"messaging/logging_zap.go:67","msg":"[Detected stale connection to broker]","remote_addr":"pulsar://10.244.0.2:6650","local_addr":"10.244.1.173:59382"}
   2021-12-29 16:12:31 {"level":"warn","ts":1640794351.6829515,"caller":"messaging/logging_zap.go:67","msg":"[Failed to write on connection]","remote_addr":"pulsar://10.244.0.2:6650","local_addr":"10.244.1.173:59382","error":"write tcp 10.244.1.173:59382->10.244.0.2:6650: use of closed network connection"}
   2021-12-29 16:12:31 {"level":"warn","ts":1640794351.6829886,"caller":"messaging/logging_zap.go:67","msg":"[Connection was closed]","topic":"persistent://REDACTED-P2","producer_name":"REDACTED-76b5b9cbf9-6rs76","producerID":1,"cnx":"10.244.1.173:59382 -> 10.244.0.2:6650"}
   2021-12-29 16:12:31 {"level":"warn","ts":1640794351.6830094,"caller":"messaging/logging_zap.go:67","msg":"[Connection was closed]","topic":"persistent://REDACTED-W0","producer_name":"REDACTED-76b5b9cbf9-6rs76","producerID":8,"cnx":"10.244.1.173:59382 -> 10.244.0.2:6650"}
   2021-12-29 16:12:31 {"level":"warn","ts":1640794351.6830208,"caller":"messaging/logging_zap.go:67","msg":"[Connection was closed]","topic":"persistent://REDACTED-J1","producer_name":"REDACTED-76b5b9cbf9-6rs76","producerID":6,"cnx":"10.244.1.173:59382 -> 10.244.0.2:6650"}
   2021-12-29 16:12:31 {"level":"warn","ts":1640794351.6830392,"caller":"messaging/logging_zap.go:67","msg":"[Failed to write on connection]","remote_addr":"pulsar://10.244.0.2:6650","local_addr":"10.244.1.173:59382","error":"write tcp 10.244.1.173:59382->10.244.0.2:6650: use of closed network connection"}
   2021-12-29 16:12:31 {"level":"warn","ts":1640794351.6830513,"caller":"messaging/logging_zap.go:67","msg":"[Failed to write on connection]","remote_addr":"pulsar://10.244.0.2:6650","local_addr":"10.244.1.173:59382","error":"write tcp 10.244.1.173:59382->10.244.0.2:6650: use of closed network connection"}
   2021-12-29 16:12:31 {"level":"warn","ts":1640794351.6830642,"caller":"messaging/logging_zap.go:67","msg":"[Failed to write on connection]","remote_addr":"pulsar://10.244.0.2:6650","local_addr":"10.244.1.173:59382","error":"write tcp 10.244.1.173:59382->10.244.0.2:6650: use of closed network connection"}
   2021-12-29 16:12:31 {"level":"info","ts":1640794351.6830602,"caller":"messaging/logging_zap.go:62","msg":"[runEventsLoop will reconnect in producer]","topic":"persistent://REDACTED-P2","producer_name":"REDACTED-76b5b9cbf9-6rs76","producerID":1}
   2021-12-29 16:12:31 {"level":"info","ts":1640794351.6830988,"caller":"messaging/logging_zap.go:62","msg":"[Reconnecting to broker in  108.668329ms]","topic":"persistent://REDACTED-P2","producer_name":"REDACTED-76b5b9cbf9-6rs76","producerID":1}
   2021-12-29 16:12:31 {"level":"info","ts":1640794351.6830504,"caller":"messaging/logging_zap.go:62","msg":"[runEventsLoop will reconnect in producer]","topic":"persistent://REDACTED-W0","producer_name":"REDACTED-76b5b9cbf9-6rs76","producerID":8}
   2021-12-29 16:12:31 {"level":"info","ts":1640794351.6831107,"caller":"messaging/logging_zap.go:62","msg":"[Reconnecting to broker in  107.590908ms]","topic":"persistent://REDACTED-W0","producer_name":"REDACTED-76b5b9cbf9-6rs76","producerID":8}
   2021-12-29 16:12:31 {"level":"info","ts":1640794351.68313,"caller":"messaging/logging_zap.go:62","msg":"[runEventsLoop will reconnect in producer]","topic":"persistent://REDACTED-J1","producer_name":"REDACTED-76b5b9cbf9-6rs76","producerID":6}
   2021-12-29 16:12:31 {"level":"info","ts":1640794351.6831586,"caller":"messaging/logging_zap.go:62","msg":"[Reconnecting to broker in  111.142612ms]","topic":"persistent://REDACTED-J1","producer_name":"REDACTED-76b5b9cbf9-6rs76","producerID":6}
   2021-12-29 16:12:31 {"level":"info","ts":1640794351.7922926,"caller":"messaging/logging_zap.go:62","msg":"[Connecting to broker]","remote_addr":"pulsar://10.244.0.2:6650"}
   2021-12-29 16:12:31 {"level":"info","ts":1640794351.7961068,"caller":"messaging/logging_zap.go:62","msg":"[TCP connection established]","remote_addr":"pulsar://10.244.0.2:6650","local_addr":"10.244.1.173:39250"}
   2021-12-29 16:12:31 {"level":"info","ts":1640794351.79917,"caller":"messaging/logging_zap.go:62","msg":"[Connection is ready]","remote_addr":"pulsar://10.244.0.2:6650","local_addr":"10.244.1.173:39250"}
   2021-12-29 16:12:31 {"level":"info","ts":1640794351.802493,"caller":"messaging/logging_zap.go:82","msg":"Resending [1780] pending batches","topic":"persistent://REDACTED-P2","producer_name":"REDACTED-76b5b9cbf9-6rs76","producerID":1}
   2021-12-29 16:12:31 {"level":"info","ts":1640794351.8025355,"caller":"messaging/logging_zap.go:82","msg":"Resending [680] pending batches","topic":"persistent://REDACTED-J1","producer_name":"REDACTED-76b5b9cbf9-6rs76","producerID":6}
   2021-12-29 16:12:31 {"level":"info","ts":1640794351.8025396,"caller":"messaging/logging_zap.go:82","msg":"Resending [1866] pending batches","topic":"persistent://REDACTED-W0","producer_name":"REDACTED-76b5b9cbf9-6rs76","producerID":8}
   2021-12-29 16:12:31 {"level":"info","ts":1640794351.8164892,"caller":"messaging/logging_zap.go:62","msg":"[Reconnected producer to broker]","topic":"persistent://REDACTED-J1","producer_name":"REDACTED-76b5b9cbf9-6rs76","producerID":6,"cnx":"10.244.1.173:39250 -> 10.244.0.2:6650"}
   2021-12-29 16:12:31 {"level":"warn","ts":1640794351.8165426,"caller":"messaging/logging_zap.go:87","msg":"Received ack for [ledgerId:1021695 entryId:47743  739 711] on sequenceId %!v(MISSING) - expected: %!v(MISSING), closing connection","topic":"persistent://REDACTED-J1","producer_name":"REDACTED-76b5b9cbf9-6rs76","producerID":6}
   2021-12-29 16:12:31 {"level":"warn","ts":1640794351.8166032,"caller":"messaging/logging_zap.go:67","msg":"[Connection was closed]","topic":"persistent://REDACTED-P2","producer_name":"REDACTED-76b5b9cbf9-6rs76","producerID":1,"cnx":"10.244.1.173:39250 -> 10.244.0.2:6650"}
   2021-12-29 16:12:31 {"level":"warn","ts":1640794351.8166149,"caller":"messaging/logging_zap.go:67","msg":"[Connection was closed]","topic":"persistent://REDACTED-W0","producer_name":"REDACTED-76b5b9cbf9-6rs76","producerID":8,"cnx":"10.244.1.173:39250 -> 10.244.0.2:6650"}
   2021-12-29 16:12:31 {"level":"warn","ts":1640794351.8166232,"caller":"messaging/logging_zap.go:67","msg":"[Connection was closed]","topic":"persistent://REDACTED-J1","producer_name":"REDACTED-76b5b9cbf9-6rs76","producerID":6,"cnx":"10.244.1.173:39250 -> 10.244.0.2:6650"}
   2021-12-29 16:12:31 {"level":"warn","ts":1640794351.8166335,"caller":"messaging/logging_zap.go:67","msg":"[Failed to write on connection]","remote_addr":"pulsar://10.244.0.2:6650","local_addr":"10.244.1.173:39250","error":"write tcp 10.244.1.173:39250->10.244.0.2:6650: use of closed network connection"}
   2021-12-29 16:12:31 {"level":"warn","ts":1640794351.816643,"caller":"messaging/logging_zap.go:67","msg":"[Failed to write on connection]","remote_addr":"pulsar://10.244.0.2:6650","local_addr":"10.244.1.173:39250","error":"write tcp 10.244.1.173:39250->10.244.0.2:6650: use of closed network connection"}
   ```
   
   I also see some other errors which might be unrelated problems, e.g. 
   
   ```
   2021-12-29 16:13:02 | {"level":"warn","ts":1640794382.4651952,"caller":"messaging/logging_zap.go:87","msg":"Received send error from server: [[PersistenceError org.apache.bookkeeper.mledger.ManagedLedgerException: Bookie operation timeout]] : [%!s(MISSING)]","remote_addr":"pulsar://10.244.0.2:6650","local_addr":"10.244.0.113:54102"}
   ```
   
   >The #689 does not seem to cause the problem described in this issue, because the reconnect stuck occurs on the prouder side
   
   I did not suggest that you looked at #689. I suggested that you look at #691.
   
   In PR #691 (https://github.com/apache/pulsar-client-go/pull/691/files), you changed `pulsar/producer_partition.go` so that `p.reconnectToBroker()` is called in a separate goroutine.
   
   My (untested) hypothesis is that one of the other `p.*()` calls from the original goroutine could perhaps now be happening at the same time as `p.reconnectToBroker()`:
   
   	for {
   		select {
   		case i := <-p.eventsChan:
   			switch v := i.(type) {
   			case *sendRequest:
   				p.internalSend(v)
   			case *flushRequest:
   				p.internalFlush(v)
   			case *closeProducer:
   				p.internalClose(v)
   				return
   			}
   		case <-p.batchFlushTicker.C:
   			if p.batchBuilder.IsMultiBatches() {
   				p.internalFlushCurrentBatches()
   			} else {
   				p.internalFlushCurrentBatch()
   			}
   		}
   	}
   
   I was wondering if this could cause the freeze in `p.grabCnx()` that you are seeing.
   
   Again, I haven't tested this idea, but thought it could be worth mentioning since I experienced a similar problem but only with PR #691 applied.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pulsar.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar-client-go] bschofield edited a comment on issue #697: Reconnection blocked in producer by request timed out

Posted by GitBox <gi...@apache.org>.
bschofield edited a comment on issue #697:
URL: https://github.com/apache/pulsar-client-go/issues/697#issuecomment-1005484225


   > Hello @bschofield Can you provide more log info for this, thanks.
   
   It is hard for me to know exactly which log lines are relevant, but here is an example that looks interesting. This may not be the root cause of the issue.
   
   ```
   2021-12-29 16:12:31 {"level":"warn","ts":1640794351.682869,"caller":"messaging/logging_zap.go:67","msg":"[Detected stale connection to broker]","remote_addr":"pulsar://10.244.0.2:6650","local_addr":"10.244.1.173:59382"}
   2021-12-29 16:12:31 {"level":"warn","ts":1640794351.6829515,"caller":"messaging/logging_zap.go:67","msg":"[Failed to write on connection]","remote_addr":"pulsar://10.244.0.2:6650","local_addr":"10.244.1.173:59382","error":"write tcp 10.244.1.173:59382->10.244.0.2:6650: use of closed network connection"}
   2021-12-29 16:12:31 {"level":"warn","ts":1640794351.6829886,"caller":"messaging/logging_zap.go:67","msg":"[Connection was closed]","topic":"persistent://REDACTED-P2","producer_name":"REDACTED-76b5b9cbf9-6rs76","producerID":1,"cnx":"10.244.1.173:59382 -> 10.244.0.2:6650"}
   2021-12-29 16:12:31 {"level":"warn","ts":1640794351.6830094,"caller":"messaging/logging_zap.go:67","msg":"[Connection was closed]","topic":"persistent://REDACTED-W0","producer_name":"REDACTED-76b5b9cbf9-6rs76","producerID":8,"cnx":"10.244.1.173:59382 -> 10.244.0.2:6650"}
   2021-12-29 16:12:31 {"level":"warn","ts":1640794351.6830208,"caller":"messaging/logging_zap.go:67","msg":"[Connection was closed]","topic":"persistent://REDACTED-J1","producer_name":"REDACTED-76b5b9cbf9-6rs76","producerID":6,"cnx":"10.244.1.173:59382 -> 10.244.0.2:6650"}
   2021-12-29 16:12:31 {"level":"warn","ts":1640794351.6830392,"caller":"messaging/logging_zap.go:67","msg":"[Failed to write on connection]","remote_addr":"pulsar://10.244.0.2:6650","local_addr":"10.244.1.173:59382","error":"write tcp 10.244.1.173:59382->10.244.0.2:6650: use of closed network connection"}
   2021-12-29 16:12:31 {"level":"warn","ts":1640794351.6830513,"caller":"messaging/logging_zap.go:67","msg":"[Failed to write on connection]","remote_addr":"pulsar://10.244.0.2:6650","local_addr":"10.244.1.173:59382","error":"write tcp 10.244.1.173:59382->10.244.0.2:6650: use of closed network connection"}
   2021-12-29 16:12:31 {"level":"warn","ts":1640794351.6830642,"caller":"messaging/logging_zap.go:67","msg":"[Failed to write on connection]","remote_addr":"pulsar://10.244.0.2:6650","local_addr":"10.244.1.173:59382","error":"write tcp 10.244.1.173:59382->10.244.0.2:6650: use of closed network connection"}
   2021-12-29 16:12:31 {"level":"info","ts":1640794351.6830602,"caller":"messaging/logging_zap.go:62","msg":"[runEventsLoop will reconnect in producer]","topic":"persistent://REDACTED-P2","producer_name":"REDACTED-76b5b9cbf9-6rs76","producerID":1}
   2021-12-29 16:12:31 {"level":"info","ts":1640794351.6830988,"caller":"messaging/logging_zap.go:62","msg":"[Reconnecting to broker in  108.668329ms]","topic":"persistent://REDACTED-P2","producer_name":"REDACTED-76b5b9cbf9-6rs76","producerID":1}
   2021-12-29 16:12:31 {"level":"info","ts":1640794351.6830504,"caller":"messaging/logging_zap.go:62","msg":"[runEventsLoop will reconnect in producer]","topic":"persistent://REDACTED-W0","producer_name":"REDACTED-76b5b9cbf9-6rs76","producerID":8}
   2021-12-29 16:12:31 {"level":"info","ts":1640794351.6831107,"caller":"messaging/logging_zap.go:62","msg":"[Reconnecting to broker in  107.590908ms]","topic":"persistent://REDACTED-W0","producer_name":"REDACTED-76b5b9cbf9-6rs76","producerID":8}
   2021-12-29 16:12:31 {"level":"info","ts":1640794351.68313,"caller":"messaging/logging_zap.go:62","msg":"[runEventsLoop will reconnect in producer]","topic":"persistent://REDACTED-J1","producer_name":"REDACTED-76b5b9cbf9-6rs76","producerID":6}
   2021-12-29 16:12:31 {"level":"info","ts":1640794351.6831586,"caller":"messaging/logging_zap.go:62","msg":"[Reconnecting to broker in  111.142612ms]","topic":"persistent://REDACTED-J1","producer_name":"REDACTED-76b5b9cbf9-6rs76","producerID":6}
   2021-12-29 16:12:31 {"level":"info","ts":1640794351.7922926,"caller":"messaging/logging_zap.go:62","msg":"[Connecting to broker]","remote_addr":"pulsar://10.244.0.2:6650"}
   2021-12-29 16:12:31 {"level":"info","ts":1640794351.7961068,"caller":"messaging/logging_zap.go:62","msg":"[TCP connection established]","remote_addr":"pulsar://10.244.0.2:6650","local_addr":"10.244.1.173:39250"}
   2021-12-29 16:12:31 {"level":"info","ts":1640794351.79917,"caller":"messaging/logging_zap.go:62","msg":"[Connection is ready]","remote_addr":"pulsar://10.244.0.2:6650","local_addr":"10.244.1.173:39250"}
   2021-12-29 16:12:31 {"level":"info","ts":1640794351.802493,"caller":"messaging/logging_zap.go:82","msg":"Resending [1780] pending batches","topic":"persistent://REDACTED-P2","producer_name":"REDACTED-76b5b9cbf9-6rs76","producerID":1}
   2021-12-29 16:12:31 {"level":"info","ts":1640794351.8025355,"caller":"messaging/logging_zap.go:82","msg":"Resending [680] pending batches","topic":"persistent://REDACTED-J1","producer_name":"REDACTED-76b5b9cbf9-6rs76","producerID":6}
   2021-12-29 16:12:31 {"level":"info","ts":1640794351.8025396,"caller":"messaging/logging_zap.go:82","msg":"Resending [1866] pending batches","topic":"persistent://REDACTED-W0","producer_name":"REDACTED-76b5b9cbf9-6rs76","producerID":8}
   2021-12-29 16:12:31 {"level":"info","ts":1640794351.8164892,"caller":"messaging/logging_zap.go:62","msg":"[Reconnected producer to broker]","topic":"persistent://REDACTED-J1","producer_name":"REDACTED-76b5b9cbf9-6rs76","producerID":6,"cnx":"10.244.1.173:39250 -> 10.244.0.2:6650"}
   2021-12-29 16:12:31 {"level":"warn","ts":1640794351.8165426,"caller":"messaging/logging_zap.go:87","msg":"Received ack for [ledgerId:1021695 entryId:47743  739 711] on sequenceId %!v(MISSING) - expected: %!v(MISSING), closing connection","topic":"persistent://REDACTED-J1","producer_name":"REDACTED-76b5b9cbf9-6rs76","producerID":6}
   2021-12-29 16:12:31 {"level":"warn","ts":1640794351.8166032,"caller":"messaging/logging_zap.go:67","msg":"[Connection was closed]","topic":"persistent://REDACTED-P2","producer_name":"REDACTED-76b5b9cbf9-6rs76","producerID":1,"cnx":"10.244.1.173:39250 -> 10.244.0.2:6650"}
   2021-12-29 16:12:31 {"level":"warn","ts":1640794351.8166149,"caller":"messaging/logging_zap.go:67","msg":"[Connection was closed]","topic":"persistent://REDACTED-W0","producer_name":"REDACTED-76b5b9cbf9-6rs76","producerID":8,"cnx":"10.244.1.173:39250 -> 10.244.0.2:6650"}
   2021-12-29 16:12:31 {"level":"warn","ts":1640794351.8166232,"caller":"messaging/logging_zap.go:67","msg":"[Connection was closed]","topic":"persistent://REDACTED-J1","producer_name":"REDACTED-76b5b9cbf9-6rs76","producerID":6,"cnx":"10.244.1.173:39250 -> 10.244.0.2:6650"}
   2021-12-29 16:12:31 {"level":"warn","ts":1640794351.8166335,"caller":"messaging/logging_zap.go:67","msg":"[Failed to write on connection]","remote_addr":"pulsar://10.244.0.2:6650","local_addr":"10.244.1.173:39250","error":"write tcp 10.244.1.173:39250->10.244.0.2:6650: use of closed network connection"}
   2021-12-29 16:12:31 {"level":"warn","ts":1640794351.816643,"caller":"messaging/logging_zap.go:67","msg":"[Failed to write on connection]","remote_addr":"pulsar://10.244.0.2:6650","local_addr":"10.244.1.173:39250","error":"write tcp 10.244.1.173:39250->10.244.0.2:6650: use of closed network connection"}
   ```
   
   I also see some other errors which might be unrelated problems, e.g. 
   
   ```
   2021-12-29 16:13:02 | {"level":"warn","ts":1640794382.4651952,"caller":"messaging/logging_zap.go:87","msg":"Received send error from server: [[PersistenceError org.apache.bookkeeper.mledger.ManagedLedgerException: Bookie operation timeout]] : [%!s(MISSING)]","remote_addr":"pulsar://10.244.0.2:6650","local_addr":"10.244.0.113:54102"}
   ```
   
   >The #689 does not seem to cause the problem described in this issue, because the reconnect stuck occurs on the prouder side
   
   I did not suggest that you looked at #689. I suggested that you look at #691.
   
   In PR #691 (https://github.com/apache/pulsar-client-go/pull/691/files), you changed `pulsar/producer_partition.go` so that `p.reconnectToBroker()` is called in a separate goroutine.
   
   My (untested) hypothesis is that one of the other `p.*()` calls from the original goroutine could perhaps now be happening at the same time as `p.reconnectToBroker()`:
   
   	for {
   		select {
   		case i := <-p.eventsChan:
   			switch v := i.(type) {
   			case *sendRequest:
   				p.internalSend(v)
   			case *flushRequest:
   				p.internalFlush(v)
   			case *closeProducer:
   				p.internalClose(v)
   				return
   			}
   		case <-p.batchFlushTicker.C:
   			if p.batchBuilder.IsMultiBatches() {
   				p.internalFlushCurrentBatches()
   			} else {
   				p.internalFlushCurrentBatch()
   			}
   		}
   	}
   
   I was wondering if this could cause the freeze in `p.grabCnx()` that you are seeing.
   
   Again, I haven't tested this idea, but thought it could be worth mentioning since I experienced a similar problem but only with PR #691 applied.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pulsar.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar-client-go] bschofield commented on issue #697: Reconnection blocked in producer by request timed out

Posted by GitBox <gi...@apache.org>.
bschofield commented on issue #697:
URL: https://github.com/apache/pulsar-client-go/issues/697#issuecomment-1004761478


   Could this issue possibly be related to https://github.com/apache/pulsar-client-go/pull/691?
   
   I tried out the client library with that PR applied, but after 24-48hrs I got a problem with clients freezing as they attempted to reconnect. I rolled back to https://github.com/apache/pulsar-client-go/commit/d5d49031c7da2aa7fa0b9338e63addc6a762281b and the issue has not reoccurred.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pulsar.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar-client-go] bschofield edited a comment on issue #697: Reconnection blocked in producer by request timed out

Posted by GitBox <gi...@apache.org>.
bschofield edited a comment on issue #697:
URL: https://github.com/apache/pulsar-client-go/issues/697#issuecomment-1005484225


   > Hello @bschofield Can you provide more log info for this, thanks.
   
   It is hard for me to know exactly which log lines are relevant as I have very many pods running, each with a large number of producers, but here is an example that looks interesting. This may not be the root cause of the issue.
   
   ```
   2021-12-29 16:12:31 {"level":"warn","ts":1640794351.682869,"caller":"messaging/logging_zap.go:67","msg":"[Detected stale connection to broker]","remote_addr":"pulsar://10.244.0.2:6650","local_addr":"10.244.1.173:59382"}
   2021-12-29 16:12:31 {"level":"warn","ts":1640794351.6829515,"caller":"messaging/logging_zap.go:67","msg":"[Failed to write on connection]","remote_addr":"pulsar://10.244.0.2:6650","local_addr":"10.244.1.173:59382","error":"write tcp 10.244.1.173:59382->10.244.0.2:6650: use of closed network connection"}
   2021-12-29 16:12:31 {"level":"warn","ts":1640794351.6829886,"caller":"messaging/logging_zap.go:67","msg":"[Connection was closed]","topic":"persistent://REDACTED-P2","producer_name":"REDACTED-76b5b9cbf9-6rs76","producerID":1,"cnx":"10.244.1.173:59382 -> 10.244.0.2:6650"}
   2021-12-29 16:12:31 {"level":"warn","ts":1640794351.6830094,"caller":"messaging/logging_zap.go:67","msg":"[Connection was closed]","topic":"persistent://REDACTED-W0","producer_name":"REDACTED-76b5b9cbf9-6rs76","producerID":8,"cnx":"10.244.1.173:59382 -> 10.244.0.2:6650"}
   2021-12-29 16:12:31 {"level":"warn","ts":1640794351.6830208,"caller":"messaging/logging_zap.go:67","msg":"[Connection was closed]","topic":"persistent://REDACTED-J1","producer_name":"REDACTED-76b5b9cbf9-6rs76","producerID":6,"cnx":"10.244.1.173:59382 -> 10.244.0.2:6650"}
   2021-12-29 16:12:31 {"level":"warn","ts":1640794351.6830392,"caller":"messaging/logging_zap.go:67","msg":"[Failed to write on connection]","remote_addr":"pulsar://10.244.0.2:6650","local_addr":"10.244.1.173:59382","error":"write tcp 10.244.1.173:59382->10.244.0.2:6650: use of closed network connection"}
   2021-12-29 16:12:31 {"level":"warn","ts":1640794351.6830513,"caller":"messaging/logging_zap.go:67","msg":"[Failed to write on connection]","remote_addr":"pulsar://10.244.0.2:6650","local_addr":"10.244.1.173:59382","error":"write tcp 10.244.1.173:59382->10.244.0.2:6650: use of closed network connection"}
   2021-12-29 16:12:31 {"level":"warn","ts":1640794351.6830642,"caller":"messaging/logging_zap.go:67","msg":"[Failed to write on connection]","remote_addr":"pulsar://10.244.0.2:6650","local_addr":"10.244.1.173:59382","error":"write tcp 10.244.1.173:59382->10.244.0.2:6650: use of closed network connection"}
   2021-12-29 16:12:31 {"level":"info","ts":1640794351.6830602,"caller":"messaging/logging_zap.go:62","msg":"[runEventsLoop will reconnect in producer]","topic":"persistent://REDACTED-P2","producer_name":"REDACTED-76b5b9cbf9-6rs76","producerID":1}
   2021-12-29 16:12:31 {"level":"info","ts":1640794351.6830988,"caller":"messaging/logging_zap.go:62","msg":"[Reconnecting to broker in  108.668329ms]","topic":"persistent://REDACTED-P2","producer_name":"REDACTED-76b5b9cbf9-6rs76","producerID":1}
   2021-12-29 16:12:31 {"level":"info","ts":1640794351.6830504,"caller":"messaging/logging_zap.go:62","msg":"[runEventsLoop will reconnect in producer]","topic":"persistent://REDACTED-W0","producer_name":"REDACTED-76b5b9cbf9-6rs76","producerID":8}
   2021-12-29 16:12:31 {"level":"info","ts":1640794351.6831107,"caller":"messaging/logging_zap.go:62","msg":"[Reconnecting to broker in  107.590908ms]","topic":"persistent://REDACTED-W0","producer_name":"REDACTED-76b5b9cbf9-6rs76","producerID":8}
   2021-12-29 16:12:31 {"level":"info","ts":1640794351.68313,"caller":"messaging/logging_zap.go:62","msg":"[runEventsLoop will reconnect in producer]","topic":"persistent://REDACTED-J1","producer_name":"REDACTED-76b5b9cbf9-6rs76","producerID":6}
   2021-12-29 16:12:31 {"level":"info","ts":1640794351.6831586,"caller":"messaging/logging_zap.go:62","msg":"[Reconnecting to broker in  111.142612ms]","topic":"persistent://REDACTED-J1","producer_name":"REDACTED-76b5b9cbf9-6rs76","producerID":6}
   2021-12-29 16:12:31 {"level":"info","ts":1640794351.7922926,"caller":"messaging/logging_zap.go:62","msg":"[Connecting to broker]","remote_addr":"pulsar://10.244.0.2:6650"}
   2021-12-29 16:12:31 {"level":"info","ts":1640794351.7961068,"caller":"messaging/logging_zap.go:62","msg":"[TCP connection established]","remote_addr":"pulsar://10.244.0.2:6650","local_addr":"10.244.1.173:39250"}
   2021-12-29 16:12:31 {"level":"info","ts":1640794351.79917,"caller":"messaging/logging_zap.go:62","msg":"[Connection is ready]","remote_addr":"pulsar://10.244.0.2:6650","local_addr":"10.244.1.173:39250"}
   2021-12-29 16:12:31 {"level":"info","ts":1640794351.802493,"caller":"messaging/logging_zap.go:82","msg":"Resending [1780] pending batches","topic":"persistent://REDACTED-P2","producer_name":"REDACTED-76b5b9cbf9-6rs76","producerID":1}
   2021-12-29 16:12:31 {"level":"info","ts":1640794351.8025355,"caller":"messaging/logging_zap.go:82","msg":"Resending [680] pending batches","topic":"persistent://REDACTED-J1","producer_name":"REDACTED-76b5b9cbf9-6rs76","producerID":6}
   2021-12-29 16:12:31 {"level":"info","ts":1640794351.8025396,"caller":"messaging/logging_zap.go:82","msg":"Resending [1866] pending batches","topic":"persistent://REDACTED-W0","producer_name":"REDACTED-76b5b9cbf9-6rs76","producerID":8}
   2021-12-29 16:12:31 {"level":"info","ts":1640794351.8164892,"caller":"messaging/logging_zap.go:62","msg":"[Reconnected producer to broker]","topic":"persistent://REDACTED-J1","producer_name":"REDACTED-76b5b9cbf9-6rs76","producerID":6,"cnx":"10.244.1.173:39250 -> 10.244.0.2:6650"}
   2021-12-29 16:12:31 {"level":"warn","ts":1640794351.8165426,"caller":"messaging/logging_zap.go:87","msg":"Received ack for [ledgerId:1021695 entryId:47743  739 711] on sequenceId %!v(MISSING) - expected: %!v(MISSING), closing connection","topic":"persistent://REDACTED-J1","producer_name":"REDACTED-76b5b9cbf9-6rs76","producerID":6}
   2021-12-29 16:12:31 {"level":"warn","ts":1640794351.8166032,"caller":"messaging/logging_zap.go:67","msg":"[Connection was closed]","topic":"persistent://REDACTED-P2","producer_name":"REDACTED-76b5b9cbf9-6rs76","producerID":1,"cnx":"10.244.1.173:39250 -> 10.244.0.2:6650"}
   2021-12-29 16:12:31 {"level":"warn","ts":1640794351.8166149,"caller":"messaging/logging_zap.go:67","msg":"[Connection was closed]","topic":"persistent://REDACTED-W0","producer_name":"REDACTED-76b5b9cbf9-6rs76","producerID":8,"cnx":"10.244.1.173:39250 -> 10.244.0.2:6650"}
   2021-12-29 16:12:31 {"level":"warn","ts":1640794351.8166232,"caller":"messaging/logging_zap.go:67","msg":"[Connection was closed]","topic":"persistent://REDACTED-J1","producer_name":"REDACTED-76b5b9cbf9-6rs76","producerID":6,"cnx":"10.244.1.173:39250 -> 10.244.0.2:6650"}
   2021-12-29 16:12:31 {"level":"warn","ts":1640794351.8166335,"caller":"messaging/logging_zap.go:67","msg":"[Failed to write on connection]","remote_addr":"pulsar://10.244.0.2:6650","local_addr":"10.244.1.173:39250","error":"write tcp 10.244.1.173:39250->10.244.0.2:6650: use of closed network connection"}
   2021-12-29 16:12:31 {"level":"warn","ts":1640794351.816643,"caller":"messaging/logging_zap.go:67","msg":"[Failed to write on connection]","remote_addr":"pulsar://10.244.0.2:6650","local_addr":"10.244.1.173:39250","error":"write tcp 10.244.1.173:39250->10.244.0.2:6650: use of closed network connection"}
   ```
   
   I also see some other errors which might be unrelated problems, e.g. 
   
   ```
   2021-12-29 16:13:02 {"level":"warn","ts":1640794382.4651952,"caller":"messaging/logging_zap.go:87","msg":"Received send error from server: [[PersistenceError org.apache.bookkeeper.mledger.ManagedLedgerException: Bookie operation timeout]] : [%!s(MISSING)]","remote_addr":"pulsar://10.244.0.2:6650","local_addr":"10.244.0.113:54102"}
   ```
   
   >The #689 does not seem to cause the problem described in this issue, because the reconnect stuck occurs on the prouder side
   
   I think there is a little miscommunication here, I was trying to suggest that you look at #691 and not #689.
   
   In PR #691 (https://github.com/apache/pulsar-client-go/pull/691/files), you changed `pulsar/producer_partition.go` so that `p.reconnectToBroker()` is called in a separate goroutine.
   
   My (untested) hypothesis is that one of the other `p.*()` calls from the original goroutine could perhaps now be happening at the same time as `p.reconnectToBroker()`:
   
   	for {
   		select {
   		case i := <-p.eventsChan:
   			switch v := i.(type) {
   			case *sendRequest:
   				p.internalSend(v)
   			case *flushRequest:
   				p.internalFlush(v)
   			case *closeProducer:
   				p.internalClose(v)
   				return
   			}
   		case <-p.batchFlushTicker.C:
   			if p.batchBuilder.IsMultiBatches() {
   				p.internalFlushCurrentBatches()
   			} else {
   				p.internalFlushCurrentBatch()
   			}
   		}
   	}
   
   I was wondering if this could cause the freeze in `p.grabCnx()` that you are seeing.
   
   Again, I haven't tested this idea, but thought it could be worth mentioning since I experienced a similar problem but only with PR #691 applied.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pulsar.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar-client-go] wolfstudy edited a comment on issue #697: Reconnection blocked in producer by request timed out

Posted by GitBox <gi...@apache.org>.
wolfstudy edited a comment on issue #697:
URL: https://github.com/apache/pulsar-client-go/issues/697#issuecomment-1006366136


   @bschofield Thanks for your work on this. The #691 change ignores some problems. When the producer is closed, it may cause the go rutine to leak, so I submitted a new pull request and listened to the action of the producer close by adding the `closeCh` of the event notification to ensure that the producer is closed. When closed, the go rutine of reconnect can exit normally. The new pull request is here https://github.com/apache/pulsar-client-go/pull/700, PTAL.
   
   In addition, the problem you encountered here seems to be somewhat different from the problem I encountered. Indeed, for some reason, the broker proactively notified and disconnected from the producer, and then in the stage of trying to reconnect, the reconnection failed due to `request time out`. Because this problem is difficult to reproduce, there is still no way to determine where `p.grabCnx()` is blocked, causing the reconnection to fail.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pulsar.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar-client-go] sharashka-io removed a comment on issue #697: Reconnection blocked in producer by request timed out

Posted by GitBox <gi...@apache.org>.
sharashka-io removed a comment on issue #697:
URL: https://github.com/apache/pulsar-client-go/issues/697#issuecomment-1004759438


   Could this issue be related to https://github.com/apache/pulsar-client-go/pull/691? I tried out the client library with that PR applied but after ~24hrs I had a problem with the system becoming stuck trying to reconnect. I rolled back to https://github.com/apache/pulsar-client-go/commit/d5d49031c7da2aa7fa0b9338e63addc6a762281b and the issue has not reoccurred.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pulsar.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar-client-go] bschofield edited a comment on issue #697: Reconnection blocked in producer by request timed out

Posted by GitBox <gi...@apache.org>.
bschofield edited a comment on issue #697:
URL: https://github.com/apache/pulsar-client-go/issues/697#issuecomment-1005484225


   > Hello @bschofield Can you provide more log info for this, thanks.
   
   It is hard for me to know exactly which log lines are relevant as I have very many pods running, each with a large number of producers, but here is an example that looks interesting. This may not be the root cause of the issue.
   
   ```
   2021-12-29 16:12:31 {"level":"warn","ts":1640794351.682869,"caller":"messaging/logging_zap.go:67","msg":"[Detected stale connection to broker]","remote_addr":"pulsar://10.244.0.2:6650","local_addr":"10.244.1.173:59382"}
   2021-12-29 16:12:31 {"level":"warn","ts":1640794351.6829515,"caller":"messaging/logging_zap.go:67","msg":"[Failed to write on connection]","remote_addr":"pulsar://10.244.0.2:6650","local_addr":"10.244.1.173:59382","error":"write tcp 10.244.1.173:59382->10.244.0.2:6650: use of closed network connection"}
   2021-12-29 16:12:31 {"level":"warn","ts":1640794351.6829886,"caller":"messaging/logging_zap.go:67","msg":"[Connection was closed]","topic":"persistent://REDACTED-P2","producer_name":"REDACTED-76b5b9cbf9-6rs76","producerID":1,"cnx":"10.244.1.173:59382 -> 10.244.0.2:6650"}
   2021-12-29 16:12:31 {"level":"warn","ts":1640794351.6830094,"caller":"messaging/logging_zap.go:67","msg":"[Connection was closed]","topic":"persistent://REDACTED-W0","producer_name":"REDACTED-76b5b9cbf9-6rs76","producerID":8,"cnx":"10.244.1.173:59382 -> 10.244.0.2:6650"}
   2021-12-29 16:12:31 {"level":"warn","ts":1640794351.6830208,"caller":"messaging/logging_zap.go:67","msg":"[Connection was closed]","topic":"persistent://REDACTED-J1","producer_name":"REDACTED-76b5b9cbf9-6rs76","producerID":6,"cnx":"10.244.1.173:59382 -> 10.244.0.2:6650"}
   2021-12-29 16:12:31 {"level":"warn","ts":1640794351.6830392,"caller":"messaging/logging_zap.go:67","msg":"[Failed to write on connection]","remote_addr":"pulsar://10.244.0.2:6650","local_addr":"10.244.1.173:59382","error":"write tcp 10.244.1.173:59382->10.244.0.2:6650: use of closed network connection"}
   2021-12-29 16:12:31 {"level":"warn","ts":1640794351.6830513,"caller":"messaging/logging_zap.go:67","msg":"[Failed to write on connection]","remote_addr":"pulsar://10.244.0.2:6650","local_addr":"10.244.1.173:59382","error":"write tcp 10.244.1.173:59382->10.244.0.2:6650: use of closed network connection"}
   2021-12-29 16:12:31 {"level":"warn","ts":1640794351.6830642,"caller":"messaging/logging_zap.go:67","msg":"[Failed to write on connection]","remote_addr":"pulsar://10.244.0.2:6650","local_addr":"10.244.1.173:59382","error":"write tcp 10.244.1.173:59382->10.244.0.2:6650: use of closed network connection"}
   2021-12-29 16:12:31 {"level":"info","ts":1640794351.6830602,"caller":"messaging/logging_zap.go:62","msg":"[runEventsLoop will reconnect in producer]","topic":"persistent://REDACTED-P2","producer_name":"REDACTED-76b5b9cbf9-6rs76","producerID":1}
   2021-12-29 16:12:31 {"level":"info","ts":1640794351.6830988,"caller":"messaging/logging_zap.go:62","msg":"[Reconnecting to broker in  108.668329ms]","topic":"persistent://REDACTED-P2","producer_name":"REDACTED-76b5b9cbf9-6rs76","producerID":1}
   2021-12-29 16:12:31 {"level":"info","ts":1640794351.6830504,"caller":"messaging/logging_zap.go:62","msg":"[runEventsLoop will reconnect in producer]","topic":"persistent://REDACTED-W0","producer_name":"REDACTED-76b5b9cbf9-6rs76","producerID":8}
   2021-12-29 16:12:31 {"level":"info","ts":1640794351.6831107,"caller":"messaging/logging_zap.go:62","msg":"[Reconnecting to broker in  107.590908ms]","topic":"persistent://REDACTED-W0","producer_name":"REDACTED-76b5b9cbf9-6rs76","producerID":8}
   2021-12-29 16:12:31 {"level":"info","ts":1640794351.68313,"caller":"messaging/logging_zap.go:62","msg":"[runEventsLoop will reconnect in producer]","topic":"persistent://REDACTED-J1","producer_name":"REDACTED-76b5b9cbf9-6rs76","producerID":6}
   2021-12-29 16:12:31 {"level":"info","ts":1640794351.6831586,"caller":"messaging/logging_zap.go:62","msg":"[Reconnecting to broker in  111.142612ms]","topic":"persistent://REDACTED-J1","producer_name":"REDACTED-76b5b9cbf9-6rs76","producerID":6}
   2021-12-29 16:12:31 {"level":"info","ts":1640794351.7922926,"caller":"messaging/logging_zap.go:62","msg":"[Connecting to broker]","remote_addr":"pulsar://10.244.0.2:6650"}
   2021-12-29 16:12:31 {"level":"info","ts":1640794351.7961068,"caller":"messaging/logging_zap.go:62","msg":"[TCP connection established]","remote_addr":"pulsar://10.244.0.2:6650","local_addr":"10.244.1.173:39250"}
   2021-12-29 16:12:31 {"level":"info","ts":1640794351.79917,"caller":"messaging/logging_zap.go:62","msg":"[Connection is ready]","remote_addr":"pulsar://10.244.0.2:6650","local_addr":"10.244.1.173:39250"}
   2021-12-29 16:12:31 {"level":"info","ts":1640794351.802493,"caller":"messaging/logging_zap.go:82","msg":"Resending [1780] pending batches","topic":"persistent://REDACTED-P2","producer_name":"REDACTED-76b5b9cbf9-6rs76","producerID":1}
   2021-12-29 16:12:31 {"level":"info","ts":1640794351.8025355,"caller":"messaging/logging_zap.go:82","msg":"Resending [680] pending batches","topic":"persistent://REDACTED-J1","producer_name":"REDACTED-76b5b9cbf9-6rs76","producerID":6}
   2021-12-29 16:12:31 {"level":"info","ts":1640794351.8025396,"caller":"messaging/logging_zap.go:82","msg":"Resending [1866] pending batches","topic":"persistent://REDACTED-W0","producer_name":"REDACTED-76b5b9cbf9-6rs76","producerID":8}
   2021-12-29 16:12:31 {"level":"info","ts":1640794351.8164892,"caller":"messaging/logging_zap.go:62","msg":"[Reconnected producer to broker]","topic":"persistent://REDACTED-J1","producer_name":"REDACTED-76b5b9cbf9-6rs76","producerID":6,"cnx":"10.244.1.173:39250 -> 10.244.0.2:6650"}
   2021-12-29 16:12:31 {"level":"warn","ts":1640794351.8165426,"caller":"messaging/logging_zap.go:87","msg":"Received ack for [ledgerId:1021695 entryId:47743  739 711] on sequenceId %!v(MISSING) - expected: %!v(MISSING), closing connection","topic":"persistent://REDACTED-J1","producer_name":"REDACTED-76b5b9cbf9-6rs76","producerID":6}
   2021-12-29 16:12:31 {"level":"warn","ts":1640794351.8166032,"caller":"messaging/logging_zap.go:67","msg":"[Connection was closed]","topic":"persistent://REDACTED-P2","producer_name":"REDACTED-76b5b9cbf9-6rs76","producerID":1,"cnx":"10.244.1.173:39250 -> 10.244.0.2:6650"}
   2021-12-29 16:12:31 {"level":"warn","ts":1640794351.8166149,"caller":"messaging/logging_zap.go:67","msg":"[Connection was closed]","topic":"persistent://REDACTED-W0","producer_name":"REDACTED-76b5b9cbf9-6rs76","producerID":8,"cnx":"10.244.1.173:39250 -> 10.244.0.2:6650"}
   2021-12-29 16:12:31 {"level":"warn","ts":1640794351.8166232,"caller":"messaging/logging_zap.go:67","msg":"[Connection was closed]","topic":"persistent://REDACTED-J1","producer_name":"REDACTED-76b5b9cbf9-6rs76","producerID":6,"cnx":"10.244.1.173:39250 -> 10.244.0.2:6650"}
   2021-12-29 16:12:31 {"level":"warn","ts":1640794351.8166335,"caller":"messaging/logging_zap.go:67","msg":"[Failed to write on connection]","remote_addr":"pulsar://10.244.0.2:6650","local_addr":"10.244.1.173:39250","error":"write tcp 10.244.1.173:39250->10.244.0.2:6650: use of closed network connection"}
   2021-12-29 16:12:31 {"level":"warn","ts":1640794351.816643,"caller":"messaging/logging_zap.go:67","msg":"[Failed to write on connection]","remote_addr":"pulsar://10.244.0.2:6650","local_addr":"10.244.1.173:39250","error":"write tcp 10.244.1.173:39250->10.244.0.2:6650: use of closed network connection"}
   ```
   
   I also see some other errors which might be unrelated problems, e.g. 
   
   ```
   2021-12-29 16:13:02 {"level":"warn","ts":1640794382.4651952,"caller":"messaging/logging_zap.go:87","msg":"Received send error from server: [[PersistenceError org.apache.bookkeeper.mledger.ManagedLedgerException: Bookie operation timeout]] : [%!s(MISSING)]","remote_addr":"pulsar://10.244.0.2:6650","local_addr":"10.244.0.113:54102"}
   ```
   
   >The #689 does not seem to cause the problem described in this issue, because the reconnect stuck occurs on the prouder side
   
   I did not suggest that you looked at #689. I suggested that you look at #691.
   
   In PR #691 (https://github.com/apache/pulsar-client-go/pull/691/files), you changed `pulsar/producer_partition.go` so that `p.reconnectToBroker()` is called in a separate goroutine.
   
   My (untested) hypothesis is that one of the other `p.*()` calls from the original goroutine could perhaps now be happening at the same time as `p.reconnectToBroker()`:
   
   	for {
   		select {
   		case i := <-p.eventsChan:
   			switch v := i.(type) {
   			case *sendRequest:
   				p.internalSend(v)
   			case *flushRequest:
   				p.internalFlush(v)
   			case *closeProducer:
   				p.internalClose(v)
   				return
   			}
   		case <-p.batchFlushTicker.C:
   			if p.batchBuilder.IsMultiBatches() {
   				p.internalFlushCurrentBatches()
   			} else {
   				p.internalFlushCurrentBatch()
   			}
   		}
   	}
   
   I was wondering if this could cause the freeze in `p.grabCnx()` that you are seeing.
   
   Again, I haven't tested this idea, but thought it could be worth mentioning since I experienced a similar problem but only with PR #691 applied.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pulsar.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar-client-go] wolfstudy commented on issue #697: Reconnection blocked in producer by request timed out

Posted by GitBox <gi...@apache.org>.
wolfstudy commented on issue #697:
URL: https://github.com/apache/pulsar-client-go/issues/697#issuecomment-1006366136


   @bschofield Thanks for your work on this. The #691 change ignores some problems. When the producer is closed, it may cause the go rutine to leak, so I submitted a new pull request and listened to the action of the producer close by adding the `closeCh` of the event notification to ensure that the producer is closed. When closed, the go rutine of reconnect can exit normally. The new pull request is here https://github.com/apache/pulsar-client-go/pull/700, PTAL.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pulsar.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org