You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@pulsar.apache.org by GitBox <gi...@apache.org> on 2021/01/24 12:48:30 UTC

[GitHub] [pulsar] wuYin opened a new issue #9297: proxy lookup result still contains unreachable owner broker serviceURL after broker restart and 30s zk session expired

wuYin opened a new issue #9297:
URL: https://github.com/apache/pulsar/issues/9297


   **Describe the bug**
   - Certain issue
     After broker restarted with 30s zk session expired, client take 51s, even 1min40s to reconnect successfully, recovery time is a bit long. 
   
   - Flaky issue
     Client reconnect always failed handshake with proxy, stucked at 1~5
   
   ![image](https://user-images.githubusercontent.com/24536920/105628942-fd158880-5e7a-11eb-98ea-6c32220241db.png)
   
   **To Reproduce**
   1. use pulsar-helm-chart to deploy a cluster
   - 1\*zk, 3\*brokers, 3\*bookies, 3\*proxy, all with 2c 2g resource limit
   - 4k QPS, 4MBps In/out, for cpu and memory, load < 35%, see [3_brokers_loadreport.json.log](https://github.com/apache/pulsar/files/5862180/3_brokers_loadreport.json.log)
   
   2. Manually delete one broker and restart, it's owning topics(bundles) will be unloaded, clients disconnect with proxy and reconnect, 0.1s, 0.2s, 0.4s, ..., 25.6s at least needed, 51.2s is common.
   In flaky case, client reconnect always failed, it has been seen in production env, but it's not easy to reproduce.
   
   **Expected behavior**
   After broker restart, client need reconnect as soon as possible, like within 40s(30s zk session expiration + 10s reconnect interval)
   
   **Screenshots**
   If applicable, add screenshots to help explain your problem.
   ![image](https://user-images.githubusercontent.com/24536920/105629682-7d3ded00-5e7f-11eb-9856-eb45a20dae24.png)
   
   
   **Other**
   How to reduce recovery time?
   - Reduce zk session expiration time, default 30s, reduce to 10s, but what is cost?
   - Client backoff policy default exponential, may need provide a option to limit the backoff limit, such as 5s, backoff should be 0.1s, 0.2s, 0.4s, ...,3.2s, 5s, 5s, ...
   - Reduce broker znode expiration waiting time, default 10s, reduce to 2s, configured in [broker-statefulset.yaml#L197](https://github.com/apache/pulsar-helm-chart/blob/master/charts/pulsar/templates/broker-statefulset.yaml#L197)
   - Flaky issue need be fixed...
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] congbobo184 commented on issue #9297: proxy lookup result still contains unreachable owner broker serviceURL after broker restart and 30s zk session expired

Posted by GitBox <gi...@apache.org>.
congbobo184 commented on issue #9297:
URL: https://github.com/apache/pulsar/issues/9297#issuecomment-767353123


   > @congbobo184 Sorry for late reply
   > I scale out proxy to only one, delete broker2 pod, and record proxy debug log: [handshake-pulsar-proxy-0.log](https://github.com/apache/pulsar/files/5871610/handshake-pulsar-proxy-0.log)
   > client:
   > ![image](https://user-images.githubusercontent.com/24536920/105809537-d5423400-5fe4-11eb-9c5f-e95005a9344c.png)
   
   When you delete broker2 pod, does the client always connect fail?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] wuYin edited a comment on issue #9297: proxy lookup result still contains unreachable owner broker serviceURL after broker restart and 30s zk session expired

Posted by GitBox <gi...@apache.org>.
wuYin edited a comment on issue #9297:
URL: https://github.com/apache/pulsar/issues/9297#issuecomment-767280833


   @congbobo184 Thanks for review
   I'm using pulsar-helm-chart to deploy cluster, in proxy.conf, broker connection addresses looks likeļ¼š
   ```
   brokerServiceURL=pulsar://handshake-pulsar-broker:6650
   brokerWebServiceURL=http://handshake-pulsar-broker:8080
   ```
   which generated by [proxy-configmap.yaml#L37](https://github.com/apache/pulsar-helm-chart/blob/master/charts/pulsar/templates/proxy-configmap.yaml#L37),  In proxy pod:
   ```
   > cat /etc/resolv.conf 
   search psr.svc.cluster.local svc.cluster.local cluster.local
   
   > host handshake-pulsar-broker
   handshake-pulsar-broker.psr.svc.cluster.local has address 10.113.42.32
   handshake-pulsar-broker.psr.svc.cluster.local has address 10.113.43.53 # will be removed
   handshake-pulsar-broker.psr.svc.cluster.local has address 10.113.46.57
   
   > host handshake-pulsar-broker-1.handshake-pulsar-broker.psr.svc.cluster.local  # bundle owner host
   handshake-pulsar-broker-1.handshake-pulsar-broker.psr.svc.cluster.local has address 10.113.43.53
   ```
   For this issue, during broker1 restarting/terminating, it's service DNS record will be removed quickly(within 1s)
   Proxy request to other brokers to do Lookup, due to broker1 related zNode not expired yet, other brokers still returned `broker1.xxx.cluster.local` which has been removed, finally lead to client backoff retry the same Lookup.
   
   I think it's reasonable, but there's still small chance to trigger flaky case
   In my production env, I drain a k8s node caused a broker be scheduled to another node, but client even retried 16min Lookup still failed.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] congbobo184 edited a comment on issue #9297: proxy lookup result still contains unreachable owner broker serviceURL after broker restart and 30s zk session expired

Posted by GitBox <gi...@apache.org>.
congbobo184 edited a comment on issue #9297:
URL: https://github.com/apache/pulsar/issues/9297#issuecomment-767353123


   > @congbobo184 Sorry for late reply
   > I scale out proxy to only one, delete broker2 pod, and record proxy debug log: [handshake-pulsar-proxy-0.log](https://github.com/apache/pulsar/files/5871610/handshake-pulsar-proxy-0.log)
   > client:
   > ![image](https://user-images.githubusercontent.com/24536920/105809537-d5423400-5fe4-11eb-9c5f-e95005a9344c.png)
   
   after you delete broker2 pod, does the client always connect fail?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] wuYin closed issue #9297: proxy lookup result still contains unreachable owner broker serviceURL after broker restart and 30s zk session expired

Posted by GitBox <gi...@apache.org>.
wuYin closed issue #9297:
URL: https://github.com/apache/pulsar/issues/9297


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] congbobo184 commented on issue #9297: proxy lookup result still contains unreachable owner broker serviceURL after broker restart and 30s zk session expired

Posted by GitBox <gi...@apache.org>.
congbobo184 commented on issue #9297:
URL: https://github.com/apache/pulsar/issues/9297#issuecomment-767215808


   I suspicion the proxy config the brokerServiceURL, this config broker down, proxy can't lookup the topic. So client will connect fail and the client reconnect time will Increasing.When the broker restart, client can connect.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] wuYin commented on issue #9297: proxy lookup result still contains unreachable owner broker serviceURL after broker restart and 30s zk session expired

Posted by GitBox <gi...@apache.org>.
wuYin commented on issue #9297:
URL: https://github.com/apache/pulsar/issues/9297#issuecomment-766526568


   @zymap thanks for review
   Actually, I'm not sure this ownership double check PR will solve the flaky Issue
   I'll add test to simulate flaky case by removing the /loadbalance/downBroker zNode directly, and fix existed failure tests


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] wuYin commented on issue #9297: proxy lookup result still contains unreachable owner broker serviceURL after broker restart and 30s zk session expired

Posted by GitBox <gi...@apache.org>.
wuYin commented on issue #9297:
URL: https://github.com/apache/pulsar/issues/9297#issuecomment-766526568


   @zymap thanks for review
   Actually, I'm not sure this ownership double check PR will solve the flaky Issue
   I'll add test to simulate flaky case by removing the /loadbalance/downBroker zNode directly, and fix existed failure tests


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] congbobo184 edited a comment on issue #9297: proxy lookup result still contains unreachable owner broker serviceURL after broker restart and 30s zk session expired

Posted by GitBox <gi...@apache.org>.
congbobo184 edited a comment on issue #9297:
URL: https://github.com/apache/pulsar/issues/9297#issuecomment-767309798


   ![image](https://user-images.githubusercontent.com/39078850/105804679-5c8aaa00-5fdb-11eb-961f-8dba4c15473a.png)
   Can you provide any proxy do look up command log, it seem to proxy resolve Dns error. What I want to make sure whether the proxy do lookup by other broker.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] jiazhai commented on issue #9297: proxy lookup result still contains unreachable owner broker serviceURL after broker restart and 30s zk session expired

Posted by GitBox <gi...@apache.org>.
jiazhai commented on issue #9297:
URL: https://github.com/apache/pulsar/issues/9297#issuecomment-766866111


   @congbobo184 Is there any suspicion or improvements that you found for the 1st issue?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] wuYin removed a comment on issue #9297: proxy lookup result still contains unreachable owner broker serviceURL after broker restart and 30s zk session expired

Posted by GitBox <gi...@apache.org>.
wuYin removed a comment on issue #9297:
URL: https://github.com/apache/pulsar/issues/9297#issuecomment-766526568


   @zymap thanks for review
   Actually, I'm not sure this ownership double check PR will solve the flaky Issue
   I'll add test to simulate flaky case by removing the /loadbalance/downBroker zNode directly, and fix existed failure tests


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] congbobo184 edited a comment on issue #9297: proxy lookup result still contains unreachable owner broker serviceURL after broker restart and 30s zk session expired

Posted by GitBox <gi...@apache.org>.
congbobo184 edited a comment on issue #9297:
URL: https://github.com/apache/pulsar/issues/9297#issuecomment-767353123


   > @congbobo184 Sorry for late reply
   > I scale out proxy to only one, delete broker2 pod, and record proxy debug log: [handshake-pulsar-proxy-0.log](https://github.com/apache/pulsar/files/5871610/handshake-pulsar-proxy-0.log)
   > client:
   > ![image](https://user-images.githubusercontent.com/24536920/105809537-d5423400-5fe4-11eb-9c5f-e95005a9344c.png)
   
   after you delete broker2 pod, does the client always connect fail? and did not recover. When you restart the broker2, the client will connect success?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] jiazhai edited a comment on issue #9297: proxy lookup result still contains unreachable owner broker serviceURL after broker restart and 30s zk session expired

Posted by GitBox <gi...@apache.org>.
jiazhai edited a comment on issue #9297:
URL: https://github.com/apache/pulsar/issues/9297#issuecomment-767256593


   > Client backoff policy default exponential, may need provide a option to limit the backoff limit, such as 5s, backoff should be 0.1s, 0.2s, 0.4s, ...,3.2s, 5s, 5s, ...
   
   Regarding this issue, There were already backoff supported in java client, maybe we need to add this feature in go client


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] wuYin commented on issue #9297: proxy lookup result still contains unreachable owner broker serviceURL after broker restart and 30s zk session expired

Posted by GitBox <gi...@apache.org>.
wuYin commented on issue #9297:
URL: https://github.com/apache/pulsar/issues/9297#issuecomment-767339685


   @congbobo184 Sorry for late reply
   I scale out proxy to only one, delete broker2 pod, and trace proxy debug log: [handshake-pulsar-proxy-0.log](https://github.com/apache/pulsar/files/5871610/handshake-pulsar-proxy-0.log)
   client:
   ![image](https://user-images.githubusercontent.com/24536920/105809537-d5423400-5fe4-11eb-9c5f-e95005a9344c.png)
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] congbobo184 commented on issue #9297: proxy lookup result still contains unreachable owner broker serviceURL after broker restart and 30s zk session expired

Posted by GitBox <gi...@apache.org>.
congbobo184 commented on issue #9297:
URL: https://github.com/apache/pulsar/issues/9297#issuecomment-767309798


   ![image](https://user-images.githubusercontent.com/39078850/105804679-5c8aaa00-5fdb-11eb-961f-8dba4c15473a.png)
   Can you provide any proxy do look up command log, it seem to proxy resolve Dns error.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] congbobo184 commented on issue #9297: proxy lookup result still contains unreachable owner broker serviceURL after broker restart and 30s zk session expired

Posted by GitBox <gi...@apache.org>.
congbobo184 commented on issue #9297:
URL: https://github.com/apache/pulsar/issues/9297#issuecomment-767441426


   It seems difficult to reproduce this problem. I think this problem may zookeeper cache not update on time. like https://github.com/apache/pulsar/pull/8304, It seem that problem will cause the flaky case.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] congbobo184 edited a comment on issue #9297: proxy lookup result still contains unreachable owner broker serviceURL after broker restart and 30s zk session expired

Posted by GitBox <gi...@apache.org>.
congbobo184 edited a comment on issue #9297:
URL: https://github.com/apache/pulsar/issues/9297#issuecomment-767353123


   > @congbobo184 Sorry for late reply
   > I scale out proxy to only one, delete broker2 pod, and record proxy debug log: [handshake-pulsar-proxy-0.log](https://github.com/apache/pulsar/files/5871610/handshake-pulsar-proxy-0.log)
   > client:
   > ![image](https://user-images.githubusercontent.com/24536920/105809537-d5423400-5fe4-11eb-9c5f-e95005a9344c.png)
   after you delete broker2 pod, does the client always connect fail?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] congbobo184 commented on issue #9297: proxy lookup result still contains unreachable owner broker serviceURL after broker restart and 30s zk session expired

Posted by GitBox <gi...@apache.org>.
congbobo184 commented on issue #9297:
URL: https://github.com/apache/pulsar/issues/9297#issuecomment-767215808


   I suspicion the proxy config the brokerServiceURL, this config broker down, proxy can't lookup the topic. So client will connect fail and the client reconnect time will Increasing.When the broker restart, client can connect.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] wuYin commented on issue #9297: proxy lookup result still contains unreachable owner broker serviceURL after broker restart and 30s zk session expired

Posted by GitBox <gi...@apache.org>.
wuYin commented on issue #9297:
URL: https://github.com/apache/pulsar/issues/9297#issuecomment-767387519


   @congbobo184
   After delete broker2, client will backoff to retry connect 0.1s, 0.2s, 0.4s, ..., 51.2s, then connected successfully.
   This is normal case, and the latency can be reduced by configuring backoff policy.
   But I can't reproduce flaky case, this is the trouble..


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] wuYin commented on issue #9297: proxy lookup result still contains unreachable owner broker serviceURL after broker restart and 30s zk session expired

Posted by GitBox <gi...@apache.org>.
wuYin commented on issue #9297:
URL: https://github.com/apache/pulsar/issues/9297#issuecomment-768927771


   > @wuYin in the certain case, if we close the broker graceful shutdown, we don't need to wait the zk session timeout. you can look [apache/pulsar-helm-chart#59](https://github.com/apache/pulsar-helm-chart/pull/59) and upgrade the helm chart to 2.6.1-2 or hight version. Sorry I still haven't reproduced the flaky case.
   
   Thanks for finding this detail. I really didn't noticed, I'll use release which contains this feature. 
   To avoid unavailability caused by flaky case, I implemented serious timeout degradation for my application.
   If it happen again and steadily, I'll reopen this issue and add more context.
   Thank you for helping.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] wuYin edited a comment on issue #9297: proxy lookup result still contains unreachable owner broker serviceURL after broker restart and 30s zk session expired

Posted by GitBox <gi...@apache.org>.
wuYin edited a comment on issue #9297:
URL: https://github.com/apache/pulsar/issues/9297#issuecomment-767339685


   @congbobo184 Sorry for late reply
   I scale out proxy to only one, delete broker2 pod, and record proxy debug log: [handshake-pulsar-proxy-0.log](https://github.com/apache/pulsar/files/5871610/handshake-pulsar-proxy-0.log)
   client:
   ![image](https://user-images.githubusercontent.com/24536920/105809537-d5423400-5fe4-11eb-9c5f-e95005a9344c.png)
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] jiazhai commented on issue #9297: proxy lookup result still contains unreachable owner broker serviceURL after broker restart and 30s zk session expired

Posted by GitBox <gi...@apache.org>.
jiazhai commented on issue #9297:
URL: https://github.com/apache/pulsar/issues/9297#issuecomment-766866111






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] congbobo184 commented on issue #9297: proxy lookup result still contains unreachable owner broker serviceURL after broker restart and 30s zk session expired

Posted by GitBox <gi...@apache.org>.
congbobo184 commented on issue #9297:
URL: https://github.com/apache/pulsar/issues/9297#issuecomment-768808066


   @wuYin in the certain case, if we close the broker graceful shutdown, we don't need to wait the zk session timeout. you can look https://github.com/apache/pulsar-helm-chart/pull/59 and upgrade the helm chart to 2.6.1-2 or hight version?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] congbobo184 edited a comment on issue #9297: proxy lookup result still contains unreachable owner broker serviceURL after broker restart and 30s zk session expired

Posted by GitBox <gi...@apache.org>.
congbobo184 edited a comment on issue #9297:
URL: https://github.com/apache/pulsar/issues/9297#issuecomment-768808066


   @wuYin in the certain case, if we close the broker graceful shutdown, we don't need to wait the zk session timeout. you can look https://github.com/apache/pulsar-helm-chart/pull/59 and upgrade the helm chart to 2.6.1-2 or hight version? Sorry I still haven't reproduced the flaky case.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] congbobo184 edited a comment on issue #9297: proxy lookup result still contains unreachable owner broker serviceURL after broker restart and 30s zk session expired

Posted by GitBox <gi...@apache.org>.
congbobo184 edited a comment on issue #9297:
URL: https://github.com/apache/pulsar/issues/9297#issuecomment-767353123


   > @congbobo184 Sorry for late reply
   > I scale out proxy to only one, delete broker2 pod, and record proxy debug log: [handshake-pulsar-proxy-0.log](https://github.com/apache/pulsar/files/5871610/handshake-pulsar-proxy-0.log)
   > client:
   > ![image](https://user-images.githubusercontent.com/24536920/105809537-d5423400-5fe4-11eb-9c5f-e95005a9344c.png)
   
   after you delete broker2 pod, does the client always connect fail? and did not recover. When you restart the broker2, the client will connect success. Is that so?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] congbobo184 edited a comment on issue #9297: proxy lookup result still contains unreachable owner broker serviceURL after broker restart and 30s zk session expired

Posted by GitBox <gi...@apache.org>.
congbobo184 edited a comment on issue #9297:
URL: https://github.com/apache/pulsar/issues/9297#issuecomment-768808066


   @wuYin in the certain case, if we close the broker graceful shutdown, we don't need to wait the zk session timeout. you can look https://github.com/apache/pulsar-helm-chart/pull/59 and upgrade the helm chart to 2.6.1-2 or hight version. Sorry I still haven't reproduced the flaky case.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] jiazhai edited a comment on issue #9297: proxy lookup result still contains unreachable owner broker serviceURL after broker restart and 30s zk session expired

Posted by GitBox <gi...@apache.org>.
jiazhai edited a comment on issue #9297:
URL: https://github.com/apache/pulsar/issues/9297#issuecomment-767256593


   > Client backoff policy default exponential, may need provide a option to limit the backoff limit, such as 5s, backoff should be 0.1s, 0.2s, 0.4s, ...,3.2s, 5s, 5s, ...
   
   Regarding this issue, There were already backoff supported in java client, maybe we need to add this feature in go client


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] jiazhai commented on issue #9297: proxy lookup result still contains unreachable owner broker serviceURL after broker restart and 30s zk session expired

Posted by GitBox <gi...@apache.org>.
jiazhai commented on issue #9297:
URL: https://github.com/apache/pulsar/issues/9297#issuecomment-767256593


   > Client backoff policy default exponential, may need provide a option to limit the backoff limit, such as 5s, backoff should be 0.1s, 0.2s, 0.4s, ...,3.2s, 5s, 5s, ...
   Regarding this issue, There were already backoff supported in java client, maybe we need to add this feature in go client


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] wuYin removed a comment on issue #9297: proxy lookup result still contains unreachable owner broker serviceURL after broker restart and 30s zk session expired

Posted by GitBox <gi...@apache.org>.
wuYin removed a comment on issue #9297:
URL: https://github.com/apache/pulsar/issues/9297#issuecomment-766526568


   @zymap thanks for review
   Actually, I'm not sure this ownership double check PR will solve the flaky Issue
   I'll add test to simulate flaky case by removing the /loadbalance/downBroker zNode directly, and fix existed failure tests


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] wuYin edited a comment on issue #9297: proxy lookup result still contains unreachable owner broker serviceURL after broker restart and 30s zk session expired

Posted by GitBox <gi...@apache.org>.
wuYin edited a comment on issue #9297:
URL: https://github.com/apache/pulsar/issues/9297#issuecomment-767387519


   @congbobo184
   After delete broker2, client will backoff to retry connect 0.1s, 0.2s, 0.4s, ..., 51.2s, then connected successfully.
   This is normal case, and the reconnect latency can be reduced by configuring backoff policy.
   But I can't reproduce flaky case, this is the trouble..


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] congbobo184 edited a comment on issue #9297: proxy lookup result still contains unreachable owner broker serviceURL after broker restart and 30s zk session expired

Posted by GitBox <gi...@apache.org>.
congbobo184 edited a comment on issue #9297:
URL: https://github.com/apache/pulsar/issues/9297#issuecomment-767353123


   > @congbobo184 Sorry for late reply
   > I scale out proxy to only one, delete broker2 pod, and record proxy debug log: [handshake-pulsar-proxy-0.log](https://github.com/apache/pulsar/files/5871610/handshake-pulsar-proxy-0.log)
   > client:
   > ![image](https://user-images.githubusercontent.com/24536920/105809537-d5423400-5fe4-11eb-9c5f-e95005a9344c.png)
   
   after you delete broker2 pod, does the client always connect fail? and did not recover. When you restart the broker2, the client will connect success. Is that so? 
   
   Looking at the log, I did not find the debug log, so the proxy look up success is not sure


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] wuYin commented on issue #9297: proxy lookup result still contains unreachable owner broker serviceURL after broker restart and 30s zk session expired

Posted by GitBox <gi...@apache.org>.
wuYin commented on issue #9297:
URL: https://github.com/apache/pulsar/issues/9297#issuecomment-767280833






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org