You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@pulsar.apache.org by GitBox <gi...@apache.org> on 2021/01/24 13:08:41 UTC

[GitHub] [pulsar] wuYin opened a new pull request #9298: [Issue 9297][broker] Try to fix flaky infinite reconnection

wuYin opened a new pull request #9298:
URL: https://github.com/apache/pulsar/pull/9298


   Fixes #9297 
   
   ### Motivation
   
   Try to fix flaky case which broker always returned unreachable address to client.
   
   ### Modifications
   
   Before broker response lookup result, check the owner broker is active or not.
   If not, remove the ownership, then redirect to leader broker to find a new active broker to own this bundle. 
   
   ### Verifying this change
   
   - [ ] Make sure that the change passes the CI checks.
   
   
   ### Other
   Since this case can't be reproduced easily, and this change will cause the lookup to increase the latency of the read local cache, so I'm not sure this is the best solution to fix this, if not, ok to close this PR.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] jiazhai commented on pull request #9298: [Issue 9297][broker] Try to fix flaky infinite reconnection

Posted by GitBox <gi...@apache.org>.
jiazhai commented on pull request #9298:
URL: https://github.com/apache/pulsar/pull/9298#issuecomment-767945286


   /pulsarbot run-failure-checks


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] eolivelli commented on a change in pull request #9298: [Issue 9297][broker] Try to fix flaky infinite reconnection

Posted by GitBox <gi...@apache.org>.
eolivelli commented on a change in pull request #9298:
URL: https://github.com/apache/pulsar/pull/9298#discussion_r685767050



##########
File path: pulsar-broker/src/main/java/org/apache/pulsar/broker/namespace/OwnershipCache.java
##########
@@ -238,7 +239,32 @@ public OwnershipCache(PulsarService pulsar, NamespaceBundleFactory bundleFactory
         }
 
         // If we're not the owner, we need to check if anybody else is
-        return resolveOwnership(path).thenApply(optional -> optional.map(Map.Entry::getKey));
+        CompletableFuture<Optional<NamespaceEphemeralData>> ownerFuture = new CompletableFuture<>();
+        ownershipReadOnlyCache.getAsync(path).whenComplete((owner, ex) -> {
+            if (ex != null) {
+                ownerFuture.completeExceptionally(ex);
+            } else if (!owner.isPresent()) {
+                ownerFuture.complete(Optional.empty());
+            } else {
+                try {
+                    URI ownerURI = new URI(owner.get().getHttpUrl());
+                    String ownerServiceAddr = ownerURI.getHost() + ":" + ownerURI.getPort();
+                    if (pulsar.getLoadManager().get().getAvailableBrokers().contains(ownerServiceAddr)) {
+                        ownerFuture.complete(owner);
+                    } else {
+                        LOG.warn("Namespace {} owned by broker {} which is not active, remove the ownership",
+                                suName.toString(), owner);
+                        removeOwnership(suName);

Review comment:
       here we are in `getOwnerAsync` that is meant to be free from side effects.
   
   I am not sure this is the right fix




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pulsar.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] zymap commented on pull request #9298: [Issue 9297][broker] Try to fix flaky infinite reconnection

Posted by GitBox <gi...@apache.org>.
zymap commented on pull request #9298:
URL: https://github.com/apache/pulsar/pull/9298#issuecomment-766518711


   Looks like tests failed by this change:
   ```
   Error:  Failures: 
   Error:  org.apache.pulsar.broker.namespace.OwnershipCacheTest.testGetOwner(org.apache.pulsar.broker.namespace.OwnershipCacheTest)
   [INFO]   Run 1: PASS
   Error:    Run 2: OwnershipCacheTest.testGetOwner:202 ? Execution java.lang.NullPointerException
   [INFO] 
   Error:  org.apache.pulsar.broker.namespace.OwnershipCacheTest.testReestablishOwnership(org.apache.pulsar.broker.namespace.OwnershipCacheTest)
   [INFO]   Run 1: PASS
   Error:    Run 2: OwnershipCacheTest.testReestablishOwnership:354 ? Execution java.lang.NullPoin...
   [INFO] 
   Error:    TransactionMetaStoreAssignmentTest.init:34->TransactionMetaStoreTestBase.setup:53 ? SessionExpired
   [INFO] 
   Error:  Tests run: 499, Failures: 3, Errors: 0, Skipped: 1
   ```


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] sijie commented on pull request #9298: [Issue 9297][broker] Try to fix flaky infinite reconnection

Posted by GitBox <gi...@apache.org>.
sijie commented on pull request #9298:
URL: https://github.com/apache/pulsar/pull/9298#issuecomment-793122271


   /pulsarbot run-failure-checks


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] hangc0276 commented on pull request #9298: [Issue 9297][broker] Try to fix flaky infinite reconnection

Posted by GitBox <gi...@apache.org>.
hangc0276 commented on pull request #9298:
URL: https://github.com/apache/pulsar/pull/9298#issuecomment-893138704


   @wuYin  Would you please address the conflict? i am starting to release 2.8.1, it all tests passed, we can include this pr in 2.8.1


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pulsar.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] codelipenghui commented on pull request #9298: [Issue 9297][broker] Try to fix flaky infinite reconnection

Posted by GitBox <gi...@apache.org>.
codelipenghui commented on pull request #9298:
URL: https://github.com/apache/pulsar/pull/9298#issuecomment-852166072


   @wuYin Could you please help check the failed tests? I have moved it to 2.9.0 and tag `release/2.8.1`, if the test passed, we can move it back to 2.8.0


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] jiazhai commented on pull request #9298: [Issue 9297][broker] Try to fix flaky infinite reconnection

Posted by GitBox <gi...@apache.org>.
jiazhai commented on pull request #9298:
URL: https://github.com/apache/pulsar/pull/9298#issuecomment-766684143


   @wuYin this change lgtm. the latency for read local zk cache should be fine. Otherwise it would be hard to fix this issue. 
   It would be great to add a test to cover and protect this change.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] 315157973 commented on pull request #9298: [Issue 9297][broker] Try to fix flaky infinite reconnection

Posted by GitBox <gi...@apache.org>.
315157973 commented on pull request #9298:
URL: https://github.com/apache/pulsar/pull/9298#issuecomment-956095141


   @wuYin  Hello, please take a look at eolivelli's comments


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pulsar.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] zymap commented on pull request #9298: [Issue 9297][broker] Try to fix flaky infinite reconnection

Posted by GitBox <gi...@apache.org>.
zymap commented on pull request #9298:
URL: https://github.com/apache/pulsar/pull/9298#issuecomment-766518711


   Looks like tests failed by this change:
   ```
   Error:  Failures: 
   Error:  org.apache.pulsar.broker.namespace.OwnershipCacheTest.testGetOwner(org.apache.pulsar.broker.namespace.OwnershipCacheTest)
   [INFO]   Run 1: PASS
   Error:    Run 2: OwnershipCacheTest.testGetOwner:202 ? Execution java.lang.NullPointerException
   [INFO] 
   Error:  org.apache.pulsar.broker.namespace.OwnershipCacheTest.testReestablishOwnership(org.apache.pulsar.broker.namespace.OwnershipCacheTest)
   [INFO]   Run 1: PASS
   Error:    Run 2: OwnershipCacheTest.testReestablishOwnership:354 ? Execution java.lang.NullPoin...
   [INFO] 
   Error:    TransactionMetaStoreAssignmentTest.init:34->TransactionMetaStoreTestBase.setup:53 ? SessionExpired
   [INFO] 
   Error:  Tests run: 499, Failures: 3, Errors: 0, Skipped: 1
   ```


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] jiazhai commented on pull request #9298: [Issue 9297][broker] Try to fix flaky infinite reconnection

Posted by GitBox <gi...@apache.org>.
jiazhai commented on pull request #9298:
URL: https://github.com/apache/pulsar/pull/9298#issuecomment-766489617


   /pulsarbot run-failure-checks


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] jiazhai commented on pull request #9298: [Issue 9297][broker] Try to fix flaky infinite reconnection

Posted by GitBox <gi...@apache.org>.
jiazhai commented on pull request #9298:
URL: https://github.com/apache/pulsar/pull/9298#issuecomment-766489617






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] hangc0276 commented on pull request #9298: [Issue 9297][broker] Try to fix flaky infinite reconnection

Posted by GitBox <gi...@apache.org>.
hangc0276 commented on pull request #9298:
URL: https://github.com/apache/pulsar/pull/9298#issuecomment-893138704


   @wuYin  Would you please address the conflict? i am starting to release 2.8.1, it all tests passed, we can include this pr in 2.8.1


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pulsar.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] wuYin closed pull request #9298: [Issue 9297][broker] Try to fix flaky infinite reconnection

Posted by GitBox <gi...@apache.org>.
wuYin closed pull request #9298:
URL: https://github.com/apache/pulsar/pull/9298


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pulsar.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] wuYin commented on pull request #9298: [Issue 9297][broker] Try to fix flaky infinite reconnection

Posted by GitBox <gi...@apache.org>.
wuYin commented on pull request #9298:
URL: https://github.com/apache/pulsar/pull/9298#issuecomment-766526674


   @zymap thanks for review
   Actually, I'm not sure this ownership double check PR will solve the flaky Issue
   I'll add test to simulate flaky case by removing the /loadbalance/downBroker zNode directly, and fix existed failure tests


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] hangc0276 commented on pull request #9298: [Issue 9297][broker] Try to fix flaky infinite reconnection

Posted by GitBox <gi...@apache.org>.
hangc0276 commented on pull request #9298:
URL: https://github.com/apache/pulsar/pull/9298#issuecomment-895697754


   @wuYin Could you please help check the failed tests and the conflicts?  I have moved it to  tag release/2.8.2,  if the test passed and conflict addressed, we can move it back to 2.8.1.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pulsar.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] jiazhai edited a comment on pull request #9298: [Issue 9297][broker] Try to fix flaky infinite reconnection

Posted by GitBox <gi...@apache.org>.
jiazhai edited a comment on pull request #9298:
URL: https://github.com/apache/pulsar/pull/9298#issuecomment-766684143


   @wuYin this change lgtm. the latency for read local zk cache should be fine. Otherwise it would be hard to fix this issue. 
   It would be great to add a test to cover and protect this change. looking forward to it.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] wuYin commented on pull request #9298: [Issue 9297][broker] Try to fix flaky infinite reconnection

Posted by GitBox <gi...@apache.org>.
wuYin commented on pull request #9298:
URL: https://github.com/apache/pulsar/pull/9298#issuecomment-766526674


   @zymap thanks for review
   Actually, I'm not sure this ownership double check PR will solve the flaky Issue
   I'll add test to simulate flaky case by removing the /loadbalance/downBroker zNode directly, and fix existed failure tests


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] jiazhai edited a comment on pull request #9298: [Issue 9297][broker] Try to fix flaky infinite reconnection

Posted by GitBox <gi...@apache.org>.
jiazhai edited a comment on pull request #9298:
URL: https://github.com/apache/pulsar/pull/9298#issuecomment-766684143


   @wuYin this change lgtm. the latency for read local zk cache should be fine. Otherwise it would be hard to fix this issue. 
   It would be great to add a test to cover and protect this change. looking forward to it.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org