You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@pulsar.apache.org by GitBox <gi...@apache.org> on 2022/08/24 07:21:48 UTC

[GitHub] [pulsar] lhotari opened a new pull request, #17254: [fix][broker] Fix intermittent issue where leader broker information is lost and causes lookup issues

lhotari opened a new pull request, #17254:
URL: https://github.com/apache/pulsar/pull/17254

   ### Motivation
   
   - Fixes issue where leader broker information isn't available and this warning gets logged:
     `WARN  org.apache.pulsar.broker.namespace.NamespaceService - The information about the current leader broker wasn't available. Handling load manager decisions in a decentralized way.`
   
   ### Modifications
   
   - use `LeaderElectionService`'s `readCurrentLeader` method instead of `getCurrentLeader` method. This ensures that the leader broker information is refreshed if it happens to be missing. 
   - refactor existing code so that the asynchronous `readCurrentLeader` method can be used.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pulsar.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [pulsar] lhotari commented on pull request #17254: [fix][broker] Fix intermittent issue where leader broker information is lost and causes lookup issues

Posted by GitBox <gi...@apache.org>.
lhotari commented on PR #17254:
URL: https://github.com/apache/pulsar/pull/17254#issuecomment-1225309386

   > Are we adding load on the metadata service ? With this patch during a restart of the whole cluster we are going to perform ad additional read from zk for every topic lookup in order to find the leader broker. is this correct ?
   
   no. LeaderElectionService caches lookups. There won't be any additional load.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pulsar.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [pulsar] lhotari commented on a diff in pull request #17254: [fix][broker] Fix intermittent issue where leader broker information is lost and causes lookup issues

Posted by GitBox <gi...@apache.org>.
lhotari commented on code in PR #17254:
URL: https://github.com/apache/pulsar/pull/17254#discussion_r954729783


##########
pulsar-broker/src/main/java/org/apache/pulsar/broker/namespace/NamespaceService.java:
##########
@@ -644,6 +646,25 @@ private CompletableFuture<Optional<CandidateBrokerResult>> getCandidateBrokerRes
 
     }
 
+    private CompletableFuture<Optional<LeaderBroker>> getOrWaitForLeader(AtomicInteger retries) {
+        return pulsar.getLeaderElectionService().readCurrentLeader()
+                .handle((leaderBroker, t) -> {
+                    CompletableFuture<Optional<LeaderBroker>> retval;
+                    if ((t != null || !leaderBroker.isPresent()) && retries.getAndIncrement() < 3) {
+                        // retry after a delay of 5 seconds
+                        retval =
+                                CompletableFuture.supplyAsync(() -> null,
+                                                CompletableFuture.delayedExecutor(5, SECONDS))

Review Comment:
   it's not really a problem to use the system's pool for delayed executor. However, delayed executor isn't available in Java 8 and using this will make backporting harder. I'll use Pulsar's existing scheduler for this purpose so that backporting will be easier.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pulsar.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [pulsar] lhotari closed pull request #17254: [fix][broker] Fix intermittent issue where leader broker information is lost and causes lookup issues

Posted by GitBox <gi...@apache.org>.
lhotari closed pull request #17254: [fix][broker] Fix intermittent issue where leader broker information is lost and causes lookup issues
URL: https://github.com/apache/pulsar/pull/17254


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pulsar.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [pulsar] lhotari commented on a diff in pull request #17254: [fix][broker] Fix intermittent issue where leader broker information is lost and causes lookup issues

Posted by GitBox <gi...@apache.org>.
lhotari commented on code in PR #17254:
URL: https://github.com/apache/pulsar/pull/17254#discussion_r954460316


##########
pulsar-broker/src/main/java/org/apache/pulsar/broker/namespace/NamespaceService.java:
##########
@@ -644,6 +646,25 @@ private CompletableFuture<Optional<CandidateBrokerResult>> getCandidateBrokerRes
 
     }
 
+    private CompletableFuture<Optional<LeaderBroker>> getOrWaitForLeader(AtomicInteger retries) {
+        return pulsar.getLeaderElectionService().readCurrentLeader()
+                .handle((leaderBroker, t) -> {
+                    CompletableFuture<Optional<LeaderBroker>> retval;
+                    if ((t != null || !leaderBroker.isPresent()) && retries.getAndIncrement() < 3) {
+                        // retry after a delay of 5 seconds
+                        retval =
+                                CompletableFuture.supplyAsync(() -> null,
+                                                CompletableFuture.delayedExecutor(5, SECONDS))
+                                        .thenCompose(__ -> getOrWaitForLeader(retries));
+                    } else {
+                        retval = CompletableFuture.completedFuture(
+                                leaderBroker != null ? leaderBroker : Optional.empty());
+                    }
+                    return retval;
+                })
+                .thenCompose(Function.identity());

Review Comment:
   when using `.handle` that returns a CompletableFuture, it's necessary to "flatten" it with `.thenCompose(Function.identity())`. 
   This is a known gap in CompletableFuture API that there is no `handleCompose` method directly. (an [stackoverflow answer explaining details](https://stackoverflow.com/a/60602429), contains links to other answers) . Java 12+ contains `.exceptionallyCompose`, but no `.handleCompose` which is unfortunate.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pulsar.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [pulsar] lhotari commented on a diff in pull request #17254: [fix][broker] Fix intermittent issue where leader broker information is lost and causes lookup issues

Posted by GitBox <gi...@apache.org>.
lhotari commented on code in PR #17254:
URL: https://github.com/apache/pulsar/pull/17254#discussion_r954727925


##########
pulsar-broker/src/main/java/org/apache/pulsar/broker/namespace/NamespaceService.java:
##########
@@ -644,6 +646,25 @@ private CompletableFuture<Optional<CandidateBrokerResult>> getCandidateBrokerRes
 
     }
 
+    private CompletableFuture<Optional<LeaderBroker>> getOrWaitForLeader(AtomicInteger retries) {
+        return pulsar.getLeaderElectionService().readCurrentLeader()
+                .handle((leaderBroker, t) -> {
+                    CompletableFuture<Optional<LeaderBroker>> retval;
+                    if ((t != null || !leaderBroker.isPresent()) && retries.getAndIncrement() < 3) {
+                        // retry after a delay of 5 seconds

Review Comment:
   I'll add some logging



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pulsar.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [pulsar] eolivelli commented on a diff in pull request #17254: [fix][broker] Fix intermittent issue where leader broker information is lost and causes lookup issues

Posted by GitBox <gi...@apache.org>.
eolivelli commented on code in PR #17254:
URL: https://github.com/apache/pulsar/pull/17254#discussion_r953810374


##########
pulsar-broker/src/main/java/org/apache/pulsar/broker/namespace/NamespaceService.java:
##########
@@ -644,6 +646,25 @@ private CompletableFuture<Optional<CandidateBrokerResult>> getCandidateBrokerRes
 
     }
 
+    private CompletableFuture<Optional<LeaderBroker>> getOrWaitForLeader(AtomicInteger retries) {
+        return pulsar.getLeaderElectionService().readCurrentLeader()
+                .handle((leaderBroker, t) -> {
+                    CompletableFuture<Optional<LeaderBroker>> retval;
+                    if ((t != null || !leaderBroker.isPresent()) && retries.getAndIncrement() < 3) {
+                        // retry after a delay of 5 seconds
+                        retval =
+                                CompletableFuture.supplyAsync(() -> null,
+                                                CompletableFuture.delayedExecutor(5, SECONDS))

Review Comment:
   which thread pool will run this execution ? (I guess it would be the system common pool)
   should we use a well-known thread pool ?



##########
pulsar-broker/src/main/java/org/apache/pulsar/broker/namespace/NamespaceService.java:
##########
@@ -644,6 +646,25 @@ private CompletableFuture<Optional<CandidateBrokerResult>> getCandidateBrokerRes
 
     }
 
+    private CompletableFuture<Optional<LeaderBroker>> getOrWaitForLeader(AtomicInteger retries) {
+        return pulsar.getLeaderElectionService().readCurrentLeader()
+                .handle((leaderBroker, t) -> {
+                    CompletableFuture<Optional<LeaderBroker>> retval;
+                    if ((t != null || !leaderBroker.isPresent()) && retries.getAndIncrement() < 3) {
+                        // retry after a delay of 5 seconds
+                        retval =
+                                CompletableFuture.supplyAsync(() -> null,
+                                                CompletableFuture.delayedExecutor(5, SECONDS))
+                                        .thenCompose(__ -> getOrWaitForLeader(retries));
+                    } else {
+                        retval = CompletableFuture.completedFuture(
+                                leaderBroker != null ? leaderBroker : Optional.empty());
+                    }
+                    return retval;
+                })
+                .thenCompose(Function.identity());

Review Comment:
   why do we need this step ?



##########
pulsar-broker/src/main/java/org/apache/pulsar/broker/namespace/NamespaceService.java:
##########
@@ -644,6 +646,25 @@ private CompletableFuture<Optional<CandidateBrokerResult>> getCandidateBrokerRes
 
     }
 
+    private CompletableFuture<Optional<LeaderBroker>> getOrWaitForLeader(AtomicInteger retries) {
+        return pulsar.getLeaderElectionService().readCurrentLeader()
+                .handle((leaderBroker, t) -> {
+                    CompletableFuture<Optional<LeaderBroker>> retval;
+                    if ((t != null || !leaderBroker.isPresent()) && retries.getAndIncrement() < 3) {
+                        // retry after a delay of 5 seconds

Review Comment:
   tracking this delay with a debugger or with a jstack dump is very hard.
   maybe we should log something at INFO level
   otherwise lookup operations will hang without any observable sign



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pulsar.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [pulsar] lhotari commented on pull request #17254: [fix][broker] Fix intermittent issue where leader broker information is lost and causes lookup issues

Posted by GitBox <gi...@apache.org>.
lhotari commented on PR #17254:
URL: https://github.com/apache/pulsar/pull/17254#issuecomment-1225421219

   I marked this as draft since the tests are still missing


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pulsar.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [pulsar] lhotari commented on pull request #17254: [fix][broker] Fix intermittent issue where leader broker information is lost and causes lookup issues

Posted by GitBox <gi...@apache.org>.
lhotari commented on PR #17254:
URL: https://github.com/apache/pulsar/pull/17254#issuecomment-1236885249

   closing this in favour of #17401


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pulsar.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [pulsar] KannarFr commented on pull request #17254: [fix][broker] Fix intermittent issue where leader broker information is lost and causes lookup issues

Posted by GitBox <gi...@apache.org>.
KannarFr commented on PR #17254:
URL: https://github.com/apache/pulsar/pull/17254#issuecomment-1232850099

   To be sure, could this issue result in 502 during topic creation? I just had this problem.
   
   I had these logs on brokers:
   
   ```Aug 31 07:30:40 test-pulsar-c2-n1 pulsar-broker[1956320]: 07:30:40.053 [pulsar-2-16] WARN org.apache.pulsar.broker.namespace.NamespaceService - The information about the current leader broker wasn't available. Handling load manager decisions in a decentralized way. NamespaceBundle[customtenant/customns/0xc0000000_0xffffffff]
   Aug 31 07:30:40 test-pulsar-c2-n1 pulsar-broker[1956320]: 07:30:40.055 [pulsar-web-34-3] INFO org.eclipse.jetty.server.RequestLog - 192.168.11.6 - - [31/Aug/2022:07:30:40 +0000] "PUT /admin/v2/persistent/customtenant/customns/customtopic?authoritative=false HTTP/1.1" 307 0 "-" "Jetty/9.4.44.v20210927" 5
   Aug 31 07:30:40 test-pulsar-c2-n1 pulsar-broker[1956320]: 07:30:40.066 [pulsar-2-4] WARN org.apache.pulsar.broker.namespace.NamespaceService - The information about the current leader broker wasn't available. Handling load manager decisions in a decentralized way. NamespaceBundle[customtenant/customns/0xc0000000_0xffffffff]
   Aug 31 07:30:40 test-pulsar-c2-n1 pulsar-broker[1956320]: 07:30:40.068 [pulsar-web-34-9] INFO org.eclipse.jetty.server.RequestLog - 192.168.11.6 - - [31/Aug/2022:07:30:40 +0000] "PUT /admin/v2/persistent/customtenant/customns/customtopic?authoritative=false HTTP/1.1" 307 0 "-" "Jetty/9.4.44.v20210927" 4
   Aug 31 07:30:40 test-pulsar-c2-n1 pulsar-broker[1956320]: 07:30:40.079 [pulsar-2-5] WARN org.apache.pulsar.broker.namespace.NamespaceService - The information about the current leader broker wasn't available. Handling load manager decisions in a decentralized way. NamespaceBundle[customtenant/customns/0xc0000000_0xffffffff]
   Aug 31 07:30:40 test-pulsar-c2-n1 pulsar-broker[1956320]: 07:30:40.080 [pulsar-web-34-3] INFO org.eclipse.jetty.server.RequestLog - 192.168.11.6 - - [31/Aug/2022:07:30:40 +0000] "PUT /admin/v2/persistent/customtenant/customns/customtopic?authoritative=false HTTP/1.1" 307 0 "-" "Jetty/9.4.44.v20210927" 4
   Aug 31 07:30:40 test-pulsar-c2-n1 pulsar-broker[1956320]: 07:30:40.090 [pulsar-2-13] WARN org.apache.pulsar.broker.namespace.NamespaceService - The information about the current leader broker wasn't available. Handling load manager decisions in a decentralized way. NamespaceBundle[customtenant/customns/0xc0000000_0xffffffff]
   Aug 31 07:30:40 test-pulsar-c2-n1 pulsar-broker[1956320]: 07:30:40.092 [pulsar-web-34-9] INFO org.eclipse.jetty.server.RequestLog - 192.168.11.6 - - [31/Aug/2022:07:30:40 +0000] "PUT /admin/v2/persistent/customtenant/customns/customtopic?authoritative=false HTTP/1.1" 307 0 "-" "Jetty/9.4.44.v20210927" 4
   ```
   
   Then nothing related to this customtopic even on another broker. And the response was HTTP 502.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pulsar.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [pulsar] lhotari commented on pull request #17254: [fix][broker] Fix intermittent issue where leader broker information is lost and causes lookup issues

Posted by GitBox <gi...@apache.org>.
lhotari commented on PR #17254:
URL: https://github.com/apache/pulsar/pull/17254#issuecomment-1225365180

   > LGTM
   > 
   > this is a very long standing issue
   
   I think I'll need to add some tests to ensure that the logic works. I did notice from the implementation that `readCurrentLeader` isn't a sufficient solution for resolving this issue. I added another commit to retry looking up the leader broker if there's no leader at the moment. I'll evolve this solution based on review feedback.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pulsar.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org