You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pulsar.apache.org by Ivan Kelly <iv...@apache.org> on 2021/08/12 18:30:05 UTC

Treating lookup request timeout as equivalent to TooManyRequests

Hi folks,

Rajan's opinion would be particularly useful here as they appears to
have done most of the TooManyRequests work in the past.

So, the original TooManyRequests works well in the case that you have
a lot of topics and loads of lookups and PMR come in to the broker. In
this case, most requests need to be served from zookeeper. These
request will end up being async, so the broker will keep accepting
requests and sending to zookeeper until it hits TooManyRequests.

We have been seeing a slightly different scenario. There's not many
topics (1 topic with 1k partitions), but there are millions of
publishers. So when a rolling restart or something happens, some of
the broker's io threads will get saturated with lookup requests and
they won't work through the backlog before the client's timeout period
(which results in the client crashing and retrying again in the
current code). This is the scenario targeted by the lookup timeout
change.

However, we have also seen the case where the brokers were not
overloaded at all, but one broker is just acting up (or the GCP load
balancer is blackholing the connection, we never got to the root).
From the client PoV, this is indistinguishable from the overloaded
broker scenario. With the lookup timeout change, the lookup will be
retried. However, it will be retried on the same connection, so will
fail again.

This is the same reason max rejected was added in
https://github.com/apache/pulsar/pull/274. My proposal is to extend
it, so that timeouts on lookup type requests also closes the
connection (and hopefully allows establishment of connection on a
working broker).

-Ivan

Re: Treating lookup request timeout as equivalent to TooManyRequests

Posted by Rajan Dhabalia <rd...@apache.org>.
> so that timeouts on lookup type requests also closes the connection (and
hopefully allows establishment of connection on a working broker).

+1.

Thanks,
Rajan

On Thu, Aug 12, 2021 at 11:30 AM Ivan Kelly <iv...@apache.org> wrote:

> Hi folks,
>
> Rajan's opinion would be particularly useful here as they appears to
> have done most of the TooManyRequests work in the past.
>
> So, the original TooManyRequests works well in the case that you have
> a lot of topics and loads of lookups and PMR come in to the broker. In
> this case, most requests need to be served from zookeeper. These
> request will end up being async, so the broker will keep accepting
> requests and sending to zookeeper until it hits TooManyRequests.
>
> We have been seeing a slightly different scenario. There's not many
> topics (1 topic with 1k partitions), but there are millions of
> publishers. So when a rolling restart or something happens, some of
> the broker's io threads will get saturated with lookup requests and
> they won't work through the backlog before the client's timeout period
> (which results in the client crashing and retrying again in the
> current code). This is the scenario targeted by the lookup timeout
> change.
>
> However, we have also seen the case where the brokers were not
> overloaded at all, but one broker is just acting up (or the GCP load
> balancer is blackholing the connection, we never got to the root).
> From the client PoV, this is indistinguishable from the overloaded
> broker scenario. With the lookup timeout change, the lookup will be
> retried. However, it will be retried on the same connection, so will
> fail again.
>
> This is the same reason max rejected was added in
> https://github.com/apache/pulsar/pull/274. My proposal is to extend
> it, so that timeouts on lookup type requests also closes the
> connection (and hopefully allows establishment of connection on a
> working broker).
>
> -Ivan
>