You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@pulsar.apache.org by Ivan Kelly <iv...@apache.org> on 2021/08/05 18:07:28 UTC

Lack of retries on TooManyRequests

Hi folks,

I'm currently digging into a customer issue we've seen where the retry
logic isn't working well. Basically, they have a ton of producers and
a load of partitions, and when they all connect at the same time, they
bury the brokers for a few minutes. I'm looking at this from the
consumer point of view. The consumer is a flink pipeline, so one
consumer failing brings down the whole pipeline, meaning all of the
flood back to try to connect again.

Anyhow, the bit that I'm confused about is the logic here:
https://github.com/apache/pulsar/commit/ca7bb998436a32f7f919567d4b62d42b994363f8
Basically, TooManyRequest is being considered a non-retriable error. I
think it should be exponential backoff with jitter, but I'm wondering
if there's any strong reason it was specifically excluded from retry.

Anyone have any insights?

-Ivan

Re: Lack of retries on TooManyRequests

Posted by Ivan Kelly <iv...@apache.org>.

Thank Rajan, will reply on the PR.
https://github.com/apache/pulsar/pull/11627/

On Wed, Aug 11, 2021 at 10:06 AM Rajan Dhabalia <dh...@gmail.com> wrote:
>
> *The history behind introducing TooManyRequest error is to handle
> backpressure for zookeeper by throttling a large number of concurrent
> topics loading during broker cold restart. Therefore, pulsar has lookup
> throttling at both client and server-side that slows down lookup because
> lookup ultimately triggers topic loading at server side. So, when a client
> sees TooManyRequest errors, the client should retry to perform this
> operation and the client will eventually reconnect to the broker,
> TooManyRequest can not harm the broker because broker already has a
> safeguard to reject the flood of the requests. I am not sure what problem
> https://github.com/apache/pulsar/pull/6584
> <https://github.com/apache/pulsar/pull/6584> PR tries to solve but it
> should not solve it by making TooManyRequest non-retriable. TooManyRequest
> is a retriable error and the client should retry. Also, it should
> definitely not close the producer/consumer due to this error otherwise it
> can bring down the entire application which depends on the availability of
> the pulsar client entities.Pulsar lookup is an operation similar to other
> operations such as: connect, publish, subscribe, etc. So, I don’t think it
> needs special treatment with a separate timeout config and we can avoid the
> complexity introduced in PR #11627 that caches and depends on the
> previously seen exception for lookup retry. Anyways, removing
> TooManyRequest from the non-retriable error list will simplify the client
> behavior and we can avoid the complexity of PR: #11627
> <https://github.com/apache/pulsar/pull/11627/>Thanks,Rajan*
>
> On Mon, Aug 9, 2021 at 12:54 AM Ivan Kelly <iv...@apache.org> wrote:
>
> > > Suppose you have about a million topics and your Pulsar cluster goes down
> > > (say, ZK down). ..many millions of producers and consumers are now
> > > anxiously awaiting the cluster to comeback. .. fun experience for the
> > first
> > > broker that comes up.   Don't ask me how I know,  ref blame
> > > ServerCnx.java#L429
> > > <
> > https://github.com/apache/pulsar/blob/master/pulsar-broker/src/main/java/org/apache/pulsar/broker/service/ServerCnx.java#L429
> > >.
> > > The broker limit was added to get through a cold restart.
> >
> > Ok. Makes sense. The scenarios we've been seeing issues with have had
> > modest numbers of topics.
> >
> > -Ivan
> >

Re: Lack of retries on TooManyRequests

Posted by Rajan Dhabalia <dh...@gmail.com>.

*The history behind introducing TooManyRequest error is to handle
backpressure for zookeeper by throttling a large number of concurrent
topics loading during broker cold restart. Therefore, pulsar has lookup
throttling at both client and server-side that slows down lookup because
lookup ultimately triggers topic loading at server side. So, when a client
sees TooManyRequest errors, the client should retry to perform this
operation and the client will eventually reconnect to the broker,
TooManyRequest can not harm the broker because broker already has a
safeguard to reject the flood of the requests. I am not sure what problem
https://github.com/apache/pulsar/pull/6584
<https://github.com/apache/pulsar/pull/6584> PR tries to solve but it
should not solve it by making TooManyRequest non-retriable. TooManyRequest
is a retriable error and the client should retry. Also, it should
definitely not close the producer/consumer due to this error otherwise it
can bring down the entire application which depends on the availability of
the pulsar client entities.Pulsar lookup is an operation similar to other
operations such as: connect, publish, subscribe, etc. So, I don’t think it
needs special treatment with a separate timeout config and we can avoid the
complexity introduced in PR #11627 that caches and depends on the
previously seen exception for lookup retry. Anyways, removing
TooManyRequest from the non-retriable error list will simplify the client
behavior and we can avoid the complexity of PR: #11627
<https://github.com/apache/pulsar/pull/11627/>Thanks,Rajan*

On Mon, Aug 9, 2021 at 12:54 AM Ivan Kelly <iv...@apache.org> wrote:

> > Suppose you have about a million topics and your Pulsar cluster goes down
> > (say, ZK down). ..many millions of producers and consumers are now
> > anxiously awaiting the cluster to comeback. .. fun experience for the
> first
> > broker that comes up.   Don't ask me how I know,  ref blame
> > ServerCnx.java#L429
> > <
> https://github.com/apache/pulsar/blob/master/pulsar-broker/src/main/java/org/apache/pulsar/broker/service/ServerCnx.java#L429
> >.
> > The broker limit was added to get through a cold restart.
>
> Ok. Makes sense. The scenarios we've been seeing issues with have had
> modest numbers of topics.
>
> -Ivan
>

Re: Lack of retries on TooManyRequests

Posted by Ivan Kelly <iv...@apache.org>.

> Suppose you have about a million topics and your Pulsar cluster goes down
> (say, ZK down). ..many millions of producers and consumers are now
> anxiously awaiting the cluster to comeback. .. fun experience for the first
> broker that comes up.   Don't ask me how I know,  ref blame
> ServerCnx.java#L429
> <https://github.com/apache/pulsar/blob/master/pulsar-broker/src/main/java/org/apache/pulsar/broker/service/ServerCnx.java#L429>.
> The broker limit was added to get through a cold restart.

Ok. Makes sense. The scenarios we've been seeing issues with have had
modest numbers of topics.

-Ivan

Re: Lack of retries on TooManyRequests

Posted by Joe F <jo...@apache.org>.

Suppose you have about a million topics and your Pulsar cluster goes down
(say, ZK down). ..many millions of producers and consumers are now
anxiously awaiting the cluster to comeback. .. fun experience for the first
broker that comes up.   Don't ask me how I know,  ref blame
ServerCnx.java#L429
<https://github.com/apache/pulsar/blob/master/pulsar-broker/src/main/java/org/apache/pulsar/broker/service/ServerCnx.java#L429>.
The broker limit was added to get through a cold restart.

-j


On Fri, Aug 6, 2021 at 12:29 PM Ivan Kelly <iv...@apache.org> wrote:

> Inline
>
> > In that scenario,
> > should we block or fail fast and let the application decide which is what
> > we do today? Also, should we distinguish between the two scenarios, i.e.
> > broker sends the error vs client internally throws the error?
> While I agree that the client limit is different to the broker limit,
> how likely are we to hit the client limit? 50k lookups is a lot. How
> many topics/partitions will a single client be talking to.
>
> Broker level limiting is a funny one. What we've seen is that
> TooManyRequest will only trigger if the server has to go to zookeeper
> to look up the topic. Otherwise, if the broker has cached the
> assignment, you'll never hit TooManyRequests as the handler is pretty
> much synchronous from this point. What is more likely to happen is
> that the request will timeout as it is queued in the TCP queue while
> waiting for other lookups to be processed. So TooManyRequests and
> request timeout are basically equivalent in the bad case.
>
> In terms of what the client should do, it should probably be
> configurable. In most cases, the default will be to block. The client
> isn't going to go "oh well, pulsar down, time to go home". Most
> likely, if we error, process will crash, restart and try the same
> thing again.
>
> -Ivan
>

Re: Lack of retries on TooManyRequests

Posted by Ivan Kelly <iv...@apache.org>.

Inline

> In that scenario,
> should we block or fail fast and let the application decide which is what
> we do today? Also, should we distinguish between the two scenarios, i.e.
> broker sends the error vs client internally throws the error?
While I agree that the client limit is different to the broker limit,
how likely are we to hit the client limit? 50k lookups is a lot. How
many topics/partitions will a single client be talking to.

Broker level limiting is a funny one. What we've seen is that
TooManyRequest will only trigger if the server has to go to zookeeper
to look up the topic. Otherwise, if the broker has cached the
assignment, you'll never hit TooManyRequests as the handler is pretty
much synchronous from this point. What is more likely to happen is
that the request will timeout as it is queued in the TCP queue while
waiting for other lookups to be processed. So TooManyRequests and
request timeout are basically equivalent in the bad case.

In terms of what the client should do, it should probably be
configurable. In most cases, the default will be to block. The client
isn't going to go "oh well, pulsar down, time to go home". Most
likely, if we error, process will crash, restart and try the same
thing again.

-Ivan

Re: Lack of retries on TooManyRequests

Posted by Jerry Peng <je...@gmail.com>.

Currently, there are two ways to get the TooManyRequest errors in the
client.

1) The client enforces a maximum number of pending lookups:

https://github.com/apache/pulsar/blob/master/pulsar-client/src/main/java/org/apache/pulsar/client/impl/ClientCnx.java#L733

 The max number can be set when creating the pulsar client via "
maxLookupRequests()" or defaults to 50000.  The client will internally
throw a "TooManyRequest" exception if that limit is exceeded.


2) The broker also have enforced a limit on pending lookup requests:

https://github.com/apache/pulsar/blob/master/pulsar-broker/src/main/java/org/apache/pulsar/broker/service/ServerCnx.java#L429

If the number of pending lookup requests for broker exceeds the set limit,
a "TooManyRequest" error would be sent back to the client


I agree.  I am not sure why we are not automatically retrying with backoff
on TooManyRequest errors in the client especially if the error is sent from
the Broker.  However, what should be behavior of the client if the client
hits the max pending lookup limit set in the client?  In that scenario,
should we block or fail fast and let the application decide which is what
we do today? Also, should we distinguish between the two scenarios, i.e.
broker sends the error vs client internally throws the error?


Best,

Jerry

On Fri, Aug 6, 2021 at 10:50 AM Ivan Kelly <iv...@apache.org> wrote:

> Another strangeness I've seen with partition metadata requests.
>
>
> https://github.com/apache/pulsar/blob/ddb5fb0e062c2fe0967efce2a443a31f9cd12c07/pulsar-client/src/main/java/org/apache/pulsar/client/impl/PulsarClientImpl.java#L886
>
> It takes a backoff, and it does retries, but the opTimeoutMs variable
> is only decremented by the retry delay, not the time spent making the
> request. So if the request is timing out after the read timeout (30s)
> or the connect timeout (10s), this time isn't taking from the
> opTimeoutMs. I would expect opTimeout to be the maximum time spent
> waiting for a request to complete. This is set to 30s by default.
> However, in this case, it can be much much more. I suspect that fixing
> this will result in a lot more people seeing timeouts though :/
>
> -Ivan
>
> On Thu, Aug 5, 2021 at 7:07 PM Ivan Kelly <iv...@apache.org> wrote:
> >
> > Hi folks,
> >
> > I'm currently digging into a customer issue we've seen where the retry
> > logic isn't working well. Basically, they have a ton of producers and
> > a load of partitions, and when they all connect at the same time, they
> > bury the brokers for a few minutes. I'm looking at this from the
> > consumer point of view. The consumer is a flink pipeline, so one
> > consumer failing brings down the whole pipeline, meaning all of the
> > flood back to try to connect again.
> >
> > Anyhow, the bit that I'm confused about is the logic here:
> >
> https://github.com/apache/pulsar/commit/ca7bb998436a32f7f919567d4b62d42b994363f8
> > Basically, TooManyRequest is being considered a non-retriable error. I
> > think it should be exponential backoff with jitter, but I'm wondering
> > if there's any strong reason it was specifically excluded from retry.
> >
> > Anyone have any insights?
> >
> > -Ivan
>

Re: Lack of retries on TooManyRequests

Posted by Ivan Kelly <iv...@apache.org>.

Another strangeness I've seen with partition metadata requests.

https://github.com/apache/pulsar/blob/ddb5fb0e062c2fe0967efce2a443a31f9cd12c07/pulsar-client/src/main/java/org/apache/pulsar/client/impl/PulsarClientImpl.java#L886

It takes a backoff, and it does retries, but the opTimeoutMs variable
is only decremented by the retry delay, not the time spent making the
request. So if the request is timing out after the read timeout (30s)
or the connect timeout (10s), this time isn't taking from the
opTimeoutMs. I would expect opTimeout to be the maximum time spent
waiting for a request to complete. This is set to 30s by default.
However, in this case, it can be much much more. I suspect that fixing
this will result in a lot more people seeing timeouts though :/

-Ivan

On Thu, Aug 5, 2021 at 7:07 PM Ivan Kelly <iv...@apache.org> wrote:
>
> Hi folks,
>
> I'm currently digging into a customer issue we've seen where the retry
> logic isn't working well. Basically, they have a ton of producers and
> a load of partitions, and when they all connect at the same time, they
> bury the brokers for a few minutes. I'm looking at this from the
> consumer point of view. The consumer is a flink pipeline, so one
> consumer failing brings down the whole pipeline, meaning all of the
> flood back to try to connect again.
>
> Anyhow, the bit that I'm confused about is the logic here:
> https://github.com/apache/pulsar/commit/ca7bb998436a32f7f919567d4b62d42b994363f8
> Basically, TooManyRequest is being considered a non-retriable error. I
> think it should be exponential backoff with jitter, but I'm wondering
> if there's any strong reason it was specifically excluded from retry.
>
> Anyone have any insights?
>
> -Ivan