You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@pulsar.apache.org by Lari Hotari <lh...@apache.org> on 2023/11/03 09:55:35 UTC

Re: [DISCUSS] PIP-310: Support custom publish rate limiters

Hi Girish,

In order to address your problem described in the PIP document [1], it might be necessary to make improvements in how rate limiters apply backpressure in Pulsar.

Pulsar uses mainly TCP/IP connection level controls for achieving backpressure. The challenge is that Pulsar can share a single TCP/IP connection across multiple producers and consumers. Because of this, there could be multiple producers and consumers and rate limiters operating on the same connection on the broker, and they will do conflicting decisions, which results in undesired behavior.

Regarding the shared TCP/IP connection backpressure issue, Apache Flink had a somewhat similar problem before Flink 1.5. It is described in the "inflicting backpressure" section of this blog post from 2019:
https://flink.apache.org/2019/06/05/flink-network-stack.html#inflicting-backpressure-1
Flink solved the issue of multiplexing multiple streams of data on a single TCP/IP connection in Flink 1.5 by introducing it's own flow control mechanism.

The backpressure and rate limiting challenges have been discussed a few times in Pulsar community meetings over the past years. There was also a generic backpressure thread on the dev mailing list [2] in Sep 2022. 
However, we haven't really documented Pulsar's backpressure design and how rate limiters are part of the overall solution and how we could improve. 
I think it might be time to do so since there's a requirement to improve rate limiting. I guess that's the main motivation also for PIP-310.

-Lari

1 - https://github.com/apache/pulsar/pull/21399/files
2 - https://lists.apache.org/thread/03w6x9zsgx11mqcp5m4k4n27cyqmp271

On 2023/10/19 12:51:14 Girish Sharma wrote:
> Hi,
> Currently, there are only 2 kinds of publish rate limiters - polling based
> and precise. Users have an option to use either one of them in the topic
> publish rate limiter, but the resource group rate limiter only uses polling
> one.
> 
> There are challenges with both the rate limiters and the fact that we can't
> use precise rate limiter in the resource group level.
> 
> Thus, in order to support custom rate limiters, I've created the PIP-310
> 
> This is the discussion thread. Please go through the PIP and provide your
> inputs.
> 
> Link - https://github.com/apache/pulsar/pull/21399
> 
> Regards
> -- 
> Girish Sharma
>

Re: [DISCUSS] PIP-310: Support custom publish rate limiters

Posted by Girish Sharma <sc...@gmail.com>.

On Sat, Nov 4, 2023 at 9:21 PM Lari Hotari <lh...@apache.org> wrote:

> One additional question:
>
> In your use case, do you have multiple producers concurrently producing to
> the same topic from different clients?
>
> The use case is challenging in the current implementation when using topic
> producing rate limiting. The problem is that the different producers will
> be able to send messages at very different rates since there isn't a
> solution to ensure fairness across multiple producers in the topic producer
> rate limiting solution. This is something that should be addressed when
> improving rate limiting.
>

We haven't personally seen this pattern that different logical producers
(app/object) are producing to the same topic. I feel like this is an
anti-pattern and goes against the homogenous nature of a topic.
Even if such kinds of use cases arrive, they can easily be handled by
moving the different producers to different topic and the consumers
subscribing to more than one topic in case they need data from all of those
topics.

Regards


>
> -Lari
>
> la 4. marrask. 2023 klo 17.24 Lari Hotari <lh...@apache.org> kirjoitti:
>
> > Replies inline
> >
> > On Fri, 3 Nov 2023 at 20:48, Girish Sharma <sc...@gmail.com>
> > wrote:
> >
> >> Could you please elaborate more on these details? Here are some
> questions:
> >> > 1. What do you mean that it is too strict?
> >> >     - Should the rate limiting allow bursting over the limit for some
> >> time?
> >> >
> >>
> >> That's one of the major use cases, yes.
> >>
> >
> > One possibility would be to improve the existing rate limiter to allow
> > bursting.
> > I think that Pulsar's out-of-the-box rate limiter should cover 99% of the
> > use cases instead of having one implementing their own rate limiter
> > algorithm.
> > The problems you are describing seem to be common to many Pulsar use
> > cases, and therefore, I think they should be handled directly in Pulsar.
> >
> > Optimally, there would be a single solution that abstracts the rate
> > limiting in a way where it does the right thing based on the declarative
> > configuration.
> > I would prefer that over having a pluggable solution for rate limiter
> > implementations.
> >
> > What would help is getting deeper in the design of the rate limiter
> > itself, without limiting ourselves to the existing rate limiter
> > implementation in Pulsar.
> >
> > In textbooks, there are algorithms such as "leaky bucket" [1] and "token
> > bucket" [2]. Both algorithms have several variations and in some ways
> they
> > are very similar algorithms but looking from the different point of view.
> > It would possibly be easier to conceptualize and understand a rate
> limiting
> > algorithm if common algorithm names and implementation choices mentioned
> in
> > textbooks would be referenced in the implementation.
> > It seems that a "token bucket" type of algorithm can be used to implement
> > rate limiting with bursting. In the token bucket algorithm, the size of
> the
> > token bucket defines how large bursts will be allowed. The design could
> > also be something where 2 rate limiters with different type of algorithms
> > and/or configuration parameters are combined to achieve a desired
> behavior.
> > For example, to achieve a rate limiter with bursting and a fixed maximum
> > rate.
> > By default, the token bucket algorithm doesn't enforce a maximum rate for
> > bursts, but that could be achieved by chaining 2 rate limiters if that is
> > really needed.
> >
> > The current Pulsar rate limiter implementation could be implemented in a
> > cleaner way, which would also be more efficient. Instead of having a
> > scheduler call a method once per second, I think that the rate limiter
> > could be implemented in a reactive way where the algorithm is implemented
> > without a scheduler.
> > I wonder if there are others that would be interested in getting down
> into
> > such implementation details?
> >
> > 1 - https://en.wikipedia.org/wiki/Leaky_bucket
> > 2 - https://en.wikipedia.org/wiki/Token_bucket
> >
> >
> > > 2. What type of data loss are you experiencing?
> >> >
> >>
> >> Messages produced by the producers which eventually get timed out due to
> >> rate limiting.
> >>
> >
> > Are you able to slow down producing on the client side? If that is
> > possible, there could be ways to improve ways to do client side back
> > pressure with Pulsar Client. Currently, the client doesn't expose this
> > information until the sending blocks or fails with an exception
> > (ProducerQueueIsFullError). Optimally, the client should slow down the
> rate
> > of producing to the rate that it can actually send to the broker.
> > Just curious if you have considered turning off producing timeouts on the
> > client side completely or making them longer? Would that address the data
> > loss problem?
> > Or is your event/message source "hot" so that you cannot stop or slow it
> > down, and it will just keep on flowing with a certain rate?
> >
> > > I think the core implementation of how the broker fails fast at the
> time
> >> of rate limiting (whether it is by pausing netty channel or a new
> permits
> >> based model) does not change the actual issue I am targeting.
> Multiplexing
> >> has some impact on it - but yet again only limited, and can easily be
> >> fixed
> >> by the client by increasing the connections per broker. Even after
> >> assuming
> >> both these things are somehow "fixed", the fact remains that an
> absolutely
> >> strict rate limiter will lead to the above mentioned data loss for burst
> >> going above the limit and that a poller based rate limiter doesn't
> really
> >> rate limit anything as it allows all produce in the first interval of
> the
> >> next second.
> >>
> >
> > Yes, it makes sense to have bursting configuration parameters in the rate
> > limiter.
> > As mentioned above, I think we could be improving the existing rate
> > limiter in Pulsar to cover 99% of the use case by making it stable and by
> > including the bursting configuration options.
> > Is there additional functionality you feel the rate limiter needs beyond
> > bursting support?
> >
> > One way to workaround the multiplexing problem would be to add a client
> > side option for producers and consumers, where you could specify that the
> > client picks a separate TCP/IP connection that is not shared and isn't
> from
> > the connection pool.
> > Preventing connection multiplexing seems to be the only way to make the
> > current rate limiting deterministic and stable without adding the
> explicit
> > flow control to the Pulsar binary protocol for producers.
> >
> > Are there other community members with input on the design and
> > implementation of an improved rate limiter?
> > I’m eager to continue this conversation and work together towards a
> robust
> > solution.
> >
> > -Lari
> >
>


-- 
Girish Sharma

Re: [DISCUSS] PIP-310: Support custom publish rate limiters

Posted by Lari Hotari <lh...@apache.org>.

One additional question:

In your use case, do you have multiple producers concurrently producing to
the same topic from different clients?

The use case is challenging in the current implementation when using topic
producing rate limiting. The problem is that the different producers will
be able to send messages at very different rates since there isn't a
solution to ensure fairness across multiple producers in the topic producer
rate limiting solution. This is something that should be addressed when
improving rate limiting.

-Lari

la 4. marrask. 2023 klo 17.24 Lari Hotari <lh...@apache.org> kirjoitti:

> Replies inline
>
> On Fri, 3 Nov 2023 at 20:48, Girish Sharma <sc...@gmail.com>
> wrote:
>
>> Could you please elaborate more on these details? Here are some questions:
>> > 1. What do you mean that it is too strict?
>> >     - Should the rate limiting allow bursting over the limit for some
>> time?
>> >
>>
>> That's one of the major use cases, yes.
>>
>
> One possibility would be to improve the existing rate limiter to allow
> bursting.
> I think that Pulsar's out-of-the-box rate limiter should cover 99% of the
> use cases instead of having one implementing their own rate limiter
> algorithm.
> The problems you are describing seem to be common to many Pulsar use
> cases, and therefore, I think they should be handled directly in Pulsar.
>
> Optimally, there would be a single solution that abstracts the rate
> limiting in a way where it does the right thing based on the declarative
> configuration.
> I would prefer that over having a pluggable solution for rate limiter
> implementations.
>
> What would help is getting deeper in the design of the rate limiter
> itself, without limiting ourselves to the existing rate limiter
> implementation in Pulsar.
>
> In textbooks, there are algorithms such as "leaky bucket" [1] and "token
> bucket" [2]. Both algorithms have several variations and in some ways they
> are very similar algorithms but looking from the different point of view.
> It would possibly be easier to conceptualize and understand a rate limiting
> algorithm if common algorithm names and implementation choices mentioned in
> textbooks would be referenced in the implementation.
> It seems that a "token bucket" type of algorithm can be used to implement
> rate limiting with bursting. In the token bucket algorithm, the size of the
> token bucket defines how large bursts will be allowed. The design could
> also be something where 2 rate limiters with different type of algorithms
> and/or configuration parameters are combined to achieve a desired behavior.
> For example, to achieve a rate limiter with bursting and a fixed maximum
> rate.
> By default, the token bucket algorithm doesn't enforce a maximum rate for
> bursts, but that could be achieved by chaining 2 rate limiters if that is
> really needed.
>
> The current Pulsar rate limiter implementation could be implemented in a
> cleaner way, which would also be more efficient. Instead of having a
> scheduler call a method once per second, I think that the rate limiter
> could be implemented in a reactive way where the algorithm is implemented
> without a scheduler.
> I wonder if there are others that would be interested in getting down into
> such implementation details?
>
> 1 - https://en.wikipedia.org/wiki/Leaky_bucket
> 2 - https://en.wikipedia.org/wiki/Token_bucket
>
>
> > 2. What type of data loss are you experiencing?
>> >
>>
>> Messages produced by the producers which eventually get timed out due to
>> rate limiting.
>>
>
> Are you able to slow down producing on the client side? If that is
> possible, there could be ways to improve ways to do client side back
> pressure with Pulsar Client. Currently, the client doesn't expose this
> information until the sending blocks or fails with an exception
> (ProducerQueueIsFullError). Optimally, the client should slow down the rate
> of producing to the rate that it can actually send to the broker.
> Just curious if you have considered turning off producing timeouts on the
> client side completely or making them longer? Would that address the data
> loss problem?
> Or is your event/message source "hot" so that you cannot stop or slow it
> down, and it will just keep on flowing with a certain rate?
>
> > I think the core implementation of how the broker fails fast at the time
>> of rate limiting (whether it is by pausing netty channel or a new permits
>> based model) does not change the actual issue I am targeting. Multiplexing
>> has some impact on it - but yet again only limited, and can easily be
>> fixed
>> by the client by increasing the connections per broker. Even after
>> assuming
>> both these things are somehow "fixed", the fact remains that an absolutely
>> strict rate limiter will lead to the above mentioned data loss for burst
>> going above the limit and that a poller based rate limiter doesn't really
>> rate limit anything as it allows all produce in the first interval of the
>> next second.
>>
>
> Yes, it makes sense to have bursting configuration parameters in the rate
> limiter.
> As mentioned above, I think we could be improving the existing rate
> limiter in Pulsar to cover 99% of the use case by making it stable and by
> including the bursting configuration options.
> Is there additional functionality you feel the rate limiter needs beyond
> bursting support?
>
> One way to workaround the multiplexing problem would be to add a client
> side option for producers and consumers, where you could specify that the
> client picks a separate TCP/IP connection that is not shared and isn't from
> the connection pool.
> Preventing connection multiplexing seems to be the only way to make the
> current rate limiting deterministic and stable without adding the explicit
> flow control to the Pulsar binary protocol for producers.
>
> Are there other community members with input on the design and
> implementation of an improved rate limiter?
> I’m eager to continue this conversation and work together towards a robust
> solution.
>
> -Lari
>

Re: [DISCUSS] PIP-310: Support custom publish rate limiters

Posted by Asaf Mesika <as...@gmail.com>.

I just want to add one thing to the mix here.

You can see by the amount of plugin interfaces Pulsar has, somebody "left
the door open" for too long.
You can agree with me that the number of those interfaces is not normal for
any open source software. I know HBase for example, or Kafka - never seen
so many in them.

You can also see the lack of attention to code quality and high level
overview by the poor implementation of current rate limiter.

The feeling is: I just need this tiny little thing and I don't have time -
so over time Pulsar got into this unmaintainable mess of public APIs and
some parts are simply unreadable - such as the rate limiters. I *still*
don't understand how rate limiting works in Pulsar, even when I read the
background  and browsed quickly through the code.

I can see the people on this thread are highly talented - let's use this to
make Pulsar better, both from a bird's-eye view and your own
personal requirement.


On Tue, Nov 7, 2023 at 3:26 PM Girish Sharma <sc...@gmail.com>
wrote:

> Hello Lari, replies inline.
>
> I will also be going through some textbook rate limiters (the one you
> shared, plus others) and propose the one that at least suits our needs in
> the next reply.
>
> On Tue, Nov 7, 2023 at 2:49 PM Lari Hotari <lh...@apache.org> wrote:
>
>>
>> It is bi-weekly on Thursdays. The meeting calendar, zoom link and
>> meeting notes can be found at
>> https://github.com/apache/pulsar/wiki/Community-Meetings .
>>
>>
> Would it make sense for me to join this time given that you are skipping
> it?
>
>
>>
>> ok. btw. "metrics" doesn't necessarily mean providing the rate limiter
>> metrics via Prometheus. There might be other ways to provide this
>> information for components that could react to this.
>> For example, it could a be system topic where these rate limiters emit
>> events.
>>
>>
> Are there any other system topics than `tenent/namespace/__change_events`
> . While it's an improvement over querying metrics, it would still mean one
> consumer per namespace and would form a cyclic dependency - for example, in
> case a broker is degrading due to mis-use of bursting, it might lead to
> delays in the consumption of the event from the __change_events topic.
>
> I agree. I just brought up this example to ensure that your
>> expectation about bursting isn't about controlling the rate limits
>> based on situational information, such as end-to-end latency
>> information.
>> Such a feature could be useful, but it does complicate things.
>> However, I think it's good to keep this on the radar since this might
>> be needed to solve some advanced use cases.
>>
>>
> I still envision auto-scaling to be admin API driven rather than produce
> throughput driven. That way, it remains deterministic in nature. But it
> probably doesn't make sense to even talk about it until (partition)
> scale-down is possible.
>
>
>>
>> >    - A producer(s) is producing at a near constant rate into a topic,
>> with
>> >    equal distribution among partitions. Due to a hiccup in their
>> downstream
>> >    component, the produce rate goes to 0 for a few seconds, and thus, to
>> >    compensate, in the next few seconds, the produce rate tries to
>> double up.
>>
>> Could you also elaborate on details such as what is the current
>> behavior of Pulsar rate limiting / throttling solution and what would
>> be the desired behavior?
>> Just guessing that you mean that the desired behavior would be to
>> allow the produce rate to double up for some time (configurable)?
>> Compared to what rate is it doubled?
>> Please explain in detail what the current and desired behaviors would
>> be so that it's easier to understand the gap.
>>
>
> In all of the 3 cases that I listed, the current behavior, with precise
> rate limiting enabled, is to pause the netty channel in case the throughput
> breaches the set limits. This eventually leads to timeout at the client
> side in case the burst is significantly greater than the configured timeout
> on the producer side.
>
> The desired behavior in all three situations is to have a multiplier based
> bursting capability for a fixed duration. For example, it could be that a
> pulsar topic would be able to support 1.5x of the set quota for a burst
> duration of up to 5 minutes. There also needs to be a cooldown period in
> such a case that it would only accept one such burst every X minutes, say
> every 1 hour.
>
>
>>
>> >    - In a visitor based produce rate (where produce rate goes up in the
>> day
>> >    and goes down in the night, think in terms of popular website hourly
>> view
>> >    counts pattern) , there are cases when, due to certain
>> external/internal
>> >    triggers, the views - and thus - the produce rate spikes for a few
>> minutes.
>>
>> Again, please explain the current behavior and desired behavior.
>> Explicit example values of number of messages, bandwidth, etc. would
>> also be helpful details.
>>
>
> Adding to what I wrote above, think of this pattern like the following:
> the produce rate slowly increases from ~2MBps at around 4 AM to a known
> peak of about 30MBps by 4 PM and stays around that peak until 9 PM after
> which is again starts decreasing until it reaches ~2MBps around 2 AM.
> Now, due to some external triggers, maybe a scheduled sale event, at 10PM,
> the quota may spike up to 40MBps for 4-5 minutes and then again go back
> down to the usual ~20MBps . Here is a rough image showcasing the trend.
> [image: image.png]
>
>
>>
>> >    It is also important to keep this in check so as to not allow bots
>> to do
>> >    DDOS into your system, while that might be a responsibility of an
>> upstream
>> >    system like API gateway, but we cannot be ignorant about that
>> completely.
>>
>> what would be the desired behavior?
>>
>
> The desired behavior is that the burst support should be short lived (5-10
> minutes) and limited to a fixed number of bursts in a duration (say - 1
> burst per hour). Obviously, all of these should be configurable, maybe at a
> broker level and not a topic level.
>
>
>>
>> >    - In streaming systems, where there are micro batches, there might be
>> >    constant fluctuations in produce rate from time to time, based on
>> batch
>> >    failure or retries.
>>
>> could you share and examples with numbers about this use case too?
>> explaining current behavior and desired behavior?
>>
>
> It is very common in Direct Input Stream spark jobs to consume messages in
> micro batches, process the messages together and then move on to the next
> batch. Thus, it is not a uniform consume at a second or sub-minute level.
> Now, in this situation itself, if a few micro-batches fail due to hardware
> issues, the spark scheduler may schedule more than usual batches together
> after the issue is resolved, leading to increased consume.
> While I am talking mostly about consume, the same job's sink may also be a
> pulsar topic, leading to same produce pattern.
>
>
>
>>
>>
>> >
>> > In all of these situations, setting the throughput of the topic to be
>> the
>> > absolute maximum of the various spikes observed during the day is very
>> > suboptimal.
>>
>> btw. A plain token bucket algorithm implementation doesn't have an
>> absolute maximum. The maximum average rate is controlled with the rate
>> of tokens added to the bucket. The capacity of the bucket controls how
>> much buffer there is to spend on spikes. If there's a need to also set
>> an absolute maximum rate, 2 token buckets could be chained to handle
>> that case. The second rate limiter could have an average rate of the
>> absolute maximum with a relatively small buffer (token bucket
>> capacity). There might also be more sophisticated algorithms which
>> vary the maximum rate in some way to smoothen spikes, but that might
>> just be completely unnecessary in Pulsar rate limiters.
>> Switching to use a conceptual model of token buckets will make it a
>> lot easier to reason about the implementation when we come to that
>> point. In the design we can try to "meet in the middle" where the
>> requirements and the design and implementation meet so that the
>> implementation is as simple as possible.
>>
>
> I have yet to go in depth on both the textbook rate limiter models, but if
> token bucket works on buffered tokens, it may not be much useful. Consider
> a situation where the throughput is always at its absolute configured peak,
> thus not reserving tokens at all. Again - this is the understanding based
> on the above text. I am yet to go through the wiki page with full details.
> In any case, I am very sure that an absolute maximum bursting limit is
> needed per topic, since at the end of the day, we are limited by hardware
> resources like network bandwidth, disk bandwidth etc.
>
>
>> > Moreover, in each of these situations, once bursting support is present
>> in
>> > the system, it would also need to have proper checks in place to
>> penalize
>> > the producers from trying to mis-use the system. In a true multi-tenant
>> > platform, this is very critical. Thus, blacklisting actually goes hand
>> in
>> > hand here. Explained more below.
>>
>> I don't yet fully understand this part. What would be an example of a
>> case where a producer should be penalized?
>> When would an attempt to mis-use happen and why? How would you detect
>> mis-use and how would you penalize it?
>>
>> Suppose the SOP for bursting support is to support a burst of upto 1.5x
> produce rate for upto 10 minutes, once in a window of 1 hour.
>
> Now if a particular producer is trying to burst to or beyond 1.5x for more
> than 5 minutes - while we are not accepting the requests and the producer
> client is continuously going into timeouts, there should be a way to
> completely blacklist that producer so that even the client stops attempting
> to produce and we unload the topics all together - thus, releasing open
> connections and broker compute.
>
>
> > This would be a much simpler and efficient model if it were reactive,
>> > based/triggered directly from within the rate limiter component. It can
>> be
>> > fast and responsive, which is very critical when trying to prevent the
>> > system from abuse.
>>
>> I agree that it could be simpler and efficient from some perspectives.
>> Do you have examples of what type of behaviors would you like to have
>> for handling this?
>> A concrete example with numbers could be very useful.
>>
>
> I believe the above example is a good one in terms of expectation, with
> numbers.
>
>
>> > It's only async if the producers use `sendAsync` . Moreover, even if it
>> is
>> > async in nature, practically it ends up being quite linear and
>> > well-homogenised. I am speaking from experience of running 1000s of
>> > partitioned topics in production
>>
>> I'd assume that most high traffic use cases would use asynchronous
>> sending in a way or another. If you'd be sending synchronously, it
>> would be extremely inefficient if messages are sent one-by-one. The
>> asyncronouscity could also come from multiple threads in an
>> application.
>>
>> You are right that a well homogenised workload reduces the severity of
>> the multiplexing problem. That's also why the problem of multiplexing
>> hasn't been fixed since the problem isn't visibile and it's also very
>> hard to observe.
>> In a case where 2 producers with very different rate limiting options
>> share a connection, it definetely is a problem in the current
>>
>
> Yes actually, I was talking to the team and we did observe a case where
> there was a client app with 2 producer objects writing to two different
> topics. When one of their topics was breaching quota, the other one was
> also observing rate limiting even though it was under quota. So eventually,
> this surely needs to be solved. One short term solution is to at least not
> share connections across producer objects. But I still feel it's out of
> scope of this discussion here.
>
> What do you think about the quick solution of not sharing connections
> across producer objects? I can raise a github issue explaining the
> situation.
>
>
>
> --
> Girish Sharma
>

Re: [DISCUSS] PIP-310: Support custom publish rate limiters

Posted by Girish Sharma <sc...@gmail.com>.

Closing this discussion thread and the PIP. Apart from the discussion
present in this thread, I presented the detailed requirements in a dev meet
on 23rd November and the conclusion was that we will actually go ahead and
implement the requirements in pulsar itself.
There was a pre-requisite of refactoring rate limiter codebase which is
already covered by Lari in PIP-322.

I will be creating a new parent PIP soon about the high level requirements.

Thank you everyone who participated in the thread and the discussion on
23rd dev meeting.

Regards

On Thu, Nov 23, 2023 at 8:26 PM Girish Sharma <sc...@gmail.com>
wrote:

> I've captured our requirements in detail in this document -
> https://docs.google.com/document/d/1-y5nBaC9QuAUHKUGMVVe4By-SmMZIL4w09U1byJBbMc/edit
> Added it to agenda document as well. Will join the meeting and discuss.
>
> Regards
>
> On Wed, Nov 22, 2023 at 10:49 PM Lari Hotari <lh...@apache.org> wrote:
>
>> I have written a long blog post that contains the context, the summary
>> of my view point about PIP-310 and the proposal for proceeding:
>>
>> https://codingthestreams.com/pulsar/2023/11/22/pulsar-slos-and-rate-limiting.html
>>
>> Let's discuss this tomorrow in the Pulsar community meeting [1]. Let's
>> coordinate on Pulsar Slack's #dev channel if the are issues in joining
>> the meeting.
>> See you tomorrow!
>>
>> -Lari
>>
>> 1 - https://github.com/apache/pulsar/wiki/Community-Meetings
>>
>> On Mon, 20 Nov 2023 at 20:48, Lari Hotari <lh...@apache.org> wrote:
>> >
>> > Hi Girish,
>> >
>> > replies inline and after that there are some updates about my
>> > preparation for the community meeting on Thursday. (there's
>> > https://github.com/lhotari/async-tokenbucket with a PoC for a
>> > low-level high performance token bucket implementation)
>> >
>> > On Sat, 11 Nov 2023 at 17:25, Girish Sharma <sc...@gmail.com>
>> wrote:
>> > > Actually, the capacity is meant to simulate that particular rate
>> limit. if
>> > > we have 2 buckets anyways, the one managing the fixed rate limit part
>> > > shouldn't generally have a capacity more than the fixed rate, right?
>> >
>> > There are multiple ways to model and understand a dual token bucket
>> > implementation.
>> > I view the 2 buckets in a dual token bucket implementation as separate
>> > buckets. They are like an AND rule, so if either bucket is empty,
>> > there will be a need to pause to wait for new tokens.
>> > Since we aren't working with code yet, these comments could be out of
>> context.
>> >
>> > > I think it can be done, especially with that one thing you mentioned
>> about
>> > > holding off filling the second bucket for 10 minutes.. but it does
>> become
>> > > quite complicated in terms of managing the flow of the tokens..
>> because
>> > > while we only fill the second bucket once every 10 minutes, after the
>> 10th
>> > > minute, it needs to be filled continuously for a while (the duration
>> we
>> > > want to support the bursting for).. and the capacity of this second
>> bucket
>> > > also is governed by and exactly matches the burst value.
>> >
>> > There might not be a need for this complexity of the "filling bucket"
>> > in the first place. It was more of a demonstration that it's possible
>> > to implement the desired behavior of limited bursting by tweaking the
>> > basic token bucket algorithm slightly.
>> > I'd rather avoid this additional complexity.
>> >
>> > > Agreed that it is much higher than a single topics' max throughput..
>> but
>> > > the context of my example had multiple topics lying on the same
>> > > broker/bookie ensemble bursting together at the same time because
>> they had
>> > > been saving up on tokens in the bucket.
>> >
>> > Yes, that makes sense.
>> >
>> > > always be a need to overprovision resources. You usually don't want to
>> > > > go beyond 60% or 70% utilization on disk, cpu or network resources
>> so
>> > > > that queues in the system don't start to increase and impacting
>> > > > latencies. In Pulsar/Bookkeeper, the storage solution has a very
>> > > > effective load balancing, especially for writing. In Bookkeeper each
>> > > > ledger (the segment) of a topic selects the "ensemble" and the
>> "write
>> > > > quorum", the set of bookies to write to, when the ledger is opened.
>> > > > The bookkeeper client could also change the ensemble in the middle
>> of
>> > > > a ledger due to some event like a bookie becoming read-only or
>> > > >
>> > >
>> > > While it does do that on complete failure of bookie or a bookie disk,
>> or
>> > > broker going down, degradations aren't handled this well. So if all
>> topics
>> > > in a bookie are bursting due to the fact that they had accumulated
>> tokens,
>> > > then all it will lead to is breach of write latency SLA because at one
>> > > point, the disks/cpu/network etc will start choking. (even after
>> > > considering the 70% utilization i.e. 30% buffer)
>> >
>> > Yes.
>> >
>> > > That's only in the case of the default rate limiter where the
>> tryAcquire
>> > > isn't even implemented.. since the default rate limiter checks for
>> breach
>> > > only at a fixed rate rather than before every produce call. But in
>> case of
>> > > precise rate limiter, the response of `tryAcquire` is respected.
>> >
>> > This is one of many reasons why I think it's better to improve the
>> > maintainability of the current solution and remove the unnecessary
>> > options between "precise" and the default one.
>> >
>> > > True, and actually, due to the fact that pulsar auto distributes
>> topics
>> > > based on load shedding parameters, we can actually focus on a single
>> > > broker's or a single bookie ensemble and assume that it works as we
>> scale
>> > > it. Of course this means that putting a reasonable limit in terms of
>> > > cpu/network/partition/throughput limits at each broker level and
>> pulsar
>> > > provides ways to do that automatically.
>> >
>> > We do have plans to improve the Pulsar so that things would simply
>> > work properly under heavy load. Optimally, things would work without
>> > the need to tune and tweak the system rigorously. These improvements
>> > go beyond rate limiting.
>> >
>> > > While I have shared the core requirements over these threads (fixed
>> rate +
>> > > burst multiplier for upto X duration every Y minutes).. We are
>> finalizing
>> > > the details requirements internally to present. As I replied in my
>> previous
>> > > mail, one outcome of detailed internal discussion was the discovery of
>> > > throughput contention.
>> >
>> > Sounds good. Sharing your experiences is extremely valuable.
>> >
>> > > We do use resource groups for certain namespace level quotas, but
>> even in
>> > > our use case, rate limiter and resource groups are two separate
>> tangents.
>> > > At least for foreseeable future.
>> >
>> > At this point they are separate, but there is a need to improve. Jack
>> > Vanlightly's blog post series "The Architecture Of Serverless Data
>> > Systems" [1] explains the competition in this area very well.
>> > Multi-tenant capacity management and SLAs are at the core of
>> > "serverless" data systems.
>> >
>> > 1 -
>> https://jack-vanlightly.com/blog/2023/11/14/the-architecture-of-serverless-data-systems
>> >
>> > > Will close this before the 23rd.
>> >
>> > Yes. :)
>> >
>> > > This is actually not a resource group limitation, but a general broker
>> > > level example. Even if resource groups weren't in picture, this issue
>> > > remains. The fact remains that since we need to have reasonable broker
>> > > level limits (as a result of our NFR testing etc), there will be
>> clash of
>> > > topics where some topics trying to burst are taking up broker's
>> capacity of
>> > > serving the fixed rate for other topics. This needs special handling
>> even
>> > > after dual token approach. I will have detailed requirement with
>> examples
>> > > by 23rd.
>> >
>> > Sounds good.
>> >
>> > > > That is the one of the reasons for putting a strict check on the
>> amount,
>> > > > > duration and frequency of bursting.
>> > > >
>> > > > Please explain more about this. It would be helpful for me in trying
>> > > > to understand your requirements.
>> > > > One question about the solution of "strict check on amount, duration
>> > > > and frequency of bursting".
>> > > > Do you already have a PoC where you have validated that the solution
>> > > > actually solves the problem?
>> > > >
>> > >
>> > > No, this is still theoretical based on our understanding of rate
>> limiter,
>> > > dual tokens, current pulsar code, etc. If topics are allowed to burst
>> > > without a real duration based limitation, then the chances of more
>> and more
>> > > topics contending for broker's actual capacity is high and thus it
>> > > hinders/contents with (a) a new topics trying to burst and make use
>> of the
>> > > bursting feature with its SLA and (b) another topics , not bursting,
>> way
>> > > within its fixed rate, still being rate limited due to lack of
>> capacity
>> > > (which is taken up by bursting topics) at a broker level.
>> >
>> > There are more factors involved in the bursting behavior and the
>> > applying back pressure in Pulsar.
>> > If there's proper end-to-end flow control and backpressure, a possible
>> > "over commitment" situation could be handled by simply handling the
>> > load as fast as possible. The bursting rates aren't necessarily part
>> > of strict SLAs so it's possible that even SLAs are met while operating
>> > as-fast-as possible.  When the Pulsar load balancing is improved, the
>> > Pulsar load manager could move load off an overloaded broker and that
>> > would help resolve the situation quicker.
>> > Later on, the SLAs should become part of the rate limiting and
>> > capacity/load management decisions within Pulsar. For example, not all
>> > topics need very low latency. When there's information about the
>> > expected latency information, it would be possible to priority and
>> > optimize the work load in different ways.
>> > When things are properly configured, the Pulsar system shouldn't get
>> > into a state where the system collapses when it gets overloaded. If
>> > brokers crash because of out-of-memory errors, the whole system will
>> > quickly collapse, since recovering from a broker crash will increase
>> > the overall load.
>> > I have doubts that focusing solely on a custom rate limiting
>> > implementation wouldn't be impactful. My current assumption is that
>> > adding support for bursting to rate limiting will be sufficient for
>> > making rate limiting good enough for most Pulsar use cases, and we
>> > could continue to improve rate limiting when there's a specific need.
>> > I'll be happy to be proven to be wrong and change my opinion when
>> > there's sufficient evidence that custom rate limiters are needed in
>> > Pulsar.
>> >
>> > > > Could you simply write a PoC by forking apache/pulsar and changing
>> the
>> > > > code directly without having it pluggable initially?
>> > > >
>> > > >
>> > > This would be a last resort :) . At most if things don't go as
>> desired,
>> > > then we would end up doing this and adding plugging logic in a pulsar
>> fork.
>> > > From a PoC perspective, we may try it out soon.
>> >
>> > I think it's useful to work with a PoC since it will show the
>> > practical limits of using rate limiters for handling SLA and capacity
>> > management issues.
>> >
>> > To start improving the current rate limiter implementation in Pulsar
>> > core, I have made a PoC of an asynchronous token bucket implementation
>> > class, which is optimized for performance without causing contention
>> > in a multi-threaded setup.
>> > The repository https://github.com/lhotari/async-tokenbucket contains
>> > the PoC code and performance test that measures the token bucket
>> > calculation logic overhead.
>> > On my laptop, it's possible to achieve about 165M ops/s with 100
>> > threads with a single token bucket. That's to prove that the
>> > calculation logic isn't blocking and has very low contention.  There's
>> > no need for a scheduler to maintain the calculations. A scheduler
>> > would be needed in an actual rate limiter implementation to unblock a
>> > paused Netty channel by setting auto read to true (in the case where
>> > backpressure is applied by setting auto read to false). The
>> > AsyncTokenBucket class is intended to be a pure function without side
>> > effects. The implementation idea is to use the composition pattern to
>> > build higher level rate limiter constructs.
>> >
>> > Let's meet on Thursday at the Pulsar community meeting (schedule at
>> > https://github.com/apache/pulsar/wiki/Community-Meetings) to present
>> > our summaries and decide on the next steps together.
>> >
>> >
>> > -Lari
>>
>
>
> --
> Girish Sharma
>


-- 
Girish Sharma

Re: [DISCUSS] PIP-310: Support custom publish rate limiters

Posted by Girish Sharma <sc...@gmail.com>.

I've captured our requirements in detail in this document -
https://docs.google.com/document/d/1-y5nBaC9QuAUHKUGMVVe4By-SmMZIL4w09U1byJBbMc/edit
Added it to agenda document as well. Will join the meeting and discuss.

Regards

On Wed, Nov 22, 2023 at 10:49 PM Lari Hotari <lh...@apache.org> wrote:

> I have written a long blog post that contains the context, the summary
> of my view point about PIP-310 and the proposal for proceeding:
>
> https://codingthestreams.com/pulsar/2023/11/22/pulsar-slos-and-rate-limiting.html
>
> Let's discuss this tomorrow in the Pulsar community meeting [1]. Let's
> coordinate on Pulsar Slack's #dev channel if the are issues in joining
> the meeting.
> See you tomorrow!
>
> -Lari
>
> 1 - https://github.com/apache/pulsar/wiki/Community-Meetings
>
> On Mon, 20 Nov 2023 at 20:48, Lari Hotari <lh...@apache.org> wrote:
> >
> > Hi Girish,
> >
> > replies inline and after that there are some updates about my
> > preparation for the community meeting on Thursday. (there's
> > https://github.com/lhotari/async-tokenbucket with a PoC for a
> > low-level high performance token bucket implementation)
> >
> > On Sat, 11 Nov 2023 at 17:25, Girish Sharma <sc...@gmail.com>
> wrote:
> > > Actually, the capacity is meant to simulate that particular rate
> limit. if
> > > we have 2 buckets anyways, the one managing the fixed rate limit part
> > > shouldn't generally have a capacity more than the fixed rate, right?
> >
> > There are multiple ways to model and understand a dual token bucket
> > implementation.
> > I view the 2 buckets in a dual token bucket implementation as separate
> > buckets. They are like an AND rule, so if either bucket is empty,
> > there will be a need to pause to wait for new tokens.
> > Since we aren't working with code yet, these comments could be out of
> context.
> >
> > > I think it can be done, especially with that one thing you mentioned
> about
> > > holding off filling the second bucket for 10 minutes.. but it does
> become
> > > quite complicated in terms of managing the flow of the tokens.. because
> > > while we only fill the second bucket once every 10 minutes, after the
> 10th
> > > minute, it needs to be filled continuously for a while (the duration we
> > > want to support the bursting for).. and the capacity of this second
> bucket
> > > also is governed by and exactly matches the burst value.
> >
> > There might not be a need for this complexity of the "filling bucket"
> > in the first place. It was more of a demonstration that it's possible
> > to implement the desired behavior of limited bursting by tweaking the
> > basic token bucket algorithm slightly.
> > I'd rather avoid this additional complexity.
> >
> > > Agreed that it is much higher than a single topics' max throughput..
> but
> > > the context of my example had multiple topics lying on the same
> > > broker/bookie ensemble bursting together at the same time because they
> had
> > > been saving up on tokens in the bucket.
> >
> > Yes, that makes sense.
> >
> > > always be a need to overprovision resources. You usually don't want to
> > > > go beyond 60% or 70% utilization on disk, cpu or network resources so
> > > > that queues in the system don't start to increase and impacting
> > > > latencies. In Pulsar/Bookkeeper, the storage solution has a very
> > > > effective load balancing, especially for writing. In Bookkeeper each
> > > > ledger (the segment) of a topic selects the "ensemble" and the "write
> > > > quorum", the set of bookies to write to, when the ledger is opened.
> > > > The bookkeeper client could also change the ensemble in the middle of
> > > > a ledger due to some event like a bookie becoming read-only or
> > > >
> > >
> > > While it does do that on complete failure of bookie or a bookie disk,
> or
> > > broker going down, degradations aren't handled this well. So if all
> topics
> > > in a bookie are bursting due to the fact that they had accumulated
> tokens,
> > > then all it will lead to is breach of write latency SLA because at one
> > > point, the disks/cpu/network etc will start choking. (even after
> > > considering the 70% utilization i.e. 30% buffer)
> >
> > Yes.
> >
> > > That's only in the case of the default rate limiter where the
> tryAcquire
> > > isn't even implemented.. since the default rate limiter checks for
> breach
> > > only at a fixed rate rather than before every produce call. But in
> case of
> > > precise rate limiter, the response of `tryAcquire` is respected.
> >
> > This is one of many reasons why I think it's better to improve the
> > maintainability of the current solution and remove the unnecessary
> > options between "precise" and the default one.
> >
> > > True, and actually, due to the fact that pulsar auto distributes topics
> > > based on load shedding parameters, we can actually focus on a single
> > > broker's or a single bookie ensemble and assume that it works as we
> scale
> > > it. Of course this means that putting a reasonable limit in terms of
> > > cpu/network/partition/throughput limits at each broker level and pulsar
> > > provides ways to do that automatically.
> >
> > We do have plans to improve the Pulsar so that things would simply
> > work properly under heavy load. Optimally, things would work without
> > the need to tune and tweak the system rigorously. These improvements
> > go beyond rate limiting.
> >
> > > While I have shared the core requirements over these threads (fixed
> rate +
> > > burst multiplier for upto X duration every Y minutes).. We are
> finalizing
> > > the details requirements internally to present. As I replied in my
> previous
> > > mail, one outcome of detailed internal discussion was the discovery of
> > > throughput contention.
> >
> > Sounds good. Sharing your experiences is extremely valuable.
> >
> > > We do use resource groups for certain namespace level quotas, but even
> in
> > > our use case, rate limiter and resource groups are two separate
> tangents.
> > > At least for foreseeable future.
> >
> > At this point they are separate, but there is a need to improve. Jack
> > Vanlightly's blog post series "The Architecture Of Serverless Data
> > Systems" [1] explains the competition in this area very well.
> > Multi-tenant capacity management and SLAs are at the core of
> > "serverless" data systems.
> >
> > 1 -
> https://jack-vanlightly.com/blog/2023/11/14/the-architecture-of-serverless-data-systems
> >
> > > Will close this before the 23rd.
> >
> > Yes. :)
> >
> > > This is actually not a resource group limitation, but a general broker
> > > level example. Even if resource groups weren't in picture, this issue
> > > remains. The fact remains that since we need to have reasonable broker
> > > level limits (as a result of our NFR testing etc), there will be clash
> of
> > > topics where some topics trying to burst are taking up broker's
> capacity of
> > > serving the fixed rate for other topics. This needs special handling
> even
> > > after dual token approach. I will have detailed requirement with
> examples
> > > by 23rd.
> >
> > Sounds good.
> >
> > > > That is the one of the reasons for putting a strict check on the
> amount,
> > > > > duration and frequency of bursting.
> > > >
> > > > Please explain more about this. It would be helpful for me in trying
> > > > to understand your requirements.
> > > > One question about the solution of "strict check on amount, duration
> > > > and frequency of bursting".
> > > > Do you already have a PoC where you have validated that the solution
> > > > actually solves the problem?
> > > >
> > >
> > > No, this is still theoretical based on our understanding of rate
> limiter,
> > > dual tokens, current pulsar code, etc. If topics are allowed to burst
> > > without a real duration based limitation, then the chances of more and
> more
> > > topics contending for broker's actual capacity is high and thus it
> > > hinders/contents with (a) a new topics trying to burst and make use of
> the
> > > bursting feature with its SLA and (b) another topics , not bursting,
> way
> > > within its fixed rate, still being rate limited due to lack of capacity
> > > (which is taken up by bursting topics) at a broker level.
> >
> > There are more factors involved in the bursting behavior and the
> > applying back pressure in Pulsar.
> > If there's proper end-to-end flow control and backpressure, a possible
> > "over commitment" situation could be handled by simply handling the
> > load as fast as possible. The bursting rates aren't necessarily part
> > of strict SLAs so it's possible that even SLAs are met while operating
> > as-fast-as possible.  When the Pulsar load balancing is improved, the
> > Pulsar load manager could move load off an overloaded broker and that
> > would help resolve the situation quicker.
> > Later on, the SLAs should become part of the rate limiting and
> > capacity/load management decisions within Pulsar. For example, not all
> > topics need very low latency. When there's information about the
> > expected latency information, it would be possible to priority and
> > optimize the work load in different ways.
> > When things are properly configured, the Pulsar system shouldn't get
> > into a state where the system collapses when it gets overloaded. If
> > brokers crash because of out-of-memory errors, the whole system will
> > quickly collapse, since recovering from a broker crash will increase
> > the overall load.
> > I have doubts that focusing solely on a custom rate limiting
> > implementation wouldn't be impactful. My current assumption is that
> > adding support for bursting to rate limiting will be sufficient for
> > making rate limiting good enough for most Pulsar use cases, and we
> > could continue to improve rate limiting when there's a specific need.
> > I'll be happy to be proven to be wrong and change my opinion when
> > there's sufficient evidence that custom rate limiters are needed in
> > Pulsar.
> >
> > > > Could you simply write a PoC by forking apache/pulsar and changing
> the
> > > > code directly without having it pluggable initially?
> > > >
> > > >
> > > This would be a last resort :) . At most if things don't go as desired,
> > > then we would end up doing this and adding plugging logic in a pulsar
> fork.
> > > From a PoC perspective, we may try it out soon.
> >
> > I think it's useful to work with a PoC since it will show the
> > practical limits of using rate limiters for handling SLA and capacity
> > management issues.
> >
> > To start improving the current rate limiter implementation in Pulsar
> > core, I have made a PoC of an asynchronous token bucket implementation
> > class, which is optimized for performance without causing contention
> > in a multi-threaded setup.
> > The repository https://github.com/lhotari/async-tokenbucket contains
> > the PoC code and performance test that measures the token bucket
> > calculation logic overhead.
> > On my laptop, it's possible to achieve about 165M ops/s with 100
> > threads with a single token bucket. That's to prove that the
> > calculation logic isn't blocking and has very low contention.  There's
> > no need for a scheduler to maintain the calculations. A scheduler
> > would be needed in an actual rate limiter implementation to unblock a
> > paused Netty channel by setting auto read to true (in the case where
> > backpressure is applied by setting auto read to false). The
> > AsyncTokenBucket class is intended to be a pure function without side
> > effects. The implementation idea is to use the composition pattern to
> > build higher level rate limiter constructs.
> >
> > Let's meet on Thursday at the Pulsar community meeting (schedule at
> > https://github.com/apache/pulsar/wiki/Community-Meetings) to present
> > our summaries and decide on the next steps together.
> >
> >
> > -Lari
>


-- 
Girish Sharma

Re: [DISCUSS] PIP-310: Support custom publish rate limiters

Posted by Lari Hotari <lh...@apache.org>.

I have written a long blog post that contains the context, the summary
of my view point about PIP-310 and the proposal for proceeding:
https://codingthestreams.com/pulsar/2023/11/22/pulsar-slos-and-rate-limiting.html

Let's discuss this tomorrow in the Pulsar community meeting [1]. Let's
coordinate on Pulsar Slack's #dev channel if the are issues in joining
the meeting.
See you tomorrow!

-Lari

1 - https://github.com/apache/pulsar/wiki/Community-Meetings

On Mon, 20 Nov 2023 at 20:48, Lari Hotari <lh...@apache.org> wrote:
>
> Hi Girish,
>
> replies inline and after that there are some updates about my
> preparation for the community meeting on Thursday. (there's
> https://github.com/lhotari/async-tokenbucket with a PoC for a
> low-level high performance token bucket implementation)
>
> On Sat, 11 Nov 2023 at 17:25, Girish Sharma <sc...@gmail.com> wrote:
> > Actually, the capacity is meant to simulate that particular rate limit. if
> > we have 2 buckets anyways, the one managing the fixed rate limit part
> > shouldn't generally have a capacity more than the fixed rate, right?
>
> There are multiple ways to model and understand a dual token bucket
> implementation.
> I view the 2 buckets in a dual token bucket implementation as separate
> buckets. They are like an AND rule, so if either bucket is empty,
> there will be a need to pause to wait for new tokens.
> Since we aren't working with code yet, these comments could be out of context.
>
> > I think it can be done, especially with that one thing you mentioned about
> > holding off filling the second bucket for 10 minutes.. but it does become
> > quite complicated in terms of managing the flow of the tokens.. because
> > while we only fill the second bucket once every 10 minutes, after the 10th
> > minute, it needs to be filled continuously for a while (the duration we
> > want to support the bursting for).. and the capacity of this second bucket
> > also is governed by and exactly matches the burst value.
>
> There might not be a need for this complexity of the "filling bucket"
> in the first place. It was more of a demonstration that it's possible
> to implement the desired behavior of limited bursting by tweaking the
> basic token bucket algorithm slightly.
> I'd rather avoid this additional complexity.
>
> > Agreed that it is much higher than a single topics' max throughput.. but
> > the context of my example had multiple topics lying on the same
> > broker/bookie ensemble bursting together at the same time because they had
> > been saving up on tokens in the bucket.
>
> Yes, that makes sense.
>
> > always be a need to overprovision resources. You usually don't want to
> > > go beyond 60% or 70% utilization on disk, cpu or network resources so
> > > that queues in the system don't start to increase and impacting
> > > latencies. In Pulsar/Bookkeeper, the storage solution has a very
> > > effective load balancing, especially for writing. In Bookkeeper each
> > > ledger (the segment) of a topic selects the "ensemble" and the "write
> > > quorum", the set of bookies to write to, when the ledger is opened.
> > > The bookkeeper client could also change the ensemble in the middle of
> > > a ledger due to some event like a bookie becoming read-only or
> > >
> >
> > While it does do that on complete failure of bookie or a bookie disk, or
> > broker going down, degradations aren't handled this well. So if all topics
> > in a bookie are bursting due to the fact that they had accumulated tokens,
> > then all it will lead to is breach of write latency SLA because at one
> > point, the disks/cpu/network etc will start choking. (even after
> > considering the 70% utilization i.e. 30% buffer)
>
> Yes.
>
> > That's only in the case of the default rate limiter where the tryAcquire
> > isn't even implemented.. since the default rate limiter checks for breach
> > only at a fixed rate rather than before every produce call. But in case of
> > precise rate limiter, the response of `tryAcquire` is respected.
>
> This is one of many reasons why I think it's better to improve the
> maintainability of the current solution and remove the unnecessary
> options between "precise" and the default one.
>
> > True, and actually, due to the fact that pulsar auto distributes topics
> > based on load shedding parameters, we can actually focus on a single
> > broker's or a single bookie ensemble and assume that it works as we scale
> > it. Of course this means that putting a reasonable limit in terms of
> > cpu/network/partition/throughput limits at each broker level and pulsar
> > provides ways to do that automatically.
>
> We do have plans to improve the Pulsar so that things would simply
> work properly under heavy load. Optimally, things would work without
> the need to tune and tweak the system rigorously. These improvements
> go beyond rate limiting.
>
> > While I have shared the core requirements over these threads (fixed rate +
> > burst multiplier for upto X duration every Y minutes).. We are finalizing
> > the details requirements internally to present. As I replied in my previous
> > mail, one outcome of detailed internal discussion was the discovery of
> > throughput contention.
>
> Sounds good. Sharing your experiences is extremely valuable.
>
> > We do use resource groups for certain namespace level quotas, but even in
> > our use case, rate limiter and resource groups are two separate tangents.
> > At least for foreseeable future.
>
> At this point they are separate, but there is a need to improve. Jack
> Vanlightly's blog post series "The Architecture Of Serverless Data
> Systems" [1] explains the competition in this area very well.
> Multi-tenant capacity management and SLAs are at the core of
> "serverless" data systems.
>
> 1 - https://jack-vanlightly.com/blog/2023/11/14/the-architecture-of-serverless-data-systems
>
> > Will close this before the 23rd.
>
> Yes. :)
>
> > This is actually not a resource group limitation, but a general broker
> > level example. Even if resource groups weren't in picture, this issue
> > remains. The fact remains that since we need to have reasonable broker
> > level limits (as a result of our NFR testing etc), there will be clash of
> > topics where some topics trying to burst are taking up broker's capacity of
> > serving the fixed rate for other topics. This needs special handling even
> > after dual token approach. I will have detailed requirement with examples
> > by 23rd.
>
> Sounds good.
>
> > > That is the one of the reasons for putting a strict check on the amount,
> > > > duration and frequency of bursting.
> > >
> > > Please explain more about this. It would be helpful for me in trying
> > > to understand your requirements.
> > > One question about the solution of "strict check on amount, duration
> > > and frequency of bursting".
> > > Do you already have a PoC where you have validated that the solution
> > > actually solves the problem?
> > >
> >
> > No, this is still theoretical based on our understanding of rate limiter,
> > dual tokens, current pulsar code, etc. If topics are allowed to burst
> > without a real duration based limitation, then the chances of more and more
> > topics contending for broker's actual capacity is high and thus it
> > hinders/contents with (a) a new topics trying to burst and make use of the
> > bursting feature with its SLA and (b) another topics , not bursting, way
> > within its fixed rate, still being rate limited due to lack of capacity
> > (which is taken up by bursting topics) at a broker level.
>
> There are more factors involved in the bursting behavior and the
> applying back pressure in Pulsar.
> If there's proper end-to-end flow control and backpressure, a possible
> "over commitment" situation could be handled by simply handling the
> load as fast as possible. The bursting rates aren't necessarily part
> of strict SLAs so it's possible that even SLAs are met while operating
> as-fast-as possible.  When the Pulsar load balancing is improved, the
> Pulsar load manager could move load off an overloaded broker and that
> would help resolve the situation quicker.
> Later on, the SLAs should become part of the rate limiting and
> capacity/load management decisions within Pulsar. For example, not all
> topics need very low latency. When there's information about the
> expected latency information, it would be possible to priority and
> optimize the work load in different ways.
> When things are properly configured, the Pulsar system shouldn't get
> into a state where the system collapses when it gets overloaded. If
> brokers crash because of out-of-memory errors, the whole system will
> quickly collapse, since recovering from a broker crash will increase
> the overall load.
> I have doubts that focusing solely on a custom rate limiting
> implementation wouldn't be impactful. My current assumption is that
> adding support for bursting to rate limiting will be sufficient for
> making rate limiting good enough for most Pulsar use cases, and we
> could continue to improve rate limiting when there's a specific need.
> I'll be happy to be proven to be wrong and change my opinion when
> there's sufficient evidence that custom rate limiters are needed in
> Pulsar.
>
> > > Could you simply write a PoC by forking apache/pulsar and changing the
> > > code directly without having it pluggable initially?
> > >
> > >
> > This would be a last resort :) . At most if things don't go as desired,
> > then we would end up doing this and adding plugging logic in a pulsar fork.
> > From a PoC perspective, we may try it out soon.
>
> I think it's useful to work with a PoC since it will show the
> practical limits of using rate limiters for handling SLA and capacity
> management issues.
>
> To start improving the current rate limiter implementation in Pulsar
> core, I have made a PoC of an asynchronous token bucket implementation
> class, which is optimized for performance without causing contention
> in a multi-threaded setup.
> The repository https://github.com/lhotari/async-tokenbucket contains
> the PoC code and performance test that measures the token bucket
> calculation logic overhead.
> On my laptop, it's possible to achieve about 165M ops/s with 100
> threads with a single token bucket. That's to prove that the
> calculation logic isn't blocking and has very low contention.  There's
> no need for a scheduler to maintain the calculations. A scheduler
> would be needed in an actual rate limiter implementation to unblock a
> paused Netty channel by setting auto read to true (in the case where
> backpressure is applied by setting auto read to false). The
> AsyncTokenBucket class is intended to be a pure function without side
> effects. The implementation idea is to use the composition pattern to
> build higher level rate limiter constructs.
>
> Let's meet on Thursday at the Pulsar community meeting (schedule at
> https://github.com/apache/pulsar/wiki/Community-Meetings) to present
> our summaries and decide on the next steps together.
>
>
> -Lari

Re: [DISCUSS] PIP-310: Support custom publish rate limiters

Posted by Lari Hotari <lh...@apache.org>.

Hi Girish,

replies inline and after that there are some updates about my
preparation for the community meeting on Thursday. (there's
https://github.com/lhotari/async-tokenbucket with a PoC for a
low-level high performance token bucket implementation)

On Sat, 11 Nov 2023 at 17:25, Girish Sharma <sc...@gmail.com> wrote:
> Actually, the capacity is meant to simulate that particular rate limit. if
> we have 2 buckets anyways, the one managing the fixed rate limit part
> shouldn't generally have a capacity more than the fixed rate, right?

There are multiple ways to model and understand a dual token bucket
implementation.
I view the 2 buckets in a dual token bucket implementation as separate
buckets. They are like an AND rule, so if either bucket is empty,
there will be a need to pause to wait for new tokens.
Since we aren't working with code yet, these comments could be out of context.

> I think it can be done, especially with that one thing you mentioned about
> holding off filling the second bucket for 10 minutes.. but it does become
> quite complicated in terms of managing the flow of the tokens.. because
> while we only fill the second bucket once every 10 minutes, after the 10th
> minute, it needs to be filled continuously for a while (the duration we
> want to support the bursting for).. and the capacity of this second bucket
> also is governed by and exactly matches the burst value.

There might not be a need for this complexity of the "filling bucket"
in the first place. It was more of a demonstration that it's possible
to implement the desired behavior of limited bursting by tweaking the
basic token bucket algorithm slightly.
I'd rather avoid this additional complexity.

> Agreed that it is much higher than a single topics' max throughput.. but
> the context of my example had multiple topics lying on the same
> broker/bookie ensemble bursting together at the same time because they had
> been saving up on tokens in the bucket.

Yes, that makes sense.

> always be a need to overprovision resources. You usually don't want to
> > go beyond 60% or 70% utilization on disk, cpu or network resources so
> > that queues in the system don't start to increase and impacting
> > latencies. In Pulsar/Bookkeeper, the storage solution has a very
> > effective load balancing, especially for writing. In Bookkeeper each
> > ledger (the segment) of a topic selects the "ensemble" and the "write
> > quorum", the set of bookies to write to, when the ledger is opened.
> > The bookkeeper client could also change the ensemble in the middle of
> > a ledger due to some event like a bookie becoming read-only or
> >
>
> While it does do that on complete failure of bookie or a bookie disk, or
> broker going down, degradations aren't handled this well. So if all topics
> in a bookie are bursting due to the fact that they had accumulated tokens,
> then all it will lead to is breach of write latency SLA because at one
> point, the disks/cpu/network etc will start choking. (even after
> considering the 70% utilization i.e. 30% buffer)

Yes.

> That's only in the case of the default rate limiter where the tryAcquire
> isn't even implemented.. since the default rate limiter checks for breach
> only at a fixed rate rather than before every produce call. But in case of
> precise rate limiter, the response of `tryAcquire` is respected.

This is one of many reasons why I think it's better to improve the
maintainability of the current solution and remove the unnecessary
options between "precise" and the default one.

> True, and actually, due to the fact that pulsar auto distributes topics
> based on load shedding parameters, we can actually focus on a single
> broker's or a single bookie ensemble and assume that it works as we scale
> it. Of course this means that putting a reasonable limit in terms of
> cpu/network/partition/throughput limits at each broker level and pulsar
> provides ways to do that automatically.

We do have plans to improve the Pulsar so that things would simply
work properly under heavy load. Optimally, things would work without
the need to tune and tweak the system rigorously. These improvements
go beyond rate limiting.

> While I have shared the core requirements over these threads (fixed rate +
> burst multiplier for upto X duration every Y minutes).. We are finalizing
> the details requirements internally to present. As I replied in my previous
> mail, one outcome of detailed internal discussion was the discovery of
> throughput contention.

Sounds good. Sharing your experiences is extremely valuable.

> We do use resource groups for certain namespace level quotas, but even in
> our use case, rate limiter and resource groups are two separate tangents.
> At least for foreseeable future.

At this point they are separate, but there is a need to improve. Jack
Vanlightly's blog post series "The Architecture Of Serverless Data
Systems" [1] explains the competition in this area very well.
Multi-tenant capacity management and SLAs are at the core of
"serverless" data systems.

1 - https://jack-vanlightly.com/blog/2023/11/14/the-architecture-of-serverless-data-systems

> Will close this before the 23rd.

Yes. :)

> This is actually not a resource group limitation, but a general broker
> level example. Even if resource groups weren't in picture, this issue
> remains. The fact remains that since we need to have reasonable broker
> level limits (as a result of our NFR testing etc), there will be clash of
> topics where some topics trying to burst are taking up broker's capacity of
> serving the fixed rate for other topics. This needs special handling even
> after dual token approach. I will have detailed requirement with examples
> by 23rd.

Sounds good.

> > That is the one of the reasons for putting a strict check on the amount,
> > > duration and frequency of bursting.
> >
> > Please explain more about this. It would be helpful for me in trying
> > to understand your requirements.
> > One question about the solution of "strict check on amount, duration
> > and frequency of bursting".
> > Do you already have a PoC where you have validated that the solution
> > actually solves the problem?
> >
>
> No, this is still theoretical based on our understanding of rate limiter,
> dual tokens, current pulsar code, etc. If topics are allowed to burst
> without a real duration based limitation, then the chances of more and more
> topics contending for broker's actual capacity is high and thus it
> hinders/contents with (a) a new topics trying to burst and make use of the
> bursting feature with its SLA and (b) another topics , not bursting, way
> within its fixed rate, still being rate limited due to lack of capacity
> (which is taken up by bursting topics) at a broker level.

There are more factors involved in the bursting behavior and the
applying back pressure in Pulsar.
If there's proper end-to-end flow control and backpressure, a possible
"over commitment" situation could be handled by simply handling the
load as fast as possible. The bursting rates aren't necessarily part
of strict SLAs so it's possible that even SLAs are met while operating
as-fast-as possible.  When the Pulsar load balancing is improved, the
Pulsar load manager could move load off an overloaded broker and that
would help resolve the situation quicker.
Later on, the SLAs should become part of the rate limiting and
capacity/load management decisions within Pulsar. For example, not all
topics need very low latency. When there's information about the
expected latency information, it would be possible to priority and
optimize the work load in different ways.
When things are properly configured, the Pulsar system shouldn't get
into a state where the system collapses when it gets overloaded. If
brokers crash because of out-of-memory errors, the whole system will
quickly collapse, since recovering from a broker crash will increase
the overall load.
I have doubts that focusing solely on a custom rate limiting
implementation wouldn't be impactful. My current assumption is that
adding support for bursting to rate limiting will be sufficient for
making rate limiting good enough for most Pulsar use cases, and we
could continue to improve rate limiting when there's a specific need.
I'll be happy to be proven to be wrong and change my opinion when
there's sufficient evidence that custom rate limiters are needed in
Pulsar.

> > Could you simply write a PoC by forking apache/pulsar and changing the
> > code directly without having it pluggable initially?
> >
> >
> This would be a last resort :) . At most if things don't go as desired,
> then we would end up doing this and adding plugging logic in a pulsar fork.
> From a PoC perspective, we may try it out soon.

I think it's useful to work with a PoC since it will show the
practical limits of using rate limiters for handling SLA and capacity
management issues.

To start improving the current rate limiter implementation in Pulsar
core, I have made a PoC of an asynchronous token bucket implementation
class, which is optimized for performance without causing contention
in a multi-threaded setup.
The repository https://github.com/lhotari/async-tokenbucket contains
the PoC code and performance test that measures the token bucket
calculation logic overhead.
On my laptop, it's possible to achieve about 165M ops/s with 100
threads with a single token bucket. That's to prove that the
calculation logic isn't blocking and has very low contention.  There's
no need for a scheduler to maintain the calculations. A scheduler
would be needed in an actual rate limiter implementation to unblock a
paused Netty channel by setting auto read to true (in the case where
backpressure is applied by setting auto read to false). The
AsyncTokenBucket class is intended to be a pure function without side
effects. The implementation idea is to use the composition pattern to
build higher level rate limiter constructs.

Let's meet on Thursday at the Pulsar community meeting (schedule at
https://github.com/apache/pulsar/wiki/Community-Meetings) to present
our summaries and decide on the next steps together.

-Lari

Re: [DISCUSS] PIP-310: Support custom publish rate limiters

Posted by Girish Sharma <sc...@gmail.com>.

One final reply from me before the holidays :)


On Sat, Nov 11, 2023 at 4:00 PM Lari Hotari <La...@hotari.net> wrote:

> Hi Girish,
>
> replies inline.
>
> > Hello Lari, replies inline. It's festive season here so I might be late
> in
> > the next reply.
>
> I'll have limited availability next week so possibly not replying
> until the following week. We have the next Pulsar community meeting on
> November 23rd, so let's wrap up all of this preparation by then.
>

Sounds good.


>
> > > How would the rate limiter know if 5MB traffic is degraded traffic
> > > which would need to be allowed to burst?
> > >
> >
> > That's not what I was implying. I was trying to question the need for
> > 1800MB worth of capacity. I am assuming this is to allow a 2 minute burst
> > of 15MBps? But isn't this bucket only taking are of the delta beyond
> 10MBps?
> > Moreover, once 2 minutes are elapsed, which bucket is ensuring that the
> > rate is only allowed to go upto 10MBps?
>
> I guess there are multiple ways to model things. I was thinking of a
> model where the first bucket (and the "filling bucket" to add "earned"
> tokens with some interval) handles all logic related to the average
> rate limiting of 10MBps, including the bursting capacity which a token
> bucket inherently has. As we know, the token bucket doesn't have a
> rate limit when there are available tokens in the bucket. That's the
>

Actually, the capacity is meant to simulate that particular rate limit. if
we have 2 buckets anyways, the one managing the fixed rate limit part
shouldn't generally have a capacity more than the fixed rate, right?



> reason why there's the separate independent bucket with a relatively
> short token capacity to enforce the maximum rate limit of 15MBps.
> There needs to be tokens in both bucket for traffic to flow freely.
> It's like an AND and not an OR (in literature there are also examples
> of this where tokens are taken from another bucket when the main one
> runs out).
>

I think it can be done, especially with that one thing you mentioned about
holding off filling the second bucket for 10 minutes.. but it does become
quite complicated in terms of managing the flow of the tokens.. because
while we only fill the second bucket once every 10 minutes, after the 10th
minute, it needs to be filled continuously for a while (the duration we
want to support the bursting for).. and the capacity of this second bucket
also is governed by and exactly matches the burst value.


> > general approach of where the tokens are initially required to be earned
> > might be helpful to tackle cold starts, but beyond that, a topic doing
> > 5MBps to accumulate enough tokens to burst for a few minutes in the
> future
> > doesn't really translate to the physical world, does it? The 1:1
> > translation here is to an SSD where the topic's data is actually being
> > written to - So my bursting upto 15MBps doesn't really depend on the fact
> > that I was only doing 5 MBps in the last few minutes (and thus,
> > accumulating the remaining 5MBps worth tokens towards the burst) - now
> does
> > it? The SSD won't really gain the ability to allow for burst just because
> > there was low throughput in the last few minutes. Not even from a space
> POV
> > either.
>
> This example of the SSD isn't concrete in terms of Pulsar and
> Bookkeeper. The capacity of a single SSD should and must be
> significantly higher than the maximum write throughput of a single
> topic. Therefore a single SSD shouldn't be a bottleneck. There will
>

Agreed that it is much higher than a single topics' max throughput.. but
the context of my example had multiple topics lying on the same
broker/bookie ensemble bursting together at the same time because they had
been saving up on tokens in the bucket.

always be a need to overprovision resources. You usually don't want to
> go beyond 60% or 70% utilization on disk, cpu or network resources so
> that queues in the system don't start to increase and impacting
> latencies. In Pulsar/Bookkeeper, the storage solution has a very
> effective load balancing, especially for writing. In Bookkeeper each
> ledger (the segment) of a topic selects the "ensemble" and the "write
> quorum", the set of bookies to write to, when the ledger is opened.
> The bookkeeper client could also change the ensemble in the middle of
> a ledger due to some event like a bookie becoming read-only or
>

While it does do that on complete failure of bookie or a bookie disk, or
broker going down, degradations aren't handled this well. So if all topics
in a bookie are bursting due to the fact that they had accumulated tokens,
then all it will lead to is breach of write latency SLA because at one
point, the disks/cpu/network etc will start choking. (even after
considering the 70% utilization i.e. 30% buffer)


> > now read the netty channel, parse the message and figure this out before
> > checking for rate limiting.. At which point, rate limiting isn' treally
> > doing anything if the broker is reading every message anyway. This may
> also
>
> This is how it works currently: the message is read, the rate limiter
> is updated and the message is sent regardless of the result of the
> rate limiter "tryAcquire" call. In asynchronous flows this makes
>

That's only in the case of the default rate limiter where the tryAcquire
isn't even implemented.. since the default rate limiter checks for breach
only at a fixed rate rather than before every produce call. But in case of
precise rate limiter, the response of `tryAcquire` is respected.


>
> > At no point I am trying to go beyond the purview of a single broker here.
> > Unless, by system, you meant a single broker itself. For which, I talk
> > about a broker level rate limiter further below in the example.
>
> My point was that there's a need to take a holistic approach of
> capacity management. The Pulsar system is a distributed system and the
> storage system is separate, unlike storage on Kafka. There isn't an
> equivalent of the Pulsar load balancing in Kafka. In the Kafka world,
> you can calculate the capacity of a single broker. In Pulsar, the
> model is very different.
> If a broker is over loaded, the Pulsar load balancer can quickly shed
> load from that broker. I feel that focusing on a single broker and
>

True, and actually, due to the fact that pulsar auto distributes topics
based on load shedding parameters, we can actually focus on a single
broker's or a single bookie ensemble and assume that it works as we scale
it. Of course this means that putting a reasonable limit in terms of
cpu/network/partition/throughput limits at each broker level and pulsar
provides ways to do that automatically.


> protecting the resources on it won't be helpful in the end. It's not
> to say that things cannot be improved from a single broker
> perspective. One essential role of rate limiters in capacity manage is
> to prevent a single resource of becoming overloaded. The Pulsar load
> balancer / load manager plays a key role in Pulsar.
>
> The other important role of rate limiters is about handling end
> application expectation of the service level. I'm not sure if someone
> remembers the example from the  Google SRE book about a too good
> service level which caused a lot of problems and outages when there
> finally was a service level degradation. The problem was that
> application got coupled to the behavior that there aren't any
> disruptions in the service. IIRC, the solution was to inject failures
> to production periodically to make the service consumers get used to
> service disruptions and design their application to have the required
> resilience. I think it was Netflix that introduced the chaos monkey
> and also ran it in production to ensure that service consumers are
> sufficiently resilient to service disruptions.
>
> In the context of Pulsar, it is possible that messaging applications
> get coupled and depend to a service level that isn't well defined.
> From this perspective, the key role of rate limiters is to ensure that
> the service provider and service consumer have a way to express the
> service level objective (SLO) in the system. If the rate isn't capped,
> there's the risk that service consumers get depended on very high
> rates which might only be achievable in a highly overprovisioned
> system.
>

+1 All platforms should induce chaos from time to time and have SLAs that
make sense rather than trying to be too good to be true.


> I wonder if you would be interested in sharing your requirements or
> context around SLOs and how you see the relationship to rate limiters?
>

While I have shared the core requirements over these threads (fixed rate +
burst multiplier for upto X duration every Y minutes).. We are finalizing
the details requirements internally to present. As I replied in my previous
mail, one outcome of detailed internal discussion was the discovery of
throughput contention.


> > Just as a side note, in practice, resource group level rate limiting is
> > very very imprecise due to the inherent nature that it has to sync broker
> > level data between multiple brokers are a frequency.. thus the data is
> > always stale and the produce goes way beyond the set limit, even when
> > resource groups use the precise rate limiter today.
>
> I think that it's a good starting point for further improvements based
> on feedback. It's currently in early stages.
> Based on what you have shared about your use cases for rate limiting,
> I'd assume that you would go beyond a single broker in the future.
>

We do use resource groups for certain namespace level quotas, but even in
our use case, rate limiter and resource groups are two separate tangents.
At least for foreseeable future.


> It would be interesting to hear more of practical details of what
> makes your use case so special that you need a custom rate limiter and
> it couldn't be covered by a generic rate limiting which is configured
> to your use case and requirements. I guess I just didn't want to keep
> on repeating the same things again and again and instead I've wanted
> to learn more about your use cases and work together with you to solve
> your requirements and attempt to generalize it in a way that we could
> have the implementation directly in Pulsar core.
> As we discussed these 2 attempts aren't conflicting. We might end up
> having a generic rate limiter that suites most of your requirements
> and for some requirements you might need that pluggable rate limiter.
> However, along all these message thread, there hasn't yet been any
> concrete examples that make your requirements so special that they
> couldn't be covered direclty in Pulsar core. Perhaps that's something
> that you are also iterating on and you might be learning new ways to
> handle your use cases and requirements while we make progress and
> small steps on improving the current rate limiter and the
> maintainability of it in preparation for further improvements.
>

Will close this before the 23rd.


> > I did not want to confuse you here by bringing in more than one broker
> into
> > picture. In any case, resource groups are not a solution here. What you
> are
> > suggesting is that we put a check on a namespace or tenant by putting in
> a
> > resource group level rate limiting - all that does is restrict the scope
> of
> > contention of capacity within that tenant or namespace. Moreover,
> resource
> > group level rate limiting is not precise. In another internal discussion
> > about finalizing the requirements from our side, we found that the
> example
> > i provided where multiple topics are bursting at the same time and are
> > contending for the capacity (globally or within a resource group), there
> > will be situation where the contention is now leading to throttling of a
> > topic which is not even trying to burst - since the broker level/resource
> > group level rate limiter will just block any and all produce requests
> after
> > the permits have been exhausted.. the bursting topics could have very
> well
> > exhausted those in the first 800ms of a second while a non bursting topic
> > actually should have had the right to produce in the remaining 200ms of a
> > second.
>
> Thanks for sharing this feedback about resource groups. The problem
>

This is actually not a resource group limitation, but a general broker
level example. Even if resource groups weren't in picture, this issue
remains. The fact remains that since we need to have reasonable broker
level limits (as a result of our NFR testing etc), there will be clash of
topics where some topics trying to burst are taking up broker's capacity of
serving the fixed rate for other topics. This needs special handling even
after dual token approach. I will have detailed requirement with examples
by 23rd.

> That is the one of the reasons for putting a strict check on the amount,
> > duration and frequency of bursting.
>
> Please explain more about this. It would be helpful for me in trying
> to understand your requirements.
> One question about the solution of "strict check on amount, duration
> and frequency of bursting".
> Do you already have a PoC where you have validated that the solution
> actually solves the problem?
>

No, this is still theoretical based on our understanding of rate limiter,
dual tokens, current pulsar code, etc. If topics are allowed to burst
without a real duration based limitation, then the chances of more and more
topics contending for broker's actual capacity is high and thus it
hinders/contents with (a) a new topics trying to burst and make use of the
bursting feature with its SLA and (b) another topics , not bursting, way
within its fixed rate, still being rate limited due to lack of capacity
(which is taken up by bursting topics) at a broker level.



> Could you simply write a PoC by forking apache/pulsar and changing the
> code directly without having it pluggable initially?
>
>
This would be a last resort :) . At most if things don't go as desired,
then we would end up doing this and adding plugging logic in a pulsar fork.
From a PoC perspective, we may try it out soon.


> > Would love to hear a plan from your point of view and what you envision.
>
> Thanks, I hope I was able to express most of it in these long emails.
> I'm having a break next week and after that I was thinking of
> summarizing this discussion from my viewpoint and then meet in the
> Pulsar community meeting on November 23rd to discuss the summary and
> conclusions and the path forward. Perhaps you could also prepare in a
>

Sounds like a plan!


> similar way where you summarize your viewpoints and we discuss this on
> Nov 23rd in the Pulsar community meeting together with everyone who is
> interested to participate. If we have completed the preparation before
> the meeting, we could possibly already exchange our summaries
> asynchronously before the meeting. Girish, Would this work for you?
>
> Yes, we can exchange it before 23rd. I can come back on my final
requirements and plan by end of next week.

Regards
-- 
Girish Sharma

Re: [DISCUSS] PIP-310: Support custom publish rate limiters

Posted by Lari Hotari <La...@hotari.net>.

Hi Girish,

replies inline.

> Hello Lari, replies inline. It's festive season here so I might be late in
> the next reply.

I'll have limited availability next week so possibly not replying
until the following week. We have the next Pulsar community meeting on
November 23rd, so let's wrap up all of this preparation by then.

> > How would the rate limiter know if 5MB traffic is degraded traffic
> > which would need to be allowed to burst?
> >
>
> That's not what I was implying. I was trying to question the need for
> 1800MB worth of capacity. I am assuming this is to allow a 2 minute burst
> of 15MBps? But isn't this bucket only taking are of the delta beyond 10MBps?
> Moreover, once 2 minutes are elapsed, which bucket is ensuring that the
> rate is only allowed to go upto 10MBps?

I guess there are multiple ways to model things. I was thinking of a
model where the first bucket (and the "filling bucket" to add "earned"
tokens with some interval) handles all logic related to the average
rate limiting of 10MBps, including the bursting capacity which a token
bucket inherently has. As we know, the token bucket doesn't have a
rate limit when there are available tokens in the bucket. That's the
reason why there's the separate independent bucket with a relatively
short token capacity to enforce the maximum rate limit of 15MBps.
There needs to be tokens in both bucket for traffic to flow freely.
It's like an AND and not an OR (in literature there are also examples
of this where tokens are taken from another bucket when the main one
runs out).

> > Without "earning the right to burst", there would have to be some
> > other way to detect whether there's a need to burst.
>
> Why does the rate limiter need to decide if there is a need to burst? It is
> dependent on the incoming message rate. Since the rate limiter has no
> knowledge of what is going to happen in future, it cannot assume that the
> messages beyond a fixed rate (10MB in our example) can be held/paused until
> next second - thus deciding that there is no need to burst right now.

To explain this, I'll first take a quote from your email about the
bursting use cases, sent on Nov 7th:
> Adding to what I wrote above, think of this pattern like the following: the produce rate slowly increases from ~2MBps at around 4 AM to a known peak of about 30MBps by 4 PM and
> stays around that peak until 9 PM after which is again starts decreasing until it reaches ~2MBps around 2 AM.
> Now, due to some external triggers, maybe a scheduled sale event, at 10PM, the quota may spike up to 40MBps for 4-5 minutes and then again go back down to the usual ~20MBps .

Let's observe these cases.
(When we get further, it will be helpful to have a set of well defined
representative concrete use case / scenario descriptions with concrete
examples of expected behavior. Failure / chaos scenarios and
operations (such as software upgrade) scenarios should also be
covered.)

> Why does the rate limiter need to decide if there is a need to burst? It is
> dependent on the incoming message rate. Since the rate limiter has no
This is a very good question. When we look at these examples, it feels
that we'd need to define what the actual rules for the bursting are
and how bursting is decided.
It seems that there are two type of ways: "earning the right to burst"
with the token bucket so that bursting could happen whenever there are
extra tokens in the bucket; or having a way to detect that there's a
need to burst. That relates directly to SLAs. If there's an SLA where
message processing end-to-end latencies must be low, that usually
would be violated if large backlogs pile up. I think you said
something about avoiding getting into this at all. I agree that this
would add complexity, but for the sake of explaining the possible
cases for the need for bursting, I think that it's necessary to cover
the end-to-end aspects.


> While I understand that earning the right to burst model works well in
> practice, another approach to think about this is that the bucket,
> initially, starts filled. Moreover, the tokens here are being filled into
> the bucket due to the available disk, cpu and network bandwidth.. The

I completely agree. If it's fine to rely on the token bucket for
bursting, this simplifies a lot of things. As you are mentioning,
there could be multiple ways to fill/add tokens to the bucket.

> general approach of where the tokens are initially required to be earned
> might be helpful to tackle cold starts, but beyond that, a topic doing
> 5MBps to accumulate enough tokens to burst for a few minutes in the future
> doesn't really translate to the physical world, does it? The 1:1
> translation here is to an SSD where the topic's data is actually being
> written to - So my bursting upto 15MBps doesn't really depend on the fact
> that I was only doing 5 MBps in the last few minutes (and thus,
> accumulating the remaining 5MBps worth tokens towards the burst) - now does
> it? The SSD won't really gain the ability to allow for burst just because
> there was low throughput in the last few minutes. Not even from a space POV
> either.

This example of the SSD isn't concrete in terms of Pulsar and
Bookkeeper. The capacity of a single SSD should and must be
significantly higher than the maximum write throughput of a single
topic. Therefore a single SSD shouldn't be a bottleneck. There will
always be a need to overprovision resources. You usually don't want to
go beyond 60% or 70% utilization on disk, cpu or network resources so
that queues in the system don't start to increase and impacting
latencies. In Pulsar/Bookkeeper, the storage solution has a very
effective load balancing, especially for writing. In Bookkeeper each
ledger (the segment) of a topic selects the "ensemble" and the "write
quorum", the set of bookies to write to, when the ledger is opened.
The bookkeeper client could also change the ensemble in the middle of
a ledger due to some event like a bookie becoming read-only or
unavailable. This effectively load balances the writes across all
bookies.

> In practice this level of cross component coordination would never result
> in a responsive and spontaneous system.. unless each message comes along
> with a "I waited in queue for this long" time and on the server side, we

Yes, the total end-to-end waiting time could be calculated from the
timestamp information when there's a way to estimate the clock diff
between the client and broker clocks. Since we are dealing with
queues, queuing theory could be applied to estimate queue sizes.

> now read the netty channel, parse the message and figure this out before
> checking for rate limiting.. At which point, rate limiting isn' treally
> doing anything if the broker is reading every message anyway. This may also

This is how it works currently: the message is read, the rate limiter
is updated and the message is sent regardless of the result of the
rate limiter "tryAcquire" call. In asynchronous flows this makes
sense. The problem is more the naming of methods which makes it hard
to reason about.

> require prioritization of messages as some messages from a different
> producer to the same partition may have waited longer than others.. This is
> out of scope of a rate limiter at this point.

This is a good observation. Yes, I agree that this is not part of the
scope at this point.

> Rate limiter is a major facilitator and guardian of capacity planning. So,
> it does relate here.

+1

> At no point I am trying to go beyond the purview of a single broker here.
> Unless, by system, you meant a single broker itself. For which, I talk
> about a broker level rate limiter further below in the example.

My point was that there's a need to take a holistic approach of
capacity management. The Pulsar system is a distributed system and the
storage system is separate, unlike storage on Kafka. There isn't an
equivalent of the Pulsar load balancing in Kafka. In the Kafka world,
you can calculate the capacity of a single broker. In Pulsar, the
model is very different.
If a broker is over loaded, the Pulsar load balancer can quickly shed
load from that broker. I feel that focusing on a single broker and
protecting the resources on it won't be helpful in the end. It's not
to say that things cannot be improved from a single broker
perspective. One essential role of rate limiters in capacity manage is
to prevent a single resource of becoming overloaded. The Pulsar load
balancer / load manager plays a key role in Pulsar.

The other important role of rate limiters is about handling end
application expectation of the service level. I'm not sure if someone
remembers the example from the  Google SRE book about a too good
service level which caused a lot of problems and outages when there
finally was a service level degradation. The problem was that
application got coupled to the behavior that there aren't any
disruptions in the service. IIRC, the solution was to inject failures
to production periodically to make the service consumers get used to
service disruptions and design their application to have the required
resilience. I think it was Netflix that introduced the chaos monkey
and also ran it in production to ensure that service consumers are
sufficiently resilient to service disruptions.

In the context of Pulsar, it is possible that messaging applications
get coupled and depend to a service level that isn't well defined.
From this perspective, the key role of rate limiters is to ensure that
the service provider and service consumer have a way to express the
service level objective (SLO) in the system. If the rate isn't capped,
there's the risk that service consumers get depended on very high
rates which might only be achievable in a highly overprovisioned
system.

I wonder if you would be interested in sharing your requirements or
context around SLOs and how you see the relationship to rate limiters?

> Just as a side note, in practice, resource group level rate limiting is
> very very imprecise due to the inherent nature that it has to sync broker
> level data between multiple brokers are a frequency.. thus the data is
> always stale and the produce goes way beyond the set limit, even when
> resource groups use the precise rate limiter today.

I think that it's a good starting point for further improvements based
on feedback. It's currently in early stages.
Based on what you have shared about your use cases for rate limiting,
I'd assume that you would go beyond a single broker in the future.

> In the very next sentence it also says "so a key takeaway from this was
> though bursting was great it solved some of the customers problems um we
> had it wasn't sufficient enough and we had tightly coupled how partition
> level capacity to admission control right so we realized that removing the
> admission control from partition from partition level would be the most
> beneficial thing for customers and for us as well and letting the partition
> burst always so we implemented something called as a global admission
> control where we moved our mission control up"
> Looks like the fact that the partition wasn't able to work due to lack of
> tokens was a problem for them? I admit I haven't gone through the full
> video yet.

I think that it provides a good description of what kind of
flexibility service consumers (tenant in a multi-tenant) system could
be looking for.
The particular bursting solution could be thought of a sort of
serverless autoscaling solution from the service consumer's
perspective. The service consumer pays for what they use.
This has a lot of business value and customers are willing to pay for
such flexibilitiy. From the service provider's perspective, there's a
proper foundation for SLOs that can be operationalized in a way where
the system can be optimally provisioned and scaled when capacity
demand increases. In a multi-tenant system, the peak loads of
different customers smoothen out a lot of the peaks and the load is
fairly steady. AWS DynamoDB and Confluent Kora are examples of such
multi-tenant systems where the "control plane" is essential to the
operations and running at scale.

> > Token bucket based solutions are widely used in the industry. I don't
> > see a reason why we should be choosing some other conceptual model as
> > the basis.
> >
>
> Not another conceptual model, but we should be using the token bucket in a
> way that works for real world situations..

I completely agree.

> I agree with you here. That's why I mentioned the ideal state of rate
> limiter in pulsar OSS version. I see that you did not comment on that in
> your reply though..

I guess you are referring to this comment of yours:

> I know and almost all OSS projects around the messaging world have very
> very basic rate limiters. That is exactly why I am originally requesting
> a pluggable rate limiter to support our use case as that's not a common
> thing to have (the level of complexity) in the OSS project itself. Although, I
> also acknowledge the need of a lot of refactoring and changes in the
> based design as we have been discussing here.

It would be interesting to hear more of practical details of what
makes your use case so special that you need a custom rate limiter and
it couldn't be covered by a generic rate limiting which is configured
to your use case and requirements. I guess I just didn't want to keep
on repeating the same things again and again and instead I've wanted
to learn more about your use cases and work together with you to solve
your requirements and attempt to generalize it in a way that we could
have the implementation directly in Pulsar core.
As we discussed these 2 attempts aren't conflicting. We might end up
having a generic rate limiter that suites most of your requirements
and for some requirements you might need that pluggable rate limiter.
However, along all these message thread, there hasn't yet been any
concrete examples that make your requirements so special that they
couldn't be covered direclty in Pulsar core. Perhaps that's something
that you are also iterating on and you might be learning new ways to
handle your use cases and requirements while we make progress and
small steps on improving the current rate limiter and the
maintainability of it in preparation for further improvements.

> > Optimally the pause would be such a duration that there's enough
> > tokens to send the next message.
> > One possible way to estimate the size of the next message is to use
> > the average size or a rolling average size as the basis of the
> > calculation.
> > It might be as short as the minimum time to pause for the timer.
> > Instead of using a generic scheduler for pausing, Netty's scheduler
> > should be used.
> > It is very efficient for high volumes. If there's a bottleneck, it's
> > possible to have some minimum duration to pause.
> >
>
> I still don't see any advantage here. What are we trying to save here?
> Honestly, any kind of guessing inside the code will lead to weird anomalies
> when running in a big system with various different usecases.

Yes, that's a good question. The thought I had was about minimizing
latency caused by rate limiting. Rate limiting will impact latency
since things will have to be slowed down so I agree that it's not
really a useful goal to minimize the pause. I guess a better solution
is to explicitly configure the minimum available tokens when the pause
is scheduled to end or simply by pausing for a configurable amount of
time at a time, let's say 100ms or 1 second.

> I did not want to confuse you here by bringing in more than one broker into
> picture. In any case, resource groups are not a solution here. What you are
> suggesting is that we put a check on a namespace or tenant by putting in a
> resource group level rate limiting - all that does is restrict the scope of
> contention of capacity within that tenant or namespace. Moreover, resource
> group level rate limiting is not precise. In another internal discussion
> about finalizing the requirements from our side, we found that the example
> i provided where multiple topics are bursting at the same time and are
> contending for the capacity (globally or within a resource group), there
> will be situation where the contention is now leading to throttling of a
> topic which is not even trying to burst - since the broker level/resource
> group level rate limiter will just block any and all produce requests after
> the permits have been exhausted.. the bursting topics could have very well
> exhausted those in the first 800ms of a second while a non bursting topic
> actually should have had the right to produce in the remaining 200ms of a
> second.

Thanks for sharing this feedback about resource groups. The problem
sounds similar to what was described as the the reason to create AWS
DynamoDB Global Admission Control.
There seems to be some gaps in the resource groups design that it's
not able to solve the problem that such solution should solve (when
comparing to the AWS DynamoDB case)?

> That is the one of the reasons for putting a strict check on the amount,
> duration and frequency of bursting.

Please explain more about this. It would be helpful for me in trying
to understand your requirements.
One question about the solution of "strict check on amount, duration
and frequency of bursting".
Do you already have a PoC where you have validated that the solution
actually solves the problem?
Could you simply write a PoC by forking apache/pulsar and changing the
code directly without having it pluggable initially?

> That's not the only difference, even ignoring the bug.. the default one has
> a scheduler running at a lower than 1 second interval. It's only in this
> scheduled method that it checks if the limit has breached and blocks the
> auto-read.. So if a produce is able to produce whatever they like before
> the first trigger of this scheduler, then there is practically no rate
> limiting. Now, I don't know if we can call this as contact.. its more of an
> implicit one .. or rather, if there is any rate limiting contract in
> documentation, this default one certainly breaks that.

Yes, you are right. I had forgotten and I completely missed that code
path when the default rate limiter is used in publish throttling.
Thanks for providing the analysis. I wouldn't call it a contract. It
should be possible to fix such bugs. However, it's true that behavior
that is considered a bug is part of a contract for someone. I think
it's called Hyrum's laws which states this too.
Obviously it's the Pulsar community and PMC that decides whether the
default rate limiter could be refactored. It's pretty bad if a project
gets into a state where changes cannot be made.
I believe that we can find a model where 99% of the users will be
happy. Can't we just require the remaining 1% to adapt? :)


> As mentioned in the comment above, resource groups shouldn't be discussed
> here.. they are out of scope of the discussion involving partition level
> rate limiting and I did not intend to bring them into the discussion.
> I would like to see a comment on my proposal about the state of pulsar rate
> limiter. I believe that's true "meeting in the middle".

Yes, we don't need to dive deep in details. However, since rate
limiting in Pulsar won't be sufficient on it's own to handle capacity
management challenges, i think it's useful to at least introduce the
concepts so that with the common background information, it's easier
to refer to the role of resource groups when something in the rate
limiter doesn't belong there but instead belongs to the resource group
part of capacity management in Pulsar. Similarly referring to the
Pulsar load balancing / load manager and defining and clarifying it's
role in capacity management will also help holistic improvements to
Pulsar. That's the end goal that many of us have and I believe it's
also important for you when you are operating Pulsar at scale.

> Thank you for painstakingly replying to these long emails.. This probably
> is the longest thread in pulsar ML in recent times :)

Likewise, you have been doing your part really well. Thanks! This has
been a valuable discussion. It is hard to measure the value of
learning and gaining understanding of a problem. The value will show
up in the future when we are better equiped to solve the right
problems instead of solving the problems that we'd like to solve. :)

> Would love to hear a plan from your point of view and what you envision.

Thanks, I hope I was able to express most of it in these long emails.
I'm having a break next week and after that I was thinking of
summarizing this discussion from my viewpoint and then meet in the
Pulsar community meeting on November 23rd to discuss the summary and
conclusions and the path forward. Perhaps you could also prepare in a
similar way where you summarize your viewpoints and we discuss this on
Nov 23rd in the Pulsar community meeting together with everyone who is
interested to participate. If we have completed the preparation before
the meeting, we could possibly already exchange our summaries
asynchronously before the meeting. Girish, Would this work for you?

-Lari

Re: [DISCUSS] PIP-310: Support custom publish rate limiters

Posted by Girish Sharma <sc...@gmail.com>.

Hello Lari, replies inline. It's festive season here so I might be late in
the next reply.


On Fri, Nov 10, 2023 at 4:51 PM Lari Hotari <lh...@apache.org> wrote:

> Hi Girish,
>
> >
> >
> > > tokens would be added to the bucket as are consumed by the actual
> > > traffic. The left over tokens 10MB - actual rate would go to a
> > > separate filling bucket that gets poured into the actual bucket every
> > > 10 minutes.
> > > This first bucket with this separate "filling bucket" would handle the
> > > bursting up to 1800MB.
> > >
> >
> > But this isn't the requirement? Let's assume that the actual traffic has
> > been 5MB for a while and this 1800MB capacity bucket is all filled up
> now..
> > What's the real use here for that at all?
>
> How would the rate limiter know if 5MB traffic is degraded traffic
> which would need to be allowed to burst?
>

That's not what I was implying. I was trying to question the need for
1800MB worth of capacity. I am assuming this is to allow a 2 minute burst
of 15MBps? But isn't this bucket only taking are of the delta beyond 10MBps?
Moreover, once 2 minutes are elapsed, which bucket is ensuring that the
rate is only allowed to go upto 10MBps?


> > I think this approach of thinking about rate limiter - "earning the right
> > to burst by letting tokens remain into the bucket, (by doing lower than
> > 10MB for a while)" doesn't not fit well in a messaging use case in real
> > world, or theoretic.
>
> It might feel like it doesn't fit well, but this is how most rate
> limiters work. The model works very well in practice.
> Without "earning the right to burst", there would have to be some
> other way to detect whether there's a need to burst.
>

Why does the rate limiter need to decide if there is a need to burst? It is
dependent on the incoming message rate. Since the rate limiter has no
knowledge of what is going to happen in future, it cannot assume that the
messages beyond a fixed rate (10MB in our example) can be held/paused until
next second - thus deciding that there is no need to burst right now.

While I understand that earning the right to burst model works well in
practice, another approach to think about this is that the bucket,
initially, starts filled. Moreover, the tokens here are being filled into
the bucket due to the available disk, cpu and network bandwidth.. The
general approach of where the tokens are initially required to be earned
might be helpful to tackle cold starts, but beyond that, a topic doing
5MBps to accumulate enough tokens to burst for a few minutes in the future
doesn't really translate to the physical world, does it? The 1:1
translation here is to an SSD where the topic's data is actually being
written to - So my bursting upto 15MBps doesn't really depend on the fact
that I was only doing 5 MBps in the last few minutes (and thus,
accumulating the remaining 5MBps worth tokens towards the burst) - now does
it? The SSD won't really gain the ability to allow for burst just because
there was low throughput in the last few minutes. Not even from a space POV
either.



> The need to burst could be detected by calculating the time that the
> message to be sent has been in queues until it's about to be sent.
> In other words, from the end-to-end latency.
>

In practice this level of cross component coordination would never result
in a responsive and spontaneous system.. unless each message comes along
with a "I waited in queue for this long" time and on the server side, we
now read the netty channel, parse the message and figure this out before
checking for rate limiting.. At which point, rate limiting isn' treally
doing anything if the broker is reading every message anyway. This may also
require prioritization of messages as some messages from a different
producer to the same partition may have waited longer than others.. This is
out of scope of a rate limiter at this point.

> Imagine a 100 of such topics going around with similar configuration of
> > fixed+burst limits and were doing way lower than the fixed rate for the
> > past couple of hours. Now that they've earned enough tokens, if they all
> > start bursting, this will bring down the system, which is probably not
> > capable of supporting simultaneous peaks of all possible topics at all.
>
> This is the challenge of capacity management and end-to-end flow
> control and backpressure.
>

Rate limiter is a major facilitator and guardian of capacity planning. So,
it does relate here.


> With proper system wide capacity management and end-to-end back
> pressure, the system won't collapse.
>

At no point I am trying to go beyond the purview of a single broker here.
Unless, by system, you meant a single broker itself. For which, I talk
about a broker level rate limiter further below in the example.


> In Pulsar, we have "PIP 82: Tenant and namespace level rate limiting"
> [4] which introduced the "resource group" concept. There is the
> resourcegroup resource in the Admin REST API [5]. There's also a
> resource called "resource-quota" in the Admin REST API [6], but that
> seems to be abandoned.
>
> I am well aware of resource groups. I personally do not want to go into
discussions about resource groups here as I am only talking about rate
limiting at a partition level, which does not need intra broker
communication or resource group feature for it to work fully.

Just as a side note, in practice, resource group level rate limiting is
very very imprecise due to the inherent nature that it has to sync broker
level data between multiple brokers are a frequency.. thus the data is
always stale and the produce goes way beyond the set limit, even when
resource groups use the precise rate limiter today.


> > Now of course we can utilize a broker level fixed rate limiter to not
> allow
> > the overall throughput of the system to go beyond a number, but at that
> > point - all the earning semantic goes for a toss anyway since the
> behavior
> > would be unknown wrt which topics are now going through with bursting and
> > which are being blocked due to the broker level fixed rate limiting.
> >
> > As such, letting topics loose would not sit well with any sort of SLA
> > guarantees to the end user.
> >
> > Moreover, contrary to the earning tokens logic, in reality a topic
> _should_
> > be allowed to burst upto the SOP/SLA as soon as produce starts in the
> > topic. It shouldn't _have_ to wait for tokens to fill up as it does
> > below-fixed-rate for a while before it is allowed to burst. This is
> because
> > there is no real benefit or reason to not let the topic do such as the
> > hardware is already present and the topic is already provisioned
> > (partitions, broker spread) accordingly, assuming the burst.
> >
> > In an algorithmic/academic/literature setting, token bucket sounds really
> > promising.. but a platform with SLA to users would not run like that.
>
> AWS DynamoDB uses token bucket based algorithms. Please check the
> video I referred in one of the previous comments.
> at 10:26 [1] "The simplest way to do this workload isolation is the
> token bucket"
> at 13:05 [2] "partitions will get throttled because they're constantly
> running out of tokens from their bucket even though they had burst
> capacity because after the burst burst capacity is over you had to
> still take time to fill capacity in the token bucket while other
> partitions actually had capacity in their token buckets so..."
>

In the very next sentence it also says "so a key takeaway from this was
though bursting was great it solved some of the customers problems um we
had it wasn't sufficient enough and we had tightly coupled how partition
level capacity to admission control right so we realized that removing the
admission control from partition from partition level would be the most
beneficial thing for customers and for us as well and letting the partition
burst always so we implemented something called as a global admission
control where we moved our mission control up"
Looks like the fact that the partition wasn't able to work due to lack of
tokens was a problem for them? I admit I haven't gone through the full
video yet.


> at 13:53 [3] "GAC (Global Admission Control) is a service that builds
> on the same token bucket mechanism so when a request router receives a
> request it basically requests the GAC to kind of get some tokens and
> once request router uses the local tokens to make admission control
> decisions if it runs out of tokens it can request for more tokens"
>
> 1 - https://www.youtube.com/watch?v=9AkgiEJ_dA4&t=10m26s
> 2 - https://www.youtube.com/watch?v=9AkgiEJ_dA4&t=13m05s
> 3 - https://www.youtube.com/watch?v=9AkgiEJ_dA4&t=13m54s
>
> Token bucket based solutions are widely used in the industry. I don't
> see a reason why we should be choosing some other conceptual model as
> the basis.
>

Not another conceptual model, but we should be using the token bucket in a
way that works for real world situations..


> >
> >
> >
> > > In the current rate limiters in Pulsar, the implementation is not
> > > optimized to how Pulsar uses rate limiting. There's no need to use a
> > > scheduler for adding "permits" as it's called in the current rate
> > > limiter. The new tokens to add can be calculated based on the elapsed
> > > time. Resuming from the blocking state (auto read disabled -> enabled)
> > > requires a scheduler. The duration of the pause should be calculated
> > >
> >
> > The precise one already does this. There is no scheduler for adding the
> > tokens, only scheduler to enable back the auto-read on netty channel.
>
> Please check the source code again [1]. The scheduler is used to do
> something that is equivalent of "adding the tokens".
> The "createTask" method schedules the task to call the "renew" method.
>
> 1 -
> https://github.com/apache/pulsar/blob/3c067ce28025e116146977118312a1471ba284f5/pulsar-common/src/main/java/org/apache/pulsar/common/util/RateLimiter.java#L260-L278
>
> > It looks like your inclination is to simply remove the default "poller"
> > implementation and just let precise one in the code base?
>
> No it's not. It's about replacing the solution with an implementation
> that is using the conceptual model of the token bucket. This will make
> the implementation more maintainable and we will be able to clean up
> the abstractions. Concepts and abstractions are the key of interface
> design.
> In the current state, the rate limiter is very hard to reason about.
> There's no reason to keep it in the code base. I'll explain more about
> that later.
> Besides the maintainability, another clear problem in the current rate
> limiter is the heavy usage of synchronization and the already
> mentioned unnecessary use of the scheduler.
>

I agree with you here. That's why I mentioned the ideal state of rate
limiter in pulsar OSS version. I see that you did not comment on that in
your reply though..


>
> > This part is a bit unclear to me. When are you suggesting to enable back
> > the auto-read on the channel? Ideally, it should open back up the next
> > second. Time to the next second could vary from 1ms to 999ms.. I do not
> > think that starting an on demand scheduler of delta ms would be accurate
> > enough to achieve this.. A newly created scheduled task is anyways not
> this
> > accurate.. With Java virtual threads coming up, I believe having one, or
> N,
> > constantly running scheduled tasks of 1 second interval would not be
> heavy
> > at all.
> > In fact, even in Java 8, as per my benchmarks, it doesn't seem to take up
> > much frames at all.
>
> Optimally the pause would be such a duration that there's enough
> tokens to send the next message.
> One possible way to estimate the size of the next message is to use
> the average size or a rolling average size as the basis of the
> calculation.
> It might be as short as the minimum time to pause for the timer.
> Instead of using a generic scheduler for pausing, Netty's scheduler
> should be used.
> It is very efficient for high volumes. If there's a bottleneck, it's
> possible to have some minimum duration to pause.
>

I still don't see any advantage here. What are we trying to save here?
Honestly, any kind of guessing inside the code will lead to weird anomalies
when running in a big system with various different usecases.


>
> > "10 minutes" was just an example. Part of the problem is explained
> above..
> > If a topic is allowed to just burst forever (irrespective of the fact
> that
> > it has earned the tokens or not) - it will start competing with other
> > topics that are also bursting. This is a problem from a capacity planning
> > point of view. Generally this is how capacity planning works
> >
> >    1. We will first do redline testing to determine how much MBps a set
> of
> >    single broker and an ensemble of bookies can achieve given a hardware
> >    configuration of broker and bookies.
> >    2. Assuming linear horizontal scaling, we then setup enough brokers
> and
> >    bookies in order to sustain a required throughput in the cluster.
> Suppose,
> >    for this example, the cluster can now do 300MBps
> >
> > Now, assuming 50% bursting is allowed in a topic - does that mean the sum
> > of fixed rate of all topics in the cluster should not exceed 200MBps? -
> If
> > yes - then there is literally no difference in this situation vs just
> > having a 50% elevated fixed rate of all the topic with 0 bursting
> support.
> > To actually take advantage of this bursting feature, the capacity
> reserved
> > to handle topic bursts would not be equal to the allowed bursting, but
> much
> > less. For example, in this cluster that can do 300MBps, we can let the
> sum
> > of fixed rate of all topics go up to 270MBps and keep just 10% of the
> > hardware (30MBps) for bursting needs. This way, we are able to sustain
> more
> > topics in the same cluster, with support of burst. This would only work
> if
> > all topics do not burst together at the same time - and that's where the
> > need to restrict duration and frequency of burst comes into play.
> >
> > Ofcourse, there can be other ways to model this problem and requirement..
> > and additional checks in place, but I believe this approach is the most
> > deterministic approach with a good balance between efficient use of
> > hardware and not being completely stringent wrt throughput on the topics.
>
> I commented on these capacity management aspects in one of my previous
> emails as well as in an earlier comment in this email.
> Based on what you are describing as the scenario, I think you are
> looking for system wide capacity management.
> The Pulsar Resource Groups (PIP-82) [1] already has solutions in this
> area and could be developed further.
> AWS DynamoDb's Global Admission Control is a good example from the
> industry how this problem was solve there [2].
> It might be worth improving and updating the PIP-82 design based on
> the learnings from Confluent Kora paper and AWS DynamoDB Admission
> control paper and presentation.
> PIP-82 focuses a lot on rate limiting. However, there's also a need
> for something what Confluent Kora paper explains in "5.2.2 Dynamic
> Quota Management" with  "The tenant quotas are auto-tuned
> proportionally to their total quota allocation on the broker. This
> mechanism ensures fair sharing of resources among tenants during
> temporary overload and re-uses the quota enforcement mechanism for
> backpressure."
>
> 1 -
> https://github.com/apache/pulsar/wiki/PIP-82:-Tenant-and-namespace-level-rate-limiting
> 2 - https://www.youtube.com/watch?v=9AkgiEJ_dA4&t=13m54s
>
>
I did not want to confuse you here by bringing in more than one broker into
picture. In any case, resource groups are not a solution here. What you are
suggesting is that we put a check on a namespace or tenant by putting in a
resource group level rate limiting - all that does is restrict the scope of
contention of capacity within that tenant or namespace. Moreover, resource
group level rate limiting is not precise. In another internal discussion
about finalizing the requirements from our side, we found that the example
i provided where multiple topics are bursting at the same time and are
contending for the capacity (globally or within a resource group), there
will be situation where the contention is now leading to throttling of a
topic which is not even trying to burst - since the broker level/resource
group level rate limiter will just block any and all produce requests after
the permits have been exhausted.. the bursting topics could have very well
exhausted those in the first 800ms of a second while a non bursting topic
actually should have had the right to produce in the remaining 200ms of a
second.

That is the one of the reasons for putting a strict check on the amount,
duration and frequency of bursting.


> It's not a breaking change to replace the existing rate limiters.
> The rate limiter implementation is an internal implementation detail.
> It would become a breaking change if the feature "contract" would be
> changed drastically.
> In this case, it's possible to make the "precise" option "precise" and
> perhaps simulate the
> way how the default rate limiter works.
> I think that besides CPU consumption, the problem with the default
> rate limiter is simply that it
> contains a bug. The bug is in the tryAcquire method [1]. The default
> handling should simply match what there is for
> isDispatchOrPrecisePublishRateLimiter.
> After this, the only difference between precise and default is in the
> renew method where there's this calculation:
>

That's not the only difference, even ignoring the bug.. the default one has
a scheduler running at a lower than 1 second interval. It's only in this
scheduled method that it checks if the limit has breached and blocks the
auto-read.. So if a produce is able to produce whatever they like before
the first trigger of this scheduler, then there is practically no rate
limiting. Now, I don't know if we can call this as contact.. its more of an
implicit one .. or rather, if there is any rate limiting contract in
documentation, this default one certainly breaks that.


> > I know and almost all OSS projects around the messaging world have very
> > very basic rate limiters. That is exactly why I am originally requesting
> a
> > pluggable rate limiter to support our use case as that's not a common
> thing
> > to have (the level of complexity) in the OSS project itself. Although, I
> > also acknowledge the need of a lot of refactoring and changes in the
> based
> > design as we have been discussing here.
>
> I appreciate your efforts on initiating PIP-310 and getting the ball
> rolling. Since you are operating Pulsar at scale, your contributions
> and feedback are very valuable in improving Pulsar's capacity manage.
> I happen to have a different view of how a custom rate limiter
> implemented with the possible pluggable interface could help with
> overall capacity management in Pulsar.
> We need to go beyond PIP-310 in solving multi-tenant capacity
> management/SOP/SLA challenges with Pulsar. The resource groups work
> started with PIP-81 is a good start point, but there's a need to
> improve and revisit the design to be able to meet the competition, the
> closed source Confluent Kora.
>

As mentioned in the comment above, resource groups shouldn't be discussed
here.. they are out of scope of the discussion involving partition level
rate limiting and I did not intend to bring them into the discussion.
I would like to see a comment on my proposal about the state of pulsar rate
limiter. I believe that's true "meeting in the middle".


>
> Thanks for providing such detailed and useful feedback! I think that
> this has already been a valuable interaction.
>

Thank you for painstakingly replying to these long emails.. This probably
is the longest thread in pulsar ML in recent times :)


>
> The improvements happen one step at a time. We can make things happen
> when we work together. I'm looking forward to that!
>

Would love to hear a plan from your point of view and what you envision.

>
> -Lari
>


-- 
Girish Sharma

Re: [DISCUSS] PIP-310: Support custom publish rate limiters

Posted by Lari Hotari <lh...@apache.org>.

Hi Girish,

replies inline.

> > I'd pick the first bucket for handling the 10MB rate.
> > The capacity of the first bucket would be 15MB * 120=1800MB. The fill
> > would happen in special way. I'm not sure if Bucket4J has this at all.
> > So describing the way of adding tokens to the bucket: the tokens in
> > the bucket would remain the same when the rate is <10MB. As many
> >
>
> How is this special behavior (tokens in bucket remaining the same when rate
> is <10MB) achieved? I would assume that to even figure out that the rate is
> less than 10MB, there is some counter going around?

It's possible to be creative in implementing the token bucket algorithm.
New tokens will be added with the configured rate. When there actual
traffic rate is less than the token bucket token rate,
there will be left over tokens. One way to implement this is to
calculate the amount of new tokens
before using the tokens. The immediately used tokens are subtracted
from the new tokens and the left over tokens are added to the
separate "filling bucket". I explained earlier that this filling
bucket would be poured in the actual bucket every 10 minutes in the
example scenario.

>
>
> > tokens would be added to the bucket as are consumed by the actual
> > traffic. The left over tokens 10MB - actual rate would go to a
> > separate filling bucket that gets poured into the actual bucket every
> > 10 minutes.
> > This first bucket with this separate "filling bucket" would handle the
> > bursting up to 1800MB.
> >
>
> But this isn't the requirement? Let's assume that the actual traffic has
> been 5MB for a while and this 1800MB capacity bucket is all filled up now..
> What's the real use here for that at all?

How would the rate limiter know if 5MB traffic is degraded traffic
which would need to be allowed to burst?

> I think this approach of thinking about rate limiter - "earning the right
> to burst by letting tokens remain into the bucket, (by doing lower than
> 10MB for a while)" doesn't not fit well in a messaging use case in real
> world, or theoretic.

It might feel like it doesn't fit well, but this is how most rate
limiters work. The model works very well in practice.
Without "earning the right to burst", there would have to be some
other way to detect whether there's a need to burst.
The need to burst could be detected by calculating the time that the
message to be sent has been in queues until it's about to be sent.
In other words, from the end-to-end latency.

> For a 10MB topic, if the actual produce has been , say, 5MB for a long
> while, this shouldn't give the right to that topic to burst to 15MB for as
> much as tokens are present.. This is purely due to the fact that this will
> then start stressing the network and bookie disks.

Well, there's always an option to not configure bursting this way or
limiting the maximum rate of bursting, like it is possible in a dual
token bucket implementation.

> Imagine a 100 of such topics going around with similar configuration of
> fixed+burst limits and were doing way lower than the fixed rate for the
> past couple of hours. Now that they've earned enough tokens, if they all
> start bursting, this will bring down the system, which is probably not
> capable of supporting simultaneous peaks of all possible topics at all.

This is the challenge of capacity management and end-to-end flow
control and backpressure.
With proper system wide capacity management and end-to-end back
pressure, the system won't collapse.

As reference, let's take a look at the Confluent Kora paper [1]:
In 5.2.1 Back Pressure and Auto-Tuning:
"Backpressure is achieved via auto-tuning tenant quotas on the broker
such that the combined tenant usage remains below the broker-wide
limit. The tenant quotas are auto-tuned proportionally to their total
quota allocation on the broker. This mechanism ensures fair sharing of
resources among tenants during temporary overload and re-uses the
quota enforcement mechanism for backpressure."
"The broker-wide limits are generally defined by benchmarking brokers
across clouds. The CPU-related limit is unique because there is no
easy way to measure and attribute CPU usage to a tenant. Instead, the
quota is defned as the clock time the broker spends processing
requests and connections, and the safe limit is variable. So to
protect CPU, request backpressure is triggered when request queues
reache a certain threshold."
In 5.2.2 Dynamic Quota Management"
"A straightforward method for distributing tenant-level quotas among
the brokers hosting the tenant is to statically divide the quota
evenly across the brokers. This static approach, which was deployed
initially, worked reasonably well on lower subscribed clusters."...
"Kora addresses this issue by using a dynamic quota mechanism that
adjusts bandwidth distribution based on a tenant’s bandwidth
consumption. This is achieved through the use of a shared quota
service to manage quota distribution, a design similar to that used by
other storage systems "

1 - https://www.vldb.org/pvldb/vol16/p3822-povzner.pdf

Regarding system wide capacity management, the concept of dynamic
quota management is interesting. "The tenant quotas are auto-tuned
proportionally to their total quota allocation on the broker. This
mechanism ensures fair sharing of resources among tenants during
temporary overload and re-uses the quota enforcement mechanism for
backpressure."

I referred to useful resources about capacity management in one of my
previous emails. For example,"Amazon DynamoDB: A Scalable, Predictably
Performant, and
Fully Managed NoSQL Database Service" paper and presentation video [2]
is really useful. The "Global Admission Control" based on token bucket
based implementation and bursting solution is explained starting at
10:25 in the YouTube video [3].

2 - https://www.usenix.org/conference/atc22/presentation/elhemali
3 - https://www.youtube.com/watch?v=9AkgiEJ_dA4&t=625s

In Pulsar, we have "PIP 82: Tenant and namespace level rate limiting"
[4] which introduced the "resource group" concept. There is the
resourcegroup resource in the Admin REST API [5]. There's also a
resource called "resource-quota" in the Admin REST API [6], but that
seems to be abandoned.

4 - https://github.com/apache/pulsar/wiki/PIP-82:-Tenant-and-namespace-level-rate-limiting
5 - https://pulsar.apache.org/admin-rest-api/?version=3.1.1#tag/resourcegroups
6 - https://pulsar.apache.org/admin-rest-api/?version=3.1.1#tag/resource-quotas


> Now of course we can utilize a broker level fixed rate limiter to not allow
> the overall throughput of the system to go beyond a number, but at that
> point - all the earning semantic goes for a toss anyway since the behavior
> would be unknown wrt which topics are now going through with bursting and
> which are being blocked due to the broker level fixed rate limiting.
>
> As such, letting topics loose would not sit well with any sort of SLA
> guarantees to the end user.
>
> Moreover, contrary to the earning tokens logic, in reality a topic _should_
> be allowed to burst upto the SOP/SLA as soon as produce starts in the
> topic. It shouldn't _have_ to wait for tokens to fill up as it does
> below-fixed-rate for a while before it is allowed to burst. This is because
> there is no real benefit or reason to not let the topic do such as the
> hardware is already present and the topic is already provisioned
> (partitions, broker spread) accordingly, assuming the burst.
>
> In an algorithmic/academic/literature setting, token bucket sounds really
> promising.. but a platform with SLA to users would not run like that.

AWS DynamoDB uses token bucket based algorithms. Please check the
video I referred in one of the previous comments.
at 10:26 [1] "The simplest way to do this workload isolation is the
token bucket"
at 13:05 [2] "partitions will get throttled because they're constantly
running out of tokens from their bucket even though they had burst
capacity because after the burst burst capacity is over you had to
still take time to fill capacity in the token bucket while other
partitions actually had capacity in their token buckets so..."
at 13:53 [3] "GAC (Global Admission Control) is a service that builds
on the same token bucket mechanism so when a request router receives a
request it basically requests the GAC to kind of get some tokens and
once request router uses the local tokens to make admission control
decisions if it runs out of tokens it can request for more tokens"

1 - https://www.youtube.com/watch?v=9AkgiEJ_dA4&t=10m26s
2 - https://www.youtube.com/watch?v=9AkgiEJ_dA4&t=13m05s
3 - https://www.youtube.com/watch?v=9AkgiEJ_dA4&t=13m54s

Token bucket based solutions are widely used in the industry. I don't
see a reason why we should be choosing some other conceptual model as
the basis.
You can always explain a token bucket algorithm in a different way.
Someone might say that it's a sliding window based algorithm. The main
benefit of token bucket is the conceptual model which is very helpful
when a larger group of people, such as an opensource community, works
on a common solution. Concepts are the starting point for
abstractions. That's why it matters a lot.

>
>
>
> > In the current rate limiters in Pulsar, the implementation is not
> > optimized to how Pulsar uses rate limiting. There's no need to use a
> > scheduler for adding "permits" as it's called in the current rate
> > limiter. The new tokens to add can be calculated based on the elapsed
> > time. Resuming from the blocking state (auto read disabled -> enabled)
> > requires a scheduler. The duration of the pause should be calculated
> >
>
> The precise one already does this. There is no scheduler for adding the
> tokens, only scheduler to enable back the auto-read on netty channel.

Please check the source code again [1]. The scheduler is used to do
something that is equivalent of "adding the tokens".
The "createTask" method schedules the task to call the "renew" method.

1 - https://github.com/apache/pulsar/blob/3c067ce28025e116146977118312a1471ba284f5/pulsar-common/src/main/java/org/apache/pulsar/common/util/RateLimiter.java#L260-L278

> It looks like your inclination is to simply remove the default "poller"
> implementation and just let precise one in the code base?

No it's not. It's about replacing the solution with an implementation
that is using the conceptual model of the token bucket. This will make
the implementation more maintainable and we will be able to clean up
the abstractions. Concepts and abstractions are the key of interface
design.
In the current state, the rate limiter is very hard to reason about.
There's no reason to keep it in the code base. I'll explain more about
that later.
Besides the maintainability, another clear problem in the current rate
limiter is the heavy usage of synchronization and the already
mentioned unnecessary use of the scheduler.

> This part is a bit unclear to me. When are you suggesting to enable back
> the auto-read on the channel? Ideally, it should open back up the next
> second. Time to the next second could vary from 1ms to 999ms.. I do not
> think that starting an on demand scheduler of delta ms would be accurate
> enough to achieve this.. A newly created scheduled task is anyways not this
> accurate.. With Java virtual threads coming up, I believe having one, or N,
> constantly running scheduled tasks of 1 second interval would not be heavy
> at all.
> In fact, even in Java 8, as per my benchmarks, it doesn't seem to take up
> much frames at all.

Optimally the pause would be such a duration that there's enough
tokens to send the next message.
One possible way to estimate the size of the next message is to use
the average size or a rolling average size as the basis of the
calculation.
It might be as short as the minimum time to pause for the timer.
Instead of using a generic scheduler for pausing, Netty's scheduler
should be used.
It is very efficient for high volumes. If there's a bottleneck, it's
possible to have some minimum duration to pause.

> "10 minutes" was just an example. Part of the problem is explained above..
> If a topic is allowed to just burst forever (irrespective of the fact that
> it has earned the tokens or not) - it will start competing with other
> topics that are also bursting. This is a problem from a capacity planning
> point of view. Generally this is how capacity planning works
>
>    1. We will first do redline testing to determine how much MBps a set of
>    single broker and an ensemble of bookies can achieve given a hardware
>    configuration of broker and bookies.
>    2. Assuming linear horizontal scaling, we then setup enough brokers and
>    bookies in order to sustain a required throughput in the cluster. Suppose,
>    for this example, the cluster can now do 300MBps
>
> Now, assuming 50% bursting is allowed in a topic - does that mean the sum
> of fixed rate of all topics in the cluster should not exceed 200MBps? - If
> yes - then there is literally no difference in this situation vs just
> having a 50% elevated fixed rate of all the topic with 0 bursting support.
> To actually take advantage of this bursting feature, the capacity reserved
> to handle topic bursts would not be equal to the allowed bursting, but much
> less. For example, in this cluster that can do 300MBps, we can let the sum
> of fixed rate of all topics go up to 270MBps and keep just 10% of the
> hardware (30MBps) for bursting needs. This way, we are able to sustain more
> topics in the same cluster, with support of burst. This would only work if
> all topics do not burst together at the same time - and that's where the
> need to restrict duration and frequency of burst comes into play.
>
> Ofcourse, there can be other ways to model this problem and requirement..
> and additional checks in place, but I believe this approach is the most
> deterministic approach with a good balance between efficient use of
> hardware and not being completely stringent wrt throughput on the topics.

I commented on these capacity management aspects in one of my previous
emails as well as in an earlier comment in this email.
Based on what you are describing as the scenario, I think you are
looking for system wide capacity management.
The Pulsar Resource Groups (PIP-82) [1] already has solutions in this
area and could be developed further.
AWS DynamoDb's Global Admission Control is a good example from the
industry how this problem was solve there [2].
It might be worth improving and updating the PIP-82 design based on
the learnings from Confluent Kora paper and AWS DynamoDB Admission
control paper and presentation.
PIP-82 focuses a lot on rate limiting. However, there's also a need
for something what Confluent Kora paper explains in "5.2.2 Dynamic
Quota Management" with  "The tenant quotas are auto-tuned
proportionally to their total quota allocation on the broker. This
mechanism ensures fair sharing of resources among tenants during
temporary overload and re-uses the quota enforcement mechanism for
backpressure."

1 - https://github.com/apache/pulsar/wiki/PIP-82:-Tenant-and-namespace-level-rate-limiting
2 - https://www.youtube.com/watch?v=9AkgiEJ_dA4&t=13m54s

> > The TCP/IP channel wouldn't be put on pause in that solution as the
> > primary way to handle flow control.
> > Yes, there are alternatives. It's the permit-based flow control. The
> > server would send out manageable amount of permits at a time. There
> > are challenges in making this work efficiently with low amounts of
> > memory. That's one of the tradeoffs.
> >
>
> Moreover, isn't this based on the assumption that the clients respect the
> tokens and actually behave in a good manner?
> If there is an implementation of a client that is not respecting the tokens
> (or, in this case, the lack of remaining tokens with the client), (very
> much possible as it can be an older java client itself) - then wouldn't the
> only way to actually throttle the client is to pause the TCP/IP channel?

Yes, it requires well behaving clients. In multiplexing pausing the
TCP/IP channel cannot be the primary way to handle flow control as we
have discussed. Pausing reads on the TCP/IP channel remains a way to
backpressure when it's really necessary.

> > I have described the benefits in some former comments. The code is not
> > maintainable at the moment other than by the people that have worked
> > with the code. Another point is getting rid of the separate "precise"
> > option. That leaks implementation details all the way to Pulsar users
> > that don't need to know what a "precise rate limiter" is. The default
> > Pulsar core rate limiter should do what it promises to do. That would
> >
>
> Is this even achievable? i.e. to completely remove the poller one and take
> away the knowledge about the precise one from users? This basically would
> then be a breaking change and would probably wait for pulsar 4.0.0 release?

It's not a breaking change to replace the existing rate limiters.
The rate limiter implementation is an internal implementation detail.
It would become a breaking change if the feature "contract" would be
changed drastically.
In this case, it's possible to make the "precise" option "precise" and
perhaps simulate the
way how the default rate limiter works.
I think that besides CPU consumption, the problem with the default
rate limiter is simply that it
contains a bug. The bug is in the tryAcquire method [1]. The default
handling should simply match what there is for
isDispatchOrPrecisePublishRateLimiter.
After this, the only difference between precise and default is in the
renew method where there's this calculation:
acquiredPermits = isDispatchOrPrecisePublishRateLimiter ? Math.max(0,
acquiredPermits - permits) : 0;
Again, setting to 0 doesn't make much sense to me. We could possibly
replace this with a solution that allows some bursting since that's
what I think that happens with the default implementation. It's
possible to check this in some test case or simulated scenario to see
that there's no breaking change aspects when the rate limiter gets
replaced. The new implementation can be compared to the old one.

1 - https://github.com/apache/pulsar/blob/82237d3684fe506bcb6426b3b23f413422e6e4fb/pulsar-common/src/main/java/org/apache/pulsar/common/util/RateLimiter.java#L177-L204

>
>
> > be the starting point to match the existing feature set. After that
> > it's possible to make it the dual token bucket internally so that it's
> > possible to set a separate max rate since when the bucket is larger,
> > the bursts don't have a limit in a single token bucket (where the
> > bucket has a large capacity, therefore allows bursts when traffic
> > temporarily drops below configured rate). Yes, the precise rate
> > limiter is something like having a bucket capacity for 1 second and
> > that's why it doesn't allow bursts.
> >
>
> I believe here I (my organization) would have a difference in opinion due
> to the difference in goal and priority. I guess it's kind of a clash
> between your vision of the rate limiter in pulsar vs our goal right now and
> I am not sure how to proceed at this point.
> What I can promise is that I (my org) will take this to completion with
> respect to making the internal rate limiter supporting our use cases, but
> the way you envision it might not sit well with what we want and when we
> want it.

> I know and almost all OSS projects around the messaging world have very
> very basic rate limiters. That is exactly why I am originally requesting a
> pluggable rate limiter to support our use case as that's not a common thing
> to have (the level of complexity) in the OSS project itself. Although, I
> also acknowledge the need of a lot of refactoring and changes in the based
> design as we have been discussing here.

I appreciate your efforts on initiating PIP-310 and getting the ball
rolling. Since you are operating Pulsar at scale, your contributions
and feedback are very valuable in improving Pulsar's capacity manage.
I happen to have a different view of how a custom rate limiter
implemented with the possible pluggable interface could help with
overall capacity management in Pulsar.
We need to go beyond PIP-310 in solving multi-tenant capacity
management/SOP/SLA challenges with Pulsar. The resource groups work
started with PIP-81 is a good start point, but there's a need to
improve and revisit the design to be able to meet the competition, the
closed source Confluent Kora.

Thanks for providing such detailed and useful feedback! I think that
this has already been a valuable interaction.

The improvements happen one step at a time. We can make things happen
when we work together. I'm looking forward to that!

-Lari

Re: [DISCUSS] PIP-310: Support custom publish rate limiters

Posted by Girish Sharma <sc...@gmail.com>.

Hello Lari, replies inline

On Thu, Nov 9, 2023 at 6:50 AM Lari Hotari <lh...@apache.org> wrote:

> Hi Girish,
>
> replies inline.
>
> On Thu, 9 Nov 2023 at 00:29, Girish Sharma <sc...@gmail.com>
> wrote:
> > While dual-rate dual token bucket looks promising, there is still some
> > challenge with respect to allowing a certain peak burst for/up to a
> bigger
> > duration. I am explaining it below:
>
> > Assume a 10MBps topic. Bursting support of 1.5x upto 2 minutes, once
> every
> > 10 minute interval.
>
> It's possible to have many ways to model a dual token buckets.
> When there are tokens in the bucket, they are consumed as fast as
> possible. This is why there is a need for the second token bucket
> which is used to rate limit the traffic to the absolute maximum rate.
> Technically the second bucket rate limits the average rate for a short
> time window.
>
> I'd pick the first bucket for handling the 10MB rate.
> The capacity of the first bucket would be 15MB * 120=1800MB. The fill
> would happen in special way. I'm not sure if Bucket4J has this at all.
> So describing the way of adding tokens to the bucket: the tokens in
> the bucket would remain the same when the rate is <10MB. As many
>

How is this special behavior (tokens in bucket remaining the same when rate
is <10MB) achieved? I would assume that to even figure out that the rate is
less than 10MB, there is some counter going around?

> tokens would be added to the bucket as are consumed by the actual
> traffic. The left over tokens 10MB - actual rate would go to a
> separate filling bucket that gets poured into the actual bucket every
> 10 minutes.
> This first bucket with this separate "filling bucket" would handle the
> bursting up to 1800MB.
>

But this isn't the requirement? Let's assume that the actual traffic has
been 5MB for a while and this 1800MB capacity bucket is all filled up now..
What's the real use here for that at all?

> The second bucket would solely enforce the 1.5x limit of 15MB rate
> with a small capacity bucket which enforces the average rate for a
> short time window.
> There's one nuance here. The bursting support will only allow bursting
> if the average rate has been lower than 10MBps for the tokens to use
> for the bursting to be usable.
> It would be possible that for example 50% of the tokens would be
> immediately available and 50% of the tokens are made available in the
> "filling bucket" that gets poured into the actual bucket every 10
> minutes. Without having some way to earn the burst, I don't think that
> there's a reasonable way to make things usable. The 10MB limit
>
wouldn't have an actual meaning unless that is used to "earn" the
> tokens to be used for the burst.
>
>
I think this approach of thinking about rate limiter - "earning the right
to burst by letting tokens remain into the bucket, (by doing lower than
10MB for a while)" doesn't not fit well in a messaging use case in real
world, or theoretic.
For a 10MB topic, if the actual produce has been , say, 5MB for a long
while, this shouldn't give the right to that topic to burst to 15MB for as
much as tokens are present.. This is purely due to the fact that this will
then start stressing the network and bookie disks.
Imagine a 100 of such topics going around with similar configuration of
fixed+burst limits and were doing way lower than the fixed rate for the
past couple of hours. Now that they've earned enough tokens, if they all
start bursting, this will bring down the system, which is probably not
capable of supporting simultaneous peaks of all possible topics at all.

Now of course we can utilize a broker level fixed rate limiter to not allow
the overall throughput of the system to go beyond a number, but at that
point - all the earning semantic goes for a toss anyway since the behavior
would be unknown wrt which topics are now going through with bursting and
which are being blocked due to the broker level fixed rate limiting.

As such, letting topics loose would not sit well with any sort of SLA
guarantees to the end user.

Moreover, contrary to the earning tokens logic, in reality a topic _should_
be allowed to burst upto the SOP/SLA as soon as produce starts in the
topic. It shouldn't _have_ to wait for tokens to fill up as it does
below-fixed-rate for a while before it is allowed to burst. This is because
there is no real benefit or reason to not let the topic do such as the
hardware is already present and the topic is already provisioned
(partitions, broker spread) accordingly, assuming the burst.

In an algorithmic/academic/literature setting, token bucket sounds really
promising.. but a platform with SLA to users would not run like that.

> In the current rate limiters in Pulsar, the implementation is not
> optimized to how Pulsar uses rate limiting. There's no need to use a
> scheduler for adding "permits" as it's called in the current rate
> limiter. The new tokens to add can be calculated based on the elapsed
> time. Resuming from the blocking state (auto read disabled -> enabled)
> requires a scheduler. The duration of the pause should be calculated
>

The precise one already does this. There is no scheduler for adding the
tokens, only scheduler to enable back the auto-read on netty channel.
It looks like your inclination is to simply remove the default "poller"
implementation and just let precise one in the code base?

> based on the rate and the average message size. Because of the nature
> of asynchronous rate limiting in Pulsar topic publish throttling, the
> tokens can go to a negative value and it also does. The calculation of
> the pause could also take this into count. The rate limiting will be
> very accurate and efficient in this way since the scheduler will only
>

This part is a bit unclear to me. When are you suggesting to enable back
the auto-read on the channel? Ideally, it should open back up the next
second. Time to the next second could vary from 1ms to 999ms.. I do not
think that starting an on demand scheduler of delta ms would be accurate
enough to achieve this.. A newly created scheduled task is anyways not this
accurate.. With Java virtual threads coming up, I believe having one, or N,
constantly running scheduled tasks of 1 second interval would not be heavy
at all.
In fact, even in Java 8, as per my benchmarks, it doesn't seem to take up
much frames at all.

be needed when the token bucket runs out of tokens and there's really
> a need to throttle. This is the change that I would do to the current
> implementation in an experiment and see how things behave with the
> revisited solution. Eliminating the precise rate limiter and having
> just a single rate limiter would be part of this.
> I think that the code base has multiple usages of the rate limiter.
> The dispatching rate limiting might require some variation. IIRC, Rate
> limiting is also used for some other reasons in the code base in a
> blocking manner. For example, unloading of bundles is rate limited.
> Working on the code base will reveal this.
>
>
Yes, I believe we should only touch publish paths right now.

Yes, it might require some rounds of experimentation. Although in
> general, I think it's a fairly straightforward problem to solve as
> long as the requirements could be adapted to some small details that
> make sense for bursting in the context of rate limiting. The detail is
> that the long time average rate shouldn't go over the configured rate
> even with the bursts. That's why the tokens usable for the burst
> should be "earned". I'm not sure if it even is necessary to enforce
> that a burst can happen once in every 10 minutes. Why does it matter
> if the average rate limit is controlled?
>

"10 minutes" was just an example. Part of the problem is explained above..
If a topic is allowed to just burst forever (irrespective of the fact that
it has earned the tokens or not) - it will start competing with other
topics that are also bursting. This is a problem from a capacity planning
point of view. Generally this is how capacity planning works

   1. We will first do redline testing to determine how much MBps a set of
   single broker and an ensemble of bookies can achieve given a hardware
   configuration of broker and bookies.
   2. Assuming linear horizontal scaling, we then setup enough brokers and
   bookies in order to sustain a required throughput in the cluster. Suppose,
   for this example, the cluster can now do 300MBps

Now, assuming 50% bursting is allowed in a topic - does that mean the sum
of fixed rate of all topics in the cluster should not exceed 200MBps? - If
yes - then there is literally no difference in this situation vs just
having a 50% elevated fixed rate of all the topic with 0 bursting support.
To actually take advantage of this bursting feature, the capacity reserved
to handle topic bursts would not be equal to the allowed bursting, but much
less. For example, in this cluster that can do 300MBps, we can let the sum
of fixed rate of all topics go up to 270MBps and keep just 10% of the
hardware (30MBps) for bursting needs. This way, we are able to sustain more
topics in the same cluster, with support of burst. This would only work if
all topics do not burst together at the same time - and that's where the
need to restrict duration and frequency of burst comes into play.

Ofcourse, there can be other ways to model this problem and requirement..
and additional checks in place, but I believe this approach is the most
deterministic approach with a good balance between efficient use of
hardware and not being completely stringent wrt throughput on the topics.

> > I think there is a need for a new protocol message from server to client
> > indicating rate-limiting (say, a 429 response, with an optional
> > retry-after). That should handle a lot of these things. For instance, we
> > can then build configurable logic in the client to stop producing all
> > together after a sufficient number of 429 received from the server. There
> > can also be a "BLOCKED" protocol that the server could indicate to the
> > client.
>
> Yes. I think Kafka has something like that. It's mentioned in the
> Kafka Quotas [1] doc.
> I believe that the needed explicit producer flow control could cover
> this. In Pulsar, we need to add explicit producer flow control to
> address the connection multiplexing problem.
>
> 1 - https://docs.confluent.io/kafka/design/quotas.html

Yup, the way kafka sends back a delay response to the clients.

> When you say - adding a permit-based flow control - even if this is
> > implemented, multiplexing is still an issue as the tcp/ip channel itself
> is
> > put on pause at the netty level. Is there any other way of rate limiting
> > and rejecting packets from a channel selectively so as to contain the
> rate
> > limiting effect only to the specific partition out of all the partitions
> > being shared in the channel?
>
> The TCP/IP channel wouldn't be put on pause in that solution as the
> primary way to handle flow control.
> Yes, there are alternatives. It's the permit-based flow control. The
> server would send out manageable amount of permits at a time. There
> are challenges in making this work efficiently with low amounts of
> memory. That's one of the tradeoffs.
>

Moreover, isn't this based on the assumption that the clients respect the
tokens and actually behave in a good manner?
If there is an implementation of a client that is not respecting the tokens
(or, in this case, the lack of remaining tokens with the client), (very
much possible as it can be an older java client itself) - then wouldn't the
only way to actually throttle the client is to pause the TCP/IP channel?

> > I will do so. Does this need a PIP? To reiterate, I will be opening a GH
> > issue on the lines of "don't share connection to the broker across
> producer
> > objects"
>
> I think it could start with an issue that describes the problem. The
> PIP is eventually needed to add the new option to the producer builder
> and consumer builder interfaces.
>
>
I will keep a note of this and take this up once I am free from this
current PIP, whichever direction it takes.

>
> > Are you suggesting that we first go ahead and convert the rate limiter in
> > pulsar to a simple, single-token based approach?
> > I personally do not see any benefit in this apart from code refactoring.
> > The precise rate limiter is basically doing that already- all be it
> > refilling only every 1 second, rather than distributing the tokens across
> > that second.
>
> I have described the benefits in some former comments. The code is not
> maintainable at the moment other than by the people that have worked
> with the code. Another point is getting rid of the separate "precise"
> option. That leaks implementation details all the way to Pulsar users
> that don't need to know what a "precise rate limiter" is. The default
> Pulsar core rate limiter should do what it promises to do. That would
>

Is this even achievable? i.e. to completely remove the poller one and take
away the knowledge about the precise one from users? This basically would
then be a breaking change and would probably wait for pulsar 4.0.0 release?

> be the starting point to match the existing feature set. After that
> it's possible to make it the dual token bucket internally so that it's
> possible to set a separate max rate since when the bucket is larger,
> the bursts don't have a limit in a single token bucket (where the
> bucket has a large capacity, therefore allows bursts when traffic
> temporarily drops below configured rate). Yes, the precise rate
> limiter is something like having a bucket capacity for 1 second and
> that's why it doesn't allow bursts.
>

I believe here I (my organization) would have a difference in opinion due
to the difference in goal and priority. I guess it's kind of a clash
between your vision of the rate limiter in pulsar vs our goal right now and
I am not sure how to proceed at this point.
What I can promise is that I (my org) will take this to completion with
respect to making the internal rate limiter supporting our use cases, but
the way you envision it might not sit well with what we want and when we
want it.

>
> > > algorithm is. That could be a starting point until we start covering
> > > the advanced cases which require a dual token bucket algorithm.
> > > If someone has the bandwidth, it would be fine to start experimenting
> > > in this area with code to learn more.
> > >
> > >
> > I think this is a big assumption. Since this is a critical use case in my
> > organisation, I will have to contribute everything here myself. Now I do
> > understand that reviews can take time in the OSS world, but this can't be
> > left at "simple token based approach and then letting anyone pick and
> > explore to extend/enhance it to dual token approach". This probably is
> one
> > of the main reasons why pluggability is important here :)
>
> If you check many other projects, most of them use token bucket based
> approaches. That's why I'm confident that it is a feasible solution.
>

I know and almost all OSS projects around the messaging world have very
very basic rate limiters. That is exactly why I am originally requesting a
pluggable rate limiter to support our use case as that's not a common thing
to have (the level of complexity) in the OSS project itself. Although, I
also acknowledge the need of a lot of refactoring and changes in the based
design as we have been discussing here.

Generally, this is how I see it should be at the end in the OSS pulsar
version - The code has been refactored, there are no implementation details
leaked to the user, the rate limiter is efficient and based on a single
token bucket based model without any bursting. But, it also provides
pluggable rate limiter - and that is where individual complex use cases can
be implemented and maintained internally by the users/organizations.
Of course, they can choose to open source them, but I don't think pulling
it back into Pulsar is the right approach with respect to future
contributions and ease of maintaining the module.

-- 
Girish Sharma

Re: [DISCUSS] PIP-310: Support custom publish rate limiters

Posted by Lari Hotari <lh...@apache.org>.

Hi Girish,

replies inline.

On Thu, 9 Nov 2023 at 00:29, Girish Sharma <sc...@gmail.com> wrote:
> While dual-rate dual token bucket looks promising, there is still some
> challenge with respect to allowing a certain peak burst for/up to a bigger
> duration. I am explaining it below:

> Assume a 10MBps topic. Bursting support of 1.5x upto 2 minutes, once every
> 10 minute interval.

It's possible to have many ways to model a dual token buckets.
When there are tokens in the bucket, they are consumed as fast as
possible. This is why there is a need for the second token bucket
which is used to rate limit the traffic to the absolute maximum rate.
Technically the second bucket rate limits the average rate for a short
time window.

I'd pick the first bucket for handling the 10MB rate.
The capacity of the first bucket would be 15MB * 120=1800MB. The fill
would happen in special way. I'm not sure if Bucket4J has this at all.
So describing the way of adding tokens to the bucket: the tokens in
the bucket would remain the same when the rate is <10MB. As many
tokens would be added to the bucket as are consumed by the actual
traffic. The left over tokens 10MB - actual rate would go to a
separate filling bucket that gets poured into the actual bucket every
10 minutes.
This first bucket with this separate "filling bucket" would handle the
bursting up to 1800MB.
The second bucket would solely enforce the 1.5x limit of 15MB rate
with a small capacity bucket which enforces the average rate for a
short time window.
There's one nuance here. The bursting support will only allow bursting
if the average rate has been lower than 10MBps for the tokens to use
for the bursting to be usable.
It would be possible that for example 50% of the tokens would be
immediately available and 50% of the tokens are made available in the
"filling bucket" that gets poured into the actual bucket every 10
minutes. Without having some way to earn the burst, I don't think that
there's a reasonable way to make things usable. The 10MB limit
wouldn't have an actual meaning unless that is used to "earn" the
tokens to be used for the burst.

One other detail of topic publishing throttling in Pulsar is that the
actual throttling happens after the limit has been exceeded.
This is due to the fact that Pulsar's network handling uses Netty
where you cannot block. When using the token bucket concepts,
the tokens are always first consumed and after that there's a chance
to pause message publishing.
In code, you can find this at
https://github.com/apache/pulsar/blob/c0eec1e46edeb46c888fa28f27b199ea7e7a1574/pulsar-broker/src/main/java/org/apache/pulsar/broker/service/ServerCnx.java#L1794
and digging down from there.

In the current rate limiters in Pulsar, the implementation is not
optimized to how Pulsar uses rate limiting. There's no need to use a
scheduler for adding "permits" as it's called in the current rate
limiter. The new tokens to add can be calculated based on the elapsed
time. Resuming from the blocking state (auto read disabled -> enabled)
requires a scheduler. The duration of the pause should be calculated
based on the rate and the average message size. Because of the nature
of asynchronous rate limiting in Pulsar topic publish throttling, the
tokens can go to a negative value and it also does. The calculation of
the pause could also take this into count. The rate limiting will be
very accurate and efficient in this way since the scheduler will only
be needed when the token bucket runs out of tokens and there's really
a need to throttle. This is the change that I would do to the current
implementation in an experiment and see how things behave with the
revisited solution. Eliminating the precise rate limiter and having
just a single rate limiter would be part of this.
I think that the code base has multiple usages of the rate limiter.
The dispatching rate limiting might require some variation. IIRC, Rate
limiting is also used for some other reasons in the code base in a
blocking manner. For example, unloading of bundles is rate limited.
Working on the code base will reveal this.

> While the number of events in the system topic would be fairly low per
> namespace, the issue is that this system topic lies on the same broker
> where the actual topic/partitions exist and those partitions are leading to
> degradation of this particular broker.

Yes, that could be possible but perhaps unlikely.

> Agreed, there is a challenge, as it's not as straightforward as I've
> demonstrated in the example above.

Yes, it might require some rounds of experimentation. Although in
general, I think it's a fairly straightforward problem to solve as
long as the requirements could be adapted to some small details that
make sense for bursting in the context of rate limiting. The detail is
that the long time average rate shouldn't go over the configured rate
even with the bursts. That's why the tokens usable for the burst
should be "earned". I'm not sure if it even is necessary to enforce
that a burst can happen once in every 10 minutes. Why does it matter
if the average rate limit is controlled?

> There is clearly a gap with respect to understanding what the common
> requirements are. I am here speaking only of my requirements. Some of these
> may contradict requirements of others who haven't spoken up so far. Would
> we be changing the rate limiter once again when someone else comes up with

I'll just repeat that Pulsar isn't that special that it would require
very special rate limiting. The basics should be sufficient like it's
for most of other similar projects.

> new requirements? For instance, the "exponential decay" mentioned by Rajan
> is completely different from any token based approach.

"exponential decay" is usually mentioned in the context of rate
limiting retries. for exampl, the exponential back off.
I'd like to see examples of "exponential decay" in the context of
actual messaging / traffic rate limiting before we FOMO about this.

> I think there is a need for a new protocol message from server to client
> indicating rate-limiting (say, a 429 response, with an optional
> retry-after). That should handle a lot of these things. For instance, we
> can then build configurable logic in the client to stop producing all
> together after a sufficient number of 429 received from the server. There
> can also be a "BLOCKED" protocol that the server could indicate to the
> client.

Yes. I think Kafka has something like that. It's mentioned in the
Kafka Quotas [1] doc.
I believe that the needed explicit producer flow control could cover
this. In Pulsar, we need to add explicit producer flow control to
address the connection multiplexing problem.

1 - https://docs.confluent.io/kafka/design/quotas.html

> From our experience, inferring these things via client-broker clock sync
> assumption would not hold true. We regularly notice clock differences
> between clients and brokers in our setup.

The Pulsar binary protocol could have a way to calculate the
approximate clock difference between the client and the broker so that
end-to-end latencies could be calculated on the broker side from
client side timestamps.

> When you say - adding a permit-based flow control - even if this is
> implemented, multiplexing is still an issue as the tcp/ip channel itself is
> put on pause at the netty level. Is there any other way of rate limiting
> and rejecting packets from a channel selectively so as to contain the rate
> limiting effect only to the specific partition out of all the partitions
> being shared in the channel?

The TCP/IP channel wouldn't be put on pause in that solution as the
primary way to handle flow control.
Yes, there are alternatives. It's the permit-based flow control. The
server would send out manageable amount of permits at a time. There
are challenges in making this work efficiently with low amounts of
memory. That's one of the tradeoffs.

> When I was doing poller vs precise testing wrt CPU, network, broker
> latencies etc, one of the reason precise was much more CPU efficient and
> had minimal impact on broker latencies was due to the fact that the netty
> channel was being paused precisely.

Yes. I have similar experiences and this has also been reported. btw.
precise rate limiter had some bugs in the past, but there were some
contributions from the community that made it useable. I also
contributed to that are in reviews and some minor PRs. One of the big
challenges to overcome was to understand how the solution works.
That's why I think that the concepts should match common reference
points so that it would be maintainable.

> > control for producers and revisiting the broker side backpressure/flow
> > control and therefore it won't happen quickly.
> > Please go ahead and create a GH issue and share your context. That
> > will be very helpful.
> >
> >
> I will do so. Does this need a PIP? To reiterate, I will be opening a GH
> issue on the lines of "don't share connection to the broker across producer
> objects"

I think it could start with an issue that describes the problem. The
PIP is eventually needed to add the new option to the producer builder
and consumer builder interfaces.

> Are you suggesting that we first go ahead and convert the rate limiter in
> pulsar to a simple, single-token based approach?
> I personally do not see any benefit in this apart from code refactoring.
> The precise rate limiter is basically doing that already- all be it
> refilling only every 1 second, rather than distributing the tokens across
> that second.

I have described the benefits in some former comments. The code is not
maintainable at the moment other than by the people that have worked
with the code. Another point is getting rid of the separate "precise"
option. That leaks implementation details all the way to Pulsar users
that don't need to know what a "precise rate limiter" is. The default
Pulsar core rate limiter should do what it promises to do. That would
be the starting point to match the existing feature set. After that
it's possible to make it the dual token bucket internally so that it's
possible to set a separate max rate since when the bucket is larger,
the bursts don't have a limit in a single token bucket (where the
bucket has a large capacity, therefore allows bursts when traffic
temporarily drops below configured rate). Yes, the precise rate
limiter is something like having a bucket capacity for 1 second and
that's why it doesn't allow bursts.

> > algorithm is. That could be a starting point until we start covering
> > the advanced cases which require a dual token bucket algorithm.
> > If someone has the bandwidth, it would be fine to start experimenting
> > in this area with code to learn more.
> >
> >
> I think this is a big assumption. Since this is a critical use case in my
> organisation, I will have to contribute everything here myself. Now I do
> understand that reviews can take time in the OSS world, but this can't be
> left at "simple token based approach and then letting anyone pick and
> explore to extend/enhance it to dual token approach". This probably is one
> of the main reasons why pluggability is important here :)

If you check many other projects, most of them use token bucket based
approaches. That's why I'm confident that it is a feasible solution.
It's always possible to be creative and make a variation that covers
the majority of Pulsar use case. Pulsar is OSS and if there's a gap in
the solution, you don't even need pluggability. :)

We are soon at the point where it's time to deliver. :) As mentioned
before, I am busy with other things in the upcoming weeks, so I'll
leave the exact schedule for my contributions open for now.
Thanks for the great conversations. It has been helpful in many ways in making
progress. Looking forward to the next step and what that could
possibly be.

-Lari

Re: [DISCUSS] PIP-310: Support custom publish rate limiters

Posted by Girish Sharma <sc...@gmail.com>.

Hello Lari,
I've now gone through a bunch of rate limiting algorithms, along with dual
rate, dual bucket algorithm.
Reply inline.

On Tue, Nov 7, 2023 at 11:32 PM Lari Hotari <lh...@apache.org> wrote:

>
> Bucket4J documentation gives some good ideas and it's shows how the
> token bucket algorithm could be varied. For example, the "Refill
> styles" section [1] is useful to read as an inspiration.
> In network routers, there's a concept of "dual token bucket"
> algorithms and by googling you can find both Cisco and Juniper
> documentation referencing this.
> I also asked ChatGPT-4 to explain "dual token bucket" algorithm [2].
>
> 1 - https://bucket4j.com/8.6.0/toc.html#refill-types
> 2 - https://chat.openai.com/share/d4f4f740-f675-4233-964e-2910a7c8ed24
>
>
While dual-rate dual token bucket looks promising, there is still some
challenge with respect to allowing a certain peak burst for/up to a bigger
duration. I am explaining it below:

Assume a 10MBps topic. Bursting support of 1.5x upto 2 minutes, once every
10 minute interval.

The first bucket has the capacity of the consistent, long-term peak that
the topic would observe, so basically -
`limit.capacity(10_000_000).refillGreedy(1_000_000, ofMillis(100))` . Here,
I am not refilling every milli because the topic will never receive uniform
homogenous traffic at a millisecond level. The pattern would always be a
batch of one or more messages in an instant, then nothing for next few
instants, then another batch of one or more messages and so on.
This first bucket is good enough to handle rate limiting in normal cases
without bursting support. This is also better than the current precise rate
limiter in pulsar which allows for hotspotting of produce within a second
as this one spreads the quota over 100ms time windows.

The second bucket would have to have a slower refill rate and also a less
granular refill rate - think of something like
`limit.capacity(5_000_000).refillGreedy(5_000_000, ofMinutes(10))`.
Now the problem with this approach is that it would allow only 1 second
worth of additional 5MBps on the topic every 10 minutes.

Suppose I change the refill rate to
`limit.capacity(5_000_000).refillGreedy(5_000_000, ofSeconds(1))` - then it
does allow additional 5MBps burst every second, but now there is no cap
(the 2 minute cap).

I will have to think this through and do some more math to figure out if
this is possible at all using dual-rate double token method. Inputs
welcomed.

>
> >>
> >> ok. btw. "metrics" doesn't necessarily mean providing the rate limiter
> >> metrics via Prometheus. There might be other ways to provide this
> >> information for components that could react to this.
> >> For example, it could a be system topic where these rate limiters emit
> events.
> >>
> >
> > Are there any other system topics than
> `tenent/namespace/__change_events` . While it's an improvement over
> querying metrics, it would still mean one consumer per namespace and would
> form a cyclic dependency - for example, in case a broker is degrading due
> to mis-use of bursting, it might lead to delays in the consumption of the
> event from the __change_events topic.
>
> The number of events is a fairly low volume so the possible
> degradation wouldn't necessarily impact the event communications for
> this purpose. I'm not sure if there are currently any public event
> topics in Pulsar that are supported in a way that an application could
> directly read from the topic.  However since this is an advanced case,
> I would assume that we don't need to focus on this in the first phase
> of improving rate limiters.
>
>
While the number of events in the system topic would be fairly low per
namespace, the issue is that this system topic lies on the same broker
where the actual topic/partitions exist and those partitions are leading to
degradation of this particular broker.



> I agree. It's better to exclude this from the discussion at the moment
> so that it doesn't cause confusion. Modifying the rate limits up and
> down automatically with some component could be considered out of
> scope in the first phase. However, that might be controversial since
> the externally observable behavior of rate limiting with bursting
> support seems to be behave in a way where the rate limit changes
> automatically. The "auto scaling" aspect of rate limiting, modifying
> the rate limits up and down, might be necessary as part of a rate
> limiter implementation eventually. More of that later.
>
Not to drag on this topic in this discussion, but doesn't this basically
mean that there is no "rate limiting" at all? Basically as good as setting
it `-1` and just let the hardware handle it to its best extent.


> >
> > The desired behavior in all three situations is to have a multiplier
> based bursting capability for a fixed duration. For example, it could be
> that a pulsar topic would be able to support 1.5x of the set quota for a
> burst duration of up to 5 minutes. There also needs to be a cooldown period
> in such a case that it would only accept one such burst every X minutes,
> say every 1 hour.
>
>
> Thanks for sharing this concrete example. It will help a lot when
> starting to design a solution which achieves this. I think that a
> token bucket algorithm based design can achieve something very close
> to what you are describing. In the first part I shared the references
> to Bucket4J's "refill styles" and the concept of "dual token bucket"
> algorithms.
> One possible solution is a dual token bucket where the first bucket
> takes care of the burst of up to 5 minutes. The second token bucket
> can cover the maximum rate of 1.5x. The cooldown period could be
> handled in a way where the token bucket fills up the bucket once in
> every 60 minutes. There could be implementation challenges related to
> adjusting the enforced rate up and down since as you had also
> observed, the token bucket algorithm will let all buffered tokens be
> used.
>

Agreed, there is a challenge, as it's not as straightforward as I've
demonstrated in the example above.


> > Suppose the SOP for bursting support is to support a burst of upto 1.5x
> produce rate for upto 10 minutes, once in a window of 1 hour.
> >
> > Now if a particular producer is trying to burst to or beyond 1.5x for
> more than 5 minutes - while we are not accepting the requests and the
> producer client is continuously going into timeouts, there should be a way
> to completely blacklist that producer so that even the client stops
> attempting to produce and we unload the topics all together - thus,
> releasing open connections and broker compute.
>
>
> It seems that this part of the requirements is more challenging to
> cover. We don't have to give up on trying to make progress on this
> side of your requirements. It could be helpful to do thought
> experiments on how some of the conditions could be detected. For
> example, how would you detect if a particular producer is going beyond
> the configured policy? It does seem intriguing to start adding the
> support to the rate limiting area. The rate limiter has the
> information about when bursting is actively happening. I would favor a
> solution where the rate limiter could only emit events and not make
> decisions about blocking a producer. However, I'm open to changing my
> opinion on that detail of the design if it turns out to be a common
> requirement, and it fits the overall architecture of the broker.
>

There is clearly a gap with respect to understanding what the common
requirements are. I am here speaking only of my requirements. Some of these
may contradict requirements of others who haven't spoken up so far. Would
we be changing the rate limiter once again when someone else comes up with
new requirements? For instance, the "exponential decay" mentioned by Rajan
is completely different from any token based approach.


>
> The information about timeouts is on the client side. Perhaps there
> are ways to pass the timeout configuration information along with the
> producer creation, for example. The Pulsar binary protocol could have
> some sort of clock sync so that client and broker clock difference
> could be calculated with sufficient accuracy. However, the broker
> cannot know when messages timeout that aren't at the broker at all. If
> the end-to-end latency of sent messages start approaching the timeout
> value, it's a good indication on the broker side that the client side
> is approaching the state where messages will start timing out.
>

I think there is a need for a new protocol message from server to client
indicating rate-limiting (say, a 429 response, with an optional
retry-after). That should handle a lot of these things. For instance, we
can then build configurable logic in the client to stop producing all
together after a sufficient number of 429 received from the server. There
can also be a "BLOCKED" protocol that the server could indicate to the
client.

From our experience, inferring these things via client-broker clock sync
assumption would not hold true. We regularly notice clock differences
between clients and brokers in our setup.


> > What do you think about the quick solution of not sharing connections
> across producer objects? I can raise a github issue explaining the
> situation.
>
> It should be very straightforward to add a flag to producer and
> consumer builder to use an isolated (non-shared) broker connection.
> It's not optimal to add such flags, but I haven't heard of other ways
> to solve this challenge where the Pulsar binary protocol doesn't have
> to be modified or the backpressure solution revisited in the broker.
> The proper solution requires doing both, adding a permit-based flow
>

When you say - adding a permit-based flow control - even if this is
implemented, multiplexing is still an issue as the tcp/ip channel itself is
put on pause at the netty level. Is there any other way of rate limiting
and rejecting packets from a channel selectively so as to contain the rate
limiting effect only to the specific partition out of all the partitions
being shared in the channel?

When I was doing poller vs precise testing wrt CPU, network, broker
latencies etc, one of the reason precise was much more CPU efficient and
had minimal impact on broker latencies was due to the fact that the netty
channel was being paused precisely.


> control for producers and revisiting the broker side backpressure/flow
> control and therefore it won't happen quickly.
> Please go ahead and create a GH issue and share your context. That
> will be very helpful.
>
>
I will do so. Does this need a PIP? To reiterate, I will be opening a GH
issue on the lines of "don't share connection to the broker across producer
objects"



> On the lowest level of unit tests, it will be helpful to have a
> solution where the clock source used in unit test can be run quickly
> so that simulated scenarios don't take a long time to run, but could
> cover the essential features.
>
>
Agreed.


> It might be hard to get started on the rate limiter implementation by
> looking at the existing code. The reason of the double methods in the
> existing interface is due to it covering 2 completely different ways
> of handling the rate limiting and trying to handle that in a single
> concept. What is actually desired for at least the producing rate
> limiting is that it's an asynchronous rate limiting where any messages
> that have already arrived to the broker will be handled. The token
> count could go to a negative value when implementing this in the token
> bucket algorithm. If it goes below 0, the producers for the topic
> should be backpressured by toggling the auto-read state and it should
> schedule a job to resume the auto-reading after there are at least a
> configurable amount of tokens available. There should be no need to
> use the scheduler to add tokens to the bucket. Whenever tokens are
> used, the new token count can be calculated based on the time since
> tokens were last updated, by default the average max rate defines how
> many tokens are added. The token count cannot increase larger than the
> token bucket capacity. This is how simple a plain token bucket
>

Are you suggesting that we first go ahead and convert the rate limiter in
pulsar to a simple, single-token based approach?
I personally do not see any benefit in this apart from code refactoring.
The precise rate limiter is basically doing that already- all be it
refilling only every 1 second, rather than distributing the tokens across
that second.



> algorithm is. That could be a starting point until we start covering
> the advanced cases which require a dual token bucket algorithm.
> If someone has the bandwidth, it would be fine to start experimenting
> in this area with code to learn more.
>
>
I think this is a big assumption. Since this is a critical use case in my
organisation, I will have to contribute everything here myself. Now I do
understand that reviews can take time in the OSS world, but this can't be
left at "simple token based approach and then letting anyone pick and
explore to extend/enhance it to dual token approach". This probably is one
of the main reasons why pluggability is important here :)


-- 
Girish Sharma

Re: [DISCUSS] PIP-310: Support custom publish rate limiters

Posted by Lari Hotari <lh...@apache.org>.

Hi Rajan,

Thank you for sharing your opinion. It appears you are in favor of
pluggable interfaces for rate limiters. I would like to offer a
perspective on why we should defer the pluggability aspect of rate
limiters. If there is a real need, it can be considered later.

Currently, there's a pressing need to enhance the built-in rate
limiters in Pulsar. The current state is not good; our default rate
limiter is nearly ineffective and unusable at scale due to its high
CPU usage and overhead. Additionally, while the precise rate limiter
doesn't have CPU issues, it does lack the capability for configuring
an average rate over an extended period.

I propose we eliminate the precise rate limiter as a distinct option
and converge towards a single, configurable rate limiter solution.
Using terms like "precise" in our configuration options exposes
unnecessary implementation details and is a practice we should avoid.

Our abstractions should be designed to allow for underlying
improvements without compromising the established "contract" of the
feature. For rate limiting, we can maintain the current external
behavior while completely overhauling the internal workings.

Introducing "bursting" features, which enable average rate
calculations over longer durations, will necessitate additional
configuration options to define this time window and possibly a
separate maximum rate.

Let's prioritize refining the core rate-limiting capabilities of
Pulsar. Afterwards, we can revisit the idea of pluggable rate limiters
if they still seem necessary.
Concentrating on Pulsar's built-in rate limiters will also help us
identify the interfaces needed for pluggable rate limiters.
Moreover, by focusing on the actual rate limiting behavior and
collating Pulsar user requirements, we can potentially design generic
solutions that address the majority of needs within Pulsar's core.

Hence, our immediate goal should be to advance these improvements and
integrate them into Pulsar's core as soon as possible.

-Lari


On Wed, 8 Nov 2023 at 02:13, Rajan Dhabalia <rd...@apache.org> wrote:
>
> Hi Lari/Girish,
>
> I am sorry for jumping late in the discussion but I would like to
> acknowledge the requirement of pluggable publish rate-limiter and I had
> also asked it during implementation of publish rate limiter as well. There
> are trade-offs between different rate-limiter implementations based on
> accuracy, n/w usage, simplification and user should be able to choose one
> based on the requirement. However, we don't have correct and extensible
> Publish rate limiter interface right now, and before making it pluggable we
> have to make sure that it should support any type of implementation for
> example: token based or sliding-window based throttling, support of various
> decaying functions (eg: exponential decay:
> https://en.wikipedia.org/wiki/Exponential_decay), etc.. I haven't seen such
> interface details and design in the PIP:
> https://github.com/apache/pulsar/pull/21399/. So, I would encourage to work
> towards building pluggable rate-limiter but current PIP is not ready as it
> doesn't cover such generic interfaces that can support different types of
> implementation.
>
> Thanks,
> Rajan
>
> On Tue, Nov 7, 2023 at 10:02 AM Lari Hotari <lh...@apache.org> wrote:
>
> > Hi Girish,
> >
> > Replies inline.
> >
> > On Tue, 7 Nov 2023 at 15:26, Girish Sharma <sc...@gmail.com>
> > wrote:
> > >
> > > Hello Lari, replies inline.
> > >
> > > I will also be going through some textbook rate limiters (the one you
> > shared, plus others) and propose the one that at least suits our needs in
> > the next reply.
> >
> >
> > sounds good. I've been also trying to find more rate limiter resources
> > that could be useful for our design.
> >
> > Bucket4J documentation gives some good ideas and it's shows how the
> > token bucket algorithm could be varied. For example, the "Refill
> > styles" section [1] is useful to read as an inspiration.
> > In network routers, there's a concept of "dual token bucket"
> > algorithms and by googling you can find both Cisco and Juniper
> > documentation referencing this.
> > I also asked ChatGPT-4 to explain "dual token bucket" algorithm [2].
> >
> > 1 - https://bucket4j.com/8.6.0/toc.html#refill-types
> > 2 - https://chat.openai.com/share/d4f4f740-f675-4233-964e-2910a7c8ed24
> >
> > >>
> > >> It is bi-weekly on Thursdays. The meeting calendar, zoom link and
> > >> meeting notes can be found at
> > >> https://github.com/apache/pulsar/wiki/Community-Meetings .
> > >>
> > >
> > > Would it make sense for me to join this time given that you are skipping
> > it?
> >
> > Yes, it's worth joining regularly when one is participating in Pulsar
> > core development. There's usually a chance to discuss all topics that
> > Pulsar community members bring up to discussion. A few times there
> > haven't been any participants and in that case, it's good to ask on
> > the #dev channel on Pulsar Slack whether others are joining the
> > meeting.
> >
> > >>
> > >> ok. btw. "metrics" doesn't necessarily mean providing the rate limiter
> > >> metrics via Prometheus. There might be other ways to provide this
> > >> information for components that could react to this.
> > >> For example, it could a be system topic where these rate limiters emit
> > events.
> > >>
> > >
> > > Are there any other system topics than
> > `tenent/namespace/__change_events` . While it's an improvement over
> > querying metrics, it would still mean one consumer per namespace and would
> > form a cyclic dependency - for example, in case a broker is degrading due
> > to mis-use of bursting, it might lead to delays in the consumption of the
> > event from the __change_events topic.
> >
> > The number of events is a fairly low volume so the possible
> > degradation wouldn't necessarily impact the event communications for
> > this purpose. I'm not sure if there are currently any public event
> > topics in Pulsar that are supported in a way that an application could
> > directly read from the topic.  However since this is an advanced case,
> > I would assume that we don't need to focus on this in the first phase
> > of improving rate limiters.
> >
> > >
> > >
> > >> I agree. I just brought up this example to ensure that your
> > >> expectation about bursting isn't about controlling the rate limits
> > >> based on situational information, such as end-to-end latency
> > >> information.
> > >> Such a feature could be useful, but it does complicate things.
> > >> However, I think it's good to keep this on the radar since this might
> > >> be needed to solve some advanced use cases.
> > >>
> > >
> > > I still envision auto-scaling to be admin API driven rather than produce
> > throughput driven. That way, it remains deterministic in nature. But it
> > probably doesn't make sense to even talk about it until (partition)
> > scale-down is possible.
> >
> >
> > I agree. It's better to exclude this from the discussion at the moment
> > so that it doesn't cause confusion. Modifying the rate limits up and
> > down automatically with some component could be considered out of
> > scope in the first phase. However, that might be controversial since
> > the externally observable behavior of rate limiting with bursting
> > support seems to be behave in a way where the rate limit changes
> > automatically. The "auto scaling" aspect of rate limiting, modifying
> > the rate limits up and down, might be necessary as part of a rate
> > limiter implementation eventually. More of that later.
> >
> > >
> > > In all of the 3 cases that I listed, the current behavior, with precise
> > rate limiting enabled, is to pause the netty channel in case the throughput
> > breaches the set limits. This eventually leads to timeout at the client
> > side in case the burst is significantly greater than the configured timeout
> > on the producer side.
> >
> >
> > Makes sense.
> >
> > >
> > > The desired behavior in all three situations is to have a multiplier
> > based bursting capability for a fixed duration. For example, it could be
> > that a pulsar topic would be able to support 1.5x of the set quota for a
> > burst duration of up to 5 minutes. There also needs to be a cooldown period
> > in such a case that it would only accept one such burst every X minutes,
> > say every 1 hour.
> >
> >
> > Thanks for sharing this concrete example. It will help a lot when
> > starting to design a solution which achieves this. I think that a
> > token bucket algorithm based design can achieve something very close
> > to what you are describing. In the first part I shared the references
> > to Bucket4J's "refill styles" and the concept of "dual token bucket"
> > algorithms.
> > One possible solution is a dual token bucket where the first bucket
> > takes care of the burst of up to 5 minutes. The second token bucket
> > can cover the maximum rate of 1.5x. The cooldown period could be
> > handled in a way where the token bucket fills up the bucket once in
> > every 60 minutes. There could be implementation challenges related to
> > adjusting the enforced rate up and down since as you had also
> > observed, the token bucket algorithm will let all buffered tokens be
> > used.
> >
> > >
> > >
> > >>
> > >>
> > >> >    - In a visitor based produce rate (where produce rate goes up in
> > the day
> > >> >    and goes down in the night, think in terms of popular website
> > hourly view
> > >> >    counts pattern) , there are cases when, due to certain
> > external/internal
> > >> >    triggers, the views - and thus - the produce rate spikes for a few
> > minutes.
> > >>
> > >> Again, please explain the current behavior and desired behavior.
> > >> Explicit example values of number of messages, bandwidth, etc. would
> > >> also be helpful details.
> > >
> > >
> > > Adding to what I wrote above, think of this pattern like the following:
> > the produce rate slowly increases from ~2MBps at around 4 AM to a known
> > peak of about 30MBps by 4 PM and stays around that peak until 9 PM after
> > which is again starts decreasing until it reaches ~2MBps around 2 AM.
> > > Now, due to some external triggers, maybe a scheduled sale event, at
> > 10PM, the quota may spike up to 40MBps for 4-5 minutes and then again go
> > back down to the usual ~20MBps . Here is a rough image showcasing the trend.
> > >
> > >
> > >>
> > >>
> > >> >    It is also important to keep this in check so as to not allow bots
> > to do
> > >> >    DDOS into your system, while that might be a responsibility of an
> > upstream
> > >> >    system like API gateway, but we cannot be ignorant about that
> > completely.
> > >>
> > >> what would be the desired behavior?
> > >
> > >
> > > The desired behavior is that the burst support should be short lived
> > (5-10 minutes) and limited to a fixed number of bursts in a duration (say -
> > 1 burst per hour). Obviously, all of these should be configurable, maybe at
> > a broker level and not a topic level.
> >
> >
> > That can be handled with a dual token bucket algorithm based rate limiter.
> >
> > >
> > > I have yet to go in depth on both the textbook rate limiter models, but
> > if token bucket works on buffered tokens, it may not be much useful.
> > Consider a situation where the throughput is always at its absolute
> > configured peak, thus not reserving tokens at all. Again - this is the
> > understanding based on the above text. I am yet to go through the wiki page
> > with full details. In any case, I am very sure that an absolute maximum
> > bursting limit is needed per topic, since at the end of the day, we are
> > limited by hardware resources like network bandwidth, disk bandwidth etc.
> >
> >
> > Yes, if the actual throughput is always at the rate limiter's maximum
> > average rate, the tokens wouldn't be reserved at all in the default
> > token bucket algorithm. Is that a problem?
> > It might not be a problem at all. Spare tokens would be buffered
> > whenever the throughput is lower and therefore bursting could help
> > catch up to the average rate over a longer period of time.
> >
> > Btw. The challenge of no bursting when the actual throughput is
> > already at the peak rate is the original reason why I introduced the
> > concept of automatically adjusting the rate limit up and down based on
> > end-to-end latency information. If the rate limiter had a feedback
> > loop, it could know whether it needs to do bursting to prevent the
> > latencies from increasing. That was one of the reasons why I was
> > thinking about the automatic adjustments of the rate limit. However,
> > it's better to keep that out of the first phase for now.
> >
> > I believe that we will find many ways to apply the token bucket
> > algorithm to meet our requirements. A lot of flexibility can be added
> > with dual token buckets and by introducing new ways of adding/filling
> > tokens to the buckets. It's also possible to tweak a lot of small
> > details if that is necessary, such as smoothening out burst rates and
> > so on. However, it's better to keep things simple when possible.
> >
> >
> > > Suppose the SOP for bursting support is to support a burst of upto 1.5x
> > produce rate for upto 10 minutes, once in a window of 1 hour.
> > >
> > > Now if a particular producer is trying to burst to or beyond 1.5x for
> > more than 5 minutes - while we are not accepting the requests and the
> > producer client is continuously going into timeouts, there should be a way
> > to completely blacklist that producer so that even the client stops
> > attempting to produce and we unload the topics all together - thus,
> > releasing open connections and broker compute.
> >
> >
> > It seems that this part of the requirements is more challenging to
> > cover. We don't have to give up on trying to make progress on this
> > side of your requirements. It could be helpful to do thought
> > experiments on how some of the conditions could be detected. For
> > example, how would you detect if a particular producer is going beyond
> > the configured policy? It does seem intriguing to start adding the
> > support to the rate limiting area. The rate limiter has the
> > information about when bursting is actively happening. I would favor a
> > solution where the rate limiter could only emit events and not make
> > decisions about blocking a producer. However, I'm open to changing my
> > opinion on that detail of the design if it turns out to be a common
> > requirement, and it fits the overall architecture of the broker.
> >
> > The information about timeouts is on the client side. Perhaps there
> > are ways to pass the timeout configuration information along with the
> > producer creation, for example. The Pulsar binary protocol could have
> > some sort of clock sync so that client and broker clock difference
> > could be calculated with sufficient accuracy. However, the broker
> > cannot know when messages timeout that aren't at the broker at all. If
> > the end-to-end latency of sent messages start approaching the timeout
> > value, it's a good indication on the broker side that the client side
> > is approaching the state where messages will start timing out.
> >
> > In general, I think we should keep these requirements covered in the
> > design so that the rate limiter implementation could be extended to
> > support the use case in a way or another in the later phases.
> >
> >
> > >> > It's only async if the producers use `sendAsync` . Moreover, even if
> > it is
> > >> > async in nature, practically it ends up being quite linear and
> > >> > well-homogenised. I am speaking from experience of running 1000s of
> > >> > partitioned topics in production
> > >>
> > >> I'd assume that most high traffic use cases would use asynchronous
> > >> sending in a way or another. If you'd be sending synchronously, it
> > >> would be extremely inefficient if messages are sent one-by-one. The
> > >> asyncronouscity could also come from multiple threads in an
> > >> application.
> > >>
> > >> You are right that a well homogenised workload reduces the severity of
> > >> the multiplexing problem. That's also why the problem of multiplexing
> > >> hasn't been fixed since the problem isn't visibile and it's also very
> > >> hard to observe.
> > >> In a case where 2 producers with very different rate limiting options
> > >> share a connection, it definetely is a problem in the current
> > >
> > >
> > > Yes actually, I was talking to the team and we did observe a case where
> > there was a client app with 2 producer objects writing to two different
> > topics. When one of their topics was breaching quota, the other one was
> > also observing rate limiting even though it was under quota. So eventually,
> > this surely needs to be solved. One short term solution is to at least not
> > share connections across producer objects. But I still feel it's out of
> > scope of this discussion here.
> >
> > Thanks for sharing. The real stories from operating Pulsar at scale
> > are always interesting to hear!
> >
> > > What do you think about the quick solution of not sharing connections
> > across producer objects? I can raise a github issue explaining the
> > situation.
> >
> > It should be very straightforward to add a flag to producer and
> > consumer builder to use an isolated (non-shared) broker connection.
> > It's not optimal to add such flags, but I haven't heard of other ways
> > to solve this challenge where the Pulsar binary protocol doesn't have
> > to be modified or the backpressure solution revisited in the broker.
> > The proper solution requires doing both, adding a permit-based flow
> > control for producers and revisiting the broker side backpressure/flow
> > control and therefore it won't happen quickly.
> > Please go ahead and create a GH issue and share your context. That
> > will be very helpful.
> >
> > Some thoughts about proceeding to making things a reality so that it
> > gets implemented.
> >
> > One important part of development will be how we could test or
> > simulate a scenario and validate that the rate limiting algorithm
> > makes sense.
> > On the lowest level of unit tests, it will be helpful to have a
> > solution where the clock source used in unit test can be run quickly
> > so that simulated scenarios don't take a long time to run, but could
> > cover the essential features.
> >
> > It might be hard to get started on the rate limiter implementation by
> > looking at the existing code. The reason of the double methods in the
> > existing interface is due to it covering 2 completely different ways
> > of handling the rate limiting and trying to handle that in a single
> > concept. What is actually desired for at least the producing rate
> > limiting is that it's an asynchronous rate limiting where any messages
> > that have already arrived to the broker will be handled. The token
> > count could go to a negative value when implementing this in the token
> > bucket algorithm. If it goes below 0, the producers for the topic
> > should be backpressured by toggling the auto-read state and it should
> > schedule a job to resume the auto-reading after there are at least a
> > configurable amount of tokens available. There should be no need to
> > use the scheduler to add tokens to the bucket. Whenever tokens are
> > used, the new token count can be calculated based on the time since
> > tokens were last updated, by default the average max rate defines how
> > many tokens are added. The token count cannot increase larger than the
> > token bucket capacity. This is how simple a plain token bucket
> > algorithm is. That could be a starting point until we start covering
> > the advanced cases which require a dual token bucket algorithm.
> > If someone has the bandwidth, it would be fine to start experimenting
> > in this area with code to learn more.
> >
> > After trying something out, it's easier to be more confident about the
> > design of dual token bucket algorithm. That's one of the assumptions
> > that I'm making that dual token bucket would be sufficient.
> >
> > This has been a great discussion so far and it feels like we are
> > getting to the necessary level of detail.
> > Thanks for sharing many details of your requirements and what you have
> > experienced and learned when operating Pulsar at scale.
> > I feel excited about the upcoming improvements for rate limiting in
> > Pulsar. From my side I have limited availability for participating in
> > this work for the next 2-3 weeks. After that, I'll be ready to help in
> > making the improved rate limiting a reality.
> >
> > -Lari
> >

Re: [DISCUSS] PIP-310: Support custom publish rate limiters

Posted by Lari Hotari <lh...@apache.org>.

Hi Girish,

Replies inline.

> > The current state of rate limiting is not acceptable in Pulsar. We
> > need to fix things in the core.
> >
>
> I wouldn't say that it's not acceptable. The precise one works as expected
> as a basic rate limiter. Its only when there are complex requirements, the
> current rate limiters fail.

Just clarifying, I consider the situation with the default rate
limiter not being
optimal. The CPU overhead is significant. You must explicitly
enable the "precise" rate limiters to resolve this issue. It's not
very obvious for any Pulsar user. I don't think that this situation
makes sense. If there's an abstraction of a rate limiter, it should be
efficient and usable in the default configuration of Pulsar.

>  The key here being "complex requirement". I am with Rajan here that
> whatever improvements we do to the core built-in rate limiter, would always
> miss one or the other complex requirement.

I haven't yet seen very complex requirements that relate directly to
the rate limiter. The scope could expand to the area of capacity
management, and I'm pretty sure that it gets there when we go further.
Capacity management is a broader concern than rate limiting. We all
know that capacity management is necessary in multi-tenant systems.
Rate limiting and throttling is one way to handle that. When going to
more complex requirements, it might be useful to go beyond rate limiting also in
the conceptual design.

For example DynamoDB has the concept of capacity units (CU).
DynamoDB's conceptual design for capacity management is well described
in a paper "Amazon DynamoDB: A Scalable, Predictably Performant, and
Fully Managed NoSQL Database Service" and a related presentation [1].
There's also other related blog posts such as "Surprising Scalability
of Multitenancy" [2] and "The Road To Serverless: Multi-tenancy" [3]
which have been inspirational to me. The paper "Kora: A Cloud-Native
Event Streaming Platform For Kafka" [4] is also a very useful read to
learn about serverless capacity management.

Capacity management goes beyond rate limiting since it has a tight
relation to end-to-end flow control, load balancing, service levels and
possible auto-scaling solutions. One of the goals of capacity
management in a multi-tenant
system is to address the dreaded "noisy neighbor" problem in a cost
optimal and efficient way.

1 - https://www.usenix.org/conference/atc22/presentation/elhemali
2 - https://brooker.co.za/blog/2023/03/23/economics.html
3 - https://me.0xffff.me/dbaas3.html
4 - https://www.vldb.org/pvldb/vol16/p3822-povzner.pdf

>
> I feel like we need both the things here. We can work on improving the
> built in rate limiter which does not try to solve all of my needs like
> blacklisting support, limiting the number of bursts in a window etc. The
> built in one can be improved with respect to code refactoring,
> extensibility, modularization etc. along with implementing a more well
> known rate limiting algorithm, like token bucket.
> Along with this, the newly improved interface can now make way for
> pluggable support.

Yes, I agree. Improving the rate limiter and exposing an interface
aren't exclusive.


> This is assuming that we do not improve the default build in rate limiter
> at all. Think of this like AuthorizationProvider - there is a built in one.
> it's good enough, but each organization would have its own requirements wrt
> how they handle authorization, and thus, most likely, any organization with
> well defined AuthN/AuthZ constructs would be plugging their own providers
> there.

I don't think that this is a valid comparison. The current rate
limiter has an explicit "contract". The user sets the maximum rate in
bytes and/or messages and the rate limiter takes care of enforcing
that limit. It's hard to see why that "contract" would have too many
interpretations of what it means.
Another reason is that I haven't seen any other messaging product
where there would be a need to add support for user provided rate
limiter algorithm. What makes Pulsar a special case that it would be
needed?
For authentication and authorization, it's a completely different
story. The abstractions require that you pick a specific
implementation for your way of doing authentication and authorization.
Many other systems out there do it in a somewhat similar way as Pulsar.

> The key thing here would be to make the public interface as minimal as
> possible while allowing for custom complex rate limiters as well. I believe
> this is easily doable without actually making the internal code more
> complex.

It's doable, but a different question is whether this is necessary in
the end. We'll see over time how we can improve the Pulsar core rate
limiter and whether there's a need to override it.
The current interfaces will change when the Pulsar core rate limiter
is improved. This work won't easily meet in the middle unless we start
by improving the core rate limiter.

What Pulsar does right now might have to change. In Pulsar, there's no
actual control of the client application, like there is in Kafka Quotas
[1]. You mentioned that you want to block individual producers in your
complex requirements. That isn't supported in the current model of
rate limiters in Pulsar.
Does it even belong there? There are many questions to cover when extending the
current model. It's not only about a "pluggable rate limiter" interface?

1 - https://docs.confluent.io/kafka/design/quotas.html

> In fact, the way I envision the pluggable proposal, things become simpler
> with respect to code flow, code ownership and custom if/else.

Yes, that might work for you. For the Pulsar open source project, it's
better to have a high quality rate limiter implementation which covers
the majority of use cases.
I don't see why this wouldn't be possible. Your expectation of what a
pluggable rate limiter can do haven't been defined. Adding support for
doing more, than what a rate limiter does, would be a larger change.
It's hard to see why there would be a need to have a better variation
of the token bucket algorithm when a lot of most advanced systems
(such as network routers) use the token bucket algorithm for rate
limiting. It's always possible that the messaging domain is very
distinct...

> I will give another example here, In big organizations, where pulsar is
> actually being used at scale - and not just in terms of QPS/MBps, but also
> in terms of number of teams, tenants, namespaces, number of unique features
> being used, there always would be an in-house schema registry. Thus, while
> pulsar already has a built in schema service and registry - it is important
> that it also supports custom ones. This does not speak badly about the
> based pulsar package, but actually speaks more about the adaptability of
> the product.

Sure, that makes sense for such features. I just don't see what you
see with rate limiting and the variations that organizations need for
it. My opinion might change once I learn more about your use case and
what the "complex requirements" are that cannot be covered by the core
functionality of a rate limiter. I noticed that a new message arrived while
I have been writing this email. Perhaps that contains more details about
your specific requirements and how that could impact the design?

> I really respect and appreciate the discussions we have had. One of the
> problems I've had in the pulsar community earlier is the lack of
> participation. But I am getting a lot of participation this time, so its
> really good.

Thanks. Yes, we need a more active Pulsar community. It's improving
every day. :)

> I am willing to take this forward to improve the default rate limiters, but
> since we would _have_ to meeting somewhere in the middle, at the end of all
> this - our organizational requirements would still remain unfulfilled until
> we build _all_ of the things that I have spoken about.

Exactly, that's a challenge that I referred above with the pluggable
interface design. Meeting in the middle won't happen easily unless
there's focus on
improving the core rate limiter at least simultaneously.

> Just as Rajan was late to the discussion and he pointed out that they also
> needed custom rate limiter a while back, there may he others who are either
> unknownst to this mailing list, or are yet to look into rate limiter all
> together who may find that whatever we have built/modified/improved is
> still lacking.

It will be interesting to hear what the details of Rajan's
requirements for a custom rate limiter were.
Perhaps those requirements could also be taken into account in
improving the default core rate limiter?

> I personally would like to work on improving the interface and making it
> pluggable first. This is a smaller task both from a design and coding
> perspective. Meanwhile, I will create another proposal for improving the
> built in rate limiter. Since we have had a lot of discussion about how to
> improve the rate limiter, and I will continue discussing which rate limiter
> works best in my opinion, I think we can be in a liberty to take a bit of
> extra time in discussing and closing the improved rate limiter design. Of
> course, I will keep the interface definition in mind while proposing the
> improved rate limiter and vice versa.

I accept that you take that perspective, and it's understandable that
it seems like the straightest path for you.
We'll see how things go, and eventually we will all adapt on the way.
Anyway, we are making progress at the moment. :)

-Lari


-Lari

Re: [DISCUSS] PIP-310: Support custom publish rate limiters

Posted by Girish Sharma <sc...@gmail.com>.

Hello Lari, while I am yet to reply to your yesterday''s email, I am trying
to wrap this discussion about the need of pluggable rate limiter, so
replying to this first.

Comments inline


On Wed, Nov 8, 2023 at 5:35 PM Lari Hotari <lh...@apache.org> wrote:

> Hi Girish,
>
> > On Asaf's comment on too many public interfaces in Pulsar and no other
> > Apache software having so many public interfaces - I would like to ask,
> has
> > that brought in any con though? For this particular use case, I feel like
> > having it has a public interface would actually improve the code quality
> > and design as the usage would be checked and changes would go through
> > scrutiny (unlike how the current PublishRateLimiter evolved unchecked).
> > Asaf - what are your thoughts on this? Are you okay with making the
> > PublishRateLimiter pluggable with a better interface?
>
> As a reply to this question, I'd like to highlight my previous email
> in this thread where I responded to Rajan.
> The current state of rate limiting is not acceptable in Pulsar. We
> need to fix things in the core.
>

I wouldn't say that it's not acceptable. The precise one works as expected
as a basic rate limiter. Its only when there are complex requirements, the
current rate limiters fail.
 The key here being "complex requirement". I am with Rajan here that
whatever improvements we do to the core built-in rate limiter, would always
miss one or the other complex requirement.

I feel like we need both the things here. We can work on improving the
built in rate limiter which does not try to solve all of my needs like
blacklisting support, limiting the number of bursts in a window etc. The
built in one can be improved with respect to code refactoring,
extensibility, modularization etc. along with implementing a more well
known rate limiting algorithm, like token bucket.
Along with this, the newly improved interface can now make way for
pluggable support.


>
> One of the downsides of adding a lot of pluggability and variation is
> the additional fragmentation and complexity that it adds to Pulsar. It
> doesn't really make sense if you need to install an external library
> to make the rate limiting feature usable in Pulsar. Rate limiting is
>

This is assuming that we do not improve the default build in rate limiter
at all. Think of this like AuthorizationProvider - there is a built in one.
it's good enough, but each organization would have its own requirements wrt
how they handle authorization, and thus, most likely, any organization with
well defined AuthN/AuthZ constructs would be plugging their own providers
there.

The key thing here would be to make the public interface as minimal as
possible while allowing for custom complex rate limiters as well. I believe
this is easily doable without actually making the internal code more
complex.
In fact, the way I envision the pluggable proposal, things become simpler
with respect to code flow, code ownership and custom if/else.



> only needed when operating Pulsar at scale. The core feature must
> support this to make any sense.
>

I will give another example here, In big organizations, where pulsar is
actually being used at scale - and not just in terms of QPS/MBps, but also
in terms of number of teams, tenants, namespaces, number of unique features
being used, there always would be an in-house schema registry. Thus, while
pulsar already has a built in schema service and registry - it is important
that it also supports custom ones. This does not speak badly about the
based pulsar package, but actually speaks more about the adaptability of
the product.


>
> The side product of this could be the pluggable interface if the core
> feature cannot be extended to cover the requirements that Pulsar users
> really have.
> Girish, it has been extremely valuable that you have shared concrete
> examples of your rate limiting related requirements. I'm confident
> that we will be able to cover the majority of your requirements in
> Pulsar core. It might be the case that when you try out the new
> features, it might actually cover your needs sufficiently.
>

I really respect and appreciate the discussions we have had. One of the
problems I've had in the pulsar community earlier is the lack of
participation. But I am getting a lot of participation this time, so its
really good.

I am willing to take this forward to improve the default rate limiters, but
since we would _have_ to meeting somewhere in the middle, at the end of all
this - our organizational requirements would still remain unfulfilled until
we build _all_ of the things that I have spoken about.

Just as Rajan was late to the discussion and he pointed out that they also
needed custom rate limiter a while back, there may he others who are either
unknownst to this mailing list, or are yet to look into rate limiter all
together who may find that whatever we have built/modified/improved is
still lacking.


>
> Let's keep the focus on improving the rate limiting in Pulsar core.
> The possible pluggable interface could follow.
>

I personally would like to work on improving the interface and making it
pluggable first. This is a smaller task both from a design and coding
perspective. Meanwhile, I will create another proposal for improving the
built in rate limiter. Since we have had a lot of discussion about how to
improve the rate limiter, and I will continue discussing which rate limiter
works best in my opinion, I think we can be in a liberty to take a bit of
extra time in discussing and closing the improved rate limiter design. Of
course, I will keep the interface definition in mind while proposing the
improved rate limiter and vice versa.

What are your thoughts here?


>
>
> -Lari
>
> On Wed, 8 Nov 2023 at 10:46, Girish Sharma <sc...@gmail.com>
> wrote:
> >
> > Hello Rajan,
> > I haven't updated the PIP with a better interface for PublishRateLimiter
> > yet as the discussion here in this thread went in a different direction.
> >
> > Personally, I agree with you that even if we choose one algorithm and
> > improve the built-in rate limiter, it still may not suit all use cases as
> > you have mentioned.
> >
> > On Asaf's comment on too many public interfaces in Pulsar and no other
> > Apache software having so many public interfaces - I would like to ask,
> has
> > that brought in any con though? For this particular use case, I feel like
> > having it has a public interface would actually improve the code quality
> > and design as the usage would be checked and changes would go through
> > scrutiny (unlike how the current PublishRateLimiter evolved unchecked).
> > Asaf - what are your thoughts on this? Are you okay with making the
> > PublishRateLimiter pluggable with a better interface?
> >
> >
> >
> >
> >
> > On Wed, Nov 8, 2023 at 5:43 AM Rajan Dhabalia <rd...@apache.org>
> wrote:
> >
> > > Hi Lari/Girish,
> > >
> > > I am sorry for jumping late in the discussion but I would like to
> > > acknowledge the requirement of pluggable publish rate-limiter and I had
> > > also asked it during implementation of publish rate limiter as well.
> There
> > > are trade-offs between different rate-limiter implementations based on
> > > accuracy, n/w usage, simplification and user should be able to choose
> one
> > > based on the requirement. However, we don't have correct and extensible
> > > Publish rate limiter interface right now, and before making it
> pluggable we
> > > have to make sure that it should support any type of implementation for
> > > example: token based or sliding-window based throttling, support of
> various
> > > decaying functions (eg: exponential decay:
> > > https://en.wikipedia.org/wiki/Exponential_decay), etc.. I haven't seen
> > > such
> > > interface details and design in the PIP:
> > > https://github.com/apache/pulsar/pull/21399/. So, I would encourage to
> > > work
> > > towards building pluggable rate-limiter but current PIP is not ready
> as it
> > > doesn't cover such generic interfaces that can support different types
> of
> > > implementation.
> > >
> > > Thanks,
> > > Rajan
> > >
> > > On Tue, Nov 7, 2023 at 10:02 AM Lari Hotari <lh...@apache.org>
> wrote:
> > >
> > > > Hi Girish,
> > > >
> > > > Replies inline.
> > > >
> > > > On Tue, 7 Nov 2023 at 15:26, Girish Sharma <sc...@gmail.com>
> > > > wrote:
> > > > >
> > > > > Hello Lari, replies inline.
> > > > >
> > > > > I will also be going through some textbook rate limiters (the one
> you
> > > > shared, plus others) and propose the one that at least suits our
> needs in
> > > > the next reply.
> > > >
> > > >
> > > > sounds good. I've been also trying to find more rate limiter
> resources
> > > > that could be useful for our design.
> > > >
> > > > Bucket4J documentation gives some good ideas and it's shows how the
> > > > token bucket algorithm could be varied. For example, the "Refill
> > > > styles" section [1] is useful to read as an inspiration.
> > > > In network routers, there's a concept of "dual token bucket"
> > > > algorithms and by googling you can find both Cisco and Juniper
> > > > documentation referencing this.
> > > > I also asked ChatGPT-4 to explain "dual token bucket" algorithm [2].
> > > >
> > > > 1 - https://bucket4j.com/8.6.0/toc.html#refill-types
> > > > 2 -
> https://chat.openai.com/share/d4f4f740-f675-4233-964e-2910a7c8ed24
> > > >
> > > > >>
> > > > >> It is bi-weekly on Thursdays. The meeting calendar, zoom link and
> > > > >> meeting notes can be found at
> > > > >> https://github.com/apache/pulsar/wiki/Community-Meetings .
> > > > >>
> > > > >
> > > > > Would it make sense for me to join this time given that you are
> > > skipping
> > > > it?
> > > >
> > > > Yes, it's worth joining regularly when one is participating in Pulsar
> > > > core development. There's usually a chance to discuss all topics that
> > > > Pulsar community members bring up to discussion. A few times there
> > > > haven't been any participants and in that case, it's good to ask on
> > > > the #dev channel on Pulsar Slack whether others are joining the
> > > > meeting.
> > > >
> > > > >>
> > > > >> ok. btw. "metrics" doesn't necessarily mean providing the rate
> limiter
> > > > >> metrics via Prometheus. There might be other ways to provide this
> > > > >> information for components that could react to this.
> > > > >> For example, it could a be system topic where these rate limiters
> emit
> > > > events.
> > > > >>
> > > > >
> > > > > Are there any other system topics than
> > > > `tenent/namespace/__change_events` . While it's an improvement over
> > > > querying metrics, it would still mean one consumer per namespace and
> > > would
> > > > form a cyclic dependency - for example, in case a broker is
> degrading due
> > > > to mis-use of bursting, it might lead to delays in the consumption
> of the
> > > > event from the __change_events topic.
> > > >
> > > > The number of events is a fairly low volume so the possible
> > > > degradation wouldn't necessarily impact the event communications for
> > > > this purpose. I'm not sure if there are currently any public event
> > > > topics in Pulsar that are supported in a way that an application
> could
> > > > directly read from the topic.  However since this is an advanced
> case,
> > > > I would assume that we don't need to focus on this in the first phase
> > > > of improving rate limiters.
> > > >
> > > > >
> > > > >
> > > > >> I agree. I just brought up this example to ensure that your
> > > > >> expectation about bursting isn't about controlling the rate limits
> > > > >> based on situational information, such as end-to-end latency
> > > > >> information.
> > > > >> Such a feature could be useful, but it does complicate things.
> > > > >> However, I think it's good to keep this on the radar since this
> might
> > > > >> be needed to solve some advanced use cases.
> > > > >>
> > > > >
> > > > > I still envision auto-scaling to be admin API driven rather than
> > > produce
> > > > throughput driven. That way, it remains deterministic in nature. But
> it
> > > > probably doesn't make sense to even talk about it until (partition)
> > > > scale-down is possible.
> > > >
> > > >
> > > > I agree. It's better to exclude this from the discussion at the
> moment
> > > > so that it doesn't cause confusion. Modifying the rate limits up and
> > > > down automatically with some component could be considered out of
> > > > scope in the first phase. However, that might be controversial since
> > > > the externally observable behavior of rate limiting with bursting
> > > > support seems to be behave in a way where the rate limit changes
> > > > automatically. The "auto scaling" aspect of rate limiting, modifying
> > > > the rate limits up and down, might be necessary as part of a rate
> > > > limiter implementation eventually. More of that later.
> > > >
> > > > >
> > > > > In all of the 3 cases that I listed, the current behavior, with
> precise
> > > > rate limiting enabled, is to pause the netty channel in case the
> > > throughput
> > > > breaches the set limits. This eventually leads to timeout at the
> client
> > > > side in case the burst is significantly greater than the configured
> > > timeout
> > > > on the producer side.
> > > >
> > > >
> > > > Makes sense.
> > > >
> > > > >
> > > > > The desired behavior in all three situations is to have a
> multiplier
> > > > based bursting capability for a fixed duration. For example, it
> could be
> > > > that a pulsar topic would be able to support 1.5x of the set quota
> for a
> > > > burst duration of up to 5 minutes. There also needs to be a cooldown
> > > period
> > > > in such a case that it would only accept one such burst every X
> minutes,
> > > > say every 1 hour.
> > > >
> > > >
> > > > Thanks for sharing this concrete example. It will help a lot when
> > > > starting to design a solution which achieves this. I think that a
> > > > token bucket algorithm based design can achieve something very close
> > > > to what you are describing. In the first part I shared the references
> > > > to Bucket4J's "refill styles" and the concept of "dual token bucket"
> > > > algorithms.
> > > > One possible solution is a dual token bucket where the first bucket
> > > > takes care of the burst of up to 5 minutes. The second token bucket
> > > > can cover the maximum rate of 1.5x. The cooldown period could be
> > > > handled in a way where the token bucket fills up the bucket once in
> > > > every 60 minutes. There could be implementation challenges related to
> > > > adjusting the enforced rate up and down since as you had also
> > > > observed, the token bucket algorithm will let all buffered tokens be
> > > > used.
> > > >
> > > > >
> > > > >
> > > > >>
> > > > >>
> > > > >> >    - In a visitor based produce rate (where produce rate goes
> up in
> > > > the day
> > > > >> >    and goes down in the night, think in terms of popular website
> > > > hourly view
> > > > >> >    counts pattern) , there are cases when, due to certain
> > > > external/internal
> > > > >> >    triggers, the views - and thus - the produce rate spikes for
> a
> > > few
> > > > minutes.
> > > > >>
> > > > >> Again, please explain the current behavior and desired behavior.
> > > > >> Explicit example values of number of messages, bandwidth, etc.
> would
> > > > >> also be helpful details.
> > > > >
> > > > >
> > > > > Adding to what I wrote above, think of this pattern like the
> following:
> > > > the produce rate slowly increases from ~2MBps at around 4 AM to a
> known
> > > > peak of about 30MBps by 4 PM and stays around that peak until 9 PM
> after
> > > > which is again starts decreasing until it reaches ~2MBps around 2 AM.
> > > > > Now, due to some external triggers, maybe a scheduled sale event,
> at
> > > > 10PM, the quota may spike up to 40MBps for 4-5 minutes and then
> again go
> > > > back down to the usual ~20MBps . Here is a rough image showcasing the
> > > trend.
> > > > >
> > > > >
> > > > >>
> > > > >>
> > > > >> >    It is also important to keep this in check so as to not allow
> > > bots
> > > > to do
> > > > >> >    DDOS into your system, while that might be a responsibility
> of an
> > > > upstream
> > > > >> >    system like API gateway, but we cannot be ignorant about that
> > > > completely.
> > > > >>
> > > > >> what would be the desired behavior?
> > > > >
> > > > >
> > > > > The desired behavior is that the burst support should be short
> lived
> > > > (5-10 minutes) and limited to a fixed number of bursts in a duration
> > > (say -
> > > > 1 burst per hour). Obviously, all of these should be configurable,
> maybe
> > > at
> > > > a broker level and not a topic level.
> > > >
> > > >
> > > > That can be handled with a dual token bucket algorithm based rate
> > > limiter.
> > > >
> > > > >
> > > > > I have yet to go in depth on both the textbook rate limiter
> models, but
> > > > if token bucket works on buffered tokens, it may not be much useful.
> > > > Consider a situation where the throughput is always at its absolute
> > > > configured peak, thus not reserving tokens at all. Again - this is
> the
> > > > understanding based on the above text. I am yet to go through the
> wiki
> > > page
> > > > with full details. In any case, I am very sure that an absolute
> maximum
> > > > bursting limit is needed per topic, since at the end of the day, we
> are
> > > > limited by hardware resources like network bandwidth, disk bandwidth
> etc.
> > > >
> > > >
> > > > Yes, if the actual throughput is always at the rate limiter's maximum
> > > > average rate, the tokens wouldn't be reserved at all in the default
> > > > token bucket algorithm. Is that a problem?
> > > > It might not be a problem at all. Spare tokens would be buffered
> > > > whenever the throughput is lower and therefore bursting could help
> > > > catch up to the average rate over a longer period of time.
> > > >
> > > > Btw. The challenge of no bursting when the actual throughput is
> > > > already at the peak rate is the original reason why I introduced the
> > > > concept of automatically adjusting the rate limit up and down based
> on
> > > > end-to-end latency information. If the rate limiter had a feedback
> > > > loop, it could know whether it needs to do bursting to prevent the
> > > > latencies from increasing. That was one of the reasons why I was
> > > > thinking about the automatic adjustments of the rate limit. However,
> > > > it's better to keep that out of the first phase for now.
> > > >
> > > > I believe that we will find many ways to apply the token bucket
> > > > algorithm to meet our requirements. A lot of flexibility can be added
> > > > with dual token buckets and by introducing new ways of adding/filling
> > > > tokens to the buckets. It's also possible to tweak a lot of small
> > > > details if that is necessary, such as smoothening out burst rates and
> > > > so on. However, it's better to keep things simple when possible.
> > > >
> > > >
> > > > > Suppose the SOP for bursting support is to support a burst of upto
> 1.5x
> > > > produce rate for upto 10 minutes, once in a window of 1 hour.
> > > > >
> > > > > Now if a particular producer is trying to burst to or beyond 1.5x
> for
> > > > more than 5 minutes - while we are not accepting the requests and the
> > > > producer client is continuously going into timeouts, there should be
> a
> > > way
> > > > to completely blacklist that producer so that even the client stops
> > > > attempting to produce and we unload the topics all together - thus,
> > > > releasing open connections and broker compute.
> > > >
> > > >
> > > > It seems that this part of the requirements is more challenging to
> > > > cover. We don't have to give up on trying to make progress on this
> > > > side of your requirements. It could be helpful to do thought
> > > > experiments on how some of the conditions could be detected. For
> > > > example, how would you detect if a particular producer is going
> beyond
> > > > the configured policy? It does seem intriguing to start adding the
> > > > support to the rate limiting area. The rate limiter has the
> > > > information about when bursting is actively happening. I would favor
> a
> > > > solution where the rate limiter could only emit events and not make
> > > > decisions about blocking a producer. However, I'm open to changing my
> > > > opinion on that detail of the design if it turns out to be a common
> > > > requirement, and it fits the overall architecture of the broker.
> > > >
> > > > The information about timeouts is on the client side. Perhaps there
> > > > are ways to pass the timeout configuration information along with the
> > > > producer creation, for example. The Pulsar binary protocol could have
> > > > some sort of clock sync so that client and broker clock difference
> > > > could be calculated with sufficient accuracy. However, the broker
> > > > cannot know when messages timeout that aren't at the broker at all.
> If
> > > > the end-to-end latency of sent messages start approaching the timeout
> > > > value, it's a good indication on the broker side that the client side
> > > > is approaching the state where messages will start timing out.
> > > >
> > > > In general, I think we should keep these requirements covered in the
> > > > design so that the rate limiter implementation could be extended to
> > > > support the use case in a way or another in the later phases.
> > > >
> > > >
> > > > >> > It's only async if the producers use `sendAsync` . Moreover,
> even if
> > > > it is
> > > > >> > async in nature, practically it ends up being quite linear and
> > > > >> > well-homogenised. I am speaking from experience of running
> 1000s of
> > > > >> > partitioned topics in production
> > > > >>
> > > > >> I'd assume that most high traffic use cases would use asynchronous
> > > > >> sending in a way or another. If you'd be sending synchronously, it
> > > > >> would be extremely inefficient if messages are sent one-by-one.
> The
> > > > >> asyncronouscity could also come from multiple threads in an
> > > > >> application.
> > > > >>
> > > > >> You are right that a well homogenised workload reduces the
> severity of
> > > > >> the multiplexing problem. That's also why the problem of
> multiplexing
> > > > >> hasn't been fixed since the problem isn't visibile and it's also
> very
> > > > >> hard to observe.
> > > > >> In a case where 2 producers with very different rate limiting
> options
> > > > >> share a connection, it definetely is a problem in the current
> > > > >
> > > > >
> > > > > Yes actually, I was talking to the team and we did observe a case
> where
> > > > there was a client app with 2 producer objects writing to two
> different
> > > > topics. When one of their topics was breaching quota, the other one
> was
> > > > also observing rate limiting even though it was under quota. So
> > > eventually,
> > > > this surely needs to be solved. One short term solution is to at
> least
> > > not
> > > > share connections across producer objects. But I still feel it's out
> of
> > > > scope of this discussion here.
> > > >
> > > > Thanks for sharing. The real stories from operating Pulsar at scale
> > > > are always interesting to hear!
> > > >
> > > > > What do you think about the quick solution of not sharing
> connections
> > > > across producer objects? I can raise a github issue explaining the
> > > > situation.
> > > >
> > > > It should be very straightforward to add a flag to producer and
> > > > consumer builder to use an isolated (non-shared) broker connection.
> > > > It's not optimal to add such flags, but I haven't heard of other ways
> > > > to solve this challenge where the Pulsar binary protocol doesn't have
> > > > to be modified or the backpressure solution revisited in the broker.
> > > > The proper solution requires doing both, adding a permit-based flow
> > > > control for producers and revisiting the broker side
> backpressure/flow
> > > > control and therefore it won't happen quickly.
> > > > Please go ahead and create a GH issue and share your context. That
> > > > will be very helpful.
> > > >
> > > > Some thoughts about proceeding to making things a reality so that it
> > > > gets implemented.
> > > >
> > > > One important part of development will be how we could test or
> > > > simulate a scenario and validate that the rate limiting algorithm
> > > > makes sense.
> > > > On the lowest level of unit tests, it will be helpful to have a
> > > > solution where the clock source used in unit test can be run quickly
> > > > so that simulated scenarios don't take a long time to run, but could
> > > > cover the essential features.
> > > >
> > > > It might be hard to get started on the rate limiter implementation by
> > > > looking at the existing code. The reason of the double methods in the
> > > > existing interface is due to it covering 2 completely different ways
> > > > of handling the rate limiting and trying to handle that in a single
> > > > concept. What is actually desired for at least the producing rate
> > > > limiting is that it's an asynchronous rate limiting where any
> messages
> > > > that have already arrived to the broker will be handled. The token
> > > > count could go to a negative value when implementing this in the
> token
> > > > bucket algorithm. If it goes below 0, the producers for the topic
> > > > should be backpressured by toggling the auto-read state and it should
> > > > schedule a job to resume the auto-reading after there are at least a
> > > > configurable amount of tokens available. There should be no need to
> > > > use the scheduler to add tokens to the bucket. Whenever tokens are
> > > > used, the new token count can be calculated based on the time since
> > > > tokens were last updated, by default the average max rate defines how
> > > > many tokens are added. The token count cannot increase larger than
> the
> > > > token bucket capacity. This is how simple a plain token bucket
> > > > algorithm is. That could be a starting point until we start covering
> > > > the advanced cases which require a dual token bucket algorithm.
> > > > If someone has the bandwidth, it would be fine to start experimenting
> > > > in this area with code to learn more.
> > > >
> > > > After trying something out, it's easier to be more confident about
> the
> > > > design of dual token bucket algorithm. That's one of the assumptions
> > > > that I'm making that dual token bucket would be sufficient.
> > > >
> > > > This has been a great discussion so far and it feels like we are
> > > > getting to the necessary level of detail.
> > > > Thanks for sharing many details of your requirements and what you
> have
> > > > experienced and learned when operating Pulsar at scale.
> > > > I feel excited about the upcoming improvements for rate limiting in
> > > > Pulsar. From my side I have limited availability for participating in
> > > > this work for the next 2-3 weeks. After that, I'll be ready to help
> in
> > > > making the improved rate limiting a reality.
> > > >
> > > > -Lari
> > > >
> > >
> >
> >
> > --
> > Girish Sharma
>


-- 
Girish Sharma

Re: [DISCUSS] PIP-310: Support custom publish rate limiters

Posted by Lari Hotari <lh...@apache.org>.

Hi Girish,

> On Asaf's comment on too many public interfaces in Pulsar and no other
> Apache software having so many public interfaces - I would like to ask, has
> that brought in any con though? For this particular use case, I feel like
> having it has a public interface would actually improve the code quality
> and design as the usage would be checked and changes would go through
> scrutiny (unlike how the current PublishRateLimiter evolved unchecked).
> Asaf - what are your thoughts on this? Are you okay with making the
> PublishRateLimiter pluggable with a better interface?

As a reply to this question, I'd like to highlight my previous email
in this thread where I responded to Rajan.
The current state of rate limiting is not acceptable in Pulsar. We
need to fix things in the core.

One of the downsides of adding a lot of pluggability and variation is
the additional fragmentation and complexity that it adds to Pulsar. It
doesn't really make sense if you need to install an external library
to make the rate limiting feature usable in Pulsar. Rate limiting is
only needed when operating Pulsar at scale. The core feature must
support this to make any sense.

The side product of this could be the pluggable interface if the core
feature cannot be extended to cover the requirements that Pulsar users
really have.
Girish, it has been extremely valuable that you have shared concrete
examples of your rate limiting related requirements. I'm confident
that we will be able to cover the majority of your requirements in
Pulsar core. It might be the case that when you try out the new
features, it might actually cover your needs sufficiently.

Let's keep the focus on improving the rate limiting in Pulsar core.
The possible pluggable interface could follow.


-Lari

On Wed, 8 Nov 2023 at 10:46, Girish Sharma <sc...@gmail.com> wrote:
>
> Hello Rajan,
> I haven't updated the PIP with a better interface for PublishRateLimiter
> yet as the discussion here in this thread went in a different direction.
>
> Personally, I agree with you that even if we choose one algorithm and
> improve the built-in rate limiter, it still may not suit all use cases as
> you have mentioned.
>
> On Asaf's comment on too many public interfaces in Pulsar and no other
> Apache software having so many public interfaces - I would like to ask, has
> that brought in any con though? For this particular use case, I feel like
> having it has a public interface would actually improve the code quality
> and design as the usage would be checked and changes would go through
> scrutiny (unlike how the current PublishRateLimiter evolved unchecked).
> Asaf - what are your thoughts on this? Are you okay with making the
> PublishRateLimiter pluggable with a better interface?
>
>
>
>
>
> On Wed, Nov 8, 2023 at 5:43 AM Rajan Dhabalia <rd...@apache.org> wrote:
>
> > Hi Lari/Girish,
> >
> > I am sorry for jumping late in the discussion but I would like to
> > acknowledge the requirement of pluggable publish rate-limiter and I had
> > also asked it during implementation of publish rate limiter as well. There
> > are trade-offs between different rate-limiter implementations based on
> > accuracy, n/w usage, simplification and user should be able to choose one
> > based on the requirement. However, we don't have correct and extensible
> > Publish rate limiter interface right now, and before making it pluggable we
> > have to make sure that it should support any type of implementation for
> > example: token based or sliding-window based throttling, support of various
> > decaying functions (eg: exponential decay:
> > https://en.wikipedia.org/wiki/Exponential_decay), etc.. I haven't seen
> > such
> > interface details and design in the PIP:
> > https://github.com/apache/pulsar/pull/21399/. So, I would encourage to
> > work
> > towards building pluggable rate-limiter but current PIP is not ready as it
> > doesn't cover such generic interfaces that can support different types of
> > implementation.
> >
> > Thanks,
> > Rajan
> >
> > On Tue, Nov 7, 2023 at 10:02 AM Lari Hotari <lh...@apache.org> wrote:
> >
> > > Hi Girish,
> > >
> > > Replies inline.
> > >
> > > On Tue, 7 Nov 2023 at 15:26, Girish Sharma <sc...@gmail.com>
> > > wrote:
> > > >
> > > > Hello Lari, replies inline.
> > > >
> > > > I will also be going through some textbook rate limiters (the one you
> > > shared, plus others) and propose the one that at least suits our needs in
> > > the next reply.
> > >
> > >
> > > sounds good. I've been also trying to find more rate limiter resources
> > > that could be useful for our design.
> > >
> > > Bucket4J documentation gives some good ideas and it's shows how the
> > > token bucket algorithm could be varied. For example, the "Refill
> > > styles" section [1] is useful to read as an inspiration.
> > > In network routers, there's a concept of "dual token bucket"
> > > algorithms and by googling you can find both Cisco and Juniper
> > > documentation referencing this.
> > > I also asked ChatGPT-4 to explain "dual token bucket" algorithm [2].
> > >
> > > 1 - https://bucket4j.com/8.6.0/toc.html#refill-types
> > > 2 - https://chat.openai.com/share/d4f4f740-f675-4233-964e-2910a7c8ed24
> > >
> > > >>
> > > >> It is bi-weekly on Thursdays. The meeting calendar, zoom link and
> > > >> meeting notes can be found at
> > > >> https://github.com/apache/pulsar/wiki/Community-Meetings .
> > > >>
> > > >
> > > > Would it make sense for me to join this time given that you are
> > skipping
> > > it?
> > >
> > > Yes, it's worth joining regularly when one is participating in Pulsar
> > > core development. There's usually a chance to discuss all topics that
> > > Pulsar community members bring up to discussion. A few times there
> > > haven't been any participants and in that case, it's good to ask on
> > > the #dev channel on Pulsar Slack whether others are joining the
> > > meeting.
> > >
> > > >>
> > > >> ok. btw. "metrics" doesn't necessarily mean providing the rate limiter
> > > >> metrics via Prometheus. There might be other ways to provide this
> > > >> information for components that could react to this.
> > > >> For example, it could a be system topic where these rate limiters emit
> > > events.
> > > >>
> > > >
> > > > Are there any other system topics than
> > > `tenent/namespace/__change_events` . While it's an improvement over
> > > querying metrics, it would still mean one consumer per namespace and
> > would
> > > form a cyclic dependency - for example, in case a broker is degrading due
> > > to mis-use of bursting, it might lead to delays in the consumption of the
> > > event from the __change_events topic.
> > >
> > > The number of events is a fairly low volume so the possible
> > > degradation wouldn't necessarily impact the event communications for
> > > this purpose. I'm not sure if there are currently any public event
> > > topics in Pulsar that are supported in a way that an application could
> > > directly read from the topic.  However since this is an advanced case,
> > > I would assume that we don't need to focus on this in the first phase
> > > of improving rate limiters.
> > >
> > > >
> > > >
> > > >> I agree. I just brought up this example to ensure that your
> > > >> expectation about bursting isn't about controlling the rate limits
> > > >> based on situational information, such as end-to-end latency
> > > >> information.
> > > >> Such a feature could be useful, but it does complicate things.
> > > >> However, I think it's good to keep this on the radar since this might
> > > >> be needed to solve some advanced use cases.
> > > >>
> > > >
> > > > I still envision auto-scaling to be admin API driven rather than
> > produce
> > > throughput driven. That way, it remains deterministic in nature. But it
> > > probably doesn't make sense to even talk about it until (partition)
> > > scale-down is possible.
> > >
> > >
> > > I agree. It's better to exclude this from the discussion at the moment
> > > so that it doesn't cause confusion. Modifying the rate limits up and
> > > down automatically with some component could be considered out of
> > > scope in the first phase. However, that might be controversial since
> > > the externally observable behavior of rate limiting with bursting
> > > support seems to be behave in a way where the rate limit changes
> > > automatically. The "auto scaling" aspect of rate limiting, modifying
> > > the rate limits up and down, might be necessary as part of a rate
> > > limiter implementation eventually. More of that later.
> > >
> > > >
> > > > In all of the 3 cases that I listed, the current behavior, with precise
> > > rate limiting enabled, is to pause the netty channel in case the
> > throughput
> > > breaches the set limits. This eventually leads to timeout at the client
> > > side in case the burst is significantly greater than the configured
> > timeout
> > > on the producer side.
> > >
> > >
> > > Makes sense.
> > >
> > > >
> > > > The desired behavior in all three situations is to have a multiplier
> > > based bursting capability for a fixed duration. For example, it could be
> > > that a pulsar topic would be able to support 1.5x of the set quota for a
> > > burst duration of up to 5 minutes. There also needs to be a cooldown
> > period
> > > in such a case that it would only accept one such burst every X minutes,
> > > say every 1 hour.
> > >
> > >
> > > Thanks for sharing this concrete example. It will help a lot when
> > > starting to design a solution which achieves this. I think that a
> > > token bucket algorithm based design can achieve something very close
> > > to what you are describing. In the first part I shared the references
> > > to Bucket4J's "refill styles" and the concept of "dual token bucket"
> > > algorithms.
> > > One possible solution is a dual token bucket where the first bucket
> > > takes care of the burst of up to 5 minutes. The second token bucket
> > > can cover the maximum rate of 1.5x. The cooldown period could be
> > > handled in a way where the token bucket fills up the bucket once in
> > > every 60 minutes. There could be implementation challenges related to
> > > adjusting the enforced rate up and down since as you had also
> > > observed, the token bucket algorithm will let all buffered tokens be
> > > used.
> > >
> > > >
> > > >
> > > >>
> > > >>
> > > >> >    - In a visitor based produce rate (where produce rate goes up in
> > > the day
> > > >> >    and goes down in the night, think in terms of popular website
> > > hourly view
> > > >> >    counts pattern) , there are cases when, due to certain
> > > external/internal
> > > >> >    triggers, the views - and thus - the produce rate spikes for a
> > few
> > > minutes.
> > > >>
> > > >> Again, please explain the current behavior and desired behavior.
> > > >> Explicit example values of number of messages, bandwidth, etc. would
> > > >> also be helpful details.
> > > >
> > > >
> > > > Adding to what I wrote above, think of this pattern like the following:
> > > the produce rate slowly increases from ~2MBps at around 4 AM to a known
> > > peak of about 30MBps by 4 PM and stays around that peak until 9 PM after
> > > which is again starts decreasing until it reaches ~2MBps around 2 AM.
> > > > Now, due to some external triggers, maybe a scheduled sale event, at
> > > 10PM, the quota may spike up to 40MBps for 4-5 minutes and then again go
> > > back down to the usual ~20MBps . Here is a rough image showcasing the
> > trend.
> > > >
> > > >
> > > >>
> > > >>
> > > >> >    It is also important to keep this in check so as to not allow
> > bots
> > > to do
> > > >> >    DDOS into your system, while that might be a responsibility of an
> > > upstream
> > > >> >    system like API gateway, but we cannot be ignorant about that
> > > completely.
> > > >>
> > > >> what would be the desired behavior?
> > > >
> > > >
> > > > The desired behavior is that the burst support should be short lived
> > > (5-10 minutes) and limited to a fixed number of bursts in a duration
> > (say -
> > > 1 burst per hour). Obviously, all of these should be configurable, maybe
> > at
> > > a broker level and not a topic level.
> > >
> > >
> > > That can be handled with a dual token bucket algorithm based rate
> > limiter.
> > >
> > > >
> > > > I have yet to go in depth on both the textbook rate limiter models, but
> > > if token bucket works on buffered tokens, it may not be much useful.
> > > Consider a situation where the throughput is always at its absolute
> > > configured peak, thus not reserving tokens at all. Again - this is the
> > > understanding based on the above text. I am yet to go through the wiki
> > page
> > > with full details. In any case, I am very sure that an absolute maximum
> > > bursting limit is needed per topic, since at the end of the day, we are
> > > limited by hardware resources like network bandwidth, disk bandwidth etc.
> > >
> > >
> > > Yes, if the actual throughput is always at the rate limiter's maximum
> > > average rate, the tokens wouldn't be reserved at all in the default
> > > token bucket algorithm. Is that a problem?
> > > It might not be a problem at all. Spare tokens would be buffered
> > > whenever the throughput is lower and therefore bursting could help
> > > catch up to the average rate over a longer period of time.
> > >
> > > Btw. The challenge of no bursting when the actual throughput is
> > > already at the peak rate is the original reason why I introduced the
> > > concept of automatically adjusting the rate limit up and down based on
> > > end-to-end latency information. If the rate limiter had a feedback
> > > loop, it could know whether it needs to do bursting to prevent the
> > > latencies from increasing. That was one of the reasons why I was
> > > thinking about the automatic adjustments of the rate limit. However,
> > > it's better to keep that out of the first phase for now.
> > >
> > > I believe that we will find many ways to apply the token bucket
> > > algorithm to meet our requirements. A lot of flexibility can be added
> > > with dual token buckets and by introducing new ways of adding/filling
> > > tokens to the buckets. It's also possible to tweak a lot of small
> > > details if that is necessary, such as smoothening out burst rates and
> > > so on. However, it's better to keep things simple when possible.
> > >
> > >
> > > > Suppose the SOP for bursting support is to support a burst of upto 1.5x
> > > produce rate for upto 10 minutes, once in a window of 1 hour.
> > > >
> > > > Now if a particular producer is trying to burst to or beyond 1.5x for
> > > more than 5 minutes - while we are not accepting the requests and the
> > > producer client is continuously going into timeouts, there should be a
> > way
> > > to completely blacklist that producer so that even the client stops
> > > attempting to produce and we unload the topics all together - thus,
> > > releasing open connections and broker compute.
> > >
> > >
> > > It seems that this part of the requirements is more challenging to
> > > cover. We don't have to give up on trying to make progress on this
> > > side of your requirements. It could be helpful to do thought
> > > experiments on how some of the conditions could be detected. For
> > > example, how would you detect if a particular producer is going beyond
> > > the configured policy? It does seem intriguing to start adding the
> > > support to the rate limiting area. The rate limiter has the
> > > information about when bursting is actively happening. I would favor a
> > > solution where the rate limiter could only emit events and not make
> > > decisions about blocking a producer. However, I'm open to changing my
> > > opinion on that detail of the design if it turns out to be a common
> > > requirement, and it fits the overall architecture of the broker.
> > >
> > > The information about timeouts is on the client side. Perhaps there
> > > are ways to pass the timeout configuration information along with the
> > > producer creation, for example. The Pulsar binary protocol could have
> > > some sort of clock sync so that client and broker clock difference
> > > could be calculated with sufficient accuracy. However, the broker
> > > cannot know when messages timeout that aren't at the broker at all. If
> > > the end-to-end latency of sent messages start approaching the timeout
> > > value, it's a good indication on the broker side that the client side
> > > is approaching the state where messages will start timing out.
> > >
> > > In general, I think we should keep these requirements covered in the
> > > design so that the rate limiter implementation could be extended to
> > > support the use case in a way or another in the later phases.
> > >
> > >
> > > >> > It's only async if the producers use `sendAsync` . Moreover, even if
> > > it is
> > > >> > async in nature, practically it ends up being quite linear and
> > > >> > well-homogenised. I am speaking from experience of running 1000s of
> > > >> > partitioned topics in production
> > > >>
> > > >> I'd assume that most high traffic use cases would use asynchronous
> > > >> sending in a way or another. If you'd be sending synchronously, it
> > > >> would be extremely inefficient if messages are sent one-by-one. The
> > > >> asyncronouscity could also come from multiple threads in an
> > > >> application.
> > > >>
> > > >> You are right that a well homogenised workload reduces the severity of
> > > >> the multiplexing problem. That's also why the problem of multiplexing
> > > >> hasn't been fixed since the problem isn't visibile and it's also very
> > > >> hard to observe.
> > > >> In a case where 2 producers with very different rate limiting options
> > > >> share a connection, it definetely is a problem in the current
> > > >
> > > >
> > > > Yes actually, I was talking to the team and we did observe a case where
> > > there was a client app with 2 producer objects writing to two different
> > > topics. When one of their topics was breaching quota, the other one was
> > > also observing rate limiting even though it was under quota. So
> > eventually,
> > > this surely needs to be solved. One short term solution is to at least
> > not
> > > share connections across producer objects. But I still feel it's out of
> > > scope of this discussion here.
> > >
> > > Thanks for sharing. The real stories from operating Pulsar at scale
> > > are always interesting to hear!
> > >
> > > > What do you think about the quick solution of not sharing connections
> > > across producer objects? I can raise a github issue explaining the
> > > situation.
> > >
> > > It should be very straightforward to add a flag to producer and
> > > consumer builder to use an isolated (non-shared) broker connection.
> > > It's not optimal to add such flags, but I haven't heard of other ways
> > > to solve this challenge where the Pulsar binary protocol doesn't have
> > > to be modified or the backpressure solution revisited in the broker.
> > > The proper solution requires doing both, adding a permit-based flow
> > > control for producers and revisiting the broker side backpressure/flow
> > > control and therefore it won't happen quickly.
> > > Please go ahead and create a GH issue and share your context. That
> > > will be very helpful.
> > >
> > > Some thoughts about proceeding to making things a reality so that it
> > > gets implemented.
> > >
> > > One important part of development will be how we could test or
> > > simulate a scenario and validate that the rate limiting algorithm
> > > makes sense.
> > > On the lowest level of unit tests, it will be helpful to have a
> > > solution where the clock source used in unit test can be run quickly
> > > so that simulated scenarios don't take a long time to run, but could
> > > cover the essential features.
> > >
> > > It might be hard to get started on the rate limiter implementation by
> > > looking at the existing code. The reason of the double methods in the
> > > existing interface is due to it covering 2 completely different ways
> > > of handling the rate limiting and trying to handle that in a single
> > > concept. What is actually desired for at least the producing rate
> > > limiting is that it's an asynchronous rate limiting where any messages
> > > that have already arrived to the broker will be handled. The token
> > > count could go to a negative value when implementing this in the token
> > > bucket algorithm. If it goes below 0, the producers for the topic
> > > should be backpressured by toggling the auto-read state and it should
> > > schedule a job to resume the auto-reading after there are at least a
> > > configurable amount of tokens available. There should be no need to
> > > use the scheduler to add tokens to the bucket. Whenever tokens are
> > > used, the new token count can be calculated based on the time since
> > > tokens were last updated, by default the average max rate defines how
> > > many tokens are added. The token count cannot increase larger than the
> > > token bucket capacity. This is how simple a plain token bucket
> > > algorithm is. That could be a starting point until we start covering
> > > the advanced cases which require a dual token bucket algorithm.
> > > If someone has the bandwidth, it would be fine to start experimenting
> > > in this area with code to learn more.
> > >
> > > After trying something out, it's easier to be more confident about the
> > > design of dual token bucket algorithm. That's one of the assumptions
> > > that I'm making that dual token bucket would be sufficient.
> > >
> > > This has been a great discussion so far and it feels like we are
> > > getting to the necessary level of detail.
> > > Thanks for sharing many details of your requirements and what you have
> > > experienced and learned when operating Pulsar at scale.
> > > I feel excited about the upcoming improvements for rate limiting in
> > > Pulsar. From my side I have limited availability for participating in
> > > this work for the next 2-3 weeks. After that, I'll be ready to help in
> > > making the improved rate limiting a reality.
> > >
> > > -Lari
> > >
> >
>
>
> --
> Girish Sharma

Re: [DISCUSS] PIP-310: Support custom publish rate limiters

Posted by Girish Sharma <sc...@gmail.com>.

Hello Rajan,
I haven't updated the PIP with a better interface for PublishRateLimiter
yet as the discussion here in this thread went in a different direction.

Personally, I agree with you that even if we choose one algorithm and
improve the built-in rate limiter, it still may not suit all use cases as
you have mentioned.

On Asaf's comment on too many public interfaces in Pulsar and no other
Apache software having so many public interfaces - I would like to ask, has
that brought in any con though? For this particular use case, I feel like
having it has a public interface would actually improve the code quality
and design as the usage would be checked and changes would go through
scrutiny (unlike how the current PublishRateLimiter evolved unchecked).
Asaf - what are your thoughts on this? Are you okay with making the
PublishRateLimiter pluggable with a better interface?





On Wed, Nov 8, 2023 at 5:43 AM Rajan Dhabalia <rd...@apache.org> wrote:

> Hi Lari/Girish,
>
> I am sorry for jumping late in the discussion but I would like to
> acknowledge the requirement of pluggable publish rate-limiter and I had
> also asked it during implementation of publish rate limiter as well. There
> are trade-offs between different rate-limiter implementations based on
> accuracy, n/w usage, simplification and user should be able to choose one
> based on the requirement. However, we don't have correct and extensible
> Publish rate limiter interface right now, and before making it pluggable we
> have to make sure that it should support any type of implementation for
> example: token based or sliding-window based throttling, support of various
> decaying functions (eg: exponential decay:
> https://en.wikipedia.org/wiki/Exponential_decay), etc.. I haven't seen
> such
> interface details and design in the PIP:
> https://github.com/apache/pulsar/pull/21399/. So, I would encourage to
> work
> towards building pluggable rate-limiter but current PIP is not ready as it
> doesn't cover such generic interfaces that can support different types of
> implementation.
>
> Thanks,
> Rajan
>
> On Tue, Nov 7, 2023 at 10:02 AM Lari Hotari <lh...@apache.org> wrote:
>
> > Hi Girish,
> >
> > Replies inline.
> >
> > On Tue, 7 Nov 2023 at 15:26, Girish Sharma <sc...@gmail.com>
> > wrote:
> > >
> > > Hello Lari, replies inline.
> > >
> > > I will also be going through some textbook rate limiters (the one you
> > shared, plus others) and propose the one that at least suits our needs in
> > the next reply.
> >
> >
> > sounds good. I've been also trying to find more rate limiter resources
> > that could be useful for our design.
> >
> > Bucket4J documentation gives some good ideas and it's shows how the
> > token bucket algorithm could be varied. For example, the "Refill
> > styles" section [1] is useful to read as an inspiration.
> > In network routers, there's a concept of "dual token bucket"
> > algorithms and by googling you can find both Cisco and Juniper
> > documentation referencing this.
> > I also asked ChatGPT-4 to explain "dual token bucket" algorithm [2].
> >
> > 1 - https://bucket4j.com/8.6.0/toc.html#refill-types
> > 2 - https://chat.openai.com/share/d4f4f740-f675-4233-964e-2910a7c8ed24
> >
> > >>
> > >> It is bi-weekly on Thursdays. The meeting calendar, zoom link and
> > >> meeting notes can be found at
> > >> https://github.com/apache/pulsar/wiki/Community-Meetings .
> > >>
> > >
> > > Would it make sense for me to join this time given that you are
> skipping
> > it?
> >
> > Yes, it's worth joining regularly when one is participating in Pulsar
> > core development. There's usually a chance to discuss all topics that
> > Pulsar community members bring up to discussion. A few times there
> > haven't been any participants and in that case, it's good to ask on
> > the #dev channel on Pulsar Slack whether others are joining the
> > meeting.
> >
> > >>
> > >> ok. btw. "metrics" doesn't necessarily mean providing the rate limiter
> > >> metrics via Prometheus. There might be other ways to provide this
> > >> information for components that could react to this.
> > >> For example, it could a be system topic where these rate limiters emit
> > events.
> > >>
> > >
> > > Are there any other system topics than
> > `tenent/namespace/__change_events` . While it's an improvement over
> > querying metrics, it would still mean one consumer per namespace and
> would
> > form a cyclic dependency - for example, in case a broker is degrading due
> > to mis-use of bursting, it might lead to delays in the consumption of the
> > event from the __change_events topic.
> >
> > The number of events is a fairly low volume so the possible
> > degradation wouldn't necessarily impact the event communications for
> > this purpose. I'm not sure if there are currently any public event
> > topics in Pulsar that are supported in a way that an application could
> > directly read from the topic.  However since this is an advanced case,
> > I would assume that we don't need to focus on this in the first phase
> > of improving rate limiters.
> >
> > >
> > >
> > >> I agree. I just brought up this example to ensure that your
> > >> expectation about bursting isn't about controlling the rate limits
> > >> based on situational information, such as end-to-end latency
> > >> information.
> > >> Such a feature could be useful, but it does complicate things.
> > >> However, I think it's good to keep this on the radar since this might
> > >> be needed to solve some advanced use cases.
> > >>
> > >
> > > I still envision auto-scaling to be admin API driven rather than
> produce
> > throughput driven. That way, it remains deterministic in nature. But it
> > probably doesn't make sense to even talk about it until (partition)
> > scale-down is possible.
> >
> >
> > I agree. It's better to exclude this from the discussion at the moment
> > so that it doesn't cause confusion. Modifying the rate limits up and
> > down automatically with some component could be considered out of
> > scope in the first phase. However, that might be controversial since
> > the externally observable behavior of rate limiting with bursting
> > support seems to be behave in a way where the rate limit changes
> > automatically. The "auto scaling" aspect of rate limiting, modifying
> > the rate limits up and down, might be necessary as part of a rate
> > limiter implementation eventually. More of that later.
> >
> > >
> > > In all of the 3 cases that I listed, the current behavior, with precise
> > rate limiting enabled, is to pause the netty channel in case the
> throughput
> > breaches the set limits. This eventually leads to timeout at the client
> > side in case the burst is significantly greater than the configured
> timeout
> > on the producer side.
> >
> >
> > Makes sense.
> >
> > >
> > > The desired behavior in all three situations is to have a multiplier
> > based bursting capability for a fixed duration. For example, it could be
> > that a pulsar topic would be able to support 1.5x of the set quota for a
> > burst duration of up to 5 minutes. There also needs to be a cooldown
> period
> > in such a case that it would only accept one such burst every X minutes,
> > say every 1 hour.
> >
> >
> > Thanks for sharing this concrete example. It will help a lot when
> > starting to design a solution which achieves this. I think that a
> > token bucket algorithm based design can achieve something very close
> > to what you are describing. In the first part I shared the references
> > to Bucket4J's "refill styles" and the concept of "dual token bucket"
> > algorithms.
> > One possible solution is a dual token bucket where the first bucket
> > takes care of the burst of up to 5 minutes. The second token bucket
> > can cover the maximum rate of 1.5x. The cooldown period could be
> > handled in a way where the token bucket fills up the bucket once in
> > every 60 minutes. There could be implementation challenges related to
> > adjusting the enforced rate up and down since as you had also
> > observed, the token bucket algorithm will let all buffered tokens be
> > used.
> >
> > >
> > >
> > >>
> > >>
> > >> >    - In a visitor based produce rate (where produce rate goes up in
> > the day
> > >> >    and goes down in the night, think in terms of popular website
> > hourly view
> > >> >    counts pattern) , there are cases when, due to certain
> > external/internal
> > >> >    triggers, the views - and thus - the produce rate spikes for a
> few
> > minutes.
> > >>
> > >> Again, please explain the current behavior and desired behavior.
> > >> Explicit example values of number of messages, bandwidth, etc. would
> > >> also be helpful details.
> > >
> > >
> > > Adding to what I wrote above, think of this pattern like the following:
> > the produce rate slowly increases from ~2MBps at around 4 AM to a known
> > peak of about 30MBps by 4 PM and stays around that peak until 9 PM after
> > which is again starts decreasing until it reaches ~2MBps around 2 AM.
> > > Now, due to some external triggers, maybe a scheduled sale event, at
> > 10PM, the quota may spike up to 40MBps for 4-5 minutes and then again go
> > back down to the usual ~20MBps . Here is a rough image showcasing the
> trend.
> > >
> > >
> > >>
> > >>
> > >> >    It is also important to keep this in check so as to not allow
> bots
> > to do
> > >> >    DDOS into your system, while that might be a responsibility of an
> > upstream
> > >> >    system like API gateway, but we cannot be ignorant about that
> > completely.
> > >>
> > >> what would be the desired behavior?
> > >
> > >
> > > The desired behavior is that the burst support should be short lived
> > (5-10 minutes) and limited to a fixed number of bursts in a duration
> (say -
> > 1 burst per hour). Obviously, all of these should be configurable, maybe
> at
> > a broker level and not a topic level.
> >
> >
> > That can be handled with a dual token bucket algorithm based rate
> limiter.
> >
> > >
> > > I have yet to go in depth on both the textbook rate limiter models, but
> > if token bucket works on buffered tokens, it may not be much useful.
> > Consider a situation where the throughput is always at its absolute
> > configured peak, thus not reserving tokens at all. Again - this is the
> > understanding based on the above text. I am yet to go through the wiki
> page
> > with full details. In any case, I am very sure that an absolute maximum
> > bursting limit is needed per topic, since at the end of the day, we are
> > limited by hardware resources like network bandwidth, disk bandwidth etc.
> >
> >
> > Yes, if the actual throughput is always at the rate limiter's maximum
> > average rate, the tokens wouldn't be reserved at all in the default
> > token bucket algorithm. Is that a problem?
> > It might not be a problem at all. Spare tokens would be buffered
> > whenever the throughput is lower and therefore bursting could help
> > catch up to the average rate over a longer period of time.
> >
> > Btw. The challenge of no bursting when the actual throughput is
> > already at the peak rate is the original reason why I introduced the
> > concept of automatically adjusting the rate limit up and down based on
> > end-to-end latency information. If the rate limiter had a feedback
> > loop, it could know whether it needs to do bursting to prevent the
> > latencies from increasing. That was one of the reasons why I was
> > thinking about the automatic adjustments of the rate limit. However,
> > it's better to keep that out of the first phase for now.
> >
> > I believe that we will find many ways to apply the token bucket
> > algorithm to meet our requirements. A lot of flexibility can be added
> > with dual token buckets and by introducing new ways of adding/filling
> > tokens to the buckets. It's also possible to tweak a lot of small
> > details if that is necessary, such as smoothening out burst rates and
> > so on. However, it's better to keep things simple when possible.
> >
> >
> > > Suppose the SOP for bursting support is to support a burst of upto 1.5x
> > produce rate for upto 10 minutes, once in a window of 1 hour.
> > >
> > > Now if a particular producer is trying to burst to or beyond 1.5x for
> > more than 5 minutes - while we are not accepting the requests and the
> > producer client is continuously going into timeouts, there should be a
> way
> > to completely blacklist that producer so that even the client stops
> > attempting to produce and we unload the topics all together - thus,
> > releasing open connections and broker compute.
> >
> >
> > It seems that this part of the requirements is more challenging to
> > cover. We don't have to give up on trying to make progress on this
> > side of your requirements. It could be helpful to do thought
> > experiments on how some of the conditions could be detected. For
> > example, how would you detect if a particular producer is going beyond
> > the configured policy? It does seem intriguing to start adding the
> > support to the rate limiting area. The rate limiter has the
> > information about when bursting is actively happening. I would favor a
> > solution where the rate limiter could only emit events and not make
> > decisions about blocking a producer. However, I'm open to changing my
> > opinion on that detail of the design if it turns out to be a common
> > requirement, and it fits the overall architecture of the broker.
> >
> > The information about timeouts is on the client side. Perhaps there
> > are ways to pass the timeout configuration information along with the
> > producer creation, for example. The Pulsar binary protocol could have
> > some sort of clock sync so that client and broker clock difference
> > could be calculated with sufficient accuracy. However, the broker
> > cannot know when messages timeout that aren't at the broker at all. If
> > the end-to-end latency of sent messages start approaching the timeout
> > value, it's a good indication on the broker side that the client side
> > is approaching the state where messages will start timing out.
> >
> > In general, I think we should keep these requirements covered in the
> > design so that the rate limiter implementation could be extended to
> > support the use case in a way or another in the later phases.
> >
> >
> > >> > It's only async if the producers use `sendAsync` . Moreover, even if
> > it is
> > >> > async in nature, practically it ends up being quite linear and
> > >> > well-homogenised. I am speaking from experience of running 1000s of
> > >> > partitioned topics in production
> > >>
> > >> I'd assume that most high traffic use cases would use asynchronous
> > >> sending in a way or another. If you'd be sending synchronously, it
> > >> would be extremely inefficient if messages are sent one-by-one. The
> > >> asyncronouscity could also come from multiple threads in an
> > >> application.
> > >>
> > >> You are right that a well homogenised workload reduces the severity of
> > >> the multiplexing problem. That's also why the problem of multiplexing
> > >> hasn't been fixed since the problem isn't visibile and it's also very
> > >> hard to observe.
> > >> In a case where 2 producers with very different rate limiting options
> > >> share a connection, it definetely is a problem in the current
> > >
> > >
> > > Yes actually, I was talking to the team and we did observe a case where
> > there was a client app with 2 producer objects writing to two different
> > topics. When one of their topics was breaching quota, the other one was
> > also observing rate limiting even though it was under quota. So
> eventually,
> > this surely needs to be solved. One short term solution is to at least
> not
> > share connections across producer objects. But I still feel it's out of
> > scope of this discussion here.
> >
> > Thanks for sharing. The real stories from operating Pulsar at scale
> > are always interesting to hear!
> >
> > > What do you think about the quick solution of not sharing connections
> > across producer objects? I can raise a github issue explaining the
> > situation.
> >
> > It should be very straightforward to add a flag to producer and
> > consumer builder to use an isolated (non-shared) broker connection.
> > It's not optimal to add such flags, but I haven't heard of other ways
> > to solve this challenge where the Pulsar binary protocol doesn't have
> > to be modified or the backpressure solution revisited in the broker.
> > The proper solution requires doing both, adding a permit-based flow
> > control for producers and revisiting the broker side backpressure/flow
> > control and therefore it won't happen quickly.
> > Please go ahead and create a GH issue and share your context. That
> > will be very helpful.
> >
> > Some thoughts about proceeding to making things a reality so that it
> > gets implemented.
> >
> > One important part of development will be how we could test or
> > simulate a scenario and validate that the rate limiting algorithm
> > makes sense.
> > On the lowest level of unit tests, it will be helpful to have a
> > solution where the clock source used in unit test can be run quickly
> > so that simulated scenarios don't take a long time to run, but could
> > cover the essential features.
> >
> > It might be hard to get started on the rate limiter implementation by
> > looking at the existing code. The reason of the double methods in the
> > existing interface is due to it covering 2 completely different ways
> > of handling the rate limiting and trying to handle that in a single
> > concept. What is actually desired for at least the producing rate
> > limiting is that it's an asynchronous rate limiting where any messages
> > that have already arrived to the broker will be handled. The token
> > count could go to a negative value when implementing this in the token
> > bucket algorithm. If it goes below 0, the producers for the topic
> > should be backpressured by toggling the auto-read state and it should
> > schedule a job to resume the auto-reading after there are at least a
> > configurable amount of tokens available. There should be no need to
> > use the scheduler to add tokens to the bucket. Whenever tokens are
> > used, the new token count can be calculated based on the time since
> > tokens were last updated, by default the average max rate defines how
> > many tokens are added. The token count cannot increase larger than the
> > token bucket capacity. This is how simple a plain token bucket
> > algorithm is. That could be a starting point until we start covering
> > the advanced cases which require a dual token bucket algorithm.
> > If someone has the bandwidth, it would be fine to start experimenting
> > in this area with code to learn more.
> >
> > After trying something out, it's easier to be more confident about the
> > design of dual token bucket algorithm. That's one of the assumptions
> > that I'm making that dual token bucket would be sufficient.
> >
> > This has been a great discussion so far and it feels like we are
> > getting to the necessary level of detail.
> > Thanks for sharing many details of your requirements and what you have
> > experienced and learned when operating Pulsar at scale.
> > I feel excited about the upcoming improvements for rate limiting in
> > Pulsar. From my side I have limited availability for participating in
> > this work for the next 2-3 weeks. After that, I'll be ready to help in
> > making the improved rate limiting a reality.
> >
> > -Lari
> >
>


-- 
Girish Sharma

Re: [DISCUSS] PIP-310: Support custom publish rate limiters

Posted by Rajan Dhabalia <rd...@apache.org>.

Hi Lari/Girish,

I am sorry for jumping late in the discussion but I would like to
acknowledge the requirement of pluggable publish rate-limiter and I had
also asked it during implementation of publish rate limiter as well. There
are trade-offs between different rate-limiter implementations based on
accuracy, n/w usage, simplification and user should be able to choose one
based on the requirement. However, we don't have correct and extensible
Publish rate limiter interface right now, and before making it pluggable we
have to make sure that it should support any type of implementation for
example: token based or sliding-window based throttling, support of various
decaying functions (eg: exponential decay:
https://en.wikipedia.org/wiki/Exponential_decay), etc.. I haven't seen such
interface details and design in the PIP:
https://github.com/apache/pulsar/pull/21399/. So, I would encourage to work
towards building pluggable rate-limiter but current PIP is not ready as it
doesn't cover such generic interfaces that can support different types of
implementation.

Thanks,
Rajan

On Tue, Nov 7, 2023 at 10:02 AM Lari Hotari <lh...@apache.org> wrote:

> Hi Girish,
>
> Replies inline.
>
> On Tue, 7 Nov 2023 at 15:26, Girish Sharma <sc...@gmail.com>
> wrote:
> >
> > Hello Lari, replies inline.
> >
> > I will also be going through some textbook rate limiters (the one you
> shared, plus others) and propose the one that at least suits our needs in
> the next reply.
>
>
> sounds good. I've been also trying to find more rate limiter resources
> that could be useful for our design.
>
> Bucket4J documentation gives some good ideas and it's shows how the
> token bucket algorithm could be varied. For example, the "Refill
> styles" section [1] is useful to read as an inspiration.
> In network routers, there's a concept of "dual token bucket"
> algorithms and by googling you can find both Cisco and Juniper
> documentation referencing this.
> I also asked ChatGPT-4 to explain "dual token bucket" algorithm [2].
>
> 1 - https://bucket4j.com/8.6.0/toc.html#refill-types
> 2 - https://chat.openai.com/share/d4f4f740-f675-4233-964e-2910a7c8ed24
>
> >>
> >> It is bi-weekly on Thursdays. The meeting calendar, zoom link and
> >> meeting notes can be found at
> >> https://github.com/apache/pulsar/wiki/Community-Meetings .
> >>
> >
> > Would it make sense for me to join this time given that you are skipping
> it?
>
> Yes, it's worth joining regularly when one is participating in Pulsar
> core development. There's usually a chance to discuss all topics that
> Pulsar community members bring up to discussion. A few times there
> haven't been any participants and in that case, it's good to ask on
> the #dev channel on Pulsar Slack whether others are joining the
> meeting.
>
> >>
> >> ok. btw. "metrics" doesn't necessarily mean providing the rate limiter
> >> metrics via Prometheus. There might be other ways to provide this
> >> information for components that could react to this.
> >> For example, it could a be system topic where these rate limiters emit
> events.
> >>
> >
> > Are there any other system topics than
> `tenent/namespace/__change_events` . While it's an improvement over
> querying metrics, it would still mean one consumer per namespace and would
> form a cyclic dependency - for example, in case a broker is degrading due
> to mis-use of bursting, it might lead to delays in the consumption of the
> event from the __change_events topic.
>
> The number of events is a fairly low volume so the possible
> degradation wouldn't necessarily impact the event communications for
> this purpose. I'm not sure if there are currently any public event
> topics in Pulsar that are supported in a way that an application could
> directly read from the topic.  However since this is an advanced case,
> I would assume that we don't need to focus on this in the first phase
> of improving rate limiters.
>
> >
> >
> >> I agree. I just brought up this example to ensure that your
> >> expectation about bursting isn't about controlling the rate limits
> >> based on situational information, such as end-to-end latency
> >> information.
> >> Such a feature could be useful, but it does complicate things.
> >> However, I think it's good to keep this on the radar since this might
> >> be needed to solve some advanced use cases.
> >>
> >
> > I still envision auto-scaling to be admin API driven rather than produce
> throughput driven. That way, it remains deterministic in nature. But it
> probably doesn't make sense to even talk about it until (partition)
> scale-down is possible.
>
>
> I agree. It's better to exclude this from the discussion at the moment
> so that it doesn't cause confusion. Modifying the rate limits up and
> down automatically with some component could be considered out of
> scope in the first phase. However, that might be controversial since
> the externally observable behavior of rate limiting with bursting
> support seems to be behave in a way where the rate limit changes
> automatically. The "auto scaling" aspect of rate limiting, modifying
> the rate limits up and down, might be necessary as part of a rate
> limiter implementation eventually. More of that later.
>
> >
> > In all of the 3 cases that I listed, the current behavior, with precise
> rate limiting enabled, is to pause the netty channel in case the throughput
> breaches the set limits. This eventually leads to timeout at the client
> side in case the burst is significantly greater than the configured timeout
> on the producer side.
>
>
> Makes sense.
>
> >
> > The desired behavior in all three situations is to have a multiplier
> based bursting capability for a fixed duration. For example, it could be
> that a pulsar topic would be able to support 1.5x of the set quota for a
> burst duration of up to 5 minutes. There also needs to be a cooldown period
> in such a case that it would only accept one such burst every X minutes,
> say every 1 hour.
>
>
> Thanks for sharing this concrete example. It will help a lot when
> starting to design a solution which achieves this. I think that a
> token bucket algorithm based design can achieve something very close
> to what you are describing. In the first part I shared the references
> to Bucket4J's "refill styles" and the concept of "dual token bucket"
> algorithms.
> One possible solution is a dual token bucket where the first bucket
> takes care of the burst of up to 5 minutes. The second token bucket
> can cover the maximum rate of 1.5x. The cooldown period could be
> handled in a way where the token bucket fills up the bucket once in
> every 60 minutes. There could be implementation challenges related to
> adjusting the enforced rate up and down since as you had also
> observed, the token bucket algorithm will let all buffered tokens be
> used.
>
> >
> >
> >>
> >>
> >> >    - In a visitor based produce rate (where produce rate goes up in
> the day
> >> >    and goes down in the night, think in terms of popular website
> hourly view
> >> >    counts pattern) , there are cases when, due to certain
> external/internal
> >> >    triggers, the views - and thus - the produce rate spikes for a few
> minutes.
> >>
> >> Again, please explain the current behavior and desired behavior.
> >> Explicit example values of number of messages, bandwidth, etc. would
> >> also be helpful details.
> >
> >
> > Adding to what I wrote above, think of this pattern like the following:
> the produce rate slowly increases from ~2MBps at around 4 AM to a known
> peak of about 30MBps by 4 PM and stays around that peak until 9 PM after
> which is again starts decreasing until it reaches ~2MBps around 2 AM.
> > Now, due to some external triggers, maybe a scheduled sale event, at
> 10PM, the quota may spike up to 40MBps for 4-5 minutes and then again go
> back down to the usual ~20MBps . Here is a rough image showcasing the trend.
> >
> >
> >>
> >>
> >> >    It is also important to keep this in check so as to not allow bots
> to do
> >> >    DDOS into your system, while that might be a responsibility of an
> upstream
> >> >    system like API gateway, but we cannot be ignorant about that
> completely.
> >>
> >> what would be the desired behavior?
> >
> >
> > The desired behavior is that the burst support should be short lived
> (5-10 minutes) and limited to a fixed number of bursts in a duration (say -
> 1 burst per hour). Obviously, all of these should be configurable, maybe at
> a broker level and not a topic level.
>
>
> That can be handled with a dual token bucket algorithm based rate limiter.
>
> >
> > I have yet to go in depth on both the textbook rate limiter models, but
> if token bucket works on buffered tokens, it may not be much useful.
> Consider a situation where the throughput is always at its absolute
> configured peak, thus not reserving tokens at all. Again - this is the
> understanding based on the above text. I am yet to go through the wiki page
> with full details. In any case, I am very sure that an absolute maximum
> bursting limit is needed per topic, since at the end of the day, we are
> limited by hardware resources like network bandwidth, disk bandwidth etc.
>
>
> Yes, if the actual throughput is always at the rate limiter's maximum
> average rate, the tokens wouldn't be reserved at all in the default
> token bucket algorithm. Is that a problem?
> It might not be a problem at all. Spare tokens would be buffered
> whenever the throughput is lower and therefore bursting could help
> catch up to the average rate over a longer period of time.
>
> Btw. The challenge of no bursting when the actual throughput is
> already at the peak rate is the original reason why I introduced the
> concept of automatically adjusting the rate limit up and down based on
> end-to-end latency information. If the rate limiter had a feedback
> loop, it could know whether it needs to do bursting to prevent the
> latencies from increasing. That was one of the reasons why I was
> thinking about the automatic adjustments of the rate limit. However,
> it's better to keep that out of the first phase for now.
>
> I believe that we will find many ways to apply the token bucket
> algorithm to meet our requirements. A lot of flexibility can be added
> with dual token buckets and by introducing new ways of adding/filling
> tokens to the buckets. It's also possible to tweak a lot of small
> details if that is necessary, such as smoothening out burst rates and
> so on. However, it's better to keep things simple when possible.
>
>
> > Suppose the SOP for bursting support is to support a burst of upto 1.5x
> produce rate for upto 10 minutes, once in a window of 1 hour.
> >
> > Now if a particular producer is trying to burst to or beyond 1.5x for
> more than 5 minutes - while we are not accepting the requests and the
> producer client is continuously going into timeouts, there should be a way
> to completely blacklist that producer so that even the client stops
> attempting to produce and we unload the topics all together - thus,
> releasing open connections and broker compute.
>
>
> It seems that this part of the requirements is more challenging to
> cover. We don't have to give up on trying to make progress on this
> side of your requirements. It could be helpful to do thought
> experiments on how some of the conditions could be detected. For
> example, how would you detect if a particular producer is going beyond
> the configured policy? It does seem intriguing to start adding the
> support to the rate limiting area. The rate limiter has the
> information about when bursting is actively happening. I would favor a
> solution where the rate limiter could only emit events and not make
> decisions about blocking a producer. However, I'm open to changing my
> opinion on that detail of the design if it turns out to be a common
> requirement, and it fits the overall architecture of the broker.
>
> The information about timeouts is on the client side. Perhaps there
> are ways to pass the timeout configuration information along with the
> producer creation, for example. The Pulsar binary protocol could have
> some sort of clock sync so that client and broker clock difference
> could be calculated with sufficient accuracy. However, the broker
> cannot know when messages timeout that aren't at the broker at all. If
> the end-to-end latency of sent messages start approaching the timeout
> value, it's a good indication on the broker side that the client side
> is approaching the state where messages will start timing out.
>
> In general, I think we should keep these requirements covered in the
> design so that the rate limiter implementation could be extended to
> support the use case in a way or another in the later phases.
>
>
> >> > It's only async if the producers use `sendAsync` . Moreover, even if
> it is
> >> > async in nature, practically it ends up being quite linear and
> >> > well-homogenised. I am speaking from experience of running 1000s of
> >> > partitioned topics in production
> >>
> >> I'd assume that most high traffic use cases would use asynchronous
> >> sending in a way or another. If you'd be sending synchronously, it
> >> would be extremely inefficient if messages are sent one-by-one. The
> >> asyncronouscity could also come from multiple threads in an
> >> application.
> >>
> >> You are right that a well homogenised workload reduces the severity of
> >> the multiplexing problem. That's also why the problem of multiplexing
> >> hasn't been fixed since the problem isn't visibile and it's also very
> >> hard to observe.
> >> In a case where 2 producers with very different rate limiting options
> >> share a connection, it definetely is a problem in the current
> >
> >
> > Yes actually, I was talking to the team and we did observe a case where
> there was a client app with 2 producer objects writing to two different
> topics. When one of their topics was breaching quota, the other one was
> also observing rate limiting even though it was under quota. So eventually,
> this surely needs to be solved. One short term solution is to at least not
> share connections across producer objects. But I still feel it's out of
> scope of this discussion here.
>
> Thanks for sharing. The real stories from operating Pulsar at scale
> are always interesting to hear!
>
> > What do you think about the quick solution of not sharing connections
> across producer objects? I can raise a github issue explaining the
> situation.
>
> It should be very straightforward to add a flag to producer and
> consumer builder to use an isolated (non-shared) broker connection.
> It's not optimal to add such flags, but I haven't heard of other ways
> to solve this challenge where the Pulsar binary protocol doesn't have
> to be modified or the backpressure solution revisited in the broker.
> The proper solution requires doing both, adding a permit-based flow
> control for producers and revisiting the broker side backpressure/flow
> control and therefore it won't happen quickly.
> Please go ahead and create a GH issue and share your context. That
> will be very helpful.
>
> Some thoughts about proceeding to making things a reality so that it
> gets implemented.
>
> One important part of development will be how we could test or
> simulate a scenario and validate that the rate limiting algorithm
> makes sense.
> On the lowest level of unit tests, it will be helpful to have a
> solution where the clock source used in unit test can be run quickly
> so that simulated scenarios don't take a long time to run, but could
> cover the essential features.
>
> It might be hard to get started on the rate limiter implementation by
> looking at the existing code. The reason of the double methods in the
> existing interface is due to it covering 2 completely different ways
> of handling the rate limiting and trying to handle that in a single
> concept. What is actually desired for at least the producing rate
> limiting is that it's an asynchronous rate limiting where any messages
> that have already arrived to the broker will be handled. The token
> count could go to a negative value when implementing this in the token
> bucket algorithm. If it goes below 0, the producers for the topic
> should be backpressured by toggling the auto-read state and it should
> schedule a job to resume the auto-reading after there are at least a
> configurable amount of tokens available. There should be no need to
> use the scheduler to add tokens to the bucket. Whenever tokens are
> used, the new token count can be calculated based on the time since
> tokens were last updated, by default the average max rate defines how
> many tokens are added. The token count cannot increase larger than the
> token bucket capacity. This is how simple a plain token bucket
> algorithm is. That could be a starting point until we start covering
> the advanced cases which require a dual token bucket algorithm.
> If someone has the bandwidth, it would be fine to start experimenting
> in this area with code to learn more.
>
> After trying something out, it's easier to be more confident about the
> design of dual token bucket algorithm. That's one of the assumptions
> that I'm making that dual token bucket would be sufficient.
>
> This has been a great discussion so far and it feels like we are
> getting to the necessary level of detail.
> Thanks for sharing many details of your requirements and what you have
> experienced and learned when operating Pulsar at scale.
> I feel excited about the upcoming improvements for rate limiting in
> Pulsar. From my side I have limited availability for participating in
> this work for the next 2-3 weeks. After that, I'll be ready to help in
> making the improved rate limiting a reality.
>
> -Lari
>

Re: [DISCUSS] PIP-310: Support custom publish rate limiters

Posted by Lari Hotari <lh...@apache.org>.

Hi Girish,

Replies inline.

On Tue, 7 Nov 2023 at 15:26, Girish Sharma <sc...@gmail.com> wrote:
>
> Hello Lari, replies inline.
>
> I will also be going through some textbook rate limiters (the one you shared, plus others) and propose the one that at least suits our needs in the next reply.

sounds good. I've been also trying to find more rate limiter resources
that could be useful for our design.

Bucket4J documentation gives some good ideas and it's shows how the
token bucket algorithm could be varied. For example, the "Refill
styles" section [1] is useful to read as an inspiration.
In network routers, there's a concept of "dual token bucket"
algorithms and by googling you can find both Cisco and Juniper
documentation referencing this.
I also asked ChatGPT-4 to explain "dual token bucket" algorithm [2].

1 - https://bucket4j.com/8.6.0/toc.html#refill-types
2 - https://chat.openai.com/share/d4f4f740-f675-4233-964e-2910a7c8ed24

>>
>> It is bi-weekly on Thursdays. The meeting calendar, zoom link and
>> meeting notes can be found at
>> https://github.com/apache/pulsar/wiki/Community-Meetings .
>>
>
> Would it make sense for me to join this time given that you are skipping it?

Yes, it's worth joining regularly when one is participating in Pulsar
core development. There's usually a chance to discuss all topics that
Pulsar community members bring up to discussion. A few times there
haven't been any participants and in that case, it's good to ask on
the #dev channel on Pulsar Slack whether others are joining the
meeting.

>>
>> ok. btw. "metrics" doesn't necessarily mean providing the rate limiter
>> metrics via Prometheus. There might be other ways to provide this
>> information for components that could react to this.
>> For example, it could a be system topic where these rate limiters emit events.
>>
>
> Are there any other system topics than `tenent/namespace/__change_events` . While it's an improvement over querying metrics, it would still mean one consumer per namespace and would form a cyclic dependency - for example, in case a broker is degrading due to mis-use of bursting, it might lead to delays in the consumption of the event from the __change_events topic.

The number of events is a fairly low volume so the possible
degradation wouldn't necessarily impact the event communications for
this purpose. I'm not sure if there are currently any public event
topics in Pulsar that are supported in a way that an application could
directly read from the topic.  However since this is an advanced case,
I would assume that we don't need to focus on this in the first phase
of improving rate limiters.

>
>
>> I agree. I just brought up this example to ensure that your
>> expectation about bursting isn't about controlling the rate limits
>> based on situational information, such as end-to-end latency
>> information.
>> Such a feature could be useful, but it does complicate things.
>> However, I think it's good to keep this on the radar since this might
>> be needed to solve some advanced use cases.
>>
>
> I still envision auto-scaling to be admin API driven rather than produce throughput driven. That way, it remains deterministic in nature. But it probably doesn't make sense to even talk about it until (partition) scale-down is possible.

I agree. It's better to exclude this from the discussion at the moment
so that it doesn't cause confusion. Modifying the rate limits up and
down automatically with some component could be considered out of
scope in the first phase. However, that might be controversial since
the externally observable behavior of rate limiting with bursting
support seems to be behave in a way where the rate limit changes
automatically. The "auto scaling" aspect of rate limiting, modifying
the rate limits up and down, might be necessary as part of a rate
limiter implementation eventually. More of that later.

>
> In all of the 3 cases that I listed, the current behavior, with precise rate limiting enabled, is to pause the netty channel in case the throughput breaches the set limits. This eventually leads to timeout at the client side in case the burst is significantly greater than the configured timeout on the producer side.

Makes sense.

>
> The desired behavior in all three situations is to have a multiplier based bursting capability for a fixed duration. For example, it could be that a pulsar topic would be able to support 1.5x of the set quota for a burst duration of up to 5 minutes. There also needs to be a cooldown period in such a case that it would only accept one such burst every X minutes, say every 1 hour.

Thanks for sharing this concrete example. It will help a lot when
starting to design a solution which achieves this. I think that a
token bucket algorithm based design can achieve something very close
to what you are describing. In the first part I shared the references
to Bucket4J's "refill styles" and the concept of "dual token bucket"
algorithms.
One possible solution is a dual token bucket where the first bucket
takes care of the burst of up to 5 minutes. The second token bucket
can cover the maximum rate of 1.5x. The cooldown period could be
handled in a way where the token bucket fills up the bucket once in
every 60 minutes. There could be implementation challenges related to
adjusting the enforced rate up and down since as you had also
observed, the token bucket algorithm will let all buffered tokens be
used.

>
>
>>
>>
>> >    - In a visitor based produce rate (where produce rate goes up in the day
>> >    and goes down in the night, think in terms of popular website hourly view
>> >    counts pattern) , there are cases when, due to certain external/internal
>> >    triggers, the views - and thus - the produce rate spikes for a few minutes.
>>
>> Again, please explain the current behavior and desired behavior.
>> Explicit example values of number of messages, bandwidth, etc. would
>> also be helpful details.
>
>
> Adding to what I wrote above, think of this pattern like the following: the produce rate slowly increases from ~2MBps at around 4 AM to a known peak of about 30MBps by 4 PM and stays around that peak until 9 PM after which is again starts decreasing until it reaches ~2MBps around 2 AM.
> Now, due to some external triggers, maybe a scheduled sale event, at 10PM, the quota may spike up to 40MBps for 4-5 minutes and then again go back down to the usual ~20MBps . Here is a rough image showcasing the trend.
>
>
>>
>>
>> >    It is also important to keep this in check so as to not allow bots to do
>> >    DDOS into your system, while that might be a responsibility of an upstream
>> >    system like API gateway, but we cannot be ignorant about that completely.
>>
>> what would be the desired behavior?
>
>
> The desired behavior is that the burst support should be short lived (5-10 minutes) and limited to a fixed number of bursts in a duration (say - 1 burst per hour). Obviously, all of these should be configurable, maybe at a broker level and not a topic level.

That can be handled with a dual token bucket algorithm based rate limiter.

>
> I have yet to go in depth on both the textbook rate limiter models, but if token bucket works on buffered tokens, it may not be much useful. Consider a situation where the throughput is always at its absolute configured peak, thus not reserving tokens at all. Again - this is the understanding based on the above text. I am yet to go through the wiki page with full details. In any case, I am very sure that an absolute maximum bursting limit is needed per topic, since at the end of the day, we are limited by hardware resources like network bandwidth, disk bandwidth etc.

Yes, if the actual throughput is always at the rate limiter's maximum
average rate, the tokens wouldn't be reserved at all in the default
token bucket algorithm. Is that a problem?
It might not be a problem at all. Spare tokens would be buffered
whenever the throughput is lower and therefore bursting could help
catch up to the average rate over a longer period of time.

Btw. The challenge of no bursting when the actual throughput is
already at the peak rate is the original reason why I introduced the
concept of automatically adjusting the rate limit up and down based on
end-to-end latency information. If the rate limiter had a feedback
loop, it could know whether it needs to do bursting to prevent the
latencies from increasing. That was one of the reasons why I was
thinking about the automatic adjustments of the rate limit. However,
it's better to keep that out of the first phase for now.

I believe that we will find many ways to apply the token bucket
algorithm to meet our requirements. A lot of flexibility can be added
with dual token buckets and by introducing new ways of adding/filling
tokens to the buckets. It's also possible to tweak a lot of small
details if that is necessary, such as smoothening out burst rates and
so on. However, it's better to keep things simple when possible.

> Suppose the SOP for bursting support is to support a burst of upto 1.5x produce rate for upto 10 minutes, once in a window of 1 hour.
>
> Now if a particular producer is trying to burst to or beyond 1.5x for more than 5 minutes - while we are not accepting the requests and the producer client is continuously going into timeouts, there should be a way to completely blacklist that producer so that even the client stops attempting to produce and we unload the topics all together - thus, releasing open connections and broker compute.

It seems that this part of the requirements is more challenging to
cover. We don't have to give up on trying to make progress on this
side of your requirements. It could be helpful to do thought
experiments on how some of the conditions could be detected. For
example, how would you detect if a particular producer is going beyond
the configured policy? It does seem intriguing to start adding the
support to the rate limiting area. The rate limiter has the
information about when bursting is actively happening. I would favor a
solution where the rate limiter could only emit events and not make
decisions about blocking a producer. However, I'm open to changing my
opinion on that detail of the design if it turns out to be a common
requirement, and it fits the overall architecture of the broker.

The information about timeouts is on the client side. Perhaps there
are ways to pass the timeout configuration information along with the
producer creation, for example. The Pulsar binary protocol could have
some sort of clock sync so that client and broker clock difference
could be calculated with sufficient accuracy. However, the broker
cannot know when messages timeout that aren't at the broker at all. If
the end-to-end latency of sent messages start approaching the timeout
value, it's a good indication on the broker side that the client side
is approaching the state where messages will start timing out.

In general, I think we should keep these requirements covered in the
design so that the rate limiter implementation could be extended to
support the use case in a way or another in the later phases.

>> > It's only async if the producers use `sendAsync` . Moreover, even if it is
>> > async in nature, practically it ends up being quite linear and
>> > well-homogenised. I am speaking from experience of running 1000s of
>> > partitioned topics in production
>>
>> I'd assume that most high traffic use cases would use asynchronous
>> sending in a way or another. If you'd be sending synchronously, it
>> would be extremely inefficient if messages are sent one-by-one. The
>> asyncronouscity could also come from multiple threads in an
>> application.
>>
>> You are right that a well homogenised workload reduces the severity of
>> the multiplexing problem. That's also why the problem of multiplexing
>> hasn't been fixed since the problem isn't visibile and it's also very
>> hard to observe.
>> In a case where 2 producers with very different rate limiting options
>> share a connection, it definetely is a problem in the current
>
>
> Yes actually, I was talking to the team and we did observe a case where there was a client app with 2 producer objects writing to two different topics. When one of their topics was breaching quota, the other one was also observing rate limiting even though it was under quota. So eventually, this surely needs to be solved. One short term solution is to at least not share connections across producer objects. But I still feel it's out of scope of this discussion here.

Thanks for sharing. The real stories from operating Pulsar at scale
are always interesting to hear!

> What do you think about the quick solution of not sharing connections across producer objects? I can raise a github issue explaining the situation.

It should be very straightforward to add a flag to producer and
consumer builder to use an isolated (non-shared) broker connection.
It's not optimal to add such flags, but I haven't heard of other ways
to solve this challenge where the Pulsar binary protocol doesn't have
to be modified or the backpressure solution revisited in the broker.
The proper solution requires doing both, adding a permit-based flow
control for producers and revisiting the broker side backpressure/flow
control and therefore it won't happen quickly.
Please go ahead and create a GH issue and share your context. That
will be very helpful.

Some thoughts about proceeding to making things a reality so that it
gets implemented.

One important part of development will be how we could test or
simulate a scenario and validate that the rate limiting algorithm
makes sense.
On the lowest level of unit tests, it will be helpful to have a
solution where the clock source used in unit test can be run quickly
so that simulated scenarios don't take a long time to run, but could
cover the essential features.

It might be hard to get started on the rate limiter implementation by
looking at the existing code. The reason of the double methods in the
existing interface is due to it covering 2 completely different ways
of handling the rate limiting and trying to handle that in a single
concept. What is actually desired for at least the producing rate
limiting is that it's an asynchronous rate limiting where any messages
that have already arrived to the broker will be handled. The token
count could go to a negative value when implementing this in the token
bucket algorithm. If it goes below 0, the producers for the topic
should be backpressured by toggling the auto-read state and it should
schedule a job to resume the auto-reading after there are at least a
configurable amount of tokens available. There should be no need to
use the scheduler to add tokens to the bucket. Whenever tokens are
used, the new token count can be calculated based on the time since
tokens were last updated, by default the average max rate defines how
many tokens are added. The token count cannot increase larger than the
token bucket capacity. This is how simple a plain token bucket
algorithm is. That could be a starting point until we start covering
the advanced cases which require a dual token bucket algorithm.
If someone has the bandwidth, it would be fine to start experimenting
in this area with code to learn more.

After trying something out, it's easier to be more confident about the
design of dual token bucket algorithm. That's one of the assumptions
that I'm making that dual token bucket would be sufficient.

This has been a great discussion so far and it feels like we are
getting to the necessary level of detail.
Thanks for sharing many details of your requirements and what you have
experienced and learned when operating Pulsar at scale.
I feel excited about the upcoming improvements for rate limiting in
Pulsar. From my side I have limited availability for participating in
this work for the next 2-3 weeks. After that, I'll be ready to help in
making the improved rate limiting a reality.

-Lari

Re: [DISCUSS] PIP-310: Support custom publish rate limiters

Posted by Girish Sharma <sc...@gmail.com>.

Hello Lari, replies inline.

I will also be going through some textbook rate limiters (the one you
shared, plus others) and propose the one that at least suits our needs in
the next reply.

On Tue, Nov 7, 2023 at 2:49 PM Lari Hotari <lh...@apache.org> wrote:

>
> It is bi-weekly on Thursdays. The meeting calendar, zoom link and
> meeting notes can be found at
> https://github.com/apache/pulsar/wiki/Community-Meetings .
>
>
Would it make sense for me to join this time given that you are skipping it?

>
> ok. btw. "metrics" doesn't necessarily mean providing the rate limiter
> metrics via Prometheus. There might be other ways to provide this
> information for components that could react to this.
> For example, it could a be system topic where these rate limiters emit
> events.
>
>
Are there any other system topics than `tenent/namespace/__change_events` .
While it's an improvement over querying metrics, it would still mean one
consumer per namespace and would form a cyclic dependency - for example, in
case a broker is degrading due to mis-use of bursting, it might lead to
delays in the consumption of the event from the __change_events topic.

I agree. I just brought up this example to ensure that your
> expectation about bursting isn't about controlling the rate limits
> based on situational information, such as end-to-end latency
> information.
> Such a feature could be useful, but it does complicate things.
> However, I think it's good to keep this on the radar since this might
> be needed to solve some advanced use cases.
>
>
I still envision auto-scaling to be admin API driven rather than produce
throughput driven. That way, it remains deterministic in nature. But it
probably doesn't make sense to even talk about it until (partition)
scale-down is possible.

>
> >    - A producer(s) is producing at a near constant rate into a topic,
> with
> >    equal distribution among partitions. Due to a hiccup in their
> downstream
> >    component, the produce rate goes to 0 for a few seconds, and thus, to
> >    compensate, in the next few seconds, the produce rate tries to double
> up.
>
> Could you also elaborate on details such as what is the current
> behavior of Pulsar rate limiting / throttling solution and what would
> be the desired behavior?
> Just guessing that you mean that the desired behavior would be to
> allow the produce rate to double up for some time (configurable)?
> Compared to what rate is it doubled?
> Please explain in detail what the current and desired behaviors would
> be so that it's easier to understand the gap.
>

In all of the 3 cases that I listed, the current behavior, with precise
rate limiting enabled, is to pause the netty channel in case the throughput
breaches the set limits. This eventually leads to timeout at the client
side in case the burst is significantly greater than the configured timeout
on the producer side.

The desired behavior in all three situations is to have a multiplier based
bursting capability for a fixed duration. For example, it could be that a
pulsar topic would be able to support 1.5x of the set quota for a burst
duration of up to 5 minutes. There also needs to be a cooldown period in
such a case that it would only accept one such burst every X minutes, say
every 1 hour.

>
> >    - In a visitor based produce rate (where produce rate goes up in the
> day
> >    and goes down in the night, think in terms of popular website hourly
> view
> >    counts pattern) , there are cases when, due to certain
> external/internal
> >    triggers, the views - and thus - the produce rate spikes for a few
> minutes.
>
> Again, please explain the current behavior and desired behavior.
> Explicit example values of number of messages, bandwidth, etc. would
> also be helpful details.
>

Adding to what I wrote above, think of this pattern like the following: the
produce rate slowly increases from ~2MBps at around 4 AM to a known peak of
about 30MBps by 4 PM and stays around that peak until 9 PM after which is
again starts decreasing until it reaches ~2MBps around 2 AM.
Now, due to some external triggers, maybe a scheduled sale event, at 10PM,
the quota may spike up to 40MBps for 4-5 minutes and then again go back
down to the usual ~20MBps . Here is a rough image showcasing the trend.
[image: image.png]

>
> >    It is also important to keep this in check so as to not allow bots to
> do
> >    DDOS into your system, while that might be a responsibility of an
> upstream
> >    system like API gateway, but we cannot be ignorant about that
> completely.
>
> what would be the desired behavior?
>

The desired behavior is that the burst support should be short lived (5-10
minutes) and limited to a fixed number of bursts in a duration (say - 1
burst per hour). Obviously, all of these should be configurable, maybe at a
broker level and not a topic level.

>
> >    - In streaming systems, where there are micro batches, there might be
> >    constant fluctuations in produce rate from time to time, based on
> batch
> >    failure or retries.
>
> could you share and examples with numbers about this use case too?
> explaining current behavior and desired behavior?
>

It is very common in Direct Input Stream spark jobs to consume messages in
micro batches, process the messages together and then move on to the next
batch. Thus, it is not a uniform consume at a second or sub-minute level.
Now, in this situation itself, if a few micro-batches fail due to hardware
issues, the spark scheduler may schedule more than usual batches together
after the issue is resolved, leading to increased consume.
While I am talking mostly about consume, the same job's sink may also be a
pulsar topic, leading to same produce pattern.

>
>
> >
> > In all of these situations, setting the throughput of the topic to be the
> > absolute maximum of the various spikes observed during the day is very
> > suboptimal.
>
> btw. A plain token bucket algorithm implementation doesn't have an
> absolute maximum. The maximum average rate is controlled with the rate
> of tokens added to the bucket. The capacity of the bucket controls how
> much buffer there is to spend on spikes. If there's a need to also set
> an absolute maximum rate, 2 token buckets could be chained to handle
> that case. The second rate limiter could have an average rate of the
> absolute maximum with a relatively small buffer (token bucket
> capacity). There might also be more sophisticated algorithms which
> vary the maximum rate in some way to smoothen spikes, but that might
> just be completely unnecessary in Pulsar rate limiters.
> Switching to use a conceptual model of token buckets will make it a
> lot easier to reason about the implementation when we come to that
> point. In the design we can try to "meet in the middle" where the
> requirements and the design and implementation meet so that the
> implementation is as simple as possible.
>

I have yet to go in depth on both the textbook rate limiter models, but if
token bucket works on buffered tokens, it may not be much useful. Consider
a situation where the throughput is always at its absolute configured peak,
thus not reserving tokens at all. Again - this is the understanding based
on the above text. I am yet to go through the wiki page with full details.
In any case, I am very sure that an absolute maximum bursting limit is
needed per topic, since at the end of the day, we are limited by hardware
resources like network bandwidth, disk bandwidth etc.

> > Moreover, in each of these situations, once bursting support is present
> in
> > the system, it would also need to have proper checks in place to penalize
> > the producers from trying to mis-use the system. In a true multi-tenant
> > platform, this is very critical. Thus, blacklisting actually goes hand in
> > hand here. Explained more below.
>
> I don't yet fully understand this part. What would be an example of a
> case where a producer should be penalized?
> When would an attempt to mis-use happen and why? How would you detect
> mis-use and how would you penalize it?
>
> Suppose the SOP for bursting support is to support a burst of upto 1.5x
produce rate for upto 10 minutes, once in a window of 1 hour.

Now if a particular producer is trying to burst to or beyond 1.5x for more
than 5 minutes - while we are not accepting the requests and the producer
client is continuously going into timeouts, there should be a way to
completely blacklist that producer so that even the client stops attempting
to produce and we unload the topics all together - thus, releasing open
connections and broker compute.

> This would be a much simpler and efficient model if it were reactive,
> > based/triggered directly from within the rate limiter component. It can
> be
> > fast and responsive, which is very critical when trying to prevent the
> > system from abuse.
>
> I agree that it could be simpler and efficient from some perspectives.
> Do you have examples of what type of behaviors would you like to have
> for handling this?
> A concrete example with numbers could be very useful.
>

I believe the above example is a good one in terms of expectation, with
numbers.

> > It's only async if the producers use `sendAsync` . Moreover, even if it
> is
> > async in nature, practically it ends up being quite linear and
> > well-homogenised. I am speaking from experience of running 1000s of
> > partitioned topics in production
>
> I'd assume that most high traffic use cases would use asynchronous
> sending in a way or another. If you'd be sending synchronously, it
> would be extremely inefficient if messages are sent one-by-one. The
> asyncronouscity could also come from multiple threads in an
> application.
>
> You are right that a well homogenised workload reduces the severity of
> the multiplexing problem. That's also why the problem of multiplexing
> hasn't been fixed since the problem isn't visibile and it's also very
> hard to observe.
> In a case where 2 producers with very different rate limiting options
> share a connection, it definetely is a problem in the current
>

Yes actually, I was talking to the team and we did observe a case where
there was a client app with 2 producer objects writing to two different
topics. When one of their topics was breaching quota, the other one was
also observing rate limiting even though it was under quota. So eventually,
this surely needs to be solved. One short term solution is to at least not
share connections across producer objects. But I still feel it's out of
scope of this discussion here.

What do you think about the quick solution of not sharing connections
across producer objects? I can raise a github issue explaining the
situation.

-- 
Girish Sharma

Re: [DISCUSS] PIP-310: Support custom publish rate limiters

Posted by Lari Hotari <lh...@apache.org>.

Hi Girish,

I think we are starting to get into the concrete details of rate
limiters and how we could start improving the existing feature.
It is very helpful that you are sharing your insight and experience of
operating Pulsar at scale.
Replies inline.

On Mon, 6 Nov 2023 at 15:37, Girish Sharma <sc...@gmail.com> wrote:
> Is this every thursday? I am willing to meet at a separate time as well if
> enough folks with a viewpoint on this can meet together. I assume that the
> community meeting has a much bigger agenda with detailed discussions not
> possible?

It is bi-weekly on Thursdays. The meeting calendar, zoom link and
meeting notes can be found at
https://github.com/apache/pulsar/wiki/Community-Meetings .

> It is all good, as long as the final goal is met within reasonable
> timelines.

+1

> Well, the blacklisting use case is a very specific use case. I am
> explaining below why that can't be done using metrics and a separate
> blacklisting API.

ok. btw. "metrics" doesn't necessarily mean providing the rate limiter
metrics via Prometheus. There might be other ways to provide this
information for components that could react to this.
For example, it could a be system topic where these rate limiters emit events.

> This actually might be a blessing in disguise, at least for RateLimiter and
> PublishRateLimiter.java, being an internal interface, it has gone out of
> hand and unchecked. Explained more below.

I agree. It is hard to reason about the existing solution.

> I would like to keep auto-scaling out of scope for this discussion. That
> opens up another huge can of worms, specially given the gaps in proper
> scale down support in pulsar.

I agree. I just brought up this example to ensure that your
expectation about bursting isn't about controlling the rate limits
based on situational information, such as end-to-end latency
information.
Such a feature could be useful, but it does complicate things.
However, I think it's good to keep this on the radar since this might
be needed to solve some advanced use cases.

> > I don't know what "bursting" means for you. Would it be possible to
> > provide concrete examples of desired behavior? That would be very
> > helpful in making progress.
> >
> >
> Here are a few different use cases:

These use cases clarify your requirements a lot. Thanks for sharing.

>    - A producer(s) is producing at a near constant rate into a topic, with
>    equal distribution among partitions. Due to a hiccup in their downstream
>    component, the produce rate goes to 0 for a few seconds, and thus, to
>    compensate, in the next few seconds, the produce rate tries to double up.

Could you also elaborate on details such as what is the current
behavior of Pulsar rate limiting / throttling solution and what would
be the desired behavior?
Just guessing that you mean that the desired behavior would be to
allow the produce rate to double up for some time (configurable)?
Compared to what rate is it doubled?
Please explain in detail what the current and desired behaviors would
be so that it's easier to understand the gap.

>    - In a visitor based produce rate (where produce rate goes up in the day
>    and goes down in the night, think in terms of popular website hourly view
>    counts pattern) , there are cases when, due to certain external/internal
>    triggers, the views - and thus - the produce rate spikes for a few minutes.

Again, please explain the current behavior and desired behavior.
Explicit example values of number of messages, bandwidth, etc. would
also be helpful details.

>    It is also important to keep this in check so as to not allow bots to do
>    DDOS into your system, while that might be a responsibility of an upstream
>    system like API gateway, but we cannot be ignorant about that completely.

what would be the desired behavior?

>    - In streaming systems, where there are micro batches, there might be
>    constant fluctuations in produce rate from time to time, based on batch
>    failure or retries.

could you share and examples with numbers about this use case too?
explaining current behavior and desired behavior?

>
> In all of these situations, setting the throughput of the topic to be the
> absolute maximum of the various spikes observed during the day is very
> suboptimal.

btw. A plain token bucket algorithm implementation doesn't have an
absolute maximum. The maximum average rate is controlled with the rate
of tokens added to the bucket. The capacity of the bucket controls how
much buffer there is to spend on spikes. If there's a need to also set
an absolute maximum rate, 2 token buckets could be chained to handle
that case. The second rate limiter could have an average rate of the
absolute maximum with a relatively small buffer (token bucket
capacity). There might also be more sophisticated algorithms which
vary the maximum rate in some way to smoothen spikes, but that might
just be completely unnecessary in Pulsar rate limiters.
Switching to use a conceptual model of token buckets will make it a
lot easier to reason about the implementation when we come to that
point. In the design we can try to "meet in the middle" where the
requirements and the design and implementation meet so that the
implementation is as simple as possible.
A token bucket is very simple and it can be implemented in a way where
no timer is used to add tokens to the bucket. You only need a timer
(scheduler) to schedule to unblock the flow once the token has run out
of tokens. Under a normal case where the traffic doesn't exceed the
limits, it's possible to calculate the update to the number of tokens
each time before tokens are consumed from the bucket. I know I'm
jumping too much ahead already. The reason for this is that we aren't
really dealing with a very complex domain. Since the scope of adding
bursting support to rate limiting is not broad, this could be
implemented in a short time frame and we can really make this happen.

> Moreover, in each of these situations, once bursting support is present in
> the system, it would also need to have proper checks in place to penalize
> the producers from trying to mis-use the system. In a true multi-tenant
> platform, this is very critical. Thus, blacklisting actually goes hand in
> hand here. Explained more below.

I don't yet fully understand this part. What would be an example of a
case where a producer should be penalized?
When would an attempt to mis-use happen and why? How would you detect
mis-use and how would you penalize it?

> > It's interesting that you mention that you would like to improve the
> > PublishRateLimiter interface.
> > How would you change it?
> >
> >
> The current interface of PublishRateLimiter has duplicate methods. I am
> assuming after an initial implementation (poller), the next implementation
> simply added more methods into the interface rather than actually using the
> ones already existing.
> For instance, there are both `tryAcquire` and `incrementPublishCount`
> methods, there are both `checkPublishRate` and `isPublishRateExceeded`.
> Then there is the issue of misplaced responsibilities where when its
> precise, the complete responsibility of checking and responding back with
> whether its rate limited or not lies with the `PrecisePublishLimiter.java`
> but when its poller based, there is some logic inside
> `PublishRateLimiterImpl` and rest of the logic is spread across
> `AbstractTopic.java`

+100. Yes, we will need to deal with this. The abstractions are poor
and need to be replaced. I'd rather have a single abstraction for rate
limiting by messages/bytes and have it support bursting configuration.
There shouldn't be a need to have a separate polling implementation
and a "precise" implementation. Those are smells of technical debt
that needs to be addressed.
Proper abstractions don't leak implementation details such ad
"polling" or "precise".

> > short and medium term blacklisting of topics based on breach of rate
> > > limiter beyond a given SOP. I feel this is very very specific to our
> > > organization right now to be included inside pulsar itself.
> >
> > This is outside of rate limiters. IIRC, there have been some
> > discussions in the community that an API for blocking individual
> > producers or producing to specific topics could be useful in some
> > cases.
> > An external component could observe metrics and control blocking if
> > there's an API for doing so.
> >
> >
> Actually, putting this in an external component that's based off of metrics
> is not a scalable or responsive solution.
> First of all, it puts a lot of pressure on the metrics system (prometheus)
> where we are now querying 1000s of metrics every minute/sub-minute
> uselessly. Since majority of the time, not even a single topic may need
> blacklisting, this is very very inefficient.
> Secondly, it makes the design such that this external component now needs
> to be in sync about the existing topics and their rate set inside the
> pulsar's zookeeper. This also puts extra pressure on zk-reads.
> Lastly, the response time for blacklisting the topic increases a lot in
> this approach.

I commented about this above. "Metrics" could mean using a system
topic for rate limiter events and that could eliminate some of your
concerns.

> This would be a much simpler and efficient model if it were reactive,
> based/triggered directly from within the rate limiter component. It can be
> fast and responsive, which is very critical when trying to prevent the
> system from abuse.

I agree that it could be simpler and efficient from some perspectives.
Do you have examples of what type of behaviors would you like to have
for handling this?
A concrete example with numbers could be very useful.

> It's only async if the producers use `sendAsync` . Moreover, even if it is
> async in nature, practically it ends up being quite linear and
> well-homogenised. I am speaking from experience of running 1000s of
> partitioned topics in production

I'd assume that most high traffic use cases would use asynchronous
sending in a way or another. If you'd be sending synchronously, it
would be extremely inefficient if messages are sent one-by-one. The
asyncronouscity could also come from multiple threads in an
application.

You are right that a well homogenised workload reduces the severity of
the multiplexing problem. That's also why the problem of multiplexing
hasn't been fixed since the problem isn't visibile and it's also very
hard to observe.
In a case where 2 producers with very different rate limiting options
share a connection, it definetely is a problem in the current
solution. Similarly if there's a message processing application where
the consumer and producer share the connection. It's not necessarily a
big problem since backpressure happens in different directions of the
traffic. However, there are some reports about actual problems with
this. For example, Tao Jiuming brought up some of this in his reply to
the thread [1].

1 - https://lists.apache.org/thread/pz42sokfzgl9fwnmqp6np0q5w6y1vgvv

> I am happy if the discussion concludes eitherways - a pluggable
> implementation or a 99% use-case capturing configurable rate limiter. But
> from what I've seen, the participation in OSS threads can be very random
> and thus, I am afraid it might take a while before more folks pitch in
> their inputs and a clear direction of discussion is formed.

Let's make this happen! Looking forward to hearing more about the
detailed examples of current behavior and desired behavior.

-Lari

Re: [DISCUSS] PIP-310: Support custom publish rate limiters

Posted by Girish Sharma <sc...@gmail.com>.

Hello Lari, inline once again.


On Mon, Nov 6, 2023 at 5:44 PM Lari Hotari <lh...@apache.org> wrote:

> Hi Girish,
>
> Replies inline. We are getting into a very detailed discussion. We
> could also discuss this topic in one of the upcoming Pulsar Community
> meetings. However, I might miss the next meeting that is scheduled
> this Thursday.
>

Is this every thursday? I am willing to meet at a separate time as well if
enough folks with a viewpoint on this can meet together. I assume that the
community meeting has a much bigger agenda with detailed discussions not
possible?



> Although I am currently opposing to your proposal PIP-310, I am
> supporting solving your problems related to rate limiting. :)
> Let's continue the discussion since that is necessary so that we could
> make progress. I hope this makes sense from your perspective.
>
>
It is all good, as long as the final goal is met within reasonable
timelines.


>
> I acknowledge that there are different usages, but my assumption is
> that we could implement a generic solution that could be configured to
> handle each specific use case.
> I haven't yet seen any evidence that the requirements in your case are
> so special that it justifies adding a pluggable interface for rate
>

Well, the blacklisting use case is a very specific use case. I am
explaining below why that can't be done using metrics and a separate
blacklisting API.


> limiters. Exposing yet another pluggable interface in Pulsar will add
> complexity without gains. Each supported public interface is a
> maintenance burden if we care about the quality of the exposed
> interfaces and put effort in ensuring that the interfaces are
> supported in future versions. Exposing an interface will also lock
> down or slow down some future refactorings.
>

This actually might be a blessing in disguise, at least for RateLimiter and
PublishRateLimiter.java, being an internal interface, it has gone out of
hand and unchecked. Explained more below.


> One concrete example of this is the desired behavior of bursting. In
> token bucket rate limiting, bursting is about using the buffered
> tokens in the "token bucket" and having a configurable limit for the
> buffer (the "bucket"). This buffer will usually only contain tokens
> when the actual rate has been lower than the configured maximum rate
> for some duration.
>
> However, there could be an expectation for a different type of
> bursting which is more like "auto scaling" of the rate limit in a way
> where the end-to-end latency of the produced messages
> is taken into account. The expected behavior might be about scaling
> the rate temporarily to a higher rate so that the queues can be
>

I would like to keep auto-scaling out of scope for this discussion. That
opens up another huge can of worms, specially given the gaps in proper
scale down support in pulsar.


>
> I don't know what "bursting" means for you. Would it be possible to
> provide concrete examples of desired behavior? That would be very
> helpful in making progress.
>
>
Here are a few different use cases:

   - A producer(s) is producing at a near constant rate into a topic, with
   equal distribution among partitions. Due to a hiccup in their downstream
   component, the produce rate goes to 0 for a few seconds, and thus, to
   compensate, in the next few seconds, the produce rate tries to double up.
   - In a visitor based produce rate (where produce rate goes up in the day
   and goes down in the night, think in terms of popular website hourly view
   counts pattern) , there are cases when, due to certain external/internal
   triggers, the views - and thus - the produce rate spikes for a few minutes.
   It is also important to keep this in check so as to not allow bots to do
   DDOS into your system, while that might be a responsibility of an upstream
   system like API gateway, but we cannot be ignorant about that completely.
   - In streaming systems, where there are micro batches, there might be
   constant fluctuations in produce rate from time to time, based on batch
   failure or retries.

In all of these situations, setting the throughput of the topic to be the
absolute maximum of the various spikes observed during the day is very
suboptimal.

Moreover, in each of these situations, once bursting support is present in
the system, it would also need to have proper checks in place to penalize
the producers from trying to mis-use the system. In a true multi-tenant
platform, this is very critical. Thus, blacklisting actually goes hand in
hand here. Explained more below.



> It's interesting that you mention that you would like to improve the
> PublishRateLimiter interface.
> How would you change it?
>
>
The current interface of PublishRateLimiter has duplicate methods. I am
assuming after an initial implementation (poller), the next implementation
simply added more methods into the interface rather than actually using the
ones already existing.
For instance, there are both `tryAcquire` and `incrementPublishCount`
methods, there are both `checkPublishRate` and `isPublishRateExceeded`.
Then there is the issue of misplaced responsibilities where when its
precise, the complete responsibility of checking and responding back with
whether its rate limited or not lies with the `PrecisePublishLimiter.java`
but when its poller based, there is some logic inside
`PublishRateLimiterImpl` and rest of the logic is spread across
`AbstractTopic.java`

> short and medium term blacklisting of topics based on breach of rate
> > limiter beyond a given SOP. I feel this is very very specific to our
> > organization right now to be included inside pulsar itself.
>
> This is outside of rate limiters. IIRC, there have been some
> discussions in the community that an API for blocking individual
> producers or producing to specific topics could be useful in some
> cases.
> An external component could observe metrics and control blocking if
> there's an API for doing so.
>
>
Actually, putting this in an external component that's based off of metrics
is not a scalable or responsive solution.
First of all, it puts a lot of pressure on the metrics system (prometheus)
where we are now querying 1000s of metrics every minute/sub-minute
uselessly. Since majority of the time, not even a single topic may need
blacklisting, this is very very inefficient.
Secondly, it makes the design such that this external component now needs
to be in sync about the existing topics and their rate set inside the
pulsar's zookeeper. This also puts extra pressure on zk-reads.
Lastly, the response time for blacklisting the topic increases a lot in
this approach.

This would be a much simpler and efficient model if it were reactive,
based/triggered directly from within the rate limiter component. It can be
fast and responsive, which is very critical when trying to prevent the
system from abuse.



> > Actually, by default, and for 99% of the cases, multiplexing isn't an
> issue
> > assuming:
> > * A single producer object is producing to a single topic (one or more
> > partition)
> > * Produce is happening in a round robin manner (by default)
>
> Connection multiplexing issue is also a problem in the cases you
> listed since multiple partitions might be served by the same broker
> and the connection to that broker could be shared. Producing in a
> round-robin manner does not eliminate the issue because the sending
> process is asynchronous in the background. Therefore, it is a problem
>

It's only async if the producers use `sendAsync` . Moreover, even if it is
async in nature, practically it ends up being quite linear and
well-homogenised. I am speaking from experience of running 1000s of
partitioned topics in production


> Let's continue in getting into more details of the intended behavior
> and define what bursting really means and how we believe that solves a
> problem with "hot source" producers.
> Makes sense?
>
>
I am happy if the discussion concludes eitherways - a pluggable
implementation or a 99% use-case capturing configurable rate limiter. But
from what I've seen, the participation in OSS threads can be very random
and thus, I am afraid it might take a while before more folks pitch in
their inputs and a clear direction of discussion is formed.


-- 
Girish Sharma

Re: [DISCUSS] PIP-310: Support custom publish rate limiters

Posted by Lari Hotari <lh...@apache.org>.

Hi Girish,

Replies inline. We are getting into a very detailed discussion. We
could also discuss this topic in one of the upcoming Pulsar Community
meetings. However, I might miss the next meeting that is scheduled
this Thursday.
Although I am currently opposing to your proposal PIP-310, I am
supporting solving your problems related to rate limiting. :)
Let's continue the discussion since that is necessary so that we could
make progress. I hope this makes sense from your perspective.

On Sat, 4 Nov 2023 at 17:53, Girish Sharma <sc...@gmail.com> wrote:
>
> There are challenges in this. As explained in the PIP, there are several
> different usages of rate limiter, stats, unloading, etc. While I am open to
> having a burstable rate limiter in pulsar out of box, it might complicate
> things considering backward compatibility etc. More on this below.
>

I acknowledge that there are different usages, but my assumption is
that we could implement a generic solution that could be configured to
handle each specific use case.
I haven't yet seen any evidence that the requirements in your case are
so special that it justifies adding a pluggable interface for rate
limiters. Exposing yet another pluggable interface in Pulsar will add
complexity without gains. Each supported public interface is a
maintenance burden if we care about the quality of the exposed
interfaces and put effort in ensuring that the interfaces are
supported in future versions. Exposing an interface will also lock
down or slow down some future refactorings.
There will be a need to refactor and improve rate limiters as part of
the flow control and back pressure improvements within the Pulsar
broker. I'd rather keep the rate limiter internal interfaces an
internal implementation detail instead of leaking the details to an
exposed public interface. That's why we should primarily look for a
generic solution. I hope we could put effort in looking into the
characteristics of your requirements and attempt to sketch a design
for a generic solution that could be configured for your purposes.

> > The problems you are describing seem to be common to many Pulsar use cases,
> > and therefore, I think they should be handled directly in Pulsar.
> >
>
> I personally haven't seen many burstability related discussions; so this
> feature might actually not be that useful for all current Pulsar users.

In general there are not many advanced discussions on the mailing list
about the Pulsar internals.
It may also be difficult for others to recognize that specific
behaviors could be resolved by enhancing flow control, back pressure,
and rate limiting/throttling mechanisms.

> I would personally suggest we tackle this problem in parts so that it's
> available incrementally over versions rather than making the scope so big
> that it takes pulsar 4.0 for these features to land.

Sure, quick delivery time is the goal of everyone. Before talking
about schedules, we should be able to discuss the use case and the
design of the desired type of rate limiting and throttling in more
depth.

One concrete example of this is the desired behavior of bursting. In
token bucket rate limiting, bursting is about using the buffered
tokens in the "token bucket" and having a configurable limit for the
buffer (the "bucket"). This buffer will usually only contain tokens
when the actual rate has been lower than the configured maximum rate
for some duration.

However, there could be an expectation for a different type of
bursting which is more like "auto scaling" of the rate limit in a way
where the end-to-end latency of the produced messages
is taken into account. The expected behavior might be about scaling
the rate temporarily to a higher rate so that the queues can be
cleared and that the latency of the messages being sent stay under a
target latency. The current
org.apache.pulsar.broker.service.PublishRateLimiter interface cannot
control aspects that would be needed to handle this type of bursting
where we actually need to scale up the rate limit based on end-to-end
feedback . The PublishRateLimiter interface doesn't have feedback
loops currently.

I don't know what "bursting" means for you. Would it be possible to
provide concrete examples of desired behavior? That would be very
helpful in making progress.

> Yes, we were also thinking on the same terms once this is pluggable. The
> idea was to have some numbers and real world usage backing an
> implementation of rate limiter before merging it back into pulsar. Any
> decision we would take right now would be limited only by theoretical
> discussion of the implementation and our assumption that it covers 99% of
> the use cases, probably just like how the precise and poller ones came into
> being.

This will remain theoretical discussion without concrete examples of
desired behavior or what problem we want to solve.
Initially, we need to have a good grasp on the problem we want to
solve. The solution might change while we experiment and iterate. It
will be possible to improve over time, even after releasing the first
version as part of a Pulsar feature release.
The problem and intention doesn't usually change that much. Assuming
that there's some declarative configuration that configures a rate
limiter, that public "interface" is fairly stable.
Besides the current config of max throughput bytes and messages, there
could be parameters for defining the configuration of handling bursts
and other features. Bursting config could be something like
configuring the max rate for bursting and the duration that the burst
can last for with the maximum rate (there are many ways to express the
size of the "token bucket", another way would be to explicitly define
the size.).

One of the challenges of the existing solution design of the rate
limiters in the Pulsar code base is that it's hard to reason about the
design because of a solution which doesn't align with textbook
algorithms such as the token bucket algorithm.
Basing the design on the token bucket algorithm and making the
concepts in the implementation also match this would be a way to make
it more understandable.
It would be good to hear if others feel this way or not.

> > The current Pulsar rate limiter implementation could be implemented in a
> > cleaner way, which would also be more efficient. Instead of having a
> > scheduler call a method once per second, I think that the rate limiter
> > could be implemented in a reactive way where the algorithm is implemented
> > without a scheduler.
> > I wonder if there are others that would be interested in getting down into
> > such implementation details?
> >
>
> That would be touching RateLimiter.java class, while my goal is to
> improve/touch the outer classes, mainly around the interface
> PublishRateLimiter.java

It's interesting that you mention that you would like to improve the
PublishRateLimiter interface.
How would you change it?

> With this discussion taking a much bigger turn, how or where do we limit
> this PIP's scope? I am happy to work on follow up PIPs which may arrive out
> of this discussion.

We will be able to evaluate that later, after we have made some
conclusions in this discussion.

> > Mostly, the sources are hot sources and can't be slowed down. The lack of
> clear error message (client-server protocol limitation) is also another
> issue that I was planning to tackle in another PIP.

This is an important detail.

> There are a few other custom things. For example, there would be cases of
> short and medium term blacklisting of topics based on breach of rate
> limiter beyond a given SOP. I feel this is very very specific to our
> organization right now to be included inside pulsar itself.

This is outside of rate limiters. IIRC, there have been some
discussions in the community that an API for blocking individual
producers or producing to specific topics could be useful in some
cases.
An external component could observe metrics and control blocking if
there's an API for doing so.

> Actually, by default, and for 99% of the cases, multiplexing isn't an issue
> assuming:
> * A single producer object is producing to a single topic (one or more
> partition)
> * Produce is happening in a round robin manner (by default)

Connection multiplexing issue is also a problem in the cases you
listed since multiple partitions might be served by the same broker
and the connection to that broker could be shared. Producing in a
round-robin manner does not eliminate the issue because the sending
process is asynchronous in the background. Therefore, it is a problem
that should be addressed to get consistent behavior in all cases.
Having a way to isolate connections for each producer / consumer would
be the workaround before solving the producer flow control properly in
the Pulsar binary protocol so that it supports connection
multiplexing.

> Again, I would love this to land in pieces so that TAT for actual usage is
> much faster. What do you suggest from that perspective?

Since we don't backport new features to existing maintenance branches,
the only way to get this into a released version is to target the
master branch so that it gets included in the next feature release.
Our release policy is documented at
https://pulsar.apache.org/contribute/release-policy/ . The goal is to
release from master branch every 3 months.
Once the feature is included, we can implement improvements and bug
fixes in patch releases as needed.

Let's continue in getting into more details of the intended behavior
and define what bursting really means and how we believe that solves a
problem with "hot source" producers.
Makes sense?

-Lari

Re: [DISCUSS] PIP-310: Support custom publish rate limiters

Posted by Girish Sharma <sc...@gmail.com>.

Replies inline.

On Sat, Nov 4, 2023 at 8:55 PM Lari Hotari <lh...@apache.org> wrote:

>
> One possibility would be to improve the existing rate limiter to allow
> bursting.
> I think that Pulsar's out-of-the-box rate limiter should cover 99% of the
> use cases instead of having one implementing their own rate limiter
> algorithm.
>

There are challenges in this. As explained in the PIP, there are several
different usages of rate limiter, stats, unloading, etc. While I am open to
having a burstable rate limiter in pulsar out of box, it might complicate
things considering backward compatibility etc. More on this below.


> The problems you are describing seem to be common to many Pulsar use cases,
> and therefore, I think they should be handled directly in Pulsar.
>

I personally haven't seen many burstability related discussions; so this
feature might actually not be that useful for all current Pulsar users.


>
> Optimally, there would be a single solution that abstracts the rate
> limiting in a way where it does the right thing based on the declarative
> configuration.
> I would prefer that over having a pluggable solution for rate limiter
> implementations.
>
> What would help is getting deeper in the design of the rate limiter itself,
> without limiting ourselves to the existing rate limiter implementation in
> Pulsar.
>

I would personally suggest we tackle this problem in parts so that it's
available incrementally over versions rather than making the scope so big
that it takes pulsar 4.0 for these features to land.


>
> In textbooks, there are algorithms such as "leaky bucket" [1] and "token
> bucket" [2]. Both algorithms have several variations and in some ways they
> are very similar algorithms but looking from the different point of view.
> It would possibly be easier to conceptualize and understand a rate limiting
> algorithm if common algorithm names and implementation choices mentioned in
> textbooks would be referenced in the implementation.
> It seems that a "token bucket" type of algorithm can be used to implement
> rate limiting with bursting. In the token bucket algorithm, the size of the
> token bucket defines how large bursts will be allowed. The design could
> also be something where 2 rate limiters with different type of algorithms
> and/or configuration parameters are combined to achieve a desired behavior.
> For example, to achieve a rate limiter with bursting and a fixed maximum
> rate.
> By default, the token bucket algorithm doesn't enforce a maximum rate for
> bursts, but that could be achieved by chaining 2 rate limiters if that is
> really needed.
>

Yes, we were also thinking on the same terms once this is pluggable. The
idea was to have some numbers and real world usage backing an
implementation of rate limiter before merging it back into pulsar. Any
decision we would take right now would be limited only by theoretical
discussion of the implementation and our assumption that it covers 99% of
the use cases, probably just like how the precise and poller ones came into
being.


>
> The current Pulsar rate limiter implementation could be implemented in a
> cleaner way, which would also be more efficient. Instead of having a
> scheduler call a method once per second, I think that the rate limiter
> could be implemented in a reactive way where the algorithm is implemented
> without a scheduler.
> I wonder if there are others that would be interested in getting down into
> such implementation details?
>

That would be touching RateLimiter.java class, while my goal is to
improve/touch the outer classes, mainly around the interface
PublishRateLimiter.java

With this discussion taking a much bigger turn, how or where do we limit
this PIP's scope? I am happy to work on follow up PIPs which may arrive out
of this discussion.


> Are you able to slow down producing on the client side? If that is
> possible, there could be ways to improve ways to do client side back
> pressure with Pulsar Client. Currently, the client doesn't expose this
> information until the sending blocks or fails with an exception
> (ProducerQueueIsFullError). Optimally, the client should slow down the rate
> of producing to the rate that it can actually send to the broker.
> Just curious if you have considered turning off producing timeouts on the
> client side completely or making them longer? Would that address the data
> loss problem?
> Or is your event/message source "hot" so that you cannot stop or slow it
> down, and it will just keep on flowing with a certain rate?
>
> Mostly, the sources are hot sources and can't be slowed down. The lack of
clear error message (client-server protocol limitation) is also another
issue that I was planning to tackle in another PIP.


Yes, it makes sense to have bursting configuration parameters in the rate
> limiter.
> As mentioned above, I think we could be improving the existing rate limiter
> in Pulsar to cover 99% of the use case by making it stable and by including
> the bursting configuration options.
> Is there additional functionality you feel the rate limiter needs beyond
> bursting support?
>
>
There are a few other custom things. For example, there would be cases of
short and medium term blacklisting of topics based on breach of rate
limiter beyond a given SOP. I feel this is very very specific to our
organization right now to be included inside pulsar itself.


> One way to workaround the multiplexing problem would be to add a client
> side option for producers and consumers, where you could specify that the
> client picks a separate TCP/IP connection that is not shared and isn't from
> the connection pool.
> Preventing connection multiplexing seems to be the only way to make the
> current rate limiting deterministic and stable without adding the explicit
> flow control to the Pulsar binary protocol for producers.
>

Actually, by default, and for 99% of the cases, multiplexing isn't an issue
assuming:
* A single producer object is producing to a single topic (one or more
partition)
* Produce is happening in a round robin manner (by default)

Due to these assumptions, it is more than likely that all partitions are
doing uniform QPS and MBps, thus disabling auto-read off at the netty layer
doesn't have that drastic impact on the rate limiting aspect.



>
> Are there other community members with input on the design and
> implementation of an improved rate limiter?
> I’m eager to continue this conversation and work together towards a robust
> solution.
>

Again, I would love this to land in pieces so that TAT for actual usage is
much faster. What do you suggest from that perspective?


>
> -Lari
>


-- 
Girish Sharma

Re: [DISCUSS] PIP-310: Support custom publish rate limiters

Posted by Lari Hotari <lh...@apache.org>.

Replies inline

On Fri, 3 Nov 2023 at 20:48, Girish Sharma <sc...@gmail.com> wrote:

> Could you please elaborate more on these details? Here are some questions:
> > 1. What do you mean that it is too strict?
> >     - Should the rate limiting allow bursting over the limit for some
> time?
> >
>
> That's one of the major use cases, yes.
>

One possibility would be to improve the existing rate limiter to allow
bursting.
I think that Pulsar's out-of-the-box rate limiter should cover 99% of the
use cases instead of having one implementing their own rate limiter
algorithm.
The problems you are describing seem to be common to many Pulsar use cases,
and therefore, I think they should be handled directly in Pulsar.

Optimally, there would be a single solution that abstracts the rate
limiting in a way where it does the right thing based on the declarative
configuration.
I would prefer that over having a pluggable solution for rate limiter
implementations.

What would help is getting deeper in the design of the rate limiter itself,
without limiting ourselves to the existing rate limiter implementation in
Pulsar.

In textbooks, there are algorithms such as "leaky bucket" [1] and "token
bucket" [2]. Both algorithms have several variations and in some ways they
are very similar algorithms but looking from the different point of view.
It would possibly be easier to conceptualize and understand a rate limiting
algorithm if common algorithm names and implementation choices mentioned in
textbooks would be referenced in the implementation.
It seems that a "token bucket" type of algorithm can be used to implement
rate limiting with bursting. In the token bucket algorithm, the size of the
token bucket defines how large bursts will be allowed. The design could
also be something where 2 rate limiters with different type of algorithms
and/or configuration parameters are combined to achieve a desired behavior.
For example, to achieve a rate limiter with bursting and a fixed maximum
rate.
By default, the token bucket algorithm doesn't enforce a maximum rate for
bursts, but that could be achieved by chaining 2 rate limiters if that is
really needed.

The current Pulsar rate limiter implementation could be implemented in a
cleaner way, which would also be more efficient. Instead of having a
scheduler call a method once per second, I think that the rate limiter
could be implemented in a reactive way where the algorithm is implemented
without a scheduler.
I wonder if there are others that would be interested in getting down into
such implementation details?

1 - https://en.wikipedia.org/wiki/Leaky_bucket
2 - https://en.wikipedia.org/wiki/Token_bucket

> 2. What type of data loss are you experiencing?
> >
>
> Messages produced by the producers which eventually get timed out due to
> rate limiting.
>

Are you able to slow down producing on the client side? If that is
possible, there could be ways to improve ways to do client side back
pressure with Pulsar Client. Currently, the client doesn't expose this
information until the sending blocks or fails with an exception
(ProducerQueueIsFullError). Optimally, the client should slow down the rate
of producing to the rate that it can actually send to the broker.
Just curious if you have considered turning off producing timeouts on the
client side completely or making them longer? Would that address the data
loss problem?
Or is your event/message source "hot" so that you cannot stop or slow it
down, and it will just keep on flowing with a certain rate?

> I think the core implementation of how the broker fails fast at the time
> of rate limiting (whether it is by pausing netty channel or a new permits
> based model) does not change the actual issue I am targeting. Multiplexing
> has some impact on it - but yet again only limited, and can easily be fixed
> by the client by increasing the connections per broker. Even after assuming
> both these things are somehow "fixed", the fact remains that an absolutely
> strict rate limiter will lead to the above mentioned data loss for burst
> going above the limit and that a poller based rate limiter doesn't really
> rate limit anything as it allows all produce in the first interval of the
> next second.
>

Yes, it makes sense to have bursting configuration parameters in the rate
limiter.
As mentioned above, I think we could be improving the existing rate limiter
in Pulsar to cover 99% of the use case by making it stable and by including
the bursting configuration options.
Is there additional functionality you feel the rate limiter needs beyond
bursting support?

One way to workaround the multiplexing problem would be to add a client
side option for producers and consumers, where you could specify that the
client picks a separate TCP/IP connection that is not shared and isn't from
the connection pool.
Preventing connection multiplexing seems to be the only way to make the
current rate limiting deterministic and stable without adding the explicit
flow control to the Pulsar binary protocol for producers.

Are there other community members with input on the design and
implementation of an improved rate limiter?
I’m eager to continue this conversation and work together towards a robust
solution.

-Lari

Re: [DISCUSS] PIP-310: Support custom publish rate limiters

Posted by Girish Sharma <sc...@gmail.com>.

Hello Lari, replies inline.


On Fri, Nov 3, 2023 at 11:13 PM Lari Hotari <lh...@apache.org> wrote:

> Hi Girish,
>
> Thanks for the questions. I'll reply to them
>
> > does this sharing of the same tcp/ip connection happen across partitions
> as
> > well (assuming both the partitions of the topic are on the same broker)?
> > i.e. producer 127.0.0.1 for partition
> > `persistent://tenant/ns/topic0-partition0` and producer 127.0.0.1 for
> > partition `persistent://tenant/ns/topic0-partition1` share the same
> tcp/ip
> > connection assuming both are on broker-0 ?
>
> The Pulsar Java client would be sharing the same TCP/IP connection to a
> single broker when using the default setting of connectionsPerBroker = 1.
> It could be using a different connection if connectionsPerBroker > 1.
>
> Thanks for clarifying this.

Could you please elaborate more on these details? Here are some questions:
> 1. What do you mean that it is too strict?
>     - Should the rate limiting allow bursting over the limit for some time?
>

That's one of the major use cases, yes.


> 2. What type of data loss are you experiencing?
>

Messages produced by the producers which eventually get timed out due to
rate limiting.


> 3. What is the root cause of the data loss?
>    - Do you mean that the system performance degrades and data loss is due
> to not being able to produce from client to the broker quickly enough and
> data loss happens because messages cannot be forwarded from the client to
> the broker?
>

No, the system performance decreases in the case of poller based rate
limiters. In the precise one, it's purely the broker pausing the netty
channel's auto read property. If the producer goes beyond the set
throughput for a longer (than timeout) duration then it starts observing
timeouts leading to the messages being timed out essentially being lost.

As mentioned in my previous email, there has been discussions about
> improving producer flow control. One of the solution ideas that was
> discussed in a Pulsar community meeting in January was to add explicit flow
> control to producers, somewhat similar to how there are "permits" as the
> flow control for consumers. The permits would be based on byte size
> (instead of number of messages). With explicit flow control in the
> protocol, the rate limiting will also be effective and deterministic and
> the issues that Tao Jiuming was explaining could also be resolved. It also
> would solve the producer/consumer multiplexing on a single TCP/IP
> connection when flow control and rate limiting isn't based on the TCP/IP
> level (and toggling the Netty channel's auto read property).
>
> I think the core implementation of how the broker fails fast at the time
of rate limiting (whether it is by pausing netty channel or a new permits
based model) does not change the actual issue I am targeting. Multiplexing
has some impact on it - but yet again only limited, and can easily be fixed
by the client by increasing the connections per broker. Even after assuming
both these things are somehow "fixed", the fact remains that an absolutely
strict rate limiter will lead to the above mentioned data loss for burst
going above the limit and that a poller based rate limiter doesn't really
rate limit anything as it allows all produce in the first interval of the
next second.


> Let's continue discussion, since I think that this is an important
> improvement area. Together we could find a good solution that works for
> multiple use cases and addresses existing challenges in producer flow
> control and rate limiting.
>
> -Lari
>
> On 2023/11/03 11:16:37 Girish Sharma wrote:
> > Hello Lari,
> > Thanks for bringing this to my attention. I went through the links, but
> > does this sharing of the same tcp/ip connection happen across partitions
> as
> > well (assuming both the partitions of the topic are on the same broker)?
> > i.e. producer 127.0.0.1 for partition
> > `persistent://tenant/ns/topic0-partition0` and producer 127.0.0.1 for
> > partition `persistent://tenant/ns/topic0-partition1` share the same
> tcp/ip
> > connection assuming both are on broker-0 ?
> >
> > In general, the major use case behind this PIP for me and my organization
> > is about supporting produce spikes. We do not want to allocate absolute
> > maximum throughput for a topic which would not even be utilized 99.99% of
> > the time. Thus, for a topic that stays constantly at 100MBps and goes to
> > 150MBps only once in a blue moon, it's unwise to allocate 150MBps worth
> of
> > resources 100% of the time. The poller based rate limiter is also not
> good
> > here as it would allow over use of hardware without a check, degrading
> the
> > system.
> >
> > @Asif, I have been sick these last 10 days, but will be updating the PIP
> > with the discussed changes early next week.
> >
> > Regards
> >
> > On Fri, Nov 3, 2023 at 3:25 PM Lari Hotari <lh...@apache.org> wrote:
> >
> > > Hi Girish,
> > >
> > > In order to address your problem described in the PIP document [1], it
> > > might be necessary to make improvements in how rate limiters apply
> > > backpressure in Pulsar.
> > >
> > > Pulsar uses mainly TCP/IP connection level controls for achieving
> > > backpressure. The challenge is that Pulsar can share a single TCP/IP
> > > connection across multiple producers and consumers. Because of this,
> there
> > > could be multiple producers and consumers and rate limiters operating
> on
> > > the same connection on the broker, and they will do conflicting
> decisions,
> > > which results in undesired behavior.
> > >
> > > Regarding the shared TCP/IP connection backpressure issue, Apache Flink
> > > had a somewhat similar problem before Flink 1.5. It is described in the
> > > "inflicting backpressure" section of this blog post from 2019:
> > >
> > >
> https://flink.apache.org/2019/06/05/flink-network-stack.html#inflicting-backpressure-1
> > > Flink solved the issue of multiplexing multiple streams of data on a
> > > single TCP/IP connection in Flink 1.5 by introducing it's own flow
> control
> > > mechanism.
> > >
> > > The backpressure and rate limiting challenges have been discussed a few
> > > times in Pulsar community meetings over the past years. There was also
> a
> > > generic backpressure thread on the dev mailing list [2] in Sep 2022.
> > > However, we haven't really documented Pulsar's backpressure design and
> how
> > > rate limiters are part of the overall solution and how we could
> improve.
> > > I think it might be time to do so since there's a requirement to
> improve
> > > rate limiting. I guess that's the main motivation also for PIP-310.
> > >
> > > -Lari
> > >
> > > 1 - https://github.com/apache/pulsar/pull/21399/files
> > > 2 - https://lists.apache.org/thread/03w6x9zsgx11mqcp5m4k4n27cyqmp271
> > >
> > > On 2023/10/19 12:51:14 Girish Sharma wrote:
> > > > Hi,
> > > > Currently, there are only 2 kinds of publish rate limiters - polling
> > > based
> > > > and precise. Users have an option to use either one of them in the
> topic
> > > > publish rate limiter, but the resource group rate limiter only uses
> > > polling
> > > > one.
> > > >
> > > > There are challenges with both the rate limiters and the fact that we
> > > can't
> > > > use precise rate limiter in the resource group level.
> > > >
> > > > Thus, in order to support custom rate limiters, I've created the
> PIP-310
> > > >
> > > > This is the discussion thread. Please go through the PIP and provide
> your
> > > > inputs.
> > > >
> > > > Link - https://github.com/apache/pulsar/pull/21399
> > > >
> > > > Regards
> > > > --
> > > > Girish Sharma
> > > >
> > >
> >
> >
> > --
> > Girish Sharma
> >
>


-- 
Girish Sharma

Re: [DISCUSS] PIP-310: Support custom publish rate limiters

Posted by Lari Hotari <lh...@apache.org>.

Hi Girish,

Thanks for the questions. I'll reply to them

> does this sharing of the same tcp/ip connection happen across partitions as
> well (assuming both the partitions of the topic are on the same broker)?
> i.e. producer 127.0.0.1 for partition
> `persistent://tenant/ns/topic0-partition0` and producer 127.0.0.1 for
> partition `persistent://tenant/ns/topic0-partition1` share the same tcp/ip
> connection assuming both are on broker-0 ?

The Pulsar Java client would be sharing the same TCP/IP connection to a single broker when using the default setting of connectionsPerBroker = 1. It could be using a different connection if connectionsPerBroker > 1.

> In general, the major use case behind this PIP for me and my organization
> is about supporting produce spikes. We do not want to allocate absolute
> maximum throughput for a topic which would not even be utilized 99.99% of
> the time. Thus, for a topic that stays constantly at 100MBps and goes to
> 150MBps only once in a blue moon, it's unwise to allocate 150MBps worth of
> resources 100% of the time. The poller based rate limiter is also not good
> here as it would allow over use of hardware without a check, degrading the
> system.

I'm trying to understand your use case better. 
In the PIP-310 document it says:
> Precise rate limiter fixes the above two issues, but introduces another challenge - the rate 
> limiting is too strict.
> * This leads potential for data loss in case there are sudden spikes in produce and client side >    produce queue breaches.
> * The produce latencies increase exponentially in case produce breaches the set throughput >    even for small windows.

Could you please elaborate more on these details? Here are some questions:
1. What do you mean that it is too strict? 
    - Should the rate limiting allow bursting over the limit for some time?
2. What type of data loss are you experiencing? 
3. What is the root cause of the data loss?
   - Do you mean that the system performance degrades and data loss is due to not being able to produce from client to the broker quickly enough and data loss happens because messages cannot be forwarded from the client to the broker?

Once there's a common understanding of the problem, it's easier to design the solution together in the Pulsar community. One possibility is that PIP-310 already solves your problem. Another possibility is that we need to improve producer flow control. I currently feel that it is needed, but I might be wrong.

As mentioned in my previous email, there has been discussions about improving producer flow control. One of the solution ideas that was discussed in a Pulsar community meeting in January was to add explicit flow control to producers, somewhat similar to how there are "permits" as the flow control for consumers. The permits would be based on byte size (instead of number of messages). With explicit flow control in the protocol, the rate limiting will also be effective and deterministic and the issues that Tao Jiuming was explaining could also be resolved. It also would solve the producer/consumer multiplexing on a single TCP/IP connection when flow control and rate limiting isn't based on the TCP/IP level (and toggling the Netty channel's auto read property).

Let's continue discussion, since I think that this is an important improvement area. Together we could find a good solution that works for multiple use cases and addresses existing challenges in producer flow control and rate limiting. 

-Lari

On 2023/11/03 11:16:37 Girish Sharma wrote:
> Hello Lari,
> Thanks for bringing this to my attention. I went through the links, but
> does this sharing of the same tcp/ip connection happen across partitions as
> well (assuming both the partitions of the topic are on the same broker)?
> i.e. producer 127.0.0.1 for partition
> `persistent://tenant/ns/topic0-partition0` and producer 127.0.0.1 for
> partition `persistent://tenant/ns/topic0-partition1` share the same tcp/ip
> connection assuming both are on broker-0 ?
> 
> In general, the major use case behind this PIP for me and my organization
> is about supporting produce spikes. We do not want to allocate absolute
> maximum throughput for a topic which would not even be utilized 99.99% of
> the time. Thus, for a topic that stays constantly at 100MBps and goes to
> 150MBps only once in a blue moon, it's unwise to allocate 150MBps worth of
> resources 100% of the time. The poller based rate limiter is also not good
> here as it would allow over use of hardware without a check, degrading the
> system.
> 
> @Asif, I have been sick these last 10 days, but will be updating the PIP
> with the discussed changes early next week.
> 
> Regards
> 
> On Fri, Nov 3, 2023 at 3:25 PM Lari Hotari <lh...@apache.org> wrote:
> 
> > Hi Girish,
> >
> > In order to address your problem described in the PIP document [1], it
> > might be necessary to make improvements in how rate limiters apply
> > backpressure in Pulsar.
> >
> > Pulsar uses mainly TCP/IP connection level controls for achieving
> > backpressure. The challenge is that Pulsar can share a single TCP/IP
> > connection across multiple producers and consumers. Because of this, there
> > could be multiple producers and consumers and rate limiters operating on
> > the same connection on the broker, and they will do conflicting decisions,
> > which results in undesired behavior.
> >
> > Regarding the shared TCP/IP connection backpressure issue, Apache Flink
> > had a somewhat similar problem before Flink 1.5. It is described in the
> > "inflicting backpressure" section of this blog post from 2019:
> >
> > https://flink.apache.org/2019/06/05/flink-network-stack.html#inflicting-backpressure-1
> > Flink solved the issue of multiplexing multiple streams of data on a
> > single TCP/IP connection in Flink 1.5 by introducing it's own flow control
> > mechanism.
> >
> > The backpressure and rate limiting challenges have been discussed a few
> > times in Pulsar community meetings over the past years. There was also a
> > generic backpressure thread on the dev mailing list [2] in Sep 2022.
> > However, we haven't really documented Pulsar's backpressure design and how
> > rate limiters are part of the overall solution and how we could improve.
> > I think it might be time to do so since there's a requirement to improve
> > rate limiting. I guess that's the main motivation also for PIP-310.
> >
> > -Lari
> >
> > 1 - https://github.com/apache/pulsar/pull/21399/files
> > 2 - https://lists.apache.org/thread/03w6x9zsgx11mqcp5m4k4n27cyqmp271
> >
> > On 2023/10/19 12:51:14 Girish Sharma wrote:
> > > Hi,
> > > Currently, there are only 2 kinds of publish rate limiters - polling
> > based
> > > and precise. Users have an option to use either one of them in the topic
> > > publish rate limiter, but the resource group rate limiter only uses
> > polling
> > > one.
> > >
> > > There are challenges with both the rate limiters and the fact that we
> > can't
> > > use precise rate limiter in the resource group level.
> > >
> > > Thus, in order to support custom rate limiters, I've created the PIP-310
> > >
> > > This is the discussion thread. Please go through the PIP and provide your
> > > inputs.
> > >
> > > Link - https://github.com/apache/pulsar/pull/21399
> > >
> > > Regards
> > > --
> > > Girish Sharma
> > >
> >
> 
> 
> -- 
> Girish Sharma
>

Re: [DISCUSS] PIP-310: Support custom publish rate limiters

Posted by Girish Sharma <sc...@gmail.com>.

Hello Lari,
Thanks for bringing this to my attention. I went through the links, but
does this sharing of the same tcp/ip connection happen across partitions as
well (assuming both the partitions of the topic are on the same broker)?
i.e. producer 127.0.0.1 for partition
`persistent://tenant/ns/topic0-partition0` and producer 127.0.0.1 for
partition `persistent://tenant/ns/topic0-partition1` share the same tcp/ip
connection assuming both are on broker-0 ?

In general, the major use case behind this PIP for me and my organization
is about supporting produce spikes. We do not want to allocate absolute
maximum throughput for a topic which would not even be utilized 99.99% of
the time. Thus, for a topic that stays constantly at 100MBps and goes to
150MBps only once in a blue moon, it's unwise to allocate 150MBps worth of
resources 100% of the time. The poller based rate limiter is also not good
here as it would allow over use of hardware without a check, degrading the
system.

@Asif, I have been sick these last 10 days, but will be updating the PIP
with the discussed changes early next week.

Regards

On Fri, Nov 3, 2023 at 3:25 PM Lari Hotari <lh...@apache.org> wrote:

> Hi Girish,
>
> In order to address your problem described in the PIP document [1], it
> might be necessary to make improvements in how rate limiters apply
> backpressure in Pulsar.
>
> Pulsar uses mainly TCP/IP connection level controls for achieving
> backpressure. The challenge is that Pulsar can share a single TCP/IP
> connection across multiple producers and consumers. Because of this, there
> could be multiple producers and consumers and rate limiters operating on
> the same connection on the broker, and they will do conflicting decisions,
> which results in undesired behavior.
>
> Regarding the shared TCP/IP connection backpressure issue, Apache Flink
> had a somewhat similar problem before Flink 1.5. It is described in the
> "inflicting backpressure" section of this blog post from 2019:
>
> https://flink.apache.org/2019/06/05/flink-network-stack.html#inflicting-backpressure-1
> Flink solved the issue of multiplexing multiple streams of data on a
> single TCP/IP connection in Flink 1.5 by introducing it's own flow control
> mechanism.
>
> The backpressure and rate limiting challenges have been discussed a few
> times in Pulsar community meetings over the past years. There was also a
> generic backpressure thread on the dev mailing list [2] in Sep 2022.
> However, we haven't really documented Pulsar's backpressure design and how
> rate limiters are part of the overall solution and how we could improve.
> I think it might be time to do so since there's a requirement to improve
> rate limiting. I guess that's the main motivation also for PIP-310.
>
> -Lari
>
> 1 - https://github.com/apache/pulsar/pull/21399/files
> 2 - https://lists.apache.org/thread/03w6x9zsgx11mqcp5m4k4n27cyqmp271
>
> On 2023/10/19 12:51:14 Girish Sharma wrote:
> > Hi,
> > Currently, there are only 2 kinds of publish rate limiters - polling
> based
> > and precise. Users have an option to use either one of them in the topic
> > publish rate limiter, but the resource group rate limiter only uses
> polling
> > one.
> >
> > There are challenges with both the rate limiters and the fact that we
> can't
> > use precise rate limiter in the resource group level.
> >
> > Thus, in order to support custom rate limiters, I've created the PIP-310
> >
> > This is the discussion thread. Please go through the PIP and provide your
> > inputs.
> >
> > Link - https://github.com/apache/pulsar/pull/21399
> >
> > Regards
> > --
> > Girish Sharma
> >
>

-- 
Girish Sharma