You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@kafka.apache.org by Gokul Ramanan Subramanian <go...@gmail.com> on 2020/04/01 15:28:16 UTC

[DISCUSS] KIP-578: Add configuration to limit number of partitions

Hi.

I have opened KIP-578, intended to provide a mechanism to limit the number
of partitions in a Kafka cluster. Kindly provide feedback on the KIP which
you can find at
https://cwiki.apache.org/confluence/display/KAFKA/KIP-578%3A+Add+configuration+to+limit+number+of+partitions

I want to specially thank Stanislav Kozlovski who helped in formulating
some aspects of the KIP.

Many thanks,

Gokul.

Re: [DISCUSS] KIP-578: Add configuration to limit number of partitions

Posted by Gokul Ramanan Subramanian <go...@gmail.com>.
Alex,

904. For multi-tenancy, yes, the limits in this KIP will apply on a
first-come-first-serve basis. If and when multi-tenancy based limiting
rolls out, it would probably introduce a different way to specific
per-client limits on partitions. I suppose those will apply in addition to
and not instead of the broker-level limits.

905. Yes, it is possible to have a scenario where the brokers in only one
rack have partition capacity left (as imposed by KIP-578), and therefore,
that all replicas for a new topic partition get assigned to that rack. But
in practice, this would only be a problem if the limit was set to too low a
value. In practice, the partition limits should be set to such a high value
that when nearing it, there would be a risk of an availability loss.
Therefore, I would treat replica assignment as we near the partition limits
an edge case that should rarely happen in practice. Further, if a replica
reassignment is not proving possible because a plan is considered invalid
in light of the limits, and the administrator believes that the limit is
too low, they can increase it (just like any other dynamic config) and
retry the reassignment.

906. The case of internal topics and auto-topic creation will be addressed
by KIP-590, which will modify the broker to send a request to the
controller for topic creation rather than directly write to ZooKeeper. The
fact that we don't handle internal topics is not a feature, it is a "bug".
Therefore, there is no plan to support allowlist/denylists for topics that
can simply bypass the partition limits.

907. If by update, you mean Kafka version upgrade, I have explained this in
the migration plan. Basically, if the brokers have more partitions at a
point in time when the partition limit is applied, then the broker will
continue to work as expected. i.e. it will not check that it has more
partitions than expected, and crash. Instead, it will simply start denying
new requests to create partitions. In other words, having as many
partitions as the limit is the same as having more partitions than the
limit.

On Fri, Jun 12, 2020 at 10:57 AM Alexandre Dupriez <
alexandre.dupriez@gmail.com> wrote:

> Hi Gokul,
>
> Thank you for the answers and the data provided to illustrate the use case.
> A couple of additional questions.
>
> 904. If multi-tenancy is addressed in a future KIP, how smooth would
> be the upgrade path? For example, the introduced configuration
> parameters still apply, right? We would still maintain a first-come
> first-served pattern when topics are created?
>
> 905. The current built-in assignment tool prioritises balance between
> racks over brokers. In the version you propose, the limit on partition
> count would take precedence over attempts to balance between racks.
> Could it lead to a situation where it results in all partitions of a
> topic being assigned in a single data center, if brokers in other
> racks are "full"? Since it can potentially weaken the availability
> guarantees for that topic (and maybe durability and/or consumer
> performance with additional cross-rack traffic), how would we want to
> handle the case? It may be worth warning users that the resulting
> guarantees differ from what is offered by an "unlimited" assignment
> plan in such cases? Also, let's keep in mind that some plans generated
> by existing rebalancing tools could become invalid (w.r.t to the
> configured limits).
>
> 906. The limits do not apply to internal topics. What about
> framework-level topics from other tools and extensions? (connect,
> streams, confluent metrics, tiered storage, etc.) Is blacklisting
> possible?
>
> 907. What happens if one of the dynamic limit is violated at update
> time? (sorry if it's already explained in the KIP, may have missed it)
>
> Thanks,
> Alexandre
>
> Le dim. 3 mai 2020 à 20:20, Gokul Ramanan Subramanian
> <go...@gmail.com> a écrit :
> >
> > Thanks Stanislav. Apologies about the long absence from this thread.
> >
> > I would prefer having per-user max partition limits in a separate KIP. I
> > don't see this as an MVP for this KIP. I will add this as an alternative
> > approach into the KIP.
> >
> > I was in a double mind about whether or not to impose the partition limit
> > for internal topics as well. I can be convinced both ways. On the one
> hand,
> > internal topics should be purely internal i.e. users of a cluster should
> > not have to care about them. In this sense, the partition limit should
> not
> > apply to internal topics. On the other hand, Kafka allows configuring
> > internal topics by specifying their replication factor etc. Therefore,
> they
> > don't feel all that internal to me. In any case, I'll modify the KIP to
> > exclude internal topics.
> >
> > I'll also add to the KIP the alternative approach Tom suggested around
> > using topic policies to limit partitions, and explain why it does not
> help
> > to solve the problem that the KIP is trying to address (as I have done
> in a
> > previous correspondence on this thread).
> >
> > Cheers.
> >
> > On Fri, Apr 24, 2020 at 4:24 PM Stanislav Kozlovski <
> stanislav@confluent.io>
> > wrote:
> >
> > > Thanks for the KIP, Gokul!
> > >
> > > I like the overall premise - I think it's more user-friendly to have
> > > configs for this than to have users implement their own config policy
> -> so
> > > unless it's very complex to implement, it seems worth it.
> > > I agree that having the topic policy on the CreatePartitions path makes
> > > sense as well.
> > >
> > > Multi-tenancy was a good point. It would be interesting to see how
> easy it
> > > is to extend the max partition limit to a per-user basis. Perhaps this
> can
> > > be done in a follow-up KIP, as a natural extension of the feature.
> > >
> > > I'm wondering whether there's a need to enforce this on internal
> topics,
> > > though. Given they're internal and critical to the function of Kafka, I
> > > believe we'd rather always ensure they're created, regardless if over
> some
> > > user-set limit. It brings up the question of forward compatibility -
> what
> > > happens if a user's cluster is at the maximum partition capacity, yet
> a new
> > > release of Kafka introduces a new topic (e.g KIP-500)?
> > >
> > > Best,
> > > Stanislav
> > >
> > > On Fri, Apr 24, 2020 at 2:39 PM Gokul Ramanan Subramanian <
> > > gokul2411s@gmail.com> wrote:
> > >
> > > > Hi Tom.
> > > >
> > > > With KIP-578, we are not trying to model the load on each partition,
> and
> > > > come up with an exact limit on what the cluster or broker can handle
> in
> > > > terms of number of partitions. We understand that not all partitions
> are
> > > > equal, and the actual load per partition varies based on the message
> > > size,
> > > > throughput, whether the broker is a leader for that partition or not
> etc.
> > > >
> > > > What we are trying to achieve with KIP-578 is to disallow a
> pathological
> > > > number of partitions that will surely put the cluster in bad shape.
> For
> > > > example, in KIP-578's appendix, we have described a case where we
> could
> > > not
> > > > delete a topic with 30k partitions, because the brokers could not
> > > > handle all the work that needed to be done. We have also described
> how
> > > > a producer performance test with 10k partitions observed basically 0
> > > > throughput. In these cases, having a limit on number of partitions
> > > > would allow the cluster to produce a graceful error message at topic
> > > > creation time, and prevent the cluster from entering a pathological
> > > state.
> > > > These are not just hypotheticals. We definitely see many of these
> > > > pathological cases happen in production, and we would like to avoid
> them.
> > > >
> > > > The actual limit on number of partitions is something we do not want
> to
> > > > suggest in the KIP. The limit will depend on various tests that
> owners of
> > > > their clusters will have to perform, including perf tests,
> identifying
> > > > topic creation / deletion times, etc. For example, the tests we did
> for
> > > the
> > > > KIP-578 appendix were enough to convince us that we should not have
> > > > anywhere close to 10k partitions on the setup we describe there.
> > > >
> > > > What we want to do with KIP-578 is provide the flexibility to set a
> limit
> > > > on number of partitions based on tests cluster owners choose to
> perform.
> > > > Cluster owners can do the tests however often they wish and
> dynamically
> > > > adjust the limit on number of partitions. For example, we found in
> our
> > > > production environment that we don't want to have more than 1k
> partitions
> > > > on an m5.large EC2 broker instances, or more than 300 partitions on a
> > > > t3.medium EC2 broker, for typical produce / consume use cases.
> > > >
> > > > Cluster owners are free to not configure the limit on number of
> > > partitions
> > > > if they don't want to spend the time coming up with a limit. The
> limit
> > > > defaults to INT32_MAX, which is basically infinity in this context,
> and
> > > > should be practically backwards compatible with current behavior.
> > > >
> > > > Further, the limit on number of partitions should not come in the
> way of
> > > > rebalancing tools under normal operation. For example, if the
> partition
> > > > limit per broker is set to 1k, unless the number of partitions comes
> > > close
> > > > to 1k, there should be no impact on rebalancing tools. Only when the
> > > number
> > > > of partitions comes close to 1k, will rebalancing tools be impacted,
> but
> > > at
> > > > that point, the cluster is already at its limit of functioning (per
> some
> > > > definition that was used to set the limit in the first place).
> > > >
> > > > Finally, I want to end this long email by suggesting that the
> partition
> > > > assignment algorithm itself does not consider the load on various
> > > > partitions before assigning partitions to brokers. In other words, it
> > > > treats all partitions as equal. The idea of having a limit on number
> of
> > > > partitions is not mis-aligned with this tenet.
> > > >
> > > > Thanks.
> > > >
> > > > On Tue, Apr 21, 2020 at 9:39 AM Tom Bentley <tb...@redhat.com>
> wrote:
> > > >
> > > > > Hi Gokul,
> > > > >
> > > > > the partition assignment algorithm needs to be aware of the
> partition
> > > > > > limits.
> > > > > >
> > > > >
> > > > > I agree, if you have limits then anything doing reassignment would
> need
> > > > > some way of knowing what they were. But the thing is that I'm not
> > > really
> > > > > sure how you would decide what the limits ought to be.
> > > > >
> > > > >
> > > > > > To illustrate this, imagine that you have 3 brokers (1, 2 and 3),
> > > > > > with 10, 20 and 30 partitions each respectively, and a limit of
> 40
> > > > > > partitions on each broker enforced via a configurable policy
> class
> > > (the
> > > > > one
> > > > > > you recommended). While the policy class may accept a topic
> creation
> > > > > > request for 11 partitions with a replication factor of 2 each
> > > (because
> > > > it
> > > > > > is satisfiable), the non-pluggable partition assignment
> algorithm (in
> > > > > > AdminUtils.assignReplicasToBrokers and a few other places) has to
> > > know
> > > > > not
> > > > > > to assign the 11th partition to broker 3 because it would run
> out of
> > > > > > partition capacity otherwise.
> > > > > >
> > > > >
> > > > > I know this is only a toy example, but I think it also serves to
> > > > illustrate
> > > > > my point above. How has a limit of 40 partitions been arrived at?
> In
> > > real
> > > > > life different partitions will impart a different load on a broker,
> > > > > depending on all sorts of factors (which topics they're for, the
> > > > throughput
> > > > > and message size for those topics, etc). By saying that a broker
> should
> > > > not
> > > > > have more than 40 partitions assigned I think you're making a big
> > > > > assumption that all partitions have the same weight. You're also
> > > limiting
> > > > > the search space for finding an acceptable assignment. Cluster
> > > balancers
> > > > > usually use some kind of heuristic optimisation algorithm for
> figuring
> > > > out
> > > > > assignments of partitions to brokers, and it could be that the
> best (or
> > > > at
> > > > > least a good enough) solution requires assigning the least loaded
> 41
> > > > > partitions to one broker.
> > > > >
> > > > > The point I'm trying to make here is whatever limit is chosen it's
> > > > probably
> > > > > been chosen fairly arbitrarily. Even if it's been chosen based on
> some
> > > > > empirical evidence of how a particular cluster behaves it's likely
> that
> > > > > that evidence will become obsolete as the cluster evolves to serve
> the
> > > > > needs of the business running it (e.g. some hot topic gets
> > > repartitioned,
> > > > > messages get compressed with some new algorithm, some new topics
> need
> > > to
> > > > be
> > > > > created). For this reason I think the problem you're trying to
> solve
> > > via
> > > > > policy (whether that was implemented in a pluggable way or not) is
> > > really
> > > > > better solved by automating the cluster balancing and having that
> > > cluster
> > > > > balancer be able to reason about when the cluster has too few
> brokers
> > > for
> > > > > the number of partitions, rather than placing some limit on the
> sizing
> > > > and
> > > > > shape of the cluster up front and then hobbling the cluster
> balancer to
> > > > > work within that.
> > > > >
> > > > > I think it might be useful to describe in the KIP how users would
> be
> > > > > expected to arrive at values for these configs (both on day 1 and
> in an
> > > > > evolving production cluster), when this solution might be better
> than
> > > > using
> > > > > a cluster balancer and/or why cluster balancers can't be trusted to
> > > avoid
> > > > > overloading brokers.
> > > > >
> > > > > Kind regards,
> > > > >
> > > > > Tom
> > > > >
> > > > >
> > > > > On Mon, Apr 20, 2020 at 7:22 PM Gokul Ramanan Subramanian <
> > > > > gokul2411s@gmail.com> wrote:
> > > > >
> > > > > > This is good reference Tom. I did not consider this approach at
> all.
> > > I
> > > > am
> > > > > > happy to learn about it now.
> > > > > >
> > > > > > However, I think that partition limits are not "yet another"
> policy
> > > > > > configuration. Instead, they are fundamental to partition
> assignment.
> > > > > i.e.
> > > > > > the partition assignment algorithm needs to be aware of the
> partition
> > > > > > limits. To illustrate this, imagine that you have 3 brokers (1,
> 2 and
> > > > 3),
> > > > > > with 10, 20 and 30 partitions each respectively, and a limit of
> 40
> > > > > > partitions on each broker enforced via a configurable policy
> class
> > > (the
> > > > > one
> > > > > > you recommended). While the policy class may accept a topic
> creation
> > > > > > request for 11 partitions with a replication factor of 2 each
> > > (because
> > > > it
> > > > > > is satisfiable), the non-pluggable partition assignment
> algorithm (in
> > > > > > AdminUtils.assignReplicasToBrokers and a few other places) has to
> > > know
> > > > > not
> > > > > > to assign the 11th partition to broker 3 because it would run
> out of
> > > > > > partition capacity otherwise.
> > > > > >
> > > > > > To achieve the ideal end that you are imagining (and I can
> totally
> > > > > > understand where you are coming from vis-a-vis the extensibility
> of
> > > > your
> > > > > > solution wrt the one in the KIP), that would require extracting
> the
> > > > > > partition assignment logic itself into a pluggable class, and for
> > > which
> > > > > we
> > > > > > could provide a custom implementation. I am afraid that would add
> > > > > > complexity that I am not sure we want to undertake.
> > > > > >
> > > > > > Do you see sense in what I am saying?
> > > > > >
> > > > > > Thanks.
> > > > > >
> > > > > > On Mon, Apr 20, 2020 at 3:59 PM Tom Bentley <tbentley@redhat.com
> >
> > > > wrote:
> > > > > >
> > > > > > > Hi Gokul,
> > > > > > >
> > > > > > > Leaving aside the question of how Kafka scales, I think the
> > > proposed
> > > > > > > solution, limiting the number of partitions in a cluster or
> > > > per-broker,
> > > > > > is
> > > > > > > a policy which ought to be addressable via the pluggable
> policies
> > > > (e.g.
> > > > > > > create.topic.policy.class.name). Unfortunately although
> there's a
> > > > > policy
> > > > > > > for topic creation, it's currently not possible to enforce a
> policy
> > > > on
> > > > > > > partition increase. It would be more flexible to be able
> enforce
> > > this
> > > > > > kind
> > > > > > > of thing via a pluggable policy, and it would also avoid the
> > > > situation
> > > > > > > where different people each want to have a config which
> addresses
> > > > some
> > > > > > > specific use case or problem that they're experiencing.
> > > > > > >
> > > > > > > Quite a while ago I proposed KIP-201 to solve this issue with
> > > > policies
> > > > > > > being easily circumvented, but it didn't really make any
> progress.
> > > > I've
> > > > > > > looked at it again in some detail more recently and I think
> > > something
> > > > > > might
> > > > > > > be possible following the work to make all ZK writes happen on
> the
> > > > > > > controller.
> > > > > > >
> > > > > > > Of course, this is just my take on it.
> > > > > > >
> > > > > > > Kind regards,
> > > > > > >
> > > > > > > Tom
> > > > > > >
> > > > > > > On Thu, Apr 16, 2020 at 11:47 AM Gokul Ramanan Subramanian <
> > > > > > > gokul2411s@gmail.com> wrote:
> > > > > > >
> > > > > > > > Hi.
> > > > > > > >
> > > > > > > > For the sake of expediting the discussion, I have created a
> > > > prototype
> > > > > > PR:
> > > > > > > > https://github.com/apache/kafka/pull/8499. Eventually, (if
> and)
> > > > when
> > > > > > the
> > > > > > > > KIP is accepted, I'll modify this to add the full
> implementation
> > > > and
> > > > > > > tests
> > > > > > > > etc. in there.
> > > > > > > >
> > > > > > > > Would appreciate if a Kafka committer could share their
> thoughts,
> > > > so
> > > > > > > that I
> > > > > > > > can more confidently start the voting thread.
> > > > > > > >
> > > > > > > > Thanks.
> > > > > > > >
> > > > > > > > On Thu, Apr 16, 2020 at 11:30 AM Gokul Ramanan Subramanian <
> > > > > > > > gokul2411s@gmail.com> wrote:
> > > > > > > >
> > > > > > > > > Thanks for your comments Alex.
> > > > > > > > >
> > > > > > > > > The KIP proposes using two configurations max.partitions
> and
> > > > > > > > > max.broker.partitions. It does not enforce their use. The
> > > default
> > > > > > > values
> > > > > > > > > are pretty large (INT MAX), therefore should be
> non-intrusive.
> > > > > > > > >
> > > > > > > > > In multi-tenant environments and in partition assignment
> and
> > > > > > > rebalancing,
> > > > > > > > > the admin could (a) use the default values which would
> yield
> > > > > similar
> > > > > > > > > behavior to now, (b) set very high values that they know is
> > > > > > sufficient,
> > > > > > > > (c)
> > > > > > > > > dynamically re-adjust the values should the business
> > > requirements
> > > > > > > change.
> > > > > > > > > Note that the two configurations are cluster-wide, so they
> can
> > > be
> > > > > > > updated
> > > > > > > > > without restarting the brokers.
> > > > > > > > >
> > > > > > > > > The quota system in Kafka seems to be geared towards
> limiting
> > > > > traffic
> > > > > > > for
> > > > > > > > > specific clients or users, or in the case of replication,
> to
> > > > > leaders
> > > > > > > and
> > > > > > > > > followers. The quota configuration itself is very similar
> to
> > > the
> > > > > one
> > > > > > > > > introduced in this KIP i.e. just a few configuration
> options to
> > > > > > specify
> > > > > > > > the
> > > > > > > > > quota. The main difference is that the quota system is far
> more
> > > > > > > > > heavy-weight because it needs to be applied to traffic
> that is
> > > > > > flowing
> > > > > > > > > in/out constantly. Whereas in this KIP, we want to limit
> number
> > > > of
> > > > > > > > > partition replicas, which gets modified rarely by
> comparison
> > > in a
> > > > > > > typical
> > > > > > > > > cluster.
> > > > > > > > >
> > > > > > > > > Hope this addresses your comments.
> > > > > > > > >
> > > > > > > > > On Thu, Apr 9, 2020 at 12:53 PM Alexandre Dupriez <
> > > > > > > > > alexandre.dupriez@gmail.com> wrote:
> > > > > > > > >
> > > > > > > > >> Hi Gokul,
> > > > > > > > >>
> > > > > > > > >> Thanks for the KIP.
> > > > > > > > >>
> > > > > > > > >> From what I understand, the objective of the new
> configuration
> > > > is
> > > > > to
> > > > > > > > >> protect a cluster from an overload driven by an excessive
> > > number
> > > > > of
> > > > > > > > >> partitions independently from the load handled on the
> > > partitions
> > > > > > > > >> themselves. As such, the approach uncouples the data-path
> load
> > > > > from
> > > > > > > > >> the number of unit of distributions of throughput and
> intends
> > > to
> > > > > > avoid
> > > > > > > > >> the degradation of performance exhibited in the test
> results
> > > > > > provided
> > > > > > > > >> with the KIP by setting an upper-bound on that number.
> > > > > > > > >>
> > > > > > > > >> Couple of comments:
> > > > > > > > >>
> > > > > > > > >> 900. Multi-tenancy - one concern I would have with a
> cluster
> > > and
> > > > > > > > >> broker-level configuration is that it is possible for a
> user
> > > to
> > > > > > > > >> consume a large proportions of the allocatable partitions
> > > within
> > > > > the
> > > > > > > > >> configured limit, leaving other users with not enough
> > > partitions
> > > > > to
> > > > > > > > >> satisfy their requirements.
> > > > > > > > >>
> > > > > > > > >> 901. Quotas - an approach in Apache Kafka to set-up an
> > > > upper-bound
> > > > > > on
> > > > > > > > >> resource consumptions is via client/user quotas. Could
> this
> > > > > > framework
> > > > > > > > >> be leveraged to add this limit?
> > > > > > > > >>
> > > > > > > > >> 902. Partition assignment - one potential problem with
> the new
> > > > > > > > >> repartitioning scheme is that if a subset of brokers have
> > > > reached
> > > > > > > > >> their number of assignable partitions, yet their data
> path is
> > > > > > > > >> under-loaded, new topics and/or partitions will be
> assigned
> > > > > > > > >> exclusively to other brokers, which could increase the
> > > > likelihood
> > > > > of
> > > > > > > > >> data-path load imbalance. Fundamentally, the isolation of
> the
> > > > > > > > >> constraint on the number of partitions from the data-path
> > > > > throughput
> > > > > > > > >> can have conflicting requirements.
> > > > > > > > >>
> > > > > > > > >> 903. Rebalancing - as a corollary to 902, external tools
> used
> > > to
> > > > > > > > >> balance ingress throughput may adopt an incremental
> approach
> > > in
> > > > > > > > >> partition re-assignment to redistribute load, and could
> hit
> > > the
> > > > > > limit
> > > > > > > > >> on the number of partitions on a broker when a (too)
> > > > conservative
> > > > > > > > >> limit is used, thereby over-constraining the objective
> > > function
> > > > > and
> > > > > > > > >> reducing the migration path.
> > > > > > > > >>
> > > > > > > > >> Thanks,
> > > > > > > > >> Alexandre
> > > > > > > > >>
> > > > > > > > >> Le jeu. 9 avr. 2020 à 00:19, Gokul Ramanan Subramanian
> > > > > > > > >> <go...@gmail.com> a écrit :
> > > > > > > > >> >
> > > > > > > > >> > Hi. Requesting you to take a look at this KIP and
> provide
> > > > > > feedback.
> > > > > > > > >> >
> > > > > > > > >> > Thanks. Regards.
> > > > > > > > >> >
> > > > > > > > >> > On Wed, Apr 1, 2020 at 4:28 PM Gokul Ramanan
> Subramanian <
> > > > > > > > >> > gokul2411s@gmail.com> wrote:
> > > > > > > > >> >
> > > > > > > > >> > > Hi.
> > > > > > > > >> > >
> > > > > > > > >> > > I have opened KIP-578, intended to provide a
> mechanism to
> > > > > limit
> > > > > > > the
> > > > > > > > >> number
> > > > > > > > >> > > of partitions in a Kafka cluster. Kindly provide
> feedback
> > > on
> > > > > the
> > > > > > > KIP
> > > > > > > > >> which
> > > > > > > > >> > > you can find at
> > > > > > > > >> > >
> > > > > > > > >> > >
> > > > > > > > >>
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-578%3A+Add+configuration+to+limit+number+of+partitions
> > > > > > > > >> > >
> > > > > > > > >> > > I want to specially thank Stanislav Kozlovski who
> helped
> > > in
> > > > > > > > >> formulating
> > > > > > > > >> > > some aspects of the KIP.
> > > > > > > > >> > >
> > > > > > > > >> > > Many thanks,
> > > > > > > > >> > >
> > > > > > > > >> > > Gokul.
> > > > > > > > >> > >
> > > > > > > > >>
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> > >
> > > --
> > > Best,
> > > Stanislav
> > >
>

Re: [DISCUSS] KIP-578: Add configuration to limit number of partitions

Posted by Gokul Ramanan Subramanian <go...@gmail.com>.
Boyang,

Tom and I discussed earlier in this thread whether we could use the topic
creation policy for limiting the number of partitions. Basically, we
realized that while we can use it for limiting the number of partitions, we
cannot use it for assigning partitions in a way that respects the limit. In
light of KIP-590, I am thinking to modify KIP-578 to just say that we won't
apply the partition limit in case of auto-topic creation wherein the
Metadata API directly writes to ZK. Once KIP-590 surfaces, users will get
partition limiting during auto-topic creation naturally without any changes
required, because Metadata API will call CreateTopics under the covers.
What do you think?

I have updated the KIP to indicate that we will bump up the versions for
the affected APIs. For older clients, I was assuming that if we throw the
new exception from the new broker (with the KIP-578 changes), and it gets
sent to an older client which cannot interpret the exception, it will be
converted to some generic exception. Is this not the case? I am not sure I
understand a way to send another type of exception just for older clients.
Could you kindly explain. Also, all this might be simplified if we just
reuse the existing PolicyViolation exception as Ismael suggested.

Thanks.

On Wed, Jun 24, 2020 at 3:26 AM Boyang Chen <re...@gmail.com>
wrote:

> Hi Gokul,
>
> Thanks for the excellent KIP. I was recently driving the rollout of KIP-590
> <
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-590%3A+Redirect+Zookeeper+Mutation+Protocols+to+The+Controller
> >
> and
> proposed to fix the hole of the bypassing of topic creation policy when
> applying Metadata auto topic creation. As far as I understand, this KIP
> will take care of this problem specifically towards partition limit, so I
> was wondering whether it makes sense to integrate this logic with topic
> creation policy, and just enforce that entirely on the Metadata RPC auto
> topic creation case?
>
> Also, for the API exception part, could we state clearly that we are going
> to bump the version of mentioned RPCs so that new clients could handle the
> new error code RESOURCE_LIMITE_REACHED? Besides, I believe we are also
> going to enforce this rule towards older clients, could you describe the
> corresponding error code to be returned in the API exception part as well?
>
> Let me know if my proposal makes sense, thank you!
>
> Boyang
>
> On Fri, Jun 12, 2020 at 2:57 AM Alexandre Dupriez <
> alexandre.dupriez@gmail.com> wrote:
>
> > Hi Gokul,
> >
> > Thank you for the answers and the data provided to illustrate the use
> case.
> > A couple of additional questions.
> >
> > 904. If multi-tenancy is addressed in a future KIP, how smooth would
> > be the upgrade path? For example, the introduced configuration
> > parameters still apply, right? We would still maintain a first-come
> > first-served pattern when topics are created?
> >
> > 905. The current built-in assignment tool prioritises balance between
> > racks over brokers. In the version you propose, the limit on partition
> > count would take precedence over attempts to balance between racks.
> > Could it lead to a situation where it results in all partitions of a
> > topic being assigned in a single data center, if brokers in other
> > racks are "full"? Since it can potentially weaken the availability
> > guarantees for that topic (and maybe durability and/or consumer
> > performance with additional cross-rack traffic), how would we want to
> > handle the case? It may be worth warning users that the resulting
> > guarantees differ from what is offered by an "unlimited" assignment
> > plan in such cases? Also, let's keep in mind that some plans generated
> > by existing rebalancing tools could become invalid (w.r.t to the
> > configured limits).
> >
> > 906. The limits do not apply to internal topics. What about
> > framework-level topics from other tools and extensions? (connect,
> > streams, confluent metrics, tiered storage, etc.) Is blacklisting
> > possible?
> >
> > 907. What happens if one of the dynamic limit is violated at update
> > time? (sorry if it's already explained in the KIP, may have missed it)
> >
> > Thanks,
> > Alexandre
> >
> > Le dim. 3 mai 2020 à 20:20, Gokul Ramanan Subramanian
> > <go...@gmail.com> a écrit :
> > >
> > > Thanks Stanislav. Apologies about the long absence from this thread.
> > >
> > > I would prefer having per-user max partition limits in a separate KIP.
> I
> > > don't see this as an MVP for this KIP. I will add this as an
> alternative
> > > approach into the KIP.
> > >
> > > I was in a double mind about whether or not to impose the partition
> limit
> > > for internal topics as well. I can be convinced both ways. On the one
> > hand,
> > > internal topics should be purely internal i.e. users of a cluster
> should
> > > not have to care about them. In this sense, the partition limit should
> > not
> > > apply to internal topics. On the other hand, Kafka allows configuring
> > > internal topics by specifying their replication factor etc. Therefore,
> > they
> > > don't feel all that internal to me. In any case, I'll modify the KIP to
> > > exclude internal topics.
> > >
> > > I'll also add to the KIP the alternative approach Tom suggested around
> > > using topic policies to limit partitions, and explain why it does not
> > help
> > > to solve the problem that the KIP is trying to address (as I have done
> > in a
> > > previous correspondence on this thread).
> > >
> > > Cheers.
> > >
> > > On Fri, Apr 24, 2020 at 4:24 PM Stanislav Kozlovski <
> > stanislav@confluent.io>
> > > wrote:
> > >
> > > > Thanks for the KIP, Gokul!
> > > >
> > > > I like the overall premise - I think it's more user-friendly to have
> > > > configs for this than to have users implement their own config policy
> > -> so
> > > > unless it's very complex to implement, it seems worth it.
> > > > I agree that having the topic policy on the CreatePartitions path
> makes
> > > > sense as well.
> > > >
> > > > Multi-tenancy was a good point. It would be interesting to see how
> > easy it
> > > > is to extend the max partition limit to a per-user basis. Perhaps
> this
> > can
> > > > be done in a follow-up KIP, as a natural extension of the feature.
> > > >
> > > > I'm wondering whether there's a need to enforce this on internal
> > topics,
> > > > though. Given they're internal and critical to the function of
> Kafka, I
> > > > believe we'd rather always ensure they're created, regardless if over
> > some
> > > > user-set limit. It brings up the question of forward compatibility -
> > what
> > > > happens if a user's cluster is at the maximum partition capacity, yet
> > a new
> > > > release of Kafka introduces a new topic (e.g KIP-500)?
> > > >
> > > > Best,
> > > > Stanislav
> > > >
> > > > On Fri, Apr 24, 2020 at 2:39 PM Gokul Ramanan Subramanian <
> > > > gokul2411s@gmail.com> wrote:
> > > >
> > > > > Hi Tom.
> > > > >
> > > > > With KIP-578, we are not trying to model the load on each
> partition,
> > and
> > > > > come up with an exact limit on what the cluster or broker can
> handle
> > in
> > > > > terms of number of partitions. We understand that not all
> partitions
> > are
> > > > > equal, and the actual load per partition varies based on the
> message
> > > > size,
> > > > > throughput, whether the broker is a leader for that partition or
> not
> > etc.
> > > > >
> > > > > What we are trying to achieve with KIP-578 is to disallow a
> > pathological
> > > > > number of partitions that will surely put the cluster in bad shape.
> > For
> > > > > example, in KIP-578's appendix, we have described a case where we
> > could
> > > > not
> > > > > delete a topic with 30k partitions, because the brokers could not
> > > > > handle all the work that needed to be done. We have also described
> > how
> > > > > a producer performance test with 10k partitions observed basically
> 0
> > > > > throughput. In these cases, having a limit on number of partitions
> > > > > would allow the cluster to produce a graceful error message at
> topic
> > > > > creation time, and prevent the cluster from entering a pathological
> > > > state.
> > > > > These are not just hypotheticals. We definitely see many of these
> > > > > pathological cases happen in production, and we would like to avoid
> > them.
> > > > >
> > > > > The actual limit on number of partitions is something we do not
> want
> > to
> > > > > suggest in the KIP. The limit will depend on various tests that
> > owners of
> > > > > their clusters will have to perform, including perf tests,
> > identifying
> > > > > topic creation / deletion times, etc. For example, the tests we did
> > for
> > > > the
> > > > > KIP-578 appendix were enough to convince us that we should not have
> > > > > anywhere close to 10k partitions on the setup we describe there.
> > > > >
> > > > > What we want to do with KIP-578 is provide the flexibility to set a
> > limit
> > > > > on number of partitions based on tests cluster owners choose to
> > perform.
> > > > > Cluster owners can do the tests however often they wish and
> > dynamically
> > > > > adjust the limit on number of partitions. For example, we found in
> > our
> > > > > production environment that we don't want to have more than 1k
> > partitions
> > > > > on an m5.large EC2 broker instances, or more than 300 partitions
> on a
> > > > > t3.medium EC2 broker, for typical produce / consume use cases.
> > > > >
> > > > > Cluster owners are free to not configure the limit on number of
> > > > partitions
> > > > > if they don't want to spend the time coming up with a limit. The
> > limit
> > > > > defaults to INT32_MAX, which is basically infinity in this context,
> > and
> > > > > should be practically backwards compatible with current behavior.
> > > > >
> > > > > Further, the limit on number of partitions should not come in the
> > way of
> > > > > rebalancing tools under normal operation. For example, if the
> > partition
> > > > > limit per broker is set to 1k, unless the number of partitions
> comes
> > > > close
> > > > > to 1k, there should be no impact on rebalancing tools. Only when
> the
> > > > number
> > > > > of partitions comes close to 1k, will rebalancing tools be
> impacted,
> > but
> > > > at
> > > > > that point, the cluster is already at its limit of functioning (per
> > some
> > > > > definition that was used to set the limit in the first place).
> > > > >
> > > > > Finally, I want to end this long email by suggesting that the
> > partition
> > > > > assignment algorithm itself does not consider the load on various
> > > > > partitions before assigning partitions to brokers. In other words,
> it
> > > > > treats all partitions as equal. The idea of having a limit on
> number
> > of
> > > > > partitions is not mis-aligned with this tenet.
> > > > >
> > > > > Thanks.
> > > > >
> > > > > On Tue, Apr 21, 2020 at 9:39 AM Tom Bentley <tb...@redhat.com>
> > wrote:
> > > > >
> > > > > > Hi Gokul,
> > > > > >
> > > > > > the partition assignment algorithm needs to be aware of the
> > partition
> > > > > > > limits.
> > > > > > >
> > > > > >
> > > > > > I agree, if you have limits then anything doing reassignment
> would
> > need
> > > > > > some way of knowing what they were. But the thing is that I'm not
> > > > really
> > > > > > sure how you would decide what the limits ought to be.
> > > > > >
> > > > > >
> > > > > > > To illustrate this, imagine that you have 3 brokers (1, 2 and
> 3),
> > > > > > > with 10, 20 and 30 partitions each respectively, and a limit of
> > 40
> > > > > > > partitions on each broker enforced via a configurable policy
> > class
> > > > (the
> > > > > > one
> > > > > > > you recommended). While the policy class may accept a topic
> > creation
> > > > > > > request for 11 partitions with a replication factor of 2 each
> > > > (because
> > > > > it
> > > > > > > is satisfiable), the non-pluggable partition assignment
> > algorithm (in
> > > > > > > AdminUtils.assignReplicasToBrokers and a few other places) has
> to
> > > > know
> > > > > > not
> > > > > > > to assign the 11th partition to broker 3 because it would run
> > out of
> > > > > > > partition capacity otherwise.
> > > > > > >
> > > > > >
> > > > > > I know this is only a toy example, but I think it also serves to
> > > > > illustrate
> > > > > > my point above. How has a limit of 40 partitions been arrived at?
> > In
> > > > real
> > > > > > life different partitions will impart a different load on a
> broker,
> > > > > > depending on all sorts of factors (which topics they're for, the
> > > > > throughput
> > > > > > and message size for those topics, etc). By saying that a broker
> > should
> > > > > not
> > > > > > have more than 40 partitions assigned I think you're making a big
> > > > > > assumption that all partitions have the same weight. You're also
> > > > limiting
> > > > > > the search space for finding an acceptable assignment. Cluster
> > > > balancers
> > > > > > usually use some kind of heuristic optimisation algorithm for
> > figuring
> > > > > out
> > > > > > assignments of partitions to brokers, and it could be that the
> > best (or
> > > > > at
> > > > > > least a good enough) solution requires assigning the least loaded
> > 41
> > > > > > partitions to one broker.
> > > > > >
> > > > > > The point I'm trying to make here is whatever limit is chosen
> it's
> > > > > probably
> > > > > > been chosen fairly arbitrarily. Even if it's been chosen based on
> > some
> > > > > > empirical evidence of how a particular cluster behaves it's
> likely
> > that
> > > > > > that evidence will become obsolete as the cluster evolves to
> serve
> > the
> > > > > > needs of the business running it (e.g. some hot topic gets
> > > > repartitioned,
> > > > > > messages get compressed with some new algorithm, some new topics
> > need
> > > > to
> > > > > be
> > > > > > created). For this reason I think the problem you're trying to
> > solve
> > > > via
> > > > > > policy (whether that was implemented in a pluggable way or not)
> is
> > > > really
> > > > > > better solved by automating the cluster balancing and having that
> > > > cluster
> > > > > > balancer be able to reason about when the cluster has too few
> > brokers
> > > > for
> > > > > > the number of partitions, rather than placing some limit on the
> > sizing
> > > > > and
> > > > > > shape of the cluster up front and then hobbling the cluster
> > balancer to
> > > > > > work within that.
> > > > > >
> > > > > > I think it might be useful to describe in the KIP how users would
> > be
> > > > > > expected to arrive at values for these configs (both on day 1 and
> > in an
> > > > > > evolving production cluster), when this solution might be better
> > than
> > > > > using
> > > > > > a cluster balancer and/or why cluster balancers can't be trusted
> to
> > > > avoid
> > > > > > overloading brokers.
> > > > > >
> > > > > > Kind regards,
> > > > > >
> > > > > > Tom
> > > > > >
> > > > > >
> > > > > > On Mon, Apr 20, 2020 at 7:22 PM Gokul Ramanan Subramanian <
> > > > > > gokul2411s@gmail.com> wrote:
> > > > > >
> > > > > > > This is good reference Tom. I did not consider this approach at
> > all.
> > > > I
> > > > > am
> > > > > > > happy to learn about it now.
> > > > > > >
> > > > > > > However, I think that partition limits are not "yet another"
> > policy
> > > > > > > configuration. Instead, they are fundamental to partition
> > assignment.
> > > > > > i.e.
> > > > > > > the partition assignment algorithm needs to be aware of the
> > partition
> > > > > > > limits. To illustrate this, imagine that you have 3 brokers (1,
> > 2 and
> > > > > 3),
> > > > > > > with 10, 20 and 30 partitions each respectively, and a limit of
> > 40
> > > > > > > partitions on each broker enforced via a configurable policy
> > class
> > > > (the
> > > > > > one
> > > > > > > you recommended). While the policy class may accept a topic
> > creation
> > > > > > > request for 11 partitions with a replication factor of 2 each
> > > > (because
> > > > > it
> > > > > > > is satisfiable), the non-pluggable partition assignment
> > algorithm (in
> > > > > > > AdminUtils.assignReplicasToBrokers and a few other places) has
> to
> > > > know
> > > > > > not
> > > > > > > to assign the 11th partition to broker 3 because it would run
> > out of
> > > > > > > partition capacity otherwise.
> > > > > > >
> > > > > > > To achieve the ideal end that you are imagining (and I can
> > totally
> > > > > > > understand where you are coming from vis-a-vis the
> extensibility
> > of
> > > > > your
> > > > > > > solution wrt the one in the KIP), that would require extracting
> > the
> > > > > > > partition assignment logic itself into a pluggable class, and
> for
> > > > which
> > > > > > we
> > > > > > > could provide a custom implementation. I am afraid that would
> add
> > > > > > > complexity that I am not sure we want to undertake.
> > > > > > >
> > > > > > > Do you see sense in what I am saying?
> > > > > > >
> > > > > > > Thanks.
> > > > > > >
> > > > > > > On Mon, Apr 20, 2020 at 3:59 PM Tom Bentley <
> tbentley@redhat.com
> > >
> > > > > wrote:
> > > > > > >
> > > > > > > > Hi Gokul,
> > > > > > > >
> > > > > > > > Leaving aside the question of how Kafka scales, I think the
> > > > proposed
> > > > > > > > solution, limiting the number of partitions in a cluster or
> > > > > per-broker,
> > > > > > > is
> > > > > > > > a policy which ought to be addressable via the pluggable
> > policies
> > > > > (e.g.
> > > > > > > > create.topic.policy.class.name). Unfortunately although
> > there's a
> > > > > > policy
> > > > > > > > for topic creation, it's currently not possible to enforce a
> > policy
> > > > > on
> > > > > > > > partition increase. It would be more flexible to be able
> > enforce
> > > > this
> > > > > > > kind
> > > > > > > > of thing via a pluggable policy, and it would also avoid the
> > > > > situation
> > > > > > > > where different people each want to have a config which
> > addresses
> > > > > some
> > > > > > > > specific use case or problem that they're experiencing.
> > > > > > > >
> > > > > > > > Quite a while ago I proposed KIP-201 to solve this issue with
> > > > > policies
> > > > > > > > being easily circumvented, but it didn't really make any
> > progress.
> > > > > I've
> > > > > > > > looked at it again in some detail more recently and I think
> > > > something
> > > > > > > might
> > > > > > > > be possible following the work to make all ZK writes happen
> on
> > the
> > > > > > > > controller.
> > > > > > > >
> > > > > > > > Of course, this is just my take on it.
> > > > > > > >
> > > > > > > > Kind regards,
> > > > > > > >
> > > > > > > > Tom
> > > > > > > >
> > > > > > > > On Thu, Apr 16, 2020 at 11:47 AM Gokul Ramanan Subramanian <
> > > > > > > > gokul2411s@gmail.com> wrote:
> > > > > > > >
> > > > > > > > > Hi.
> > > > > > > > >
> > > > > > > > > For the sake of expediting the discussion, I have created a
> > > > > prototype
> > > > > > > PR:
> > > > > > > > > https://github.com/apache/kafka/pull/8499. Eventually, (if
> > and)
> > > > > when
> > > > > > > the
> > > > > > > > > KIP is accepted, I'll modify this to add the full
> > implementation
> > > > > and
> > > > > > > > tests
> > > > > > > > > etc. in there.
> > > > > > > > >
> > > > > > > > > Would appreciate if a Kafka committer could share their
> > thoughts,
> > > > > so
> > > > > > > > that I
> > > > > > > > > can more confidently start the voting thread.
> > > > > > > > >
> > > > > > > > > Thanks.
> > > > > > > > >
> > > > > > > > > On Thu, Apr 16, 2020 at 11:30 AM Gokul Ramanan Subramanian
> <
> > > > > > > > > gokul2411s@gmail.com> wrote:
> > > > > > > > >
> > > > > > > > > > Thanks for your comments Alex.
> > > > > > > > > >
> > > > > > > > > > The KIP proposes using two configurations max.partitions
> > and
> > > > > > > > > > max.broker.partitions. It does not enforce their use. The
> > > > default
> > > > > > > > values
> > > > > > > > > > are pretty large (INT MAX), therefore should be
> > non-intrusive.
> > > > > > > > > >
> > > > > > > > > > In multi-tenant environments and in partition assignment
> > and
> > > > > > > > rebalancing,
> > > > > > > > > > the admin could (a) use the default values which would
> > yield
> > > > > > similar
> > > > > > > > > > behavior to now, (b) set very high values that they know
> is
> > > > > > > sufficient,
> > > > > > > > > (c)
> > > > > > > > > > dynamically re-adjust the values should the business
> > > > requirements
> > > > > > > > change.
> > > > > > > > > > Note that the two configurations are cluster-wide, so
> they
> > can
> > > > be
> > > > > > > > updated
> > > > > > > > > > without restarting the brokers.
> > > > > > > > > >
> > > > > > > > > > The quota system in Kafka seems to be geared towards
> > limiting
> > > > > > traffic
> > > > > > > > for
> > > > > > > > > > specific clients or users, or in the case of replication,
> > to
> > > > > > leaders
> > > > > > > > and
> > > > > > > > > > followers. The quota configuration itself is very similar
> > to
> > > > the
> > > > > > one
> > > > > > > > > > introduced in this KIP i.e. just a few configuration
> > options to
> > > > > > > specify
> > > > > > > > > the
> > > > > > > > > > quota. The main difference is that the quota system is
> far
> > more
> > > > > > > > > > heavy-weight because it needs to be applied to traffic
> > that is
> > > > > > > flowing
> > > > > > > > > > in/out constantly. Whereas in this KIP, we want to limit
> > number
> > > > > of
> > > > > > > > > > partition replicas, which gets modified rarely by
> > comparison
> > > > in a
> > > > > > > > typical
> > > > > > > > > > cluster.
> > > > > > > > > >
> > > > > > > > > > Hope this addresses your comments.
> > > > > > > > > >
> > > > > > > > > > On Thu, Apr 9, 2020 at 12:53 PM Alexandre Dupriez <
> > > > > > > > > > alexandre.dupriez@gmail.com> wrote:
> > > > > > > > > >
> > > > > > > > > >> Hi Gokul,
> > > > > > > > > >>
> > > > > > > > > >> Thanks for the KIP.
> > > > > > > > > >>
> > > > > > > > > >> From what I understand, the objective of the new
> > configuration
> > > > > is
> > > > > > to
> > > > > > > > > >> protect a cluster from an overload driven by an
> excessive
> > > > number
> > > > > > of
> > > > > > > > > >> partitions independently from the load handled on the
> > > > partitions
> > > > > > > > > >> themselves. As such, the approach uncouples the
> data-path
> > load
> > > > > > from
> > > > > > > > > >> the number of unit of distributions of throughput and
> > intends
> > > > to
> > > > > > > avoid
> > > > > > > > > >> the degradation of performance exhibited in the test
> > results
> > > > > > > provided
> > > > > > > > > >> with the KIP by setting an upper-bound on that number.
> > > > > > > > > >>
> > > > > > > > > >> Couple of comments:
> > > > > > > > > >>
> > > > > > > > > >> 900. Multi-tenancy - one concern I would have with a
> > cluster
> > > > and
> > > > > > > > > >> broker-level configuration is that it is possible for a
> > user
> > > > to
> > > > > > > > > >> consume a large proportions of the allocatable
> partitions
> > > > within
> > > > > > the
> > > > > > > > > >> configured limit, leaving other users with not enough
> > > > partitions
> > > > > > to
> > > > > > > > > >> satisfy their requirements.
> > > > > > > > > >>
> > > > > > > > > >> 901. Quotas - an approach in Apache Kafka to set-up an
> > > > > upper-bound
> > > > > > > on
> > > > > > > > > >> resource consumptions is via client/user quotas. Could
> > this
> > > > > > > framework
> > > > > > > > > >> be leveraged to add this limit?
> > > > > > > > > >>
> > > > > > > > > >> 902. Partition assignment - one potential problem with
> > the new
> > > > > > > > > >> repartitioning scheme is that if a subset of brokers
> have
> > > > > reached
> > > > > > > > > >> their number of assignable partitions, yet their data
> > path is
> > > > > > > > > >> under-loaded, new topics and/or partitions will be
> > assigned
> > > > > > > > > >> exclusively to other brokers, which could increase the
> > > > > likelihood
> > > > > > of
> > > > > > > > > >> data-path load imbalance. Fundamentally, the isolation
> of
> > the
> > > > > > > > > >> constraint on the number of partitions from the
> data-path
> > > > > > throughput
> > > > > > > > > >> can have conflicting requirements.
> > > > > > > > > >>
> > > > > > > > > >> 903. Rebalancing - as a corollary to 902, external tools
> > used
> > > > to
> > > > > > > > > >> balance ingress throughput may adopt an incremental
> > approach
> > > > in
> > > > > > > > > >> partition re-assignment to redistribute load, and could
> > hit
> > > > the
> > > > > > > limit
> > > > > > > > > >> on the number of partitions on a broker when a (too)
> > > > > conservative
> > > > > > > > > >> limit is used, thereby over-constraining the objective
> > > > function
> > > > > > and
> > > > > > > > > >> reducing the migration path.
> > > > > > > > > >>
> > > > > > > > > >> Thanks,
> > > > > > > > > >> Alexandre
> > > > > > > > > >>
> > > > > > > > > >> Le jeu. 9 avr. 2020 à 00:19, Gokul Ramanan Subramanian
> > > > > > > > > >> <go...@gmail.com> a écrit :
> > > > > > > > > >> >
> > > > > > > > > >> > Hi. Requesting you to take a look at this KIP and
> > provide
> > > > > > > feedback.
> > > > > > > > > >> >
> > > > > > > > > >> > Thanks. Regards.
> > > > > > > > > >> >
> > > > > > > > > >> > On Wed, Apr 1, 2020 at 4:28 PM Gokul Ramanan
> > Subramanian <
> > > > > > > > > >> > gokul2411s@gmail.com> wrote:
> > > > > > > > > >> >
> > > > > > > > > >> > > Hi.
> > > > > > > > > >> > >
> > > > > > > > > >> > > I have opened KIP-578, intended to provide a
> > mechanism to
> > > > > > limit
> > > > > > > > the
> > > > > > > > > >> number
> > > > > > > > > >> > > of partitions in a Kafka cluster. Kindly provide
> > feedback
> > > > on
> > > > > > the
> > > > > > > > KIP
> > > > > > > > > >> which
> > > > > > > > > >> > > you can find at
> > > > > > > > > >> > >
> > > > > > > > > >> > >
> > > > > > > > > >>
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-578%3A+Add+configuration+to+limit+number+of+partitions
> > > > > > > > > >> > >
> > > > > > > > > >> > > I want to specially thank Stanislav Kozlovski who
> > helped
> > > > in
> > > > > > > > > >> formulating
> > > > > > > > > >> > > some aspects of the KIP.
> > > > > > > > > >> > >
> > > > > > > > > >> > > Many thanks,
> > > > > > > > > >> > >
> > > > > > > > > >> > > Gokul.
> > > > > > > > > >> > >
> > > > > > > > > >>
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > > >
> > > > --
> > > > Best,
> > > > Stanislav
> > > >
> >
>

Re: [DISCUSS] KIP-578: Add configuration to limit number of partitions

Posted by Boyang Chen <re...@gmail.com>.
Hi Gokul,

Thanks for the excellent KIP. I was recently driving the rollout of KIP-590
<https://cwiki.apache.org/confluence/display/KAFKA/KIP-590%3A+Redirect+Zookeeper+Mutation+Protocols+to+The+Controller>
and
proposed to fix the hole of the bypassing of topic creation policy when
applying Metadata auto topic creation. As far as I understand, this KIP
will take care of this problem specifically towards partition limit, so I
was wondering whether it makes sense to integrate this logic with topic
creation policy, and just enforce that entirely on the Metadata RPC auto
topic creation case?

Also, for the API exception part, could we state clearly that we are going
to bump the version of mentioned RPCs so that new clients could handle the
new error code RESOURCE_LIMITE_REACHED? Besides, I believe we are also
going to enforce this rule towards older clients, could you describe the
corresponding error code to be returned in the API exception part as well?

Let me know if my proposal makes sense, thank you!

Boyang

On Fri, Jun 12, 2020 at 2:57 AM Alexandre Dupriez <
alexandre.dupriez@gmail.com> wrote:

> Hi Gokul,
>
> Thank you for the answers and the data provided to illustrate the use case.
> A couple of additional questions.
>
> 904. If multi-tenancy is addressed in a future KIP, how smooth would
> be the upgrade path? For example, the introduced configuration
> parameters still apply, right? We would still maintain a first-come
> first-served pattern when topics are created?
>
> 905. The current built-in assignment tool prioritises balance between
> racks over brokers. In the version you propose, the limit on partition
> count would take precedence over attempts to balance between racks.
> Could it lead to a situation where it results in all partitions of a
> topic being assigned in a single data center, if brokers in other
> racks are "full"? Since it can potentially weaken the availability
> guarantees for that topic (and maybe durability and/or consumer
> performance with additional cross-rack traffic), how would we want to
> handle the case? It may be worth warning users that the resulting
> guarantees differ from what is offered by an "unlimited" assignment
> plan in such cases? Also, let's keep in mind that some plans generated
> by existing rebalancing tools could become invalid (w.r.t to the
> configured limits).
>
> 906. The limits do not apply to internal topics. What about
> framework-level topics from other tools and extensions? (connect,
> streams, confluent metrics, tiered storage, etc.) Is blacklisting
> possible?
>
> 907. What happens if one of the dynamic limit is violated at update
> time? (sorry if it's already explained in the KIP, may have missed it)
>
> Thanks,
> Alexandre
>
> Le dim. 3 mai 2020 à 20:20, Gokul Ramanan Subramanian
> <go...@gmail.com> a écrit :
> >
> > Thanks Stanislav. Apologies about the long absence from this thread.
> >
> > I would prefer having per-user max partition limits in a separate KIP. I
> > don't see this as an MVP for this KIP. I will add this as an alternative
> > approach into the KIP.
> >
> > I was in a double mind about whether or not to impose the partition limit
> > for internal topics as well. I can be convinced both ways. On the one
> hand,
> > internal topics should be purely internal i.e. users of a cluster should
> > not have to care about them. In this sense, the partition limit should
> not
> > apply to internal topics. On the other hand, Kafka allows configuring
> > internal topics by specifying their replication factor etc. Therefore,
> they
> > don't feel all that internal to me. In any case, I'll modify the KIP to
> > exclude internal topics.
> >
> > I'll also add to the KIP the alternative approach Tom suggested around
> > using topic policies to limit partitions, and explain why it does not
> help
> > to solve the problem that the KIP is trying to address (as I have done
> in a
> > previous correspondence on this thread).
> >
> > Cheers.
> >
> > On Fri, Apr 24, 2020 at 4:24 PM Stanislav Kozlovski <
> stanislav@confluent.io>
> > wrote:
> >
> > > Thanks for the KIP, Gokul!
> > >
> > > I like the overall premise - I think it's more user-friendly to have
> > > configs for this than to have users implement their own config policy
> -> so
> > > unless it's very complex to implement, it seems worth it.
> > > I agree that having the topic policy on the CreatePartitions path makes
> > > sense as well.
> > >
> > > Multi-tenancy was a good point. It would be interesting to see how
> easy it
> > > is to extend the max partition limit to a per-user basis. Perhaps this
> can
> > > be done in a follow-up KIP, as a natural extension of the feature.
> > >
> > > I'm wondering whether there's a need to enforce this on internal
> topics,
> > > though. Given they're internal and critical to the function of Kafka, I
> > > believe we'd rather always ensure they're created, regardless if over
> some
> > > user-set limit. It brings up the question of forward compatibility -
> what
> > > happens if a user's cluster is at the maximum partition capacity, yet
> a new
> > > release of Kafka introduces a new topic (e.g KIP-500)?
> > >
> > > Best,
> > > Stanislav
> > >
> > > On Fri, Apr 24, 2020 at 2:39 PM Gokul Ramanan Subramanian <
> > > gokul2411s@gmail.com> wrote:
> > >
> > > > Hi Tom.
> > > >
> > > > With KIP-578, we are not trying to model the load on each partition,
> and
> > > > come up with an exact limit on what the cluster or broker can handle
> in
> > > > terms of number of partitions. We understand that not all partitions
> are
> > > > equal, and the actual load per partition varies based on the message
> > > size,
> > > > throughput, whether the broker is a leader for that partition or not
> etc.
> > > >
> > > > What we are trying to achieve with KIP-578 is to disallow a
> pathological
> > > > number of partitions that will surely put the cluster in bad shape.
> For
> > > > example, in KIP-578's appendix, we have described a case where we
> could
> > > not
> > > > delete a topic with 30k partitions, because the brokers could not
> > > > handle all the work that needed to be done. We have also described
> how
> > > > a producer performance test with 10k partitions observed basically 0
> > > > throughput. In these cases, having a limit on number of partitions
> > > > would allow the cluster to produce a graceful error message at topic
> > > > creation time, and prevent the cluster from entering a pathological
> > > state.
> > > > These are not just hypotheticals. We definitely see many of these
> > > > pathological cases happen in production, and we would like to avoid
> them.
> > > >
> > > > The actual limit on number of partitions is something we do not want
> to
> > > > suggest in the KIP. The limit will depend on various tests that
> owners of
> > > > their clusters will have to perform, including perf tests,
> identifying
> > > > topic creation / deletion times, etc. For example, the tests we did
> for
> > > the
> > > > KIP-578 appendix were enough to convince us that we should not have
> > > > anywhere close to 10k partitions on the setup we describe there.
> > > >
> > > > What we want to do with KIP-578 is provide the flexibility to set a
> limit
> > > > on number of partitions based on tests cluster owners choose to
> perform.
> > > > Cluster owners can do the tests however often they wish and
> dynamically
> > > > adjust the limit on number of partitions. For example, we found in
> our
> > > > production environment that we don't want to have more than 1k
> partitions
> > > > on an m5.large EC2 broker instances, or more than 300 partitions on a
> > > > t3.medium EC2 broker, for typical produce / consume use cases.
> > > >
> > > > Cluster owners are free to not configure the limit on number of
> > > partitions
> > > > if they don't want to spend the time coming up with a limit. The
> limit
> > > > defaults to INT32_MAX, which is basically infinity in this context,
> and
> > > > should be practically backwards compatible with current behavior.
> > > >
> > > > Further, the limit on number of partitions should not come in the
> way of
> > > > rebalancing tools under normal operation. For example, if the
> partition
> > > > limit per broker is set to 1k, unless the number of partitions comes
> > > close
> > > > to 1k, there should be no impact on rebalancing tools. Only when the
> > > number
> > > > of partitions comes close to 1k, will rebalancing tools be impacted,
> but
> > > at
> > > > that point, the cluster is already at its limit of functioning (per
> some
> > > > definition that was used to set the limit in the first place).
> > > >
> > > > Finally, I want to end this long email by suggesting that the
> partition
> > > > assignment algorithm itself does not consider the load on various
> > > > partitions before assigning partitions to brokers. In other words, it
> > > > treats all partitions as equal. The idea of having a limit on number
> of
> > > > partitions is not mis-aligned with this tenet.
> > > >
> > > > Thanks.
> > > >
> > > > On Tue, Apr 21, 2020 at 9:39 AM Tom Bentley <tb...@redhat.com>
> wrote:
> > > >
> > > > > Hi Gokul,
> > > > >
> > > > > the partition assignment algorithm needs to be aware of the
> partition
> > > > > > limits.
> > > > > >
> > > > >
> > > > > I agree, if you have limits then anything doing reassignment would
> need
> > > > > some way of knowing what they were. But the thing is that I'm not
> > > really
> > > > > sure how you would decide what the limits ought to be.
> > > > >
> > > > >
> > > > > > To illustrate this, imagine that you have 3 brokers (1, 2 and 3),
> > > > > > with 10, 20 and 30 partitions each respectively, and a limit of
> 40
> > > > > > partitions on each broker enforced via a configurable policy
> class
> > > (the
> > > > > one
> > > > > > you recommended). While the policy class may accept a topic
> creation
> > > > > > request for 11 partitions with a replication factor of 2 each
> > > (because
> > > > it
> > > > > > is satisfiable), the non-pluggable partition assignment
> algorithm (in
> > > > > > AdminUtils.assignReplicasToBrokers and a few other places) has to
> > > know
> > > > > not
> > > > > > to assign the 11th partition to broker 3 because it would run
> out of
> > > > > > partition capacity otherwise.
> > > > > >
> > > > >
> > > > > I know this is only a toy example, but I think it also serves to
> > > > illustrate
> > > > > my point above. How has a limit of 40 partitions been arrived at?
> In
> > > real
> > > > > life different partitions will impart a different load on a broker,
> > > > > depending on all sorts of factors (which topics they're for, the
> > > > throughput
> > > > > and message size for those topics, etc). By saying that a broker
> should
> > > > not
> > > > > have more than 40 partitions assigned I think you're making a big
> > > > > assumption that all partitions have the same weight. You're also
> > > limiting
> > > > > the search space for finding an acceptable assignment. Cluster
> > > balancers
> > > > > usually use some kind of heuristic optimisation algorithm for
> figuring
> > > > out
> > > > > assignments of partitions to brokers, and it could be that the
> best (or
> > > > at
> > > > > least a good enough) solution requires assigning the least loaded
> 41
> > > > > partitions to one broker.
> > > > >
> > > > > The point I'm trying to make here is whatever limit is chosen it's
> > > > probably
> > > > > been chosen fairly arbitrarily. Even if it's been chosen based on
> some
> > > > > empirical evidence of how a particular cluster behaves it's likely
> that
> > > > > that evidence will become obsolete as the cluster evolves to serve
> the
> > > > > needs of the business running it (e.g. some hot topic gets
> > > repartitioned,
> > > > > messages get compressed with some new algorithm, some new topics
> need
> > > to
> > > > be
> > > > > created). For this reason I think the problem you're trying to
> solve
> > > via
> > > > > policy (whether that was implemented in a pluggable way or not) is
> > > really
> > > > > better solved by automating the cluster balancing and having that
> > > cluster
> > > > > balancer be able to reason about when the cluster has too few
> brokers
> > > for
> > > > > the number of partitions, rather than placing some limit on the
> sizing
> > > > and
> > > > > shape of the cluster up front and then hobbling the cluster
> balancer to
> > > > > work within that.
> > > > >
> > > > > I think it might be useful to describe in the KIP how users would
> be
> > > > > expected to arrive at values for these configs (both on day 1 and
> in an
> > > > > evolving production cluster), when this solution might be better
> than
> > > > using
> > > > > a cluster balancer and/or why cluster balancers can't be trusted to
> > > avoid
> > > > > overloading brokers.
> > > > >
> > > > > Kind regards,
> > > > >
> > > > > Tom
> > > > >
> > > > >
> > > > > On Mon, Apr 20, 2020 at 7:22 PM Gokul Ramanan Subramanian <
> > > > > gokul2411s@gmail.com> wrote:
> > > > >
> > > > > > This is good reference Tom. I did not consider this approach at
> all.
> > > I
> > > > am
> > > > > > happy to learn about it now.
> > > > > >
> > > > > > However, I think that partition limits are not "yet another"
> policy
> > > > > > configuration. Instead, they are fundamental to partition
> assignment.
> > > > > i.e.
> > > > > > the partition assignment algorithm needs to be aware of the
> partition
> > > > > > limits. To illustrate this, imagine that you have 3 brokers (1,
> 2 and
> > > > 3),
> > > > > > with 10, 20 and 30 partitions each respectively, and a limit of
> 40
> > > > > > partitions on each broker enforced via a configurable policy
> class
> > > (the
> > > > > one
> > > > > > you recommended). While the policy class may accept a topic
> creation
> > > > > > request for 11 partitions with a replication factor of 2 each
> > > (because
> > > > it
> > > > > > is satisfiable), the non-pluggable partition assignment
> algorithm (in
> > > > > > AdminUtils.assignReplicasToBrokers and a few other places) has to
> > > know
> > > > > not
> > > > > > to assign the 11th partition to broker 3 because it would run
> out of
> > > > > > partition capacity otherwise.
> > > > > >
> > > > > > To achieve the ideal end that you are imagining (and I can
> totally
> > > > > > understand where you are coming from vis-a-vis the extensibility
> of
> > > > your
> > > > > > solution wrt the one in the KIP), that would require extracting
> the
> > > > > > partition assignment logic itself into a pluggable class, and for
> > > which
> > > > > we
> > > > > > could provide a custom implementation. I am afraid that would add
> > > > > > complexity that I am not sure we want to undertake.
> > > > > >
> > > > > > Do you see sense in what I am saying?
> > > > > >
> > > > > > Thanks.
> > > > > >
> > > > > > On Mon, Apr 20, 2020 at 3:59 PM Tom Bentley <tbentley@redhat.com
> >
> > > > wrote:
> > > > > >
> > > > > > > Hi Gokul,
> > > > > > >
> > > > > > > Leaving aside the question of how Kafka scales, I think the
> > > proposed
> > > > > > > solution, limiting the number of partitions in a cluster or
> > > > per-broker,
> > > > > > is
> > > > > > > a policy which ought to be addressable via the pluggable
> policies
> > > > (e.g.
> > > > > > > create.topic.policy.class.name). Unfortunately although
> there's a
> > > > > policy
> > > > > > > for topic creation, it's currently not possible to enforce a
> policy
> > > > on
> > > > > > > partition increase. It would be more flexible to be able
> enforce
> > > this
> > > > > > kind
> > > > > > > of thing via a pluggable policy, and it would also avoid the
> > > > situation
> > > > > > > where different people each want to have a config which
> addresses
> > > > some
> > > > > > > specific use case or problem that they're experiencing.
> > > > > > >
> > > > > > > Quite a while ago I proposed KIP-201 to solve this issue with
> > > > policies
> > > > > > > being easily circumvented, but it didn't really make any
> progress.
> > > > I've
> > > > > > > looked at it again in some detail more recently and I think
> > > something
> > > > > > might
> > > > > > > be possible following the work to make all ZK writes happen on
> the
> > > > > > > controller.
> > > > > > >
> > > > > > > Of course, this is just my take on it.
> > > > > > >
> > > > > > > Kind regards,
> > > > > > >
> > > > > > > Tom
> > > > > > >
> > > > > > > On Thu, Apr 16, 2020 at 11:47 AM Gokul Ramanan Subramanian <
> > > > > > > gokul2411s@gmail.com> wrote:
> > > > > > >
> > > > > > > > Hi.
> > > > > > > >
> > > > > > > > For the sake of expediting the discussion, I have created a
> > > > prototype
> > > > > > PR:
> > > > > > > > https://github.com/apache/kafka/pull/8499. Eventually, (if
> and)
> > > > when
> > > > > > the
> > > > > > > > KIP is accepted, I'll modify this to add the full
> implementation
> > > > and
> > > > > > > tests
> > > > > > > > etc. in there.
> > > > > > > >
> > > > > > > > Would appreciate if a Kafka committer could share their
> thoughts,
> > > > so
> > > > > > > that I
> > > > > > > > can more confidently start the voting thread.
> > > > > > > >
> > > > > > > > Thanks.
> > > > > > > >
> > > > > > > > On Thu, Apr 16, 2020 at 11:30 AM Gokul Ramanan Subramanian <
> > > > > > > > gokul2411s@gmail.com> wrote:
> > > > > > > >
> > > > > > > > > Thanks for your comments Alex.
> > > > > > > > >
> > > > > > > > > The KIP proposes using two configurations max.partitions
> and
> > > > > > > > > max.broker.partitions. It does not enforce their use. The
> > > default
> > > > > > > values
> > > > > > > > > are pretty large (INT MAX), therefore should be
> non-intrusive.
> > > > > > > > >
> > > > > > > > > In multi-tenant environments and in partition assignment
> and
> > > > > > > rebalancing,
> > > > > > > > > the admin could (a) use the default values which would
> yield
> > > > > similar
> > > > > > > > > behavior to now, (b) set very high values that they know is
> > > > > > sufficient,
> > > > > > > > (c)
> > > > > > > > > dynamically re-adjust the values should the business
> > > requirements
> > > > > > > change.
> > > > > > > > > Note that the two configurations are cluster-wide, so they
> can
> > > be
> > > > > > > updated
> > > > > > > > > without restarting the brokers.
> > > > > > > > >
> > > > > > > > > The quota system in Kafka seems to be geared towards
> limiting
> > > > > traffic
> > > > > > > for
> > > > > > > > > specific clients or users, or in the case of replication,
> to
> > > > > leaders
> > > > > > > and
> > > > > > > > > followers. The quota configuration itself is very similar
> to
> > > the
> > > > > one
> > > > > > > > > introduced in this KIP i.e. just a few configuration
> options to
> > > > > > specify
> > > > > > > > the
> > > > > > > > > quota. The main difference is that the quota system is far
> more
> > > > > > > > > heavy-weight because it needs to be applied to traffic
> that is
> > > > > > flowing
> > > > > > > > > in/out constantly. Whereas in this KIP, we want to limit
> number
> > > > of
> > > > > > > > > partition replicas, which gets modified rarely by
> comparison
> > > in a
> > > > > > > typical
> > > > > > > > > cluster.
> > > > > > > > >
> > > > > > > > > Hope this addresses your comments.
> > > > > > > > >
> > > > > > > > > On Thu, Apr 9, 2020 at 12:53 PM Alexandre Dupriez <
> > > > > > > > > alexandre.dupriez@gmail.com> wrote:
> > > > > > > > >
> > > > > > > > >> Hi Gokul,
> > > > > > > > >>
> > > > > > > > >> Thanks for the KIP.
> > > > > > > > >>
> > > > > > > > >> From what I understand, the objective of the new
> configuration
> > > > is
> > > > > to
> > > > > > > > >> protect a cluster from an overload driven by an excessive
> > > number
> > > > > of
> > > > > > > > >> partitions independently from the load handled on the
> > > partitions
> > > > > > > > >> themselves. As such, the approach uncouples the data-path
> load
> > > > > from
> > > > > > > > >> the number of unit of distributions of throughput and
> intends
> > > to
> > > > > > avoid
> > > > > > > > >> the degradation of performance exhibited in the test
> results
> > > > > > provided
> > > > > > > > >> with the KIP by setting an upper-bound on that number.
> > > > > > > > >>
> > > > > > > > >> Couple of comments:
> > > > > > > > >>
> > > > > > > > >> 900. Multi-tenancy - one concern I would have with a
> cluster
> > > and
> > > > > > > > >> broker-level configuration is that it is possible for a
> user
> > > to
> > > > > > > > >> consume a large proportions of the allocatable partitions
> > > within
> > > > > the
> > > > > > > > >> configured limit, leaving other users with not enough
> > > partitions
> > > > > to
> > > > > > > > >> satisfy their requirements.
> > > > > > > > >>
> > > > > > > > >> 901. Quotas - an approach in Apache Kafka to set-up an
> > > > upper-bound
> > > > > > on
> > > > > > > > >> resource consumptions is via client/user quotas. Could
> this
> > > > > > framework
> > > > > > > > >> be leveraged to add this limit?
> > > > > > > > >>
> > > > > > > > >> 902. Partition assignment - one potential problem with
> the new
> > > > > > > > >> repartitioning scheme is that if a subset of brokers have
> > > > reached
> > > > > > > > >> their number of assignable partitions, yet their data
> path is
> > > > > > > > >> under-loaded, new topics and/or partitions will be
> assigned
> > > > > > > > >> exclusively to other brokers, which could increase the
> > > > likelihood
> > > > > of
> > > > > > > > >> data-path load imbalance. Fundamentally, the isolation of
> the
> > > > > > > > >> constraint on the number of partitions from the data-path
> > > > > throughput
> > > > > > > > >> can have conflicting requirements.
> > > > > > > > >>
> > > > > > > > >> 903. Rebalancing - as a corollary to 902, external tools
> used
> > > to
> > > > > > > > >> balance ingress throughput may adopt an incremental
> approach
> > > in
> > > > > > > > >> partition re-assignment to redistribute load, and could
> hit
> > > the
> > > > > > limit
> > > > > > > > >> on the number of partitions on a broker when a (too)
> > > > conservative
> > > > > > > > >> limit is used, thereby over-constraining the objective
> > > function
> > > > > and
> > > > > > > > >> reducing the migration path.
> > > > > > > > >>
> > > > > > > > >> Thanks,
> > > > > > > > >> Alexandre
> > > > > > > > >>
> > > > > > > > >> Le jeu. 9 avr. 2020 à 00:19, Gokul Ramanan Subramanian
> > > > > > > > >> <go...@gmail.com> a écrit :
> > > > > > > > >> >
> > > > > > > > >> > Hi. Requesting you to take a look at this KIP and
> provide
> > > > > > feedback.
> > > > > > > > >> >
> > > > > > > > >> > Thanks. Regards.
> > > > > > > > >> >
> > > > > > > > >> > On Wed, Apr 1, 2020 at 4:28 PM Gokul Ramanan
> Subramanian <
> > > > > > > > >> > gokul2411s@gmail.com> wrote:
> > > > > > > > >> >
> > > > > > > > >> > > Hi.
> > > > > > > > >> > >
> > > > > > > > >> > > I have opened KIP-578, intended to provide a
> mechanism to
> > > > > limit
> > > > > > > the
> > > > > > > > >> number
> > > > > > > > >> > > of partitions in a Kafka cluster. Kindly provide
> feedback
> > > on
> > > > > the
> > > > > > > KIP
> > > > > > > > >> which
> > > > > > > > >> > > you can find at
> > > > > > > > >> > >
> > > > > > > > >> > >
> > > > > > > > >>
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-578%3A+Add+configuration+to+limit+number+of+partitions
> > > > > > > > >> > >
> > > > > > > > >> > > I want to specially thank Stanislav Kozlovski who
> helped
> > > in
> > > > > > > > >> formulating
> > > > > > > > >> > > some aspects of the KIP.
> > > > > > > > >> > >
> > > > > > > > >> > > Many thanks,
> > > > > > > > >> > >
> > > > > > > > >> > > Gokul.
> > > > > > > > >> > >
> > > > > > > > >>
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> > >
> > > --
> > > Best,
> > > Stanislav
> > >
>

Re: [DISCUSS] KIP-578: Add configuration to limit number of partitions

Posted by Alexandre Dupriez <al...@gmail.com>.
Hi Gokul,

Thank you for the answers and the data provided to illustrate the use case.
A couple of additional questions.

904. If multi-tenancy is addressed in a future KIP, how smooth would
be the upgrade path? For example, the introduced configuration
parameters still apply, right? We would still maintain a first-come
first-served pattern when topics are created?

905. The current built-in assignment tool prioritises balance between
racks over brokers. In the version you propose, the limit on partition
count would take precedence over attempts to balance between racks.
Could it lead to a situation where it results in all partitions of a
topic being assigned in a single data center, if brokers in other
racks are "full"? Since it can potentially weaken the availability
guarantees for that topic (and maybe durability and/or consumer
performance with additional cross-rack traffic), how would we want to
handle the case? It may be worth warning users that the resulting
guarantees differ from what is offered by an "unlimited" assignment
plan in such cases? Also, let's keep in mind that some plans generated
by existing rebalancing tools could become invalid (w.r.t to the
configured limits).

906. The limits do not apply to internal topics. What about
framework-level topics from other tools and extensions? (connect,
streams, confluent metrics, tiered storage, etc.) Is blacklisting
possible?

907. What happens if one of the dynamic limit is violated at update
time? (sorry if it's already explained in the KIP, may have missed it)

Thanks,
Alexandre

Le dim. 3 mai 2020 à 20:20, Gokul Ramanan Subramanian
<go...@gmail.com> a écrit :
>
> Thanks Stanislav. Apologies about the long absence from this thread.
>
> I would prefer having per-user max partition limits in a separate KIP. I
> don't see this as an MVP for this KIP. I will add this as an alternative
> approach into the KIP.
>
> I was in a double mind about whether or not to impose the partition limit
> for internal topics as well. I can be convinced both ways. On the one hand,
> internal topics should be purely internal i.e. users of a cluster should
> not have to care about them. In this sense, the partition limit should not
> apply to internal topics. On the other hand, Kafka allows configuring
> internal topics by specifying their replication factor etc. Therefore, they
> don't feel all that internal to me. In any case, I'll modify the KIP to
> exclude internal topics.
>
> I'll also add to the KIP the alternative approach Tom suggested around
> using topic policies to limit partitions, and explain why it does not help
> to solve the problem that the KIP is trying to address (as I have done in a
> previous correspondence on this thread).
>
> Cheers.
>
> On Fri, Apr 24, 2020 at 4:24 PM Stanislav Kozlovski <st...@confluent.io>
> wrote:
>
> > Thanks for the KIP, Gokul!
> >
> > I like the overall premise - I think it's more user-friendly to have
> > configs for this than to have users implement their own config policy -> so
> > unless it's very complex to implement, it seems worth it.
> > I agree that having the topic policy on the CreatePartitions path makes
> > sense as well.
> >
> > Multi-tenancy was a good point. It would be interesting to see how easy it
> > is to extend the max partition limit to a per-user basis. Perhaps this can
> > be done in a follow-up KIP, as a natural extension of the feature.
> >
> > I'm wondering whether there's a need to enforce this on internal topics,
> > though. Given they're internal and critical to the function of Kafka, I
> > believe we'd rather always ensure they're created, regardless if over some
> > user-set limit. It brings up the question of forward compatibility - what
> > happens if a user's cluster is at the maximum partition capacity, yet a new
> > release of Kafka introduces a new topic (e.g KIP-500)?
> >
> > Best,
> > Stanislav
> >
> > On Fri, Apr 24, 2020 at 2:39 PM Gokul Ramanan Subramanian <
> > gokul2411s@gmail.com> wrote:
> >
> > > Hi Tom.
> > >
> > > With KIP-578, we are not trying to model the load on each partition, and
> > > come up with an exact limit on what the cluster or broker can handle in
> > > terms of number of partitions. We understand that not all partitions are
> > > equal, and the actual load per partition varies based on the message
> > size,
> > > throughput, whether the broker is a leader for that partition or not etc.
> > >
> > > What we are trying to achieve with KIP-578 is to disallow a pathological
> > > number of partitions that will surely put the cluster in bad shape. For
> > > example, in KIP-578's appendix, we have described a case where we could
> > not
> > > delete a topic with 30k partitions, because the brokers could not
> > > handle all the work that needed to be done. We have also described how
> > > a producer performance test with 10k partitions observed basically 0
> > > throughput. In these cases, having a limit on number of partitions
> > > would allow the cluster to produce a graceful error message at topic
> > > creation time, and prevent the cluster from entering a pathological
> > state.
> > > These are not just hypotheticals. We definitely see many of these
> > > pathological cases happen in production, and we would like to avoid them.
> > >
> > > The actual limit on number of partitions is something we do not want to
> > > suggest in the KIP. The limit will depend on various tests that owners of
> > > their clusters will have to perform, including perf tests, identifying
> > > topic creation / deletion times, etc. For example, the tests we did for
> > the
> > > KIP-578 appendix were enough to convince us that we should not have
> > > anywhere close to 10k partitions on the setup we describe there.
> > >
> > > What we want to do with KIP-578 is provide the flexibility to set a limit
> > > on number of partitions based on tests cluster owners choose to perform.
> > > Cluster owners can do the tests however often they wish and dynamically
> > > adjust the limit on number of partitions. For example, we found in our
> > > production environment that we don't want to have more than 1k partitions
> > > on an m5.large EC2 broker instances, or more than 300 partitions on a
> > > t3.medium EC2 broker, for typical produce / consume use cases.
> > >
> > > Cluster owners are free to not configure the limit on number of
> > partitions
> > > if they don't want to spend the time coming up with a limit. The limit
> > > defaults to INT32_MAX, which is basically infinity in this context, and
> > > should be practically backwards compatible with current behavior.
> > >
> > > Further, the limit on number of partitions should not come in the way of
> > > rebalancing tools under normal operation. For example, if the partition
> > > limit per broker is set to 1k, unless the number of partitions comes
> > close
> > > to 1k, there should be no impact on rebalancing tools. Only when the
> > number
> > > of partitions comes close to 1k, will rebalancing tools be impacted, but
> > at
> > > that point, the cluster is already at its limit of functioning (per some
> > > definition that was used to set the limit in the first place).
> > >
> > > Finally, I want to end this long email by suggesting that the partition
> > > assignment algorithm itself does not consider the load on various
> > > partitions before assigning partitions to brokers. In other words, it
> > > treats all partitions as equal. The idea of having a limit on number of
> > > partitions is not mis-aligned with this tenet.
> > >
> > > Thanks.
> > >
> > > On Tue, Apr 21, 2020 at 9:39 AM Tom Bentley <tb...@redhat.com> wrote:
> > >
> > > > Hi Gokul,
> > > >
> > > > the partition assignment algorithm needs to be aware of the partition
> > > > > limits.
> > > > >
> > > >
> > > > I agree, if you have limits then anything doing reassignment would need
> > > > some way of knowing what they were. But the thing is that I'm not
> > really
> > > > sure how you would decide what the limits ought to be.
> > > >
> > > >
> > > > > To illustrate this, imagine that you have 3 brokers (1, 2 and 3),
> > > > > with 10, 20 and 30 partitions each respectively, and a limit of 40
> > > > > partitions on each broker enforced via a configurable policy class
> > (the
> > > > one
> > > > > you recommended). While the policy class may accept a topic creation
> > > > > request for 11 partitions with a replication factor of 2 each
> > (because
> > > it
> > > > > is satisfiable), the non-pluggable partition assignment algorithm (in
> > > > > AdminUtils.assignReplicasToBrokers and a few other places) has to
> > know
> > > > not
> > > > > to assign the 11th partition to broker 3 because it would run out of
> > > > > partition capacity otherwise.
> > > > >
> > > >
> > > > I know this is only a toy example, but I think it also serves to
> > > illustrate
> > > > my point above. How has a limit of 40 partitions been arrived at? In
> > real
> > > > life different partitions will impart a different load on a broker,
> > > > depending on all sorts of factors (which topics they're for, the
> > > throughput
> > > > and message size for those topics, etc). By saying that a broker should
> > > not
> > > > have more than 40 partitions assigned I think you're making a big
> > > > assumption that all partitions have the same weight. You're also
> > limiting
> > > > the search space for finding an acceptable assignment. Cluster
> > balancers
> > > > usually use some kind of heuristic optimisation algorithm for figuring
> > > out
> > > > assignments of partitions to brokers, and it could be that the best (or
> > > at
> > > > least a good enough) solution requires assigning the least loaded 41
> > > > partitions to one broker.
> > > >
> > > > The point I'm trying to make here is whatever limit is chosen it's
> > > probably
> > > > been chosen fairly arbitrarily. Even if it's been chosen based on some
> > > > empirical evidence of how a particular cluster behaves it's likely that
> > > > that evidence will become obsolete as the cluster evolves to serve the
> > > > needs of the business running it (e.g. some hot topic gets
> > repartitioned,
> > > > messages get compressed with some new algorithm, some new topics need
> > to
> > > be
> > > > created). For this reason I think the problem you're trying to solve
> > via
> > > > policy (whether that was implemented in a pluggable way or not) is
> > really
> > > > better solved by automating the cluster balancing and having that
> > cluster
> > > > balancer be able to reason about when the cluster has too few brokers
> > for
> > > > the number of partitions, rather than placing some limit on the sizing
> > > and
> > > > shape of the cluster up front and then hobbling the cluster balancer to
> > > > work within that.
> > > >
> > > > I think it might be useful to describe in the KIP how users would be
> > > > expected to arrive at values for these configs (both on day 1 and in an
> > > > evolving production cluster), when this solution might be better than
> > > using
> > > > a cluster balancer and/or why cluster balancers can't be trusted to
> > avoid
> > > > overloading brokers.
> > > >
> > > > Kind regards,
> > > >
> > > > Tom
> > > >
> > > >
> > > > On Mon, Apr 20, 2020 at 7:22 PM Gokul Ramanan Subramanian <
> > > > gokul2411s@gmail.com> wrote:
> > > >
> > > > > This is good reference Tom. I did not consider this approach at all.
> > I
> > > am
> > > > > happy to learn about it now.
> > > > >
> > > > > However, I think that partition limits are not "yet another" policy
> > > > > configuration. Instead, they are fundamental to partition assignment.
> > > > i.e.
> > > > > the partition assignment algorithm needs to be aware of the partition
> > > > > limits. To illustrate this, imagine that you have 3 brokers (1, 2 and
> > > 3),
> > > > > with 10, 20 and 30 partitions each respectively, and a limit of 40
> > > > > partitions on each broker enforced via a configurable policy class
> > (the
> > > > one
> > > > > you recommended). While the policy class may accept a topic creation
> > > > > request for 11 partitions with a replication factor of 2 each
> > (because
> > > it
> > > > > is satisfiable), the non-pluggable partition assignment algorithm (in
> > > > > AdminUtils.assignReplicasToBrokers and a few other places) has to
> > know
> > > > not
> > > > > to assign the 11th partition to broker 3 because it would run out of
> > > > > partition capacity otherwise.
> > > > >
> > > > > To achieve the ideal end that you are imagining (and I can totally
> > > > > understand where you are coming from vis-a-vis the extensibility of
> > > your
> > > > > solution wrt the one in the KIP), that would require extracting the
> > > > > partition assignment logic itself into a pluggable class, and for
> > which
> > > > we
> > > > > could provide a custom implementation. I am afraid that would add
> > > > > complexity that I am not sure we want to undertake.
> > > > >
> > > > > Do you see sense in what I am saying?
> > > > >
> > > > > Thanks.
> > > > >
> > > > > On Mon, Apr 20, 2020 at 3:59 PM Tom Bentley <tb...@redhat.com>
> > > wrote:
> > > > >
> > > > > > Hi Gokul,
> > > > > >
> > > > > > Leaving aside the question of how Kafka scales, I think the
> > proposed
> > > > > > solution, limiting the number of partitions in a cluster or
> > > per-broker,
> > > > > is
> > > > > > a policy which ought to be addressable via the pluggable policies
> > > (e.g.
> > > > > > create.topic.policy.class.name). Unfortunately although there's a
> > > > policy
> > > > > > for topic creation, it's currently not possible to enforce a policy
> > > on
> > > > > > partition increase. It would be more flexible to be able enforce
> > this
> > > > > kind
> > > > > > of thing via a pluggable policy, and it would also avoid the
> > > situation
> > > > > > where different people each want to have a config which addresses
> > > some
> > > > > > specific use case or problem that they're experiencing.
> > > > > >
> > > > > > Quite a while ago I proposed KIP-201 to solve this issue with
> > > policies
> > > > > > being easily circumvented, but it didn't really make any progress.
> > > I've
> > > > > > looked at it again in some detail more recently and I think
> > something
> > > > > might
> > > > > > be possible following the work to make all ZK writes happen on the
> > > > > > controller.
> > > > > >
> > > > > > Of course, this is just my take on it.
> > > > > >
> > > > > > Kind regards,
> > > > > >
> > > > > > Tom
> > > > > >
> > > > > > On Thu, Apr 16, 2020 at 11:47 AM Gokul Ramanan Subramanian <
> > > > > > gokul2411s@gmail.com> wrote:
> > > > > >
> > > > > > > Hi.
> > > > > > >
> > > > > > > For the sake of expediting the discussion, I have created a
> > > prototype
> > > > > PR:
> > > > > > > https://github.com/apache/kafka/pull/8499. Eventually, (if and)
> > > when
> > > > > the
> > > > > > > KIP is accepted, I'll modify this to add the full implementation
> > > and
> > > > > > tests
> > > > > > > etc. in there.
> > > > > > >
> > > > > > > Would appreciate if a Kafka committer could share their thoughts,
> > > so
> > > > > > that I
> > > > > > > can more confidently start the voting thread.
> > > > > > >
> > > > > > > Thanks.
> > > > > > >
> > > > > > > On Thu, Apr 16, 2020 at 11:30 AM Gokul Ramanan Subramanian <
> > > > > > > gokul2411s@gmail.com> wrote:
> > > > > > >
> > > > > > > > Thanks for your comments Alex.
> > > > > > > >
> > > > > > > > The KIP proposes using two configurations max.partitions and
> > > > > > > > max.broker.partitions. It does not enforce their use. The
> > default
> > > > > > values
> > > > > > > > are pretty large (INT MAX), therefore should be non-intrusive.
> > > > > > > >
> > > > > > > > In multi-tenant environments and in partition assignment and
> > > > > > rebalancing,
> > > > > > > > the admin could (a) use the default values which would yield
> > > > similar
> > > > > > > > behavior to now, (b) set very high values that they know is
> > > > > sufficient,
> > > > > > > (c)
> > > > > > > > dynamically re-adjust the values should the business
> > requirements
> > > > > > change.
> > > > > > > > Note that the two configurations are cluster-wide, so they can
> > be
> > > > > > updated
> > > > > > > > without restarting the brokers.
> > > > > > > >
> > > > > > > > The quota system in Kafka seems to be geared towards limiting
> > > > traffic
> > > > > > for
> > > > > > > > specific clients or users, or in the case of replication, to
> > > > leaders
> > > > > > and
> > > > > > > > followers. The quota configuration itself is very similar to
> > the
> > > > one
> > > > > > > > introduced in this KIP i.e. just a few configuration options to
> > > > > specify
> > > > > > > the
> > > > > > > > quota. The main difference is that the quota system is far more
> > > > > > > > heavy-weight because it needs to be applied to traffic that is
> > > > > flowing
> > > > > > > > in/out constantly. Whereas in this KIP, we want to limit number
> > > of
> > > > > > > > partition replicas, which gets modified rarely by comparison
> > in a
> > > > > > typical
> > > > > > > > cluster.
> > > > > > > >
> > > > > > > > Hope this addresses your comments.
> > > > > > > >
> > > > > > > > On Thu, Apr 9, 2020 at 12:53 PM Alexandre Dupriez <
> > > > > > > > alexandre.dupriez@gmail.com> wrote:
> > > > > > > >
> > > > > > > >> Hi Gokul,
> > > > > > > >>
> > > > > > > >> Thanks for the KIP.
> > > > > > > >>
> > > > > > > >> From what I understand, the objective of the new configuration
> > > is
> > > > to
> > > > > > > >> protect a cluster from an overload driven by an excessive
> > number
> > > > of
> > > > > > > >> partitions independently from the load handled on the
> > partitions
> > > > > > > >> themselves. As such, the approach uncouples the data-path load
> > > > from
> > > > > > > >> the number of unit of distributions of throughput and intends
> > to
> > > > > avoid
> > > > > > > >> the degradation of performance exhibited in the test results
> > > > > provided
> > > > > > > >> with the KIP by setting an upper-bound on that number.
> > > > > > > >>
> > > > > > > >> Couple of comments:
> > > > > > > >>
> > > > > > > >> 900. Multi-tenancy - one concern I would have with a cluster
> > and
> > > > > > > >> broker-level configuration is that it is possible for a user
> > to
> > > > > > > >> consume a large proportions of the allocatable partitions
> > within
> > > > the
> > > > > > > >> configured limit, leaving other users with not enough
> > partitions
> > > > to
> > > > > > > >> satisfy their requirements.
> > > > > > > >>
> > > > > > > >> 901. Quotas - an approach in Apache Kafka to set-up an
> > > upper-bound
> > > > > on
> > > > > > > >> resource consumptions is via client/user quotas. Could this
> > > > > framework
> > > > > > > >> be leveraged to add this limit?
> > > > > > > >>
> > > > > > > >> 902. Partition assignment - one potential problem with the new
> > > > > > > >> repartitioning scheme is that if a subset of brokers have
> > > reached
> > > > > > > >> their number of assignable partitions, yet their data path is
> > > > > > > >> under-loaded, new topics and/or partitions will be assigned
> > > > > > > >> exclusively to other brokers, which could increase the
> > > likelihood
> > > > of
> > > > > > > >> data-path load imbalance. Fundamentally, the isolation of the
> > > > > > > >> constraint on the number of partitions from the data-path
> > > > throughput
> > > > > > > >> can have conflicting requirements.
> > > > > > > >>
> > > > > > > >> 903. Rebalancing - as a corollary to 902, external tools used
> > to
> > > > > > > >> balance ingress throughput may adopt an incremental approach
> > in
> > > > > > > >> partition re-assignment to redistribute load, and could hit
> > the
> > > > > limit
> > > > > > > >> on the number of partitions on a broker when a (too)
> > > conservative
> > > > > > > >> limit is used, thereby over-constraining the objective
> > function
> > > > and
> > > > > > > >> reducing the migration path.
> > > > > > > >>
> > > > > > > >> Thanks,
> > > > > > > >> Alexandre
> > > > > > > >>
> > > > > > > >> Le jeu. 9 avr. 2020 à 00:19, Gokul Ramanan Subramanian
> > > > > > > >> <go...@gmail.com> a écrit :
> > > > > > > >> >
> > > > > > > >> > Hi. Requesting you to take a look at this KIP and provide
> > > > > feedback.
> > > > > > > >> >
> > > > > > > >> > Thanks. Regards.
> > > > > > > >> >
> > > > > > > >> > On Wed, Apr 1, 2020 at 4:28 PM Gokul Ramanan Subramanian <
> > > > > > > >> > gokul2411s@gmail.com> wrote:
> > > > > > > >> >
> > > > > > > >> > > Hi.
> > > > > > > >> > >
> > > > > > > >> > > I have opened KIP-578, intended to provide a mechanism to
> > > > limit
> > > > > > the
> > > > > > > >> number
> > > > > > > >> > > of partitions in a Kafka cluster. Kindly provide feedback
> > on
> > > > the
> > > > > > KIP
> > > > > > > >> which
> > > > > > > >> > > you can find at
> > > > > > > >> > >
> > > > > > > >> > >
> > > > > > > >>
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> > https://cwiki.apache.org/confluence/display/KAFKA/KIP-578%3A+Add+configuration+to+limit+number+of+partitions
> > > > > > > >> > >
> > > > > > > >> > > I want to specially thank Stanislav Kozlovski who helped
> > in
> > > > > > > >> formulating
> > > > > > > >> > > some aspects of the KIP.
> > > > > > > >> > >
> > > > > > > >> > > Many thanks,
> > > > > > > >> > >
> > > > > > > >> > > Gokul.
> > > > > > > >> > >
> > > > > > > >>
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> >
> > --
> > Best,
> > Stanislav
> >

Re: [DISCUSS] KIP-578: Add configuration to limit number of partitions

Posted by Gokul Ramanan Subramanian <go...@gmail.com>.
Thanks Stanislav. Apologies about the long absence from this thread.

I would prefer having per-user max partition limits in a separate KIP. I
don't see this as an MVP for this KIP. I will add this as an alternative
approach into the KIP.

I was in a double mind about whether or not to impose the partition limit
for internal topics as well. I can be convinced both ways. On the one hand,
internal topics should be purely internal i.e. users of a cluster should
not have to care about them. In this sense, the partition limit should not
apply to internal topics. On the other hand, Kafka allows configuring
internal topics by specifying their replication factor etc. Therefore, they
don't feel all that internal to me. In any case, I'll modify the KIP to
exclude internal topics.

I'll also add to the KIP the alternative approach Tom suggested around
using topic policies to limit partitions, and explain why it does not help
to solve the problem that the KIP is trying to address (as I have done in a
previous correspondence on this thread).

Cheers.

On Fri, Apr 24, 2020 at 4:24 PM Stanislav Kozlovski <st...@confluent.io>
wrote:

> Thanks for the KIP, Gokul!
>
> I like the overall premise - I think it's more user-friendly to have
> configs for this than to have users implement their own config policy -> so
> unless it's very complex to implement, it seems worth it.
> I agree that having the topic policy on the CreatePartitions path makes
> sense as well.
>
> Multi-tenancy was a good point. It would be interesting to see how easy it
> is to extend the max partition limit to a per-user basis. Perhaps this can
> be done in a follow-up KIP, as a natural extension of the feature.
>
> I'm wondering whether there's a need to enforce this on internal topics,
> though. Given they're internal and critical to the function of Kafka, I
> believe we'd rather always ensure they're created, regardless if over some
> user-set limit. It brings up the question of forward compatibility - what
> happens if a user's cluster is at the maximum partition capacity, yet a new
> release of Kafka introduces a new topic (e.g KIP-500)?
>
> Best,
> Stanislav
>
> On Fri, Apr 24, 2020 at 2:39 PM Gokul Ramanan Subramanian <
> gokul2411s@gmail.com> wrote:
>
> > Hi Tom.
> >
> > With KIP-578, we are not trying to model the load on each partition, and
> > come up with an exact limit on what the cluster or broker can handle in
> > terms of number of partitions. We understand that not all partitions are
> > equal, and the actual load per partition varies based on the message
> size,
> > throughput, whether the broker is a leader for that partition or not etc.
> >
> > What we are trying to achieve with KIP-578 is to disallow a pathological
> > number of partitions that will surely put the cluster in bad shape. For
> > example, in KIP-578's appendix, we have described a case where we could
> not
> > delete a topic with 30k partitions, because the brokers could not
> > handle all the work that needed to be done. We have also described how
> > a producer performance test with 10k partitions observed basically 0
> > throughput. In these cases, having a limit on number of partitions
> > would allow the cluster to produce a graceful error message at topic
> > creation time, and prevent the cluster from entering a pathological
> state.
> > These are not just hypotheticals. We definitely see many of these
> > pathological cases happen in production, and we would like to avoid them.
> >
> > The actual limit on number of partitions is something we do not want to
> > suggest in the KIP. The limit will depend on various tests that owners of
> > their clusters will have to perform, including perf tests, identifying
> > topic creation / deletion times, etc. For example, the tests we did for
> the
> > KIP-578 appendix were enough to convince us that we should not have
> > anywhere close to 10k partitions on the setup we describe there.
> >
> > What we want to do with KIP-578 is provide the flexibility to set a limit
> > on number of partitions based on tests cluster owners choose to perform.
> > Cluster owners can do the tests however often they wish and dynamically
> > adjust the limit on number of partitions. For example, we found in our
> > production environment that we don't want to have more than 1k partitions
> > on an m5.large EC2 broker instances, or more than 300 partitions on a
> > t3.medium EC2 broker, for typical produce / consume use cases.
> >
> > Cluster owners are free to not configure the limit on number of
> partitions
> > if they don't want to spend the time coming up with a limit. The limit
> > defaults to INT32_MAX, which is basically infinity in this context, and
> > should be practically backwards compatible with current behavior.
> >
> > Further, the limit on number of partitions should not come in the way of
> > rebalancing tools under normal operation. For example, if the partition
> > limit per broker is set to 1k, unless the number of partitions comes
> close
> > to 1k, there should be no impact on rebalancing tools. Only when the
> number
> > of partitions comes close to 1k, will rebalancing tools be impacted, but
> at
> > that point, the cluster is already at its limit of functioning (per some
> > definition that was used to set the limit in the first place).
> >
> > Finally, I want to end this long email by suggesting that the partition
> > assignment algorithm itself does not consider the load on various
> > partitions before assigning partitions to brokers. In other words, it
> > treats all partitions as equal. The idea of having a limit on number of
> > partitions is not mis-aligned with this tenet.
> >
> > Thanks.
> >
> > On Tue, Apr 21, 2020 at 9:39 AM Tom Bentley <tb...@redhat.com> wrote:
> >
> > > Hi Gokul,
> > >
> > > the partition assignment algorithm needs to be aware of the partition
> > > > limits.
> > > >
> > >
> > > I agree, if you have limits then anything doing reassignment would need
> > > some way of knowing what they were. But the thing is that I'm not
> really
> > > sure how you would decide what the limits ought to be.
> > >
> > >
> > > > To illustrate this, imagine that you have 3 brokers (1, 2 and 3),
> > > > with 10, 20 and 30 partitions each respectively, and a limit of 40
> > > > partitions on each broker enforced via a configurable policy class
> (the
> > > one
> > > > you recommended). While the policy class may accept a topic creation
> > > > request for 11 partitions with a replication factor of 2 each
> (because
> > it
> > > > is satisfiable), the non-pluggable partition assignment algorithm (in
> > > > AdminUtils.assignReplicasToBrokers and a few other places) has to
> know
> > > not
> > > > to assign the 11th partition to broker 3 because it would run out of
> > > > partition capacity otherwise.
> > > >
> > >
> > > I know this is only a toy example, but I think it also serves to
> > illustrate
> > > my point above. How has a limit of 40 partitions been arrived at? In
> real
> > > life different partitions will impart a different load on a broker,
> > > depending on all sorts of factors (which topics they're for, the
> > throughput
> > > and message size for those topics, etc). By saying that a broker should
> > not
> > > have more than 40 partitions assigned I think you're making a big
> > > assumption that all partitions have the same weight. You're also
> limiting
> > > the search space for finding an acceptable assignment. Cluster
> balancers
> > > usually use some kind of heuristic optimisation algorithm for figuring
> > out
> > > assignments of partitions to brokers, and it could be that the best (or
> > at
> > > least a good enough) solution requires assigning the least loaded 41
> > > partitions to one broker.
> > >
> > > The point I'm trying to make here is whatever limit is chosen it's
> > probably
> > > been chosen fairly arbitrarily. Even if it's been chosen based on some
> > > empirical evidence of how a particular cluster behaves it's likely that
> > > that evidence will become obsolete as the cluster evolves to serve the
> > > needs of the business running it (e.g. some hot topic gets
> repartitioned,
> > > messages get compressed with some new algorithm, some new topics need
> to
> > be
> > > created). For this reason I think the problem you're trying to solve
> via
> > > policy (whether that was implemented in a pluggable way or not) is
> really
> > > better solved by automating the cluster balancing and having that
> cluster
> > > balancer be able to reason about when the cluster has too few brokers
> for
> > > the number of partitions, rather than placing some limit on the sizing
> > and
> > > shape of the cluster up front and then hobbling the cluster balancer to
> > > work within that.
> > >
> > > I think it might be useful to describe in the KIP how users would be
> > > expected to arrive at values for these configs (both on day 1 and in an
> > > evolving production cluster), when this solution might be better than
> > using
> > > a cluster balancer and/or why cluster balancers can't be trusted to
> avoid
> > > overloading brokers.
> > >
> > > Kind regards,
> > >
> > > Tom
> > >
> > >
> > > On Mon, Apr 20, 2020 at 7:22 PM Gokul Ramanan Subramanian <
> > > gokul2411s@gmail.com> wrote:
> > >
> > > > This is good reference Tom. I did not consider this approach at all.
> I
> > am
> > > > happy to learn about it now.
> > > >
> > > > However, I think that partition limits are not "yet another" policy
> > > > configuration. Instead, they are fundamental to partition assignment.
> > > i.e.
> > > > the partition assignment algorithm needs to be aware of the partition
> > > > limits. To illustrate this, imagine that you have 3 brokers (1, 2 and
> > 3),
> > > > with 10, 20 and 30 partitions each respectively, and a limit of 40
> > > > partitions on each broker enforced via a configurable policy class
> (the
> > > one
> > > > you recommended). While the policy class may accept a topic creation
> > > > request for 11 partitions with a replication factor of 2 each
> (because
> > it
> > > > is satisfiable), the non-pluggable partition assignment algorithm (in
> > > > AdminUtils.assignReplicasToBrokers and a few other places) has to
> know
> > > not
> > > > to assign the 11th partition to broker 3 because it would run out of
> > > > partition capacity otherwise.
> > > >
> > > > To achieve the ideal end that you are imagining (and I can totally
> > > > understand where you are coming from vis-a-vis the extensibility of
> > your
> > > > solution wrt the one in the KIP), that would require extracting the
> > > > partition assignment logic itself into a pluggable class, and for
> which
> > > we
> > > > could provide a custom implementation. I am afraid that would add
> > > > complexity that I am not sure we want to undertake.
> > > >
> > > > Do you see sense in what I am saying?
> > > >
> > > > Thanks.
> > > >
> > > > On Mon, Apr 20, 2020 at 3:59 PM Tom Bentley <tb...@redhat.com>
> > wrote:
> > > >
> > > > > Hi Gokul,
> > > > >
> > > > > Leaving aside the question of how Kafka scales, I think the
> proposed
> > > > > solution, limiting the number of partitions in a cluster or
> > per-broker,
> > > > is
> > > > > a policy which ought to be addressable via the pluggable policies
> > (e.g.
> > > > > create.topic.policy.class.name). Unfortunately although there's a
> > > policy
> > > > > for topic creation, it's currently not possible to enforce a policy
> > on
> > > > > partition increase. It would be more flexible to be able enforce
> this
> > > > kind
> > > > > of thing via a pluggable policy, and it would also avoid the
> > situation
> > > > > where different people each want to have a config which addresses
> > some
> > > > > specific use case or problem that they're experiencing.
> > > > >
> > > > > Quite a while ago I proposed KIP-201 to solve this issue with
> > policies
> > > > > being easily circumvented, but it didn't really make any progress.
> > I've
> > > > > looked at it again in some detail more recently and I think
> something
> > > > might
> > > > > be possible following the work to make all ZK writes happen on the
> > > > > controller.
> > > > >
> > > > > Of course, this is just my take on it.
> > > > >
> > > > > Kind regards,
> > > > >
> > > > > Tom
> > > > >
> > > > > On Thu, Apr 16, 2020 at 11:47 AM Gokul Ramanan Subramanian <
> > > > > gokul2411s@gmail.com> wrote:
> > > > >
> > > > > > Hi.
> > > > > >
> > > > > > For the sake of expediting the discussion, I have created a
> > prototype
> > > > PR:
> > > > > > https://github.com/apache/kafka/pull/8499. Eventually, (if and)
> > when
> > > > the
> > > > > > KIP is accepted, I'll modify this to add the full implementation
> > and
> > > > > tests
> > > > > > etc. in there.
> > > > > >
> > > > > > Would appreciate if a Kafka committer could share their thoughts,
> > so
> > > > > that I
> > > > > > can more confidently start the voting thread.
> > > > > >
> > > > > > Thanks.
> > > > > >
> > > > > > On Thu, Apr 16, 2020 at 11:30 AM Gokul Ramanan Subramanian <
> > > > > > gokul2411s@gmail.com> wrote:
> > > > > >
> > > > > > > Thanks for your comments Alex.
> > > > > > >
> > > > > > > The KIP proposes using two configurations max.partitions and
> > > > > > > max.broker.partitions. It does not enforce their use. The
> default
> > > > > values
> > > > > > > are pretty large (INT MAX), therefore should be non-intrusive.
> > > > > > >
> > > > > > > In multi-tenant environments and in partition assignment and
> > > > > rebalancing,
> > > > > > > the admin could (a) use the default values which would yield
> > > similar
> > > > > > > behavior to now, (b) set very high values that they know is
> > > > sufficient,
> > > > > > (c)
> > > > > > > dynamically re-adjust the values should the business
> requirements
> > > > > change.
> > > > > > > Note that the two configurations are cluster-wide, so they can
> be
> > > > > updated
> > > > > > > without restarting the brokers.
> > > > > > >
> > > > > > > The quota system in Kafka seems to be geared towards limiting
> > > traffic
> > > > > for
> > > > > > > specific clients or users, or in the case of replication, to
> > > leaders
> > > > > and
> > > > > > > followers. The quota configuration itself is very similar to
> the
> > > one
> > > > > > > introduced in this KIP i.e. just a few configuration options to
> > > > specify
> > > > > > the
> > > > > > > quota. The main difference is that the quota system is far more
> > > > > > > heavy-weight because it needs to be applied to traffic that is
> > > > flowing
> > > > > > > in/out constantly. Whereas in this KIP, we want to limit number
> > of
> > > > > > > partition replicas, which gets modified rarely by comparison
> in a
> > > > > typical
> > > > > > > cluster.
> > > > > > >
> > > > > > > Hope this addresses your comments.
> > > > > > >
> > > > > > > On Thu, Apr 9, 2020 at 12:53 PM Alexandre Dupriez <
> > > > > > > alexandre.dupriez@gmail.com> wrote:
> > > > > > >
> > > > > > >> Hi Gokul,
> > > > > > >>
> > > > > > >> Thanks for the KIP.
> > > > > > >>
> > > > > > >> From what I understand, the objective of the new configuration
> > is
> > > to
> > > > > > >> protect a cluster from an overload driven by an excessive
> number
> > > of
> > > > > > >> partitions independently from the load handled on the
> partitions
> > > > > > >> themselves. As such, the approach uncouples the data-path load
> > > from
> > > > > > >> the number of unit of distributions of throughput and intends
> to
> > > > avoid
> > > > > > >> the degradation of performance exhibited in the test results
> > > > provided
> > > > > > >> with the KIP by setting an upper-bound on that number.
> > > > > > >>
> > > > > > >> Couple of comments:
> > > > > > >>
> > > > > > >> 900. Multi-tenancy - one concern I would have with a cluster
> and
> > > > > > >> broker-level configuration is that it is possible for a user
> to
> > > > > > >> consume a large proportions of the allocatable partitions
> within
> > > the
> > > > > > >> configured limit, leaving other users with not enough
> partitions
> > > to
> > > > > > >> satisfy their requirements.
> > > > > > >>
> > > > > > >> 901. Quotas - an approach in Apache Kafka to set-up an
> > upper-bound
> > > > on
> > > > > > >> resource consumptions is via client/user quotas. Could this
> > > > framework
> > > > > > >> be leveraged to add this limit?
> > > > > > >>
> > > > > > >> 902. Partition assignment - one potential problem with the new
> > > > > > >> repartitioning scheme is that if a subset of brokers have
> > reached
> > > > > > >> their number of assignable partitions, yet their data path is
> > > > > > >> under-loaded, new topics and/or partitions will be assigned
> > > > > > >> exclusively to other brokers, which could increase the
> > likelihood
> > > of
> > > > > > >> data-path load imbalance. Fundamentally, the isolation of the
> > > > > > >> constraint on the number of partitions from the data-path
> > > throughput
> > > > > > >> can have conflicting requirements.
> > > > > > >>
> > > > > > >> 903. Rebalancing - as a corollary to 902, external tools used
> to
> > > > > > >> balance ingress throughput may adopt an incremental approach
> in
> > > > > > >> partition re-assignment to redistribute load, and could hit
> the
> > > > limit
> > > > > > >> on the number of partitions on a broker when a (too)
> > conservative
> > > > > > >> limit is used, thereby over-constraining the objective
> function
> > > and
> > > > > > >> reducing the migration path.
> > > > > > >>
> > > > > > >> Thanks,
> > > > > > >> Alexandre
> > > > > > >>
> > > > > > >> Le jeu. 9 avr. 2020 à 00:19, Gokul Ramanan Subramanian
> > > > > > >> <go...@gmail.com> a écrit :
> > > > > > >> >
> > > > > > >> > Hi. Requesting you to take a look at this KIP and provide
> > > > feedback.
> > > > > > >> >
> > > > > > >> > Thanks. Regards.
> > > > > > >> >
> > > > > > >> > On Wed, Apr 1, 2020 at 4:28 PM Gokul Ramanan Subramanian <
> > > > > > >> > gokul2411s@gmail.com> wrote:
> > > > > > >> >
> > > > > > >> > > Hi.
> > > > > > >> > >
> > > > > > >> > > I have opened KIP-578, intended to provide a mechanism to
> > > limit
> > > > > the
> > > > > > >> number
> > > > > > >> > > of partitions in a Kafka cluster. Kindly provide feedback
> on
> > > the
> > > > > KIP
> > > > > > >> which
> > > > > > >> > > you can find at
> > > > > > >> > >
> > > > > > >> > >
> > > > > > >>
> > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-578%3A+Add+configuration+to+limit+number+of+partitions
> > > > > > >> > >
> > > > > > >> > > I want to specially thank Stanislav Kozlovski who helped
> in
> > > > > > >> formulating
> > > > > > >> > > some aspects of the KIP.
> > > > > > >> > >
> > > > > > >> > > Many thanks,
> > > > > > >> > >
> > > > > > >> > > Gokul.
> > > > > > >> > >
> > > > > > >>
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>
>
> --
> Best,
> Stanislav
>

Re: [DISCUSS] KIP-578: Add configuration to limit number of partitions

Posted by Stanislav Kozlovski <st...@confluent.io>.
Thanks for the KIP, Gokul!

I like the overall premise - I think it's more user-friendly to have
configs for this than to have users implement their own config policy -> so
unless it's very complex to implement, it seems worth it.
I agree that having the topic policy on the CreatePartitions path makes
sense as well.

Multi-tenancy was a good point. It would be interesting to see how easy it
is to extend the max partition limit to a per-user basis. Perhaps this can
be done in a follow-up KIP, as a natural extension of the feature.

I'm wondering whether there's a need to enforce this on internal topics,
though. Given they're internal and critical to the function of Kafka, I
believe we'd rather always ensure they're created, regardless if over some
user-set limit. It brings up the question of forward compatibility - what
happens if a user's cluster is at the maximum partition capacity, yet a new
release of Kafka introduces a new topic (e.g KIP-500)?

Best,
Stanislav

On Fri, Apr 24, 2020 at 2:39 PM Gokul Ramanan Subramanian <
gokul2411s@gmail.com> wrote:

> Hi Tom.
>
> With KIP-578, we are not trying to model the load on each partition, and
> come up with an exact limit on what the cluster or broker can handle in
> terms of number of partitions. We understand that not all partitions are
> equal, and the actual load per partition varies based on the message size,
> throughput, whether the broker is a leader for that partition or not etc.
>
> What we are trying to achieve with KIP-578 is to disallow a pathological
> number of partitions that will surely put the cluster in bad shape. For
> example, in KIP-578's appendix, we have described a case where we could not
> delete a topic with 30k partitions, because the brokers could not
> handle all the work that needed to be done. We have also described how
> a producer performance test with 10k partitions observed basically 0
> throughput. In these cases, having a limit on number of partitions
> would allow the cluster to produce a graceful error message at topic
> creation time, and prevent the cluster from entering a pathological state.
> These are not just hypotheticals. We definitely see many of these
> pathological cases happen in production, and we would like to avoid them.
>
> The actual limit on number of partitions is something we do not want to
> suggest in the KIP. The limit will depend on various tests that owners of
> their clusters will have to perform, including perf tests, identifying
> topic creation / deletion times, etc. For example, the tests we did for the
> KIP-578 appendix were enough to convince us that we should not have
> anywhere close to 10k partitions on the setup we describe there.
>
> What we want to do with KIP-578 is provide the flexibility to set a limit
> on number of partitions based on tests cluster owners choose to perform.
> Cluster owners can do the tests however often they wish and dynamically
> adjust the limit on number of partitions. For example, we found in our
> production environment that we don't want to have more than 1k partitions
> on an m5.large EC2 broker instances, or more than 300 partitions on a
> t3.medium EC2 broker, for typical produce / consume use cases.
>
> Cluster owners are free to not configure the limit on number of partitions
> if they don't want to spend the time coming up with a limit. The limit
> defaults to INT32_MAX, which is basically infinity in this context, and
> should be practically backwards compatible with current behavior.
>
> Further, the limit on number of partitions should not come in the way of
> rebalancing tools under normal operation. For example, if the partition
> limit per broker is set to 1k, unless the number of partitions comes close
> to 1k, there should be no impact on rebalancing tools. Only when the number
> of partitions comes close to 1k, will rebalancing tools be impacted, but at
> that point, the cluster is already at its limit of functioning (per some
> definition that was used to set the limit in the first place).
>
> Finally, I want to end this long email by suggesting that the partition
> assignment algorithm itself does not consider the load on various
> partitions before assigning partitions to brokers. In other words, it
> treats all partitions as equal. The idea of having a limit on number of
> partitions is not mis-aligned with this tenet.
>
> Thanks.
>
> On Tue, Apr 21, 2020 at 9:39 AM Tom Bentley <tb...@redhat.com> wrote:
>
> > Hi Gokul,
> >
> > the partition assignment algorithm needs to be aware of the partition
> > > limits.
> > >
> >
> > I agree, if you have limits then anything doing reassignment would need
> > some way of knowing what they were. But the thing is that I'm not really
> > sure how you would decide what the limits ought to be.
> >
> >
> > > To illustrate this, imagine that you have 3 brokers (1, 2 and 3),
> > > with 10, 20 and 30 partitions each respectively, and a limit of 40
> > > partitions on each broker enforced via a configurable policy class (the
> > one
> > > you recommended). While the policy class may accept a topic creation
> > > request for 11 partitions with a replication factor of 2 each (because
> it
> > > is satisfiable), the non-pluggable partition assignment algorithm (in
> > > AdminUtils.assignReplicasToBrokers and a few other places) has to know
> > not
> > > to assign the 11th partition to broker 3 because it would run out of
> > > partition capacity otherwise.
> > >
> >
> > I know this is only a toy example, but I think it also serves to
> illustrate
> > my point above. How has a limit of 40 partitions been arrived at? In real
> > life different partitions will impart a different load on a broker,
> > depending on all sorts of factors (which topics they're for, the
> throughput
> > and message size for those topics, etc). By saying that a broker should
> not
> > have more than 40 partitions assigned I think you're making a big
> > assumption that all partitions have the same weight. You're also limiting
> > the search space for finding an acceptable assignment. Cluster balancers
> > usually use some kind of heuristic optimisation algorithm for figuring
> out
> > assignments of partitions to brokers, and it could be that the best (or
> at
> > least a good enough) solution requires assigning the least loaded 41
> > partitions to one broker.
> >
> > The point I'm trying to make here is whatever limit is chosen it's
> probably
> > been chosen fairly arbitrarily. Even if it's been chosen based on some
> > empirical evidence of how a particular cluster behaves it's likely that
> > that evidence will become obsolete as the cluster evolves to serve the
> > needs of the business running it (e.g. some hot topic gets repartitioned,
> > messages get compressed with some new algorithm, some new topics need to
> be
> > created). For this reason I think the problem you're trying to solve via
> > policy (whether that was implemented in a pluggable way or not) is really
> > better solved by automating the cluster balancing and having that cluster
> > balancer be able to reason about when the cluster has too few brokers for
> > the number of partitions, rather than placing some limit on the sizing
> and
> > shape of the cluster up front and then hobbling the cluster balancer to
> > work within that.
> >
> > I think it might be useful to describe in the KIP how users would be
> > expected to arrive at values for these configs (both on day 1 and in an
> > evolving production cluster), when this solution might be better than
> using
> > a cluster balancer and/or why cluster balancers can't be trusted to avoid
> > overloading brokers.
> >
> > Kind regards,
> >
> > Tom
> >
> >
> > On Mon, Apr 20, 2020 at 7:22 PM Gokul Ramanan Subramanian <
> > gokul2411s@gmail.com> wrote:
> >
> > > This is good reference Tom. I did not consider this approach at all. I
> am
> > > happy to learn about it now.
> > >
> > > However, I think that partition limits are not "yet another" policy
> > > configuration. Instead, they are fundamental to partition assignment.
> > i.e.
> > > the partition assignment algorithm needs to be aware of the partition
> > > limits. To illustrate this, imagine that you have 3 brokers (1, 2 and
> 3),
> > > with 10, 20 and 30 partitions each respectively, and a limit of 40
> > > partitions on each broker enforced via a configurable policy class (the
> > one
> > > you recommended). While the policy class may accept a topic creation
> > > request for 11 partitions with a replication factor of 2 each (because
> it
> > > is satisfiable), the non-pluggable partition assignment algorithm (in
> > > AdminUtils.assignReplicasToBrokers and a few other places) has to know
> > not
> > > to assign the 11th partition to broker 3 because it would run out of
> > > partition capacity otherwise.
> > >
> > > To achieve the ideal end that you are imagining (and I can totally
> > > understand where you are coming from vis-a-vis the extensibility of
> your
> > > solution wrt the one in the KIP), that would require extracting the
> > > partition assignment logic itself into a pluggable class, and for which
> > we
> > > could provide a custom implementation. I am afraid that would add
> > > complexity that I am not sure we want to undertake.
> > >
> > > Do you see sense in what I am saying?
> > >
> > > Thanks.
> > >
> > > On Mon, Apr 20, 2020 at 3:59 PM Tom Bentley <tb...@redhat.com>
> wrote:
> > >
> > > > Hi Gokul,
> > > >
> > > > Leaving aside the question of how Kafka scales, I think the proposed
> > > > solution, limiting the number of partitions in a cluster or
> per-broker,
> > > is
> > > > a policy which ought to be addressable via the pluggable policies
> (e.g.
> > > > create.topic.policy.class.name). Unfortunately although there's a
> > policy
> > > > for topic creation, it's currently not possible to enforce a policy
> on
> > > > partition increase. It would be more flexible to be able enforce this
> > > kind
> > > > of thing via a pluggable policy, and it would also avoid the
> situation
> > > > where different people each want to have a config which addresses
> some
> > > > specific use case or problem that they're experiencing.
> > > >
> > > > Quite a while ago I proposed KIP-201 to solve this issue with
> policies
> > > > being easily circumvented, but it didn't really make any progress.
> I've
> > > > looked at it again in some detail more recently and I think something
> > > might
> > > > be possible following the work to make all ZK writes happen on the
> > > > controller.
> > > >
> > > > Of course, this is just my take on it.
> > > >
> > > > Kind regards,
> > > >
> > > > Tom
> > > >
> > > > On Thu, Apr 16, 2020 at 11:47 AM Gokul Ramanan Subramanian <
> > > > gokul2411s@gmail.com> wrote:
> > > >
> > > > > Hi.
> > > > >
> > > > > For the sake of expediting the discussion, I have created a
> prototype
> > > PR:
> > > > > https://github.com/apache/kafka/pull/8499. Eventually, (if and)
> when
> > > the
> > > > > KIP is accepted, I'll modify this to add the full implementation
> and
> > > > tests
> > > > > etc. in there.
> > > > >
> > > > > Would appreciate if a Kafka committer could share their thoughts,
> so
> > > > that I
> > > > > can more confidently start the voting thread.
> > > > >
> > > > > Thanks.
> > > > >
> > > > > On Thu, Apr 16, 2020 at 11:30 AM Gokul Ramanan Subramanian <
> > > > > gokul2411s@gmail.com> wrote:
> > > > >
> > > > > > Thanks for your comments Alex.
> > > > > >
> > > > > > The KIP proposes using two configurations max.partitions and
> > > > > > max.broker.partitions. It does not enforce their use. The default
> > > > values
> > > > > > are pretty large (INT MAX), therefore should be non-intrusive.
> > > > > >
> > > > > > In multi-tenant environments and in partition assignment and
> > > > rebalancing,
> > > > > > the admin could (a) use the default values which would yield
> > similar
> > > > > > behavior to now, (b) set very high values that they know is
> > > sufficient,
> > > > > (c)
> > > > > > dynamically re-adjust the values should the business requirements
> > > > change.
> > > > > > Note that the two configurations are cluster-wide, so they can be
> > > > updated
> > > > > > without restarting the brokers.
> > > > > >
> > > > > > The quota system in Kafka seems to be geared towards limiting
> > traffic
> > > > for
> > > > > > specific clients or users, or in the case of replication, to
> > leaders
> > > > and
> > > > > > followers. The quota configuration itself is very similar to the
> > one
> > > > > > introduced in this KIP i.e. just a few configuration options to
> > > specify
> > > > > the
> > > > > > quota. The main difference is that the quota system is far more
> > > > > > heavy-weight because it needs to be applied to traffic that is
> > > flowing
> > > > > > in/out constantly. Whereas in this KIP, we want to limit number
> of
> > > > > > partition replicas, which gets modified rarely by comparison in a
> > > > typical
> > > > > > cluster.
> > > > > >
> > > > > > Hope this addresses your comments.
> > > > > >
> > > > > > On Thu, Apr 9, 2020 at 12:53 PM Alexandre Dupriez <
> > > > > > alexandre.dupriez@gmail.com> wrote:
> > > > > >
> > > > > >> Hi Gokul,
> > > > > >>
> > > > > >> Thanks for the KIP.
> > > > > >>
> > > > > >> From what I understand, the objective of the new configuration
> is
> > to
> > > > > >> protect a cluster from an overload driven by an excessive number
> > of
> > > > > >> partitions independently from the load handled on the partitions
> > > > > >> themselves. As such, the approach uncouples the data-path load
> > from
> > > > > >> the number of unit of distributions of throughput and intends to
> > > avoid
> > > > > >> the degradation of performance exhibited in the test results
> > > provided
> > > > > >> with the KIP by setting an upper-bound on that number.
> > > > > >>
> > > > > >> Couple of comments:
> > > > > >>
> > > > > >> 900. Multi-tenancy - one concern I would have with a cluster and
> > > > > >> broker-level configuration is that it is possible for a user to
> > > > > >> consume a large proportions of the allocatable partitions within
> > the
> > > > > >> configured limit, leaving other users with not enough partitions
> > to
> > > > > >> satisfy their requirements.
> > > > > >>
> > > > > >> 901. Quotas - an approach in Apache Kafka to set-up an
> upper-bound
> > > on
> > > > > >> resource consumptions is via client/user quotas. Could this
> > > framework
> > > > > >> be leveraged to add this limit?
> > > > > >>
> > > > > >> 902. Partition assignment - one potential problem with the new
> > > > > >> repartitioning scheme is that if a subset of brokers have
> reached
> > > > > >> their number of assignable partitions, yet their data path is
> > > > > >> under-loaded, new topics and/or partitions will be assigned
> > > > > >> exclusively to other brokers, which could increase the
> likelihood
> > of
> > > > > >> data-path load imbalance. Fundamentally, the isolation of the
> > > > > >> constraint on the number of partitions from the data-path
> > throughput
> > > > > >> can have conflicting requirements.
> > > > > >>
> > > > > >> 903. Rebalancing - as a corollary to 902, external tools used to
> > > > > >> balance ingress throughput may adopt an incremental approach in
> > > > > >> partition re-assignment to redistribute load, and could hit the
> > > limit
> > > > > >> on the number of partitions on a broker when a (too)
> conservative
> > > > > >> limit is used, thereby over-constraining the objective function
> > and
> > > > > >> reducing the migration path.
> > > > > >>
> > > > > >> Thanks,
> > > > > >> Alexandre
> > > > > >>
> > > > > >> Le jeu. 9 avr. 2020 à 00:19, Gokul Ramanan Subramanian
> > > > > >> <go...@gmail.com> a écrit :
> > > > > >> >
> > > > > >> > Hi. Requesting you to take a look at this KIP and provide
> > > feedback.
> > > > > >> >
> > > > > >> > Thanks. Regards.
> > > > > >> >
> > > > > >> > On Wed, Apr 1, 2020 at 4:28 PM Gokul Ramanan Subramanian <
> > > > > >> > gokul2411s@gmail.com> wrote:
> > > > > >> >
> > > > > >> > > Hi.
> > > > > >> > >
> > > > > >> > > I have opened KIP-578, intended to provide a mechanism to
> > limit
> > > > the
> > > > > >> number
> > > > > >> > > of partitions in a Kafka cluster. Kindly provide feedback on
> > the
> > > > KIP
> > > > > >> which
> > > > > >> > > you can find at
> > > > > >> > >
> > > > > >> > >
> > > > > >>
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-578%3A+Add+configuration+to+limit+number+of+partitions
> > > > > >> > >
> > > > > >> > > I want to specially thank Stanislav Kozlovski who helped in
> > > > > >> formulating
> > > > > >> > > some aspects of the KIP.
> > > > > >> > >
> > > > > >> > > Many thanks,
> > > > > >> > >
> > > > > >> > > Gokul.
> > > > > >> > >
> > > > > >>
> > > > > >
> > > > >
> > > >
> > >
> >
>


-- 
Best,
Stanislav

Re: [DISCUSS] KIP-578: Add configuration to limit number of partitions

Posted by Gokul Ramanan Subramanian <go...@gmail.com>.
Hi Tom.

With KIP-578, we are not trying to model the load on each partition, and
come up with an exact limit on what the cluster or broker can handle in
terms of number of partitions. We understand that not all partitions are
equal, and the actual load per partition varies based on the message size,
throughput, whether the broker is a leader for that partition or not etc.

What we are trying to achieve with KIP-578 is to disallow a pathological
number of partitions that will surely put the cluster in bad shape. For
example, in KIP-578's appendix, we have described a case where we could not
delete a topic with 30k partitions, because the brokers could not
handle all the work that needed to be done. We have also described how
a producer performance test with 10k partitions observed basically 0
throughput. In these cases, having a limit on number of partitions
would allow the cluster to produce a graceful error message at topic
creation time, and prevent the cluster from entering a pathological state.
These are not just hypotheticals. We definitely see many of these
pathological cases happen in production, and we would like to avoid them.

The actual limit on number of partitions is something we do not want to
suggest in the KIP. The limit will depend on various tests that owners of
their clusters will have to perform, including perf tests, identifying
topic creation / deletion times, etc. For example, the tests we did for the
KIP-578 appendix were enough to convince us that we should not have
anywhere close to 10k partitions on the setup we describe there.

What we want to do with KIP-578 is provide the flexibility to set a limit
on number of partitions based on tests cluster owners choose to perform.
Cluster owners can do the tests however often they wish and dynamically
adjust the limit on number of partitions. For example, we found in our
production environment that we don't want to have more than 1k partitions
on an m5.large EC2 broker instances, or more than 300 partitions on a
t3.medium EC2 broker, for typical produce / consume use cases.

Cluster owners are free to not configure the limit on number of partitions
if they don't want to spend the time coming up with a limit. The limit
defaults to INT32_MAX, which is basically infinity in this context, and
should be practically backwards compatible with current behavior.

Further, the limit on number of partitions should not come in the way of
rebalancing tools under normal operation. For example, if the partition
limit per broker is set to 1k, unless the number of partitions comes close
to 1k, there should be no impact on rebalancing tools. Only when the number
of partitions comes close to 1k, will rebalancing tools be impacted, but at
that point, the cluster is already at its limit of functioning (per some
definition that was used to set the limit in the first place).

Finally, I want to end this long email by suggesting that the partition
assignment algorithm itself does not consider the load on various
partitions before assigning partitions to brokers. In other words, it
treats all partitions as equal. The idea of having a limit on number of
partitions is not mis-aligned with this tenet.

Thanks.

On Tue, Apr 21, 2020 at 9:39 AM Tom Bentley <tb...@redhat.com> wrote:

> Hi Gokul,
>
> the partition assignment algorithm needs to be aware of the partition
> > limits.
> >
>
> I agree, if you have limits then anything doing reassignment would need
> some way of knowing what they were. But the thing is that I'm not really
> sure how you would decide what the limits ought to be.
>
>
> > To illustrate this, imagine that you have 3 brokers (1, 2 and 3),
> > with 10, 20 and 30 partitions each respectively, and a limit of 40
> > partitions on each broker enforced via a configurable policy class (the
> one
> > you recommended). While the policy class may accept a topic creation
> > request for 11 partitions with a replication factor of 2 each (because it
> > is satisfiable), the non-pluggable partition assignment algorithm (in
> > AdminUtils.assignReplicasToBrokers and a few other places) has to know
> not
> > to assign the 11th partition to broker 3 because it would run out of
> > partition capacity otherwise.
> >
>
> I know this is only a toy example, but I think it also serves to illustrate
> my point above. How has a limit of 40 partitions been arrived at? In real
> life different partitions will impart a different load on a broker,
> depending on all sorts of factors (which topics they're for, the throughput
> and message size for those topics, etc). By saying that a broker should not
> have more than 40 partitions assigned I think you're making a big
> assumption that all partitions have the same weight. You're also limiting
> the search space for finding an acceptable assignment. Cluster balancers
> usually use some kind of heuristic optimisation algorithm for figuring out
> assignments of partitions to brokers, and it could be that the best (or at
> least a good enough) solution requires assigning the least loaded 41
> partitions to one broker.
>
> The point I'm trying to make here is whatever limit is chosen it's probably
> been chosen fairly arbitrarily. Even if it's been chosen based on some
> empirical evidence of how a particular cluster behaves it's likely that
> that evidence will become obsolete as the cluster evolves to serve the
> needs of the business running it (e.g. some hot topic gets repartitioned,
> messages get compressed with some new algorithm, some new topics need to be
> created). For this reason I think the problem you're trying to solve via
> policy (whether that was implemented in a pluggable way or not) is really
> better solved by automating the cluster balancing and having that cluster
> balancer be able to reason about when the cluster has too few brokers for
> the number of partitions, rather than placing some limit on the sizing and
> shape of the cluster up front and then hobbling the cluster balancer to
> work within that.
>
> I think it might be useful to describe in the KIP how users would be
> expected to arrive at values for these configs (both on day 1 and in an
> evolving production cluster), when this solution might be better than using
> a cluster balancer and/or why cluster balancers can't be trusted to avoid
> overloading brokers.
>
> Kind regards,
>
> Tom
>
>
> On Mon, Apr 20, 2020 at 7:22 PM Gokul Ramanan Subramanian <
> gokul2411s@gmail.com> wrote:
>
> > This is good reference Tom. I did not consider this approach at all. I am
> > happy to learn about it now.
> >
> > However, I think that partition limits are not "yet another" policy
> > configuration. Instead, they are fundamental to partition assignment.
> i.e.
> > the partition assignment algorithm needs to be aware of the partition
> > limits. To illustrate this, imagine that you have 3 brokers (1, 2 and 3),
> > with 10, 20 and 30 partitions each respectively, and a limit of 40
> > partitions on each broker enforced via a configurable policy class (the
> one
> > you recommended). While the policy class may accept a topic creation
> > request for 11 partitions with a replication factor of 2 each (because it
> > is satisfiable), the non-pluggable partition assignment algorithm (in
> > AdminUtils.assignReplicasToBrokers and a few other places) has to know
> not
> > to assign the 11th partition to broker 3 because it would run out of
> > partition capacity otherwise.
> >
> > To achieve the ideal end that you are imagining (and I can totally
> > understand where you are coming from vis-a-vis the extensibility of your
> > solution wrt the one in the KIP), that would require extracting the
> > partition assignment logic itself into a pluggable class, and for which
> we
> > could provide a custom implementation. I am afraid that would add
> > complexity that I am not sure we want to undertake.
> >
> > Do you see sense in what I am saying?
> >
> > Thanks.
> >
> > On Mon, Apr 20, 2020 at 3:59 PM Tom Bentley <tb...@redhat.com> wrote:
> >
> > > Hi Gokul,
> > >
> > > Leaving aside the question of how Kafka scales, I think the proposed
> > > solution, limiting the number of partitions in a cluster or per-broker,
> > is
> > > a policy which ought to be addressable via the pluggable policies (e.g.
> > > create.topic.policy.class.name). Unfortunately although there's a
> policy
> > > for topic creation, it's currently not possible to enforce a policy on
> > > partition increase. It would be more flexible to be able enforce this
> > kind
> > > of thing via a pluggable policy, and it would also avoid the situation
> > > where different people each want to have a config which addresses some
> > > specific use case or problem that they're experiencing.
> > >
> > > Quite a while ago I proposed KIP-201 to solve this issue with policies
> > > being easily circumvented, but it didn't really make any progress. I've
> > > looked at it again in some detail more recently and I think something
> > might
> > > be possible following the work to make all ZK writes happen on the
> > > controller.
> > >
> > > Of course, this is just my take on it.
> > >
> > > Kind regards,
> > >
> > > Tom
> > >
> > > On Thu, Apr 16, 2020 at 11:47 AM Gokul Ramanan Subramanian <
> > > gokul2411s@gmail.com> wrote:
> > >
> > > > Hi.
> > > >
> > > > For the sake of expediting the discussion, I have created a prototype
> > PR:
> > > > https://github.com/apache/kafka/pull/8499. Eventually, (if and) when
> > the
> > > > KIP is accepted, I'll modify this to add the full implementation and
> > > tests
> > > > etc. in there.
> > > >
> > > > Would appreciate if a Kafka committer could share their thoughts, so
> > > that I
> > > > can more confidently start the voting thread.
> > > >
> > > > Thanks.
> > > >
> > > > On Thu, Apr 16, 2020 at 11:30 AM Gokul Ramanan Subramanian <
> > > > gokul2411s@gmail.com> wrote:
> > > >
> > > > > Thanks for your comments Alex.
> > > > >
> > > > > The KIP proposes using two configurations max.partitions and
> > > > > max.broker.partitions. It does not enforce their use. The default
> > > values
> > > > > are pretty large (INT MAX), therefore should be non-intrusive.
> > > > >
> > > > > In multi-tenant environments and in partition assignment and
> > > rebalancing,
> > > > > the admin could (a) use the default values which would yield
> similar
> > > > > behavior to now, (b) set very high values that they know is
> > sufficient,
> > > > (c)
> > > > > dynamically re-adjust the values should the business requirements
> > > change.
> > > > > Note that the two configurations are cluster-wide, so they can be
> > > updated
> > > > > without restarting the brokers.
> > > > >
> > > > > The quota system in Kafka seems to be geared towards limiting
> traffic
> > > for
> > > > > specific clients or users, or in the case of replication, to
> leaders
> > > and
> > > > > followers. The quota configuration itself is very similar to the
> one
> > > > > introduced in this KIP i.e. just a few configuration options to
> > specify
> > > > the
> > > > > quota. The main difference is that the quota system is far more
> > > > > heavy-weight because it needs to be applied to traffic that is
> > flowing
> > > > > in/out constantly. Whereas in this KIP, we want to limit number of
> > > > > partition replicas, which gets modified rarely by comparison in a
> > > typical
> > > > > cluster.
> > > > >
> > > > > Hope this addresses your comments.
> > > > >
> > > > > On Thu, Apr 9, 2020 at 12:53 PM Alexandre Dupriez <
> > > > > alexandre.dupriez@gmail.com> wrote:
> > > > >
> > > > >> Hi Gokul,
> > > > >>
> > > > >> Thanks for the KIP.
> > > > >>
> > > > >> From what I understand, the objective of the new configuration is
> to
> > > > >> protect a cluster from an overload driven by an excessive number
> of
> > > > >> partitions independently from the load handled on the partitions
> > > > >> themselves. As such, the approach uncouples the data-path load
> from
> > > > >> the number of unit of distributions of throughput and intends to
> > avoid
> > > > >> the degradation of performance exhibited in the test results
> > provided
> > > > >> with the KIP by setting an upper-bound on that number.
> > > > >>
> > > > >> Couple of comments:
> > > > >>
> > > > >> 900. Multi-tenancy - one concern I would have with a cluster and
> > > > >> broker-level configuration is that it is possible for a user to
> > > > >> consume a large proportions of the allocatable partitions within
> the
> > > > >> configured limit, leaving other users with not enough partitions
> to
> > > > >> satisfy their requirements.
> > > > >>
> > > > >> 901. Quotas - an approach in Apache Kafka to set-up an upper-bound
> > on
> > > > >> resource consumptions is via client/user quotas. Could this
> > framework
> > > > >> be leveraged to add this limit?
> > > > >>
> > > > >> 902. Partition assignment - one potential problem with the new
> > > > >> repartitioning scheme is that if a subset of brokers have reached
> > > > >> their number of assignable partitions, yet their data path is
> > > > >> under-loaded, new topics and/or partitions will be assigned
> > > > >> exclusively to other brokers, which could increase the likelihood
> of
> > > > >> data-path load imbalance. Fundamentally, the isolation of the
> > > > >> constraint on the number of partitions from the data-path
> throughput
> > > > >> can have conflicting requirements.
> > > > >>
> > > > >> 903. Rebalancing - as a corollary to 902, external tools used to
> > > > >> balance ingress throughput may adopt an incremental approach in
> > > > >> partition re-assignment to redistribute load, and could hit the
> > limit
> > > > >> on the number of partitions on a broker when a (too) conservative
> > > > >> limit is used, thereby over-constraining the objective function
> and
> > > > >> reducing the migration path.
> > > > >>
> > > > >> Thanks,
> > > > >> Alexandre
> > > > >>
> > > > >> Le jeu. 9 avr. 2020 à 00:19, Gokul Ramanan Subramanian
> > > > >> <go...@gmail.com> a écrit :
> > > > >> >
> > > > >> > Hi. Requesting you to take a look at this KIP and provide
> > feedback.
> > > > >> >
> > > > >> > Thanks. Regards.
> > > > >> >
> > > > >> > On Wed, Apr 1, 2020 at 4:28 PM Gokul Ramanan Subramanian <
> > > > >> > gokul2411s@gmail.com> wrote:
> > > > >> >
> > > > >> > > Hi.
> > > > >> > >
> > > > >> > > I have opened KIP-578, intended to provide a mechanism to
> limit
> > > the
> > > > >> number
> > > > >> > > of partitions in a Kafka cluster. Kindly provide feedback on
> the
> > > KIP
> > > > >> which
> > > > >> > > you can find at
> > > > >> > >
> > > > >> > >
> > > > >>
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-578%3A+Add+configuration+to+limit+number+of+partitions
> > > > >> > >
> > > > >> > > I want to specially thank Stanislav Kozlovski who helped in
> > > > >> formulating
> > > > >> > > some aspects of the KIP.
> > > > >> > >
> > > > >> > > Many thanks,
> > > > >> > >
> > > > >> > > Gokul.
> > > > >> > >
> > > > >>
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] KIP-578: Add configuration to limit number of partitions

Posted by Tom Bentley <tb...@redhat.com>.
Hi Gokul,

the partition assignment algorithm needs to be aware of the partition
> limits.
>

I agree, if you have limits then anything doing reassignment would need
some way of knowing what they were. But the thing is that I'm not really
sure how you would decide what the limits ought to be.


> To illustrate this, imagine that you have 3 brokers (1, 2 and 3),
> with 10, 20 and 30 partitions each respectively, and a limit of 40
> partitions on each broker enforced via a configurable policy class (the one
> you recommended). While the policy class may accept a topic creation
> request for 11 partitions with a replication factor of 2 each (because it
> is satisfiable), the non-pluggable partition assignment algorithm (in
> AdminUtils.assignReplicasToBrokers and a few other places) has to know not
> to assign the 11th partition to broker 3 because it would run out of
> partition capacity otherwise.
>

I know this is only a toy example, but I think it also serves to illustrate
my point above. How has a limit of 40 partitions been arrived at? In real
life different partitions will impart a different load on a broker,
depending on all sorts of factors (which topics they're for, the throughput
and message size for those topics, etc). By saying that a broker should not
have more than 40 partitions assigned I think you're making a big
assumption that all partitions have the same weight. You're also limiting
the search space for finding an acceptable assignment. Cluster balancers
usually use some kind of heuristic optimisation algorithm for figuring out
assignments of partitions to brokers, and it could be that the best (or at
least a good enough) solution requires assigning the least loaded 41
partitions to one broker.

The point I'm trying to make here is whatever limit is chosen it's probably
been chosen fairly arbitrarily. Even if it's been chosen based on some
empirical evidence of how a particular cluster behaves it's likely that
that evidence will become obsolete as the cluster evolves to serve the
needs of the business running it (e.g. some hot topic gets repartitioned,
messages get compressed with some new algorithm, some new topics need to be
created). For this reason I think the problem you're trying to solve via
policy (whether that was implemented in a pluggable way or not) is really
better solved by automating the cluster balancing and having that cluster
balancer be able to reason about when the cluster has too few brokers for
the number of partitions, rather than placing some limit on the sizing and
shape of the cluster up front and then hobbling the cluster balancer to
work within that.

I think it might be useful to describe in the KIP how users would be
expected to arrive at values for these configs (both on day 1 and in an
evolving production cluster), when this solution might be better than using
a cluster balancer and/or why cluster balancers can't be trusted to avoid
overloading brokers.

Kind regards,

Tom


On Mon, Apr 20, 2020 at 7:22 PM Gokul Ramanan Subramanian <
gokul2411s@gmail.com> wrote:

> This is good reference Tom. I did not consider this approach at all. I am
> happy to learn about it now.
>
> However, I think that partition limits are not "yet another" policy
> configuration. Instead, they are fundamental to partition assignment. i.e.
> the partition assignment algorithm needs to be aware of the partition
> limits. To illustrate this, imagine that you have 3 brokers (1, 2 and 3),
> with 10, 20 and 30 partitions each respectively, and a limit of 40
> partitions on each broker enforced via a configurable policy class (the one
> you recommended). While the policy class may accept a topic creation
> request for 11 partitions with a replication factor of 2 each (because it
> is satisfiable), the non-pluggable partition assignment algorithm (in
> AdminUtils.assignReplicasToBrokers and a few other places) has to know not
> to assign the 11th partition to broker 3 because it would run out of
> partition capacity otherwise.
>
> To achieve the ideal end that you are imagining (and I can totally
> understand where you are coming from vis-a-vis the extensibility of your
> solution wrt the one in the KIP), that would require extracting the
> partition assignment logic itself into a pluggable class, and for which we
> could provide a custom implementation. I am afraid that would add
> complexity that I am not sure we want to undertake.
>
> Do you see sense in what I am saying?
>
> Thanks.
>
> On Mon, Apr 20, 2020 at 3:59 PM Tom Bentley <tb...@redhat.com> wrote:
>
> > Hi Gokul,
> >
> > Leaving aside the question of how Kafka scales, I think the proposed
> > solution, limiting the number of partitions in a cluster or per-broker,
> is
> > a policy which ought to be addressable via the pluggable policies (e.g.
> > create.topic.policy.class.name). Unfortunately although there's a policy
> > for topic creation, it's currently not possible to enforce a policy on
> > partition increase. It would be more flexible to be able enforce this
> kind
> > of thing via a pluggable policy, and it would also avoid the situation
> > where different people each want to have a config which addresses some
> > specific use case or problem that they're experiencing.
> >
> > Quite a while ago I proposed KIP-201 to solve this issue with policies
> > being easily circumvented, but it didn't really make any progress. I've
> > looked at it again in some detail more recently and I think something
> might
> > be possible following the work to make all ZK writes happen on the
> > controller.
> >
> > Of course, this is just my take on it.
> >
> > Kind regards,
> >
> > Tom
> >
> > On Thu, Apr 16, 2020 at 11:47 AM Gokul Ramanan Subramanian <
> > gokul2411s@gmail.com> wrote:
> >
> > > Hi.
> > >
> > > For the sake of expediting the discussion, I have created a prototype
> PR:
> > > https://github.com/apache/kafka/pull/8499. Eventually, (if and) when
> the
> > > KIP is accepted, I'll modify this to add the full implementation and
> > tests
> > > etc. in there.
> > >
> > > Would appreciate if a Kafka committer could share their thoughts, so
> > that I
> > > can more confidently start the voting thread.
> > >
> > > Thanks.
> > >
> > > On Thu, Apr 16, 2020 at 11:30 AM Gokul Ramanan Subramanian <
> > > gokul2411s@gmail.com> wrote:
> > >
> > > > Thanks for your comments Alex.
> > > >
> > > > The KIP proposes using two configurations max.partitions and
> > > > max.broker.partitions. It does not enforce their use. The default
> > values
> > > > are pretty large (INT MAX), therefore should be non-intrusive.
> > > >
> > > > In multi-tenant environments and in partition assignment and
> > rebalancing,
> > > > the admin could (a) use the default values which would yield similar
> > > > behavior to now, (b) set very high values that they know is
> sufficient,
> > > (c)
> > > > dynamically re-adjust the values should the business requirements
> > change.
> > > > Note that the two configurations are cluster-wide, so they can be
> > updated
> > > > without restarting the brokers.
> > > >
> > > > The quota system in Kafka seems to be geared towards limiting traffic
> > for
> > > > specific clients or users, or in the case of replication, to leaders
> > and
> > > > followers. The quota configuration itself is very similar to the one
> > > > introduced in this KIP i.e. just a few configuration options to
> specify
> > > the
> > > > quota. The main difference is that the quota system is far more
> > > > heavy-weight because it needs to be applied to traffic that is
> flowing
> > > > in/out constantly. Whereas in this KIP, we want to limit number of
> > > > partition replicas, which gets modified rarely by comparison in a
> > typical
> > > > cluster.
> > > >
> > > > Hope this addresses your comments.
> > > >
> > > > On Thu, Apr 9, 2020 at 12:53 PM Alexandre Dupriez <
> > > > alexandre.dupriez@gmail.com> wrote:
> > > >
> > > >> Hi Gokul,
> > > >>
> > > >> Thanks for the KIP.
> > > >>
> > > >> From what I understand, the objective of the new configuration is to
> > > >> protect a cluster from an overload driven by an excessive number of
> > > >> partitions independently from the load handled on the partitions
> > > >> themselves. As such, the approach uncouples the data-path load from
> > > >> the number of unit of distributions of throughput and intends to
> avoid
> > > >> the degradation of performance exhibited in the test results
> provided
> > > >> with the KIP by setting an upper-bound on that number.
> > > >>
> > > >> Couple of comments:
> > > >>
> > > >> 900. Multi-tenancy - one concern I would have with a cluster and
> > > >> broker-level configuration is that it is possible for a user to
> > > >> consume a large proportions of the allocatable partitions within the
> > > >> configured limit, leaving other users with not enough partitions to
> > > >> satisfy their requirements.
> > > >>
> > > >> 901. Quotas - an approach in Apache Kafka to set-up an upper-bound
> on
> > > >> resource consumptions is via client/user quotas. Could this
> framework
> > > >> be leveraged to add this limit?
> > > >>
> > > >> 902. Partition assignment - one potential problem with the new
> > > >> repartitioning scheme is that if a subset of brokers have reached
> > > >> their number of assignable partitions, yet their data path is
> > > >> under-loaded, new topics and/or partitions will be assigned
> > > >> exclusively to other brokers, which could increase the likelihood of
> > > >> data-path load imbalance. Fundamentally, the isolation of the
> > > >> constraint on the number of partitions from the data-path throughput
> > > >> can have conflicting requirements.
> > > >>
> > > >> 903. Rebalancing - as a corollary to 902, external tools used to
> > > >> balance ingress throughput may adopt an incremental approach in
> > > >> partition re-assignment to redistribute load, and could hit the
> limit
> > > >> on the number of partitions on a broker when a (too) conservative
> > > >> limit is used, thereby over-constraining the objective function and
> > > >> reducing the migration path.
> > > >>
> > > >> Thanks,
> > > >> Alexandre
> > > >>
> > > >> Le jeu. 9 avr. 2020 à 00:19, Gokul Ramanan Subramanian
> > > >> <go...@gmail.com> a écrit :
> > > >> >
> > > >> > Hi. Requesting you to take a look at this KIP and provide
> feedback.
> > > >> >
> > > >> > Thanks. Regards.
> > > >> >
> > > >> > On Wed, Apr 1, 2020 at 4:28 PM Gokul Ramanan Subramanian <
> > > >> > gokul2411s@gmail.com> wrote:
> > > >> >
> > > >> > > Hi.
> > > >> > >
> > > >> > > I have opened KIP-578, intended to provide a mechanism to limit
> > the
> > > >> number
> > > >> > > of partitions in a Kafka cluster. Kindly provide feedback on the
> > KIP
> > > >> which
> > > >> > > you can find at
> > > >> > >
> > > >> > >
> > > >>
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-578%3A+Add+configuration+to+limit+number+of+partitions
> > > >> > >
> > > >> > > I want to specially thank Stanislav Kozlovski who helped in
> > > >> formulating
> > > >> > > some aspects of the KIP.
> > > >> > >
> > > >> > > Many thanks,
> > > >> > >
> > > >> > > Gokul.
> > > >> > >
> > > >>
> > > >
> > >
> >
>

Re: [DISCUSS] KIP-578: Add configuration to limit number of partitions

Posted by Gokul Ramanan Subramanian <go...@gmail.com>.
This is good reference Tom. I did not consider this approach at all. I am
happy to learn about it now.

However, I think that partition limits are not "yet another" policy
configuration. Instead, they are fundamental to partition assignment. i.e.
the partition assignment algorithm needs to be aware of the partition
limits. To illustrate this, imagine that you have 3 brokers (1, 2 and 3),
with 10, 20 and 30 partitions each respectively, and a limit of 40
partitions on each broker enforced via a configurable policy class (the one
you recommended). While the policy class may accept a topic creation
request for 11 partitions with a replication factor of 2 each (because it
is satisfiable), the non-pluggable partition assignment algorithm (in
AdminUtils.assignReplicasToBrokers and a few other places) has to know not
to assign the 11th partition to broker 3 because it would run out of
partition capacity otherwise.

To achieve the ideal end that you are imagining (and I can totally
understand where you are coming from vis-a-vis the extensibility of your
solution wrt the one in the KIP), that would require extracting the
partition assignment logic itself into a pluggable class, and for which we
could provide a custom implementation. I am afraid that would add
complexity that I am not sure we want to undertake.

Do you see sense in what I am saying?

Thanks.

On Mon, Apr 20, 2020 at 3:59 PM Tom Bentley <tb...@redhat.com> wrote:

> Hi Gokul,
>
> Leaving aside the question of how Kafka scales, I think the proposed
> solution, limiting the number of partitions in a cluster or per-broker, is
> a policy which ought to be addressable via the pluggable policies (e.g.
> create.topic.policy.class.name). Unfortunately although there's a policy
> for topic creation, it's currently not possible to enforce a policy on
> partition increase. It would be more flexible to be able enforce this kind
> of thing via a pluggable policy, and it would also avoid the situation
> where different people each want to have a config which addresses some
> specific use case or problem that they're experiencing.
>
> Quite a while ago I proposed KIP-201 to solve this issue with policies
> being easily circumvented, but it didn't really make any progress. I've
> looked at it again in some detail more recently and I think something might
> be possible following the work to make all ZK writes happen on the
> controller.
>
> Of course, this is just my take on it.
>
> Kind regards,
>
> Tom
>
> On Thu, Apr 16, 2020 at 11:47 AM Gokul Ramanan Subramanian <
> gokul2411s@gmail.com> wrote:
>
> > Hi.
> >
> > For the sake of expediting the discussion, I have created a prototype PR:
> > https://github.com/apache/kafka/pull/8499. Eventually, (if and) when the
> > KIP is accepted, I'll modify this to add the full implementation and
> tests
> > etc. in there.
> >
> > Would appreciate if a Kafka committer could share their thoughts, so
> that I
> > can more confidently start the voting thread.
> >
> > Thanks.
> >
> > On Thu, Apr 16, 2020 at 11:30 AM Gokul Ramanan Subramanian <
> > gokul2411s@gmail.com> wrote:
> >
> > > Thanks for your comments Alex.
> > >
> > > The KIP proposes using two configurations max.partitions and
> > > max.broker.partitions. It does not enforce their use. The default
> values
> > > are pretty large (INT MAX), therefore should be non-intrusive.
> > >
> > > In multi-tenant environments and in partition assignment and
> rebalancing,
> > > the admin could (a) use the default values which would yield similar
> > > behavior to now, (b) set very high values that they know is sufficient,
> > (c)
> > > dynamically re-adjust the values should the business requirements
> change.
> > > Note that the two configurations are cluster-wide, so they can be
> updated
> > > without restarting the brokers.
> > >
> > > The quota system in Kafka seems to be geared towards limiting traffic
> for
> > > specific clients or users, or in the case of replication, to leaders
> and
> > > followers. The quota configuration itself is very similar to the one
> > > introduced in this KIP i.e. just a few configuration options to specify
> > the
> > > quota. The main difference is that the quota system is far more
> > > heavy-weight because it needs to be applied to traffic that is flowing
> > > in/out constantly. Whereas in this KIP, we want to limit number of
> > > partition replicas, which gets modified rarely by comparison in a
> typical
> > > cluster.
> > >
> > > Hope this addresses your comments.
> > >
> > > On Thu, Apr 9, 2020 at 12:53 PM Alexandre Dupriez <
> > > alexandre.dupriez@gmail.com> wrote:
> > >
> > >> Hi Gokul,
> > >>
> > >> Thanks for the KIP.
> > >>
> > >> From what I understand, the objective of the new configuration is to
> > >> protect a cluster from an overload driven by an excessive number of
> > >> partitions independently from the load handled on the partitions
> > >> themselves. As such, the approach uncouples the data-path load from
> > >> the number of unit of distributions of throughput and intends to avoid
> > >> the degradation of performance exhibited in the test results provided
> > >> with the KIP by setting an upper-bound on that number.
> > >>
> > >> Couple of comments:
> > >>
> > >> 900. Multi-tenancy - one concern I would have with a cluster and
> > >> broker-level configuration is that it is possible for a user to
> > >> consume a large proportions of the allocatable partitions within the
> > >> configured limit, leaving other users with not enough partitions to
> > >> satisfy their requirements.
> > >>
> > >> 901. Quotas - an approach in Apache Kafka to set-up an upper-bound on
> > >> resource consumptions is via client/user quotas. Could this framework
> > >> be leveraged to add this limit?
> > >>
> > >> 902. Partition assignment - one potential problem with the new
> > >> repartitioning scheme is that if a subset of brokers have reached
> > >> their number of assignable partitions, yet their data path is
> > >> under-loaded, new topics and/or partitions will be assigned
> > >> exclusively to other brokers, which could increase the likelihood of
> > >> data-path load imbalance. Fundamentally, the isolation of the
> > >> constraint on the number of partitions from the data-path throughput
> > >> can have conflicting requirements.
> > >>
> > >> 903. Rebalancing - as a corollary to 902, external tools used to
> > >> balance ingress throughput may adopt an incremental approach in
> > >> partition re-assignment to redistribute load, and could hit the limit
> > >> on the number of partitions on a broker when a (too) conservative
> > >> limit is used, thereby over-constraining the objective function and
> > >> reducing the migration path.
> > >>
> > >> Thanks,
> > >> Alexandre
> > >>
> > >> Le jeu. 9 avr. 2020 à 00:19, Gokul Ramanan Subramanian
> > >> <go...@gmail.com> a écrit :
> > >> >
> > >> > Hi. Requesting you to take a look at this KIP and provide feedback.
> > >> >
> > >> > Thanks. Regards.
> > >> >
> > >> > On Wed, Apr 1, 2020 at 4:28 PM Gokul Ramanan Subramanian <
> > >> > gokul2411s@gmail.com> wrote:
> > >> >
> > >> > > Hi.
> > >> > >
> > >> > > I have opened KIP-578, intended to provide a mechanism to limit
> the
> > >> number
> > >> > > of partitions in a Kafka cluster. Kindly provide feedback on the
> KIP
> > >> which
> > >> > > you can find at
> > >> > >
> > >> > >
> > >>
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-578%3A+Add+configuration+to+limit+number+of+partitions
> > >> > >
> > >> > > I want to specially thank Stanislav Kozlovski who helped in
> > >> formulating
> > >> > > some aspects of the KIP.
> > >> > >
> > >> > > Many thanks,
> > >> > >
> > >> > > Gokul.
> > >> > >
> > >>
> > >
> >
>

Re: [DISCUSS] KIP-578: Add configuration to limit number of partitions

Posted by Tom Bentley <tb...@redhat.com>.
Hi Gokul,

Leaving aside the question of how Kafka scales, I think the proposed
solution, limiting the number of partitions in a cluster or per-broker, is
a policy which ought to be addressable via the pluggable policies (e.g.
create.topic.policy.class.name). Unfortunately although there's a policy
for topic creation, it's currently not possible to enforce a policy on
partition increase. It would be more flexible to be able enforce this kind
of thing via a pluggable policy, and it would also avoid the situation
where different people each want to have a config which addresses some
specific use case or problem that they're experiencing.

Quite a while ago I proposed KIP-201 to solve this issue with policies
being easily circumvented, but it didn't really make any progress. I've
looked at it again in some detail more recently and I think something might
be possible following the work to make all ZK writes happen on the
controller.

Of course, this is just my take on it.

Kind regards,

Tom

On Thu, Apr 16, 2020 at 11:47 AM Gokul Ramanan Subramanian <
gokul2411s@gmail.com> wrote:

> Hi.
>
> For the sake of expediting the discussion, I have created a prototype PR:
> https://github.com/apache/kafka/pull/8499. Eventually, (if and) when the
> KIP is accepted, I'll modify this to add the full implementation and tests
> etc. in there.
>
> Would appreciate if a Kafka committer could share their thoughts, so that I
> can more confidently start the voting thread.
>
> Thanks.
>
> On Thu, Apr 16, 2020 at 11:30 AM Gokul Ramanan Subramanian <
> gokul2411s@gmail.com> wrote:
>
> > Thanks for your comments Alex.
> >
> > The KIP proposes using two configurations max.partitions and
> > max.broker.partitions. It does not enforce their use. The default values
> > are pretty large (INT MAX), therefore should be non-intrusive.
> >
> > In multi-tenant environments and in partition assignment and rebalancing,
> > the admin could (a) use the default values which would yield similar
> > behavior to now, (b) set very high values that they know is sufficient,
> (c)
> > dynamically re-adjust the values should the business requirements change.
> > Note that the two configurations are cluster-wide, so they can be updated
> > without restarting the brokers.
> >
> > The quota system in Kafka seems to be geared towards limiting traffic for
> > specific clients or users, or in the case of replication, to leaders and
> > followers. The quota configuration itself is very similar to the one
> > introduced in this KIP i.e. just a few configuration options to specify
> the
> > quota. The main difference is that the quota system is far more
> > heavy-weight because it needs to be applied to traffic that is flowing
> > in/out constantly. Whereas in this KIP, we want to limit number of
> > partition replicas, which gets modified rarely by comparison in a typical
> > cluster.
> >
> > Hope this addresses your comments.
> >
> > On Thu, Apr 9, 2020 at 12:53 PM Alexandre Dupriez <
> > alexandre.dupriez@gmail.com> wrote:
> >
> >> Hi Gokul,
> >>
> >> Thanks for the KIP.
> >>
> >> From what I understand, the objective of the new configuration is to
> >> protect a cluster from an overload driven by an excessive number of
> >> partitions independently from the load handled on the partitions
> >> themselves. As such, the approach uncouples the data-path load from
> >> the number of unit of distributions of throughput and intends to avoid
> >> the degradation of performance exhibited in the test results provided
> >> with the KIP by setting an upper-bound on that number.
> >>
> >> Couple of comments:
> >>
> >> 900. Multi-tenancy - one concern I would have with a cluster and
> >> broker-level configuration is that it is possible for a user to
> >> consume a large proportions of the allocatable partitions within the
> >> configured limit, leaving other users with not enough partitions to
> >> satisfy their requirements.
> >>
> >> 901. Quotas - an approach in Apache Kafka to set-up an upper-bound on
> >> resource consumptions is via client/user quotas. Could this framework
> >> be leveraged to add this limit?
> >>
> >> 902. Partition assignment - one potential problem with the new
> >> repartitioning scheme is that if a subset of brokers have reached
> >> their number of assignable partitions, yet their data path is
> >> under-loaded, new topics and/or partitions will be assigned
> >> exclusively to other brokers, which could increase the likelihood of
> >> data-path load imbalance. Fundamentally, the isolation of the
> >> constraint on the number of partitions from the data-path throughput
> >> can have conflicting requirements.
> >>
> >> 903. Rebalancing - as a corollary to 902, external tools used to
> >> balance ingress throughput may adopt an incremental approach in
> >> partition re-assignment to redistribute load, and could hit the limit
> >> on the number of partitions on a broker when a (too) conservative
> >> limit is used, thereby over-constraining the objective function and
> >> reducing the migration path.
> >>
> >> Thanks,
> >> Alexandre
> >>
> >> Le jeu. 9 avr. 2020 à 00:19, Gokul Ramanan Subramanian
> >> <go...@gmail.com> a écrit :
> >> >
> >> > Hi. Requesting you to take a look at this KIP and provide feedback.
> >> >
> >> > Thanks. Regards.
> >> >
> >> > On Wed, Apr 1, 2020 at 4:28 PM Gokul Ramanan Subramanian <
> >> > gokul2411s@gmail.com> wrote:
> >> >
> >> > > Hi.
> >> > >
> >> > > I have opened KIP-578, intended to provide a mechanism to limit the
> >> number
> >> > > of partitions in a Kafka cluster. Kindly provide feedback on the KIP
> >> which
> >> > > you can find at
> >> > >
> >> > >
> >>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-578%3A+Add+configuration+to+limit+number+of+partitions
> >> > >
> >> > > I want to specially thank Stanislav Kozlovski who helped in
> >> formulating
> >> > > some aspects of the KIP.
> >> > >
> >> > > Many thanks,
> >> > >
> >> > > Gokul.
> >> > >
> >>
> >
>

Re: [DISCUSS] KIP-578: Add configuration to limit number of partitions

Posted by Gokul Ramanan Subramanian <go...@gmail.com>.
Hi.

For the sake of expediting the discussion, I have created a prototype PR:
https://github.com/apache/kafka/pull/8499. Eventually, (if and) when the
KIP is accepted, I'll modify this to add the full implementation and tests
etc. in there.

Would appreciate if a Kafka committer could share their thoughts, so that I
can more confidently start the voting thread.

Thanks.

On Thu, Apr 16, 2020 at 11:30 AM Gokul Ramanan Subramanian <
gokul2411s@gmail.com> wrote:

> Thanks for your comments Alex.
>
> The KIP proposes using two configurations max.partitions and
> max.broker.partitions. It does not enforce their use. The default values
> are pretty large (INT MAX), therefore should be non-intrusive.
>
> In multi-tenant environments and in partition assignment and rebalancing,
> the admin could (a) use the default values which would yield similar
> behavior to now, (b) set very high values that they know is sufficient, (c)
> dynamically re-adjust the values should the business requirements change.
> Note that the two configurations are cluster-wide, so they can be updated
> without restarting the brokers.
>
> The quota system in Kafka seems to be geared towards limiting traffic for
> specific clients or users, or in the case of replication, to leaders and
> followers. The quota configuration itself is very similar to the one
> introduced in this KIP i.e. just a few configuration options to specify the
> quota. The main difference is that the quota system is far more
> heavy-weight because it needs to be applied to traffic that is flowing
> in/out constantly. Whereas in this KIP, we want to limit number of
> partition replicas, which gets modified rarely by comparison in a typical
> cluster.
>
> Hope this addresses your comments.
>
> On Thu, Apr 9, 2020 at 12:53 PM Alexandre Dupriez <
> alexandre.dupriez@gmail.com> wrote:
>
>> Hi Gokul,
>>
>> Thanks for the KIP.
>>
>> From what I understand, the objective of the new configuration is to
>> protect a cluster from an overload driven by an excessive number of
>> partitions independently from the load handled on the partitions
>> themselves. As such, the approach uncouples the data-path load from
>> the number of unit of distributions of throughput and intends to avoid
>> the degradation of performance exhibited in the test results provided
>> with the KIP by setting an upper-bound on that number.
>>
>> Couple of comments:
>>
>> 900. Multi-tenancy - one concern I would have with a cluster and
>> broker-level configuration is that it is possible for a user to
>> consume a large proportions of the allocatable partitions within the
>> configured limit, leaving other users with not enough partitions to
>> satisfy their requirements.
>>
>> 901. Quotas - an approach in Apache Kafka to set-up an upper-bound on
>> resource consumptions is via client/user quotas. Could this framework
>> be leveraged to add this limit?
>>
>> 902. Partition assignment - one potential problem with the new
>> repartitioning scheme is that if a subset of brokers have reached
>> their number of assignable partitions, yet their data path is
>> under-loaded, new topics and/or partitions will be assigned
>> exclusively to other brokers, which could increase the likelihood of
>> data-path load imbalance. Fundamentally, the isolation of the
>> constraint on the number of partitions from the data-path throughput
>> can have conflicting requirements.
>>
>> 903. Rebalancing - as a corollary to 902, external tools used to
>> balance ingress throughput may adopt an incremental approach in
>> partition re-assignment to redistribute load, and could hit the limit
>> on the number of partitions on a broker when a (too) conservative
>> limit is used, thereby over-constraining the objective function and
>> reducing the migration path.
>>
>> Thanks,
>> Alexandre
>>
>> Le jeu. 9 avr. 2020 à 00:19, Gokul Ramanan Subramanian
>> <go...@gmail.com> a écrit :
>> >
>> > Hi. Requesting you to take a look at this KIP and provide feedback.
>> >
>> > Thanks. Regards.
>> >
>> > On Wed, Apr 1, 2020 at 4:28 PM Gokul Ramanan Subramanian <
>> > gokul2411s@gmail.com> wrote:
>> >
>> > > Hi.
>> > >
>> > > I have opened KIP-578, intended to provide a mechanism to limit the
>> number
>> > > of partitions in a Kafka cluster. Kindly provide feedback on the KIP
>> which
>> > > you can find at
>> > >
>> > >
>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-578%3A+Add+configuration+to+limit+number+of+partitions
>> > >
>> > > I want to specially thank Stanislav Kozlovski who helped in
>> formulating
>> > > some aspects of the KIP.
>> > >
>> > > Many thanks,
>> > >
>> > > Gokul.
>> > >
>>
>

Re: [DISCUSS] KIP-578: Add configuration to limit number of partitions

Posted by Gokul Ramanan Subramanian <go...@gmail.com>.
Thanks for your comments Alex.

The KIP proposes using two configurations max.partitions and
max.broker.partitions. It does not enforce their use. The default values
are pretty large (INT MAX), therefore should be non-intrusive.

In multi-tenant environments and in partition assignment and rebalancing,
the admin could (a) use the default values which would yield similar
behavior to now, (b) set very high values that they know is sufficient, (c)
dynamically re-adjust the values should the business requirements change.
Note that the two configurations are cluster-wide, so they can be updated
without restarting the brokers.

The quota system in Kafka seems to be geared towards limiting traffic for
specific clients or users, or in the case of replication, to leaders and
followers. The quota configuration itself is very similar to the one
introduced in this KIP i.e. just a few configuration options to specify the
quota. The main difference is that the quota system is far more
heavy-weight because it needs to be applied to traffic that is flowing
in/out constantly. Whereas in this KIP, we want to limit number of
partition replicas, which gets modified rarely by comparison in a typical
cluster.

Hope this addresses your comments.

On Thu, Apr 9, 2020 at 12:53 PM Alexandre Dupriez <
alexandre.dupriez@gmail.com> wrote:

> Hi Gokul,
>
> Thanks for the KIP.
>
> From what I understand, the objective of the new configuration is to
> protect a cluster from an overload driven by an excessive number of
> partitions independently from the load handled on the partitions
> themselves. As such, the approach uncouples the data-path load from
> the number of unit of distributions of throughput and intends to avoid
> the degradation of performance exhibited in the test results provided
> with the KIP by setting an upper-bound on that number.
>
> Couple of comments:
>
> 900. Multi-tenancy - one concern I would have with a cluster and
> broker-level configuration is that it is possible for a user to
> consume a large proportions of the allocatable partitions within the
> configured limit, leaving other users with not enough partitions to
> satisfy their requirements.
>
> 901. Quotas - an approach in Apache Kafka to set-up an upper-bound on
> resource consumptions is via client/user quotas. Could this framework
> be leveraged to add this limit?
>
> 902. Partition assignment - one potential problem with the new
> repartitioning scheme is that if a subset of brokers have reached
> their number of assignable partitions, yet their data path is
> under-loaded, new topics and/or partitions will be assigned
> exclusively to other brokers, which could increase the likelihood of
> data-path load imbalance. Fundamentally, the isolation of the
> constraint on the number of partitions from the data-path throughput
> can have conflicting requirements.
>
> 903. Rebalancing - as a corollary to 902, external tools used to
> balance ingress throughput may adopt an incremental approach in
> partition re-assignment to redistribute load, and could hit the limit
> on the number of partitions on a broker when a (too) conservative
> limit is used, thereby over-constraining the objective function and
> reducing the migration path.
>
> Thanks,
> Alexandre
>
> Le jeu. 9 avr. 2020 à 00:19, Gokul Ramanan Subramanian
> <go...@gmail.com> a écrit :
> >
> > Hi. Requesting you to take a look at this KIP and provide feedback.
> >
> > Thanks. Regards.
> >
> > On Wed, Apr 1, 2020 at 4:28 PM Gokul Ramanan Subramanian <
> > gokul2411s@gmail.com> wrote:
> >
> > > Hi.
> > >
> > > I have opened KIP-578, intended to provide a mechanism to limit the
> number
> > > of partitions in a Kafka cluster. Kindly provide feedback on the KIP
> which
> > > you can find at
> > >
> > >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-578%3A+Add+configuration+to+limit+number+of+partitions
> > >
> > > I want to specially thank Stanislav Kozlovski who helped in formulating
> > > some aspects of the KIP.
> > >
> > > Many thanks,
> > >
> > > Gokul.
> > >
>

Re: [DISCUSS] KIP-578: Add configuration to limit number of partitions

Posted by Alexandre Dupriez <al...@gmail.com>.
Hi Gokul,

Thanks for the KIP.

From what I understand, the objective of the new configuration is to
protect a cluster from an overload driven by an excessive number of
partitions independently from the load handled on the partitions
themselves. As such, the approach uncouples the data-path load from
the number of unit of distributions of throughput and intends to avoid
the degradation of performance exhibited in the test results provided
with the KIP by setting an upper-bound on that number.

Couple of comments:

900. Multi-tenancy - one concern I would have with a cluster and
broker-level configuration is that it is possible for a user to
consume a large proportions of the allocatable partitions within the
configured limit, leaving other users with not enough partitions to
satisfy their requirements.

901. Quotas - an approach in Apache Kafka to set-up an upper-bound on
resource consumptions is via client/user quotas. Could this framework
be leveraged to add this limit?

902. Partition assignment - one potential problem with the new
repartitioning scheme is that if a subset of brokers have reached
their number of assignable partitions, yet their data path is
under-loaded, new topics and/or partitions will be assigned
exclusively to other brokers, which could increase the likelihood of
data-path load imbalance. Fundamentally, the isolation of the
constraint on the number of partitions from the data-path throughput
can have conflicting requirements.

903. Rebalancing - as a corollary to 902, external tools used to
balance ingress throughput may adopt an incremental approach in
partition re-assignment to redistribute load, and could hit the limit
on the number of partitions on a broker when a (too) conservative
limit is used, thereby over-constraining the objective function and
reducing the migration path.

Thanks,
Alexandre

Le jeu. 9 avr. 2020 à 00:19, Gokul Ramanan Subramanian
<go...@gmail.com> a écrit :
>
> Hi. Requesting you to take a look at this KIP and provide feedback.
>
> Thanks. Regards.
>
> On Wed, Apr 1, 2020 at 4:28 PM Gokul Ramanan Subramanian <
> gokul2411s@gmail.com> wrote:
>
> > Hi.
> >
> > I have opened KIP-578, intended to provide a mechanism to limit the number
> > of partitions in a Kafka cluster. Kindly provide feedback on the KIP which
> > you can find at
> >
> > https://cwiki.apache.org/confluence/display/KAFKA/KIP-578%3A+Add+configuration+to+limit+number+of+partitions
> >
> > I want to specially thank Stanislav Kozlovski who helped in formulating
> > some aspects of the KIP.
> >
> > Many thanks,
> >
> > Gokul.
> >

Re: [DISCUSS] KIP-578: Add configuration to limit number of partitions

Posted by Gokul Ramanan Subramanian <go...@gmail.com>.
Hi. Requesting you to take a look at this KIP and provide feedback.

Thanks. Regards.

On Wed, Apr 1, 2020 at 4:28 PM Gokul Ramanan Subramanian <
gokul2411s@gmail.com> wrote:

> Hi.
>
> I have opened KIP-578, intended to provide a mechanism to limit the number
> of partitions in a Kafka cluster. Kindly provide feedback on the KIP which
> you can find at
>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-578%3A+Add+configuration+to+limit+number+of+partitions
>
> I want to specially thank Stanislav Kozlovski who helped in formulating
> some aspects of the KIP.
>
> Many thanks,
>
> Gokul.
>

Re: [DISCUSS] KIP-578: Add configuration to limit number of partitions

Posted by Gokul Ramanan Subramanian <go...@gmail.com>.
Ismael,

I have updated the KIP to reuse the PolicyViolationException. It is not
clear to me there is a value in either having a new exception type or in
renaming the existing exception. For error codes, we'll just use
POLICY_VIOLATION. This should work well for existing clients also since
they have seen these exceptions and error codes from policies.

As you mention, the actual performance figures are probably better with
trunk than for the version I used in the tests mentioned in the KIP. And as
Colin mentions in another thread, the goal is to have Kafka support many
more partitions in the future. I have tagged my performance experiments
with a certain Kafka version so as to nor misrepresent Kafka's performance
overall. Also, I think that the number of partitions is going to be a
moving limit as faster and better Kafka versions come up. The experiments
in the KIP were aimed at convincing the reader that we need some way to
limit the number of partitions, rather than to specify a specific limit.

Boyang,

I have updated the KIP to reflect that until KIP-590, the limit won't apply
to auto-created topics via Metadata API. After KIP-590, the Metadata API
can also throw the PolicyViolationException (once it starts relying on
CreateTopics).

Do you think it makes sense to bump up the API versions if they are going
to throw an exception type they could throw in the past? Or do we bump up
the API version simply when there is a different reason why the API could
throw a given exception?

Thanks for your comments and feedback.

On Mon, Jun 29, 2020 at 3:38 AM Boyang Chen <re...@gmail.com>
wrote:

> Hey Gokul, thanks for the reply.
>
> It is true that the Metadata API will call CreateTopic under the cover. The
> key guarantee we need to provide is to properly propagate the exact error
> code to the original client. So either we are going to introduce a new
> error code or reuse an existing one, the Metadata RPC should be able to
> handle it. Could we mention this guarantee in the KIP, once the error code
> discussion gets converged?
>
> On Wed, Jun 24, 2020 at 6:35 AM Ismael Juma <is...@juma.me.uk> wrote:
>
> > Error code names are safe to rename at the moment as they are in an
> > internal package. The exception class is in a public package though. I
> was
> > thinking that PolicyViolationException could be a subclass of the more
> > generic exception. This approach would mean that older clients would
> > understand the error code, etc. I didn't think through all the details,
> but
> > worth considering.
> >
> > With regards to the version, OK. I expect trunk to do better than what
> was
> > described there on both replication overhead and topic creation/deletion
> > performance.
> >
> > Ismael
> >
> > On Tue, Jun 23, 2020 at 11:10 PM Gokul Ramanan Subramanian <
> > gokul2411s@gmail.com> wrote:
> >
> > > Ismael,
> > >
> > > I am open to using any error code and am not attached to one TBH. Colin
> > had
> > > suggested creating a new resource code called RESOURCE_LIMIT_EXCEEDED.
> I
> > am
> > > happy to reuse the error code corresponding to PolicyViolation. Is it
> > safe
> > > to rename errors and corresponding exception names? If so, I'd prefer
> > > reusing the existing code as well.
> > >
> > > For performance testing results that I added to this KIP, I used Kafka
> > > 2.3.1, which was very close to trunk at the time I tested. We have seen
> > > similar issues with Kafka 2.4.1. Please note that for the tests done in
> > the
> > > KIP, especially the Produce performance tests, we probably could have
> > > gotten higher performance, but the focus was on comparing performance
> > > across a different number of partitions, for a given configuration,
> > rather
> > > than trying to find out the best performance possible for the "right"
> > > number of partitions.
> > >
> > > Thanks.
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > > On Wed, Jun 24, 2020 at 4:07 AM Ismael Juma <is...@juma.me.uk> wrote:
> > >
> > > > Thanks for the KIP. A couple of questions:
> > > >
> > > > 1. Have we considered reusing the existing PolicyViolation error code
> > and
> > > > renaming it? This would make it simpler to handle on the client.
> > > >
> > > > 2. What version was used for the perf section? I think master should
> do
> > > > better than what's described there.
> > > >
> > > > Ismael
> > > >
> > > > On Wed, Apr 1, 2020, 8:28 AM Gokul Ramanan Subramanian <
> > > > gokul2411s@gmail.com>
> > > > wrote:
> > > >
> > > > > Hi.
> > > > >
> > > > > I have opened KIP-578, intended to provide a mechanism to limit the
> > > > number
> > > > > of partitions in a Kafka cluster. Kindly provide feedback on the
> KIP
> > > > which
> > > > > you can find at
> > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-578%3A+Add+configuration+to+limit+number+of+partitions
> > > > >
> > > > > I want to specially thank Stanislav Kozlovski who helped in
> > formulating
> > > > > some aspects of the KIP.
> > > > >
> > > > > Many thanks,
> > > > >
> > > > > Gokul.
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] KIP-578: Add configuration to limit number of partitions

Posted by Boyang Chen <re...@gmail.com>.
Hey Gokul, thanks for the reply.

It is true that the Metadata API will call CreateTopic under the cover. The
key guarantee we need to provide is to properly propagate the exact error
code to the original client. So either we are going to introduce a new
error code or reuse an existing one, the Metadata RPC should be able to
handle it. Could we mention this guarantee in the KIP, once the error code
discussion gets converged?

On Wed, Jun 24, 2020 at 6:35 AM Ismael Juma <is...@juma.me.uk> wrote:

> Error code names are safe to rename at the moment as they are in an
> internal package. The exception class is in a public package though. I was
> thinking that PolicyViolationException could be a subclass of the more
> generic exception. This approach would mean that older clients would
> understand the error code, etc. I didn't think through all the details, but
> worth considering.
>
> With regards to the version, OK. I expect trunk to do better than what was
> described there on both replication overhead and topic creation/deletion
> performance.
>
> Ismael
>
> On Tue, Jun 23, 2020 at 11:10 PM Gokul Ramanan Subramanian <
> gokul2411s@gmail.com> wrote:
>
> > Ismael,
> >
> > I am open to using any error code and am not attached to one TBH. Colin
> had
> > suggested creating a new resource code called RESOURCE_LIMIT_EXCEEDED. I
> am
> > happy to reuse the error code corresponding to PolicyViolation. Is it
> safe
> > to rename errors and corresponding exception names? If so, I'd prefer
> > reusing the existing code as well.
> >
> > For performance testing results that I added to this KIP, I used Kafka
> > 2.3.1, which was very close to trunk at the time I tested. We have seen
> > similar issues with Kafka 2.4.1. Please note that for the tests done in
> the
> > KIP, especially the Produce performance tests, we probably could have
> > gotten higher performance, but the focus was on comparing performance
> > across a different number of partitions, for a given configuration,
> rather
> > than trying to find out the best performance possible for the "right"
> > number of partitions.
> >
> > Thanks.
> >
> >
> >
> >
> >
> >
> >
> >
> >
> > On Wed, Jun 24, 2020 at 4:07 AM Ismael Juma <is...@juma.me.uk> wrote:
> >
> > > Thanks for the KIP. A couple of questions:
> > >
> > > 1. Have we considered reusing the existing PolicyViolation error code
> and
> > > renaming it? This would make it simpler to handle on the client.
> > >
> > > 2. What version was used for the perf section? I think master should do
> > > better than what's described there.
> > >
> > > Ismael
> > >
> > > On Wed, Apr 1, 2020, 8:28 AM Gokul Ramanan Subramanian <
> > > gokul2411s@gmail.com>
> > > wrote:
> > >
> > > > Hi.
> > > >
> > > > I have opened KIP-578, intended to provide a mechanism to limit the
> > > number
> > > > of partitions in a Kafka cluster. Kindly provide feedback on the KIP
> > > which
> > > > you can find at
> > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-578%3A+Add+configuration+to+limit+number+of+partitions
> > > >
> > > > I want to specially thank Stanislav Kozlovski who helped in
> formulating
> > > > some aspects of the KIP.
> > > >
> > > > Many thanks,
> > > >
> > > > Gokul.
> > > >
> > >
> >
>

Re: [DISCUSS] KIP-578: Add configuration to limit number of partitions

Posted by Ismael Juma <is...@juma.me.uk>.
Error code names are safe to rename at the moment as they are in an
internal package. The exception class is in a public package though. I was
thinking that PolicyViolationException could be a subclass of the more
generic exception. This approach would mean that older clients would
understand the error code, etc. I didn't think through all the details, but
worth considering.

With regards to the version, OK. I expect trunk to do better than what was
described there on both replication overhead and topic creation/deletion
performance.

Ismael

On Tue, Jun 23, 2020 at 11:10 PM Gokul Ramanan Subramanian <
gokul2411s@gmail.com> wrote:

> Ismael,
>
> I am open to using any error code and am not attached to one TBH. Colin had
> suggested creating a new resource code called RESOURCE_LIMIT_EXCEEDED. I am
> happy to reuse the error code corresponding to PolicyViolation. Is it safe
> to rename errors and corresponding exception names? If so, I'd prefer
> reusing the existing code as well.
>
> For performance testing results that I added to this KIP, I used Kafka
> 2.3.1, which was very close to trunk at the time I tested. We have seen
> similar issues with Kafka 2.4.1. Please note that for the tests done in the
> KIP, especially the Produce performance tests, we probably could have
> gotten higher performance, but the focus was on comparing performance
> across a different number of partitions, for a given configuration, rather
> than trying to find out the best performance possible for the "right"
> number of partitions.
>
> Thanks.
>
>
>
>
>
>
>
>
>
> On Wed, Jun 24, 2020 at 4:07 AM Ismael Juma <is...@juma.me.uk> wrote:
>
> > Thanks for the KIP. A couple of questions:
> >
> > 1. Have we considered reusing the existing PolicyViolation error code and
> > renaming it? This would make it simpler to handle on the client.
> >
> > 2. What version was used for the perf section? I think master should do
> > better than what's described there.
> >
> > Ismael
> >
> > On Wed, Apr 1, 2020, 8:28 AM Gokul Ramanan Subramanian <
> > gokul2411s@gmail.com>
> > wrote:
> >
> > > Hi.
> > >
> > > I have opened KIP-578, intended to provide a mechanism to limit the
> > number
> > > of partitions in a Kafka cluster. Kindly provide feedback on the KIP
> > which
> > > you can find at
> > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-578%3A+Add+configuration+to+limit+number+of+partitions
> > >
> > > I want to specially thank Stanislav Kozlovski who helped in formulating
> > > some aspects of the KIP.
> > >
> > > Many thanks,
> > >
> > > Gokul.
> > >
> >
>

Re: [DISCUSS] KIP-578: Add configuration to limit number of partitions

Posted by Gokul Ramanan Subramanian <go...@gmail.com>.
Ismael,

I am open to using any error code and am not attached to one TBH. Colin had
suggested creating a new resource code called RESOURCE_LIMIT_EXCEEDED. I am
happy to reuse the error code corresponding to PolicyViolation. Is it safe
to rename errors and corresponding exception names? If so, I'd prefer
reusing the existing code as well.

For performance testing results that I added to this KIP, I used Kafka
2.3.1, which was very close to trunk at the time I tested. We have seen
similar issues with Kafka 2.4.1. Please note that for the tests done in the
KIP, especially the Produce performance tests, we probably could have
gotten higher performance, but the focus was on comparing performance
across a different number of partitions, for a given configuration, rather
than trying to find out the best performance possible for the "right"
number of partitions.

Thanks.









On Wed, Jun 24, 2020 at 4:07 AM Ismael Juma <is...@juma.me.uk> wrote:

> Thanks for the KIP. A couple of questions:
>
> 1. Have we considered reusing the existing PolicyViolation error code and
> renaming it? This would make it simpler to handle on the client.
>
> 2. What version was used for the perf section? I think master should do
> better than what's described there.
>
> Ismael
>
> On Wed, Apr 1, 2020, 8:28 AM Gokul Ramanan Subramanian <
> gokul2411s@gmail.com>
> wrote:
>
> > Hi.
> >
> > I have opened KIP-578, intended to provide a mechanism to limit the
> number
> > of partitions in a Kafka cluster. Kindly provide feedback on the KIP
> which
> > you can find at
> >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-578%3A+Add+configuration+to+limit+number+of+partitions
> >
> > I want to specially thank Stanislav Kozlovski who helped in formulating
> > some aspects of the KIP.
> >
> > Many thanks,
> >
> > Gokul.
> >
>

Re: [DISCUSS] KIP-578: Add configuration to limit number of partitions

Posted by Ismael Juma <is...@juma.me.uk>.
Thanks for the KIP. A couple of questions:

1. Have we considered reusing the existing PolicyViolation error code and
renaming it? This would make it simpler to handle on the client.

2. What version was used for the perf section? I think master should do
better than what's described there.

Ismael

On Wed, Apr 1, 2020, 8:28 AM Gokul Ramanan Subramanian <go...@gmail.com>
wrote:

> Hi.
>
> I have opened KIP-578, intended to provide a mechanism to limit the number
> of partitions in a Kafka cluster. Kindly provide feedback on the KIP which
> you can find at
>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-578%3A+Add+configuration+to+limit+number+of+partitions
>
> I want to specially thank Stanislav Kozlovski who helped in formulating
> some aspects of the KIP.
>
> Many thanks,
>
> Gokul.
>