You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@kafka.apache.org by Artem Livshits <al...@confluent.io.INVALID> on 2021/11/04 17:37:33 UTC

[DISCUSS] KIP-794: Strictly Uniform Sticky Partitioner

Hello,

This is the discussion thread for
https://cwiki.apache.org/confluence/display/KAFKA/KIP-794%3A+Strictly+Uniform+Sticky+Partitioner
.

The proposal is a bug fix for
https://issues.apache.org/jira/browse/KAFKA-10888, but it does include a
client config change, therefore we have a KIP to discuss.

-Artem

Re: [DISCUSS] KIP-794: Strictly Uniform Sticky Partitioner

Posted by Artem Livshits <al...@confluent.io.INVALID>.
Hello,

During implementation it turned out that the existing Partitioner.partition
method doesn't have enough arguments to accurately estimate record size in
bytes (e.g. it doesn't have headers, cannot take compression into account,
etc.).  So I'm proposing to add a new Partitioner.partition method that
takes a callback that can be used to estimate record size.

I've updated the KIP correspondingly
https://cwiki.apache.org/confluence/display/KAFKA/KIP-794%3A+Strictly+Uniform+Sticky+Partitioner

-Artem

On Mon, Nov 8, 2021 at 5:42 PM Artem Livshits <al...@confluent.io>
wrote:

> Hi Luke, Justine,
>
> Thank you for feedback and questions. I've added clarification to the KIP.
>
> > there will be some period of time the distribution is not even.
>
> That's correct.  There would be a small temporary imbalance, but over time
> the distribution should be uniform.
>
> > 1. This paragraph is a little confusing, because there's no "batch mode"
> or "non-batch mode", right?
>
> Updated the wording to not use "batch mode"
>
> > 2. In motivation, you mentioned 1 drawback of current
> UniformStickyPartitioner is
>
> The problem with the current implementation is that it switches once a new
> batch is created which may happen after the first record when linger.ms=0.
> The new implementation won't switch after the batch, so even if the first
> record got sent out in a batch, the second record would be produced to the
> same partition.  Once we have 5 batches in-flight, the new records will
> pile up in the accumulator.
>
> > I was curious about how the logic automatically switches here.
>
> Added some clarifications to the KIP.  Basically, because we can only have
> 5 in-flight batches, as soon as the first 5 are in-flight, the records
> start piling up in the accumulator.  If the rate is low, records get sent
> quickly (e.g. if we have latency 50ms, and produce less than 20 rec / sec,
> then each record will often get sent in its own batch, because a batch
> would often complete before a new record arrives).  If the rate is high,
> then the first few records get sent quickly, but then records will batch
> together until one of the in-flight batches completes, the higher the rate
> is (or the higher latency is), the larger the batches are.
>
> This is not a new logic, btw, this is how it works now, the new logic just
> helps to utilize this better by giving the partition an opportunity to hit
> 5 in-flight and start accumulating sooner.  KIP-782 will make this even
> better, so batches could also grow beyond 16KB if production rate is high.
>
> -Artem
>
>
> On Mon, Nov 8, 2021 at 11:56 AM Justine Olshan
> <jo...@confluent.io.invalid> wrote:
>
>> Hi Artem,
>> Thanks for working on improving the Sticky Partitioner!
>>
>> I had a few questions about this portion:
>>
>> *The batching will continue until either an in-flight batch completes or
>> we
>> hit the N bytes and move to the next partition.  This way it takes just 5
>> records to get to batching mode, not 5 x number of partition records, and
>> the batching mode will stay longer as we'll be batching while waiting for
>> a
>> request to be completed.  As the production rate accelerates, the logic
>> will automatically switch to use larger batches to sustain higher
>> throughput.*
>>
>> *If one of the brokers has higher latency the records for the partitions
>> hosted on that broker are going to form larger batches, but it's still
>> going to be the same *amount records* sent less frequently in larger
>> batches, the logic automatically adapts to that.*
>>
>> I was curious about how the logic automatically switches here. It seems
>> like we are just adding *partitioner.sticky.batch.size *which seems like a
>> static value. Can you go into more detail about this logic? Or clarify
>> something I may have missed.
>>
>> On Mon, Nov 8, 2021 at 1:34 AM Luke Chen <sh...@gmail.com> wrote:
>>
>> > Thanks Artem,
>> > It's much better now.
>> > I've got your idea. In KIP-480: Sticky Partitioner, we'll change
>> partition
>> > (call partitioner) when either 1 of below condition match
>> > 1. the batch is full
>> > 2. when linger.ms is up
>> > But, you are changing the definition, into a
>> > "partitioner.sticky.batch.size" size is reached.
>> >
>> > It'll fix the uneven distribution issue, because we did the sent out
>> size
>> > calculation in the producer side.
>> > But it might have another issue that when the producer rate is low,
>> there
>> > will be some period of time the distribution is not even. Ex:
>> > tp-1: 12KB
>> > tp-2: 0KB
>> > tp-3: 0KB
>> > tp-4: 0KB
>> > while the producer is still keeping sending records into tp-1 (because
>> we
>> > haven't reached the 16KB threshold)
>> > Maybe the user should set a good value to
>> "partitioner.sticky.batch.size"
>> > to fix this issue?
>> >
>> > Some comment to the KIP:
>> > 1. This paragraph is a little confusing, because there's no "batch
>> mode" or
>> > "non-batch mode", right?
>> >
>> > > The batching will continue until either an in-flight batch completes
>> or
>> > we hit the N bytes and move to the next partition.  This way it takes
>> just
>> > 5 records to get to batching mode, not 5 x number of partition records,
>> and
>> > the batching mode will stay longer as we'll be batching while waiting
>> for a
>> > request to be completed.
>> >
>> > Even with linger.ms=0, before the sender thread is ready, we're always
>> > batching (accumulating) records into batches. So I think the "batch
>> mode"
>> > description is confusing. And that's why I asked you if you have some
>> kind
>> > of "batch switch" here.
>> >
>> > 2. In motivation, you mentioned 1 drawback of current
>> > UniformStickyPartitioner is "the sticky partitioner doesn't create
>> batches
>> > as efficiently", because it sent out a batch with only 1 record (under
>> > linger.ms=0). But I can't tell how you fix this un-efficient issue in
>> the
>> > proposed solution. I still see we sent 1 record within 1 batch. Could
>> you
>> > explain more here?
>> >
>> > Thank you.
>> > Luke
>> >
>> > On Sat, Nov 6, 2021 at 6:41 AM Artem Livshits
>> > <al...@confluent.io.invalid> wrote:
>> >
>> > > Hi Luke,
>> > >
>> > > Thank you for your feedback.  I've updated the KIP with your
>> suggestions.
>> > >
>> > > 1. Updated with a better example.
>> > > 2. I removed the reference to ClassicDefaultPartitioner, it was
>> probably
>> > > confusing.
>> > > 3. The logic doesn't rely on checking batches, I've updated the
>> proposal
>> > to
>> > > make it more explicit.
>> > > 4. The primary issue (uneven distribution) is described in the linked
>> > jira,
>> > > copied an example from jira into the KIP as well.
>> > >
>> > > -Artem
>> > >
>> > >
>> > > On Thu, Nov 4, 2021 at 8:34 PM Luke Chen <sh...@gmail.com> wrote:
>> > >
>> > > > Hi Artem,
>> > > > Thanks for the KIP! And thanks for reminding me to complete KIP-782,
>> > > soon.
>> > > > :)
>> > > >
>> > > > Back to the KIP, I have some comments:
>> > > > 1. You proposed to have a new config:
>> "partitioner.sticky.batch.size",
>> > > but
>> > > > I can't see how we're going to use it to make the partitioner
>> better.
>> > > > Please explain more in KIP (with an example will be better as
>> > suggestion
>> > > > (4))
>> > > > 2. In the "Proposed change" section, you take an example to use
>> > > > "ClassicDefaultPartitioner", is that referring to the current
>> default
>> > > > sticky partitioner? I think it'd better you name your proposed
>> > partition
>> > > > with a different name for distinguish between the default one and
>> new
>> > > one.
>> > > > (Although after implementation, we are going to just use the same
>> name)
>> > > > 3. So, if my understanding is correct, you're going to have a
>> "batch"
>> > > > switch, and before the in-flight is full, it's disabled. Otherwise,
>> > we'll
>> > > > enable it. Is that right? Sorry, I don't see any advantage of having
>> > this
>> > > > batch switch. Could you explain more?
>> > > > 4. I think it should be more clear if you can have a clear real
>> example
>> > > in
>> > > > the motivation section, to describe what issue we faced using
>> current
>> > > > sticky partitioner. And in proposed changes section, using the same
>> > > > example, to describe more detail about how you fix this issue with
>> your
>> > > > way.
>> > > >
>> > > > Thank you.
>> > > > Luke
>> > > >
>> > > > On Fri, Nov 5, 2021 at 1:38 AM Artem Livshits
>> > > > <al...@confluent.io.invalid> wrote:
>> > > >
>> > > > > Hello,
>> > > > >
>> > > > > This is the discussion thread for
>> > > > >
>> > > > >
>> > > >
>> > >
>> >
>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-794%3A+Strictly+Uniform+Sticky+Partitioner
>> > > > > .
>> > > > >
>> > > > > The proposal is a bug fix for
>> > > > > https://issues.apache.org/jira/browse/KAFKA-10888, but it does
>> > > include a
>> > > > > client config change, therefore we have a KIP to discuss.
>> > > > >
>> > > > > -Artem
>> > > > >
>> > > >
>> > >
>> >
>>
>

Re: [DISCUSS] KIP-794: Strictly Uniform Sticky Partitioner

Posted by Artem Livshits <al...@confluent.io.INVALID>.
Hi Luke, Justine,

Thank you for feedback and questions. I've added clarification to the KIP.

> there will be some period of time the distribution is not even.

That's correct.  There would be a small temporary imbalance, but over time
the distribution should be uniform.

> 1. This paragraph is a little confusing, because there's no "batch mode"
or "non-batch mode", right?

Updated the wording to not use "batch mode"

> 2. In motivation, you mentioned 1 drawback of current
UniformStickyPartitioner is

The problem with the current implementation is that it switches once a new
batch is created which may happen after the first record when linger.ms=0.
The new implementation won't switch after the batch, so even if the first
record got sent out in a batch, the second record would be produced to the
same partition.  Once we have 5 batches in-flight, the new records will
pile up in the accumulator.

> I was curious about how the logic automatically switches here.

Added some clarifications to the KIP.  Basically, because we can only have
5 in-flight batches, as soon as the first 5 are in-flight, the records
start piling up in the accumulator.  If the rate is low, records get sent
quickly (e.g. if we have latency 50ms, and produce less than 20 rec / sec,
then each record will often get sent in its own batch, because a batch
would often complete before a new record arrives).  If the rate is high,
then the first few records get sent quickly, but then records will batch
together until one of the in-flight batches completes, the higher the rate
is (or the higher latency is), the larger the batches are.

This is not a new logic, btw, this is how it works now, the new logic just
helps to utilize this better by giving the partition an opportunity to hit
5 in-flight and start accumulating sooner.  KIP-782 will make this even
better, so batches could also grow beyond 16KB if production rate is high.

-Artem


On Mon, Nov 8, 2021 at 11:56 AM Justine Olshan <jo...@confluent.io.invalid>
wrote:

> Hi Artem,
> Thanks for working on improving the Sticky Partitioner!
>
> I had a few questions about this portion:
>
> *The batching will continue until either an in-flight batch completes or we
> hit the N bytes and move to the next partition.  This way it takes just 5
> records to get to batching mode, not 5 x number of partition records, and
> the batching mode will stay longer as we'll be batching while waiting for a
> request to be completed.  As the production rate accelerates, the logic
> will automatically switch to use larger batches to sustain higher
> throughput.*
>
> *If one of the brokers has higher latency the records for the partitions
> hosted on that broker are going to form larger batches, but it's still
> going to be the same *amount records* sent less frequently in larger
> batches, the logic automatically adapts to that.*
>
> I was curious about how the logic automatically switches here. It seems
> like we are just adding *partitioner.sticky.batch.size *which seems like a
> static value. Can you go into more detail about this logic? Or clarify
> something I may have missed.
>
> On Mon, Nov 8, 2021 at 1:34 AM Luke Chen <sh...@gmail.com> wrote:
>
> > Thanks Artem,
> > It's much better now.
> > I've got your idea. In KIP-480: Sticky Partitioner, we'll change
> partition
> > (call partitioner) when either 1 of below condition match
> > 1. the batch is full
> > 2. when linger.ms is up
> > But, you are changing the definition, into a
> > "partitioner.sticky.batch.size" size is reached.
> >
> > It'll fix the uneven distribution issue, because we did the sent out size
> > calculation in the producer side.
> > But it might have another issue that when the producer rate is low, there
> > will be some period of time the distribution is not even. Ex:
> > tp-1: 12KB
> > tp-2: 0KB
> > tp-3: 0KB
> > tp-4: 0KB
> > while the producer is still keeping sending records into tp-1 (because we
> > haven't reached the 16KB threshold)
> > Maybe the user should set a good value to "partitioner.sticky.batch.size"
> > to fix this issue?
> >
> > Some comment to the KIP:
> > 1. This paragraph is a little confusing, because there's no "batch mode"
> or
> > "non-batch mode", right?
> >
> > > The batching will continue until either an in-flight batch completes or
> > we hit the N bytes and move to the next partition.  This way it takes
> just
> > 5 records to get to batching mode, not 5 x number of partition records,
> and
> > the batching mode will stay longer as we'll be batching while waiting
> for a
> > request to be completed.
> >
> > Even with linger.ms=0, before the sender thread is ready, we're always
> > batching (accumulating) records into batches. So I think the "batch mode"
> > description is confusing. And that's why I asked you if you have some
> kind
> > of "batch switch" here.
> >
> > 2. In motivation, you mentioned 1 drawback of current
> > UniformStickyPartitioner is "the sticky partitioner doesn't create
> batches
> > as efficiently", because it sent out a batch with only 1 record (under
> > linger.ms=0). But I can't tell how you fix this un-efficient issue in
> the
> > proposed solution. I still see we sent 1 record within 1 batch. Could you
> > explain more here?
> >
> > Thank you.
> > Luke
> >
> > On Sat, Nov 6, 2021 at 6:41 AM Artem Livshits
> > <al...@confluent.io.invalid> wrote:
> >
> > > Hi Luke,
> > >
> > > Thank you for your feedback.  I've updated the KIP with your
> suggestions.
> > >
> > > 1. Updated with a better example.
> > > 2. I removed the reference to ClassicDefaultPartitioner, it was
> probably
> > > confusing.
> > > 3. The logic doesn't rely on checking batches, I've updated the
> proposal
> > to
> > > make it more explicit.
> > > 4. The primary issue (uneven distribution) is described in the linked
> > jira,
> > > copied an example from jira into the KIP as well.
> > >
> > > -Artem
> > >
> > >
> > > On Thu, Nov 4, 2021 at 8:34 PM Luke Chen <sh...@gmail.com> wrote:
> > >
> > > > Hi Artem,
> > > > Thanks for the KIP! And thanks for reminding me to complete KIP-782,
> > > soon.
> > > > :)
> > > >
> > > > Back to the KIP, I have some comments:
> > > > 1. You proposed to have a new config:
> "partitioner.sticky.batch.size",
> > > but
> > > > I can't see how we're going to use it to make the partitioner better.
> > > > Please explain more in KIP (with an example will be better as
> > suggestion
> > > > (4))
> > > > 2. In the "Proposed change" section, you take an example to use
> > > > "ClassicDefaultPartitioner", is that referring to the current default
> > > > sticky partitioner? I think it'd better you name your proposed
> > partition
> > > > with a different name for distinguish between the default one and new
> > > one.
> > > > (Although after implementation, we are going to just use the same
> name)
> > > > 3. So, if my understanding is correct, you're going to have a "batch"
> > > > switch, and before the in-flight is full, it's disabled. Otherwise,
> > we'll
> > > > enable it. Is that right? Sorry, I don't see any advantage of having
> > this
> > > > batch switch. Could you explain more?
> > > > 4. I think it should be more clear if you can have a clear real
> example
> > > in
> > > > the motivation section, to describe what issue we faced using current
> > > > sticky partitioner. And in proposed changes section, using the same
> > > > example, to describe more detail about how you fix this issue with
> your
> > > > way.
> > > >
> > > > Thank you.
> > > > Luke
> > > >
> > > > On Fri, Nov 5, 2021 at 1:38 AM Artem Livshits
> > > > <al...@confluent.io.invalid> wrote:
> > > >
> > > > > Hello,
> > > > >
> > > > > This is the discussion thread for
> > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-794%3A+Strictly+Uniform+Sticky+Partitioner
> > > > > .
> > > > >
> > > > > The proposal is a bug fix for
> > > > > https://issues.apache.org/jira/browse/KAFKA-10888, but it does
> > > include a
> > > > > client config change, therefore we have a KIP to discuss.
> > > > >
> > > > > -Artem
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] KIP-794: Strictly Uniform Sticky Partitioner

Posted by Justine Olshan <jo...@confluent.io.INVALID>.
Hi Artem,
Thanks for working on improving the Sticky Partitioner!

I had a few questions about this portion:

*The batching will continue until either an in-flight batch completes or we
hit the N bytes and move to the next partition.  This way it takes just 5
records to get to batching mode, not 5 x number of partition records, and
the batching mode will stay longer as we'll be batching while waiting for a
request to be completed.  As the production rate accelerates, the logic
will automatically switch to use larger batches to sustain higher
throughput.*

*If one of the brokers has higher latency the records for the partitions
hosted on that broker are going to form larger batches, but it's still
going to be the same *amount records* sent less frequently in larger
batches, the logic automatically adapts to that.*

I was curious about how the logic automatically switches here. It seems
like we are just adding *partitioner.sticky.batch.size *which seems like a
static value. Can you go into more detail about this logic? Or clarify
something I may have missed.

On Mon, Nov 8, 2021 at 1:34 AM Luke Chen <sh...@gmail.com> wrote:

> Thanks Artem,
> It's much better now.
> I've got your idea. In KIP-480: Sticky Partitioner, we'll change partition
> (call partitioner) when either 1 of below condition match
> 1. the batch is full
> 2. when linger.ms is up
> But, you are changing the definition, into a
> "partitioner.sticky.batch.size" size is reached.
>
> It'll fix the uneven distribution issue, because we did the sent out size
> calculation in the producer side.
> But it might have another issue that when the producer rate is low, there
> will be some period of time the distribution is not even. Ex:
> tp-1: 12KB
> tp-2: 0KB
> tp-3: 0KB
> tp-4: 0KB
> while the producer is still keeping sending records into tp-1 (because we
> haven't reached the 16KB threshold)
> Maybe the user should set a good value to "partitioner.sticky.batch.size"
> to fix this issue?
>
> Some comment to the KIP:
> 1. This paragraph is a little confusing, because there's no "batch mode" or
> "non-batch mode", right?
>
> > The batching will continue until either an in-flight batch completes or
> we hit the N bytes and move to the next partition.  This way it takes just
> 5 records to get to batching mode, not 5 x number of partition records, and
> the batching mode will stay longer as we'll be batching while waiting for a
> request to be completed.
>
> Even with linger.ms=0, before the sender thread is ready, we're always
> batching (accumulating) records into batches. So I think the "batch mode"
> description is confusing. And that's why I asked you if you have some kind
> of "batch switch" here.
>
> 2. In motivation, you mentioned 1 drawback of current
> UniformStickyPartitioner is "the sticky partitioner doesn't create batches
> as efficiently", because it sent out a batch with only 1 record (under
> linger.ms=0). But I can't tell how you fix this un-efficient issue in the
> proposed solution. I still see we sent 1 record within 1 batch. Could you
> explain more here?
>
> Thank you.
> Luke
>
> On Sat, Nov 6, 2021 at 6:41 AM Artem Livshits
> <al...@confluent.io.invalid> wrote:
>
> > Hi Luke,
> >
> > Thank you for your feedback.  I've updated the KIP with your suggestions.
> >
> > 1. Updated with a better example.
> > 2. I removed the reference to ClassicDefaultPartitioner, it was probably
> > confusing.
> > 3. The logic doesn't rely on checking batches, I've updated the proposal
> to
> > make it more explicit.
> > 4. The primary issue (uneven distribution) is described in the linked
> jira,
> > copied an example from jira into the KIP as well.
> >
> > -Artem
> >
> >
> > On Thu, Nov 4, 2021 at 8:34 PM Luke Chen <sh...@gmail.com> wrote:
> >
> > > Hi Artem,
> > > Thanks for the KIP! And thanks for reminding me to complete KIP-782,
> > soon.
> > > :)
> > >
> > > Back to the KIP, I have some comments:
> > > 1. You proposed to have a new config: "partitioner.sticky.batch.size",
> > but
> > > I can't see how we're going to use it to make the partitioner better.
> > > Please explain more in KIP (with an example will be better as
> suggestion
> > > (4))
> > > 2. In the "Proposed change" section, you take an example to use
> > > "ClassicDefaultPartitioner", is that referring to the current default
> > > sticky partitioner? I think it'd better you name your proposed
> partition
> > > with a different name for distinguish between the default one and new
> > one.
> > > (Although after implementation, we are going to just use the same name)
> > > 3. So, if my understanding is correct, you're going to have a "batch"
> > > switch, and before the in-flight is full, it's disabled. Otherwise,
> we'll
> > > enable it. Is that right? Sorry, I don't see any advantage of having
> this
> > > batch switch. Could you explain more?
> > > 4. I think it should be more clear if you can have a clear real example
> > in
> > > the motivation section, to describe what issue we faced using current
> > > sticky partitioner. And in proposed changes section, using the same
> > > example, to describe more detail about how you fix this issue with your
> > > way.
> > >
> > > Thank you.
> > > Luke
> > >
> > > On Fri, Nov 5, 2021 at 1:38 AM Artem Livshits
> > > <al...@confluent.io.invalid> wrote:
> > >
> > > > Hello,
> > > >
> > > > This is the discussion thread for
> > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-794%3A+Strictly+Uniform+Sticky+Partitioner
> > > > .
> > > >
> > > > The proposal is a bug fix for
> > > > https://issues.apache.org/jira/browse/KAFKA-10888, but it does
> > include a
> > > > client config change, therefore we have a KIP to discuss.
> > > >
> > > > -Artem
> > > >
> > >
> >
>

Re: [DISCUSS] KIP-794: Strictly Uniform Sticky Partitioner

Posted by Luke Chen <sh...@gmail.com>.
Thanks Artem,
It's much better now.
I've got your idea. In KIP-480: Sticky Partitioner, we'll change partition
(call partitioner) when either 1 of below condition match
1. the batch is full
2. when linger.ms is up
But, you are changing the definition, into a
"partitioner.sticky.batch.size" size is reached.

It'll fix the uneven distribution issue, because we did the sent out size
calculation in the producer side.
But it might have another issue that when the producer rate is low, there
will be some period of time the distribution is not even. Ex:
tp-1: 12KB
tp-2: 0KB
tp-3: 0KB
tp-4: 0KB
while the producer is still keeping sending records into tp-1 (because we
haven't reached the 16KB threshold)
Maybe the user should set a good value to "partitioner.sticky.batch.size"
to fix this issue?

Some comment to the KIP:
1. This paragraph is a little confusing, because there's no "batch mode" or
"non-batch mode", right?

> The batching will continue until either an in-flight batch completes or
we hit the N bytes and move to the next partition.  This way it takes just
5 records to get to batching mode, not 5 x number of partition records, and
the batching mode will stay longer as we'll be batching while waiting for a
request to be completed.

Even with linger.ms=0, before the sender thread is ready, we're always
batching (accumulating) records into batches. So I think the "batch mode"
description is confusing. And that's why I asked you if you have some kind
of "batch switch" here.

2. In motivation, you mentioned 1 drawback of current
UniformStickyPartitioner is "the sticky partitioner doesn't create batches
as efficiently", because it sent out a batch with only 1 record (under
linger.ms=0). But I can't tell how you fix this un-efficient issue in the
proposed solution. I still see we sent 1 record within 1 batch. Could you
explain more here?

Thank you.
Luke

On Sat, Nov 6, 2021 at 6:41 AM Artem Livshits
<al...@confluent.io.invalid> wrote:

> Hi Luke,
>
> Thank you for your feedback.  I've updated the KIP with your suggestions.
>
> 1. Updated with a better example.
> 2. I removed the reference to ClassicDefaultPartitioner, it was probably
> confusing.
> 3. The logic doesn't rely on checking batches, I've updated the proposal to
> make it more explicit.
> 4. The primary issue (uneven distribution) is described in the linked jira,
> copied an example from jira into the KIP as well.
>
> -Artem
>
>
> On Thu, Nov 4, 2021 at 8:34 PM Luke Chen <sh...@gmail.com> wrote:
>
> > Hi Artem,
> > Thanks for the KIP! And thanks for reminding me to complete KIP-782,
> soon.
> > :)
> >
> > Back to the KIP, I have some comments:
> > 1. You proposed to have a new config: "partitioner.sticky.batch.size",
> but
> > I can't see how we're going to use it to make the partitioner better.
> > Please explain more in KIP (with an example will be better as suggestion
> > (4))
> > 2. In the "Proposed change" section, you take an example to use
> > "ClassicDefaultPartitioner", is that referring to the current default
> > sticky partitioner? I think it'd better you name your proposed partition
> > with a different name for distinguish between the default one and new
> one.
> > (Although after implementation, we are going to just use the same name)
> > 3. So, if my understanding is correct, you're going to have a "batch"
> > switch, and before the in-flight is full, it's disabled. Otherwise, we'll
> > enable it. Is that right? Sorry, I don't see any advantage of having this
> > batch switch. Could you explain more?
> > 4. I think it should be more clear if you can have a clear real example
> in
> > the motivation section, to describe what issue we faced using current
> > sticky partitioner. And in proposed changes section, using the same
> > example, to describe more detail about how you fix this issue with your
> > way.
> >
> > Thank you.
> > Luke
> >
> > On Fri, Nov 5, 2021 at 1:38 AM Artem Livshits
> > <al...@confluent.io.invalid> wrote:
> >
> > > Hello,
> > >
> > > This is the discussion thread for
> > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-794%3A+Strictly+Uniform+Sticky+Partitioner
> > > .
> > >
> > > The proposal is a bug fix for
> > > https://issues.apache.org/jira/browse/KAFKA-10888, but it does
> include a
> > > client config change, therefore we have a KIP to discuss.
> > >
> > > -Artem
> > >
> >
>

Re: [DISCUSS] KIP-794: Strictly Uniform Sticky Partitioner

Posted by Artem Livshits <al...@confluent.io.INVALID>.
Hi Luke,

Thank you for your feedback.  I've updated the KIP with your suggestions.

1. Updated with a better example.
2. I removed the reference to ClassicDefaultPartitioner, it was probably
confusing.
3. The logic doesn't rely on checking batches, I've updated the proposal to
make it more explicit.
4. The primary issue (uneven distribution) is described in the linked jira,
copied an example from jira into the KIP as well.

-Artem


On Thu, Nov 4, 2021 at 8:34 PM Luke Chen <sh...@gmail.com> wrote:

> Hi Artem,
> Thanks for the KIP! And thanks for reminding me to complete KIP-782, soon.
> :)
>
> Back to the KIP, I have some comments:
> 1. You proposed to have a new config: "partitioner.sticky.batch.size", but
> I can't see how we're going to use it to make the partitioner better.
> Please explain more in KIP (with an example will be better as suggestion
> (4))
> 2. In the "Proposed change" section, you take an example to use
> "ClassicDefaultPartitioner", is that referring to the current default
> sticky partitioner? I think it'd better you name your proposed partition
> with a different name for distinguish between the default one and new one.
> (Although after implementation, we are going to just use the same name)
> 3. So, if my understanding is correct, you're going to have a "batch"
> switch, and before the in-flight is full, it's disabled. Otherwise, we'll
> enable it. Is that right? Sorry, I don't see any advantage of having this
> batch switch. Could you explain more?
> 4. I think it should be more clear if you can have a clear real example in
> the motivation section, to describe what issue we faced using current
> sticky partitioner. And in proposed changes section, using the same
> example, to describe more detail about how you fix this issue with your
> way.
>
> Thank you.
> Luke
>
> On Fri, Nov 5, 2021 at 1:38 AM Artem Livshits
> <al...@confluent.io.invalid> wrote:
>
> > Hello,
> >
> > This is the discussion thread for
> >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-794%3A+Strictly+Uniform+Sticky+Partitioner
> > .
> >
> > The proposal is a bug fix for
> > https://issues.apache.org/jira/browse/KAFKA-10888, but it does include a
> > client config change, therefore we have a KIP to discuss.
> >
> > -Artem
> >
>

Re: [DISCUSS] KIP-794: Strictly Uniform Sticky Partitioner

Posted by Luke Chen <sh...@gmail.com>.
Hi Artem,
Thanks for the KIP! And thanks for reminding me to complete KIP-782, soon.
:)

Back to the KIP, I have some comments:
1. You proposed to have a new config: "partitioner.sticky.batch.size", but
I can't see how we're going to use it to make the partitioner better.
Please explain more in KIP (with an example will be better as suggestion
(4))
2. In the "Proposed change" section, you take an example to use
"ClassicDefaultPartitioner", is that referring to the current default
sticky partitioner? I think it'd better you name your proposed partition
with a different name for distinguish between the default one and new one.
(Although after implementation, we are going to just use the same name)
3. So, if my understanding is correct, you're going to have a "batch"
switch, and before the in-flight is full, it's disabled. Otherwise, we'll
enable it. Is that right? Sorry, I don't see any advantage of having this
batch switch. Could you explain more?
4. I think it should be more clear if you can have a clear real example in
the motivation section, to describe what issue we faced using current
sticky partitioner. And in proposed changes section, using the same
example, to describe more detail about how you fix this issue with your way.

Thank you.
Luke

On Fri, Nov 5, 2021 at 1:38 AM Artem Livshits
<al...@confluent.io.invalid> wrote:

> Hello,
>
> This is the discussion thread for
>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-794%3A+Strictly+Uniform+Sticky+Partitioner
> .
>
> The proposal is a bug fix for
> https://issues.apache.org/jira/browse/KAFKA-10888, but it does include a
> client config change, therefore we have a KIP to discuss.
>
> -Artem
>

Re: [DISCUSS] KIP-794: Strictly Uniform Sticky Partitioner

Posted by Artem Livshits <al...@confluent.io.INVALID>.
Hi all,

Thank you to everybody who contributed to this discussion, we've come a
long way and now the "rejected alternatives" section is almost as big as
the actual proposal (which indicates that it's a well thought through
solution :-)).  If there are no further considerations, I'll start voting
in the next couple of days.

-Artem

On Mon, Mar 14, 2022 at 6:19 PM Artem Livshits <al...@confluent.io>
wrote:

> Hi Jun,
>
> 33. Sounds good.  Updated the KIP.
>
> -Artem
>
> On Mon, Mar 14, 2022 at 5:45 PM Jun Rao <ju...@confluent.io.invalid> wrote:
>
>> Hi, Artem,
>>
>> 33. We introduced onNewBatch() primarily for the sticky partitioner. It
>> seems to be a very subtle API to explain and to use properly. If we can't
>> find any convincing usage, it's probably better to deprecate it so that we
>> could keep the API clean.
>>
>> Thanks,
>>
>> Jun
>>
>> On Mon, Mar 14, 2022 at 1:36 PM Artem Livshits
>> <al...@confluent.io.invalid> wrote:
>>
>> > Hi Jun,
>> >
>> > 33.  That's an interesting point.  Technically, onNewBatch is just a
>> way to
>> > pass some signal to the partitioner, the sticky partitioner uses this
>> > signal that is suboptimal, in theory someone could use it for something
>> > else
>> >
>> > -Artem
>> >
>> > On Mon, Mar 14, 2022 at 9:11 AM Jun Rao <ju...@confluent.io.invalid>
>> wrote:
>> >
>> > > Hi, Artem,
>> > >
>> > > Thanks for the reply. A couple of more comments.
>> > >
>> > > 32. We could defer the metrics until we know what to add.
>> > >
>> > > 33. Since we are deprecating DefaultPartitioner and
>> > > UniformStickyPartitioner, should we depreciate
>> Partitioner.onNewBatch()
>> > too
>> > > given its unexpected side effect?
>> > >
>> > > Thanks,
>> > >
>> > > Jun
>> > >
>> > > On Thu, Mar 10, 2022 at 5:20 PM Artem Livshits
>> > > <al...@confluent.io.invalid> wrote:
>> > >
>> > > > Hi Jun,
>> > > >
>> > > > 32. Good point.  Do you think it's ok to defer the metrics until we
>> run
>> > > > some benchmarks so that we get a better idea of what metrics we
>> need?
>> > > >
>> > > > -Artem
>> > > >
>> > > > On Thu, Mar 10, 2022 at 3:12 PM Jun Rao <ju...@confluent.io.invalid>
>> > > wrote:
>> > > >
>> > > > > Hi, Artem.
>> > > > >
>> > > > > Thanks for the reply. One more comment.
>> > > > >
>> > > > > 32. Do we need to add any new metric on the producer? For
>> example, if
>> > > > > partitioner.availability.timeout.ms is > 0, it might be useful to
>> > know
>> > > > the
>> > > > > number of unavailable partitions.
>> > > > >
>> > > > > Thanks,
>> > > > >
>> > > > > Jun
>> > > > >
>> > > > > On Thu, Mar 10, 2022 at 12:46 PM Artem Livshits
>> > > > > <al...@confluent.io.invalid> wrote:
>> > > > >
>> > > > > > Hi Jun,
>> > > > > >
>> > > > > > 30.  Clarified.
>> > > > > >
>> > > > > > 31. I plan to do some benchmarking once implementation is
>> finished,
>> > > > I'll
>> > > > > > update the KIP with the results once I have them.  The reason to
>> > make
>> > > > it
>> > > > > > default is that it won't be used otherwise and we won't know if
>> > it's
>> > > > good
>> > > > > > or not in practical workloads.
>> > > > > >
>> > > > > > -Artem
>> > > > > >
>> > > > > > On Thu, Mar 10, 2022 at 11:42 AM Jun Rao
>> <jun@confluent.io.invalid
>> > >
>> > > > > wrote:
>> > > > > >
>> > > > > > > Hi, Artem,
>> > > > > > >
>> > > > > > > Thanks for the updated KIP. A couple of more comments.
>> > > > > > >
>> > > > > > > 30. For the 3 new configs, it would be useful to make it clear
>> > that
>> > > > > they
>> > > > > > > are only relevant when the partitioner class is null.
>> > > > > > >
>> > > > > > > 31. partitioner.adaptive.partitioning.enable : I am wondering
>> > > whether
>> > > > > it
>> > > > > > > should default to true. This is a more complex behavior than
>> > > "uniform
>> > > > > > > sticky" and may take some time to get right. If we do want to
>> > > enable
>> > > > it
>> > > > > > by
>> > > > > > > default, it would be useful to validate it with some test
>> > results.
>> > > > > > >
>> > > > > > > Jun
>> > > > > > >
>> > > > > > >
>> > > > > > >
>> > > > > > >
>> > > > > > > On Wed, Mar 9, 2022 at 10:05 PM Artem Livshits
>> > > > > > > <al...@confluent.io.invalid> wrote:
>> > > > > > >
>> > > > > > > > Thank you for feedback, I've discussed this offline with
>> some
>> > of
>> > > > the
>> > > > > > > folks
>> > > > > > > > and updated the KIP.  The main change is that now instead of
>> > > using
>> > > > > > > > DefaultPartitioner and UniformStickyPartitioners as flags,
>> in
>> > the
>> > > > new
>> > > > > > > > proposal the default partitioner is null, so if no custom
>> > > > partitioner
>> > > > > > is
>> > > > > > > > specified then the partitioning logic is implemented in
>> > > > > KafkaProducer.
>> > > > > > > > Compatibility section is updated as well.  Also the
>> > configuration
>> > > > > > options
>> > > > > > > > are renamed to be more consistent.
>> > > > > > > >
>> > > > > > > > -Artem
>> > > > > > > >
>> > > > > > > > On Fri, Mar 4, 2022 at 10:38 PM Luke Chen <
>> showuon@gmail.com>
>> > > > wrote:
>> > > > > > > >
>> > > > > > > > > Hi Artem,
>> > > > > > > > >
>> > > > > > > > > Thanks for your explanation and update to the KIP.
>> > > > > > > > > Some comments:
>> > > > > > > > >
>> > > > > > > > > 5. In the description for `enable.adaptive.partitioning`,
>> the
>> > > > > `false`
>> > > > > > > > case,
>> > > > > > > > > you said:
>> > > > > > > > > > the producer will try to distribute messages uniformly.
>> > > > > > > > > I think we should describe the possible skewing
>> distribution.
>> > > > > > > Otherwise,
>> > > > > > > > > user might be confused about why adaptive partitioning is
>> > > > > important.
>> > > > > > > > >
>> > > > > > > > > 6. In the description for `
>> partition.availability.timeout.ms
>> > `,
>> > > I
>> > > > > > think
>> > > > > > > > we
>> > > > > > > > > should mention in the last sentence about if
>> > > > > > > > `enable.adaptive.partitioning`
>> > > > > > > > > is disabled this logic is also disabled.
>> > > > > > > > >
>> > > > > > > > > 7. Similar thoughts as Ismael, I think we should have a
>> POC
>> > and
>> > > > > test
>> > > > > > to
>> > > > > > > > > prove that this adaptive partitioning algorithm can have
>> > better
>> > > > > > uniform
>> > > > > > > > > partitioning, compared with original sticky one.
>> > > > > > > > >
>> > > > > > > > > Thank you.
>> > > > > > > > > Luke
>> > > > > > > > >
>> > > > > > > > > On Fri, Mar 4, 2022 at 9:22 PM Ismael Juma <
>> > ismael@juma.me.uk>
>> > > > > > wrote:
>> > > > > > > > >
>> > > > > > > > > > Regarding `3`, we should only deprecate it if we're sure
>> > the
>> > > > new
>> > > > > > > > approach
>> > > > > > > > > > handles all cases better. Are we confident about that
>> for
>> > > both
>> > > > of
>> > > > > > the
>> > > > > > > > > > previous partitioners?
>> > > > > > > > > >
>> > > > > > > > > > Ismael
>> > > > > > > > > >
>> > > > > > > > > > On Fri, Mar 4, 2022 at 1:37 AM David Jacot
>> > > > > > > <djacot@confluent.io.invalid
>> > > > > > > > >
>> > > > > > > > > > wrote:
>> > > > > > > > > >
>> > > > > > > > > > > Hi Artem,
>> > > > > > > > > > >
>> > > > > > > > > > > Thanks for the KIP! I have a few comments:
>> > > > > > > > > > >
>> > > > > > > > > > > 1. In the preamble of the proposed change section,
>> there
>> > is
>> > > > > > still a
>> > > > > > > > > > > mention of the
>> > > > > > > > > > > -1 approach. My understanding is that we have moved
>> away
>> > > from
>> > > > > it
>> > > > > > > now.
>> > > > > > > > > > >
>> > > > > > > > > > > 2. I am a bit concerned by the trick suggested about
>> the
>> > > > > > > > > > > DefaultPartitioner and
>> > > > > > > > > > > the UniformStickyPartitioner. I do agree that
>> > implementing
>> > > > the
>> > > > > > > logic
>> > > > > > > > in
>> > > > > > > > > > the
>> > > > > > > > > > > producer itself is a good thing. However, it is weird
>> > from
>> > > a
>> > > > > user
>> > > > > > > > > > > perspective
>> > > > > > > > > > > that he can set a class as partitioner that is not
>> used
>> > in
>> > > > the
>> > > > > > > end. I
>> > > > > > > > > > > think that
>> > > > > > > > > > > this will be confusing for our users. Have we
>> considered
>> > > > > changing
>> > > > > > > the
>> > > > > > > > > > > default
>> > > > > > > > > > > value of partitioner.class to null to indicate that
>> the
>> > new
>> > > > > > > built-in
>> > > > > > > > > > > partitioner
>> > > > > > > > > > > must be used? By default, the built-in partitioner
>> would
>> > be
>> > > > > used
>> > > > > > > > unless
>> > > > > > > > > > the
>> > > > > > > > > > > user explicitly specify one. The downside is that the
>> new
>> > > > > default
>> > > > > > > > > > behavior
>> > > > > > > > > > > would not work if the user explicitly specify the
>> > > partitioner
>> > > > > but
>> > > > > > > we
>> > > > > > > > > > could
>> > > > > > > > > > > mitigate this with my next point.
>> > > > > > > > > > >
>> > > > > > > > > > > 3. Related to the previous point, I think that we
>> could
>> > > > > deprecate
>> > > > > > > > both
>> > > > > > > > > > the
>> > > > > > > > > > > DefaultPartitioner and the UniformStickyPartitioner. I
>> > > would
>> > > > > also
>> > > > > > > > add a
>> > > > > > > > > > > warning if one of them is explicitly provided by the
>> user
>> > > to
>> > > > > > inform
>> > > > > > > > > them
>> > > > > > > > > > > that they should switch to the new built-in one. I am
>> > > pretty
>> > > > > sure
>> > > > > > > > that
>> > > > > > > > > > most
>> > > > > > > > > > > of the folks use the default configuration anyway.
>> > > > > > > > > > >
>> > > > > > > > > > > 4. It would be great if we could explain why the -1
>> way
>> > was
>> > > > > > > rejected.
>> > > > > > > > > At
>> > > > > > > > > > > the moment, the rejected alternative only explain the
>> > idea
>> > > > but
>> > > > > > does
>> > > > > > > > not
>> > > > > > > > > > > say why we rejected it.
>> > > > > > > > > > >
>> > > > > > > > > > > Best,
>> > > > > > > > > > > David
>> > > > > > > > > > >
>> > > > > > > > > > > On Fri, Mar 4, 2022 at 6:03 AM Artem Livshits
>> > > > > > > > > > > <al...@confluent.io.invalid> wrote:
>> > > > > > > > > > > >
>> > > > > > > > > > > > Hi Jun,
>> > > > > > > > > > > >
>> > > > > > > > > > > > 2. Removed the option from the KIP.  Now the sticky
>> > > > > > partitioning
>> > > > > > > > > > > threshold
>> > > > > > > > > > > > is hardcoded to batch.size.
>> > > > > > > > > > > >
>> > > > > > > > > > > > 20. Added the corresponding wording to the KIP.
>> > > > > > > > > > > >
>> > > > > > > > > > > > -Artem
>> > > > > > > > > > > >
>> > > > > > > > > > > > On Thu, Mar 3, 2022 at 10:52 AM Jun Rao
>> > > > > > <jun@confluent.io.invalid
>> > > > > > > >
>> > > > > > > > > > > wrote:
>> > > > > > > > > > > >
>> > > > > > > > > > > > > Hi, Artem,
>> > > > > > > > > > > > >
>> > > > > > > > > > > > > Thanks for the reply.
>> > > > > > > > > > > > >
>> > > > > > > > > > > > > 1. Sounds good.
>> > > > > > > > > > > > >
>> > > > > > > > > > > > > 2. If we don't expect users to change it, we
>> probably
>> > > > could
>> > > > > > > just
>> > > > > > > > > > leave
>> > > > > > > > > > > out
>> > > > > > > > > > > > > the new config. In general, it's easy to add a new
>> > > > config,
>> > > > > > but
>> > > > > > > > hard
>> > > > > > > > > > to
>> > > > > > > > > > > > > remove an existing config.
>> > > > > > > > > > > > >
>> > > > > > > > > > > > > 20. The two new configs
>> enable.adaptive.partitioning
>> > > and
>> > > > > > > > > > > > > partition.availability.timeout.ms only apply to
>> the
>> > > two
>> > > > > > > built-in
>> > > > > > > > > > > > > partitioners DefaultPartitioner and
>> > > > > UniformStickyPartitioner,
>> > > > > > > > > right?
>> > > > > > > > > > It
>> > > > > > > > > > > > > would be useful to document that in the KIP.
>> > > > > > > > > > > > >
>> > > > > > > > > > > > > Thanks,
>> > > > > > > > > > > > >
>> > > > > > > > > > > > > Jun
>> > > > > > > > > > > > >
>> > > > > > > > > > > > > On Thu, Mar 3, 2022 at 9:47 AM Artem Livshits
>> > > > > > > > > > > > > <al...@confluent.io.invalid> wrote:
>> > > > > > > > > > > > >
>> > > > > > > > > > > > > > Hi Jun,
>> > > > > > > > > > > > > >
>> > > > > > > > > > > > > > Thank you for the suggestions.
>> > > > > > > > > > > > > >
>> > > > > > > > > > > > > > 1. As we discussed offline, we can hardcode the
>> > logic
>> > > > for
>> > > > > > > > > > > > > > DefaultPartitioner and UniformStickyPartitioner
>> in
>> > > the
>> > > > > > > > > > KafkaProducer
>> > > > > > > > > > > > > (i.e.
>> > > > > > > > > > > > > > the DefaultPartitioner.partition won't get
>> called,
>> > > > > instead
>> > > > > > > > > > > KafkaProducer
>> > > > > > > > > > > > > > would check if the partitioner is an instance of
>> > > > > > > > > DefaultPartitioner
>> > > > > > > > > > > and
>> > > > > > > > > > > > > > then run the actual partitioning logic itself).
>> > Then
>> > > > the
>> > > > > > > > change
>> > > > > > > > > to
>> > > > > > > > > > > the
>> > > > > > > > > > > > > > Partitioner wouldn't be required.  I'll update
>> the
>> > > KIP
>> > > > to
>> > > > > > > > reflect
>> > > > > > > > > > > that.
>> > > > > > > > > > > > > >
>> > > > > > > > > > > > > > 2. I don't expect users to change this too
>> often,
>> > as
>> > > > > > changing
>> > > > > > > > it
>> > > > > > > > > > > would
>> > > > > > > > > > > > > > require a bit of studying of the production
>> > patterns.
>> > > > > As a
>> > > > > > > > > general
>> > > > > > > > > > > > > > principle, if I can think of a model that
>> requires
>> > a
>> > > > > > > deviation
>> > > > > > > > > from
>> > > > > > > > > > > > > > default, I tend to add a configuration option.
>> It
>> > > > could
>> > > > > be
>> > > > > > > > that
>> > > > > > > > > > > it'll
>> > > > > > > > > > > > > > never get used in practice, but I cannot prove
>> > that.
>> > > > I'm
>> > > > > > ok
>> > > > > > > > with
>> > > > > > > > > > > > > removing
>> > > > > > > > > > > > > > the option, let me know what you think.
>> > > > > > > > > > > > > >
>> > > > > > > > > > > > > > -Artem
>> > > > > > > > > > > > > >
>> > > > > > > > > > > > > > On Mon, Feb 28, 2022 at 2:06 PM Jun Rao
>> > > > > > > > <jun@confluent.io.invalid
>> > > > > > > > > >
>> > > > > > > > > > > > > wrote:
>> > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > Hi, Artem,
>> > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > Thanks for the reply. A few more comments.
>> > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > 1. Since we control the implementation and the
>> > > usage
>> > > > of
>> > > > > > > > > > > > > > DefaultPartitioner,
>> > > > > > > > > > > > > > > another way is to instantiate the
>> > > DefaultPartitioner
>> > > > > > with a
>> > > > > > > > > > special
>> > > > > > > > > > > > > > > constructor, which allows it to have more
>> access
>> > to
>> > > > > > > internal
>> > > > > > > > > > > > > information.
>> > > > > > > > > > > > > > > Then we could just change the behavior of
>> > > > > > > DefaultPartitioner
>> > > > > > > > > > such
>> > > > > > > > > > > that
>> > > > > > > > > > > > > > it
>> > > > > > > > > > > > > > > can use the internal infoamtion when choosing
>> the
>> > > > > > > partition.
>> > > > > > > > > This
>> > > > > > > > > > > seems
>> > > > > > > > > > > > > > > more intuitive than having DefaultPartitioner
>> > > return
>> > > > -1
>> > > > > > > > > > partition.
>> > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > 2. I guess partitioner.sticky.batch.size is
>> > > > introduced
>> > > > > > > > because
>> > > > > > > > > > the
>> > > > > > > > > > > > > > > effective batch size could be less than
>> > batch.size
>> > > > and
>> > > > > we
>> > > > > > > > want
>> > > > > > > > > to
>> > > > > > > > > > > align
>> > > > > > > > > > > > > > > partition switching with the effective batch
>> > size.
>> > > > How
>> > > > > > > would
>> > > > > > > > a
>> > > > > > > > > > user
>> > > > > > > > > > > > > know
>> > > > > > > > > > > > > > > the effective batch size to set
>> > > > > > > partitioner.sticky.batch.size
>> > > > > > > > > > > properly?
>> > > > > > > > > > > > > > If
>> > > > > > > > > > > > > > > the user somehow knows the effective batch
>> size,
>> > > does
>> > > > > > > setting
>> > > > > > > > > > > > > batch.size
>> > > > > > > > > > > > > > to
>> > > > > > > > > > > > > > > the effective batch size achieve the same
>> result?
>> > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > 4. Thanks for the explanation. Makes sense to
>> me.
>> > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > Thanks,
>> > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > Jun
>> > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > Thanks,
>> > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > Jun
>> > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > On Fri, Feb 25, 2022 at 8:26 PM Artem Livshits
>> > > > > > > > > > > > > > > <al...@confluent.io.invalid> wrote:
>> > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > Hi Jun,
>> > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > 1. Updated the KIP to add a couple
>> paragraphs
>> > > about
>> > > > > > > > > > > implementation
>> > > > > > > > > > > > > > > > necessities in the Proposed Changes section.
>> > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > 2. Sorry if my reply was confusing, what I
>> > meant
>> > > to
>> > > > > say
>> > > > > > > > (and
>> > > > > > > > > I
>> > > > > > > > > > > > > > elaborated
>> > > > > > > > > > > > > > > > on that in point #3) is that there could be
>> > > > patterns
>> > > > > > for
>> > > > > > > > > which
>> > > > > > > > > > > 16KB
>> > > > > > > > > > > > > > > > wouldn't be the most effective setting,
>> thus it
>> > > > would
>> > > > > > be
>> > > > > > > > good
>> > > > > > > > > > to
>> > > > > > > > > > > make
>> > > > > > > > > > > > > > it
>> > > > > > > > > > > > > > > > configurable.
>> > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > 4. We could use broker readiness timeout.
>> But
>> > > I'm
>> > > > > not
>> > > > > > > sure
>> > > > > > > > > it
>> > > > > > > > > > > would
>> > > > > > > > > > > > > > > > correctly model the broker load.  The
>> problem
>> > is
>> > > > that
>> > > > > > > > latency
>> > > > > > > > > > is
>> > > > > > > > > > > not
>> > > > > > > > > > > > > an
>> > > > > > > > > > > > > > > > accurate measure of throughput, we could
>> have 2
>> > > > > brokers
>> > > > > > > > that
>> > > > > > > > > > have
>> > > > > > > > > > > > > equal
>> > > > > > > > > > > > > > > > throughput but one has higher latency (so it
>> > > takes
>> > > > > > larger
>> > > > > > > > > > batches
>> > > > > > > > > > > > > less
>> > > > > > > > > > > > > > > > frequently, but still takes the same load).
>> > > > > > > Latency-based
>> > > > > > > > > > logic
>> > > > > > > > > > > is
>> > > > > > > > > > > > > > > likely
>> > > > > > > > > > > > > > > > to send less data to the broker with higher
>> > > > latency.
>> > > > > > > Using
>> > > > > > > > > the
>> > > > > > > > > > > queue
>> > > > > > > > > > > > > > > size
>> > > > > > > > > > > > > > > > would adapt to throughput, regardless of
>> > latency
>> > > > > (which
>> > > > > > > > could
>> > > > > > > > > > be
>> > > > > > > > > > > > > just a
>> > > > > > > > > > > > > > > > result of deployment topology), so that's
>> the
>> > > model
>> > > > > > > chosen
>> > > > > > > > in
>> > > > > > > > > > the
>> > > > > > > > > > > > > > > > proposal.  The
>> > partition.availability.timeout.ms
>> > > > > logic
>> > > > > > > > > > > approaches
>> > > > > > > > > > > > > the
>> > > > > > > > > > > > > > > > model
>> > > > > > > > > > > > > > > > from a slightly different angle, say we
>> have a
>> > > > > > > requirement
>> > > > > > > > to
>> > > > > > > > > > > deliver
>> > > > > > > > > > > > > > > > messages via brokers that have a certain
>> > latency,
>> > > > > then
>> > > > > > > > > > > > > > > > partition.availability.timeout.ms could be
>> > used
>> > > to
>> > > > > > tune
>> > > > > > > > > that.
>> > > > > > > > > > > > > Latency
>> > > > > > > > > > > > > > > is
>> > > > > > > > > > > > > > > > a
>> > > > > > > > > > > > > > > > much more volatile metric than throughput
>> > > (latency
>> > > > > > > depends
>> > > > > > > > on
>> > > > > > > > > > > > > external
>> > > > > > > > > > > > > > > > load, on capacity, on deployment topology,
>> on
>> > > > jitter
>> > > > > in
>> > > > > > > > > > network,
>> > > > > > > > > > > on
>> > > > > > > > > > > > > > > jitter
>> > > > > > > > > > > > > > > > in disk, etc.) and I think it would be best
>> to
>> > > > leave
>> > > > > > > > > > > latency-based
>> > > > > > > > > > > > > > > > thresholds configurable to tune for the
>> > > > environment.
>> > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > -Artem
>> > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > On Wed, Feb 23, 2022 at 11:14 AM Jun Rao
>> > > > > > > > > > > <ju...@confluent.io.invalid>
>> > > > > > > > > > > > > > > wrote:
>> > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > > Hi, Artem,
>> > > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > > Thanks for the reply. A few more comments.
>> > > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > > 1. Perhaps you could elaborate a bit more
>> on
>> > > how
>> > > > > the
>> > > > > > > > > producer
>> > > > > > > > > > > > > > > determines
>> > > > > > > > > > > > > > > > > the partition if the partitioner returns
>> -1.
>> > > This
>> > > > > > will
>> > > > > > > > help
>> > > > > > > > > > > > > > understand
>> > > > > > > > > > > > > > > > why
>> > > > > > > > > > > > > > > > > encapsulating that logic as a partitioner
>> is
>> > > not
>> > > > > > clean.
>> > > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > > 2. I am not sure that I understand this
>> part.
>> > > If
>> > > > > > 15.5KB
>> > > > > > > > is
>> > > > > > > > > > more
>> > > > > > > > > > > > > > > > efficient,
>> > > > > > > > > > > > > > > > > could we just set batch.size to 15.5KB?
>> > > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > > 4. Yes, we could add a switch (or a
>> variant
>> > of
>> > > > the
>> > > > > > > > > > > partitioner) for
>> > > > > > > > > > > > > > > > > enabling this behavior. Also, choosing
>> > > partitions
>> > > > > > based
>> > > > > > > > on
>> > > > > > > > > > > broker
>> > > > > > > > > > > > > > > > readiness
>> > > > > > > > > > > > > > > > > can be made in a smoother way. For
>> example,
>> > we
>> > > > > could
>> > > > > > > > track
>> > > > > > > > > > the
>> > > > > > > > > > > last
>> > > > > > > > > > > > > > > time
>> > > > > > > > > > > > > > > > a
>> > > > > > > > > > > > > > > > > broker has drained any batches from the
>> > > > > accumulator.
>> > > > > > We
>> > > > > > > > can
>> > > > > > > > > > > then
>> > > > > > > > > > > > > > select
>> > > > > > > > > > > > > > > > > partitions from brokers proportionally to
>> the
>> > > > > inverse
>> > > > > > > of
>> > > > > > > > > that
>> > > > > > > > > > > time.
>> > > > > > > > > > > > > > > This
>> > > > > > > > > > > > > > > > > seems smoother than a cutoff based on a
>> > > > > > > > > > > > > > > > partition.availability.timeout.ms
>> > > > > > > > > > > > > > > > >  threshold.
>> > > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > > Thanks,
>> > > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > > Jun
>> > > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > > On Fri, Feb 18, 2022 at 5:14 PM Artem
>> > Livshits
>> > > > > > > > > > > > > > > > > <al...@confluent.io.invalid> wrote:
>> > > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > > > Hello Luke, Jun,
>> > > > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > > > Thank you for your feedback.  I've added
>> > the
>> > > > > > Rejected
>> > > > > > > > > > > Alternative
>> > > > > > > > > > > > > > > > section
>> > > > > > > > > > > > > > > > > > that may clarify some of the questions
>> > w.r.t.
>> > > > > > > returning
>> > > > > > > > > -1.
>> > > > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > > > 1. I've elaborated on the -1 in the KIP.
>> > The
>> > > > > > problem
>> > > > > > > > is
>> > > > > > > > > > > that a
>> > > > > > > > > > > > > > > > > significant
>> > > > > > > > > > > > > > > > > > part of the logic needs to be in the
>> > producer
>> > > > > > > (because
>> > > > > > > > it
>> > > > > > > > > > now
>> > > > > > > > > > > > > uses
>> > > > > > > > > > > > > > > > > > information about brokers that only the
>> > > > producer
>> > > > > > > > knows),
>> > > > > > > > > so
>> > > > > > > > > > > > > > > > encapsulation
>> > > > > > > > > > > > > > > > > > of the logic within the default
>> partitioner
>> > > > isn't
>> > > > > > as
>> > > > > > > > > clean.
>> > > > > > > > > > > > >  I've
>> > > > > > > > > > > > > > > > added
>> > > > > > > > > > > > > > > > > > the Rejected Alternative section that
>> > > documents
>> > > > > an
>> > > > > > > > > attempt
>> > > > > > > > > > to
>> > > > > > > > > > > > > keep
>> > > > > > > > > > > > > > > the
>> > > > > > > > > > > > > > > > > > encapsulation by providing new
>> callbacks to
>> > > the
>> > > > > > > > > > partitioner.
>> > > > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > > > 2. The meaning of the
>> > > > > partitioner.sticky.batch.size
>> > > > > > > is
>> > > > > > > > > > > explained
>> > > > > > > > > > > > > in
>> > > > > > > > > > > > > > > the
>> > > > > > > > > > > > > > > > > > Uniform Sticky Batch Size section.
>> > > Basically,
>> > > > we
>> > > > > > > track
>> > > > > > > > > the
>> > > > > > > > > > > > > amount
>> > > > > > > > > > > > > > of
>> > > > > > > > > > > > > > > > > bytes
>> > > > > > > > > > > > > > > > > > produced to the partition and if it
>> exceeds
>> > > > > > > > > > > > > > > > partitioner.sticky.batch.size
>> > > > > > > > > > > > > > > > > > then we switch to the next partition.
>> As
>> > far
>> > > > as
>> > > > > > the
>> > > > > > > > > reason
>> > > > > > > > > > > to
>> > > > > > > > > > > > > make
>> > > > > > > > > > > > > > > it
>> > > > > > > > > > > > > > > > > > different from batch.size, I think Luke
>> > > > answered
>> > > > > > this
>> > > > > > > > > with
>> > > > > > > > > > > the
>> > > > > > > > > > > > > > > question
>> > > > > > > > > > > > > > > > > #3
>> > > > > > > > > > > > > > > > > > -- what if the load pattern is such that
>> > > 15.5KB
>> > > > > > would
>> > > > > > > > be
>> > > > > > > > > > more
>> > > > > > > > > > > > > > > efficient
>> > > > > > > > > > > > > > > > > > than 16KB?
>> > > > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > > > 3. I think it's hard to have one size
>> that
>> > > > would
>> > > > > > fit
>> > > > > > > > all
>> > > > > > > > > > > > > patterns.
>> > > > > > > > > > > > > > > > E.g.
>> > > > > > > > > > > > > > > > > if
>> > > > > > > > > > > > > > > > > > the load pattern is such that there is
>> > linger
>> > > > and
>> > > > > > the
>> > > > > > > > app
>> > > > > > > > > > > fills
>> > > > > > > > > > > > > the
>> > > > > > > > > > > > > > > > batch
>> > > > > > > > > > > > > > > > > > before linger expires, then having 16KB
>> > would
>> > > > > most
>> > > > > > > > likely
>> > > > > > > > > > > > > > synchronize
>> > > > > > > > > > > > > > > > > > batching and partition switching, so
>> each
>> > > > > partition
>> > > > > > > > would
>> > > > > > > > > > > get a
>> > > > > > > > > > > > > > full
>> > > > > > > > > > > > > > > > > > batch.  If load pattern is such that
>> there
>> > > are
>> > > > a
>> > > > > > few
>> > > > > > > > > > > non-complete
>> > > > > > > > > > > > > > > > batches
>> > > > > > > > > > > > > > > > > > go out before a larger batch starts to
>> > fill,
>> > > > then
>> > > > > > it
>> > > > > > > > may
>> > > > > > > > > > > actually
>> > > > > > > > > > > > > > be
>> > > > > > > > > > > > > > > > > > beneficial to make slightly larger (e.g.
>> > > > > linger=0,
>> > > > > > > > first
>> > > > > > > > > > few
>> > > > > > > > > > > > > > records
>> > > > > > > > > > > > > > > go
>> > > > > > > > > > > > > > > > > in
>> > > > > > > > > > > > > > > > > > the first batch, then next few records
>> go
>> > to
>> > > > > second
>> > > > > > > > > batch,
>> > > > > > > > > > > and so
>> > > > > > > > > > > > > > on,
>> > > > > > > > > > > > > > > > > until
>> > > > > > > > > > > > > > > > > > 5 in-flight, then larger batch would
>> form
>> > > while
>> > > > > > > waiting
>> > > > > > > > > for
>> > > > > > > > > > > > > broker
>> > > > > > > > > > > > > > to
>> > > > > > > > > > > > > > > > > > respond, but the partition switch would
>> > > happen
>> > > > > > before
>> > > > > > > > the
>> > > > > > > > > > > larger
>> > > > > > > > > > > > > > > batch
>> > > > > > > > > > > > > > > > is
>> > > > > > > > > > > > > > > > > > full).
>> > > > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > > > 4. There are a couple of reasons for
>> > > > introducing
>> > > > > > > > > > > > > > > > > > partition.availability.timeout.ms.
>> Luke's
>> > > an
>> > > > > > Jun's
>> > > > > > > > > > > questions
>> > > > > > > > > > > > > are
>> > > > > > > > > > > > > > > > > slightly
>> > > > > > > > > > > > > > > > > > different, so I'm going to separate
>> > replies.
>> > > > > > > > > > > > > > > > > > (Luke) Is the queue size a good enough
>> > > > signal?  I
>> > > > > > > think
>> > > > > > > > > > it's
>> > > > > > > > > > > a
>> > > > > > > > > > > > > good
>> > > > > > > > > > > > > > > > > default
>> > > > > > > > > > > > > > > > > > signal as it tries to preserve general
>> > > fairness
>> > > > > and
>> > > > > > > not
>> > > > > > > > > > > overreact
>> > > > > > > > > > > > > > on
>> > > > > > > > > > > > > > > > the
>> > > > > > > > > > > > > > > > > > broker's state at each moment in time.
>> But
>> > > > > because
>> > > > > > > > it's
>> > > > > > > > > > > smooth,
>> > > > > > > > > > > > > it
>> > > > > > > > > > > > > > > may
>> > > > > > > > > > > > > > > > > not
>> > > > > > > > > > > > > > > > > > be reactive enough to instantaneous
>> latency
>> > > > > jumps.
>> > > > > > > For
>> > > > > > > > > > > > > > > > latency-sensitive
>> > > > > > > > > > > > > > > > > > workloads, it may be desirable to react
>> > > faster
>> > > > > > when a
>> > > > > > > > > > broker
>> > > > > > > > > > > > > > becomes
>> > > > > > > > > > > > > > > > > > unresponsive (but that may make the
>> > > > distribution
>> > > > > > > really
>> > > > > > > > > > > choppy),
>> > > > > > > > > > > > > so
>> > > > > > > > > > > > > > > > > > partition.availability.timeout.ms
>> provides
>> > > an
>> > > > > > > > > opportunity
>> > > > > > > > > > to
>> > > > > > > > > > > > > tune
>> > > > > > > > > > > > > > > > > > adaptiveness.
>> > > > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > > > (Jun) Can we just not assign partitions
>> to
>> > > > > brokers
>> > > > > > > that
>> > > > > > > > > are
>> > > > > > > > > > > not
>> > > > > > > > > > > > > > > ready?
>> > > > > > > > > > > > > > > > > > Switching partitions purely based on
>> > current
>> > > > > broker
>> > > > > > > > > > readiness
>> > > > > > > > > > > > > > > > information
>> > > > > > > > > > > > > > > > > > can really skew workload I think (or at
>> > > least I
>> > > > > > > > couldn't
>> > > > > > > > > > > build a
>> > > > > > > > > > > > > > > model
>> > > > > > > > > > > > > > > > > that
>> > > > > > > > > > > > > > > > > > proves that over time it's going to be
>> > > > generally
>> > > > > > > > fair), I
>> > > > > > > > > > > feel
>> > > > > > > > > > > > > that
>> > > > > > > > > > > > > > > the
>> > > > > > > > > > > > > > > > > > algorithm should try to be fair in
>> general
>> > > and
>> > > > > use
>> > > > > > > > > smoother
>> > > > > > > > > > > > > signals
>> > > > > > > > > > > > > > > by
>> > > > > > > > > > > > > > > > > > default (e.g. a broker with choppier
>> > latency
>> > > > may
>> > > > > > get
>> > > > > > > > much
>> > > > > > > > > > > less
>> > > > > > > > > > > > > load
>> > > > > > > > > > > > > > > > even
>> > > > > > > > > > > > > > > > > > though it can handle throughput, it then
>> > may
>> > > > > > > > potentially
>> > > > > > > > > > skew
>> > > > > > > > > > > > > > > > > consumption),
>> > > > > > > > > > > > > > > > > > note that the queue-size-based logic
>> uses
>> > > > > > > probabilities
>> > > > > > > > > (so
>> > > > > > > > > > > we
>> > > > > > > > > > > > > > don't
>> > > > > > > > > > > > > > > > > fully
>> > > > > > > > > > > > > > > > > > remove brokers, just make it less
>> likely)
>> > and
>> > > > > > > relative
>> > > > > > > > > info
>> > > > > > > > > > > > > rather
>> > > > > > > > > > > > > > > > than a
>> > > > > > > > > > > > > > > > > > threshold (so if all brokers are
>> heavily,
>> > but
>> > > > > > equally
>> > > > > > > > > > loaded,
>> > > > > > > > > > > > > they
>> > > > > > > > > > > > > > > will
>> > > > > > > > > > > > > > > > > get
>> > > > > > > > > > > > > > > > > > equal distribution, rather than get
>> removed
>> > > > > because
>> > > > > > > > they
>> > > > > > > > > > > exceed
>> > > > > > > > > > > > > > some
>> > > > > > > > > > > > > > > > > > threshold).  So at the very least, I
>> would
>> > > like
>> > > > > > this
>> > > > > > > > > logic
>> > > > > > > > > > > to be
>> > > > > > > > > > > > > > > turned
>> > > > > > > > > > > > > > > > > off
>> > > > > > > > > > > > > > > > > > by default as it's hard to predict what
>> it
>> > > > could
>> > > > > do
>> > > > > > > > with
>> > > > > > > > > > > > > different
>> > > > > > > > > > > > > > > > > patterns
>> > > > > > > > > > > > > > > > > > (which means that there would need to be
>> > some
>> > > > > > > > > > > configuration).  We
>> > > > > > > > > > > > > > > could
>> > > > > > > > > > > > > > > > > > just not use brokers that are not ready,
>> > but
>> > > > > > again, I
>> > > > > > > > > think
>> > > > > > > > > > > that
>> > > > > > > > > > > > > > it's
>> > > > > > > > > > > > > > > > > good
>> > > > > > > > > > > > > > > > > > to try to be fair under normal
>> > circumstances,
>> > > > so
>> > > > > if
>> > > > > > > > > > normally
>> > > > > > > > > > > > > > brokers
>> > > > > > > > > > > > > > > > can
>> > > > > > > > > > > > > > > > > > respond under some
>> > > > > > partition.availability.timeout.ms
>> > > > > > > > > > > threshold
>> > > > > > > > > > > > > and
>> > > > > > > > > > > > > > > the
>> > > > > > > > > > > > > > > > > > application works well with those
>> > latencies,
>> > > > then
>> > > > > > we
>> > > > > > > > > could
>> > > > > > > > > > > > > > distribute
>> > > > > > > > > > > > > > > > > data
>> > > > > > > > > > > > > > > > > > equally between brokers that don't
>> exceed
>> > the
>> > > > > > > > latencies.
>> > > > > > > > > > The
>> > > > > > > > > > > > > > value,
>> > > > > > > > > > > > > > > of
>> > > > > > > > > > > > > > > > > > course, would depend on the environment
>> and
>> > > app
>> > > > > > > > > > requirements,
>> > > > > > > > > > > > > hence
>> > > > > > > > > > > > > > > > it's
>> > > > > > > > > > > > > > > > > > configurable.
>> > > > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > > > 10. Added a statement at the beginning
>> of
>> > the
>> > > > > > > proposed
>> > > > > > > > > > > changes.
>> > > > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > > > -Artem
>> > > > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > > > On Thu, Feb 17, 2022 at 3:46 PM Jun Rao
>> > > > > > > > > > > <jun@confluent.io.invalid
>> > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > > wrote:
>> > > > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > > > > Hi, Artem,
>> > > > > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > > > > Thanks for the KIP. A few comments
>> below.
>> > > > > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > > > > 1. I agree with Luke that having the
>> > > > > partitioner
>> > > > > > > > > > returning
>> > > > > > > > > > > -1
>> > > > > > > > > > > > > is
>> > > > > > > > > > > > > > > kind
>> > > > > > > > > > > > > > > > > of
>> > > > > > > > > > > > > > > > > > > weird. Could we just change the
>> > > > implementation
>> > > > > of
>> > > > > > > > > > > > > > > DefaultPartitioner
>> > > > > > > > > > > > > > > > to
>> > > > > > > > > > > > > > > > > > the
>> > > > > > > > > > > > > > > > > > > new behavior?
>> > > > > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > > > > 2. partitioner.sticky.batch.size:
>> Similar
>> > > > > > question
>> > > > > > > to
>> > > > > > > > > > > Luke. I
>> > > > > > > > > > > > > am
>> > > > > > > > > > > > > > > not
>> > > > > > > > > > > > > > > > > sure
>> > > > > > > > > > > > > > > > > > > why we want to introduce this new
>> > > > > configuration.
>> > > > > > > > Could
>> > > > > > > > > we
>> > > > > > > > > > > just
>> > > > > > > > > > > > > > use
>> > > > > > > > > > > > > > > > the
>> > > > > > > > > > > > > > > > > > > existing batch.size?
>> > > > > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > > > > 4. I also agree with Luke that it's
>> not
>> > > clear
>> > > > > why
>> > > > > > > we
>> > > > > > > > > need
>> > > > > > > > > > > > > > > > > > > partition.availability.timeout.ms.
>> The
>> > KIP
>> > > > > says
>> > > > > > > the
>> > > > > > > > > > broker
>> > > > > > > > > > > > > > "would
>> > > > > > > > > > > > > > > > not
>> > > > > > > > > > > > > > > > > be
>> > > > > > > > > > > > > > > > > > > chosen until the broker is able to
>> accept
>> > > the
>> > > > > > next
>> > > > > > > > > ready
>> > > > > > > > > > > batch
>> > > > > > > > > > > > > > from
>> > > > > > > > > > > > > > > > the
>> > > > > > > > > > > > > > > > > > > partition". If we are keeping track of
>> > > this,
>> > > > > > could
>> > > > > > > we
>> > > > > > > > > > just
>> > > > > > > > > > > > > avoid
>> > > > > > > > > > > > > > > > > > assigning
>> > > > > > > > > > > > > > > > > > > records to partitions whose leader is
>> not
>> > > > able
>> > > > > to
>> > > > > > > > > accept
>> > > > > > > > > > > the
>> > > > > > > > > > > > > next
>> > > > > > > > > > > > > > > > > batch?
>> > > > > > > > > > > > > > > > > > If
>> > > > > > > > > > > > > > > > > > > we do that, perhaps we don't need
>> > > > > > > > > > > > > > > partition.availability.timeout.ms.
>> > > > > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > > > > 10. Currently, partitioner.class
>> defaults
>> > > to
>> > > > > > > > > > > > > DefaultPartitioner,
>> > > > > > > > > > > > > > > > which
>> > > > > > > > > > > > > > > > > > uses
>> > > > > > > > > > > > > > > > > > > StickyPartitioner when the key is
>> > > specified.
>> > > > > > Since
>> > > > > > > > this
>> > > > > > > > > > KIP
>> > > > > > > > > > > > > > > improves
>> > > > > > > > > > > > > > > > > upon
>> > > > > > > > > > > > > > > > > > > StickyPartitioner, it would be useful
>> to
>> > > make
>> > > > > the
>> > > > > > > new
>> > > > > > > > > > > behavior
>> > > > > > > > > > > > > > the
>> > > > > > > > > > > > > > > > > > default
>> > > > > > > > > > > > > > > > > > > and document that in the KIP.
>> > > > > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > > > > Thanks,
>> > > > > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > > > > Jun
>> > > > > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > > > > On Wed, Feb 16, 2022 at 7:30 PM Luke
>> > Chen <
>> > > > > > > > > > > showuon@gmail.com>
>> > > > > > > > > > > > > > > wrote:
>> > > > > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > > > > > Hi Artem,
>> > > > > > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > > > > > Also, one more thing I think you
>> need
>> > to
>> > > > > know.
>> > > > > > > > > > > > > > > > > > > > As this bug KAFKA-7572 <
>> > > > > > > > > > > > > > > > > >
>> > > > https://issues.apache.org/jira/browse/KAFKA-7572
>> > > > > > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > > > > > mentioned, sometimes the custom
>> > > partitioner
>> > > > > > would
>> > > > > > > > > > return
>> > > > > > > > > > > > > > negative
>> > > > > > > > > > > > > > > > > > > partition
>> > > > > > > > > > > > > > > > > > > > id accidentally.
>> > > > > > > > > > > > > > > > > > > > If it returned -1, how could you
>> know
>> > if
>> > > it
>> > > > > is
>> > > > > > > > > expected
>> > > > > > > > > > > or
>> > > > > > > > > > > > > not
>> > > > > > > > > > > > > > > > > > expected?
>> > > > > > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > > > > > Thanks.
>> > > > > > > > > > > > > > > > > > > > Luke
>> > > > > > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > > > > > On Wed, Feb 16, 2022 at 3:28 PM Luke
>> > > Chen <
>> > > > > > > > > > > showuon@gmail.com
>> > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > wrote:
>> > > > > > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > > > > > > Hi Artem,
>> > > > > > > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > > > > > > Thanks for the update. I have some
>> > > > > questions
>> > > > > > > > about
>> > > > > > > > > > it:
>> > > > > > > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > > > > > > 1. Could you explain why you need
>> the
>> > > > > > > > `partitioner`
>> > > > > > > > > > > return
>> > > > > > > > > > > > > > -1?
>> > > > > > > > > > > > > > > In
>> > > > > > > > > > > > > > > > > > which
>> > > > > > > > > > > > > > > > > > > > > case we need it? And how it is
>> used
>> > in
>> > > > your
>> > > > > > > KIP?
>> > > > > > > > > > > > > > > > > > > > > 2. What does the
>> > > > > > > "partitioner.sticky.batch.size"
>> > > > > > > > > > mean?
>> > > > > > > > > > > In
>> > > > > > > > > > > > > the
>> > > > > > > > > > > > > > > > > > > > > "Configuration" part, you didn't
>> > > explain
>> > > > > it.
>> > > > > > > And
>> > > > > > > > > > > default to
>> > > > > > > > > > > > > > 0,
>> > > > > > > > > > > > > > > I
>> > > > > > > > > > > > > > > > > > guess
>> > > > > > > > > > > > > > > > > > > > it's
>> > > > > > > > > > > > > > > > > > > > > the same as current behavior for
>> > > backward
>> > > > > > > > > > > compatibility,
>> > > > > > > > > > > > > > right?
>> > > > > > > > > > > > > > > > You
>> > > > > > > > > > > > > > > > > > > > should
>> > > > > > > > > > > > > > > > > > > > > mention it.
>> > > > > > > > > > > > > > > > > > > > > 3. I'm thinking we can have a
>> > threshold
>> > > > to
>> > > > > > the
>> > > > > > > > > > > > > > > > > > > > > "partitioner.sticky.batch.size".
>> > Let's
>> > > > say,
>> > > > > > we
>> > > > > > > > > > already
>> > > > > > > > > > > > > > > accumulate
>> > > > > > > > > > > > > > > > > > > 15.5KB
>> > > > > > > > > > > > > > > > > > > > in
>> > > > > > > > > > > > > > > > > > > > > partition1, and sent. So when next
>> > > batch
>> > > > > > > created,
>> > > > > > > > > in
>> > > > > > > > > > > your
>> > > > > > > > > > > > > > > current
>> > > > > > > > > > > > > > > > > > > design,
>> > > > > > > > > > > > > > > > > > > > > we still stick to partition1,
>> until
>> > > 16KB
>> > > > > > > reached,
>> > > > > > > > > and
>> > > > > > > > > > > then
>> > > > > > > > > > > > > we
>> > > > > > > > > > > > > > > > > create
>> > > > > > > > > > > > > > > > > > a
>> > > > > > > > > > > > > > > > > > > > new
>> > > > > > > > > > > > > > > > > > > > > batch to change to next partition,
>> > ex:
>> > > > > > > > partition2.
>> > > > > > > > > > But
>> > > > > > > > > > > I
>> > > > > > > > > > > > > > think
>> > > > > > > > > > > > > > > if
>> > > > > > > > > > > > > > > > > we
>> > > > > > > > > > > > > > > > > > > set
>> > > > > > > > > > > > > > > > > > > > a
>> > > > > > > > > > > > > > > > > > > > > threshold to 95% (for example), we
>> > can
>> > > > know
>> > > > > > > > > previous
>> > > > > > > > > > > 15.5KB
>> > > > > > > > > > > > > > > > already
>> > > > > > > > > > > > > > > > > > > > exceeds
>> > > > > > > > > > > > > > > > > > > > > the threshold so that we can
>> directly
>> > > > > create
>> > > > > > > new
>> > > > > > > > > > batch
>> > > > > > > > > > > for
>> > > > > > > > > > > > > > next
>> > > > > > > > > > > > > > > > > > > records.
>> > > > > > > > > > > > > > > > > > > > > This way should be able to make it
>> > more
>> > > > > > > > efficient.
>> > > > > > > > > > > WDYT?
>> > > > > > > > > > > > > > > > > > > > > 4. I think the improved queuing
>> logic
>> > > > > should
>> > > > > > be
>> > > > > > > > > good
>> > > > > > > > > > > > > enough.
>> > > > > > > > > > > > > > I
>> > > > > > > > > > > > > > > > > can't
>> > > > > > > > > > > > > > > > > > > get
>> > > > > > > > > > > > > > > > > > > > > the benefit of having `
>> > > > > > > > > > > partition.availability.timeout.ms`
>> > > > > > > > > > > > > > > > config.
>> > > > > > > > > > > > > > > > > In
>> > > > > > > > > > > > > > > > > > > > > short, you want to make the
>> > partitioner
>> > > > > take
>> > > > > > > the
>> > > > > > > > > > broker
>> > > > > > > > > > > > > load
>> > > > > > > > > > > > > > > into
>> > > > > > > > > > > > > > > > > > > > > consideration. We can just improve
>> > that
>> > > > in
>> > > > > > the
>> > > > > > > > > > queuing
>> > > > > > > > > > > > > logic
>> > > > > > > > > > > > > > > (and
>> > > > > > > > > > > > > > > > > you
>> > > > > > > > > > > > > > > > > > > > > already did it). Why should we add
>> > the
>> > > > > > config?
>> > > > > > > > > Could
>> > > > > > > > > > > you
>> > > > > > > > > > > > > use
>> > > > > > > > > > > > > > > some
>> > > > > > > > > > > > > > > > > > > > examples
>> > > > > > > > > > > > > > > > > > > > > to explain why we need it.
>> > > > > > > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > > > > > > Thank you.
>> > > > > > > > > > > > > > > > > > > > > Luke
>> > > > > > > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > > > > > > On Wed, Feb 16, 2022 at 8:57 AM
>> Artem
>> > > > > > Livshits
>> > > > > > > > > > > > > > > > > > > > > <al...@confluent.io.invalid>
>> > > wrote:
>> > > > > > > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > > > > > >> Hello,
>> > > > > > > > > > > > > > > > > > > > >>
>> > > > > > > > > > > > > > > > > > > > >> Please add your comments about
>> the
>> > > KIP.
>> > > > > If
>> > > > > > > > there
>> > > > > > > > > > are
>> > > > > > > > > > > no
>> > > > > > > > > > > > > > > > > > > considerations,
>> > > > > > > > > > > > > > > > > > > > >> I'll put it up for vote in the
>> next
>> > > few
>> > > > > > days.
>> > > > > > > > > > > > > > > > > > > > >>
>> > > > > > > > > > > > > > > > > > > > >> -Artem
>> > > > > > > > > > > > > > > > > > > > >>
>> > > > > > > > > > > > > > > > > > > > >> On Mon, Feb 7, 2022 at 6:01 PM
>> Artem
>> > > > > > Livshits
>> > > > > > > <
>> > > > > > > > > > > > > > > > > > alivshits@confluent.io
>> > > > > > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > > > > > >> wrote:
>> > > > > > > > > > > > > > > > > > > > >>
>> > > > > > > > > > > > > > > > > > > > >> > Hello,
>> > > > > > > > > > > > > > > > > > > > >> >
>> > > > > > > > > > > > > > > > > > > > >> > After trying a few prototypes,
>> > I've
>> > > > made
>> > > > > > > some
>> > > > > > > > > > > changes to
>> > > > > > > > > > > > > > the
>> > > > > > > > > > > > > > > > > > public
>> > > > > > > > > > > > > > > > > > > > >> > interface.  Please see the
>> updated
>> > > > > > document
>> > > > > > > > > > > > > > > > > > > > >> >
>> > > > > > > > > > > > > > > > > > > > >>
>> > > > > > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > >
>> > > > > > > > > > > > > >
>> > > > > > > > > > > > >
>> > > > > > > > > > >
>> > > > > > > > > >
>> > > > > > > > >
>> > > > > > > >
>> > > > > > >
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-794%3A+Strictly+Uniform+Sticky+Partitioner
>> > > > > > > > > > > > > > > > > > > > >> > .
>> > > > > > > > > > > > > > > > > > > > >> >
>> > > > > > > > > > > > > > > > > > > > >> > -Artem
>> > > > > > > > > > > > > > > > > > > > >> >
>> > > > > > > > > > > > > > > > > > > > >> > On Thu, Nov 4, 2021 at 10:37 AM
>> > > Artem
>> > > > > > > > Livshits <
>> > > > > > > > > > > > > > > > > > > > alivshits@confluent.io>
>> > > > > > > > > > > > > > > > > > > > >> > wrote:
>> > > > > > > > > > > > > > > > > > > > >> >
>> > > > > > > > > > > > > > > > > > > > >> >> Hello,
>> > > > > > > > > > > > > > > > > > > > >> >>
>> > > > > > > > > > > > > > > > > > > > >> >> This is the discussion thread
>> for
>> > > > > > > > > > > > > > > > > > > > >> >>
>> > > > > > > > > > > > > > > > > > > > >>
>> > > > > > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > >
>> > > > > > > > > > > > > >
>> > > > > > > > > > > > >
>> > > > > > > > > > >
>> > > > > > > > > >
>> > > > > > > > >
>> > > > > > > >
>> > > > > > >
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-794%3A+Strictly+Uniform+Sticky+Partitioner
>> > > > > > > > > > > > > > > > > > > > >> >> .
>> > > > > > > > > > > > > > > > > > > > >> >>
>> > > > > > > > > > > > > > > > > > > > >> >> The proposal is a bug fix for
>> > > > > > > > > > > > > > > > > > > > >> >>
>> > > > > > > > > > https://issues.apache.org/jira/browse/KAFKA-10888,
>> > > > > > > > > > > but
>> > > > > > > > > > > > > > it
>> > > > > > > > > > > > > > > > does
>> > > > > > > > > > > > > > > > > > > > >> include a
>> > > > > > > > > > > > > > > > > > > > >> >> client config change,
>> therefore
>> > we
>> > > > > have a
>> > > > > > > KIP
>> > > > > > > > > to
>> > > > > > > > > > > > > discuss.
>> > > > > > > > > > > > > > > > > > > > >> >>
>> > > > > > > > > > > > > > > > > > > > >> >> -Artem
>> > > > > > > > > > > > > > > > > > > > >> >>
>> > > > > > > > > > > > > > > > > > > > >> >
>> > > > > > > > > > > > > > > > > > > > >>
>> > > > > > > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > >
>> > > > > > > > > > > > > >
>> > > > > > > > > > > > >
>> > > > > > > > > > >
>> > > > > > > > > >
>> > > > > > > > >
>> > > > > > > >
>> > > > > > >
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>>
>

Re: [DISCUSS] KIP-794: Strictly Uniform Sticky Partitioner

Posted by Artem Livshits <al...@confluent.io.INVALID>.
Hi Jun,

33. Sounds good.  Updated the KIP.

-Artem

On Mon, Mar 14, 2022 at 5:45 PM Jun Rao <ju...@confluent.io.invalid> wrote:

> Hi, Artem,
>
> 33. We introduced onNewBatch() primarily for the sticky partitioner. It
> seems to be a very subtle API to explain and to use properly. If we can't
> find any convincing usage, it's probably better to deprecate it so that we
> could keep the API clean.
>
> Thanks,
>
> Jun
>
> On Mon, Mar 14, 2022 at 1:36 PM Artem Livshits
> <al...@confluent.io.invalid> wrote:
>
> > Hi Jun,
> >
> > 33.  That's an interesting point.  Technically, onNewBatch is just a way
> to
> > pass some signal to the partitioner, the sticky partitioner uses this
> > signal that is suboptimal, in theory someone could use it for something
> > else
> >
> > -Artem
> >
> > On Mon, Mar 14, 2022 at 9:11 AM Jun Rao <ju...@confluent.io.invalid>
> wrote:
> >
> > > Hi, Artem,
> > >
> > > Thanks for the reply. A couple of more comments.
> > >
> > > 32. We could defer the metrics until we know what to add.
> > >
> > > 33. Since we are deprecating DefaultPartitioner and
> > > UniformStickyPartitioner, should we depreciate Partitioner.onNewBatch()
> > too
> > > given its unexpected side effect?
> > >
> > > Thanks,
> > >
> > > Jun
> > >
> > > On Thu, Mar 10, 2022 at 5:20 PM Artem Livshits
> > > <al...@confluent.io.invalid> wrote:
> > >
> > > > Hi Jun,
> > > >
> > > > 32. Good point.  Do you think it's ok to defer the metrics until we
> run
> > > > some benchmarks so that we get a better idea of what metrics we need?
> > > >
> > > > -Artem
> > > >
> > > > On Thu, Mar 10, 2022 at 3:12 PM Jun Rao <ju...@confluent.io.invalid>
> > > wrote:
> > > >
> > > > > Hi, Artem.
> > > > >
> > > > > Thanks for the reply. One more comment.
> > > > >
> > > > > 32. Do we need to add any new metric on the producer? For example,
> if
> > > > > partitioner.availability.timeout.ms is > 0, it might be useful to
> > know
> > > > the
> > > > > number of unavailable partitions.
> > > > >
> > > > > Thanks,
> > > > >
> > > > > Jun
> > > > >
> > > > > On Thu, Mar 10, 2022 at 12:46 PM Artem Livshits
> > > > > <al...@confluent.io.invalid> wrote:
> > > > >
> > > > > > Hi Jun,
> > > > > >
> > > > > > 30.  Clarified.
> > > > > >
> > > > > > 31. I plan to do some benchmarking once implementation is
> finished,
> > > > I'll
> > > > > > update the KIP with the results once I have them.  The reason to
> > make
> > > > it
> > > > > > default is that it won't be used otherwise and we won't know if
> > it's
> > > > good
> > > > > > or not in practical workloads.
> > > > > >
> > > > > > -Artem
> > > > > >
> > > > > > On Thu, Mar 10, 2022 at 11:42 AM Jun Rao
> <jun@confluent.io.invalid
> > >
> > > > > wrote:
> > > > > >
> > > > > > > Hi, Artem,
> > > > > > >
> > > > > > > Thanks for the updated KIP. A couple of more comments.
> > > > > > >
> > > > > > > 30. For the 3 new configs, it would be useful to make it clear
> > that
> > > > > they
> > > > > > > are only relevant when the partitioner class is null.
> > > > > > >
> > > > > > > 31. partitioner.adaptive.partitioning.enable : I am wondering
> > > whether
> > > > > it
> > > > > > > should default to true. This is a more complex behavior than
> > > "uniform
> > > > > > > sticky" and may take some time to get right. If we do want to
> > > enable
> > > > it
> > > > > > by
> > > > > > > default, it would be useful to validate it with some test
> > results.
> > > > > > >
> > > > > > > Jun
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > On Wed, Mar 9, 2022 at 10:05 PM Artem Livshits
> > > > > > > <al...@confluent.io.invalid> wrote:
> > > > > > >
> > > > > > > > Thank you for feedback, I've discussed this offline with some
> > of
> > > > the
> > > > > > > folks
> > > > > > > > and updated the KIP.  The main change is that now instead of
> > > using
> > > > > > > > DefaultPartitioner and UniformStickyPartitioners as flags, in
> > the
> > > > new
> > > > > > > > proposal the default partitioner is null, so if no custom
> > > > partitioner
> > > > > > is
> > > > > > > > specified then the partitioning logic is implemented in
> > > > > KafkaProducer.
> > > > > > > > Compatibility section is updated as well.  Also the
> > configuration
> > > > > > options
> > > > > > > > are renamed to be more consistent.
> > > > > > > >
> > > > > > > > -Artem
> > > > > > > >
> > > > > > > > On Fri, Mar 4, 2022 at 10:38 PM Luke Chen <showuon@gmail.com
> >
> > > > wrote:
> > > > > > > >
> > > > > > > > > Hi Artem,
> > > > > > > > >
> > > > > > > > > Thanks for your explanation and update to the KIP.
> > > > > > > > > Some comments:
> > > > > > > > >
> > > > > > > > > 5. In the description for `enable.adaptive.partitioning`,
> the
> > > > > `false`
> > > > > > > > case,
> > > > > > > > > you said:
> > > > > > > > > > the producer will try to distribute messages uniformly.
> > > > > > > > > I think we should describe the possible skewing
> distribution.
> > > > > > > Otherwise,
> > > > > > > > > user might be confused about why adaptive partitioning is
> > > > > important.
> > > > > > > > >
> > > > > > > > > 6. In the description for `
> partition.availability.timeout.ms
> > `,
> > > I
> > > > > > think
> > > > > > > > we
> > > > > > > > > should mention in the last sentence about if
> > > > > > > > `enable.adaptive.partitioning`
> > > > > > > > > is disabled this logic is also disabled.
> > > > > > > > >
> > > > > > > > > 7. Similar thoughts as Ismael, I think we should have a POC
> > and
> > > > > test
> > > > > > to
> > > > > > > > > prove that this adaptive partitioning algorithm can have
> > better
> > > > > > uniform
> > > > > > > > > partitioning, compared with original sticky one.
> > > > > > > > >
> > > > > > > > > Thank you.
> > > > > > > > > Luke
> > > > > > > > >
> > > > > > > > > On Fri, Mar 4, 2022 at 9:22 PM Ismael Juma <
> > ismael@juma.me.uk>
> > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > Regarding `3`, we should only deprecate it if we're sure
> > the
> > > > new
> > > > > > > > approach
> > > > > > > > > > handles all cases better. Are we confident about that for
> > > both
> > > > of
> > > > > > the
> > > > > > > > > > previous partitioners?
> > > > > > > > > >
> > > > > > > > > > Ismael
> > > > > > > > > >
> > > > > > > > > > On Fri, Mar 4, 2022 at 1:37 AM David Jacot
> > > > > > > <djacot@confluent.io.invalid
> > > > > > > > >
> > > > > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > > Hi Artem,
> > > > > > > > > > >
> > > > > > > > > > > Thanks for the KIP! I have a few comments:
> > > > > > > > > > >
> > > > > > > > > > > 1. In the preamble of the proposed change section,
> there
> > is
> > > > > > still a
> > > > > > > > > > > mention of the
> > > > > > > > > > > -1 approach. My understanding is that we have moved
> away
> > > from
> > > > > it
> > > > > > > now.
> > > > > > > > > > >
> > > > > > > > > > > 2. I am a bit concerned by the trick suggested about
> the
> > > > > > > > > > > DefaultPartitioner and
> > > > > > > > > > > the UniformStickyPartitioner. I do agree that
> > implementing
> > > > the
> > > > > > > logic
> > > > > > > > in
> > > > > > > > > > the
> > > > > > > > > > > producer itself is a good thing. However, it is weird
> > from
> > > a
> > > > > user
> > > > > > > > > > > perspective
> > > > > > > > > > > that he can set a class as partitioner that is not used
> > in
> > > > the
> > > > > > > end. I
> > > > > > > > > > > think that
> > > > > > > > > > > this will be confusing for our users. Have we
> considered
> > > > > changing
> > > > > > > the
> > > > > > > > > > > default
> > > > > > > > > > > value of partitioner.class to null to indicate that the
> > new
> > > > > > > built-in
> > > > > > > > > > > partitioner
> > > > > > > > > > > must be used? By default, the built-in partitioner
> would
> > be
> > > > > used
> > > > > > > > unless
> > > > > > > > > > the
> > > > > > > > > > > user explicitly specify one. The downside is that the
> new
> > > > > default
> > > > > > > > > > behavior
> > > > > > > > > > > would not work if the user explicitly specify the
> > > partitioner
> > > > > but
> > > > > > > we
> > > > > > > > > > could
> > > > > > > > > > > mitigate this with my next point.
> > > > > > > > > > >
> > > > > > > > > > > 3. Related to the previous point, I think that we could
> > > > > deprecate
> > > > > > > > both
> > > > > > > > > > the
> > > > > > > > > > > DefaultPartitioner and the UniformStickyPartitioner. I
> > > would
> > > > > also
> > > > > > > > add a
> > > > > > > > > > > warning if one of them is explicitly provided by the
> user
> > > to
> > > > > > inform
> > > > > > > > > them
> > > > > > > > > > > that they should switch to the new built-in one. I am
> > > pretty
> > > > > sure
> > > > > > > > that
> > > > > > > > > > most
> > > > > > > > > > > of the folks use the default configuration anyway.
> > > > > > > > > > >
> > > > > > > > > > > 4. It would be great if we could explain why the -1 way
> > was
> > > > > > > rejected.
> > > > > > > > > At
> > > > > > > > > > > the moment, the rejected alternative only explain the
> > idea
> > > > but
> > > > > > does
> > > > > > > > not
> > > > > > > > > > > say why we rejected it.
> > > > > > > > > > >
> > > > > > > > > > > Best,
> > > > > > > > > > > David
> > > > > > > > > > >
> > > > > > > > > > > On Fri, Mar 4, 2022 at 6:03 AM Artem Livshits
> > > > > > > > > > > <al...@confluent.io.invalid> wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > Hi Jun,
> > > > > > > > > > > >
> > > > > > > > > > > > 2. Removed the option from the KIP.  Now the sticky
> > > > > > partitioning
> > > > > > > > > > > threshold
> > > > > > > > > > > > is hardcoded to batch.size.
> > > > > > > > > > > >
> > > > > > > > > > > > 20. Added the corresponding wording to the KIP.
> > > > > > > > > > > >
> > > > > > > > > > > > -Artem
> > > > > > > > > > > >
> > > > > > > > > > > > On Thu, Mar 3, 2022 at 10:52 AM Jun Rao
> > > > > > <jun@confluent.io.invalid
> > > > > > > >
> > > > > > > > > > > wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > > Hi, Artem,
> > > > > > > > > > > > >
> > > > > > > > > > > > > Thanks for the reply.
> > > > > > > > > > > > >
> > > > > > > > > > > > > 1. Sounds good.
> > > > > > > > > > > > >
> > > > > > > > > > > > > 2. If we don't expect users to change it, we
> probably
> > > > could
> > > > > > > just
> > > > > > > > > > leave
> > > > > > > > > > > out
> > > > > > > > > > > > > the new config. In general, it's easy to add a new
> > > > config,
> > > > > > but
> > > > > > > > hard
> > > > > > > > > > to
> > > > > > > > > > > > > remove an existing config.
> > > > > > > > > > > > >
> > > > > > > > > > > > > 20. The two new configs
> enable.adaptive.partitioning
> > > and
> > > > > > > > > > > > > partition.availability.timeout.ms only apply to
> the
> > > two
> > > > > > > built-in
> > > > > > > > > > > > > partitioners DefaultPartitioner and
> > > > > UniformStickyPartitioner,
> > > > > > > > > right?
> > > > > > > > > > It
> > > > > > > > > > > > > would be useful to document that in the KIP.
> > > > > > > > > > > > >
> > > > > > > > > > > > > Thanks,
> > > > > > > > > > > > >
> > > > > > > > > > > > > Jun
> > > > > > > > > > > > >
> > > > > > > > > > > > > On Thu, Mar 3, 2022 at 9:47 AM Artem Livshits
> > > > > > > > > > > > > <al...@confluent.io.invalid> wrote:
> > > > > > > > > > > > >
> > > > > > > > > > > > > > Hi Jun,
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Thank you for the suggestions.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > 1. As we discussed offline, we can hardcode the
> > logic
> > > > for
> > > > > > > > > > > > > > DefaultPartitioner and UniformStickyPartitioner
> in
> > > the
> > > > > > > > > > KafkaProducer
> > > > > > > > > > > > > (i.e.
> > > > > > > > > > > > > > the DefaultPartitioner.partition won't get
> called,
> > > > > instead
> > > > > > > > > > > KafkaProducer
> > > > > > > > > > > > > > would check if the partitioner is an instance of
> > > > > > > > > DefaultPartitioner
> > > > > > > > > > > and
> > > > > > > > > > > > > > then run the actual partitioning logic itself).
> > Then
> > > > the
> > > > > > > > change
> > > > > > > > > to
> > > > > > > > > > > the
> > > > > > > > > > > > > > Partitioner wouldn't be required.  I'll update
> the
> > > KIP
> > > > to
> > > > > > > > reflect
> > > > > > > > > > > that.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > 2. I don't expect users to change this too often,
> > as
> > > > > > changing
> > > > > > > > it
> > > > > > > > > > > would
> > > > > > > > > > > > > > require a bit of studying of the production
> > patterns.
> > > > > As a
> > > > > > > > > general
> > > > > > > > > > > > > > principle, if I can think of a model that
> requires
> > a
> > > > > > > deviation
> > > > > > > > > from
> > > > > > > > > > > > > > default, I tend to add a configuration option.
> It
> > > > could
> > > > > be
> > > > > > > > that
> > > > > > > > > > > it'll
> > > > > > > > > > > > > > never get used in practice, but I cannot prove
> > that.
> > > > I'm
> > > > > > ok
> > > > > > > > with
> > > > > > > > > > > > > removing
> > > > > > > > > > > > > > the option, let me know what you think.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > -Artem
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > On Mon, Feb 28, 2022 at 2:06 PM Jun Rao
> > > > > > > > <jun@confluent.io.invalid
> > > > > > > > > >
> > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Hi, Artem,
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Thanks for the reply. A few more comments.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > 1. Since we control the implementation and the
> > > usage
> > > > of
> > > > > > > > > > > > > > DefaultPartitioner,
> > > > > > > > > > > > > > > another way is to instantiate the
> > > DefaultPartitioner
> > > > > > with a
> > > > > > > > > > special
> > > > > > > > > > > > > > > constructor, which allows it to have more
> access
> > to
> > > > > > > internal
> > > > > > > > > > > > > information.
> > > > > > > > > > > > > > > Then we could just change the behavior of
> > > > > > > DefaultPartitioner
> > > > > > > > > > such
> > > > > > > > > > > that
> > > > > > > > > > > > > > it
> > > > > > > > > > > > > > > can use the internal infoamtion when choosing
> the
> > > > > > > partition.
> > > > > > > > > This
> > > > > > > > > > > seems
> > > > > > > > > > > > > > > more intuitive than having DefaultPartitioner
> > > return
> > > > -1
> > > > > > > > > > partition.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > 2. I guess partitioner.sticky.batch.size is
> > > > introduced
> > > > > > > > because
> > > > > > > > > > the
> > > > > > > > > > > > > > > effective batch size could be less than
> > batch.size
> > > > and
> > > > > we
> > > > > > > > want
> > > > > > > > > to
> > > > > > > > > > > align
> > > > > > > > > > > > > > > partition switching with the effective batch
> > size.
> > > > How
> > > > > > > would
> > > > > > > > a
> > > > > > > > > > user
> > > > > > > > > > > > > know
> > > > > > > > > > > > > > > the effective batch size to set
> > > > > > > partitioner.sticky.batch.size
> > > > > > > > > > > properly?
> > > > > > > > > > > > > > If
> > > > > > > > > > > > > > > the user somehow knows the effective batch
> size,
> > > does
> > > > > > > setting
> > > > > > > > > > > > > batch.size
> > > > > > > > > > > > > > to
> > > > > > > > > > > > > > > the effective batch size achieve the same
> result?
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > 4. Thanks for the explanation. Makes sense to
> me.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Thanks,
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Jun
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Thanks,
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Jun
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > On Fri, Feb 25, 2022 at 8:26 PM Artem Livshits
> > > > > > > > > > > > > > > <al...@confluent.io.invalid> wrote:
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Hi Jun,
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > 1. Updated the KIP to add a couple paragraphs
> > > about
> > > > > > > > > > > implementation
> > > > > > > > > > > > > > > > necessities in the Proposed Changes section.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > 2. Sorry if my reply was confusing, what I
> > meant
> > > to
> > > > > say
> > > > > > > > (and
> > > > > > > > > I
> > > > > > > > > > > > > > elaborated
> > > > > > > > > > > > > > > > on that in point #3) is that there could be
> > > > patterns
> > > > > > for
> > > > > > > > > which
> > > > > > > > > > > 16KB
> > > > > > > > > > > > > > > > wouldn't be the most effective setting, thus
> it
> > > > would
> > > > > > be
> > > > > > > > good
> > > > > > > > > > to
> > > > > > > > > > > make
> > > > > > > > > > > > > > it
> > > > > > > > > > > > > > > > configurable.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > 4. We could use broker readiness timeout.
> But
> > > I'm
> > > > > not
> > > > > > > sure
> > > > > > > > > it
> > > > > > > > > > > would
> > > > > > > > > > > > > > > > correctly model the broker load.  The problem
> > is
> > > > that
> > > > > > > > latency
> > > > > > > > > > is
> > > > > > > > > > > not
> > > > > > > > > > > > > an
> > > > > > > > > > > > > > > > accurate measure of throughput, we could
> have 2
> > > > > brokers
> > > > > > > > that
> > > > > > > > > > have
> > > > > > > > > > > > > equal
> > > > > > > > > > > > > > > > throughput but one has higher latency (so it
> > > takes
> > > > > > larger
> > > > > > > > > > batches
> > > > > > > > > > > > > less
> > > > > > > > > > > > > > > > frequently, but still takes the same load).
> > > > > > > Latency-based
> > > > > > > > > > logic
> > > > > > > > > > > is
> > > > > > > > > > > > > > > likely
> > > > > > > > > > > > > > > > to send less data to the broker with higher
> > > > latency.
> > > > > > > Using
> > > > > > > > > the
> > > > > > > > > > > queue
> > > > > > > > > > > > > > > size
> > > > > > > > > > > > > > > > would adapt to throughput, regardless of
> > latency
> > > > > (which
> > > > > > > > could
> > > > > > > > > > be
> > > > > > > > > > > > > just a
> > > > > > > > > > > > > > > > result of deployment topology), so that's the
> > > model
> > > > > > > chosen
> > > > > > > > in
> > > > > > > > > > the
> > > > > > > > > > > > > > > > proposal.  The
> > partition.availability.timeout.ms
> > > > > logic
> > > > > > > > > > > approaches
> > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > model
> > > > > > > > > > > > > > > > from a slightly different angle, say we have
> a
> > > > > > > requirement
> > > > > > > > to
> > > > > > > > > > > deliver
> > > > > > > > > > > > > > > > messages via brokers that have a certain
> > latency,
> > > > > then
> > > > > > > > > > > > > > > > partition.availability.timeout.ms could be
> > used
> > > to
> > > > > > tune
> > > > > > > > > that.
> > > > > > > > > > > > > Latency
> > > > > > > > > > > > > > > is
> > > > > > > > > > > > > > > > a
> > > > > > > > > > > > > > > > much more volatile metric than throughput
> > > (latency
> > > > > > > depends
> > > > > > > > on
> > > > > > > > > > > > > external
> > > > > > > > > > > > > > > > load, on capacity, on deployment topology, on
> > > > jitter
> > > > > in
> > > > > > > > > > network,
> > > > > > > > > > > on
> > > > > > > > > > > > > > > jitter
> > > > > > > > > > > > > > > > in disk, etc.) and I think it would be best
> to
> > > > leave
> > > > > > > > > > > latency-based
> > > > > > > > > > > > > > > > thresholds configurable to tune for the
> > > > environment.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > -Artem
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > On Wed, Feb 23, 2022 at 11:14 AM Jun Rao
> > > > > > > > > > > <ju...@confluent.io.invalid>
> > > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Hi, Artem,
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Thanks for the reply. A few more comments.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > 1. Perhaps you could elaborate a bit more
> on
> > > how
> > > > > the
> > > > > > > > > producer
> > > > > > > > > > > > > > > determines
> > > > > > > > > > > > > > > > > the partition if the partitioner returns
> -1.
> > > This
> > > > > > will
> > > > > > > > help
> > > > > > > > > > > > > > understand
> > > > > > > > > > > > > > > > why
> > > > > > > > > > > > > > > > > encapsulating that logic as a partitioner
> is
> > > not
> > > > > > clean.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > 2. I am not sure that I understand this
> part.
> > > If
> > > > > > 15.5KB
> > > > > > > > is
> > > > > > > > > > more
> > > > > > > > > > > > > > > > efficient,
> > > > > > > > > > > > > > > > > could we just set batch.size to 15.5KB?
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > 4. Yes, we could add a switch (or a variant
> > of
> > > > the
> > > > > > > > > > > partitioner) for
> > > > > > > > > > > > > > > > > enabling this behavior. Also, choosing
> > > partitions
> > > > > > based
> > > > > > > > on
> > > > > > > > > > > broker
> > > > > > > > > > > > > > > > readiness
> > > > > > > > > > > > > > > > > can be made in a smoother way. For example,
> > we
> > > > > could
> > > > > > > > track
> > > > > > > > > > the
> > > > > > > > > > > last
> > > > > > > > > > > > > > > time
> > > > > > > > > > > > > > > > a
> > > > > > > > > > > > > > > > > broker has drained any batches from the
> > > > > accumulator.
> > > > > > We
> > > > > > > > can
> > > > > > > > > > > then
> > > > > > > > > > > > > > select
> > > > > > > > > > > > > > > > > partitions from brokers proportionally to
> the
> > > > > inverse
> > > > > > > of
> > > > > > > > > that
> > > > > > > > > > > time.
> > > > > > > > > > > > > > > This
> > > > > > > > > > > > > > > > > seems smoother than a cutoff based on a
> > > > > > > > > > > > > > > > partition.availability.timeout.ms
> > > > > > > > > > > > > > > > >  threshold.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Thanks,
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Jun
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > On Fri, Feb 18, 2022 at 5:14 PM Artem
> > Livshits
> > > > > > > > > > > > > > > > > <al...@confluent.io.invalid> wrote:
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > Hello Luke, Jun,
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > Thank you for your feedback.  I've added
> > the
> > > > > > Rejected
> > > > > > > > > > > Alternative
> > > > > > > > > > > > > > > > section
> > > > > > > > > > > > > > > > > > that may clarify some of the questions
> > w.r.t.
> > > > > > > returning
> > > > > > > > > -1.
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > 1. I've elaborated on the -1 in the KIP.
> > The
> > > > > > problem
> > > > > > > > is
> > > > > > > > > > > that a
> > > > > > > > > > > > > > > > > significant
> > > > > > > > > > > > > > > > > > part of the logic needs to be in the
> > producer
> > > > > > > (because
> > > > > > > > it
> > > > > > > > > > now
> > > > > > > > > > > > > uses
> > > > > > > > > > > > > > > > > > information about brokers that only the
> > > > producer
> > > > > > > > knows),
> > > > > > > > > so
> > > > > > > > > > > > > > > > encapsulation
> > > > > > > > > > > > > > > > > > of the logic within the default
> partitioner
> > > > isn't
> > > > > > as
> > > > > > > > > clean.
> > > > > > > > > > > > >  I've
> > > > > > > > > > > > > > > > added
> > > > > > > > > > > > > > > > > > the Rejected Alternative section that
> > > documents
> > > > > an
> > > > > > > > > attempt
> > > > > > > > > > to
> > > > > > > > > > > > > keep
> > > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > > > encapsulation by providing new callbacks
> to
> > > the
> > > > > > > > > > partitioner.
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > 2. The meaning of the
> > > > > partitioner.sticky.batch.size
> > > > > > > is
> > > > > > > > > > > explained
> > > > > > > > > > > > > in
> > > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > > > Uniform Sticky Batch Size section.
> > > Basically,
> > > > we
> > > > > > > track
> > > > > > > > > the
> > > > > > > > > > > > > amount
> > > > > > > > > > > > > > of
> > > > > > > > > > > > > > > > > bytes
> > > > > > > > > > > > > > > > > > produced to the partition and if it
> exceeds
> > > > > > > > > > > > > > > > partitioner.sticky.batch.size
> > > > > > > > > > > > > > > > > > then we switch to the next partition.  As
> > far
> > > > as
> > > > > > the
> > > > > > > > > reason
> > > > > > > > > > > to
> > > > > > > > > > > > > make
> > > > > > > > > > > > > > > it
> > > > > > > > > > > > > > > > > > different from batch.size, I think Luke
> > > > answered
> > > > > > this
> > > > > > > > > with
> > > > > > > > > > > the
> > > > > > > > > > > > > > > question
> > > > > > > > > > > > > > > > > #3
> > > > > > > > > > > > > > > > > > -- what if the load pattern is such that
> > > 15.5KB
> > > > > > would
> > > > > > > > be
> > > > > > > > > > more
> > > > > > > > > > > > > > > efficient
> > > > > > > > > > > > > > > > > > than 16KB?
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > 3. I think it's hard to have one size
> that
> > > > would
> > > > > > fit
> > > > > > > > all
> > > > > > > > > > > > > patterns.
> > > > > > > > > > > > > > > > E.g.
> > > > > > > > > > > > > > > > > if
> > > > > > > > > > > > > > > > > > the load pattern is such that there is
> > linger
> > > > and
> > > > > > the
> > > > > > > > app
> > > > > > > > > > > fills
> > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > batch
> > > > > > > > > > > > > > > > > > before linger expires, then having 16KB
> > would
> > > > > most
> > > > > > > > likely
> > > > > > > > > > > > > > synchronize
> > > > > > > > > > > > > > > > > > batching and partition switching, so each
> > > > > partition
> > > > > > > > would
> > > > > > > > > > > get a
> > > > > > > > > > > > > > full
> > > > > > > > > > > > > > > > > > batch.  If load pattern is such that
> there
> > > are
> > > > a
> > > > > > few
> > > > > > > > > > > non-complete
> > > > > > > > > > > > > > > > batches
> > > > > > > > > > > > > > > > > > go out before a larger batch starts to
> > fill,
> > > > then
> > > > > > it
> > > > > > > > may
> > > > > > > > > > > actually
> > > > > > > > > > > > > > be
> > > > > > > > > > > > > > > > > > beneficial to make slightly larger (e.g.
> > > > > linger=0,
> > > > > > > > first
> > > > > > > > > > few
> > > > > > > > > > > > > > records
> > > > > > > > > > > > > > > go
> > > > > > > > > > > > > > > > > in
> > > > > > > > > > > > > > > > > > the first batch, then next few records go
> > to
> > > > > second
> > > > > > > > > batch,
> > > > > > > > > > > and so
> > > > > > > > > > > > > > on,
> > > > > > > > > > > > > > > > > until
> > > > > > > > > > > > > > > > > > 5 in-flight, then larger batch would form
> > > while
> > > > > > > waiting
> > > > > > > > > for
> > > > > > > > > > > > > broker
> > > > > > > > > > > > > > to
> > > > > > > > > > > > > > > > > > respond, but the partition switch would
> > > happen
> > > > > > before
> > > > > > > > the
> > > > > > > > > > > larger
> > > > > > > > > > > > > > > batch
> > > > > > > > > > > > > > > > is
> > > > > > > > > > > > > > > > > > full).
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > 4. There are a couple of reasons for
> > > > introducing
> > > > > > > > > > > > > > > > > > partition.availability.timeout.ms.
> Luke's
> > > an
> > > > > > Jun's
> > > > > > > > > > > questions
> > > > > > > > > > > > > are
> > > > > > > > > > > > > > > > > slightly
> > > > > > > > > > > > > > > > > > different, so I'm going to separate
> > replies.
> > > > > > > > > > > > > > > > > > (Luke) Is the queue size a good enough
> > > > signal?  I
> > > > > > > think
> > > > > > > > > > it's
> > > > > > > > > > > a
> > > > > > > > > > > > > good
> > > > > > > > > > > > > > > > > default
> > > > > > > > > > > > > > > > > > signal as it tries to preserve general
> > > fairness
> > > > > and
> > > > > > > not
> > > > > > > > > > > overreact
> > > > > > > > > > > > > > on
> > > > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > > > broker's state at each moment in time.
> But
> > > > > because
> > > > > > > > it's
> > > > > > > > > > > smooth,
> > > > > > > > > > > > > it
> > > > > > > > > > > > > > > may
> > > > > > > > > > > > > > > > > not
> > > > > > > > > > > > > > > > > > be reactive enough to instantaneous
> latency
> > > > > jumps.
> > > > > > > For
> > > > > > > > > > > > > > > > latency-sensitive
> > > > > > > > > > > > > > > > > > workloads, it may be desirable to react
> > > faster
> > > > > > when a
> > > > > > > > > > broker
> > > > > > > > > > > > > > becomes
> > > > > > > > > > > > > > > > > > unresponsive (but that may make the
> > > > distribution
> > > > > > > really
> > > > > > > > > > > choppy),
> > > > > > > > > > > > > so
> > > > > > > > > > > > > > > > > > partition.availability.timeout.ms
> provides
> > > an
> > > > > > > > > opportunity
> > > > > > > > > > to
> > > > > > > > > > > > > tune
> > > > > > > > > > > > > > > > > > adaptiveness.
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > (Jun) Can we just not assign partitions
> to
> > > > > brokers
> > > > > > > that
> > > > > > > > > are
> > > > > > > > > > > not
> > > > > > > > > > > > > > > ready?
> > > > > > > > > > > > > > > > > > Switching partitions purely based on
> > current
> > > > > broker
> > > > > > > > > > readiness
> > > > > > > > > > > > > > > > information
> > > > > > > > > > > > > > > > > > can really skew workload I think (or at
> > > least I
> > > > > > > > couldn't
> > > > > > > > > > > build a
> > > > > > > > > > > > > > > model
> > > > > > > > > > > > > > > > > that
> > > > > > > > > > > > > > > > > > proves that over time it's going to be
> > > > generally
> > > > > > > > fair), I
> > > > > > > > > > > feel
> > > > > > > > > > > > > that
> > > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > > > algorithm should try to be fair in
> general
> > > and
> > > > > use
> > > > > > > > > smoother
> > > > > > > > > > > > > signals
> > > > > > > > > > > > > > > by
> > > > > > > > > > > > > > > > > > default (e.g. a broker with choppier
> > latency
> > > > may
> > > > > > get
> > > > > > > > much
> > > > > > > > > > > less
> > > > > > > > > > > > > load
> > > > > > > > > > > > > > > > even
> > > > > > > > > > > > > > > > > > though it can handle throughput, it then
> > may
> > > > > > > > potentially
> > > > > > > > > > skew
> > > > > > > > > > > > > > > > > consumption),
> > > > > > > > > > > > > > > > > > note that the queue-size-based logic uses
> > > > > > > probabilities
> > > > > > > > > (so
> > > > > > > > > > > we
> > > > > > > > > > > > > > don't
> > > > > > > > > > > > > > > > > fully
> > > > > > > > > > > > > > > > > > remove brokers, just make it less likely)
> > and
> > > > > > > relative
> > > > > > > > > info
> > > > > > > > > > > > > rather
> > > > > > > > > > > > > > > > than a
> > > > > > > > > > > > > > > > > > threshold (so if all brokers are heavily,
> > but
> > > > > > equally
> > > > > > > > > > loaded,
> > > > > > > > > > > > > they
> > > > > > > > > > > > > > > will
> > > > > > > > > > > > > > > > > get
> > > > > > > > > > > > > > > > > > equal distribution, rather than get
> removed
> > > > > because
> > > > > > > > they
> > > > > > > > > > > exceed
> > > > > > > > > > > > > > some
> > > > > > > > > > > > > > > > > > threshold).  So at the very least, I
> would
> > > like
> > > > > > this
> > > > > > > > > logic
> > > > > > > > > > > to be
> > > > > > > > > > > > > > > turned
> > > > > > > > > > > > > > > > > off
> > > > > > > > > > > > > > > > > > by default as it's hard to predict what
> it
> > > > could
> > > > > do
> > > > > > > > with
> > > > > > > > > > > > > different
> > > > > > > > > > > > > > > > > patterns
> > > > > > > > > > > > > > > > > > (which means that there would need to be
> > some
> > > > > > > > > > > configuration).  We
> > > > > > > > > > > > > > > could
> > > > > > > > > > > > > > > > > > just not use brokers that are not ready,
> > but
> > > > > > again, I
> > > > > > > > > think
> > > > > > > > > > > that
> > > > > > > > > > > > > > it's
> > > > > > > > > > > > > > > > > good
> > > > > > > > > > > > > > > > > > to try to be fair under normal
> > circumstances,
> > > > so
> > > > > if
> > > > > > > > > > normally
> > > > > > > > > > > > > > brokers
> > > > > > > > > > > > > > > > can
> > > > > > > > > > > > > > > > > > respond under some
> > > > > > partition.availability.timeout.ms
> > > > > > > > > > > threshold
> > > > > > > > > > > > > and
> > > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > > > application works well with those
> > latencies,
> > > > then
> > > > > > we
> > > > > > > > > could
> > > > > > > > > > > > > > distribute
> > > > > > > > > > > > > > > > > data
> > > > > > > > > > > > > > > > > > equally between brokers that don't exceed
> > the
> > > > > > > > latencies.
> > > > > > > > > > The
> > > > > > > > > > > > > > value,
> > > > > > > > > > > > > > > of
> > > > > > > > > > > > > > > > > > course, would depend on the environment
> and
> > > app
> > > > > > > > > > requirements,
> > > > > > > > > > > > > hence
> > > > > > > > > > > > > > > > it's
> > > > > > > > > > > > > > > > > > configurable.
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > 10. Added a statement at the beginning of
> > the
> > > > > > > proposed
> > > > > > > > > > > changes.
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > -Artem
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > On Thu, Feb 17, 2022 at 3:46 PM Jun Rao
> > > > > > > > > > > <jun@confluent.io.invalid
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > Hi, Artem,
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > Thanks for the KIP. A few comments
> below.
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > 1. I agree with Luke that having the
> > > > > partitioner
> > > > > > > > > > returning
> > > > > > > > > > > -1
> > > > > > > > > > > > > is
> > > > > > > > > > > > > > > kind
> > > > > > > > > > > > > > > > > of
> > > > > > > > > > > > > > > > > > > weird. Could we just change the
> > > > implementation
> > > > > of
> > > > > > > > > > > > > > > DefaultPartitioner
> > > > > > > > > > > > > > > > to
> > > > > > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > > > > new behavior?
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > 2. partitioner.sticky.batch.size:
> Similar
> > > > > > question
> > > > > > > to
> > > > > > > > > > > Luke. I
> > > > > > > > > > > > > am
> > > > > > > > > > > > > > > not
> > > > > > > > > > > > > > > > > sure
> > > > > > > > > > > > > > > > > > > why we want to introduce this new
> > > > > configuration.
> > > > > > > > Could
> > > > > > > > > we
> > > > > > > > > > > just
> > > > > > > > > > > > > > use
> > > > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > > > > existing batch.size?
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > 4. I also agree with Luke that it's not
> > > clear
> > > > > why
> > > > > > > we
> > > > > > > > > need
> > > > > > > > > > > > > > > > > > > partition.availability.timeout.ms. The
> > KIP
> > > > > says
> > > > > > > the
> > > > > > > > > > broker
> > > > > > > > > > > > > > "would
> > > > > > > > > > > > > > > > not
> > > > > > > > > > > > > > > > > be
> > > > > > > > > > > > > > > > > > > chosen until the broker is able to
> accept
> > > the
> > > > > > next
> > > > > > > > > ready
> > > > > > > > > > > batch
> > > > > > > > > > > > > > from
> > > > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > > > > partition". If we are keeping track of
> > > this,
> > > > > > could
> > > > > > > we
> > > > > > > > > > just
> > > > > > > > > > > > > avoid
> > > > > > > > > > > > > > > > > > assigning
> > > > > > > > > > > > > > > > > > > records to partitions whose leader is
> not
> > > > able
> > > > > to
> > > > > > > > > accept
> > > > > > > > > > > the
> > > > > > > > > > > > > next
> > > > > > > > > > > > > > > > > batch?
> > > > > > > > > > > > > > > > > > If
> > > > > > > > > > > > > > > > > > > we do that, perhaps we don't need
> > > > > > > > > > > > > > > partition.availability.timeout.ms.
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > 10. Currently, partitioner.class
> defaults
> > > to
> > > > > > > > > > > > > DefaultPartitioner,
> > > > > > > > > > > > > > > > which
> > > > > > > > > > > > > > > > > > uses
> > > > > > > > > > > > > > > > > > > StickyPartitioner when the key is
> > > specified.
> > > > > > Since
> > > > > > > > this
> > > > > > > > > > KIP
> > > > > > > > > > > > > > > improves
> > > > > > > > > > > > > > > > > upon
> > > > > > > > > > > > > > > > > > > StickyPartitioner, it would be useful
> to
> > > make
> > > > > the
> > > > > > > new
> > > > > > > > > > > behavior
> > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > > > default
> > > > > > > > > > > > > > > > > > > and document that in the KIP.
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > Thanks,
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > Jun
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > On Wed, Feb 16, 2022 at 7:30 PM Luke
> > Chen <
> > > > > > > > > > > showuon@gmail.com>
> > > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > Hi Artem,
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > Also, one more thing I think you need
> > to
> > > > > know.
> > > > > > > > > > > > > > > > > > > > As this bug KAFKA-7572 <
> > > > > > > > > > > > > > > > > >
> > > > https://issues.apache.org/jira/browse/KAFKA-7572
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > mentioned, sometimes the custom
> > > partitioner
> > > > > > would
> > > > > > > > > > return
> > > > > > > > > > > > > > negative
> > > > > > > > > > > > > > > > > > > partition
> > > > > > > > > > > > > > > > > > > > id accidentally.
> > > > > > > > > > > > > > > > > > > > If it returned -1, how could you know
> > if
> > > it
> > > > > is
> > > > > > > > > expected
> > > > > > > > > > > or
> > > > > > > > > > > > > not
> > > > > > > > > > > > > > > > > > expected?
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > Thanks.
> > > > > > > > > > > > > > > > > > > > Luke
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > On Wed, Feb 16, 2022 at 3:28 PM Luke
> > > Chen <
> > > > > > > > > > > showuon@gmail.com
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > Hi Artem,
> > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > Thanks for the update. I have some
> > > > > questions
> > > > > > > > about
> > > > > > > > > > it:
> > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > 1. Could you explain why you need
> the
> > > > > > > > `partitioner`
> > > > > > > > > > > return
> > > > > > > > > > > > > > -1?
> > > > > > > > > > > > > > > In
> > > > > > > > > > > > > > > > > > which
> > > > > > > > > > > > > > > > > > > > > case we need it? And how it is used
> > in
> > > > your
> > > > > > > KIP?
> > > > > > > > > > > > > > > > > > > > > 2. What does the
> > > > > > > "partitioner.sticky.batch.size"
> > > > > > > > > > mean?
> > > > > > > > > > > In
> > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > > > > > > "Configuration" part, you didn't
> > > explain
> > > > > it.
> > > > > > > And
> > > > > > > > > > > default to
> > > > > > > > > > > > > > 0,
> > > > > > > > > > > > > > > I
> > > > > > > > > > > > > > > > > > guess
> > > > > > > > > > > > > > > > > > > > it's
> > > > > > > > > > > > > > > > > > > > > the same as current behavior for
> > > backward
> > > > > > > > > > > compatibility,
> > > > > > > > > > > > > > right?
> > > > > > > > > > > > > > > > You
> > > > > > > > > > > > > > > > > > > > should
> > > > > > > > > > > > > > > > > > > > > mention it.
> > > > > > > > > > > > > > > > > > > > > 3. I'm thinking we can have a
> > threshold
> > > > to
> > > > > > the
> > > > > > > > > > > > > > > > > > > > > "partitioner.sticky.batch.size".
> > Let's
> > > > say,
> > > > > > we
> > > > > > > > > > already
> > > > > > > > > > > > > > > accumulate
> > > > > > > > > > > > > > > > > > > 15.5KB
> > > > > > > > > > > > > > > > > > > > in
> > > > > > > > > > > > > > > > > > > > > partition1, and sent. So when next
> > > batch
> > > > > > > created,
> > > > > > > > > in
> > > > > > > > > > > your
> > > > > > > > > > > > > > > current
> > > > > > > > > > > > > > > > > > > design,
> > > > > > > > > > > > > > > > > > > > > we still stick to partition1, until
> > > 16KB
> > > > > > > reached,
> > > > > > > > > and
> > > > > > > > > > > then
> > > > > > > > > > > > > we
> > > > > > > > > > > > > > > > > create
> > > > > > > > > > > > > > > > > > a
> > > > > > > > > > > > > > > > > > > > new
> > > > > > > > > > > > > > > > > > > > > batch to change to next partition,
> > ex:
> > > > > > > > partition2.
> > > > > > > > > > But
> > > > > > > > > > > I
> > > > > > > > > > > > > > think
> > > > > > > > > > > > > > > if
> > > > > > > > > > > > > > > > > we
> > > > > > > > > > > > > > > > > > > set
> > > > > > > > > > > > > > > > > > > > a
> > > > > > > > > > > > > > > > > > > > > threshold to 95% (for example), we
> > can
> > > > know
> > > > > > > > > previous
> > > > > > > > > > > 15.5KB
> > > > > > > > > > > > > > > > already
> > > > > > > > > > > > > > > > > > > > exceeds
> > > > > > > > > > > > > > > > > > > > > the threshold so that we can
> directly
> > > > > create
> > > > > > > new
> > > > > > > > > > batch
> > > > > > > > > > > for
> > > > > > > > > > > > > > next
> > > > > > > > > > > > > > > > > > > records.
> > > > > > > > > > > > > > > > > > > > > This way should be able to make it
> > more
> > > > > > > > efficient.
> > > > > > > > > > > WDYT?
> > > > > > > > > > > > > > > > > > > > > 4. I think the improved queuing
> logic
> > > > > should
> > > > > > be
> > > > > > > > > good
> > > > > > > > > > > > > enough.
> > > > > > > > > > > > > > I
> > > > > > > > > > > > > > > > > can't
> > > > > > > > > > > > > > > > > > > get
> > > > > > > > > > > > > > > > > > > > > the benefit of having `
> > > > > > > > > > > partition.availability.timeout.ms`
> > > > > > > > > > > > > > > > config.
> > > > > > > > > > > > > > > > > In
> > > > > > > > > > > > > > > > > > > > > short, you want to make the
> > partitioner
> > > > > take
> > > > > > > the
> > > > > > > > > > broker
> > > > > > > > > > > > > load
> > > > > > > > > > > > > > > into
> > > > > > > > > > > > > > > > > > > > > consideration. We can just improve
> > that
> > > > in
> > > > > > the
> > > > > > > > > > queuing
> > > > > > > > > > > > > logic
> > > > > > > > > > > > > > > (and
> > > > > > > > > > > > > > > > > you
> > > > > > > > > > > > > > > > > > > > > already did it). Why should we add
> > the
> > > > > > config?
> > > > > > > > > Could
> > > > > > > > > > > you
> > > > > > > > > > > > > use
> > > > > > > > > > > > > > > some
> > > > > > > > > > > > > > > > > > > > examples
> > > > > > > > > > > > > > > > > > > > > to explain why we need it.
> > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > Thank you.
> > > > > > > > > > > > > > > > > > > > > Luke
> > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > On Wed, Feb 16, 2022 at 8:57 AM
> Artem
> > > > > > Livshits
> > > > > > > > > > > > > > > > > > > > > <al...@confluent.io.invalid>
> > > wrote:
> > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > >> Hello,
> > > > > > > > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > > > > > > > >> Please add your comments about the
> > > KIP.
> > > > > If
> > > > > > > > there
> > > > > > > > > > are
> > > > > > > > > > > no
> > > > > > > > > > > > > > > > > > > considerations,
> > > > > > > > > > > > > > > > > > > > >> I'll put it up for vote in the
> next
> > > few
> > > > > > days.
> > > > > > > > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > > > > > > > >> -Artem
> > > > > > > > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > > > > > > > >> On Mon, Feb 7, 2022 at 6:01 PM
> Artem
> > > > > > Livshits
> > > > > > > <
> > > > > > > > > > > > > > > > > > alivshits@confluent.io
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > >> wrote:
> > > > > > > > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > > > > > > > >> > Hello,
> > > > > > > > > > > > > > > > > > > > >> >
> > > > > > > > > > > > > > > > > > > > >> > After trying a few prototypes,
> > I've
> > > > made
> > > > > > > some
> > > > > > > > > > > changes to
> > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > > > public
> > > > > > > > > > > > > > > > > > > > >> > interface.  Please see the
> updated
> > > > > > document
> > > > > > > > > > > > > > > > > > > > >> >
> > > > > > > > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-794%3A+Strictly+Uniform+Sticky+Partitioner
> > > > > > > > > > > > > > > > > > > > >> > .
> > > > > > > > > > > > > > > > > > > > >> >
> > > > > > > > > > > > > > > > > > > > >> > -Artem
> > > > > > > > > > > > > > > > > > > > >> >
> > > > > > > > > > > > > > > > > > > > >> > On Thu, Nov 4, 2021 at 10:37 AM
> > > Artem
> > > > > > > > Livshits <
> > > > > > > > > > > > > > > > > > > > alivshits@confluent.io>
> > > > > > > > > > > > > > > > > > > > >> > wrote:
> > > > > > > > > > > > > > > > > > > > >> >
> > > > > > > > > > > > > > > > > > > > >> >> Hello,
> > > > > > > > > > > > > > > > > > > > >> >>
> > > > > > > > > > > > > > > > > > > > >> >> This is the discussion thread
> for
> > > > > > > > > > > > > > > > > > > > >> >>
> > > > > > > > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-794%3A+Strictly+Uniform+Sticky+Partitioner
> > > > > > > > > > > > > > > > > > > > >> >> .
> > > > > > > > > > > > > > > > > > > > >> >>
> > > > > > > > > > > > > > > > > > > > >> >> The proposal is a bug fix for
> > > > > > > > > > > > > > > > > > > > >> >>
> > > > > > > > > > https://issues.apache.org/jira/browse/KAFKA-10888,
> > > > > > > > > > > but
> > > > > > > > > > > > > > it
> > > > > > > > > > > > > > > > does
> > > > > > > > > > > > > > > > > > > > >> include a
> > > > > > > > > > > > > > > > > > > > >> >> client config change, therefore
> > we
> > > > > have a
> > > > > > > KIP
> > > > > > > > > to
> > > > > > > > > > > > > discuss.
> > > > > > > > > > > > > > > > > > > > >> >>
> > > > > > > > > > > > > > > > > > > > >> >> -Artem
> > > > > > > > > > > > > > > > > > > > >> >>
> > > > > > > > > > > > > > > > > > > > >> >
> > > > > > > > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] KIP-794: Strictly Uniform Sticky Partitioner

Posted by Jun Rao <ju...@confluent.io.INVALID>.
Hi, Artem,

33. We introduced onNewBatch() primarily for the sticky partitioner. It
seems to be a very subtle API to explain and to use properly. If we can't
find any convincing usage, it's probably better to deprecate it so that we
could keep the API clean.

Thanks,

Jun

On Mon, Mar 14, 2022 at 1:36 PM Artem Livshits
<al...@confluent.io.invalid> wrote:

> Hi Jun,
>
> 33.  That's an interesting point.  Technically, onNewBatch is just a way to
> pass some signal to the partitioner, the sticky partitioner uses this
> signal that is suboptimal, in theory someone could use it for something
> else
>
> -Artem
>
> On Mon, Mar 14, 2022 at 9:11 AM Jun Rao <ju...@confluent.io.invalid> wrote:
>
> > Hi, Artem,
> >
> > Thanks for the reply. A couple of more comments.
> >
> > 32. We could defer the metrics until we know what to add.
> >
> > 33. Since we are deprecating DefaultPartitioner and
> > UniformStickyPartitioner, should we depreciate Partitioner.onNewBatch()
> too
> > given its unexpected side effect?
> >
> > Thanks,
> >
> > Jun
> >
> > On Thu, Mar 10, 2022 at 5:20 PM Artem Livshits
> > <al...@confluent.io.invalid> wrote:
> >
> > > Hi Jun,
> > >
> > > 32. Good point.  Do you think it's ok to defer the metrics until we run
> > > some benchmarks so that we get a better idea of what metrics we need?
> > >
> > > -Artem
> > >
> > > On Thu, Mar 10, 2022 at 3:12 PM Jun Rao <ju...@confluent.io.invalid>
> > wrote:
> > >
> > > > Hi, Artem.
> > > >
> > > > Thanks for the reply. One more comment.
> > > >
> > > > 32. Do we need to add any new metric on the producer? For example, if
> > > > partitioner.availability.timeout.ms is > 0, it might be useful to
> know
> > > the
> > > > number of unavailable partitions.
> > > >
> > > > Thanks,
> > > >
> > > > Jun
> > > >
> > > > On Thu, Mar 10, 2022 at 12:46 PM Artem Livshits
> > > > <al...@confluent.io.invalid> wrote:
> > > >
> > > > > Hi Jun,
> > > > >
> > > > > 30.  Clarified.
> > > > >
> > > > > 31. I plan to do some benchmarking once implementation is finished,
> > > I'll
> > > > > update the KIP with the results once I have them.  The reason to
> make
> > > it
> > > > > default is that it won't be used otherwise and we won't know if
> it's
> > > good
> > > > > or not in practical workloads.
> > > > >
> > > > > -Artem
> > > > >
> > > > > On Thu, Mar 10, 2022 at 11:42 AM Jun Rao <jun@confluent.io.invalid
> >
> > > > wrote:
> > > > >
> > > > > > Hi, Artem,
> > > > > >
> > > > > > Thanks for the updated KIP. A couple of more comments.
> > > > > >
> > > > > > 30. For the 3 new configs, it would be useful to make it clear
> that
> > > > they
> > > > > > are only relevant when the partitioner class is null.
> > > > > >
> > > > > > 31. partitioner.adaptive.partitioning.enable : I am wondering
> > whether
> > > > it
> > > > > > should default to true. This is a more complex behavior than
> > "uniform
> > > > > > sticky" and may take some time to get right. If we do want to
> > enable
> > > it
> > > > > by
> > > > > > default, it would be useful to validate it with some test
> results.
> > > > > >
> > > > > > Jun
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > On Wed, Mar 9, 2022 at 10:05 PM Artem Livshits
> > > > > > <al...@confluent.io.invalid> wrote:
> > > > > >
> > > > > > > Thank you for feedback, I've discussed this offline with some
> of
> > > the
> > > > > > folks
> > > > > > > and updated the KIP.  The main change is that now instead of
> > using
> > > > > > > DefaultPartitioner and UniformStickyPartitioners as flags, in
> the
> > > new
> > > > > > > proposal the default partitioner is null, so if no custom
> > > partitioner
> > > > > is
> > > > > > > specified then the partitioning logic is implemented in
> > > > KafkaProducer.
> > > > > > > Compatibility section is updated as well.  Also the
> configuration
> > > > > options
> > > > > > > are renamed to be more consistent.
> > > > > > >
> > > > > > > -Artem
> > > > > > >
> > > > > > > On Fri, Mar 4, 2022 at 10:38 PM Luke Chen <sh...@gmail.com>
> > > wrote:
> > > > > > >
> > > > > > > > Hi Artem,
> > > > > > > >
> > > > > > > > Thanks for your explanation and update to the KIP.
> > > > > > > > Some comments:
> > > > > > > >
> > > > > > > > 5. In the description for `enable.adaptive.partitioning`, the
> > > > `false`
> > > > > > > case,
> > > > > > > > you said:
> > > > > > > > > the producer will try to distribute messages uniformly.
> > > > > > > > I think we should describe the possible skewing distribution.
> > > > > > Otherwise,
> > > > > > > > user might be confused about why adaptive partitioning is
> > > > important.
> > > > > > > >
> > > > > > > > 6. In the description for `partition.availability.timeout.ms
> `,
> > I
> > > > > think
> > > > > > > we
> > > > > > > > should mention in the last sentence about if
> > > > > > > `enable.adaptive.partitioning`
> > > > > > > > is disabled this logic is also disabled.
> > > > > > > >
> > > > > > > > 7. Similar thoughts as Ismael, I think we should have a POC
> and
> > > > test
> > > > > to
> > > > > > > > prove that this adaptive partitioning algorithm can have
> better
> > > > > uniform
> > > > > > > > partitioning, compared with original sticky one.
> > > > > > > >
> > > > > > > > Thank you.
> > > > > > > > Luke
> > > > > > > >
> > > > > > > > On Fri, Mar 4, 2022 at 9:22 PM Ismael Juma <
> ismael@juma.me.uk>
> > > > > wrote:
> > > > > > > >
> > > > > > > > > Regarding `3`, we should only deprecate it if we're sure
> the
> > > new
> > > > > > > approach
> > > > > > > > > handles all cases better. Are we confident about that for
> > both
> > > of
> > > > > the
> > > > > > > > > previous partitioners?
> > > > > > > > >
> > > > > > > > > Ismael
> > > > > > > > >
> > > > > > > > > On Fri, Mar 4, 2022 at 1:37 AM David Jacot
> > > > > > <djacot@confluent.io.invalid
> > > > > > > >
> > > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > Hi Artem,
> > > > > > > > > >
> > > > > > > > > > Thanks for the KIP! I have a few comments:
> > > > > > > > > >
> > > > > > > > > > 1. In the preamble of the proposed change section, there
> is
> > > > > still a
> > > > > > > > > > mention of the
> > > > > > > > > > -1 approach. My understanding is that we have moved away
> > from
> > > > it
> > > > > > now.
> > > > > > > > > >
> > > > > > > > > > 2. I am a bit concerned by the trick suggested about the
> > > > > > > > > > DefaultPartitioner and
> > > > > > > > > > the UniformStickyPartitioner. I do agree that
> implementing
> > > the
> > > > > > logic
> > > > > > > in
> > > > > > > > > the
> > > > > > > > > > producer itself is a good thing. However, it is weird
> from
> > a
> > > > user
> > > > > > > > > > perspective
> > > > > > > > > > that he can set a class as partitioner that is not used
> in
> > > the
> > > > > > end. I
> > > > > > > > > > think that
> > > > > > > > > > this will be confusing for our users. Have we considered
> > > > changing
> > > > > > the
> > > > > > > > > > default
> > > > > > > > > > value of partitioner.class to null to indicate that the
> new
> > > > > > built-in
> > > > > > > > > > partitioner
> > > > > > > > > > must be used? By default, the built-in partitioner would
> be
> > > > used
> > > > > > > unless
> > > > > > > > > the
> > > > > > > > > > user explicitly specify one. The downside is that the new
> > > > default
> > > > > > > > > behavior
> > > > > > > > > > would not work if the user explicitly specify the
> > partitioner
> > > > but
> > > > > > we
> > > > > > > > > could
> > > > > > > > > > mitigate this with my next point.
> > > > > > > > > >
> > > > > > > > > > 3. Related to the previous point, I think that we could
> > > > deprecate
> > > > > > > both
> > > > > > > > > the
> > > > > > > > > > DefaultPartitioner and the UniformStickyPartitioner. I
> > would
> > > > also
> > > > > > > add a
> > > > > > > > > > warning if one of them is explicitly provided by the user
> > to
> > > > > inform
> > > > > > > > them
> > > > > > > > > > that they should switch to the new built-in one. I am
> > pretty
> > > > sure
> > > > > > > that
> > > > > > > > > most
> > > > > > > > > > of the folks use the default configuration anyway.
> > > > > > > > > >
> > > > > > > > > > 4. It would be great if we could explain why the -1 way
> was
> > > > > > rejected.
> > > > > > > > At
> > > > > > > > > > the moment, the rejected alternative only explain the
> idea
> > > but
> > > > > does
> > > > > > > not
> > > > > > > > > > say why we rejected it.
> > > > > > > > > >
> > > > > > > > > > Best,
> > > > > > > > > > David
> > > > > > > > > >
> > > > > > > > > > On Fri, Mar 4, 2022 at 6:03 AM Artem Livshits
> > > > > > > > > > <al...@confluent.io.invalid> wrote:
> > > > > > > > > > >
> > > > > > > > > > > Hi Jun,
> > > > > > > > > > >
> > > > > > > > > > > 2. Removed the option from the KIP.  Now the sticky
> > > > > partitioning
> > > > > > > > > > threshold
> > > > > > > > > > > is hardcoded to batch.size.
> > > > > > > > > > >
> > > > > > > > > > > 20. Added the corresponding wording to the KIP.
> > > > > > > > > > >
> > > > > > > > > > > -Artem
> > > > > > > > > > >
> > > > > > > > > > > On Thu, Mar 3, 2022 at 10:52 AM Jun Rao
> > > > > <jun@confluent.io.invalid
> > > > > > >
> > > > > > > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > > > Hi, Artem,
> > > > > > > > > > > >
> > > > > > > > > > > > Thanks for the reply.
> > > > > > > > > > > >
> > > > > > > > > > > > 1. Sounds good.
> > > > > > > > > > > >
> > > > > > > > > > > > 2. If we don't expect users to change it, we probably
> > > could
> > > > > > just
> > > > > > > > > leave
> > > > > > > > > > out
> > > > > > > > > > > > the new config. In general, it's easy to add a new
> > > config,
> > > > > but
> > > > > > > hard
> > > > > > > > > to
> > > > > > > > > > > > remove an existing config.
> > > > > > > > > > > >
> > > > > > > > > > > > 20. The two new configs enable.adaptive.partitioning
> > and
> > > > > > > > > > > > partition.availability.timeout.ms only apply to the
> > two
> > > > > > built-in
> > > > > > > > > > > > partitioners DefaultPartitioner and
> > > > UniformStickyPartitioner,
> > > > > > > > right?
> > > > > > > > > It
> > > > > > > > > > > > would be useful to document that in the KIP.
> > > > > > > > > > > >
> > > > > > > > > > > > Thanks,
> > > > > > > > > > > >
> > > > > > > > > > > > Jun
> > > > > > > > > > > >
> > > > > > > > > > > > On Thu, Mar 3, 2022 at 9:47 AM Artem Livshits
> > > > > > > > > > > > <al...@confluent.io.invalid> wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > > Hi Jun,
> > > > > > > > > > > > >
> > > > > > > > > > > > > Thank you for the suggestions.
> > > > > > > > > > > > >
> > > > > > > > > > > > > 1. As we discussed offline, we can hardcode the
> logic
> > > for
> > > > > > > > > > > > > DefaultPartitioner and UniformStickyPartitioner in
> > the
> > > > > > > > > KafkaProducer
> > > > > > > > > > > > (i.e.
> > > > > > > > > > > > > the DefaultPartitioner.partition won't get called,
> > > > instead
> > > > > > > > > > KafkaProducer
> > > > > > > > > > > > > would check if the partitioner is an instance of
> > > > > > > > DefaultPartitioner
> > > > > > > > > > and
> > > > > > > > > > > > > then run the actual partitioning logic itself).
> Then
> > > the
> > > > > > > change
> > > > > > > > to
> > > > > > > > > > the
> > > > > > > > > > > > > Partitioner wouldn't be required.  I'll update the
> > KIP
> > > to
> > > > > > > reflect
> > > > > > > > > > that.
> > > > > > > > > > > > >
> > > > > > > > > > > > > 2. I don't expect users to change this too often,
> as
> > > > > changing
> > > > > > > it
> > > > > > > > > > would
> > > > > > > > > > > > > require a bit of studying of the production
> patterns.
> > > > As a
> > > > > > > > general
> > > > > > > > > > > > > principle, if I can think of a model that requires
> a
> > > > > > deviation
> > > > > > > > from
> > > > > > > > > > > > > default, I tend to add a configuration option.  It
> > > could
> > > > be
> > > > > > > that
> > > > > > > > > > it'll
> > > > > > > > > > > > > never get used in practice, but I cannot prove
> that.
> > > I'm
> > > > > ok
> > > > > > > with
> > > > > > > > > > > > removing
> > > > > > > > > > > > > the option, let me know what you think.
> > > > > > > > > > > > >
> > > > > > > > > > > > > -Artem
> > > > > > > > > > > > >
> > > > > > > > > > > > > On Mon, Feb 28, 2022 at 2:06 PM Jun Rao
> > > > > > > <jun@confluent.io.invalid
> > > > > > > > >
> > > > > > > > > > > > wrote:
> > > > > > > > > > > > >
> > > > > > > > > > > > > > Hi, Artem,
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Thanks for the reply. A few more comments.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > 1. Since we control the implementation and the
> > usage
> > > of
> > > > > > > > > > > > > DefaultPartitioner,
> > > > > > > > > > > > > > another way is to instantiate the
> > DefaultPartitioner
> > > > > with a
> > > > > > > > > special
> > > > > > > > > > > > > > constructor, which allows it to have more access
> to
> > > > > > internal
> > > > > > > > > > > > information.
> > > > > > > > > > > > > > Then we could just change the behavior of
> > > > > > DefaultPartitioner
> > > > > > > > > such
> > > > > > > > > > that
> > > > > > > > > > > > > it
> > > > > > > > > > > > > > can use the internal infoamtion when choosing the
> > > > > > partition.
> > > > > > > > This
> > > > > > > > > > seems
> > > > > > > > > > > > > > more intuitive than having DefaultPartitioner
> > return
> > > -1
> > > > > > > > > partition.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > 2. I guess partitioner.sticky.batch.size is
> > > introduced
> > > > > > > because
> > > > > > > > > the
> > > > > > > > > > > > > > effective batch size could be less than
> batch.size
> > > and
> > > > we
> > > > > > > want
> > > > > > > > to
> > > > > > > > > > align
> > > > > > > > > > > > > > partition switching with the effective batch
> size.
> > > How
> > > > > > would
> > > > > > > a
> > > > > > > > > user
> > > > > > > > > > > > know
> > > > > > > > > > > > > > the effective batch size to set
> > > > > > partitioner.sticky.batch.size
> > > > > > > > > > properly?
> > > > > > > > > > > > > If
> > > > > > > > > > > > > > the user somehow knows the effective batch size,
> > does
> > > > > > setting
> > > > > > > > > > > > batch.size
> > > > > > > > > > > > > to
> > > > > > > > > > > > > > the effective batch size achieve the same result?
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > 4. Thanks for the explanation. Makes sense to me.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Thanks,
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Jun
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Thanks,
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Jun
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > On Fri, Feb 25, 2022 at 8:26 PM Artem Livshits
> > > > > > > > > > > > > > <al...@confluent.io.invalid> wrote:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Hi Jun,
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > 1. Updated the KIP to add a couple paragraphs
> > about
> > > > > > > > > > implementation
> > > > > > > > > > > > > > > necessities in the Proposed Changes section.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > 2. Sorry if my reply was confusing, what I
> meant
> > to
> > > > say
> > > > > > > (and
> > > > > > > > I
> > > > > > > > > > > > > elaborated
> > > > > > > > > > > > > > > on that in point #3) is that there could be
> > > patterns
> > > > > for
> > > > > > > > which
> > > > > > > > > > 16KB
> > > > > > > > > > > > > > > wouldn't be the most effective setting, thus it
> > > would
> > > > > be
> > > > > > > good
> > > > > > > > > to
> > > > > > > > > > make
> > > > > > > > > > > > > it
> > > > > > > > > > > > > > > configurable.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > 4. We could use broker readiness timeout.  But
> > I'm
> > > > not
> > > > > > sure
> > > > > > > > it
> > > > > > > > > > would
> > > > > > > > > > > > > > > correctly model the broker load.  The problem
> is
> > > that
> > > > > > > latency
> > > > > > > > > is
> > > > > > > > > > not
> > > > > > > > > > > > an
> > > > > > > > > > > > > > > accurate measure of throughput, we could have 2
> > > > brokers
> > > > > > > that
> > > > > > > > > have
> > > > > > > > > > > > equal
> > > > > > > > > > > > > > > throughput but one has higher latency (so it
> > takes
> > > > > larger
> > > > > > > > > batches
> > > > > > > > > > > > less
> > > > > > > > > > > > > > > frequently, but still takes the same load).
> > > > > > Latency-based
> > > > > > > > > logic
> > > > > > > > > > is
> > > > > > > > > > > > > > likely
> > > > > > > > > > > > > > > to send less data to the broker with higher
> > > latency.
> > > > > > Using
> > > > > > > > the
> > > > > > > > > > queue
> > > > > > > > > > > > > > size
> > > > > > > > > > > > > > > would adapt to throughput, regardless of
> latency
> > > > (which
> > > > > > > could
> > > > > > > > > be
> > > > > > > > > > > > just a
> > > > > > > > > > > > > > > result of deployment topology), so that's the
> > model
> > > > > > chosen
> > > > > > > in
> > > > > > > > > the
> > > > > > > > > > > > > > > proposal.  The
> partition.availability.timeout.ms
> > > > logic
> > > > > > > > > > approaches
> > > > > > > > > > > > the
> > > > > > > > > > > > > > > model
> > > > > > > > > > > > > > > from a slightly different angle, say we have a
> > > > > > requirement
> > > > > > > to
> > > > > > > > > > deliver
> > > > > > > > > > > > > > > messages via brokers that have a certain
> latency,
> > > > then
> > > > > > > > > > > > > > > partition.availability.timeout.ms could be
> used
> > to
> > > > > tune
> > > > > > > > that.
> > > > > > > > > > > > Latency
> > > > > > > > > > > > > > is
> > > > > > > > > > > > > > > a
> > > > > > > > > > > > > > > much more volatile metric than throughput
> > (latency
> > > > > > depends
> > > > > > > on
> > > > > > > > > > > > external
> > > > > > > > > > > > > > > load, on capacity, on deployment topology, on
> > > jitter
> > > > in
> > > > > > > > > network,
> > > > > > > > > > on
> > > > > > > > > > > > > > jitter
> > > > > > > > > > > > > > > in disk, etc.) and I think it would be best to
> > > leave
> > > > > > > > > > latency-based
> > > > > > > > > > > > > > > thresholds configurable to tune for the
> > > environment.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > -Artem
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > On Wed, Feb 23, 2022 at 11:14 AM Jun Rao
> > > > > > > > > > <ju...@confluent.io.invalid>
> > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Hi, Artem,
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Thanks for the reply. A few more comments.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > 1. Perhaps you could elaborate a bit more on
> > how
> > > > the
> > > > > > > > producer
> > > > > > > > > > > > > > determines
> > > > > > > > > > > > > > > > the partition if the partitioner returns -1.
> > This
> > > > > will
> > > > > > > help
> > > > > > > > > > > > > understand
> > > > > > > > > > > > > > > why
> > > > > > > > > > > > > > > > encapsulating that logic as a partitioner is
> > not
> > > > > clean.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > 2. I am not sure that I understand this part.
> > If
> > > > > 15.5KB
> > > > > > > is
> > > > > > > > > more
> > > > > > > > > > > > > > > efficient,
> > > > > > > > > > > > > > > > could we just set batch.size to 15.5KB?
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > 4. Yes, we could add a switch (or a variant
> of
> > > the
> > > > > > > > > > partitioner) for
> > > > > > > > > > > > > > > > enabling this behavior. Also, choosing
> > partitions
> > > > > based
> > > > > > > on
> > > > > > > > > > broker
> > > > > > > > > > > > > > > readiness
> > > > > > > > > > > > > > > > can be made in a smoother way. For example,
> we
> > > > could
> > > > > > > track
> > > > > > > > > the
> > > > > > > > > > last
> > > > > > > > > > > > > > time
> > > > > > > > > > > > > > > a
> > > > > > > > > > > > > > > > broker has drained any batches from the
> > > > accumulator.
> > > > > We
> > > > > > > can
> > > > > > > > > > then
> > > > > > > > > > > > > select
> > > > > > > > > > > > > > > > partitions from brokers proportionally to the
> > > > inverse
> > > > > > of
> > > > > > > > that
> > > > > > > > > > time.
> > > > > > > > > > > > > > This
> > > > > > > > > > > > > > > > seems smoother than a cutoff based on a
> > > > > > > > > > > > > > > partition.availability.timeout.ms
> > > > > > > > > > > > > > > >  threshold.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Thanks,
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Jun
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > On Fri, Feb 18, 2022 at 5:14 PM Artem
> Livshits
> > > > > > > > > > > > > > > > <al...@confluent.io.invalid> wrote:
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Hello Luke, Jun,
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Thank you for your feedback.  I've added
> the
> > > > > Rejected
> > > > > > > > > > Alternative
> > > > > > > > > > > > > > > section
> > > > > > > > > > > > > > > > > that may clarify some of the questions
> w.r.t.
> > > > > > returning
> > > > > > > > -1.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > 1. I've elaborated on the -1 in the KIP.
> The
> > > > > problem
> > > > > > > is
> > > > > > > > > > that a
> > > > > > > > > > > > > > > > significant
> > > > > > > > > > > > > > > > > part of the logic needs to be in the
> producer
> > > > > > (because
> > > > > > > it
> > > > > > > > > now
> > > > > > > > > > > > uses
> > > > > > > > > > > > > > > > > information about brokers that only the
> > > producer
> > > > > > > knows),
> > > > > > > > so
> > > > > > > > > > > > > > > encapsulation
> > > > > > > > > > > > > > > > > of the logic within the default partitioner
> > > isn't
> > > > > as
> > > > > > > > clean.
> > > > > > > > > > > >  I've
> > > > > > > > > > > > > > > added
> > > > > > > > > > > > > > > > > the Rejected Alternative section that
> > documents
> > > > an
> > > > > > > > attempt
> > > > > > > > > to
> > > > > > > > > > > > keep
> > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > > encapsulation by providing new callbacks to
> > the
> > > > > > > > > partitioner.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > 2. The meaning of the
> > > > partitioner.sticky.batch.size
> > > > > > is
> > > > > > > > > > explained
> > > > > > > > > > > > in
> > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > > Uniform Sticky Batch Size section.
> > Basically,
> > > we
> > > > > > track
> > > > > > > > the
> > > > > > > > > > > > amount
> > > > > > > > > > > > > of
> > > > > > > > > > > > > > > > bytes
> > > > > > > > > > > > > > > > > produced to the partition and if it exceeds
> > > > > > > > > > > > > > > partitioner.sticky.batch.size
> > > > > > > > > > > > > > > > > then we switch to the next partition.  As
> far
> > > as
> > > > > the
> > > > > > > > reason
> > > > > > > > > > to
> > > > > > > > > > > > make
> > > > > > > > > > > > > > it
> > > > > > > > > > > > > > > > > different from batch.size, I think Luke
> > > answered
> > > > > this
> > > > > > > > with
> > > > > > > > > > the
> > > > > > > > > > > > > > question
> > > > > > > > > > > > > > > > #3
> > > > > > > > > > > > > > > > > -- what if the load pattern is such that
> > 15.5KB
> > > > > would
> > > > > > > be
> > > > > > > > > more
> > > > > > > > > > > > > > efficient
> > > > > > > > > > > > > > > > > than 16KB?
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > 3. I think it's hard to have one size that
> > > would
> > > > > fit
> > > > > > > all
> > > > > > > > > > > > patterns.
> > > > > > > > > > > > > > > E.g.
> > > > > > > > > > > > > > > > if
> > > > > > > > > > > > > > > > > the load pattern is such that there is
> linger
> > > and
> > > > > the
> > > > > > > app
> > > > > > > > > > fills
> > > > > > > > > > > > the
> > > > > > > > > > > > > > > batch
> > > > > > > > > > > > > > > > > before linger expires, then having 16KB
> would
> > > > most
> > > > > > > likely
> > > > > > > > > > > > > synchronize
> > > > > > > > > > > > > > > > > batching and partition switching, so each
> > > > partition
> > > > > > > would
> > > > > > > > > > get a
> > > > > > > > > > > > > full
> > > > > > > > > > > > > > > > > batch.  If load pattern is such that there
> > are
> > > a
> > > > > few
> > > > > > > > > > non-complete
> > > > > > > > > > > > > > > batches
> > > > > > > > > > > > > > > > > go out before a larger batch starts to
> fill,
> > > then
> > > > > it
> > > > > > > may
> > > > > > > > > > actually
> > > > > > > > > > > > > be
> > > > > > > > > > > > > > > > > beneficial to make slightly larger (e.g.
> > > > linger=0,
> > > > > > > first
> > > > > > > > > few
> > > > > > > > > > > > > records
> > > > > > > > > > > > > > go
> > > > > > > > > > > > > > > > in
> > > > > > > > > > > > > > > > > the first batch, then next few records go
> to
> > > > second
> > > > > > > > batch,
> > > > > > > > > > and so
> > > > > > > > > > > > > on,
> > > > > > > > > > > > > > > > until
> > > > > > > > > > > > > > > > > 5 in-flight, then larger batch would form
> > while
> > > > > > waiting
> > > > > > > > for
> > > > > > > > > > > > broker
> > > > > > > > > > > > > to
> > > > > > > > > > > > > > > > > respond, but the partition switch would
> > happen
> > > > > before
> > > > > > > the
> > > > > > > > > > larger
> > > > > > > > > > > > > > batch
> > > > > > > > > > > > > > > is
> > > > > > > > > > > > > > > > > full).
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > 4. There are a couple of reasons for
> > > introducing
> > > > > > > > > > > > > > > > > partition.availability.timeout.ms.  Luke's
> > an
> > > > > Jun's
> > > > > > > > > > questions
> > > > > > > > > > > > are
> > > > > > > > > > > > > > > > slightly
> > > > > > > > > > > > > > > > > different, so I'm going to separate
> replies.
> > > > > > > > > > > > > > > > > (Luke) Is the queue size a good enough
> > > signal?  I
> > > > > > think
> > > > > > > > > it's
> > > > > > > > > > a
> > > > > > > > > > > > good
> > > > > > > > > > > > > > > > default
> > > > > > > > > > > > > > > > > signal as it tries to preserve general
> > fairness
> > > > and
> > > > > > not
> > > > > > > > > > overreact
> > > > > > > > > > > > > on
> > > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > > broker's state at each moment in time.  But
> > > > because
> > > > > > > it's
> > > > > > > > > > smooth,
> > > > > > > > > > > > it
> > > > > > > > > > > > > > may
> > > > > > > > > > > > > > > > not
> > > > > > > > > > > > > > > > > be reactive enough to instantaneous latency
> > > > jumps.
> > > > > > For
> > > > > > > > > > > > > > > latency-sensitive
> > > > > > > > > > > > > > > > > workloads, it may be desirable to react
> > faster
> > > > > when a
> > > > > > > > > broker
> > > > > > > > > > > > > becomes
> > > > > > > > > > > > > > > > > unresponsive (but that may make the
> > > distribution
> > > > > > really
> > > > > > > > > > choppy),
> > > > > > > > > > > > so
> > > > > > > > > > > > > > > > > partition.availability.timeout.ms provides
> > an
> > > > > > > > opportunity
> > > > > > > > > to
> > > > > > > > > > > > tune
> > > > > > > > > > > > > > > > > adaptiveness.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > (Jun) Can we just not assign partitions to
> > > > brokers
> > > > > > that
> > > > > > > > are
> > > > > > > > > > not
> > > > > > > > > > > > > > ready?
> > > > > > > > > > > > > > > > > Switching partitions purely based on
> current
> > > > broker
> > > > > > > > > readiness
> > > > > > > > > > > > > > > information
> > > > > > > > > > > > > > > > > can really skew workload I think (or at
> > least I
> > > > > > > couldn't
> > > > > > > > > > build a
> > > > > > > > > > > > > > model
> > > > > > > > > > > > > > > > that
> > > > > > > > > > > > > > > > > proves that over time it's going to be
> > > generally
> > > > > > > fair), I
> > > > > > > > > > feel
> > > > > > > > > > > > that
> > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > > algorithm should try to be fair in general
> > and
> > > > use
> > > > > > > > smoother
> > > > > > > > > > > > signals
> > > > > > > > > > > > > > by
> > > > > > > > > > > > > > > > > default (e.g. a broker with choppier
> latency
> > > may
> > > > > get
> > > > > > > much
> > > > > > > > > > less
> > > > > > > > > > > > load
> > > > > > > > > > > > > > > even
> > > > > > > > > > > > > > > > > though it can handle throughput, it then
> may
> > > > > > > potentially
> > > > > > > > > skew
> > > > > > > > > > > > > > > > consumption),
> > > > > > > > > > > > > > > > > note that the queue-size-based logic uses
> > > > > > probabilities
> > > > > > > > (so
> > > > > > > > > > we
> > > > > > > > > > > > > don't
> > > > > > > > > > > > > > > > fully
> > > > > > > > > > > > > > > > > remove brokers, just make it less likely)
> and
> > > > > > relative
> > > > > > > > info
> > > > > > > > > > > > rather
> > > > > > > > > > > > > > > than a
> > > > > > > > > > > > > > > > > threshold (so if all brokers are heavily,
> but
> > > > > equally
> > > > > > > > > loaded,
> > > > > > > > > > > > they
> > > > > > > > > > > > > > will
> > > > > > > > > > > > > > > > get
> > > > > > > > > > > > > > > > > equal distribution, rather than get removed
> > > > because
> > > > > > > they
> > > > > > > > > > exceed
> > > > > > > > > > > > > some
> > > > > > > > > > > > > > > > > threshold).  So at the very least, I would
> > like
> > > > > this
> > > > > > > > logic
> > > > > > > > > > to be
> > > > > > > > > > > > > > turned
> > > > > > > > > > > > > > > > off
> > > > > > > > > > > > > > > > > by default as it's hard to predict what it
> > > could
> > > > do
> > > > > > > with
> > > > > > > > > > > > different
> > > > > > > > > > > > > > > > patterns
> > > > > > > > > > > > > > > > > (which means that there would need to be
> some
> > > > > > > > > > configuration).  We
> > > > > > > > > > > > > > could
> > > > > > > > > > > > > > > > > just not use brokers that are not ready,
> but
> > > > > again, I
> > > > > > > > think
> > > > > > > > > > that
> > > > > > > > > > > > > it's
> > > > > > > > > > > > > > > > good
> > > > > > > > > > > > > > > > > to try to be fair under normal
> circumstances,
> > > so
> > > > if
> > > > > > > > > normally
> > > > > > > > > > > > > brokers
> > > > > > > > > > > > > > > can
> > > > > > > > > > > > > > > > > respond under some
> > > > > partition.availability.timeout.ms
> > > > > > > > > > threshold
> > > > > > > > > > > > and
> > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > > application works well with those
> latencies,
> > > then
> > > > > we
> > > > > > > > could
> > > > > > > > > > > > > distribute
> > > > > > > > > > > > > > > > data
> > > > > > > > > > > > > > > > > equally between brokers that don't exceed
> the
> > > > > > > latencies.
> > > > > > > > > The
> > > > > > > > > > > > > value,
> > > > > > > > > > > > > > of
> > > > > > > > > > > > > > > > > course, would depend on the environment and
> > app
> > > > > > > > > requirements,
> > > > > > > > > > > > hence
> > > > > > > > > > > > > > > it's
> > > > > > > > > > > > > > > > > configurable.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > 10. Added a statement at the beginning of
> the
> > > > > > proposed
> > > > > > > > > > changes.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > -Artem
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > On Thu, Feb 17, 2022 at 3:46 PM Jun Rao
> > > > > > > > > > <jun@confluent.io.invalid
> > > > > > > > > > > > >
> > > > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > Hi, Artem,
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > Thanks for the KIP. A few comments below.
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > 1. I agree with Luke that having the
> > > > partitioner
> > > > > > > > > returning
> > > > > > > > > > -1
> > > > > > > > > > > > is
> > > > > > > > > > > > > > kind
> > > > > > > > > > > > > > > > of
> > > > > > > > > > > > > > > > > > weird. Could we just change the
> > > implementation
> > > > of
> > > > > > > > > > > > > > DefaultPartitioner
> > > > > > > > > > > > > > > to
> > > > > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > > > new behavior?
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > 2. partitioner.sticky.batch.size: Similar
> > > > > question
> > > > > > to
> > > > > > > > > > Luke. I
> > > > > > > > > > > > am
> > > > > > > > > > > > > > not
> > > > > > > > > > > > > > > > sure
> > > > > > > > > > > > > > > > > > why we want to introduce this new
> > > > configuration.
> > > > > > > Could
> > > > > > > > we
> > > > > > > > > > just
> > > > > > > > > > > > > use
> > > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > > > existing batch.size?
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > 4. I also agree with Luke that it's not
> > clear
> > > > why
> > > > > > we
> > > > > > > > need
> > > > > > > > > > > > > > > > > > partition.availability.timeout.ms. The
> KIP
> > > > says
> > > > > > the
> > > > > > > > > broker
> > > > > > > > > > > > > "would
> > > > > > > > > > > > > > > not
> > > > > > > > > > > > > > > > be
> > > > > > > > > > > > > > > > > > chosen until the broker is able to accept
> > the
> > > > > next
> > > > > > > > ready
> > > > > > > > > > batch
> > > > > > > > > > > > > from
> > > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > > > partition". If we are keeping track of
> > this,
> > > > > could
> > > > > > we
> > > > > > > > > just
> > > > > > > > > > > > avoid
> > > > > > > > > > > > > > > > > assigning
> > > > > > > > > > > > > > > > > > records to partitions whose leader is not
> > > able
> > > > to
> > > > > > > > accept
> > > > > > > > > > the
> > > > > > > > > > > > next
> > > > > > > > > > > > > > > > batch?
> > > > > > > > > > > > > > > > > If
> > > > > > > > > > > > > > > > > > we do that, perhaps we don't need
> > > > > > > > > > > > > > partition.availability.timeout.ms.
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > 10. Currently, partitioner.class defaults
> > to
> > > > > > > > > > > > DefaultPartitioner,
> > > > > > > > > > > > > > > which
> > > > > > > > > > > > > > > > > uses
> > > > > > > > > > > > > > > > > > StickyPartitioner when the key is
> > specified.
> > > > > Since
> > > > > > > this
> > > > > > > > > KIP
> > > > > > > > > > > > > > improves
> > > > > > > > > > > > > > > > upon
> > > > > > > > > > > > > > > > > > StickyPartitioner, it would be useful to
> > make
> > > > the
> > > > > > new
> > > > > > > > > > behavior
> > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > > default
> > > > > > > > > > > > > > > > > > and document that in the KIP.
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > Thanks,
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > Jun
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > On Wed, Feb 16, 2022 at 7:30 PM Luke
> Chen <
> > > > > > > > > > showuon@gmail.com>
> > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > Hi Artem,
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > Also, one more thing I think you need
> to
> > > > know.
> > > > > > > > > > > > > > > > > > > As this bug KAFKA-7572 <
> > > > > > > > > > > > > > > > >
> > > https://issues.apache.org/jira/browse/KAFKA-7572
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > mentioned, sometimes the custom
> > partitioner
> > > > > would
> > > > > > > > > return
> > > > > > > > > > > > > negative
> > > > > > > > > > > > > > > > > > partition
> > > > > > > > > > > > > > > > > > > id accidentally.
> > > > > > > > > > > > > > > > > > > If it returned -1, how could you know
> if
> > it
> > > > is
> > > > > > > > expected
> > > > > > > > > > or
> > > > > > > > > > > > not
> > > > > > > > > > > > > > > > > expected?
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > Thanks.
> > > > > > > > > > > > > > > > > > > Luke
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > On Wed, Feb 16, 2022 at 3:28 PM Luke
> > Chen <
> > > > > > > > > > showuon@gmail.com
> > > > > > > > > > > > >
> > > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > Hi Artem,
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > Thanks for the update. I have some
> > > > questions
> > > > > > > about
> > > > > > > > > it:
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > 1. Could you explain why you need the
> > > > > > > `partitioner`
> > > > > > > > > > return
> > > > > > > > > > > > > -1?
> > > > > > > > > > > > > > In
> > > > > > > > > > > > > > > > > which
> > > > > > > > > > > > > > > > > > > > case we need it? And how it is used
> in
> > > your
> > > > > > KIP?
> > > > > > > > > > > > > > > > > > > > 2. What does the
> > > > > > "partitioner.sticky.batch.size"
> > > > > > > > > mean?
> > > > > > > > > > In
> > > > > > > > > > > > the
> > > > > > > > > > > > > > > > > > > > "Configuration" part, you didn't
> > explain
> > > > it.
> > > > > > And
> > > > > > > > > > default to
> > > > > > > > > > > > > 0,
> > > > > > > > > > > > > > I
> > > > > > > > > > > > > > > > > guess
> > > > > > > > > > > > > > > > > > > it's
> > > > > > > > > > > > > > > > > > > > the same as current behavior for
> > backward
> > > > > > > > > > compatibility,
> > > > > > > > > > > > > right?
> > > > > > > > > > > > > > > You
> > > > > > > > > > > > > > > > > > > should
> > > > > > > > > > > > > > > > > > > > mention it.
> > > > > > > > > > > > > > > > > > > > 3. I'm thinking we can have a
> threshold
> > > to
> > > > > the
> > > > > > > > > > > > > > > > > > > > "partitioner.sticky.batch.size".
> Let's
> > > say,
> > > > > we
> > > > > > > > > already
> > > > > > > > > > > > > > accumulate
> > > > > > > > > > > > > > > > > > 15.5KB
> > > > > > > > > > > > > > > > > > > in
> > > > > > > > > > > > > > > > > > > > partition1, and sent. So when next
> > batch
> > > > > > created,
> > > > > > > > in
> > > > > > > > > > your
> > > > > > > > > > > > > > current
> > > > > > > > > > > > > > > > > > design,
> > > > > > > > > > > > > > > > > > > > we still stick to partition1, until
> > 16KB
> > > > > > reached,
> > > > > > > > and
> > > > > > > > > > then
> > > > > > > > > > > > we
> > > > > > > > > > > > > > > > create
> > > > > > > > > > > > > > > > > a
> > > > > > > > > > > > > > > > > > > new
> > > > > > > > > > > > > > > > > > > > batch to change to next partition,
> ex:
> > > > > > > partition2.
> > > > > > > > > But
> > > > > > > > > > I
> > > > > > > > > > > > > think
> > > > > > > > > > > > > > if
> > > > > > > > > > > > > > > > we
> > > > > > > > > > > > > > > > > > set
> > > > > > > > > > > > > > > > > > > a
> > > > > > > > > > > > > > > > > > > > threshold to 95% (for example), we
> can
> > > know
> > > > > > > > previous
> > > > > > > > > > 15.5KB
> > > > > > > > > > > > > > > already
> > > > > > > > > > > > > > > > > > > exceeds
> > > > > > > > > > > > > > > > > > > > the threshold so that we can directly
> > > > create
> > > > > > new
> > > > > > > > > batch
> > > > > > > > > > for
> > > > > > > > > > > > > next
> > > > > > > > > > > > > > > > > > records.
> > > > > > > > > > > > > > > > > > > > This way should be able to make it
> more
> > > > > > > efficient.
> > > > > > > > > > WDYT?
> > > > > > > > > > > > > > > > > > > > 4. I think the improved queuing logic
> > > > should
> > > > > be
> > > > > > > > good
> > > > > > > > > > > > enough.
> > > > > > > > > > > > > I
> > > > > > > > > > > > > > > > can't
> > > > > > > > > > > > > > > > > > get
> > > > > > > > > > > > > > > > > > > > the benefit of having `
> > > > > > > > > > partition.availability.timeout.ms`
> > > > > > > > > > > > > > > config.
> > > > > > > > > > > > > > > > In
> > > > > > > > > > > > > > > > > > > > short, you want to make the
> partitioner
> > > > take
> > > > > > the
> > > > > > > > > broker
> > > > > > > > > > > > load
> > > > > > > > > > > > > > into
> > > > > > > > > > > > > > > > > > > > consideration. We can just improve
> that
> > > in
> > > > > the
> > > > > > > > > queuing
> > > > > > > > > > > > logic
> > > > > > > > > > > > > > (and
> > > > > > > > > > > > > > > > you
> > > > > > > > > > > > > > > > > > > > already did it). Why should we add
> the
> > > > > config?
> > > > > > > > Could
> > > > > > > > > > you
> > > > > > > > > > > > use
> > > > > > > > > > > > > > some
> > > > > > > > > > > > > > > > > > > examples
> > > > > > > > > > > > > > > > > > > > to explain why we need it.
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > Thank you.
> > > > > > > > > > > > > > > > > > > > Luke
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > On Wed, Feb 16, 2022 at 8:57 AM Artem
> > > > > Livshits
> > > > > > > > > > > > > > > > > > > > <al...@confluent.io.invalid>
> > wrote:
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > >> Hello,
> > > > > > > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > > > > > > >> Please add your comments about the
> > KIP.
> > > > If
> > > > > > > there
> > > > > > > > > are
> > > > > > > > > > no
> > > > > > > > > > > > > > > > > > considerations,
> > > > > > > > > > > > > > > > > > > >> I'll put it up for vote in the next
> > few
> > > > > days.
> > > > > > > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > > > > > > >> -Artem
> > > > > > > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > > > > > > >> On Mon, Feb 7, 2022 at 6:01 PM Artem
> > > > > Livshits
> > > > > > <
> > > > > > > > > > > > > > > > > alivshits@confluent.io
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > >> wrote:
> > > > > > > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > > > > > > >> > Hello,
> > > > > > > > > > > > > > > > > > > >> >
> > > > > > > > > > > > > > > > > > > >> > After trying a few prototypes,
> I've
> > > made
> > > > > > some
> > > > > > > > > > changes to
> > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > > public
> > > > > > > > > > > > > > > > > > > >> > interface.  Please see the updated
> > > > > document
> > > > > > > > > > > > > > > > > > > >> >
> > > > > > > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-794%3A+Strictly+Uniform+Sticky+Partitioner
> > > > > > > > > > > > > > > > > > > >> > .
> > > > > > > > > > > > > > > > > > > >> >
> > > > > > > > > > > > > > > > > > > >> > -Artem
> > > > > > > > > > > > > > > > > > > >> >
> > > > > > > > > > > > > > > > > > > >> > On Thu, Nov 4, 2021 at 10:37 AM
> > Artem
> > > > > > > Livshits <
> > > > > > > > > > > > > > > > > > > alivshits@confluent.io>
> > > > > > > > > > > > > > > > > > > >> > wrote:
> > > > > > > > > > > > > > > > > > > >> >
> > > > > > > > > > > > > > > > > > > >> >> Hello,
> > > > > > > > > > > > > > > > > > > >> >>
> > > > > > > > > > > > > > > > > > > >> >> This is the discussion thread for
> > > > > > > > > > > > > > > > > > > >> >>
> > > > > > > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-794%3A+Strictly+Uniform+Sticky+Partitioner
> > > > > > > > > > > > > > > > > > > >> >> .
> > > > > > > > > > > > > > > > > > > >> >>
> > > > > > > > > > > > > > > > > > > >> >> The proposal is a bug fix for
> > > > > > > > > > > > > > > > > > > >> >>
> > > > > > > > > https://issues.apache.org/jira/browse/KAFKA-10888,
> > > > > > > > > > but
> > > > > > > > > > > > > it
> > > > > > > > > > > > > > > does
> > > > > > > > > > > > > > > > > > > >> include a
> > > > > > > > > > > > > > > > > > > >> >> client config change, therefore
> we
> > > > have a
> > > > > > KIP
> > > > > > > > to
> > > > > > > > > > > > discuss.
> > > > > > > > > > > > > > > > > > > >> >>
> > > > > > > > > > > > > > > > > > > >> >> -Artem
> > > > > > > > > > > > > > > > > > > >> >>
> > > > > > > > > > > > > > > > > > > >> >
> > > > > > > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] KIP-794: Strictly Uniform Sticky Partitioner

Posted by Artem Livshits <al...@confluent.io.INVALID>.
Hi Jun,

33.  That's an interesting point.  Technically, onNewBatch is just a way to
pass some signal to the partitioner, the sticky partitioner uses this
signal that is suboptimal, in theory someone could use it for something else

-Artem

On Mon, Mar 14, 2022 at 9:11 AM Jun Rao <ju...@confluent.io.invalid> wrote:

> Hi, Artem,
>
> Thanks for the reply. A couple of more comments.
>
> 32. We could defer the metrics until we know what to add.
>
> 33. Since we are deprecating DefaultPartitioner and
> UniformStickyPartitioner, should we depreciate Partitioner.onNewBatch() too
> given its unexpected side effect?
>
> Thanks,
>
> Jun
>
> On Thu, Mar 10, 2022 at 5:20 PM Artem Livshits
> <al...@confluent.io.invalid> wrote:
>
> > Hi Jun,
> >
> > 32. Good point.  Do you think it's ok to defer the metrics until we run
> > some benchmarks so that we get a better idea of what metrics we need?
> >
> > -Artem
> >
> > On Thu, Mar 10, 2022 at 3:12 PM Jun Rao <ju...@confluent.io.invalid>
> wrote:
> >
> > > Hi, Artem.
> > >
> > > Thanks for the reply. One more comment.
> > >
> > > 32. Do we need to add any new metric on the producer? For example, if
> > > partitioner.availability.timeout.ms is > 0, it might be useful to know
> > the
> > > number of unavailable partitions.
> > >
> > > Thanks,
> > >
> > > Jun
> > >
> > > On Thu, Mar 10, 2022 at 12:46 PM Artem Livshits
> > > <al...@confluent.io.invalid> wrote:
> > >
> > > > Hi Jun,
> > > >
> > > > 30.  Clarified.
> > > >
> > > > 31. I plan to do some benchmarking once implementation is finished,
> > I'll
> > > > update the KIP with the results once I have them.  The reason to make
> > it
> > > > default is that it won't be used otherwise and we won't know if it's
> > good
> > > > or not in practical workloads.
> > > >
> > > > -Artem
> > > >
> > > > On Thu, Mar 10, 2022 at 11:42 AM Jun Rao <ju...@confluent.io.invalid>
> > > wrote:
> > > >
> > > > > Hi, Artem,
> > > > >
> > > > > Thanks for the updated KIP. A couple of more comments.
> > > > >
> > > > > 30. For the 3 new configs, it would be useful to make it clear that
> > > they
> > > > > are only relevant when the partitioner class is null.
> > > > >
> > > > > 31. partitioner.adaptive.partitioning.enable : I am wondering
> whether
> > > it
> > > > > should default to true. This is a more complex behavior than
> "uniform
> > > > > sticky" and may take some time to get right. If we do want to
> enable
> > it
> > > > by
> > > > > default, it would be useful to validate it with some test results.
> > > > >
> > > > > Jun
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > On Wed, Mar 9, 2022 at 10:05 PM Artem Livshits
> > > > > <al...@confluent.io.invalid> wrote:
> > > > >
> > > > > > Thank you for feedback, I've discussed this offline with some of
> > the
> > > > > folks
> > > > > > and updated the KIP.  The main change is that now instead of
> using
> > > > > > DefaultPartitioner and UniformStickyPartitioners as flags, in the
> > new
> > > > > > proposal the default partitioner is null, so if no custom
> > partitioner
> > > > is
> > > > > > specified then the partitioning logic is implemented in
> > > KafkaProducer.
> > > > > > Compatibility section is updated as well.  Also the configuration
> > > > options
> > > > > > are renamed to be more consistent.
> > > > > >
> > > > > > -Artem
> > > > > >
> > > > > > On Fri, Mar 4, 2022 at 10:38 PM Luke Chen <sh...@gmail.com>
> > wrote:
> > > > > >
> > > > > > > Hi Artem,
> > > > > > >
> > > > > > > Thanks for your explanation and update to the KIP.
> > > > > > > Some comments:
> > > > > > >
> > > > > > > 5. In the description for `enable.adaptive.partitioning`, the
> > > `false`
> > > > > > case,
> > > > > > > you said:
> > > > > > > > the producer will try to distribute messages uniformly.
> > > > > > > I think we should describe the possible skewing distribution.
> > > > > Otherwise,
> > > > > > > user might be confused about why adaptive partitioning is
> > > important.
> > > > > > >
> > > > > > > 6. In the description for `partition.availability.timeout.ms`,
> I
> > > > think
> > > > > > we
> > > > > > > should mention in the last sentence about if
> > > > > > `enable.adaptive.partitioning`
> > > > > > > is disabled this logic is also disabled.
> > > > > > >
> > > > > > > 7. Similar thoughts as Ismael, I think we should have a POC and
> > > test
> > > > to
> > > > > > > prove that this adaptive partitioning algorithm can have better
> > > > uniform
> > > > > > > partitioning, compared with original sticky one.
> > > > > > >
> > > > > > > Thank you.
> > > > > > > Luke
> > > > > > >
> > > > > > > On Fri, Mar 4, 2022 at 9:22 PM Ismael Juma <is...@juma.me.uk>
> > > > wrote:
> > > > > > >
> > > > > > > > Regarding `3`, we should only deprecate it if we're sure the
> > new
> > > > > > approach
> > > > > > > > handles all cases better. Are we confident about that for
> both
> > of
> > > > the
> > > > > > > > previous partitioners?
> > > > > > > >
> > > > > > > > Ismael
> > > > > > > >
> > > > > > > > On Fri, Mar 4, 2022 at 1:37 AM David Jacot
> > > > > <djacot@confluent.io.invalid
> > > > > > >
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > > > Hi Artem,
> > > > > > > > >
> > > > > > > > > Thanks for the KIP! I have a few comments:
> > > > > > > > >
> > > > > > > > > 1. In the preamble of the proposed change section, there is
> > > > still a
> > > > > > > > > mention of the
> > > > > > > > > -1 approach. My understanding is that we have moved away
> from
> > > it
> > > > > now.
> > > > > > > > >
> > > > > > > > > 2. I am a bit concerned by the trick suggested about the
> > > > > > > > > DefaultPartitioner and
> > > > > > > > > the UniformStickyPartitioner. I do agree that implementing
> > the
> > > > > logic
> > > > > > in
> > > > > > > > the
> > > > > > > > > producer itself is a good thing. However, it is weird from
> a
> > > user
> > > > > > > > > perspective
> > > > > > > > > that he can set a class as partitioner that is not used in
> > the
> > > > > end. I
> > > > > > > > > think that
> > > > > > > > > this will be confusing for our users. Have we considered
> > > changing
> > > > > the
> > > > > > > > > default
> > > > > > > > > value of partitioner.class to null to indicate that the new
> > > > > built-in
> > > > > > > > > partitioner
> > > > > > > > > must be used? By default, the built-in partitioner would be
> > > used
> > > > > > unless
> > > > > > > > the
> > > > > > > > > user explicitly specify one. The downside is that the new
> > > default
> > > > > > > > behavior
> > > > > > > > > would not work if the user explicitly specify the
> partitioner
> > > but
> > > > > we
> > > > > > > > could
> > > > > > > > > mitigate this with my next point.
> > > > > > > > >
> > > > > > > > > 3. Related to the previous point, I think that we could
> > > deprecate
> > > > > > both
> > > > > > > > the
> > > > > > > > > DefaultPartitioner and the UniformStickyPartitioner. I
> would
> > > also
> > > > > > add a
> > > > > > > > > warning if one of them is explicitly provided by the user
> to
> > > > inform
> > > > > > > them
> > > > > > > > > that they should switch to the new built-in one. I am
> pretty
> > > sure
> > > > > > that
> > > > > > > > most
> > > > > > > > > of the folks use the default configuration anyway.
> > > > > > > > >
> > > > > > > > > 4. It would be great if we could explain why the -1 way was
> > > > > rejected.
> > > > > > > At
> > > > > > > > > the moment, the rejected alternative only explain the idea
> > but
> > > > does
> > > > > > not
> > > > > > > > > say why we rejected it.
> > > > > > > > >
> > > > > > > > > Best,
> > > > > > > > > David
> > > > > > > > >
> > > > > > > > > On Fri, Mar 4, 2022 at 6:03 AM Artem Livshits
> > > > > > > > > <al...@confluent.io.invalid> wrote:
> > > > > > > > > >
> > > > > > > > > > Hi Jun,
> > > > > > > > > >
> > > > > > > > > > 2. Removed the option from the KIP.  Now the sticky
> > > > partitioning
> > > > > > > > > threshold
> > > > > > > > > > is hardcoded to batch.size.
> > > > > > > > > >
> > > > > > > > > > 20. Added the corresponding wording to the KIP.
> > > > > > > > > >
> > > > > > > > > > -Artem
> > > > > > > > > >
> > > > > > > > > > On Thu, Mar 3, 2022 at 10:52 AM Jun Rao
> > > > <jun@confluent.io.invalid
> > > > > >
> > > > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > > Hi, Artem,
> > > > > > > > > > >
> > > > > > > > > > > Thanks for the reply.
> > > > > > > > > > >
> > > > > > > > > > > 1. Sounds good.
> > > > > > > > > > >
> > > > > > > > > > > 2. If we don't expect users to change it, we probably
> > could
> > > > > just
> > > > > > > > leave
> > > > > > > > > out
> > > > > > > > > > > the new config. In general, it's easy to add a new
> > config,
> > > > but
> > > > > > hard
> > > > > > > > to
> > > > > > > > > > > remove an existing config.
> > > > > > > > > > >
> > > > > > > > > > > 20. The two new configs enable.adaptive.partitioning
> and
> > > > > > > > > > > partition.availability.timeout.ms only apply to the
> two
> > > > > built-in
> > > > > > > > > > > partitioners DefaultPartitioner and
> > > UniformStickyPartitioner,
> > > > > > > right?
> > > > > > > > It
> > > > > > > > > > > would be useful to document that in the KIP.
> > > > > > > > > > >
> > > > > > > > > > > Thanks,
> > > > > > > > > > >
> > > > > > > > > > > Jun
> > > > > > > > > > >
> > > > > > > > > > > On Thu, Mar 3, 2022 at 9:47 AM Artem Livshits
> > > > > > > > > > > <al...@confluent.io.invalid> wrote:
> > > > > > > > > > >
> > > > > > > > > > > > Hi Jun,
> > > > > > > > > > > >
> > > > > > > > > > > > Thank you for the suggestions.
> > > > > > > > > > > >
> > > > > > > > > > > > 1. As we discussed offline, we can hardcode the logic
> > for
> > > > > > > > > > > > DefaultPartitioner and UniformStickyPartitioner in
> the
> > > > > > > > KafkaProducer
> > > > > > > > > > > (i.e.
> > > > > > > > > > > > the DefaultPartitioner.partition won't get called,
> > > instead
> > > > > > > > > KafkaProducer
> > > > > > > > > > > > would check if the partitioner is an instance of
> > > > > > > DefaultPartitioner
> > > > > > > > > and
> > > > > > > > > > > > then run the actual partitioning logic itself).  Then
> > the
> > > > > > change
> > > > > > > to
> > > > > > > > > the
> > > > > > > > > > > > Partitioner wouldn't be required.  I'll update the
> KIP
> > to
> > > > > > reflect
> > > > > > > > > that.
> > > > > > > > > > > >
> > > > > > > > > > > > 2. I don't expect users to change this too often, as
> > > > changing
> > > > > > it
> > > > > > > > > would
> > > > > > > > > > > > require a bit of studying of the production patterns.
> > > As a
> > > > > > > general
> > > > > > > > > > > > principle, if I can think of a model that requires a
> > > > > deviation
> > > > > > > from
> > > > > > > > > > > > default, I tend to add a configuration option.  It
> > could
> > > be
> > > > > > that
> > > > > > > > > it'll
> > > > > > > > > > > > never get used in practice, but I cannot prove that.
> > I'm
> > > > ok
> > > > > > with
> > > > > > > > > > > removing
> > > > > > > > > > > > the option, let me know what you think.
> > > > > > > > > > > >
> > > > > > > > > > > > -Artem
> > > > > > > > > > > >
> > > > > > > > > > > > On Mon, Feb 28, 2022 at 2:06 PM Jun Rao
> > > > > > <jun@confluent.io.invalid
> > > > > > > >
> > > > > > > > > > > wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > > Hi, Artem,
> > > > > > > > > > > > >
> > > > > > > > > > > > > Thanks for the reply. A few more comments.
> > > > > > > > > > > > >
> > > > > > > > > > > > > 1. Since we control the implementation and the
> usage
> > of
> > > > > > > > > > > > DefaultPartitioner,
> > > > > > > > > > > > > another way is to instantiate the
> DefaultPartitioner
> > > > with a
> > > > > > > > special
> > > > > > > > > > > > > constructor, which allows it to have more access to
> > > > > internal
> > > > > > > > > > > information.
> > > > > > > > > > > > > Then we could just change the behavior of
> > > > > DefaultPartitioner
> > > > > > > > such
> > > > > > > > > that
> > > > > > > > > > > > it
> > > > > > > > > > > > > can use the internal infoamtion when choosing the
> > > > > partition.
> > > > > > > This
> > > > > > > > > seems
> > > > > > > > > > > > > more intuitive than having DefaultPartitioner
> return
> > -1
> > > > > > > > partition.
> > > > > > > > > > > > >
> > > > > > > > > > > > > 2. I guess partitioner.sticky.batch.size is
> > introduced
> > > > > > because
> > > > > > > > the
> > > > > > > > > > > > > effective batch size could be less than batch.size
> > and
> > > we
> > > > > > want
> > > > > > > to
> > > > > > > > > align
> > > > > > > > > > > > > partition switching with the effective batch size.
> > How
> > > > > would
> > > > > > a
> > > > > > > > user
> > > > > > > > > > > know
> > > > > > > > > > > > > the effective batch size to set
> > > > > partitioner.sticky.batch.size
> > > > > > > > > properly?
> > > > > > > > > > > > If
> > > > > > > > > > > > > the user somehow knows the effective batch size,
> does
> > > > > setting
> > > > > > > > > > > batch.size
> > > > > > > > > > > > to
> > > > > > > > > > > > > the effective batch size achieve the same result?
> > > > > > > > > > > > >
> > > > > > > > > > > > > 4. Thanks for the explanation. Makes sense to me.
> > > > > > > > > > > > >
> > > > > > > > > > > > > Thanks,
> > > > > > > > > > > > >
> > > > > > > > > > > > > Jun
> > > > > > > > > > > > >
> > > > > > > > > > > > > Thanks,
> > > > > > > > > > > > >
> > > > > > > > > > > > > Jun
> > > > > > > > > > > > >
> > > > > > > > > > > > > On Fri, Feb 25, 2022 at 8:26 PM Artem Livshits
> > > > > > > > > > > > > <al...@confluent.io.invalid> wrote:
> > > > > > > > > > > > >
> > > > > > > > > > > > > > Hi Jun,
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > 1. Updated the KIP to add a couple paragraphs
> about
> > > > > > > > > implementation
> > > > > > > > > > > > > > necessities in the Proposed Changes section.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > 2. Sorry if my reply was confusing, what I meant
> to
> > > say
> > > > > > (and
> > > > > > > I
> > > > > > > > > > > > elaborated
> > > > > > > > > > > > > > on that in point #3) is that there could be
> > patterns
> > > > for
> > > > > > > which
> > > > > > > > > 16KB
> > > > > > > > > > > > > > wouldn't be the most effective setting, thus it
> > would
> > > > be
> > > > > > good
> > > > > > > > to
> > > > > > > > > make
> > > > > > > > > > > > it
> > > > > > > > > > > > > > configurable.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > 4. We could use broker readiness timeout.  But
> I'm
> > > not
> > > > > sure
> > > > > > > it
> > > > > > > > > would
> > > > > > > > > > > > > > correctly model the broker load.  The problem is
> > that
> > > > > > latency
> > > > > > > > is
> > > > > > > > > not
> > > > > > > > > > > an
> > > > > > > > > > > > > > accurate measure of throughput, we could have 2
> > > brokers
> > > > > > that
> > > > > > > > have
> > > > > > > > > > > equal
> > > > > > > > > > > > > > throughput but one has higher latency (so it
> takes
> > > > larger
> > > > > > > > batches
> > > > > > > > > > > less
> > > > > > > > > > > > > > frequently, but still takes the same load).
> > > > > Latency-based
> > > > > > > > logic
> > > > > > > > > is
> > > > > > > > > > > > > likely
> > > > > > > > > > > > > > to send less data to the broker with higher
> > latency.
> > > > > Using
> > > > > > > the
> > > > > > > > > queue
> > > > > > > > > > > > > size
> > > > > > > > > > > > > > would adapt to throughput, regardless of latency
> > > (which
> > > > > > could
> > > > > > > > be
> > > > > > > > > > > just a
> > > > > > > > > > > > > > result of deployment topology), so that's the
> model
> > > > > chosen
> > > > > > in
> > > > > > > > the
> > > > > > > > > > > > > > proposal.  The partition.availability.timeout.ms
> > > logic
> > > > > > > > > approaches
> > > > > > > > > > > the
> > > > > > > > > > > > > > model
> > > > > > > > > > > > > > from a slightly different angle, say we have a
> > > > > requirement
> > > > > > to
> > > > > > > > > deliver
> > > > > > > > > > > > > > messages via brokers that have a certain latency,
> > > then
> > > > > > > > > > > > > > partition.availability.timeout.ms could be used
> to
> > > > tune
> > > > > > > that.
> > > > > > > > > > > Latency
> > > > > > > > > > > > > is
> > > > > > > > > > > > > > a
> > > > > > > > > > > > > > much more volatile metric than throughput
> (latency
> > > > > depends
> > > > > > on
> > > > > > > > > > > external
> > > > > > > > > > > > > > load, on capacity, on deployment topology, on
> > jitter
> > > in
> > > > > > > > network,
> > > > > > > > > on
> > > > > > > > > > > > > jitter
> > > > > > > > > > > > > > in disk, etc.) and I think it would be best to
> > leave
> > > > > > > > > latency-based
> > > > > > > > > > > > > > thresholds configurable to tune for the
> > environment.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > -Artem
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > On Wed, Feb 23, 2022 at 11:14 AM Jun Rao
> > > > > > > > > <ju...@confluent.io.invalid>
> > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Hi, Artem,
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Thanks for the reply. A few more comments.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > 1. Perhaps you could elaborate a bit more on
> how
> > > the
> > > > > > > producer
> > > > > > > > > > > > > determines
> > > > > > > > > > > > > > > the partition if the partitioner returns -1.
> This
> > > > will
> > > > > > help
> > > > > > > > > > > > understand
> > > > > > > > > > > > > > why
> > > > > > > > > > > > > > > encapsulating that logic as a partitioner is
> not
> > > > clean.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > 2. I am not sure that I understand this part.
> If
> > > > 15.5KB
> > > > > > is
> > > > > > > > more
> > > > > > > > > > > > > > efficient,
> > > > > > > > > > > > > > > could we just set batch.size to 15.5KB?
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > 4. Yes, we could add a switch (or a variant of
> > the
> > > > > > > > > partitioner) for
> > > > > > > > > > > > > > > enabling this behavior. Also, choosing
> partitions
> > > > based
> > > > > > on
> > > > > > > > > broker
> > > > > > > > > > > > > > readiness
> > > > > > > > > > > > > > > can be made in a smoother way. For example, we
> > > could
> > > > > > track
> > > > > > > > the
> > > > > > > > > last
> > > > > > > > > > > > > time
> > > > > > > > > > > > > > a
> > > > > > > > > > > > > > > broker has drained any batches from the
> > > accumulator.
> > > > We
> > > > > > can
> > > > > > > > > then
> > > > > > > > > > > > select
> > > > > > > > > > > > > > > partitions from brokers proportionally to the
> > > inverse
> > > > > of
> > > > > > > that
> > > > > > > > > time.
> > > > > > > > > > > > > This
> > > > > > > > > > > > > > > seems smoother than a cutoff based on a
> > > > > > > > > > > > > > partition.availability.timeout.ms
> > > > > > > > > > > > > > >  threshold.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Thanks,
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Jun
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > On Fri, Feb 18, 2022 at 5:14 PM Artem Livshits
> > > > > > > > > > > > > > > <al...@confluent.io.invalid> wrote:
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Hello Luke, Jun,
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Thank you for your feedback.  I've added the
> > > > Rejected
> > > > > > > > > Alternative
> > > > > > > > > > > > > > section
> > > > > > > > > > > > > > > > that may clarify some of the questions w.r.t.
> > > > > returning
> > > > > > > -1.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > 1. I've elaborated on the -1 in the KIP.  The
> > > > problem
> > > > > > is
> > > > > > > > > that a
> > > > > > > > > > > > > > > significant
> > > > > > > > > > > > > > > > part of the logic needs to be in the producer
> > > > > (because
> > > > > > it
> > > > > > > > now
> > > > > > > > > > > uses
> > > > > > > > > > > > > > > > information about brokers that only the
> > producer
> > > > > > knows),
> > > > > > > so
> > > > > > > > > > > > > > encapsulation
> > > > > > > > > > > > > > > > of the logic within the default partitioner
> > isn't
> > > > as
> > > > > > > clean.
> > > > > > > > > > >  I've
> > > > > > > > > > > > > > added
> > > > > > > > > > > > > > > > the Rejected Alternative section that
> documents
> > > an
> > > > > > > attempt
> > > > > > > > to
> > > > > > > > > > > keep
> > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > encapsulation by providing new callbacks to
> the
> > > > > > > > partitioner.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > 2. The meaning of the
> > > partitioner.sticky.batch.size
> > > > > is
> > > > > > > > > explained
> > > > > > > > > > > in
> > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > Uniform Sticky Batch Size section.
> Basically,
> > we
> > > > > track
> > > > > > > the
> > > > > > > > > > > amount
> > > > > > > > > > > > of
> > > > > > > > > > > > > > > bytes
> > > > > > > > > > > > > > > > produced to the partition and if it exceeds
> > > > > > > > > > > > > > partitioner.sticky.batch.size
> > > > > > > > > > > > > > > > then we switch to the next partition.  As far
> > as
> > > > the
> > > > > > > reason
> > > > > > > > > to
> > > > > > > > > > > make
> > > > > > > > > > > > > it
> > > > > > > > > > > > > > > > different from batch.size, I think Luke
> > answered
> > > > this
> > > > > > > with
> > > > > > > > > the
> > > > > > > > > > > > > question
> > > > > > > > > > > > > > > #3
> > > > > > > > > > > > > > > > -- what if the load pattern is such that
> 15.5KB
> > > > would
> > > > > > be
> > > > > > > > more
> > > > > > > > > > > > > efficient
> > > > > > > > > > > > > > > > than 16KB?
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > 3. I think it's hard to have one size that
> > would
> > > > fit
> > > > > > all
> > > > > > > > > > > patterns.
> > > > > > > > > > > > > > E.g.
> > > > > > > > > > > > > > > if
> > > > > > > > > > > > > > > > the load pattern is such that there is linger
> > and
> > > > the
> > > > > > app
> > > > > > > > > fills
> > > > > > > > > > > the
> > > > > > > > > > > > > > batch
> > > > > > > > > > > > > > > > before linger expires, then having 16KB would
> > > most
> > > > > > likely
> > > > > > > > > > > > synchronize
> > > > > > > > > > > > > > > > batching and partition switching, so each
> > > partition
> > > > > > would
> > > > > > > > > get a
> > > > > > > > > > > > full
> > > > > > > > > > > > > > > > batch.  If load pattern is such that there
> are
> > a
> > > > few
> > > > > > > > > non-complete
> > > > > > > > > > > > > > batches
> > > > > > > > > > > > > > > > go out before a larger batch starts to fill,
> > then
> > > > it
> > > > > > may
> > > > > > > > > actually
> > > > > > > > > > > > be
> > > > > > > > > > > > > > > > beneficial to make slightly larger (e.g.
> > > linger=0,
> > > > > > first
> > > > > > > > few
> > > > > > > > > > > > records
> > > > > > > > > > > > > go
> > > > > > > > > > > > > > > in
> > > > > > > > > > > > > > > > the first batch, then next few records go to
> > > second
> > > > > > > batch,
> > > > > > > > > and so
> > > > > > > > > > > > on,
> > > > > > > > > > > > > > > until
> > > > > > > > > > > > > > > > 5 in-flight, then larger batch would form
> while
> > > > > waiting
> > > > > > > for
> > > > > > > > > > > broker
> > > > > > > > > > > > to
> > > > > > > > > > > > > > > > respond, but the partition switch would
> happen
> > > > before
> > > > > > the
> > > > > > > > > larger
> > > > > > > > > > > > > batch
> > > > > > > > > > > > > > is
> > > > > > > > > > > > > > > > full).
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > 4. There are a couple of reasons for
> > introducing
> > > > > > > > > > > > > > > > partition.availability.timeout.ms.  Luke's
> an
> > > > Jun's
> > > > > > > > > questions
> > > > > > > > > > > are
> > > > > > > > > > > > > > > slightly
> > > > > > > > > > > > > > > > different, so I'm going to separate replies.
> > > > > > > > > > > > > > > > (Luke) Is the queue size a good enough
> > signal?  I
> > > > > think
> > > > > > > > it's
> > > > > > > > > a
> > > > > > > > > > > good
> > > > > > > > > > > > > > > default
> > > > > > > > > > > > > > > > signal as it tries to preserve general
> fairness
> > > and
> > > > > not
> > > > > > > > > overreact
> > > > > > > > > > > > on
> > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > broker's state at each moment in time.  But
> > > because
> > > > > > it's
> > > > > > > > > smooth,
> > > > > > > > > > > it
> > > > > > > > > > > > > may
> > > > > > > > > > > > > > > not
> > > > > > > > > > > > > > > > be reactive enough to instantaneous latency
> > > jumps.
> > > > > For
> > > > > > > > > > > > > > latency-sensitive
> > > > > > > > > > > > > > > > workloads, it may be desirable to react
> faster
> > > > when a
> > > > > > > > broker
> > > > > > > > > > > > becomes
> > > > > > > > > > > > > > > > unresponsive (but that may make the
> > distribution
> > > > > really
> > > > > > > > > choppy),
> > > > > > > > > > > so
> > > > > > > > > > > > > > > > partition.availability.timeout.ms provides
> an
> > > > > > > opportunity
> > > > > > > > to
> > > > > > > > > > > tune
> > > > > > > > > > > > > > > > adaptiveness.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > (Jun) Can we just not assign partitions to
> > > brokers
> > > > > that
> > > > > > > are
> > > > > > > > > not
> > > > > > > > > > > > > ready?
> > > > > > > > > > > > > > > > Switching partitions purely based on current
> > > broker
> > > > > > > > readiness
> > > > > > > > > > > > > > information
> > > > > > > > > > > > > > > > can really skew workload I think (or at
> least I
> > > > > > couldn't
> > > > > > > > > build a
> > > > > > > > > > > > > model
> > > > > > > > > > > > > > > that
> > > > > > > > > > > > > > > > proves that over time it's going to be
> > generally
> > > > > > fair), I
> > > > > > > > > feel
> > > > > > > > > > > that
> > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > algorithm should try to be fair in general
> and
> > > use
> > > > > > > smoother
> > > > > > > > > > > signals
> > > > > > > > > > > > > by
> > > > > > > > > > > > > > > > default (e.g. a broker with choppier latency
> > may
> > > > get
> > > > > > much
> > > > > > > > > less
> > > > > > > > > > > load
> > > > > > > > > > > > > > even
> > > > > > > > > > > > > > > > though it can handle throughput, it then may
> > > > > > potentially
> > > > > > > > skew
> > > > > > > > > > > > > > > consumption),
> > > > > > > > > > > > > > > > note that the queue-size-based logic uses
> > > > > probabilities
> > > > > > > (so
> > > > > > > > > we
> > > > > > > > > > > > don't
> > > > > > > > > > > > > > > fully
> > > > > > > > > > > > > > > > remove brokers, just make it less likely) and
> > > > > relative
> > > > > > > info
> > > > > > > > > > > rather
> > > > > > > > > > > > > > than a
> > > > > > > > > > > > > > > > threshold (so if all brokers are heavily, but
> > > > equally
> > > > > > > > loaded,
> > > > > > > > > > > they
> > > > > > > > > > > > > will
> > > > > > > > > > > > > > > get
> > > > > > > > > > > > > > > > equal distribution, rather than get removed
> > > because
> > > > > > they
> > > > > > > > > exceed
> > > > > > > > > > > > some
> > > > > > > > > > > > > > > > threshold).  So at the very least, I would
> like
> > > > this
> > > > > > > logic
> > > > > > > > > to be
> > > > > > > > > > > > > turned
> > > > > > > > > > > > > > > off
> > > > > > > > > > > > > > > > by default as it's hard to predict what it
> > could
> > > do
> > > > > > with
> > > > > > > > > > > different
> > > > > > > > > > > > > > > patterns
> > > > > > > > > > > > > > > > (which means that there would need to be some
> > > > > > > > > configuration).  We
> > > > > > > > > > > > > could
> > > > > > > > > > > > > > > > just not use brokers that are not ready, but
> > > > again, I
> > > > > > > think
> > > > > > > > > that
> > > > > > > > > > > > it's
> > > > > > > > > > > > > > > good
> > > > > > > > > > > > > > > > to try to be fair under normal circumstances,
> > so
> > > if
> > > > > > > > normally
> > > > > > > > > > > > brokers
> > > > > > > > > > > > > > can
> > > > > > > > > > > > > > > > respond under some
> > > > partition.availability.timeout.ms
> > > > > > > > > threshold
> > > > > > > > > > > and
> > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > application works well with those latencies,
> > then
> > > > we
> > > > > > > could
> > > > > > > > > > > > distribute
> > > > > > > > > > > > > > > data
> > > > > > > > > > > > > > > > equally between brokers that don't exceed the
> > > > > > latencies.
> > > > > > > > The
> > > > > > > > > > > > value,
> > > > > > > > > > > > > of
> > > > > > > > > > > > > > > > course, would depend on the environment and
> app
> > > > > > > > requirements,
> > > > > > > > > > > hence
> > > > > > > > > > > > > > it's
> > > > > > > > > > > > > > > > configurable.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > 10. Added a statement at the beginning of the
> > > > > proposed
> > > > > > > > > changes.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > -Artem
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > On Thu, Feb 17, 2022 at 3:46 PM Jun Rao
> > > > > > > > > <jun@confluent.io.invalid
> > > > > > > > > > > >
> > > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Hi, Artem,
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Thanks for the KIP. A few comments below.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > 1. I agree with Luke that having the
> > > partitioner
> > > > > > > > returning
> > > > > > > > > -1
> > > > > > > > > > > is
> > > > > > > > > > > > > kind
> > > > > > > > > > > > > > > of
> > > > > > > > > > > > > > > > > weird. Could we just change the
> > implementation
> > > of
> > > > > > > > > > > > > DefaultPartitioner
> > > > > > > > > > > > > > to
> > > > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > > new behavior?
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > 2. partitioner.sticky.batch.size: Similar
> > > > question
> > > > > to
> > > > > > > > > Luke. I
> > > > > > > > > > > am
> > > > > > > > > > > > > not
> > > > > > > > > > > > > > > sure
> > > > > > > > > > > > > > > > > why we want to introduce this new
> > > configuration.
> > > > > > Could
> > > > > > > we
> > > > > > > > > just
> > > > > > > > > > > > use
> > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > > existing batch.size?
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > 4. I also agree with Luke that it's not
> clear
> > > why
> > > > > we
> > > > > > > need
> > > > > > > > > > > > > > > > > partition.availability.timeout.ms. The KIP
> > > says
> > > > > the
> > > > > > > > broker
> > > > > > > > > > > > "would
> > > > > > > > > > > > > > not
> > > > > > > > > > > > > > > be
> > > > > > > > > > > > > > > > > chosen until the broker is able to accept
> the
> > > > next
> > > > > > > ready
> > > > > > > > > batch
> > > > > > > > > > > > from
> > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > > partition". If we are keeping track of
> this,
> > > > could
> > > > > we
> > > > > > > > just
> > > > > > > > > > > avoid
> > > > > > > > > > > > > > > > assigning
> > > > > > > > > > > > > > > > > records to partitions whose leader is not
> > able
> > > to
> > > > > > > accept
> > > > > > > > > the
> > > > > > > > > > > next
> > > > > > > > > > > > > > > batch?
> > > > > > > > > > > > > > > > If
> > > > > > > > > > > > > > > > > we do that, perhaps we don't need
> > > > > > > > > > > > > partition.availability.timeout.ms.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > 10. Currently, partitioner.class defaults
> to
> > > > > > > > > > > DefaultPartitioner,
> > > > > > > > > > > > > > which
> > > > > > > > > > > > > > > > uses
> > > > > > > > > > > > > > > > > StickyPartitioner when the key is
> specified.
> > > > Since
> > > > > > this
> > > > > > > > KIP
> > > > > > > > > > > > > improves
> > > > > > > > > > > > > > > upon
> > > > > > > > > > > > > > > > > StickyPartitioner, it would be useful to
> make
> > > the
> > > > > new
> > > > > > > > > behavior
> > > > > > > > > > > > the
> > > > > > > > > > > > > > > > default
> > > > > > > > > > > > > > > > > and document that in the KIP.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Thanks,
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Jun
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > On Wed, Feb 16, 2022 at 7:30 PM Luke Chen <
> > > > > > > > > showuon@gmail.com>
> > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > Hi Artem,
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > Also, one more thing I think you need to
> > > know.
> > > > > > > > > > > > > > > > > > As this bug KAFKA-7572 <
> > > > > > > > > > > > > > > >
> > https://issues.apache.org/jira/browse/KAFKA-7572
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > mentioned, sometimes the custom
> partitioner
> > > > would
> > > > > > > > return
> > > > > > > > > > > > negative
> > > > > > > > > > > > > > > > > partition
> > > > > > > > > > > > > > > > > > id accidentally.
> > > > > > > > > > > > > > > > > > If it returned -1, how could you know if
> it
> > > is
> > > > > > > expected
> > > > > > > > > or
> > > > > > > > > > > not
> > > > > > > > > > > > > > > > expected?
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > Thanks.
> > > > > > > > > > > > > > > > > > Luke
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > On Wed, Feb 16, 2022 at 3:28 PM Luke
> Chen <
> > > > > > > > > showuon@gmail.com
> > > > > > > > > > > >
> > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > Hi Artem,
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > Thanks for the update. I have some
> > > questions
> > > > > > about
> > > > > > > > it:
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > 1. Could you explain why you need the
> > > > > > `partitioner`
> > > > > > > > > return
> > > > > > > > > > > > -1?
> > > > > > > > > > > > > In
> > > > > > > > > > > > > > > > which
> > > > > > > > > > > > > > > > > > > case we need it? And how it is used in
> > your
> > > > > KIP?
> > > > > > > > > > > > > > > > > > > 2. What does the
> > > > > "partitioner.sticky.batch.size"
> > > > > > > > mean?
> > > > > > > > > In
> > > > > > > > > > > the
> > > > > > > > > > > > > > > > > > > "Configuration" part, you didn't
> explain
> > > it.
> > > > > And
> > > > > > > > > default to
> > > > > > > > > > > > 0,
> > > > > > > > > > > > > I
> > > > > > > > > > > > > > > > guess
> > > > > > > > > > > > > > > > > > it's
> > > > > > > > > > > > > > > > > > > the same as current behavior for
> backward
> > > > > > > > > compatibility,
> > > > > > > > > > > > right?
> > > > > > > > > > > > > > You
> > > > > > > > > > > > > > > > > > should
> > > > > > > > > > > > > > > > > > > mention it.
> > > > > > > > > > > > > > > > > > > 3. I'm thinking we can have a threshold
> > to
> > > > the
> > > > > > > > > > > > > > > > > > > "partitioner.sticky.batch.size". Let's
> > say,
> > > > we
> > > > > > > > already
> > > > > > > > > > > > > accumulate
> > > > > > > > > > > > > > > > > 15.5KB
> > > > > > > > > > > > > > > > > > in
> > > > > > > > > > > > > > > > > > > partition1, and sent. So when next
> batch
> > > > > created,
> > > > > > > in
> > > > > > > > > your
> > > > > > > > > > > > > current
> > > > > > > > > > > > > > > > > design,
> > > > > > > > > > > > > > > > > > > we still stick to partition1, until
> 16KB
> > > > > reached,
> > > > > > > and
> > > > > > > > > then
> > > > > > > > > > > we
> > > > > > > > > > > > > > > create
> > > > > > > > > > > > > > > > a
> > > > > > > > > > > > > > > > > > new
> > > > > > > > > > > > > > > > > > > batch to change to next partition, ex:
> > > > > > partition2.
> > > > > > > > But
> > > > > > > > > I
> > > > > > > > > > > > think
> > > > > > > > > > > > > if
> > > > > > > > > > > > > > > we
> > > > > > > > > > > > > > > > > set
> > > > > > > > > > > > > > > > > > a
> > > > > > > > > > > > > > > > > > > threshold to 95% (for example), we can
> > know
> > > > > > > previous
> > > > > > > > > 15.5KB
> > > > > > > > > > > > > > already
> > > > > > > > > > > > > > > > > > exceeds
> > > > > > > > > > > > > > > > > > > the threshold so that we can directly
> > > create
> > > > > new
> > > > > > > > batch
> > > > > > > > > for
> > > > > > > > > > > > next
> > > > > > > > > > > > > > > > > records.
> > > > > > > > > > > > > > > > > > > This way should be able to make it more
> > > > > > efficient.
> > > > > > > > > WDYT?
> > > > > > > > > > > > > > > > > > > 4. I think the improved queuing logic
> > > should
> > > > be
> > > > > > > good
> > > > > > > > > > > enough.
> > > > > > > > > > > > I
> > > > > > > > > > > > > > > can't
> > > > > > > > > > > > > > > > > get
> > > > > > > > > > > > > > > > > > > the benefit of having `
> > > > > > > > > partition.availability.timeout.ms`
> > > > > > > > > > > > > > config.
> > > > > > > > > > > > > > > In
> > > > > > > > > > > > > > > > > > > short, you want to make the partitioner
> > > take
> > > > > the
> > > > > > > > broker
> > > > > > > > > > > load
> > > > > > > > > > > > > into
> > > > > > > > > > > > > > > > > > > consideration. We can just improve that
> > in
> > > > the
> > > > > > > > queuing
> > > > > > > > > > > logic
> > > > > > > > > > > > > (and
> > > > > > > > > > > > > > > you
> > > > > > > > > > > > > > > > > > > already did it). Why should we add the
> > > > config?
> > > > > > > Could
> > > > > > > > > you
> > > > > > > > > > > use
> > > > > > > > > > > > > some
> > > > > > > > > > > > > > > > > > examples
> > > > > > > > > > > > > > > > > > > to explain why we need it.
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > Thank you.
> > > > > > > > > > > > > > > > > > > Luke
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > On Wed, Feb 16, 2022 at 8:57 AM Artem
> > > > Livshits
> > > > > > > > > > > > > > > > > > > <al...@confluent.io.invalid>
> wrote:
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > >> Hello,
> > > > > > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > > > > > >> Please add your comments about the
> KIP.
> > > If
> > > > > > there
> > > > > > > > are
> > > > > > > > > no
> > > > > > > > > > > > > > > > > considerations,
> > > > > > > > > > > > > > > > > > >> I'll put it up for vote in the next
> few
> > > > days.
> > > > > > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > > > > > >> -Artem
> > > > > > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > > > > > >> On Mon, Feb 7, 2022 at 6:01 PM Artem
> > > > Livshits
> > > > > <
> > > > > > > > > > > > > > > > alivshits@confluent.io
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > >> wrote:
> > > > > > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > > > > > >> > Hello,
> > > > > > > > > > > > > > > > > > >> >
> > > > > > > > > > > > > > > > > > >> > After trying a few prototypes, I've
> > made
> > > > > some
> > > > > > > > > changes to
> > > > > > > > > > > > the
> > > > > > > > > > > > > > > > public
> > > > > > > > > > > > > > > > > > >> > interface.  Please see the updated
> > > > document
> > > > > > > > > > > > > > > > > > >> >
> > > > > > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-794%3A+Strictly+Uniform+Sticky+Partitioner
> > > > > > > > > > > > > > > > > > >> > .
> > > > > > > > > > > > > > > > > > >> >
> > > > > > > > > > > > > > > > > > >> > -Artem
> > > > > > > > > > > > > > > > > > >> >
> > > > > > > > > > > > > > > > > > >> > On Thu, Nov 4, 2021 at 10:37 AM
> Artem
> > > > > > Livshits <
> > > > > > > > > > > > > > > > > > alivshits@confluent.io>
> > > > > > > > > > > > > > > > > > >> > wrote:
> > > > > > > > > > > > > > > > > > >> >
> > > > > > > > > > > > > > > > > > >> >> Hello,
> > > > > > > > > > > > > > > > > > >> >>
> > > > > > > > > > > > > > > > > > >> >> This is the discussion thread for
> > > > > > > > > > > > > > > > > > >> >>
> > > > > > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-794%3A+Strictly+Uniform+Sticky+Partitioner
> > > > > > > > > > > > > > > > > > >> >> .
> > > > > > > > > > > > > > > > > > >> >>
> > > > > > > > > > > > > > > > > > >> >> The proposal is a bug fix for
> > > > > > > > > > > > > > > > > > >> >>
> > > > > > > > https://issues.apache.org/jira/browse/KAFKA-10888,
> > > > > > > > > but
> > > > > > > > > > > > it
> > > > > > > > > > > > > > does
> > > > > > > > > > > > > > > > > > >> include a
> > > > > > > > > > > > > > > > > > >> >> client config change, therefore we
> > > have a
> > > > > KIP
> > > > > > > to
> > > > > > > > > > > discuss.
> > > > > > > > > > > > > > > > > > >> >>
> > > > > > > > > > > > > > > > > > >> >> -Artem
> > > > > > > > > > > > > > > > > > >> >>
> > > > > > > > > > > > > > > > > > >> >
> > > > > > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] KIP-794: Strictly Uniform Sticky Partitioner

Posted by Jun Rao <ju...@confluent.io.INVALID>.
Hi, Artem,

Thanks for the reply. A couple of more comments.

32. We could defer the metrics until we know what to add.

33. Since we are deprecating DefaultPartitioner and
UniformStickyPartitioner, should we depreciate Partitioner.onNewBatch() too
given its unexpected side effect?

Thanks,

Jun

On Thu, Mar 10, 2022 at 5:20 PM Artem Livshits
<al...@confluent.io.invalid> wrote:

> Hi Jun,
>
> 32. Good point.  Do you think it's ok to defer the metrics until we run
> some benchmarks so that we get a better idea of what metrics we need?
>
> -Artem
>
> On Thu, Mar 10, 2022 at 3:12 PM Jun Rao <ju...@confluent.io.invalid> wrote:
>
> > Hi, Artem.
> >
> > Thanks for the reply. One more comment.
> >
> > 32. Do we need to add any new metric on the producer? For example, if
> > partitioner.availability.timeout.ms is > 0, it might be useful to know
> the
> > number of unavailable partitions.
> >
> > Thanks,
> >
> > Jun
> >
> > On Thu, Mar 10, 2022 at 12:46 PM Artem Livshits
> > <al...@confluent.io.invalid> wrote:
> >
> > > Hi Jun,
> > >
> > > 30.  Clarified.
> > >
> > > 31. I plan to do some benchmarking once implementation is finished,
> I'll
> > > update the KIP with the results once I have them.  The reason to make
> it
> > > default is that it won't be used otherwise and we won't know if it's
> good
> > > or not in practical workloads.
> > >
> > > -Artem
> > >
> > > On Thu, Mar 10, 2022 at 11:42 AM Jun Rao <ju...@confluent.io.invalid>
> > wrote:
> > >
> > > > Hi, Artem,
> > > >
> > > > Thanks for the updated KIP. A couple of more comments.
> > > >
> > > > 30. For the 3 new configs, it would be useful to make it clear that
> > they
> > > > are only relevant when the partitioner class is null.
> > > >
> > > > 31. partitioner.adaptive.partitioning.enable : I am wondering whether
> > it
> > > > should default to true. This is a more complex behavior than "uniform
> > > > sticky" and may take some time to get right. If we do want to enable
> it
> > > by
> > > > default, it would be useful to validate it with some test results.
> > > >
> > > > Jun
> > > >
> > > >
> > > >
> > > >
> > > > On Wed, Mar 9, 2022 at 10:05 PM Artem Livshits
> > > > <al...@confluent.io.invalid> wrote:
> > > >
> > > > > Thank you for feedback, I've discussed this offline with some of
> the
> > > > folks
> > > > > and updated the KIP.  The main change is that now instead of using
> > > > > DefaultPartitioner and UniformStickyPartitioners as flags, in the
> new
> > > > > proposal the default partitioner is null, so if no custom
> partitioner
> > > is
> > > > > specified then the partitioning logic is implemented in
> > KafkaProducer.
> > > > > Compatibility section is updated as well.  Also the configuration
> > > options
> > > > > are renamed to be more consistent.
> > > > >
> > > > > -Artem
> > > > >
> > > > > On Fri, Mar 4, 2022 at 10:38 PM Luke Chen <sh...@gmail.com>
> wrote:
> > > > >
> > > > > > Hi Artem,
> > > > > >
> > > > > > Thanks for your explanation and update to the KIP.
> > > > > > Some comments:
> > > > > >
> > > > > > 5. In the description for `enable.adaptive.partitioning`, the
> > `false`
> > > > > case,
> > > > > > you said:
> > > > > > > the producer will try to distribute messages uniformly.
> > > > > > I think we should describe the possible skewing distribution.
> > > > Otherwise,
> > > > > > user might be confused about why adaptive partitioning is
> > important.
> > > > > >
> > > > > > 6. In the description for `partition.availability.timeout.ms`, I
> > > think
> > > > > we
> > > > > > should mention in the last sentence about if
> > > > > `enable.adaptive.partitioning`
> > > > > > is disabled this logic is also disabled.
> > > > > >
> > > > > > 7. Similar thoughts as Ismael, I think we should have a POC and
> > test
> > > to
> > > > > > prove that this adaptive partitioning algorithm can have better
> > > uniform
> > > > > > partitioning, compared with original sticky one.
> > > > > >
> > > > > > Thank you.
> > > > > > Luke
> > > > > >
> > > > > > On Fri, Mar 4, 2022 at 9:22 PM Ismael Juma <is...@juma.me.uk>
> > > wrote:
> > > > > >
> > > > > > > Regarding `3`, we should only deprecate it if we're sure the
> new
> > > > > approach
> > > > > > > handles all cases better. Are we confident about that for both
> of
> > > the
> > > > > > > previous partitioners?
> > > > > > >
> > > > > > > Ismael
> > > > > > >
> > > > > > > On Fri, Mar 4, 2022 at 1:37 AM David Jacot
> > > > <djacot@confluent.io.invalid
> > > > > >
> > > > > > > wrote:
> > > > > > >
> > > > > > > > Hi Artem,
> > > > > > > >
> > > > > > > > Thanks for the KIP! I have a few comments:
> > > > > > > >
> > > > > > > > 1. In the preamble of the proposed change section, there is
> > > still a
> > > > > > > > mention of the
> > > > > > > > -1 approach. My understanding is that we have moved away from
> > it
> > > > now.
> > > > > > > >
> > > > > > > > 2. I am a bit concerned by the trick suggested about the
> > > > > > > > DefaultPartitioner and
> > > > > > > > the UniformStickyPartitioner. I do agree that implementing
> the
> > > > logic
> > > > > in
> > > > > > > the
> > > > > > > > producer itself is a good thing. However, it is weird from a
> > user
> > > > > > > > perspective
> > > > > > > > that he can set a class as partitioner that is not used in
> the
> > > > end. I
> > > > > > > > think that
> > > > > > > > this will be confusing for our users. Have we considered
> > changing
> > > > the
> > > > > > > > default
> > > > > > > > value of partitioner.class to null to indicate that the new
> > > > built-in
> > > > > > > > partitioner
> > > > > > > > must be used? By default, the built-in partitioner would be
> > used
> > > > > unless
> > > > > > > the
> > > > > > > > user explicitly specify one. The downside is that the new
> > default
> > > > > > > behavior
> > > > > > > > would not work if the user explicitly specify the partitioner
> > but
> > > > we
> > > > > > > could
> > > > > > > > mitigate this with my next point.
> > > > > > > >
> > > > > > > > 3. Related to the previous point, I think that we could
> > deprecate
> > > > > both
> > > > > > > the
> > > > > > > > DefaultPartitioner and the UniformStickyPartitioner. I would
> > also
> > > > > add a
> > > > > > > > warning if one of them is explicitly provided by the user to
> > > inform
> > > > > > them
> > > > > > > > that they should switch to the new built-in one. I am pretty
> > sure
> > > > > that
> > > > > > > most
> > > > > > > > of the folks use the default configuration anyway.
> > > > > > > >
> > > > > > > > 4. It would be great if we could explain why the -1 way was
> > > > rejected.
> > > > > > At
> > > > > > > > the moment, the rejected alternative only explain the idea
> but
> > > does
> > > > > not
> > > > > > > > say why we rejected it.
> > > > > > > >
> > > > > > > > Best,
> > > > > > > > David
> > > > > > > >
> > > > > > > > On Fri, Mar 4, 2022 at 6:03 AM Artem Livshits
> > > > > > > > <al...@confluent.io.invalid> wrote:
> > > > > > > > >
> > > > > > > > > Hi Jun,
> > > > > > > > >
> > > > > > > > > 2. Removed the option from the KIP.  Now the sticky
> > > partitioning
> > > > > > > > threshold
> > > > > > > > > is hardcoded to batch.size.
> > > > > > > > >
> > > > > > > > > 20. Added the corresponding wording to the KIP.
> > > > > > > > >
> > > > > > > > > -Artem
> > > > > > > > >
> > > > > > > > > On Thu, Mar 3, 2022 at 10:52 AM Jun Rao
> > > <jun@confluent.io.invalid
> > > > >
> > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > Hi, Artem,
> > > > > > > > > >
> > > > > > > > > > Thanks for the reply.
> > > > > > > > > >
> > > > > > > > > > 1. Sounds good.
> > > > > > > > > >
> > > > > > > > > > 2. If we don't expect users to change it, we probably
> could
> > > > just
> > > > > > > leave
> > > > > > > > out
> > > > > > > > > > the new config. In general, it's easy to add a new
> config,
> > > but
> > > > > hard
> > > > > > > to
> > > > > > > > > > remove an existing config.
> > > > > > > > > >
> > > > > > > > > > 20. The two new configs enable.adaptive.partitioning and
> > > > > > > > > > partition.availability.timeout.ms only apply to the two
> > > > built-in
> > > > > > > > > > partitioners DefaultPartitioner and
> > UniformStickyPartitioner,
> > > > > > right?
> > > > > > > It
> > > > > > > > > > would be useful to document that in the KIP.
> > > > > > > > > >
> > > > > > > > > > Thanks,
> > > > > > > > > >
> > > > > > > > > > Jun
> > > > > > > > > >
> > > > > > > > > > On Thu, Mar 3, 2022 at 9:47 AM Artem Livshits
> > > > > > > > > > <al...@confluent.io.invalid> wrote:
> > > > > > > > > >
> > > > > > > > > > > Hi Jun,
> > > > > > > > > > >
> > > > > > > > > > > Thank you for the suggestions.
> > > > > > > > > > >
> > > > > > > > > > > 1. As we discussed offline, we can hardcode the logic
> for
> > > > > > > > > > > DefaultPartitioner and UniformStickyPartitioner in the
> > > > > > > KafkaProducer
> > > > > > > > > > (i.e.
> > > > > > > > > > > the DefaultPartitioner.partition won't get called,
> > instead
> > > > > > > > KafkaProducer
> > > > > > > > > > > would check if the partitioner is an instance of
> > > > > > DefaultPartitioner
> > > > > > > > and
> > > > > > > > > > > then run the actual partitioning logic itself).  Then
> the
> > > > > change
> > > > > > to
> > > > > > > > the
> > > > > > > > > > > Partitioner wouldn't be required.  I'll update the KIP
> to
> > > > > reflect
> > > > > > > > that.
> > > > > > > > > > >
> > > > > > > > > > > 2. I don't expect users to change this too often, as
> > > changing
> > > > > it
> > > > > > > > would
> > > > > > > > > > > require a bit of studying of the production patterns.
> > As a
> > > > > > general
> > > > > > > > > > > principle, if I can think of a model that requires a
> > > > deviation
> > > > > > from
> > > > > > > > > > > default, I tend to add a configuration option.  It
> could
> > be
> > > > > that
> > > > > > > > it'll
> > > > > > > > > > > never get used in practice, but I cannot prove that.
> I'm
> > > ok
> > > > > with
> > > > > > > > > > removing
> > > > > > > > > > > the option, let me know what you think.
> > > > > > > > > > >
> > > > > > > > > > > -Artem
> > > > > > > > > > >
> > > > > > > > > > > On Mon, Feb 28, 2022 at 2:06 PM Jun Rao
> > > > > <jun@confluent.io.invalid
> > > > > > >
> > > > > > > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > > > Hi, Artem,
> > > > > > > > > > > >
> > > > > > > > > > > > Thanks for the reply. A few more comments.
> > > > > > > > > > > >
> > > > > > > > > > > > 1. Since we control the implementation and the usage
> of
> > > > > > > > > > > DefaultPartitioner,
> > > > > > > > > > > > another way is to instantiate the DefaultPartitioner
> > > with a
> > > > > > > special
> > > > > > > > > > > > constructor, which allows it to have more access to
> > > > internal
> > > > > > > > > > information.
> > > > > > > > > > > > Then we could just change the behavior of
> > > > DefaultPartitioner
> > > > > > > such
> > > > > > > > that
> > > > > > > > > > > it
> > > > > > > > > > > > can use the internal infoamtion when choosing the
> > > > partition.
> > > > > > This
> > > > > > > > seems
> > > > > > > > > > > > more intuitive than having DefaultPartitioner return
> -1
> > > > > > > partition.
> > > > > > > > > > > >
> > > > > > > > > > > > 2. I guess partitioner.sticky.batch.size is
> introduced
> > > > > because
> > > > > > > the
> > > > > > > > > > > > effective batch size could be less than batch.size
> and
> > we
> > > > > want
> > > > > > to
> > > > > > > > align
> > > > > > > > > > > > partition switching with the effective batch size.
> How
> > > > would
> > > > > a
> > > > > > > user
> > > > > > > > > > know
> > > > > > > > > > > > the effective batch size to set
> > > > partitioner.sticky.batch.size
> > > > > > > > properly?
> > > > > > > > > > > If
> > > > > > > > > > > > the user somehow knows the effective batch size, does
> > > > setting
> > > > > > > > > > batch.size
> > > > > > > > > > > to
> > > > > > > > > > > > the effective batch size achieve the same result?
> > > > > > > > > > > >
> > > > > > > > > > > > 4. Thanks for the explanation. Makes sense to me.
> > > > > > > > > > > >
> > > > > > > > > > > > Thanks,
> > > > > > > > > > > >
> > > > > > > > > > > > Jun
> > > > > > > > > > > >
> > > > > > > > > > > > Thanks,
> > > > > > > > > > > >
> > > > > > > > > > > > Jun
> > > > > > > > > > > >
> > > > > > > > > > > > On Fri, Feb 25, 2022 at 8:26 PM Artem Livshits
> > > > > > > > > > > > <al...@confluent.io.invalid> wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > > Hi Jun,
> > > > > > > > > > > > >
> > > > > > > > > > > > > 1. Updated the KIP to add a couple paragraphs about
> > > > > > > > implementation
> > > > > > > > > > > > > necessities in the Proposed Changes section.
> > > > > > > > > > > > >
> > > > > > > > > > > > > 2. Sorry if my reply was confusing, what I meant to
> > say
> > > > > (and
> > > > > > I
> > > > > > > > > > > elaborated
> > > > > > > > > > > > > on that in point #3) is that there could be
> patterns
> > > for
> > > > > > which
> > > > > > > > 16KB
> > > > > > > > > > > > > wouldn't be the most effective setting, thus it
> would
> > > be
> > > > > good
> > > > > > > to
> > > > > > > > make
> > > > > > > > > > > it
> > > > > > > > > > > > > configurable.
> > > > > > > > > > > > >
> > > > > > > > > > > > > 4. We could use broker readiness timeout.  But I'm
> > not
> > > > sure
> > > > > > it
> > > > > > > > would
> > > > > > > > > > > > > correctly model the broker load.  The problem is
> that
> > > > > latency
> > > > > > > is
> > > > > > > > not
> > > > > > > > > > an
> > > > > > > > > > > > > accurate measure of throughput, we could have 2
> > brokers
> > > > > that
> > > > > > > have
> > > > > > > > > > equal
> > > > > > > > > > > > > throughput but one has higher latency (so it takes
> > > larger
> > > > > > > batches
> > > > > > > > > > less
> > > > > > > > > > > > > frequently, but still takes the same load).
> > > > Latency-based
> > > > > > > logic
> > > > > > > > is
> > > > > > > > > > > > likely
> > > > > > > > > > > > > to send less data to the broker with higher
> latency.
> > > > Using
> > > > > > the
> > > > > > > > queue
> > > > > > > > > > > > size
> > > > > > > > > > > > > would adapt to throughput, regardless of latency
> > (which
> > > > > could
> > > > > > > be
> > > > > > > > > > just a
> > > > > > > > > > > > > result of deployment topology), so that's the model
> > > > chosen
> > > > > in
> > > > > > > the
> > > > > > > > > > > > > proposal.  The partition.availability.timeout.ms
> > logic
> > > > > > > > approaches
> > > > > > > > > > the
> > > > > > > > > > > > > model
> > > > > > > > > > > > > from a slightly different angle, say we have a
> > > > requirement
> > > > > to
> > > > > > > > deliver
> > > > > > > > > > > > > messages via brokers that have a certain latency,
> > then
> > > > > > > > > > > > > partition.availability.timeout.ms could be used to
> > > tune
> > > > > > that.
> > > > > > > > > > Latency
> > > > > > > > > > > > is
> > > > > > > > > > > > > a
> > > > > > > > > > > > > much more volatile metric than throughput (latency
> > > > depends
> > > > > on
> > > > > > > > > > external
> > > > > > > > > > > > > load, on capacity, on deployment topology, on
> jitter
> > in
> > > > > > > network,
> > > > > > > > on
> > > > > > > > > > > > jitter
> > > > > > > > > > > > > in disk, etc.) and I think it would be best to
> leave
> > > > > > > > latency-based
> > > > > > > > > > > > > thresholds configurable to tune for the
> environment.
> > > > > > > > > > > > >
> > > > > > > > > > > > > -Artem
> > > > > > > > > > > > >
> > > > > > > > > > > > > On Wed, Feb 23, 2022 at 11:14 AM Jun Rao
> > > > > > > > <ju...@confluent.io.invalid>
> > > > > > > > > > > > wrote:
> > > > > > > > > > > > >
> > > > > > > > > > > > > > Hi, Artem,
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Thanks for the reply. A few more comments.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > 1. Perhaps you could elaborate a bit more on how
> > the
> > > > > > producer
> > > > > > > > > > > > determines
> > > > > > > > > > > > > > the partition if the partitioner returns -1. This
> > > will
> > > > > help
> > > > > > > > > > > understand
> > > > > > > > > > > > > why
> > > > > > > > > > > > > > encapsulating that logic as a partitioner is not
> > > clean.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > 2. I am not sure that I understand this part. If
> > > 15.5KB
> > > > > is
> > > > > > > more
> > > > > > > > > > > > > efficient,
> > > > > > > > > > > > > > could we just set batch.size to 15.5KB?
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > 4. Yes, we could add a switch (or a variant of
> the
> > > > > > > > partitioner) for
> > > > > > > > > > > > > > enabling this behavior. Also, choosing partitions
> > > based
> > > > > on
> > > > > > > > broker
> > > > > > > > > > > > > readiness
> > > > > > > > > > > > > > can be made in a smoother way. For example, we
> > could
> > > > > track
> > > > > > > the
> > > > > > > > last
> > > > > > > > > > > > time
> > > > > > > > > > > > > a
> > > > > > > > > > > > > > broker has drained any batches from the
> > accumulator.
> > > We
> > > > > can
> > > > > > > > then
> > > > > > > > > > > select
> > > > > > > > > > > > > > partitions from brokers proportionally to the
> > inverse
> > > > of
> > > > > > that
> > > > > > > > time.
> > > > > > > > > > > > This
> > > > > > > > > > > > > > seems smoother than a cutoff based on a
> > > > > > > > > > > > > partition.availability.timeout.ms
> > > > > > > > > > > > > >  threshold.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Thanks,
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Jun
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > On Fri, Feb 18, 2022 at 5:14 PM Artem Livshits
> > > > > > > > > > > > > > <al...@confluent.io.invalid> wrote:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Hello Luke, Jun,
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Thank you for your feedback.  I've added the
> > > Rejected
> > > > > > > > Alternative
> > > > > > > > > > > > > section
> > > > > > > > > > > > > > > that may clarify some of the questions w.r.t.
> > > > returning
> > > > > > -1.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > 1. I've elaborated on the -1 in the KIP.  The
> > > problem
> > > > > is
> > > > > > > > that a
> > > > > > > > > > > > > > significant
> > > > > > > > > > > > > > > part of the logic needs to be in the producer
> > > > (because
> > > > > it
> > > > > > > now
> > > > > > > > > > uses
> > > > > > > > > > > > > > > information about brokers that only the
> producer
> > > > > knows),
> > > > > > so
> > > > > > > > > > > > > encapsulation
> > > > > > > > > > > > > > > of the logic within the default partitioner
> isn't
> > > as
> > > > > > clean.
> > > > > > > > > >  I've
> > > > > > > > > > > > > added
> > > > > > > > > > > > > > > the Rejected Alternative section that documents
> > an
> > > > > > attempt
> > > > > > > to
> > > > > > > > > > keep
> > > > > > > > > > > > the
> > > > > > > > > > > > > > > encapsulation by providing new callbacks to the
> > > > > > > partitioner.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > 2. The meaning of the
> > partitioner.sticky.batch.size
> > > > is
> > > > > > > > explained
> > > > > > > > > > in
> > > > > > > > > > > > the
> > > > > > > > > > > > > > > Uniform Sticky Batch Size section.  Basically,
> we
> > > > track
> > > > > > the
> > > > > > > > > > amount
> > > > > > > > > > > of
> > > > > > > > > > > > > > bytes
> > > > > > > > > > > > > > > produced to the partition and if it exceeds
> > > > > > > > > > > > > partitioner.sticky.batch.size
> > > > > > > > > > > > > > > then we switch to the next partition.  As far
> as
> > > the
> > > > > > reason
> > > > > > > > to
> > > > > > > > > > make
> > > > > > > > > > > > it
> > > > > > > > > > > > > > > different from batch.size, I think Luke
> answered
> > > this
> > > > > > with
> > > > > > > > the
> > > > > > > > > > > > question
> > > > > > > > > > > > > > #3
> > > > > > > > > > > > > > > -- what if the load pattern is such that 15.5KB
> > > would
> > > > > be
> > > > > > > more
> > > > > > > > > > > > efficient
> > > > > > > > > > > > > > > than 16KB?
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > 3. I think it's hard to have one size that
> would
> > > fit
> > > > > all
> > > > > > > > > > patterns.
> > > > > > > > > > > > > E.g.
> > > > > > > > > > > > > > if
> > > > > > > > > > > > > > > the load pattern is such that there is linger
> and
> > > the
> > > > > app
> > > > > > > > fills
> > > > > > > > > > the
> > > > > > > > > > > > > batch
> > > > > > > > > > > > > > > before linger expires, then having 16KB would
> > most
> > > > > likely
> > > > > > > > > > > synchronize
> > > > > > > > > > > > > > > batching and partition switching, so each
> > partition
> > > > > would
> > > > > > > > get a
> > > > > > > > > > > full
> > > > > > > > > > > > > > > batch.  If load pattern is such that there are
> a
> > > few
> > > > > > > > non-complete
> > > > > > > > > > > > > batches
> > > > > > > > > > > > > > > go out before a larger batch starts to fill,
> then
> > > it
> > > > > may
> > > > > > > > actually
> > > > > > > > > > > be
> > > > > > > > > > > > > > > beneficial to make slightly larger (e.g.
> > linger=0,
> > > > > first
> > > > > > > few
> > > > > > > > > > > records
> > > > > > > > > > > > go
> > > > > > > > > > > > > > in
> > > > > > > > > > > > > > > the first batch, then next few records go to
> > second
> > > > > > batch,
> > > > > > > > and so
> > > > > > > > > > > on,
> > > > > > > > > > > > > > until
> > > > > > > > > > > > > > > 5 in-flight, then larger batch would form while
> > > > waiting
> > > > > > for
> > > > > > > > > > broker
> > > > > > > > > > > to
> > > > > > > > > > > > > > > respond, but the partition switch would happen
> > > before
> > > > > the
> > > > > > > > larger
> > > > > > > > > > > > batch
> > > > > > > > > > > > > is
> > > > > > > > > > > > > > > full).
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > 4. There are a couple of reasons for
> introducing
> > > > > > > > > > > > > > > partition.availability.timeout.ms.  Luke's an
> > > Jun's
> > > > > > > > questions
> > > > > > > > > > are
> > > > > > > > > > > > > > slightly
> > > > > > > > > > > > > > > different, so I'm going to separate replies.
> > > > > > > > > > > > > > > (Luke) Is the queue size a good enough
> signal?  I
> > > > think
> > > > > > > it's
> > > > > > > > a
> > > > > > > > > > good
> > > > > > > > > > > > > > default
> > > > > > > > > > > > > > > signal as it tries to preserve general fairness
> > and
> > > > not
> > > > > > > > overreact
> > > > > > > > > > > on
> > > > > > > > > > > > > the
> > > > > > > > > > > > > > > broker's state at each moment in time.  But
> > because
> > > > > it's
> > > > > > > > smooth,
> > > > > > > > > > it
> > > > > > > > > > > > may
> > > > > > > > > > > > > > not
> > > > > > > > > > > > > > > be reactive enough to instantaneous latency
> > jumps.
> > > > For
> > > > > > > > > > > > > latency-sensitive
> > > > > > > > > > > > > > > workloads, it may be desirable to react faster
> > > when a
> > > > > > > broker
> > > > > > > > > > > becomes
> > > > > > > > > > > > > > > unresponsive (but that may make the
> distribution
> > > > really
> > > > > > > > choppy),
> > > > > > > > > > so
> > > > > > > > > > > > > > > partition.availability.timeout.ms provides an
> > > > > > opportunity
> > > > > > > to
> > > > > > > > > > tune
> > > > > > > > > > > > > > > adaptiveness.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > (Jun) Can we just not assign partitions to
> > brokers
> > > > that
> > > > > > are
> > > > > > > > not
> > > > > > > > > > > > ready?
> > > > > > > > > > > > > > > Switching partitions purely based on current
> > broker
> > > > > > > readiness
> > > > > > > > > > > > > information
> > > > > > > > > > > > > > > can really skew workload I think (or at least I
> > > > > couldn't
> > > > > > > > build a
> > > > > > > > > > > > model
> > > > > > > > > > > > > > that
> > > > > > > > > > > > > > > proves that over time it's going to be
> generally
> > > > > fair), I
> > > > > > > > feel
> > > > > > > > > > that
> > > > > > > > > > > > the
> > > > > > > > > > > > > > > algorithm should try to be fair in general and
> > use
> > > > > > smoother
> > > > > > > > > > signals
> > > > > > > > > > > > by
> > > > > > > > > > > > > > > default (e.g. a broker with choppier latency
> may
> > > get
> > > > > much
> > > > > > > > less
> > > > > > > > > > load
> > > > > > > > > > > > > even
> > > > > > > > > > > > > > > though it can handle throughput, it then may
> > > > > potentially
> > > > > > > skew
> > > > > > > > > > > > > > consumption),
> > > > > > > > > > > > > > > note that the queue-size-based logic uses
> > > > probabilities
> > > > > > (so
> > > > > > > > we
> > > > > > > > > > > don't
> > > > > > > > > > > > > > fully
> > > > > > > > > > > > > > > remove brokers, just make it less likely) and
> > > > relative
> > > > > > info
> > > > > > > > > > rather
> > > > > > > > > > > > > than a
> > > > > > > > > > > > > > > threshold (so if all brokers are heavily, but
> > > equally
> > > > > > > loaded,
> > > > > > > > > > they
> > > > > > > > > > > > will
> > > > > > > > > > > > > > get
> > > > > > > > > > > > > > > equal distribution, rather than get removed
> > because
> > > > > they
> > > > > > > > exceed
> > > > > > > > > > > some
> > > > > > > > > > > > > > > threshold).  So at the very least, I would like
> > > this
> > > > > > logic
> > > > > > > > to be
> > > > > > > > > > > > turned
> > > > > > > > > > > > > > off
> > > > > > > > > > > > > > > by default as it's hard to predict what it
> could
> > do
> > > > > with
> > > > > > > > > > different
> > > > > > > > > > > > > > patterns
> > > > > > > > > > > > > > > (which means that there would need to be some
> > > > > > > > configuration).  We
> > > > > > > > > > > > could
> > > > > > > > > > > > > > > just not use brokers that are not ready, but
> > > again, I
> > > > > > think
> > > > > > > > that
> > > > > > > > > > > it's
> > > > > > > > > > > > > > good
> > > > > > > > > > > > > > > to try to be fair under normal circumstances,
> so
> > if
> > > > > > > normally
> > > > > > > > > > > brokers
> > > > > > > > > > > > > can
> > > > > > > > > > > > > > > respond under some
> > > partition.availability.timeout.ms
> > > > > > > > threshold
> > > > > > > > > > and
> > > > > > > > > > > > the
> > > > > > > > > > > > > > > application works well with those latencies,
> then
> > > we
> > > > > > could
> > > > > > > > > > > distribute
> > > > > > > > > > > > > > data
> > > > > > > > > > > > > > > equally between brokers that don't exceed the
> > > > > latencies.
> > > > > > > The
> > > > > > > > > > > value,
> > > > > > > > > > > > of
> > > > > > > > > > > > > > > course, would depend on the environment and app
> > > > > > > requirements,
> > > > > > > > > > hence
> > > > > > > > > > > > > it's
> > > > > > > > > > > > > > > configurable.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > 10. Added a statement at the beginning of the
> > > > proposed
> > > > > > > > changes.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > -Artem
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > On Thu, Feb 17, 2022 at 3:46 PM Jun Rao
> > > > > > > > <jun@confluent.io.invalid
> > > > > > > > > > >
> > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Hi, Artem,
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Thanks for the KIP. A few comments below.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > 1. I agree with Luke that having the
> > partitioner
> > > > > > > returning
> > > > > > > > -1
> > > > > > > > > > is
> > > > > > > > > > > > kind
> > > > > > > > > > > > > > of
> > > > > > > > > > > > > > > > weird. Could we just change the
> implementation
> > of
> > > > > > > > > > > > DefaultPartitioner
> > > > > > > > > > > > > to
> > > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > new behavior?
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > 2. partitioner.sticky.batch.size: Similar
> > > question
> > > > to
> > > > > > > > Luke. I
> > > > > > > > > > am
> > > > > > > > > > > > not
> > > > > > > > > > > > > > sure
> > > > > > > > > > > > > > > > why we want to introduce this new
> > configuration.
> > > > > Could
> > > > > > we
> > > > > > > > just
> > > > > > > > > > > use
> > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > existing batch.size?
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > 4. I also agree with Luke that it's not clear
> > why
> > > > we
> > > > > > need
> > > > > > > > > > > > > > > > partition.availability.timeout.ms. The KIP
> > says
> > > > the
> > > > > > > broker
> > > > > > > > > > > "would
> > > > > > > > > > > > > not
> > > > > > > > > > > > > > be
> > > > > > > > > > > > > > > > chosen until the broker is able to accept the
> > > next
> > > > > > ready
> > > > > > > > batch
> > > > > > > > > > > from
> > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > partition". If we are keeping track of this,
> > > could
> > > > we
> > > > > > > just
> > > > > > > > > > avoid
> > > > > > > > > > > > > > > assigning
> > > > > > > > > > > > > > > > records to partitions whose leader is not
> able
> > to
> > > > > > accept
> > > > > > > > the
> > > > > > > > > > next
> > > > > > > > > > > > > > batch?
> > > > > > > > > > > > > > > If
> > > > > > > > > > > > > > > > we do that, perhaps we don't need
> > > > > > > > > > > > partition.availability.timeout.ms.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > 10. Currently, partitioner.class defaults to
> > > > > > > > > > DefaultPartitioner,
> > > > > > > > > > > > > which
> > > > > > > > > > > > > > > uses
> > > > > > > > > > > > > > > > StickyPartitioner when the key is specified.
> > > Since
> > > > > this
> > > > > > > KIP
> > > > > > > > > > > > improves
> > > > > > > > > > > > > > upon
> > > > > > > > > > > > > > > > StickyPartitioner, it would be useful to make
> > the
> > > > new
> > > > > > > > behavior
> > > > > > > > > > > the
> > > > > > > > > > > > > > > default
> > > > > > > > > > > > > > > > and document that in the KIP.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Thanks,
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Jun
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > On Wed, Feb 16, 2022 at 7:30 PM Luke Chen <
> > > > > > > > showuon@gmail.com>
> > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Hi Artem,
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Also, one more thing I think you need to
> > know.
> > > > > > > > > > > > > > > > > As this bug KAFKA-7572 <
> > > > > > > > > > > > > > >
> https://issues.apache.org/jira/browse/KAFKA-7572
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > mentioned, sometimes the custom partitioner
> > > would
> > > > > > > return
> > > > > > > > > > > negative
> > > > > > > > > > > > > > > > partition
> > > > > > > > > > > > > > > > > id accidentally.
> > > > > > > > > > > > > > > > > If it returned -1, how could you know if it
> > is
> > > > > > expected
> > > > > > > > or
> > > > > > > > > > not
> > > > > > > > > > > > > > > expected?
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Thanks.
> > > > > > > > > > > > > > > > > Luke
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > On Wed, Feb 16, 2022 at 3:28 PM Luke Chen <
> > > > > > > > showuon@gmail.com
> > > > > > > > > > >
> > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > Hi Artem,
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > Thanks for the update. I have some
> > questions
> > > > > about
> > > > > > > it:
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > 1. Could you explain why you need the
> > > > > `partitioner`
> > > > > > > > return
> > > > > > > > > > > -1?
> > > > > > > > > > > > In
> > > > > > > > > > > > > > > which
> > > > > > > > > > > > > > > > > > case we need it? And how it is used in
> your
> > > > KIP?
> > > > > > > > > > > > > > > > > > 2. What does the
> > > > "partitioner.sticky.batch.size"
> > > > > > > mean?
> > > > > > > > In
> > > > > > > > > > the
> > > > > > > > > > > > > > > > > > "Configuration" part, you didn't explain
> > it.
> > > > And
> > > > > > > > default to
> > > > > > > > > > > 0,
> > > > > > > > > > > > I
> > > > > > > > > > > > > > > guess
> > > > > > > > > > > > > > > > > it's
> > > > > > > > > > > > > > > > > > the same as current behavior for backward
> > > > > > > > compatibility,
> > > > > > > > > > > right?
> > > > > > > > > > > > > You
> > > > > > > > > > > > > > > > > should
> > > > > > > > > > > > > > > > > > mention it.
> > > > > > > > > > > > > > > > > > 3. I'm thinking we can have a threshold
> to
> > > the
> > > > > > > > > > > > > > > > > > "partitioner.sticky.batch.size". Let's
> say,
> > > we
> > > > > > > already
> > > > > > > > > > > > accumulate
> > > > > > > > > > > > > > > > 15.5KB
> > > > > > > > > > > > > > > > > in
> > > > > > > > > > > > > > > > > > partition1, and sent. So when next batch
> > > > created,
> > > > > > in
> > > > > > > > your
> > > > > > > > > > > > current
> > > > > > > > > > > > > > > > design,
> > > > > > > > > > > > > > > > > > we still stick to partition1, until 16KB
> > > > reached,
> > > > > > and
> > > > > > > > then
> > > > > > > > > > we
> > > > > > > > > > > > > > create
> > > > > > > > > > > > > > > a
> > > > > > > > > > > > > > > > > new
> > > > > > > > > > > > > > > > > > batch to change to next partition, ex:
> > > > > partition2.
> > > > > > > But
> > > > > > > > I
> > > > > > > > > > > think
> > > > > > > > > > > > if
> > > > > > > > > > > > > > we
> > > > > > > > > > > > > > > > set
> > > > > > > > > > > > > > > > > a
> > > > > > > > > > > > > > > > > > threshold to 95% (for example), we can
> know
> > > > > > previous
> > > > > > > > 15.5KB
> > > > > > > > > > > > > already
> > > > > > > > > > > > > > > > > exceeds
> > > > > > > > > > > > > > > > > > the threshold so that we can directly
> > create
> > > > new
> > > > > > > batch
> > > > > > > > for
> > > > > > > > > > > next
> > > > > > > > > > > > > > > > records.
> > > > > > > > > > > > > > > > > > This way should be able to make it more
> > > > > efficient.
> > > > > > > > WDYT?
> > > > > > > > > > > > > > > > > > 4. I think the improved queuing logic
> > should
> > > be
> > > > > > good
> > > > > > > > > > enough.
> > > > > > > > > > > I
> > > > > > > > > > > > > > can't
> > > > > > > > > > > > > > > > get
> > > > > > > > > > > > > > > > > > the benefit of having `
> > > > > > > > partition.availability.timeout.ms`
> > > > > > > > > > > > > config.
> > > > > > > > > > > > > > In
> > > > > > > > > > > > > > > > > > short, you want to make the partitioner
> > take
> > > > the
> > > > > > > broker
> > > > > > > > > > load
> > > > > > > > > > > > into
> > > > > > > > > > > > > > > > > > consideration. We can just improve that
> in
> > > the
> > > > > > > queuing
> > > > > > > > > > logic
> > > > > > > > > > > > (and
> > > > > > > > > > > > > > you
> > > > > > > > > > > > > > > > > > already did it). Why should we add the
> > > config?
> > > > > > Could
> > > > > > > > you
> > > > > > > > > > use
> > > > > > > > > > > > some
> > > > > > > > > > > > > > > > > examples
> > > > > > > > > > > > > > > > > > to explain why we need it.
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > Thank you.
> > > > > > > > > > > > > > > > > > Luke
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > On Wed, Feb 16, 2022 at 8:57 AM Artem
> > > Livshits
> > > > > > > > > > > > > > > > > > <al...@confluent.io.invalid> wrote:
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > >> Hello,
> > > > > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > > > > >> Please add your comments about the KIP.
> > If
> > > > > there
> > > > > > > are
> > > > > > > > no
> > > > > > > > > > > > > > > > considerations,
> > > > > > > > > > > > > > > > > >> I'll put it up for vote in the next few
> > > days.
> > > > > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > > > > >> -Artem
> > > > > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > > > > >> On Mon, Feb 7, 2022 at 6:01 PM Artem
> > > Livshits
> > > > <
> > > > > > > > > > > > > > > alivshits@confluent.io
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > >> wrote:
> > > > > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > > > > >> > Hello,
> > > > > > > > > > > > > > > > > >> >
> > > > > > > > > > > > > > > > > >> > After trying a few prototypes, I've
> made
> > > > some
> > > > > > > > changes to
> > > > > > > > > > > the
> > > > > > > > > > > > > > > public
> > > > > > > > > > > > > > > > > >> > interface.  Please see the updated
> > > document
> > > > > > > > > > > > > > > > > >> >
> > > > > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-794%3A+Strictly+Uniform+Sticky+Partitioner
> > > > > > > > > > > > > > > > > >> > .
> > > > > > > > > > > > > > > > > >> >
> > > > > > > > > > > > > > > > > >> > -Artem
> > > > > > > > > > > > > > > > > >> >
> > > > > > > > > > > > > > > > > >> > On Thu, Nov 4, 2021 at 10:37 AM Artem
> > > > > Livshits <
> > > > > > > > > > > > > > > > > alivshits@confluent.io>
> > > > > > > > > > > > > > > > > >> > wrote:
> > > > > > > > > > > > > > > > > >> >
> > > > > > > > > > > > > > > > > >> >> Hello,
> > > > > > > > > > > > > > > > > >> >>
> > > > > > > > > > > > > > > > > >> >> This is the discussion thread for
> > > > > > > > > > > > > > > > > >> >>
> > > > > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-794%3A+Strictly+Uniform+Sticky+Partitioner
> > > > > > > > > > > > > > > > > >> >> .
> > > > > > > > > > > > > > > > > >> >>
> > > > > > > > > > > > > > > > > >> >> The proposal is a bug fix for
> > > > > > > > > > > > > > > > > >> >>
> > > > > > > https://issues.apache.org/jira/browse/KAFKA-10888,
> > > > > > > > but
> > > > > > > > > > > it
> > > > > > > > > > > > > does
> > > > > > > > > > > > > > > > > >> include a
> > > > > > > > > > > > > > > > > >> >> client config change, therefore we
> > have a
> > > > KIP
> > > > > > to
> > > > > > > > > > discuss.
> > > > > > > > > > > > > > > > > >> >>
> > > > > > > > > > > > > > > > > >> >> -Artem
> > > > > > > > > > > > > > > > > >> >>
> > > > > > > > > > > > > > > > > >> >
> > > > > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] KIP-794: Strictly Uniform Sticky Partitioner

Posted by Artem Livshits <al...@confluent.io.INVALID>.
Hi Jun,

32. Good point.  Do you think it's ok to defer the metrics until we run
some benchmarks so that we get a better idea of what metrics we need?

-Artem

On Thu, Mar 10, 2022 at 3:12 PM Jun Rao <ju...@confluent.io.invalid> wrote:

> Hi, Artem.
>
> Thanks for the reply. One more comment.
>
> 32. Do we need to add any new metric on the producer? For example, if
> partitioner.availability.timeout.ms is > 0, it might be useful to know the
> number of unavailable partitions.
>
> Thanks,
>
> Jun
>
> On Thu, Mar 10, 2022 at 12:46 PM Artem Livshits
> <al...@confluent.io.invalid> wrote:
>
> > Hi Jun,
> >
> > 30.  Clarified.
> >
> > 31. I plan to do some benchmarking once implementation is finished, I'll
> > update the KIP with the results once I have them.  The reason to make it
> > default is that it won't be used otherwise and we won't know if it's good
> > or not in practical workloads.
> >
> > -Artem
> >
> > On Thu, Mar 10, 2022 at 11:42 AM Jun Rao <ju...@confluent.io.invalid>
> wrote:
> >
> > > Hi, Artem,
> > >
> > > Thanks for the updated KIP. A couple of more comments.
> > >
> > > 30. For the 3 new configs, it would be useful to make it clear that
> they
> > > are only relevant when the partitioner class is null.
> > >
> > > 31. partitioner.adaptive.partitioning.enable : I am wondering whether
> it
> > > should default to true. This is a more complex behavior than "uniform
> > > sticky" and may take some time to get right. If we do want to enable it
> > by
> > > default, it would be useful to validate it with some test results.
> > >
> > > Jun
> > >
> > >
> > >
> > >
> > > On Wed, Mar 9, 2022 at 10:05 PM Artem Livshits
> > > <al...@confluent.io.invalid> wrote:
> > >
> > > > Thank you for feedback, I've discussed this offline with some of the
> > > folks
> > > > and updated the KIP.  The main change is that now instead of using
> > > > DefaultPartitioner and UniformStickyPartitioners as flags, in the new
> > > > proposal the default partitioner is null, so if no custom partitioner
> > is
> > > > specified then the partitioning logic is implemented in
> KafkaProducer.
> > > > Compatibility section is updated as well.  Also the configuration
> > options
> > > > are renamed to be more consistent.
> > > >
> > > > -Artem
> > > >
> > > > On Fri, Mar 4, 2022 at 10:38 PM Luke Chen <sh...@gmail.com> wrote:
> > > >
> > > > > Hi Artem,
> > > > >
> > > > > Thanks for your explanation and update to the KIP.
> > > > > Some comments:
> > > > >
> > > > > 5. In the description for `enable.adaptive.partitioning`, the
> `false`
> > > > case,
> > > > > you said:
> > > > > > the producer will try to distribute messages uniformly.
> > > > > I think we should describe the possible skewing distribution.
> > > Otherwise,
> > > > > user might be confused about why adaptive partitioning is
> important.
> > > > >
> > > > > 6. In the description for `partition.availability.timeout.ms`, I
> > think
> > > > we
> > > > > should mention in the last sentence about if
> > > > `enable.adaptive.partitioning`
> > > > > is disabled this logic is also disabled.
> > > > >
> > > > > 7. Similar thoughts as Ismael, I think we should have a POC and
> test
> > to
> > > > > prove that this adaptive partitioning algorithm can have better
> > uniform
> > > > > partitioning, compared with original sticky one.
> > > > >
> > > > > Thank you.
> > > > > Luke
> > > > >
> > > > > On Fri, Mar 4, 2022 at 9:22 PM Ismael Juma <is...@juma.me.uk>
> > wrote:
> > > > >
> > > > > > Regarding `3`, we should only deprecate it if we're sure the new
> > > > approach
> > > > > > handles all cases better. Are we confident about that for both of
> > the
> > > > > > previous partitioners?
> > > > > >
> > > > > > Ismael
> > > > > >
> > > > > > On Fri, Mar 4, 2022 at 1:37 AM David Jacot
> > > <djacot@confluent.io.invalid
> > > > >
> > > > > > wrote:
> > > > > >
> > > > > > > Hi Artem,
> > > > > > >
> > > > > > > Thanks for the KIP! I have a few comments:
> > > > > > >
> > > > > > > 1. In the preamble of the proposed change section, there is
> > still a
> > > > > > > mention of the
> > > > > > > -1 approach. My understanding is that we have moved away from
> it
> > > now.
> > > > > > >
> > > > > > > 2. I am a bit concerned by the trick suggested about the
> > > > > > > DefaultPartitioner and
> > > > > > > the UniformStickyPartitioner. I do agree that implementing the
> > > logic
> > > > in
> > > > > > the
> > > > > > > producer itself is a good thing. However, it is weird from a
> user
> > > > > > > perspective
> > > > > > > that he can set a class as partitioner that is not used in the
> > > end. I
> > > > > > > think that
> > > > > > > this will be confusing for our users. Have we considered
> changing
> > > the
> > > > > > > default
> > > > > > > value of partitioner.class to null to indicate that the new
> > > built-in
> > > > > > > partitioner
> > > > > > > must be used? By default, the built-in partitioner would be
> used
> > > > unless
> > > > > > the
> > > > > > > user explicitly specify one. The downside is that the new
> default
> > > > > > behavior
> > > > > > > would not work if the user explicitly specify the partitioner
> but
> > > we
> > > > > > could
> > > > > > > mitigate this with my next point.
> > > > > > >
> > > > > > > 3. Related to the previous point, I think that we could
> deprecate
> > > > both
> > > > > > the
> > > > > > > DefaultPartitioner and the UniformStickyPartitioner. I would
> also
> > > > add a
> > > > > > > warning if one of them is explicitly provided by the user to
> > inform
> > > > > them
> > > > > > > that they should switch to the new built-in one. I am pretty
> sure
> > > > that
> > > > > > most
> > > > > > > of the folks use the default configuration anyway.
> > > > > > >
> > > > > > > 4. It would be great if we could explain why the -1 way was
> > > rejected.
> > > > > At
> > > > > > > the moment, the rejected alternative only explain the idea but
> > does
> > > > not
> > > > > > > say why we rejected it.
> > > > > > >
> > > > > > > Best,
> > > > > > > David
> > > > > > >
> > > > > > > On Fri, Mar 4, 2022 at 6:03 AM Artem Livshits
> > > > > > > <al...@confluent.io.invalid> wrote:
> > > > > > > >
> > > > > > > > Hi Jun,
> > > > > > > >
> > > > > > > > 2. Removed the option from the KIP.  Now the sticky
> > partitioning
> > > > > > > threshold
> > > > > > > > is hardcoded to batch.size.
> > > > > > > >
> > > > > > > > 20. Added the corresponding wording to the KIP.
> > > > > > > >
> > > > > > > > -Artem
> > > > > > > >
> > > > > > > > On Thu, Mar 3, 2022 at 10:52 AM Jun Rao
> > <jun@confluent.io.invalid
> > > >
> > > > > > > wrote:
> > > > > > > >
> > > > > > > > > Hi, Artem,
> > > > > > > > >
> > > > > > > > > Thanks for the reply.
> > > > > > > > >
> > > > > > > > > 1. Sounds good.
> > > > > > > > >
> > > > > > > > > 2. If we don't expect users to change it, we probably could
> > > just
> > > > > > leave
> > > > > > > out
> > > > > > > > > the new config. In general, it's easy to add a new config,
> > but
> > > > hard
> > > > > > to
> > > > > > > > > remove an existing config.
> > > > > > > > >
> > > > > > > > > 20. The two new configs enable.adaptive.partitioning and
> > > > > > > > > partition.availability.timeout.ms only apply to the two
> > > built-in
> > > > > > > > > partitioners DefaultPartitioner and
> UniformStickyPartitioner,
> > > > > right?
> > > > > > It
> > > > > > > > > would be useful to document that in the KIP.
> > > > > > > > >
> > > > > > > > > Thanks,
> > > > > > > > >
> > > > > > > > > Jun
> > > > > > > > >
> > > > > > > > > On Thu, Mar 3, 2022 at 9:47 AM Artem Livshits
> > > > > > > > > <al...@confluent.io.invalid> wrote:
> > > > > > > > >
> > > > > > > > > > Hi Jun,
> > > > > > > > > >
> > > > > > > > > > Thank you for the suggestions.
> > > > > > > > > >
> > > > > > > > > > 1. As we discussed offline, we can hardcode the logic for
> > > > > > > > > > DefaultPartitioner and UniformStickyPartitioner in the
> > > > > > KafkaProducer
> > > > > > > > > (i.e.
> > > > > > > > > > the DefaultPartitioner.partition won't get called,
> instead
> > > > > > > KafkaProducer
> > > > > > > > > > would check if the partitioner is an instance of
> > > > > DefaultPartitioner
> > > > > > > and
> > > > > > > > > > then run the actual partitioning logic itself).  Then the
> > > > change
> > > > > to
> > > > > > > the
> > > > > > > > > > Partitioner wouldn't be required.  I'll update the KIP to
> > > > reflect
> > > > > > > that.
> > > > > > > > > >
> > > > > > > > > > 2. I don't expect users to change this too often, as
> > changing
> > > > it
> > > > > > > would
> > > > > > > > > > require a bit of studying of the production patterns.
> As a
> > > > > general
> > > > > > > > > > principle, if I can think of a model that requires a
> > > deviation
> > > > > from
> > > > > > > > > > default, I tend to add a configuration option.  It could
> be
> > > > that
> > > > > > > it'll
> > > > > > > > > > never get used in practice, but I cannot prove that.  I'm
> > ok
> > > > with
> > > > > > > > > removing
> > > > > > > > > > the option, let me know what you think.
> > > > > > > > > >
> > > > > > > > > > -Artem
> > > > > > > > > >
> > > > > > > > > > On Mon, Feb 28, 2022 at 2:06 PM Jun Rao
> > > > <jun@confluent.io.invalid
> > > > > >
> > > > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > > Hi, Artem,
> > > > > > > > > > >
> > > > > > > > > > > Thanks for the reply. A few more comments.
> > > > > > > > > > >
> > > > > > > > > > > 1. Since we control the implementation and the usage of
> > > > > > > > > > DefaultPartitioner,
> > > > > > > > > > > another way is to instantiate the DefaultPartitioner
> > with a
> > > > > > special
> > > > > > > > > > > constructor, which allows it to have more access to
> > > internal
> > > > > > > > > information.
> > > > > > > > > > > Then we could just change the behavior of
> > > DefaultPartitioner
> > > > > > such
> > > > > > > that
> > > > > > > > > > it
> > > > > > > > > > > can use the internal infoamtion when choosing the
> > > partition.
> > > > > This
> > > > > > > seems
> > > > > > > > > > > more intuitive than having DefaultPartitioner return -1
> > > > > > partition.
> > > > > > > > > > >
> > > > > > > > > > > 2. I guess partitioner.sticky.batch.size is introduced
> > > > because
> > > > > > the
> > > > > > > > > > > effective batch size could be less than batch.size and
> we
> > > > want
> > > > > to
> > > > > > > align
> > > > > > > > > > > partition switching with the effective batch size. How
> > > would
> > > > a
> > > > > > user
> > > > > > > > > know
> > > > > > > > > > > the effective batch size to set
> > > partitioner.sticky.batch.size
> > > > > > > properly?
> > > > > > > > > > If
> > > > > > > > > > > the user somehow knows the effective batch size, does
> > > setting
> > > > > > > > > batch.size
> > > > > > > > > > to
> > > > > > > > > > > the effective batch size achieve the same result?
> > > > > > > > > > >
> > > > > > > > > > > 4. Thanks for the explanation. Makes sense to me.
> > > > > > > > > > >
> > > > > > > > > > > Thanks,
> > > > > > > > > > >
> > > > > > > > > > > Jun
> > > > > > > > > > >
> > > > > > > > > > > Thanks,
> > > > > > > > > > >
> > > > > > > > > > > Jun
> > > > > > > > > > >
> > > > > > > > > > > On Fri, Feb 25, 2022 at 8:26 PM Artem Livshits
> > > > > > > > > > > <al...@confluent.io.invalid> wrote:
> > > > > > > > > > >
> > > > > > > > > > > > Hi Jun,
> > > > > > > > > > > >
> > > > > > > > > > > > 1. Updated the KIP to add a couple paragraphs about
> > > > > > > implementation
> > > > > > > > > > > > necessities in the Proposed Changes section.
> > > > > > > > > > > >
> > > > > > > > > > > > 2. Sorry if my reply was confusing, what I meant to
> say
> > > > (and
> > > > > I
> > > > > > > > > > elaborated
> > > > > > > > > > > > on that in point #3) is that there could be patterns
> > for
> > > > > which
> > > > > > > 16KB
> > > > > > > > > > > > wouldn't be the most effective setting, thus it would
> > be
> > > > good
> > > > > > to
> > > > > > > make
> > > > > > > > > > it
> > > > > > > > > > > > configurable.
> > > > > > > > > > > >
> > > > > > > > > > > > 4. We could use broker readiness timeout.  But I'm
> not
> > > sure
> > > > > it
> > > > > > > would
> > > > > > > > > > > > correctly model the broker load.  The problem is that
> > > > latency
> > > > > > is
> > > > > > > not
> > > > > > > > > an
> > > > > > > > > > > > accurate measure of throughput, we could have 2
> brokers
> > > > that
> > > > > > have
> > > > > > > > > equal
> > > > > > > > > > > > throughput but one has higher latency (so it takes
> > larger
> > > > > > batches
> > > > > > > > > less
> > > > > > > > > > > > frequently, but still takes the same load).
> > > Latency-based
> > > > > > logic
> > > > > > > is
> > > > > > > > > > > likely
> > > > > > > > > > > > to send less data to the broker with higher latency.
> > > Using
> > > > > the
> > > > > > > queue
> > > > > > > > > > > size
> > > > > > > > > > > > would adapt to throughput, regardless of latency
> (which
> > > > could
> > > > > > be
> > > > > > > > > just a
> > > > > > > > > > > > result of deployment topology), so that's the model
> > > chosen
> > > > in
> > > > > > the
> > > > > > > > > > > > proposal.  The partition.availability.timeout.ms
> logic
> > > > > > > approaches
> > > > > > > > > the
> > > > > > > > > > > > model
> > > > > > > > > > > > from a slightly different angle, say we have a
> > > requirement
> > > > to
> > > > > > > deliver
> > > > > > > > > > > > messages via brokers that have a certain latency,
> then
> > > > > > > > > > > > partition.availability.timeout.ms could be used to
> > tune
> > > > > that.
> > > > > > > > > Latency
> > > > > > > > > > > is
> > > > > > > > > > > > a
> > > > > > > > > > > > much more volatile metric than throughput (latency
> > > depends
> > > > on
> > > > > > > > > external
> > > > > > > > > > > > load, on capacity, on deployment topology, on jitter
> in
> > > > > > network,
> > > > > > > on
> > > > > > > > > > > jitter
> > > > > > > > > > > > in disk, etc.) and I think it would be best to leave
> > > > > > > latency-based
> > > > > > > > > > > > thresholds configurable to tune for the environment.
> > > > > > > > > > > >
> > > > > > > > > > > > -Artem
> > > > > > > > > > > >
> > > > > > > > > > > > On Wed, Feb 23, 2022 at 11:14 AM Jun Rao
> > > > > > > <ju...@confluent.io.invalid>
> > > > > > > > > > > wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > > Hi, Artem,
> > > > > > > > > > > > >
> > > > > > > > > > > > > Thanks for the reply. A few more comments.
> > > > > > > > > > > > >
> > > > > > > > > > > > > 1. Perhaps you could elaborate a bit more on how
> the
> > > > > producer
> > > > > > > > > > > determines
> > > > > > > > > > > > > the partition if the partitioner returns -1. This
> > will
> > > > help
> > > > > > > > > > understand
> > > > > > > > > > > > why
> > > > > > > > > > > > > encapsulating that logic as a partitioner is not
> > clean.
> > > > > > > > > > > > >
> > > > > > > > > > > > > 2. I am not sure that I understand this part. If
> > 15.5KB
> > > > is
> > > > > > more
> > > > > > > > > > > > efficient,
> > > > > > > > > > > > > could we just set batch.size to 15.5KB?
> > > > > > > > > > > > >
> > > > > > > > > > > > > 4. Yes, we could add a switch (or a variant of the
> > > > > > > partitioner) for
> > > > > > > > > > > > > enabling this behavior. Also, choosing partitions
> > based
> > > > on
> > > > > > > broker
> > > > > > > > > > > > readiness
> > > > > > > > > > > > > can be made in a smoother way. For example, we
> could
> > > > track
> > > > > > the
> > > > > > > last
> > > > > > > > > > > time
> > > > > > > > > > > > a
> > > > > > > > > > > > > broker has drained any batches from the
> accumulator.
> > We
> > > > can
> > > > > > > then
> > > > > > > > > > select
> > > > > > > > > > > > > partitions from brokers proportionally to the
> inverse
> > > of
> > > > > that
> > > > > > > time.
> > > > > > > > > > > This
> > > > > > > > > > > > > seems smoother than a cutoff based on a
> > > > > > > > > > > > partition.availability.timeout.ms
> > > > > > > > > > > > >  threshold.
> > > > > > > > > > > > >
> > > > > > > > > > > > > Thanks,
> > > > > > > > > > > > >
> > > > > > > > > > > > > Jun
> > > > > > > > > > > > >
> > > > > > > > > > > > > On Fri, Feb 18, 2022 at 5:14 PM Artem Livshits
> > > > > > > > > > > > > <al...@confluent.io.invalid> wrote:
> > > > > > > > > > > > >
> > > > > > > > > > > > > > Hello Luke, Jun,
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Thank you for your feedback.  I've added the
> > Rejected
> > > > > > > Alternative
> > > > > > > > > > > > section
> > > > > > > > > > > > > > that may clarify some of the questions w.r.t.
> > > returning
> > > > > -1.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > 1. I've elaborated on the -1 in the KIP.  The
> > problem
> > > > is
> > > > > > > that a
> > > > > > > > > > > > > significant
> > > > > > > > > > > > > > part of the logic needs to be in the producer
> > > (because
> > > > it
> > > > > > now
> > > > > > > > > uses
> > > > > > > > > > > > > > information about brokers that only the producer
> > > > knows),
> > > > > so
> > > > > > > > > > > > encapsulation
> > > > > > > > > > > > > > of the logic within the default partitioner isn't
> > as
> > > > > clean.
> > > > > > > > >  I've
> > > > > > > > > > > > added
> > > > > > > > > > > > > > the Rejected Alternative section that documents
> an
> > > > > attempt
> > > > > > to
> > > > > > > > > keep
> > > > > > > > > > > the
> > > > > > > > > > > > > > encapsulation by providing new callbacks to the
> > > > > > partitioner.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > 2. The meaning of the
> partitioner.sticky.batch.size
> > > is
> > > > > > > explained
> > > > > > > > > in
> > > > > > > > > > > the
> > > > > > > > > > > > > > Uniform Sticky Batch Size section.  Basically, we
> > > track
> > > > > the
> > > > > > > > > amount
> > > > > > > > > > of
> > > > > > > > > > > > > bytes
> > > > > > > > > > > > > > produced to the partition and if it exceeds
> > > > > > > > > > > > partitioner.sticky.batch.size
> > > > > > > > > > > > > > then we switch to the next partition.  As far as
> > the
> > > > > reason
> > > > > > > to
> > > > > > > > > make
> > > > > > > > > > > it
> > > > > > > > > > > > > > different from batch.size, I think Luke answered
> > this
> > > > > with
> > > > > > > the
> > > > > > > > > > > question
> > > > > > > > > > > > > #3
> > > > > > > > > > > > > > -- what if the load pattern is such that 15.5KB
> > would
> > > > be
> > > > > > more
> > > > > > > > > > > efficient
> > > > > > > > > > > > > > than 16KB?
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > 3. I think it's hard to have one size that would
> > fit
> > > > all
> > > > > > > > > patterns.
> > > > > > > > > > > > E.g.
> > > > > > > > > > > > > if
> > > > > > > > > > > > > > the load pattern is such that there is linger and
> > the
> > > > app
> > > > > > > fills
> > > > > > > > > the
> > > > > > > > > > > > batch
> > > > > > > > > > > > > > before linger expires, then having 16KB would
> most
> > > > likely
> > > > > > > > > > synchronize
> > > > > > > > > > > > > > batching and partition switching, so each
> partition
> > > > would
> > > > > > > get a
> > > > > > > > > > full
> > > > > > > > > > > > > > batch.  If load pattern is such that there are a
> > few
> > > > > > > non-complete
> > > > > > > > > > > > batches
> > > > > > > > > > > > > > go out before a larger batch starts to fill, then
> > it
> > > > may
> > > > > > > actually
> > > > > > > > > > be
> > > > > > > > > > > > > > beneficial to make slightly larger (e.g.
> linger=0,
> > > > first
> > > > > > few
> > > > > > > > > > records
> > > > > > > > > > > go
> > > > > > > > > > > > > in
> > > > > > > > > > > > > > the first batch, then next few records go to
> second
> > > > > batch,
> > > > > > > and so
> > > > > > > > > > on,
> > > > > > > > > > > > > until
> > > > > > > > > > > > > > 5 in-flight, then larger batch would form while
> > > waiting
> > > > > for
> > > > > > > > > broker
> > > > > > > > > > to
> > > > > > > > > > > > > > respond, but the partition switch would happen
> > before
> > > > the
> > > > > > > larger
> > > > > > > > > > > batch
> > > > > > > > > > > > is
> > > > > > > > > > > > > > full).
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > 4. There are a couple of reasons for introducing
> > > > > > > > > > > > > > partition.availability.timeout.ms.  Luke's an
> > Jun's
> > > > > > > questions
> > > > > > > > > are
> > > > > > > > > > > > > slightly
> > > > > > > > > > > > > > different, so I'm going to separate replies.
> > > > > > > > > > > > > > (Luke) Is the queue size a good enough signal?  I
> > > think
> > > > > > it's
> > > > > > > a
> > > > > > > > > good
> > > > > > > > > > > > > default
> > > > > > > > > > > > > > signal as it tries to preserve general fairness
> and
> > > not
> > > > > > > overreact
> > > > > > > > > > on
> > > > > > > > > > > > the
> > > > > > > > > > > > > > broker's state at each moment in time.  But
> because
> > > > it's
> > > > > > > smooth,
> > > > > > > > > it
> > > > > > > > > > > may
> > > > > > > > > > > > > not
> > > > > > > > > > > > > > be reactive enough to instantaneous latency
> jumps.
> > > For
> > > > > > > > > > > > latency-sensitive
> > > > > > > > > > > > > > workloads, it may be desirable to react faster
> > when a
> > > > > > broker
> > > > > > > > > > becomes
> > > > > > > > > > > > > > unresponsive (but that may make the distribution
> > > really
> > > > > > > choppy),
> > > > > > > > > so
> > > > > > > > > > > > > > partition.availability.timeout.ms provides an
> > > > > opportunity
> > > > > > to
> > > > > > > > > tune
> > > > > > > > > > > > > > adaptiveness.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > (Jun) Can we just not assign partitions to
> brokers
> > > that
> > > > > are
> > > > > > > not
> > > > > > > > > > > ready?
> > > > > > > > > > > > > > Switching partitions purely based on current
> broker
> > > > > > readiness
> > > > > > > > > > > > information
> > > > > > > > > > > > > > can really skew workload I think (or at least I
> > > > couldn't
> > > > > > > build a
> > > > > > > > > > > model
> > > > > > > > > > > > > that
> > > > > > > > > > > > > > proves that over time it's going to be generally
> > > > fair), I
> > > > > > > feel
> > > > > > > > > that
> > > > > > > > > > > the
> > > > > > > > > > > > > > algorithm should try to be fair in general and
> use
> > > > > smoother
> > > > > > > > > signals
> > > > > > > > > > > by
> > > > > > > > > > > > > > default (e.g. a broker with choppier latency may
> > get
> > > > much
> > > > > > > less
> > > > > > > > > load
> > > > > > > > > > > > even
> > > > > > > > > > > > > > though it can handle throughput, it then may
> > > > potentially
> > > > > > skew
> > > > > > > > > > > > > consumption),
> > > > > > > > > > > > > > note that the queue-size-based logic uses
> > > probabilities
> > > > > (so
> > > > > > > we
> > > > > > > > > > don't
> > > > > > > > > > > > > fully
> > > > > > > > > > > > > > remove brokers, just make it less likely) and
> > > relative
> > > > > info
> > > > > > > > > rather
> > > > > > > > > > > > than a
> > > > > > > > > > > > > > threshold (so if all brokers are heavily, but
> > equally
> > > > > > loaded,
> > > > > > > > > they
> > > > > > > > > > > will
> > > > > > > > > > > > > get
> > > > > > > > > > > > > > equal distribution, rather than get removed
> because
> > > > they
> > > > > > > exceed
> > > > > > > > > > some
> > > > > > > > > > > > > > threshold).  So at the very least, I would like
> > this
> > > > > logic
> > > > > > > to be
> > > > > > > > > > > turned
> > > > > > > > > > > > > off
> > > > > > > > > > > > > > by default as it's hard to predict what it could
> do
> > > > with
> > > > > > > > > different
> > > > > > > > > > > > > patterns
> > > > > > > > > > > > > > (which means that there would need to be some
> > > > > > > configuration).  We
> > > > > > > > > > > could
> > > > > > > > > > > > > > just not use brokers that are not ready, but
> > again, I
> > > > > think
> > > > > > > that
> > > > > > > > > > it's
> > > > > > > > > > > > > good
> > > > > > > > > > > > > > to try to be fair under normal circumstances, so
> if
> > > > > > normally
> > > > > > > > > > brokers
> > > > > > > > > > > > can
> > > > > > > > > > > > > > respond under some
> > partition.availability.timeout.ms
> > > > > > > threshold
> > > > > > > > > and
> > > > > > > > > > > the
> > > > > > > > > > > > > > application works well with those latencies, then
> > we
> > > > > could
> > > > > > > > > > distribute
> > > > > > > > > > > > > data
> > > > > > > > > > > > > > equally between brokers that don't exceed the
> > > > latencies.
> > > > > > The
> > > > > > > > > > value,
> > > > > > > > > > > of
> > > > > > > > > > > > > > course, would depend on the environment and app
> > > > > > requirements,
> > > > > > > > > hence
> > > > > > > > > > > > it's
> > > > > > > > > > > > > > configurable.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > 10. Added a statement at the beginning of the
> > > proposed
> > > > > > > changes.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > -Artem
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > On Thu, Feb 17, 2022 at 3:46 PM Jun Rao
> > > > > > > <jun@confluent.io.invalid
> > > > > > > > > >
> > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Hi, Artem,
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Thanks for the KIP. A few comments below.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > 1. I agree with Luke that having the
> partitioner
> > > > > > returning
> > > > > > > -1
> > > > > > > > > is
> > > > > > > > > > > kind
> > > > > > > > > > > > > of
> > > > > > > > > > > > > > > weird. Could we just change the implementation
> of
> > > > > > > > > > > DefaultPartitioner
> > > > > > > > > > > > to
> > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > new behavior?
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > 2. partitioner.sticky.batch.size: Similar
> > question
> > > to
> > > > > > > Luke. I
> > > > > > > > > am
> > > > > > > > > > > not
> > > > > > > > > > > > > sure
> > > > > > > > > > > > > > > why we want to introduce this new
> configuration.
> > > > Could
> > > > > we
> > > > > > > just
> > > > > > > > > > use
> > > > > > > > > > > > the
> > > > > > > > > > > > > > > existing batch.size?
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > 4. I also agree with Luke that it's not clear
> why
> > > we
> > > > > need
> > > > > > > > > > > > > > > partition.availability.timeout.ms. The KIP
> says
> > > the
> > > > > > broker
> > > > > > > > > > "would
> > > > > > > > > > > > not
> > > > > > > > > > > > > be
> > > > > > > > > > > > > > > chosen until the broker is able to accept the
> > next
> > > > > ready
> > > > > > > batch
> > > > > > > > > > from
> > > > > > > > > > > > the
> > > > > > > > > > > > > > > partition". If we are keeping track of this,
> > could
> > > we
> > > > > > just
> > > > > > > > > avoid
> > > > > > > > > > > > > > assigning
> > > > > > > > > > > > > > > records to partitions whose leader is not able
> to
> > > > > accept
> > > > > > > the
> > > > > > > > > next
> > > > > > > > > > > > > batch?
> > > > > > > > > > > > > > If
> > > > > > > > > > > > > > > we do that, perhaps we don't need
> > > > > > > > > > > partition.availability.timeout.ms.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > 10. Currently, partitioner.class defaults to
> > > > > > > > > DefaultPartitioner,
> > > > > > > > > > > > which
> > > > > > > > > > > > > > uses
> > > > > > > > > > > > > > > StickyPartitioner when the key is specified.
> > Since
> > > > this
> > > > > > KIP
> > > > > > > > > > > improves
> > > > > > > > > > > > > upon
> > > > > > > > > > > > > > > StickyPartitioner, it would be useful to make
> the
> > > new
> > > > > > > behavior
> > > > > > > > > > the
> > > > > > > > > > > > > > default
> > > > > > > > > > > > > > > and document that in the KIP.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Thanks,
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Jun
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > On Wed, Feb 16, 2022 at 7:30 PM Luke Chen <
> > > > > > > showuon@gmail.com>
> > > > > > > > > > > wrote:
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Hi Artem,
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Also, one more thing I think you need to
> know.
> > > > > > > > > > > > > > > > As this bug KAFKA-7572 <
> > > > > > > > > > > > > > https://issues.apache.org/jira/browse/KAFKA-7572
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > mentioned, sometimes the custom partitioner
> > would
> > > > > > return
> > > > > > > > > > negative
> > > > > > > > > > > > > > > partition
> > > > > > > > > > > > > > > > id accidentally.
> > > > > > > > > > > > > > > > If it returned -1, how could you know if it
> is
> > > > > expected
> > > > > > > or
> > > > > > > > > not
> > > > > > > > > > > > > > expected?
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Thanks.
> > > > > > > > > > > > > > > > Luke
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > On Wed, Feb 16, 2022 at 3:28 PM Luke Chen <
> > > > > > > showuon@gmail.com
> > > > > > > > > >
> > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Hi Artem,
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Thanks for the update. I have some
> questions
> > > > about
> > > > > > it:
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > 1. Could you explain why you need the
> > > > `partitioner`
> > > > > > > return
> > > > > > > > > > -1?
> > > > > > > > > > > In
> > > > > > > > > > > > > > which
> > > > > > > > > > > > > > > > > case we need it? And how it is used in your
> > > KIP?
> > > > > > > > > > > > > > > > > 2. What does the
> > > "partitioner.sticky.batch.size"
> > > > > > mean?
> > > > > > > In
> > > > > > > > > the
> > > > > > > > > > > > > > > > > "Configuration" part, you didn't explain
> it.
> > > And
> > > > > > > default to
> > > > > > > > > > 0,
> > > > > > > > > > > I
> > > > > > > > > > > > > > guess
> > > > > > > > > > > > > > > > it's
> > > > > > > > > > > > > > > > > the same as current behavior for backward
> > > > > > > compatibility,
> > > > > > > > > > right?
> > > > > > > > > > > > You
> > > > > > > > > > > > > > > > should
> > > > > > > > > > > > > > > > > mention it.
> > > > > > > > > > > > > > > > > 3. I'm thinking we can have a threshold to
> > the
> > > > > > > > > > > > > > > > > "partitioner.sticky.batch.size". Let's say,
> > we
> > > > > > already
> > > > > > > > > > > accumulate
> > > > > > > > > > > > > > > 15.5KB
> > > > > > > > > > > > > > > > in
> > > > > > > > > > > > > > > > > partition1, and sent. So when next batch
> > > created,
> > > > > in
> > > > > > > your
> > > > > > > > > > > current
> > > > > > > > > > > > > > > design,
> > > > > > > > > > > > > > > > > we still stick to partition1, until 16KB
> > > reached,
> > > > > and
> > > > > > > then
> > > > > > > > > we
> > > > > > > > > > > > > create
> > > > > > > > > > > > > > a
> > > > > > > > > > > > > > > > new
> > > > > > > > > > > > > > > > > batch to change to next partition, ex:
> > > > partition2.
> > > > > > But
> > > > > > > I
> > > > > > > > > > think
> > > > > > > > > > > if
> > > > > > > > > > > > > we
> > > > > > > > > > > > > > > set
> > > > > > > > > > > > > > > > a
> > > > > > > > > > > > > > > > > threshold to 95% (for example), we can know
> > > > > previous
> > > > > > > 15.5KB
> > > > > > > > > > > > already
> > > > > > > > > > > > > > > > exceeds
> > > > > > > > > > > > > > > > > the threshold so that we can directly
> create
> > > new
> > > > > > batch
> > > > > > > for
> > > > > > > > > > next
> > > > > > > > > > > > > > > records.
> > > > > > > > > > > > > > > > > This way should be able to make it more
> > > > efficient.
> > > > > > > WDYT?
> > > > > > > > > > > > > > > > > 4. I think the improved queuing logic
> should
> > be
> > > > > good
> > > > > > > > > enough.
> > > > > > > > > > I
> > > > > > > > > > > > > can't
> > > > > > > > > > > > > > > get
> > > > > > > > > > > > > > > > > the benefit of having `
> > > > > > > partition.availability.timeout.ms`
> > > > > > > > > > > > config.
> > > > > > > > > > > > > In
> > > > > > > > > > > > > > > > > short, you want to make the partitioner
> take
> > > the
> > > > > > broker
> > > > > > > > > load
> > > > > > > > > > > into
> > > > > > > > > > > > > > > > > consideration. We can just improve that in
> > the
> > > > > > queuing
> > > > > > > > > logic
> > > > > > > > > > > (and
> > > > > > > > > > > > > you
> > > > > > > > > > > > > > > > > already did it). Why should we add the
> > config?
> > > > > Could
> > > > > > > you
> > > > > > > > > use
> > > > > > > > > > > some
> > > > > > > > > > > > > > > > examples
> > > > > > > > > > > > > > > > > to explain why we need it.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Thank you.
> > > > > > > > > > > > > > > > > Luke
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > On Wed, Feb 16, 2022 at 8:57 AM Artem
> > Livshits
> > > > > > > > > > > > > > > > > <al...@confluent.io.invalid> wrote:
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >> Hello,
> > > > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > > > >> Please add your comments about the KIP.
> If
> > > > there
> > > > > > are
> > > > > > > no
> > > > > > > > > > > > > > > considerations,
> > > > > > > > > > > > > > > > >> I'll put it up for vote in the next few
> > days.
> > > > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > > > >> -Artem
> > > > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > > > >> On Mon, Feb 7, 2022 at 6:01 PM Artem
> > Livshits
> > > <
> > > > > > > > > > > > > > alivshits@confluent.io
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >> wrote:
> > > > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > > > >> > Hello,
> > > > > > > > > > > > > > > > >> >
> > > > > > > > > > > > > > > > >> > After trying a few prototypes, I've made
> > > some
> > > > > > > changes to
> > > > > > > > > > the
> > > > > > > > > > > > > > public
> > > > > > > > > > > > > > > > >> > interface.  Please see the updated
> > document
> > > > > > > > > > > > > > > > >> >
> > > > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-794%3A+Strictly+Uniform+Sticky+Partitioner
> > > > > > > > > > > > > > > > >> > .
> > > > > > > > > > > > > > > > >> >
> > > > > > > > > > > > > > > > >> > -Artem
> > > > > > > > > > > > > > > > >> >
> > > > > > > > > > > > > > > > >> > On Thu, Nov 4, 2021 at 10:37 AM Artem
> > > > Livshits <
> > > > > > > > > > > > > > > > alivshits@confluent.io>
> > > > > > > > > > > > > > > > >> > wrote:
> > > > > > > > > > > > > > > > >> >
> > > > > > > > > > > > > > > > >> >> Hello,
> > > > > > > > > > > > > > > > >> >>
> > > > > > > > > > > > > > > > >> >> This is the discussion thread for
> > > > > > > > > > > > > > > > >> >>
> > > > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-794%3A+Strictly+Uniform+Sticky+Partitioner
> > > > > > > > > > > > > > > > >> >> .
> > > > > > > > > > > > > > > > >> >>
> > > > > > > > > > > > > > > > >> >> The proposal is a bug fix for
> > > > > > > > > > > > > > > > >> >>
> > > > > > https://issues.apache.org/jira/browse/KAFKA-10888,
> > > > > > > but
> > > > > > > > > > it
> > > > > > > > > > > > does
> > > > > > > > > > > > > > > > >> include a
> > > > > > > > > > > > > > > > >> >> client config change, therefore we
> have a
> > > KIP
> > > > > to
> > > > > > > > > discuss.
> > > > > > > > > > > > > > > > >> >>
> > > > > > > > > > > > > > > > >> >> -Artem
> > > > > > > > > > > > > > > > >> >>
> > > > > > > > > > > > > > > > >> >
> > > > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] KIP-794: Strictly Uniform Sticky Partitioner

Posted by Jun Rao <ju...@confluent.io.INVALID>.
Hi, Artem.

Thanks for the reply. One more comment.

32. Do we need to add any new metric on the producer? For example, if
partitioner.availability.timeout.ms is > 0, it might be useful to know the
number of unavailable partitions.

Thanks,

Jun

On Thu, Mar 10, 2022 at 12:46 PM Artem Livshits
<al...@confluent.io.invalid> wrote:

> Hi Jun,
>
> 30.  Clarified.
>
> 31. I plan to do some benchmarking once implementation is finished, I'll
> update the KIP with the results once I have them.  The reason to make it
> default is that it won't be used otherwise and we won't know if it's good
> or not in practical workloads.
>
> -Artem
>
> On Thu, Mar 10, 2022 at 11:42 AM Jun Rao <ju...@confluent.io.invalid> wrote:
>
> > Hi, Artem,
> >
> > Thanks for the updated KIP. A couple of more comments.
> >
> > 30. For the 3 new configs, it would be useful to make it clear that they
> > are only relevant when the partitioner class is null.
> >
> > 31. partitioner.adaptive.partitioning.enable : I am wondering whether it
> > should default to true. This is a more complex behavior than "uniform
> > sticky" and may take some time to get right. If we do want to enable it
> by
> > default, it would be useful to validate it with some test results.
> >
> > Jun
> >
> >
> >
> >
> > On Wed, Mar 9, 2022 at 10:05 PM Artem Livshits
> > <al...@confluent.io.invalid> wrote:
> >
> > > Thank you for feedback, I've discussed this offline with some of the
> > folks
> > > and updated the KIP.  The main change is that now instead of using
> > > DefaultPartitioner and UniformStickyPartitioners as flags, in the new
> > > proposal the default partitioner is null, so if no custom partitioner
> is
> > > specified then the partitioning logic is implemented in KafkaProducer.
> > > Compatibility section is updated as well.  Also the configuration
> options
> > > are renamed to be more consistent.
> > >
> > > -Artem
> > >
> > > On Fri, Mar 4, 2022 at 10:38 PM Luke Chen <sh...@gmail.com> wrote:
> > >
> > > > Hi Artem,
> > > >
> > > > Thanks for your explanation and update to the KIP.
> > > > Some comments:
> > > >
> > > > 5. In the description for `enable.adaptive.partitioning`, the `false`
> > > case,
> > > > you said:
> > > > > the producer will try to distribute messages uniformly.
> > > > I think we should describe the possible skewing distribution.
> > Otherwise,
> > > > user might be confused about why adaptive partitioning is important.
> > > >
> > > > 6. In the description for `partition.availability.timeout.ms`, I
> think
> > > we
> > > > should mention in the last sentence about if
> > > `enable.adaptive.partitioning`
> > > > is disabled this logic is also disabled.
> > > >
> > > > 7. Similar thoughts as Ismael, I think we should have a POC and test
> to
> > > > prove that this adaptive partitioning algorithm can have better
> uniform
> > > > partitioning, compared with original sticky one.
> > > >
> > > > Thank you.
> > > > Luke
> > > >
> > > > On Fri, Mar 4, 2022 at 9:22 PM Ismael Juma <is...@juma.me.uk>
> wrote:
> > > >
> > > > > Regarding `3`, we should only deprecate it if we're sure the new
> > > approach
> > > > > handles all cases better. Are we confident about that for both of
> the
> > > > > previous partitioners?
> > > > >
> > > > > Ismael
> > > > >
> > > > > On Fri, Mar 4, 2022 at 1:37 AM David Jacot
> > <djacot@confluent.io.invalid
> > > >
> > > > > wrote:
> > > > >
> > > > > > Hi Artem,
> > > > > >
> > > > > > Thanks for the KIP! I have a few comments:
> > > > > >
> > > > > > 1. In the preamble of the proposed change section, there is
> still a
> > > > > > mention of the
> > > > > > -1 approach. My understanding is that we have moved away from it
> > now.
> > > > > >
> > > > > > 2. I am a bit concerned by the trick suggested about the
> > > > > > DefaultPartitioner and
> > > > > > the UniformStickyPartitioner. I do agree that implementing the
> > logic
> > > in
> > > > > the
> > > > > > producer itself is a good thing. However, it is weird from a user
> > > > > > perspective
> > > > > > that he can set a class as partitioner that is not used in the
> > end. I
> > > > > > think that
> > > > > > this will be confusing for our users. Have we considered changing
> > the
> > > > > > default
> > > > > > value of partitioner.class to null to indicate that the new
> > built-in
> > > > > > partitioner
> > > > > > must be used? By default, the built-in partitioner would be used
> > > unless
> > > > > the
> > > > > > user explicitly specify one. The downside is that the new default
> > > > > behavior
> > > > > > would not work if the user explicitly specify the partitioner but
> > we
> > > > > could
> > > > > > mitigate this with my next point.
> > > > > >
> > > > > > 3. Related to the previous point, I think that we could deprecate
> > > both
> > > > > the
> > > > > > DefaultPartitioner and the UniformStickyPartitioner. I would also
> > > add a
> > > > > > warning if one of them is explicitly provided by the user to
> inform
> > > > them
> > > > > > that they should switch to the new built-in one. I am pretty sure
> > > that
> > > > > most
> > > > > > of the folks use the default configuration anyway.
> > > > > >
> > > > > > 4. It would be great if we could explain why the -1 way was
> > rejected.
> > > > At
> > > > > > the moment, the rejected alternative only explain the idea but
> does
> > > not
> > > > > > say why we rejected it.
> > > > > >
> > > > > > Best,
> > > > > > David
> > > > > >
> > > > > > On Fri, Mar 4, 2022 at 6:03 AM Artem Livshits
> > > > > > <al...@confluent.io.invalid> wrote:
> > > > > > >
> > > > > > > Hi Jun,
> > > > > > >
> > > > > > > 2. Removed the option from the KIP.  Now the sticky
> partitioning
> > > > > > threshold
> > > > > > > is hardcoded to batch.size.
> > > > > > >
> > > > > > > 20. Added the corresponding wording to the KIP.
> > > > > > >
> > > > > > > -Artem
> > > > > > >
> > > > > > > On Thu, Mar 3, 2022 at 10:52 AM Jun Rao
> <jun@confluent.io.invalid
> > >
> > > > > > wrote:
> > > > > > >
> > > > > > > > Hi, Artem,
> > > > > > > >
> > > > > > > > Thanks for the reply.
> > > > > > > >
> > > > > > > > 1. Sounds good.
> > > > > > > >
> > > > > > > > 2. If we don't expect users to change it, we probably could
> > just
> > > > > leave
> > > > > > out
> > > > > > > > the new config. In general, it's easy to add a new config,
> but
> > > hard
> > > > > to
> > > > > > > > remove an existing config.
> > > > > > > >
> > > > > > > > 20. The two new configs enable.adaptive.partitioning and
> > > > > > > > partition.availability.timeout.ms only apply to the two
> > built-in
> > > > > > > > partitioners DefaultPartitioner and UniformStickyPartitioner,
> > > > right?
> > > > > It
> > > > > > > > would be useful to document that in the KIP.
> > > > > > > >
> > > > > > > > Thanks,
> > > > > > > >
> > > > > > > > Jun
> > > > > > > >
> > > > > > > > On Thu, Mar 3, 2022 at 9:47 AM Artem Livshits
> > > > > > > > <al...@confluent.io.invalid> wrote:
> > > > > > > >
> > > > > > > > > Hi Jun,
> > > > > > > > >
> > > > > > > > > Thank you for the suggestions.
> > > > > > > > >
> > > > > > > > > 1. As we discussed offline, we can hardcode the logic for
> > > > > > > > > DefaultPartitioner and UniformStickyPartitioner in the
> > > > > KafkaProducer
> > > > > > > > (i.e.
> > > > > > > > > the DefaultPartitioner.partition won't get called, instead
> > > > > > KafkaProducer
> > > > > > > > > would check if the partitioner is an instance of
> > > > DefaultPartitioner
> > > > > > and
> > > > > > > > > then run the actual partitioning logic itself).  Then the
> > > change
> > > > to
> > > > > > the
> > > > > > > > > Partitioner wouldn't be required.  I'll update the KIP to
> > > reflect
> > > > > > that.
> > > > > > > > >
> > > > > > > > > 2. I don't expect users to change this too often, as
> changing
> > > it
> > > > > > would
> > > > > > > > > require a bit of studying of the production patterns.  As a
> > > > general
> > > > > > > > > principle, if I can think of a model that requires a
> > deviation
> > > > from
> > > > > > > > > default, I tend to add a configuration option.  It could be
> > > that
> > > > > > it'll
> > > > > > > > > never get used in practice, but I cannot prove that.  I'm
> ok
> > > with
> > > > > > > > removing
> > > > > > > > > the option, let me know what you think.
> > > > > > > > >
> > > > > > > > > -Artem
> > > > > > > > >
> > > > > > > > > On Mon, Feb 28, 2022 at 2:06 PM Jun Rao
> > > <jun@confluent.io.invalid
> > > > >
> > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > Hi, Artem,
> > > > > > > > > >
> > > > > > > > > > Thanks for the reply. A few more comments.
> > > > > > > > > >
> > > > > > > > > > 1. Since we control the implementation and the usage of
> > > > > > > > > DefaultPartitioner,
> > > > > > > > > > another way is to instantiate the DefaultPartitioner
> with a
> > > > > special
> > > > > > > > > > constructor, which allows it to have more access to
> > internal
> > > > > > > > information.
> > > > > > > > > > Then we could just change the behavior of
> > DefaultPartitioner
> > > > > such
> > > > > > that
> > > > > > > > > it
> > > > > > > > > > can use the internal infoamtion when choosing the
> > partition.
> > > > This
> > > > > > seems
> > > > > > > > > > more intuitive than having DefaultPartitioner return -1
> > > > > partition.
> > > > > > > > > >
> > > > > > > > > > 2. I guess partitioner.sticky.batch.size is introduced
> > > because
> > > > > the
> > > > > > > > > > effective batch size could be less than batch.size and we
> > > want
> > > > to
> > > > > > align
> > > > > > > > > > partition switching with the effective batch size. How
> > would
> > > a
> > > > > user
> > > > > > > > know
> > > > > > > > > > the effective batch size to set
> > partitioner.sticky.batch.size
> > > > > > properly?
> > > > > > > > > If
> > > > > > > > > > the user somehow knows the effective batch size, does
> > setting
> > > > > > > > batch.size
> > > > > > > > > to
> > > > > > > > > > the effective batch size achieve the same result?
> > > > > > > > > >
> > > > > > > > > > 4. Thanks for the explanation. Makes sense to me.
> > > > > > > > > >
> > > > > > > > > > Thanks,
> > > > > > > > > >
> > > > > > > > > > Jun
> > > > > > > > > >
> > > > > > > > > > Thanks,
> > > > > > > > > >
> > > > > > > > > > Jun
> > > > > > > > > >
> > > > > > > > > > On Fri, Feb 25, 2022 at 8:26 PM Artem Livshits
> > > > > > > > > > <al...@confluent.io.invalid> wrote:
> > > > > > > > > >
> > > > > > > > > > > Hi Jun,
> > > > > > > > > > >
> > > > > > > > > > > 1. Updated the KIP to add a couple paragraphs about
> > > > > > implementation
> > > > > > > > > > > necessities in the Proposed Changes section.
> > > > > > > > > > >
> > > > > > > > > > > 2. Sorry if my reply was confusing, what I meant to say
> > > (and
> > > > I
> > > > > > > > > elaborated
> > > > > > > > > > > on that in point #3) is that there could be patterns
> for
> > > > which
> > > > > > 16KB
> > > > > > > > > > > wouldn't be the most effective setting, thus it would
> be
> > > good
> > > > > to
> > > > > > make
> > > > > > > > > it
> > > > > > > > > > > configurable.
> > > > > > > > > > >
> > > > > > > > > > > 4. We could use broker readiness timeout.  But I'm not
> > sure
> > > > it
> > > > > > would
> > > > > > > > > > > correctly model the broker load.  The problem is that
> > > latency
> > > > > is
> > > > > > not
> > > > > > > > an
> > > > > > > > > > > accurate measure of throughput, we could have 2 brokers
> > > that
> > > > > have
> > > > > > > > equal
> > > > > > > > > > > throughput but one has higher latency (so it takes
> larger
> > > > > batches
> > > > > > > > less
> > > > > > > > > > > frequently, but still takes the same load).
> > Latency-based
> > > > > logic
> > > > > > is
> > > > > > > > > > likely
> > > > > > > > > > > to send less data to the broker with higher latency.
> > Using
> > > > the
> > > > > > queue
> > > > > > > > > > size
> > > > > > > > > > > would adapt to throughput, regardless of latency (which
> > > could
> > > > > be
> > > > > > > > just a
> > > > > > > > > > > result of deployment topology), so that's the model
> > chosen
> > > in
> > > > > the
> > > > > > > > > > > proposal.  The partition.availability.timeout.ms logic
> > > > > > approaches
> > > > > > > > the
> > > > > > > > > > > model
> > > > > > > > > > > from a slightly different angle, say we have a
> > requirement
> > > to
> > > > > > deliver
> > > > > > > > > > > messages via brokers that have a certain latency, then
> > > > > > > > > > > partition.availability.timeout.ms could be used to
> tune
> > > > that.
> > > > > > > > Latency
> > > > > > > > > > is
> > > > > > > > > > > a
> > > > > > > > > > > much more volatile metric than throughput (latency
> > depends
> > > on
> > > > > > > > external
> > > > > > > > > > > load, on capacity, on deployment topology, on jitter in
> > > > > network,
> > > > > > on
> > > > > > > > > > jitter
> > > > > > > > > > > in disk, etc.) and I think it would be best to leave
> > > > > > latency-based
> > > > > > > > > > > thresholds configurable to tune for the environment.
> > > > > > > > > > >
> > > > > > > > > > > -Artem
> > > > > > > > > > >
> > > > > > > > > > > On Wed, Feb 23, 2022 at 11:14 AM Jun Rao
> > > > > > <ju...@confluent.io.invalid>
> > > > > > > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > > > Hi, Artem,
> > > > > > > > > > > >
> > > > > > > > > > > > Thanks for the reply. A few more comments.
> > > > > > > > > > > >
> > > > > > > > > > > > 1. Perhaps you could elaborate a bit more on how the
> > > > producer
> > > > > > > > > > determines
> > > > > > > > > > > > the partition if the partitioner returns -1. This
> will
> > > help
> > > > > > > > > understand
> > > > > > > > > > > why
> > > > > > > > > > > > encapsulating that logic as a partitioner is not
> clean.
> > > > > > > > > > > >
> > > > > > > > > > > > 2. I am not sure that I understand this part. If
> 15.5KB
> > > is
> > > > > more
> > > > > > > > > > > efficient,
> > > > > > > > > > > > could we just set batch.size to 15.5KB?
> > > > > > > > > > > >
> > > > > > > > > > > > 4. Yes, we could add a switch (or a variant of the
> > > > > > partitioner) for
> > > > > > > > > > > > enabling this behavior. Also, choosing partitions
> based
> > > on
> > > > > > broker
> > > > > > > > > > > readiness
> > > > > > > > > > > > can be made in a smoother way. For example, we could
> > > track
> > > > > the
> > > > > > last
> > > > > > > > > > time
> > > > > > > > > > > a
> > > > > > > > > > > > broker has drained any batches from the accumulator.
> We
> > > can
> > > > > > then
> > > > > > > > > select
> > > > > > > > > > > > partitions from brokers proportionally to the inverse
> > of
> > > > that
> > > > > > time.
> > > > > > > > > > This
> > > > > > > > > > > > seems smoother than a cutoff based on a
> > > > > > > > > > > partition.availability.timeout.ms
> > > > > > > > > > > >  threshold.
> > > > > > > > > > > >
> > > > > > > > > > > > Thanks,
> > > > > > > > > > > >
> > > > > > > > > > > > Jun
> > > > > > > > > > > >
> > > > > > > > > > > > On Fri, Feb 18, 2022 at 5:14 PM Artem Livshits
> > > > > > > > > > > > <al...@confluent.io.invalid> wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > > Hello Luke, Jun,
> > > > > > > > > > > > >
> > > > > > > > > > > > > Thank you for your feedback.  I've added the
> Rejected
> > > > > > Alternative
> > > > > > > > > > > section
> > > > > > > > > > > > > that may clarify some of the questions w.r.t.
> > returning
> > > > -1.
> > > > > > > > > > > > >
> > > > > > > > > > > > > 1. I've elaborated on the -1 in the KIP.  The
> problem
> > > is
> > > > > > that a
> > > > > > > > > > > > significant
> > > > > > > > > > > > > part of the logic needs to be in the producer
> > (because
> > > it
> > > > > now
> > > > > > > > uses
> > > > > > > > > > > > > information about brokers that only the producer
> > > knows),
> > > > so
> > > > > > > > > > > encapsulation
> > > > > > > > > > > > > of the logic within the default partitioner isn't
> as
> > > > clean.
> > > > > > > >  I've
> > > > > > > > > > > added
> > > > > > > > > > > > > the Rejected Alternative section that documents an
> > > > attempt
> > > > > to
> > > > > > > > keep
> > > > > > > > > > the
> > > > > > > > > > > > > encapsulation by providing new callbacks to the
> > > > > partitioner.
> > > > > > > > > > > > >
> > > > > > > > > > > > > 2. The meaning of the partitioner.sticky.batch.size
> > is
> > > > > > explained
> > > > > > > > in
> > > > > > > > > > the
> > > > > > > > > > > > > Uniform Sticky Batch Size section.  Basically, we
> > track
> > > > the
> > > > > > > > amount
> > > > > > > > > of
> > > > > > > > > > > > bytes
> > > > > > > > > > > > > produced to the partition and if it exceeds
> > > > > > > > > > > partitioner.sticky.batch.size
> > > > > > > > > > > > > then we switch to the next partition.  As far as
> the
> > > > reason
> > > > > > to
> > > > > > > > make
> > > > > > > > > > it
> > > > > > > > > > > > > different from batch.size, I think Luke answered
> this
> > > > with
> > > > > > the
> > > > > > > > > > question
> > > > > > > > > > > > #3
> > > > > > > > > > > > > -- what if the load pattern is such that 15.5KB
> would
> > > be
> > > > > more
> > > > > > > > > > efficient
> > > > > > > > > > > > > than 16KB?
> > > > > > > > > > > > >
> > > > > > > > > > > > > 3. I think it's hard to have one size that would
> fit
> > > all
> > > > > > > > patterns.
> > > > > > > > > > > E.g.
> > > > > > > > > > > > if
> > > > > > > > > > > > > the load pattern is such that there is linger and
> the
> > > app
> > > > > > fills
> > > > > > > > the
> > > > > > > > > > > batch
> > > > > > > > > > > > > before linger expires, then having 16KB would most
> > > likely
> > > > > > > > > synchronize
> > > > > > > > > > > > > batching and partition switching, so each partition
> > > would
> > > > > > get a
> > > > > > > > > full
> > > > > > > > > > > > > batch.  If load pattern is such that there are a
> few
> > > > > > non-complete
> > > > > > > > > > > batches
> > > > > > > > > > > > > go out before a larger batch starts to fill, then
> it
> > > may
> > > > > > actually
> > > > > > > > > be
> > > > > > > > > > > > > beneficial to make slightly larger (e.g. linger=0,
> > > first
> > > > > few
> > > > > > > > > records
> > > > > > > > > > go
> > > > > > > > > > > > in
> > > > > > > > > > > > > the first batch, then next few records go to second
> > > > batch,
> > > > > > and so
> > > > > > > > > on,
> > > > > > > > > > > > until
> > > > > > > > > > > > > 5 in-flight, then larger batch would form while
> > waiting
> > > > for
> > > > > > > > broker
> > > > > > > > > to
> > > > > > > > > > > > > respond, but the partition switch would happen
> before
> > > the
> > > > > > larger
> > > > > > > > > > batch
> > > > > > > > > > > is
> > > > > > > > > > > > > full).
> > > > > > > > > > > > >
> > > > > > > > > > > > > 4. There are a couple of reasons for introducing
> > > > > > > > > > > > > partition.availability.timeout.ms.  Luke's an
> Jun's
> > > > > > questions
> > > > > > > > are
> > > > > > > > > > > > slightly
> > > > > > > > > > > > > different, so I'm going to separate replies.
> > > > > > > > > > > > > (Luke) Is the queue size a good enough signal?  I
> > think
> > > > > it's
> > > > > > a
> > > > > > > > good
> > > > > > > > > > > > default
> > > > > > > > > > > > > signal as it tries to preserve general fairness and
> > not
> > > > > > overreact
> > > > > > > > > on
> > > > > > > > > > > the
> > > > > > > > > > > > > broker's state at each moment in time.  But because
> > > it's
> > > > > > smooth,
> > > > > > > > it
> > > > > > > > > > may
> > > > > > > > > > > > not
> > > > > > > > > > > > > be reactive enough to instantaneous latency jumps.
> > For
> > > > > > > > > > > latency-sensitive
> > > > > > > > > > > > > workloads, it may be desirable to react faster
> when a
> > > > > broker
> > > > > > > > > becomes
> > > > > > > > > > > > > unresponsive (but that may make the distribution
> > really
> > > > > > choppy),
> > > > > > > > so
> > > > > > > > > > > > > partition.availability.timeout.ms provides an
> > > > opportunity
> > > > > to
> > > > > > > > tune
> > > > > > > > > > > > > adaptiveness.
> > > > > > > > > > > > >
> > > > > > > > > > > > > (Jun) Can we just not assign partitions to brokers
> > that
> > > > are
> > > > > > not
> > > > > > > > > > ready?
> > > > > > > > > > > > > Switching partitions purely based on current broker
> > > > > readiness
> > > > > > > > > > > information
> > > > > > > > > > > > > can really skew workload I think (or at least I
> > > couldn't
> > > > > > build a
> > > > > > > > > > model
> > > > > > > > > > > > that
> > > > > > > > > > > > > proves that over time it's going to be generally
> > > fair), I
> > > > > > feel
> > > > > > > > that
> > > > > > > > > > the
> > > > > > > > > > > > > algorithm should try to be fair in general and use
> > > > smoother
> > > > > > > > signals
> > > > > > > > > > by
> > > > > > > > > > > > > default (e.g. a broker with choppier latency may
> get
> > > much
> > > > > > less
> > > > > > > > load
> > > > > > > > > > > even
> > > > > > > > > > > > > though it can handle throughput, it then may
> > > potentially
> > > > > skew
> > > > > > > > > > > > consumption),
> > > > > > > > > > > > > note that the queue-size-based logic uses
> > probabilities
> > > > (so
> > > > > > we
> > > > > > > > > don't
> > > > > > > > > > > > fully
> > > > > > > > > > > > > remove brokers, just make it less likely) and
> > relative
> > > > info
> > > > > > > > rather
> > > > > > > > > > > than a
> > > > > > > > > > > > > threshold (so if all brokers are heavily, but
> equally
> > > > > loaded,
> > > > > > > > they
> > > > > > > > > > will
> > > > > > > > > > > > get
> > > > > > > > > > > > > equal distribution, rather than get removed because
> > > they
> > > > > > exceed
> > > > > > > > > some
> > > > > > > > > > > > > threshold).  So at the very least, I would like
> this
> > > > logic
> > > > > > to be
> > > > > > > > > > turned
> > > > > > > > > > > > off
> > > > > > > > > > > > > by default as it's hard to predict what it could do
> > > with
> > > > > > > > different
> > > > > > > > > > > > patterns
> > > > > > > > > > > > > (which means that there would need to be some
> > > > > > configuration).  We
> > > > > > > > > > could
> > > > > > > > > > > > > just not use brokers that are not ready, but
> again, I
> > > > think
> > > > > > that
> > > > > > > > > it's
> > > > > > > > > > > > good
> > > > > > > > > > > > > to try to be fair under normal circumstances, so if
> > > > > normally
> > > > > > > > > brokers
> > > > > > > > > > > can
> > > > > > > > > > > > > respond under some
> partition.availability.timeout.ms
> > > > > > threshold
> > > > > > > > and
> > > > > > > > > > the
> > > > > > > > > > > > > application works well with those latencies, then
> we
> > > > could
> > > > > > > > > distribute
> > > > > > > > > > > > data
> > > > > > > > > > > > > equally between brokers that don't exceed the
> > > latencies.
> > > > > The
> > > > > > > > > value,
> > > > > > > > > > of
> > > > > > > > > > > > > course, would depend on the environment and app
> > > > > requirements,
> > > > > > > > hence
> > > > > > > > > > > it's
> > > > > > > > > > > > > configurable.
> > > > > > > > > > > > >
> > > > > > > > > > > > > 10. Added a statement at the beginning of the
> > proposed
> > > > > > changes.
> > > > > > > > > > > > >
> > > > > > > > > > > > > -Artem
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > On Thu, Feb 17, 2022 at 3:46 PM Jun Rao
> > > > > > <jun@confluent.io.invalid
> > > > > > > > >
> > > > > > > > > > > > wrote:
> > > > > > > > > > > > >
> > > > > > > > > > > > > > Hi, Artem,
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Thanks for the KIP. A few comments below.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > 1. I agree with Luke that having the partitioner
> > > > > returning
> > > > > > -1
> > > > > > > > is
> > > > > > > > > > kind
> > > > > > > > > > > > of
> > > > > > > > > > > > > > weird. Could we just change the implementation of
> > > > > > > > > > DefaultPartitioner
> > > > > > > > > > > to
> > > > > > > > > > > > > the
> > > > > > > > > > > > > > new behavior?
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > 2. partitioner.sticky.batch.size: Similar
> question
> > to
> > > > > > Luke. I
> > > > > > > > am
> > > > > > > > > > not
> > > > > > > > > > > > sure
> > > > > > > > > > > > > > why we want to introduce this new configuration.
> > > Could
> > > > we
> > > > > > just
> > > > > > > > > use
> > > > > > > > > > > the
> > > > > > > > > > > > > > existing batch.size?
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > 4. I also agree with Luke that it's not clear why
> > we
> > > > need
> > > > > > > > > > > > > > partition.availability.timeout.ms. The KIP says
> > the
> > > > > broker
> > > > > > > > > "would
> > > > > > > > > > > not
> > > > > > > > > > > > be
> > > > > > > > > > > > > > chosen until the broker is able to accept the
> next
> > > > ready
> > > > > > batch
> > > > > > > > > from
> > > > > > > > > > > the
> > > > > > > > > > > > > > partition". If we are keeping track of this,
> could
> > we
> > > > > just
> > > > > > > > avoid
> > > > > > > > > > > > > assigning
> > > > > > > > > > > > > > records to partitions whose leader is not able to
> > > > accept
> > > > > > the
> > > > > > > > next
> > > > > > > > > > > > batch?
> > > > > > > > > > > > > If
> > > > > > > > > > > > > > we do that, perhaps we don't need
> > > > > > > > > > partition.availability.timeout.ms.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > 10. Currently, partitioner.class defaults to
> > > > > > > > DefaultPartitioner,
> > > > > > > > > > > which
> > > > > > > > > > > > > uses
> > > > > > > > > > > > > > StickyPartitioner when the key is specified.
> Since
> > > this
> > > > > KIP
> > > > > > > > > > improves
> > > > > > > > > > > > upon
> > > > > > > > > > > > > > StickyPartitioner, it would be useful to make the
> > new
> > > > > > behavior
> > > > > > > > > the
> > > > > > > > > > > > > default
> > > > > > > > > > > > > > and document that in the KIP.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Thanks,
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Jun
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > On Wed, Feb 16, 2022 at 7:30 PM Luke Chen <
> > > > > > showuon@gmail.com>
> > > > > > > > > > wrote:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Hi Artem,
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Also, one more thing I think you need to know.
> > > > > > > > > > > > > > > As this bug KAFKA-7572 <
> > > > > > > > > > > > > https://issues.apache.org/jira/browse/KAFKA-7572
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > mentioned, sometimes the custom partitioner
> would
> > > > > return
> > > > > > > > > negative
> > > > > > > > > > > > > > partition
> > > > > > > > > > > > > > > id accidentally.
> > > > > > > > > > > > > > > If it returned -1, how could you know if it is
> > > > expected
> > > > > > or
> > > > > > > > not
> > > > > > > > > > > > > expected?
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Thanks.
> > > > > > > > > > > > > > > Luke
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > On Wed, Feb 16, 2022 at 3:28 PM Luke Chen <
> > > > > > showuon@gmail.com
> > > > > > > > >
> > > > > > > > > > > wrote:
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Hi Artem,
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Thanks for the update. I have some questions
> > > about
> > > > > it:
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > 1. Could you explain why you need the
> > > `partitioner`
> > > > > > return
> > > > > > > > > -1?
> > > > > > > > > > In
> > > > > > > > > > > > > which
> > > > > > > > > > > > > > > > case we need it? And how it is used in your
> > KIP?
> > > > > > > > > > > > > > > > 2. What does the
> > "partitioner.sticky.batch.size"
> > > > > mean?
> > > > > > In
> > > > > > > > the
> > > > > > > > > > > > > > > > "Configuration" part, you didn't explain it.
> > And
> > > > > > default to
> > > > > > > > > 0,
> > > > > > > > > > I
> > > > > > > > > > > > > guess
> > > > > > > > > > > > > > > it's
> > > > > > > > > > > > > > > > the same as current behavior for backward
> > > > > > compatibility,
> > > > > > > > > right?
> > > > > > > > > > > You
> > > > > > > > > > > > > > > should
> > > > > > > > > > > > > > > > mention it.
> > > > > > > > > > > > > > > > 3. I'm thinking we can have a threshold to
> the
> > > > > > > > > > > > > > > > "partitioner.sticky.batch.size". Let's say,
> we
> > > > > already
> > > > > > > > > > accumulate
> > > > > > > > > > > > > > 15.5KB
> > > > > > > > > > > > > > > in
> > > > > > > > > > > > > > > > partition1, and sent. So when next batch
> > created,
> > > > in
> > > > > > your
> > > > > > > > > > current
> > > > > > > > > > > > > > design,
> > > > > > > > > > > > > > > > we still stick to partition1, until 16KB
> > reached,
> > > > and
> > > > > > then
> > > > > > > > we
> > > > > > > > > > > > create
> > > > > > > > > > > > > a
> > > > > > > > > > > > > > > new
> > > > > > > > > > > > > > > > batch to change to next partition, ex:
> > > partition2.
> > > > > But
> > > > > > I
> > > > > > > > > think
> > > > > > > > > > if
> > > > > > > > > > > > we
> > > > > > > > > > > > > > set
> > > > > > > > > > > > > > > a
> > > > > > > > > > > > > > > > threshold to 95% (for example), we can know
> > > > previous
> > > > > > 15.5KB
> > > > > > > > > > > already
> > > > > > > > > > > > > > > exceeds
> > > > > > > > > > > > > > > > the threshold so that we can directly create
> > new
> > > > > batch
> > > > > > for
> > > > > > > > > next
> > > > > > > > > > > > > > records.
> > > > > > > > > > > > > > > > This way should be able to make it more
> > > efficient.
> > > > > > WDYT?
> > > > > > > > > > > > > > > > 4. I think the improved queuing logic should
> be
> > > > good
> > > > > > > > enough.
> > > > > > > > > I
> > > > > > > > > > > > can't
> > > > > > > > > > > > > > get
> > > > > > > > > > > > > > > > the benefit of having `
> > > > > > partition.availability.timeout.ms`
> > > > > > > > > > > config.
> > > > > > > > > > > > In
> > > > > > > > > > > > > > > > short, you want to make the partitioner take
> > the
> > > > > broker
> > > > > > > > load
> > > > > > > > > > into
> > > > > > > > > > > > > > > > consideration. We can just improve that in
> the
> > > > > queuing
> > > > > > > > logic
> > > > > > > > > > (and
> > > > > > > > > > > > you
> > > > > > > > > > > > > > > > already did it). Why should we add the
> config?
> > > > Could
> > > > > > you
> > > > > > > > use
> > > > > > > > > > some
> > > > > > > > > > > > > > > examples
> > > > > > > > > > > > > > > > to explain why we need it.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Thank you.
> > > > > > > > > > > > > > > > Luke
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > On Wed, Feb 16, 2022 at 8:57 AM Artem
> Livshits
> > > > > > > > > > > > > > > > <al...@confluent.io.invalid> wrote:
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >> Hello,
> > > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > > >> Please add your comments about the KIP.  If
> > > there
> > > > > are
> > > > > > no
> > > > > > > > > > > > > > considerations,
> > > > > > > > > > > > > > > >> I'll put it up for vote in the next few
> days.
> > > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > > >> -Artem
> > > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > > >> On Mon, Feb 7, 2022 at 6:01 PM Artem
> Livshits
> > <
> > > > > > > > > > > > > alivshits@confluent.io
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >> wrote:
> > > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > > >> > Hello,
> > > > > > > > > > > > > > > >> >
> > > > > > > > > > > > > > > >> > After trying a few prototypes, I've made
> > some
> > > > > > changes to
> > > > > > > > > the
> > > > > > > > > > > > > public
> > > > > > > > > > > > > > > >> > interface.  Please see the updated
> document
> > > > > > > > > > > > > > > >> >
> > > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-794%3A+Strictly+Uniform+Sticky+Partitioner
> > > > > > > > > > > > > > > >> > .
> > > > > > > > > > > > > > > >> >
> > > > > > > > > > > > > > > >> > -Artem
> > > > > > > > > > > > > > > >> >
> > > > > > > > > > > > > > > >> > On Thu, Nov 4, 2021 at 10:37 AM Artem
> > > Livshits <
> > > > > > > > > > > > > > > alivshits@confluent.io>
> > > > > > > > > > > > > > > >> > wrote:
> > > > > > > > > > > > > > > >> >
> > > > > > > > > > > > > > > >> >> Hello,
> > > > > > > > > > > > > > > >> >>
> > > > > > > > > > > > > > > >> >> This is the discussion thread for
> > > > > > > > > > > > > > > >> >>
> > > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-794%3A+Strictly+Uniform+Sticky+Partitioner
> > > > > > > > > > > > > > > >> >> .
> > > > > > > > > > > > > > > >> >>
> > > > > > > > > > > > > > > >> >> The proposal is a bug fix for
> > > > > > > > > > > > > > > >> >>
> > > > > https://issues.apache.org/jira/browse/KAFKA-10888,
> > > > > > but
> > > > > > > > > it
> > > > > > > > > > > does
> > > > > > > > > > > > > > > >> include a
> > > > > > > > > > > > > > > >> >> client config change, therefore we have a
> > KIP
> > > > to
> > > > > > > > discuss.
> > > > > > > > > > > > > > > >> >>
> > > > > > > > > > > > > > > >> >> -Artem
> > > > > > > > > > > > > > > >> >>
> > > > > > > > > > > > > > > >> >
> > > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] KIP-794: Strictly Uniform Sticky Partitioner

Posted by Artem Livshits <al...@confluent.io.INVALID>.
Hi Jun,

30.  Clarified.

31. I plan to do some benchmarking once implementation is finished, I'll
update the KIP with the results once I have them.  The reason to make it
default is that it won't be used otherwise and we won't know if it's good
or not in practical workloads.

-Artem

On Thu, Mar 10, 2022 at 11:42 AM Jun Rao <ju...@confluent.io.invalid> wrote:

> Hi, Artem,
>
> Thanks for the updated KIP. A couple of more comments.
>
> 30. For the 3 new configs, it would be useful to make it clear that they
> are only relevant when the partitioner class is null.
>
> 31. partitioner.adaptive.partitioning.enable : I am wondering whether it
> should default to true. This is a more complex behavior than "uniform
> sticky" and may take some time to get right. If we do want to enable it by
> default, it would be useful to validate it with some test results.
>
> Jun
>
>
>
>
> On Wed, Mar 9, 2022 at 10:05 PM Artem Livshits
> <al...@confluent.io.invalid> wrote:
>
> > Thank you for feedback, I've discussed this offline with some of the
> folks
> > and updated the KIP.  The main change is that now instead of using
> > DefaultPartitioner and UniformStickyPartitioners as flags, in the new
> > proposal the default partitioner is null, so if no custom partitioner is
> > specified then the partitioning logic is implemented in KafkaProducer.
> > Compatibility section is updated as well.  Also the configuration options
> > are renamed to be more consistent.
> >
> > -Artem
> >
> > On Fri, Mar 4, 2022 at 10:38 PM Luke Chen <sh...@gmail.com> wrote:
> >
> > > Hi Artem,
> > >
> > > Thanks for your explanation and update to the KIP.
> > > Some comments:
> > >
> > > 5. In the description for `enable.adaptive.partitioning`, the `false`
> > case,
> > > you said:
> > > > the producer will try to distribute messages uniformly.
> > > I think we should describe the possible skewing distribution.
> Otherwise,
> > > user might be confused about why adaptive partitioning is important.
> > >
> > > 6. In the description for `partition.availability.timeout.ms`, I think
> > we
> > > should mention in the last sentence about if
> > `enable.adaptive.partitioning`
> > > is disabled this logic is also disabled.
> > >
> > > 7. Similar thoughts as Ismael, I think we should have a POC and test to
> > > prove that this adaptive partitioning algorithm can have better uniform
> > > partitioning, compared with original sticky one.
> > >
> > > Thank you.
> > > Luke
> > >
> > > On Fri, Mar 4, 2022 at 9:22 PM Ismael Juma <is...@juma.me.uk> wrote:
> > >
> > > > Regarding `3`, we should only deprecate it if we're sure the new
> > approach
> > > > handles all cases better. Are we confident about that for both of the
> > > > previous partitioners?
> > > >
> > > > Ismael
> > > >
> > > > On Fri, Mar 4, 2022 at 1:37 AM David Jacot
> <djacot@confluent.io.invalid
> > >
> > > > wrote:
> > > >
> > > > > Hi Artem,
> > > > >
> > > > > Thanks for the KIP! I have a few comments:
> > > > >
> > > > > 1. In the preamble of the proposed change section, there is still a
> > > > > mention of the
> > > > > -1 approach. My understanding is that we have moved away from it
> now.
> > > > >
> > > > > 2. I am a bit concerned by the trick suggested about the
> > > > > DefaultPartitioner and
> > > > > the UniformStickyPartitioner. I do agree that implementing the
> logic
> > in
> > > > the
> > > > > producer itself is a good thing. However, it is weird from a user
> > > > > perspective
> > > > > that he can set a class as partitioner that is not used in the
> end. I
> > > > > think that
> > > > > this will be confusing for our users. Have we considered changing
> the
> > > > > default
> > > > > value of partitioner.class to null to indicate that the new
> built-in
> > > > > partitioner
> > > > > must be used? By default, the built-in partitioner would be used
> > unless
> > > > the
> > > > > user explicitly specify one. The downside is that the new default
> > > > behavior
> > > > > would not work if the user explicitly specify the partitioner but
> we
> > > > could
> > > > > mitigate this with my next point.
> > > > >
> > > > > 3. Related to the previous point, I think that we could deprecate
> > both
> > > > the
> > > > > DefaultPartitioner and the UniformStickyPartitioner. I would also
> > add a
> > > > > warning if one of them is explicitly provided by the user to inform
> > > them
> > > > > that they should switch to the new built-in one. I am pretty sure
> > that
> > > > most
> > > > > of the folks use the default configuration anyway.
> > > > >
> > > > > 4. It would be great if we could explain why the -1 way was
> rejected.
> > > At
> > > > > the moment, the rejected alternative only explain the idea but does
> > not
> > > > > say why we rejected it.
> > > > >
> > > > > Best,
> > > > > David
> > > > >
> > > > > On Fri, Mar 4, 2022 at 6:03 AM Artem Livshits
> > > > > <al...@confluent.io.invalid> wrote:
> > > > > >
> > > > > > Hi Jun,
> > > > > >
> > > > > > 2. Removed the option from the KIP.  Now the sticky partitioning
> > > > > threshold
> > > > > > is hardcoded to batch.size.
> > > > > >
> > > > > > 20. Added the corresponding wording to the KIP.
> > > > > >
> > > > > > -Artem
> > > > > >
> > > > > > On Thu, Mar 3, 2022 at 10:52 AM Jun Rao <jun@confluent.io.invalid
> >
> > > > > wrote:
> > > > > >
> > > > > > > Hi, Artem,
> > > > > > >
> > > > > > > Thanks for the reply.
> > > > > > >
> > > > > > > 1. Sounds good.
> > > > > > >
> > > > > > > 2. If we don't expect users to change it, we probably could
> just
> > > > leave
> > > > > out
> > > > > > > the new config. In general, it's easy to add a new config, but
> > hard
> > > > to
> > > > > > > remove an existing config.
> > > > > > >
> > > > > > > 20. The two new configs enable.adaptive.partitioning and
> > > > > > > partition.availability.timeout.ms only apply to the two
> built-in
> > > > > > > partitioners DefaultPartitioner and UniformStickyPartitioner,
> > > right?
> > > > It
> > > > > > > would be useful to document that in the KIP.
> > > > > > >
> > > > > > > Thanks,
> > > > > > >
> > > > > > > Jun
> > > > > > >
> > > > > > > On Thu, Mar 3, 2022 at 9:47 AM Artem Livshits
> > > > > > > <al...@confluent.io.invalid> wrote:
> > > > > > >
> > > > > > > > Hi Jun,
> > > > > > > >
> > > > > > > > Thank you for the suggestions.
> > > > > > > >
> > > > > > > > 1. As we discussed offline, we can hardcode the logic for
> > > > > > > > DefaultPartitioner and UniformStickyPartitioner in the
> > > > KafkaProducer
> > > > > > > (i.e.
> > > > > > > > the DefaultPartitioner.partition won't get called, instead
> > > > > KafkaProducer
> > > > > > > > would check if the partitioner is an instance of
> > > DefaultPartitioner
> > > > > and
> > > > > > > > then run the actual partitioning logic itself).  Then the
> > change
> > > to
> > > > > the
> > > > > > > > Partitioner wouldn't be required.  I'll update the KIP to
> > reflect
> > > > > that.
> > > > > > > >
> > > > > > > > 2. I don't expect users to change this too often, as changing
> > it
> > > > > would
> > > > > > > > require a bit of studying of the production patterns.  As a
> > > general
> > > > > > > > principle, if I can think of a model that requires a
> deviation
> > > from
> > > > > > > > default, I tend to add a configuration option.  It could be
> > that
> > > > > it'll
> > > > > > > > never get used in practice, but I cannot prove that.  I'm ok
> > with
> > > > > > > removing
> > > > > > > > the option, let me know what you think.
> > > > > > > >
> > > > > > > > -Artem
> > > > > > > >
> > > > > > > > On Mon, Feb 28, 2022 at 2:06 PM Jun Rao
> > <jun@confluent.io.invalid
> > > >
> > > > > > > wrote:
> > > > > > > >
> > > > > > > > > Hi, Artem,
> > > > > > > > >
> > > > > > > > > Thanks for the reply. A few more comments.
> > > > > > > > >
> > > > > > > > > 1. Since we control the implementation and the usage of
> > > > > > > > DefaultPartitioner,
> > > > > > > > > another way is to instantiate the DefaultPartitioner with a
> > > > special
> > > > > > > > > constructor, which allows it to have more access to
> internal
> > > > > > > information.
> > > > > > > > > Then we could just change the behavior of
> DefaultPartitioner
> > > > such
> > > > > that
> > > > > > > > it
> > > > > > > > > can use the internal infoamtion when choosing the
> partition.
> > > This
> > > > > seems
> > > > > > > > > more intuitive than having DefaultPartitioner return -1
> > > > partition.
> > > > > > > > >
> > > > > > > > > 2. I guess partitioner.sticky.batch.size is introduced
> > because
> > > > the
> > > > > > > > > effective batch size could be less than batch.size and we
> > want
> > > to
> > > > > align
> > > > > > > > > partition switching with the effective batch size. How
> would
> > a
> > > > user
> > > > > > > know
> > > > > > > > > the effective batch size to set
> partitioner.sticky.batch.size
> > > > > properly?
> > > > > > > > If
> > > > > > > > > the user somehow knows the effective batch size, does
> setting
> > > > > > > batch.size
> > > > > > > > to
> > > > > > > > > the effective batch size achieve the same result?
> > > > > > > > >
> > > > > > > > > 4. Thanks for the explanation. Makes sense to me.
> > > > > > > > >
> > > > > > > > > Thanks,
> > > > > > > > >
> > > > > > > > > Jun
> > > > > > > > >
> > > > > > > > > Thanks,
> > > > > > > > >
> > > > > > > > > Jun
> > > > > > > > >
> > > > > > > > > On Fri, Feb 25, 2022 at 8:26 PM Artem Livshits
> > > > > > > > > <al...@confluent.io.invalid> wrote:
> > > > > > > > >
> > > > > > > > > > Hi Jun,
> > > > > > > > > >
> > > > > > > > > > 1. Updated the KIP to add a couple paragraphs about
> > > > > implementation
> > > > > > > > > > necessities in the Proposed Changes section.
> > > > > > > > > >
> > > > > > > > > > 2. Sorry if my reply was confusing, what I meant to say
> > (and
> > > I
> > > > > > > > elaborated
> > > > > > > > > > on that in point #3) is that there could be patterns for
> > > which
> > > > > 16KB
> > > > > > > > > > wouldn't be the most effective setting, thus it would be
> > good
> > > > to
> > > > > make
> > > > > > > > it
> > > > > > > > > > configurable.
> > > > > > > > > >
> > > > > > > > > > 4. We could use broker readiness timeout.  But I'm not
> sure
> > > it
> > > > > would
> > > > > > > > > > correctly model the broker load.  The problem is that
> > latency
> > > > is
> > > > > not
> > > > > > > an
> > > > > > > > > > accurate measure of throughput, we could have 2 brokers
> > that
> > > > have
> > > > > > > equal
> > > > > > > > > > throughput but one has higher latency (so it takes larger
> > > > batches
> > > > > > > less
> > > > > > > > > > frequently, but still takes the same load).
> Latency-based
> > > > logic
> > > > > is
> > > > > > > > > likely
> > > > > > > > > > to send less data to the broker with higher latency.
> Using
> > > the
> > > > > queue
> > > > > > > > > size
> > > > > > > > > > would adapt to throughput, regardless of latency (which
> > could
> > > > be
> > > > > > > just a
> > > > > > > > > > result of deployment topology), so that's the model
> chosen
> > in
> > > > the
> > > > > > > > > > proposal.  The partition.availability.timeout.ms logic
> > > > > approaches
> > > > > > > the
> > > > > > > > > > model
> > > > > > > > > > from a slightly different angle, say we have a
> requirement
> > to
> > > > > deliver
> > > > > > > > > > messages via brokers that have a certain latency, then
> > > > > > > > > > partition.availability.timeout.ms could be used to tune
> > > that.
> > > > > > > Latency
> > > > > > > > > is
> > > > > > > > > > a
> > > > > > > > > > much more volatile metric than throughput (latency
> depends
> > on
> > > > > > > external
> > > > > > > > > > load, on capacity, on deployment topology, on jitter in
> > > > network,
> > > > > on
> > > > > > > > > jitter
> > > > > > > > > > in disk, etc.) and I think it would be best to leave
> > > > > latency-based
> > > > > > > > > > thresholds configurable to tune for the environment.
> > > > > > > > > >
> > > > > > > > > > -Artem
> > > > > > > > > >
> > > > > > > > > > On Wed, Feb 23, 2022 at 11:14 AM Jun Rao
> > > > > <ju...@confluent.io.invalid>
> > > > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > > Hi, Artem,
> > > > > > > > > > >
> > > > > > > > > > > Thanks for the reply. A few more comments.
> > > > > > > > > > >
> > > > > > > > > > > 1. Perhaps you could elaborate a bit more on how the
> > > producer
> > > > > > > > > determines
> > > > > > > > > > > the partition if the partitioner returns -1. This will
> > help
> > > > > > > > understand
> > > > > > > > > > why
> > > > > > > > > > > encapsulating that logic as a partitioner is not clean.
> > > > > > > > > > >
> > > > > > > > > > > 2. I am not sure that I understand this part. If 15.5KB
> > is
> > > > more
> > > > > > > > > > efficient,
> > > > > > > > > > > could we just set batch.size to 15.5KB?
> > > > > > > > > > >
> > > > > > > > > > > 4. Yes, we could add a switch (or a variant of the
> > > > > partitioner) for
> > > > > > > > > > > enabling this behavior. Also, choosing partitions based
> > on
> > > > > broker
> > > > > > > > > > readiness
> > > > > > > > > > > can be made in a smoother way. For example, we could
> > track
> > > > the
> > > > > last
> > > > > > > > > time
> > > > > > > > > > a
> > > > > > > > > > > broker has drained any batches from the accumulator. We
> > can
> > > > > then
> > > > > > > > select
> > > > > > > > > > > partitions from brokers proportionally to the inverse
> of
> > > that
> > > > > time.
> > > > > > > > > This
> > > > > > > > > > > seems smoother than a cutoff based on a
> > > > > > > > > > partition.availability.timeout.ms
> > > > > > > > > > >  threshold.
> > > > > > > > > > >
> > > > > > > > > > > Thanks,
> > > > > > > > > > >
> > > > > > > > > > > Jun
> > > > > > > > > > >
> > > > > > > > > > > On Fri, Feb 18, 2022 at 5:14 PM Artem Livshits
> > > > > > > > > > > <al...@confluent.io.invalid> wrote:
> > > > > > > > > > >
> > > > > > > > > > > > Hello Luke, Jun,
> > > > > > > > > > > >
> > > > > > > > > > > > Thank you for your feedback.  I've added the Rejected
> > > > > Alternative
> > > > > > > > > > section
> > > > > > > > > > > > that may clarify some of the questions w.r.t.
> returning
> > > -1.
> > > > > > > > > > > >
> > > > > > > > > > > > 1. I've elaborated on the -1 in the KIP.  The problem
> > is
> > > > > that a
> > > > > > > > > > > significant
> > > > > > > > > > > > part of the logic needs to be in the producer
> (because
> > it
> > > > now
> > > > > > > uses
> > > > > > > > > > > > information about brokers that only the producer
> > knows),
> > > so
> > > > > > > > > > encapsulation
> > > > > > > > > > > > of the logic within the default partitioner isn't as
> > > clean.
> > > > > > >  I've
> > > > > > > > > > added
> > > > > > > > > > > > the Rejected Alternative section that documents an
> > > attempt
> > > > to
> > > > > > > keep
> > > > > > > > > the
> > > > > > > > > > > > encapsulation by providing new callbacks to the
> > > > partitioner.
> > > > > > > > > > > >
> > > > > > > > > > > > 2. The meaning of the partitioner.sticky.batch.size
> is
> > > > > explained
> > > > > > > in
> > > > > > > > > the
> > > > > > > > > > > > Uniform Sticky Batch Size section.  Basically, we
> track
> > > the
> > > > > > > amount
> > > > > > > > of
> > > > > > > > > > > bytes
> > > > > > > > > > > > produced to the partition and if it exceeds
> > > > > > > > > > partitioner.sticky.batch.size
> > > > > > > > > > > > then we switch to the next partition.  As far as the
> > > reason
> > > > > to
> > > > > > > make
> > > > > > > > > it
> > > > > > > > > > > > different from batch.size, I think Luke answered this
> > > with
> > > > > the
> > > > > > > > > question
> > > > > > > > > > > #3
> > > > > > > > > > > > -- what if the load pattern is such that 15.5KB would
> > be
> > > > more
> > > > > > > > > efficient
> > > > > > > > > > > > than 16KB?
> > > > > > > > > > > >
> > > > > > > > > > > > 3. I think it's hard to have one size that would fit
> > all
> > > > > > > patterns.
> > > > > > > > > > E.g.
> > > > > > > > > > > if
> > > > > > > > > > > > the load pattern is such that there is linger and the
> > app
> > > > > fills
> > > > > > > the
> > > > > > > > > > batch
> > > > > > > > > > > > before linger expires, then having 16KB would most
> > likely
> > > > > > > > synchronize
> > > > > > > > > > > > batching and partition switching, so each partition
> > would
> > > > > get a
> > > > > > > > full
> > > > > > > > > > > > batch.  If load pattern is such that there are a few
> > > > > non-complete
> > > > > > > > > > batches
> > > > > > > > > > > > go out before a larger batch starts to fill, then it
> > may
> > > > > actually
> > > > > > > > be
> > > > > > > > > > > > beneficial to make slightly larger (e.g. linger=0,
> > first
> > > > few
> > > > > > > > records
> > > > > > > > > go
> > > > > > > > > > > in
> > > > > > > > > > > > the first batch, then next few records go to second
> > > batch,
> > > > > and so
> > > > > > > > on,
> > > > > > > > > > > until
> > > > > > > > > > > > 5 in-flight, then larger batch would form while
> waiting
> > > for
> > > > > > > broker
> > > > > > > > to
> > > > > > > > > > > > respond, but the partition switch would happen before
> > the
> > > > > larger
> > > > > > > > > batch
> > > > > > > > > > is
> > > > > > > > > > > > full).
> > > > > > > > > > > >
> > > > > > > > > > > > 4. There are a couple of reasons for introducing
> > > > > > > > > > > > partition.availability.timeout.ms.  Luke's an Jun's
> > > > > questions
> > > > > > > are
> > > > > > > > > > > slightly
> > > > > > > > > > > > different, so I'm going to separate replies.
> > > > > > > > > > > > (Luke) Is the queue size a good enough signal?  I
> think
> > > > it's
> > > > > a
> > > > > > > good
> > > > > > > > > > > default
> > > > > > > > > > > > signal as it tries to preserve general fairness and
> not
> > > > > overreact
> > > > > > > > on
> > > > > > > > > > the
> > > > > > > > > > > > broker's state at each moment in time.  But because
> > it's
> > > > > smooth,
> > > > > > > it
> > > > > > > > > may
> > > > > > > > > > > not
> > > > > > > > > > > > be reactive enough to instantaneous latency jumps.
> For
> > > > > > > > > > latency-sensitive
> > > > > > > > > > > > workloads, it may be desirable to react faster when a
> > > > broker
> > > > > > > > becomes
> > > > > > > > > > > > unresponsive (but that may make the distribution
> really
> > > > > choppy),
> > > > > > > so
> > > > > > > > > > > > partition.availability.timeout.ms provides an
> > > opportunity
> > > > to
> > > > > > > tune
> > > > > > > > > > > > adaptiveness.
> > > > > > > > > > > >
> > > > > > > > > > > > (Jun) Can we just not assign partitions to brokers
> that
> > > are
> > > > > not
> > > > > > > > > ready?
> > > > > > > > > > > > Switching partitions purely based on current broker
> > > > readiness
> > > > > > > > > > information
> > > > > > > > > > > > can really skew workload I think (or at least I
> > couldn't
> > > > > build a
> > > > > > > > > model
> > > > > > > > > > > that
> > > > > > > > > > > > proves that over time it's going to be generally
> > fair), I
> > > > > feel
> > > > > > > that
> > > > > > > > > the
> > > > > > > > > > > > algorithm should try to be fair in general and use
> > > smoother
> > > > > > > signals
> > > > > > > > > by
> > > > > > > > > > > > default (e.g. a broker with choppier latency may get
> > much
> > > > > less
> > > > > > > load
> > > > > > > > > > even
> > > > > > > > > > > > though it can handle throughput, it then may
> > potentially
> > > > skew
> > > > > > > > > > > consumption),
> > > > > > > > > > > > note that the queue-size-based logic uses
> probabilities
> > > (so
> > > > > we
> > > > > > > > don't
> > > > > > > > > > > fully
> > > > > > > > > > > > remove brokers, just make it less likely) and
> relative
> > > info
> > > > > > > rather
> > > > > > > > > > than a
> > > > > > > > > > > > threshold (so if all brokers are heavily, but equally
> > > > loaded,
> > > > > > > they
> > > > > > > > > will
> > > > > > > > > > > get
> > > > > > > > > > > > equal distribution, rather than get removed because
> > they
> > > > > exceed
> > > > > > > > some
> > > > > > > > > > > > threshold).  So at the very least, I would like this
> > > logic
> > > > > to be
> > > > > > > > > turned
> > > > > > > > > > > off
> > > > > > > > > > > > by default as it's hard to predict what it could do
> > with
> > > > > > > different
> > > > > > > > > > > patterns
> > > > > > > > > > > > (which means that there would need to be some
> > > > > configuration).  We
> > > > > > > > > could
> > > > > > > > > > > > just not use brokers that are not ready, but again, I
> > > think
> > > > > that
> > > > > > > > it's
> > > > > > > > > > > good
> > > > > > > > > > > > to try to be fair under normal circumstances, so if
> > > > normally
> > > > > > > > brokers
> > > > > > > > > > can
> > > > > > > > > > > > respond under some partition.availability.timeout.ms
> > > > > threshold
> > > > > > > and
> > > > > > > > > the
> > > > > > > > > > > > application works well with those latencies, then we
> > > could
> > > > > > > > distribute
> > > > > > > > > > > data
> > > > > > > > > > > > equally between brokers that don't exceed the
> > latencies.
> > > > The
> > > > > > > > value,
> > > > > > > > > of
> > > > > > > > > > > > course, would depend on the environment and app
> > > > requirements,
> > > > > > > hence
> > > > > > > > > > it's
> > > > > > > > > > > > configurable.
> > > > > > > > > > > >
> > > > > > > > > > > > 10. Added a statement at the beginning of the
> proposed
> > > > > changes.
> > > > > > > > > > > >
> > > > > > > > > > > > -Artem
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > On Thu, Feb 17, 2022 at 3:46 PM Jun Rao
> > > > > <jun@confluent.io.invalid
> > > > > > > >
> > > > > > > > > > > wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > > Hi, Artem,
> > > > > > > > > > > > >
> > > > > > > > > > > > > Thanks for the KIP. A few comments below.
> > > > > > > > > > > > >
> > > > > > > > > > > > > 1. I agree with Luke that having the partitioner
> > > > returning
> > > > > -1
> > > > > > > is
> > > > > > > > > kind
> > > > > > > > > > > of
> > > > > > > > > > > > > weird. Could we just change the implementation of
> > > > > > > > > DefaultPartitioner
> > > > > > > > > > to
> > > > > > > > > > > > the
> > > > > > > > > > > > > new behavior?
> > > > > > > > > > > > >
> > > > > > > > > > > > > 2. partitioner.sticky.batch.size: Similar question
> to
> > > > > Luke. I
> > > > > > > am
> > > > > > > > > not
> > > > > > > > > > > sure
> > > > > > > > > > > > > why we want to introduce this new configuration.
> > Could
> > > we
> > > > > just
> > > > > > > > use
> > > > > > > > > > the
> > > > > > > > > > > > > existing batch.size?
> > > > > > > > > > > > >
> > > > > > > > > > > > > 4. I also agree with Luke that it's not clear why
> we
> > > need
> > > > > > > > > > > > > partition.availability.timeout.ms. The KIP says
> the
> > > > broker
> > > > > > > > "would
> > > > > > > > > > not
> > > > > > > > > > > be
> > > > > > > > > > > > > chosen until the broker is able to accept the next
> > > ready
> > > > > batch
> > > > > > > > from
> > > > > > > > > > the
> > > > > > > > > > > > > partition". If we are keeping track of this, could
> we
> > > > just
> > > > > > > avoid
> > > > > > > > > > > > assigning
> > > > > > > > > > > > > records to partitions whose leader is not able to
> > > accept
> > > > > the
> > > > > > > next
> > > > > > > > > > > batch?
> > > > > > > > > > > > If
> > > > > > > > > > > > > we do that, perhaps we don't need
> > > > > > > > > partition.availability.timeout.ms.
> > > > > > > > > > > > >
> > > > > > > > > > > > > 10. Currently, partitioner.class defaults to
> > > > > > > DefaultPartitioner,
> > > > > > > > > > which
> > > > > > > > > > > > uses
> > > > > > > > > > > > > StickyPartitioner when the key is specified. Since
> > this
> > > > KIP
> > > > > > > > > improves
> > > > > > > > > > > upon
> > > > > > > > > > > > > StickyPartitioner, it would be useful to make the
> new
> > > > > behavior
> > > > > > > > the
> > > > > > > > > > > > default
> > > > > > > > > > > > > and document that in the KIP.
> > > > > > > > > > > > >
> > > > > > > > > > > > > Thanks,
> > > > > > > > > > > > >
> > > > > > > > > > > > > Jun
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > On Wed, Feb 16, 2022 at 7:30 PM Luke Chen <
> > > > > showuon@gmail.com>
> > > > > > > > > wrote:
> > > > > > > > > > > > >
> > > > > > > > > > > > > > Hi Artem,
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Also, one more thing I think you need to know.
> > > > > > > > > > > > > > As this bug KAFKA-7572 <
> > > > > > > > > > > > https://issues.apache.org/jira/browse/KAFKA-7572
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > mentioned, sometimes the custom partitioner would
> > > > return
> > > > > > > > negative
> > > > > > > > > > > > > partition
> > > > > > > > > > > > > > id accidentally.
> > > > > > > > > > > > > > If it returned -1, how could you know if it is
> > > expected
> > > > > or
> > > > > > > not
> > > > > > > > > > > > expected?
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Thanks.
> > > > > > > > > > > > > > Luke
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > On Wed, Feb 16, 2022 at 3:28 PM Luke Chen <
> > > > > showuon@gmail.com
> > > > > > > >
> > > > > > > > > > wrote:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Hi Artem,
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Thanks for the update. I have some questions
> > about
> > > > it:
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > 1. Could you explain why you need the
> > `partitioner`
> > > > > return
> > > > > > > > -1?
> > > > > > > > > In
> > > > > > > > > > > > which
> > > > > > > > > > > > > > > case we need it? And how it is used in your
> KIP?
> > > > > > > > > > > > > > > 2. What does the
> "partitioner.sticky.batch.size"
> > > > mean?
> > > > > In
> > > > > > > the
> > > > > > > > > > > > > > > "Configuration" part, you didn't explain it.
> And
> > > > > default to
> > > > > > > > 0,
> > > > > > > > > I
> > > > > > > > > > > > guess
> > > > > > > > > > > > > > it's
> > > > > > > > > > > > > > > the same as current behavior for backward
> > > > > compatibility,
> > > > > > > > right?
> > > > > > > > > > You
> > > > > > > > > > > > > > should
> > > > > > > > > > > > > > > mention it.
> > > > > > > > > > > > > > > 3. I'm thinking we can have a threshold to the
> > > > > > > > > > > > > > > "partitioner.sticky.batch.size". Let's say, we
> > > > already
> > > > > > > > > accumulate
> > > > > > > > > > > > > 15.5KB
> > > > > > > > > > > > > > in
> > > > > > > > > > > > > > > partition1, and sent. So when next batch
> created,
> > > in
> > > > > your
> > > > > > > > > current
> > > > > > > > > > > > > design,
> > > > > > > > > > > > > > > we still stick to partition1, until 16KB
> reached,
> > > and
> > > > > then
> > > > > > > we
> > > > > > > > > > > create
> > > > > > > > > > > > a
> > > > > > > > > > > > > > new
> > > > > > > > > > > > > > > batch to change to next partition, ex:
> > partition2.
> > > > But
> > > > > I
> > > > > > > > think
> > > > > > > > > if
> > > > > > > > > > > we
> > > > > > > > > > > > > set
> > > > > > > > > > > > > > a
> > > > > > > > > > > > > > > threshold to 95% (for example), we can know
> > > previous
> > > > > 15.5KB
> > > > > > > > > > already
> > > > > > > > > > > > > > exceeds
> > > > > > > > > > > > > > > the threshold so that we can directly create
> new
> > > > batch
> > > > > for
> > > > > > > > next
> > > > > > > > > > > > > records.
> > > > > > > > > > > > > > > This way should be able to make it more
> > efficient.
> > > > > WDYT?
> > > > > > > > > > > > > > > 4. I think the improved queuing logic should be
> > > good
> > > > > > > enough.
> > > > > > > > I
> > > > > > > > > > > can't
> > > > > > > > > > > > > get
> > > > > > > > > > > > > > > the benefit of having `
> > > > > partition.availability.timeout.ms`
> > > > > > > > > > config.
> > > > > > > > > > > In
> > > > > > > > > > > > > > > short, you want to make the partitioner take
> the
> > > > broker
> > > > > > > load
> > > > > > > > > into
> > > > > > > > > > > > > > > consideration. We can just improve that in the
> > > > queuing
> > > > > > > logic
> > > > > > > > > (and
> > > > > > > > > > > you
> > > > > > > > > > > > > > > already did it). Why should we add the config?
> > > Could
> > > > > you
> > > > > > > use
> > > > > > > > > some
> > > > > > > > > > > > > > examples
> > > > > > > > > > > > > > > to explain why we need it.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Thank you.
> > > > > > > > > > > > > > > Luke
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > On Wed, Feb 16, 2022 at 8:57 AM Artem Livshits
> > > > > > > > > > > > > > > <al...@confluent.io.invalid> wrote:
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >> Hello,
> > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > >> Please add your comments about the KIP.  If
> > there
> > > > are
> > > > > no
> > > > > > > > > > > > > considerations,
> > > > > > > > > > > > > > >> I'll put it up for vote in the next few days.
> > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > >> -Artem
> > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > >> On Mon, Feb 7, 2022 at 6:01 PM Artem Livshits
> <
> > > > > > > > > > > > alivshits@confluent.io
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > >> wrote:
> > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > >> > Hello,
> > > > > > > > > > > > > > >> >
> > > > > > > > > > > > > > >> > After trying a few prototypes, I've made
> some
> > > > > changes to
> > > > > > > > the
> > > > > > > > > > > > public
> > > > > > > > > > > > > > >> > interface.  Please see the updated document
> > > > > > > > > > > > > > >> >
> > > > > > > > > > > > > > >>
> > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-794%3A+Strictly+Uniform+Sticky+Partitioner
> > > > > > > > > > > > > > >> > .
> > > > > > > > > > > > > > >> >
> > > > > > > > > > > > > > >> > -Artem
> > > > > > > > > > > > > > >> >
> > > > > > > > > > > > > > >> > On Thu, Nov 4, 2021 at 10:37 AM Artem
> > Livshits <
> > > > > > > > > > > > > > alivshits@confluent.io>
> > > > > > > > > > > > > > >> > wrote:
> > > > > > > > > > > > > > >> >
> > > > > > > > > > > > > > >> >> Hello,
> > > > > > > > > > > > > > >> >>
> > > > > > > > > > > > > > >> >> This is the discussion thread for
> > > > > > > > > > > > > > >> >>
> > > > > > > > > > > > > > >>
> > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-794%3A+Strictly+Uniform+Sticky+Partitioner
> > > > > > > > > > > > > > >> >> .
> > > > > > > > > > > > > > >> >>
> > > > > > > > > > > > > > >> >> The proposal is a bug fix for
> > > > > > > > > > > > > > >> >>
> > > > https://issues.apache.org/jira/browse/KAFKA-10888,
> > > > > but
> > > > > > > > it
> > > > > > > > > > does
> > > > > > > > > > > > > > >> include a
> > > > > > > > > > > > > > >> >> client config change, therefore we have a
> KIP
> > > to
> > > > > > > discuss.
> > > > > > > > > > > > > > >> >>
> > > > > > > > > > > > > > >> >> -Artem
> > > > > > > > > > > > > > >> >>
> > > > > > > > > > > > > > >> >
> > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] KIP-794: Strictly Uniform Sticky Partitioner

Posted by Jun Rao <ju...@confluent.io.INVALID>.
Hi, Artem,

Thanks for the updated KIP. A couple of more comments.

30. For the 3 new configs, it would be useful to make it clear that they
are only relevant when the partitioner class is null.

31. partitioner.adaptive.partitioning.enable : I am wondering whether it
should default to true. This is a more complex behavior than "uniform
sticky" and may take some time to get right. If we do want to enable it by
default, it would be useful to validate it with some test results.

Jun




On Wed, Mar 9, 2022 at 10:05 PM Artem Livshits
<al...@confluent.io.invalid> wrote:

> Thank you for feedback, I've discussed this offline with some of the folks
> and updated the KIP.  The main change is that now instead of using
> DefaultPartitioner and UniformStickyPartitioners as flags, in the new
> proposal the default partitioner is null, so if no custom partitioner is
> specified then the partitioning logic is implemented in KafkaProducer.
> Compatibility section is updated as well.  Also the configuration options
> are renamed to be more consistent.
>
> -Artem
>
> On Fri, Mar 4, 2022 at 10:38 PM Luke Chen <sh...@gmail.com> wrote:
>
> > Hi Artem,
> >
> > Thanks for your explanation and update to the KIP.
> > Some comments:
> >
> > 5. In the description for `enable.adaptive.partitioning`, the `false`
> case,
> > you said:
> > > the producer will try to distribute messages uniformly.
> > I think we should describe the possible skewing distribution. Otherwise,
> > user might be confused about why adaptive partitioning is important.
> >
> > 6. In the description for `partition.availability.timeout.ms`, I think
> we
> > should mention in the last sentence about if
> `enable.adaptive.partitioning`
> > is disabled this logic is also disabled.
> >
> > 7. Similar thoughts as Ismael, I think we should have a POC and test to
> > prove that this adaptive partitioning algorithm can have better uniform
> > partitioning, compared with original sticky one.
> >
> > Thank you.
> > Luke
> >
> > On Fri, Mar 4, 2022 at 9:22 PM Ismael Juma <is...@juma.me.uk> wrote:
> >
> > > Regarding `3`, we should only deprecate it if we're sure the new
> approach
> > > handles all cases better. Are we confident about that for both of the
> > > previous partitioners?
> > >
> > > Ismael
> > >
> > > On Fri, Mar 4, 2022 at 1:37 AM David Jacot <djacot@confluent.io.invalid
> >
> > > wrote:
> > >
> > > > Hi Artem,
> > > >
> > > > Thanks for the KIP! I have a few comments:
> > > >
> > > > 1. In the preamble of the proposed change section, there is still a
> > > > mention of the
> > > > -1 approach. My understanding is that we have moved away from it now.
> > > >
> > > > 2. I am a bit concerned by the trick suggested about the
> > > > DefaultPartitioner and
> > > > the UniformStickyPartitioner. I do agree that implementing the logic
> in
> > > the
> > > > producer itself is a good thing. However, it is weird from a user
> > > > perspective
> > > > that he can set a class as partitioner that is not used in the end. I
> > > > think that
> > > > this will be confusing for our users. Have we considered changing the
> > > > default
> > > > value of partitioner.class to null to indicate that the new built-in
> > > > partitioner
> > > > must be used? By default, the built-in partitioner would be used
> unless
> > > the
> > > > user explicitly specify one. The downside is that the new default
> > > behavior
> > > > would not work if the user explicitly specify the partitioner but we
> > > could
> > > > mitigate this with my next point.
> > > >
> > > > 3. Related to the previous point, I think that we could deprecate
> both
> > > the
> > > > DefaultPartitioner and the UniformStickyPartitioner. I would also
> add a
> > > > warning if one of them is explicitly provided by the user to inform
> > them
> > > > that they should switch to the new built-in one. I am pretty sure
> that
> > > most
> > > > of the folks use the default configuration anyway.
> > > >
> > > > 4. It would be great if we could explain why the -1 way was rejected.
> > At
> > > > the moment, the rejected alternative only explain the idea but does
> not
> > > > say why we rejected it.
> > > >
> > > > Best,
> > > > David
> > > >
> > > > On Fri, Mar 4, 2022 at 6:03 AM Artem Livshits
> > > > <al...@confluent.io.invalid> wrote:
> > > > >
> > > > > Hi Jun,
> > > > >
> > > > > 2. Removed the option from the KIP.  Now the sticky partitioning
> > > > threshold
> > > > > is hardcoded to batch.size.
> > > > >
> > > > > 20. Added the corresponding wording to the KIP.
> > > > >
> > > > > -Artem
> > > > >
> > > > > On Thu, Mar 3, 2022 at 10:52 AM Jun Rao <ju...@confluent.io.invalid>
> > > > wrote:
> > > > >
> > > > > > Hi, Artem,
> > > > > >
> > > > > > Thanks for the reply.
> > > > > >
> > > > > > 1. Sounds good.
> > > > > >
> > > > > > 2. If we don't expect users to change it, we probably could just
> > > leave
> > > > out
> > > > > > the new config. In general, it's easy to add a new config, but
> hard
> > > to
> > > > > > remove an existing config.
> > > > > >
> > > > > > 20. The two new configs enable.adaptive.partitioning and
> > > > > > partition.availability.timeout.ms only apply to the two built-in
> > > > > > partitioners DefaultPartitioner and UniformStickyPartitioner,
> > right?
> > > It
> > > > > > would be useful to document that in the KIP.
> > > > > >
> > > > > > Thanks,
> > > > > >
> > > > > > Jun
> > > > > >
> > > > > > On Thu, Mar 3, 2022 at 9:47 AM Artem Livshits
> > > > > > <al...@confluent.io.invalid> wrote:
> > > > > >
> > > > > > > Hi Jun,
> > > > > > >
> > > > > > > Thank you for the suggestions.
> > > > > > >
> > > > > > > 1. As we discussed offline, we can hardcode the logic for
> > > > > > > DefaultPartitioner and UniformStickyPartitioner in the
> > > KafkaProducer
> > > > > > (i.e.
> > > > > > > the DefaultPartitioner.partition won't get called, instead
> > > > KafkaProducer
> > > > > > > would check if the partitioner is an instance of
> > DefaultPartitioner
> > > > and
> > > > > > > then run the actual partitioning logic itself).  Then the
> change
> > to
> > > > the
> > > > > > > Partitioner wouldn't be required.  I'll update the KIP to
> reflect
> > > > that.
> > > > > > >
> > > > > > > 2. I don't expect users to change this too often, as changing
> it
> > > > would
> > > > > > > require a bit of studying of the production patterns.  As a
> > general
> > > > > > > principle, if I can think of a model that requires a deviation
> > from
> > > > > > > default, I tend to add a configuration option.  It could be
> that
> > > > it'll
> > > > > > > never get used in practice, but I cannot prove that.  I'm ok
> with
> > > > > > removing
> > > > > > > the option, let me know what you think.
> > > > > > >
> > > > > > > -Artem
> > > > > > >
> > > > > > > On Mon, Feb 28, 2022 at 2:06 PM Jun Rao
> <jun@confluent.io.invalid
> > >
> > > > > > wrote:
> > > > > > >
> > > > > > > > Hi, Artem,
> > > > > > > >
> > > > > > > > Thanks for the reply. A few more comments.
> > > > > > > >
> > > > > > > > 1. Since we control the implementation and the usage of
> > > > > > > DefaultPartitioner,
> > > > > > > > another way is to instantiate the DefaultPartitioner with a
> > > special
> > > > > > > > constructor, which allows it to have more access to internal
> > > > > > information.
> > > > > > > > Then we could just change the behavior of  DefaultPartitioner
> > > such
> > > > that
> > > > > > > it
> > > > > > > > can use the internal infoamtion when choosing the partition.
> > This
> > > > seems
> > > > > > > > more intuitive than having DefaultPartitioner return -1
> > > partition.
> > > > > > > >
> > > > > > > > 2. I guess partitioner.sticky.batch.size is introduced
> because
> > > the
> > > > > > > > effective batch size could be less than batch.size and we
> want
> > to
> > > > align
> > > > > > > > partition switching with the effective batch size. How would
> a
> > > user
> > > > > > know
> > > > > > > > the effective batch size to set partitioner.sticky.batch.size
> > > > properly?
> > > > > > > If
> > > > > > > > the user somehow knows the effective batch size, does setting
> > > > > > batch.size
> > > > > > > to
> > > > > > > > the effective batch size achieve the same result?
> > > > > > > >
> > > > > > > > 4. Thanks for the explanation. Makes sense to me.
> > > > > > > >
> > > > > > > > Thanks,
> > > > > > > >
> > > > > > > > Jun
> > > > > > > >
> > > > > > > > Thanks,
> > > > > > > >
> > > > > > > > Jun
> > > > > > > >
> > > > > > > > On Fri, Feb 25, 2022 at 8:26 PM Artem Livshits
> > > > > > > > <al...@confluent.io.invalid> wrote:
> > > > > > > >
> > > > > > > > > Hi Jun,
> > > > > > > > >
> > > > > > > > > 1. Updated the KIP to add a couple paragraphs about
> > > > implementation
> > > > > > > > > necessities in the Proposed Changes section.
> > > > > > > > >
> > > > > > > > > 2. Sorry if my reply was confusing, what I meant to say
> (and
> > I
> > > > > > > elaborated
> > > > > > > > > on that in point #3) is that there could be patterns for
> > which
> > > > 16KB
> > > > > > > > > wouldn't be the most effective setting, thus it would be
> good
> > > to
> > > > make
> > > > > > > it
> > > > > > > > > configurable.
> > > > > > > > >
> > > > > > > > > 4. We could use broker readiness timeout.  But I'm not sure
> > it
> > > > would
> > > > > > > > > correctly model the broker load.  The problem is that
> latency
> > > is
> > > > not
> > > > > > an
> > > > > > > > > accurate measure of throughput, we could have 2 brokers
> that
> > > have
> > > > > > equal
> > > > > > > > > throughput but one has higher latency (so it takes larger
> > > batches
> > > > > > less
> > > > > > > > > frequently, but still takes the same load).  Latency-based
> > > logic
> > > > is
> > > > > > > > likely
> > > > > > > > > to send less data to the broker with higher latency.  Using
> > the
> > > > queue
> > > > > > > > size
> > > > > > > > > would adapt to throughput, regardless of latency (which
> could
> > > be
> > > > > > just a
> > > > > > > > > result of deployment topology), so that's the model chosen
> in
> > > the
> > > > > > > > > proposal.  The partition.availability.timeout.ms logic
> > > > approaches
> > > > > > the
> > > > > > > > > model
> > > > > > > > > from a slightly different angle, say we have a requirement
> to
> > > > deliver
> > > > > > > > > messages via brokers that have a certain latency, then
> > > > > > > > > partition.availability.timeout.ms could be used to tune
> > that.
> > > > > > Latency
> > > > > > > > is
> > > > > > > > > a
> > > > > > > > > much more volatile metric than throughput (latency depends
> on
> > > > > > external
> > > > > > > > > load, on capacity, on deployment topology, on jitter in
> > > network,
> > > > on
> > > > > > > > jitter
> > > > > > > > > in disk, etc.) and I think it would be best to leave
> > > > latency-based
> > > > > > > > > thresholds configurable to tune for the environment.
> > > > > > > > >
> > > > > > > > > -Artem
> > > > > > > > >
> > > > > > > > > On Wed, Feb 23, 2022 at 11:14 AM Jun Rao
> > > > <ju...@confluent.io.invalid>
> > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > Hi, Artem,
> > > > > > > > > >
> > > > > > > > > > Thanks for the reply. A few more comments.
> > > > > > > > > >
> > > > > > > > > > 1. Perhaps you could elaborate a bit more on how the
> > producer
> > > > > > > > determines
> > > > > > > > > > the partition if the partitioner returns -1. This will
> help
> > > > > > > understand
> > > > > > > > > why
> > > > > > > > > > encapsulating that logic as a partitioner is not clean.
> > > > > > > > > >
> > > > > > > > > > 2. I am not sure that I understand this part. If 15.5KB
> is
> > > more
> > > > > > > > > efficient,
> > > > > > > > > > could we just set batch.size to 15.5KB?
> > > > > > > > > >
> > > > > > > > > > 4. Yes, we could add a switch (or a variant of the
> > > > partitioner) for
> > > > > > > > > > enabling this behavior. Also, choosing partitions based
> on
> > > > broker
> > > > > > > > > readiness
> > > > > > > > > > can be made in a smoother way. For example, we could
> track
> > > the
> > > > last
> > > > > > > > time
> > > > > > > > > a
> > > > > > > > > > broker has drained any batches from the accumulator. We
> can
> > > > then
> > > > > > > select
> > > > > > > > > > partitions from brokers proportionally to the inverse of
> > that
> > > > time.
> > > > > > > > This
> > > > > > > > > > seems smoother than a cutoff based on a
> > > > > > > > > partition.availability.timeout.ms
> > > > > > > > > >  threshold.
> > > > > > > > > >
> > > > > > > > > > Thanks,
> > > > > > > > > >
> > > > > > > > > > Jun
> > > > > > > > > >
> > > > > > > > > > On Fri, Feb 18, 2022 at 5:14 PM Artem Livshits
> > > > > > > > > > <al...@confluent.io.invalid> wrote:
> > > > > > > > > >
> > > > > > > > > > > Hello Luke, Jun,
> > > > > > > > > > >
> > > > > > > > > > > Thank you for your feedback.  I've added the Rejected
> > > > Alternative
> > > > > > > > > section
> > > > > > > > > > > that may clarify some of the questions w.r.t. returning
> > -1.
> > > > > > > > > > >
> > > > > > > > > > > 1. I've elaborated on the -1 in the KIP.  The problem
> is
> > > > that a
> > > > > > > > > > significant
> > > > > > > > > > > part of the logic needs to be in the producer (because
> it
> > > now
> > > > > > uses
> > > > > > > > > > > information about brokers that only the producer
> knows),
> > so
> > > > > > > > > encapsulation
> > > > > > > > > > > of the logic within the default partitioner isn't as
> > clean.
> > > > > >  I've
> > > > > > > > > added
> > > > > > > > > > > the Rejected Alternative section that documents an
> > attempt
> > > to
> > > > > > keep
> > > > > > > > the
> > > > > > > > > > > encapsulation by providing new callbacks to the
> > > partitioner.
> > > > > > > > > > >
> > > > > > > > > > > 2. The meaning of the partitioner.sticky.batch.size is
> > > > explained
> > > > > > in
> > > > > > > > the
> > > > > > > > > > > Uniform Sticky Batch Size section.  Basically, we track
> > the
> > > > > > amount
> > > > > > > of
> > > > > > > > > > bytes
> > > > > > > > > > > produced to the partition and if it exceeds
> > > > > > > > > partitioner.sticky.batch.size
> > > > > > > > > > > then we switch to the next partition.  As far as the
> > reason
> > > > to
> > > > > > make
> > > > > > > > it
> > > > > > > > > > > different from batch.size, I think Luke answered this
> > with
> > > > the
> > > > > > > > question
> > > > > > > > > > #3
> > > > > > > > > > > -- what if the load pattern is such that 15.5KB would
> be
> > > more
> > > > > > > > efficient
> > > > > > > > > > > than 16KB?
> > > > > > > > > > >
> > > > > > > > > > > 3. I think it's hard to have one size that would fit
> all
> > > > > > patterns.
> > > > > > > > > E.g.
> > > > > > > > > > if
> > > > > > > > > > > the load pattern is such that there is linger and the
> app
> > > > fills
> > > > > > the
> > > > > > > > > batch
> > > > > > > > > > > before linger expires, then having 16KB would most
> likely
> > > > > > > synchronize
> > > > > > > > > > > batching and partition switching, so each partition
> would
> > > > get a
> > > > > > > full
> > > > > > > > > > > batch.  If load pattern is such that there are a few
> > > > non-complete
> > > > > > > > > batches
> > > > > > > > > > > go out before a larger batch starts to fill, then it
> may
> > > > actually
> > > > > > > be
> > > > > > > > > > > beneficial to make slightly larger (e.g. linger=0,
> first
> > > few
> > > > > > > records
> > > > > > > > go
> > > > > > > > > > in
> > > > > > > > > > > the first batch, then next few records go to second
> > batch,
> > > > and so
> > > > > > > on,
> > > > > > > > > > until
> > > > > > > > > > > 5 in-flight, then larger batch would form while waiting
> > for
> > > > > > broker
> > > > > > > to
> > > > > > > > > > > respond, but the partition switch would happen before
> the
> > > > larger
> > > > > > > > batch
> > > > > > > > > is
> > > > > > > > > > > full).
> > > > > > > > > > >
> > > > > > > > > > > 4. There are a couple of reasons for introducing
> > > > > > > > > > > partition.availability.timeout.ms.  Luke's an Jun's
> > > > questions
> > > > > > are
> > > > > > > > > > slightly
> > > > > > > > > > > different, so I'm going to separate replies.
> > > > > > > > > > > (Luke) Is the queue size a good enough signal?  I think
> > > it's
> > > > a
> > > > > > good
> > > > > > > > > > default
> > > > > > > > > > > signal as it tries to preserve general fairness and not
> > > > overreact
> > > > > > > on
> > > > > > > > > the
> > > > > > > > > > > broker's state at each moment in time.  But because
> it's
> > > > smooth,
> > > > > > it
> > > > > > > > may
> > > > > > > > > > not
> > > > > > > > > > > be reactive enough to instantaneous latency jumps.  For
> > > > > > > > > latency-sensitive
> > > > > > > > > > > workloads, it may be desirable to react faster when a
> > > broker
> > > > > > > becomes
> > > > > > > > > > > unresponsive (but that may make the distribution really
> > > > choppy),
> > > > > > so
> > > > > > > > > > > partition.availability.timeout.ms provides an
> > opportunity
> > > to
> > > > > > tune
> > > > > > > > > > > adaptiveness.
> > > > > > > > > > >
> > > > > > > > > > > (Jun) Can we just not assign partitions to brokers that
> > are
> > > > not
> > > > > > > > ready?
> > > > > > > > > > > Switching partitions purely based on current broker
> > > readiness
> > > > > > > > > information
> > > > > > > > > > > can really skew workload I think (or at least I
> couldn't
> > > > build a
> > > > > > > > model
> > > > > > > > > > that
> > > > > > > > > > > proves that over time it's going to be generally
> fair), I
> > > > feel
> > > > > > that
> > > > > > > > the
> > > > > > > > > > > algorithm should try to be fair in general and use
> > smoother
> > > > > > signals
> > > > > > > > by
> > > > > > > > > > > default (e.g. a broker with choppier latency may get
> much
> > > > less
> > > > > > load
> > > > > > > > > even
> > > > > > > > > > > though it can handle throughput, it then may
> potentially
> > > skew
> > > > > > > > > > consumption),
> > > > > > > > > > > note that the queue-size-based logic uses probabilities
> > (so
> > > > we
> > > > > > > don't
> > > > > > > > > > fully
> > > > > > > > > > > remove brokers, just make it less likely) and relative
> > info
> > > > > > rather
> > > > > > > > > than a
> > > > > > > > > > > threshold (so if all brokers are heavily, but equally
> > > loaded,
> > > > > > they
> > > > > > > > will
> > > > > > > > > > get
> > > > > > > > > > > equal distribution, rather than get removed because
> they
> > > > exceed
> > > > > > > some
> > > > > > > > > > > threshold).  So at the very least, I would like this
> > logic
> > > > to be
> > > > > > > > turned
> > > > > > > > > > off
> > > > > > > > > > > by default as it's hard to predict what it could do
> with
> > > > > > different
> > > > > > > > > > patterns
> > > > > > > > > > > (which means that there would need to be some
> > > > configuration).  We
> > > > > > > > could
> > > > > > > > > > > just not use brokers that are not ready, but again, I
> > think
> > > > that
> > > > > > > it's
> > > > > > > > > > good
> > > > > > > > > > > to try to be fair under normal circumstances, so if
> > > normally
> > > > > > > brokers
> > > > > > > > > can
> > > > > > > > > > > respond under some partition.availability.timeout.ms
> > > > threshold
> > > > > > and
> > > > > > > > the
> > > > > > > > > > > application works well with those latencies, then we
> > could
> > > > > > > distribute
> > > > > > > > > > data
> > > > > > > > > > > equally between brokers that don't exceed the
> latencies.
> > > The
> > > > > > > value,
> > > > > > > > of
> > > > > > > > > > > course, would depend on the environment and app
> > > requirements,
> > > > > > hence
> > > > > > > > > it's
> > > > > > > > > > > configurable.
> > > > > > > > > > >
> > > > > > > > > > > 10. Added a statement at the beginning of the proposed
> > > > changes.
> > > > > > > > > > >
> > > > > > > > > > > -Artem
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > On Thu, Feb 17, 2022 at 3:46 PM Jun Rao
> > > > <jun@confluent.io.invalid
> > > > > > >
> > > > > > > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > > > Hi, Artem,
> > > > > > > > > > > >
> > > > > > > > > > > > Thanks for the KIP. A few comments below.
> > > > > > > > > > > >
> > > > > > > > > > > > 1. I agree with Luke that having the partitioner
> > > returning
> > > > -1
> > > > > > is
> > > > > > > > kind
> > > > > > > > > > of
> > > > > > > > > > > > weird. Could we just change the implementation of
> > > > > > > > DefaultPartitioner
> > > > > > > > > to
> > > > > > > > > > > the
> > > > > > > > > > > > new behavior?
> > > > > > > > > > > >
> > > > > > > > > > > > 2. partitioner.sticky.batch.size: Similar question to
> > > > Luke. I
> > > > > > am
> > > > > > > > not
> > > > > > > > > > sure
> > > > > > > > > > > > why we want to introduce this new configuration.
> Could
> > we
> > > > just
> > > > > > > use
> > > > > > > > > the
> > > > > > > > > > > > existing batch.size?
> > > > > > > > > > > >
> > > > > > > > > > > > 4. I also agree with Luke that it's not clear why we
> > need
> > > > > > > > > > > > partition.availability.timeout.ms. The KIP says the
> > > broker
> > > > > > > "would
> > > > > > > > > not
> > > > > > > > > > be
> > > > > > > > > > > > chosen until the broker is able to accept the next
> > ready
> > > > batch
> > > > > > > from
> > > > > > > > > the
> > > > > > > > > > > > partition". If we are keeping track of this, could we
> > > just
> > > > > > avoid
> > > > > > > > > > > assigning
> > > > > > > > > > > > records to partitions whose leader is not able to
> > accept
> > > > the
> > > > > > next
> > > > > > > > > > batch?
> > > > > > > > > > > If
> > > > > > > > > > > > we do that, perhaps we don't need
> > > > > > > > partition.availability.timeout.ms.
> > > > > > > > > > > >
> > > > > > > > > > > > 10. Currently, partitioner.class defaults to
> > > > > > DefaultPartitioner,
> > > > > > > > > which
> > > > > > > > > > > uses
> > > > > > > > > > > > StickyPartitioner when the key is specified. Since
> this
> > > KIP
> > > > > > > > improves
> > > > > > > > > > upon
> > > > > > > > > > > > StickyPartitioner, it would be useful to make the new
> > > > behavior
> > > > > > > the
> > > > > > > > > > > default
> > > > > > > > > > > > and document that in the KIP.
> > > > > > > > > > > >
> > > > > > > > > > > > Thanks,
> > > > > > > > > > > >
> > > > > > > > > > > > Jun
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > On Wed, Feb 16, 2022 at 7:30 PM Luke Chen <
> > > > showuon@gmail.com>
> > > > > > > > wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > > Hi Artem,
> > > > > > > > > > > > >
> > > > > > > > > > > > > Also, one more thing I think you need to know.
> > > > > > > > > > > > > As this bug KAFKA-7572 <
> > > > > > > > > > > https://issues.apache.org/jira/browse/KAFKA-7572
> > > > > > > > > > > > >
> > > > > > > > > > > > > mentioned, sometimes the custom partitioner would
> > > return
> > > > > > > negative
> > > > > > > > > > > > partition
> > > > > > > > > > > > > id accidentally.
> > > > > > > > > > > > > If it returned -1, how could you know if it is
> > expected
> > > > or
> > > > > > not
> > > > > > > > > > > expected?
> > > > > > > > > > > > >
> > > > > > > > > > > > > Thanks.
> > > > > > > > > > > > > Luke
> > > > > > > > > > > > >
> > > > > > > > > > > > > On Wed, Feb 16, 2022 at 3:28 PM Luke Chen <
> > > > showuon@gmail.com
> > > > > > >
> > > > > > > > > wrote:
> > > > > > > > > > > > >
> > > > > > > > > > > > > > Hi Artem,
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Thanks for the update. I have some questions
> about
> > > it:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > 1. Could you explain why you need the
> `partitioner`
> > > > return
> > > > > > > -1?
> > > > > > > > In
> > > > > > > > > > > which
> > > > > > > > > > > > > > case we need it? And how it is used in your KIP?
> > > > > > > > > > > > > > 2. What does the "partitioner.sticky.batch.size"
> > > mean?
> > > > In
> > > > > > the
> > > > > > > > > > > > > > "Configuration" part, you didn't explain it. And
> > > > default to
> > > > > > > 0,
> > > > > > > > I
> > > > > > > > > > > guess
> > > > > > > > > > > > > it's
> > > > > > > > > > > > > > the same as current behavior for backward
> > > > compatibility,
> > > > > > > right?
> > > > > > > > > You
> > > > > > > > > > > > > should
> > > > > > > > > > > > > > mention it.
> > > > > > > > > > > > > > 3. I'm thinking we can have a threshold to the
> > > > > > > > > > > > > > "partitioner.sticky.batch.size". Let's say, we
> > > already
> > > > > > > > accumulate
> > > > > > > > > > > > 15.5KB
> > > > > > > > > > > > > in
> > > > > > > > > > > > > > partition1, and sent. So when next batch created,
> > in
> > > > your
> > > > > > > > current
> > > > > > > > > > > > design,
> > > > > > > > > > > > > > we still stick to partition1, until 16KB reached,
> > and
> > > > then
> > > > > > we
> > > > > > > > > > create
> > > > > > > > > > > a
> > > > > > > > > > > > > new
> > > > > > > > > > > > > > batch to change to next partition, ex:
> partition2.
> > > But
> > > > I
> > > > > > > think
> > > > > > > > if
> > > > > > > > > > we
> > > > > > > > > > > > set
> > > > > > > > > > > > > a
> > > > > > > > > > > > > > threshold to 95% (for example), we can know
> > previous
> > > > 15.5KB
> > > > > > > > > already
> > > > > > > > > > > > > exceeds
> > > > > > > > > > > > > > the threshold so that we can directly create new
> > > batch
> > > > for
> > > > > > > next
> > > > > > > > > > > > records.
> > > > > > > > > > > > > > This way should be able to make it more
> efficient.
> > > > WDYT?
> > > > > > > > > > > > > > 4. I think the improved queuing logic should be
> > good
> > > > > > enough.
> > > > > > > I
> > > > > > > > > > can't
> > > > > > > > > > > > get
> > > > > > > > > > > > > > the benefit of having `
> > > > partition.availability.timeout.ms`
> > > > > > > > > config.
> > > > > > > > > > In
> > > > > > > > > > > > > > short, you want to make the partitioner take the
> > > broker
> > > > > > load
> > > > > > > > into
> > > > > > > > > > > > > > consideration. We can just improve that in the
> > > queuing
> > > > > > logic
> > > > > > > > (and
> > > > > > > > > > you
> > > > > > > > > > > > > > already did it). Why should we add the config?
> > Could
> > > > you
> > > > > > use
> > > > > > > > some
> > > > > > > > > > > > > examples
> > > > > > > > > > > > > > to explain why we need it.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Thank you.
> > > > > > > > > > > > > > Luke
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > On Wed, Feb 16, 2022 at 8:57 AM Artem Livshits
> > > > > > > > > > > > > > <al...@confluent.io.invalid> wrote:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >> Hello,
> > > > > > > > > > > > > >>
> > > > > > > > > > > > > >> Please add your comments about the KIP.  If
> there
> > > are
> > > > no
> > > > > > > > > > > > considerations,
> > > > > > > > > > > > > >> I'll put it up for vote in the next few days.
> > > > > > > > > > > > > >>
> > > > > > > > > > > > > >> -Artem
> > > > > > > > > > > > > >>
> > > > > > > > > > > > > >> On Mon, Feb 7, 2022 at 6:01 PM Artem Livshits <
> > > > > > > > > > > alivshits@confluent.io
> > > > > > > > > > > > >
> > > > > > > > > > > > > >> wrote:
> > > > > > > > > > > > > >>
> > > > > > > > > > > > > >> > Hello,
> > > > > > > > > > > > > >> >
> > > > > > > > > > > > > >> > After trying a few prototypes, I've made some
> > > > changes to
> > > > > > > the
> > > > > > > > > > > public
> > > > > > > > > > > > > >> > interface.  Please see the updated document
> > > > > > > > > > > > > >> >
> > > > > > > > > > > > > >>
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-794%3A+Strictly+Uniform+Sticky+Partitioner
> > > > > > > > > > > > > >> > .
> > > > > > > > > > > > > >> >
> > > > > > > > > > > > > >> > -Artem
> > > > > > > > > > > > > >> >
> > > > > > > > > > > > > >> > On Thu, Nov 4, 2021 at 10:37 AM Artem
> Livshits <
> > > > > > > > > > > > > alivshits@confluent.io>
> > > > > > > > > > > > > >> > wrote:
> > > > > > > > > > > > > >> >
> > > > > > > > > > > > > >> >> Hello,
> > > > > > > > > > > > > >> >>
> > > > > > > > > > > > > >> >> This is the discussion thread for
> > > > > > > > > > > > > >> >>
> > > > > > > > > > > > > >>
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-794%3A+Strictly+Uniform+Sticky+Partitioner
> > > > > > > > > > > > > >> >> .
> > > > > > > > > > > > > >> >>
> > > > > > > > > > > > > >> >> The proposal is a bug fix for
> > > > > > > > > > > > > >> >>
> > > https://issues.apache.org/jira/browse/KAFKA-10888,
> > > > but
> > > > > > > it
> > > > > > > > > does
> > > > > > > > > > > > > >> include a
> > > > > > > > > > > > > >> >> client config change, therefore we have a KIP
> > to
> > > > > > discuss.
> > > > > > > > > > > > > >> >>
> > > > > > > > > > > > > >> >> -Artem
> > > > > > > > > > > > > >> >>
> > > > > > > > > > > > > >> >
> > > > > > > > > > > > > >>
> > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] KIP-794: Strictly Uniform Sticky Partitioner

Posted by Artem Livshits <al...@confluent.io.INVALID>.
Thank you for feedback, I've discussed this offline with some of the folks
and updated the KIP.  The main change is that now instead of using
DefaultPartitioner and UniformStickyPartitioners as flags, in the new
proposal the default partitioner is null, so if no custom partitioner is
specified then the partitioning logic is implemented in KafkaProducer.
Compatibility section is updated as well.  Also the configuration options
are renamed to be more consistent.

-Artem

On Fri, Mar 4, 2022 at 10:38 PM Luke Chen <sh...@gmail.com> wrote:

> Hi Artem,
>
> Thanks for your explanation and update to the KIP.
> Some comments:
>
> 5. In the description for `enable.adaptive.partitioning`, the `false` case,
> you said:
> > the producer will try to distribute messages uniformly.
> I think we should describe the possible skewing distribution. Otherwise,
> user might be confused about why adaptive partitioning is important.
>
> 6. In the description for `partition.availability.timeout.ms`, I think we
> should mention in the last sentence about if `enable.adaptive.partitioning`
> is disabled this logic is also disabled.
>
> 7. Similar thoughts as Ismael, I think we should have a POC and test to
> prove that this adaptive partitioning algorithm can have better uniform
> partitioning, compared with original sticky one.
>
> Thank you.
> Luke
>
> On Fri, Mar 4, 2022 at 9:22 PM Ismael Juma <is...@juma.me.uk> wrote:
>
> > Regarding `3`, we should only deprecate it if we're sure the new approach
> > handles all cases better. Are we confident about that for both of the
> > previous partitioners?
> >
> > Ismael
> >
> > On Fri, Mar 4, 2022 at 1:37 AM David Jacot <dj...@confluent.io.invalid>
> > wrote:
> >
> > > Hi Artem,
> > >
> > > Thanks for the KIP! I have a few comments:
> > >
> > > 1. In the preamble of the proposed change section, there is still a
> > > mention of the
> > > -1 approach. My understanding is that we have moved away from it now.
> > >
> > > 2. I am a bit concerned by the trick suggested about the
> > > DefaultPartitioner and
> > > the UniformStickyPartitioner. I do agree that implementing the logic in
> > the
> > > producer itself is a good thing. However, it is weird from a user
> > > perspective
> > > that he can set a class as partitioner that is not used in the end. I
> > > think that
> > > this will be confusing for our users. Have we considered changing the
> > > default
> > > value of partitioner.class to null to indicate that the new built-in
> > > partitioner
> > > must be used? By default, the built-in partitioner would be used unless
> > the
> > > user explicitly specify one. The downside is that the new default
> > behavior
> > > would not work if the user explicitly specify the partitioner but we
> > could
> > > mitigate this with my next point.
> > >
> > > 3. Related to the previous point, I think that we could deprecate both
> > the
> > > DefaultPartitioner and the UniformStickyPartitioner. I would also add a
> > > warning if one of them is explicitly provided by the user to inform
> them
> > > that they should switch to the new built-in one. I am pretty sure that
> > most
> > > of the folks use the default configuration anyway.
> > >
> > > 4. It would be great if we could explain why the -1 way was rejected.
> At
> > > the moment, the rejected alternative only explain the idea but does not
> > > say why we rejected it.
> > >
> > > Best,
> > > David
> > >
> > > On Fri, Mar 4, 2022 at 6:03 AM Artem Livshits
> > > <al...@confluent.io.invalid> wrote:
> > > >
> > > > Hi Jun,
> > > >
> > > > 2. Removed the option from the KIP.  Now the sticky partitioning
> > > threshold
> > > > is hardcoded to batch.size.
> > > >
> > > > 20. Added the corresponding wording to the KIP.
> > > >
> > > > -Artem
> > > >
> > > > On Thu, Mar 3, 2022 at 10:52 AM Jun Rao <ju...@confluent.io.invalid>
> > > wrote:
> > > >
> > > > > Hi, Artem,
> > > > >
> > > > > Thanks for the reply.
> > > > >
> > > > > 1. Sounds good.
> > > > >
> > > > > 2. If we don't expect users to change it, we probably could just
> > leave
> > > out
> > > > > the new config. In general, it's easy to add a new config, but hard
> > to
> > > > > remove an existing config.
> > > > >
> > > > > 20. The two new configs enable.adaptive.partitioning and
> > > > > partition.availability.timeout.ms only apply to the two built-in
> > > > > partitioners DefaultPartitioner and UniformStickyPartitioner,
> right?
> > It
> > > > > would be useful to document that in the KIP.
> > > > >
> > > > > Thanks,
> > > > >
> > > > > Jun
> > > > >
> > > > > On Thu, Mar 3, 2022 at 9:47 AM Artem Livshits
> > > > > <al...@confluent.io.invalid> wrote:
> > > > >
> > > > > > Hi Jun,
> > > > > >
> > > > > > Thank you for the suggestions.
> > > > > >
> > > > > > 1. As we discussed offline, we can hardcode the logic for
> > > > > > DefaultPartitioner and UniformStickyPartitioner in the
> > KafkaProducer
> > > > > (i.e.
> > > > > > the DefaultPartitioner.partition won't get called, instead
> > > KafkaProducer
> > > > > > would check if the partitioner is an instance of
> DefaultPartitioner
> > > and
> > > > > > then run the actual partitioning logic itself).  Then the change
> to
> > > the
> > > > > > Partitioner wouldn't be required.  I'll update the KIP to reflect
> > > that.
> > > > > >
> > > > > > 2. I don't expect users to change this too often, as changing it
> > > would
> > > > > > require a bit of studying of the production patterns.  As a
> general
> > > > > > principle, if I can think of a model that requires a deviation
> from
> > > > > > default, I tend to add a configuration option.  It could be that
> > > it'll
> > > > > > never get used in practice, but I cannot prove that.  I'm ok with
> > > > > removing
> > > > > > the option, let me know what you think.
> > > > > >
> > > > > > -Artem
> > > > > >
> > > > > > On Mon, Feb 28, 2022 at 2:06 PM Jun Rao <jun@confluent.io.invalid
> >
> > > > > wrote:
> > > > > >
> > > > > > > Hi, Artem,
> > > > > > >
> > > > > > > Thanks for the reply. A few more comments.
> > > > > > >
> > > > > > > 1. Since we control the implementation and the usage of
> > > > > > DefaultPartitioner,
> > > > > > > another way is to instantiate the DefaultPartitioner with a
> > special
> > > > > > > constructor, which allows it to have more access to internal
> > > > > information.
> > > > > > > Then we could just change the behavior of  DefaultPartitioner
> > such
> > > that
> > > > > > it
> > > > > > > can use the internal infoamtion when choosing the partition.
> This
> > > seems
> > > > > > > more intuitive than having DefaultPartitioner return -1
> > partition.
> > > > > > >
> > > > > > > 2. I guess partitioner.sticky.batch.size is introduced because
> > the
> > > > > > > effective batch size could be less than batch.size and we want
> to
> > > align
> > > > > > > partition switching with the effective batch size. How would a
> > user
> > > > > know
> > > > > > > the effective batch size to set partitioner.sticky.batch.size
> > > properly?
> > > > > > If
> > > > > > > the user somehow knows the effective batch size, does setting
> > > > > batch.size
> > > > > > to
> > > > > > > the effective batch size achieve the same result?
> > > > > > >
> > > > > > > 4. Thanks for the explanation. Makes sense to me.
> > > > > > >
> > > > > > > Thanks,
> > > > > > >
> > > > > > > Jun
> > > > > > >
> > > > > > > Thanks,
> > > > > > >
> > > > > > > Jun
> > > > > > >
> > > > > > > On Fri, Feb 25, 2022 at 8:26 PM Artem Livshits
> > > > > > > <al...@confluent.io.invalid> wrote:
> > > > > > >
> > > > > > > > Hi Jun,
> > > > > > > >
> > > > > > > > 1. Updated the KIP to add a couple paragraphs about
> > > implementation
> > > > > > > > necessities in the Proposed Changes section.
> > > > > > > >
> > > > > > > > 2. Sorry if my reply was confusing, what I meant to say (and
> I
> > > > > > elaborated
> > > > > > > > on that in point #3) is that there could be patterns for
> which
> > > 16KB
> > > > > > > > wouldn't be the most effective setting, thus it would be good
> > to
> > > make
> > > > > > it
> > > > > > > > configurable.
> > > > > > > >
> > > > > > > > 4. We could use broker readiness timeout.  But I'm not sure
> it
> > > would
> > > > > > > > correctly model the broker load.  The problem is that latency
> > is
> > > not
> > > > > an
> > > > > > > > accurate measure of throughput, we could have 2 brokers that
> > have
> > > > > equal
> > > > > > > > throughput but one has higher latency (so it takes larger
> > batches
> > > > > less
> > > > > > > > frequently, but still takes the same load).  Latency-based
> > logic
> > > is
> > > > > > > likely
> > > > > > > > to send less data to the broker with higher latency.  Using
> the
> > > queue
> > > > > > > size
> > > > > > > > would adapt to throughput, regardless of latency (which could
> > be
> > > > > just a
> > > > > > > > result of deployment topology), so that's the model chosen in
> > the
> > > > > > > > proposal.  The partition.availability.timeout.ms logic
> > > approaches
> > > > > the
> > > > > > > > model
> > > > > > > > from a slightly different angle, say we have a requirement to
> > > deliver
> > > > > > > > messages via brokers that have a certain latency, then
> > > > > > > > partition.availability.timeout.ms could be used to tune
> that.
> > > > > Latency
> > > > > > > is
> > > > > > > > a
> > > > > > > > much more volatile metric than throughput (latency depends on
> > > > > external
> > > > > > > > load, on capacity, on deployment topology, on jitter in
> > network,
> > > on
> > > > > > > jitter
> > > > > > > > in disk, etc.) and I think it would be best to leave
> > > latency-based
> > > > > > > > thresholds configurable to tune for the environment.
> > > > > > > >
> > > > > > > > -Artem
> > > > > > > >
> > > > > > > > On Wed, Feb 23, 2022 at 11:14 AM Jun Rao
> > > <ju...@confluent.io.invalid>
> > > > > > > wrote:
> > > > > > > >
> > > > > > > > > Hi, Artem,
> > > > > > > > >
> > > > > > > > > Thanks for the reply. A few more comments.
> > > > > > > > >
> > > > > > > > > 1. Perhaps you could elaborate a bit more on how the
> producer
> > > > > > > determines
> > > > > > > > > the partition if the partitioner returns -1. This will help
> > > > > > understand
> > > > > > > > why
> > > > > > > > > encapsulating that logic as a partitioner is not clean.
> > > > > > > > >
> > > > > > > > > 2. I am not sure that I understand this part. If 15.5KB is
> > more
> > > > > > > > efficient,
> > > > > > > > > could we just set batch.size to 15.5KB?
> > > > > > > > >
> > > > > > > > > 4. Yes, we could add a switch (or a variant of the
> > > partitioner) for
> > > > > > > > > enabling this behavior. Also, choosing partitions based on
> > > broker
> > > > > > > > readiness
> > > > > > > > > can be made in a smoother way. For example, we could track
> > the
> > > last
> > > > > > > time
> > > > > > > > a
> > > > > > > > > broker has drained any batches from the accumulator. We can
> > > then
> > > > > > select
> > > > > > > > > partitions from brokers proportionally to the inverse of
> that
> > > time.
> > > > > > > This
> > > > > > > > > seems smoother than a cutoff based on a
> > > > > > > > partition.availability.timeout.ms
> > > > > > > > >  threshold.
> > > > > > > > >
> > > > > > > > > Thanks,
> > > > > > > > >
> > > > > > > > > Jun
> > > > > > > > >
> > > > > > > > > On Fri, Feb 18, 2022 at 5:14 PM Artem Livshits
> > > > > > > > > <al...@confluent.io.invalid> wrote:
> > > > > > > > >
> > > > > > > > > > Hello Luke, Jun,
> > > > > > > > > >
> > > > > > > > > > Thank you for your feedback.  I've added the Rejected
> > > Alternative
> > > > > > > > section
> > > > > > > > > > that may clarify some of the questions w.r.t. returning
> -1.
> > > > > > > > > >
> > > > > > > > > > 1. I've elaborated on the -1 in the KIP.  The problem is
> > > that a
> > > > > > > > > significant
> > > > > > > > > > part of the logic needs to be in the producer (because it
> > now
> > > > > uses
> > > > > > > > > > information about brokers that only the producer knows),
> so
> > > > > > > > encapsulation
> > > > > > > > > > of the logic within the default partitioner isn't as
> clean.
> > > > >  I've
> > > > > > > > added
> > > > > > > > > > the Rejected Alternative section that documents an
> attempt
> > to
> > > > > keep
> > > > > > > the
> > > > > > > > > > encapsulation by providing new callbacks to the
> > partitioner.
> > > > > > > > > >
> > > > > > > > > > 2. The meaning of the partitioner.sticky.batch.size is
> > > explained
> > > > > in
> > > > > > > the
> > > > > > > > > > Uniform Sticky Batch Size section.  Basically, we track
> the
> > > > > amount
> > > > > > of
> > > > > > > > > bytes
> > > > > > > > > > produced to the partition and if it exceeds
> > > > > > > > partitioner.sticky.batch.size
> > > > > > > > > > then we switch to the next partition.  As far as the
> reason
> > > to
> > > > > make
> > > > > > > it
> > > > > > > > > > different from batch.size, I think Luke answered this
> with
> > > the
> > > > > > > question
> > > > > > > > > #3
> > > > > > > > > > -- what if the load pattern is such that 15.5KB would be
> > more
> > > > > > > efficient
> > > > > > > > > > than 16KB?
> > > > > > > > > >
> > > > > > > > > > 3. I think it's hard to have one size that would fit all
> > > > > patterns.
> > > > > > > > E.g.
> > > > > > > > > if
> > > > > > > > > > the load pattern is such that there is linger and the app
> > > fills
> > > > > the
> > > > > > > > batch
> > > > > > > > > > before linger expires, then having 16KB would most likely
> > > > > > synchronize
> > > > > > > > > > batching and partition switching, so each partition would
> > > get a
> > > > > > full
> > > > > > > > > > batch.  If load pattern is such that there are a few
> > > non-complete
> > > > > > > > batches
> > > > > > > > > > go out before a larger batch starts to fill, then it may
> > > actually
> > > > > > be
> > > > > > > > > > beneficial to make slightly larger (e.g. linger=0, first
> > few
> > > > > > records
> > > > > > > go
> > > > > > > > > in
> > > > > > > > > > the first batch, then next few records go to second
> batch,
> > > and so
> > > > > > on,
> > > > > > > > > until
> > > > > > > > > > 5 in-flight, then larger batch would form while waiting
> for
> > > > > broker
> > > > > > to
> > > > > > > > > > respond, but the partition switch would happen before the
> > > larger
> > > > > > > batch
> > > > > > > > is
> > > > > > > > > > full).
> > > > > > > > > >
> > > > > > > > > > 4. There are a couple of reasons for introducing
> > > > > > > > > > partition.availability.timeout.ms.  Luke's an Jun's
> > > questions
> > > > > are
> > > > > > > > > slightly
> > > > > > > > > > different, so I'm going to separate replies.
> > > > > > > > > > (Luke) Is the queue size a good enough signal?  I think
> > it's
> > > a
> > > > > good
> > > > > > > > > default
> > > > > > > > > > signal as it tries to preserve general fairness and not
> > > overreact
> > > > > > on
> > > > > > > > the
> > > > > > > > > > broker's state at each moment in time.  But because it's
> > > smooth,
> > > > > it
> > > > > > > may
> > > > > > > > > not
> > > > > > > > > > be reactive enough to instantaneous latency jumps.  For
> > > > > > > > latency-sensitive
> > > > > > > > > > workloads, it may be desirable to react faster when a
> > broker
> > > > > > becomes
> > > > > > > > > > unresponsive (but that may make the distribution really
> > > choppy),
> > > > > so
> > > > > > > > > > partition.availability.timeout.ms provides an
> opportunity
> > to
> > > > > tune
> > > > > > > > > > adaptiveness.
> > > > > > > > > >
> > > > > > > > > > (Jun) Can we just not assign partitions to brokers that
> are
> > > not
> > > > > > > ready?
> > > > > > > > > > Switching partitions purely based on current broker
> > readiness
> > > > > > > > information
> > > > > > > > > > can really skew workload I think (or at least I couldn't
> > > build a
> > > > > > > model
> > > > > > > > > that
> > > > > > > > > > proves that over time it's going to be generally fair), I
> > > feel
> > > > > that
> > > > > > > the
> > > > > > > > > > algorithm should try to be fair in general and use
> smoother
> > > > > signals
> > > > > > > by
> > > > > > > > > > default (e.g. a broker with choppier latency may get much
> > > less
> > > > > load
> > > > > > > > even
> > > > > > > > > > though it can handle throughput, it then may potentially
> > skew
> > > > > > > > > consumption),
> > > > > > > > > > note that the queue-size-based logic uses probabilities
> (so
> > > we
> > > > > > don't
> > > > > > > > > fully
> > > > > > > > > > remove brokers, just make it less likely) and relative
> info
> > > > > rather
> > > > > > > > than a
> > > > > > > > > > threshold (so if all brokers are heavily, but equally
> > loaded,
> > > > > they
> > > > > > > will
> > > > > > > > > get
> > > > > > > > > > equal distribution, rather than get removed because they
> > > exceed
> > > > > > some
> > > > > > > > > > threshold).  So at the very least, I would like this
> logic
> > > to be
> > > > > > > turned
> > > > > > > > > off
> > > > > > > > > > by default as it's hard to predict what it could do with
> > > > > different
> > > > > > > > > patterns
> > > > > > > > > > (which means that there would need to be some
> > > configuration).  We
> > > > > > > could
> > > > > > > > > > just not use brokers that are not ready, but again, I
> think
> > > that
> > > > > > it's
> > > > > > > > > good
> > > > > > > > > > to try to be fair under normal circumstances, so if
> > normally
> > > > > > brokers
> > > > > > > > can
> > > > > > > > > > respond under some partition.availability.timeout.ms
> > > threshold
> > > > > and
> > > > > > > the
> > > > > > > > > > application works well with those latencies, then we
> could
> > > > > > distribute
> > > > > > > > > data
> > > > > > > > > > equally between brokers that don't exceed the latencies.
> > The
> > > > > > value,
> > > > > > > of
> > > > > > > > > > course, would depend on the environment and app
> > requirements,
> > > > > hence
> > > > > > > > it's
> > > > > > > > > > configurable.
> > > > > > > > > >
> > > > > > > > > > 10. Added a statement at the beginning of the proposed
> > > changes.
> > > > > > > > > >
> > > > > > > > > > -Artem
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > On Thu, Feb 17, 2022 at 3:46 PM Jun Rao
> > > <jun@confluent.io.invalid
> > > > > >
> > > > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > > Hi, Artem,
> > > > > > > > > > >
> > > > > > > > > > > Thanks for the KIP. A few comments below.
> > > > > > > > > > >
> > > > > > > > > > > 1. I agree with Luke that having the partitioner
> > returning
> > > -1
> > > > > is
> > > > > > > kind
> > > > > > > > > of
> > > > > > > > > > > weird. Could we just change the implementation of
> > > > > > > DefaultPartitioner
> > > > > > > > to
> > > > > > > > > > the
> > > > > > > > > > > new behavior?
> > > > > > > > > > >
> > > > > > > > > > > 2. partitioner.sticky.batch.size: Similar question to
> > > Luke. I
> > > > > am
> > > > > > > not
> > > > > > > > > sure
> > > > > > > > > > > why we want to introduce this new configuration. Could
> we
> > > just
> > > > > > use
> > > > > > > > the
> > > > > > > > > > > existing batch.size?
> > > > > > > > > > >
> > > > > > > > > > > 4. I also agree with Luke that it's not clear why we
> need
> > > > > > > > > > > partition.availability.timeout.ms. The KIP says the
> > broker
> > > > > > "would
> > > > > > > > not
> > > > > > > > > be
> > > > > > > > > > > chosen until the broker is able to accept the next
> ready
> > > batch
> > > > > > from
> > > > > > > > the
> > > > > > > > > > > partition". If we are keeping track of this, could we
> > just
> > > > > avoid
> > > > > > > > > > assigning
> > > > > > > > > > > records to partitions whose leader is not able to
> accept
> > > the
> > > > > next
> > > > > > > > > batch?
> > > > > > > > > > If
> > > > > > > > > > > we do that, perhaps we don't need
> > > > > > > partition.availability.timeout.ms.
> > > > > > > > > > >
> > > > > > > > > > > 10. Currently, partitioner.class defaults to
> > > > > DefaultPartitioner,
> > > > > > > > which
> > > > > > > > > > uses
> > > > > > > > > > > StickyPartitioner when the key is specified. Since this
> > KIP
> > > > > > > improves
> > > > > > > > > upon
> > > > > > > > > > > StickyPartitioner, it would be useful to make the new
> > > behavior
> > > > > > the
> > > > > > > > > > default
> > > > > > > > > > > and document that in the KIP.
> > > > > > > > > > >
> > > > > > > > > > > Thanks,
> > > > > > > > > > >
> > > > > > > > > > > Jun
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > On Wed, Feb 16, 2022 at 7:30 PM Luke Chen <
> > > showuon@gmail.com>
> > > > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > > > Hi Artem,
> > > > > > > > > > > >
> > > > > > > > > > > > Also, one more thing I think you need to know.
> > > > > > > > > > > > As this bug KAFKA-7572 <
> > > > > > > > > > https://issues.apache.org/jira/browse/KAFKA-7572
> > > > > > > > > > > >
> > > > > > > > > > > > mentioned, sometimes the custom partitioner would
> > return
> > > > > > negative
> > > > > > > > > > > partition
> > > > > > > > > > > > id accidentally.
> > > > > > > > > > > > If it returned -1, how could you know if it is
> expected
> > > or
> > > > > not
> > > > > > > > > > expected?
> > > > > > > > > > > >
> > > > > > > > > > > > Thanks.
> > > > > > > > > > > > Luke
> > > > > > > > > > > >
> > > > > > > > > > > > On Wed, Feb 16, 2022 at 3:28 PM Luke Chen <
> > > showuon@gmail.com
> > > > > >
> > > > > > > > wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > > Hi Artem,
> > > > > > > > > > > > >
> > > > > > > > > > > > > Thanks for the update. I have some questions about
> > it:
> > > > > > > > > > > > >
> > > > > > > > > > > > > 1. Could you explain why you need the `partitioner`
> > > return
> > > > > > -1?
> > > > > > > In
> > > > > > > > > > which
> > > > > > > > > > > > > case we need it? And how it is used in your KIP?
> > > > > > > > > > > > > 2. What does the "partitioner.sticky.batch.size"
> > mean?
> > > In
> > > > > the
> > > > > > > > > > > > > "Configuration" part, you didn't explain it. And
> > > default to
> > > > > > 0,
> > > > > > > I
> > > > > > > > > > guess
> > > > > > > > > > > > it's
> > > > > > > > > > > > > the same as current behavior for backward
> > > compatibility,
> > > > > > right?
> > > > > > > > You
> > > > > > > > > > > > should
> > > > > > > > > > > > > mention it.
> > > > > > > > > > > > > 3. I'm thinking we can have a threshold to the
> > > > > > > > > > > > > "partitioner.sticky.batch.size". Let's say, we
> > already
> > > > > > > accumulate
> > > > > > > > > > > 15.5KB
> > > > > > > > > > > > in
> > > > > > > > > > > > > partition1, and sent. So when next batch created,
> in
> > > your
> > > > > > > current
> > > > > > > > > > > design,
> > > > > > > > > > > > > we still stick to partition1, until 16KB reached,
> and
> > > then
> > > > > we
> > > > > > > > > create
> > > > > > > > > > a
> > > > > > > > > > > > new
> > > > > > > > > > > > > batch to change to next partition, ex: partition2.
> > But
> > > I
> > > > > > think
> > > > > > > if
> > > > > > > > > we
> > > > > > > > > > > set
> > > > > > > > > > > > a
> > > > > > > > > > > > > threshold to 95% (for example), we can know
> previous
> > > 15.5KB
> > > > > > > > already
> > > > > > > > > > > > exceeds
> > > > > > > > > > > > > the threshold so that we can directly create new
> > batch
> > > for
> > > > > > next
> > > > > > > > > > > records.
> > > > > > > > > > > > > This way should be able to make it more efficient.
> > > WDYT?
> > > > > > > > > > > > > 4. I think the improved queuing logic should be
> good
> > > > > enough.
> > > > > > I
> > > > > > > > > can't
> > > > > > > > > > > get
> > > > > > > > > > > > > the benefit of having `
> > > partition.availability.timeout.ms`
> > > > > > > > config.
> > > > > > > > > In
> > > > > > > > > > > > > short, you want to make the partitioner take the
> > broker
> > > > > load
> > > > > > > into
> > > > > > > > > > > > > consideration. We can just improve that in the
> > queuing
> > > > > logic
> > > > > > > (and
> > > > > > > > > you
> > > > > > > > > > > > > already did it). Why should we add the config?
> Could
> > > you
> > > > > use
> > > > > > > some
> > > > > > > > > > > > examples
> > > > > > > > > > > > > to explain why we need it.
> > > > > > > > > > > > >
> > > > > > > > > > > > > Thank you.
> > > > > > > > > > > > > Luke
> > > > > > > > > > > > >
> > > > > > > > > > > > > On Wed, Feb 16, 2022 at 8:57 AM Artem Livshits
> > > > > > > > > > > > > <al...@confluent.io.invalid> wrote:
> > > > > > > > > > > > >
> > > > > > > > > > > > >> Hello,
> > > > > > > > > > > > >>
> > > > > > > > > > > > >> Please add your comments about the KIP.  If there
> > are
> > > no
> > > > > > > > > > > considerations,
> > > > > > > > > > > > >> I'll put it up for vote in the next few days.
> > > > > > > > > > > > >>
> > > > > > > > > > > > >> -Artem
> > > > > > > > > > > > >>
> > > > > > > > > > > > >> On Mon, Feb 7, 2022 at 6:01 PM Artem Livshits <
> > > > > > > > > > alivshits@confluent.io
> > > > > > > > > > > >
> > > > > > > > > > > > >> wrote:
> > > > > > > > > > > > >>
> > > > > > > > > > > > >> > Hello,
> > > > > > > > > > > > >> >
> > > > > > > > > > > > >> > After trying a few prototypes, I've made some
> > > changes to
> > > > > > the
> > > > > > > > > > public
> > > > > > > > > > > > >> > interface.  Please see the updated document
> > > > > > > > > > > > >> >
> > > > > > > > > > > > >>
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-794%3A+Strictly+Uniform+Sticky+Partitioner
> > > > > > > > > > > > >> > .
> > > > > > > > > > > > >> >
> > > > > > > > > > > > >> > -Artem
> > > > > > > > > > > > >> >
> > > > > > > > > > > > >> > On Thu, Nov 4, 2021 at 10:37 AM Artem Livshits <
> > > > > > > > > > > > alivshits@confluent.io>
> > > > > > > > > > > > >> > wrote:
> > > > > > > > > > > > >> >
> > > > > > > > > > > > >> >> Hello,
> > > > > > > > > > > > >> >>
> > > > > > > > > > > > >> >> This is the discussion thread for
> > > > > > > > > > > > >> >>
> > > > > > > > > > > > >>
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-794%3A+Strictly+Uniform+Sticky+Partitioner
> > > > > > > > > > > > >> >> .
> > > > > > > > > > > > >> >>
> > > > > > > > > > > > >> >> The proposal is a bug fix for
> > > > > > > > > > > > >> >>
> > https://issues.apache.org/jira/browse/KAFKA-10888,
> > > but
> > > > > > it
> > > > > > > > does
> > > > > > > > > > > > >> include a
> > > > > > > > > > > > >> >> client config change, therefore we have a KIP
> to
> > > > > discuss.
> > > > > > > > > > > > >> >>
> > > > > > > > > > > > >> >> -Artem
> > > > > > > > > > > > >> >>
> > > > > > > > > > > > >> >
> > > > > > > > > > > > >>
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > >
> >
>

Re: [DISCUSS] KIP-794: Strictly Uniform Sticky Partitioner

Posted by Luke Chen <sh...@gmail.com>.
Hi Artem,

Thanks for your explanation and update to the KIP.
Some comments:

5. In the description for `enable.adaptive.partitioning`, the `false` case,
you said:
> the producer will try to distribute messages uniformly.
I think we should describe the possible skewing distribution. Otherwise,
user might be confused about why adaptive partitioning is important.

6. In the description for `partition.availability.timeout.ms`, I think we
should mention in the last sentence about if `enable.adaptive.partitioning`
is disabled this logic is also disabled.

7. Similar thoughts as Ismael, I think we should have a POC and test to
prove that this adaptive partitioning algorithm can have better uniform
partitioning, compared with original sticky one.

Thank you.
Luke

On Fri, Mar 4, 2022 at 9:22 PM Ismael Juma <is...@juma.me.uk> wrote:

> Regarding `3`, we should only deprecate it if we're sure the new approach
> handles all cases better. Are we confident about that for both of the
> previous partitioners?
>
> Ismael
>
> On Fri, Mar 4, 2022 at 1:37 AM David Jacot <dj...@confluent.io.invalid>
> wrote:
>
> > Hi Artem,
> >
> > Thanks for the KIP! I have a few comments:
> >
> > 1. In the preamble of the proposed change section, there is still a
> > mention of the
> > -1 approach. My understanding is that we have moved away from it now.
> >
> > 2. I am a bit concerned by the trick suggested about the
> > DefaultPartitioner and
> > the UniformStickyPartitioner. I do agree that implementing the logic in
> the
> > producer itself is a good thing. However, it is weird from a user
> > perspective
> > that he can set a class as partitioner that is not used in the end. I
> > think that
> > this will be confusing for our users. Have we considered changing the
> > default
> > value of partitioner.class to null to indicate that the new built-in
> > partitioner
> > must be used? By default, the built-in partitioner would be used unless
> the
> > user explicitly specify one. The downside is that the new default
> behavior
> > would not work if the user explicitly specify the partitioner but we
> could
> > mitigate this with my next point.
> >
> > 3. Related to the previous point, I think that we could deprecate both
> the
> > DefaultPartitioner and the UniformStickyPartitioner. I would also add a
> > warning if one of them is explicitly provided by the user to inform them
> > that they should switch to the new built-in one. I am pretty sure that
> most
> > of the folks use the default configuration anyway.
> >
> > 4. It would be great if we could explain why the -1 way was rejected. At
> > the moment, the rejected alternative only explain the idea but does not
> > say why we rejected it.
> >
> > Best,
> > David
> >
> > On Fri, Mar 4, 2022 at 6:03 AM Artem Livshits
> > <al...@confluent.io.invalid> wrote:
> > >
> > > Hi Jun,
> > >
> > > 2. Removed the option from the KIP.  Now the sticky partitioning
> > threshold
> > > is hardcoded to batch.size.
> > >
> > > 20. Added the corresponding wording to the KIP.
> > >
> > > -Artem
> > >
> > > On Thu, Mar 3, 2022 at 10:52 AM Jun Rao <ju...@confluent.io.invalid>
> > wrote:
> > >
> > > > Hi, Artem,
> > > >
> > > > Thanks for the reply.
> > > >
> > > > 1. Sounds good.
> > > >
> > > > 2. If we don't expect users to change it, we probably could just
> leave
> > out
> > > > the new config. In general, it's easy to add a new config, but hard
> to
> > > > remove an existing config.
> > > >
> > > > 20. The two new configs enable.adaptive.partitioning and
> > > > partition.availability.timeout.ms only apply to the two built-in
> > > > partitioners DefaultPartitioner and UniformStickyPartitioner, right?
> It
> > > > would be useful to document that in the KIP.
> > > >
> > > > Thanks,
> > > >
> > > > Jun
> > > >
> > > > On Thu, Mar 3, 2022 at 9:47 AM Artem Livshits
> > > > <al...@confluent.io.invalid> wrote:
> > > >
> > > > > Hi Jun,
> > > > >
> > > > > Thank you for the suggestions.
> > > > >
> > > > > 1. As we discussed offline, we can hardcode the logic for
> > > > > DefaultPartitioner and UniformStickyPartitioner in the
> KafkaProducer
> > > > (i.e.
> > > > > the DefaultPartitioner.partition won't get called, instead
> > KafkaProducer
> > > > > would check if the partitioner is an instance of DefaultPartitioner
> > and
> > > > > then run the actual partitioning logic itself).  Then the change to
> > the
> > > > > Partitioner wouldn't be required.  I'll update the KIP to reflect
> > that.
> > > > >
> > > > > 2. I don't expect users to change this too often, as changing it
> > would
> > > > > require a bit of studying of the production patterns.  As a general
> > > > > principle, if I can think of a model that requires a deviation from
> > > > > default, I tend to add a configuration option.  It could be that
> > it'll
> > > > > never get used in practice, but I cannot prove that.  I'm ok with
> > > > removing
> > > > > the option, let me know what you think.
> > > > >
> > > > > -Artem
> > > > >
> > > > > On Mon, Feb 28, 2022 at 2:06 PM Jun Rao <ju...@confluent.io.invalid>
> > > > wrote:
> > > > >
> > > > > > Hi, Artem,
> > > > > >
> > > > > > Thanks for the reply. A few more comments.
> > > > > >
> > > > > > 1. Since we control the implementation and the usage of
> > > > > DefaultPartitioner,
> > > > > > another way is to instantiate the DefaultPartitioner with a
> special
> > > > > > constructor, which allows it to have more access to internal
> > > > information.
> > > > > > Then we could just change the behavior of  DefaultPartitioner
> such
> > that
> > > > > it
> > > > > > can use the internal infoamtion when choosing the partition. This
> > seems
> > > > > > more intuitive than having DefaultPartitioner return -1
> partition.
> > > > > >
> > > > > > 2. I guess partitioner.sticky.batch.size is introduced because
> the
> > > > > > effective batch size could be less than batch.size and we want to
> > align
> > > > > > partition switching with the effective batch size. How would a
> user
> > > > know
> > > > > > the effective batch size to set partitioner.sticky.batch.size
> > properly?
> > > > > If
> > > > > > the user somehow knows the effective batch size, does setting
> > > > batch.size
> > > > > to
> > > > > > the effective batch size achieve the same result?
> > > > > >
> > > > > > 4. Thanks for the explanation. Makes sense to me.
> > > > > >
> > > > > > Thanks,
> > > > > >
> > > > > > Jun
> > > > > >
> > > > > > Thanks,
> > > > > >
> > > > > > Jun
> > > > > >
> > > > > > On Fri, Feb 25, 2022 at 8:26 PM Artem Livshits
> > > > > > <al...@confluent.io.invalid> wrote:
> > > > > >
> > > > > > > Hi Jun,
> > > > > > >
> > > > > > > 1. Updated the KIP to add a couple paragraphs about
> > implementation
> > > > > > > necessities in the Proposed Changes section.
> > > > > > >
> > > > > > > 2. Sorry if my reply was confusing, what I meant to say (and I
> > > > > elaborated
> > > > > > > on that in point #3) is that there could be patterns for which
> > 16KB
> > > > > > > wouldn't be the most effective setting, thus it would be good
> to
> > make
> > > > > it
> > > > > > > configurable.
> > > > > > >
> > > > > > > 4. We could use broker readiness timeout.  But I'm not sure it
> > would
> > > > > > > correctly model the broker load.  The problem is that latency
> is
> > not
> > > > an
> > > > > > > accurate measure of throughput, we could have 2 brokers that
> have
> > > > equal
> > > > > > > throughput but one has higher latency (so it takes larger
> batches
> > > > less
> > > > > > > frequently, but still takes the same load).  Latency-based
> logic
> > is
> > > > > > likely
> > > > > > > to send less data to the broker with higher latency.  Using the
> > queue
> > > > > > size
> > > > > > > would adapt to throughput, regardless of latency (which could
> be
> > > > just a
> > > > > > > result of deployment topology), so that's the model chosen in
> the
> > > > > > > proposal.  The partition.availability.timeout.ms logic
> > approaches
> > > > the
> > > > > > > model
> > > > > > > from a slightly different angle, say we have a requirement to
> > deliver
> > > > > > > messages via brokers that have a certain latency, then
> > > > > > > partition.availability.timeout.ms could be used to tune that.
> > > > Latency
> > > > > > is
> > > > > > > a
> > > > > > > much more volatile metric than throughput (latency depends on
> > > > external
> > > > > > > load, on capacity, on deployment topology, on jitter in
> network,
> > on
> > > > > > jitter
> > > > > > > in disk, etc.) and I think it would be best to leave
> > latency-based
> > > > > > > thresholds configurable to tune for the environment.
> > > > > > >
> > > > > > > -Artem
> > > > > > >
> > > > > > > On Wed, Feb 23, 2022 at 11:14 AM Jun Rao
> > <ju...@confluent.io.invalid>
> > > > > > wrote:
> > > > > > >
> > > > > > > > Hi, Artem,
> > > > > > > >
> > > > > > > > Thanks for the reply. A few more comments.
> > > > > > > >
> > > > > > > > 1. Perhaps you could elaborate a bit more on how the producer
> > > > > > determines
> > > > > > > > the partition if the partitioner returns -1. This will help
> > > > > understand
> > > > > > > why
> > > > > > > > encapsulating that logic as a partitioner is not clean.
> > > > > > > >
> > > > > > > > 2. I am not sure that I understand this part. If 15.5KB is
> more
> > > > > > > efficient,
> > > > > > > > could we just set batch.size to 15.5KB?
> > > > > > > >
> > > > > > > > 4. Yes, we could add a switch (or a variant of the
> > partitioner) for
> > > > > > > > enabling this behavior. Also, choosing partitions based on
> > broker
> > > > > > > readiness
> > > > > > > > can be made in a smoother way. For example, we could track
> the
> > last
> > > > > > time
> > > > > > > a
> > > > > > > > broker has drained any batches from the accumulator. We can
> > then
> > > > > select
> > > > > > > > partitions from brokers proportionally to the inverse of that
> > time.
> > > > > > This
> > > > > > > > seems smoother than a cutoff based on a
> > > > > > > partition.availability.timeout.ms
> > > > > > > >  threshold.
> > > > > > > >
> > > > > > > > Thanks,
> > > > > > > >
> > > > > > > > Jun
> > > > > > > >
> > > > > > > > On Fri, Feb 18, 2022 at 5:14 PM Artem Livshits
> > > > > > > > <al...@confluent.io.invalid> wrote:
> > > > > > > >
> > > > > > > > > Hello Luke, Jun,
> > > > > > > > >
> > > > > > > > > Thank you for your feedback.  I've added the Rejected
> > Alternative
> > > > > > > section
> > > > > > > > > that may clarify some of the questions w.r.t. returning -1.
> > > > > > > > >
> > > > > > > > > 1. I've elaborated on the -1 in the KIP.  The problem is
> > that a
> > > > > > > > significant
> > > > > > > > > part of the logic needs to be in the producer (because it
> now
> > > > uses
> > > > > > > > > information about brokers that only the producer knows), so
> > > > > > > encapsulation
> > > > > > > > > of the logic within the default partitioner isn't as clean.
> > > >  I've
> > > > > > > added
> > > > > > > > > the Rejected Alternative section that documents an attempt
> to
> > > > keep
> > > > > > the
> > > > > > > > > encapsulation by providing new callbacks to the
> partitioner.
> > > > > > > > >
> > > > > > > > > 2. The meaning of the partitioner.sticky.batch.size is
> > explained
> > > > in
> > > > > > the
> > > > > > > > > Uniform Sticky Batch Size section.  Basically, we track the
> > > > amount
> > > > > of
> > > > > > > > bytes
> > > > > > > > > produced to the partition and if it exceeds
> > > > > > > partitioner.sticky.batch.size
> > > > > > > > > then we switch to the next partition.  As far as the reason
> > to
> > > > make
> > > > > > it
> > > > > > > > > different from batch.size, I think Luke answered this with
> > the
> > > > > > question
> > > > > > > > #3
> > > > > > > > > -- what if the load pattern is such that 15.5KB would be
> more
> > > > > > efficient
> > > > > > > > > than 16KB?
> > > > > > > > >
> > > > > > > > > 3. I think it's hard to have one size that would fit all
> > > > patterns.
> > > > > > > E.g.
> > > > > > > > if
> > > > > > > > > the load pattern is such that there is linger and the app
> > fills
> > > > the
> > > > > > > batch
> > > > > > > > > before linger expires, then having 16KB would most likely
> > > > > synchronize
> > > > > > > > > batching and partition switching, so each partition would
> > get a
> > > > > full
> > > > > > > > > batch.  If load pattern is such that there are a few
> > non-complete
> > > > > > > batches
> > > > > > > > > go out before a larger batch starts to fill, then it may
> > actually
> > > > > be
> > > > > > > > > beneficial to make slightly larger (e.g. linger=0, first
> few
> > > > > records
> > > > > > go
> > > > > > > > in
> > > > > > > > > the first batch, then next few records go to second batch,
> > and so
> > > > > on,
> > > > > > > > until
> > > > > > > > > 5 in-flight, then larger batch would form while waiting for
> > > > broker
> > > > > to
> > > > > > > > > respond, but the partition switch would happen before the
> > larger
> > > > > > batch
> > > > > > > is
> > > > > > > > > full).
> > > > > > > > >
> > > > > > > > > 4. There are a couple of reasons for introducing
> > > > > > > > > partition.availability.timeout.ms.  Luke's an Jun's
> > questions
> > > > are
> > > > > > > > slightly
> > > > > > > > > different, so I'm going to separate replies.
> > > > > > > > > (Luke) Is the queue size a good enough signal?  I think
> it's
> > a
> > > > good
> > > > > > > > default
> > > > > > > > > signal as it tries to preserve general fairness and not
> > overreact
> > > > > on
> > > > > > > the
> > > > > > > > > broker's state at each moment in time.  But because it's
> > smooth,
> > > > it
> > > > > > may
> > > > > > > > not
> > > > > > > > > be reactive enough to instantaneous latency jumps.  For
> > > > > > > latency-sensitive
> > > > > > > > > workloads, it may be desirable to react faster when a
> broker
> > > > > becomes
> > > > > > > > > unresponsive (but that may make the distribution really
> > choppy),
> > > > so
> > > > > > > > > partition.availability.timeout.ms provides an opportunity
> to
> > > > tune
> > > > > > > > > adaptiveness.
> > > > > > > > >
> > > > > > > > > (Jun) Can we just not assign partitions to brokers that are
> > not
> > > > > > ready?
> > > > > > > > > Switching partitions purely based on current broker
> readiness
> > > > > > > information
> > > > > > > > > can really skew workload I think (or at least I couldn't
> > build a
> > > > > > model
> > > > > > > > that
> > > > > > > > > proves that over time it's going to be generally fair), I
> > feel
> > > > that
> > > > > > the
> > > > > > > > > algorithm should try to be fair in general and use smoother
> > > > signals
> > > > > > by
> > > > > > > > > default (e.g. a broker with choppier latency may get much
> > less
> > > > load
> > > > > > > even
> > > > > > > > > though it can handle throughput, it then may potentially
> skew
> > > > > > > > consumption),
> > > > > > > > > note that the queue-size-based logic uses probabilities (so
> > we
> > > > > don't
> > > > > > > > fully
> > > > > > > > > remove brokers, just make it less likely) and relative info
> > > > rather
> > > > > > > than a
> > > > > > > > > threshold (so if all brokers are heavily, but equally
> loaded,
> > > > they
> > > > > > will
> > > > > > > > get
> > > > > > > > > equal distribution, rather than get removed because they
> > exceed
> > > > > some
> > > > > > > > > threshold).  So at the very least, I would like this logic
> > to be
> > > > > > turned
> > > > > > > > off
> > > > > > > > > by default as it's hard to predict what it could do with
> > > > different
> > > > > > > > patterns
> > > > > > > > > (which means that there would need to be some
> > configuration).  We
> > > > > > could
> > > > > > > > > just not use brokers that are not ready, but again, I think
> > that
> > > > > it's
> > > > > > > > good
> > > > > > > > > to try to be fair under normal circumstances, so if
> normally
> > > > > brokers
> > > > > > > can
> > > > > > > > > respond under some partition.availability.timeout.ms
> > threshold
> > > > and
> > > > > > the
> > > > > > > > > application works well with those latencies, then we could
> > > > > distribute
> > > > > > > > data
> > > > > > > > > equally between brokers that don't exceed the latencies.
> The
> > > > > value,
> > > > > > of
> > > > > > > > > course, would depend on the environment and app
> requirements,
> > > > hence
> > > > > > > it's
> > > > > > > > > configurable.
> > > > > > > > >
> > > > > > > > > 10. Added a statement at the beginning of the proposed
> > changes.
> > > > > > > > >
> > > > > > > > > -Artem
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > On Thu, Feb 17, 2022 at 3:46 PM Jun Rao
> > <jun@confluent.io.invalid
> > > > >
> > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > Hi, Artem,
> > > > > > > > > >
> > > > > > > > > > Thanks for the KIP. A few comments below.
> > > > > > > > > >
> > > > > > > > > > 1. I agree with Luke that having the partitioner
> returning
> > -1
> > > > is
> > > > > > kind
> > > > > > > > of
> > > > > > > > > > weird. Could we just change the implementation of
> > > > > > DefaultPartitioner
> > > > > > > to
> > > > > > > > > the
> > > > > > > > > > new behavior?
> > > > > > > > > >
> > > > > > > > > > 2. partitioner.sticky.batch.size: Similar question to
> > Luke. I
> > > > am
> > > > > > not
> > > > > > > > sure
> > > > > > > > > > why we want to introduce this new configuration. Could we
> > just
> > > > > use
> > > > > > > the
> > > > > > > > > > existing batch.size?
> > > > > > > > > >
> > > > > > > > > > 4. I also agree with Luke that it's not clear why we need
> > > > > > > > > > partition.availability.timeout.ms. The KIP says the
> broker
> > > > > "would
> > > > > > > not
> > > > > > > > be
> > > > > > > > > > chosen until the broker is able to accept the next ready
> > batch
> > > > > from
> > > > > > > the
> > > > > > > > > > partition". If we are keeping track of this, could we
> just
> > > > avoid
> > > > > > > > > assigning
> > > > > > > > > > records to partitions whose leader is not able to accept
> > the
> > > > next
> > > > > > > > batch?
> > > > > > > > > If
> > > > > > > > > > we do that, perhaps we don't need
> > > > > > partition.availability.timeout.ms.
> > > > > > > > > >
> > > > > > > > > > 10. Currently, partitioner.class defaults to
> > > > DefaultPartitioner,
> > > > > > > which
> > > > > > > > > uses
> > > > > > > > > > StickyPartitioner when the key is specified. Since this
> KIP
> > > > > > improves
> > > > > > > > upon
> > > > > > > > > > StickyPartitioner, it would be useful to make the new
> > behavior
> > > > > the
> > > > > > > > > default
> > > > > > > > > > and document that in the KIP.
> > > > > > > > > >
> > > > > > > > > > Thanks,
> > > > > > > > > >
> > > > > > > > > > Jun
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > On Wed, Feb 16, 2022 at 7:30 PM Luke Chen <
> > showuon@gmail.com>
> > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > > Hi Artem,
> > > > > > > > > > >
> > > > > > > > > > > Also, one more thing I think you need to know.
> > > > > > > > > > > As this bug KAFKA-7572 <
> > > > > > > > > https://issues.apache.org/jira/browse/KAFKA-7572
> > > > > > > > > > >
> > > > > > > > > > > mentioned, sometimes the custom partitioner would
> return
> > > > > negative
> > > > > > > > > > partition
> > > > > > > > > > > id accidentally.
> > > > > > > > > > > If it returned -1, how could you know if it is expected
> > or
> > > > not
> > > > > > > > > expected?
> > > > > > > > > > >
> > > > > > > > > > > Thanks.
> > > > > > > > > > > Luke
> > > > > > > > > > >
> > > > > > > > > > > On Wed, Feb 16, 2022 at 3:28 PM Luke Chen <
> > showuon@gmail.com
> > > > >
> > > > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > > > Hi Artem,
> > > > > > > > > > > >
> > > > > > > > > > > > Thanks for the update. I have some questions about
> it:
> > > > > > > > > > > >
> > > > > > > > > > > > 1. Could you explain why you need the `partitioner`
> > return
> > > > > -1?
> > > > > > In
> > > > > > > > > which
> > > > > > > > > > > > case we need it? And how it is used in your KIP?
> > > > > > > > > > > > 2. What does the "partitioner.sticky.batch.size"
> mean?
> > In
> > > > the
> > > > > > > > > > > > "Configuration" part, you didn't explain it. And
> > default to
> > > > > 0,
> > > > > > I
> > > > > > > > > guess
> > > > > > > > > > > it's
> > > > > > > > > > > > the same as current behavior for backward
> > compatibility,
> > > > > right?
> > > > > > > You
> > > > > > > > > > > should
> > > > > > > > > > > > mention it.
> > > > > > > > > > > > 3. I'm thinking we can have a threshold to the
> > > > > > > > > > > > "partitioner.sticky.batch.size". Let's say, we
> already
> > > > > > accumulate
> > > > > > > > > > 15.5KB
> > > > > > > > > > > in
> > > > > > > > > > > > partition1, and sent. So when next batch created, in
> > your
> > > > > > current
> > > > > > > > > > design,
> > > > > > > > > > > > we still stick to partition1, until 16KB reached, and
> > then
> > > > we
> > > > > > > > create
> > > > > > > > > a
> > > > > > > > > > > new
> > > > > > > > > > > > batch to change to next partition, ex: partition2.
> But
> > I
> > > > > think
> > > > > > if
> > > > > > > > we
> > > > > > > > > > set
> > > > > > > > > > > a
> > > > > > > > > > > > threshold to 95% (for example), we can know previous
> > 15.5KB
> > > > > > > already
> > > > > > > > > > > exceeds
> > > > > > > > > > > > the threshold so that we can directly create new
> batch
> > for
> > > > > next
> > > > > > > > > > records.
> > > > > > > > > > > > This way should be able to make it more efficient.
> > WDYT?
> > > > > > > > > > > > 4. I think the improved queuing logic should be good
> > > > enough.
> > > > > I
> > > > > > > > can't
> > > > > > > > > > get
> > > > > > > > > > > > the benefit of having `
> > partition.availability.timeout.ms`
> > > > > > > config.
> > > > > > > > In
> > > > > > > > > > > > short, you want to make the partitioner take the
> broker
> > > > load
> > > > > > into
> > > > > > > > > > > > consideration. We can just improve that in the
> queuing
> > > > logic
> > > > > > (and
> > > > > > > > you
> > > > > > > > > > > > already did it). Why should we add the config? Could
> > you
> > > > use
> > > > > > some
> > > > > > > > > > > examples
> > > > > > > > > > > > to explain why we need it.
> > > > > > > > > > > >
> > > > > > > > > > > > Thank you.
> > > > > > > > > > > > Luke
> > > > > > > > > > > >
> > > > > > > > > > > > On Wed, Feb 16, 2022 at 8:57 AM Artem Livshits
> > > > > > > > > > > > <al...@confluent.io.invalid> wrote:
> > > > > > > > > > > >
> > > > > > > > > > > >> Hello,
> > > > > > > > > > > >>
> > > > > > > > > > > >> Please add your comments about the KIP.  If there
> are
> > no
> > > > > > > > > > considerations,
> > > > > > > > > > > >> I'll put it up for vote in the next few days.
> > > > > > > > > > > >>
> > > > > > > > > > > >> -Artem
> > > > > > > > > > > >>
> > > > > > > > > > > >> On Mon, Feb 7, 2022 at 6:01 PM Artem Livshits <
> > > > > > > > > alivshits@confluent.io
> > > > > > > > > > >
> > > > > > > > > > > >> wrote:
> > > > > > > > > > > >>
> > > > > > > > > > > >> > Hello,
> > > > > > > > > > > >> >
> > > > > > > > > > > >> > After trying a few prototypes, I've made some
> > changes to
> > > > > the
> > > > > > > > > public
> > > > > > > > > > > >> > interface.  Please see the updated document
> > > > > > > > > > > >> >
> > > > > > > > > > > >>
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-794%3A+Strictly+Uniform+Sticky+Partitioner
> > > > > > > > > > > >> > .
> > > > > > > > > > > >> >
> > > > > > > > > > > >> > -Artem
> > > > > > > > > > > >> >
> > > > > > > > > > > >> > On Thu, Nov 4, 2021 at 10:37 AM Artem Livshits <
> > > > > > > > > > > alivshits@confluent.io>
> > > > > > > > > > > >> > wrote:
> > > > > > > > > > > >> >
> > > > > > > > > > > >> >> Hello,
> > > > > > > > > > > >> >>
> > > > > > > > > > > >> >> This is the discussion thread for
> > > > > > > > > > > >> >>
> > > > > > > > > > > >>
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-794%3A+Strictly+Uniform+Sticky+Partitioner
> > > > > > > > > > > >> >> .
> > > > > > > > > > > >> >>
> > > > > > > > > > > >> >> The proposal is a bug fix for
> > > > > > > > > > > >> >>
> https://issues.apache.org/jira/browse/KAFKA-10888,
> > but
> > > > > it
> > > > > > > does
> > > > > > > > > > > >> include a
> > > > > > > > > > > >> >> client config change, therefore we have a KIP to
> > > > discuss.
> > > > > > > > > > > >> >>
> > > > > > > > > > > >> >> -Artem
> > > > > > > > > > > >> >>
> > > > > > > > > > > >> >
> > > > > > > > > > > >>
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> >
>

Re: [DISCUSS] KIP-794: Strictly Uniform Sticky Partitioner

Posted by Ismael Juma <is...@juma.me.uk>.
Regarding `3`, we should only deprecate it if we're sure the new approach
handles all cases better. Are we confident about that for both of the
previous partitioners?

Ismael

On Fri, Mar 4, 2022 at 1:37 AM David Jacot <dj...@confluent.io.invalid>
wrote:

> Hi Artem,
>
> Thanks for the KIP! I have a few comments:
>
> 1. In the preamble of the proposed change section, there is still a
> mention of the
> -1 approach. My understanding is that we have moved away from it now.
>
> 2. I am a bit concerned by the trick suggested about the
> DefaultPartitioner and
> the UniformStickyPartitioner. I do agree that implementing the logic in the
> producer itself is a good thing. However, it is weird from a user
> perspective
> that he can set a class as partitioner that is not used in the end. I
> think that
> this will be confusing for our users. Have we considered changing the
> default
> value of partitioner.class to null to indicate that the new built-in
> partitioner
> must be used? By default, the built-in partitioner would be used unless the
> user explicitly specify one. The downside is that the new default behavior
> would not work if the user explicitly specify the partitioner but we could
> mitigate this with my next point.
>
> 3. Related to the previous point, I think that we could deprecate both the
> DefaultPartitioner and the UniformStickyPartitioner. I would also add a
> warning if one of them is explicitly provided by the user to inform them
> that they should switch to the new built-in one. I am pretty sure that most
> of the folks use the default configuration anyway.
>
> 4. It would be great if we could explain why the -1 way was rejected. At
> the moment, the rejected alternative only explain the idea but does not
> say why we rejected it.
>
> Best,
> David
>
> On Fri, Mar 4, 2022 at 6:03 AM Artem Livshits
> <al...@confluent.io.invalid> wrote:
> >
> > Hi Jun,
> >
> > 2. Removed the option from the KIP.  Now the sticky partitioning
> threshold
> > is hardcoded to batch.size.
> >
> > 20. Added the corresponding wording to the KIP.
> >
> > -Artem
> >
> > On Thu, Mar 3, 2022 at 10:52 AM Jun Rao <ju...@confluent.io.invalid>
> wrote:
> >
> > > Hi, Artem,
> > >
> > > Thanks for the reply.
> > >
> > > 1. Sounds good.
> > >
> > > 2. If we don't expect users to change it, we probably could just leave
> out
> > > the new config. In general, it's easy to add a new config, but hard to
> > > remove an existing config.
> > >
> > > 20. The two new configs enable.adaptive.partitioning and
> > > partition.availability.timeout.ms only apply to the two built-in
> > > partitioners DefaultPartitioner and UniformStickyPartitioner, right? It
> > > would be useful to document that in the KIP.
> > >
> > > Thanks,
> > >
> > > Jun
> > >
> > > On Thu, Mar 3, 2022 at 9:47 AM Artem Livshits
> > > <al...@confluent.io.invalid> wrote:
> > >
> > > > Hi Jun,
> > > >
> > > > Thank you for the suggestions.
> > > >
> > > > 1. As we discussed offline, we can hardcode the logic for
> > > > DefaultPartitioner and UniformStickyPartitioner in the KafkaProducer
> > > (i.e.
> > > > the DefaultPartitioner.partition won't get called, instead
> KafkaProducer
> > > > would check if the partitioner is an instance of DefaultPartitioner
> and
> > > > then run the actual partitioning logic itself).  Then the change to
> the
> > > > Partitioner wouldn't be required.  I'll update the KIP to reflect
> that.
> > > >
> > > > 2. I don't expect users to change this too often, as changing it
> would
> > > > require a bit of studying of the production patterns.  As a general
> > > > principle, if I can think of a model that requires a deviation from
> > > > default, I tend to add a configuration option.  It could be that
> it'll
> > > > never get used in practice, but I cannot prove that.  I'm ok with
> > > removing
> > > > the option, let me know what you think.
> > > >
> > > > -Artem
> > > >
> > > > On Mon, Feb 28, 2022 at 2:06 PM Jun Rao <ju...@confluent.io.invalid>
> > > wrote:
> > > >
> > > > > Hi, Artem,
> > > > >
> > > > > Thanks for the reply. A few more comments.
> > > > >
> > > > > 1. Since we control the implementation and the usage of
> > > > DefaultPartitioner,
> > > > > another way is to instantiate the DefaultPartitioner with a special
> > > > > constructor, which allows it to have more access to internal
> > > information.
> > > > > Then we could just change the behavior of  DefaultPartitioner such
> that
> > > > it
> > > > > can use the internal infoamtion when choosing the partition. This
> seems
> > > > > more intuitive than having DefaultPartitioner return -1 partition.
> > > > >
> > > > > 2. I guess partitioner.sticky.batch.size is introduced because the
> > > > > effective batch size could be less than batch.size and we want to
> align
> > > > > partition switching with the effective batch size. How would a user
> > > know
> > > > > the effective batch size to set partitioner.sticky.batch.size
> properly?
> > > > If
> > > > > the user somehow knows the effective batch size, does setting
> > > batch.size
> > > > to
> > > > > the effective batch size achieve the same result?
> > > > >
> > > > > 4. Thanks for the explanation. Makes sense to me.
> > > > >
> > > > > Thanks,
> > > > >
> > > > > Jun
> > > > >
> > > > > Thanks,
> > > > >
> > > > > Jun
> > > > >
> > > > > On Fri, Feb 25, 2022 at 8:26 PM Artem Livshits
> > > > > <al...@confluent.io.invalid> wrote:
> > > > >
> > > > > > Hi Jun,
> > > > > >
> > > > > > 1. Updated the KIP to add a couple paragraphs about
> implementation
> > > > > > necessities in the Proposed Changes section.
> > > > > >
> > > > > > 2. Sorry if my reply was confusing, what I meant to say (and I
> > > > elaborated
> > > > > > on that in point #3) is that there could be patterns for which
> 16KB
> > > > > > wouldn't be the most effective setting, thus it would be good to
> make
> > > > it
> > > > > > configurable.
> > > > > >
> > > > > > 4. We could use broker readiness timeout.  But I'm not sure it
> would
> > > > > > correctly model the broker load.  The problem is that latency is
> not
> > > an
> > > > > > accurate measure of throughput, we could have 2 brokers that have
> > > equal
> > > > > > throughput but one has higher latency (so it takes larger batches
> > > less
> > > > > > frequently, but still takes the same load).  Latency-based logic
> is
> > > > > likely
> > > > > > to send less data to the broker with higher latency.  Using the
> queue
> > > > > size
> > > > > > would adapt to throughput, regardless of latency (which could be
> > > just a
> > > > > > result of deployment topology), so that's the model chosen in the
> > > > > > proposal.  The partition.availability.timeout.ms logic
> approaches
> > > the
> > > > > > model
> > > > > > from a slightly different angle, say we have a requirement to
> deliver
> > > > > > messages via brokers that have a certain latency, then
> > > > > > partition.availability.timeout.ms could be used to tune that.
> > > Latency
> > > > > is
> > > > > > a
> > > > > > much more volatile metric than throughput (latency depends on
> > > external
> > > > > > load, on capacity, on deployment topology, on jitter in network,
> on
> > > > > jitter
> > > > > > in disk, etc.) and I think it would be best to leave
> latency-based
> > > > > > thresholds configurable to tune for the environment.
> > > > > >
> > > > > > -Artem
> > > > > >
> > > > > > On Wed, Feb 23, 2022 at 11:14 AM Jun Rao
> <ju...@confluent.io.invalid>
> > > > > wrote:
> > > > > >
> > > > > > > Hi, Artem,
> > > > > > >
> > > > > > > Thanks for the reply. A few more comments.
> > > > > > >
> > > > > > > 1. Perhaps you could elaborate a bit more on how the producer
> > > > > determines
> > > > > > > the partition if the partitioner returns -1. This will help
> > > > understand
> > > > > > why
> > > > > > > encapsulating that logic as a partitioner is not clean.
> > > > > > >
> > > > > > > 2. I am not sure that I understand this part. If 15.5KB is more
> > > > > > efficient,
> > > > > > > could we just set batch.size to 15.5KB?
> > > > > > >
> > > > > > > 4. Yes, we could add a switch (or a variant of the
> partitioner) for
> > > > > > > enabling this behavior. Also, choosing partitions based on
> broker
> > > > > > readiness
> > > > > > > can be made in a smoother way. For example, we could track the
> last
> > > > > time
> > > > > > a
> > > > > > > broker has drained any batches from the accumulator. We can
> then
> > > > select
> > > > > > > partitions from brokers proportionally to the inverse of that
> time.
> > > > > This
> > > > > > > seems smoother than a cutoff based on a
> > > > > > partition.availability.timeout.ms
> > > > > > >  threshold.
> > > > > > >
> > > > > > > Thanks,
> > > > > > >
> > > > > > > Jun
> > > > > > >
> > > > > > > On Fri, Feb 18, 2022 at 5:14 PM Artem Livshits
> > > > > > > <al...@confluent.io.invalid> wrote:
> > > > > > >
> > > > > > > > Hello Luke, Jun,
> > > > > > > >
> > > > > > > > Thank you for your feedback.  I've added the Rejected
> Alternative
> > > > > > section
> > > > > > > > that may clarify some of the questions w.r.t. returning -1.
> > > > > > > >
> > > > > > > > 1. I've elaborated on the -1 in the KIP.  The problem is
> that a
> > > > > > > significant
> > > > > > > > part of the logic needs to be in the producer (because it now
> > > uses
> > > > > > > > information about brokers that only the producer knows), so
> > > > > > encapsulation
> > > > > > > > of the logic within the default partitioner isn't as clean.
> > >  I've
> > > > > > added
> > > > > > > > the Rejected Alternative section that documents an attempt to
> > > keep
> > > > > the
> > > > > > > > encapsulation by providing new callbacks to the partitioner.
> > > > > > > >
> > > > > > > > 2. The meaning of the partitioner.sticky.batch.size is
> explained
> > > in
> > > > > the
> > > > > > > > Uniform Sticky Batch Size section.  Basically, we track the
> > > amount
> > > > of
> > > > > > > bytes
> > > > > > > > produced to the partition and if it exceeds
> > > > > > partitioner.sticky.batch.size
> > > > > > > > then we switch to the next partition.  As far as the reason
> to
> > > make
> > > > > it
> > > > > > > > different from batch.size, I think Luke answered this with
> the
> > > > > question
> > > > > > > #3
> > > > > > > > -- what if the load pattern is such that 15.5KB would be more
> > > > > efficient
> > > > > > > > than 16KB?
> > > > > > > >
> > > > > > > > 3. I think it's hard to have one size that would fit all
> > > patterns.
> > > > > > E.g.
> > > > > > > if
> > > > > > > > the load pattern is such that there is linger and the app
> fills
> > > the
> > > > > > batch
> > > > > > > > before linger expires, then having 16KB would most likely
> > > > synchronize
> > > > > > > > batching and partition switching, so each partition would
> get a
> > > > full
> > > > > > > > batch.  If load pattern is such that there are a few
> non-complete
> > > > > > batches
> > > > > > > > go out before a larger batch starts to fill, then it may
> actually
> > > > be
> > > > > > > > beneficial to make slightly larger (e.g. linger=0, first few
> > > > records
> > > > > go
> > > > > > > in
> > > > > > > > the first batch, then next few records go to second batch,
> and so
> > > > on,
> > > > > > > until
> > > > > > > > 5 in-flight, then larger batch would form while waiting for
> > > broker
> > > > to
> > > > > > > > respond, but the partition switch would happen before the
> larger
> > > > > batch
> > > > > > is
> > > > > > > > full).
> > > > > > > >
> > > > > > > > 4. There are a couple of reasons for introducing
> > > > > > > > partition.availability.timeout.ms.  Luke's an Jun's
> questions
> > > are
> > > > > > > slightly
> > > > > > > > different, so I'm going to separate replies.
> > > > > > > > (Luke) Is the queue size a good enough signal?  I think it's
> a
> > > good
> > > > > > > default
> > > > > > > > signal as it tries to preserve general fairness and not
> overreact
> > > > on
> > > > > > the
> > > > > > > > broker's state at each moment in time.  But because it's
> smooth,
> > > it
> > > > > may
> > > > > > > not
> > > > > > > > be reactive enough to instantaneous latency jumps.  For
> > > > > > latency-sensitive
> > > > > > > > workloads, it may be desirable to react faster when a broker
> > > > becomes
> > > > > > > > unresponsive (but that may make the distribution really
> choppy),
> > > so
> > > > > > > > partition.availability.timeout.ms provides an opportunity to
> > > tune
> > > > > > > > adaptiveness.
> > > > > > > >
> > > > > > > > (Jun) Can we just not assign partitions to brokers that are
> not
> > > > > ready?
> > > > > > > > Switching partitions purely based on current broker readiness
> > > > > > information
> > > > > > > > can really skew workload I think (or at least I couldn't
> build a
> > > > > model
> > > > > > > that
> > > > > > > > proves that over time it's going to be generally fair), I
> feel
> > > that
> > > > > the
> > > > > > > > algorithm should try to be fair in general and use smoother
> > > signals
> > > > > by
> > > > > > > > default (e.g. a broker with choppier latency may get much
> less
> > > load
> > > > > > even
> > > > > > > > though it can handle throughput, it then may potentially skew
> > > > > > > consumption),
> > > > > > > > note that the queue-size-based logic uses probabilities (so
> we
> > > > don't
> > > > > > > fully
> > > > > > > > remove brokers, just make it less likely) and relative info
> > > rather
> > > > > > than a
> > > > > > > > threshold (so if all brokers are heavily, but equally loaded,
> > > they
> > > > > will
> > > > > > > get
> > > > > > > > equal distribution, rather than get removed because they
> exceed
> > > > some
> > > > > > > > threshold).  So at the very least, I would like this logic
> to be
> > > > > turned
> > > > > > > off
> > > > > > > > by default as it's hard to predict what it could do with
> > > different
> > > > > > > patterns
> > > > > > > > (which means that there would need to be some
> configuration).  We
> > > > > could
> > > > > > > > just not use brokers that are not ready, but again, I think
> that
> > > > it's
> > > > > > > good
> > > > > > > > to try to be fair under normal circumstances, so if normally
> > > > brokers
> > > > > > can
> > > > > > > > respond under some partition.availability.timeout.ms
> threshold
> > > and
> > > > > the
> > > > > > > > application works well with those latencies, then we could
> > > > distribute
> > > > > > > data
> > > > > > > > equally between brokers that don't exceed the latencies.  The
> > > > value,
> > > > > of
> > > > > > > > course, would depend on the environment and app requirements,
> > > hence
> > > > > > it's
> > > > > > > > configurable.
> > > > > > > >
> > > > > > > > 10. Added a statement at the beginning of the proposed
> changes.
> > > > > > > >
> > > > > > > > -Artem
> > > > > > > >
> > > > > > > >
> > > > > > > > On Thu, Feb 17, 2022 at 3:46 PM Jun Rao
> <jun@confluent.io.invalid
> > > >
> > > > > > > wrote:
> > > > > > > >
> > > > > > > > > Hi, Artem,
> > > > > > > > >
> > > > > > > > > Thanks for the KIP. A few comments below.
> > > > > > > > >
> > > > > > > > > 1. I agree with Luke that having the partitioner returning
> -1
> > > is
> > > > > kind
> > > > > > > of
> > > > > > > > > weird. Could we just change the implementation of
> > > > > DefaultPartitioner
> > > > > > to
> > > > > > > > the
> > > > > > > > > new behavior?
> > > > > > > > >
> > > > > > > > > 2. partitioner.sticky.batch.size: Similar question to
> Luke. I
> > > am
> > > > > not
> > > > > > > sure
> > > > > > > > > why we want to introduce this new configuration. Could we
> just
> > > > use
> > > > > > the
> > > > > > > > > existing batch.size?
> > > > > > > > >
> > > > > > > > > 4. I also agree with Luke that it's not clear why we need
> > > > > > > > > partition.availability.timeout.ms. The KIP says the broker
> > > > "would
> > > > > > not
> > > > > > > be
> > > > > > > > > chosen until the broker is able to accept the next ready
> batch
> > > > from
> > > > > > the
> > > > > > > > > partition". If we are keeping track of this, could we just
> > > avoid
> > > > > > > > assigning
> > > > > > > > > records to partitions whose leader is not able to accept
> the
> > > next
> > > > > > > batch?
> > > > > > > > If
> > > > > > > > > we do that, perhaps we don't need
> > > > > partition.availability.timeout.ms.
> > > > > > > > >
> > > > > > > > > 10. Currently, partitioner.class defaults to
> > > DefaultPartitioner,
> > > > > > which
> > > > > > > > uses
> > > > > > > > > StickyPartitioner when the key is specified. Since this KIP
> > > > > improves
> > > > > > > upon
> > > > > > > > > StickyPartitioner, it would be useful to make the new
> behavior
> > > > the
> > > > > > > > default
> > > > > > > > > and document that in the KIP.
> > > > > > > > >
> > > > > > > > > Thanks,
> > > > > > > > >
> > > > > > > > > Jun
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > On Wed, Feb 16, 2022 at 7:30 PM Luke Chen <
> showuon@gmail.com>
> > > > > wrote:
> > > > > > > > >
> > > > > > > > > > Hi Artem,
> > > > > > > > > >
> > > > > > > > > > Also, one more thing I think you need to know.
> > > > > > > > > > As this bug KAFKA-7572 <
> > > > > > > > https://issues.apache.org/jira/browse/KAFKA-7572
> > > > > > > > > >
> > > > > > > > > > mentioned, sometimes the custom partitioner would return
> > > > negative
> > > > > > > > > partition
> > > > > > > > > > id accidentally.
> > > > > > > > > > If it returned -1, how could you know if it is expected
> or
> > > not
> > > > > > > > expected?
> > > > > > > > > >
> > > > > > > > > > Thanks.
> > > > > > > > > > Luke
> > > > > > > > > >
> > > > > > > > > > On Wed, Feb 16, 2022 at 3:28 PM Luke Chen <
> showuon@gmail.com
> > > >
> > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > > Hi Artem,
> > > > > > > > > > >
> > > > > > > > > > > Thanks for the update. I have some questions about it:
> > > > > > > > > > >
> > > > > > > > > > > 1. Could you explain why you need the `partitioner`
> return
> > > > -1?
> > > > > In
> > > > > > > > which
> > > > > > > > > > > case we need it? And how it is used in your KIP?
> > > > > > > > > > > 2. What does the "partitioner.sticky.batch.size" mean?
> In
> > > the
> > > > > > > > > > > "Configuration" part, you didn't explain it. And
> default to
> > > > 0,
> > > > > I
> > > > > > > > guess
> > > > > > > > > > it's
> > > > > > > > > > > the same as current behavior for backward
> compatibility,
> > > > right?
> > > > > > You
> > > > > > > > > > should
> > > > > > > > > > > mention it.
> > > > > > > > > > > 3. I'm thinking we can have a threshold to the
> > > > > > > > > > > "partitioner.sticky.batch.size". Let's say, we already
> > > > > accumulate
> > > > > > > > > 15.5KB
> > > > > > > > > > in
> > > > > > > > > > > partition1, and sent. So when next batch created, in
> your
> > > > > current
> > > > > > > > > design,
> > > > > > > > > > > we still stick to partition1, until 16KB reached, and
> then
> > > we
> > > > > > > create
> > > > > > > > a
> > > > > > > > > > new
> > > > > > > > > > > batch to change to next partition, ex: partition2. But
> I
> > > > think
> > > > > if
> > > > > > > we
> > > > > > > > > set
> > > > > > > > > > a
> > > > > > > > > > > threshold to 95% (for example), we can know previous
> 15.5KB
> > > > > > already
> > > > > > > > > > exceeds
> > > > > > > > > > > the threshold so that we can directly create new batch
> for
> > > > next
> > > > > > > > > records.
> > > > > > > > > > > This way should be able to make it more efficient.
> WDYT?
> > > > > > > > > > > 4. I think the improved queuing logic should be good
> > > enough.
> > > > I
> > > > > > > can't
> > > > > > > > > get
> > > > > > > > > > > the benefit of having `
> partition.availability.timeout.ms`
> > > > > > config.
> > > > > > > In
> > > > > > > > > > > short, you want to make the partitioner take the broker
> > > load
> > > > > into
> > > > > > > > > > > consideration. We can just improve that in the queuing
> > > logic
> > > > > (and
> > > > > > > you
> > > > > > > > > > > already did it). Why should we add the config? Could
> you
> > > use
> > > > > some
> > > > > > > > > > examples
> > > > > > > > > > > to explain why we need it.
> > > > > > > > > > >
> > > > > > > > > > > Thank you.
> > > > > > > > > > > Luke
> > > > > > > > > > >
> > > > > > > > > > > On Wed, Feb 16, 2022 at 8:57 AM Artem Livshits
> > > > > > > > > > > <al...@confluent.io.invalid> wrote:
> > > > > > > > > > >
> > > > > > > > > > >> Hello,
> > > > > > > > > > >>
> > > > > > > > > > >> Please add your comments about the KIP.  If there are
> no
> > > > > > > > > considerations,
> > > > > > > > > > >> I'll put it up for vote in the next few days.
> > > > > > > > > > >>
> > > > > > > > > > >> -Artem
> > > > > > > > > > >>
> > > > > > > > > > >> On Mon, Feb 7, 2022 at 6:01 PM Artem Livshits <
> > > > > > > > alivshits@confluent.io
> > > > > > > > > >
> > > > > > > > > > >> wrote:
> > > > > > > > > > >>
> > > > > > > > > > >> > Hello,
> > > > > > > > > > >> >
> > > > > > > > > > >> > After trying a few prototypes, I've made some
> changes to
> > > > the
> > > > > > > > public
> > > > > > > > > > >> > interface.  Please see the updated document
> > > > > > > > > > >> >
> > > > > > > > > > >>
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-794%3A+Strictly+Uniform+Sticky+Partitioner
> > > > > > > > > > >> > .
> > > > > > > > > > >> >
> > > > > > > > > > >> > -Artem
> > > > > > > > > > >> >
> > > > > > > > > > >> > On Thu, Nov 4, 2021 at 10:37 AM Artem Livshits <
> > > > > > > > > > alivshits@confluent.io>
> > > > > > > > > > >> > wrote:
> > > > > > > > > > >> >
> > > > > > > > > > >> >> Hello,
> > > > > > > > > > >> >>
> > > > > > > > > > >> >> This is the discussion thread for
> > > > > > > > > > >> >>
> > > > > > > > > > >>
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-794%3A+Strictly+Uniform+Sticky+Partitioner
> > > > > > > > > > >> >> .
> > > > > > > > > > >> >>
> > > > > > > > > > >> >> The proposal is a bug fix for
> > > > > > > > > > >> >> https://issues.apache.org/jira/browse/KAFKA-10888,
> but
> > > > it
> > > > > > does
> > > > > > > > > > >> include a
> > > > > > > > > > >> >> client config change, therefore we have a KIP to
> > > discuss.
> > > > > > > > > > >> >>
> > > > > > > > > > >> >> -Artem
> > > > > > > > > > >> >>
> > > > > > > > > > >> >
> > > > > > > > > > >>
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
>

Re: [DISCUSS] KIP-794: Strictly Uniform Sticky Partitioner

Posted by David Jacot <dj...@confluent.io.INVALID>.
Hi Artem,

Thanks for the KIP! I have a few comments:

1. In the preamble of the proposed change section, there is still a
mention of the
-1 approach. My understanding is that we have moved away from it now.

2. I am a bit concerned by the trick suggested about the DefaultPartitioner and
the UniformStickyPartitioner. I do agree that implementing the logic in the
producer itself is a good thing. However, it is weird from a user perspective
that he can set a class as partitioner that is not used in the end. I think that
this will be confusing for our users. Have we considered changing the default
value of partitioner.class to null to indicate that the new built-in partitioner
must be used? By default, the built-in partitioner would be used unless the
user explicitly specify one. The downside is that the new default behavior
would not work if the user explicitly specify the partitioner but we could
mitigate this with my next point.

3. Related to the previous point, I think that we could deprecate both the
DefaultPartitioner and the UniformStickyPartitioner. I would also add a
warning if one of them is explicitly provided by the user to inform them
that they should switch to the new built-in one. I am pretty sure that most
of the folks use the default configuration anyway.

4. It would be great if we could explain why the -1 way was rejected. At
the moment, the rejected alternative only explain the idea but does not
say why we rejected it.

Best,
David

On Fri, Mar 4, 2022 at 6:03 AM Artem Livshits
<al...@confluent.io.invalid> wrote:
>
> Hi Jun,
>
> 2. Removed the option from the KIP.  Now the sticky partitioning threshold
> is hardcoded to batch.size.
>
> 20. Added the corresponding wording to the KIP.
>
> -Artem
>
> On Thu, Mar 3, 2022 at 10:52 AM Jun Rao <ju...@confluent.io.invalid> wrote:
>
> > Hi, Artem,
> >
> > Thanks for the reply.
> >
> > 1. Sounds good.
> >
> > 2. If we don't expect users to change it, we probably could just leave out
> > the new config. In general, it's easy to add a new config, but hard to
> > remove an existing config.
> >
> > 20. The two new configs enable.adaptive.partitioning and
> > partition.availability.timeout.ms only apply to the two built-in
> > partitioners DefaultPartitioner and UniformStickyPartitioner, right? It
> > would be useful to document that in the KIP.
> >
> > Thanks,
> >
> > Jun
> >
> > On Thu, Mar 3, 2022 at 9:47 AM Artem Livshits
> > <al...@confluent.io.invalid> wrote:
> >
> > > Hi Jun,
> > >
> > > Thank you for the suggestions.
> > >
> > > 1. As we discussed offline, we can hardcode the logic for
> > > DefaultPartitioner and UniformStickyPartitioner in the KafkaProducer
> > (i.e.
> > > the DefaultPartitioner.partition won't get called, instead KafkaProducer
> > > would check if the partitioner is an instance of DefaultPartitioner and
> > > then run the actual partitioning logic itself).  Then the change to the
> > > Partitioner wouldn't be required.  I'll update the KIP to reflect that.
> > >
> > > 2. I don't expect users to change this too often, as changing it would
> > > require a bit of studying of the production patterns.  As a general
> > > principle, if I can think of a model that requires a deviation from
> > > default, I tend to add a configuration option.  It could be that it'll
> > > never get used in practice, but I cannot prove that.  I'm ok with
> > removing
> > > the option, let me know what you think.
> > >
> > > -Artem
> > >
> > > On Mon, Feb 28, 2022 at 2:06 PM Jun Rao <ju...@confluent.io.invalid>
> > wrote:
> > >
> > > > Hi, Artem,
> > > >
> > > > Thanks for the reply. A few more comments.
> > > >
> > > > 1. Since we control the implementation and the usage of
> > > DefaultPartitioner,
> > > > another way is to instantiate the DefaultPartitioner with a special
> > > > constructor, which allows it to have more access to internal
> > information.
> > > > Then we could just change the behavior of  DefaultPartitioner such that
> > > it
> > > > can use the internal infoamtion when choosing the partition. This seems
> > > > more intuitive than having DefaultPartitioner return -1 partition.
> > > >
> > > > 2. I guess partitioner.sticky.batch.size is introduced because the
> > > > effective batch size could be less than batch.size and we want to align
> > > > partition switching with the effective batch size. How would a user
> > know
> > > > the effective batch size to set partitioner.sticky.batch.size properly?
> > > If
> > > > the user somehow knows the effective batch size, does setting
> > batch.size
> > > to
> > > > the effective batch size achieve the same result?
> > > >
> > > > 4. Thanks for the explanation. Makes sense to me.
> > > >
> > > > Thanks,
> > > >
> > > > Jun
> > > >
> > > > Thanks,
> > > >
> > > > Jun
> > > >
> > > > On Fri, Feb 25, 2022 at 8:26 PM Artem Livshits
> > > > <al...@confluent.io.invalid> wrote:
> > > >
> > > > > Hi Jun,
> > > > >
> > > > > 1. Updated the KIP to add a couple paragraphs about implementation
> > > > > necessities in the Proposed Changes section.
> > > > >
> > > > > 2. Sorry if my reply was confusing, what I meant to say (and I
> > > elaborated
> > > > > on that in point #3) is that there could be patterns for which 16KB
> > > > > wouldn't be the most effective setting, thus it would be good to make
> > > it
> > > > > configurable.
> > > > >
> > > > > 4. We could use broker readiness timeout.  But I'm not sure it would
> > > > > correctly model the broker load.  The problem is that latency is not
> > an
> > > > > accurate measure of throughput, we could have 2 brokers that have
> > equal
> > > > > throughput but one has higher latency (so it takes larger batches
> > less
> > > > > frequently, but still takes the same load).  Latency-based logic is
> > > > likely
> > > > > to send less data to the broker with higher latency.  Using the queue
> > > > size
> > > > > would adapt to throughput, regardless of latency (which could be
> > just a
> > > > > result of deployment topology), so that's the model chosen in the
> > > > > proposal.  The partition.availability.timeout.ms logic approaches
> > the
> > > > > model
> > > > > from a slightly different angle, say we have a requirement to deliver
> > > > > messages via brokers that have a certain latency, then
> > > > > partition.availability.timeout.ms could be used to tune that.
> > Latency
> > > > is
> > > > > a
> > > > > much more volatile metric than throughput (latency depends on
> > external
> > > > > load, on capacity, on deployment topology, on jitter in network, on
> > > > jitter
> > > > > in disk, etc.) and I think it would be best to leave latency-based
> > > > > thresholds configurable to tune for the environment.
> > > > >
> > > > > -Artem
> > > > >
> > > > > On Wed, Feb 23, 2022 at 11:14 AM Jun Rao <ju...@confluent.io.invalid>
> > > > wrote:
> > > > >
> > > > > > Hi, Artem,
> > > > > >
> > > > > > Thanks for the reply. A few more comments.
> > > > > >
> > > > > > 1. Perhaps you could elaborate a bit more on how the producer
> > > > determines
> > > > > > the partition if the partitioner returns -1. This will help
> > > understand
> > > > > why
> > > > > > encapsulating that logic as a partitioner is not clean.
> > > > > >
> > > > > > 2. I am not sure that I understand this part. If 15.5KB is more
> > > > > efficient,
> > > > > > could we just set batch.size to 15.5KB?
> > > > > >
> > > > > > 4. Yes, we could add a switch (or a variant of the partitioner) for
> > > > > > enabling this behavior. Also, choosing partitions based on broker
> > > > > readiness
> > > > > > can be made in a smoother way. For example, we could track the last
> > > > time
> > > > > a
> > > > > > broker has drained any batches from the accumulator. We can then
> > > select
> > > > > > partitions from brokers proportionally to the inverse of that time.
> > > > This
> > > > > > seems smoother than a cutoff based on a
> > > > > partition.availability.timeout.ms
> > > > > >  threshold.
> > > > > >
> > > > > > Thanks,
> > > > > >
> > > > > > Jun
> > > > > >
> > > > > > On Fri, Feb 18, 2022 at 5:14 PM Artem Livshits
> > > > > > <al...@confluent.io.invalid> wrote:
> > > > > >
> > > > > > > Hello Luke, Jun,
> > > > > > >
> > > > > > > Thank you for your feedback.  I've added the Rejected Alternative
> > > > > section
> > > > > > > that may clarify some of the questions w.r.t. returning -1.
> > > > > > >
> > > > > > > 1. I've elaborated on the -1 in the KIP.  The problem is that a
> > > > > > significant
> > > > > > > part of the logic needs to be in the producer (because it now
> > uses
> > > > > > > information about brokers that only the producer knows), so
> > > > > encapsulation
> > > > > > > of the logic within the default partitioner isn't as clean.
> >  I've
> > > > > added
> > > > > > > the Rejected Alternative section that documents an attempt to
> > keep
> > > > the
> > > > > > > encapsulation by providing new callbacks to the partitioner.
> > > > > > >
> > > > > > > 2. The meaning of the partitioner.sticky.batch.size is explained
> > in
> > > > the
> > > > > > > Uniform Sticky Batch Size section.  Basically, we track the
> > amount
> > > of
> > > > > > bytes
> > > > > > > produced to the partition and if it exceeds
> > > > > partitioner.sticky.batch.size
> > > > > > > then we switch to the next partition.  As far as the reason to
> > make
> > > > it
> > > > > > > different from batch.size, I think Luke answered this with the
> > > > question
> > > > > > #3
> > > > > > > -- what if the load pattern is such that 15.5KB would be more
> > > > efficient
> > > > > > > than 16KB?
> > > > > > >
> > > > > > > 3. I think it's hard to have one size that would fit all
> > patterns.
> > > > > E.g.
> > > > > > if
> > > > > > > the load pattern is such that there is linger and the app fills
> > the
> > > > > batch
> > > > > > > before linger expires, then having 16KB would most likely
> > > synchronize
> > > > > > > batching and partition switching, so each partition would get a
> > > full
> > > > > > > batch.  If load pattern is such that there are a few non-complete
> > > > > batches
> > > > > > > go out before a larger batch starts to fill, then it may actually
> > > be
> > > > > > > beneficial to make slightly larger (e.g. linger=0, first few
> > > records
> > > > go
> > > > > > in
> > > > > > > the first batch, then next few records go to second batch, and so
> > > on,
> > > > > > until
> > > > > > > 5 in-flight, then larger batch would form while waiting for
> > broker
> > > to
> > > > > > > respond, but the partition switch would happen before the larger
> > > > batch
> > > > > is
> > > > > > > full).
> > > > > > >
> > > > > > > 4. There are a couple of reasons for introducing
> > > > > > > partition.availability.timeout.ms.  Luke's an Jun's questions
> > are
> > > > > > slightly
> > > > > > > different, so I'm going to separate replies.
> > > > > > > (Luke) Is the queue size a good enough signal?  I think it's a
> > good
> > > > > > default
> > > > > > > signal as it tries to preserve general fairness and not overreact
> > > on
> > > > > the
> > > > > > > broker's state at each moment in time.  But because it's smooth,
> > it
> > > > may
> > > > > > not
> > > > > > > be reactive enough to instantaneous latency jumps.  For
> > > > > latency-sensitive
> > > > > > > workloads, it may be desirable to react faster when a broker
> > > becomes
> > > > > > > unresponsive (but that may make the distribution really choppy),
> > so
> > > > > > > partition.availability.timeout.ms provides an opportunity to
> > tune
> > > > > > > adaptiveness.
> > > > > > >
> > > > > > > (Jun) Can we just not assign partitions to brokers that are not
> > > > ready?
> > > > > > > Switching partitions purely based on current broker readiness
> > > > > information
> > > > > > > can really skew workload I think (or at least I couldn't build a
> > > > model
> > > > > > that
> > > > > > > proves that over time it's going to be generally fair), I feel
> > that
> > > > the
> > > > > > > algorithm should try to be fair in general and use smoother
> > signals
> > > > by
> > > > > > > default (e.g. a broker with choppier latency may get much less
> > load
> > > > > even
> > > > > > > though it can handle throughput, it then may potentially skew
> > > > > > consumption),
> > > > > > > note that the queue-size-based logic uses probabilities (so we
> > > don't
> > > > > > fully
> > > > > > > remove brokers, just make it less likely) and relative info
> > rather
> > > > > than a
> > > > > > > threshold (so if all brokers are heavily, but equally loaded,
> > they
> > > > will
> > > > > > get
> > > > > > > equal distribution, rather than get removed because they exceed
> > > some
> > > > > > > threshold).  So at the very least, I would like this logic to be
> > > > turned
> > > > > > off
> > > > > > > by default as it's hard to predict what it could do with
> > different
> > > > > > patterns
> > > > > > > (which means that there would need to be some configuration).  We
> > > > could
> > > > > > > just not use brokers that are not ready, but again, I think that
> > > it's
> > > > > > good
> > > > > > > to try to be fair under normal circumstances, so if normally
> > > brokers
> > > > > can
> > > > > > > respond under some partition.availability.timeout.ms threshold
> > and
> > > > the
> > > > > > > application works well with those latencies, then we could
> > > distribute
> > > > > > data
> > > > > > > equally between brokers that don't exceed the latencies.  The
> > > value,
> > > > of
> > > > > > > course, would depend on the environment and app requirements,
> > hence
> > > > > it's
> > > > > > > configurable.
> > > > > > >
> > > > > > > 10. Added a statement at the beginning of the proposed changes.
> > > > > > >
> > > > > > > -Artem
> > > > > > >
> > > > > > >
> > > > > > > On Thu, Feb 17, 2022 at 3:46 PM Jun Rao <jun@confluent.io.invalid
> > >
> > > > > > wrote:
> > > > > > >
> > > > > > > > Hi, Artem,
> > > > > > > >
> > > > > > > > Thanks for the KIP. A few comments below.
> > > > > > > >
> > > > > > > > 1. I agree with Luke that having the partitioner returning -1
> > is
> > > > kind
> > > > > > of
> > > > > > > > weird. Could we just change the implementation of
> > > > DefaultPartitioner
> > > > > to
> > > > > > > the
> > > > > > > > new behavior?
> > > > > > > >
> > > > > > > > 2. partitioner.sticky.batch.size: Similar question to Luke. I
> > am
> > > > not
> > > > > > sure
> > > > > > > > why we want to introduce this new configuration. Could we just
> > > use
> > > > > the
> > > > > > > > existing batch.size?
> > > > > > > >
> > > > > > > > 4. I also agree with Luke that it's not clear why we need
> > > > > > > > partition.availability.timeout.ms. The KIP says the broker
> > > "would
> > > > > not
> > > > > > be
> > > > > > > > chosen until the broker is able to accept the next ready batch
> > > from
> > > > > the
> > > > > > > > partition". If we are keeping track of this, could we just
> > avoid
> > > > > > > assigning
> > > > > > > > records to partitions whose leader is not able to accept the
> > next
> > > > > > batch?
> > > > > > > If
> > > > > > > > we do that, perhaps we don't need
> > > > partition.availability.timeout.ms.
> > > > > > > >
> > > > > > > > 10. Currently, partitioner.class defaults to
> > DefaultPartitioner,
> > > > > which
> > > > > > > uses
> > > > > > > > StickyPartitioner when the key is specified. Since this KIP
> > > > improves
> > > > > > upon
> > > > > > > > StickyPartitioner, it would be useful to make the new behavior
> > > the
> > > > > > > default
> > > > > > > > and document that in the KIP.
> > > > > > > >
> > > > > > > > Thanks,
> > > > > > > >
> > > > > > > > Jun
> > > > > > > >
> > > > > > > >
> > > > > > > > On Wed, Feb 16, 2022 at 7:30 PM Luke Chen <sh...@gmail.com>
> > > > wrote:
> > > > > > > >
> > > > > > > > > Hi Artem,
> > > > > > > > >
> > > > > > > > > Also, one more thing I think you need to know.
> > > > > > > > > As this bug KAFKA-7572 <
> > > > > > > https://issues.apache.org/jira/browse/KAFKA-7572
> > > > > > > > >
> > > > > > > > > mentioned, sometimes the custom partitioner would return
> > > negative
> > > > > > > > partition
> > > > > > > > > id accidentally.
> > > > > > > > > If it returned -1, how could you know if it is expected or
> > not
> > > > > > > expected?
> > > > > > > > >
> > > > > > > > > Thanks.
> > > > > > > > > Luke
> > > > > > > > >
> > > > > > > > > On Wed, Feb 16, 2022 at 3:28 PM Luke Chen <showuon@gmail.com
> > >
> > > > > wrote:
> > > > > > > > >
> > > > > > > > > > Hi Artem,
> > > > > > > > > >
> > > > > > > > > > Thanks for the update. I have some questions about it:
> > > > > > > > > >
> > > > > > > > > > 1. Could you explain why you need the `partitioner` return
> > > -1?
> > > > In
> > > > > > > which
> > > > > > > > > > case we need it? And how it is used in your KIP?
> > > > > > > > > > 2. What does the "partitioner.sticky.batch.size" mean? In
> > the
> > > > > > > > > > "Configuration" part, you didn't explain it. And default to
> > > 0,
> > > > I
> > > > > > > guess
> > > > > > > > > it's
> > > > > > > > > > the same as current behavior for backward compatibility,
> > > right?
> > > > > You
> > > > > > > > > should
> > > > > > > > > > mention it.
> > > > > > > > > > 3. I'm thinking we can have a threshold to the
> > > > > > > > > > "partitioner.sticky.batch.size". Let's say, we already
> > > > accumulate
> > > > > > > > 15.5KB
> > > > > > > > > in
> > > > > > > > > > partition1, and sent. So when next batch created, in your
> > > > current
> > > > > > > > design,
> > > > > > > > > > we still stick to partition1, until 16KB reached, and then
> > we
> > > > > > create
> > > > > > > a
> > > > > > > > > new
> > > > > > > > > > batch to change to next partition, ex: partition2. But I
> > > think
> > > > if
> > > > > > we
> > > > > > > > set
> > > > > > > > > a
> > > > > > > > > > threshold to 95% (for example), we can know previous 15.5KB
> > > > > already
> > > > > > > > > exceeds
> > > > > > > > > > the threshold so that we can directly create new batch for
> > > next
> > > > > > > > records.
> > > > > > > > > > This way should be able to make it more efficient. WDYT?
> > > > > > > > > > 4. I think the improved queuing logic should be good
> > enough.
> > > I
> > > > > > can't
> > > > > > > > get
> > > > > > > > > > the benefit of having `partition.availability.timeout.ms`
> > > > > config.
> > > > > > In
> > > > > > > > > > short, you want to make the partitioner take the broker
> > load
> > > > into
> > > > > > > > > > consideration. We can just improve that in the queuing
> > logic
> > > > (and
> > > > > > you
> > > > > > > > > > already did it). Why should we add the config? Could you
> > use
> > > > some
> > > > > > > > > examples
> > > > > > > > > > to explain why we need it.
> > > > > > > > > >
> > > > > > > > > > Thank you.
> > > > > > > > > > Luke
> > > > > > > > > >
> > > > > > > > > > On Wed, Feb 16, 2022 at 8:57 AM Artem Livshits
> > > > > > > > > > <al...@confluent.io.invalid> wrote:
> > > > > > > > > >
> > > > > > > > > >> Hello,
> > > > > > > > > >>
> > > > > > > > > >> Please add your comments about the KIP.  If there are no
> > > > > > > > considerations,
> > > > > > > > > >> I'll put it up for vote in the next few days.
> > > > > > > > > >>
> > > > > > > > > >> -Artem
> > > > > > > > > >>
> > > > > > > > > >> On Mon, Feb 7, 2022 at 6:01 PM Artem Livshits <
> > > > > > > alivshits@confluent.io
> > > > > > > > >
> > > > > > > > > >> wrote:
> > > > > > > > > >>
> > > > > > > > > >> > Hello,
> > > > > > > > > >> >
> > > > > > > > > >> > After trying a few prototypes, I've made some changes to
> > > the
> > > > > > > public
> > > > > > > > > >> > interface.  Please see the updated document
> > > > > > > > > >> >
> > > > > > > > > >>
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> > https://cwiki.apache.org/confluence/display/KAFKA/KIP-794%3A+Strictly+Uniform+Sticky+Partitioner
> > > > > > > > > >> > .
> > > > > > > > > >> >
> > > > > > > > > >> > -Artem
> > > > > > > > > >> >
> > > > > > > > > >> > On Thu, Nov 4, 2021 at 10:37 AM Artem Livshits <
> > > > > > > > > alivshits@confluent.io>
> > > > > > > > > >> > wrote:
> > > > > > > > > >> >
> > > > > > > > > >> >> Hello,
> > > > > > > > > >> >>
> > > > > > > > > >> >> This is the discussion thread for
> > > > > > > > > >> >>
> > > > > > > > > >>
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> > https://cwiki.apache.org/confluence/display/KAFKA/KIP-794%3A+Strictly+Uniform+Sticky+Partitioner
> > > > > > > > > >> >> .
> > > > > > > > > >> >>
> > > > > > > > > >> >> The proposal is a bug fix for
> > > > > > > > > >> >> https://issues.apache.org/jira/browse/KAFKA-10888, but
> > > it
> > > > > does
> > > > > > > > > >> include a
> > > > > > > > > >> >> client config change, therefore we have a KIP to
> > discuss.
> > > > > > > > > >> >>
> > > > > > > > > >> >> -Artem
> > > > > > > > > >> >>
> > > > > > > > > >> >
> > > > > > > > > >>
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >

Re: [DISCUSS] KIP-794: Strictly Uniform Sticky Partitioner

Posted by Artem Livshits <al...@confluent.io.INVALID>.
Hi Jun,

2. Removed the option from the KIP.  Now the sticky partitioning threshold
is hardcoded to batch.size.

20. Added the corresponding wording to the KIP.

-Artem

On Thu, Mar 3, 2022 at 10:52 AM Jun Rao <ju...@confluent.io.invalid> wrote:

> Hi, Artem,
>
> Thanks for the reply.
>
> 1. Sounds good.
>
> 2. If we don't expect users to change it, we probably could just leave out
> the new config. In general, it's easy to add a new config, but hard to
> remove an existing config.
>
> 20. The two new configs enable.adaptive.partitioning and
> partition.availability.timeout.ms only apply to the two built-in
> partitioners DefaultPartitioner and UniformStickyPartitioner, right? It
> would be useful to document that in the KIP.
>
> Thanks,
>
> Jun
>
> On Thu, Mar 3, 2022 at 9:47 AM Artem Livshits
> <al...@confluent.io.invalid> wrote:
>
> > Hi Jun,
> >
> > Thank you for the suggestions.
> >
> > 1. As we discussed offline, we can hardcode the logic for
> > DefaultPartitioner and UniformStickyPartitioner in the KafkaProducer
> (i.e.
> > the DefaultPartitioner.partition won't get called, instead KafkaProducer
> > would check if the partitioner is an instance of DefaultPartitioner and
> > then run the actual partitioning logic itself).  Then the change to the
> > Partitioner wouldn't be required.  I'll update the KIP to reflect that.
> >
> > 2. I don't expect users to change this too often, as changing it would
> > require a bit of studying of the production patterns.  As a general
> > principle, if I can think of a model that requires a deviation from
> > default, I tend to add a configuration option.  It could be that it'll
> > never get used in practice, but I cannot prove that.  I'm ok with
> removing
> > the option, let me know what you think.
> >
> > -Artem
> >
> > On Mon, Feb 28, 2022 at 2:06 PM Jun Rao <ju...@confluent.io.invalid>
> wrote:
> >
> > > Hi, Artem,
> > >
> > > Thanks for the reply. A few more comments.
> > >
> > > 1. Since we control the implementation and the usage of
> > DefaultPartitioner,
> > > another way is to instantiate the DefaultPartitioner with a special
> > > constructor, which allows it to have more access to internal
> information.
> > > Then we could just change the behavior of  DefaultPartitioner such that
> > it
> > > can use the internal infoamtion when choosing the partition. This seems
> > > more intuitive than having DefaultPartitioner return -1 partition.
> > >
> > > 2. I guess partitioner.sticky.batch.size is introduced because the
> > > effective batch size could be less than batch.size and we want to align
> > > partition switching with the effective batch size. How would a user
> know
> > > the effective batch size to set partitioner.sticky.batch.size properly?
> > If
> > > the user somehow knows the effective batch size, does setting
> batch.size
> > to
> > > the effective batch size achieve the same result?
> > >
> > > 4. Thanks for the explanation. Makes sense to me.
> > >
> > > Thanks,
> > >
> > > Jun
> > >
> > > Thanks,
> > >
> > > Jun
> > >
> > > On Fri, Feb 25, 2022 at 8:26 PM Artem Livshits
> > > <al...@confluent.io.invalid> wrote:
> > >
> > > > Hi Jun,
> > > >
> > > > 1. Updated the KIP to add a couple paragraphs about implementation
> > > > necessities in the Proposed Changes section.
> > > >
> > > > 2. Sorry if my reply was confusing, what I meant to say (and I
> > elaborated
> > > > on that in point #3) is that there could be patterns for which 16KB
> > > > wouldn't be the most effective setting, thus it would be good to make
> > it
> > > > configurable.
> > > >
> > > > 4. We could use broker readiness timeout.  But I'm not sure it would
> > > > correctly model the broker load.  The problem is that latency is not
> an
> > > > accurate measure of throughput, we could have 2 brokers that have
> equal
> > > > throughput but one has higher latency (so it takes larger batches
> less
> > > > frequently, but still takes the same load).  Latency-based logic is
> > > likely
> > > > to send less data to the broker with higher latency.  Using the queue
> > > size
> > > > would adapt to throughput, regardless of latency (which could be
> just a
> > > > result of deployment topology), so that's the model chosen in the
> > > > proposal.  The partition.availability.timeout.ms logic approaches
> the
> > > > model
> > > > from a slightly different angle, say we have a requirement to deliver
> > > > messages via brokers that have a certain latency, then
> > > > partition.availability.timeout.ms could be used to tune that.
> Latency
> > > is
> > > > a
> > > > much more volatile metric than throughput (latency depends on
> external
> > > > load, on capacity, on deployment topology, on jitter in network, on
> > > jitter
> > > > in disk, etc.) and I think it would be best to leave latency-based
> > > > thresholds configurable to tune for the environment.
> > > >
> > > > -Artem
> > > >
> > > > On Wed, Feb 23, 2022 at 11:14 AM Jun Rao <ju...@confluent.io.invalid>
> > > wrote:
> > > >
> > > > > Hi, Artem,
> > > > >
> > > > > Thanks for the reply. A few more comments.
> > > > >
> > > > > 1. Perhaps you could elaborate a bit more on how the producer
> > > determines
> > > > > the partition if the partitioner returns -1. This will help
> > understand
> > > > why
> > > > > encapsulating that logic as a partitioner is not clean.
> > > > >
> > > > > 2. I am not sure that I understand this part. If 15.5KB is more
> > > > efficient,
> > > > > could we just set batch.size to 15.5KB?
> > > > >
> > > > > 4. Yes, we could add a switch (or a variant of the partitioner) for
> > > > > enabling this behavior. Also, choosing partitions based on broker
> > > > readiness
> > > > > can be made in a smoother way. For example, we could track the last
> > > time
> > > > a
> > > > > broker has drained any batches from the accumulator. We can then
> > select
> > > > > partitions from brokers proportionally to the inverse of that time.
> > > This
> > > > > seems smoother than a cutoff based on a
> > > > partition.availability.timeout.ms
> > > > >  threshold.
> > > > >
> > > > > Thanks,
> > > > >
> > > > > Jun
> > > > >
> > > > > On Fri, Feb 18, 2022 at 5:14 PM Artem Livshits
> > > > > <al...@confluent.io.invalid> wrote:
> > > > >
> > > > > > Hello Luke, Jun,
> > > > > >
> > > > > > Thank you for your feedback.  I've added the Rejected Alternative
> > > > section
> > > > > > that may clarify some of the questions w.r.t. returning -1.
> > > > > >
> > > > > > 1. I've elaborated on the -1 in the KIP.  The problem is that a
> > > > > significant
> > > > > > part of the logic needs to be in the producer (because it now
> uses
> > > > > > information about brokers that only the producer knows), so
> > > > encapsulation
> > > > > > of the logic within the default partitioner isn't as clean.
>  I've
> > > > added
> > > > > > the Rejected Alternative section that documents an attempt to
> keep
> > > the
> > > > > > encapsulation by providing new callbacks to the partitioner.
> > > > > >
> > > > > > 2. The meaning of the partitioner.sticky.batch.size is explained
> in
> > > the
> > > > > > Uniform Sticky Batch Size section.  Basically, we track the
> amount
> > of
> > > > > bytes
> > > > > > produced to the partition and if it exceeds
> > > > partitioner.sticky.batch.size
> > > > > > then we switch to the next partition.  As far as the reason to
> make
> > > it
> > > > > > different from batch.size, I think Luke answered this with the
> > > question
> > > > > #3
> > > > > > -- what if the load pattern is such that 15.5KB would be more
> > > efficient
> > > > > > than 16KB?
> > > > > >
> > > > > > 3. I think it's hard to have one size that would fit all
> patterns.
> > > > E.g.
> > > > > if
> > > > > > the load pattern is such that there is linger and the app fills
> the
> > > > batch
> > > > > > before linger expires, then having 16KB would most likely
> > synchronize
> > > > > > batching and partition switching, so each partition would get a
> > full
> > > > > > batch.  If load pattern is such that there are a few non-complete
> > > > batches
> > > > > > go out before a larger batch starts to fill, then it may actually
> > be
> > > > > > beneficial to make slightly larger (e.g. linger=0, first few
> > records
> > > go
> > > > > in
> > > > > > the first batch, then next few records go to second batch, and so
> > on,
> > > > > until
> > > > > > 5 in-flight, then larger batch would form while waiting for
> broker
> > to
> > > > > > respond, but the partition switch would happen before the larger
> > > batch
> > > > is
> > > > > > full).
> > > > > >
> > > > > > 4. There are a couple of reasons for introducing
> > > > > > partition.availability.timeout.ms.  Luke's an Jun's questions
> are
> > > > > slightly
> > > > > > different, so I'm going to separate replies.
> > > > > > (Luke) Is the queue size a good enough signal?  I think it's a
> good
> > > > > default
> > > > > > signal as it tries to preserve general fairness and not overreact
> > on
> > > > the
> > > > > > broker's state at each moment in time.  But because it's smooth,
> it
> > > may
> > > > > not
> > > > > > be reactive enough to instantaneous latency jumps.  For
> > > > latency-sensitive
> > > > > > workloads, it may be desirable to react faster when a broker
> > becomes
> > > > > > unresponsive (but that may make the distribution really choppy),
> so
> > > > > > partition.availability.timeout.ms provides an opportunity to
> tune
> > > > > > adaptiveness.
> > > > > >
> > > > > > (Jun) Can we just not assign partitions to brokers that are not
> > > ready?
> > > > > > Switching partitions purely based on current broker readiness
> > > > information
> > > > > > can really skew workload I think (or at least I couldn't build a
> > > model
> > > > > that
> > > > > > proves that over time it's going to be generally fair), I feel
> that
> > > the
> > > > > > algorithm should try to be fair in general and use smoother
> signals
> > > by
> > > > > > default (e.g. a broker with choppier latency may get much less
> load
> > > > even
> > > > > > though it can handle throughput, it then may potentially skew
> > > > > consumption),
> > > > > > note that the queue-size-based logic uses probabilities (so we
> > don't
> > > > > fully
> > > > > > remove brokers, just make it less likely) and relative info
> rather
> > > > than a
> > > > > > threshold (so if all brokers are heavily, but equally loaded,
> they
> > > will
> > > > > get
> > > > > > equal distribution, rather than get removed because they exceed
> > some
> > > > > > threshold).  So at the very least, I would like this logic to be
> > > turned
> > > > > off
> > > > > > by default as it's hard to predict what it could do with
> different
> > > > > patterns
> > > > > > (which means that there would need to be some configuration).  We
> > > could
> > > > > > just not use brokers that are not ready, but again, I think that
> > it's
> > > > > good
> > > > > > to try to be fair under normal circumstances, so if normally
> > brokers
> > > > can
> > > > > > respond under some partition.availability.timeout.ms threshold
> and
> > > the
> > > > > > application works well with those latencies, then we could
> > distribute
> > > > > data
> > > > > > equally between brokers that don't exceed the latencies.  The
> > value,
> > > of
> > > > > > course, would depend on the environment and app requirements,
> hence
> > > > it's
> > > > > > configurable.
> > > > > >
> > > > > > 10. Added a statement at the beginning of the proposed changes.
> > > > > >
> > > > > > -Artem
> > > > > >
> > > > > >
> > > > > > On Thu, Feb 17, 2022 at 3:46 PM Jun Rao <jun@confluent.io.invalid
> >
> > > > > wrote:
> > > > > >
> > > > > > > Hi, Artem,
> > > > > > >
> > > > > > > Thanks for the KIP. A few comments below.
> > > > > > >
> > > > > > > 1. I agree with Luke that having the partitioner returning -1
> is
> > > kind
> > > > > of
> > > > > > > weird. Could we just change the implementation of
> > > DefaultPartitioner
> > > > to
> > > > > > the
> > > > > > > new behavior?
> > > > > > >
> > > > > > > 2. partitioner.sticky.batch.size: Similar question to Luke. I
> am
> > > not
> > > > > sure
> > > > > > > why we want to introduce this new configuration. Could we just
> > use
> > > > the
> > > > > > > existing batch.size?
> > > > > > >
> > > > > > > 4. I also agree with Luke that it's not clear why we need
> > > > > > > partition.availability.timeout.ms. The KIP says the broker
> > "would
> > > > not
> > > > > be
> > > > > > > chosen until the broker is able to accept the next ready batch
> > from
> > > > the
> > > > > > > partition". If we are keeping track of this, could we just
> avoid
> > > > > > assigning
> > > > > > > records to partitions whose leader is not able to accept the
> next
> > > > > batch?
> > > > > > If
> > > > > > > we do that, perhaps we don't need
> > > partition.availability.timeout.ms.
> > > > > > >
> > > > > > > 10. Currently, partitioner.class defaults to
> DefaultPartitioner,
> > > > which
> > > > > > uses
> > > > > > > StickyPartitioner when the key is specified. Since this KIP
> > > improves
> > > > > upon
> > > > > > > StickyPartitioner, it would be useful to make the new behavior
> > the
> > > > > > default
> > > > > > > and document that in the KIP.
> > > > > > >
> > > > > > > Thanks,
> > > > > > >
> > > > > > > Jun
> > > > > > >
> > > > > > >
> > > > > > > On Wed, Feb 16, 2022 at 7:30 PM Luke Chen <sh...@gmail.com>
> > > wrote:
> > > > > > >
> > > > > > > > Hi Artem,
> > > > > > > >
> > > > > > > > Also, one more thing I think you need to know.
> > > > > > > > As this bug KAFKA-7572 <
> > > > > > https://issues.apache.org/jira/browse/KAFKA-7572
> > > > > > > >
> > > > > > > > mentioned, sometimes the custom partitioner would return
> > negative
> > > > > > > partition
> > > > > > > > id accidentally.
> > > > > > > > If it returned -1, how could you know if it is expected or
> not
> > > > > > expected?
> > > > > > > >
> > > > > > > > Thanks.
> > > > > > > > Luke
> > > > > > > >
> > > > > > > > On Wed, Feb 16, 2022 at 3:28 PM Luke Chen <showuon@gmail.com
> >
> > > > wrote:
> > > > > > > >
> > > > > > > > > Hi Artem,
> > > > > > > > >
> > > > > > > > > Thanks for the update. I have some questions about it:
> > > > > > > > >
> > > > > > > > > 1. Could you explain why you need the `partitioner` return
> > -1?
> > > In
> > > > > > which
> > > > > > > > > case we need it? And how it is used in your KIP?
> > > > > > > > > 2. What does the "partitioner.sticky.batch.size" mean? In
> the
> > > > > > > > > "Configuration" part, you didn't explain it. And default to
> > 0,
> > > I
> > > > > > guess
> > > > > > > > it's
> > > > > > > > > the same as current behavior for backward compatibility,
> > right?
> > > > You
> > > > > > > > should
> > > > > > > > > mention it.
> > > > > > > > > 3. I'm thinking we can have a threshold to the
> > > > > > > > > "partitioner.sticky.batch.size". Let's say, we already
> > > accumulate
> > > > > > > 15.5KB
> > > > > > > > in
> > > > > > > > > partition1, and sent. So when next batch created, in your
> > > current
> > > > > > > design,
> > > > > > > > > we still stick to partition1, until 16KB reached, and then
> we
> > > > > create
> > > > > > a
> > > > > > > > new
> > > > > > > > > batch to change to next partition, ex: partition2. But I
> > think
> > > if
> > > > > we
> > > > > > > set
> > > > > > > > a
> > > > > > > > > threshold to 95% (for example), we can know previous 15.5KB
> > > > already
> > > > > > > > exceeds
> > > > > > > > > the threshold so that we can directly create new batch for
> > next
> > > > > > > records.
> > > > > > > > > This way should be able to make it more efficient. WDYT?
> > > > > > > > > 4. I think the improved queuing logic should be good
> enough.
> > I
> > > > > can't
> > > > > > > get
> > > > > > > > > the benefit of having `partition.availability.timeout.ms`
> > > > config.
> > > > > In
> > > > > > > > > short, you want to make the partitioner take the broker
> load
> > > into
> > > > > > > > > consideration. We can just improve that in the queuing
> logic
> > > (and
> > > > > you
> > > > > > > > > already did it). Why should we add the config? Could you
> use
> > > some
> > > > > > > > examples
> > > > > > > > > to explain why we need it.
> > > > > > > > >
> > > > > > > > > Thank you.
> > > > > > > > > Luke
> > > > > > > > >
> > > > > > > > > On Wed, Feb 16, 2022 at 8:57 AM Artem Livshits
> > > > > > > > > <al...@confluent.io.invalid> wrote:
> > > > > > > > >
> > > > > > > > >> Hello,
> > > > > > > > >>
> > > > > > > > >> Please add your comments about the KIP.  If there are no
> > > > > > > considerations,
> > > > > > > > >> I'll put it up for vote in the next few days.
> > > > > > > > >>
> > > > > > > > >> -Artem
> > > > > > > > >>
> > > > > > > > >> On Mon, Feb 7, 2022 at 6:01 PM Artem Livshits <
> > > > > > alivshits@confluent.io
> > > > > > > >
> > > > > > > > >> wrote:
> > > > > > > > >>
> > > > > > > > >> > Hello,
> > > > > > > > >> >
> > > > > > > > >> > After trying a few prototypes, I've made some changes to
> > the
> > > > > > public
> > > > > > > > >> > interface.  Please see the updated document
> > > > > > > > >> >
> > > > > > > > >>
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-794%3A+Strictly+Uniform+Sticky+Partitioner
> > > > > > > > >> > .
> > > > > > > > >> >
> > > > > > > > >> > -Artem
> > > > > > > > >> >
> > > > > > > > >> > On Thu, Nov 4, 2021 at 10:37 AM Artem Livshits <
> > > > > > > > alivshits@confluent.io>
> > > > > > > > >> > wrote:
> > > > > > > > >> >
> > > > > > > > >> >> Hello,
> > > > > > > > >> >>
> > > > > > > > >> >> This is the discussion thread for
> > > > > > > > >> >>
> > > > > > > > >>
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-794%3A+Strictly+Uniform+Sticky+Partitioner
> > > > > > > > >> >> .
> > > > > > > > >> >>
> > > > > > > > >> >> The proposal is a bug fix for
> > > > > > > > >> >> https://issues.apache.org/jira/browse/KAFKA-10888, but
> > it
> > > > does
> > > > > > > > >> include a
> > > > > > > > >> >> client config change, therefore we have a KIP to
> discuss.
> > > > > > > > >> >>
> > > > > > > > >> >> -Artem
> > > > > > > > >> >>
> > > > > > > > >> >
> > > > > > > > >>
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] KIP-794: Strictly Uniform Sticky Partitioner

Posted by Jun Rao <ju...@confluent.io.INVALID>.
Hi, Artem,

Thanks for the reply.

1. Sounds good.

2. If we don't expect users to change it, we probably could just leave out
the new config. In general, it's easy to add a new config, but hard to
remove an existing config.

20. The two new configs enable.adaptive.partitioning and
partition.availability.timeout.ms only apply to the two built-in
partitioners DefaultPartitioner and UniformStickyPartitioner, right? It
would be useful to document that in the KIP.

Thanks,

Jun

On Thu, Mar 3, 2022 at 9:47 AM Artem Livshits
<al...@confluent.io.invalid> wrote:

> Hi Jun,
>
> Thank you for the suggestions.
>
> 1. As we discussed offline, we can hardcode the logic for
> DefaultPartitioner and UniformStickyPartitioner in the KafkaProducer (i.e.
> the DefaultPartitioner.partition won't get called, instead KafkaProducer
> would check if the partitioner is an instance of DefaultPartitioner and
> then run the actual partitioning logic itself).  Then the change to the
> Partitioner wouldn't be required.  I'll update the KIP to reflect that.
>
> 2. I don't expect users to change this too often, as changing it would
> require a bit of studying of the production patterns.  As a general
> principle, if I can think of a model that requires a deviation from
> default, I tend to add a configuration option.  It could be that it'll
> never get used in practice, but I cannot prove that.  I'm ok with removing
> the option, let me know what you think.
>
> -Artem
>
> On Mon, Feb 28, 2022 at 2:06 PM Jun Rao <ju...@confluent.io.invalid> wrote:
>
> > Hi, Artem,
> >
> > Thanks for the reply. A few more comments.
> >
> > 1. Since we control the implementation and the usage of
> DefaultPartitioner,
> > another way is to instantiate the DefaultPartitioner with a special
> > constructor, which allows it to have more access to internal information.
> > Then we could just change the behavior of  DefaultPartitioner such that
> it
> > can use the internal infoamtion when choosing the partition. This seems
> > more intuitive than having DefaultPartitioner return -1 partition.
> >
> > 2. I guess partitioner.sticky.batch.size is introduced because the
> > effective batch size could be less than batch.size and we want to align
> > partition switching with the effective batch size. How would a user know
> > the effective batch size to set partitioner.sticky.batch.size properly?
> If
> > the user somehow knows the effective batch size, does setting batch.size
> to
> > the effective batch size achieve the same result?
> >
> > 4. Thanks for the explanation. Makes sense to me.
> >
> > Thanks,
> >
> > Jun
> >
> > Thanks,
> >
> > Jun
> >
> > On Fri, Feb 25, 2022 at 8:26 PM Artem Livshits
> > <al...@confluent.io.invalid> wrote:
> >
> > > Hi Jun,
> > >
> > > 1. Updated the KIP to add a couple paragraphs about implementation
> > > necessities in the Proposed Changes section.
> > >
> > > 2. Sorry if my reply was confusing, what I meant to say (and I
> elaborated
> > > on that in point #3) is that there could be patterns for which 16KB
> > > wouldn't be the most effective setting, thus it would be good to make
> it
> > > configurable.
> > >
> > > 4. We could use broker readiness timeout.  But I'm not sure it would
> > > correctly model the broker load.  The problem is that latency is not an
> > > accurate measure of throughput, we could have 2 brokers that have equal
> > > throughput but one has higher latency (so it takes larger batches less
> > > frequently, but still takes the same load).  Latency-based logic is
> > likely
> > > to send less data to the broker with higher latency.  Using the queue
> > size
> > > would adapt to throughput, regardless of latency (which could be just a
> > > result of deployment topology), so that's the model chosen in the
> > > proposal.  The partition.availability.timeout.ms logic approaches the
> > > model
> > > from a slightly different angle, say we have a requirement to deliver
> > > messages via brokers that have a certain latency, then
> > > partition.availability.timeout.ms could be used to tune that.  Latency
> > is
> > > a
> > > much more volatile metric than throughput (latency depends on external
> > > load, on capacity, on deployment topology, on jitter in network, on
> > jitter
> > > in disk, etc.) and I think it would be best to leave latency-based
> > > thresholds configurable to tune for the environment.
> > >
> > > -Artem
> > >
> > > On Wed, Feb 23, 2022 at 11:14 AM Jun Rao <ju...@confluent.io.invalid>
> > wrote:
> > >
> > > > Hi, Artem,
> > > >
> > > > Thanks for the reply. A few more comments.
> > > >
> > > > 1. Perhaps you could elaborate a bit more on how the producer
> > determines
> > > > the partition if the partitioner returns -1. This will help
> understand
> > > why
> > > > encapsulating that logic as a partitioner is not clean.
> > > >
> > > > 2. I am not sure that I understand this part. If 15.5KB is more
> > > efficient,
> > > > could we just set batch.size to 15.5KB?
> > > >
> > > > 4. Yes, we could add a switch (or a variant of the partitioner) for
> > > > enabling this behavior. Also, choosing partitions based on broker
> > > readiness
> > > > can be made in a smoother way. For example, we could track the last
> > time
> > > a
> > > > broker has drained any batches from the accumulator. We can then
> select
> > > > partitions from brokers proportionally to the inverse of that time.
> > This
> > > > seems smoother than a cutoff based on a
> > > partition.availability.timeout.ms
> > > >  threshold.
> > > >
> > > > Thanks,
> > > >
> > > > Jun
> > > >
> > > > On Fri, Feb 18, 2022 at 5:14 PM Artem Livshits
> > > > <al...@confluent.io.invalid> wrote:
> > > >
> > > > > Hello Luke, Jun,
> > > > >
> > > > > Thank you for your feedback.  I've added the Rejected Alternative
> > > section
> > > > > that may clarify some of the questions w.r.t. returning -1.
> > > > >
> > > > > 1. I've elaborated on the -1 in the KIP.  The problem is that a
> > > > significant
> > > > > part of the logic needs to be in the producer (because it now uses
> > > > > information about brokers that only the producer knows), so
> > > encapsulation
> > > > > of the logic within the default partitioner isn't as clean.   I've
> > > added
> > > > > the Rejected Alternative section that documents an attempt to keep
> > the
> > > > > encapsulation by providing new callbacks to the partitioner.
> > > > >
> > > > > 2. The meaning of the partitioner.sticky.batch.size is explained in
> > the
> > > > > Uniform Sticky Batch Size section.  Basically, we track the amount
> of
> > > > bytes
> > > > > produced to the partition and if it exceeds
> > > partitioner.sticky.batch.size
> > > > > then we switch to the next partition.  As far as the reason to make
> > it
> > > > > different from batch.size, I think Luke answered this with the
> > question
> > > > #3
> > > > > -- what if the load pattern is such that 15.5KB would be more
> > efficient
> > > > > than 16KB?
> > > > >
> > > > > 3. I think it's hard to have one size that would fit all patterns.
> > > E.g.
> > > > if
> > > > > the load pattern is such that there is linger and the app fills the
> > > batch
> > > > > before linger expires, then having 16KB would most likely
> synchronize
> > > > > batching and partition switching, so each partition would get a
> full
> > > > > batch.  If load pattern is such that there are a few non-complete
> > > batches
> > > > > go out before a larger batch starts to fill, then it may actually
> be
> > > > > beneficial to make slightly larger (e.g. linger=0, first few
> records
> > go
> > > > in
> > > > > the first batch, then next few records go to second batch, and so
> on,
> > > > until
> > > > > 5 in-flight, then larger batch would form while waiting for broker
> to
> > > > > respond, but the partition switch would happen before the larger
> > batch
> > > is
> > > > > full).
> > > > >
> > > > > 4. There are a couple of reasons for introducing
> > > > > partition.availability.timeout.ms.  Luke's an Jun's questions are
> > > > slightly
> > > > > different, so I'm going to separate replies.
> > > > > (Luke) Is the queue size a good enough signal?  I think it's a good
> > > > default
> > > > > signal as it tries to preserve general fairness and not overreact
> on
> > > the
> > > > > broker's state at each moment in time.  But because it's smooth, it
> > may
> > > > not
> > > > > be reactive enough to instantaneous latency jumps.  For
> > > latency-sensitive
> > > > > workloads, it may be desirable to react faster when a broker
> becomes
> > > > > unresponsive (but that may make the distribution really choppy), so
> > > > > partition.availability.timeout.ms provides an opportunity to tune
> > > > > adaptiveness.
> > > > >
> > > > > (Jun) Can we just not assign partitions to brokers that are not
> > ready?
> > > > > Switching partitions purely based on current broker readiness
> > > information
> > > > > can really skew workload I think (or at least I couldn't build a
> > model
> > > > that
> > > > > proves that over time it's going to be generally fair), I feel that
> > the
> > > > > algorithm should try to be fair in general and use smoother signals
> > by
> > > > > default (e.g. a broker with choppier latency may get much less load
> > > even
> > > > > though it can handle throughput, it then may potentially skew
> > > > consumption),
> > > > > note that the queue-size-based logic uses probabilities (so we
> don't
> > > > fully
> > > > > remove brokers, just make it less likely) and relative info rather
> > > than a
> > > > > threshold (so if all brokers are heavily, but equally loaded, they
> > will
> > > > get
> > > > > equal distribution, rather than get removed because they exceed
> some
> > > > > threshold).  So at the very least, I would like this logic to be
> > turned
> > > > off
> > > > > by default as it's hard to predict what it could do with different
> > > > patterns
> > > > > (which means that there would need to be some configuration).  We
> > could
> > > > > just not use brokers that are not ready, but again, I think that
> it's
> > > > good
> > > > > to try to be fair under normal circumstances, so if normally
> brokers
> > > can
> > > > > respond under some partition.availability.timeout.ms threshold and
> > the
> > > > > application works well with those latencies, then we could
> distribute
> > > > data
> > > > > equally between brokers that don't exceed the latencies.  The
> value,
> > of
> > > > > course, would depend on the environment and app requirements, hence
> > > it's
> > > > > configurable.
> > > > >
> > > > > 10. Added a statement at the beginning of the proposed changes.
> > > > >
> > > > > -Artem
> > > > >
> > > > >
> > > > > On Thu, Feb 17, 2022 at 3:46 PM Jun Rao <ju...@confluent.io.invalid>
> > > > wrote:
> > > > >
> > > > > > Hi, Artem,
> > > > > >
> > > > > > Thanks for the KIP. A few comments below.
> > > > > >
> > > > > > 1. I agree with Luke that having the partitioner returning -1 is
> > kind
> > > > of
> > > > > > weird. Could we just change the implementation of
> > DefaultPartitioner
> > > to
> > > > > the
> > > > > > new behavior?
> > > > > >
> > > > > > 2. partitioner.sticky.batch.size: Similar question to Luke. I am
> > not
> > > > sure
> > > > > > why we want to introduce this new configuration. Could we just
> use
> > > the
> > > > > > existing batch.size?
> > > > > >
> > > > > > 4. I also agree with Luke that it's not clear why we need
> > > > > > partition.availability.timeout.ms. The KIP says the broker
> "would
> > > not
> > > > be
> > > > > > chosen until the broker is able to accept the next ready batch
> from
> > > the
> > > > > > partition". If we are keeping track of this, could we just avoid
> > > > > assigning
> > > > > > records to partitions whose leader is not able to accept the next
> > > > batch?
> > > > > If
> > > > > > we do that, perhaps we don't need
> > partition.availability.timeout.ms.
> > > > > >
> > > > > > 10. Currently, partitioner.class defaults to DefaultPartitioner,
> > > which
> > > > > uses
> > > > > > StickyPartitioner when the key is specified. Since this KIP
> > improves
> > > > upon
> > > > > > StickyPartitioner, it would be useful to make the new behavior
> the
> > > > > default
> > > > > > and document that in the KIP.
> > > > > >
> > > > > > Thanks,
> > > > > >
> > > > > > Jun
> > > > > >
> > > > > >
> > > > > > On Wed, Feb 16, 2022 at 7:30 PM Luke Chen <sh...@gmail.com>
> > wrote:
> > > > > >
> > > > > > > Hi Artem,
> > > > > > >
> > > > > > > Also, one more thing I think you need to know.
> > > > > > > As this bug KAFKA-7572 <
> > > > > https://issues.apache.org/jira/browse/KAFKA-7572
> > > > > > >
> > > > > > > mentioned, sometimes the custom partitioner would return
> negative
> > > > > > partition
> > > > > > > id accidentally.
> > > > > > > If it returned -1, how could you know if it is expected or not
> > > > > expected?
> > > > > > >
> > > > > > > Thanks.
> > > > > > > Luke
> > > > > > >
> > > > > > > On Wed, Feb 16, 2022 at 3:28 PM Luke Chen <sh...@gmail.com>
> > > wrote:
> > > > > > >
> > > > > > > > Hi Artem,
> > > > > > > >
> > > > > > > > Thanks for the update. I have some questions about it:
> > > > > > > >
> > > > > > > > 1. Could you explain why you need the `partitioner` return
> -1?
> > In
> > > > > which
> > > > > > > > case we need it? And how it is used in your KIP?
> > > > > > > > 2. What does the "partitioner.sticky.batch.size" mean? In the
> > > > > > > > "Configuration" part, you didn't explain it. And default to
> 0,
> > I
> > > > > guess
> > > > > > > it's
> > > > > > > > the same as current behavior for backward compatibility,
> right?
> > > You
> > > > > > > should
> > > > > > > > mention it.
> > > > > > > > 3. I'm thinking we can have a threshold to the
> > > > > > > > "partitioner.sticky.batch.size". Let's say, we already
> > accumulate
> > > > > > 15.5KB
> > > > > > > in
> > > > > > > > partition1, and sent. So when next batch created, in your
> > current
> > > > > > design,
> > > > > > > > we still stick to partition1, until 16KB reached, and then we
> > > > create
> > > > > a
> > > > > > > new
> > > > > > > > batch to change to next partition, ex: partition2. But I
> think
> > if
> > > > we
> > > > > > set
> > > > > > > a
> > > > > > > > threshold to 95% (for example), we can know previous 15.5KB
> > > already
> > > > > > > exceeds
> > > > > > > > the threshold so that we can directly create new batch for
> next
> > > > > > records.
> > > > > > > > This way should be able to make it more efficient. WDYT?
> > > > > > > > 4. I think the improved queuing logic should be good enough.
> I
> > > > can't
> > > > > > get
> > > > > > > > the benefit of having `partition.availability.timeout.ms`
> > > config.
> > > > In
> > > > > > > > short, you want to make the partitioner take the broker load
> > into
> > > > > > > > consideration. We can just improve that in the queuing logic
> > (and
> > > > you
> > > > > > > > already did it). Why should we add the config? Could you use
> > some
> > > > > > > examples
> > > > > > > > to explain why we need it.
> > > > > > > >
> > > > > > > > Thank you.
> > > > > > > > Luke
> > > > > > > >
> > > > > > > > On Wed, Feb 16, 2022 at 8:57 AM Artem Livshits
> > > > > > > > <al...@confluent.io.invalid> wrote:
> > > > > > > >
> > > > > > > >> Hello,
> > > > > > > >>
> > > > > > > >> Please add your comments about the KIP.  If there are no
> > > > > > considerations,
> > > > > > > >> I'll put it up for vote in the next few days.
> > > > > > > >>
> > > > > > > >> -Artem
> > > > > > > >>
> > > > > > > >> On Mon, Feb 7, 2022 at 6:01 PM Artem Livshits <
> > > > > alivshits@confluent.io
> > > > > > >
> > > > > > > >> wrote:
> > > > > > > >>
> > > > > > > >> > Hello,
> > > > > > > >> >
> > > > > > > >> > After trying a few prototypes, I've made some changes to
> the
> > > > > public
> > > > > > > >> > interface.  Please see the updated document
> > > > > > > >> >
> > > > > > > >>
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-794%3A+Strictly+Uniform+Sticky+Partitioner
> > > > > > > >> > .
> > > > > > > >> >
> > > > > > > >> > -Artem
> > > > > > > >> >
> > > > > > > >> > On Thu, Nov 4, 2021 at 10:37 AM Artem Livshits <
> > > > > > > alivshits@confluent.io>
> > > > > > > >> > wrote:
> > > > > > > >> >
> > > > > > > >> >> Hello,
> > > > > > > >> >>
> > > > > > > >> >> This is the discussion thread for
> > > > > > > >> >>
> > > > > > > >>
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-794%3A+Strictly+Uniform+Sticky+Partitioner
> > > > > > > >> >> .
> > > > > > > >> >>
> > > > > > > >> >> The proposal is a bug fix for
> > > > > > > >> >> https://issues.apache.org/jira/browse/KAFKA-10888, but
> it
> > > does
> > > > > > > >> include a
> > > > > > > >> >> client config change, therefore we have a KIP to discuss.
> > > > > > > >> >>
> > > > > > > >> >> -Artem
> > > > > > > >> >>
> > > > > > > >> >
> > > > > > > >>
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] KIP-794: Strictly Uniform Sticky Partitioner

Posted by Artem Livshits <al...@confluent.io.INVALID>.
Hi Jun,

Thank you for the suggestions.

1. As we discussed offline, we can hardcode the logic for
DefaultPartitioner and UniformStickyPartitioner in the KafkaProducer (i.e.
the DefaultPartitioner.partition won't get called, instead KafkaProducer
would check if the partitioner is an instance of DefaultPartitioner and
then run the actual partitioning logic itself).  Then the change to the
Partitioner wouldn't be required.  I'll update the KIP to reflect that.

2. I don't expect users to change this too often, as changing it would
require a bit of studying of the production patterns.  As a general
principle, if I can think of a model that requires a deviation from
default, I tend to add a configuration option.  It could be that it'll
never get used in practice, but I cannot prove that.  I'm ok with removing
the option, let me know what you think.

-Artem

On Mon, Feb 28, 2022 at 2:06 PM Jun Rao <ju...@confluent.io.invalid> wrote:

> Hi, Artem,
>
> Thanks for the reply. A few more comments.
>
> 1. Since we control the implementation and the usage of DefaultPartitioner,
> another way is to instantiate the DefaultPartitioner with a special
> constructor, which allows it to have more access to internal information.
> Then we could just change the behavior of  DefaultPartitioner such that it
> can use the internal infoamtion when choosing the partition. This seems
> more intuitive than having DefaultPartitioner return -1 partition.
>
> 2. I guess partitioner.sticky.batch.size is introduced because the
> effective batch size could be less than batch.size and we want to align
> partition switching with the effective batch size. How would a user know
> the effective batch size to set partitioner.sticky.batch.size properly? If
> the user somehow knows the effective batch size, does setting batch.size to
> the effective batch size achieve the same result?
>
> 4. Thanks for the explanation. Makes sense to me.
>
> Thanks,
>
> Jun
>
> Thanks,
>
> Jun
>
> On Fri, Feb 25, 2022 at 8:26 PM Artem Livshits
> <al...@confluent.io.invalid> wrote:
>
> > Hi Jun,
> >
> > 1. Updated the KIP to add a couple paragraphs about implementation
> > necessities in the Proposed Changes section.
> >
> > 2. Sorry if my reply was confusing, what I meant to say (and I elaborated
> > on that in point #3) is that there could be patterns for which 16KB
> > wouldn't be the most effective setting, thus it would be good to make it
> > configurable.
> >
> > 4. We could use broker readiness timeout.  But I'm not sure it would
> > correctly model the broker load.  The problem is that latency is not an
> > accurate measure of throughput, we could have 2 brokers that have equal
> > throughput but one has higher latency (so it takes larger batches less
> > frequently, but still takes the same load).  Latency-based logic is
> likely
> > to send less data to the broker with higher latency.  Using the queue
> size
> > would adapt to throughput, regardless of latency (which could be just a
> > result of deployment topology), so that's the model chosen in the
> > proposal.  The partition.availability.timeout.ms logic approaches the
> > model
> > from a slightly different angle, say we have a requirement to deliver
> > messages via brokers that have a certain latency, then
> > partition.availability.timeout.ms could be used to tune that.  Latency
> is
> > a
> > much more volatile metric than throughput (latency depends on external
> > load, on capacity, on deployment topology, on jitter in network, on
> jitter
> > in disk, etc.) and I think it would be best to leave latency-based
> > thresholds configurable to tune for the environment.
> >
> > -Artem
> >
> > On Wed, Feb 23, 2022 at 11:14 AM Jun Rao <ju...@confluent.io.invalid>
> wrote:
> >
> > > Hi, Artem,
> > >
> > > Thanks for the reply. A few more comments.
> > >
> > > 1. Perhaps you could elaborate a bit more on how the producer
> determines
> > > the partition if the partitioner returns -1. This will help understand
> > why
> > > encapsulating that logic as a partitioner is not clean.
> > >
> > > 2. I am not sure that I understand this part. If 15.5KB is more
> > efficient,
> > > could we just set batch.size to 15.5KB?
> > >
> > > 4. Yes, we could add a switch (or a variant of the partitioner) for
> > > enabling this behavior. Also, choosing partitions based on broker
> > readiness
> > > can be made in a smoother way. For example, we could track the last
> time
> > a
> > > broker has drained any batches from the accumulator. We can then select
> > > partitions from brokers proportionally to the inverse of that time.
> This
> > > seems smoother than a cutoff based on a
> > partition.availability.timeout.ms
> > >  threshold.
> > >
> > > Thanks,
> > >
> > > Jun
> > >
> > > On Fri, Feb 18, 2022 at 5:14 PM Artem Livshits
> > > <al...@confluent.io.invalid> wrote:
> > >
> > > > Hello Luke, Jun,
> > > >
> > > > Thank you for your feedback.  I've added the Rejected Alternative
> > section
> > > > that may clarify some of the questions w.r.t. returning -1.
> > > >
> > > > 1. I've elaborated on the -1 in the KIP.  The problem is that a
> > > significant
> > > > part of the logic needs to be in the producer (because it now uses
> > > > information about brokers that only the producer knows), so
> > encapsulation
> > > > of the logic within the default partitioner isn't as clean.   I've
> > added
> > > > the Rejected Alternative section that documents an attempt to keep
> the
> > > > encapsulation by providing new callbacks to the partitioner.
> > > >
> > > > 2. The meaning of the partitioner.sticky.batch.size is explained in
> the
> > > > Uniform Sticky Batch Size section.  Basically, we track the amount of
> > > bytes
> > > > produced to the partition and if it exceeds
> > partitioner.sticky.batch.size
> > > > then we switch to the next partition.  As far as the reason to make
> it
> > > > different from batch.size, I think Luke answered this with the
> question
> > > #3
> > > > -- what if the load pattern is such that 15.5KB would be more
> efficient
> > > > than 16KB?
> > > >
> > > > 3. I think it's hard to have one size that would fit all patterns.
> > E.g.
> > > if
> > > > the load pattern is such that there is linger and the app fills the
> > batch
> > > > before linger expires, then having 16KB would most likely synchronize
> > > > batching and partition switching, so each partition would get a full
> > > > batch.  If load pattern is such that there are a few non-complete
> > batches
> > > > go out before a larger batch starts to fill, then it may actually be
> > > > beneficial to make slightly larger (e.g. linger=0, first few records
> go
> > > in
> > > > the first batch, then next few records go to second batch, and so on,
> > > until
> > > > 5 in-flight, then larger batch would form while waiting for broker to
> > > > respond, but the partition switch would happen before the larger
> batch
> > is
> > > > full).
> > > >
> > > > 4. There are a couple of reasons for introducing
> > > > partition.availability.timeout.ms.  Luke's an Jun's questions are
> > > slightly
> > > > different, so I'm going to separate replies.
> > > > (Luke) Is the queue size a good enough signal?  I think it's a good
> > > default
> > > > signal as it tries to preserve general fairness and not overreact on
> > the
> > > > broker's state at each moment in time.  But because it's smooth, it
> may
> > > not
> > > > be reactive enough to instantaneous latency jumps.  For
> > latency-sensitive
> > > > workloads, it may be desirable to react faster when a broker becomes
> > > > unresponsive (but that may make the distribution really choppy), so
> > > > partition.availability.timeout.ms provides an opportunity to tune
> > > > adaptiveness.
> > > >
> > > > (Jun) Can we just not assign partitions to brokers that are not
> ready?
> > > > Switching partitions purely based on current broker readiness
> > information
> > > > can really skew workload I think (or at least I couldn't build a
> model
> > > that
> > > > proves that over time it's going to be generally fair), I feel that
> the
> > > > algorithm should try to be fair in general and use smoother signals
> by
> > > > default (e.g. a broker with choppier latency may get much less load
> > even
> > > > though it can handle throughput, it then may potentially skew
> > > consumption),
> > > > note that the queue-size-based logic uses probabilities (so we don't
> > > fully
> > > > remove brokers, just make it less likely) and relative info rather
> > than a
> > > > threshold (so if all brokers are heavily, but equally loaded, they
> will
> > > get
> > > > equal distribution, rather than get removed because they exceed some
> > > > threshold).  So at the very least, I would like this logic to be
> turned
> > > off
> > > > by default as it's hard to predict what it could do with different
> > > patterns
> > > > (which means that there would need to be some configuration).  We
> could
> > > > just not use brokers that are not ready, but again, I think that it's
> > > good
> > > > to try to be fair under normal circumstances, so if normally brokers
> > can
> > > > respond under some partition.availability.timeout.ms threshold and
> the
> > > > application works well with those latencies, then we could distribute
> > > data
> > > > equally between brokers that don't exceed the latencies.  The value,
> of
> > > > course, would depend on the environment and app requirements, hence
> > it's
> > > > configurable.
> > > >
> > > > 10. Added a statement at the beginning of the proposed changes.
> > > >
> > > > -Artem
> > > >
> > > >
> > > > On Thu, Feb 17, 2022 at 3:46 PM Jun Rao <ju...@confluent.io.invalid>
> > > wrote:
> > > >
> > > > > Hi, Artem,
> > > > >
> > > > > Thanks for the KIP. A few comments below.
> > > > >
> > > > > 1. I agree with Luke that having the partitioner returning -1 is
> kind
> > > of
> > > > > weird. Could we just change the implementation of
> DefaultPartitioner
> > to
> > > > the
> > > > > new behavior?
> > > > >
> > > > > 2. partitioner.sticky.batch.size: Similar question to Luke. I am
> not
> > > sure
> > > > > why we want to introduce this new configuration. Could we just use
> > the
> > > > > existing batch.size?
> > > > >
> > > > > 4. I also agree with Luke that it's not clear why we need
> > > > > partition.availability.timeout.ms. The KIP says the broker "would
> > not
> > > be
> > > > > chosen until the broker is able to accept the next ready batch from
> > the
> > > > > partition". If we are keeping track of this, could we just avoid
> > > > assigning
> > > > > records to partitions whose leader is not able to accept the next
> > > batch?
> > > > If
> > > > > we do that, perhaps we don't need
> partition.availability.timeout.ms.
> > > > >
> > > > > 10. Currently, partitioner.class defaults to DefaultPartitioner,
> > which
> > > > uses
> > > > > StickyPartitioner when the key is specified. Since this KIP
> improves
> > > upon
> > > > > StickyPartitioner, it would be useful to make the new behavior the
> > > > default
> > > > > and document that in the KIP.
> > > > >
> > > > > Thanks,
> > > > >
> > > > > Jun
> > > > >
> > > > >
> > > > > On Wed, Feb 16, 2022 at 7:30 PM Luke Chen <sh...@gmail.com>
> wrote:
> > > > >
> > > > > > Hi Artem,
> > > > > >
> > > > > > Also, one more thing I think you need to know.
> > > > > > As this bug KAFKA-7572 <
> > > > https://issues.apache.org/jira/browse/KAFKA-7572
> > > > > >
> > > > > > mentioned, sometimes the custom partitioner would return negative
> > > > > partition
> > > > > > id accidentally.
> > > > > > If it returned -1, how could you know if it is expected or not
> > > > expected?
> > > > > >
> > > > > > Thanks.
> > > > > > Luke
> > > > > >
> > > > > > On Wed, Feb 16, 2022 at 3:28 PM Luke Chen <sh...@gmail.com>
> > wrote:
> > > > > >
> > > > > > > Hi Artem,
> > > > > > >
> > > > > > > Thanks for the update. I have some questions about it:
> > > > > > >
> > > > > > > 1. Could you explain why you need the `partitioner` return -1?
> In
> > > > which
> > > > > > > case we need it? And how it is used in your KIP?
> > > > > > > 2. What does the "partitioner.sticky.batch.size" mean? In the
> > > > > > > "Configuration" part, you didn't explain it. And default to 0,
> I
> > > > guess
> > > > > > it's
> > > > > > > the same as current behavior for backward compatibility, right?
> > You
> > > > > > should
> > > > > > > mention it.
> > > > > > > 3. I'm thinking we can have a threshold to the
> > > > > > > "partitioner.sticky.batch.size". Let's say, we already
> accumulate
> > > > > 15.5KB
> > > > > > in
> > > > > > > partition1, and sent. So when next batch created, in your
> current
> > > > > design,
> > > > > > > we still stick to partition1, until 16KB reached, and then we
> > > create
> > > > a
> > > > > > new
> > > > > > > batch to change to next partition, ex: partition2. But I think
> if
> > > we
> > > > > set
> > > > > > a
> > > > > > > threshold to 95% (for example), we can know previous 15.5KB
> > already
> > > > > > exceeds
> > > > > > > the threshold so that we can directly create new batch for next
> > > > > records.
> > > > > > > This way should be able to make it more efficient. WDYT?
> > > > > > > 4. I think the improved queuing logic should be good enough. I
> > > can't
> > > > > get
> > > > > > > the benefit of having `partition.availability.timeout.ms`
> > config.
> > > In
> > > > > > > short, you want to make the partitioner take the broker load
> into
> > > > > > > consideration. We can just improve that in the queuing logic
> (and
> > > you
> > > > > > > already did it). Why should we add the config? Could you use
> some
> > > > > > examples
> > > > > > > to explain why we need it.
> > > > > > >
> > > > > > > Thank you.
> > > > > > > Luke
> > > > > > >
> > > > > > > On Wed, Feb 16, 2022 at 8:57 AM Artem Livshits
> > > > > > > <al...@confluent.io.invalid> wrote:
> > > > > > >
> > > > > > >> Hello,
> > > > > > >>
> > > > > > >> Please add your comments about the KIP.  If there are no
> > > > > considerations,
> > > > > > >> I'll put it up for vote in the next few days.
> > > > > > >>
> > > > > > >> -Artem
> > > > > > >>
> > > > > > >> On Mon, Feb 7, 2022 at 6:01 PM Artem Livshits <
> > > > alivshits@confluent.io
> > > > > >
> > > > > > >> wrote:
> > > > > > >>
> > > > > > >> > Hello,
> > > > > > >> >
> > > > > > >> > After trying a few prototypes, I've made some changes to the
> > > > public
> > > > > > >> > interface.  Please see the updated document
> > > > > > >> >
> > > > > > >>
> > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-794%3A+Strictly+Uniform+Sticky+Partitioner
> > > > > > >> > .
> > > > > > >> >
> > > > > > >> > -Artem
> > > > > > >> >
> > > > > > >> > On Thu, Nov 4, 2021 at 10:37 AM Artem Livshits <
> > > > > > alivshits@confluent.io>
> > > > > > >> > wrote:
> > > > > > >> >
> > > > > > >> >> Hello,
> > > > > > >> >>
> > > > > > >> >> This is the discussion thread for
> > > > > > >> >>
> > > > > > >>
> > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-794%3A+Strictly+Uniform+Sticky+Partitioner
> > > > > > >> >> .
> > > > > > >> >>
> > > > > > >> >> The proposal is a bug fix for
> > > > > > >> >> https://issues.apache.org/jira/browse/KAFKA-10888, but it
> > does
> > > > > > >> include a
> > > > > > >> >> client config change, therefore we have a KIP to discuss.
> > > > > > >> >>
> > > > > > >> >> -Artem
> > > > > > >> >>
> > > > > > >> >
> > > > > > >>
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] KIP-794: Strictly Uniform Sticky Partitioner

Posted by Jun Rao <ju...@confluent.io.INVALID>.
Hi, Artem,

Thanks for the reply. A few more comments.

1. Since we control the implementation and the usage of DefaultPartitioner,
another way is to instantiate the DefaultPartitioner with a special
constructor, which allows it to have more access to internal information.
Then we could just change the behavior of  DefaultPartitioner such that it
can use the internal infoamtion when choosing the partition. This seems
more intuitive than having DefaultPartitioner return -1 partition.

2. I guess partitioner.sticky.batch.size is introduced because the
effective batch size could be less than batch.size and we want to align
partition switching with the effective batch size. How would a user know
the effective batch size to set partitioner.sticky.batch.size properly? If
the user somehow knows the effective batch size, does setting batch.size to
the effective batch size achieve the same result?

4. Thanks for the explanation. Makes sense to me.

Thanks,

Jun

Thanks,

Jun

On Fri, Feb 25, 2022 at 8:26 PM Artem Livshits
<al...@confluent.io.invalid> wrote:

> Hi Jun,
>
> 1. Updated the KIP to add a couple paragraphs about implementation
> necessities in the Proposed Changes section.
>
> 2. Sorry if my reply was confusing, what I meant to say (and I elaborated
> on that in point #3) is that there could be patterns for which 16KB
> wouldn't be the most effective setting, thus it would be good to make it
> configurable.
>
> 4. We could use broker readiness timeout.  But I'm not sure it would
> correctly model the broker load.  The problem is that latency is not an
> accurate measure of throughput, we could have 2 brokers that have equal
> throughput but one has higher latency (so it takes larger batches less
> frequently, but still takes the same load).  Latency-based logic is likely
> to send less data to the broker with higher latency.  Using the queue size
> would adapt to throughput, regardless of latency (which could be just a
> result of deployment topology), so that's the model chosen in the
> proposal.  The partition.availability.timeout.ms logic approaches the
> model
> from a slightly different angle, say we have a requirement to deliver
> messages via brokers that have a certain latency, then
> partition.availability.timeout.ms could be used to tune that.  Latency is
> a
> much more volatile metric than throughput (latency depends on external
> load, on capacity, on deployment topology, on jitter in network, on jitter
> in disk, etc.) and I think it would be best to leave latency-based
> thresholds configurable to tune for the environment.
>
> -Artem
>
> On Wed, Feb 23, 2022 at 11:14 AM Jun Rao <ju...@confluent.io.invalid> wrote:
>
> > Hi, Artem,
> >
> > Thanks for the reply. A few more comments.
> >
> > 1. Perhaps you could elaborate a bit more on how the producer determines
> > the partition if the partitioner returns -1. This will help understand
> why
> > encapsulating that logic as a partitioner is not clean.
> >
> > 2. I am not sure that I understand this part. If 15.5KB is more
> efficient,
> > could we just set batch.size to 15.5KB?
> >
> > 4. Yes, we could add a switch (or a variant of the partitioner) for
> > enabling this behavior. Also, choosing partitions based on broker
> readiness
> > can be made in a smoother way. For example, we could track the last time
> a
> > broker has drained any batches from the accumulator. We can then select
> > partitions from brokers proportionally to the inverse of that time. This
> > seems smoother than a cutoff based on a
> partition.availability.timeout.ms
> >  threshold.
> >
> > Thanks,
> >
> > Jun
> >
> > On Fri, Feb 18, 2022 at 5:14 PM Artem Livshits
> > <al...@confluent.io.invalid> wrote:
> >
> > > Hello Luke, Jun,
> > >
> > > Thank you for your feedback.  I've added the Rejected Alternative
> section
> > > that may clarify some of the questions w.r.t. returning -1.
> > >
> > > 1. I've elaborated on the -1 in the KIP.  The problem is that a
> > significant
> > > part of the logic needs to be in the producer (because it now uses
> > > information about brokers that only the producer knows), so
> encapsulation
> > > of the logic within the default partitioner isn't as clean.   I've
> added
> > > the Rejected Alternative section that documents an attempt to keep the
> > > encapsulation by providing new callbacks to the partitioner.
> > >
> > > 2. The meaning of the partitioner.sticky.batch.size is explained in the
> > > Uniform Sticky Batch Size section.  Basically, we track the amount of
> > bytes
> > > produced to the partition and if it exceeds
> partitioner.sticky.batch.size
> > > then we switch to the next partition.  As far as the reason to make it
> > > different from batch.size, I think Luke answered this with the question
> > #3
> > > -- what if the load pattern is such that 15.5KB would be more efficient
> > > than 16KB?
> > >
> > > 3. I think it's hard to have one size that would fit all patterns.
> E.g.
> > if
> > > the load pattern is such that there is linger and the app fills the
> batch
> > > before linger expires, then having 16KB would most likely synchronize
> > > batching and partition switching, so each partition would get a full
> > > batch.  If load pattern is such that there are a few non-complete
> batches
> > > go out before a larger batch starts to fill, then it may actually be
> > > beneficial to make slightly larger (e.g. linger=0, first few records go
> > in
> > > the first batch, then next few records go to second batch, and so on,
> > until
> > > 5 in-flight, then larger batch would form while waiting for broker to
> > > respond, but the partition switch would happen before the larger batch
> is
> > > full).
> > >
> > > 4. There are a couple of reasons for introducing
> > > partition.availability.timeout.ms.  Luke's an Jun's questions are
> > slightly
> > > different, so I'm going to separate replies.
> > > (Luke) Is the queue size a good enough signal?  I think it's a good
> > default
> > > signal as it tries to preserve general fairness and not overreact on
> the
> > > broker's state at each moment in time.  But because it's smooth, it may
> > not
> > > be reactive enough to instantaneous latency jumps.  For
> latency-sensitive
> > > workloads, it may be desirable to react faster when a broker becomes
> > > unresponsive (but that may make the distribution really choppy), so
> > > partition.availability.timeout.ms provides an opportunity to tune
> > > adaptiveness.
> > >
> > > (Jun) Can we just not assign partitions to brokers that are not ready?
> > > Switching partitions purely based on current broker readiness
> information
> > > can really skew workload I think (or at least I couldn't build a model
> > that
> > > proves that over time it's going to be generally fair), I feel that the
> > > algorithm should try to be fair in general and use smoother signals by
> > > default (e.g. a broker with choppier latency may get much less load
> even
> > > though it can handle throughput, it then may potentially skew
> > consumption),
> > > note that the queue-size-based logic uses probabilities (so we don't
> > fully
> > > remove brokers, just make it less likely) and relative info rather
> than a
> > > threshold (so if all brokers are heavily, but equally loaded, they will
> > get
> > > equal distribution, rather than get removed because they exceed some
> > > threshold).  So at the very least, I would like this logic to be turned
> > off
> > > by default as it's hard to predict what it could do with different
> > patterns
> > > (which means that there would need to be some configuration).  We could
> > > just not use brokers that are not ready, but again, I think that it's
> > good
> > > to try to be fair under normal circumstances, so if normally brokers
> can
> > > respond under some partition.availability.timeout.ms threshold and the
> > > application works well with those latencies, then we could distribute
> > data
> > > equally between brokers that don't exceed the latencies.  The value, of
> > > course, would depend on the environment and app requirements, hence
> it's
> > > configurable.
> > >
> > > 10. Added a statement at the beginning of the proposed changes.
> > >
> > > -Artem
> > >
> > >
> > > On Thu, Feb 17, 2022 at 3:46 PM Jun Rao <ju...@confluent.io.invalid>
> > wrote:
> > >
> > > > Hi, Artem,
> > > >
> > > > Thanks for the KIP. A few comments below.
> > > >
> > > > 1. I agree with Luke that having the partitioner returning -1 is kind
> > of
> > > > weird. Could we just change the implementation of DefaultPartitioner
> to
> > > the
> > > > new behavior?
> > > >
> > > > 2. partitioner.sticky.batch.size: Similar question to Luke. I am not
> > sure
> > > > why we want to introduce this new configuration. Could we just use
> the
> > > > existing batch.size?
> > > >
> > > > 4. I also agree with Luke that it's not clear why we need
> > > > partition.availability.timeout.ms. The KIP says the broker "would
> not
> > be
> > > > chosen until the broker is able to accept the next ready batch from
> the
> > > > partition". If we are keeping track of this, could we just avoid
> > > assigning
> > > > records to partitions whose leader is not able to accept the next
> > batch?
> > > If
> > > > we do that, perhaps we don't need partition.availability.timeout.ms.
> > > >
> > > > 10. Currently, partitioner.class defaults to DefaultPartitioner,
> which
> > > uses
> > > > StickyPartitioner when the key is specified. Since this KIP improves
> > upon
> > > > StickyPartitioner, it would be useful to make the new behavior the
> > > default
> > > > and document that in the KIP.
> > > >
> > > > Thanks,
> > > >
> > > > Jun
> > > >
> > > >
> > > > On Wed, Feb 16, 2022 at 7:30 PM Luke Chen <sh...@gmail.com> wrote:
> > > >
> > > > > Hi Artem,
> > > > >
> > > > > Also, one more thing I think you need to know.
> > > > > As this bug KAFKA-7572 <
> > > https://issues.apache.org/jira/browse/KAFKA-7572
> > > > >
> > > > > mentioned, sometimes the custom partitioner would return negative
> > > > partition
> > > > > id accidentally.
> > > > > If it returned -1, how could you know if it is expected or not
> > > expected?
> > > > >
> > > > > Thanks.
> > > > > Luke
> > > > >
> > > > > On Wed, Feb 16, 2022 at 3:28 PM Luke Chen <sh...@gmail.com>
> wrote:
> > > > >
> > > > > > Hi Artem,
> > > > > >
> > > > > > Thanks for the update. I have some questions about it:
> > > > > >
> > > > > > 1. Could you explain why you need the `partitioner` return -1? In
> > > which
> > > > > > case we need it? And how it is used in your KIP?
> > > > > > 2. What does the "partitioner.sticky.batch.size" mean? In the
> > > > > > "Configuration" part, you didn't explain it. And default to 0, I
> > > guess
> > > > > it's
> > > > > > the same as current behavior for backward compatibility, right?
> You
> > > > > should
> > > > > > mention it.
> > > > > > 3. I'm thinking we can have a threshold to the
> > > > > > "partitioner.sticky.batch.size". Let's say, we already accumulate
> > > > 15.5KB
> > > > > in
> > > > > > partition1, and sent. So when next batch created, in your current
> > > > design,
> > > > > > we still stick to partition1, until 16KB reached, and then we
> > create
> > > a
> > > > > new
> > > > > > batch to change to next partition, ex: partition2. But I think if
> > we
> > > > set
> > > > > a
> > > > > > threshold to 95% (for example), we can know previous 15.5KB
> already
> > > > > exceeds
> > > > > > the threshold so that we can directly create new batch for next
> > > > records.
> > > > > > This way should be able to make it more efficient. WDYT?
> > > > > > 4. I think the improved queuing logic should be good enough. I
> > can't
> > > > get
> > > > > > the benefit of having `partition.availability.timeout.ms`
> config.
> > In
> > > > > > short, you want to make the partitioner take the broker load into
> > > > > > consideration. We can just improve that in the queuing logic (and
> > you
> > > > > > already did it). Why should we add the config? Could you use some
> > > > > examples
> > > > > > to explain why we need it.
> > > > > >
> > > > > > Thank you.
> > > > > > Luke
> > > > > >
> > > > > > On Wed, Feb 16, 2022 at 8:57 AM Artem Livshits
> > > > > > <al...@confluent.io.invalid> wrote:
> > > > > >
> > > > > >> Hello,
> > > > > >>
> > > > > >> Please add your comments about the KIP.  If there are no
> > > > considerations,
> > > > > >> I'll put it up for vote in the next few days.
> > > > > >>
> > > > > >> -Artem
> > > > > >>
> > > > > >> On Mon, Feb 7, 2022 at 6:01 PM Artem Livshits <
> > > alivshits@confluent.io
> > > > >
> > > > > >> wrote:
> > > > > >>
> > > > > >> > Hello,
> > > > > >> >
> > > > > >> > After trying a few prototypes, I've made some changes to the
> > > public
> > > > > >> > interface.  Please see the updated document
> > > > > >> >
> > > > > >>
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-794%3A+Strictly+Uniform+Sticky+Partitioner
> > > > > >> > .
> > > > > >> >
> > > > > >> > -Artem
> > > > > >> >
> > > > > >> > On Thu, Nov 4, 2021 at 10:37 AM Artem Livshits <
> > > > > alivshits@confluent.io>
> > > > > >> > wrote:
> > > > > >> >
> > > > > >> >> Hello,
> > > > > >> >>
> > > > > >> >> This is the discussion thread for
> > > > > >> >>
> > > > > >>
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-794%3A+Strictly+Uniform+Sticky+Partitioner
> > > > > >> >> .
> > > > > >> >>
> > > > > >> >> The proposal is a bug fix for
> > > > > >> >> https://issues.apache.org/jira/browse/KAFKA-10888, but it
> does
> > > > > >> include a
> > > > > >> >> client config change, therefore we have a KIP to discuss.
> > > > > >> >>
> > > > > >> >> -Artem
> > > > > >> >>
> > > > > >> >
> > > > > >>
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] KIP-794: Strictly Uniform Sticky Partitioner

Posted by Artem Livshits <al...@confluent.io.INVALID>.
Hi Jun,

1. Updated the KIP to add a couple paragraphs about implementation
necessities in the Proposed Changes section.

2. Sorry if my reply was confusing, what I meant to say (and I elaborated
on that in point #3) is that there could be patterns for which 16KB
wouldn't be the most effective setting, thus it would be good to make it
configurable.

4. We could use broker readiness timeout.  But I'm not sure it would
correctly model the broker load.  The problem is that latency is not an
accurate measure of throughput, we could have 2 brokers that have equal
throughput but one has higher latency (so it takes larger batches less
frequently, but still takes the same load).  Latency-based logic is likely
to send less data to the broker with higher latency.  Using the queue size
would adapt to throughput, regardless of latency (which could be just a
result of deployment topology), so that's the model chosen in the
proposal.  The partition.availability.timeout.ms logic approaches the model
from a slightly different angle, say we have a requirement to deliver
messages via brokers that have a certain latency, then
partition.availability.timeout.ms could be used to tune that.  Latency is a
much more volatile metric than throughput (latency depends on external
load, on capacity, on deployment topology, on jitter in network, on jitter
in disk, etc.) and I think it would be best to leave latency-based
thresholds configurable to tune for the environment.

-Artem

On Wed, Feb 23, 2022 at 11:14 AM Jun Rao <ju...@confluent.io.invalid> wrote:

> Hi, Artem,
>
> Thanks for the reply. A few more comments.
>
> 1. Perhaps you could elaborate a bit more on how the producer determines
> the partition if the partitioner returns -1. This will help understand why
> encapsulating that logic as a partitioner is not clean.
>
> 2. I am not sure that I understand this part. If 15.5KB is more efficient,
> could we just set batch.size to 15.5KB?
>
> 4. Yes, we could add a switch (or a variant of the partitioner) for
> enabling this behavior. Also, choosing partitions based on broker readiness
> can be made in a smoother way. For example, we could track the last time a
> broker has drained any batches from the accumulator. We can then select
> partitions from brokers proportionally to the inverse of that time. This
> seems smoother than a cutoff based on a partition.availability.timeout.ms
>  threshold.
>
> Thanks,
>
> Jun
>
> On Fri, Feb 18, 2022 at 5:14 PM Artem Livshits
> <al...@confluent.io.invalid> wrote:
>
> > Hello Luke, Jun,
> >
> > Thank you for your feedback.  I've added the Rejected Alternative section
> > that may clarify some of the questions w.r.t. returning -1.
> >
> > 1. I've elaborated on the -1 in the KIP.  The problem is that a
> significant
> > part of the logic needs to be in the producer (because it now uses
> > information about brokers that only the producer knows), so encapsulation
> > of the logic within the default partitioner isn't as clean.   I've added
> > the Rejected Alternative section that documents an attempt to keep the
> > encapsulation by providing new callbacks to the partitioner.
> >
> > 2. The meaning of the partitioner.sticky.batch.size is explained in the
> > Uniform Sticky Batch Size section.  Basically, we track the amount of
> bytes
> > produced to the partition and if it exceeds partitioner.sticky.batch.size
> > then we switch to the next partition.  As far as the reason to make it
> > different from batch.size, I think Luke answered this with the question
> #3
> > -- what if the load pattern is such that 15.5KB would be more efficient
> > than 16KB?
> >
> > 3. I think it's hard to have one size that would fit all patterns.  E.g.
> if
> > the load pattern is such that there is linger and the app fills the batch
> > before linger expires, then having 16KB would most likely synchronize
> > batching and partition switching, so each partition would get a full
> > batch.  If load pattern is such that there are a few non-complete batches
> > go out before a larger batch starts to fill, then it may actually be
> > beneficial to make slightly larger (e.g. linger=0, first few records go
> in
> > the first batch, then next few records go to second batch, and so on,
> until
> > 5 in-flight, then larger batch would form while waiting for broker to
> > respond, but the partition switch would happen before the larger batch is
> > full).
> >
> > 4. There are a couple of reasons for introducing
> > partition.availability.timeout.ms.  Luke's an Jun's questions are
> slightly
> > different, so I'm going to separate replies.
> > (Luke) Is the queue size a good enough signal?  I think it's a good
> default
> > signal as it tries to preserve general fairness and not overreact on the
> > broker's state at each moment in time.  But because it's smooth, it may
> not
> > be reactive enough to instantaneous latency jumps.  For latency-sensitive
> > workloads, it may be desirable to react faster when a broker becomes
> > unresponsive (but that may make the distribution really choppy), so
> > partition.availability.timeout.ms provides an opportunity to tune
> > adaptiveness.
> >
> > (Jun) Can we just not assign partitions to brokers that are not ready?
> > Switching partitions purely based on current broker readiness information
> > can really skew workload I think (or at least I couldn't build a model
> that
> > proves that over time it's going to be generally fair), I feel that the
> > algorithm should try to be fair in general and use smoother signals by
> > default (e.g. a broker with choppier latency may get much less load even
> > though it can handle throughput, it then may potentially skew
> consumption),
> > note that the queue-size-based logic uses probabilities (so we don't
> fully
> > remove brokers, just make it less likely) and relative info rather than a
> > threshold (so if all brokers are heavily, but equally loaded, they will
> get
> > equal distribution, rather than get removed because they exceed some
> > threshold).  So at the very least, I would like this logic to be turned
> off
> > by default as it's hard to predict what it could do with different
> patterns
> > (which means that there would need to be some configuration).  We could
> > just not use brokers that are not ready, but again, I think that it's
> good
> > to try to be fair under normal circumstances, so if normally brokers can
> > respond under some partition.availability.timeout.ms threshold and the
> > application works well with those latencies, then we could distribute
> data
> > equally between brokers that don't exceed the latencies.  The value, of
> > course, would depend on the environment and app requirements, hence it's
> > configurable.
> >
> > 10. Added a statement at the beginning of the proposed changes.
> >
> > -Artem
> >
> >
> > On Thu, Feb 17, 2022 at 3:46 PM Jun Rao <ju...@confluent.io.invalid>
> wrote:
> >
> > > Hi, Artem,
> > >
> > > Thanks for the KIP. A few comments below.
> > >
> > > 1. I agree with Luke that having the partitioner returning -1 is kind
> of
> > > weird. Could we just change the implementation of DefaultPartitioner to
> > the
> > > new behavior?
> > >
> > > 2. partitioner.sticky.batch.size: Similar question to Luke. I am not
> sure
> > > why we want to introduce this new configuration. Could we just use the
> > > existing batch.size?
> > >
> > > 4. I also agree with Luke that it's not clear why we need
> > > partition.availability.timeout.ms. The KIP says the broker "would not
> be
> > > chosen until the broker is able to accept the next ready batch from the
> > > partition". If we are keeping track of this, could we just avoid
> > assigning
> > > records to partitions whose leader is not able to accept the next
> batch?
> > If
> > > we do that, perhaps we don't need partition.availability.timeout.ms.
> > >
> > > 10. Currently, partitioner.class defaults to DefaultPartitioner, which
> > uses
> > > StickyPartitioner when the key is specified. Since this KIP improves
> upon
> > > StickyPartitioner, it would be useful to make the new behavior the
> > default
> > > and document that in the KIP.
> > >
> > > Thanks,
> > >
> > > Jun
> > >
> > >
> > > On Wed, Feb 16, 2022 at 7:30 PM Luke Chen <sh...@gmail.com> wrote:
> > >
> > > > Hi Artem,
> > > >
> > > > Also, one more thing I think you need to know.
> > > > As this bug KAFKA-7572 <
> > https://issues.apache.org/jira/browse/KAFKA-7572
> > > >
> > > > mentioned, sometimes the custom partitioner would return negative
> > > partition
> > > > id accidentally.
> > > > If it returned -1, how could you know if it is expected or not
> > expected?
> > > >
> > > > Thanks.
> > > > Luke
> > > >
> > > > On Wed, Feb 16, 2022 at 3:28 PM Luke Chen <sh...@gmail.com> wrote:
> > > >
> > > > > Hi Artem,
> > > > >
> > > > > Thanks for the update. I have some questions about it:
> > > > >
> > > > > 1. Could you explain why you need the `partitioner` return -1? In
> > which
> > > > > case we need it? And how it is used in your KIP?
> > > > > 2. What does the "partitioner.sticky.batch.size" mean? In the
> > > > > "Configuration" part, you didn't explain it. And default to 0, I
> > guess
> > > > it's
> > > > > the same as current behavior for backward compatibility, right? You
> > > > should
> > > > > mention it.
> > > > > 3. I'm thinking we can have a threshold to the
> > > > > "partitioner.sticky.batch.size". Let's say, we already accumulate
> > > 15.5KB
> > > > in
> > > > > partition1, and sent. So when next batch created, in your current
> > > design,
> > > > > we still stick to partition1, until 16KB reached, and then we
> create
> > a
> > > > new
> > > > > batch to change to next partition, ex: partition2. But I think if
> we
> > > set
> > > > a
> > > > > threshold to 95% (for example), we can know previous 15.5KB already
> > > > exceeds
> > > > > the threshold so that we can directly create new batch for next
> > > records.
> > > > > This way should be able to make it more efficient. WDYT?
> > > > > 4. I think the improved queuing logic should be good enough. I
> can't
> > > get
> > > > > the benefit of having `partition.availability.timeout.ms` config.
> In
> > > > > short, you want to make the partitioner take the broker load into
> > > > > consideration. We can just improve that in the queuing logic (and
> you
> > > > > already did it). Why should we add the config? Could you use some
> > > > examples
> > > > > to explain why we need it.
> > > > >
> > > > > Thank you.
> > > > > Luke
> > > > >
> > > > > On Wed, Feb 16, 2022 at 8:57 AM Artem Livshits
> > > > > <al...@confluent.io.invalid> wrote:
> > > > >
> > > > >> Hello,
> > > > >>
> > > > >> Please add your comments about the KIP.  If there are no
> > > considerations,
> > > > >> I'll put it up for vote in the next few days.
> > > > >>
> > > > >> -Artem
> > > > >>
> > > > >> On Mon, Feb 7, 2022 at 6:01 PM Artem Livshits <
> > alivshits@confluent.io
> > > >
> > > > >> wrote:
> > > > >>
> > > > >> > Hello,
> > > > >> >
> > > > >> > After trying a few prototypes, I've made some changes to the
> > public
> > > > >> > interface.  Please see the updated document
> > > > >> >
> > > > >>
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-794%3A+Strictly+Uniform+Sticky+Partitioner
> > > > >> > .
> > > > >> >
> > > > >> > -Artem
> > > > >> >
> > > > >> > On Thu, Nov 4, 2021 at 10:37 AM Artem Livshits <
> > > > alivshits@confluent.io>
> > > > >> > wrote:
> > > > >> >
> > > > >> >> Hello,
> > > > >> >>
> > > > >> >> This is the discussion thread for
> > > > >> >>
> > > > >>
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-794%3A+Strictly+Uniform+Sticky+Partitioner
> > > > >> >> .
> > > > >> >>
> > > > >> >> The proposal is a bug fix for
> > > > >> >> https://issues.apache.org/jira/browse/KAFKA-10888, but it does
> > > > >> include a
> > > > >> >> client config change, therefore we have a KIP to discuss.
> > > > >> >>
> > > > >> >> -Artem
> > > > >> >>
> > > > >> >
> > > > >>
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] KIP-794: Strictly Uniform Sticky Partitioner

Posted by Jun Rao <ju...@confluent.io.INVALID>.
Hi, Artem,

Thanks for the reply. A few more comments.

1. Perhaps you could elaborate a bit more on how the producer determines
the partition if the partitioner returns -1. This will help understand why
encapsulating that logic as a partitioner is not clean.

2. I am not sure that I understand this part. If 15.5KB is more efficient,
could we just set batch.size to 15.5KB?

4. Yes, we could add a switch (or a variant of the partitioner) for
enabling this behavior. Also, choosing partitions based on broker readiness
can be made in a smoother way. For example, we could track the last time a
broker has drained any batches from the accumulator. We can then select
partitions from brokers proportionally to the inverse of that time. This
seems smoother than a cutoff based on a partition.availability.timeout.ms
 threshold.

Thanks,

Jun

On Fri, Feb 18, 2022 at 5:14 PM Artem Livshits
<al...@confluent.io.invalid> wrote:

> Hello Luke, Jun,
>
> Thank you for your feedback.  I've added the Rejected Alternative section
> that may clarify some of the questions w.r.t. returning -1.
>
> 1. I've elaborated on the -1 in the KIP.  The problem is that a significant
> part of the logic needs to be in the producer (because it now uses
> information about brokers that only the producer knows), so encapsulation
> of the logic within the default partitioner isn't as clean.   I've added
> the Rejected Alternative section that documents an attempt to keep the
> encapsulation by providing new callbacks to the partitioner.
>
> 2. The meaning of the partitioner.sticky.batch.size is explained in the
> Uniform Sticky Batch Size section.  Basically, we track the amount of bytes
> produced to the partition and if it exceeds partitioner.sticky.batch.size
> then we switch to the next partition.  As far as the reason to make it
> different from batch.size, I think Luke answered this with the question #3
> -- what if the load pattern is such that 15.5KB would be more efficient
> than 16KB?
>
> 3. I think it's hard to have one size that would fit all patterns.  E.g. if
> the load pattern is such that there is linger and the app fills the batch
> before linger expires, then having 16KB would most likely synchronize
> batching and partition switching, so each partition would get a full
> batch.  If load pattern is such that there are a few non-complete batches
> go out before a larger batch starts to fill, then it may actually be
> beneficial to make slightly larger (e.g. linger=0, first few records go in
> the first batch, then next few records go to second batch, and so on, until
> 5 in-flight, then larger batch would form while waiting for broker to
> respond, but the partition switch would happen before the larger batch is
> full).
>
> 4. There are a couple of reasons for introducing
> partition.availability.timeout.ms.  Luke's an Jun's questions are slightly
> different, so I'm going to separate replies.
> (Luke) Is the queue size a good enough signal?  I think it's a good default
> signal as it tries to preserve general fairness and not overreact on the
> broker's state at each moment in time.  But because it's smooth, it may not
> be reactive enough to instantaneous latency jumps.  For latency-sensitive
> workloads, it may be desirable to react faster when a broker becomes
> unresponsive (but that may make the distribution really choppy), so
> partition.availability.timeout.ms provides an opportunity to tune
> adaptiveness.
>
> (Jun) Can we just not assign partitions to brokers that are not ready?
> Switching partitions purely based on current broker readiness information
> can really skew workload I think (or at least I couldn't build a model that
> proves that over time it's going to be generally fair), I feel that the
> algorithm should try to be fair in general and use smoother signals by
> default (e.g. a broker with choppier latency may get much less load even
> though it can handle throughput, it then may potentially skew consumption),
> note that the queue-size-based logic uses probabilities (so we don't fully
> remove brokers, just make it less likely) and relative info rather than a
> threshold (so if all brokers are heavily, but equally loaded, they will get
> equal distribution, rather than get removed because they exceed some
> threshold).  So at the very least, I would like this logic to be turned off
> by default as it's hard to predict what it could do with different patterns
> (which means that there would need to be some configuration).  We could
> just not use brokers that are not ready, but again, I think that it's good
> to try to be fair under normal circumstances, so if normally brokers can
> respond under some partition.availability.timeout.ms threshold and the
> application works well with those latencies, then we could distribute data
> equally between brokers that don't exceed the latencies.  The value, of
> course, would depend on the environment and app requirements, hence it's
> configurable.
>
> 10. Added a statement at the beginning of the proposed changes.
>
> -Artem
>
>
> On Thu, Feb 17, 2022 at 3:46 PM Jun Rao <ju...@confluent.io.invalid> wrote:
>
> > Hi, Artem,
> >
> > Thanks for the KIP. A few comments below.
> >
> > 1. I agree with Luke that having the partitioner returning -1 is kind of
> > weird. Could we just change the implementation of DefaultPartitioner to
> the
> > new behavior?
> >
> > 2. partitioner.sticky.batch.size: Similar question to Luke. I am not sure
> > why we want to introduce this new configuration. Could we just use the
> > existing batch.size?
> >
> > 4. I also agree with Luke that it's not clear why we need
> > partition.availability.timeout.ms. The KIP says the broker "would not be
> > chosen until the broker is able to accept the next ready batch from the
> > partition". If we are keeping track of this, could we just avoid
> assigning
> > records to partitions whose leader is not able to accept the next batch?
> If
> > we do that, perhaps we don't need partition.availability.timeout.ms.
> >
> > 10. Currently, partitioner.class defaults to DefaultPartitioner, which
> uses
> > StickyPartitioner when the key is specified. Since this KIP improves upon
> > StickyPartitioner, it would be useful to make the new behavior the
> default
> > and document that in the KIP.
> >
> > Thanks,
> >
> > Jun
> >
> >
> > On Wed, Feb 16, 2022 at 7:30 PM Luke Chen <sh...@gmail.com> wrote:
> >
> > > Hi Artem,
> > >
> > > Also, one more thing I think you need to know.
> > > As this bug KAFKA-7572 <
> https://issues.apache.org/jira/browse/KAFKA-7572
> > >
> > > mentioned, sometimes the custom partitioner would return negative
> > partition
> > > id accidentally.
> > > If it returned -1, how could you know if it is expected or not
> expected?
> > >
> > > Thanks.
> > > Luke
> > >
> > > On Wed, Feb 16, 2022 at 3:28 PM Luke Chen <sh...@gmail.com> wrote:
> > >
> > > > Hi Artem,
> > > >
> > > > Thanks for the update. I have some questions about it:
> > > >
> > > > 1. Could you explain why you need the `partitioner` return -1? In
> which
> > > > case we need it? And how it is used in your KIP?
> > > > 2. What does the "partitioner.sticky.batch.size" mean? In the
> > > > "Configuration" part, you didn't explain it. And default to 0, I
> guess
> > > it's
> > > > the same as current behavior for backward compatibility, right? You
> > > should
> > > > mention it.
> > > > 3. I'm thinking we can have a threshold to the
> > > > "partitioner.sticky.batch.size". Let's say, we already accumulate
> > 15.5KB
> > > in
> > > > partition1, and sent. So when next batch created, in your current
> > design,
> > > > we still stick to partition1, until 16KB reached, and then we create
> a
> > > new
> > > > batch to change to next partition, ex: partition2. But I think if we
> > set
> > > a
> > > > threshold to 95% (for example), we can know previous 15.5KB already
> > > exceeds
> > > > the threshold so that we can directly create new batch for next
> > records.
> > > > This way should be able to make it more efficient. WDYT?
> > > > 4. I think the improved queuing logic should be good enough. I can't
> > get
> > > > the benefit of having `partition.availability.timeout.ms` config. In
> > > > short, you want to make the partitioner take the broker load into
> > > > consideration. We can just improve that in the queuing logic (and you
> > > > already did it). Why should we add the config? Could you use some
> > > examples
> > > > to explain why we need it.
> > > >
> > > > Thank you.
> > > > Luke
> > > >
> > > > On Wed, Feb 16, 2022 at 8:57 AM Artem Livshits
> > > > <al...@confluent.io.invalid> wrote:
> > > >
> > > >> Hello,
> > > >>
> > > >> Please add your comments about the KIP.  If there are no
> > considerations,
> > > >> I'll put it up for vote in the next few days.
> > > >>
> > > >> -Artem
> > > >>
> > > >> On Mon, Feb 7, 2022 at 6:01 PM Artem Livshits <
> alivshits@confluent.io
> > >
> > > >> wrote:
> > > >>
> > > >> > Hello,
> > > >> >
> > > >> > After trying a few prototypes, I've made some changes to the
> public
> > > >> > interface.  Please see the updated document
> > > >> >
> > > >>
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-794%3A+Strictly+Uniform+Sticky+Partitioner
> > > >> > .
> > > >> >
> > > >> > -Artem
> > > >> >
> > > >> > On Thu, Nov 4, 2021 at 10:37 AM Artem Livshits <
> > > alivshits@confluent.io>
> > > >> > wrote:
> > > >> >
> > > >> >> Hello,
> > > >> >>
> > > >> >> This is the discussion thread for
> > > >> >>
> > > >>
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-794%3A+Strictly+Uniform+Sticky+Partitioner
> > > >> >> .
> > > >> >>
> > > >> >> The proposal is a bug fix for
> > > >> >> https://issues.apache.org/jira/browse/KAFKA-10888, but it does
> > > >> include a
> > > >> >> client config change, therefore we have a KIP to discuss.
> > > >> >>
> > > >> >> -Artem
> > > >> >>
> > > >> >
> > > >>
> > > >
> > >
> >
>

Re: [DISCUSS] KIP-794: Strictly Uniform Sticky Partitioner

Posted by Artem Livshits <al...@confluent.io.INVALID>.
Hello Luke, Jun,

Thank you for your feedback.  I've added the Rejected Alternative section
that may clarify some of the questions w.r.t. returning -1.

1. I've elaborated on the -1 in the KIP.  The problem is that a significant
part of the logic needs to be in the producer (because it now uses
information about brokers that only the producer knows), so encapsulation
of the logic within the default partitioner isn't as clean.   I've added
the Rejected Alternative section that documents an attempt to keep the
encapsulation by providing new callbacks to the partitioner.

2. The meaning of the partitioner.sticky.batch.size is explained in the
Uniform Sticky Batch Size section.  Basically, we track the amount of bytes
produced to the partition and if it exceeds partitioner.sticky.batch.size
then we switch to the next partition.  As far as the reason to make it
different from batch.size, I think Luke answered this with the question #3
-- what if the load pattern is such that 15.5KB would be more efficient
than 16KB?

3. I think it's hard to have one size that would fit all patterns.  E.g. if
the load pattern is such that there is linger and the app fills the batch
before linger expires, then having 16KB would most likely synchronize
batching and partition switching, so each partition would get a full
batch.  If load pattern is such that there are a few non-complete batches
go out before a larger batch starts to fill, then it may actually be
beneficial to make slightly larger (e.g. linger=0, first few records go in
the first batch, then next few records go to second batch, and so on, until
5 in-flight, then larger batch would form while waiting for broker to
respond, but the partition switch would happen before the larger batch is
full).

4. There are a couple of reasons for introducing
partition.availability.timeout.ms.  Luke's an Jun's questions are slightly
different, so I'm going to separate replies.
(Luke) Is the queue size a good enough signal?  I think it's a good default
signal as it tries to preserve general fairness and not overreact on the
broker's state at each moment in time.  But because it's smooth, it may not
be reactive enough to instantaneous latency jumps.  For latency-sensitive
workloads, it may be desirable to react faster when a broker becomes
unresponsive (but that may make the distribution really choppy), so
partition.availability.timeout.ms provides an opportunity to tune
adaptiveness.

(Jun) Can we just not assign partitions to brokers that are not ready?
Switching partitions purely based on current broker readiness information
can really skew workload I think (or at least I couldn't build a model that
proves that over time it's going to be generally fair), I feel that the
algorithm should try to be fair in general and use smoother signals by
default (e.g. a broker with choppier latency may get much less load even
though it can handle throughput, it then may potentially skew consumption),
note that the queue-size-based logic uses probabilities (so we don't fully
remove brokers, just make it less likely) and relative info rather than a
threshold (so if all brokers are heavily, but equally loaded, they will get
equal distribution, rather than get removed because they exceed some
threshold).  So at the very least, I would like this logic to be turned off
by default as it's hard to predict what it could do with different patterns
(which means that there would need to be some configuration).  We could
just not use brokers that are not ready, but again, I think that it's good
to try to be fair under normal circumstances, so if normally brokers can
respond under some partition.availability.timeout.ms threshold and the
application works well with those latencies, then we could distribute data
equally between brokers that don't exceed the latencies.  The value, of
course, would depend on the environment and app requirements, hence it's
configurable.

10. Added a statement at the beginning of the proposed changes.

-Artem


On Thu, Feb 17, 2022 at 3:46 PM Jun Rao <ju...@confluent.io.invalid> wrote:

> Hi, Artem,
>
> Thanks for the KIP. A few comments below.
>
> 1. I agree with Luke that having the partitioner returning -1 is kind of
> weird. Could we just change the implementation of DefaultPartitioner to the
> new behavior?
>
> 2. partitioner.sticky.batch.size: Similar question to Luke. I am not sure
> why we want to introduce this new configuration. Could we just use the
> existing batch.size?
>
> 4. I also agree with Luke that it's not clear why we need
> partition.availability.timeout.ms. The KIP says the broker "would not be
> chosen until the broker is able to accept the next ready batch from the
> partition". If we are keeping track of this, could we just avoid assigning
> records to partitions whose leader is not able to accept the next batch? If
> we do that, perhaps we don't need partition.availability.timeout.ms.
>
> 10. Currently, partitioner.class defaults to DefaultPartitioner, which uses
> StickyPartitioner when the key is specified. Since this KIP improves upon
> StickyPartitioner, it would be useful to make the new behavior the default
> and document that in the KIP.
>
> Thanks,
>
> Jun
>
>
> On Wed, Feb 16, 2022 at 7:30 PM Luke Chen <sh...@gmail.com> wrote:
>
> > Hi Artem,
> >
> > Also, one more thing I think you need to know.
> > As this bug KAFKA-7572 <https://issues.apache.org/jira/browse/KAFKA-7572
> >
> > mentioned, sometimes the custom partitioner would return negative
> partition
> > id accidentally.
> > If it returned -1, how could you know if it is expected or not expected?
> >
> > Thanks.
> > Luke
> >
> > On Wed, Feb 16, 2022 at 3:28 PM Luke Chen <sh...@gmail.com> wrote:
> >
> > > Hi Artem,
> > >
> > > Thanks for the update. I have some questions about it:
> > >
> > > 1. Could you explain why you need the `partitioner` return -1? In which
> > > case we need it? And how it is used in your KIP?
> > > 2. What does the "partitioner.sticky.batch.size" mean? In the
> > > "Configuration" part, you didn't explain it. And default to 0, I guess
> > it's
> > > the same as current behavior for backward compatibility, right? You
> > should
> > > mention it.
> > > 3. I'm thinking we can have a threshold to the
> > > "partitioner.sticky.batch.size". Let's say, we already accumulate
> 15.5KB
> > in
> > > partition1, and sent. So when next batch created, in your current
> design,
> > > we still stick to partition1, until 16KB reached, and then we create a
> > new
> > > batch to change to next partition, ex: partition2. But I think if we
> set
> > a
> > > threshold to 95% (for example), we can know previous 15.5KB already
> > exceeds
> > > the threshold so that we can directly create new batch for next
> records.
> > > This way should be able to make it more efficient. WDYT?
> > > 4. I think the improved queuing logic should be good enough. I can't
> get
> > > the benefit of having `partition.availability.timeout.ms` config. In
> > > short, you want to make the partitioner take the broker load into
> > > consideration. We can just improve that in the queuing logic (and you
> > > already did it). Why should we add the config? Could you use some
> > examples
> > > to explain why we need it.
> > >
> > > Thank you.
> > > Luke
> > >
> > > On Wed, Feb 16, 2022 at 8:57 AM Artem Livshits
> > > <al...@confluent.io.invalid> wrote:
> > >
> > >> Hello,
> > >>
> > >> Please add your comments about the KIP.  If there are no
> considerations,
> > >> I'll put it up for vote in the next few days.
> > >>
> > >> -Artem
> > >>
> > >> On Mon, Feb 7, 2022 at 6:01 PM Artem Livshits <alivshits@confluent.io
> >
> > >> wrote:
> > >>
> > >> > Hello,
> > >> >
> > >> > After trying a few prototypes, I've made some changes to the public
> > >> > interface.  Please see the updated document
> > >> >
> > >>
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-794%3A+Strictly+Uniform+Sticky+Partitioner
> > >> > .
> > >> >
> > >> > -Artem
> > >> >
> > >> > On Thu, Nov 4, 2021 at 10:37 AM Artem Livshits <
> > alivshits@confluent.io>
> > >> > wrote:
> > >> >
> > >> >> Hello,
> > >> >>
> > >> >> This is the discussion thread for
> > >> >>
> > >>
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-794%3A+Strictly+Uniform+Sticky+Partitioner
> > >> >> .
> > >> >>
> > >> >> The proposal is a bug fix for
> > >> >> https://issues.apache.org/jira/browse/KAFKA-10888, but it does
> > >> include a
> > >> >> client config change, therefore we have a KIP to discuss.
> > >> >>
> > >> >> -Artem
> > >> >>
> > >> >
> > >>
> > >
> >
>

Re: [DISCUSS] KIP-794: Strictly Uniform Sticky Partitioner

Posted by Jun Rao <ju...@confluent.io.INVALID>.
Hi, Artem,

Thanks for the KIP. A few comments below.

1. I agree with Luke that having the partitioner returning -1 is kind of
weird. Could we just change the implementation of DefaultPartitioner to the
new behavior?

2. partitioner.sticky.batch.size: Similar question to Luke. I am not sure
why we want to introduce this new configuration. Could we just use the
existing batch.size?

4. I also agree with Luke that it's not clear why we need
partition.availability.timeout.ms. The KIP says the broker "would not be
chosen until the broker is able to accept the next ready batch from the
partition". If we are keeping track of this, could we just avoid assigning
records to partitions whose leader is not able to accept the next batch? If
we do that, perhaps we don't need partition.availability.timeout.ms.

10. Currently, partitioner.class defaults to DefaultPartitioner, which uses
StickyPartitioner when the key is specified. Since this KIP improves upon
StickyPartitioner, it would be useful to make the new behavior the default
and document that in the KIP.

Thanks,

Jun


On Wed, Feb 16, 2022 at 7:30 PM Luke Chen <sh...@gmail.com> wrote:

> Hi Artem,
>
> Also, one more thing I think you need to know.
> As this bug KAFKA-7572 <https://issues.apache.org/jira/browse/KAFKA-7572>
> mentioned, sometimes the custom partitioner would return negative partition
> id accidentally.
> If it returned -1, how could you know if it is expected or not expected?
>
> Thanks.
> Luke
>
> On Wed, Feb 16, 2022 at 3:28 PM Luke Chen <sh...@gmail.com> wrote:
>
> > Hi Artem,
> >
> > Thanks for the update. I have some questions about it:
> >
> > 1. Could you explain why you need the `partitioner` return -1? In which
> > case we need it? And how it is used in your KIP?
> > 2. What does the "partitioner.sticky.batch.size" mean? In the
> > "Configuration" part, you didn't explain it. And default to 0, I guess
> it's
> > the same as current behavior for backward compatibility, right? You
> should
> > mention it.
> > 3. I'm thinking we can have a threshold to the
> > "partitioner.sticky.batch.size". Let's say, we already accumulate 15.5KB
> in
> > partition1, and sent. So when next batch created, in your current design,
> > we still stick to partition1, until 16KB reached, and then we create a
> new
> > batch to change to next partition, ex: partition2. But I think if we set
> a
> > threshold to 95% (for example), we can know previous 15.5KB already
> exceeds
> > the threshold so that we can directly create new batch for next records.
> > This way should be able to make it more efficient. WDYT?
> > 4. I think the improved queuing logic should be good enough. I can't get
> > the benefit of having `partition.availability.timeout.ms` config. In
> > short, you want to make the partitioner take the broker load into
> > consideration. We can just improve that in the queuing logic (and you
> > already did it). Why should we add the config? Could you use some
> examples
> > to explain why we need it.
> >
> > Thank you.
> > Luke
> >
> > On Wed, Feb 16, 2022 at 8:57 AM Artem Livshits
> > <al...@confluent.io.invalid> wrote:
> >
> >> Hello,
> >>
> >> Please add your comments about the KIP.  If there are no considerations,
> >> I'll put it up for vote in the next few days.
> >>
> >> -Artem
> >>
> >> On Mon, Feb 7, 2022 at 6:01 PM Artem Livshits <al...@confluent.io>
> >> wrote:
> >>
> >> > Hello,
> >> >
> >> > After trying a few prototypes, I've made some changes to the public
> >> > interface.  Please see the updated document
> >> >
> >>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-794%3A+Strictly+Uniform+Sticky+Partitioner
> >> > .
> >> >
> >> > -Artem
> >> >
> >> > On Thu, Nov 4, 2021 at 10:37 AM Artem Livshits <
> alivshits@confluent.io>
> >> > wrote:
> >> >
> >> >> Hello,
> >> >>
> >> >> This is the discussion thread for
> >> >>
> >>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-794%3A+Strictly+Uniform+Sticky+Partitioner
> >> >> .
> >> >>
> >> >> The proposal is a bug fix for
> >> >> https://issues.apache.org/jira/browse/KAFKA-10888, but it does
> >> include a
> >> >> client config change, therefore we have a KIP to discuss.
> >> >>
> >> >> -Artem
> >> >>
> >> >
> >>
> >
>

Re: [DISCUSS] KIP-794: Strictly Uniform Sticky Partitioner

Posted by Luke Chen <sh...@gmail.com>.
Hi Artem,

Also, one more thing I think you need to know.
As this bug KAFKA-7572 <https://issues.apache.org/jira/browse/KAFKA-7572>
mentioned, sometimes the custom partitioner would return negative partition
id accidentally.
If it returned -1, how could you know if it is expected or not expected?

Thanks.
Luke

On Wed, Feb 16, 2022 at 3:28 PM Luke Chen <sh...@gmail.com> wrote:

> Hi Artem,
>
> Thanks for the update. I have some questions about it:
>
> 1. Could you explain why you need the `partitioner` return -1? In which
> case we need it? And how it is used in your KIP?
> 2. What does the "partitioner.sticky.batch.size" mean? In the
> "Configuration" part, you didn't explain it. And default to 0, I guess it's
> the same as current behavior for backward compatibility, right? You should
> mention it.
> 3. I'm thinking we can have a threshold to the
> "partitioner.sticky.batch.size". Let's say, we already accumulate 15.5KB in
> partition1, and sent. So when next batch created, in your current design,
> we still stick to partition1, until 16KB reached, and then we create a new
> batch to change to next partition, ex: partition2. But I think if we set a
> threshold to 95% (for example), we can know previous 15.5KB already exceeds
> the threshold so that we can directly create new batch for next records.
> This way should be able to make it more efficient. WDYT?
> 4. I think the improved queuing logic should be good enough. I can't get
> the benefit of having `partition.availability.timeout.ms` config. In
> short, you want to make the partitioner take the broker load into
> consideration. We can just improve that in the queuing logic (and you
> already did it). Why should we add the config? Could you use some examples
> to explain why we need it.
>
> Thank you.
> Luke
>
> On Wed, Feb 16, 2022 at 8:57 AM Artem Livshits
> <al...@confluent.io.invalid> wrote:
>
>> Hello,
>>
>> Please add your comments about the KIP.  If there are no considerations,
>> I'll put it up for vote in the next few days.
>>
>> -Artem
>>
>> On Mon, Feb 7, 2022 at 6:01 PM Artem Livshits <al...@confluent.io>
>> wrote:
>>
>> > Hello,
>> >
>> > After trying a few prototypes, I've made some changes to the public
>> > interface.  Please see the updated document
>> >
>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-794%3A+Strictly+Uniform+Sticky+Partitioner
>> > .
>> >
>> > -Artem
>> >
>> > On Thu, Nov 4, 2021 at 10:37 AM Artem Livshits <al...@confluent.io>
>> > wrote:
>> >
>> >> Hello,
>> >>
>> >> This is the discussion thread for
>> >>
>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-794%3A+Strictly+Uniform+Sticky+Partitioner
>> >> .
>> >>
>> >> The proposal is a bug fix for
>> >> https://issues.apache.org/jira/browse/KAFKA-10888, but it does
>> include a
>> >> client config change, therefore we have a KIP to discuss.
>> >>
>> >> -Artem
>> >>
>> >
>>
>

Re: [DISCUSS] KIP-794: Strictly Uniform Sticky Partitioner

Posted by Luke Chen <sh...@gmail.com>.
Hi Artem,

Thanks for the update. I have some questions about it:

1. Could you explain why you need the `partitioner` return -1? In which
case we need it? And how it is used in your KIP?
2. What does the "partitioner.sticky.batch.size" mean? In the
"Configuration" part, you didn't explain it. And default to 0, I guess it's
the same as current behavior for backward compatibility, right? You should
mention it.
3. I'm thinking we can have a threshold to the
"partitioner.sticky.batch.size". Let's say, we already accumulate 15.5KB in
partition1, and sent. So when next batch created, in your current design,
we still stick to partition1, until 16KB reached, and then we create a new
batch to change to next partition, ex: partition2. But I think if we set a
threshold to 95% (for example), we can know previous 15.5KB already exceeds
the threshold so that we can directly create new batch for next records.
This way should be able to make it more efficient. WDYT?
4. I think the improved queuing logic should be good enough. I can't get
the benefit of having `partition.availability.timeout.ms` config. In short,
you want to make the partitioner take the broker load into consideration.
We can just improve that in the queuing logic (and you already did it). Why
should we add the config? Could you use some examples to explain why we
need it.

Thank you.
Luke

On Wed, Feb 16, 2022 at 8:57 AM Artem Livshits
<al...@confluent.io.invalid> wrote:

> Hello,
>
> Please add your comments about the KIP.  If there are no considerations,
> I'll put it up for vote in the next few days.
>
> -Artem
>
> On Mon, Feb 7, 2022 at 6:01 PM Artem Livshits <al...@confluent.io>
> wrote:
>
> > Hello,
> >
> > After trying a few prototypes, I've made some changes to the public
> > interface.  Please see the updated document
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-794%3A+Strictly+Uniform+Sticky+Partitioner
> > .
> >
> > -Artem
> >
> > On Thu, Nov 4, 2021 at 10:37 AM Artem Livshits <al...@confluent.io>
> > wrote:
> >
> >> Hello,
> >>
> >> This is the discussion thread for
> >>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-794%3A+Strictly+Uniform+Sticky+Partitioner
> >> .
> >>
> >> The proposal is a bug fix for
> >> https://issues.apache.org/jira/browse/KAFKA-10888, but it does include
> a
> >> client config change, therefore we have a KIP to discuss.
> >>
> >> -Artem
> >>
> >
>

Re: [DISCUSS] KIP-794: Strictly Uniform Sticky Partitioner

Posted by Artem Livshits <al...@confluent.io.INVALID>.
Hello,

Please add your comments about the KIP.  If there are no considerations,
I'll put it up for vote in the next few days.

-Artem

On Mon, Feb 7, 2022 at 6:01 PM Artem Livshits <al...@confluent.io>
wrote:

> Hello,
>
> After trying a few prototypes, I've made some changes to the public
> interface.  Please see the updated document
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-794%3A+Strictly+Uniform+Sticky+Partitioner
> .
>
> -Artem
>
> On Thu, Nov 4, 2021 at 10:37 AM Artem Livshits <al...@confluent.io>
> wrote:
>
>> Hello,
>>
>> This is the discussion thread for
>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-794%3A+Strictly+Uniform+Sticky+Partitioner
>> .
>>
>> The proposal is a bug fix for
>> https://issues.apache.org/jira/browse/KAFKA-10888, but it does include a
>> client config change, therefore we have a KIP to discuss.
>>
>> -Artem
>>
>

Re: [DISCUSS] KIP-794: Strictly Uniform Sticky Partitioner

Posted by Artem Livshits <al...@confluent.io.INVALID>.
Hello,

After trying a few prototypes, I've made some changes to the public
interface.  Please see the updated document
https://cwiki.apache.org/confluence/display/KAFKA/KIP-794%3A+Strictly+Uniform+Sticky+Partitioner
.

-Artem

On Thu, Nov 4, 2021 at 10:37 AM Artem Livshits <al...@confluent.io>
wrote:

> Hello,
>
> This is the discussion thread for
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-794%3A+Strictly+Uniform+Sticky+Partitioner
> .
>
> The proposal is a bug fix for
> https://issues.apache.org/jira/browse/KAFKA-10888, but it does include a
> client config change, therefore we have a KIP to discuss.
>
> -Artem
>