You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pulsar.apache.org by Heesung Sohn <he...@streamnative.io.INVALID> on 2022/11/15 03:56:37 UTC

[Discuss] Deprecate Index-based Publisher Stat Aggregation in Topics Partitioned-Stats

Dear Pulsar Community,

We recently found a bug in `pulsar-admin topics partitioned-stats api` that
could incur a memory burst, high GC time, or OOM.

For this issue, I proposed a fix
<https://github.com/apache/pulsar/pull/18254> by deprecating the
aggregatePublisherStatsByProducerName
config and always aggregating the publishers' stats by publisherName,
instead of the list index(aggregatePublisherStatsByProducerName=false,
default).


   -  The index-based aggregation is inherently wrong in a highly
   concurrent producer environment(where the order and size of the publisher
   stat list are not guaranteed to be the same). The publisher stats need to
   be aggregated by a unique key, preferably the producer
   name(aggregatePublisherStatsByProducerName=true).


However, this fix will break some of the old client's compatibility since
the way Pulsar generates the producer name has changed over time, as
described here
<https://github.com/apache/pulsar/pull/18254#issuecomment-1311168182>.

As I replied here
<https://github.com/apache/pulsar/pull/18254#issuecomment-1312084363>, although
it is not desirable, I think we could be lenient on this change in the stat
API response(assuming thispublishers'stat struct is used for human admins
only for ad-hoc checks).

Are we OK with this non-backward-compatible fix for some of the old
clients? Or, do you have any other suggestions?

One idea for a long-term fix could be:
When there are thousands of producers(consumers) for a (partitioned-)topic,
it is expensive to aggregate each publisher(subscriptions)'s stats
on-the-fly across the brokers. Alternatively, for the next major version, I
think we could further define producers(subscriptions)' API like the below
and drop the publishers and subscriptions structs from topics
(partitioned-)stats returns.

pulsar-admin publishers list my-topic --last-pagination-key xyz
pulsar-admin publishers stats my-producer

# similarly for subscriptions

Regards,
Heesung

Re: [Discuss] Deprecate Index-based Publisher Stat Aggregation in Topics Partitioned-Stats

Posted by Heesung Sohn <he...@streamnative.io.INVALID>.
Hi dev,

If there are no particular concerns, I will send out a vote email.

Thanks,
Heesung

On Tue, Jan 3, 2023 at 10:28 AM Heesung Sohn <he...@streamnative.io>
wrote:

> Hi Pulsar dev,
>
> I raised a PIP for the default config change(change the
> aggregatePublisherStatsByProducerName config default to true) as
> requested.
> PIP: https://github.com/apache/pulsar/issues/19125
>
> I guess we can continue to use this discussion thread for this PIP as well.
>
> Thank you,
> Heesung
>
>
> On Fri, Dec 9, 2022 at 2:12 PM Heesung Sohn <he...@streamnative.io>
> wrote:
>
>> Hi,
>>
>> I raised a PR to the master branch here,
>> https://github.com/apache/pulsar/pull/18807.
>>
>> PTAL.
>>
>> Thank you,
>> Heesung
>>
>> On Wed, Nov 30, 2022 at 4:06 AM Enrico Olivelli <eo...@gmail.com>
>> wrote:
>>
>>> Heesung,
>>> I also agree that we must preserve compatibility, if we want to pick
>>> this change to released versions.
>>> For a new major version we COULD break compatibility only as a last
>>> resort.
>>>
>>> If we have a way to improve the code and use less resources, I believe
>>> that it is the best way.
>>>
>>> Enrico
>>>
>>> Il giorno mer 30 nov 2022 alle ore 11:56 Heesung Sohn
>>> <he...@streamnative.io.invalid> ha scritto:
>>> >
>>> > Hi,
>>> >
>>> > I raised this PR to my local fork to sync our direction.
>>> >
>>> > https://github.com/heesung-sn/pulsar/pull/15/files
>>> >
>>> > Please let me know if this direction looks good. Then, I will proceed
>>> to
>>> > raise an official PR.
>>> >
>>> >
>>> > Thanks,
>>> > Heesung
>>>
>>

Re: [Discuss] Deprecate Index-based Publisher Stat Aggregation in Topics Partitioned-Stats

Posted by Heesung Sohn <he...@streamnative.io.INVALID>.
Hi Pulsar dev,

I raised a PIP for the default config change(change the
aggregatePublisherStatsByProducerName config default to true) as requested.
PIP: https://github.com/apache/pulsar/issues/19125

I guess we can continue to use this discussion thread for this PIP as well.

Thank you,
Heesung


On Fri, Dec 9, 2022 at 2:12 PM Heesung Sohn <he...@streamnative.io>
wrote:

> Hi,
>
> I raised a PR to the master branch here,
> https://github.com/apache/pulsar/pull/18807.
>
> PTAL.
>
> Thank you,
> Heesung
>
> On Wed, Nov 30, 2022 at 4:06 AM Enrico Olivelli <eo...@gmail.com>
> wrote:
>
>> Heesung,
>> I also agree that we must preserve compatibility, if we want to pick
>> this change to released versions.
>> For a new major version we COULD break compatibility only as a last
>> resort.
>>
>> If we have a way to improve the code and use less resources, I believe
>> that it is the best way.
>>
>> Enrico
>>
>> Il giorno mer 30 nov 2022 alle ore 11:56 Heesung Sohn
>> <he...@streamnative.io.invalid> ha scritto:
>> >
>> > Hi,
>> >
>> > I raised this PR to my local fork to sync our direction.
>> >
>> > https://github.com/heesung-sn/pulsar/pull/15/files
>> >
>> > Please let me know if this direction looks good. Then, I will proceed to
>> > raise an official PR.
>> >
>> >
>> > Thanks,
>> > Heesung
>>
>

Re: [Discuss] Deprecate Index-based Publisher Stat Aggregation in Topics Partitioned-Stats

Posted by Heesung Sohn <he...@streamnative.io.INVALID>.
Hi,

I raised a PR to the master branch here,
https://github.com/apache/pulsar/pull/18807.

PTAL.

Thank you,
Heesung

On Wed, Nov 30, 2022 at 4:06 AM Enrico Olivelli <eo...@gmail.com> wrote:

> Heesung,
> I also agree that we must preserve compatibility, if we want to pick
> this change to released versions.
> For a new major version we COULD break compatibility only as a last resort.
>
> If we have a way to improve the code and use less resources, I believe
> that it is the best way.
>
> Enrico
>
> Il giorno mer 30 nov 2022 alle ore 11:56 Heesung Sohn
> <he...@streamnative.io.invalid> ha scritto:
> >
> > Hi,
> >
> > I raised this PR to my local fork to sync our direction.
> >
> > https://github.com/heesung-sn/pulsar/pull/15/files
> >
> > Please let me know if this direction looks good. Then, I will proceed to
> > raise an official PR.
> >
> >
> > Thanks,
> > Heesung
>

Re: [Discuss] Deprecate Index-based Publisher Stat Aggregation in Topics Partitioned-Stats

Posted by Enrico Olivelli <eo...@gmail.com>.
Heesung,
I also agree that we must preserve compatibility, if we want to pick
this change to released versions.
For a new major version we COULD break compatibility only as a last resort.

If we have a way to improve the code and use less resources, I believe
that it is the best way.

Enrico

Il giorno mer 30 nov 2022 alle ore 11:56 Heesung Sohn
<he...@streamnative.io.invalid> ha scritto:
>
> Hi,
>
> I raised this PR to my local fork to sync our direction.
>
> https://github.com/heesung-sn/pulsar/pull/15/files
>
> Please let me know if this direction looks good. Then, I will proceed to
> raise an official PR.
>
>
> Thanks,
> Heesung

Re: [Discuss] Deprecate Index-based Publisher Stat Aggregation in Topics Partitioned-Stats

Posted by Heesung Sohn <he...@streamnative.io.INVALID>.
Hi,

I raised this PR to my local fork to sync our direction.

https://github.com/heesung-sn/pulsar/pull/15/files

Please let me know if this direction looks good. Then, I will proceed to
raise an official PR.


Thanks,
Heesung

Re: [Discuss] Deprecate Index-based Publisher Stat Aggregation in Topics Partitioned-Stats

Posted by Nozomi Kurihara <nk...@apache.org>.
Hi Heesung,

> update the PR and only fix the bug, the nested for-loop producer stat obj
generation without changing any default configs.
+1
Moreover, I think you can change the default value of
aggregatePublisherStatsByProducerName to true since your final plan still
keeps the index-based aggregation.

>  adding the config will further complicate the system— this is anti to
the “Invent and Simplify” principle.
I agree with you, the simpler configuration, the better.

Best Regards,
Nozomi

2022年11月30日(水) 17:52 Heesung Sohn <he...@streamnative.io.invalid>:

> Hi,
>
> I dont think adding another config around this is a good idea. Pulsar
> already has so many configs, and to me adding the config will further
> complicate the system— this is anti to the “Invent and Simplify” principle.
>
> Based on what we discussed, Pulsar cannot give up the index-based
> aggregation due to the backward compatibility. Hence, I think we can update
> the PR and only fix the bug, the nested for-loop producer stat obj
> generation without changing any default configs. In this way, the memory
> burst issue can be minimized at least. Plus, we can update the doc to warn
> users.
>
> Please let me know what you think.
>
> Thanks,
> Heesung
>
> On Tue, Nov 29, 2022 at 10:02 PM Lin Zhao <lin.zhao@streamnative.io
> .invalid>
> wrote:
>
> > Hi Nozomi,
> >
> > Thank you for the explanation. It's clearer to me now. Maybe instead of
> > aggregatePublisherStatsByProducerName, name it
> allowStatsAggregationByIndex
> > to be more descriptive. Defaults to false which makes this fix take
> effect.
> >
> > On Mon, Nov 28, 2022 at 9:36 PM Nozomi Kurihara <nk...@apache.org>
> > wrote:
> >
> > > Hi Lin,
> > >
> > > > What if we change aggregatePublisherStatsByProducerName to default to
> > > true, and logs a warning if it's set to false?
> > >
> > > Currently "aggregatePublisherStatsByProducerName=true" means that stats
> > is
> > > aggregated by NOT ONLY producer name BUT index when clients don't
> support
> > > partial producer.
> > >
> > >
> >
> https://github.com/apache/pulsar/blob/6a719480b2afc84f5acb85fedf3accf8861eb97a/conf/broker.conf#L1476-L1478
> > >
> > > Thus
> > > > change aggregatePublisherStatsByProducerName to default to true
> > > is not enough to avoid the performance issue since the index-based
> > > aggregation can still work.
> > >
> > > To address this issue, https://github.com/apache/pulsar/pull/18254
> > > completely removes the index-based aggregation logic.
> > > However this change will cause another problem, breaking the backward
> > > compatibility.
> > > Though I agree that the impact is less than the performance issue, I
> > think
> > > we should keep the existing behavior too.
> > >
> > > That's why I proposed "force" setting by which users have 3 options:
> > >
> > > (1) force=true
> > > * stats is aggregated by only producer name
> > > * for those who want to avoid the performance issue
> > >
> > > (2) force=false & aggregatePublisherStatsByProducerName=true
> > > * stats is aggregated by producer name for clients which supports
> partial
> > > producer, otherwise by index
> > > * for those who want to keep the backward-compatibility
> > >
> > > (3) force=false & aggregatePublisherStatsByProducerName=false
> > > * stats is aggregated by only index
> > > * (I'm not sure whether there are users who need this option)
> > >
> > > Best Regards,
> > > Nozomi
> > >
> > > 2022年11月29日(火) 8:55 Lin Zhao <li...@streamnative.io.invalid>:
> > >
> > > > In my opinion, the performance issue reported is to critical that it
> > > should
> > > > be considered a regression. The only option is to either revert the
> > > initial
> > > > change or fix it as soon as possible.
> > > >
> > > > The backward compatibility issue introduced to this fix, on the other
> > > hand,
> > > > doesn't impact the behavior of the broker other than the reported
> json
> > > from
> > > > the stats call. It's a much lesser evil than broker running out of
> > memory
> > > > with tens of millions of ProducerStatsImpl as I have seen.
> > > >
> > > > I would prefer deprecating aggregatePublisherStatsByProducerName, but
> > at
> > > > the same time I see the reason behind keeping the feature intact in a
> > > minor
> > > > version change. Nozimo's proposal of
> > > > forceAggregatePublisherStatsByProducerName is basically keeping*
> > > > aggregatePublisherStatsByProducerName=false
> > > > *an option right? What if we change
> > aggregatePublisherStatsByProducerName
> > > > to default to true, and logs a warning if it's set to false?
> > > >
> > > > On Thu, Nov 24, 2022 at 10:34 PM Nozomi Kurihara <
> nkurihar@apache.org>
> > > > wrote:
> > > >
> > > > > Hi Heesung,
> > > > >
> > > > > Thank you for opening this discussion.
> > > > >
> > > > > IMO backward-compatibility should be kept at least across minor
> > > releases.
> > > > >
> > > > > Although the performance issue you mentioned (e.g. memory burst,
> high
> > > GC
> > > > > and OOM) looks problematic, backward-compatibility is also
> important.
> > > > > I think which has higher priority depends on the case.
> > > > >
> > > > > Your current change seems to remove the index-based aggregation
> > > > completely.
> > > > > However I think we should keep room for choice.
> > > > >
> > > > > In order to allow users (here Pulsar server-side admins in
> > particular)
> > > to
> > > > > choose the performance or backward-compatibility, how about
> > > introducing a
> > > > > "force" setting, e.g. "forceAggregatePublisherStatsByProducerName"?
> > > > >
> > > > > Those who place more importance on the performance than
> > > > > backward-compatibility can set this flag to true.
> > > > > Others, those who want to keep backward-compatibility, set this
> flag
> > to
> > > > > false.
> > > > >
> > > > > By the way, I'm not sure, is the producer name generation logic
> > already
> > > > > implemented in C++, Go and other clients?
> > > > > If not so, first we should implement it before switching the
> > > > > producer-name-based aggregation.
> > > > >
> > > > > Best Regards,
> > > > > Nozomi
> > > > >
> > > > > 2022年11月17日(木) 8:51 Heesung Sohn <heesung.sohn@streamnative.io
> > > .invalid>:
> > > > >
> > > > > > Hi,
> > > > > >
> > > > > > To add more about the backward incompatibility issue
> > > > > > <
> > https://github.com/apache/pulsar/pull/18254#issuecomment-1311168182
> > > >,
> > > > > >
> > > > > > Before fix:
> > > > > > % ./bin/pulsar-admin topics partitioned-stats
> > > > > > persistent://public/default/pt
> > > > > > ...
> > > > > >   "publishers" : [ {
> > > > > >     "msgRateIn" : 0.0,
> > > > > >     "msgThroughputIn" : 0.0,
> > > > > >     "averageMsgSize" : 0.0,
> > > > > >     "chunkedMessageRate" : 0.0,
> > > > > >     "producerId" : 0,
> > > > > >     "supportsPartialProducer" : false
> > > > > >   } ],
> > > > > >
> > > > > > After fix:
> > > > > > % ./bin/pulsar-admin topics partitioned-stats
> > > > > > persistent://public/default/pt
> > > > > > ...
> > > > > >   "publishers" : [ {
> > > > > >     "msgRateIn" : 0.0,
> > > > > >     "msgThroughputIn" : 0.0,
> > > > > >     "averageMsgSize" : 0.0,
> > > > > >     "chunkedMessageRate" : 0.0,
> > > > > >     "producerId" : 0,
> > > > > >     "supportsPartialProducer" : true,
> > > > > >     "producerName" : "standalone-0-1"
> > > > > >   }, {
> > > > > >     "msgRateIn" : 0.0,
> > > > > >     "msgThroughputIn" : 0.0,
> > > > > >     "averageMsgSize" : 0.0,
> > > > > >     "chunkedMessageRate" : 0.0,
> > > > > >     "producerId" : 0,
> > > > > >     "supportsPartialProducer" : true,
> > > > > >     "producerName" : "standalone-0-0"
> > > > > >   } ],
> > > > > > ...
> > > > > >
> > > > > >
> > > > > > The broker side's producer name generation has been there since
> > this
> > > PR
> > > > > > <https://github.com/apache/pulsar/pull/1178>(1.22-incubating).
> > > > > > ProducerName was automatically generated in this format
> > > > > > {clusterName}-{brokerInstanceId}-{producerNameGenerationCounter}
> > > until
> > > > > > 2.10.
> > > > > > So, a producer to a partitioned topic(2) results in the two
> > producer
> > > > > names,
> > > > > > like the following.
> > > > > > standalone-0-0
> > > > > > standalone-0-1
> > > > > >
> > > > > > And since 2.10, by default, partitioned producers have the same
> > > > producer
> > > > > > name between partitions by this PR
> > > > > > <https://github.com/apache/pulsar/pull/10279>(generated on the
> > > client
> > > > > > side).
> > > > > > standalone-0-0
> > > > > >
> > > > > > Hence, the impacted versions(backward incompatibility issue
> > > > > > <
> > https://github.com/apache/pulsar/pull/18254#issuecomment-1311168182
> > > >)
> > > > > by
> > > > > > the proposed fix(Option 1 below) are < 2.10.
> > > > > >
> > > > > > When we aggregate stats between partitions, the default(
> > > > > > aggregatePublisherStatsByProducerName=false) aggregates the
> > producer
> > > > > stats
> > > > > > by the index of the producer stat list from each partition. So,
> > when
> > > > > lucky,
> > > > > > it could output a single producer stat. However, this method can
> be
> > > > > buggy,
> > > > > > as each partition could return a different size and index of the
> > > > > > producer stat list.
> > > > > >
> > > > > >
> > > > > > To fix the original issue(described here
> > > > > > <https://github.com/apache/pulsar/pull/18254>), I think we have
> > the
> > > > > > following options.
> > > > > >
> > > > > > Option 1(proposed): Deprecate
> aggregatePublisherStatsByProducerName
> > > > (the
> > > > > > current PR here <https://github.com/apache/pulsar/pull/18254>)
> in
> > > the
> > > > > next
> > > > > > release and live with behavior change where we get `topics
> > > > > > partitioned-stats` per-producer-and-partition from old
> clients(Ver.
> > > > > <2.10),
> > > > > > instead of stats per-producer.
> > > > > >
> > > > > > Option 2: Defer the Option 1 fix and push it to the next major
> > > > > > version(3.0.0), as this is a breaking change.
> > > > > >
> > > > > > Option 3: Keep aggregatePublisherStatsByProducerName config but
> > > change
> > > > > the
> > > > > > default to aggregatePublisherStatsByProducerName=true.
> > > > > >
> > > > > > Option 4: As a long-term fix, create separate Admin-APIs for
> > > publisher
> > > > > and
> > > > > > subscription stats and drop their stats from `topics
> > > partitioned-stats`
> > > > > as
> > > > > > it is expensive to aggregate them on the fly. (for thousands of
> > > > > publishers
> > > > > > and subscriptions). Push this change to the next major version.
> > > > > >
> > > > > > Or other suggestions?
> > > > > >
> > > > > > Regards,
> > > > > > Heesung
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > On Mon, Nov 14, 2022 at 7:56 PM Heesung Sohn <
> > > > > heesung.sohn@streamnative.io
> > > > > > >
> > > > > > wrote:
> > > > > >
> > > > > > > Dear Pulsar Community,
> > > > > > >
> > > > > > > We recently found a bug in `pulsar-admin topics
> partitioned-stats
> > > > api`
> > > > > > that
> > > > > > > could incur a memory burst, high GC time, or OOM.
> > > > > > >
> > > > > > > For this issue, I proposed a fix
> > > > > > > <https://github.com/apache/pulsar/pull/18254> by deprecating
> the
> > > > > > aggregatePublisherStatsByProducerName
> > > > > > > config and always aggregating the publishers' stats by
> > > publisherName,
> > > > > > > instead of the list
> > > > index(aggregatePublisherStatsByProducerName=false,
> > > > > > > default).
> > > > > > >
> > > > > > >
> > > > > > >    -  The index-based aggregation is inherently wrong in a
> highly
> > > > > > >    concurrent producer environment(where the order and size of
> > the
> > > > > > publisher
> > > > > > >    stat list are not guaranteed to be the same). The publisher
> > > stats
> > > > > > need to
> > > > > > >    be aggregated by a unique key, preferably the producer
> > > > > > >    name(aggregatePublisherStatsByProducerName=true).
> > > > > > >
> > > > > > >
> > > > > > > However, this fix will break some of the old client's
> > compatibility
> > > > > since
> > > > > > > the way Pulsar generates the producer name has changed over
> time,
> > > as
> > > > > > > described here
> > > > > > > <
> > > https://github.com/apache/pulsar/pull/18254#issuecomment-1311168182
> > > > >.
> > > > > > >
> > > > > > > As I replied here
> > > > > > > <
> > > https://github.com/apache/pulsar/pull/18254#issuecomment-1312084363
> > > > >,
> > > > > > although
> > > > > > > it is not desirable, I think we could be lenient on this change
> > in
> > > > the
> > > > > > stat
> > > > > > > API response(assuming thispublishers'stat struct is used for
> > human
> > > > > admins
> > > > > > > only for ad-hoc checks).
> > > > > > >
> > > > > > > Are we OK with this non-backward-compatible fix for some of the
> > old
> > > > > > > clients? Or, do you have any other suggestions?
> > > > > > >
> > > > > > > One idea for a long-term fix could be:
> > > > > > > When there are thousands of producers(consumers) for a
> > > > > > > (partitioned-)topic, it is expensive to aggregate each
> > > > > > > publisher(subscriptions)'s stats on-the-fly across the brokers.
> > > > > > Alternatively,
> > > > > > > for the next major version, I think we could further define
> > > > > > > producers(subscriptions)' API like the below and drop the
> > > publishers
> > > > > and
> > > > > > > subscriptions structs from topics (partitioned-)stats returns.
> > > > > > >
> > > > > > > pulsar-admin publishers list my-topic --last-pagination-key xyz
> > > > > > > pulsar-admin publishers stats my-producer
> > > > > > >
> > > > > > > # similarly for subscriptions
> > > > > > >
> > > > > > > Regards,
> > > > > > > Heesung
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [Discuss] Deprecate Index-based Publisher Stat Aggregation in Topics Partitioned-Stats

Posted by Heesung Sohn <he...@streamnative.io.INVALID>.
Hi,

I dont think adding another config around this is a good idea. Pulsar
already has so many configs, and to me adding the config will further
complicate the system— this is anti to the “Invent and Simplify” principle.

Based on what we discussed, Pulsar cannot give up the index-based
aggregation due to the backward compatibility. Hence, I think we can update
the PR and only fix the bug, the nested for-loop producer stat obj
generation without changing any default configs. In this way, the memory
burst issue can be minimized at least. Plus, we can update the doc to warn
users.

Please let me know what you think.

Thanks,
Heesung

On Tue, Nov 29, 2022 at 10:02 PM Lin Zhao <li...@streamnative.io.invalid>
wrote:

> Hi Nozomi,
>
> Thank you for the explanation. It's clearer to me now. Maybe instead of
> aggregatePublisherStatsByProducerName, name it allowStatsAggregationByIndex
> to be more descriptive. Defaults to false which makes this fix take effect.
>
> On Mon, Nov 28, 2022 at 9:36 PM Nozomi Kurihara <nk...@apache.org>
> wrote:
>
> > Hi Lin,
> >
> > > What if we change aggregatePublisherStatsByProducerName to default to
> > true, and logs a warning if it's set to false?
> >
> > Currently "aggregatePublisherStatsByProducerName=true" means that stats
> is
> > aggregated by NOT ONLY producer name BUT index when clients don't support
> > partial producer.
> >
> >
> https://github.com/apache/pulsar/blob/6a719480b2afc84f5acb85fedf3accf8861eb97a/conf/broker.conf#L1476-L1478
> >
> > Thus
> > > change aggregatePublisherStatsByProducerName to default to true
> > is not enough to avoid the performance issue since the index-based
> > aggregation can still work.
> >
> > To address this issue, https://github.com/apache/pulsar/pull/18254
> > completely removes the index-based aggregation logic.
> > However this change will cause another problem, breaking the backward
> > compatibility.
> > Though I agree that the impact is less than the performance issue, I
> think
> > we should keep the existing behavior too.
> >
> > That's why I proposed "force" setting by which users have 3 options:
> >
> > (1) force=true
> > * stats is aggregated by only producer name
> > * for those who want to avoid the performance issue
> >
> > (2) force=false & aggregatePublisherStatsByProducerName=true
> > * stats is aggregated by producer name for clients which supports partial
> > producer, otherwise by index
> > * for those who want to keep the backward-compatibility
> >
> > (3) force=false & aggregatePublisherStatsByProducerName=false
> > * stats is aggregated by only index
> > * (I'm not sure whether there are users who need this option)
> >
> > Best Regards,
> > Nozomi
> >
> > 2022年11月29日(火) 8:55 Lin Zhao <li...@streamnative.io.invalid>:
> >
> > > In my opinion, the performance issue reported is to critical that it
> > should
> > > be considered a regression. The only option is to either revert the
> > initial
> > > change or fix it as soon as possible.
> > >
> > > The backward compatibility issue introduced to this fix, on the other
> > hand,
> > > doesn't impact the behavior of the broker other than the reported json
> > from
> > > the stats call. It's a much lesser evil than broker running out of
> memory
> > > with tens of millions of ProducerStatsImpl as I have seen.
> > >
> > > I would prefer deprecating aggregatePublisherStatsByProducerName, but
> at
> > > the same time I see the reason behind keeping the feature intact in a
> > minor
> > > version change. Nozimo's proposal of
> > > forceAggregatePublisherStatsByProducerName is basically keeping*
> > > aggregatePublisherStatsByProducerName=false
> > > *an option right? What if we change
> aggregatePublisherStatsByProducerName
> > > to default to true, and logs a warning if it's set to false?
> > >
> > > On Thu, Nov 24, 2022 at 10:34 PM Nozomi Kurihara <nk...@apache.org>
> > > wrote:
> > >
> > > > Hi Heesung,
> > > >
> > > > Thank you for opening this discussion.
> > > >
> > > > IMO backward-compatibility should be kept at least across minor
> > releases.
> > > >
> > > > Although the performance issue you mentioned (e.g. memory burst, high
> > GC
> > > > and OOM) looks problematic, backward-compatibility is also important.
> > > > I think which has higher priority depends on the case.
> > > >
> > > > Your current change seems to remove the index-based aggregation
> > > completely.
> > > > However I think we should keep room for choice.
> > > >
> > > > In order to allow users (here Pulsar server-side admins in
> particular)
> > to
> > > > choose the performance or backward-compatibility, how about
> > introducing a
> > > > "force" setting, e.g. "forceAggregatePublisherStatsByProducerName"?
> > > >
> > > > Those who place more importance on the performance than
> > > > backward-compatibility can set this flag to true.
> > > > Others, those who want to keep backward-compatibility, set this flag
> to
> > > > false.
> > > >
> > > > By the way, I'm not sure, is the producer name generation logic
> already
> > > > implemented in C++, Go and other clients?
> > > > If not so, first we should implement it before switching the
> > > > producer-name-based aggregation.
> > > >
> > > > Best Regards,
> > > > Nozomi
> > > >
> > > > 2022年11月17日(木) 8:51 Heesung Sohn <heesung.sohn@streamnative.io
> > .invalid>:
> > > >
> > > > > Hi,
> > > > >
> > > > > To add more about the backward incompatibility issue
> > > > > <
> https://github.com/apache/pulsar/pull/18254#issuecomment-1311168182
> > >,
> > > > >
> > > > > Before fix:
> > > > > % ./bin/pulsar-admin topics partitioned-stats
> > > > > persistent://public/default/pt
> > > > > ...
> > > > >   "publishers" : [ {
> > > > >     "msgRateIn" : 0.0,
> > > > >     "msgThroughputIn" : 0.0,
> > > > >     "averageMsgSize" : 0.0,
> > > > >     "chunkedMessageRate" : 0.0,
> > > > >     "producerId" : 0,
> > > > >     "supportsPartialProducer" : false
> > > > >   } ],
> > > > >
> > > > > After fix:
> > > > > % ./bin/pulsar-admin topics partitioned-stats
> > > > > persistent://public/default/pt
> > > > > ...
> > > > >   "publishers" : [ {
> > > > >     "msgRateIn" : 0.0,
> > > > >     "msgThroughputIn" : 0.0,
> > > > >     "averageMsgSize" : 0.0,
> > > > >     "chunkedMessageRate" : 0.0,
> > > > >     "producerId" : 0,
> > > > >     "supportsPartialProducer" : true,
> > > > >     "producerName" : "standalone-0-1"
> > > > >   }, {
> > > > >     "msgRateIn" : 0.0,
> > > > >     "msgThroughputIn" : 0.0,
> > > > >     "averageMsgSize" : 0.0,
> > > > >     "chunkedMessageRate" : 0.0,
> > > > >     "producerId" : 0,
> > > > >     "supportsPartialProducer" : true,
> > > > >     "producerName" : "standalone-0-0"
> > > > >   } ],
> > > > > ...
> > > > >
> > > > >
> > > > > The broker side's producer name generation has been there since
> this
> > PR
> > > > > <https://github.com/apache/pulsar/pull/1178>(1.22-incubating).
> > > > > ProducerName was automatically generated in this format
> > > > > {clusterName}-{brokerInstanceId}-{producerNameGenerationCounter}
> > until
> > > > > 2.10.
> > > > > So, a producer to a partitioned topic(2) results in the two
> producer
> > > > names,
> > > > > like the following.
> > > > > standalone-0-0
> > > > > standalone-0-1
> > > > >
> > > > > And since 2.10, by default, partitioned producers have the same
> > > producer
> > > > > name between partitions by this PR
> > > > > <https://github.com/apache/pulsar/pull/10279>(generated on the
> > client
> > > > > side).
> > > > > standalone-0-0
> > > > >
> > > > > Hence, the impacted versions(backward incompatibility issue
> > > > > <
> https://github.com/apache/pulsar/pull/18254#issuecomment-1311168182
> > >)
> > > > by
> > > > > the proposed fix(Option 1 below) are < 2.10.
> > > > >
> > > > > When we aggregate stats between partitions, the default(
> > > > > aggregatePublisherStatsByProducerName=false) aggregates the
> producer
> > > > stats
> > > > > by the index of the producer stat list from each partition. So,
> when
> > > > lucky,
> > > > > it could output a single producer stat. However, this method can be
> > > > buggy,
> > > > > as each partition could return a different size and index of the
> > > > > producer stat list.
> > > > >
> > > > >
> > > > > To fix the original issue(described here
> > > > > <https://github.com/apache/pulsar/pull/18254>), I think we have
> the
> > > > > following options.
> > > > >
> > > > > Option 1(proposed): Deprecate aggregatePublisherStatsByProducerName
> > > (the
> > > > > current PR here <https://github.com/apache/pulsar/pull/18254>) in
> > the
> > > > next
> > > > > release and live with behavior change where we get `topics
> > > > > partitioned-stats` per-producer-and-partition from old clients(Ver.
> > > > <2.10),
> > > > > instead of stats per-producer.
> > > > >
> > > > > Option 2: Defer the Option 1 fix and push it to the next major
> > > > > version(3.0.0), as this is a breaking change.
> > > > >
> > > > > Option 3: Keep aggregatePublisherStatsByProducerName config but
> > change
> > > > the
> > > > > default to aggregatePublisherStatsByProducerName=true.
> > > > >
> > > > > Option 4: As a long-term fix, create separate Admin-APIs for
> > publisher
> > > > and
> > > > > subscription stats and drop their stats from `topics
> > partitioned-stats`
> > > > as
> > > > > it is expensive to aggregate them on the fly. (for thousands of
> > > > publishers
> > > > > and subscriptions). Push this change to the next major version.
> > > > >
> > > > > Or other suggestions?
> > > > >
> > > > > Regards,
> > > > > Heesung
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > On Mon, Nov 14, 2022 at 7:56 PM Heesung Sohn <
> > > > heesung.sohn@streamnative.io
> > > > > >
> > > > > wrote:
> > > > >
> > > > > > Dear Pulsar Community,
> > > > > >
> > > > > > We recently found a bug in `pulsar-admin topics partitioned-stats
> > > api`
> > > > > that
> > > > > > could incur a memory burst, high GC time, or OOM.
> > > > > >
> > > > > > For this issue, I proposed a fix
> > > > > > <https://github.com/apache/pulsar/pull/18254> by deprecating the
> > > > > aggregatePublisherStatsByProducerName
> > > > > > config and always aggregating the publishers' stats by
> > publisherName,
> > > > > > instead of the list
> > > index(aggregatePublisherStatsByProducerName=false,
> > > > > > default).
> > > > > >
> > > > > >
> > > > > >    -  The index-based aggregation is inherently wrong in a highly
> > > > > >    concurrent producer environment(where the order and size of
> the
> > > > > publisher
> > > > > >    stat list are not guaranteed to be the same). The publisher
> > stats
> > > > > need to
> > > > > >    be aggregated by a unique key, preferably the producer
> > > > > >    name(aggregatePublisherStatsByProducerName=true).
> > > > > >
> > > > > >
> > > > > > However, this fix will break some of the old client's
> compatibility
> > > > since
> > > > > > the way Pulsar generates the producer name has changed over time,
> > as
> > > > > > described here
> > > > > > <
> > https://github.com/apache/pulsar/pull/18254#issuecomment-1311168182
> > > >.
> > > > > >
> > > > > > As I replied here
> > > > > > <
> > https://github.com/apache/pulsar/pull/18254#issuecomment-1312084363
> > > >,
> > > > > although
> > > > > > it is not desirable, I think we could be lenient on this change
> in
> > > the
> > > > > stat
> > > > > > API response(assuming thispublishers'stat struct is used for
> human
> > > > admins
> > > > > > only for ad-hoc checks).
> > > > > >
> > > > > > Are we OK with this non-backward-compatible fix for some of the
> old
> > > > > > clients? Or, do you have any other suggestions?
> > > > > >
> > > > > > One idea for a long-term fix could be:
> > > > > > When there are thousands of producers(consumers) for a
> > > > > > (partitioned-)topic, it is expensive to aggregate each
> > > > > > publisher(subscriptions)'s stats on-the-fly across the brokers.
> > > > > Alternatively,
> > > > > > for the next major version, I think we could further define
> > > > > > producers(subscriptions)' API like the below and drop the
> > publishers
> > > > and
> > > > > > subscriptions structs from topics (partitioned-)stats returns.
> > > > > >
> > > > > > pulsar-admin publishers list my-topic --last-pagination-key xyz
> > > > > > pulsar-admin publishers stats my-producer
> > > > > >
> > > > > > # similarly for subscriptions
> > > > > >
> > > > > > Regards,
> > > > > > Heesung
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [Discuss] Deprecate Index-based Publisher Stat Aggregation in Topics Partitioned-Stats

Posted by Lin Zhao <li...@streamnative.io.INVALID>.
Hi Nozomi,

Thank you for the explanation. It's clearer to me now. Maybe instead of
aggregatePublisherStatsByProducerName, name it allowStatsAggregationByIndex
to be more descriptive. Defaults to false which makes this fix take effect.

On Mon, Nov 28, 2022 at 9:36 PM Nozomi Kurihara <nk...@apache.org> wrote:

> Hi Lin,
>
> > What if we change aggregatePublisherStatsByProducerName to default to
> true, and logs a warning if it's set to false?
>
> Currently "aggregatePublisherStatsByProducerName=true" means that stats is
> aggregated by NOT ONLY producer name BUT index when clients don't support
> partial producer.
>
> https://github.com/apache/pulsar/blob/6a719480b2afc84f5acb85fedf3accf8861eb97a/conf/broker.conf#L1476-L1478
>
> Thus
> > change aggregatePublisherStatsByProducerName to default to true
> is not enough to avoid the performance issue since the index-based
> aggregation can still work.
>
> To address this issue, https://github.com/apache/pulsar/pull/18254
> completely removes the index-based aggregation logic.
> However this change will cause another problem, breaking the backward
> compatibility.
> Though I agree that the impact is less than the performance issue, I think
> we should keep the existing behavior too.
>
> That's why I proposed "force" setting by which users have 3 options:
>
> (1) force=true
> * stats is aggregated by only producer name
> * for those who want to avoid the performance issue
>
> (2) force=false & aggregatePublisherStatsByProducerName=true
> * stats is aggregated by producer name for clients which supports partial
> producer, otherwise by index
> * for those who want to keep the backward-compatibility
>
> (3) force=false & aggregatePublisherStatsByProducerName=false
> * stats is aggregated by only index
> * (I'm not sure whether there are users who need this option)
>
> Best Regards,
> Nozomi
>
> 2022年11月29日(火) 8:55 Lin Zhao <li...@streamnative.io.invalid>:
>
> > In my opinion, the performance issue reported is to critical that it
> should
> > be considered a regression. The only option is to either revert the
> initial
> > change or fix it as soon as possible.
> >
> > The backward compatibility issue introduced to this fix, on the other
> hand,
> > doesn't impact the behavior of the broker other than the reported json
> from
> > the stats call. It's a much lesser evil than broker running out of memory
> > with tens of millions of ProducerStatsImpl as I have seen.
> >
> > I would prefer deprecating aggregatePublisherStatsByProducerName, but at
> > the same time I see the reason behind keeping the feature intact in a
> minor
> > version change. Nozimo's proposal of
> > forceAggregatePublisherStatsByProducerName is basically keeping*
> > aggregatePublisherStatsByProducerName=false
> > *an option right? What if we change aggregatePublisherStatsByProducerName
> > to default to true, and logs a warning if it's set to false?
> >
> > On Thu, Nov 24, 2022 at 10:34 PM Nozomi Kurihara <nk...@apache.org>
> > wrote:
> >
> > > Hi Heesung,
> > >
> > > Thank you for opening this discussion.
> > >
> > > IMO backward-compatibility should be kept at least across minor
> releases.
> > >
> > > Although the performance issue you mentioned (e.g. memory burst, high
> GC
> > > and OOM) looks problematic, backward-compatibility is also important.
> > > I think which has higher priority depends on the case.
> > >
> > > Your current change seems to remove the index-based aggregation
> > completely.
> > > However I think we should keep room for choice.
> > >
> > > In order to allow users (here Pulsar server-side admins in particular)
> to
> > > choose the performance or backward-compatibility, how about
> introducing a
> > > "force" setting, e.g. "forceAggregatePublisherStatsByProducerName"?
> > >
> > > Those who place more importance on the performance than
> > > backward-compatibility can set this flag to true.
> > > Others, those who want to keep backward-compatibility, set this flag to
> > > false.
> > >
> > > By the way, I'm not sure, is the producer name generation logic already
> > > implemented in C++, Go and other clients?
> > > If not so, first we should implement it before switching the
> > > producer-name-based aggregation.
> > >
> > > Best Regards,
> > > Nozomi
> > >
> > > 2022年11月17日(木) 8:51 Heesung Sohn <heesung.sohn@streamnative.io
> .invalid>:
> > >
> > > > Hi,
> > > >
> > > > To add more about the backward incompatibility issue
> > > > <https://github.com/apache/pulsar/pull/18254#issuecomment-1311168182
> >,
> > > >
> > > > Before fix:
> > > > % ./bin/pulsar-admin topics partitioned-stats
> > > > persistent://public/default/pt
> > > > ...
> > > >   "publishers" : [ {
> > > >     "msgRateIn" : 0.0,
> > > >     "msgThroughputIn" : 0.0,
> > > >     "averageMsgSize" : 0.0,
> > > >     "chunkedMessageRate" : 0.0,
> > > >     "producerId" : 0,
> > > >     "supportsPartialProducer" : false
> > > >   } ],
> > > >
> > > > After fix:
> > > > % ./bin/pulsar-admin topics partitioned-stats
> > > > persistent://public/default/pt
> > > > ...
> > > >   "publishers" : [ {
> > > >     "msgRateIn" : 0.0,
> > > >     "msgThroughputIn" : 0.0,
> > > >     "averageMsgSize" : 0.0,
> > > >     "chunkedMessageRate" : 0.0,
> > > >     "producerId" : 0,
> > > >     "supportsPartialProducer" : true,
> > > >     "producerName" : "standalone-0-1"
> > > >   }, {
> > > >     "msgRateIn" : 0.0,
> > > >     "msgThroughputIn" : 0.0,
> > > >     "averageMsgSize" : 0.0,
> > > >     "chunkedMessageRate" : 0.0,
> > > >     "producerId" : 0,
> > > >     "supportsPartialProducer" : true,
> > > >     "producerName" : "standalone-0-0"
> > > >   } ],
> > > > ...
> > > >
> > > >
> > > > The broker side's producer name generation has been there since this
> PR
> > > > <https://github.com/apache/pulsar/pull/1178>(1.22-incubating).
> > > > ProducerName was automatically generated in this format
> > > > {clusterName}-{brokerInstanceId}-{producerNameGenerationCounter}
> until
> > > > 2.10.
> > > > So, a producer to a partitioned topic(2) results in the two producer
> > > names,
> > > > like the following.
> > > > standalone-0-0
> > > > standalone-0-1
> > > >
> > > > And since 2.10, by default, partitioned producers have the same
> > producer
> > > > name between partitions by this PR
> > > > <https://github.com/apache/pulsar/pull/10279>(generated on the
> client
> > > > side).
> > > > standalone-0-0
> > > >
> > > > Hence, the impacted versions(backward incompatibility issue
> > > > <https://github.com/apache/pulsar/pull/18254#issuecomment-1311168182
> >)
> > > by
> > > > the proposed fix(Option 1 below) are < 2.10.
> > > >
> > > > When we aggregate stats between partitions, the default(
> > > > aggregatePublisherStatsByProducerName=false) aggregates the producer
> > > stats
> > > > by the index of the producer stat list from each partition. So, when
> > > lucky,
> > > > it could output a single producer stat. However, this method can be
> > > buggy,
> > > > as each partition could return a different size and index of the
> > > > producer stat list.
> > > >
> > > >
> > > > To fix the original issue(described here
> > > > <https://github.com/apache/pulsar/pull/18254>), I think we have the
> > > > following options.
> > > >
> > > > Option 1(proposed): Deprecate aggregatePublisherStatsByProducerName
> > (the
> > > > current PR here <https://github.com/apache/pulsar/pull/18254>) in
> the
> > > next
> > > > release and live with behavior change where we get `topics
> > > > partitioned-stats` per-producer-and-partition from old clients(Ver.
> > > <2.10),
> > > > instead of stats per-producer.
> > > >
> > > > Option 2: Defer the Option 1 fix and push it to the next major
> > > > version(3.0.0), as this is a breaking change.
> > > >
> > > > Option 3: Keep aggregatePublisherStatsByProducerName config but
> change
> > > the
> > > > default to aggregatePublisherStatsByProducerName=true.
> > > >
> > > > Option 4: As a long-term fix, create separate Admin-APIs for
> publisher
> > > and
> > > > subscription stats and drop their stats from `topics
> partitioned-stats`
> > > as
> > > > it is expensive to aggregate them on the fly. (for thousands of
> > > publishers
> > > > and subscriptions). Push this change to the next major version.
> > > >
> > > > Or other suggestions?
> > > >
> > > > Regards,
> > > > Heesung
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > On Mon, Nov 14, 2022 at 7:56 PM Heesung Sohn <
> > > heesung.sohn@streamnative.io
> > > > >
> > > > wrote:
> > > >
> > > > > Dear Pulsar Community,
> > > > >
> > > > > We recently found a bug in `pulsar-admin topics partitioned-stats
> > api`
> > > > that
> > > > > could incur a memory burst, high GC time, or OOM.
> > > > >
> > > > > For this issue, I proposed a fix
> > > > > <https://github.com/apache/pulsar/pull/18254> by deprecating the
> > > > aggregatePublisherStatsByProducerName
> > > > > config and always aggregating the publishers' stats by
> publisherName,
> > > > > instead of the list
> > index(aggregatePublisherStatsByProducerName=false,
> > > > > default).
> > > > >
> > > > >
> > > > >    -  The index-based aggregation is inherently wrong in a highly
> > > > >    concurrent producer environment(where the order and size of the
> > > > publisher
> > > > >    stat list are not guaranteed to be the same). The publisher
> stats
> > > > need to
> > > > >    be aggregated by a unique key, preferably the producer
> > > > >    name(aggregatePublisherStatsByProducerName=true).
> > > > >
> > > > >
> > > > > However, this fix will break some of the old client's compatibility
> > > since
> > > > > the way Pulsar generates the producer name has changed over time,
> as
> > > > > described here
> > > > > <
> https://github.com/apache/pulsar/pull/18254#issuecomment-1311168182
> > >.
> > > > >
> > > > > As I replied here
> > > > > <
> https://github.com/apache/pulsar/pull/18254#issuecomment-1312084363
> > >,
> > > > although
> > > > > it is not desirable, I think we could be lenient on this change in
> > the
> > > > stat
> > > > > API response(assuming thispublishers'stat struct is used for human
> > > admins
> > > > > only for ad-hoc checks).
> > > > >
> > > > > Are we OK with this non-backward-compatible fix for some of the old
> > > > > clients? Or, do you have any other suggestions?
> > > > >
> > > > > One idea for a long-term fix could be:
> > > > > When there are thousands of producers(consumers) for a
> > > > > (partitioned-)topic, it is expensive to aggregate each
> > > > > publisher(subscriptions)'s stats on-the-fly across the brokers.
> > > > Alternatively,
> > > > > for the next major version, I think we could further define
> > > > > producers(subscriptions)' API like the below and drop the
> publishers
> > > and
> > > > > subscriptions structs from topics (partitioned-)stats returns.
> > > > >
> > > > > pulsar-admin publishers list my-topic --last-pagination-key xyz
> > > > > pulsar-admin publishers stats my-producer
> > > > >
> > > > > # similarly for subscriptions
> > > > >
> > > > > Regards,
> > > > > Heesung
> > > > >
> > > >
> > >
> >
>

Re: [Discuss] Deprecate Index-based Publisher Stat Aggregation in Topics Partitioned-Stats

Posted by Nozomi Kurihara <nk...@apache.org>.
Hi Lin,

> What if we change aggregatePublisherStatsByProducerName to default to
true, and logs a warning if it's set to false?

Currently "aggregatePublisherStatsByProducerName=true" means that stats is
aggregated by NOT ONLY producer name BUT index when clients don't support
partial producer.
https://github.com/apache/pulsar/blob/6a719480b2afc84f5acb85fedf3accf8861eb97a/conf/broker.conf#L1476-L1478

Thus
> change aggregatePublisherStatsByProducerName to default to true
is not enough to avoid the performance issue since the index-based
aggregation can still work.

To address this issue, https://github.com/apache/pulsar/pull/18254
completely removes the index-based aggregation logic.
However this change will cause another problem, breaking the backward
compatibility.
Though I agree that the impact is less than the performance issue, I think
we should keep the existing behavior too.

That's why I proposed "force" setting by which users have 3 options:

(1) force=true
* stats is aggregated by only producer name
* for those who want to avoid the performance issue

(2) force=false & aggregatePublisherStatsByProducerName=true
* stats is aggregated by producer name for clients which supports partial
producer, otherwise by index
* for those who want to keep the backward-compatibility

(3) force=false & aggregatePublisherStatsByProducerName=false
* stats is aggregated by only index
* (I'm not sure whether there are users who need this option)

Best Regards,
Nozomi

2022年11月29日(火) 8:55 Lin Zhao <li...@streamnative.io.invalid>:

> In my opinion, the performance issue reported is to critical that it should
> be considered a regression. The only option is to either revert the initial
> change or fix it as soon as possible.
>
> The backward compatibility issue introduced to this fix, on the other hand,
> doesn't impact the behavior of the broker other than the reported json from
> the stats call. It's a much lesser evil than broker running out of memory
> with tens of millions of ProducerStatsImpl as I have seen.
>
> I would prefer deprecating aggregatePublisherStatsByProducerName, but at
> the same time I see the reason behind keeping the feature intact in a minor
> version change. Nozimo's proposal of
> forceAggregatePublisherStatsByProducerName is basically keeping*
> aggregatePublisherStatsByProducerName=false
> *an option right? What if we change aggregatePublisherStatsByProducerName
> to default to true, and logs a warning if it's set to false?
>
> On Thu, Nov 24, 2022 at 10:34 PM Nozomi Kurihara <nk...@apache.org>
> wrote:
>
> > Hi Heesung,
> >
> > Thank you for opening this discussion.
> >
> > IMO backward-compatibility should be kept at least across minor releases.
> >
> > Although the performance issue you mentioned (e.g. memory burst, high GC
> > and OOM) looks problematic, backward-compatibility is also important.
> > I think which has higher priority depends on the case.
> >
> > Your current change seems to remove the index-based aggregation
> completely.
> > However I think we should keep room for choice.
> >
> > In order to allow users (here Pulsar server-side admins in particular) to
> > choose the performance or backward-compatibility, how about introducing a
> > "force" setting, e.g. "forceAggregatePublisherStatsByProducerName"?
> >
> > Those who place more importance on the performance than
> > backward-compatibility can set this flag to true.
> > Others, those who want to keep backward-compatibility, set this flag to
> > false.
> >
> > By the way, I'm not sure, is the producer name generation logic already
> > implemented in C++, Go and other clients?
> > If not so, first we should implement it before switching the
> > producer-name-based aggregation.
> >
> > Best Regards,
> > Nozomi
> >
> > 2022年11月17日(木) 8:51 Heesung Sohn <he...@streamnative.io.invalid>:
> >
> > > Hi,
> > >
> > > To add more about the backward incompatibility issue
> > > <https://github.com/apache/pulsar/pull/18254#issuecomment-1311168182>,
> > >
> > > Before fix:
> > > % ./bin/pulsar-admin topics partitioned-stats
> > > persistent://public/default/pt
> > > ...
> > >   "publishers" : [ {
> > >     "msgRateIn" : 0.0,
> > >     "msgThroughputIn" : 0.0,
> > >     "averageMsgSize" : 0.0,
> > >     "chunkedMessageRate" : 0.0,
> > >     "producerId" : 0,
> > >     "supportsPartialProducer" : false
> > >   } ],
> > >
> > > After fix:
> > > % ./bin/pulsar-admin topics partitioned-stats
> > > persistent://public/default/pt
> > > ...
> > >   "publishers" : [ {
> > >     "msgRateIn" : 0.0,
> > >     "msgThroughputIn" : 0.0,
> > >     "averageMsgSize" : 0.0,
> > >     "chunkedMessageRate" : 0.0,
> > >     "producerId" : 0,
> > >     "supportsPartialProducer" : true,
> > >     "producerName" : "standalone-0-1"
> > >   }, {
> > >     "msgRateIn" : 0.0,
> > >     "msgThroughputIn" : 0.0,
> > >     "averageMsgSize" : 0.0,
> > >     "chunkedMessageRate" : 0.0,
> > >     "producerId" : 0,
> > >     "supportsPartialProducer" : true,
> > >     "producerName" : "standalone-0-0"
> > >   } ],
> > > ...
> > >
> > >
> > > The broker side's producer name generation has been there since this PR
> > > <https://github.com/apache/pulsar/pull/1178>(1.22-incubating).
> > > ProducerName was automatically generated in this format
> > > {clusterName}-{brokerInstanceId}-{producerNameGenerationCounter} until
> > > 2.10.
> > > So, a producer to a partitioned topic(2) results in the two producer
> > names,
> > > like the following.
> > > standalone-0-0
> > > standalone-0-1
> > >
> > > And since 2.10, by default, partitioned producers have the same
> producer
> > > name between partitions by this PR
> > > <https://github.com/apache/pulsar/pull/10279>(generated on the client
> > > side).
> > > standalone-0-0
> > >
> > > Hence, the impacted versions(backward incompatibility issue
> > > <https://github.com/apache/pulsar/pull/18254#issuecomment-1311168182>)
> > by
> > > the proposed fix(Option 1 below) are < 2.10.
> > >
> > > When we aggregate stats between partitions, the default(
> > > aggregatePublisherStatsByProducerName=false) aggregates the producer
> > stats
> > > by the index of the producer stat list from each partition. So, when
> > lucky,
> > > it could output a single producer stat. However, this method can be
> > buggy,
> > > as each partition could return a different size and index of the
> > > producer stat list.
> > >
> > >
> > > To fix the original issue(described here
> > > <https://github.com/apache/pulsar/pull/18254>), I think we have the
> > > following options.
> > >
> > > Option 1(proposed): Deprecate aggregatePublisherStatsByProducerName
> (the
> > > current PR here <https://github.com/apache/pulsar/pull/18254>) in the
> > next
> > > release and live with behavior change where we get `topics
> > > partitioned-stats` per-producer-and-partition from old clients(Ver.
> > <2.10),
> > > instead of stats per-producer.
> > >
> > > Option 2: Defer the Option 1 fix and push it to the next major
> > > version(3.0.0), as this is a breaking change.
> > >
> > > Option 3: Keep aggregatePublisherStatsByProducerName config but change
> > the
> > > default to aggregatePublisherStatsByProducerName=true.
> > >
> > > Option 4: As a long-term fix, create separate Admin-APIs for publisher
> > and
> > > subscription stats and drop their stats from `topics partitioned-stats`
> > as
> > > it is expensive to aggregate them on the fly. (for thousands of
> > publishers
> > > and subscriptions). Push this change to the next major version.
> > >
> > > Or other suggestions?
> > >
> > > Regards,
> > > Heesung
> > >
> > >
> > >
> > >
> > >
> > >
> > > On Mon, Nov 14, 2022 at 7:56 PM Heesung Sohn <
> > heesung.sohn@streamnative.io
> > > >
> > > wrote:
> > >
> > > > Dear Pulsar Community,
> > > >
> > > > We recently found a bug in `pulsar-admin topics partitioned-stats
> api`
> > > that
> > > > could incur a memory burst, high GC time, or OOM.
> > > >
> > > > For this issue, I proposed a fix
> > > > <https://github.com/apache/pulsar/pull/18254> by deprecating the
> > > aggregatePublisherStatsByProducerName
> > > > config and always aggregating the publishers' stats by publisherName,
> > > > instead of the list
> index(aggregatePublisherStatsByProducerName=false,
> > > > default).
> > > >
> > > >
> > > >    -  The index-based aggregation is inherently wrong in a highly
> > > >    concurrent producer environment(where the order and size of the
> > > publisher
> > > >    stat list are not guaranteed to be the same). The publisher stats
> > > need to
> > > >    be aggregated by a unique key, preferably the producer
> > > >    name(aggregatePublisherStatsByProducerName=true).
> > > >
> > > >
> > > > However, this fix will break some of the old client's compatibility
> > since
> > > > the way Pulsar generates the producer name has changed over time, as
> > > > described here
> > > > <https://github.com/apache/pulsar/pull/18254#issuecomment-1311168182
> >.
> > > >
> > > > As I replied here
> > > > <https://github.com/apache/pulsar/pull/18254#issuecomment-1312084363
> >,
> > > although
> > > > it is not desirable, I think we could be lenient on this change in
> the
> > > stat
> > > > API response(assuming thispublishers'stat struct is used for human
> > admins
> > > > only for ad-hoc checks).
> > > >
> > > > Are we OK with this non-backward-compatible fix for some of the old
> > > > clients? Or, do you have any other suggestions?
> > > >
> > > > One idea for a long-term fix could be:
> > > > When there are thousands of producers(consumers) for a
> > > > (partitioned-)topic, it is expensive to aggregate each
> > > > publisher(subscriptions)'s stats on-the-fly across the brokers.
> > > Alternatively,
> > > > for the next major version, I think we could further define
> > > > producers(subscriptions)' API like the below and drop the publishers
> > and
> > > > subscriptions structs from topics (partitioned-)stats returns.
> > > >
> > > > pulsar-admin publishers list my-topic --last-pagination-key xyz
> > > > pulsar-admin publishers stats my-producer
> > > >
> > > > # similarly for subscriptions
> > > >
> > > > Regards,
> > > > Heesung
> > > >
> > >
> >
>

Re: [Discuss] Deprecate Index-based Publisher Stat Aggregation in Topics Partitioned-Stats

Posted by Lin Zhao <li...@streamnative.io.INVALID>.
In my opinion, the performance issue reported is to critical that it should
be considered a regression. The only option is to either revert the initial
change or fix it as soon as possible.

The backward compatibility issue introduced to this fix, on the other hand,
doesn't impact the behavior of the broker other than the reported json from
the stats call. It's a much lesser evil than broker running out of memory
with tens of millions of ProducerStatsImpl as I have seen.

I would prefer deprecating aggregatePublisherStatsByProducerName, but at
the same time I see the reason behind keeping the feature intact in a minor
version change. Nozimo's proposal of
forceAggregatePublisherStatsByProducerName is basically keeping*
aggregatePublisherStatsByProducerName=false
*an option right? What if we change aggregatePublisherStatsByProducerName
to default to true, and logs a warning if it's set to false?

On Thu, Nov 24, 2022 at 10:34 PM Nozomi Kurihara <nk...@apache.org>
wrote:

> Hi Heesung,
>
> Thank you for opening this discussion.
>
> IMO backward-compatibility should be kept at least across minor releases.
>
> Although the performance issue you mentioned (e.g. memory burst, high GC
> and OOM) looks problematic, backward-compatibility is also important.
> I think which has higher priority depends on the case.
>
> Your current change seems to remove the index-based aggregation completely.
> However I think we should keep room for choice.
>
> In order to allow users (here Pulsar server-side admins in particular) to
> choose the performance or backward-compatibility, how about introducing a
> "force" setting, e.g. "forceAggregatePublisherStatsByProducerName"?
>
> Those who place more importance on the performance than
> backward-compatibility can set this flag to true.
> Others, those who want to keep backward-compatibility, set this flag to
> false.
>
> By the way, I'm not sure, is the producer name generation logic already
> implemented in C++, Go and other clients?
> If not so, first we should implement it before switching the
> producer-name-based aggregation.
>
> Best Regards,
> Nozomi
>
> 2022年11月17日(木) 8:51 Heesung Sohn <he...@streamnative.io.invalid>:
>
> > Hi,
> >
> > To add more about the backward incompatibility issue
> > <https://github.com/apache/pulsar/pull/18254#issuecomment-1311168182>,
> >
> > Before fix:
> > % ./bin/pulsar-admin topics partitioned-stats
> > persistent://public/default/pt
> > ...
> >   "publishers" : [ {
> >     "msgRateIn" : 0.0,
> >     "msgThroughputIn" : 0.0,
> >     "averageMsgSize" : 0.0,
> >     "chunkedMessageRate" : 0.0,
> >     "producerId" : 0,
> >     "supportsPartialProducer" : false
> >   } ],
> >
> > After fix:
> > % ./bin/pulsar-admin topics partitioned-stats
> > persistent://public/default/pt
> > ...
> >   "publishers" : [ {
> >     "msgRateIn" : 0.0,
> >     "msgThroughputIn" : 0.0,
> >     "averageMsgSize" : 0.0,
> >     "chunkedMessageRate" : 0.0,
> >     "producerId" : 0,
> >     "supportsPartialProducer" : true,
> >     "producerName" : "standalone-0-1"
> >   }, {
> >     "msgRateIn" : 0.0,
> >     "msgThroughputIn" : 0.0,
> >     "averageMsgSize" : 0.0,
> >     "chunkedMessageRate" : 0.0,
> >     "producerId" : 0,
> >     "supportsPartialProducer" : true,
> >     "producerName" : "standalone-0-0"
> >   } ],
> > ...
> >
> >
> > The broker side's producer name generation has been there since this PR
> > <https://github.com/apache/pulsar/pull/1178>(1.22-incubating).
> > ProducerName was automatically generated in this format
> > {clusterName}-{brokerInstanceId}-{producerNameGenerationCounter} until
> > 2.10.
> > So, a producer to a partitioned topic(2) results in the two producer
> names,
> > like the following.
> > standalone-0-0
> > standalone-0-1
> >
> > And since 2.10, by default, partitioned producers have the same producer
> > name between partitions by this PR
> > <https://github.com/apache/pulsar/pull/10279>(generated on the client
> > side).
> > standalone-0-0
> >
> > Hence, the impacted versions(backward incompatibility issue
> > <https://github.com/apache/pulsar/pull/18254#issuecomment-1311168182>)
> by
> > the proposed fix(Option 1 below) are < 2.10.
> >
> > When we aggregate stats between partitions, the default(
> > aggregatePublisherStatsByProducerName=false) aggregates the producer
> stats
> > by the index of the producer stat list from each partition. So, when
> lucky,
> > it could output a single producer stat. However, this method can be
> buggy,
> > as each partition could return a different size and index of the
> > producer stat list.
> >
> >
> > To fix the original issue(described here
> > <https://github.com/apache/pulsar/pull/18254>), I think we have the
> > following options.
> >
> > Option 1(proposed): Deprecate aggregatePublisherStatsByProducerName (the
> > current PR here <https://github.com/apache/pulsar/pull/18254>) in the
> next
> > release and live with behavior change where we get `topics
> > partitioned-stats` per-producer-and-partition from old clients(Ver.
> <2.10),
> > instead of stats per-producer.
> >
> > Option 2: Defer the Option 1 fix and push it to the next major
> > version(3.0.0), as this is a breaking change.
> >
> > Option 3: Keep aggregatePublisherStatsByProducerName config but change
> the
> > default to aggregatePublisherStatsByProducerName=true.
> >
> > Option 4: As a long-term fix, create separate Admin-APIs for publisher
> and
> > subscription stats and drop their stats from `topics partitioned-stats`
> as
> > it is expensive to aggregate them on the fly. (for thousands of
> publishers
> > and subscriptions). Push this change to the next major version.
> >
> > Or other suggestions?
> >
> > Regards,
> > Heesung
> >
> >
> >
> >
> >
> >
> > On Mon, Nov 14, 2022 at 7:56 PM Heesung Sohn <
> heesung.sohn@streamnative.io
> > >
> > wrote:
> >
> > > Dear Pulsar Community,
> > >
> > > We recently found a bug in `pulsar-admin topics partitioned-stats api`
> > that
> > > could incur a memory burst, high GC time, or OOM.
> > >
> > > For this issue, I proposed a fix
> > > <https://github.com/apache/pulsar/pull/18254> by deprecating the
> > aggregatePublisherStatsByProducerName
> > > config and always aggregating the publishers' stats by publisherName,
> > > instead of the list index(aggregatePublisherStatsByProducerName=false,
> > > default).
> > >
> > >
> > >    -  The index-based aggregation is inherently wrong in a highly
> > >    concurrent producer environment(where the order and size of the
> > publisher
> > >    stat list are not guaranteed to be the same). The publisher stats
> > need to
> > >    be aggregated by a unique key, preferably the producer
> > >    name(aggregatePublisherStatsByProducerName=true).
> > >
> > >
> > > However, this fix will break some of the old client's compatibility
> since
> > > the way Pulsar generates the producer name has changed over time, as
> > > described here
> > > <https://github.com/apache/pulsar/pull/18254#issuecomment-1311168182>.
> > >
> > > As I replied here
> > > <https://github.com/apache/pulsar/pull/18254#issuecomment-1312084363>,
> > although
> > > it is not desirable, I think we could be lenient on this change in the
> > stat
> > > API response(assuming thispublishers'stat struct is used for human
> admins
> > > only for ad-hoc checks).
> > >
> > > Are we OK with this non-backward-compatible fix for some of the old
> > > clients? Or, do you have any other suggestions?
> > >
> > > One idea for a long-term fix could be:
> > > When there are thousands of producers(consumers) for a
> > > (partitioned-)topic, it is expensive to aggregate each
> > > publisher(subscriptions)'s stats on-the-fly across the brokers.
> > Alternatively,
> > > for the next major version, I think we could further define
> > > producers(subscriptions)' API like the below and drop the publishers
> and
> > > subscriptions structs from topics (partitioned-)stats returns.
> > >
> > > pulsar-admin publishers list my-topic --last-pagination-key xyz
> > > pulsar-admin publishers stats my-producer
> > >
> > > # similarly for subscriptions
> > >
> > > Regards,
> > > Heesung
> > >
> >
>

Re: [Discuss] Deprecate Index-based Publisher Stat Aggregation in Topics Partitioned-Stats

Posted by Yunze Xu <yz...@streamnative.io.INVALID>.
Hi Nozomi,

I didn't look into the proposal at the moment. But I noticed the
producer name generation logic you mentioned here. At least in C++
clients, the producer name can only be set from the CommandProducer
response. i.e. PIP-79 is not catched up in C++ clients.

Thanks,
Yunze

On Fri, Nov 25, 2022 at 2:34 PM Nozomi Kurihara <nk...@apache.org> wrote:
>
> Hi Heesung,
>
> Thank you for opening this discussion.
>
> IMO backward-compatibility should be kept at least across minor releases.
>
> Although the performance issue you mentioned (e.g. memory burst, high GC
> and OOM) looks problematic, backward-compatibility is also important.
> I think which has higher priority depends on the case.
>
> Your current change seems to remove the index-based aggregation completely.
> However I think we should keep room for choice.
>
> In order to allow users (here Pulsar server-side admins in particular) to
> choose the performance or backward-compatibility, how about introducing a
> "force" setting, e.g. "forceAggregatePublisherStatsByProducerName"?
>
> Those who place more importance on the performance than
> backward-compatibility can set this flag to true.
> Others, those who want to keep backward-compatibility, set this flag to
> false.
>
> By the way, I'm not sure, is the producer name generation logic already
> implemented in C++, Go and other clients?
> If not so, first we should implement it before switching the
> producer-name-based aggregation.
>
> Best Regards,
> Nozomi
>
> 2022年11月17日(木) 8:51 Heesung Sohn <he...@streamnative.io.invalid>:
>
> > Hi,
> >
> > To add more about the backward incompatibility issue
> > <https://github.com/apache/pulsar/pull/18254#issuecomment-1311168182>,
> >
> > Before fix:
> > % ./bin/pulsar-admin topics partitioned-stats
> > persistent://public/default/pt
> > ...
> >   "publishers" : [ {
> >     "msgRateIn" : 0.0,
> >     "msgThroughputIn" : 0.0,
> >     "averageMsgSize" : 0.0,
> >     "chunkedMessageRate" : 0.0,
> >     "producerId" : 0,
> >     "supportsPartialProducer" : false
> >   } ],
> >
> > After fix:
> > % ./bin/pulsar-admin topics partitioned-stats
> > persistent://public/default/pt
> > ...
> >   "publishers" : [ {
> >     "msgRateIn" : 0.0,
> >     "msgThroughputIn" : 0.0,
> >     "averageMsgSize" : 0.0,
> >     "chunkedMessageRate" : 0.0,
> >     "producerId" : 0,
> >     "supportsPartialProducer" : true,
> >     "producerName" : "standalone-0-1"
> >   }, {
> >     "msgRateIn" : 0.0,
> >     "msgThroughputIn" : 0.0,
> >     "averageMsgSize" : 0.0,
> >     "chunkedMessageRate" : 0.0,
> >     "producerId" : 0,
> >     "supportsPartialProducer" : true,
> >     "producerName" : "standalone-0-0"
> >   } ],
> > ...
> >
> >
> > The broker side's producer name generation has been there since this PR
> > <https://github.com/apache/pulsar/pull/1178>(1.22-incubating).
> > ProducerName was automatically generated in this format
> > {clusterName}-{brokerInstanceId}-{producerNameGenerationCounter} until
> > 2.10.
> > So, a producer to a partitioned topic(2) results in the two producer names,
> > like the following.
> > standalone-0-0
> > standalone-0-1
> >
> > And since 2.10, by default, partitioned producers have the same producer
> > name between partitions by this PR
> > <https://github.com/apache/pulsar/pull/10279>(generated on the client
> > side).
> > standalone-0-0
> >
> > Hence, the impacted versions(backward incompatibility issue
> > <https://github.com/apache/pulsar/pull/18254#issuecomment-1311168182>) by
> > the proposed fix(Option 1 below) are < 2.10.
> >
> > When we aggregate stats between partitions, the default(
> > aggregatePublisherStatsByProducerName=false) aggregates the producer stats
> > by the index of the producer stat list from each partition. So, when lucky,
> > it could output a single producer stat. However, this method can be buggy,
> > as each partition could return a different size and index of the
> > producer stat list.
> >
> >
> > To fix the original issue(described here
> > <https://github.com/apache/pulsar/pull/18254>), I think we have the
> > following options.
> >
> > Option 1(proposed): Deprecate aggregatePublisherStatsByProducerName (the
> > current PR here <https://github.com/apache/pulsar/pull/18254>) in the next
> > release and live with behavior change where we get `topics
> > partitioned-stats` per-producer-and-partition from old clients(Ver. <2.10),
> > instead of stats per-producer.
> >
> > Option 2: Defer the Option 1 fix and push it to the next major
> > version(3.0.0), as this is a breaking change.
> >
> > Option 3: Keep aggregatePublisherStatsByProducerName config but change the
> > default to aggregatePublisherStatsByProducerName=true.
> >
> > Option 4: As a long-term fix, create separate Admin-APIs for publisher and
> > subscription stats and drop their stats from `topics partitioned-stats` as
> > it is expensive to aggregate them on the fly. (for thousands of publishers
> > and subscriptions). Push this change to the next major version.
> >
> > Or other suggestions?
> >
> > Regards,
> > Heesung
> >
> >
> >
> >
> >
> >
> > On Mon, Nov 14, 2022 at 7:56 PM Heesung Sohn <heesung.sohn@streamnative.io
> > >
> > wrote:
> >
> > > Dear Pulsar Community,
> > >
> > > We recently found a bug in `pulsar-admin topics partitioned-stats api`
> > that
> > > could incur a memory burst, high GC time, or OOM.
> > >
> > > For this issue, I proposed a fix
> > > <https://github.com/apache/pulsar/pull/18254> by deprecating the
> > aggregatePublisherStatsByProducerName
> > > config and always aggregating the publishers' stats by publisherName,
> > > instead of the list index(aggregatePublisherStatsByProducerName=false,
> > > default).
> > >
> > >
> > >    -  The index-based aggregation is inherently wrong in a highly
> > >    concurrent producer environment(where the order and size of the
> > publisher
> > >    stat list are not guaranteed to be the same). The publisher stats
> > need to
> > >    be aggregated by a unique key, preferably the producer
> > >    name(aggregatePublisherStatsByProducerName=true).
> > >
> > >
> > > However, this fix will break some of the old client's compatibility since
> > > the way Pulsar generates the producer name has changed over time, as
> > > described here
> > > <https://github.com/apache/pulsar/pull/18254#issuecomment-1311168182>.
> > >
> > > As I replied here
> > > <https://github.com/apache/pulsar/pull/18254#issuecomment-1312084363>,
> > although
> > > it is not desirable, I think we could be lenient on this change in the
> > stat
> > > API response(assuming thispublishers'stat struct is used for human admins
> > > only for ad-hoc checks).
> > >
> > > Are we OK with this non-backward-compatible fix for some of the old
> > > clients? Or, do you have any other suggestions?
> > >
> > > One idea for a long-term fix could be:
> > > When there are thousands of producers(consumers) for a
> > > (partitioned-)topic, it is expensive to aggregate each
> > > publisher(subscriptions)'s stats on-the-fly across the brokers.
> > Alternatively,
> > > for the next major version, I think we could further define
> > > producers(subscriptions)' API like the below and drop the publishers and
> > > subscriptions structs from topics (partitioned-)stats returns.
> > >
> > > pulsar-admin publishers list my-topic --last-pagination-key xyz
> > > pulsar-admin publishers stats my-producer
> > >
> > > # similarly for subscriptions
> > >
> > > Regards,
> > > Heesung
> > >
> >

Re: [Discuss] Deprecate Index-based Publisher Stat Aggregation in Topics Partitioned-Stats

Posted by Nozomi Kurihara <nk...@apache.org>.
Hi Heesung,

Thank you for opening this discussion.

IMO backward-compatibility should be kept at least across minor releases.

Although the performance issue you mentioned (e.g. memory burst, high GC
and OOM) looks problematic, backward-compatibility is also important.
I think which has higher priority depends on the case.

Your current change seems to remove the index-based aggregation completely.
However I think we should keep room for choice.

In order to allow users (here Pulsar server-side admins in particular) to
choose the performance or backward-compatibility, how about introducing a
"force" setting, e.g. "forceAggregatePublisherStatsByProducerName"?

Those who place more importance on the performance than
backward-compatibility can set this flag to true.
Others, those who want to keep backward-compatibility, set this flag to
false.

By the way, I'm not sure, is the producer name generation logic already
implemented in C++, Go and other clients?
If not so, first we should implement it before switching the
producer-name-based aggregation.

Best Regards,
Nozomi

2022年11月17日(木) 8:51 Heesung Sohn <he...@streamnative.io.invalid>:

> Hi,
>
> To add more about the backward incompatibility issue
> <https://github.com/apache/pulsar/pull/18254#issuecomment-1311168182>,
>
> Before fix:
> % ./bin/pulsar-admin topics partitioned-stats
> persistent://public/default/pt
> ...
>   "publishers" : [ {
>     "msgRateIn" : 0.0,
>     "msgThroughputIn" : 0.0,
>     "averageMsgSize" : 0.0,
>     "chunkedMessageRate" : 0.0,
>     "producerId" : 0,
>     "supportsPartialProducer" : false
>   } ],
>
> After fix:
> % ./bin/pulsar-admin topics partitioned-stats
> persistent://public/default/pt
> ...
>   "publishers" : [ {
>     "msgRateIn" : 0.0,
>     "msgThroughputIn" : 0.0,
>     "averageMsgSize" : 0.0,
>     "chunkedMessageRate" : 0.0,
>     "producerId" : 0,
>     "supportsPartialProducer" : true,
>     "producerName" : "standalone-0-1"
>   }, {
>     "msgRateIn" : 0.0,
>     "msgThroughputIn" : 0.0,
>     "averageMsgSize" : 0.0,
>     "chunkedMessageRate" : 0.0,
>     "producerId" : 0,
>     "supportsPartialProducer" : true,
>     "producerName" : "standalone-0-0"
>   } ],
> ...
>
>
> The broker side's producer name generation has been there since this PR
> <https://github.com/apache/pulsar/pull/1178>(1.22-incubating).
> ProducerName was automatically generated in this format
> {clusterName}-{brokerInstanceId}-{producerNameGenerationCounter} until
> 2.10.
> So, a producer to a partitioned topic(2) results in the two producer names,
> like the following.
> standalone-0-0
> standalone-0-1
>
> And since 2.10, by default, partitioned producers have the same producer
> name between partitions by this PR
> <https://github.com/apache/pulsar/pull/10279>(generated on the client
> side).
> standalone-0-0
>
> Hence, the impacted versions(backward incompatibility issue
> <https://github.com/apache/pulsar/pull/18254#issuecomment-1311168182>) by
> the proposed fix(Option 1 below) are < 2.10.
>
> When we aggregate stats between partitions, the default(
> aggregatePublisherStatsByProducerName=false) aggregates the producer stats
> by the index of the producer stat list from each partition. So, when lucky,
> it could output a single producer stat. However, this method can be buggy,
> as each partition could return a different size and index of the
> producer stat list.
>
>
> To fix the original issue(described here
> <https://github.com/apache/pulsar/pull/18254>), I think we have the
> following options.
>
> Option 1(proposed): Deprecate aggregatePublisherStatsByProducerName (the
> current PR here <https://github.com/apache/pulsar/pull/18254>) in the next
> release and live with behavior change where we get `topics
> partitioned-stats` per-producer-and-partition from old clients(Ver. <2.10),
> instead of stats per-producer.
>
> Option 2: Defer the Option 1 fix and push it to the next major
> version(3.0.0), as this is a breaking change.
>
> Option 3: Keep aggregatePublisherStatsByProducerName config but change the
> default to aggregatePublisherStatsByProducerName=true.
>
> Option 4: As a long-term fix, create separate Admin-APIs for publisher and
> subscription stats and drop their stats from `topics partitioned-stats` as
> it is expensive to aggregate them on the fly. (for thousands of publishers
> and subscriptions). Push this change to the next major version.
>
> Or other suggestions?
>
> Regards,
> Heesung
>
>
>
>
>
>
> On Mon, Nov 14, 2022 at 7:56 PM Heesung Sohn <heesung.sohn@streamnative.io
> >
> wrote:
>
> > Dear Pulsar Community,
> >
> > We recently found a bug in `pulsar-admin topics partitioned-stats api`
> that
> > could incur a memory burst, high GC time, or OOM.
> >
> > For this issue, I proposed a fix
> > <https://github.com/apache/pulsar/pull/18254> by deprecating the
> aggregatePublisherStatsByProducerName
> > config and always aggregating the publishers' stats by publisherName,
> > instead of the list index(aggregatePublisherStatsByProducerName=false,
> > default).
> >
> >
> >    -  The index-based aggregation is inherently wrong in a highly
> >    concurrent producer environment(where the order and size of the
> publisher
> >    stat list are not guaranteed to be the same). The publisher stats
> need to
> >    be aggregated by a unique key, preferably the producer
> >    name(aggregatePublisherStatsByProducerName=true).
> >
> >
> > However, this fix will break some of the old client's compatibility since
> > the way Pulsar generates the producer name has changed over time, as
> > described here
> > <https://github.com/apache/pulsar/pull/18254#issuecomment-1311168182>.
> >
> > As I replied here
> > <https://github.com/apache/pulsar/pull/18254#issuecomment-1312084363>,
> although
> > it is not desirable, I think we could be lenient on this change in the
> stat
> > API response(assuming thispublishers'stat struct is used for human admins
> > only for ad-hoc checks).
> >
> > Are we OK with this non-backward-compatible fix for some of the old
> > clients? Or, do you have any other suggestions?
> >
> > One idea for a long-term fix could be:
> > When there are thousands of producers(consumers) for a
> > (partitioned-)topic, it is expensive to aggregate each
> > publisher(subscriptions)'s stats on-the-fly across the brokers.
> Alternatively,
> > for the next major version, I think we could further define
> > producers(subscriptions)' API like the below and drop the publishers and
> > subscriptions structs from topics (partitioned-)stats returns.
> >
> > pulsar-admin publishers list my-topic --last-pagination-key xyz
> > pulsar-admin publishers stats my-producer
> >
> > # similarly for subscriptions
> >
> > Regards,
> > Heesung
> >
>

Re: [Discuss] Deprecate Index-based Publisher Stat Aggregation in Topics Partitioned-Stats

Posted by Heesung Sohn <he...@streamnative.io.INVALID>.
Hi,

To add more about the backward incompatibility issue
<https://github.com/apache/pulsar/pull/18254#issuecomment-1311168182>,

Before fix:
% ./bin/pulsar-admin topics partitioned-stats persistent://public/default/pt
...
  "publishers" : [ {
    "msgRateIn" : 0.0,
    "msgThroughputIn" : 0.0,
    "averageMsgSize" : 0.0,
    "chunkedMessageRate" : 0.0,
    "producerId" : 0,
    "supportsPartialProducer" : false
  } ],

After fix:
% ./bin/pulsar-admin topics partitioned-stats persistent://public/default/pt
...
  "publishers" : [ {
    "msgRateIn" : 0.0,
    "msgThroughputIn" : 0.0,
    "averageMsgSize" : 0.0,
    "chunkedMessageRate" : 0.0,
    "producerId" : 0,
    "supportsPartialProducer" : true,
    "producerName" : "standalone-0-1"
  }, {
    "msgRateIn" : 0.0,
    "msgThroughputIn" : 0.0,
    "averageMsgSize" : 0.0,
    "chunkedMessageRate" : 0.0,
    "producerId" : 0,
    "supportsPartialProducer" : true,
    "producerName" : "standalone-0-0"
  } ],
...


The broker side's producer name generation has been there since this PR
<https://github.com/apache/pulsar/pull/1178>(1.22-incubating).
ProducerName was automatically generated in this format
{clusterName}-{brokerInstanceId}-{producerNameGenerationCounter} until 2.10.
So, a producer to a partitioned topic(2) results in the two producer names,
like the following.
standalone-0-0
standalone-0-1

And since 2.10, by default, partitioned producers have the same producer
name between partitions by this PR
<https://github.com/apache/pulsar/pull/10279>(generated on the client side).
standalone-0-0

Hence, the impacted versions(backward incompatibility issue
<https://github.com/apache/pulsar/pull/18254#issuecomment-1311168182>) by
the proposed fix(Option 1 below) are < 2.10.

When we aggregate stats between partitions, the default(
aggregatePublisherStatsByProducerName=false) aggregates the producer stats
by the index of the producer stat list from each partition. So, when lucky,
it could output a single producer stat. However, this method can be buggy,
as each partition could return a different size and index of the
producer stat list.


To fix the original issue(described here
<https://github.com/apache/pulsar/pull/18254>), I think we have the
following options.

Option 1(proposed): Deprecate aggregatePublisherStatsByProducerName (the
current PR here <https://github.com/apache/pulsar/pull/18254>) in the next
release and live with behavior change where we get `topics
partitioned-stats` per-producer-and-partition from old clients(Ver. <2.10),
instead of stats per-producer.

Option 2: Defer the Option 1 fix and push it to the next major
version(3.0.0), as this is a breaking change.

Option 3: Keep aggregatePublisherStatsByProducerName config but change the
default to aggregatePublisherStatsByProducerName=true.

Option 4: As a long-term fix, create separate Admin-APIs for publisher and
subscription stats and drop their stats from `topics partitioned-stats` as
it is expensive to aggregate them on the fly. (for thousands of publishers
and subscriptions). Push this change to the next major version.

Or other suggestions?

Regards,
Heesung






On Mon, Nov 14, 2022 at 7:56 PM Heesung Sohn <he...@streamnative.io>
wrote:

> Dear Pulsar Community,
>
> We recently found a bug in `pulsar-admin topics partitioned-stats api` that
> could incur a memory burst, high GC time, or OOM.
>
> For this issue, I proposed a fix
> <https://github.com/apache/pulsar/pull/18254> by deprecating the aggregatePublisherStatsByProducerName
> config and always aggregating the publishers' stats by publisherName,
> instead of the list index(aggregatePublisherStatsByProducerName=false,
> default).
>
>
>    -  The index-based aggregation is inherently wrong in a highly
>    concurrent producer environment(where the order and size of the publisher
>    stat list are not guaranteed to be the same). The publisher stats need to
>    be aggregated by a unique key, preferably the producer
>    name(aggregatePublisherStatsByProducerName=true).
>
>
> However, this fix will break some of the old client's compatibility since
> the way Pulsar generates the producer name has changed over time, as
> described here
> <https://github.com/apache/pulsar/pull/18254#issuecomment-1311168182>.
>
> As I replied here
> <https://github.com/apache/pulsar/pull/18254#issuecomment-1312084363>, although
> it is not desirable, I think we could be lenient on this change in the stat
> API response(assuming thispublishers'stat struct is used for human admins
> only for ad-hoc checks).
>
> Are we OK with this non-backward-compatible fix for some of the old
> clients? Or, do you have any other suggestions?
>
> One idea for a long-term fix could be:
> When there are thousands of producers(consumers) for a
> (partitioned-)topic, it is expensive to aggregate each
> publisher(subscriptions)'s stats on-the-fly across the brokers. Alternatively,
> for the next major version, I think we could further define
> producers(subscriptions)' API like the below and drop the publishers and
> subscriptions structs from topics (partitioned-)stats returns.
>
> pulsar-admin publishers list my-topic --last-pagination-key xyz
> pulsar-admin publishers stats my-producer
>
> # similarly for subscriptions
>
> Regards,
> Heesung
>