You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@zookeeper.apache.org by Li Wang <li...@gmail.com> on 2021/04/06 19:33:44 UTC

Performance impact of enabling Prometheus Metrics

Hi,

I would like to reach out to the community to see if anyone has some
insights or experience with the performance impact of enabling prometheus
metrics.

I have done load comparison tests for Prometheus enabled vs disabled and
found the performance is reduced about 40%-60% for both read and write
oeprations (i.e. getData, getChildren and createNode).

The load test was done with Zookeeper 3.7, cluster size of 5 participants
and 5 observers, each ZK server has 10G heap size and 4 cpu, 500 concurrent
users sending requests.

The performance impact is quite significant.  I wonder if this is expected
and what are things we can do to have ZK performing the same while
leveraging the new feature of Prometheus metric.

Best,

Li

Re: Performance impact of enabling Prometheus Metrics

Posted by Li Wang <li...@gmail.com>.
Thanks for the input. We can discuss and look more into different ways of
changing Zookeeper code to reduce the impact.  It would be even better if
 the issue can be addressed by some tuning knobs or customizations in the
Prometheus client library as mentioned by Enrico.

Li


On Mon, Apr 26, 2021 at 9:35 PM Ted Dunning <te...@gmail.com> wrote:

> Batching metrics reporting is very similar to option (c) but with locking
> like option (a). That can usually be made faster by passing a reference to
> the metrics accumulator to the reporting thread which can do the batch
> update without locks. Usually requires ping-pong metrics accumulators so
> that a thread can accumulate in one accumulator for a bit, pass that to the
> merge thread and switch to using the alternate accumulator. Since all
> threads typically report at the same time, this avoids a stampede on the
> global accumulator.
>
>
> On Mon, Apr 26, 2021 at 9:30 PM Li Wang <li...@gmail.com> wrote:
>
> > batching metrics reporting can help. For example, in the CommitProcessor,
> > increasing the maxCommitBatchSize helps improving the the performance of
> > write operation.
> >
> >
> > On Mon, Apr 26, 2021 at 9:21 PM Li Wang <li...@gmail.com> wrote:
> >
> > > Yes, I am thinking that handling metrics reporting in a separate
> thread,
> > > so it doesn't impact the "main" thread.
> > >
> > > Not sure about the idea of merging at the end of a reporting period.
> Can
> > > you elaborate a bit on it?
> > >
> > > Thanks,
> > >
> > > Li
> > >
> > > On Mon, Apr 26, 2021 at 9:11 PM Ted Dunning <te...@gmail.com>
> > wrote:
> > >
> > >> Would it help to keep per thread metrics that are either reported
> > >> independently or are merged at the end of a reporting period?
> > >>
> > >>
> > >>
> > >> On Mon, Apr 26, 2021 at 8:51 PM Li Wang <li...@gmail.com> wrote:
> > >>
> > >> > Hi Community,
> > >> >
> > >> > I've done further investigation on the issue and found the following
> > >> >
> > >> > 1. The perf of the read operation was decreased due to the lock
> > >> contention
> > >> > in the Prometheus TimeWindowQuantiles APIs. 3 out of 4
> > CommitProcWorker
> > >> > threads were blocked on the TimeWindowQuantiles.insert() API when
> the
> > >> test
> > >> > was.
> > >> >
> > >> > 2. The perf of the write operation was decreased because of the high
> > CPU
> > >> > usage from Prometheus Summary type of metrics. The CPU usage of
> > >> > CommitProcessor increased about 50% when Prometheus was disabled
> > >> compared
> > >> > to enabled (46% vs 80% with 4 CPU, 63% vs 99% with 12 CPU).
> > >> >
> > >> >
> > >> > Prometheus integration is a great feature, however the negative
> > >> performance
> > >> > impact is very significant.  I wonder if anyone has any thoughts on
> > how
> > >> to
> > >> > reduce the perf impact.
> > >> >
> > >> >
> > >> >
> > >> > Thanks,
> > >> >
> > >> >
> > >> > Li
> > >> >
> > >> >
> > >> > On Tue, Apr 6, 2021 at 12:33 PM Li Wang <li...@gmail.com> wrote:
> > >> >
> > >> > > Hi,
> > >> > >
> > >> > > I would like to reach out to the community to see if anyone has
> some
> > >> > > insights or experience with the performance impact of enabling
> > >> prometheus
> > >> > > metrics.
> > >> > >
> > >> > > I have done load comparison tests for Prometheus enabled vs
> disabled
> > >> and
> > >> > > found the performance is reduced about 40%-60% for both read and
> > write
> > >> > > oeprations (i.e. getData, getChildren and createNode).
> > >> > >
> > >> > > The load test was done with Zookeeper 3.7, cluster size of 5
> > >> participants
> > >> > > and 5 observers, each ZK server has 10G heap size and 4 cpu, 500
> > >> > concurrent
> > >> > > users sending requests.
> > >> > >
> > >> > > The performance impact is quite significant.  I wonder if this is
> > >> > expected
> > >> > > and what are things we can do to have ZK performing the same while
> > >> > > leveraging the new feature of Prometheus metric.
> > >> > >
> > >> > > Best,
> > >> > >
> > >> > > Li
> > >> > >
> > >> > >
> > >> > >
> > >> > >
> > >> >
> > >>
> > >
> >
>

Re: Performance impact of enabling Prometheus Metrics

Posted by Li Wang <li...@gmail.com>.
What I observed is that the impact varies depending on what types of
metrics (count/gauge vs summary) are reported and how many metrics are
reported in the applications. Quite a lot of metrics have been added to the
CommitProcessor and they are pretty much all Summary.

For the read operation, the number of metrics involved are not as many as
the write operation. With a single CommitProcWork thread, the performance
is pretty much the same between prometheus enabled and disabled. However,
with multiple CommitProcWorker threads, the performance impact is
significant, as essentially only one worker running and the others are
blocked on prometheus metrics reporting. The following is from the thread
profiling.

[image: image.png]

It's great that we have the options of tuning some knobs in the Prometheus
client and also porting some customizations in the metrics collector.

Please let me know if you have any questions or need any additional info.

Thanks,

Li


On Mon, Apr 26, 2021 at 10:42 PM Enrico Olivelli <eo...@gmail.com>
wrote:

> Sorry for top posting.
>
> I have never seen in other applications that Prometheus has such a
> significant impact.
>
> The first things that come into my mind:
> - collect a couple of dumps with some perf tool and dig into the problem
> -  verify that we have the latest version of Prometheus client
> - tune the few knobs we have in the Prometheus client
>
> In Apache Pulsar and in Apache Bookkeeper we have some customizations in
> the Prometheus metrics collectors, we could the a look and port those to
> Zookeeper (initially I worked on the Prometheus integration based on my
> usecases I have with Pulsar, Bookkeeper and other systems that already use
> Prometheus, but here in Zookeeper we are using the basic Prometheus client)
>
> Enrico
>
> Il Mar 27 Apr 2021, 06:35 Ted Dunning <te...@gmail.com> ha scritto:
>
> > Batching metrics reporting is very similar to option (c) but with locking
> > like option (a). That can usually be made faster by passing a reference
> to
> > the metrics accumulator to the reporting thread which can do the batch
> > update without locks. Usually requires ping-pong metrics accumulators so
> > that a thread can accumulate in one accumulator for a bit, pass that to
> the
> > merge thread and switch to using the alternate accumulator. Since all
> > threads typically report at the same time, this avoids a stampede on the
> > global accumulator.
> >
> >
> > On Mon, Apr 26, 2021 at 9:30 PM Li Wang <li...@gmail.com> wrote:
> >
> > > batching metrics reporting can help. For example, in the
> CommitProcessor,
> > > increasing the maxCommitBatchSize helps improving the the performance
> of
> > > write operation.
> > >
> > >
> > > On Mon, Apr 26, 2021 at 9:21 PM Li Wang <li...@gmail.com> wrote:
> > >
> > > > Yes, I am thinking that handling metrics reporting in a separate
> > thread,
> > > > so it doesn't impact the "main" thread.
> > > >
> > > > Not sure about the idea of merging at the end of a reporting period.
> > Can
> > > > you elaborate a bit on it?
> > > >
> > > > Thanks,
> > > >
> > > > Li
> > > >
> > > > On Mon, Apr 26, 2021 at 9:11 PM Ted Dunning <te...@gmail.com>
> > > wrote:
> > > >
> > > >> Would it help to keep per thread metrics that are either reported
> > > >> independently or are merged at the end of a reporting period?
> > > >>
> > > >>
> > > >>
> > > >> On Mon, Apr 26, 2021 at 8:51 PM Li Wang <li...@gmail.com> wrote:
> > > >>
> > > >> > Hi Community,
> > > >> >
> > > >> > I've done further investigation on the issue and found the
> following
> > > >> >
> > > >> > 1. The perf of the read operation was decreased due to the lock
> > > >> contention
> > > >> > in the Prometheus TimeWindowQuantiles APIs. 3 out of 4
> > > CommitProcWorker
> > > >> > threads were blocked on the TimeWindowQuantiles.insert() API when
> > the
> > > >> test
> > > >> > was.
> > > >> >
> > > >> > 2. The perf of the write operation was decreased because of the
> high
> > > CPU
> > > >> > usage from Prometheus Summary type of metrics. The CPU usage of
> > > >> > CommitProcessor increased about 50% when Prometheus was disabled
> > > >> compared
> > > >> > to enabled (46% vs 80% with 4 CPU, 63% vs 99% with 12 CPU).
> > > >> >
> > > >> >
> > > >> > Prometheus integration is a great feature, however the negative
> > > >> performance
> > > >> > impact is very significant.  I wonder if anyone has any thoughts
> on
> > > how
> > > >> to
> > > >> > reduce the perf impact.
> > > >> >
> > > >> >
> > > >> >
> > > >> > Thanks,
> > > >> >
> > > >> >
> > > >> > Li
> > > >> >
> > > >> >
> > > >> > On Tue, Apr 6, 2021 at 12:33 PM Li Wang <li...@gmail.com>
> wrote:
> > > >> >
> > > >> > > Hi,
> > > >> > >
> > > >> > > I would like to reach out to the community to see if anyone has
> > some
> > > >> > > insights or experience with the performance impact of enabling
> > > >> prometheus
> > > >> > > metrics.
> > > >> > >
> > > >> > > I have done load comparison tests for Prometheus enabled vs
> > disabled
> > > >> and
> > > >> > > found the performance is reduced about 40%-60% for both read and
> > > write
> > > >> > > oeprations (i.e. getData, getChildren and createNode).
> > > >> > >
> > > >> > > The load test was done with Zookeeper 3.7, cluster size of 5
> > > >> participants
> > > >> > > and 5 observers, each ZK server has 10G heap size and 4 cpu, 500
> > > >> > concurrent
> > > >> > > users sending requests.
> > > >> > >
> > > >> > > The performance impact is quite significant.  I wonder if this
> is
> > > >> > expected
> > > >> > > and what are things we can do to have ZK performing the same
> while
> > > >> > > leveraging the new feature of Prometheus metric.
> > > >> > >
> > > >> > > Best,
> > > >> > >
> > > >> > > Li
> > > >> > >
> > > >> > >
> > > >> > >
> > > >> > >
> > > >> >
> > > >>
> > > >
> > >
> >
>

Re: Performance impact of enabling Prometheus Metrics

Posted by Li Wang <li...@gmail.com>.
Thanks Ted. I will share the work and submit a PR for it.

Cheers,

Li

On Mon, May 3, 2021 at 10:12 PM Ted Dunning <te...@gmail.com> wrote:

> On Mon, May 3, 2021 at 6:50 PM Li Wang <li...@gmail.com> wrote:
>
> > ...
> >
> > Please let me know if you have any thoughts or questions.
> >
>
> My thought is that you have done some nice work.
>

Re: Performance impact of enabling Prometheus Metrics

Posted by Ted Dunning <te...@gmail.com>.
On Mon, May 3, 2021 at 6:50 PM Li Wang <li...@gmail.com> wrote:

> ...
>
> Please let me know if you have any thoughts or questions.
>

My thought is that you have done some nice work.

Re: Performance impact of enabling Prometheus Metrics

Posted by Li Wang <li...@gmail.com>.
Hi Encrico, Andor,

The PR has been updated.  https://github.com/apache/zookeeper/pull/1698.
Can you help reviewing it, so it can be merged?

Thanks,

Li

On Tue, May 18, 2021 at 5:01 PM Li Wang <li...@gmail.com> wrote:

> Hi Andor, Ted, Enrico,
>
> Thanks again for your discussions and inputs. The changes have been
> submitted. https://github.com/apache/zookeeper/pull/1698
>
> Would you mind reviewing the PR?
>
> Best,
>
> Li
>
> On Wed, May 5, 2021 at 10:00 AM Li Wang <li...@gmail.com> wrote:
>
>> Thanks Andor!
>>
>> Li
>>
>> On Tue, May 4, 2021 at 2:00 PM Andor Molnar <an...@apache.org> wrote:
>>
>>> Awesome work Li, indeed.
>>> I’m happy to review the PR. Thanks for the contribution.
>>>
>>> Andor
>>>
>>>
>>>
>>> > On 2021. May 4., at 3:49, Li Wang <li...@gmail.com> wrote:
>>> >
>>> > Hi,
>>> >
>>> > Thanks Ted and Enrico again for the inputs and discussions. I would
>>> like to
>>> > share some updates from my side.
>>> >
>>> > 1. The main issue is that Prometheus summary computation is expensive
>>> and
>>> > locked/synchronized. To reduce the perf impact, I changed the
>>> > PrometheusLabelledSummary.add() operation to be async, so it is
>>> handled by
>>> > separate threads from the main thread.
>>> >
>>> > 2. Ran benchmark against the changes. The perf of read and write
>>> operations
>>> > are significantly improved.  .
>>> >
>>> > a) For read operation, the avg latency is reduced about 50% and avg
>>> > throughout is increased about 95%.
>>> > b) For write operation, the avg latency is reduced about 28% and avg
>>> > throughput is increased about 20%.
>>> >
>>> > 3. I opened a JIRA for the issue and can submit a PR for the change.
>>> >
>>> > https://issues.apache.org/jira/browse/ZOOKEEPER-4289
>>> >
>>> > Please let me know if you have any thoughts or questions.
>>> >
>>> > Thanks,
>>> >
>>> > Li
>>> >
>>> >
>>> >
>>> > On Wed, Apr 28, 2021 at 11:32 PM Enrico Olivelli <eo...@gmail.com>
>>> > wrote:
>>> >
>>> >> Il giorno mer 28 apr 2021 alle ore 21:48 Li Wang <li...@gmail.com>
>>> ha
>>> >> scritto:
>>> >>>
>>> >>> Hi Enrico,
>>> >>>
>>> >>> Please see my comments inline.
>>> >>>
>>> >>> Thanks,
>>> >>>
>>> >>> Li.
>>> >>>
>>> >>> On Mon, Apr 26, 2021 at 10:42 PM Enrico Olivelli <
>>> eolivelli@gmail.com>
>>> >>> wrote:
>>> >>>
>>> >>>> Sorry for top posting.
>>> >>>>
>>> >>>> I have never seen in other applications that Prometheus has such a
>>> >>>> significant impact.
>>> >>>
>>> >>>
>>> >>> Please see my reply in the previous response.
>>> >>>
>>> >>>
>>> >>>>
>>> >>>> The first things that come into my mind:
>>> >>>> - collect a couple of dumps with some perf tool and dig into the
>>> >> problem
>>> >>>
>>> >>>
>>> >>> Heap dump, thread dump and CPU profiling with Prometheus enabled vs
>>> >>> disabled have been collected. That's how we found the issue.
>>> >>>
>>> >>>
>>> >>>> -  verify that we have the latest version of Prometheus client
>>> >>>>
>>> >>>
>>> >>> ZK uses 0.9.0 and the latest version of Prometheus client is 0.10.0.
>>> >>> Checked in the change logs, it doesn't look like there are major
>>> changes
>>> >>> between these two.
>>> https://github.com/prometheus/client_java/releases We
>>> >>> can upgrade the version and see if it makes any difference.
>>> >>>
>>> >>> - tune the few knobs we have in the Prometheus client
>>> >>>>
>>> >>>
>>> >>> What are the knobs we can tune? Can you provide more info on this?
>>> >>
>>> >> Unfortunately there is nothing we can tune directly with the zookeeper
>>> >> configuration file
>>> >>
>>> >> In ZK code we currently have only some basic configuration of
>>> >> summaries, and we are using mostly the default values
>>> >>
>>> >>
>>> https://github.com/apache/zookeeper/blob/11c07921c15e2fb7692375327b53f26a583b77ca/zookeeper-metrics-providers/zookeeper-prometheus-metrics/src/main/java/org/apache/zookeeper/metrics/prometheus/PrometheusMetricsProvider.java#L340
>>> >> there is no way to fine tune the buckets and all of the structures.
>>> >>
>>> >> we should work on looking for optimal values
>>> >>
>>> >
>>> >
>>> >
>>> >
>>> >>
>>> >>
>>> >>>
>>> >>>>
>>> >>>> In Apache Pulsar and in Apache Bookkeeper we have some
>>> customizations
>>> >> in
>>> >>>> the Prometheus metrics collectors, we could the a look and port
>>> those
>>> >> to
>>> >>>> Zookeeper (initially I worked on the Prometheus integration based
>>> on my
>>> >>>> usecases I have with Pulsar, Bookkeeper and other systems that
>>> already
>>> >> use
>>> >>>> Prometheus, but here in Zookeeper we are using the basic Prometheus
>>> >> client)
>>> >>>>
>>> >>>
>>> >>> Are there any customizations for minimizing the performance impact?
>>> Is
>>> >>> there anything that we can port to ZK to help the issue?
>>> >>
>>> >> Yes, there is much we can port, this is the Prometheus Metrics
>>> >> provider in BookKeeper
>>> >>
>>> >>
>>> https://github.com/apache/bookkeeper/tree/master/bookkeeper-stats-providers/prometheus-metrics-provider/src/main/java/org/apache/bookkeeper/stats/prometheus
>>> >>
>>> >> Enrico
>>> >>
>>> >>>
>>> >>>
>>> >>>> Enrico
>>> >>>>
>>> >>>> Il Mar 27 Apr 2021, 06:35 Ted Dunning <te...@gmail.com> ha
>>> >> scritto:
>>> >>>>
>>> >>>>> Batching metrics reporting is very similar to option (c) but with
>>> >> locking
>>> >>>>> like option (a). That can usually be made faster by passing a
>>> >> reference
>>> >>>> to
>>> >>>>> the metrics accumulator to the reporting thread which can do the
>>> >> batch
>>> >>>>> update without locks. Usually requires ping-pong metrics
>>> >> accumulators so
>>> >>>>> that a thread can accumulate in one accumulator for a bit, pass
>>> that
>>> >> to
>>> >>>> the
>>> >>>>> merge thread and switch to using the alternate accumulator. Since
>>> all
>>> >>>>> threads typically report at the same time, this avoids a stampede
>>> on
>>> >> the
>>> >>>>> global accumulator.
>>> >>>>>
>>> >>>>>
>>> >>>>> On Mon, Apr 26, 2021 at 9:30 PM Li Wang <li...@gmail.com> wrote:
>>> >>>>>
>>> >>>>>> batching metrics reporting can help. For example, in the
>>> >>>> CommitProcessor,
>>> >>>>>> increasing the maxCommitBatchSize helps improving the the
>>> >> performance
>>> >>>> of
>>> >>>>>> write operation.
>>> >>>>>>
>>> >>>>>>
>>> >>>>>> On Mon, Apr 26, 2021 at 9:21 PM Li Wang <li...@gmail.com>
>>> wrote:
>>> >>>>>>
>>> >>>>>>> Yes, I am thinking that handling metrics reporting in a separate
>>> >>>>> thread,
>>> >>>>>>> so it doesn't impact the "main" thread.
>>> >>>>>>>
>>> >>>>>>> Not sure about the idea of merging at the end of a reporting
>>> >> period.
>>> >>>>> Can
>>> >>>>>>> you elaborate a bit on it?
>>> >>>>>>>
>>> >>>>>>> Thanks,
>>> >>>>>>>
>>> >>>>>>> Li
>>> >>>>>>>
>>> >>>>>>> On Mon, Apr 26, 2021 at 9:11 PM Ted Dunning <
>>> >> ted.dunning@gmail.com>
>>> >>>>>> wrote:
>>> >>>>>>>
>>> >>>>>>>> Would it help to keep per thread metrics that are either
>>> >> reported
>>> >>>>>>>> independently or are merged at the end of a reporting period?
>>> >>>>>>>>
>>> >>>>>>>>
>>> >>>>>>>>
>>> >>>>>>>> On Mon, Apr 26, 2021 at 8:51 PM Li Wang <li...@gmail.com>
>>> >> wrote:
>>> >>>>>>>>
>>> >>>>>>>>> Hi Community,
>>> >>>>>>>>>
>>> >>>>>>>>> I've done further investigation on the issue and found the
>>> >>>> following
>>> >>>>>>>>>
>>> >>>>>>>>> 1. The perf of the read operation was decreased due to the
>>> >> lock
>>> >>>>>>>> contention
>>> >>>>>>>>> in the Prometheus TimeWindowQuantiles APIs. 3 out of 4
>>> >>>>>> CommitProcWorker
>>> >>>>>>>>> threads were blocked on the TimeWindowQuantiles.insert() API
>>> >> when
>>> >>>>> the
>>> >>>>>>>> test
>>> >>>>>>>>> was.
>>> >>>>>>>>>
>>> >>>>>>>>> 2. The perf of the write operation was decreased because of
>>> >> the
>>> >>>> high
>>> >>>>>> CPU
>>> >>>>>>>>> usage from Prometheus Summary type of metrics. The CPU usage
>>> >> of
>>> >>>>>>>>> CommitProcessor increased about 50% when Prometheus was
>>> >> disabled
>>> >>>>>>>> compared
>>> >>>>>>>>> to enabled (46% vs 80% with 4 CPU, 63% vs 99% with 12 CPU).
>>> >>>>>>>>>
>>> >>>>>>>>>
>>> >>>>>>>>> Prometheus integration is a great feature, however the
>>> >> negative
>>> >>>>>>>> performance
>>> >>>>>>>>> impact is very significant.  I wonder if anyone has any
>>> >> thoughts
>>> >>>> on
>>> >>>>>> how
>>> >>>>>>>> to
>>> >>>>>>>>> reduce the perf impact.
>>> >>>>>>>>>
>>> >>>>>>>>>
>>> >>>>>>>>>
>>> >>>>>>>>> Thanks,
>>> >>>>>>>>>
>>> >>>>>>>>>
>>> >>>>>>>>> Li
>>> >>>>>>>>>
>>> >>>>>>>>>
>>> >>>>>>>>> On Tue, Apr 6, 2021 at 12:33 PM Li Wang <li...@gmail.com>
>>> >>>> wrote:
>>> >>>>>>>>>
>>> >>>>>>>>>> Hi,
>>> >>>>>>>>>>
>>> >>>>>>>>>> I would like to reach out to the community to see if anyone
>>> >> has
>>> >>>>> some
>>> >>>>>>>>>> insights or experience with the performance impact of
>>> >> enabling
>>> >>>>>>>> prometheus
>>> >>>>>>>>>> metrics.
>>> >>>>>>>>>>
>>> >>>>>>>>>> I have done load comparison tests for Prometheus enabled vs
>>> >>>>> disabled
>>> >>>>>>>> and
>>> >>>>>>>>>> found the performance is reduced about 40%-60% for both
>>> >> read and
>>> >>>>>> write
>>> >>>>>>>>>> oeprations (i.e. getData, getChildren and createNode).
>>> >>>>>>>>>>
>>> >>>>>>>>>> The load test was done with Zookeeper 3.7, cluster size of 5
>>> >>>>>>>> participants
>>> >>>>>>>>>> and 5 observers, each ZK server has 10G heap size and 4
>>> >> cpu, 500
>>> >>>>>>>>> concurrent
>>> >>>>>>>>>> users sending requests.
>>> >>>>>>>>>>
>>> >>>>>>>>>> The performance impact is quite significant.  I wonder if
>>> >> this
>>> >>>> is
>>> >>>>>>>>> expected
>>> >>>>>>>>>> and what are things we can do to have ZK performing the same
>>> >>>> while
>>> >>>>>>>>>> leveraging the new feature of Prometheus metric.
>>> >>>>>>>>>>
>>> >>>>>>>>>> Best,
>>> >>>>>>>>>>
>>> >>>>>>>>>> Li
>>> >>>>>>>>>>
>>> >>>>>>>>>>
>>> >>>>>>>>>>
>>> >>>>>>>>>>
>>> >>>>>>>>>
>>> >>>>>>>>
>>> >>>>>>>
>>> >>>>>>
>>> >>>>>
>>> >>>>
>>> >>
>>>
>>>

Re: Performance impact of enabling Prometheus Metrics

Posted by Li Wang <li...@gmail.com>.
Hi Andor, Ted, Enrico,

Thanks again for your discussions and inputs. The changes have been
submitted. https://github.com/apache/zookeeper/pull/1698

Would you mind reviewing the PR?

Best,

Li

On Wed, May 5, 2021 at 10:00 AM Li Wang <li...@gmail.com> wrote:

> Thanks Andor!
>
> Li
>
> On Tue, May 4, 2021 at 2:00 PM Andor Molnar <an...@apache.org> wrote:
>
>> Awesome work Li, indeed.
>> I’m happy to review the PR. Thanks for the contribution.
>>
>> Andor
>>
>>
>>
>> > On 2021. May 4., at 3:49, Li Wang <li...@gmail.com> wrote:
>> >
>> > Hi,
>> >
>> > Thanks Ted and Enrico again for the inputs and discussions. I would
>> like to
>> > share some updates from my side.
>> >
>> > 1. The main issue is that Prometheus summary computation is expensive
>> and
>> > locked/synchronized. To reduce the perf impact, I changed the
>> > PrometheusLabelledSummary.add() operation to be async, so it is handled
>> by
>> > separate threads from the main thread.
>> >
>> > 2. Ran benchmark against the changes. The perf of read and write
>> operations
>> > are significantly improved.  .
>> >
>> > a) For read operation, the avg latency is reduced about 50% and avg
>> > throughout is increased about 95%.
>> > b) For write operation, the avg latency is reduced about 28% and avg
>> > throughput is increased about 20%.
>> >
>> > 3. I opened a JIRA for the issue and can submit a PR for the change.
>> >
>> > https://issues.apache.org/jira/browse/ZOOKEEPER-4289
>> >
>> > Please let me know if you have any thoughts or questions.
>> >
>> > Thanks,
>> >
>> > Li
>> >
>> >
>> >
>> > On Wed, Apr 28, 2021 at 11:32 PM Enrico Olivelli <eo...@gmail.com>
>> > wrote:
>> >
>> >> Il giorno mer 28 apr 2021 alle ore 21:48 Li Wang <li...@gmail.com>
>> ha
>> >> scritto:
>> >>>
>> >>> Hi Enrico,
>> >>>
>> >>> Please see my comments inline.
>> >>>
>> >>> Thanks,
>> >>>
>> >>> Li.
>> >>>
>> >>> On Mon, Apr 26, 2021 at 10:42 PM Enrico Olivelli <eolivelli@gmail.com
>> >
>> >>> wrote:
>> >>>
>> >>>> Sorry for top posting.
>> >>>>
>> >>>> I have never seen in other applications that Prometheus has such a
>> >>>> significant impact.
>> >>>
>> >>>
>> >>> Please see my reply in the previous response.
>> >>>
>> >>>
>> >>>>
>> >>>> The first things that come into my mind:
>> >>>> - collect a couple of dumps with some perf tool and dig into the
>> >> problem
>> >>>
>> >>>
>> >>> Heap dump, thread dump and CPU profiling with Prometheus enabled vs
>> >>> disabled have been collected. That's how we found the issue.
>> >>>
>> >>>
>> >>>> -  verify that we have the latest version of Prometheus client
>> >>>>
>> >>>
>> >>> ZK uses 0.9.0 and the latest version of Prometheus client is 0.10.0.
>> >>> Checked in the change logs, it doesn't look like there are major
>> changes
>> >>> between these two. https://github.com/prometheus/client_java/releases
>> We
>> >>> can upgrade the version and see if it makes any difference.
>> >>>
>> >>> - tune the few knobs we have in the Prometheus client
>> >>>>
>> >>>
>> >>> What are the knobs we can tune? Can you provide more info on this?
>> >>
>> >> Unfortunately there is nothing we can tune directly with the zookeeper
>> >> configuration file
>> >>
>> >> In ZK code we currently have only some basic configuration of
>> >> summaries, and we are using mostly the default values
>> >>
>> >>
>> https://github.com/apache/zookeeper/blob/11c07921c15e2fb7692375327b53f26a583b77ca/zookeeper-metrics-providers/zookeeper-prometheus-metrics/src/main/java/org/apache/zookeeper/metrics/prometheus/PrometheusMetricsProvider.java#L340
>> >> there is no way to fine tune the buckets and all of the structures.
>> >>
>> >> we should work on looking for optimal values
>> >>
>> >
>> >
>> >
>> >
>> >>
>> >>
>> >>>
>> >>>>
>> >>>> In Apache Pulsar and in Apache Bookkeeper we have some customizations
>> >> in
>> >>>> the Prometheus metrics collectors, we could the a look and port those
>> >> to
>> >>>> Zookeeper (initially I worked on the Prometheus integration based on
>> my
>> >>>> usecases I have with Pulsar, Bookkeeper and other systems that
>> already
>> >> use
>> >>>> Prometheus, but here in Zookeeper we are using the basic Prometheus
>> >> client)
>> >>>>
>> >>>
>> >>> Are there any customizations for minimizing the performance impact?
>> Is
>> >>> there anything that we can port to ZK to help the issue?
>> >>
>> >> Yes, there is much we can port, this is the Prometheus Metrics
>> >> provider in BookKeeper
>> >>
>> >>
>> https://github.com/apache/bookkeeper/tree/master/bookkeeper-stats-providers/prometheus-metrics-provider/src/main/java/org/apache/bookkeeper/stats/prometheus
>> >>
>> >> Enrico
>> >>
>> >>>
>> >>>
>> >>>> Enrico
>> >>>>
>> >>>> Il Mar 27 Apr 2021, 06:35 Ted Dunning <te...@gmail.com> ha
>> >> scritto:
>> >>>>
>> >>>>> Batching metrics reporting is very similar to option (c) but with
>> >> locking
>> >>>>> like option (a). That can usually be made faster by passing a
>> >> reference
>> >>>> to
>> >>>>> the metrics accumulator to the reporting thread which can do the
>> >> batch
>> >>>>> update without locks. Usually requires ping-pong metrics
>> >> accumulators so
>> >>>>> that a thread can accumulate in one accumulator for a bit, pass that
>> >> to
>> >>>> the
>> >>>>> merge thread and switch to using the alternate accumulator. Since
>> all
>> >>>>> threads typically report at the same time, this avoids a stampede on
>> >> the
>> >>>>> global accumulator.
>> >>>>>
>> >>>>>
>> >>>>> On Mon, Apr 26, 2021 at 9:30 PM Li Wang <li...@gmail.com> wrote:
>> >>>>>
>> >>>>>> batching metrics reporting can help. For example, in the
>> >>>> CommitProcessor,
>> >>>>>> increasing the maxCommitBatchSize helps improving the the
>> >> performance
>> >>>> of
>> >>>>>> write operation.
>> >>>>>>
>> >>>>>>
>> >>>>>> On Mon, Apr 26, 2021 at 9:21 PM Li Wang <li...@gmail.com> wrote:
>> >>>>>>
>> >>>>>>> Yes, I am thinking that handling metrics reporting in a separate
>> >>>>> thread,
>> >>>>>>> so it doesn't impact the "main" thread.
>> >>>>>>>
>> >>>>>>> Not sure about the idea of merging at the end of a reporting
>> >> period.
>> >>>>> Can
>> >>>>>>> you elaborate a bit on it?
>> >>>>>>>
>> >>>>>>> Thanks,
>> >>>>>>>
>> >>>>>>> Li
>> >>>>>>>
>> >>>>>>> On Mon, Apr 26, 2021 at 9:11 PM Ted Dunning <
>> >> ted.dunning@gmail.com>
>> >>>>>> wrote:
>> >>>>>>>
>> >>>>>>>> Would it help to keep per thread metrics that are either
>> >> reported
>> >>>>>>>> independently or are merged at the end of a reporting period?
>> >>>>>>>>
>> >>>>>>>>
>> >>>>>>>>
>> >>>>>>>> On Mon, Apr 26, 2021 at 8:51 PM Li Wang <li...@gmail.com>
>> >> wrote:
>> >>>>>>>>
>> >>>>>>>>> Hi Community,
>> >>>>>>>>>
>> >>>>>>>>> I've done further investigation on the issue and found the
>> >>>> following
>> >>>>>>>>>
>> >>>>>>>>> 1. The perf of the read operation was decreased due to the
>> >> lock
>> >>>>>>>> contention
>> >>>>>>>>> in the Prometheus TimeWindowQuantiles APIs. 3 out of 4
>> >>>>>> CommitProcWorker
>> >>>>>>>>> threads were blocked on the TimeWindowQuantiles.insert() API
>> >> when
>> >>>>> the
>> >>>>>>>> test
>> >>>>>>>>> was.
>> >>>>>>>>>
>> >>>>>>>>> 2. The perf of the write operation was decreased because of
>> >> the
>> >>>> high
>> >>>>>> CPU
>> >>>>>>>>> usage from Prometheus Summary type of metrics. The CPU usage
>> >> of
>> >>>>>>>>> CommitProcessor increased about 50% when Prometheus was
>> >> disabled
>> >>>>>>>> compared
>> >>>>>>>>> to enabled (46% vs 80% with 4 CPU, 63% vs 99% with 12 CPU).
>> >>>>>>>>>
>> >>>>>>>>>
>> >>>>>>>>> Prometheus integration is a great feature, however the
>> >> negative
>> >>>>>>>> performance
>> >>>>>>>>> impact is very significant.  I wonder if anyone has any
>> >> thoughts
>> >>>> on
>> >>>>>> how
>> >>>>>>>> to
>> >>>>>>>>> reduce the perf impact.
>> >>>>>>>>>
>> >>>>>>>>>
>> >>>>>>>>>
>> >>>>>>>>> Thanks,
>> >>>>>>>>>
>> >>>>>>>>>
>> >>>>>>>>> Li
>> >>>>>>>>>
>> >>>>>>>>>
>> >>>>>>>>> On Tue, Apr 6, 2021 at 12:33 PM Li Wang <li...@gmail.com>
>> >>>> wrote:
>> >>>>>>>>>
>> >>>>>>>>>> Hi,
>> >>>>>>>>>>
>> >>>>>>>>>> I would like to reach out to the community to see if anyone
>> >> has
>> >>>>> some
>> >>>>>>>>>> insights or experience with the performance impact of
>> >> enabling
>> >>>>>>>> prometheus
>> >>>>>>>>>> metrics.
>> >>>>>>>>>>
>> >>>>>>>>>> I have done load comparison tests for Prometheus enabled vs
>> >>>>> disabled
>> >>>>>>>> and
>> >>>>>>>>>> found the performance is reduced about 40%-60% for both
>> >> read and
>> >>>>>> write
>> >>>>>>>>>> oeprations (i.e. getData, getChildren and createNode).
>> >>>>>>>>>>
>> >>>>>>>>>> The load test was done with Zookeeper 3.7, cluster size of 5
>> >>>>>>>> participants
>> >>>>>>>>>> and 5 observers, each ZK server has 10G heap size and 4
>> >> cpu, 500
>> >>>>>>>>> concurrent
>> >>>>>>>>>> users sending requests.
>> >>>>>>>>>>
>> >>>>>>>>>> The performance impact is quite significant.  I wonder if
>> >> this
>> >>>> is
>> >>>>>>>>> expected
>> >>>>>>>>>> and what are things we can do to have ZK performing the same
>> >>>> while
>> >>>>>>>>>> leveraging the new feature of Prometheus metric.
>> >>>>>>>>>>
>> >>>>>>>>>> Best,
>> >>>>>>>>>>
>> >>>>>>>>>> Li
>> >>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>>
>> >>>>>>>>
>> >>>>>>>
>> >>>>>>
>> >>>>>
>> >>>>
>> >>
>>
>>

Re: Performance impact of enabling Prometheus Metrics

Posted by Li Wang <li...@gmail.com>.
Thanks Andor!

Li

On Tue, May 4, 2021 at 2:00 PM Andor Molnar <an...@apache.org> wrote:

> Awesome work Li, indeed.
> I’m happy to review the PR. Thanks for the contribution.
>
> Andor
>
>
>
> > On 2021. May 4., at 3:49, Li Wang <li...@gmail.com> wrote:
> >
> > Hi,
> >
> > Thanks Ted and Enrico again for the inputs and discussions. I would like
> to
> > share some updates from my side.
> >
> > 1. The main issue is that Prometheus summary computation is expensive and
> > locked/synchronized. To reduce the perf impact, I changed the
> > PrometheusLabelledSummary.add() operation to be async, so it is handled
> by
> > separate threads from the main thread.
> >
> > 2. Ran benchmark against the changes. The perf of read and write
> operations
> > are significantly improved.  .
> >
> > a) For read operation, the avg latency is reduced about 50% and avg
> > throughout is increased about 95%.
> > b) For write operation, the avg latency is reduced about 28% and avg
> > throughput is increased about 20%.
> >
> > 3. I opened a JIRA for the issue and can submit a PR for the change.
> >
> > https://issues.apache.org/jira/browse/ZOOKEEPER-4289
> >
> > Please let me know if you have any thoughts or questions.
> >
> > Thanks,
> >
> > Li
> >
> >
> >
> > On Wed, Apr 28, 2021 at 11:32 PM Enrico Olivelli <eo...@gmail.com>
> > wrote:
> >
> >> Il giorno mer 28 apr 2021 alle ore 21:48 Li Wang <li...@gmail.com> ha
> >> scritto:
> >>>
> >>> Hi Enrico,
> >>>
> >>> Please see my comments inline.
> >>>
> >>> Thanks,
> >>>
> >>> Li.
> >>>
> >>> On Mon, Apr 26, 2021 at 10:42 PM Enrico Olivelli <eo...@gmail.com>
> >>> wrote:
> >>>
> >>>> Sorry for top posting.
> >>>>
> >>>> I have never seen in other applications that Prometheus has such a
> >>>> significant impact.
> >>>
> >>>
> >>> Please see my reply in the previous response.
> >>>
> >>>
> >>>>
> >>>> The first things that come into my mind:
> >>>> - collect a couple of dumps with some perf tool and dig into the
> >> problem
> >>>
> >>>
> >>> Heap dump, thread dump and CPU profiling with Prometheus enabled vs
> >>> disabled have been collected. That's how we found the issue.
> >>>
> >>>
> >>>> -  verify that we have the latest version of Prometheus client
> >>>>
> >>>
> >>> ZK uses 0.9.0 and the latest version of Prometheus client is 0.10.0.
> >>> Checked in the change logs, it doesn't look like there are major
> changes
> >>> between these two. https://github.com/prometheus/client_java/releases
> We
> >>> can upgrade the version and see if it makes any difference.
> >>>
> >>> - tune the few knobs we have in the Prometheus client
> >>>>
> >>>
> >>> What are the knobs we can tune? Can you provide more info on this?
> >>
> >> Unfortunately there is nothing we can tune directly with the zookeeper
> >> configuration file
> >>
> >> In ZK code we currently have only some basic configuration of
> >> summaries, and we are using mostly the default values
> >>
> >>
> https://github.com/apache/zookeeper/blob/11c07921c15e2fb7692375327b53f26a583b77ca/zookeeper-metrics-providers/zookeeper-prometheus-metrics/src/main/java/org/apache/zookeeper/metrics/prometheus/PrometheusMetricsProvider.java#L340
> >> there is no way to fine tune the buckets and all of the structures.
> >>
> >> we should work on looking for optimal values
> >>
> >
> >
> >
> >
> >>
> >>
> >>>
> >>>>
> >>>> In Apache Pulsar and in Apache Bookkeeper we have some customizations
> >> in
> >>>> the Prometheus metrics collectors, we could the a look and port those
> >> to
> >>>> Zookeeper (initially I worked on the Prometheus integration based on
> my
> >>>> usecases I have with Pulsar, Bookkeeper and other systems that already
> >> use
> >>>> Prometheus, but here in Zookeeper we are using the basic Prometheus
> >> client)
> >>>>
> >>>
> >>> Are there any customizations for minimizing the performance impact?  Is
> >>> there anything that we can port to ZK to help the issue?
> >>
> >> Yes, there is much we can port, this is the Prometheus Metrics
> >> provider in BookKeeper
> >>
> >>
> https://github.com/apache/bookkeeper/tree/master/bookkeeper-stats-providers/prometheus-metrics-provider/src/main/java/org/apache/bookkeeper/stats/prometheus
> >>
> >> Enrico
> >>
> >>>
> >>>
> >>>> Enrico
> >>>>
> >>>> Il Mar 27 Apr 2021, 06:35 Ted Dunning <te...@gmail.com> ha
> >> scritto:
> >>>>
> >>>>> Batching metrics reporting is very similar to option (c) but with
> >> locking
> >>>>> like option (a). That can usually be made faster by passing a
> >> reference
> >>>> to
> >>>>> the metrics accumulator to the reporting thread which can do the
> >> batch
> >>>>> update without locks. Usually requires ping-pong metrics
> >> accumulators so
> >>>>> that a thread can accumulate in one accumulator for a bit, pass that
> >> to
> >>>> the
> >>>>> merge thread and switch to using the alternate accumulator. Since all
> >>>>> threads typically report at the same time, this avoids a stampede on
> >> the
> >>>>> global accumulator.
> >>>>>
> >>>>>
> >>>>> On Mon, Apr 26, 2021 at 9:30 PM Li Wang <li...@gmail.com> wrote:
> >>>>>
> >>>>>> batching metrics reporting can help. For example, in the
> >>>> CommitProcessor,
> >>>>>> increasing the maxCommitBatchSize helps improving the the
> >> performance
> >>>> of
> >>>>>> write operation.
> >>>>>>
> >>>>>>
> >>>>>> On Mon, Apr 26, 2021 at 9:21 PM Li Wang <li...@gmail.com> wrote:
> >>>>>>
> >>>>>>> Yes, I am thinking that handling metrics reporting in a separate
> >>>>> thread,
> >>>>>>> so it doesn't impact the "main" thread.
> >>>>>>>
> >>>>>>> Not sure about the idea of merging at the end of a reporting
> >> period.
> >>>>> Can
> >>>>>>> you elaborate a bit on it?
> >>>>>>>
> >>>>>>> Thanks,
> >>>>>>>
> >>>>>>> Li
> >>>>>>>
> >>>>>>> On Mon, Apr 26, 2021 at 9:11 PM Ted Dunning <
> >> ted.dunning@gmail.com>
> >>>>>> wrote:
> >>>>>>>
> >>>>>>>> Would it help to keep per thread metrics that are either
> >> reported
> >>>>>>>> independently or are merged at the end of a reporting period?
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> On Mon, Apr 26, 2021 at 8:51 PM Li Wang <li...@gmail.com>
> >> wrote:
> >>>>>>>>
> >>>>>>>>> Hi Community,
> >>>>>>>>>
> >>>>>>>>> I've done further investigation on the issue and found the
> >>>> following
> >>>>>>>>>
> >>>>>>>>> 1. The perf of the read operation was decreased due to the
> >> lock
> >>>>>>>> contention
> >>>>>>>>> in the Prometheus TimeWindowQuantiles APIs. 3 out of 4
> >>>>>> CommitProcWorker
> >>>>>>>>> threads were blocked on the TimeWindowQuantiles.insert() API
> >> when
> >>>>> the
> >>>>>>>> test
> >>>>>>>>> was.
> >>>>>>>>>
> >>>>>>>>> 2. The perf of the write operation was decreased because of
> >> the
> >>>> high
> >>>>>> CPU
> >>>>>>>>> usage from Prometheus Summary type of metrics. The CPU usage
> >> of
> >>>>>>>>> CommitProcessor increased about 50% when Prometheus was
> >> disabled
> >>>>>>>> compared
> >>>>>>>>> to enabled (46% vs 80% with 4 CPU, 63% vs 99% with 12 CPU).
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> Prometheus integration is a great feature, however the
> >> negative
> >>>>>>>> performance
> >>>>>>>>> impact is very significant.  I wonder if anyone has any
> >> thoughts
> >>>> on
> >>>>>> how
> >>>>>>>> to
> >>>>>>>>> reduce the perf impact.
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> Thanks,
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> Li
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> On Tue, Apr 6, 2021 at 12:33 PM Li Wang <li...@gmail.com>
> >>>> wrote:
> >>>>>>>>>
> >>>>>>>>>> Hi,
> >>>>>>>>>>
> >>>>>>>>>> I would like to reach out to the community to see if anyone
> >> has
> >>>>> some
> >>>>>>>>>> insights or experience with the performance impact of
> >> enabling
> >>>>>>>> prometheus
> >>>>>>>>>> metrics.
> >>>>>>>>>>
> >>>>>>>>>> I have done load comparison tests for Prometheus enabled vs
> >>>>> disabled
> >>>>>>>> and
> >>>>>>>>>> found the performance is reduced about 40%-60% for both
> >> read and
> >>>>>> write
> >>>>>>>>>> oeprations (i.e. getData, getChildren and createNode).
> >>>>>>>>>>
> >>>>>>>>>> The load test was done with Zookeeper 3.7, cluster size of 5
> >>>>>>>> participants
> >>>>>>>>>> and 5 observers, each ZK server has 10G heap size and 4
> >> cpu, 500
> >>>>>>>>> concurrent
> >>>>>>>>>> users sending requests.
> >>>>>>>>>>
> >>>>>>>>>> The performance impact is quite significant.  I wonder if
> >> this
> >>>> is
> >>>>>>>>> expected
> >>>>>>>>>> and what are things we can do to have ZK performing the same
> >>>> while
> >>>>>>>>>> leveraging the new feature of Prometheus metric.
> >>>>>>>>>>
> >>>>>>>>>> Best,
> >>>>>>>>>>
> >>>>>>>>>> Li
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>
>
>

Re: Performance impact of enabling Prometheus Metrics

Posted by Andor Molnar <an...@apache.org>.
Awesome work Li, indeed.
I’m happy to review the PR. Thanks for the contribution.

Andor



> On 2021. May 4., at 3:49, Li Wang <li...@gmail.com> wrote:
> 
> Hi,
> 
> Thanks Ted and Enrico again for the inputs and discussions. I would like to
> share some updates from my side.
> 
> 1. The main issue is that Prometheus summary computation is expensive and
> locked/synchronized. To reduce the perf impact, I changed the
> PrometheusLabelledSummary.add() operation to be async, so it is handled by
> separate threads from the main thread.
> 
> 2. Ran benchmark against the changes. The perf of read and write operations
> are significantly improved.  .
> 
> a) For read operation, the avg latency is reduced about 50% and avg
> throughout is increased about 95%.
> b) For write operation, the avg latency is reduced about 28% and avg
> throughput is increased about 20%.
> 
> 3. I opened a JIRA for the issue and can submit a PR for the change.
> 
> https://issues.apache.org/jira/browse/ZOOKEEPER-4289
> 
> Please let me know if you have any thoughts or questions.
> 
> Thanks,
> 
> Li
> 
> 
> 
> On Wed, Apr 28, 2021 at 11:32 PM Enrico Olivelli <eo...@gmail.com>
> wrote:
> 
>> Il giorno mer 28 apr 2021 alle ore 21:48 Li Wang <li...@gmail.com> ha
>> scritto:
>>> 
>>> Hi Enrico,
>>> 
>>> Please see my comments inline.
>>> 
>>> Thanks,
>>> 
>>> Li.
>>> 
>>> On Mon, Apr 26, 2021 at 10:42 PM Enrico Olivelli <eo...@gmail.com>
>>> wrote:
>>> 
>>>> Sorry for top posting.
>>>> 
>>>> I have never seen in other applications that Prometheus has such a
>>>> significant impact.
>>> 
>>> 
>>> Please see my reply in the previous response.
>>> 
>>> 
>>>> 
>>>> The first things that come into my mind:
>>>> - collect a couple of dumps with some perf tool and dig into the
>> problem
>>> 
>>> 
>>> Heap dump, thread dump and CPU profiling with Prometheus enabled vs
>>> disabled have been collected. That's how we found the issue.
>>> 
>>> 
>>>> -  verify that we have the latest version of Prometheus client
>>>> 
>>> 
>>> ZK uses 0.9.0 and the latest version of Prometheus client is 0.10.0.
>>> Checked in the change logs, it doesn't look like there are major changes
>>> between these two. https://github.com/prometheus/client_java/releases We
>>> can upgrade the version and see if it makes any difference.
>>> 
>>> - tune the few knobs we have in the Prometheus client
>>>> 
>>> 
>>> What are the knobs we can tune? Can you provide more info on this?
>> 
>> Unfortunately there is nothing we can tune directly with the zookeeper
>> configuration file
>> 
>> In ZK code we currently have only some basic configuration of
>> summaries, and we are using mostly the default values
>> 
>> https://github.com/apache/zookeeper/blob/11c07921c15e2fb7692375327b53f26a583b77ca/zookeeper-metrics-providers/zookeeper-prometheus-metrics/src/main/java/org/apache/zookeeper/metrics/prometheus/PrometheusMetricsProvider.java#L340
>> there is no way to fine tune the buckets and all of the structures.
>> 
>> we should work on looking for optimal values
>> 
> 
> 
> 
> 
>> 
>> 
>>> 
>>>> 
>>>> In Apache Pulsar and in Apache Bookkeeper we have some customizations
>> in
>>>> the Prometheus metrics collectors, we could the a look and port those
>> to
>>>> Zookeeper (initially I worked on the Prometheus integration based on my
>>>> usecases I have with Pulsar, Bookkeeper and other systems that already
>> use
>>>> Prometheus, but here in Zookeeper we are using the basic Prometheus
>> client)
>>>> 
>>> 
>>> Are there any customizations for minimizing the performance impact?  Is
>>> there anything that we can port to ZK to help the issue?
>> 
>> Yes, there is much we can port, this is the Prometheus Metrics
>> provider in BookKeeper
>> 
>> https://github.com/apache/bookkeeper/tree/master/bookkeeper-stats-providers/prometheus-metrics-provider/src/main/java/org/apache/bookkeeper/stats/prometheus
>> 
>> Enrico
>> 
>>> 
>>> 
>>>> Enrico
>>>> 
>>>> Il Mar 27 Apr 2021, 06:35 Ted Dunning <te...@gmail.com> ha
>> scritto:
>>>> 
>>>>> Batching metrics reporting is very similar to option (c) but with
>> locking
>>>>> like option (a). That can usually be made faster by passing a
>> reference
>>>> to
>>>>> the metrics accumulator to the reporting thread which can do the
>> batch
>>>>> update without locks. Usually requires ping-pong metrics
>> accumulators so
>>>>> that a thread can accumulate in one accumulator for a bit, pass that
>> to
>>>> the
>>>>> merge thread and switch to using the alternate accumulator. Since all
>>>>> threads typically report at the same time, this avoids a stampede on
>> the
>>>>> global accumulator.
>>>>> 
>>>>> 
>>>>> On Mon, Apr 26, 2021 at 9:30 PM Li Wang <li...@gmail.com> wrote:
>>>>> 
>>>>>> batching metrics reporting can help. For example, in the
>>>> CommitProcessor,
>>>>>> increasing the maxCommitBatchSize helps improving the the
>> performance
>>>> of
>>>>>> write operation.
>>>>>> 
>>>>>> 
>>>>>> On Mon, Apr 26, 2021 at 9:21 PM Li Wang <li...@gmail.com> wrote:
>>>>>> 
>>>>>>> Yes, I am thinking that handling metrics reporting in a separate
>>>>> thread,
>>>>>>> so it doesn't impact the "main" thread.
>>>>>>> 
>>>>>>> Not sure about the idea of merging at the end of a reporting
>> period.
>>>>> Can
>>>>>>> you elaborate a bit on it?
>>>>>>> 
>>>>>>> Thanks,
>>>>>>> 
>>>>>>> Li
>>>>>>> 
>>>>>>> On Mon, Apr 26, 2021 at 9:11 PM Ted Dunning <
>> ted.dunning@gmail.com>
>>>>>> wrote:
>>>>>>> 
>>>>>>>> Would it help to keep per thread metrics that are either
>> reported
>>>>>>>> independently or are merged at the end of a reporting period?
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On Mon, Apr 26, 2021 at 8:51 PM Li Wang <li...@gmail.com>
>> wrote:
>>>>>>>> 
>>>>>>>>> Hi Community,
>>>>>>>>> 
>>>>>>>>> I've done further investigation on the issue and found the
>>>> following
>>>>>>>>> 
>>>>>>>>> 1. The perf of the read operation was decreased due to the
>> lock
>>>>>>>> contention
>>>>>>>>> in the Prometheus TimeWindowQuantiles APIs. 3 out of 4
>>>>>> CommitProcWorker
>>>>>>>>> threads were blocked on the TimeWindowQuantiles.insert() API
>> when
>>>>> the
>>>>>>>> test
>>>>>>>>> was.
>>>>>>>>> 
>>>>>>>>> 2. The perf of the write operation was decreased because of
>> the
>>>> high
>>>>>> CPU
>>>>>>>>> usage from Prometheus Summary type of metrics. The CPU usage
>> of
>>>>>>>>> CommitProcessor increased about 50% when Prometheus was
>> disabled
>>>>>>>> compared
>>>>>>>>> to enabled (46% vs 80% with 4 CPU, 63% vs 99% with 12 CPU).
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> Prometheus integration is a great feature, however the
>> negative
>>>>>>>> performance
>>>>>>>>> impact is very significant.  I wonder if anyone has any
>> thoughts
>>>> on
>>>>>> how
>>>>>>>> to
>>>>>>>>> reduce the perf impact.
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> Thanks,
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> Li
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> On Tue, Apr 6, 2021 at 12:33 PM Li Wang <li...@gmail.com>
>>>> wrote:
>>>>>>>>> 
>>>>>>>>>> Hi,
>>>>>>>>>> 
>>>>>>>>>> I would like to reach out to the community to see if anyone
>> has
>>>>> some
>>>>>>>>>> insights or experience with the performance impact of
>> enabling
>>>>>>>> prometheus
>>>>>>>>>> metrics.
>>>>>>>>>> 
>>>>>>>>>> I have done load comparison tests for Prometheus enabled vs
>>>>> disabled
>>>>>>>> and
>>>>>>>>>> found the performance is reduced about 40%-60% for both
>> read and
>>>>>> write
>>>>>>>>>> oeprations (i.e. getData, getChildren and createNode).
>>>>>>>>>> 
>>>>>>>>>> The load test was done with Zookeeper 3.7, cluster size of 5
>>>>>>>> participants
>>>>>>>>>> and 5 observers, each ZK server has 10G heap size and 4
>> cpu, 500
>>>>>>>>> concurrent
>>>>>>>>>> users sending requests.
>>>>>>>>>> 
>>>>>>>>>> The performance impact is quite significant.  I wonder if
>> this
>>>> is
>>>>>>>>> expected
>>>>>>>>>> and what are things we can do to have ZK performing the same
>>>> while
>>>>>>>>>> leveraging the new feature of Prometheus metric.
>>>>>>>>>> 
>>>>>>>>>> Best,
>>>>>>>>>> 
>>>>>>>>>> Li
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>> 


Re: Performance impact of enabling Prometheus Metrics

Posted by Li Wang <li...@gmail.com>.
Hi,

Thanks Ted and Enrico again for the inputs and discussions. I would like to
share some updates from my side.

1. The main issue is that Prometheus summary computation is expensive and
locked/synchronized. To reduce the perf impact, I changed the
PrometheusLabelledSummary.add() operation to be async, so it is handled by
separate threads from the main thread.

2. Ran benchmark against the changes. The perf of read and write operations
are significantly improved.  .

a) For read operation, the avg latency is reduced about 50% and avg
throughout is increased about 95%.
b) For write operation, the avg latency is reduced about 28% and avg
throughput is increased about 20%.

3. I opened a JIRA for the issue and can submit a PR for the change.

https://issues.apache.org/jira/browse/ZOOKEEPER-4289

Please let me know if you have any thoughts or questions.

Thanks,

Li



On Wed, Apr 28, 2021 at 11:32 PM Enrico Olivelli <eo...@gmail.com>
wrote:

> Il giorno mer 28 apr 2021 alle ore 21:48 Li Wang <li...@gmail.com> ha
> scritto:
> >
> > Hi Enrico,
> >
> > Please see my comments inline.
> >
> > Thanks,
> >
> >  Li.
> >
> > On Mon, Apr 26, 2021 at 10:42 PM Enrico Olivelli <eo...@gmail.com>
> > wrote:
> >
> > > Sorry for top posting.
> > >
> > > I have never seen in other applications that Prometheus has such a
> > > significant impact.
> >
> >
> > Please see my reply in the previous response.
> >
> >
> > >
> > > The first things that come into my mind:
> > > - collect a couple of dumps with some perf tool and dig into the
> problem
> >
> >
> > Heap dump, thread dump and CPU profiling with Prometheus enabled vs
> > disabled have been collected. That's how we found the issue.
> >
> >
> > > -  verify that we have the latest version of Prometheus client
> > >
> >
> > ZK uses 0.9.0 and the latest version of Prometheus client is 0.10.0.
> > Checked in the change logs, it doesn't look like there are major changes
> > between these two. https://github.com/prometheus/client_java/releases We
> > can upgrade the version and see if it makes any difference.
> >
> > - tune the few knobs we have in the Prometheus client
> > >
> >
> > What are the knobs we can tune? Can you provide more info on this?
>
> Unfortunately there is nothing we can tune directly with the zookeeper
> configuration file
>
> In ZK code we currently have only some basic configuration of
> summaries, and we are using mostly the default values
>
> https://github.com/apache/zookeeper/blob/11c07921c15e2fb7692375327b53f26a583b77ca/zookeeper-metrics-providers/zookeeper-prometheus-metrics/src/main/java/org/apache/zookeeper/metrics/prometheus/PrometheusMetricsProvider.java#L340
> there is no way to fine tune the buckets and all of the structures.
>
> we should work on looking for optimal values
>




>
>
> >
> > >
> > > In Apache Pulsar and in Apache Bookkeeper we have some customizations
> in
> > > the Prometheus metrics collectors, we could the a look and port those
> to
> > > Zookeeper (initially I worked on the Prometheus integration based on my
> > > usecases I have with Pulsar, Bookkeeper and other systems that already
> use
> > > Prometheus, but here in Zookeeper we are using the basic Prometheus
> client)
> > >
> >
> > Are there any customizations for minimizing the performance impact?  Is
> > there anything that we can port to ZK to help the issue?
>
> Yes, there is much we can port, this is the Prometheus Metrics
> provider in BookKeeper
>
> https://github.com/apache/bookkeeper/tree/master/bookkeeper-stats-providers/prometheus-metrics-provider/src/main/java/org/apache/bookkeeper/stats/prometheus
>
> Enrico
>
> >
> >
> > > Enrico
> > >
> > > Il Mar 27 Apr 2021, 06:35 Ted Dunning <te...@gmail.com> ha
> scritto:
> > >
> > > > Batching metrics reporting is very similar to option (c) but with
> locking
> > > > like option (a). That can usually be made faster by passing a
> reference
> > > to
> > > > the metrics accumulator to the reporting thread which can do the
> batch
> > > > update without locks. Usually requires ping-pong metrics
> accumulators so
> > > > that a thread can accumulate in one accumulator for a bit, pass that
> to
> > > the
> > > > merge thread and switch to using the alternate accumulator. Since all
> > > > threads typically report at the same time, this avoids a stampede on
> the
> > > > global accumulator.
> > > >
> > > >
> > > > On Mon, Apr 26, 2021 at 9:30 PM Li Wang <li...@gmail.com> wrote:
> > > >
> > > > > batching metrics reporting can help. For example, in the
> > > CommitProcessor,
> > > > > increasing the maxCommitBatchSize helps improving the the
> performance
> > > of
> > > > > write operation.
> > > > >
> > > > >
> > > > > On Mon, Apr 26, 2021 at 9:21 PM Li Wang <li...@gmail.com> wrote:
> > > > >
> > > > > > Yes, I am thinking that handling metrics reporting in a separate
> > > > thread,
> > > > > > so it doesn't impact the "main" thread.
> > > > > >
> > > > > > Not sure about the idea of merging at the end of a reporting
> period.
> > > > Can
> > > > > > you elaborate a bit on it?
> > > > > >
> > > > > > Thanks,
> > > > > >
> > > > > > Li
> > > > > >
> > > > > > On Mon, Apr 26, 2021 at 9:11 PM Ted Dunning <
> ted.dunning@gmail.com>
> > > > > wrote:
> > > > > >
> > > > > >> Would it help to keep per thread metrics that are either
> reported
> > > > > >> independently or are merged at the end of a reporting period?
> > > > > >>
> > > > > >>
> > > > > >>
> > > > > >> On Mon, Apr 26, 2021 at 8:51 PM Li Wang <li...@gmail.com>
> wrote:
> > > > > >>
> > > > > >> > Hi Community,
> > > > > >> >
> > > > > >> > I've done further investigation on the issue and found the
> > > following
> > > > > >> >
> > > > > >> > 1. The perf of the read operation was decreased due to the
> lock
> > > > > >> contention
> > > > > >> > in the Prometheus TimeWindowQuantiles APIs. 3 out of 4
> > > > > CommitProcWorker
> > > > > >> > threads were blocked on the TimeWindowQuantiles.insert() API
> when
> > > > the
> > > > > >> test
> > > > > >> > was.
> > > > > >> >
> > > > > >> > 2. The perf of the write operation was decreased because of
> the
> > > high
> > > > > CPU
> > > > > >> > usage from Prometheus Summary type of metrics. The CPU usage
> of
> > > > > >> > CommitProcessor increased about 50% when Prometheus was
> disabled
> > > > > >> compared
> > > > > >> > to enabled (46% vs 80% with 4 CPU, 63% vs 99% with 12 CPU).
> > > > > >> >
> > > > > >> >
> > > > > >> > Prometheus integration is a great feature, however the
> negative
> > > > > >> performance
> > > > > >> > impact is very significant.  I wonder if anyone has any
> thoughts
> > > on
> > > > > how
> > > > > >> to
> > > > > >> > reduce the perf impact.
> > > > > >> >
> > > > > >> >
> > > > > >> >
> > > > > >> > Thanks,
> > > > > >> >
> > > > > >> >
> > > > > >> > Li
> > > > > >> >
> > > > > >> >
> > > > > >> > On Tue, Apr 6, 2021 at 12:33 PM Li Wang <li...@gmail.com>
> > > wrote:
> > > > > >> >
> > > > > >> > > Hi,
> > > > > >> > >
> > > > > >> > > I would like to reach out to the community to see if anyone
> has
> > > > some
> > > > > >> > > insights or experience with the performance impact of
> enabling
> > > > > >> prometheus
> > > > > >> > > metrics.
> > > > > >> > >
> > > > > >> > > I have done load comparison tests for Prometheus enabled vs
> > > > disabled
> > > > > >> and
> > > > > >> > > found the performance is reduced about 40%-60% for both
> read and
> > > > > write
> > > > > >> > > oeprations (i.e. getData, getChildren and createNode).
> > > > > >> > >
> > > > > >> > > The load test was done with Zookeeper 3.7, cluster size of 5
> > > > > >> participants
> > > > > >> > > and 5 observers, each ZK server has 10G heap size and 4
> cpu, 500
> > > > > >> > concurrent
> > > > > >> > > users sending requests.
> > > > > >> > >
> > > > > >> > > The performance impact is quite significant.  I wonder if
> this
> > > is
> > > > > >> > expected
> > > > > >> > > and what are things we can do to have ZK performing the same
> > > while
> > > > > >> > > leveraging the new feature of Prometheus metric.
> > > > > >> > >
> > > > > >> > > Best,
> > > > > >> > >
> > > > > >> > > Li
> > > > > >> > >
> > > > > >> > >
> > > > > >> > >
> > > > > >> > >
> > > > > >> >
> > > > > >>
> > > > > >
> > > > >
> > > >
> > >
>

Re: Performance impact of enabling Prometheus Metrics

Posted by Enrico Olivelli <eo...@gmail.com>.
Il giorno mer 28 apr 2021 alle ore 21:48 Li Wang <li...@gmail.com> ha scritto:
>
> Hi Enrico,
>
> Please see my comments inline.
>
> Thanks,
>
>  Li.
>
> On Mon, Apr 26, 2021 at 10:42 PM Enrico Olivelli <eo...@gmail.com>
> wrote:
>
> > Sorry for top posting.
> >
> > I have never seen in other applications that Prometheus has such a
> > significant impact.
>
>
> Please see my reply in the previous response.
>
>
> >
> > The first things that come into my mind:
> > - collect a couple of dumps with some perf tool and dig into the problem
>
>
> Heap dump, thread dump and CPU profiling with Prometheus enabled vs
> disabled have been collected. That's how we found the issue.
>
>
> > -  verify that we have the latest version of Prometheus client
> >
>
> ZK uses 0.9.0 and the latest version of Prometheus client is 0.10.0.
> Checked in the change logs, it doesn't look like there are major changes
> between these two. https://github.com/prometheus/client_java/releases We
> can upgrade the version and see if it makes any difference.
>
> - tune the few knobs we have in the Prometheus client
> >
>
> What are the knobs we can tune? Can you provide more info on this?

Unfortunately there is nothing we can tune directly with the zookeeper
configuration file

In ZK code we currently have only some basic configuration of
summaries, and we are using mostly the default values
https://github.com/apache/zookeeper/blob/11c07921c15e2fb7692375327b53f26a583b77ca/zookeeper-metrics-providers/zookeeper-prometheus-metrics/src/main/java/org/apache/zookeeper/metrics/prometheus/PrometheusMetricsProvider.java#L340
there is no way to fine tune the buckets and all of the structures.

we should work on looking for optimal values


>
> >
> > In Apache Pulsar and in Apache Bookkeeper we have some customizations in
> > the Prometheus metrics collectors, we could the a look and port those to
> > Zookeeper (initially I worked on the Prometheus integration based on my
> > usecases I have with Pulsar, Bookkeeper and other systems that already use
> > Prometheus, but here in Zookeeper we are using the basic Prometheus client)
> >
>
> Are there any customizations for minimizing the performance impact?  Is
> there anything that we can port to ZK to help the issue?

Yes, there is much we can port, this is the Prometheus Metrics
provider in BookKeeper
https://github.com/apache/bookkeeper/tree/master/bookkeeper-stats-providers/prometheus-metrics-provider/src/main/java/org/apache/bookkeeper/stats/prometheus

Enrico

>
>
> > Enrico
> >
> > Il Mar 27 Apr 2021, 06:35 Ted Dunning <te...@gmail.com> ha scritto:
> >
> > > Batching metrics reporting is very similar to option (c) but with locking
> > > like option (a). That can usually be made faster by passing a reference
> > to
> > > the metrics accumulator to the reporting thread which can do the batch
> > > update without locks. Usually requires ping-pong metrics accumulators so
> > > that a thread can accumulate in one accumulator for a bit, pass that to
> > the
> > > merge thread and switch to using the alternate accumulator. Since all
> > > threads typically report at the same time, this avoids a stampede on the
> > > global accumulator.
> > >
> > >
> > > On Mon, Apr 26, 2021 at 9:30 PM Li Wang <li...@gmail.com> wrote:
> > >
> > > > batching metrics reporting can help. For example, in the
> > CommitProcessor,
> > > > increasing the maxCommitBatchSize helps improving the the performance
> > of
> > > > write operation.
> > > >
> > > >
> > > > On Mon, Apr 26, 2021 at 9:21 PM Li Wang <li...@gmail.com> wrote:
> > > >
> > > > > Yes, I am thinking that handling metrics reporting in a separate
> > > thread,
> > > > > so it doesn't impact the "main" thread.
> > > > >
> > > > > Not sure about the idea of merging at the end of a reporting period.
> > > Can
> > > > > you elaborate a bit on it?
> > > > >
> > > > > Thanks,
> > > > >
> > > > > Li
> > > > >
> > > > > On Mon, Apr 26, 2021 at 9:11 PM Ted Dunning <te...@gmail.com>
> > > > wrote:
> > > > >
> > > > >> Would it help to keep per thread metrics that are either reported
> > > > >> independently or are merged at the end of a reporting period?
> > > > >>
> > > > >>
> > > > >>
> > > > >> On Mon, Apr 26, 2021 at 8:51 PM Li Wang <li...@gmail.com> wrote:
> > > > >>
> > > > >> > Hi Community,
> > > > >> >
> > > > >> > I've done further investigation on the issue and found the
> > following
> > > > >> >
> > > > >> > 1. The perf of the read operation was decreased due to the lock
> > > > >> contention
> > > > >> > in the Prometheus TimeWindowQuantiles APIs. 3 out of 4
> > > > CommitProcWorker
> > > > >> > threads were blocked on the TimeWindowQuantiles.insert() API when
> > > the
> > > > >> test
> > > > >> > was.
> > > > >> >
> > > > >> > 2. The perf of the write operation was decreased because of the
> > high
> > > > CPU
> > > > >> > usage from Prometheus Summary type of metrics. The CPU usage of
> > > > >> > CommitProcessor increased about 50% when Prometheus was disabled
> > > > >> compared
> > > > >> > to enabled (46% vs 80% with 4 CPU, 63% vs 99% with 12 CPU).
> > > > >> >
> > > > >> >
> > > > >> > Prometheus integration is a great feature, however the negative
> > > > >> performance
> > > > >> > impact is very significant.  I wonder if anyone has any thoughts
> > on
> > > > how
> > > > >> to
> > > > >> > reduce the perf impact.
> > > > >> >
> > > > >> >
> > > > >> >
> > > > >> > Thanks,
> > > > >> >
> > > > >> >
> > > > >> > Li
> > > > >> >
> > > > >> >
> > > > >> > On Tue, Apr 6, 2021 at 12:33 PM Li Wang <li...@gmail.com>
> > wrote:
> > > > >> >
> > > > >> > > Hi,
> > > > >> > >
> > > > >> > > I would like to reach out to the community to see if anyone has
> > > some
> > > > >> > > insights or experience with the performance impact of enabling
> > > > >> prometheus
> > > > >> > > metrics.
> > > > >> > >
> > > > >> > > I have done load comparison tests for Prometheus enabled vs
> > > disabled
> > > > >> and
> > > > >> > > found the performance is reduced about 40%-60% for both read and
> > > > write
> > > > >> > > oeprations (i.e. getData, getChildren and createNode).
> > > > >> > >
> > > > >> > > The load test was done with Zookeeper 3.7, cluster size of 5
> > > > >> participants
> > > > >> > > and 5 observers, each ZK server has 10G heap size and 4 cpu, 500
> > > > >> > concurrent
> > > > >> > > users sending requests.
> > > > >> > >
> > > > >> > > The performance impact is quite significant.  I wonder if this
> > is
> > > > >> > expected
> > > > >> > > and what are things we can do to have ZK performing the same
> > while
> > > > >> > > leveraging the new feature of Prometheus metric.
> > > > >> > >
> > > > >> > > Best,
> > > > >> > >
> > > > >> > > Li
> > > > >> > >
> > > > >> > >
> > > > >> > >
> > > > >> > >
> > > > >> >
> > > > >>
> > > > >
> > > >
> > >
> >

Re: Performance impact of enabling Prometheus Metrics

Posted by Li Wang <li...@gmail.com>.
Hi Enrico,

Please see my comments inline.

Thanks,

 Li.

On Mon, Apr 26, 2021 at 10:42 PM Enrico Olivelli <eo...@gmail.com>
wrote:

> Sorry for top posting.
>
> I have never seen in other applications that Prometheus has such a
> significant impact.


Please see my reply in the previous response.


>
> The first things that come into my mind:
> - collect a couple of dumps with some perf tool and dig into the problem


Heap dump, thread dump and CPU profiling with Prometheus enabled vs
disabled have been collected. That's how we found the issue.


> -  verify that we have the latest version of Prometheus client
>

ZK uses 0.9.0 and the latest version of Prometheus client is 0.10.0.
Checked in the change logs, it doesn't look like there are major changes
between these two. https://github.com/prometheus/client_java/releases We
can upgrade the version and see if it makes any difference.

- tune the few knobs we have in the Prometheus client
>

What are the knobs we can tune? Can you provide more info on this?

>
> In Apache Pulsar and in Apache Bookkeeper we have some customizations in
> the Prometheus metrics collectors, we could the a look and port those to
> Zookeeper (initially I worked on the Prometheus integration based on my
> usecases I have with Pulsar, Bookkeeper and other systems that already use
> Prometheus, but here in Zookeeper we are using the basic Prometheus client)
>

Are there any customizations for minimizing the performance impact?  Is
there anything that we can port to ZK to help the issue?


> Enrico
>
> Il Mar 27 Apr 2021, 06:35 Ted Dunning <te...@gmail.com> ha scritto:
>
> > Batching metrics reporting is very similar to option (c) but with locking
> > like option (a). That can usually be made faster by passing a reference
> to
> > the metrics accumulator to the reporting thread which can do the batch
> > update without locks. Usually requires ping-pong metrics accumulators so
> > that a thread can accumulate in one accumulator for a bit, pass that to
> the
> > merge thread and switch to using the alternate accumulator. Since all
> > threads typically report at the same time, this avoids a stampede on the
> > global accumulator.
> >
> >
> > On Mon, Apr 26, 2021 at 9:30 PM Li Wang <li...@gmail.com> wrote:
> >
> > > batching metrics reporting can help. For example, in the
> CommitProcessor,
> > > increasing the maxCommitBatchSize helps improving the the performance
> of
> > > write operation.
> > >
> > >
> > > On Mon, Apr 26, 2021 at 9:21 PM Li Wang <li...@gmail.com> wrote:
> > >
> > > > Yes, I am thinking that handling metrics reporting in a separate
> > thread,
> > > > so it doesn't impact the "main" thread.
> > > >
> > > > Not sure about the idea of merging at the end of a reporting period.
> > Can
> > > > you elaborate a bit on it?
> > > >
> > > > Thanks,
> > > >
> > > > Li
> > > >
> > > > On Mon, Apr 26, 2021 at 9:11 PM Ted Dunning <te...@gmail.com>
> > > wrote:
> > > >
> > > >> Would it help to keep per thread metrics that are either reported
> > > >> independently or are merged at the end of a reporting period?
> > > >>
> > > >>
> > > >>
> > > >> On Mon, Apr 26, 2021 at 8:51 PM Li Wang <li...@gmail.com> wrote:
> > > >>
> > > >> > Hi Community,
> > > >> >
> > > >> > I've done further investigation on the issue and found the
> following
> > > >> >
> > > >> > 1. The perf of the read operation was decreased due to the lock
> > > >> contention
> > > >> > in the Prometheus TimeWindowQuantiles APIs. 3 out of 4
> > > CommitProcWorker
> > > >> > threads were blocked on the TimeWindowQuantiles.insert() API when
> > the
> > > >> test
> > > >> > was.
> > > >> >
> > > >> > 2. The perf of the write operation was decreased because of the
> high
> > > CPU
> > > >> > usage from Prometheus Summary type of metrics. The CPU usage of
> > > >> > CommitProcessor increased about 50% when Prometheus was disabled
> > > >> compared
> > > >> > to enabled (46% vs 80% with 4 CPU, 63% vs 99% with 12 CPU).
> > > >> >
> > > >> >
> > > >> > Prometheus integration is a great feature, however the negative
> > > >> performance
> > > >> > impact is very significant.  I wonder if anyone has any thoughts
> on
> > > how
> > > >> to
> > > >> > reduce the perf impact.
> > > >> >
> > > >> >
> > > >> >
> > > >> > Thanks,
> > > >> >
> > > >> >
> > > >> > Li
> > > >> >
> > > >> >
> > > >> > On Tue, Apr 6, 2021 at 12:33 PM Li Wang <li...@gmail.com>
> wrote:
> > > >> >
> > > >> > > Hi,
> > > >> > >
> > > >> > > I would like to reach out to the community to see if anyone has
> > some
> > > >> > > insights or experience with the performance impact of enabling
> > > >> prometheus
> > > >> > > metrics.
> > > >> > >
> > > >> > > I have done load comparison tests for Prometheus enabled vs
> > disabled
> > > >> and
> > > >> > > found the performance is reduced about 40%-60% for both read and
> > > write
> > > >> > > oeprations (i.e. getData, getChildren and createNode).
> > > >> > >
> > > >> > > The load test was done with Zookeeper 3.7, cluster size of 5
> > > >> participants
> > > >> > > and 5 observers, each ZK server has 10G heap size and 4 cpu, 500
> > > >> > concurrent
> > > >> > > users sending requests.
> > > >> > >
> > > >> > > The performance impact is quite significant.  I wonder if this
> is
> > > >> > expected
> > > >> > > and what are things we can do to have ZK performing the same
> while
> > > >> > > leveraging the new feature of Prometheus metric.
> > > >> > >
> > > >> > > Best,
> > > >> > >
> > > >> > > Li
> > > >> > >
> > > >> > >
> > > >> > >
> > > >> > >
> > > >> >
> > > >>
> > > >
> > >
> >
>

Re: Performance impact of enabling Prometheus Metrics

Posted by Enrico Olivelli <eo...@gmail.com>.
Sorry for top posting.

I have never seen in other applications that Prometheus has such a
significant impact.

The first things that come into my mind:
- collect a couple of dumps with some perf tool and dig into the problem
-  verify that we have the latest version of Prometheus client
- tune the few knobs we have in the Prometheus client

In Apache Pulsar and in Apache Bookkeeper we have some customizations in
the Prometheus metrics collectors, we could the a look and port those to
Zookeeper (initially I worked on the Prometheus integration based on my
usecases I have with Pulsar, Bookkeeper and other systems that already use
Prometheus, but here in Zookeeper we are using the basic Prometheus client)

Enrico

Il Mar 27 Apr 2021, 06:35 Ted Dunning <te...@gmail.com> ha scritto:

> Batching metrics reporting is very similar to option (c) but with locking
> like option (a). That can usually be made faster by passing a reference to
> the metrics accumulator to the reporting thread which can do the batch
> update without locks. Usually requires ping-pong metrics accumulators so
> that a thread can accumulate in one accumulator for a bit, pass that to the
> merge thread and switch to using the alternate accumulator. Since all
> threads typically report at the same time, this avoids a stampede on the
> global accumulator.
>
>
> On Mon, Apr 26, 2021 at 9:30 PM Li Wang <li...@gmail.com> wrote:
>
> > batching metrics reporting can help. For example, in the CommitProcessor,
> > increasing the maxCommitBatchSize helps improving the the performance of
> > write operation.
> >
> >
> > On Mon, Apr 26, 2021 at 9:21 PM Li Wang <li...@gmail.com> wrote:
> >
> > > Yes, I am thinking that handling metrics reporting in a separate
> thread,
> > > so it doesn't impact the "main" thread.
> > >
> > > Not sure about the idea of merging at the end of a reporting period.
> Can
> > > you elaborate a bit on it?
> > >
> > > Thanks,
> > >
> > > Li
> > >
> > > On Mon, Apr 26, 2021 at 9:11 PM Ted Dunning <te...@gmail.com>
> > wrote:
> > >
> > >> Would it help to keep per thread metrics that are either reported
> > >> independently or are merged at the end of a reporting period?
> > >>
> > >>
> > >>
> > >> On Mon, Apr 26, 2021 at 8:51 PM Li Wang <li...@gmail.com> wrote:
> > >>
> > >> > Hi Community,
> > >> >
> > >> > I've done further investigation on the issue and found the following
> > >> >
> > >> > 1. The perf of the read operation was decreased due to the lock
> > >> contention
> > >> > in the Prometheus TimeWindowQuantiles APIs. 3 out of 4
> > CommitProcWorker
> > >> > threads were blocked on the TimeWindowQuantiles.insert() API when
> the
> > >> test
> > >> > was.
> > >> >
> > >> > 2. The perf of the write operation was decreased because of the high
> > CPU
> > >> > usage from Prometheus Summary type of metrics. The CPU usage of
> > >> > CommitProcessor increased about 50% when Prometheus was disabled
> > >> compared
> > >> > to enabled (46% vs 80% with 4 CPU, 63% vs 99% with 12 CPU).
> > >> >
> > >> >
> > >> > Prometheus integration is a great feature, however the negative
> > >> performance
> > >> > impact is very significant.  I wonder if anyone has any thoughts on
> > how
> > >> to
> > >> > reduce the perf impact.
> > >> >
> > >> >
> > >> >
> > >> > Thanks,
> > >> >
> > >> >
> > >> > Li
> > >> >
> > >> >
> > >> > On Tue, Apr 6, 2021 at 12:33 PM Li Wang <li...@gmail.com> wrote:
> > >> >
> > >> > > Hi,
> > >> > >
> > >> > > I would like to reach out to the community to see if anyone has
> some
> > >> > > insights or experience with the performance impact of enabling
> > >> prometheus
> > >> > > metrics.
> > >> > >
> > >> > > I have done load comparison tests for Prometheus enabled vs
> disabled
> > >> and
> > >> > > found the performance is reduced about 40%-60% for both read and
> > write
> > >> > > oeprations (i.e. getData, getChildren and createNode).
> > >> > >
> > >> > > The load test was done with Zookeeper 3.7, cluster size of 5
> > >> participants
> > >> > > and 5 observers, each ZK server has 10G heap size and 4 cpu, 500
> > >> > concurrent
> > >> > > users sending requests.
> > >> > >
> > >> > > The performance impact is quite significant.  I wonder if this is
> > >> > expected
> > >> > > and what are things we can do to have ZK performing the same while
> > >> > > leveraging the new feature of Prometheus metric.
> > >> > >
> > >> > > Best,
> > >> > >
> > >> > > Li
> > >> > >
> > >> > >
> > >> > >
> > >> > >
> > >> >
> > >>
> > >
> >
>

Re: Performance impact of enabling Prometheus Metrics

Posted by Ted Dunning <te...@gmail.com>.
Batching metrics reporting is very similar to option (c) but with locking
like option (a). That can usually be made faster by passing a reference to
the metrics accumulator to the reporting thread which can do the batch
update without locks. Usually requires ping-pong metrics accumulators so
that a thread can accumulate in one accumulator for a bit, pass that to the
merge thread and switch to using the alternate accumulator. Since all
threads typically report at the same time, this avoids a stampede on the
global accumulator.


On Mon, Apr 26, 2021 at 9:30 PM Li Wang <li...@gmail.com> wrote:

> batching metrics reporting can help. For example, in the CommitProcessor,
> increasing the maxCommitBatchSize helps improving the the performance of
> write operation.
>
>
> On Mon, Apr 26, 2021 at 9:21 PM Li Wang <li...@gmail.com> wrote:
>
> > Yes, I am thinking that handling metrics reporting in a separate thread,
> > so it doesn't impact the "main" thread.
> >
> > Not sure about the idea of merging at the end of a reporting period. Can
> > you elaborate a bit on it?
> >
> > Thanks,
> >
> > Li
> >
> > On Mon, Apr 26, 2021 at 9:11 PM Ted Dunning <te...@gmail.com>
> wrote:
> >
> >> Would it help to keep per thread metrics that are either reported
> >> independently or are merged at the end of a reporting period?
> >>
> >>
> >>
> >> On Mon, Apr 26, 2021 at 8:51 PM Li Wang <li...@gmail.com> wrote:
> >>
> >> > Hi Community,
> >> >
> >> > I've done further investigation on the issue and found the following
> >> >
> >> > 1. The perf of the read operation was decreased due to the lock
> >> contention
> >> > in the Prometheus TimeWindowQuantiles APIs. 3 out of 4
> CommitProcWorker
> >> > threads were blocked on the TimeWindowQuantiles.insert() API when the
> >> test
> >> > was.
> >> >
> >> > 2. The perf of the write operation was decreased because of the high
> CPU
> >> > usage from Prometheus Summary type of metrics. The CPU usage of
> >> > CommitProcessor increased about 50% when Prometheus was disabled
> >> compared
> >> > to enabled (46% vs 80% with 4 CPU, 63% vs 99% with 12 CPU).
> >> >
> >> >
> >> > Prometheus integration is a great feature, however the negative
> >> performance
> >> > impact is very significant.  I wonder if anyone has any thoughts on
> how
> >> to
> >> > reduce the perf impact.
> >> >
> >> >
> >> >
> >> > Thanks,
> >> >
> >> >
> >> > Li
> >> >
> >> >
> >> > On Tue, Apr 6, 2021 at 12:33 PM Li Wang <li...@gmail.com> wrote:
> >> >
> >> > > Hi,
> >> > >
> >> > > I would like to reach out to the community to see if anyone has some
> >> > > insights or experience with the performance impact of enabling
> >> prometheus
> >> > > metrics.
> >> > >
> >> > > I have done load comparison tests for Prometheus enabled vs disabled
> >> and
> >> > > found the performance is reduced about 40%-60% for both read and
> write
> >> > > oeprations (i.e. getData, getChildren and createNode).
> >> > >
> >> > > The load test was done with Zookeeper 3.7, cluster size of 5
> >> participants
> >> > > and 5 observers, each ZK server has 10G heap size and 4 cpu, 500
> >> > concurrent
> >> > > users sending requests.
> >> > >
> >> > > The performance impact is quite significant.  I wonder if this is
> >> > expected
> >> > > and what are things we can do to have ZK performing the same while
> >> > > leveraging the new feature of Prometheus metric.
> >> > >
> >> > > Best,
> >> > >
> >> > > Li
> >> > >
> >> > >
> >> > >
> >> > >
> >> >
> >>
> >
>

Re: Performance impact of enabling Prometheus Metrics

Posted by Li Wang <li...@gmail.com>.
batching metrics reporting can help. For example, in the CommitProcessor,
increasing the maxCommitBatchSize helps improving the the performance of
write operation.


On Mon, Apr 26, 2021 at 9:21 PM Li Wang <li...@gmail.com> wrote:

> Yes, I am thinking that handling metrics reporting in a separate thread,
> so it doesn't impact the "main" thread.
>
> Not sure about the idea of merging at the end of a reporting period. Can
> you elaborate a bit on it?
>
> Thanks,
>
> Li
>
> On Mon, Apr 26, 2021 at 9:11 PM Ted Dunning <te...@gmail.com> wrote:
>
>> Would it help to keep per thread metrics that are either reported
>> independently or are merged at the end of a reporting period?
>>
>>
>>
>> On Mon, Apr 26, 2021 at 8:51 PM Li Wang <li...@gmail.com> wrote:
>>
>> > Hi Community,
>> >
>> > I've done further investigation on the issue and found the following
>> >
>> > 1. The perf of the read operation was decreased due to the lock
>> contention
>> > in the Prometheus TimeWindowQuantiles APIs. 3 out of 4 CommitProcWorker
>> > threads were blocked on the TimeWindowQuantiles.insert() API when the
>> test
>> > was.
>> >
>> > 2. The perf of the write operation was decreased because of the high CPU
>> > usage from Prometheus Summary type of metrics. The CPU usage of
>> > CommitProcessor increased about 50% when Prometheus was disabled
>> compared
>> > to enabled (46% vs 80% with 4 CPU, 63% vs 99% with 12 CPU).
>> >
>> >
>> > Prometheus integration is a great feature, however the negative
>> performance
>> > impact is very significant.  I wonder if anyone has any thoughts on how
>> to
>> > reduce the perf impact.
>> >
>> >
>> >
>> > Thanks,
>> >
>> >
>> > Li
>> >
>> >
>> > On Tue, Apr 6, 2021 at 12:33 PM Li Wang <li...@gmail.com> wrote:
>> >
>> > > Hi,
>> > >
>> > > I would like to reach out to the community to see if anyone has some
>> > > insights or experience with the performance impact of enabling
>> prometheus
>> > > metrics.
>> > >
>> > > I have done load comparison tests for Prometheus enabled vs disabled
>> and
>> > > found the performance is reduced about 40%-60% for both read and write
>> > > oeprations (i.e. getData, getChildren and createNode).
>> > >
>> > > The load test was done with Zookeeper 3.7, cluster size of 5
>> participants
>> > > and 5 observers, each ZK server has 10G heap size and 4 cpu, 500
>> > concurrent
>> > > users sending requests.
>> > >
>> > > The performance impact is quite significant.  I wonder if this is
>> > expected
>> > > and what are things we can do to have ZK performing the same while
>> > > leveraging the new feature of Prometheus metric.
>> > >
>> > > Best,
>> > >
>> > > Li
>> > >
>> > >
>> > >
>> > >
>> >
>>
>

Re: Performance impact of enabling Prometheus Metrics

Posted by Ted Dunning <te...@gmail.com>.
There are three patterns I have seen:

a) shared object that all threads update with locks (I think that this is
what is causing the slowdown)

b) message queue to separate thread with a metrics object. I think that
this is what you suggested last. This can be higher performance than (a)
because a queue can be constructed to be lockless, but it still takes a hit
because of NUMA locality and cache spills

c) each thread being measured has a separate metrics object. At the end of
some reporting interval, these objects are passed to a consolidation thread
which merges the metrics into a single report back to the metrics
framework. This is the alternative that I was suggesting. This is usually
faster than (b) because the cost of the queue is paid rarely.

In each case, I am assuming that you don't report metrics instantly, but
accumulate things like query count and total latency and size and so on
until you want to make a report (say, every minute or every 10 seconds).
This makes the metrics easier to accumulate, but still allows you to talk
about averages and totals. If you want a real distribution of some quantity
like latency or query size or queue size, you can keep an HdrHistogram or
LogHistogram (for latencies) or a t-digest (for measurements with arbitrary
distribution).

Does that help?





On Mon, Apr 26, 2021 at 9:21 PM Li Wang <li...@gmail.com> wrote:

> Yes, I am thinking that handling metrics reporting in a separate thread, so
> it doesn't impact the "main" thread.
>
> Not sure about the idea of merging at the end of a reporting period. Can
> you elaborate a bit on it?
>
> Thanks,
>
> Li
>
> On Mon, Apr 26, 2021 at 9:11 PM Ted Dunning <te...@gmail.com> wrote:
>
> > Would it help to keep per thread metrics that are either reported
> > independently or are merged at the end of a reporting period?
> >
> >
> >
> > On Mon, Apr 26, 2021 at 8:51 PM Li Wang <li...@gmail.com> wrote:
> >
> > > Hi Community,
> > >
> > > I've done further investigation on the issue and found the following
> > >
> > > 1. The perf of the read operation was decreased due to the lock
> > contention
> > > in the Prometheus TimeWindowQuantiles APIs. 3 out of 4 CommitProcWorker
> > > threads were blocked on the TimeWindowQuantiles.insert() API when the
> > test
> > > was.
> > >
> > > 2. The perf of the write operation was decreased because of the high
> CPU
> > > usage from Prometheus Summary type of metrics. The CPU usage of
> > > CommitProcessor increased about 50% when Prometheus was disabled
> compared
> > > to enabled (46% vs 80% with 4 CPU, 63% vs 99% with 12 CPU).
> > >
> > >
> > > Prometheus integration is a great feature, however the negative
> > performance
> > > impact is very significant.  I wonder if anyone has any thoughts on how
> > to
> > > reduce the perf impact.
> > >
> > >
> > >
> > > Thanks,
> > >
> > >
> > > Li
> > >
> > >
> > > On Tue, Apr 6, 2021 at 12:33 PM Li Wang <li...@gmail.com> wrote:
> > >
> > > > Hi,
> > > >
> > > > I would like to reach out to the community to see if anyone has some
> > > > insights or experience with the performance impact of enabling
> > prometheus
> > > > metrics.
> > > >
> > > > I have done load comparison tests for Prometheus enabled vs disabled
> > and
> > > > found the performance is reduced about 40%-60% for both read and
> write
> > > > oeprations (i.e. getData, getChildren and createNode).
> > > >
> > > > The load test was done with Zookeeper 3.7, cluster size of 5
> > participants
> > > > and 5 observers, each ZK server has 10G heap size and 4 cpu, 500
> > > concurrent
> > > > users sending requests.
> > > >
> > > > The performance impact is quite significant.  I wonder if this is
> > > expected
> > > > and what are things we can do to have ZK performing the same while
> > > > leveraging the new feature of Prometheus metric.
> > > >
> > > > Best,
> > > >
> > > > Li
> > > >
> > > >
> > > >
> > > >
> > >
> >
>

Re: Performance impact of enabling Prometheus Metrics

Posted by Li Wang <li...@gmail.com>.
Yes, I am thinking that handling metrics reporting in a separate thread, so
it doesn't impact the "main" thread.

Not sure about the idea of merging at the end of a reporting period. Can
you elaborate a bit on it?

Thanks,

Li

On Mon, Apr 26, 2021 at 9:11 PM Ted Dunning <te...@gmail.com> wrote:

> Would it help to keep per thread metrics that are either reported
> independently or are merged at the end of a reporting period?
>
>
>
> On Mon, Apr 26, 2021 at 8:51 PM Li Wang <li...@gmail.com> wrote:
>
> > Hi Community,
> >
> > I've done further investigation on the issue and found the following
> >
> > 1. The perf of the read operation was decreased due to the lock
> contention
> > in the Prometheus TimeWindowQuantiles APIs. 3 out of 4 CommitProcWorker
> > threads were blocked on the TimeWindowQuantiles.insert() API when the
> test
> > was.
> >
> > 2. The perf of the write operation was decreased because of the high CPU
> > usage from Prometheus Summary type of metrics. The CPU usage of
> > CommitProcessor increased about 50% when Prometheus was disabled compared
> > to enabled (46% vs 80% with 4 CPU, 63% vs 99% with 12 CPU).
> >
> >
> > Prometheus integration is a great feature, however the negative
> performance
> > impact is very significant.  I wonder if anyone has any thoughts on how
> to
> > reduce the perf impact.
> >
> >
> >
> > Thanks,
> >
> >
> > Li
> >
> >
> > On Tue, Apr 6, 2021 at 12:33 PM Li Wang <li...@gmail.com> wrote:
> >
> > > Hi,
> > >
> > > I would like to reach out to the community to see if anyone has some
> > > insights or experience with the performance impact of enabling
> prometheus
> > > metrics.
> > >
> > > I have done load comparison tests for Prometheus enabled vs disabled
> and
> > > found the performance is reduced about 40%-60% for both read and write
> > > oeprations (i.e. getData, getChildren and createNode).
> > >
> > > The load test was done with Zookeeper 3.7, cluster size of 5
> participants
> > > and 5 observers, each ZK server has 10G heap size and 4 cpu, 500
> > concurrent
> > > users sending requests.
> > >
> > > The performance impact is quite significant.  I wonder if this is
> > expected
> > > and what are things we can do to have ZK performing the same while
> > > leveraging the new feature of Prometheus metric.
> > >
> > > Best,
> > >
> > > Li
> > >
> > >
> > >
> > >
> >
>

Re: Performance impact of enabling Prometheus Metrics

Posted by Ted Dunning <te...@gmail.com>.
Would it help to keep per thread metrics that are either reported
independently or are merged at the end of a reporting period?



On Mon, Apr 26, 2021 at 8:51 PM Li Wang <li...@gmail.com> wrote:

> Hi Community,
>
> I've done further investigation on the issue and found the following
>
> 1. The perf of the read operation was decreased due to the lock contention
> in the Prometheus TimeWindowQuantiles APIs. 3 out of 4 CommitProcWorker
> threads were blocked on the TimeWindowQuantiles.insert() API when the test
> was.
>
> 2. The perf of the write operation was decreased because of the high CPU
> usage from Prometheus Summary type of metrics. The CPU usage of
> CommitProcessor increased about 50% when Prometheus was disabled compared
> to enabled (46% vs 80% with 4 CPU, 63% vs 99% with 12 CPU).
>
>
> Prometheus integration is a great feature, however the negative performance
> impact is very significant.  I wonder if anyone has any thoughts on how to
> reduce the perf impact.
>
>
>
> Thanks,
>
>
> Li
>
>
> On Tue, Apr 6, 2021 at 12:33 PM Li Wang <li...@gmail.com> wrote:
>
> > Hi,
> >
> > I would like to reach out to the community to see if anyone has some
> > insights or experience with the performance impact of enabling prometheus
> > metrics.
> >
> > I have done load comparison tests for Prometheus enabled vs disabled and
> > found the performance is reduced about 40%-60% for both read and write
> > oeprations (i.e. getData, getChildren and createNode).
> >
> > The load test was done with Zookeeper 3.7, cluster size of 5 participants
> > and 5 observers, each ZK server has 10G heap size and 4 cpu, 500
> concurrent
> > users sending requests.
> >
> > The performance impact is quite significant.  I wonder if this is
> expected
> > and what are things we can do to have ZK performing the same while
> > leveraging the new feature of Prometheus metric.
> >
> > Best,
> >
> > Li
> >
> >
> >
> >
>

Re: Performance impact of enabling Prometheus Metrics

Posted by Li Wang <li...@gmail.com>.
Hi Community,

I've done further investigation on the issue and found the following

1. The perf of the read operation was decreased due to the lock contention
in the Prometheus TimeWindowQuantiles APIs. 3 out of 4 CommitProcWorker
threads were blocked on the TimeWindowQuantiles.insert() API when the test
was.

2. The perf of the write operation was decreased because of the high CPU
usage from Prometheus Summary type of metrics. The CPU usage of
CommitProcessor increased about 50% when Prometheus was disabled compared
to enabled (46% vs 80% with 4 CPU, 63% vs 99% with 12 CPU).


Prometheus integration is a great feature, however the negative performance
impact is very significant.  I wonder if anyone has any thoughts on how to
reduce the perf impact.



Thanks,


Li


On Tue, Apr 6, 2021 at 12:33 PM Li Wang <li...@gmail.com> wrote:

> Hi,
>
> I would like to reach out to the community to see if anyone has some
> insights or experience with the performance impact of enabling prometheus
> metrics.
>
> I have done load comparison tests for Prometheus enabled vs disabled and
> found the performance is reduced about 40%-60% for both read and write
> oeprations (i.e. getData, getChildren and createNode).
>
> The load test was done with Zookeeper 3.7, cluster size of 5 participants
> and 5 observers, each ZK server has 10G heap size and 4 cpu, 500 concurrent
> users sending requests.
>
> The performance impact is quite significant.  I wonder if this is expected
> and what are things we can do to have ZK performing the same while
> leveraging the new feature of Prometheus metric.
>
> Best,
>
> Li
>
>
>
>