You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pulsar.apache.org by Asaf Mesika <as...@gmail.com> on 2022/10/03 08:35:36 UTC

[DISCUSS] Pulsar Metrics - Current State and Future Directions

Hi All,

I would like to share with you a document I wrote during the last months
titled Pulsar Metrics - Current State and Future Directions
<https://docs.google.com/document/d/1vke4w1nt7EEgOvEerPEUS-Al3aqLTm9cl2wTBkKNXUA/edit?usp=sharing>,
and most importantly *get your feedback.*

The initial motivation is to rethink/refactor the way metrics are used in
Pulsar codebase to solve two large pain points:

1. *Metrics Cardinality: *As Pulsar can support up to 1M topics
cluster-wide, this translates into ~100M unique time series, which becomes
both an impossible cost and affects query performance and general usability
of metrics. This issue starts surfacing even at 50k-100k topics.

Today users work-around it by disabling topic-granularity metrics and
scripting their own ETL for generating metrics they can use (based on admin
stats API), switching between granular topic-level metrics to a group-by
view of their choosing.

The document outlines a solution built upon the notion of Groups, in which
users can define a group of metrics, and specify if they wish to define a
roll-up on it (i.e. remove labels) and filter (i.e. remove specific
metrics).
The solution should be able to bring the granularity from topic level (1M)
to group level (1000).

2. *Consolidate into a single library:* Today there are 4 different metrics
libraries/systems in Pulsar. This creates lots of confusion and unhappy
developer experience, among other impacts. Also achieving (1) requires
having (2).

The document outlines the different libraries, their functionality and the
problems they create. The doc also describes one idea for such a library,
but it still requires a POC.


The main goal of the document is mainly to garner feedback to see if the
directions stipulated there are agreed upon, and if there is any other
problem missing or existing functionality missed as it serves as the basis
for the requirements for the solution that will be chosen.

Thanks!

Asaf Mesika

Document link:
https://docs.google.com/document/d/1vke4w1nt7EEgOvEerPEUS-Al3aqLTm9cl2wTBkKNXUA/edit?usp=sharing

Re: [DISCUSS] Pulsar Metrics - Current State and Future Directions

Posted by Asaf Mesika <as...@gmail.com>.
Thanks, Michael for taking the time to read it. I've added your suggestion
to the bottom of the document and I will use them when the time comes to
create several PIPs for this.

Any other feedback on this from the community would be greatly appreciated!


On Fri, Oct 14, 2022 at 7:23 AM Michael Marshall <mm...@apache.org>
wrote:

> Hi Asaf,
>
> This is a great topic for discussion, and your document is extremely
> thorough! I agree with the general proposal to improve Pulsar's
> metrics.
>
> > *Metrics Cardinality: *
>
> +100 if we want to scale Pulsar (and I do!) we need to make this manageable
>
> > *Consolidate into a single library:*
>
> This makes sense to me, and it ensures that new metrics will not be in
> one API but not another.
>
> I haven't read the whole doc, but I did read the suggested
> improvements. Here are some additional improvements that I've thought
> about before.
>
> Are there any metrics we can drop? This would definitely require a
> community effort to verify, but I think it could prove valuable.
>
> Can we make the number of histogram buckets configurable? I proposed
> this here [0].
>
> Would it be possible to produce a script to help users convert
> existing grafana dashboards to work with the new metrics?
>
> Finally, it'd be great to create a metrics section in the contributors
> guide when you've completed your work. That will help existing and new
> contributors adjust to the new style.
>
> Thanks,
> Michael
>
> [0] https://github.com/apache/pulsar/issues/12069
>
> On Mon, Oct 3, 2022 at 3:36 AM Asaf Mesika <as...@gmail.com> wrote:
> >
> > Hi All,
> >
> > I would like to share with you a document I wrote during the last months
> > titled Pulsar Metrics - Current State and Future Directions
> > <
> https://docs.google.com/document/d/1vke4w1nt7EEgOvEerPEUS-Al3aqLTm9cl2wTBkKNXUA/edit?usp=sharing
> >,
> > and most importantly *get your feedback.*
> >
> > The initial motivation is to rethink/refactor the way metrics are used in
> > Pulsar codebase to solve two large pain points:
> >
> > 1. *Metrics Cardinality: *As Pulsar can support up to 1M topics
> > cluster-wide, this translates into ~100M unique time series, which
> becomes
> > both an impossible cost and affects query performance and general
> usability
> > of metrics. This issue starts surfacing even at 50k-100k topics.
> >
> > Today users work-around it by disabling topic-granularity metrics and
> > scripting their own ETL for generating metrics they can use (based on
> admin
> > stats API), switching between granular topic-level metrics to a group-by
> > view of their choosing.
> >
> > The document outlines a solution built upon the notion of Groups, in
> which
> > users can define a group of metrics, and specify if they wish to define a
> > roll-up on it (i.e. remove labels) and filter (i.e. remove specific
> > metrics).
> > The solution should be able to bring the granularity from topic level
> (1M)
> > to group level (1000).
> >
> > 2. *Consolidate into a single library:* Today there are 4 different
> metrics
> > libraries/systems in Pulsar. This creates lots of confusion and unhappy
> > developer experience, among other impacts. Also achieving (1) requires
> > having (2).
> >
> > The document outlines the different libraries, their functionality and
> the
> > problems they create. The doc also describes one idea for such a library,
> > but it still requires a POC.
> >
> >
> > The main goal of the document is mainly to garner feedback to see if the
> > directions stipulated there are agreed upon, and if there is any other
> > problem missing or existing functionality missed as it serves as the
> basis
> > for the requirements for the solution that will be chosen.
> >
> > Thanks!
> >
> > Asaf Mesika
> >
> > Document link:
> >
> https://docs.google.com/document/d/1vke4w1nt7EEgOvEerPEUS-Al3aqLTm9cl2wTBkKNXUA/edit?usp=sharing
>

Re: [DISCUSS] Pulsar Metrics - Current State and Future Directions

Posted by Michael Marshall <mm...@apache.org>.
Hi Asaf,

This is a great topic for discussion, and your document is extremely
thorough! I agree with the general proposal to improve Pulsar's
metrics.

> *Metrics Cardinality: *

+100 if we want to scale Pulsar (and I do!) we need to make this manageable

> *Consolidate into a single library:*

This makes sense to me, and it ensures that new metrics will not be in
one API but not another.

I haven't read the whole doc, but I did read the suggested
improvements. Here are some additional improvements that I've thought
about before.

Are there any metrics we can drop? This would definitely require a
community effort to verify, but I think it could prove valuable.

Can we make the number of histogram buckets configurable? I proposed
this here [0].

Would it be possible to produce a script to help users convert
existing grafana dashboards to work with the new metrics?

Finally, it'd be great to create a metrics section in the contributors
guide when you've completed your work. That will help existing and new
contributors adjust to the new style.

Thanks,
Michael

[0] https://github.com/apache/pulsar/issues/12069

On Mon, Oct 3, 2022 at 3:36 AM Asaf Mesika <as...@gmail.com> wrote:
>
> Hi All,
>
> I would like to share with you a document I wrote during the last months
> titled Pulsar Metrics - Current State and Future Directions
> <https://docs.google.com/document/d/1vke4w1nt7EEgOvEerPEUS-Al3aqLTm9cl2wTBkKNXUA/edit?usp=sharing>,
> and most importantly *get your feedback.*
>
> The initial motivation is to rethink/refactor the way metrics are used in
> Pulsar codebase to solve two large pain points:
>
> 1. *Metrics Cardinality: *As Pulsar can support up to 1M topics
> cluster-wide, this translates into ~100M unique time series, which becomes
> both an impossible cost and affects query performance and general usability
> of metrics. This issue starts surfacing even at 50k-100k topics.
>
> Today users work-around it by disabling topic-granularity metrics and
> scripting their own ETL for generating metrics they can use (based on admin
> stats API), switching between granular topic-level metrics to a group-by
> view of their choosing.
>
> The document outlines a solution built upon the notion of Groups, in which
> users can define a group of metrics, and specify if they wish to define a
> roll-up on it (i.e. remove labels) and filter (i.e. remove specific
> metrics).
> The solution should be able to bring the granularity from topic level (1M)
> to group level (1000).
>
> 2. *Consolidate into a single library:* Today there are 4 different metrics
> libraries/systems in Pulsar. This creates lots of confusion and unhappy
> developer experience, among other impacts. Also achieving (1) requires
> having (2).
>
> The document outlines the different libraries, their functionality and the
> problems they create. The doc also describes one idea for such a library,
> but it still requires a POC.
>
>
> The main goal of the document is mainly to garner feedback to see if the
> directions stipulated there are agreed upon, and if there is any other
> problem missing or existing functionality missed as it serves as the basis
> for the requirements for the solution that will be chosen.
>
> Thanks!
>
> Asaf Mesika
>
> Document link:
> https://docs.google.com/document/d/1vke4w1nt7EEgOvEerPEUS-Al3aqLTm9cl2wTBkKNXUA/edit?usp=sharing