You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pulsar.apache.org by Matteo Merli <ma...@gmail.com> on 2024/03/03 03:09:06 UTC

[DISCUSS] PIP-342: Support OpenTelemetry metrics in Pulsar client

PIP PR: https://github.com/apache/pulsar/pull/22178

WIP of proposed implementation: https://github.com/apache/pulsar/pull/22179

--------------------

# PIP 342: Support OpenTelemetry metrics in Pulsar client

## Motivation

Current support for metric instrumentation in Pulsar client is very limited
and poses a lot of
issues for integrating the metrics into any telemetry system.

We have 2 ways that metrics are exposed today:

1. Printing logs every 1 minute: While this is ok as it comes out of the
box, it's very hard for
   any application to get the data or use it in any meaningful way.
2. `producer.getStats()` or `consumer.getStats()`: Calling these methods
will get access to
   the rate of events in the last 1-minute interval. This is problematic
because out of the
   box the metrics are not collected anywhere. One would have to start its
own thread to
   periodically check these values and export them to some other system.

Neither of these mechanism that we have today are sufficient to enable
application to easily
export the telemetry data of Pulsar client SDK.

## Goal

Provide a good way for applications to retrieve and analyze the usage of
Pulsar client operation,
in particular with respect to:

1. Maximizing compatibility with existing telemetry systems
2. Minimizing the effort required to export these metrics

## Why OpenTelemetry?

[OpenTelemetry](https://opentelemetry.io/) is quickly becoming the de-facto
standard API for metric and
tracing instrumentation. In fact, as part of [PIP-264](
https://github.com/apache/pulsar/blob/master/pip/pip-264.md),
we are already migrating the Pulsar server side metrics to use
OpenTelemetry.

For Pulsar client SDK, we need to provide a similar way for application
builder to quickly integrate and
export Pulsar metrics.

### Why exposing OpenTelemetry directly in Pulsar API

When deciding how to expose the metrics exporter configuration there are
multiple options:

 1. Accept an `OpenTelemetry` object directly in Pulsar API
 2. Build a pluggable interface that describe all the Pulsar client SDK
events and allow application to
    provide an implementation, perhaps providing an OpenTelemetry included
option.

For this proposal, we are following the (1) option. Here are the reasons:

 1. In a way, OpenTelemetry can be compared to [SLF4J](
https://www.slf4j.org/), in the sense that it provides an API
    on top of which different vendor can build multiple implementations.
Therefore, there is no need to create a new
    Pulsar-specific interface
 2. OpenTelemetry has 2 main artifacts: API and SDK. For the context of
Pulsar client, we will only depend on its
    API. Applications that are going to use OpenTelemetry, will include the
OTel SDK
 3. Providing a custom interface has several drawbacks:
     1. Applications need to update their implementations every time a new
metric is added in Pulsar SDK
     2. The surface of this plugin API can become quite big when there are
several metrics
     3. If we imagine an application that uses multiple libraries, like
Pulsar SDK, and each of these has its own
        custom way to expose metrics, we can see the level of integration
burden that is pushed to application
        developers
 4. It will always be easy to use OpenTelemetry to collect the metrics and
export them using a custom metrics API. There
    are several examples of this in OpenTelemetry documentation.

## Public API changes

### Enabling OpenTelemetry

When building a `PulsarClient` instance, it will be possible to pass an
`OpenTelemetry` object:

```java
interface ClientBuilder {
    // ...
    ClientBuilder openTelemetry(io.opentelemetry.api.OpenTelemetry
openTelemetry);

    ClientBuilder openTelemetryMetricsCardinality(MetricsCardinality
metricsCardinality);
}
```

The common usage for an application would be something like:

```java
// Creates a OpenTelemetry instance using environment variables to
configure it
OpenTelemetry otel=AutoConfiguredOpenTelemetrySdk.builder()
        .build().getOpenTelemetrySdk();

        PulsarClient client=PulsarClient.builder()
        .serviceUrl("pulsar://localhost:6650")
        .build();

// ....
```

Cardinality enum will allow to select a default cardinality label to be
attached to the
metrics:

```java
public enum MetricsCardinality {
    /**
     * Do not add additional labels to metrics
     */
    None,

    /**
     * Label metrics by tenant
     */
    Tenant,

    /**
     * Label metrics by tenant and namespace
     */
    Namespace,

    /**
     * Label metrics by topic
     */
    Topic,

    /**
     * Label metrics by each partition
     */
    Partition,
}
```

The labels are addictive. For example, selecting `Topic` level would mean
that the metrics will be
labeled like:

```
pulsar_client_received_total{namespace="public/default",tenant="public",topic="persistent://public/default/pt"}
149.0
```

While selecting `Namespace` level:

```
pulsar_client_received_total{namespace="public/default",tenant="public"}
149.0
```

### Deprecating the old stats methods

The old way of collecting stats will be disabled by default, deprecated and
eventually removed
in Pulsar 4.0 release.

Methods to deprecate:

```java
interface ClientBuilder {
    // ...
    @Deprecated
    ClientBuilder statsInterval(long statsInterval, TimeUnit unit);
}

interface Producer {
    @Deprecated
    ProducerStats getStats();
}

interface Consumer {
    @Deprecated
    ConsumerStats getStats();
}
```

## Initial set of metrics to include

Based on the experience of Pulsar Go client SDK metrics (
see:
https://github.com/apache/pulsar-client-go/blob/master/pulsar/internal/metrics.go
),
this is the proposed initial set of metrics to export.

Additional metrics could be added later on, though it's better to start
with the set of most important metrics
and then evaluate any missing information.

| OTel metric name                                | Type      | Unit
 | Description
                       |
|-------------------------------------------------|-----------|-------------|------------------------------------------------------------------------------------------------|
| `pulsar.client.connections.opened`              | Counter   | connections
| Counter of connections opened
                     |
| `pulsar.client.connections.closed`              | Counter   | connections
| Counter of connections closed
                     |
| `pulsar.client.connections.failed`              | Counter   | connections
| Counter of connections establishment failures
                     |
| `pulsar.client.session.opened`                  | Counter   | sessions
 | Counter of sessions opened. `type="producer"` or `consumer`
                       |
| `pulsar.client.session.closed`                  | Counter   | sessions
 | Counter of sessions closed. `type="producer"` or `consumer`
                       |
| `pulsar.client.received`                        | Counter   | messages
 | Number of messages received
                       |
| `pulsar.client.received`                        | Counter   | bytes
| Number of bytes received
                      |
| `pulsar.client.consumer.preteched.messages`     | Gauge     | messages
 | Number of messages currently sitting in the consumer pre-fetch queue
                      |
| `pulsar.client.consumer.preteched`              | Gauge     | bytes
| Total number of bytes currently sitting in the consumer pre-fetch queue
                     |
| `pulsar.client.consumer.ack`                    | Counter   | messages
 | Number of ack operations
                      |
| `pulsar.client.consumer.nack`                   | Counter   | messages
 | Number of negative ack operations
                       |
| `pulsar.client.consumer.dlq`                    | Counter   | messages
 | Number of messages sent to DLQ
                      |
| `pulsar.client.consumer.ack.timeout`            | Counter   | messages
 | Number of ack timeouts events
                       |
| `pulsar.client.producer.latency`                | Histogram | seconds
| Publish latency experienced by the application, includes client batching
time                  |
| `pulsar.client.producer.rpc.latency`            | Histogram | seconds
| Publish RPC latency experienced internally by the client when sending
data to receiving an ack |
| `pulsar.client.producer.published`              | Counter   | bytes
| Bytes published
                     |
| `pulsar.client.producer.pending.messages.count` | Gauge     | messages
 | Pending messages for this producer
                      |
| `pulsar.client.producer.pending.count`          | Gauge     | bytes
| Pending bytes for this producer
                     |

Topic lookup metric will be differentiated by the lookup type label and by
the lookup transport
mechanism (`transport-type="binary|http"`):

| OTel metric name                           | Type      | Unit    |
Description                                      |
|--------------------------------------------|-----------|---------|--------------------------------------------------|
| `pulsar.client.lookup{type="topic"}`       | Histogram | seconds |
Counter of topic lookup operations               |
| `pulsar.client.lookup{type="metadata"}`    | Histogram | seconds |
Counter of topic partitioned metadata operations |
| `pulsar.client.lookup{type="schema"}`      | Histogram | seconds |
Counter of schema retrieval operations           |
| `pulsar.client.lookup{type="list-topics"}` | Histogram | seconds |
Counter of namespace list topics operations      |

Additionally, all the histograms will have a `success=true|false` label to
distinguish successful and failed
operations.




--
Matteo Merli
<ma...@gmail.com>

Re: [DISCUSS] PIP-342: Support OpenTelemetry metrics in Pulsar client

Posted by Matteo Merli <ma...@gmail.com>.
I have updated the proposal (and the WIP PR) based on the very helpful
feedback received. I will start a VOTE thread for this PIP.

Thanks.


--
Matteo Merli
<ma...@gmail.com>


On Sun, Mar 3, 2024 at 5:14 AM 太上玄元道君 <da...@apache.org> wrote:

> +1
>
> Enrico Olivelli <eo...@gmail.com>于2024年3月3日 周日16:58写道:
>
> > I support the initiative
> >
> > Lgtm
> >
> >
> > Thanks
> > Enrico
> >
> > Enrico
> >
> > Il Dom 3 Mar 2024, 04:09 Matteo Merli <ma...@gmail.com> ha
> scritto:
> >
> > > PIP PR: https://github.com/apache/pulsar/pull/22178
> > >
> > > WIP of proposed implementation:
> > > https://github.com/apache/pulsar/pull/22179
> > >
> > > --------------------
> > >
> > > # PIP 342: Support OpenTelemetry metrics in Pulsar client
> > >
> > > ## Motivation
> > >
> > > Current support for metric instrumentation in Pulsar client is very
> > limited
> > > and poses a lot of
> > > issues for integrating the metrics into any telemetry system.
> > >
> > > We have 2 ways that metrics are exposed today:
> > >
> > > 1. Printing logs every 1 minute: While this is ok as it comes out of
> the
> > > box, it's very hard for
> > >    any application to get the data or use it in any meaningful way.
> > > 2. `producer.getStats()` or `consumer.getStats()`: Calling these
> methods
> > > will get access to
> > >    the rate of events in the last 1-minute interval. This is
> problematic
> > > because out of the
> > >    box the metrics are not collected anywhere. One would have to start
> > its
> > > own thread to
> > >    periodically check these values and export them to some other
> system.
> > >
> > > Neither of these mechanism that we have today are sufficient to enable
> > > application to easily
> > > export the telemetry data of Pulsar client SDK.
> > >
> > > ## Goal
> > >
> > > Provide a good way for applications to retrieve and analyze the usage
> of
> > > Pulsar client operation,
> > > in particular with respect to:
> > >
> > > 1. Maximizing compatibility with existing telemetry systems
> > > 2. Minimizing the effort required to export these metrics
> > >
> > > ## Why OpenTelemetry?
> > >
> > > [OpenTelemetry](https://opentelemetry.io/) is quickly becoming the
> > > de-facto
> > > standard API for metric and
> > > tracing instrumentation. In fact, as part of [PIP-264](
> > > https://github.com/apache/pulsar/blob/master/pip/pip-264.md),
> > > we are already migrating the Pulsar server side metrics to use
> > > OpenTelemetry.
> > >
> > > For Pulsar client SDK, we need to provide a similar way for application
> > > builder to quickly integrate and
> > > export Pulsar metrics.
> > >
> > > ### Why exposing OpenTelemetry directly in Pulsar API
> > >
> > > When deciding how to expose the metrics exporter configuration there
> are
> > > multiple options:
> > >
> > >  1. Accept an `OpenTelemetry` object directly in Pulsar API
> > >  2. Build a pluggable interface that describe all the Pulsar client SDK
> > > events and allow application to
> > >     provide an implementation, perhaps providing an OpenTelemetry
> > included
> > > option.
> > >
> > > For this proposal, we are following the (1) option. Here are the
> reasons:
> > >
> > >  1. In a way, OpenTelemetry can be compared to [SLF4J](
> > > https://www.slf4j.org/), in the sense that it provides an API
> > >     on top of which different vendor can build multiple
> implementations.
> > > Therefore, there is no need to create a new
> > >     Pulsar-specific interface
> > >  2. OpenTelemetry has 2 main artifacts: API and SDK. For the context of
> > > Pulsar client, we will only depend on its
> > >     API. Applications that are going to use OpenTelemetry, will include
> > the
> > > OTel SDK
> > >  3. Providing a custom interface has several drawbacks:
> > >      1. Applications need to update their implementations every time a
> > new
> > > metric is added in Pulsar SDK
> > >      2. The surface of this plugin API can become quite big when there
> > are
> > > several metrics
> > >      3. If we imagine an application that uses multiple libraries, like
> > > Pulsar SDK, and each of these has its own
> > >         custom way to expose metrics, we can see the level of
> integration
> > > burden that is pushed to application
> > >         developers
> > >  4. It will always be easy to use OpenTelemetry to collect the metrics
> > and
> > > export them using a custom metrics API. There
> > >     are several examples of this in OpenTelemetry documentation.
> > >
> > > ## Public API changes
> > >
> > > ### Enabling OpenTelemetry
> > >
> > > When building a `PulsarClient` instance, it will be possible to pass an
> > > `OpenTelemetry` object:
> > >
> > > ```java
> > > interface ClientBuilder {
> > >     // ...
> > >     ClientBuilder openTelemetry(io.opentelemetry.api.OpenTelemetry
> > > openTelemetry);
> > >
> > >     ClientBuilder openTelemetryMetricsCardinality(MetricsCardinality
> > > metricsCardinality);
> > > }
> > > ```
> > >
> > > The common usage for an application would be something like:
> > >
> > > ```java
> > > // Creates a OpenTelemetry instance using environment variables to
> > > configure it
> > > OpenTelemetry otel=AutoConfiguredOpenTelemetrySdk.builder()
> > >         .build().getOpenTelemetrySdk();
> > >
> > >         PulsarClient client=PulsarClient.builder()
> > >         .serviceUrl("pulsar://localhost:6650")
> > >         .build();
> > >
> > > // ....
> > > ```
> > >
> > > Cardinality enum will allow to select a default cardinality label to be
> > > attached to the
> > > metrics:
> > >
> > > ```java
> > > public enum MetricsCardinality {
> > >     /**
> > >      * Do not add additional labels to metrics
> > >      */
> > >     None,
> > >
> > >     /**
> > >      * Label metrics by tenant
> > >      */
> > >     Tenant,
> > >
> > >     /**
> > >      * Label metrics by tenant and namespace
> > >      */
> > >     Namespace,
> > >
> > >     /**
> > >      * Label metrics by topic
> > >      */
> > >     Topic,
> > >
> > >     /**
> > >      * Label metrics by each partition
> > >      */
> > >     Partition,
> > > }
> > > ```
> > >
> > > The labels are addictive. For example, selecting `Topic` level would
> mean
> > > that the metrics will be
> > > labeled like:
> > >
> > > ```
> > >
> > >
> >
> pulsar_client_received_total{namespace="public/default",tenant="public",topic="persistent://public/default/pt"}
> > > 149.0
> > > ```
> > >
> > > While selecting `Namespace` level:
> > >
> > > ```
> > >
> pulsar_client_received_total{namespace="public/default",tenant="public"}
> > > 149.0
> > > ```
> > >
> > > ### Deprecating the old stats methods
> > >
> > > The old way of collecting stats will be disabled by default, deprecated
> > and
> > > eventually removed
> > > in Pulsar 4.0 release.
> > >
> > > Methods to deprecate:
> > >
> > > ```java
> > > interface ClientBuilder {
> > >     // ...
> > >     @Deprecated
> > >     ClientBuilder statsInterval(long statsInterval, TimeUnit unit);
> > > }
> > >
> > > interface Producer {
> > >     @Deprecated
> > >     ProducerStats getStats();
> > > }
> > >
> > > interface Consumer {
> > >     @Deprecated
> > >     ConsumerStats getStats();
> > > }
> > > ```
> > >
> > > ## Initial set of metrics to include
> > >
> > > Based on the experience of Pulsar Go client SDK metrics (
> > > see:
> > >
> > >
> >
> https://github.com/apache/pulsar-client-go/blob/master/pulsar/internal/metrics.go
> > > ),
> > > this is the proposed initial set of metrics to export.
> > >
> > > Additional metrics could be added later on, though it's better to start
> > > with the set of most important metrics
> > > and then evaluate any missing information.
> > >
> > > | OTel metric name                                | Type      | Unit
> > >  | Description
> > >                        |
> > >
> > >
> >
> |-------------------------------------------------|-----------|-------------|------------------------------------------------------------------------------------------------|
> > > | `pulsar.client.connections.opened`              | Counter   |
> > connections
> > > | Counter of connections opened
> > >                      |
> > > | `pulsar.client.connections.closed`              | Counter   |
> > connections
> > > | Counter of connections closed
> > >                      |
> > > | `pulsar.client.connections.failed`              | Counter   |
> > connections
> > > | Counter of connections establishment failures
> > >                      |
> > > | `pulsar.client.session.opened`                  | Counter   |
> sessions
> > >  | Counter of sessions opened. `type="producer"` or `consumer`
> > >                        |
> > > | `pulsar.client.session.closed`                  | Counter   |
> sessions
> > >  | Counter of sessions closed. `type="producer"` or `consumer`
> > >                        |
> > > | `pulsar.client.received`                        | Counter   |
> messages
> > >  | Number of messages received
> > >                        |
> > > | `pulsar.client.received`                        | Counter   | bytes
> > > | Number of bytes received
> > >                       |
> > > | `pulsar.client.consumer.preteched.messages`     | Gauge     |
> messages
> > >  | Number of messages currently sitting in the consumer pre-fetch queue
> > >                       |
> > > | `pulsar.client.consumer.preteched`              | Gauge     | bytes
> > > | Total number of bytes currently sitting in the consumer pre-fetch
> queue
> > >                      |
> > > | `pulsar.client.consumer.ack`                    | Counter   |
> messages
> > >  | Number of ack operations
> > >                       |
> > > | `pulsar.client.consumer.nack`                   | Counter   |
> messages
> > >  | Number of negative ack operations
> > >                        |
> > > | `pulsar.client.consumer.dlq`                    | Counter   |
> messages
> > >  | Number of messages sent to DLQ
> > >                       |
> > > | `pulsar.client.consumer.ack.timeout`            | Counter   |
> messages
> > >  | Number of ack timeouts events
> > >                        |
> > > | `pulsar.client.producer.latency`                | Histogram | seconds
> > > | Publish latency experienced by the application, includes client
> > batching
> > > time                  |
> > > | `pulsar.client.producer.rpc.latency`            | Histogram | seconds
> > > | Publish RPC latency experienced internally by the client when sending
> > > data to receiving an ack |
> > > | `pulsar.client.producer.published`              | Counter   | bytes
> > > | Bytes published
> > >                      |
> > > | `pulsar.client.producer.pending.messages.count` | Gauge     |
> messages
> > >  | Pending messages for this producer
> > >                       |
> > > | `pulsar.client.producer.pending.count`          | Gauge     | bytes
> > > | Pending bytes for this producer
> > >                      |
> > >
> > > Topic lookup metric will be differentiated by the lookup type label and
> > by
> > > the lookup transport
> > > mechanism (`transport-type="binary|http"`):
> > >
> > > | OTel metric name                           | Type      | Unit    |
> > > Description                                      |
> > >
> > >
> >
> |--------------------------------------------|-----------|---------|--------------------------------------------------|
> > > | `pulsar.client.lookup{type="topic"}`       | Histogram | seconds |
> > > Counter of topic lookup operations               |
> > > | `pulsar.client.lookup{type="metadata"}`    | Histogram | seconds |
> > > Counter of topic partitioned metadata operations |
> > > | `pulsar.client.lookup{type="schema"}`      | Histogram | seconds |
> > > Counter of schema retrieval operations           |
> > > | `pulsar.client.lookup{type="list-topics"}` | Histogram | seconds |
> > > Counter of namespace list topics operations      |
> > >
> > > Additionally, all the histograms will have a `success=true|false` label
> > to
> > > distinguish successful and failed
> > > operations.
> > >
> > >
> > >
> > >
> > > --
> > > Matteo Merli
> > > <ma...@gmail.com>
> > >
> >
>

Re: [DISCUSS] PIP-342: Support OpenTelemetry metrics in Pulsar client

Posted by 太上玄元道君 <da...@apache.org>.
+1

Enrico Olivelli <eo...@gmail.com>于2024年3月3日 周日16:58写道:

> I support the initiative
>
> Lgtm
>
>
> Thanks
> Enrico
>
> Enrico
>
> Il Dom 3 Mar 2024, 04:09 Matteo Merli <ma...@gmail.com> ha scritto:
>
> > PIP PR: https://github.com/apache/pulsar/pull/22178
> >
> > WIP of proposed implementation:
> > https://github.com/apache/pulsar/pull/22179
> >
> > --------------------
> >
> > # PIP 342: Support OpenTelemetry metrics in Pulsar client
> >
> > ## Motivation
> >
> > Current support for metric instrumentation in Pulsar client is very
> limited
> > and poses a lot of
> > issues for integrating the metrics into any telemetry system.
> >
> > We have 2 ways that metrics are exposed today:
> >
> > 1. Printing logs every 1 minute: While this is ok as it comes out of the
> > box, it's very hard for
> >    any application to get the data or use it in any meaningful way.
> > 2. `producer.getStats()` or `consumer.getStats()`: Calling these methods
> > will get access to
> >    the rate of events in the last 1-minute interval. This is problematic
> > because out of the
> >    box the metrics are not collected anywhere. One would have to start
> its
> > own thread to
> >    periodically check these values and export them to some other system.
> >
> > Neither of these mechanism that we have today are sufficient to enable
> > application to easily
> > export the telemetry data of Pulsar client SDK.
> >
> > ## Goal
> >
> > Provide a good way for applications to retrieve and analyze the usage of
> > Pulsar client operation,
> > in particular with respect to:
> >
> > 1. Maximizing compatibility with existing telemetry systems
> > 2. Minimizing the effort required to export these metrics
> >
> > ## Why OpenTelemetry?
> >
> > [OpenTelemetry](https://opentelemetry.io/) is quickly becoming the
> > de-facto
> > standard API for metric and
> > tracing instrumentation. In fact, as part of [PIP-264](
> > https://github.com/apache/pulsar/blob/master/pip/pip-264.md),
> > we are already migrating the Pulsar server side metrics to use
> > OpenTelemetry.
> >
> > For Pulsar client SDK, we need to provide a similar way for application
> > builder to quickly integrate and
> > export Pulsar metrics.
> >
> > ### Why exposing OpenTelemetry directly in Pulsar API
> >
> > When deciding how to expose the metrics exporter configuration there are
> > multiple options:
> >
> >  1. Accept an `OpenTelemetry` object directly in Pulsar API
> >  2. Build a pluggable interface that describe all the Pulsar client SDK
> > events and allow application to
> >     provide an implementation, perhaps providing an OpenTelemetry
> included
> > option.
> >
> > For this proposal, we are following the (1) option. Here are the reasons:
> >
> >  1. In a way, OpenTelemetry can be compared to [SLF4J](
> > https://www.slf4j.org/), in the sense that it provides an API
> >     on top of which different vendor can build multiple implementations.
> > Therefore, there is no need to create a new
> >     Pulsar-specific interface
> >  2. OpenTelemetry has 2 main artifacts: API and SDK. For the context of
> > Pulsar client, we will only depend on its
> >     API. Applications that are going to use OpenTelemetry, will include
> the
> > OTel SDK
> >  3. Providing a custom interface has several drawbacks:
> >      1. Applications need to update their implementations every time a
> new
> > metric is added in Pulsar SDK
> >      2. The surface of this plugin API can become quite big when there
> are
> > several metrics
> >      3. If we imagine an application that uses multiple libraries, like
> > Pulsar SDK, and each of these has its own
> >         custom way to expose metrics, we can see the level of integration
> > burden that is pushed to application
> >         developers
> >  4. It will always be easy to use OpenTelemetry to collect the metrics
> and
> > export them using a custom metrics API. There
> >     are several examples of this in OpenTelemetry documentation.
> >
> > ## Public API changes
> >
> > ### Enabling OpenTelemetry
> >
> > When building a `PulsarClient` instance, it will be possible to pass an
> > `OpenTelemetry` object:
> >
> > ```java
> > interface ClientBuilder {
> >     // ...
> >     ClientBuilder openTelemetry(io.opentelemetry.api.OpenTelemetry
> > openTelemetry);
> >
> >     ClientBuilder openTelemetryMetricsCardinality(MetricsCardinality
> > metricsCardinality);
> > }
> > ```
> >
> > The common usage for an application would be something like:
> >
> > ```java
> > // Creates a OpenTelemetry instance using environment variables to
> > configure it
> > OpenTelemetry otel=AutoConfiguredOpenTelemetrySdk.builder()
> >         .build().getOpenTelemetrySdk();
> >
> >         PulsarClient client=PulsarClient.builder()
> >         .serviceUrl("pulsar://localhost:6650")
> >         .build();
> >
> > // ....
> > ```
> >
> > Cardinality enum will allow to select a default cardinality label to be
> > attached to the
> > metrics:
> >
> > ```java
> > public enum MetricsCardinality {
> >     /**
> >      * Do not add additional labels to metrics
> >      */
> >     None,
> >
> >     /**
> >      * Label metrics by tenant
> >      */
> >     Tenant,
> >
> >     /**
> >      * Label metrics by tenant and namespace
> >      */
> >     Namespace,
> >
> >     /**
> >      * Label metrics by topic
> >      */
> >     Topic,
> >
> >     /**
> >      * Label metrics by each partition
> >      */
> >     Partition,
> > }
> > ```
> >
> > The labels are addictive. For example, selecting `Topic` level would mean
> > that the metrics will be
> > labeled like:
> >
> > ```
> >
> >
> pulsar_client_received_total{namespace="public/default",tenant="public",topic="persistent://public/default/pt"}
> > 149.0
> > ```
> >
> > While selecting `Namespace` level:
> >
> > ```
> > pulsar_client_received_total{namespace="public/default",tenant="public"}
> > 149.0
> > ```
> >
> > ### Deprecating the old stats methods
> >
> > The old way of collecting stats will be disabled by default, deprecated
> and
> > eventually removed
> > in Pulsar 4.0 release.
> >
> > Methods to deprecate:
> >
> > ```java
> > interface ClientBuilder {
> >     // ...
> >     @Deprecated
> >     ClientBuilder statsInterval(long statsInterval, TimeUnit unit);
> > }
> >
> > interface Producer {
> >     @Deprecated
> >     ProducerStats getStats();
> > }
> >
> > interface Consumer {
> >     @Deprecated
> >     ConsumerStats getStats();
> > }
> > ```
> >
> > ## Initial set of metrics to include
> >
> > Based on the experience of Pulsar Go client SDK metrics (
> > see:
> >
> >
> https://github.com/apache/pulsar-client-go/blob/master/pulsar/internal/metrics.go
> > ),
> > this is the proposed initial set of metrics to export.
> >
> > Additional metrics could be added later on, though it's better to start
> > with the set of most important metrics
> > and then evaluate any missing information.
> >
> > | OTel metric name                                | Type      | Unit
> >  | Description
> >                        |
> >
> >
> |-------------------------------------------------|-----------|-------------|------------------------------------------------------------------------------------------------|
> > | `pulsar.client.connections.opened`              | Counter   |
> connections
> > | Counter of connections opened
> >                      |
> > | `pulsar.client.connections.closed`              | Counter   |
> connections
> > | Counter of connections closed
> >                      |
> > | `pulsar.client.connections.failed`              | Counter   |
> connections
> > | Counter of connections establishment failures
> >                      |
> > | `pulsar.client.session.opened`                  | Counter   | sessions
> >  | Counter of sessions opened. `type="producer"` or `consumer`
> >                        |
> > | `pulsar.client.session.closed`                  | Counter   | sessions
> >  | Counter of sessions closed. `type="producer"` or `consumer`
> >                        |
> > | `pulsar.client.received`                        | Counter   | messages
> >  | Number of messages received
> >                        |
> > | `pulsar.client.received`                        | Counter   | bytes
> > | Number of bytes received
> >                       |
> > | `pulsar.client.consumer.preteched.messages`     | Gauge     | messages
> >  | Number of messages currently sitting in the consumer pre-fetch queue
> >                       |
> > | `pulsar.client.consumer.preteched`              | Gauge     | bytes
> > | Total number of bytes currently sitting in the consumer pre-fetch queue
> >                      |
> > | `pulsar.client.consumer.ack`                    | Counter   | messages
> >  | Number of ack operations
> >                       |
> > | `pulsar.client.consumer.nack`                   | Counter   | messages
> >  | Number of negative ack operations
> >                        |
> > | `pulsar.client.consumer.dlq`                    | Counter   | messages
> >  | Number of messages sent to DLQ
> >                       |
> > | `pulsar.client.consumer.ack.timeout`            | Counter   | messages
> >  | Number of ack timeouts events
> >                        |
> > | `pulsar.client.producer.latency`                | Histogram | seconds
> > | Publish latency experienced by the application, includes client
> batching
> > time                  |
> > | `pulsar.client.producer.rpc.latency`            | Histogram | seconds
> > | Publish RPC latency experienced internally by the client when sending
> > data to receiving an ack |
> > | `pulsar.client.producer.published`              | Counter   | bytes
> > | Bytes published
> >                      |
> > | `pulsar.client.producer.pending.messages.count` | Gauge     | messages
> >  | Pending messages for this producer
> >                       |
> > | `pulsar.client.producer.pending.count`          | Gauge     | bytes
> > | Pending bytes for this producer
> >                      |
> >
> > Topic lookup metric will be differentiated by the lookup type label and
> by
> > the lookup transport
> > mechanism (`transport-type="binary|http"`):
> >
> > | OTel metric name                           | Type      | Unit    |
> > Description                                      |
> >
> >
> |--------------------------------------------|-----------|---------|--------------------------------------------------|
> > | `pulsar.client.lookup{type="topic"}`       | Histogram | seconds |
> > Counter of topic lookup operations               |
> > | `pulsar.client.lookup{type="metadata"}`    | Histogram | seconds |
> > Counter of topic partitioned metadata operations |
> > | `pulsar.client.lookup{type="schema"}`      | Histogram | seconds |
> > Counter of schema retrieval operations           |
> > | `pulsar.client.lookup{type="list-topics"}` | Histogram | seconds |
> > Counter of namespace list topics operations      |
> >
> > Additionally, all the histograms will have a `success=true|false` label
> to
> > distinguish successful and failed
> > operations.
> >
> >
> >
> >
> > --
> > Matteo Merli
> > <ma...@gmail.com>
> >
>

Re: [DISCUSS] PIP-342: Support OpenTelemetry metrics in Pulsar client

Posted by Enrico Olivelli <eo...@gmail.com>.
I support the initiative

Lgtm


Thanks
Enrico

Enrico

Il Dom 3 Mar 2024, 04:09 Matteo Merli <ma...@gmail.com> ha scritto:

> PIP PR: https://github.com/apache/pulsar/pull/22178
>
> WIP of proposed implementation:
> https://github.com/apache/pulsar/pull/22179
>
> --------------------
>
> # PIP 342: Support OpenTelemetry metrics in Pulsar client
>
> ## Motivation
>
> Current support for metric instrumentation in Pulsar client is very limited
> and poses a lot of
> issues for integrating the metrics into any telemetry system.
>
> We have 2 ways that metrics are exposed today:
>
> 1. Printing logs every 1 minute: While this is ok as it comes out of the
> box, it's very hard for
>    any application to get the data or use it in any meaningful way.
> 2. `producer.getStats()` or `consumer.getStats()`: Calling these methods
> will get access to
>    the rate of events in the last 1-minute interval. This is problematic
> because out of the
>    box the metrics are not collected anywhere. One would have to start its
> own thread to
>    periodically check these values and export them to some other system.
>
> Neither of these mechanism that we have today are sufficient to enable
> application to easily
> export the telemetry data of Pulsar client SDK.
>
> ## Goal
>
> Provide a good way for applications to retrieve and analyze the usage of
> Pulsar client operation,
> in particular with respect to:
>
> 1. Maximizing compatibility with existing telemetry systems
> 2. Minimizing the effort required to export these metrics
>
> ## Why OpenTelemetry?
>
> [OpenTelemetry](https://opentelemetry.io/) is quickly becoming the
> de-facto
> standard API for metric and
> tracing instrumentation. In fact, as part of [PIP-264](
> https://github.com/apache/pulsar/blob/master/pip/pip-264.md),
> we are already migrating the Pulsar server side metrics to use
> OpenTelemetry.
>
> For Pulsar client SDK, we need to provide a similar way for application
> builder to quickly integrate and
> export Pulsar metrics.
>
> ### Why exposing OpenTelemetry directly in Pulsar API
>
> When deciding how to expose the metrics exporter configuration there are
> multiple options:
>
>  1. Accept an `OpenTelemetry` object directly in Pulsar API
>  2. Build a pluggable interface that describe all the Pulsar client SDK
> events and allow application to
>     provide an implementation, perhaps providing an OpenTelemetry included
> option.
>
> For this proposal, we are following the (1) option. Here are the reasons:
>
>  1. In a way, OpenTelemetry can be compared to [SLF4J](
> https://www.slf4j.org/), in the sense that it provides an API
>     on top of which different vendor can build multiple implementations.
> Therefore, there is no need to create a new
>     Pulsar-specific interface
>  2. OpenTelemetry has 2 main artifacts: API and SDK. For the context of
> Pulsar client, we will only depend on its
>     API. Applications that are going to use OpenTelemetry, will include the
> OTel SDK
>  3. Providing a custom interface has several drawbacks:
>      1. Applications need to update their implementations every time a new
> metric is added in Pulsar SDK
>      2. The surface of this plugin API can become quite big when there are
> several metrics
>      3. If we imagine an application that uses multiple libraries, like
> Pulsar SDK, and each of these has its own
>         custom way to expose metrics, we can see the level of integration
> burden that is pushed to application
>         developers
>  4. It will always be easy to use OpenTelemetry to collect the metrics and
> export them using a custom metrics API. There
>     are several examples of this in OpenTelemetry documentation.
>
> ## Public API changes
>
> ### Enabling OpenTelemetry
>
> When building a `PulsarClient` instance, it will be possible to pass an
> `OpenTelemetry` object:
>
> ```java
> interface ClientBuilder {
>     // ...
>     ClientBuilder openTelemetry(io.opentelemetry.api.OpenTelemetry
> openTelemetry);
>
>     ClientBuilder openTelemetryMetricsCardinality(MetricsCardinality
> metricsCardinality);
> }
> ```
>
> The common usage for an application would be something like:
>
> ```java
> // Creates a OpenTelemetry instance using environment variables to
> configure it
> OpenTelemetry otel=AutoConfiguredOpenTelemetrySdk.builder()
>         .build().getOpenTelemetrySdk();
>
>         PulsarClient client=PulsarClient.builder()
>         .serviceUrl("pulsar://localhost:6650")
>         .build();
>
> // ....
> ```
>
> Cardinality enum will allow to select a default cardinality label to be
> attached to the
> metrics:
>
> ```java
> public enum MetricsCardinality {
>     /**
>      * Do not add additional labels to metrics
>      */
>     None,
>
>     /**
>      * Label metrics by tenant
>      */
>     Tenant,
>
>     /**
>      * Label metrics by tenant and namespace
>      */
>     Namespace,
>
>     /**
>      * Label metrics by topic
>      */
>     Topic,
>
>     /**
>      * Label metrics by each partition
>      */
>     Partition,
> }
> ```
>
> The labels are addictive. For example, selecting `Topic` level would mean
> that the metrics will be
> labeled like:
>
> ```
>
> pulsar_client_received_total{namespace="public/default",tenant="public",topic="persistent://public/default/pt"}
> 149.0
> ```
>
> While selecting `Namespace` level:
>
> ```
> pulsar_client_received_total{namespace="public/default",tenant="public"}
> 149.0
> ```
>
> ### Deprecating the old stats methods
>
> The old way of collecting stats will be disabled by default, deprecated and
> eventually removed
> in Pulsar 4.0 release.
>
> Methods to deprecate:
>
> ```java
> interface ClientBuilder {
>     // ...
>     @Deprecated
>     ClientBuilder statsInterval(long statsInterval, TimeUnit unit);
> }
>
> interface Producer {
>     @Deprecated
>     ProducerStats getStats();
> }
>
> interface Consumer {
>     @Deprecated
>     ConsumerStats getStats();
> }
> ```
>
> ## Initial set of metrics to include
>
> Based on the experience of Pulsar Go client SDK metrics (
> see:
>
> https://github.com/apache/pulsar-client-go/blob/master/pulsar/internal/metrics.go
> ),
> this is the proposed initial set of metrics to export.
>
> Additional metrics could be added later on, though it's better to start
> with the set of most important metrics
> and then evaluate any missing information.
>
> | OTel metric name                                | Type      | Unit
>  | Description
>                        |
>
> |-------------------------------------------------|-----------|-------------|------------------------------------------------------------------------------------------------|
> | `pulsar.client.connections.opened`              | Counter   | connections
> | Counter of connections opened
>                      |
> | `pulsar.client.connections.closed`              | Counter   | connections
> | Counter of connections closed
>                      |
> | `pulsar.client.connections.failed`              | Counter   | connections
> | Counter of connections establishment failures
>                      |
> | `pulsar.client.session.opened`                  | Counter   | sessions
>  | Counter of sessions opened. `type="producer"` or `consumer`
>                        |
> | `pulsar.client.session.closed`                  | Counter   | sessions
>  | Counter of sessions closed. `type="producer"` or `consumer`
>                        |
> | `pulsar.client.received`                        | Counter   | messages
>  | Number of messages received
>                        |
> | `pulsar.client.received`                        | Counter   | bytes
> | Number of bytes received
>                       |
> | `pulsar.client.consumer.preteched.messages`     | Gauge     | messages
>  | Number of messages currently sitting in the consumer pre-fetch queue
>                       |
> | `pulsar.client.consumer.preteched`              | Gauge     | bytes
> | Total number of bytes currently sitting in the consumer pre-fetch queue
>                      |
> | `pulsar.client.consumer.ack`                    | Counter   | messages
>  | Number of ack operations
>                       |
> | `pulsar.client.consumer.nack`                   | Counter   | messages
>  | Number of negative ack operations
>                        |
> | `pulsar.client.consumer.dlq`                    | Counter   | messages
>  | Number of messages sent to DLQ
>                       |
> | `pulsar.client.consumer.ack.timeout`            | Counter   | messages
>  | Number of ack timeouts events
>                        |
> | `pulsar.client.producer.latency`                | Histogram | seconds
> | Publish latency experienced by the application, includes client batching
> time                  |
> | `pulsar.client.producer.rpc.latency`            | Histogram | seconds
> | Publish RPC latency experienced internally by the client when sending
> data to receiving an ack |
> | `pulsar.client.producer.published`              | Counter   | bytes
> | Bytes published
>                      |
> | `pulsar.client.producer.pending.messages.count` | Gauge     | messages
>  | Pending messages for this producer
>                       |
> | `pulsar.client.producer.pending.count`          | Gauge     | bytes
> | Pending bytes for this producer
>                      |
>
> Topic lookup metric will be differentiated by the lookup type label and by
> the lookup transport
> mechanism (`transport-type="binary|http"`):
>
> | OTel metric name                           | Type      | Unit    |
> Description                                      |
>
> |--------------------------------------------|-----------|---------|--------------------------------------------------|
> | `pulsar.client.lookup{type="topic"}`       | Histogram | seconds |
> Counter of topic lookup operations               |
> | `pulsar.client.lookup{type="metadata"}`    | Histogram | seconds |
> Counter of topic partitioned metadata operations |
> | `pulsar.client.lookup{type="schema"}`      | Histogram | seconds |
> Counter of schema retrieval operations           |
> | `pulsar.client.lookup{type="list-topics"}` | Histogram | seconds |
> Counter of namespace list topics operations      |
>
> Additionally, all the histograms will have a `success=true|false` label to
> distinguish successful and failed
> operations.
>
>
>
>
> --
> Matteo Merli
> <ma...@gmail.com>
>