You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@pulsar.apache.org by GitBox <gi...@apache.org> on 2022/09/02 03:58:45 UTC

[GitHub] [pulsar] michaeljmarshall opened a new pull request, #17419: [fix][broker] Remove timestamp from Promtheus metrics

michaeljmarshall opened a new pull request, #17419:
URL: https://github.com/apache/pulsar/pull/17419

   ### Motivation
   
   When a Pulsar topic is unloaded from a broker, certain metrics related to that topic will appear to remain active for the broker for 5 minutes. This is confusing for troubleshooting because it makes the topic appear to be owned by multiple brokers for a short period of time. See below for a way to reproduce this behavior.
   
   In order to solve this "zombie" metric problem, I propose we remove the timestamps that get exported with each Prometheus metric served by the broker.
   
   ### Analysis
   
   Since we introduced Prometheus metrics in #294, we have exported a timestamp along with most metrics. This is an optional, valid part of the spec defined [here](https://prometheus.io/docs/instrumenting/exposition_formats/#comments-help-text-and-type-information). However, after our adoption of Prometheus metrics, the Prometheus project released version 2.0 with a significant improvement to its concept of staleness. In short, before 2.0, a metric that was in the last scrape but not the next one (this often happens for topics that are unloaded) will essentially inherit the most recent value for the last 5 minute window. If there isn't one in the past 5 minutes, the metric becomes "stale" and isn't reported. Starting in 2.0, there was new logic to consider a value stale the very first time that it is not reported in a scrape. Importantly, this new behavior is only available if you do not export timestamps with metrics, as documented here: https://prometheus.io/docs/prometheus/latest/
 querying/basics/#staleness. We want to use the new behavior because it gives better insight into all topic metrics, which are subject to move between brokers at any time.
   
   This presentation https://www.youtube.com/watch?v=GcTzd2CLH7I and slide deck https://promcon.io/2017-munich/slides/staleness-in-prometheus-2-0.pdf document the feature in detail. This blog post was also helpful: https://www.robustperception.io/staleness-and-promql/.
   
   Additional motivation comes from mailing list threads like this one https://groups.google.com/g/prometheus-users/c/8OFAwp1OEcY. It says:
   
   > Note, however, that adding timestamps is an extremely niche use
   case. Most of the users who think the need it should actually not do
   it.
   >
   > The main usecases within that tiny niche are federation and mirroring
   the data from another monitoring system.
   
   As such, I think we are not a niche use case, and we should not add timestamps to our metrics.
   
   ### Reproducing the problem
   
   1. Run any 2.x version of Prometheus (I used 2.31.0) along with the following scrape config:
   ```yaml
     - job_name: broker
       honor_timestamps: true
       scrape_interval: 30s
       scrape_timeout: 10s
       metrics_path: /metrics
       scheme: http
       follow_redirects: true
       static_configs:
         - targets: ["localhost:8080"]
   ```
   2. Start pulsar standalone on the same machine. I used a recently compiled version of master.
   3. Publish messages to a topic.
   4. Observe `pulsar_in_messages_total` metric for the topic in the prometheus UI (localhost:9090)
   5. Stop the producer.
   6. Unload the topic from the broker.
   7. Optionally, `curl` the metrics endpoint to verify that the topic’s `pulsar_in_messages_total` metric is no longer reported.
   8. Watch the metrics get reported in prometheus for 5 additional minutes.
   
   When you set `honor_timestamps: false`, the metric stops getting reported right after the topic is unloaded, which is the desired behavior.
   
   ### Modifications
   
   * Remove all timestamps from metrics
   * Fix affected tests and test files (some of those tests were in the proxy and the function worker, but no code was changed for those modules)
   
   ### Verifying this change
   
   This change is accompanied by updated tests.
   
   ### Does this pull request potentially affect one of the following parts:
   
   This is technically a breaking change to the metrics, though I would consider it a bug fix at this point. I will discuss it on the mailing list to ensure it gets proper visibility.
   
   Given how frequently Pulsar changes which metrics are exposed between each scrape, I think this is an important fix that should be cherry picked to older release branches. Technically, we can avoid cherry picking this change if we advise users to set `honor_timestamps: false`. However, I think it is better to just remove them.
   
   ### Documentation
   - [x] `doc-not-needed` 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pulsar.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [pulsar] congbobo184 commented on pull request #17419: [fix][broker] Remove timestamp from Promtheus metrics

Posted by GitBox <gi...@apache.org>.
congbobo184 commented on PR #17419:
URL: https://github.com/apache/pulsar/pull/17419#issuecomment-1314659671

   could you please cherry-pick this PR to branch-2.9? thanks.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pulsar.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [pulsar] mattisonchao commented on pull request #17419: [fix][broker] Remove timestamp from Promtheus metrics

Posted by GitBox <gi...@apache.org>.
mattisonchao commented on PR #17419:
URL: https://github.com/apache/pulsar/pull/17419#issuecomment-1244797892

   Hello @michaeljmarshall 
   It looks like we got many conflicts when cherry-picking it to branch-2.9.
   Would you mind helping cherry-pick it(To avloid involving bugs)?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pulsar.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [pulsar] michaeljmarshall commented on pull request #17419: [fix][broker] Remove timestamp from Promtheus metrics

Posted by GitBox <gi...@apache.org>.
michaeljmarshall commented on PR #17419:
URL: https://github.com/apache/pulsar/pull/17419#issuecomment-1235717681

   Done. https://github.com/apache/pulsar/pull/15558 reduced the number of places we add the timestamp, so the diff is slightly smaller now.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pulsar.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [pulsar] michaeljmarshall commented on pull request #17419: [fix][broker] Remove timestamp from Promtheus metrics

Posted by GitBox <gi...@apache.org>.
michaeljmarshall commented on PR #17419:
URL: https://github.com/apache/pulsar/pull/17419#issuecomment-1244799580

   Hi @mattisonchao, it's because this PR relies on https://github.com/apache/pulsar/pull/15558. I have been trying to figure out if we can/should cherry pick that PR. If we do not, we should cherry pick this commit https://github.com/apache/pulsar/commit/b5cb02deb06760a2b6fe7b6c221e08acfabdf830 instead, which was my original work and should have fewer conflicts. Do you have an opinion on #15558? (I am happy to help cherry picking the commit, I just need to figure out _what_ to cherry pick first.)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pulsar.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [pulsar] lhotari commented on pull request #17419: [fix][broker] Remove timestamp from Promtheus metrics

Posted by GitBox <gi...@apache.org>.
lhotari commented on PR #17419:
URL: https://github.com/apache/pulsar/pull/17419#issuecomment-1235116484

   This change is also consistent with ZK and BK since the Prometheus metrics for ZK or BK don't have the timestamps. 
   
   I verified this on a local microk8s cluster by opening a shell to a ZK and a BK pod:
   
   no timestamps in ZK metrics
   ```
   I have no name!@pulsar-testenv-pulsar-zookeeper-0:/pulsar$ curl -s http://localhost:8000/metrics|tail -n 10
   process_open_fds 314.0
   # HELP process_max_fds Maximum number of open file descriptors.
   # TYPE process_max_fds gauge
   process_max_fds 65536.0
   # HELP process_virtual_memory_bytes Virtual memory size in bytes.
   # TYPE process_virtual_memory_bytes gauge
   process_virtual_memory_bytes 5.394948096E9
   # HELP process_resident_memory_bytes Resident memory size in bytes.
   # TYPE process_resident_memory_bytes gauge
   process_resident_memory_bytes 1.96804608E8
   ```
   
   no timestamps in BK metrics
   ```
   I have no name!@pulsar-testenv-pulsar-bookkeeper-0:/pulsar$ curl -s localhost:8000/metrics | tail -n 10
   bookie_bookie_zk_create_sum{success="false"} 0.0
   bookie_bookie_zk_create{success="true",quantile="0.5"} NaN
   bookie_bookie_zk_create{success="true",quantile="0.75"} NaN
   bookie_bookie_zk_create{success="true",quantile="0.95"} NaN
   bookie_bookie_zk_create{success="true",quantile="0.99"} NaN
   bookie_bookie_zk_create{success="true",quantile="0.999"} NaN
   bookie_bookie_zk_create{success="true",quantile="0.9999"} NaN
   bookie_bookie_zk_create{success="true",quantile="1.0"} NaN
   bookie_bookie_zk_create_count{success="true"} 3
   bookie_bookie_zk_create_sum{success="true"} 12.0
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pulsar.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [pulsar] eolivelli commented on pull request #17419: [fix][broker] Remove timestamp from Promtheus metrics

Posted by GitBox <gi...@apache.org>.
eolivelli commented on PR #17419:
URL: https://github.com/apache/pulsar/pull/17419#issuecomment-1235280379

   @michaeljmarshall can you please resolve the conflicts ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pulsar.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [pulsar] mattisonchao commented on pull request #17419: [fix][broker] Remove timestamp from Promtheus metrics

Posted by GitBox <gi...@apache.org>.
mattisonchao commented on PR #17419:
URL: https://github.com/apache/pulsar/pull/17419#issuecomment-1244816395

   I left a comment at #15558, when this PR got cherry-picked, we can do the next step.
   Very much thanks for your help.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pulsar.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [pulsar] michaeljmarshall merged pull request #17419: [fix][broker] Remove timestamp from Promtheus metrics

Posted by GitBox <gi...@apache.org>.
michaeljmarshall merged PR #17419:
URL: https://github.com/apache/pulsar/pull/17419


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pulsar.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [pulsar] congbobo184 commented on pull request #17419: [fix][broker] Remove timestamp from Promtheus metrics

Posted by GitBox <gi...@apache.org>.
congbobo184 commented on PR #17419:
URL: https://github.com/apache/pulsar/pull/17419#issuecomment-1318528369

   @michaeljmarshall  hi, I move this PR to release/2.9.5, if you have any questions, please ping me. thanks.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pulsar.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org