You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/11/08 15:08:55 UTC

[GitHub] [arrow-ballista] thinkharderdev opened a new pull request, #504: Prometheus metrics

thinkharderdev opened a new pull request, #504:
URL: https://github.com/apache/arrow-ballista/pull/504

   # Which issue does this PR close?
   
   <!--
   We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. For example `Closes #123` indicates that this PR will close issue #123.
   -->
   
   Closes #427
   
   Posting for feedback now but still need to do a couple more small things:
   [] Verify the REST API endpoint works
   [] Decide whether prometheus should be default feature or not
   
    # Rationale for this change
   <!--
    Why are you proposing this change? If this is already explained clearly in the issue then this section is not needed.
    Explaining clearly why changes are proposed helps reviewers understand your changes and offer better suggestions for fixes.  
   -->
   
   We should expose metrics from the scheduler in a standard format. This PR adds an initial integration with prometheus to expose some basic scheduler metrics
   
   # What changes are included in this PR?
   <!--
   There is no need to duplicate the description in the issue here but it is sometimes worth providing a summary of the individual changes in this PR.
   -->
   
   1. Add a new trait `SchedulerMetricsCollector` which is plugged into the core event loop to capture metrics
   2. Add some additional metadata to the job events so we can track durations of the different parts of the query lifecycle (queued, planning, execution) separately. 
   3. Add an implementation of `SchedulerMetricsCollector` for prometheus
   4. Add an API endpoint `/api/metrics` which will expose metrics. 
   5. Add some baseline metrics
   
   In addition, I did some refactoring to make the core event loop more testable. Currently you can't really test it rigorously at all which is not great since it is a rather critical piece of code. I added a new trait `TaskLauncher` which can control how the `TaskManager` which actually launch tasks and various crate-private methods to plug in a custom launcher. I don't imagine this will ever be something exposed through a public API but it is very useful for testing. 
   
   Using the `TaskLauncher` I added some additional utilities to write scheduler test more succinctly. 
   
   # Are there any user-facing changes?
   <!--
   If there are user-facing changes then we may require documentation to be updated before approving the PR.
   -->
   
   Users can plug in their own metrics collector if they want. 
   
   <!--
   If there are any breaking changes to public APIs, please add the `api change` label.
   -->
   
   No
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-ballista] thinkharderdev commented on a diff in pull request #504: Prometheus metrics

Posted by GitBox <gi...@apache.org>.
thinkharderdev commented on code in PR #504:
URL: https://github.com/apache/arrow-ballista/pull/504#discussion_r1016758023


##########
ballista/scheduler/src/metrics/prometheus.rs:
##########
@@ -0,0 +1,141 @@
+use crate::metrics::SchedulerMetricsCollector;
+use ballista_core::error::{BallistaError, Result};
+use hyper::header::CONTENT_TYPE;
+use prometheus::{
+    register_counter_with_registry, register_gauge_with_registry,
+    register_histogram_with_registry, Counter, Gauge, Histogram, Registry,
+};
+use prometheus::{Encoder, TextEncoder};
+
+use warp::Reply;
+
+pub struct PrometheusMetricsCollector {
+    execution_time: Histogram,
+    planning_time: Histogram,
+    failed: Counter,
+    cancelled: Counter,
+    completed: Counter,
+    submitted: Counter,
+    pending_queue_size: Gauge,
+}
+
+impl PrometheusMetricsCollector {
+    pub fn new(registry: &Registry) -> Result<Self> {
+        let execution_time = register_histogram_with_registry!(
+            "query_time_seconds",
+            "Histogram of query execution time in seconds",
+            vec![0.5_f64, 1_f64, 5_f64, 30_f64, 60_f64],
+            registry
+        )
+        .map_err(|e| {
+            BallistaError::Internal(format!("Error registering metric: {:?}", e))
+        })?;
+
+        let planning_time = register_histogram_with_registry!(
+            "planning_time_ms",
+            "Histogram of query planning time in milliseconds",
+            vec![1.0_f64, 5.0_f64, 25.0_f64, 100.0_f64, 500.0_f64],
+            registry
+        )
+        .map_err(|e| {
+            BallistaError::Internal(format!("Error registering metric: {:?}", e))
+        })?;

Review Comment:
   These histograms are hard to generalize since use cases could vary wildly. We can either:
   1. Make these configurable
   2. Try and choose sensible defaults and if someone needs to customize then they can implement a custom collector



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-ballista] andygrove closed pull request #504: Prometheus metrics

Posted by GitBox <gi...@apache.org>.
andygrove closed pull request #504: Prometheus metrics
URL: https://github.com/apache/arrow-ballista/pull/504


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-ballista] thinkharderdev commented on pull request #504: Prometheus metrics

Posted by GitBox <gi...@apache.org>.
thinkharderdev commented on PR #504:
URL: https://github.com/apache/arrow-ballista/pull/504#issuecomment-1311922378

   @andygrove This PR is superseded by https://github.com/apache/arrow-ballista/pull/511 which has some tweaks to make the plug-ability more useful so if we're good with the other PR we should just merge that one and I can close this one. I'll leave it open for the moment in case it's easier to review them separately. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-ballista] andygrove commented on pull request #504: Prometheus metrics

Posted by GitBox <gi...@apache.org>.
andygrove commented on PR #504:
URL: https://github.com/apache/arrow-ballista/pull/504#issuecomment-1310417132

   I filed https://github.com/apache/arrow-ballista/issues/507 for writing documentation in the user guide for this feature.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-ballista] thinkharderdev commented on a diff in pull request #504: Prometheus metrics

Posted by GitBox <gi...@apache.org>.
thinkharderdev commented on code in PR #504:
URL: https://github.com/apache/arrow-ballista/pull/504#discussion_r1016761799


##########
ballista/scheduler/src/state/task_manager.rs:
##########
@@ -405,19 +478,19 @@ impl<T: 'static + AsLogicalPlan, U: 'static + AsExecutionPlan> TaskManager<T, U>
             ]
         };
 
-        let _res = if let Some(graph) = self.remove_active_execution_graph(job_id).await {
+        if let Some(graph) = self.remove_active_execution_graph(job_id).await {
             let mut graph = graph.write().await;
             let previous_status = graph.status();
             graph.fail_job(failure_reason);
-            let value = self.encode_execution_graph(graph.clone())?;
+
+            let value = encode_protobuf(&graph.status())?;

Review Comment:
   First bug uncovered while testing the event loop proper :) When a job would fail we would save the entire graph into `FailedJobs` here but in all other scenarios we only save the status



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-ballista] andygrove commented on pull request #504: Prometheus metrics

Posted by GitBox <gi...@apache.org>.
andygrove commented on PR #504:
URL: https://github.com/apache/arrow-ballista/pull/504#issuecomment-1310470569

   I pulled this branch and ran into a build issue:
   
   ```
     = note: /home/andy/miniconda3/envs/dask-sql/bin/../lib/gcc/x86_64-conda-linux-gnu/12.1.0/../../../../x86_64-conda-linux-gnu/bin/ld: /home/andy/git/apache/arrow-ballista/target/release/deps/libprocfs-372835ba76b41348.rlib(procfs-372835ba76b41348.procfs.efebeec9-cgu.3.rcgu.o): in function `procfs::CpuTime::from_str':
             procfs.efebeec9-cgu.3:(.text._ZN6procfs7CpuTime8from_str17hf1b21a01a95d1a1cE+0x7b): undefined reference to `getauxval'
   ```
   
   It is building OK in CI though so I am confused. Any ideas @thinkharderdev?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-ballista] thinkharderdev commented on pull request #504: Prometheus metrics

Posted by GitBox <gi...@apache.org>.
thinkharderdev commented on PR #504:
URL: https://github.com/apache/arrow-ballista/pull/504#issuecomment-1310571228

   > I filed #507 for writing documentation in the user guide for this feature.
   
   Awesome. I'll have another PR shortly that will add the pending tasks queue tracking. I can add some initial documentation as part of that PR. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-ballista] thinkharderdev commented on pull request #504: Prometheus metrics

Posted by GitBox <gi...@apache.org>.
thinkharderdev commented on PR #504:
URL: https://github.com/apache/arrow-ballista/pull/504#issuecomment-1307692467

   I think this is ready for review. I had to tweak how we do the API routing slightly because the prometheus exporter probably won't send an `application/json` `Accept` header and I'm not sure it is guaranteed to send any particular `Accept` header so changed the routing to look at the path instead. But I was able to verify the metrics are exported. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-ballista] andygrove commented on pull request #504: Prometheus metrics

Posted by GitBox <gi...@apache.org>.
andygrove commented on PR #504:
URL: https://github.com/apache/arrow-ballista/pull/504#issuecomment-1310959981

   > > 
   > 
   > Hmm... I haven't seen this locally but is looks like something when building the prometheus-rs crate. What OS are you using?
   
   Ubuntu 20.04.4 LTS


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-ballista] thinkharderdev commented on pull request #504: Prometheus metrics

Posted by GitBox <gi...@apache.org>.
thinkharderdev commented on PR #504:
URL: https://github.com/apache/arrow-ballista/pull/504#issuecomment-1310569426

   > 
   
   Hmm... I haven't seen this locally but is looks like something when building the prometheus-rs crate. What OS are you using?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-ballista] andygrove commented on pull request #504: Prometheus metrics

Posted by GitBox <gi...@apache.org>.
andygrove commented on PR #504:
URL: https://github.com/apache/arrow-ballista/pull/504#issuecomment-1307642647

   This is awesome! Thanks @thinkharderdev. I will make time to try this out later this week.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org