You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by "syun64 (via GitHub)" <gi...@apache.org> on 2023/02/21 16:13:00 UTC

[GitHub] [airflow] syun64 opened a new issue, #29663: Option to Disable High Cardinality Metrics on Statsd

syun64 opened a new issue, #29663:
URL: https://github.com/apache/airflow/issues/29663

   ### Description
   
   With recent PRs enabling tags-support on Statsd metrics, we gained a deeper understanding into the issue of publishing high cardinality metrics. Through this issue, I hope to facilitate the discussion in categorizing metric cardinality of Airflow specific events and tags, and finding a way to disable high cardinality metrics and including it into 2.6.0 release
   
   In the world of Observability & Metrics, cardinality is broadly defined as the following:
   
   `number of unique metric names * number of unique application tag pairs`
   
   This means that events with _unbounded_ number of tag-pairs (key value pair of tags) as well as events with _unbounded_ number of unique metric names will incur expensive storage requirements on the metrics backend.
   
   Let's take a look at the following metric:
   
   `local_task_job.task_exit.<job_id>.<dag_id>.<task_id>.<return_code>`
   
   Here, we have 4 different variable/tag-like attributes embedded into the metric name that I think we can categorize into 3 levels of cardinality.
   
   1. High cardinality / Unbounded metric
   2. Medium cardinality / semi-bounded metric
   3. Low cardinality / categorically-bounded metric
   
   ### High Cardinality / Unbounded Metric
   Example tag: <job_id>
   
   This category of metrics are strictly unbounded, and incorporates a monotonically increasing attribute like <job_id> or <run_id>. To demonstrate just how explosive the growth of these metrics can be, let's take an example. In an Airflow instance with 1000 daily jobs, with a metric retention period of 10 days, we are increasing the cardinality of our metrics by 10,000 on just one single metric just by adding this tag alone. If we add this tag to a few other metrics, that could easily result in an explosion of metric cardinality. As a benchmark,[ DataDog's Enterprise level pricing plan only has 200 custom metrics per host included](https://www.datadoghq.com/pricing/), and anything beyond that needs to be added at a premium. These metrics should be avoided at all costs.
   
   ### Medium Cardinality / semi-bounded metric
   Example tag: <dag_id>, <task_id>
   
   This category of metrics are semi-bounded. They are not bounded by a pre-defined category of enums, but they are bounded by the number of dags or tasks there are within an Airflow infrastructure. This means that although these metrics can lead to increasing levels of cardinality in an Airflow cluster with increasing number of dags, cardinality will still be temporarily bounded. I.e. a given cluster will maintain its level of cardinality over time.
   
   ### Low Cardinality / categorically-bounded metric
   Example tag: <return_code>
   
   This category of metrics is strictly bounded by a category of enums. <return_code> and <task_state> are good examples of attributes with low cardinality. Ideally, we would only want to publish metrics with this level of cardinality.
   
   Using above definition of High Cardinality, I've identified the following metrics as examples that fall under this criteria.
   
   https://github.com/apache/airflow/blob/main/airflow/jobs/local_task_job.py#L292
   https://github.com/apache/airflow/blob/main/airflow/dag_processing/processor.py#L444
   https://github.com/apache/airflow/blob/main/airflow/jobs/scheduler_job.py#L691
   https://github.com/apache/airflow/blob/main/airflow/jobs/scheduler_job.py#L1584
   https://github.com/apache/airflow/blob/main/airflow/models/dag.py#L1331
   https://github.com/apache/airflow/blob/main/airflow/models/taskinstance.py#L1258
   https://github.com/apache/airflow/blob/main/airflow/models/taskinstance.py#L1577
   https://github.com/apache/airflow/blob/main/airflow/models/taskinstance.py#L1847
   
   I would like to propose that we need to provide the option to disable 'Unbounded metrics' with 2.6.0 release. In order to ensure backward compatibility, we could leave the default behavior to publish all metrics, but implement a single Boolean flag to disable these high cardinality metrics.
   
   ### Use case/motivation
   
   _No response_
   
   ### Related issues
   
   https://github.com/apache/airflow/pull/28961
   https://github.com/apache/airflow/pull/29093
   
   ### Are you willing to submit a PR?
   
   - [X] Yes I am willing to submit a PR!
   
   ### Code of Conduct
   
   - [X] I agree to follow this project's [Code of Conduct](https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] syun64 commented on issue #29663: Option to Disable High Cardinality Metrics on Statsd

Posted by "syun64 (via GitHub)" <gi...@apache.org>.
syun64 commented on issue #29663:
URL: https://github.com/apache/airflow/issues/29663#issuecomment-1446392871

   Since the consensus is that we allow a configurable list driven solution, I will focus on the implementation detail of that and open a PR 👍 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] potiuk closed issue #29663: Option to Disable High Cardinality Metrics on Statsd

Posted by "potiuk (via GitHub)" <gi...@apache.org>.
potiuk closed issue #29663: Option to Disable High Cardinality Metrics on Statsd
URL: https://github.com/apache/airflow/issues/29663


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] syun64 commented on issue #29663: Option to Disable High Cardinality Metrics on Statsd

Posted by "syun64 (via GitHub)" <gi...@apache.org>.
syun64 commented on issue #29663:
URL: https://github.com/apache/airflow/issues/29663#issuecomment-1438783502

   Would love to get your input on this issue @potiuk @hussein-awala


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] hussein-awala commented on issue #29663: Option to Disable High Cardinality Metrics on Statsd

Posted by "hussein-awala (via GitHub)" <gi...@apache.org>.
hussein-awala commented on issue #29663:
URL: https://github.com/apache/airflow/issues/29663#issuecomment-1442583976

   I have checked with DataDog, and as I mentioned in my [PR](https://github.com/apache/airflow/pull/28961#issue-1533973742), activating this option will increase the cost, but the difference between the metrics with tags and the old prefixed metrics is just the `dag_run_id` and `job_id` for callbacks.
   
   But as @potiuk has mentioned [before](https://github.com/apache/airflow/pull/28961#issuecomment-1435655836), this detailed information could be an important part of OTEL specification, so I suggest adding a new boolean config `detailed_metrics` with default value `False` to enable/disable `dag_run_id` and `job_id`  tags or a new string config `disabled_tags` which allows user to provide a list of tags he wants to disable separated by a comma, and we use `"dag_run_id,job_id"` as default value.
   
   I prefer the second option because it's more dynamic, and we can use it to disable one of the two tags or another different tag. 
   
   @potiuk what do you think?
   
   (@syun64 I have assigned it to you.)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] potiuk commented on issue #29663: Option to Disable High Cardinality Metrics on Statsd

Posted by "potiuk (via GitHub)" <gi...@apache.org>.
potiuk commented on issue #29663:
URL: https://github.com/apache/airflow/issues/29663#issuecomment-1445154381

   > I prefer the second option because it's more dynamic, and we can use it to disable one of the two tags or another different tag.
   > 
   > @potiuk what do you think?
   
   Yep. Same thought exactly


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org