You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by GitBox <gi...@apache.org> on 2021/06/29 13:53:34 UTC

[GitHub] [airflow] malthe opened a new issue #16706: Improve operational logging and telemetry reporting in hooks

malthe opened a new issue #16706:
URL: https://github.com/apache/airflow/issues/16706


   <!--
   
   Welcome to Apache Airflow!  For a smooth issue process, try to answer the following questions.
   Don't worry if they're not all applicable; just try to include what you can :-)
   
   If you need to include code snippets or logs, please put them in fenced code
   blocks.  If they're super-long, please use the details tag like
   <details><summary>super-long log</summary> lots of stuff </details>
   
   Please delete these comment blocks before submitting the issue.
   
   -->
   
   **Description**
   
   This is a proposal to standardized and improve operational logging and telemetry reporting across the existing hooks.
   
   **Use case / motivation**
   
   There is a need to log operational information such as bytes transferred, transfer rates, rows received, etc.
   
   This need is both from an investigatory perspective (i.e. what happened during a task run), but there is also a need to instrument these hooks in order to report metrics in a more structured way, perhaps tying into #12771.
   
   The current state of affairs is that some hook implementations have some logging, but there's a lot to be desired:
   
   - Poor use of logging levels
   - Repetitive information is often emitted
   - Useful information about each operation is often left out
   
   Finally, support for summary information is not provided. For example, when a task is done using a hook, it would be useful to be able to emit a summary table.
   
   <!-- What do you want to happen?
   
   Rather than telling us how you might implement this solution, try to take a
   step back and describe what you are trying to achieve.
   
   -->
   
   **Are you willing to submit a PR?**
   
   Yes
   
   <!--- We accept contributions! -->
   
   **Related Issues**
   
   - #12771
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk edited a comment on issue #16706: Improve operational logging and telemetry reporting in hooks

Posted by GitBox <gi...@apache.org>.
potiuk edited a comment on issue #16706:
URL: https://github.com/apache/airflow/issues/16706#issuecomment-909732961


   Hmm. I spent a bit of time reading, The title suggested that you wanted to focus on "Hooks" and I think that made me think that you want to focus on instrumenting particular hooks (each hook separately), but I have a different understanding now - please correct me if I am wrong or right or maybe needs some correction :).
   
   Let me rephrase it all and describe in my own words how i understand what should be done. I might be repeating what you wanted to expressed before, just want to make sure that we understand it the same way.
   
   I think I misunderstood, when first reading your answer, the 'contribute  opentelemetry-instrumentation-X'  package.
   
   Did you mean contribute an `opentelemetry-instrumentation-airflow` ? 
   Or did you mean just add dependencies to the existing instrumentation in airflow's setup.py and get some way of enabling it? 
   
   If that's the former - Airflow does not feel like good candidate to have `opentelemetry-instrumentation-airflow`. Hardly anyone would use airflow as a library in the sense that some other code would like to instrument airflow. So I assume you mean the latter. I work with the assumption that it's this..
   
   1) Integrating particular libraries
   
   I think that it would be nice to  add  (psycopg2, mysql/pymsql, celery, boto, botocore, flask, sqlalchemy, httpx, urllib3, redis, rquests, sqlite3 wsgi) - those could give low-level traceable data that might be really useful to start with. Necessarily with some way of enabling and configuring it - for example someone might want to disable tracing for postgres - because it might hit the performance if instrumentation is enabled for all calls, or maybe selectively enable cursor tracking. Basically we need to figure out for each of the libraries what and how should be configured - likely in airflow.cfg as a new [opentelemetry] section. And we need to figure out how to add those packages (ideally those should be `extras: of `pip install apache-airflow[opentelemetry-instrumentation-psycopg2]` should install the psycpg2 instrumentation). Possibly even instrumentation of particular library should be enabled by default when a given package is installed (to make things easier) with the cap
 ability of disabling it selectively. I think those instrumentations should be done one-by-one, there is really no way to make a general dynamic instrumentation (but it's rather easy to do it one-by-one). Extras would be useful to be able to generate "golden" set of constraints for different versions of those telemetry libraries supported by Airflow (same as provider's extras). One more useful thing her would beto be able to disable/enable such tracing selectively WHILE the application is running, but that might be a bit more complex (but long term doable I believe).
   
   [HERE: Documentaiton Update needed on how to install and enable telemetry for the librarires] 
   
   So far, so good. This could give us low-level data collected. Now what to do with it.
   
   2) Exporting the data
   
   We need a way to extensibly define exporters. Similarly to how we define Secret Backends - we should be able to define ways of how to register/instantiate exporter(s) and configure them (again in airflow.cfg). Sounds like fairly-easily doable. Worst case we can implement thin wrappers around the exporters to map configuration from some dictionary form to actually constructing the exporters and instantiating them. There are some things to solve for PUSH vs. PULL approach (authentication ? how to make sure the ports are exposed in a deployment? How to nicely embed those networking pieces in the Helm Chart of ours?
   
   [HERE Documentaiton will be needed on how to configure sevearl popular exporters - Prometheus/DataDog etc. etc.]
   
   That sounds like we can have usable data and ability to view it here already.
   
   3) Propagating context and correlation.
   
   From what I read - spans can allow us to corelate events generated by those different events. Seems like we could define them for different "parts" of the system. Some spans can be defined for just "executing task" some for "parsing" some for "scheduling" parts. I think all of them need a treatment (not only hooks) as those parts are related and can interfere with each other. There will be some difficulties there to solve - we are using mutli-processing and forking heavily in parts of the system, celery workers work continuously for different tasks (so different spans) in CeleryExecutor while K8S executor starts new pod always per task etc. etc. Implicit Propagation will likely work in some cases but it might require some effort to make it complete. That will be likely most complex part of it (but it also can be done in stages). I think we need at most "Task" correlation (all event belonging to particular task). Not sure if possible to add DAG span on top of the Task as "hierarchy
 ". Might be useful.
   
   [Here: we need to document spans, attributes they have[
   
   Now the really useful part:
   
   4) Metrics 
   
   I understand that each of the libraries instrumented will already provide their metrics (so urrllib3 will provide the "bytes transferred" for one) and we do not have to do anything to get it once we get Spans and they will nicely correlate all the urllib3 calls (so we will be immediately able to see how much data was transferred via urrlib3 in task X). That sounds cool, but then we can look at our metrics we have already which are specific for Airflow, and add them in. 
   
   Uff. Finally got my thoughts written down...
   
   
   Now. Let me summarise then what we can get comparing to the current statsd/logging/metrics we have:
   
   * by using that instrumenting libraries, we will get automatically low-level tracing for them [we do not have them now]
   
   * by correlating the events from different libraries with the "higher-level events" of Airflow and current  (for example task execution) and "higher-level metrics" of Airflow (for example delay in task execution) we can have a consistent view on how "each task" behaves (so to speak). [we have only the high-level metrics now - without the low-level correlated data]
   
   * by plugging-in the existing exporters, we have an easy way to open-up to a number of  different  tracing solution - not only Grafana, Prometheus etc., but also AWS CloudWatch,  GCP CloudTrace etc. - those all come for free basically. 
   
   Do I get it right? Are we on the same page?
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] malthe commented on issue #16706: Improve operational logging and telemetry reporting in hooks

Posted by GitBox <gi...@apache.org>.
malthe commented on issue #16706:
URL: https://github.com/apache/airflow/issues/16706#issuecomment-908309524


   I think perhaps a starting point is to contribute "opentelemetry-instrumentation-X" packages (e.g. [opentelemetry-instrumentation-psycopg2](https://pypi.org/project/opentelemetry-instrumentation-psycopg2/)).
   
   It should be possible then to [add a span around task execution](https://opentelemetry.io/docs/java/manual_instrumentation/#create-nested-spans) to logically group an arbitrary and unknown set of events and/or metrics produced by would-be instrumented libraries used during the execution of a single task.
   
   The next step might then be to add a metrics/events tab to the task instance page and display events/metrics emitted during the execution – for example, if exporting telemetry to Prometheus, then we can use the [HTTP API](https://prometheus.io/docs/prometheus/latest/querying/api/#expression-queries) to pull this data dynamically.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] malthe commented on issue #16706: Improve operational logging and telemetry reporting in hooks

Posted by GitBox <gi...@apache.org>.
malthe commented on issue #16706:
URL: https://github.com/apache/airflow/issues/16706#issuecomment-908309524


   I think perhaps a starting point is to contribute "opentelemetry-instrumentation-X" packages (e.g. [opentelemetry-instrumentation-psycopg2](https://pypi.org/project/opentelemetry-instrumentation-psycopg2/)).
   
   It should be possible then to [add a span around task execution](https://opentelemetry.io/docs/java/manual_instrumentation/#create-nested-spans) to logically group an arbitrary and unknown set of events and/or metrics produced by would-be instrumented libraries used during the execution of a single task.
   
   The next step might then be to add a metrics/events tab to the task instance page and display events/metrics emitted during the execution – for example, if exporting telemetry to Prometheus, then we can use the [HTTP API](https://prometheus.io/docs/prometheus/latest/querying/api/#expression-queries) to pull this data dynamically.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on issue #16706: Improve operational logging and telemetry reporting in hooks

Posted by GitBox <gi...@apache.org>.
potiuk commented on issue #16706:
URL: https://github.com/apache/airflow/issues/16706#issuecomment-872599076


   How would you approach it @malthe ? You wrote you are willing to make PR but I am not sure you fully realise the scope of it (or maybe I do not understand something :). 
   
   Do you think about proposing/enforcing some standard of telemetry and implementing it separate for each hook? At once? Gradually? Or implementing some automated telemetry logging via BaseHook/python magic/automation ? 
   
   I think it is a good idea to standardize but I am simply afraid  that this might be rather difficult taking into account that each hook's implementation uses different libraries to communicate, each has a different set of methods  - which are not standardised (cannot be). And taking into account a number of those.
   
   Just a few numbers/back-of-the-envelope calculation: 
   
   We have 162 Hooks currently and many of them have 5-10, some ~20/30 methods that communicate outside.
   
   I did a very rough grep: `find . -name '*.py' | grep '/hooks/' | grep -v _init_ |grep -v test| xargs grep -e " def.*(" | grep -v "def _" | wc -l`
   Result: 2961
   
   You can remove the last 'wc' and see that vast majority of those are 'legitimate' hook methods that should be instrumented. Roughly speaking we have ~3000 methods (excluded constructors + internal methods) that we would need to instrument If we do it manually one-by-one. If it takes 20 minutes per method (which is unrealistic including testing/pr approval/review)  - we are talking about 1000 hours of work (about 6 man-months).
   
   I am not against, just wanted to think how realistic it is to standardise something like that across all  the codebase (if we cannot do automation) - and whether it's worth it, if this is that much of an effort. Of course it might be distributed among the contributors, but then it has to be organized, tracked, people need to be found owning parts of the code, convinced. testing done etc. etc. This is all doable but it's quite an effort that will take literally year or so to complete. We could of course just define the standard and hope for the best that people will implement it over time - but then it would take much more to complete.
   
   I'd really love to hear what's your thinking here.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk edited a comment on issue #16706: Improve operational logging and telemetry reporting in hooks

Posted by GitBox <gi...@apache.org>.
potiuk edited a comment on issue #16706:
URL: https://github.com/apache/airflow/issues/16706#issuecomment-909732961


   Hmm. I spent a bit of time reading, The title suggested that you wanted to focus on "Hooks" and I think that made me think that you want to focus on instrumenting particular hooks (each hook separately), but I have a different understanding now - please correct me if I am wrong or right or maybe needs some correction :).
   
   Let me rephrase it all and describe in my own words how i understand what should be done. I might be repeating what you wanted to expressed before, just want to make sure that we understand it the same way.
   
   I think I misunderstood, when first reading your answer, the 'contribute  opentelemetry-instrumentation-X'  package.
   
   Did you mean contribute an `opentelemetry-instrumentation-airflow` ? 
   Or did you mean just add dependencies to the existing instrumentation in airflow's setup.py and get some way of enabling it? 
   
   If that's the former - Airflow does not feel like good candidate to have `opentelemetry-instrumentation-airflow`. Hardly anyone would use airflow as a library in the sense that some other code would like to instrument airflow. So I assume you mean the latter. I work with the assumption that it's this..
   
   1) Integrating particular libraries
   
   I think that it would be nice to  add  (psycopg2, mysql/pymsql, celery, boto, botocore, flask, sqlalchemy, httpx, urllib3, redis, rquests, sqlite3 wsgi) - those could give low-level traceable data that might be really useful to start with. Necessarily with some way of enabling and configuring it - for example someone might want to disable tracing for postgres - because it might hit the performance if instrumentation is enabled for all calls, or maybe selectively enable cursor tracking. Basically we need to figure out for each of the libraries what and how should be configured - likely in airflow.cfg as a new [opentelemetry] section. And we need to figure out how to add those packages (ideally those should be extras:  `pip install apache-airflow[opentelemetry-instrumentation-psycopg2]` should install the psycpg2 instrumentation). Possibly even instrumentation of particular library should be enabled by default when a given package is installed (to make things easier) with the capabi
 lity of disabling it selectively. I think those instrumentations should be done one-by-one, there is really no way to make a general dynamic instrumentation (but it's rather easy to do it one-by-one). Extras would be useful to be able to generate "golden" set of constraints for different versions of those telemetry libraries supported by Airflow (same as provider's extras). One more useful thing her would beto be able to disable/enable such tracing selectively WHILE the application is running, but that might be a bit more complex (but long term doable I believe).
   
   [HERE: Documentaiton Update needed on how to install and enable telemetry for the librarires] 
   
   So far, so good. This could give us low-level data collected. Now what to do with it.
   
   2) Exporting the data
   
   We need a way to extensibly define exporters. Similarly to how we define Secret Backends - we should be able to define ways of how to register/instantiate exporter(s) and configure them (again in airflow.cfg). Sounds like fairly-easily doable. Worst case we can implement thin wrappers around the exporters to map configuration from some dictionary form to actually constructing the exporters and instantiating them. There are some things to solve for PUSH vs. PULL approach (authentication ? how to make sure the ports are exposed in a deployment? How to nicely embed those networking pieces in the Helm Chart of ours?
   
   [HERE Documentation will be needed on how to configure several popular exporters - Prometheus/DataDog/CloudWatch/CloudTrace etc. etc.]
   
   That sounds like we can have usable data and ability to view it here already.
   
   3) Propagating context and correlation.
   
   From what I read - spans can allow us to correlate events generated by those different libraries. Seems like we could define them for different "parts" of the system. Some spans can be defined for just "executing task" some for "parsing" some for "scheduling" parts. I think all of them need a treatment (not only hooks) as those parts are related and can interfere with each other. There will be some difficulties there to solve - we are using mutli-processing and forking heavily in parts of the system, celery workers work continuously for different tasks (so different spans) in CeleryExecutor while K8S executor starts new pod always per task etc. etc. Implicit Propagation will likely work in some cases but it might require some effort to make it complete. That will be likely most complex part of it (but it also can be done in stages). I think we need at most "Task" correlation (all event belonging to particular task). Not sure if possible to add DAG span on top of the Task as "hiera
 rchy". Might be useful.
   
   [Here: we need to document spans, attributes they have[
   
   Now the really useful part:
   
   4) Metrics 
   
   I understand that each of the libraries instrumented will already provide their metrics (so urrllib3 will provide the "bytes transferred" for one) and we do not have to do anything to get it once we get Spans and they will nicely correlate all the urllib3 calls (so we will be immediately able to see how much data was transferred via urrlib3 in task X). That sounds cool, but then we can look at our metrics we have already which are specific for Airflow, and add them in. 
   
   Uff. Finally got my thoughts written down...
   
   
   Now. Let me summarise then what we can get comparing to the current statsd/logging/metrics we have:
   
   * by using that instrumenting libraries, we will get automatically low-level tracing for them [we do not have them now]
   
   * by correlating the events from different libraries with the "higher-level events" of Airflow and current  (for example task execution) and "higher-level metrics" of Airflow (for example delay in task execution) we can have a consistent view on how "each task" behaves (so to speak). [we have only the high-level metrics now - without the low-level correlated data]
   
   * by plugging-in the existing exporters, we have an easy way to open-up to a number of  different  tracing solution - not only Grafana, Prometheus etc., but also AWS CloudWatch,  GCP CloudTrace etc. - those all come for free basically. 
   
   Do I get it right? Are we on the same page?
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] boring-cyborg[bot] commented on issue #16706: Improve operational logging and telemetry reporting in hooks

Posted by GitBox <gi...@apache.org>.
boring-cyborg[bot] commented on issue #16706:
URL: https://github.com/apache/airflow/issues/16706#issuecomment-870593857


   Thanks for opening your first issue here! Be sure to follow the issue template!
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on issue #16706: Improve operational logging and telemetry reporting in hooks

Posted by GitBox <gi...@apache.org>.
potiuk commented on issue #16706:
URL: https://github.com/apache/airflow/issues/16706#issuecomment-909732961


   Hmm. I spent a bit of time reading, The title suggested that you wanted to focus on "Hooks" and I think that made me think that you want to focus on instrumenting particular hooks (each hook separately), but I have a different understanding now - please correct me if I am wrong or right or maybe needs some correction :).
   
   Let me rephrase it all and describe in my own words how i understand what should be done. I might be repeating what you wanted to expressed before, just want to make sure that we understand it the same way.
   
   I think I misunderstood, when first reading your answer, the 'contribute  opentelemetry-instrumentation-X'  package.
   
   Did you mean contribute an `opentelmentry-instrumentation-airflow` ? 
   Or did you mean just add dependencies to the existing instrumentation in airflow's setup.py and get some way of enabling it? 
   
   If that's the former - Airflow does not feel like good candidate to have `opentelemetry-instrumentation-airflow`. Hardly anyone would use airflow as a library in the sense that some other code would like to instrument airflow. So I assume you mean the latter. I work with the assumption that it's this..
   
   1) Integrating particular libraries
   
   I think that it would be nice to  add  (psycopg2, mysql/pymsql, celery, boto, botocore, flask, sqlalchemy, httpx, urllib3, redis, rquests, sqlite3 wsgi) - those could give low-level traceable data that might be really useful to start with. Necessarily with some way of enabling and configuring it - for example someone might want to disable tracing for postgres - because it might hit the performance if instrumentation is enabled for all calls, or maybe selectively enable cursor tracking. Basically we need to figure out for each of the libraries what and how should be configured - likely in airflow.cfg as a new [opentelemetry] section. And we need to figure out how to add those packages (ideally those should be `extras: of `pip install apache-airflow[opentelemetry-instrumentation-psycopg2]` should install the psycpg2 instrumentation). Possibly even instrumentation of particular library should be enabled by default when a given package is installed (to make things easier) with the cap
 ability of disabling it selectively. I think those instrumentations should be done one-by-one, there is really no way to make a general dynamic instrumentation (but it's rather easy to do it one-by-one). Extras would be useful to be able to generate "golden" set of constraints for different versions of those telemetry libraries supported by Airflow (same as provider's extras). One more useful thing her would beto be able to disable/enable such tracing selectively WHILE the application is running, but that might be a bit more complex (but long term doable I believe).
   
   [HERE: Documentaiton Update needed on how to install and enable telemetry for the librarires] 
   
   So far, so good. This could give us low-level data collected. Now what to do with it.
   
   2) Exporting the data
   
   We need a way to extensibly define exporters. Similarly to how we define Secret Backends - we should be able to define ways of how to register/instantiate exporter(s) and configure them (again in airflow.cfg). Sounds like fairly-easily doable. Worst case we can implement thin wrappers around the exporters to map configuration from some dictionary form to actually constructing the exporters and instantiating them. There are some things to solve for PUSH vs. PULL approach (authentication ? how to make sure the ports are exposed in a deployment? How to nicely embed those networking pieces in the Helm Chart of ours?
   
   [HERE Documentaiton will be needed on how to configure sevearl popular exporters - Prometheus/DataDog etc. etc.]
   
   That sounds like we can have usable data and ability to view it here already.
   
   3) Propagating context and correlation.
   
   From what I read - spans can allow us to corelate events generated by those different events. Seems like we could define them for different "parts" of the system. Some spans can be defined for just "executing task" some for "parsing" some for "scheduling" parts. I think all of them need a treatment (not only hooks) as those parts are related and can interfere with each other. There will be some difficulties there to solve - we are using mutli-processing and forking heavily in parts of the system, celery workers work continuously for different tasks (so different spans) in CeleryExecutor while K8S executor starts new pod always per task etc. etc. Implicit Propagation will likely work in some cases but it might require some effort to make it complete. That will be likely most complex part of it (but it also can be done in stages). I think we need at most "Task" correlation (all event belonging to particular task). Not sure if possible to add DAG span on top of the Task as "hierarchy
 ". Might be useful.
   
   [Here: we need to document spans, attributes they have[
   
   Now the really useful part:
   
   4) Metrics 
   
   I understand that each of the libraries instrumented will already provide their metrics (so urrllib3 will provide the "bytes transferred" for one) and we do not have to do anything to get it once we get Spans and they will nicely correlate all the urllib3 calls (so we will be immediately able to see how much data was transferred via urrlib3 in task X). That sounds cool, but then we can look at our metrics we have already which are specific for Airflow, and add them in. 
   
   Uff. Finally got my thoughts written down...
   
   
   Now. Let me summarise then what we can get comparing to the current statsd/logging/metrics we have:
   
   * by using that instrumenting libraries, we will get automatically low-level tracing for them [we do not have them now]
   
   * by correlating the events from different libraries with the "higher-level events" of Airflow and current  (for example task execution) and "higher-level metrics" of Airflow (for example delay in task execution) we can have a consistent view on how "each task" behaves (so to speak). [we have only the high-level metrics now - without the low-level correlated data]
   
   * by plugging-in the existing exporters, we have an easy way to open-up to a number of  different  tracing solution - not only Grafana, Prometheus etc., but also AWS CloudWatch,  GCP CloudTrace etc. - those all come for free basically. 
   
   Do I get it right? Are we on the same page?
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on issue #16706: Improve operational logging and telemetry reporting in hooks

Posted by GitBox <gi...@apache.org>.
potiuk commented on issue #16706:
URL: https://github.com/apache/airflow/issues/16706#issuecomment-909732961


   Hmm. I spent a bit of time reading, The title suggested that you wanted to focus on "Hooks" and I think that made me think that you want to focus on instrumenting particular hooks (each hook separately), but I have a different understanding now - please correct me if I am wrong or right or maybe needs some correction :).
   
   Let me rephrase it all and describe in my own words how i understand what should be done. I might be repeating what you wanted to expressed before, just want to make sure that we understand it the same way.
   
   I think I misunderstood, when first reading your answer, the 'contribute  opentelemetry-instrumentation-X'  package.
   
   Did you mean contribute an `opentelmentry-instrumentation-airflow` ? 
   Or did you mean just add dependencies to the existing instrumentation in airflow's setup.py and get some way of enabling it? 
   
   If that's the former - Airflow does not feel like good candidate to have `opentelemetry-instrumentation-airflow`. Hardly anyone would use airflow as a library in the sense that some other code would like to instrument airflow. So I assume you mean the latter. I work with the assumption that it's this..
   
   1) Integrating particular libraries
   
   I think that it would be nice to  add  (psycopg2, mysql/pymsql, celery, boto, botocore, flask, sqlalchemy, httpx, urllib3, redis, rquests, sqlite3 wsgi) - those could give low-level traceable data that might be really useful to start with. Necessarily with some way of enabling and configuring it - for example someone might want to disable tracing for postgres - because it might hit the performance if instrumentation is enabled for all calls, or maybe selectively enable cursor tracking. Basically we need to figure out for each of the libraries what and how should be configured - likely in airflow.cfg as a new [opentelemetry] section. And we need to figure out how to add those packages (ideally those should be `extras: of `pip install apache-airflow[opentelemetry-instrumentation-psycopg2]` should install the psycpg2 instrumentation). Possibly even instrumentation of particular library should be enabled by default when a given package is installed (to make things easier) with the cap
 ability of disabling it selectively. I think those instrumentations should be done one-by-one, there is really no way to make a general dynamic instrumentation (but it's rather easy to do it one-by-one). Extras would be useful to be able to generate "golden" set of constraints for different versions of those telemetry libraries supported by Airflow (same as provider's extras). One more useful thing her would beto be able to disable/enable such tracing selectively WHILE the application is running, but that might be a bit more complex (but long term doable I believe).
   
   [HERE: Documentaiton Update needed on how to install and enable telemetry for the librarires] 
   
   So far, so good. This could give us low-level data collected. Now what to do with it.
   
   2) Exporting the data
   
   We need a way to extensibly define exporters. Similarly to how we define Secret Backends - we should be able to define ways of how to register/instantiate exporter(s) and configure them (again in airflow.cfg). Sounds like fairly-easily doable. Worst case we can implement thin wrappers around the exporters to map configuration from some dictionary form to actually constructing the exporters and instantiating them. There are some things to solve for PUSH vs. PULL approach (authentication ? how to make sure the ports are exposed in a deployment? How to nicely embed those networking pieces in the Helm Chart of ours?
   
   [HERE Documentaiton will be needed on how to configure sevearl popular exporters - Prometheus/DataDog etc. etc.]
   
   That sounds like we can have usable data and ability to view it here already.
   
   3) Propagating context and correlation.
   
   From what I read - spans can allow us to corelate events generated by those different events. Seems like we could define them for different "parts" of the system. Some spans can be defined for just "executing task" some for "parsing" some for "scheduling" parts. I think all of them need a treatment (not only hooks) as those parts are related and can interfere with each other. There will be some difficulties there to solve - we are using mutli-processing and forking heavily in parts of the system, celery workers work continuously for different tasks (so different spans) in CeleryExecutor while K8S executor starts new pod always per task etc. etc. Implicit Propagation will likely work in some cases but it might require some effort to make it complete. That will be likely most complex part of it (but it also can be done in stages). I think we need at most "Task" correlation (all event belonging to particular task). Not sure if possible to add DAG span on top of the Task as "hierarchy
 ". Might be useful.
   
   [Here: we need to document spans, attributes they have[
   
   Now the really useful part:
   
   4) Metrics 
   
   I understand that each of the libraries instrumented will already provide their metrics (so urrllib3 will provide the "bytes transferred" for one) and we do not have to do anything to get it once we get Spans and they will nicely correlate all the urllib3 calls (so we will be immediately able to see how much data was transferred via urrlib3 in task X). That sounds cool, but then we can look at our metrics we have already which are specific for Airflow, and add them in. 
   
   Uff. Finally got my thoughts written down...
   
   
   Now. Let me summarise then what we can get comparing to the current statsd/logging/metrics we have:
   
   * by using that instrumenting libraries, we will get automatically low-level tracing for them [we do not have them now]
   
   * by correlating the events from different libraries with the "higher-level events" of Airflow and current  (for example task execution) and "higher-level metrics" of Airflow (for example delay in task execution) we can have a consistent view on how "each task" behaves (so to speak). [we have only the high-level metrics now - without the low-level correlated data]
   
   * by plugging-in the existing exporters, we have an easy way to open-up to a number of  different  tracing solution - not only Grafana, Prometheus etc., but also AWS CloudWatch,  GCP CloudTrace etc. - those all come for free basically. 
   
   Do I get it right? Are we on the same page?
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk edited a comment on issue #16706: Improve operational logging and telemetry reporting in hooks

Posted by GitBox <gi...@apache.org>.
potiuk edited a comment on issue #16706:
URL: https://github.com/apache/airflow/issues/16706#issuecomment-909732961


   Hmm. I spent a bit of time reading, The title suggested that you wanted to focus on "Hooks" and I think that made me think that you want to focus on instrumenting particular hooks (each hook separately), but I have a different understanding now - please correct me if I am wrong or right or maybe needs some correction :).
   
   Let me rephrase it all and describe in my own words how i understand what should be done. I might be repeating what you wanted to expressed before, just want to make sure that we understand it the same way.
   
   I think I misunderstood, when first reading your answer, the 'contribute  opentelemetry-instrumentation-X'  package.
   
   Did you mean contribute an `opentelemetry-instrumentation-airflow` ? 
   Or did you mean just add dependencies to the existing instrumentation in airflow's setup.py and get some way of enabling it? 
   
   If that's the former - Airflow does not feel like good candidate to have `opentelemetry-instrumentation-airflow`. Hardly anyone would use airflow as a library in the sense that some other code would like to instrument airflow. So I assume you mean the latter. I work with the assumption that it's this..
   
   1) Integrating particular libraries
   
   I think that it would be nice to  add  (psycopg2, mysql/pymsql, celery, boto, botocore, flask, sqlalchemy, httpx, urllib3, redis, rquests, sqlite3 wsgi) - those could give low-level traceable data that might be really useful to start with. Necessarily with some way of enabling and configuring it - for example someone might want to disable tracing for postgres - because it might hit the performance if instrumentation is enabled for all calls, or maybe selectively enable cursor tracking. Basically we need to figure out for each of the libraries what and how should be configured - likely in airflow.cfg as a new [opentelemetry] section. And we need to figure out how to add those packages (ideally those should be `extras: of `pip install apache-airflow[opentelemetry-instrumentation-psycopg2]` should install the psycpg2 instrumentation). Possibly even instrumentation of particular library should be enabled by default when a given package is installed (to make things easier) with the cap
 ability of disabling it selectively. I think those instrumentations should be done one-by-one, there is really no way to make a general dynamic instrumentation (but it's rather easy to do it one-by-one). Extras would be useful to be able to generate "golden" set of constraints for different versions of those telemetry libraries supported by Airflow (same as provider's extras). One more useful thing her would beto be able to disable/enable such tracing selectively WHILE the application is running, but that might be a bit more complex (but long term doable I believe).
   
   [HERE: Documentaiton Update needed on how to install and enable telemetry for the librarires] 
   
   So far, so good. This could give us low-level data collected. Now what to do with it.
   
   2) Exporting the data
   
   We need a way to extensibly define exporters. Similarly to how we define Secret Backends - we should be able to define ways of how to register/instantiate exporter(s) and configure them (again in airflow.cfg). Sounds like fairly-easily doable. Worst case we can implement thin wrappers around the exporters to map configuration from some dictionary form to actually constructing the exporters and instantiating them. There are some things to solve for PUSH vs. PULL approach (authentication ? how to make sure the ports are exposed in a deployment? How to nicely embed those networking pieces in the Helm Chart of ours?
   
   [HERE Documentation will be needed on how to configure several popular exporters - Prometheus/DataDog/CloudWatch/CloudTrace etc. etc.]
   
   That sounds like we can have usable data and ability to view it here already.
   
   3) Propagating context and correlation.
   
   From what I read - spans can allow us to correlate events generated by those different libraries. Seems like we could define them for different "parts" of the system. Some spans can be defined for just "executing task" some for "parsing" some for "scheduling" parts. I think all of them need a treatment (not only hooks) as those parts are related and can interfere with each other. There will be some difficulties there to solve - we are using mutli-processing and forking heavily in parts of the system, celery workers work continuously for different tasks (so different spans) in CeleryExecutor while K8S executor starts new pod always per task etc. etc. Implicit Propagation will likely work in some cases but it might require some effort to make it complete. That will be likely most complex part of it (but it also can be done in stages). I think we need at most "Task" correlation (all event belonging to particular task). Not sure if possible to add DAG span on top of the Task as "hiera
 rchy". Might be useful.
   
   [Here: we need to document spans, attributes they have[
   
   Now the really useful part:
   
   4) Metrics 
   
   I understand that each of the libraries instrumented will already provide their metrics (so urrllib3 will provide the "bytes transferred" for one) and we do not have to do anything to get it once we get Spans and they will nicely correlate all the urllib3 calls (so we will be immediately able to see how much data was transferred via urrlib3 in task X). That sounds cool, but then we can look at our metrics we have already which are specific for Airflow, and add them in. 
   
   Uff. Finally got my thoughts written down...
   
   
   Now. Let me summarise then what we can get comparing to the current statsd/logging/metrics we have:
   
   * by using that instrumenting libraries, we will get automatically low-level tracing for them [we do not have them now]
   
   * by correlating the events from different libraries with the "higher-level events" of Airflow and current  (for example task execution) and "higher-level metrics" of Airflow (for example delay in task execution) we can have a consistent view on how "each task" behaves (so to speak). [we have only the high-level metrics now - without the low-level correlated data]
   
   * by plugging-in the existing exporters, we have an easy way to open-up to a number of  different  tracing solution - not only Grafana, Prometheus etc., but also AWS CloudWatch,  GCP CloudTrace etc. - those all come for free basically. 
   
   Do I get it right? Are we on the same page?
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk edited a comment on issue #16706: Improve operational logging and telemetry reporting in hooks

Posted by GitBox <gi...@apache.org>.
potiuk edited a comment on issue #16706:
URL: https://github.com/apache/airflow/issues/16706#issuecomment-909732961






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org