You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@airflow.apache.org by Jarek Potiuk <po...@apache.org> on 2022/07/11 16:44:33 UTC

Re: [DISCUSS] Airflow Scheduling Delay Metric Definition

Sorry for the late reply - Ping.

TL;DR; I think the metrics might be useful but I think using triggers
is asking for troubles.

While using triggers sounds like a common approach in a number of
installations, we do not use triggers so far.
Using Triggers moves some logic to the database, and in our case we do
not have it at all - all logic is in Airflow, and we keep it there,
the database for us is merely "state" storage and "locks". Adding
database triggers, extends it to also keep some logic there. And
adding triggers has some worrying "implicitness" which goes against
the "Explicit is better than Implicit" Zen of Python.

One thing that makes me think "coldly" about this is that it might
have some undesired side effects - such as synchronizing of changes
from multiple schedulers on trying to insert such audit entry (you
need to create an index lock when you insert rows to a table which has
a primary key/unique indexes).

And what's even more worrying is that we are using SQLAlchemy and
MySQl/MsSQL/Postgres and we should make sure it works the same in all
of them. This is troublesome.

Even if we could solve and verify all those problems individually the
effect is - Once we open the "gate" of triggers, we will get more "ok
we have trigger here so let's also use it for that and this" and this
will be hard to say "no" if we already have a precedent, and this
might lead to more and more logica and features deferred to a database
logic (and my past experience is that it leads to more complexity and
implicit behaviours that are difficult to reason about).

But this is only about the technical details of this, not the metrics
itself. I think the metric you proposed is very useful.

I think however (correct me if I am wrong) - that we do not need
database triggers for any of those. I have a feeling that this
proposal is trying to implement the (useful) metrics with very limited
modification to the Airflow code, so I can understand that you might
think about it this way when you have your own fork - then it makes
sense to piggyback on the existing database and use triggers, because
you do not want to modify Airflow code.

But here - we are in a completely different situation. We CAN modify
Airflow code and add missing features and functionality to capture the
necessary metric data in the code,  rather than using triggers. We
could even define some kind of callbacks for the auditing events that
would allow us to gather those metrics in a way that does not even use
the database to store the information for the metrics.

In fact - this leads me to conclusion that we should implement the
metrics you mention as part of our Open-Telemetry effort
 https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-49+OpenTelemetry+Support+for+Apache+Airflow.
This is precisely what it was prepared for, once we have
Open-Telemetry integrated we could add more and more such useful
metrics more easily, and that could be way more useful, because
instead of running external custom-db-reading process for that, we
could not only calculate such metrics using the right metrics tooling
(each company could use their preferred open-telemetry compliant
tool), but that would open up all the features like alerting,
connecting it with traces and other metrics etc. etc.

Howard - WDYT?

J.






On Thu, Jun 30, 2022 at 4:52 PM Vikram Koka
<vi...@astronomer.io.invalid> wrote:
>
> HI Ping,
>
> Apologies for the belated response.
>
> We have created a set of stress test DAGs where the tasks take almost no time to execute at all, so that the worker task execution time is small, and the stress is on the Scheduler and Executor.
>
> We then calculate "task latency" aka "task lag" as:
>  ti_lag = ti.start_date - max_upstream_ti_end_date
> This is effectively the time between "the downstream task starting" and "the last dependent upstream task complete"
>
> We don't use the tasks that don't have any upstream tasks in this metric for measuring task lag.
> And for tasks that have multiple upstream tasks, we use the upstream task for which the end_date took maximum time as the scheduler waits for completion of all parent tasks before scheduling any downstream task.
>
> Vikram
>
>
> On Wed, Jun 8, 2022 at 2:58 PM Ping Zhang <pi...@umich.edu> wrote:
>>
>> Hi Mehta,
>>
>> Good point. The primary goal of the metric is for stress testing to catch airflow scheduler performance regression for 1) our internal scheduler improvement work and 2) airflow version upgrade.
>>
>> One of the key benefits of this metric definition is it is independent from the scheduler implementation and it can be computed/backfilled offline.
>>
>> Currently, we expose it to the datadog and we (the airflow cluster maintainers) are the main users for it.
>>
>> Thanks,
>>
>> Ping
>>
>>
>> On Wed, Jun 8, 2022 at 2:36 PM Mehta, Shubham <sh...@amazon.com.invalid> wrote:
>>>
>>> Ping,
>>>
>>>
>>>
>>> I’m very interested in this as well. A good metric can help us benchmark and identify potential improvements in the scheduler performance.
>>> In order to understand the proposal better, can you please share where and how do you intend to use “Scheduling delay”? Is it meant for benchmarking or stress testing only? Do you plan to expose it to the users in the Airflow UI?
>>>
>>>
>>>
>>> Thanks
>>> Shubham
>>>
>>>
>>>
>>>
>>>
>>> From: Ping Zhang <pi...@umich.edu>
>>> Reply-To: "dev@airflow.apache.org" <de...@airflow.apache.org>
>>> Date: Wednesday, June 8, 2022 at 11:58 AM
>>> To: "dev@airflow.apache.org" <de...@airflow.apache.org>, "vikram@astronomer.io" <vi...@astronomer.io>
>>> Subject: RE: [EXTERNAL][DISCUSS] Airflow Scheduling Delay Metric Definition
>>>
>>>
>>>
>>> CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.
>>>
>>>
>>>
>>> Hi Vikram,
>>>
>>>
>>>
>>> Thanks for pointing that out, 'task latency',
>>>
>>>
>>>
>>> "we define task latency as the time it takes for a task to begin executing once its dependencies have been met."
>>>
>>>
>>>
>>> It will be great if you can elaborate more about "begin executing" and how you calculate "its dependencies have been met.".
>>>
>>>
>>>
>>> If the 'begin executing' means the state of ti becomes running, then the 'Scheduling Delay' metric focuses on the overhead introduced by the scheduler.
>>>
>>>
>>>
>>> In our prod and stress test, we use the `task_instance_audit` table ( a new row is created whenever there is state change in task_instance table) to compute the time of a ti should be scheduled.
>>>
>>>
>>>
>>> Thanks,
>>>
>>>
>>>
>>> Ping
>>>
>>>
>>>
>>>
>>>
>>> On Wed, Jun 8, 2022 at 11:25 AM Vikram Koka <vi...@astronomer.io.invalid> wrote:
>>>
>>> Ping,
>>>
>>>
>>>
>>> I am quite interested in this topic and trying to understand the difference between the "scheduling delay" metric articulated as compared to the "task latency" aka "task lag" metric which we have been using before.
>>>
>>>
>>>
>>> As you may recall, we have been using two specific metrics to benchmark Scheduler performance, specifically "task latency" and "task throughput" since Airflow 2.0.
>>>
>>> These were described in the 2.0 Scheduler blog post
>>> Specifically, within that we defined task tatency as the time it takes for the task to begin executing once it's dependencies are all met.
>>>
>>>
>>>
>>> Thanks,
>>>
>>> Vikram
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> On Wed, Jun 8, 2022 at 10:25 AM Ping Zhang <pi...@umich.edu> wrote:
>>>
>>> Hi Airflow Community,
>>>
>>>
>>>
>>> Airflow is a scheduling platform for data pipelines, however there is no good metric to measure the scheduling delay in the production and also the stress test environment. This makes it hard to catch regressions in the scheduler during the stress test stage.
>>>
>>>
>>> I would like to propose an airflow scheduling delay metric definition. Here is the detailed design of the metric and its implementation:
>>>
>>> https://docs.google.com/document/d/1NhO26kgWkIZJEe50M60yh_jgROaU84dRJ5qGFqbkNbU/edit?usp=sharing
>>>
>>> Please take a look and any feedback is welcome.
>>>
>>>
>>>
>>> Thanks,
>>>
>>>
>>>
>>> Ping
>>>
>>>

Re: [DISCUSS] Airflow Scheduling Delay Metric Definition

Posted by Jarek Potiuk <ja...@potiuk.com>.
Sure

On Tue, Jul 26, 2022 at 7:41 PM Ping Zhang <pi...@umich.edu> wrote:

> Hi Jarek,
>
> Thanks a lot for the thorough response and they are all legitimate
> concerns.
>
> How about that I prepare a more thorough doc to address your concerns and
> we can continue the discussion from there?
>
> Thanks,
>
> Ping
>
>
>
> Thanks,
>
> Ping
>
>
> On Tue, Jul 26, 2022 at 6:53 AM Jarek Potiuk <ja...@potiuk.com> wrote:
>
>> If I understand correctly, the Idea is to run an additional set of stress
>> tests before releasing a version - without impacting the production version
>> of Airflow.
>>
>> I think if this is something that we want to make part of our release
>> process, then submitting a code somewhere is the last thing to do. Just
>> submitting the code does not mean that it will be executed.
>>
>> Note - this is my personal view on it, I am not sure if this is other's
>> view as well, but it comes from years of being involved in the release
>> process and doing it myself - and volunteering to do part of the process
>> (and improve and perfect it).
>>
>> I think the first thing here is to have several answers:
>>
>> a) do we want to do it
>> b) who will do it
>> c) when this will be done in the release process
>> d) what infrastructure will be used to run the tests
>>
>> The actiual "completion" of the release process of ours is described in
>> https://github.com/apache/airflow/blob/main/dev/README_RELEASE_AIRFLOW.md
>> and there is of course release manager running the release process. This is
>> usually Jed, Ephraim recently but it is generally whoever from the PMC
>> members (or committers if they have a PMC member ready to sign the
>> artifacts) who raises their hand and say "Hey I want to be a release
>> manager". Similarly we have a release process for providers
>> https://github.com/apache/airflow/blob/main/dev/README_RELEASE_PROVIDER_PACKAGES.md.
>> But for "MINOR releases" the community process starts when about 90% of
>> testing effort has already been spent.
>>
>> The important thing is that the release manager is NOT doing testing (and
>> the release process from ASF does not even touch on the subject). Release
>> manager has the role of executing the "mechanics" to produce
>> release artifacts, start voting and has the power of deciding
>> single-handedly if the release should be cancelled (fully or in parts) if
>> there are some issues found (more info on the role and release process here
>> https://www.apache.org/legal/release-policy.html#management). In case we
>> release Providers or Airflow (especially the PATCHLEVEL ones) - we delegate
>> a big part of the testing to whoever was involved in preparing fixes - by
>> the "Status of testing" issue (quite successfully I think). For MINOR
>> "airflow", it's much more complex, usually a lot of testing is done by
>> stakeholders - mostly Astronomer who donates HUGE amount of testing time of
>> Airflow MINOR releases (and this is one of the reasons why Astronomer is
>> able to release new Airflow versions much faster than anyone else because
>> they run a lot of tests in their own infrastructure - and this is actually
>> great contribution to the community :). This has a huge mutual benefit for
>> both - the community and Astronomer.
>>
>> Now - if we are going to do the stress testing before releasing Airflow -
>> the question is who will be doing that. This is quite an effort (I believe)
>> and it requires quite an infrastructure. And if we are to donate any kind
>> of testing harness to the community - it only makes sense if there is a way
>> the community can use it. For example there are (from what I hear) many
>> test scenarios and scripts that Astronomer has and follows, but it's not
>> donated to the community. It simply makes no sense because we have no
>> capacity to run those tests, nor process how we can follow those test
>> scenarios.
>>
>> So now - the question is - who will run such tests and with what
>> infrastructure?  I am not sure what kind of infrastructure it might
>> require, but I think the only way to make it part of the community process
>> is to fully automate it in our CI.
>>
>> This is for example what happened with Docker Image and especially with
>> the ARM version of it. Building and running it requires - generally
>> speaking an ARM hardware and it is a  heavy cpu-and-network process - so
>> until I automated it, it was my personal "commitment" that I will build the
>> image (with the goal that we will be able to fully automate it). Until then
>> releasing of the images was not "community" duty, but "Jarek Potiuk"'s duty
>> (I did automate it from the very beginning, but it took some time and
>> effort to implement - but we finally got this nice and simple CI workflow -
>> https://github.com/apache/airflow/blob/main/dev/README_RELEASE_AIRFLOW.md#manually-prepare-production-docker-image
>> that the release manager (whoever the release manager is) can trigger and
>> release the images.
>>
>> So I think if we want to add such a "test harness before" release,
>> answering the questions above is important. If we are to get it in the
>> community, we need to know what kind of infrastructure it requires, whether
>> it is fully automatable (eventually) and ready-to-use by whoever can
>> trigger it before the release. And when it comes to such testing, there is
>> one more important question - what do we do with the results? Is there a
>> trigger that should make the release manager say "OK - those results are
>> bad enough to not release"? Do we know what the trigger is ? Do we know how
>> to interpret the results? Is it documented and have we run it already on a
>> few releases to get some baseline?
>>
>> I think there are two basic paths we can follow:
>>
>> 1) some stakeholder (Ideally the one it came from - AirBnB in this case)
>> commits to the burden on running it and reporting the results before every
>> release (similar to Astronomer - if you noticed, the few days/weeks before
>> a release there is a flurry of stability issues and fixes coming from
>> Astronomer usually as a result of this testing). Similarly for Production
>> Image it could be just "Jarek Potiuk" as a stakeholder because that was
>> small enough for me to handle
>>
>> 2) if such a test harness is to be donated to Airflow, then it must be
>> preceded with a few releases where point 1) is done by someone who commits
>> to it and makes sure all the "wrinkles" are removed. The release process
>> should be smooth and tested. It should not introduce any more friction to
>> the process and delay it, so running it for a number of releases is a must
>> (this is what I did for the Production Image first and then for the ARM
>> Image version). That someone needs to volunteer and commit to it (same as I
>> did for Prod Image).
>>
>> This is how I see it. I think commiting a code to a repo is likely
>> somewhere around 50% of the project where it is already run and at least
>> semi-automated for a few releases (if we are going to go the route 2). Or
>> is not needed at all (can be kept in AirBnB for example or whoever wants to
>> commit to doing it) if we are going the route 1)/
>>
>> But I am curious if my understanding of it is also what others understand.
>>
>> J.
>>
>>
>> On Tue, Jul 26, 2022 at 1:54 AM Ping Zhang <pi...@umich.edu> wrote:
>>
>>> Hi Jarek,
>>>
>>> Friendly bump this thread. What's your thoughts on having a scheduler
>>> perf test before each release and incorporating this metric?
>>>
>>> Also, is there a devops git repo to put these files/logics?
>>>
>>> Thanks,
>>>
>>> Ping
>>>
>>>
>>> On Wed, Jul 13, 2022 at 9:47 AM Ping Zhang <pi...@umich.edu> wrote:
>>>
>>>> Hi Jarek,
>>>>
>>>> Yep, it is more useful in the stress test stage before releasing a new
>>>> version with some extra set up to ensure no scheduler performance
>>>> degradation due to a release. This can also help to find the scaling limit
>>>> of the scheduler with a certain SLA, like upper limit of the number of
>>>> tasks in a dag, total number of dag files in a cluster, concurrent running
>>>> dag runs etc.
>>>>
>>>> Very good point about synthetic dag files in the stress test, our team
>>>> is working on a stress test framework that can directly use all production
>>>> dag files to ensure the stress test has the same set of prod dags, but it
>>>> will skip the task execution. It can also generate different kinds of dag
>>>> (including number of tasks, levels etc).
>>>>
>>>> Monitoring the production issues for particular DAGs, time of the day
>>>> is a different issue. I agree that in prod, we should not let the scheduler
>>>> calculate the `dependency met` time.
>>>>
>>>>
>>>> Thanks,
>>>>
>>>> Ping
>>>>
>>>>
>>>> On Tue, Jul 12, 2022 at 11:02 AM Jarek Potiuk <ja...@potiuk.com> wrote:
>>>>
>>>>> I think if we limit it to stress tests, this could be an "extra"
>>>>> addition - not even necessarily part of Airflow codebase and adding
>>>>> triggers with a script, on a single database, some kind of
>>>>> test-harness that you always add after you installed airflow in test
>>>>> environment - for that I have far less reservations to use triggers.
>>>>>
>>>>> But if we want to measure the delays in production, that's quite a
>>>>> different story (and different purpose):
>>>>>
>>>>> * The stress tests are synthetic and basically what you will get out
>>>>> of it is "are worse/better in this version than in the previous one"?
>>>>> "How much", "Which synthetic scenarios are affected most" . Those will
>>>>> be done with a few synthetic kinds of traffic/load/shape.
>>>>> * The production is different - you really want to see if you have
>>>>> some problems with particular DAGs, times of the day, week, load etc
>>>>> and you should be able to take some corrective actions ( for example
>>>>> increase number of schedulers, or queues, split your dags etc.) - so
>>>>> even the "scheduling delay" metrics might sound familiar you might
>>>>> want to use completely different dimensions to look at it (how about
>>>>> this DAG? this time of day, this group of dags, this type of workloads
>>>>> etc).
>>>>>
>>>>> I think those two might even be separated and calculated differently
>>>>> (though having a single approach would be likely better). I am not
>>>>> entirely sure but I have a feeling we do not need the scheduler to
>>>>> calculate the "dependency met" while scheduling. I think for
>>>>> production purposes, it would be much better (less overhead) to simply
>>>>> emit "raw" mettrics such as task start/end time of each task plus
>>>>> possibly simple publishing of - mostly static - task dependency rules
>>>>> - then "dependency met" time can be calculated offline based on joined
>>>>> data. That would be roughly equivalent to what you have in the
>>>>> trigger, but without the overhead of triggers- simply instead of
>>>>> storing the events in metadata db we would emit them (for example
>>>>> using otel) and let the external system aggregate them and process it
>>>>> offline independently.
>>>>>
>>>>> The OTEL integration is rather lightweight - most of them use
>>>>> in-memory buffers and efficiently push the data (and even can
>>>>> implement scalable forwarding of the data and pre-aggregation). The
>>>>> nice thing about it is that it can scale much easier. I think that
>>>>> (apart of my earlier reservation) database-trigger approach has this
>>>>> not-nice property that the less workers and schedulers you have, the
>>>>> more "centralized overhead" you have, where the distributed OTEL
>>>>> solution scales together with the system adding more or less fixed
>>>>> overhead per component (providing that the remote telemetry service is
>>>>> also scalable). This makes the trigger approach far less suitable IMHO
>>>>> as we are getting dangerously close to Heisen-Monitoring where the
>>>>> more we observe the system the more we impact its performance.
>>>>>
>>>>> J.
>>>>>
>>>>> On Tue, Jul 12, 2022 at 6:49 PM Ping Zhang <pi...@umich.edu> wrote:
>>>>> >
>>>>> > Hi Jarek,
>>>>> >
>>>>> > Thanks for the insights and pointing out the potential issues with
>>>>> triggers in the prod with scheduler HA setup.
>>>>> >
>>>>> > The solution that I proposed is mainly for the stress test scheduler
>>>>> before each airflow release. We can make changes in the airflow codebase to
>>>>> emit this metric however:
>>>>> >
>>>>> > 1. It will incur additional overhead for the scheduler to compute
>>>>> the metric as scheduler needs to compute the dependency met time of a task.
>>>>> > 2. It couples with the implementation of the scheduler. For example,
>>>>> from 1.10.4 to airflow 2, the scheduler has changed a lot. If the metric is
>>>>> emitted from the scheduler, when making the changes in the scheduler, it
>>>>> also needs to update how the metric is computed and emitted.
>>>>> >
>>>>> > Thus, I think having it out of the airflow core makes it easier to
>>>>> compare the scheduling delay across different airflow versions.
>>>>> >
>>>>> > Thanks for pointing out the OpenTelemetry, let me check it out.
>>>>> >
>>>>> > Thanks,
>>>>> >
>>>>> > Ping
>>>>> >
>>>>> >
>>>>> > On Mon, Jul 11, 2022 at 9:44 AM Jarek Potiuk <po...@apache.org>
>>>>> wrote:
>>>>> >>
>>>>> >> Sorry for the late reply - Ping.
>>>>> >>
>>>>> >> TL;DR; I think the metrics might be useful but I think using
>>>>> triggers
>>>>> >> is asking for troubles.
>>>>> >>
>>>>> >> While using triggers sounds like a common approach in a number of
>>>>> >> installations, we do not use triggers so far.
>>>>> >> Using Triggers moves some logic to the database, and in our case we
>>>>> do
>>>>> >> not have it at all - all logic is in Airflow, and we keep it there,
>>>>> >> the database for us is merely "state" storage and "locks". Adding
>>>>> >> database triggers, extends it to also keep some logic there. And
>>>>> >> adding triggers has some worrying "implicitness" which goes against
>>>>> >> the "Explicit is better than Implicit" Zen of Python.
>>>>> >>
>>>>> >> One thing that makes me think "coldly" about this is that it might
>>>>> >> have some undesired side effects - such as synchronizing of changes
>>>>> >> from multiple schedulers on trying to insert such audit entry (you
>>>>> >> need to create an index lock when you insert rows to a table which
>>>>> has
>>>>> >> a primary key/unique indexes).
>>>>> >>
>>>>> >> And what's even more worrying is that we are using SQLAlchemy and
>>>>> >> MySQl/MsSQL/Postgres and we should make sure it works the same in
>>>>> all
>>>>> >> of them. This is troublesome.
>>>>> >>
>>>>> >> Even if we could solve and verify all those problems individually
>>>>> the
>>>>> >> effect is - Once we open the "gate" of triggers, we will get more
>>>>> "ok
>>>>> >> we have trigger here so let's also use it for that and this" and
>>>>> this
>>>>> >> will be hard to say "no" if we already have a precedent, and this
>>>>> >> might lead to more and more logica and features deferred to a
>>>>> database
>>>>> >> logic (and my past experience is that it leads to more complexity
>>>>> and
>>>>> >> implicit behaviours that are difficult to reason about).
>>>>> >>
>>>>> >> But this is only about the technical details of this, not the
>>>>> metrics
>>>>> >> itself. I think the metric you proposed is very useful.
>>>>> >>
>>>>> >> I think however (correct me if I am wrong) - that we do not need
>>>>> >> database triggers for any of those. I have a feeling that this
>>>>> >> proposal is trying to implement the (useful) metrics with very
>>>>> limited
>>>>> >> modification to the Airflow code, so I can understand that you might
>>>>> >> think about it this way when you have your own fork - then it makes
>>>>> >> sense to piggyback on the existing database and use triggers,
>>>>> because
>>>>> >> you do not want to modify Airflow code.
>>>>> >>
>>>>> >> But here - we are in a completely different situation. We CAN modify
>>>>> >> Airflow code and add missing features and functionality to capture
>>>>> the
>>>>> >> necessary metric data in the code,  rather than using triggers. We
>>>>> >> could even define some kind of callbacks for the auditing events
>>>>> that
>>>>> >> would allow us to gather those metrics in a way that does not even
>>>>> use
>>>>> >> the database to store the information for the metrics.
>>>>> >>
>>>>> >> In fact - this leads me to conclusion that we should implement the
>>>>> >> metrics you mention as part of our Open-Telemetry effort
>>>>> >>
>>>>> https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-49+OpenTelemetry+Support+for+Apache+Airflow
>>>>> .
>>>>> >> This is precisely what it was prepared for, once we have
>>>>> >> Open-Telemetry integrated we could add more and more such useful
>>>>> >> metrics more easily, and that could be way more useful, because
>>>>> >> instead of running external custom-db-reading process for that, we
>>>>> >> could not only calculate such metrics using the right metrics
>>>>> tooling
>>>>> >> (each company could use their preferred open-telemetry compliant
>>>>> >> tool), but that would open up all the features like alerting,
>>>>> >> connecting it with traces and other metrics etc. etc.
>>>>> >>
>>>>> >> Howard - WDYT?
>>>>> >>
>>>>> >> J.
>>>>> >>
>>>>> >>
>>>>> >>
>>>>> >>
>>>>> >>
>>>>> >>
>>>>> >> On Thu, Jun 30, 2022 at 4:52 PM Vikram Koka
>>>>> >> <vi...@astronomer.io.invalid> wrote:
>>>>> >> >
>>>>> >> > HI Ping,
>>>>> >> >
>>>>> >> > Apologies for the belated response.
>>>>> >> >
>>>>> >> > We have created a set of stress test DAGs where the tasks take
>>>>> almost no time to execute at all, so that the worker task execution time is
>>>>> small, and the stress is on the Scheduler and Executor.
>>>>> >> >
>>>>> >> > We then calculate "task latency" aka "task lag" as:
>>>>> >> >  ti_lag = ti.start_date - max_upstream_ti_end_date
>>>>> >> > This is effectively the time between "the downstream task
>>>>> starting" and "the last dependent upstream task complete"
>>>>> >> >
>>>>> >> > We don't use the tasks that don't have any upstream tasks in this
>>>>> metric for measuring task lag.
>>>>> >> > And for tasks that have multiple upstream tasks, we use the
>>>>> upstream task for which the end_date took maximum time as the scheduler
>>>>> waits for completion of all parent tasks before scheduling any downstream
>>>>> task.
>>>>> >> >
>>>>> >> > Vikram
>>>>> >> >
>>>>> >> >
>>>>> >> > On Wed, Jun 8, 2022 at 2:58 PM Ping Zhang <pi...@umich.edu>
>>>>> wrote:
>>>>> >> >>
>>>>> >> >> Hi Mehta,
>>>>> >> >>
>>>>> >> >> Good point. The primary goal of the metric is for stress testing
>>>>> to catch airflow scheduler performance regression for 1) our internal
>>>>> scheduler improvement work and 2) airflow version upgrade.
>>>>> >> >>
>>>>> >> >> One of the key benefits of this metric definition is it is
>>>>> independent from the scheduler implementation and it can be
>>>>> computed/backfilled offline.
>>>>> >> >>
>>>>> >> >> Currently, we expose it to the datadog and we (the airflow
>>>>> cluster maintainers) are the main users for it.
>>>>> >> >>
>>>>> >> >> Thanks,
>>>>> >> >>
>>>>> >> >> Ping
>>>>> >> >>
>>>>> >> >>
>>>>> >> >> On Wed, Jun 8, 2022 at 2:36 PM Mehta, Shubham
>>>>> <sh...@amazon.com.invalid> wrote:
>>>>> >> >>>
>>>>> >> >>> Ping,
>>>>> >> >>>
>>>>> >> >>>
>>>>> >> >>>
>>>>> >> >>> I’m very interested in this as well. A good metric can help us
>>>>> benchmark and identify potential improvements in the scheduler performance.
>>>>> >> >>> In order to understand the proposal better, can you please
>>>>> share where and how do you intend to use “Scheduling delay”? Is it meant
>>>>> for benchmarking or stress testing only? Do you plan to expose it to the
>>>>> users in the Airflow UI?
>>>>> >> >>>
>>>>> >> >>>
>>>>> >> >>>
>>>>> >> >>> Thanks
>>>>> >> >>> Shubham
>>>>> >> >>>
>>>>> >> >>>
>>>>> >> >>>
>>>>> >> >>>
>>>>> >> >>>
>>>>> >> >>> From: Ping Zhang <pi...@umich.edu>
>>>>> >> >>> Reply-To: "dev@airflow.apache.org" <de...@airflow.apache.org>
>>>>> >> >>> Date: Wednesday, June 8, 2022 at 11:58 AM
>>>>> >> >>> To: "dev@airflow.apache.org" <de...@airflow.apache.org>, "
>>>>> vikram@astronomer.io" <vi...@astronomer.io>
>>>>> >> >>> Subject: RE: [EXTERNAL][DISCUSS] Airflow Scheduling Delay
>>>>> Metric Definition
>>>>> >> >>>
>>>>> >> >>>
>>>>> >> >>>
>>>>> >> >>> CAUTION: This email originated from outside of the
>>>>> organization. Do not click links or open attachments unless you can confirm
>>>>> the sender and know the content is safe.
>>>>> >> >>>
>>>>> >> >>>
>>>>> >> >>>
>>>>> >> >>> Hi Vikram,
>>>>> >> >>>
>>>>> >> >>>
>>>>> >> >>>
>>>>> >> >>> Thanks for pointing that out, 'task latency',
>>>>> >> >>>
>>>>> >> >>>
>>>>> >> >>>
>>>>> >> >>> "we define task latency as the time it takes for a task to
>>>>> begin executing once its dependencies have been met."
>>>>> >> >>>
>>>>> >> >>>
>>>>> >> >>>
>>>>> >> >>> It will be great if you can elaborate more about "begin
>>>>> executing" and how you calculate "its dependencies have been met.".
>>>>> >> >>>
>>>>> >> >>>
>>>>> >> >>>
>>>>> >> >>> If the 'begin executing' means the state of ti becomes running,
>>>>> then the 'Scheduling Delay' metric focuses on the overhead introduced by
>>>>> the scheduler.
>>>>> >> >>>
>>>>> >> >>>
>>>>> >> >>>
>>>>> >> >>> In our prod and stress test, we use the `task_instance_audit`
>>>>> table ( a new row is created whenever there is state change in
>>>>> task_instance table) to compute the time of a ti should be scheduled.
>>>>> >> >>>
>>>>> >> >>>
>>>>> >> >>>
>>>>> >> >>> Thanks,
>>>>> >> >>>
>>>>> >> >>>
>>>>> >> >>>
>>>>> >> >>> Ping
>>>>> >> >>>
>>>>> >> >>>
>>>>> >> >>>
>>>>> >> >>>
>>>>> >> >>>
>>>>> >> >>> On Wed, Jun 8, 2022 at 11:25 AM Vikram Koka
>>>>> <vi...@astronomer.io.invalid> wrote:
>>>>> >> >>>
>>>>> >> >>> Ping,
>>>>> >> >>>
>>>>> >> >>>
>>>>> >> >>>
>>>>> >> >>> I am quite interested in this topic and trying to understand
>>>>> the difference between the "scheduling delay" metric articulated as
>>>>> compared to the "task latency" aka "task lag" metric which we have been
>>>>> using before.
>>>>> >> >>>
>>>>> >> >>>
>>>>> >> >>>
>>>>> >> >>> As you may recall, we have been using two specific metrics to
>>>>> benchmark Scheduler performance, specifically "task latency" and "task
>>>>> throughput" since Airflow 2.0.
>>>>> >> >>>
>>>>> >> >>> These were described in the 2.0 Scheduler blog post
>>>>> >> >>> Specifically, within that we defined task tatency as the time
>>>>> it takes for the task to begin executing once it's dependencies are all met.
>>>>> >> >>>
>>>>> >> >>>
>>>>> >> >>>
>>>>> >> >>> Thanks,
>>>>> >> >>>
>>>>> >> >>> Vikram
>>>>> >> >>>
>>>>> >> >>>
>>>>> >> >>>
>>>>> >> >>>
>>>>> >> >>>
>>>>> >> >>>
>>>>> >> >>>
>>>>> >> >>>
>>>>> >> >>>
>>>>> >> >>> On Wed, Jun 8, 2022 at 10:25 AM Ping Zhang <pi...@umich.edu>
>>>>> wrote:
>>>>> >> >>>
>>>>> >> >>> Hi Airflow Community,
>>>>> >> >>>
>>>>> >> >>>
>>>>> >> >>>
>>>>> >> >>> Airflow is a scheduling platform for data pipelines, however
>>>>> there is no good metric to measure the scheduling delay in the production
>>>>> and also the stress test environment. This makes it hard to catch
>>>>> regressions in the scheduler during the stress test stage.
>>>>> >> >>>
>>>>> >> >>>
>>>>> >> >>> I would like to propose an airflow scheduling delay metric
>>>>> definition. Here is the detailed design of the metric and its
>>>>> implementation:
>>>>> >> >>>
>>>>> >> >>>
>>>>> https://docs.google.com/document/d/1NhO26kgWkIZJEe50M60yh_jgROaU84dRJ5qGFqbkNbU/edit?usp=sharing
>>>>> >> >>>
>>>>> >> >>> Please take a look and any feedback is welcome.
>>>>> >> >>>
>>>>> >> >>>
>>>>> >> >>>
>>>>> >> >>> Thanks,
>>>>> >> >>>
>>>>> >> >>>
>>>>> >> >>>
>>>>> >> >>> Ping
>>>>> >> >>>
>>>>> >> >>>
>>>>>
>>>>

Re: [DISCUSS] Airflow Scheduling Delay Metric Definition

Posted by Ping Zhang <pi...@umich.edu>.
Hi Jarek,

Thanks a lot for the thorough response and they are all legitimate concerns.

How about that I prepare a more thorough doc to address your concerns and
we can continue the discussion from there?

Thanks,

Ping



Thanks,

Ping


On Tue, Jul 26, 2022 at 6:53 AM Jarek Potiuk <ja...@potiuk.com> wrote:

> If I understand correctly, the Idea is to run an additional set of stress
> tests before releasing a version - without impacting the production version
> of Airflow.
>
> I think if this is something that we want to make part of our release
> process, then submitting a code somewhere is the last thing to do. Just
> submitting the code does not mean that it will be executed.
>
> Note - this is my personal view on it, I am not sure if this is other's
> view as well, but it comes from years of being involved in the release
> process and doing it myself - and volunteering to do part of the process
> (and improve and perfect it).
>
> I think the first thing here is to have several answers:
>
> a) do we want to do it
> b) who will do it
> c) when this will be done in the release process
> d) what infrastructure will be used to run the tests
>
> The actiual "completion" of the release process of ours is described in
> https://github.com/apache/airflow/blob/main/dev/README_RELEASE_AIRFLOW.md
> and there is of course release manager running the release process. This is
> usually Jed, Ephraim recently but it is generally whoever from the PMC
> members (or committers if they have a PMC member ready to sign the
> artifacts) who raises their hand and say "Hey I want to be a release
> manager". Similarly we have a release process for providers
> https://github.com/apache/airflow/blob/main/dev/README_RELEASE_PROVIDER_PACKAGES.md.
> But for "MINOR releases" the community process starts when about 90% of
> testing effort has already been spent.
>
> The important thing is that the release manager is NOT doing testing (and
> the release process from ASF does not even touch on the subject). Release
> manager has the role of executing the "mechanics" to produce
> release artifacts, start voting and has the power of deciding
> single-handedly if the release should be cancelled (fully or in parts) if
> there are some issues found (more info on the role and release process here
> https://www.apache.org/legal/release-policy.html#management). In case we
> release Providers or Airflow (especially the PATCHLEVEL ones) - we delegate
> a big part of the testing to whoever was involved in preparing fixes - by
> the "Status of testing" issue (quite successfully I think). For MINOR
> "airflow", it's much more complex, usually a lot of testing is done by
> stakeholders - mostly Astronomer who donates HUGE amount of testing time of
> Airflow MINOR releases (and this is one of the reasons why Astronomer is
> able to release new Airflow versions much faster than anyone else because
> they run a lot of tests in their own infrastructure - and this is actually
> great contribution to the community :). This has a huge mutual benefit for
> both - the community and Astronomer.
>
> Now - if we are going to do the stress testing before releasing Airflow -
> the question is who will be doing that. This is quite an effort (I believe)
> and it requires quite an infrastructure. And if we are to donate any kind
> of testing harness to the community - it only makes sense if there is a way
> the community can use it. For example there are (from what I hear) many
> test scenarios and scripts that Astronomer has and follows, but it's not
> donated to the community. It simply makes no sense because we have no
> capacity to run those tests, nor process how we can follow those test
> scenarios.
>
> So now - the question is - who will run such tests and with what
> infrastructure?  I am not sure what kind of infrastructure it might
> require, but I think the only way to make it part of the community process
> is to fully automate it in our CI.
>
> This is for example what happened with Docker Image and especially with
> the ARM version of it. Building and running it requires - generally
> speaking an ARM hardware and it is a  heavy cpu-and-network process - so
> until I automated it, it was my personal "commitment" that I will build the
> image (with the goal that we will be able to fully automate it). Until then
> releasing of the images was not "community" duty, but "Jarek Potiuk"'s duty
> (I did automate it from the very beginning, but it took some time and
> effort to implement - but we finally got this nice and simple CI workflow -
> https://github.com/apache/airflow/blob/main/dev/README_RELEASE_AIRFLOW.md#manually-prepare-production-docker-image
> that the release manager (whoever the release manager is) can trigger and
> release the images.
>
> So I think if we want to add such a "test harness before" release,
> answering the questions above is important. If we are to get it in the
> community, we need to know what kind of infrastructure it requires, whether
> it is fully automatable (eventually) and ready-to-use by whoever can
> trigger it before the release. And when it comes to such testing, there is
> one more important question - what do we do with the results? Is there a
> trigger that should make the release manager say "OK - those results are
> bad enough to not release"? Do we know what the trigger is ? Do we know how
> to interpret the results? Is it documented and have we run it already on a
> few releases to get some baseline?
>
> I think there are two basic paths we can follow:
>
> 1) some stakeholder (Ideally the one it came from - AirBnB in this case)
> commits to the burden on running it and reporting the results before every
> release (similar to Astronomer - if you noticed, the few days/weeks before
> a release there is a flurry of stability issues and fixes coming from
> Astronomer usually as a result of this testing). Similarly for Production
> Image it could be just "Jarek Potiuk" as a stakeholder because that was
> small enough for me to handle
>
> 2) if such a test harness is to be donated to Airflow, then it must be
> preceded with a few releases where point 1) is done by someone who commits
> to it and makes sure all the "wrinkles" are removed. The release process
> should be smooth and tested. It should not introduce any more friction to
> the process and delay it, so running it for a number of releases is a must
> (this is what I did for the Production Image first and then for the ARM
> Image version). That someone needs to volunteer and commit to it (same as I
> did for Prod Image).
>
> This is how I see it. I think commiting a code to a repo is likely
> somewhere around 50% of the project where it is already run and at least
> semi-automated for a few releases (if we are going to go the route 2). Or
> is not needed at all (can be kept in AirBnB for example or whoever wants to
> commit to doing it) if we are going the route 1)/
>
> But I am curious if my understanding of it is also what others understand.
>
> J.
>
>
> On Tue, Jul 26, 2022 at 1:54 AM Ping Zhang <pi...@umich.edu> wrote:
>
>> Hi Jarek,
>>
>> Friendly bump this thread. What's your thoughts on having a scheduler
>> perf test before each release and incorporating this metric?
>>
>> Also, is there a devops git repo to put these files/logics?
>>
>> Thanks,
>>
>> Ping
>>
>>
>> On Wed, Jul 13, 2022 at 9:47 AM Ping Zhang <pi...@umich.edu> wrote:
>>
>>> Hi Jarek,
>>>
>>> Yep, it is more useful in the stress test stage before releasing a new
>>> version with some extra set up to ensure no scheduler performance
>>> degradation due to a release. This can also help to find the scaling limit
>>> of the scheduler with a certain SLA, like upper limit of the number of
>>> tasks in a dag, total number of dag files in a cluster, concurrent running
>>> dag runs etc.
>>>
>>> Very good point about synthetic dag files in the stress test, our team
>>> is working on a stress test framework that can directly use all production
>>> dag files to ensure the stress test has the same set of prod dags, but it
>>> will skip the task execution. It can also generate different kinds of dag
>>> (including number of tasks, levels etc).
>>>
>>> Monitoring the production issues for particular DAGs, time of the day is
>>> a different issue. I agree that in prod, we should not let the scheduler
>>> calculate the `dependency met` time.
>>>
>>>
>>> Thanks,
>>>
>>> Ping
>>>
>>>
>>> On Tue, Jul 12, 2022 at 11:02 AM Jarek Potiuk <ja...@potiuk.com> wrote:
>>>
>>>> I think if we limit it to stress tests, this could be an "extra"
>>>> addition - not even necessarily part of Airflow codebase and adding
>>>> triggers with a script, on a single database, some kind of
>>>> test-harness that you always add after you installed airflow in test
>>>> environment - for that I have far less reservations to use triggers.
>>>>
>>>> But if we want to measure the delays in production, that's quite a
>>>> different story (and different purpose):
>>>>
>>>> * The stress tests are synthetic and basically what you will get out
>>>> of it is "are worse/better in this version than in the previous one"?
>>>> "How much", "Which synthetic scenarios are affected most" . Those will
>>>> be done with a few synthetic kinds of traffic/load/shape.
>>>> * The production is different - you really want to see if you have
>>>> some problems with particular DAGs, times of the day, week, load etc
>>>> and you should be able to take some corrective actions ( for example
>>>> increase number of schedulers, or queues, split your dags etc.) - so
>>>> even the "scheduling delay" metrics might sound familiar you might
>>>> want to use completely different dimensions to look at it (how about
>>>> this DAG? this time of day, this group of dags, this type of workloads
>>>> etc).
>>>>
>>>> I think those two might even be separated and calculated differently
>>>> (though having a single approach would be likely better). I am not
>>>> entirely sure but I have a feeling we do not need the scheduler to
>>>> calculate the "dependency met" while scheduling. I think for
>>>> production purposes, it would be much better (less overhead) to simply
>>>> emit "raw" mettrics such as task start/end time of each task plus
>>>> possibly simple publishing of - mostly static - task dependency rules
>>>> - then "dependency met" time can be calculated offline based on joined
>>>> data. That would be roughly equivalent to what you have in the
>>>> trigger, but without the overhead of triggers- simply instead of
>>>> storing the events in metadata db we would emit them (for example
>>>> using otel) and let the external system aggregate them and process it
>>>> offline independently.
>>>>
>>>> The OTEL integration is rather lightweight - most of them use
>>>> in-memory buffers and efficiently push the data (and even can
>>>> implement scalable forwarding of the data and pre-aggregation). The
>>>> nice thing about it is that it can scale much easier. I think that
>>>> (apart of my earlier reservation) database-trigger approach has this
>>>> not-nice property that the less workers and schedulers you have, the
>>>> more "centralized overhead" you have, where the distributed OTEL
>>>> solution scales together with the system adding more or less fixed
>>>> overhead per component (providing that the remote telemetry service is
>>>> also scalable). This makes the trigger approach far less suitable IMHO
>>>> as we are getting dangerously close to Heisen-Monitoring where the
>>>> more we observe the system the more we impact its performance.
>>>>
>>>> J.
>>>>
>>>> On Tue, Jul 12, 2022 at 6:49 PM Ping Zhang <pi...@umich.edu> wrote:
>>>> >
>>>> > Hi Jarek,
>>>> >
>>>> > Thanks for the insights and pointing out the potential issues with
>>>> triggers in the prod with scheduler HA setup.
>>>> >
>>>> > The solution that I proposed is mainly for the stress test scheduler
>>>> before each airflow release. We can make changes in the airflow codebase to
>>>> emit this metric however:
>>>> >
>>>> > 1. It will incur additional overhead for the scheduler to compute the
>>>> metric as scheduler needs to compute the dependency met time of a task.
>>>> > 2. It couples with the implementation of the scheduler. For example,
>>>> from 1.10.4 to airflow 2, the scheduler has changed a lot. If the metric is
>>>> emitted from the scheduler, when making the changes in the scheduler, it
>>>> also needs to update how the metric is computed and emitted.
>>>> >
>>>> > Thus, I think having it out of the airflow core makes it easier to
>>>> compare the scheduling delay across different airflow versions.
>>>> >
>>>> > Thanks for pointing out the OpenTelemetry, let me check it out.
>>>> >
>>>> > Thanks,
>>>> >
>>>> > Ping
>>>> >
>>>> >
>>>> > On Mon, Jul 11, 2022 at 9:44 AM Jarek Potiuk <po...@apache.org>
>>>> wrote:
>>>> >>
>>>> >> Sorry for the late reply - Ping.
>>>> >>
>>>> >> TL;DR; I think the metrics might be useful but I think using triggers
>>>> >> is asking for troubles.
>>>> >>
>>>> >> While using triggers sounds like a common approach in a number of
>>>> >> installations, we do not use triggers so far.
>>>> >> Using Triggers moves some logic to the database, and in our case we
>>>> do
>>>> >> not have it at all - all logic is in Airflow, and we keep it there,
>>>> >> the database for us is merely "state" storage and "locks". Adding
>>>> >> database triggers, extends it to also keep some logic there. And
>>>> >> adding triggers has some worrying "implicitness" which goes against
>>>> >> the "Explicit is better than Implicit" Zen of Python.
>>>> >>
>>>> >> One thing that makes me think "coldly" about this is that it might
>>>> >> have some undesired side effects - such as synchronizing of changes
>>>> >> from multiple schedulers on trying to insert such audit entry (you
>>>> >> need to create an index lock when you insert rows to a table which
>>>> has
>>>> >> a primary key/unique indexes).
>>>> >>
>>>> >> And what's even more worrying is that we are using SQLAlchemy and
>>>> >> MySQl/MsSQL/Postgres and we should make sure it works the same in all
>>>> >> of them. This is troublesome.
>>>> >>
>>>> >> Even if we could solve and verify all those problems individually the
>>>> >> effect is - Once we open the "gate" of triggers, we will get more "ok
>>>> >> we have trigger here so let's also use it for that and this" and this
>>>> >> will be hard to say "no" if we already have a precedent, and this
>>>> >> might lead to more and more logica and features deferred to a
>>>> database
>>>> >> logic (and my past experience is that it leads to more complexity and
>>>> >> implicit behaviours that are difficult to reason about).
>>>> >>
>>>> >> But this is only about the technical details of this, not the metrics
>>>> >> itself. I think the metric you proposed is very useful.
>>>> >>
>>>> >> I think however (correct me if I am wrong) - that we do not need
>>>> >> database triggers for any of those. I have a feeling that this
>>>> >> proposal is trying to implement the (useful) metrics with very
>>>> limited
>>>> >> modification to the Airflow code, so I can understand that you might
>>>> >> think about it this way when you have your own fork - then it makes
>>>> >> sense to piggyback on the existing database and use triggers, because
>>>> >> you do not want to modify Airflow code.
>>>> >>
>>>> >> But here - we are in a completely different situation. We CAN modify
>>>> >> Airflow code and add missing features and functionality to capture
>>>> the
>>>> >> necessary metric data in the code,  rather than using triggers. We
>>>> >> could even define some kind of callbacks for the auditing events that
>>>> >> would allow us to gather those metrics in a way that does not even
>>>> use
>>>> >> the database to store the information for the metrics.
>>>> >>
>>>> >> In fact - this leads me to conclusion that we should implement the
>>>> >> metrics you mention as part of our Open-Telemetry effort
>>>> >>
>>>> https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-49+OpenTelemetry+Support+for+Apache+Airflow
>>>> .
>>>> >> This is precisely what it was prepared for, once we have
>>>> >> Open-Telemetry integrated we could add more and more such useful
>>>> >> metrics more easily, and that could be way more useful, because
>>>> >> instead of running external custom-db-reading process for that, we
>>>> >> could not only calculate such metrics using the right metrics tooling
>>>> >> (each company could use their preferred open-telemetry compliant
>>>> >> tool), but that would open up all the features like alerting,
>>>> >> connecting it with traces and other metrics etc. etc.
>>>> >>
>>>> >> Howard - WDYT?
>>>> >>
>>>> >> J.
>>>> >>
>>>> >>
>>>> >>
>>>> >>
>>>> >>
>>>> >>
>>>> >> On Thu, Jun 30, 2022 at 4:52 PM Vikram Koka
>>>> >> <vi...@astronomer.io.invalid> wrote:
>>>> >> >
>>>> >> > HI Ping,
>>>> >> >
>>>> >> > Apologies for the belated response.
>>>> >> >
>>>> >> > We have created a set of stress test DAGs where the tasks take
>>>> almost no time to execute at all, so that the worker task execution time is
>>>> small, and the stress is on the Scheduler and Executor.
>>>> >> >
>>>> >> > We then calculate "task latency" aka "task lag" as:
>>>> >> >  ti_lag = ti.start_date - max_upstream_ti_end_date
>>>> >> > This is effectively the time between "the downstream task
>>>> starting" and "the last dependent upstream task complete"
>>>> >> >
>>>> >> > We don't use the tasks that don't have any upstream tasks in this
>>>> metric for measuring task lag.
>>>> >> > And for tasks that have multiple upstream tasks, we use the
>>>> upstream task for which the end_date took maximum time as the scheduler
>>>> waits for completion of all parent tasks before scheduling any downstream
>>>> task.
>>>> >> >
>>>> >> > Vikram
>>>> >> >
>>>> >> >
>>>> >> > On Wed, Jun 8, 2022 at 2:58 PM Ping Zhang <pi...@umich.edu>
>>>> wrote:
>>>> >> >>
>>>> >> >> Hi Mehta,
>>>> >> >>
>>>> >> >> Good point. The primary goal of the metric is for stress testing
>>>> to catch airflow scheduler performance regression for 1) our internal
>>>> scheduler improvement work and 2) airflow version upgrade.
>>>> >> >>
>>>> >> >> One of the key benefits of this metric definition is it is
>>>> independent from the scheduler implementation and it can be
>>>> computed/backfilled offline.
>>>> >> >>
>>>> >> >> Currently, we expose it to the datadog and we (the airflow
>>>> cluster maintainers) are the main users for it.
>>>> >> >>
>>>> >> >> Thanks,
>>>> >> >>
>>>> >> >> Ping
>>>> >> >>
>>>> >> >>
>>>> >> >> On Wed, Jun 8, 2022 at 2:36 PM Mehta, Shubham
>>>> <sh...@amazon.com.invalid> wrote:
>>>> >> >>>
>>>> >> >>> Ping,
>>>> >> >>>
>>>> >> >>>
>>>> >> >>>
>>>> >> >>> I’m very interested in this as well. A good metric can help us
>>>> benchmark and identify potential improvements in the scheduler performance.
>>>> >> >>> In order to understand the proposal better, can you please share
>>>> where and how do you intend to use “Scheduling delay”? Is it meant for
>>>> benchmarking or stress testing only? Do you plan to expose it to the users
>>>> in the Airflow UI?
>>>> >> >>>
>>>> >> >>>
>>>> >> >>>
>>>> >> >>> Thanks
>>>> >> >>> Shubham
>>>> >> >>>
>>>> >> >>>
>>>> >> >>>
>>>> >> >>>
>>>> >> >>>
>>>> >> >>> From: Ping Zhang <pi...@umich.edu>
>>>> >> >>> Reply-To: "dev@airflow.apache.org" <de...@airflow.apache.org>
>>>> >> >>> Date: Wednesday, June 8, 2022 at 11:58 AM
>>>> >> >>> To: "dev@airflow.apache.org" <de...@airflow.apache.org>, "
>>>> vikram@astronomer.io" <vi...@astronomer.io>
>>>> >> >>> Subject: RE: [EXTERNAL][DISCUSS] Airflow Scheduling Delay Metric
>>>> Definition
>>>> >> >>>
>>>> >> >>>
>>>> >> >>>
>>>> >> >>> CAUTION: This email originated from outside of the organization.
>>>> Do not click links or open attachments unless you can confirm the sender
>>>> and know the content is safe.
>>>> >> >>>
>>>> >> >>>
>>>> >> >>>
>>>> >> >>> Hi Vikram,
>>>> >> >>>
>>>> >> >>>
>>>> >> >>>
>>>> >> >>> Thanks for pointing that out, 'task latency',
>>>> >> >>>
>>>> >> >>>
>>>> >> >>>
>>>> >> >>> "we define task latency as the time it takes for a task to begin
>>>> executing once its dependencies have been met."
>>>> >> >>>
>>>> >> >>>
>>>> >> >>>
>>>> >> >>> It will be great if you can elaborate more about "begin
>>>> executing" and how you calculate "its dependencies have been met.".
>>>> >> >>>
>>>> >> >>>
>>>> >> >>>
>>>> >> >>> If the 'begin executing' means the state of ti becomes running,
>>>> then the 'Scheduling Delay' metric focuses on the overhead introduced by
>>>> the scheduler.
>>>> >> >>>
>>>> >> >>>
>>>> >> >>>
>>>> >> >>> In our prod and stress test, we use the `task_instance_audit`
>>>> table ( a new row is created whenever there is state change in
>>>> task_instance table) to compute the time of a ti should be scheduled.
>>>> >> >>>
>>>> >> >>>
>>>> >> >>>
>>>> >> >>> Thanks,
>>>> >> >>>
>>>> >> >>>
>>>> >> >>>
>>>> >> >>> Ping
>>>> >> >>>
>>>> >> >>>
>>>> >> >>>
>>>> >> >>>
>>>> >> >>>
>>>> >> >>> On Wed, Jun 8, 2022 at 11:25 AM Vikram Koka
>>>> <vi...@astronomer.io.invalid> wrote:
>>>> >> >>>
>>>> >> >>> Ping,
>>>> >> >>>
>>>> >> >>>
>>>> >> >>>
>>>> >> >>> I am quite interested in this topic and trying to understand the
>>>> difference between the "scheduling delay" metric articulated as compared to
>>>> the "task latency" aka "task lag" metric which we have been using before.
>>>> >> >>>
>>>> >> >>>
>>>> >> >>>
>>>> >> >>> As you may recall, we have been using two specific metrics to
>>>> benchmark Scheduler performance, specifically "task latency" and "task
>>>> throughput" since Airflow 2.0.
>>>> >> >>>
>>>> >> >>> These were described in the 2.0 Scheduler blog post
>>>> >> >>> Specifically, within that we defined task tatency as the time it
>>>> takes for the task to begin executing once it's dependencies are all met.
>>>> >> >>>
>>>> >> >>>
>>>> >> >>>
>>>> >> >>> Thanks,
>>>> >> >>>
>>>> >> >>> Vikram
>>>> >> >>>
>>>> >> >>>
>>>> >> >>>
>>>> >> >>>
>>>> >> >>>
>>>> >> >>>
>>>> >> >>>
>>>> >> >>>
>>>> >> >>>
>>>> >> >>> On Wed, Jun 8, 2022 at 10:25 AM Ping Zhang <pi...@umich.edu>
>>>> wrote:
>>>> >> >>>
>>>> >> >>> Hi Airflow Community,
>>>> >> >>>
>>>> >> >>>
>>>> >> >>>
>>>> >> >>> Airflow is a scheduling platform for data pipelines, however
>>>> there is no good metric to measure the scheduling delay in the production
>>>> and also the stress test environment. This makes it hard to catch
>>>> regressions in the scheduler during the stress test stage.
>>>> >> >>>
>>>> >> >>>
>>>> >> >>> I would like to propose an airflow scheduling delay metric
>>>> definition. Here is the detailed design of the metric and its
>>>> implementation:
>>>> >> >>>
>>>> >> >>>
>>>> https://docs.google.com/document/d/1NhO26kgWkIZJEe50M60yh_jgROaU84dRJ5qGFqbkNbU/edit?usp=sharing
>>>> >> >>>
>>>> >> >>> Please take a look and any feedback is welcome.
>>>> >> >>>
>>>> >> >>>
>>>> >> >>>
>>>> >> >>> Thanks,
>>>> >> >>>
>>>> >> >>>
>>>> >> >>>
>>>> >> >>> Ping
>>>> >> >>>
>>>> >> >>>
>>>>
>>>

Re: [DISCUSS] Airflow Scheduling Delay Metric Definition

Posted by Jarek Potiuk <ja...@potiuk.com>.
If I understand correctly, the Idea is to run an additional set of stress
tests before releasing a version - without impacting the production version
of Airflow.

I think if this is something that we want to make part of our release
process, then submitting a code somewhere is the last thing to do. Just
submitting the code does not mean that it will be executed.

Note - this is my personal view on it, I am not sure if this is other's
view as well, but it comes from years of being involved in the release
process and doing it myself - and volunteering to do part of the process
(and improve and perfect it).

I think the first thing here is to have several answers:

a) do we want to do it
b) who will do it
c) when this will be done in the release process
d) what infrastructure will be used to run the tests

The actiual "completion" of the release process of ours is described in
https://github.com/apache/airflow/blob/main/dev/README_RELEASE_AIRFLOW.md
and there is of course release manager running the release process. This is
usually Jed, Ephraim recently but it is generally whoever from the PMC
members (or committers if they have a PMC member ready to sign the
artifacts) who raises their hand and say "Hey I want to be a release
manager". Similarly we have a release process for providers
https://github.com/apache/airflow/blob/main/dev/README_RELEASE_PROVIDER_PACKAGES.md.
But for "MINOR releases" the community process starts when about 90% of
testing effort has already been spent.

The important thing is that the release manager is NOT doing testing (and
the release process from ASF does not even touch on the subject). Release
manager has the role of executing the "mechanics" to produce
release artifacts, start voting and has the power of deciding
single-handedly if the release should be cancelled (fully or in parts) if
there are some issues found (more info on the role and release process here
https://www.apache.org/legal/release-policy.html#management). In case we
release Providers or Airflow (especially the PATCHLEVEL ones) - we delegate
a big part of the testing to whoever was involved in preparing fixes - by
the "Status of testing" issue (quite successfully I think). For MINOR
"airflow", it's much more complex, usually a lot of testing is done by
stakeholders - mostly Astronomer who donates HUGE amount of testing time of
Airflow MINOR releases (and this is one of the reasons why Astronomer is
able to release new Airflow versions much faster than anyone else because
they run a lot of tests in their own infrastructure - and this is actually
great contribution to the community :). This has a huge mutual benefit for
both - the community and Astronomer.

Now - if we are going to do the stress testing before releasing Airflow -
the question is who will be doing that. This is quite an effort (I believe)
and it requires quite an infrastructure. And if we are to donate any kind
of testing harness to the community - it only makes sense if there is a way
the community can use it. For example there are (from what I hear) many
test scenarios and scripts that Astronomer has and follows, but it's not
donated to the community. It simply makes no sense because we have no
capacity to run those tests, nor process how we can follow those test
scenarios.

So now - the question is - who will run such tests and with what
infrastructure?  I am not sure what kind of infrastructure it might
require, but I think the only way to make it part of the community process
is to fully automate it in our CI.

This is for example what happened with Docker Image and especially with the
ARM version of it. Building and running it requires - generally speaking an
ARM hardware and it is a  heavy cpu-and-network process - so until I
automated it, it was my personal "commitment" that I will build the image
(with the goal that we will be able to fully automate it). Until then
releasing of the images was not "community" duty, but "Jarek Potiuk"'s duty
(I did automate it from the very beginning, but it took some time and
effort to implement - but we finally got this nice and simple CI workflow -
https://github.com/apache/airflow/blob/main/dev/README_RELEASE_AIRFLOW.md#manually-prepare-production-docker-image
that the release manager (whoever the release manager is) can trigger and
release the images.

So I think if we want to add such a "test harness before" release,
answering the questions above is important. If we are to get it in the
community, we need to know what kind of infrastructure it requires, whether
it is fully automatable (eventually) and ready-to-use by whoever can
trigger it before the release. And when it comes to such testing, there is
one more important question - what do we do with the results? Is there a
trigger that should make the release manager say "OK - those results are
bad enough to not release"? Do we know what the trigger is ? Do we know how
to interpret the results? Is it documented and have we run it already on a
few releases to get some baseline?

I think there are two basic paths we can follow:

1) some stakeholder (Ideally the one it came from - AirBnB in this case)
commits to the burden on running it and reporting the results before every
release (similar to Astronomer - if you noticed, the few days/weeks before
a release there is a flurry of stability issues and fixes coming from
Astronomer usually as a result of this testing). Similarly for Production
Image it could be just "Jarek Potiuk" as a stakeholder because that was
small enough for me to handle

2) if such a test harness is to be donated to Airflow, then it must be
preceded with a few releases where point 1) is done by someone who commits
to it and makes sure all the "wrinkles" are removed. The release process
should be smooth and tested. It should not introduce any more friction to
the process and delay it, so running it for a number of releases is a must
(this is what I did for the Production Image first and then for the ARM
Image version). That someone needs to volunteer and commit to it (same as I
did for Prod Image).

This is how I see it. I think commiting a code to a repo is likely
somewhere around 50% of the project where it is already run and at least
semi-automated for a few releases (if we are going to go the route 2). Or
is not needed at all (can be kept in AirBnB for example or whoever wants to
commit to doing it) if we are going the route 1)/

But I am curious if my understanding of it is also what others understand.

J.


On Tue, Jul 26, 2022 at 1:54 AM Ping Zhang <pi...@umich.edu> wrote:

> Hi Jarek,
>
> Friendly bump this thread. What's your thoughts on having a scheduler perf
> test before each release and incorporating this metric?
>
> Also, is there a devops git repo to put these files/logics?
>
> Thanks,
>
> Ping
>
>
> On Wed, Jul 13, 2022 at 9:47 AM Ping Zhang <pi...@umich.edu> wrote:
>
>> Hi Jarek,
>>
>> Yep, it is more useful in the stress test stage before releasing a new
>> version with some extra set up to ensure no scheduler performance
>> degradation due to a release. This can also help to find the scaling limit
>> of the scheduler with a certain SLA, like upper limit of the number of
>> tasks in a dag, total number of dag files in a cluster, concurrent running
>> dag runs etc.
>>
>> Very good point about synthetic dag files in the stress test, our team is
>> working on a stress test framework that can directly use all production dag
>> files to ensure the stress test has the same set of prod dags, but it will
>> skip the task execution. It can also generate different kinds of dag
>> (including number of tasks, levels etc).
>>
>> Monitoring the production issues for particular DAGs, time of the day is
>> a different issue. I agree that in prod, we should not let the scheduler
>> calculate the `dependency met` time.
>>
>>
>> Thanks,
>>
>> Ping
>>
>>
>> On Tue, Jul 12, 2022 at 11:02 AM Jarek Potiuk <ja...@potiuk.com> wrote:
>>
>>> I think if we limit it to stress tests, this could be an "extra"
>>> addition - not even necessarily part of Airflow codebase and adding
>>> triggers with a script, on a single database, some kind of
>>> test-harness that you always add after you installed airflow in test
>>> environment - for that I have far less reservations to use triggers.
>>>
>>> But if we want to measure the delays in production, that's quite a
>>> different story (and different purpose):
>>>
>>> * The stress tests are synthetic and basically what you will get out
>>> of it is "are worse/better in this version than in the previous one"?
>>> "How much", "Which synthetic scenarios are affected most" . Those will
>>> be done with a few synthetic kinds of traffic/load/shape.
>>> * The production is different - you really want to see if you have
>>> some problems with particular DAGs, times of the day, week, load etc
>>> and you should be able to take some corrective actions ( for example
>>> increase number of schedulers, or queues, split your dags etc.) - so
>>> even the "scheduling delay" metrics might sound familiar you might
>>> want to use completely different dimensions to look at it (how about
>>> this DAG? this time of day, this group of dags, this type of workloads
>>> etc).
>>>
>>> I think those two might even be separated and calculated differently
>>> (though having a single approach would be likely better). I am not
>>> entirely sure but I have a feeling we do not need the scheduler to
>>> calculate the "dependency met" while scheduling. I think for
>>> production purposes, it would be much better (less overhead) to simply
>>> emit "raw" mettrics such as task start/end time of each task plus
>>> possibly simple publishing of - mostly static - task dependency rules
>>> - then "dependency met" time can be calculated offline based on joined
>>> data. That would be roughly equivalent to what you have in the
>>> trigger, but without the overhead of triggers- simply instead of
>>> storing the events in metadata db we would emit them (for example
>>> using otel) and let the external system aggregate them and process it
>>> offline independently.
>>>
>>> The OTEL integration is rather lightweight - most of them use
>>> in-memory buffers and efficiently push the data (and even can
>>> implement scalable forwarding of the data and pre-aggregation). The
>>> nice thing about it is that it can scale much easier. I think that
>>> (apart of my earlier reservation) database-trigger approach has this
>>> not-nice property that the less workers and schedulers you have, the
>>> more "centralized overhead" you have, where the distributed OTEL
>>> solution scales together with the system adding more or less fixed
>>> overhead per component (providing that the remote telemetry service is
>>> also scalable). This makes the trigger approach far less suitable IMHO
>>> as we are getting dangerously close to Heisen-Monitoring where the
>>> more we observe the system the more we impact its performance.
>>>
>>> J.
>>>
>>> On Tue, Jul 12, 2022 at 6:49 PM Ping Zhang <pi...@umich.edu> wrote:
>>> >
>>> > Hi Jarek,
>>> >
>>> > Thanks for the insights and pointing out the potential issues with
>>> triggers in the prod with scheduler HA setup.
>>> >
>>> > The solution that I proposed is mainly for the stress test scheduler
>>> before each airflow release. We can make changes in the airflow codebase to
>>> emit this metric however:
>>> >
>>> > 1. It will incur additional overhead for the scheduler to compute the
>>> metric as scheduler needs to compute the dependency met time of a task.
>>> > 2. It couples with the implementation of the scheduler. For example,
>>> from 1.10.4 to airflow 2, the scheduler has changed a lot. If the metric is
>>> emitted from the scheduler, when making the changes in the scheduler, it
>>> also needs to update how the metric is computed and emitted.
>>> >
>>> > Thus, I think having it out of the airflow core makes it easier to
>>> compare the scheduling delay across different airflow versions.
>>> >
>>> > Thanks for pointing out the OpenTelemetry, let me check it out.
>>> >
>>> > Thanks,
>>> >
>>> > Ping
>>> >
>>> >
>>> > On Mon, Jul 11, 2022 at 9:44 AM Jarek Potiuk <po...@apache.org>
>>> wrote:
>>> >>
>>> >> Sorry for the late reply - Ping.
>>> >>
>>> >> TL;DR; I think the metrics might be useful but I think using triggers
>>> >> is asking for troubles.
>>> >>
>>> >> While using triggers sounds like a common approach in a number of
>>> >> installations, we do not use triggers so far.
>>> >> Using Triggers moves some logic to the database, and in our case we do
>>> >> not have it at all - all logic is in Airflow, and we keep it there,
>>> >> the database for us is merely "state" storage and "locks". Adding
>>> >> database triggers, extends it to also keep some logic there. And
>>> >> adding triggers has some worrying "implicitness" which goes against
>>> >> the "Explicit is better than Implicit" Zen of Python.
>>> >>
>>> >> One thing that makes me think "coldly" about this is that it might
>>> >> have some undesired side effects - such as synchronizing of changes
>>> >> from multiple schedulers on trying to insert such audit entry (you
>>> >> need to create an index lock when you insert rows to a table which has
>>> >> a primary key/unique indexes).
>>> >>
>>> >> And what's even more worrying is that we are using SQLAlchemy and
>>> >> MySQl/MsSQL/Postgres and we should make sure it works the same in all
>>> >> of them. This is troublesome.
>>> >>
>>> >> Even if we could solve and verify all those problems individually the
>>> >> effect is - Once we open the "gate" of triggers, we will get more "ok
>>> >> we have trigger here so let's also use it for that and this" and this
>>> >> will be hard to say "no" if we already have a precedent, and this
>>> >> might lead to more and more logica and features deferred to a database
>>> >> logic (and my past experience is that it leads to more complexity and
>>> >> implicit behaviours that are difficult to reason about).
>>> >>
>>> >> But this is only about the technical details of this, not the metrics
>>> >> itself. I think the metric you proposed is very useful.
>>> >>
>>> >> I think however (correct me if I am wrong) - that we do not need
>>> >> database triggers for any of those. I have a feeling that this
>>> >> proposal is trying to implement the (useful) metrics with very limited
>>> >> modification to the Airflow code, so I can understand that you might
>>> >> think about it this way when you have your own fork - then it makes
>>> >> sense to piggyback on the existing database and use triggers, because
>>> >> you do not want to modify Airflow code.
>>> >>
>>> >> But here - we are in a completely different situation. We CAN modify
>>> >> Airflow code and add missing features and functionality to capture the
>>> >> necessary metric data in the code,  rather than using triggers. We
>>> >> could even define some kind of callbacks for the auditing events that
>>> >> would allow us to gather those metrics in a way that does not even use
>>> >> the database to store the information for the metrics.
>>> >>
>>> >> In fact - this leads me to conclusion that we should implement the
>>> >> metrics you mention as part of our Open-Telemetry effort
>>> >>
>>> https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-49+OpenTelemetry+Support+for+Apache+Airflow
>>> .
>>> >> This is precisely what it was prepared for, once we have
>>> >> Open-Telemetry integrated we could add more and more such useful
>>> >> metrics more easily, and that could be way more useful, because
>>> >> instead of running external custom-db-reading process for that, we
>>> >> could not only calculate such metrics using the right metrics tooling
>>> >> (each company could use their preferred open-telemetry compliant
>>> >> tool), but that would open up all the features like alerting,
>>> >> connecting it with traces and other metrics etc. etc.
>>> >>
>>> >> Howard - WDYT?
>>> >>
>>> >> J.
>>> >>
>>> >>
>>> >>
>>> >>
>>> >>
>>> >>
>>> >> On Thu, Jun 30, 2022 at 4:52 PM Vikram Koka
>>> >> <vi...@astronomer.io.invalid> wrote:
>>> >> >
>>> >> > HI Ping,
>>> >> >
>>> >> > Apologies for the belated response.
>>> >> >
>>> >> > We have created a set of stress test DAGs where the tasks take
>>> almost no time to execute at all, so that the worker task execution time is
>>> small, and the stress is on the Scheduler and Executor.
>>> >> >
>>> >> > We then calculate "task latency" aka "task lag" as:
>>> >> >  ti_lag = ti.start_date - max_upstream_ti_end_date
>>> >> > This is effectively the time between "the downstream task starting"
>>> and "the last dependent upstream task complete"
>>> >> >
>>> >> > We don't use the tasks that don't have any upstream tasks in this
>>> metric for measuring task lag.
>>> >> > And for tasks that have multiple upstream tasks, we use the
>>> upstream task for which the end_date took maximum time as the scheduler
>>> waits for completion of all parent tasks before scheduling any downstream
>>> task.
>>> >> >
>>> >> > Vikram
>>> >> >
>>> >> >
>>> >> > On Wed, Jun 8, 2022 at 2:58 PM Ping Zhang <pi...@umich.edu> wrote:
>>> >> >>
>>> >> >> Hi Mehta,
>>> >> >>
>>> >> >> Good point. The primary goal of the metric is for stress testing
>>> to catch airflow scheduler performance regression for 1) our internal
>>> scheduler improvement work and 2) airflow version upgrade.
>>> >> >>
>>> >> >> One of the key benefits of this metric definition is it is
>>> independent from the scheduler implementation and it can be
>>> computed/backfilled offline.
>>> >> >>
>>> >> >> Currently, we expose it to the datadog and we (the airflow cluster
>>> maintainers) are the main users for it.
>>> >> >>
>>> >> >> Thanks,
>>> >> >>
>>> >> >> Ping
>>> >> >>
>>> >> >>
>>> >> >> On Wed, Jun 8, 2022 at 2:36 PM Mehta, Shubham
>>> <sh...@amazon.com.invalid> wrote:
>>> >> >>>
>>> >> >>> Ping,
>>> >> >>>
>>> >> >>>
>>> >> >>>
>>> >> >>> I’m very interested in this as well. A good metric can help us
>>> benchmark and identify potential improvements in the scheduler performance.
>>> >> >>> In order to understand the proposal better, can you please share
>>> where and how do you intend to use “Scheduling delay”? Is it meant for
>>> benchmarking or stress testing only? Do you plan to expose it to the users
>>> in the Airflow UI?
>>> >> >>>
>>> >> >>>
>>> >> >>>
>>> >> >>> Thanks
>>> >> >>> Shubham
>>> >> >>>
>>> >> >>>
>>> >> >>>
>>> >> >>>
>>> >> >>>
>>> >> >>> From: Ping Zhang <pi...@umich.edu>
>>> >> >>> Reply-To: "dev@airflow.apache.org" <de...@airflow.apache.org>
>>> >> >>> Date: Wednesday, June 8, 2022 at 11:58 AM
>>> >> >>> To: "dev@airflow.apache.org" <de...@airflow.apache.org>, "
>>> vikram@astronomer.io" <vi...@astronomer.io>
>>> >> >>> Subject: RE: [EXTERNAL][DISCUSS] Airflow Scheduling Delay Metric
>>> Definition
>>> >> >>>
>>> >> >>>
>>> >> >>>
>>> >> >>> CAUTION: This email originated from outside of the organization.
>>> Do not click links or open attachments unless you can confirm the sender
>>> and know the content is safe.
>>> >> >>>
>>> >> >>>
>>> >> >>>
>>> >> >>> Hi Vikram,
>>> >> >>>
>>> >> >>>
>>> >> >>>
>>> >> >>> Thanks for pointing that out, 'task latency',
>>> >> >>>
>>> >> >>>
>>> >> >>>
>>> >> >>> "we define task latency as the time it takes for a task to begin
>>> executing once its dependencies have been met."
>>> >> >>>
>>> >> >>>
>>> >> >>>
>>> >> >>> It will be great if you can elaborate more about "begin
>>> executing" and how you calculate "its dependencies have been met.".
>>> >> >>>
>>> >> >>>
>>> >> >>>
>>> >> >>> If the 'begin executing' means the state of ti becomes running,
>>> then the 'Scheduling Delay' metric focuses on the overhead introduced by
>>> the scheduler.
>>> >> >>>
>>> >> >>>
>>> >> >>>
>>> >> >>> In our prod and stress test, we use the `task_instance_audit`
>>> table ( a new row is created whenever there is state change in
>>> task_instance table) to compute the time of a ti should be scheduled.
>>> >> >>>
>>> >> >>>
>>> >> >>>
>>> >> >>> Thanks,
>>> >> >>>
>>> >> >>>
>>> >> >>>
>>> >> >>> Ping
>>> >> >>>
>>> >> >>>
>>> >> >>>
>>> >> >>>
>>> >> >>>
>>> >> >>> On Wed, Jun 8, 2022 at 11:25 AM Vikram Koka
>>> <vi...@astronomer.io.invalid> wrote:
>>> >> >>>
>>> >> >>> Ping,
>>> >> >>>
>>> >> >>>
>>> >> >>>
>>> >> >>> I am quite interested in this topic and trying to understand the
>>> difference between the "scheduling delay" metric articulated as compared to
>>> the "task latency" aka "task lag" metric which we have been using before.
>>> >> >>>
>>> >> >>>
>>> >> >>>
>>> >> >>> As you may recall, we have been using two specific metrics to
>>> benchmark Scheduler performance, specifically "task latency" and "task
>>> throughput" since Airflow 2.0.
>>> >> >>>
>>> >> >>> These were described in the 2.0 Scheduler blog post
>>> >> >>> Specifically, within that we defined task tatency as the time it
>>> takes for the task to begin executing once it's dependencies are all met.
>>> >> >>>
>>> >> >>>
>>> >> >>>
>>> >> >>> Thanks,
>>> >> >>>
>>> >> >>> Vikram
>>> >> >>>
>>> >> >>>
>>> >> >>>
>>> >> >>>
>>> >> >>>
>>> >> >>>
>>> >> >>>
>>> >> >>>
>>> >> >>>
>>> >> >>> On Wed, Jun 8, 2022 at 10:25 AM Ping Zhang <pi...@umich.edu>
>>> wrote:
>>> >> >>>
>>> >> >>> Hi Airflow Community,
>>> >> >>>
>>> >> >>>
>>> >> >>>
>>> >> >>> Airflow is a scheduling platform for data pipelines, however
>>> there is no good metric to measure the scheduling delay in the production
>>> and also the stress test environment. This makes it hard to catch
>>> regressions in the scheduler during the stress test stage.
>>> >> >>>
>>> >> >>>
>>> >> >>> I would like to propose an airflow scheduling delay metric
>>> definition. Here is the detailed design of the metric and its
>>> implementation:
>>> >> >>>
>>> >> >>>
>>> https://docs.google.com/document/d/1NhO26kgWkIZJEe50M60yh_jgROaU84dRJ5qGFqbkNbU/edit?usp=sharing
>>> >> >>>
>>> >> >>> Please take a look and any feedback is welcome.
>>> >> >>>
>>> >> >>>
>>> >> >>>
>>> >> >>> Thanks,
>>> >> >>>
>>> >> >>>
>>> >> >>>
>>> >> >>> Ping
>>> >> >>>
>>> >> >>>
>>>
>>

Re: [DISCUSS] Airflow Scheduling Delay Metric Definition

Posted by Ping Zhang <pi...@umich.edu>.
Hi Jarek,

Friendly bump this thread. What's your thoughts on having a scheduler perf
test before each release and incorporating this metric?

Also, is there a devops git repo to put these files/logics?

Thanks,

Ping


On Wed, Jul 13, 2022 at 9:47 AM Ping Zhang <pi...@umich.edu> wrote:

> Hi Jarek,
>
> Yep, it is more useful in the stress test stage before releasing a new
> version with some extra set up to ensure no scheduler performance
> degradation due to a release. This can also help to find the scaling limit
> of the scheduler with a certain SLA, like upper limit of the number of
> tasks in a dag, total number of dag files in a cluster, concurrent running
> dag runs etc.
>
> Very good point about synthetic dag files in the stress test, our team is
> working on a stress test framework that can directly use all production dag
> files to ensure the stress test has the same set of prod dags, but it will
> skip the task execution. It can also generate different kinds of dag
> (including number of tasks, levels etc).
>
> Monitoring the production issues for particular DAGs, time of the day is a
> different issue. I agree that in prod, we should not let the scheduler
> calculate the `dependency met` time.
>
>
> Thanks,
>
> Ping
>
>
> On Tue, Jul 12, 2022 at 11:02 AM Jarek Potiuk <ja...@potiuk.com> wrote:
>
>> I think if we limit it to stress tests, this could be an "extra"
>> addition - not even necessarily part of Airflow codebase and adding
>> triggers with a script, on a single database, some kind of
>> test-harness that you always add after you installed airflow in test
>> environment - for that I have far less reservations to use triggers.
>>
>> But if we want to measure the delays in production, that's quite a
>> different story (and different purpose):
>>
>> * The stress tests are synthetic and basically what you will get out
>> of it is "are worse/better in this version than in the previous one"?
>> "How much", "Which synthetic scenarios are affected most" . Those will
>> be done with a few synthetic kinds of traffic/load/shape.
>> * The production is different - you really want to see if you have
>> some problems with particular DAGs, times of the day, week, load etc
>> and you should be able to take some corrective actions ( for example
>> increase number of schedulers, or queues, split your dags etc.) - so
>> even the "scheduling delay" metrics might sound familiar you might
>> want to use completely different dimensions to look at it (how about
>> this DAG? this time of day, this group of dags, this type of workloads
>> etc).
>>
>> I think those two might even be separated and calculated differently
>> (though having a single approach would be likely better). I am not
>> entirely sure but I have a feeling we do not need the scheduler to
>> calculate the "dependency met" while scheduling. I think for
>> production purposes, it would be much better (less overhead) to simply
>> emit "raw" mettrics such as task start/end time of each task plus
>> possibly simple publishing of - mostly static - task dependency rules
>> - then "dependency met" time can be calculated offline based on joined
>> data. That would be roughly equivalent to what you have in the
>> trigger, but without the overhead of triggers- simply instead of
>> storing the events in metadata db we would emit them (for example
>> using otel) and let the external system aggregate them and process it
>> offline independently.
>>
>> The OTEL integration is rather lightweight - most of them use
>> in-memory buffers and efficiently push the data (and even can
>> implement scalable forwarding of the data and pre-aggregation). The
>> nice thing about it is that it can scale much easier. I think that
>> (apart of my earlier reservation) database-trigger approach has this
>> not-nice property that the less workers and schedulers you have, the
>> more "centralized overhead" you have, where the distributed OTEL
>> solution scales together with the system adding more or less fixed
>> overhead per component (providing that the remote telemetry service is
>> also scalable). This makes the trigger approach far less suitable IMHO
>> as we are getting dangerously close to Heisen-Monitoring where the
>> more we observe the system the more we impact its performance.
>>
>> J.
>>
>> On Tue, Jul 12, 2022 at 6:49 PM Ping Zhang <pi...@umich.edu> wrote:
>> >
>> > Hi Jarek,
>> >
>> > Thanks for the insights and pointing out the potential issues with
>> triggers in the prod with scheduler HA setup.
>> >
>> > The solution that I proposed is mainly for the stress test scheduler
>> before each airflow release. We can make changes in the airflow codebase to
>> emit this metric however:
>> >
>> > 1. It will incur additional overhead for the scheduler to compute the
>> metric as scheduler needs to compute the dependency met time of a task.
>> > 2. It couples with the implementation of the scheduler. For example,
>> from 1.10.4 to airflow 2, the scheduler has changed a lot. If the metric is
>> emitted from the scheduler, when making the changes in the scheduler, it
>> also needs to update how the metric is computed and emitted.
>> >
>> > Thus, I think having it out of the airflow core makes it easier to
>> compare the scheduling delay across different airflow versions.
>> >
>> > Thanks for pointing out the OpenTelemetry, let me check it out.
>> >
>> > Thanks,
>> >
>> > Ping
>> >
>> >
>> > On Mon, Jul 11, 2022 at 9:44 AM Jarek Potiuk <po...@apache.org> wrote:
>> >>
>> >> Sorry for the late reply - Ping.
>> >>
>> >> TL;DR; I think the metrics might be useful but I think using triggers
>> >> is asking for troubles.
>> >>
>> >> While using triggers sounds like a common approach in a number of
>> >> installations, we do not use triggers so far.
>> >> Using Triggers moves some logic to the database, and in our case we do
>> >> not have it at all - all logic is in Airflow, and we keep it there,
>> >> the database for us is merely "state" storage and "locks". Adding
>> >> database triggers, extends it to also keep some logic there. And
>> >> adding triggers has some worrying "implicitness" which goes against
>> >> the "Explicit is better than Implicit" Zen of Python.
>> >>
>> >> One thing that makes me think "coldly" about this is that it might
>> >> have some undesired side effects - such as synchronizing of changes
>> >> from multiple schedulers on trying to insert such audit entry (you
>> >> need to create an index lock when you insert rows to a table which has
>> >> a primary key/unique indexes).
>> >>
>> >> And what's even more worrying is that we are using SQLAlchemy and
>> >> MySQl/MsSQL/Postgres and we should make sure it works the same in all
>> >> of them. This is troublesome.
>> >>
>> >> Even if we could solve and verify all those problems individually the
>> >> effect is - Once we open the "gate" of triggers, we will get more "ok
>> >> we have trigger here so let's also use it for that and this" and this
>> >> will be hard to say "no" if we already have a precedent, and this
>> >> might lead to more and more logica and features deferred to a database
>> >> logic (and my past experience is that it leads to more complexity and
>> >> implicit behaviours that are difficult to reason about).
>> >>
>> >> But this is only about the technical details of this, not the metrics
>> >> itself. I think the metric you proposed is very useful.
>> >>
>> >> I think however (correct me if I am wrong) - that we do not need
>> >> database triggers for any of those. I have a feeling that this
>> >> proposal is trying to implement the (useful) metrics with very limited
>> >> modification to the Airflow code, so I can understand that you might
>> >> think about it this way when you have your own fork - then it makes
>> >> sense to piggyback on the existing database and use triggers, because
>> >> you do not want to modify Airflow code.
>> >>
>> >> But here - we are in a completely different situation. We CAN modify
>> >> Airflow code and add missing features and functionality to capture the
>> >> necessary metric data in the code,  rather than using triggers. We
>> >> could even define some kind of callbacks for the auditing events that
>> >> would allow us to gather those metrics in a way that does not even use
>> >> the database to store the information for the metrics.
>> >>
>> >> In fact - this leads me to conclusion that we should implement the
>> >> metrics you mention as part of our Open-Telemetry effort
>> >>
>> https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-49+OpenTelemetry+Support+for+Apache+Airflow
>> .
>> >> This is precisely what it was prepared for, once we have
>> >> Open-Telemetry integrated we could add more and more such useful
>> >> metrics more easily, and that could be way more useful, because
>> >> instead of running external custom-db-reading process for that, we
>> >> could not only calculate such metrics using the right metrics tooling
>> >> (each company could use their preferred open-telemetry compliant
>> >> tool), but that would open up all the features like alerting,
>> >> connecting it with traces and other metrics etc. etc.
>> >>
>> >> Howard - WDYT?
>> >>
>> >> J.
>> >>
>> >>
>> >>
>> >>
>> >>
>> >>
>> >> On Thu, Jun 30, 2022 at 4:52 PM Vikram Koka
>> >> <vi...@astronomer.io.invalid> wrote:
>> >> >
>> >> > HI Ping,
>> >> >
>> >> > Apologies for the belated response.
>> >> >
>> >> > We have created a set of stress test DAGs where the tasks take
>> almost no time to execute at all, so that the worker task execution time is
>> small, and the stress is on the Scheduler and Executor.
>> >> >
>> >> > We then calculate "task latency" aka "task lag" as:
>> >> >  ti_lag = ti.start_date - max_upstream_ti_end_date
>> >> > This is effectively the time between "the downstream task starting"
>> and "the last dependent upstream task complete"
>> >> >
>> >> > We don't use the tasks that don't have any upstream tasks in this
>> metric for measuring task lag.
>> >> > And for tasks that have multiple upstream tasks, we use the upstream
>> task for which the end_date took maximum time as the scheduler waits for
>> completion of all parent tasks before scheduling any downstream task.
>> >> >
>> >> > Vikram
>> >> >
>> >> >
>> >> > On Wed, Jun 8, 2022 at 2:58 PM Ping Zhang <pi...@umich.edu> wrote:
>> >> >>
>> >> >> Hi Mehta,
>> >> >>
>> >> >> Good point. The primary goal of the metric is for stress testing to
>> catch airflow scheduler performance regression for 1) our internal
>> scheduler improvement work and 2) airflow version upgrade.
>> >> >>
>> >> >> One of the key benefits of this metric definition is it is
>> independent from the scheduler implementation and it can be
>> computed/backfilled offline.
>> >> >>
>> >> >> Currently, we expose it to the datadog and we (the airflow cluster
>> maintainers) are the main users for it.
>> >> >>
>> >> >> Thanks,
>> >> >>
>> >> >> Ping
>> >> >>
>> >> >>
>> >> >> On Wed, Jun 8, 2022 at 2:36 PM Mehta, Shubham
>> <sh...@amazon.com.invalid> wrote:
>> >> >>>
>> >> >>> Ping,
>> >> >>>
>> >> >>>
>> >> >>>
>> >> >>> I’m very interested in this as well. A good metric can help us
>> benchmark and identify potential improvements in the scheduler performance.
>> >> >>> In order to understand the proposal better, can you please share
>> where and how do you intend to use “Scheduling delay”? Is it meant for
>> benchmarking or stress testing only? Do you plan to expose it to the users
>> in the Airflow UI?
>> >> >>>
>> >> >>>
>> >> >>>
>> >> >>> Thanks
>> >> >>> Shubham
>> >> >>>
>> >> >>>
>> >> >>>
>> >> >>>
>> >> >>>
>> >> >>> From: Ping Zhang <pi...@umich.edu>
>> >> >>> Reply-To: "dev@airflow.apache.org" <de...@airflow.apache.org>
>> >> >>> Date: Wednesday, June 8, 2022 at 11:58 AM
>> >> >>> To: "dev@airflow.apache.org" <de...@airflow.apache.org>, "
>> vikram@astronomer.io" <vi...@astronomer.io>
>> >> >>> Subject: RE: [EXTERNAL][DISCUSS] Airflow Scheduling Delay Metric
>> Definition
>> >> >>>
>> >> >>>
>> >> >>>
>> >> >>> CAUTION: This email originated from outside of the organization.
>> Do not click links or open attachments unless you can confirm the sender
>> and know the content is safe.
>> >> >>>
>> >> >>>
>> >> >>>
>> >> >>> Hi Vikram,
>> >> >>>
>> >> >>>
>> >> >>>
>> >> >>> Thanks for pointing that out, 'task latency',
>> >> >>>
>> >> >>>
>> >> >>>
>> >> >>> "we define task latency as the time it takes for a task to begin
>> executing once its dependencies have been met."
>> >> >>>
>> >> >>>
>> >> >>>
>> >> >>> It will be great if you can elaborate more about "begin executing"
>> and how you calculate "its dependencies have been met.".
>> >> >>>
>> >> >>>
>> >> >>>
>> >> >>> If the 'begin executing' means the state of ti becomes running,
>> then the 'Scheduling Delay' metric focuses on the overhead introduced by
>> the scheduler.
>> >> >>>
>> >> >>>
>> >> >>>
>> >> >>> In our prod and stress test, we use the `task_instance_audit`
>> table ( a new row is created whenever there is state change in
>> task_instance table) to compute the time of a ti should be scheduled.
>> >> >>>
>> >> >>>
>> >> >>>
>> >> >>> Thanks,
>> >> >>>
>> >> >>>
>> >> >>>
>> >> >>> Ping
>> >> >>>
>> >> >>>
>> >> >>>
>> >> >>>
>> >> >>>
>> >> >>> On Wed, Jun 8, 2022 at 11:25 AM Vikram Koka
>> <vi...@astronomer.io.invalid> wrote:
>> >> >>>
>> >> >>> Ping,
>> >> >>>
>> >> >>>
>> >> >>>
>> >> >>> I am quite interested in this topic and trying to understand the
>> difference between the "scheduling delay" metric articulated as compared to
>> the "task latency" aka "task lag" metric which we have been using before.
>> >> >>>
>> >> >>>
>> >> >>>
>> >> >>> As you may recall, we have been using two specific metrics to
>> benchmark Scheduler performance, specifically "task latency" and "task
>> throughput" since Airflow 2.0.
>> >> >>>
>> >> >>> These were described in the 2.0 Scheduler blog post
>> >> >>> Specifically, within that we defined task tatency as the time it
>> takes for the task to begin executing once it's dependencies are all met.
>> >> >>>
>> >> >>>
>> >> >>>
>> >> >>> Thanks,
>> >> >>>
>> >> >>> Vikram
>> >> >>>
>> >> >>>
>> >> >>>
>> >> >>>
>> >> >>>
>> >> >>>
>> >> >>>
>> >> >>>
>> >> >>>
>> >> >>> On Wed, Jun 8, 2022 at 10:25 AM Ping Zhang <pi...@umich.edu>
>> wrote:
>> >> >>>
>> >> >>> Hi Airflow Community,
>> >> >>>
>> >> >>>
>> >> >>>
>> >> >>> Airflow is a scheduling platform for data pipelines, however there
>> is no good metric to measure the scheduling delay in the production and
>> also the stress test environment. This makes it hard to catch regressions
>> in the scheduler during the stress test stage.
>> >> >>>
>> >> >>>
>> >> >>> I would like to propose an airflow scheduling delay metric
>> definition. Here is the detailed design of the metric and its
>> implementation:
>> >> >>>
>> >> >>>
>> https://docs.google.com/document/d/1NhO26kgWkIZJEe50M60yh_jgROaU84dRJ5qGFqbkNbU/edit?usp=sharing
>> >> >>>
>> >> >>> Please take a look and any feedback is welcome.
>> >> >>>
>> >> >>>
>> >> >>>
>> >> >>> Thanks,
>> >> >>>
>> >> >>>
>> >> >>>
>> >> >>> Ping
>> >> >>>
>> >> >>>
>>
>

Re: [DISCUSS] Airflow Scheduling Delay Metric Definition

Posted by Ping Zhang <pi...@umich.edu>.
Hi Jarek,

Yep, it is more useful in the stress test stage before releasing a new
version with some extra set up to ensure no scheduler performance
degradation due to a release. This can also help to find the scaling limit
of the scheduler with a certain SLA, like upper limit of the number of
tasks in a dag, total number of dag files in a cluster, concurrent running
dag runs etc.

Very good point about synthetic dag files in the stress test, our team is
working on a stress test framework that can directly use all production dag
files to ensure the stress test has the same set of prod dags, but it will
skip the task execution. It can also generate different kinds of dag
(including number of tasks, levels etc).

Monitoring the production issues for particular DAGs, time of the day is a
different issue. I agree that in prod, we should not let the scheduler
calculate the `dependency met` time.


Thanks,

Ping


On Tue, Jul 12, 2022 at 11:02 AM Jarek Potiuk <ja...@potiuk.com> wrote:

> I think if we limit it to stress tests, this could be an "extra"
> addition - not even necessarily part of Airflow codebase and adding
> triggers with a script, on a single database, some kind of
> test-harness that you always add after you installed airflow in test
> environment - for that I have far less reservations to use triggers.
>
> But if we want to measure the delays in production, that's quite a
> different story (and different purpose):
>
> * The stress tests are synthetic and basically what you will get out
> of it is "are worse/better in this version than in the previous one"?
> "How much", "Which synthetic scenarios are affected most" . Those will
> be done with a few synthetic kinds of traffic/load/shape.
> * The production is different - you really want to see if you have
> some problems with particular DAGs, times of the day, week, load etc
> and you should be able to take some corrective actions ( for example
> increase number of schedulers, or queues, split your dags etc.) - so
> even the "scheduling delay" metrics might sound familiar you might
> want to use completely different dimensions to look at it (how about
> this DAG? this time of day, this group of dags, this type of workloads
> etc).
>
> I think those two might even be separated and calculated differently
> (though having a single approach would be likely better). I am not
> entirely sure but I have a feeling we do not need the scheduler to
> calculate the "dependency met" while scheduling. I think for
> production purposes, it would be much better (less overhead) to simply
> emit "raw" mettrics such as task start/end time of each task plus
> possibly simple publishing of - mostly static - task dependency rules
> - then "dependency met" time can be calculated offline based on joined
> data. That would be roughly equivalent to what you have in the
> trigger, but without the overhead of triggers- simply instead of
> storing the events in metadata db we would emit them (for example
> using otel) and let the external system aggregate them and process it
> offline independently.
>
> The OTEL integration is rather lightweight - most of them use
> in-memory buffers and efficiently push the data (and even can
> implement scalable forwarding of the data and pre-aggregation). The
> nice thing about it is that it can scale much easier. I think that
> (apart of my earlier reservation) database-trigger approach has this
> not-nice property that the less workers and schedulers you have, the
> more "centralized overhead" you have, where the distributed OTEL
> solution scales together with the system adding more or less fixed
> overhead per component (providing that the remote telemetry service is
> also scalable). This makes the trigger approach far less suitable IMHO
> as we are getting dangerously close to Heisen-Monitoring where the
> more we observe the system the more we impact its performance.
>
> J.
>
> On Tue, Jul 12, 2022 at 6:49 PM Ping Zhang <pi...@umich.edu> wrote:
> >
> > Hi Jarek,
> >
> > Thanks for the insights and pointing out the potential issues with
> triggers in the prod with scheduler HA setup.
> >
> > The solution that I proposed is mainly for the stress test scheduler
> before each airflow release. We can make changes in the airflow codebase to
> emit this metric however:
> >
> > 1. It will incur additional overhead for the scheduler to compute the
> metric as scheduler needs to compute the dependency met time of a task.
> > 2. It couples with the implementation of the scheduler. For example,
> from 1.10.4 to airflow 2, the scheduler has changed a lot. If the metric is
> emitted from the scheduler, when making the changes in the scheduler, it
> also needs to update how the metric is computed and emitted.
> >
> > Thus, I think having it out of the airflow core makes it easier to
> compare the scheduling delay across different airflow versions.
> >
> > Thanks for pointing out the OpenTelemetry, let me check it out.
> >
> > Thanks,
> >
> > Ping
> >
> >
> > On Mon, Jul 11, 2022 at 9:44 AM Jarek Potiuk <po...@apache.org> wrote:
> >>
> >> Sorry for the late reply - Ping.
> >>
> >> TL;DR; I think the metrics might be useful but I think using triggers
> >> is asking for troubles.
> >>
> >> While using triggers sounds like a common approach in a number of
> >> installations, we do not use triggers so far.
> >> Using Triggers moves some logic to the database, and in our case we do
> >> not have it at all - all logic is in Airflow, and we keep it there,
> >> the database for us is merely "state" storage and "locks". Adding
> >> database triggers, extends it to also keep some logic there. And
> >> adding triggers has some worrying "implicitness" which goes against
> >> the "Explicit is better than Implicit" Zen of Python.
> >>
> >> One thing that makes me think "coldly" about this is that it might
> >> have some undesired side effects - such as synchronizing of changes
> >> from multiple schedulers on trying to insert such audit entry (you
> >> need to create an index lock when you insert rows to a table which has
> >> a primary key/unique indexes).
> >>
> >> And what's even more worrying is that we are using SQLAlchemy and
> >> MySQl/MsSQL/Postgres and we should make sure it works the same in all
> >> of them. This is troublesome.
> >>
> >> Even if we could solve and verify all those problems individually the
> >> effect is - Once we open the "gate" of triggers, we will get more "ok
> >> we have trigger here so let's also use it for that and this" and this
> >> will be hard to say "no" if we already have a precedent, and this
> >> might lead to more and more logica and features deferred to a database
> >> logic (and my past experience is that it leads to more complexity and
> >> implicit behaviours that are difficult to reason about).
> >>
> >> But this is only about the technical details of this, not the metrics
> >> itself. I think the metric you proposed is very useful.
> >>
> >> I think however (correct me if I am wrong) - that we do not need
> >> database triggers for any of those. I have a feeling that this
> >> proposal is trying to implement the (useful) metrics with very limited
> >> modification to the Airflow code, so I can understand that you might
> >> think about it this way when you have your own fork - then it makes
> >> sense to piggyback on the existing database and use triggers, because
> >> you do not want to modify Airflow code.
> >>
> >> But here - we are in a completely different situation. We CAN modify
> >> Airflow code and add missing features and functionality to capture the
> >> necessary metric data in the code,  rather than using triggers. We
> >> could even define some kind of callbacks for the auditing events that
> >> would allow us to gather those metrics in a way that does not even use
> >> the database to store the information for the metrics.
> >>
> >> In fact - this leads me to conclusion that we should implement the
> >> metrics you mention as part of our Open-Telemetry effort
> >>
> https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-49+OpenTelemetry+Support+for+Apache+Airflow
> .
> >> This is precisely what it was prepared for, once we have
> >> Open-Telemetry integrated we could add more and more such useful
> >> metrics more easily, and that could be way more useful, because
> >> instead of running external custom-db-reading process for that, we
> >> could not only calculate such metrics using the right metrics tooling
> >> (each company could use their preferred open-telemetry compliant
> >> tool), but that would open up all the features like alerting,
> >> connecting it with traces and other metrics etc. etc.
> >>
> >> Howard - WDYT?
> >>
> >> J.
> >>
> >>
> >>
> >>
> >>
> >>
> >> On Thu, Jun 30, 2022 at 4:52 PM Vikram Koka
> >> <vi...@astronomer.io.invalid> wrote:
> >> >
> >> > HI Ping,
> >> >
> >> > Apologies for the belated response.
> >> >
> >> > We have created a set of stress test DAGs where the tasks take almost
> no time to execute at all, so that the worker task execution time is small,
> and the stress is on the Scheduler and Executor.
> >> >
> >> > We then calculate "task latency" aka "task lag" as:
> >> >  ti_lag = ti.start_date - max_upstream_ti_end_date
> >> > This is effectively the time between "the downstream task starting"
> and "the last dependent upstream task complete"
> >> >
> >> > We don't use the tasks that don't have any upstream tasks in this
> metric for measuring task lag.
> >> > And for tasks that have multiple upstream tasks, we use the upstream
> task for which the end_date took maximum time as the scheduler waits for
> completion of all parent tasks before scheduling any downstream task.
> >> >
> >> > Vikram
> >> >
> >> >
> >> > On Wed, Jun 8, 2022 at 2:58 PM Ping Zhang <pi...@umich.edu> wrote:
> >> >>
> >> >> Hi Mehta,
> >> >>
> >> >> Good point. The primary goal of the metric is for stress testing to
> catch airflow scheduler performance regression for 1) our internal
> scheduler improvement work and 2) airflow version upgrade.
> >> >>
> >> >> One of the key benefits of this metric definition is it is
> independent from the scheduler implementation and it can be
> computed/backfilled offline.
> >> >>
> >> >> Currently, we expose it to the datadog and we (the airflow cluster
> maintainers) are the main users for it.
> >> >>
> >> >> Thanks,
> >> >>
> >> >> Ping
> >> >>
> >> >>
> >> >> On Wed, Jun 8, 2022 at 2:36 PM Mehta, Shubham
> <sh...@amazon.com.invalid> wrote:
> >> >>>
> >> >>> Ping,
> >> >>>
> >> >>>
> >> >>>
> >> >>> I’m very interested in this as well. A good metric can help us
> benchmark and identify potential improvements in the scheduler performance.
> >> >>> In order to understand the proposal better, can you please share
> where and how do you intend to use “Scheduling delay”? Is it meant for
> benchmarking or stress testing only? Do you plan to expose it to the users
> in the Airflow UI?
> >> >>>
> >> >>>
> >> >>>
> >> >>> Thanks
> >> >>> Shubham
> >> >>>
> >> >>>
> >> >>>
> >> >>>
> >> >>>
> >> >>> From: Ping Zhang <pi...@umich.edu>
> >> >>> Reply-To: "dev@airflow.apache.org" <de...@airflow.apache.org>
> >> >>> Date: Wednesday, June 8, 2022 at 11:58 AM
> >> >>> To: "dev@airflow.apache.org" <de...@airflow.apache.org>, "
> vikram@astronomer.io" <vi...@astronomer.io>
> >> >>> Subject: RE: [EXTERNAL][DISCUSS] Airflow Scheduling Delay Metric
> Definition
> >> >>>
> >> >>>
> >> >>>
> >> >>> CAUTION: This email originated from outside of the organization. Do
> not click links or open attachments unless you can confirm the sender and
> know the content is safe.
> >> >>>
> >> >>>
> >> >>>
> >> >>> Hi Vikram,
> >> >>>
> >> >>>
> >> >>>
> >> >>> Thanks for pointing that out, 'task latency',
> >> >>>
> >> >>>
> >> >>>
> >> >>> "we define task latency as the time it takes for a task to begin
> executing once its dependencies have been met."
> >> >>>
> >> >>>
> >> >>>
> >> >>> It will be great if you can elaborate more about "begin executing"
> and how you calculate "its dependencies have been met.".
> >> >>>
> >> >>>
> >> >>>
> >> >>> If the 'begin executing' means the state of ti becomes running,
> then the 'Scheduling Delay' metric focuses on the overhead introduced by
> the scheduler.
> >> >>>
> >> >>>
> >> >>>
> >> >>> In our prod and stress test, we use the `task_instance_audit` table
> ( a new row is created whenever there is state change in task_instance
> table) to compute the time of a ti should be scheduled.
> >> >>>
> >> >>>
> >> >>>
> >> >>> Thanks,
> >> >>>
> >> >>>
> >> >>>
> >> >>> Ping
> >> >>>
> >> >>>
> >> >>>
> >> >>>
> >> >>>
> >> >>> On Wed, Jun 8, 2022 at 11:25 AM Vikram Koka
> <vi...@astronomer.io.invalid> wrote:
> >> >>>
> >> >>> Ping,
> >> >>>
> >> >>>
> >> >>>
> >> >>> I am quite interested in this topic and trying to understand the
> difference between the "scheduling delay" metric articulated as compared to
> the "task latency" aka "task lag" metric which we have been using before.
> >> >>>
> >> >>>
> >> >>>
> >> >>> As you may recall, we have been using two specific metrics to
> benchmark Scheduler performance, specifically "task latency" and "task
> throughput" since Airflow 2.0.
> >> >>>
> >> >>> These were described in the 2.0 Scheduler blog post
> >> >>> Specifically, within that we defined task tatency as the time it
> takes for the task to begin executing once it's dependencies are all met.
> >> >>>
> >> >>>
> >> >>>
> >> >>> Thanks,
> >> >>>
> >> >>> Vikram
> >> >>>
> >> >>>
> >> >>>
> >> >>>
> >> >>>
> >> >>>
> >> >>>
> >> >>>
> >> >>>
> >> >>> On Wed, Jun 8, 2022 at 10:25 AM Ping Zhang <pi...@umich.edu>
> wrote:
> >> >>>
> >> >>> Hi Airflow Community,
> >> >>>
> >> >>>
> >> >>>
> >> >>> Airflow is a scheduling platform for data pipelines, however there
> is no good metric to measure the scheduling delay in the production and
> also the stress test environment. This makes it hard to catch regressions
> in the scheduler during the stress test stage.
> >> >>>
> >> >>>
> >> >>> I would like to propose an airflow scheduling delay metric
> definition. Here is the detailed design of the metric and its
> implementation:
> >> >>>
> >> >>>
> https://docs.google.com/document/d/1NhO26kgWkIZJEe50M60yh_jgROaU84dRJ5qGFqbkNbU/edit?usp=sharing
> >> >>>
> >> >>> Please take a look and any feedback is welcome.
> >> >>>
> >> >>>
> >> >>>
> >> >>> Thanks,
> >> >>>
> >> >>>
> >> >>>
> >> >>> Ping
> >> >>>
> >> >>>
>

Re: [DISCUSS] Airflow Scheduling Delay Metric Definition

Posted by Jarek Potiuk <ja...@potiuk.com>.
Correction <the more workers and schedulers you have, the more
"centralized overhead" you have> of course.


On Tue, Jul 12, 2022 at 8:01 PM Jarek Potiuk <ja...@potiuk.com> wrote:
>
> I think if we limit it to stress tests, this could be an "extra"
> addition - not even necessarily part of Airflow codebase and adding
> triggers with a script, on a single database, some kind of
> test-harness that you always add after you installed airflow in test
> environment - for that I have far less reservations to use triggers.
>
> But if we want to measure the delays in production, that's quite a
> different story (and different purpose):
>
> * The stress tests are synthetic and basically what you will get out
> of it is "are worse/better in this version than in the previous one"?
> "How much", "Which synthetic scenarios are affected most" . Those will
> be done with a few synthetic kinds of traffic/load/shape.
> * The production is different - you really want to see if you have
> some problems with particular DAGs, times of the day, week, load etc
> and you should be able to take some corrective actions ( for example
> increase number of schedulers, or queues, split your dags etc.) - so
> even the "scheduling delay" metrics might sound familiar you might
> want to use completely different dimensions to look at it (how about
> this DAG? this time of day, this group of dags, this type of workloads
> etc).
>
> I think those two might even be separated and calculated differently
> (though having a single approach would be likely better). I am not
> entirely sure but I have a feeling we do not need the scheduler to
> calculate the "dependency met" while scheduling. I think for
> production purposes, it would be much better (less overhead) to simply
> emit "raw" mettrics such as task start/end time of each task plus
> possibly simple publishing of - mostly static - task dependency rules
> - then "dependency met" time can be calculated offline based on joined
> data. That would be roughly equivalent to what you have in the
> trigger, but without the overhead of triggers- simply instead of
> storing the events in metadata db we would emit them (for example
> using otel) and let the external system aggregate them and process it
> offline independently.
>
> The OTEL integration is rather lightweight - most of them use
> in-memory buffers and efficiently push the data (and even can
> implement scalable forwarding of the data and pre-aggregation). The
> nice thing about it is that it can scale much easier. I think that
> (apart of my earlier reservation) database-trigger approach has this
> not-nice property that the less workers and schedulers you have, the
> more "centralized overhead" you have, where the distributed OTEL
> solution scales together with the system adding more or less fixed
> overhead per component (providing that the remote telemetry service is
> also scalable). This makes the trigger approach far less suitable IMHO
> as we are getting dangerously close to Heisen-Monitoring where the
> more we observe the system the more we impact its performance.
>
> J.
>
> On Tue, Jul 12, 2022 at 6:49 PM Ping Zhang <pi...@umich.edu> wrote:
> >
> > Hi Jarek,
> >
> > Thanks for the insights and pointing out the potential issues with triggers in the prod with scheduler HA setup.
> >
> > The solution that I proposed is mainly for the stress test scheduler before each airflow release. We can make changes in the airflow codebase to emit this metric however:
> >
> > 1. It will incur additional overhead for the scheduler to compute the metric as scheduler needs to compute the dependency met time of a task.
> > 2. It couples with the implementation of the scheduler. For example, from 1.10.4 to airflow 2, the scheduler has changed a lot. If the metric is emitted from the scheduler, when making the changes in the scheduler, it also needs to update how the metric is computed and emitted.
> >
> > Thus, I think having it out of the airflow core makes it easier to compare the scheduling delay across different airflow versions.
> >
> > Thanks for pointing out the OpenTelemetry, let me check it out.
> >
> > Thanks,
> >
> > Ping
> >
> >
> > On Mon, Jul 11, 2022 at 9:44 AM Jarek Potiuk <po...@apache.org> wrote:
> >>
> >> Sorry for the late reply - Ping.
> >>
> >> TL;DR; I think the metrics might be useful but I think using triggers
> >> is asking for troubles.
> >>
> >> While using triggers sounds like a common approach in a number of
> >> installations, we do not use triggers so far.
> >> Using Triggers moves some logic to the database, and in our case we do
> >> not have it at all - all logic is in Airflow, and we keep it there,
> >> the database for us is merely "state" storage and "locks". Adding
> >> database triggers, extends it to also keep some logic there. And
> >> adding triggers has some worrying "implicitness" which goes against
> >> the "Explicit is better than Implicit" Zen of Python.
> >>
> >> One thing that makes me think "coldly" about this is that it might
> >> have some undesired side effects - such as synchronizing of changes
> >> from multiple schedulers on trying to insert such audit entry (you
> >> need to create an index lock when you insert rows to a table which has
> >> a primary key/unique indexes).
> >>
> >> And what's even more worrying is that we are using SQLAlchemy and
> >> MySQl/MsSQL/Postgres and we should make sure it works the same in all
> >> of them. This is troublesome.
> >>
> >> Even if we could solve and verify all those problems individually the
> >> effect is - Once we open the "gate" of triggers, we will get more "ok
> >> we have trigger here so let's also use it for that and this" and this
> >> will be hard to say "no" if we already have a precedent, and this
> >> might lead to more and more logica and features deferred to a database
> >> logic (and my past experience is that it leads to more complexity and
> >> implicit behaviours that are difficult to reason about).
> >>
> >> But this is only about the technical details of this, not the metrics
> >> itself. I think the metric you proposed is very useful.
> >>
> >> I think however (correct me if I am wrong) - that we do not need
> >> database triggers for any of those. I have a feeling that this
> >> proposal is trying to implement the (useful) metrics with very limited
> >> modification to the Airflow code, so I can understand that you might
> >> think about it this way when you have your own fork - then it makes
> >> sense to piggyback on the existing database and use triggers, because
> >> you do not want to modify Airflow code.
> >>
> >> But here - we are in a completely different situation. We CAN modify
> >> Airflow code and add missing features and functionality to capture the
> >> necessary metric data in the code,  rather than using triggers. We
> >> could even define some kind of callbacks for the auditing events that
> >> would allow us to gather those metrics in a way that does not even use
> >> the database to store the information for the metrics.
> >>
> >> In fact - this leads me to conclusion that we should implement the
> >> metrics you mention as part of our Open-Telemetry effort
> >>  https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-49+OpenTelemetry+Support+for+Apache+Airflow.
> >> This is precisely what it was prepared for, once we have
> >> Open-Telemetry integrated we could add more and more such useful
> >> metrics more easily, and that could be way more useful, because
> >> instead of running external custom-db-reading process for that, we
> >> could not only calculate such metrics using the right metrics tooling
> >> (each company could use their preferred open-telemetry compliant
> >> tool), but that would open up all the features like alerting,
> >> connecting it with traces and other metrics etc. etc.
> >>
> >> Howard - WDYT?
> >>
> >> J.
> >>
> >>
> >>
> >>
> >>
> >>
> >> On Thu, Jun 30, 2022 at 4:52 PM Vikram Koka
> >> <vi...@astronomer.io.invalid> wrote:
> >> >
> >> > HI Ping,
> >> >
> >> > Apologies for the belated response.
> >> >
> >> > We have created a set of stress test DAGs where the tasks take almost no time to execute at all, so that the worker task execution time is small, and the stress is on the Scheduler and Executor.
> >> >
> >> > We then calculate "task latency" aka "task lag" as:
> >> >  ti_lag = ti.start_date - max_upstream_ti_end_date
> >> > This is effectively the time between "the downstream task starting" and "the last dependent upstream task complete"
> >> >
> >> > We don't use the tasks that don't have any upstream tasks in this metric for measuring task lag.
> >> > And for tasks that have multiple upstream tasks, we use the upstream task for which the end_date took maximum time as the scheduler waits for completion of all parent tasks before scheduling any downstream task.
> >> >
> >> > Vikram
> >> >
> >> >
> >> > On Wed, Jun 8, 2022 at 2:58 PM Ping Zhang <pi...@umich.edu> wrote:
> >> >>
> >> >> Hi Mehta,
> >> >>
> >> >> Good point. The primary goal of the metric is for stress testing to catch airflow scheduler performance regression for 1) our internal scheduler improvement work and 2) airflow version upgrade.
> >> >>
> >> >> One of the key benefits of this metric definition is it is independent from the scheduler implementation and it can be computed/backfilled offline.
> >> >>
> >> >> Currently, we expose it to the datadog and we (the airflow cluster maintainers) are the main users for it.
> >> >>
> >> >> Thanks,
> >> >>
> >> >> Ping
> >> >>
> >> >>
> >> >> On Wed, Jun 8, 2022 at 2:36 PM Mehta, Shubham <sh...@amazon.com.invalid> wrote:
> >> >>>
> >> >>> Ping,
> >> >>>
> >> >>>
> >> >>>
> >> >>> I’m very interested in this as well. A good metric can help us benchmark and identify potential improvements in the scheduler performance.
> >> >>> In order to understand the proposal better, can you please share where and how do you intend to use “Scheduling delay”? Is it meant for benchmarking or stress testing only? Do you plan to expose it to the users in the Airflow UI?
> >> >>>
> >> >>>
> >> >>>
> >> >>> Thanks
> >> >>> Shubham
> >> >>>
> >> >>>
> >> >>>
> >> >>>
> >> >>>
> >> >>> From: Ping Zhang <pi...@umich.edu>
> >> >>> Reply-To: "dev@airflow.apache.org" <de...@airflow.apache.org>
> >> >>> Date: Wednesday, June 8, 2022 at 11:58 AM
> >> >>> To: "dev@airflow.apache.org" <de...@airflow.apache.org>, "vikram@astronomer.io" <vi...@astronomer.io>
> >> >>> Subject: RE: [EXTERNAL][DISCUSS] Airflow Scheduling Delay Metric Definition
> >> >>>
> >> >>>
> >> >>>
> >> >>> CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.
> >> >>>
> >> >>>
> >> >>>
> >> >>> Hi Vikram,
> >> >>>
> >> >>>
> >> >>>
> >> >>> Thanks for pointing that out, 'task latency',
> >> >>>
> >> >>>
> >> >>>
> >> >>> "we define task latency as the time it takes for a task to begin executing once its dependencies have been met."
> >> >>>
> >> >>>
> >> >>>
> >> >>> It will be great if you can elaborate more about "begin executing" and how you calculate "its dependencies have been met.".
> >> >>>
> >> >>>
> >> >>>
> >> >>> If the 'begin executing' means the state of ti becomes running, then the 'Scheduling Delay' metric focuses on the overhead introduced by the scheduler.
> >> >>>
> >> >>>
> >> >>>
> >> >>> In our prod and stress test, we use the `task_instance_audit` table ( a new row is created whenever there is state change in task_instance table) to compute the time of a ti should be scheduled.
> >> >>>
> >> >>>
> >> >>>
> >> >>> Thanks,
> >> >>>
> >> >>>
> >> >>>
> >> >>> Ping
> >> >>>
> >> >>>
> >> >>>
> >> >>>
> >> >>>
> >> >>> On Wed, Jun 8, 2022 at 11:25 AM Vikram Koka <vi...@astronomer.io.invalid> wrote:
> >> >>>
> >> >>> Ping,
> >> >>>
> >> >>>
> >> >>>
> >> >>> I am quite interested in this topic and trying to understand the difference between the "scheduling delay" metric articulated as compared to the "task latency" aka "task lag" metric which we have been using before.
> >> >>>
> >> >>>
> >> >>>
> >> >>> As you may recall, we have been using two specific metrics to benchmark Scheduler performance, specifically "task latency" and "task throughput" since Airflow 2.0.
> >> >>>
> >> >>> These were described in the 2.0 Scheduler blog post
> >> >>> Specifically, within that we defined task tatency as the time it takes for the task to begin executing once it's dependencies are all met.
> >> >>>
> >> >>>
> >> >>>
> >> >>> Thanks,
> >> >>>
> >> >>> Vikram
> >> >>>
> >> >>>
> >> >>>
> >> >>>
> >> >>>
> >> >>>
> >> >>>
> >> >>>
> >> >>>
> >> >>> On Wed, Jun 8, 2022 at 10:25 AM Ping Zhang <pi...@umich.edu> wrote:
> >> >>>
> >> >>> Hi Airflow Community,
> >> >>>
> >> >>>
> >> >>>
> >> >>> Airflow is a scheduling platform for data pipelines, however there is no good metric to measure the scheduling delay in the production and also the stress test environment. This makes it hard to catch regressions in the scheduler during the stress test stage.
> >> >>>
> >> >>>
> >> >>> I would like to propose an airflow scheduling delay metric definition. Here is the detailed design of the metric and its implementation:
> >> >>>
> >> >>> https://docs.google.com/document/d/1NhO26kgWkIZJEe50M60yh_jgROaU84dRJ5qGFqbkNbU/edit?usp=sharing
> >> >>>
> >> >>> Please take a look and any feedback is welcome.
> >> >>>
> >> >>>
> >> >>>
> >> >>> Thanks,
> >> >>>
> >> >>>
> >> >>>
> >> >>> Ping
> >> >>>
> >> >>>

Re: [DISCUSS] Airflow Scheduling Delay Metric Definition

Posted by Jarek Potiuk <ja...@potiuk.com>.
I think if we limit it to stress tests, this could be an "extra"
addition - not even necessarily part of Airflow codebase and adding
triggers with a script, on a single database, some kind of
test-harness that you always add after you installed airflow in test
environment - for that I have far less reservations to use triggers.

But if we want to measure the delays in production, that's quite a
different story (and different purpose):

* The stress tests are synthetic and basically what you will get out
of it is "are worse/better in this version than in the previous one"?
"How much", "Which synthetic scenarios are affected most" . Those will
be done with a few synthetic kinds of traffic/load/shape.
* The production is different - you really want to see if you have
some problems with particular DAGs, times of the day, week, load etc
and you should be able to take some corrective actions ( for example
increase number of schedulers, or queues, split your dags etc.) - so
even the "scheduling delay" metrics might sound familiar you might
want to use completely different dimensions to look at it (how about
this DAG? this time of day, this group of dags, this type of workloads
etc).

I think those two might even be separated and calculated differently
(though having a single approach would be likely better). I am not
entirely sure but I have a feeling we do not need the scheduler to
calculate the "dependency met" while scheduling. I think for
production purposes, it would be much better (less overhead) to simply
emit "raw" mettrics such as task start/end time of each task plus
possibly simple publishing of - mostly static - task dependency rules
- then "dependency met" time can be calculated offline based on joined
data. That would be roughly equivalent to what you have in the
trigger, but without the overhead of triggers- simply instead of
storing the events in metadata db we would emit them (for example
using otel) and let the external system aggregate them and process it
offline independently.

The OTEL integration is rather lightweight - most of them use
in-memory buffers and efficiently push the data (and even can
implement scalable forwarding of the data and pre-aggregation). The
nice thing about it is that it can scale much easier. I think that
(apart of my earlier reservation) database-trigger approach has this
not-nice property that the less workers and schedulers you have, the
more "centralized overhead" you have, where the distributed OTEL
solution scales together with the system adding more or less fixed
overhead per component (providing that the remote telemetry service is
also scalable). This makes the trigger approach far less suitable IMHO
as we are getting dangerously close to Heisen-Monitoring where the
more we observe the system the more we impact its performance.

J.

On Tue, Jul 12, 2022 at 6:49 PM Ping Zhang <pi...@umich.edu> wrote:
>
> Hi Jarek,
>
> Thanks for the insights and pointing out the potential issues with triggers in the prod with scheduler HA setup.
>
> The solution that I proposed is mainly for the stress test scheduler before each airflow release. We can make changes in the airflow codebase to emit this metric however:
>
> 1. It will incur additional overhead for the scheduler to compute the metric as scheduler needs to compute the dependency met time of a task.
> 2. It couples with the implementation of the scheduler. For example, from 1.10.4 to airflow 2, the scheduler has changed a lot. If the metric is emitted from the scheduler, when making the changes in the scheduler, it also needs to update how the metric is computed and emitted.
>
> Thus, I think having it out of the airflow core makes it easier to compare the scheduling delay across different airflow versions.
>
> Thanks for pointing out the OpenTelemetry, let me check it out.
>
> Thanks,
>
> Ping
>
>
> On Mon, Jul 11, 2022 at 9:44 AM Jarek Potiuk <po...@apache.org> wrote:
>>
>> Sorry for the late reply - Ping.
>>
>> TL;DR; I think the metrics might be useful but I think using triggers
>> is asking for troubles.
>>
>> While using triggers sounds like a common approach in a number of
>> installations, we do not use triggers so far.
>> Using Triggers moves some logic to the database, and in our case we do
>> not have it at all - all logic is in Airflow, and we keep it there,
>> the database for us is merely "state" storage and "locks". Adding
>> database triggers, extends it to also keep some logic there. And
>> adding triggers has some worrying "implicitness" which goes against
>> the "Explicit is better than Implicit" Zen of Python.
>>
>> One thing that makes me think "coldly" about this is that it might
>> have some undesired side effects - such as synchronizing of changes
>> from multiple schedulers on trying to insert such audit entry (you
>> need to create an index lock when you insert rows to a table which has
>> a primary key/unique indexes).
>>
>> And what's even more worrying is that we are using SQLAlchemy and
>> MySQl/MsSQL/Postgres and we should make sure it works the same in all
>> of them. This is troublesome.
>>
>> Even if we could solve and verify all those problems individually the
>> effect is - Once we open the "gate" of triggers, we will get more "ok
>> we have trigger here so let's also use it for that and this" and this
>> will be hard to say "no" if we already have a precedent, and this
>> might lead to more and more logica and features deferred to a database
>> logic (and my past experience is that it leads to more complexity and
>> implicit behaviours that are difficult to reason about).
>>
>> But this is only about the technical details of this, not the metrics
>> itself. I think the metric you proposed is very useful.
>>
>> I think however (correct me if I am wrong) - that we do not need
>> database triggers for any of those. I have a feeling that this
>> proposal is trying to implement the (useful) metrics with very limited
>> modification to the Airflow code, so I can understand that you might
>> think about it this way when you have your own fork - then it makes
>> sense to piggyback on the existing database and use triggers, because
>> you do not want to modify Airflow code.
>>
>> But here - we are in a completely different situation. We CAN modify
>> Airflow code and add missing features and functionality to capture the
>> necessary metric data in the code,  rather than using triggers. We
>> could even define some kind of callbacks for the auditing events that
>> would allow us to gather those metrics in a way that does not even use
>> the database to store the information for the metrics.
>>
>> In fact - this leads me to conclusion that we should implement the
>> metrics you mention as part of our Open-Telemetry effort
>>  https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-49+OpenTelemetry+Support+for+Apache+Airflow.
>> This is precisely what it was prepared for, once we have
>> Open-Telemetry integrated we could add more and more such useful
>> metrics more easily, and that could be way more useful, because
>> instead of running external custom-db-reading process for that, we
>> could not only calculate such metrics using the right metrics tooling
>> (each company could use their preferred open-telemetry compliant
>> tool), but that would open up all the features like alerting,
>> connecting it with traces and other metrics etc. etc.
>>
>> Howard - WDYT?
>>
>> J.
>>
>>
>>
>>
>>
>>
>> On Thu, Jun 30, 2022 at 4:52 PM Vikram Koka
>> <vi...@astronomer.io.invalid> wrote:
>> >
>> > HI Ping,
>> >
>> > Apologies for the belated response.
>> >
>> > We have created a set of stress test DAGs where the tasks take almost no time to execute at all, so that the worker task execution time is small, and the stress is on the Scheduler and Executor.
>> >
>> > We then calculate "task latency" aka "task lag" as:
>> >  ti_lag = ti.start_date - max_upstream_ti_end_date
>> > This is effectively the time between "the downstream task starting" and "the last dependent upstream task complete"
>> >
>> > We don't use the tasks that don't have any upstream tasks in this metric for measuring task lag.
>> > And for tasks that have multiple upstream tasks, we use the upstream task for which the end_date took maximum time as the scheduler waits for completion of all parent tasks before scheduling any downstream task.
>> >
>> > Vikram
>> >
>> >
>> > On Wed, Jun 8, 2022 at 2:58 PM Ping Zhang <pi...@umich.edu> wrote:
>> >>
>> >> Hi Mehta,
>> >>
>> >> Good point. The primary goal of the metric is for stress testing to catch airflow scheduler performance regression for 1) our internal scheduler improvement work and 2) airflow version upgrade.
>> >>
>> >> One of the key benefits of this metric definition is it is independent from the scheduler implementation and it can be computed/backfilled offline.
>> >>
>> >> Currently, we expose it to the datadog and we (the airflow cluster maintainers) are the main users for it.
>> >>
>> >> Thanks,
>> >>
>> >> Ping
>> >>
>> >>
>> >> On Wed, Jun 8, 2022 at 2:36 PM Mehta, Shubham <sh...@amazon.com.invalid> wrote:
>> >>>
>> >>> Ping,
>> >>>
>> >>>
>> >>>
>> >>> I’m very interested in this as well. A good metric can help us benchmark and identify potential improvements in the scheduler performance.
>> >>> In order to understand the proposal better, can you please share where and how do you intend to use “Scheduling delay”? Is it meant for benchmarking or stress testing only? Do you plan to expose it to the users in the Airflow UI?
>> >>>
>> >>>
>> >>>
>> >>> Thanks
>> >>> Shubham
>> >>>
>> >>>
>> >>>
>> >>>
>> >>>
>> >>> From: Ping Zhang <pi...@umich.edu>
>> >>> Reply-To: "dev@airflow.apache.org" <de...@airflow.apache.org>
>> >>> Date: Wednesday, June 8, 2022 at 11:58 AM
>> >>> To: "dev@airflow.apache.org" <de...@airflow.apache.org>, "vikram@astronomer.io" <vi...@astronomer.io>
>> >>> Subject: RE: [EXTERNAL][DISCUSS] Airflow Scheduling Delay Metric Definition
>> >>>
>> >>>
>> >>>
>> >>> CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.
>> >>>
>> >>>
>> >>>
>> >>> Hi Vikram,
>> >>>
>> >>>
>> >>>
>> >>> Thanks for pointing that out, 'task latency',
>> >>>
>> >>>
>> >>>
>> >>> "we define task latency as the time it takes for a task to begin executing once its dependencies have been met."
>> >>>
>> >>>
>> >>>
>> >>> It will be great if you can elaborate more about "begin executing" and how you calculate "its dependencies have been met.".
>> >>>
>> >>>
>> >>>
>> >>> If the 'begin executing' means the state of ti becomes running, then the 'Scheduling Delay' metric focuses on the overhead introduced by the scheduler.
>> >>>
>> >>>
>> >>>
>> >>> In our prod and stress test, we use the `task_instance_audit` table ( a new row is created whenever there is state change in task_instance table) to compute the time of a ti should be scheduled.
>> >>>
>> >>>
>> >>>
>> >>> Thanks,
>> >>>
>> >>>
>> >>>
>> >>> Ping
>> >>>
>> >>>
>> >>>
>> >>>
>> >>>
>> >>> On Wed, Jun 8, 2022 at 11:25 AM Vikram Koka <vi...@astronomer.io.invalid> wrote:
>> >>>
>> >>> Ping,
>> >>>
>> >>>
>> >>>
>> >>> I am quite interested in this topic and trying to understand the difference between the "scheduling delay" metric articulated as compared to the "task latency" aka "task lag" metric which we have been using before.
>> >>>
>> >>>
>> >>>
>> >>> As you may recall, we have been using two specific metrics to benchmark Scheduler performance, specifically "task latency" and "task throughput" since Airflow 2.0.
>> >>>
>> >>> These were described in the 2.0 Scheduler blog post
>> >>> Specifically, within that we defined task tatency as the time it takes for the task to begin executing once it's dependencies are all met.
>> >>>
>> >>>
>> >>>
>> >>> Thanks,
>> >>>
>> >>> Vikram
>> >>>
>> >>>
>> >>>
>> >>>
>> >>>
>> >>>
>> >>>
>> >>>
>> >>>
>> >>> On Wed, Jun 8, 2022 at 10:25 AM Ping Zhang <pi...@umich.edu> wrote:
>> >>>
>> >>> Hi Airflow Community,
>> >>>
>> >>>
>> >>>
>> >>> Airflow is a scheduling platform for data pipelines, however there is no good metric to measure the scheduling delay in the production and also the stress test environment. This makes it hard to catch regressions in the scheduler during the stress test stage.
>> >>>
>> >>>
>> >>> I would like to propose an airflow scheduling delay metric definition. Here is the detailed design of the metric and its implementation:
>> >>>
>> >>> https://docs.google.com/document/d/1NhO26kgWkIZJEe50M60yh_jgROaU84dRJ5qGFqbkNbU/edit?usp=sharing
>> >>>
>> >>> Please take a look and any feedback is welcome.
>> >>>
>> >>>
>> >>>
>> >>> Thanks,
>> >>>
>> >>>
>> >>>
>> >>> Ping
>> >>>
>> >>>

Re: [DISCUSS] Airflow Scheduling Delay Metric Definition

Posted by Ping Zhang <pi...@umich.edu>.
Hi Jarek,

Thanks for the insights and pointing out the potential issues with triggers
in the prod with scheduler HA setup.

The solution that I proposed is mainly for the stress test scheduler before
each airflow release. We can make changes in the airflow codebase to emit
this metric however:

1. It will incur additional overhead for the scheduler to compute the
metric as scheduler needs to compute the dependency met time of a task.
2. It couples with the implementation of the scheduler. For example, from
1.10.4 to airflow 2, the scheduler has changed a lot. If the metric is
emitted from the scheduler, when making the changes in the scheduler, it
also needs to update how the metric is computed and emitted.

Thus, I think having it out of the airflow core makes it easier to compare
the scheduling delay across different airflow versions.

Thanks for pointing out the OpenTelemetry, let me check it out.

Thanks,

Ping


On Mon, Jul 11, 2022 at 9:44 AM Jarek Potiuk <po...@apache.org> wrote:

> Sorry for the late reply - Ping.
>
> TL;DR; I think the metrics might be useful but I think using triggers
> is asking for troubles.
>
> While using triggers sounds like a common approach in a number of
> installations, we do not use triggers so far.
> Using Triggers moves some logic to the database, and in our case we do
> not have it at all - all logic is in Airflow, and we keep it there,
> the database for us is merely "state" storage and "locks". Adding
> database triggers, extends it to also keep some logic there. And
> adding triggers has some worrying "implicitness" which goes against
> the "Explicit is better than Implicit" Zen of Python.
>
> One thing that makes me think "coldly" about this is that it might
> have some undesired side effects - such as synchronizing of changes
> from multiple schedulers on trying to insert such audit entry (you
> need to create an index lock when you insert rows to a table which has
> a primary key/unique indexes).
>
> And what's even more worrying is that we are using SQLAlchemy and
> MySQl/MsSQL/Postgres and we should make sure it works the same in all
> of them. This is troublesome.
>
> Even if we could solve and verify all those problems individually the
> effect is - Once we open the "gate" of triggers, we will get more "ok
> we have trigger here so let's also use it for that and this" and this
> will be hard to say "no" if we already have a precedent, and this
> might lead to more and more logica and features deferred to a database
> logic (and my past experience is that it leads to more complexity and
> implicit behaviours that are difficult to reason about).
>
> But this is only about the technical details of this, not the metrics
> itself. I think the metric you proposed is very useful.
>
> I think however (correct me if I am wrong) - that we do not need
> database triggers for any of those. I have a feeling that this
> proposal is trying to implement the (useful) metrics with very limited
> modification to the Airflow code, so I can understand that you might
> think about it this way when you have your own fork - then it makes
> sense to piggyback on the existing database and use triggers, because
> you do not want to modify Airflow code.
>
> But here - we are in a completely different situation. We CAN modify
> Airflow code and add missing features and functionality to capture the
> necessary metric data in the code,  rather than using triggers. We
> could even define some kind of callbacks for the auditing events that
> would allow us to gather those metrics in a way that does not even use
> the database to store the information for the metrics.
>
> In fact - this leads me to conclusion that we should implement the
> metrics you mention as part of our Open-Telemetry effort
>
> https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-49+OpenTelemetry+Support+for+Apache+Airflow
> .
> This is precisely what it was prepared for, once we have
> Open-Telemetry integrated we could add more and more such useful
> metrics more easily, and that could be way more useful, because
> instead of running external custom-db-reading process for that, we
> could not only calculate such metrics using the right metrics tooling
> (each company could use their preferred open-telemetry compliant
> tool), but that would open up all the features like alerting,
> connecting it with traces and other metrics etc. etc.
>
> Howard - WDYT?
>
> J.
>
>
>
>
>
>
> On Thu, Jun 30, 2022 at 4:52 PM Vikram Koka
> <vi...@astronomer.io.invalid> wrote:
> >
> > HI Ping,
> >
> > Apologies for the belated response.
> >
> > We have created a set of stress test DAGs where the tasks take almost no
> time to execute at all, so that the worker task execution time is small,
> and the stress is on the Scheduler and Executor.
> >
> > We then calculate "task latency" aka "task lag" as:
> >  ti_lag = ti.start_date - max_upstream_ti_end_date
> > This is effectively the time between "the downstream task starting" and
> "the last dependent upstream task complete"
> >
> > We don't use the tasks that don't have any upstream tasks in this metric
> for measuring task lag.
> > And for tasks that have multiple upstream tasks, we use the upstream
> task for which the end_date took maximum time as the scheduler waits for
> completion of all parent tasks before scheduling any downstream task.
> >
> > Vikram
> >
> >
> > On Wed, Jun 8, 2022 at 2:58 PM Ping Zhang <pi...@umich.edu> wrote:
> >>
> >> Hi Mehta,
> >>
> >> Good point. The primary goal of the metric is for stress testing to
> catch airflow scheduler performance regression for 1) our internal
> scheduler improvement work and 2) airflow version upgrade.
> >>
> >> One of the key benefits of this metric definition is it is independent
> from the scheduler implementation and it can be computed/backfilled offline.
> >>
> >> Currently, we expose it to the datadog and we (the airflow cluster
> maintainers) are the main users for it.
> >>
> >> Thanks,
> >>
> >> Ping
> >>
> >>
> >> On Wed, Jun 8, 2022 at 2:36 PM Mehta, Shubham <sh...@amazon.com.invalid>
> wrote:
> >>>
> >>> Ping,
> >>>
> >>>
> >>>
> >>> I’m very interested in this as well. A good metric can help us
> benchmark and identify potential improvements in the scheduler performance.
> >>> In order to understand the proposal better, can you please share where
> and how do you intend to use “Scheduling delay”? Is it meant for
> benchmarking or stress testing only? Do you plan to expose it to the users
> in the Airflow UI?
> >>>
> >>>
> >>>
> >>> Thanks
> >>> Shubham
> >>>
> >>>
> >>>
> >>>
> >>>
> >>> From: Ping Zhang <pi...@umich.edu>
> >>> Reply-To: "dev@airflow.apache.org" <de...@airflow.apache.org>
> >>> Date: Wednesday, June 8, 2022 at 11:58 AM
> >>> To: "dev@airflow.apache.org" <de...@airflow.apache.org>, "
> vikram@astronomer.io" <vi...@astronomer.io>
> >>> Subject: RE: [EXTERNAL][DISCUSS] Airflow Scheduling Delay Metric
> Definition
> >>>
> >>>
> >>>
> >>> CAUTION: This email originated from outside of the organization. Do
> not click links or open attachments unless you can confirm the sender and
> know the content is safe.
> >>>
> >>>
> >>>
> >>> Hi Vikram,
> >>>
> >>>
> >>>
> >>> Thanks for pointing that out, 'task latency',
> >>>
> >>>
> >>>
> >>> "we define task latency as the time it takes for a task to begin
> executing once its dependencies have been met."
> >>>
> >>>
> >>>
> >>> It will be great if you can elaborate more about "begin executing" and
> how you calculate "its dependencies have been met.".
> >>>
> >>>
> >>>
> >>> If the 'begin executing' means the state of ti becomes running, then
> the 'Scheduling Delay' metric focuses on the overhead introduced by the
> scheduler.
> >>>
> >>>
> >>>
> >>> In our prod and stress test, we use the `task_instance_audit` table (
> a new row is created whenever there is state change in task_instance table)
> to compute the time of a ti should be scheduled.
> >>>
> >>>
> >>>
> >>> Thanks,
> >>>
> >>>
> >>>
> >>> Ping
> >>>
> >>>
> >>>
> >>>
> >>>
> >>> On Wed, Jun 8, 2022 at 11:25 AM Vikram Koka
> <vi...@astronomer.io.invalid> wrote:
> >>>
> >>> Ping,
> >>>
> >>>
> >>>
> >>> I am quite interested in this topic and trying to understand the
> difference between the "scheduling delay" metric articulated as compared to
> the "task latency" aka "task lag" metric which we have been using before.
> >>>
> >>>
> >>>
> >>> As you may recall, we have been using two specific metrics to
> benchmark Scheduler performance, specifically "task latency" and "task
> throughput" since Airflow 2.0.
> >>>
> >>> These were described in the 2.0 Scheduler blog post
> >>> Specifically, within that we defined task tatency as the time it takes
> for the task to begin executing once it's dependencies are all met.
> >>>
> >>>
> >>>
> >>> Thanks,
> >>>
> >>> Vikram
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>> On Wed, Jun 8, 2022 at 10:25 AM Ping Zhang <pi...@umich.edu> wrote:
> >>>
> >>> Hi Airflow Community,
> >>>
> >>>
> >>>
> >>> Airflow is a scheduling platform for data pipelines, however there is
> no good metric to measure the scheduling delay in the production and also
> the stress test environment. This makes it hard to catch regressions in the
> scheduler during the stress test stage.
> >>>
> >>>
> >>> I would like to propose an airflow scheduling delay metric definition.
> Here is the detailed design of the metric and its implementation:
> >>>
> >>>
> https://docs.google.com/document/d/1NhO26kgWkIZJEe50M60yh_jgROaU84dRJ5qGFqbkNbU/edit?usp=sharing
> >>>
> >>> Please take a look and any feedback is welcome.
> >>>
> >>>
> >>>
> >>> Thanks,
> >>>
> >>>
> >>>
> >>> Ping
> >>>
> >>>
>