You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@airflow.apache.org by Rafal Biegacz <ra...@google.com.INVALID> on 2022/05/19 17:31:01 UTC

question about bulk queries against Airflow DB...

Hi Airflow Community,

Some heavy users of Airflow rely on the possibility to run bulk sql queries
against Airflow DB.

My gut feeling is that one of the reason they do that is to implement some
additional visualizations and monitoring for Airflow tasks, DAGs and sense
state of task scheduling.

For example, I see that users do the following:

   - retrieving all tasks that are in the queued state
   - deleting all the tasks that are in the queued state for too long
   - query DB to retrieve info about all DAGs that failed in the
   specified period of time.
   - etc.

Just wondering if we shouldn't implement support for some such queries via
Airflow API to eliminate a need to query Airflow database directly.
This request might be even more important in the context of AIP-44 Airflow
Internal API
<https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-44+Airflow+Internal+API>
which
tries to limit/control direct access to the Airflow Database.

It would be great if you could share more info about such "bulk" queries
that you do/saw in practice and it would be great to hear from you what you
think about implementing support for some of them in Airflow API (stable or
internal).

Regards, Rafal.

Re: question about bulk queries against Airflow DB...

Posted by Jarek Potiuk <ja...@potiuk.com>.
Yeah. This is a very good and important point and one that i was thinking a
lot about recently as we are gearing up to finalize AIP-44 POC work (still
on me and i dropped a ball a bit on it due to Summit and 'around 2.3.0 - 1
activities but I am going to resume it very soon).


I agree with Ping that 'system admins' should have access to the db
directly. While this is an internal detail of Airflow, the db is not a
black box (we already expose a number 'properties' to people who really
'know' what they do (migrations, including automated generation of those)
and for bigger installations especially when you need to optimize for your
specific case, improve an index or simply groom your data, there will
always be cases for that. But this should be done on a 'side' and it should
have no more support from Airflow software and community than documenting
the schema and keeping track of the changes (and having this admins aware
that backwards compatibility is not really guaranteed and they have to be
prepared for changes).

On the other hand for users and 'tenant' admins (when we speak about future
multi-tenancy) we should not expose db directly eventually and provide
stable and reliable bulk API. But this is a process and. I'd say marathon
rather than sprint.


I think we definitely want to extend the API with bulk operations where we
miss them. As part of the API-44 we might want to think about some
mechanism that willake gradual migration possible for people who use such
db queries but would like to start using db isolation mode.


This is - I think not something that were will solve upfront, as there
might be many ways users would like to access the DB. I think of it more as
a process - i.e. finding a way when db isolation is on and the 'isolation'
benefit is achieved, but on the other hand we do not require our users to
do a 'big bang' approach that they have to migrate in on go all of their
workload.


I thought that we need some mechanism that will allow for exceptions - once
db isolation mode is enabled - maybe even some whitelist of queries that
you can enable and some way for the admins to see which queries were
blocked and an easy way to enable them - maybe even in Airflow UI. I think
in a number of cases no one in the whole organisation running Airflow will
actually know what queries differebt stakeholders and users in their
organisations use. So a mechanism that will allow to detect what queries
are run, built inventory of those and allow to make exceptions for those
would be helpful to manage the transition. I thought about a mechanism
similar in a way to some local firewall approaches wher you get notified
that someone runs unknown query and allow to add exception for it (likely
there could be graduation of 'alow but log', 'log and disable with
exceptions' and eventually 'lock and disable all no exceptions'. This will
be easy for any code using Airflow sqlalchemy classes and i think that
might be good enough actually (direct db queries would fall into 'super
admins' mode.

Then we could find a way to gather (possibly even semi automatically) an
information from our users on what kind of Bulk API they need and either
let them to contribute it or add it by the team.

I think this should be part of designing the 'transition' process - i.e. we
have to think about the 'transition scenarios' we give to users of Airflow
who would like to eventually switch on the DB isolation. Part of it (and
this is the difficult one i have no clear idea yet) is how t make such
exceptions annoying enough a ND basically engineer the behaviour of the
users who will choose the transition path so that they are incentivised to
report such missing apis and switch to them when ready. It could be some
performance penalties or maybe annoying notifications on UI or planned
deprecation period or something similar, but we need to shave such
mechanism otherwise most users will stuck in 'exceptions' stage for ever.

I will put some thoughts there at our multi-tenancy talk with Mateusz - on
Airflow Summit (next week - Wednesday all, last talk in Bangalore session)
everyone is welcome to attend and ask questions - also happy to talk about
it at Warsaw and London events in person.

J.

pt., 20 maj 2022, 05:55 użytkownik Ping Zhang <pi...@umich.edu> napisał:

> Hi Rafal,
>
> In our production, we have a few bulk queries:
>
> 1. We have a cron job to query the replica of the airflow db for number of
> tis in different states, the number of tasks in queued, running, and the
> number of tasks whose state is changed to 'scheduled', 'running', 'queued'
> in the last minute window (we have a task_instance trigger table to track
> all state changes of tis). It then emits metrics that we have alerts on.
>
> 2. We have another job to periodically trim historical data in
> task_instance, log, dag_run, etc tables, if their state is failed/success.
> This prevents  those tables  from growing too big to have performance
> impact.
>
> (when we do deletion, we carefully break down the query into smaller
> chunks to avoid locking the table for too long, which might cause issues)
>
> 3. We have a script to easily reset failed tis given a period of time so
> that when there is an infra failure, we can reset those failed tasks due to
> infra failure.
>
> I hope this can help. As for the Airflow API part, I think it depends on
> the use cases. If the use cases are for infra teams who manage the airflow
> clusters, I am leaning towards not having an API. But if it is for other
> users, it is better to have an API around it as we don't want them to have
> access to prod db.
>
> Thanks,
>
> Ping
>
>
> On Thu, May 19, 2022 at 10:31 AM Rafal Biegacz
> <ra...@google.com.invalid> wrote:
>
>> Hi Airflow Community,
>>
>> Some heavy users of Airflow rely on the possibility to run bulk sql
>> queries against Airflow DB.
>>
>> My gut feeling is that one of the reason they do that is to implement
>> some additional visualizations and monitoring for Airflow tasks, DAGs and
>> sense state of task scheduling.
>>
>> For example, I see that users do the following:
>>
>>    - retrieving all tasks that are in the queued state
>>    - deleting all the tasks that are in the queued state for too long
>>    - query DB to retrieve info about all DAGs that failed in the
>>    specified period of time.
>>    - etc.
>>
>> Just wondering if we shouldn't implement support for some such queries
>> via Airflow API to eliminate a need to query Airflow database directly.
>> This request might be even more important in the context of AIP-44
>> Airflow Internal API
>> <https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-44+Airflow+Internal+API> which
>> tries to limit/control direct access to the Airflow Database.
>>
>> It would be great if you could share more info about such "bulk" queries
>> that you do/saw in practice and it would be great to hear from you what you
>> think about implementing support for some of them in Airflow API (stable or
>> internal).
>>
>> Regards, Rafal.
>>
>

Re: question about bulk queries against Airflow DB...

Posted by Ping Zhang <pi...@umich.edu>.
Hi Rafal,

In our production, we have a few bulk queries:

1. We have a cron job to query the replica of the airflow db for number of
tis in different states, the number of tasks in queued, running, and the
number of tasks whose state is changed to 'scheduled', 'running', 'queued'
in the last minute window (we have a task_instance trigger table to track
all state changes of tis). It then emits metrics that we have alerts on.

2. We have another job to periodically trim historical data in
task_instance, log, dag_run, etc tables, if their state is failed/success.
This prevents  those tables  from growing too big to have performance
impact.

(when we do deletion, we carefully break down the query into smaller chunks
to avoid locking the table for too long, which might cause issues)

3. We have a script to easily reset failed tis given a period of time so
that when there is an infra failure, we can reset those failed tasks due to
infra failure.

I hope this can help. As for the Airflow API part, I think it depends on
the use cases. If the use cases are for infra teams who manage the airflow
clusters, I am leaning towards not having an API. But if it is for other
users, it is better to have an API around it as we don't want them to have
access to prod db.

Thanks,

Ping


On Thu, May 19, 2022 at 10:31 AM Rafal Biegacz
<ra...@google.com.invalid> wrote:

> Hi Airflow Community,
>
> Some heavy users of Airflow rely on the possibility to run bulk sql
> queries against Airflow DB.
>
> My gut feeling is that one of the reason they do that is to implement some
> additional visualizations and monitoring for Airflow tasks, DAGs and sense
> state of task scheduling.
>
> For example, I see that users do the following:
>
>    - retrieving all tasks that are in the queued state
>    - deleting all the tasks that are in the queued state for too long
>    - query DB to retrieve info about all DAGs that failed in the
>    specified period of time.
>    - etc.
>
> Just wondering if we shouldn't implement support for some such queries via
> Airflow API to eliminate a need to query Airflow database directly.
> This request might be even more important in the context of AIP-44
> Airflow Internal API
> <https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-44+Airflow+Internal+API> which
> tries to limit/control direct access to the Airflow Database.
>
> It would be great if you could share more info about such "bulk" queries
> that you do/saw in practice and it would be great to hear from you what you
> think about implementing support for some of them in Airflow API (stable or
> internal).
>
> Regards, Rafal.
>