You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by GitBox <gi...@apache.org> on 2022/12/16 09:51:08 UTC

[GitHub] [airflow] dolfinus opened a new issue, #28402: Support of reading dags from database by task_runners

dolfinus opened a new issue, #28402:
URL: https://github.com/apache/airflow/issues/28402

   ### Description
   
   Currently I'm working on managed Kubernetes cluster with vSphere CSI manager, which supports PVCs with accessMode=ReadWriteOnce only. So I cannot mount the same volume with dags into all the pods (webserver, scheduler, etc).
   
   Also I cannot use gitSync because there is no access k8s -> git server (because of security reasons). But there is an access git -> CI runner -> k8s cluster.
   
   I whish I have some way to push dags from git to Airflow without implementing some overcomplicated way of deploying dags to the cluster. For example, just push them to the volume with dagProcessor using my CI runner, then dagProcessor will parse all the dags, save them into the database, and then Airflow could execute them.
   
   All the components (standalone dagProcessor, saving dags to the database, reading dags from the database) are already there, after dags are parsed I can see source code in the web server. But when I try to run this dag, I get an exception:
   ```
   
   [2022-12-16, 08:51:21 UTC] {taskinstance.py:1165} INFO - Dependencies all met for <TaskInstance: tutorial.print_date manual__2022-12-16T08:51:20.145317+00:00 [queued]>
   [2022-12-16, 08:51:21 UTC] {taskinstance.py:1165} INFO - Dependencies all met for <TaskInstance: tutorial.print_date manual__2022-12-16T08:51:20.145317+00:00 [queued]>
   [2022-12-16, 08:51:21 UTC] {taskinstance.py:1362} INFO - 
   --------------------------------------------------------------------------------
   [2022-12-16, 08:51:21 UTC] {taskinstance.py:1363} INFO - Starting attempt 1 of 2
   [2022-12-16, 08:51:21 UTC] {taskinstance.py:1364} INFO - 
   --------------------------------------------------------------------------------
   [2022-12-16, 08:51:21 UTC] {taskinstance.py:1383} INFO - Executing <Task(BashOperator): print_date> on 2022-12-16 08:51:20.145317+00:00
   [2022-12-16, 08:51:21 UTC] {standard_task_runner.py:54} INFO - Started process 11404 to run task
   [2022-12-16, 08:51:21 UTC] {standard_task_runner.py:82} INFO - Running: ['airflow', 'tasks', 'run', 'tutorial', 'print_date', 'manual__2022-12-16T08:51:20.145317+00:00', '--job-id', '1015', '--raw', '--subdir', 'DAGS_FOLDER/tutorial.py', '--cfg-path', '/tmp/tmp40xmsmk9']
   [2022-12-16, 08:51:21 UTC] {standard_task_runner.py:83} INFO - Job 1015: Subtask print_date
   [2022-12-16, 08:51:21 UTC] {dagbag.py:525} INFO - Filling up the DagBag from /opt/airflow/dags/tutorial.py
   [2022-12-16, 08:51:21 UTC] {standard_task_runner.py:107} ERROR - Failed to execute job 1015 for task print_date (Dag 'tutorial' could not be found; either it does not exist or it failed to parse.; 11404)
   [2022-12-16, 08:51:21 UTC] {local_task_job.py:164} INFO - Task exited with return code 1
   [2022-12-16, 08:51:21 UTC] {local_task_job.py:273} INFO - 0 downstream tasks scheduled from follow-on schedule check
   ```
   
   This is caused by `airflow.cli.commands.task_command.task_run` loading only the dag source code from the file system of worker/scheduler:
   https://github.com/apache/airflow/blob/3bee4818e5d8f3ad8c1792453efb7d0c93a0236f/airflow/cli/commands/task_command.py#L378
   https://github.com/apache/airflow/blob/3bee4818e5d8f3ad8c1792453efb7d0c93a0236f/airflow/utils/cli.py#L225-L226
   https://github.com/apache/airflow/blob/3bee4818e5d8f3ad8c1792453efb7d0c93a0236f/airflow/models/dagbag.py#L98
   
   My proposal - add an argument `--read-dags-from-db` to `airflow task` cli commands (at least for one that require only reading access for dags), and some config option to `[scheduler]` to pass this argument to task runner.
   This allows to fetch dag source code from the database instead of file system, and the only pod which should have an access to the dags PVC is the dagProcessor.
   
   This could also eliminate adding gitSync sidecar to all the pods with Airflow components. But not in all the cases - for example, if someone places some python module in the dags folder and imports it in the dag, this will not work because module content is not being saved in the database, and dag import will fail if module is not present in the worker file system.
   
   ### Use case/motivation
   
   _No response_
   
   ### Related issues
   
   _No response_
   
   ### Are you willing to submit a PR?
   
   - [ ] Yes I am willing to submit a PR!
   
   ### Code of Conduct
   
   - [X] I agree to follow this project's [Code of Conduct](https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] uranusjr commented on issue #28402: Support of reading dags from database by task_runners

Posted by GitBox <gi...@apache.org>.
uranusjr commented on issue #28402:
URL: https://github.com/apache/airflow/issues/28402#issuecomment-1370542263

   > where there is no need to re-run the scripts to load these dags before running each task, we can directly use the serialized dags stored in the DB
   
   A serialised DAG does not contain actual operator logic, only the DAG’s _shape_. So it’s not possible to run tasks against them.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] boring-cyborg[bot] commented on issue #28402: Support of reading dags from database by task_runners

Posted by GitBox <gi...@apache.org>.
boring-cyborg[bot] commented on issue #28402:
URL: https://github.com/apache/airflow/issues/28402#issuecomment-1354480318

   Thanks for opening your first issue here! Be sure to follow the issue template!
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] dolfinus commented on issue #28402: Support of reading dags from database by task_runners

Posted by GitBox <gi...@apache.org>.
dolfinus commented on issue #28402:
URL: https://github.com/apache/airflow/issues/28402#issuecomment-1370796670

   I thought serialized dag is just a python source code, isn't it?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] potiuk closed issue #28402: Support of reading dags from database by task_runners

Posted by GitBox <gi...@apache.org>.
potiuk closed issue #28402: Support of reading dags from database by task_runners
URL: https://github.com/apache/airflow/issues/28402


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] potiuk commented on issue #28402: Support of reading dags from database by task_runners

Posted by GitBox <gi...@apache.org>.
potiuk commented on issue #28402:
URL: https://github.com/apache/airflow/issues/28402#issuecomment-1370815981

   > I thought serialized dag is just a python source code, isn't it?
   
   No. It's not. It's just json serialized DAG structure and metadata. 
   
   What you see in the UI is **just** the source code of the DAG in question - but you cannot see there any code it imports. And this is just for "inspection" - it's not possible to run this code, because of missing dependent code.
   
   In the current state, we cannot (and should not) serialize Python code to the database - simply because a python DAG can import arbitrary number of libraries, common code, other dags etc. And in case you have dynamic imports in the DAGs or local imports it is extremelly difficult (or actually impossible to determine which files should be put in such database).  Effectively what you ask for is to store the whole DAG folder as a record in a database for every single DAG run.
   
   With the current way how airflow works and how "flexible" Python is, that makes no sense - any kind of file sharing does the job much better than trying to read the whole DAG folder and convert it in a blob of all DAG files stored in a relational database. From performance point of view, it makes no sense. 
   
   Changing this would ba quite a fundamental change in how Airflow works, so it definitely does not pass the bar of a "Feature" - it definitely goes into the "Airflow Improvement Proposal" camp (so no @hussein-awala - I don't think we are going to assign it to anyone as this is definitely not someothing that would ever got accepted before we have a proper proposal and discussion about it).
   
   There are opened and never completed related Airflow Improvement Proposals (https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-5+Remote+DAG+Fetcher, https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-20+DAG+manifest) that were aiming at solving that problem.
   
   If you would like to change the behaviour, then the right approach is to either pick some of them, complete them (they are in Draft status), be able to explain and defend all the different cases and start a discussion about it in the Airflow Devlist (see https://airflow.apache.org/community/ for details on how to join it). You will need to specify it in the level of detail that will allow to asses all the cases, small/big deployments, performance considerations, describe different cases. 
   
   Converting it into discussion.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] hussein-awala commented on issue #28402: Support of reading dags from database by task_runners

Posted by GitBox <gi...@apache.org>.
hussein-awala commented on issue #28402:
URL: https://github.com/apache/airflow/issues/28402#issuecomment-1370366470

   I think this is a good feature especially for the static dags files, where there is no need to re-run the scripts to load these dags before running each task, we can directly use the serialized dags stored in the DB.
   
   To make it clean, I propose adding a Dag conf `load_from_db_to_run_tasks` to tell Airflow that this Dag doesn't need a new file processing to run its tasks, with a  scheduler conf `default_load_from_db_to_run_tasks` which is false by default.
   
   The scheduler dags files processor agent will be deactivated, and the standalone dag processor will be executed by the CI on each release, we need #28711 to avoid considering the dags as stale and cleaning them.
   
   Can someone assign the ticket to me?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org