You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by GitBox <gi...@apache.org> on 2022/10/28 06:59:10 UTC

[GitHub] [airflow] hterik opened a new issue, #27341: Move db check command to run inside task process.

hterik opened a new issue, #27341:
URL: https://github.com/apache/airflow/issues/27341

   ### Description
   
   When using `KubernetesExecutor` and task workers based on the offical Docker image,  [there is an Entrypoint](https://github.com/apache/airflow/blob/main/Dockerfile#L1408) that runs a shell-script [where **airflow db check** is called](https://github.com/apache/airflow/blob/main/scripts/docker/entrypoint_prod.sh#L282).
   https://github.com/apache/airflow/blob/550b49b418b0c364b6483cda07e5371b2d816261/scripts/docker/entrypoint_prod.sh#L282-L283
   
   As opposed to celery tasks, kubernetes tasks go through this whole process for every task instance started. With high number of task instances, this path needs a bit more consideration.
   
   Currently i see two issues
   * Python interpreter startup is famously slow, especially on large projects such as airflow. Even if i haven't measured this particular aspect to know if it has significant impact or not, starting airflow twice just to make one ping is redundant. 
   * New Postgresql connections are slow and expensive for the database. If the db check could be run from within the same process as the task, the connection pool from within sqlalchemy can reuse the same connection after it has been successful. This could cut the number of connections made almost in half. Correct me if i'm wrong about this, i'm not super deeply familiar with the sqlalchemy internals and how it's integrated in Airflow. (Pgbouncer supposedly helps with this but less connections are always better regardless).
   
   ----
   My suggestion is to do something equivalent of just adding db.check() inside task_command.task_run().
   This could eventually also allow more rich code doing the check-retries and logging, and simplify the entrypoint from any shell-script.
   There is a `check_db=False` decorator on the `task_run` today. Setting this to True however also includes db-migrations in the check. I'm not familiar enough with this to say if that's a good idea or not.
   
   https://github.com/apache/airflow/blob/550b49b418b0c364b6483cda07e5371b2d816261/airflow/cli/commands/task_command.py#L312-L313
   https://github.com/apache/airflow/blob/550b49b418b0c364b6483cda07e5371b2d816261/airflow/utils/cli.py#L98-L102
   
   ----
   A workaround is to set the envvar `CONNECTION_CHECK_MAX_COUNT=0` in your `pod_template_file.yaml`. I don't know how that could impact stability of running tasks in case the DB is unreachable on task startup. In a way that scenario should ideally be handled anyway, as the check is just a sample and potential race condition for the DB to be unreachable on next real operation.
   
   ### Use case/motivation
   
   _No response_
   
   ### Related issues
   
   _No response_
   
   ### Are you willing to submit a PR?
   
   - [ ] Yes I am willing to submit a PR!
   
   ### Code of Conduct
   
   - [X] I agree to follow this project's [Code of Conduct](https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] eladkal commented on issue #27341: Move db check command to run inside task process.

Posted by GitBox <gi...@apache.org>.
eladkal commented on issue #27341:
URL: https://github.com/apache/airflow/issues/27341#issuecomment-1368239643

   @hterik are you working on this task?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] amoghrajesh commented on issue #27341: Move db check command to run inside task process.

Posted by "amoghrajesh (via GitHub)" <gi...@apache.org>.
amoghrajesh commented on issue #27341:
URL: https://github.com/apache/airflow/issues/27341#issuecomment-1563843208

   @hterik @potiuk if nobody is working on this, I would like to try my hand at it. Looks like a nice improvement to have


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] potiuk commented on issue #27341: Move db check command to run inside task process.

Posted by GitBox <gi...@apache.org>.
potiuk commented on issue #27341:
URL: https://github.com/apache/airflow/issues/27341#issuecomment-1296543747

   Ah I see you proposed it. Would you like ot propose a PR?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] hterik commented on issue #27341: Move db check command to run inside task process.

Posted by GitBox <gi...@apache.org>.
hterik commented on issue #27341:
URL: https://github.com/apache/airflow/issues/27341#issuecomment-1375302546

   Sorry, haven't started at this at all. It's not completely off my radar but i have many other priorities atm. If someone else wants to take this, please go ahead.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] potiuk commented on issue #27341: Move db check command to run inside task process.

Posted by "potiuk (via GitHub)" <gi...@apache.org>.
potiuk commented on issue #27341:
URL: https://github.com/apache/airflow/issues/27341#issuecomment-1583284719

   Feel free (sorry I just picked this up).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org