You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by GitBox <gi...@apache.org> on 2022/12/15 12:54:37 UTC

[GitHub] [airflow] victorjourne opened a new issue, #28380: Celery green treads incompatibility

victorjourne opened a new issue, #28380:
URL: https://github.com/apache/airflow/issues/28380

   ### Apache Airflow version
   
   2.5.0
   
   ### What happened
   
   Celery offers the capability to concurrently run *green threads* like `gevent` or `eventlet` for IO bound tasks.
   
   I didn't manage to setup a simple Airflow project which follow this basic idea.
   
   1. Let us say  _N_ concurrent tasks are scheduled with a _dynamic task mapping_ to a gevent celery pool of size  _C_
   2. The first _C_ tasks  are correctly executed and the metadata database is updated,  but the their status in the result backend (here postgres) are not updated (**Why?**). In flower UI, tasks are still active.
   3. After 10 minutes, an attempt is made to complete the tasks. As a result, (and despite of worrying logs), the task status turn into _success_ 
   4. Same scenario for the _N_ - _C_ left tasks.
   
   The log of a task : [dag_id=simple_mapping_run_id=scheduled__2022-12-14T12 10 17.314020+00 00_task_id=add_one_map_index=3_attempt=1.log](https://github.com/apache/airflow/files/10236974/dag_id.simple_mapping_run_id.scheduled__2022-12-14T12.10.17.314020%2B00.00_task_id.add_one_map_index.3_attempt.1.log)
   
   The log of the worker  : [airflow-worker.log](https://github.com/apache/airflow/files/10236977/airflow-worker.log)
   
   
   ### What you think should happen instead
   
   Instead, the result backend database should be updated as soon as a task completes, in order to quickly let other tasks to run.
   
   First, I have suspected the `Postgres` result backend to be the culprit since it is not clear that Psychopg manage concurrent writings with `gevent`.
   
   But after seeing worker logs warning identical than #8164 about the _gevent monkey-patching_ , I have a doubt.
   
   ### How to reproduce
   
   - Strictly follow [docker compose airflow 2.5.0](https://airflow.apache.org/docs/apache-airflow/stable/howto/docker-compose/index.html#fetching-docker-compose-yaml) instructions.
   
   - Add to  `airflow-common-env` the env var `AIRFLOW__CELERY__POOL: 'gevent'`.
   
   - Launch from Airflow UI this simple DAG: 
   
   ```
   from datetime import datetime
   
   from airflow import DAG
   from airflow.decorators import task
   
   with DAG(dag_id="simple_mapping",
           catchup=False,
           start_date=datetime(2022, 3, 4),
           max_active_tasks=200) as dag:
   
       @task
       def add_one(x: int):
           return x + 1
   
       @task
       def sum_it(values):
           total = sum(values)
           print(f"Total was {total}")
   
       xlist = list(range(25))
       added_values = add_one.expand(x=xlist)
   
       sum_it(added_values)
   ```
   
   -  Observe the airflow-worker logs, the Airflow UI tasks flow and Flower active tasks.
   
   
   ### Operating System
   
   Ubuntu 22.04.1 LTS
   
   ### Versions of Apache Airflow Providers
   
   _No response_
   
   ### Deployment
   
   Docker-Compose
   
   ### Deployment details
   
   Docker Engine - Community - Version: 20.10.18
   Docker Compose version v2.10.2
   
   ### Anything else
   
   _No response_
   
   ### Are you willing to submit PR?
   
   - [ ] Yes I am willing to submit a PR!
   
   ### Code of Conduct
   
   - [X] I agree to follow this project's [Code of Conduct](https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] boring-cyborg[bot] commented on issue #28380: Celery green treads incompatibility

Posted by GitBox <gi...@apache.org>.
boring-cyborg[bot] commented on issue #28380:
URL: https://github.com/apache/airflow/issues/28380#issuecomment-1353020680

   Thanks for opening your first issue here! Be sure to follow the issue template!
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] victorjourne commented on issue #28380: Celery green threads incompatibility

Posted by GitBox <gi...@apache.org>.
victorjourne commented on issue #28380:
URL: https://github.com/apache/airflow/issues/28380#issuecomment-1355456814

   @potiuk, I still have the issue. Nevertheless, monkey patch warning have disappeared. 
   Tested with Breeze `breeze start-airflow  --python 3.7 --backend postgres --postgres-version 13 --integration celery`. then launched airflow worker and flower.
   ```
   #files/airflow-breeze-config/variables.env
   export AIRFLOW__CORE__EXECUTOR=CeleryExecutor
   export AIRFLOW__CELERY__POOL=gevent
   export _AIRFLOW_PATCH_GEVENT=1
   ```
   
   I am quite sure now that the celery broker or backend  are responsible. The result backend database are not updated when some tasks finished, contrary to metadata database... 
   
   What do you think?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] potiuk commented on issue #28380: Celery green threads incompatibility

Posted by GitBox <gi...@apache.org>.
potiuk commented on issue #28380:
URL: https://github.com/apache/airflow/issues/28380#issuecomment-1353620554

   As I explained in Slack, it's lilely fixed in https://github.com/apache/airflow/pull/28283 . 
   
   Provisionally, clossing, until you test and confirm it is not fixed by it.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] potiuk closed issue #28380: Celery green threads incompatibility

Posted by GitBox <gi...@apache.org>.
potiuk closed issue #28380: Celery green threads incompatibility
URL: https://github.com/apache/airflow/issues/28380


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] potiuk commented on issue #28380: Celery green threads incompatibility

Posted by GitBox <gi...@apache.org>.
potiuk commented on issue #28380:
URL: https://github.com/apache/airflow/issues/28380#issuecomment-1355537151

   > @potiuk, I still have the issue. Nevertheless, monkey patch warning have disappeared.
   
   This is a good news. At least it allows us to merge this change because clearly the warnings are removed.
   
   > I am quite sure now that the celery broker or backend are responsible. The result backend database are not updated when some tasks finished, contrary to metadata database...
   
   > What do you think?
   
   There are many things that could have caused the behaviour  - but without any evidences, I can only guess, and I have no prior similar experience to base it on. As discussed on Slack the "Celery" Executor is not supported in Breeze as a "working" feature (Issue opened https://github.com/apache/airflow/issues/28412 . It might well be some issue in the way how things are configured in Breeze. If you would like to explore it further and try to investigate how to implement it, that woudl be awesome. But this is not a high priority, really and I would prefer someone (like you) who is a celery user can do some more investigation how to make it works in Breeze. This happend in the past and I would like more of the contirbutors to spend their time in improving our dev env - seems that someone who understand what gevent is and wants to use it has enough incentive to do more investigation. Happy to help with it but need more evidences and some deeper debugging from your side if we are to pr
 ogress there.
   
   I am happy to get ideas bounced off me and to help to understand ins-outs of Breeze to guide such a person. Slack discussion in #breeze  channel on slack is likely best place for that.
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] victorjourne commented on issue #28380: Celery green threads incompatibility

Posted by GitBox <gi...@apache.org>.
victorjourne commented on issue #28380:
URL: https://github.com/apache/airflow/issues/28380#issuecomment-1356891596

   After testing many configurations about celery backend, the solution I found is the combination of :
   - Force celery **not to store the task return** by passing the env variable : `CELERY_IGNORE_RESULT='True'` ([docs](https://docs.celeryq.dev/en/stable/userguide/configuration.html#std-setting-task_ignore_result)). 
   Fortunately, airflow stores through Xcom the task result to process it further.
   - Use this bug fix from @potiuk  :  #28283
   
   However, it seems to be inconsistent with this airflow [diagram](https://airflow.apache.org/docs/apache-airflow/stable/executor/celery.html#task-execution-process), the task status in the metadata should be updated from the result backend table, which is empty.... I would go deeper in the code if I had the time.
   
   In any case, the whole issue about celery green threads is related to the way of airflow calls the **result backend**. There is something blocking the celery workers to stop. I should to investigate it more, but quite astonish to be the first user to undergo this, since it is a quite common pattern to concurrently call **IO tasks** with green threads. To achieve that, you guys may use `CeleryExecutor`, or the `LocalExecutor`?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org