You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by GitBox <gi...@apache.org> on 2021/05/30 03:40:18 UTC

[GitHub] [airflow] yuqian90 opened a new issue #16163: Confusing log for long running tasks: "dependency 'Task Instance Not Running' FAILED: Task is in the running state"

yuqian90 opened a new issue #16163:
URL: https://github.com/apache/airflow/issues/16163


   **Apache Airflow version**: 1.10.* / 2.0.* / 2.1.*
   
   
   **Kubernetes version (if you are using kubernetes)** (use `kubectl version`): Any
   
   **Environment**:
   
   - **Cloud provider or hardware configuration**: Any
   - **OS** (e.g. from /etc/os-release): Any
   - **Kernel** (e.g. `uname -a`): Any
   - **Install tools**: Any
   - **Others**: N/A
   
   **What happened**:
   
   This line in the TaskInstance log is very misleading. It seems to happen for tasks that take longer than one hour. When users are waiting for tasks to finish and see this in the log, they often get confused. They may think something is wrong with their task or with Airflow. In fact, this line is harmless. The scheduler is simply saying "the TaskInstance is already running so it cannot be run again". 
   
   ```
   {taskinstance.py:874} INFO - Dependencies not met for <TaskInstance: ... [running]>, dependency 'Task Instance Not Running' FAILED: Task is in the running state
   {taskinstance.py:874} INFO - Dependencies not met for <TaskInstance: ... [running]>, dependency 'Task Instance State' FAILED: Task is in the 'running' state which is not a valid state for execution. The task must be cleared in order to be run.
   ```
   
   **What you expected to happen**:
   
   The confusion is unnecessary. This line should be silenced in the log. Or it should log something clearer.
   
   **How to reproduce it**:
   
   Any task that takes more than an hour to run has this line in the log.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on issue #16163: Confusing log for long running tasks: "dependency 'Task Instance Not Running' FAILED: Task is in the running state"

Posted by GitBox <gi...@apache.org>.
potiuk commented on issue #16163:
URL: https://github.com/apache/airflow/issues/16163#issuecomment-872878372


   Not sure. Needs investigation


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] ashb commented on issue #16163: Confusing log for long running tasks: "dependency 'Task Instance Not Running' FAILED: Task is in the running state"

Posted by GitBox <gi...@apache.org>.
ashb commented on issue #16163:
URL: https://github.com/apache/airflow/issues/16163#issuecomment-919081611


   @malthe That's not quite what is happening. My understanding is this:
   
   - The task message on celery is picked up and the TI starts running
   - After 1 hour (the celery visiblity timeout) celery makes the message visible again
   - A second worker picks up the celery message
   - When the second one runs, it gets this message.
   
   
   So the first attempt is still running, and this message is the _second_ concurrent worker (but for the same try_number)  saying "I can't run this task, it's already running."


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] malthe commented on issue #16163: Confusing log for long running tasks: "dependency 'Task Instance Not Running' FAILED: Task is in the running state"

Posted by GitBox <gi...@apache.org>.
malthe commented on issue #16163:
URL: https://github.com/apache/airflow/issues/16163#issuecomment-872868696


   In the case where the visibility timeout is reached, it's confusing that there is not a clear log line that the task has been killed for taking too long to complete.
   
   (If that's indeed what is happening.)
   
   @potiuk is it the case, that the Celery task is killed or is it simply no longer streaming logs into Airflow at that point?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] yuqian90 edited a comment on issue #16163: Confusing log for long running tasks: "dependency 'Task Instance Not Running' FAILED: Task is in the running state"

Posted by GitBox <gi...@apache.org>.
yuqian90 edited a comment on issue #16163:
URL: https://github.com/apache/airflow/issues/16163#issuecomment-868977040


   After some more investigation, it's very likely we see this log appearing an hour after a long running task started because of the default `visibility_timeout` setting in Celery. This code in default_celery.py sets `visibility_timeout` to 21600 only if the broker_url starts with redis or sqs. In our case we are using redis sentinels so it's still redis although the URL starts with `sentinel`. Therefore the `visibility_timeout` is left at 3600 which is the default according to [celery documentation](https://docs.celeryproject.org/en/v4.4.7/getting-started/brokers/redis.html#visibility-timeout). The weird thing is that after I tried to manually change `visibility_timeout` to a very large integer in airflow.cfg, the same log still showed up exactly an hour after a task started. So it seems changing `visibility_timeout` in this case does not make any difference. Not sure if anyone experienced the same.
   
   @david30907d maybe try changing `visibility_timeout` to a large number in your setup and see if it still happens after an hour. If it stops for you, it means `visibility_timeout` is probably the cause. There may be something wrong in our own setup causing changing `visibility_timeout` not to take effect.
   
   ```python
   def _broker_supports_visibility_timeout(url):
       return url.startswith("redis://") or url.startswith("sqs://")
   
   
   log = logging.getLogger(__name__)
   
   broker_url = conf.get('celery', 'BROKER_URL')
   
   broker_transport_options = conf.getsection('celery_broker_transport_options') or {}
   if 'visibility_timeout' not in broker_transport_options:
       if _broker_supports_visibility_timeout(broker_url):
           broker_transport_options['visibility_timeout'] = 21600
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on issue #16163: Confusing log for long running tasks: "dependency 'Task Instance Not Running' FAILED: Task is in the running state"

Posted by GitBox <gi...@apache.org>.
potiuk commented on issue #16163:
URL: https://github.com/apache/airflow/issues/16163#issuecomment-869597629


   Moved it to 2.2 now - I guess we have not addressed it yet in 2.1.1


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] kaxil commented on issue #16163: Confusing log for long running tasks: "dependency 'Task Instance Not Running' FAILED: Task is in the running state"

Posted by GitBox <gi...@apache.org>.
kaxil commented on issue #16163:
URL: https://github.com/apache/airflow/issues/16163#issuecomment-919080190


   @jedcunningham @ephraimbuddy Can one of you take a look at this too please


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] ashb commented on issue #16163: Confusing log for long running tasks: "dependency 'Task Instance Not Running' FAILED: Task is in the running state"

Posted by GitBox <gi...@apache.org>.
ashb commented on issue #16163:
URL: https://github.com/apache/airflow/issues/16163#issuecomment-919093741


   This is largely airflow and celery "fighting" over behaviour


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] yuqian90 commented on issue #16163: Confusing log for long running tasks: "dependency 'Task Instance Not Running' FAILED: Task is in the running state"

Posted by GitBox <gi...@apache.org>.
yuqian90 commented on issue #16163:
URL: https://github.com/apache/airflow/issues/16163#issuecomment-868977040


   After some more investigation, it's very likely we see this log appearing an hour after a long running task started because of the default `visibility_timeout` setting in Celery. This code in default_celery.py sets `visibility_timeout` to 21600 only if the broker_url starts with redis or sql. In our case we are using redis sentinels so it's still redis although the URL starts with `sentinel`. The weird thing is that after I tried to manually change `visibility_timeout` to a very large integer, the same log still showed up exactly an hour after a task started. So it seems changing `visibility_timeout` in this case does not make any difference. Not sure if anyone experienced the same.
   
   @david30907d maybe try changing `visibility_timeout` to a large number in your setup and see if it still happens after an hour. If it stops for you, it means `visibility_timeout` is probably the cause. There may be something wrong in our own setup causing changing `visibility_timeout` not to take effect.
   
   ```python
   def _broker_supports_visibility_timeout(url):
       return url.startswith("redis://") or url.startswith("sqs://")
   
   
   log = logging.getLogger(__name__)
   
   broker_url = conf.get('celery', 'BROKER_URL')
   
   broker_transport_options = conf.getsection('celery_broker_transport_options') or {}
   if 'visibility_timeout' not in broker_transport_options:
       if _broker_supports_visibility_timeout(broker_url):
           broker_transport_options['visibility_timeout'] = 21600
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] yuqian90 edited a comment on issue #16163: Confusing log for long running tasks: "dependency 'Task Instance Not Running' FAILED: Task is in the running state"

Posted by GitBox <gi...@apache.org>.
yuqian90 edited a comment on issue #16163:
URL: https://github.com/apache/airflow/issues/16163#issuecomment-868977040


   After some more investigation, it's very likely we see this log appearing an hour after a long running task started because of the default `visibility_timeout` setting in Celery. This code in default_celery.py sets `visibility_timeout` to 21600 only if the broker_url starts with redis or sql. In our case we are using redis sentinels so it's still redis although the URL starts with `sentinel`. Therefore the `visibility_timeout` is left at 3600 which is the default according to [celery documentation](https://docs.celeryproject.org/en/v4.4.7/getting-started/brokers/redis.html#visibility-timeout). The weird thing is that after I tried to manually change `visibility_timeout` to a very large integer in airflow.cfg, the same log still showed up exactly an hour after a task started. So it seems changing `visibility_timeout` in this case does not make any difference. Not sure if anyone experienced the same.
   
   @david30907d maybe try changing `visibility_timeout` to a large number in your setup and see if it still happens after an hour. If it stops for you, it means `visibility_timeout` is probably the cause. There may be something wrong in our own setup causing changing `visibility_timeout` not to take effect.
   
   ```python
   def _broker_supports_visibility_timeout(url):
       return url.startswith("redis://") or url.startswith("sqs://")
   
   
   log = logging.getLogger(__name__)
   
   broker_url = conf.get('celery', 'BROKER_URL')
   
   broker_transport_options = conf.getsection('celery_broker_transport_options') or {}
   if 'visibility_timeout' not in broker_transport_options:
       if _broker_supports_visibility_timeout(broker_url):
           broker_transport_options['visibility_timeout'] = 21600
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] malthe commented on issue #16163: Confusing log for long running tasks: "dependency 'Task Instance Not Running' FAILED: Task is in the running state"

Posted by GitBox <gi...@apache.org>.
malthe commented on issue #16163:
URL: https://github.com/apache/airflow/issues/16163#issuecomment-919085793


   @ashb so regardless of whether the first worker is still running or defunct/dead, shouldn't the second worker be able to "take over" somehow?
   
   Otherwise, what's the point in trying?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] david30907d commented on issue #16163: Confusing log for long running tasks: "dependency 'Task Instance Not Running' FAILED: Task is in the running state"

Posted by GitBox <gi...@apache.org>.
david30907d commented on issue #16163:
URL: https://github.com/apache/airflow/issues/16163#issuecomment-868977773


   @yuqian90 thanks, very useful information. I'll give `celery_broker_transport_options` a shot!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] david30907d commented on issue #16163: Confusing log for long running tasks: "dependency 'Task Instance Not Running' FAILED: Task is in the running state"

Posted by GitBox <gi...@apache.org>.
david30907d commented on issue #16163:
URL: https://github.com/apache/airflow/issues/16163#issuecomment-866438191


   I bumped into this issue and I misunderstood until now 😅 
   thanks for pointing out it's okay to wait


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] kaxil removed a comment on issue #16163: Confusing log for long running tasks: "dependency 'Task Instance Not Running' FAILED: Task is in the running state"

Posted by GitBox <gi...@apache.org>.
kaxil removed a comment on issue #16163:
URL: https://github.com/apache/airflow/issues/16163#issuecomment-919080190


   @jedcunningham @ephraimbuddy Can one of you take a look at this too please


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org