You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by GitBox <gi...@apache.org> on 2022/07/28 16:21:51 UTC

[GitHub] [airflow] schattian opened a new issue, #23497: Tasks stuck indefinitely when following container logs

schattian opened a new issue, #23497:
URL: https://github.com/apache/airflow/issues/23497

   ### Apache Airflow version
   
   2.2.4
   
   ### What happened
   
   I observed that some workers hanged randomly after being running. Also, logs were not being reported. After some time, the pod status was on "Completed" when inspecting from k8s api, but wasn't on Airflow, which showed "status:running" for the pod.
   After some investigation, the issue is in the new kubernetes pod operator and is dependant of a current issue in the kubernetes api.
   
   When a log rotate event occurs in kubernetes, the stream we consume on fetch_container_logs(follow=True,...) is no longer being feeded.
   
   Therefore, the k8s pod operator hangs indefinetly at the middle of the log. Only a sigterm could terminate it as logs consumption is blocking execute() to finish.
   
   Ref to the issue in kubernetes: https://github.com/kubernetes/kubernetes/issues/59902
   
   Linking to https://github.com/apache/airflow/issues/12103 for reference, as the result is more or less the same for end user (although the root cause is different)
   
   ### What you think should happen instead
   
   Pod operator should not hang.
   Pod operator could follow the new logs from the container - this is out of scope of airflow as ideally the k8s api does it automatically.
   
   ### Solution proposal
   
   I think there are many possibilities to walk-around this from airflow-side to not hang indefinitely (like making `fetch_container_logs` non-blocking for `execute` and instead always block until status.phase.completed as it's currently done when get_logs is not true).
   
   ### How to reproduce
   
   Running multiple tasks will sooner or later trigger this. Also, one can configure a more aggressive logs rotation in k8s so this race is triggered more often.
   
   #### Operating System
   
   Debian GNU/Linux 11 (bullseye)
   
   #### Versions of Apache Airflow Providers
   ```
   apache-airflow==2.2.4
   apache-airflow-providers-google==6.4.0
   apache-airflow-providers-cncf-kubernetes==3.0.2
   ```
   
   However, this should be reproducible with master.
   
   #### Deployment
   
   Official Apache Airflow Helm Chart
   
   
   
   
   ### Are you willing to submit PR?
   
   - [X] Yes I am willing to submit a PR!
   
   ### Code of Conduct
   
   - [X] I agree to follow this project's [Code of Conduct](https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] schattian commented on issue #23497: Tasks stuck indefinitely when following container logs

Posted by GitBox <gi...@apache.org>.
schattian commented on issue #23497:
URL: https://github.com/apache/airflow/issues/23497#issuecomment-1121203954

   @potiuk sure, I will submit one one of these days.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] moiseenkov commented on issue #23497: Tasks stuck indefinitely when following container logs

Posted by GitBox <gi...@apache.org>.
moiseenkov commented on issue #23497:
URL: https://github.com/apache/airflow/issues/23497#issuecomment-1346520028

   I've finally managed to reproduce this bug with the following DAG on composer-2.0.29-airflow-2.3.3:
   ```python
   import datetime
   
   from airflow import models
   from airflow.providers.cncf.kubernetes.operators.kubernetes_pod import KubernetesPodOperator
   
   YESTERDAY = datetime.datetime.now() - datetime.timedelta(days=1)
   
   with models.DAG(
       dag_id="composer_sample_kubernetes_pod",
       schedule_interval=datetime.timedelta(days=1),
       start_date=YESTERDAY,
   ) as dag:
       timeout = 240
       iterations = 600 * 1000
       arguments = \
           'for i in {1..%(iterations)s}; do echo "$i of %(iterations)s"; done' % {'iterations': iterations}
   
       kubernetes_min_pod_0 = KubernetesPodOperator(
           task_id="pod-ex-minimum-0",
           name="pod-ex-minimum-0",
           cmds=["/bin/bash", "-c"],
           arguments=[arguments],
           namespace="default",
           image="gcr.io/gcp-runtimes/ubuntu_18_0_4",
           startup_timeout_seconds=timeout
       )
   ```
   With this example a container prints 600K log messages and terminates very fast. Meanwhile a Kubernetes API is pulling chunks of container logs from a stream. The pulling process is much slower, and thus eventually we get a situation, when the container is terminated but we're still pulling logs. The pulling process continues after the container termination for about 2-3 minutes. It looks to me that logs are being cached somewhere on a lower level, and once this cache gets exhausted, the stream hangs. Perhaps it should check a socket or connection status, but in practice it just hangs.
   
   Here's the line of code that hangs in Airflow's side: https://github.com/apache/airflow/blob/395a34b960c73118a732d371e93aeab8dcd76275/airflow/providers/cncf/kubernetes/utils/pod_manager.py#L232
   
   And here's the underlying line of code  that hangs on urllib3's side: https://github.com/urllib3/urllib3/blob/d393b4a5091c27d2e158074f81feb264c5c175af/src/urllib3/response.py#L999
   
   If I'm right, then the source of the issue belongs to third-party libraries (Kubernetes API or urllib3). In this case the easiest solution would be checking the container status before pulling each chunk of logs from the `urllib3.reponse.HTTPResponse`. A more robust solution would be caching logs into a temporary storage and fetching them from this new source independently from a container life-cycle, but I'm not sure it's possible.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] schattian commented on issue #23497: Tasks stuck indefinitely when following container logs

Posted by GitBox <gi...@apache.org>.
schattian commented on issue #23497:
URL: https://github.com/apache/airflow/issues/23497#issuecomment-1121121785

   @potiuk hm I don't see how that could help here.
   
   The code here continuous to be blocking execute: https://github.com/apache/airflow/blob/main/airflow/providers/cncf/kubernetes/operators/kubernetes_pod.py#L386-L396
   and the new python api (looking at kubernetes==23.0) still allows hanging (which i think is desirable as it's just an sdk, and that would imply a change in the kube api itself really).
   
   The only issue i see is related in the k8s python implementation is https://github.com/kubernetes-client/python/issues/199 but was an implementation bug and affected all the calls (not only the ones affected by logrotate).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] potiuk commented on issue #23497: Tasks stuck indefinitely when following container logs

Posted by GitBox <gi...@apache.org>.
potiuk commented on issue #23497:
URL: https://github.com/apache/airflow/issues/23497#issuecomment-1198369253

   But if you would like to fix it - feel absolutely free @swalkowski :)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] potiuk commented on issue #23497: Tasks stuck indefinitely when following container logs

Posted by GitBox <gi...@apache.org>.
potiuk commented on issue #23497:
URL: https://github.com/apache/airflow/issues/23497#issuecomment-1121089809

   Can you please try it with Airflow 2.3.0 @schattian - one of the changes in 2.3.0 (And corresponding cncf provider 4.0+ was to migrate to newer kubernetes libraries that are handling similar cases way better. Unfortunately you need to migrate to Airflow 2.3.0 for that one because of the old kubernetes library dependencies in Airlfow 2.2.* series. 
   
   I think that was the root cause of the problem and I will provisionally close that one. We can reopen if this is not fixed.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] VOvchinnikov commented on issue #23497: Tasks stuck indefinitely when following container logs

Posted by GitBox <gi...@apache.org>.
VOvchinnikov commented on issue #23497:
URL: https://github.com/apache/airflow/issues/23497#issuecomment-1311654675

   As of now, 2.3.3 still suffers from the same fate. It is pretty much a constant annoyance for us TBH, and I know at least one more group, who suffer from similar issue.
   But it's fairly hard to put a finger on _why_ this happens exactly - and "Airflow task stuck in running" googling returns pretty much anything, _but_ this particular cause.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] potiuk closed issue #23497: Tasks stuck indefinitely when following container logs

Posted by GitBox <gi...@apache.org>.
potiuk closed issue #23497: Tasks stuck indefinitely when following container logs
URL: https://github.com/apache/airflow/issues/23497


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] potiuk commented on issue #23497: Tasks stuck indefinitely when following container logs

Posted by GitBox <gi...@apache.org>.
potiuk commented on issue #23497:
URL: https://github.com/apache/airflow/issues/23497#issuecomment-1121187437

   Ah ok. I see. Would you like to propose a PR to fix it ? You seem to understand very well both the root cause, you can apparently reproduce it and you seem to understand well  how it could be circumvented and seems to understand a lot what's going under the hood. You could become one of the > 2000 contributors to airflow this way. PR around that is most welcome @schattian 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] potiuk commented on issue #23497: Tasks stuck indefinitely when following container logs

Posted by GitBox <gi...@apache.org>.
potiuk commented on issue #23497:
URL: https://github.com/apache/airflow/issues/23497#issuecomment-1346347794

   Yeah. I think a reproduction would be helpful. For now we should close the issue if it cannot be reproduced on latest Airflow.  There were many changes in Kubernetes integration since 2.2. and the only way to fix it - will be to upgrade anyway. I will close it and we can reopen if some evidences for 2.5 will be shown (or better new issue for 2.5 opened if it still happens.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] potiuk closed issue #23497: Tasks stuck indefinitely when following container logs

Posted by GitBox <gi...@apache.org>.
potiuk closed issue #23497: Tasks stuck indefinitely when following container logs
URL: https://github.com/apache/airflow/issues/23497


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] potiuk commented on issue #23497: Tasks stuck indefinitely when following container logs

Posted by GitBox <gi...@apache.org>.
potiuk commented on issue #23497:
URL: https://github.com/apache/airflow/issues/23497#issuecomment-1346563700

   Yes. This code was heavily rewritten since it uses completely different paradigms now.. Upgrading Airflow to latest possible version is advised. Ideally - f you use composer - upgrade to latest version available in composer and then whe never composer releases new versions follow that. 
   
   The only way we release fixes is by releasing upgraded versions and we only do that in latest minor version (2.5.0 currently) so even if we find any bugs there remaining, the only way to apply fix is to upgrade, so your upgrade is something that you will have to do anyway.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] schattian commented on issue #23497: Tasks stuck indefinitely when following container logs

Posted by GitBox <gi...@apache.org>.
schattian commented on issue #23497:
URL: https://github.com/apache/airflow/issues/23497#issuecomment-1318967988

   fyi @potiuk @swalkowski after investigating a bit more this issue, seems there's a walkaround by using the kubernetes executor.
   Using celery executor means that the job status is not updated till the pid that observes the execution finishes (execution which hangs by the logs api)
   
   Using k8s executor, the status is updated based on pod status instead so it does not hang (as pod status was always updated correctly, but job wasnt)
   
   Im trying it but so far it's working, will update.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] potiuk commented on issue #23497: Tasks stuck indefinitely when following container logs

Posted by GitBox <gi...@apache.org>.
potiuk commented on issue #23497:
URL: https://github.com/apache/airflow/issues/23497#issuecomment-1121219013

   Cool. Assigned you :) !


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] potiuk closed issue #23497: Tasks stuck indefinitely when following container logs

Posted by "potiuk (via GitHub)" <gi...@apache.org>.
potiuk closed issue #23497: Tasks stuck indefinitely when following container logs
URL: https://github.com/apache/airflow/issues/23497


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] moiseenkov commented on issue #23497: Tasks stuck indefinitely when following container logs

Posted by GitBox <gi...@apache.org>.
moiseenkov commented on issue #23497:
URL: https://github.com/apache/airflow/issues/23497#issuecomment-1339551453

   Hi everyone,
   It seems that the problem is still relevant. Could somebody please provide me with a DAG to reproduce the issue. I ran `KubernetesPodOperator` in the Composer Environment (composer-2.0.29-airflow-2.3.3) lots of times, and it has executed with no problems.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] potiuk closed issue #23497: Tasks stuck indefinitely when following container logs

Posted by GitBox <gi...@apache.org>.
potiuk closed issue #23497: Tasks stuck indefinitely when following container logs
URL: https://github.com/apache/airflow/issues/23497


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] potiuk commented on issue #23497: Tasks stuck indefinitely when following container logs

Posted by GitBox <gi...@apache.org>.
potiuk commented on issue #23497:
URL: https://github.com/apache/airflow/issues/23497#issuecomment-1317273947

   > What was the reversion reason? I seem to fail to grasp that one - maybe then it could be addressed in the re-opened PR.
   
   Surely - feel free to create a new PR and continue that - possibly you are the best person to see if the original change is fixingthe problem and lead it to completion, Not sure why any previous change was reverted (you can likely ask it in the PR that reverted it, but for sure the way to fix it is to have someeone who takes over the PR and leads it to completion.
   
   BTW. It would be also worthwile (and you can do it generally when working on the PR) to check if the problem still exists in latest main/2.4 (and maybe you should start looking at this with migrating to 2.4 to check) - because maybe it's been fixed already, there were some fixes to k8s library integrations in 2.4 I believe.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] swalkowski commented on issue #23497: Tasks stuck indefinitely when following container logs

Posted by GitBox <gi...@apache.org>.
swalkowski commented on issue #23497:
URL: https://github.com/apache/airflow/issues/23497#issuecomment-1198347004

   Should we reopen this issue given that the fix has been reverted? Is there an alternative plan for resolving it?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] potiuk commented on issue #23497: Tasks stuck indefinitely when following container logs

Posted by GitBox <gi...@apache.org>.
potiuk commented on issue #23497:
URL: https://github.com/apache/airflow/issues/23497#issuecomment-1198368669

   > Should we reopen this issue given that the fix has been reverted? Is there an alternative plan for resolving it?
   
   As usual here :) - if someone fixes it, it will get fixed. But I do not think it's a high priority for anyone.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org