You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by GitBox <gi...@apache.org> on 2022/12/15 04:52:45 UTC

[GitHub] [airflow] ayman-albaz opened a new issue, #28372: Airflow KubernetesPodOperator task running despite no resources being available

ayman-albaz opened a new issue, #28372:
URL: https://github.com/apache/airflow/issues/28372

   ### Apache Airflow version
   
   2.5.0
   
   ### What happened
   
   I have a dynamic mapping task that is supposed to launch over 100 KubernetesPodOperator tasks. I have assigned 2.0 CPUs per task. When running the DAG, 16 tasks are in 'running state', however only 3 truly run, the remainder 13 fail with `Pod took longer than 120 seconds to start`. The remainder of the tasks are either queued or scheduled, and when there are less than 16 active tasks, they run and more or less fail with the same error.
   
   Here is a snapshot of 
   ```
   kubectl -n airflow get all
   NAME                                                            READY   STATUS              RESTARTS   AGE
   pod/airflow-postgresql-0                                        1/1     Running             5          2d
   pod/airflow-scheduler-6dd68b485c-w8bhp                          3/3     Running             19         2d
   pod/airflow-statsd-586dbdcc6b-h4mnr                             1/1     Running             5          2d
   pod/airflow-triggerer-95565b95d-phts7                           2/2     Running             14         2d
   pod/airflow-webserver-599bb95bcd-7dtpk                          1/1     Running             5          2d
   pod/my-task-17dd038ca4d04164ba90f9c7f9a7fbb6            0/2     Pending             0          49s
   pod/my-task-20aba86c65544ea384343f8fb4415d3a            0/2     Pending             0          53s
   pod/my-task-3c5b4444a7d242459907ff3be7b7d6f6            0/2     Pending             0          44s
   pod/my-task-5c8af5edb0904711b6a76a2edf1d1067            0/2     Pending             0          60s
   pod/my-task-6001d3567f96400bb0ae559f22d3d2db            0/2     Pending             0          43s
   pod/my-task-6dfb1945f3ff4ac4a06c7e6c6a85099c            0/2     Pending             0          81s
   pod/my-task-71ad2fb48fb64f449014bba45bee980f            0/2     ContainerCreating   0          52s
   pod/my-task-774216cb5f9344ffb35deac826d71639            0/2     Pending             0          68s
   pod/my-task-814266d425254130868c3a5ebc8dce49            0/2     Pending             0          67s
   pod/my-task-a11588d878b54944b4c069f49231ac36            0/2     Pending             0          77s
   pod/my-task-b16c843fa038441ea31b90363ed86aa0            0/2     Pending             0          49s
   pod/my-task-b85e2ed3417a4a62940661f418c900e5            0/2     Pending             0          60s
   pod/my-task-d1de2a771a104a2592956a713f785300            0/2     Pending             0          73s
   pod/my-task-dbeba55a80074c08bbdf023b3f0b885c            0/2     Completed           0          10m
   pod/my-task-f83ad2805d314be3a7307b7216a54e53            2/2     Running             0          10m
   pod/pipeline-my-task-0bc9e094afee4527b5b764e32f590282   0/1     Init:0/1            0          1s
   pod/pipeline-my-task-1d51c5d3776e4dd8a89461e8a76faba1   1/1     Running             0          62s
   pod/pipeline-my-task-24b1326a71d149fb9f62c101647468ee   1/1     Running             0          62s
   pod/pipeline-my-task-29b132b7b0ce4832a5e30a821c6405bf   1/1     Running             0          10m
   pod/pipeline-my-task-29fb55604eec457fa21d13d85c7889b5   1/1     Running             0          10m
   pod/pipeline-my-task-2a337f1cc28b4315945cec8a961b1111   1/1     Running             0          69s
   pod/pipeline-my-task-35d5c97570474082bc9b04189c433be7   1/1     Running             0          57s
   pod/pipeline-my-task-569de133975d4dbb96becb2a04c0dac3   1/1     Running             0          78s
   pod/pipeline-my-task-96a9681ace4441deba4faeef602f6e5b   1/1     Running             0          78s
   pod/pipeline-my-task-9dcb9578720643eca5fa918a0a295f87   1/1     Running             0          87s
   pod/pipeline-my-task-a643741d29ea4f4baa06e0ea20bc1a57   1/1     Running             0          10m
   pod/pipeline-my-task-b04532a9f35a48a09cb1d46c9d9470dd   1/1     Running             0          57s
   pod/pipeline-my-task-c9b7bb4ee07749be98083a11a512e1f4   1/1     Running             0          90s
   pod/pipeline-my-task-d9c5ce9bf5ce499583cdf0ea3f58b7f0   1/1     Running             0          82s
   pod/pipeline-my-task-dd5a5a45374f487fbc34c904e71b93b5   1/1     Running             0          59s
   pod/pipeline-my-task-ea8c39d657824a1db505b00e8673b06a   1/1     Running             0          69s
   pod/pipeline-my-task-fb5c71d274034f5392aebe0f4b395d98   1/1     Running             0          65s
   ```
   
   ### What you think should happen instead
   
   Only 3 tasks should be running.
   The remainder tasks should be scheduled or queued.
   
   ### How to reproduce
   
   ```python
   
   import json
   import textwrap
   
   import pendulum
   
   from airflow.decorators import dag, task
   from airflow.models.param import Param
   from airflow.providers.cncf.kubernetes.operators.kubernetes_pod import (
       KubernetesPodOperator,
       Secret,
   )
   from kubernetes.client import models as k8s
   
   
   @dag(
       schedule=None,
       start_date=pendulum.datetime(2021, 1, 1, tz="UTC"),
       catchup=False,
       tags=["example"],
   )
   def pipeline():
   
       container_resources = k8s.V1ResourceRequirements(
           limits={
               "memory": "512Mi",
               "cpu": 2.0,
           },
           requests={
               "memory": "512Mi",
               "cpu": 2.0,
           },
       )
   
   
       volumes = [
           k8s.V1Volume(
               name="pvc-airflow",
               persistent_volume_claim=k8s.V1PersistentVolumeClaimVolumeSource(
                   claim_name="pvc-airflow"
               ),
           )
       ]
   
       volume_mounts = [
           k8s.V1VolumeMount(mount_path="/airflow", name="pvc-airflow", sub_path=None)
       ]
   
       @task
       def make_list():
           return [{"a": "a"}] * 100
   
       my_task = KubernetesPodOperator.partial(
           name="my_task",
           task_id="my_task",
           image="ubuntu:20.04",
           namespace="airflow",
           container_resources=container_resources,
           volumes=volumes,
           volume_mounts=volume_mounts,
           in_cluster=True,
           do_xcom_push=True,
           get_logs=True,
           cmds=[
               "/bin/bash",
               "-c",
               """
                   sleep 600
               """
           ],
       ).expand(env_vars=make_list())
   ```
   
   ### Operating System
   
   Ubuntu 20.04.5 LTS
   
   ### Versions of Apache Airflow Providers
   
   _No response_
   
   ### Deployment
   
   Official Apache Airflow Helm Chart
   
   ### Deployment details
   
   I am running this locally using the helm chart on Kind.
   
   My machine is 4 CPU (x2), with 16 GB RAM.
   
   ### Anything else
   
   I have confirmed that the failing tasks are not starting due to timeouts from waiting for resources too long.
   
   ### Are you willing to submit PR?
   
   - [ ] Yes I am willing to submit a PR!
   
   ### Code of Conduct
   
   - [X] I agree to follow this project's [Code of Conduct](https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] potiuk closed issue #28372: Airflow KubernetesPodOperator task running despite no resources being available

Posted by GitBox <gi...@apache.org>.
potiuk closed issue #28372: Airflow KubernetesPodOperator task running despite no resources being available
URL: https://github.com/apache/airflow/issues/28372


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] potiuk commented on issue #28372: Airflow KubernetesPodOperator task running despite no resources being available

Posted by GitBox <gi...@apache.org>.
potiuk commented on issue #28372:
URL: https://github.com/apache/airflow/issues/28372#issuecomment-1355901746

   > > You can limit the number of concurrent task run by airflow thanks to airflow pool
   > > or concurrency settings on the dag
   > 
   > Ya perhaps this is not necessarily a bug from airflow then but more of a feature request. What you suggested is a workaround to my problem, thanks for the suggestion.
   
   > I think it would be pretty cool if Airflow scheduler was aware of resources + auto scaling capability of a cluster, and then schedule accordingly (i.e. keep running jobs, and schedule the remainder that no resources can possibly be allocated for).
   
   This is actually not a workaround. This is how you are supposed to limit resources in Airflow when you use Kubernetes Pod Operator. 
   
   Using Kubernetes Pod Operator and expecting Airlfow to understand resource limits coming from autoscaling of the cluster it runs would basically mean that Airflow would have to copy the whole logic of Kubernetes to know what it can / cannot schedule. I am not sure if you are aware that there are plenty of things Kubernetes takes into account when scheduling pods - and many of them have super complex logic. It's not only memory, but also affinities, anti-affinities, labels that are matching or not the nodes the pod could run on and plenty of others. For example imagine you have 20 KPOs each requiring GPU and only 2 GPUS are available. And tihs is only one of the cases. Duplicating the whole logic of K8S by airflow is not only difficult but also prone to errors and it would mean that Airlfow's KPO would be closely tied with specific version of K8S because new features of K8S are added with each release. What you ask for is not really feasible. 
   
   You might think it is simple for your specific case because you just **know** you have 2 CPUS per node and you know you have 6 of them in total, so it must be simple for Airflow to know it ...  But in fact Airlfow would have to implement a very complex logic to know it in general case. And by providing the Pool you ACTUALLY pass your knowledge to Airflow and it indeed knows what are the limits without performing all the complex and brittle K8s logic..
   
   We do not really want to re-implement K8S in Airflow.
   
   But you can do better than manually allocating fixed pool of resources for your workloads. And Airlflow gets you covered.
   
   If you really want to do scaling, then what you can do you can use Celery Executor Running on K8S. As surprisingly as it is - this is pretty good way to implement K8s auto-scaling. This is precisely what Celery Executor was designed for really - especially if you have relatively short tasks which are similar to each other in terms of complexity, CeleryExecutor is the way to go rather than running tasks through KPOs. We have KEDA-based auto-scaling implemented in our Helm Chart, and  if you run it on top of auto-scaling K8S cluster, it will actually be able to handle autoscaling well. You can even connect it with long running Kubernetes tasks and run Celery Kubernetes Executor and choose which tasks are run where.
   
   Again - in this case you need to manage queues to direct your load, but then those queues can dynamically grow in sizes if you want it.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] raphaelauv commented on issue #28372: Airflow KubernetesPodOperator task running despite no resources being available

Posted by GitBox <gi...@apache.org>.
raphaelauv commented on issue #28372:
URL: https://github.com/apache/airflow/issues/28372#issuecomment-1354144857

   You can limit the number of concurrent task run by airflow thanks to airflow pool 
   
   or concurrency settings on the dag
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] boring-cyborg[bot] commented on issue #28372: Airflow KubernetesPodOperator task running despite no resources being available

Posted by GitBox <gi...@apache.org>.
boring-cyborg[bot] commented on issue #28372:
URL: https://github.com/apache/airflow/issues/28372#issuecomment-1352556806

   Thanks for opening your first issue here! Be sure to follow the issue template!
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] ayman-albaz commented on issue #28372: Airflow KubernetesPodOperator task running despite no resources being available

Posted by GitBox <gi...@apache.org>.
ayman-albaz commented on issue #28372:
URL: https://github.com/apache/airflow/issues/28372#issuecomment-1352559497

   Also some additional info
   ```
   
   kubectl -n airflow describe node
   Name:               kind-control-plane
   Roles:              master
   Labels:             beta.kubernetes.io/arch=amd64
                       beta.kubernetes.io/os=linux
                       kubernetes.io/arch=amd64
                       kubernetes.io/hostname=kind-control-plane
                       kubernetes.io/os=linux
                       node-role.kubernetes.io/master=
   Annotations:        kubeadm.alpha.kubernetes.io/cri-socket: unix:///run/containerd/containerd.sock
                       node.alpha.kubernetes.io/ttl: 0
                       volumes.kubernetes.io/controller-managed-attach-detach: true
   CreationTimestamp:  Mon, 12 Dec 2022 21:54:55 -0500
   Taints:             <none>
   Unschedulable:      false
   Lease:
     HolderIdentity:  kind-control-plane
     AcquireTime:     <unset>
     RenewTime:       Wed, 14 Dec 2022 22:50:52 -0500
   Conditions:
     Type             Status  LastHeartbeatTime                 LastTransitionTime                Reason                       Message
     ----             ------  -----------------                 ------------------                ------                       -------
     MemoryPressure   False   Wed, 14 Dec 2022 22:48:10 -0500   Wed, 14 Dec 2022 20:03:49 -0500   KubeletHasSufficientMemory   kubelet has sufficient memory available
     DiskPressure     False   Wed, 14 Dec 2022 22:48:10 -0500   Wed, 14 Dec 2022 20:03:49 -0500   KubeletHasNoDiskPressure     kubelet has no disk pressure
     PIDPressure      False   Wed, 14 Dec 2022 22:48:10 -0500   Wed, 14 Dec 2022 20:03:49 -0500   KubeletHasSufficientPID      kubelet has sufficient PID available
     Ready            True    Wed, 14 Dec 2022 22:48:10 -0500   Wed, 14 Dec 2022 20:03:49 -0500   KubeletReady                 kubelet is posting ready status
   Addresses:
     Hostname:    kind-control-plane
   Capacity:
     cpu:                8
     ephemeral-storage:  382935608Ki
     hugepages-1Gi:      0
     hugepages-2Mi:      0
     memory:             16336528Ki
     pods:               110
   Allocatable:
     cpu:                8
     ephemeral-storage:  382935608Ki
     hugepages-1Gi:      0
     hugepages-2Mi:      0
     memory:             16336528Ki
     pods:               110
   System Info:
     Kernel Version:             5.15.0-56-generic
     OS Image:                   Ubuntu 20.10
     Operating System:           linux
     Architecture:               amd64
     Container Runtime Version:  containerd://1.4.0-106-gce4439a8
     Kubelet Version:            v1.18.15
     Kube-Proxy Version:         v1.18.15
   PodCIDR:                      10.244.0.0/24
   PodCIDRs:                     10.244.0.0/24
   ProviderID:                   kind://docker/kind/kind-control-plane
   Non-terminated Pods:          (33 in total)
     Namespace                   Name                                                         CPU Requests  CPU Limits  Memory Requests  Memory Limits  Age
     ---------                   ----                                                         ------------  ----------  ---------------  -------------  ---
     airflow                     airflow-postgresql-0                                         250m (3%)     0 (0%)      256Mi (1%)       0 (0%)         2d
     airflow                     airflow-scheduler-6dd68b485c-w8bhp                           0 (0%)        0 (0%)      0 (0%)           0 (0%)         2d
     airflow                     airflow-statsd-586dbdcc6b-h4mnr                              0 (0%)        0 (0%)      0 (0%)           0 (0%)         2d
     airflow                     airflow-triggerer-95565b95d-phts7                            0 (0%)        0 (0%)      0 (0%)           0 (0%)         2d
     airflow                     airflow-webserver-599bb95bcd-7dtpk                           0 (0%)        0 (0%)      0 (0%)           0 (0%)         2d
     airflow                     my-task-dbeba55a80074c08bbdf023b3f0b885c             2001m (25%)   2 (25%)     512Mi (3%)       512Mi (3%)     7m39s
     airflow                     my-task-e2d57a9de7934aaabd9b4c481c0b8fde             2001m (25%)   2 (25%)     512Mi (3%)       512Mi (3%)     7m39s
     airflow                     my-task-f83ad2805d314be3a7307b7216a54e53             2001m (25%)   2 (25%)     512Mi (3%)       512Mi (3%)     7m35s
     airflow                     pipeline-my-task-0c6bba5cd622466fa6e234d3dbd9151b    0 (0%)        0 (0%)      0 (0%)           0 (0%)         67s
     airflow                     pipeline-my-task-19efa573444842f0b259f72f01680994    0 (0%)        0 (0%)      0 (0%)           0 (0%)         59s
     airflow                     pipeline-my-task-22faa8c5d7ce4e439917be3e4cb9bbdf    0 (0%)        0 (0%)      0 (0%)           0 (0%)         51s
     airflow                     pipeline-my-task-29b132b7b0ce4832a5e30a821c6405bf    0 (0%)        0 (0%)      0 (0%)           0 (0%)         7m49s
     airflow                     pipeline-my-task-29fb55604eec457fa21d13d85c7889b5    0 (0%)        0 (0%)      0 (0%)           0 (0%)         7m48s
     airflow                     pipeline-my-task-2cafc04fcbd84f22b91d58b19b086b23    0 (0%)        0 (0%)      0 (0%)           0 (0%)         43s
     airflow                     pipeline-my-task-2e1cbe0e5e694954b218bf351fefeb56    0 (0%)        0 (0%)      0 (0%)           0 (0%)         40s
     airflow                     pipeline-my-task-6b7817b762404da99fbb2b8a7e8c4cd2    0 (0%)        0 (0%)      0 (0%)           0 (0%)         47s
     airflow                     pipeline-my-task-6f5e4439a9814b7f85af1988bde7cc10    0 (0%)        0 (0%)      0 (0%)           0 (0%)         64s
     airflow                     pipeline-my-task-7fb24e233fa5487f808154248eee51b3    0 (0%)        0 (0%)      0 (0%)           0 (0%)         41s
     airflow                     pipeline-my-task-a643741d29ea4f4baa06e0ea20bc1a57    0 (0%)        0 (0%)      0 (0%)           0 (0%)         7m49s
     airflow                     pipeline-my-task-b0ed061414f0476f82e1816186f08613    0 (0%)        0 (0%)      0 (0%)           0 (0%)         52s
     airflow                     pipeline-my-task-bcda6d7708844985871c761925d32b44    0 (0%)        0 (0%)      0 (0%)           0 (0%)         61s
     airflow                     pipeline-my-task-df9431a3eb4c43b482ff68fdc6297568    0 (0%)        0 (0%)      0 (0%)           0 (0%)         45s
     airflow                     pipeline-my-task-f339343358a2410e9e9ecca6fd01d230    0 (0%)        0 (0%)      0 (0%)           0 (0%)         70s
     airflow                     pipeline-my-task-ff6afc1414964cc5915f55616659a271    0 (0%)        0 (0%)      0 (0%)           0 (0%)         46s
     kube-system                 coredns-66bff467f8-w2gsw                                     100m (1%)     0 (0%)      70Mi (0%)        170Mi (1%)     2d
     kube-system                 coredns-66bff467f8-x57x2                                     100m (1%)     0 (0%)      70Mi (0%)        170Mi (1%)     2d
     kube-system                 etcd-kind-control-plane                                      0 (0%)        0 (0%)      0 (0%)           0 (0%)         2d
     kube-system                 kindnet-hv2p2                                                100m (1%)     100m (1%)   50Mi (0%)        50Mi (0%)      2d
     kube-system                 kube-apiserver-kind-control-plane                            250m (3%)     0 (0%)      0 (0%)           0 (0%)         2d
     kube-system                 kube-controller-manager-kind-control-plane                   200m (2%)     0 (0%)      0 (0%)           0 (0%)         2d
     kube-system                 kube-proxy-qb5jn                                             0 (0%)        0 (0%)      0 (0%)           0 (0%)         2d
     kube-system                 kube-scheduler-kind-control-plane                            100m (1%)     0 (0%)      0 (0%)           0 (0%)         2d
     local-path-storage          local-path-provisioner-5b4b545c55-nkz89                      0 (0%)        0 (0%)      0 (0%)           0 (0%)         2d
   Allocated resources:
     (Total limits may be over 100 percent, i.e., overcommitted.)
     Resource           Requests      Limits
     --------           --------      ------
     cpu                7103m (88%)   6100m (76%)
     memory             1982Mi (12%)  1926Mi (12%)
     ephemeral-storage  0 (0%)        0 (0%)
     hugepages-1Gi      0 (0%)        0 (0%)
     hugepages-2Mi      0 (0%)        0 (0%)
   Events:              <none>
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] ayman-albaz commented on issue #28372: Airflow KubernetesPodOperator task running despite no resources being available

Posted by GitBox <gi...@apache.org>.
ayman-albaz commented on issue #28372:
URL: https://github.com/apache/airflow/issues/28372#issuecomment-1355872345

   > You can limit the number of concurrent task run by airflow thanks to airflow pool
   > 
   > or concurrency settings on the dag
   
   Ya perhaps this is not necessarily a bug from airflow then but more of a feature request. What you suggested is a workaround to my problem, thanks for the suggestion.
   
   I think it would be pretty cool if Airflow scheduler was aware of resources + auto scaling capability of a cluster, and then schedule accordingly (i.e. keep running jobs, and schedule the remainder that no resources can possibly be allocated for).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org