You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by GitBox <gi...@apache.org> on 2022/12/15 04:52:45 UTC
[GitHub] [airflow] ayman-albaz opened a new issue, #28372: Airflow KubernetesPodOperator task running despite no resources being available
ayman-albaz opened a new issue, #28372:
URL: https://github.com/apache/airflow/issues/28372
### Apache Airflow version
2.5.0
### What happened
I have a dynamic mapping task that is supposed to launch over 100 KubernetesPodOperator tasks. I have assigned 2.0 CPUs per task. When running the DAG, 16 tasks are in 'running state', however only 3 truly run, the remainder 13 fail with `Pod took longer than 120 seconds to start`. The remainder of the tasks are either queued or scheduled, and when there are less than 16 active tasks, they run and more or less fail with the same error.
Here is a snapshot of
```
kubectl -n airflow get all
NAME READY STATUS RESTARTS AGE
pod/airflow-postgresql-0 1/1 Running 5 2d
pod/airflow-scheduler-6dd68b485c-w8bhp 3/3 Running 19 2d
pod/airflow-statsd-586dbdcc6b-h4mnr 1/1 Running 5 2d
pod/airflow-triggerer-95565b95d-phts7 2/2 Running 14 2d
pod/airflow-webserver-599bb95bcd-7dtpk 1/1 Running 5 2d
pod/my-task-17dd038ca4d04164ba90f9c7f9a7fbb6 0/2 Pending 0 49s
pod/my-task-20aba86c65544ea384343f8fb4415d3a 0/2 Pending 0 53s
pod/my-task-3c5b4444a7d242459907ff3be7b7d6f6 0/2 Pending 0 44s
pod/my-task-5c8af5edb0904711b6a76a2edf1d1067 0/2 Pending 0 60s
pod/my-task-6001d3567f96400bb0ae559f22d3d2db 0/2 Pending 0 43s
pod/my-task-6dfb1945f3ff4ac4a06c7e6c6a85099c 0/2 Pending 0 81s
pod/my-task-71ad2fb48fb64f449014bba45bee980f 0/2 ContainerCreating 0 52s
pod/my-task-774216cb5f9344ffb35deac826d71639 0/2 Pending 0 68s
pod/my-task-814266d425254130868c3a5ebc8dce49 0/2 Pending 0 67s
pod/my-task-a11588d878b54944b4c069f49231ac36 0/2 Pending 0 77s
pod/my-task-b16c843fa038441ea31b90363ed86aa0 0/2 Pending 0 49s
pod/my-task-b85e2ed3417a4a62940661f418c900e5 0/2 Pending 0 60s
pod/my-task-d1de2a771a104a2592956a713f785300 0/2 Pending 0 73s
pod/my-task-dbeba55a80074c08bbdf023b3f0b885c 0/2 Completed 0 10m
pod/my-task-f83ad2805d314be3a7307b7216a54e53 2/2 Running 0 10m
pod/pipeline-my-task-0bc9e094afee4527b5b764e32f590282 0/1 Init:0/1 0 1s
pod/pipeline-my-task-1d51c5d3776e4dd8a89461e8a76faba1 1/1 Running 0 62s
pod/pipeline-my-task-24b1326a71d149fb9f62c101647468ee 1/1 Running 0 62s
pod/pipeline-my-task-29b132b7b0ce4832a5e30a821c6405bf 1/1 Running 0 10m
pod/pipeline-my-task-29fb55604eec457fa21d13d85c7889b5 1/1 Running 0 10m
pod/pipeline-my-task-2a337f1cc28b4315945cec8a961b1111 1/1 Running 0 69s
pod/pipeline-my-task-35d5c97570474082bc9b04189c433be7 1/1 Running 0 57s
pod/pipeline-my-task-569de133975d4dbb96becb2a04c0dac3 1/1 Running 0 78s
pod/pipeline-my-task-96a9681ace4441deba4faeef602f6e5b 1/1 Running 0 78s
pod/pipeline-my-task-9dcb9578720643eca5fa918a0a295f87 1/1 Running 0 87s
pod/pipeline-my-task-a643741d29ea4f4baa06e0ea20bc1a57 1/1 Running 0 10m
pod/pipeline-my-task-b04532a9f35a48a09cb1d46c9d9470dd 1/1 Running 0 57s
pod/pipeline-my-task-c9b7bb4ee07749be98083a11a512e1f4 1/1 Running 0 90s
pod/pipeline-my-task-d9c5ce9bf5ce499583cdf0ea3f58b7f0 1/1 Running 0 82s
pod/pipeline-my-task-dd5a5a45374f487fbc34c904e71b93b5 1/1 Running 0 59s
pod/pipeline-my-task-ea8c39d657824a1db505b00e8673b06a 1/1 Running 0 69s
pod/pipeline-my-task-fb5c71d274034f5392aebe0f4b395d98 1/1 Running 0 65s
```
### What you think should happen instead
Only 3 tasks should be running.
The remainder tasks should be scheduled or queued.
### How to reproduce
```python
import json
import textwrap
import pendulum
from airflow.decorators import dag, task
from airflow.models.param import Param
from airflow.providers.cncf.kubernetes.operators.kubernetes_pod import (
KubernetesPodOperator,
Secret,
)
from kubernetes.client import models as k8s
@dag(
schedule=None,
start_date=pendulum.datetime(2021, 1, 1, tz="UTC"),
catchup=False,
tags=["example"],
)
def pipeline():
container_resources = k8s.V1ResourceRequirements(
limits={
"memory": "512Mi",
"cpu": 2.0,
},
requests={
"memory": "512Mi",
"cpu": 2.0,
},
)
volumes = [
k8s.V1Volume(
name="pvc-airflow",
persistent_volume_claim=k8s.V1PersistentVolumeClaimVolumeSource(
claim_name="pvc-airflow"
),
)
]
volume_mounts = [
k8s.V1VolumeMount(mount_path="/airflow", name="pvc-airflow", sub_path=None)
]
@task
def make_list():
return [{"a": "a"}] * 100
my_task = KubernetesPodOperator.partial(
name="my_task",
task_id="my_task",
image="ubuntu:20.04",
namespace="airflow",
container_resources=container_resources,
volumes=volumes,
volume_mounts=volume_mounts,
in_cluster=True,
do_xcom_push=True,
get_logs=True,
cmds=[
"/bin/bash",
"-c",
"""
sleep 600
"""
],
).expand(env_vars=make_list())
```
### Operating System
Ubuntu 20.04.5 LTS
### Versions of Apache Airflow Providers
_No response_
### Deployment
Official Apache Airflow Helm Chart
### Deployment details
I am running this locally using the helm chart on Kind.
My machine is 4 CPU (x2), with 16 GB RAM.
### Anything else
I have confirmed that the failing tasks are not starting due to timeouts from waiting for resources too long.
### Are you willing to submit PR?
- [ ] Yes I am willing to submit a PR!
### Code of Conduct
- [X] I agree to follow this project's [Code of Conduct](https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [airflow] potiuk closed issue #28372: Airflow KubernetesPodOperator task running despite no resources being available
Posted by GitBox <gi...@apache.org>.
potiuk closed issue #28372: Airflow KubernetesPodOperator task running despite no resources being available
URL: https://github.com/apache/airflow/issues/28372
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [airflow] potiuk commented on issue #28372: Airflow KubernetesPodOperator task running despite no resources being available
Posted by GitBox <gi...@apache.org>.
potiuk commented on issue #28372:
URL: https://github.com/apache/airflow/issues/28372#issuecomment-1355901746
> > You can limit the number of concurrent task run by airflow thanks to airflow pool
> > or concurrency settings on the dag
>
> Ya perhaps this is not necessarily a bug from airflow then but more of a feature request. What you suggested is a workaround to my problem, thanks for the suggestion.
> I think it would be pretty cool if Airflow scheduler was aware of resources + auto scaling capability of a cluster, and then schedule accordingly (i.e. keep running jobs, and schedule the remainder that no resources can possibly be allocated for).
This is actually not a workaround. This is how you are supposed to limit resources in Airflow when you use Kubernetes Pod Operator.
Using Kubernetes Pod Operator and expecting Airlfow to understand resource limits coming from autoscaling of the cluster it runs would basically mean that Airflow would have to copy the whole logic of Kubernetes to know what it can / cannot schedule. I am not sure if you are aware that there are plenty of things Kubernetes takes into account when scheduling pods - and many of them have super complex logic. It's not only memory, but also affinities, anti-affinities, labels that are matching or not the nodes the pod could run on and plenty of others. For example imagine you have 20 KPOs each requiring GPU and only 2 GPUS are available. And tihs is only one of the cases. Duplicating the whole logic of K8S by airflow is not only difficult but also prone to errors and it would mean that Airlfow's KPO would be closely tied with specific version of K8S because new features of K8S are added with each release. What you ask for is not really feasible.
You might think it is simple for your specific case because you just **know** you have 2 CPUS per node and you know you have 6 of them in total, so it must be simple for Airflow to know it ... But in fact Airlfow would have to implement a very complex logic to know it in general case. And by providing the Pool you ACTUALLY pass your knowledge to Airflow and it indeed knows what are the limits without performing all the complex and brittle K8s logic..
We do not really want to re-implement K8S in Airflow.
But you can do better than manually allocating fixed pool of resources for your workloads. And Airlflow gets you covered.
If you really want to do scaling, then what you can do you can use Celery Executor Running on K8S. As surprisingly as it is - this is pretty good way to implement K8s auto-scaling. This is precisely what Celery Executor was designed for really - especially if you have relatively short tasks which are similar to each other in terms of complexity, CeleryExecutor is the way to go rather than running tasks through KPOs. We have KEDA-based auto-scaling implemented in our Helm Chart, and if you run it on top of auto-scaling K8S cluster, it will actually be able to handle autoscaling well. You can even connect it with long running Kubernetes tasks and run Celery Kubernetes Executor and choose which tasks are run where.
Again - in this case you need to manage queues to direct your load, but then those queues can dynamically grow in sizes if you want it.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [airflow] raphaelauv commented on issue #28372: Airflow KubernetesPodOperator task running despite no resources being available
Posted by GitBox <gi...@apache.org>.
raphaelauv commented on issue #28372:
URL: https://github.com/apache/airflow/issues/28372#issuecomment-1354144857
You can limit the number of concurrent task run by airflow thanks to airflow pool
or concurrency settings on the dag
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [airflow] boring-cyborg[bot] commented on issue #28372: Airflow KubernetesPodOperator task running despite no resources being available
Posted by GitBox <gi...@apache.org>.
boring-cyborg[bot] commented on issue #28372:
URL: https://github.com/apache/airflow/issues/28372#issuecomment-1352556806
Thanks for opening your first issue here! Be sure to follow the issue template!
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [airflow] ayman-albaz commented on issue #28372: Airflow KubernetesPodOperator task running despite no resources being available
Posted by GitBox <gi...@apache.org>.
ayman-albaz commented on issue #28372:
URL: https://github.com/apache/airflow/issues/28372#issuecomment-1352559497
Also some additional info
```
kubectl -n airflow describe node
Name: kind-control-plane
Roles: master
Labels: beta.kubernetes.io/arch=amd64
beta.kubernetes.io/os=linux
kubernetes.io/arch=amd64
kubernetes.io/hostname=kind-control-plane
kubernetes.io/os=linux
node-role.kubernetes.io/master=
Annotations: kubeadm.alpha.kubernetes.io/cri-socket: unix:///run/containerd/containerd.sock
node.alpha.kubernetes.io/ttl: 0
volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp: Mon, 12 Dec 2022 21:54:55 -0500
Taints: <none>
Unschedulable: false
Lease:
HolderIdentity: kind-control-plane
AcquireTime: <unset>
RenewTime: Wed, 14 Dec 2022 22:50:52 -0500
Conditions:
Type Status LastHeartbeatTime LastTransitionTime Reason Message
---- ------ ----------------- ------------------ ------ -------
MemoryPressure False Wed, 14 Dec 2022 22:48:10 -0500 Wed, 14 Dec 2022 20:03:49 -0500 KubeletHasSufficientMemory kubelet has sufficient memory available
DiskPressure False Wed, 14 Dec 2022 22:48:10 -0500 Wed, 14 Dec 2022 20:03:49 -0500 KubeletHasNoDiskPressure kubelet has no disk pressure
PIDPressure False Wed, 14 Dec 2022 22:48:10 -0500 Wed, 14 Dec 2022 20:03:49 -0500 KubeletHasSufficientPID kubelet has sufficient PID available
Ready True Wed, 14 Dec 2022 22:48:10 -0500 Wed, 14 Dec 2022 20:03:49 -0500 KubeletReady kubelet is posting ready status
Addresses:
Hostname: kind-control-plane
Capacity:
cpu: 8
ephemeral-storage: 382935608Ki
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 16336528Ki
pods: 110
Allocatable:
cpu: 8
ephemeral-storage: 382935608Ki
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 16336528Ki
pods: 110
System Info:
Kernel Version: 5.15.0-56-generic
OS Image: Ubuntu 20.10
Operating System: linux
Architecture: amd64
Container Runtime Version: containerd://1.4.0-106-gce4439a8
Kubelet Version: v1.18.15
Kube-Proxy Version: v1.18.15
PodCIDR: 10.244.0.0/24
PodCIDRs: 10.244.0.0/24
ProviderID: kind://docker/kind/kind-control-plane
Non-terminated Pods: (33 in total)
Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits Age
--------- ---- ------------ ---------- --------------- ------------- ---
airflow airflow-postgresql-0 250m (3%) 0 (0%) 256Mi (1%) 0 (0%) 2d
airflow airflow-scheduler-6dd68b485c-w8bhp 0 (0%) 0 (0%) 0 (0%) 0 (0%) 2d
airflow airflow-statsd-586dbdcc6b-h4mnr 0 (0%) 0 (0%) 0 (0%) 0 (0%) 2d
airflow airflow-triggerer-95565b95d-phts7 0 (0%) 0 (0%) 0 (0%) 0 (0%) 2d
airflow airflow-webserver-599bb95bcd-7dtpk 0 (0%) 0 (0%) 0 (0%) 0 (0%) 2d
airflow my-task-dbeba55a80074c08bbdf023b3f0b885c 2001m (25%) 2 (25%) 512Mi (3%) 512Mi (3%) 7m39s
airflow my-task-e2d57a9de7934aaabd9b4c481c0b8fde 2001m (25%) 2 (25%) 512Mi (3%) 512Mi (3%) 7m39s
airflow my-task-f83ad2805d314be3a7307b7216a54e53 2001m (25%) 2 (25%) 512Mi (3%) 512Mi (3%) 7m35s
airflow pipeline-my-task-0c6bba5cd622466fa6e234d3dbd9151b 0 (0%) 0 (0%) 0 (0%) 0 (0%) 67s
airflow pipeline-my-task-19efa573444842f0b259f72f01680994 0 (0%) 0 (0%) 0 (0%) 0 (0%) 59s
airflow pipeline-my-task-22faa8c5d7ce4e439917be3e4cb9bbdf 0 (0%) 0 (0%) 0 (0%) 0 (0%) 51s
airflow pipeline-my-task-29b132b7b0ce4832a5e30a821c6405bf 0 (0%) 0 (0%) 0 (0%) 0 (0%) 7m49s
airflow pipeline-my-task-29fb55604eec457fa21d13d85c7889b5 0 (0%) 0 (0%) 0 (0%) 0 (0%) 7m48s
airflow pipeline-my-task-2cafc04fcbd84f22b91d58b19b086b23 0 (0%) 0 (0%) 0 (0%) 0 (0%) 43s
airflow pipeline-my-task-2e1cbe0e5e694954b218bf351fefeb56 0 (0%) 0 (0%) 0 (0%) 0 (0%) 40s
airflow pipeline-my-task-6b7817b762404da99fbb2b8a7e8c4cd2 0 (0%) 0 (0%) 0 (0%) 0 (0%) 47s
airflow pipeline-my-task-6f5e4439a9814b7f85af1988bde7cc10 0 (0%) 0 (0%) 0 (0%) 0 (0%) 64s
airflow pipeline-my-task-7fb24e233fa5487f808154248eee51b3 0 (0%) 0 (0%) 0 (0%) 0 (0%) 41s
airflow pipeline-my-task-a643741d29ea4f4baa06e0ea20bc1a57 0 (0%) 0 (0%) 0 (0%) 0 (0%) 7m49s
airflow pipeline-my-task-b0ed061414f0476f82e1816186f08613 0 (0%) 0 (0%) 0 (0%) 0 (0%) 52s
airflow pipeline-my-task-bcda6d7708844985871c761925d32b44 0 (0%) 0 (0%) 0 (0%) 0 (0%) 61s
airflow pipeline-my-task-df9431a3eb4c43b482ff68fdc6297568 0 (0%) 0 (0%) 0 (0%) 0 (0%) 45s
airflow pipeline-my-task-f339343358a2410e9e9ecca6fd01d230 0 (0%) 0 (0%) 0 (0%) 0 (0%) 70s
airflow pipeline-my-task-ff6afc1414964cc5915f55616659a271 0 (0%) 0 (0%) 0 (0%) 0 (0%) 46s
kube-system coredns-66bff467f8-w2gsw 100m (1%) 0 (0%) 70Mi (0%) 170Mi (1%) 2d
kube-system coredns-66bff467f8-x57x2 100m (1%) 0 (0%) 70Mi (0%) 170Mi (1%) 2d
kube-system etcd-kind-control-plane 0 (0%) 0 (0%) 0 (0%) 0 (0%) 2d
kube-system kindnet-hv2p2 100m (1%) 100m (1%) 50Mi (0%) 50Mi (0%) 2d
kube-system kube-apiserver-kind-control-plane 250m (3%) 0 (0%) 0 (0%) 0 (0%) 2d
kube-system kube-controller-manager-kind-control-plane 200m (2%) 0 (0%) 0 (0%) 0 (0%) 2d
kube-system kube-proxy-qb5jn 0 (0%) 0 (0%) 0 (0%) 0 (0%) 2d
kube-system kube-scheduler-kind-control-plane 100m (1%) 0 (0%) 0 (0%) 0 (0%) 2d
local-path-storage local-path-provisioner-5b4b545c55-nkz89 0 (0%) 0 (0%) 0 (0%) 0 (0%) 2d
Allocated resources:
(Total limits may be over 100 percent, i.e., overcommitted.)
Resource Requests Limits
-------- -------- ------
cpu 7103m (88%) 6100m (76%)
memory 1982Mi (12%) 1926Mi (12%)
ephemeral-storage 0 (0%) 0 (0%)
hugepages-1Gi 0 (0%) 0 (0%)
hugepages-2Mi 0 (0%) 0 (0%)
Events: <none>
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [airflow] ayman-albaz commented on issue #28372: Airflow KubernetesPodOperator task running despite no resources being available
Posted by GitBox <gi...@apache.org>.
ayman-albaz commented on issue #28372:
URL: https://github.com/apache/airflow/issues/28372#issuecomment-1355872345
> You can limit the number of concurrent task run by airflow thanks to airflow pool
>
> or concurrency settings on the dag
Ya perhaps this is not necessarily a bug from airflow then but more of a feature request. What you suggested is a workaround to my problem, thanks for the suggestion.
I think it would be pretty cool if Airflow scheduler was aware of resources + auto scaling capability of a cluster, and then schedule accordingly (i.e. keep running jobs, and schedule the remainder that no resources can possibly be allocated for).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org