You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@airflow.apache.org by "Yoichi Iwaki (JIRA)" <ji...@apache.org> on 2019/06/04 08:36:00 UTC

[jira] [Commented] (AIRFLOW-4346) Kubernetes Executor Fails for Large Wide DAGs

    [ https://issues.apache.org/jira/browse/AIRFLOW-4346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16855458#comment-16855458 ] 

Yoichi Iwaki commented on AIRFLOW-4346:
---------------------------------------

[~vcastane]

It looks like you're using PVC(PersistentVolumeClaim) for DAGs volume in your config. Does your underlying PV/PVC supports ReadWriteMany or ReadOnlyMany? You can check the table in following URL.
https://kubernetes.io/docs/concepts/storage/persistent-volumes/#access-modes

If it doesn't, created KubernetesExecutor pods can be scheduled only on the single node. Considering the max pods per node is limited to 100 in GKE, this may be causing the problem.

 

Note:
On my 4vCPU/24GB RAM environment VM, wide_dag_bash_test.py ran successfully.

 

> Kubernetes Executor Fails for Large Wide DAGs
> ---------------------------------------------
>
>                 Key: AIRFLOW-4346
>                 URL: https://issues.apache.org/jira/browse/AIRFLOW-4346
>             Project: Apache Airflow
>          Issue Type: Bug
>          Components: DAG, executors
>    Affects Versions: 1.10.2, 1.10.3
>            Reporter: Vincent Castaneda
>            Priority: Blocker
>              Labels: kubernetes
>         Attachments: configmap-airflow-share.yaml, sched_logs.txt, wide_dag_bash_test.py, wide_dag_test_100_300.py, wide_dag_test_300_300.py
>
>
> When running large DAGs–those with parallelism of over 100 task instances to be running concurrently--several tasks fail on the executor and are reported to the database, but the scheduler is never aware of them failing.
> Attached are:
>  - A test DAG that we can use to replicate the issue.
>  - The configmap-airflow.yaml file
> I will be available to answer any other questions that are raised about our configuration. We are running this on GKE and giving the scheduler and web pod a base 100m for execution.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)