You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by GitBox <gi...@apache.org> on 2021/08/08 00:05:57 UTC

[GitHub] [airflow] EliMor opened a new issue #17490: KubernetesJobOperator

EliMor opened a new issue #17490:
URL: https://github.com/apache/airflow/issues/17490


   <!--
   
   Welcome to Apache Airflow!  For a smooth issue process, try to answer the following questions.
   Don't worry if they're not all applicable; just try to include what you can :-)
   
   If you need to include code snippets or logs, please put them in fenced code
   blocks.  If they're super-long, please use the details tag like
   <details><summary>super-long log</summary> lots of stuff </details>
   
   Please delete these comment blocks before submitting the issue.
   
   -->
   
   **Description**
   
   Airflow has a PodOperator for Kubernetes. Why not a JobOperator? 
   
   **Use case / motivation**
   
   I'm curious to get the community's thoughts, maybe there are reasons why this hasn't been implemented yet in Airflow core. 
   Basically I would like an out-of-the-box KJO in Airflow that does similar things to the KPO but using the Job type in Kube. I'd like to shift more work to Kube and keep the DAG code as clean as possible. Think of it like a 'poor mans' helm chart with Airflow as the renderer and executor.
   
   
   <!-- What do you want to happen?
   Rather than telling us how you might implement this solution, try to take a
   step back and describe what you are trying to achieve.
   -->
   
   **Are you willing to submit a PR?**
   
   Of course. I've already mocked up an idea of how it could work here: https://github.com/EliMor/airflow-kube-job-operator
   But it needs improvements and suggestions from the community. I am but a humble noob. :)  
   <!--- We accept contributions! -->
   
   **Related Issues**
   
   <!-- Is there currently another issue associated with this? -->
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk edited a comment on issue #17490: KubernetesJobOperator

Posted by GitBox <gi...@apache.org>.
potiuk edited a comment on issue #17490:
URL: https://github.com/apache/airflow/issues/17490#issuecomment-894741110


   I believe (and please.correcr me if I am wrong) that pod_template_file parameter of the KubernetesPodOperator already handles what you wanted to implement - including possibility of generating the template file via cli of airflow and integration with Jinja templating of Airflow). 
   
   This has been added relatively recently to the operator (Autumn last year) and the Python-object approach is recommended only if you have relatively simple Pod to fire,  so you might have not noticed it. For any more complex /sophisticated uses of the operator pod_template file is recommended.
   
   People are already not very clear when they should use KubernetesExecutor vs KubernetesPodOperator, so adding yet another option is I think not a good idea.
   
   Are there any features that you thought about not possible to do with KPO /pod_template_file ? If so i think it would be better to rather add those to the KPO rather than create a new entity.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on issue #17490: KubernetesJobOperator

Posted by GitBox <gi...@apache.org>.
potiuk commented on issue #17490:
URL: https://github.com/apache/airflow/issues/17490#issuecomment-894741110


   I believe (and please.correcr me if I am wrong) that pod_template_file parameter of the KubernetesPodOperator already handles what you wanted to implement - including possibility of generating the template file via cli of airflow and integration with Ninja templating of Airflow). 
   
   This has been added relatively recently to the operator (Autumn last year) and the Python-object approach is recommended only if you have relatively simple Pod to fire,  so you might have not noticed it. For any more complex /sophisticated uses of the operator pod_template file is recommended.
   
   People are already not very clear when they should use KubernetesExecutor vs KunernetesPodOperator, so adding yet another option is I think not a good idea.
   
   Are there any features that you thought about not possible to do with KPO /pod_template_file ? If so i think it would be better to rather add those to the KPO rather than create a new entity.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] EliMor commented on issue #17490: KubernetesJobOperator

Posted by GitBox <gi...@apache.org>.
EliMor commented on issue #17490:
URL: https://github.com/apache/airflow/issues/17490#issuecomment-899296492


   Thanks for pointing out the TTL functionality @lboudard . I was unaware of that ability in Kube Jobs and will test out exposing it in my flavor of the KJO. 
   One thing I've also been thinking a bit about is how to map the DAG/task defaults to Kube Jobs well.
   ```
   "retries": N, -> "backOffLimit" : N
   ```
   Seems natural. 
   But TMK ```"retry_delay"``` for example doesn't have a good comparable in kube Jobs exposed. As far as I know the retry delay is handled by kube completely so clients from the Dag perspective wouldn't see their expected behavior. Not ideal. Please correct me if I'm wrong there.
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk edited a comment on issue #17490: KubernetesJobOperator

Posted by GitBox <gi...@apache.org>.
potiuk edited a comment on issue #17490:
URL: https://github.com/apache/airflow/issues/17490#issuecomment-894785162


   > For example, in some cases PVCs are used in an 'assembly-line' where they are moved from one task to another and in others they are meant to be bound to the life-cycle of a kube Job and forgotten about -- implicitly created and destroyed together. An elegant way to surface these two kinds of relationships to PVCs is what I was puzzling over for a while.
   
   One thing to remember is that while Airflow loves K8S, K8S is not the only way it is and will be deployed, we do not want people to tie their task implementation with the fact that they are run on K8S. Task while executing should be largely unaware of what deployment it runs on.
   
   I think for that a custom XCom Backend with PVC support for K8S could be useful (and should be possible to write). The fact that task runs on K8S  should be a deployment detail, but if there is some inter-task communication, Airflow has its own deployment-independent mechanism - namely XCom. And as of recently we have the capability of implementing custom XCom backends which serve precisely the purpose and should be used for anything-data-sharing in Airflow.
   
   I think you could write an XComBackend implementation that uses PVC under the hood for those who want to use it and run their tasks on K8S (and possibly contribute it back to Airflow).
   
   You can learn more about those concepts and how Airflow's "love" to Kubernetes and custom XCom approach that we are starting to make "full use of" in some of the talks from the recent Airflow Summit:
   
   * Why and how Airflow Loves Kubernetes: https://airflowsummit.org/sessions/2021/airflow-loves-kubernetes/
   * How Ray will integrate with Airflow shortly (in-short via cust XCom Backends and custom decorators)  : https://airflowsummit.org/sessions/2021/airflow-ray/
   * How Custom XCom Backends work for data sharing: https://airflowsummit.org/sessions/2021/customizing-xcom-to-enhance-data-sharing-between-tasks/
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] EliMor commented on issue #17490: KubernetesJobOperator

Posted by GitBox <gi...@apache.org>.
EliMor commented on issue #17490:
URL: https://github.com/apache/airflow/issues/17490#issuecomment-894764947


   Yes there's a lot of discussion that could be had here I think. 
   
   I like your impulse to make a higher tier 'Base' kube operator. Something dumb that we could reuse even if a 'child' does not officially exist (yet!). ... But I also defer to KPO folks here since I'm quite new and clueless!
   
   Another common kube object in my ETL/ML workflows for consideration are PVCs (another 'child' perhaps?) and having a standard way to specify the management of their lifecycle in Airflow and their generation using yaml only would be SO great! 
   
   For example, in some cases PVCs are used in an 'assembly-line' where they are moved from one task to another and in others they are meant to be bound to the life-cycle of a kube Job and forgotten about -- implicitly created and destroyed together. An elegant way to surface these two kinds of relationships to PVCs is what I was puzzling over for a while. 
   
   Cheers :)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] EliMor commented on issue #17490: KubernetesJobOperator

Posted by GitBox <gi...@apache.org>.
EliMor commented on issue #17490:
URL: https://github.com/apache/airflow/issues/17490#issuecomment-894843706


   Thanks for teaching me some new things today!
   Ray+Airflow looks very neat.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] lboudard edited a comment on issue #17490: KubernetesJobOperator

Posted by GitBox <gi...@apache.org>.
lboudard edited a comment on issue #17490:
URL: https://github.com/apache/airflow/issues/17490#issuecomment-898448207


   I agree on this subject, currently pod operator is missing some very handy features that [kubernetes job controller](https://kubernetes.io/docs/concepts/workloads/controllers/job/) implements such as time to live after success/failure that are really handy (though they are a number of overlapping features such as retries and parallelism control).
   I also agree on the fact that the usage of kubernetes executor vs kubernetes pod operator is not very clear yet.
   In our use case, since we have very different dags types living in the same airflow instance, so we use multiple images that are scheduled through pod operators (that we used before kubernetes executor and taskflow api appeared).
   Say for instance one image to parse new batches of data and another one to train models on it in another dag.
   That is not ideal since the workflow dependencies are not properly binded in code but rather to expected data checkpoints, say instead of having
   ```
   read_file | parse | feature_engineering | train_model
   read_file | archive
   ```
   that describe direct data dependencies in code (say the airflow taskflow way, or equivalently in spark or apache beam), we rather have
   ```
   schedule_parse_file_and_store(raw_data_batch_location)
   schedule_feature_engineer(raw_data_batch_location)
   schedule_train_model(feature_engineered_batch_location)
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] EliMor commented on issue #17490: KubernetesJobOperator

Posted by GitBox <gi...@apache.org>.
EliMor commented on issue #17490:
URL: https://github.com/apache/airflow/issues/17490#issuecomment-894750630


   Hi there! 
   
   Thanks for your feedback. I admit I'd need to take a little bit more time to look at 'pod_template_file.' My memory is foggy but these trees look familiar to me. 
   
   To clarify there're a few things I wanted to ensure our use of yaml + Jinja would accomplish for free with the KJO. 
   
   1. We could pass in the location of the yaml template files just as we could for other templates  (template_searchpath)
   2. We could move away entirely from using Python objects for kube related things, I do not want to import k8s ever in a Dag. (as you noted!)
   3. We could pass in variables to the KJO to be rendered by Jinja in the yaml template
   4. Also with Jinja magic we could reuse **_multiple_** yaml templates to render a **_single_** Job (Pod) yaml file similar to how one would do for web work. 
   
   If 'pod_template_file' accomplishes this I'm a happy camper, albeit very confused.
   
   As far as why a Job and not a Pod, to my (limited) knowledge of kube, the extra abstraction of the 'Job' type also allows for parallelism out-of-the-box (See [Kube Job](https://kubernetes.io/docs/concepts/workloads/controllers/job/) ). If I have a use case where I want 10X pods to run simultaneously would I need to surface that to the task level on an airflow Dag? Does that not consume more resources on the Airflow level than just letting Airflow manage a single Job abstraction as a single task and defer to kube to handle the pods? 
   
   Totally understand not wanting to add confusion. I'm confused more than half the time I try anything these days! 
   Homework assignment for me, reinvestigate the limitations of pod_template_file. 
   
   For one, I do recall also experiencing some bugs with how the logs were being forwarded from the pod to Airflow using the KPO. If the pod just slept for a minute or something and then completed, whatever it was that was tracking the pod for log streaming seemed to just drop and the task would never complete and then continue. 
   Entirely different issue, possibly resolved now! 
   
   If there's anywhere else I could offer some clarity or otherwise be helpful please let me know! 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk edited a comment on issue #17490: KubernetesJobOperator

Posted by GitBox <gi...@apache.org>.
potiuk edited a comment on issue #17490:
URL: https://github.com/apache/airflow/issues/17490#issuecomment-894741110


   I believe (and please.correcr me if I am wrong) that pod_template_file parameter of the KubernetesPodOperator already handles what you wanted to implement - including possibility of generating the template file via cli of airflow and integration with Jinja templating of Airflow). 
   
   This has been added relatively recently to the operator (Autumn last year) and the Python-object approach is recommended only if you have relatively simple Pod to fire,  so you might have not noticed it. For any more complex /sophisticated uses of the operator pod_template file is recommended.
   
   People are already not very clear when they should use KubernetesExecutor vs KunernetesPodOperator, so adding yet another option is I think not a good idea.
   
   Are there any features that you thought about not possible to do with KPO /pod_template_file ? If so i think it would be better to rather add those to the KPO rather than create a new entity.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on issue #17490: KubernetesJobOperator

Posted by GitBox <gi...@apache.org>.
potiuk commented on issue #17490:
URL: https://github.com/apache/airflow/issues/17490#issuecomment-894757117


   Do take a look at the current implementation and some of the fixes over the last months. I think many of the log problems have been fixed already (and I'd say if there are others - fixing those should not be difficult. 
   
   From what you explain about template and Jinja i think only the possibility of using multiple template files is not out-of-the-box however this could likely be easily achieved (maybe even possible today) with Jinja include mechanism. I think you do not have to import nor use any of the k8s imports if you use KPO in your dag and use pod_template_file. Those 'old ways' are still there as optional parameters but from what I know you can skip them all.
   
   Regarding Kube Job - i'd rather defer to others who were involved there @dimberman ? Job is a bit higher abstraction level than Pod, and in a way you could achieve what Job does with Airflow itself, however yeah, i see the point why you might want to use parallelism in some cases and see those all parallel running pods as single Airflow task. 
   
   I am not sure if the way how KPO handles the pod template, but i think it should be essentially possible to have one operator to run either Job or Pod. Or maybe even  extracting some kind of common KTO (BaseKubernetesTemplateOperator) from KPO and implement KJO + kPO as children. Or maybe it can be handled easily with one operator and choosing whether to run job or pod.
   
   I would love to hear those more involved in KPO :). 
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] EliMor edited a comment on issue #17490: KubernetesJobOperator

Posted by GitBox <gi...@apache.org>.
EliMor edited a comment on issue #17490:
URL: https://github.com/apache/airflow/issues/17490#issuecomment-894764947


   Yes there's a lot of discussion that could be had here I think. 
   
   I like your impulse to make a higher tier 'Base' kube operator. Something dumb that we could reuse even if a 'child' does not officially exist (yet!). ... But I also defer to KPO folks here since I'm quite new and clueless!
   
   Another common kube object in my ETL/ML workflows for consideration are PVCs (another 'child' perhaps?) and having a standard way to specify the management of their lifecycle in Airflow and their generation using yaml only would be SO great! 
   
   For example, in some cases PVCs are used in an 'assembly-line' where they are moved from one task (pod) to another and in others they are meant to be bound to the life-cycle of a kube Job and forgotten about -- implicitly created and destroyed together. An elegant way to surface these two kinds of relationships to PVCs is what I was puzzling over for a while. 
   
   Cheers :)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk edited a comment on issue #17490: KubernetesJobOperator

Posted by GitBox <gi...@apache.org>.
potiuk edited a comment on issue #17490:
URL: https://github.com/apache/airflow/issues/17490#issuecomment-894741110


   I believe (and please correct me if I am wrong) that pod_template_file parameter of the KubernetesPodOperator already handles what you wanted to implement - including possibility of generating the template file via cli of airflow and integration with Jinja templating of Airflow). 
   
   This has been added relatively recently to the operator (Autumn last year) and the old Python-object approach is recommended only if you have relatively simple Pod to fire,  so you might have not noticed it. For any more complex /sophisticated uses of the operator pod_template file is recommended.
   
   People are already not very clear when they should use KubernetesExecutor vs KubernetesPodOperator, so adding yet another option is I think not a good idea.
   
   Are there any features that you thought about not possible to do with KPO /pod_template_file ? If so i think it would be better to rather add those to the KPO rather than create a new entity.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] lboudard edited a comment on issue #17490: KubernetesJobOperator

Posted by GitBox <gi...@apache.org>.
lboudard edited a comment on issue #17490:
URL: https://github.com/apache/airflow/issues/17490#issuecomment-898448207


   I agree on this subject, currently pod operator is missing some very handy features that [kubernetes job controller](https://kubernetes.io/docs/concepts/workloads/controllers/job/) implements such as time to live after success/failure that are really handy.
   I also agree on the fact that the usage of kubernetes executor vs kubernetes pod operator is not very clear yet.
   In our use case, since we have very different dags types living in the same airflow instance, so we use multiple images that are scheduled through pod operators (that we used before kubernetes executor and taskflow api appeared).
   Say for instance one image to parse new batches of data and another one to train models on it in another dag.
   That is not ideal since the workflow dependencies are not properly binded in code but rather to expected data checkpoints, say instead of having
   ```
   read_file | parse | feature_engineering | train_model
   read_file | archive
   ```
   that describe direct data dependencies in code (say the airflow taskflow way, or equivalently in spark or apache beam), we rather have
   ```
   schedule_parse_file_and_store(raw_data_batch_location)
   schedule_feature_engineer(raw_data_batch_location)
   schedule_train_model(feature_engineered_batch_location)
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk edited a comment on issue #17490: KubernetesJobOperator

Posted by GitBox <gi...@apache.org>.
potiuk edited a comment on issue #17490:
URL: https://github.com/apache/airflow/issues/17490#issuecomment-894741110


   I believe (and please correct me if I am wrong) that pod_template_file parameter of the KubernetesPodOperator already handles what you wanted to implement - including possibility of generating the template file via cli of airflow and integration with Jinja templating of Airflow). 
   
   This has been added relatively recently to the operator (Autumn last year) and the old Python-object approach you complain about in your docs  is recommended only if you have relatively simple Pod to fire,  so you might have not noticed it. For any more complex /sophisticated uses of the operator pod_template file is recommended.
   
   People are already not very clear when they should use KubernetesExecutor vs KubernetesPodOperator, so adding yet another option is I think not a good idea.
   
   Are there any features that you thought about not possible to do with KPO /pod_template_file ? If so i think it would be better to rather add those to the KPO rather than create a new entity.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] EliMor edited a comment on issue #17490: KubernetesJobOperator

Posted by GitBox <gi...@apache.org>.
EliMor edited a comment on issue #17490:
URL: https://github.com/apache/airflow/issues/17490#issuecomment-896115681


   Looks like there's another project that does similar/more things! 
   https://github.com/LamaAni/KubernetesJobOperator
   
   Would be great to consolidate. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] lboudard edited a comment on issue #17490: KubernetesJobOperator

Posted by GitBox <gi...@apache.org>.
lboudard edited a comment on issue #17490:
URL: https://github.com/apache/airflow/issues/17490#issuecomment-898448207


   I agree on this subject, currently pod operator is missing some very handy features that [kubernetes job controller](https://kubernetes.io/docs/concepts/workloads/controllers/job/) implements such as time to live after success/failure (though they are a number of overlapping features such as retries and parallelism control).
   I also agree on the fact that the usage of kubernetes executor vs kubernetes pod operator is not very clear yet.
   In our use case, since we have very different dags types living in the same airflow instance, so we use multiple images that are scheduled through pod operators (that we used before kubernetes executor and taskflow api appeared).
   Say for instance one image to parse new batches of data and another one to train models on it in another dag.
   That is not ideal since the workflow dependencies are not properly binded in code but rather to expected data checkpoints, say instead of having
   ```
   read_file | parse | feature_engineering | train_model
   read_file | archive
   ```
   that describe direct data dependencies in code (say the airflow taskflow way, or equivalently in spark or apache beam), we rather have
   ```
   schedule_parse_file_and_store(raw_data_batch_location, parsing_docker_image)
   schedule_feature_engineer(raw_data_batch_location, feature_engineering_docker_image)
   schedule_train_model(feature_engineered_batch_location, model_training_docker_image)
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] lboudard commented on issue #17490: KubernetesJobOperator

Posted by GitBox <gi...@apache.org>.
lboudard commented on issue #17490:
URL: https://github.com/apache/airflow/issues/17490#issuecomment-898448207


   I agree on this subject, currently pod operator is missing some very handy features that [kubernetes job controller](https://kubernetes.io/docs/concepts/workloads/controllers/job/) implements such as time to live after success/failure that are really handy.
   I also agree on the fact that the usage of kubernetes executor vs kubernetes pod operator is not very clear yet.
   In our use case, since we have very different dags types living in the same airflow instance, we have multiple images that are run through pod operators (that we used before kubernetes executor and taskflow api).
   Say for instance one image to parse new batches of data and another one to train models on it in another dag. But that is not ideal since the workflow dependencies are not properly binded in code
   ```
   read_file | parse | feature_engineering | train_model
   read_file | archive
   ```
   that describe direct data dependencies in code (say the airflow taskflow way, or equivalently in spark or apache beam), we rather have
   ```
   schedule_parse_file_and_store(raw_data_batch_location)
   schedule_feature_engineer(raw_data_batch_location)
   schedule_train_model(feature_engineered_batch_location)
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on issue #17490: KubernetesJobOperator

Posted by GitBox <gi...@apache.org>.
potiuk commented on issue #17490:
URL: https://github.com/apache/airflow/issues/17490#issuecomment-894785162


   > For example, in some cases PVCs are used in an 'assembly-line' where they are moved from one task to another and in others they are meant to be bound to the life-cycle of a kube Job and forgotten about -- implicitly created and destroyed together. An elegant way to surface these two kinds of relationships to PVCs is what I was puzzling over for a while.
   
   One thing to remember is that while Airflow loves K8S, K8S is not the only way it is and will be deployed, we do not want people to tie their task implementation with the fact that they are run on K8S. Task while executing should be largely unaware of what deployment it runs on.
   
   I think for that a custom XCom Backend with PVC support for K8S could be useful (If it is possible to write). The fact that task runs on K8S  should be a deployment detail, but if there is some inter-task communication, Airflow has its own deployment-independent mechanism - namely XCom. And as of recently we have the capability of implementing custom XCom backends which serve precisely the purpose and should be used for anything-data-sharing in Airflow.
   
   I think you could write an XComBackend implementation that uses PVC under the hood for those who want to use it and run their tasks on K8S (and possibly contribute it back to Airflow).
   
   You can learn more about those concepts and how Airflow's "love" to Kubernetes and custom XCom approach that we are starting to make "full use of" in some of the talks from the recent Airflow Summit:
   
   * Why and how Airflow Loves Kubernetes: https://airflowsummit.org/sessions/2021/airflow-loves-kubernetes/
   * How Ray will integrate with Airflow shortly (in-short via cust XCom Backends and custom decorators)  : https://airflowsummit.org/sessions/2021/airflow-ray/
   * How Custom XCom Backends work for data sharing: https://airflowsummit.org/sessions/2021/customizing-xcom-to-enhance-data-sharing-between-tasks/
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] EliMor commented on issue #17490: KubernetesJobOperator

Posted by GitBox <gi...@apache.org>.
EliMor commented on issue #17490:
URL: https://github.com/apache/airflow/issues/17490#issuecomment-896115681


   Looks like there's another project that does similar/more things!
   https://github.com/LamaAni/KubernetesJobOperator
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] EliMor edited a comment on issue #17490: KubernetesJobOperator

Posted by GitBox <gi...@apache.org>.
EliMor edited a comment on issue #17490:
URL: https://github.com/apache/airflow/issues/17490#issuecomment-894843706


   Thanks for your time and teaching me some new things today!
   Ray+Airflow looks very neat.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk edited a comment on issue #17490: KubernetesJobOperator

Posted by GitBox <gi...@apache.org>.
potiuk edited a comment on issue #17490:
URL: https://github.com/apache/airflow/issues/17490#issuecomment-894741110


   I believe (and please correct me if I am wrong) that pod_template_file parameter of the KubernetesPodOperator already handles what you wanted to implement - including possibility of generating the template file via cli of airflow and integration with Jinja templating of Airflow). 
   
   This has been added relatively recently to the operator (Autumn last year) and the Python-object approach is recommended only if you have relatively simple Pod to fire,  so you might have not noticed it. For any more complex /sophisticated uses of the operator pod_template file is recommended.
   
   People are already not very clear when they should use KubernetesExecutor vs KubernetesPodOperator, so adding yet another option is I think not a good idea.
   
   Are there any features that you thought about not possible to do with KPO /pod_template_file ? If so i think it would be better to rather add those to the KPO rather than create a new entity.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org