You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Sean R. Owen (Jira)" <ji...@apache.org> on 2022/12/01 15:03:00 UTC

[jira] [Commented] (SPARK-41342) Add support for distributed deep learning framework

    [ https://issues.apache.org/jira/browse/SPARK-41342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17641989#comment-17641989 ] 

Sean R. Owen commented on SPARK-41342:
--------------------------------------

Why not Horovod? it works with Spark and Pytorch. 

> Add support for distributed deep learning framework
> ---------------------------------------------------
>
>                 Key: SPARK-41342
>                 URL: https://issues.apache.org/jira/browse/SPARK-41342
>             Project: Spark
>          Issue Type: Improvement
>          Components: PySpark
>    Affects Versions: 3.3.2
>            Reporter: Lu Wang
>            Priority: Major
>
> There is a clear trend for deep learning to go from single-machine to distributed to scale/accelerate training. Adding a support for Distributed DL solution on Spark will increase the power for spark and largely simplify the distributed DL workload for the users. 
> Currently, [spark-tensorflow-distributor|https://github.com/tensorflow/ecosystem/tree/master/spark/spark-tensorflow-distributor] provides a solution to run distributed Tensorflow on spark clusters.But there is no such support for distributed PyTorch. 
> We want to add a general framework to support both DL frameworks so that we can have a unified interface for distributed DL workload on spark. And it can take the advantages for GPU scheduling on spark and have a better resource management too. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org