You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Sean R. Owen (Jira)" <ji...@apache.org> on 2022/12/01 15:03:00 UTC
[jira] [Commented] (SPARK-41342) Add support for distributed deep learning framework
[ https://issues.apache.org/jira/browse/SPARK-41342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17641989#comment-17641989 ]
Sean R. Owen commented on SPARK-41342:
--------------------------------------
Why not Horovod? it works with Spark and Pytorch.
> Add support for distributed deep learning framework
> ---------------------------------------------------
>
> Key: SPARK-41342
> URL: https://issues.apache.org/jira/browse/SPARK-41342
> Project: Spark
> Issue Type: Improvement
> Components: PySpark
> Affects Versions: 3.3.2
> Reporter: Lu Wang
> Priority: Major
>
> There is a clear trend for deep learning to go from single-machine to distributed to scale/accelerate training. Adding a support for Distributed DL solution on Spark will increase the power for spark and largely simplify the distributed DL workload for the users.
> Currently, [spark-tensorflow-distributor|https://github.com/tensorflow/ecosystem/tree/master/spark/spark-tensorflow-distributor] provides a solution to run distributed Tensorflow on spark clusters.But there is no such support for distributed PyTorch.
> We want to add a general framework to support both DL frameworks so that we can have a unified interface for distributed DL workload on spark. And it can take the advantages for GPU scheduling on spark and have a better resource management too.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org