You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Robert Joseph Evans (JIRA)" <ji...@apache.org> on 2019/04/13 20:36:00 UTC

[jira] [Commented] (SPARK-24579) SPIP: Standardize Optimized Data Exchange between Spark and DL/AI frameworks

    [ https://issues.apache.org/jira/browse/SPARK-24579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16817080#comment-16817080 ] 

Robert Joseph Evans commented on SPARK-24579:
---------------------------------------------

This SPIP SPARK-27396 covers a superset of the functionality described here.

> SPIP: Standardize Optimized Data Exchange between Spark and DL/AI frameworks
> ----------------------------------------------------------------------------
>
>                 Key: SPARK-24579
>                 URL: https://issues.apache.org/jira/browse/SPARK-24579
>             Project: Spark
>          Issue Type: Epic
>          Components: ML, PySpark, SQL
>    Affects Versions: 3.0.0
>            Reporter: Xiangrui Meng
>            Assignee: Xiangrui Meng
>            Priority: Major
>              Labels: Hydrogen
>         Attachments: [SPARK-24579] SPIP_ Standardize Optimized Data Exchange between Apache Spark and DL%2FAI Frameworks .pdf
>
>
> (see attached SPIP pdf for more details)
> At the crossroads of big data and AI, we see both the success of Apache Spark as a unified
> analytics engine and the rise of AI frameworks like TensorFlow and Apache MXNet (incubating).
> Both big data and AI are indispensable components to drive business innovation and there have
> been multiple attempts from both communities to bring them together.
> We saw efforts from AI community to implement data solutions for AI frameworks like tf.data and tf.Transform. However, with 50+ data sources and built-in SQL, DataFrames, and Streaming features, Spark remains the community choice for big data. This is why we saw many efforts to integrate DL/AI frameworks with Spark to leverage its power, for example, TFRecords data source for Spark, TensorFlowOnSpark, TensorFrames, etc. As part of Project Hydrogen, this SPIP takes a different angle at Spark + AI unification.
> None of the integrations are possible without exchanging data between Spark and external DL/AI frameworks. And the performance matters. However, there doesn’t exist a standard way to exchange data and hence implementation and performance optimization fall into pieces. For example, TensorFlowOnSpark uses Hadoop InputFormat/OutputFormat for TensorFlow’s TFRecords to load and save data and pass the RDD records to TensorFlow in Python. And TensorFrames converts Spark DataFrames Rows to/from TensorFlow Tensors using TensorFlow’s Java API. How can we reduce the complexity?
> The proposal here is to standardize the data exchange interface (or format) between Spark and DL/AI frameworks and optimize data conversion from/to this interface.  So DL/AI frameworks can leverage Spark to load data virtually from anywhere without spending extra effort building complex data solutions, like reading features from a production data warehouse or streaming model inference. Spark users can use DL/AI frameworks without learning specific data APIs implemented there. And developers from both sides can work on performance optimizations independently given the interface itself doesn’t introduce big overhead.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org