You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Tyson Condie (JIRA)" <ji...@apache.org> on 2019/03/15 20:08:00 UTC
[jira] [Comment Edited] (SPARK-26257) SPIP: Interop Support for Spark Language Extensions

    [ https://issues.apache.org/jira/browse/SPARK-26257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16793842#comment-16793842 ] 

Tyson Condie edited comment on SPARK-26257 at 3/15/19 8:07 PM:
---------------------------------------------------------------

Thank you [~abellina] [~srowen] [~zjffdu] for your interest in this SPIP. Our Microsoft team has been working on the C# bindings for Spark, and while doing this work, we are thinking hard about what a generic interop layer might look like. First, much of our C# bindings reuse existing Python interop support. This includes, the Catalyst rules that plan UDF physical operators (copied), establishing workers (Runners) for UDF execution, piping data to/from an external worker process, control channels in the driver that forward operations on C# proxy objects, all of which mirror existing Python and R interop mechanisms. [~imback82] (who is doing the bulk of work on the C# bindings [SPARK-27006|https://issues.apache.org/jira/browse/SPARK-27006]) may have other examples of overlapping logic. 

As such, I think a single extensible interop layer makes sense for Spark. It would centralize some of the unit tests that are presently spread across Python and R extensions. And more importantly, it would reduce the code footprint in Spark. Beyond that, it sends a good message to other communities (.NET, Go, Haskell, Julia) that Spark is a programming language friendly platform for data analytics. 

As I mentioned earlier, we have been learning thus far, and we are now starting to collect our findings in a concrete design doc / proposal (that [~imback82] will be sharing soon) to the Spark community.

Thank you again for your interest! 


was (Author: tcondie):
Thank you [~abellina] [~srowen] [~zjffdu] for your interest in this SPIP. Our Microsoft team has been working on the C# bindings for Spark, and while doing this work, we are thinking hard about what a generic interop layer might look like. First, much of our C# bindings reuse existing Python interop support. This includes, the Catalyst rules that plan UDF physical operators (copied), establishing workers (Runners) for UDF execution, piping data to/from an external worker process, control channels in the driver that forward operations on C# proxy objects, all of which mirror existing Python and R interop mechanisms. [~imback82] (who is doing the bulk of work on the C# bindings) may have other examples of overlapping logic. 

As such, I think a single extensible interop layer makes sense for Spark. It would centralize some of the unit tests that are presently spread across Python and R extensions. And more importantly, it would reduce the code footprint in Spark. Beyond that, it sends a good message to other communities (.NET, Go, Haskell, Julia) that Spark is a programming language friendly platform for data analytics. 

As I mentioned earlier, we have been learning thus far, and we are now starting to collect our findings in a concrete design doc / proposal (that [~imback82] will be sharing soon) to the Spark community.

Thank you again for your interest! 

> SPIP: Interop Support for Spark Language Extensions
> ---------------------------------------------------
>
>                 Key: SPARK-26257
>                 URL: https://issues.apache.org/jira/browse/SPARK-26257
>             Project: Spark
>          Issue Type: Improvement
>          Components: PySpark, R, Spark Core
>    Affects Versions: 3.0.0
>            Reporter: Tyson Condie
>            Priority: Major
>
> h2.  ** Background and Motivation:
> There is a desire for third party language extensions for Apache Spark. Some notable examples include:
>  * C#/F# from project Mobius [https://github.com/Microsoft/Mobius]
>  * Haskell from project sparkle [https://github.com/tweag/sparkle]
>  * Julia from project Spark.jl [https://github.com/dfdx/Spark.jl]
> Presently, Apache Spark supports Python and R via a tightly integrated interop layer. It would seem that much of that existing interop layer could be refactored into a clean surface for general (third party) language bindings, such as the above mentioned. More specifically, could we generalize the following modules:
>  * Deploy runners (e.g., PythonRunner and RRunner)
>  * DataFrame Executors
>  * RDD operations?
> The last being questionable: integrating third party language extensions at the RDD level may be too heavy-weight and unnecessary given the preference towards the Dataframe abstraction.
> The main goals of this effort would be:
>  * Provide a clean abstraction for third party language extensions making it easier to maintain (the language extension) with the evolution of Apache Spark
>  * Provide guidance to third party language authors on how a language extension should be implemented
>  * Provide general reusable libraries that are not specific to any language extension
>  * Open the door to developers that prefer alternative languages
>  * Identify and clean up common code shared between Python and R interops
> h2. Target Personas:
> Data Scientists, Data Engineers, Library Developers
> h2. Goals:
> Data scientists and engineers will have the opportunity to work with Spark in languages other than what’s natively supported. Library developers will be able to create language extensions for Spark in a clean way. The interop layer should also provide guidance for developing language extensions.
> h2. Non-Goals:
> The proposal does not aim to create an actual language extension. Rather, it aims to provide a stable interop layer for third party language extensions to dock.
> h2. Proposed API Changes:
> Much of the work will involve generalizing existing interop APIs for PySpark and R, specifically for the Dataframe API. For instance, it would be good to have a general deploy.Runner (similar to PythonRunner) for language extension efforts. In Spark SQL, it would be good to have a general InteropUDF and evaluator (similar to BatchEvalPythonExec).
> Low-level RDD operations should not be needed in this initial offering; depending on the success of the interop layer and with proper demand, RDD interop could be added later. However, one open question is supporting a subset of low-level functions that are core to ETL e.g., transform.
> h2. Optional Design Sketch:
> The work would be broken down into two top-level phases:
>  Phase 1: Introduce general interop API for deploying a driver/application, running an interop UDF along with any other low-level transformations that aid with ETL.
> Phase 2: Port existing Python and R language extensions to the new interop layer. This port should be contained solely to the Spark core side, and all protocols specific to Python and R should not change e.g., Python should continue to use py4j is the protocol between the Python process and core Spark. The port itself should be contained to a handful of files e.g., some examples for Python: PythonRunner, BatchEvalPythonExec, +PythonUDFRunner+, PythonRDD (possibly), and will mostly involve refactoring common logic abstract implementations and utilities.
> h2. Optional Rejected Designs:
> The clear alternative is the status quo; developers that want to provide a third-party language extension to Spark do so directly; often by extending existing Python classes and overriding the portions that are relevant to the new extension. Not only is this not sound code (e.g., an JuliaRDD is not a PythonRDD, which contains a lot of reusable code), but it runs the great risk of future revisions making the subclass implementation obsolete. It would be hard to imagine that any third-party language extension would be successful if there was not something in place to guarantee its long-term maintainability. 
> Another alternative is that third-party languages should only interact with Spark via pure-SQL; possibly via REST. However, this does not enable UDFs written in the third-party language; a key desideratum in this effort, which most notably takes the form of legacy code/UDFs that would need to be ported to a supported language e.g., Scala. This exercise is extremely cumbersome and not always feasible due to the code no longer being available i.e., only the compiled library exists.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org