You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Derek Tapley (Jira)" <ji...@apache.org> on 2020/12/17 16:56:00 UTC

[jira] [Comment Edited] (SPARK-24632) Allow 3rd-party libraries to use pyspark.ml abstractions for Java wrappers for persistence

    [ https://issues.apache.org/jira/browse/SPARK-24632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17251202#comment-17251202 ] 

Derek Tapley edited comment on SPARK-24632 at 12/17/20, 4:55 PM:
-----------------------------------------------------------------

I've been running into this problem as well, it seems like this could be solved in several ways.
 # Provide a way to override the line

{code:java}
stage_name = java_stage.getClass().getName().replace("org.apache.spark", "pyspark"){code}

 # Refactor `Pipeline` and `PipelineModel` to use:

{code:java}
py_stages = [s._from_java() for s in java_stage.stages()
{code}
instead of
{code:java}
py_stages = [JavaParams._from_java(s) for s in java_stage.stages()
{code}
Both ways would likely require custom (meta) Transformers/Estimators to override both `_to_java` and `_from_java`.  Is there any preference?  I can open a PR for either method, though I'm leaning towards the latter being easier to implement provided it doesn't break existing functionality.


was (Author: derektapley):
I've been running into this problem as well, it seems like this could be solved in several ways.
 # Provide a way to override the line
stage_name = java_stage.getClass().getName().replace("org.apache.spark", "pyspark")
 # Refactor `Pipeline` and `PipelineModel` to use:

{code:java}
py_stages = [s._from_java() for s in java_stage.stages()
{code}
instead of 

{code:java}
py_stages = [JavaParams._from_java(s) for s in java_stage.stages()
{code}

Both ways would likely require custom (meta) Transformers/Estimators to override both `_to_java` and `_from_java`.  Is there any preference?  I can open a PR for either method, though I'm leaning towards the latter being easier to implement provided it doesn't break existing functionality.

> Allow 3rd-party libraries to use pyspark.ml abstractions for Java wrappers for persistence
> ------------------------------------------------------------------------------------------
>
>                 Key: SPARK-24632
>                 URL: https://issues.apache.org/jira/browse/SPARK-24632
>             Project: Spark
>          Issue Type: Improvement
>          Components: ML, PySpark
>    Affects Versions: 3.1.0
>            Reporter: Joseph K. Bradley
>            Priority: Major
>
> This is a follow-up for [SPARK-17025], which allowed users to implement Python PipelineStages in 3rd-party libraries, include them in Pipelines, and use Pipeline persistence.  This task is to make it easier for 3rd-party libraries to have PipelineStages written in Java and then to use pyspark.ml abstractions to create wrappers around those Java classes.  This is currently possible, except that users hit bugs around persistence.
> I spent a bit thinking about this and wrote up thoughts and a proposal in the doc linked below.  Summary of proposal:
> Require that 3rd-party libraries with Java classes with Python wrappers implement a trait which provides the corresponding Python classpath in some field:
> {code}
> trait PythonWrappable {
>   def pythonClassPath: String = …
> }
> MyJavaType extends PythonWrappable
> {code}
> This will not be required for MLlib wrappers, which we can handle specially.
> One issue for this task will be that we may have trouble writing unit tests.  They would ideally test a Java class + Python wrapper class pair sitting outside of pyspark.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org