You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Nicholas Brett Marcott (Jira)" <ji...@apache.org> on 2020/12/26 12:05:00 UTC
[jira] [Comment Edited] (SPARK-28902) Spark ML Pipeline with nested Pipelines fails to load when saved from Python

    [ https://issues.apache.org/jira/browse/SPARK-28902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17255007#comment-17255007 ] 

Nicholas Brett Marcott edited comment on SPARK-28902 at 12/26/20, 12:04 PM:
----------------------------------------------------------------------------

It seems [this PR|https://github.com/apache/spark/pull/18888/files] and [this PR |https://github.com/apache/spark/commit/7e759b2d95eb3592d62ec010297c39384173a93c#diff-43bf01d52810ead40daf5a967f807a6c6b99d66959ad531617f10c1535503192R291-R295]combined (and possibly others) are breaking this. Both of these PRs are to support python-only stages. 

The [implementation|https://github.com/apache/spark/blob/master/python/pyspark/ml/pipeline.py#L351-L352] considers anything that doesn't inherit JavaMLWritable as python-only, and writes them in a format not valid for scala. This includes several "meta" stages like PipelineModel, CrossValidatorModel + more since the second PR mentioned.
{code:java}
 def checkStagesForJava(stages):
          return all(isinstance(stage, JavaMLWritable) for stage in stages){code}
 

[Similar logic|https://github.com/apache/spark/blob/master/python/pyspark/ml/tuning.py#L291-L295] to check if nested stages have java equivalents exists in the second PR mentioned above:

 
{code:java}
def is_java_convertible(instance):
     allNestedStages = MetaAlgorithmReadWrite.getAllNestedStages(instance.getEstimator())
     evaluator_convertible = isinstance(instance.getEvaluator(), JavaParams)
     estimator_convertible = all(map(lambda stage: hasattr(stage, '_to_java'), allNestedStages))
     return estimator_convertible and evaluator_convertible
{code}
 

It seems there needs to a be a consistent and clean way to check whether all stages can be converted to java/support being written in Java. Maybe something similar to the is_java_convertible function above can be used instead of checkStagesForJava for Pipelines. Another alternative is to add an abstraction around the '_to_java'/'_from_java' functions/ having a java equivalent and check all stages inherit that.

+ [~ajaysaini95700], [~weichenxu123], [~podongfeng]

 


was (Author: nmarcott):
It seems [this PR|https://github.com/apache/spark/pull/18888/files] and [this PR |https://github.com/apache/spark/commit/7e759b2d95eb3592d62ec010297c39384173a93c#diff-43bf01d52810ead40daf5a967f807a6c6b99d66959ad531617f10c1535503192R291-R295]combined (and possibly others) are breaking this. Both of these PRs are to support python-only stages. 

The [implementation|https://github.com/apache/spark/blob/master/python/pyspark/ml/pipeline.py#L351-L352] considers anything that doesn't inherit JavaMLWritable as python-only, and writes them in a format not valid for scala. This includes several "meta" stages like PipelineModel, CrossValidatorModel + more since the second PR mentioned.
{code:java}
 def checkStagesForJava(stages):
          return all(isinstance(stage, JavaMLWritable) for stage in stages){code}
 

[Similar logic|https://github.com/apache/spark/blob/master/python/pyspark/ml/tuning.py#L291-L295] to check if nested stages have java equivalents exists in the second PR mentioned above:

 
{code:java}
def is_java_convertible(instance):
     allNestedStages = MetaAlgorithmReadWrite.getAllNestedStages(instance.getEstimator())
     evaluator_convertible = isinstance(instance.getEvaluator(), JavaParams)
     estimator_convertible = all(map(lambda stage: hasattr(stage, '_to_java'), allNestedStages))
     return estimator_convertible and evaluator_convertible
{code}
 

It seems there needs to a be a consistent and clean way to check whether all stages can be converted to java/support being written in Java. Maybe something similar to the is_java_convertible function above can be used instead of checkStagesForJava for Pipelines. Another alternative is to add a an abstraction around the '_to_java'/'_from_java' functions/ having a java equivalent and check all stages inherit that.

+ [~ajaysaini95700], [~weichenxu123], [~podongfeng]

 

> Spark ML Pipeline with nested Pipelines fails to load when saved from Python
> ----------------------------------------------------------------------------
>
>                 Key: SPARK-28902
>                 URL: https://issues.apache.org/jira/browse/SPARK-28902
>             Project: Spark
>          Issue Type: Bug
>          Components: ML
>    Affects Versions: 2.4.3
>            Reporter: Saif Addin
>            Priority: Minor
>
> Hi, this error is affecting a bunch of our nested use cases.
> Saving a *PipelineModel* with one of its stages being another *PipelineModel*, fails when loading it from Scala if it is saved in Python.
> *Python side:*
>  
> {code:java}
> from pyspark.ml import Pipeline
> from pyspark.ml.feature import Tokenizer
> t = Tokenizer()
> p = Pipeline().setStages([t])
> d = spark.createDataFrame([["Hello Peter Parker"]])
> pm = p.fit(d)
> np = Pipeline().setStages([pm])
> npm = np.fit(d)
> npm.write().save('./npm_test')
> {code}
>  
>  
> *Scala side:*
>  
> {code:java}
> scala> import org.apache.spark.ml.PipelineModel
> scala> val pp = PipelineModel.load("./npm_test")
> java.lang.IllegalArgumentException: requirement failed: Error loading metadata: Expected class name org.apache.spark.ml.PipelineModel but found class name pyspark.ml.pipeline.PipelineModel
>  at scala.Predef$.require(Predef.scala:224)
>  at org.apache.spark.ml.util.DefaultParamsReader$.parseMetadata(ReadWrite.scala:638)
>  at org.apache.spark.ml.util.DefaultParamsReader$.loadMetadata(ReadWrite.scala:616)
>  at org.apache.spark.ml.Pipeline$SharedReadWrite$.load(Pipeline.scala:267)
>  at org.apache.spark.ml.PipelineModel$PipelineModelReader.load(Pipeline.scala:348)
>  at org.apache.spark.ml.PipelineModel$PipelineModelReader.load(Pipeline.scala:342)
>  at org.apache.spark.ml.util.MLReadable$class.load(ReadWrite.scala:380)
>  at org.apache.spark.ml.PipelineModel$.load(Pipeline.scala:332)
>  ... 50 elided
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org