You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Joseph K. Bradley (JIRA)" <ji...@apache.org> on 2017/08/12 07:00:05 UTC

[jira] [Commented] (SPARK-17025) Cannot persist PySpark ML Pipeline model that includes custom Transformer

    [ https://issues.apache.org/jira/browse/SPARK-17025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16124493#comment-16124493 ] 

Joseph K. Bradley commented on SPARK-17025:
-------------------------------------------

[~nchammas] I just merged https://github.com/apache/spark/pull/18888 which should make this work if the custom Transformer uses simple (JSON-serializable) Params to store all of its data.  Does it meet your use case?  I'd like to make it easier to implement ML persistence for fancier data types in Transformers and Models (like Vectors or DataFrames) in the future, but hopefully this unblocks some use cases for now.

> Cannot persist PySpark ML Pipeline model that includes custom Transformer
> -------------------------------------------------------------------------
>
>                 Key: SPARK-17025
>                 URL: https://issues.apache.org/jira/browse/SPARK-17025
>             Project: Spark
>          Issue Type: New Feature
>          Components: ML, PySpark
>    Affects Versions: 2.0.0
>            Reporter: Nicholas Chammas
>            Priority: Minor
>
> Following the example in [this Databricks blog post|https://databricks.com/blog/2016/05/31/apache-spark-2-0-preview-machine-learning-model-persistence.html] under "Python tuning", I'm trying to save an ML Pipeline model.
> This pipeline, however, includes a custom transformer. When I try to save the model, the operation fails because the custom transformer doesn't have a {{_to_java}} attribute.
> {code}
> Traceback (most recent call last):
>   File ".../file.py", line 56, in <module>
>     model.bestModel.save('model')
>   File "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/ml/pipeline.py", line 222, in save
>   File "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/ml/pipeline.py", line 217, in write
>   File "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/ml/util.py", line 93, in __init__
>   File "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/ml/pipeline.py", line 254, in _to_java
> AttributeError: 'PeoplePairFeaturizer' object has no attribute '_to_java'
> {code}
> Looking at the source code for [ml/base.py|https://github.com/apache/spark/blob/acaf2a81ad5238fd1bc81e7be2c328f40c07e755/python/pyspark/ml/base.py], I see that not even the base Transformer class has such an attribute.
> I'm assuming this is missing functionality that is intended to be patched up (i.e. [like this|https://github.com/apache/spark/blob/acaf2a81ad5238fd1bc81e7be2c328f40c07e755/python/pyspark/ml/classification.py#L1421-L1433]).
> I'm not sure if there is an existing JIRA for this (my searches didn't turn up clear results).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org