You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Joseph K. Bradley (JIRA)" <ji...@apache.org> on 2017/07/13 21:10:00 UTC

[jira] [Commented] (SPARK-20099) Add transformSchema to pyspark.ml

    [ https://issues.apache.org/jira/browse/SPARK-20099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16086422#comment-16086422 ] 

Joseph K. Bradley commented on SPARK-20099:
-------------------------------------------

[~holdenk] [~yanboliang] [~yuhaoyan] [~mlnick] CCing a few people since [~WeichenXu123] is interested in working on this.  Do you think it's reasonable to add PipelineStage to Python in order to add transformSchema?

Pro: early schema failure detection in Python

Con: duplication of schema checking logic in Python
* I don't see a good way to do schema checking in Python for Pipelines without this duplication.  The only way would be to convert Pipelines to Scala equivalents before executing them; i.e., the Pipeline implementation would be in Scala only.  The problem is that we need Pipelines implemented in Python as well in order to support Python-only implementations of Transformers and Estimators (for custom use cases).

A reasonable way to do this in a series of PRs would be to:
* Add PipelineStage abstraction, with abstract transformSchema method
* For each Transformer/Estimator/Model in Python, change it to inherit from PipelineStage
* Finally, change Pipeline and PipelineModel to call transformSchema on their sequences of stages

> Add transformSchema to pyspark.ml
> ---------------------------------
>
>                 Key: SPARK-20099
>                 URL: https://issues.apache.org/jira/browse/SPARK-20099
>             Project: Spark
>          Issue Type: Improvement
>          Components: ML, PySpark
>    Affects Versions: 2.1.0
>            Reporter: Joseph K. Bradley
>
> Python's ML API currently lacks the PipelineStage abstraction.  This abstraction's main purpose is to provide transformSchema() for checking for early failures in a Pipeline.
> As mentioned in https://github.com/apache/spark/pull/17218 it would also be useful in Python for checking Params in Python wrapper for Scala implementations; in these, transformSchema would involve passing Params in Python to Scala, which would then be able to validate the Param values.  This could prevent late failures from bad Param settings in Pipeline execution, while still allowing us to check Param values on only the Scala side.
> This issue is for adding transformSchema() to pyspark.ml.  If it's reasonable, we could create a PipelineStage abstraction.  But it'd probably be fine to add transformSchema() directly to Transformer and Estimator, rather than creating PipelineStage.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org