You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Zhe Sun (JIRA)" <ji...@apache.org> on 2017/03/02 12:53:45 UTC
[jira] [Comment Edited] (SPARK-19797) ML pipelines document error

    [ https://issues.apache.org/jira/browse/SPARK-19797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15892189#comment-15892189 ] 

Zhe Sun edited comment on SPARK-19797 at 3/2/17 12:52 PM:
----------------------------------------------------------

Hi Sean, thanks for your quick reply. 

bq. If the Pipeline had more stages, it would call the LogisticRegressionModel’s transform() method on the DataFrame before passing the DataFrame to the next stage.

Let's use IDF as an example. If the pipeline is like:
bq. Tokenizer -> HashingTF -> IDF -> LogisticRegression
When we fit this pipeline, *IDF* will first call _fit_, then call _transform_ and pass the idf result to LogisticRegression. Because LogisticRegression is an Estimator and _fit_ of LogisticRegression needs the data from _transformer_ of *IDF*.

However, if the last stage of pipeline is Normalizer (https://spark.apache.org/docs/latest/api/python/pyspark.ml.html#pyspark.ml.feature.Normalizer)
bq. Tokenizer -> HashingTF -> IDF -> Normalizer 
When fitting this pipeline, *IDF* will only call _fit_, and do not need to call _transform_

That's why I think it is better to modify the description as below to make it accurate.
bq. If the Pipeline had more Estimators, it would call the LogisticRegressionModel’s transform() method on the DataFrame before passing the DataFrame to the next stage.



was (Author: ymwdalex):
Hi Sean, thanks for your quick reply. 

bq. If the Pipeline had more stages, it would call the LogisticRegressionModel’s transform() method on the DataFrame before passing the DataFrame to the next stage.

Let's use IDF as an example. If the pipeline is like:
bq. Tokenizer -> HashingTF -> IDF -> LogisticRegression
When we fit this pipeline, *IDF* will first call _fit_, then call _transform_ and pass the idf result to LogisticRegression. Because LogisticRegression is an Estimator and _fit_ of LogisticRegression needs the data from _transformer_ of *IDF*.

However, if the last stage of pipeline is Normalizer (https://spark.apache.org/docs/latest/api/python/pyspark.ml.html#pyspark.ml.feature.Normalizer)
bq. Tokenizer -> HashingTF -> IDF -> Normalizer 
When fitting this pipeline, *IDF* will only call _fit_, and do not need to call _transform_

That's why I think it is better to correct the description as 
bq. If the Pipeline had more Estimators, it would call the LogisticRegressionModel’s transform() method on the DataFrame before passing the DataFrame to the next stage.


> ML pipelines document error
> ---------------------------
>
>                 Key: SPARK-19797
>                 URL: https://issues.apache.org/jira/browse/SPARK-19797
>             Project: Spark
>          Issue Type: Bug
>          Components: ML
>    Affects Versions: 2.1.0
>            Reporter: Zhe Sun
>            Priority: Trivial
>              Labels: documentation
>   Original Estimate: 5m
>  Remaining Estimate: 5m
>
> Description about pipeline in this paragraph is incorrect https://spark.apache.org/docs/latest/ml-pipeline.html#how-it-works, which misleads the user
> bq. If the Pipeline had more *stages*, it would call the LogisticRegressionModel’s transform() method on the DataFrame before passing the DataFrame to the next stage.
> The description is not accurate, because *Transformer* could also be a stage. But only another Estimator will invoke an extra transform call.
> So, the description should be corrected as: *If the Pipeline had more _Estimators_*. 
> The code to prove it is here https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/Pipeline.scala#L160



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org