You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Hyukjin Kwon (Jira)" <ji...@apache.org> on 2019/10/08 05:42:21 UTC

[jira] [Resolved] (SPARK-24656) SparkML Transformers and Estimators with multiple columns

     [ https://issues.apache.org/jira/browse/SPARK-24656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Hyukjin Kwon resolved SPARK-24656.
----------------------------------
    Resolution: Incomplete

> SparkML Transformers and Estimators with multiple columns
> ---------------------------------------------------------
>
>                 Key: SPARK-24656
>                 URL: https://issues.apache.org/jira/browse/SPARK-24656
>             Project: Spark
>          Issue Type: New Feature
>          Components: ML, MLlib
>    Affects Versions: 2.3.1
>            Reporter: Michael Dreibelbis
>            Priority: Major
>              Labels: bulk-closed
>
> Currently SparkML Transformers and Estimators operate on single input/output column pairs. This makes pipelines extremely cumbersome (as well as non-performant) when transformations on multiple columns needs to be made.
>  
> I am proposing to implement ParallelPipelineStage/Transformer/Estimator/Model that would operate on the input columns in parallel.
>  
> {code:java}
>  // old way
>     val pipeline = new Pipeline().setStages(Array(
>       new CountVectorizer().setInputCol("_1").setOutputCol("_1_cv"),
>       new CountVectorizer().setInputCol("_2").setOutputCol("_2_cv"),
>       new IDF().setInputCol("_1_cv").setOutputCol("_1_idf"),
>       new IDF().setInputCol("_2_cv").setOutputCol("_2_idf")
>     ))
>     // proposed way
>     val pipeline2 = new Pipeline().setStages(Array(
>       new ParallelCountVectorizer().setInputCols(Array("_1", "_2")).setOutputCols(Array("_1_cv", "_2_cv")),
>       new ParallelIDF().setInputCols(Array("_1_cv", "_2_cv")).setOutputCols(Array("_1_idf", "_2_idf"))
>     ))
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org