You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Wenchen Fan (JIRA)" <ji...@apache.org> on 2018/09/10 13:45:00 UTC
[jira] [Commented] (SPARK-24656) SparkML Transformers and
Estimators with multiple columns
[ https://issues.apache.org/jira/browse/SPARK-24656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16609213#comment-16609213 ]
Wenchen Fan commented on SPARK-24656:
-------------------------------------
I'm removing the target version, since no one is working on it.
> SparkML Transformers and Estimators with multiple columns
> ---------------------------------------------------------
>
> Key: SPARK-24656
> URL: https://issues.apache.org/jira/browse/SPARK-24656
> Project: Spark
> Issue Type: New Feature
> Components: ML, MLlib
> Affects Versions: 2.3.1
> Reporter: Michael Dreibelbis
> Priority: Major
>
> Currently SparkML Transformers and Estimators operate on single input/output column pairs. This makes pipelines extremely cumbersome (as well as non-performant) when transformations on multiple columns needs to be made.
>
> I am proposing to implement ParallelPipelineStage/Transformer/Estimator/Model that would operate on the input columns in parallel.
>
> {code:java}
> // old way
> val pipeline = new Pipeline().setStages(Array(
> new CountVectorizer().setInputCol("_1").setOutputCol("_1_cv"),
> new CountVectorizer().setInputCol("_2").setOutputCol("_2_cv"),
> new IDF().setInputCol("_1_cv").setOutputCol("_1_idf"),
> new IDF().setInputCol("_2_cv").setOutputCol("_2_idf")
> ))
> // proposed way
> val pipeline2 = new Pipeline().setStages(Array(
> new ParallelCountVectorizer().setInputCols(Array("_1", "_2")).setOutputCols(Array("_1_cv", "_2_cv")),
> new ParallelIDF().setInputCols(Array("_1_cv", "_2_cv")).setOutputCols(Array("_1_idf", "_2_idf"))
> ))
> {code}
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org