You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Michael Dreibelbis (JIRA)" <ji...@apache.org> on 2018/06/20 01:07:00 UTC
[jira] [Created] (SPARK-24597) Spark ML Pipeline Should support
non-linear models => DAGPipeline
Michael Dreibelbis created SPARK-24597:
------------------------------------------
Summary: Spark ML Pipeline Should support non-linear models => DAGPipeline
Key: SPARK-24597
URL: https://issues.apache.org/jira/browse/SPARK-24597
Project: Spark
Issue Type: New Feature
Components: ML
Affects Versions: 2.3.1
Reporter: Michael Dreibelbis
Currently SparkML Pipeline/PipelineModel supports single linear dataset transformation
despite the documentation stating otherwise:
[reference documentation|https://spark.apache.org/docs/2.3.0/ml-pipeline.html#details]
I'm proposing implementing a DAGPipeline and supporting multiple datasets as input
The code could look something like this:
{code:java}
val ds1 = /*dataset 1 creation*/
val ds2 = /*dataset 2 creation*/
// nodes take on uid from estimator/transformer
val i1 = IdentityNode(new IdentityTransformer("i1"))
val i2 = IdentityNode(new IdentityTransformer("i2"))
val bi = TransformerNode(new Binarizer("bi"))
val cv = EstimatorNode(new CountVectorizer("cv"))
val idf = EstimatorNode(new IDF("idf"))
val j1 = JoinerNode(new Joiner("j1"))
val nodes = Array(i1, i2, bi, cv, idf)
val edges = Array(
("i1", "cv"), ("cv", "idf"), ("idf", "j1"),
("i2", "bi"), ("bi", "j1"))
val p = new DAGPipeline(nodes, edges)
.setIdentity("i1", ds1)
.setIdentity("i2", ds2)
val m = p.fit(spark.emptyDataFrame)
m.setIdentity("i1", ds1).setIdentity("i2", ds2)
m.transform(spark.emptyDataFrame)
{code}
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org