You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Michael Dreibelbis (JIRA)" <ji...@apache.org> on 2018/06/20 01:07:00 UTC

[jira] [Created] (SPARK-24597) Spark ML Pipeline Should support non-linear models => DAGPipeline

Michael Dreibelbis created SPARK-24597:
------------------------------------------

             Summary: Spark ML Pipeline Should support non-linear models => DAGPipeline
                 Key: SPARK-24597
                 URL: https://issues.apache.org/jira/browse/SPARK-24597
             Project: Spark
          Issue Type: New Feature
          Components: ML
    Affects Versions: 2.3.1
            Reporter: Michael Dreibelbis


Currently SparkML Pipeline/PipelineModel supports single linear dataset transformation

despite the documentation stating otherwise:

[reference documentation|https://spark.apache.org/docs/2.3.0/ml-pipeline.html#details] 

 I'm proposing implementing a DAGPipeline and supporting multiple datasets as input

The code could look something like this:

 
{code:java}
val ds1 = /*dataset 1 creation*/
val ds2 = /*dataset 2 creation*/

// nodes take on uid from estimator/transformer
val i1 = IdentityNode(new IdentityTransformer("i1"))
val i2 = IdentityNode(new IdentityTransformer("i2"))
val bi = TransformerNode(new Binarizer("bi"))
val cv = EstimatorNode(new CountVectorizer("cv"))
val idf = EstimatorNode(new IDF("idf"))
val j1 = JoinerNode(new Joiner("j1"))
val nodes = Array(i1, i2, bi, cv, idf)
val edges = Array(
("i1", "cv"), ("cv", "idf"), ("idf", "j1"), 
("i2", "bi"), ("bi", "j1"))
val p = new DAGPipeline(nodes, edges)
.setIdentity("i1", ds1)
.setIdentity("i2", ds2)
val m = p.fit(spark.emptyDataFrame)
m.setIdentity("i1", ds1).setIdentity("i2", ds2)
m.transform(spark.emptyDataFrame)
{code}
 

 

         

          



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org