You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Patrick Wendell (JIRA)" <ji...@apache.org> on 2014/10/02 08:31:33 UTC

[jira] [Commented] (SPARK-3573) Dataset

    [ https://issues.apache.org/jira/browse/SPARK-3573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14156129#comment-14156129 ] 

Patrick Wendell commented on SPARK-3573:
----------------------------------------

I think people are hung up on the term SQL - SchemaRDD is designed to simply represent richer types on top of the core RDD API. In fact we though originally of naming the package "schema" instead of "sql" for exactly this reason. SchemaRDD is in the sql/core package right now, but we could pull the public interface of a Schema RDD into another package in the future (and maybe we'd drop exposing anything about the logical plan here).

I'd like to see a common representation of typed data be used across both SQL and MLlib and longer term other libraries as well. I don't see an insurmountable semantic gap between an R-style data frame and a relational table. In fact, if you look across other projects today - almost all of them are trying to unify these types of data representations.

So I'd support seeing where maybe we can enhance or extend SchemaRDD to better support numeric data sets. And if we find there is just too large of a gap here, then we could look at implementing a second dataset abstraction. If nothing else this is a test of whether SchemaRDD is sufficiently extensible to be useful in contexts beyond SQL (which is its original design).

> Dataset
> -------
>
>                 Key: SPARK-3573
>                 URL: https://issues.apache.org/jira/browse/SPARK-3573
>             Project: Spark
>          Issue Type: Sub-task
>          Components: MLlib
>            Reporter: Xiangrui Meng
>            Assignee: Xiangrui Meng
>            Priority: Critical
>
> This JIRA is for discussion of ML dataset, essentially a SchemaRDD with extra ML-specific metadata embedded in its schema.
> .Sample code
> Suppose we have training events stored on HDFS and user/ad features in Hive, we want to assemble features for training and then apply decision tree.
> The proposed pipeline with dataset looks like the following (need more refinements):
> {code}
> sqlContext.jsonFile("/path/to/training/events", 0.01).registerTempTable("event")
> val training = sqlContext.sql("""
>   SELECT event.id AS eventId, event.userId AS userId, event.adId AS adId, event.action AS label,
>          user.gender AS userGender, user.country AS userCountry, user.features AS userFeatures,
>          ad.targetGender AS targetGender
>     FROM event JOIN user ON event.userId = user.id JOIN ad ON event.adId = ad.id;""").cache()
> val indexer = new Indexer()
> val interactor = new Interactor()
> val fvAssembler = new FeatureVectorAssembler()
> val treeClassifer = new DecisionTreeClassifer()
> val paramMap = new ParamMap()
>   .put(indexer.features, Map("userCountryIndex" -> "userCountry"))
>   .put(indexer.sortByFrequency, true)
>   .put(interactor.features, Map("genderMatch" -> Array("userGender", "targetGender")))
>   .put(fvAssembler.features, Map("features" -> Array("genderMatch", "userCountryIndex", "userFeatures")))
>   .put(fvAssembler.dense, true)
>   .put(treeClassifer.maxDepth, 4) // By default, classifier recognizes "features" and "label" columns.
> val pipeline = Pipeline.create(indexer, interactor, fvAssembler, treeClassifier)
> val model = pipeline.fit(training, paramMap)
> sqlContext.jsonFile("/path/to/events", 0.01).registerTempTable("event")
> val test = sqlContext.sql("""
>   SELECT event.id AS eventId, event.userId AS userId, event.adId AS adId,
>          user.gender AS userGender, user.country AS userCountry, user.features AS userFeatures,
>          ad.targetGender AS targetGender
>     FROM event JOIN user ON event.userId = user.id JOIN ad ON event.adId = ad.id;""")
> val prediction = model.transform(test).select('eventId, 'prediction)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org