You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Kai Sasaki (JIRA)" <ji...@apache.org> on 2015/09/15 15:08:45 UTC
[jira] [Commented] (SPARK-10388) Public dataset loader interface

    [ https://issues.apache.org/jira/browse/SPARK-10388?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14745430#comment-14745430 ] 

Kai Sasaki commented on SPARK-10388:
------------------------------------

It seems very useful for the beginners who want to try Spark ML on their projects and who want to see the behaviour of Pipeline API. I have several comments.

* It might be better to do lazy download. Some datasets are very large, so it will be good to download them when it is realy needed. In above example, datasets are downloaded at {{datasets.show()}}.
* Once datasets are downloaded, it will be better to cache these data at the local. And it requires repository API to publicate the latest update. Therefore public dataset loader can update its local cache properly.
* I agree with the idea to allow 3rd-party to create their repositories. It requires to fix the design of repository itself. We can create the specification and also some SDK if possible. (Should these be included Spark projects?)
* We should not restrict the format which public dataset loader can load. Current {{DataFrameReader}} can read such as json, libsvm or orc. There might be various kind of format at the public. So it may be reasonable to support also these kind of format which is currently not supported in future.
* Although this is a little whim, integration between public dataset loader and kaggle datasets increases the use cases of Spark ML.

In general, searching data and loading data are troublesome. This feature makes it easier for developers. I want to help this design and implementation. Thank you.

> Public dataset loader interface
> -------------------------------
>
>                 Key: SPARK-10388
>                 URL: https://issues.apache.org/jira/browse/SPARK-10388
>             Project: Spark
>          Issue Type: New Feature
>          Components: ML
>            Reporter: Xiangrui Meng
>            Assignee: Xiangrui Meng
>
> It is very useful to have a public dataset loader to fetch ML datasets from popular repos, e.g., libsvm and UCI. This JIRA is to discuss the design, requirements, and initial implementation.
> {code}
> val loader = new DatasetLoader(sqlContext)
> val df = loader.get("libsvm", "rcv1_train.binary")
> {code}
> User should be able to list (or preview) datasets, e.g.
> {code}
> val datasets = loader.ls("libsvm") // returns a local DataFrame
> datasets.show() // list all datasets under libsvm repo
> {code}
> It would be nice to allow 3rd-party packages to register new repos. Both the API and implementation are pending discussion. Note that this requires http and https support.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org