You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Xiangrui Meng (JIRA)" <ji...@apache.org> on 2016/01/15 00:29:39 UTC

[jira] [Commented] (SPARK-10388) Public dataset loader interface

    [ https://issues.apache.org/jira/browse/SPARK-10388?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15099118#comment-15099118 ] 

Xiangrui Meng commented on SPARK-10388:
---------------------------------------

[~zjffdu] Thanks for posting the design doc! There might be some miscommunication in my description. We shouldn't assume any additional work on the server side. LIBSVM and UCI repos are out of our control, and we cannot mirror the repos and implement servers because of license issues and maintenance cost. We should only consider what we can do on the Spark side. Essentially, we need the following:

1) a catalog of public datasets
2) how to fetch datasets into Spark (via http/ftp)
3) how to expand the catalog

For example, we can host the catalog as a resource file inside Spark repo. But it won't be updated frequently due to Spark release cycle. Or we can put the catalog file on spark.apache.org. Then we need to make sure it is compatible cross Spark versions (or maintain a catalog file for each Spark release).

I think this would be the main focus of the design. Do you have time to check the details and update doc? Thanks!



> Public dataset loader interface
> -------------------------------
>
>                 Key: SPARK-10388
>                 URL: https://issues.apache.org/jira/browse/SPARK-10388
>             Project: Spark
>          Issue Type: New Feature
>          Components: ML
>            Reporter: Xiangrui Meng
>         Attachments: SPARK-10388PublicDataSetLoaderInterface.pdf
>
>
> It is very useful to have a public dataset loader to fetch ML datasets from popular repos, e.g., libsvm and UCI. This JIRA is to discuss the design, requirements, and initial implementation.
> {code}
> val loader = new DatasetLoader(sqlContext)
> val df = loader.get("libsvm", "rcv1_train.binary")
> {code}
> User should be able to list (or preview) datasets, e.g.
> {code}
> val datasets = loader.ls("libsvm") // returns a local DataFrame
> datasets.show() // list all datasets under libsvm repo
> {code}
> It would be nice to allow 3rd-party packages to register new repos. Both the API and implementation are pending discussion. Note that this requires http and https support.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org