You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Konrad Tendera (JIRA)" <ji...@apache.org> on 2019/04/13 01:51:00 UTC

[jira] [Commented] (SPARK-10388) Public dataset loader interface

    [ https://issues.apache.org/jira/browse/SPARK-10388?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16816789#comment-16816789 ] 

Konrad Tendera commented on SPARK-10388:
----------------------------------------

[~mengxr] is anyone currently working on this feature? I'd like to start contributing and it looks like nice and in the same time challenging starter task.

> Public dataset loader interface
> -------------------------------
>
>                 Key: SPARK-10388
>                 URL: https://issues.apache.org/jira/browse/SPARK-10388
>             Project: Spark
>          Issue Type: New Feature
>          Components: ML
>            Reporter: Xiangrui Meng
>            Priority: Major
>         Attachments: SPARK-10388PublicDataSetLoaderInterface.pdf
>
>
> It is very useful to have a public dataset loader to fetch ML datasets from popular repos, e.g., libsvm and UCI. This JIRA is to discuss the design, requirements, and initial implementation.
> {code}
> val loader = new DatasetLoader(sqlContext)
> val df = loader.get("libsvm", "rcv1_train.binary")
> {code}
> User should be able to list (or preview) datasets, e.g.
> {code}
> val datasets = loader.ls("libsvm") // returns a local DataFrame
> datasets.show() // list all datasets under libsvm repo
> {code}
> It would be nice to allow 3rd-party packages to register new repos. Both the API and implementation are pending discussion. Note that this requires http and https support.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org