You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Sean Owen (JIRA)" <ji...@apache.org> on 2014/09/15 12:13:33 UTC
[jira] [Commented] (SPARK-3530) Pipeline and Parameters

    [ https://issues.apache.org/jira/browse/SPARK-3530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14133760#comment-14133760 ] 

Sean Owen commented on SPARK-3530:
----------------------------------

A few high-level questions:

Is this a rewrite of MLlib? I see the old code will be deprecated. I assume the algorithms will come along, but in a fairly different form. I think that's actually a good thing. But is this targeted at a 2.x release, or sooner?

How does this relate to MLI and MLbase? I had thought they would in theory handle things like grid-search, but haven't seen activity or mention of these in a while. Is this at all a merge of the two or is MLlib going to take over these concerns?

I don't think you will need or want to use this code, but the oryx project already has an implementation of grid search on Spark. At least another take on the API for such a thing to consider. https://github.com/OryxProject/oryx/tree/master/oryx-ml/src/main/java/com/cloudera/oryx/ml/param

Big +1 for parameter tuning. That belongs as a first-class citizen. I'm also intrigued by doing better than trying every possible combination of parameters separately, and maybe sharing partial results to speed up several models' training. Is this realistic for any parameters besides things like # iterations? which isn't really a hyperparam. I don't know, for example, ways to build N models with N different overfitting params and share some work. I would love to know that's possible. Good to design for it anyway.

I see mention of a Dataset abstraction, which I'm assuming contains some type information, like distinguishing categorical and numeric features. I think that's very good!

I've always found the 'pipeline' part hard to build. It's tempting to construct a framework for feature extraction. To some degree you can by providing transformations, 1-hot encoding, etc. But I think that a framework for understanding arbitrary databases and fields and so on quickly becomes too endlessly large a scope. Spark Core to me is already the right abstraction for upstream ETL of data before entering an ML framework.  I mention it just because it's in the first picture, but I don't see discussion of actually doing user/product attribute selection later. So maybe it's not meant to be part of the proposal. 

I'd certainly like to keep up more with your work here. This is a big step forward in making MLlib more relevant to production deployments rather than just pure algorithms implementations.

> Pipeline and Parameters
> -----------------------
>
>                 Key: SPARK-3530
>                 URL: https://issues.apache.org/jira/browse/SPARK-3530
>             Project: Spark
>          Issue Type: Sub-task
>          Components: ML, MLlib
>            Reporter: Xiangrui Meng
>            Assignee: Xiangrui Meng
>            Priority: Critical
>
> This part of the design doc is for pipelines and parameters. I put the design doc at
> https://docs.google.com/document/d/1rVwXRjWKfIb-7PI6b86ipytwbUH7irSNLF1_6dLmh8o/edit?usp=sharing
> I will copy the proposed interfaces to this JIRA later. Some sample code can be viewed at: https://github.com/mengxr/spark-ml/
> Please help review the design and post your comments here. Thanks!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org