You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by jkbradley <gi...@git.apache.org> on 2015/04/20 05:25:55 UTC

[GitHub] spark pull request: [SPARK-7000] [ml] Refactor prediction and tree...

GitHub user jkbradley opened a pull request:

    https://github.com/apache/spark/pull/5585

    [SPARK-7000] [ml] Refactor prediction and tree abstractions to be under ml.prediction subpackage

    From JIRA:
    > spark.ml prediction abstractions are currently not gathered; they are in both ml.impl and ml.tree.  Instead, they should be gathered into ml.prediction.  This will become more important as more abstractions, such as ensembles, are added.
    
    I refactored using IntelliJ.
    
    The only additional changes I made were:
    * Better doc for DecisionTreeExample to warn users that it can require more memory than run-example provides.
    * line 120 of treeParams.scala (a small correction for a warning)
    
    CC: @mengxr

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/jkbradley/spark dt-api-dt3

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/5585.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #5585
    
----
commit 775cecf3b56bab8c665ae431e42acf01891c45a1
Author: Joseph K. Bradley <jo...@databricks.com>
Date:   2015-04-19T22:19:37Z

    Refactored prediction and tree abstractions to be under ml.prediction sub-package

commit b2cb6ad8651b56e7ad5f33c84bafb2388ccb8a0c
Author: Joseph K. Bradley <jo...@databricks.com>
Date:   2015-04-20T03:24:48Z

    Fixed build issues after refactor, and improved doc for DecisionTreeExample

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-7000] [ml] Refactor prediction and tree...

Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley commented on the pull request:

    https://github.com/apache/spark/pull/5585#issuecomment-94570451
  
    Ok, so you'd vote for having separate subpackages for each type of classification/prediction abstraction?
    * ml.prediction.Predictor (once it is public)
    * ml.tree.*
    * ml.ensembles.* (once we add general boosting, bagging)
    * (There may be more which are not on the roadmap.)



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-7000] [ml] Refactor prediction and tree...

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on the pull request:

    https://github.com/apache/spark/pull/5585#issuecomment-94580875
  
    `ml.tree` and `ml.ensemble` look good. If we want to distinguish decision tree from tree elements used in hierarchical clustering, we can put them under separate packages, e.g., `ml.tree` and `ml.clustering.hierachical`. It is not necessary to create common base classes if the subclasses are not expected to be called in a generic way.
    
    What do we want to put under `ml.prediction` beside `Predictor`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-7000] [ml] Refactor prediction and tree...

Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley commented on the pull request:

    https://github.com/apache/spark/pull/5585#issuecomment-94510319
  
    Oh, I misread one thing you wrote: ```tree.Node``` lives under ```ml.prediction``` (not ```ml.prediction.impl```) since it's a public interface.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-7000] [ml] Refactor prediction and tree...

Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley commented on the pull request:

    https://github.com/apache/spark/pull/5585#issuecomment-94895731
  
    I'm copying the 2 small edits to [https://github.com/apache/spark/pull/5567]


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-7000] [ml] Refactor prediction and tree...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/5585#issuecomment-94358565
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/30574/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-7000] [ml] Refactor prediction and tree...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/5585#issuecomment-94358561
  
      [Test build #30574 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/30574/consoleFull) for   PR 5585 at commit [`b2cb6ad`](https://github.com/apache/spark/commit/b2cb6ad8651b56e7ad5f33c84bafb2388ccb8a0c).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.
     * This patch does not change any dependencies.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-7000] [ml] Refactor prediction and tree...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/5585#issuecomment-94347422
  
      [Test build #30574 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/30574/consoleFull) for   PR 5585 at commit [`b2cb6ad`](https://github.com/apache/spark/commit/b2cb6ad8651b56e7ad5f33c84bafb2388ccb8a0c).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-7000] [ml] Refactor prediction and tree...

Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley closed the pull request at:

    https://github.com/apache/spark/pull/5585


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-7000] [ml] Refactor prediction and tree...

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on the pull request:

    https://github.com/apache/spark/pull/5585#issuecomment-94566462
  
    `prediction` sounds too general here, and I don't know what should go into this package. Many models can make predictions, but only tree nodes are under `prediction` now.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-7000] [ml] Refactor prediction and tree...

Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley commented on the pull request:

    https://github.com/apache/spark/pull/5585#issuecomment-94895292
  
    Closing this pending discussions


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-7000] [ml] Refactor prediction and tree...

Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley commented on the pull request:

    https://github.com/apache/spark/pull/5585#issuecomment-94582904
  
    I'm not sure what else would go under ```ml.prediction```.  I have, however, started to wonder if evaluation metrics should sit under the relevant subpackage (to make it easier for users to matches evaluators with models), in which case there might be an evaluation abstraction under ```ml.prediction```.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-7000] [ml] Refactor prediction and tree...

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on the pull request:

    https://github.com/apache/spark/pull/5585#issuecomment-94506032
  
    @jkbradley I don't quite understand the purpose of this PR, e.g., why `tree.Node` should live under `ml.prediction.impl`. On the high-level, what classes do we want to put under `ml.prediction`, and what do we mean by `impl`, developer API or private classes that implements some interfaces?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-7000] [ml] Refactor prediction and tree...

Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley commented on the pull request:

    https://github.com/apache/spark/pull/5585#issuecomment-94509488
  
    It's mainly to reduce clutter in the spark.ml namespace.  We'll get more and more items shared between classification and regression:
    * public interfaces
      * Predictor (private now, but should be public later)
      * tree abstractions: Node, Split, models
      * ensembles: boosting & bagging
    * impl
      * tree params
      * ensemble params
    
    Once the prediction Dev APIs are made public (Predictor, etc.), then we'll have a spark.ml.prediction subpackage anyways.  At that point, tree and ensemble abstractions seem like they would belong in that subpackage, rather than in the spark.ml namespace.
    
    I'm OK if you prefer to keep these items in the .ml namespace, but if you're ambivalent, then I'd prefer fewer subpackages under spark.ml


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-7000] [ml] Refactor prediction and tree...

Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley commented on the pull request:

    https://github.com/apache/spark/pull/5585#issuecomment-94510621
  
    One more thought: Later on, I could imagine us having other types of trees, such as for hierarchical clustering.  Those would live under the ```ml.clustering``` namespace


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org