You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by jkbradley <gi...@git.apache.org> on 2014/07/25 03:01:01 UTC

[GitHub] spark pull request: [mllib] Decision Tree API update and multiclas...

GitHub user jkbradley opened a pull request:

    https://github.com/apache/spark/pull/1582

    [mllib] Decision Tree API update and multiclass bug fix

    Summary:
     (1) Split DecisionTree API into separate Classifier and Regressor classes.
     (2) Bug fixes for recent multiclass PR (https://github.com/apache/spark/pull/886)
    
    Details on (1) API:
    
    (1a) Split classes:  E.g.: DecisionTree --> DecisionTreeClassifier and DecisionTreeRegressor
    (1b) Included print() function for human-readable model descriptions
    (1c) Renamed Strategy to *Params.  Changed to take strings instead of special types.
    (1d) Made configuration classes (Impurity, QuantileStrategy) private to mllib.
    (1e) Changed meaning of maxDepth by 1 to match scikit-learn and rpart.
    (1f) Removed static train() functions in favor of using Params classes.
    (1g) Introduced  DatasetInfo class for metadata.
    
    Details on (2) bug fixes:
    
    (2a) Inconsistent aggregate (agg) indexing for unordered features.
    (2b) Fixed gain calculations for edge cases.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/jkbradley/spark decisiontree-api

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/1582.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #1582
    
----
commit 929f0e648962fd0e0529ac2f40452c7302eed733
Author: Joseph K. Bradley <jo...@gmail.com>
Date:   2014-07-18T21:34:59Z

    updating DT APIf

commit 29e29b8c132fd08649d71807d60dc1eb369a6ea5
Author: Joseph K. Bradley <jo...@gmail.com>
Date:   2014-07-18T21:54:07Z

    Merging multiclass DT PR, plus others, into branch with updates to DT API.

commit 20fc8057e912c8cc1266cbb39ce0285907e7356b
Author: Joseph K. Bradley <jo...@gmail.com>
Date:   2014-07-19T23:37:50Z

    Mostly done with DecisionTree API re-config.  Still need to update DecisionTreeRegressor class,object, update docs, tests and examples.

commit 0ced13a5773e2973042a580a03ab4a9457fe3fe8
Author: Joseph K. Bradley <jo...@gmail.com>
Date:   2014-07-23T20:07:55Z

    Major changes to DecisionTree API and internals.  Unit tests work.  Still need to update documentation.
    
    Split classes:
    * DecisionTree --> DecisionTreeClassifier and DecisionTreeRegressor
    * DecisionTreeModel --> DecisionTreeClassifierModel, DecisionTreeRegressorModel
    * Super-classes DecisionTree, DecisionTreeModel are private to mllib.
    
    Included print() function for human-readable model descriptions
    * For: DecisionTreeClassifierModel, DecisionTreeRegressorModel, Node
    
    parameters (used to be named Strategy)
    * Split into: DTParams, DTClassifierParams, DTRegressorParams.
    * Added defaultParams() method to DecisionTreeClassifier/Regressor.
    * impurity
    ** Made private to mllib package.
    ** Split Impurity into ClassifierImpurity, RegressorImpurity
    ** Added factories: ClassifierImpurities, RegressorImpurities
    * QuantileStrategy: Added factory QuantileStrategies
    * maxDepth: Changed meaning by 1.  Previously, depth = 1 meant 1 leaf node; now it means 1 internal and 2 leaf nodes.  This matches scikit-learn and rpart.
    
    train() functions:
    * Changed to use DatasetInfo class for metadata.
    * Eliminated many of the static train() functions to prevent users from needing to remember the order of long lists of parameters.
    
    DecisionTree internals:
    * renamed numSplits to numBins (since it was a duplicate name)

commit 4ba347fa2bce4b714478680a10442d26e6972ffc
Author: Joseph K. Bradley <jo...@gmail.com>
Date:   2014-07-23T20:10:21Z

    Merge remote-tracking branch 'upstream/master' into decisiontree-api

commit a853bfc1929e9d1fb56d955241c827fd2a5c1351
Author: Joseph K. Bradley <jo...@gmail.com>
Date:   2014-07-23T20:25:05Z

    Last non-merge commit said it changed the maxDepth meaning, but it did not.
    This one implements this change:
    
    maxDepth: Changed meaning by 1.  Previously, depth = 1 meant 1 leaf node; now it means 1 internal and 2 leaf nodes.  This matches scikit-learn and rpart.
    Internally, this meant replacing: maxDepth <— maxDepth+1.
    In tests, decremented maxDepth by 1.

commit 45068442dbcf36548d32001d60f9d4bda68c6a87
Author: Joseph K. Bradley <jo...@gmail.com>
Date:   2014-07-24T01:06:36Z

    Changed all config/impurity classes/objects to be private[mllib].
    Changed Params classes to take strings instead of special types.
    Made impurity names lists publicly accessible via Params classes.
    Simplified impurity factories.

commit b6b0809249a81e950f87b0a7f2c389f6c5d08f98
Author: Joseph K. Bradley <jo...@gmail.com>
Date:   2014-07-24T01:07:26Z

    removed mllib/src/test/java/org/apache/spark/mllib/tree/JavaDecisionTreeSuite.java since it fails currently

commit a2a93115a1f2106e13bb122589a28669310d19f5
Author: Joseph K. Bradley <jo...@gmail.com>
Date:   2014-07-24T01:07:26Z

    removed mllib/src/test/java/org/apache/spark/mllib/tree/JavaDecisionTreeSuite.java since it fails currently
    
    Comments which should have been added to previous commit:
    
    Fixed one test in DecisionTreeSuite to undo a change in previous commit (“stump with categorical variables for multiclass classification”).  Reverted impurity from Entropy back to Gini.
    
    Java compatibility:
    * Changed non-static train() methods’ names to run() to avoid conflicts with static train() methods in Java.
    * Added setter functions to *Params classes.

commit 0cb9866ab5a7e70663358263aea3ae1136c3b19b
Author: Joseph K. Bradley <jo...@gmail.com>
Date:   2014-07-24T01:09:39Z

    Merge branch 'decisiontree-api' of github.com:jkbradley/spark into decisiontree-api

commit 3ff5027c8fc7cd3e5a84233ceb763dc905ec6cc0
Author: Joseph K. Bradley <jo...@gmail.com>
Date:   2014-07-24T22:31:04Z

    Bug fix: Indexing was inconsistent for aggregate calculations for unordered features (in multiclass classification with categorical features, where the features had few enough values such that they could be considered unordered, i.e., isSpaceSufficientForAllCategoricalSplits=true).
    * updateBinForUnorderedFeature indexed agg as (node, feature, featureValue, binIndex), where
    ** featureValue was from arr (so it was a feature value)
    ** binIndex was in [0,…, 2^(maxFeatureValue-1)-1)
    * The rest of the code indexed agg as (node, feature, binIndex, label).
    * Corrected this bug by changing updateBinForUnorderedFeature to use the second indexing pattern.
    
    Unit tests in DecisionTreeSuite
    * Updated a few tests to train a model and test its training accuracy, which catches the indexing bug from updateBinForUnorderedFeature() discussed above.
    * Added new test (“stump with categorical variables for multiclass classification, with just enough bins”) to test bin extremes.
    
    Bug fix: calculateGainForSplit (for classification):
    * It used to return dummy prediction values when either the right or left children had 0 weight.  These were incorrect for multiclass classification.  It has been corrected.
    
    Updated impurities to allow for count = 0.  This was related to the above bug fix for calculateGainForSplit (for classification).
    
    Small updates to documentation and coding style.

commit 3ba5b4c5692bc0e769a0fb76f382c32bf3db6292
Author: Joseph K. Bradley <jo...@gmail.com>
Date:   2014-07-25T00:17:07Z

    Merge remote-tracking branch 'upstream/master' into decisiontree-api

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2692] [mllib] Decision Tree API update ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1582#issuecomment-50536367
  
    QA tests have started for PR 1582. This patch merges cleanly. <br>View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17378/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [mllib] Decision Tree API update and multiclas...

Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley commented on the pull request:

    https://github.com/apache/spark/pull/1582#issuecomment-50109448
  
    It looks like the Jenkins failures are MIMA issues; I'll work on fixing them.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [mllib] Decision Tree API update and multiclas...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1582#issuecomment-50111962
  
    QA results for PR 1582:<br>- This patch FAILED unit tests.<br>- This patch merges cleanly<br>- This patch adds no public classes<br><br>For more information see test ouptut:<br>https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17167/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2692] [mllib] Decision Tree API update ...

Posted by manishamde <gi...@git.apache.org>.
Github user manishamde commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1582#discussion_r15498592
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/configuration/DTParams.scala ---
    @@ -0,0 +1,60 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.tree.configuration
    +
    +import scala.beans.BeanProperty
    +
    +import org.apache.spark.annotation.Experimental
    +
    +/**
    + * :: Experimental ::
    + * Stores configuration options for DecisionTree construction.
    + * @param maxDepth Maximum depth of the tree.
    + *                 E.g., depth 0 means 1 leaf node; depth 1 means 1 internal node + 2 leaf nodes.
    + * @param maxBins maximum number of bins used for splitting features
    + * @param quantileStrategy algorithm for calculating quantiles
    + * @param maxMemoryInMB maximum memory in MB allocated to histogram aggregation. Default value is
    + *                      128 MB.
    + */
    +@Experimental
    +abstract class DTParams (
    --- End diff --
    
    Should this class be private to mllib.tree? If yes, do we need bean properties? I am assuming the ```@BeanProperty``` is being used for java interoperability. But we might not need to expose this class directly to the users.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [mllib] Decision Tree API update and multiclas...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1582#issuecomment-50111307
  
    QA tests have started for PR 1582. This patch merges cleanly. <br>View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17170/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2692] [mllib] Decision Tree API update ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1582#issuecomment-50532970
  
    QA tests have started for PR 1582. This patch merges cleanly. <br>View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17373/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2692] [mllib] Decision Tree API update ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1582#issuecomment-50520143
  
    QA results for PR 1582:<br>- This patch PASSES unit tests.<br>- This patch merges cleanly<br>- This patch adds no public classes<br><br>For more information see test ouptut:<br>https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17366/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [mllib] Decision Tree API update and multiclas...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1582#issuecomment-50097415
  
    QA results for PR 1582:<br>- This patch FAILED unit tests.<br>- This patch merges cleanly<br>- This patch adds no public classes<br><br>For more information see test ouptut:<br>https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17148/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2692] [mllib] Decision Tree API update ...

Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1582#discussion_r15539977
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala ---
    @@ -19,48 +19,60 @@ package org.apache.spark.mllib.tree
     
     import org.apache.spark.annotation.Experimental
     import org.apache.spark.Logging
    +import org.apache.spark.mllib.rdd.DatasetInfo
     import org.apache.spark.mllib.regression.LabeledPoint
    -import org.apache.spark.mllib.tree.configuration.Strategy
    -import org.apache.spark.mllib.tree.configuration.Algo._
    +import org.apache.spark.mllib.tree.configuration.DTParams
     import org.apache.spark.mllib.tree.configuration.FeatureType._
    -import org.apache.spark.mllib.tree.configuration.QuantileStrategy._
    -import org.apache.spark.mllib.tree.impurity.Impurity
    +import org.apache.spark.mllib.tree.configuration.QuantileStrategies
    +import org.apache.spark.mllib.tree.configuration.QuantileStrategy
     import org.apache.spark.mllib.tree.model._
     import org.apache.spark.rdd.RDD
     import org.apache.spark.util.random.XORShiftRandom
     
    +
     /**
      * :: Experimental ::
    - * A class that implements a decision tree algorithm for classification and regression. It
    - * supports both continuous and categorical features.
    - * @param strategy The configuration parameters for the tree algorithm which specify the type
    - *                 of algorithm (classification, regression, etc.), feature type (continuous,
    - *                 categorical), depth of the tree, quantile calculation strategy, etc.
    + * An abstract class for decision tree algorithms for classification and regression.
    + * It supports both continuous and categorical features.
    + * @param params The configuration parameters for the tree algorithm.
      */
     @Experimental
    -class DecisionTree (private val strategy: Strategy) extends Serializable with Logging {
    +private[mllib] abstract class DecisionTree[M <: DecisionTreeModel] (params: DTParams)
    +  extends Serializable with Logging {
     
    -  /**
    +  protected final val InvalidBinIndex = -1
    +
    +  // depth of the decision tree
    +  protected val maxDepth: Int = params.maxDepth
    +
    +  protected val maxBins: Int = params.maxBins
    +
    +  protected val quantileStrategy: QuantileStrategy.QuantileStrategy =
    +    QuantileStrategies.strategy(params.quantileStrategy)
    +
    +  protected val maxMemoryInMB: Int = params.maxMemoryInMB
    +
    +    /**
        * Method to train a decision tree model over an RDD
        * @param input RDD of [[org.apache.spark.mllib.regression.LabeledPoint]] used as training data
    -   * @return a DecisionTreeModel that can be used for prediction
    +   * @param datasetInfo  Dataset metadata.
    +   * @return top node of a DecisionTreeModel
        */
    -  def train(input: RDD[LabeledPoint]): DecisionTreeModel = {
    +  protected def trainSub(
    --- End diff --
    
    Not obvious, IMO.  @mengxr and I talked and decided to call the learning functions "run()" in the class and "train()" in the object.  When I tried naming it "train()" in both the class and object, it caused name conflicts when calling code from Java.  I added the suffix "Sub" here because the code returns a node instead of a model, so it does not fit the "run" or "train" signature.  I am renaming it to runSub() since it is called from run().


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2692] [mllib] Decision Tree API update ...

Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley closed the pull request at:

    https://github.com/apache/spark/pull/1582


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2692] [mllib] Decision Tree API update ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1582#issuecomment-50412170
  
    QA results for PR 1582:<br>- This patch FAILED unit tests.<br>- This patch merges cleanly<br>- This patch adds no public classes<br><br>For more information see test ouptut:<br>https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17306/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [mllib] Decision Tree API update and multiclas...

Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley commented on the pull request:

    https://github.com/apache/spark/pull/1582#issuecomment-50109411
  
    @manishamde  I do realize it's a big change, and I hope it does not cause too much trouble for the other methods!  The functionality should be the same, and the internals are almost identical (mostly moving around code, with no major duplication), so performance should not change much.  (I do have some ideas for future optimizations, but we will push the API update through first.)  I appreciate your thoughts on the update!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2692] [mllib] Decision Tree API update ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1582#issuecomment-50407630
  
    QA tests have started for PR 1582. This patch merges cleanly. <br>View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17306/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2692] [mllib] Decision Tree API update ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1582#issuecomment-50406397
  
    QA tests have started for PR 1582. This patch merges cleanly. <br>View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17305/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2692] [mllib] Decision Tree API update ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1582#issuecomment-50513739
  
    QA tests have started for PR 1582. This patch merges cleanly. <br>View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17366/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2692] [mllib] Decision Tree API update ...

Posted by manishamde <gi...@git.apache.org>.
Github user manishamde commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1582#discussion_r15497681
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/configuration/DTClassifierParams.scala ---
    @@ -0,0 +1,67 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.tree.configuration
    +
    +import org.apache.spark.annotation.Experimental
    +import org.apache.spark.mllib.tree.impurity.ClassificationImpurities
    +
    +/**
    + * :: Experimental ::
    + * Stores all the configuration options for DecisionTreeClassifier construction
    + * @param impurity Criterion used for information gain calculation.
    + *                 Currently supported: "gini", "entropy"
    + * @param maxDepth Maximum depth of the tree.
    + *                 E.g., depth 0 means 1 leaf node; depth 1 means 1 internal node + 2 leaf nodes.
    + * @param maxBins maximum number of bins used for splitting features
    + * @param quantileStrategy algorithm for calculating quantiles
    + * @param maxMemoryInMB maximum memory in MB allocated to histogram aggregation. Default value is
    + *                      128 MB.
    + */
    +@Experimental
    +class DTClassifierParams (
    +    var impurity: String = "gini",
    +    maxDepth: Int = 4,
    +    maxBins: Int = 100,
    +    quantileStrategy: String = "sort",
    +    maxMemoryInMB: Int = 128)
    +  extends DTParams(maxDepth, maxBins, quantileStrategy, maxMemoryInMB) {
    +
    +  def getImpurity: String = this.impurity
    +
    +  def setImpurity(impurity: String) = {
    +    if (!ClassificationImpurities.nameToImpurityMap.contains(impurity)) {
    +      throw new IllegalArgumentException(s"Bad impurity parameter for classification: $impurity")
    --- End diff --
    
    Mentioning supported impurities might help since such errors are generally typos.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [mllib] Decision Tree API update and multiclas...

Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley commented on the pull request:

    https://github.com/apache/spark/pull/1582#issuecomment-50107456
  
      @mengxr This does not have a specific JIRA.  (It does not solve the Python API JIRA [SPARK-2478] yet.)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [mllib] Decision Tree API update and multiclas...

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on the pull request:

    https://github.com/apache/spark/pull/1582#issuecomment-50107147
  
    @jkbradley Could you add the JIRA number to the PR title like `[SPARK-####][MLLIB]`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [mllib] Decision Tree API update and multiclas...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1582#issuecomment-50114354
  
    QA results for PR 1582:<br>- This patch FAILED unit tests.<br>- This patch merges cleanly<br>- This patch adds no public classes<br><br>For more information see test ouptut:<br>https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17170/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2692] [mllib] Decision Tree API update ...

Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley commented on the pull request:

    https://github.com/apache/spark/pull/1582#issuecomment-50532688
  
    Just updated with a few improvements.  Main change was for print() methods, to use toString() instead.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2692] [mllib] Decision Tree API update ...

Posted by manishamde <gi...@git.apache.org>.
Github user manishamde commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1582#discussion_r15498917
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala ---
    @@ -19,48 +19,60 @@ package org.apache.spark.mllib.tree
     
     import org.apache.spark.annotation.Experimental
     import org.apache.spark.Logging
    +import org.apache.spark.mllib.rdd.DatasetInfo
     import org.apache.spark.mllib.regression.LabeledPoint
    -import org.apache.spark.mllib.tree.configuration.Strategy
    -import org.apache.spark.mllib.tree.configuration.Algo._
    +import org.apache.spark.mllib.tree.configuration.DTParams
     import org.apache.spark.mllib.tree.configuration.FeatureType._
    -import org.apache.spark.mllib.tree.configuration.QuantileStrategy._
    -import org.apache.spark.mllib.tree.impurity.Impurity
    +import org.apache.spark.mllib.tree.configuration.QuantileStrategies
    +import org.apache.spark.mllib.tree.configuration.QuantileStrategy
     import org.apache.spark.mllib.tree.model._
     import org.apache.spark.rdd.RDD
     import org.apache.spark.util.random.XORShiftRandom
     
    +
     /**
      * :: Experimental ::
    - * A class that implements a decision tree algorithm for classification and regression. It
    - * supports both continuous and categorical features.
    - * @param strategy The configuration parameters for the tree algorithm which specify the type
    - *                 of algorithm (classification, regression, etc.), feature type (continuous,
    - *                 categorical), depth of the tree, quantile calculation strategy, etc.
    + * An abstract class for decision tree algorithms for classification and regression.
    + * It supports both continuous and categorical features.
    + * @param params The configuration parameters for the tree algorithm.
      */
     @Experimental
    -class DecisionTree (private val strategy: Strategy) extends Serializable with Logging {
    +private[mllib] abstract class DecisionTree[M <: DecisionTreeModel] (params: DTParams)
    +  extends Serializable with Logging {
     
    -  /**
    +  protected final val InvalidBinIndex = -1
    +
    +  // depth of the decision tree
    +  protected val maxDepth: Int = params.maxDepth
    +
    +  protected val maxBins: Int = params.maxBins
    +
    +  protected val quantileStrategy: QuantileStrategy.QuantileStrategy =
    +    QuantileStrategies.strategy(params.quantileStrategy)
    +
    +  protected val maxMemoryInMB: Int = params.maxMemoryInMB
    +
    +    /**
        * Method to train a decision tree model over an RDD
        * @param input RDD of [[org.apache.spark.mllib.regression.LabeledPoint]] used as training data
    -   * @return a DecisionTreeModel that can be used for prediction
    +   * @param datasetInfo  Dataset metadata.
    +   * @return top node of a DecisionTreeModel
        */
    -  def train(input: RDD[LabeledPoint]): DecisionTreeModel = {
    +  protected def trainSub(
    --- End diff --
    
    Not sure why the ```train``` method needs to be renamed. Actually, I couldn't understand what ```Sub``` in ```trainSub``` means. :-)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2692] [mllib] Decision Tree API update ...

Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley commented on the pull request:

    https://github.com/apache/spark/pull/1582#issuecomment-50513428
  
    I just submitted an update based on the comments from @manishamde (Thank you for them!)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2692] [mllib] Decision Tree API update ...

Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1582#discussion_r15539922
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/configuration/DTParams.scala ---
    @@ -0,0 +1,60 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.tree.configuration
    +
    +import scala.beans.BeanProperty
    +
    +import org.apache.spark.annotation.Experimental
    +
    +/**
    + * :: Experimental ::
    + * Stores configuration options for DecisionTree construction.
    + * @param maxDepth Maximum depth of the tree.
    + *                 E.g., depth 0 means 1 leaf node; depth 1 means 1 internal node + 2 leaf nodes.
    + * @param maxBins maximum number of bins used for splitting features
    + * @param quantileStrategy algorithm for calculating quantiles
    + * @param maxMemoryInMB maximum memory in MB allocated to histogram aggregation. Default value is
    + *                      128 MB.
    + */
    +@Experimental
    +abstract class DTParams (
    --- End diff --
    
    I agree that users should not really need to use DTParams, so I will make it private[mllib].  But @BeanProperty is nice since it generates getters/setters for the subclasses DTClassifierParams and DTRegressorParams as well (for Java, as you figured).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [mllib] Decision Tree API update and multiclas...

Posted by manishamde <gi...@git.apache.org>.
Github user manishamde commented on the pull request:

    https://github.com/apache/spark/pull/1582#issuecomment-50107957
  
    @jkbradley Awesome! 
    
    A couple of quick thoughts:
    + I am not completely convinced about the strategy for1a (I was expecting thin wrappers for regression and classification tree) but I guess that was expected considering I am very familiar with the existing code. I will sleep over it and get back. :-) To give a historical perspective, we had a similar split implementations for regression and classification in the beginning that we decided to combine into one. Perhaps, it's the right time to split them again. @etrain was also hinting at that in the multiclass review.
    + I have ensemble RF and Boosting implementations close-to-ready which will need major refactoring or rewriting from scratch considering the magnitude of this PR. That's fine but we should try and get it accepted ASAP. I promise prompt piecemeal reviews.
    + We should perform regression testing and compare with the 1.0 release.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [mllib] Decision Tree API update and multiclas...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1582#issuecomment-50097374
  
    QA tests have started for PR 1582. This patch merges cleanly. <br>View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17148/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2692] [mllib] Decision Tree API update ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1582#issuecomment-50406491
  
    QA results for PR 1582:<br>- This patch FAILED unit tests.<br>- This patch merges cleanly<br>- This patch adds no public classes<br><br>For more information see test ouptut:<br>https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17305/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2692] [mllib] Decision Tree API update ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1582#issuecomment-50533389
  
    QA results for PR 1582:<br>- This patch FAILED unit tests.<br>- This patch merges cleanly<br>- This patch adds no public classes<br><br>For more information see test ouptut:<br>https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17373/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2692] [mllib] Decision Tree API update ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1582#issuecomment-50542099
  
    QA results for PR 1582:<br>- This patch FAILED unit tests.<br>- This patch merges cleanly<br>- This patch adds no public classes<br><br>For more information see test ouptut:<br>https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17378/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [mllib] Decision Tree API update and multiclas...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1582#issuecomment-50169783
  
    QA results for PR 1582:<br>- This patch PASSES unit tests.<br>- This patch merges cleanly<br>- This patch adds no public classes<br><br>For more information see test ouptut:<br>https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17184/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2692] [mllib] Decision Tree API update ...

Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1582#discussion_r15539970
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala ---
    @@ -19,48 +19,60 @@ package org.apache.spark.mllib.tree
     
     import org.apache.spark.annotation.Experimental
     import org.apache.spark.Logging
    +import org.apache.spark.mllib.rdd.DatasetInfo
     import org.apache.spark.mllib.regression.LabeledPoint
    -import org.apache.spark.mllib.tree.configuration.Strategy
    -import org.apache.spark.mllib.tree.configuration.Algo._
    +import org.apache.spark.mllib.tree.configuration.DTParams
     import org.apache.spark.mllib.tree.configuration.FeatureType._
    -import org.apache.spark.mllib.tree.configuration.QuantileStrategy._
    -import org.apache.spark.mllib.tree.impurity.Impurity
    +import org.apache.spark.mllib.tree.configuration.QuantileStrategies
    +import org.apache.spark.mllib.tree.configuration.QuantileStrategy
     import org.apache.spark.mllib.tree.model._
     import org.apache.spark.rdd.RDD
     import org.apache.spark.util.random.XORShiftRandom
     
    +
     /**
      * :: Experimental ::
    - * A class that implements a decision tree algorithm for classification and regression. It
    - * supports both continuous and categorical features.
    - * @param strategy The configuration parameters for the tree algorithm which specify the type
    - *                 of algorithm (classification, regression, etc.), feature type (continuous,
    - *                 categorical), depth of the tree, quantile calculation strategy, etc.
    + * An abstract class for decision tree algorithms for classification and regression.
    + * It supports both continuous and categorical features.
    + * @param params The configuration parameters for the tree algorithm.
      */
     @Experimental
    -class DecisionTree (private val strategy: Strategy) extends Serializable with Logging {
    +private[mllib] abstract class DecisionTree[M <: DecisionTreeModel] (params: DTParams)
    --- End diff --
    
    I agree; I will leave it out.  We may put those parameters back in once the MLlib class hierarchy becomes more developed (for standardized run/train functions).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2692] [mllib] Decision Tree API update ...

Posted by manishamde <gi...@git.apache.org>.
Github user manishamde commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1582#discussion_r15498862
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala ---
    @@ -19,48 +19,60 @@ package org.apache.spark.mllib.tree
     
     import org.apache.spark.annotation.Experimental
     import org.apache.spark.Logging
    +import org.apache.spark.mllib.rdd.DatasetInfo
     import org.apache.spark.mllib.regression.LabeledPoint
    -import org.apache.spark.mllib.tree.configuration.Strategy
    -import org.apache.spark.mllib.tree.configuration.Algo._
    +import org.apache.spark.mllib.tree.configuration.DTParams
     import org.apache.spark.mllib.tree.configuration.FeatureType._
    -import org.apache.spark.mllib.tree.configuration.QuantileStrategy._
    -import org.apache.spark.mllib.tree.impurity.Impurity
    +import org.apache.spark.mllib.tree.configuration.QuantileStrategies
    +import org.apache.spark.mllib.tree.configuration.QuantileStrategy
     import org.apache.spark.mllib.tree.model._
     import org.apache.spark.rdd.RDD
     import org.apache.spark.util.random.XORShiftRandom
     
    +
     /**
      * :: Experimental ::
    - * A class that implements a decision tree algorithm for classification and regression. It
    - * supports both continuous and categorical features.
    - * @param strategy The configuration parameters for the tree algorithm which specify the type
    - *                 of algorithm (classification, regression, etc.), feature type (continuous,
    - *                 categorical), depth of the tree, quantile calculation strategy, etc.
    + * An abstract class for decision tree algorithms for classification and regression.
    + * It supports both continuous and categorical features.
    + * @param params The configuration parameters for the tree algorithm.
      */
     @Experimental
    -class DecisionTree (private val strategy: Strategy) extends Serializable with Logging {
    +private[mllib] abstract class DecisionTree[M <: DecisionTreeModel] (params: DTParams)
    --- End diff --
    
    Couldn't find usage for the generic parameter ```M```.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [mllib] Decision Tree API update and multiclas...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1582#issuecomment-50108750
  
    QA tests have started for PR 1582. This patch merges cleanly. <br>View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17165/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [mllib] Decision Tree API update and multiclas...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1582#issuecomment-50098446
  
    QA tests have started for PR 1582. This patch merges cleanly. <br>View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17151/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [mllib] Decision Tree API update and multiclas...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1582#issuecomment-50100883
  
    QA results for PR 1582:<br>- This patch FAILED unit tests.<br>- This patch merges cleanly<br>- This patch adds no public classes<br><br>For more information see test ouptut:<br>https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17151/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [mllib] Decision Tree API update and multiclas...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1582#issuecomment-50110901
  
    QA results for PR 1582:<br>- This patch FAILED unit tests.<br>- This patch merges cleanly<br>- This patch adds no public classes<br><br>For more information see test ouptut:<br>https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17165/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [mllib] Decision Tree API update and multiclas...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1582#issuecomment-50109829
  
    QA tests have started for PR 1582. This patch merges cleanly. <br>View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17167/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2692] [mllib] Decision Tree API update ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1582#issuecomment-50415207
  
    QA results for PR 1582:<br>- This patch PASSES unit tests.<br>- This patch merges cleanly<br>- This patch adds no public classes<br><br>For more information see test ouptut:<br>https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17309/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [mllib] Decision Tree API update and multiclas...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1582#issuecomment-50164374
  
    QA tests have started for PR 1582. This patch merges cleanly. <br>View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17184/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [mllib] Decision Tree API update and multiclas...

Posted by manishamde <gi...@git.apache.org>.
Github user manishamde commented on the pull request:

    https://github.com/apache/spark/pull/1582#issuecomment-50108039
  
    @jkbradley You might want to create a JIRA for this one and ask Matei to assign to you. It's a big enough change to require one. :-)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2692] [mllib] Decision Tree API update ...

Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley commented on the pull request:

    https://github.com/apache/spark/pull/1582#issuecomment-50685573
  
    Closing this PR since we decided not to change the public DT API on this release.  Will submit bug fixes as another PR.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2692] [mllib] Decision Tree API update ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1582#issuecomment-50410838
  
    QA tests have started for PR 1582. This patch merges cleanly. <br>View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17309/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2692] [mllib] Decision Tree API update ...

Posted by manishamde <gi...@git.apache.org>.
Github user manishamde commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1582#discussion_r15497700
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/configuration/DTRegressorParams.scala ---
    @@ -0,0 +1,67 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.tree.configuration
    +
    +import org.apache.spark.annotation.Experimental
    +import org.apache.spark.mllib.tree.impurity.RegressionImpurities
    +
    +/**
    + * :: Experimental ::
    + * Stores all the configuration options for DecisionTreeRegressor construction
    + * @param impurity Criterion used for information gain calculation.
    + *                 Currently supported: "variance"
    + * @param maxDepth Maximum depth of the tree.
    + *                 E.g., depth 0 means 1 leaf node; depth 1 means 1 internal node + 2 leaf nodes.
    + * @param maxBins maximum number of bins used for splitting features
    + * @param quantileStrategy algorithm for calculating quantiles
    + * @param maxMemoryInMB maximum memory in MB allocated to histogram aggregation. Default value is
    + *                      128 MB.
    + */
    +@Experimental
    +class DTRegressorParams (
    +    var impurity: String = "variance",
    +    maxDepth: Int = 4,
    +    maxBins: Int = 100,
    +    quantileStrategy: String = "sort",
    +    maxMemoryInMB: Int = 128)
    +  extends DTParams(maxDepth, maxBins, quantileStrategy, maxMemoryInMB) {
    +
    +  def getImpurity: String = this.impurity
    +
    +  def setImpurity(impurity: String) = {
    +    if (!RegressionImpurities.nameToImpurityMap.contains(impurity)) {
    +      throw new IllegalArgumentException(s"Bad impurity parameter for regression: $impurity")
    --- End diff --
    
    Mentioning supported impurities might help since such errors are generally typos.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---