You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by jkbradley <gi...@git.apache.org> on 2014/08/06 04:39:25 UTC

[GitHub] spark pull request: DecisionTree Python consistency update

GitHub user jkbradley opened a pull request:

    https://github.com/apache/spark/pull/1798

    DecisionTree Python consistency update

    Added 6 static train methods to match Python API, but without default arguments (but with Python default args noted in docs).
    
    Added factory classes for Algo and Impurity, but made private[mllib].
    
    CC: @mengxr @dorx  Please let me know if there are other changes which would help with API consistency---thanks!

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/jkbradley/spark dt-python-consistency

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/1798.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #1798
    
----
commit eaf84c0cdcf2eee5c00addbc4c73d24aa90e68b8
Author: Joseph K. Bradley <jo...@gmail.com>
Date:   2014-08-05T23:36:21Z

    Added DecisionTree static train() methods API to match Python, but without default parameters

commit c69985063c22b1802d7755bd6b12e1e8cd76ba74
Author: Joseph K. Bradley <jo...@gmail.com>
Date:   2014-08-05T23:46:05Z

    a few doc comments

commit e35866176f69ac408fa422cd6c09d6a2db6e9435
Author: Joseph K. Bradley <jo...@gmail.com>
Date:   2014-08-06T00:43:22Z

    DecisionTree API change:
    * Added 6 static train methods to match Python API, but without default arguments (but with Python default args noted in docs).
    
    Added factory classes for Algo and Impurity, but made private[mllib].

commit fe6dbfad5f2734eb6cb54ef1708312df3e67b8fc
Author: Joseph K. Bradley <jo...@gmail.com>
Date:   2014-08-06T02:34:06Z

    removed unnecessary imports

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2851] [mllib] DecisionTree Python consi...

Posted by asfgit <gi...@git.apache.org>.

Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/1798


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2851] [mllib] DecisionTree Python consi...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1798#issuecomment-51401271
  
    QA tests have started for PR 1798. This patch merges cleanly. <br>View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18054/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2851] [mllib] DecisionTree Python consi...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1798#discussion_r15853908
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/impurity/Impurities.scala ---
    @@ -0,0 +1,32 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.tree.impurity
    +
    +/**
    + * Factory class for Impurity types.
    + */
    +private[mllib] object Impurities {
    +
    +  def stringToImpurity(name: String): Impurity = name match {
    --- End diff --
    
    ditto: `fromString`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2851] [mllib] DecisionTree Python consi...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1798#discussion_r15892945
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/impurity/Impurities.scala ---
    @@ -0,0 +1,32 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.tree.impurity
    +
    +/**
    + * Factory class for Impurity types.
    --- End diff --
    
    maybe this is better: `Factory for Impurity instances.`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2851] [mllib] DecisionTree Python consi...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1798#issuecomment-51287686
  
    QA tests have started for PR 1798. This patch merges cleanly. <br>View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17980/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2851] [mllib] DecisionTree Python consi...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1798#issuecomment-51379307
  
    QA tests have started for PR 1798. This patch merges cleanly. <br>View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18037/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2851] [mllib] DecisionTree Python consi...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1798#discussion_r15854074
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala ---
    @@ -300,6 +293,198 @@ object DecisionTree extends Serializable with Logging {
         new DecisionTree(strategy).train(input)
       }
     
    +  /**
    +   * Method to train a decision tree model.
    +   * The method supports binary and multiclass classification and regression.
    +   * This version takes basic types, for consistency with Python API.
    +   *
    +   * @param input Training dataset: RDD of [[org.apache.spark.mllib.regression.LabeledPoint]].
    +   *              For classification, labels should take values {0, 1, ..., numClasses-1}.
    +   *              For regression, labels are real numbers.
    +   * @param algo "classification" or "regression"
    +   * @param numClassesForClassification number of classes for classification. Default value of 2.
    +   * @param categoricalFeaturesInfo Map storing arity of categorical features.
    +   *                                E.g., an entry (n -> k) indicates that feature n is categorical
    +   *                                with k categories indexed from 0: {0, 1, ..., k-1}.
    +   * @param impurity criterion used for information gain calculation
    +   * @param maxDepth Maximum depth of the tree.
    +   *                 E.g., depth 0 means 1 leaf node; depth 1 means 1 internal node + 2 leaf nodes.
    +   * @param maxBins maximum number of bins used for splitting features
    +   *                 (default Python value = 100)
    +   * @return DecisionTreeModel that can be used for prediction
    +   */
    +  def train(
    +      input: RDD[LabeledPoint],
    +      algo: String,
    +      numClassesForClassification: Int,
    +      categoricalFeaturesInfo: Map[Int, Int],
    +      impurity: String,
    +      maxDepth: Int,
    +      maxBins: Int): DecisionTreeModel = {
    +    val algoType = Algo.stringToAlgo(algo)
    +    val impurityType = Impurities.stringToImpurity(impurity)
    +    train(input, algoType, impurityType, maxDepth, numClassesForClassification, maxBins, Sort,
    +      categoricalFeaturesInfo)
    +  }
    +
    +  /**
    +   * Method to train a decision tree model.
    +   * The method supports binary and multiclass classification and regression.
    +   * This version takes basic types, for consistency with Python API.
    +   * This version is Java-friendly, taking a Java map for categoricalFeaturesInfo.
    +   *
    +   * @param input Training dataset: RDD of [[org.apache.spark.mllib.regression.LabeledPoint]].
    +   *              For classification, labels should take values {0, 1, ..., numClasses-1}.
    +   *              For regression, labels are real numbers.
    +   * @param algo "classification" or "regression"
    +   * @param numClassesForClassification number of classes for classification. Default value of 2.
    +   * @param categoricalFeaturesInfo Map storing arity of categorical features.
    +   *                                E.g., an entry (n -> k) indicates that feature n is categorical
    +   *                                with k categories indexed from 0: {0, 1, ..., k-1}.
    +   * @param impurity criterion used for information gain calculation
    +   * @param maxDepth Maximum depth of the tree.
    +   *                 E.g., depth 0 means 1 leaf node; depth 1 means 1 internal node + 2 leaf nodes.
    +   * @param maxBins maximum number of bins used for splitting features
    +   *                 (default Python value = 100)
    +   * @return DecisionTreeModel that can be used for prediction
    +   */
    +  def train(
    +      input: RDD[LabeledPoint],
    +      algo: String,
    +      numClassesForClassification: Int,
    +      categoricalFeaturesInfo: java.util.Map[java.lang.Integer, java.lang.Integer],
    +      impurity: String,
    +      maxDepth: Int,
    +      maxBins: Int): DecisionTreeModel = {
    +    train(input, algo, numClassesForClassification,
    +      categoricalFeaturesInfo.asInstanceOf[java.util.Map[Int, Int]].asScala.toMap,
    +      impurity, maxDepth, maxBins)
    +  }
    +
    +  /**
    +   * Method to train a decision tree model for binary or multiclass classification.
    +   * This version takes basic types, for consistency with Python API.
    +   *
    +   * @param input Training dataset: RDD of [[org.apache.spark.mllib.regression.LabeledPoint]].
    +   *              Labels should take values {0, 1, ..., numClasses-1}.
    +   * @param numClassesForClassification number of classes for classification.
    +   * @param categoricalFeaturesInfo Map storing arity of categorical features.
    +   *                                E.g., an entry (n -> k) indicates that feature n is categorical
    +   *                                with k categories indexed from 0: {0, 1, ..., k-1}.
    +   *                                 (default Python value = {}, i.e., no categorical features)
    --- End diff --
    
    We don't need to mention Python's default value here. Btw, having the empty map as the default value may be error-prone. What do you think?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2851] [mllib] DecisionTree Python consi...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1798#discussion_r15892856
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala ---
    @@ -300,6 +293,198 @@ object DecisionTree extends Serializable with Logging {
         new DecisionTree(strategy).train(input)
       }
     
    +  /**
    +   * Method to train a decision tree model.
    +   * The method supports binary and multiclass classification and regression.
    +   * This version takes basic types, for consistency with Python API.
    +   *
    +   * @param input Training dataset: RDD of [[org.apache.spark.mllib.regression.LabeledPoint]].
    +   *              For classification, labels should take values {0, 1, ..., numClasses-1}.
    +   *              For regression, labels are real numbers.
    +   * @param algo "classification" or "regression"
    +   * @param numClassesForClassification number of classes for classification. Default value of 2.
    +   * @param categoricalFeaturesInfo Map storing arity of categorical features.
    +   *                                E.g., an entry (n -> k) indicates that feature n is categorical
    +   *                                with k categories indexed from 0: {0, 1, ..., k-1}.
    +   * @param impurity criterion used for information gain calculation
    +   * @param maxDepth Maximum depth of the tree.
    +   *                 E.g., depth 0 means 1 leaf node; depth 1 means 1 internal node + 2 leaf nodes.
    +   * @param maxBins maximum number of bins used for splitting features
    +   *                 (default Python value = 100)
    +   * @return DecisionTreeModel that can be used for prediction
    +   */
    +  def train(
    +      input: RDD[LabeledPoint],
    +      algo: String,
    +      numClassesForClassification: Int,
    +      categoricalFeaturesInfo: Map[Int, Int],
    +      impurity: String,
    +      maxDepth: Int,
    +      maxBins: Int): DecisionTreeModel = {
    +    val algoType = Algo.stringToAlgo(algo)
    +    val impurityType = Impurities.stringToImpurity(impurity)
    +    train(input, algoType, impurityType, maxDepth, numClassesForClassification, maxBins, Sort,
    +      categoricalFeaturesInfo)
    +  }
    +
    +  /**
    +   * Method to train a decision tree model.
    +   * The method supports binary and multiclass classification and regression.
    +   * This version takes basic types, for consistency with Python API.
    +   * This version is Java-friendly, taking a Java map for categoricalFeaturesInfo.
    +   *
    +   * @param input Training dataset: RDD of [[org.apache.spark.mllib.regression.LabeledPoint]].
    +   *              For classification, labels should take values {0, 1, ..., numClasses-1}.
    +   *              For regression, labels are real numbers.
    +   * @param algo "classification" or "regression"
    +   * @param numClassesForClassification number of classes for classification. Default value of 2.
    +   * @param categoricalFeaturesInfo Map storing arity of categorical features.
    +   *                                E.g., an entry (n -> k) indicates that feature n is categorical
    +   *                                with k categories indexed from 0: {0, 1, ..., k-1}.
    +   * @param impurity criterion used for information gain calculation
    +   * @param maxDepth Maximum depth of the tree.
    +   *                 E.g., depth 0 means 1 leaf node; depth 1 means 1 internal node + 2 leaf nodes.
    +   * @param maxBins maximum number of bins used for splitting features
    +   *                 (default Python value = 100)
    +   * @return DecisionTreeModel that can be used for prediction
    +   */
    +  def train(
    +      input: RDD[LabeledPoint],
    +      algo: String,
    +      numClassesForClassification: Int,
    +      categoricalFeaturesInfo: java.util.Map[java.lang.Integer, java.lang.Integer],
    +      impurity: String,
    +      maxDepth: Int,
    +      maxBins: Int): DecisionTreeModel = {
    +    train(input, algo, numClassesForClassification,
    +      categoricalFeaturesInfo.asInstanceOf[java.util.Map[Int, Int]].asScala.toMap,
    +      impurity, maxDepth, maxBins)
    +  }
    +
    +  /**
    +   * Method to train a decision tree model for binary or multiclass classification.
    +   * This version takes basic types, for consistency with Python API.
    +   *
    +   * @param input Training dataset: RDD of [[org.apache.spark.mllib.regression.LabeledPoint]].
    +   *              Labels should take values {0, 1, ..., numClasses-1}.
    +   * @param numClassesForClassification number of classes for classification.
    +   * @param categoricalFeaturesInfo Map storing arity of categorical features.
    +   *                                E.g., an entry (n -> k) indicates that feature n is categorical
    +   *                                with k categories indexed from 0: {0, 1, ..., k-1}.
    +   *                                 (default Python value = {}, i.e., no categorical features)
    --- End diff --
    
    Yes, that is helpful. It is something like the schema inference in SQL.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2851] [mllib] DecisionTree Python consi...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on the pull request:

    https://github.com/apache/spark/pull/1798#issuecomment-51386956
  
    LGTM except minor inline comments.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2851] [mllib] DecisionTree Python consi...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1798#issuecomment-51406843
  
    QA results for PR 1798:<br>- This patch PASSES unit tests.<br>- This patch merges cleanly<br>- This patch adds no public classes<br><br>For more information see test ouptut:<br>https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18054/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2851] [mllib] DecisionTree Python consi...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1798#discussion_r15898657
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala ---
    @@ -300,6 +309,93 @@ object DecisionTree extends Serializable with Logging {
         new DecisionTree(strategy).train(input)
       }
     
    +  /**
    +   * Method to train a decision tree model for binary or multiclass classification.
    +   *
    +   * @param input Training dataset: RDD of [[org.apache.spark.mllib.regression.LabeledPoint]].
    +   *              Labels should take values {0, 1, ..., numClasses-1}.
    +   * @param numClassesForClassification number of classes for classification.
    +   * @param categoricalFeaturesInfo Map storing arity of categorical features.
    +   *                                E.g., an entry (n -> k) indicates that feature n is categorical
    +   *                                with k categories indexed from 0: {0, 1, ..., k-1}.
    +   * @param impurity Criterion used for information gain calculation.
    +   *                 Supported values: "gini" (recommended) or "entropy".
    +   * @param maxDepth Maximum depth of the tree.
    +   *                 E.g., depth 0 means 1 leaf node; depth 1 means 1 internal node + 2 leaf nodes.
    +   *                  (suggested value: 4)
    +   * @param maxBins maximum number of bins used for splitting features
    +   *                 (suggested value: 100)
    +   * @return DecisionTreeModel that can be used for prediction
    +   */
    +  def trainClassifier(
    +      input: RDD[LabeledPoint],
    +      numClassesForClassification: Int,
    +      categoricalFeaturesInfo: Map[Int, Int],
    +      impurity: String,
    +      maxDepth: Int,
    +      maxBins: Int): DecisionTreeModel = {
    +    val impurityType = Impurities.fromString(impurity)
    +    train(input, Classification, impurityType, maxDepth, numClassesForClassification, maxBins, Sort,
    +      categoricalFeaturesInfo)
    +  }
    +
    +  /**
    +   * Java-friendly API for [[org.apache.spark.mllib.tree.DecisionTree$#trainClassifier]]
    +   */
    +  def trainClassifier(
    +      input: RDD[LabeledPoint],
    --- End diff --
    
    `RDD` -> `JavaRDD`. Sorry I missed this in the first pass. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2851] [mllib] DecisionTree Python consi...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1798#discussion_r15854129
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala ---
    @@ -1334,10 +1519,10 @@ object DecisionTree extends Serializable with Logging {
        *   (a) For multiclass classification with a low-arity feature
        *       (i.e., if isMulticlass && isSpaceSufficientForAllCategoricalSplits),
        *       the feature is split based on subsets of categories.
    -   *       There are 2^(maxFeatureValue - 1) - 1 splits.
    +   *       There are math.pow(2, (maxFeatureValue - 1) - 1) splits.
        *   (b) For regression and binary classification,
        *       and for multiclass classification with a high-arity feature,
    --- End diff --
    
    `,` -> `.`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2851] [mllib] DecisionTree Python consi...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1798#discussion_r15854029
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala ---
    @@ -300,6 +293,198 @@ object DecisionTree extends Serializable with Logging {
         new DecisionTree(strategy).train(input)
       }
     
    +  /**
    +   * Method to train a decision tree model.
    +   * The method supports binary and multiclass classification and regression.
    +   * This version takes basic types, for consistency with Python API.
    +   *
    +   * @param input Training dataset: RDD of [[org.apache.spark.mllib.regression.LabeledPoint]].
    +   *              For classification, labels should take values {0, 1, ..., numClasses-1}.
    +   *              For regression, labels are real numbers.
    +   * @param algo "classification" or "regression"
    +   * @param numClassesForClassification number of classes for classification. Default value of 2.
    +   * @param categoricalFeaturesInfo Map storing arity of categorical features.
    +   *                                E.g., an entry (n -> k) indicates that feature n is categorical
    +   *                                with k categories indexed from 0: {0, 1, ..., k-1}.
    +   * @param impurity criterion used for information gain calculation
    +   * @param maxDepth Maximum depth of the tree.
    +   *                 E.g., depth 0 means 1 leaf node; depth 1 means 1 internal node + 2 leaf nodes.
    +   * @param maxBins maximum number of bins used for splitting features
    +   *                 (default Python value = 100)
    +   * @return DecisionTreeModel that can be used for prediction
    +   */
    +  def train(
    +      input: RDD[LabeledPoint],
    +      algo: String,
    +      numClassesForClassification: Int,
    +      categoricalFeaturesInfo: Map[Int, Int],
    +      impurity: String,
    +      maxDepth: Int,
    +      maxBins: Int): DecisionTreeModel = {
    +    val algoType = Algo.stringToAlgo(algo)
    +    val impurityType = Impurities.stringToImpurity(impurity)
    +    train(input, algoType, impurityType, maxDepth, numClassesForClassification, maxBins, Sort,
    +      categoricalFeaturesInfo)
    +  }
    +
    +  /**
    +   * Method to train a decision tree model.
    +   * The method supports binary and multiclass classification and regression.
    +   * This version takes basic types, for consistency with Python API.
    +   * This version is Java-friendly, taking a Java map for categoricalFeaturesInfo.
    +   *
    +   * @param input Training dataset: RDD of [[org.apache.spark.mllib.regression.LabeledPoint]].
    +   *              For classification, labels should take values {0, 1, ..., numClasses-1}.
    +   *              For regression, labels are real numbers.
    +   * @param algo "classification" or "regression"
    +   * @param numClassesForClassification number of classes for classification. Default value of 2.
    +   * @param categoricalFeaturesInfo Map storing arity of categorical features.
    +   *                                E.g., an entry (n -> k) indicates that feature n is categorical
    +   *                                with k categories indexed from 0: {0, 1, ..., k-1}.
    +   * @param impurity criterion used for information gain calculation
    +   * @param maxDepth Maximum depth of the tree.
    +   *                 E.g., depth 0 means 1 leaf node; depth 1 means 1 internal node + 2 leaf nodes.
    +   * @param maxBins maximum number of bins used for splitting features
    +   *                 (default Python value = 100)
    --- End diff --
    
    ditto: remove `Python` and change `default` to `recommended` (and in other places)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2851] [mllib] DecisionTree Python consi...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1798#issuecomment-51403014
  
    QA results for PR 1798:<br>- This patch PASSES unit tests.<br><br>For more information see test ouptut:<br>https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18044/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2851] [mllib] DecisionTree Python consi...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on the pull request:

    https://github.com/apache/spark/pull/1798#issuecomment-51397327
  
    @jkbradley Could you try to merge the latest master? It seems that there are conflicts.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2851] [mllib] DecisionTree Python consi...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1798#discussion_r15854160
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/impurity/Impurities.scala ---
    @@ -0,0 +1,32 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.tree.impurity
    +
    +/**
    + * Factory class for Impurity types.
    --- End diff --
    
    minor: `Factory for Impurities.`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2851] [mllib] DecisionTree Python consi...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1798#discussion_r15891638
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala ---
    @@ -1334,10 +1519,10 @@ object DecisionTree extends Serializable with Logging {
        *   (a) For multiclass classification with a low-arity feature
        *       (i.e., if isMulticlass && isSpaceSufficientForAllCategoricalSplits),
        *       the feature is split based on subsets of categories.
    -   *       There are 2^(maxFeatureValue - 1) - 1 splits.
    +   *       There are math.pow(2, (maxFeatureValue - 1) - 1) splits.
        *   (b) For regression and binary classification,
        *       and for multiclass classification with a high-arity feature,
    -   *       there is one split per category.
    +   *
     
    --- End diff --
    
    Oops, this was an accidental deletion; I'll fix the doc.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2851] [mllib] DecisionTree Python consi...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1798#discussion_r15853946
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala ---
    @@ -300,6 +293,198 @@ object DecisionTree extends Serializable with Logging {
         new DecisionTree(strategy).train(input)
       }
     
    +  /**
    +   * Method to train a decision tree model.
    +   * The method supports binary and multiclass classification and regression.
    +   * This version takes basic types, for consistency with Python API.
    +   *
    +   * @param input Training dataset: RDD of [[org.apache.spark.mllib.regression.LabeledPoint]].
    +   *              For classification, labels should take values {0, 1, ..., numClasses-1}.
    +   *              For regression, labels are real numbers.
    +   * @param algo "classification" or "regression"
    +   * @param numClassesForClassification number of classes for classification. Default value of 2.
    +   * @param categoricalFeaturesInfo Map storing arity of categorical features.
    +   *                                E.g., an entry (n -> k) indicates that feature n is categorical
    +   *                                with k categories indexed from 0: {0, 1, ..., k-1}.
    +   * @param impurity criterion used for information gain calculation
    --- End diff --
    
    provide a list of candidates?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2851] [mllib] DecisionTree Python consi...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on the pull request:

    https://github.com/apache/spark/pull/1798#issuecomment-51392451
  
    Hopefully all done now, after Jenkins.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2851] [mllib] DecisionTree Python consi...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1798#discussion_r15853902
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/configuration/Algo.scala ---
    @@ -27,4 +27,10 @@ import org.apache.spark.annotation.Experimental
     object Algo extends Enumeration {
       type Algo = Value
       val Classification, Regression = Value
    +
    +  private[mllib] def stringToAlgo(name: String): Algo = name match {
    --- End diff --
    
    minor: `stringToAlgo` -> `fromString`? In the code, it might be more natural to read
    
    ~~~
    Algo.fromString("classification")
    ~~~
    
    ~~~
    Algo.stringToAlgo("classification")
    ~~~


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2851] [mllib] DecisionTree Python consi...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1798#discussion_r15854139
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala ---
    @@ -1334,10 +1519,10 @@ object DecisionTree extends Serializable with Logging {
        *   (a) For multiclass classification with a low-arity feature
        *       (i.e., if isMulticlass && isSpaceSufficientForAllCategoricalSplits),
        *       the feature is split based on subsets of categories.
    -   *       There are 2^(maxFeatureValue - 1) - 1 splits.
    +   *       There are math.pow(2, (maxFeatureValue - 1) - 1) splits.
        *   (b) For regression and binary classification,
        *       and for multiclass classification with a high-arity feature,
    -   *       there is one split per category.
    +   *
     
    --- End diff --
    
    Could you help remove this empty line? Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2851] [mllib] DecisionTree Python consi...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1798#discussion_r15854238
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala ---
    @@ -300,6 +293,198 @@ object DecisionTree extends Serializable with Logging {
         new DecisionTree(strategy).train(input)
       }
     
    +  /**
    +   * Method to train a decision tree model.
    +   * The method supports binary and multiclass classification and regression.
    +   * This version takes basic types, for consistency with Python API.
    --- End diff --
    
    minor: To have consistency across different languages is not the only purpose to add this method. Using easier types is also more user-friendly. I would suggest removing `, for consistency with Python API` and add `We recommend using trainClassifier and trainRegressor` in the doc of the previous `train` method that takes `Algo` and `Impurtiy`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2851] [mllib] DecisionTree Python consi...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1798#discussion_r15898661
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala ---
    @@ -300,6 +309,93 @@ object DecisionTree extends Serializable with Logging {
         new DecisionTree(strategy).train(input)
       }
     
    +  /**
    +   * Method to train a decision tree model for binary or multiclass classification.
    +   *
    +   * @param input Training dataset: RDD of [[org.apache.spark.mllib.regression.LabeledPoint]].
    +   *              Labels should take values {0, 1, ..., numClasses-1}.
    +   * @param numClassesForClassification number of classes for classification.
    +   * @param categoricalFeaturesInfo Map storing arity of categorical features.
    +   *                                E.g., an entry (n -> k) indicates that feature n is categorical
    +   *                                with k categories indexed from 0: {0, 1, ..., k-1}.
    +   * @param impurity Criterion used for information gain calculation.
    +   *                 Supported values: "gini" (recommended) or "entropy".
    +   * @param maxDepth Maximum depth of the tree.
    +   *                 E.g., depth 0 means 1 leaf node; depth 1 means 1 internal node + 2 leaf nodes.
    +   *                  (suggested value: 4)
    +   * @param maxBins maximum number of bins used for splitting features
    +   *                 (suggested value: 100)
    +   * @return DecisionTreeModel that can be used for prediction
    +   */
    +  def trainClassifier(
    +      input: RDD[LabeledPoint],
    +      numClassesForClassification: Int,
    +      categoricalFeaturesInfo: Map[Int, Int],
    +      impurity: String,
    +      maxDepth: Int,
    +      maxBins: Int): DecisionTreeModel = {
    +    val impurityType = Impurities.fromString(impurity)
    +    train(input, Classification, impurityType, maxDepth, numClassesForClassification, maxBins, Sort,
    +      categoricalFeaturesInfo)
    +  }
    +
    +  /**
    +   * Java-friendly API for [[org.apache.spark.mllib.tree.DecisionTree$#trainClassifier]]
    +   */
    +  def trainClassifier(
    +      input: RDD[LabeledPoint],
    +      numClassesForClassification: Int,
    +      categoricalFeaturesInfo: java.util.Map[java.lang.Integer, java.lang.Integer],
    +      impurity: String,
    +      maxDepth: Int,
    +      maxBins: Int): DecisionTreeModel = {
    +    trainClassifier(input, numClassesForClassification,
    +      categoricalFeaturesInfo.asInstanceOf[java.util.Map[Int, Int]].asScala.toMap,
    +      impurity, maxDepth, maxBins)
    +  }
    +
    +  /**
    +   * Method to train a decision tree model for regression.
    +   *
    +   * @param input Training dataset: RDD of [[org.apache.spark.mllib.regression.LabeledPoint]].
    +   *              Labels are real numbers.
    +   * @param categoricalFeaturesInfo Map storing arity of categorical features.
    +   *                                E.g., an entry (n -> k) indicates that feature n is categorical
    +   *                                with k categories indexed from 0: {0, 1, ..., k-1}.
    +   * @param impurity Criterion used for information gain calculation.
    +   *                 Supported values: "variance".
    +   * @param maxDepth Maximum depth of the tree.
    +   *                 E.g., depth 0 means 1 leaf node; depth 1 means 1 internal node + 2 leaf nodes.
    +   *                  (suggested value: 4)
    +   * @param maxBins maximum number of bins used for splitting features
    +   *                 (suggested value: 100)
    +   * @return DecisionTreeModel that can be used for prediction
    +   */
    +  def trainRegressor(
    +      input: RDD[LabeledPoint],
    +      categoricalFeaturesInfo: Map[Int, Int],
    +      impurity: String,
    +      maxDepth: Int,
    +      maxBins: Int): DecisionTreeModel = {
    +    val impurityType = Impurities.fromString(impurity)
    +    train(input, Regression, impurityType, maxDepth, 0, maxBins, Sort, categoricalFeaturesInfo)
    +  }
    +
    +  /**
    +   * Java-friendly API for [[org.apache.spark.mllib.tree.DecisionTree$#trainRegressor]]
    +   */
    +  def trainRegressor(
    +      input: RDD[LabeledPoint],
    --- End diff --
    
    ditto: `RDD` -> `JavaRDD`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2851] [mllib] DecisionTree Python consi...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1798#discussion_r15891644
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/impurity/Impurities.scala ---
    @@ -0,0 +1,32 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.tree.impurity
    +
    +/**
    + * Factory class for Impurity types.
    --- End diff --
    
    Impurity?  or Impurities?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2851] [mllib] DecisionTree Python consi...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1798#discussion_r15891632
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala ---
    @@ -300,6 +293,198 @@ object DecisionTree extends Serializable with Logging {
         new DecisionTree(strategy).train(input)
       }
     
    +  /**
    +   * Method to train a decision tree model.
    +   * The method supports binary and multiclass classification and regression.
    +   * This version takes basic types, for consistency with Python API.
    +   *
    +   * @param input Training dataset: RDD of [[org.apache.spark.mllib.regression.LabeledPoint]].
    +   *              For classification, labels should take values {0, 1, ..., numClasses-1}.
    +   *              For regression, labels are real numbers.
    +   * @param algo "classification" or "regression"
    +   * @param numClassesForClassification number of classes for classification. Default value of 2.
    +   * @param categoricalFeaturesInfo Map storing arity of categorical features.
    +   *                                E.g., an entry (n -> k) indicates that feature n is categorical
    +   *                                with k categories indexed from 0: {0, 1, ..., k-1}.
    +   * @param impurity criterion used for information gain calculation
    +   * @param maxDepth Maximum depth of the tree.
    +   *                 E.g., depth 0 means 1 leaf node; depth 1 means 1 internal node + 2 leaf nodes.
    +   * @param maxBins maximum number of bins used for splitting features
    +   *                 (default Python value = 100)
    +   * @return DecisionTreeModel that can be used for prediction
    +   */
    +  def train(
    +      input: RDD[LabeledPoint],
    +      algo: String,
    +      numClassesForClassification: Int,
    +      categoricalFeaturesInfo: Map[Int, Int],
    +      impurity: String,
    +      maxDepth: Int,
    +      maxBins: Int): DecisionTreeModel = {
    +    val algoType = Algo.stringToAlgo(algo)
    +    val impurityType = Impurities.stringToImpurity(impurity)
    +    train(input, algoType, impurityType, maxDepth, numClassesForClassification, maxBins, Sort,
    +      categoricalFeaturesInfo)
    +  }
    +
    +  /**
    +   * Method to train a decision tree model.
    +   * The method supports binary and multiclass classification and regression.
    +   * This version takes basic types, for consistency with Python API.
    +   * This version is Java-friendly, taking a Java map for categoricalFeaturesInfo.
    +   *
    +   * @param input Training dataset: RDD of [[org.apache.spark.mllib.regression.LabeledPoint]].
    +   *              For classification, labels should take values {0, 1, ..., numClasses-1}.
    +   *              For regression, labels are real numbers.
    +   * @param algo "classification" or "regression"
    +   * @param numClassesForClassification number of classes for classification. Default value of 2.
    +   * @param categoricalFeaturesInfo Map storing arity of categorical features.
    +   *                                E.g., an entry (n -> k) indicates that feature n is categorical
    +   *                                with k categories indexed from 0: {0, 1, ..., k-1}.
    +   * @param impurity criterion used for information gain calculation
    +   * @param maxDepth Maximum depth of the tree.
    +   *                 E.g., depth 0 means 1 leaf node; depth 1 means 1 internal node + 2 leaf nodes.
    +   * @param maxBins maximum number of bins used for splitting features
    +   *                 (default Python value = 100)
    +   * @return DecisionTreeModel that can be used for prediction
    +   */
    +  def train(
    +      input: RDD[LabeledPoint],
    +      algo: String,
    +      numClassesForClassification: Int,
    +      categoricalFeaturesInfo: java.util.Map[java.lang.Integer, java.lang.Integer],
    +      impurity: String,
    +      maxDepth: Int,
    +      maxBins: Int): DecisionTreeModel = {
    +    train(input, algo, numClassesForClassification,
    +      categoricalFeaturesInfo.asInstanceOf[java.util.Map[Int, Int]].asScala.toMap,
    +      impurity, maxDepth, maxBins)
    +  }
    +
    +  /**
    +   * Method to train a decision tree model for binary or multiclass classification.
    +   * This version takes basic types, for consistency with Python API.
    +   *
    +   * @param input Training dataset: RDD of [[org.apache.spark.mllib.regression.LabeledPoint]].
    +   *              Labels should take values {0, 1, ..., numClasses-1}.
    +   * @param numClassesForClassification number of classes for classification.
    +   * @param categoricalFeaturesInfo Map storing arity of categorical features.
    +   *                                E.g., an entry (n -> k) indicates that feature n is categorical
    +   *                                with k categories indexed from 0: {0, 1, ..., k-1}.
    +   *                                 (default Python value = {}, i.e., no categorical features)
    --- End diff --
    
    I agree it is error-prone; I can remove that default from Python.  At some point, I'd like us to have that as a default and intelligently try to guess whether a feature is categorical or not based on the number of distinct values (in order to support users who are very new to data analysis).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2851] [mllib] DecisionTree Python consi...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1798#issuecomment-51386191
  
    QA results for PR 1798:<br>- This patch PASSES unit tests.<br>- This patch merges cleanly<br>- This patch adds no public classes<br><br>For more information see test ouptut:<br>https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18037/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2851] [mllib] DecisionTree Python consi...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1798#discussion_r15853962
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala ---
    @@ -300,6 +293,198 @@ object DecisionTree extends Serializable with Logging {
         new DecisionTree(strategy).train(input)
       }
     
    +  /**
    +   * Method to train a decision tree model.
    +   * The method supports binary and multiclass classification and regression.
    +   * This version takes basic types, for consistency with Python API.
    +   *
    +   * @param input Training dataset: RDD of [[org.apache.spark.mllib.regression.LabeledPoint]].
    +   *              For classification, labels should take values {0, 1, ..., numClasses-1}.
    +   *              For regression, labels are real numbers.
    +   * @param algo "classification" or "regression"
    +   * @param numClassesForClassification number of classes for classification. Default value of 2.
    +   * @param categoricalFeaturesInfo Map storing arity of categorical features.
    +   *                                E.g., an entry (n -> k) indicates that feature n is categorical
    +   *                                with k categories indexed from 0: {0, 1, ..., k-1}.
    +   * @param impurity criterion used for information gain calculation
    +   * @param maxDepth Maximum depth of the tree.
    +   *                 E.g., depth 0 means 1 leaf node; depth 1 means 1 internal node + 2 leaf nodes.
    +   * @param maxBins maximum number of bins used for splitting features
    +   *                 (default Python value = 100)
    --- End diff --
    
    It is a little weird to reference the default value in Python here. We can change it into `suggested value: 100`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2851] [mllib] DecisionTree Python consi...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1798#issuecomment-51393123
  
    QA tests have started for PR 1798. This patch DID NOT merge cleanly! <br>View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18044/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2851] [mllib] DecisionTree Python consi...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1798#discussion_r15854094
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala ---
    @@ -1334,10 +1519,10 @@ object DecisionTree extends Serializable with Logging {
        *   (a) For multiclass classification with a low-arity feature
        *       (i.e., if isMulticlass && isSpaceSufficientForAllCategoricalSplits),
        *       the feature is split based on subsets of categories.
    -   *       There are 2^(maxFeatureValue - 1) - 1 splits.
    +   *       There are math.pow(2, (maxFeatureValue - 1) - 1) splits.
        *   (b) For regression and binary classification,
        *       and for multiclass classification with a high-arity feature,
    --- End diff --
    
    `,` -> `.`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2851] [mllib] DecisionTree Python consi...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on the pull request:

    https://github.com/apache/spark/pull/1798#issuecomment-51434319
  
    LGTM. Merged into both master and branch-1.1.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2851] [mllib] DecisionTree Python consi...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1798#discussion_r15854011
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala ---
    @@ -300,6 +293,198 @@ object DecisionTree extends Serializable with Logging {
         new DecisionTree(strategy).train(input)
       }
     
    +  /**
    +   * Method to train a decision tree model.
    +   * The method supports binary and multiclass classification and regression.
    +   * This version takes basic types, for consistency with Python API.
    +   *
    +   * @param input Training dataset: RDD of [[org.apache.spark.mllib.regression.LabeledPoint]].
    +   *              For classification, labels should take values {0, 1, ..., numClasses-1}.
    +   *              For regression, labels are real numbers.
    +   * @param algo "classification" or "regression"
    +   * @param numClassesForClassification number of classes for classification. Default value of 2.
    +   * @param categoricalFeaturesInfo Map storing arity of categorical features.
    +   *                                E.g., an entry (n -> k) indicates that feature n is categorical
    +   *                                with k categories indexed from 0: {0, 1, ..., k-1}.
    +   * @param impurity criterion used for information gain calculation
    +   * @param maxDepth Maximum depth of the tree.
    +   *                 E.g., depth 0 means 1 leaf node; depth 1 means 1 internal node + 2 leaf nodes.
    +   * @param maxBins maximum number of bins used for splitting features
    +   *                 (default Python value = 100)
    +   * @return DecisionTreeModel that can be used for prediction
    +   */
    +  def train(
    +      input: RDD[LabeledPoint],
    +      algo: String,
    +      numClassesForClassification: Int,
    +      categoricalFeaturesInfo: Map[Int, Int],
    +      impurity: String,
    +      maxDepth: Int,
    +      maxBins: Int): DecisionTreeModel = {
    +    val algoType = Algo.stringToAlgo(algo)
    +    val impurityType = Impurities.stringToImpurity(impurity)
    +    train(input, algoType, impurityType, maxDepth, numClassesForClassification, maxBins, Sort,
    +      categoricalFeaturesInfo)
    +  }
    +
    +  /**
    +   * Method to train a decision tree model.
    --- End diff --
    
    minor: We can say `Java-friendly API for [[org.apache.spark.mllib.tree.DecisionTree$#train]]` and remove the doc, so we only need to maintain a single copy of the doc.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2851] [mllib] DecisionTree Python consi...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1798#issuecomment-51290594
  
    QA results for PR 1798:<br>- This patch PASSES unit tests.<br>- This patch merges cleanly<br>- This patch adds no public classes<br><br>For more information see test ouptut:<br>https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17980/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org