You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by jkbradley <gi...@git.apache.org> on 2014/08/20 20:36:13 UTC

[GitHub] spark pull request: [SPARK-2840] [mllib] DecisionTree doc update (...

GitHub user jkbradley opened a pull request:

    https://github.com/apache/spark/pull/2063

    [SPARK-2840] [mllib] DecisionTree doc update (Java, Python examples)

    Updated DecisionTree documentation, with examples for Java, Python.
    Added same Java example to code as well.
    CC: @mengxr  @manishamde

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/jkbradley/spark dt-docs

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/2063.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #2063
    
----
commit d939a926203fa443305078cb3caf573111b75359
Author: Joseph K. Bradley <jo...@gmail.com>
Date:   2014-08-20T04:02:25Z

    Updated DecisionTree documentation.  Added Java, Python examples.

commit 57eee9fa174fa3435f69c38785d9c757f3744fd9
Author: Joseph K. Bradley <jo...@gmail.com>
Date:   2014-08-20T17:34:57Z

    Created JavaDecisionTree example from example in docs, and corrected doc example as needed.

commit b9bee04d8ef538912b54a739f47ceeb11e4582b5
Author: Joseph K. Bradley <jo...@gmail.com>
Date:   2014-08-20T18:34:06Z

    Updated DT examples

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2840] [mllib] DecisionTree doc update (...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2063#issuecomment-52831943
  
      [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18972/consoleFull) for   PR 2063 at commit [`b9bee04`](https://github.com/apache/spark/commit/b9bee04d8ef538912b54a739f47ceeb11e4582b5).
     * This patch **passes** unit tests.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `is used for ordering. In multiclass classification, all `$2^`
      * `public final class JavaDecisionTree `



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2840] [mllib] DecisionTree doc update (...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2063#discussion_r16496871
  
    --- Diff: docs/mllib-decision-tree.md ---
    @@ -114,35 +135,122 @@ perform classification using a decision tree using Gini impurity as an impurity
     maximum tree depth of 5. The training error is calculated to measure the algorithm accuracy.
     
     <div class="codetabs">
    +
     <div data-lang="scala">
     {% highlight scala %}
    -import org.apache.spark.SparkContext
     import org.apache.spark.mllib.tree.DecisionTree
    -import org.apache.spark.mllib.regression.LabeledPoint
    -import org.apache.spark.mllib.linalg.Vectors
    -import org.apache.spark.mllib.tree.configuration.Algo._
    -import org.apache.spark.mllib.tree.impurity.Gini
    +import org.apache.spark.mllib.util.MLUtils
     
     // Load and parse the data file
    -val data = sc.textFile("data/mllib/sample_tree_data.csv")
    -val parsedData = data.map { line =>
    -  val parts = line.split(',').map(_.toDouble)
    -  LabeledPoint(parts(0), Vectors.dense(parts.tail))
    -}
    +val data = MLUtils.loadLibSVMFile(sc, "data/mllib/sample_libsvm_data.txt")
     
    -// Run training algorithm to build the model
    +// Train a DecisionTree model.
    +//  Empty categoricalFeaturesInfo indicates all features are continuous.
    +val numClasses = 2
    +val categoricalFeaturesInfo = Map[Int, Int]()
    +val impurity = "gini"
     val maxDepth = 5
    -val model = DecisionTree.train(parsedData, Classification, Gini, maxDepth)
    +val maxBins = 100
    +
    +val model = DecisionTree.trainClassifier(data, numClasses, categoricalFeaturesInfo, impurity,
    +  maxDepth, maxBins)
     
    -// Evaluate model on training examples and compute training error
    -val labelAndPreds = parsedData.map { point =>
    +// Evaluate model on training instances and compute training error
    +val labelAndPreds = data.map { point =>
       val prediction = model.predict(point.features)
       (point.label, prediction)
     }
    -val trainErr = labelAndPreds.filter(r => r._1 != r._2).count.toDouble / parsedData.count
    +val trainErr = labelAndPreds.filter(r => r._1 != r._2).count.toDouble / data.count
     println("Training Error = " + trainErr)
    +println("Learned classification tree model:\n" + model)
    +{% endhighlight %}
    +</div>
    +
    +<div data-lang="java">
    +{% highlight java %}
    +import java.util.HashMap;
    +import scala.Tuple2;
    +import org.apache.spark.api.java.function.Function2;
    +import org.apache.spark.api.java.JavaPairRDD;
    +import org.apache.spark.api.java.JavaRDD;
    +import org.apache.spark.api.java.JavaSparkContext;
    +import org.apache.spark.api.java.function.Function;
    +import org.apache.spark.api.java.function.PairFunction;
    +import org.apache.spark.mllib.regression.LabeledPoint;
    +import org.apache.spark.mllib.tree.DecisionTree;
    +import org.apache.spark.mllib.tree.model.DecisionTreeModel;
    +import org.apache.spark.mllib.util.MLUtils;
    +import org.apache.spark.SparkConf;
    +
    +SparkConf sparkConf = new SparkConf().setAppName("JavaDecisionTree");
    +JavaSparkContext sc = new JavaSparkContext(sparkConf);
    +
    +String datapath = "data/mllib/sample_libsvm_data.txt";
    +JavaRDD<LabeledPoint> data = MLUtils.loadLibSVMFile(sc.sc(), datapath).toJavaRDD();
    --- End diff --
    
    cache the data before computation?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2840] [mllib] DecisionTree doc update (...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on the pull request:

    https://github.com/apache/spark/pull/2063#issuecomment-52850446
  
    @mengxr  @manishamde  Thanks for the feedback!  I believe I've addressed all of the comments.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2840] [mllib] DecisionTree doc update (...

Posted by manishamde <gi...@git.apache.org>.

Github user manishamde commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2063#discussion_r16505721
  
    --- Diff: docs/mllib-decision-tree.md ---
    @@ -77,33 +85,46 @@ bins if the condition is not satisfied.
     
     **Categorical features**
     
    -For `$M$` categorical feature values, one could come up with `$2^(M-1)-1$` split candidates. For
    -binary classification, we can reduce the number of split candidates to `$M-1$` by ordering the
    +For a categorical feature with `$M$` possible values (categories), one could come up with
    +`$2^{M-1}-1$` split candidates. For binary classification and regression,
    +we can reduce the number of split candidates to `$M-1$` by ordering the
     categorical feature values by the proportion of labels falling in one of the two classes (see
     Section 9.2.4 in
     [Elements of Statistical Machine Learning](http://statweb.stanford.edu/~tibs/ElemStatLearn/) for
     details). For example, for a binary classification problem with one categorical feature with three
    -categories A, B and C with corresponding proportion of label 1 as 0.2, 0.6 and 0.4, the categorical
    -features are ordered as A followed by C followed B or A, C, B. The two split candidates are A \| C, B
    +categories A, B and C whose corresponding proportions of label 1 are 0.2, 0.6 and 0.4, the categorical
    +features are ordered as A, C, B. The two split candidates are A \| C, B
     and A , C \| B where \| denotes the split. A similar heuristic is used for multiclass classification
    -when `$2^(M-1)-1$` is greater than the number of bins -- the impurity for each categorical feature value
    -is used for ordering.
    +when `$2^{M-1}-1$` is greater than the `maxBins` parameter: the impurity for each categorical feature value
    +is used for ordering. In multiclass classification, all `$2^{M-1}-1$` possible splits are used
    +whenever possible.
    +
    +Note that the `maxBins` parameter must be at least `$M_{max}$`, the maximum number of categories for
    --- End diff --
    
    Suggestion: M_{max} can be eliminated unless it's used elsewhere in the document. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2840] [mllib] DecisionTree doc update (...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2063#issuecomment-52866876
  
      [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19022/consoleFull) for   PR 2063 at commit [`2dd2c19`](https://github.com/apache/spark/commit/2dd2c191233be76b445683fa8aa65fa1cc426b37).
     * This patch **passes** unit tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2840] [mllib] DecisionTree doc update (...

Posted by atalwalkar <gi...@git.apache.org>.

Github user atalwalkar commented on a diff in the pull request:

https://github.com/apache/spark/pull/2063#discussion_r16511804

--- Diff: docs/mllib-decision-tree.md ---
@@ -77,109 +85,316 @@ bins if the condition is not satisfied.

**Categorical features**

-For `$M$` categorical feature values, one could come up with `$2^(M-1)-1$` split candidates. For
-binary classification, we can reduce the number of split candidates to `$M-1$` by ordering the
+For a categorical feature with `$M$` possible values (categories), one could come up with
+`$2^{M-1}-1$` split candidates. For binary classification and regression,
+we can reduce the number of split candidates to `$M-1$` by ordering the
categorical feature values by the proportion of labels falling in one of the two classes (see
Section 9.2.4 in
[Elements of Statistical Machine Learning](http://statweb.stanford.edu/~tibs/ElemStatLearn/) for
details). For example, for a binary classification problem with one categorical feature with three
-categories A, B and C with corresponding proportion of label 1 as 0.2, 0.6 and 0.4, the categorical
-features are ordered as A followed by C followed B or A, C, B. The two split candidates are A \| C, B
-and A , C \| B where \| denotes the split. A similar heuristic is used for multiclass classification
-when `$2^(M-1)-1$` is greater than the number of bins -- the impurity for each categorical feature value
-is used for ordering.
+categories A, B and C whose corresponding proportions of label 1 are 0.2, 0.6 and 0.4, the categorical
+features are ordered as A, C, B. The two split candidates are A \| C, B
+and A , C \| B where \| denotes the split.
+
+In multiclass classification, all `$2^{M-1}-1$` possible splits are used whenever possible.
+When `$2^{M-1}-1$` is greater than the `maxBins` parameter, we use a (heuristic) method
+similar to the method used for binary classification and regression.
+The `$M$` categorical feature values are ordered by impurity,
+and the resulting `$M-1$` split candidates are considered.

### Stopping rule

The recursive tree construction is stopped at a node when one of the two conditions is met:

-1. The node depth is equal to the `maxDepth` training parameter
+1. The node depth is equal to the `maxDepth` training parameter.
2. No split candidate leads to an information gain at the node.

+## Implementation details
+
### Max memory requirements

"...tree, leading to..." -> "...tree, potentially leading to..."

---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2840] [mllib] DecisionTree doc update (...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on the pull request:

    https://github.com/apache/spark/pull/2063#issuecomment-52862344
  
    @atalwalkar  Thanks for the comments!  I believe I've fixed the issues.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2840] [mllib] DecisionTree doc update (...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/2063#discussion_r16507924

--- Diff: docs/mllib-decision-tree.md ---
@@ -62,14 +69,15 @@ datasets `$D_{left}$` and `$D_{right}$` of sizes `$N_{left}$` and `$N_{right}$`,

**Continuous features**

-For small datasets in single machine implementations, the split candidates for each continuous
+For small datasets in single-machine implementations, the split candidates for each continuous
feature are typically the unique values for the feature. Some implementations sort the feature
values and then use the ordered unique values as split candidates for faster tree calculations.

-Finding ordered unique feature values is computationally intensive for large distributed
-datasets. One can get an approximate set of split candidates by performing a quantile calculation
-over a sampled fraction of the data. The ordered splits create "bins" and the maximum number of such
-bins can be specified using the `maxBins` parameters.
+Sorting feature values is expensive for large distributed datasets.
--- End diff --

That sounds a bit detailed for this overview, but it could be interesting for the blog post. I'm about to push an update with more comments on maxBins, which is related to this but more about how the user should set parameters.

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2840] [mllib] DecisionTree doc update (...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2063#discussion_r16496981
  
    --- Diff: docs/mllib-decision-tree.md ---
    @@ -114,35 +135,122 @@ perform classification using a decision tree using Gini impurity as an impurity
     maximum tree depth of 5. The training error is calculated to measure the algorithm accuracy.
     
     <div class="codetabs">
    +
     <div data-lang="scala">
     {% highlight scala %}
    -import org.apache.spark.SparkContext
     import org.apache.spark.mllib.tree.DecisionTree
    -import org.apache.spark.mllib.regression.LabeledPoint
    -import org.apache.spark.mllib.linalg.Vectors
    -import org.apache.spark.mllib.tree.configuration.Algo._
    -import org.apache.spark.mllib.tree.impurity.Gini
    +import org.apache.spark.mllib.util.MLUtils
     
     // Load and parse the data file
    -val data = sc.textFile("data/mllib/sample_tree_data.csv")
    -val parsedData = data.map { line =>
    -  val parts = line.split(',').map(_.toDouble)
    -  LabeledPoint(parts(0), Vectors.dense(parts.tail))
    -}
    +val data = MLUtils.loadLibSVMFile(sc, "data/mllib/sample_libsvm_data.txt")
    --- End diff --
    
    Need to update the text description and change `CSV file` to `LIBSVM file`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2840] [mllib] DecisionTree doc update (...

Posted by manishamde <gi...@git.apache.org>.

Github user manishamde commented on the pull request:

    https://github.com/apache/spark/pull/2063#issuecomment-52846609
  
    Thanks @jkbradley 
    
    I had some minor comments that I have noted above. Apart from the LGTM!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2840] [mllib] DecisionTree doc update (...

Posted by manishamde <gi...@git.apache.org>.

Github user manishamde commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2063#discussion_r16505298
  
    --- Diff: docs/mllib-decision-tree.md ---
    @@ -62,14 +69,15 @@ datasets `$D_{left}$` and `$D_{right}$` of sizes `$N_{left}$` and `$N_{right}$`,
     
     **Continuous features**
     
    -For small datasets in single machine implementations, the split candidates for each continuous
    +For small datasets in single-machine implementations, the split candidates for each continuous
     feature are typically the unique values for the feature. Some implementations sort the feature
     values and then use the ordered unique values as split candidates for faster tree calculations.
     
    -Finding ordered unique feature values is computationally intensive for large distributed
    -datasets. One can get an approximate set of split candidates by performing a quantile calculation
    -over a sampled fraction of the data. The ordered splits create "bins" and the maximum number of such
    -bins can be specified using the `maxBins` parameters.
    +Sorting feature values is expensive for large distributed datasets.
    --- End diff --
    
    One could also highlight that high cardinality of split candidates per feature (unique values) could slow down training without significant advantage in accuracy.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2840] [mllib] DecisionTree doc update (...

Posted by manishamde <gi...@git.apache.org>.

Github user manishamde commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2063#discussion_r16504941
  
    --- Diff: docs/mllib-decision-tree.md ---
    @@ -7,20 +7,26 @@ displayTitle: <a href="mllib-guide.html">MLlib</a> - Decision Tree
     * Table of contents
     {:toc}
     
    -Decision trees and their ensembles are popular methods for the machine learning tasks of
    +[Decision trees](http://en.wikipedia.org/wiki/Decision_tree_learning)
    +and their ensembles are popular methods for the machine learning tasks of
     classification and regression. Decision trees are widely used since they are easy to interpret,
    -handle categorical variables, extend to the multiclass classification setting, do not require
    +handle categorical features, extend to the multiclass classification setting, do not require
     feature scaling and are able to capture nonlinearities and feature interactions. Tree ensemble
    -algorithms such as decision forest and boosting are among the top performers for classification and
    +algorithms such as decision forests and boosting are among the top performers for classification and
    --- End diff --
    
    should we just call them random forests instead? :-)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2840] [mllib] DecisionTree doc update (...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2063#discussion_r16500175
  
    --- Diff: docs/mllib-decision-tree.md ---
    @@ -114,35 +135,122 @@ perform classification using a decision tree using Gini impurity as an impurity
     maximum tree depth of 5. The training error is calculated to measure the algorithm accuracy.
     
     <div class="codetabs">
    +
     <div data-lang="scala">
     {% highlight scala %}
    -import org.apache.spark.SparkContext
     import org.apache.spark.mllib.tree.DecisionTree
    -import org.apache.spark.mllib.regression.LabeledPoint
    -import org.apache.spark.mllib.linalg.Vectors
    -import org.apache.spark.mllib.tree.configuration.Algo._
    -import org.apache.spark.mllib.tree.impurity.Gini
    +import org.apache.spark.mllib.util.MLUtils
     
     // Load and parse the data file
    -val data = sc.textFile("data/mllib/sample_tree_data.csv")
    -val parsedData = data.map { line =>
    -  val parts = line.split(',').map(_.toDouble)
    -  LabeledPoint(parts(0), Vectors.dense(parts.tail))
    -}
    +val data = MLUtils.loadLibSVMFile(sc, "data/mllib/sample_libsvm_data.txt")
     
    -// Run training algorithm to build the model
    +// Train a DecisionTree model.
    +//  Empty categoricalFeaturesInfo indicates all features are continuous.
    +val numClasses = 2
    +val categoricalFeaturesInfo = Map[Int, Int]()
    +val impurity = "gini"
     val maxDepth = 5
    -val model = DecisionTree.train(parsedData, Classification, Gini, maxDepth)
    +val maxBins = 100
    +
    +val model = DecisionTree.trainClassifier(data, numClasses, categoricalFeaturesInfo, impurity,
    +  maxDepth, maxBins)
     
    -// Evaluate model on training examples and compute training error
    -val labelAndPreds = parsedData.map { point =>
    +// Evaluate model on training instances and compute training error
    +val labelAndPreds = data.map { point =>
       val prediction = model.predict(point.features)
       (point.label, prediction)
     }
    -val trainErr = labelAndPreds.filter(r => r._1 != r._2).count.toDouble / parsedData.count
    +val trainErr = labelAndPreds.filter(r => r._1 != r._2).count.toDouble / data.count
     println("Training Error = " + trainErr)
    +println("Learned classification tree model:\n" + model)
    +{% endhighlight %}
    +</div>
    +
    +<div data-lang="java">
    +{% highlight java %}
    +import java.util.HashMap;
    +import scala.Tuple2;
    +import org.apache.spark.api.java.function.Function2;
    +import org.apache.spark.api.java.JavaPairRDD;
    +import org.apache.spark.api.java.JavaRDD;
    +import org.apache.spark.api.java.JavaSparkContext;
    +import org.apache.spark.api.java.function.Function;
    +import org.apache.spark.api.java.function.PairFunction;
    +import org.apache.spark.mllib.regression.LabeledPoint;
    +import org.apache.spark.mllib.tree.DecisionTree;
    +import org.apache.spark.mllib.tree.model.DecisionTreeModel;
    +import org.apache.spark.mllib.util.MLUtils;
    +import org.apache.spark.SparkConf;
    +
    +SparkConf sparkConf = new SparkConf().setAppName("JavaDecisionTree");
    +JavaSparkContext sc = new JavaSparkContext(sparkConf);
    +
    +String datapath = "data/mllib/sample_libsvm_data.txt";
    +JavaRDD<LabeledPoint> data = MLUtils.loadLibSVMFile(sc.sc(), datapath).toJavaRDD();
    --- End diff --
    
    It is cached by tree training, but should we cache it here too since it used again for testing?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2840] [mllib] DecisionTree doc update (...

Posted by manishamde <gi...@git.apache.org>.

Github user manishamde commented on a diff in the pull request:

https://github.com/apache/spark/pull/2063#discussion_r16505882

--- Diff: docs/mllib-decision-tree.md ---
@@ -77,33 +85,46 @@ bins if the condition is not satisfied.

**Categorical features**

-For `$M$` categorical feature values, one could come up with `$2^(M-1)-1$` split candidates. For
-binary classification, we can reduce the number of split candidates to `$M-1$` by ordering the
+For a categorical feature with `$M$` possible values (categories), one could come up with
+`$2^{M-1}-1$` split candidates. For binary classification and regression,
+we can reduce the number of split candidates to `$M-1$` by ordering the
categorical feature values by the proportion of labels falling in one of the two classes (see
Section 9.2.4 in
[Elements of Statistical Machine Learning](http://statweb.stanford.edu/~tibs/ElemStatLearn/) for
details). For example, for a binary classification problem with one categorical feature with three
-categories A, B and C with corresponding proportion of label 1 as 0.2, 0.6 and 0.4, the categorical
-features are ordered as A followed by C followed B or A, C, B. The two split candidates are A \| C, B
+categories A, B and C whose corresponding proportions of label 1 are 0.2, 0.6 and 0.4, the categorical
+features are ordered as A, C, B. The two split candidates are A \| C, B
and A , C \| B where \| denotes the split. A similar heuristic is used for multiclass classification
-when `$2^(M-1)-1$` is greater than the number of bins -- the impurity for each categorical feature value
-is used for ordering.
+when `$2^{M-1}-1$` is greater than the `maxBins` parameter: the impurity for each categorical feature value
+is used for ordering. In multiclass classification, all `$2^{M-1}-1$` possible splits are used
+whenever possible.
+
+Note that the `maxBins` parameter must be at least `$M_{max}$`, the maximum number of categories for
+any categorical feature.

### Stopping rule

The recursive tree construction is stopped at a node when one of the two conditions is met:

-1. The node depth is equal to the `maxDepth` training parameter
+1. The node depth is equal to the `maxDepth` training parameter.
2. No split candidate leads to an information gain at the node.

### Max memory requirements

-For faster processing, the decision tree algorithm performs simultaneous histogram computations for all nodes at each level of the tree. This could lead to high memory requirements at deeper levels of the tree leading to memory overflow errors. To alleviate this problem, a 'maxMemoryInMB' training parameter is provided which specifies the maximum amount of memory at the workers (twice as much at the master) to be allocated to the histogram computation. The default value is conservatively chosen to be 128 MB to allow the decision algorithm to work in most scenarios. Once the memory requirements for a level-wise computation crosses the `maxMemoryInMB` threshold, the node training tasks at each subsequent level is split into smaller tasks.
+For faster processing, the decision tree algorithm performs simultaneous histogram computations for
+all nodes at each level of the tree. This could lead to high memory requirements at deeper levels
+of the tree, leading to memory overflow errors. To alleviate this problem, a `maxMemoryInMB`
+training parameter specifies the maximum amount of memory at the workers (twice as much at the
+master) to be allocated to the histogram computation. The default value is conservatively chosen to
+be 128 MB to allow the decision algorithm to work in most scenarios. Once the memory requirements
+for a level-wise computation cross the `maxMemoryInMB` threshold, the node training tasks at each
+subsequent level are split into smaller tasks.

### Practical limitations

1. The implemented algorithm reads both sparse and dense data. However, it is not optimized for sparse input.
-2. Python is not supported in this release.
+2. Computation scales approximately linearly in the number of training instances,
--- End diff --

I think it's a feature and not a limitation! We should highlight this scaling in other sections.

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2840] [mllib] DecisionTree doc update (...

Posted by manishamde <gi...@git.apache.org>.

Github user manishamde commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2063#discussion_r16505453
  
    --- Diff: docs/mllib-decision-tree.md ---
    @@ -77,33 +85,46 @@ bins if the condition is not satisfied.
     
     **Categorical features**
     
    -For `$M$` categorical feature values, one could come up with `$2^(M-1)-1$` split candidates. For
    -binary classification, we can reduce the number of split candidates to `$M-1$` by ordering the
    +For a categorical feature with `$M$` possible values (categories), one could come up with
    +`$2^{M-1}-1$` split candidates. For binary classification and regression,
    +we can reduce the number of split candidates to `$M-1$` by ordering the
     categorical feature values by the proportion of labels falling in one of the two classes (see
     Section 9.2.4 in
     [Elements of Statistical Machine Learning](http://statweb.stanford.edu/~tibs/ElemStatLearn/) for
     details). For example, for a binary classification problem with one categorical feature with three
    -categories A, B and C with corresponding proportion of label 1 as 0.2, 0.6 and 0.4, the categorical
    -features are ordered as A followed by C followed B or A, C, B. The two split candidates are A \| C, B
    +categories A, B and C whose corresponding proportions of label 1 are 0.2, 0.6 and 0.4, the categorical
    +features are ordered as A, C, B. The two split candidates are A \| C, B
     and A , C \| B where \| denotes the split. A similar heuristic is used for multiclass classification
    -when `$2^(M-1)-1$` is greater than the number of bins -- the impurity for each categorical feature value
    -is used for ordering.
    +when `$2^{M-1}-1$` is greater than the `maxBins` parameter: the impurity for each categorical feature value
    --- End diff --
    
    My fault but this sentence looks awkward. Feel free to rephrase it. :-)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2840] [mllib] DecisionTree doc update (...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2063#issuecomment-52862719
  
      [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19022/consoleFull) for   PR 2063 at commit [`2dd2c19`](https://github.com/apache/spark/commit/2dd2c191233be76b445683fa8aa65fa1cc426b37).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2840] [mllib] DecisionTree doc update (...

Posted by manishamde <gi...@git.apache.org>.

Github user manishamde commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2063#discussion_r16506299
  
    --- Diff: docs/mllib-decision-tree.md ---
    @@ -114,35 +135,122 @@ perform classification using a decision tree using Gini impurity as an impurity
     maximum tree depth of 5. The training error is calculated to measure the algorithm accuracy.
     
     <div class="codetabs">
    +
     <div data-lang="scala">
     {% highlight scala %}
    -import org.apache.spark.SparkContext
     import org.apache.spark.mllib.tree.DecisionTree
    -import org.apache.spark.mllib.regression.LabeledPoint
    -import org.apache.spark.mllib.linalg.Vectors
    -import org.apache.spark.mllib.tree.configuration.Algo._
    -import org.apache.spark.mllib.tree.impurity.Gini
    +import org.apache.spark.mllib.util.MLUtils
     
     // Load and parse the data file
    -val data = sc.textFile("data/mllib/sample_tree_data.csv")
    -val parsedData = data.map { line =>
    -  val parts = line.split(',').map(_.toDouble)
    -  LabeledPoint(parts(0), Vectors.dense(parts.tail))
    -}
    +val data = MLUtils.loadLibSVMFile(sc, "data/mllib/sample_libsvm_data.txt")
     
    -// Run training algorithm to build the model
    +// Train a DecisionTree model.
    +//  Empty categoricalFeaturesInfo indicates all features are continuous.
    +val numClasses = 2
    +val categoricalFeaturesInfo = Map[Int, Int]()
    +val impurity = "gini"
     val maxDepth = 5
    -val model = DecisionTree.train(parsedData, Classification, Gini, maxDepth)
    +val maxBins = 100
    +
    +val model = DecisionTree.trainClassifier(data, numClasses, categoricalFeaturesInfo, impurity,
    +  maxDepth, maxBins)
     
    -// Evaluate model on training examples and compute training error
    -val labelAndPreds = parsedData.map { point =>
    +// Evaluate model on training instances and compute training error
    +val labelAndPreds = data.map { point =>
       val prediction = model.predict(point.features)
       (point.label, prediction)
     }
    -val trainErr = labelAndPreds.filter(r => r._1 != r._2).count.toDouble / parsedData.count
    +val trainErr = labelAndPreds.filter(r => r._1 != r._2).count.toDouble / data.count
     println("Training Error = " + trainErr)
    +println("Learned classification tree model:\n" + model)
    +{% endhighlight %}
    +</div>
    +
    +<div data-lang="java">
    +{% highlight java %}
    +import java.util.HashMap;
    +import scala.Tuple2;
    +import org.apache.spark.api.java.function.Function2;
    +import org.apache.spark.api.java.JavaPairRDD;
    +import org.apache.spark.api.java.JavaRDD;
    +import org.apache.spark.api.java.JavaSparkContext;
    +import org.apache.spark.api.java.function.Function;
    +import org.apache.spark.api.java.function.PairFunction;
    +import org.apache.spark.mllib.regression.LabeledPoint;
    +import org.apache.spark.mllib.tree.DecisionTree;
    +import org.apache.spark.mllib.tree.model.DecisionTreeModel;
    +import org.apache.spark.mllib.util.MLUtils;
    +import org.apache.spark.SparkConf;
    +
    +SparkConf sparkConf = new SparkConf().setAppName("JavaDecisionTree");
    +JavaSparkContext sc = new JavaSparkContext(sparkConf);
    +
    +String datapath = "data/mllib/sample_libsvm_data.txt";
    +JavaRDD<LabeledPoint> data = MLUtils.loadLibSVMFile(sc.sc(), datapath).toJavaRDD();
    --- End diff --
    
    We are only calculating numClasses in the Java example. Should we eliminate it since it makes the already verbose Java code even more verbose? Else, we need to make the same change to the Scala and Python examples.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2840] [mllib] DecisionTree doc update (...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2063#issuecomment-52856913
  
      [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19001/consoleFull) for   PR 2063 at commit [`9dd1b6b`](https://github.com/apache/spark/commit/9dd1b6b6edd11035d081c425b2cc1af06a2d8442).
     * This patch **passes** unit tests.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `In multiclass classification, all `$2^`
      * `public final class JavaDecisionTree `



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2840] [mllib] DecisionTree doc update (...

Posted by manishamde <gi...@git.apache.org>.

Github user manishamde commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2063#discussion_r16506492
  
    --- Diff: docs/mllib-decision-tree.md ---
    @@ -114,35 +135,122 @@ perform classification using a decision tree using Gini impurity as an impurity
     maximum tree depth of 5. The training error is calculated to measure the algorithm accuracy.
     
     <div class="codetabs">
    +
     <div data-lang="scala">
     {% highlight scala %}
    -import org.apache.spark.SparkContext
     import org.apache.spark.mllib.tree.DecisionTree
    -import org.apache.spark.mllib.regression.LabeledPoint
    -import org.apache.spark.mllib.linalg.Vectors
    -import org.apache.spark.mllib.tree.configuration.Algo._
    -import org.apache.spark.mllib.tree.impurity.Gini
    +import org.apache.spark.mllib.util.MLUtils
     
     // Load and parse the data file
    -val data = sc.textFile("data/mllib/sample_tree_data.csv")
    -val parsedData = data.map { line =>
    -  val parts = line.split(',').map(_.toDouble)
    -  LabeledPoint(parts(0), Vectors.dense(parts.tail))
    -}
    +val data = MLUtils.loadLibSVMFile(sc, "data/mllib/sample_libsvm_data.txt")
     
    -// Run training algorithm to build the model
    +// Train a DecisionTree model.
    +//  Empty categoricalFeaturesInfo indicates all features are continuous.
    +val numClasses = 2
    +val categoricalFeaturesInfo = Map[Int, Int]()
    +val impurity = "gini"
     val maxDepth = 5
    -val model = DecisionTree.train(parsedData, Classification, Gini, maxDepth)
    +val maxBins = 100
    +
    +val model = DecisionTree.trainClassifier(data, numClasses, categoricalFeaturesInfo, impurity,
    +  maxDepth, maxBins)
     
    -// Evaluate model on training examples and compute training error
    -val labelAndPreds = parsedData.map { point =>
    +// Evaluate model on training instances and compute training error
    +val labelAndPreds = data.map { point =>
       val prediction = model.predict(point.features)
       (point.label, prediction)
     }
    -val trainErr = labelAndPreds.filter(r => r._1 != r._2).count.toDouble / parsedData.count
    +val trainErr = labelAndPreds.filter(r => r._1 != r._2).count.toDouble / data.count
     println("Training Error = " + trainErr)
    +println("Learned classification tree model:\n" + model)
    +{% endhighlight %}
    +</div>
    +
    +<div data-lang="java">
    +{% highlight java %}
    +import java.util.HashMap;
    +import scala.Tuple2;
    +import org.apache.spark.api.java.function.Function2;
    +import org.apache.spark.api.java.JavaPairRDD;
    +import org.apache.spark.api.java.JavaRDD;
    +import org.apache.spark.api.java.JavaSparkContext;
    +import org.apache.spark.api.java.function.Function;
    +import org.apache.spark.api.java.function.PairFunction;
    +import org.apache.spark.mllib.regression.LabeledPoint;
    +import org.apache.spark.mllib.tree.DecisionTree;
    +import org.apache.spark.mllib.tree.model.DecisionTreeModel;
    +import org.apache.spark.mllib.util.MLUtils;
    +import org.apache.spark.SparkConf;
    +
    +SparkConf sparkConf = new SparkConf().setAppName("JavaDecisionTree");
    +JavaSparkContext sc = new JavaSparkContext(sparkConf);
    +
    +String datapath = "data/mllib/sample_libsvm_data.txt";
    +JavaRDD<LabeledPoint> data = MLUtils.loadLibSVMFile(sc.sc(), datapath).toJavaRDD();
    --- End diff --
    
    If we decide to cache, let's note why we are doing it via a comment. Else, some users might get confused and decide to cache always before calling the tree algorithm.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2840] [mllib] DecisionTree doc update (...

Posted by manishamde <gi...@git.apache.org>.

Github user manishamde commented on a diff in the pull request:

https://github.com/apache/spark/pull/2063#discussion_r16508593

--- Diff: docs/mllib-decision-tree.md ---
@@ -62,14 +69,15 @@ datasets `$D_{left}$` and `$D_{right}$` of sizes `$N_{left}$` and `$N_{right}$`,

**Continuous features**

Sounds good.

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2840] [mllib] DecisionTree doc update (...

Posted by atalwalkar <gi...@git.apache.org>.

Github user atalwalkar commented on a diff in the pull request:

https://github.com/apache/spark/pull/2063#discussion_r16511872

--- Diff: docs/mllib-decision-tree.md ---
@@ -77,109 +85,316 @@ bins if the condition is not satisfied.

**Categorical features**

### Stopping rule

The recursive tree construction is stopped at a node when one of the two conditions is met:

-1. The node depth is equal to the `maxDepth` training parameter
+1. The node depth is equal to the `maxDepth` training parameter.
2. No split candidate leads to an information gain at the node.

+## Implementation details
+
### Max memory requirements

-### Practical limitations
+Computation scales approximately linearly in the number of training instances,
+in the number of features, and in the `maxBins` parameter.
+Communication scales approximately linearly in the number of features and in `maxBins`.

-1. The implemented algorithm reads both sparse and dense data. However, it is not optimized for sparse input.
-2. Python is not supported in this release.
+The implemented algorithm reads both sparse and dense data. However, it is not optimized for sparse input.

## Examples

### Classification

-The example below demonstrates how to load a CSV file, parse it as an RDD of `LabeledPoint` and then
+The example below demonstrates how to load a
+[LIBSVM data file](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/),
+parse it as an RDD of `LabeledPoint` and then
perform classification using a decision tree using Gini impurity as an impurity measure and a
--- End diff --

"decision tree using Gini..." -> "decision tree with Gini..."

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2840] [mllib] DecisionTree doc update (...

Posted by asfgit <gi...@git.apache.org>.

Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/2063


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2840] [mllib] DecisionTree doc update (...

Posted by atalwalkar <gi...@git.apache.org>.

Github user atalwalkar commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2063#discussion_r16511894
  
    --- Diff: docs/mllib-decision-tree.md ---
    @@ -77,109 +85,316 @@ bins if the condition is not satisfied.
     
     **Categorical features**
     
    -For `$M$` categorical feature values, one could come up with `$2^(M-1)-1$` split candidates. For
    -binary classification, we can reduce the number of split candidates to `$M-1$` by ordering the
    +For a categorical feature with `$M$` possible values (categories), one could come up with
    +`$2^{M-1}-1$` split candidates. For binary classification and regression,
    +we can reduce the number of split candidates to `$M-1$` by ordering the
     categorical feature values by the proportion of labels falling in one of the two classes (see
     Section 9.2.4 in
     [Elements of Statistical Machine Learning](http://statweb.stanford.edu/~tibs/ElemStatLearn/) for
     details). For example, for a binary classification problem with one categorical feature with three
    -categories A, B and C with corresponding proportion of label 1 as 0.2, 0.6 and 0.4, the categorical
    -features are ordered as A followed by C followed B or A, C, B. The two split candidates are A \| C, B
    -and A , C \| B where \| denotes the split. A similar heuristic is used for multiclass classification
    -when `$2^(M-1)-1$` is greater than the number of bins -- the impurity for each categorical feature value
    -is used for ordering.
    +categories A, B and C whose corresponding proportions of label 1 are 0.2, 0.6 and 0.4, the categorical
    +features are ordered as A, C, B. The two split candidates are A \| C, B
    +and A , C \| B where \| denotes the split.
    +
    +In multiclass classification, all `$2^{M-1}-1$` possible splits are used whenever possible.
    +When `$2^{M-1}-1$` is greater than the `maxBins` parameter, we use a (heuristic) method
    +similar to the method used for binary classification and regression.
    +The `$M$` categorical feature values are ordered by impurity,
    +and the resulting `$M-1$` split candidates are considered.
     
     ### Stopping rule
     
     The recursive tree construction is stopped at a node when one of the two conditions is met:
     
    -1. The node depth is equal to the `maxDepth` training parameter
    +1. The node depth is equal to the `maxDepth` training parameter.
     2. No split candidate leads to an information gain at the node.
     
    +## Implementation details
    +
     ### Max memory requirements
     
    -For faster processing, the decision tree algorithm performs simultaneous histogram computations for all nodes at each level of the tree. This could lead to high memory requirements at deeper levels of the tree leading to memory overflow errors. To alleviate this problem, a 'maxMemoryInMB' training parameter is provided which specifies the maximum amount of memory at the workers (twice as much at the master) to be allocated to the histogram computation. The default value is conservatively chosen to be 128 MB to allow the decision algorithm to work in most scenarios. Once the memory requirements for a level-wise computation crosses the `maxMemoryInMB` threshold, the node training tasks at each subsequent level is split into smaller tasks.
    +For faster processing, the decision tree algorithm performs simultaneous histogram computations for
    +all nodes at each level of the tree. This could lead to high memory requirements at deeper levels
    +of the tree, leading to memory overflow errors. To alleviate this problem, a `maxMemoryInMB`
    +training parameter specifies the maximum amount of memory at the workers (twice as much at the
    +master) to be allocated to the histogram computation. The default value is conservatively chosen to
    +be 128 MB to allow the decision algorithm to work in most scenarios. Once the memory requirements
    +for a level-wise computation cross the `maxMemoryInMB` threshold, the node training tasks at each
    +subsequent level are split into smaller tasks.
    +
    +Note that, if you have a large amount of memory, increasing `maxMemoryInMB` can lead to faster
    +training by requiring fewer passes over the data.
    +
    +### Binning feature values
    +
    +Increasing `maxBins` allows the algorithm to consider more split candidates and make fine-grained
    +split decisions.  However, it also increases computation and communication.
    +
    +Note that the `maxBins` parameter must be at least the maximum number of categories `$M$` for
    +any categorical feature.
    +
    +### Scaling
     
    -### Practical limitations
    +Computation scales approximately linearly in the number of training instances,
    +in the number of features, and in the `maxBins` parameter.
    +Communication scales approximately linearly in the number of features and in `maxBins`.
     
    -1. The implemented algorithm reads both sparse and dense data. However, it is not optimized for sparse input.
    -2. Python is not supported in this release.
    +The implemented algorithm reads both sparse and dense data. However, it is not optimized for sparse input.
     
     ## Examples
     
     ### Classification
     
    -The example below demonstrates how to load a CSV file, parse it as an RDD of `LabeledPoint` and then
    +The example below demonstrates how to load a
    +[LIBSVM data file](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/),
    +parse it as an RDD of `LabeledPoint` and then
     perform classification using a decision tree using Gini impurity as an impurity measure and a
     maximum tree depth of 5. The training error is calculated to measure the algorithm accuracy.
     
     <div class="codetabs">
    +
     <div data-lang="scala">
     {% highlight scala %}
    -import org.apache.spark.SparkContext
     import org.apache.spark.mllib.tree.DecisionTree
    -import org.apache.spark.mllib.regression.LabeledPoint
    -import org.apache.spark.mllib.linalg.Vectors
    -import org.apache.spark.mllib.tree.configuration.Algo._
    -import org.apache.spark.mllib.tree.impurity.Gini
    -
    -// Load and parse the data file
    -val data = sc.textFile("data/mllib/sample_tree_data.csv")
    -val parsedData = data.map { line =>
    -  val parts = line.split(',').map(_.toDouble)
    -  LabeledPoint(parts(0), Vectors.dense(parts.tail))
    -}
    +import org.apache.spark.mllib.util.MLUtils
     
    -// Run training algorithm to build the model
    +// Load and parse the data file.
    +// Cache the data since we will use it again to compute training error.
    +val data = MLUtils.loadLibSVMFile(sc, "data/mllib/sample_libsvm_data.txt").cache()
    +
    +// Train a DecisionTree model.
    +//  Empty categoricalFeaturesInfo indicates all features are continuous.
    +val numClasses = 2
    +val categoricalFeaturesInfo = Map[Int, Int]()
    +val impurity = "gini"
     val maxDepth = 5
    -val model = DecisionTree.train(parsedData, Classification, Gini, maxDepth)
    +val maxBins = 100
    +
    +val model = DecisionTree.trainClassifier(data, numClasses, categoricalFeaturesInfo, impurity,
    +  maxDepth, maxBins)
     
    -// Evaluate model on training examples and compute training error
    -val labelAndPreds = parsedData.map { point =>
    +// Evaluate model on training instances and compute training error
    +val labelAndPreds = data.map { point =>
       val prediction = model.predict(point.features)
       (point.label, prediction)
     }
    -val trainErr = labelAndPreds.filter(r => r._1 != r._2).count.toDouble / parsedData.count
    +val trainErr = labelAndPreds.filter(r => r._1 != r._2).count.toDouble / data.count
     println("Training Error = " + trainErr)
    +println("Learned classification tree model:\n" + model)
    +{% endhighlight %}
    +</div>
    +
    +<div data-lang="java">
    +{% highlight java %}
    +import java.util.HashMap;
    +import scala.Tuple2;
    +import org.apache.spark.api.java.function.Function2;
    +import org.apache.spark.api.java.JavaPairRDD;
    +import org.apache.spark.api.java.JavaRDD;
    +import org.apache.spark.api.java.JavaSparkContext;
    +import org.apache.spark.api.java.function.Function;
    +import org.apache.spark.api.java.function.PairFunction;
    +import org.apache.spark.mllib.regression.LabeledPoint;
    +import org.apache.spark.mllib.tree.DecisionTree;
    +import org.apache.spark.mllib.tree.model.DecisionTreeModel;
    +import org.apache.spark.mllib.util.MLUtils;
    +import org.apache.spark.SparkConf;
    +
    +SparkConf sparkConf = new SparkConf().setAppName("JavaDecisionTree");
    +JavaSparkContext sc = new JavaSparkContext(sparkConf);
    +
    +// Load and parse the data file.
    +// Cache the data since we will use it again to compute training error.
    +String datapath = "data/mllib/sample_libsvm_data.txt";
    +JavaRDD<LabeledPoint> data = MLUtils.loadLibSVMFile(sc.sc(), datapath).toJavaRDD().cache();
    +
    +// Set parameters.
    +//  Empty categoricalFeaturesInfo indicates all features are continuous.
    +Integer numClasses = 2;
    +HashMap<Integer, Integer> categoricalFeaturesInfo = new HashMap<Integer, Integer>();
    +String impurity = "gini";
    +Integer maxDepth = 5;
    +Integer maxBins = 100;
    +
    +// Train a DecisionTree model for classification.
    +final DecisionTreeModel model = DecisionTree.trainClassifier(data, numClasses,
    +  categoricalFeaturesInfo, impurity, maxDepth, maxBins);
    +
    +// Evaluate model on training instances and compute training error
    +JavaPairRDD<Double, Double> predictionAndLabel =
    +  data.mapToPair(new PairFunction<LabeledPoint, Double, Double>() {
    +    @Override public Tuple2<Double, Double> call(LabeledPoint p) {
    +      return new Tuple2<Double, Double>(model.predict(p.features()), p.label());
    +    }
    +  });
    +Double trainErr =
    +  1.0 * predictionAndLabel.filter(new Function<Tuple2<Double, Double>, Boolean>() {
    +    @Override public Boolean call(Tuple2<Double, Double> pl) {
    +      return !pl._1().equals(pl._2());
    +    }
    +  }).count() / data.count();
    +System.out.println("Training error: " + trainErr);
    +System.out.println("Learned classification tree model:\n" + model);
    +{% endhighlight %}
    +</div>
    +
    +<div data-lang="python">
    +{% highlight python %}
    +from pyspark.mllib.regression import LabeledPoint
    +from pyspark.mllib.tree import DecisionTree
    +from pyspark.mllib.util import MLUtils
    +
    +# Load and parse the data file into an RDD of LabeledPoint.
    +# Cache the data since we will use it again to compute training error.
    +data = MLUtils.loadLibSVMFile(sc, 'data/mllib/sample_libsvm_data.txt').cache()
    +
    +# Train a DecisionTree model.
    +#  Empty categoricalFeaturesInfo indicates all features are continuous.
    +model = DecisionTree.trainClassifier(data, numClasses=2, categoricalFeaturesInfo={},
    +                                     impurity='gini', maxDepth=5, maxBins=100)
    +
    +# Evaluate model on training instances and compute training error
    +predictions = model.predict(data.map(lambda x: x.features))
    +labelsAndPredictions = data.map(lambda lp: lp.label).zip(predictions)
    +trainErr = labelsAndPredictions.filter(lambda (v, p): v != p).count() / float(data.count())
    +print('Training Error = ' + str(trainErr))
    +print('Learned classification tree model:')
    +print(model)
     {% endhighlight %}
    +
    +Note: When making predictions for a dataset, it is more efficient to do batch prediction rather
    +than separately calling `predict` on each data point.  This is because the Python code makes calls
    +to an underlying `DecisionTree` model in Scala.
     </div>
    +
     </div>
     
     ### Regression
     
    -The example below demonstrates how to load a CSV file, parse it as an RDD of `LabeledPoint` and then
    +The example below demonstrates how to load a
    +[LIBSVM data file](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/),
    +parse it as an RDD of `LabeledPoint` and then
     perform regression using a decision tree using variance as an impurity measure and a maximum tree
    --- End diff --
    
    using (second instance) -> with


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2840] [mllib] DecisionTree doc update (...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on the pull request:

    https://github.com/apache/spark/pull/2063#issuecomment-52886441
  
    I've merged this into master and branch-1.1. Thanks!!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2840] [mllib] DecisionTree doc update (...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2063#issuecomment-52846027
  
      [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18996/consoleFull) for   PR 2063 at commit [`d802369`](https://github.com/apache/spark/commit/d80236916413f5102654280fec131530674774b6).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2840] [mllib] DecisionTree doc update (...

Posted by atalwalkar <gi...@git.apache.org>.

Github user atalwalkar commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2063#discussion_r16511708
  
    --- Diff: docs/mllib-decision-tree.md ---
    @@ -77,109 +85,316 @@ bins if the condition is not satisfied.
     
     **Categorical features**
     
    -For `$M$` categorical feature values, one could come up with `$2^(M-1)-1$` split candidates. For
    -binary classification, we can reduce the number of split candidates to `$M-1$` by ordering the
    +For a categorical feature with `$M$` possible values (categories), one could come up with
    +`$2^{M-1}-1$` split candidates. For binary classification and regression,
    +we can reduce the number of split candidates to `$M-1$` by ordering the
     categorical feature values by the proportion of labels falling in one of the two classes (see
    --- End diff --
    
    This explanation is specific to binary classification, though I think it's supposed to explain a strategy that's applicable to both binary classification and regression.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2840] [mllib] DecisionTree doc update (...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2063#issuecomment-52822524
  
      [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18972/consoleFull) for   PR 2063 at commit [`b9bee04`](https://github.com/apache/spark/commit/b9bee04d8ef538912b54a739f47ceeb11e4582b5).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2840] [mllib] DecisionTree doc update (...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2063#issuecomment-52852471
  
      [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18996/consoleFull) for   PR 2063 at commit [`d802369`](https://github.com/apache/spark/commit/d80236916413f5102654280fec131530674774b6).
     * This patch **passes** unit tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2840] [mllib] DecisionTree doc update (...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2063#issuecomment-52851069
  
      [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19001/consoleFull) for   PR 2063 at commit [`9dd1b6b`](https://github.com/apache/spark/commit/9dd1b6b6edd11035d081c425b2cc1af06a2d8442).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2840] [mllib] DecisionTree doc update (...

Posted by manishamde <gi...@git.apache.org>.

Github user manishamde commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2063#discussion_r16511789
  
    --- Diff: docs/mllib-decision-tree.md ---
    @@ -77,109 +85,316 @@ bins if the condition is not satisfied.
     
     **Categorical features**
     
    -For `$M$` categorical feature values, one could come up with `$2^(M-1)-1$` split candidates. For
    -binary classification, we can reduce the number of split candidates to `$M-1$` by ordering the
    +For a categorical feature with `$M$` possible values (categories), one could come up with
    +`$2^{M-1}-1$` split candidates. For binary classification and regression,
    +we can reduce the number of split candidates to `$M-1$` by ordering the
     categorical feature values by the proportion of labels falling in one of the two classes (see
    --- End diff --
    
    Correct.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2840] [mllib] DecisionTree doc update (...

Posted by atalwalkar <gi...@git.apache.org>.

Github user atalwalkar commented on the pull request:

    https://github.com/apache/spark/pull/2063#issuecomment-52864422
  
    LGTM


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2840] [mllib] DecisionTree doc update (...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2063#discussion_r16503947
  
    --- Diff: docs/mllib-decision-tree.md ---
    @@ -114,35 +135,122 @@ perform classification using a decision tree using Gini impurity as an impurity
     maximum tree depth of 5. The training error is calculated to measure the algorithm accuracy.
     
     <div class="codetabs">
    +
     <div data-lang="scala">
     {% highlight scala %}
    -import org.apache.spark.SparkContext
     import org.apache.spark.mllib.tree.DecisionTree
    -import org.apache.spark.mllib.regression.LabeledPoint
    -import org.apache.spark.mllib.linalg.Vectors
    -import org.apache.spark.mllib.tree.configuration.Algo._
    -import org.apache.spark.mllib.tree.impurity.Gini
    +import org.apache.spark.mllib.util.MLUtils
     
     // Load and parse the data file
    -val data = sc.textFile("data/mllib/sample_tree_data.csv")
    -val parsedData = data.map { line =>
    -  val parts = line.split(',').map(_.toDouble)
    -  LabeledPoint(parts(0), Vectors.dense(parts.tail))
    -}
    +val data = MLUtils.loadLibSVMFile(sc, "data/mllib/sample_libsvm_data.txt")
     
    -// Run training algorithm to build the model
    +// Train a DecisionTree model.
    +//  Empty categoricalFeaturesInfo indicates all features are continuous.
    +val numClasses = 2
    +val categoricalFeaturesInfo = Map[Int, Int]()
    +val impurity = "gini"
     val maxDepth = 5
    -val model = DecisionTree.train(parsedData, Classification, Gini, maxDepth)
    +val maxBins = 100
    +
    +val model = DecisionTree.trainClassifier(data, numClasses, categoricalFeaturesInfo, impurity,
    +  maxDepth, maxBins)
     
    -// Evaluate model on training examples and compute training error
    -val labelAndPreds = parsedData.map { point =>
    +// Evaluate model on training instances and compute training error
    +val labelAndPreds = data.map { point =>
       val prediction = model.predict(point.features)
       (point.label, prediction)
     }
    -val trainErr = labelAndPreds.filter(r => r._1 != r._2).count.toDouble / parsedData.count
    +val trainErr = labelAndPreds.filter(r => r._1 != r._2).count.toDouble / data.count
     println("Training Error = " + trainErr)
    +println("Learned classification tree model:\n" + model)
    +{% endhighlight %}
    +</div>
    +
    +<div data-lang="java">
    +{% highlight java %}
    +import java.util.HashMap;
    +import scala.Tuple2;
    +import org.apache.spark.api.java.function.Function2;
    +import org.apache.spark.api.java.JavaPairRDD;
    +import org.apache.spark.api.java.JavaRDD;
    +import org.apache.spark.api.java.JavaSparkContext;
    +import org.apache.spark.api.java.function.Function;
    +import org.apache.spark.api.java.function.PairFunction;
    +import org.apache.spark.mllib.regression.LabeledPoint;
    +import org.apache.spark.mllib.tree.DecisionTree;
    +import org.apache.spark.mllib.tree.model.DecisionTreeModel;
    +import org.apache.spark.mllib.util.MLUtils;
    +import org.apache.spark.SparkConf;
    +
    +SparkConf sparkConf = new SparkConf().setAppName("JavaDecisionTree");
    +JavaSparkContext sc = new JavaSparkContext(sparkConf);
    +
    +String datapath = "data/mllib/sample_libsvm_data.txt";
    +JavaRDD<LabeledPoint> data = MLUtils.loadLibSVMFile(sc.sc(), datapath).toJavaRDD();
    --- End diff --
    
    We cached the binned features in training. But in this example, we visit the raw features twice. Since it is reading from disk, it should help if we cache the data.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org