You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by jkbradley <gi...@git.apache.org> on 2014/08/01 23:54:12 UTC

[GitHub] spark pull request: [SPARK-2796] [mllib] DecisionTree bug fix: ord...

GitHub user jkbradley opened a pull request:

    https://github.com/apache/spark/pull/1720

    [SPARK-2796] [mllib] DecisionTree bug fix: ordered categorical features

    Bug: In DecisionTree, the method sequentialBinSearchForOrderedCategoricalFeatureInClassification() indexed bins from 0 to (math.pow(2, featureCategories.toInt - 1) - 1). This upper bound is the bound for unordered categorical features, not ordered ones. The upper bound should be the arity (i.e., max value) of the feature.
    
    Added new test to DecisionTreeSuite to catch this: "regression stump with categorical variables of arity 2"
    
    Bug fix: Modified upper bound discussed above.
    
    Also: Small improvements to coding style in DecisionTree.
    
    CC @mengxr @manishamde

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/jkbradley/spark decisiontree-bugfix2

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/1720.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #1720
    
----
commit 225822fe38762596b8c917a867e5cdbb2d9b4b55
Author: Joseph K. Bradley <jo...@gmail.com>
Date:   2014-08-01T21:50:42Z

    Bug: In DecisionTree, the method sequentialBinSearchForOrderedCategoricalFeatureInClassification() indexed bins from 0 to (math.pow(2, featureCategories.toInt - 1) - 1). This upper bound is the bound for unordered categorical features, not ordered ones. The upper bound should be the arity (i.e., max value) of the feature.
    
    Added new test to DecisionTreeSuite to catch this: "regression stump with categorical variables of arity 2"
    
    Bug fix: Modified upper bound discussed above.
    
    Also: Small improvements to coding style in DecisionTree.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2796] [mllib] DecisionTree bug fix: ord...

Posted by manishamde <gi...@git.apache.org>.

Github user manishamde commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1720#discussion_r15722106
  
    --- Diff: mllib/src/test/scala/org/apache/spark/mllib/tree/DecisionTreeSuite.scala ---
    @@ -42,6 +42,18 @@ class DecisionTreeSuite extends FunSuite with LocalSparkContext {
         assert(accuracy >= requiredAccuracy)
       }
     
    +  def validateRegressor(
    +      model: DecisionTreeModel,
    +      input: Seq[LabeledPoint],
    +      requiredMSE: Double) {
    +    val predictions = input.map(x => model.predict(x.features))
    +    val squaredError = predictions.zip(input).map { case (prediction, expected) =>
    +      (prediction - expected.label) * (prediction - expected.label)
    --- End diff --
    
    super minor style nit: could the case statement go on the next line?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2796] [mllib] DecisionTree bug fix: ord...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1720#discussion_r15722196
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala ---
    @@ -522,28 +522,36 @@ object DecisionTree extends Serializable with Logging {
           }
     
           /**
    -       * Sequential search helper method to find bin for categorical feature.
    +       * Sequential search helper method to find bin for categorical feature
    +       * (for classification and regression).
            */
    -      def sequentialBinSearchForOrderedCategoricalFeatureInClassification(): Int = {
    +      def sequentialBinSearchForOrderedCategoricalFeature(): Int = {
             val featureCategories = strategy.categoricalFeaturesInfo(featureIndex)
    -        val numCategoricalBins = math.pow(2.0, featureCategories - 1).toInt - 1
    +        val featureValue = labeledPoint.features(featureIndex)
             var binIndex = 0
    -        while (binIndex < numCategoricalBins) {
    +        while (binIndex < featureCategories) {
               val bin = bins(featureIndex)(binIndex)
               val categories = bin.highSplit.categories
    -          val features = labeledPoint.features
    -          if (categories.contains(features(featureIndex))) {
    +          if (categories.contains(featureValue)) {
                 return binIndex
               }
               binIndex += 1
             }
    +        if (featureValue < 0 || featureValue >= featureCategories) {
    --- End diff --
    
    We could do this verification as part of dataset functions. It doesn't hurt to double check here.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2796] [mllib] DecisionTree bug fix: ord...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1720#issuecomment-50939553
  
    QA tests have started for PR 1720. This patch merges cleanly. <br>View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17702/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2796] [mllib] DecisionTree bug fix: ord...

Posted by manishamde <gi...@git.apache.org>.

Github user manishamde commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1720#discussion_r15722046
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala ---
    @@ -522,28 +522,36 @@ object DecisionTree extends Serializable with Logging {
           }
     
           /**
    -       * Sequential search helper method to find bin for categorical feature.
    +       * Sequential search helper method to find bin for categorical feature
    +       * (for classification and regression).
            */
    -      def sequentialBinSearchForOrderedCategoricalFeatureInClassification(): Int = {
    +      def sequentialBinSearchForOrderedCategoricalFeature(): Int = {
             val featureCategories = strategy.categoricalFeaturesInfo(featureIndex)
    -        val numCategoricalBins = math.pow(2.0, featureCategories - 1).toInt - 1
    +        val featureValue = labeledPoint.features(featureIndex)
             var binIndex = 0
    -        while (binIndex < numCategoricalBins) {
    +        while (binIndex < featureCategories) {
               val bin = bins(featureIndex)(binIndex)
               val categories = bin.highSplit.categories
    -          val features = labeledPoint.features
    -          if (categories.contains(features(featureIndex))) {
    +          if (categories.contains(featureValue)) {
                 return binIndex
               }
               binIndex += 1
             }
    +        if (featureValue < 0 || featureValue >= featureCategories) {
    --- End diff --
    
    I like this idea. Do you think it might be better handled in a verification step before training? The idea would be run the tree in '''diagnostic''' mode so that such errors are caught before training.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2796] [mllib] DecisionTree bug fix: ord...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1720#discussion_r15722274
  
    --- Diff: mllib/src/test/scala/org/apache/spark/mllib/tree/DecisionTreeSuite.scala ---
    @@ -42,6 +42,18 @@ class DecisionTreeSuite extends FunSuite with LocalSparkContext {
         assert(accuracy >= requiredAccuracy)
       }
     
    +  def validateRegressor(
    +      model: DecisionTreeModel,
    +      input: Seq[LabeledPoint],
    +      requiredMSE: Double) {
    +    val predictions = input.map(x => model.predict(x.features))
    +    val squaredError = predictions.zip(input).map { case (prediction, expected) =>
    +      (prediction - expected.label) * (prediction - expected.label)
    --- End diff --
    
    I think this is the correct Spark style. But it maybe better if we use
    
    ~~~
    val err = prediction - expected.label
    err * err
    ~~~


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2796] [mllib] DecisionTree bug fix: ord...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1720#issuecomment-50943393
  
    QA results for PR 1720:<br>- This patch PASSES unit tests.<br>- This patch merges cleanly<br>- This patch adds no public classes<br><br>For more information see test ouptut:<br>https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17702/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2796] [mllib] DecisionTree bug fix: ord...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on the pull request:

    https://github.com/apache/spark/pull/1720#issuecomment-50942241
  
    LGTM. Waiting for Jenkins ...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2796] [mllib] DecisionTree bug fix: ord...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on the pull request:

    https://github.com/apache/spark/pull/1720#issuecomment-50943502
  
    Merged into master. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2796] [mllib] DecisionTree bug fix: ord...

Posted by manishamde <gi...@git.apache.org>.

Github user manishamde commented on the pull request:

    https://github.com/apache/spark/pull/1720#issuecomment-50941616
  
    LGTM!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2796] [mllib] DecisionTree bug fix: ord...

Posted by asfgit <gi...@git.apache.org>.

Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/1720


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---