You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by jkbradley <gi...@git.apache.org> on 2016/03/08 19:38:03 UTC

[GitHub] spark pull request: [SPARK-11888] [ML] Decision tree persistence i...

GitHub user jkbradley opened a pull request:

    https://github.com/apache/spark/pull/11581

    [SPARK-11888] [ML] Decision tree persistence in spark.ml

    ### What changes were proposed in this pull request?
    
    Made these MLReadable and MLWritable: DecisionTreeClassifier, DecisionTreeClassificationModel, DecisionTreeRegressor, DecisionTreeRegressionModel
    * The shared implementation is in treeModels.scala
    * I use case classes to create a DataFrame to save, and I use the Dataset API to parse loaded files.
    
    Other changes:
    * Made CategoricalSplit.numCategories public (to use in persistence)
    * Fixed a bug in DefaultReadWriteTest.testEstimatorAndModelReadWrite, where it did not call the checkModelData function passed as an argument.  This caused an error in LDASuite, which I fixed.
    
    ### How was this patch tested?
    
    Persistence is tested via unit tests.  For each algorithm, there are 2 non-trivial trees (depth 2).  One is built with continuous features, and one with categorical; this ensures that both types of splits are tested.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/jkbradley/spark dt-io

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/11581.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #11581
    
----
commit 1eb8f4118d1dbeba9fac2e00adb20b468f7ec668
Author: Joseph K. Bradley <jo...@databricks.com>
Date:   2015-11-20T19:22:23Z

    partly done adding save/load to DecisionTreeClassifier and Model

commit 24a418873702042b5456423dd77d7672abc8aa7b
Author: Joseph K. Bradley <jo...@databricks.com>
Date:   2016-03-08T06:41:25Z

    DecisionTreeClassifier,Regressor and Models support save,load.  Fixed bug in DefaultReadWriteTest.testEstimatorAndModelReadWrite where it never called checkModelData function.

commit ea1e02ff69494e7eed6742c5234432610d4b5ede
Author: Joseph K. Bradley <jo...@databricks.com>
Date:   2016-03-08T08:03:43Z

    Fixed issue in LDA not copying doc,topicConcentration values

commit 7a1013a3cfc7c4a195cea714c4612e9d3e946b4d
Author: Joseph K. Bradley <jo...@databricks.com>
Date:   2016-03-08T18:04:10Z

    Reverted annoying style mistakes from IntelliJ.  Fixed my fix to LDASuite.  Some more docs.

commit 207cec29f4017c1c4d4a12fb382ed3a604f60303
Author: Joseph K. Bradley <jo...@databricks.com>
Date:   2016-03-08T18:36:59Z

    tiny cleanups

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11888] [ML] Decision tree persistence i...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/11581#issuecomment-194991336
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11888] [ML] Decision tree persistence i...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/11581#issuecomment-195007843
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11888] [ML] Decision tree persistence i...

Posted by yanboliang <gi...@git.apache.org>.
Github user yanboliang commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11581#discussion_r56014633
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/impurity/Impurity.scala ---
    @@ -179,3 +179,21 @@ private[spark] abstract class ImpurityCalculator(val stats: Array[Double]) exten
       }
     
     }
    +
    +private[spark] object ImpurityCalculator {
    +
    +  /**
    +   * Create an [[ImpurityCalculator]] instance of the given impurity type and with
    +   * the given stats.
    +   */
    +  def getCalculator(impurity: String, stats: Array[Double]): ImpurityCalculator = {
    +    impurity match {
    +      case "gini" => new GiniCalculator(stats)
    +      case "entropy" => new EntropyCalculator(stats)
    +      case "variance" => new VarianceCalculator(stats)
    --- End diff --
    
    String references are error-prone. It's better to add a string name for each Calculator, then use ```GiniCalculator.name``` rather than directly use "gini" here. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11888] [ML] Decision tree persistence i...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/11581#issuecomment-193919747
  
    **[Test build #52677 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/52677/consoleFull)** for PR 11581 at commit [`e94772d`](https://github.com/apache/spark/commit/e94772dd193acc9ccc3999e2d4aa623f1a73cf81).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11888] [ML] Decision tree persistence i...

Posted by yanboliang <gi...@git.apache.org>.
Github user yanboliang commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11581#discussion_r56345217
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/tree/treeModels.scala ---
    @@ -101,3 +109,127 @@ private[ml] trait TreeEnsembleModel {
       /** Total number of nodes, summed over all trees in the ensemble. */
       lazy val totalNumNodes: Int = trees.map(_.numNodes).sum
     }
    +
    +/** Helper classes for tree model persistence */
    +private[ml] object DecisionTreeModelReadWrite {
    +
    +  /**
    +   * Info for a [[org.apache.spark.ml.tree.Split]]
    +   *
    +   * @param featureIndex  Index of feature split on
    +   * @param leftCategoriesOrThreshold  For categorical feature, set of leftCategories.
    +   *                                   For continuous feature, threshold.
    +   * @param numCategories  For categorical feature, number of categories.
    +   *                       For continuous feature, -1.
    +   */
    +  case class SplitData(
    +      featureIndex: Int,
    +      leftCategoriesOrThreshold: Array[Double],
    +      numCategories: Int) {
    +
    +    def getSplit: Split = {
    +      if (numCategories != -1) {
    +        new CategoricalSplit(featureIndex, leftCategoriesOrThreshold, numCategories)
    +      } else {
    +        assert(leftCategoriesOrThreshold.length == 1, s"DecisionTree split data expected" +
    +          s" 1 threshold for ContinuousSplit, but found thresholds: " +
    +          leftCategoriesOrThreshold.mkString(", "))
    +        new ContinuousSplit(featureIndex, leftCategoriesOrThreshold(0))
    +      }
    +    }
    +  }
    +
    +  object SplitData {
    +    def apply(split: Split): SplitData = split match {
    +      case s: CategoricalSplit =>
    +        SplitData(s.featureIndex, s.leftCategories, s.numCategories)
    +      case s: ContinuousSplit =>
    +        SplitData(s.featureIndex, Array(s.threshold), -1)
    +    }
    +  }
    +
    +  /**
    +   * Info for a [[Node]]
    +   *
    +   * @param id  Index used for tree reconstruction.  Indices follow a pre-order traversal.
    +   * @param impurityStats  Stats array.  Impurity type is stored in metadata.
    +   * @param gain  Gain, or arbitrary value if leaf node.
    +   * @param leftChild  Left child index, or arbitrary value if leaf node.
    +   * @param rightChild  Right child index, or arbitrary value if leaf node.
    +   * @param split  Split info, or arbitrary value if leaf node.
    +   */
    +  case class NodeData(
    +    id: Int,
    +    prediction: Double,
    +    impurity: Double,
    +    impurityStats: Array[Double],
    +    gain: Double,
    +    leftChild: Int,
    +    rightChild: Int,
    +    split: SplitData)
    +
    +  object NodeData {
    +    /**
    +     * Create [[NodeData]] instances for this node and all children.
    +     *
    +     * @param id  Current ID.  IDs are assigned via a pre-order traversal.
    +     * @return (sequence of nodes in pre-order traversal order, largest ID in subtree)
    +     *         The nodes are returned in pre-order traversal (root first) so that it is easy to
    +     *         get the ID of the subtree's root node.
    +     */
    +    def build(node: Node, id: Int): (Seq[NodeData], Int) = node match {
    +      case n: InternalNode =>
    +        val (leftNodeData, leftIdx) = build(n.leftChild, id + 1)
    +        val (rightNodeData, rightIdx) = build(n.rightChild, leftIdx + 1)
    +        val thisNodeData = NodeData(id, n.prediction, n.impurity, n.impurityStats.stats,
    +          n.gain, leftNodeData.head.id, rightNodeData.head.id, SplitData(n.split))
    +        (thisNodeData +: (leftNodeData ++ rightNodeData), rightIdx)
    +      case _: LeafNode =>
    +        (Seq(NodeData(id, node.prediction, node.impurity, node.impurityStats.stats,
    +          -1.0, -1, -1, SplitData(-1, Array.empty[Double], -1))),
    +          id)
    +    }
    +  }
    +
    +  def loadTreeNodes(
    +      path: String,
    +      metadata: DefaultParamsReader.Metadata,
    +      sqlContext: SQLContext): Node = {
    +    import sqlContext.implicits._
    +    implicit val format = DefaultFormats
    +
    +    // Get impurity to construct ImpurityCalculator for each node
    +    val impurityType: String = {
    +      val impurityJson: JValue = metadata.getParamValue("impurity")
    +      Param.jsonDecode[String](compact(render(impurityJson)))
    +    }
    +
    +    val dataPath = new Path(path, "data").toString
    +    val data = sqlContext.read.format("parquet").load(dataPath).as[NodeData]
    --- End diff --
    
    nit: ```sqlContext.read.parquet(dataPath).as[NodeData]```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11888] [ML] Decision tree persistence i...

Posted by yinxusen <gi...@git.apache.org>.
Github user yinxusen commented on the pull request:

    https://github.com/apache/spark/pull/11581#issuecomment-194600443
  
    Sure thing.
    
    On Wednesday, March 9, 2016, jkbradley <no...@github.com> wrote:
    
    > @yanboliang <https://github.com/yanboliang> @yinxusen
    > <https://github.com/yinxusen> Would you have time to review this patch?
    > Thanks!
    >
    > —
    > Reply to this email directly or view it on GitHub
    > <https://github.com/apache/spark/pull/11581#issuecomment-193909221>.
    >
    
    
    -- 
    Cheers,
    Xusen Yin



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11888] [ML] Decision tree persistence i...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/11581#issuecomment-193992317
  
    **[Test build #52689 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/52689/consoleFull)** for PR 11581 at commit [`40f224d`](https://github.com/apache/spark/commit/40f224d9c5d03b9b6c5fb0ef97326391f7b4ba8f).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11888] [ML] Decision tree persistence i...

Posted by yanboliang <gi...@git.apache.org>.
Github user yanboliang commented on the pull request:

    https://github.com/apache/spark/pull/11581#issuecomment-196250449
  
    Sorry for late response, make a pass now.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11888] [ML] Decision tree persistence i...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/11581#issuecomment-196542261
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/53122/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11888] [ML] Decision tree persistence i...

Posted by yanboliang <gi...@git.apache.org>.
Github user yanboliang commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11581#discussion_r56012675
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/tree/treeModels.scala ---
    @@ -101,3 +109,127 @@ private[ml] trait TreeEnsembleModel {
       /** Total number of nodes, summed over all trees in the ensemble. */
       lazy val totalNumNodes: Int = trees.map(_.numNodes).sum
     }
    +
    +/** Helper classes for tree model persistence */
    +private[ml] object DecisionTreeModelReadWrite {
    +
    +  /**
    +   * Info for a [[org.apache.spark.ml.tree.Split]]
    +   *
    +   * @param featureIndex  Index of feature split on
    +   * @param leftCategoriesOrThreshold  For categorical feature, set of leftCategories.
    +   *                                   For continuous feature, threshold.
    +   * @param numCategories  For categorical feature, number of categories.
    +   *                       For continuous feature, -1.
    +   */
    +  case class SplitData(
    +      featureIndex: Int,
    +      leftCategoriesOrThreshold: Array[Double],
    +      numCategories: Int) {
    +
    +    def getSplit: Split = {
    +      if (numCategories != -1) {
    +        new CategoricalSplit(featureIndex, leftCategoriesOrThreshold, numCategories)
    +      } else {
    +        assert(leftCategoriesOrThreshold.length == 1, s"DecisionTree split data expected" +
    +          s" 1 threshold for ContinuousSplit, but found thresholds: " +
    +          leftCategoriesOrThreshold.mkString(", "))
    +        new ContinuousSplit(featureIndex, leftCategoriesOrThreshold(0))
    +      }
    +    }
    +  }
    +
    +  object SplitData {
    +    def apply(split: Split): SplitData = split match {
    +      case s: CategoricalSplit =>
    +        SplitData(s.featureIndex, s.leftCategories, s.numCategories)
    +      case s: ContinuousSplit =>
    +        SplitData(s.featureIndex, Array(s.threshold), -1)
    +    }
    +  }
    +
    +  /**
    +   * Info for a [[Node]]
    +   *
    +   * @param id  Index used for tree reconstruction.  Indices follow an in-order traversal.
    +   * @param impurityStats  Stats array.  Impurity type is stored in metadata.
    +   * @param gain  Gain, or arbitrary value if leaf node.
    +   * @param leftChild  Left child index, or arbitrary value if leaf node.
    +   * @param rightChild  Right child index, or arbitrary value if leaf node.
    +   * @param split  Split info, or arbitrary value if leaf node.
    +   */
    +  case class NodeData(
    +    id: Int,
    +    prediction: Double,
    +    impurity: Double,
    +    impurityStats: Array[Double],
    +    gain: Double,
    +    leftChild: Int,
    +    rightChild: Int,
    +    split: SplitData)
    +
    +  object NodeData {
    +    /**
    +     * Create [[NodeData]] instances for this node and all children.
    +     *
    +     * @param id  Current ID.  IDs are assigned via an in-order traversal.
    +     * @return (sequence of nodes in in-order traversal order, largest ID in subtree)
    +     *         The nodes are returned in in-order traversal (root first) so that it is easy to
    +     *         get the ID of the subtree's root node.
    --- End diff --
    
    I think here we use pre-order traversal because root first + left subtree + right subtree in L186. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11888] [ML] Decision tree persistence i...

Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11581#discussion_r55724843
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/DecisionTreeRegressor.scala ---
    @@ -117,17 +126,18 @@ final class DecisionTreeRegressionModel private[ml] (
         override val rootNode: Node,
         override val numFeatures: Int)
       extends PredictionModel[Vector, DecisionTreeRegressionModel]
    -  with DecisionTreeModel with DecisionTreeRegressorParams with Serializable {
    +  with DecisionTreeModel with DecisionTreeRegressorParams with MLWritable with Serializable {
     
       /** @group setParam */
       def setVarianceCol(value: String): this.type = set(varianceCol, value)
     
       require(rootNode != null,
    -    "DecisionTreeClassificationModel given null rootNode, but it requires a non-null rootNode.")
    +    "DecisionTreeRegressionModel given null rootNode, but it requires a non-null rootNode.")
     
       /**
        * Construct a decision tree regression model.
    -   * @param rootNode  Root node of tree, with other nodes attached.
    +    *
    +    * @param rootNode  Root node of tree, with other nodes attached.
    --- End diff --
    
    Thanks!  I can't figure out how to fix IntelliJ imports ever since I updated it...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11888] [ML] Decision tree persistence i...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/11581#issuecomment-193919818
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/52677/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11888] [ML] Decision tree persistence i...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/11581#issuecomment-194991341
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/52837/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11888] [ML] Decision tree persistence i...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/11581#issuecomment-193909847
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/52676/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11888] [ML] Decision tree persistence i...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/11581#issuecomment-197552019
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11888] [ML] Decision tree persistence i...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/11581#issuecomment-193919815
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11888] [ML] Decision tree persistence i...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/11581#issuecomment-194991328
  
    **[Test build #52837 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/52837/consoleFull)** for PR 11581 at commit [`76943bd`](https://github.com/apache/spark/commit/76943bd9d109e4765e50d5b4c639f547f67215cb).
     * This patch **fails Scala style tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11888] [ML] Decision tree persistence i...

Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley commented on the pull request:

    https://github.com/apache/spark/pull/11581#issuecomment-194994053
  
    OK that should fix the merge.  I also added a unit test for depth 0 as a corner case.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11888] [ML] Decision tree persistence i...

Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley commented on the pull request:

    https://github.com/apache/spark/pull/11581#issuecomment-197553266
  
    Got the LGTM from @mengxr offline.  I'll merge this with master.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11888] [ML] Decision tree persistence i...

Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11581#discussion_r56081484
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/tree/treeModels.scala ---
    @@ -101,3 +109,127 @@ private[ml] trait TreeEnsembleModel {
       /** Total number of nodes, summed over all trees in the ensemble. */
       lazy val totalNumNodes: Int = trees.map(_.numNodes).sum
     }
    +
    +/** Helper classes for tree model persistence */
    +private[ml] object DecisionTreeModelReadWrite {
    +
    +  /**
    +   * Info for a [[org.apache.spark.ml.tree.Split]]
    +   *
    +   * @param featureIndex  Index of feature split on
    +   * @param leftCategoriesOrThreshold  For categorical feature, set of leftCategories.
    +   *                                   For continuous feature, threshold.
    +   * @param numCategories  For categorical feature, number of categories.
    +   *                       For continuous feature, -1.
    +   */
    +  case class SplitData(
    +      featureIndex: Int,
    +      leftCategoriesOrThreshold: Array[Double],
    +      numCategories: Int) {
    +
    +    def getSplit: Split = {
    +      if (numCategories != -1) {
    +        new CategoricalSplit(featureIndex, leftCategoriesOrThreshold, numCategories)
    +      } else {
    +        assert(leftCategoriesOrThreshold.length == 1, s"DecisionTree split data expected" +
    +          s" 1 threshold for ContinuousSplit, but found thresholds: " +
    +          leftCategoriesOrThreshold.mkString(", "))
    +        new ContinuousSplit(featureIndex, leftCategoriesOrThreshold(0))
    +      }
    +    }
    +  }
    +
    +  object SplitData {
    +    def apply(split: Split): SplitData = split match {
    +      case s: CategoricalSplit =>
    +        SplitData(s.featureIndex, s.leftCategories, s.numCategories)
    +      case s: ContinuousSplit =>
    +        SplitData(s.featureIndex, Array(s.threshold), -1)
    +    }
    +  }
    +
    +  /**
    +   * Info for a [[Node]]
    +   *
    +   * @param id  Index used for tree reconstruction.  Indices follow an in-order traversal.
    +   * @param impurityStats  Stats array.  Impurity type is stored in metadata.
    +   * @param gain  Gain, or arbitrary value if leaf node.
    +   * @param leftChild  Left child index, or arbitrary value if leaf node.
    +   * @param rightChild  Right child index, or arbitrary value if leaf node.
    +   * @param split  Split info, or arbitrary value if leaf node.
    +   */
    +  case class NodeData(
    +    id: Int,
    +    prediction: Double,
    +    impurity: Double,
    +    impurityStats: Array[Double],
    +    gain: Double,
    +    leftChild: Int,
    +    rightChild: Int,
    +    split: SplitData)
    +
    +  object NodeData {
    +    /**
    +     * Create [[NodeData]] instances for this node and all children.
    +     *
    +     * @param id  Current ID.  IDs are assigned via an in-order traversal.
    +     * @return (sequence of nodes in in-order traversal order, largest ID in subtree)
    +     *         The nodes are returned in in-order traversal (root first) so that it is easy to
    +     *         get the ID of the subtree's root node.
    --- End diff --
    
    Whoops!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11888] [ML] Decision tree persistence i...

Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley commented on the pull request:

    https://github.com/apache/spark/pull/11581#issuecomment-197553356
  
    Thanks everyone for the reviews!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11888] [ML] Decision tree persistence i...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/11581#issuecomment-193913339
  
    **[Test build #52677 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/52677/consoleFull)** for PR 11581 at commit [`e94772d`](https://github.com/apache/spark/commit/e94772dd193acc9ccc3999e2d4aa623f1a73cf81).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11888] [ML] Decision tree persistence i...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/11581#issuecomment-195007630
  
    **[Test build #52839 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/52839/consoleFull)** for PR 11581 at commit [`1ba8507`](https://github.com/apache/spark/commit/1ba85071464c3f27f77256c6987718a2ed888d35).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11888] [ML] Decision tree persistence i...

Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11581#discussion_r56081469
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/impurity/Impurity.scala ---
    @@ -179,3 +179,21 @@ private[spark] abstract class ImpurityCalculator(val stats: Array[Double]) exten
       }
     
     }
    +
    +private[spark] object ImpurityCalculator {
    +
    +  /**
    +   * Create an [[ImpurityCalculator]] instance of the given impurity type and with
    +   * the given stats.
    +   */
    +  def getCalculator(impurity: String, stats: Array[Double]): ImpurityCalculator = {
    +    impurity match {
    +      case "gini" => new GiniCalculator(stats)
    +      case "entropy" => new EntropyCalculator(stats)
    +      case "variance" => new VarianceCalculator(stats)
    --- End diff --
    
    I agree, but that goes beyond this PR.  Let's do it in a follow-up.  (It could be part of removing the old spark.mllib implementation.)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11888] [ML] Decision tree persistence i...

Posted by yinxusen <gi...@git.apache.org>.
Github user yinxusen commented on the pull request:

    https://github.com/apache/spark/pull/11581#issuecomment-194675931
  
    @jkbradley Make a pass. LGTM except for the minor issue.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11888] [ML] Decision tree persistence i...

Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11581#discussion_r56107175
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/util/ReadWrite.scala ---
    @@ -273,7 +273,29 @@ private[ml] object DefaultParamsReader {
           sparkVersion: String,
           params: JValue,
           metadata: JValue,
    -      metadataJson: String)
    +      metadataJson: String) {
    +
    +    /**
    +     * Get the JSON value of the [[org.apache.spark.ml.param.Param]] of the given name.
    +     * This can be useful for getting a Param value before an instance of [[Params]]
    +     * is available.
    +     */
    +    def getParamValue(paramName: String): JValue = {
    --- End diff --
    
    It's in class Metadata, which is in ```private[ml] object DefaultParamsReader```, so it shouldn't be publicly accessible, right?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11888] [ML] Decision tree persistence i...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/11581#issuecomment-194994694
  
    **[Test build #52839 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/52839/consoleFull)** for PR 11581 at commit [`1ba8507`](https://github.com/apache/spark/commit/1ba85071464c3f27f77256c6987718a2ed888d35).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11888] [ML] Decision tree persistence i...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/11581#issuecomment-196542230
  
    **[Test build #53122 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/53122/consoleFull)** for PR 11581 at commit [`ce2d824`](https://github.com/apache/spark/commit/ce2d824b54d7c7ad45f133ff13f44c9aee12df28).
     * This patch **fails MiMa tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11888] [ML] Decision tree persistence i...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/11581#issuecomment-196537012
  
    **[Test build #53122 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/53122/consoleFull)** for PR 11581 at commit [`ce2d824`](https://github.com/apache/spark/commit/ce2d824b54d7c7ad45f133ff13f44c9aee12df28).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11888] [ML] Decision tree persistence i...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/11581#issuecomment-197552020
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/53346/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11888] [ML] Decision tree persistence i...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/11581#issuecomment-193909843
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11888] [ML] Decision tree persistence i...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/11581#issuecomment-193909182
  
    **[Test build #52676 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/52676/consoleFull)** for PR 11581 at commit [`207cec2`](https://github.com/apache/spark/commit/207cec29f4017c1c4d4a12fb382ed3a604f60303).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11888] [ML] Decision tree persistence i...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/11581#issuecomment-197535912
  
    **[Test build #53346 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/53346/consoleFull)** for PR 11581 at commit [`cfb770b`](https://github.com/apache/spark/commit/cfb770ba4da92de63e9f4a64517c4d9d17de7d02).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11888] [ML] Decision tree persistence i...

Posted by yanboliang <gi...@git.apache.org>.
Github user yanboliang commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11581#discussion_r56012073
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/tree/treeModels.scala ---
    @@ -101,3 +109,127 @@ private[ml] trait TreeEnsembleModel {
       /** Total number of nodes, summed over all trees in the ensemble. */
       lazy val totalNumNodes: Int = trees.map(_.numNodes).sum
     }
    +
    +/** Helper classes for tree model persistence */
    +private[ml] object DecisionTreeModelReadWrite {
    +
    +  /**
    +   * Info for a [[org.apache.spark.ml.tree.Split]]
    +   *
    +   * @param featureIndex  Index of feature split on
    +   * @param leftCategoriesOrThreshold  For categorical feature, set of leftCategories.
    +   *                                   For continuous feature, threshold.
    +   * @param numCategories  For categorical feature, number of categories.
    +   *                       For continuous feature, -1.
    +   */
    +  case class SplitData(
    +      featureIndex: Int,
    +      leftCategoriesOrThreshold: Array[Double],
    +      numCategories: Int) {
    +
    +    def getSplit: Split = {
    +      if (numCategories != -1) {
    +        new CategoricalSplit(featureIndex, leftCategoriesOrThreshold, numCategories)
    +      } else {
    +        assert(leftCategoriesOrThreshold.length == 1, s"DecisionTree split data expected" +
    +          s" 1 threshold for ContinuousSplit, but found thresholds: " +
    +          leftCategoriesOrThreshold.mkString(", "))
    +        new ContinuousSplit(featureIndex, leftCategoriesOrThreshold(0))
    +      }
    +    }
    +  }
    +
    +  object SplitData {
    +    def apply(split: Split): SplitData = split match {
    +      case s: CategoricalSplit =>
    +        SplitData(s.featureIndex, s.leftCategories, s.numCategories)
    +      case s: ContinuousSplit =>
    +        SplitData(s.featureIndex, Array(s.threshold), -1)
    +    }
    +  }
    +
    +  /**
    +   * Info for a [[Node]]
    +   *
    +   * @param id  Index used for tree reconstruction.  Indices follow an in-order traversal.
    +   * @param impurityStats  Stats array.  Impurity type is stored in metadata.
    +   * @param gain  Gain, or arbitrary value if leaf node.
    --- End diff --
    
    In leaf node, the value of ```gain``` should be ```-1.0```?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11888] [ML] Decision tree persistence i...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/11581#issuecomment-193992822
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11888] [ML] Decision tree persistence i...

Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11581#discussion_r56082025
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/tree/treeModels.scala ---
    @@ -101,3 +109,127 @@ private[ml] trait TreeEnsembleModel {
       /** Total number of nodes, summed over all trees in the ensemble. */
       lazy val totalNumNodes: Int = trees.map(_.numNodes).sum
     }
    +
    +/** Helper classes for tree model persistence */
    +private[ml] object DecisionTreeModelReadWrite {
    +
    +  /**
    +   * Info for a [[org.apache.spark.ml.tree.Split]]
    +   *
    +   * @param featureIndex  Index of feature split on
    +   * @param leftCategoriesOrThreshold  For categorical feature, set of leftCategories.
    +   *                                   For continuous feature, threshold.
    +   * @param numCategories  For categorical feature, number of categories.
    +   *                       For continuous feature, -1.
    +   */
    +  case class SplitData(
    +      featureIndex: Int,
    +      leftCategoriesOrThreshold: Array[Double],
    +      numCategories: Int) {
    +
    +    def getSplit: Split = {
    +      if (numCategories != -1) {
    +        new CategoricalSplit(featureIndex, leftCategoriesOrThreshold, numCategories)
    +      } else {
    +        assert(leftCategoriesOrThreshold.length == 1, s"DecisionTree split data expected" +
    +          s" 1 threshold for ContinuousSplit, but found thresholds: " +
    +          leftCategoriesOrThreshold.mkString(", "))
    +        new ContinuousSplit(featureIndex, leftCategoriesOrThreshold(0))
    +      }
    +    }
    +  }
    +
    +  object SplitData {
    +    def apply(split: Split): SplitData = split match {
    +      case s: CategoricalSplit =>
    +        SplitData(s.featureIndex, s.leftCategories, s.numCategories)
    +      case s: ContinuousSplit =>
    +        SplitData(s.featureIndex, Array(s.threshold), -1)
    +    }
    +  }
    +
    +  /**
    +   * Info for a [[Node]]
    +   *
    +   * @param id  Index used for tree reconstruction.  Indices follow an in-order traversal.
    +   * @param impurityStats  Stats array.  Impurity type is stored in metadata.
    +   * @param gain  Gain, or arbitrary value if leaf node.
    --- End diff --
    
    I like having this weaker public guarantee: people should not assume anything about the values


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11888] [ML] Decision tree persistence i...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/11581#issuecomment-195123894
  
    **[Test build #2631 has started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/2631/consoleFull)** for PR 11581 at commit [`1ba8507`](https://github.com/apache/spark/commit/1ba85071464c3f27f77256c6987718a2ed888d35).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11888] [ML] Decision tree persistence i...

Posted by yinxusen <gi...@git.apache.org>.
Github user yinxusen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11581#discussion_r55627061
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/DecisionTreeRegressor.scala ---
    @@ -117,17 +126,18 @@ final class DecisionTreeRegressionModel private[ml] (
         override val rootNode: Node,
         override val numFeatures: Int)
       extends PredictionModel[Vector, DecisionTreeRegressionModel]
    -  with DecisionTreeModel with DecisionTreeRegressorParams with Serializable {
    +  with DecisionTreeModel with DecisionTreeRegressorParams with MLWritable with Serializable {
     
       /** @group setParam */
       def setVarianceCol(value: String): this.type = set(varianceCol, value)
     
       require(rootNode != null,
    -    "DecisionTreeClassificationModel given null rootNode, but it requires a non-null rootNode.")
    +    "DecisionTreeRegressionModel given null rootNode, but it requires a non-null rootNode.")
     
       /**
        * Construct a decision tree regression model.
    -   * @param rootNode  Root node of tree, with other nodes attached.
    +    *
    +    * @param rootNode  Root node of tree, with other nodes attached.
    --- End diff --
    
    Wrong indent and extra blank line.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11888] [ML] Decision tree persistence i...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/11581#issuecomment-196542259
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11888] [ML] Decision tree persistence i...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/11581#issuecomment-193909833
  
    **[Test build #52676 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/52676/consoleFull)** for PR 11581 at commit [`207cec2`](https://github.com/apache/spark/commit/207cec29f4017c1c4d4a12fb382ed3a604f60303).
     * This patch **fails Scala style tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11888] [ML] Decision tree persistence i...

Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley commented on the pull request:

    https://github.com/apache/spark/pull/11581#issuecomment-193908984
  
    I'll rebase this after [https://github.com/apache/spark/pull/11513] gets merged


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11888] [ML] Decision tree persistence i...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/11581#issuecomment-195007849
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/52839/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11888] [ML] Decision tree persistence i...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/11581#issuecomment-193992827
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/52689/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11888] [ML] Decision tree persistence i...

Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley commented on the pull request:

    https://github.com/apache/spark/pull/11581#issuecomment-194989852
  
    @yinxusen Fixed.  Does it look ready now?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11888] [ML] Decision tree persistence i...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/11581#issuecomment-195138159
  
    **[Test build #2631 has finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/2631/consoleFull)** for PR 11581 at commit [`1ba8507`](https://github.com/apache/spark/commit/1ba85071464c3f27f77256c6987718a2ed888d35).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11888] [ML] Decision tree persistence i...

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11581#discussion_r56099459
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/util/ReadWrite.scala ---
    @@ -273,7 +273,29 @@ private[ml] object DefaultParamsReader {
           sparkVersion: String,
           params: JValue,
           metadata: JValue,
    -      metadataJson: String)
    +      metadataJson: String) {
    +
    +    /**
    +     * Get the JSON value of the [[org.apache.spark.ml.param.Param]] of the given name.
    +     * This can be useful for getting a Param value before an instance of [[Params]]
    +     * is available.
    +     */
    +    def getParamValue(paramName: String): JValue = {
    --- End diff --
    
    This is a public API but it returns a type from a 3rd-party package. Shall we mark it private?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11888] [ML] Decision tree persistence i...

Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley commented on the pull request:

    https://github.com/apache/spark/pull/11581#issuecomment-193909221
  
    @yanboliang Would you have time to review this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11888] [ML] Decision tree persistence i...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/11581#issuecomment-193977620
  
    **[Test build #52689 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/52689/consoleFull)** for PR 11581 at commit [`40f224d`](https://github.com/apache/spark/commit/40f224d9c5d03b9b6c5fb0ef97326391f7b4ba8f).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11888] [ML] Decision tree persistence i...

Posted by yinxusen <gi...@git.apache.org>.
Github user yinxusen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11581#discussion_r55725405
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/DecisionTreeRegressor.scala ---
    @@ -117,17 +126,18 @@ final class DecisionTreeRegressionModel private[ml] (
         override val rootNode: Node,
         override val numFeatures: Int)
       extends PredictionModel[Vector, DecisionTreeRegressionModel]
    -  with DecisionTreeModel with DecisionTreeRegressorParams with Serializable {
    +  with DecisionTreeModel with DecisionTreeRegressorParams with MLWritable with Serializable {
     
       /** @group setParam */
       def setVarianceCol(value: String): this.type = set(varianceCol, value)
     
       require(rootNode != null,
    -    "DecisionTreeClassificationModel given null rootNode, but it requires a non-null rootNode.")
    +    "DecisionTreeRegressionModel given null rootNode, but it requires a non-null rootNode.")
     
       /**
        * Construct a decision tree regression model.
    -   * @param rootNode  Root node of tree, with other nodes attached.
    +    *
    +    * @param rootNode  Root node of tree, with other nodes attached.
    --- End diff --
    
    Same here, I don't know either.
    
    On Thursday, March 10, 2016, jkbradley <no...@github.com> wrote:
    
    > In
    > mllib/src/main/scala/org/apache/spark/ml/regression/DecisionTreeRegressor.scala
    > <https://github.com/apache/spark/pull/11581#discussion_r55724843>:
    >
    > >
    > >    /**
    > >     * Construct a decision tree regression model.
    > > -   * @param rootNode  Root node of tree, with other nodes attached.
    > > +    *
    > > +    * @param rootNode  Root node of tree, with other nodes attached.
    >
    > Thanks! I can't figure out how to fix IntelliJ imports ever since I
    > updated it...
    >
    > —
    > Reply to this email directly or view it on GitHub
    > <https://github.com/apache/spark/pull/11581/files#r55724843>.
    >
    
    
    -- 
    Cheers,
    Xusen Yin



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11888] [ML] Decision tree persistence i...

Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley commented on the pull request:

    https://github.com/apache/spark/pull/11581#issuecomment-196552383
  
    @yanboliang Thanks for reviewing it!  That test failure is from another PR breaking the build.  I'll wait until it's cleared up.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11888] [ML] Decision tree persistence i...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/11581


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11888] [ML] Decision tree persistence i...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/11581#issuecomment-194990545
  
    **[Test build #52837 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/52837/consoleFull)** for PR 11581 at commit [`76943bd`](https://github.com/apache/spark/commit/76943bd9d109e4765e50d5b4c639f547f67215cb).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11888] [ML] Decision tree persistence i...

Posted by yinxusen <gi...@git.apache.org>.
Github user yinxusen commented on the pull request:

    https://github.com/apache/spark/pull/11581#issuecomment-194991139
  
    LGTM now.
    
    On Thursday, March 10, 2016, jkbradley <no...@github.com> wrote:
    
    > @yinxusen <https://github.com/yinxusen> Fixed. Does it look ready now?
    >
    > —
    > Reply to this email directly or view it on GitHub
    > <https://github.com/apache/spark/pull/11581#issuecomment-194989852>.
    >
    
    
    -- 
    Cheers,
    Xusen Yin



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11888] [ML] Decision tree persistence i...

Posted by yanboliang <gi...@git.apache.org>.
Github user yanboliang commented on the pull request:

    https://github.com/apache/spark/pull/11581#issuecomment-197361857
  
    @jkbradley Looks good for me now, thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11888] [ML] Decision tree persistence i...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/11581#issuecomment-197551836
  
    **[Test build #53346 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/53346/consoleFull)** for PR 11581 at commit [`cfb770b`](https://github.com/apache/spark/commit/cfb770ba4da92de63e9f4a64517c4d9d17de7d02).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11888] [ML] Decision tree persistence i...

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11581#discussion_r56293008
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/util/ReadWrite.scala ---
    @@ -273,7 +273,29 @@ private[ml] object DefaultParamsReader {
           sparkVersion: String,
           params: JValue,
           metadata: JValue,
    -      metadataJson: String)
    +      metadataJson: String) {
    +
    +    /**
    +     * Get the JSON value of the [[org.apache.spark.ml.param.Param]] of the given name.
    +     * This can be useful for getting a Param value before an instance of [[Params]]
    +     * is available.
    +     */
    +    def getParamValue(paramName: String): JValue = {
    --- End diff --
    
    Sorry, I thought this was in a `private[ml]` trait that could be inherited by a public class.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org