You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by GayathriMurali <gi...@git.apache.org> on 2016/03/29 04:23:28 UTC

[GitHub] spark pull request: [Spark 13784][ML][WIP] Model export/import for...

GitHub user GayathriMurali opened a pull request:

    https://github.com/apache/spark/pull/12023

    [Spark 13784][ML][WIP] Model export/import for spark.ml: RandomForests

    Please help review the code. I have the WIP included to make sure the changes look correct. 
    
    ## What changes were proposed in this pull request?
    
    Model export/import for spark.ml RandomForests
    
    
    


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/GayathriMurali/spark SPARK-13784

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/12023.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #12023
    
----
commit 03bb8880e73b6c107d9a13ab90ce7f61a8756c8f
Author: GayathriMurali <ga...@gmail.com>
Date:   2016-03-23T21:09:35Z

    SPARK-13784 Model export/import for Spark ml RandomForests

commit 68b9358f128c365d573c5881b06f420276fd44ff
Author: GayathriMurali <ga...@gmail.com>
Date:   2016-03-29T02:17:41Z

    SPARK-13783 Model export/import for spark.ml:RandomForests

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [Spark-13784][ML][WIP] Model export/import for...

Posted by GayathriMurali <gi...@git.apache.org>.

Github user GayathriMurali commented on the pull request:

    https://github.com/apache/spark/pull/12023#issuecomment-204576825
  
    @jkbradley I was just about to ping you regarding this. I would definitely love to help out. I was out at Strata all week and couldn't get to this. Please let me know if you need anything else from me. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [Spark-13784][ML][WIP] Model export/import for...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12023#discussion_r57674663
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/RandomForestClassifier.scala ---
    @@ -136,7 +144,8 @@ object RandomForestClassifier {
      * [[http://en.wikipedia.org/wiki/Random_forest  Random Forest]] model for classification.
      * It supports both binary and multiclass labels, as well as both continuous and categorical
      * features.
    - * @param _trees  Decision trees in the ensemble.
    +  *
    --- End diff --
    
    Update your IntelliJ settings: Editor -> Code Style -> Scala -> ScalaDoc tab -> uncheck "Use scaladoc indent for leading asterisk."
    
    You will probably need to manually correct these indentation changes for this PR, though.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [Spark-13784][ML][WIP] Model export/import for...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12023#discussion_r57674667
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/RandomForestRegressor.scala ---
    @@ -199,21 +210,71 @@ final class RandomForestRegressionModel private[ml] (
       private[ml] def toOld: OldRandomForestModel = {
         new OldRandomForestModel(OldAlgo.Regression, _trees.map(_.toOld))
       }
    +
    +  @Since("2.0.0")
    +  override def write: MLWriter =
    +    new RandomForestRegressionModel.RandomForestRegressionModelWriter(this)
    +
    +  @Since("2.0.0")
    +  override def read: MLReader[RandomForestRegressionModel] =
    +    new RandomForestRegressionModel.RandomForestRegressionModelReader(this)
     }
     
    -private[ml] object RandomForestRegressionModel {
    -
    -  /** (private[ml]) Convert a model from the old API */
    -  def fromOld(
    -      oldModel: OldRandomForestModel,
    -      parent: RandomForestRegressor,
    -      categoricalFeatures: Map[Int, Int],
    -      numFeatures: Int = -1): RandomForestRegressionModel = {
    -    require(oldModel.algo == OldAlgo.Regression, "Cannot convert RandomForestModel" +
    -      s" with algo=${oldModel.algo} (old API) to RandomForestRegressionModel (new API).")
    -    val newTrees = oldModel.trees.map { tree =>
    -      // parent for each tree is null since there is no good way to set this.
    -      DecisionTreeRegressionModel.fromOld(tree, null, categoricalFeatures)
    +@Since("2.0.0")
    +object RandomForestRegressionModel extends MLReadable[RandomForestRegressionModel] {
    +
    +    @Since("2.0.0")
    +    override def load(path: String): RandomForestRegressionModel = super.load(path)
    +
    +    private[RandomForestRegressionModel]
    +    class RandomForestRegressionModelWriter(instance: RandomForestRegressionModel)
    +      extends MLWriter {
    +
    +          override protected def saveImpl(path: String): Unit = {
    +            val extraMetadata: JObject = Map(
    +                "numFeatures" -> instance.numFeatures)
    +            DefaultParamsWriter.saveMetadata(instance, path, sc, Some(extraMetadata))
    +            for ( treeIndex <- 1 to instance.getNumTrees) {
    --- End diff --
    
    Can you share this implementation between all tree ensembles?  How about making a generic method with this signature in TreeEnsembleParams?
    ```
    private[ml] def saveImpl(trees: Array[DecisionTreeModel], path: String, sql: SQLContext)
    ```
    and a similar loadImpl method


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [Spark-13784][ML][WIP] Model export/import for...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12023#issuecomment-203122213
  
    **[Test build #2708 has finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/2708/consoleFull)** for PR 12023 at commit [`68b9358`](https://github.com/apache/spark/commit/68b9358f128c365d573c5881b06f420276fd44ff).
     * This patch **fails to build**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [Spark-13784][ML][WIP] Model export/import for...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12023#discussion_r57674664
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/RandomForestClassifier.scala ---
    @@ -240,12 +250,66 @@ final class RandomForestClassificationModel private[ml] (
       private[ml] def toOld: OldRandomForestModel = {
         new OldRandomForestModel(OldAlgo.Classification, _trees.map(_.toOld))
       }
    +
    +  @Since("2.0.0")
    +  override def write: MLWriter =
    +    new RandomForestClassificationModel.RandomForestClassificationModelWriter(this)
    +
    +  @Since("2.0.0")
    +  override def read: MLReader =
    +    new RandomForestClassificationModel.RandomForestClassificationModelReader(this)
     }
     
    -private[ml] object RandomForestClassificationModel {
    +@Since("2.0.0")
    +object RandomForestClassificationModel extends MLReadable[RandomForestClassificationModel] {
    +
    +
    +  @Since("2.0.0")
    +  override def load(path: String): RandomForestClassificationModel = super.load(path)
    +
    +  private[RandomForestClassificationModel]
    +  class RandomForestClassificationModelWriter(instance: RandomForestClassificationModel)
    +    extends MLWriter {
    +
    +    override protected def saveImpl(path: String): Unit = {
    +      val extraMetadata: JObject = Map(
    +        "numFeatures" -> instance.numFeatures,
    +        "numClasses" -> instance.numClasses)
    +      DefaultParamsWriter.saveMetadata(instance, path, sc, Some(extraMetadata))
    +      for(treeIndex <- 1 to instance.getNumTrees) {
    --- End diff --
    
    This is writing each tree separately.  Based on our JIRA discussion, it would be better to write all trees in a single DataFrame.  You could create an RDD of trees, then flatMap that to an RDD of NodeData, and then convert that to a DataFrame.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [Spark-13784][ML][WIP] Model export/import for...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12023#issuecomment-203118358
  
    **[Test build #2708 has started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/2708/consoleFull)** for PR 12023 at commit [`68b9358`](https://github.com/apache/spark/commit/68b9358f128c365d573c5881b06f420276fd44ff).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [Spark-13784][ML][WIP] Model export/import for...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on the pull request:

    https://github.com/apache/spark/pull/12023#issuecomment-204101033
  
    Sure, I'll send a PR to update this PR.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [Spark-13784][ML][WIP] Model export/import for...

Posted by GayathriMurali <gi...@git.apache.org>.

Github user GayathriMurali commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12023#discussion_r57968732
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/RandomForestClassifier.scala ---
    @@ -240,12 +250,66 @@ final class RandomForestClassificationModel private[ml] (
       private[ml] def toOld: OldRandomForestModel = {
         new OldRandomForestModel(OldAlgo.Classification, _trees.map(_.toOld))
       }
    +
    +  @Since("2.0.0")
    +  override def write: MLWriter =
    +    new RandomForestClassificationModel.RandomForestClassificationModelWriter(this)
    +
    +  @Since("2.0.0")
    +  override def read: MLReader =
    +    new RandomForestClassificationModel.RandomForestClassificationModelReader(this)
     }
     
    -private[ml] object RandomForestClassificationModel {
    +@Since("2.0.0")
    +object RandomForestClassificationModel extends MLReadable[RandomForestClassificationModel] {
    +
    +
    +  @Since("2.0.0")
    +  override def load(path: String): RandomForestClassificationModel = super.load(path)
    +
    +  private[RandomForestClassificationModel]
    +  class RandomForestClassificationModelWriter(instance: RandomForestClassificationModel)
    +    extends MLWriter {
    +
    +    override protected def saveImpl(path: String): Unit = {
    +      val extraMetadata: JObject = Map(
    +        "numFeatures" -> instance.numFeatures,
    +        "numClasses" -> instance.numClasses)
    +      DefaultParamsWriter.saveMetadata(instance, path, sc, Some(extraMetadata))
    +      for(treeIndex <- 1 to instance.getNumTrees) {
    --- End diff --
    
    @jkbradley Are you thinking Array of RDDs and then flatten them into a single RDD?  Should I use SparkContext.union to combine multiple RDDs? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [Spark-13784][ML][WIP] Model export/import for...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12023#discussion_r57674668
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/tree/treeModels.scala ---
    @@ -238,3 +238,128 @@ private[ml] object DecisionTreeModelReadWrite {
         finalNodes.head
       }
     }
    +
    +private[ml] object RandomForestModelReadWrite {
    +
    +  /**
    +    * Info for a [[org.apache.spark.ml.tree.Split]]
    +    *
    +    * @param featureIndex  Index of feature split on
    +    * @param leftCategoriesOrThreshold  For categorical feature, set of leftCategories.
    +    *                                   For continuous feature, threshold.
    +    * @param numCategories  For categorical feature, number of categories.
    +    *                       For continuous feature, -1.
    +    */
    +  case class SplitData(
    --- End diff --
    
    Why is this copied here?  Can you reuse the existing SplitData?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [Spark-13784][ML][WIP] Model export/import for...

Posted by GayathriMurali <gi...@git.apache.org>.

Github user GayathriMurali commented on the pull request:

    https://github.com/apache/spark/pull/12023#issuecomment-203763228
  
    @jkbradley I am sorry, I am afraid I will not be able to complete tonight. Can you please help me with reusing Splitdata/build code from DecisionTrees in RandomForests? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [Spark-13784][ML][WIP] Model export/import for...

Posted by yanboliang <gi...@git.apache.org>.

Github user yanboliang commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12023#discussion_r57904716
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/RandomForestRegressor.scala ---
    @@ -199,21 +210,71 @@ final class RandomForestRegressionModel private[ml] (
       private[ml] def toOld: OldRandomForestModel = {
         new OldRandomForestModel(OldAlgo.Regression, _trees.map(_.toOld))
       }
    +
    +  @Since("2.0.0")
    +  override def write: MLWriter =
    +    new RandomForestRegressionModel.RandomForestRegressionModelWriter(this)
    +
    +  @Since("2.0.0")
    +  override def read: MLReader[RandomForestRegressionModel] =
    +    new RandomForestRegressionModel.RandomForestRegressionModelReader(this)
     }
     
    -private[ml] object RandomForestRegressionModel {
    -
    -  /** (private[ml]) Convert a model from the old API */
    -  def fromOld(
    -      oldModel: OldRandomForestModel,
    -      parent: RandomForestRegressor,
    -      categoricalFeatures: Map[Int, Int],
    -      numFeatures: Int = -1): RandomForestRegressionModel = {
    -    require(oldModel.algo == OldAlgo.Regression, "Cannot convert RandomForestModel" +
    -      s" with algo=${oldModel.algo} (old API) to RandomForestRegressionModel (new API).")
    -    val newTrees = oldModel.trees.map { tree =>
    -      // parent for each tree is null since there is no good way to set this.
    -      DecisionTreeRegressionModel.fromOld(tree, null, categoricalFeatures)
    +@Since("2.0.0")
    +object RandomForestRegressionModel extends MLReadable[RandomForestRegressionModel] {
    +
    +    @Since("2.0.0")
    +    override def load(path: String): RandomForestRegressionModel = super.load(path)
    +
    +    private[RandomForestRegressionModel]
    +    class RandomForestRegressionModelWriter(instance: RandomForestRegressionModel)
    +      extends MLWriter {
    +
    +          override protected def saveImpl(path: String): Unit = {
    +            val extraMetadata: JObject = Map(
    +                "numFeatures" -> instance.numFeatures)
    +            DefaultParamsWriter.saveMetadata(instance, path, sc, Some(extraMetadata))
    +            for ( treeIndex <- 1 to instance.getNumTrees) {
    --- End diff --
    
    +1


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [Spark-13784][ML][WIP] Model export/import for...

Posted by GayathriMurali <gi...@git.apache.org>.

Github user GayathriMurali commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12023#discussion_r57993953
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/RandomForestRegressor.scala ---
    @@ -199,21 +210,71 @@ final class RandomForestRegressionModel private[ml] (
       private[ml] def toOld: OldRandomForestModel = {
         new OldRandomForestModel(OldAlgo.Regression, _trees.map(_.toOld))
       }
    +
    +  @Since("2.0.0")
    +  override def write: MLWriter =
    +    new RandomForestRegressionModel.RandomForestRegressionModelWriter(this)
    +
    +  @Since("2.0.0")
    +  override def read: MLReader[RandomForestRegressionModel] =
    +    new RandomForestRegressionModel.RandomForestRegressionModelReader(this)
     }
     
    -private[ml] object RandomForestRegressionModel {
    -
    -  /** (private[ml]) Convert a model from the old API */
    -  def fromOld(
    -      oldModel: OldRandomForestModel,
    -      parent: RandomForestRegressor,
    -      categoricalFeatures: Map[Int, Int],
    -      numFeatures: Int = -1): RandomForestRegressionModel = {
    -    require(oldModel.algo == OldAlgo.Regression, "Cannot convert RandomForestModel" +
    -      s" with algo=${oldModel.algo} (old API) to RandomForestRegressionModel (new API).")
    -    val newTrees = oldModel.trees.map { tree =>
    -      // parent for each tree is null since there is no good way to set this.
    -      DecisionTreeRegressionModel.fromOld(tree, null, categoricalFeatures)
    +@Since("2.0.0")
    +object RandomForestRegressionModel extends MLReadable[RandomForestRegressionModel] {
    +
    +    @Since("2.0.0")
    +    override def load(path: String): RandomForestRegressionModel = super.load(path)
    +
    +    private[RandomForestRegressionModel]
    +    class RandomForestRegressionModelWriter(instance: RandomForestRegressionModel)
    +      extends MLWriter {
    +
    +          override protected def saveImpl(path: String): Unit = {
    +            val extraMetadata: JObject = Map(
    +                "numFeatures" -> instance.numFeatures)
    +            DefaultParamsWriter.saveMetadata(instance, path, sc, Some(extraMetadata))
    +            for ( treeIndex <- 1 to instance.getNumTrees) {
    --- End diff --
    
    @jkbradley Should saveImpl and load methods in RandomForestClassifier and Regressor over ride this method? I assume loadImpl will also have same signature. 



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [Spark 13784][ML][WIP] Model export/import for...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12023#issuecomment-202671513
  
    Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [Spark-13784][ML][WIP] Model export/import for...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on the pull request:

    https://github.com/apache/spark/pull/12023#issuecomment-202722392
  
    ok to test


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [Spark-13784][ML][WIP] Model export/import for...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on the pull request:

    https://github.com/apache/spark/pull/12023#issuecomment-203603276
  
    @GayathriMurali Will you be able to update this soon?  If not, I'd be happy to help out.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [Spark-13784][ML][WIP] Model export/import for...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12023#discussion_r57788687
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/RandomForestClassifier.scala ---
    @@ -240,12 +250,66 @@ final class RandomForestClassificationModel private[ml] (
       private[ml] def toOld: OldRandomForestModel = {
         new OldRandomForestModel(OldAlgo.Classification, _trees.map(_.toOld))
       }
    +
    +  @Since("2.0.0")
    +  override def write: MLWriter =
    +    new RandomForestClassificationModel.RandomForestClassificationModelWriter(this)
    +
    +  @Since("2.0.0")
    +  override def read: MLReader =
    +    new RandomForestClassificationModel.RandomForestClassificationModelReader(this)
     }
     
    -private[ml] object RandomForestClassificationModel {
    +@Since("2.0.0")
    +object RandomForestClassificationModel extends MLReadable[RandomForestClassificationModel] {
    +
    +
    +  @Since("2.0.0")
    +  override def load(path: String): RandomForestClassificationModel = super.load(path)
    +
    +  private[RandomForestClassificationModel]
    +  class RandomForestClassificationModelWriter(instance: RandomForestClassificationModel)
    +    extends MLWriter {
    +
    +    override protected def saveImpl(path: String): Unit = {
    +      val extraMetadata: JObject = Map(
    +        "numFeatures" -> instance.numFeatures,
    +        "numClasses" -> instance.numClasses)
    +      DefaultParamsWriter.saveMetadata(instance, path, sc, Some(extraMetadata))
    +      for(treeIndex <- 1 to instance.getNumTrees) {
    --- End diff --
    
    Ok thanks--that should be much more efficient.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [Spark-13784][ML][WIP] Model export/import for...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on the pull request:

    https://github.com/apache/spark/pull/12023#issuecomment-203655431
  
    Yes, that sounds fine. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [Spark-13784][ML][WIP] Model export/import for...

Posted by GayathriMurali <gi...@git.apache.org>.

Github user GayathriMurali commented on the pull request:

    https://github.com/apache/spark/pull/12023#issuecomment-203645597
  
    @jkbradley I should be able to update this by tonight. Would that work? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [Spark-13784][ML][WIP] Model export/import for...

Posted by GayathriMurali <gi...@git.apache.org>.

Github user GayathriMurali commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12023#discussion_r57787829
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/RandomForestClassifier.scala ---
    @@ -240,12 +250,66 @@ final class RandomForestClassificationModel private[ml] (
       private[ml] def toOld: OldRandomForestModel = {
         new OldRandomForestModel(OldAlgo.Classification, _trees.map(_.toOld))
       }
    +
    +  @Since("2.0.0")
    +  override def write: MLWriter =
    +    new RandomForestClassificationModel.RandomForestClassificationModelWriter(this)
    +
    +  @Since("2.0.0")
    +  override def read: MLReader =
    +    new RandomForestClassificationModel.RandomForestClassificationModelReader(this)
     }
     
    -private[ml] object RandomForestClassificationModel {
    +@Since("2.0.0")
    +object RandomForestClassificationModel extends MLReadable[RandomForestClassificationModel] {
    +
    +
    +  @Since("2.0.0")
    +  override def load(path: String): RandomForestClassificationModel = super.load(path)
    +
    +  private[RandomForestClassificationModel]
    +  class RandomForestClassificationModelWriter(instance: RandomForestClassificationModel)
    +    extends MLWriter {
    +
    +    override protected def saveImpl(path: String): Unit = {
    +      val extraMetadata: JObject = Map(
    +        "numFeatures" -> instance.numFeatures,
    +        "numClasses" -> instance.numClasses)
    +      DefaultParamsWriter.saveMetadata(instance, path, sc, Some(extraMetadata))
    +      for(treeIndex <- 1 to instance.getNumTrees) {
    --- End diff --
    
    @jkbradley Sorry for the confusion. In the JIRA discussion, I meant every tree would be stored in a single dataframe. I guess I can work on storing all of them in a single dataframe. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [Spark-13784][ML][WIP] Model export/import for...

Posted by yanboliang <gi...@git.apache.org>.

Github user yanboliang commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12023#discussion_r57904365
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/RandomForestClassifier.scala ---
    @@ -240,12 +250,66 @@ final class RandomForestClassificationModel private[ml] (
       private[ml] def toOld: OldRandomForestModel = {
         new OldRandomForestModel(OldAlgo.Classification, _trees.map(_.toOld))
       }
    +
    +  @Since("2.0.0")
    +  override def write: MLWriter =
    +    new RandomForestClassificationModel.RandomForestClassificationModelWriter(this)
    +
    +  @Since("2.0.0")
    +  override def read: MLReader =
    +    new RandomForestClassificationModel.RandomForestClassificationModelReader(this)
     }
     
    -private[ml] object RandomForestClassificationModel {
    +@Since("2.0.0")
    +object RandomForestClassificationModel extends MLReadable[RandomForestClassificationModel] {
    +
    +
    +  @Since("2.0.0")
    +  override def load(path: String): RandomForestClassificationModel = super.load(path)
    +
    +  private[RandomForestClassificationModel]
    +  class RandomForestClassificationModelWriter(instance: RandomForestClassificationModel)
    +    extends MLWriter {
    +
    +    override protected def saveImpl(path: String): Unit = {
    +      val extraMetadata: JObject = Map(
    +        "numFeatures" -> instance.numFeatures,
    +        "numClasses" -> instance.numClasses)
    +      DefaultParamsWriter.saveMetadata(instance, path, sc, Some(extraMetadata))
    +      for(treeIndex <- 1 to instance.getNumTrees) {
    --- End diff --
    
    +1 @jkbradley


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [Spark 13784][ML][WIP] Model export/import for...

Posted by GayathriMurali <gi...@git.apache.org>.

Github user GayathriMurali commented on the pull request:

    https://github.com/apache/spark/pull/12023#issuecomment-202671126
  
    @yanboliang @jkbradley Please help review the code. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [Spark-13784][ML][WIP] Model export/import for...

Posted by GayathriMurali <gi...@git.apache.org>.

Github user GayathriMurali closed the pull request at:

    https://github.com/apache/spark/pull/12023


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [Spark-13784][ML][WIP] Model export/import for...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on the pull request:

    https://github.com/apache/spark/pull/12023#issuecomment-203117967
  
    add to whitelist


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [Spark-13784][ML][WIP] Model export/import for...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12023#discussion_r57674670
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/tree/treeModels.scala ---
    @@ -238,3 +238,128 @@ private[ml] object DecisionTreeModelReadWrite {
         finalNodes.head
       }
     }
    +
    +private[ml] object RandomForestModelReadWrite {
    +
    +  /**
    +    * Info for a [[org.apache.spark.ml.tree.Split]]
    +    *
    +    * @param featureIndex  Index of feature split on
    +    * @param leftCategoriesOrThreshold  For categorical feature, set of leftCategories.
    +    *                                   For continuous feature, threshold.
    +    * @param numCategories  For categorical feature, number of categories.
    +    *                       For continuous feature, -1.
    +    */
    +  case class SplitData(
    +                        featureIndex: Int,
    +                        leftCategoriesOrThreshold: Array[Double],
    +                        numCategories: Int) {
    +
    +    def getSplit: Split = {
    +      if (numCategories != -1) {
    +        new CategoricalSplit(featureIndex, leftCategoriesOrThreshold, numCategories)
    +      } else {
    +        assert(leftCategoriesOrThreshold.length == 1, s"DecisionTree split data expected" +
    +          s" 1 threshold for ContinuousSplit, but found thresholds: " +
    +          leftCategoriesOrThreshold.mkString(", "))
    +        new ContinuousSplit(featureIndex, leftCategoriesOrThreshold(0))
    +      }
    +    }
    +  }
    +
    +  object SplitData {
    +    def apply(split: Split): SplitData = split match {
    +      case s: CategoricalSplit =>
    +        SplitData(s.featureIndex, s.leftCategories, s.numCategories)
    +      case s: ContinuousSplit =>
    +        SplitData(s.featureIndex, Array(s.threshold), -1)
    +    }
    +  }
    +
    +  /**
    +    * Info for a [[Node]]
    +    *
    +    * @param treeID  Index used for tree identification in RandomForest
    +    * @param id  Index used for tree reconstruction.  Indices follow a pre-order traversal.
    +    * @param impurityStats  Stats array.  Impurity type is stored in metadata.
    +    * @param gain  Gain, or arbitrary value if leaf node.
    +    * @param leftChild  Left child index, or arbitrary value if leaf node.
    +    * @param rightChild  Right child index, or arbitrary value if leaf node.
    +    * @param split  Split info, or arbitrary value if leaf node.
    +    */
    +  case class NodeData(
    --- End diff --
    
    Ditto for NodeData.  You can create a new class for ensembles if needed:
    ```
    case class EnsembleNodeData(treeID: Int, node: NodeData)
    ```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [Spark-13784][ML][WIP] Model export/import for...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on the pull request:

    https://github.com/apache/spark/pull/12023#issuecomment-204591688
  
    @GayathriMurali Thanks a lot.  I just opened [https://github.com/apache/spark/pull/12118]
    Would you mind closing this PR?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [Spark-13784][ML][WIP] Model export/import for...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on the pull request:

    https://github.com/apache/spark/pull/12023#issuecomment-202724996
  
    I made an initial pass with high-level comments.  It will be important to reuse more code.  I'll stay tuned for updates.  Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [Spark-13784][ML][WIP] Model export/import for...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on the pull request:

    https://github.com/apache/spark/pull/12023#issuecomment-204569102
  
    @GayathriMurali  This turns out to be messier than I would have expected because of the fact that the RF Estimators and Models do not share the same Params.  I'd like to take over this PR if that's OK.  I'll use your commits too, so you'll be an author on it.
    
    I can send a PR soon.  Would you mind helping to review the PR once I send it?  Thanks a lot for your understanding.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [Spark-13784][ML][WIP] Model export/import for...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12023#discussion_r57674671
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/tree/treeModels.scala ---
    @@ -238,3 +238,128 @@ private[ml] object DecisionTreeModelReadWrite {
         finalNodes.head
       }
     }
    +
    +private[ml] object RandomForestModelReadWrite {
    +
    +  /**
    +    * Info for a [[org.apache.spark.ml.tree.Split]]
    +    *
    +    * @param featureIndex  Index of feature split on
    +    * @param leftCategoriesOrThreshold  For categorical feature, set of leftCategories.
    +    *                                   For continuous feature, threshold.
    +    * @param numCategories  For categorical feature, number of categories.
    +    *                       For continuous feature, -1.
    +    */
    +  case class SplitData(
    +                        featureIndex: Int,
    +                        leftCategoriesOrThreshold: Array[Double],
    +                        numCategories: Int) {
    +
    +    def getSplit: Split = {
    +      if (numCategories != -1) {
    +        new CategoricalSplit(featureIndex, leftCategoriesOrThreshold, numCategories)
    +      } else {
    +        assert(leftCategoriesOrThreshold.length == 1, s"DecisionTree split data expected" +
    +          s" 1 threshold for ContinuousSplit, but found thresholds: " +
    +          leftCategoriesOrThreshold.mkString(", "))
    +        new ContinuousSplit(featureIndex, leftCategoriesOrThreshold(0))
    +      }
    +    }
    +  }
    +
    +  object SplitData {
    +    def apply(split: Split): SplitData = split match {
    +      case s: CategoricalSplit =>
    +        SplitData(s.featureIndex, s.leftCategories, s.numCategories)
    +      case s: ContinuousSplit =>
    +        SplitData(s.featureIndex, Array(s.threshold), -1)
    +    }
    +  }
    +
    +  /**
    +    * Info for a [[Node]]
    +    *
    +    * @param treeID  Index used for tree identification in RandomForest
    +    * @param id  Index used for tree reconstruction.  Indices follow a pre-order traversal.
    +    * @param impurityStats  Stats array.  Impurity type is stored in metadata.
    +    * @param gain  Gain, or arbitrary value if leaf node.
    +    * @param leftChild  Left child index, or arbitrary value if leaf node.
    +    * @param rightChild  Right child index, or arbitrary value if leaf node.
    +    * @param split  Split info, or arbitrary value if leaf node.
    +    */
    +  case class NodeData(
    +                       treeID: Int,
    +                       id: Int,
    +                       prediction: Double,
    +                       impurity: Double,
    +                       impurityStats: Array[Double],
    +                       gain: Double,
    +                       leftChild: Int,
    +                       rightChild: Int,
    +                       split: SplitData)
    +
    +  object NodeData {
    +    /**
    +      * Create [[NodeData]] instances for this node and all children.
    +      *
    +      * @param id  Current ID.  IDs are assigned via a pre-order traversal.
    +      * @return (sequence of nodes in pre-order traversal order, largest ID in subtree)
    +      *         The nodes are returned in pre-order traversal (root first) so that it is easy to
    +      *         get the ID of the subtree's root node.
    +      */
    +    def build(node: Node, treeID: Int, id: Int): (Seq[NodeData], Int) = node match {
    +      case n: InternalNode =>
    +        val (leftNodeData, leftIdx) = build(n.leftChild, treeID, id + 1)
    +        val (rightNodeData, rightIdx) = build(n.rightChild, treeID, leftIdx + 1)
    +        val thisNodeData = NodeData(treeID, id, n.prediction, n.impurity, n.impurityStats.stats,
    +          n.gain, leftNodeData.head.id, rightNodeData.head.id, SplitData(n.split))
    +        (thisNodeData +: (leftNodeData ++ rightNodeData), rightIdx)
    +      case _: LeafNode =>
    +        (Seq(NodeData(treeID, id, node.prediction, node.impurity, node.impurityStats.stats,
    +          -1.0, -1, -1, SplitData(-1, Array.empty[Double], -1))),
    +          id)
    +    }
    +  }
    +
    +  def loadTreeNodes(
    --- End diff --
    
    This should not be copied either.  Please reuse the one for trees.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org