You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by jkbradley <gi...@git.apache.org> on 2016/04/02 00:13:25 UTC

[GitHub] spark pull request: [SPARK-13784][ML] Persistence for RandomForest...

GitHub user jkbradley opened a pull request:

    https://github.com/apache/spark/pull/12118

    [SPARK-13784][ML] Persistence for RandomForestClassifier, RandomForestRegressor

    ## What changes were proposed in this pull request?
    
    **Main change**: Added save/load for RandomForestClassifier, RandomForestRegressor (implementation details below)
    
    Modified numTrees method (*deprecation*)
    * Goal: Use default implementations of unit tests which assume Estimators and Models share the same set of Params.
    * What this PR does: Moves method numTrees outside of trait TreeEnsembleModel.  Adds it to GBT and RF Models.  Deprecates it in RF Models in favor of new method getNumTrees.  In Spark 2.1, we can have RF Models include Param numTrees.
    
    Minor items
    * Fixes bugs in GBTClassificationModel, GBTRegressionModel fromOld methods where they assign the wrong old UID.
    
    **Implementation details**
    * Split DecisionTreeModelReadWrite.loadTreeNodes into 2 methods in order to reuse some code for ensembles.
    * Added EnsembleModelReadWrite object with save/load implementations usable for RFs and GBTs
      * These store all trees' nodes in a single DataFrame, and all trees' metadata in a second DataFrame.
    * Split trait RandomForestParams into parts in order to add more Estimator Params to RF models
    * Split DefaultParamsWriter.saveMetadata into two methods to allow ensembles to store sub-models' metadata in a single DataFrame.  Same for DefaultParamsReader.loadMetadata
    
    ## How was this patch tested?
    
    Adds standard unit tests for RF save/load

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/jkbradley/spark GayathriMurali-SPARK-13784

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/12118.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #12118
    
----
commit 03bb8880e73b6c107d9a13ab90ce7f61a8756c8f
Author: GayathriMurali <ga...@gmail.com>
Date:   2016-03-23T21:09:35Z

    SPARK-13784 Model export/import for Spark ml RandomForests

commit 68b9358f128c365d573c5881b06f420276fd44ff
Author: GayathriMurali <ga...@gmail.com>
Date:   2016-03-29T02:17:41Z

    SPARK-13783 Model export/import for spark.ml:RandomForests

commit 2d89b4c1ad08290a7118a49413e9ed2b4722471e
Author: Joseph K. Bradley <jo...@databricks.com>
Date:   2016-04-01T19:00:33Z

    Implemented read/write for RandomForestClassifier, Regressor.

commit f2f89eb74d8dd1f20915ebe4db254aa2e529ce9d
Author: Joseph K. Bradley <jo...@databricks.com>
Date:   2016-04-01T19:54:45Z

    PR cleanup

commit 623b309e1a70a33acc1e03791f0450147d37eb16
Author: Joseph K. Bradley <jo...@databricks.com>
Date:   2016-04-01T21:45:19Z

    Moved numTrees Param outside of RandomForest*Model Params, and deprecated current numTrees val

commit de7ef9d1700c6653e035c3bff273165f7b73c24a
Author: Joseph K. Bradley <jo...@databricks.com>
Date:   2016-04-01T21:58:53Z

    PR cleanups

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13784][ML] Persistence for RandomForest...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/12118


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13784][ML] Persistence for RandomForest...

Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley commented on the pull request:

    https://github.com/apache/spark/pull/12118#issuecomment-205403548
  
    @GayathriMurali Thanks for the initial commits, and @hhbyyh thanks for the review!  I'll go ahead and merge this with master since you gave a LGTM pending minor changes.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13784][ML] Persistence for RandomForest...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12118#issuecomment-204590237
  
    **[Test build #54727 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/54727/consoleFull)** for PR 12118 at commit [`607ed4e`](https://github.com/apache/spark/commit/607ed4e10e6a6377c106b19e2fa2b7e5698e95bb).
     * This patch **fails Scala style tests**.
     * This patch **does not merge cleanly**.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13784][ML] Persistence for RandomForest...

Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley commented on the pull request:

    https://github.com/apache/spark/pull/12118#issuecomment-204591519
  
    @GayathriMurali @hhbyyh  Could you please help review this PR?  Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13784][ML] Persistence for RandomForest...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12118#issuecomment-204594592
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/54730/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13784][ML] Persistence for RandomForest...

Posted by GayathriMurali <gi...@git.apache.org>.
Github user GayathriMurali commented on the pull request:

    https://github.com/apache/spark/pull/12118#issuecomment-204635416
  
    @jkbradley Thanks for this. This looks great and clarifies a lot of things I was trying to do. I had one minor comment, except that it looks fine to me. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13784][ML] Persistence for RandomForest...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12118#issuecomment-204602373
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13784][ML] Persistence for RandomForest...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12118#issuecomment-205175829
  
    **[Test build #54836 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/54836/consoleFull)** for PR 12118 at commit [`e0306a9`](https://github.com/apache/spark/commit/e0306a9e861b6e70b5608d3add11a37173e20885).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13784][ML] Persistence for RandomForest...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12118#issuecomment-204594515
  
    **[Test build #54730 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/54730/consoleFull)** for PR 12118 at commit [`cd07aff`](https://github.com/apache/spark/commit/cd07aff885c1f9c9403a44f81b64b4ecac1f3388).
     * This patch **fails MiMa tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13784][ML] Persistence for RandomForest...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12118#issuecomment-205175939
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/54836/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13784][ML] Persistence for RandomForest...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12118#issuecomment-205175937
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13784][ML] Persistence for RandomForest...

Posted by hhbyyh <gi...@git.apache.org>.
Github user hhbyyh commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12118#discussion_r58312620
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/RandomForestRegressor.scala ---
    @@ -117,13 +121,18 @@ object RandomForestRegressor {
       @Since("1.4.0")
       final val supportedFeatureSubsetStrategies: Array[String] =
         RandomForestParams.supportedFeatureSubsetStrategies
    +
    +  @Since("2.0.0")
    +  override def load(path: String): RandomForestRegressor = super.load(path)
    +
     }
     
     /**
      * :: Experimental ::
      * [[http://en.wikipedia.org/wiki/Random_forest  Random Forest]] model for regression.
      * It supports both continuous and categorical features.
    - * @param _trees  Decision trees in the ensemble.
    +  *
    --- End diff --
    
    indent correctly?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13784][ML] Persistence for RandomForest...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12118#issuecomment-204589688
  
    **[Test build #54727 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/54727/consoleFull)** for PR 12118 at commit [`607ed4e`](https://github.com/apache/spark/commit/607ed4e10e6a6377c106b19e2fa2b7e5698e95bb).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13784][ML] Persistence for RandomForest...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12118#issuecomment-205151409
  
    **[Test build #54828 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/54828/consoleFull)** for PR 12118 at commit [`afed67b`](https://github.com/apache/spark/commit/afed67b02aaa5440c50c24c656bab0ff3302a4ff).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13784][ML] Persistence for RandomForest...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12118#issuecomment-205160960
  
    **[Test build #54836 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/54836/consoleFull)** for PR 12118 at commit [`e0306a9`](https://github.com/apache/spark/commit/e0306a9e861b6e70b5608d3add11a37173e20885).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13784][ML] Persistence for RandomForest...

Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12118#discussion_r58296743
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/tree/treeModels.scala ---
    @@ -358,3 +376,100 @@ private[ml] object DecisionTreeModelReadWrite {
         finalNodes.head
       }
     }
    +
    +private[ml] object EnsembleModelReadWrite {
    +
    +  /**
    +   * Helper method for saving a tree ensemble to disk.
    +   *
    +   * @param instance  Tree ensemble model
    +   * @param path  Path to which to save the ensemble model.
    +   * @param extraMetadata  Metadata such as numFeatures, numClasses, numTrees.
    +   */
    +  def saveImpl[M <: Params with TreeEnsembleModel](
    +      instance: M,
    +      path: String,
    +      sql: SQLContext,
    +      extraMetadata: JObject): Unit = {
    +    DefaultParamsWriter.saveMetadata(instance, path, sql.sparkContext, Some(extraMetadata))
    +    val treesMetadataJson: Array[(Int, String)] = instance.trees.zipWithIndex.map {
    +      case (tree, treeID) =>
    +        treeID -> DefaultParamsWriter.getMetadataToSave(tree.asInstanceOf[Params], sql.sparkContext)
    +    }
    +    val treesMetadataPath = new Path(path, "treesMetadata").toString
    +    sql.createDataFrame(treesMetadataJson).toDF("treeID", "metadata")
    +      .write.parquet(treesMetadataPath)
    +    val dataPath = new Path(path, "data").toString
    +    val nodeDataRDD = sql.sparkContext.parallelize(instance.trees.zipWithIndex).flatMap {
    --- End diff --
    
    This is a single RDD.  The flatMap maps every element of the original RDD to multiple elements in a new RDD.  It should be fine.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13784][ML] Persistence for RandomForest...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12118#issuecomment-204591234
  
    **[Test build #54728 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/54728/consoleFull)** for PR 12118 at commit [`b13b11f`](https://github.com/apache/spark/commit/b13b11fef094d0946f1201dff76321846fed93ac).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13784][ML] Persistence for RandomForest...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12118#issuecomment-204590244
  
    Build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13784][ML] Persistence for RandomForest...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12118#issuecomment-204592564
  
    **[Test build #54730 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/54730/consoleFull)** for PR 12118 at commit [`cd07aff`](https://github.com/apache/spark/commit/cd07aff885c1f9c9403a44f81b64b4ecac1f3388).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13784][ML] Persistence for RandomForest...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12118#issuecomment-204594587
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13784][ML] Persistence for RandomForest...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12118#issuecomment-205151430
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13784][ML] Persistence for RandomForest...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12118#issuecomment-204608632
  
    **[Test build #2727 has finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/2727/consoleFull)** for PR 12118 at commit [`cd07aff`](https://github.com/apache/spark/commit/cd07aff885c1f9c9403a44f81b64b4ecac1f3388).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13784][ML] Persistence for RandomForest...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12118#issuecomment-204602379
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/54728/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13784][ML] Persistence for RandomForest...

Posted by hhbyyh <gi...@git.apache.org>.
Github user hhbyyh commented on the pull request:

    https://github.com/apache/spark/pull/12118#issuecomment-205007321
  
    just some minor comments. LGTM.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13784][ML] Persistence for RandomForest...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12118#issuecomment-205151432
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/54828/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13784][ML] Persistence for RandomForest...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12118#issuecomment-205149388
  
    **[Test build #54828 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/54828/consoleFull)** for PR 12118 at commit [`afed67b`](https://github.com/apache/spark/commit/afed67b02aaa5440c50c24c656bab0ff3302a4ff).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13784][ML] Persistence for RandomForest...

Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley commented on the pull request:

    https://github.com/apache/spark/pull/12118#issuecomment-204601514
  
    That seems like a spurious or unrelated failure.  I'll retest


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13784][ML] Persistence for RandomForest...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12118#issuecomment-204601587
  
    **[Test build #2727 has started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/2727/consoleFull)** for PR 12118 at commit [`cd07aff`](https://github.com/apache/spark/commit/cd07aff885c1f9c9403a44f81b64b4ecac1f3388).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13784][ML] Persistence for RandomForest...

Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley commented on the pull request:

    https://github.com/apache/spark/pull/12118#issuecomment-204763366
  
    @GayathriMurali Thanks for taking a look!  @hhbyyh It would be great if you could take a look too since you're familiar with some of this code.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13784][ML] Persistence for RandomForest...

Posted by hhbyyh <gi...@git.apache.org>.
Github user hhbyyh commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12118#discussion_r58312723
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/RandomForestClassifier.scala ---
    @@ -236,12 +255,66 @@ final class RandomForestClassificationModel private[ml] (
       private[ml] def toOld: OldRandomForestModel = {
         new OldRandomForestModel(OldAlgo.Classification, _trees.map(_.toOld))
       }
    +
    +  @Since("2.0.0")
    +  override def write: MLWriter =
    +    new RandomForestClassificationModel.RandomForestClassificationModelWriter(this)
     }
     
    -private[ml] object RandomForestClassificationModel {
    +@Since("2.0.0")
    +object RandomForestClassificationModel extends MLReadable[RandomForestClassificationModel] {
    +
    +  @Since("2.0.0")
    +  override def read: MLReader[RandomForestClassificationModel] =
    +    new RandomForestClassificationModelReader
    +
    +  @Since("2.0.0")
    +  override def load(path: String): RandomForestClassificationModel = super.load(path)
    +
    +  private[RandomForestClassificationModel]
    +  class RandomForestClassificationModelWriter(instance: RandomForestClassificationModel)
    +    extends MLWriter {
    +
    +    override protected def saveImpl(path: String): Unit = {
    +      // Note: numTrees is not currently used, but could be nice to store for fast querying.
    +      val extraMetadata: JObject = Map(
    +        "numFeatures" -> instance.numFeatures,
    +        "numClasses" -> instance.numClasses,
    +        "numTrees" -> instance.getNumTrees)
    +      EnsembleModelReadWrite.saveImpl(instance, path, sqlContext, extraMetadata)
    +    }
    +  }
    +
    +  private class RandomForestClassificationModelReader
    +    extends MLReader[RandomForestClassificationModel] {
    +
    +    /** Checked against metadata when loading model */
    +    private val className = classOf[RandomForestClassificationModel].getName
    +    private val treeClassName = classOf[DecisionTreeClassificationModel].getName
    +
    +    override def load(path: String): RandomForestClassificationModel = {
    +      implicit val format = DefaultFormats
    +      val (metadata: Metadata, treesData: Array[(Metadata, Node)]) =
    +        EnsembleModelReadWrite.loadImpl(path, sqlContext, className, treeClassName)
    +      val numFeatures = (metadata.metadata \ "numFeatures").extract[Int]
    +      val numClasses = (metadata.metadata \ "numClasses").extract[Int]
    +
    --- End diff --
    
    just IMO, maybe check numTrees == trees.length since there's redundant information.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13784][ML] Persistence for RandomForest...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12118#issuecomment-204602107
  
    **[Test build #54728 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/54728/consoleFull)** for PR 12118 at commit [`b13b11f`](https://github.com/apache/spark/commit/b13b11fef094d0946f1201dff76321846fed93ac).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13784][ML] Persistence for RandomForest...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12118#issuecomment-204590248
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/54727/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13784][ML] Persistence for RandomForest...

Posted by GayathriMurali <gi...@git.apache.org>.
Github user GayathriMurali commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12118#discussion_r58287249
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/tree/treeModels.scala ---
    @@ -358,3 +376,100 @@ private[ml] object DecisionTreeModelReadWrite {
         finalNodes.head
       }
     }
    +
    +private[ml] object EnsembleModelReadWrite {
    +
    +  /**
    +   * Helper method for saving a tree ensemble to disk.
    +   *
    +   * @param instance  Tree ensemble model
    +   * @param path  Path to which to save the ensemble model.
    +   * @param extraMetadata  Metadata such as numFeatures, numClasses, numTrees.
    +   */
    +  def saveImpl[M <: Params with TreeEnsembleModel](
    +      instance: M,
    +      path: String,
    +      sql: SQLContext,
    +      extraMetadata: JObject): Unit = {
    +    DefaultParamsWriter.saveMetadata(instance, path, sql.sparkContext, Some(extraMetadata))
    +    val treesMetadataJson: Array[(Int, String)] = instance.trees.zipWithIndex.map {
    +      case (tree, treeID) =>
    +        treeID -> DefaultParamsWriter.getMetadataToSave(tree.asInstanceOf[Params], sql.sparkContext)
    +    }
    +    val treesMetadataPath = new Path(path, "treesMetadata").toString
    +    sql.createDataFrame(treesMetadataJson).toDF("treeID", "metadata")
    +      .write.parquet(treesMetadataPath)
    +    val dataPath = new Path(path, "data").toString
    +    val nodeDataRDD = sql.sparkContext.parallelize(instance.trees.zipWithIndex).flatMap {
    --- End diff --
    
    Is it alright to use flatMap to combine RDDs? Can we use sparkContext.union instead? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13784][ML] Persistence for RandomForest...

Posted by hhbyyh <gi...@git.apache.org>.
Github user hhbyyh commented on the pull request:

    https://github.com/apache/spark/pull/12118#issuecomment-204988756
  
    Sorry for the delay. I'll start now.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org