You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by sethah <gi...@git.apache.org> on 2015/09/10 01:29:58 UTC

[GitHub] spark pull request: [SPARK-9715][ML] Store numFeatures in all ML P...

GitHub user sethah opened a pull request:

    https://github.com/apache/spark/pull/8675

    [SPARK-9715][ML] Store numFeatures in all ML PredictionModel types

    All prediction models should store `numFeatures` indicating the number of features the model was trained on. Default value of -1 added for backwards compatibility.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/sethah/spark SPARK-9715

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/8675.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #8675
    
----
commit 43deffc52b1b5b5da86608fb8b860cb7d602b05b
Author: sethah <se...@gmail.com>
Date:   2015-09-08T21:42:16Z

    Adding abstract definition numFeatures for PredictionModel

commit b04f8f9adc334b3d0b5cfb09d35d80415e70b048
Author: sethah <se...@gmail.com>
Date:   2015-09-08T23:34:51Z

    Changing trees numFeatures declaration and cleanup

commit 6eb4bc7c0f058832450552bfcb68c13ad549b961
Author: sethah <se...@gmail.com>
Date:   2015-09-09T23:01:30Z

    adding Since tag and default value

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-9715][ML] Store numFeatures in all ML P...

Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/8675#discussion_r39671205
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/Predictor.scala ---
    @@ -145,6 +145,10 @@ abstract class PredictionModel[FeaturesType, M <: PredictionModel[FeaturesType,
       /** @group setParam */
       def setPredictionCol(value: String): M = set(predictionCol, value).asInstanceOf[M]
     
    +  /** Returns the number of features the model was trained on. Default: -1 */
    --- End diff --
    
    "Default: -1" --> "If unknown, returns -1."


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-9715][ML] Store numFeatures in all ML P...

Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley commented on the pull request:

    https://github.com/apache/spark/pull/8675#issuecomment-142743829
  
    LGTM  Thank you!  Merging with master


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-9715][ML] Store numFeatures in all ML P...

Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/8675#discussion_r39671219
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala ---
    @@ -179,22 +179,28 @@ private[ml] object RandomForest extends Logging {
           }
         }
     
    +    val numFeatures = metadata.numFeaturesPerNode
    --- End diff --
    
    Change to ```metadata.numFeatures``` (numFeaturesPerNode is different)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-9715][ML] Store numFeatures in all ML P...

Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/8675#discussion_r39671227
  
    --- Diff: mllib/src/test/scala/org/apache/spark/ml/classification/LogisticRegressionSuite.scala ---
    @@ -191,6 +191,10 @@ class LogisticRegressionSuite extends SparkFunSuite with MLlibTestSparkContext {
     
         val model = lr.fit(dataset)
         assert(model.numClasses === 2)
    +    val datasetFeatureSize = dataset.rdd.first() match {
    --- End diff --
    
    You shouldn't assume a column ordering.  Instead:
    ```
    dataset.select("features").first().getAs[Vector](0).size
    ```
    or
    ```
    dataset.select("features").first() match {
      case features: Vector => features.size
    }
    ```
    
    (same issue in other test suites)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-9715][ML] Store numFeatures in all ML P...

Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/8675#discussion_r39671174
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/DecisionTreeClassifier.scala ---
    @@ -166,6 +167,7 @@ private[ml] object DecisionTreeClassificationModel {
             s" DecisionTreeClassificationModel (new API).  Algo is: ${oldModel.algo}")
         val rootNode = Node.fromOld(oldModel.topNode, categoricalFeatures)
         val uid = if (parent != null) parent.uid else Identifiable.randomUID("dtc")
    -    new DecisionTreeClassificationModel(uid, rootNode, -1)
    +    // Can't infer number of features from old model, so default to -1
    +    new DecisionTreeClassificationModel(uid, rootNode, -1, -1)
    --- End diff --
    
    I like @holdenk 's suggestion.  It's a private API anyways.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-9715][ML] Store numFeatures in all ML P...

Posted by sethah <gi...@git.apache.org>.
Github user sethah commented on a diff in the pull request:

    https://github.com/apache/spark/pull/8675#discussion_r39110298
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/DecisionTreeClassifier.scala ---
    @@ -166,6 +167,7 @@ private[ml] object DecisionTreeClassificationModel {
             s" DecisionTreeClassificationModel (new API).  Algo is: ${oldModel.algo}")
         val rootNode = Node.fromOld(oldModel.topNode, categoricalFeatures)
         val uid = if (parent != null) parent.uid else Identifiable.randomUID("dtc")
    -    new DecisionTreeClassificationModel(uid, rootNode, -1)
    +    // Can't infer number of features from old model, so default to -1
    +    new DecisionTreeClassificationModel(uid, rootNode, -1, -1)
    --- End diff --
    
    There isn't a way to infer the number of features from an old MLlib model since we only have access to the DT root node. I give it a dummy value of -1 here, but would appreciate feedback to see if there is a better way.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-9715][ML] Store numFeatures in all ML P...

Posted by feynmanliang <gi...@git.apache.org>.
Github user feynmanliang commented on the pull request:

    https://github.com/apache/spark/pull/8675#issuecomment-140493169
  
    That's interesting, I don't know why `GBTRegressionModel` has a public constructor (IMO it shouldn't since users should not be directly instantiationg `Model`s), maybe @jkbradley can help


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-9715][ML] Store numFeatures in all ML P...

Posted by sethah <gi...@git.apache.org>.
Github user sethah commented on a diff in the pull request:

    https://github.com/apache/spark/pull/8675#discussion_r39798759
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/GBTClassifier.scala ---
    @@ -167,7 +168,8 @@ object GBTClassifier {
     final class GBTClassificationModel(
    --- End diff --
    
    Done.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-9715][ML] Store numFeatures in all ML P...

Posted by sethah <gi...@git.apache.org>.
Github user sethah commented on a diff in the pull request:

    https://github.com/apache/spark/pull/8675#discussion_r39698374
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/GBTClassifier.scala ---
    @@ -175,6 +177,14 @@ final class GBTClassificationModel(
       require(_trees.length == _treeWeights.length, "GBTClassificationModel given trees, treeWeights" +
         s" of non-matching lengths (${_trees.length}, ${_treeWeights.length}, respectively).")
     
    +  /**
    +   * Construct a GBTClassificationModel
    +   * @param _trees  Decision trees in the ensemble.
    +   * @param _treeWeights  Weights for the decision trees in the ensemble.
    +   */
    +  def this(uid: String, _trees: Array[DecisionTreeRegressionModel], _treeWeights: Array[Double]) =
    +    this(uid, _trees, _treeWeights, -1)
    +
    --- End diff --
    
    @jkbradley I want to make sure I interpreted your note correctly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-9715][ML] Store numFeatures in all ML P...

Posted by feynmanliang <gi...@git.apache.org>.
Github user feynmanliang commented on the pull request:

    https://github.com/apache/spark/pull/8675#issuecomment-140235205
  
    This introduces break changes by modifying public constructor APIs; is there any way we can refactor this to use a trait mixin to avoid the duplication and modification of constructors?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-9715][ML] Store numFeatures in all ML P...

Posted by sethah <gi...@git.apache.org>.
Github user sethah commented on the pull request:

    https://github.com/apache/spark/pull/8675#issuecomment-142365103
  
    @jkbradley Fixed the merge conflicts. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-9715][ML] Store numFeatures in all ML P...

Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/8675#discussion_r39671224
  
    --- Diff: mllib/src/test/scala/org/apache/spark/ml/classification/DecisionTreeClassifierSuite.scala ---
    @@ -318,5 +319,6 @@ private[ml] object DecisionTreeClassifierSuite extends SparkFunSuite {
         val oldTreeAsNew = DecisionTreeClassificationModel.fromOld(
           oldTree, newTree.parent.asInstanceOf[DecisionTreeClassifier], categoricalFeatures)
         TreeTests.checkEqual(oldTreeAsNew, newTree)
    +    assert(newTree.numFeatures == numFeatures)
    --- End diff --
    
    Use triple equals in unit tests ```===```  (same for other test suites)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-9715][ML] Store numFeatures in all ML P...

Posted by sethah <gi...@git.apache.org>.
Github user sethah commented on a diff in the pull request:

    https://github.com/apache/spark/pull/8675#discussion_r39110354
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/Predictor.scala ---
    @@ -145,6 +145,10 @@ abstract class PredictionModel[FeaturesType, M <: PredictionModel[FeaturesType,
       /** @group setParam */
       def setPredictionCol(value: String): M = set(predictionCol, value).asInstanceOf[M]
     
    +  /** Returns the number of features the model was trained on. Default: -1 */
    +  @Since("1.5.1")
    --- End diff --
    
    Not sure if this is necessary and what version it should be.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-9715][ML] Store numFeatures in all ML P...

Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley commented on the pull request:

    https://github.com/apache/spark/pull/8675#issuecomment-140843703
  
    Nice PR, thanks!  Those are the only issues I spotted.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-9715][ML] Store numFeatures in all ML P...

Posted by sethah <gi...@git.apache.org>.
Github user sethah commented on a diff in the pull request:

    https://github.com/apache/spark/pull/8675#discussion_r39298668
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/DecisionTreeClassifier.scala ---
    @@ -166,6 +167,7 @@ private[ml] object DecisionTreeClassificationModel {
             s" DecisionTreeClassificationModel (new API).  Algo is: ${oldModel.algo}")
         val rootNode = Node.fromOld(oldModel.topNode, categoricalFeatures)
         val uid = if (parent != null) parent.uid else Identifiable.randomUID("dtc")
    -    new DecisionTreeClassificationModel(uid, rootNode, -1)
    +    // Can't infer number of features from old model, so default to -1
    +    new DecisionTreeClassificationModel(uid, rootNode, -1, -1)
    --- End diff --
    
    I think this works well. This allows me make the tests for GBTs sensible. Done.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-9715][ML] Store numFeatures in all ML P...

Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/8675#discussion_r39783539
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/GBTClassifier.scala ---
    @@ -175,6 +177,14 @@ final class GBTClassificationModel(
       require(_trees.length == _treeWeights.length, "GBTClassificationModel given trees, treeWeights" +
         s" of non-matching lengths (${_trees.length}, ${_treeWeights.length}, respectively).")
     
    +  /**
    +   * Construct a GBTClassificationModel
    +   * @param _trees  Decision trees in the ensemble.
    +   * @param _treeWeights  Weights for the decision trees in the ensemble.
    +   */
    +  def this(uid: String, _trees: Array[DecisionTreeRegressionModel], _treeWeights: Array[Double]) =
    +    this(uid, _trees, _treeWeights, -1)
    +
    --- End diff --
    
    Yep, this is what I had in mind.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-9715][ML] Store numFeatures in all ML P...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/8675


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-9715][ML] Store numFeatures in all ML P...

Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/8675#discussion_r39671186
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/Predictor.scala ---
    @@ -145,6 +145,10 @@ abstract class PredictionModel[FeaturesType, M <: PredictionModel[FeaturesType,
       /** @group setParam */
       def setPredictionCol(value: String): M = set(predictionCol, value).asInstanceOf[M]
     
    +  /** Returns the number of features the model was trained on. Default: -1 */
    +  @Since("1.5.1")
    --- End diff --
    
    Well, we're adding those incrementally, so you're right to go ahead and add it.  The version should be the next release 1.6.0 (not the little patch releases).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-9715][ML] Store numFeatures in all ML P...

Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley commented on the pull request:

    https://github.com/apache/spark/pull/8675#issuecomment-140843611
  
    GBT models had public constructors as a mistake in our QA for 1.4.  I want all model constructors to be public eventually, but only after we are pretty certain we can keep them stable.
    
    The public constructor for GBT* should not really be a problem.  You can just add a new constructor which takes numFeatures (which can be private) as well as the old one (which will be public and set numFeatures to the default -1).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-9715][ML] Store numFeatures in all ML P...

Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley commented on the pull request:

    https://github.com/apache/spark/pull/8675#issuecomment-142358046
  
    Apologies for the delayed review.  Could you please fix the merge conflicts before I merge it?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-9715][ML] Store numFeatures in all ML P...

Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/8675#discussion_r39783544
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/GBTClassifier.scala ---
    @@ -167,7 +168,8 @@ object GBTClassifier {
     final class GBTClassificationModel(
    --- End diff --
    
    make private


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-9715][ML] Store numFeatures in all ML P...

Posted by sethah <gi...@git.apache.org>.
Github user sethah commented on the pull request:

    https://github.com/apache/spark/pull/8675#issuecomment-140926689
  
    @jkbradley Thanks for the feedback. I made the changes noted above and fixed the GBT constructors. Let me know if you see anything else.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-9715][ML] Store numFeatures in all ML P...

Posted by holdenk <gi...@git.apache.org>.
Github user holdenk commented on a diff in the pull request:

    https://github.com/apache/spark/pull/8675#discussion_r39112136
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/DecisionTreeClassifier.scala ---
    @@ -166,6 +167,7 @@ private[ml] object DecisionTreeClassificationModel {
             s" DecisionTreeClassificationModel (new API).  Algo is: ${oldModel.algo}")
         val rootNode = Node.fromOld(oldModel.topNode, categoricalFeatures)
         val uid = if (parent != null) parent.uid else Identifiable.randomUID("dtc")
    -    new DecisionTreeClassificationModel(uid, rootNode, -1)
    +    // Can't infer number of features from old model, so default to -1
    +    new DecisionTreeClassificationModel(uid, rootNode, -1, -1)
    --- End diff --
    
    What about having numFeatures passed in (default value of -1)? (Just a thought).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-9715][ML] Store numFeatures in all ML P...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/8675#issuecomment-139072325
  
    Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-9715][ML] Store numFeatures in all ML P...

Posted by holdenk <gi...@git.apache.org>.
Github user holdenk commented on a diff in the pull request:

    https://github.com/apache/spark/pull/8675#discussion_r39111865
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/Predictor.scala ---
    @@ -145,6 +145,10 @@ abstract class PredictionModel[FeaturesType, M <: PredictionModel[FeaturesType,
       /** @group setParam */
       def setPredictionCol(value: String): M = set(predictionCol, value).asInstanceOf[M]
     
    +  /** Returns the number of features the model was trained on. Default: -1 */
    --- End diff --
    
    Maybe include a mention that # of features is not inferred for old models?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-9715][ML] Store numFeatures in all ML P...

Posted by sethah <gi...@git.apache.org>.
Github user sethah commented on the pull request:

    https://github.com/apache/spark/pull/8675#issuecomment-140489654
  
    @feynmanliang One thing I'm curious about is if this would still be a problem if all the constructors were private to ml? Right now, GBTs are the only one of the `PredictionModel` subclasses that is not designated as `private[ml]`, and I'm not exactly sure why it isn't. Other constructors are not public. This is important because it makes the most sense to pass this value as a constructor argument when it can't be inferred (i.e. for tree based predictors). The number of features must be passed, in this case, when the model is created (when the estimator that created it is trained). If I understand correctly, using a trait would require the value to be set after the model is created, but that doesn't make a ton of sense since the `numFeatures` for a given model should never change (should be a val). I believe this is why the only model that currently stores `numFeatures` (RF) passes it through the constructor. 
    
    If we can make the GBT models private, then would it be permissible to change the constructors? Open to suggestions...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-9715][ML] Store numFeatures in all ML P...

Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley commented on the pull request:

    https://github.com/apache/spark/pull/8675#issuecomment-141184531
  
    Thanks for the updates.  That 1 comment should be it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org