You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by rotationsymmetry <gi...@git.apache.org> on 2015/10/07 06:07:37 UTC

[GitHub] spark pull request: [SPARK-9478] [ml] Add class weights to Random ...

GitHub user rotationsymmetry opened a pull request:

    https://github.com/apache/spark/pull/9008

    [SPARK-9478] [ml] Add class weights to Random Forest

    This PR adds weight support to 
    
    DecisionTreeClassifier
    DecisionTreeRegressor
    RandomForestClassifier
    RandomForestRegressor

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/rotationsymmetry/spark SPARK-9478

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/9008.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #9008
    
----
commit 367a443761f46a4acbd911f5c3e84902079596ac
Author: Meihua Wu <me...@umich.edu>
Date:   2015-10-07T02:55:35Z

    Add weight support for DecisionTree and RandomForest

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-9478] [ml] Add class weights to Random ...

Posted by sethah <gi...@git.apache.org>.
Github user sethah commented on a diff in the pull request:

    https://github.com/apache/spark/pull/9008#discussion_r53089695
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/DecisionTreeRegressor.scala ---
    @@ -40,7 +42,7 @@ import org.apache.spark.sql.DataFrame
     @Experimental
     final class DecisionTreeRegressor @Since("1.4.0") (@Since("1.4.0") override val uid: String)
       extends Predictor[Vector, DecisionTreeRegressor, DecisionTreeRegressionModel]
    -  with DecisionTreeParams with TreeRegressorParams {
    +  with DecisionTreeParams with TreeRegressorParams with HasWeightCol{
    --- End diff --
    
    Space after `HasWeightCol`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-9478] [ml] Add class weights to Random ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/9008#issuecomment-147122387
  
    **[Test build #43531 timed out](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/43531/console)**     for PR 9008 at commit [`8f35057`](https://github.com/apache/spark/commit/8f350577ca7ceeadd9ea74570d19407784e49fa4)     after a configured wait of `250m`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-9478] [ml] Add class weights to Random ...

Posted by holdenk <gi...@git.apache.org>.
Github user holdenk commented on the pull request:

    https://github.com/apache/spark/pull/9008#issuecomment-209136374
  
    @sethah So to avoid adding any overhead from computing stats for both these params one option would be to selectively compute only the stats that are required (e.g. if they request `minInstancesPerNode` per node request that and if they requeust `min_weight_fraction_leaf` compute the stats needed).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-9478] [ml] Add class weights to Random ...

Posted by sethah <gi...@git.apache.org>.
Github user sethah commented on a diff in the pull request:

    https://github.com/apache/spark/pull/9008#discussion_r41648955
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala ---
    @@ -1211,4 +1213,28 @@ private[ml] object RandomForest extends Logging {
         }
       }
     
    +  /**
    +   * Inject the sample weight to sub-sample weights of the baggedPoints
    +   */
    +  private[impl] def reweightSubSampleWeights(
    +      baggedTreePoints: RDD[BaggedPoint[TreePoint]]): RDD[BaggedPoint[TreePoint]] = {
    +    baggedTreePoints.map {bagged =>
    +      val treePoint = bagged.datum
    +      val adjustedSubSampleWeights = bagged.subsampleWeights.map(w => w * treePoint.weight)
    +      new BaggedPoint[TreePoint](treePoint, adjustedSubSampleWeights)
    +    }
    +  }
    +
    +  /**
    +   * A thin adaptor to [[org.apache.spark.mllib.tree.impl.DecisionTreeMetadata.buildMetadata]]
    +   */
    +  private[impl] def buildWeightedMetadata(
    --- End diff --
    
    I am working on another PR where it is an issue that the ML and MLlib implementations share the `DecisionTreeMetadata` class. I'm not sure what usual protocol is around this type of thing, but since the MLlib implementation will be phased out, I wonder if we can't just copy the `DecisionTreeMetadata` code to ML so we can separate it from the MLlib implementation. @jkbradley [mentioned](https://github.com/apache/spark/pull/7294) that the shared classes can be ported to ML lazily when the initial ML implementation was done. Doing that now would prevent having to build this thin wrapper to `buildMetadata`. Any feedback would be appreciated.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-9478] [ml] Add class weights to Random ...

Posted by sethah <gi...@git.apache.org>.
Github user sethah commented on a diff in the pull request:

    https://github.com/apache/spark/pull/9008#discussion_r53092866
  
    --- Diff: mllib/src/test/scala/org/apache/spark/ml/regression/DecisionTreeRegressorSuite.scala ---
    @@ -73,6 +76,56 @@ class DecisionTreeRegressorSuite extends SparkFunSuite with MLlibTestSparkContex
         MLTestingUtils.checkCopy(model)
       }
     
    +  test("training with weighted data") {
    +    val (dataset, weightedDataset) = {
    +      val testData1 = TreeTests.generateNoisyData(5, 123)
    +      val testData2 = TreeTests.generateNoisyData(5, 456)
    +
    +      // Over-sample the 1st dataset twice.
    +      val overSampledTestData1 = testData1.flatMap {
    +        labeledPoint => Iterator(labeledPoint, labeledPoint)
    +      }
    +
    +      val rnd = new Random(8392)
    +      val weightedTestData1 = testData1.flatMap {
    +        case LabeledPoint(label: Double, features: Vector) => {
    +          if (rnd.nextGaussian() > 0.0) {
    +            Iterator(
    +              Instance(label, 1.2, features),
    +              Instance(label, 0.8, features),
    +              Instance(0.0, 0.0, features))
    +          } else {
    +            Iterator(
    +              Instance(label, 0.3, features),
    +              Instance(1.0, 0.0, features),
    +              Instance(label, 1.1, features),
    +              Instance(label, 0.6, features))
    +          }
    +        }
    +      }
    +      val weightedTestData2 = testData2.map {
    +        p: LabeledPoint => Instance(p.label, 1, p.features)
    +      }
    +
    +      (sqlContext.createDataFrame(sc.parallelize(overSampledTestData1 ++ testData2, 2)),
    +        sqlContext.createDataFrame(sc.parallelize(weightedTestData1 ++ weightedTestData2, 2)))
    +    }
    +
    +    val featureIndexer = new VectorIndexer()
    +      .setInputCol("features")
    +      .setOutputCol("indexedFeatures")
    +      .setMaxCategories(4)
    +      .fit(dataset)
    +
    +    val dt = new DecisionTreeRegressor()
    +      .setFeaturesCol("indexedFeatures")
    +
    +    val model1 = dt.fit(featureIndexer.transform(dataset))
    +    val model2 = dt.fit(featureIndexer.transform(weightedDataset),
    +      dt.weightCol->"weight")
    --- End diff --
    
    ditto


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-9478] [ml] Add class weights to Random ...

Posted by sethah <gi...@git.apache.org>.
Github user sethah commented on a diff in the pull request:

    https://github.com/apache/spark/pull/9008#discussion_r41597118
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala ---
    @@ -87,8 +86,10 @@ private[ml] object RandomForest extends Logging {
     
         val withReplacement = numTrees > 1
     
    -    val baggedInput = BaggedPoint
    +    val crudeBaggedInput = BaggedPoint
    --- End diff --
    
    Perhaps `unWeightedBaggedInput` is more descriptive?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-9478] [ml] Add class weights to Random ...

Posted by rotationsymmetry <gi...@git.apache.org>.
Github user rotationsymmetry commented on a diff in the pull request:

    https://github.com/apache/spark/pull/9008#discussion_r41591990
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala ---
    @@ -1211,4 +1212,34 @@ private[ml] object RandomForest extends Logging {
         }
       }
     
    +  /**
    +   * Inject the sample weight to sub-sample weights of the baggedPoints
    +   */
    +  private[impl] def reweightSubSampleWeights(
    +      baggedTreePoints: RDD[BaggedPoint[TreePoint]]): RDD[BaggedPoint[TreePoint]] = {
    +    baggedTreePoints.map {bagged =>
    +      val treePoint = bagged.datum
    +      val adjustedSubSampleWeights = bagged.subsampleWeights.map(w => w * treePoint.weight)
    +      new BaggedPoint[TreePoint](treePoint, adjustedSubSampleWeights)
    +    }
    +  }
    +
    +  /**
    +   * A thin adaptor to [[org.apache.spark.mllib.tree.impl.DecisionTreeMetadata.buildMetadata]]
    +   */
    +  private[impl] def buildWeightedMetadata(
    +      input: RDD[WeightedLabeledPoint],
    +      strategy: OldStrategy,
    +      numTrees: Int,
    +      featureSubsetStrategy: String) = {
    --- End diff --
    
    Thank you very much for your comment. 
    
    1) I will add the return type in my next push. 
    
    2) yes, you are right, I don't want to change the mllib impl yet. I will leave it as a TODO after we have a standard way to represent weighted label point. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-9478] [ml] Add class weights to Random ...

Posted by sethah <gi...@git.apache.org>.
Github user sethah commented on the pull request:

    https://github.com/apache/spark/pull/9008#issuecomment-210002149
  
    @holdenk Thanks for the feedback. Upon some further thought, I think that a.) We need to compute the statistics needed for both `minInstancesPerNode` and `minWeightFractionPerNode` and b.) it will not be too hard to compute them both and will not add a ton of extra memory overhead. Selectively computing one or the other could get complicated very quickly.
    
    I am going to have a PR for this ready soon, which will incorporate changes submitted in this PR. I created two JIRAs for issues that I encountered when preparing this PR and submitted patches for each. They are:
    
    * [SPARK-14610](https://issues.apache.org/jira/browse/SPARK-14610) - [PR 12374](https://github.com/apache/spark/pull/12374)
    * [SPARK-14599](https://issues.apache.org/jira/browse/SPARK-14599) - [PR 12370](https://github.com/apache/spark/pull/12370)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-9478] [ml] Add class weights to Random ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9008#issuecomment-146942918
  
    Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-9478] [ml] Add class weights to Random ...

Posted by sethah <gi...@git.apache.org>.
Github user sethah commented on a diff in the pull request:

    https://github.com/apache/spark/pull/9008#discussion_r53089894
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala ---
    @@ -1171,4 +1173,28 @@ private[ml] object RandomForest extends Logging {
         }
       }
     
    +  /**
    +   * Inject the sample weight to sub-sample weights of the baggedPoints
    +   */
    +  private[impl] def reweightSubSampleWeights(
    +      baggedTreePoints: RDD[BaggedPoint[TreePoint]]): RDD[BaggedPoint[TreePoint]] = {
    +    baggedTreePoints.map {bagged =>
    +      val treePoint = bagged.datum
    +      val adjustedSubSampleWeights = bagged.subsampleWeights.map(w => w * treePoint.weight)
    +      new BaggedPoint[TreePoint](treePoint, adjustedSubSampleWeights)
    +    }
    +  }
    +
    +  /**
    +   * A thin adaptor to [[org.apache.spark.mllib.tree.impl.DecisionTreeMetadata.buildMetadata]]
    --- End diff --
    
    nit: "adaptor" -> "adapter". I'm not sure it is currently _incorrect_ but "adapter" is the significantly more common spelling.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-9478] [ml] Add class weights to Random ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/9008#issuecomment-146775911
  
      [Test build #43460 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/43460/console) for   PR 9008 at commit [`0ffbdd0`](https://github.com/apache/spark/commit/0ffbdd06951bc62b85483580ec68369b0d2a7191).
     * This patch **fails MiMa tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-9478] [ml] Add class weights to Random ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/9008#issuecomment-154381876
  
      [Test build #45206 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/45206/console) for   PR 9008 at commit [`c1785a8`](https://github.com/apache/spark/commit/c1785a8f3055bc48ce480b827befcb27812f0449).
     * This patch **fails Spark unit tests**.
     * This patch **does not merge cleanly**.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-9478] [ml] Add class weights to Random ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/9008#issuecomment-154858995
  
    **[Test build #45315 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/45315/consoleFull)** for PR 9008 at commit [`32f4548`](https://github.com/apache/spark/commit/32f4548a22aaf2079dabef0743342d81bef7750f).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-9478] [ml] Add class weights to Random ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9008#issuecomment-147530643
  
    Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-9478] [ml] Add class weights to Random ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9008#issuecomment-147013601
  
     Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-9478] [ml] Add class weights to Random ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9008#issuecomment-147122395
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-9478] [ml] Add class weights to Random ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/9008#issuecomment-147273096
  
      [Test build #43553 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/43553/consoleFull) for   PR 9008 at commit [`bd316d6`](https://github.com/apache/spark/commit/bd316d6bf6bfb2cbe811c1b0b937c494c1acf273).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-9478] [ml] Add class weights to Random ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9008#issuecomment-147553262
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-9478] [ml] Add class weights to Random ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9008#issuecomment-146775989
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/43460/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-9478] [ml] Add class weights to Random ...

Posted by sethah <gi...@git.apache.org>.
Github user sethah commented on a diff in the pull request:

    https://github.com/apache/spark/pull/9008#discussion_r53093008
  
    --- Diff: mllib/src/test/scala/org/apache/spark/ml/regression/DecisionTreeRegressorSuite.scala ---
    @@ -73,6 +76,56 @@ class DecisionTreeRegressorSuite extends SparkFunSuite with MLlibTestSparkContex
         MLTestingUtils.checkCopy(model)
       }
     
    +  test("training with weighted data") {
    --- End diff --
    
    Since the data generation process for this test is identical to the classifier test, perhaps we could reuse the code and put it in `TreeTests`? It's basically just copy pasted as is.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-9478] [ml] Add class weights to Random ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/9008#issuecomment-146995030
  
    **[Test build #43482 timed out](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/43482/console)**     for PR 9008 at commit [`33982fb`](https://github.com/apache/spark/commit/33982fb2a852a33f90a67bef602fef0cc494655f)     after a configured wait of `250m`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-9478] [ml] Add class weights to Random ...

Posted by sethah <gi...@git.apache.org>.
Github user sethah commented on a diff in the pull request:

    https://github.com/apache/spark/pull/9008#discussion_r53090000
  
    --- Diff: mllib/src/test/scala/org/apache/spark/ml/classification/DecisionTreeClassifierSuite.scala ---
    @@ -275,6 +278,63 @@ class DecisionTreeClassifierSuite extends SparkFunSuite with MLlibTestSparkConte
         val model = dt.fit(df)
       }
     
    +  test("training with weighted data") {
    +    val (dataset, weightedDataset) = {
    +      val testData1 = TreeTests.generateNoisyData(5, 123)
    +      val testData2 = TreeTests.generateNoisyData(5, 456)
    +
    +      // Over-sample the 1st dataset twice.
    +      val overSampledTestData1 = testData1.flatMap {
    +        labeledPoint => Iterator(labeledPoint, labeledPoint)
    +      }
    +
    +      val rnd = new Random(8392)
    +      val weightedTestData1 = testData1.flatMap {
    +        case LabeledPoint(label: Double, features: Vector) => {
    +          if (rnd.nextGaussian() > 0.0) {
    +            Iterator(
    +              Instance(label, 1.2, features),
    +              Instance(label, 0.8, features),
    +              Instance(0.0, 0.0, features))
    +          } else {
    +            Iterator(
    +              Instance(label, 0.3, features),
    +              Instance(1, 0.0, features),
    +              Instance(label, 1.1, features),
    +              Instance(label, 0.6, features))
    +          }
    +        }
    +      }
    +      val weightedTestData2 = testData2.map {
    +        p: LabeledPoint => Instance(p.label, 1, p.features)
    +      }
    +
    +      (sqlContext.createDataFrame(sc.parallelize(overSampledTestData1 ++ testData2, 2)),
    +        sqlContext.createDataFrame(sc.parallelize(weightedTestData1 ++ weightedTestData2, 2)))
    +    }
    +
    +    val labelIndexer = new StringIndexer()
    +      .setInputCol("label")
    +      .setOutputCol("indexedLabel")
    +      .fit(dataset)
    +
    +    val featureIndexer = new VectorIndexer()
    +      .setInputCol("features")
    +      .setOutputCol("indexedFeatures")
    +      .setMaxCategories(4)
    +      .fit(dataset)
    +
    +    val dt = new DecisionTreeClassifier()
    +      .setLabelCol("indexedLabel")
    +      .setFeaturesCol("indexedFeatures")
    +
    +    val model1 = dt.fit(featureIndexer.transform(labelIndexer.transform(dataset)))
    +    val model2 = dt.fit(featureIndexer.transform(labelIndexer.transform(weightedDataset)),
    +      dt.weightCol->"weight")
    --- End diff --
    
    need spaces around `->`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-9478] [ml] Add class weights to Random ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9008#issuecomment-147461393
  
     Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-9478] [ml] Add class weights to Random ...

Posted by sethah <gi...@git.apache.org>.
Github user sethah commented on a diff in the pull request:

    https://github.com/apache/spark/pull/9008#discussion_r53091878
  
    --- Diff: mllib/src/test/scala/org/apache/spark/ml/classification/RandomForestClassifierSuite.scala ---
    @@ -182,6 +184,53 @@ class RandomForestClassifierSuite extends SparkFunSuite with MLlibTestSparkConte
         assert(mostImportantFeature === 1)
       }
     
    +  test("training with weighted data") {
    +    val (dataset, testDataset) = {
    +      val keyFeature = Vectors.dense(0, 1.0, 2, 1.2)
    +      val data0 = Array.fill(20)(Instance(0, 0.1, keyFeature))
    +      val data1 = Array.fill(10)(Instance(1, 20.0, keyFeature))
    +
    +      val testData = Seq(Instance(0, 0.1, keyFeature))
    +      (sqlContext.createDataFrame(sc.parallelize(data0 ++ data1, 2)),
    +        sqlContext.createDataFrame(sc.parallelize(testData, 2)))
    +    }
    +
    +    val labelIndexer = new StringIndexer()
    +      .setInputCol("label")
    +      .setOutputCol("indexedLabel")
    +      .fit(dataset)
    +
    +    val featureIndexer = new VectorIndexer()
    +      .setInputCol("features")
    +      .setOutputCol("indexedFeatures")
    +      .setMaxCategories(4)
    +      .fit(dataset)
    +
    +    val rf = new RandomForestClassifier()
    +      .setLabelCol("indexedLabel")
    +      .setFeaturesCol("indexedFeatures")
    +      .setSeed(1)
    +
    +    val labelConverter = new IndexToString()
    +      .setInputCol("prediction")
    +      .setOutputCol("predictedLabel")
    +      .setLabels(labelIndexer.labels)
    +
    +    val pipeline = new Pipeline()
    +      .setStages(Array(labelIndexer, featureIndexer, rf, labelConverter))
    +
    +    val model1 = pipeline.fit(dataset)
    +    val model2 = pipeline.fit(dataset, rf.weightCol->"weight")
    --- End diff --
    
    ditto: space around `->`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-9478] [ml] Add class weights to Random ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9008#issuecomment-147272648
  
     Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-9478] [ml] Add class weights to Random ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9008#issuecomment-147282536
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/43553/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-9478] [ml] Add class weights to Random ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9008#issuecomment-147021257
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/43503/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-9478] [ml] Add class weights to Random ...

Posted by rotationsymmetry <gi...@git.apache.org>.
Github user rotationsymmetry commented on the pull request:

    https://github.com/apache/spark/pull/9008#issuecomment-147461337
  
    Jenkins failed tests unrelated to this patch. Let's try again. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-9478] [ml] Add class weights to Random ...

Posted by sethah <gi...@git.apache.org>.
Github user sethah commented on a diff in the pull request:

    https://github.com/apache/spark/pull/9008#discussion_r41573997
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala ---
    @@ -1211,4 +1212,34 @@ private[ml] object RandomForest extends Logging {
         }
       }
     
    +  /**
    +   * Inject the sample weight to sub-sample weights of the baggedPoints
    +   */
    +  private[impl] def reweightSubSampleWeights(
    +      baggedTreePoints: RDD[BaggedPoint[TreePoint]]): RDD[BaggedPoint[TreePoint]] = {
    +    baggedTreePoints.map {bagged =>
    +      val treePoint = bagged.datum
    +      val adjustedSubSampleWeights = bagged.subsampleWeights.map(w => w * treePoint.weight)
    +      new BaggedPoint[TreePoint](treePoint, adjustedSubSampleWeights)
    +    }
    +  }
    +
    +  /**
    +   * A thin adaptor to [[org.apache.spark.mllib.tree.impl.DecisionTreeMetadata.buildMetadata]]
    +   */
    +  private[impl] def buildWeightedMetadata(
    +      input: RDD[WeightedLabeledPoint],
    +      strategy: OldStrategy,
    +      numTrees: Int,
    +      featureSubsetStrategy: String) = {
    --- End diff --
    
    Should specify return type here.
    
    Is the reason that you can't just modify `buildMetadata` to accept and `RDD[WeightedLabeledPoint]` because you are trying not to change MLlib implementation?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-9478] [ml] Add class weights to Random ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9008#issuecomment-147493385
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/43572/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-9478] [ml] Add class weights to Random ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/9008#issuecomment-146098411
  
      [Test build #43318 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/43318/console) for   PR 9008 at commit [`367a443`](https://github.com/apache/spark/commit/367a443761f46a4acbd911f5c3e84902079596ac).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-9478] [ml] Add class weights to Random ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9008#issuecomment-146995069
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-9478] [ml] Add class weights to Random ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9008#issuecomment-147013614
  
    Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-9478] [ml] Add class weights to Random ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/9008#issuecomment-146773792
  
      [Test build #43460 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/43460/consoleFull) for   PR 9008 at commit [`0ffbdd0`](https://github.com/apache/spark/commit/0ffbdd06951bc62b85483580ec68369b0d2a7191).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-9478] [ml] Add class weights to Random ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9008#issuecomment-154858834
  
    Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-9478] [ml] Add class weights to Random ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9008#issuecomment-146098500
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-9478] [ml] Add class weights to Random ...

Posted by sethah <gi...@git.apache.org>.
Github user sethah commented on a diff in the pull request:

    https://github.com/apache/spark/pull/9008#discussion_r53089711
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/RandomForestRegressor.scala ---
    @@ -41,7 +41,7 @@ import org.apache.spark.sql.functions._
     @Experimental
     final class RandomForestRegressor @Since("1.4.0") (@Since("1.4.0") override val uid: String)
       extends Predictor[Vector, RandomForestRegressor, RandomForestRegressionModel]
    -  with RandomForestParams with TreeRegressorParams {
    +  with RandomForestParams with TreeRegressorParams with HasWeightCol{
    --- End diff --
    
    Space after `HasWeightCol`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-9478] [ml] Add class weights to Random ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9008#issuecomment-146942850
  
     Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-9478] [ml] Add class weights to Random ...

Posted by sethah <gi...@git.apache.org>.
Github user sethah commented on the pull request:

    https://github.com/apache/spark/pull/9008#issuecomment-198428763
  
    cc @MLnick thoughts on the above comments?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-9478] [ml] Add class weights to Random ...

Posted by sethah <gi...@git.apache.org>.
Github user sethah commented on a diff in the pull request:

    https://github.com/apache/spark/pull/9008#discussion_r53097743
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala ---
    @@ -1171,4 +1173,28 @@ private[ml] object RandomForest extends Logging {
         }
       }
     
    +  /**
    +   * Inject the sample weight to sub-sample weights of the baggedPoints
    +   */
    +  private[impl] def reweightSubSampleWeights(
    --- End diff --
    
    There is a TODO in _BaggedPoint.scala_ for accepting weighted instances. This might be a good time to address that. If not, we will have to implement this in this JIRA, fix Bagged Point in another JIRA, and then return to this, likely in a third JIRA. Thoughts?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-9478] [ml] Add class weights to Random ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9008#issuecomment-146073241
  
    Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-9478] [ml] Add class weights to Random ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9008#issuecomment-147102028
  
     Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-9478] [ml] Add class weights to Random ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/9008#issuecomment-147531959
  
      [Test build #43588 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/43588/consoleFull) for   PR 9008 at commit [`c1785a8`](https://github.com/apache/spark/commit/c1785a8f3055bc48ce480b827befcb27812f0449).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-9478] [ml] Add class weights to Random ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9008#issuecomment-146073233
  
     Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-9478] [ml] Add class weights to Random ...

Posted by sethah <gi...@git.apache.org>.
Github user sethah commented on a diff in the pull request:

    https://github.com/apache/spark/pull/9008#discussion_r53092821
  
    --- Diff: mllib/src/test/scala/org/apache/spark/ml/classification/DecisionTreeClassifierSuite.scala ---
    @@ -275,6 +278,63 @@ class DecisionTreeClassifierSuite extends SparkFunSuite with MLlibTestSparkConte
         val model = dt.fit(df)
       }
     
    +  test("training with weighted data") {
    +    val (dataset, weightedDataset) = {
    +      val testData1 = TreeTests.generateNoisyData(5, 123)
    +      val testData2 = TreeTests.generateNoisyData(5, 456)
    +
    +      // Over-sample the 1st dataset twice.
    +      val overSampledTestData1 = testData1.flatMap {
    +        labeledPoint => Iterator(labeledPoint, labeledPoint)
    +      }
    +
    +      val rnd = new Random(8392)
    +      val weightedTestData1 = testData1.flatMap {
    --- End diff --
    
    It is not obvious what the code is doing here. A comment might be useful to point out that you are weighting each sample by 2x, effectively, by using weighted?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-9478] [ml] Add class weights to Random ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9008#issuecomment-146773042
  
     Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-9478] [ml] Add class weights to Random ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9008#issuecomment-147122396
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/43531/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-9478] [ml] Add class weights to Random ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9008#issuecomment-147461420
  
    Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-9478] [ml] Add class weights to Random ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/9008#issuecomment-146943400
  
      [Test build #43482 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/43482/consoleFull) for   PR 9008 at commit [`33982fb`](https://github.com/apache/spark/commit/33982fb2a852a33f90a67bef602fef0cc494655f).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-9478] [ml] Add class weights to Random ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9008#issuecomment-154341728
  
    Build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-9478] [ml] Add class weights to Random ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9008#issuecomment-147272653
  
    Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-9478] [ml] Add class weights to Random ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/9008#issuecomment-147493291
  
      [Test build #43572 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/43572/console) for   PR 9008 at commit [`822382e`](https://github.com/apache/spark/commit/822382e42c323d98dfdf9a23cbb8f92c5708e053).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-9478] [ml] Add class weights to Random ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9008#issuecomment-147102033
  
    Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-9478] [ml] Add class weights to Random ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9008#issuecomment-154875214
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-9478] [ml] Add class weights to Random ...

Posted by rotationsymmetry <gi...@git.apache.org>.
Github user rotationsymmetry commented on the pull request:

    https://github.com/apache/spark/pull/9008#issuecomment-185364988
  
    @sethah Thank you very much for your review. I will incorporate the changes in the next few days. Regarding the TODO in BaggedPoint.scala, I want to look into the details to find out the scope of the change. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-9478] [ml] Add class weights to Random ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9008#issuecomment-154341714
  
     Build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-9478] [ml] Add class weights to Random ...

Posted by sethah <gi...@git.apache.org>.
Github user sethah commented on the pull request:

    https://github.com/apache/spark/pull/9008#issuecomment-187920267
  
    Another issue is that the information gain for candidate splits is not computed correctly with fractional samples. This is because the information gain calculation [here](https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala#L636) uses the sample counts which are converted to `Long` type. This produces incorrect results in general, and `NaN` values when the total count is less than 1. The `count` function [here](https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/tree/impurity/Impurity.scala#L149) should return a `Double` type instead. Can we add a test to ensure that the trees are invariant under constant multiplication of the weights?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-9478] [ml] Add class weights to Random ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9008#issuecomment-147021256
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-9478] [ml] Add class weights to Random ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/9008#issuecomment-146073764
  
      [Test build #43318 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/43318/consoleFull) for   PR 9008 at commit [`367a443`](https://github.com/apache/spark/commit/367a443761f46a4acbd911f5c3e84902079596ac).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-9478] [ml] Add class weights to Random ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9008#issuecomment-154858824
  
     Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-9478] [ml] Add class weights to Random ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9008#issuecomment-147553264
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/43588/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-9478] [ml] Add class weights to Random ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9008#issuecomment-146995071
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/43482/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-9478] [ml] Add class weights to Random ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9008#issuecomment-146773059
  
    Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-9478] [ml] Add class weights to Random ...

Posted by sethah <gi...@git.apache.org>.
Github user sethah commented on a diff in the pull request:

    https://github.com/apache/spark/pull/9008#discussion_r53092853
  
    --- Diff: mllib/src/test/scala/org/apache/spark/ml/regression/RandomForestRegressorSuite.scala ---
    @@ -101,6 +104,43 @@ class RandomForestRegressorSuite extends SparkFunSuite with MLlibTestSparkContex
         assert(mostImportantFeature === 1)
       }
     
    +  test("training with weighted data") {
    +    val (dataset, testDataset) = {
    +      val keyFeature = Vectors.dense(0, 1.0, 2, 1.2)
    +      val data0 = Array.fill(10)(Instance(10, 0.1, keyFeature))
    +      val data1 = Array.fill(10)(Instance(20, 20.0, keyFeature))
    +
    +      val testData = Seq(Instance(0, 1, keyFeature))
    +      (sqlContext.createDataFrame(sc.parallelize(data0 ++ data1, 2)),
    +        sqlContext.createDataFrame(sc.parallelize(testData, 2)))
    +    }
    +
    +    val featureIndexer = new VectorIndexer()
    +      .setInputCol("features")
    +      .setOutputCol("indexedFeatures")
    +      .setMaxCategories(4)
    +      .fit(dataset)
    +
    +    val rf = new RandomForestRegressor()
    +      .setFeaturesCol("indexedFeatures")
    +      .setPredictionCol("predictedLabel")
    +      .setSeed(1)
    +
    +    val pipeline = new Pipeline()
    +      .setStages(Array(featureIndexer, rf))
    +
    +    val model1 = pipeline.fit(dataset)
    +    val model2 = pipeline.fit(dataset, rf.weightCol->"weight")
    --- End diff --
    
    ditto


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-9478] [ml] Add class weights to Random ...

Posted by rotationsymmetry <gi...@git.apache.org>.
Github user rotationsymmetry commented on the pull request:

    https://github.com/apache/spark/pull/9008#issuecomment-147574091
  
    @sethah I have incorporated your comments in the latest patch. Thank you!
    
    @jkbradley Do you have any comments or suggestions? Much appreciated. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-9478] [ml] Add class weights to Random ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/9008#issuecomment-147463491
  
      [Test build #43572 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/43572/consoleFull) for   PR 9008 at commit [`822382e`](https://github.com/apache/spark/commit/822382e42c323d98dfdf9a23cbb8f92c5708e053).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-9478] [ml] Add class weights to Random ...

Posted by fabboe <gi...@git.apache.org>.
Github user fabboe commented on the pull request:

    https://github.com/apache/spark/pull/9008#issuecomment-184901410
  
    Thanks for working on this!
    
    Minor: PR title says `class weights` but actually it's `sample weights` what is implemented.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-9478] [ml] Add class weights to Random ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9008#issuecomment-154381940
  
    Build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-9478] [ml] Add class weights to Random ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9008#issuecomment-146098501
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/43318/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-9478] [ml] Add class weights to Random ...

Posted by rotationsymmetry <gi...@git.apache.org>.
Github user rotationsymmetry commented on the pull request:

    https://github.com/apache/spark/pull/9008#issuecomment-154606846
  
    retest this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-9478] [ml] Add class weights to Random ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/9008#issuecomment-154343014
  
      [Test build #45206 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/45206/consoleFull) for   PR 9008 at commit [`c1785a8`](https://github.com/apache/spark/commit/c1785a8f3055bc48ce480b827befcb27812f0449).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-9478] [ml] Add class weights to Random ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/9008#issuecomment-147102431
  
      [Test build #43531 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/43531/consoleFull) for   PR 9008 at commit [`8f35057`](https://github.com/apache/spark/commit/8f350577ca7ceeadd9ea74570d19407784e49fa4).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-9478] [ml] Add class weights to Random ...

Posted by sethah <gi...@git.apache.org>.
Github user sethah commented on the pull request:

    https://github.com/apache/spark/pull/9008#issuecomment-193978528
  
    @rotationsymmetry: Will you have time to work on this? I am more than happy to send a PR to your PR if you do not have time.
    
    @jkbradley @dbtsai Would you mind chiming in on the issue mentioned above about minimum instances per node?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-9478] [ml] Add class weights to Random ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9008#issuecomment-146775984
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-9478] [ml] Add class weights to Random ...

Posted by sethah <gi...@git.apache.org>.
Github user sethah commented on a diff in the pull request:

    https://github.com/apache/spark/pull/9008#discussion_r53089674
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/DecisionTreeRegressor.scala ---
    @@ -17,18 +17,20 @@
     
     package org.apache.spark.ml.regression
     
    +
    --- End diff --
    
    nit: remove extra line


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-9478] [ml] Add class weights to Random ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/9008#issuecomment-147021241
  
      [Test build #43503 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/43503/console) for   PR 9008 at commit [`3273ed4`](https://github.com/apache/spark/commit/3273ed4c770072bf6fbc0127c57c96673f4d9d23).
     * This patch **fails PySpark unit tests**.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `class ChildProcAppHandle implements SparkAppHandle `
      * `abstract class LauncherConnection implements Closeable, Runnable `
      * `final class LauncherProtocol `
      * `  static class Message implements Serializable `
      * `  static class Hello extends Message `
      * `  static class SetAppId extends Message `
      * `  static class SetState extends Message `
      * `  static class Stop extends Message `
      * `class LauncherServer implements Closeable `
      * `class NamedThreadFactory implements ThreadFactory `
      * `class OutputRedirector `



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-9478] [ml] Add class weights to Random ...

Posted by sethah <gi...@git.apache.org>.
Github user sethah commented on the pull request:

    https://github.com/apache/spark/pull/9008#issuecomment-187524117
  
    I noticed a problem with the current implementation regarding the `minInstancesPerNode` parameter. The number of  _instances_ in each node is now a weighted count where the weights can have an arbitrary scale. For example, a tree built with uniform weights where each weight is equal to 1.0 will build a different tree than uniform weights where each weight is 1.0 / N (N is number of samples). I suppose there are a number of ways to mitigate this. 
    
    I checked scikit-learn and they track the actual raw sample counts (unweighted) as well as the sample weights. They use `min_samples_leaf` to compute validity based on raw counts, and `min_weight_fraction_leaf` to compute validity based on weighted counts. This will not be possible under the current implementation here because we lose the raw counts when we convert to `unadjustedBaggedInput` to `baggedInput`. We could compare weighted split counts vs `minInstancesPerNode / N` where N is number of training samples, or we could adjust the `BaggedPoint` class to store counts and weight and proceed ala scikit-learn. I'm not sure what is best, thoughts?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-9478] [ml] Add class weights to Random ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/9008#issuecomment-147553172
  
      [Test build #43588 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/43588/console) for   PR 9008 at commit [`c1785a8`](https://github.com/apache/spark/commit/c1785a8f3055bc48ce480b827befcb27812f0449).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-9478] [ml] Add class weights to Random ...

Posted by sethah <gi...@git.apache.org>.
Github user sethah commented on the pull request:

    https://github.com/apache/spark/pull/9008#issuecomment-184920013
  
    @rotationsymmetry I made a pass on this, mostly minor comments. Thanks for working on this, it would be great to get it merged in! 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-9478] [ml] Add class weights to Random ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9008#issuecomment-147493384
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #9008: [SPARK-9478] [ml] Add class weights to Random Fore...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/9008


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-9478] [ml] Add class weights to Random ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/9008#issuecomment-147282474
  
      [Test build #43553 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/43553/console) for   PR 9008 at commit [`bd316d6`](https://github.com/apache/spark/commit/bd316d6bf6bfb2cbe811c1b0b937c494c1acf273).
     * This patch **fails PySpark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-9478] [ml] Add class weights to Random ...

Posted by sethah <gi...@git.apache.org>.
Github user sethah commented on a diff in the pull request:

    https://github.com/apache/spark/pull/9008#discussion_r53089599
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/RandomForestClassifier.scala ---
    @@ -41,7 +41,7 @@ import org.apache.spark.sql.functions._
     @Experimental
     final class RandomForestClassifier(override val uid: String)
       extends ProbabilisticClassifier[Vector, RandomForestClassifier, RandomForestClassificationModel]
    -  with RandomForestParams with TreeClassifierParams {
    +  with RandomForestParams with TreeClassifierParams with HasWeightCol{
    --- End diff --
    
    nit: Space after `HasWeightCol`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-9478] [ml] Add class weights to Random ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9008#issuecomment-147530617
  
     Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-9478] [ml] Add class weights to Random ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9008#issuecomment-147282535
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-9478] [ml] Add class weights to Random ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/9008#issuecomment-147014205
  
      [Test build #43503 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/43503/consoleFull) for   PR 9008 at commit [`3273ed4`](https://github.com/apache/spark/commit/3273ed4c770072bf6fbc0127c57c96673f4d9d23).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-9478] [ml] Add class weights to Random ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/9008#issuecomment-154875163
  
    **[Test build #45315 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/45315/consoleFull)** for PR 9008 at commit [`32f4548`](https://github.com/apache/spark/commit/32f4548a22aaf2079dabef0743342d81bef7750f).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:\n  * `abstract class Writer extends BaseReadWrite `\n  * `trait Writable `\n  * `abstract class Reader[T] extends BaseReadWrite `\n  * `trait Readable[T] `\n  * `case class GetInternalRowField(child: Expression, ordinal: Int, dataType: DataType)`\n  * `case class Expand(`\n


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #9008: [SPARK-9478] [ml] Add class weights to Random Forest

Posted by sethah <gi...@git.apache.org>.
Github user sethah commented on the issue:

    https://github.com/apache/spark/pull/9008
  
    @rotationsymmetry Could you please close this?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-9478] [ml] Add class weights to Random ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9008#issuecomment-154381942
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/45206/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org