You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by hhbyyh <gi...@git.apache.org> on 2016/02/06 09:26:53 UTC

[GitHub] spark pull request: [SPARK-13223] [ML] Add stratified sampling to ...

GitHub user hhbyyh opened a pull request:

    https://github.com/apache/spark/pull/11102

    [SPARK-13223] [ML] Add stratified sampling to ML feature engineering

    jira: https://issues.apache.org/jira/browse/SPARK-13223
    
    I found it useful to add an sampling transformer during a case of fraud detection. It can be used in resampling or overSampling, which in turn is required by ensemble and unbalanced data processing.
    Internally, it invoke the sampleByKey in Pair RDD operation.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/hhbyyh/spark stratifiedSampling

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/11102.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #11102
    
----
commit 022c8367a28ade4529d522e4fffe0896e75336da
Author: Yuhao Yang <hh...@gmail.com>
Date:   2016-02-06T08:11:26Z

    add stratifiedSampling and ut

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13223] [ML] Add stratified sampling to ...

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11102#discussion_r52696679
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/StratifiedSampling.scala ---
    @@ -0,0 +1,129 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.ml.feature
    +
    +import org.apache.hadoop.fs.Path
    +
    +import org.apache.spark.annotation.{Experimental, Since}
    +import org.apache.spark.ml.Transformer
    +import org.apache.spark.ml.feature.StratifiedSampling.StratifiedSamplingWriter
    --- End diff --
    
    Import this inside the class.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13223] [ML] Add stratified sampling to ...

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11102#discussion_r52696687
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/StratifiedSampling.scala ---
    @@ -0,0 +1,129 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.ml.feature
    +
    +import org.apache.hadoop.fs.Path
    +
    +import org.apache.spark.annotation.{Experimental, Since}
    +import org.apache.spark.ml.Transformer
    +import org.apache.spark.ml.feature.StratifiedSampling.StratifiedSamplingWriter
    +import org.apache.spark.ml.param.ParamMap
    +import org.apache.spark.ml.param.shared._
    +import org.apache.spark.ml.util._
    +import org.apache.spark.sql.{DataFrame, Row}
    +import org.apache.spark.sql.types.{StringType, StructType}
    +import org.apache.spark.util.Utils
    +
    +/**
    + * :: Experimental ::
    + *
    + * Stratified sampling on the DataFrame according to the keys in a specific label column. User
    + * can set 'fraction' to set different sampling rate for each key.
    + *
    + * @param withReplacement can elements be sampled multiple times (replaced when sampled out)
    + * @param fraction expected size of the sample as a fraction of the items
    + *  without replacement: probability that each element is chosen; fraction must be [0, 1]
    + *  with replacement: expected number of times each element is chosen; fraction must be >= 0
    + */
    +@Experimental
    +final class StratifiedSampling private(
    --- End diff --
    
    * `StratifiedSampler`
    * space after `private`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13223] [ML] Add stratified sampling to ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/11102#issuecomment-183861726
  
    **[Test build #51260 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/51260/consoleFull)** for PR 11102 at commit [`853fb96`](https://github.com/apache/spark/commit/853fb967ae845aabb6f70b58b339d2e2152c1906).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13223] [ML] Add stratified sampling to ...

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11102#discussion_r52696695
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/StratifiedSampling.scala ---
    @@ -0,0 +1,129 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.ml.feature
    +
    +import org.apache.hadoop.fs.Path
    +
    +import org.apache.spark.annotation.{Experimental, Since}
    +import org.apache.spark.ml.Transformer
    +import org.apache.spark.ml.feature.StratifiedSampling.StratifiedSamplingWriter
    +import org.apache.spark.ml.param.ParamMap
    +import org.apache.spark.ml.param.shared._
    +import org.apache.spark.ml.util._
    +import org.apache.spark.sql.{DataFrame, Row}
    +import org.apache.spark.sql.types.{StringType, StructType}
    +import org.apache.spark.util.Utils
    +
    +/**
    + * :: Experimental ::
    + *
    + * Stratified sampling on the DataFrame according to the keys in a specific label column. User
    + * can set 'fraction' to set different sampling rate for each key.
    + *
    + * @param withReplacement can elements be sampled multiple times (replaced when sampled out)
    + * @param fraction expected size of the sample as a fraction of the items
    + *  without replacement: probability that each element is chosen; fraction must be [0, 1]
    + *  with replacement: expected number of times each element is chosen; fraction must be >= 0
    + */
    +@Experimental
    +final class StratifiedSampling private(
    +    override val uid: String,
    +    val withReplacement: Boolean,
    +    val fraction: Map[String, Double])
    +  extends Transformer with HasLabelCol with HasSeed with DefaultParamsWritable {
    +
    +  @Since("2.0.0")
    +  def this(withReplacement: Boolean, fraction: Map[String, Double]) =
    +    this(Identifiable.randomUID("stratifiedSampling"), withReplacement, fraction)
    +
    +  /** @group setParam */
    +  @Since("2.0.0")
    +  def setSeed(value: Long): this.type = set(seed, value)
    +
    +  /** @group setParam */
    +  @Since("2.0.0")
    +  def setLabel(value: String): this.type = set(labelCol, value)
    +
    +  setDefault(seed -> Utils.random.nextLong)
    +
    +  @Since("2.0.0")
    +  override def transform(data: DataFrame): DataFrame = {
    +    transformSchema(data.schema)
    +    val schema = data.schema
    +    val colId = schema.fieldIndex($(labelCol))
    +    val result = data.rdd.map(r => (r.get(colId), r))
    --- End diff --
    
    `DataFrame` has `sampleBy` implemented. We can just use that.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13223] [ML] Add stratified sampling to ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/11102#issuecomment-183632084
  
    **[Test build #51239 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/51239/consoleFull)** for PR 11102 at commit [`78c80d7`](https://github.com/apache/spark/commit/78c80d78afa94c6522f81339ffb6a6bef315eee0).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13223] [ML] Add stratified sampling to ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/11102#issuecomment-209558647
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13223] [ML] Add stratified sampling to ...

Posted by hhbyyh <gi...@git.apache.org>.
Github user hhbyyh commented on the pull request:

    https://github.com/apache/spark/pull/11102#issuecomment-209546381
  
    I tried to make it support multiple data types. As shown in the code, the main constraint is the save/load implementation. It'll be great if we can support Any key in the save/load.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13223] [ML] Add stratified sampling to ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/11102#issuecomment-183976261
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/51276/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13223] [ML] Add stratified sampling to ...

Posted by hhbyyh <gi...@git.apache.org>.
Github user hhbyyh commented on the pull request:

    https://github.com/apache/spark/pull/11102#issuecomment-205640056
  
    @mengxr Appreciate if you have time for this


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13223] [ML] Add stratified sampling to ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/11102#issuecomment-183976257
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13223] [ML] Add stratified sampling to ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/11102#issuecomment-209558521
  
    **[Test build #55728 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/55728/consoleFull)** for PR 11102 at commit [`8f8e747`](https://github.com/apache/spark/commit/8f8e74797c8699881741be2d492560e4c49ebb7f).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13223] [ML] Add stratified sampling to ...

Posted by hhbyyh <gi...@git.apache.org>.
Github user hhbyyh commented on the pull request:

    https://github.com/apache/spark/pull/11102#issuecomment-183625102
  
    @mengxr Thanks for the review. Sorry for the late response, I was on a flight.
    
    It's great to know DataFrame.stat.sampleB. One concern is that it does not allow `withReplacement = true`. That means oversampling is not supported.
    
    For prediction, are you worrying about that users need to use the same PipelineModel for training and evaluation dataset? I would propose the solution to explicitly allow stepwise enable/disable on each stage for a PipelineModel. Thus users can skip specific steps in a pipeline.
    
    



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13223] [ML] Add stratified sampling to ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/11102#issuecomment-209546847
  
    **[Test build #55728 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/55728/consoleFull)** for PR 11102 at commit [`8f8e747`](https://github.com/apache/spark/commit/8f8e74797c8699881741be2d492560e4c49ebb7f).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13223] [ML] Add stratified sampling to ...

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11102#discussion_r52696690
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/StratifiedSampling.scala ---
    @@ -0,0 +1,129 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.ml.feature
    +
    +import org.apache.hadoop.fs.Path
    +
    +import org.apache.spark.annotation.{Experimental, Since}
    +import org.apache.spark.ml.Transformer
    +import org.apache.spark.ml.feature.StratifiedSampling.StratifiedSamplingWriter
    +import org.apache.spark.ml.param.ParamMap
    +import org.apache.spark.ml.param.shared._
    +import org.apache.spark.ml.util._
    +import org.apache.spark.sql.{DataFrame, Row}
    +import org.apache.spark.sql.types.{StringType, StructType}
    +import org.apache.spark.util.Utils
    +
    +/**
    + * :: Experimental ::
    + *
    + * Stratified sampling on the DataFrame according to the keys in a specific label column. User
    + * can set 'fraction' to set different sampling rate for each key.
    + *
    + * @param withReplacement can elements be sampled multiple times (replaced when sampled out)
    + * @param fraction expected size of the sample as a fraction of the items
    + *  without replacement: probability that each element is chosen; fraction must be [0, 1]
    + *  with replacement: expected number of times each element is chosen; fraction must be >= 0
    + */
    +@Experimental
    +final class StratifiedSampling private(
    +    override val uid: String,
    +    val withReplacement: Boolean,
    +    val fraction: Map[String, Double])
    +  extends Transformer with HasLabelCol with HasSeed with DefaultParamsWritable {
    +
    +  @Since("2.0.0")
    +  def this(withReplacement: Boolean, fraction: Map[String, Double]) =
    +    this(Identifiable.randomUID("stratifiedSampling"), withReplacement, fraction)
    +
    +  /** @group setParam */
    +  @Since("2.0.0")
    +  def setSeed(value: Long): this.type = set(seed, value)
    +
    +  /** @group setParam */
    +  @Since("2.0.0")
    +  def setLabel(value: String): this.type = set(labelCol, value)
    +
    +  setDefault(seed -> Utils.random.nextLong)
    --- End diff --
    
    This is not necessary. We assign a fixed seed based on the hash value of the class name.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13223] [ML] Add stratified sampling to ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/11102#issuecomment-183861970
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13223] [ML] Add stratified sampling to ...

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11102#discussion_r52696683
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/StratifiedSampling.scala ---
    @@ -0,0 +1,129 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.ml.feature
    +
    +import org.apache.hadoop.fs.Path
    +
    +import org.apache.spark.annotation.{Experimental, Since}
    +import org.apache.spark.ml.Transformer
    +import org.apache.spark.ml.feature.StratifiedSampling.StratifiedSamplingWriter
    +import org.apache.spark.ml.param.ParamMap
    +import org.apache.spark.ml.param.shared._
    +import org.apache.spark.ml.util._
    +import org.apache.spark.sql.{DataFrame, Row}
    +import org.apache.spark.sql.types.{StringType, StructType}
    +import org.apache.spark.util.Utils
    +
    +/**
    + * :: Experimental ::
    + *
    + * Stratified sampling on the DataFrame according to the keys in a specific label column. User
    + * can set 'fraction' to set different sampling rate for each key.
    + *
    + * @param withReplacement can elements be sampled multiple times (replaced when sampled out)
    + * @param fraction expected size of the sample as a fraction of the items
    + *  without replacement: probability that each element is chosen; fraction must be [0, 1]
    + *  with replacement: expected number of times each element is chosen; fraction must be >= 0
    + */
    --- End diff --
    
    Add `@see` to link to `DataFrame.sampleBy` so we don't need to document the behavior for keys not appearing in the map.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13223] [ML] Add stratified sampling to ...

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11102#discussion_r52696689
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/StratifiedSampling.scala ---
    @@ -0,0 +1,129 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.ml.feature
    +
    +import org.apache.hadoop.fs.Path
    +
    +import org.apache.spark.annotation.{Experimental, Since}
    +import org.apache.spark.ml.Transformer
    +import org.apache.spark.ml.feature.StratifiedSampling.StratifiedSamplingWriter
    +import org.apache.spark.ml.param.ParamMap
    +import org.apache.spark.ml.param.shared._
    +import org.apache.spark.ml.util._
    +import org.apache.spark.sql.{DataFrame, Row}
    +import org.apache.spark.sql.types.{StringType, StructType}
    +import org.apache.spark.util.Utils
    +
    +/**
    + * :: Experimental ::
    + *
    + * Stratified sampling on the DataFrame according to the keys in a specific label column. User
    + * can set 'fraction' to set different sampling rate for each key.
    + *
    + * @param withReplacement can elements be sampled multiple times (replaced when sampled out)
    + * @param fraction expected size of the sample as a fraction of the items
    + *  without replacement: probability that each element is chosen; fraction must be [0, 1]
    + *  with replacement: expected number of times each element is chosen; fraction must be >= 0
    + */
    +@Experimental
    +final class StratifiedSampling private(
    +    override val uid: String,
    +    val withReplacement: Boolean,
    +    val fraction: Map[String, Double])
    +  extends Transformer with HasLabelCol with HasSeed with DefaultParamsWritable {
    +
    +  @Since("2.0.0")
    +  def this(withReplacement: Boolean, fraction: Map[String, Double]) =
    --- End diff --
    
    Add a Java friendly constructor. `Map` is from Scala.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13223] [ML] Add stratified sampling to ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/11102#issuecomment-180720697
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13223] [ML] Add stratified sampling to ...

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on the pull request:

    https://github.com/apache/spark/pull/11102#issuecomment-183144960
  
    @hhbyyh I think this is good to add. However, we need to think about the behavior during prediction. Certainly we don't want to apply sampling during prediction, but this is not addressed by the current pipeline implementation. Not for this PR, I just want to collect some ideas about this issue.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13223] [ML] Add stratified sampling to ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/11102#issuecomment-183966788
  
    **[Test build #51276 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/51276/consoleFull)** for PR 11102 at commit [`525a80d`](https://github.com/apache/spark/commit/525a80d116ee9bf106118bfbfb1bcd8c67a5d62e).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13223] [ML] Add stratified sampling to ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/11102#issuecomment-209558650
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/55728/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13223] [ML] Add stratified sampling to ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/11102#issuecomment-183632594
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/51239/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13223] [ML] Add stratified sampling to ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/11102#issuecomment-183632593
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13223] [ML] Add stratified sampling to ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/11102#issuecomment-183632590
  
    **[Test build #51239 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/51239/consoleFull)** for PR 11102 at commit [`78c80d7`](https://github.com/apache/spark/commit/78c80d78afa94c6522f81339ffb6a6bef315eee0).
     * This patch **fails to build**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13223] [ML] Add stratified sampling to ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/11102#issuecomment-180720696
  
    **[Test build #50867 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/50867/consoleFull)** for PR 11102 at commit [`022c836`](https://github.com/apache/spark/commit/022c8367a28ade4529d522e4fffe0896e75336da).
     * This patch **fails to build**.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `  class StratifiedSamplingWriter(instance: StratifiedSampling) extends MLWriter `


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13223] [ML] Add stratified sampling to ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/11102#issuecomment-180720332
  
    **[Test build #50867 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/50867/consoleFull)** for PR 11102 at commit [`022c836`](https://github.com/apache/spark/commit/022c8367a28ade4529d522e4fffe0896e75336da).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13223] [ML] Add stratified sampling to ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/11102#issuecomment-180720698
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/50867/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13223] [ML] Add stratified sampling to ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/11102#issuecomment-183975932
  
    **[Test build #51276 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/51276/consoleFull)** for PR 11102 at commit [`525a80d`](https://github.com/apache/spark/commit/525a80d116ee9bf106118bfbfb1bcd8c67a5d62e).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #11102: [SPARK-13223] [ML] Add stratified sampling to ML ...

Posted by hhbyyh <gi...@git.apache.org>.
Github user hhbyyh closed the pull request at:

    https://github.com/apache/spark/pull/11102


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13223] [ML] Add stratified sampling to ...

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11102#discussion_r52696693
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/StratifiedSampling.scala ---
    @@ -0,0 +1,129 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.ml.feature
    +
    +import org.apache.hadoop.fs.Path
    +
    +import org.apache.spark.annotation.{Experimental, Since}
    +import org.apache.spark.ml.Transformer
    +import org.apache.spark.ml.feature.StratifiedSampling.StratifiedSamplingWriter
    +import org.apache.spark.ml.param.ParamMap
    +import org.apache.spark.ml.param.shared._
    +import org.apache.spark.ml.util._
    +import org.apache.spark.sql.{DataFrame, Row}
    +import org.apache.spark.sql.types.{StringType, StructType}
    +import org.apache.spark.util.Utils
    +
    +/**
    + * :: Experimental ::
    + *
    + * Stratified sampling on the DataFrame according to the keys in a specific label column. User
    + * can set 'fraction' to set different sampling rate for each key.
    + *
    + * @param withReplacement can elements be sampled multiple times (replaced when sampled out)
    + * @param fraction expected size of the sample as a fraction of the items
    + *  without replacement: probability that each element is chosen; fraction must be [0, 1]
    + *  with replacement: expected number of times each element is chosen; fraction must be >= 0
    + */
    +@Experimental
    +final class StratifiedSampling private(
    +    override val uid: String,
    +    val withReplacement: Boolean,
    +    val fraction: Map[String, Double])
    +  extends Transformer with HasLabelCol with HasSeed with DefaultParamsWritable {
    +
    +  @Since("2.0.0")
    +  def this(withReplacement: Boolean, fraction: Map[String, Double]) =
    +    this(Identifiable.randomUID("stratifiedSampling"), withReplacement, fraction)
    +
    +  /** @group setParam */
    +  @Since("2.0.0")
    +  def setSeed(value: Long): this.type = set(seed, value)
    +
    +  /** @group setParam */
    +  @Since("2.0.0")
    +  def setLabel(value: String): this.type = set(labelCol, value)
    +
    +  setDefault(seed -> Utils.random.nextLong)
    +
    +  @Since("2.0.0")
    +  override def transform(data: DataFrame): DataFrame = {
    +    transformSchema(data.schema)
    --- End diff --
    
    turn on logging


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #11102: [SPARK-13223] [ML] Add stratified sampling to ML feature...

Posted by hhbyyh <gi...@git.apache.org>.
Github user hhbyyh commented on the issue:

    https://github.com/apache/spark/pull/11102
  
    Close it since it's been overlooked for some time and can be implemented with https://github.com/apache/spark/pull/17583 easily. Thanks for the review and comments.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13223] [ML] Add stratified sampling to ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/11102#issuecomment-183861974
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/51260/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13223] [ML] Add stratified sampling to ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/11102#issuecomment-183855587
  
    **[Test build #51260 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/51260/consoleFull)** for PR 11102 at commit [`853fb96`](https://github.com/apache/spark/commit/853fb967ae845aabb6f70b58b339d2e2152c1906).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org