You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by MaxGekk <gi...@git.apache.org> on 2018/09/08 16:16:32 UTC

[GitHub] spark pull request #22365: [SPARK-25381][SQL] Stratified sampling by Column ...

GitHub user MaxGekk opened a pull request:

    https://github.com/apache/spark/pull/22365

    [SPARK-25381][SQL] Stratified sampling by Column argument

    ## What changes were proposed in this pull request?
    
    In the PR, I propose to add an overloaded method for `sampleBy` which accepts the first argument of the `Column` type. This will allow to sample by any complex columns as well as sampling by multiple columns. For example:
    
    ```Scala
    spark.createDataFrame(Seq(("Bob", 17), ("Alice", 10), ("Nico", 8), ("Bob", 17),
      ("Alice", 10))).toDF("name", "age")
      .stat
      .sampleBy(struct($"name", $"age"), Map(Row("Alice", 10) -> 0.3, Row("Nico", 8) -> 1.0), 36L)
      .show()
    
    +-----+---+
    | name|age|
    +-----+---+
    | Nico|  8|
    |Alice| 10|
    +-----+---+
    ```
    
    ## How was this patch tested?
    
    Added new test for sampling by multiple columns for Scala and test for Java, Python to check that `sampleBy` is able to sample by `Column` type argument.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/MaxGekk/spark-1 sample-by-column

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/22365.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #22365
    
----
commit 3832f2137676a76d6d06a0bb6dbcedcba801910b
Author: Maxim Gekk <ma...@...>
Date:   2018-09-08T13:30:49Z

    Adding overloaded sampleBy with Column type

commit 5cd3229ce8bfe894dac8ebc097109da237d95401
Author: Maxim Gekk <ma...@...>
Date:   2018-09-08T13:39:30Z

    Adding overloaded sampleBy with Column type for Java

commit e2e61498c47da9d7b36d2e0727ce8642d5d71472
Author: Maxim Gekk <ma...@...>
Date:   2018-09-08T14:56:36Z

    Adding overloaded sampleBy with Column type for Python

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22365: [SPARK-25381][SQL] Stratified sampling by Column argumen...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22365
  
    **[Test build #96358 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96358/testReport)** for PR 22365 at commit [`1740d60`](https://github.com/apache/spark/commit/1740d60a9bdc1c84b1d74d7637411396b9fbff75).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22365: [SPARK-25381][SQL] Stratified sampling by Column ...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22365#discussion_r217256279
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/DataFrameStatFunctions.scala ---
    @@ -370,29 +370,76 @@ final class DataFrameStatFunctions private[sql](df: DataFrame) {
        * @since 1.5.0
        */
       def sampleBy[T](col: String, fractions: Map[T, Double], seed: Long): DataFrame = {
    --- End diff --
    
    I'm +1 for it, but we probably need to send a email to dev list to get more feedbacks.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22365: [SPARK-25381][SQL] Stratified sampling by Column argumen...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22365
  
    **[Test build #95900 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95900/testReport)** for PR 22365 at commit [`e85175e`](https://github.com/apache/spark/commit/e85175e18e95d7751748d4615792579375859786).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22365: [SPARK-25381][SQL] Stratified sampling by Column argumen...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/22365
  
    Merged to master.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22365: [SPARK-25381][SQL] Stratified sampling by Column argumen...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22365
  
    **[Test build #95835 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95835/testReport)** for PR 22365 at commit [`7e77941`](https://github.com/apache/spark/commit/7e7794153924b824dc5fe5f05375c8b9950ef539).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22365: [SPARK-25381][SQL] Stratified sampling by Column argumen...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22365
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22365: [SPARK-25381][SQL] Stratified sampling by Column argumen...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22365
  
    **[Test build #95900 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95900/testReport)** for PR 22365 at commit [`e85175e`](https://github.com/apache/spark/commit/e85175e18e95d7751748d4615792579375859786).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22365: [SPARK-25381][SQL] Stratified sampling by Column argumen...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22365
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22365: [SPARK-25381][SQL] Stratified sampling by Column ...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22365#discussion_r217252035
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/DataFrameStatFunctions.scala ---
    @@ -370,29 +370,76 @@ final class DataFrameStatFunctions private[sql](df: DataFrame) {
        * @since 1.5.0
        */
       def sampleBy[T](col: String, fractions: Map[T, Double], seed: Long): DataFrame = {
    --- End diff --
    
    @cloud-fan, WDYT about we start to deprecate String method?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22365: [SPARK-25381][SQL] Stratified sampling by Column argumen...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22365
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95900/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22365: [SPARK-25381][SQL] Stratified sampling by Column argumen...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22365
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95835/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22365: [SPARK-25381][SQL] Stratified sampling by Column argumen...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22365
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95834/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22365: [SPARK-25381][SQL] Stratified sampling by Column ...

Posted by asfgit <gi...@git.apache.org>.

Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/22365


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22365: [SPARK-25381][SQL] Stratified sampling by Column argumen...

Posted by MaxGekk <gi...@git.apache.org>.

Github user MaxGekk commented on the issue:

    https://github.com/apache/spark/pull/22365
  
    @HyukjinKwon May I ask you to look at this PR one more time.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22365: [SPARK-25381][SQL] Stratified sampling by Column argumen...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22365
  
    **[Test build #95836 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95836/testReport)** for PR 22365 at commit [`2845bca`](https://github.com/apache/spark/commit/2845bca09797a34e930e6aca42f198ec5cbd95e3).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22365: [SPARK-25381][SQL] Stratified sampling by Column argumen...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/22365
  
    retest this please


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22365: [SPARK-25381][SQL] Stratified sampling by Column ...

Posted by MaxGekk <gi...@git.apache.org>.

Github user MaxGekk commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22365#discussion_r216482340
  
    --- Diff: python/pyspark/sql/dataframe.py ---
    @@ -880,18 +880,23 @@ def sampleBy(self, col, fractions, seed=None):
             |  0|    5|
             |  1|    9|
             +---+-----+
    +        >>> dataset.sampleBy(col("key"), fractions={2: 1.0}, seed=0).count()
    --- End diff --
    
    Added


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22365: [SPARK-25381][SQL] Stratified sampling by Column argumen...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22365
  
    **[Test build #96343 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96343/testReport)** for PR 22365 at commit [`1740d60`](https://github.com/apache/spark/commit/1740d60a9bdc1c84b1d74d7637411396b9fbff75).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22365: [SPARK-25381][SQL] Stratified sampling by Column argumen...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22365
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22365: [SPARK-25381][SQL] Stratified sampling by Column argumen...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22365
  
    Can one of the admins verify this patch?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22365: [SPARK-25381][SQL] Stratified sampling by Column argumen...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22365
  
    **[Test build #95836 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95836/testReport)** for PR 22365 at commit [`2845bca`](https://github.com/apache/spark/commit/2845bca09797a34e930e6aca42f198ec5cbd95e3).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22365: [SPARK-25381][SQL] Stratified sampling by Column ...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22365#discussion_r219034294
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/DataFrameStatFunctions.scala ---
    @@ -370,29 +370,76 @@ final class DataFrameStatFunctions private[sql](df: DataFrame) {
        * @since 1.5.0
        */
       def sampleBy[T](col: String, fractions: Map[T, Double], seed: Long): DataFrame = {
    +    sampleBy(Column(col), fractions, seed)
    +  }
    +
    +  /**
    +   * Returns a stratified sample without replacement based on the fraction given on each stratum.
    +   * @param col column that defines strata
    +   * @param fractions sampling fraction for each stratum. If a stratum is not specified, we treat
    +   *                  its fraction as zero.
    +   * @param seed random seed
    +   * @tparam T stratum type
    +   * @return a new `DataFrame` that represents the stratified sample
    +   *
    +   * @since 1.5.0
    +   */
    +  def sampleBy[T](col: String, fractions: ju.Map[T, jl.Double], seed: Long): DataFrame = {
    +    sampleBy(col, fractions.asScala.toMap.asInstanceOf[Map[T, Double]], seed)
    +  }
    +
    +  /**
    +   * Returns a stratified sample without replacement based on the fraction given on each stratum.
    +   * @param col column that defines strata
    +   * @param fractions sampling fraction for each stratum. If a stratum is not specified, we treat
    +   *                  its fraction as zero.
    +   * @param seed random seed
    +   * @tparam T stratum type
    +   * @return a new `DataFrame` that represents the stratified sample
    +   *
    +   * The stratified sample can be performed over multiple columns:
    +   * {{{
    +   *    import org.apache.spark.sql.Row
    +   *    import org.apache.spark.sql.functions.struct
    +   *
    +   *    val df = spark.createDataFrame(Seq(("Bob", 17), ("Alice", 10), ("Nico", 8), ("Bob", 17),
    +   *      ("Alice", 10))).toDF("name", "age")
    +   *    val fractions = Map(Row("Alice", 10) -> 0.3, Row("Nico", 8) -> 1.0)
    +   *    df.stat.sampleBy(struct($"name", $"age"), fractions, 36L).show()
    +   *    +-----+---+
    +   *    | name|age|
    +   *    +-----+---+
    +   *    | Nico|  8|
    +   *    |Alice| 10|
    +   *    +-----+---+
    +   * }}}
    +   *
    +   * @since 3.0.0
    --- End diff --
    
    the next release is 2.5.0


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22365: [SPARK-25381][SQL] Stratified sampling by Column argumen...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22365
  
    Can one of the admins verify this patch?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22365: [SPARK-25381][SQL] Stratified sampling by Column argumen...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22365
  
    **[Test build #96358 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96358/testReport)** for PR 22365 at commit [`1740d60`](https://github.com/apache/spark/commit/1740d60a9bdc1c84b1d74d7637411396b9fbff75).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22365: [SPARK-25381][SQL] Stratified sampling by Column argumen...

Posted by MaxGekk <gi...@git.apache.org>.

Github user MaxGekk commented on the issue:

    https://github.com/apache/spark/pull/22365
  
    > Seems fine but I or someone else should take a closer look before getting this in.
    
    @HyukjinKwon Whom can I ask to look at this? @gatorsmile Please, give me an advice.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22365: [SPARK-25381][SQL] Stratified sampling by Column argumen...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/22365
  
    Seems fine but I or someone else should take a closer look before getting this in.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22365: [SPARK-25381][SQL] Stratified sampling by Column argumen...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22365
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22365: [SPARK-25381][SQL] Stratified sampling by Column ...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22365#discussion_r216233066
  
    --- Diff: python/pyspark/sql/dataframe.py ---
    @@ -880,18 +880,23 @@ def sampleBy(self, col, fractions, seed=None):
             |  0|    5|
             |  1|    9|
             +---+-----+
    +        >>> dataset.sampleBy(col("key"), fractions={2: 1.0}, seed=0).count()
    --- End diff --
    
    @MaxGekk, shall we add:
    
    ```python
            .. versionchanged:: 3.0
               blah blah blah
    ```


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22365: [SPARK-25381][SQL] Stratified sampling by Column argumen...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/22365
  
    LGTM


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22365: [SPARK-25381][SQL] Stratified sampling by Column argumen...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22365
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/96358/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22365: [SPARK-25381][SQL] Stratified sampling by Column argumen...

Posted by MaxGekk <gi...@git.apache.org>.

Github user MaxGekk commented on the issue:

    https://github.com/apache/spark/pull/22365
  
    @HyukjinKwon @cloud-fan Are there any objections from you that could block the PR?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22365: [SPARK-25381][SQL] Stratified sampling by Column argumen...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22365
  
    **[Test build #95834 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95834/testReport)** for PR 22365 at commit [`e2e6149`](https://github.com/apache/spark/commit/e2e61498c47da9d7b36d2e0727ce8642d5d71472).
     * This patch **fails Python style tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22365: [SPARK-25381][SQL] Stratified sampling by Column argumen...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22365
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/96343/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22365: [SPARK-25381][SQL] Stratified sampling by Column ...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22365#discussion_r217257137
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/DataFrameStatFunctions.scala ---
    @@ -370,29 +370,76 @@ final class DataFrameStatFunctions private[sql](df: DataFrame) {
        * @since 1.5.0
        */
       def sampleBy[T](col: String, fractions: Map[T, Double], seed: Long): DataFrame = {
    --- End diff --
    
    Will probably send an email after 2.4.0 since it's not going to be super urgent.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22365: [SPARK-25381][SQL] Stratified sampling by Column argumen...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22365
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22365: [SPARK-25381][SQL] Stratified sampling by Column argumen...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22365
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22365: [SPARK-25381][SQL] Stratified sampling by Column argumen...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22365
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95836/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22365: [SPARK-25381][SQL] Stratified sampling by Column argumen...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22365
  
    **[Test build #96343 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96343/testReport)** for PR 22365 at commit [`1740d60`](https://github.com/apache/spark/commit/1740d60a9bdc1c84b1d74d7637411396b9fbff75).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22365: [SPARK-25381][SQL] Stratified sampling by Column argumen...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22365
  
    **[Test build #95834 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95834/testReport)** for PR 22365 at commit [`e2e6149`](https://github.com/apache/spark/commit/e2e61498c47da9d7b36d2e0727ce8642d5d71472).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22365: [SPARK-25381][SQL] Stratified sampling by Column ...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22365#discussion_r216233575
  
    --- Diff: python/pyspark/sql/dataframe.py ---
    @@ -880,18 +880,23 @@ def sampleBy(self, col, fractions, seed=None):
             |  0|    5|
             |  1|    9|
             +---+-----+
    +        >>> dataset.sampleBy(col("key"), fractions={2: 1.0}, seed=0).count()
    +        33
     
             """
    -        if not isinstance(col, basestring):
    -            raise ValueError("col must be a string, but got %r" % type(col))
    +        if isinstance(col, basestring):
    +            col = Column(col)
    +        elif not isinstance(col, Column):
    +            raise ValueError("col must be a string or a column, but got %r" % type(col))
             if not isinstance(fractions, dict):
                 raise ValueError("fractions must be a dict but got %r" % type(fractions))
             for k, v in fractions.items():
                 if not isinstance(k, (float, int, long, basestring)):
                     raise ValueError("key must be float, int, long, or string, but got %r" % type(k))
                 fractions[k] = float(v)
             seed = seed if seed is not None else random.randint(0, sys.maxsize)
    -        return DataFrame(self._jdf.stat().sampleBy(col, self._jmap(fractions), seed), self.sql_ctx)
    +        return DataFrame(self._jdf.stat()
    +                         .sampleBy(col._jc, self._jmap(fractions), seed), self.sql_ctx)
    --- End diff --
    
    I would just do `col = col._jc`


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22365: [SPARK-25381][SQL] Stratified sampling by Column argumen...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22365
  
    **[Test build #95835 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95835/testReport)** for PR 22365 at commit [`7e77941`](https://github.com/apache/spark/commit/7e7794153924b824dc5fe5f05375c8b9950ef539).
     * This patch **fails Python style tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org