You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by MaxGekk <gi...@git.apache.org> on 2018/09/08 16:16:32 UTC
[GitHub] spark pull request #22365: [SPARK-25381][SQL] Stratified sampling by Column ...
GitHub user MaxGekk opened a pull request:
https://github.com/apache/spark/pull/22365
[SPARK-25381][SQL] Stratified sampling by Column argument
## What changes were proposed in this pull request?
In the PR, I propose to add an overloaded method for `sampleBy` which accepts the first argument of the `Column` type. This will allow to sample by any complex columns as well as sampling by multiple columns. For example:
```Scala
spark.createDataFrame(Seq(("Bob", 17), ("Alice", 10), ("Nico", 8), ("Bob", 17),
("Alice", 10))).toDF("name", "age")
.stat
.sampleBy(struct($"name", $"age"), Map(Row("Alice", 10) -> 0.3, Row("Nico", 8) -> 1.0), 36L)
.show()
+-----+---+
| name|age|
+-----+---+
| Nico| 8|
|Alice| 10|
+-----+---+
```
## How was this patch tested?
Added new test for sampling by multiple columns for Scala and test for Java, Python to check that `sampleBy` is able to sample by `Column` type argument.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/MaxGekk/spark-1 sample-by-column
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/22365.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #22365
----
commit 3832f2137676a76d6d06a0bb6dbcedcba801910b
Author: Maxim Gekk <ma...@...>
Date: 2018-09-08T13:30:49Z
Adding overloaded sampleBy with Column type
commit 5cd3229ce8bfe894dac8ebc097109da237d95401
Author: Maxim Gekk <ma...@...>
Date: 2018-09-08T13:39:30Z
Adding overloaded sampleBy with Column type for Java
commit e2e61498c47da9d7b36d2e0727ce8642d5d71472
Author: Maxim Gekk <ma...@...>
Date: 2018-09-08T14:56:36Z
Adding overloaded sampleBy with Column type for Python
----
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #22365: [SPARK-25381][SQL] Stratified sampling by Column argumen...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/22365
**[Test build #96358 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96358/testReport)** for PR 22365 at commit [`1740d60`](https://github.com/apache/spark/commit/1740d60a9bdc1c84b1d74d7637411396b9fbff75).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #22365: [SPARK-25381][SQL] Stratified sampling by Column ...
Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:
https://github.com/apache/spark/pull/22365#discussion_r217256279
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/DataFrameStatFunctions.scala ---
@@ -370,29 +370,76 @@ final class DataFrameStatFunctions private[sql](df: DataFrame) {
* @since 1.5.0
*/
def sampleBy[T](col: String, fractions: Map[T, Double], seed: Long): DataFrame = {
--- End diff --
I'm +1 for it, but we probably need to send a email to dev list to get more feedbacks.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #22365: [SPARK-25381][SQL] Stratified sampling by Column argumen...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/22365
**[Test build #95900 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95900/testReport)** for PR 22365 at commit [`e85175e`](https://github.com/apache/spark/commit/e85175e18e95d7751748d4615792579375859786).
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #22365: [SPARK-25381][SQL] Stratified sampling by Column argumen...
Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:
https://github.com/apache/spark/pull/22365
Merged to master.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #22365: [SPARK-25381][SQL] Stratified sampling by Column argumen...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/22365
**[Test build #95835 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95835/testReport)** for PR 22365 at commit [`7e77941`](https://github.com/apache/spark/commit/7e7794153924b824dc5fe5f05375c8b9950ef539).
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #22365: [SPARK-25381][SQL] Stratified sampling by Column argumen...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/22365
Merged build finished. Test FAILed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #22365: [SPARK-25381][SQL] Stratified sampling by Column argumen...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/22365
**[Test build #95900 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95900/testReport)** for PR 22365 at commit [`e85175e`](https://github.com/apache/spark/commit/e85175e18e95d7751748d4615792579375859786).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #22365: [SPARK-25381][SQL] Stratified sampling by Column argumen...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/22365
Merged build finished. Test FAILed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #22365: [SPARK-25381][SQL] Stratified sampling by Column ...
Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on a diff in the pull request:
https://github.com/apache/spark/pull/22365#discussion_r217252035
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/DataFrameStatFunctions.scala ---
@@ -370,29 +370,76 @@ final class DataFrameStatFunctions private[sql](df: DataFrame) {
* @since 1.5.0
*/
def sampleBy[T](col: String, fractions: Map[T, Double], seed: Long): DataFrame = {
--- End diff --
@cloud-fan, WDYT about we start to deprecate String method?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #22365: [SPARK-25381][SQL] Stratified sampling by Column argumen...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/22365
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95900/
Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #22365: [SPARK-25381][SQL] Stratified sampling by Column argumen...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/22365
Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95835/
Test FAILed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #22365: [SPARK-25381][SQL] Stratified sampling by Column argumen...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/22365
Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95834/
Test FAILed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #22365: [SPARK-25381][SQL] Stratified sampling by Column ...
Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:
https://github.com/apache/spark/pull/22365
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #22365: [SPARK-25381][SQL] Stratified sampling by Column argumen...
Posted by MaxGekk <gi...@git.apache.org>.
Github user MaxGekk commented on the issue:
https://github.com/apache/spark/pull/22365
@HyukjinKwon May I ask you to look at this PR one more time.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #22365: [SPARK-25381][SQL] Stratified sampling by Column argumen...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/22365
**[Test build #95836 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95836/testReport)** for PR 22365 at commit [`2845bca`](https://github.com/apache/spark/commit/2845bca09797a34e930e6aca42f198ec5cbd95e3).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #22365: [SPARK-25381][SQL] Stratified sampling by Column argumen...
Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on the issue:
https://github.com/apache/spark/pull/22365
retest this please
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #22365: [SPARK-25381][SQL] Stratified sampling by Column ...
Posted by MaxGekk <gi...@git.apache.org>.
Github user MaxGekk commented on a diff in the pull request:
https://github.com/apache/spark/pull/22365#discussion_r216482340
--- Diff: python/pyspark/sql/dataframe.py ---
@@ -880,18 +880,23 @@ def sampleBy(self, col, fractions, seed=None):
| 0| 5|
| 1| 9|
+---+-----+
+ >>> dataset.sampleBy(col("key"), fractions={2: 1.0}, seed=0).count()
--- End diff --
Added
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #22365: [SPARK-25381][SQL] Stratified sampling by Column argumen...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/22365
**[Test build #96343 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96343/testReport)** for PR 22365 at commit [`1740d60`](https://github.com/apache/spark/commit/1740d60a9bdc1c84b1d74d7637411396b9fbff75).
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #22365: [SPARK-25381][SQL] Stratified sampling by Column argumen...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/22365
Merged build finished. Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #22365: [SPARK-25381][SQL] Stratified sampling by Column argumen...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/22365
Can one of the admins verify this patch?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #22365: [SPARK-25381][SQL] Stratified sampling by Column argumen...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/22365
**[Test build #95836 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95836/testReport)** for PR 22365 at commit [`2845bca`](https://github.com/apache/spark/commit/2845bca09797a34e930e6aca42f198ec5cbd95e3).
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #22365: [SPARK-25381][SQL] Stratified sampling by Column ...
Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:
https://github.com/apache/spark/pull/22365#discussion_r219034294
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/DataFrameStatFunctions.scala ---
@@ -370,29 +370,76 @@ final class DataFrameStatFunctions private[sql](df: DataFrame) {
* @since 1.5.0
*/
def sampleBy[T](col: String, fractions: Map[T, Double], seed: Long): DataFrame = {
+ sampleBy(Column(col), fractions, seed)
+ }
+
+ /**
+ * Returns a stratified sample without replacement based on the fraction given on each stratum.
+ * @param col column that defines strata
+ * @param fractions sampling fraction for each stratum. If a stratum is not specified, we treat
+ * its fraction as zero.
+ * @param seed random seed
+ * @tparam T stratum type
+ * @return a new `DataFrame` that represents the stratified sample
+ *
+ * @since 1.5.0
+ */
+ def sampleBy[T](col: String, fractions: ju.Map[T, jl.Double], seed: Long): DataFrame = {
+ sampleBy(col, fractions.asScala.toMap.asInstanceOf[Map[T, Double]], seed)
+ }
+
+ /**
+ * Returns a stratified sample without replacement based on the fraction given on each stratum.
+ * @param col column that defines strata
+ * @param fractions sampling fraction for each stratum. If a stratum is not specified, we treat
+ * its fraction as zero.
+ * @param seed random seed
+ * @tparam T stratum type
+ * @return a new `DataFrame` that represents the stratified sample
+ *
+ * The stratified sample can be performed over multiple columns:
+ * {{{
+ * import org.apache.spark.sql.Row
+ * import org.apache.spark.sql.functions.struct
+ *
+ * val df = spark.createDataFrame(Seq(("Bob", 17), ("Alice", 10), ("Nico", 8), ("Bob", 17),
+ * ("Alice", 10))).toDF("name", "age")
+ * val fractions = Map(Row("Alice", 10) -> 0.3, Row("Nico", 8) -> 1.0)
+ * df.stat.sampleBy(struct($"name", $"age"), fractions, 36L).show()
+ * +-----+---+
+ * | name|age|
+ * +-----+---+
+ * | Nico| 8|
+ * |Alice| 10|
+ * +-----+---+
+ * }}}
+ *
+ * @since 3.0.0
--- End diff --
the next release is 2.5.0
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #22365: [SPARK-25381][SQL] Stratified sampling by Column argumen...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/22365
Can one of the admins verify this patch?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #22365: [SPARK-25381][SQL] Stratified sampling by Column argumen...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/22365
**[Test build #96358 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96358/testReport)** for PR 22365 at commit [`1740d60`](https://github.com/apache/spark/commit/1740d60a9bdc1c84b1d74d7637411396b9fbff75).
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #22365: [SPARK-25381][SQL] Stratified sampling by Column argumen...
Posted by MaxGekk <gi...@git.apache.org>.
Github user MaxGekk commented on the issue:
https://github.com/apache/spark/pull/22365
> Seems fine but I or someone else should take a closer look before getting this in.
@HyukjinKwon Whom can I ask to look at this? @gatorsmile Please, give me an advice.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #22365: [SPARK-25381][SQL] Stratified sampling by Column argumen...
Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:
https://github.com/apache/spark/pull/22365
Seems fine but I or someone else should take a closer look before getting this in.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #22365: [SPARK-25381][SQL] Stratified sampling by Column argumen...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/22365
Merged build finished. Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #22365: [SPARK-25381][SQL] Stratified sampling by Column ...
Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on a diff in the pull request:
https://github.com/apache/spark/pull/22365#discussion_r216233066
--- Diff: python/pyspark/sql/dataframe.py ---
@@ -880,18 +880,23 @@ def sampleBy(self, col, fractions, seed=None):
| 0| 5|
| 1| 9|
+---+-----+
+ >>> dataset.sampleBy(col("key"), fractions={2: 1.0}, seed=0).count()
--- End diff --
@MaxGekk, shall we add:
```python
.. versionchanged:: 3.0
blah blah blah
```
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #22365: [SPARK-25381][SQL] Stratified sampling by Column argumen...
Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on the issue:
https://github.com/apache/spark/pull/22365
LGTM
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #22365: [SPARK-25381][SQL] Stratified sampling by Column argumen...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/22365
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/96358/
Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #22365: [SPARK-25381][SQL] Stratified sampling by Column argumen...
Posted by MaxGekk <gi...@git.apache.org>.
Github user MaxGekk commented on the issue:
https://github.com/apache/spark/pull/22365
@HyukjinKwon @cloud-fan Are there any objections from you that could block the PR?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #22365: [SPARK-25381][SQL] Stratified sampling by Column argumen...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/22365
**[Test build #95834 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95834/testReport)** for PR 22365 at commit [`e2e6149`](https://github.com/apache/spark/commit/e2e61498c47da9d7b36d2e0727ce8642d5d71472).
* This patch **fails Python style tests**.
* This patch merges cleanly.
* This patch adds no public classes.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #22365: [SPARK-25381][SQL] Stratified sampling by Column argumen...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/22365
Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/96343/
Test FAILed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #22365: [SPARK-25381][SQL] Stratified sampling by Column ...
Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on a diff in the pull request:
https://github.com/apache/spark/pull/22365#discussion_r217257137
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/DataFrameStatFunctions.scala ---
@@ -370,29 +370,76 @@ final class DataFrameStatFunctions private[sql](df: DataFrame) {
* @since 1.5.0
*/
def sampleBy[T](col: String, fractions: Map[T, Double], seed: Long): DataFrame = {
--- End diff --
Will probably send an email after 2.4.0 since it's not going to be super urgent.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #22365: [SPARK-25381][SQL] Stratified sampling by Column argumen...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/22365
Merged build finished. Test FAILed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #22365: [SPARK-25381][SQL] Stratified sampling by Column argumen...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/22365
Merged build finished. Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #22365: [SPARK-25381][SQL] Stratified sampling by Column argumen...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/22365
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95836/
Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #22365: [SPARK-25381][SQL] Stratified sampling by Column argumen...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/22365
**[Test build #96343 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96343/testReport)** for PR 22365 at commit [`1740d60`](https://github.com/apache/spark/commit/1740d60a9bdc1c84b1d74d7637411396b9fbff75).
* This patch **fails Spark unit tests**.
* This patch merges cleanly.
* This patch adds no public classes.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #22365: [SPARK-25381][SQL] Stratified sampling by Column argumen...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/22365
**[Test build #95834 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95834/testReport)** for PR 22365 at commit [`e2e6149`](https://github.com/apache/spark/commit/e2e61498c47da9d7b36d2e0727ce8642d5d71472).
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #22365: [SPARK-25381][SQL] Stratified sampling by Column ...
Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on a diff in the pull request:
https://github.com/apache/spark/pull/22365#discussion_r216233575
--- Diff: python/pyspark/sql/dataframe.py ---
@@ -880,18 +880,23 @@ def sampleBy(self, col, fractions, seed=None):
| 0| 5|
| 1| 9|
+---+-----+
+ >>> dataset.sampleBy(col("key"), fractions={2: 1.0}, seed=0).count()
+ 33
"""
- if not isinstance(col, basestring):
- raise ValueError("col must be a string, but got %r" % type(col))
+ if isinstance(col, basestring):
+ col = Column(col)
+ elif not isinstance(col, Column):
+ raise ValueError("col must be a string or a column, but got %r" % type(col))
if not isinstance(fractions, dict):
raise ValueError("fractions must be a dict but got %r" % type(fractions))
for k, v in fractions.items():
if not isinstance(k, (float, int, long, basestring)):
raise ValueError("key must be float, int, long, or string, but got %r" % type(k))
fractions[k] = float(v)
seed = seed if seed is not None else random.randint(0, sys.maxsize)
- return DataFrame(self._jdf.stat().sampleBy(col, self._jmap(fractions), seed), self.sql_ctx)
+ return DataFrame(self._jdf.stat()
+ .sampleBy(col._jc, self._jmap(fractions), seed), self.sql_ctx)
--- End diff --
I would just do `col = col._jc`
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #22365: [SPARK-25381][SQL] Stratified sampling by Column argumen...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/22365
**[Test build #95835 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95835/testReport)** for PR 22365 at commit [`7e77941`](https://github.com/apache/spark/commit/7e7794153924b824dc5fe5f05375c8b9950ef539).
* This patch **fails Python style tests**.
* This patch merges cleanly.
* This patch adds no public classes.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org