You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Maxim Gekk (JIRA)" <ji...@apache.org> on 2018/09/08 16:07:00 UTC
[jira] [Created] (SPARK-25381) Stratified sampling by Column
argument
Maxim Gekk created SPARK-25381:
----------------------------------
Summary: Stratified sampling by Column argument
Key: SPARK-25381
URL: https://issues.apache.org/jira/browse/SPARK-25381
Project: Spark
Issue Type: Improvement
Components: SQL
Affects Versions: 2.4.0
Reporter: Maxim Gekk
Currently the sampleBy method accepts the first argument of string type only. Need to provide overloaded method which accepts Column type too. So, it will allow sampling by multiple columns , for example:
{code:scala}
import org.apache.spark.sql.Row
import org.apache.spark.sql.functions.struct
val df = spark.createDataFrame(Seq(("Bob", 17), ("Alice", 10), ("Nico", 8), ("Bob", 17),
("Alice", 10))).toDF("name", "age")
val fractions = Map(Row("Alice", 10) -> 0.3, Row("Nico", 8) -> 1.0)
df.stat.sampleBy(struct($"name", $"age"), fractions, 36L).show()
+-----+---+
| name|age|
+-----+---+
| Nico| 8|
|Alice| 10|
+-----+---+
{code}
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org