You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Maxim Gekk (JIRA)" <ji...@apache.org> on 2018/09/08 16:07:00 UTC

[jira] [Created] (SPARK-25381) Stratified sampling by Column argument

Maxim Gekk created SPARK-25381:
----------------------------------

             Summary: Stratified sampling by Column argument
                 Key: SPARK-25381
                 URL: https://issues.apache.org/jira/browse/SPARK-25381
             Project: Spark
          Issue Type: Improvement
          Components: SQL
    Affects Versions: 2.4.0
            Reporter: Maxim Gekk


Currently the sampleBy method accepts the first argument of string type only. Need to provide overloaded method which accepts Column type too. So, it will allow sampling by multiple columns , for example:
{code:scala}
import org.apache.spark.sql.Row
import org.apache.spark.sql.functions.struct
val df = spark.createDataFrame(Seq(("Bob", 17), ("Alice", 10), ("Nico", 8), ("Bob", 17),
  ("Alice", 10))).toDF("name", "age")
val fractions = Map(Row("Alice", 10) -> 0.3, Row("Nico", 8) -> 1.0)
df.stat.sampleBy(struct($"name", $"age"), fractions, 36L).show()
       +-----+---+
       | name|age|
       +-----+---+
       | Nico|  8|
       |Alice| 10|
       +-----+---+
{code} 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org