You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Shafique Jamal (JIRA)" <ji...@apache.org> on 2016/06/10 23:40:21 UTC
[jira] [Created] (SPARK-15890) Support Stata-like tabulation of values in a single column, optionally with weights

Shafique Jamal created SPARK-15890:
--------------------------------------

             Summary: Support Stata-like tabulation of values in a single column, optionally with weights
                 Key: SPARK-15890
                 URL: https://issues.apache.org/jira/browse/SPARK-15890
             Project: Spark
          Issue Type: New Feature
          Components: SQL
            Reporter: Shafique Jamal
            Priority: Minor


In Stata, one can tabulate the values in a single column of a dataset, and provide weights. For example if your data looks like this:

     +-----------------+
     | id   gender   w |
     |-----------------|
  1. |  1        M   2 |
  2. |  2        M   4 |
  3. |  3        M   1 |
  4. |  4        F   1 |
  5. |  5        F   3 |
     +-----------------+
(where w is weight), you can tabulate the values of gender and get this result:

. tab gender

     gender |      Freq.     Percent        Cum.
------------+-----------------------------------
          F |          2       40.00       40.00
          M |          3       60.00      100.00
------------+-----------------------------------
      Total |          5      100.00

you can apply weights to this tabulation as follows:

. tab gender [aw=w]

     gender |      Freq.     Percent        Cum.
------------+-----------------------------------
          F | 1.81818182       36.36       36.36
          M | 3.18181818       63.64      100.00
------------+-----------------------------------
      Total |          5      100.00

I would like to have the same capability with Spark dataframes. Here is what I have done:

https://github.com/shafiquejamal/spark/commit/24ed3151db1ed2188ad67b2b5ccbf2883adf7af2

This allows me to do the following:

    val obs1 = ("1", "M", 10, "P", 2d)
    val obs2 = ("2", "M", 12, "S", 4d)
    val obs3 = ("3", "M", 13, "B", 1d)
    val obs4 = ("4", "F", 11, "P", 1d)
    val obs5 = ("5", "F", 13, "M", 3d)
    val df = Seq(obs1, obs2, obs3, obs4, obs5).toDF("id", "gender", "age", "educ", "w")

    val tabWithoutWeights = df.stat.tab("gender")
    val tabWithWeights = df.stat.tab("gender", "w")

    tabWithoutWeights.show()
    tabWithWeights.show()

This yields the following:

+------+-------------+---------+----------+
|gender|count(gender)|Frequency|Proportion|
+------+-------------+---------+----------+
|     F|            2|      2.0|       0.4|
|     M|            3|      3.0|       0.6|
+------+-------------+---------+----------+

+------+-------------+------------------+-------------------+
|gender|count(gender)|         Frequency|         Proportion|
+------+-------------+------------------+-------------------+
|     F|            2|1.8181818181818181|0.36363636363636365|
|     M|            3|3.1818181818181817| 0.6363636363636364|
+------+-------------+------------------+-------------------+






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org