You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Shafique Jamal (JIRA)" <ji...@apache.org> on 2016/06/10 23:40:21 UTC
[jira] [Created] (SPARK-15890) Support Stata-like tabulation of
values in a single column, optionally with weights
Shafique Jamal created SPARK-15890:
--------------------------------------
Summary: Support Stata-like tabulation of values in a single column, optionally with weights
Key: SPARK-15890
URL: https://issues.apache.org/jira/browse/SPARK-15890
Project: Spark
Issue Type: New Feature
Components: SQL
Reporter: Shafique Jamal
Priority: Minor
In Stata, one can tabulate the values in a single column of a dataset, and provide weights. For example if your data looks like this:
+-----------------+
| id gender w |
|-----------------|
1. | 1 M 2 |
2. | 2 M 4 |
3. | 3 M 1 |
4. | 4 F 1 |
5. | 5 F 3 |
+-----------------+
(where w is weight), you can tabulate the values of gender and get this result:
. tab gender
gender | Freq. Percent Cum.
------------+-----------------------------------
F | 2 40.00 40.00
M | 3 60.00 100.00
------------+-----------------------------------
Total | 5 100.00
you can apply weights to this tabulation as follows:
. tab gender [aw=w]
gender | Freq. Percent Cum.
------------+-----------------------------------
F | 1.81818182 36.36 36.36
M | 3.18181818 63.64 100.00
------------+-----------------------------------
Total | 5 100.00
I would like to have the same capability with Spark dataframes. Here is what I have done:
https://github.com/shafiquejamal/spark/commit/24ed3151db1ed2188ad67b2b5ccbf2883adf7af2
This allows me to do the following:
val obs1 = ("1", "M", 10, "P", 2d)
val obs2 = ("2", "M", 12, "S", 4d)
val obs3 = ("3", "M", 13, "B", 1d)
val obs4 = ("4", "F", 11, "P", 1d)
val obs5 = ("5", "F", 13, "M", 3d)
val df = Seq(obs1, obs2, obs3, obs4, obs5).toDF("id", "gender", "age", "educ", "w")
val tabWithoutWeights = df.stat.tab("gender")
val tabWithWeights = df.stat.tab("gender", "w")
tabWithoutWeights.show()
tabWithWeights.show()
This yields the following:
+------+-------------+---------+----------+
|gender|count(gender)|Frequency|Proportion|
+------+-------------+---------+----------+
| F| 2| 2.0| 0.4|
| M| 3| 3.0| 0.6|
+------+-------------+---------+----------+
+------+-------------+------------------+-------------------+
|gender|count(gender)| Frequency| Proportion|
+------+-------------+------------------+-------------------+
| F| 2|1.8181818181818181|0.36363636363636365|
| M| 3|3.1818181818181817| 0.6363636363636364|
+------+-------------+------------------+-------------------+
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org